I’m sure you all know the party game Telephone, where a message is given to a person at one end of a line and is whispered to the next, then the next, until it reaches its eventual destination at the other end. Small changes ensure that the final message is totally different to how it started.
World War One gave us a tragic example of what can happen when a message suffers from miscommunication. A message sent from the trenches to British headquarters started as:
• “Send reinforcements, we’re going to advance”
By the time the message reached HQ it had become:
• “Send three and fourpence, we’re going to a dance”
So what has this got to do with data integrity?
Well, shared datasets can suffer the same end as the fateful British message – as your dataset is passed around, small changes and errors introduced over time can kill its accuracy, rendering it unfit for purpose.
Here, we’re going to look at data integrity in the context of shared data and introduce some procedures that will mitigate against miscommunication...
What is Data Integrity?
The principle of data integrity is that data should be recorded exactly as intended, and when later retrieved, is the same as when it was recorded. To do this, any data handling procedures must ensure the accuracy and consistency of data over its entire life cycle.
To maintain data integrity standards, the FDA uses the acronym ALCOA, where data should be:
• Attributable – Data should demonstrate when it was observed and recorded, by whom, and who or what it is about
• Legible – Data should be easy to understand, recorded permanently and with original entries preserved
• Contemporaneous – Data should be recorded at the same time as it was observed
• Original – Source data should be preserved in its original form
• Accurate – Data should be error free
Why is Data Integrity Important?
Central to data integrity is the principle that there should be an authoritative dataset that serves as a single source of truth. Here, we need to think SMaRRT, where having a well-defined data integrity system increases:
• Stability – all data integrity operations are performed in one centralized system, ensuring consistency and repeatability
• Maintainability – one centralized system makes all data integrity administration simpler
• Reusability – all applications benefit from a single centralized data integrity system
• Recoverability – a single, centralized data source can be backed up regularly
• Traceability – every data point should be traceable back to its origin
This last point – traceability – is particularly important here. When we think of a dataset, we typically consider it static and unchanging, but it isn’t. When we collect data there will be errors. The Originality principle states that we should maintain those data in the original state. But when we clean out these errors, we change the data from the original state to something else. To keep a dataset in its original state, change it, and still have access to the original means having multiple copies of the same dataset in various stages of handling. This is called version control, where we keep a chronological record of everything that has been done to the dataset at every stage. This has the benefit of being a natural back-up system, although you will also need a separate back-up policy.
“Version control is extremely important,” says Dr Deans Buchanan, Palliative Medicine Consultant and Clinical Lead at NHS Tayside. “You need to know what’s updated and when”. He recommends naming files with date prefixes in this format:
• “year.month.day – name of file”
• “2018.08.28 – Version Control Your Files.docx”
Dr Steven Wall, Director of SJW Bio-Consulting Limited, echoes this, maintaining that you should “version control every change, and with each change highlight what was changed and who performed the up-revision”. He also insists that “full transparency and openness is key to building trust with all partners”, including reporting all decisions and operations “whether bad or good”.
In summary, every time you create a new version of a file you should immediately make a backup copy, so that every file has both a history and a backup, all listed in chronological order.
What Can Go Wrong?
If you don’t have an effective data integrity system, your data might suffer from miscommunication, changing over time until it bears little or no resemblance to the original dataset.
Briefly, data integrity may be compromised through:
• Human error – whether malicious or unintentional
• Transfer errors –unintended alterations during transfer between devices
• Compromised hardware – such as a disk crash
• Missing metadata – the information needed to understand the data may be missing, rendering the data useless
“If you put rubbish
in, you get rubbish out” Deans says, “but most dangerous of all” he adds, “is
when you have good data that is turned to rubbish via error – if you don’t recognize
that error then your rubbish out – the incorrect results of your analyses – is
How to Safeguard Against Errors and Data Corruption
All data, whether a departmental database, an Excel spreadsheet, passwords, documents, etc., should have a single source of truth, and these are the minimum you need to maintain data integrity:
• A single authoritative data source
• Version control
• Back-up systems
• A gate-keeper (a source of responsibility)
• Maintenance procedures, including adequate training
• Documentation of data handling procedures
• An access policy which determines who may access the data
• A user record-keeping strategy, detailing who, what, when, where and why
• A reporting system to report errors back to the authoritative source
• An auditing procedure, to ensure accountability for inaccuracies entered into the system
While this may seem onerous to maintain data integrity, these procedures are flexible – the size and cost of the procedures you put in place should be proportional to the value of your data.
Dr Catherine Paterson, a Lecturer in the School of Nursing and Midwifery at Robert Gordon University, gives some insights into how her team collected and shared data in a UK-Australia study: “We developed an agreed coding book for the whole research team and a master data file with precisely the same variables and labels. This was distributed to all those involved in data entry, and facilitated easy merging of the UK and Australian datasets”. Although they did experience some disparities, “the clear and transparent coding of the variables from the outset minimized problems and data failure”.
On data entry, Steven says that “if possible, implement electronic, rather than manual recording”, and in terms of ensuring correctness, he told me that you should “have all data checked and signed off as accurate by an operator, then verified by a second operator”.
Deans, speaking from bitter personal experience, insists that you should “never have only one person who knows the password”. After all, passwords need to be backed-up too. Unless you like losing access to entire datasets that you’ve carefully built and maintained for several years…
The three most important take-away messages for maintaining integrity in shared data are that:
1. there should be a single source of truth for that data
2. all changes should be traceable back to the original via version control
3. every version should be backed up
Summing up, Deans points
out that shared wisdom itself is also version controlled and backed up:
“You’ve got to plan
ahead. Seek advice from those who have experience and made all the usual
mistakes. Listen to them! Then commit to data entry and sharing being the
foundation of the work”.
Wise words indeed…