Enhancing Data Provenance To Safeguard Biomedical Research Integrity
The growing use of public data repositories has transformed how researchers access and analyze vast datasets.
Complete the form below to unlock access to ALL audio articles.
In today's biomedical research landscape, data integrity and provenance have become critical components of successful scientific investigations. The growing use of public data repositories has transformed how researchers access and analyze vast datasets – facilitating breakthroughs in genomics, transcriptomics and bioinformatics. However, these advancements have also introduced challenges, particularly regarding the reliability and traceability of shared data.
Data provenance – which tracks the origin, movement and transformations of data throughout its lifecycle – plays a crucial role in ensuring data integrity. Without robust provenance, researchers face difficulties assessing the quality of data, potentially leading to inaccurate conclusions and wasted resources. These issues are particularly pressing as more research relies on pooled datasets from multiple sources, where even minor inconsistencies can have far-reaching consequences.
Technology Networks recently spoke with Jonathan Jacobs, PhD, senior director of Bioinformatics and BioNexus principal scientist at the American Type Culture Collection (ATCC), a nonprofit organization that collects, stores and distributes standard reference microorganisms, cell lines and other materials for research and development. Jacobs, who leads ATCC’s Sequencing & Bioinformatics Center, discussed how data provenance influences data integrity in scientific research and how a lack of standardization can lead to issues like data poisoning.
Can you discuss why data repositories are important in biomedical research?
Over the last decade, data repositories have become an increasingly important driver of biomedical research. Breakthroughs in next-generation sequencing have made it possible to produce massive amounts of genomic, transcriptomic and epigenomic data, while advancements in bioinformatics analytical methods, artificial intelligence (AI) and machine learning have given us an unprecedented ability to make sense of it all. By bringing multiple sources of related data together into accessible data repositories, researchers are empowered to perform statistically robust comparative genomics studies and discover new relationships between often disparate datasets – something that is often impossible for a single lab to carry out.
Can you explain how data provenance influences data integrity, and why these two concepts are so critical for biomedical researchers using data repositories?
The value of any genomics data repository depends on the quality of its underlying data. To make use of these data, researchers need to know the context surrounding it. For example: Who collected it, where it was collected from and what techniques were used to produce the data? Answers to questions such as these are critical, because they inform researchers on how genomics data can be reliably used, interpreted and reproduced when needed. Information about the origin, movement and post-processing of data throughout its lifecycle (i.e., data provenance) helps researchers evaluate the accuracy and reliability (i.e., integrity) of a data set.
As concepts, the value of data provenance and data integrity are easy to understand. However, in practice, ensuring the integrity of all data sets contained within a repository is far more difficult. There are many reasons for this. Often there is no standardized language to guide researchers when describing data sets, meaning researchers may have to go through a laborious and time-consuming process to reconcile similar data sets with discordant naming conventions. It is also common for data to be submitted with incomplete or inaccurate documentation.
A recent study looked at the reported cell lines used across 420 different research papers. The provenance of the cell lines would be critical information to be submitted to a repository – knowing where the cells came from and how they have been manipulated over time will greatly affect how the resulting data is interpreted. More than half of the included studies (235) cited the use of a non-verifiable cell line, meaning there is no record of how the cell line was produced. Not only does this lack of provenance affect the reproducibility of each study, but it undermines our ability to assess the integrity of the data.
From a data repository perspective, data produced from mislabeled or unknown cell lines can be a significant problem because it may lead to data poisoning, hence misleading researchers or producing statistical noise that obscures the data’s underlying patterns. Put another way, bad data can poison the data well by distracting researchers. This can in turn lead to wasted time, resources and careers.
For these reasons, it’s vital for data repositories to have good data integrity, meaning all the data contained therein is accompanied by strong documentation of its provenance.
What are the positives of using public data repositories for scientific research?
Public data repositories, such as the National Center for Biotechnology Information’s (NCBI) GenBank database, are essential to modern biomedical research. Many discoveries and advancements in the life sciences over the last 30 years would not have been possible without resources like these. There are numerous benefits of these international resources; a significant example is that they facilitate comparative studies and reproducibility at scale in the life sciences.
Researchers typically compromise on the scale of their studies owing to the costs, resources and precious time spent collecting data. Because of this, subtle or rare biological phenomena can be difficult to observe and even harder to conclude from when limited to individual studies performed in a single lab. Data repositories allow researchers to pool data sets together and massively expand the scale of their study. This allows for statistically well-powered studies that produce more robust conclusions, whilst conserving time and resources.
Additionally, the use of public data repositories facilitates greater reproducibility, in part by making raw data readily accessible to other researchers.
And drawbacks?
While valuable, public repositories also have notable drawbacks, including a lack of control or insight into data quality and completeness. Not all datasets undergo rigorous quality control checks, which can lead to issues with data integrity, reliability and usability. If an error is made when submitting data to a public database, this may be propagated several times as researchers use this data in subsequent publications. One particularly damaging form of this could come in the circular reuse and duplication of data – wherein the same data set is used and republished to the repository under a different name. Such duplication could greatly skew distribution statistics and bias analyses.
As described above, if data lacking provenance and integrity is included in a knowledge base, it has the potential to cause data poisoning, significantly affecting research funding, resource allocation and overall progress.
Sharing data publicly, especially in biomedical research, also raises concerns about patient privacy and data security. Data generated from patients must be properly anonymized and managed to comply with ethical and legal standards, such as GDPR and HIPAA.
In this context, data poisoning is the intentional or unintentional addition of false data to a database, with the potential for a significant downstream effect.
Consider the COVID-19 pandemic. To rapidly develop therapeutics and public health tools, researchers around the world relied on shared data repositories. This data included genomic data, viral evolutionary trees and epidemiological data, such as contact tracing. The integrity of this data was paramount because considerable resources would be spent on vaccine development if, for example, evidence surfaced showing the existence of a new viral strain. Researchers had to be certain that the data describing viral strains was authentic and generated through robust means.
Data poisoning can come about through many different means. Malicious actors may tamper with data to skew results or sabotage research, such as altering genetic sequences or patient records. In a competitive biotechnological landscape, there might even be attempts to undermine competitors by corrupting shared datasets. By far the most common form of data poisoning is simple human error that can lead to mistakes in data entry or labeling. With the rise of large language models, data poisoning could also take the form of false publications that obscure the relevance of the true studies.
Regardless of the source, the consequences of data poisoning on the bioeconomy can be far-reaching. Erroneous data can lead to false research conclusions, wasted resources spent on follow-up studies, abandoned projects and further exacerbate the reproducibility crisis in science. Economically, the development of ineffective products based on poisoned data can result in significant financial losses for companies, including product recalls and legal liabilities, while eroding public trust and investor confidence in the field.
Defending against data poisoning should be a significant priority for data repositories. Unfortunately, many common public repositories have a low bar for data integrity and traceability, leaving them highly susceptible to data tampering, falsification and “sloppy” science.
There are fortunately some simple steps that can be taken to improve data provenance and integrity in the biomedical space.
Researchers should, for example, always check to confirm that the cell line they believe they are using is validated. For researchers using human or mouse cell lines, validation is frequently done by submitting a sample of the cell line to a qualified service organization (such as ATCC), to have short tandem repeat profiling done on the material – a widely accepted method for determining the purity of a cell line.
On the microbial side, 16S or 18S/ITS sequencing is often an inexpensive way to confirm the species identity of a bacterial or fungal strain. A more comprehensive and detailed approach, however, should include genomics and/or transcriptome profiling of the materials followed by comparison to a reference dataset to serve as ground truth. For example, ATCC has been expanding a database of fully traceable genome references for thousands of organisms held within its physical biorepository. The ATCC Genome Portal, which recently surpassed 5,000 authenticated reference genomes produced in-house at ATCC directly from reference materials, provides a true single source of truth for the ATCC materials.
It can be important to check a cell line’s genomic profile against these databases because subtle genetic changes can accumulate over time in many different types of biological systems. Genomics offers the best approach to detect these shifts before they negatively impact research conclusions and reproducibility. This level of data provenance can go a long way towards ensuring the integrity of your data.
To enhance public data repositories in biomedical research, one crucial development would be requiring a degree of data provenance before accepting data into these knowledge bases.
Ensuring that researchers provide detailed records of the data's origin, processing steps and any transformations it has undergone would greatly improve transparency and traceability. This level of documentation allows others to understand the complete history of the data, helping to verify its integrity and reproducibility.
By making data provenance a prerequisite for repository submission, we can significantly reduce the risk of data poisoning and ensure that the data used in biomedical research is reliable and trustworthy.
Another valuable development would be the establishment of cross-industry working groups dedicated to standardizing the language used in metadata that describes the materials and how the data was produced. Metadata is essential for providing context and understanding of a dataset, but inconsistencies in terminology and format can lead to misinterpretation or misuse of data. By bringing together experts from various sectors – academia, industry and regulatory bodies – these working groups can develop unified standards for metadata, ensuring clarity across different repositories. Standardized metadata would not only enhance data quality but also facilitate data sharing and interoperability, making it easier for researchers to integrate and compare datasets from different sources.
Together, these initiatives would have a profound effect on overall data integrity in biomedical research.