How Will Genome UK Securely Handle 150 Petabases of Genomic Data?
Industry Insight Dec 04, 2020
Genome UK, the UK government’s recently announced genomic healthcare strategy, represents an ambitious effort to leverage sequencing power to solve healthcare problems. With a headline goal of creating “the most advanced genomic healthcare system in the world”, Genome UK hopes to amass genomic data from hundreds of thousands of individuals for applications in pharmaceutical development and cancer research.
PetaGene, a genomic data storage company based in Cambridge, UK, recently sounded a warning that Genome UK will produce a staggering 150 petabases of genomic data and that there are four important issues that the government needs to address around the secure and integral handling of that data. To find out more, Technology Networks spoke to PetaGene’s co-founder and chief commercial officer, Vaughan Wittorff, PhD, and co-founder and chief executive officer, Dan Greenfield, PhD.
Ruairi Mackenzie (RM): Genome UK is an incredibly ambitious data effort. What are the main challenges that the project faces from a bioinformatics perspective?
Vaughan Wittorff (VW): We believe Genome UK will enable the NHS to leverage genomics research and precision medicine in a multitude of ways for the future of healthcare. The strategy can improve patient outcomes and reduce costs, drive research into new treatments and diagnostics and foster an ecosystem for the UK Genomics Industry to thrive.
It’s important to note that the weight of this strategy lies in genome sequencing and the resultant data. For example, there is already a national effort to sequence 500,000 whole genomes by 2024. The NHS England Genomic Medicine Service is currently implementing whole-genome sequencing as routine care to supplement this effort.
The success of Genome UK lies in the public’s willingness to share highly sensitive data for research and clinical applications – data which must be properly managed (with controlled and secure access) to gain and maintain the public’s trust in the initiative. As such, the correct execution of Genome UK involves successfully navigating potential ‘Bio-IT issues’ which arise when handling massive biological datasets. In the past, similar genomics projects have failed to achieve set goals due to these technical challenges.
Due to the scope of this initiative and our vast experience in addressing issues in this domain, we have laid out recommendations that can help prevent some foreseeable problems and ensure that Genome UK is successfully implemented.
Our recommendations cover key issues in:
● Data security and privacy, including regional encryption and data minimization
● Reducing technical barriers to ensure efficient IT and lower costs
● The need for computational reproducibility, and supporting existing pipelines
● Data integrity and information loss
RM: Which of these bioinformatics challenges is most pressing?
Dan Greenfield (DG): Each of these is critically important, however, the area which is the most sensitive is the data security and privacy aspect, since this is real NHS patient data.
Genome UK will be making data available from about 500,000 samples to a variety of organizations and researchers. This poses considerable data security and privacy challenges. Genomic data is unique to individuals and is the most identifiable piece of information. Access to these data needs to be controlled. While we understand that Genome UK will allow researchers to access these data only by running compute instances inside their platform, this is insufficient to prevent information leakage or de-anonymization of data. It is still vital that data is accessed according to a “minimum necessary” rule in which the data remains encrypted and authorization is given to researchers according to their actual usage needs, and detailed auditing is in place so that data stewards know exactly 1) who has accessed their data, 2) when, 3) which regions of the genomic data they have accessed, and 4) for which purposes (with which tools, commands and options). Such measures fill a key gap in security needed to prevent data leaks, and also help to build public trust.
RM: How can these challenges be overcome?
DG: The ideal way to overcome the security challenges is to ensure that Patient Health Information is stored in an encrypted state in which only specific data, down to a fine-grain level within each file where applicable, is made accessible on a need to know basis. Furthermore, the Data Steward needs to be able to ensure compliance with regulations such as GDPR/HIPAA as well as the patient consent agreements and so all accesses need to be tracked in an easy-to-search tamper-evident ledger to report for internal/external auditing. It is essential that these measures be fully compatible with all bioinformatics analysis tools and do not cause inconvenience nor degrade the tools’ efficiency. PetaGene put together the Protect solution that addresses the compliance requirements while maintaining compatibility, efficiency and performance.
Molly Campbell (MC): PetaGene outlined a call-to-action for Genome UK. Are there any plans for the company to work with the UK Government to address some of the bioinformatics challenges highlighted?
VW: Genome UK is an exciting new phase for genomics in the UK. As a UK-based company located in the home of genomics in Cambridge, we are looking forward to playing a part in this important initiative. PetaGene’s experience with the large-scale challenges posed by genomic data places us in an ideal position to offer critical technological solutions that will reduce the costs of storing data and enhance the security and accessibility of these data. In short, PetaGene is always ready and willing to engage with UK Government efforts to address some of the bioinformatics challenges highlighted here.
Vaughan Wittorff and Dan Greenfield were speaking to Ruairi J Mackenzie and Molly Campbell, Science Writers for Technology Networks