It is a modern day scandal that, while developing therapies and potentially life-saving cures, oncology scientists face challenges in identifying and accessing research data. Now in the public spotlight – only due to late-Baroness Tessa Jowell’s selfless actions in donating her medical data – it is clear that advancements in cancer research are being hampered by issues of data sharing, security and ownership – both ethical and commercial. Repositive was founded in 2014 to tackle this exact problem, by creating the tools required for more efficient and ethical sharing of relevant data, and we recently authored the FAIR Guide for Data Providers to Maximize Sharing of Human Genomic Data, published in PLOS Computational Biology.
It is generally acknowledged that, for reproducibility and progress of human genomic research, data sharing is critical. For every sharing transaction, a successful data exchange is achieved between a data user and a data provider. Providers of human genomic data (e.g., publicly or privately funded repositories and data archives, such as the Universal Cancer Databank Baroness Jowell donated her medical data to) fulfill their social contract with data donors when their shareable data conforms to FAIR (findable, accessible, interoperable, reusable) principles. Based on our own experiences of developing our platforms, which index data held across numerous repositories in order to enable researchers to search in one central place for global data to power their research, we wanted to propose guidelines for data providers wishing to maximize the FAIRness of their data.
The act of sharing is a two-way process. A data producer might first delegate the provision of their data to a trusted repository, or data-provider, where a data consumer will, in turn, find and access the data. Whether the data is donated for public benefit, free of charge, or whether it is made available at a cost to interested parties, the managers of these repositories have a responsibility to ensure that those searching for the data are able to find and use it. We propose tips for providers of human genomic data wishing to use FAIR principles, which include the following:
1. Establish a FAIR-aware patient consent framework. Consent frameworks dictate the extent to which human genomic data can be accessed and reused, so it is essential to describe clearly whether the data is intended to be shared beyond the scope of the current project – i.e. for general research. If so, consent forms should set out the potential risks and benefits for participants, as well as any procedures to anonymize the data. The Global Alliance for Genomics and Health (GA4GH) has developed consent codes that facilitate the integration of distinct consent types across different legal systems, and it is always advised that ethical and genetic counselling experts are consulted when choosing appropriate consent forms.
2. To maximize the potential for data reuse, specify the intended uses and limitations of the data and provide sufficient information about the type of data being shared, including what format the data is in and what size the files are. It is also important to clearly define the technologies the data originated from, the experimental conditions, and any limitations as to how the data can be reused. There are further considerations too, for instance, it is important to make a distinction between raw and processed data types for human genome-based data, as raw sequencing data must be processed before it can be interpreted.
3. Use machine-readable data and metadata, with complete, coherent and standard descriptions to maximize the likelihood for data to be found. A good template is the PGP-Harvard data collection. An increasing number of data repositories exist around the world and a number of metadata catalogues (like the Repositive platforms) have been created in order to increase the discoverability of data in these individual repositories – but this means it is even more critical to standardize metadata descriptions for researchers searching the metadata catalogues. As a result of researchers being concerned about not being recognized for their work, it is essential to also ensure that data is shared in a citable way, in order to incentivize researchers to continue to share their data.
The potential for progress which results from the sharing of human health data is, of course, not limited to oncology research, and the Farr Institute has been flying the flag for data sharing for some time. Its 2014 #DataSavesLives campaign highlighted that, without data, we would not be able to learn the reasons why more children die in England than in Sweden – the answers to which may help to reduce deaths in this country. We would also not be able to make predictions about what’s going to happen to patients in order to improve and make better decisions about when to perform procedures.
In the UK we also have a unique resource: we have data on everyone from cradle to grave, thanks to the NHS. If we wish to ensure that publicly funded services like the NHS are being run effectively, and if we want to offer future generations the best treatments, we must be able to use this data for the public benefit. As precision medicine starts to impact patient lives more and more, it is expected that sharing of datasets containing potentially sensitive information will become more widespread – hence it is crucial to have these FAIR tips on how to keep patient genomic data reusable whilst complying with consent frameworks.
Since 2014, however, when the #DataSavesLives campaign kicked off and Repositive was founded, the security of individuals’ data has become an increasingly hot topic. High profile data breaches are becoming more commonplace, and stories about data held on individuals being used against them are causing people to think twice about sharing their data (e.g. genetic data has been used to convict, and could be used to hike insurance premiums or to reveal non-paternity). With GDPR coming into effect, the NHS is also developing a tool to enable individuals to opt out of their confidential patient information being used for planning and research purposes.
So while we create, store, analyse and act upon more data from around the globe each year, it’s certainly right that data security be at the forefront of our agenda. If we want to advance our culture of data sharing, in order to speed up discovery of cancer cures and therapies, it is more important than ever that data providers not only ensure their data is FAIR, but that it is secure, too.
At Repositive we increase access to data whilst simultaneously ensuring individuals’ and industry’s concerns about privacy, security and IP protection are addressed. Others in the sector are similarly focused on increasing data security without inhibiting sharing and, positively, a team of computer scientists and mathematicians at MIT recently discovered a way to encrypt genetic data, so that up to 23,000 people’s genetic codes can be analysed at once, while keeping them anonymous to as many people as possible.
But data brokers like Repositive can only do so much. To achieve better and faster discovery of therapies and cures, individual patients, researchers and pharma need to follow Baroness Jowell’s example and share their data – so that we can play our part by making sure it is findable, accessible, interoperable and reusable to researchers.
Fiona Nielsen is founder and CEO of Repositive