Publishing Open Data in the Plant Sciences
Industry Insight Aug 18, 2015
RD: The decreasing costs of data generation, coupled with the increasing demands for open and reproducible science, pose a very pressing issue: How do we move from a scenario where data gets described poorly and publicly deposited infrequently to a point where the description and deposition of data becomes commonplace? There is a high time cost associated with enriching data with information about all the experimental and analytical processes that went into producing it (metadata), and this is one of the things that COPO is aiming to improve through simpler user interfaces, consolidated access to public repositories, and wizards to guide annotating data with metadata. The most important benefits to standardisation are that data become more easily found, integrated and reused in other scientific research, and standardising data descriptions across a variety of experimental organisms will make it easier to transfer knowledge between researchers who work in different plants. Therefore researchers who study a particular crop species might be able to benefit from work performed in a different crop if they understand how the experiments were performed, and can access the right datasets and tools to reproduce and expand on the work carried out in those different organisms.
An exciting by-product of these efforts is the improvement in recognition for depositing data in the public domain. Providing the means to cite and track data is vital to understand the value and impact of research outputs in this digital age. A paper in a hard-copy journal isn't a good way of assessing scientific impact in the current climate of fast-paced data-intensive plant science, and there is a lot of information (often relegated to supplemental appendices) that underpins the results of a study which just isn't available to other researchers in a way that's usable. By making data and analyses first-class citizens in plant science, there are obvious benefits in providing a clear and open interconnected knowledge base of research to improve efforts into global grand challenges of food security and plant health.
RD: Access to data is a key part of modern science. This paradigm spans all the way from experimental metadata, through raw data, to processed datasets that are relevant to downstream biological problems. "Big data" is also a hot topic at the moment, and the sheer amount and complexity of the data that is now generated daily makes it harder and harder for scientists to find the right datasets to contribute to their research, let alone analyse. COPO aims to tackle the central challenge of preparing the groundwork for discovery, reuse and recognition for data-intensive research outputs.
JR: The Genome Analysis Centre has been a part of the Collaborative Open Data Plant Omics Consortium (COPO) since its inception in 2014. What were some of the main drivers behind forming this consortium?
RD: The COPO project comprises a number of partners: University of Warwick; TGAC; University of Oxford e-Research Centre; EMBL-EBI. The grant awarded by the Biotechnology and Biological Sciences Research Council (BBSRC) allowed the partners to formally collaborate on this important research area. The partners have clear experience each in infrastructure development, data management and handling, metadata specifications, experimental design, and community interactions. Each of these is not a small domain of research, so needs focused and clear coordination which the consortium and associated supporters bring together.
JR: COPO recently met for their first workshop, hosted at The Genome Analysis Centre. How will this meeting help to move the project on to the next stage?
RD: The success of COPO depends on frequent and in-depth conversations with its key stakeholders, i.e. biologists and bioinformaticians who will use the system. This initial workshop helped us understand the current state-of-the-nation with regards to the data needs of users. We discussed data repositories and services that already exist for certain domains of plant science, and also those areas where suitable repositories are not available or ready to support the new data deluge issues in domains such as high-throughput phenotyping. A GARNet Community blog post describing the workshop has been published, if readers would like to know more: http://blog.garnetcommunity.org.uk/copo-2015-meeting/