7 Projects That Are Harnessing the Power of Big Data
Big data. It seems like the phrase is everywhere. Scientists across many fields have been early adopters of big data in terms of production, management, and analysis. A move that has been driven by the rapid generation of large and complex scientific data created by instruments and devices in labs across the globe. Ever smaller, more readily available instruments are capable of producing huge piles of data. Leaving some concerned that this data deluge may escalate out of control, making it too difficult to find relevant data and to derive meaningful patterns and insights that solve the problem in question. But, what does big data mean in the real world? Who’s producing all this data? And, importantly, what are they doing with it?
This list brings together 7 enormous projects that are harnessing the power of big data to solve big problems in science.
1. Broad Genomics
Broad Institute researchers generate in the order of 20 terabytes of sequence data every day1 (roughly equivalent to more than 6.6 billion tweets or 3,300 high definition feature-length movies). This makes them the largest producer of human genomic information in the world. To date, they’ve processed more than 1.5 million samples from over 1,400 groups in 50 counties2. One of the core labs at the Broad, the world-famous Zhang lab, is pioneering the development and application of CRISPR-Cas9 and CRISPR-Cpf13. To support ground-breaking projects like this, the Broad employ a dedicated LIMS and Analytics group who develop and maintain a bespoke blend of custom software and off-the-shelf solutions4. After years of depending on in-house storage, Broad has partnered with Google to leverage the (essentially limitless) Google Cloud Platform. From here, they utilise open-source Java-based tools, developed in-house, including their Genome Analysis Toolkit and Picard for data processing and analysis5,6.
2. Nestle – Food Safety and Quality Testing
For Nestle, the world’s largest food company, big data is a big issue7. Speaking at “The Future of the Food Industry” last year Professor Guy Poppy explained that the company carries out around 100 million analytical tests every year. This equates to around 200,000 tests every day at the factory level and around 10,000 safety results conducted at regional labs per day8. Tests are carried out to verify that every batch of every product leaving a factory is compliant with internal and external standards, including for harmful compounds or microorganisms in the materials they use, the environment they operate in, and within the product itself. The regional labs alone are manned by over 950 people including 30 group and regional experts working in 25 ISO accredited labs in countries right across the world9. Since 2015 Nestle have been involved in a movement to improve big data sharing between companies like itself and regulatory authorities like the FSA to enable data mining to track emerging food safety issues.
3. AstraZeneca – Sequencing 2 million genomes
Last year AstraZeneca launched a massive effort to compile genome sequences and health records from two million people over the next decade10. Menelas Pangalos, executive vice-president of the company's innovative medicines programme stated that this will cost “hundreds of millions of dollars”. He went on to explain that this project alone would produce about 5 petabytes of data saying, “If you put 5 petabytes on DVDs, it would be four times the height of the 310-metre-tall London Shard”. Much of this data will be produced and managed by their partner, Human Longevity whose ultimate aim is to sequence 10 million human genomes and pair them with medical records. Powered by improved bioinformatics, the aim of this project is to identify rare genetic sequences that are associated with disease and treatment response.
4. EMBL-EBI – PRIDE Archive
The PRoteomics IDEntifications (PRIDE) database is a centralised, standards compliant, public data repository for proteomics data, including protein and peptide identifications, post-translational modifications and supporting spectral evidence. Speaking at ISAS 2016 in Dortmund, Juan Antonio Vizcaino, Proteomics Team Leader at EMBL-EBI described how the archive is made up of over 4,000 datasets from over 50 countries and includes data produced by over 1,700 groups11. At the time this one database, one of many that EMBL is responsible for, consisted of over 560,000 files taking up 225 terabytes of storage space. Around 150 new datasets are submitted every month, a rate that is only ever going to increase11. To add to the challenge, over half of the database is made publicly available and, users download 200 terabytes worth of data every single year11. Currently, the EMBL-EBI is powered by a 20 Gbit Internet Connection, >40,000 CPU Cores and has access to 70 Petabytes of storage12. Databases like PRIDE are playing a key role in mapping the human proteome by enabling researchers to access, download and build on previously published data. Current projects focus on identifying the roughly 75% of spectra that are typically left unidentified in any proteomic mass spec experiment13.
5. The Human Brain Project
The HBP Flagship was launched by the European Commission's Future and Emerging Technologies (FET) scheme in October 2013, and is scheduled to run for ten years14. The project aims to build a collaborative information communications technology--based scientific research infrastructure to allow researchers across Europe to advance knowledge in the fields of neuroscience, computing, and brain-related medicine. The core data that fuels this project is generated by slicing human brains into several thousand 60 micrometre thick sections and scanning them using 3D polarised light imaging. These scans are then gathered together to create a 3D digital reconstruction of individual nerve fibres which will eventually be combined on a larger scale to produce a digital map of the human brain. Each slice generates around 40 gigabytes of data which equates to several petabytes of raw data for the entire brain15.
The project relies on four high performance computing infrastructures. One of which, Cineca, the HBP Massive Data Analytics Supercomputer, provides 2 Petaflop/s of computational power at peak performance and 200 Terabytes of main memory, integrated with a mass storage facility of more than 5 Petabytes of working space. This system will also be integrated with another data facility providing an additional 5 Petabytes for an on-line disk storage repository and a further 10 Petabytes for long term data preservation16. The service's architecture has been carefully designed to scale to millions of files and petabytes of data, joining robustness and versatility.
6. NCI – Genomic Data Commons
The Genomic Data Commons (GDC), is a unified data system that promotes sharing of genomic and clinical data between researchers17. An initiative of the National Cancer Institute (NCI), the GDC is a core component of the National Cancer Moonshot and the President’s Precision Medicine Initiative (PMI), and benefits from $70 million allocated to NCI to lead efforts in cancer genomics as part of PMI for Oncology. The GDC aims to centralise, standardise and make accessible data from large-scale NCI programs such as The Cancer Genome Atlas (TCGA) and its paediatric equivalent, Therapeutically Applicable Research to Generate Effective Treatments (TARGET)18,19. Together, TCGA and TARGET represent some of the largest and most comprehensive cancer genomics datasets in the world, comprising more than two petabytes of data (one petabyte is equivalent to 223,000 DVDs filled to capacity with data).
On top of this, the GDC has been tasked with creating a standardised data submission process, ensuring data quality, harmonising large genomic datasets, and providing secure access to data. Three Cancer Genomics Cloud (CGC) Pilots have also been launched to provide cancer researchers with access to genomic data and harness the elastic computational power of the cloud20. This eliminates the need for researchers to download petabytes of data and, the prohibitive cost and time required for such downloads. The Cloud Pilots also allow researchers to take advantage of hosted cutting-edge analysis pipelines or to bring their own tools to the cloud. Through cooperation and collaboration within and between academia, government, and private industry, the GDC along with the technology and lessons learned from the CGC Pilots will continue to enhance the democratisation of cancer data and further the mission of the NCI.
7. Swiss Bioinformatics Institute – VITAL-IT
The SIB Swiss Institute of Bioinformatics (SIB), set up 18 years ago, aims to foster excellence in data science to support progress in biological research and health21. Made up of 750 scientists in 60 groups spread out across Switzerland they supply and maintain more than 150 high-quality databases and software platforms for the global life science research community. Within the SIB, a smaller group called VITAL-IT is responsible for providing expertise in data storage and analysis22. Over the last 5 years this group has handled more than 75 research projects on a vast array of topics ranging from Ecology to Pharmacodynamics. Resulting in the group being involved in more than 90 publications to date.
To achieve this, VITAL-IT harness 7000 CPUs and 7.5 petabytes of storage, located across five different sites23. They utilise this infrastructure to archive around 30 terabytes of raw sequencing, imaging, serotyping and behavioural data per week. On top of this, they are then responsible for archiving the results of analysing this data which typically equates to an additional 120 terabytes of data every week. This is all carried out whilst enabling high speed access to all their data for up to 900,000 scientists and future proofing their data storage so that it can be reliably accessed for decades to come23.
1. Broad Institute. Data Sciences. Available at https://www.broadinstitute.org/data-sciences (Accessed 25 August 2017).
2. Broad Institute. Genomics. Available at https://www.broadinstitute.org/genomics (Accessed 25 August 2017).
3. Broad Institute. Zhang Lab – Areas of Focus. Available at https://www.broadinstitute.org/zhang-lab/areas-focus (Accessed 25 August 2017).
4. Broad Institute. LIMS and Analytics. Available at https://www.broadinstitute.org/genomics/lims-and-analytics (Accessed 25 August 2017).
5. Broad Institute. Genome Analysis Toolkit. Available at https://software.broadinstitute.org/gatk/ (Accessed 25 August 2017).
6. Broad Institute. Picard. Available at https://broadinstitute.github.io/picard/ (Accessed 25 August 2017).
7. Forbes. Nestlé Tops the List of Largest Food & Beverage Companies in the World. Available at https://www.forbes.com/pictures/gimf45klj/nestle-tops-the-list-of/#70bb04924398 (Accessed 25 August 2017).
8. Nestlé. How Nestlé Ensures Safe Food: Our Global Standards. Available at http://www.nestle.com/asset-library/documents/about_us/ask-nestle/nestle-ensures-safe-food-lead.pdf (Accessed 25 August 2017).
9. Nestlé. Food safety at Nestlé combining foresight, vigilance and harmonised standards. Available at http://www.nestle.com/asset-library/documents/investors/nis-2013-vevey/john-obrien-randd-food-safety.pdf (Accessed 25 August 2017).
10. Ledford, H. (2016). AstraZeneca launches project to sequence 2 million genomes. Nature, 532(7600), 427.
11. EMBL – European Bioinformatics Institute. Proteomics and the “big data” trend: challenges and new possibilities (Talk at ISAS Dortmund). Available at https://www.slideshare.net/JuanAntonioVizcaino/proteomics-and-the-big-data-trend-challenges-and-new-possibilitites-talk-at-isas-dortmund (Accessed 25 August 2017).
12. EMBL-EBI. European Genome Phenome Archive at the European Bioinformatics Institute. Available at https://www.turing-gateway.cam.ac.uk/sites/default/files/asset/doc/1609/Helen-parkinson.pdf (Accessed 25 August 2017).
13. Griss, J., Perez-Riverol, Y., Lewis, S., Tabb, D. L., Dianes, J. A., Del-Toro, N., ... & Wang, R. (2016). Recognizing millions of consistently unidentified spectra across hundreds of shotgun proteomics datasets. Nature methods, 13(8), 651-656.
14. Human Brain Project – Overview. Available at https://www.humanbrainproject.eu/en/science/overview/ *(Accessed 25 August 2017).
15. Spectrum. The Human Brain Project Reboots: A Search Engine for the Brain Is in Sight. Available at http://spectrum.ieee.org/computing/hardware/the-human-brain-project-reboots-a-search-engine-for-the-brain-is-in-sight (Accessed August 25 2017).
16. Cineca. Available at https://www.cineca.it/en (Accessed August 25 2017).
17. National Cancer Institute – Genomic Data Commons. Available at https://gdc.cancer.gov/ (Accessed August 25 2017).
18. National Cancer Institute – The Cancer Genome Atlas. Available at https://cancergenome.nih.gov/ (Accessed August 25 2017).
19. National Cancer Institute – TARGET: Therapeutically Applicable Research to Generate Effective Treatments. Available at https://ocg.cancer.gov/programs/target (Accessed August 25 2017).
20. National Cancer Institute – Center for Biomedical Informatics & Information Technology. NCI Cloud Resources. Available at https://cbiit.nci.nih.gov/ncip/cloudresources (Accessed at August 25 2017).
21. Swiss Institute of Bioinformatics. Available at http://www.sib.swiss/ (Accessed August 25 2017).
22. Vital-IT – Competence Centre in Bioinformatics and Computational Biology. Available at https://www.vital-it.ch/services (Accessed August 25 2017).
23. Bright talk – Data for Decades: Managing Bioinformatics for the Long Term at SIB. Available at https://www.brighttalk.com/webcast/13139/186673/data-for-decades-managing-bioinformatics-for-the-long-term-at-sib (Accessed August 25 2017).