The Many Silver Linings of Cloud-Based Data Analytics
Blog Mar 08, 2018 | by Ruairi J Mackenzie, Science Writer for Technology Networks
Riffyn, who specialize in cloud-based R&D software, have recently been awarded a patent on a hypergraph data model for experiment design and data analysis. We caught up with Riffyn's CEO, Timothy Gardner, to discuss cloud-based data analytics' role in improving discovery, reproducibility, and scalability in life sciences research.
Ruairi Mackenzie (RM): Why should scientists turn to cloud-based solutions for their research data management?
Timothy Gardner (TG): R&D is a deeply collaborative undertaking. Scientists’ work is dependent on ideas, methods and data collected from colleagues across the world, and across months or years of time. Project teams at some of our customers are spread across multiple sites on three continents. This lends itself naturally to cloud-based solutions because they provide a means to centralize the access to, sharing and integration of data.
More specifically, cloud-based systems have these advantages:
- Single-source of truth for global sample (material, equipment) identification and measurement data.
- Real-time data sharing.
- Greater scalability and higher performance due to the ability to add servers on demand for parallelization of computing.
- Continuous upgrades of hardware and software delivering a system that is always at the forefront of technology advances.
- Lower maintenance and operating costs than on premise systems
- Greater security due to concentration of resources into a deeply protected virtual environment isolated from human access.
- Greater uptime and disaster protection due to built-in failover redundancy and multi-layer backup systems.
Some companies may fear that cloud means reduced data accessibility or customization. But a good architecture can address this, as Riffyn does, through the provision of APIs for customization and adoption of vendor neutral data standards.
Cloud solutions may not be appropriate for every situation, for example where realtime control of equipment is required. But in most situations, the benefits of cloud-based software are quite profound, and we see a rapid shift of companies to this approach for scientific data.
RM: Riffyn has been awarded a patent for hypergraph data modelling. Has this patent been essential for moving into the US market?
TG: To date, the technology claimed by this patent has played a fundamental role in our business growth. It gives the Riffyn SDE an unmatched capability to describe R&D processes, to adapt to process changes, and to automatically integrate data. Our sales have been driven by customers’ recognition of Riffyn’s unique approach, and the Riffyn’s success in solving really difficult data and analytics problems in R&D. However, we have also seen intensifying market activity and interest in the Riffyn approach, and therefore the potential for greater copycat approaches. The issuing of this patent will help to solidify our position as the leading provider software for experiment design and data analysis.
RM: What are the problems currently facing researchers in their use of data and how can Riffyn’s SDE benefit these researchers?
TG: There is much hype about how machine learning could significantly accelerate the pace of discovery, but data need to be annotated, collated, and cleaned before machine learning can be performed. Right now these data cleaning activities are creating a bottleneck controlling the pace of discovery. Research shows that data scientists spend up to 80% of their time performing these low value activities, and only 20% of their time doing the important part — analyzing and learning from that data. So even if machine learning significantly accelerates the pace of analysis (the 20% part), the pace of discovery will not be appreciably impacted until the pace of the other 80% (data collation, organization, and cleaning) are significantly improved.
Traditional Excel, ELN and LIMS based approaches to data management do not deliver the annotation, structuring and linking data needed for statistical analysis and machine learning. As a result scientists get stuck in “spreadsheet hell” as they attempt to bring data together manually. Related datasets are recorded by different people separated in time and location with little to no indication of how the data are related. In order to learn from the datasets, they need to be annotated and joined together across unit operations and across experiments. However, most commonly this data joining and annotation is performed manually, often requiring scientists to physically track down the people who understand the data connections (if such people exist).
Riffyn’s goal is to free up scientists and data scientists to spend their time learning from the data instead of cleaning it. In other words, Riffyn aims to shift the rate limiting step from data organization and cleaning to data analysis, thus enabling machine learning to actually increase the pace of discovery.
The Riffyn SDE achieves this aim by performing automatic annotation and integration of data to produce structured data tables for immediate statistical analysis data across experimental steps, samples and time. This is enabled by the hypergraph data model. Users simply draw a process flow diagram describing their process, and record their data on the appropriate step in the diagram. The Riffyn SDE then translates the visual representation of experimental processes into a flexible hypergraph data store that tracks the relationships between experimental parameters, samples and data.
When a user is ready to perform data analysis, the Riffyn SDE analytics engine walks the hypergraph to assemble a comprehensive data tables in 30 seconds or less. This allows users to explore cause-and-effect relationships in their experimental data in real time. These tables can then be used for machine learning analyses in statistical programs such as JMP, Python, R, Spotfire, etc. to significantly accelerate the pace of a “design, measure, discover, iterate” cycle.
RM: How can the SDE help solve the age-old problem of reproducibility in science?
TG: Irreproducibility stems from 2 main sources: 1) a lack of understanding of what experimental parameters affect key outcomes, leading scientists to incorrectly identify cause and effect, and 2) a lack of clarity around experimental methods, i.e., “I don’t understand what you did, so I can’t possibly reproduce it.” The Riffyn SDE solves both problems.
The lack of understanding of what experimental parameters affect key outcomes is addressed by allowing scientists to record all experimental parameters as structured data alongside their measurement data. Using the Riffyn SDE, the data will no longer be spread across batch records, Excel worksheets, and databases, so it is trivial to perform correlation analyses on all your variables to determine which ones affect key outcomes, not just variables that you think matter or those that are most convenient to test.
In addition, it is our belief that the process designs at the heart of the Riffyn SDE will augment or even replace written Materials and Methods sections, in much the same way that CAD drawings are used to define engineered products. This will eliminate the second source of irreproducibility by providing unambiguous visual records of experimental processes, equipment, parameters and reagents. Both methods and associated data sets can then be shared to colleagues with just a click, and then further extended by those colleagues who revise the processes and add their own data sets.