We've updated our Privacy Policy to make it clearer how we use your personal data. We use cookies to provide you with a better experience. You can read our Cookie Policy here.


How To Build a Data-Centric Future for Biotechnology

Data and network icons in front of a group of people in discussion.
Credit: Gerd Altmann / Pixabay
Listen with
Register for free to listen to this article
Thank you. Listen to this article using the player above.

Want to listen to this article for FREE?

Complete the form below to unlock access to ALL audio articles.

Read time: 3 minutes

The following article is an opinion piece written by Markus Gershater. The views and opinions expressed in this article are those of the author and do not necessarily reflect the official position of Technology Networks.

In biology, harvesting large amounts of data is insufficient. Data must be comprehensive and interconnected so that it can identify the interactions at the heart of biology.

This is because biology’s complexity stems from it being an emergent phenomenon. It has novel and intricate patterns, behaviors and properties, which come from the interplay of simpler component parts. You can’t predict emergent properties based on individual components. Predicting them often means understanding a system's interactions and dynamics in a holistic way.

Here, I’ll delve into different aspects of the data we’ll need to generate to help us truly understand the biological systems we deal with.

Datasets should be both large and diverse

The way biology looks and operates today likely won’t at all resemble the future of data-centric biotechnology. Just like multiomic data for large numbers of patient groups, today the emphasis is on gathering varied and large datasets. These large “observational” datasets are unsuitable for some stages of the drug discovery process, but they can prove immensely helpful for specific parts, like target identification. So while these large datasets are key to biological insight, they aren’t enough on their own. The validation of targets, for instance, partly depends on the perturbation of clinically relevant cellular models. As we disrupt the activity or expression of our target of interest, we must get to grips with how these cells behave. This type of data isn’t just “observational”. It’s actively produced as part of the experiment.

Datasets must span large numbers of dimensions

Building a meaningful model of a system’s behavior involves data that can portray the dynamic nature of living systems because apart from genomes, which mostly remain fixed, all other aspects of living systems change significantly over time. They rely on a host of different factors related to the conditions that a system is exposed to. Which means sheer quantity of data won’t fit the bill. Gathering data under a limited set of conditions can be insufficient or, worst case, seriously misleading. Instead, we need to create multi-dimensional datasets that capture the changes in timings and conditions that could impact a biological system. Blending this multi-dimensional approach with accumulating as much data as possible for each set of conditions can help us better understand the biology we work with.

Taking the right measures to preserve quality

Multi-dimensional and dynamic data is all well and good. But without getting the fundamentals of quality right, your data will be useless. While there are different ways to measure quality, I’d say that there are two things worth keeping an eye on.

The first is making sure we’re measuring the right things. Like dropping your keys and avoiding looking in the dark undergrowth and choosing to look under the lamp-post, it’s all too tempting to take the easy route. You’ll end up with models that look sophisticated but aren’t at all relevant.

The second thing worth mentioning is making sure our assays are of the highest possible quality. Assays are responsible for one-third of pre-clinical expenditure for a sample drug discovery program, according to the NIH Assay Guidance Manual, so making sure these assays yield the cleanest possible data is very important.

Context is key

Our near-perfect view of data – large, varied, dynamic and high-quality – is still missing one key ingredient for enduring value. As without context, data is worthless. There’s a reason all data is produced. It has a place in the bigger picture. Think about it like this: an important experiment’s key output could be a csv file holding 96 numbers. Those 96 numbers could identify a brilliant lead compound for an as-yet untreatable disease. But without context, it’s just a bunch of numbers.

This is what happens to all data left by the scientist who produced it without any metadata – the bit describing why and how the data was generated. In an ideal world, we’d have as much context as humanly possible. This way, AI could help power analyses of today’s data in ways we aren’t yet able to imagine.

Where the future of data-centric biology could lead us

Our recent research revealed that an astonishing 43% of R&D leadership lacks confidence in the quality of their experiment data. This worrying statistic doesn’t only call for improvements to our methods of recording experimental data, but it also means making sure we’re producing higher quality experimental data to begin with. Understanding our data, therefore, also requires a granular level of detail on how it was generated: we should be collecting as much metadata about experimentation as we can.

Companies like Recursion and Insitro are already looking towards the future when thinking about their approach to gathering data about biological systems. They have constructed entire automated platforms around this concept. Fully digitized, they are engineered to systematically foster a deeper understanding of biological systems.

They offer us a snapshot into the future: the routine production of large, high-quality, diverse and multi-dimensional data, with full context thanks to comprehensive metadata. This data forms the basis for AI, and a transformational leap in our capacity to work with and understand biological systems.