Are Modular Approaches The Way Forward for Bioinformatics?
Blog Jul 02, 2018 | by Ruairi J Mackenzie, Science Writer for Technology Networks
In the first of a two-part blog series, we discuss the advantages of a modular approach to bioinformatics with Dr. Misha Kapushesky, CEO and Founder of Genestack. Previously a Team Leader at the European Bioinformatics Institute, we also learn about Misha's move from Institute to industry.
Ruairi Mackenzie (RM): When you were a team leader at the European Bioinformatics Institute was there a challenge you encountered that inspired you to set up Genestack?
Misha Kapushesky (MK): Yes. When I came to the Institute in 2001, it was only a few years old and my first role was as a programmer. One of my first tasks was to take a set of modules which are what we call data managers and make them available for, in the wider sense, the community, so that they could run simple analyses themselves. My second task, done in parallel, was to organize public data so that people could get access to the data so that they could do the analyses. And a major challenge turned out to be to join those two things. On the one hand we had tools to analyze data, on the other hand we had the data itself which was archive of essentially zip files. And to join the two of them, people would download the zip files and then re-upload them to those tools. When I became team leader, one of the first things that we set up was a mechanism that would allow them to interrogate the data that’s sitting in the archives without having to push it through this analytical process; it was a code called Expression Atlas. This tool was really a major success and that was one of the first tools where a large public data collection became not just an archive of files to download but really something that you could interrogate, you know – ask “When and where is this gene active?” “Under what conditions?” and so on.
The challenge that arose there was “How do I view my own data in this context?” Pharma companies started asking us whether we could make this tool, this environment, available to analyze locally. We made that possible and we were distributing the tool together with the data and providing some support. Pharma companies paid us a bit to do this at EBI.
And I realized that what would be great is if I could make tools like this easy to use. That the infrastructure for putting data together with the right analytical tools and building nice interactive data mining interfaces was missing. And we were at the best place in the world for doing it.
So that was the sort of challenges that led me to start up this company, because ultimately the job of an SaaS provider is not to develop replicable infrastructures; his job is really to be optimal at serving data up. But what I noticed is that as data volumes in the world increase and the costs of data production are dropping, every pharma company, every life science research organization, every biotech, every consumer goods company, every medical institution, will have the same challenges.
It took us three years to build this first atlas, and I wanted to have an infrastructure where I could do it in a demo session, you know, in 30 minutes. Just grab these modules, pull them together, and tada, you have your own expression atlas. That was the impetus behind it.
RM: You’ve suggested that a modular approach is the answer to the data woes of many scientists – how can this approach help?
MK: Having modules is critical. It’s one of those things that I think by now pretty much everybody in the industry recognizes, and the reason why having a system that’s composed of relatively independent and replaceable modules gives you flexibility, gives you longevity and gives you control over the information of the system.
If you look at what happens in different R&D organizations, who are at the cutting edge of science and tech and the way that they do things, the data mining processes tend to be dated and complex, and so they have, up until now, had essentially two options.
One option is to build everything in-house. Maybe outsourcing some key elements, but let’s build our own data management infrastructure. However, by the time it’s done the infrastructure won’t consider whatever data type or instrument has come out and we have to do it again. And you don’t have to look far to see examples of this; really good succinct presentations were given on this as recently as last month at Bio-IT World Conference 2018.
So, the second option is: you bring in a service provider who creates an infrastructure for you, and it’s good because you can get going quickly and it’s also very good for the service provider because you’re locked in. You’ve got pay for it as moving data around is difficult and expensive.
As a result, what people are really after now is a combination of these. This is really what our option gives you. What we’re saying is that by providing you with a set of modules, you can pick and choose, and you can build for yourself am omics data ecosystem.
You can create a very flexible and optimizable data architecture. So, we take on the most basic, common, underlying layer and that module of ours is compact. These are common user paths which are fundamental for all kinds of data management, not even specific to biological.
And then we develop individual modules for different paths and these modules work independently. In fact, if you’re looking at genomic data, if you see there is a better module than Genestack you can grab it. And this means that you can use your own analysis pipeline; you can use open source packages; you can use commercial analytical providers for pipelines like Spotfire. We’re quite agnostic, we provide several modules building blocks that you can use to build up a multi-layered system.
We’ve got this ability to integrate with other things, but the key is that it’s really easy for us to introduce modules that capture new emerging datatypes. If there’s one constant about the multi-omics world it’s that it’s always changing. Every two or three years, there’s a new advance in instrumentation. We had microarrays, then we went to next generation sequencing, and now we’re in the single cell analysis world. So, it’s moving, and every two or three years we have a new thing. You have to develop new modules each time.
We have a system that can evolve with the industry and it provides a flexible data architecture and it means the organization, the pharma company in this case, are in control as to which bits we have to offer they can use and which bits they can bring in from the rest of the industry. So that I think is an important development, to offer this third way between having a customized vendor solution or building it all in-house.
Misha Kapushesky was speaking with Ruairi J Mackenzie, Science Writer for Technology Networks