Open-access and data provenance, two key themes that pervade modern scientific research. Not least in neuroscience where the grand challenge of one day fully understanding the brain is likely to only be accomplished through mass collaboration. Not just collaboration between labs in a single organisation, or even between several separate organisations, but something on a bigger scale. Perhaps even where hundreds of labs, worldwide, could combine and compare their data to build up a comprehensive picture of the brain. This is the vision of the team behind the Blue Brain Nexus, part of the EPFL’s Blue Brain Project lead by Professor Henry Markram. They’ve built, and are still building, a data integration platform called ‘Blue Brain Nexus’ that they believe will enable this kind of collaboration. An open-access system for recording, curating, storing, describing and tracking experimental and computational neuroscience data in a format that enables data to be better compared amongst researchers.
Following on from an earlier conversation about the role of neuroinformatics at Blue Brain, we spoke to Samuel Kerrien, Section Manager of Neuroinformatics Software Engineering and Data and Knowledge Engineering, about the Blue Brain Nexus and the vital importance of data provenance.
Jack Rudd (JR): Could you please provide us with an overview of Blue Brain Nexus?
Samuel Kerrien (SK): Blue Brain Nexus is the crucial element that connects everything we do together. We have just open-sourced this technology and made it publicly available online, via GitHub. Everybody can now use the data integration platform we are building, we are really proud of that. At the heart of this platform is a Knowledge Graph that acts as a data repository, providing researchers with somewhere to deposit their experimental and computational data. The system also acts as a metadata catalogue, allowing users to describe their data in as much detail as they like. One of the big features of this component is that it is agnostic to the domain. This means it is not just set up for neuroscience. Of course, it will support neuroscience at Blue Brain and potentially other institutes, but we have also designed it so that users can design their own domain. It can be used for anything from astronomy to pharmacological developments, it doesn’t really matter. You are free to design your own entities, your own relationships, and from that point on start recording the data that you care about. We feel like this is something that is not being done yet, this is really novel, and we hope that we are going to help neuroscience lead the way.
Defining domains is achieved through the structure of entities that you create through the schema. We use Shape Constraint Language (SHACL) an open standard defined by the World Wide Web Consortium (W3C), that allows you to really describe the entities you want to store in the Knowledge Graph as well as the relations they have with other entities. For example, if I was concerned with the recording of electrophysiology data I could create a domain that has a subject where I can describe the animal I am concerned with including its species, strain, age and gender. Create a relation to say, a brain slice so, the brain slice derivates from this subject. I could carry on like this, a neuron comes from a brain slice and maybe an electrophysiology dataset comes from a neuron. In doing so I have now recorded where my dataset came from, from which specific animal, from which slice, from which neuron. Anyone could then interrogate the system to find all the neurons from a given animal or from a given slice or all the datasets that originate from mice. You can start asking interesting questions. That’s why data provenance is treated as a first-class citizen at Blue Brain. We hope that our data integration platform will help provide other people in the community with the opportunity to take this approach.
We are currently designing a number of these domains to facilitate the integration of specific datasets. For example, the repositioning of specific data in space is something of importance now. We also have a focus on cell electrophysiology recordings and neuron morphology reconstructions which we are using to build our simulations. Building these domains is part of our plan to move all the data we have at our disposal into Blue Brain Nexus as soon as possible.
The integration of existing datasets into the platform is very important so as we are maturing these domains we are starting to push data into Blue Brain Nexus to make it available to our scientists. This data integration platform represents an important milestone for us and we are already hard at work promoting the technology and letting people know that it is available. Trying to facilitate the onboarding of new people is important. We are really proud of what we have achieved.
JR: You’ve mentioned provenance a few times now. Why is this such an important concept?
SK: Provenance is how we track the origin of data and also track how your data is being used and derivated into other datasets. You may take an electrophysiology dataset and start training a mathematical model to behave like the real neurons. You could record where this model came from. Which in turn this allows you to assess the quality of the data by looking at who generated it, which protocol they used to generate it. Then you are able to build trust in the data you use. If enough information has been captured along this provenance trail it might even allow you to reproduce specific experiments, which as we know in science is not always an easy thing to do. But, I want to stress, if enough information is captured it becomes possible. Another reason for putting provenance first is to allow the attribution of data and algorithms, and with collaborations in particular, it is crucial to acknowledge all contributions, something which scientists value highly. Finally, the Knowledge Graph in Blue Brain Nexus is also a semantic search engine providing the ability to ask complex scientific questions across entities and their relations. These searches can be incredibly far-reaching as the system is built to deal with very large amounts of data and caters for high usage. Properly recording provenance makes all of these features possible.
JR: You have mentioned that there are lots of different types of data and data sources. What challenges does this present and how you are working to overcome them?
SK: Neuroscience is bringing together many different fields of science and, as such, the heterogeneity of the data generated in the field is very broad. That in itself is a massive challenge. Catering for various data types is definitely not easy. Now, when we talk about volume of data, I would say there are different ways to look at it. One approach is to look at the sheer volume of a single dataset. Atlasing or rebuilding the brain is one good example to study. Some of the largest datasets we have today for example, imaging a single mouse brain can generate 7 to 8 terabytes for a single image stack. To be clear, I’m talking about taking slices of a brain, imaging them in high-resolution and then processing them to create an image stack. All this data then has to be processed later on, which is another challenge in itself. Processing this kind of large dataset is not trivial. Thankfully, technology is moving on, and now there are plenty of high performance computing frameworks, like Apache Spark, that provide an efficient framework to carry out this work. That’s one way to look at large sizes, a single dataset being really huge.
Datasets can also come in very large numbers, with many, many files. Electrophysiological recording of data is one of these cases. An electrophysiology experiment on a single neuron can easily generate a thousand traces in a short period of time. The Laboratory of Neural Microcircuitry at the École Polytechnique Fédérale de Lausanne (EPFL) institute, headed by Henry Markram, has been generating this kind of dataset for over ten years, resulting in several millions of datasets needing to be carefully integrated so that they can be analysed together. That’s another thing you can look at — data volume and the related challenges. The total number of datasets, that you have to take care of and integrate into a system, is just as important.
So far, I have only mentioned data that is produced in the lab, it is also important to remember that at Blue Brain there is also the computational data we generate i.e. the results of processing the data. For instance, our scientists have recently been generating neuron morphologies by retracing neuron pathways on the slices of a brain. To create more variety in the types of neurons we are handling in the simulations there is a process that involves recreating specific types of neurons in large numbers with subtle variations to create more organic simulations. To carry this out you can be talking about hundreds of thousands of neurons being generated computationally and again, this is all data that needs to be registered. Across all the simulations being run at Blue Brain we are generating terabytes or potentially petabytes of data, depending on the exact details and number of neurons you want to add into your simulation.
JR: With all this data to hand, have your team been able to get involved in any publications?
SK: Alongside focusing on the various activities I described earlier, we have published papers on text mining and annotation. Now with the open sourcing of Blue Brain Nexus, we are preparing further publications of our work, which is really exciting. This is a huge project that we have been working on for quite a few years now and we have focused heavily on building the platform from early on this year. Through this, we have accumulated a lot of knowledge and experience, so it is important that we transfer this to the community through publications.
JR: What’s next for you and your team now that Blue Brain Nexus has been released?
SK: Alongside publishing, it is important to state that releasing Blue Brain Nexus is far from the end of the game for us. A lot of work remains to be done to strengthen the platform and to deploy it into production at Blue Brain. We have already built a lot of domains that will allow the fine grain integration of all the data Blue Brain is currently handling into a system that it is better organised. Mass integration is crucial in ensuring scientists get direct access to all of the data from across the project. Currently, some data may simply be shared on a memory stick or saved locally, which prevents it from being utilised by other scientists in other parts of the project. I guess not everyone is aware of how much work is going into integrating all the data, pushing it into Blue Brain Nexus is going to be a massive task. A task that we had already started as we were building the data integration platform but there is so much more that needs to be done. I think this will make up the bulk of our work for the years to come. But, in the end, we should be able to bring everything together and provide unified access to all this data to everyone at Blue Brain.
Samuel Kerrien was speaking to Jack, Senior Editor for Technology Networks.