qPortal Offers a Data-Driven Approach to Biomedical Research
Article Feb 14, 2018 | by Ruairi J Mackenzie, Science Writer for Technology Networks
We recently caught up with the University of Tübingen's Christopher Mohr, to discuss the data management platform qPortal, which was unveiled in PLOS ONE last month.
Ruairi Mackenzie (RM): Could you introduce qPortal’s function, and your motivations behind designing it?
Christopher Mohr (CM): qPortal is a web-based platform with an integrated workflow system providing portlets, which are sub-pages in our platform, for convenient creation of large-scale experimental designs and data management. This makes the platform convenient, as users can access the platform through their normal web browser and then design projects and experiments, access, manage, view the corresponding data and meta information and execute bioinformatics analysis pipelines via easy-to-use sub-pages. The main motivation for the design of qPortal was to have a scalable system for the entire life cycle in modern data-driven biomedical research which begins with the experimental design and ends with the analysis of the measured data.
With qPortal we provide the technical means to cope with the ever-growing complexity and volume of biomedical data.
Additionally, we created a platform that empowers users to do all-digital project management. This is in line with the concept of the Quantitative Biology Center (QBiC) in Tübingen, founded in 2012 as a central platform to coordinate data-driven research from the design of studies to the management of big (biomedical) data. We anticipate that web platforms, such as qPortal, will be indispensable cornerstones for state of the art research infrastructures.
RM: Why is a data-driven management platform important in modern biomedical research?
CM: Modern biomedical research aims at drawing biological conclusions from large, highly complex biological datasets. Therefore, it has become common practice to make extensive use of high-throughput technologies that produce great amounts of heterogeneous data. In addition to the ever-improving accuracy, methods are getting faster and cheaper, resulting in a steadily increasing need for scalable data management and easily accessible means of analysis. With the continuously growing number and throughput of omics technologies, the need for full automation in data management and analysis is obvious. Additionally, modern research projects are commonly coordinated within larger consortia, implying distributed data generation and many stakeholders. The web-based nature of our approach facilitates the implementation of qPortal as a central platform in such projects. Easy data sharing and remote communication in the context of scientific projects are receiving more and more attention and portals, such as qPortal, can provide relevant solutions.
Additionally, the workflow-based analysis module provides end users with intuitive interfaces to compute resources allowing for easy execution of bioinformatics pipelines. This way, the complexity of distributed computing infrastructures is hidden from the end user and thereby enables data analysis for scientists without prior scripting or command line experience.
Furthermore, our data-driven management platform addresses the topic of reproducibility of computational analyses. In fact, numerous studies in recent years strongly support this focus. Annotation of analysis pipelines and keeping track of the used parameter settings is crucial in order to solve this issue. The comprehensive data-driven approach of qPortal takes the notion of reproducibility one step further. Our approach puts strong emphasis on properly annotated data (data-driven). While qPortal implements thorough logging of processing, parameters, and pipelines, qPortal also facilitates data annotation, which is equally important for fully reproducible research and adherence to FAIR principles. Starting with extensive metadata collection before the experiment has several advantages. Users can easily trace back mistakes in the study design or sample handling with higher confidence, well-annotated experimental data can be reused in future studies, and estimation of statistical power before experiments are performed saves both time and money.
Currently, we are working on further improvements on the analysis pipeline through the extension of our workflow system. We are adding interfaces to other workflow systems like SnakeMake and NextFlow and container solutions like Docker and Singularity.
RM: What features does qPortal have that make it distinct from other Portals such as Galaxy?
CM: The focus of qPortal and the underlying system is on specifying the experimental steps and annotation early on and leveraging this information throughout the whole project life cycle to facilitate analysis. One main aspect of qPortal is project management. This includes the functionality to register and store projects with customized experimental designs and to maintain them. This property is essential to define the data-driven approach. The annotation of data is the primary focus throughout the entire life cycle of experiments and projects. These features are provided by qPortal’s web applications Project Wizard and Project Browser. Experiment and sample instances representing the experimental design are registered and used to attach incoming data. In contrast, the clearly workflow-driven approach of Galaxy is not built upon this concept but uses implicit project management with a focus on file-based analysis and visualization.
Data annotation is possible in both systems, however the emphasis is different. While qPortal collects metadata for every step of the experiment, there is a clear focus on the annotation of workflow runs in Galaxy. The latter property is essential for reproducibility as well as benchmarking of different workflows and parameters. The qPortal approach of starting extensive metadata collection before the experiment is performed has numerous advantages. Firstly, time and money can be saved, because the study design allows for estimation of statistical power before experiments are performed. Secondly, mistakes in study design or sample handling can be traced back more easily and with higher confidence.
Data import is essential when working with data-heavy biomedical applications. Data import to qPortal is possible e.g. using the openBIS Datamover software. Automatic registration including ID mapping and file format recognition is done by ETL scripts and is built upon the experimental design and connected barcode creation. This can be file-type or lab-specific, in the latter case often containing additional metadata annotation to be registered. For small, unstructured data a project-specific upload through the browser is available. Galaxy offers functionality for direct data upload through the web browser with varying rules and governance as defined by the Galaxy instance provider. Upload of larger files is supported via encrypted FTP. Additionally, data transfer of input data directly from provided URLs is possible. The Galaxy upload provides auto detection of many commonly used file types. More extensive operations on the input data, comparable to our ETL scripts, could be implemented by the user in the form of workflow nodes.
Another crucial difference is that the qPortal workflow system can make use of the registered experimental design information and other annotations, for example to color graphical analysis results according to different study variables.
RM: You suggest qPortal can be a platform for biomedical research as a whole – what flexibility does the platform have to adapt to different niches of research in this field?
CM: Niches in biomedical research might differ by the experiments conducted and the types of data generated. With respect to these differences, qPortal is highly flexible. The underlying data model can be easily adapted and extended through openBIS (developed at ETH Zurich) to account for new types of experiments, measured data types, or metadata which should be tracked through the project life-cycle. In some cases, users may require additional means of data analysis for the new data types. In this case new workflows hasve to be developed if they do not already exist. However, the integration of these new workflows in qPortal is also easily possible.
RM: What level of expertise will users of qPortal need to have in programming and scripting?
CM: Users do not need to have any expertise in programming and scripting for their qPortal experience. All that is needed to access a running qPortal instance is a working internet connection and a web browser.
RM: How does qPortal handle data security, which is paramount in modern biomedical research?
CM: Data security is of fundamental importance especially with respect to clinical data. In general, biomedical data is normally bound to strictly regulated terms regarding data security, access and confidentiality. Therefore, we implemented a two-step security process. Firstly, we require users to register in order to be able to log in to qPortal. Secondly, qPortal utilizes the rule-based permission scheme of the open Biology Information System (openBIS). After logging in to qPortal, users will only be able to access their own projects or the ones of collaboration partners, including the corresponding data. The permissions are controlled on a so-called “space” level. “Spaces” might include several projects and users can be added to “spaces” with different roles. If a minimum required role exists, the user will be able to see the data.
We spoke to Andrew Howley from Adventure Scientists,a pioneering not-for-profit organization that seeks to unite skilled adventurers with scientists keen to receive valuable data from remote areas, to learn more about the initiative and the impact their projects are having in the scientific community and beyond.READ MORE
If you work in science, chances are you spend upwards of 50% of your time analyzing data in one form or another.However, it's easy to get lost when it comes to the question of what techniques to apply to what data. This is where data mining comes in - put broadly, data mining is the utilization of statistical techniques to discover patterns or associations in the datasets you have. Here we provide an overview of the critical steps you'll need to get the most out of your data analysis pipeline.