Individual studies that look at microbial genomes and the biochemical and metabolic pathways that underlie microbial life expand our understanding of this invisible world. But as a stand-alone investigation without context or data from other studies to enable elucidation of gene function and piece information together, they can be difficult, if not impossible to interpret. It is vital that scientists can collate existing knowledge and understanding in the context of their current work to draw conclusions and fill gaps in knowledge. Being able to not only access the information revealed by other studies but also interpret, collate and interrogate it all together is therefore very important. However, it would be incredibly challenging and time consuming for any one individual or even a research group to do every time they want to investigate something new.
BioCyc, a not-for-profit web portal, contains genome and metabolic pathway information for more than 14,500 microbes. User data has been combined with curated information from over 87,000 publications, providing users with astounding breadth and depth of information for microbial genome, biochemical, metabolic and regulatory pathway analyses. The database also provides users with extensive genome informatics and comparative genomics tools, omics data analysis tools and enables the construction of quantitative metabolic models. A resource like this can be invaluable for researchers, saving time and computational work.
We spoke to Dr Peter Karp, Director of the Bioinformatics Research Group at SRI International and leader of the BioCyc project, about the evolution of BioCyc and its role in research.
Karen Steward (KS): Can you tell us how the formation of BioCyc came about and how it has evolved?
Peter Karp (PK): BioCyc started in the early 1990s with EcoCyc, our curated database for Escherichia coli (E. coli) K-12. At that time EcoCyc combined a partial database on the E. coli genome (which had not been fully sequenced) called EcoGene with curated E. coli metabolic pathway information. EcoCyc evolved to encompass the full E. coli genome as well as the E. coli regulatory network. We then decided to do for other bacteria what we had done for E. coli, so over the years we have added EcoCyc-like databases for many other sequenced bacteria. Those databases integrate the genome with many other types of information, including subcellular locations, protein features, and gene essentiality data. BioCyc has also evolved to include a broad collection of software tools, from traditional tools such as BLAST, to metabolic route search, omics data analysis, comparative genomics, and metabolic modeling.
KS: Why do you think that it is important to have a resource like the BioCyc database available to researchers?
PK: Researchers appreciate the broad array of information we have integrated within BioCyc, and the high quality of BioCyc information. One of our users called EcoCyc an oasis of high-quality information. Researchers shouldn't have to do a literature review every time they need to learn about the function of a particular gene or metabolic pathway; that synthesis should be ready and waiting for them. They come to BioCyc as an encyclopedic reference source (that after all is what the "cyc" stands for) on the genes and metabolism of different organisms.
KS: Can you highlight for us what you feel are some of the most valuable tools offered by BioCyc? How are they impacting research?
PK: One tool predicts the metabolic pathways of each BioCyc organism from its annotated genome -- we apply this tool to every BioCyc organism. As well as providing an encyclopedic reference on the metabolic pathways of each organism (applications include metabolic engineering), having the complete metabolic network for each organism has provided a foundation for us to develop a whole suite of omics-data analysis tools. These tools work for gene expression, proteomics, and metabolomics data. They include visualizing omics data on individual pathways and on a zoomable diagram of the full metabolic network; computing a perturbation score that captures the activity level of each pathway; and visualizing omics data on a tool called the Omics Dashboard. The Dashboard provides a high-level visual summary of the activity of every cellular subsystem and enables users to drill down into detailed views of the activity of individual subsystems. This tool has been extremely popular with our users and is activated thousands of times per month on our site. You can check out a demonstration here.
KS: How important is manual curation in the running and maintenance of the BioCyc database collection? Can you foresee a time when manual curation will no longer be necessary?
PK: I think manual curation is absolutely essential to providing the high-quality data that scientists value. Although I understand that many people are concerned about the costs of curation, unfortunately many do not appreciate that machine understanding of written text has been an unsolved problem in artificial intelligence for 60 years, and is likely to remain so for some time to come, despite all the recent hype about AI. With all the care that scientists take in seeking just the right wording in their publications, it saddens me to think that we may rely on machines to read these publications using algorithms that will distort or corrupt the meaning. Furthermore, I think it's easy to overlook the large number of errors and inconsistencies (such as in terminology) in the primary literature. I'd like people to appreciate curation as a necessary part of the scientific process that converts fragments of knowledge scattered across thousands of publications into a unified and accurate database.
Imagine a scientist who sees 50 genes with significantly altered expression in an RNAseq experiment. Imagine further that 20 of those genes have no known function. Those 20 biological clues will therefore lead nowhere. But new gene functions are experimentally elucidated all the time: perhaps 10 of those genes have functions in the experimental literature. For example, the first Mycobacterium tuberculosis genome was published in 1998, thus there are 20 years of new published gene functions for this organism. BioCyc curators integrate these new gene functions (and pathways) into our databases so that experimentalists can go to a single up to date information source that is also integrated with our extensive bioinformatics tools.
KS: Are there further developments that BioCyc are hoping to make that will enhance the collection and functionality in the future?
PK: One of our main directions for the past five years has been creating quantitative metabolic models from our pathway databases using the flux-balance analysis methodology. Anyone can create such models using our downloadable Pathway Tools software. One of my passions for the future is to apply these modeling tools to the human microbiome to help us understand the mechanisms behind the interactions of organisms in the microbiome.
Dr Peter Karp was speaking to Dr Karen Steward, Science Writer for Technology Networks.