We've updated our Privacy Policy to make it clearer how we use your personal data. We use cookies to provide you with a better experience. You can read our Cookie Policy here.

Advertisement

Georgia Tech Creates Self-Training Gene Prediction Program

Listen with
Speechify
0:00
Register for free to listen to this article
Thank you. Listen to this article using the player above.

Want to listen to this article for FREE?

Complete the form below to unlock access to ALL audio articles.

Read time: 1 minute

Researchers at the Georgia Institute of Technology have developed the computer program capable of training itself to predict genes in genomic DNA sequences of eukaryotic organisms such as animals, plants and fungi.

The software program, GeneMark.hmm-ES, may help researchers save a year or more in a genome sequencing and interpretation project.

The program is an addition to the family of GeneMark gene prediction programs developed at Georgia Tech and is freely available to academic researchers.

Currently, there are 600 ongoing genome sequencing projects of eukaryotes that carry nuclei within cells.

Decoding the DNA sequences that come out from even a single genome project is an enormous task.

Still, unraveling the genetic code of living creatures allows scientists to understand the details of the cellular machinery. This knowledge helps generate ideas for a variety of future research directions.

Understanding the specific features of individual genomes may lead to the development of personalized medicine, while comparing the genomes from related species can help scientists trace their evolution.

"The genomic sequence is a foundation and blueprint of molecular cellular networks and processes which dynamics need to be reconstructed to understand how the cell works," said Mark Borodovsky, Regents’ professor in the School of Biology and the Department of Biomedical Engineering, and director of the Center for Bioinformatics and Computational Genomics at Georgia Tech.

"These networks are specific for each organism, so once you know the list of the genes, you start to assemble all the parts into a picture."

A self-training version of the genefinding program for prokaryotic genomes was created by Borodovsky’s group in 2001.

Now Borodovsky and his team at Georgia Tech have taken a leap forward and built a program that can train itself to make accurate gene prediction in the numerous newly sequenced genomes of eukaryotes.

"The program uses established general principles of genetic code organization - adjusted to the general compositional features of a particular genome - to help identify at least a few regions of the anonymous genome that contain protein coding sequences."

"Once they have the initial predictions, they separate the coding and non-coding sequences."

"This clusterization allows scientists to apply machine-learning techniques to refine the parameters of the recognition algorithm to the specific patterns found in the newly identified protein-coding sequences."

"A researcher then repeats this prediction and training step, each time detecting a larger set of true coding sequences that are used to further improve the model employed in statistical pattern recognition."

"The last run, when no innovation is reached at the prediction step, produces the desirable final set of predicted genes."

Because the self-training method uses established general principles of eukaryotic gene organization to reconstruct the species specific nucleotide sequence patterns, it speeds things up, since scientists don’t have to wait for an outside expert to develop a sequence large enough to use as a training set.

That can shave a year or more off a sequencing project. With the self-training method, the program does the work itself.

Details on the program can be found in number 20 of Nucleic Acids Research (volume 33) on pages 6494-6506.