Scientists at the Baylor College of Medicine Human Genome Sequencing Center (HGSC), USA, will use high throughput sequencing systems from Applied Biosystems for a significant part of their contribution to the first pilot phases of the 1000 Genomes Project, sponsored by the National Human Genome Research Institute (NHGRI), the Wellcome Trust and the Beijing Genome Institute.
This project is a worldwide research effort that will involve the sequencing of 1,000 genomes from people from around the world to create the most detailed and medically useful picture to date of human genetic variation. The HGSC will acquire six SOLiD™ Systems in order to complete the work.
The data generated as part of the 1000 Genomes Project are expected to reveal clues about how variant DNA sequences contribute to conditions such as cancer, diabetes and heart disease. The HGSC is using the SOLiD System to expand its contribution of the first phases of the project and help researchers to determine the best methods for sequencing these 1,000 human genomes.
The first phases of the project began earlier this month. When the pilot phase of this project is complete, the HGSC will have used the SOLiD System to generate approximately 200 billion bases of sequence data over a span of four months. The sequence data will consist of significant sequence coverage of 24 human genome samples and much deeper coverage of a single human genome sample. This amount of data is equivalent to the entire contents of GenBank, the largest public repository of DNA sequence data.
In the analysis of human genetic variation, the depth of coverage refers to the number of times each of the approximately 3 billion base pairs of DNA from one genome is read by a genetic analysis system. Deeper coverage of a genome increases the confidence researchers have in characterizing the bases that exist at each position within a genome.
As a result, researchers are better able to recognize the occurrence of variants in the genome. According to the HGSC, one goal of the pilot phase of the 1000 Genomes Project will be to determine the depth of sequence coverage from different data types that are needed in order to fully understand how sequences of DNA in the genome vary significantly between individuals.
One reason why the HGSC chose to use the SOLiD System for this project is because of its extremely high throughput capability. The SOLiD System has now demonstrated that it can produce greater than 10 gigabases per run, which is more than 3x genome coverage. The throughput of the SOLiD System establishes it as the highest throughput genetic analysis system available today. An ultra-high-throughput genetic analysis system will help enable scientists at the HGSC to complete this project in an efficient and cost-effective manner.
“There is clearly a role for very high density data from platforms that generate read lengths in the 25-50 base range,” said Dr Richard Gibbs, director of the HGSC at Baylor College of Medicine. “We believe that the SOLiD System will dominate in this arena. The production and pooling of data from multiple sources and platforms on the same samples in the 1000 Genomes Project will help researchers ultimately determine the genetic analysis platform of choice.”
Although most human genetic information is the same in all people, researchers are generally more interested in studying the small percentage of genetic material that varies among individuals. Researchers characterize genetic variation as either single-base changes – single nucleotide polymorphisms (SNPs) – or as a series of larger stretches of sequence variation known as structural variants. To characterize SNPs in the genome for the 1000 Genomes Project, researchers must be able to distinguish real genetic variants from sequencing errors, which requires a highly accurate genetic analysis system. A combination of depth of coverage and a highly accurate genetic analysis system helps researchers identify the genetic differences that exist between individuals.
Use of higher accuracy genetic analysis systems will require lower depth of coverage to confidently characterize variants in genome samples. HGSC’s decision to use the SOLiD System for the 1000 Genomes Project was in part based on a comparative study of microbial and mammalian genomes sequenced by the SOLiD System and competing short-read platforms.
“Accuracy is vital, not just for the 1000 Genomes Project, but for all other applications, too,” said Donna Muzny, director of operations at the HGSC. “The internal error-checking strategy for the SOLiD System makes it superior for the read lengths that are produced.”
In deciphering the human genome, researchers strive to both accurately identify and locate genetic variants in the genome. Mate pair analysis, or the ability of a genetic analysis system to analyze pairs of sequences separated by a known distance between them – known as the insert size – allows researchers to determine the precise location of structural variants in the genome. Structural variants consist of gene copy number variations, single base duplications, inversions, translocations, insertions and deletions.
Once researchers determine the depth of coverage necessary to confidently characterize SNPs and structural variants in human genome samples, they will be able to make better use of genomic information. For instance, researchers will be able to more effectively use this information to understand how these variations are related to an individual’s susceptibility to disease and response to treatment for disease, which is the promise of personalized medicine.