NGS Workflows: A Simpler Life Through Software
Having too much data sounds like a nice problem to have. But Next Generation Sequencing (NGS) data’s ubiquity poses challenges as well as opportunities for geneticists. Quick, accurate and reliable workflows are top of the average geneticist’s wishlist, and Oxford Gene Technology (OGT) think that their SureSeq™ Interpret software is the answer. We caught up with OGT’s Dave Cook to discuss the challenges of NGS data, and how software can make things simpler.
Ruairi Mackenzie (RM): What makes NGS data cumbersome?
Dave Cook (DC): Predominantly the volume of data generated and this increases with every new or updated sequencer. All of these data need to be organized, assembled and analyzed. To put this in context, a common read length would be 150 base pairs. When this is compared to the genome of 3,000,000,000 bases it represents 1/20,000,000 of the human sequence. Consequently, mapping and assembling the individual reads is computationally laborious. Additionally, this translates into large amounts of data, volumes ranging from gigabytes to terabytes depending on the sequencing hardware. These data need to be stored and accessible, which for many can be a serious problem and is only going to become more of an issue.
RM: How can NGS software make workflows simpler for geneticists?
DC: A geneticist wants to get from the data to a result in as little time as possible and as easily as possible, and a workflow can enable this. In SureSeq Interpret this starts with the raw data file upload where samples are automatically paired and loaded into a database, so all analyses can be tracked. Once the samples are available a user can select a workflow to follow for an analysis. A normal workflow will include:
- Alignment of the raw data to the reference genome
- QC analysis of the samples both at an individual and a batch level.
- Variant detection for SNVs, CNVs, translocations and other structural variations
- Annotation of variants with supporting information
- Saving results in a database
- Presentation of the results in reports
Taken as a whole a workflow allows a user to load raw data files and then wait for results without the need for any interaction with the analysis pipeline until the workflow completes. This means they will have more time to focus on their results whilst reducing the requirement for a bioinformatic resource. Additionally, an automated workflow helps to ease the burden of working with NGS data discussed above.
RM: NGS programs are constantly evolving alongside NGS workflows; how flexible is SureSeq?
DC: Very. SureSeq comprises two parts; the analysis pipeline and the user interface. Both are modular meaning that it is possible to slot new components into either without disrupting the overall package. The analysis pipeline has been packaged within a container which contains everything needed to process the NGS data files. Using a container ensures the pipeline will work consistently irrespective of the hardware infrastructure. Furthermore, any updates to the pipeline would generate a new container that can simply replace an existing container. Likewise, the user interface has been developed with the inclusion of a plug-ins framework. This provides a means to implement individual customizations as users demand. For example, if a particular format of report is required then a template plug-in can be created and supplied to the user. The SureSeq Interpret user interface provides a means to load such plug-ins and once loaded the additional functionality will be available.
RM: Is SureSeq vendor-agnostic, or specifically designed for OGT Gene Panels?
DC: SureSeq is currently configured to run data from OGT Gene Panels. FASTQ files generated with non-OGT panels can be analyzed with SureSeq Interpret but the ability to upload the BED file for a non-OGT panel is not possible.
RM: How can we move towards standardized, high quality NGS data, which will ultimately be required to realize the clinical potential of NGS data?
DC: I don’t think that there is a simple answer. However, there are two parts to this; firstly, the laboratory processes and then the computational analysis. Conversion of a sample to a sequence is a highly technical multi-step process with the possibility of error incorporation being an inherent danger. Any such errors would be carried through to the analysis stage making it harder for detection of any true variants. High quality NGS relies on ensuring both of these are as accurate as possible to minimize incorporation of errors and maximize variant detection.
Dave Cook was speaking to Ruairi J Mackenzie, Science Writer for Technology Networks