Unlocking the Secrets of Complex Genetic Data
News Mar 19, 2012
As a result, it has become increasingly difficult to identify which genes are being expressed, and to what level, especially when working with tens of thousands of data points being generated by hundreds of different patients. Despite this challenge, it is essential for scientists to capture and analyse this type of data effectively, since this information is vital if these researchers are to apply their findings to real-world conditions.
In order to address this issue, a team of scientists at Cincinnati Children's Hospital Medical Center in the United States is currently using next-generation data analysis software on studies that aim to identify the signals and pathways that are unique to tumour cells.
"We primarily work with leukaemia cells, and often use comparative studies to determine how leukaemia cells differ from normal blood cells," says James Mulloy, Ph.D., Associate Professor at Cincinnati Children's Hospital Medical Center. "However, once we've identified the signals that are unique, we need to perform tests to determine whether the tumour cells are dependent on these signals, or addicted to these signals. If the latter is true, we can then begin to search for compounds that might target these addictive signals, in the hope that we can identify new therapies for cancer."
"We've used various data analysis programs in the past, and it's fair to say that we have found the interface and complexity of the programs to be cumbersome to master and somewhat frustrating," Dr Mulloy adds. "Most of these programs took a great deal of time to learn and weren't very intuitive. As a result, we often needed to collaborate with trained bioinformatics specialists in order to analyse our data, which can be a time-consuming endeavour."
New technological advances in this area, however, are making it much easier for scientists to compare the vast quantity of data generated by gene expression studies, to test different hypotheses, and to explore alternative scenarios within seconds.
Making sense of complex data
The overall performance of modern data analysis software has been optimised significantly over the past three years. With key actions and plots now displayed within a fraction of a second, researchers can increasingly perform the research they want and find the results they need much more quickly. Dr Mulloy has recently been using Qlucore Omics Explorer to conduct bioinformatic analyses of microarray data.
"Our goal is to identify important signals involved in leukaemia, so our studies typically are set up to compare normal hematopoietic cells with leukaemia samples," Dr Mulloy explains. "An example of an analysis essentially would be to narrow down a list of genes that are either up or down regulated in the leukaemia cells, as compared to the control cells."
Dr Mulloy begins this process by reducing high dimension data down to lower dimensions so that it can be plotted in 3D. Principal Component Analysis (PCA) is often used for this purpose, as it uses a mathematical procedure to transform a number of possibly correlated variables into a number of uncorrelated variables (called principal components).
One of the more recent breakthroughs in this area, however, has been the introduction of dynamic PCA, an innovative way of combining PCA analysis with immediate user interaction. This novel take on PCA analysis allows Dr Mulloy to manipulate different PCA plots interactively and in real time, directly on his computer screen. With this approach, his team is given the full freedom to explore all possible versions of the presented view, and is therefore able to visualise - and therefore analyse - even very large datasets easily.
By using a heat map alongside dynamic PCA analysis, the team has yet another method for visualising its data, since heat maps can take the values of a variable in a two-dimensional map and represent them as different colours. Because modern heat maps use sophisticated mapping techniques to represent this data (as opposed to standard charting and graphing techniques), they can provide a view of data that is simply not possible to achieve with simple charts and graphs.
A step-by-step approach
Dr Mulloy and his team often begin their analyses by grouping like-treated samples, since similar samples will typically group together when using the PCA plot, and then adjusting the variance. These samples will then be grouped and a p-value is then adjusted.
Once the team has matched normal and leukaemia samples, the elimination factor can then be used to identify more promising targets. Once they have this set of genes, the team will often view the heat map to visualise the results.
At this stage, gene lists are generated, so that Dr Mulloy can run them through programs such as the MSigDB database in order to identify possible gene signatures that exist in his leukaemia samples. If desired, the team is also able to pull out a single gene and examine the expression level across all of the samples.
"Once the groups are established, we frequently use the scatter function to examine different variables within a data set in greater detail. The ability to colour code within this feature is very useful, as is the ability to import a particular gene list of interest and examine our data set against this list."
According to Dr Mulloy, having access to such powerful software helps to encourage a sense of creativity in his research, as it allows the research team to test a number of different hypotheses very quickly, in rapid succession. For example, because array data is published quite frequently in this area of study, the Qlucore software can be used to download these data sets and study them very quickly, in order to find concepts that are of interest to the scientist's particular research.
"The exceptional speed that this kind of software can deliver is very important for us, since the fast analysis of the data highly contributes to the identification of subpopulations in a sample collection or a list of variables," says Dr Mulloy. "For example, it now takes very little time for us to make figures and generate gene lists. Without a doubt, these rapid results - and the way in which the data is visualised - prompted us to perform analyses that we would have never performed otherwise."
Data visualisation helps to streamline analysis
Data visualisation works by projecting high dimensional data down to lower dimensions, which can then be plotted in 3D on a computer screen, and then rotated manually or automatically and examined by the naked eye. With the benefit of instant user feedback on all of these actions, scientists studying microarray data can now easily analyse their findings in real-time, directly on their computer screen, in an easy-to-interpret graphical form.
When used during gene expression research, the ability to visualise data in 3D represents a very powerful tool for scientists, since the human brain is very good at detecting structures and patterns. In practice, this means that Dr Mulloy and his team are able to make decisions based on information that they can identify and understand easily.
For example, 3D presentation makes it much easier for Dr Mulloy and his team to see the separation of groups based on gene expression. This provides more meaning to the data and allows for easy visualisation, which in turn provides an additional way of thinking about the data. As a result, this approach can often lead to more useful connections when several samples are being analysed simultaneously.
"The ability to visualise data actually makes the software quite fun to use, which means that more of our group is likely to use it over time," according to Dr Mulloy. "The ease of use and speed at which it operates not only allows us to answer some of our key scientific questions more effectively, but it also enables us to identify potential therapeutic targets to examine further."
"Right now, we have only used this kind of software for gene expression array data, but we also have data from methylation arrays and also miRNA arrays, and will be moving on to this type of data in the future, as we expect the software will work just as well for these data sets as it does for the gene expression arrays."
Comments | 0 ADD COMMENT
EMBL Course: Analysis of Non-Coding RNAs: quaerite et invenietis
Sep 09 - Sep 15, 2017
EMBO Conference: Protein Synthesis and Translational Control
Sep 06 - Sep 09, 2017