Using ‘R’ for Statistics to Answer Biological Questions
News Feb 11, 2015
The Genome Analysis Centre (TGAC) has hosted a four-day workshop, “Statistics for Ecology, Genetics and Genomics Using R” that advanced the participants’ understanding of the role of statistical modelling in the analysis of biological data.
Due to the advent of high-throughput technologies, there has been a huge surge in the quantity and complexity of data available for analysis within biological fields such as ecology, genetics and genomics. This data has the potential to unlock a better understanding of the world, however, insights and progress often require intricate data analysis and interpretation.
R is a programming language used in statistical computing for data analysis and the development of statistical software. The workshop, based on providing trainees with skills in ‘R’, explored an array of statistical modelling techniques, methods and concepts. The course increased the attendees’ knowledge in this area and demonstrated the benefits of applying such methods in their research.
For example, mixed and generalized linear models were used to analyze phenotypic variation, that is differences in characteristics such as eye colour. As the course progressed, the group studied a variety of complex models including polygenic, genome-wide association studies and also mixture models and hidden markov models for investigating underlying structure.
Dr Vicky Schneider, Head of the 361° Division at TGAC and co-organizer of the workshop, said: “More than ever, equipping biologists and ecologists with the ability to handle and analyze data through powerful open-source statistical package in R is fundamental to their ability to face the challenges associated with high-throughput data. I am thrilled to be joined by Dr Tom Van Dooren with whom we first organized an R course back in Leiden more than 12 years ago. Back then, we could not foresee how popular and widely adopted by the community R would become.”
Dr Tom Van Dooren, Senior Research Fellow, Institute of Ecology and Environmental Sciences Paris, co-organizer and main tutor, added: “What I’m trying to get across to the participants is very simple: to not just accept what the software package is doing for you but try to explore some alternatives and, if that’s possible, to go beyond the standard pipeline. Very often you have to use a tool for your bioinformatics data that is embedded in a pipeline, but there are alternative methods where you can modify the pipeline and do something new.”
“Whenever you have to deal with a dataset, whatever the dataset, you will have some kind of statistics involved. You need to ensure that what you are doing from a statistical point of view is relevant and, as you start with a biological question but end with a statistical response, you have to make sure that there is a connection between the two,” said course instructor Tristan Mary-Huard, Researcher at INRA and AgroParisTech. “For this you have to dig a little bit into the model. The good news is that most biologists attending the training are already used to doing this - we are just helping them to push themselves forward and advance their usual analysis.”
Instructor Marie Laure Martin-Magniette, Researcher at INRA and AgroParisTech, added: “The key point of this course is to explain the important statistical models for biological data. All software gives an answer but some answers are wrong because you haven’t put sufficient input into the software - we’re trying to explain the models and the kind of interpretations you can do to resolve this.”
TGAC is strategically funded by BBSRC and operates a National Capability to promote the application of genomics and bioinformatics to advance bioscience research and innovation.
MIT researchers have developed a cryptographic system that could help neural networks identify promising drug candidates in massive pharmacological datasets, while keeping the data private. Secure computation done at such a massive scale could enable broad pooling of sensitive pharmacological data for predictive drug discovery.
Previous work by the International Multiple Sclerosis Genetics Consortium (IMSGC) has identified 233 genetic risk variants. However, these only account for about 20% of overall disease risk, with the remaining genetic culprits proving elusive. A new study has tracked down four of these hard-to-find genes.READ MORE