Many recent advances in research have aimed to maximize the amount of data we can produce. With the cost of data handling storage plummeting, why wouldn’t you? But as anyone who has spent hours pipetting with an uncalibrated pipette or watched all 29 series of the Simpsons will tell you, quality is more important than quantity. This realization has hit home in many companies that are now buried under a pile of poorly-stored data that can’t be synchronized to other data silos and is occupying many terabytes of storage. In analytical chemistry, that data has more complexity and value than everyday spreadsheets, and tools matching that complexity will be needed to get data back into shape.
“If you don’t have correct data then it’s pretty much unusable by anybody downstream, including yourself, for anything that you originally intended it for,” says Andrew Anderson, Vice President of Innovation and Informatics Strategy at Toronto-based analytical software supplier ACD/Labs. Anderson suggests that this need for correctness is now being recognized at the beginning of the data life cycle – and the end: “Organizations like the Food and Drug Administration require pharmaceutical companies and drug manufacturers to have safe, efficacious and quality drugs and the data that they supply for characterizing those drugs has to meet guidelines according to data integrity. There’s both the pragmatic impetus right from the get-go and at the end. What is the expectation if you’re going to bring a product to market that is supposed to benefit people? If it’s not what it’s supposed to be, there could be really serious consequences.”
Anderson’s view is that data integrity is important at all stages of the research pipeline, from design to drug. This perspective has become vital as advances in technology enable data to be recorded from more sources in larger volumes: “One of the trends in industrial innovation is utilizing what we would call the secondary or tertiary value that you would get from data. Historically, if you look at how analytical data is leveraged within industry, it’s question and answer, input and output. What people have recognized is that by having data you can infer trends, you can apply and use data for training sets, or things like predictive analytics, machine learning and the like. If I’m using analytical data to release a substance to be used in a pharmacy setting or in a commercial setting, that released data is used to give a green light to say, yes, you can release the batch for its intended use. If you store that data right on every batch that’s ever been released, you can look at trends, and infer operational optimization decision making – do I see any trends in how quality at one site differs from another, for example?”
With these potential benefits available, it’s surprising that analytical chemistry has been slower than other fields to embrace big data techniques, with available datasets and algorithms often not up to the task of analyzing complex chemical data. Andrew’s colleague, and ACD/Labs’ Director of Strategic Partnerships, Graham McGibbon, says that the complexity and volume of data are the biggest obstacles to simply adopting automation techniques: “You have optical spectra across ranges of wavelengths, you have experiments performed not just for the certain sampling frequency but across all frequencies. It takes time to run them—a chromatography run could take half an hour. If you’re acquiring data for that entire half hour and you have a mass spectrometer attached, there could be thousands or millions of data points. Furthermore, you have multiple dimensions of information where you can probe how atoms are attached to each other. People want to know which peaks represent which atoms or features, and that complexity is really a key thing about chemistry data. I think it’s much more complicated or complex than for some other data that people would choose to store in other fields.”
Andrew notes that labs or companies conducting large-scale chemical analyses could end up with a mind-boggling amount of data: “If we want to do big data analysis, we’re generating a terabyte of data a day, and you’re going to get to a petabyte fairly fast over time. Being able to do the types of analyses we’d like to do is hard if you’re not reducing the data volume somehow.”
Such a deluge of data certainly sounds like a good reason to avoid altering with a company-wide data system, but Andrew firmly believes that even if adopting big data techniques isn’t an easy road to walk, the alternative is far worse: “I'm personally familiar with a situation in food and beverage companies dealing with pesticides that had to respond to a pesticide becoming regulated. They spent 18 months doing the hazard assessment on their commercial products and raw material supply chains. If you have the big data system it’s a query, a simple query as opposed to what they had to do because the big data systems aren’t in play. They had to gather samples, re-analyze and go from there. If you’re going to consider adopting big data techniques in analytical chemistry, consider the value proposition—that’s how they justified a data center investment. If you build this and you house it, and you architect it the right way, it will pay off, you can avoid those 18 months’ worth of cost to solve a problem.”
Whilst the need to advance big data techniques might seem clear, the way in which companies choose to adopt those techniques is less so. Who exactly has to promote more data-centric strategies within a company? “It’s not like any individual department gets saddled with an innovation strategy like this, there has to be all stakeholders at the table,” says Andrew. “You must have a concerted plan to migrate from the current strategic set of capabilities to a new set of capabilities. So, I wouldn't put any single department under the gun, so to speak, to have a responsibility to build something like this, it has to be an inter-department function.”
Modern informatics solutions clearly have the potential to improve how entire industries treat their data and bring outdated practices to an end. What is also clear is that implementing these solutions requires an intensive, but worthwhile effort. Andrew sums up the task ahead for companies who want to improve how they handle and analyze data: “If somebody can mine data, then I think that’s great but it’s an uncertain additional value compared to what they were doing in the first place. I think that’s really important to recognize - the nature of where data is assembled and what trade-offs there are in terms of getting complete and accurate data and making it useful.”