Data Mining Techniques: From Preprocessing to Prediction
Article Jul 30, 2018 | by Alex Harston
What is Data Mining?
If you work in science, chances are you spend upwards of 50% of your time analyzing data in one form or another. Data analysis is such a large and complex field however, that it's easy to get lost when it comes to the question of what techniques to apply to what data. This is where data mining comes in - put broadly, data mining is the utilization of statistical techniques to discover patterns or associations in the datasets you have.
Everyone's data is different - it's always highly contextual and can vary on an experiment-to-experiment basis. There's no way we could give specific technical advice as to exactly what you might need for your data - the field's just too broad! What we're going to do here instead is provide high-level tips on the critical steps you'll need to get the most out of your data analysis pipeline.
You’ll likely spend a large percentage of your time formatting and cleaning data for further analysis. This is most often termed 'data wrangling' (or 'data engineering' if you want to sound fancy). Despite being laborious, this is perhaps the most necessary step in any data analysis pipeline.
Making sure your data is good quality is evidently a hard enough job in itself - a 2016 paper showed that 1 in 5 genetics papers had data errors resulting from Microsoft Excel auto-formatting gene names to dates. It's often all too easy to overlook even the simplest of sanity checks - a friend working with a medical database recently came across an official table proudly stating that a 5-year-old girl was 180cm tall - even just a cursory glance at your raw data before starting analysis can save you a whole lot of trouble later on.
Data preprocessing generally involves the following steps:
• Smoothing of noisy data - biological recordings can be incredibly noisy, and so filtering your data is often needed (EEG or neural recordings are good examples of noisy data).
• Aggregating your data - your data will likely be collected by different recording devices simultaneously, potentially at different temporal or spatial resolutions, and will therefore need aggregating into the same tables or matrices, potentially with appropriate subsampling.
• Imputing missing values - Taking the time to perform proper error handling for missing values or NaNs (“Not-a-Number”) in your analysis scripts can save you hours of debugging further down your analysis pipeline.
• Removing erroneous data points (6-foot-tall children don't make for particularly reliable datasets, shockingly).
Understanding Your Data
Once you've done the required data cleaning chores, taking steps to explore the data you're working with is essential for identifying the most appropriate analyses to perform. One can break this approach down into the broad categories of description and prediction:
One big pitfall in data analysis is simply failing to look at your data. However, real-world experiments often yield complex, high-dimensional results however, and when your tabular dataset has 7 dimensions, simply looking at raw values is not as straightforward as it seems.
Dimensionality reduction techniques are useful here - they allow you to take high-dimensional, complex data and transform them into lower-dimensional spaces (2D or 3D), making them more visually intuitive. Dimensionality reduction techniques like PCA, t-SNE or Autoencoders are common ways to begin exploring your data.
Understanding how dense or sparse your data are, whether your data are normally distributed, and how your data covary are all questions to address during exploratory analysis in order to build better predictive models.
K-means is the go-to technique for clustering data, with multiple variants of the algorithm for different applications. It's an unsupervised learning technique, commonly used when you do not have predefined classes, and want to understand how, or if, your data is grouped.
K-means is popular because it can be run in just a few simple steps:
• You select “k” groups and the centers of these groups are randomly initialized (it’s normally worth checking your data in a 2D plot first to see if you can identify any obvious clustering by eye).
• For each data point present, the distance to the center of each group is calculated, and the point classified into the group the shortest distance away.
• Once all data points have been grouped, the center of each group is recalculated (by taking the mean vector of all the group’s points).
• Repeat these steps until the group centers don’t change any more, thereby giving you your finalized groups. It’s important to run K-means a fair few times for consistency.
Often however, you might already have predefined classes, and want to see which of them your experimental data fits into. K-nearest neighbors (KNN) is the most common algorithm used here - it's a supervised learning technique, where given a data point, the algorithm will output a class membership for that point. KNN can also be used for identifying outliers in data.
(Note: “K” in KNN is not the same as “K” in K-means - here “K” refers to the number of neighboring data points you use to classify your new data point, not groups).
In KNN, the distance of each test data point to all neighbors is calculated and ranked in ascending order. The top ‘k’ distances are taken, and the most frequent class in that subset used to define the class of that data point. This is repeated for all data points until all have been labelled.
The Best Languages for Your Data
Visualization of data is becoming ever more important, as are the tools you need to produce your plots.
• In much of science and engineering, MATLAB still dominates for data visualization, due to the program's legacy, integrated nature, and sizeable community.
• Python usage is skyrocketing, both in academia and industry. Due to the language's simple and clean syntax, open-source nature, compatibility with many other existing languages, machine learning frameworks (Tensorflow/PyTorch/Scikit-Learn) and scientific computing libraries (numpy, pandas), Python has firmly established itself in the past few years as the language of choice for data science and analysis. Python's main plotting libraries include matplotlib and seaborn for 2D plots, and bokeh for interactive browser-based visualization. These libraries can have a steep learning curve, but are powerful and offer a lot of flexibility.
• Statistics-focused languages like R and its plotting libraries, such as 'ggplot', are also becoming more widespread, due to their ease-of-use and good-looking plot designs.
• It’s worth noting that both Python and R have the benefit of allowing code to be written in Jupyter Notebooks, a flexible and extensible format allowing for text and figures to be embedded alongside the code used to produce them, for ultimate reproducibility. (Trust me - this is an absolute godsend when trying to piece together what on earth you actually did to generate that figure six months ago.
So, you have your data mostly figured out. Now what? Can you use the insights you've gathered from your exploratory analyses to do something useful, and make predictions?
Regression models are one of the simpler yet powerful analysis methods for understanding relationships in your data, and generating predictions from them. One of the most common types is linear regression. Linear regression models, as the name implies, seek to define a linear relationship (think: "y=mx+c" from school) between a set of independent variables (predictors), and their corresponding dependent variables (targets). This is normally done using the Least Squares Method, which attempts to fit a 'line of best fit' that minimizes the sum of squares of the vertical difference of each point from the line itself. The success of this fitting measure is reported as the "R-squared" value (also known as the 'coefficient of determination'). Generally, linear regression models are used for forecasting and modelling time series with continuous variables.
Logistic regression on the other hand is used when your dependent variable is binary (True/False, Yes/No), for example in classification problems. Importantly, logistic regression can be used even if the dependent and independent variables in your models do not have a linear relationship.
(N.B. If your data doesn't fit a line very well, you could also try polynomial regression, which fits a curve to your data. Just watch out for overfitting - you don't want the model to fit the curve too closely to your data points!)
When building models, it's also important to remember to crossvalidate - that is, to separate your data into training and testing sets. Crossvalidation means training your model on one portion of your data, and then testing how well the model works by comparing its predictions on the 'testing' data against the actual values, thereby measuring the predictive power of your model. In this way, crossvalidation also helps reduce overfitting.
Whilst covering only a fraction of the techniques you could apply to your data, this short guide has hopefully given you some things to think about with respect to your data pipeline. Remember that your data is unique - take the time to dig into it and you'll reap the rewards!
Disaster Recovery (DR) is a vital process to ensure the rapid recovery of an organization’s applications, data and hardware that are critical to operations in the event of a natural disaster, network or hardware failure or human error. In this article, we explore how the public cloud is making ideal DR a reality.