3 Tips for Data Processing and Analysis

How To Guide

Published: April 19, 2018

Lee Baker

3 Tips for Data Processing and Analysis content piece image

So you’ve collected your data and have a nice, clean dataset. Before you start your analyses, you need to get to know your data. Inspect each variable and decide to which data type they belong. From here, this will tell you which mathematical operations you can and can’t do with these data. This step can make or break your well-planned study.

Here we’re going to take a look at the different data types, learn to understand them and do calculations with them. By the time you’ve got to the end you’ll be able to put your new skills into practice in your data.

1. Know Your Data

There are 4 distinct data types:

Ratio
Interval
Ordinal
Nominal

Let’s start from the top and work down the list.

Ratio Data

These data are called Ratio because you can divide their values. Distance or weight measurements are Ratio because you can divide their values to get a meaningful answer.

Here are some examples:

20 metres is twice the distance of 10 metres (i.e. 20/10 = 2)
50 kg is ten times heavier than 5 kg (i.e. 50/5 = 10)
150K has half the amount of energy as 300K (i.e. 150/300 = ½)

With Ratio data, you can do pretty much any mathematical operation and the result will be valid. You can:

Divide or multiply
Add or subtract
Compare (greater than, equal to or less than)

For example, Body Mass Index (BMI) is calculated as the ratio of the weight to the square of the height. The weight and height are both Ratio data, as is the resultant BMI. The crucial point is that for values to be divisible, there needs to be a meaningful zero point to the data. A tape measure can’t make negative measurements, and neither can a jug or a set of weighing scales, so anything measured by these has an absolute zero and can take only positive values.

Would you prefer to read this as a PDF?
DOWNLOAD HERE

Interval Data

With Interval data, you cannot multiply or divide, but you can add and subtract.

Here are some examples:

4pm is 2 hours after 2pm (i.e. 4-2 = 2)
50°C is 30 degrees hotter than 20°C (i.e. 50-30 = 20)
My test score of 80% was higher by 20% than your score of 60% (i.e. 80-20 = 60)

We can’t multiply or divide any of these examples, because there is no meaningful zero, so we can’t say things like ‘4pm is twice as late as 2pm’. Clocks don’t have a zero-point.

With Interval data, you can do the following mathematical operations:

Add or subtract
Compare (greater than, equal to or less than)

Ordinal Data

With Ordinal data, the data are in categories that have a natural order, but the difference between each category cannot be quantified. Examples of Ordinal data are:

Rankings (i.e. 1st, 12th, 52nd, etc.)
Agreement (i.e. Agree, Neutral, Disagree)
Socioeconomic status (i.e. Lower, Middle, Upper)

What you can do with Ordinal data is:

Compare (greater than, equal to or less than)

We can say that a Scotch Bonnet pepper is hotter than a Cayenne pepper (rating at 100,000 and 10,000 on the Scoville scale), but we can’t subtract their Scoville measurements because degrees of ‘hotness’ are not meaningful. That would be like saying that you need to eat 90,000 Cayenne peppers to get the same effect as eating zero Scotch Bonnets, and that would just be silly.

Nominal Data

With Nominal data, all you can do is name the categories. Each Nominal category is different, but you can’t define mathematically why they are different and there is no order in the categories. Examples include:

Gender (i.e. Male, Female, Other)
Genotype (i.e. BB, Bb, bB, bb)
Hair color (i.e. Black, Brown, Blonde, Red, Other)

It is really important to identify for every variable in your dataset which data type it belongs to. Once you’ve done that you will then know what calculations are possible with each variable, which is where we’re going next.

2. Make Your Data Calculations

Some of the data you need for your analyses are collected (like Height, Weight, Gender), but others need to be calculated (like Age, BMI, Time to Event). There are 5 basic types of calculation that you’re likely to face in your data:

Create new variables by multiplication and division
Create new variables by addition and subtraction
Summarize continuous data in integer categories
Convert integer data to text categories
Convert text data to integer categories

Create new variables by multiplication and division

At times we need to multiply or divide variables to create new variables. Examples include BMI, which is weight divided by the square of the height. All variables must be of Ratio type, and the outcome will also be Ratio.

Create new variables by addition and subtraction

Some data needs to be added or subtracted to create new variables. Calculating survival age by subtracting the date of birth from date of death (both Interval data) will give you an outcome that is of Ratio type. This is because the date of birth defines a true zeropoint – you can convert your data from Interval to Ratio!

Summarize continuous data in integer categories

Sometimes continuous data (Ratio or Interval) contains bias, noise, or estimated figures. Ask a fisherman the weight of his biggest catch – you’re not always going to get a truthful answer! When your continuous data isn’t so accurate, it’s useful to summarize your data in categories. For example, you might summarize age into decade categories, so 2, 3 and 4 represent people in their twenties, thirties and forties. In doing this, you will remove some or all of the bias and noise, but you will also lose some of the detail in your information.

Convert integer data to text categories

There may be times when the counts in some of your categories are too small for meaningful analysis. It might be more useful to summarize your integer categories into broader categories, such as age categories of juvenile, pre-menopausal, postmenopausal, or whatever is appropriate for your study. Representing these categories with text labels might be more useful and informative than integers.

Convert text data to integer categories

Now that you have your categories suitably named in Excel and you’re ready to analyze your data, you suddenly realize that your favorite stats program doesn’t support text categories! Oops, you’ve now got to convert it back from text to integers, from [small, medium, large] to [1, 2, 3].

3. Check That Your Data Is Sensible

Real life follows rules, and so must your data. If you have stored your data in Excel, there may be errors in your data that Excel cannot detect, such as when a patient’s age is negative or over 300.

One way to check whether your data are sensible is to compute descriptive statistics on each variable, and you should do this for both continuous (Ratio and Interval) and categorical (Ordinal and Nominal) data types.

Summary

Well, I hope that you’re now starting to realize how important it is to know how to identify the data type of every variable in your dataset, and to understand what you can and can’t do with these data types. If you get it right, your analytical choices will all be simple, and everything will drop into place. On the other hand, bypassing this step will have serious consequences for your analyses. The last thing you want to do is take your results to your boss, only for them to tell you that it’s all wrong and you need to start again.

Sponsored by:

Meet the Author

Lee Baker

Lee Baker is an award-winning software creator with decades of experience in science, statistics and artificial intelligence. He authors articles that teach the fundamentals of data analysis and statistics.