Organizing Your Data: 7 Tips and Tricks
When collecting their data, most people enter them in a single Excel worksheet, then clean them and process them right there in the same worksheet. This really isn’t a very good idea. If you select a single column of data and do a find & replace operation, will Excel do what you ask and replace items only in that column, or will it operate across the entire worksheet? Are you sure? Really, really sure? You need to be, because if Excel does something different to what you expect you’re likely to enter errors across your entire dataset. Worse still, some of Excel’s operations don’t have an “undo”, so correcting the errors that you’ve just introduced can quickly lead you down a rabbit hole. No, the way to handle data is to be organized and use multiple worksheets to introduce a chronological flow to your data processing and cleaning, and this is what we’re going to learn here.
Collect and Store Your Data
When you collect your data, you should store them in a single Excel worksheet. These data are a copy of the paper-based forms that you used in data collection (if you collected data on paper). When they have been entered this sheet should be named “Raw Data” and should then remain untouched. Of course, these data will likely contain errors, typos and other issues that will need your attention, but this sheet is your starting point. From here you’re going to use other worksheets to get organized and build a flow into your processes.
Make a Copy – Clean Your Data
From here you make a copy of “Raw Data”, and name it “In Progress”, where you’re going to clean your data. So now you have two worksheets that are identical, and you can safely do find/replace operations without the risk of entering errors, can’t you? Au contraire! You NEVER do data cleaning operations in the worksheet where your data are stored. Instead, you extract the data you want to clean (i.e. a single column), clean and process it in a separate worksheet, then once you’re sure that these data are perfectly clean, you copy it back into your In Progress worksheet. This procedure is called Extract-Transform-Load, or ETL for short, and ensures that you never introduce new errors into your data. When you’re happy that your data are perfectly clean, you can rename your “In Progress” worksheet to “Clean Data”.
Make a Copy – Do Your Calculations
Now you have two worksheets, containing your data at different stages of preparation – one with dirty data and the other with clean data. As you move to the next phase, you create a copy of the “Clean Data” worksheet and name it “Calculated Data”. Some data are collected, like date of birth, while others need to be calculated, such as age – typically the difference between dates. Here is where you create, calculate and add in these new data. Do you see why you clean the data before doing calculations on them? If you do it the other way around, you’re making calculations on data that contain errors. Once you’ve cleaned the errors, you’re going to have to do the calculations all over again.
Understand Your Data – Descriptive Statistics
So now you have three worksheets, and I hope by now you’re starting to understand how you’re introducing a chronological flow to your data cleaning processes. When you discover an error – and you will – you can go back through your worksheets to see when, where and how the errors were introduced, and this gives you an opportunity to improve your data collection, cleaning and processing practices so that you’ll have fewer errors in future. Talking about errors, computing descriptive statistics on each column of data is where you start to get a real understanding of your data and find errors that you didn’t know existed. The descriptive stats you’ll need to compute are different for categorical data and numerical data, but these are the most useful measures:
For Numerical Columns:
- all the entries
- all the positive entries
- all the negative entries
- all the entries that are zero
- all the empty cells
- the minimum value
- the maximum value
For Categorical Columns:
- all the entries
- all the entries in each category
- all the empty cells
Checking through these results will highlight the most obvious errors in your data. For example, are there ages that are negative or zero? There shouldn’t be any. Do the number of entries tally correctly? Are there any missing data points? Are the minimum and maximum values sensible? If you’re using Excel, learn how to use the formulae COUNT, MIN, MAX, and AVERAGE for your numerical entries. For categorical entries, COUNTIF can tell you how many entries of each category you have in your variable. For empty cells, COUNTBLANK is a very useful formula to use. Whenever you discover an error, trace it back through the worksheets until you reach the source of the error. I guarantee you’ll feel a flush of satisfaction when you realize that your new-found data organizational skills have just saved you!
Keep a “Codes” Sheet
Some statistics programs don’t accept text data, so sometimes it can be useful to store categorical data as integers [1, 2, 3] rather than [Small, Medium, Large]. So if you’ve coded your categories as integers, will you remember what they signify? You shouldn’t need to remember, and this is why we keep a codes sheet. You need to remember that you may not be the only person that uses these data. You want those that come after you to say nice things about you, not moan about how disorganized you were, so keeping a note of the codes not only helps you, but it also helps others. And don’t forget to keep a note of the measurement units in this sheet too!
Keep a “Notes” Sheet
During your project you will make lots of decisions, and at some point you will need to describe and explain them to someone else. Keeping a notes sheet allows you to document everything about your study. There’s nothing wrong with writing notes in your lab book, but if someone else needs access to your data you’re unlikely to hand this over to someone else or make a photocopy of it, so I recommend that you keep notes in a separate worksheet – that way your notes will remain with your data. If, at some point later, you need to hand your dataset over to a statistician for analysis, you’ll be glad you kept a notes sheet – it will save you hours of explanations!
If your data were extracted from a departmental database, it is good practice to report all errors back to the original source. You’re not just being good to others, though – you’re being good to yourself. It may be that at some point in the future you need to extract the same or similar data again. Only this time you won’t have to waste time cleaning errors you’ve already cleaned before.
Identifying errors and cleaning data might not be your idea of a good time (trust me, it’s not mine either), but if you get yourself organized and learn a few simple, repeatable processes, your time spent cleaning and preparing your data for analysis can be shorter – and less painful – than you expect. Learning – even before you’ve begun collecting data – how to set up your Excel workbook to utilize multiple worksheets and introduce “flow” into your data will help you reach the story of your data much quicker, and that usually goes down well with the boss!