We've updated our Privacy Policy to make it clearer how we use your personal data. We use cookies to provide you with a better experience. You can read our Cookie Policy here.


Supervised vs Unsupervised Learning

A baby looking through books.
Credit: iStock
Listen with
Register for free to listen to this article
Thank you. Listen to this article using the player above.

Want to listen to this article for FREE?

Complete the form below to unlock access to ALL audio articles.

Read time: 6 minutes

What is supervised learning? Combined with big data, this machine learning technique has the power to change the world. In this article, we’ll explore the topic of supervised learning, but will first touch on some recent machine learning history.

In 2012 Alex Krizhevsky, a researcher at the University of Toronto, kicked off the third golden age of artificial intelligence. By a large margin, he beat the state-of-the-art in automatic labeling of ImageNet,1 a database of over a million images from 1000 different categories ranging from canoes to cats and frogs to hotdogs. If you’ve ever wondered why artificial intelligence has been all over the news in the last decade the hype began with this breakthrough.

Alex’s novel approach was to parallelize the computation of his neural networks, allowing them to be wider and deeper than ever before.2 But how did he train his network? That’s all down to supervised learning.

What is supervised learning?

All you need for supervised learning is some data samples and their labels. We want to train a model to guess the correct label of each data sample.


Guesses of image class

(Alex’s network guesses the top five most likely classes for some images in ImageNet)2


This method of learning is intrinsic to all of us. If you’ve ever practiced a language, revised for a math exam or done a pub quiz then you’ve learned under supervision. Imagine holding up an apple to a baby and asking, “What’s this?”. The baby points to the apple and declares “Banana!!”. The baby was close, but there was some error in its prediction. “App-le”, you say. The baby updates its language model, and next time you show the apple it says “Appum!!”. Eventually, that baby will learn to say “Apple”.

This is exactly the update formula used in supervised learning. In short, we are testing our model, the baby, with questions and supervising it with true answers. Or in statistical speak, we fit models to minimize the error between their predictions and the ground truth.

You were posing the baby a classification problem as it needed a categorical response. Other types of questions are concerned with estimating quantities, which we call regression problems. These include guessing the price of a house, someone’s age or the weight of your suitcase.

Let us continue our machine learning story. The year is now 2015 and Kaiming He, a researcher at Microsoft, builds a supervised neural network that, for the first time, surpasses human-level performance in classifying ImageNet.3 Since, focus has been shifting towards unsupervised learning and what we can achieve without labels.

What is unsupervised learning?

Put simply, unsupervised learning is just supervised learning but without the labels. But then how can we learn anything without a set of "true answers"?

Unsupervised learning tackles this seemingly impossible task of learning useful information without any sample-specific prior knowledge. Recall our supervised learning baby. When it was first born it had never seen any objects before and didn’t know a single word. How did it go from knowing nothing about the world to knowing something? A popular term for this kind of problem in computer science is bootstrapping, named because the task is akin to lifting yourself up by your bootstraps.

Unsupervised grouping of images into living and non-living things

(Unsupervised clustering on ImageNet.1 Do you agree with the red and green groupings into living and non-living things, or would you have done it differently? Perhaps by color or time of day?)


This is usually achieved by making generic assumptions about the dataset as a whole. Popular ones are:

  • Clustering – assuming the data naturally falls into a finite number of distinct groups. We might expect ImageNet’s 1000 classes to divide into 1000 groups. Algorithms that help decide what data should go in what group include centroid-based methods such as k-means and Gaussian mixture models, and graph-based approaches such as spectral clustering.
  • Dimensionality reduction – assuming the data can be compressed while preserving data integrity. Everyday algorithms we use are lossy compression formats such as JPEG and MP3. We also use principal component analysis and autoencoders.
  • Anomaly detection – expecting that anomalous samples lie outside the distribution of normal ones. By showing our model only normative samples, anomalous ones are flagged by their distance from the normative population. In practice, we assume the normative population follows a Gaussian distribution and define anomalies as lying some number of standard deviations from the mean.

Self-supervised learning

Self-supervised methods represent a fascinating subset of unsupervised learning. In the context of end-to-end deep learning, we still require some form of supervisory signal for training. This means we need to design learning objectives that are a function of the data samples alone. Researchers have been creative here. For language models, this might mean filling in the blank word in a sentence, such as:


Will machines take over the word?

and for models trained on images, solving jigsaw puzzles


Reconstruction of jigsaw image

(Given (b), the model must rearrange the jigsaw pieces to reconstruct (a))4


Justifiably, you may question the usefulness of an AI that solves jigsaw puzzles. But performing a generic task like this requires learning important information about the data. To rearrange the tiger, you have to first learn what one looks like.

Semi-supervised learning: the best of both worlds

A combination of unsupervised and supervised learning, this scenario asks what we can learn when only a subset of the dataset is labeled. Typically, this involves learning a powerful representation of the data through unsupervised pre-training, followed by supervised calibration and testing on the smaller labeled set. By first learning from the cheap and abundant unlabeled set we can we achieve better results than if we only performed supervised training using the labeled subset.

When to use supervised vs unsupervised learning

For supervised learning we need labels. But annotating your data isn’t always that easy.


Images with true labels and class guesses

(Labeling issues. Is the first picture really a grille? Is the third picture a dog, or some cherries? How would you label these images?)


Some issues you might encounter are:

  1. Big data: Assigning a label for every sample in your dataset can be timely and expensive, especially if they require an expert as for medical imaging.
  2. Multiple classes per sample: Your dataset may require several labels per sample if it belongs to or exhibits multiple classes. Was the third picture above a dog, some cherries, or both?
  3. Dense labeling: Each dimension of your multivariate data might need a label which can get very expensive. For example, if we are training a network to draw contours around apples. we typically need every pixel to be labeled either as belonging to an apple or the background.


If your data comes pre-packaged with labels, supervised learning is a great place to start. It can allow you to compare performance of different models and provide intuition on how difficult the prediction task is. However, keep in mind that the accuracy of your labels can be in jeopardy from:

  1. Labeling errors, as systematic bias or variance. In other words, different annotators may not assign the same labels to the same samples. This is called inter-rater agreement and can be alarmingly low. Indeed, labels from the same person aren’t guaranteed to be consistent; it was found judges give lighter sentences after they’ve eaten lunch.
  2. Categorical representations of continuous variables, where several different levels of a variable are binned to the same discrete value, thus destroying the nuance in the variable.
  3. Disregarding class relationships. Independent categorical variables ignore class overlap. For example, we know that cats are conceptually closer to dogs than either are to skyscrapers. Yet simple categorical labels will not encode this fact.


Babies do a large amount of learning on their own. At the 2016 NeurIPS conference Yann Lecun, one of the three godfathers of artificial intelligence, said:

“If intelligence is a cake, the bulk of the cake is unsupervised learning, the icing on the cake is supervised learning”

Unsupervised deep learning methods have seen significant progress in the last few years, with their performance fast approaching their supervised counterparts on the ImageNet challenge. Once you know the pros and cons of both styles of learning, choosing between unsupervised or supervised, or a mix, is down to you and your dataset.