Calculating a Least Squares Regression Line: Equation, Example, Explanation

If you want a simple explanation of how to calculate and draw a line of best fit through your data, read on!

Article

Published: August 21, 2020

| Last Updated: December 18, 2023

| by Andrew Lee, Medical Statistician, Cystic Fibrosis Trust

A graph showing a positive correlation between the x and y axis. Two question marks sit in the top left and bottom right of the graph.

Listen with

Speechify

0:00

Thank you. Listen to this article using the player above. ✖

Want to listen to this article for FREE?

Complete the form below to unlock access to ALL audio articles.

Read time: 5 minutes

Being able to make conclusions about data trends is one of the most important steps in both business and science. It’s the bread and butter of the market analyst who realizes Tesla’s stock bombs every time Elon Musk appears on a comedy podcast, as well as the scientist calculating exactly how much rocket fuel is needed to propel a car into space.

Contents

How to find a least squares regression line

Least squares regression line example

Least squares regression equations

How do you calculate a least squares regression line by hand?

Drawing a least squares regression line by hand

What are the disadvantages of least-squares regression?

How to find a least squares regression line

Often the questions we ask require us to make accurate predictions on how one factor affects an outcome. If a teacher is asked to work out how time spent writing an essay affects essay grades, it’s easy to look at a graph of time spent writing essays and essay grades say “Hey, people who spend more time on their essays are getting better grades.” What is much harder (and realistically, pretty impossible) to do by eye is to try and predict what score someone will get in an essay based on how long they spent on it. Sure, there are other factors at play like how good the student is at that particular class, but we’re going to ignore confounding factors like this for now and work through a simple example.

Our teacher already knows there is a positive relationship between how much time was spent on an essay and the grade the essay gets, but we’re going to need some data to demonstrate this properly.

Least squares regression line example

Suppose we wanted to estimate a score for someone who had spent exactly 2.3 hours on an essay. I’m sure most of us have experience in drawing lines of best fit, where we line up a ruler, think “this seems about right”, and draw some lines from the X to the Y axis. In a room full of people, you’ll notice that no two lines of best fit turn out exactly the same. What we need to answer this question is the best best fit line.

Through the magic of least sums regression, and with a few simple equations, we can calculate a predictive model that can let us estimate grades far more accurately than by sight alone. Regression analyses are an extremely powerful analytical tool used within economics and science. There are a number of popular statistical programs that can construct complicated regression models for a variety of needs. A simpler model such as this requires nothing more than some data, and maybe a calculator. It’s worth noting at this point that this method is intended for continuous data.

Least squares regression equations

The premise of a regression model is to examine the impact of one or more independent variables (in this case time spent writing an essay) on a dependent variable of interest (in this case essay grades). Linear regression analyses such as these are based on a simple equation:

Y = a + bX

Y – Essay Grade a – Intercept b – Coefficient X – Time spent on Essay

There’s a couple of key takeaways from the above equation. First of all, the intercept (a) is the essay grade we expect to get when the time spent on essays is zero. You can imagine you can jot down a few key bullet points while spending only a minute on an essay and still get a few points here and there. Every essay will have at least this score according to our model. On top of that, every hour we spent on our essays (X) leads to an increase of b in the grade the essay gets. We can work out b through the following, slightly scary equation:

But we’re getting ahead of ourselves. To calculate b, and make sense of that creepy equation, we’re going to need to know the values for our data:

How do you calculate a least squares regression line by hand?

When calculating least squares regressions by hand, the first step is to find the means of the dependent and independent variables. We do this because of an interesting quirk within linear regression lines - the line will always cross the point where the two means intersect. We can think of this as an anchor point, as we know that the regression line in our test score data will always cross (4.72, 64.45).

The second step is to calculate the difference between each value and the mean value for both the dependent and the independent variable. In this case this means we subtract 64.45 from each test score and 4.72 from each time data point. Additionally, we want to find the product of multiplying these two differences together.

A data table with dependent and independent variable calculations.

You should notice that as some scores are lower than the mean score, we end up with negative values. By squaring these differences, we end up with a standardized measure of deviation from the mean regardless of whether the values are more or less than the mean.

Let's remind ourselves of the equation we need to calculate b.

An equation for calculating the coefficient, b.

The symbol sigma (∑) tells us we need to add all the relevant values together.

If we do this for the table above, we get the following results:

∑(x-x ̅ ) * (y-y ̅ ) = 611.36

And

∑(x-x ̅ ) ^2 = 94.18

Slotting in the information from the above table into a calculator allows us to calculate b, which is step one of two to unlock the predictive power of our shiny new model:

An equation for calculating the coefficient, b.

The final step is to calculate the intercept, which we can do using the initial regression equation with the values of test score and time spent set as their respective means, along with our newly calculated coefficient.

64.45= a + 6.49*4.72

We can then solve this for a:

64.45 = a + 30.63

a = 64.45 – 30.63

a = 30.18

Now we have all the information needed for our equation and are free to slot in values as we see fit. If we wanted to know the predicted grade of someone who spends 2.35 hours on their essay, all we need to do is swap that in for X.

y=30.18 + 6.49 * X

y = 30.18 + (6.49 * 2.35)

y = 45.43

Drawing a least squares regression line by hand

If we wanted to draw a line of best fit, we could calculate the estimated grade for a series of time values and then connect them with a ruler. As we mentioned before, this line should cross the means of both the time spent on the essay and the mean grade received.

A least-squares regression line is shown to link grade with hours spent on essay.

And there we have it! A perfect* predictive model that will make our teachers’ lives a lot easier.

What are the disadvantages of least-squares regression?

*As some of you will have noticed, a model such as this has its limitations. For example, if a student had spent 20 hours on an essay, their predicted score would be 160, which doesn’t really make sense on a typical 0-100 scale. It’s always important to understand the realistic real-world limitations of a model and ensure that it’s not being used to answer questions that it’s not suited for.

Outliers such as these can have a disproportionate effect on our data. In this case, it's important to organize your data and validate your model depending on what your data looks like to make sure it is the right approach to take.

How do you calculate a least squares regression line by hand?
When calculating least squares regressions by hand, the first step is to find the means of the dependent and independent variables. The second step is to calculate the difference between each value and the mean value for both the dependent and the independent variable. The final step is to calculate the intercept, which we can do using the initial regression equation with the values of test score and time spent set as their respective means, along with our newly calculated coefficient.

What are the disadvantages of least-squares regression?
It’s always important to understand the realistic real-world limitations of a model and ensure that it’s not being used to answer questions that it’s not suited for. Outliers such as these can have a disproportionate effect on our data. In this case, it's important to organize your data and validate your model depending on what your data looks like to make sure it is the right approach to take.

What is the equation for calculating a least squares regression line and how is it derived?
The equation for a least squares regression line is typically expressed as y = a + bx, where 'b' is the slope of the line (calculated as the covariance of x and y divided by the variance of x), and 'a' is the y-intercept (calculated as the mean of y minus 'm' times the mean of x). This equation is derived by minimizing the sum of the squares of the vertical deviations from each data point to the line (hence, "least squares").

How does the method of least squares help in creating the best-fitting line for a set of data points?
The method of least squares helps to create the best-fitting line by minimizing the sum of the squares of the residuals, which are the differences between the actual and predicted values. By doing so, it provides the best linear unbiased estimation of the data points, revealing the underlying trend or relationship between the variables.

Can you provide a step-by-step example of calculating a least squares regression line?
To calculate a least squares regression line, follow these steps: Calculate the means of x and y. 2. Calculate the deviations of each x and y from their means. 3. Multiply each x deviation by the corresponding y deviation and sum them all up to get the covariance. 4. Square each x deviation, then sum them all to get the variance of x. 5. Calculate the slope (m) as the covariance divided by the variance. 6. Calculate the y-intercept (b) as the mean of y minus 'b' times the mean of x.

How can the least squares regression line be used to make predictions?
The least squares regression line can be used to make predictions by substituting the value of the independent variable (x) into the equation y = a + bx. The resulting 'y' value is the predicted dependent variable based on the regression line. This can be useful in forecasting trends, setting expectations, or understanding the relationship between variables.

What are some potential issues or limitations when using least squares regression and how can they be addressed?
While least squares regression is widely used, it has some limitations. It assumes linearity, constant variance and independence of observations, which may not always hold true. It is also sensitive to outliers. These issues can be addressed by examining residuals, conducting various statistical tests and considering robust or non-linear regression methods when appropriate. In the presence of multiple predictors, one might also need to consider potential multicollinearity.

Informatics

Informatics

Calculating a Least Squares Regression Line: Equation, Example, Explanation

If you want a simple explanation of how to calculate and draw a line of best fit through your data, read on!

How to find a least squares regression line

Least squares regression line example

Least squares regression equations

How do you calculate a least squares regression line by hand?

Drawing a least squares regression line by hand

What are the disadvantages of least-squares regression?

Calculating a Least Squares Regression Line: Equation, Example, Explanation

If you want a simple explanation of how to calculate and draw a line of best fit through your data, read on!

How to find a least squares regression line

Least squares regression line example

Least squares regression equations

How do you calculate a least squares regression line by hand?

Drawing a least squares regression line by hand

What are the disadvantages of least-squares regression?

One-Way vs Two-Way ANOVA: Differences, Assumptions and Hypotheses