# Least Square Regression Line – Definition and Procedure

-

Suppose that we have two variables X and Y where X is the independent variable and Y is the dependent variable. For example we may take X to be the number of hours spent per week studying and Y to be the marks obtained in the exam. If we assume that Y is related to X in a linear fashion we may wish to find the equation which models this relationship. This is useful because the model will help us predict the value of Y given X.

For example if the relationship between Y and X is given as Y=10X+2 then we can predict the value of Y given X. Suppose X=6 that is the student spends six hour per week studying then we may predict that Y=10*6+2=62 that is, the student will get 62 marks in the exam. Of course a model will never be accurate and always have some errors. The linear model which we obtain in such a way so that the ‘errors’ are minimised is called the least square regression line.

This is done by making sure that the square of the errors is as small as possible (“least square”). Note that we square the errors so that the positive and negative errors do not cancel out and each error is taken into account and minimised.

A linear relationship is given as Y=mX+c , hence we need to obtain the coefficients m and c. The formula for these coefficients using least square method is given as,

m= N*∑xy –(∑x)(∑y)/N*∑x2-(∑x)2 and

c= ∑y – m*∑x/N

Let us try to see how to apply these formulas by means of an example. Suppose we are given the following values for x and y :

Here N = 3

Then we calculate the required quantities as follows:

Substituting all this in the above formula we get:

m=3*117-15*20/3*89-152=1.2143 and hence c=20-1.2143*15/3=0.5952

Hence the model is Y=1.2143X+ 0.5952.

Substituting any value for X we can obtain a ‘prediction’ for Y. For example if X=2 then the predicted value of Y=1.2143*2+0.5952=3.0238.

Assumptions behind least square regression method:

The method given above is valid under the following assumptions:

1. The scatter plot of X and Y indicated that a “line” can be fitted which indicates a linear relationship between the two variables.
2. Errors are normally distributed with variance of error being same for any value of X. This is called Homoscedasticity.
3. Each successive observation is independent of the others

Multiple Regression:

It may be the case that Y depends on more than one variable. In such a case the linear model is called a multiple linear model. Once again the coefficients for the model can be obtained by least square method by minimising the square of the errors.

Hey 👋

I'm currently pursuing a Ph.D. in Maths. Prior to this, I completed my master's in Maths & bachelors in Statistics.

I created this website for explaining maths and statistics concepts in the simplest possible manner.

If you've found value from reading my content, feel free to support me in even the smallest way you can.