Linear Regression for beginners with simple python code

Linear regression is one of the most well known and well-understood algorithms in machine learning and statistics. Before jumping into Linear regression lets recap the formula of the linear equation. We all of us know the general formula of linear equation is, y=mx+c

在这里插入图片描述
Here x and y are 2 variables in which x is independent and y is dependent on x. and m is the slope and c is the intercept.
Actually, this is the basic formula and we will use this formula into Linear regression in different looks. If we break down the name of Linear Regression we will find there are 2 words over there one is ‘Liner’ and another one is Regression. We all know about Line and Linear equation. Now let’s talk about Regression.
What is Regression?
In fact, Regression analysis is a form of predictive modeling technique which investigates the relationship between the dependent and independent variable.
Uses of regression:
There are three major uses of regression analysis. They are
• Determining the strength of predictions.
• Forecasting an effect
• Trend forecasting
still, now we gained some knowledge about linear equation and regression.
Now let’s jump into Linear Regression. y=mx+c [simple linear equation]
For Linear regression, this simple equation will be transformed into the following equation
y=β0+β1x+ε
Here in this equation, we have changed nothing but the variable’s name. In fact, we transformed it into Greek letters. And here we added a new element ε at the end of the equation. If we break down the formula here β_0 is same as c or intercept and β_1 is the slope of the equation which is same as m . and there is a new element ε which is the error term.I’ll discusses the ε later. Now let’s assume ε=0 or the error is 0. Now let’s see how would our model look like using the following linear regression equation.Let, β_0=0,x_1=2,x_2=3,y1=6,y2=9

在这里插入图片描述
But in the real world, the scenario is not that simple. If we use more points or more x,y value then the line will not go through all the points.

在这里插入图片描述

There would be some distance between our regression line and data point. And this distance is called the error ϵ in linear regression.
Now I hope you get the point what ϵ is.
Now our main goal is to reduce the error as much as possible.
Lets go to the equation of linear regression again
y=β0+β1 x+ε
Suppose we have a data set of X,Y is given bellow

X	Y
1	3
2	4
3	2
4	4
5	5

If we plot the data in a scatter plot it would be
Like this,
在这里插入图片描述
Now our task is to draw a regression line in the scatter plot where the error is minimum.
If we calculate mathematically, the slope of the line would be m= β1= ∑((x-x ̅)(y-y ̅)) / ∑(x-x ̅)^2
So to find the slope, we need a column of (x-x ̅), (x-x ̅)^2, (y-y ̅) in our data table.

在这里插入图片描述
Here mean of x , x ̅ = 3 ∑(x-x ̅)^2=10
Mean of y, y ̅=3.6 ∑(x-x ̅ ) (y-y ̅ ) = 4
So, β1=4/10
Now, β0= y ̅ - β1 x =3.6- 1.2 = 2.4
Now lets draw the regression line with this value

Finally, we have drawn the regression line into our scatter plot.
Now the question is how accurate our model is !!!
How to do that?
Let’s try to find out.
There is a method called the R squared method. We will use that to determine how close the data are to our regression line.
R squared method:
R squared value is a statistical measure of how close the data are to the fitted regression line.
It is also known as the coefficient of determination or the coefficient of multiple determination.
The equation is,
R^2=(∑(Yp-y ̅)^2 )/(∑(y-y ̅ )^2 )
Here,
Yp = predicted value
Y= actual value
y ̅ = mean value
在这里插入图片描述
Here, ∑(Yp-y ̅)^2 ) = 1.6 and ∑(y-y ̅ )^2=5.2
So,
R^2=(∑(Yp-y ̅)^2 )/(∑(y-y ̅ )^2 ) =1.6/5.2 = 0.3 (approximately)
If the value of R^2 is nearer to 1 then the model is more accurate.
We are almost done, now its time to implement the algorithm using python.
For this, we will need a data set. I have used the” head brain” data set here to implement the linear regression.
You can download the data set from this link: headbrain.csv download
Oh, most important thing is that I’ve used jupyter notebook for coding. but I don’t know how to upload that file in this blog.
you can see my code here in this github link : original code

#impoting_necessary_Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

data=pd.read_csv('headbrain.csv')
print(data.shape)
data.head()

在这里插入图片描述


X=data['Head Size(cm^3)'].values
Y=data['Brain Weight(grams)'].values

mean_x=np.mean(X)
mean_y=np.mean(Y)

m=len(X)

numer=0
denom=0

for i in range(m):
    numer +=(X[i]-mean_x)*(Y[i]-mean_y)
    denom += (X[i]-mean_x)**2
b1=numer/denom # slope
b0=mean_y-(b1*mean_x) #intercept or c in y=mx+c equation

print(b1,b0)

0.26342933948939945
325.57342104944223


#plotting values
max_x=np.max(X)+100
min_x=np.min(X)-100

x=np.linspace(min_x,max_x,1000)
y=b0+b1*x

plt.plot(x,y,color='red',label='Regression line')
plt.scatter(X,Y,color='blue',label='Scatter Plot')

plt.xlabel('Head size in cm3')
plt.ylabel('Brain Weight in gram')
plt.legend()
plt.show()

在这里插入图片描述

#Checking_perfectness
ss_t=0
ss_r=0
for i in range (m):
    y_pred=b0+b1*X[i]
    ss_t+=(Y[i]-mean_y)**2
    ss_r+=(y_pred-mean_y)**2
r2=(ss_r/ss_t)
print(r2)

0.6393117199570001

As we have seen the value is closer to 1 so our regression model is good enough.
Huh we have done here for today. actually this is my first blog writing. I didn’t write before . So I am worried about the quality of writing.If you have any question feel free to ask. I am also expecting your advices and suggestions.
Thanks…!

来源：oschina

链接：https://my.oschina.net/u/4395489/blog/4535893

标签

Here

Linear Regression implementation with python

Linear Regression for beginners with simple python code