Linear Regression-Machine Learning Techniques implement using statsmodels and scikit-learn

Ashish Patil
15 min readNov 10, 2020

Today we are going to talk about linear regression, one of the most well known and well understood algorithms in machine learning. But before that we understand what is regression analysis and why use it.

What is Regression Analysis?

Regression in statistics is the process of predicting a Label(or Dependent Variable) based on the features(Independent Variables) at hand. Regression is used for time series modelling and finding the causal effect relationship between the variables and forecasting. For example, the relationship between the stock prices of the company and various factors like customer reputation and company annual performance etc. can be studied using regression.

Regression analysis is an important tool for analysing and modelling data. Here, we fit a curve/line to the data points, in such a manner that the differences between the distance of the actual data points from the plotted curve/line is minimum.

The use of Regression

Regression analyses the relationship between two or more features. Let’s take an example:

Let’s suppose we want to make an application which predicts the chances of admission a student to a foreign university. In that case,the benefits of using Regression analysis are as follows:

  • It shows the significant relationships between the Lable (dependent variable) and the features(independent variable).
  • It shows the extent of the impact of multiple independent variables on the dependent variable.
  • It can also measure these effects even if the variables are on a different scale.

These features enable the data scientists to find the best set of independent variables for predictions.

Linear Regression

Linear Regression is one of the most fundamental and widely known Machine Learning Algorithms which people start with. Building blocks of a Linear Regression Model are:

  • Discreet/continuous independent variables
  • A best-fit regression line
  • Continuous dependent variable. i.e., A Linear Regression model predicts the dependent variable using a regression line based on the independent variables. The equation of the Linear Regression is:
  • Y=a+b*X + e

Where,a is the intercept, b is the slope of the line, and e is the error term. The equation above is used to predict the value of the target variable based on the given predictor variable(s).

Simple Linear Regression

Simple Linear regression is a method for predicting a quantitative response using a single feature (“input variable”).

The mathematical equation is:

What do terms represent?

  • y is the response or the target variable
  • x is the feature
  • β1 is the coefficient of x
  • β0 is the intercept

β0 and β1 are the model coefficients. To create a model, we must “learn” the values of these coefficients. And once we have the value of these coefficients, we can use the model to predict the Sales!

Multiple Linear Regression

Now, we’ll include multiple features and create a model to see the relationship between those features and the label column. This is called Multiple Linear Regression.

Each X represents a different feature, and each feature has its own coefficient. In this case:

Assumptions

Let’s see the underlying assumptions: -

  • The regression model is linear in terms of coefficients and error term.
  • The mean of the residuals is zero.
  • The error terms are not correlated with each other, i.e. given an error value; we cannot predict the next error value.
  • The independent variables(x) are uncorrelated with the residual term, also termed as exogeneity. This, in layman term, generalises that in no way should the error term be predicted given the value of independent variables.
  • The error terms have a constant variance, i.e. homoscedasticity.
  • No Multicollinearity, i.e. no independent variables should be correlated with each other or affect one another. If there is multicollinearity, the precision of prediction by the OLS model decreases.
  • The error terms are normally distributed.

Estimating Model Coefficients

For coefficent estimation we use mainly three methods based on need.

  1. Ordinal Least Squares
  2. Generalised Least Squares
  3. Weighted Least Square

Here we estimate using the least-squares criterion, i.e., the best fit line has to be calculated that minimizes the sum of squared residuals (or “sum of squared errors”).

Understanding the mathematics behind linear regression

Today we are going to talk about linear regression, one of the most well known and well understood algorithms in machine learning. We are going to focus on the simple linear regression, which contains only one input variable. But the same logic and analyses will extend to the multi-variable linear regression. You probably are familiar with its form: y=β0+β1x.However, you might feel confused about how the linear regression model is learnt. This blog post will talk about some of the most commonly techniques used to train a linear regression model.

Solving linear regression using Ordinary Least Squares — general formula

A simple linear regression function can be written as:

We can obtain n equations for n examples:

If we add n equations together, we get:

Because for linear regression, the sum of the residuals is zero. We get:

If we use the Ordinary Least Squares method, which aims to minimize the sum of the squared residuals. We define C to be the sum of the squared residuals:

This is a quadratic polynomial problem. To minimize C, we take the partial derivatives with respect to β1and set the results to 0 and we get:

Solving equations (1) and (2), we can then get:

Solving linear regression using Ordinary Least Squares — matrix formulation

It is, however, more computationally efficient to use matrices to define the linear regression model and performing the subsequent analyses.

The n simple linear regression equations can be written out as:

In matrix formulation the linear regression model can be rewritten as:

In the above section, equations (1) and (2) state that:

We know:

We can see that the equations (1) and (2) are equivalent to the following matrix formula:

Thus, we could get:

Thus, to solve the linear regression problem using least squares, it normally requires that all of the data must be available and your computer must have enough memory to hold the data and perform matrix operations.

Solving linear regression using Gradient Descent

When you have a very large dataset. The dataset may contain a lot of examples (rows) or it may contain a lot of features (columns). In either way, the matrix representing the dataset will be large and may not fit into memory. It is better to use another method to train the linear regression model. We’ll talk about gradient descent in this section.

We first define a cost function which measures how good fit the regression line is. The cost function E(β) is defined as:

The goal of the linear regression model is to minimize the cost function. Gradient descent search will determine a weight vector B that minimizes E by starting with an arbitrary initial weight vector, then repeatedly modifying it in small steps. At each step, the weight vector is altered in the direction that produces the steepest descent along the error surface. This process continues until the global minimum error is reached.

So, in what direction should we change the weight vector B that moves towards minimizing the cost function? If we change a small amount Δβ0 in the β0 direction, and change a small amount Δβ1 in the β1 direction, then E changes as follows:

We rewrite this as:

where

If we choose:

Then we will get:

We will make sure that the error E will always decrease, which is the property we want. Hence, the update rule for β0and β1 is:

We have:

We get:

The learning rate η controls how quickly we want the weights to move towards the minimum. The weights are updated until a minimum sum squared error is achieved or no further improvement is possible.

Let’s Perform Linear Regression

Using Statsmodels library

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# First import our dataset
data=pd.read_csv('Advertising.csv')
data.head()
png

What are the features?

  • TV: Advertising dollars spent on TV for a single product in a given market (in thousands of dollars)
  • Radio: Advertising dollars spent on Radio
  • Newspaper: Advertising dollars spent on Newspaper

What is the response?

  • Sales: sales of a single product in a given market (in thousands of widgets)
data.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 200 non-null int64
1 TV 200 non-null float64
2 radio 200 non-null float64
3 newspaper 200 non-null float64
4 sales 200 non-null float64
dtypes: float64(4), int64(1)
memory usage: 7.9 KB

Our dataset contain four variable Unnamed:0,TV,radio,newspaper,sales. Variable Unnamed:0 is just serial number which not much import to predict sales,so we remove it. No missing values present in our datasets.

data=data.drop(columns=['Unnamed: 0'],axis=1)
data.head()
png

Visualize relation between dependant and independant variables

# visualize the relationship between the features and the response using scatterplots
fig, axs = plt.subplots(1, 3, sharey=True)
sns.scatterplot(x='TV', y='sales',data=data, ax=axs[0])
sns.scatterplot(x='radio', y='sales',data=data, ax=axs[1])
sns.scatterplot(x='newspaper', y='sales',data=data, ax=axs[2]);
png

In above scatterplots we see that linear type of relation between dependant variable to independant variables.Relation between TV ad, Radio ad and Sales are look more linear.

We perform statistical analysis on ‘Advertising data’ to predict futurs sales

Hypothesis

  • Null hypothesis: No relationship exists between Dependant Variable and Independant variables
  • Alternative hypothesis: There exists a Dependant Variable and Independant variables.
# create X and y variable
X=data.drop(['sales'],axis=1)
y=data.sales
X=sm.add_constant(X)

lm=sm.OLS(y,X).fit()
print(lm.summary())
OLS Regression Results
==============================================================================
Dep. Variable: sales R-squared: 0.897
Model: OLS Adj. R-squared: 0.896
Method: Least Squares F-statistic: 570.3
Date: Mon, 09 Nov 2020 Prob (F-statistic): 1.58e-96
Time: 23:17:21 Log-Likelihood: -386.18
No. Observations: 200 AIC: 780.4
Df Residuals: 196 BIC: 793.6
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 2.9389 0.312 9.422 0.000 2.324 3.554
TV 0.0458 0.001 32.809 0.000 0.043 0.049
radio 0.1885 0.009 21.893 0.000 0.172 0.206
newspaper -0.0010 0.006 -0.177 0.860 -0.013 0.011
==============================================================================
Omnibus: 60.414 Durbin-Watson: 2.084
Prob(Omnibus): 0.000 Jarque-Bera (JB): 151.241
Skew: -1.327 Prob(JB): 1.44e-33
Kurtosis: 6.332 Cond. No. 454.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Conclusions-:

  1. On above result we see that our models Adj. R-squared is 0.896 which is very good.But problem is P-value of newspaper varible.
  2. P-values for TV and Radio is less than level of significance i.e 0.05 but for newspaper is more than 0.05, so we failed to reject null hypothess for newspaper. so we build new model without newspaper variable
X=data.drop(['sales','newspaper'],axis=1)
y=data.sales
X=sm.add_constant(X)
lm1=sm.OLS(y,X).fit()
print(lm1.summary())
OLS Regression Results
==============================================================================
Dep. Variable: sales R-squared: 0.897
Model: OLS Adj. R-squared: 0.896
Method: Least Squares F-statistic: 859.6
Date: Mon, 09 Nov 2020 Prob (F-statistic): 4.83e-98
Time: 23:17:22 Log-Likelihood: -386.20
No. Observations: 200 AIC: 778.4
Df Residuals: 197 BIC: 788.3
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 2.9211 0.294 9.919 0.000 2.340 3.502
TV 0.0458 0.001 32.909 0.000 0.043 0.048
radio 0.1880 0.008 23.382 0.000 0.172 0.204
==============================================================================
Omnibus: 60.022 Durbin-Watson: 2.081
Prob(Omnibus): 0.000 Jarque-Bera (JB): 148.679
Skew: -1.323 Prob(JB): 5.19e-33
Kurtosis: 6.292 Cond. No. 425.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
#define figure size
fig = plt.figure(figsize=(12,8))

#produce regression plots
fig = sm.graphics.plot_regress_exog(lm1, 'TV', fig=fig)
png

Y and Fitted vs X

  • This graph indiacted relationship between fitted values to independant variable along with confidence interval.

Residual vs TV

  • Residual vs Tv graph gives idea about is there heteroscedasticity present in data or not. If any linear or funnel shape or any other curve then heteroscedasticity present in our data. In our case looklike homoscedastic.

Partial Regression Plot

  • partial regression plot provide the information of relationship of sales due to TV ad after removing the linear effects of radio. The slope of the partial regression plot is same as the coefficient of TV in multiple regression model.

CCPR Plot (component and component-plus-residual plot)

  • The CCPR plot provides a way to judge the effect of one regressor on the response variable by taking into account the effects of the other independent variables. The partial residuals plot is defined as $Residuals + B_iX_i versus X_i.$ The component adds the $B_iX_i versus X_i$ to show where the fitted line would lie.
fig = plt.figure(figsize=(12,8))
fig = sm.graphics.plot_regress_exog(lm1, 'radio', fig=fig)
png

Conclusions-:

  1. On above result we see that our models Adj. R-squared is 0.896 which is same as above.
  2. Akaike’s Information Criteria(AIC) and Bayesian Information Criteria (BIC) which is low than previous model. Our curent model is best fit model for our data.

Interpreting the model

How do we interpret the coefficient for spends on TV ad and Radio Ad

  • A “unit” increase in spends on a TV ad is associated with a 0.0458 “unit” increase in Sales.
  • Or, an additional $1,000 on TV ads is translated to an increase in sales by 45.80 Dollars.
  • A “unit” increase in spends on a Radio ad is associated with a 0.1880 “unit” increase in Sales.
  • Or, an additional $1,000 on Radio ads is translated to an increase in sales by 188 Dollars.

Using Scikit-learn Library

Here we predict chance of getting admission based on various independant variable such as GRE score, TOEFL Scores, University Rating, SOP, LOR ,CGPA,Research.

#Let's start with importing necessary libraries

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge,Lasso,RidgeCV, LassoCV, ElasticNet, ElasticNetCV, LinearRegression
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
sns.set()
data=pd.read_csv('Admission_Prediction.csv')
data.head()
png
data.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Serial No. 500 non-null int64
1 GRE Score 485 non-null float64
2 TOEFL Score 490 non-null float64
3 University Rating 485 non-null float64
4 SOP 500 non-null float64
5 LOR 500 non-null float64
6 CGPA 500 non-null float64
7 Research 500 non-null int64
8 Chance of Admit 500 non-null float64
dtypes: float64(7), int64(2)
memory usage: 35.3 KB
sns.heatmap(data.isnull(),cbar=False,yticklabels=False);
png

Heatmap show’s that missing values are pressent in GRE Score, TOEFL Score, University Rating.

data.isnull().sum()Serial No.            0
GRE Score 15
TOEFL Score 10
University Rating 15
SOP 0
LOR 0
CGPA 0
Research 0
Chance of Admit 0
dtype: int64
data['University Rating'] = data['University Rating'].fillna(data['University Rating'].mode()[0])
data['TOEFL Score'] = data['TOEFL Score'].fillna(data['TOEFL Score'].mean())
data['GRE Score'] = data['GRE Score'].fillna(data['GRE Score'].mean())
data.describe()
png

look at data there are no missing values in our data. Also, the first column is just serial numbers, so we don’t need that column . Let’s drop it from data

data= data.drop(columns = ['Serial No.'])
data.head()
png

Let’s visualize the data and analyze the relationship between independent and dependent variables:

# let's see how data is distributed for every column
plt.figure(figsize=(20,25), facecolor='white')
plotnumber = 1

for column in data:
if plotnumber<=16 :
ax = plt.subplot(4,4,plotnumber)
sns.distplot(data[column])
plt.xlabel(column,fontsize=20)
#plt.ylabel('Salary',fontsize=20)
plotnumber+=1
plt.tight_layout()
png

The data distribution looks decent enough and there doesn’t seem to be any skewness. Great let’s go ahead!

Let’s observe the relationship between independent variables and dependent variable.

y = data['Chance of Admit']
X =data.drop(columns = ['Chance of Admit'])
plt.figure(figsize=(20,30), facecolor='white')
plotnumber = 1

for column in X:
if plotnumber<=15 :
ax = plt.subplot(5,3,plotnumber)
plt.scatter(X[column],y)
plt.xlabel(column,fontsize=20)
plt.ylabel('Chance of Admit',fontsize=20)
plotnumber+=1
plt.tight_layout()
png

Great, the relationship between the dependent and independent variables look fairly linear. Thus, our linearity assumption is satisfied.

Let’s move ahead and check for multicollinearity.

First we do scalling on data and then further proceed

scaler =StandardScaler()

X_scaled = scaler.fit_transform(X)
from statsmodels.stats.outliers_influence import variance_inflation_factor
variables = X_scaled

# we create a new data frame which will include all the VIFs
# note that each variable has its own variance inflation factor as this measure is variable specific (not model specific)
# we do not include categorical values for mulitcollinearity as they do not provide much information as numerical ones do
vif = pd.DataFrame()

# here we make use of the variance_inflation_factor, which will basically output the respective VIFs
vif["VIF"] = [variance_inflation_factor(variables, i) for i in range(variables.shape[1])]
# Finally, I like to include names so it is easier to explore the result
vif["Features"] = X.columns
vif
png

As a thumb rule, a VIF value greater than 5 means a very severe multicollinearity. We don’t any VIF greater than 5 , so we are good to go.

Great. Let’s go ahead and use linear regression and see how good it fits our data. But first. let’s split our data in train and test.

x_train,x_test,y_train,y_test = train_test_split(X_scaled,y,test_size = 0.25,random_state=355)regression = LinearRegression()

regression.fit(x_train,y_train)
LinearRegression()regression.score(x_train,y_train)0.8415250484247909# Let's create a function to create adjusted R-Squared
def adj_r2(x,y):
r2 = regression.score(x,y)
n = x.shape[0]
p = x.shape[1]
adjusted_r2 = 1-(1-r2)*(n-1)/(n-p-1)
return adjusted_r2
adj_r2(x_train,y_train)0.8385023654247188

Our r2 score is 84.15% and adj r2 is 83.85% for our training set.

Let’s check how well model fits the test data.

Now let’s check if our model is overfitting to data. Check it for testing set

regression.score(x_test,y_test)0.7534898831471066adj_r2(x_test,y_test)0.7387414146174464

So it looks like our model r2 score is less on the test data. It may be due to overfitting.

Let’s see if our model is overfitting our training data.

sns.heatmap(data.corr(), vmin=-1, vmax=1, annot=True,yticklabels=False);
png

Correlation between our independant varible and dependant variable is high, so we use lasso regularization to make coefficient as zero of variable that cause overfiting

# Lasso Regularization
# LassoCV will return best alpha and coefficients after performing 10 cross validations
lasscv = LassoCV(alphas = None,cv =10, max_iter = 100000, normalize = True)
lasscv.fit(x_train, y_train)
LassoCV(cv=10, max_iter=100000, normalize=True)# best alpha parameter
alpha = lasscv.alpha_
alpha
3.0341655445178153e-05#now that we have best parameter, let's use Lasso regression and see how well our data has fitted before

lasso_reg = Lasso(alpha)
lasso_reg.fit(x_train, y_train)
Lasso(alpha=3.0341655445178153e-05)lasso_reg.score(x_test, y_test)0.7534654960492284

our r2_score for test data (75.34%) comes same as before using regularization. So, it is fair to say our OLS model did not overfit the data. Save this model to local file system and use it to predict and deployment.

# saving the model to the local file system
filename = 'finalized_model.pickle'
pickle.dump(regression, open(filename, 'wb'))
filename1='scaler_model.pickle'
pickle.dump(scaler,open(filename1,'wb'))
# prediction using the saved model
loaded_model = pickle.load(open(filename, 'rb'))
a=loaded_model.predict(scaler.transform([[300,110,5,5,5,10,1]]))
a

You can use this save model for web deployment and do prediction without run any single line code. Just fill values an place enter. To Know how deploy the model follow me.

Download dataset — https://github.com/Rudra-stat/LinearRegression

--

--