Multiple Linear Regression using Python
In the previous article, we studied
Logistic Regression. One thing that I believe is that if we can correlate anything with us or our lives, there are greater chances of understanding the concept. So I will try to explain everything by relating it to humans.
What is Regression? Types of Regression
When we should use Multiple Linear Regression?
Multiple Linear Regression is an extended version of simple Linear regression, with one most important difference being the number of features it can handle. Multiple Linear Regression can handle more than 1 feature. So, we should use Multiple Linear Regression in cases where the dataset is uniformly distributed and has more than 1 feature to process.
How do we calculate Multiple Linear Regression?
The formula of the linear regression doesn't change, it remains y= m*X+b, only the number of coefficients increases
Advantages/Features of Multiple Linear Regression
- The chances of getting a better-fit increase as the generated models are dependent on more than 1 feature
- Multiple Linear Regression can detect outliers and anomalies very effectively.
Disadvantages/Shortcomings of Multiple Linear Regression
- The problem of overfitting is very prevalent here, as we can use all features to generate the model, so the model can start "memorizing" the values
- Accuracy decreases as the linearity of the dataset decreases.
Multiple Linear Regression
Multiple linear regression (MLR) or multiple regression, is a statistical technique that uses several preparatory variables to predict the outcome of a response variable. The goal of multiple linear regression (MLR) is to model the linear relationship between the explanatory (independent) variables and response (dependent) variable.
In essence, multiple regression is the extension of ordinary least-squares (OLS) regression that involves more than one explanatory variable.
Simple linear regression is a method that allows an analyst or statistician to make predictions about one variable based on the information that is known about another variable. Linear regression can only be used when one has two continuous variables—an independent variable and a dependent variable. The independent variable is the parameter that is used to calculate the dependent variable or outcome. A multiple regression model extends to several explanatory variables.
The multiple regression model is based on the following assumptions:
- Linearity: There is a linear relationship between the dependent variables and the independent variables.
- Correlation: The independent variables are not too highly correlated with each other.
- yi observations are selected independently and randomly from the population.
- Normal Distribution: Residuals should be normally distributed with a mean of 0 and variance σ.
When interpreting the results of multiple regression, beta coefficients are valid while holding all other variables constant ("all else equal"). The output from a multiple regression can be displayed horizontally as an equation, or vertically in table form.
Multiple Linear Regression Example
Let's take the example of the IRIS dataset, you can directly import it from the sklearn dataset repository. Feel free to use any dataset, there some very good datasets available on kaggle and with Google Colab.
Before we start with this, it is highly recommended you read the following tutorials
- Python Pandas
- Python Numpy
- Python Scikit Learn
- Python MatPlotLib
- Python Seaborn
- Python Tensorflow
1. Using SkLearn
- from pandas import DataFrame
- from sklearn import linear_model
- import statsmodels.api as sm
In the above code, we import the required python libraries.
- Stock_Market = {'Year': [2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016],
- 'Month': [12, 11,10,9,8,7,6,5,4,3,2,1,12,11,10,9,8,7,6,5,4,3,2,1],
- 'Interest_Rate': [2.75,2.5,2.5,2.5,2.5,2.5,2.5,2.25,2.25,2.25,2,2,2,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75],
- 'Unemployment_Rate': [5.3,5.3,5.3,5.3,5.4,5.6,5.5,5.5,5.5,5.6,5.7,5.9,6,5.9,5.8,6.1,6.2,6.1,6.1,6.1,5.9,6.2,6.2,6.1],
- 'Stock_Index_Price': [1464,1394,1357,1293,1256,1254,1234,1195,1159,1167,1130,1075,1047,965,943,958,971,949,884,866,876,822,704,719]
- }
In the above code, we are defining our data.
- df = DataFrame(Stock_Market,columns=['Year','Month','Interest_Rate','Unemployment_Rate','Stock_Index_Price'])
-
- X = df[['Interest_Rate','Unemployment_Rate']]
- Y = df['Stock_Index_Price']
In the above code, we are pre-processing the data.
- regr = linear_model.LinearRegression()
- regr.fit(X, Y)
In the above code, we are generating the model
- print('Intercept: \n', regr.intercept_) Multiple Linear Regression using Python
- print('Coefficients: \n', regr.coef_)
In the above code, we are printing the parameters of the generated model
the output that I am getting is :
Intercept: 1798.4039776258546
Coefficients: [ 345.54008701 -250.14657137]
-
- New_Interest_Rate = 2.75
- New_Unemployment_Rate = 5.3
- print ('Predicted Stock Index Price: \n', regr.predict([[New_Interest_Rate ,New_Unemployment_Rate]]))
In the above code, we are predicting the stock price corresponding to the given feature values.
MLR_SkLearn.py
- from pandas import DataFrame
- from sklearn import linear_model
- import statsmodels.api as sm
-
- Stock_Market = {'Year': [2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016],
- 'Month': [12, 11,10,9,8,7,6,5,4,3,2,1,12,11,10,9,8,7,6,5,4,3,2,1],
- 'Interest_Rate': [2.75,2.5,2.5,2.5,2.5,2.5,2.5,2.25,2.25,2.25,2,2,2,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75],
- 'Unemployment_Rate': [5.3,5.3,5.3,5.3,5.4,5.6,5.5,5.5,5.5,5.6,5.7,5.9,6,5.9,5.8,6.1,6.2,6.1,6.1,6.1,5.9,6.2,6.2,6.1],
- 'Stock_Index_Price': [1464,1394,1357,1293,1256,1254,1234,1195,1159,1167,1130,1075,1047,965,943,958,971,949,884,866,876,822,704,719]
- }
-
- df = DataFrame(Stock_Market,columns=['Year','Month','Interest_Rate','Unemployment_Rate','Stock_Index_Price'])
-
- X = df[['Interest_Rate','Unemployment_Rate']]
- Y = df['Stock_Index_Price']
-
-
- regr = linear_model.LinearRegression()
- regr.fit(X, Y)
-
- print('Intercept: \n', regr.intercept_)
- print('Coefficients: \n', regr.coef_)
-
-
- New_Interest_Rate = 2.75
- New_Unemployment_Rate = 5.3
- print ('Predicted Stock Index Price: \n', regr.predict([[New_Interest_Rate ,New_Unemployment_Rate]]))
Output
Intercept: 1798.4039776258546
Coefficients: [ 345.54008701 -250.14657137]
Predicted Stock Index Price: [1422.86238865]
- print_model = model.summary()
- print(print_model)
Output
2. Using NumPy
- import numpy as np
- import pandas as pd
- import matplotlib.pyplot as plt
- import seaborn as sns
In the above code, we are importing the necessary libraries.
- my_data = pd.read_csv('home.txt',names=["size","bedroom","price"])
In the above code, we are importing the data. You can download the "home.txt" file from the article.
-
- my_data = (my_data - my_data.mean())/my_data.std()
-
-
- X = my_data.iloc[:,0:2]
- ones = np.ones([X.shape[0],1])
- X = np.concatenate((ones,X),axis=1)
-
- y = my_data.iloc[:,2:3].values
- theta = np.zeros([1,3])
In the above code, we are preprocessing the data.
Let us visualize the data using a heatmap.
- def computeCost(X,y,theta):
- tobesummed = np.power(((X @ theta.T)-y),2)
- return np.sum(tobesummed)/(2 * len(X))
-
- def gradientDescent(X,y,theta,iters,alpha):
- cost = np.zeros(iters)
- for i in range(iters):
- theta = theta - (alpha/len(X)) * np.sum(X * (X @ theta.T - y), axis=0)
- cost[i] = computeCost(X, y, theta)
-
- return theta,cost
In the above code, we are defining the methods for finding the cost and for gradient descent
-
- alpha = 0.01
- iters = 1000
In the above code, we are setting the value for the hyperparameters.
- g,cost = gradientDescent(X,y,theta,iters,alpha)
- print(g)
-
- finalCost = computeCost(X,y,g)
- print(finalCost)
In the above code, we are calling the methods for fitting the model
The output that I am getting is
[[-1.10868761e-16 8.78503652e-01 -4.69166570e-02]] 0.13070336960771892
- fig, ax = plt.subplots()
- ax.plot(np.arange(iters), cost, 'r')
- ax.set_xlabel('Iterations')
- ax.set_ylabel('Cost')
- ax.set_title('Error vs. Training Epoch')
In the above code, we are generating the graph of Error vs Training Epochs
MLR_NumPy.py
- import numpy as np
- import pandas as pd
- import matplotlib.pyplot as plt
- import seaborn as sns
-
- my_data = pd.read_csv('home.txt',names=["size","bedroom","price"])
-
-
- my_data = (my_data - my_data.mean())/my_data.std()
-
-
- X = my_data.iloc[:,0:2]
- ones = np.ones([X.shape[0],1])
- X = np.concatenate((ones,X),axis=1)
-
- y = my_data.iloc[:,2:3].values
- theta = np.zeros([1,3])
-
- sns.heatmap(X)
-
-
- def computeCost(X,y,theta):
- tobesummed = np.power(((X @ theta.T)-y),2)
- return np.sum(tobesummed)/(2 * len(X))
-
- def gradientDescent(X,y,theta,iters,alpha):
- cost = np.zeros(iters)
- for i in range(iters):
- theta = theta - (alpha/len(X)) * np.sum(X * (X @ theta.T - y), axis=0)
- cost[i] = computeCost(X, y, theta)
-
- return theta,cost
-
-
- alpha = 0.01
- iters = 1000
-
- g,cost = gradientDescent(X,y,theta,iters,alpha)
- print(g)
-
- finalCost = computeCost(X,y,g)
- print(finalCost)
-
- fig, ax = plt.subplots()
- ax.plot(np.arange(iters), cost, 'r')
- ax.set_xlabel('Iterations')
- ax.set_ylabel('Cost')
- ax.set_title('Error vs. Training Epoch')
3. Using TensorFlow
- import matplotlib.pyplot as plt
- import tensorflow as tf
- import tensorflow.contrib.learn as skflow
- from sklearn.utils import shuffle
- import numpy as np
- import pandas as pd
- import seaborn as sns
In the above code, we are importing the required libraries
- df = pd.read_csv("boston.csv", header=0)
- print (df.describe())
In the above code, we are importing the dataset. You can download the dataset from
Kaggle
Let us visualize the data.
- f, ax1 = plt.subplots()
-
- y = df['MEDV']
-
- for i in range (1,8):
- number = 420 + i
- ax1.locator_params(nbins=3)
- ax1 = plt.subplot(number)
- plt.title(list(df)[i])
- ax1.scatter(df[df.columns[i]],y)
- plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=1.0)
-
- plt.show()
Let us visualize each dataset column seprately.
- X = tf.placeholder("float", name="X")
- Y = tf.placeholder("float", name = "Y")
In the above code, we are defining the actual trainable variables.
- with tf.name_scope("Model"):
-
- w = tf.Variable(tf.random_normal([2], stddev=0.01), name="b0")
- b = tf.Variable(tf.random_normal([2], stddev=0.01), name="b1")
-
- def model(X, w, b):
- return tf.multiply(X, w) + b
-
- y_model = model(X, w, b)
In the above code, we are defining the model.
- with tf.name_scope("CostFunction"):
- cost = tf.reduce_mean(tf.pow(Y-y_model, 2))
-
- train_op = tf.train.AdamOptimizer(0.001).minimize(cost)
In the above code, we are defining the cost function and the cost optimizer function.
- sess = tf.Session()
- init = tf.initialize_all_variables()
- tf.train.write_graph(sess.graph, '/home/bonnin/linear2','graph.pbtxt')
- cost_op = tf.summary.scalar("loss", cost)
- merged = tf.summary.merge_all()
- sess.run(init)
- writer = tf.summary.FileWriter('/home/bonnin/linear2', sess.graph)
In the above code, we create the garph file which can be used to visualize the model on TensorBoard.
- xvalues = df[[df.columns[2], df.columns[4]]].values.astype(float)
- yvalues = df[df.columns[12]].values.astype(float)
- b0temp=b.eval(session=sess)
- b1temp=w.eval(session=sess)
In the above code, we are making sure that the values are accesible to us even after the session ends.
- for a in range (1,50):
- cost1=0.0
- for i, j in zip(xvalues, yvalues):
- sess.run(train_op, feed_dict={X: i, Y: j})
- cost1+=sess.run(cost, feed_dict={X: i, Y: i})/506.00
- xvalues, yvalues = shuffle (xvalues, yvalues)
- print ("Cost over iterations",cost1)
- b0temp=b.eval(session=sess)
- b1temp=w.eval(session=sess)
In the above code, we are doing training.
- print("the final equation comes out to be", b0temp,"+",b1temp,"*X","\n Cost :",cost1)
In the above code, we are printing the model and the final cost.
The output that I am getting is
the final equation comes out to be [4.7545404 7.7991614] + [1.0045488 7.807921 ] *X
MLR_TensorFlow.py
- import tensorflow as tf
- import tensorflow.contrib.learn as skflow
- from sklearn.utils import shuffle
- import numpy as np
- import pandas as pd
-
- df = pd.read_csv("boston.csv", header=0)
- print (df.describe())
-
- f, ax1 = plt.subplots()
- import seaborn as sns
- sns.heatmap(df)
-
- y = df['MEDV']
-
- for i in range (1,8):
- number = 420 + i
- ax1.locator_params(nbins=3)
- ax1 = plt.subplot(number)
- plt.title(list(df)[i])
- ax1.scatter(df[df.columns[i]],y)
- plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=1.0)
-
- plt.show()
-
- X = tf.placeholder("float", name="X")
- Y = tf.placeholder("float", name = "Y")
-
- with tf.name_scope("Model"):
-
- w = tf.Variable(tf.random_normal([2], stddev=0.01), name="b0")
- b = tf.Variable(tf.random_normal([2], stddev=0.01), name="b1")
-
- def model(X, w, b):
- return tf.multiply(X, w) + b
-
- y_model = model(X, w, b)
-
- with tf.name_scope("CostFunction"):
- cost = tf.reduce_mean(tf.pow(Y-y_model, 2))
-
- train_op = tf.train.AdamOptimizer(0.001).minimize(cost)
-
-
- sess = tf.Session()
- init = tf.initialize_all_variables()
- tf.train.write_graph(sess.graph, '/home/bonnin/linear2','graph.pbtxt')
- cost_op = tf.summary.scalar("loss", cost)
- merged = tf.summary.merge_all()
- sess.run(init)
- writer = tf.summary.FileWriter('/home/bonnin/linear2', sess.graph)
-
- xvalues = df[[df.columns[2], df.columns[4]]].values.astype(float)
- yvalues = df[df.columns[12]].values.astype(float)
- b0temp=b.eval(session=sess)
- b1temp=w.eval(session=sess)
-
- for a in range (1,50):
- cost1=0.0
- for i, j in zip(xvalues, yvalues):
- sess.run(train_op, feed_dict={X: i, Y: j})
- cost1+=sess.run(cost, feed_dict={X: i, Y: i})/506.00
- xvalues, yvalues = shuffle (xvalues, yvalues)
- print ("Cost over iterations",cost1)
- b0temp=b.eval(session=sess)
- b1temp=w.eval(session=sess)
-
- print("the final equation comes out to be", b0temp,"+",b1temp,"*X")
Conclusion
In this article, we studied what is regression, types of regression and why should we use multiple linear regression, how do we calculate multiple linear regression, advantages of multiple linear regression, disadvantages of multiple linear regression, multiple linear regression example using sklearn, numpy, and TensorFlow. Hope you were able to understand each and everything. For any doubts, please comment on your query.
In the next article, we will learn about the Decision Tree.
Congratulations!!! You have climbed your next step in becoming a successful ML Engineer.