Learn Machine Learning With Python

Topics

Machine Learning: Linear Regression

Introduction

In the previous chapter, we studied machine learning and some key terms associated with it.

From this chapter onwards, we will start with learning and implementing machine learning algorithms. So in that series, in this chapter, we will start with Linear Regression.

Note: if you can correlate anything with yourself or your life, there are greater chances of understanding the concept. So try to understand everything by relating it to humans.

What is Regression and Why is it called so?

Regression is an ML algorithm that can be trained to predict real numbered outputs; like temperature, stock price, etc. Regression is based on a hypothesis that can be linear, quadratic, polynomial, non-linear, etc. The hypothesis is a function based on some hidden parameters and input values. In the training phase, the hidden parameters are optimized w.r.t. the input values presented in the training. The process that does the optimization is the gradient descent algorithm. You also need a Back-propagation algorithm that can be used to compute the gradient at each layer, If you are using neural networks. Once the hypothesis parameters got trained (when they gave the least error during the training), then the same hypothesis with the trained parameters is used with new input values to predict outcomes that will be again real values.

Types of Regression

1. Linear regression is used for predictive analysis. Linear regression is a linear approach for modeling the relationship between the criterion or the scalar response and the multiple predictors or explanatory variables. Linear regression focuses on the conditional probability distribution of the response given the values of the predictors. For linear regression, there is a danger of overfitting. The formula for linear regression is Y’ = bX + A.

2. Logistic regression is used when the dependent variable is dichotomous. Logistic regression estimates the parameters of a logistic model and is a form of binomial regression. Logistic regression is used to deal with data that has two possible criteria and the relationship between the criteria and the predictors. The equation for logistic regression is l = $\beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2}$ .

3. Polynomial regression is used for curvilinear data. Polynomial regression is fit with the method of least squares. The goal of regression analysis to model the expected value of a dependent variable y in regards to the independent variable x. The equation for polynomial regression is l = $\beta_{0}+\beta_{0}x_{1}+\epsilon$

4. Stepwise regression is used for fitting regression models with predictive models. It is carried out automatically. With each step, the variable is added or subtracted from the set of explanatory variables. The approaches for stepwise regression are forward selection, backward elimination, and bidirectional elimination. The formula for stepwise regression is $b_{j.std} = b_{j}(s_{x} * s_{y}^{-1})$

5. Ridge regression is a technique for analyzing multiple regression data. When multicollinearity occurs, least squares estimates are unbiased. A degree of bias is added to the regression estimates, and a result, ridge regression reduces the standard errors. The formula for ridge regression is $\beta = (X^{T}X + \lambda * I)^{-1}X^{T}y$

6. Lasso regression is a regression analysis method that performs both variable selection and regularization. Lasso regression uses soft thresholding. Lasso regression selects only a subset of the provided covariates for use in the final model. Lasso regression is: $N^{-1}\sum^{N}_{i=1}f(x_{i}, y_{I}, \alpha, \beta)$ .

7. ElasticNet regression is a regularized regression method that linearly combines the penalties of the lasso and ridge methods. ElasticNet regression is used to support vector machines, metric learning, and portfolio optimization. The penalty function is given by: $|| \beta ||_1=\sum^{p}_{j=1}|\beta_j|$

8. Support Vector regression is a regression method where we identify a hyperplane with maximum margin such that the maximum number of data points are within that margin. SVRs are almost similar to the SVM classification algorithm. Instead of minimizing the error rate as in simple linear regression, we try to fit the error within a certain threshold. Our objective in SVR is to basically consider the points that are within the margin. Our best fit line is the hyperplane that has the maximum number of points.

9. Decision Tree regression is a regression where the ID3 algorithm can be used to identify the splitting node by reducing the standard deviation (in classification information gain is used). A decision tree is built by partitioning the data into subsets containing instances with similar values (homogenous). Standard deviation is used to calculate the homogeneity of a numerical sample. If the numerical sample is completely homogeneous, its standard deviation is zero.

10. Random Forest regression is an ensemble approach where we take into account the predictions of several decision regression trees.

Select K random points
Identify n where n is the number of decision tree regressors to be created. Repeat steps 1 and 2 to create several regression trees.
The average of each branch is assigned to the leaf node in each decision tree.
To predict output for a variable, the average of all the predictions of all decision trees are taken into consideration.

Random Forest prevents overfitting (which is common in decision trees) by creating random subsets of the features and building smaller trees using these subsets.

Key Terms

1. Estimator

A formula or algorithm for generating estimates of parameters, given relevant data.

2. Bias

An estimate is unbiased if its expectation equals the value of the parameter being estimated; otherwise, it is biased.

3.Efficiency

An estimator A is more efficient than an estimator B if A has a smaller sampling variance — that is, if the particular values generated by A are more tightly clustered around their expectation.

4. Consistency

An estimator is consistent if the estimates it produces converge on the true parameter value as the sample size increases without limit. Consider an estimator that produces estimates θ^ of some parameter θ, and let ^ denote a small number. If the estimator is consistent, we can make the probability as close to 1.0 as we like or as small as we like by drawing a sufficiently large sample. Note that a biased estimator may nonetheless be consistent if the bias tends to zero in the limit. Conversely, an unbiased estimator may be inconsistent if its sampling variance fails to shrink appropriately as the sample size increases.

5. Standard error of the Regression (SER)

An estimate of the standard deviation of the error term in a regression model.

6. R-squared

A standardized measure of the goodness of fit for a regression model.

7. Standard error of regression coefficient

An estimate of the standard deviation of the sampling distribution for the coefficient in question.

8. P-value

The probability, supposing the null hypothesis to be true, of drawing sample data that are as adverse to the null as the data are actually drawn, or more so. When a small p-value is found, the two possibilities are that we happened to draw a low-probability unrepresentative sample or that the null hypothesis is in fact false.

9. Significance level

For a hypothesis test, this is the smallest p-value for which we will not reject the null hypothesis. If we choose a significance level of 1%, we're saying that we'll reject the null if and only if the p-value for the test is less than 0.01. The significance level is also the probability of making a type 1 error (that is, rejecting a true null hypothesis).

10. T-test

The t-test (or z-test, which is the same thing asymptotically) is a common test for the null hypothesis that a particular regression parameter, βi, has some specific value (commonly zero, but generically βH0).

11. F-test

A common procedure for jointly testing a set of linear restrictions on a regression model.

12. Multicollinearity

A situation where there is a high degree of correlation among the independent variables in a regression model — or, more generally, where some of the Xs are close to being linear combinations of other Xs. Symptoms include large standard errors and the inability to produce precise parameter estimates. This is not a serious problem if one is primarily interested in forecasting; it is a problem is one is trying to estimate causal influences.

13. Omitted variable bias

Bias in the estimation of regression parameters that arises when a relevant independent variable is omitted from a model and the omitted variable is correlated with one or more of the included variables.

14. Log variables

A common transformation permits the estimation of a nonlinear model using OLS to substitute the natural log of a variable for the level of that variable. This can be done for the dependent variable and/or one or more independent variables. A key point to remember about logs is that for small changes, the change in the log of a variable is a good approximation to the proportional change in the variable itself. For example, if log(y) changes by 0.04, y changes by about 4%.

15. Quadratic terms

Another common transformation. When both xi and x^2_i are included as regressors, it is important to remember that the estimated effect of xi on y is given by the derivative of the regression equation with respect to xi. If the coefficient on xi is β and the coefficient on x 2 i is γ, the derivative is β + 2γ xi.

16. Interaction terms

Pairwise products of the "original" independent variables. The inclusion of interaction terms in a regression allows for the possibility that the degree to which xi affects y depends on the value of some other variable x j. In other words, x j modulates the effect of xi on y. For example, the effect of experience on wages (xi) might depend on the gender (x j) of the worker.

17. Independent Variable

An independent variable is a variable that is changed or controlled in a scientific experiment to test the effects on the dependent variable.

18. Dependent Variable

A dependent variable is a variable being tested and measured in a scientific experiment. The dependent variable is 'dependent' on the independent variable. As the experimenter changes the independent variable, the effect on the dependent variable is observed and recorded.

19. Regularization

There are extensions of the training of the linear model called regularization methods. These seek to both minimize the sum of the squared error of the model on the training data (using ordinary least squares) but also to reduce the complexity of the model (like the number or absolute size of the sum of all coefficients in the model).

Gradient Descent

Gradient descent is an optimization algorithm used to minimize some function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient. In machine learning, we use gradient descent to update the parameters of our model. Parameters refer to coefficients in Linear Regression and weights in neural networks.

1. Learning Rate

The size of these steps is called the learning rate. With a high learning rate, we can cover more ground each step, but we risk overshooting the lowest point since the slope of the hill is constantly changing. With a very low learning rate, we can confidently move in the direction of the negative gradient since we are recalculating it so frequently. A low learning rate is more precise, but calculating the gradient is time-consuming, so it will take us a very long time to get to the bottom.

2. Cost Function

A Loss Function or Cost Function tells us “how good” our model is at making predictions for a given set of parameters. The cost function has its own curve and its own gradients. The slope of this curve tells us how to update our parameters to make the model more accurate.

Python Implementation of Gradient Descent

def update_weights(m, b, X, Y, learning_rate):
m_deriv = 0
b_deriv = 0
N = len(X)
for i in range(N):
# Calculate partial derivatives
# -2x(y - (mx + b))
m_deriv += -2*X[i] * (Y[i] - (m*X[i] + b))
# -2(y - (mx + b))
b_deriv += -2*(Y[i] - (m*X[i] + b))
# We subtract because the derivatives point in direction of steepest ascent
m -= (m_deriv / float(N)) * learning_rate
b -= (b_deriv / float(N)) * learning_rate
return m, b

Remembering Variables With DRYMIX

When results are plotted in graphs, the convention is to use the independent variable as the x-axis and the dependent variable as the y-axis. The DRY MIX acronym can help keep the variables straight:

D is the dependent variable
R is the responding variable
Y is the axis on which the dependent or responding variable is graphed (the vertical axis)
M is the manipulated variable or the one that is changed in an experiment
I is the independent variable
X is the axis on which the independent or manipulated variable is graphed (the horizontal axis)

Simple Linear Regression Model

The simple linear regression model is represented like this: y = (β0 +β1 + Ε)

By mathematical convention, the two factors that are involved in simple linear regression analysis are designated x and y. The equation that describes how y is related to x is known as the regression model. The linear regression model also contains an error term that is represented by Ε, or the Greek letter epsilon. The error term is used to account for the variability in y that cannot be explained by the linear relationship between x and y. There also parameters that represent the population being studied. These parameters of the model are represented by (β0+β1x).

The simple linear regression equation is graphed as a straight line.

The simple linear regression equation is represented like this: Ε(y) = (β0 +β1*x)

β0 is the y-intercept of the regression line.
β1 is the slope.
Ε(y) is the mean or expected value of y for a given value of x.

A regression line can show a positive linear relationship, a negative linear relationship, or no relationship.

If the graphed line in a simple linear regression is flat (not sloped), there is no relationship between the two variables.
If the regression line slopes upward with the lower end of the line at the y-intercept (axis) of the graph, and the upper end of the line extending upward into the graph field, away from the x-intercept (axis) a positive linear relationship exists.
If the regression line slopes downward with the upper end of the line at the y-intercept (axis) of the graph, and the lower end of the line extending downward into the graph field, toward the x-intercept (axis) a negative linear relationship exists.

y = β0 +β1*x

Important Note

1. Regression analysis is not used to interpret cause-and-effect relationships( a mechanism where one event makes another event happen, i.e each event is dependent on one-another) between variables. Regression analysis can, however, indicate how variables are related or to what extent variables are associated with each other.

2. It is also known as bivariate regression or regression analysis

Sum of Square Errors

This is the sum of differences between the points and the regression line. It can serve as a measure of how well the line fits the data.

Standard Estimate of Errors

The mean error is equal to zero. If se(sigma_epsilon) is small the errors tend to be close to zero (close to the mean error). Then, the model fits the data well. Therefore, we can use se as a measure of the suitability of using a linear model. An estimator of se is given by se(sigma_epsilon)

Coefficient of Determination

To measure the strength of the linear relationship we use the coefficient of determination. R^2 takes on any value between zero and one.

R^2 = 1: Perfect match between the line and the data points.
R^2 = 0: There is no linear relationship between x and y.

Image result for coefficient of determination formula

Simple Linear Regression Example

Let's take the example of Housing Price Prediction, the data that I am using is KC House Prices. Feel free to use any dataset, there some very good datasets available on kaggle and with Google Colab.

import pandas as pd
df = pd.read_csv("kc_house_data.csv")
df.describe()

Output

From the above data, we can see that we have a lot of tables, but simple linear regression can only process two columns, so we select the "price" and "sqrt_living". Here we will take the first 100 rows for our demonstration.

Before fitting the data let's analyze the data:

import seaborn as sns
df = pd.read_csv("kc_house_data.csv")
sns.distplot(df['price'][1:100])

Output

import seaborn as sns
df = pd.read_csv("kc_house_data.csv")
sns.distplot(df['sqrt_living'][1:100])

Output

1. Demo using Numpy

In the below code, we will just use numpy to perform linear regression.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
def estimate_coef(x, y):
# number of observations/points
n = np.size(x)
# mean of x and y vector
m_x, m_y = np.mean(x), np.mean(y)
# calculating cross-deviation and deviation about x
SS_xy = np.sum(y*x) - n*m_y*m_x
SS_xx = np.sum(x*x) - n*m_x*m_x
# calculating regression coefficients
b_1 = SS_xy / SS_xx
b_0 = m_y - b_1*m_x
return(b_0, b_1)
def plot_regression_line(x, y, b):
# plotting the actual points as scatter plot
plt.scatter(x, y, color = "m",
marker = "o", s = 30)
# predicted response vector
y_pred = b[0] + b[1]*x
# plotting the regression line
plt.plot(x, y_pred, color = "g")
# putting labels
plt.xlabel('x')
plt.ylabel('y')
# function to show plot
plt.show()
def main():
# observations
df = pd.read_csv("kc_house_data.csv")
y= df['price'][1:100]
x= df['sqft_living'][1:100]
plt.scatter(x, y)
# putting labels
plt.xlabel('x')
plt.ylabel('y')
# function to show plot
plt.show()
# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} \nb_1 = {}".format(b[0], b[1]))
# plotting regression line
plot_regression_line(x, y, b)
if __name__ == "__main__":
main()

Output

The input data can be visualized as:

After, executing the code we get the following output

Estimated coefficients: b_0 = 41517.979725295736 b_1 = 229.10249314945074

So the Linear Regression equation becomes :

[price] = 41517.979725795736+ 229.10249374945074*[sqrt_living]

i.e y = b[0] + b[1]*x

Let's see the final plot with the Regression Line

To predict the values we run the following command:

print("For x =100", "the predicted value would be",(b[1]*100+b[0]))

So the predicted value that we get is 64428.22904024081

Since the machine has to learn, hence would be doing some errors in predicting, so the let's see what percentage of error it is performing at each given point

y_diff = y -(b[1]+b[0]*x)
sns.distplot(y_diff)

2. Demo using Sklearn

In the below code, we will learn how to code linear regression using SkLearn

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("kc_house_data.csv")
y= df['price'][1:100]
x= df['sqft_living'][1:100]
from sklearn.model_selection import train_test_split #import model selection train test split for splitting the data into test and train for
#model validation.
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=101)
from sklearn.linear_model import LinearRegression
lr = LinearRegression() # create an instance of Linear Regression Class
x_train = x_train.values.reshape(-1,1) #we need to reshape as fit requires a Matrix
lr.fit(x_train, y_train) #calling the function fit to get the regression line equation
#output: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
print("b[0] = ",lr.intercept_) #prints the b[0]
print("b[1] = ",lr.coef_) #prints the b[1]
x_test = x_test.values.reshape(-1,1) #we need to reshape as predict requires a Matrix
pred = lr.predict(x_test) #Matrix of the predicted values
plt.scatter(x_train,y_train)
# predicted response vector
y_pred = lr.intercept_ + lr.coef_*x_train
# plotting the regression line
plt.plot(x_train, y_pred, color = "g")
# putting labels
plt.xlabel('x')
plt.ylabel('y')
# function to show plot
plt.show()

Output

After, executing the code we get the following output

b[0] = 21803.55365770642 b[1] = [232.07541739]

So the Linear Regression equation becomes :

[price] = 21803.55365770642+ 232.07541739*[sqrt_living]

i.e y = b[0] + b[1]*x

Let's now see a graphical representation between predicted value wrt test set

Since the machine has to learn, hence would be doing some errors in predicting, so the let's see what percentage of error it is performing at each given point

sns.distplot((y_test-pred))

3. Demo using TensorFlow

In this, I will not be using "kc_house_data.csv". To make it as simple as possible, I used some dummy data and then I will explain each and everything

np.random.seed(101)
tf.set_random_seed(101)

We first set the seed value for both NumPy and TensorFlow. The seed value is used for random number generation.

x = np.linspace(0, 50, 50)
y = np.linspace(0, 50, 50)

We now generate dummy data using the NumPy's linspace function, which generated 50 equally distributed points between 0 and 50

x += np.random.uniform(-4, 4, 50)
y += np.random.uniform(-4, 4, 50)

Since the data so generated is too perfect, so we add a liitle bit of uniformaly distributed white noise.

n = len(x) # Number of data points
# Plot of Training Data
plt.scatter(x, y)
plt.xlabel('x')
plt.xlabel('y')
plt.title("Training Data")
plt.show()

Here, we thus plot the graphical representation of the dummy data.

X = tf.placeholder("float")
Y = tf.placeholder("float")
W = tf.Variable(np.random.randn(), name = "W")
b = tf.Variable(np.random.randn(), name = "b")

Here we are setting X and Y as the actual training data and the W and b as the trainable data, where:

W means Weight
b means bais
X means the dependent variable
Y means the independent variable

We have to give initial weights and bias to the model. So, we initialize them with some random values.

learning_rate = 0.01
training_epochs = 1000

Here, we assume the learning rate to be 0.01 i.e. the gradient descent value would increase/decrease by 0.01 and we will train the model 1000 times or for 1000 epochs

# Hypothesis
y_pred = tf.add(tf.multiply(X, W), b)
# Mean Squared Error Cost Function
cost = tf.reduce_sum(tf.pow(y_pred-Y, 2)) / (2 * n)
# Gradient Descent Optimizer
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
# Global Variables Initializer
init = tf.global_variables_initializer()

Next thing, we have to set the hypothesis, cost function, the optimizer (which will make sure that model is learning on every epoch), and the global variable initializer.

Since we are using simple linear regression hence the hypothesis is set as:

Depedent_Variable*Weight+Bias or Depedent_Variable*b[1]+b[0]

We choose the cost function to be mean squared error cost function
We choose the optimizer as Gradient Descent with the motive of minimizing the cost with given learning rate
We use TensorFlow's global variable initialize the global variable so that we can re-use the value of variables

# Starting the Tensorflow Session
with tf.Session() as sess:
# Initializing the Variables
sess.run(init)
# Iterating through all the epochs
for epoch in range(training_epochs):
# Feeding each data point into the optimizer using Feed Dictionary
for (_x, _y) in zip(x, y):
sess.run(optimizer, feed_dict = {X : _x, Y : _y})
# Displaying the result after every 50 epochs
if (epoch + 1) % 50 == 0:
# Calculating the cost a every epoch
c = sess.run(cost, feed_dict = {X : x, Y : y})
print("Epoch", (epoch + 1), ": cost =", c, "W =", sess.run(W), "b =", sess.run(b))

tf.Session()
we start the TensorFlow session and name the Session as sess. We then initialize all the required variable using the
sess.run(init)
initialize all the required variable using the
for epoch in range(training_epochs)
create a for loop that runs till epoch becomes equal to training_epocs
for (_x, _y) in zip (x,y)
this is used to form a set of values from the given data and then parse through the so formed sets
sess.run(optimizer, feed_dict = {X: _x, Y: _y})
used to run the first optimizer iteration on each the data points
if(epoch+1)%50==0
to break the given iterations into a batch of 50 each
c= sess.run(cost, feed_dict= {X: x, Y: y})
performs the training on the given data points
print("Epoch",(epoch+1),": cost", c, "W =", sess.run(W), "b =", sess.run(b)
print the values after each 50 epochs

# Storing necessary values to be used outside the Session
training_cost = sess.run(cost, feed_dict ={X: x, Y: y})
weight = sess.run(W)
bias = sess.run(b)

the above code is used to store the values for the next session

# Calculating the predictions
predictions = weight * x + bias
print("Training cost =", training_cost, "Weight =", weight, "bias =", bias, '\n')
# Plotting the Results
plt.plot(x, y, 'ro', label ='Original data')
plt.plot(x, predictions, label ='Fitted line')
plt.title('Linear Regression Result')
plt.legend()
plt.show()

The above code demonstrates how we can use the trained model to predict values

LR_TensorFlow.py

import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
np.random.seed(101)
tf.set_random_seed(101)
# Genrating random linear data
# There will be 50 data points ranging from 0 to 50
x = np.linspace(0, 50, 50)
y = np.linspace(0, 50, 50)
# Adding noise to the random linear data
x += np.random.uniform(-4, 4, 50)
y += np.random.uniform(-4, 4, 50)
n = len(x) # Number of data points
# Plot of Training Data
plt.scatter(x, y)
plt.xlabel('x')
plt.xlabel('y')
plt.title("Training Data")
plt.show()
X = tf.placeholder("float")
Y = tf.placeholder("float")
W = tf.Variable(np.random.randn(), name = "W")
b = tf.Variable(np.random.randn(), name = "b")
learning_rate = 0.01
training_epochs = 1000
# Hypothesis
y_pred = tf.add(tf.multiply(X, W), b)
# Mean Squared Error Cost Function
cost = tf.reduce_sum(tf.pow(y_pred-Y, 2)) / (2 * n)
# Gradient Descent Optimizer
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
# Global Variables Initializer
init = tf.global_variables_initializer()
# Starting the Tensorflow Session
with tf.Session() as sess:
# Initializing the Variables
sess.run(init)
# Iterating through all the epochs
for epoch in range(training_epochs):
# Feeding each data point into the optimizer using Feed Dictionary
for (_x, _y) in zip(x, y):
sess.run(optimizer, feed_dict = {X : _x, Y : _y})
# Displaying the result after every 50 epochs
if (epoch + 1) % 50 == 0:
# Calculating the cost a every epoch
c = sess.run(cost, feed_dict = {X : x, Y : y})
print("Epoch", (epoch + 1), ": cost =", c, "W =", sess.run(W), "b =", sess.run(b))
# Storing necessary values to be used outside the Session
training_cost = sess.run(cost, feed_dict ={X: x, Y: y})
weight = sess.run(W)
bias = sess.run(b)
# Calculating the predictions
predictions = weight * x + bias
print("Training cost =", training_cost, "Weight =", weight, "bias =", bias, '\n')
# Plotting the Results
plt.plot(x, y, 'ro', label ='Original data')
plt.plot(x, predictions, label ='Fitted line')
plt.title('Linear Regression Result')
plt.legend()
plt.show()

Output

After, executing the code we get the following output

b[0] = 1.0199214 b[1] = 0.02561676

So the Linear Regression equation becomes :

[y] = 1.0199214+ 1.0199214*[x]

i.e y = b[0] + b[1]*x

Conclusion

In this chapter, we studied regression and simple linear regression.

In the next chapter, we will study the simple logistic regression, which is another type of regression.

Author

Rohit Gupta

55 29.2k 3.1m

Previous « Machine Learning: IntroductionPrevious Next » Machine Learning: Logistic RegressionNext