30 Days Of Python πŸ‘¨β€πŸ’» - Day 28 - ML And Data Science II

This article is a part of a 30 day Python challenge series. You can find the links to all the previous posts of this series here
Today I explored the Scikit-Learn library and created a notebook project to go over some of the basics and try creating a machine learning model. Scikit-Learn is a vast library and it takes a lot of practice and exploration to get a grasp on it. I followed some tutorials and articles to try building a simple classifier model just to figure out how it works. It looked a bit intimidating to me but I decided to create a basic workflow in a Jupyter Notebook so that I can use it as a reference when I decide to dive deep into the ML and Data Science Domain.
 
Scikit-Learn is a popular Python library for Machine Learning. Scikit-Learn can process data provided to it and create machine learning models to learn patterns within the data and makes predictions using its tools.
 

Why Scikit-learn?

  • Built on top of numpy and matplotlib libraries
  • Has tons of built-in machine learning models
  • Lot of methods to evaluate machine learning models
  • Easy to understand and well-designed API
Usually, Machine Learning can be a bit overwhelming as it involves complex algorithms and statistics to analyze data. Scikit-learn abstracts this complexity and makes it easy to build models and train them without having to know much about mathematics and statistics.
 
Here is the Notebook I created today. The link to the Github repository is here.
 

Basics of the scikit-learn library

 
This notebook covers some of the basics of the amazing scikit-learn Python library. Some of the important use cases of the library have been listed in this notebook which can be used as a cheat-sheet for reference.
 
Some of the topics covered are:
  • Getting the data ready
  • Selecting the appropriate algorithm/estimator for the specific problem
  • Fitting the model/algorithm to use it to make predictions on the data
  • Evaluating a model
  • Improving a model
  • Saving a loading a trained model

Getting the data ready

 
The data used for this project will be the heart disease data set available from https://www.kaggle.com/ronitf/heart-disease-uci
  1. import pandas as pd    
  2. import numpy as np    
  3. heart_disease = pd.read_csv('data/heart.csv')    
  4. heart_disease.head()    
  5. <div class="table-wrapper">    
  6. <table border="1" class="dataframe">    
  7.   <thead>    
  8.     <tr style="text-align: right;">    
  9.       <th></th>    
  10.       <th>age</th>    
  11.       <th>sex</th>    
  12.       <th>cp</th>    
  13.       <th>trestbps</th>    
  14.       <th>chol</th>    
  15.       <th>fbs</th>    
  16.       <th>restecg</th>    
  17.       <th>thalach</th>    
  18.       <th>exang</th>    
  19.       <th>oldpeak</th>    
  20.       <th>slope</th>    
  21.       <th>ca</th>    
  22.       <th>thal</th>    
  23.       <th>target</th>    
  24.     </tr>    
  25.   </thead>    
  26.   <tbody>    
  27.     <tr>    
  28.       <th>0</th>    
  29.       <td>63</td>    
  30.       <td>1</td>    
  31.       <td>3</td>    
  32.       <td>145</td>    
  33.       <td>233</td>    
  34.       <td>1</td>    
  35.       <td>0</td>    
  36.       <td>150</td>    
  37.       <td>0</td>    
  38.       <td>2.3</td>    
  39.       <td>0</td>    
  40.       <td>0</td>    
  41.       <td>1</td>    
  42.       <td>1</td>    
  43.     </tr>    
  44.     <tr>    
  45.       <th>1</th>    
  46.       <td>37</td>    
  47.       <td>1</td>    
  48.       <td>2</td>    
  49.       <td>130</td>    
  50.       <td>250</td>    
  51.       <td>0</td>    
  52.       <td>1</td>    
  53.       <td>187</td>    
  54.       <td>0</td>    
  55.       <td>3.5</td>    
  56.       <td>0</td>    
  57.       <td>0</td>    
  58.       <td>2</td>    
  59.       <td>1</td>    
  60.     </tr>    
  61.     <tr>    
  62.       <th>2</th>    
  63.       <td>41</td>    
  64.       <td>0</td>    
  65.       <td>1</td>    
  66.       <td>130</td>    
  67.       <td>204</td>    
  68.       <td>0</td>    
  69.       <td>0</td>    
  70.       <td>172</td>    
  71.       <td>0</td>    
  72.       <td>1.4</td>    
  73.       <td>2</td>    
  74.       <td>0</td>    
  75.       <td>2</td>    
  76.       <td>1</td>    
  77.     </tr>    
  78.     <tr>    
  79.       <th>3</th>    
  80.       <td>56</td>    
  81.       <td>1</td>    
  82.       <td>1</td>    
  83.       <td>120</td>    
  84.       <td>236</td>    
  85.       <td>0</td>    
  86.       <td>1</td>    
  87.       <td>178</td>    
  88.       <td>0</td>    
  89.       <td>0.8</td>    
  90.       <td>2</td>    
  91.       <td>0</td>    
  92.       <td>2</td>    
  93.       <td>1</td>    
  94.     </tr>    
  95.     <tr>    
  96.       <th>4</th>    
  97.       <td>57</td>    
  98.       <td>0</td>    
  99.       <td>0</td>    
  100.       <td>120</td>    
  101.       <td>354</td>    
  102.       <td>0</td>    
  103.       <td>1</td>    
  104.       <td>163</td>    
  105.       <td>1</td>    
  106.       <td>0.6</td>    
  107.       <td>2</td>    
  108.       <td>0</td>    
  109.       <td>2</td>    
  110.       <td>1</td>    
  111.     </tr>    
  112.   </tbody>    
  113. </table>    
  114. </div>    
The aim is to predict based on the above data whether a patient has a heart disease or not. The target column determines the result and the other columns are called the features.
  1. # Create Features Matrix (X)    
  2. X = heart_disease.drop('target', axis=1)    
  3.     
  4. # Create Labels (Y)    
  5. y = heart_disease['target']    

Choose the appropriate model/estimator for the problem

 
For this problem, we will be using the RandomForestClassifier model form sklearn which is a classification machine learning model.
  1. from sklearn.ensemble import RandomForestClassifier    
  2. clf = RandomForestClassifier()    
  3. clf.get_params() # lists the hyperparameters    
  4.    
  5. {'bootstrap'True,    
  6.  'ccp_alpha'0.0,    
  7.  'class_weight'None,    
  8.  'criterion''gini',    
  9.  'max_depth'None,    
  10.  'max_features''auto',    
  11.  'max_leaf_nodes'None,    
  12.  'max_samples'None,    
  13.  'min_impurity_decrease'0.0,    
  14.  'min_impurity_split'None,    
  15.  'min_samples_leaf'1,    
  16.  'min_samples_split'2,    
  17.  'min_weight_fraction_leaf'0.0,    
  18.  'n_estimators'100,    
  19.  'n_jobs'None,    
  20.  'oob_score'False,    
  21.  'random_state'None,    
  22.  'verbose'0,    
  23.  'warm_start'False}  
Fit the model to the training data
 
In this step the model is split into training and testing data
  1. # fit the model to data    
  2. from sklearn.model_selection import train_test_split    
  3.     
  4. X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)     
  5. # Means 20% of the data will be used as testing data    
  6. clf.fit(X_train, y_train);    
  7.   
  8. # make prediction    
  9. y_label = clf.predict(np.array([0,2,3,4]))    
  10.    
  11. y_preds = clf.predict(X_test)    
  12. y_preds    
  13.    
  14. array([1110111101011101001100,    
  15.        1000000110001010110110,    
  16.        11011001110011111], dtype=int64)    
  17. y_test.head()    
  18.   
  19. 72     1    
  20. 116    1    
  21. 107    1    
  22. 262    0    
  23. 162    1    
  24. Name: target, dtype: int64    

Evaluate the model

 
In this step the model in evaluated on the training data and test data
  1. clf.score(X_train, y_train)    
  2.    
  3. 1.0    
  4.    
  5. clf.score(X_test, y_test)  
0.7704918032786885
  1. from sklearn.metrics import classification_report, confusion_matrix, accuracy_score    
  2.     
  3. print(classification_report(y_test, y_preds))    
  4.    
  5.            precision    recall  f1-score   support    
  6.     
  7.         0       0.77      0.71      0.74        28    
  8.         1       0.77      0.82      0.79        33    
  9.     
  10.  accuracy                           0.77        61    
  11. macro avg       0.77      0.77      0.77        61    
  12. ghted avg       0.77      0.77      0.77        61    
  13.    
  14. print(confusion_matrix(y_test, y_preds))    
  15.    
  16. [[20  8]    
  17.  [ 6 27]]    
  18.    
  19. print(accuracy_score(y_test, y_preds))   
0.7704918032786885
 

Improve the model

 
This step involves improving the model to get more accurate results
  1. # Try different amount of n_estimators    
  2. np.random.seed(42)    
  3. for i in range(110010):    
  4.     print(f'Trying model with {i} estimators')    
  5.     clf = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)    
  6.     print(f'Model accuracy on test set: {clf.score(X_test, y_test) * 100:.2f}%')    
  7.     print('')    
  8. Trying model with 1 estimators    
  9.    Model accuracy on test set: 72.13%    
  10.        
  11.    Trying model with 11 estimators    
  12.    Model accuracy on test set: 83.61%    
  13.        
  14.    Trying model with 21 estimators    
  15.    Model accuracy on test set: 78.69%    
  16.        
  17.    Trying model with 31 estimators    
  18.    Model accuracy on test set: 78.69%    
  19.        
  20.    Trying model with 41 estimators    
  21.    Model accuracy on test set: 75.41%    
  22.        
  23.    Trying model with 51 estimators    
  24.    Model accuracy on test set: 75.41%    
  25.        
  26.    Trying model with 61 estimators    
  27.    Model accuracy on test set: 75.41%    
  28.        
  29.    Trying model with 71 estimators    
  30.    Model accuracy on test set: 73.77%    
  31.        
  32.    Trying model with 81 estimators    
  33.    Model accuracy on test set: 73.77%    
  34.        
  35.    Trying model with 91 estimators    
  36.    Model accuracy on test set: 75.41%    

Save the model and load it

 
Will be using the pickle library from Python to save the model
  1. import pickle    
  2.     
  3. pickle.dump(clf, open('random_forest_model_1.pkl''wb'))    
  4.     
  5. #load the model    
  6. loaded_model = pickle.load(open('random_forest_model_1.pkl','rb'))    
  7. loaded_model.score(X_test, y_test)  
0.7540983606557377
 
That’s all for today. Since Machine Learning and Data Science is an ocean in itself, I decided to look into it in more detail and after becoming more adept with its tools and concepts, share my experience as blog posts and projects. For the remaining two parts of this challenge, I would like to explore domains such as Automation Testing with Python using Selenium and create another post on a compilation of Python resources.
 
Have a great one!


Similar Articles