In this article, we’ll learn step by step an entire process of machine learning to solve a challenge in Kaggle Competition and make a submission. We’ll explore through Kaggle Competition joining, Data Exploration, Feature Engineering, Model Training, Testing, Scoring, and Evaluation. Then, we’ll go ahead and submit our result to the Kaggle Leaderboard. This article is a walkthrough of the entire Machine Learning process in Kaggle. You can learn more about Kaggle from my previous article, Kaggle Competition where I’ve detailed about the Kaggle Platform and different types of competition you can take part in.
Now, let us learn to perform some machine learning in Kaggle.
Step 1
Visit the Kaggle platform and select any on-going competition. Here I’ve chosen the Housing Prices Competition. Kaggle is great for beginners and you can choose the Getting Started challenges to adapt to the machine learning processes and basics with Python and R.
Explore the Data
Step 2
Under the Data Tab, we can see all the different files of the dataset and the structure of file formats necessary for submission. Moreover, you’ll also find the details of all the columns of the data which will be used for machine learning.
You can explore how the training data is here.
You can see under the Data Explorer the sample of the submission CSV file here. It just has an ID column and the SalePrice column. Depending upon the various competition which requires you to submit the entire machine learning code notebooks while some require you to simply just submit the output data file from your machine learning testing which will then be used to rank you for your submission.
Submission
Step 3
Under My Submission, you can explore the previous submitted notebooks and view the public score. Up to two submissions can be selected for final leaderboard score evaluation.
Step 4
Here, simply clicking on the Submit Predictions, you can upload the CSV file. It is defined here for this competition; the solution file needs to have 1459 prediction rows.
Once you add in the submission file which fulfills the requirements of the number of predictions, the description detail of the submission and Click on Make Submission, your submission file will be uploaded to rank to the competition.
Notebook
Step 5
Kaggle provides you Notebook itself to code the entire machine learning process. Under Code in the Menu, Click on New Notebook and you’ll be provided with some library package calls and input and output connection.
Here, Shift Enter in each tab in the notebook to run the code in the cell or click on the play button above.
You can simply click on the + Code button to add more cell to go ahead with the machine learning process and add more code.
Here, you can see, the new cell has been added here.
Similarly, on the left-hand side you can view the Submit button and Save button to save your on-going progress in the machine learning for this challenge.
Step 6
Here, as the cell is run you can see the update on the console. Here, the Session has started and files streams has been added to kaggle/input
Setting Up
Step 7
Above, we saw the example of the setup. Here, we initiate the process for this Housing Price Prediction challenge. Add this code to link the CSV files for training and testing. Similarly, import other libraries and print Setup Complete to confirm all the above process is successfully completed.
# Set up code checking
import os
if not os.path.exists("../input/train.csv"):
os.symlink("../input/home-data-for-ml-course/train.csv", "../input/train.csv")
os.symlink("../input/home-data-for-ml-course/test.csv", "../input/test.csv")
from learntools.core import binder
binder.bind(globals())
from learntools
Feature Selection
Step 8
Here, we import out pandas library and split from sklearn. Then we go ahead to read the data for training and testing file. Next, the target for prediction are selected. We select 8 key features to predict the SalePrice. Some of the columns are LotArea, YearBuilt, Floor and Room Sizes, and more. It seems logical to select these columns.
import pandas as pd
from sklearn.model_selection import train_test_split
# Read the data
X_full = pd.read_csv('../input/train.csv', index_col='Id')
X_test_full = pd.read_csv('../input/test.csv', index_col='Id')
# Obtain target and predictors
y = X_full.SalePrice
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = X_full[features].copy()
X_test = X_test_full[features].copy()
# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
random_state=0)
You see, machine learning is about experimenting and finding the best features which will provide great output. In the Azure Machine Learning - Auto ML, we performed automated machine learning using Azure and found out the top key features for the prediction in the best algorithm. Currently, as we are hand-coding each of the machine learning process, we are here to explore the data and see how these features perform.
Training Features Data
Step 9
We need to explore the data and here we’ve outputted some data of the different features we’ve selected for training.
X_train.head()
We can see the LotArea, Year Built the first-floor size, total number of full bath and bedrooms. With numerous features, a lot of factors comes into play and these 7 features are good to go for training.
Training
Step 10
Here, all the data of the features are basically numerical values. This makes it a great place to use regression algorithm. One of the most used regression algorithms for multi-variate problem with sub-sample of dataset is the Random Forest Regressor.
from sklearn.ensemble import RandomForestRegressor
# Define the models
model_1 = RandomForestRegressor(n_estimators=50, random_state=0)
model_2 = RandomForestRegressor(n_estimators=100, random_state=0)
model_3 = RandomForestRegressor(n_estimators=100, criterion='mae', random_state=0)
model_4 = RandomForestRegressor(n_estimators=200, min_samples_split=20, random_state=0)
model_5 = RandomForestRegressor(n_estimators=100, max_depth=7, random_state=0)
models = [model_1, model_2, model_3, model_4, model_5]
We simply import the Random Forest Regressor of Sklearn and define different models each with different estimators, random states, minimum sample split, maximum and depth and more.
Scoring
Step 11
Now, we define function to calculate the mean absolute error comparing the different models on their testing dataset with the predicted outcome.
from sklearn.metrics import mean_absolute_error
# Function for comparing different models
def score_model(model, X_t=X_train, X_v=X_valid, y_t=y_train, y_v=y_valid):
model.fit(X_t, y_t)
preds = model.predict(X_v)
return mean_absolute_error(y_v, preds)
for i in range(0, len(models)):
mae = score_model(models[i])
print("Model %d MAE: %d" % (i+1, mae))
Each of the five model’s mean absolute error (MAE) are printed here.
Evaluation
Step 12
Thinking of which model would be best one analyzing the MAE we can see; Model 3 has the lowest value. Hence, we check here and are confirmed Model 3 is the best one out of all 5 models.
# Fill in the best model
best_model = model_3
# Check your answer
step_1.check()
# Lines below will give you a hint or solution code
step_1.hint()
step_1.solution()
Testing
Step 13
Here, we evaluate which is the most accurate model. As we can see, Model is the one with the most accurate prediction. This is because of the lowest mean squared error which surpasses every other model prediction.
# Define a model
my_model = model_3 # Your code here
# Check your answer
step_2.check()
# Lines below will give you a hint or solution code
step_2.hint()
step_2.solution()
Output Submission File
Step 14
Now, as we learn above about the Submission process, we knew we need a specific format of comma separated values file with specific no of rows to submit our prediction file.
# Fit the model to the training data
my_model.fit(X, y)
# Generate test predictions
preds_test = my_model.predict(X_test)
# Save predictions in format used for competition scoring
output = pd.DataFrame({'Id': X_test.index,
'SalePrice': preds_test})
output.to_csv('submission.csv', index=False)
Here, we fit our training data, produce test prediction and save it in the format of ID and SalePrice in two columns as submission.csv file without any indexing.
Submission
Step 15
Finally, we can submit our file to rank in the leaderboard and view how we performed compared to other submissions. This will help us create a better machine learning process in future iterations with better results. We can explore other algorithms, use different hyperparameters and even use a different feature altogether to train our model.
Shown below is the submission file we make ready. Simply, Click on the Submit Button and we’ll be evaluated against all other submission.
Similarly, explore the Logs and see the update of the code in the notebook to keep track of any update and errors and a holistic overview of our processes.
Time # Log Message
13.3s 1 /opt/conda/lib/python3.7/site-packages/papermill/iorw.py:50: FutureWarning: pyarrow.HadoopFileSystem is deprecated as of 2.0.0, please use pyarrow.fs.HadoopFileSystem instead.
13.3s 2 from pyarrow import HadoopFileSystem
14.8s 3 Setup Complete
15.7s 4 Model 1 MAE: 24015
16.2s 5 Model 2 MAE: 23740
16.2s 6 /opt/conda/lib/python3.7/site-packages/sklearn/ensemble/_forest.py:399: FutureWarning: Criterion 'mae' was deprecated in v1.0 and will be removed in version 1.2. Use `criterion='absolute_error'` which is equivalent.
16.2s 7 FutureWarning,
19.2s 8 Model 3 MAE: 23528
19.9s 9 Model 4 MAE: 23996
20.0s 10 Model 5 MAE: 23706
20.7s 11 /opt/conda/lib/python3.7/site-packages/sklearn/ensemble/_forest.py:399: FutureWarning: Criterion 'mae' was deprecated in v1.0 and will be removed in version 1.2. Use `criterion='absolute_error'` which is equivalent.
20.7s 12 FutureWarning,
28.4s 13 /opt/conda/lib/python3.7/site-packages/traitlets/traitlets.py:2567: FutureWarning: --Exporter.preprocessors=["remove_papermill_header.RemovePapermillHeader"] for containers is deprecated in traitlets 5.0. You can pass `--Exporter.preprocessors item` ... multiple times to add items to a list.
28.4s 14 FutureWarning,
28.4s 15 [NbConvertApp] Converting notebook __notebook__.ipynb to notebook
28.7s 16 [NbConvertApp] Writing 30179 bytes to __notebook__.ipynb
31.2s 17 /opt/conda/lib/python3.7/site-packages/traitlets/traitlets.py:2567: FutureWarning: --Exporter.preprocessors=["nbconvert.preprocessors.ExtractOutputPreprocessor"] for containers is deprecated in traitlets 5.0. You can pass `--Exporter.preprocessors item` ... multiple times to add items to a list.
31.2s 18 FutureWarning,
31.2s 19 [NbConvertApp] Converting notebook __notebook__.ipynb to html
32.0s 20 [NbConvertApp] Writing 305243 bytes to __results__.html
You can then go again to the Notebook section and perform another machine learning process with different other choices as I suggested and check your new output and submit to the competition.
Conclusion
Thus, in this article, we went through a step-by-step process in Kaggle to join a competition, go through an entire machine learning process from data exploration and feature selection to training, testing, scoring and evaluation. Then we also went ahead to submit our result to the competition and rank ourselves and our work in the leaderboard of Kaggle.