Introduction
For creating a Machine Learning experiment, we will use an automobile data set and try to predict the price of the automobile based on factors, such as ‘make’ and ‘technical specification’. Before we get started with the steps involved in creating this experiment in Azure ML Studio, we need to Sign Up (In) on the platform. For that, visit
here and sign in using your Outlook account. You may use any other Microsoft account, work account, or school account.
Steps involved to create an Experiment
Creation of a model can be divided into three parts (a) Creation of Model (b) Training of Model (c) Testing the Model, and the steps involved are -
Creation of Model
- Step 1: Get the data
- Step 2: Prepare the data
- Step 3: Define features
Train the Model
- Step 4: Choose and apply a learning algorithm
Score and Test the Model
- Step 5: Predict the new automobile price.
Let’s get started.
Step 1 - Get the Data
The very first step is to get the data. The data can come in different types, formats, and structures. Azure ML Studio comes with many sample datasets that we can use. For this experiment, we are going to use Automobile Price Data (Raw), which is present in the Azure ML workspace. Note that we can import data from various resources.
1.1 At the bottom of the Machine Learning Studio window, you’ll find the ‘+New’ button. Click on it to create a new experiment and then, select Blank Experiment.
Figure: The +New button
Figure: Select Blank Experiment
1.2 At the top of the canvas, you can find the default given name. Rename it to Automobile Price Prediction.
Figure: Rename the experiment name
1.3 Towards the left, you’ll find there is a pallet of dataset and modules. At the top of this pallet, in the search box, type "Automobile" to find the dataset labeled "Automobile Price Data (Raw)". Drag and drop this data set on the experiment canvas.
Figure: Search for automobile
Figure: Drag and drop the data on the canvas
To visualize this dataset, click on the output port of the dataset and then select "Visualize".
Figure: Click on Visualize
In this dataset, the data is stored in row and column format. Row carries the instance of automobile appearing and column describes different features associated with each automobile. From the given dataset, our task is to predict the price of an automobile located in column 26 and titled as ‘price’.
Figure: Dataset
You may close the window by clicking the ‘x’ button in the upper right corner.
Step 2 - Prepare the data
This step is often called feature engineering where data is pre-processed before it can be analyzed. For instance, there are missing values in many columns. Also, the normalized-losses column has a huge proportion of missing values, so we will drop that column while analyzing. First, we’ll remove the normalized-losses column and then the rows that have many missing data.
2.1 In the search box, type "Select Columns" to find the "Select Columns in Dataset" module. Drag and drop this module on canvas. By using this module, we can select a column we want to include or exclude in the model.
Figure: Search Select columns
2.2 Click on the output port of Automobile Price Data (Raw) and connect it to the input port of the "Select Columns in Dataset".
Figure: Connect data and select the column module
2.3 Click on the "Select Columns in Dataset" module and on the right side, you’ll find "Properties" pane. Click on "Launch Column Selector".
Figure: Click on Select Columns in Dataset Module
Figure: Click on Launch Column Selector
- From Select Column Window, click on "With Rules".
- Click "All Columns" under Begin With which selects all columns except those we are going to exclude.
- To exclude the normalized-losses column, we will select Exclude and Column Name from the drop-down. In the list of columns displayed, select normalized-losses and add it to the text box.
- Click OK and close the column selector.
Figure: Exclude normalized-losses column
Look at the properties pane of "Select Columns in Dataset". It indicates that it allows all columns to pass except normalized-losses.
Figure: Normalized-losses columns now excluded
Tip: Double click the module to add a comment which can help to get a better understanding of the experiment.
2.4 Let us now figure out the missing values in rows. As done earlier, search for "Clean Missing Data" and drag and drop the module on canvas.
Figure: Search for clean missing data
- Connect it to "Select Columns in Dataset".
- In Properties pane under Cleaning Mode title, select "Remove Entire Row". This removes all rows which have missing values.
Figure: Connect Clean Missing Data to Select Columns in Dataset
Figure: Select Remove Entire Row
2.5 At the bottom of the window, click Run.
Figure: Click on RUN
After the experiment has finished running, all modules are marked with green checkmarks indicating that they are finished successfully. Also, at the top right corner, you’ll find the status "Finished Running".
Figure: Green marks after a successful run
Ok, let us visualize our dataset now. Click on the left output port of the Clean Missing Data module and select "Visualize". Note, there are no missing values and column normalized-losses are dropped. Our data is now clean and ready for analysis.
Step 3 - Define Features
Features in machine learning are most often columns in the dataset that help us derive the output. In this dataset, each row indicates one automobile and each column is a feature of that automobile.
Some features are good for predicting target values and some are not. Some features are very co-related to each other and can be dropped. In our case, ‘city-mpg’ and ‘highway-mpg’ are closely related. Hence, we can keep one of them and drop other features, without affecting the predictive outcome.
To get started, let us use the following set of features.
make, body-style, wheel-base, engine-size, horsepower, peak-rpm, highway-mpg, price
3.1 From the search box, once again, type Select Column and drag & drop the "Select Columns in Dataset" to the experiment canvas. Join the left output port of the Clean Missing Data module to the input port of Select Columns in Dataset.
Figure: Connect the two modules
3.2 From the Properties pane, click on "Column Selector".
- Click on "With Rules" and under Begin With, click "No Columns".
- Select "Include" and Columns Names from the drop-down in the text box and add the following list of columns.
make, body-style, wheel-base, engine-size, horsepower, peak-rpm, highway-mpg, price
Figure: Include specific columns
After this module runs, it provides a filtered dataset containing features we passed. These features will only be used to pass to the learning algorithm. Remember, you can always come back and play around this module to add or remove features to get a better output.
Step 4 - Choose and apply the algorithm
As our data is ready for analyzing, we can now construct a predictive model that consists of training and testing. We will use most of the data to train the model (70% - 80% of data) and the rest of the data will be used to test the model to check the accuracy of our predicted values.
From our previous discussions, Regression is used to predict a number, and as we want to predict the price of the automobile, we will use the Regression Algorithm because the price is a number.
We train our model by giving it sample data, i.e., training data that includes price. The models analyze the data and find the relation between prices and automobile features. We, then, test our model with the training data. We give a model set of features for automobiles, whose answer we are familiar with and then see how closely our model was able to predict the known price.
We are going to split our dataset into a test dataset and train dataset for training and testing the model.
4.1 From the pallet, search for the "Split Data" module and drag it to the experiment canvas. At the same time, connect it to the previous "Select Columns in Dataset".
Figure: Add Split Data module and connect it to the previous model
4.2 Click Split Data Module and in the Properties section, under heading "Fraction of rows" in the first output dataset, set its value to 0.75, which means 75% of data will be used to train the model and the left-over data, i.e., 25% of the data will be used for testing. You can always come back and change the values.
Figure: Split the dataset into train and test dataset
Random Seed produces different random samples for training and testing.
4.3 Run the experiment to pass a defined set of features from the dataset and split the dataset into training and testing of the dataset. Click on the left output port of Split Data module and select "Visualize" to see the training dataset, and click on the right output port and select "Visualize" to see the testing dataset.
Figure: Train dataset with 145 records i.e. 75% of the original dataset
Figure: Test dataset with 48 records i.e. 25% of the original dataset
4.4 It’s time we select our machine learning algorithm. From the pallet, on the left side, expand the Machine Learning category and then expand Initialize Model. Here, you can see many Machine Learning Algorithms. Drag and drop the Linear Regression module under the Regression category.
Figure: Look for Linear Regression model in pallet
You could simply search for Linear Regression and drag & drop to the experiment canvas from the model pallet.
Figure: Type Linear Regression in search box
4.5 Search for Train Model module and drag and drop on the canvas. Connect the left output port of the Split Data module, i.e. Training Dataset to the right port Train Model, and connect the output port of the Linear Regression model to the left port of the Training Model.
Figure: Feeding model with algorithm and train dataset
4.6 Click the "Train Model" module. From the Properties pane, click on the "Launch Column Selector".
- Click on "By Name" and then, select the “price” column which is the value that we are going to predict.
- Select “price” column from the Available Columns section and move it to the Selected Columns
Figure: Selecting a column to predict
4.7 RUN the experiment.
The model is now trained to predict the new price of the automobile when given a set of parameters.
Figure: Green checkmarks after successful RUN
Step 5 - Predict New Automobile Price
As we trained our model with 75% of data, the leftover (25%) data can be used to score how well the model has performed.
5.1 Search for Score Model module and drag & drop it to the experiment canvas.
Connect the test data output port from the "Split Data" module to the right input port of ScoreModel and the output port of the Train Model to the left input port of the Score Model.
Figure Connect Score with Train Model and Split Data
5.2 RUN the experiment and click on the output port of the Score Model and select Visualise. The output shows the new price calculated by the model and the known price values from the test data.
Figure Predicted and Known Values Compared
5.3 Towards the send, we check the quality of the results. Search Evaluate Model module and drag & drop to the experiment canvas. Connect the output port of the Score Model to the left input port of the Evaluate Model.
Figure Connect Score Model and Evaluate Model
5.4 RUN the experiment.
Click on the output port of the Evaluate Model and select Visualize.
Figure Output of Evaluation Model
Note
The "Evaluate" model contains two input ports that can be used to compare output from two different models simultaneously. We may use two different algorithms in the experiment and use evaluate the model to check which one gives the better output.
Note
The difference between the predicted and actual value is an
error.
The following statistics are shown for our model,
- Mean Absolute Error (MAE)
The average of the absolute errors is known as MAE.
- Root Mean Squared Error (RMSE)
It is calculated by taking the square root of the average of squared errors of predictions made on the test dataset.
- Relative Absolute Error
It is the average of absolute errors relative to the absolute difference between the average of all actual values and actual values.
- Relative Squared Error
It is the average of squared errors relative to the squared difference between the average of all actual values and the actual values.
- Coefficient of Determination
It is the statistical metric that indicates how well a model fits the data. It is also known as the R squared value.
For all the statistics errors, smaller is better. The smaller the error value predicted values are close to actual values. In the case of the Coefficient of Determination, the closer its value is to one (1.0), the better the predictions.