In this article, we’ll learn about Automated Machine Learning in Azure Machine Learning. We’ll go through step-by-step process to create machine learning models simultaneously using different algorithms and approaches in Azure Machine Learning. We’ll then opt to evaluate the best algorithm out of all we test for the dataset and visualize output from it. Here, we are trying to make predictions for the Boston Housing Dataset. You can learn about the theoretical aspect of Automated ML from my previous article, Auto ML. This article is a part of the Azure Machine Learning Series.
- Azure Machine Learning - Create Workspace for Machine Learning
- Azure Machine Learning – Create Compute Instance and Compute Cluster
- Azure Machine Learning - Writing Python Script in Notebook
- Azure Machine Learning - Model Training
- Azure Machine Learning - Linear Regression
- Azure Machine Learning – Model Deployment
- Azure Machine Learning – Auto ML
Pre-requisite
This article is a follow up to the previous article, Azure Machine Learning - Create Workspace for Machine Learning.
Let us get into the step-by-step process using the Automated ML in Azure Machine Learning.
Step 1
Visit the Azure portal of ml.azure.com and create Workspace for Machine learning following the article, Azure Machine Learning - Create Workspace for Machine Learning.
Once the deployment is complete, we can access the Workspace and the welcome page of the Azure Machine Learning Studio will look something like this below.
Step 2
Now, on the Left-hand side panel, Click on Automated ML.
Here, select the New Automated ML run.
Step 3
The process to create a new automated run will be initiated.
For any machine learning process, Data is key and the first step. Click on the Create dataset button.
There are numerous ways, we can use the data. For Automated ML we can use from local file, datastore, web files, and even from open datasets.
Boston Housing
Step 4
We’ll use freely available dataset. One of the widely experimented datasets is the Boston Housing Dataset. Simply search for it. You can use it from the Kaggle or University of Toronto.
If you using the dataset from web, select the web files and add the web URL.
If you want to use local dataset ie. Once you download to your own system choose from local files.
Here, you browse the file in the Datastore section and add the file.
Here, my file is HoustingData.csv of size 34.19KB with the green tick mark status confirming that my file has been uploaded to the datastore.
Exploration
Step 5
Here, under the Settings and Preview section, we can explore the dataset and under the Schema choose which columns we want to include for our automated machine learning process.
Once we are all set, click on Create under Confirm details.
We are notified about the success of the dataset creation.
Configure for Compute
Step 6
Now, under the Configure run, name your experiment and choose the target column. Here, mine is the MEDV which is the median value of owner-occupied homes in $1000’s. As we are predicting the price of the Boston houses with our automated machine learning, this is the column that we choose. This might be similar or different depending upon the dataset you choose. Whatsoever, it must be the pricing column of the housing dataset which might be of different name in yours.
Setting Compute Cluster
Step 7
For the Automated ML, we are going to create and use a compute cluster. Compute Instance is of now requirement here thus, we click on New and setup a 4 Core, 14GB RAM with 28GB Storage Standard_DS3_v2 compute cluster. This will charge us around $0.29 per hour.
Make sure the virtual machine tier is set to dedicated to save yourself from the hassle of slow processing later on.
Under Advanced Settings, name your compute and setup the minimum and maximum no of nodes with idle seconds before downtime.
Click on Create.
Once, done, we can choose the compute cluster and click on Next.
Machine Learning
Step 8
Here, for automated machine learning run, we need to select the appropriate kind of machine learning task. Since we are predicting prices of Boston houses, it is a regression problem. Thus, here we select the Regression and click on Next.
Under the Additional configurations, select Normalized Root Mean Squared Error and tick on Explain Best model. This will help up visualize our best model later on.
Under Use all supported model, make sure its untick and search for the regression algorithms you want to use. Here, I've used random forest, light gbm, fastlinearregressor and xgboostregressor.
Under the exit criterion, we can see the minimum training job time should exceed 30 minutes.
Thus, I’ve set my training job time to 30 minutes with metric score threshold of 0.085 and max concurrent iterations to 1.
Now, click on Save.
Featurization Settings
Step 9
Under the Featurization, select what columns you want to include or if you don’t want to. Here, all the remaining columns are all included as features for automated learning.
Validation and Testing
Step 10
Under validate and test, let us set to Auto with no test dataset required. This is basically an optional step to provide our own validation and test dataset. If you wanted to, you could have split the initial dataset for training and evaluation and used the test dataset here.
Once done, click on Finish.
Experiment Run
Step 11
Now, the experiment is run and we can see it at the status at the Automated ML section.
We can explore the cluster note status from Compute and the update of the experiments of the different algorithms we choose under Models in Automated ML.
Here, we can we see the normalized root mean error values, samplings, time for which it was run and different hyperparameters.
Best Model Summary
Step 12
Finally, as the model stops after the training time, we can obtain the names of the best algorithms. Here, its SparseNormalizer and RandomForest. We can see the normalized room mean squared error value is 0.08652 which is better than other algorithms in comparison.
We can also view the hyperparameters details with data transformation and training algorithms code.
Data transformation:
{
"class_name": "SparseNormalizer",
"module": "automl.client.core.common.model_wrappers",
"param_args": [],
"param_kwargs": {
"norm": "max"
},
"prepared_kwargs": {},
"spec_class": "preproc"
}
Training algorithm:
{
"class_name": "RandomForestRegressor",
"module": "sklearn.ensemble",
"param_args": [],
"param_kwargs": {
"bootstrap": false,
"criterion": "mse",
"max_features": 0.4,
"min_samples_leaf": 0.001953125,
"min_samples_split": 0.005285388593079247,
"n_estimators": 25
},
"prepared_kwargs": {},
"spec_class": "sklearn"
}
Visualization
Step 13
We can also view the run metrics and plot the graph for different error value, predicted values, residuals and explained variance.
Graph: Residual Histogram - Bin Count VS Residuals
Also, we can view the Data Transformation graph of the algorithms with the illustration of data preprocessing, scaling techniques and feature engineering which the Azure Automated ML used to generate the best model.
Test Model
Step 14
We can also click on Test Results, select the compute cluster and add in the dataset to test out the output. Remember, this testing dataset should always be different from the one we used to train.
Explanations of Important Features
Step 15
Some of the key functionality provided by Azure Machine Learning is the Aggregate feature importance explanations and visualizations for Automated ML. Here, simply choose the total no of important features you want to visualize and you can see the aggregate feature importance in a chart itself.
Here, we can see RM, NOX, PTRATIO and DIS which represents the average number of rooms per dwelling, nitric oxides concentration, pupil-teacher ratio in town, weighted distance of five Boston employment centers features in the dataset which looks logical.
Also, you can choose the explanation ID and explore other visualizations based upon each features too. Here, I’ve selected RM and Index and Y and X value respectively.
We can also see the global importance graph with each for the Top K Features in order. Here, I’ve selected 8 and we can see the 8 feature columns which is highly responsible for the prediction.
We can also view the model performance and evaluate it and see it through the graph.
Delete Resource
Step 16
Once, we are done with the use of the resources and our work is done, make sure you delete all the resources from the Azure Portal to save yourself from any charges to incur.
Click on Delete resource group and retype your resource name and then click on Delete.
Conclusion
Thus, in this article, we learned about Automated ML. We explored the Automated ML functionality offered by Azure Machine Learning Studio and went through a step-by-step process to implement Regression on Boston Housing Dataset. We used various algorithms such as Random Forest, Light GBM, XGBoost and Fast Linear Regressor. We then got our best algorithm and visualized the most important features using the Aggregator functionality. With this, we’ve learned to perform simultaneous processing of numerous algorithms for automated machine learning in Azure Machine Learning Studio to produce the best result.