Azure Machine Learning Pipelines
This article discusses Azure Machine Learning Pipelines and how they can help in building, optimizing, and managing the machine learning workflow. Azure Machine Learning enables developers and data scientists to integrate and explore a wide range of Machine Learning processes and Azure Machine Learning Pipelines are a part of it.
Azure Machine Learning Pipelines helps to build, test and deploy continuously to any platform and cloud computing. From Linux, macOS, and Windows, it supports all to build web, mobile, and desktop applications and deploy them either on the cloud or on-premises. The Azure Machine Learning Pipeline helps save valuable time for Data Scientists, Engineers, and DevOps with automated build and deployment with unattended runs and another plethora of advantages. It supports languages from Python, Java, PHP, Ruby, C/C++, iOs, and Android Apps with Node.js to run parallelly on numerous operating systems. Containerization and Container Orchestration services such as Kubernetes are all supported with continuous delivery to cloud services within Azure itself to AWS and GCP. Build Chaining and Multi-phased builds are all supported with YAML, release gates, test integration, and reporting.
There are a variety of pipelines supported by Azure, each for different purposes and use case scenarios. Listed below is the table that describes, in brief, the various pipelines that are available for different roles and use case scenarios with details.
Scenario |
Primary Persona |
Azure Offering |
OSS Offering |
Canonical Pipe |
Strengths |
Data orchestration (Data prep) |
Data Engineer |
Azure Data Factory Pipelines |
Apache Airflow |
Data -> Data |
Strongly typed movement, data-centric activities |
Model orchestration (Machine learning) |
Data Scientist |
Azure Machine Learning Pipelines |
Kubeflow Pipelines |
Data -> Model |
Distribution, caching, code-first, reuse |
Code & app orchestration (CI/CD) |
App Developer / DevOps |
Azure Pipelines |
Jenkins |
Code + Model -> App/Service |
Most open and flexible activity support, approval queues, phases with gating |
Advantages of using Machine Learning Pipelines for the workflow,
Independent
With Azure Machine Learning Pipeline, the steps can be scheduled to run in sequence or in parallel to run on independently without manual intervention. This provides engineers and data scientists ease and comfort to focus on the rest of their tasks while the pipeline takes care of the scheduled work.
Reusable
One of the key pros of the Machine Learning Pipeline is the functionality to reuse a process. With pipeline templates, batch-scoring and retraining can be performed without much re-work. With simple REST calls, triggers can be published to the pipeline from external systems.
Assorted Computation
Azure provides a plethora of options for computing resources and storage locations. With Pipeline, these resources can be intermixed and this use in a reliably coordinated fashion. The pipelines can be run on numerous targets such as Databricks, GPU Data Science VMs, and HDInsight.
Modular
As discussed in the previous article Common Software Engineering Practices For Production Code, Modularity is crucial to churn out high-quality software as rapidly as possible. Pipelines help isolate changes and separate areas of concern to make the system and process modular.
Versioning and Tracking
The pipeline SDK supports the explicit naming and versioning of data sources, inputs, and output thus making it easier to version and track data and result paths. The scripts and data can also be managed separated for increased productivity.
Partnership
Data Scientists are enabled to partner with teams across the machine learning design and development process. This association doesn’t disrupt the data scientists to concurrently work on their pipeline steps due to the modularity of the pipeline.
Creating Pipelines
Pipelines can be created in two ways. With SDK such as Azure Machine Learning Python SDK and Azure Machine Learning Designer.
With Azure Machine Learning Designer, the data flow can be designed with a drag and drop feature. The inputs and outputs of each step are displayed visibly while designing with a visual design pipeline. This tool can be accessed from the Homepage of Workspace under the Designer Selection.
While building a pipeline with SDK such as Python SDK, the pipeline is defined as Python Object in the azureml.pipeline.core module. One or more PipelineStep objects are contained within a Pipeline object. DataTransferStep, PythonScriptStep, and EstimatorStep are subclasses of the actual steps for the abstract PipeLineStep class. The reusable sequence of steps is located within the ModuleStep which can be shared among the pipelines. The pipeline runs as a segment of the Experiment.
The following snippet in Python shows the calls and objects that are needed to create and run a Pipeline.
ws = Workspace.from_config()
blob_store = Datastore(ws, "workspaceblobstore")
compute_target = ws.compute_targets["STANDARD_NC6"]
experiment = Experiment(ws, 'MyExperiment')
input_data = Dataset.File.from_files(DataPath(datastore, '20newsgroups/20news.pkl'))
prepped_data_path = OutputFileDatasetConfig(name = "output_path")
dataprep_step = PythonScriptStep(name = "prep_data", script_name = "dataprep.py", source_directory = "prep_src", compute_target = compute_target, arguments = ["--prepped_data_path", prepped_data_path], inputs = [input_dataset.as_named_input('raw_data').as_mount()])
prepped_data = prepped_data_path.read_delimited_files()
train_step = PythonScriptStep(name = "train", script_name = "train.py", compute_target = compute_target, arguments = ["--prepped_data", prepped_data], source_directory = "train_src")
steps = [dataprep_step, train_step]
pipeline = Pipeline(workspace = ws, steps = steps)
pipeline_run = experiment.submit(pipeline)
pipeline_run.wait_for_completion()
Tasks Machine Learning Pipeline can Focus Upon
Azure Machine Learning pipelines focus on multitudes of Machine Learning tasks. The Azure Machine Learning pipeline consists of the workflow of the entire machine learning tasks which is also independently executable. Within the pipeline, the subtasks are encapsulated as a series of steps. Even something as small as a Python Scripts call can be an Azure Machine Learning Pipeline. The entirety of workflow for machine learning from Data Preparation which includes importing, validation, cleansing, transformation, and staging to Deployment with versioning, scaling, access control, and provisioning – The Azure Machine Learning pipeline focuses on overall tasks – through and through.
Training configurations that consist of parameterizing arguments, file paths, logging, and reporting configurations, and validating for efficiency over and over again by specifying certain data subsets, compute resources and progress monitoring – all of these tasks can be focused by the pipeline.
You can learn about similar Pipelines provided in Azure for Azure DevOps Pipelines from this video.
Since the steps are independent, numerous data scientists can work on the same pipeline at the same time without burdening the compute resources. Also, the separated steps make it convenient to use a variety of computing types and sizes for each individual step. Intermediate data flows seamlessly to downstream compute targets as Azure coordinates multiple compute targets for use. Also, all the metrics can be tracked for the pipeline experiments.
Thus, the entire life cycle of Machine Learning can be supported with the pipeline. Check out Azure Pipelines to see how it can enable the CI/ CD process too. With the sophisticated dependency analysis in Azure Machine Learning pipelines, the steps can run on distributed resources and environments with orchestration tools. Automatic orchestration is performed for all dependencies between pipeline steps which includes spinning the Docker images up and down, moving data consistently and automatically between steps, and attaching and detaching compute resources.
Conclusion
In this article, we learned about the Azure Machine Learning Pipelines. We got in touch with the varieties of pipelines supported by Azure which can be used for different purposes in specific scenarios depending upon its strength. Then we learned about the pros of machine learning pipelines and how they enablesengineers and data scientists to contribute to their machine learning workflow for seamless working. Then we got introduced to the two ways to creating and building pipelines and their process with details and later learned about all the tasks Machine Learning Pipeline focuses on in the workflow and how it does it.