Introduction
The article briefly explains the new orchestration tool Apache Airflow, to understand Airflow it’s better to understand what workflows are. A workflow is a sequence of tasks that processes a set of data or a series of steps needed to be executed, a Wikipedia definition is “A workflow consists of an orchestrated and repeatable pattern of activity, enabled by the systematic organization of resources into processes that transform materials, provide services, or process information. It can be depicted as a sequence of operations, the work of a person or group, the work of an organization of staff, or one or more simple or complex mechanisms.”
In a nutshell, Apache Airflow is a platform for authoring, scheduling (let go of old cron jobs), monitoring data, and computing workflows. It is developed by Airbnb, and it is now under the Apache Software Foundation.
Features
Apache Airflow is open-source, and it uses standard Python to create workflows. It has a collection of ready-to-use operators that can work with MySQL, Oracle, etc. along with Azure, AWS, Google Cloud platforms. Operators are explained in detail in the latter part of the article.
It has an extremely intuitive UI for monitoring and managing workflows, we can check the status of running tasks with the flexibility of stopping and executing on demand.
DAGs
DAG stands for Directed Acyclic Graph, it’s a graph with Nodes and Edges and it should not have any loops as edges should always be directed, In a nutshell, DAG is a Data Pipeline, Node in a DAG is a task like “Download a File from S3” or “Query MySQL Database”.
In the DAG, A, B, C represents Tasks and each one of them is independent this is a correct DAG as there are no loops.
Incorrect DAG as it’s a loop, this DAG will never complete, and it will keep running forever.
Let’s understand the DAG with a realistic example. Say the organization receives a CSV file every day on S3, then aggregation and normalization are of the CSV file is required say using Pandas and upload the data to the MongoDB and after processing upload the source file to another S3 location.
Operator
As discussed, the nodes in the DAGs are tasks like Download, Aggregate, Upload, etc. These tasks are executed as Operators, our DAG has 4 nodes, and each node/task will be executed as an Operator, In Airflow there are various ready to use Operators which can be leverage
For downloading and uploading a file to S3, the existing “S3FileTransformOperator” will be used.
For executing a Python function the step2 is to execute a Pandas program, for that PythonOperator is there.
Similarly, for interaction with MongoDB or any other Database, Airflow has existing Operators.
Writing a Simple DAG
from datetime import datetime as dt
from datetime import timedelta
from airflow.utils.dates import days_ago
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
default_args = {
'start_date' : days_ago(1),
'retries' : 1,
'retry_delay' : timedelta(minutes=5)
}
dag = DAG('hello_world_dag',
description = 'Simple Hello World DAG',
default_args = default_args,
schedule_interval = timedelta(days = 1)
)
def print():
return ("Hello world")
task = PythonOperator(
task_id = 'print_hello_world',
python_callable = print,
dag = dag)
task
On the Airflow UI, we can see our DAG is imported ready for execution
The program is straightforward “DAG” section indicates our DAG creation properties, like name, start dates, retry parameters. Since we are using a simple python function to print “Hello World” a PythonOperator is used.
Summary
In the article, we have covered what is Airflow, Features, and 2 major building blocks which are DAG and Operator, this much knowledge is sufficient to start working with Airflow. In the upcoming articles, more information on Airflow will be provided.