Data Pipeline Essentials: Apache Airflow

Apache Airflow is an open-source workflow management tool that provides users with a system to create, schedule, and monitor workflows

Airflow was created by the vacation rental site AirBnB to produce a platform that could keep up with an ever-growing number of complex workflows and data pipelines. In 2016, the code was open sourced and donated to the Apache project. Written in Python, Apache Airflow includes libraries for creating complex data pipelines, tools for running and monitoring jobs, a web user interface, and a rich set of command-line utilities.

Airflow has become an extremely important tool for many Data Science and Engineering tasks and continues to grow in popularity because of its strong capabilities, flexibility, and ease of use. In this blog post, we will introduce AirFlow, its worldview and features, and describe some of the common ways it is used.

Directed Acyclic Graph (DAG)

In common development workflows, developers will write batch jobs that may need dependencies from other databases. They will have to configure the connection from the database to the job. These jobs need to run on a timely schedule, have different jobs depend on their success, and be partitioned to a worker node to execute the task. The point here is that building out a workflow system will quickly become a complicated monstrosity. Imagine having a fast-paced data organization with an evolving data ecosystem, and now you have a complicated network of computational tasks. This challenge was one that AirBnB faced and overcame by building a system that shapes these "workflows" as directed acyclic graphs (DAGs) of tasks.

As a DAG, these workflows turn into networks of jobs that communicate together to smooth out the process.

These DAGs have the following properties:

Scheduled

Each job runs at a configured scheduled interval or unique trigger

Mission Critical

If some jobs aren't running, alerts are set off

Evolving

Data pipelines and processing can change as the organization matures

Workflow management is a common enough concept to start with, but as time passes and systems scale, complexities in process emerge. Each new task requires configuration, scheduling, and troubleshooting that adds to growing levels of inefficiency. Airflow was created to remedy frustrations in scaled workflow management scenarios.

The team at AirBnB worked to build a system that could handle the workflows from numerous areas within the company. To name a few of the processes that AirBnB utilizes Airflow for:

Data storing - cleaning, organization, quality assurance, and the publishing of data into a warehouse
Growth analytics - compute metrics around service engagements and growth accounting
Experimentation - execute A/B testing frameworks logic
Email targeting - target and engage users through email campaigns
Tracking - aggregation of clickstreams and user tracking
Search - the gathering of search ranking metrics
Infrastructure maintenance - database scraping and cleaning

These features aren't the only processes that can be fueled by Airflow. Almost any workflow can be integrated into Airflow to automate and monitor the execution of jobs.

Manage Your Pipelines

Airflow is built to be compatible with your organization's infrastructure

Python from the Ground Up

Airflow is written in Python with a codebase that is extensible and supported by clear documentation, an ever-growing open-source community, and extensive coverage of unit tests. Pipeline authoring is configured using Python as well; this makes creating workflows easier - and allows you to write workflows like you're writing a program. The result of this is that developers can quickly feel comfortable with Airflow.

Components

The architecture of Airflow contains the following components.

CLI (command-line interface) - For access and management of your DAGs
Job definitions - Defined in source control
Scheduler - Manages the scheduling of tasks to be executed correlated with the assigned frequency. Ques up the DAGs in order to perform their tasks. Can retry failed jobs if defined
Webserver - User interface frontend for Airflow built on top of the Flask Python web framework. Users can view a visualization of DAGs and their respective tasks. Can take manual control of tasks and enable/disable, retry, and see logs
Executor - Responsible for running jobs. Controls which worker to run what tasks. Uses Kubernetes to run multiple workers on tasks. By default, Airflow uses SerialExecutor which runs one task at a time on a local machine
Backend - Can use MySQL or PostgreSQL to store configuration and state of all the DAG and tasks. By default, Airflow uses SQLite as a backend default.

Monitoring

Airflow provides excellent methods of monitoring. Besides the user interface that allows a user to view the status of tasks, it can send notifications when specific DAGs or tasks fail. Logs can be seen from the web UI. Airflow's monitoring features provide users with a significant understanding of how their workflows are executing.

Lineage

Lineage tracks the origins of data, what happens to it, and where it is transported over time. This feature becomes extremely helpful when you are executing multiple data tasks that read and write into storage. Apache Atlas prompts a user to define the input and output data sources for each job, and a graph is created that shows the relationship between various data sources.

Sensors

Tasks can be automatically executed upon a trigger condition. Specifications can be placed to set the type of sensor and what rate it performs. This feature allows users to build a continuous system that can achieve many things. Continuous integration and continuous deployment can be utilized to test and deploy your analytic data models.

Extensibility

Airflow is built to fit into any organization's workflow process. Out of the box, Airflow can easily interact with systems, including Hive, MySQL, HDFS, Postgres, S3, and many more. Each system has integration capabilities done through the concepts of hooks and operators.

Hooks provide a unified interface for interacting with other systems through abstraction. They hold a central pool with host, port, login, and password keys to access other systems.

Operators use hooks to execute specific tasks in each system. They tell the systems to perform actions that become nodes in the DAGs. Other operators called transfer and sensors can move data and monitor certain criteria between systems.

Airflow allows users to extend the platform to create their operators and sensors. All the code is written in Python, which allows for easy integration for developers.

A Necessity for Data Operations

Bringing your organization into the world of data science is easier said than done. Along with finding the right engineers, a massively complex process of organization awaits. From data mining to machine learning, data operations are what fill in-between it all. Every organization's data pipelines, workflows, and ETL processes quickly become a complicated and claustrophobic experience. That is why Apache Airflow was created - to manage the vast network(s) of workflows to make it easier to execute and monitor your data pipelines. When constructing your infrastructure to support data operations, Airflow is an essential tool for managing workflows.