Airflow (Unstable)
Airflow Integration guidelines
Apache Airflow
Apache Airflow is an open-source platform designed to programmatically author, schedule, and monitor workflows. It allows users to define workflows as directed acyclic graphs (DAGs) of tasks, where each node represents a task and the edges define dependencies between them. Airflow is highly extensible, enabling users to create custom plugins and operators to extend its functionality.
Concur Airflow Plugin Overview
The Concur Airflow plugin integrates Concur's data discovery and lineage capabilities with Apache Airflow. This plugin enables automatic and manual lineage tracking, providing comprehensive visibility into data pipelines. Below is a detailed explanation of the plugin's features:
On the Airflow UI, go to Admin -> Connections and click the "+" symbol to create a new connection. Select "REST Server" from the dropdown for "Connection Type" and enter the appropriate values.
<UI Screenshots from Mapping Team>
1. Automatic Column-Level Lineage Extraction
The plugin supports automatic extraction of column-level lineage from various operators. This capability ensures that data transformations and movements are accurately tracked at a granular level. Supported operators include:
SQL Operators:
MySqlOperatorPostgresOperatorSnowflakeOperatorBigQueryInsertJobOperatorAnd more
File Transform Operators:
S3FileTransformOperatorAnd more
This feature ensures that any SQL-based transformations or file manipulations are captured and reflected in the Concur lineage graph.
2. Airflow DAG and Task Metadata
The plugin collects and exports metadata about Airflow DAGs and tasks, including:
Properties: Details about the DAGs and tasks such as descriptions, schedules, and parameters.
Ownership: Information about the owners or maintainers of the DAGs and tasks, facilitating accountability and collaboration.
Tags: Custom tags that can be assigned to DAGs and tasks for better categorization and searchability.
3. Task Run Information
The plugin also tracks task run information, providing insights into the execution of tasks within a DAG. This includes:
Task Successes: Logs of successful task executions.
Task Failures: Logs of failed task executions, aiding in troubleshooting and optimization of workflows.
4. Manual Lineage Annotations
In addition to automatic lineage extraction, the plugin supports manual lineage annotations. Users can define lineage relationships using inlets and outlets on Airflow operators, allowing for precise control over lineage information. This feature is useful for custom operators or complex workflows where automatic extraction might not be sufficient.
Supported Implementations
There are two actively supported implementations of the Concur Airflow plugin, each compatible with different versions of Airflow. This ensures that users can benefit from the plugin's capabilities regardless of their specific Airflow version. Details about version compatibility and installation instructions are typically provided in the official documentation.
The Concur Airflow plugin significantly enhances the visibility and traceability of data workflows in Apache Airflow. By automatically extracting lineage information, collecting detailed task metadata, and supporting manual annotations, it provides a comprehensive solution for data lineage tracking. This integration empowers data teams to better understand their data pipelines, ensure data quality, and comply with regulatory requirements.
Trouble Shooting Airflow Integration
Last updated