Apache Airflow (beta)

Follow these steps to connect your Apache Airflow to Select Star (via OpenLineage).

Before you start

To connect Apache Airflow to Select Star, you will need...

  • Permission to install and update packages in your Airflow environment

Select Star won't need any permissions for your Airflow directly, but you will need to install a python package and configure an environment variable in your Airflow environment.

Complete the following steps to connect Apache Airflow to Select Star.

Note that Select Star does not connects to Apache Airflow directly. Instead, we connect via OpenLineage, which is an open platform for collection and analysis of data lineage. It tracks metadata about your Apache Airflow datasets, DAGs, and DAG Runs, and sends that metadata to Select Star.

1. Create a new Data Source in Select Star

Go to the Select Star Settings. Click Data in the sidebar, then + Add to create a new Data Source.

Fill form in the required information:

  • Display Name - This value is Apache Airflow by default, but you can override it.

  • Source Type - Choose Apache Airflow from the dropdown.

  • Base URL - The URL of your Apache Airflow instance. For example, http://airflow.example.com.

Click Save to proceed.

On the next screen, you will get the API Token and the Events Endpoint. You will need these in the next steps to configure your Apache Airflow environment.

  • API Token - This is a secret key that Select Star will use to authenticate the traffic coming from your Apache Airflow instance.

  • Events Endpoint - This is the Select Star Events URL where your Apache Airflow instance will send OpenLineage events, containing the metadata about your DAGs, DAG Runs, and datasets.

2. Configure Apache Airflow

Install OpenLineage provider

Install provider package or add following line to your requirements file (usually requirements.txt):

apache-airflow-providers-openlineage==1.10.0

Transport setup

  1. Self-hosted Apache Airflow Provide a Transport configuration so that OpenLineage knows where to send the events. Keep the API Token and Events Endpoint handy from the previous step.

  • Within airflow.cfg file

[openlineage]
transport = {"type": "http", "url": "https://ingestion.production.selectstar.com", "endpoint": "<EVENTS_ENDPOINT_PROVIDED_BY_SELECT_STAR>", "auth": {"type": "api_key", "api_key": "<API_KEY_PROVIDED_BY_SELECT_STAR>"}}
  • or with AIRFLOW__OPENLINEAGE__TRANSPORT environment variable

AIRFLOW__OPENLINEAGE__TRANSPORT='{"type": "http", "url": "https://ingestion.production.selectstar.com", "endpoint": "<EVENTS_ENDPOINT_PROVIDED_BY_SELECT_STAR>", "auth": {"type": "api_key", "api_key": "<API_KEY_PROVIDED_BY_SELECT_STAR>"}}'

Make sure to replace <EVENTS_ENDPOINT_PROVIDED_BY_SELECT_STAR> and <API_KEY_PROVIDED_BY_SELECT_STAR> with the actual values provided by Select Star in Step 1

  1. Amazon Managed Workflows for Apache Airflow (MWAA) In case of Amazon MWAA, installation of OpenLineage does not change, however setting up transport is done using plugin.

First, create env_var_plugin.py file. Paste following code:

from airflow.plugins_manager import AirflowPlugin
import os

os.environ["AIRFLOW__OPENLINEAGE__NAMESPACE"] = "airflow"
os.environ["AIRFLOW__OPENLINEAGE__TRANSPORT"] = '''{
  "type": "http", 
  "url": "https://ingestion.staging.selectstar.com",
  "endpoint": "<EVENTS_ENDPOINT_PROVIDED_BY_SELECT_STAR>",
  "auth": { 
    "type": "api_key", 
    "api_key": "<API_KEY_PROVIDED_BY_SELECT_STAR>"
   }
}'''
os.environ["AIRFLOW__OPENLINEAGE__CONFIG_PATH"] = ""
os.environ["AIRFLOW__OPENLINEAGE__DISABLED_FOR_OPERATORS"] = ""


class EnvVarPlugin(AirflowPlugin):
    name = "env_var_plugin"

Make sure to replace <EVENTS_ENDPOINT_PROVIDED_BY_SELECT_STAR> and <API_KEY_PROVIDED_BY_SELECT_STAR> with the actual values provided by Select Star in Step 1

If you already have plugins.zip file, add env_var_plugin.py to it. Otherwise, you can create it by calling:

zip plugins.zip env_var_plugin.py

Update your plugins in MWAA environment by following these steps:

  • Upload plugins.zip to s3 bucket associated with MWAA environment.

  • Go to your MWAA environment

  • Click Edit

  • Scroll to section DAG code in Amazon S3

  • Under Plugins file choose your plugins.zip file and set version to latest

NOTE: You should do the same for requirements.txt file

Now environment will update itself by downloading and installing the plugin. It may take a while for changes to take effect.

That’s it ! OpenLineage events should be sent to the Select Star when DAGs are run.

for more details on using OpenLineage integration with Apache Airflow, please read the official airflow documentation.

3. Sync Metadata in Select Star

After you have configured your Apache Airflow environment, make sure to trigger your healthcheck DAGs. This will send OpenLineage events to Select Star, and help you verify that the integration is working correctly.

Afterwards, you can go to the Select Star Settings and click on the Data in the sidebar. Click on the Sync metadata button on your Apache Airflow Data Source.

Note that Select Star does not connects to Apache Airflow directly. That means, that the lineage and your DAGs metadata will be available in Select Star only after you run your DAGs and OpenLineage events are sent to Select Star.

If you want to look at OpenLineage events without sending them anywhere, you can set up ConsoleTransport - the events will end up in task logs.

[openlineage]
transport = {"type": "console"}

Last updated