Apache Airflow (beta)

Follow these steps to connect your Apache Airflow to Select Star (via OpenLineage).

Before you start

To connect Apache Airflow to Select Star, you will need...

  • Permission to install and update packages in your Airflow environment

Select Star won't need any permissions for your Airflow directly, but you will need to install a Python package and configure an environment variable in your Airflow environment.

Complete the following steps to connect Apache Airflow to Select Star.

Note that Select Star does not connect to Apache Airflow directly. Instead, we connect via OpenLineage, which is an open platform for collection and analysis of data lineage. It tracks metadata about your Apache Airflow datasets and DAGs, DAG Runs, and sends that metadata to Select Star. Airflow DAGs will not appear in the catalog until metadata is received and ingestion is run.

1. Create a new Data Source in Select Star

Go to the Select Star Settings. Click Data in the sidebar, then + Add to create a new Data Source.

Fill the form in the required information:

  • Display Name - This value is Apache Airflow by default, but you can override it.

  • Source Type - Choose Apache Airflow from the dropdown.

  • Base URL - The URL of your Apache Airflow instance. For example, http://airflow.example.com.

Click Save to proceed.

On the next screen, you will get the API Token, Events Endpoint and the Events URL. You will need these in the next steps to configure your Apache Airflow environment.

  • API Token - This is a secret key that Select Star will use to authenticate the traffic coming from your Apache Airflow instance.

  • Events Endpoint - This is the Select Star endpoint where your Apache Airflow instance will send OpenLineage events, containing the metadata about your DAGs, DAG Runs, and datasets.

  • Events URL - This is the Select Star Base URL where your Apache Airflow instance will send OpenLineage events.

2. Configure Apache Airflow

Install OpenLineage provider

Install the provider package or add the following line to your requirements file (usually requirements.txt):

apache-airflow-providers-openlineage==1.10.0

Transport setup

  1. Self-hosted Apache Airflow Provide a Transport configuration so that OpenLineage knows where to send the events. Keep the API Token and Events Endpoint handy from the previous step.

  • Within airflow.cfg file

[openlineage]
transport = {"type": "http", "url": "<EVENTS_URL_PROVIDED_BY_SELECT_STAR>", "endpoint": "<EVENTS_ENDPOINT_PROVIDED_BY_SELECT_STAR>", "auth": {"type": "api_key", "api_key": "<API_KEY_PROVIDED_BY_SELECT_STAR>"}}
  • or with AIRFLOW__OPENLINEAGE__TRANSPORT environment variable

AIRFLOW__OPENLINEAGE__TRANSPORT='{"type": "http", "url": "<EVENTS_URL_PROVIDED_BY_SELECT_STAR>", "endpoint": "<EVENTS_ENDPOINT_PROVIDED_BY_SELECT_STAR>", "auth": {"type": "api_key", "api_key": "<API_KEY_PROVIDED_BY_SELECT_STAR>"}}'

Make sure to replace <EVENTS_URL_PROVIDED_BY_SELECT_STAR>, <EVENTS_ENDPOINT_PROVIDED_BY_SELECT_STAR> and <API_KEY_PROVIDED_BY_SELECT_STAR> with the actual values provided by Select Star in Step 1

  1. Amazon Managed Workflows for Apache Airflow (MWAA) In the case of Amazon MWAA, the installation of OpenLineage does not change, however, setting up transport is done using the plugin.

First, create env_var_plugin.py file. Paste the following code:

from airflow.plugins_manager import AirflowPlugin
import os

os.environ["AIRFLOW__OPENLINEAGE__NAMESPACE"] = "airflow"
os.environ["AIRFLOW__OPENLINEAGE__TRANSPORT"] = '''{
  "type": "http", 
  "url": "<EVENTS_URL_PROVIDED_BY_SELECT_STAR>",
  "endpoint": "<EVENTS_ENDPOINT_PROVIDED_BY_SELECT_STAR>",
  "auth": { 
    "type": "api_key", 
    "api_key": "<API_KEY_PROVIDED_BY_SELECT_STAR>"
   }
}'''
os.environ["AIRFLOW__OPENLINEAGE__CONFIG_PATH"] = ""
os.environ["AIRFLOW__OPENLINEAGE__DISABLED_FOR_OPERATORS"] = ""


class EnvVarPlugin(AirflowPlugin):
    name = "env_var_plugin"

Make sure to replace <EVENTS_URL_PROVIDED_BY_SELECT_STAR>, <EVENTS_ENDPOINT_PROVIDED_BY_SELECT_STAR> and <API_KEY_PROVIDED_BY_SELECT_STAR> with the actual values provided by Select Star in Step 1

If you already have plugins.zip file, add env_var_plugin.py to it. Otherwise, you can create it by calling:

zip plugins.zip env_var_plugin.py

Update your plugins in MWAA environment by following these steps:

  • Upload plugins.zip to the S3 bucket associated with MWAA environment.

  • Go to your MWAA environment

  • Click Edit

  • Scroll to section DAG code in Amazon S3

  • Under Plugins file choose your plugins.zip file and set the version to the latest

NOTE: You should do the same for requirements.txt file

Now environment will update itself by downloading and installing the plugin. It may take a while for changes to take effect.

That’s it! OpenLineage events should be sent to the Select Star when DAGs are run.

For more details on using OpenLineage integration with Apache Airflow, please read the official airflow documentation.

3. Sync Metadata in Select Star

After you have configured your Apache Airflow environment, make sure to trigger your healthcheck DAGs. This will send OpenLineage events to Select Star, and help you verify that the integration is working correctly.

Afterwards, you can go to the Select Star Settings and click on the Data in the sidebar. Click on the Sync metadata button on your Apache Airflow Data Source.

If you want to examine OpenLineage events without sending them anywhere, you can set up ConsoleTransport. The events will end up in task logs.

[openlineage]
transport = {"type": "console"}

Last updated

Was this helpful?