Apache Airflow (beta)
Follow these steps to connect your Apache Airflow to Select Star (via OpenLineage).
Last updated
Follow these steps to connect your Apache Airflow to Select Star (via OpenLineage).
Last updated
To connect Apache Airflow to Select Star, you will need...
Permission to install and update packages in your Airflow environment
Select Star won't need any permissions for your Airflow directly, but you will need to install a Python package and configure an environment variable in your Airflow environment.
Complete the following steps to connect Apache Airflow to Select Star.
Note that Select Star does not connect to Apache Airflow directly. Instead, we connect via OpenLineage, which is an open platform for collection and analysis of data lineage. It tracks metadata about your Apache Airflow datasets and DAGs, DAG Runs, and sends that metadata to Select Star. Airflow DAGs will not appear in the catalog until metadata is received and ingestion is run.
Go to the Select Star Settings. Click Data in the sidebar, then + Add to create a new Data Source.
Fill the form in the required information:
Display Name - This value is Apache Airflow
by default, but you can override it.
Source Type - Choose Apache Airflow
from the dropdown.
Base URL - The URL of your Apache Airflow instance. For example, http://airflow.example.com
.
Click Save to proceed.
On the next screen, you will get the API Token, Events Endpoint and the Events URL. You will need these in the next steps to configure your Apache Airflow environment.
API Token - This is a secret key that Select Star will use to authenticate the traffic coming from your Apache Airflow instance.
Events Endpoint - This is the Select Star endpoint where your Apache Airflow instance will send OpenLineage events, containing the metadata about your DAGs, DAG Runs, and datasets.
Events URL - This is the Select Star Base URL where your Apache Airflow instance will send OpenLineage events.
Install the provider package or add the following line to your requirements file (usually requirements.txt):
Self-hosted Apache Airflow Provide a Transport configuration so that OpenLineage knows where to send the events. Keep the API Token and Events Endpoint handy from the previous step.
Within airflow.cfg file
or with AIRFLOW__OPENLINEAGE__TRANSPORT
environment variable
Make sure to replace <EVENTS_URL_PROVIDED_BY_SELECT_STAR>
, <EVENTS_ENDPOINT_PROVIDED_BY_SELECT_STAR>
and <API_KEY_PROVIDED_BY_SELECT_STAR>
with the actual values provided by Select Star in Step 1
Amazon Managed Workflows for Apache Airflow (MWAA) In the case of Amazon MWAA, the installation of OpenLineage does not change, however, setting up transport is done using the plugin.
First, create env_var_plugin.py
file. Paste the following code:
Make sure to replace <EVENTS_URL_PROVIDED_BY_SELECT_STAR>
, <EVENTS_ENDPOINT_PROVIDED_BY_SELECT_STAR>
and <API_KEY_PROVIDED_BY_SELECT_STAR>
with the actual values provided by Select Star in Step 1
If you already have plugins.zip
file, add env_var_plugin.py
to it. Otherwise, you can create it by calling:
Update your plugins in MWAA environment by following these steps:
Upload plugins.zip
to the S3 bucket associated with MWAA environment.
Go to your MWAA environment
Click Edit
Scroll to section DAG code in Amazon S3
Under Plugins file
choose your plugins.zip
file and set the version to the latest
NOTE: You should do the same for requirements.txt
file
Now environment will update itself by downloading and installing the plugin. It may take a while for changes to take effect.
That’s it! OpenLineage events should be sent to the Select Star when DAGs are run.
For more details on using OpenLineage integration with Apache Airflow, please read the official airflow documentation.
After you have configured your Apache Airflow environment, make sure to trigger your healthcheck DAGs. This will send OpenLineage events to Select Star, and help you verify that the integration is working correctly.
Afterwards, you can go to the Select Star Settings and click on the Data in the sidebar. Click on the Sync metadata button on your Apache Airflow Data Source.
Note that Select Star does not connect to Apache Airflow directly. That means the lineage and your DAGs metadata will be available in Select Star only after you run your DAGs and OpenLineage events are sent to Select Star and ingestion is completed.
If you want to examine OpenLineage events without sending them anywhere, you can set up ConsoleTransport. The events will end up in task logs.