# Databricks on AWS

## **Before you start**

{% hint style="info" %}
Ensure Unity Catalog is enabled for your Databricks instance. For details, see [Getting Started with Unity catalog](https://docs.databricks.com/data-governance/unity-catalog/get-started.html).
{% endhint %}

To connect Databricks to Select Star, you will need...

* an Databricks instance on AWS. For details, see [Databricks' documentation](https://www.databricks.com/product/aws).
* Account admin permissions on the Databricks instance
* Workspace admin permissions on the Databricks instance

Complete all of the following steps to see Databricks metadata, lineage, and popularity in Select Star.

1. [Create a service principal (SelectStar) in Databricks](#id-1.-create-a-service-principal-in-databricks)
2. [Generate a Personal Access Token](#id-2.-generate-a-personal-access-token)
3. [Configure System tables lineage (Recommended)](#id-3.-configure-system-tables-lineage-recommended)
4. [Connect Databricks to Select Star](#id-4.-connect-databricks-to-select-star)
5. [Choose Catalogs and Schemas](#id-5.-choose-catalogs-and-schemas)

## **1. Create** a Service Principal **in Databricks**

#### What is a Service Principal?

A service principal is an identity that you create in Databricks for use with automated tools, jobs, and applications. Service principals give automated tools and scripts API-only access to Databricks resources, providing greater security than using users or groups. It also prevents jobs and automations from failing if a user leaves your organization or a group is modified. For details, see [Manage Service Principal](https://docs.databricks.com/administration-guide/users-groups/service-principals.html#manage-service-principals).

### **Add a service principal to your Databricks account**

Account admins can add service principals to your Databricks account using the account console or the System for Cross-domain Identity Management (SCIM) Account API.

### **Add service principals to your account using the account console**

To add a service principal to the account using the account console:

1. As an account admin, log in to the [account console](https://accounts.cloud.databricks.com/).
2. Click **User management**.
3. On the **Service principals** tab, click **Add service principal**.
4. Enter a name (**SelectStar**) for the service principal.
5. Click **Add**.

To add a service principal via REST API, see [Add service principals to your account using the SCIM (Account) API](https://docs.databricks.com/administration-guide/users-groups/service-principals.html#add-service-principals-to-your-account-using-the-scim-account-api) .

{% hint style="info" %}
💡 To use service principals, you must add them to a workspace and generate access tokens for them in the workspace.
{% endhint %}

### **Add a service principal to a workspace**

Account admins can add service principals to [identity-federated workspaces](https://docs.databricks.com/administration-guide/users-groups/index.html#assign-users-to-workspaces) using the following:

* The account console
* The Workspace Assignment API

Workspace admins can manage service principals in their workspace using the following:

* The workspace admin console (if the workspace is enabled for identity federation)
* The workspace-level SCIM (ServicePrincipals) API
* The Workspace Assignment API (if the workspace is enabled for identity federation)

### **Assign a service principal to a workspace using the account console**

To add service principals to a workspace using the account console, the workspace must be enabled for identity federation.

1. As an account admin, log in to the [account console](https://accounts.cloud.databricks.com/).
2. Click **Workspaces**.
3. On the **Permissions** tab, click **Add permissions**.
4. Search for and select the service principal **SelectStar** and assign the permission level (workspace **Admin**), and click **Save**.

To add a service principle to a workspace via admin console or REST API, see [Add a service principal to a workspace](https://docs.databricks.com/administration-guide/users-groups/service-principals.html#add-sp-workspace).

These are the minimum permissions required for Select Star to collect basic metadata and query history. Query history is also used to generate [Data Lineage](https://docs.selectstar.com/features/lineage).

## Grant SQL and Workspace access **for a service principal**

To grant SQL Warehouse access for a service principal using the workspace admin console, the workspace must be enabled for identity federation.

1. As a workspace admin, log in to the Databricks workspace.
2. Click your username in the top bar of the Databricks workspace and select **Admin Console**.

   ![Admin Console](https://3470314135-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MgAiVthA_yg9UXKuhyY%2Fuploads%2Fgit-blob-db28fea1eee6156a3739ea2e394a0e522d1a0e84%2FScreen%20Shot%202022-12-27%20at%2012.47.06%20PM.png?alt=media)
3. Click **Settings** and select **Service principals**.
4. On the **Service principals** tab, click the service principal that was create in the previous steps.
5. Select the checkbox for **Databricks SQL access** and **Workspace access**, and click **Update**.

   ![Entitlements for service principal](https://3470314135-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MgAiVthA_yg9UXKuhyY%2Fuploads%2Fgit-blob-295f91b29e3c025f5afbc73eebed22b4b02e75e9%2FScreen%20Shot%202022-12-28%20at%201.01.41%20PM.png?alt=media)

## Grant permissions to a catalog for a service principal

1. Log in to a workspace that is linked to the metastore.
2. Click **Data**.
3. Click the **catalog** that needs to be granted access to, and select **Permissions**.

   <figure><img src="https://3470314135-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MgAiVthA_yg9UXKuhyY%2Fuploads%2Fgit-blob-dfb3d456ca6a25d209406f09ae70569b1ca9b9a0%2Fimage.png?alt=media" alt=""><figcaption><p>Catalog permissions in the Data Explorer UI</p></figcaption></figure>
4. Click **Grant**.
5. Select the user/group and grant Privilege presets to **Data Reader**, and select the checkbox for **USE CATALOG, USE SCHEMA** and **SELECT**, and click **Grant**.

![Privileges for service principal or User groups](https://3470314135-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MgAiVthA_yg9UXKuhyY%2Fuploads%2Fgit-blob-a8b02fe2e302fa0563f800ca57f126cbc1912913%2FScreen%20Shot%202022-12-28%20at%201.02.29%20PM.png?alt=media)

## Grant permission to a workspace for a service principal

This step is required to show notebooks in the catalog and notebook lineage.

1. Log in to a workspace that is linked to the metastore.
2. Click **Workspace** and select top folder.
3. Click **Share** button.

   <figure><img src="https://3470314135-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MgAiVthA_yg9UXKuhyY%2Fuploads%2Fgit-blob-9c74ba617210a49c023b53414a27d0fd7f2d3982%2Fdatabricks-workspace-share.png?alt=media" alt=""><figcaption><p>Folder permissions in the Workspace explore UI</p></figcaption></figure>
4. Select the user/group, then select permission "Can view", and click **Add**.

   <figure><img src="https://3470314135-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MgAiVthA_yg9UXKuhyY%2Fuploads%2Fgit-blob-36b86e9e67e5fcf0a2309d0df5e59e64ea00fdf5%2Fdatabricks-workspace-grant.png?alt=media" alt=""><figcaption><p>Permission grant in Workspace share</p></figcaption></figure>

## **2. Generate a Personal Access Token**

To authenticate a service principal to APIs on Databricks, an administrator can create a Databricks Personal Access Tokens on behalf of the service principal.

1. Grant the [Can Use token permission](https://docs.databricks.com/administration-guide/access-control/tokens.html#control-who-can-use-or-create-tokens) to the service principal.
2. Create a Databricks personal access token on behalf of the service principal using the `POST /token-management/on-behalf-of/tokens` operation in the [token management REST API](https://docs.databricks.com/dev-tools/api/latest/token-management.html). An administrator can also list personal access tokens and delete them using the same API.

## Generate a Personal Access Token

<mark style="color:green;">`POST`</mark> `https://<deployment name>.cloud.databricks.com/api/2.0/token-management/on-behalf-of/tokens/`

When you want to use the Databricks API to generate a Personal Access token on behalf of a user or service principal, use this command.

Use the `token value` generated from this response as API key.

#### Request Body

| Name                                              | Type   | Description                                                                                                             |
| ------------------------------------------------- | ------ | ----------------------------------------------------------------------------------------------------------------------- |
| application\_id<mark style="color:red;">\*</mark> | String | UUID of the Service Principal, and can be found here - <https://accounts.cloud.databricks.com/users/serviceprincipals/> |
| comment                                           | String |                                                                                                                         |
| lifetime\_seconds                                 | String | Use value = `-1` in order for it to live indefinitely                                                                   |

{% tabs %}
{% tab title="200: OK " %}

```json

{
    "token_value": "dapia.....", #Use this value
    "token_info": {
        "token_id": "4305bc67998.........",
        "creation_time": 1671720121149,
        "expiry_time": -1,
        "comment": "Service Principal Token. API Test",
        "created_by_id": 355825636633264,
        "created_by_username": "prat@getselectstar.com",
        "owner_id": 4012126671306509
    }
}

```

{% endtab %}

{% tab title="400: Bad Request Invalid Application ID" %}

```json

{
    "error_code": "INVALID_PARAMETER_VALUE",
    "message": "edxiueirxxxxxxxx does not exist"
}

```

{% endtab %}
{% endtabs %}

For detailed, step-by-step instructions for creating access tokens for service principals, see [Service principals for Databricks automation](https://docs.databricks.com/dev-tools/service-principals.html).

## **3. Configure System tables lineage (Recommended)**

{% hint style="info" %}
💡 This section is optional but recommended. System tables lineage provides better performance and scalability by using Databricks system tables instead of individual API calls. If you skip this section, Select Star will use API lineage collection.
{% endhint %}

System tables lineage requires additional permissions beyond the basic setup. These permissions allow Select Star to query Databricks system tables that contain lineage metadata, without accessing your actual data.

### **Grant SQL Warehouse access permissions**

The service principal needs permission to use a specific SQL Warehouse for executing lineage queries.

1. In your Databricks workspace, go to **SQL Warehouses**.
2. Select the SQL Warehouse you want to use for Select Star.
3. Click the **Permissions** button.
4. Click **Add** and search for your **SelectStar** service principal.
5. Grant **Can use** permission and click **Add**.

<figure><img src="https://3470314135-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MgAiVthA_yg9UXKuhyY%2Fuploads%2Fgit-blob-77fb1c7c3a7bb52aaf2df7918174fe421c8dd6f6%2Fdatabricks-sql-warehouse-permissions.png?alt=media" alt=""><figcaption><p>Grant CAN USE permission on SQL Warehouse</p></figcaption></figure>

{% hint style="info" %}
💡 Note the **Warehouse ID** from the SQL Warehouse details page - you'll need this when connecting to Select Star.
{% endhint %}

<figure><img src="https://3470314135-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MgAiVthA_yg9UXKuhyY%2Fuploads%2Fgit-blob-112f860bd22249dbcd144c18cfa8ab4db37d00dd%2Fdatabricks-warehouse-id.png?alt=media" alt=""><figcaption><p>SQL Warehouse ID location</p></figcaption></figure>

### **Grant system.access schema permissions**

The service principal needs permissions to read lineage data from Databricks system tables.

1. In your Databricks workspace, go to **Catalog**.
2. Select the **system** catalog.
3. Select the **access** schema.
4. Go to the **Permissions** tab.
5. Click **Grant** and search for your **SelectStar** service principal.
6. Select **USE** and **SELECT** permissions.
7. Click **Grant**.

<figure><img src="https://3470314135-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MgAiVthA_yg9UXKuhyY%2Fuploads%2Fgit-blob-c850f58d8c26d94bf0491f366b55b81754d77631%2Fdatabricks-system-access-permissions.png?alt=media" alt=""><figcaption><p>Grant permissions on system.access schema</p></figcaption></figure>

### **Ensure SQL access entitlement**

Verify that your service principal has the SQL access entitlement enabled:

1. In your Databricks workspace, click your username and select **Admin Console**.
2. Click **Settings** and select **Service principals**.
3. Click on your **SelectStar** service principal.
4. Ensure **Databricks SQL access** is checked and click **Update** if needed.

<figure><img src="https://3470314135-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MgAiVthA_yg9UXKuhyY%2Fuploads%2Fgit-blob-f608a6e48cc6126dd8df4c6ecb1606ef08c48a2c%2Fdatabricks-service-principal-sql-access.png?alt=media" alt=""><figcaption><p>Enable SQL access for service principal</p></figcaption></figure>

## **4. Connect Databricks to Select Star**

Go to the Select Star **Settings**. Click **Data** in the sidebar, then **+ Add** to create a new Data Source.

<figure><img src="https://3470314135-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MgAiVthA_yg9UXKuhyY%2Fuploads%2Fgit-blob-c0cb7b4b8e61636bfeefb6e1e8f13467267e268d%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>

Choose **Databricks** in the Source Type dropdown and provide the following information:

<figure><img src="https://3470314135-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MgAiVthA_yg9UXKuhyY%2Fuploads%2Fgit-blob-ddddeaf0e06309f1eb8f417a0aa448c34bb80f21%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>

**Display Name:** This value is `Databricks` by default, but you can override it if desired.

**Workspace URL:** This is the address of the Workspace. This should include the `<deployment name>.cloud.databricks.com` . Deployment Name can be found in <https://accounts.cloud.databricks.com/workspaces>

**Access Token:** This is the **Personal access token** from Step 2, which is used to authenticate access to Databricks.

**Lineage Method:** Choose between System tables (recommended) or API lineage collection.

**SQL Warehouse ID:** Required when using System tables lineage. This is the Warehouse ID noted in Step 3. Not available for use with API lineage.

## **5. Choose Catalogs and Schemas**

After you fill in the information, you'll be asked to select the catalog you'd like to load into Select Star.

{% hint style="info" %}
💡 Select Star will not read queries or metadata or generate lineage for Catalogs, schemas, or tables that are not loaded. Please load all data for which you expect to see lineage.
{% endhint %}

You can [change the catalogs and schemas](https://docs.selectstar.com/data-source-management/manage-data-sources#configure-a-data-source) you have loaded if needed.

Select the catalogs and click **Next**.

<figure><img src="https://3470314135-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MgAiVthA_yg9UXKuhyY%2Fuploads%2Fgit-blob-0be93212b8d0cceea913058f6a9323d8c648874a%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>

For each catalog you selected, you'll be able to select the schemas.

<figure><img src="https://3470314135-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MgAiVthA_yg9UXKuhyY%2Fuploads%2Fgit-blob-9a5aa9364129621daf592eb15a928284538de7e1%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>

Your metadata should start loading automatically. Please allow 24-48 hours to completely generate popularity and lineage.

When the sync is complete, you'll be able to explore Databricks in Select Star.

See the link below for more information on Databricks in Select Star.

{% content-ref url="../../learning-data/getting-started-databricks" %}
[getting-started-databricks](https://docs.selectstar.com/learning-data/getting-started-databricks)
{% endcontent-ref %}
