Exam DP203 Course Tybul 32 A closer look at Databricks notebooks

From MillerSql.com

Chapter 31

Cluster Access modes:

  1. Single User. All languages, but for a single user
  2. Shared. For multiple users. Resources are isolated for each user separately.
  3. Shared No Isolation. Same as 2, but shared resources.

Databricks versions: You want to choose one appended with LTS - Long Term Support

Chapter 32

The main difference between the Databricks premium tier and standard tier is that in the premium tier you have the ability to apply permissions. In the standard tier you do not, I believe, and therefore all users are admins.

If a cluster is left unused for 30 days, it will be removed automatically. To prevent this from happening, "pin" it to the dashboard in the web front end.

Under the Workspace folder you can create subfolders. There are two folders there already:

  1. Shared
  2. Users. Under here there is a folder for each user (named as their email). And notebooks that they create appear here by default.

Right-click Share to give other users access to the items (notebooks) within it (with the premium tier).

To create a new field in a notebook, click Ctrl-Alt-N. Or hover below an existing field and click +Code. Note you can also click +Text, and this will create a new field but starting with the magic command %md which means mark-down - it is a comment field where you can also use html tags to markup the text.

dbutils is the main command used.

When you create a databricks workspace, it also adds a number of Resource Groups that are managed by Databricks not Azure. These contain a storage account, a virtual machine, and a virtual network.

The sample datasets in Databricks are stored in the above storage account.

Note the above storage account will give you an access denied if you try to get into it, even as an administrator, because the rights are with the Databricks services.

Use: .option("inferschema", "true") in order to read the schema from the target. This should cause the datatypes shown in the column headings to change to the appropriate one for the data type.

33 - Connecting to ADLSg2 from Azure Databricks

Unity Catalog

In the portal, go to Entra ID, and then under Manage, select App Registrations

Here you can add a Service Principal, which is a user account for applications.

Then within this service principle, under Manage click Certificates and Secrets. Click to add a secret, giving it a name. Leave it to expire by default after 180 days. Once created, it will appear in the list of secrets along with a long autogenerated password. It is important to note what the password is, as it will not be shown again.

Then create a Entra group, and add this service principal to it.

Next go into the storage account, and click on Access Control (IAM) and click on Role Assignments. Click Add, and select permission ""Storage Blob Data Contributor" to be assigned to the group, thus giving this new group and the app registration within it access to the storage account.

SAS tokens to connect

Go into the storage account and click on Security - Shared Access Signature. Click to set the permission to Blob and File (I think), and Container, and Object. Click to generate the signature and have it ready for use. With this code, the Databricks client can get permission to access the storage account.

Next, from:

https://learn.microsoft.com/en-us/azure/databricks/connect/storage/azure-storage#access-azure-data-lake-storage-gen2-or-blob-storage-using-a-sas-token

Scroll down to the SAS Tokens section, and copy the following code:

spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net", "SAS")
spark.conf.set("fs.azure.sas.token.provider.type.<storage-account>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")
spark.conf.set("fs.azure.sas.fixed.token.<storage-account>.dfs.core.windows.net", dbutils.secrets.get(scope="<scope>", key="<sas-token-key>"))

into the Databricks notebook.

Then replace the last line string with the access token above, so that the last two lines look like:

spark.conf.set("fs.azure.sas.fixed.token.<storage-account>.dfs.core.windows.net", "token_string_here"))

Using Key Vault secrets

In Key vault, create a new secret named: databricksserviceprincipalsecret and give it a password.

Next you need to create a scope for this secret, by going to the following URL constructed from your standard Databricks URL:

https://adb-3694131035979511.11.azuredatabricks.net#secrets/createScope

(note this does not work with the free Databricks Community Edition)

Then enter the URL of the key vault (from its Properties window):

https://keyvaultmillersql1.vault.azure.net/

and it's resource ID also from the same place, like:

/subscriptions/df3267b7-1d81-4547-b38a-9f8e2e538ccd/resourceGroups/resourcegroup1/providers/Microsoft.KeyVault/vaults/keyvaultmillersql1

By creating this scope, a new Access Policy is created in the key vault, allowing Databricks to read the required secret.

Mount points

From the documentation at:

https://docs.databricks.com/en/dbfs/mounts.html

there is the code template to create a mount point in section; Mount ADLS Gen2 or Blob Storage with ABFS

Get the Directory ID for your tenant from https://portal.azure.com/#settings/directory

One downside to using mount points, is that they are accessible to all Databricks users once mounted.

AAD Credentials Pass-thru

You can create a cluster in Databricks with option Enable Credential Passthru.. enabled. Once this is done, Databricks passes through the Azure user account that you are logged on to on Databricks, and authenticates this through to the Azure back end.

Note this does not work with Data Factory (Data Factory cannot run a Databricks notebook by providing DataFactory's account to the notebook it wants to run).

But it does work with mount points.

34 - Common data transformations

35 - Writing data to ADLSg2 from Azure Databricks

It appears to be possible to click Edit to configure a cluster to add a connection string (several lines) to allow it to auto-connect to Azure Storage.

Databricks tables are stored in Azure Storage, in resources in a Databricks resource group. You don't have access rights to these resources, so the only way you have to edit this data is through Databricks.

Databricks table columns can have an identity property with the expression: id bigint generated always as identity

Note the values assigned won't be contiguous or consecutive - but they are unique. For example they can be 1, 3 ,6, 4, 5, 9, 11 etc

36 - Automating the process with Azure Databricks Autoloader

If you are running an AutoLoader process, using (spark.readStream

which continuously monitors a directory for new files, loading them into a table, then if you upload a file with a schema that is different from the existing schema, then the job will fail. But you can manually start it again, and if you are using.option("cloudFiles.inferColumnTypes", True) then it will read the schema of the new file, and incorporate it into the destination table.

If you add the following option item: .option("cloudFiles.schemaEvolutionMode", "rescue")

then it will handle the new schema by not adding the new column, but instead placing the new data into column _rescued_data

Another feature of this, is that if the datatype of the data provided in a new file in an existing field is wrong compared with the existing field datatype, then the file will be loaded, but the column will be set to null, and the value provided will be written to the _rescued_data field.

It can also be configured to save changes to files into Queues, and read the queue items to load the files.

37 - Orchestrating Databricks notebooks

From ChatGPT is looks like you cannot run notebooks in Databricks Community Edition directly from Synapse Analytics. It does say that in Databricks Community Edition you can create a Personal Access Token (PAT) and call that from Synapse Analytics - but I have found this is not the case.

With standard non-community Databricks, there is more than one type of cluster, each one of which has different pricing. The main one is "All Purpose Compute". The next one is "Jobs Compute", which is the same as the first but you can't use it to run jobs in the Databricks web front end; it can only be used to run scheduled jobs and those called by Synapse Analytics for example. This costs around half the price. These can only be, and are only, created at the point you create a new pipeline in Synapse Analytics when you click "New Job Cluster" rather than use an existing cluster.

The URL for an existing Databricks cluster is like:

https://adb-3694131035979511.11.azuredatabricks.net/compute/clusters/1223-231655-tfslm79j?o=3694131035979511

And here, the part of it that is: 1223-231655-tfslm79j is the Cluster ID.

You can also add the Synapse Analytics service Managed Identity to the Databrick's Access Control (IAM) with the Contributor role. And you can also do this or similar in Databricks in the user settings on the top right.

Within the non Community Databricks cluster, you can assign users permissions to start or restart the cluster.

If the cluster was set up with single-user mode, it won't be able to be run by an external managed identity like Synapse. You don't seem to be able to change it to multi-user, so it is best to create a new cluster.

You can create Databricks Policies to restrict the ability of external services like Synapse Analytics to create new clusters. This is necessary to prevent accidental creation of expensive cluster resources.

Widgets

These are parameters that can be provided to notebooks, which show along the top of the notebook. They are created from Edit - Add Parameter.

and used with commands like: print(dbutils.widgets.get("param2"))

To provide these parameters when you run the notebook from Synapse Analytics, in the pipeline task go to Settings - Base Parameters

Workflows

The Workflows menu item in full Databricks on the left, is where you configure scheduling.

38 - Transforming data with dbt

DBT is used to transform data once it has been ingested into your data warehouse.

But it is outside the scope of the DP203 exam.