Exam DP203 Course Tybul 32 A closer look at Databricks notebooks
Chapter 31
Cluster Access modes:
- Single User. All languages, but for a single user
- Shared. For multiple users. Resources are isolated for each user separately.
- Shared No Isolation. Same as 2, but shared resources.
Databricks versions: You want to choose one appended with LTS - Long Term Support
Chapter 32
The main difference between the Databricks premium tier and standard tier is that in the premium tier you have the ability to apply permissions. In the standard tier you do not, I believe, and therefore all users are admins.
If a cluster is left unused for 30 days, it will be removed automatically. To prevent this from happening, "pin" it to the dashboard in the web front end.
Under the Workspace folder you can create subfolders. There are two folders there already:
- Shared
- Users. Under here there is a folder for each user (named as their email). And notebooks that they create appear here by default.
Right-click Share to give other users access to the items (notebooks) within it (with the premium tier).
To create a new field in a notebook, click Ctrl-Alt-N. Or hover below an existing field and click +Code. Note you can also click +Text, and this will create a new field but starting with the magic command %md which means mark-down - it is a comment field where you can also use html tags to markup the text.
dbutils is the main command used.
When you create a databricks workspace, it also adds a number of Resource Groups that are managed by Databricks not Azure. These contain a storage account, a virtual machine, and a virtual network.
The sample datasets in Databricks are stored in the above storage account.
Note the above storage account will give you an access denied if you try to get into it, even as an administrator, because the rights are with the Databricks services.
Use: .option("inferschema", "true") in order to read the schema from the target. This should cause the datatypes shown in the column headings to change to the appropriate one for the data type.
33 - Connecting to ADLSg2 from Azure Databricks
Unity Catalog
In the portal, go to Entra ID, and then under Manage, select App Registrations
Here you can add a Service Principal, which is a user account for applications.
Then within this service principle, under Manage click Certificates and Secrets. Click to add a secret, giving it a name. Leave it to expire by default after 180 days. Once created, it will appear in the list of secrets along with a long autogenerated password. It is important to note what the password is, as it will not be shown again.
Then create a Entra group, and add this service principal to it.
Next go into the storage account, and click on Access Control (IAM) and click on Role Assignments. Click Add, and select permission ""Storage Blob Data Contributor" to be assigned to the group, thus giving this new group and the app registration within it access to the storage account.
SAS tokens to connect
Go into the storage account and click on Security - Shared Access Signature. Click to set the permission to Blob and File (I think), and Container, and Object. Click to generate the signature and have it ready for use. With this code, the Databricks client can get permission to access the storage account.
Next, from:
Scroll down to the SAS Tokens section, and copy the following code:
spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net", "SAS") spark.conf.set("fs.azure.sas.token.provider.type.<storage-account>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider") spark.conf.set("fs.azure.sas.fixed.token.<storage-account>.dfs.core.windows.net", dbutils.secrets.get(scope="<scope>", key="<sas-token-key>"))
into the Databricks notebook.
Then replace the last line string with the access token above, so that the last two lines look like:
spark.conf.set("fs.azure.sas.fixed.token.<storage-account>.dfs.core.windows.net", "token_string_here"))
Using Key Vault secrets
In Key vault, create a new secret named: databricksserviceprincipalsecret and give it a password.
Next you need to create a scope for this secret, by going to the following URL constructed from your standard Databricks URL:
https://adb-3694131035979511.11.azuredatabricks.net#secrets/createScope
(note this does not work with the free Databricks Community Edition)
Then enter the URL of the key vault (from its Properties window):
https://keyvaultmillersql1.vault.azure.net/
and it's resource ID also from the same place, like:
/subscriptions/df3267b7-1d81-4547-b38a-9f8e2e538ccd/resourceGroups/resourcegroup1/providers/Microsoft.KeyVault/vaults/keyvaultmillersql1
By creating this scope, a new Access Policy is created in the key vault, allowing Databricks to read the required secret.
Mount points
From the documentation at:
https://docs.databricks.com/en/dbfs/mounts.html
there is the code template to create a mount point in section; Mount ADLS Gen2 or Blob Storage with ABFS
Get the Directory ID for your tenant from https://portal.azure.com/#settings/directory
One downside to using mount points, is that they are accessible to all Databricks users once mounted.
AAD Credentials Pass-thru
You can create a cluster in Databricks with option Enable Credential Passthru.. enabled. Once this is done, Databricks passes through the Azure user account that you are logged on to on Databricks, and authenticates this through to the Azure back end.
Note this does not work with Data Factory (Data Factory cannot run a Databricks notebook by providing DataFactory's account to the notebook it wants to run).
But it does work with mount points.
34 - Common data transformations
35 - Writing data to ADLSg2 from Azure Databricks
It appears to be possible to click Edit to configure a cluster to add a connection string (several lines) to allow it to auto-connect to Azure Storage.
Databricks tables are stored in Azure Storage, in resources in a Databricks resource group. You don't have access rights to these resources, so the only way you have to edit this data is through Databricks.
Databricks table columns can have an identity property with the expression: id bigint generated always as identity
Note the values assigned won't be contiguous or consecutive - but they are unique. For example they can be 1, 3 ,6, 4, 5, 9, 11 etc
36 - Automating the process with Azure Databricks Autoloader
If you are running an AutoLoader process, using (spark.readStream
which continuously monitors a directory for new files, loading them into a table, then if you upload a file with a schema that is different from the existing schema, then the job will fail. But you can manually start it again, and if you are using.option("cloudFiles.inferColumnTypes", True) then it will read the schema of the new file, and incorporate it into the destination table.
If you add the following option item: .option("cloudFiles.schemaEvolutionMode", "rescue")
then it will handle the new schema by not adding the new column, but instead placing the new data into column _rescued_data
Another feature of this, is that if the datatype of the data provided in a new file in an existing field is wrong compared with the existing field datatype, then the file will be loaded, but the column will be set to null, and the value provided will be written to the _rescued_data field.
It can also be configured to save changes to files into Queues, and read the queue items to load the files.
37 - Orchestrating Databricks notebooks
From ChatGPT is looks like you cannot run notebooks in Databricks Community Edition directly from Synapse Analytics. It does say that in Databricks Community Edition you can create a Personal Access Token (PAT) and call that from Synapse Analytics - but I have found this is not the case.
With standard non-community Databricks, there is more than one type of cluster, each one of which has different pricing. The main one is "All Purpose Compute". The next one is "Jobs Compute", which is the same as the first but you can't use it to run jobs in the Databricks web front end; it can only be used to run scheduled jobs and those called by Synapse Analytics for example. This costs around half the price. These can only be, and are only, created at the point you create a new pipeline in Synapse Analytics when you click "New Job Cluster" rather than use an existing cluster.
The URL for an existing Databricks cluster is like:
And here, the part of it that is: 1223-231655-tfslm79j is the Cluster ID.
You can also add the Synapse Analytics service Managed Identity to the Databrick's Access Control (IAM) with the Contributor role. And you can also do this or similar in Databricks in the user settings on the top right.
Within the non Community Databricks cluster, you can assign users permissions to start or restart the cluster.
If the cluster was set up with single-user mode, it won't be able to be run by an external managed identity like Synapse. You don't seem to be able to change it to multi-user, so it is best to create a new cluster.
You can create Databricks Policies to restrict the ability of external services like Synapse Analytics to create new clusters. This is necessary to prevent accidental creation of expensive cluster resources.
Widgets
These are parameters that can be provided to notebooks, which show along the top of the notebook. They are created from Edit - Add Parameter.
and used with commands like: print(dbutils.widgets.get("param2"))
To provide these parameters when you run the notebook from Synapse Analytics, in the pipeline task go to Settings - Base Parameters
Workflows
The Workflows menu item in full Databricks on the left, is where you configure scheduling.
38 - Transforming data with dbt
DBT is used to transform data once it has been ingested into your data warehouse.
But it is outside the scope of the DP203 exam.