Exam DP203 Course Tybul 39 Azure Synapse Analytics Spark Pools

From MillerSql.com

39 - Azure Synapse Analytics - Spark Pools

Databricks tends to have newer versions of Apache Spark than Synapse Analytics.

There is no option to use C# in Synapse notebooks, whereas you can in Databricks.

Whereas in Databricks you might use the following command: display(dbutils.fs.ls("abfss://raw@storageaccountmillersql.dfs.core.windows.net/automation"))

in Synapse the dbutils command does not exist (it is a Databricks thing only). Replace it with: mssparkutils.fs.ls

In Synapse you need to use %%sql rather than %$sql to turn it into a SQL command.

To run a SQL command in Synapse, outputting the output into a Spark dataframe, run the query like:

df = spark.sql("""
select
  col1
, col2
from MyData
""")

where MyData is the name of the source table.

Note that the three quotes are necessary to denote it as a multi-line statement.

Tables created in Synapse are by default saved as type Hive, rather than type Delta as they are created in Databricks. Use command USING DELTA in the CREATE TABLE command to save it as Delta in Synapse.

Run notebooks as System Managed Identity

If you run a notebook in Synapse, by default it will run under your own Entra account. Which if you are only set up as the Owner of the underlying storage account, may not give you the required permissions.

If the system managed identity of Synapse Analytics has Data Contributor (or similar) rights onto the Azure storage, then when Synapse runs the notebook (for example as a scheduled event), then it will have the required rights.

But there is a third option: You can configure a notebook to be run interactively by you, but using the System Managed Identity account. This is done in the Settings window for the notebook, option Run as Managed Identity. After which the notebook will say Using Managed Identity on the top right hand side.

If you run a df.write command, you can add the argument .partitionby("year", "month") to partition the files saved by these columns, and the data will go into the partitioned files in different directories.

Run SQL command: DESCRIBE TABLE EXTENDED to get details of source file locations for the data underlying the table.

41 - Transforming data with Data Flows

Data Flows in Data Factory and Synapse Analytics, are not the same as data flows in Fabric.

Data flows are created within Synapse under the Develop tab.

In the Source, you choose one of three source types:

  1. Integration Dataset: This is where you choose an existing data source I think, such as a linked service to Databricks or Azure Storage.
  2. Inline: Same as Integration Dataset I think, but local to the package
  3. Workspace DB: This reads from a database defined in the local workspace (Data - Workspace tab)

To run a data flow using the Debug option, you need to click to enable the Data Flow Debug slider. This prompts you to start an hour's run of the AutoResolveIntegrationRuntime, which is chargeable. I don't know at what rate, but if it is in the order of $1/hour as the Databricks clusters and dedicated SQL pools are, then I don't want to spend this just for the purpose of training.

case(
   like(col1, "%toy%"), "Toy",
   like(col2, "%shirt%"), "Shirt",
   like(col3, "%helmet%"), "Helmet",
   "Other"
)

Note that the output (sink) cannot be in the Delta format.Note that it is not possible to save the data with the Delta format if using 1. Integration Dataset. You have to use 2. Inline to do this. This is done by selecting Delta from the Inline dataset type popup menu there.

For the destination you can create a new directory for the output by entering /folder1/folder2 in the output path.

The Surrogate Key transform is used to create and assign a surrogate key value.

The Window transform is used to provide Window functions on the data, for example rank etc. Though note that there is an explicit Rank transform that is also available.

43 - Azure Synapse Dedicated SQL Pool - Architecture and overview

With Dedicated SQL Pool, you have the data stored in one or more of up to 60 distributions, which are processed by up to 60 compute nodes.

Ideally you want a 1-1 relationship between each of these (although this is the most expensive option), and you want to avoid as much as possible the requirement to move data from a distribution to a node that is not close to it (a process called shuffling)

The Performance Level, which determines the number of compute nodes, runs from DW 100c which has only a single node, to DW 30000c, which has 60 compute nodes.

You can connect to a dedicated SQL Pool, from your client machine from SQL Management Studio.

Dedicated SQL Pools will keep on running until you pause them. There is no built-in mechanism to pause them automatically.

Hash tables

Fact tables are typically stored in tables of type Hash. What this means is that a function runs on each row on one of the key columns, for example customer_id (but not a date column), and outputs a hash value for each row between 1 and 60 (or less?), and this has value determines into which one of the 60 distributions the row will be saved into.

44 - Loading data to Dedicated SQL Pool

If a filename is named like: MyData.snappy.parquet the snappy part means it has been compressed with the snappy compression algorithm.

I might want to repeat this chapter.

Polybase

On the dedicated sql pool, connected in a SQL script, run the command:

create database scoped credential

to create a credential that can be used by an external service to access resources in the dedicated SQL pool.

Delta (as opposed to parquet or CSV) source data is only able to be read by data flow objects using the inline not dataset source type. They cannot be read by copy data objects

To add the neilmadf managed identity, go to the Azure Portal, search for Managed Identities to go into the service to create managed identities. Then create the neilmadf managed identity. Next go into the Data Factory resource, and

45 - Azure Synapse Dedicated SQL Pool - extra features

Workload groups

Request_Min_Resource_Grant_Percent: Set this to 25% means that the group will have at least 25% of the resource in question (e.g. memory).

To enable results caching, run the commands:

USE master

ALTER DATABASE dedicatedpool1 SET RESULT_SET_CACHING ON

Note that although the command says "DATABASE" it is actually running on the pool (the master database in that).If you add to the end of a query: OPTION (LABEL = 'label1') then the results set output by the query is assigned this label. This means that it can be referenced afterwards in a DMV or other parts of the database service.

Partitioning

To remove olds data in an old partition, switch out the old partition into a new table, and then truncate the new table.

To add new data in a new partition, take the existing newest partition, and split it to create a newer partition.

Normally partitions are at right angles to distributions. This makes sense because we want the latest data (in the latest partition) to be split over the (up to) 60 distributions. Use partitions only when the intersections have at least 1 million rows, otherwise the column store indexes won't be as efficient.

CREATE TABLE

PARTITION (Sale_Date, RANGE RIGHT FOR VALUES (

20240101, 20240201, 20240301)

The RANGE RIGHT means that values on the boundary date e.g. 20240201 will go into the partition on the right of the date.

Data Skew: PWD_SHOWSPACEUSED shows the distribution of data across the distributions.

TDE - Transaction Data Encryption encrypts at rest the data in the data files, transaction log files, and backup file.

To set it, go into the dedicated sql pool resource (it is a resource in the resource group), and under Security - Transaction Data Encryption enable it. Note that you need to start the dedicated SQL pool to make this change.

46 - Dedicated SQL Pool security

Row-level security

This is implemented by 1. creating a SQL function with a string input scalar variable that returns 1 when a username is x and the value of the scalar variable is in a set of static values, and then 2. creating a SECURITY POLICY on the table where you want to implement the row-level security, that calls this function with the column that you want to filter providing the value for the function's input scalar variable on a row by row basis.

Dynamic Data Masking

This is not at-rest encryption (the data files will still be able to be read if one was able to gain access to them).

There are different types of mask:

  • The default mask swaps strings to xxxx, numbers to 0, and dates to 1900-00-00
  • The email mask swaps fred@blogs.net to fxxx@xxxx.net
  • The random mask substitutes random values
  • The custom mask

Dynamic Data Masking only applies to ordinary users, not admins.

Note that Dynamic Data Masking does not prevent query predicates from operating correctly on the underlying data. So users could use a binary chop method in a where clause to find a value.

47 - Azure Synapse Analytics - Serverless SQL Pool

Serverless. You don't pay for uptime. Instead you only pay for processing, at a rate of around $5/Tb. Which is quite cheap.

Use case: Ad-hoc data exploration