Exam DP203 Hybrid Transactional Analytical

From MillerSql.com

Hybrid Transactional / Analytical Processing (HTAP). This is where Azure Synapse Link replicates transactional data into an analysis data store.

From CosmosDB: Azure Synapse Link for CosmosDB

For Azure SQL Database or a SQL Server instance: Azure Synapse Link for SQL (to a dedicated pool, and a spark pool can connect to that dedicated pool).

Azure Synapse Link for Dataverse. Where Dataverse (the Power Platform database) is the OLTP database. Note that here the data goes into Gen2, not a dedicated pool for some reason.

CosmosDB

First need to Enable Azure Synapse Link in the CosmosDBb resource in the Portal by clicking a button.

Alternatively this can be done in the Azure CLI by running:

az cosmosdb update --name my-cosmos-db --resource-group my-rg --enable-analytical-storage true

or Powershell:

Update-AzCosmosDBAccount -Name "my-cosmos-db" -ResourceGroupName "my-rg" -EnableAnalyticalStorage 1

Once enabled it cannot be disabled.

Dynamic Schema Maintenance

As schema changes are made in the upstream OLTP, these get replicated down to the Analysis database. JSON.

There are two types

WellDefined: First instance of data in the JSON determines the data type

FullFidelity: Each instance of data is sent with its data type, so allowing for changes in data type. Is only used with MongoDB.

I think CosmosDB is implemented with containers. You can enable Azure Synapse Link in them either by configuring this in these containers, or by (as above) configuring it as the resource directly in the portal.

Or in the Azure CLI:

az cosmosdb sql container create --resource-group my-rg --account-name my-cosmos-db --database-name my-db --name my-container --partition-key-path "/productID" --analytical-storage-ttl -1

or Powershell:

New-AzCosmosDBSqlContainer -ResourceGroupName "my-rg" -AccountName "my-cosmos-db" -DatabaseName "my-db" -Name "my-container" -PartitionKeyKind "hash" -PartitionKeyPath "/productID" -AnalyticalStorageTtl -1

Like with the resource, this config cannot be removed without deleting the container.

Linked Service

Next, in Synapse Studio, you need to create a linked server to the CosmosDB, to be able to pull the data into Synapse. "Connect to external data".

Query from Spark

You can apparently use a Spark pool to query the CosmosDB using the linked service.

 df = spark.read
     .format("cosmos.olap")\
     .option("spark.synapse.linkedService", "my_linked_service")\
     .option("spark.cosmos.container", "my-container")\
     .load()

display(df.limit(10))

It says "The data is loaded from the analytical store in the container, not from the operational store"

You can also write data using: mydf.write.format

and you can run SQL code on it.

Query from Serverless pool

In addition to using a Spark pool, you can also query an Azure Cosmos DB analytical container by using a built-in serverless SQL pool in Azure Synapse Analytics

SELECT *
FROM OPENROWSET(​
    'CosmosDB',
    'Account=my-cosmos-db;Database=my-db;Key=abcd1234....==',
    [my-container]) AS products_data

If the source JSON contains multi-level data, like:

{
    "productID": 126,
    "productName": "Sprocket",
    "supplier": {
        "supplierName": "Contoso",
        "supplierPhone": "555-123-4567"
    }
    "id": "62588f072-11c3-42b1-a738-...",
    "_rid": "mjMaAL...==",
    ...
}

then you can use a select statement like the following, with the methods of parsing these levels given with the WITH statement in the select:

SELECT *
 FROM OPENROWSET(PROVIDER = 'CosmosDB',
                 CONNECTION = 'Account=my-cosmos-db;Database=my-db',
                 OBJECT = 'my-container',
                 SERVER_CREDENTIAL = 'my_credential'
 )
 WITH (
    ProductNo INT '$.productID',
    ProductName VARCHAR(20) '$.productName',
    Supplier VARCHAR(20) '$.supplier.supplierName',
    SupplierPhoneNo VARCHAR(15) '$.supplier.supplierPhone'
 ) AS products_data

Provision your Azure Cosmos DB analytical storage and any client applications (for example Microsoft Power BI) in the same region as serverless SQL pool.

Azure Cosmos DB containers can be replicated to multiple regions. If you have a multi-region container, you can specify a region parameter in the OPENROWSET connection string to ensure queries are sent to a specific regional replica of the container.

Azure Synapse Link for SQL

From Azure SQL Database or SQL Server 2022, but not Azure SQL Managed Instance

To a Dedicated SQL Pool

It uses the change feed feature in the source, with changes recorded in the source transaction log either being directly applied to the destination, or being written to files intermediary.

You must assign a system assigned managed identity and have the correct firewall rules.

CREATE USER my_synapse_workspace FROM EXTERNAL PROVIDER;
ALTER ROLE [db_owner] ADD MEMBER my_synapse_workspace;

Then create a linked Service in Synapse Studio to point to the source db

Choose the number of CPU cores to use.

Initial synchonisation by copying a parquet file for each table. But thenafter once synchronised it copies CSV files.

A self-hosted integration runtime is run on your source machine if on-premise SQL Server.

Note you can't use the default Gen2 storage - you must create a new account within your subscription.

In the source, run:

CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'my$ecretPa$$w0rd';

Also run: CREATE MASTER KEY in the destination.

In Synapse Studio, create linked services for the source SQL Server/Azure SQL Database, and the Gen2 storage (which will be used as the landing zone).

In Synapse Studio, on the Integrate page, on the + drop-down menu, select Link connection