Exam DP203 Dedicated SQL Pool
Dedicated SQL Pool.
Reads relational data from data Warehouses.
In Synapse Studio Manage - SQL Pools tab, notice that there is an item that has been set up by the tutorial: sql9cxqfg2
Click the arrow to the left of it to resume it. I think this starts up a virtual machine in the background.
Then once it has started, in Data - Workspace, there is header "SQL Database". And inside this is database: sql9cxqfg2 underneath of which is a folder for tables and other metadata. This is a relational database that exists (by virtue of being under "Workspace") only in Synapse Studio.
Once querying is complete, pause the pool to save money.
Set the performance level for the pool: DW100c to DW30000c
Set the collation of the SQL pool, which determines sort order and string comparison rules.
Dedicated SQL pools in Synapse Analytics don't support foreign key and unique constraints as found in other relational database systems like SQL Server. This means such uniqueness must be enforced during data load.
Indexes. The default index type is clustered columnstore. Though regular clustered indexes are also supported (and is needed for the VARBINARY(MAX) datatype).
Again they use a massively parallel processing (MPP) architecture, as opposed to the symmetric multiprocessing (SMP) architecture used in most OLTP database systems.
To implement this MPP architecture, the data is distributed on the nodes according to one of the following:
- Hash. Here a hash of the data is stored on the nodes. Use with FACT tables
- Round-Robin: Here the data is split across multiple nodes. Use with STAGING tables.
- Replicate. All the data is on all of the nodes. Useful for performance with small datasets. Use with DIMENSION tables.
Table Creation
To create a dimension table (with a replicate distribution), run:
CREATE TABLE dbo.DimCustomer ( CustomerKey INT IDENTITY NOT NULL, CustomerAlternateKey NVARCHAR(15) NULL, CustomerName NVARCHAR(80) NOT NULL, EmailAddress NVARCHAR(50) NULL, Phone NVARCHAR(25) NULL, StreetAddress NVARCHAR(100), City NVARCHAR(20), PostalCode NVARCHAR(10), CountryRegion NVARCHAR(20) ) WITH ( DISTRIBUTION = REPLICATE, CLUSTERED COLUMNSTORE INDEX );
Note the key column values are auto-generated with an IDENTITY field.
To create a fact table, with a Hash distribution on the OrderNumber key, run:
CREATE TABLE dbo.FactSales ( OrderDateKey INT NOT NULL, CustomerKey INT NOT NULL, ProductKey INT NOT NULL, StoreKey INT NOT NULL, OrderNumber NVARCHAR(10) NOT NULL, OrderLineItem INT NOT NULL, OrderQuantity SMALLINT NOT NULL, UnitPrice DECIMAL NOT NULL, Discount DECIMAL NOT NULL, Tax DECIMAL NOT NULL, SalesAmount DECIMAL NOT NULL ) WITH ( DISTRIBUTION = HASH(OrderNumber), CLUSTERED COLUMNSTORE INDEX );
And to create a staging table, do the same but with
DISTRIBUTION = ROUND_ROBIN
Create External Table
You can create external tables (tables where the underlying files are in Gen2), data source, and file format objects in the same way as you can create them with the Serverless pool:
CREATE EXTERNAL TABLE ProductSalesTotals WITH ( LOCATION = 'sales/productsales/', DATA_SOURCE = sales_data, FILE_FORMAT = ParquetFormat ) AS SELECT Item AS Product, SUM(Quantity) AS ItemsSold, ROUND(SUM(UnitPrice) - SUM(TaxAmount), 2) AS NetRevenue FROM OPENROWSET( BULK 'sales/csv/*.csv', DATA_SOURCE = 'sales_data', FORMAT = 'CSV', PARSER_VERSION = '2.0', HEADER_ROW = TRUE ) AS orders GROUP BY Item;
where WITH states where the table will be saved, and AS is where it reads the data in from.
Copy Into
Load data into a relational table with the Copy Into command. This is the preferred way to copy data in because it is high performance:
COPY INTO dbo.StageProducts (ProductID, ProductName, ProductCategory, Color, Size, ListPrice, Discontinued) FROM 'https://mydatalake.blob.core.windows.net/data/stagedfiles/products/*.parquet' WITH ( FILE_TYPE = 'PARQUET', MAXERRORS = 0, IDENTITY_INSERT = 'OFF' );
Data loading stages
The most commonly used set of steps to load data into a data warehouse, is:
- Load the files into the Gen2
- Load from files into Staging tables
- Load from Staging into Dimension tables
- Load from Staging into Fact tables
- Perform post-load optimization by updating indexes and table distribution statistics.
Sometimes you can skip the staging tables step (2.) by creating an external table based on the Gen2 source files.
SQL
NTILE returns the specified percentile in which the row falls. e.g. NTILE(4) returns between 1 and 4 (or is it 0 and 3?)
APPROX_COUNT_DISTINCT is the same as COUNT_DISTINCT, but has better performance, but is not accurate.
CETAS
Like with the Serverless pool, CETAS can be used. Create Table As <Select>. Bear in mind that you can't use IDENTITY
to generate a unique integer value for the surrogate key when using a CTAS statement, so this example uses the ROW_NUMBER
function.
To load data from Staging into DIM (dimension) tables, you can also do a CETAS to create a new table that will subsequently be named the DIM table, by doing a UNION of the existing DIM table and the Staging table, where rows don't match existing rows so are new.
To load data into datetime dimension tables, do a WHILE loop incrementing a datetime variable, inserting data row by row. Me: This won't be the fastest!
Slowly changing dimension
Type 0: Read-only
Type 1: Update values only (no record of previous values is kept), inserting new values.
Type 2: Insert a new record. Requires a surrogate key for uniqueness (with the Business key remaining unchanged).
With Type 2, there is normally a column to denote that a record is the current, or latest, for the particular business key. Or a pair of columns giving the date range when the row was/is current.
You can use the MERGE statment to handle the inserts and updates (upserts) for type 1 dimensions. e.g.:
MERGE dbo.DimProduct AS tgt USING (SELECT * FROM dbo.StageProducts) AS src ON src.ProductID = tgt.ProductBusinessKey WHEN MATCHED THEN -- Type 1 updates UPDATE SET tgt.ProductName = src.ProductName, tgt.ProductCategory = src.ProductCategory, tgt.Color = src.Color, tgt.Size = src.Size, tgt.ListPrice = src.ListPrice, tgt.Discontinued = src.Discontinued WHEN NOT MATCHED THEN -- New products INSERT VALUES (src.ProductID, src.ProductName, src.ProductCategory, src.Color, src.Size, src.ListPrice, src.Discontinued);