Exam DP203 Dedicated SQL Pool
Dedicated SQL Pool.
Reads relational data from data Warehouses.
In Synapse Studio Manage - SQL Pools tab, notice that there is an item that has been set up by the tutorial: sql9cxqfg2
Click the arrow to the left of it to resume it. I think this starts up a virtual machine in the background.
Then once it has started, in Data - Workspace, there is header "SQL Database". And inside this is database: sql9cxqfg2 underneath of which is a folder for tables and other metadata. This is a relational database that exists (by virtue of being under "Workspace") only in Synapse Studio.
Once querying is complete, pause the pool to save money.
Set the performance level for the pool: DW100c to DW30000c
Set the collation of the SQL pool, which determines sort order and string comparison rules.
Dedicated SQL pools in Synapse Analytics don't support foreign key and unique constraints as found in other relational database systems like SQL Server. This means such uniqueness must be enforced during data load.
Indexes. The default index type is clustered columnstore. Though regular clustered indexes are also supported (and is needed for the VARBINARY(MAX) datatype).
Again they use a massively parallel processing (MPP) architecture, as opposed to the symmetric multiprocessing (SMP) architecture used in most OLTP database systems.
To implement this MPP architecture, the data is distributed on the nodes according to one of the following:
- Hash. Here a hash of the data is stored on the nodes. Use with FACT tables
- Round-Robin: Here the data is split across multiple nodes. Use with STAGING tables.
- Replicate. All the data is on all of the nodes. Useful for performance with small datasets. Use with DIMENSION tables.
Table Creation
To create a dimension table (with a replicate distribution), run:
CREATE TABLE dbo.DimCustomer ( CustomerKey INT IDENTITY NOT NULL, CustomerAlternateKey NVARCHAR(15) NULL, CustomerName NVARCHAR(80) NOT NULL, EmailAddress NVARCHAR(50) NULL, Phone NVARCHAR(25) NULL, StreetAddress NVARCHAR(100), City NVARCHAR(20), PostalCode NVARCHAR(10), CountryRegion NVARCHAR(20) ) WITH ( DISTRIBUTION = REPLICATE, CLUSTERED COLUMNSTORE INDEX );
Note the key column values are auto-generated with an IDENTITY field.
To create a fact table, with a Hash distribution on the OrderNumber key, run:
CREATE TABLE dbo.FactSales ( OrderDateKey INT NOT NULL, CustomerKey INT NOT NULL, ProductKey INT NOT NULL, StoreKey INT NOT NULL, OrderNumber NVARCHAR(10) NOT NULL, OrderLineItem INT NOT NULL, OrderQuantity SMALLINT NOT NULL, UnitPrice DECIMAL NOT NULL, Discount DECIMAL NOT NULL, Tax DECIMAL NOT NULL, SalesAmount DECIMAL NOT NULL ) WITH ( DISTRIBUTION = HASH(OrderNumber), CLUSTERED COLUMNSTORE INDEX );
And to create a staging table, do the same but with
DISTRIBUTION = ROUND_ROBIN
Create External Table
You can create external tables (tables where the underlying files are in Gen2) in the same way as you can create them with the Serverless pool, with:
CREATE EXTERNAL TABLE ProductSalesTotals WITH ( LOCATION = 'sales/productsales/', DATA_SOURCE = sales_data, FILE_FORMAT = ParquetFormat ) AS SELECT Item AS Product, SUM(Quantity) AS ItemsSold, ROUND(SUM(UnitPrice) - SUM(TaxAmount), 2) AS NetRevenue FROM OPENROWSET( BULK 'sales/csv/*.csv', DATA_SOURCE = 'sales_data', FORMAT = 'CSV', PARSER_VERSION = '2.0', HEADER_ROW = TRUE ) AS orders GROUP BY Item;
where WITH states where the table will be saved, and AS is where it reads the data in from.