Exam DP203 Dedicated SQL Pool: Difference between revisions

From MillerSql.com
NeilM (talk | contribs)
NeilM (talk | contribs)
Line 133: Line 133:
# Load from Staging into Fact tables
# Load from Staging into Fact tables
# Perform post-load optimization by updating indexes and table distribution statistics.
# Perform post-load optimization by updating indexes and table distribution statistics.
Sometimes you can skip the staging tables step (2.) by creating an external table based on the Gen2 source files.


== SQL ==
== SQL ==

Revision as of 19:09, 18 November 2024

Dedicated SQL Pool.

Reads relational data from data Warehouses.

In Synapse Studio Manage - SQL Pools tab, notice that there is an item that has been set up by the tutorial: sql9cxqfg2

Click the arrow to the left of it to resume it. I think this starts up a virtual machine in the background.

Then once it has started, in Data - Workspace, there is header "SQL Database". And inside this is database: sql9cxqfg2 underneath of which is a folder for tables and other metadata. This is a relational database that exists (by virtue of being under "Workspace") only in Synapse Studio.

Once querying is complete, pause the pool to save money.

Set the performance level for the pool: DW100c to DW30000c

Set the collation of the SQL pool, which determines sort order and string comparison rules.

Dedicated SQL pools in Synapse Analytics don't support foreign key and unique constraints as found in other relational database systems like SQL Server. This means such uniqueness must be enforced during data load.

Indexes. The default index type is clustered columnstore. Though regular clustered indexes are also supported (and is needed for the VARBINARY(MAX) datatype).

Again they use a massively parallel processing (MPP) architecture, as opposed to the symmetric multiprocessing (SMP) architecture used in most OLTP database systems.

To implement this MPP architecture, the data is distributed on the nodes according to one of the following:

  1. Hash. Here a hash of the data is stored on the nodes. Use with FACT tables
  2. Round-Robin: Here the data is split across multiple nodes. Use with STAGING tables.
  3. Replicate. All the data is on all of the nodes. Useful for performance with small datasets. Use with DIMENSION tables.

Table Creation

To create a dimension table (with a replicate distribution), run:

CREATE TABLE dbo.DimCustomer
(
    CustomerKey INT IDENTITY NOT NULL,
    CustomerAlternateKey NVARCHAR(15) NULL,
    CustomerName NVARCHAR(80) NOT NULL,
    EmailAddress NVARCHAR(50) NULL,
    Phone NVARCHAR(25) NULL,
    StreetAddress NVARCHAR(100),
    City NVARCHAR(20),
    PostalCode NVARCHAR(10),
    CountryRegion NVARCHAR(20)
)
WITH
(
    DISTRIBUTION = REPLICATE,
    CLUSTERED COLUMNSTORE INDEX
);

Note the key column values are auto-generated with an IDENTITY field.

To create a fact table, with a Hash distribution on the OrderNumber key, run:

CREATE TABLE dbo.FactSales
(
    OrderDateKey INT NOT NULL,
    CustomerKey INT NOT NULL,
    ProductKey INT NOT NULL,
    StoreKey INT NOT NULL,
    OrderNumber NVARCHAR(10) NOT NULL,
    OrderLineItem INT NOT NULL,
    OrderQuantity SMALLINT NOT NULL,
    UnitPrice DECIMAL NOT NULL,
    Discount DECIMAL NOT NULL,
    Tax DECIMAL NOT NULL,
    SalesAmount DECIMAL NOT NULL
)
WITH
(
    DISTRIBUTION = HASH(OrderNumber),
    CLUSTERED COLUMNSTORE INDEX
);

And to create a staging table, do the same but with

DISTRIBUTION = ROUND_ROBIN

Create External Table

You can create external tables (tables where the underlying files are in Gen2), data source, and file format objects in the same way as you can create them with the Serverless pool:

 CREATE EXTERNAL TABLE ProductSalesTotals
     WITH (
         LOCATION = 'sales/productsales/',
         DATA_SOURCE = sales_data,
         FILE_FORMAT = ParquetFormat
     )
 AS
 SELECT Item AS Product,
     SUM(Quantity) AS ItemsSold,
     ROUND(SUM(UnitPrice) - SUM(TaxAmount), 2) AS NetRevenue
 FROM
     OPENROWSET(
         BULK 'sales/csv/*.csv',
         DATA_SOURCE = 'sales_data',
         FORMAT = 'CSV',
         PARSER_VERSION = '2.0',
         HEADER_ROW = TRUE
     ) AS orders
 GROUP BY Item;

where WITH states where the table will be saved, and AS is where it reads the data in from.

Copy Into

Load data into a relational table with the Copy Into command. This is the preferred way to copy data in because it is high performance:

COPY INTO dbo.StageProducts
    (ProductID, ProductName, ProductCategory, Color, Size, ListPrice, Discontinued)
FROM 'https://mydatalake.blob.core.windows.net/data/stagedfiles/products/*.parquet'
WITH
(
    FILE_TYPE = 'PARQUET',
    MAXERRORS = 0,
    IDENTITY_INSERT = 'OFF'
);

Data loading stages

The most commonly used set of steps to load data into a data warehouse, is:

  1. Load the files into the Gen2
  2. Load from files into Staging tables
  3. Load from Staging into Dimension tables
  4. Load from Staging into Fact tables
  5. Perform post-load optimization by updating indexes and table distribution statistics.

Sometimes you can skip the staging tables step (2.) by creating an external table based on the Gen2 source files.

SQL

NTILE returns the specified percentile in which the row falls. e.g. NTILE(4) returns between 1 and 4 (or is it 0 and 3?)

APPROX_COUNT_DISTINCT is the same as COUNT_DISTINCT, but has better performance, but is not accurate.