Exam DP203 Serverless SQL Pool

Serverless SQL Pool. Reads from data lake files.

In Synapse Studio Data - Linked tab, go into the Azure Data Lake Gen2 header, and drill down and right-click a file there. Right-click and select "New SQL Script".

This generates the following script:

SELECT
    TOP 100 *
FROM
    OPENROWSET(
        BULK 'https://datalake9cxqfg2.dfs.core.windows.net/files/product_data/products.csv',
        FORMAT = 'CSV',
        PARSER_VERSION = '2.0'
        HEADER_ROW = TRUE
    ) AS result

Click the "Connect to" popup list, and select "Built-in", and run it.

"result" is the alias name of the output table.

Give the script a name on the right hand side, and click Publish. Thenafter it can be seen under the "Develop" tab.

Output the results instead as a chart by clicking the "Chart" button in the output pane, setting appropriate values in the "Category" and "Legend" popup lists.

Three file formats can be read: CSV, JSON, Parquet.

No charge for resources reserved, you're only charged for the data processed by queries you run.

Not recommended for OLTP.

you can define external objects such as tables and views in a serverless SQL database.

The BULK parameter includes the full URL to the location in the data lake containing the data files. These paths can include the * wildcard:

https://mydatalake.blob.core.windows.net/data/files/*: All files in the files folder.
https://mydatalake.blob.core.windows.net/data/files/**: All files in the files folder, and recursively its subfolders.

FIELDTERMINATOR

ROWTERMINATOR

FIELDQUOTE - string in quotes

FIRSTROW = 2 - how many rows to skip

HEADER_ROW = TRUE (only available when using parser version 2.0)

WITH clause

Use the "WITH" clause to define the column names and datatypes:

SELECT TOP 100 *
FROM OPENROWSET(
    BULK 'https://mydatalake.blob.core.windows.net/data/files/*.csv',
    FORMAT = 'csv',
    PARSER_VERSION = '2.0')
WITH (
    product_id INT,
    product_name VARCHAR(20) COLLATE Latin1_General_100_BIN2_UTF8,
    list_price DECIMAL(5,2)
) AS rows

JSON source files

If JSON files (.json) are used instead of .CSV files, use the following parameters in the above:

        BULK 'https://mydatalake.blob.core.windows.net/data/files/*.json',
        FORMAT = 'csv',
        FIELDTERMINATOR ='0x0b',
        FIELDQUOTE = '0x0b',
        ROWTERMINATOR = '0x0b'

Note that FORMAT still says csv, not json.

You can run the JSON_VALUE function on the JSON output by the query, like:

SELECT JSON_VALUE(doc, '$.product_name') AS product,
           JSON_VALUE(doc, '$.list_price') AS price
FROM
    OPENROWSET(
        BULK 'https://mydatalake.blob.core.windows.net/data/files/*.json',
        FORMAT = 'csv',
        FIELDTERMINATOR ='0x0b',
        FIELDQUOTE = '0x0b',
        ROWTERMINATOR = '0x0b'
    ) WITH (doc NVARCHAR(MAX)) as rows

Parquet source files

SELECT TOP 100 *
FROM OPENROWSET(
    BULK 'https://mydatalake.blob.core.windows.net/data/files/*.*',
    FORMAT = 'parquet') AS rows

Partitioning Parquet files

If the data in the Parquet source files is partitioned into multiple files in multiple directories, it can be referenced in the following manner - which specific parameter directories can be referenced:

SELECT *
FROM OPENROWSET(
    BULK 'https://mydatalake.blob.core.windows.net/data/orders/year=*/month=*/*.*',
    FORMAT = 'parquet') AS orders
WHERE orders.filepath(1) = '2020'
    AND orders.filepath(2) IN ('1','2');

Create Database

CREATE DATABASE SalesDB
    COLLATE Latin1_General_100_BIN2_UTF8

External Data Source

Instead of putting the full path into the BULK clause every time you reference a source file as a table, you can create an External Data Source with the path:

CREATE EXTERNAL DATA SOURCE files
WITH (
    LOCATION = 'https://mydatalake.blob.core.windows.net/data/files/'
)

and then run the main select as:

SELECT *
FROM
    OPENROWSET(
        BULK 'orders/*.csv',
        DATA_SOURCE = 'files',
        FORMAT = 'csv',
        PARSER_VERSION = '2.0'
    ) AS orders

where the "files" datasource is referenced. And the BULK parameter only then needs to have the relative path to the file(s).