Exam DP203 Serverless SQL Pool
Serverless SQL Pool. Reads from data lake files.
In Synapse Studio Data - Linked tab, go into the Azure Data Lake Gen2 header, and drill down and right-click a file there. Right-click and select "New SQL Script".
This generates the following script:
SELECT
TOP 100 *
FROM
OPENROWSET(
BULK 'https://datalake9cxqfg2.dfs.core.windows.net/files/product_data/products.csv',
FORMAT = 'CSV',
PARSER_VERSION = '2.0'
HEADER_ROW = TRUE
) AS result
Click the "Connect to" popup list, and select "Built-in", and run it.
"result" is the alias name of the output table.
Give the script a name on the right hand side, and click Publish. Thenafter it can be seen under the "Develop" tab.
Output the results instead as a chart by clicking the "Chart" button in the output pane, setting appropriate values in the "Category" and "Legend" popup lists.
Three file formats can be read: CSV, JSON, Parquet.
No charge for resources reserved, you're only charged for the data processed by queries you run.
Not recommended for OLTP.
you can define external objects such as tables and views in a serverless SQL database.
The BULK parameter includes the full URL to the location in the data lake containing the data files. These paths can include the * wildcard:
https://mydatalake.blob.core.windows.net/data/files/*: All files in the files folder.https://mydatalake.blob.core.windows.net/data/files/**: All files in the files folder, and recursively its subfolders.
FIELDTERMINATOR
ROWTERMINATOR
FIELDQUOTE - string in quotes
FIRSTROW = 2 - how many rows to skip
HEADER_ROW = TRUE (only available when using parser version 2.0)
WITH clause
Use the "WITH" clause to define the column names and datatypes:
SELECT TOP 100 *
FROM OPENROWSET(
BULK 'https://mydatalake.blob.core.windows.net/data/files/*.csv',
FORMAT = 'csv',
PARSER_VERSION = '2.0')
WITH (
product_id INT,
product_name VARCHAR(20) COLLATE Latin1_General_100_BIN2_UTF8,
list_price DECIMAL(5,2)
) AS rows
JSON source files
If JSON files (.json) are used instead of .CSV files, use the following parameters in the above:
BULK 'https://mydatalake.blob.core.windows.net/data/files/*.json',
FORMAT = 'csv',
FIELDTERMINATOR ='0x0b',
FIELDQUOTE = '0x0b',
ROWTERMINATOR = '0x0b'
Note that FORMAT still says csv, not json.
You can run the JSON_VALUE function on the JSON output by the query, like:
SELECT JSON_VALUE(doc, '$.product_name') AS product,
JSON_VALUE(doc, '$.list_price') AS price
FROM
OPENROWSET(
BULK 'https://mydatalake.blob.core.windows.net/data/files/*.json',
FORMAT = 'csv',
FIELDTERMINATOR ='0x0b',
FIELDQUOTE = '0x0b',
ROWTERMINATOR = '0x0b'
) WITH (doc NVARCHAR(MAX)) as rows
Parquet source files
SELECT TOP 100 *
FROM OPENROWSET(
BULK 'https://mydatalake.blob.core.windows.net/data/files/*.*',
FORMAT = 'parquet') AS rows
Partitioning Parquet files
If the data in the Parquet source files is partitioned into multiple files in multiple directories, it can be referenced in the following manner - which specific parameter directories can be referenced:
SELECT *
FROM OPENROWSET(
BULK 'https://mydatalake.blob.core.windows.net/data/orders/year=*/month=*/*.*',
FORMAT = 'parquet') AS orders
WHERE orders.filepath(1) = '2020'
AND orders.filepath(2) IN ('1','2');
Create Database
CREATE DATABASE SalesDB
COLLATE Latin1_General_100_BIN2_UTF8
External Data Source
Instead of putting the full path into the BULK clause every time you reference a source file as a table, you can create an External Data Source with the path:
CREATE EXTERNAL DATA SOURCE files
WITH (
LOCATION = 'https://mydatalake.blob.core.windows.net/data/files/'
)
and then run the main select as:
SELECT *
FROM
OPENROWSET(
BULK 'orders/*.csv',
DATA_SOURCE = 'files',
FORMAT = 'csv',
PARSER_VERSION = '2.0'
) AS orders
where the "files" datasource is referenced. And the BULK parameter only then needs to have the relative path to the file(s).