Exam DP203 Serverless SQL Pool
Serverless SQL Pool. Reads from data lake files.
In Synapse Studio Data - Linked tab, go into the Azure Data Lake Gen2 header, and drill down and right-click a file there. Right-click and select "New SQL Script".
This generates the following script:
SELECT TOP 100 * FROM OPENROWSET( BULK 'https://datalake9cxqfg2.dfs.core.windows.net/files/product_data/products.csv', FORMAT = 'CSV', PARSER_VERSION = '2.0' HEADER_ROW = TRUE ) AS result
Click the "Connect to" popup list, and select "Built-in", and run it.
"result" is the alias name of the output table.
Give the script a name on the right hand side, and click Publish. Thenafter it can be seen under the "Develop" tab.
Output the results instead as a chart by clicking the "Chart" button in the output pane, setting appropriate values in the "Category" and "Legend" popup lists.
Three file formats can be read: CSV, JSON, Parquet.
No charge for resources reserved, you're only charged for the data processed by queries you run.
Not recommended for OLTP.
you can define external objects such as tables and views in a serverless SQL database.
The BULK parameter includes the full URL to the location in the data lake containing the data files. These paths can include the * wildcard:
https://mydatalake.blob.core.windows.net/data/files/*
: All files in the files folder.https://mydatalake.blob.core.windows.net/data/files/**
: All files in the files folder, and recursively its subfolders.
FIELDTERMINATOR
ROWTERMINATOR
FIELDQUOTE - string in quotes
FIRSTROW = 2 - how many rows to skip
HEADER_ROW = TRUE (only available when using parser version 2.0)
WITH clause
Use the "WITH" clause to define the column names and datatypes:
SELECT TOP 100 * FROM OPENROWSET( BULK 'https://mydatalake.blob.core.windows.net/data/files/*.csv', FORMAT = 'csv', PARSER_VERSION = '2.0') WITH ( product_id INT, product_name VARCHAR(20) COLLATE Latin1_General_100_BIN2_UTF8, list_price DECIMAL(5,2) ) AS rows
JSON source files
If JSON files (.json) are used instead of .CSV files, use the following parameters in the above:
BULK 'https://mydatalake.blob.core.windows.net/data/files/*.json', FORMAT = 'csv', FIELDTERMINATOR ='0x0b', FIELDQUOTE = '0x0b', ROWTERMINATOR = '0x0b'
Note that FORMAT still says csv, not json.
You can run the JSON_VALUE function on the JSON output by the query, like:
SELECT JSON_VALUE(doc, '$.product_name') AS product, JSON_VALUE(doc, '$.list_price') AS price FROM OPENROWSET( BULK 'https://mydatalake.blob.core.windows.net/data/files/*.json', FORMAT = 'csv', FIELDTERMINATOR ='0x0b', FIELDQUOTE = '0x0b', ROWTERMINATOR = '0x0b' ) WITH (doc NVARCHAR(MAX)) as rows
Parquet source files
SELECT TOP 100 * FROM OPENROWSET( BULK 'https://mydatalake.blob.core.windows.net/data/files/*.*', FORMAT = 'parquet') AS rows
Partitioning Parquet files
If the data in the Parquet source files is partitioned into multiple files in multiple directories, it can be referenced in the following manner - which specific parameter directories can be referenced:
SELECT * FROM OPENROWSET( BULK 'https://mydatalake.blob.core.windows.net/data/orders/year=*/month=*/*.*', FORMAT = 'parquet') AS orders WHERE orders.filepath(1) = '2020' AND orders.filepath(2) IN ('1','2');
Create Database
CREATE DATABASE SalesDB COLLATE Latin1_General_100_BIN2_UTF8
When the database is created, it appears in Synapse Studio as a database under section: Data - Workspace - SQL Database
External Data Source
Instead of putting the full path into the BULK clause every time you reference a source file as a table, you can create an External Data Source with the path:
CREATE EXTERNAL DATA SOURCE files WITH ( LOCATION = 'https://mydatalake.blob.core.windows.net/data/files/', TYPE = HADOOP )
(the TYPE = HADOOP is used if it is a HADOOP data source - but I think this may be for dedicated pools only, not the serverless pool).
The data source object created appears in Synapse Studio under Data - Workspace - SQL Database - External Resources - External Data Sources
And then run the main select as:
SELECT * FROM OPENROWSET( BULK 'orders/*.csv', DATA_SOURCE = 'files', FORMAT = 'csv', PARSER_VERSION = '2.0' ) AS orders
where the "files" datasource is referenced. And the BULK parameter only then needs to have the relative path to the file(s).
Scoped Credentials
Allows you to define a credentials object for use when the data source is defined.
CREATE DATABASE SCOPED CREDENTIAL sqlcred WITH IDENTITY='SHARED ACCESS SIGNATURE', SECRET = 'sv=xxx...'; GO CREATE EXTERNAL DATA SOURCE secureFiles WITH ( LOCATION = 'https://mydatalake.blob.core.windows.net/data/secureFiles/' CREDENTIAL = sqlcred ); GO
File Format
Allows you to define a file format object that can be referenced elsewhere too:
CREATE EXTERNAL FILE FORMAT CsvFormat WITH ( FORMAT_TYPE = DELIMITEDTEXT, FORMAT_OPTIONS( FIELD_TERMINATOR = ',', STRING_DELIMITER = '"' ) ); GO
The file format object created appears in Synapse Studio under Data - Workspace - SQL Database - External Resources - External File Formats
Create External Table
Define an external table object that loads its data from the source:
CREATE EXTERNAL TABLE ProductSalesTotals WITH ( LOCATION = 'sales/productsales/', DATA_SOURCE = sales_data, FILE_FORMAT = ParquetFormat ) AS SELECT Item AS Product, SUM(Quantity) AS ItemsSold, ROUND(SUM(UnitPrice) - SUM(TaxAmount), 2) AS NetRevenue FROM OPENROWSET( BULK 'sales/csv/*.csv', DATA_SOURCE = 'sales_data', FORMAT = 'CSV', PARSER_VERSION = '2.0', HEADER_ROW = TRUE ) AS orders GROUP BY Item;
Here, the data is read from the "AS" bit, and saved into the different path in the "WITH" bit.
This is an example of CETAS - CREATE EXTERNAL TABLE AS SELECT
Note this has to be run on a custom database (having used the CREATE DATABASE command), not the Built-in database.
The table created appears in Synapse Studio under Data - Workspace - SQL Database - External Tables
Stored procedures
You can create stored procedures to drop and create the table: CREATE PROCEDURE sp_GetYearlySales.
Benefits of this are:
- Reduction in client to server network traffic
- Allows permissions to be applied to who can run the stored proc, without giving permission to the users to the underlying objects.
- Easier maintenance
- Improved performance