Exam DP203 Serverless SQL Pool
Serverless SQL Pool. Reads from data lake files.
See also: Serverless SQL Pool Authentication
In Synapse Studio Data - Linked tab, go into the Azure Data Lake Gen2 header, and drill down and right-click a file there. Right-click and select "New SQL Script".
This generates the following script:
SELECT TOP 100 * FROM OPENROWSET( BULK 'https://datalake9cxqfg2.dfs.core.windows.net/files/product_data/products.csv', FORMAT = 'CSV', PARSER_VERSION = '2.0' HEADER_ROW = TRUE ) AS result
Click the "Connect to" popup list, and select "Built-in", and run it.
"result" is the alias name of the output table.
Give the script a name on the right hand side, and click Publish. Thenafter it can be seen under the "Develop" tab.
Output the results instead as a chart by clicking the "Chart" button in the output pane, setting appropriate values in the "Category" and "Legend" popup lists.
Three file formats can be read: CSV, JSON, Parquet.
No charge for resources reserved, you're only charged for the data processed by queries you run.
Not recommended for OLTP.
you can define external objects such as tables and views in a serverless SQL database.
The BULK parameter includes the full URL to the location in the data lake containing the data files. These paths can include the * wildcard:
https://mydatalake.blob.core.windows.net/data/files/*
: All files in the files folder.https://mydatalake.blob.core.windows.net/data/files/**
: All files in the files folder, and recursively its subfolders.
FIELDTERMINATOR
ROWTERMINATOR
FIELDQUOTE - string in quotes
FIRSTROW = 2 - how many rows to skip
HEADER_ROW = TRUE (only available when using parser version 2.0)
WITH clause
Use the "WITH" clause to define the column names and datatypes:
SELECT TOP 100 * FROM OPENROWSET( BULK 'https://mydatalake.blob.core.windows.net/data/files/*.csv', FORMAT = 'csv', PARSER_VERSION = '2.0') WITH ( product_id INT, product_name VARCHAR(20) COLLATE Latin1_General_100_BIN2_UTF8, list_price DECIMAL(5,2) ) AS rows
JSON source files
If JSON files (.json) are used instead of .CSV files, use the following parameters in the above:
BULK 'https://mydatalake.blob.core.windows.net/data/files/*.json', FORMAT = 'csv', FIELDTERMINATOR ='0x0b', FIELDQUOTE = '0x0b', ROWTERMINATOR = '0x0b'
Note that FORMAT still says csv, not json.
You can run the JSON_VALUE function on the JSON output by the query, like:
SELECT JSON_VALUE(doc, '$.product_name') AS product, JSON_VALUE(doc, '$.list_price') AS price FROM OPENROWSET( BULK 'https://mydatalake.blob.core.windows.net/data/files/*.json', FORMAT = 'csv', FIELDTERMINATOR ='0x0b', FIELDQUOTE = '0x0b', ROWTERMINATOR = '0x0b' ) WITH (doc NVARCHAR(MAX)) as rows
Parquet source files
SELECT TOP 100 * FROM OPENROWSET( BULK 'https://mydatalake.blob.core.windows.net/data/files/*.*', FORMAT = 'parquet') AS rows
Partitioning Parquet files
If the data in the Parquet source files is partitioned into multiple files in multiple directories, it can be referenced in the following manner - which specific parameter directories can be referenced:
SELECT * FROM OPENROWSET( BULK 'https://mydatalake.blob.core.windows.net/data/orders/year=*/month=*/*.*', FORMAT = 'parquet') AS orders WHERE orders.filepath(1) = '2020' AND orders.filepath(2) IN ('1','2');
Create Database
CREATE DATABASE SalesDB COLLATE Latin1_General_100_BIN2_UTF8
When the database is created, it appears in Synapse Studio as a database under section: Data - Workspace - SQL Database
External Data Source
Instead of putting the full path into the BULK clause every time you reference a source file as a table, you can create an External Data Source with the path:
CREATE EXTERNAL DATA SOURCE files WITH ( LOCATION = 'https://mydatalake.blob.core.windows.net/data/files/', TYPE = HADOOP )
(the TYPE = HADOOP is used if it is a HADOOP data source - but I think this may be for dedicated pools only, not the serverless pool).
The data source object created appears in Synapse Studio under Data - Workspace - SQL Database - External Resources - External Data Sources
And then run the main select as:
SELECT * FROM OPENROWSET( BULK 'orders/*.csv', DATA_SOURCE = 'files', FORMAT = 'csv', PARSER_VERSION = '2.0' ) AS orders
where the "files" datasource is referenced. And the BULK parameter only then needs to have the relative path to the file(s).
Scoped Credentials
Allows you to define a credentials object for use when the data source is defined.
CREATE DATABASE SCOPED CREDENTIAL sqlcred WITH IDENTITY='SHARED ACCESS SIGNATURE', SECRET = 'sv=xxx...'; GO CREATE EXTERNAL DATA SOURCE secureFiles WITH ( LOCATION = 'https://mydatalake.blob.core.windows.net/data/secureFiles/' CREDENTIAL = sqlcred ); GO
File Format
Allows you to define a file format object that can be referenced elsewhere too:
CREATE EXTERNAL FILE FORMAT CsvFormat WITH ( FORMAT_TYPE = DELIMITEDTEXT, FORMAT_OPTIONS( FIELD_TERMINATOR = ',', STRING_DELIMITER = '"' ) ); GO
The file format object created appears in Synapse Studio under Data - Workspace - SQL Database - External Resources - External File Formats
Create External Table
Define an external table object that loads its data from the source:
CREATE EXTERNAL TABLE ProductSalesTotals WITH ( LOCATION = 'sales/productsales/', DATA_SOURCE = sales_data, FILE_FORMAT = ParquetFormat ) AS SELECT Item AS Product, SUM(Quantity) AS ItemsSold, ROUND(SUM(UnitPrice) - SUM(TaxAmount), 2) AS NetRevenue FROM OPENROWSET( BULK 'sales/csv/*.csv', DATA_SOURCE = 'sales_data', FORMAT = 'CSV', PARSER_VERSION = '2.0', HEADER_ROW = TRUE ) AS orders GROUP BY Item;
Here, the data is read from the "AS" bit, and saved into the different path in the "WITH" bit.
This is an example of CETAS - CREATE EXTERNAL TABLE AS SELECT
Note this has to be run on a custom database (having used the CREATE DATABASE command), not the Built-in database.
The table created appears in Synapse Studio under Data - Workspace - SQL Database - External Tables
Note that when this External Table is deleted, the underlying data files remain in place and are not deleted.
I think the "AS" in CETAS means that a new copy of the data is being created (and stored in the "WITH" location), necessitating the "AS" clause to tell it where it is copying its data from.
Stored procedures
You can create stored procedures to drop and create the table: CREATE PROCEDURE sp_GetYearlySales.
Benefits of this are:
- Reduction in client to server network traffic
- Allows permissions to be applied to who can run the stored proc, without giving permission to the users to the underlying objects.
- Easier maintenance
- Improved performance