Exam DP203 Serverless SQL Pool

From MillerSql.com
Revision as of 19:18, 18 November 2024 by NeilM (talk | contribs) (→‎Create External Table)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Serverless SQL Pool. Reads from data lake files.

See also: Serverless SQL Pool Authentication

In Synapse Studio Data - Linked tab, go into the Azure Data Lake Gen2 header, and drill down and right-click a file there. Right-click and select "New SQL Script".

This generates the following script:

SELECT
    TOP 100 *
FROM
    OPENROWSET(
        BULK 'https://datalake9cxqfg2.dfs.core.windows.net/files/product_data/products.csv',
        FORMAT = 'CSV',
        PARSER_VERSION = '2.0'
        HEADER_ROW = TRUE
    ) AS result

Click the "Connect to" popup list, and select "Built-in", and run it.

"result" is the alias name of the output table.

Give the script a name on the right hand side, and click Publish. Thenafter it can be seen under the "Develop" tab.

Output the results instead as a chart by clicking the "Chart" button in the output pane, setting appropriate values in the "Category" and "Legend" popup lists.

Three file formats can be read: CSV, JSON, Parquet.

No charge for resources reserved, you're only charged for the data processed by queries you run.

Not recommended for OLTP.

you can define external objects such as tables and views in a serverless SQL database.

The BULK parameter includes the full URL to the location in the data lake containing the data files. These paths can include the * wildcard:

  • https://mydatalake.blob.core.windows.net/data/files/*: All files in the files folder.
  • https://mydatalake.blob.core.windows.net/data/files/**: All files in the files folder, and recursively its subfolders.

FIELDTERMINATOR

ROWTERMINATOR

FIELDQUOTE - string in quotes

FIRSTROW = 2 - how many rows to skip

HEADER_ROW = TRUE (only available when using parser version 2.0)

WITH clause

Use the "WITH" clause to define the column names and datatypes:

SELECT TOP 100 *
FROM OPENROWSET(
    BULK 'https://mydatalake.blob.core.windows.net/data/files/*.csv',
    FORMAT = 'csv',
    PARSER_VERSION = '2.0')
WITH (
    product_id INT,
    product_name VARCHAR(20) COLLATE Latin1_General_100_BIN2_UTF8,
    list_price DECIMAL(5,2)
) AS rows

JSON source files

If JSON files (.json) are used instead of .CSV files, use the following parameters in the above:

        BULK 'https://mydatalake.blob.core.windows.net/data/files/*.json',
        FORMAT = 'csv',
        FIELDTERMINATOR ='0x0b',
        FIELDQUOTE = '0x0b',
        ROWTERMINATOR = '0x0b'

Note that FORMAT still says csv, not json.

You can run the JSON_VALUE function on the JSON output by the query, like:

SELECT JSON_VALUE(doc, '$.product_name') AS product,
           JSON_VALUE(doc, '$.list_price') AS price
FROM
    OPENROWSET(
        BULK 'https://mydatalake.blob.core.windows.net/data/files/*.json',
        FORMAT = 'csv',
        FIELDTERMINATOR ='0x0b',
        FIELDQUOTE = '0x0b',
        ROWTERMINATOR = '0x0b'
    ) WITH (doc NVARCHAR(MAX)) as rows

Parquet source files

SELECT TOP 100 *
FROM OPENROWSET(
    BULK 'https://mydatalake.blob.core.windows.net/data/files/*.*',
    FORMAT = 'parquet') AS rows

Partitioning Parquet files

If the data in the Parquet source files is partitioned into multiple files in multiple directories, it can be referenced in the following manner - which specific parameter directories can be referenced:

SELECT *
FROM OPENROWSET(
    BULK 'https://mydatalake.blob.core.windows.net/data/orders/year=*/month=*/*.*',
    FORMAT = 'parquet') AS orders
WHERE orders.filepath(1) = '2020'
    AND orders.filepath(2) IN ('1','2');

Create Database

CREATE DATABASE SalesDB
    COLLATE Latin1_General_100_BIN2_UTF8

When the database is created, it appears in Synapse Studio as a database under section: Data - Workspace - SQL Database

External Data Source

Instead of putting the full path into the BULK clause every time you reference a source file as a table, you can create an External Data Source with the path:

CREATE EXTERNAL DATA SOURCE files
WITH (
    LOCATION = 'https://mydatalake.blob.core.windows.net/data/files/',
    TYPE = HADOOP
)

(the TYPE = HADOOP is used if it is a HADOOP data source - but I think this may be for dedicated pools only, not the serverless pool).

The data source object created appears in Synapse Studio under Data - Workspace - SQL Database - External Resources - External Data Sources

And then run the main select as:

SELECT *
FROM
    OPENROWSET(
        BULK 'orders/*.csv',
        DATA_SOURCE = 'files',
        FORMAT = 'csv',
        PARSER_VERSION = '2.0'
    ) AS orders

where the "files" datasource is referenced. And the BULK parameter only then needs to have the relative path to the file(s).

Scoped Credentials

Allows you to define a credentials object for use when the data source is defined.

CREATE DATABASE SCOPED CREDENTIAL sqlcred
WITH
    IDENTITY='SHARED ACCESS SIGNATURE',  
    SECRET = 'sv=xxx...';
GO

CREATE EXTERNAL DATA SOURCE secureFiles
WITH (
    LOCATION = 'https://mydatalake.blob.core.windows.net/data/secureFiles/'
    CREDENTIAL = sqlcred
);
GO

File Format

Allows you to define a file format object that can be referenced elsewhere too:

CREATE EXTERNAL FILE FORMAT CsvFormat
    WITH (
        FORMAT_TYPE = DELIMITEDTEXT,
        FORMAT_OPTIONS(
            FIELD_TERMINATOR = ',',
            STRING_DELIMITER = '"'
        )
    );
GO

The file format object created appears in Synapse Studio under Data - Workspace - SQL Database - External Resources - External File Formats

Create External Table

Define an external table object that loads its data from the source:

 CREATE EXTERNAL TABLE ProductSalesTotals
     WITH (
         LOCATION = 'sales/productsales/',
         DATA_SOURCE = sales_data,
         FILE_FORMAT = ParquetFormat
     )
 AS
 SELECT Item AS Product,
     SUM(Quantity) AS ItemsSold,
     ROUND(SUM(UnitPrice) - SUM(TaxAmount), 2) AS NetRevenue
 FROM
     OPENROWSET(
         BULK 'sales/csv/*.csv',
         DATA_SOURCE = 'sales_data',
         FORMAT = 'CSV',
         PARSER_VERSION = '2.0',
         HEADER_ROW = TRUE
     ) AS orders
 GROUP BY Item;

Here, the data is read from the "AS" bit, and saved into the different path in the "WITH" bit.

This is an example of CETAS - CREATE EXTERNAL TABLE AS SELECT

Note this has to be run on a custom database (having used the CREATE DATABASE command), not the Built-in database.

The table created appears in Synapse Studio under Data - Workspace - SQL Database - External Tables

Note that when this External Table is deleted, the underlying data files remain in place and are not deleted.

I think the "AS" in CETAS means that a new copy of the data is being created (and stored in the "WITH" location), necessitating the "AS" clause to tell it where it is copying its data from.

Stored procedures

You can create stored procedures to drop and create the table: CREATE PROCEDURE sp_GetYearlySales.

Benefits of this are:

  1. Reduction in client to server network traffic
  2. Allows permissions to be applied to who can run the stored proc, without giving permission to the users to the underlying objects.
  3. Easier maintenance
  4. Improved performance