Exam DP203 Synapse Data Pipelines

From MillerSql.com

Data Pipelines in Synapse Studio

In the Synapse Studio Integrate tab.

integration runtime

For pipelines to be able to do anything, you have to define Linked Services to the resources they will need to access.

To create a pipeline, click the + button and select "Pipeline"

When you create a pipeline in Synapse Studio, like in SSIS the Control Flow items are shown graphically when they are created, with green and red arrows stating where to go next on success, failure, or completion.

Once the pipeline has been created, click the curly brackets on the top right {} to display the JSON code for it. The template can also be generated from supplied JSON code (this is done by clicking the + button and selecting "Import from Pipeline Template" (it needs the JSON file as a .zip).

Source, Transformation, Sink

Triggers to run the pipeline. Immediate, Scheduled, In response to an event

The Monitor tab of Synapse Studio shows the run progress of a pipeline. This progress information is used by Microsoft Purview to inform it about data lineage.

Create Pipeline

Under Activities - Move and Transform, select Data Flow. (another option is Copy Data). Name it "LoadProducts".

On the Settings tab of this data flow object, to the right of the "Data Flow" popup list, click + New. This opens a new window for this data flow.

In this window, on the top left, click the greyed-out "Add Source" graphic to add a new Source Stream object. Then at the bottom click the + New button to select a data source. Search for "Gen2" to find the Gen2 type. Click CSV. Then in the "Linked Service" popup, select the existing linked service to the Gen2.

Then click the browse button to choose the source file here - which puts in the path to products.csv and click OK to go back to the dataflow window.

With the new source stream object selected, give it a name (output stream name) of "ProductsText", verify that "Source Type" is set to "Integration Dataset" (as opposed to "Inline" or "Workspace DB"), and that the "Dataset" popup list is correctly set to the dataset just created (for products.csv).

Next click the Projection tab for the source object. This lists the names of the columns, with the datatypes as popup lists.

Next create a second source stream object for ProductCategories in the same way as above. This will sit alongside the first one.

Next these two tables can be joined using a Lookup object. To do this, on the graphic for the Products source stream, click the + button on the bottom right, and select "Lookup". In the config for this, there are pop-up lists for the "Primary Stream" and "Lookup Stream". Set each to the Products and ProductCategory respectively.

Then the "Lookup Conditions" section has two popup menus where you select the join columns in each source.

Click OK, and the Lookup object will appear to the right of the two source stream objects.

Next we will edit a column value. This is done by clicking the + button to the bottom right of the lookup object, and selecting item Alter Row. This allows you to alter a row (how?) with commands that are selected from a popup list like "insert if", "update if" and a condition. Me: I think this is telling it what to do with the source data when applied to the sink. i.e. it checks to see whether the record is in the sink and e.g. inserts it only if it is not present etc.

Add a Sink: Finally we add the sink object, by clicking the + button on the bottom right of the above Alter Row object, and selecting Sink. Set the popup menu values to the previously mentioned object, and check that the columns are mapped correctly to the destination.

You can debug the whole thing by clicking the "Data Flow Debug" slider, and waiting a few minutes for the cluster to start.

Then publish it.

Then in the Add trigger menu, select Trigger now.

Spark pipelines

The pipelines can run Spark notebooks too.

They need to be run in a Spark pool, not the Synapse Studio integration (built-in) pool.

Avoid using collect() unless necessary: When working with large datasets, it is often better to perform operations on the entire dataset rather than bringing the data into the driver node using the collect() method.

You can import Spark notebooks from .ipynb files by clicking + Import

Spark notebook Parameters

You can create a parameter cell associated with a Spark notebook by opening the notebook, and on the top right clicking the three dots icon ... and selecting Toggle Parameter Cell.