Exam DP203 Course Tybul 09 Azure Data Factory
Note this page also contains 10 - 14 as well as 09.
https://www.youtube.com/watch?v=nnAgdqrZZGM
Note that Azure SQL Database is not in the exam.
"SQL Databases" service in Azure is the implementation of a SQL Server in the cloud. In a server instance you can create databases. An example of PAAS - Platform as a service.
"Azure SQL Database" is the implementation of a SQL server single database in the cloud. You don't get the rest of the SQL Server there. An example of SAAS - Software as a service.
In the "Compute and storage" section when you are creating the SQL databases server, click to change the setting to "Basic" to generate a lower spec machine that is cheaper.
When you have created the SQL Server resource, you need to go into Security - Networking, and change "Public Network Access" from "Disabled" to "Selected Networks". Then further down click "Allow Azure Services and Resources" access to it.
In Data Factory Studio, which can be opened from within the Azure data factory resource in the portal, under the Manage blade, you can create a linked server to connect to a data source. Here you choose between data store and compute.
11 - Dynamic Azure Data Factory
Parameters in dataset names
12 - Integration Runtimes part 1
You have a set of virtual machines within Azure that are used to run Azure Data Factory jobs for all Azure tenants. These are IR - Integration Runtime machines. These are in contrast to the built-in Auto Resolve Integration Runtime that is available by default.
In order to prevent these machines from being able to access your own Azure resources, you need to create an Azure Managed VNET Integration Runtime. This I think defines a set of these machines for your own use. These machines are then bound to a new virtual network, and you give only this virtual network access to your Azure resources.
You create the integration runtime. And then you create a linked service that uses it to connect to the source database. As part of this you must also create a managed endpoint.
Then you go into the Azure SQL source resource, and in Networking there is tab "Private Access" (as opposed to "public access") and the new endpoint above can be seen. If you want this endpoint to be able to connect to the Azure SQL server, click "Approve".
Azure Managed VNET integration runtimes cost $1/hour, as opposed to the built-in integration runtimes that cost $0.005/hour.
Note that Azure VNET integration runtimes are a little slower than the built-in one.
13 - Integration Runtimes part 2
Azure self-hosted integration runtime - SHIR. This is an integration runtime that is run in Azure, but within a virtual machine. So you have to create a virtual machine resource within the portal, and run it. This is a windows machine, and costs around $1/hour to run.
You RDP onto the virtual machine, and download the installer for the above runtime, and install it.
When it is installed, run it.
It will ask for the connection string to connect to the Azure SQL Database.
You can also run the self-hosted integration runtime software on an on-premise machine too.
The purpose of running the integration runtime on an on-prem machines it, I think, to allow the integration runtime to access resources on the network where the integration runtime runs. Which might not be accessible to Azure.
14 - Error handling in Azure Data Factory
If you click on a task within a pipeline, for example a copy data task, under the main General tab, you can set a value for Retry, and Timeout (in seconds).
The pipeline is a success only if all of the leaf items are a success. In this context, the set of items defined as being "leaf" tasks is obtained by taking all items at the end of the chain that have not been skipped, but also taking those at the end of the chain that have been skipped, and working back up the tree to the first item that has not been skipped.
If-Else commands: If you have a task A that runs, with success going to task B, and failure to task C, then if A fails, it will go to C which will be successful. But B is not run, so it looks to the parent (A). Since that parent failed, the pipeline fails.
Try-catch commands: If you have a task A that runs, with only a on-fail going to task C, then if A fails it will go to C that succeeds. And since C is the only leaf node and was successful, the overall pipeline succeeds.
Do if skip else: In the case of If-Else, you can add another task named task D to run after task B, but using the Skip line to connect the two (as opposed to the green success or red fail or white completed line). The skip line is followed if task B is not followed by the completion of task A (i.e. because task A has failed, it takes the red line to task C). This means that task D runs. And if task D is successful, then in fact the overall pipeline succeeds.
Try catch succeed: Here we have task A that we want to go forward to task D even if it fails, but that we want it to run task C if it fails (there is no task B). This is done by adding both a on success and on skip line from task C to task D. So if A succeeds, then C will be skipped, leading to D running. But if A fails, then C runs, and on success goes on to D. Assuming that D succeeds, then the pipeline succeeds.
AND logic in common error handling: If you have three tasks A, B, and C in series, and you add the red output of each to task D to handle error handing, then if one of them fails then task D will in fact not run. The reason for this is that the red lines (all lines in fact) use AND logic - they must all apply for the route to be followed. To resolve this problem the way to set it up is to replace the three red lines with one red line from task C to D, and one "skip" line from C to D. This way if C fails it will go to D, and if A or B fails, then C will be skipped, leading it to go to D. And if A, B, C do not fail, then the pipeline will end. In all situations the pipeline succeeds (as the errors have been handled).
The same principles work with tasks that run pipelines.
Alerts
Under Monitor - Alerts and Metrics you can set up alerts for failed pipelines.
Click New Alert Rule, give it severity 2, and choose Failed Pipelines Runs Metrics from the popup list.
In the Name and Failure Type popup lists, select All.
Then accept the defaults on Alert Logic (when condition > ?) and Evaluate based on (how often to check).
Then under Configure Email/SMS/Push/Voice notification click + Configure notification. Give the new Action Group a name, and click + next to Notification, and click to add an action - to send an email or SMS.
This will give an error that the subscription is not registered to use Microsoft.Insights. To fix, run the following powershell command at the Azure command prompt:
Register-AzureRmResourceProvider -ProviderNamespace Microsoft.Insights.
The emails sent are not very user-friendly, because they don't contain much relevant information.
Within Data Factory there is no option for a task to send an email.
So to send an email with better information, you need to create a logic app within the Azure portal.
Logic Apps
Go into Logic Apps within the portal, and click to create new.
Choose a Consumption rather than Standard hosting plan (the latter you have to pay circa $145/month for CPU).
Once created, go into Development Tools - Logic App Designer
First you define the trigger - what makes it run. Search for http, and one of the search results should be: When a http request is received
In the properties, click "Use sample payload to generate schema" to open a new window. And in this paste the following JSON code sample, to let it know in what format to expect the http request to be received:
{ "properties": { "datafactoryname": { "type": "string" }, "message": { "type": "string" } } }
Click DONE, and corresponding schema will appear in the window.
According to the course, datafactoryname and message should be created by this as variables that can be used in subsequent parts of the app. But these don't seem to appear.
Then click Save on the top left of the main window, and then the + button to add an action to undertake when it is triggered.
Choose an action to send an Outlook email (select one's own account to test with). Then set it to send to another email address. Enter the destination email address and subject. Click into the message body field, and click the lightning flash icon to open a popup window with a list of these variables. Click the one for message.
Then go back to the overview page of the logic app, and copy the Workflow URL of it onto the clipboard.
Then back in the DataFactory pipeline, add a new task of type Web. Configure this to connect to the above URL, using method POST. And into the body field paste the following JSON:
{ "datafactoryname": "df1", "message": "message1" }
I tried running this. And it did send the email, but the email contained the above JSON without separating out the variables.
Note you can manually add variables here, and reference them using code like: {@pipeline().RunId}
Log Analytics workspaces
Data Factory logs are only stored for 45 days. So to store data for a longer period of time you need to store the data in a Log Analytics workspace - which is an Azure service you can create in the portal. Create this.
Once this has been created, go into the Data Factory resource in the portal, and under Monitoring - Diagnostic Settings on the left hand side, click Add Diagnostic setting. Select "All logs", and under Destination details click Send to Log Analytics workspace. (and also Azure Storage if you want).
Then run a pipeline to create an event that will be logged. It will take a few minutes for data to be written.
Go back into the log analytics resource, and under Settings - Usage and estimated costs, click Data Retention to select the number of days for which the data is stored (up to 2 years).
The data in the Log Analytics workspace can be queried with KQL run in a workbook
To use the data in the Log Analytics workspace, you might for example schedule a logic app to run every hour, query the Log Analytics Workspace, and send an email if it sees a problem.
In the workspace, go to: Monitoring - Workbooks, and click Default Template. This opens a workbook with some sample code in it.
Also, if you click on Tables, you can see around 20 tables that have been created. For example table ADFActivityRun
In the workbook, paste ADFActivityRun, and run it. This will list the contents of this table.
You can also add a WHERE clause on one of the columns:
ADFActivityRun | where OperationName == "Send mail - Queued"