Exam DP203 Course Tybul 15 Questions about new data source

From MillerSql.com

Exam_DP203_Course_Tybul_15_Questions_about_new_data_source

Data Factory does not natively connect to all sources of data. For example if you want it to get a Sharepoint file, you have to craft a URL, or use a logic app.

Incremental data loads: CDC - Change Data Capture. Or go on modification date intervals.

Is it master data? Or transactional?

Personal identifiable data? GDPR?


Note that with pipelines, there are parameters and there are variables. The latter can be changed within the pipeline, and the former are read-only once the pipeline is run.

16 Executing ADF pipelines

GIT Integration

In the Data Factory front end, go to Manage - Git Configuration

This allows you to save your changes to Git. For example if you have edited a pipeline but it is still incomplete - in this scenario it will give an error if you try to publish it, but if you close the browser you will lose the changes. Here you can save the changes to Git.

If you trigger a pipeline that has not been published, it will fail with the error that it cannot be found.

There can be a many to many relationship between triggers and pipelines.

Triggers

If you trigger a pipeline that has variables in it, you can give values for these variables in the trigger. You can give these as trigger-specific variables like:

@trigger().ScheduledTime or @trigger().StartTime

Triggers are configured in the Manage window, and past runs of them are shown in the Monitor window. If you click the properties window on a run instance of a trigger, it might show you the variables associated with the trigger (I'm not sure about this).

https://learn.microsoft.com/en-us/azure/data-factory/control-flow-system-variables

Scheduled trigger

The main type of trigger is the Scheduled trigger. This allows you to run pipelines every x minutes or hours, or just a one-off time.

Tumbling trigger

The next type of trigger is the Tumbling trigger. This also allows you to run the pipelines every x minutes etc, but in addition lets you introduce a delay period before it runs.

In addition with the tumbling trigger you can give dates in the past, and the trigger will execute these now. This can result in multiple instances of the trigger being executed right now at the same time. Under the Advanced section, you can set MaxConcurrency to limit the number that can be executed at the same time (default is 50).

In addition you can set the number of retries (this is not available with the scheduled trigger).

You can also use variable @trigger().outputs.WindowStartTime and @trigger().outputs.WindowEndTime

Storage event trigger

This triggers on a file being created, updated, or optionally deleted, in a storage account (this is called the "event producer"). This then goes to "event grid" and on to "event producer".

Custom event trigger

This triggers when an event grid event occurs. You define this, but are responsible for making the event occur.

17 - Azure data lake security - anonymous access, access keys

Within Azure Storage you can set access levels of files at the container level with the Change Access Level popup list:

  1. Private (no anonymous access)
  2. Blob (anonymous read access for blobs only)
  3. Container (anonymous read access for blobs and containers)

(note you have to enable anonymous blob access at the storage account level, otherwise the menu is greyed out). The values set for each container is shown in one of the columns when viewing the containers.

Access keys

In the storage account, go to Security + Networking - Access keys to view the two access keys (these can be rotated). These give you admin rights to everything within the storage account.

If you want to use an access key within Data Factory to connect to Azure data lake Gen2 as a linked service, for the type of linked service choose "Account Key".

It is also possible to disable the use of these access keys in the storage account config by going to Settings - Config and disabling "Allow Storage Account Key Access".

18 - Azure data lake security - Azure Key Vault and managed identity

Azure Key Vault (which you create as a resource in a resource group) can store encryption keys, certificates, and secrets. Secrets are just strings.

There is a read-only copy of the key vault in another place (for fault tolerance).

Under the Objects menu, there are three items:

  1. keys
  2. Secrets
  3. Certificates

Click on Secrets, and click to add a new text secret. Note that you need to have gone into the Resource Group and under "Access Control (IAM)" clicked "Add role assignment" and added role "Key Vault Administrator" to the user NeilM I think.

Managed Identities

Next you need to go into the Data Factory resource, and under Settings - Managed Identities get the Object (principal) ID for the data factory.

Access policies

Within the Key Vault resource, there are two Access Policies menus, one called Access Policies, and the other called Access Control IAM

We want to change the default to the first one, so to do this, within the Key Vault resource, go to Settings - Access Configuration. Here there are two options:

  1. Azure role-based access control (recommended) - configured using the Access Control IAM menu I think, or:
  2. Vault access policy - configured using the Access Policies menu.

Change this from the first to the second option above (Vault Access policy).

Then go into Access Policies, and click to create a new access policy. In the first window select all the options, and click Next. In the next window that is asking for a user id, paste the Object (principal) ID (from earlier) of the data factory. This gives the data factory's system managed identity access to be able to read the secrets from the Azure key vault.

Data factory linked service to the key vault

Next, go into the data factory studio, and go to manage - linked services. There click New, and add a linked service of type key vault. Select the existing key vault from the pop-up menu, and for Authentication Method choose System Assigned Managed Identity. This is allowing the data factory to access the data in the key vault using the data factory's ID (system assigned managed identity) which has been given rights earlier to the key vault.

Note you can test this - that it is able to connect to the key vault, and also that it is able to access a secret by name.

19 - Azure data lake security - Shared Access Signature (SAS)

In a storage account, in addition to access keys as strings you can use to authenticate access to the storage account, you can also use Shared Access Signatures.

These are URLs that provide access to the resource.

You configure it with the allowed services (blob, file, queue, table), the allowed resource types (service, container, object), and access type (read, write, list, delete etc). Make sure you click the checkboxes across all three.

Then click create. And below will be listed a series of strings that can be used to connect to the resource. One of these is Blob service SAS URL. Use this string in Data Factory to create a new linked service pointing to the Azure Storage, using Authentication Type SAS URI.

Note this will fail, with a message that it cannot see a container.

To fix this, in the storage account, under Settings - Endpoints you can see a set of endpoints (URLs) that can be used to access the various types of service within the Azure Storage. Here you will see that part of the URL for blob storage is blob.core. Change this to dfs.core, and this should fix the error.

Me: Although the URLs generated each time are different for different checkboxes checked, the old ones generated still appear to work.

URL from Azure key vault

You can do a test of the integration with key vault when you set up the linked service from data factory to Azure storage, by getting it to read the URL to the linked service from key vault rather than copying it from the shared access signature window.

Azure Storage access policies

Within Azure Storage, in a container, go to Settings - Access Policy. Under Stored Access Policy Click Add Policy to create a new policy. As well as a date range for the policy, you can also click the Permissions popup list, and assign the permissions the policy will offer, for example Read or Create.

Then go back into a resource, say an image file in Azure Storage for which you want to add a SAS, and in this click the three dots and choose Generate SAS. In the create window that opens, you can choose the access policy created above from a pop-up menu "Stored Access Policy".

Once this has been chosen, the "Permissions" popup list below it is greyed out because the permissions are now inherited from the policy.

One reason for using access policies is the following. Let's say you have created eight SAS items (tokens) for different resources, signed using the access key. If you want to invalidate one of them (say because you think one of them has been compromised), then you would need to rotate the access key, requiring you to regenerate the seven remaining SAS tokens. However if you have instead created two access policies, and assigned each to four of the eight, then all you need to do is move the expiry date of the one assigned to the SAS token you want to invalidate, into the past. And then you will only need to regenerate a three SAS tokens, not seven.

Note that these SAS tokens that are URL tokens are not in fact unsecure when sent over the internet. This is because the browser or Azure resource that uses them only sends the primary part of the URL to the destination at first, the purpose of which is to establish a secure connection over https. And then the secret part of the token, even though it is part of the URL, is only sent over this secure connection. So any intermediary cannot read it.

User SAS tokens

You can create "user delegated" SAS tokens, which are a different flavour to the above service-level SAS tokens.

There are no access keys, nor access policies. And can be used only with the Blob service.

To create these, go to create the SAS token as usual, but change Signing Method from "Account Key" to "User delegation key". You need to use an API call to invalidate it.

With both types of SAS token, you can set access to containers down to individual files.

20 - Azure data lake security - Role-Based Access Control (RBAC)

All the previous authentication methods allow someone to access Azure resources without knowing who the person is. The next way to authenticate is for if we do know their identity.

RBAC - Role Based Access Control

There are the following built-in roles:

  • Owner: Full rights to manage the object, and assign RBAC permissions.
  • Contributor: Same as Owner where you can manage the objects, but no rights to assign RBAC permissions.
  • Reader: Same as Owner, but can only read the configuration of the objects.

Note that these permissions are about the management of the permissions to the objects, they are not about the permissions to view or edit the data itself. i.e. the control-based plane, not the data plane.

Note that there are some roles that do give rights to the data plane, like Storage Blob Data Contributor.

Thirdly note that, if the contributor role has been set up to use the Access Key, then it will have full rights to read the data.

Fourthly note that the Owner does not have access to the data - but he is able to assign himself rights to allow this.

To set these for an object, for example a resource group, go to the object in the portal, and click Access Control (IAM). Here you can click Add Role Assignment to add members to a listed existing role (there are hundreds of them), or you can click Add Custom Role to create a new role.

For any existing role, you can click "View" to see what specific permissions are assigned to each of them. Again there are hundreds of these.

Roles are additive - the permissions are a union of the permissions in the roles they are a member of.

Service Principal

Above the resource group level is the subscription level. And above this is Management Groups. To configure these, in the portal go to All Services, and select Management and Governance on the left hand side.

Roles added at higher levels (for example on a resource group) in Access Control (IAM) are visible as Inherited in the lower levels in Access Control (IAM) (for example Azure Storage) in the Scope column, along with the level (in this case resource group) where it was set.

In Access Control (IAM), you can click Check Access in order to read the permissions assigned to a user.

The idea is that you add a managed identity to a role, and add a permission to the role.

These roles can be assigned down to the Container level but no lower. If you click on Access Control (IAM) for a lower level, it will take you into the config for the parent container level.

21 - Azure data lake security - Access Control Lists (ACL)

ACLs can be used to give permissions to directories and files inside a container.

Note that this can apply only to Azure Gen2 storage, not the blob service, because directories only exist in Gen2 (they are only emulated using \ in filenames in blob storage).

To set these permissions, go into the object you want to give the permissions, and click Settings - Manage ACL.

Here you can give permissions to a service principal directly. Or if it is a folder you can click on Default Permissions to configure default permissions for new objects created inside the folder.

Note that when I first logged into it, it could not see the subscription, so could not see any containers etc. To fix this, ChatGPT suggested that I delete directory: C:\Users\<username>\AppData\Roaming\StorageExplorer on my local machine. This fixed the problem, so I thanked ChatGPT and explained to it that this had done so.

Azure Storage Explorer can be used to manually propagate permissions down the tree from a parent directory. To do this, open Azure Storage Explorer, and right-click a directory and select Manage Access Control Lists. This shows the same permissions on the objects (as are seen in the portal). But instead you can select Propagate Access Control Lists to copy the permissions on the current object down to all objects below it.

ACLs are not inherited. But you can set default ACLs to a directory that determine the default ACLs that will be applied to child objects created subsequently.

Also, if the child items already exist, then in Storage Explorer at least you can use Propagate Access Control Lists to propagate the directory's ACLs down the chain.

22 - ARM-based CI/CD for Azure Data Factory (part 1)

CI/CD means Continuous Integration / Continuous Delivery

Go into github.com, log in, and create a new repository (repo), setting it to Private not Public. In the text description of this repo, click a link to create a new file with some text, and click Save. This is the first commit to the repo, and is necessary to allow the repo to be used. Copy the URL for it onto the clipboard.

Then in Data Factory Studio, go into Manage - Git Configuration, and click the Configure button in the centre. Set Repository Type to Github, and enter n***9327 into GitHub repository owner.

Then select the repository just created from the pop-up menu.

Then in Collaboration Branch, select main from the popup list.

Then on the top left of Data Factory Studio, click New Branch, to create a new branch in the repo, which I will create as Branch3

Pull Requests

To copy code updated in this new branch into the main branch, in Data Factory Studio on the top left click Create Pull Request. This opens a window in github.com to do this. Click buttons to approve this. Verify that the item has been copied into the main branch in both Github and Data Factory.

It might be necessary to "approve" the pull request in Data factory Studio as well.

Note that in the course, the change is made to "Azure Devops" and not Github.com

Hotfix branches

Let's say that Dev code has advanced to a newer version than is on Prod. Let's say then a problem is found in Prod, for which an urgent fix is required. To do this, create a "hotfix" branch from the point of the main branch that the most recent update released to prod was made from. Then fix the code in this hotfix branch, and release it to Prod. (then make the same fix in the existing main branch for when that will subsequently go into prod).

In Azure Devops there is the concept of the "environment". Prod and Dev for example. Click on each of these environments to get a list of all releases to those environments.

23 - ARM-based CI/CD for Azure Data Factory (part 2)

26 - Introduction to Azure Logic Apps

He says that Logic Apps are not in scope for the DP 203 exam.

27 - Azure Synapse Analytics: Overview, Pipelines

Connect to the following sample datasource REST API:

https://rebrickable.com/api/v3/docs/?key=

get /api/v3/lego/minifigs/

In Synapse Studio, create a Copy-Data task in a new pipeline. In the Source for this, create a new linked service, of type REST (type this into the search window).

Set the base URL to: https://rebrickable.com

and authentication type to Anonymous (the credentials are not entered at this point). Click Test Connection to test that it is able to connect to the API, and click to create it.

Then set the dataset that reads from the linked service, to: /api/v3/lego/minifigs/

Then you need to create a new header to hold the API key.

Although you can hard-code this key into this header, you can alternatively add it to a Secret in Azure Key Vault, and reference it from there.

FMI when I tried to add the secret to Key Vault, I got a permissions error. This is because, even though I was the owner of the key vault, under the Access Policies menu on the left, I only had list access to the secrets. I had to add all permissions there, which fixed the problem.