Azure Data Factory: Delete Files From Azure Data Lake Store (ADLS)

In a previous post over at Kromer Big Data, I posted examples of deleting files from Azure Blob Storage and Table Storage as part of your ETL pipeline using Azure Data Factory (ADF). In those examples, I built a small, quick Logic App that used the Azure Storage APIs to delete data. In those post, I’m going to demonstrate how to remove files from Azure Data Lake Store (ADLS). For this demo, we’ll use ADF’s V2 service.

Deleting / removing files after they’ve been processed is a very common task in ETL Data Integration routines. Here’s how to do that for Azure Data Lake Store files in ADF:

adfweb

  1. Start by creating a new Data Factory from Azure
  2. Click “Author & Monitor” from your factory in order to launch the ADF UI.
  3. Create a new pipeline and add a single Web Activity.
  4. Switch to the “Settings” tab on the properties pane at the bottom of the pipeline builder UI.
  5. The URL in the Web Activity will need to be the URI pointer the ADLS file you wish to delete
    • https://<yourstorename>.azuredatalakestore.net/webhdfs/v1/mytempdir/myinputfile1.txt?op=DELETE
  6. The URL above (i.e. file names, folder names) can be parameterized. Click the “Add Dynamic Content” link when editing the URL text box.
  7. Set the Web Activity “Method” to “DELETE”.
  8. For authentication, you will need to have an access token. You can use this method to produce one:
  9. The access token returned will need to be captured and used in the Web Activity header as such:
    • Header = "Authorization"  Expression = "Bearer <ACCESS TOKEN>"
  10. You can now validate and test run your pipeline with the Web Activity. Click the “Debug” button to give it a try.

 

 

Advertisements

Control Flow Introduced in Azure Data Factory V2

One of the most exciting new features in Azure Data Factory will be very familiar to data developers, data engineers and ETLers familiar with hand-coding data integration jobs or designing SSIS packages …

“Control Flow”

In general computer science terms, Control Flow is described as “the order in which individual statements, instructions or function calls of an imperative program are executed or evaluated,” (Wikipedia https://en.wikipedia.org/wiki/Control_flow). In the Microsoft Data Integration world, tools like SQL Server Integration Services (SSIS) enables control flow via tasks and containers (https://docs.microsoft.com/en-us/sql/integration-services/control-flow/control-flow). Azure Data Factory’s V1 service was focused on building sequenced pipelines for Big Data Analytics based on time windows. The V2 (preview) version of ADF now includes workflow capabilities in pipelines that enable control flow capabilities that include parameterization, conditional execution, loops and if conditions.

If-Then

In ADF pipelines, activities are equivalent to SSIS tasks. So an IF statement in ADF Control Flow will need to be written as an activity of type “IfCondition”:

"name": "MyIfCondition",
"type": "IfCondition",
"typeProperties": { "expression":
  { "value": "@bool(pipeline().parameters.routeSelection)",
    "type": "Expression" }

IfTrue and IfFalse are properties of that activity which perform subsequent actions via Activities:

"ifTrueActivities":
  [ { "name": "CopyFromBlobToBlob1", "type": "Copy" ... ]

"ifFalseActivities":
  [ { "name": "CopyFromBlobToBlob2", "type": "Copy" ... ]

https://docs.microsoft.com/en-us/azure/data-factory/control-flow-if-condition-activity

Loops

Do-Until: https://docs.microsoft.com/en-us/azure/data-factory/control-flow-until-activity

"name": "Adfv2QuickStartPipeline",
"properties":
  { "activities":
    [ { "type": "Until", "typeProperties":
        { "expression": { "value": "@equals('false', pipeline().parameters.repeat)",
           "type": "Expression" }, "timeout": "00:01:00", "activities":
              [ { "name": "CopyFromBlobToBlob", "type": "Copy" ...

For-Each: https://docs.microsoft.com/en-us/azure/data-factory/control-flow-for-each-activity

"name": "<MyForEachPipeline>",
    "properties":
        { "activities":
             [ { "name": "<MyForEachActivity>", "type": "ForEach",
                   "typeProperties": { "isSequential": "true",
                    "items": "@pipeline().parameters.mySinkDatasetFolderPath", "activities":
                       [ { "name": "MyCopyActivity", "type": "Copy", ...

 

Parameters

Parameters are very common in SSIS packages as well to make for very flexible execution so that packages can be generalized and re-used. This is now available in ADF via parameters in control flow:

https://docs.microsoft.com/en-us/azure/data-factory/control-flow-expression-language-functions

You can build a pipeline that is parameterized like this example with parameterized paths:

"name": "Adfv2QuickStartPipeline",
"properties": {
"activities": [
 {
"name": "CopyFromBlobToBlob",
"type": "Copy",
"inputs": [
 {
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.inputPath"
 },
"type": "DatasetReference"
 }
 ],
"outputs": [
 {
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.outputPath"
 },
"type": "DatasetReference"

You can then execute that pipeline via the Invoke Pipeline PowerShell cmdlet: https://docs.microsoft.com/en-us/powershell/module/azurerm.datafactoryv2/Invoke-AzureRmDataFactoryV2Pipeline?view=azurermps-4.4.1. That command will let you either pass the parameters for inputPath and outputPath either on the command line PS or define the values dynamically via parameter files, i.e. (params.json):

{
"inputPath": "markkdhi/tmp",
"outputPath": "markkdhi/user"
}

 

 

Azure Data Factory V2 Quickstart Template

If you’d like to try a quick & easy way to get up & running with your first V2 pipeline of Azure Data Factory, check out the video that I just recorded walking you through how to use the Azure Quickstart Template.

To get to the template, you can search the Azure Quickstart template gallery, or here is a direct like to the ADF V2 template.

Deploying this template to your Azure subscription will build a very simple ADF pipeline using the V2 public preview service, which was announced at the end of September at Ignite.

The pipeline executes a copy activity that will copy a file in blob store. In the example video I linked to above, I use it to make a copy of a file in the same blob container but with a new filename.

To learn more about the new features in ADF V2, see our updated public documentation home page: https://docs.microsoft.com/en-us/azure/data-factory/.