Azure Data Factory: Delete Files From Azure Data Lake Store (ADLS)

In a previous post over at Kromer Big Data, I posted examples of deleting files from Azure Blob Storage and Table Storage as part of your ETL pipeline using Azure Data Factory (ADF). In those examples, I built a small, quick Logic App that used the Azure Storage APIs to delete data. In those post, I’m going to demonstrate how to remove files from Azure Data Lake Store (ADLS). For this demo, we’ll use ADF’s V2 service.

Deleting / removing files after they’ve been processed is a very common task in ETL Data Integration routines. Here’s how to do that for Azure Data Lake Store files in ADF:

adfweb

  1. Start by creating a new Data Factory from Azure
  2. Click “Author & Monitor” from your factory in order to launch the ADF UI.
  3. Create a new pipeline and add a single Web Activity.
  4. Switch to the “Settings” tab on the properties pane at the bottom of the pipeline builder UI.
  5. The URL in the Web Activity will need to be the URI pointer the ADLS file you wish to delete
    • https://<yourstorename>.azuredatalakestore.net/webhdfs/v1/mytempdir/myinputfile1.txt?op=DELETE
  6. The URL above (i.e. file names, folder names) can be parameterized. Click the “Add Dynamic Content” link when editing the URL text box.
  7. Set the Web Activity “Method” to “DELETE”.
  8. For authentication, you will need to have an access token. You can use this method to produce one:
  9. The access token returned will need to be captured and used in the Web Activity header as such:
    • Header = "Authorization"  Expression = "Bearer <ACCESS TOKEN>"
  10. You can now validate and test run your pipeline with the Web Activity. Click the “Debug” button to give it a try.

 

 

Advertisements

14 thoughts on “Azure Data Factory: Delete Files From Azure Data Lake Store (ADLS)

  1. After hours researching about how to “delete” content form Azure Data Lake, I thing this method is easiest and simple.

    It is almost all clear for me, I have problems to understand how to “get” token and “where”….
    Where should I use token code to automate service-to-service login, in the web component, in web browser and use always same token ?

    1. Hi MSSQLDUDE, i am using following URL to get the access token and where do i need to put it to get access token and though i am using postman application to get it but it’s throwing an error .

      GET METHOD : https://management.azure.com/subscriptions/0188f03b-0df0-4083-8143-eae90a173ae2/resourceGroups/ati-ioninsights-sandbox/providers/Microsoft.DataLakeAnalytics/accounts/ionisb/dataLakeStoreAccounts/ionisb?api-version=2016-11-01

      ERROR:
      {
      “error”: {
      “code”: “AuthenticationFailedInvalidHeader”,
      “message”: “Authentication failed. The ‘Authorization’ header is provided in an invalid format.”
      }
      }

  2. Do you know if it’s possible to incorporate a web activity for the POST request/some way to renew the access token so you don’t need to re-enter it every time it expires?

  3. Thanks for the answer
    My problem is, I don’t know where to write the code for the service-to-service login , I am new in ADF, and I couldn’t find tutorial for beginners.

    E.g. After you get automatically the token, how do you capture it in a variable and how to use that variable after you get the token…?

    Summary, I understand the code, I don’t know where and how to use it.

    Regards
    Thanks

    1. I ran CURL from my Windows command prompt to get a token. I then copy & paste the bearer token and put it in the REST Header using the Web Activity in ADF. I have a picture of that in the blog post.

    1. Think this is what I’m looking for… you’ve used for SSAS integration but I guess I could use the same approach for PowerBI. I want to have a step in my pipeline to kick off the powerbi refresh.

  4. Similar to others, I am stryggling as how to execute curl commands from Azure Data Factory. Also I could get the access token using direct web call from postman, but when I try to DELETE a file I get error message saying

    {
    “RemoteException”: {
    “exception”: “RuntimeException”,
    “message”: “Operation DELETE not implemented [2018-09-26T09:50:30.1043241-07:00]”,
    “javaClassName”: “java.lang.RuntimeException”
    }
    }

  5. Actually, I figured it out. It’s the Auth-Key that was assigned in the Active Directory app registration. Thanks for the article!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s