Azure Data Factory Data Flow: Building Slowly Changing Dimensions

Here is a quick walk-through on how to use Azure Data Factory’s new Data Flow feature (limited preview) to build Slowly Changing Dimension (SCD) ETL patterns.

The Data Flow feature in ADF is currently in limited preview. If you would like to try this out on your Data Factories, please fill out this form to request whitelisting your Azure Subscription for ADF Data Flows: http://aka.ms/dataflowpreview. Once your subscription has been enabled, you will see “Data Factory V2 (with data flows)” as an option from the Azure Portal when creating Data Factories.

I’m going to start super-simple by building just the path in my data flow for an SCD Type 2 in the instance where the dimension member does not already exist in the target Azure SQL DW. In ADF Data Flow, you can have multiple paths in your data flows (called streams) that enable you to perform different data transformations on different data sources in a single Data Flow.

In this case, I will look-up from the DW to see if the incoming product already exists in the DW dimension table. If it does, I’ll branch off and update the row with the new attributes (I’ll blog that path later). For this blog, we’ll complete the path to create a new surrogate key for the row and clean up some of the attributes before loading the new row into the target DW dimension table.

scd1

The first thing I’ll do is to add 2 sources, 1 for the Azure SQL OLTP dataset which has products that were sold today. The 2nd source is a query from the existing DW DimProducts table which has existing unique rows from the products dimension table. Because I am demonstrating a Type 2 SCD, the active dimension members can be signified with NULL as the End Date or ‘Y’ in the Status column to indicate that is the row to use in calculations. To use this query for the 2nd source at the bottom of the diagram, I created a SQL View with the proper query predicate and used the view as my source in the ADF Dataset. However, I could have also used Data Flow directly as the query predicate using the Filter transform and an expression of isNull(EndDate) or EndDate > addMonths(currentDate, 999), depending on the method that you choose to utilize for your SCD management in your dimension tables.

scd7.png

ADF Data Flow is smart enough to take your end-to-end flows and optimize the execution utilizing pushdown techniques when available. So, using a Filter transform against what appears like a complete table scan in the design view may not actually execute as such when you attach your Data Flow to a pipeline. It is best to experiment with different techniques and use the Monitoring in ADF to gather timings and partition counts on your Data Flow activity executions.

There are several different ways to proceed from here. In this demo, I used a Left Outer Join so that I can get all of the new products and all of the existing products in a single query, then I split the flow with a Conditional Split based on whether the surrogate key (stored in ProductAlternateKey) was set.

scd2

If the OLTP source has this value in this sample, then I can drop to the bottom stream in the Conditional Split. I did not yet build out this flow, but what I will do here is to look to see if the stored attributes in the DimProduct table are updated. If I make a determination that the dimension row requires a new row, then I’ll add that row to the DW in the bottom stream.

Using the Lookup transform within ADF Data Flow is also a very common mechanism to use to lookup reference values that can be leveraged in this case. You can utilize this pattern for SCD Type 2 when you want to match the DW member values incoming values to update the table when an attribute changes.

Real quick, I want to point out the Select transform immediately following my Join. This is an important concept in ADF Data Flow. All of your columns will automatically propagate throughout your streams. As your columns accumulate, you can use the Select transform to “select” the columns that you wish to keep. You will see stream names appended to the column names to indicate the origins and lineage of those columns to help you make the proper determination.

scd9

While you can always choose column selectivity in the Sink transform at the end of your Data Flow, maintaining column hygiene may help you to prune your column lists. The downside to this approach is that you will lose that metadata downstream and will not be able to access it once you’ve dropped it from your metadata with a Select.

Now, we need to add a surrogate key. Use the Surrogate Key transform to generate a unique key that will be stored, in this case, in ProductAlternateKey as a way to provide a key that is used for inner joins in the DW star schema and for our BI/analytical solutions. This is a key that is not a business key and is not part of the OLTP system. This is specific to the data warehouse in joining facts to dimensions.

scd5

The ADF Data Flow surrogate transform will provide an incrementing value, so I’m going to prepend some text to the beginning of that value to demonstrate how to give your own flavor to the key. I decided to do this all in one Derived Column transform that I called “SetAttributes” below:scd6

You can always create separate Derived Column transforms for these formulas, but I just found it easier to build all of my expressions in one transform for this sample.

The other formulas are setting up attributes that I’ll use to hold some stored data for publishing into the SQL DW DimProd table. I’m using ‘Y’ for active status of the dimension member, setting a StartDate for the row, EndDate in the future, empty strings instead of NULLs and setting a bit to use because my target schema requires it.

Now all you have to do is sink your fields in your Azure SQL DW with a Sink transform and you’re SCD Type 2 data flow will be complete. In order to execute this flow on a schedule, create a new ADF Pipeline and add a Data Flow activity that points to this SCD Data Flow. You’ll be able to use the full range of ADF schedules and monitoring to operationalize your data warehouse loading in ADF.

Advertisements

New Azure Data Factory End-to-End Lab

Just a few weeks ago, we announced the public preview of the new browser-based UI for Azure Data Factory. See Gaurav’s blog here detailing the release. To help you understand how to build complex data integration projects in the ADF Visual Design interface, we’ve partnered with Pragmatic Works, who have been long-time experts in the Microsoft data integration and ETL space, to create a new set of hands-on labs that you can now use to learn how to build those DI patterns using ADF V2.

That link points to a GitHub repo for the lab, where you will find data files and scripts in the Deployment folder. There are lab manual folders for each Lab and overview presentations, shown below for more details. You will also find a series of PowerShell and DB scripts as well as ARM templates that will generate resource groups that the labs need in order for you to successfully build out an end-to-end scenario with sample data that you can use for Power BI reports in the final Lab 9. Here is how the individual labs are divided:

  • Lab 1 – Setting up ADF and Resources, Start here to get all of the ARM resource groups and database backup files loaded properly.
  • Lab 2 – Lift and Shift of SSIS to Azure, Go to this lab if you have existing SSIS packages on-prem that you’d like to migrate directly to the cloud using the ADF SSIS-IR capability.
  • Lab 3 – Rebuilding an existing SSIS job as an ADF pipeline.
  • Lab 4 – Take the new ADF pipeline and enhance it with data from Cloud Sources.
  • Lab 5 – Modernize the DW pipeline by transforming Big Data with HDInsight.
  • Lab 6 – Go to this lab to learn how to create copy workflows in ADF into Azure SQL Data Warehouse.
  • Lab 7 – Build a trigger-based schedule for your new ADF pipeline.
  • Lab 8 – You’ve operationalized your pipeline based on a schedule. Now learn how to monitor and manage that DI process.
  • Lab 9 – Bringing it all Together

Azure Data Factory V2 Quickstart Template

If you’d like to try a quick & easy way to get up & running with your first V2 pipeline of Azure Data Factory, check out the video that I just recorded walking you through how to use the Azure Quickstart Template.

To get to the template, you can search the Azure Quickstart template gallery, or here is a direct like to the ADF V2 template.

Deploying this template to your Azure subscription will build a very simple ADF pipeline using the V2 public preview service, which was announced at the end of September at Ignite.

The pipeline executes a copy activity that will copy a file in blob store. In the example video I linked to above, I use it to make a copy of a file in the same blob container but with a new filename.

To learn more about the new features in ADF V2, see our updated public documentation home page: https://docs.microsoft.com/en-us/azure/data-factory/.