ADF Data Flows: Self-Join

Data Flows in ADF (Preview) allow you to build visual data transformation routines that ADF will compile and execute as optimized scale-out operations on Azure Databricks clusters. In this quick post, I want to talk a bit about a pattern that you’ll use from time-to-time in ADF Data Flows: Self Joins.

In SQL database parlance, a self-join is a query where both the left and right side tables are the same. In ADF Data Flow, you can achieve this through a combination of the Join transformation and the Select transformation. In both instances, the Join operation will require you to alias one of the relationships.

The Join transform will allow you to join 2 Data Flow streams (can be a source or any other transform step in your data flow) and the Select transform will be used for aliasing.

Here is what a very simple example would look like when I am aggregating data in my data flow. In this case, I am creating an aggregation called “states” in Aggregate1 that is grouped by AddressID. AddressID is a unique key field, but I am using in my group-by in the Aggregate so that I can use it in my self-join next. Aggregate transforms in ADF Data Flow will only output the columns used in the aggregation. I.e. only the fields used in group-by and the aggregated fields will be passed on to the next transformation in your data flow.

This is why self-joins are very important with flows that use aggregators. If you wish to include the previous columns in your flow, use a New Branch from the previous step and use the self-join pattern to connect the flow with the original metadata. The new branch is created by clicking the “+” on the Addresses source transform:

sj1

The New Branch duplicates the stream, so both Address streams are identical. To enable a Self-Join with the Join transform at the end, I need to alias the ID column. Do this with the Select transform, called “AddressesOrig” in the diagram (I simply added “_orig” to differentiate):

sj2

As stated above, I want to pass-through the AddressID for my join on the top stream, so I included AddressID in the Group-by in my Aggregator even though it will not result in any actual grouping:

sj3

Now, I can join the original set of column metadata back together with my new aggregation which I called “states”. I’ll do this with a Join transform. Note that I see both the AddressID from the Aggregate and the AddressID_orig from the branch with the Select alias:

sj4

And to show that my self-join is working, you’ll see that my data flow now has all of the data combined with my new “states” aggregation:

sj5

If you’d like to get started building Data Flows in ADF, please submit a request for whitelisting your Azure Subscription to enable ADF Data Flow factories: http://aka.ms/dataflowpreview.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s