Data Flows in ADF (Preview) allow you to build visual data transformation routines that ADF will compile and execute as optimized scale-out operations on Azure Databricks clusters. In this quick post, I want to talk a bit about a pattern that you’ll use from time-to-time in ADF Data Flows: Self Joins.
In SQL database parlance, a self-join is a query where both the left and right side tables are the same. In ADF Data Flow, you can achieve this through a combination of the Join transformation and the Select transformation. In both instances, the Join operation will require you to alias one of the relationships.
Here is what a very simple example would look like when I am aggregating data in my data flow. In this case, I am creating an aggregation called “states” in Aggregate1 that is grouped by AddressID. AddressID is a unique key field, but I am using in my group-by in the Aggregate so that I can use it in my self-join next. Aggregate transforms in ADF Data Flow will only output the columns used in the aggregation. I.e. only the fields used in group-by and the aggregated fields will be passed on to the next transformation in your data flow.
This is why self-joins are very important with flows that use aggregators. If you wish to include the previous columns in your flow, use a New Branch from the previous step and use the self-join pattern to connect the flow with the original metadata. The new branch is created by clicking the “+” on the Addresses source transform:
The New Branch duplicates the stream, so both Address streams are identical. To enable a Self-Join with the Join transform at the end, I need to alias the ID column. Do this with the Select transform, called “AddressesOrig” in the diagram (I simply added “_orig” to differentiate):
As stated above, I want to pass-through the AddressID for my join on the top stream, so I included AddressID in the Group-by in my Aggregator even though it will not result in any actual grouping:
Now, I can join the original set of column metadata back together with my new aggregation which I called “states”. I’ll do this with a Join transform. Note that I see both the AddressID from the Aggregate and the AddressID_orig from the branch with the Select alias:
And to show that my self-join is working, you’ll see that my data flow now has all of the data combined with my new “states” aggregation:
If you’d like to get started building Data Flows in ADF, please submit a request for whitelisting your Azure Subscription to enable ADF Data Flow factories: http://aka.ms/dataflowpreview.