ADF Data Flows: Distinct Rows

Below is a method to use in ADF’s Mapping Data Flows to reduce the data stream in your data flows to only include distinct rows. This sample is available as a pipeline template here.

distinct1

  1. Choose your source. This will work with any type of source.
  2. Add an Aggregate transformation. Here I call it “DistinctRows”.
  3. In the group by setting, you need to choose which or column or combination of columns will make up the key(s) for ADF to determine distinct rows. In this simple demo, I’ll just pick “movie” as my key column:distinct3
  4. The inherent nature of the aggregate transformation is to block all metadata columns not used in the aggregate. But here, we are using the aggregate to filter out non-distinct rows, so we need every column from the original dataset.
  5. To do this, go to the aggregate settings and choose Column Pattern.
  6. Here, you will need to make a choice between including the first set of values from the duplicate rows, or the last. Essentially, choose which row you want to be the source of truth. distinct2
  7. That’s really it. That’s all you need to do to find distinct rows in your data.
  8. The rest of this data flow sample is comparing the distinct row to the original set of data. You can keep the other transformation streams in this sample data flow so that you can compare the original data with the distinct row to make sure it is behaving the way you expect.
  9. I created a copy of the original source data by using New Source and then renamed that stream as “OriginalData” by aliasing it with the Select transformation on the bottom row.
  10. The row counts are just Aggregate transformations.
  11. To create row counts go to Aggregate settings and leave the group by empty.
  12. In the aggregate function, use the function count(1). This will create a running count of every row.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s