When transforming cloud data at scale in ADF, Mapping Data Flows allows Data Engineers to graphically design an execution plan of data transformation routines that will execute against Spark. The inherent nature of complex cloud-based big data is that the data will be messy and will frequently change shape and form.
We have built-in features in Azure Data Factory to make it easy for Data Engineers to build robust ETL projects in ADF that can transform massive amounts in a flexible way. without writing code. To build your transformations in a way that can account for changes in data shapes, you will build your data flows in a way that exploits what we call “late binding”.
In many cases, you will build ETL projects for your business that require well-defined schemas with data sources that do not frequently change. In those cases, you can use traditional early-binding in your data flows. These are the characteristics of early binding:
- A well-defined schema in your dataset
- The data shape for your data sources does not frequently change
- Your transformation expressions use explicit column names from your datasets
- Heavy use of Select transformations to manipulate schemas manually
- Heavy use of manual mapping in Sink and Select transformations
An example set of Aggregation expressions that utilize early-binding would be these expressions below that I use in my baseball data flow to generate offensive statistics based on columns that I expect to be present in my source data:
For column mapping, I know exactly which fields I wish to map in my destination and I want to control the logical names, so I’m using manual mapping:
As your data flows become more complex, include flexible schemas, and utilize late-binding, you’ll use either auto-mapping to generate output for all late-arriving columns (also known as “drifted”) or rule-based mapping.
To build data flows that are more resilient to change, you can choose to leverage features for late binding. It is important to keep in mind that when utilizing these late-binding techniques, your logical flow may look a little more complex and requires you to work with your data in a way that is patterns-based.
- Use datasets that are schemaless. Your dataset is used in the Source transformations to define the shape of your data as the data is acquired into a Spark data frame. ADF datasets do not require schemas and you can take advantage of that to work with flexible data sources. Any column that ADF sees in your source that is not defined in your dataset is labeled as “drifted”.
- Build transformations that use patterns instead of named columns. Writing expressions that require explicitly named columns from a dataset are using “early binding”. Instead, use patterns and rules that look for data by position, by naming patterns and data types.
- Late binding utilizes ADF’s schema drift features, so you’ll want to make sure you’ve selected “schema drift” on your source and sinks. This way, ADF can pass through all newly-detected columns.
Here is an example of a Derived Column where I demonstrate pattern matching. From top-to-bottom:
- I trim every String column, keeping the column name the same
- I at 1.2 to the columns present at ordinal positions 2 and 3
- I look for any column with the words “Customers” followed by a number using regular expressions. When that pattern is found, I generate a new column by appending “_customer” to the original column name. “$$” means that I am keeping the original value.
As your datasets become larger and wider, utilize rules in your Sink mapping like this example below. Here, I’m mapping any stat that contains “H” to the same target name in my destination table. I’m also including “SH” as a new column called “Sacrifices”.
These are examples of late binding that depend upon patterns and rules to define my transformation and mapping intent when I have data sources that are wide and change frequently.
For more details in these areas in ADF, read about: