Reduce Execution Time for Data Flow Activities in ADF Pipelines

In ADF Mapping Data Flows, there are 2 working modes: Debug mode and Pipeline mode.

Debug mode is active when you turn on the Data Flow debug switch and the light is green, showing debug as active. You will also see the Data Preview pane at the bottom of your transformation panel light-up with a green light. This will turn on the ability to interactively preview your data as you build your transformations in real-time in the data flow UI.

You can also execute pipelines with data flow activities against that same live Azure IR. To do this, you click “Debug” from the pipeline. Your pipeline will execute immediately and you can view your data flow activity in the bottom panel.

debugpipeline.png

The other mode of working in Data Flows is from an operationalized pipeline. That is, a pipeline that has been scheduled from a trigger and runs against the live ADF service. Typically, you do this as your final testing step and then schedule your pipeline for normal operations. You can also get to this mode interactively from the pipeline screen (see above) by clicking Add Trigger > Trigger Now.

When you use Trigger Now, or a pipeline scheduled run, you no longer are using the debug Azure IR compute environment. Instead, you will use the Azure IR that is selected in each of your Data Flow activities.

activity-data-flow2.png

Each of your data flow activities can execute from a different configuration of Azure IR. That means that you can apply more or fewer resources to each execution. You define the size of the compute environment from the Data Flow properties in the Azure IR:

azureir2

If you set the TTL, ADF will spin-up and maintain a pool of resources that can be reused for that period of time. Every time that you request a data flow activity execution against that same Azure IR, ADF can load the cluster compute and job execution on a warm VM, reducing the overall execution time of your data flow. If you do not specify a TTL, then ADF will always spin-up new compute clusters on every execution.

This model ADF presents is very economical because you do not have to maintain an always-running Spark cluster to serve your ETL needs. With ADF, the compute is ephemeral and only present when it is needed.

The average start-up time for a just-in-time cluster is 5-7 minutes. If you have the need to lower that start-up time, then utilize the TTL setting. This will still require that initial load time to provision a pool of resources for that Azure IR. But each subsequent execution that occurs within that TTL window will have less acquisition time required.

am2

In my example above, I executed my JSON Data Flow first against an Azure IR without TTL and it took 13 mins. Then, I executed it again, this time using a TTL of 10 mins. In this case, the data flow took almost half that time. Because I used a 10 min TTL, the cluster spun-up from warm VMs and I only had to pay for an extra 10 min TTL.

On the monitoring view, I am able to see the difference in cluster acquisition time from my Azure IR with no TTL vs. with TTL. Just keep in mind that your first activity will spin-up the resource pool which requires that same 5-7 minute build time.

mon2

mon1

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s