ADF Mapping Data Flows: Optimize for Azure SQL Database

I’m going to use this blog post as a dynamic list of performance optimizations to consider when using Azure Data Factory’s Mapping Data Flow. I am going to focus this only to Azure SQL DB. I will post subsequent articles that list ways to optimize other source, sinks, and data transformation types. As I receive more good practices, feedback, and other performance tunings, I will update this article accordingly.

Here is Azure SQL DW Optimizations for ADF Data Flows.

Optimizations to consider when using ADF Mapping Data Flows with Azure SQL DB

 

NOTE: When you are designing and testing Data Flows from the ADF UI, make sure to turn on the Debug switch so that you can execute your data flows in real-time without waiting for a cluster to warm up.

debugb1

You can match Spark data partitioning to your source database partitioning based on a database table column key in the Source transformation

sourcepart2

  1. Go to “Optimize” and select “Source”. Set either a specific table column or a type in a query.
  2. If you chose “column”, then pick the partition column.
  3. Also, set the maximum number of connections to your Azure SQL DB. You can try a higher setting to gain parallel connections to your database. However, some cases may result in faster performance with a limited number of connections.

Set Batch Size and Query on Source

source4.png

  1. Setting batch size will instruct ADF to store data in sets in memory instead of row-by-row. It is an optional setting and you may run out of resources on the compute nodes if they are not sized properly.
  2. Setting a query can allow you to filter rows right at the source before they even arrive for Data Flow for processing, which can make the initial data acquisition faster.
  3. If you use a query, you can add optional query hints for your Azure SQL DB, i.e. READ UNCOMMITTED

Set Sink Batch Size

sink4

  1.  In order to avoid row-by-row processing of your data floes, set the “Batch size” in the sink settings for Azure SQL DB. This will tell ADF to process database writes in batches based on the size provided.

Set Partitioning Options on your Sink

robin

  1. Even if you don’t have your data partitioned in your destination Azure SQL DB tables, go to the Optimize tab and set partitioning.
  2. Very often, simply telling ADF to use Round Robin partitioning on the Spark execution clusters results in much faster data loading instead of forcing all connections from a single node/partition.

Increase size of your compute engine in Azure Integration Runtime

ir-new

  1. Increase the number of cores, which will increase the number of nodes, and provide you with more processing power to query and write to your Azure SQL DB.
  2. Try “Compute Optimized” and “Memory Optimized” options

Disable indexes on write

  1. Use an ADF pipeline stored procedure activity prior to your Data Flow activity that disables indexes on your target tables that are being written to from your Sink.
  2. After your Data Flow activity, add another stored proc activity that enabled those indexes.

Increase the size of your Azure SQL DB

  1. Schedule a resizing of your source and sink Azure SQL DB before your run your pipeline to increase the throughput and minimize Azure throttling once you reach DTU limits.
  2. After your pipeline execution is complete, you can resize your databases back to their normal run rate.
Advertisements

2 thoughts on “ADF Mapping Data Flows: Optimize for Azure SQL Database

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s