Azure Data Factory Data Flow Preview: Sabermetrics Example

Let’s walk through an end-to-end sample scenario that utilizes the new Azure Data Factory Data Flow feature. Some of the patterns that I’ll demonstrate here are very common in ETL data integration projects, which is the target use case for ADF Data Flow.

Data Flow in ADF is current in limited preview and available only via whitelisting of your Azure Subscription. You can submit your request for access by filling out this form. All associated online documentation, videos and getting started guides are available here.

Bill James is known as the father of baseball statistics and considered a genius for those of us who love baseball stats and his Sabermetrics methodology of analyzing baseball through advanced statistics. So, rather than use a typical SQL Server or SSIS example ETL data integration project around books, bikes or toys samples, I thought I’d build a sample data flow using baseball statistics. The raw data that I used here is available for download from a number of different sites. I used the stats CSVs that cover up to the 2017 season from Chadwick Baseball Bureau:

bb_excel1.png

I first created one data flow that used the raw at-bats, walks, hits, HRs, etc. to calculate basic stats batting average (BA), on-base % (OBP), Slugging (SLG) and total bases (TB). I called this Data Flow “Baseball 1 Basic Stats”. I’ll show you how I used Aggregation transforms for grouping to combine the players by player ID and year:

bb_basic1

There are 2 source files. The raw batting stats from the above CSV screenshot and a CSV that contains details about each player. In ADF Data Flow, sources are all on the left-hand side of the design surface with vertical bars. The Sinks are your destination outputs with vertical bars on the right-hand side of your flows. In this case, I join the 2 sources of data together using a composite key that I created using ADF Data Flow’s expression language so that I can ensure that I am grouping players by name and year. You can see that key in the Join Condition below:

bb_join1.png

My sink destination is an Azure SQL Database where I generate a new schema on the fly. I do not need to define that target table schema. Data Flow will create the SQL tables for me:

bb_ssms1.png

To create calculations in ADF Data Flow, you will use transformations like Aggregate and Derived Column. In ETL patterns, these transformations will be very commonly used. Notice in the Data Flow diagram at the top of this blog post, there is a straight vertical line coming down from the first source. That is using the “New Branch” transform which tells Data Flow to duplicate the data stream. You will do this often in data transformation routines that use aggregates. When you aggregate data, the output schema will only contain the fields that are used in your group-by clause and your aggregate functions. So I create a separate branch just for my aggregations and I’ll join this back together with the rest of the fields that I wish to sink in the end. You can always see what metadata is coming in and metadata coming out of each transform by looking at the Inspect tab, which also shows data lineage for how you’ve derived your column values:

bb_inspect1

Also, on the Inspect tab, you can view a preview of your data as you are building your transformation. You do this by first clicking the Debug button on the top left of the Data Flow design window. Debug mode will allow you to work in a data prep environment where you can view the data as it is being transformed throughout each step in order to validate your transformation routine. I can set this to sample 50 rows out of the 100k+ rows in my dataset. It also will not require me to write the data into my Azure SQL DB:

bb_debug1.png

You will see two Joins in this design. The first join on the 2nd row is combining the aggregated data back to the original source of data so that we can sink all of our interesting columns of data. The 2nd Join is joining two different ADF Blob CSV datasets: the batting data with the player data. This is what the batting data looks like when the CSV is brought into ADF via ADF dataset. I’ve adjusted the data types accordingly for my calculations:

bb_source1

One last transformation on this Data Flow I want to point out is the Select transform on the top row. I’ve used the Select transform here to rename the original source dataset to “OrigSourceBatting” so that I know that is the native source raw data with no aggregates and no derived columns.

The final step in this Data Flow before the Sink is my “AddTimestamp” Derived Column transform. This is another common pattern in ETL data integration design. I land this as a column in my Sink dataset (Azure SQL DB) so that when the table is written, I can have a timestamp in each row to indicate to me the last time that it was updated.

I will switch-off the “Debug” mode from my Data Flow so that I no longer incur the cost of my Azure Databricks (this is the engine that Data Flow uses for data transformation) and I’ll now build a very simple new ADF Pipeline with just 1 single activity: Data Flow. In that Data Flow, I will point to my Azure Databricks job cluster and execute the Data Flow from my pipeline so that I can now load all of my data into the Azure SQL DB. While I was in Debug mode, I was only sampling 50 rows for testing my expressions. Also, debug mode does not sink the data into my Sink location.

bb_pipe1

Once that pipeline execution is complete, I can go into the Monitoring view on the ADF UI and check to see the stats from that Data Flow run. I’ll be able to view data partitioning, data skew, the time spent in each stage of data transformation, data lineage along with other row & column statistics:

bb_mon1.png

Now that I have all of my basic stats calculated and landed in a database table, I can begin adding the advanced Sabermetric stats. For this example, let’s start with RunsCreated. This will be a very simple data flow because I’ve already created the aggregate buckets (player + year) with the basic stats. Now we’re just adding more calculations on a row-by-row basis. This means that I can use the same Azure SQL DB table that I created above as my Source and use a Derived Column transform to generate Runs Created:

bb_runscreated1

Switch on Debug mode before entering into the above Expression Builder so that you can see the results of that Runs Created formula as you write it. This way, you can assure that you have the proper parentheses, operands, etc. Since I don’t have to aggregate across multiple rows again, I can use this non-blocking transformation and all fields from input schema will just naturally flow to my output.

bb_adv1

That’s all it takes for this demo, so now I can execute the advanced data flow stats from a pipeline using Trigger Now in the Pipeline with a Data Flow activity:

bb_pipe2

I’m now landing my data in Azure SQL DW because I want to take advantage of SQL DW’s scaling and pausing features when I hook it up to my Power BI and reporting tools. Here’s what the final stats look like as I begin building my Sabermetrics dashboard:

bb_ssms2.png

That’s pretty much it for this demo of ADF’s new Data Flow feature (currently in limited preview). A few good practices contained in this sample that you should follow when using Data Flow for your ETL and data prep projects:

  1. Land your data in Blob store for transformation. Data Flow can read directly from ADLS, Blob, Azure SQL DB, and Azure SQL DW. But working with files in Blob store was the easiest.
  2. Switch on debug mode during your Data Flow designing to validate your designs and expressions. Don’t guess, then execute from your pipeline. You can always execute Data Flows from ADF pipelines in Pipeline Debug mode. But that requires you to switch back & forth between design screens, wasting time. Note that Debug mode is not yet enabled in Data Flow. That is lighting up very soon.
  3. Use New Branch when creating aggregates so that you can join your data back together after your aggregate. Make sure to attach a key field to your agg so that you have it available for the subsequent JOIN operation. Otherwise, sink your agg data on a separate path from the rest of your columns.
  4. Use a timestamp function in a Derived Column transform so you can tag your rows written with the latest timestamp.
  5. After you have completed testing your Data Flow, use the normal ADF Data Flow mechanisms to schedule your pipeline with data flows and monitor via the Monitoring UX.
  6. If you need to turn up the knob on parallelism and partitioning on Databricks, you can use the Optimize tab on the transformations. The skewness and kurtosis of data partitioning is presented to you in the monitoring UI. There is no need for you to dive into Spark, Databricks or any other cluster operation. You can design and optimize your Data Flow all from the ADF UI.
Advertisements

SQL Server on Azure VMs – Updated Tips April 2017

There are a lot of “best practice” and “performance guide” links available on the Internet to provide guidance on migrating SQL Server instances and databases from on-prem to Azure on VMs (aka Azure IaaS). In this post, I wanted to share a few findings from the field that I’ve encountered along the way recently working with customers on projects that migrate from on-prem SQL Server to Azure VMs. This is not a guide to migrating to Azure SQL Database and I do not cover all of the issues, migration steps or performance measures in an exhaustive manner. I will share a number of good links to help in that respect. But Azure as a platform is changing and evolving very rapidly and so it seemed like a good time to catch my breath and share a few tips here.

Tip #1: Take the time to analyze and evaluate current on-prem workloads prior to database lift and shift

Don’t just jump right into SSMS and use “Migrate to Azure”. Benchmark your current database and resource utilization. Measure IOPS, TX/s, CPU, RAM, network … all classic SQL Server constraints. This will help to determine the proper class of VM to choose for your SQL Server instances. Azure VMs come in classes of resources and you need to make sure that you have enough capacity. I generally recommend DS_v2 and G-series VMs. You can always scale-up to higher-numbered VM in the same series in Azure. But note that there are resource capacity limits grouped together (RAM, IOPS, disks, network) in each series and they all scale-up together when you upgrade your VMs.

Tip #2: Look at Azure Managed Disks for production workloads

Microsoft recently released Managed Disks (https://azure.microsoft.com/en-us/services/managed-disks/) as a way to eliminate the need to manually manage the details of the Azure IO subsystem in terms of storage accounts, relaxes limitations on bandwidth and IOPs that storage accounts imposed and provides a higher level of VM reliability. The absolute best detailed description of Managed Disks is here by Igor Pagliai.

You can choose MDs if you choose to build a SQL Server VM from a base Windows image in Azure or a pre-built SQL Server image. If you choose MDs or UMDs, make sure the use premium disks for production environments. Azure provides a 99.9% SLA with single-instance VMs but they must be using premium disks.

Tip #3: Backup to URL

You are running your SQL Server locally on a VM in the Cloud. Make sure that you use the Backup-to-URL feature in SQL Server to store your backups in Azure. You can use standard storage for your backups and consider GRS to keep geo-redundant copies of your backups, relying on Azure’s behind-the-scenes async replication.

Tip #4: Understand the Azure storage subsystem

If you are not going to use Managed Disks (see above), aka UMDs, then you will need to manage the storage accounts yourself. Here is a good SQL Server on Azure IaaS VMs perf guide with an Azure storage explainer that I like: http://www.faceofit.com/performance-tuning-best-practices-for-sql-on-azure-vm/.

Here are some tips below on storage, disk and LUN usage within SQL Server VMs in Azure. I indicate relevance to managed disks (MDs) and unmanaged disks (UMDs):

  1. (UMD) Build 2 additional Storage Accounts (SAs) after provisioning the VM. Keep the original VHDs where they are and add new premium disks for data and log in separate SAs. You do this from the VM’s Disks blade. Make sure to do this entire process one at a time. This way you can keep better track of the LUN #s to be sure that the SA being presented to the Windows OS is the LOG vs. the DATA LUN. SQL Server also allows you to also map database files directly to blob storage using URLs, as opposed to presenting LUNs to the OS, see here. There are a number of limitations listed in that document and since mapping raw disks to volumes and drives is very common and storage/database admins are comfortable with that approach, I typically do not recommend it at this time.
  2. (UMD/MD) Keep TempDB mapped to local disks, which are SSDs on the VM. It is also a good practice to use the local SSDs for BPE: https://blogs.technet.microsoft.com/dataplatforminsider/2014/09/25/using-ssds-in-azure-vms-to-store-sql-server-tempdb-and-buffer-pool-extensions/.
  3. (UMD/MD) We need to determine if the database needs optimization for read or write. For read workloads, set disk caching policy to “ReadOnly”. For write-intensive (and for SQL log files disk & tempdb disk, if you store tempdb on separate VHDs) set disk caching policy to “None”. This is on the Attach New Disk blade.
  4. (UMD/MD) Those are raw volumes, so we need to map them disk letters in Server Manager from Windows Server on the VM. Create a folder for LOG and a folder for DATA in each of the new volumes from the respective drive letter home dir.
  5. (UMD/MD) For TempDB, create a data (rows) file for each CPU core exposed to the VM. I.e. if the VM has 8 cores, generate 8 TempDB data files.http://sqlmag.com/storage/sql-server-storage-best-practices
  6. (UMD) You are now striped across multiple disks, SAs and multi-threaded maximizing available resources and ensuring that you are not getting IOPs and TX/sec capped.
  7. (UMD/MD) Azure disks are limited to 1 TB, so if you need files and drives larger than 1 TB, create a virtual volume using Windows Storage Spaces to pool multiple VHDs together.

Tip #5: SQL Server images from the Azure Gallery are easy and quick, but be careful

By far, the quickest and easiest way to start-up a new SQL Server in Azure is to pick the SQL Server image that you want to spin-up from the Azure gallery. Just be aware that you must (a) pick bring-your-own-license to use your existing MSFT SQL Server licenses or (b) pay-as-you-go pricing that is baked into the monthly $$ cost that you see on your bill for the amount of time that the VM is running. Also, you cannot upgrade SQL Server on VM created from the gallery. You will need to migrate your databases to the next version of SQL Server on a new image.

Tip #6: High Availability & Disaster Recovery

Best practice in production is to use Availability Groups. There is a pre-built image option available in the Azure gallery for SQL Server that builds out the networking and secondary nodes for an AG for you. It also created the ILB/ELB that Azure requires to handle routing to primary nodes. If you run single-instance VMs for your SQL Server, make sure that you choose Premium Storage to gain that 99.9% SLA from MSFT Azure.

That’s all that I’ll cover for now. These are some of the basics, updated as of April 2017. Since joining the Azure team @ MSFT 2 years ago, I’ve learned to be very careful about regularly updating these sorts of lists and indicating their dates of validity / spoil-by-date!

Best, Mark

Azure Marketplace Solutions

Yes, it’s been a while and I’m just now getting back into the swing of things, keeping MS-SQL Dude running, after my 2 year hiatus in the Open Source world.

To get back into the swing of things, I’ll share with you all a series of videos that I put together with the Microsoft Azure Partner team around the database, analytics and big data offerings from other software vendors that are available as solution template images on the Azure Marketplace:

We first started at the fundamental data level with my old friends at DataStax and their Cassandra database product

We then moved up the stack into batch analytics, streaming and distributed execution engines on Hadoop with my friends at Hortonworks

Lastly, we finished up with the end-user data visualization and BI layer from Looker

All of these data solutions are available from the Azure Marketplace link above.

You can also find several of the presentations from this series at my Slideshare site here.

 

What 2011 Will Bring in Microsoft SQL Server Land

I am going to continue in this blog, on SearchSQLServer, SQL Magazine BI Blog and Twitter (@MSSQLDUDE) to focus on these core areas that you should watch for in 2011 for those of you that follow Microsoft SQL Server: Oracle integration, best practices, BI visualizations and SQL 11 (Denali).

But what are the new investment areas @ Microsoft and new & exciting outside of the classic SQL Server 2005/2008/2011 topics? Let’s focus on these for 2011 because it is clear this is the direction that Microsoft will take us:

1. Cloud Computing (SQL Azure & Windows Azure)

2. Database scale-out and load balancing with virtualization, SQL Azure and Parallel Data Warehouse

3. Silverlight BI hyper-cool visualizations on PCs, Slate and Windows Phone 7

4. How to take things to the next level with SQL 11

5. Cloud BI

What else interests you for SQL world in 2011?