A very common practice when designing Big Data ETL and Analytics solutions in the Cloud is to find creative ways to work with very large data files. Of course, Data Engineers who are working primarily on-prem also face challenges processing very large files. The added challenge with hybrid and Cloud scenarios is that you also have to build architectures with bandwidth and utility billing constraints in mind.
Big Data platforms like Spark and Hadoop provide a very natural fit for handling large file processing by leveraging distributed file systems. This allows partitioned data files across worker nodes to process data locally on each node to divide & conquer these workloads.
ADF’s Mapping Data Flow feature is built upon Spark in the Cloud, so the fundamental steps in large file processing are also available to you as an ADF user. This means that you can use Data Flows to perform the very common requirement of splitting your large file across partitioned files so that you can process and move the file in pieces.
To accomplish this in ADF Data Flows:
- Create a new Data Flow
- You are going to create a very simple Data Flow just to leverage file partitioning. There will not be any column or row transformations. Just a Source and a Sink that will take a large file and produce smaller part files.
- Add a Source file
- For this demo, I am using the Kaggle public loans data CSV with >800k records. You do not need to set any schema or projections because we are not working with data at the column level here.
- Make sure to turn on the Data Flow Debug switch at the top of the UI browser to warm up your cluster to execute this data flow later
- Add a Sink folder
- For the Sink dataset, choose the type of output files you would like to produce. I’m going to start with my large CSV and produce partitioned CSV files, so I’m using a Delimited Text dataset. Note in the dataset file path I am typing in a new folder name “output/parts”. ADF will use this to generate a new folder called “parts” when this data flow executes that will be created in my existing “output” folder in Blob.
- In the Sink, define the partitioning
- This is where you will define how you would like the partitioned files to be generated. I’m asking for 20 equal distributions using a simple Round Robin technique. I have also set the output file names using the “pattern” option. “loans[n].csv” will produce new part files names loans1.csv, loans2.csv … loans20.csv.
- Notice I’ve also set “Clear the folder”. This will ask ADF to wipe the contents of the destination folder clean before loading new part files
- Save your data flow and create a new pipeline
- Add an Execute Data Flow activity and select your new file split data flow
- Execute the pipeline using the pipeline debug button
- You must execute data flows from a pipeline in order to generate file output. Debugging from Data Flow does not write any data.
- After execution, you should now see 20 files that resulted from round robin partitioning of your large source file. You’re done:
- In the output of your pipeline debug run, you’ll see the execution results of the data flow activity. Click on eyeglasses icon to show the details of your data flow execution. You’ll see the statistics of the distribution of records in your partitioned files: