One of the most commonly used execution environments for Big Data transformations in ADF is Azure Data Lake Analytics (ADLA) using U-SQL scripts to transform data at scale. The new ADF Data Flow feature allows you to build data transformations at scale without needing to know any scripting or coding environments. You won’t need to learn and understand the details of the execution environment or big data cluster maintenance. Instead, you can remain within the ETL data integration environment to build both your workflow pipelines as well as your data transformation code-free inside the ADF visual UI.
For this example, I’ll rebuild the Tweet Analysis U-SQL demo here from Michael Rys that demonstrates some of the analytical capabilities of ADLA and U-SQL. To build this in ADF, I’ll use visual data flow capabilities against the MikeDoesBigDataTweets.csv file in this folder. I added a few additional new Tweets to the CSV to update the contents a bit using my own KromerBigData handle using http://tweetdownload.net/.
Step 1: Define the schema and count the number of Tweets grouped by Author
EXTRACT date string,
COUNT( * ) AS
GROUP BY author;
ORDER BY DESC
In ADF Data Flow, I’ll accomplish this with a Source Transform that defines the schema and then an Aggregate transform for the count. Notice that I put the Aggregate in a separate order than the U-SQL sample. There is no need to recreate scripts or programs in exact order when building a data flow. Also, notice that I grouped both by author & category. I took some liberties to change the scenario slightly. You can see that I am setting my Data Flow to “Debug” mode so that I can see the Data Preview as I build the flow.
Step 2: Categorize Tweets
U-SQL: Extract Mentions
@m = SELECT m.Substring(1) AS m
, "mention" AS category
FROM @m CROSS APPLY EXPLODE(mentions) AS t(m)
WHERE m != "@";
SELECT author, "author" AS category
@res = SELECT author.ToLowerInvariant() AS author
, COUNT( * ) AS tweetcount
GROUP BY author.ToLowerInvariant(), category;
ORDER BY tweetcount DESC
In ADF Data Flow, I’m going to categorize the Tweets by Mentions and Authors using an IIF statement based upon the first character of the Tweet Text. Any aggregation like Count, is done in the Aggregate Transform, which is why the sequencing is different when converting from a scripting environment.
Step 3: Rank Tweets
U-SQL: Rank Tweets with Window Functions
author, category, tweetcount
, PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY tweetcount ASC)
OVER (PARTITION BY category) AS median_tweetcount_perhandle_category
, PERCENT_RANK() OVER
(PARTITION BY category ORDER BY tweetcount ASC) AS relative_rank
, ROW_NUMBER() OVER
(PARTITION BY category ORDER BY tweetcount DESC) AS absolute_rank
WHERE tweetcount >= 50;
ORDER BY absolute_rank, category ASC
In ADF Data Flow, you’ll use a Window Transformation with the Rank function and partition the data using the Over and Range settings in the transformation. You can see in the debug data previews below that the Rank function has partitioned the Tweets both by the Category and the Author as I had set in my Aggregation previously.
Step 4: Serialize output results to Lake
In these U-SQL samples, the scripts are divided up by function, so the results are serialized back in the lake at the end of each step. In ADF Data Flow, to serialize your results, use a Sink. To keep the data in the lake, just use an ADLS or Azure Blob Store dataset to land the results in files. If you’d like to serialize your results in a database, you can use Azure SQL DB or Azure SQL DW datasets.