Azure Data Factory: Build U-SQL Tweet Analysis with ADF Data Flows

One of the most commonly used execution environments for Big Data transformations in ADF is Azure Data Lake Analytics (ADLA) using U-SQL scripts to transform data at scale. The new ADF Data Flow feature allows you to build data transformations at scale without needing to know any scripting or coding environments. You won’t need to learn and understand the details of the execution environment or big data cluster maintenance. Instead, you can remain within the ETL data integration environment to build both your workflow pipelines as well as your data transformation code-free inside the ADF visual UI.

For this example, I’ll rebuild the Tweet Analysis U-SQL demo here from Michael Rys that demonstrates some of the analytical capabilities of ADLA and U-SQL. To build this in ADF, I’ll use visual data flow capabilities against the MikeDoesBigDataTweets.csv file in this folder. I added a few additional new Tweets to the CSV to update the contents a bit using my own KromerBigData handle using http://tweetdownload.net/.

Step 1: Define the schema and count the number of Tweets grouped by Author

U-SQL: Simple Analysis


@t =
EXTRACT date string,
time string,
author string,
tweet string
FROM "/Samples/Data/Tweets/MikeDoesBigDataTweets.csv"
USING Extractors.Csv();


@res =
SELECT author,
COUNT( * ) AS
FROM @t
GROUP BY author;

OUTPUT @res
TO "/Output/TweetAnalysis/MyTwitterAnalysis1.csv"
ORDER BY DESC
USING Outputters.Csv();

In ADF Data Flow, I’ll accomplish this with a Source Transform that defines the schema and then an Aggregate transform for the count. Notice that I put the Aggregate in a separate order than the U-SQL sample. There is no need to recreate scripts or programs in exact order when building a data flow. Also, notice that I grouped both by author & category. I took some liberties to change the scenario slightly. You can see that I am setting my Data Flow to “Debug” mode so that I can see the Data Preview as I build the flow.

usql1

usql4

usql3

Step 2: Categorize Tweets

U-SQL: Extract Mentions

@m = SELECT m.Substring(1) AS m
, "mention" AS category
FROM @m CROSS APPLY EXPLODE(mentions) AS t(m)
WHERE m != "@";
@t =
SELECT author, "author" AS category
FROM @t
UNION ALL
SELECT *
FROM @m;
@res = SELECT author.ToLowerInvariant() AS author
, category
, COUNT( * ) AS tweetcount
FROM @t
GROUP BY author.ToLowerInvariant(), category;
OUTPUT @res
TO "/Output/TweetAnalysis/MyTwitterAnalysis2.csv"
ORDER BY tweetcount DESC
USING Outputters.Csv();

In ADF Data Flow, I’m going to categorize the Tweets by Mentions and Authors using an IIF statement based upon the first character of the Tweet Text. Any aggregation like Count, is done in the Aggregate Transform, which is why the sequencing is different when converting from a scripting environment.

usql2

Step 3: Rank Tweets

U-SQL:  Rank Tweets with Window Functions

@res =
SELECT DISTINCT
author, category, tweetcount
, PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY tweetcount ASC)
OVER (PARTITION BY category) AS median_tweetcount_perhandle_category
, PERCENT_RANK() OVER
(PARTITION BY category ORDER BY tweetcount ASC) AS relative_rank
, ROW_NUMBER() OVER
(PARTITION BY category ORDER BY tweetcount DESC) AS absolute_rank
FROM TweetAuthorsAndMentions
WHERE tweetcount >= 50;
OUTPUT @res
TO "/Output/TweetAnalysis/MyTwitterAnalysis6.csv"
ORDER BY absolute_rank, category ASC
USING Outputters.Csv();

 

In ADF Data Flow, you’ll use a Window Transformation with the Rank function and partition the data using the Over and Range settings in the transformation. You can see in the debug data previews below that the Rank function has partitioned the Tweets both by the Category and the Author as I had set in my Aggregation previously.

usql5

usql6

Step 4: Serialize output results to Lake

In these U-SQL samples, the scripts are divided up by function, so the results are serialized back in the lake at the end of each step. In ADF Data Flow, to serialize your results, use a Sink. To keep the data in the lake, just use an ADLS or Azure Blob Store dataset to land the results in files. If you’d like to serialize your results in a database, you can use Azure SQL DB or Azure SQL DW datasets.

usql7

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s