Azure Data Factory Data Flow Preview: Sabermetrics Example

Let’s walk through an end-to-end sample scenario that utilizes the new Azure Data Factory Data Flow feature. Some of the patterns that I’ll demonstrate here are very common in ETL data integration projects, which is the target use case for ADF Data Flow.

Data Flow in ADF is current in limited preview and available only via whitelisting of your Azure Subscription. You can submit your request for access by filling out this form. All associated online documentation, videos and getting started guides are available here.

Bill James is known as the father of baseball statistics and considered a genius for those of us who love baseball stats and his Sabermetrics methodology of analyzing baseball through advanced statistics. So, rather than use a typical SQL Server or SSIS example ETL data integration project around books, bikes or toys samples, I thought I’d build a sample data flow using baseball statistics. The raw data that I used here is available for download from a number of different sites. I used the stats CSVs that cover up to the 2017 season from Chadwick Baseball Bureau:

bb_excel1.png

I first created one data flow that used the raw at-bats, walks, hits, HRs, etc. to calculate basic stats batting average (BA), on-base % (OBP), Slugging (SLG) and total bases (TB). I called this Data Flow “Baseball 1 Basic Stats”. I’ll show you how I used Aggregation transforms for grouping to combine the players by player ID and year:

bb_basic1

There are 2 source files. The raw batting stats from the above CSV screenshot and a CSV that contains details about each player. In ADF Data Flow, sources are all on the left-hand side of the design surface with vertical bars. The Sinks are your destination outputs with vertical bars on the right-hand side of your flows. In this case, I join the 2 sources of data together using a composite key that I created using ADF Data Flow’s expression language so that I can ensure that I am grouping players by name and year. You can see that key in the Join Condition below:

bb_join1.png

My sink destination is an Azure SQL Database where I generate a new schema on the fly. I do not need to define that target table schema. Data Flow will create the SQL tables for me:

bb_ssms1.png

To create calculations in ADF Data Flow, you will use transformations like Aggregate and Derived Column. In ETL patterns, these transformations will be very commonly used. Notice in the Data Flow diagram at the top of this blog post, there is a straight vertical line coming down from the first source. That is using the “New Branch” transform which tells Data Flow to duplicate the data stream. You will do this often in data transformation routines that use aggregates. When you aggregate data, the output schema will only contain the fields that are used in your group-by clause and your aggregate functions. So I create a separate branch just for my aggregations and I’ll join this back together with the rest of the fields that I wish to sink in the end. You can always see what metadata is coming in and metadata coming out of each transform by looking at the Inspect tab, which also shows data lineage for how you’ve derived your column values:

bb_inspect1

Also, on the Inspect tab, you can view a preview of your data as you are building your transformation. You do this by first clicking the Debug button on the top left of the Data Flow design window. Debug mode will allow you to work in a data prep environment where you can view the data as it is being transformed throughout each step in order to validate your transformation routine. I can set this to sample 50 rows out of the 100k+ rows in my dataset. It also will not require me to write the data into my Azure SQL DB:

bb_debug1.png

You will see two Joins in this design. The first join on the 2nd row is combining the aggregated data back to the original source of data so that we can sink all of our interesting columns of data. The 2nd Join is joining two different ADF Blob CSV datasets: the batting data with the player data. This is what the batting data looks like when the CSV is brought into ADF via ADF dataset. I’ve adjusted the data types accordingly for my calculations:

bb_source1

One last transformation on this Data Flow I want to point out is the Select transform on the top row. I’ve used the Select transform here to rename the original source dataset to “OrigSourceBatting” so that I know that is the native source raw data with no aggregates and no derived columns.

The final step in this Data Flow before the Sink is my “AddTimestamp” Derived Column transform. This is another common pattern in ETL data integration design. I land this as a column in my Sink dataset (Azure SQL DB) so that when the table is written, I can have a timestamp in each row to indicate to me the last time that it was updated.

I will switch-off the “Debug” mode from my Data Flow so that I no longer incur the cost of my Azure Databricks (this is the engine that Data Flow uses for data transformation) and I’ll now build a very simple new ADF Pipeline with just 1 single activity: Data Flow. In that Data Flow, I will point to my Azure Databricks job cluster and execute the Data Flow from my pipeline so that I can now load all of my data into the Azure SQL DB. While I was in Debug mode, I was only sampling 50 rows for testing my expressions. Also, debug mode does not sink the data into my Sink location.

bb_pipe1

Once that pipeline execution is complete, I can go into the Monitoring view on the ADF UI and check to see the stats from that Data Flow run. I’ll be able to view data partitioning, data skew, the time spent in each stage of data transformation, data lineage along with other row & column statistics:

bb_mon1.png

Now that I have all of my basic stats calculated and landed in a database table, I can begin adding the advanced Sabermetric stats. For this example, let’s start with RunsCreated. This will be a very simple data flow because I’ve already created the aggregate buckets (player + year) with the basic stats. Now we’re just adding more calculations on a row-by-row basis. This means that I can use the same Azure SQL DB table that I created above as my Source and use a Derived Column transform to generate Runs Created:

bb_runscreated1

Switch on Debug mode before entering into the above Expression Builder so that you can see the results of that Runs Created formula as you write it. This way, you can assure that you have the proper parentheses, operands, etc. Since I don’t have to aggregate across multiple rows again, I can use this non-blocking transformation and all fields from input schema will just naturally flow to my output.

bb_adv1

That’s all it takes for this demo, so now I can execute the advanced data flow stats from a pipeline using Trigger Now in the Pipeline with a Data Flow activity:

bb_pipe2

I’m now landing my data in Azure SQL DW because I want to take advantage of SQL DW’s scaling and pausing features when I hook it up to my Power BI and reporting tools. Here’s what the final stats look like as I begin building my Sabermetrics dashboard:

bb_ssms2.png

That’s pretty much it for this demo of ADF’s new Data Flow feature (currently in limited preview). A few good practices contained in this sample that you should follow when using Data Flow for your ETL and data prep projects:

  1. Land your data in Blob store for transformation. Data Flow can read directly from ADLS, Blob, Azure SQL DB, and Azure SQL DW. But working with files in Blob store was the easiest.
  2. Switch on debug mode during your Data Flow designing to validate your designs and expressions. Don’t guess, then execute from your pipeline. You can always execute Data Flows from ADF pipelines in Pipeline Debug mode. But that requires you to switch back & forth between design screens, wasting time. Note that Debug mode is not yet enabled in Data Flow. That is lighting up very soon.
  3. Use New Branch when creating aggregates so that you can join your data back together after your aggregate. Make sure to attach a key field to your agg so that you have it available for the subsequent JOIN operation. Otherwise, sink your agg data on a separate path from the rest of your columns.
  4. Use a timestamp function in a Derived Column transform so you can tag your rows written with the latest timestamp.
  5. After you have completed testing your Data Flow, use the normal ADF Data Flow mechanisms to schedule your pipeline with data flows and monitor via the Monitoring UX.
  6. If you need to turn up the knob on parallelism and partitioning on Databricks, you can use the Optimize tab on the transformations. The skewness and kurtosis of data partitioning is presented to you in the monitoring UI. There is no need for you to dive into Spark, Databricks or any other cluster operation. You can design and optimize your Data Flow all from the ADF UI.
Advertisements

Azure Data Factory: Delete Files From Azure Data Lake Store (ADLS)

In a previous post over at Kromer Big Data, I posted examples of deleting files from Azure Blob Storage and Table Storage as part of your ETL pipeline using Azure Data Factory (ADF). In those examples, I built a small, quick Logic App that used the Azure Storage APIs to delete data. In those post, I’m going to demonstrate how to remove files from Azure Data Lake Store (ADLS). For this demo, we’ll use ADF’s V2 service.

Deleting / removing files after they’ve been processed is a very common task in ETL Data Integration routines. Here’s how to do that for Azure Data Lake Store files in ADF:

adfweb

  1. Start by creating a new Data Factory from Azure
  2. Click “Author & Monitor” from your factory in order to launch the ADF UI.
  3. Create a new pipeline and add a single Web Activity.
  4. Switch to the “Settings” tab on the properties pane at the bottom of the pipeline builder UI.
  5. The URL in the Web Activity will need to be the URI pointer the ADLS file you wish to delete
    • https://<yourstorename>.azuredatalakestore.net/webhdfs/v1/mytempdir/myinputfile1.txt?op=DELETE
  6. The URL above (i.e. file names, folder names) can be parameterized. Click the “Add Dynamic Content” link when editing the URL text box.
  7. Set the Web Activity “Method” to “DELETE”.
  8. For authentication, you will need to have an access token. You can use this method to produce one:
  9. The access token returned will need to be captured and used in the Web Activity header as such:
    • Header = "Authorization"  Expression = "Bearer <ACCESS TOKEN>"
  10. You can now validate and test run your pipeline with the Web Activity. Click the “Debug” button to give it a try.

 

 

New Azure Data Factory End-to-End Lab

Just a few weeks ago, we announced the public preview of the new browser-based UI for Azure Data Factory. See Gaurav’s blog here detailing the release. To help you understand how to build complex data integration projects in the ADF Visual Design interface, we’ve partnered with Pragmatic Works, who have been long-time experts in the Microsoft data integration and ETL space, to create a new set of hands-on labs that you can now use to learn how to build those DI patterns using ADF V2.

That link points to a GitHub repo for the lab, where you will find data files and scripts in the Deployment folder. There are lab manual folders for each Lab and overview presentations, shown below for more details. You will also find a series of PowerShell and DB scripts as well as ARM templates that will generate resource groups that the labs need in order for you to successfully build out an end-to-end scenario with sample data that you can use for Power BI reports in the final Lab 9. Here is how the individual labs are divided:

  • Lab 1 – Setting up ADF and Resources, Start here to get all of the ARM resource groups and database backup files loaded properly.
  • Lab 2 – Lift and Shift of SSIS to Azure, Go to this lab if you have existing SSIS packages on-prem that you’d like to migrate directly to the cloud using the ADF SSIS-IR capability.
  • Lab 3 – Rebuilding an existing SSIS job as an ADF pipeline.
  • Lab 4 – Take the new ADF pipeline and enhance it with data from Cloud Sources.
  • Lab 5 – Modernize the DW pipeline by transforming Big Data with HDInsight.
  • Lab 6 – Go to this lab to learn how to create copy workflows in ADF into Azure SQL Data Warehouse.
  • Lab 7 – Build a trigger-based schedule for your new ADF pipeline.
  • Lab 8 – You’ve operationalized your pipeline based on a schedule. Now learn how to monitor and manage that DI process.
  • Lab 9 – Bringing it all Together

Control Flow Introduced in Azure Data Factory V2

One of the most exciting new features in Azure Data Factory will be very familiar to data developers, data engineers and ETLers familiar with hand-coding data integration jobs or designing SSIS packages …

“Control Flow”

In general computer science terms, Control Flow is described as “the order in which individual statements, instructions or function calls of an imperative program are executed or evaluated,” (Wikipedia https://en.wikipedia.org/wiki/Control_flow). In the Microsoft Data Integration world, tools like SQL Server Integration Services (SSIS) enables control flow via tasks and containers (https://docs.microsoft.com/en-us/sql/integration-services/control-flow/control-flow). Azure Data Factory’s V1 service was focused on building sequenced pipelines for Big Data Analytics based on time windows. The V2 (preview) version of ADF now includes workflow capabilities in pipelines that enable control flow capabilities that include parameterization, conditional execution, loops and if conditions.

If-Then

In ADF pipelines, activities are equivalent to SSIS tasks. So an IF statement in ADF Control Flow will need to be written as an activity of type “IfCondition”:

"name": "MyIfCondition",
"type": "IfCondition",
"typeProperties": { "expression":
  { "value": "@bool(pipeline().parameters.routeSelection)",
    "type": "Expression" }

IfTrue and IfFalse are properties of that activity which perform subsequent actions via Activities:

"ifTrueActivities":
  [ { "name": "CopyFromBlobToBlob1", "type": "Copy" ... ]

"ifFalseActivities":
  [ { "name": "CopyFromBlobToBlob2", "type": "Copy" ... ]

https://docs.microsoft.com/en-us/azure/data-factory/control-flow-if-condition-activity

Loops

Do-Until: https://docs.microsoft.com/en-us/azure/data-factory/control-flow-until-activity

"name": "Adfv2QuickStartPipeline",
"properties":
  { "activities":
    [ { "type": "Until", "typeProperties":
        { "expression": { "value": "@equals('false', pipeline().parameters.repeat)",
           "type": "Expression" }, "timeout": "00:01:00", "activities":
              [ { "name": "CopyFromBlobToBlob", "type": "Copy" ...

For-Each: https://docs.microsoft.com/en-us/azure/data-factory/control-flow-for-each-activity

"name": "<MyForEachPipeline>",
    "properties":
        { "activities":
             [ { "name": "<MyForEachActivity>", "type": "ForEach",
                   "typeProperties": { "isSequential": "true",
                    "items": "@pipeline().parameters.mySinkDatasetFolderPath", "activities":
                       [ { "name": "MyCopyActivity", "type": "Copy", ...

 

Parameters

Parameters are very common in SSIS packages as well to make for very flexible execution so that packages can be generalized and re-used. This is now available in ADF via parameters in control flow:

https://docs.microsoft.com/en-us/azure/data-factory/control-flow-expression-language-functions

You can build a pipeline that is parameterized like this example with parameterized paths:

"name": "Adfv2QuickStartPipeline",
"properties": {
"activities": [
 {
"name": "CopyFromBlobToBlob",
"type": "Copy",
"inputs": [
 {
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.inputPath"
 },
"type": "DatasetReference"
 }
 ],
"outputs": [
 {
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.outputPath"
 },
"type": "DatasetReference"

You can then execute that pipeline via the Invoke Pipeline PowerShell cmdlet: https://docs.microsoft.com/en-us/powershell/module/azurerm.datafactoryv2/Invoke-AzureRmDataFactoryV2Pipeline?view=azurermps-4.4.1. That command will let you either pass the parameters for inputPath and outputPath either on the command line PS or define the values dynamically via parameter files, i.e. (params.json):

{
"inputPath": "markkdhi/tmp",
"outputPath": "markkdhi/user"
}

 

 

Azure Data Factory V2 Quickstart Template

If you’d like to try a quick & easy way to get up & running with your first V2 pipeline of Azure Data Factory, check out the video that I just recorded walking you through how to use the Azure Quickstart Template.

To get to the template, you can search the Azure Quickstart template gallery, or here is a direct like to the ADF V2 template.

Deploying this template to your Azure subscription will build a very simple ADF pipeline using the V2 public preview service, which was announced at the end of September at Ignite.

The pipeline executes a copy activity that will copy a file in blob store. In the example video I linked to above, I use it to make a copy of a file in the same blob container but with a new filename.

To learn more about the new features in ADF V2, see our updated public documentation home page: https://docs.microsoft.com/en-us/azure/data-factory/.

SQL Server on Azure VMs – Updated Tips April 2017

There are a lot of “best practice” and “performance guide” links available on the Internet to provide guidance on migrating SQL Server instances and databases from on-prem to Azure on VMs (aka Azure IaaS). In this post, I wanted to share a few findings from the field that I’ve encountered along the way recently working with customers on projects that migrate from on-prem SQL Server to Azure VMs. This is not a guide to migrating to Azure SQL Database and I do not cover all of the issues, migration steps or performance measures in an exhaustive manner. I will share a number of good links to help in that respect. But Azure as a platform is changing and evolving very rapidly and so it seemed like a good time to catch my breath and share a few tips here.

Tip #1: Take the time to analyze and evaluate current on-prem workloads prior to database lift and shift

Don’t just jump right into SSMS and use “Migrate to Azure”. Benchmark your current database and resource utilization. Measure IOPS, TX/s, CPU, RAM, network … all classic SQL Server constraints. This will help to determine the proper class of VM to choose for your SQL Server instances. Azure VMs come in classes of resources and you need to make sure that you have enough capacity. I generally recommend DS_v2 and G-series VMs. You can always scale-up to higher-numbered VM in the same series in Azure. But note that there are resource capacity limits grouped together (RAM, IOPS, disks, network) in each series and they all scale-up together when you upgrade your VMs.

Tip #2: Look at Azure Managed Disks for production workloads

Microsoft recently released Managed Disks (https://azure.microsoft.com/en-us/services/managed-disks/) as a way to eliminate the need to manually manage the details of the Azure IO subsystem in terms of storage accounts, relaxes limitations on bandwidth and IOPs that storage accounts imposed and provides a higher level of VM reliability. The absolute best detailed description of Managed Disks is here by Igor Pagliai.

You can choose MDs if you choose to build a SQL Server VM from a base Windows image in Azure or a pre-built SQL Server image. If you choose MDs or UMDs, make sure the use premium disks for production environments. Azure provides a 99.9% SLA with single-instance VMs but they must be using premium disks.

Tip #3: Backup to URL

You are running your SQL Server locally on a VM in the Cloud. Make sure that you use the Backup-to-URL feature in SQL Server to store your backups in Azure. You can use standard storage for your backups and consider GRS to keep geo-redundant copies of your backups, relying on Azure’s behind-the-scenes async replication.

Tip #4: Understand the Azure storage subsystem

If you are not going to use Managed Disks (see above), aka UMDs, then you will need to manage the storage accounts yourself. Here is a good SQL Server on Azure IaaS VMs perf guide with an Azure storage explainer that I like: http://www.faceofit.com/performance-tuning-best-practices-for-sql-on-azure-vm/.

Here are some tips below on storage, disk and LUN usage within SQL Server VMs in Azure. I indicate relevance to managed disks (MDs) and unmanaged disks (UMDs):

  1. (UMD) Build 2 additional Storage Accounts (SAs) after provisioning the VM. Keep the original VHDs where they are and add new premium disks for data and log in separate SAs. You do this from the VM’s Disks blade. Make sure to do this entire process one at a time. This way you can keep better track of the LUN #s to be sure that the SA being presented to the Windows OS is the LOG vs. the DATA LUN. SQL Server also allows you to also map database files directly to blob storage using URLs, as opposed to presenting LUNs to the OS, see here. There are a number of limitations listed in that document and since mapping raw disks to volumes and drives is very common and storage/database admins are comfortable with that approach, I typically do not recommend it at this time.
  2. (UMD/MD) Keep TempDB mapped to local disks, which are SSDs on the VM. It is also a good practice to use the local SSDs for BPE: https://blogs.technet.microsoft.com/dataplatforminsider/2014/09/25/using-ssds-in-azure-vms-to-store-sql-server-tempdb-and-buffer-pool-extensions/.
  3. (UMD/MD) We need to determine if the database needs optimization for read or write. For read workloads, set disk caching policy to “ReadOnly”. For write-intensive (and for SQL log files disk & tempdb disk, if you store tempdb on separate VHDs) set disk caching policy to “None”. This is on the Attach New Disk blade.
  4. (UMD/MD) Those are raw volumes, so we need to map them disk letters in Server Manager from Windows Server on the VM. Create a folder for LOG and a folder for DATA in each of the new volumes from the respective drive letter home dir.
  5. (UMD/MD) For TempDB, create a data (rows) file for each CPU core exposed to the VM. I.e. if the VM has 8 cores, generate 8 TempDB data files.http://sqlmag.com/storage/sql-server-storage-best-practices
  6. (UMD) You are now striped across multiple disks, SAs and multi-threaded maximizing available resources and ensuring that you are not getting IOPs and TX/sec capped.
  7. (UMD/MD) Azure disks are limited to 1 TB, so if you need files and drives larger than 1 TB, create a virtual volume using Windows Storage Spaces to pool multiple VHDs together.

Tip #5: SQL Server images from the Azure Gallery are easy and quick, but be careful

By far, the quickest and easiest way to start-up a new SQL Server in Azure is to pick the SQL Server image that you want to spin-up from the Azure gallery. Just be aware that you must (a) pick bring-your-own-license to use your existing MSFT SQL Server licenses or (b) pay-as-you-go pricing that is baked into the monthly $$ cost that you see on your bill for the amount of time that the VM is running. Also, you cannot upgrade SQL Server on VM created from the gallery. You will need to migrate your databases to the next version of SQL Server on a new image.

Tip #6: High Availability & Disaster Recovery

Best practice in production is to use Availability Groups. There is a pre-built image option available in the Azure gallery for SQL Server that builds out the networking and secondary nodes for an AG for you. It also created the ILB/ELB that Azure requires to handle routing to primary nodes. If you run single-instance VMs for your SQL Server, make sure that you choose Premium Storage to gain that 99.9% SLA from MSFT Azure.

That’s all that I’ll cover for now. These are some of the basics, updated as of April 2017. Since joining the Azure team @ MSFT 2 years ago, I’ve learned to be very careful about regularly updating these sorts of lists and indicating their dates of validity / spoil-by-date!

Best, Mark

How to Make Use of Azure Cognitive Services

Azure Cognitive Services is a service that is available from your Azure portal which allows you to set-up accounts that enable the use of Microsoft’s pre-built algorithms and models for common advanced analytical functions such as Text Analytics, Recommendation Engines, speech and vision recognition as well as many more service-oriented APIs. These models have been trained and are ready for you to use and consume in your solutions, such as Power BI business intelligence solutions.

There is a very common use case emerging that I’m finding in the Cloud Analytics space that relates specifically to a portion of the Text Analytics function where there are a number of solutions that can take advantage of the Sentiment Analysis capabilities that Azure Cognitive Services provides. By sending in a text string to the API service, you can leverage the existing Microsoft sentiment engine to determine if the message that you sent in is positive, negative or neutral based upon a score.

The best way to demonstrate this is with Power BI. I am going to briefly touch on 2 of the most common social media sources to pull real-time data from and core the messages for sentiment analysis: Twitter & Facebook.

First, the easiest with Power BI (PBI) is Facebook. There is an out-of-the-box connector for Facebook included with PBI, so you can use Get Data to bring data directly into a Power BI model from Facebook very easily:

pbi4

Now, to hook into the Microsoft Sentiment Analysis, follow the method described very well on this PBI community page and from the associated PASS Virtual Chapter presentation on this very topic by Gil from DataChant: http://community.powerbi.com/t5/Community-Blog/Sentiment-Analysis-in-Power-BI/ba-p/55898. His full working sample is awesome and you can play around with it here: https://app.powerbi.com/view?r=eyJrIjoiMjU5NWM3MTMtMTg0NS00YzE2LTk0YTYtMmZkZWM4NjYwMmFjIiwidCI6IjIyNzNjNDFiLWI4ZDAtNDVhZi1iZWU2LWUwODQ5NmFlNjcxOCIsImMiOjN9.

To download and use a similar sample in the PBI Desktop, just download the PBIX from my Github. I recommend going into the detailed PBI Community link from Gil above, it rocks and gives you the background you need to customize this for yourself with step-by-step instructions on how this was made.

For the demo above that I uploaded to Github, I included Gil’s Trump & Clinton Facebook election sentiment analysis data on Tab 2 of the report view on PBI. Tab 1 has Microsoft BI Facebook sentiment data where I used the Microsoft BI Facebook page as the data source. Each of those datasets is run through the Cognitive Services API for Sentiment Analysis, part of the Text Analytics suite.

To make it work for you, follow Gil’s instructions on getting an API key from Cognitive Services and plug in your API Key in the GetSentimentResults function in the Power Query / Edit query tool. You can use the free limited service for Sentiment Analysis or go to the Azure Portal and create a full-use Cognitive Services account. The code for calling out to the API and saving the sentiment score in the PBI model is in a Power Query function call in the PBIX and looks like this:

(Source as table) as any =>
let
JsonRecords = Text.FromBinary(Json.FromValue(Source)),
JsonRequest = “{“”documents””: ” & JsonRecords & “}”,

JsonContent = Text.ToBinary(JsonRequest, TextEncoding.Ascii),
Response =
Web.Contents(“https://westus.api.cognitive.microsoft.com/text/analytics/v2.0/sentiment?&#8221;,
[
Headers = [#”Ocp-Apim-Subscription-Key”= “your api key here”,
#”Content-Type”=”application/json”, Accept=”application/json”],
Content=JsonContent
]),
JsonResponse = Json.Document(Response,1252)
in
JsonResponse

As you can tell from my sample Sentiment Facebook dashboard below, I did not put as much time into the report as Gil did. So take the PBIX and add your own visuals to make it better!

pbi

Now onto the 2nd example, using Twitter data. There is no out-of-the-box connector directly into PBI for live Twitter feeds, so I used the Solution Template for Twitter (Marketing use case) from the Power BI Solution Template site: https://bpsolutiontemplates.com/.

That site provides you with a “wizard” that walks through the set-up of an Azure SQL Database, your Cognitive Services API as I mentioned above (free or the Azure account) and the Twitter handles/keywords that you want to use as a filter. This creates a much more robust solution to grab the Twitter data in that the solution template with spin-up Logic Apps, App Service and API connectors to Twitter and Sentiment Analysis on the backend instead of directly from Power BI as the first example with Facebook above.

So, essentially Power BI will just connect to your Azure SQL DB that you’ve chosen from the template and that SQL DB will get populated from the Azure App that is built automatically for you. This separates the data from the business logic and from the presentation layer in PBI. The code to call Twitter and Cognitive Services are all in the “middle tier” on the App Service in Azure, which you can open directly from your Azure Portal. Modify the API calls, strings and sentiment bins in that code:

//Sentiment analysis – Cognitive APIs
string sentiment = await MakeSentimentRequest(tweet);
sentiment = (double.Parse(sentiment) * 2 – 1).ToString(CultureInfo.InvariantCulture);
string sentimentBin = (Math.Floor(double.Parse(sentiment) * 10) / 10).ToString(CultureInfo.InvariantCulture);
string sentimentPosNeg = String.Empty;
if (double.Parse(sentimentBin) > 0)
{
sentimentPosNeg = “Positive”;
}
else if (double.Parse(sentimentBin) < 0)
{
sentimentPosNeg = “Negative”;
}
else
{
sentimentPosNeg = “Neutral”;
}

The out-of-the-box reports from these Power BI templates are quite good with a social graph visualization and a number of report tabs for different pivots of the Twitter data and sentiment scores.

pbi2

pbi3

Azure Marketplace Solutions

Yes, it’s been a while and I’m just now getting back into the swing of things, keeping MS-SQL Dude running, after my 2 year hiatus in the Open Source world.

To get back into the swing of things, I’ll share with you all a series of videos that I put together with the Microsoft Azure Partner team around the database, analytics and big data offerings from other software vendors that are available as solution template images on the Azure Marketplace:

We first started at the fundamental data level with my old friends at DataStax and their Cassandra database product

We then moved up the stack into batch analytics, streaming and distributed execution engines on Hadoop with my friends at Hortonworks

Lastly, we finished up with the end-user data visualization and BI layer from Looker

All of these data solutions are available from the Azure Marketplace link above.

You can also find several of the presentations from this series at my Slideshare site here.

 

Pentaho and HP Vertica – Big Data Analytics

Anyone going to the HP Vertica conference next week in Boston? (August 12, 2014)

If so, stop by and say Hi at the Pentaho booth in the expo center!

BTW, I uploaded a quick & short video that I threw together with no voiceover, but shows you how easy & quick it is to make a Dashboard in Pentaho 5.1 against a Big Data source like HP Vertica.

See it here.

HP Vertica is a world-class MPP Analytical Database and is a perfect match for Pentaho’s Analytics Suite. In the video you’ll see me through together a quick Dashboard by stacking Analyzer report components that are querying a Mondrian cube.

The business model is based on Vertica’s VMart data warehouse sample data and the semantic models is generated automatically through Pentaho’s auto modeler. I’m using our Web-based Thin Modeler to modify the model a bit before publishing it for ad-hoc interactive analysis.

Hope to see you @ the HP Vertica conference in Boston next week!  Best, Mark