If you are in the Orlando Sanford area tomorrow night (Thursday October 24 @ 6;30 register here) then come join the Orlando PASS gang where I will lead a discussion on Big Data and Big Data Analytics in the real world. I am basing this discussion on the systems that we designed and took to market for Razorfish and will include demos of HDInsight, Hadoop, NoSQL databases, Big Data Analytics with Pentaho and SQL Server. Here is the Slideshare link to the presentation for the meeting.
Let’s say you have SQL Server databases for your data marts and warehouses in your organization, but you are looking for alternatives to SSAS cubes and SharePoint-based dashboards. If you are like me and have built many BI solutions in SSAS OLAP and PPS, then you’ll find a transition to OSS with Pentaho to be very easy and a natural fit for those BI scenarios. We’re going to take this SQL Server 2012 DW and SSAS cube and recreate it all in Pentaho:
I’m going to kick this series off with a very simple BI dashboard using the traditional SQL Server Adventure Works data warehouse data set and put a Pentaho ROLAP cube and dashboard on top of that data. What I think you’ll really be impressed with is how much easier it is than building cubes in Visual Studio and PPS dashboards.
Start by going to Pentaho.com to download an evaluation version of the Pentaho Business Analytics suite and run the installer. We’re going to stay completely in a thin client browser experience for this demo, so no need to open any IDE tools through this entire workflow.
I’m using 4.8.2 of the Pentaho suite for this series and so when I log into the portal, I will use the Analysis feature to point to the SQL Server database and auto-generate the OLAP cube as well as design the visualizations. This is also where I can put my reports and analysis views together for end-user dashboards:
When you click to create new Analysis content, all you have to do to recreate a cube in the Pentaho Mondrian OLAP engine is to point to your DW database source and the engine will auto generate the model for you. So to replace the SSAS cube I have above, just follow the flow of these next 3 steps as I point to SQL Server, select my facts and dimension tables and create the star joins. This is a very simple beginning model which we will use for the rest of this series to build upon with custom calculations and other features. In classic AdventureWorks mode, I will name InternetSales as my fact table and let’s keep things simple and straightforward for this first intro to Pentaho modeling and just join in the Date and Product dimension tables. Just like SSAS, Mondrian will treat facts as measures and dimension tables as dimensional hierarchies and aggregation levels from your data warehouse. Much, much more on this in the coming weeks …
That’s all you have to do. Now you will select the model which will show up in your list as the name that you typed in from the screenshot above which I called simply “AdventureWorks”. Selecting that model will drop you into the interactive Analyzer reporting tool where you will see a field list to make pivot tables and data visualizations which you can see below. Think of that last step as running through the SSAS cube wizard in Visual Studio with data source views and publishing the cube on the SSAS server. This can be done in a much smaller number of steps here in Pentaho.
Remember that this is using ROLAP for the cube engine (Pentaho’s Mondrian), so in the upcoming parts of the series, I’ll talk about optimizing and customizing the logical OLAP models so that these reports can perform well for you. Your final step is to put your reports together as an interactive dashboard. In upcoming parts of the series, we’ll make the dashboard fully interactive as well. For now, just pick a couple of reports that you’ve generated from the AdventureWorks model and drop them onto the Dashboard Designer surface:
That’s pretty much it. We’ve recreated the OLAP cubes and reports and put them into a dashboard all in a browser using Pentaho with the SQL Server database source, replacing SSAS and SharePoint PPS. I know it was quick & brief, so think of this first part as just a teaser. We’ll continue to build this out with more features so that you can have some guidance when trying out Pentaho’s BI Suite.
As always, just reach out to me with any questions or requests. Enjoy! Br, Mark
FYI, if you need to download the SQL Server sample databases above, you can go here on Microsoft’s Codeplex site to get Adventureworks.
Hi Everyone! My apologies for the gap in blogging … It’s been a long couple of months transition over to the Open Source Software world and relocating to Orlando for Pentaho. It’s been an amazing adventure so far and I’ve just started transitioning my work to the hybrid OSS / commercial world for this MSSQLDUDE blog, SQL PASS, local BI Meetups & the SQL Server Pro Magazine BI Blog.
I will post my Adventure Works on Pentaho OLAP blog post this evening to kick it off and if you are looking for the more purist Big Data Analytics and Pentaho coverage, check out my Big Data Analytics blog here.
See ya soon! Best, Mark
I’ve posted several times over the years here @MSSQLDUDE on the benefits of SQL Server Management Data Warehouse (MDW) for baselining your SQL Server performance and sending those measurements to a central server for monitoring. MDW makes use of the SQL Server Data Collector feature and is a feature you get out of the box. Here are some of my writings on MDW here @MSSQLDUDE: http://en.search.wordpress.com/?q=mdw&site=mssqldude.wordpress.com.
There is a debate among SQL Server DBAs about the worthiness of MDW and Data Collectors because they can appear to be a “heavy burden” on your server and provide overhead that you may not wish to occur on your production server. That being said, the way things work in our world and jobs in databases is a give and take with performance, monitoring, scalability and providing functionality to our business users & customers to allow them to do their jobs. Any monitoring and baselining product incur a small performance cost to your servers. There are a number of options out there in the market and many DBAs write their own scripts to store DMV data into tables and files to provide similar functionality.
What I wanted to point you all to is something that I think proves the value of Data Collectors (and MDW) through extending the functionality with custom data collectors.
I’ve been working on a CDC project recently in a large BI/DW scenario and one thing that you notice when working on CDC instead of longer-run SQL Server features like Replication, is that there is not much support in SSMS or other built-in SQL Server tools for CDC.
But what is really nice is this custom data collector for CDC that is included on MSDN from this terrific article onMonitoring the Change Data Capture Process. Scroll down to the Data Collector portion and the code to paste into SSMS is right there for you to create and start your data collector. This is a great way to monitor your CDC log processes.
BTW, you have to make an update to the XML in code from MSDN … Look for the SELECT @parameters code and replace the entire SELECT … XML statement with this:
SELECT @paramters = CONVERT(xml, N'<ns:TSQLQueryCollector xmlns:ns="DataCollectorType"> <Query> <Value>SELECT * FROM UserDB.sys.dm_cdc_log_scan_sessions</Value> <OutputTable>cdc_log_scan_data</OutputTable> </Query> </ns:TSQLQueryCollector>'); Enjoy! Best, Mark
The complicated world of telecommunications analytics continues to be a primary driver behind complex data analytics solutions and I find it mentioned time and time again in Big Data use cases and scenarios.
Those of us who have lived in this world for years will probably agree with me that we’ve been pioneers in “Big Data techniques” ever since we were asked to build CDR (call detail record) data warehouses. My first CDR solution was for customer service and marketing at AT&T in the 1990s. We used Oracle for the DW and hired PhD statisticians to build models for predictive analytics on top of that data.
The marketing department was able to utilize that data to better understand customer patterns of usage and make data-driven decisions about how to package subscriptions and products. The call center team used the analytics from our cubes for market basket and association algorithms that provided reps with the ability to cross-sell to customers, which was also used by sales for up-sell opportunities to corporate accounts.
Then there is also the mass amounts of streaming data coming from network equipment which was used by engineering and the NOC for troubleshooting, predicting outages and tuning the network. Rules for correlation, thresholds and root-cause were needed to make sense of the 1000s of events/sec and not overwhelm systems and dashboards.
Does that sound familiar to today’s “Big Data use cases”? It should. We used to call these techniques CEP (complex event processing) and VLDB (very large databases). Really, at the end of the day, what this meant was that our DBAs, architects and developers needed to think about scale and distributed architectures to account for the scale that we were dealing with.
Today, it is a nice evolution of these techniques to see Hadoop, sharded databases, NoSQL and in-database analytics providing packaged, easier ways to process and manage systems of TB & PB scale.
Essentially, what this means is that these techniques now become available to all IT departments and examples like the churn & customer analytics (the holy grail of telcos is churn management) solutions become better, faster with improved data sampling because of new, emerging Big Data technologies.
I found this story on the Internet by Harish Vadada from Telecom Cloud here. It talks about T-Mobile with databases like Oracle & SQL Server using Big Data technologies such as Hadoop, to improve the delivery of customer & churn analytics to drive both the bottom-line and top-line of their business. Very impressive and spot-on to what I am saying here in this post.
I started off my series on Hadoop on Windows with the new Windows distribution of Hadoop known as Microsoft HDInsight, by talking about installing the local version of Hadoop on Windows. There is also a public cloud version of Hadoop on Azure: http://www.hadooponazure.com.
Here in part 2, I’ll focus on moving data between SQL Server and HDFS using Sqoop.
In this demo, I’m going to move data between a very simple sample SQL Server 2012 database that I’ve created called “sqoop_test” with a single table called “customers”. You’ll see the table is very simple for this demo with just a customer ID and a customer name. What I’m going to do is to show you how the Microsoft & Hortonworks Hadoop distribution for Windows (HDInsights) includes Sqoop for moving data between SQL Server & Hadoop.
You can also move data between HDFS and SQL Server with the Linux distributions of Hadoop and Sqoop by using the Microsoft Sqoop adapter available for download here.
First, I’ll start with moving data from SQL Server to Hadoop. When you run this command, you will “import” data into Hadoop from SQL Server. Presumably, this would provide a way for you to perform distributed processing and analysis of your data via MapReduce once you’ve copied the data to HDFS:
sqoop import –connect jdbc:sqlserver://localhost –username sqoop -password password –table customers -m 1
I have 1 record inserted into my customers table and the import command places that into my Hadoop cluster and I can view the data in a text file, which most things in Hadoop resolve to:
> hadoop fs -cat /user/mark/customers/part-m-00000
> 5,Bob Smith
My SQL Server table has 1 row (see below) so that row was imported into HDFS:
The more common action would likely move data into SQL Server from Hadoop and to do this, I will export from HDFS to SQL Server. I have a database schema for my data in Hadoop that I created with Hive that creates a table called Employees. I’m going to tranform those into Customer records in my SQL Server schema with Sqoop:
> sqoop export –connect jdbc:sqlserver://localhost –username sqoop -password password -m 1 –table customers –export-dir /user/mark/data/employees3
12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Transferred 201 bytes in 32.6364 seconds (6.1588 bytes/sec)
12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Exported 4 records.
Those MapReduce jobs extract my data from HDFS and send it to SQL Server so that now when I query my SQL Server Customers table, I have my original Bob Smith record plus these 4 new records that I transferred from Hadoop:
Thank you all for coming out on a rainy, snowy, cold evening to join us for this month’s PSSUG meeting!
Here is a link to the slides that I used tonight during my presentation of Big Data with SQL Server and Hadoop demos: http://www.slideshare.net/kromerm/pssug-nov-2012-big-data-with-sql-server.