Pentaho and HP Vertica – Big Data Analytics

Anyone going to the HP Vertica conference next week in Boston? (August 12, 2014)

If so, stop by and say Hi at the Pentaho booth in the expo center!

BTW, I uploaded a quick & short video that I threw together with no voiceover, but shows you how easy & quick it is to make a Dashboard in Pentaho 5.1 against a Big Data source like HP Vertica.

See it here.

HP Vertica is a world-class MPP Analytical Database and is a perfect match for Pentaho’s Analytics Suite. In the video you’ll see me through together a quick Dashboard by stacking Analyzer report components that are querying a Mondrian cube.

The business model is based on Vertica’s VMart data warehouse sample data and the semantic models is generated automatically through Pentaho’s auto modeler. I’m using our Web-based Thin Modeler to modify the model a bit before publishing it for ad-hoc interactive analysis.

Hope to see you @ the HP Vertica conference in Boston next week!  Best, Mark

OPASS Discussions & Big Data in the Real World

If you are in the Orlando Sanford area tomorrow night (Thursday October 24 @ 6;30 register here) then come join the Orlando PASS gang where I will lead a discussion on Big Data and Big Data Analytics in the real world. I am basing this discussion on the systems that we designed and took to market for Razorfish and will include demos of HDInsight, Hadoop, NoSQL databases, Big Data Analytics with Pentaho and SQL Server. Here is the Slideshare link to the presentation for the meeting.

Big Data with SQL Server, part 2: Sqoop

I started off my series on Hadoop on Windows with the new Windows distribution of Hadoop known as Microsoft HDInsight, by talking about installing the local version of Hadoop on Windows. There is also a public cloud version of Hadoop on Azure:

Here in part 2, I’ll focus on moving data between SQL Server and HDFS using Sqoop.

In this demo, I’m going to move data between a very simple sample SQL Server 2012 database that I’ve created called “sqoop_test” with a single table called “customers”. You’ll see the table is very simple for this demo with just a customer ID and a customer name. What I’m going to do is to show you how the Microsoft & Hortonworks Hadoop distribution for Windows (HDInsights) includes Sqoop for moving data between SQL Server & Hadoop.

You can also move data between HDFS and SQL Server with the Linux distributions of Hadoop and Sqoop by using the Microsoft Sqoop adapter available for download here.

First, I’ll start with moving data from SQL Server to Hadoop. When you run this command, you will “import” data into Hadoop from SQL Server. Presumably, this would provide a way for you to perform distributed processing and analysis of your data via MapReduce once you’ve copied the data to HDFS:

sqoop import –connect jdbc:sqlserver://localhost –username sqoop -password password –table customers -m 1

I have 1 record inserted into my customers table and the import command places that into my Hadoop cluster and I can view the data in a text file, which most things in Hadoop resolve to:

> hadoop fs -cat /user/mark/customers/part-m-00000

> 5,Bob Smith

My SQL Server table has 1 row (see below) so that row was imported into HDFS:

The more common action would likely move data into SQL Server from Hadoop and to do this, I will export from HDFS to SQL Server. I have a database schema for my data in Hadoop that I created with Hive that creates a table called Employees. I’m going to tranform those into Customer records in my SQL Server schema with Sqoop:

> sqoop export –connect jdbc:sqlserver://localhost –username sqoop -password password -m 1 –table customers –export-dir /user/mark/data/employees3

12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Transferred 201 bytes in 32.6364 seconds (6.1588 bytes/sec)
12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Exported 4 records.

Those MapReduce jobs extract my data from HDFS and send it to SQL Server so that now when I query my SQL Server Customers table, I have my original Bob Smith record plus these 4 new records that I transferred from Hadoop:

PSSUG November 2012 Presentation: Big Data with SQL Server

Thank you all for coming out on a rainy, snowy, cold evening to join us for this month’s PSSUG meeting!

Here is a link to the slides that I used tonight during my presentation of Big Data with SQL Server and Hadoop demos:

Br, Mark

Configure SQL Server for Big Data

Now that my new focus is 100% on Big Data Analytics, I thought I’d get back into blogging here @ MSSQLDUDE and start providing some insights and lessons-learned around managing “Big Data” in SQL Server.

The context for this example will be a SQL Server 2012 database that is loading flat files with millions of rows of Website cookie data and social media data for specific marketing campaigns. The files are generated from an Hadoop MapReduce job that parses those files and produces new flat files in structured format that SSIS can pick-up and store in SQL Server.

If you are going to work with data of this type in multiple TBs in a SQL Server DW schema, here are my top 10 notes & gotchas for you that are working well for me:

  1. Use SQL Server 2012 Enterprise Edition
  2. Use a covering columnstore index on your fact table
  3. Make sure to disable that index before your data loads and enable it when done. You have to do this, BTW, because the table becomes Read-Only when you enable columnstore
  4. Use the Tabular model in-memory semantic model cubes in SSAS to provide the analytical layer into the DW
  5. Avoid direct queries to the DW, if possible. If the data scientists and analysts require ad-hoc query rights to the event-level data, then use AlwaysOn to produce a read-only secondary to report from.
  6. Use SQL Server general DW guidance for disk configurations, including using multiple data files for TempDB and your DW database. Look for the SQL Server FastTrack whitepaper on I found that disk layout guidance to be very helpful in Big Data scenarios.
  7. Use PAGE compression on your DW tables
  8. Partition your fact table so that you can just switch-in new rows from a staging table, for easy archiving and performance gains
  9. We are getting better results with data loads, building cubes and occassional ad-hoc queries with MAXDOP=1 on the server. This is a setting that I found you need to try out on your own environment because results vary greatly from system to system.
  10. Make sure that you have a dev environment that is representative of prod, not a complete COPY! The data loads and data volumes are just too large to be successful in developing and unit testing. Sample the data and keep your dev environment small.