AdventureWorks DW on Pentaho, part 3: Geo Mapping AdventureWorks

In the previous parts of this series, I’ve showed you how to make quick, auto-modeled OLAP cubes and reports on a SQL Server AdventureWorks data warehouse as well as how to take your existing SSAS cubes on top of those SQL Server DWs and turn them into Mondrian cubes, which runs natively inside of the Pentaho Business Analytics suite.

Here in part 3, I am going to take the Analytics a step futher by using Geo Maps on top of the SQL Server Adventure Works data warehouse by modeling the ROLAP schema manually instead of with the Pentaho auto-modeler. And since I’m finally upgraded to the latest 5.0 version of the Pentaho Suite. you may notice a few UI differences and workflow changes in the process. To follow along, make sure you download the latest & greatest Pentaho 5.0 product here.

Instead of starting the analysis process from the Pentaho User Console or Analyzer, we will build a model first as an XML document and we’ll use the Pentaho cube design tool, Pentaho Schema Workbench:

p5_04

You can download the complete sample XML schema that I built which you use in Pentaho Mondrian and Schema Workbench to connect to your SQL Server AdventureWorks Data Warehouse database to try this out on your own Pentaho Business Analytics suite. What I typically do when building a manual Mondrian schema is to start-up the Pentaho Schema Workbench (PSW) client and build out the primary parts of the model such as the measures, dimensions, calculations, etc. Then save the schema file as an XML file and edit it with an XML editor such as Visual Studio, Eclipse or Notepad++.

The reason why we are following this workflow for this example is do to the nature of the SQL Server sample data warehouse, Adventure Works (AW). AW makes use of a number of “snowflake” schema joins between dimension tables. The auto-modeler from Analyzer and PUC in Pentaho support only “star” schemas, so I write the model using PSW and manual XML to join the Customer dimension with the Geography dimension and the Customer key back to the customer key in the Fact table, which is FactInternetSales in this model. This will provide Analyzer with an MDX query to resolve the geography from the Fact table and graph it on the map using the hierarchy that I’ve defined below as Country, State, City, Postal Code:

<Dimension type=”StandardDimension” visible=”true” name=”Geo”>
<Hierarchy name=”default” visible=”true” hasAll=”true” primaryKey=”CustomerKey” primaryKeyTable=”c”>
<Join leftAlias=”c” leftKey=”GeographyKey” rightAlias=”g” rightKey=”GeographyKey”>
<Table name=”DimCustomer” schema=”dbo” alias=”c”>
</Table>
<Table name=”DimGeography” schema=”dbo” alias=”g”>
</Table>
</Join>
<Level name=”Country” visible=”true” table=”g” column=”CountryRegionCode” uniqueMembers=”false”>
</Level>
<Level name=”State” visible=”true” table=”g” column=”StateProvinceName” uniqueMembers=”false”>
</Level>
<Level name=”City” visible=”true” table=”g” column=”City” uniqueMembers=”false”>
</Level>
<Level name=”PostalCode” visible=”true” table=”g” column=”PostalCode” uniqueMembers=”false”>
</Level>
</Hierarchy>
</Dimension>

. . .

<Cube name=”SalesGeo” visible=”true” cache=”true” enabled=”true”>
<Table name=”FactInternetSales” schema=”dbo”>
</Table>
<DimensionUsage source=”Geo” name=”Geo” visible=”true” foreignKey=”CustomerKey”>
</DimensionUsage>
<Measure name=”TotalSales” column=”SalesAmount” datatype=”Numeric” aggregator=”sum” visible=”true”>
</Measure>
<Measure name=”NumOfCusts” column=”CustomerKey” aggregator=”distinct-count” visible=”true”>
</Measure>
</Cube>

Now that we have a functioning schema, publish it to the Pentaho BA Server either from Schema Workbench under File >> Publish, or through the Import mechanism in the Pentaho website. This will make the schema available to both the Mondrian ROLAP engine and to your report designers who are using Analyzer for OLAP queries and reports:

p5_05

 

I’ll select the AWGeo model and now we’re in the familiar Pentaho Analyzer experience where I can grab the measures that I built in the model from AdventureWorks for total sales and distinct count of unique customers:

p5_01

 

To put this into a Geo Map, I’ll remove the levels from the Geography dimension and leave just the top-level Country level as the Geo Dimension on the design surface. This way, we’ll drill-down from highest level down to finest grain in the map. Select Geo Map from the visualization drop-down menu at the top-right of the Analyzer screen … Now inside the map, I can drill to detail based on my 2 measures that are defining the circle size and color:

p5_02 p5_03

 

There ya have it … SQL Server 2012 data warehouse serving data to the Pentaho ROLAP engine (Mondrian) with a defined schema that models the Geography dimension for customers of Adventure Works with total sales and unique customers.

 

Monitor SQL Server CDC with Customer Data Collector

I’ve posted several times over the years here @MSSQLDUDE on the benefits of SQL Server Management Data Warehouse (MDW) for baselining your SQL Server performance and sending those measurements to a central server for monitoring. MDW makes use of the SQL Server Data Collector feature and is a feature you get out of the box. Here are some of my writings on MDW here @MSSQLDUDE: http://en.search.wordpress.com/?q=mdw&site=mssqldude.wordpress.com.

There is a debate among SQL Server DBAs about the worthiness of MDW and Data Collectors because they can appear to be a “heavy burden” on your server and provide overhead that you may not wish to occur on your production server. That being said, the way things work in our world and jobs in databases is a give and take with performance, monitoring, scalability and providing functionality to our business users & customers to allow them to do their jobs. Any monitoring and baselining product incur a small performance cost to your servers. There are a number of options out there in the market and many DBAs write their own scripts to store DMV data into tables and files to provide similar functionality.

What I wanted to point you all to is something that I think proves the value of Data Collectors (and MDW) through extending the functionality with custom data collectors.

I’ve been working on a CDC project recently in a large BI/DW scenario and one thing that you notice when working on CDC instead of longer-run SQL Server features like Replication, is that there is not much support in SSMS or other built-in SQL Server tools for CDC.

But what is really nice is this custom data collector for CDC that is included on MSDN from this terrific article onMonitoring the Change Data Capture Process. Scroll down to the Data Collector portion and the code to paste into SSMS is right there for you to create and start your data collector. This is a great way to monitor your CDC log processes.

BTW, you have to make an update to the XML in code from MSDN … Look for the SELECT @parameters code and replace the entire SELECT … XML statement with this:

SELECT @paramters = CONVERT(xml, 
    N'<ns:TSQLQueryCollector xmlns:ns="DataCollectorType">
        <Query>
          <Value>SELECT * FROM UserDB.sys.dm_cdc_log_scan_sessions</Value>
          <OutputTable>cdc_log_scan_data</OutputTable>
        </Query>
      </ns:TSQLQueryCollector>');

Enjoy! Best, Mark

Configure SQL Server for Big Data

Now that my new focus is 100% on Big Data Analytics, I thought I’d get back into blogging here @ MSSQLDUDE and start providing some insights and lessons-learned around managing “Big Data” in SQL Server.

The context for this example will be a SQL Server 2012 database that is loading flat files with millions of rows of Website cookie data and social media data for specific marketing campaigns. The files are generated from an Hadoop MapReduce job that parses those files and produces new flat files in structured format that SSIS can pick-up and store in SQL Server.

If you are going to work with data of this type in multiple TBs in a SQL Server DW schema, here are my top 10 notes & gotchas for you that are working well for me:

  1. Use SQL Server 2012 Enterprise Edition
  2. Use a covering columnstore index on your fact table
  3. Make sure to disable that index before your data loads and enable it when done. You have to do this, BTW, because the table becomes Read-Only when you enable columnstore
  4. Use the Tabular model in-memory semantic model cubes in SSAS to provide the analytical layer into the DW
  5. Avoid direct queries to the DW, if possible. If the data scientists and analysts require ad-hoc query rights to the event-level data, then use AlwaysOn to produce a read-only secondary to report from.
  6. Use SQL Server general DW guidance for disk configurations, including using multiple data files for TempDB and your DW database. Look for the SQL Server FastTrack whitepaper on Microsoft.com. I found that disk layout guidance to be very helpful in Big Data scenarios.
  7. Use PAGE compression on your DW tables
  8. Partition your fact table so that you can just switch-in new rows from a staging table, for easy archiving and performance gains
  9. We are getting better results with data loads, building cubes and occassional ad-hoc queries with MAXDOP=1 on the server. This is a setting that I found you need to try out on your own environment because results vary greatly from system to system.
  10. Make sure that you have a dev environment that is representative of prod, not a complete COPY! The data loads and data volumes are just too large to be successful in developing and unit testing. Sample the data and keep your dev environment small.

It’s a SQL Server Weekend!

Tomorrow is SQL Saturday Philly in our Microsoft Technology Center in Malvern, PA!!

Here is the SlideShare link to my presentation on Microsoft Cloud BI

This is my blog post from today on SQL Server Pro Magazine BI Blog for Microsoft Cloud BI

Here is the SlideShare link to my presentation on Microsoft DW for SQL Server DBAs

Have a great SQL Server weekend all!!  Br, Mark

2011 End-of-year Microsoft BI Wrap-up

Welcome to the final week of 2011 everyone!

I have a couple of really quick loose ends to tie up here in the blogosphere for Microsoft Business Intelligence. Ever since Microsoft acquired Proclarity in 2006, they have shown a real commitment throughout the organization to business intelligence. Microsoft’s investments and focus on democratization of BI and BI for the masses has been a huge success, particularly with self-service for business decision makers and the decision to make BI accessible to Excel and business users.

So, that being said, there are just 3 topics that I want to close up on in my MSSQLDUDE blog this year:

  1. A few friends of mine here in the Microsoft SQL Server & BI field on the east coast here in the U.S. have started a YouTube channel where we are recording our monthly lunch ‘n learn sessions, with a focus on BI & DW: http://www.youtube.com/user/MSFTSQLBIEASTEPG
  2. I have written about the new Enterprise Information Management (EIM) capabilities that SQL Server 2012 is exposing and expanding at blogs such as SearchSQLServer, MSSQLDUDE and SQL Pro BI Blog. I can remember pushing hard for data quality, MDM and expanding those capabilities into ETL through SSIS back in a series of meetings in Redmond in 2007. That is all coming into place now in the SQL Server product and coming to market finally. I wanted to point you all to a nice series of videos to explain how to get started with the DQS product coming in SQL Server 2012 to provide data quality services to your BI solutions here.
  3. Lastly, I wrote a quick blog with an intro to the new Silverlight Power View BI reporting and data exploration tool in SQL Server 2012 here. Unfortunately, the image links do not seem to be working, so I’ve included the thumbnails to th screenshots of Power View and the SharePoint BISM data connection page below. Notice the awesome Silverlight-based data visualizations that you can now generate out-of-the-box with Power View, which is part of SQL Server Reporting Service in SQL Server 2012. You can try this now, today, with the SQL Server 2012 RC0 download.

Happy New Year!!

Microsoft Philly: Learn about SQL Server PDW Massive-Scale DW

On August 10 in our Microsoft Malvern office (45 Liberty Blvd Malvern, PA) we have 2 sessions being sponored by Microsoft partner BI Voyage. Both of these sessions are focued on learning about the Microsoft SQL Server parallel data warehouse (PDW) for massive-scale-out data warehouses.

You can join us for the 8:30 AM session with breakfast included, goes until 12 PM. This session will be higher-level with overview and business impact of PDW.

The evening session will begin @ 5:30 PM in conjuction with the Philly SQL Server User Group and will go until 8:30 PM. This will be a more technical deep-dive session.

If you would like more details, either send me questions in comments on this post or email me (mkromer :at: microoft.com).

Thanks, Mark

UPDATES: There is NO CHARGE to come into Malvern to join us for these PDW sessions. BI Voyage (http://www.bivoyage.com) has been kind enough to sponor these sessions for Microsoft.

If you would like to join the evening session, here is the Phily SQL Server User Group EventBrite link to do register (no cost).

SQL Server 2008 Support for 15k Partitions

The newest service packs for both SQL Server 2008 (SP2) and SQL Server 2008 R2 (SP1) now provide support for up to 15,000 partitions in SQL Server. You may find this very helpful if you need to partition very granularly in your data warehouse and were perhaps limited by the previous 999 partition limit. More information is available here and you can even modify existing partitions to accept the new limit. As a side note, I’ve performed several data warehouse management and partitioning sessions for customers recently and have yet to find anyone feeling hampered by the 999 partition limit. But there are definitely cases out there for this requirement. Though not related to any updates in these service packs, you should note that the advances to SQL Server partitioning since 2008 include parallelism & partition-level lock escalation andwhen combined with SQL Server 2008’s ability to maximize Star Join queries with bitmap filters and good old partition elimination, you get better performing queries (when using the partition key) in a data warehouse workload. This makes me think … it may be time for my own take on data warehouse lifecycle management using partition switching, data archiving and sliding window methods. All are very important to understand when you become a DBA responsible for data warehouses.