Monitor SQL Server CDC with Customer Data Collector

I’ve posted several times over the years here @MSSQLDUDE on the benefits of SQL Server Management Data Warehouse (MDW) for baselining your SQL Server performance and sending those measurements to a central server for monitoring. MDW makes use of the SQL Server Data Collector feature and is a feature you get out of the box. Here are some of my writings on MDW here @MSSQLDUDE: http://en.search.wordpress.com/?q=mdw&site=mssqldude.wordpress.com.

There is a debate among SQL Server DBAs about the worthiness of MDW and Data Collectors because they can appear to be a “heavy burden” on your server and provide overhead that you may not wish to occur on your production server. That being said, the way things work in our world and jobs in databases is a give and take with performance, monitoring, scalability and providing functionality to our business users & customers to allow them to do their jobs. Any monitoring and baselining product incur a small performance cost to your servers. There are a number of options out there in the market and many DBAs write their own scripts to store DMV data into tables and files to provide similar functionality.

What I wanted to point you all to is something that I think proves the value of Data Collectors (and MDW) through extending the functionality with custom data collectors.

I’ve been working on a CDC project recently in a large BI/DW scenario and one thing that you notice when working on CDC instead of longer-run SQL Server features like Replication, is that there is not much support in SSMS or other built-in SQL Server tools for CDC.

But what is really nice is this custom data collector for CDC that is included on MSDN from this terrific article onMonitoring the Change Data Capture Process. Scroll down to the Data Collector portion and the code to paste into SSMS is right there for you to create and start your data collector. This is a great way to monitor your CDC log processes.

BTW, you have to make an update to the XML in code from MSDN … Look for the SELECT @parameters code and replace the entire SELECT … XML statement with this:

SELECT @paramters = CONVERT(xml, 
    N'<ns:TSQLQueryCollector xmlns:ns="DataCollectorType">
        <Query>
          <Value>SELECT * FROM UserDB.sys.dm_cdc_log_scan_sessions</Value>
          <OutputTable>cdc_log_scan_data</OutputTable>
        </Query>
      </ns:TSQLQueryCollector>');

Enjoy! Best, Mark

Big Data and the Telecom World

The complicated world of telecommunications analytics continues to be a primary driver behind complex data analytics solutions and I find it mentioned time and time again in Big Data use cases and scenarios.

Those of us who have lived in this world for years will probably agree with me that we’ve been pioneers in “Big Data techniques” ever since we were asked to build CDR (call detail record) data warehouses. My first CDR solution was for customer service and marketing at AT&T in the 1990s. We used Oracle for the DW and hired PhD statisticians to build models for predictive analytics on top of that data.

The marketing department was able to utilize that data to better understand customer patterns of usage and make data-driven decisions about how to package subscriptions and products. The call center team used the analytics from our cubes for market basket and association algorithms that provided reps with the ability to cross-sell to customers, which was also used by sales for up-sell opportunities to corporate accounts.

Then there is also the mass amounts of streaming data coming from network equipment which was used by engineering and the NOC for troubleshooting, predicting outages and tuning the network. Rules for correlation, thresholds and root-cause were needed to make sense of the 1000s of events/sec and not overwhelm systems and dashboards.

Does that sound familiar to today’s “Big Data use cases”? It should. We used to call these techniques CEP (complex event processing) and VLDB (very large databases). Really, at the end of the day, what this meant was that our DBAs, architects and developers needed to think about scale and distributed architectures to account for the scale that we were dealing with.

Today, it is a nice evolution of these techniques to see Hadoop, sharded databases, NoSQL and in-database analytics providing packaged, easier ways to process and manage systems of TB & PB scale.

Essentially, what this means is that these techniques now become available to all IT departments and examples like the churn & customer analytics (the holy grail of telcos is churn management) solutions become better, faster with improved data sampling because of new, emerging Big Data technologies.

I found this story on the Internet by Harish Vadada from Telecom Cloud here. It talks about T-Mobile with databases like Oracle & SQL Server using Big Data technologies such as Hadoop, to improve the delivery of customer & churn analytics to drive both the bottom-line and top-line of their business. Very impressive and spot-on to what I am saying here in this post.

Cheers! Mark

 

SQL Saturday #200 Philadelphia – Microsoft Malvern – June 1

June 1 is the date for the 2013 SQL Saturday Philly @ the Microsoft Malvern office again this year. Here is the home page for the event. You can register for FREE there and also submit a session.

The event is an entire Saturday filled with FREE training on SQL Server, business intelligence, data warehousing and Big Data. The only cost is an optional $10 for the catered lunch. Or bring your own lunch, your choice!

I will present an updated version of my Big Data material called “Microsoft Big Data in the Real World” and I’ll share the updated material here on MSSQLDUDE, MSDN and KromerBigData once I have that completed and checked in.

Thanks! Best, Mark

New MSDN Blog Home for MSSQLDUDE

Hey MSSQLDUDE fans! I’m migrating over to MSDN blogs now that I’m in a new group @ Microsoft. We focus on developers, which is why I chose MSDN over Technet for the home for the MSSQLDUDE blog. I’ll cross-post for awhile, but then I’ll use this MSDN home for all Microsoft-specific postings. Anything that falls outside that scope and is more personal opinion and observation, I’ll keep @ http://mssqldude.wordpress.com/.

It’s exciting to be back @ MSDN! I haven’t maintained an MSDN blog since DBCF & MEC in 2009! http://blogs.msdn.com/b/makrom/.

For those new to MSSQLDUDE, you can come here for a variety of SQL Server topics, tips & tricks and links. Mostly, I focus on BI, DW, Cloud (Azure) and Big Data with a few posts on DBA stylings. For the most part, I stick with latest & greatest SQL Server technologies. There are plenty of excellent blogs out there on the previous versions of SQL Server that can help nearly every SQL Server DBA or developer.

Thanks! Best, Mark

And now for a quick break from Big Data blogging (sort of) …

… My apologies for long-time readers who have felt inundated with my recent focus on Big Data here @ MSSQLDUDE!

Well, today, I want to send you over to Robert Davis’ always-excellent blog where he has an in-progress series called 31 Days of SQL Server Disaster Recovery.

Very, very important and great stuff in that series as I’ve started reading through it this week to see if there are things that I can pull out of the posts that I haven’t run into or tried before.

As was the case with Robert when he hosted brown bag training sessions for us @ MSFT, I have found some new things and good tidbits in there.

OK, now getting back to reality: even in the SQL Server world, database platforms (SQL Server included) are moving more & more toward schemaless-on-write, distributed processing, in-memory and columnar storage. These are all underpinnings of Big Data platforms, so I’m not going to let you go without bringing-up my new area focus again!

Big Data (probably this year) will become part-and-parcel of any data warehouse or analytics solution. This is because now that the business community, CIOs and vendors are on-board with these techniques, BI, DBA, data architect and DW pros cannot ignore the mass amounts of data that typically hang outside of your data warehouse as “exhaust data”.

So keep learning these new technologies and capabilities and incorporate it into your 2013 strategies and architectures. You won’t be able to set them aside any longer!

Best, Mark

Big Data with SQL Server, part 2: Sqoop

I started off my series on Hadoop on Windows with the new Windows distribution of Hadoop known as Microsoft HDInsight, by talking about installing the local version of Hadoop on Windows. There is also a public cloud version of Hadoop on Azure: http://www.hadooponazure.com.

Here in part 2, I’ll focus on moving data between SQL Server and HDFS using Sqoop.

In this demo, I’m going to move data between a very simple sample SQL Server 2012 database that I’ve created called “sqoop_test” with a single table called “customers”. You’ll see the table is very simple for this demo with just a customer ID and a customer name. What I’m going to do is to show you how the Microsoft & Hortonworks Hadoop distribution for Windows (HDInsights) includes Sqoop for moving data between SQL Server & Hadoop.

You can also move data between HDFS and SQL Server with the Linux distributions of Hadoop and Sqoop by using the Microsoft Sqoop adapter available for download here.

First, I’ll start with moving data from SQL Server to Hadoop. When you run this command, you will “import” data into Hadoop from SQL Server. Presumably, this would provide a way for you to perform distributed processing and analysis of your data via MapReduce once you’ve copied the data to HDFS:

sqoop import –connect jdbc:sqlserver://localhost –username sqoop -password password –table customers -m 1

I have 1 record inserted into my customers table and the import command places that into my Hadoop cluster and I can view the data in a text file, which most things in Hadoop resolve to:

> hadoop fs -cat /user/mark/customers/part-m-00000

> 5,Bob Smith

My SQL Server table has 1 row (see below) so that row was imported into HDFS:

The more common action would likely move data into SQL Server from Hadoop and to do this, I will export from HDFS to SQL Server. I have a database schema for my data in Hadoop that I created with Hive that creates a table called Employees. I’m going to tranform those into Customer records in my SQL Server schema with Sqoop:

> sqoop export –connect jdbc:sqlserver://localhost –username sqoop -password password -m 1 –table customers –export-dir /user/mark/data/employees3

12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Transferred 201 bytes in 32.6364 seconds (6.1588 bytes/sec)
12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Exported 4 records.

Those MapReduce jobs extract my data from HDFS and send it to SQL Server so that now when I query my SQL Server Customers table, I have my original Bob Smith record plus these 4 new records that I transferred from Hadoop:

PSSUG November 2012 Presentation: Big Data with SQL Server

Thank you all for coming out on a rainy, snowy, cold evening to join us for this month’s PSSUG meeting!

Here is a link to the slides that I used tonight during my presentation of Big Data with SQL Server and Hadoop demos: http://www.slideshare.net/kromerm/pssug-nov-2012-big-data-with-sql-server.

Br, Mark

Hortonworks on Windows – Microsoft HDInsight & SQL Server – Part 1

I’m going to start a series here on using Microsoft’s Windows distribution of the Hadoop stack, which Microsoft has released in community preview here together with Hortonworks: http://www.microsoft.com/en-us/download/details.aspx?id=35397.

Currently, I am using Cloudera on Ubutnu and Amazon’s Elastic MapReduce for Hadoop & Hive jobs. I’ve been using Sqoop to import & export data between databases (SQL Server, HBase and Aster Data) and ETL jobs for data warehousing the aggregated data (SSIS) while leaving the detail data in persistent HDFS nodes. Our data scientists are analyzing data from all 3 of those sources: SQL Server, Aster Data and Hadoop through cubes, Excel, SQL interfaces and Hive. We are also using analytical tools: PowerPivot, SAS and Tableau.

That being said, and having spent 5 years previously @ Microsoft, I was very much anticipating getting the Windows distribution of Hadoop. I’ve only had 1 week to play around with it so far and I’ve decided to begin documenting my journey here in my blog. I’ll also talk about it so far, along with Aster, Tableau and Hadoop on Linux Nov 7 @ 6 PM in Microsoft’s Malvern office, my old stomping grounds: http://www.pssug.org.

As the group’s director, one of the reasons that I like having a Windows distribution of Hadoop is so that we are not locked into an OS and can leverage the broad skill sets that we have on staff & off shore and so that we don’t tie ourselves to hiring on specific backgrounds when we analyze potential employee experience.

When I began experimenting with the Microsoft Windows Hadoop distribution, I downloaded the preview file and then installed it from the Web Installer, which then created a series of Apache Hadoop services, including the most popular in the Hadoop stack that drives the entire framework: jobtracker, tasktracker, namenode and datanode. There are a number of others that you can read about from any good Hadoop tutorial.

The installer created a user “hadoop” and an IIS app pool and site for the Microsoft dashboard for Hadoop. Compared to what you see from Hortonworks and Cloudera, it is quite sparse at this point. But I don’t really make much use of the management consoles from Hadoop vendors at this point. As we expand our use of Hadoop, I’m sure we’ll use them more. Just as I am sure that Microsoft will expand their dashboards, management, etc. and maybe even integrate with System Center.

You’ll get the usual Hadoop namenode and MapReduce web pages to view system activity and a command-line interface to issue jobs, manage the HDFS file system, etc. I’ve been using the dashboard to issue jobs, run Hive queries and download the Excel Hive drive, which I LOVE. I’ll blog about Hive next, in part 2. In the meantime, enjoy the screenshots of the portal access into Hadoop from the dashboard below:

This is how you submit a MapReduce JAR file (Java) job:

Here is the Hive interface for submitting SQL-like (HiveQL) queries against Hadoop using Hive’s data warehouse metadata schemas:

Configure SQL Server for Big Data

Now that my new focus is 100% on Big Data Analytics, I thought I’d get back into blogging here @ MSSQLDUDE and start providing some insights and lessons-learned around managing “Big Data” in SQL Server.

The context for this example will be a SQL Server 2012 database that is loading flat files with millions of rows of Website cookie data and social media data for specific marketing campaigns. The files are generated from an Hadoop MapReduce job that parses those files and produces new flat files in structured format that SSIS can pick-up and store in SQL Server.

If you are going to work with data of this type in multiple TBs in a SQL Server DW schema, here are my top 10 notes & gotchas for you that are working well for me:

  1. Use SQL Server 2012 Enterprise Edition
  2. Use a covering columnstore index on your fact table
  3. Make sure to disable that index before your data loads and enable it when done. You have to do this, BTW, because the table becomes Read-Only when you enable columnstore
  4. Use the Tabular model in-memory semantic model cubes in SSAS to provide the analytical layer into the DW
  5. Avoid direct queries to the DW, if possible. If the data scientists and analysts require ad-hoc query rights to the event-level data, then use AlwaysOn to produce a read-only secondary to report from.
  6. Use SQL Server general DW guidance for disk configurations, including using multiple data files for TempDB and your DW database. Look for the SQL Server FastTrack whitepaper on Microsoft.com. I found that disk layout guidance to be very helpful in Big Data scenarios.
  7. Use PAGE compression on your DW tables
  8. Partition your fact table so that you can just switch-in new rows from a staging table, for easy archiving and performance gains
  9. We are getting better results with data loads, building cubes and occassional ad-hoc queries with MAXDOP=1 on the server. This is a setting that I found you need to try out on your own environment because results vary greatly from system to system.
  10. Make sure that you have a dev environment that is representative of prod, not a complete COPY! The data loads and data volumes are just too large to be successful in developing and unit testing. Sample the data and keep your dev environment small.

Change to tagline

You may have noticed that I changed my blog’s tagline from “A TSP’s life” to “The life of a data geek” … That’s because I’ve decided to hang-up my Microsoft Data Platform Technology Specialist role and leave Microsoft after 5 great years with MSFT.

I’m starting a new role where I get to focus on 3 things are my sweet spot and near & dear to my heart: business intelligence, product development and big data. I’ll post a wrap-up to the SQL TSP role @ Microsoft shortly. But suffice it to say that I prefer building things and focusing on analytics and customer solutions as opposed to the requirements that come along with the SQL TSP role which was about 75/25 weighted toward traditional DBA work.

That being said, I’ll still be very active in the SQL Server community, just as a customer once again. And this time I won’t be a competitor like I was when I was with Oracle!

Leaving Microsoft will allow me to give a different perspective on things without feeling constrained by product roadmap and internal Microsoft constraints. In fact, I’ll probably start adding back in more Oracle perspectives as well as other BI integration techniques, so please stay tuned!

That’s it for now … Give me another 1-2 weeks to get established in my new role and then I’ll get back into a regular cadence here on the MSSQLDUDE blog. Thanks for reading and for your patience!!

All the best, Mark