PSSUG November 2012 Presentation: Big Data with SQL Server

Thank you all for coming out on a rainy, snowy, cold evening to join us for this month’s PSSUG meeting!

Here is a link to the slides that I used tonight during my presentation of Big Data with SQL Server and Hadoop demos: http://www.slideshare.net/kromerm/pssug-nov-2012-big-data-with-sql-server.

Br, Mark

The History and Future of Data Quality in SQL Server

In 2008, when I first started really using SSIS 2008, I wrote-up the joy I was having using the Data Profiler task in SSIS at SearchSQLServer.com here. It was a good first step into built-in funcationality around data quality and data profiling in SQL Server. Still not on par with some of my favorite tools from past lives like Informatica & ETI. Prior to that toolbox task in SSIS, SQL Server ETL developers would perform data quality and profiling through 3rd party tools or use the data mining and fuzzy lookup tasks to roll your own. In fact, there is a popular MSDN entry online here that walks you through an entire example of roll-your-own, including data cleansing, which the current SSIS 2008 profile task does not do. It was written and reviewed by SQL Server heavy-weights like Donald Farmer and Stacia Misner. Definitely it is worth your time checking it out.

Then came the exciting news of Microsoft in 2008, acquiring Strature for MDM and Zoomix for Data Quality. Sounds like a perfect marriage. But these technologies are finding their way into the SQL Server family piecemeal. MDM became Master Data Services in SQL Server 2008 R2 and Zoomix is Data Quality Services, which will be part of SSIS in SQL 11, althought the current CTP 1 of Denali does not include these capability so you can’t try it out yet. Personally, I would like to see SSIS, MDS & DQS come together and I’m sure that those of us in the SQL Server community will find ways to do that on our own.

In the interim, I tried to find some good info for my readers on what Data Quality Services (DQS) will look like and do for you in SQL 11 and this is a paraphrased quote from a Microsoft program manager on what to expect out of DQS:

Data Quality Service (DQS) creates and maintains a Data Quality Knowledge Base (DQKB) that is used for performing various data quality operations, such as data cleansing and matching and profiling.

DQS will support three major functionalities:

  1. Knowledge Management – creating and maintaing a DQJB, including supporting “knowledge discovery” – an automated computer-assisted acquisition of knowledge from a data source sample. In this secnario you will also define a train a set of rules that will serve as the policy governing the matching process, as well as choose and integrate reference data from different parties into the DQKB domains.
  2. Data Quality Projects – enable correcting, standardizing and matching source data according to domain values, rules and reference data associated with a designated data quality knowledge base.
  3. Administration – monitoring the current and past DQ processes and defining the various parameters governing the overall DQ activities in your server.

All of the above will be available through our DQS client experience. We will also include a new SSIS correction component, that enables correcting the input data according to domain values, rules, and reference data associated with a designated data quality knowledge base.

Bring Oracle Data into SQL Server

This is part III of my continuing series on working with Oracle for SQL Server pros …

In this Oracle posting, I’ll focus on data integration as part of an ETL strategy where you will build a SQL Server data warehouse from Oracle data sources. This is a very common practice and allows you to take advantage of the awesome SQL Server data integration and ETL capabilities that you essentially get for free as they come with the SQL Server product by way of SSIS (SQL Server Integration Services).

What I’ll do is point out to you a few links to dive deeper, some gotchas, best practices and approaches that I have taken in the past that were successful. If you are not careful about your approach to integrating Oracle databases into SQL Server databases, you may find yourself quickly falling victim to degrading performance that fails to meet your most basic SLAs and data management needs.

The most obvious and easiest mechanism that I find to connect to Oracle data source from SSIS is to use the Microsoft OLE DB Provider to Oracle (see pic below). There are also many other providers besides Microsoft for Oracle. But with SQL Server, you get this OLEDB adapter from Microsoft out of the box. I’ve used it many times, it works just fine. There is a good paper on SQL CAT here which you can use to see some of the tests and timings recorded and published by the Microsoft SQL customer advisory team when connecting to Oracle from SQL Server, it’s very helpful for you as you set your service level agreements with your business users. A few things to watch out for include that you must know the Oracle database network name that is being broadcast from the TNSNAMES configuration on your Oracle database sources. If you do not have access to the Oracle servers, make sure that you’ve collected a read-only login to the database, the TNS NAME and the server host name (physical or virtual).

OLEDB, however, is not generally the fastest mechanism to use to access data from Oracle. That being said, not every use case where you utilize SQL Server either as a DW or for an application that needs data from Oracle, requires super-fast access to the Oracle data source. For instance, an ETL job that runs nightly may have a large window with which it can use to pull data from, not requiring lightening-speed data access to the Oracle sources.

However, if you need to access Oracle faster than OLEDB, SQL Server 2008 ships with a set of high-speed adapters for Oracle from Attunity that you can use with SQL Server. To use the Attunity-provided Oracle adapter with SSIS, go to this Microsoft download link here. Once you add the Attunity drivers, you can then select those from your SSIS data source or data target to take advantage of the improved data movement speeds.