The History and Future of Data Quality in SQL Server

In 2008, when I first started really using SSIS 2008, I wrote-up the joy I was having using the Data Profiler task in SSIS at here. It was a good first step into built-in funcationality around data quality and data profiling in SQL Server. Still not on par with some of my favorite tools from past lives like Informatica & ETI. Prior to that toolbox task in SSIS, SQL Server ETL developers would perform data quality and profiling through 3rd party tools or use the data mining and fuzzy lookup tasks to roll your own. In fact, there is a popular MSDN entry online here that walks you through an entire example of roll-your-own, including data cleansing, which the current SSIS 2008 profile task does not do. It was written and reviewed by SQL Server heavy-weights like Donald Farmer and Stacia Misner. Definitely it is worth your time checking it out.

Then came the exciting news of Microsoft in 2008, acquiring Strature for MDM and Zoomix for Data Quality. Sounds like a perfect marriage. But these technologies are finding their way into the SQL Server family piecemeal. MDM became Master Data Services in SQL Server 2008 R2 and Zoomix is Data Quality Services, which will be part of SSIS in SQL 11, althought the current CTP 1 of Denali does not include these capability so you can’t try it out yet. Personally, I would like to see SSIS, MDS & DQS come together and I’m sure that those of us in the SQL Server community will find ways to do that on our own.

In the interim, I tried to find some good info for my readers on what Data Quality Services (DQS) will look like and do for you in SQL 11 and this is a paraphrased quote from a Microsoft program manager on what to expect out of DQS:

Data Quality Service (DQS) creates and maintains a Data Quality Knowledge Base (DQKB) that is used for performing various data quality operations, such as data cleansing and matching and profiling.

DQS will support three major functionalities:

  1. Knowledge Management – creating and maintaing a DQJB, including supporting “knowledge discovery” – an automated computer-assisted acquisition of knowledge from a data source sample. In this secnario you will also define a train a set of rules that will serve as the policy governing the matching process, as well as choose and integrate reference data from different parties into the DQKB domains.
  2. Data Quality Projects – enable correcting, standardizing and matching source data according to domain values, rules and reference data associated with a designated data quality knowledge base.
  3. Administration – monitoring the current and past DQ processes and defining the various parameters governing the overall DQ activities in your server.

All of the above will be available through our DQS client experience. We will also include a new SSIS correction component, that enables correcting the input data according to domain values, rules, and reference data associated with a designated data quality knowledge base.

More Spring Cleaning: SSIS Data Profiler

So I will now continue to clean-up and work through my backlog of items from previous writings by introducing you to another of my favorite data warehouse-related improvements from SQL Server 2005 to SQL Server 2008: Data Profiler. As in my previous post, start with my deeper and lengthier treatment of this topic from SearchSQLServer here.

Now consider this: most data warehouse and BI projects need to start by finding, collecting, analyzing and then finally utilizing data from disparate systems, commonly transactional applications and databases, as the source for analsis. But how often do you find that data to be spot-on, clean, no duplicates, no confusing similarities and ALWAYS 100% just the way that the customer said the data would look?

Not very often.

Before SQL Server 2008 and the data profile task in SSIS, I used to work on projects where we would look to 3rd party tools as part of the data discovery and cleansing part of the process. Or we would write separate SSIS packages to cleanse data.

But now with Data Profiler as a core part of SSIS in SQL Server 2008, you can easily perform data quality checks for matches, patterns, dupes, etc.