Managing SQL Server Large Data Warehouses, Part 1(Compression)

As a few follow-ups from the session that Joe D’Antoni (@jdanton) and I delivered at the SQL Saturday Philly (https://mssqldude.wordpress.com/2012/06/08/its-a-sql-server-weekend/), I wanted to include some of the deeper-dive details and scripts that we didn’t get a chance to dig into during the session because we ran out of time.

I’ll probably fit this into 5 parts in this sequence: Part 1: Compression, Part 2: Table Partitioning, Part 3: Resource Governor, Part 4: Scale-Architectures, Part 5: Monitor and manage the BI stack. Thinking back on this and seeing how it’s going to take me around 5 parts to fit in all of the different DBA areas that I can think to touch on for large data warehouses, it was crazy of me to think that we could fit it into 75 minutes!

And this doesn’t include Joe’s portion focused on database storage best practices for SQL Server data warehouses! I will use the excuse that we threw the session together at the last minute because we had a speaker cancel, however …

But don’t discount the importance of accounting for a storage subsystem with data warehouses and how important it can be to ensure the throughput and IOPS that you need to read data off of the disks quickly. Database access and query patterns in large data warehouses (not just SQL Server) are providing business intelligence reports with large data sets returned typically in sequence and usually with aggregations. This is why Microsoft’s SQL Server Fast Track option focuses so heavily on contiguous, non-fragmented data layouts with minimal indexing. The complexities of ensuring unfragmented data layouts and forcing large data scans is also why companies like Microsoft, Oracle, IBM, etc. are now in the business of selling “appliances” for large data warehouses with their own dedicated storage.

So, Part 1 is going to focus on compression. In the live session, I showed the following query below from the sample orders database in Fast Track 3.0 on SQL Server 2012:

use [FastTrackDBLab2]

SET STATISTICS IO ON
SET STATISTICS TIME ON

DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS

SELECT

a.l_orderkey as OrderNum,
a.l_shipmode as Mode,
a.l_shipdate as ShipDate,
SUM (a.l_quantity) as Quantity,
SUM (a.l_extendedprice) as Price

FROM dbo.lineitem a

GROUP BY l_orderkey, a.l_shipmode, a.l_shipdate

ORDER BY a.l_shipdate

I cleared out the buffers and cache to make sure we are running from a clean slate each time. I made sure to include aggregations and I included a very large time period with which to run a query against our DW history, i.e. no WHERE clause in this case to make for a long-running query. And then I ordered by date. This is a common query pattern that you will see in DW workloads. And I’ve simplified this by not joining in dimension tables, which will almost always surely be the case in a production environment.

When I run this without any compression on the tables or indexes, I get this timing returned via the SET STATS on commands:

CPU time = 29924 ms,  elapsed time = 132847 ms.

My experience has been that page compression on fact tables in SQL Server data warehouses is going to make your queries faster. This is because the small hit that you get from the CPU compressing and uncompressing data will be greatly outweighed by smaller and less data being transmitted over the wire and the bus with compression. You enable page compression on a SQL Server 2012 table from SSMS from right-click on the table, Storage > Manage Compression.

Here are the results that we showed when we rerun that exact same query, no changes to the code, just with page table compression turned on:

CPU time = 25505 ms,  elapsed time = 53551 ms.

We’ve improved the query results time from 132 seconds down to 53 seconds just by doing that.

Next, we show the new 2012 columnstore index compression the fact table and rerun the query. As opposed to compressing the table storage column indexes are available from the Index object explorer node under the fact table. I created a columnstore index on just the fields from the fact table that were included in the query above. Now the timings that we get are:

CPU time = 9780 ms,  elapsed time = 16872 ms.

So now we’ve gone from 132 seconds for the query to complete with no compression all the way down to 16 seconds, more than an 8X improvement with no code changes to application or schema changes, just using table compression and columnstore indexes.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s