Now that my new focus is 100% on Big Data Analytics, I thought I’d get back into blogging here @ MSSQLDUDE and start providing some insights and lessons-learned around managing “Big Data” in SQL Server.
The context for this example will be a SQL Server 2012 database that is loading flat files with millions of rows of Website cookie data and social media data for specific marketing campaigns. The files are generated from an Hadoop MapReduce job that parses those files and produces new flat files in structured format that SSIS can pick-up and store in SQL Server.
If you are going to work with data of this type in multiple TBs in a SQL Server DW schema, here are my top 10 notes & gotchas for you that are working well for me:
- Use SQL Server 2012 Enterprise Edition
- Use a covering columnstore index on your fact table
- Make sure to disable that index before your data loads and enable it when done. You have to do this, BTW, because the table becomes Read-Only when you enable columnstore
- Use the Tabular model in-memory semantic model cubes in SSAS to provide the analytical layer into the DW
- Avoid direct queries to the DW, if possible. If the data scientists and analysts require ad-hoc query rights to the event-level data, then use AlwaysOn to produce a read-only secondary to report from.
- Use SQL Server general DW guidance for disk configurations, including using multiple data files for TempDB and your DW database. Look for the SQL Server FastTrack whitepaper on Microsoft.com. I found that disk layout guidance to be very helpful in Big Data scenarios.
- Use PAGE compression on your DW tables
- Partition your fact table so that you can just switch-in new rows from a staging table, for easy archiving and performance gains
- We are getting better results with data loads, building cubes and occassional ad-hoc queries with MAXDOP=1 on the server. This is a setting that I found you need to try out on your own environment because results vary greatly from system to system.
- Make sure that you have a dev environment that is representative of prod, not a complete COPY! The data loads and data volumes are just too large to be successful in developing and unit testing. Sample the data and keep your dev environment small.