SQL Azure: Scale-Out and Big Data

Perhaps the NoSQL / Big Data trend in high-performance computing, made popular by Hadoop and MapReduce will end up being the “killer app” or at least “killer capability” for cloud databases?

I find it be an interesting thought because the currently available Microsoft cloud databsae, SQL Aure, is a complete SQL Server-based transactional database complete with support for SQL Server tools, T-SQL, ODBC, referential integrity, etc. The current maximium single stand-alone database size is 50 GB.

But Microsoft has recently shown a lot of interest in providing support for scaled-out large database workloads, first with SQL Server Parallel Data Warehouse (PDW) and then the recent announcement of PDW support for Hadoop. Scale-out applications built on traditional SQL Server have been around for some time. The typical mechanisms used to do this are based on partitioning or “sharding” the data to fan-out the data and queries across SQL Servers using MS-DTC, replication or Service Broker.

SQL Azure is coming out with a built-in capability to enable this style of “sharded” partitioning called Database Federations. This is a link to a terrific write-up of using these concepts in a Big-Data application, written by Roger Jennings for Visual Studio Magazine.

Note that this capability is not yet available even in CTP (beta) for SQL Azure yet at the time that I am writing this blog post. I like the fact that these capabilities are being surfaced as a built-in T-SQL command. There will be accompanying ADO.NET library changes with APIs and methods to manipulate the distributed data and to query it appropriately as well.

Very interesting, exciting ways that SQL Azure can be used. Once I get access to the CTPs, I’ll start building out distributed apps using that capability and blog my results here for you. In the meantime, that article link above gives you some code samples to start thinking about your Big Data architectures.

Identity Gap with SQL Sever Denali Sequence Objects

With my DBA & developer lineage being in both Oracle PL/SQL as well as SQL Server T-SQL, I have been a big fan of Microsoft adding the Sequence object to SQL Server. You can find this new feature in the current public beta of SQL Server v-next, code named “Denali”. Download it here from Microsoft.com.

You can search through my MSSQLDUDE archives to find a few things that I’ve written about Sequences and how they can act as a good replacement for Identity columns. In those other postings, I touch on some of the things to thing about and re-architect since Sequence objects are detached from specific tables and do not have the Identity and Scope_Identity issues.

One thing that I tested which seems to still exist in  Sequence objects, at least in CTP1 of Denali, is the “identity gap” that currently occurs in SQL Server Identity columns when you insert into a table with a column that has the identity property, but the insert fails. This condition is explain very well, very concisely and (what I like about it) in under 2 mins at SQL Share!

I tried the same technique with a Sequence object in Denali and got the same thing (screenshot below). The sequence object is incremented by 1 and then the insert fails because of a unique constraint on the table. But because the sequence was incremented first, before the actual insert, I have a gap in my sequence. This is not necessarily a big problem, just something to watch out for, as you most likely already are aware of with Identity columns.

This happens even if I moved the ID field in the table definition above to the end of the table row definition.

Think About Replacing Uniqueidentifier with Sequence in Denali

As I’ve finally been finding some time recently to play with CTP 1 of SQL Server Denali, I’ve focused a lot of my energy around the Sequence object. This is because I have an Oracle PL/SQL background and used Sequences quite a bit in the old days there.

When talking to SQL Server DBAs and developers about Sequences, the conversation often starts with “this is pretty cool” and something that seems interesting on a forward-looking basis. I mean, in reality, who wants to go back into code and data models and make changes to tables, right? So is there a good case to use the Sequence to replace a table-based ID field like an Identity column or uniqueidentifier?

So far, I’m thinking that uniqueidentifier will be something that you’ll want to look at. The 2 most common constructs in T-SQL for creating ID fields in tables is to use either a uniqueidentifier as the column data type or use an INT with the identity property. The identity property in current SQL Server versions is probably the closest correlation to the new Denali sequence object. But the sequence object nicely encapsulates the IDs away from being tied to any single table or object. I describe this usage futher here.

Certainly you can go back and replace your IDENTITY columns easily with sequence objects because the INT data type is a natural fit for sequence #s. However, removing the identity property is not easy. Threre is no ALTER statement to do it nice & easy for you. And even though the SSMS GUI supports removing the identity property from a table column specification, the T-SQL that is generated from that action is actually the creation of a COPY of your table and data, drop the original table and re-create a new table without the identity definition! That’s not fun and not good, especially if you have a very large table.

Replacing the GUID field in a UNIQUEIDENTIFIER is not easy, either. You will want to remove that field and replace it with an INT using sequence. But this can only be done if you do not have that field as a foreign key in other tables. However, if you can replace it, do so, especially if you are using the GUID for these purposes: because it has guaranteed uniqueness and you are using it as a clustered index key.

Sequence will be a better option in those instances. The Sequence object has its own T-SQL to manage the #s being fed into your INT field to help with uniqueness when moving the data between instances and the sequential INT value of sequence is a MUCH better clustered key than a GUID. The control that you can have over the sequence separate from the table definition should help eliminate some of the limitations that may have led you to use the uniqueidentifier.

Avoiding Cursors

A topic that I have run into a lot in the past of the years, particularly when working with Oracle PL/SQL developers and when coverting Oracle applications to T-SQL for SQL Server, is cursors. I find that Oracle developers lean toward using cursors often from writing Pro*C and stored procedures. When you move to SQL Server programming, though, you’ll find much better performance in most cases by moving to set-based operations and being very careful utilizing cursors and to make sure that they are optimized using best practices.

I was reading a very good series on SQL Server Central this week by R Barry Young and am linking you to it here. If you absolutely, postively need to work with cursors, perhaps to fetch rows individually and perform unique operations row by row, then this article will also help you guide for what areas of cursor optimization to look for.

Of BLOBs and FILESTREAM

With the FILESTREAM attribute now available to SQL Server T-SQL developers as of SQL Server 2008, you are faced with questions that I hear from Oracle PL/SQL developers as well: when to store a large binary object (BLOB) directly in the DB and when to point to the file on the server’s file system. With FILESTREAM in SQL Server 2008, you can use a varbinary (max) column and SQL Server will use the file system to store the data. There are a few caveats and limitations to be aware of. But this does make access of BLOBs very fast from SQL Server. I found the TechNet entry on FILESTREAM to be most useful for those points.

This concept is similar to Oracle’s BFILE and a rule of thumb is that files over 1 MB should utilize FILESTREAM or a file pointer instead of storing the object directly in SQL Server. I liked Don Schlichting’s write-up on this Database Journal, so click on that link and check it out.