What Makes your Data Warehouse a “Big Data Warehouse”?

I’ve been closely observing the evolution of marketing of the classic database and data warehouse products over the past 2 years with great interest. Now that Big Data is top-of-mind of most CIOs in corporations around the globe, traditional data vendors like IBM, Oracle, Teradata and Microsoft are referring to their platforms as “Big Data” or “Big Data Warehouses”.

I guess, in the final analysis, this is really an attempt by data vendors at shifting perceptions and melding CIO thinking about Big Data away from Apache Hadoop, Cloudera and Hortonworks and toward their own platforms. Certainly, there are some changes taking place to those traditional data warehouse platforms (MPP, in-memory, columnstore) that are important for workloads that are classic “Big Data” use cases: clickstream analysis, big data analytics, log analytics, risk modeling … And most of those vendors will even tack-on a version of Hadoop with their databases!

But this is not necessarily breaking new ground or an inflection point in terms of technologies. Teradata pioneered MPP decade ago, Oracle led the way with smart caching and proved (once again) the infamous bottleneck in databases is I/O. Columnar databases like Vertica proved their worth in this space and that led to Microsoft and Oracle adopting those technologies, while Aster Data led with MapReduce-style distributed UDFs and analytics, which Teradata just simply bought up in whole.

In other words, the titans in the data market finally felt enough pressure from their core target audiences that Hadoop was coming out of the shadows and Silicon Valley to threaten their data warehouse market share that you will now hear these sorts of slogans from traditional data warehouses:

Oraclehttp://www.oracle.com/us/technologies/big-data/index.html. Oracle lists different products for dealing with different “Big Data” problems: acquire, organize and analyze. The product page lists the Oracle Big Data Appliance, Exadata and Advanced Analytics as just a few products for those traditional data warehouse problems. Yikes.

Teradata: In the world of traditional DWs, Teradata is the Godfather and pioneered many of the concepts that we are talking about today for Big Data Analytics and Big Data DWs. But Aster Data is still a separate technology and technology group under Teradata and sometimes they step on their own messaging by forcing their EDW database products into the same “Big Data” space as Aster Data: http://www.prnewswire.com/news-releases/latest-teradata-database-release-supports-big-data-and-the-convergence-of-advanced-analytics-105674593.html.

But the fact remains that “Hadoop” is still seen as synonymous with “Big Data” and the traditional DW platforms had been used in many of those same scenarios for decades. Hadoop has been seen as an alternative means to provide Big Data Analaytics at a lower cost per scale. Just adding Hadoop to an Oracle Exadata installation, for example, doesn’t solve that problem for customers outside of the original NoSQL and Hadoop community: Yahoo, Google, Amazon, etc.

So what are your criteria for a database data warehouse to qualify as a “Big Data Warehouse”? Here are a few for me that I use:

  1. MPP scale-out nodes
  2. Column-oriented compression and data stores
  3. Distributed programming framework (i.e. MapReduce)
  4. In-memory options
  5. Built-in analytics
  6. Parallel and fast-load data loading options

To me, the “pure-play” Big Data Analytics “warehouses” are: Vertica (HP), Greenplum (EMC) and Aster (Teradata). But the next-generation of platforms that will include improved distributed access & programming, better than today’s MapReduce and Hive, will be Microsoft with PDW & Polybase, Teradata’s appliance with Aster & SQL-H and Cloudera’s Impala, if you like Open Source Software.