I’m going to start a series here on using Microsoft’s Windows distribution of the Hadoop stack, which Microsoft has released in community preview here together with Hortonworks: http://www.microsoft.com/en-us/download/details.aspx?id=35397.
Currently, I am using Cloudera on Ubutnu and Amazon’s Elastic MapReduce for Hadoop & Hive jobs. I’ve been using Sqoop to import & export data between databases (SQL Server, HBase and Aster Data) and ETL jobs for data warehousing the aggregated data (SSIS) while leaving the detail data in persistent HDFS nodes. Our data scientists are analyzing data from all 3 of those sources: SQL Server, Aster Data and Hadoop through cubes, Excel, SQL interfaces and Hive. We are also using analytical tools: PowerPivot, SAS and Tableau.
That being said, and having spent 5 years previously @ Microsoft, I was very much anticipating getting the Windows distribution of Hadoop. I’ve only had 1 week to play around with it so far and I’ve decided to begin documenting my journey here in my blog. I’ll also talk about it so far, along with Aster, Tableau and Hadoop on Linux Nov 7 @ 6 PM in Microsoft’s Malvern office, my old stomping grounds: http://www.pssug.org.
As the group’s director, one of the reasons that I like having a Windows distribution of Hadoop is so that we are not locked into an OS and can leverage the broad skill sets that we have on staff & off shore and so that we don’t tie ourselves to hiring on specific backgrounds when we analyze potential employee experience.
When I began experimenting with the Microsoft Windows Hadoop distribution, I downloaded the preview file and then installed it from the Web Installer, which then created a series of Apache Hadoop services, including the most popular in the Hadoop stack that drives the entire framework: jobtracker, tasktracker, namenode and datanode. There are a number of others that you can read about from any good Hadoop tutorial.
The installer created a user “hadoop” and an IIS app pool and site for the Microsoft dashboard for Hadoop. Compared to what you see from Hortonworks and Cloudera, it is quite sparse at this point. But I don’t really make much use of the management consoles from Hadoop vendors at this point. As we expand our use of Hadoop, I’m sure we’ll use them more. Just as I am sure that Microsoft will expand their dashboards, management, etc. and maybe even integrate with System Center.
You’ll get the usual Hadoop namenode and MapReduce web pages to view system activity and a command-line interface to issue jobs, manage the HDFS file system, etc. I’ve been using the dashboard to issue jobs, run Hive queries and download the Excel Hive drive, which I LOVE. I’ll blog about Hive next, in part 2. In the meantime, enjoy the screenshots of the portal access into Hadoop from the dashboard below:
This is how you submit a MapReduce JAR file (Java) job:
Here is the Hive interface for submitting SQL-like (HiveQL) queries against Hadoop using Hive’s data warehouse metadata schemas: