I started off my series on Hadoop on Windows with the new Windows distribution of Hadoop known as Microsoft HDInsight, by talking about installing the local version of Hadoop on Windows. There is also a public cloud version of Hadoop on Azure: http://www.hadooponazure.com.
Here in part 2, I’ll focus on moving data between SQL Server and HDFS using Sqoop.
In this demo, I’m going to move data between a very simple sample SQL Server 2012 database that I’ve created called “sqoop_test” with a single table called “customers”. You’ll see the table is very simple for this demo with just a customer ID and a customer name. What I’m going to do is to show you how the Microsoft & Hortonworks Hadoop distribution for Windows (HDInsights) includes Sqoop for moving data between SQL Server & Hadoop.
You can also move data between HDFS and SQL Server with the Linux distributions of Hadoop and Sqoop by using the Microsoft Sqoop adapter available for download here.
First, I’ll start with moving data from SQL Server to Hadoop. When you run this command, you will “import” data into Hadoop from SQL Server. Presumably, this would provide a way for you to perform distributed processing and analysis of your data via MapReduce once you’ve copied the data to HDFS:
sqoop import –connect jdbc:sqlserver://localhost –username sqoop -password password –table customers -m 1
I have 1 record inserted into my customers table and the import command places that into my Hadoop cluster and I can view the data in a text file, which most things in Hadoop resolve to:
> hadoop fs -cat /user/mark/customers/part-m-00000
> 5,Bob Smith
My SQL Server table has 1 row (see below) so that row was imported into HDFS:
The more common action would likely move data into SQL Server from Hadoop and to do this, I will export from HDFS to SQL Server. I have a database schema for my data in Hadoop that I created with Hive that creates a table called Employees. I’m going to tranform those into Customer records in my SQL Server schema with Sqoop:
> sqoop export –connect jdbc:sqlserver://localhost –username sqoop -password password -m 1 –table customers –export-dir /user/mark/data/employees3
12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Transferred 201 bytes in 32.6364 seconds (6.1588 bytes/sec)
12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Exported 4 records.
Those MapReduce jobs extract my data from HDFS and send it to SQL Server so that now when I query my SQL Server Customers table, I have my original Bob Smith record plus these 4 new records that I transferred from Hadoop: