This article is copied from IBM developerswork, they did really good job, thank you !

Hive and HBase

Hive is a data warehouse solution that has a thin SQL-like querying language called HiveQL. This language is used for querying data, and it saves you from writing native MapReduce processing to get your data out. Since you already know SQL, Hive is a good solution since it enables you to take advantage of your SQL knowledge to get data in and out of Apache Hadoop. One limitation of the Hive approach, though, is that it makes use of the append-only nature of HDFS to provide storage. This means that it is phenomenally easy to get the data in, but you cannot update it. Hive is not a database but a data warehouse with convenient SQL querying built on top of it. Despite the convenient interface, particularly on very large datasets, the fact that the query time required to process requests is so large means that jobs are submitted and results accessed when available. This means that the information is not interactively available.

HBase, by comparison, is a key-value (NoSQL) data store that enables you to write, update, and read data randomly, just like any other database. But it's not SQL. HBase enables you to make use of Hadoop in a more traditional real-time fashion than would normally be possible with the Hadoop architecture. Processing and querying data is more complex with HBase, but you can combine the HBase structure with Hive to get an SQL-like interface. HBase can be really practical as part of a solution that adds the data, processes it, summarizes it through MapReduce, and stores the output for use in future processing.

In short, think of Hive as an append-only SQL database and HBase as a more typical read-write NoSQL data store.

Hive is useful for SQL integration if you want to store long-term data to be processed and summarized and loaded back. Hive's major limitation is query speed. When dealing with billions of rows, there is no live querying of the data that would be fast enough for any interactive interface to the data.

For example, with data logging, the quantities of data can be huge, but what you often need is quick, flexible querying on either summarized or extreme data (i.e., faults and failures).

HBase is useful when what you want is to store large volumes of flexible data and query that information, but you might want only smaller datasets to work with. Hence, you might export data that simultaneously:

  1. Needs to be kept "whole," such as sales or financial data
  2. May change over time
  3. Also needs to be queried

HBase can then be combined with traditional SQL or Hive to allow snapshots, ranges, or aggregate data to be queried.

Making use of Hive

The primary reason to use Hive over a typical SQL database infrastructure is simply the size of the data and the length of time required to perform the query. Rather than dumping information into Hadoop, writing your own MapReduce query, and getting the information back out, with Hive you can (normally) write the same SQL statement, but on a much larger dataset.

Hive accomplishes this task by translating the SQL statement into a more typical Hadoop MapReduce job that assembles the data into a tabular format. This is where the limitation comes, in that Hive is not a real-time, or live querying solution. Once you submit the job, it can take a long time to get a response.

A typical use of Hive is to combine the real-time accessibility of data in a local SQL table, export that information into Hive long-term, and reimport the processed version that summarizes the data so it can be used in a live query environment, as seen in Figure 1.

Figure 1. Typical use of Hive

Image shows typical use of Hive

We can use the same SQL statement to get and obtain the data in both situations, a convenience that helps to harmonize and streamline your applications.

Getting data into Hive from SQL

The simplest way to get data in and out is by writing a custom application that will extract the required data from your existing SQL table and insert that data into Hive — that is, perform a select query and use INSERT to place those values directly into Hive.

Alternatively, depending on your application type, you might consider inserting data directly into both your standard SQL and Hive stores. This way, you can check your standard SQL for recent queries and process Hive data on an hourly/daily/weekly schedule as required to produce the statistical data you require longer-term.

Remember, with Hive, in the majority of cases we can:

  • Use the tabular data directly
  • Export without conversion
  • Export without reformatting or restructuring

More typically, you will be dumping entire tables or entire datasets by hour or day into Hive through an intermediary file. One benefit of this approach is that we easily introduce the file by running a local import or by copying that data directly into HDFS.

Let's look at this with an example, using the City of Chicago Traffic Tracker that studies bus data, giving a historical view of the speed of buses in different regions of Chicago at different times. A sample of the data is shown in Listing 1.

Listing 1. Sample of bus data from City of Chicago Traffic Tracker
mysql> select * from chicago limit 10;
| timelogged          | region | buscount | logreads | speed |
| 2013-03-26 09:50:00 |      2 |       40 |      600 | 26.59 |
| 2013-03-26 09:50:00 |      3 |       76 |     1023 | 21.82 |
| 2013-03-26 09:50:00 |      4 |       46 |      594 | 23.18 |
| 2013-03-26 09:50:00 |      5 |       56 |     1016 | 22.33 |
| 2013-03-26 09:50:00 |      6 |       58 |      859 | 21.14 |
| 2013-03-26 09:50:00 |      7 |       59 |      873 | 19.23 |
| 2013-03-26 09:50:00 |      8 |       89 |     1356 | 21.14 |
| 2013-03-26 09:50:00 |      9 |       28 |      404 | 21.82 |
| 2013-03-26 09:50:00 |     10 |      114 |     1544 | 20.45 |
| 2013-03-26 09:50:00 |     11 |       91 |     1310 |  22.5 |
10 rows in set

The sample dataset is just over 2 million records, and it has been loaded from a CSV export. To get that information into Hive, we can export it to a CSV file.

Listing 2. Exporting data to a CSV file
mysql> select * into outfile 'chicago.csv' fields terminated by ',' 
lines terminated by '\n' from chicago;

Then copy the data into HDFS: $ hdfs dfs -copyFromLocal chicago.csv. Now open Hive and create a suitable table.

Listing 3. Opening Hive and creating a suitable table
hive> create table chicago_bus (timelog timestamp, region int, 
buscount int, logreads int, speed float) row format delimited 
fields terminated by ',' lines terminated by "\n";

The first thing to notice with Hive is that compared to most standard SQL environments, we have a somewhat limited set of data types that can be used. Although core types like integers, strings, and floats are there, date types are limited. That said, Hive supports reading complex types, such as hashmaps. The core types supported by Hive:

  • Integers (1-8 bytes)
  • Boolean
  • Float/Double
  • String — any sequence of characters, and therefore good for CHAR, VARCHAR, SET, ENUM, TEXT, and BLOB types if the BLOG is storing text
  • Timestamp — either an EPOCH or YYYY-MM-DD hh:mm:ss.fffffffff-formatted string
  • Binary — for BLOBs that are not TEXT

The second observation is that we are defining the table structure. There are binary structures within Hive, but using CSV natively is convenient for our purposes since we've exported the data to a CSV file. When loading the data, the CSV format specification here will be used to identify the fields in the data.

The above example creates a standard table (the table within Hive's data store within HDFS). You can also create an external table that uses the copied file directly. However, in an SQL-to-Hive environment, we want to make use of one big table into which we can append new data.

Now the data can be loaded into the table: hive> load data inpath 'chicago.csv' into table chicago_bus;.

This code adds the contents of the CSV file to the existing table. In this case, it is empty, but you can see how easy it would be to import additional data.

Processing and querying Hive data

Once the data is loaded, you can execute Hive queries from the Hive shell just as you would in any other SQL environment. For example, Listing 4 shows the same query to get the first 10 rows.

Listing 4. Query to get the first 10 rows
hive> select * from chicago_bus limit 10;                    
2013-03-26 09:50:00	  1	    31  461	  22.5
2013-03-26 09:50:00	  2	    40  600	  26.59
2013-03-26 09:50:00	  3	    76  1023  21.82
2013-03-26 09:50:00	  4	    46  594	  23.18
2013-03-26 09:50:00	  5	    56  1016  22.33
2013-03-26 09:50:00	  6	    58  859   21.14
2013-03-26 09:50:00	  7	    59  873	  19.23
2013-03-26 09:50:00	  8	    89  1356  21.14
2013-03-26 09:50:00	  9	    28  404   21.82
2013-03-26 09:50:00	  10    114 1544  20.45
Time taken: 0.205 seconds

The benefit comes when we perform an aggregate query. For example, let's obtain an average bus speed for each region by day.

Listing 5. Obtaining average bus speed for each region by day
hive> select to_date(timelog),region,avg(speed) from chicago_bus 
group by to_date(timelog),region;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_201308151101_0009, Tracking URL = 
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_201308151101_0009
Hadoop job information for Stage-1: number of mappers: 1; 
   number of reducers: 1
2013-08-18 11:44:04,943 Stage-1 map = 0%,  reduce = 0%
2013-08-18 11:44:24,041 Stage-1 map = 100%,  reduce = 0%, 
    Cumulative CPU 12.83 sec
2013-08-18 11:44:25,053 Stage-1 map = 100%,  reduce = 0%, 
    Cumulative CPU 12.83 sec
2013-08-18 11:44:26,093 Stage-1 map = 100%,  reduce = 0%, 
    Cumulative CPU 12.83 sec
2013-08-18 11:44:27,102 Stage-1 map = 100%,  reduce = 0%, 
    Cumulative CPU 12.83 sec
2013-08-18 11:44:28,115 Stage-1 map = 100%,  reduce = 0%, 
    Cumulative CPU 12.83 sec
2013-08-18 11:44:29,136 Stage-1 map = 100%,  reduce = 100%, 
    Cumulative CPU 14.94 sec
2013-08-18 11:44:30,147 Stage-1 map = 100%,  reduce = 100%, 
    Cumulative CPU 14.94 sec
2013-08-18 11:44:31,164 Stage-1 map = 100%,  reduce = 100%, 
    Cumulative CPU 14.94 sec
2013-08-18 11:44:32,179 Stage-1 map = 100%,  reduce = 100%, 
    Cumulative CPU 14.94 sec
MapReduce Total cumulative CPU time: 14 seconds 940 msec
Ended Job = job_201308151101_0009
MapReduce Jobs Launched: 
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 14.94 sec   HDFS Read: 
    77778973 HDFS Write: 690111 SUCCESS
Total MapReduce CPU Time Spent: 14 seconds 940 msec
2011-03-12  1   25.597916682561237
2011-03-12  2   27.758749961853027
2011-03-12  3   26.450833241144817
2011-03-12  4   24.638333241144817
2011-03-12  5   25.868695715199348
2011-03-12  6   24.921666463216145
2011-03-12  7   24.383749961853027
2011-03-12  8   23.729583422342937
2011-03-12  9   26.06217384338379
2011-03-12  10  24.53833333651225

As you can see from the output, the query is fundamentally the same as with MySQL (we are grouping by an alternative value), but Hive converts this into a MapReduce job, then calculates the summary values.

The reality with data of this style is that the likelihood of requiring the speed in region 1, for example, at 9:50 last Thursday is quite low. But knowing the average speed per day for each region might help predict the timing of traffic or buses in the future. The summary data can be queried and analyzed efficiently with a few thousand rows in an SQL store to allow the data to be sliced and diced accordingly.

To output that information back to a file, a number of options are available, but you can simply export back to a local file using the statement in Listing 6.

Listing 6. Exporting back to a local file
   to_date(timelog),region,avg(speed) from chicago_bus 
   group by to_date

This code creates a directory (chicagoout), into which the output is written as a series of text files. These can be loaded back into MySQL, but by default, the fields are separated by Ctrl+A. The output can be simplified to a CSV file again by creating a table beforehand, which uses the CSV formatting.

Listing 7. Simplifying to a CSV file
hive> CREATE TABLE chicago_region_speed (logdate timestamp, region int, 

Now rerun the job and insert the information into the table.

Listing 8. Rerunning the job
hive> INSERT TABLE chicago_region_speed SELECT to_date(timelog) as 
timelog,region,avg(speed) as speed from chicago_bus group by to_date

You can now find the files that make up the table in your Hive datawarehouse directory, so you can copy them out for loading into your standard SQL store, for example, using LOAD DATA INFILE.

Listing 9. Loading files into your standard SQL store
mysql> load data local infile 'chicago_region_speed.csv' into table 
chicago fields terminated by ',' lines terminated by '\n';

This process sounds clunky, but it can be automated, and because we have files from each stage, it is easy to re-execute or reformat the information if required.

Using views

If you are using Hive in the manner suggested earlier, and regularly processing and summarizing data daily, weekly or monthly, it might be simpler to create a view.

Views within Hive are logical — that is, the output of the view gets re-created each time a query is executed. Although using views is more expensive, for a data exchange environment, views hugely simplify the process by simplifying the query structure and allowing consistent output as the underlying source tables expand with new data. For example, to create a view from our original speed/region summary, use Listing 10.

Listing 10. Creating a view from our original speed/region summary
hive> create view chicago_region_speed_view as select to_date(timelog),
region,avg(speed) from chicago_bus group by to_date(timelog),region;

Now we can perform Hive queries on the view including new selections.

Listing 11. Performing Hive queries on the view
hive> select * from chicago_region_speed_view where speed < 15;
2012-01-15	28	13.449097183015612
2012-01-29	28	13.56520840856764
2012-02-05	28	14.568958282470703
2012-02-19	28	12.512847211625841
2012-03-04	28	12.886666708522373
2012-04-01	28	13.932638910081652

Not looking good for traffic in region 28, is it?

Data life-cycle management

If you decide to use Hive as a live component of your querying mechanism — exporting data from SQL, into Hive, and back out again so it can be used regularly — give careful thought as to how you manage the files to ensure accuracy. I tend to use the following basic sequence:

  1. Insert new data into a table — datalog, for example.
  2. When datalog is full (i.e., an hour, day, week, or month of information is complete), the table is renamed (to datalog_archive, for example), and a new table (same structure) is created.
  3. Data is exported from datalog_archive and appended into the Hive table for the data.

Depending on how the data is used and processed, analysis occurs by accessing the live data or by running the exact same SQL query statement on Hive. If the data is needed quickly, a view or query is executed that imports the corresponding data back into an SQL table in a summarized format. For example, for systems logging data (RAM, disk, and other usages) of large clusters, the data is stored in SQL for a day. This approach allows for live monitoring and makes it possible to spot urgent trends.

Data is written out each day to Hive where the log content is analyzed by a series of views that collect extreme values (for example, disk space less than 5 percent), as well as average disk usage. While reviewing the recent data, it's easy to examine and correlate problems (extreme disk usage and increased CPU time, for example), and execute the same query on the Hive long-term store to get the detailed picture.


Using HBase

Whereas Hive is useful for huge datasets where live queries are not required, HBase allows us to perform live queries on data, but it works differently. The primary difference is that HBase is not a tabular data store, so importing tabular data from an SQL store is more complex.

That said, the flexible internal structure of HBase is also more flexible. Data sources of multiple different data structures can be merged together within HBase. For example, with log data, you can store multiple sensor data into a single table within HBase, a situation that would require multiple tables in an SQL store.

Getting data in HBase from SQL

Unlike Hive, which supports a native tabular layout for the source data, HBase stores key-value pairs. This key-value system complicates the process of exporting data and using it directly because it first needs to be identified and then formatted accordingly to be understood within HBase.

Each item (or item identifier) requires a unique key. The unique ID is important because it is the only way to get individual data back again; the unique ID locates the record within the HBase table. Remember that HBase is about key-value pairs, and the unique ID (or key) is the identifier to the stored record data.

For some data types, such as the log data in our Hive examples, the unique key is meaningless because we are unlikely to want to view just one record. For other types, the data may already have a suitable unique ID within the table you want to use.

This dilemma can be solved by pre-processing our output, for example, and inserting a UUID() into our output.

Listing 12. Pre-processing the output
mysql> select uuid(),timelogged,region,buscount,logreads,speed into 
outfile 'chicago.tsv' fields terminated by '\t' lines terminated by '\n' 
from chicago;

This code creates a new UUID for each row of output. The UUID can be used to identify each record — even though for this type of data, that identification is not individually useful.

A secondary consideration within the export process is that HBase does not support joins. If you want to use HBase to write complex queries on your SQL data, you need to run a query within your SQL store that outputs an already-joined or aggregate record.

Within HBase, tables are organized according to column families, and these can be used to bond multiple groups of individual columns, or you can use the column families as actual columns. The translation is from the table to the document structure, as shown in Figure 2.

Figure 2. Translation from table to the document structure

Image shows translation from table to the document structure

To import the data, you have to first create the table. HBase includes a basic shell for accepting commands. We can open it and create a table and a table group called cf.

Listing 13. Creating a table group called cf
$ hbase shell
13/08/18 15:54:46 WARN conf.Configuration: hadoop.native.lib is 
deprecated. Instead, use io.native.lib.available
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.94.6-cdh4.3.0, rUnknown, Mon May 27 20:22:05 PDT 2013

hbase(main):001:0> create 'chicago', 'logtime', 'region', 'buscount', 
'readnumber' , 'speed'

Copy the tab-separated file created earlier into HDFS: $ hdfs dfs -copyFromLocal chicago.tsv. Now we can run importtsv, a tool inside the HBase JAR that imports values from a tab-delimited file.

Listing 14. Running importtsv
$ hadoop jar /usr/lib/hbase/hbase.jar importtsv \
speed chicago chicago.tsv

The content needs to be split to make it understandable: hadoop jar /usr/lib/hbase/hbase.jar importtsv.

This code runs the importtsv tool, which is included as part of the HBase JAR: -Dimporttsv.columns=HBASE_ROW_KEY,logtime,region,buscount,readnumber,speed.

The tool defines the columns that will be imported and how they will be identified. The fields are defined as a list; at least one of them must be the identifier (UUID) for each row, specified by HBASE_ROW_KEY, and the others define the field names (within the column family, cf) used for each input column.

  • chicago— The table name. It must have been created before this tool is executed.
  • chicago.tsv— The name of the file in HDFS to be imported.

The output from this command (see Listing 15) is rather immense, but the import process is complicated. The data cannot be directly loaded. Instead, it gets parsed by a MapReduce process that extracts and then inserts the data into an HBase table.

Listing 15. Output from importtsv command
13/08/18 16:08:37 INFO zookeeper.ZooKeeper: Client 
environment:zookeeper.version=3.4.5-cdh4.3.0--1, built on 
    05/28/2013 02:01 GMT
13/08/18 16:08:48 INFO mapred.JobClient: Running job: 
13/08/18 16:08:49 INFO mapred.JobClient:  map 0% reduce 0%
13/08/18 16:09:07 INFO mapred.JobClient:  map 1% reduce 0%
13/08/18 16:09:10 INFO mapred.JobClient:  map 4% reduce 0%
13/08/18 16:09:13 INFO mapred.JobClient:  map 9% reduce 0%
13/08/18 16:09:16 INFO mapred.JobClient:  map 18% reduce 0%
13/08/18 16:09:19 INFO mapred.JobClient:  map 26% reduce 0%
13/08/18 16:09:22 INFO mapred.JobClient:  map 28% reduce 0%
13/08/18 16:09:25 INFO mapred.JobClient:  map 35% reduce 0%
13/08/18 16:09:29 INFO mapred.JobClient:  map 40% reduce 0%
13/08/18 16:09:32 INFO mapred.JobClient:  map 46% reduce 0%
13/08/18 16:09:36 INFO mapred.JobClient:  map 47% reduce 0%
13/08/18 16:09:38 INFO mapred.JobClient:  map 52% reduce 0%
13/08/18 16:09:42 INFO mapred.JobClient:  map 55% reduce 0%
13/08/18 16:09:43 INFO mapred.JobClient:  map 57% reduce 0%
13/08/18 16:09:45 INFO mapred.JobClient:  map 58% reduce 0%
13/08/18 16:09:48 INFO mapred.JobClient:  map 60% reduce 0%
13/08/18 16:09:51 INFO mapred.JobClient:  map 62% reduce 0%
13/08/18 16:09:54 INFO mapred.JobClient:  map 64% reduce 0%
13/08/18 16:09:57 INFO mapred.JobClient:  map 65% reduce 0%
13/08/18 16:10:01 INFO mapred.JobClient:  map 67% reduce 0%
13/08/18 16:10:04 INFO mapred.JobClient:  map 69% reduce 0%
13/08/18 16:10:07 INFO mapred.JobClient:  map 71% reduce 0%
13/08/18 16:10:10 INFO mapred.JobClient:  map 72% reduce 0%
13/08/18 16:10:13 INFO mapred.JobClient:  map 74% reduce 0%
13/08/18 16:10:16 INFO mapred.JobClient:  map 76% reduce 0%
13/08/18 16:10:19 INFO mapred.JobClient:  map 78% reduce 0%
13/08/18 16:10:23 INFO mapred.JobClient:  map 80% reduce 0%
13/08/18 16:10:26 INFO mapred.JobClient:  map 81% reduce 0%
13/08/18 16:10:29 INFO mapred.JobClient:  map 83% reduce 0%
13/08/18 16:10:32 INFO mapred.JobClient:  map 84% reduce 0%
13/08/18 16:10:35 INFO mapred.JobClient:  map 86% reduce 0%
13/08/18 16:10:38 INFO mapred.JobClient:  map 88% reduce 0%
13/08/18 16:10:41 INFO mapred.JobClient:  map 90% reduce 0%
13/08/18 16:10:44 INFO mapred.JobClient:  map 92% reduce 0%
13/08/18 16:10:47 INFO mapred.JobClient:  map 95% reduce 0%
13/08/18 16:10:50 INFO mapred.JobClient:  map 98% reduce 0%
13/08/18 16:10:53 INFO mapred.JobClient:  map 100% reduce 0%
13/08/18 16:11:02 INFO mapred.JobClient: Job complete: 
13/08/18 16:11:02 INFO mapred.JobClient: Counters: 25
13/08/18 16:11:02 INFO mapred.JobClient:   File System Counters
13/08/18 16:11:02 INFO mapred.JobClient:     
    FILE: Number of bytes read=0
13/08/18 16:11:02 INFO mapred.JobClient:     
    FILE: Number of bytes written=416878
13/08/18 16:11:02 INFO mapred.JobClient:     
    FILE: Number of read operations=0
13/08/18 16:11:02 INFO mapred.JobClient:     
    FILE: Number of large read operations=0
13/08/18 16:11:02 INFO mapred.JobClient:     
    FILE: Number of write operations=0
13/08/18 16:11:02 INFO mapred.JobClient:     
    HDFS: Number of bytes read=156425993
13/08/18 16:11:02 INFO mapred.JobClient:     
    HDFS: Number of bytes written=0
13/08/18 16:11:02 INFO mapred.JobClient:     
    HDFS: Number of read operations=4
13/08/18 16:11:02 INFO mapred.JobClient:     
    HDFS: Number of large read operations=0
13/08/18 16:11:02 INFO mapred.JobClient:     
    HDFS: Number of write operations=0
13/08/18 16:11:02 INFO mapred.JobClient:   
    Job Counters 
13/08/18 16:11:02 INFO mapred.JobClient:     
    Launched map tasks=2
13/08/18 16:11:02 INFO mapred.JobClient:     
    Data-local map tasks=2
13/08/18 16:11:02 INFO mapred.JobClient:     
    Total time spent by all maps in occupied slots (ms)=178767
13/08/18 16:11:02 INFO mapred.JobClient:     
    Total time spent by all reduces in occupied slots (ms)=0
13/08/18 16:11:02 INFO mapred.JobClient:     
    Total time spent by all maps waiting after reserving slots (ms)=0
13/08/18 16:11:02 INFO mapred.JobClient:     
    Total time spent by all reduces waiting after reserving slots (ms)=0
13/08/18 16:11:02 INFO mapred.JobClient:   
    Map-Reduce Framework
13/08/18 16:11:02 INFO mapred.JobClient:    
    Map input records=2168224
13/08/18 16:11:02 INFO mapred.JobClient:     
    Map output records=2168224
13/08/18 16:11:02 INFO mapred.JobClient:     
    Input split bytes=248
13/08/18 16:11:02 INFO mapred.JobClient:     
    Spilled Records=0
13/08/18 16:11:02 INFO mapred.JobClient:     
    CPU time spent (ms)=14140
13/08/18 16:11:02 INFO mapred.JobClient:     
    Physical memory (bytes) snapshot=274292736
13/08/18 16:11:02 INFO mapred.JobClient:     
    Virtual memory (bytes) snapshot=1394532352
13/08/18 16:11:02 INFO mapred.JobClient:     
    Total committed heap usage (bytes)=125566976
13/08/18 16:11:02 INFO mapred.JobClient:   
13/08/18 16:11:02 INFO mapred.JobClient:     
    Bad Lines=0

If you get a bad-lines output that shows a high number of errors, particularly if the number equals the number of rows you are importing, the problem is probably the format of the source file or the fact that the number of columns in the source file does not match the number of columns defined in the import specification.

Once the data has been imported, we can use the shell to get one record to check that the import has worked.

Listing 16. Checking to see if the import has worked
hbase(main):003:0> scan 'chicago', { 'LIMIT' > 1}
ROW                                    COLUMN+CELL
 e49e1c1c-0d93-11e3-a5e5-81b71544159b  column=buscount:,
                                       timestamp=1376949811719, value=40
 e49e1c1c-0d93-11e3-a5e5-81b71544159b  column=logtime:, 
                                       value=2013-03-26 09:50:00
 e49e1c1c-0d93-11e3-a5e5-81b71544159b  column=readnumber:, 
                                       timestamp=1376949811719, value=600
 e49e1c1c-0d93-11e3-a5e5-81b71544159b  column=region:, 
                                       timestamp=1376949811719, value=2
 e49e1c1c-0d93-11e3-a5e5-81b71544159b  column=speed:, 
                                       timestamp=1376949811719, value=26.59
1 row(s) in 0.1830 seconds

You can see the basic structure of the data as it exists within the HBase table. The unique ID identifies each record, then individual key-value pairs contain the detail (i.e., the columns from the original SQL table).

Alternative SQL or Hive to HBase

An alternative model to the raw-data export (less common to HBase because of the record structure) is to use HBase to store summary values and parsed/composed queries.

Because the data from HBase is stored in a readily and quickly accessible format (access the key and get the data), it can be used to access chunks of data that have been computed from other jobs, stored into HBase, and used to access the summary data. For example, the summary data we generated using Hive earlier in this example could have been written into HBase to be accessed quickly to provide statistical data on the fly for a website.


Using HBase data from Hive

Now that we have the data in HBase, we can start querying and reporting on the information. The primary advantage of HBase is its powerful querying facilities based on the MapReduce within Hadoop. Since the data is stored internally as simple key-value combinations, it is easy to process through MapReduce.

MapReduce is no solution for someone from the SQL world, but we can take advantage of the flexible nature of Hive's processing model to crunch HBase data using the HQL interface. You may remember earlier I described how Hive supports processing of mapped data types; this is what HBase data is: mapped key-value pairs.

To use HBase data, we need to create a table within Hive that points to the HBase table and maps the key-value pairs in HBase to the column style of Hive.

Listing 17. Creating a table within Hive that points to the HBase table and maps the key-value pairs in HBase to the column style of Hive
hive> create external table hbase_chicago (key string, logtime timestamp, 
region int, buscount int, readnumber int, speed float) STORED BY 
'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with SERDEPROPERTIES 
('hbase.columns.mapping'=':key, logtime:val, region:val, buscount:val, 
readnumber:val, speed:val') TBLPROPERTIES(''='chicago');
  • The first part of this code creates a table definition identical to the one we used natively in Hive, except that we have added the row-key UUID as the first column.
  • The STORED BY block defines the storage format.
  • The SERDEPROPERTIES block is the mapping between the document structure and the columns. The colon separates the key name and corresponding value and how the data should be mapped to the columns, in sequence, from the table definition.
  • The TBLPROPERTIES block defines the name of the HBase table where the data lies.

Once the table has been created, the table can be queried through Hive using native SQL, just as we saw earlier.

Why use this method instead of a native import? The primary reason is the ease with which it can be queried (although no longer live), but also because the underlying HBase data can be updated, rather than just appended to. In an SQL-to-Hadoop architecture this advantage means we can take regular dumps of changing data from SQL and update the content.


Reminder: HBase or Hive

Given the information here, it's worth reminding ourselves of the benefits of the two systems.

Table 1. Benefits of the two systems
Feature Hive HBase
SQL Support Yes No
Tabular Data Structure Yes Semi
Append Only Yes No
Allows Updates No Yes
Live Queries Not really Yes, within limits
Joins Yes No

Which one you use will depend entirely on your use case and the data you have available, and how you want to query it. Hive is great for massive processing of ever-increasing data. HBase is useful for querying data that may change over time and need to be updated.



The primary reason for moving data between SQL stores and Hadoop is usually to take advantage of the massive storage and processing capabilities to process quantities of data larger than you could hope to cope with in SQL alone. How you exchange and process that information from your SQL store into Hadoop is, therefore, important. Large quantities of long-term data that need to be queried more interactively can take advantage of the append-only and SQL nature of Hive. For data that needs to be updated and processed, it might make more sense to use HBase. HBase also makes an ideal output target from Hive because it's so easy to access summary data directly by using the native key-value store.

When processing, you also need to consider how to get the data back in. With Hive, the process is easy because we can run SQL and get a table that can easily be imported back to our SQL store for straightforward or live query processing. In this article, I've covered a wide range of use cases and examples of how data can be exchanged more easily from SQL using tabular interfaces to the Hadoop and non-tabular storage underneath.