What is Apache Hive ? The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. Read Hive Official site

This how to guide will help you to Install Apache Hive on CentOS/RHEL with Hadoop with easy steps.

Requirements: Install JAVA and Hadoop

Apache Hive2.1 requires java 7 or later version. We also need to install hadoop first before installing apache hive on our system. In example below, I'm using java 8

$ java -version
java version "1.8.0_92"
Java(TM) SE Runtime Environment (build 1.8.0_92-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.92-b14, mixed mode)

If you want further detail information about java requirement, see Java version

Steps to Install Hadoop2.7 on CentOS/RHEL 7

Download Hive package

After configuring hadoop successfully on your linux system. lets start hive setup. First download latest hive source code and extract archive using following commands.

$ cd /opt/hadoop
$ wget http://archive.apache.org/dist/hive/hive-2.1.0/apache-hive-2.1.0-bin.tar.gz
$ tar xzf apache-hive-2.1.0-bin.tar.gz
$ mv hive-2.1.0-bin hive
$ chown hadoop -R hive

Setup Environment Variables

Beside Hadoop environments, there are two more environment variables you need to add to ~.bashrc

# su - hadoop
export HIVE_HOME=/opt/hadoop/hive
export PATH=$HIVE_HOME/bin:$PATH

Preparation before starting Hive

Create /tmp and /user/hive/warehouse and set them chmod g+w in HDFS.

$ hdfs dfs -mkdir /tmp
$ hdfs dfs -mkdir /user/hive/warehouse
$ hdfs dfs -chmod g+w /tmp
$ hdfs dfs -chmod g+w /user/hive/warehouse

Hive metastore schema initialization

On Hive2.1.0, it's required to initialize metastore schema before hive starts, in the case below, I used db type derby

$ schematool -dbType derby -initSchemaTo 2.0.0 -verbose
which: no hbase in (/opt/hadoop/hive/bin:/usr/lib64/qt-3.3/bin:/bin:/usr/bin:/usr/X11R6/bin:/opt/local/bin:/usr/local/bin:/opt/hadoop/hadoop-2.7.3/sbin:/opt/hadoop/hadoop-2.7.3/bin:/home/hadoop/.local/bin:/home/hadoop/bin:/opt/hadoop/hadoop-2.7.3/sbin:/opt/hadoop/hadoop-2.7.3/bin)
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hadoop/hive/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Metastore connection URL:     jdbc:derby:;databaseName=metastore_db;create=true
Metastore Connection Driver :     org.apache.derby.jdbc.EmbeddedDriver
Metastore connection User:     APP
Starting metastore schema initialization to 2.0.0
Initialization script hive-schema-2.0.0.derby.sql
Connecting to jdbc:derby:;databaseName=metastore_db;create=true
Connected to: Apache Derby (version 10.10.2.0 - (1582446))
Driver: Apache Derby Embedded JDBC Driver (version 10.10.2.0 - (1582446))
Transaction isolation: TRANSACTION_READ_COMMITTED
0: jdbc:derby:> !autocommit on
Autocommit status: true
0: jdbc:derby:> CREATE FUNCTION "APP"."NUCLEUS_ASCII" (C CHAR(1)) RETURNS INTEGER LANGUAGE JAVA PARAMETER STYLE JAVA READS SQL DATA CALLED ON NULL INPUT EXTERNAL NAME 'org.datanucleus.store.rdbms.adapter.DerbySQLFunction.ascii'
No rows affected (0.374 seconds)

Note:

You may have noticed that I initialized the schema to version 2.0.0 instead of 2.1.0, why?

This is caused old known problem of Hive, new release of Hive quite often missing the latest version of db schema support. To avoid that, just use previous release.

Here is the example error output when using version 2.1.0

$ schematool -dbType derby -initSchemaTo 2.1.0 -verbose
which: no hbase in (/opt/hadoop/hive/bin:/usr/lib64/qt-3.3/bin:/bin:/usr/bin:/usr/X11R6/bin:/opt/local/bin:/usr/local/bin:/opt/hadoop/hadoop-2.7.3/sbin:/opt/hadoop/hadoop-2.7.3/bin:/home/hadoop/.local/bin:/home/hadoop/bin:/opt/hadoop/hadoop-2.7.3/sbin:/opt/hadoop/hadoop-2.7.3/bin)
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hadoop/hive/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Metastore connection URL:     jdbc:derby:;databaseName=metastore_db;create=true
Metastore Connection Driver :     org.apache.derby.jdbc.EmbeddedDriver
Metastore connection User:     APP
Starting metastore schema initialization to 2.1.0
org.apache.hadoop.hive.metastore.HiveMetaException: Unknown version specified for initialization: 2.1.0
org.apache.hadoop.hive.metastore.HiveMetaException: Unknown version specified for initialization: 2.1.0
    at org.apache.hadoop.hive.metastore.MetaStoreSchemaInfo.generateInitFileName(MetaStoreSchemaInfo.java:128)
    at org.apache.hive.beeline.HiveSchemaTool.doInit(HiveSchemaTool.java:282)
    at org.apache.hive.beeline.HiveSchemaTool.main(HiveSchemaTool.java:508)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
*** schemaTool failed ***

Tips:

In old Hive version, you don't have init metastore schema, so if you accidently ran hive before schema initialization, you may see error like below:

Error: FUNCTION 'NUCLEUS_ASCII' already exists. (state=X0Y68,code=30000)

Here is the fix for it

rm -f $HIVE_HOME/scripts/metastore/upgrade/derby/hive-schema-2.1.0.derby.sql 

then re init Hive metastore schema

Start Hive

Lets start using hive using following command.

$ hive
which: no hbase in (/opt/hadoop/hive/bin:/usr/lib64/qt-3.3/bin:/bin:/usr/bin:/usr/X11R6/bin:/opt/local/bin:/usr/local/bin:/opt/hadoop/hadoop-2.7.3/sbin:/opt/hadoop/hadoop-2.7.3/bin:/home/hadoop/.local/bin:/home/hadoop/bin:/opt/hadoop/hadoop-2.7.3/sbin:/opt/hadoop/hadoop-2.7.3/bin)
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hadoop/hive/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

Logging initialized using configuration in jar:file:/opt/hadoop/hive/lib/hive-common-2.1.0.jar!/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
$ hive >

Create  test Table

At this stage you have successfully installed hive. Lets create a sample table using following command

hive>  CREATE TABLE test (id int, name string, age int);
OK
Time taken: 1.223 seconds

Show the created tables with below command.

hive> SHOW TABLES;
OK
test
Time taken: 0.431 seconds, Fetched: 1 row(s)

Drop the table using below command.

hive> DROP TABLE test;
OK
Time taken: 1.393 seconds

Now you have Hadoop Hive installed and configured on your Hadoop cluster!