HIVE Query Language (HQL) – HIVE Create Database, Create Table

In this section, we will discuss data definition language parts of HIVE Query Language(HQL), which are used for creating, altering and dropping databases, tables, views, functions, and indexes.

We will also look into SHOW and DESCRIBE commands for listing and describing databases and tables stored in HDFS file system.

Hive Database – HIVE Query

A database in Hive is just a namespace or catalog of tables. If we do not specify database, default database is used.

HIVE CREATE database syntax 

hive> CREATE DATABASE IF NOT EXISTS financial;

HIVE USE Database syntax

This command is used for using a particular database.

hive> use <database name>

hive> use financial;

HIVE SHOW all existing databases

hive> SHOW DATABASES;

default
financial

For each database, HIVE will create a directory and the tables say “EMP” in that database and say “financial” is stored in sub-directories.

The database directory is created under the directory specified in the parameter “hive.metastore.warehouse.dir”. 

Assuming the default value of the parameter is “/usr/hive/warehouse”, when we create the “financial” database, HIVE will create the sub-directory as “/usr/hive/warehouse/financial.db” where “.db” is the extension.

HIVE DESCRIBE database syntax

This will show the directory location of the “financial” database.

hive> DESCRIBE DATABASE financial;

financial
hdfs://usr/hive/warehouse/financial.db

HIVE ALTER Database Syntax

We can associate key-value pairs with a database in the DBPROPERTIES using ALTER DATABASE command.

hive> ALTER DATABSE financial SET DBPROPERTIES (‘edited by’ = ‘kmak’);

HIVE CREATE Table Syntax

CREATE table statement in Hive is similar to what we follow in SQL but hive provides lots of flexibilities in terms of where the data files for the table will be stored, the format used, delimiter used etc.

CREATE TABLE IF NOT EXISTS financial.EMP

(

Ename STRING,

EmpID INT,

Salary FLOAT,

Tax MAP<STRING, FLOAT>,

Subordinates ARRAY<STRING>,

Address STRUCT<city: STRING, state: STRING, zip: INT>

)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\001'

COLLECTION ITEMS TERMINATED BY '\002'

MAP KEYS TERMINATED BY '\003'

LINES TERMINATED BY '\n'

STORED AS TEXTFILE;

In the above example, the default location of the table will be “/user/hive/warehouse/financial.db/EMP”.

If we want to explicitly override the location then we need to use the “LOCATION” keyword in the above create table statement.

HIVE SHOW TABLES Syntax

This command lists the tables in the current working database.

hive> SHOW TABLES;

EMP

Table1

Table2

HIVE DESCRIBE Syntax

The DESCRIBE statement displays metadata about a table, such as the column names and their data types.

hive> DESCRIBE financial.emp.salary

salary    float

HIVE DROP Database Syntax

DROP Database statement deletes all the related tables and then delete the database.

hive>DROP DATABASE IF EXISTS financial;

hive> DROP DATABASE IF EXISTS financial CASCADE;

HIVE DROP Table Syntax

hive> DROP TABLE IF EXISTS emp;

HIVE ALTER Table Syntax

hive> ALTER TABLE emp RENAME TO employee;--Rename Table Name

hive> ALTER TABLE emp CHANGE name firstname String;--Rename Column Name

hive> ALTER TABLE emp CHANGE sal sal Double; -- Rename data types

hive> ALTER TABLE emp ADD COLUMNS (dno INT COMMENT 'Department number'); -- Add column name

hive> ALTER TABLE emp REPLACE COLUMNS (eid INT empid Int,  ename STRING name String);--Deletes all the columns from emp and replace it with two columns

HIVE DESCRIBE EXTENDED Syntax

This command is used for describing details about the table. If we replace EXTENDED with FORMATTED then it provides more verbose output.

The output lines in the description that start with LOCATION also shows the full URL path in the HDFS where hive will keep all the data related to a given table.

hive>DESCRIBE EXTENDED financial.EMP;
Ename STRING,
EmpID INT,
Salary FLOAT,
Tax MAP<STRING, FLOAT>,
Subordinates ARRAY<STRING>,
Address STRUCT<city: STRING, state: STRING, zip: INT>

Detailed Table Information&nbsp;&nbsp; Table( tableName: Emp, dbName:financial, owner:kmak…
Location:hdfs://user/hive/warehouse/financial.db/Emp,
Parameters:{creator=kmak, created at=’2016-02-02 11:01:43’,last_modified_user=kmak,last_modified_time=1443553211,…}

HIVE Managed and External Tables

HIVE Managed Tables

By default, HIVE tables are the managed tables. HIVE controls metadata and the lifecycle of the data.

HIVE stores all the data related to a given table in the subdirectory under the directory defined by the parameter “hive.metastore.warehouse.dir” which is “/user/hive/warehouse” by default.

Dropping a managed table deletes the data from the table by deleting the sub-directory that has got created for the respective table. “DESCRIBE EXTENDED” command output will tell whether a table is managed or extended.

The drawback of the managed table is less convenient to use with other tools. For

For Example: We have inserted some data into some table by Pig or some other tool. Now, we wanted to run some HIVE queries on the data inserted by Pig but do not want to give the ownership to HIVE. In such case, we create an external table which has access to the data but not the ownership.

HIVE External Tables

Let us suppose we want to analyze the marketing data getting ingested from different sources.

We ingest south and north zone data using many tools such as Pig, HIVE and so on. Let us assume data file for the SALES table resides in directory /data/marketing.

The following table declaration creates an external table that can read all the data files from /data/marketing distributed directory with data stored as comma-delimited data.

CREATE EXTERNAL TABLE IF NOT EXISTS SALES (
SaleId INT,
ProductId INT,
Quantity INT,
ProdName STRING)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/data/marketing';

The keyword “EXTERNAL” tells HIVE that this table is external and the data is stored in the directory mentioned in “LOCATION” clause.

Since the table is external, HIVE does not assume it owns the data. Therefore, dropping table deletes only the metadata in HIVE Metastore and the actual data remains intact.

Therefore, if the data is shared between tools, then it is always advisable to create an external table to make ownership explicit.

Copy the schema (not the data) of an existing table

CREATE EXTERNAL TABLE IF NOT EXISTS financial.emp1
LIKE financial.emp
LOCATION ‘/data/marketing’;

HIVE Partition Concepts – Managed Table and External Table Partition

In the below post we are going to understand about HIVE Partition and Concepts of HIVE partitioning in detail.

Partitions are essentially horizontal slices of data that divides the larger set of data into small manageable chunks.

In HIVE, partitioning is supported for both managed and external table. Partitioning can be done based on one or more than one columns to impose multi-dimensional structure on directory storage.

Partition column is a virtual column that does not exist on the file as a column. HIVE queries can take advantage of the partitioned data for better performance.

HIVE Partition – Managed Table Partitioning

Let us discuss the HIVE partition concept for the managed table first.

Our human resource (HR) team might be interested in looking the data country-wise and then state-wise. Therefore, let us first partition the data first county-wise and then state-wise.

CREATE TABLE IF NOT EXISTS financial.EMP
(
Ename STRING,
EmpID INT,
Salary FLOAT,
Tax MAP<STRING, FLOAT>,
Subordinates ARRAY<STRING>,
Address STRUCT<city: STRING, state: STRING, zip: INT>
)
PARTITIONED BY (country STRING, state STRING);

Partitioning tables changes how HIVE structures the data storage. HIVE will create sub-directories to reflect the partition structure. For Example:

...

…/Emp/country=IN/state=KA

…/Emp/country=IN/state=BH

…

…/Emp/country=US/state=NC

…/Emp/country=US/state=AL

…

If the table data and the number of partitions are large then a HIVE query across all partitions it can trigger enormous MapReduce jobs.

In such case, we put the HIVE into “strict” mode by setting up the below parameter.

hive> set hive.mapred.mode=strict;

The above parameter prohibits the HIVE queries on partitioned tables to run without a WHERE clause.

SHOW PARTITION Syntax

hive> SHOW PARTITIONS EMP;

HIVE Partition – External Table Partitioning

Partitioning external table has the added advantage of sharing the data with other tools, while still optimizing the query performance.

CREATE EXTERNAL TABLE IF NOT EXISTS log_tracking

(
ID STRING,
SERVER STRING,
MESSAGE STRING,
)

PARTITIONED BY (YEAR INT,MONTH INT,DAY INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\n';

When we created a non-partitioned external table, we need the “LOCATION” as the mandatory clause.

But for the partitioned external table, the clause is not mandatory. We use ALTER table statement to add each partition separately.

ALTER TABLE log_tracking ADD PARTITION(YEAR=2016, MONTH=12, DAY=3)

LOCATION ‘hdfs://server_name/data/log_tracking/2016/12/03’;

SHOW PARTITIONS log_tracking;
…
Year=2016/month=12/day=3
…
…

Similarly, DESCRIBE EXTENDED log_tracking will show the detailed information of the partition keys.

HiveQL Data Manipulation – Load, Insert, Export Data and Create Table

It is important to note that HiveQL data manipulation doesn’t offer any row-level insert, update or delete operation. Therefore, data can be inserted into hive tables using either “bulk” load operations or writing the files into correct directories by other methods.

HiveQL Load Data into Managed Tables

Loading data from input file (Schema on Read)

Hive>LOAD DATA LOCAL INPATH   '/home/hduser/sampledata/users.txt'

OVERWRITE INTO TABLE users;
  • ‘LOCAL’ indicates the source data is on local file system
  • Local data will be copied into the final destination (HDFS file system) by Hive
  • If ‘Local’ is not specified, the file is assumed to be on HDFS
  • Hive does not do any data transformation while loading the data

Loading data into partition requires PARTITION clause

Hive>LOAD DATA LOCAL PATH  '/home/hduser/sampledata/Employees.txt' OVERWRITE INTO TABLES Employees;

PARTITION (country  = 'India', city ='Delhi');

HDFS directory is created according to the partition values

Loading data from HDFS directory

Hive>LOAD DATA INPATH  '/usr/hadoop/data' OVERWRITE INTO TABLES aliens;
  • All the files in the directory are copied into Hive
  • ‘OVERWRITE’ causes table to be purged and filled
  • Leaving out ‘OVERWRITE’ adds data to existing folder (old data will exist under its name and new one under a different name)

HiveQL Insert Data into Hive Tables from Queries

Hive> INSERT OVERWRITE TABLE Employee
Partition (country= ‘IN’,state=’KA’)
SELECT * FROM emp_stage ese
WHERE ese.country=’IN’ AND ese.state=’KA’;

Create table and load them from Hive Queries

Hive> CREATE TABLE Employees

AS SELECT eno,ename,sal,address

FROM emp

WHERE country=’IN’;

Exporting Data out of Hive

If LOCAL keyword is used, Hive will write the data to local directory

Hive>INSERT OVERWRITE LOCAL DIRECTORY   '/home/hadoop/data'

SELECT name, age

FROM aliens

WHERE date_sighted >'2014-09-15'

Hive Query – HiveQL, Hive Select, Join, Order, Sort, Indexes

Hive Query is similar to SQL. Hive queries are written in HiveQL and can be used to query the data without knowledge of Java or MapReduce.

Hive Query – HiveQL

SELECT … FROM Clause

SELECT … FROM clause is almost same as SQL. Please see few examples below.

Hive SELECT all query

Hive> SELECT * FROM aliens LIMIT X;

From above Hive query, it will select X records from aliens. Array elements in a record will be within brackets

Hive SELECT query with array indexing

Hive> SELECT  name, reportee[0] FROM employees;

From the above Hive query, the first element of the array is selected.

Hive SELECT query with STRUCT column.

Hive> SELECT address.city,  name FROM employees;

From above Hive query output will show the struct column in JSON format and the first element of the array is selected.

Hive Manipulating Column values

We can manipulate column values using function calls and arithmetic expressions.

hive> SELECT upper(name), sales_cost, round((sales_cost-cogs)/sales_cost) FROM products;

Hive Column Alias and Nested Queries

Select query with column alias

Hive> SELECT  name, address.city as city  FROM employees;

Hive Column aliases for nested queries

Hive> FROM(

SELECT cust_name, round(ytd_sales * .05) as rewards

FROM customer;

)  C

SELECT C.name, C.rewards

WHERE C.rewards > 20;

Hive Selecting data from partitions

Hive scans only partitions relevant to the query, thus improving performance.

Hive> SELECT  name, age  FROM employees

Where city = 'Delhi';

Assuming partitioned on cities and there are 4 partitions with equal volume of data, query will partition only 1/4th of the data

Hive JOIN Statements

Hive supports ANSI joins

hive> SELECT  * FROM customers JOIN sales

ON customers.id = sales.cust_id

Hive query also supports Inner joins, left outer joins, Right outer, Full outer, left semi-joins

Hive Order By 

  • Order By clause similar to SQL
  • Performs ordering using single reducer
  • Will take a long time for large datasets

Hive Sort By

  • SORT BY Clause
  • Data is ordered in each reducer – each reducer’s output is sorted
  • If multiple reducers – final results may not be as expected

Hive views

Hive views are similar to SQL views which reduce the Hive query complexity by encapsulating the complex queries from end users.

Hive> CREATE VIEW emp_dept AS
SELECT * FROM emp JOIN dept
ON (emp.dno=dept.deptno) WHERE ename =’Andrew’;

A user can drop the view in the same way as the table.

Hive> DROP VIEW IF EXISTS emp_dept;

Hive Indexes

Hive can build indexes on columns to speed up querying operations. The index data is stored in another table.

This is a very useful concept and a good alternative to partitioning when the logical partitions are small and numerous.

Let us consider the below table:

CREATE TABLE IF NOT EXIXTS financial.EMP
(
Ename STRING,
EmpID INT,
Salary FLOAT,
Tax MAP<STRING, FLOAT>,
Subordinates ARRAY<STRING>,
Address STRUCT<city: STRING, state: STRING, zip: INT>
)
PARTITIONED BY (country STRING, state STRING);

If we need to create an index on country partition only, then below is the syntax:

CREATE INDEX emp_index
ON TABLE EMP (country)
AS ‘org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler’
WITH DEFERRED REBUILD
IDXPROPERTIES (‘creator’=’kmak’, ‘created_at’=13092016 13:24:56’)
IN TABLE EMP_INDEX
PARTITIONED BY (country, ename)
COMMENT ‘Employee indexed by country and ename’;

The AS clause specifies the index handler, a java inbuilt class that implements indexing.

Hive BITMAP Indexes

CREATE INDEX emp_index
ON TABLE EMP (country)
AS ‘BITMAP’
WITH DEFERRED REBUILD
IDXPROPERTIES (‘creator’=’kmak’, ‘created_at’=13092016 13:24:56’)
IN TABLE EMP_INDEX_TABLE
PARTITIONED BY (country, ename)
COMMENT ‘Employee indexed by country and ename’;

Showing an Index

SHOW FORMATTED INDEX ON emp;

Dropping an Index

DROP INDEX IF EXISTS emp_index ON TABLE emp;

HIVE UDF (User Defined Functions) – HIVE Standard, Aggregate Function

HIVE UDF (User Defined Functions) allow the user to extend HIVE Query Language. Once the UDF is added in the HIVE script, it works like a normal built-in function. To check which all UDFs are loaded in current hive session, we use SHOW command.

To check which all HIVE UDF are loaded in the current HIVE session, we use SHOW command.

hive> SHOW FUNCTIONS;

array
abs
acos
and

The documentation about a function is shown by DESCRIBE FUNCTION <“function name”> command and their detailed documentation is accessed by DESCRIBE FUNCTION EXTENDED <“function name”> command.

hive> DESCRIBE FUNCTION concat;

concat(str1,str, ... ,strN) - returns the concatenation of strings.

Different HIVE UDF used in HIVE Query Language

Standard FUNCTION

These kinds of functions take a row argument or one or more columns from a row and return a single value.

Example: round(),floor(), concat(), reverse(),ucase(),abs() etc.

It is important to note that these functions can also return a complex object, such as an array, map, or struct.

Aggregate FUNCTION

It takes one or more columns from zero to many rows and returns a single value.

Example: sum(), avg(),min(),max() etc.

Table Generating FUNCTION

Table generation functions take zero or more inputs and produce multiple columns or rows as output.

hive> SELECT array(1,2,3,4) FROM dual;

1
2
3
4

Explode() is another table generation function which takes an array of input and iterates through the list and returns each element from the list in a separate row.

The lateral view is used in conjunction with user-defined table generating functions such as explode().

An example of explode() function is given below :

Column NameColumn Type
DeptbranchnoArray<int>
DnameSTRING

An example of data in the above tables with three rows:

DnameDeptbranchno
Zoology[1,2,3]
Mathematics[5,6]

A lateral view with explode() can be used to convert  “Deptbranchno”  into separate rows using the query:

Hive> SELECT Dname, Deptbranchno

FROM dept

LATERAL VIEW explode(Deptbranchno) as DeptbranchID;

The output of above HIVE Query will be:

Dname(string)DeptbranchID(int)
Zoology1
Zoology2
Zoology3
Mathematics5
Mathematics6