HiveQL Data Manipulation – Load, Insert, Export Data and Create Table

It is important to note that HiveQL data manipulation doesn’t offer any row-level insert, update or delete operation. Therefore, data can be inserted into hive tables using either “bulk” load operations or writing the files into correct directories by other methods.

HiveQL Load Data into Managed Tables

Loading data from input file (Schema on Read)

Hive>LOAD DATA LOCAL INPATH   '/home/hduser/sampledata/users.txt'

OVERWRITE INTO TABLE users;
  • ‘LOCAL’ indicates the source data is on local file system
  • Local data will be copied into the final destination (HDFS file system) by Hive
  • If ‘Local’ is not specified, the file is assumed to be on HDFS
  • Hive does not do any data transformation while loading the data

Loading data into partition requires PARTITION clause

Hive>LOAD DATA LOCAL PATH  '/home/hduser/sampledata/Employees.txt' OVERWRITE INTO TABLES Employees;

PARTITION (country  = 'India', city ='Delhi');

HDFS directory is created according to the partition values

Loading data from HDFS directory

Hive>LOAD DATA INPATH  '/usr/hadoop/data' OVERWRITE INTO TABLES aliens;
  • All the files in the directory are copied into Hive
  • ‘OVERWRITE’ causes table to be purged and filled
  • Leaving out ‘OVERWRITE’ adds data to existing folder (old data will exist under its name and new one under a different name)

HiveQL Insert Data into Hive Tables from Queries

Hive> INSERT OVERWRITE TABLE Employee
Partition (country= ‘IN’,state=’KA’)
SELECT * FROM emp_stage ese
WHERE ese.country=’IN’ AND ese.state=’KA’;

Create table and load them from Hive Queries

Hive> CREATE TABLE Employees

AS SELECT eno,ename,sal,address

FROM emp

WHERE country=’IN’;

Exporting Data out of Hive

If LOCAL keyword is used, Hive will write the data to local directory

Hive>INSERT OVERWRITE LOCAL DIRECTORY   '/home/hadoop/data'

SELECT name, age

FROM aliens

WHERE date_sighted >'2014-09-15'

Hive Query – HiveQL, Hive Select, Join, Order, Sort, Indexes

Hive Query is similar to SQL. Hive queries are written in HiveQL and can be used to query the data without knowledge of Java or MapReduce.

Hive Query – HiveQL

SELECT … FROM Clause

SELECT … FROM clause is almost same as SQL. Please see few examples below.

Hive SELECT all query

Hive> SELECT * FROM aliens LIMIT X;

From above Hive query, it will select X records from aliens. Array elements in a record will be within brackets

Hive SELECT query with array indexing

Hive> SELECT  name, reportee[0] FROM employees;

From the above Hive query, the first element of the array is selected.

Hive SELECT query with STRUCT column.

Hive> SELECT address.city,  name FROM employees;

From above Hive query output will show the struct column in JSON format and the first element of the array is selected.

Hive Manipulating Column values

We can manipulate column values using function calls and arithmetic expressions.

hive> SELECT upper(name), sales_cost, round((sales_cost-cogs)/sales_cost) FROM products;

Hive Column Alias and Nested Queries

Select query with column alias

Hive> SELECT  name, address.city as city  FROM employees;

Hive Column aliases for nested queries

Hive> FROM(

SELECT cust_name, round(ytd_sales * .05) as rewards

FROM customer;

)  C

SELECT C.name, C.rewards

WHERE C.rewards > 20;

Hive Selecting data from partitions

Hive scans only partitions relevant to the query, thus improving performance.

Hive> SELECT  name, age  FROM employees

Where city = 'Delhi';

Assuming partitioned on cities and there are 4 partitions with equal volume of data, query will partition only 1/4th of the data

Hive JOIN Statements

Hive supports ANSI joins

hive> SELECT  * FROM customers JOIN sales

ON customers.id = sales.cust_id

Hive query also supports Inner joins, left outer joins, Right outer, Full outer, left semi-joins

Hive Order By 

  • Order By clause similar to SQL
  • Performs ordering using single reducer
  • Will take a long time for large datasets

Hive Sort By

  • SORT BY Clause
  • Data is ordered in each reducer – each reducer’s output is sorted
  • If multiple reducers – final results may not be as expected

Hive views

Hive views are similar to SQL views which reduce the Hive query complexity by encapsulating the complex queries from end users.

Hive> CREATE VIEW emp_dept AS
SELECT * FROM emp JOIN dept
ON (emp.dno=dept.deptno) WHERE ename =’Andrew’;

A user can drop the view in the same way as the table.

Hive> DROP VIEW IF EXISTS emp_dept;

Hive Indexes

Hive can build indexes on columns to speed up querying operations. The index data is stored in another table.

This is a very useful concept and a good alternative to partitioning when the logical partitions are small and numerous.

Let us consider the below table:

CREATE TABLE IF NOT EXIXTS financial.EMP
(
Ename STRING,
EmpID INT,
Salary FLOAT,
Tax MAP<STRING, FLOAT>,
Subordinates ARRAY<STRING>,
Address STRUCT<city: STRING, state: STRING, zip: INT>
)
PARTITIONED BY (country STRING, state STRING);

If we need to create an index on country partition only, then below is the syntax:

CREATE INDEX emp_index
ON TABLE EMP (country)
AS ‘org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler’
WITH DEFERRED REBUILD
IDXPROPERTIES (‘creator’=’kmak’, ‘created_at’=13092016 13:24:56’)
IN TABLE EMP_INDEX
PARTITIONED BY (country, ename)
COMMENT ‘Employee indexed by country and ename’;

The AS clause specifies the index handler, a java inbuilt class that implements indexing.

Hive BITMAP Indexes

CREATE INDEX emp_index
ON TABLE EMP (country)
AS ‘BITMAP’
WITH DEFERRED REBUILD
IDXPROPERTIES (‘creator’=’kmak’, ‘created_at’=13092016 13:24:56’)
IN TABLE EMP_INDEX_TABLE
PARTITIONED BY (country, ename)
COMMENT ‘Employee indexed by country and ename’;

Showing an Index

SHOW FORMATTED INDEX ON emp;

Dropping an Index

DROP INDEX IF EXISTS emp_index ON TABLE emp;