HIVE Bucketing – Advantages of HIVE Bucketing | RCVAcademy

Unlike partitioning, HIVE bucketing is another way to decompose data into more manageable sets.

Consider our requirement is to create the partition based on date and then on user ID’s. In this case, there will be more numbers of small partition which will get created and it is difficult to manage.

Also, HIVE limits the maximum number of partitions that can be created.

Therefore, the below commands might fail to execute on hive shell:

hive> CREATE TABLE web_log (source_id STRING, source_ip STRING, source_desc STRING)
    > PARTITIONED BY (dte STRING, uid STRING);

Instead, if we bucket “uid” column and partition “dte” column then the value of “uid” column will be hashed depending upon the number of buckets user has defined.

hive> CREATE TABLE web_log (uid STRING,source_id STRING, source_ip STRING, source_desc STRING)
> PARTITIONED BY (dte STRING)
> CLUSTERED BY (uid) INTO 50 BUCKETS;

HIVE Bucketing Advantages

HIVE Bucketing has several advantages. HIVE Bucketing improves the join performance if the bucket key and join keys are common.

HIVE Bucketing also provides efficient sampling in Bucketing table than the non-bucketed tables.

It also reduces the I/O scans during the join process if the process is happening on the same keys (columns).

To leverage the bucketing in the join operation we should SET hive.optimize.bucketmapjoin=true.

This setting hints to HIVE to do bucket level join during the map stage join.

It also reduces the scan cycles to find a particular key because bucketing ensures that the key is present in a certain bucket.

Custom Hive using UDF’s – Prerequisites, Step by Step example

We can extend and have custom Hive using the User Defined Functions (UDFs). To demonstrate the process let us look at the below example.

Prerequisites for Custom Hive

  • We have to extend “our class” to the UDF abstract class.
  • “Our Class” must have at leas one evaluate () method. evaluate () method is not the method of UDF abstract class. This evaluate() method should have at least one parameter.
  • Compile the above java file and create the JAR file for keeping the .class file.
  • Add JAR file to hive classpath.
  • Create temporary function.

Problem Statement

Find the maximum marks obtained out of four subjects by a student.

Step 1:

Create a table STUDENTS_RECORDS with below sample record.

SIDNAMECLASSMATHPHYSICSENGLISHCSCTOT_MARKS
1MAK1085958692358
2TANUL10858510095377
3JHON1092988290362

Step 2:

Create a new project say “hiveudf” in package explorer of the Eclipse IDE.

Add the required JAR files by going in Libraries tab of the eclipse.

Choose “Add External JARs…”. The main JAR file which we need to add is “hadoop-core.jar”.

Also, add hive related JAR files. All the hive related JAR files will be there in /hive/lib folder.

Step 3:

Create a class for the “hiveudf” by right clicking on the project.

Package Name: “com.hadoop.hive”

Class Name: “GetMaxMarks”.

Step 4:

Write code as below:

package com.hadoop.hive; 
import org.apache.hadoop.hive.ql.exec.UDF; 
public class GetMaxMarks extends UDF{ 

            public double evaluate (double math,double eng,double physics,double csc)

            {
                        double maxMarks=math;
                        if(eng>maxMarks)
                        {
                                    maxMarks = eng;
                        }

                        if(physics>maxMarks)
                        {
                                    maxMarks=physics;
                        }
                        if(csc>maxMarks)
                        {
                                    maxMarks=csc;
                        }                      
                        return maxMarks;
            }
}

Step 5:

Create the JAR file for the above class. Right-click on the project -> Export -> JAR -> Next -> Put the JAR file name as “hive-maxmarks.jar”

Step 6: 

Add JAR file to hive classpath.

hive> add jar /home/training/workspace/hive-maxmarks.jar

Step 7:

In order to apply business logic on top of hive column using our UDF, we need to create a temporary function for the exported jar file.

Hive> CREATE temporary function func_name as com.hadoop.hive.GetMaxMarks --'absolute_class_path_name'.

Step 8:

Apply this UDF/function on your table.

Hive> SELECT sid, name,GetMaxMarks(math, eng, physics,csc) from STUDENTS_RECORDS;