Apache Pig Installation can be done on the local machine or Hadoop cluster. To install Apache Pig, download package from the Apache Pig’s release page here.
You can also download the pig package from Cloudera or Apache’s Maven repository.
Pig does not need the Hadoop cluster for installation however it runs on the machine from where we launch the Hadoop jobs. It can also be installed on your local desktop or laptop for prototyping purpose. In this case, Apache Pig will run in local mode. If your desktop or laptop can access the Hadoop cluster then you can install the pig package there as well.
Pig package is written using JAVA language hence portable to all the operating systems but the pig script uses bash script, so it requires UNIX /LINUX operating system.
Once you have downloaded the pig, place it in the directory of your choice and untar it using below command:
tar –xvf <pig_downloaded_tar_file>
Apache Pig Execution
You can execute pig script in following three ways:
Interactive mode
pig -x local <script.pig>
Runs in the single virtual machine and all files are in the local system
Interactive mode in Hadoop File System
pig -x
Runs in Hadoop cluster, it is the default mode
Script mode
pig -x local pig myscript .pig
Script is a text file can be run in local or MapReduce mode
Command Line and their configurations
Pig provides a wide variety of command line option. Below are few of them:
-h or –help
It will list all the available command line options.
-e or –execute
If you want to execute a single command through pig then you can use this option. e.g. pig –e fs –ls will list the home directory.
–P or –propertyfile
It is used to specify a property file that a pig script should read.
The below tabular chart shows the return codes used by pig along with their description.
Value | Description |
0 | Success |
1 | Retriable failure |
2 | Failure |
3 | Partial failure – Used with multi-query |
4 | Illegal arguments passed to Pig |
5 | IOException thrown – thrown usually by a UDF |
6 | PigException thrown – thrown usually by Python UDF. |
7 | ParseException thrown – in case of variable substitution |
8 | an unexpected exception |
Grunt
The interactive shell name of Apache Pig is called Grunt. It provides the shell for users to interact with HDFS using PigLatin.
Once you enter the below command on the Unix env where Pig package is installed
pig –x local
The output will be:
grunt>
To exit the grunt shell you can type ‘quit’ or Ctrl-D.
HDFS command in the grunt shell can be accessed using keyword ‘fs’. Dash(-) is the mandatory part when Hadoop fs is used.
grunt>fs -ls
Utility Commands for controlling Pig from grunt shell
Kill jobid
You can find the job’s ID by looking at the Hadoop’s Job Tracker GUI. The above command can be used to kill a Pig job based on the job id.
exec
exec command to run a Pig script in batch mode with no interaction between the script and the Grunt shell.
Example as below:
grunt> exec script.pig grunt> exec –param p1=myparam1 –param p2=myparam2 script.pig
run
Issuing a run command on the grunt shell has basically the same effect as typing the statements manually.
Run and exec commands are useful for debugging because you can modify a Pig script in an editor and then rerun the script in the Grunt shell without leaving the shell.