Skip to content

Pig in Hadoop

    Pig Hadoop is basically a high-level programming language that is helpful for the analysis of huge datasets. Pig Hadoop was developed by Yahoo! and is generally used with Hadoop to perform a lot of data administration operations.

    For writing data analysis programs, Pig renders a high-level programming language called Pig Latin. Several operators are provided by Pig Latin using which personalized functions for writing, reading, and processing of data can be developed by programmers.

    For analyzing data through Apache Pig, we need to write scripts using Pig Latin. Then, these scripts need to be transformed into MapReduce tasks. This is achieved with the help of Pig Engine.

    Why Apache Pig?

    By now, we know that Apache Pig is used with Hadoop, and Hadoop is based on the Java programming language. Now, the question that arises in our minds is ‘Why Pig?’ The need for Apache Pig came up when many programmers weren’t comfortable with Java and were facing a lot of struggle working with Hadoop, especially, when MapReduce tasks had to be performed. Apache Pig came into the Hadoop world as a boon for all such programmers.

    • After the introduction of Pig Latin, now, programmers are able to work on MapReduce tasks without the use of complicated codes as in Java.
    • To reduce the length of codes, the multi-query approach is used by Apache Pig, which results in reduced development time by 16 folds.
    • Since Pig Latin is very similar to SQL, it is comparatively easy to learn Apache Pig if we have little knowledge of SQL.

    Features of Pig Hadoop

    There are several features of Apache Pig:

    1. In-built operators: Apache Pig provides a very good set of operators for performing several data operations like sort, join, filter, etc.
    2. Ease of programming: Since Pig Latin has similarities with SQL, it is very easy to write a Pig script.
    3. Automatic optimization: The tasks in Apache Pig are automatically optimized. This makes the programmers concentrate only on the semantics of the language.
    4. Handles all kinds of data: Apache Pig can analyze both structured and unstructured data and store the results in HDFS.

    Apache Pig Architecture

    The main reason why programmers have started using Hadoop Pig is that it converts the scripts into a series of MapReduce tasks making their job easy.

    Pig Hadoop framework has four main components:

    1. Parser: When a Pig Latin script is sent to Hadoop Pig, it is first handled by the parser. The parser is responsible for checking the syntax of the script, along with other miscellaneous checks. Parser gives an output in the form of a Directed Acyclic Graph (DAG) that contains Pig Latin statements, together with other logical operators represented as nodes.
    2. Optimizer: After the output from the parser is retrieved, a logical plan for DAG is passed to a logical optimizer. The optimizer is responsible for carrying out the logical optimizations.
    3. Compiler: The role of the compiler comes in when the output from the optimizer is received. The compiler compiles the logical plan sent by the optimize The logical plan is then converted into a series of MapReduce tasks or jobs.
    4. Execution Engine: After the logical plan is converted to MapReduce jobs, these jobs are sent to Hadoop in a properly sorted order, and these jobs are executed on Hadoop for yielding the desired result.

    Downloading and Installing Pig Hadoop

    Follow the below steps for the Apache Pig installation. These steps are for Linux/CentOS/Windows (using VM/Ubuntu/Cloudera). In this tutorial section on ‘Pig Hadoop’, we are using CentOS.

    Step 1: Download the Pig.tar file by writing the following command on your Terminal:

    wget http://www-us.apache.org/dist/pig/pig-0.16.0/pig-0.16.0.tar.gz

    Step 2: Extract the tar file (you downloaded in the previous step) using the following command:

     tar -xzf pig-0.16.0.tar.gz

    Your tar file gets extracted automatically from this command. To check whether your file is extracted, write the command ls for displaying the contents of the file. If you see the below output, the Pig file has been successfully extracted.

    Step 3: Edit the .bashrc file for updating the Apache Pig environment variables. It is required to set this up in order to access Pig from a directory instead of going to the Pig directory for executing Pig commands. Other applications can also access Pig using this path of Apache Pig from this file

    A new window will open up wherein you need to add a few commands.

    When the above window pops up, write down the following commands at the end of the file:

    # Set PIG_HOME

    export PIG_HOME=/home/training/pig-0.16.0

    export PATH=$PATH:/home/training/pig-0.16.0/bin

    export PIG_CLASSPATH=$HADOOP_CONF_DIR

    You then need to save this file in:

    File > Save

    You have to close this window and then, on your terminal, enter the following command for getting the changes updated:

    source .bashrc

    Step 4: Check the Pig Version. To check whether Pig is successfully installed, you can run the following command:

    pig -version


    Step 5: Start the Grunt Shell (used to run Pig Latin scripts) by running the command: Pig


    By default, Pig Hadoop chooses to run MapReduce jobs in which access is required to the Hadoop cluster and the HDFS installation. But there is another mode, i.e., a local mode in which all the files are installed and run using a localhost, along with the file system. You can run the localhost mode using the command: pig -x local
    I hope you were able to successfully install Apache Pig. In this section on Apache Pig, we learned ‘What is Apache Pig?’, why we need Pig, its features, architecture, and finally the installation of Apache Pig.

    Apache Pig:

    It is a high-level platform for creating programs that runs on Hadoop, the language is known as Pig Latin. Pig can execute its Hadoop jobs in MapReduce

    Data types:

    A particular kind of data defined by the values it can take

    • Simple data types:
      • Int – It is a signed 32 bit integer
      • Long- It is a signed 64 bit integer
      • Float- 32 bit floating point
      • Double- 64 bit floating point
      • Chararray- Character array in UTF 8 format
      • Bytearray- byte array (blob)
      • Boolean: True or False
    • Complex data types:
      • Tuple: It is an ordered set of fields
      • Bag: It is a collection of tuples
      • Map: A set of key value pairs

    Apache Pig Components:

    • Parser: Parser is used to check the syntax of the scripts.
    • Optimizer: It is used for the logical optimizations such as projection and push down
    • Compiler: Compiler is used to compile the optimized logical plan into a series of MapReduce jobs
    • Execution engine: The MapReduced jobs are executed on Hadoop, and the desired results are obtained

    Pig execution modes:

    • Grunt mode: This is a very interactive and useful mode in testing syntax checking and ad hoc data exploration
    • Script mode: It is used to run set of instructions from a file
    • Embedded mode: It is useful to execute pig programs from a java program
    • Local mode: In this mode the entire pig job runs as a single JVM process
    • MapReduce Mode: In this mode, pig runs the jobs as a series of map reduce jobs
    • Tez: In this mode, pig jobs run as a series of tez jobs

    Apache Pig Architecture

    Pig commands equivalent to the SQL functions:

    FunctionsPig commands
    SELECTFOREACH alias GENERATE column_name,column_name;
    SELECT*FOREACH alias GENERATE *;
    DISTINCTDISTINCT(FOREACH aliasgenerate column_name, column_name);
    WHEREFOREACH (FILTER alias BY column_nameoperator value)GENERATE column_name, column_name;
    AND/ORFILTER alias BY (column_name operator value1AND column_name operator value2)OR column_name operator value3;
    ORDER BYORDER alias BY column_name ASC|DESC,column_name ASC|DESC;
    TOP/LIMITFOREACH (GROUP alias BY column_name)GENERATE LIMIT alias number;TOP(number, column_index, alias);
    GROUP BYFOREACH (GROUP alias BY column_name)GENERATE function(alias.column_name);
    LIKEFILTER alias BY REGEX_EXTRACT(column_name,pattern, 1) IS NOT NULL;
    INFILTER alias BY column_name IN(value1, value2,…);
    JOINFOREACH (JOIN alias1 BY column_name,alias2 BY column_name)GENERATE column_name(s);
    LEFT/RIGHT/FULL OUTERJOINFOREACH(JOINalias1 BY  column_name LEFT|RIGHT|FULL,alias2 BY  column_name) GENERATE column_name(s);
    UNION ALLUNION  alias1, alias2;
    AVGFOREACH (GROUP Alias ALL) GENERATEAVG(alias.column_name);
    COUNTFOREACH (GROUP alias ALL) GENERATE COUNT(alias);
    COUNT DISTINCTFOREACH alias{Unique _column=DISTINT Column_name);};
    MAXFOREACH(GROUP aliasALL) GENERATE MAX(alias.column_name);
    MINFOREACH (GROUP aliasALL)GENERATE MIN(alias.column_name)
    SUMFOREACH (GROUP aliasALL)GEENRATE SUM(alias.column_name);
    HAVINGFILTER alias BYAggregate_function(column_name)operatorValue;
    UCASE/UPPERFOREACH aliasGENERATEUPPER(column_name);
    LCASE/LOWERFOREACH aliasGENERATELOWER(column_name);
    SUBSTRINGFOREACH aliasGENERATESUBSTRING(column_name,start,Star+length) as Some_name;
    LENFOREACH aliasGENERATE SIZE(column_name)
    ROUNDFOREACH aliasGENEARATE ROUND(column_name);

    Pig Operators:

    TypeCommandDescription
    Loading and storingLOAD
    DUMP
    STORE
    It is used to load data into a relation
    Dumps the data into the console
    Stores data in a given location
    Grouping data and joiningGROUP
    COGROUP
    CROSS JOIN
    Groups based on the key will group the data from multiple relations
    Cross join is used to join two or more relations
    StoringLIMIT
    ORDER
    It is used for limiting the results
    It is used for sorting by categories or fields
    Data setsUNION
    SPLIT
    It is used for combining multiple relations
    It is used for splitting the relations

     

    Basic Operators:

    OperatorsDescription
    Arithmetic operators+, -, *, /, %, ?, :
    Boolean operatorsAnd, or, not
    Casting operatorsCasting from one datatype to another
    Comparison Operators==, !=, >, <, >=, <=, matches
    Construction operatorsUsed to construct tuple(), bag{}, map[]
    Dereference operatorsUsed to dereferencing as tuples(tuple.id or tuple.(id,…)),
    bags(bag.id or bag.(id,…))and
    maps(map# ‘key’)
    Disambiguate operators(::)
    It  used to identify field names after JOIN,COGROUP,CROSS, or FLATTEN Operators
    Flatten operatorIt is used to flatten un-nests tuples as well as bags
    Null operatorIs null, is not null
    Sign operators+-> has no effect,
    –>It changes the sign of a positive/negative number

     

    Relational Operators:

    OperatorsDescription
    COGROUP/ GROUPIt is used to group the data in one or more relations
    COGROUP operator groups together the tuples that has the same group key
    CROSSThis operator is used to compute the cross product of two or more relations
    DEFINEThis operator assigns an alias to an UDF or a streaming command
    DISTINCTThis operator will remove the duplicate tuples from a relation
    FILTERIt is used to generate the transformation for each statement as specified
    FOREACHIt selects the tuples for a relation based on a the specified condition
    IMPORTThis operator imports macros defined in a separate file
    JOINThis operator performs inner join of two or more relations based on common field values
    LOADThis operator loads the data from a file system
    MAPREDUCEThis operator executes the native MapReduce jobs in a Pig script
    ORDER BYThis will sort the relation based on two or more fields
    SAMPLEDivides the relation into two or more relations, and selects a random data sample based on a specified size
    SPLITThis will partition the relation based on some conditions or expressions as specified
    STOREThis will store or save the result in a file system
    STREAMThis operator sends the data to an external script or program
    UNIONThis operator is used to compute the unions of two or more relations

    Diagnostic Operators:

    OperatorDescription
    DescribeReturns the schema of the relation
    DumpIt will dump or display the result on screen
    ExplainDisplays execution plans
    IllustrateIt displays the step by step execution for the sequence of statements

    Differentiation between Operational vs. Analytical Systems

    OperationalAnalytical
    Latency1 ms to 100 ms1 min to 100 min
    Concurrency1000 to100,0001 to 10
    Access PatternWrites and ReadsReads
    QueriesSelectiveUnselective
    Data ScopeOperationalRetrospective
    End UserCustomerData Scientist
    TechnologyNoSQL DatabaseMapReduce, MPP Database

    Traditional Enterprise Approach

    This approach of enterprise will use a computer to store and process big data. For storage purpose is available of their choice of database vendors such as Oracle, IBM, etc. The user interacts with the application, which executes data storage and analysis.

    Limitation

    This approach are good for those applications which require low storage, processing and database capabilities, but when it comes to dealing with large  amounts of scalable data, it imposes a bottleneck.

    Solution

    Google solved this problem using an algorithm based on MapReduce. This algorithm divides the task into small parts or units and assigns them to multiple computers, and intermediate results together integrated results in the desired results. Intellipaat’s Big Data Hadoop training will really help you get a better understanding the concepts of Big Data Solutions in Open Data Platform!

    Pig built-in functions:

    TypeExamples
    EVAL functionsAVG, COUNT, COUNT_STAR, SUM, TOKENIZE, MAX, MIN, SIZE etc
    LOAD or STORE functionsPigstorage(), Textloader, HbaseStorage, JsonLoader, JsonStorage etc
    Math functionsABS, COS, SIN, TAN, CEIL, FLOOR, ROUND, RANDOM etc
    String functionsTRIM, RTRIM, SUBSTRING, LOWER, UPPER etc
    DateTime functionGetDay, GetHour, GetYear, ToUnixTime, ToString etc

     

    Eval functions:

    • AVG(col): computes the average of the numerical values in a single column of a bag
    • CONCAT(string expression1, string expression2) : Concatenates two expressions of identical type
    • COUNT(DataBag bag): Computes the number of elements in a bag excluding null values
    • COUNT STAR (DataBag bag1, DataBag bag 2): Computes the number of elements in a bag including null values.
    • DIFF(DataBag bag1, DataBag bag2): It is used to compare two bags, if any element in one bag is not present in the other bag are returned in a bag
    • IsEmpty(DataBag bag), IsEmpty(Map map): It is used to check if the bag or map is empty
    • Max(col): Computes the maximum of the numeric values or character in a single column bag
    • MIN(col): Computes the minimum of the numeric values or character in a single column bag
    • DEFINE pluck pluckTuple(expression1): It allows the user to specify a string prefix, and filters the columns which begins with that prefix
    • SIZE(expression): Computes the number of elements based on any pig data
    • SUBSTRACT(DataBag bag1, DataBag bag2): It returns the bag which does not contain bag1 element in bag2
    • SUM: Computes the sum of the values in a single-column bag
    • TOKENIZE(String expression[,‘field delimiter’): It splits the string and outputs a bag of words

    Load or Store Functions:

    • PigStorage ():

    Syntax: PigStorage(field_delimiter)
    A = LOAD ‘Employee’ USING PigStorage(‘\t’) AS (name: chararray, age:int, gpa: float);
    Loads and stores data as structured text file

    • TextLoader():

    Syntax: A = LOAD ‘data’ USING TextLoader();
    Loads unstructured data in UTF 8 format

    • BinStorage():

    Syntax: A = LOAD ‘data’ USING BinStorage();
    Loads and stores data in machine readable format

    • Handling compression:

    It loads and stores compressed data in Pig

    • JsonLoader, JsonStorage:

    Syntax: A = load ‘a.json’ using JsonLoader();
    It loads and stores JSON data

    • Pig dump:

    Syntax: STORE X INTO ‘output’ USING PigDump ();
    Stores data in UTF 8 format

    Math functions:

    • ABS:

    Syntax: ABS(expression)
    It returns the absolute value of an expression

    • COS:

    Syntax: COS(expression)
    It Returns the trigonometric cosine of an expression.

    • SIN:

    Syntax: SIN (expression)
    It returns the sine of an expression.

    • CEIL:

    Syntax: CEIL(expression)
    It is used to return the value of an expression rounded up to the nearest integer

    • TAN:

    Syntax: TAN(expression)
    It is used to return the trigonometric tangent of an angle.

    • ROUND:

    Syntax: ROUND(expression)
    It returns the value of an expression rounded to an integer (if the result type is float) or long (if the result type is double)

    • RANDOM:

    Synatx: RANDOM ()
    It returns a pseudo random number (type double) greater than or equal to 0.0 and less than 1.0

    • Floor:

    Syntax: FLOOR(expression)
    Returns the value of an expression rounded down to the nearest integer.

    • CBRT:

    Synatx: CBRT(expression)
    It returns the cube root of an expression

    • EXP:

    Syntax: EXP(expression)
    Returns Euler’s number e raised to the power of x.

    String Functions:

    • INDEXOF:

    Syntax: INDEXOF (string, ‘character’, startIndex)
    It returns an index of the first occurrence of a character in a string

    • LAST_INDEX:

    Syntax: LAST_INDEX_OF (expression)
    It returns an index of the last occurrence of a character in a string

    • TRIM:

    Syntax: TRIM(expression)
    It returns a copy of the string with leading and trailing whitespaces removed

    • SUBSTRING:

    Syntax: SUBSTRING (string, startIndex, stopIndex)
    It will return a substring from a given string

    • UCFIRST:

    Syntax: UCFIRST(expression)
    It will return a string with the first character changed to the upper case

    • LOWER:

    Syntax: LOWER(expression)
    Converts all characters in a string to lowercase

    • UPPER:

    Synatx: UPPER(expression)
    Converts all characters in a string to the uppercase

    Tuple, Bag and Map functions:

    FunctionSyntaxDescription
    TOTUPLETOTUPLE(expression [, expression …])It is used to convert one or more expressions to the type Tuple
    TOBAGTOBAG(expression [, expression …])It is used to convert one or more expression to the individual tuple, which is then placed in a bag
    TOMAPTOMAP(key-expression, value-expression [, key-expression, value-expression …])It is used to convert key/value expression pairs to a Map
    TOPTOP(topN,column,relation)Returns a top-n tuples from a bag of tuples


    In the next section of this tutorial, we will learn about Apache Hive.