Apache Hive is an open-source data warehouse system that has been built on top of Hadoop. You can use Hive for analyzing and querying large datasets that are stored in Hadoop files. Processing structured and semi-structured data can be done by using Hive.
What is Hive in Hadoop?
Don’t you think writing MapReduce jobs is tedious work? Well, with Hadoop Hive, you can just go ahead and submit SQL queries and perform MapReduce jobs. So, if you are comfortable with SQL, then Hive is the right tool for you as you will be able to work on MapReduce tasks efficiently. Similar to Pig, Hive has its own language, called HiveQL (HQL). It is similar to SQL. HQL translates SQL-like queries into MapReduce jobs, like what Pig Latin does. The best part is that you don’t need to learn Java to work with Hadoop Hive.
Hadoop Hive runs on our system and converts SQL queries into a set of jobs for execution on a Hadoop cluster. Basically, Hadoop Hive classifies data into tables providing a method for attaching the structure to data stores in HDFS.
Facebook uses Hive to address its various requirements, like running thousands of tasks on the cluster, along with thousands of users for a huge variety of applications. Since Facebook has a huge amount of raw data, i.e., 2 PB, Hadoop Hive is used for storing this voluminous data. It regularly loads around 15 TB of data on a daily basis. Now, many companies, such as IBM, Amazon, Yahoo!, and many others, are also using and developing Hive.
Why do we need Hadoop Hive?
Let’s now talk about the need for Hive. To understand that, let’s see what Facebook did with its big data.
Basically, there were a lot of challenges faced by Facebook before they had finally implemented Apache Hive. One of those challenges was the size of data that has been generated on a daily basis. Traditional databases, such as RDBMS and SQL, weren’t able to handle the pressure of such a huge amount of data. Because of this, Facebook was looking for better options. It started using MapReduce in the beginning to overcome this problem. But, it was very difficult to work with MapReduce as it needed mandatory programming expertise in SQL. Later on, Facebook realized that Hadoop Hive had the potential to actually overcome the challenges it faced.
Apache Hive helps developers get away with writing complex MapReduce tasks. Hadoop Hive is extremely fast, scalable, and extensible. Since Apache Hive is comparable to SQL, it is easy for the SQL developers as well to implement Hive queries.
Additionally, the Hive is capable of decreasing the complexity of MapReduce by providing an interface wherein a user can submit various SQL queries. So, technically, you don’t need to learn Java for working with Apache Hive.
Let’s now talk about the Hadoop Hive architecture and the major working force behind Apache Hive. The components of Apache Hive are as follows:
- Driver: The driver acts as a controller receiving HiveQL statements. It begins the execution of statements by creating sessions. It is responsible for monitoring the life cycle and the progress of the execution. Along with that, it also saves the important metadata that has been generated during the execution of the HiveQL statement.
- Metastore: A metastore stores metadata of all tables. Since Hive includes partition metadata, it helps the driver in tracking the progress of various datasets that have been distributed across a cluster, hence keeping track of data. In a metastore, the data is saved in an RDBMS format.
- Compiler: The compiler performs the compilation of a HiveQL query. It transforms the query into an execution plan that contains tasks.
- Optimizer: An optimizer performs many transformations on the execution plan for providing an optimized DAG. An optimizer aggregates several transformations together like converting a pipeline of joins to a single join. It can also split the tasks for providing better performance.
- Executor: After the processes of compilation and optimization are completed, the execution of the task is done by the executor. It is responsible for pipelining the tasks.
Differences Between Hive and Pig
|Used for data analysis||Pig is used for data and programs|
|Used for processing the structured data||It is used for the semi-structured data|
|Has HiveQL||Has Pig Latin|
|Used for creating reports||Used for programming|
|Works on the server side||Works on the client side|
|Does not support Avro||Supports Avro|
Features of Apache Hive
Let’s now look at the features of Apache Hive:
- Hive provides easy data summarization and analysis and query support.
- Hive supports external tables, making it feasible to process data without having to store it into HDFS.
- Since Hadoop has a low-level interface, Hive fits in here properly.
- Hive supports the partitioning of data at the data level for better performance.
- There is a rule-based optimizer present in Hive responsible for optimizing logical plans.
- Hadoop can process external data using Hive.
Limitations of Apache Hive
Though Hive is a progressive tool, it has some limitations as well.
- Apache Hive doesn’t offer any real-time queries.
- Online transaction processing is not well-supported by Apache Hive.
- There can be a delay while performing Hive queries.
That is all for this Apache Hive tutorial. In this section about Apache Hive, you learned about Hive that is present on top of Hadoop and is used for data analysis. It uses a language called HiveQL that translates SQL-like queries into relevant MapReduce jobs. In the upcoming section of this Hadoop tutorial, you will be learning about Hadoop clusters.
Overview to Hive:
All the industries deal with the Big data that is large amount of data and Hive is a tool that is used for analysis of this Big Data. Apache Hive is a tool where the data is stored for analysis and querying.
Apache Hive: It is a data warehouse infrastructure based on Hadoop framework which is perfectly suitable for data summarization, analysis and querying. It uses an SQL like language called HQL (Hive query Language)
HQL: It is a query language used to write the custom map reduce framework in Hive to perform more sophisticated analysis of the data
Table: Table in hive is a table which contains logically stored data
Components of Hive:
- Meta store: Meta store is where the schemas of the Hive tables are stored, it stores the information about the tables and partitions that are in the warehouse.
- SerDe: Serializer, Deserializer which gives instructions to hive on how to process records
- Hive interfaces includes WEB UI
- Hive command line
- HD insight (windows server)
Hive Function Meta commands:
- Show functions: Lists Hive functions and operators
- Describe function [function name]: Displays short description of the particular function
- Describe function extended [function name]: Displays extended description of the particular function
Types of Hive Functions:
- UDF (User defined Functions): It is a function that fetches one or more columns from a row as arguments and returns a single value
- UDTF (User defined Tabular Functions): This function is used to produce multiple columns or rows of output by taking zero or more inputs
- Macros: It is a function that uses other Hive functions
User defined aggregate functions: A user defined function that takes multiple rows or columns and returns the aggregation of the data
User defined table generating functions: A function which takes a column from single record and splitting it into multiple rows
Indexes: Indexes are created to the speedy access to columns in the database
- Syntax: Create index <INDEX_NAME> on table <TABLE_NAME>
Thrift: A thrift service is used to provide remote access from other processors
Meta store: This is a service which stores the metadata information such as table schemas
Hcatalog: It is a metadata and table management system for Hadoop platform which enables storage of data in any format.
Hive SELECT statement syntax using HQL:
SELECT [ALL | DISTINCT] select_expr, select_expr, …
[GROUP BY col_list]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]]
- Select: Select is a projection operator in HiveQL, which scans the table specified by the FROM clause
- Where: Where is a condition which specifies what to filter
- Group by: It uses the list of columns, which specifies how to aggregate the records
- Cluster by, distribute by, Sort by: Specifies the algorithm to sort, distribute and create cluster, and the order for sorting
- Limit: This specifies how many records to be retrieved
Partitioner: Partitioner controls the partitioning of keys of the intermediate map outputs, typically by a hash function which is same as the number of reduce tasks for a job
Partitioning: It is used for distributing load horizontally. It is a way of dividing the tables into related parts based on values such as date, city, departments etc.
Bucketing: It is a technique to decompose the datasets into more manageable parts
Hive commands in HQL:
Data Definition Language (DDL): It is used to build or modify tables and objects stored in a database
Some of the DDL commands are as follows:
- To create database in Hive: create database<data base name>
- To list out the databases created in a Hive warehouse: show databases
- To use the database created: USE <data base name>
- To describe the associated database in metadata: describe<data base name>
- To alter the database created: alter<data base name>
Data Manipulation Language (DML): These statements are used to retrieve, store, modify, delete, insert and update data in a database
- Inserting data in a database: The Load function is used to move the data into a particular Hive table.
LOAD data <LOCAL> inpath <file path> into table [tablename]
- Drop table: The drop table statements deletes the data and metadata from the table:
drop table<table name>
- Aggregation: It is used to count different categories from the table :
Select count (DISTINCT category) from tablename;
- Grouping: Group command is used to group the result set, where the result of one table is stored in the other:
Select <category>, sum( amount) from <txt records> group by <category>
- To exit from the Hive shell: Use the command quit
Hive data types:
- Integral data types:
- String types:
- VARCHAR-Length(1 to 65355)
- Timestamp: It supports the traditional Unix timestamp with optional nanosecond precision
- Union type: It is a collection of heterogenous data types.
- Syntax: UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>
- Complex types:
- Arrays: Syntax-ARRAY<data_type>
- Maps: Syntax- MAP<primitive_type, data_type>
- Structs: STRUCT<col_name : data_type [COMMENT col_comment], …>
Operations that can be performed on Hive:
|To retrieve information||SELECT from_columns FROM table WHERE conditions;|
|To select all values||SELECT * FROM table;|
|To select a particular category values||SELECT * FROM table WHERE rec_name = “value”;|
|To select for multiple criteria||SELECT * FROM TABLE WHERE rec1 = “value1” AND rec2 = “value2”;|
|For selecting specific columns||SELECT column_name FROM table;|
|To retrieve unique output records||SELECT DISTINCT column_name FROM table;|
|For sorting||SELECT col1, col2 FROM table ORDER BY col2;|
|For sorting backwards||SELECT col1, col2 FROM table ORDER BY col2 DESC;|
|For counting rows from the table||SELECT COUNT(*) FROM table;|
|For grouping along with counting||SELECT owner, COUNT(*) FROM table GROUP BY owner;|
|For selecting maximum values||SELECT owner, COUNT(*) FROM table GROUP BY owner;|
|Selecting from multiple tables and joining||SELECT pet.name, comment FROM pet JOIN event ON (pet.name = event.name);|
Metadata functions and query used for operations:
|Function||Hive query or commands|
|Selecting a database||USE database;|
|Listing databases||SHOW DATABASES;|
|listing table in a database||SHOW TABLES;|
|Describing format of a table||DESCRIBE (FORMATTED|EXTENDED) table;|
|Creating a database||CREATE DATABASE db_name;|
|Dropping a database||DROP DATABASE db_name (CASCADE);|
Command Line statements:
|To run the query||hive -e ‘select a.col from tab1 a’|
|To run a query in a silent mode||hive -S -e ‘select a.col from tab1 a’|
|To select hive configuration variables||hive -e ‘select a.col from tab1 a’ -hiveconf hive.root.logger=DEBUG,console|
|To use the initialization script||hive -i initialize.sql|
|To run the non-interactive script||hive -f script.sql|
|To run script inside the shell||source file_name|
|To run the list command||dfs –ls /user|
|To run ls (bash command) from the shell||!ls|
|To set configuration variables||set mapred.reduce.tasks=32|
|Tab auto completion||set hive.<TAB>|
|To display all variables starting with hive||set|
|To revert all variables||reset|
|To add jar files to distributed cache||add jar jar_path|
|To display all the jars in the distributed cache||list jars|
|To delete jars from the distributed cache||delete jar jar_name|