Apache Hive

Apache Hive is an open source data warehouse system for querying and analyzing large data sets that are principally stored in Hadoop files. It is commonly a part of compatible tools deployed as part of the software ecosystem based on the Hadoop framework for handling large data sets in a distributed computing environment.

Like Hadoop, Hive has roots in batch processing techniques. It was originated in 2007 by developers at Facebook who sought to provide SQL access to Hadoop data for analytics users. Like Hadoop, Hive was developed to address the need to handle petabytes of data accumulating via web activity. Release 1.0 became available in February 2015.

How Apache Hive works

Initially, Hadoop processing relied solely on the MapReduce framework, and this required users to understand advanced styles of Java programming in order to successfully query data. The motivation behind Apache Hive was to simplify query development, and to, in turn, open up Hadoop unstructured data to a wider group of users in organizations.

Hive has three main functions: data summarization, query and analysis. It supports queries expressed in a language called HiveQL, or HQL, a declarative SQL-like language that, in its first incarnation, automatically translated SQL-style queries into MapReduce jobs executed on the Hadoop platform. In addition, HiveQL supported custom MapReduce scripts to plug into queries.

When SQL queries are submitted via Hive, they are initially received by a driver component that creates session handles, forwards requests to a compiler via Java Database Connectivity/Open Database Connectivity interfaces, which subsequently forwards jobs for execution. Hive enables data serialization/deserialization and increases flexibility in schema design by including a system catalog called Hive-Metastore.

Apache Hive brings SQL capabilities to Hadoop analytics. A driver component creates session handles and links to a compiler, and work is forwarded for execution. Live Long and Process daemons can separately handle I/O, caching and query fragment execution, boosting performance.

How Hive has evolved

Like Hadoop, Hive has evolved to encompass more than just MapReduce. Inclusion of the YARN resource manager in Hadoop 2.0 helped developers’ ability to expand use of Hive, as it did other Hadoop ecosystem components. Over time, HiveQL has gained support for the Apache Spark SQL engine as well as the Hive engine, and both HiveQL and the Hive Engine have added support for distributed process execution via Apache Tez and Spark.

Early Hive file support comprised text files (also called flat files), SequenceFiles (flat files consisting of binary key/value pairs) and Record Columnar Files (RCFiles), which store columns of a table in a columnar database way). Hive columnar storage support has come to include Optimized Row Columnar (ORC) files and Parquet files.

Hive execution and interactivity were a topic of attention nearly from its inception. That is because query performance lagged that of more familiar SQL engines. In 2013, to boost performance, Apache Hive committers began work on the Stinger project, which brought Apache Tez and directed acyclic graph processing to the warehouse system.

Also accompanying Stinger were new approaches that improved performance by adding a cost-based optimizer, in-memory hash joins, a vector query engine and other enhancements. Query performance reaching 100,000 queries per hour and analytics processing of 100 million rows per second, per node have been reported for recent versions of Hive.

Additions accompanying releases 2.3 in 2017 and release 3.0 in 2018 furthered Apache Hive’s development. Among highlights were support for Live Long and Process (LLAP) functionality that allows prefetching and caching of columnar data and support for atomicity, consistency, isolation and durability (ACID) operations including INSERT, UPDATE and DELETE. Work also began on materialized views and automatic query rewriting capabilities familiar to traditional data warehouse users.

Hive supporters and alternatives

Committers to the Apache Hive community project have included individuals from Cloudera, Hortonworks, Facebook, Intel, LinkedIn, Databricks and others. Hive is supported in Hadoop distributions. As with the Hbase NoSQL database, it is very commonly implemented as part of Hadoop distributed data processing applications. Hive is available by download from the Apache Foundation, as well as from Hadoop distribution providers Cloudera, MapR and Hortonworks, and as a part of AWS Elastic MapReduce. The latter implementation supports analysis of data sets residing in Simple Storage Service object storage.

Apache Hive was among the very first attempts to bring SQL querying capabilities to the Hadoop ecosystem. Among a host of other SQL-on-Hadoop alternatives that have arisen are BigSQL, Drill, Hadapt, Impala and Presto. Also, Apache Pig has emerged as an alternative language to HiveQL for Hadoop-oriented data warehousing.

Architecture and Working of Hive

The major components of Hive and its interaction with the Hadoop is demonstrated in the figure below and all the components are described further:

  • User Interface (UI)
    As the name describes User interface provide an interface between user and hive. It enables user to submit queries and other operations to the system. Hive web UI, Hive command line, and Hive HD Insight (In windows server) are supported by the user interface.
  • Driver
    Queries of the user after the interface are received by the driver within the Hive. Concept of session handles is implemented by driver. Execution and Fetching of APIs modelled on JDBC/ODBC interfaces is provided by the user.
  • Compiler
    Queries are parses, semantic analysis on the different query blocks and query expression is done by the compiler. Execution plan with the help of the table in the database and partition metadata observed from the metastore are generated by the compiler eventually.
  • Metastore
    All the structured data or information of the different tables and partition in the warehouse containing attributes and attributes level information are stored in the metastore. Sequences or de-sequences necessary to read and write data and the corresponding HDFS files where the data is stored. Hive selects corresponding database servers to stock the schema or Metadata of databases, tables, attributes in a table, data types of databases, and HDFS mapping.
  • Execution Engine
    Execution of the execution plan made by the compiler is performed in the execution engine. The plan is a DAG of stages. The dependencies within the various stages of the plan is managed by execution engine as well as it executes these stages on the suitable system components.

Diagram – Architecture of Hive that is built on the top of Hadoop

In the above diagram along with architecture, job execution flow in Hive with Hadoop is demonstrated step by step.

  • Step-1: Execute Query
    Interface of the Hive such as Command Line or Web user interface delivers query to the driver to execute. In this, UI calls the execute interface to the driver such as ODBC or JDBC.
  • Step-2: Get Plan
    Driver designs a session handle for the query and transfer the query to the compiler to make execution plan. In other words, driver interacts with the compiler.
  • Step-3: Get Metadata
    In this, the compiler transfers the metadata request to any database and the compiler gets the necessary metadata from the metastore.
  • Step-4: Send Metadata
    Metastore transfers metadata as an acknowledgement to the compiler.
  • Step-5: Send Plan
    Compiler communicating with driver with the execution plan made by the compiler to execute the query.
  • Step-6: Execute Plan
    Execute plan is sent to the execution engine by the driver.
    • Execute Job
    • Job Done
    • Dfs operation (Metadata Operation)
  • Step-7: Fetch Results
    Fetching results from the driver to the user interface (UI).
  • Step-8: Send Results
    Result is transferred to the execution engine from the driver. Sending results to Execution engine. When the result is retrieved from data nodes to the execution engine, it returns the result to the driver and to user interface (UI).

Difference between Apache Hive and Apache Spark SQL

1. Apache Hive :
Apache Hive is a data warehouse device constructed on the pinnacle of Apache Hadoop that enables convenient records summarization, ad-hoc queries, and the evaluation of massive datasets saved in a number of databases and file structures that combine with Hadoop, together with the MapR Data Platform with MapR XD and MapR Database. Hive gives an easy way to practice structure to massive quantities of unstructured facts and then operate batch SQL-like queries on that data.

2. Apache Spark SQL :
Spark SQL brings native assist for SQL to Spark and streamlines the method of querying records saved each in RDDs (Spark’s allotted datasets) and in exterior sources. Spark SQL effortlessly blurs the traces between RDDs and relational tables. Unifying these effective abstractions makes it convenient for developers to intermix SQL instructions querying exterior information with complicated analytics, all inside a single application.



Difference Between Apache Hive and Apache Spark SQL :

S.NO.APACHE HIVEAPACHE SPARK SQL
1.It is an Open Source Data warehouse system,
constructed on top of Apache Hadoop.
It is used in structured data Processing system where
it processes information using SQL.
2.It contains large data sets and stored in Hadoop files for
analyzing and querying purposes.
It computes heavy functions followed by correct
optimization techniques for processing a task.
3.It was released in the year 2012.It first came into the picture in 2014.
4.For its implementation, it mainly uses JAVA.It can be implemented in various languages such as R, Python and Scala.
5.Its latest version (2.3.2) is released in 2017.Its latest version (2.3.0) is released in 2018.
6.Mainly RDMS is used as its Database Model.It can be integrated with any No-SQL database.
7.It can support all OS provided, JVM environment will be there.It supports various OS such as Linux, Windows, etc.
8.Access methods for its processing include JDBC, ODBC and Thrift.It can be accessed only by ODBC and JDBC.

Apache Hive is a data warehouse and an ETL tool which provides an SQL-like interface between the user and the Hadoop distributed file system (HDFS) which integrates Hadoop. It is built on top of Hadoop. It is a software project that provides data query and analysis. It facilitates reading, writing and handling wide datasets that stored in distributed storage and queried by Structure Query Language (SQL) syntax. It is not built for Online Transactional Processing (OLTP) workloads. It is frequently used for data warehousing tasks like data encapsulation, Ad-hoc Queries, and analysis of huge datasets. It is designed to enhance scalability, extensibility, performance, fault-tolerance and loose-coupling with its input formats.

Initially Hive is developed by Facebook and Amazon, Netflix and It delivers standard SQL functionality for analytics. Traditional SQL queries are written in the MapReduce Java API to execute SQL Application and SQL queries over distributed data. Hive provides portability as most data warehousing applications functions with SQL-based query languages like NoSQL.

Components of Hive:

  1. HCatalog
    It is a Hive component and is a table as well as a store management layer for Hadoop. It enables user along with various data processing tools like Pig and MapReduce which enables to read and write on the grid easily.
  • WebHCat
    It provides a service which can be utilized by the user to run Hadoop MapReduce (or YARN), Pig, Hive tasks or function Hive metadata operations with an HTTP interface.

Modes of Hive:

Hive is functioned in two major modes which are described below. These modes are depended on the size of data nodes in Hadoop.

  1. Local Mode
    It is used, when the Hadoop is built under pseudo mode which have only one data node, when the data size is smaller in term of restricted to single local machine, and when processing will be faster on smaller datasets existing in the local machine.
  2. Map Reduce Mode
    It is used, when Hadoop is built with multiple data nodes and data is divided across various nodes, it will function on huge datasets and query is executed parallelly, and to achieve enhanced performance in processing large datasets.

Characteristics of Hive:

  1. Databases and tables are built before loading the data.
  2. Hive as data warehouse is built to manage and query only structured data which is residing under tables.
  3. At the time of handling structured data, MapReduce lacks optimization and usability function such as UDFs whereas Hive framework have optimization and usability.
  4. Programming in Hadoop deals directly with the files. So, Hive can partition the data with directory structures to improve performance on certain queries.
  5. Hive is compatible for the various file formats which are TEXTFILE, SEQUENCEFILE, ORC, RCFILE, etc.
  6. Hive uses derby database in single user metadata storage and it uses MYSQL for multiple user Metadata or shared Metadata.

Features of Hive:

  1. It provides indexes, including bitmap indexes to accelerate the queries. Index type containing compaction and bitmap index as of 0.10.
  2. Metadata storage in a RDBMS, reduces the time to function semantic checks during query execution.
  3. Built in user-defined functions (UDFs) to manipulation of strings, dates, and other data-mining tools. Hive is reinforced to extend the UDF set to deal with the use-cases not reinforced by predefined functions.
  4. DEFLATE, BWT, snappy, etc are the algorithms to operation on compressed data which is stored in Hadoop Ecosystem.
  5. It stores schemas in a database and processes the data into the Hadoop File Distributed File System (HDFS).
  6. It is built for the Online Analytical Processing (OLAP).
  7. It delivers various types od querying language which are frequently known as Hive Query Language (HVL or HiveQL).