Apache HBase

Apache Hbase is a popular and highly efficient Column-oriented NoSQL database built on top of Hadoop Distributed File System that allows performing read/write operations on large datasets in real time using Key/Value data.

It is an open source platform and is horizontally scalable. It is the database which distributed based on the column oriented. It is built on top most of the Hadoop file system. It is based on the non relational database system (NoSQL). HBase is truly and faithful, open source implementation devised on Google’s Bigtable.

Column oriented databases are those databases which store the data tables in terms of sections of columns of data instead of rows of data. It is specified based on distribution, persistent, strictly consistent storage system with near-optimal write in terms of Input/output channel saturation and excellent reading performance which make use makes use of efficient disk space by supporting pluggable compression algorithms that can be chosen based on the nature of the data in particular set of column families.

HBase manages shifting the load and failures elegantly and clearly to the client side. Scalability is built in and clusters can be grown or shrunk while the system is still production stage. Changing the cluster does not involve any difficult rebalancing or resharding procedure but is fully automated as per the customer requirements.

Apache HBase is an Apache Hadoop project and Open Source, non-relational distributed Hadoop database that had its genesis in the Google’s Bigtable. The programming language of HBase is Java. Today it is an integral part of the Apache Software Foundation and the Hadoop ecosystem. It is a high availability database that exclusively runs on top of the HDFS and provides the Capabilities of Google Bigtable for the Hadoop framework for storing huge volumes of unstructured data at breakneck speeds in order to derive valuable insights from it.

It has an extremely fault-tolerant way of storing data and is extremely good for storing sparse data. Sparse data is something like looking for a needle in a haystack. A real-life example of sparse data would be like looking for someone who has spent over $100,000 dollars in a single transaction on Amazon among the tens of millions of transactions that happen on any given week.

CriteriaHBase
Cluster basisHadoop
Deployed forBatch Jobs
APIThrift or REST

The Architecture of Apache HBase

The Apache HBase carries all the features of the original Google Bigtable paper like the Bloom filters, in-memory operations and compression. The tables of this database can serve as the input for MapReduce jobs on the Hadoop ecosystem and it can also serve as output after the data is processed by MapReduce. The data can be accessed via the Java API or through the REST API or even the Thrift and AVRO gateways.

What HBase is that it is basically a column-oriented key-value data store, and the since it works extremely fine with the kind of data that Hadoop process it is natural fit for deploying as a top layer on HDFS. It is extremely fast when it comes to both read and write operations and does not lose this extremely important quality even when the datasets are humongous. Therefore, it is being widely used by corporations for its high throughput and low input/output latency. It cannot work as a replacement for the SQL database but it is perfectly possible to have an SQL layer on top of HBase to integrate it with the various business intelligence and analytics tools.

As it is obvious that HBase does not support SQL scripting but the same is written in Java like what we do for a MapReduce application.

Why should you use the HBase technology?

HBase is one of the core components of the Hadoop ecosystem along with the other two being HDFS and MapReduce. As part of the Hortonworks Data Platform the Apache Hadoop ecosystem is available as a highly secure, enterprise ready big data framework. It is being regularly deployed by some of the biggest companies like Facebook messaging system and so on. Some of the salient features of HBase that makes it one of the most sought after message storing system is as follows:

  • It has a completely distributed architecture and can work on extremely large scale data
  • It works for extremely random read and write operations
  • It has high security and easy management of data
  • It provides an unprecedented high write throughput
  • Scaling to meet additional requirements is seamless and quick
  • Can be used for both structured and semi-structured data types
  • It is good when you don’t need full RDBMS capabilities
  • It has a perfectly modular and linear scalability feature
  • The data reads and writes are strictly consistent
  • The table sharding can be easily configured and automatized
  • The various servers are provided automatic failover support
  • The MapReduce jobs can be backed with HBase Tables
  • Client access is seamless with Java APIs.

What is the scope of Apache HBase?

One of the most important features of HBase is that it can handle data sets which number in billions of rows and millions of columns. It can extremely well combine the various data sources that are coming from a wide variety of types, structures and schemas. The best part is that it can be integrated natively with Hadoop in order to provide a seamless fit. It also works extremely well with YARN. HBase provides very low latency access over fast-changing and humungous amounts of data.

Why do we need this technology and what is the problem that it is solving?

HBase is a very progressive NoSQL database that is seeing increased use in today’s world that is overwhelmed with Big Data. It has a very simple Java programming roots which can be deployed for scaling HBase on a big scale. There are a lot of business scenarios wherein we are exclusively working with sparse data which is to look for a handful of data fields matching certain criteria within data fields that are numbering in the billions. It is extremely fault-tolerant and resilient and can work on multiple types of data making it useful for varied business scenarios.

It is a column-oriented table making it very easy to look for the right data among billions of data fields. You can easily shard the data into tables with the right configuration and automatization. HBase is perfectly suited for analytical processing of data. Since analytical processing has huge amounts of data required it causes queries to exceed the limit that is possible on a single server. This is when the distributed storage comes into the picture.

There is also a need for handling large amounts of reads and writes which is just not possible using an RDBMS database and so HBase is the perfect candidate for such applications. The read/write capacity of this technology can be scaled to even millions/second giving it an unprecedented advantage.  Facebook uses it extensively for real-time messaging applications and Pinterest uses for multiple tasks running up to 5 million operations per second.

Using Apache HBase to store and access data

Configuring HBase and Hive

Follow this step to complete the configuration:

Modify the hive-site.xml configuration file. Add the required path to the jars. The jars will be used by Hive to write data into the HBase. The full list of JARs to add can be seen by running the commandhbase mapredcp on the command-line.

 
<property>
<name>hive.aux.jars.path</name>
<value>
file:///usr/hdp/3.0.1.0-61/hbase/lib/commons-lang3-3.6.jar,
file:///usr/hdp/3.0.1.0-61/hbase/lib/hbase-zookeeper-2.0.0.3.0.1.0-61.jar,
file:///usr/hdp/3.0.1.0-61/hbase/lib/hbase-mapreduce-2.0.0.3.0.1.0-61.jar,
file:///usr/hdp/3.0.1.0-61/hbase/lib/jackson-annotations-2.9.5.jar,
file:///usr/hdp/3.0.1.0-61/hbase/lib/hbase-shaded-miscellaneous-2.1.0.jar,
file:///usr/hdp/3.0.1.0-61/hbase/lib/jackson-databind-2.9.5.jar,
file:///usr/hdp/3.0.1.0-61/hbase/lib/hbase-hadoop-compat-2.0.0.3.0.1.0-61.jar,
file:///usr/hdp/3.0.1.0-61/hbase/lib/hbase-metrics-2.0.0.3.0.1.0-61.jar,
file:///usr/hdp/3.0.1.0-61/hbase/lib/hbase-client-2.0.0.3.0.1.0-61.jar,
file:///usr/hdp/3.0.1.0-61/hbase/lib/hbase-protocol-shaded-2.0.0.3.0.1.0-61.jar,
file:///usr/hdp/3.0.1.0-61/hbase/lib/jackson-core-2.9.5.jar,
file:///usr/hdp/3.0.1.0-61/hbase/lib/protobuf-java-2.5.0.jar,
file:///usr/hdp/3.0.1.0-61/hbase/lib/hbase-shaded-netty-2.1.0.jar,
file:///usr/hdp/3.0.1.0-61/hbase/lib/metrics-core-3.2.1.jar,
file:///usr/hdp/3.0.1.0-61/hbase/lib/hbase-server-2.0.0.3.0.1.0-61.jar,
file:///usr/hdp/3.0.1.0-61/hbase/lib/hbase-hadoop2-compat-2.0.0.3.0.1.0-61.jar,
file:///usr/hdp/3.0.1.0-61/hbase/lib/hbase-metrics-api-2.0.0.3.0.1.0-61.jar,
file:///usr/hdp/3.0.1.0-61/hbase/lib/hbase-common-2.0.0.3.0.1.0-61.jar,
file:///usr/hdp/3.0.1.0-61/hbase/lib/hbase-protocol-2.0.0.3.0.1.0-61.jar,
file:///usr/hdp/3.0.1.0-61/hbase/lib/hbase-shaded-protobuf-2.1.0.jar,
file:///usr/hdp/3.0.1.0-61/hbase/lib/htrace-core4-4.2.0-incubating.jar,
file:///usr/hdp/3.0.1.0-61/zookeeper/zookeeper-3.4.6.3.0.1.0-61.jar
</value>
</property>

Using HBase Hive integration

Before you begin to use the Hive HBase integration, complete the following steps:

  • Use the HBaseStorageHandler to register the HBase tables with the Hive metastore. You can also register the Hbase tables directly in Hive using the HiveHBaseTableInputFormat and HiveHBaseTableOutputFormat classes.
  • As part of the registration process, specify a column mapping. There are two SERDEPROPERTIES that controls the HBase column mapping to Hive:
    • Hbase.columns.mapping
    • Hbase.table.default.storage.type

HBase Hive integration example

A change to Hive in HDP 3.0 is that all StorageHandlers must be marked as “external”. There is no such thing as an non-external table created by a StorageHandler. If the corresponding HBase table exists when the Hive table is created, it will mimic the HDP 2.x semantics of an “external” table. If the corresponding HBase table does not exist when the Hive table is created, it will mimic the HDP 2.x semantics of a non-external table (e.g. the HBase table is dropped when the Hive table is dropped).

From the Hive shell, create a HBase table:

CREATE EXTERNAL TABLE hbase_hive_table (key int, value string)

STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’

WITH SERDEPROPERTIES (“hbase.columns.mapping” = “:key,cf1:val”)

TBLPROPERTIES (“hbase.table.name” = “hbase_hive_table”, “hbase.mapred.output.outputtable” = “hbase_hive_table”);

The hbase.columns.mapping property is mandatory. The hbase.table.name property is optional. The hbase.mapred.output.outputtable property is optional; It is needed, if you plan to insert data to the table from the HBase shell, access the hbase_hive_table:

$ hbase shell

HBase Shell; enter ‘help<RETURN>’ for list of supported commands.

Version: 0.20.3, r902334, Mon Jan 25 13:13:08 PST 2010

hbase(main):001:0> list hbase_hive_table                                                                                                         

1 row(s) in 0.0530 seconds

hbase(main):002:0> describe hbase_hive_table

Table hbase_hive_table is ENABLED

hbase_hive_table COLUMN FAMILIES DESCRIPTION{NAME => ‘cf’, DATA_BLOCK_ENCODING => ‘NONE’, BLOOMFILTER => ‘ROW’, REPLICATION_SCOPE => ‘0’, VERSIONS => ‘1’, COMPRESSION => ‘NONE’, MIN_VERSIONS => ‘0’, TTL => ‘FOREVER’, KEEP_DELETED_CELLS => ‘FALSE’, BLOCKSIZE => ‘65536’, IN_MEMORY => ‘false’, BLOCKCACHE => ‘true’} 1 row(s) in 0.2860 seconds

hbase(main):003:0> scan “hbase_hive_table “

ROW                          COLUMN+CELL                                                                     

0 row(s) in 0.0060 seconds

Insert the data into the HBase table through Hive:

INSERT OVERWRITE TABLE HBASE_HIVE_TABLE SELECT * FROM pokes WHERE foo=98;

From the HBase shell, verify that the data got loaded:

hbase(main):009:0> scan “hbase_hive_table”

ROW                        COLUMN+CELL                                                                     

 98                          column=cf1:val, timestamp=1267737987733, value=val_98                           

1 row(s) in 0.0110 seconds

From Hive, query the HBase data to view the data that is inserted in the hbase_hive_table:

hive> select * from HBASE_HIVE_TABLE;

Total MapReduce jobs = 1

Launching Job 1 out of 1

OK

98            val_98

Time taken: 4.582 seconds

Using Hive to access an existing HBase table example

Use the following steps to access the existing HBase table through Hive.

  • You can access the existing HBase table through Hive using the CREATE EXTERNAL TABLE:

CREATE EXTERNAL TABLE hbase_table_2(key int, value string)

STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’

WITH SERDEPROPERTIES (“hbase.columns.mapping” = “:key

,cf1:val”)

TBLPROPERTIES(“hbase.table.name” = “some_existing_table”, “hbase.mapred.output.outputtable” = “some_existing_table”);

  • You can use different type of column mapping to map the HBase columns to Hive:
    • Multiple Columns and Families

To define four columns, the first being the rowkey: “:key,cf:a,cf:b,cf:c”

  • Hive MAP to HBase Column Family

When the Hive datatype is a Map, a column family with no qualifier might be used. This will use the keys of the Map as the column qualifier in HBase: “cf:”

  • Hive MAP to HBase Column Prefix

When the Hive datatype is a Map, a prefix for the column qualifier can be provided which will be prepended to the Map keys: “cf:prefix_.*”

Note: The prefix is removed from the column qualifier as compared to the key in the Hive Map. For example, for the above column mapping, a column of “cf:prefix_a” would result in a key in the Map of “a”.

  • You can also define composite row keys. Composite row keys use multiple Hive columns to generate the HBase row key.
    • Simple Composite Row Keys

A Hive column with a datatype of Struct will automatically concatenate all elements in the struct with the termination character specified in the DDL.

  • Complex Composite Row Keys and HBaseKeyFactory

Custom logic can be implemented by writing Java code to implement a KeyFactory and provide it to the DDL using the table property key “hbase.composite.key.factory”.

Understanding Bulk Loading

A common pattern in HBase to obtain high rates of data throughput on the write path is to use “bulk loading”. This generates HBase files (HFiles) that have a specific format instead of shipping edits to HBase RegionServers. The Hive integration has the ability to generate HFiles, which can be enabled by setting the property “hive.hbase.generatehfiles” to true, for example, `set hive.hbase.generatehfiles=true`. Additionally, the path to a directory which to write the HFiles must also be provided, for example,`set hfile.family.path=/tmp/hfiles”.

After the Hive query finishes, you must execute the “completebulkload” action in HBase to bring the files “online” in your HBase table. For example, to finish the bulk load for files in “/tmp/hfiles” for the table “hive_data”, you might run on the command-line:

$ hbase completebulkload /tmp/hfiles hive_data

Understanding HBase Snapshots

When an HBase snapshot exists for an HBase table which a Hive table references, you can choose to execute queries over the “offline” snapshot for that table instead of the table itself.

First, set the property to the name of the HBase snapshot in your Hive script: `set hive.hbase.snapshot.name=my_snapshot`. A temporary directory is required to run the query over the snapshot. By default, a directory is chosen inside of “/tmp” in HDFS, but this can be overridden by using the property “hive.hbase.snapshot.restoredir”.