Hadoop Ecosystem

Core Hadoop ecosystem is nothing but the different components that are built on the Hadoop platform directly. However, there are a lot of complex interdependencies between these systems.

HDFS

Starting from the base of the Hadoop ecosystem, there is HDFS or Hadoop Distributed File System. It is a system that allows you to distribute the storage of big data across a cluster of computers. That means, all of your hard drives look like a single giant cluster on your system. That’s not all; it also maintains the redundant copies of data. So, if one of your computers happen to randomly burst into flames or if some technical issues occur, HDFS can actually recover from that by creating a backup from a copy of the data that it had saved automatically, and you won’t even know if anything happened. So, that’s the power of HDFS, i.e., the data storage is in a distributed manner having redundant copies.

YARN

Next in the Hadoop ecosystem is YARN (Yet Another Resource Negotiator). It is the place where the data processing of Hadoop comes into play. YARN is a system that manages the resources on your computing cluster. It is the one that decides who gets to run the tasks, when and what nodes are available for extra work, and which nodes are not available to do so. So, it’s like the heartbeat of Hadoop that keeps your cluster going.

MapReduce

One interesting application that can be built on top of YARN is MapReduce. MapReduce, the next component of the Hadoop ecosystem, is just a programming model that allows you to process your data across an entire cluster. It basically consists of Mappers and Reducers that are different scripts, which you might write, or different functions you might use when writing a MapReduce program. Mappers have the ability to transform your data in parallel across your computing cluster in a very efficient manner; whereas, Reducers are responsible for aggregating your data together. This may sound like a simple model, but MapReduce is very versatile. Mappers and Reducers put together can be used to solve complex problems. We will talk about MapReduce in one of the upcoming sections of this Hadoop tutorial.

Apache Pig

Next up in the Hadoop ecosystem, we have a technology called Apache Pig. It is just a high-level scripting language that sits on top of MapReduce. If you don’t want to write Java or Python MapReduce codes and are more familiar with a scripting language that has somewhat SQL-style syntax, Pig is for you. It is a very high-level programming API that allows you to write simple scripts. You can get complex answers without actually writing Java code in the process. Pig Latin will transform that script into something that will run on MapReduce. So, in simpler terms, instead of writing your code in Java for MapReduce, you can go ahead and write your code in Pig Latin which is similar to SQL. By doing so, you won’t have to perform MapReduce jobs. Rather, just writing a Pig Latin code will perform MapReduce functions.

Hive

Now, in the Hadoop ecosystem, there comes Hive. It also sits on top of MapReduce and solves a similar type of problem like Pig, but it looks more like a SQL. So, Hive is a way of taking SQL queries and making the distributed data sitting on your file system somewhere look like a SQL database. It has a language known as Hive SQL. It is just a database in which you can connect to a shell client and ODBC (Open Database Connectivity) and execute SQL queries on the data that is stored on your Hadoop cluster even though it’s not really a relational database under the hood. If you’re familiar with SQL, Hive might be a very useful API or interface for you to use.

Apache Ambari

Apache Ambari is the next in the Hadoop ecosystem which sits on top of everything and gives you a view of your cluster. It is basically an open-source administration tool responsible for tracking applications and keeping their status.  It lets you visualize what runs on your cluster, what systems you’re using, and how much resources are being used. So, Ambari lets you have a view into the actual state of your cluster in terms of the applications that are running on it. It can be considered as a management tool that will manage the monitors along with the health of several Hadoop clusters.

Mesos

Mesos isn’t really a part of Hadoop, but it’s included in the Hadoop ecosystem as it is an alternative to YARN. It is also a resource negotiator just like YARN. Mesos and YARN solve the same problem in different ways. The main difference between Mesos and YARN is in their scheduler. In Mesos, when a job comes in, a job request is sent to the Mesos master, and what Mesos does is it determines the resources that are available and it makes offers back. These offers can be accepted or rejected. So, Mesos is another way of managing your resources in the cluster.

Apache Spark

Spark is the most interesting technology of this Hadoop ecosystem. It sits on the same level as MapReduce and right above Mesos to run queries on your data. It is mainly a real-time data processing engine developed in order to provide faster and easy-to-use analytics than MapReduce. Spark is extremely fast and is under a lot of active development. It is a very powerful technology as it uses the in-memory processing of data. If you want to efficiently and reliably process your data on the Hadoop cluster, you can use Spark for that. It can handle SQL queries, do Machine Learning across an entire cluster of information, handle streaming data, etc.

Tez

Tez is similar to Spark and is next in the Hadoop ecosystem it uses some of the same techniques as Spark. It tells you what MapReduce does as it produces a more optimal plan for executing your queries. Tez, when used in conjunction with Hive, tends to accelerate Hive’s performance. Hive is placed on top of MapReduce, but you can place it on top of Tez, as Hive through Tez can be a lot faster than Hive through MapReduce. They are both different means of optimizing queries together.

Apache HBase

Next up in the Hadoop ecosystem is HBase. It is set on the side, and it is a way of exposing data on your cluster to the transactional platform.  So, it is called the NoSQL database, i.e., it is a columnar data store that is a very fast database and is meant for large transaction rates. It can expose data stored in your cluster which might be transformed in some way by Spark or MapReduce. It provides a very fast way of exposing those results to other systems.

Apache Storm

Apache Storm is basically a way of processing streaming data. So, if you have streaming data from sensors or weblogs, you can actually process it in real time using Storm. Processing data doesn’t have to be a batch thing anymore; you can update your Machine Learning models or transform data into the database, all in real time, as the data comes in.

Oozie

Next up in the Hadoop ecosystem, there Oozie. Oozie is just a way of scheduling jobs on your cluster. So, if you have a task that needs to be performed on your Hadoop cluster involving different steps and maybe different systems, Oozie is the way for scheduling all these things together into jobs that can be run on some order. So, when you have more complicated operations that require loading data into Hive, integrating that with Pig, and maybe querying it with Spark, and then transforming the results into HBase, Oozie can manage all that for you and make sure that it runs reliably on a consistent basis.

ZooKeeper

ZooKeeper is basically a technology for coordinating everything on your cluster. So, it is a technology that can be used for keeping track of the nodes that are up and the ones that are down.  It is a very reliable way of keeping track of shared states across your cluster that different applications can use. Many of these applications rely on ZooKeeper to maintain reliable and consistent performance across a cluster even when a node randomly goes down. Therefore, ZooKeeper can be used for keeping track of which the master node is, which node is up, or which node is down. Actually, it’s even more extensible than that.

Data Ingestion

The below-listed systems in the Hadoop ecosystem are focused mainly on the problem of data ingestion, i.e., how to get data into your cluster and into HDFS from external sources. Let’s have a look at them.

  • Sqoop: Sqoop is a tool used for transferring data between relational database servers and Hadoop. Sqoop is used to import data from various relational databases like Oracle to Hadoop HDFS, MySQL, etc. and to export from HDFS to relational databases.
  • Flume: Flume is a service for aggregating, collecting, and moving large amounts of log data. Flume has a flexible and simple architecture that is based on streaming data flows. Its architecture is robust and fault-tolerant with reliable and recovery mechanisms. It uses the extensible data model that allows for online analytic applications. Flume is used to move the log data generated by application servers into HDFS at a higher speed.
  • Kafka: Kafka is also an open-source streaming data processing software that solves a similar problem as Flume. It is used for building real-time data pipelines and streaming apps reducing complexity. It is horizontally scalable and fault-tolerant. Kafka aims to provide a unified, low-latency platform to handle real-time data feeds. Asynchronous communication and messages can be established with the help of Kafka. This ensures reliable communication.

In the next section, we will be learning about HDFS in detail.

HDFS in Hadoop

So, what is HDFS? HDFS or Hadoop Distributed File System, which is completely written in Java programming language, is based on the Google File System (GFS). Google had only presented a white paper on this, without providing any particular implementation. It is interesting that around 90 percent of the GFS architecture has been implemented in HDFS.

HDFS was formerly developed as a storage infrastructure for the Apache Nutch web search engine project, and hence it was initially known as the Nutch Distributed File System (NDFS). Later on, the HDFS design was developed essentially for using it as a distributed file system.

HDFS is extremely fault-tolerant and can hold a large number of datasets, along with providing ease of access. The files in HDFS are stored across multiple machines in a systematic order. This is to eliminate all feasible data losses in the case of any crash, and it helps in making applications accessible for parallel processing. This file system is designed for storing a very large amount of files with streaming data access.

To know ‘What is HDFS in Hadoop?’ in detail, let’s first see what a file system is? Well, a file system is one of the fundamental parts of all operating systems. It basically administers the storage in the hard disk.

Why do you need another file system?

Now, you know ‘What is HDFS in Hadoop?’ It is basically a file system. But, the question here is, why do you need another file system?

Have you ever used a file system before?

The answer would be yes!

Let’s say, a person has a book and another has a pile of unordered papers from the same book and both of them need to open Chapter 3 of the book. Who do you think would get to Chapter 3 faster? The one with the book, right? Because, that person can simply go to the index, look for Chapter 3, check out the page number, and go to the page. Meanwhile, the one with the pile of papers has to go through the entire pile and if he is lucky enough, he might find Chapter 3. Just like a well-organized book, a file system helps navigate data that is stored in your storage.

Without a file system, the information stored in your hard disk will be a large body of data in which there would be no way to tell where one piece of information stops and the next begins.

The file system manages how a dataset is saved and retrieved. So, when reading and writing of files is done on your hard disk, the request goes through a distinct file system. The file system has some amount of metadata of your files such as size, filename, created time, owner, modified time, etc.

When you want to write a file to a hard disk, the file system helps in figuring out where in the hard disk the file should be written and how efficiently it can do so. How do you think the file system manages to do that? Since it has all the details about the hard disk, including the empty spaces available in it, it can directly write that particular file there.

Now, we will talk about HDFS, by working with Example.txt which is a 514 MB file.

When you upload a file into HDFS, it will automatically be split into 128 MB fixed-size blocks (In the older versions of Hadoop, the file used to be divided into 64 MB fixed-size blocks). So basically, it takes care of placing the blocks in three different DataNodes by replicating each block three times.

Now that you have understood why you need HDFS, next in this section on ‘What is HDFS?’ let’s see why it is a perfect match for big data.

HDFS is a perfect tool for working with big data. The following list of facts proves it.

  • HDFS uses the MapReduce method for accessing data, which is very fast.
  • HDFS follows the data coherency model, in which the data is synchronized across the server. It is very simple to implement and is highly robust and scalable.
  • HDFS is compatible with any kind of commodity hardware and operating system processors
  • As data is saved in multiple locations, it is safe enough.
  • It is conveniently accessible to use a web browser which makes it highly utilitarian.

Let’s see the architecture of HDFS.

HDFS Architecture

The following image gives the most important components present in the HDFS architecture. It has a Master-Slave architecture and has several components in it.

Let’s start with the basic two nodes in the HDFS architecture, i.e., the DataNode and the NameNode.

DataNode

Nodes wherein the blocks are physically stored are known as DataNodes. Since these nodes hold the actual data of the cluster, they are termed as DataNodes. Every DataNode knows the blocks it is responsible for, but it might sometimes miss some major information.

Although the DataNode knows about the block it is responsible for, it doesn’t care to know about the other blocks and the other DataNodes. This is a problem for you as a user because you don’t know anything about the blocks other than the file name, and you should be able to work only with the file name in the Hadoop cluster.

So the question here is: if the DataNodes do not know which block belongs to which file, then who has the key information? The key information is maintained by a node called the NameNode.

NameNode

A NameNode keeps track of all the files or datasets in HDFS. It knows the list of blocks that are made up of files in HDFS, not only the list of blocks but also the location of them.

Why is a NameNode so important? Imagine that a NameNode is down in your Hadoop cluster. In this scenario, there would be no way you could look up for the files in the cluster because you won’t be able to figure out the list of blocks made up of the files. Also, you won’t be able to figure out the location of the blocks. Apart from the block locations, a NameNode also has the metadata of the files and folders in HDFS, which includes information like, the size, replication factor, created by, created on, last modified by, last modified on, etc.

Due to the significance of the NameNode, it is also called the master node, and the DataNodes are called slave nodes, and hence the master–slave architecture.

NameNode persists all the metadata information about the files and folder and hard disk, except for the block location.

Since NameNodes are in constant communication with each other, when a NameNode starts up, the DataNodes will try to connect with the NameNode and broadcast the list of blocks that each of them is responsible for. The NameNode will hold the block locations in memory and never persist the information in the hard disk. Because, in a busy cluster HDFS is constantly changing with the new data files coming into the cluster, and if the NameNode has to persist every change to the block by writing the information to a hard disk, it would be a bottleneck. Hence, with performance reasons in mind, the NameNode will hold the block locations in memory so that it can give a faster response to the clients. Therefore, it is clear that the NameNode is the most powerful node in the cluster in terms of capacity. A NameNode failure is clearly not an option.

Other than the NameNode and the DataNodes, there is another component called the secondary NameNode. It works simultaneously with a primary NameNode as a helper. Although, the secondary NameNode is not a backup NameNode.


The functions of a secondary NameNode are listed below:

  • The secondary NameNode reads all files, along with the metadata, from the RAM of the NameNode. It also writes the metadata into the file system or to the hard disk.
  • The secondary NameNode is also responsible for combining EditLogs with fsImage present in the NameNode.
  • At regular intervals, the EditLogs are downloaded from the NameNode and are applied to fsImage by the secondary NameNode.
  • The secondary NameNode has periodic checkpoints in HDFS, and hence it is also called the checkpoint node.

Blocks

The data in HDFS is stored in the form of multiple files. These files are divided into one or more segments and are further stored in individual DataNodes. These file segments are known as blocks. The default block size is 128 MB in Apache Hadoop 2.x and 64 MB in Apache Hadoop 1.x, which can be modified as per the requirements from the HDFS configuration.

HDFS blocks are huge compared to disk blocks and they are designed this way for cost reduction.

By making a particular set of blocks large enough the time consumed for transferring data from the disk can be reduced. Therefore, with HDFS, the time consumed to transfer a huge file made up of multiple blocks works at a faster disk transfer rate.

In the next part of this ‘What is HDFS?’ tutorial, let’s look at the benefits of HDFS.

Benefits of HDFS

  • HDFS supports the concept of blocks: When uploading a file into HDFS, the file is divided into fixed-size blocks to support distributed computation. HDFS keeps track of all the blocks in the cluster.
  • HDFS maintains data integrity: Data failures or data corruption are inevitable in any big data environment. So, it maintains data integrity and helps recover from data loss by replicating the blocks and more than the node.
  • HDFS supports scaling: If you like to expand your cluster by adding more nodes, it’s very easy to do with HDFS.
  • No particular hardware required: There is no need for any specialized hardware to run or operate HDFS. It is basically built up to work with commodity computers.

Now, we come to the end of this section on ‘What is HDFS?’ of the Hadoop tutorial. We learned ‘What is HDFS?’, the need for HDFS, and its architecture. In the next section of this tutorial, we shall be learning about HDFS Commands.