Hadoop Introduction

Hadoop is the most important framework for working with Big Data. Hadoop biggest strength is scalability. It upgrades from working on a single node to thousands of nodes without any issue in a seamless manner.

The different domains of Big Data means we are able to manage the data’s are  from videos, text medium, transactional data, sensor information, statistical data, social media conversations, search engine queries, ecommerce data, financial information, weather data, news updates, forum discussions, executive reports, and so on

Google’s Doug Cutting and his team members developed an Open Source Project namely known as HADOOP which allows you to handle the very large amount of data. Hadoop runs the applications on the basis of MapReduce where the data is processed in parallel and accomplish the entire statistical analysis on large amount of data.

It is a framework which is based on java programming. It is intended to work upon from a single server to thousands of machines each offering local computation and storage. It supports the large collection of data set in a distributed computing environment.

The Apache Hadoop software library based framework that gives permissions to distribute huge amount of data sets processing across clusters of computers using easy programming models.

History of Hadoop

Apache Hadoop was born to enhance the usage and solve major issues of big data. The web media was generating loads of information on a daily basis, and it was becoming very difficult to manage the data of around one billion pages of content. In order of revolutionary, Google invented a new methodology of processing data popularly known as MapReduce. Later after a year Google published a white paper of Map Reducing framework where Doug Cutting and Mike Cafarella, inspired by the white paper and thus created Hadoop to apply these concepts to an open-source software framework which supported the Nutch search engine project. Considering the original case study, Hadoop was designed with a much simpler storage infrastructure facilities.

Hadoop was created by Doug Cutting and hence was the creator of Apache Lucene. It is the widely used text to search library. Hadoop has its origins in Apache Nutch which is an open source web search engine itself a part of the Lucene project.

Why Apache Hadoop?

Most database management systems are not up to the mark for operating at such lofty levels of Big Data requirements either due to the sheer technical inefficiency or the insurmountable financial challenges posed. When the type of data is unstructured, the volume of data is huge, and the results needed are at uncompromisable speeds, then the only platform that can effectively stand up to the challenge is Apache Hadoop.


Hadoop owes its runaway success to a processing framework, MapReduce, that is central to its existence. MapReduce technology lets ordinary programmers contribute their part where large datasets are divided and are independently processed in parallel. These coders need not know the nuances of high-performance computing. With MapReduce, they can work efficiently without having to worry about intra-cluster complexities, monitoring of tasks, node failure management, and so on.

How did Big Data help in driving Walmart’s performance?

Walmart, one of the Big Data companies, is currently the biggest retailer in the world with maximum revenue. Consisting of 2 million employees and 20,000 stores, Walmart is building its own private cloud in order to incorporate 2.5 petabytes of data every hour.  

Walmart has been collecting data of products that have the maximum sales in a particular season or because of some specific reason. For example, if people are buying candies during the Halloween season, along with costumes, you’d see a lot of candies and costumes all around Walmart only during the Halloween season. This it does based on the Big Data Analytics it had made for the previous years’ Halloween seasons.

Again, when in 2012, Hurricane Sandy hit the US it was analyzed by Walmart, from the data it had collected and analyzed from such previous instances, that people generally buy emergency equipment and strawberry pop-tarts when a warning for an approaching hurricane is declared. So, this time too, Walmart quickly filled its racks with the emergency equipment people would require during the hurricane in the red alert areas. This made the selling of these products very quick and Walmart gaining a lot of profit.

Hadoop Architecture

Apache Hadoop was developed with the goal of having an inexpensive, redundant data store that would enable organizations to leverage Big Data Analytics economically and increase the profitability of the business.
A Hadoop architectural design needs to have several design factors in terms of networking, computing power, and storage. Hadoop provides a reliable, scalable, flexible, and distributed computing Big Data framework.

Hadoop follows a master–slave architecture for storing data and data processing. This master–slave architecture has master nodes and slave nodes. Let’s first look at each terminology before we start with understanding the architecture:

  1. NameNode: NameNode is basically a master node that acts like a monitor and supervises operations performed by DataNodes.
    1. Secondary NameNode: A Secondary NameNode plays a vital role in case if there is some technical issue in the NameNode.
    1. DataNode: DataNode is the slave node that stores all files and processes.
    1. Mapper: Mapper maps data or files in the DataNodes. It will go to every DataNode and run a particular set of codes or operations in order to get the work done.
    1. Reducer: While a Mapper runs a code, Reducer is required for getting the result from each Mapper.
    1. JobTracker: JobTracker is a master node used for getting the location of a file in different DataNodes. It is a very important service in Hadoop as if it goes down, all the running jobs will get halted.
    1. TaskTracker: TaskTracker is a reference for the JobTracker present in the DataNodes. It accepts different tasks, such as map, reduce, and shuffle operations, from the JobTracker. It is a key player performing the main MapReduce functions.
    1. Block: Block is a small unit wherein the files are split. It has a default size of 64 MB and can be increased as needed.
    1. Cluster: Cluster is a set of machines such as DataNodes, NameNodes, Secondary NameNodes, etc.

There are two layers in the Hadoop architecture. First, we will see how data is stored in Hadoop and then we will move on to how it is processed. While talking about the storage of files in Hadoop, HDFS comes to place.

Hadoop Distributed File System (HDFS)

HDFS is based on Google File System (GFS) that provides a distributed system particularly designed to run on commodity hardware. The file system has several similarities with the existing distributed file systems. However, HDFS does stand out among all of them. This is because it is fault-tolerant and is specifically designed for deploying on low-cost hardware.


HDFS is mainly responsible for taking care of the storage parts of Hadoop applications. So, if you have a 100 MB file that needs to be stored in the file system, then in HDFS, this file will be split into chunks, called blocks. The default size of each block in Hadoop 1 is 64 MB, on the other hand in Hadoop 2 it is 128 MB. For example, in Hadoop version 1, if we have a 100 MB file, it will be divided into 64 MB stored in one block and 36 MB in another block. Also, each block is given a unique name, i.e., blk_n (n = any number). Each block is uploaded to one DataNode in the cluster. On each of the machines or clusters, there is something called as a daemon or a piece of software that runs in the background.

The daemons of HDFS are as follows:


NameNode: It is the master node that maintains or manages all data. It points to DataNodes and retrieves data from them. The file system data is stored on a NameNode.
Secondary NameNode: It is the master node and is responsible for keeping the checkpoints of the file system metadata that is present on the NameNode.
DataNode: DataNodes have the application data that is stored on the servers. It is the slave node that basically has all the data of the files in the form of blocks.
As we know, HDFS stores the application data and the files individually on dedicated servers. The file content is replicated by HDFS on various DataNodes based on the replication factor to assure the authenticity of the data. The DataNode and the NameNode communicate with each other using TCP protocols.
The following prerequisites are required to be satisfied by HDFS for the Hadoop architecture to perform efficiently:

  • There must be good network speed in order to manage data transfer.
    • Hard drives should have a high throughput.

MapReduce Layer

MapReduce is a patented software framework introduced by Google to support distributed computing on large datasets on clusters of computers.


It is basically an operative programming model that runs in the Hadoop background providing simplicity, scalability, recovery, and speed, including easy solutions for data processing. This MapReduce framework is proficient in processing a tremendous amount of data parallelly on large clusters of computational nodes.


MapReduce is a programming model that allows you to process your data across an entire cluster. It basically consists of Mappers and Reducers that are different scripts you write or different functions you might use when writing a MapReduce program. Mappers have the ability to transform your data in parallel across your computing cluster in a very efficient manner; whereas, Reducers are responsible for aggregating your data together.


Mappers and Reducers put together can be used to solve complex problems.

Working of the MapReduce Architecture

The job of MapReduce starts when a client submits a file. The file first goes to the JobTracker. It combines Reduce functions, with the location, for input and output data. When a file is received, the JobTracker sends a request to the NameNode that has the location of the DataNode. The NameNode will send that location to the JobTracker. Next, the JobTracker will go to that location in the DataNode. Then, the JobTracker present in the DataNode sends a request to the select TaskTrackers.
Next, the processing of the map phase begins. In this phase, the TaskTracker retrieves all the input data. For each record, a map function is invoked, which has been parsed by the ‘InputFormat’ producing key–value pairs in the memory buffer.

Sorting the memory buffer is done next wherein different reducer nodes are sorted by invoking a function called combine. When the map task is completed, the JobTracker gets a notification from the TaskTracker for the same. Once all the TaskTrackers notify the JobTracker, the JobTracker notifies the select TaskTrackers, to begin with the reduce phase. The TaskTracker’s work now is to read the region files and sort the key–value pairs for each and every key. Lastly, the reduce function is invoked, which collects the combined values into an output file.

How does Hadoop work?

Hadoop runs code across a cluster of computers and performs the following tasks:

  • Data is initially divided into files and directories. Files are then divided into consistently sized blocks ranging from 128 MB in Hadoop 2 to 64 MB in Hadoop 1.
  • Then, the files are distributed across various cluster nodes for further processing of data.
  • The JobTracker starts its scheduling programs on individual nodes.
  • Once all the nodes are done with scheduling, the output is returned.

Data from HDFS is consumed through MapReduce applications. HDFS is also responsible for multiple replicas of data blocks that are created along with the distribution of nodes in a cluster, which enables reliable and extremely quick computations.


So, in the first step, the file is divided into blocks and is stored in different DataNodes. If a job request is generated, it is directed to the JobTracker.

The JobTracker doesn’t really know the location of the file. So, it contacts with the NameNode for this.

The NameNode will now find the location and give it to the JobTracker for further processing. Now, since the JobTracker knows the location of the blocks of the requested file, it will contact the TaskTracker present on a particular DataNode for the data file. The TaskTracker will now send the data it has to the JobTracker.

Finally, the JobTracker will collect the data and send it back to the requested source.

How does Yahoo! use Hadoop Architecture?

In Yahoo!, there are 36 different Hadoop clusters that are spread across Apache HBase, Storm, and YARN, i.e., there are 60,000 servers in total made from 100s of distinct hardware configurations. Yahoo! runs the largest multi-tenant Hadoop installation in the world. There are approximately 850,000 Hadoop jobs daily, which are run by Yahoo!.


The cost of storing and processing data using Hadoop is the best way to determine whether Hadoop is the right choice for your company. When comparing on the basis of the expense for managing data, Hadoop is much cheaper than any legacy systems.

Hadoop Installation

Hadoop is basically supported by the Linux platform and its facilities. If you are working on Windows, you can use Cloudera VMware that has preinstalled Hadoop, or you can use Oracle VirtualBox or the VMware Workstation. In this tutorial, I will be demonstrating the installation process for Hadoop using the VMware Workstation 12. You can use any of the above to perform the installation. I will do this by installing CentOS on my VMware.

Prerequisites

  • VirtualBox/VMWare/Cloudera: Any of these can be used for installing the operating system on.
  • Operating System: You can install Hadoop on Linux-based operating systems. Ubuntu and CentOS are very commonly used among them. In this tutorial, we are using CentOS.
  • Java: You need to install the Java 8 package on your system.
  • Hadoop: You require the Hadoop 2.7.3 package.

Hadoop Installation on Windows

Note: If you are working on Linux, then skip to Step 9.

Step 1: Installing VMware Workstation

  • Download VMware Workstation
  • Once downloaded, open the .exe file and set the location as required
  • Follow the required steps of installation

Step 2: Installing CentOS

  • Install CentOS
  • Save the file in any desired location

Step 3: Setting up CentOS in VMware 12

When you open VMware, the following window pops up:

Click on Create a New Virtual Machine

1. As seen in the screenshot above, browse the location of your CentOS file you downloaded. Note that it should be a disc image file

2. Click on Next

1. Choose the name of your machine. Here, I have given the name as CentOS 64-bit

2. Then, click Next

1. Specify the disk capacity. Here, I have specified it to be 20 GB

2. Click Next

  • Click on Finish
  • After this, you should be able to see a window as shown below. This screen indicates that you are booting the system and getting it ready for installation. You will be given a time of 60 seconds to change the option from Install CentOS to others. You will need to wait for 60 seconds if you need the option selected to be Install CentOS

Note: In the image above, you can see three options, such as, I Finished InstallingChange Disc, and Help. You don’t need to touch any of these until your CentOS is successfully installed.

  • At the moment, your system is being checked and is getting ready for installation
  • Once the checking percentage reaches 100%, you will be taken to a screen as shown below:

Step 4: Here, you can choose your language. The default language is English, and that is what I have selected

1. If you want any other language to be selected, specify it
2. Click on Continue

Step 5: Setting up the Installation Processes

  • From Step 4, you will be directed to a window with various options as shown below:
  • First, to select the software type, click on the SOFTWARE SELECTION option
  • Now, you will see the following window:1. Select the Server with GUI option to give your server a graphical appeal
    2. Click on Done
  • After clicking on Done, you will be taken to the main menu where you had previously selected SOFTWARE SELECTION
  • Next, you need to click on INSTALLATION DESTINATION
  • On clicking this, you will see the following window:

1. Under Other Storage Options, select I would like to make additional space available
2. Then, select the radio button that says I will configure partitioning
3. Then, click on Done

  • Next, you’ll be taken to another window as shown below:

1. Select the partition scheme here as Standard Partition
2. Now, you need to add three mount points here. For doing that, click on ‘+’

a) Select the Mount Point /boot as shown above
b) Next, select the Desired Capacity as 500 MiB as shown below:

c) Click on Add mount point
d) Again, click on ‘+’ to add another Mount Point

e) This time, select the Mount Point as swap and Desired Capacity as 2 GiB


f) Click on Add Mount Point
g) Now, to add the last Mount Point, click on + again

 h) Add another Mount Point ‘/’ and click on Add Mount Point

  1. Click on Done, and you will see the following window:

This is just to make you aware of all the changes you had made in the partition of your drive

  • Now, click on Accept Changes if you’re sure about the partitions you have made
    • Next, select NETWORK & HOST NAME
  • You’ll be taken to a window as shown below:

1. Set the Ethernet settings as ON
2. Change the HOST name if required
3. Apply the settings
4. Finally, click on Done

  • Next, click on Begin Installation


Step 6: Configuration

  • Once you complete Step 5, you will see the following window where the final installation process will be completed.
  • But before that, you need to set the ROOT PASSWORD and create a user
  • Click on ROOT PASSWORD, which will direct you to the following window:

1. Enter your root password here
2. Confirm the password
3. Click on Done

  • Now, click on USER CREATION, and you will be directed to the following window:


1. Enter your Full name. Here, I have entered Intell
2. Next, enter your User name; here, intell (This generally comes up automatically)
3. You can either make this password-based or make this a user administrator
4. Enter the password
5. Confirm your password
6. Finally, click on Done

  • You’ll see the Reboot button, as seen below, when your installation is done, which takes up to 20–30 minutes
  • In the next screen, you will see the installation process in progress
  • Wait until a window pops up to accept your license infoStep 7: Setting up the License Information
  • Accept the License Information

Step 8: Logging into CentOS

  • You will see the login screen as below:

Enter the user ID and password you had set up in Step 6

Your CentOS installation is now complete!


Now, you need to start working on CentOS, and not on your local operating system. If you have jumped to this step because you are already working on Linux/Ubuntu, then continue with the following steps.

Step 9: Downloading and Installing Java 8

  • Save this file in your home directory
  •  Extract the Java tar file using the following command:

tar -xvf jdk-8u101-linux-i586.tar.gz

Step 10: Downloading and Installing Hadoop

  • Download a stable release packed as a zipped file and unpack it somewhere on your file system
  • Extract the Hadoop file using the following command on the terminal:

tar -xvf hadoop-2.7.3.tar.gz

  • You will be directed to the following window:

Step 11: Moving Hadoop to a Location

  • Use the following code to move your file to a particular location, here Hadoop:

mv hadoop-2.7.3/home/intell/hadoop

Note: The location of the file you want to change may differ. For demonstration purposes, I have used this location, and this will be the same throughout this tutorial. You can change it according to your choice.

  • Here, Home will remain the same.
  • Intellipaat is the user name I have used. You can change it according to your user name.
  • Hadoop is the location where I want to save this file. You can change it as well if you want.

mv Hadoop-2.7.3 /home/intel/hadoop

Step 12: Editing and Setting up HadoopFirst, you need to set the path in the ~/.bashrc file. You can set the path from the root user by using the command ~/.bashrc. Before you edit ~/.bashrc, you need to check your Java configurations.

Enter the command:

update-alternatives-config java

You will now see all the Java versions available in the machine. Here, since I have only one version of Java which is the latest one, it is shown below:

You can have multiple versions as well.

  • Next, you need to select the version you want to work on. As you can see, there is a highlighted path in the above screenshot. Copy this path and place it in a gedit file. This path is just for being used in the upcoming steps
  • Enter the selection number you have chosen. Here, I have chosen the number 1
  • Now, open ~/.bashrc with the vi editor (the screen-oriented text editor in Linux)

Note: You have to become a root user first to be able to edit ~/.bashrc.

  • Enter the command: su
  • You will be prompted for the password. Enter your root password
  • When you get logged into your root user, enter the command: vi ~/.bashrc
  • The above command takes you to the vi editor, and you should be able to see the following screen:
  • To access this, press Insert on your keyboard, and then, start writing the following set of codes for setting a path for Java:
  • fi
  • #HADOOP VARIABLES START
  • export JAVA_HOME= (path you copied in the previous step)
  • export HADOOP_HOME=/home/(your username)/hadoop
  • export PATH=$PATH:$HADOOP_INSTALL/bin
  • export PATH=$PATH:$HADOOP_INSTALL/sbin
  • export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
  • export HADOOP_COMMON_HOME=$HADOOP_INSTALL
  • export HADOOP_HDFS_HOME=$HADOOP_INSTALL
  • export YARN_HOME=$HADOOP_INSTALL
  • export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_/INSTALL/lib/native
  • export HADOOP_OPTS=”Djava.library.path”=$HADOOP_INSTALL/lib”

#HADOOP VARIABLES END


After writing the code, click on Esc on your keyboard and write the command :wq!
This will save and exit you from the vi editor. The path has been set now as it can be seen in the image below:

Step 13: Adding Configuration Files

  • Open hadoop-env.sh with the vi editor using the following command:

vi /home/intell/hadoop/etc/hadoop/hadoop-env.sh

  • Replace this path with the Java path to tell Hadoop which path to use. You will see the following window coming up:
  • Change the JAVA_HOME variable to the path you had copied in the previous step

Step 14:
Now, there are several XML files that need to be edited, and you need to set the property and the path for them.

  • Editing core-site.xml
    • Use the same command as in the previous step and just change the last part to core-site.xml as given below:

vi /home/intell/hadoop/etc/hadoop/core-site.xml

Next, you will see the following window:

  • Enter the following code in between the configuration tags as below:
  • <configuration>
    •     <property>
    •         <name>fs.defaultFS</name>
    •         <value>hdfs://(your localhost):9000</value>
    •     </property>
    • </configuration>
  • Now, exit from this window by entering the command :wq!
  • Editing yarn-site.xml
    • Enter the command:

vi /home/intell/hadoop/etc/hadoop/yarn-site.xml

You will see the following window:

  • Enter the code in between the configuration tags as shown below:
  • <configuration>
    • <property>
    • <name>yarn.nodemanager.aux-services</name>
    • <value>mapreduce_shuffle</value>
    • </property>
    • <property>
    • <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
    • <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    • </property>

</configuration>

  • Exit from this window by pressing Esc and then writing the command :wq!
  • Editing mapred-site.xml
    • Copy or rename a file mapred-site.xml.template with the name mapred-site.xml.Note: If you go to the following path, you will see that there is no file named mapred-site.xml:
      Home > intell > hadoop > hadoop-2.7.3 > etc > hadoop
      So, we will copy the contents of mapred-site .xml.template to mapred-site.xml.
    • Use the following command to copy the contents:

cp /home/intell/hadoop/hadoop-2.7.3/etc/hadoop/ mapred-site.xml.template /home/intell/hadoop/hadoop-2.7.3/etc/hadoop/ mapred-site.xml

Once the contents have been copied to a new file named mapred-site.xml, you can verify it by going to the following path:
Home > intell > hadoop > hadoop-2.7.3 > etc > hadoop

  • Now, use the following command to add configurations:

vi/home/intell/hadoop/etc/hadoop/mapred-site.xml

  • In the new window, enter the following code in between the configuration tags as below:
  • <configuration>
    • <property>
    • <name>mapreduce.framework.name</name>
    • <value>yarn</value>
    • </property>

</configuration>

  • Exit using Esc and the command :wq!
  • Editing hdfs-site.xml

    Before editing hdfs-site.xml, two directories have to be created, which will contain the namenode and the datanode.
    • Enter the following code for creating a directory, namenode:

mkdir -p /home/intell/hadoop_store/hdfs/namenode

Note: Here, mkdir means creating a new file.

  • Similarly, to create the datanode directory, enter the following command:

mkdir -p /home/intell/hadoop_store/hdfs/datanode

  • Now, go to the following path to check both the files:
    Home > intell > hadoop_store > hdfsYou can find both directories in the specified path as in the images below:https://intellipaat.com/blog/wp-content/uploads/2015/07/51.png
    https://intellipaat.com/blog/wp-content/uploads/2015/07/52.png
    • Now, to configure hdfs-site.xml, use the following command:

vi /home/intell/hadoop/etc/hadoop/hdfs-site.xml 

  • Enter the following code in between the configuration tags:
  • <configuration>
    • <property>
    • <name>dfs.replication</name>
    • <value>1</value>
    • </property>
    • <property>
    • <name>dfs.namenode.name.dir</name>
    • <value>file:/home/intell/hadoop_store/hdfs/namenode</value>
    • </property>
    • <property>
    • <name>dfs.datanode.data.dir</name>
    • <value> file:/home/intell/hadoop_store/hdfs/namenode</value>
    • </property>

</configuration>

  • Exit using Esc and the command :wq!

All your configurations are done. And Hadoop Installation is done now!


Step 15: Checking Hadoop

You will now need to check whether the Hadoop installation is successfully done on your system or not.

  • Go to the location where you had extracted the Hadoop tar file, right-click on the bin, and open it in the terminal
  • Now, write the command, ls
    Next, if you see a window as below, then it means that Hadoop is successfully installed!

The Hadoop High-level Architecture

Hadoop Architecture based on the two main components namely MapReduce and HDFS

Different Hadoop Architectures based on the Parameters chosen

The Apache Hadoop Module

Hadoop Common: Includes the common utilities which supports the other Hadoop modules

HDFS: Hadoop Distributed File System provides unrestricted, high-speed access to the data application.

Hadoop YARN: This technology is basically used for scheduling of job and efficient management of the cluster resource.

MapReduce: This is a highly efficient methodology for parallel processing of huge volumes of data.

Then there are other projects included in the Hadoop module like:

Apache Ambari

Cassandra

HBase

Apache Spark

Hive

Pig

Sqoop

Oozie

Zookeepe

How does Hadoop Work?

Hadoop helps to execute large amount of processing where the user can connect together multiple commodity computers to a single-CPU, as a single functional distributed system and have the particular set of clustered machines that reads the dataset in parallel and provide intermediate, and after integration gets the desired output.

Hadoop runs code across a cluster of computers and performs the following tasks:

  • Data are initially divided into files and directories. Files are divided into consistent sized blocks ranging from 128M and 64M.
  • Then the files are distributed across various cluster nodes for further processing of data.
  • Job tracker starts its scheduling programs on individual nodes.
  • Once all the nodes are done with scheduling then the output is return back.

The Challenges facing Data at Scale and the Scope of Hadoop

Big Data are categorized into:

  • Structured –which stores the data in rows and columns like relational data sets
  • Unstructured – here data cannot be stored in rows and columns like video, images, etc.
  • Semi-structured – data in format XML are readable by machines and human

There is a standardized methodology that Big Data follows highlighting usage methodology of ETL.

ETL – stands for Extract, Transform, and Load.

Extract –fetching the data from multiple sources

Transform – convert the existing data to fit into the analytical needs

Load –right systems to derive value in it.

Comparison to Existing Database Technologies

Most database management systems are not up to scratch for operating at such lofty levels of Big data exigencies either due to the sheer technical inefficient. When the data is totally unstructured, the volume of data is humongous, where the results are at high speeds, then finally only platform that can effectively stand up to the challenge is Apache Hadoop.

Hadoop majorly owes its success to a processing framework called as MapReduce that is central to its existence. The MapReduce technology gives opportunity to all programmers contributes their part where large data sets are divided and are independently processed in parallel. These coders doesn’t need to knew the high-performance computing and can work efficiently without worrying about intra-cluster complexities, monitoring of tasks, node failure management, and so on.

Hadoop also contributes it’s another platform namely known as Hadoop Distributed File System (HDFS). The main strength of HDFS is its ability to rapidly scale and work without a hitch irrespective of any fault with the nodes. HDFS in essence divides large file into smaller blocks or units ranging from 64 to 128MB later are copied onto a couple of nodes of the cluster. From this HDFS ensures no work would stop even when some nodes going out of service. HDFS owns APIs to ensure The MapReduce program is used for reading and writing data (contents) simultaneously at high speeds. When there is a need to speed up performance, and then add extra nodes in parallel to the cluster and the increased demand can be immediately met.

Advantages of Hadoop

  • It give access to the user to rapidly write and test the distributed systems and then automatically distributes the data and works across the machines and in turn utilizes the primary parallelism of the CPU cores.
  • Hadoop library are developed to find/search and handle the failures at the application layer.
  • Servers can be added or removed from the cluster dynamically at any point of time.
  • It is open source based on Java applications and hence compatible on all the platforms.

Hadoop Features and Characteristics

Apache Hadoop is the most popular and powerful big data tool, which provides world’s best reliable storage layer HDFS (Hadoop Distributed File System), a batch Processing engine namely MapReduce and a Resource Management Layer like YARN. Open-source Apache Hadoop is an open source project. It means its code can be modified according to business requirements.

  • Distributed Processing– The data storage is maintained in a distributed manner in HDFS across the cluster, data is processed in parallel on cluster of nodes.
  • Fault Tolerance– By default the three replicas of each block is stored across the cluster in Hadoop and it’s changed only when required. Hadoop’s fault tolerant can be examined in such cases when any node goes down, the data on that node can be recovered easily from other nodes. Failures of a particular node or task are recovered automatically by the framework.
  • Reliability– Due to replication of data in the cluster, data can be reliable which is stored on the cluster of machine despite machine failures. Even if your machine goes down, and then also your data will be stored reliably.
  • High Availability– Data is available and accessible even there occurs a hardware failure due to multiple copies of data. If any incidents occurred such as if your machine or few hardware crashes, then data will be accessed from other path.
  • Scalability– Hadoop is highly scalable and in a unique way hardware can be easily added to the nodes. It also provides horizontal scalability which means new nodes can be added on the top without any downtime.
  • Economic– Hadoop is not very expensive as it runs on cluster of commodity hardware. We do not require any specialized machine for it. Hadoop provides huge cost reduction since it is very easy to add more nodes on the top here. So if the requirement increases, then there is an increase of nodes, without any downtime and without any much of pre planning.
  • Easy to use– No need of client to deal with distributed computing, framework takes care of all the things. So it is easy to use.
  • Data Locality– Hadoop works on data locality principle which states that the movement of computation of data instead of data to computation. When client submits his algorithm, then the algorithm is moved to data in the cluster instead of bringing data to the location where algorithm is submitted and then processing it.

Hadoop Assumptions

Hadoop is written with huge amount of clusters of computers in mind and is built upon the following assumptions:

  • Hardware may fail due to any external or technical malfunction where instead commodity hardware can be used.
  • Processing will be run in batches and there exits an emphasis on high throughput as opposed to low latency.
  • Applications which run on HDFS have large sets of data. A typical file in HDFS may be of gigabytes to terabytes in size.
  • Applications require a write-once-read-many access model.
  • Moving Computation is cheaper compared to the Moving Data.

Hadoop Design Principles

The following are the design principles on which Hadoop works:

  • System shall manage and heal itself as per the requirement occurred.
  • Fault Tolerant are automatically and transparently route are managed around failures speculatively execute redundant tasks if certain nodes are detected to be running of slower phase.
  • Performance is scaled based on linearity.
  • Proportional change in terms of capacity with resource been change (Scalability)
  • Compute must be moved to data.
  • Data Locality is termed as lower latency, lower bandwidth.
  • It is based on simple core, modular and extensible (Economical).