MapReduce in Hadoop

Now that you know about HDFS, it is time to talk about MapReduce. So, in this section, we’re going to learn the basic concepts of MapReduce.

We will learn MapReduce in Hadoop using a fun example!

MapReduce in Hadoop is nothing but the processing model in Hadoop. The programming model of MapReduce is designed to process huge volumes of data parallelly by dividing the work into a set of independent tasks. As we learned in the Hadoop architecture, the complete job or work is submitted by the user to the master node, which is further divided into small tasks, i.e., into slave nodes.

How does MapReduce in Hadoop work?

Now, how does MapReduce in Hadoop work? Let’s explore a scenario.

Imagine that the governor of the state assigns you to manage the upcoming census drive for State A. Your task is to find the population of all cities in State A. You are given four months to do this with all the resources you may require.

So, how would you complete this task? Estimating the population of all cities in a big state is not a simple task for a single person. A practical thing to do would be to actually divide the state by cities and assigning each city to separate individuals. They would calculate the population of the respective cities.

Here is an illustration for this scenario considering only three cities, X, Y, and Z:

Person 1, Person 2, and Person 3 will be in charge of the X, Y, and Z cities, respectively.

So, you have broken down State A into cities where each city is allocated to different people, and they are solely responsible for figuring out the population of their respective cities. Now, you need to provide specific instructions to each of them. You will ask them to go to each home and find out how many people live there.

Assume, there are five people in the home Person 1 first visits in City X. He notes down, X 5. It’s a proper divide and conquer approach right?

The same set of instructions would be carried out by everyone associated. That means that Person 2 will go to City Y and Person 3 will go to City Z and will do the same. In the end, they would need to submit their results to the state’s headquarters where someone would aggregate the results. Hence, in this strategy, you will be able to calculate the population of State A in four months.

Next year, you’re asked to do the same job but in two months. What would you do? Won’t you just double the number of people performing the task? So, you would divide City X into two parts and would assign one part to Person 1 and have one more person to take charge of the other part. Then, the same would be done for Cities Y and Z, and the same set of instructions would be given again.

Also there would be two headquarters HQ1 and HQ2. So, you will ask the census takers at X1 and X2 to send their results to HQ1 or HQ2. Similarly, you would instruct census takers for Cities Y and Z and tell them that they should either send the results to HQ1 or HQ2. Problem solved!

So, with twice the force, you would be able to achieve the same in two months.

Now, you have a good enough model. This model is called MapReduce.

MapReduce in Hadoop is a distributed programming model for processing large datasets. This concept was conceived at Google and Hadoop adopted it. It can be implemented in any programming language, and Hadoop supports a lot of programming languages to write MapReduce programs.

You can write a MapReduce program in Scala, Python, C++, or Java. MapReduce is not a programming language; rather, it is a programming model. So, the MapReduce system in Hadoop manages data transfer for parallel execution across distributed servers or nodes. Now let’s look at the phases involved in MapReduce.

Phases in MapReduce

There are mainly three phases in MapReduce: the map phase, the shuffle phase, and the reduce phase. Let’s understand them in detail with the help of the same scenario.

Map Phase

The phase wherein individuals count the population of their assigned cities is called the map phase. There are some terms related to this phase.

  • Mapper: The individual (census taker) involved in the actual calculation is called a mapper.
  • Input split:  The part of the city each census taker is assigned with is known as the input split.
  • Key–Value pair: The output from each mapper is a key–value pair. As you can see in the image, the key is X and the value is, say, 5.

Reduce Phase

By now, the large dataset has been broken down into various input splits and the instances of the tasks have been processed. This is when the reduce phase comes into place. Similar to the map phase, the reduce phase processes each key separately.

  • Reducers: The individuals who work in the headquarters are known as the reducers. This is because they reduce or consolidate the outputs from many different mappers.

After the reducer has finally finished the task, a results file is generated, which is stored in HDFS. Then, HDFS replicates these results.

Shuffle Phase

The phase in which the values from different mappers are copied or transferred to reducers is known as the shuffle phase.

The shuffle phase comes in-between the map phase and the reduce phase.

Now, let’s see the architectural view of how map and reduce work together:

When the input data is given to a mapper, it is processed through some user-defined functions that are written in the mapper. The output is generated by the mapper, i.e., the intermediate data. This intermediate data is the input for a reducer.

The output from the reducer is then processed by some user-defined functions that are written in the reducer, and then the final output is produced. Further, this output is saved in HDFS and the replication is done. So far, we learned how Hadoop MapReduce works. Moving ahead, let’s discuss some features of MapReduce.

Features of the MapReduce System

The important features of MapReduce are illustrated as follows:

  • Abstracts developers from the complexity of distributed programming languages
  • In-built redundancy and fault tolerance is available
  • The MapReduce programming model is language independent
  • Automatic parallelization and distribution are available
  • Enables the local processing of data
  • Manages all the inter-process communication
  • Parallelly manages distributed servers running across various tasks
  • Manages all communications and data transfers between various parts of the system module
  • Redundancy and failures are provided for the overall management of the whole process.