Four MapReduce design patterns – DZone Big Data

Last article I wrote about creating a simple word count program in MapReduce as well as a tutorial for running the program using Hadoop. Please go through it if you don’t know how to write a program in MapReduce.
To solve any problem in MapReduce, we have to think in terms of MapReduce. It is not necessarily true that each time we have both a card and a reduction work.
MapReduce Design Pattern
- Entry-Map-Reduce-Exit
- Entry-Card-Exit
- Entry-Multiple Maps-Minimize-Exit
- Input-Map-Combiner-Reduce-Output
Here is a real-time scenario to understand when to use which design template.
If we want to do an aggregation, this model is used:
Scenario |
Count total gender / average employee salary |
Card (key, value) |
Key: Gender Value: their salary |
Reduce |
Group by gender And take the total salary for each group |
If we want to change only the data format, this template is used:
Scenario |
Some employees have a gender entry like “Woman”, “F”, “f”, 0 |
Card (key, value) |
Key: Employee ID Value: Gender -> if Gender is Female / F / f / 0 then converted to F otherwise if Gender is Male / M / m / 1 then convert to M |
- Entry-Multiple Maps-Minimize-Exit
In this design template, our input is taken from two files that have a different schema:
Scenario |
We need to find the total salaries for the whole genre. But it has 2 files with different schema. Input file 1 Gender is given as a prefix to the name For example. Ms. Shital Katkar Mr. Krishna Katkar Input file 2 There is a different column for the genre. However, the format is mixed. For example. Female / Male, 0/1, F / M |
Card (key, value) |
Card 1 (For entry 1) We need to write a program to separate the prefix from the name and, depending on the prefix, determine the gender. Next, prepare the key-value pair (Gender, Salary). Card 2 (For entry 2) Here the program will be simple. Solve the mixed format and create a key-value pair (Gender, Salary). |
Reduce |
Group by gender And take the total salary for each group |
- Input-Map-Combiner-Reduce-Output
A Combiner, also known as a semi-reducer, is an optional class that works by accepting input from the Map class and then passing the output key-value pairs to the Reducer class. The purpose of the combiner is to reduce the workload of the reducer.
In the MapReduce program, 20% of the work is done in the map stage, also known as the data preparation stage, which works in parallel.
80% of the work is done in the reduction step which is known as the compute step, and it is not parallel. It is therefore slower than the Map phase. To reduce the time, some work in the reduction phase can be done in the combination phase.
Scenario
There are 5 departments. And we have to calculate the total salary for the whole sex. However, there are certain rules for calculating the total. After calculating the total by gender for each department, if the salary is more than 200K, add 20K to the total, if the salary is more than 100K, add 10K to the total
Input files (for each department, there is 1 file) |
Menu (Parallel) (, Value = Salary) |
Combiner (Parallel) |
Reducer (Not parallel) |
Go out |
Department 1 |
Man Women |
Man Women |
Man
10,90,90,30> Female
10,110,10,130,10> |
Man Female |
Department 2 |
Man Women |
Man Women |
||
Dept 3 |
Man Women |
Man Women |
||
Department 4 |
Man Women |
Man Women |
||
Department 5 |
Man Women |
Man Women |
- The points mentioned above are basic models. You can make your own according to the requirement of the problem.