Invented by Google to increase the computational power by utilizing multiple machines

Abstraction: assuming there is some input which is split up into a bunch of different files or chunks in some way

The map function will accepts the input key/value pairs and separate the job to different machine, and the reduce function will collect and merge the result computed by the different machines (workers)

map(k1, v1) => list(k2, v2) → Takes in the input and produces a set of intermediate key/value pairs. Then the MapReduce library will groups all intermediate values associated with the same intermediate key and passes them to the reduce function

reduce(k2, list(v2)) => list(v2) → merges together the values provided by map and form a possibly smaller set of values

Execution Overview

  1. The MapReduce library in the user program first splits the input files into M pieces. It then starts up many copies of the program on a cluster of machines
  2. One copy is selected as the master, and the master picks idle workers to assign each of them a map task or a reduce task. (with total of M map tasks and R reduced task)
  3. A map worker reads the contents of the corresponding input slit and parses the key/value pairs out of the input data. Then, it passes the data to the user-defined map function and buffered the intermediate kv pairs in memory
  4. Those buffered pairs are written to disk periodically, partitioned to E regions by the partitioning function. The location of those region are passed back to the master
  5. Reduce workers will read in those data and sort them by the intermediate keys to group the same key together.
  6. The reduce workers will then iterate through the sorted result and for each unique key, it will call the reduce function and append its output to a final file for this reduce partition
  7. When all map and reduce tasks are finished, the master wakes up the user program and return to the user code

Fault Tolerance

Worker Failure

The master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any worker that completes or fails a task would be bought back to an idle state. If a worker has been identified as failed, the task will be re-execute if it’s a map worker (since the result will be stored locally) but not for reduce worker (the result would be stored globally)

Master Failure

Master will write periodic checkpoints of the master data structures. If the master task dies, a new copy can be started from the last checkpoint state.