Invented by Google to increase the computational power by utilizing multiple machines
Abstraction: assuming there is some input which is split up into a bunch of different files or chunks in some way
The map
function will accepts the input key/value pairs and separate the job to different machine, and the reduce
function will collect and merge the result computed by the different machines (workers)
map(k1, v1) => list(k2, v2)
→ Takes in the input and produces a set of intermediate key/value pairs. Then the MapReduce library will groups all intermediate values associated with the same intermediate key and passes them to the reduce function
reduce(k2, list(v2)) => list(v2)
→ merges together the values provided by map and form a possibly smaller set of values
map
function and buffered the intermediate kv pairs in memoryreduce
function and append its output to a final file for this reduce partitionThe master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any worker that completes or fails a task would be bought back to an idle state. If a worker has been identified as failed, the task will be re-execute if it’s a map worker (since the result will be stored locally) but not for reduce worker (the result would be stored globally)
Master will write periodic checkpoints of the master data structures. If the master task dies, a new copy can be started from the last checkpoint state.