cloudera training > MapReduce and HDFS > map-reduce overview
ref: http://www.cloudera.com/sites/default/files/2-MapReduceAndHDFS.pdf
- borrows from functional programming: map, reduce
- provides an interface for map/reduce; we must implement the interface
- map
– the mapper can emit an arbitrary pair, not necessarily the input key/val
– the mapper runs simultaneously on multiple machines; the first to complete is used
– each map runs in its own jvm
– each run in parallel
– input is usualy 64MB – 128MB chunks (results in more streaming)
- reduce
– the number of reduces that run corresponds to the number of output files
– ideally, we want 1 reduce
– run in paralllel
- flow
– data store of k/v pairs > map > barrier (shuffle phase) > reduce > result
- chained map-reduce jobs are common
- all values are processed independently
- bottleneck: now reduce can run until all maps are finished
- combiner
– runs immediately after mapper on map node
– can use reducer function if reducer is commutative and associative
- conclusions
– mapreduce is a useful abstraction
– simplifies large scale comp
– lets the programmer focus on the problem and the library handle the details of distribution




