Erik’s Blog

Erik’s Blog

cloudera training > MapReduce and HDFS > map-reduce overview

leave a comment »

ref: http://www.cloudera.com/sites/default/files/2-MapReduceAndHDFS.pdf

- borrows from functional programming: map, reduce
- provides an interface for map/reduce; we must implement the interface
- map
– the mapper can emit an arbitrary pair, not necessarily the input key/val
– the mapper runs simultaneously on multiple machines; the first to complete is used
– each map runs in its own jvm
– each run in parallel
– input is usualy 64MB – 128MB chunks (results in more streaming)

- reduce
– the number of reduces that run corresponds to the number of output files
– ideally, we want 1 reduce
– run in paralllel

- flow
– data store of k/v pairs > map > barrier (shuffle phase) > reduce > result
- chained map-reduce jobs are common
- all values are processed independently
- bottleneck: now reduce can run until all maps are finished
- combiner
– runs immediately after mapper on map node
– can use reducer function if reducer is commutative and associative

- conclusions
– mapreduce is a useful abstraction
– simplifies large scale comp
– lets the programmer focus on the problem and the library handle the details of distribution

Written by Erik

June 11, 2009 at 9:27 am

Posted in notes

Tagged with , ,

Leave a Reply