Erik's blog

Code, notes, recipes, general musings

cloudera training > MapReduce and HDFS > map-reduce overview

leave a comment »

ref: http://www.cloudera.com/sites/default/files/2-MapReduceAndHDFS.pdf

– borrows from functional programming: map, reduce
– provides an interface for map/reduce; we must implement the interface
– map
— the mapper can emit an arbitrary pair, not necessarily the input key/val
— the mapper runs simultaneously on multiple machines; the first to complete is used
— each map runs in its own jvm
— each run in parallel
— input is usualy 64MB – 128MB chunks (results in more streaming)

– reduce
— the number of reduces that run corresponds to the number of output files
— ideally, we want 1 reduce
— run in paralllel

– flow
— data store of k/v pairs > map > barrier (shuffle phase) > reduce > result
– chained map-reduce jobs are common
– all values are processed independently
– bottleneck: now reduce can run until all maps are finished
– combiner
— runs immediately after mapper on map node
— can use reducer function if reducer is commutative and associative

– conclusions
— mapreduce is a useful abstraction
— simplifies large scale comp
— lets the programmer focus on the problem and the library handle the details of distribution

Advertisements

Written by Erik

June 11, 2009 at 9:27 am

Posted in notes

Tagged with , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: