notes from cloudera basic training > hadoop mapreduce deep dive
//we moved quickly through this, so the notes are sparse
- job
– a full program
- task
– by default, hadoop creates the same amount of tasks as there are input blocks
– task attempts
— tasks are attempted at least once
— multiple attempts in parellel are performed w/ speculative execution turned on
- tasktracker
– forks jvm process for each task
- job distribution
– mapreduce programs = jar + xml config
– running a job puts jar and xml in hdfs
- data distribution
– data locality decreases when multiple tasks are running
- mapreduce flow
– client creates joconf
— identify map and reducer classes
— specify inputs/outputs
— set optional settings
– job launches jobclient
— runjob blcks until the job completes
— submitjob is non-blocking
– …
– tasttracker
— perioducally query jobtracker for work
– …
– write for cache coherency (re-use objects in loops(?))
— reusing memory locations => 2x speed-up
— all k/v pairs given by hadoop use this model
//is avro comparable to thrift?
- getting data to mapper
– data sets are specified
– input sets contain at least 1 record and are composed of full blocks
- file input format
– most people use SequenceFileInputFormat
– usually we store all our data in hdfs and then ignore what we don’t need, rather than spending time formatting the data when it’s input
– …
- shuffling
– what happens btwn map and reduce
- write the output
– OutputFormat is analagous to InputFormat




