Erik's blog

Code, notes, recipes, general musings

Posts Tagged ‘chef

hadoop summit 09 > applications track > lightning talks

leave a comment »

- hadoop is for performance, not speed
- use activerecord or hibernate for rapid, iterative web dev
- few businesses write map reduce jobs –> use cascading instead
- emi is a ruby shop
- I2P
– feed + pipe script + processing node
– written in a ruby dsl
– can run on a single node or in a cluster
– all data is pushed into S3, which is great cause it’s super cheap
– stack: aws > ec2 + s3 > conductor + processing node + processing center > spring + hadoop > admin + cascading > ruby-based dsl > zookeeper > jms > rest
– deployment via chef
– simple ui (built by engineers, no designer involved)
- cascading supports dsls
- “i helpig ciomputers learn languages
- higher accuracy can be achieved using a dependency syntax tree, but this is expensive to produce
- the expectation-maximum algorithm is a cheaper alternative
- easy to parallelize, but not a natural fit for map-reduce
– map-reduce overhead can become a bottleneck
- 15x speed-up using hadoop on 50 processors
- allowing 5% of data to be dropped results in a 22x speed-up w/ no loss in accuracy
- a more complex algorithm, not more data, resulted in better accuracy
- bayesian estimation w/ bilingual pairs, a more complex algo, with 8000 only sentences results in 62% accuracy (after a week of calculation!)

Written by Erik

June 10, 2009 at 5:27 pm


Get every new post delivered to your Inbox.