Archive for June 10th, 2009
hadoop summit 09 > applications track > lightning talks
emi
- hadoop is for performance, not speed
- use activerecord or hibernate for rapid, iterative web dev
- few businesses write map reduce jobs –> use cascading instead
- emi is a ruby shop
- I2P
– feed + pipe script + processing node
– written in a ruby dsl
– can run on a single node or in a cluster
– all data is pushed into S3, which is great cause it’s super cheap
– stack: aws > ec2 + s3 > conductor + processing node + processing center > spring + hadoop > admin + cascading > ruby-based dsl > zookeeper > jms > rest
– deployment via chef
– simple ui (built by engineers, no designer involved)
- cascading supports dsls
- “i helpig ciomputers learn languages
- higher accuracy can be achieved using a dependency syntax tree, but this is expensive to produce
- the expectation-maximum algorithm is a cheaper alternative
- easy to parallelize, but not a natural fit for map-reduce
– map-reduce overhead can become a bottleneck
- 15x speed-up using hadoop on 50 processors
- allowing 5% of data to be dropped results in a 22x speed-up w/ no loss in accuracy
- a more complex algorithm, not more data, resulted in better accuracy
- bayesian estimation w/ bilingual pairs, a more complex algo, with 8000 only sentences results in 62% accuracy (after a week of calculation!)
hadoop summit 09 > administration track > chip multi-threading (CMT)
ref: http://developer.yahoo.com/events/hadoopsummit09/
chip multi-threading (CMT)
- a high-throughput architecture targeted at web-serving, oracle, etc
- a “system-on-a-chip”
- 1/2 Terabyte of mem
logical domains
- a hardware-based virtualization supported by Sun hardware
zone
- software-based virtualization feature in solaris
hadoop summit 09 > applications track > Case Studies on EC2
ref: http://developer.yahoo.com/events/hadoopsummit09/
- eHarmony
– matching people is an N^2 process
– run hadoop jobs on EC2 and S3
– results downloaded from S3 and imported into BerkeleyDB
– S3 is a great place to store huge files for a long time because it’s so cheap
– switched from bash to ruby because ruby has better exception handling
– elastic map reduce has replaced 150 lines of ec2 management script
- share this
– simplifies sharing online content: delicious + ping.fm + bit.ly
– they’re a small compan, but they need to keep pace w/ the volume of the large publishers they support
– they’re 100% based on AWS
– aster + lamp stack + cascading running hadoop (to clean logs before pushing data into db) + s3 + sqs
– sharded search mostly used for business intel
– cascading allows efficient hadoop coding, more so than pig
– in the hadoop book, the author of cascading wrote a case study on sharethis
- lookery
– started as an ad network on facebook
– built completely on aws
– use a javascript-based tracker like google analytics to gather data
– data acquisition + data serving + reporting + billing–> all done in hadoop
– they use voldemort, a distributed key/val store instead of memcache
– heavy use of hadoop streaming w/ python
- deepdyve
– a search engine
– having an elastic infrastructure allows for innovation
– using hadoop, they went from 1 wk to 1 hr for indexing
– start spinning up new clusters and discarding old ones
– ec2 + katta + zookeeper + hadoop + lucene –>most of the software they run, they didn’t have to write
– query times are lower, user satisfaction is higher
– problems:
— unstable aws
— session timeout on zookeeper
— slow provisioning for aws
– with aws, they can run load tests to prepare for spikes




