Erik’s Blog

Erik’s Blog

Archive for June 10th, 2009

hadoop summit 09 > applications track > lightning talks

without comments

emi
- hadoop is for performance, not speed
- use activerecord or hibernate for rapid, iterative web dev
- few businesses write map reduce jobs –> use cascading instead
- emi is a ruby shop
- I2P
– feed + pipe script + processing node
– written in a ruby dsl
– can run on a single node or in a cluster
– all data is pushed into S3, which is great cause it’s super cheap
– stack: aws > ec2 + s3 > conductor + processing node + processing center > spring + hadoop > admin + cascading > ruby-based dsl > zookeeper > jms > rest
– deployment via chef
– simple ui (built by engineers, no designer involved)
- cascading supports dsls
- “i helpig ciomputers learn languages
- higher accuracy can be achieved using a dependency syntax tree, but this is expensive to produce
- the expectation-maximum algorithm is a cheaper alternative
- easy to parallelize, but not a natural fit for map-reduce
– map-reduce overhead can become a bottleneck
- 15x speed-up using hadoop on 50 processors
- allowing 5% of data to be dropped results in a 22x speed-up w/ no loss in accuracy
- a more complex algorithm, not more data, resulted in better accuracy
- bayesian estimation w/ bilingual pairs, a more complex algo, with 8000 only sentences results in 62% accuracy (after a week of calculation!)

Written by Erik

June 10, 2009 at 5:27 pm

hadoop summit 09 > administration track > chip multi-threading (CMT)

without comments

ref: http://developer.yahoo.com/events/hadoopsummit09/

chip multi-threading (CMT)
- a high-throughput architecture targeted at web-serving, oracle, etc
- a “system-on-a-chip”
- 1/2 Terabyte of mem

logical domains
- a hardware-based virtualization supported by Sun hardware

zone
- software-based virtualization feature in solaris

Written by Erik

June 10, 2009 at 2:38 pm

Posted in notes

Tagged with , , ,

hadoop summit 09 > applications track > Case Studies on EC2

without comments

ref: http://developer.yahoo.com/events/hadoopsummit09/

- eHarmony

– matching people is an N^2 process

– run hadoop jobs on EC2 and S3

– results downloaded from S3 and imported into BerkeleyDB

– S3 is a great place to store huge files for a long time because it’s so cheap

– switched from bash to ruby because ruby has better exception handling

– elastic map reduce has replaced 150 lines of ec2 management script

 

- share this

– simplifies sharing online content: delicious + ping.fm + bit.ly

– they’re a small compan, but they need to keep pace w/ the volume of the large publishers they support

– they’re 100% based on AWS

– aster + lamp stack + cascading running hadoop (to clean logs before pushing data into db) + s3 + sqs

– sharded search mostly used for business intel

– cascading allows efficient hadoop coding, more so than pig

– in the hadoop book, the author of cascading wrote a case study on sharethis

 

- lookery

– started as an ad network on facebook

– built completely on aws

– use a javascript-based tracker like google analytics to gather data

– data acquisition + data serving + reporting + billing–> all done in hadoop

– they use voldemort, a distributed key/val store instead of memcache

– heavy use of hadoop streaming w/ python

 

- deepdyve

– a search engine

– having an elastic infrastructure allows for innovation

– using hadoop, they went from 1 wk to 1 hr for indexing

– start spinning up new clusters and discarding old ones

– ec2 + katta + zookeeper + hadoop + lucene –>most of the software they run, they didn’t have to write

– query times are lower, user satisfaction is higher

– problems:

— unstable aws

— session timeout on zookeeper

— slow provisioning for aws

– with aws, they can run load tests to prepare for spikes

Written by Erik

June 10, 2009 at 1:51 pm

Posted in notes

Tagged with , , , , , ,