Erik's blog

Code, notes, recipes, general musings

Archive for November 2009

Simpleton Pattern

leave a comment »

Simpleton pattern


We don’t want multiple objects of the same class, but we also don’t want to clutter our code with the kind of checks required to implement the Singleton pattern


Kill program execution if a second attempt to instantiate an object occurs


class Foo {
    function constructor(){




variable foo = new Foo //all good


variable bar = new Foo  //poof!

Written by Erik

November 25, 2009 at 8:02 pm

Posted in Uncategorized

notes from Cloud Expo 2009: Chuck Neerdaels on “Yahoo! Scalable Storage and Delivery Services”

leave a comment »

– 280k image requests/sec
– 300k req/sec

– sports, travel, mail, news

internal expectations
– global

common challenges
– spped of light
– spikes
– cost
— space, paower, bandwidth, replication bandwidth
– partitioned network failures
— data center failures
— cap theorem
– consumer/user intuition
— replication is not bcp

mobstor & sherpa
– mobstor
— storage and delivery cloud
— cdns make sense when we have 90% cache hit rate
— features
— global
—- caching
—- protocol termination
—- authentication
—- content routing
— local
—- auto expiration
—- de-dup
—- object placement
—- re-replication
— layeruing
— dns gslb –> hardware vip –> scalable session mgr –> geo replication –> internal dns w/ loop feedback –> hardware vip –> local replication –> separate metadata replication
– sherpa
— simply put, it’s sharded mysql w/ replication
— stack: dns –> hadware vip –> scalable router & session mgr –> geo repl –> …

physics & econ for a global cloud
– what’s your target sla?
— distance + speed of light + network degredation = latency
— selective replication
– lessons learned
— intuition is usually wrong: let data drive data
— provide hoks, experimental fedback, and mobility
— n-way global repl is really expensive
— customers don’t understand 95/5 billing
— customers don’t understand cap theorem
— verify all provisioning
— there are a lot of non-hardware issues that can affect hardware performance
— strive for quality, but plan for failure

why care about y! cloud?
– commitment to open source
— componenet approach
— traffic server
— a handful of ppl for more than a year worked on open sourcing
— it’s a huge benefit to the community
— 400tb/day on 150 commodity boxs!
— would you want to build your company oin a web server that no one else uses?
— zookeeper

– why mobstor?
— mobstor solved the problem of brittl urls
– why dora?
— mobstor was built on filers
— dora was developed to replace filers
– commodity box costs
— < $4k / box
— highest density sata: 12 drives x 1tb/drive
– k/v store vs. relational db?
— everyone would love acid transactions, but they also need consistency & geo replication
– uses of k/v stores?
— mail uses it to assoc abuse records w/ ip addresses

about 150 ppl, 3/4+ full

Written by Erik

November 4, 2009 at 9:50 am

Posted in Uncategorized

notes from Cloud Expo 2009: Christophe Bisciglia on “Working with Big Data and Hadoop”

leave a comment »

– machines are reliable
– machines are unique or identifiable
– a data set should fit on one machine

– it’s not a database
— it doesn’t serve data in real-time
— it augments existing DBs
— it does enable deeper analysis that would normally slow a relational DB
– leverages commodity hardware for big data & analytics
– cloudera does for hadoop what redhat does for linux

– fox
— hat ppl are watching on set-top obxes
– autodesk
– D.E.Shaw
— analyze financial data
– mailtrust
— use hadoop to process mail logs and generate indexes that suport staff can use to make adhoc queries

– scientific and experimental data
– storage
— multiple machines are req’d to store the amount of data we’re interested in
— replication protects data from failure
— data is also 3 times as available

– allows for processing data locally
– allows for jobs to fail and be restarted

hadoop’s fault tolerance
– handled at software level

using hadoop
– map-reduce
— natively written in java
— map-reduce can be written in any language
– hive
— provides sql interface
– pig
— high level lang for ad-hoc analysis
— imperative lang
— great for researchers and techinical prod. managers

high performance DB and analytics.  when is it time for hadoop
– in general
— generation rate exceeds load capacity
— performance/cost considerations
— workloads that impede performance
– db
— 1000s of transactions per second
— many concurrent queries
— read/write
— many tables
— structured data
— high-end machines
— annual fees
– hadoop
— append only update pattern
— arbitrary keys
— unstructured or structured data
— commodity hardware
— free, open source

– traditional: web server –> db –> oracle –> biz analytics
– hadoop: web server –> db –> hadop –> oracle –> biz analytics

– data storage costs drops every year
– hadoop removes bottlenecks; use the right tool for the job
– makes biz intel apps smarter

– cloudera’s distro for hadoop
– cloudera desktop

Written by Erik

November 3, 2009 at 6:26 pm

Posted in notes

Tagged with , ,

notes from Cloud Expo 2009: Surendra Reddy’s presentation on “Walking Through Cloud Serving at Yahoo!”

leave a comment »

open cloud access protocol (opencap)
– definition, deployment, and life cycle mgmt of cloiud resources
– aloc, provisioning, and metering of clourd resources
– metasdat,/registry for cloud resources
– virtual infrastructure
– why ietf?  they are brilliant folks.  we’re not attaching any vendor-specific details
– open source implentation planned
– structure
— resource model
— all infastructure
— nodes, networks, etc
— resource properties
— modeled as json objects
— standard catalog of attributes, extensible
— resource operations
— operation, control, etc.
— management ntification services

– smtp is simple.  protocols must be simple
– traffic server has bindings built in (or vice versa?)
– open cloud consortium
— a national testbed to bring clouds together

Written by Erik

November 3, 2009 at 3:11 pm

Posted in notes

Tagged with , ,

Notes from Cloud Expo 2009: Raghu Ramakrishnan’s talk on the Yahoo! cloud: “key challenges in cloud computing … and the yahoo! approach”

leave a comment »

raghu ramakrishnan
– a triumphant preso
– “key chalengeds in cloud comoputing .. and the y! approach”

this is a watershed time.  we’ve spent lots of time building packabged software now wer’re moving to the cloud

key challenges
– elastic scaling
– availabiolity
— if the cloud goes down, everyone is hosed.  consistency or performance myst be traded for availoability.
– handliong failures
— if things go wrong, what can the developer count on when things come up?
– operational efficiency
— cloud managers are db admins for 1000s of clients
– the right abstractions

yahoo’s cloud
– the cloud is an ecosystem.  it’s bigger than a single componenet.  all the pueces must work together seamlessly.

data management in the cloud
– how to make sense of the many options
– what are you trying todo?
– oltp vs olap
– oltp
— random access to a few records
— read-heavy vs write-heavy
– olap
— scan access to a large number of records
— by rows vs columns vs unstructired
– storage
— common features
— managed service. rest apis
— replication
— global footprint
— sherpa
— mopbstor

y! storage problem
– small records, 100kb or less
– structured records, lots of fields
– extreme data scale

typical applications
– user logins and profiles
— single=-record transactions suffice
– events
— alerts, social network activity
— ad clicks
app-specific data
– postings to messsage boards
– uploaded photos and tags

vsld data serving stores
– scale based on partitioning data accross machines
– range selections
— requests span machines
– availability
– replication
– durability
— is it required?
– how is data stored on a single machine?

the cap theorem
– consistency vs availability vs partition tolerance
– consistency => serializability

approaches to cap
– use a single version of a db w/ defered reconciliation
– defer transaction commit
– eventual consistency eg dynamo
– restrict transatctions eg sharded mysql
– object timelines, eg sherpa
– ref: julianbrowne.cim/artice/viewer/brewers-cap-theorem

single slide hadoop primer
– hadoop is wrte optimized, not ideal for serving

out there in the world
– oltp
— oracle, mysql,
— write optimized: cassandra
— main-mem; memchached

ways of using hadoop
– data workloads -> olap -> pig for row ops, zebra for column ops, map reduce for others

hadoop based apps
– we own the terasort benchmark

– parallel db
– geo replication
– structured, flexible schemas
– hashed and ordered tables
– components
— req -> routers -> (record looked up, if necessary) -> lookup cached -> individual machine
– raghu is awesome (“And then!”, sprinting through dense slides)
– write-ahead
– asynch replication
— why? we’re doing geo replication due to the physics involved
— supposing an eearthquake hits and ca falls in th ocean, two users can continue to update their profiles
– consistency model
— acid requiores synch updates
— eventual consistency works
— is there any middle ground?
— sherpa follows a timeline of changes achieved through a standard per-record primary copy protocol

– cloud allows us to apperate at scale
– tablet splitting and balancing
– automatic transfer of mastership

comparing systems
– main point: all of this needs to be thought through and handled automatically

– sherpa, oracle, mysql work well for oltp

banchmark tiers
– cluster performance
– replication
– scale out
– availability
– we’d like to do this a group effort, in keeping w/ our philosophy

the integrated cloud
– big idea: declrative lang for specifying structure of service
– key insight: multi-env
– central mechanism: the integrated cloud
– surrendra will talk about htis

foundation componenets
– how to describe app
– desc for resources, entrypoijts, bindings, etc

yst hadled 16.4 million uniques for mj death news

acm socc
– acm symposium on cloud computing

Written by Erik

November 3, 2009 at 10:25 am

Posted in notes

Tagged with , ,

Notes from Cloud Expo 2009: Shelton Shugar, “Accelerating Innovation with Cloud Computing”

leave a comment »

Shelton Shugar just delivered an excelllent keynote address “Accelerating Innovation with Cloud Computing” at the 4th “Cloud Conference and Expo”: in Santa Clara.  The subtitle of the expo is “”.  This is also the 7th annual virtualization summit.

Yahoo is not here to sell you anything; we’re not into consulting or selling software.  At Yahooo! cloud computing is not about saving money.  Our motivation arises from the fact that cloud computing drives innovation.  Cloud computing is the “engine of innovation”.  Yahoo! has hundreds of products and platforms all over the world.  Many of these products were the result of acquisition, so they came onboard w/ their own infrastructure, down tot he metal.  Cloud computing at y! is about streamlining the services these products and platforms require.  We store hundreds of petabytes of data all ove the world, and petabytes of internet traffic daily.  We think about scale foremost and features second.

cloud strategy
we are building a private cloud, deployed in data centers all over the world.  focusing in two areas: data processing and serving.  data processing refers to data minigna nd analysis.  serving refers to app environments for our products, edge capabilities for fast delivery, and a channel for data to flow into storage.  This is a multi-year effort.  “Open source plays a central role”.  We both consume and produce open source.

inside the y! cloud
5 buckets: edge services, cloud serving where we host apps w/in y!, online storage for serving content to consumers, a batch rocessing data warehouse, data collection services to clean, de-dup, and filter incoming data.

Serving is based on the Yahoo! Traffic Server.  Over half of all y! traffic flows through YST.

The app serving layer is based on a tiered architecture.  Apps can be cloned.  Traffic can be split natively, which allows for bucket testing.  THis frees developer from having to worry about versions of the platform, location of machines, etc.  Capacity can be moved via point and click.

Uses Restful apis.  Deployed worldwide.  Global replication is supported natively.  Multiple consistency models are provided.  Mobstore (mass object store) is used to store large objects (1mb-2gb) such as images and video.  Objects are immutable.  Structured content is provided via a product called Sherpa, a key-value store.  Content can be replicated easily.  Sherpa is intended to support enough of the capabilities properties used to build.mainatin relational dbs for.

Batch processing is oriented around Hadoop.  This has been running for a few years.  It now runs on 10s of thousands of machines.  80PB worth.  We use it to optimize our sadvertising, process weblogs.  1000s of yahoos are trained to run jobs on it.  hdfs allows thousands of computers to be treated as a single machine.  Pig is a higher-level procedural lang that generates map-reduce code.  It’s almost as efficient a well-written map-reduce code.  the internal joke is that most people don’t write well-written map-reduce code.  We’re building columnar storage.

An example: the y! homepage
When a user visits the homepage, the user is usung y! cloud services.  Content is optimized using a feedback loop to provide relevant stories in the news offered.  Hadoop is used to optimize ad matching.  Hadoop is used to build the search index.  edge services are used to cahce and load-balance the page content, normalize the news feeds.

Another example of useafge: y! mail.  Hadoop is used to identify and filter spam.  before hadoop, mail engineers had to spend lots of time maintaining storage and machines to process a huge amount of data.  hadooop abstracts scale for processing enormous data, handles failures, and manages multiple users.  this allows the scientists to focus on their jobs.  mail uses cloud storage’s replication services to help detect abuse.

Y! soprts usage of cloud services.  Edges services provides a proxy service to route requests for dynamic content.  this allows y! sports to provide the most up-to-date content.  People want scores as fast as possible.  the cunsumers are happy due to faster access to content.

y! finance.  y! is #1 for finance.  finance uses hadoop to spped advertisinf optimaization by importing resource utilization.

yql is an sql-like language.  it allows developers to qu ery, filter, join etc data.  yql uses sherpa instead of mamnaging its oawn storage.

open source @ y!
hadoop.  we contribute our code for hadoop to open source.  external developers benefit and contribute back.  pig is open source.  zookeeper is a system used to coordinate mutliple systems.  open cissur is a consortium was designed to facilitate to cloud computing.  it has 9 members.  y! contribution is m45, w/ 1000 cores. we work w/ some of the leading universities in the world.   We’ve built an enormous community around hadoop.  we can hire people straight out of university.  open source attracts the best and the brightest.

About 500 people were in attendance.

The highlight of the talk was his announcement of the newly open-sourced Yahoo! Traffic Server, now an Apache Incubator project.  A “recent post”: on OStatic gives more information about the project.  trafic server can process up to 34k trasnsactions/sec on commodity hardware.  it’s modular.  it’s how we implement our cahcing, proxy, load balancing, etc.  we push 400tb daily through it.  we use it in online storage to help direct traffic.  we’re hoping to create a vibrnt community around traffic server like we did w/ hadoop.

Back in june, we announced the y! distribution of hadoop.  we select the code we need and test it well.  it’s a solid collection of code that’s been proven to work.  shelton annouced that we’re now updating our releas.

we’re fully committed to cloud computing.  “moving to the cloud requires change”.  if you’re like us, w/ lots of legacy systems, you need to make a big organization commitment.  it’s more like  amarriage than a transaction.  it takes invesment to create these services and migrate to them.  it takes time.  ours is a multi-year effort.  cloud computing is worth it for us.  it’s changing our cutlure.  we’re able to deploy so much faster than before.

Written by Erik

November 3, 2009 at 8:51 am

Posted in notes

Tagged with

Notes from Christian Heilmann’s Developer Evangelism handbook

leave a comment »

Notes from Christian Heilmann’s Developer Evangelism handbook

  • The Developer Evangelist Handbook
  • remove the brand
    • “As a developer evangelist you have to keep your independence.”
    • “Your independence and your integrity is your main weapon. If you lost it you are not effective any longer. People should get excited about what you do because they trust your judgment – not because you work for a certain company.”
    • ways to work w/ the competition: “Remain an independent voice”, “Become a specialist in a certain underlying technology”, “Keep your finger on the pulse” –
    • “You can’t be a professional evangelist and bad-mouth the competition at the same time. We all are professionals and work on projects to make the web a better place.” –
    • “Acknowledge when the competition is better”
    • “Know about the competition”
  • Work with your own company
    • “Your job as a developer evangelist is to listen to developers, understand their problems and communicate with management to try to sort the issues out.” –
    • “There is no “off the record”” –
    • “if people ask you what is going on don’t say “no comment” as that implies you know something but are not allowed to say it. Simply state that you are not in a position to know yet but that you are investigating.”

Written by Erik

November 2, 2009 at 4:01 pm

Posted in notes

Tagged with , ,