Data Mining 3.4 billion Web pages for $100 of EC2
I'm a large user of spot requests for my main application stack. I run my startups core services on spot requests. These are processes that I need running 100% of the time (memcache, nfs, gearman, varnish, nginx etc).
I'm sure you have all heard of the Chaos Monkey that Netflix runs? Well, I didn't actually even need to code a chaos monkey... All you have to do is run everything on spot requests. Eventually you will lose servers at unpredictable times because someone outbids you[1].
The typical spot request pricing is at least 3/10th of the price of an on demand. For instance a cp1.medium (4 cores 4GB RAM) costs $0.044 per Hour for a spot request. Compare that to on demand and it is $0.186 per Hour.
I bid $1 per hour for my spot requests across two zones in the same region. I group my servers and use ELB (Elastic Load Balancers) to route requests...
Typically, a spot request might last for about a week before it gets killed because the capacity isn't there. That's then when instances in my other zone takes 100% of the load temporarily. At this point, since I've lost an entire zones worth of servers, I have my auto scaling group fire up on demand instances until I can get some more spot requests fulfilled. Creating a setup like this took about a week, but the savings are enormous.
----
How is the data stored on this setup?
RDS(MySQL) handles all data that can be stored in a database.
Ephemeral Storage is used to store things that don't need to be persistant (ie transactional logs).
Sessions are managed through Redis.. If the redis servers die, then session handling is handled via mysql temporarily. (It's a lot slower but the mysql server is RDS so it's always running)
Elastic Block Storage volumes are automatically mounted to a single instance which is then set up as a NFS server->client in order to allow other servers to read from a particular mount point (ie.. A user uploads an image, and it's stored on the NFS mount. A different server reads the file and starts generating different dimensions, uploads all of the files to Amazon S3, and then deletes the original file on the NFS device).
The worst part about losing servers is when the memcached server dies, because I could lose weeks worth of cache storage. When this happens, I have to boot up several micro instances that take my "cache warming" list and basically start repopulating memcached again.
The entire system is designed to be redundant... I can kill every server and then run the initialization script to start up the entire stack. (It's basically lots of little cloud-init scripts[2])
Bit of a problem with the headline: they didn't crawl anything because Common Crawl already did that.
AWS Spot Instances are incredible. They make you a liquidity provider, and reward you for it.
Paying list price for any load that isn't mission-critical and needed immediately is insane.
If anyone is interested in spot instance pricing there was an interesting work on the subject: http://www.cs.technion.ac.il/~ladypine/spotprice-ieee.pdf
In short, it's not a perfect supply-and-demand market, but it's interesting to see the details they found.
undefined
Anyone have any experience/thoughts on using spot instances with Hadoop? Specifically, regular instances with Hadoop installed (not via Elastic Map Reduce). The cost savings are potentially huge, but I'd hate to lose my instances 80-90% of the way through a set of long running (12-48h) jobs. I guess if I had EBS backed instances I could relaunch and resume, but I'm not sure how well that'd work in practice.
"Master Data Collection Service. A very simple REST service accepts GET and POST requests with key/value pairs. ... we then front end the service with Passenger Fusion and Apache httpd. This service requires great attention, as it’s the likeliest bottleneck in the whole architecture." Seems this can be replaced by DynamoDB.
Thanks for all the commentary. We're planning on presenting this work at reInvent with the folks from Common Crawl, and also releasing sample code to github. For those who haven't yet tried spot instances, or looked into the Common Crawl data set, we highly recommend them!
What about using elastic map reduce with spot instances instead of a custom job queue. Hadoop seems to do this for us and supports the arc format as an inputformat.
I want to use common crawl to periodically fetch crawled data for some of the sites. How frequently does common crawl updates its data set. Does it crawl all sites?
Is there an open source framework related to this work/run?