Post date: Nov 18, 2013 3:5:58 PM
The scale of what AWS is doing is mind-boggling. A few things that stood out to me:
A pharma company allocated about 17,000 servers to run clinical trial simulations over a week, spending $30K -- instead of $68M (3 orders of magnitude cheaper)
A graphics company doing movie work (Star Trek, etc.) competes with Pixar; they can stand up 1000 servers to render images for an hour, instead of 100 servers for 10 hours, and spin them down when not needed. This completely changes the game in that industry.
Netflix, at peak usage, consume about 1/3 of all US internet bandwidth -- all hosted in AWS
There are some interesting mortgage opportunities along these lines, like:
Crawl all county websites for all file documents, and store them all, forever: many servers up front, few to maintain
Pull all BK info from Pacer in a similar manner
Strategically, here are some of the AWS technologies we dove into, and their application to Quandis:
Cloud Formation: templates that drive AWS resource allocation: set up PROD/UAT/DEV template, with push-button publish
this means when we publish a new release, we keep the old servers in place for a few days in case we want to roll back
we use auto-scaling for both Web and App servers (SLS gets multiple app servers as spot instances in the mornings)
Elasticache: this is in-memory caching, and solves our dependency issue on config changes between boxes
Microsoft's equivalent to this is AppFabric, which can be installed locally or used in Azure
Couchbase provides a similar solution
QBO will need to abstract caching to allow for configurable providers; OutputCaching in ASP.NET is not sufficient for configuring this
Redshift: reporting via a PostgreSQL database geared for data warehousing. We publish tables to it, and can provide clients direct access via ODBC/JDBC securely. Layer on top of that a product like Jasper, and we've just solved most of our ad-hoc reporting issues.
Heartbeat:
push to CloudWatch for alarms
pull from CloudWatch for hardware instrumentation
a lot of our effort is around "what action to take": much of this may migrate to "notify of action already taken"
Queuing: our queuing infrastructure can auto-scale using spot instances to guarantee throughput
imagine a UI with a slider between guaranteed throughput and pricing feeding the Cloud Formation scaling template parameters
Disaster Recovery: pilot light model is ideal with:
PROD dbs replicating to a small instance 'pilot' DB
AMIs standing by to become web/app servers
A 'failover' script will upscale the pilot db(s) to bigger boxes, spin up the web/app servers, and script the website change in Route53
Lots of devil in the details. More quick-hit notes are below.
Server farm profiles: (Foley)
PROD: 2-5 web servers autoscale, 1-5 app server with auto scaling spot instances
UAT: 1-2 web servers autoscale, 1-2 app server with auto scaling spot instances
DEV: 1 web/app server, no autoscale
Deployment (Errol/Cassidy):
AMI builds: trunk, and last 3 stable QBO 3 builds
A/B deployments with CloudFormation
DB connection,
AMI(s) to use,
Queue Service status started?
Website name for ELB
Keep ‘old’ site available (stopped) for 48 hours?
All sites use standard AMI
Performance tuning
Auto setup of indexes (Foley/Eric)
Auto setup of purging (Eric)
Redshift with Jasper for non-real-time reporting
DynamoDB / Kinetic as a logging sink
Reporting (Eric)
Jasper
Redshift
Delta generation, including flattening of Xml blobs, to CSV on S3
IIS logs to Redshift
Rolling logs to Redshift
Centralized error reporting
Research (Eric):
ProstgreSQL / MySQL: SqlPattern / IPattern
Offload history to alternate store (DynamoDB?)
Need rollback built in
Disaster Recovery (Foley):
Dedicated bandwidth
Hosting.com replication or backup shipping
Ship via storage gateway?
Caching (Greg/Eric):
Elasticache client selection
Refactor all config sections to ensure loading from cache
How to handle loss of elasticache node?
Leveraging SQS/SNS/SWS (Greg/Eric)