Thoughts from the AWS Conference

posted Nov 18, 2013, 7:05 AM by Eric Patrick   [ updated Nov 20, 2013, 12:55 PM ]
The scale of what AWS is doing is mind-boggling. A few things that stood out to me:
  • A pharma company allocated about 17,000 servers to run clinical trial simulations over a week, spending $30K -- instead of $68M (3 orders of magnitude cheaper)
  • A graphics company doing movie work (Star Trek, etc.) competes with Pixar; they can stand up 1000 servers to render images for an hour, instead of 100 servers for 10 hours, and spin them down when not needed.  This completely changes the game in that industry.
  • Netflix, at peak usage, consume about 1/3 of all US internet bandwidth -- all hosted in AWS
There are some interesting mortgage opportunities along these lines, like:
  • Crawl all county websites for all file documents, and store them all, forever: many servers up front, few to maintain
  • Pull all BK info from Pacer in a similar manner
Strategically, here are some of the AWS technologies we dove into, and their application to Quandis:
  • Cloud Formation: templates that drive AWS resource allocation: set up PROD/UAT/DEV template, with push-button publish
    • this means when we publish a new release, we keep the old servers in place for a few days in case we want to roll back
    • we use auto-scaling for both Web and App servers (SLS gets multiple app servers as spot instances in the mornings)

  • Elasticache: this is in-memory caching, and solves our dependency issue on config changes between boxes
    • Microsoft's equivalent to this is AppFabric, which can be installed locally or used in Azure
    • Couchbase provides a similar solution
    • QBO will need to abstract caching to allow for configurable providers; OutputCaching in ASP.NET is not sufficient for configuring this

  • Redshift: reporting via a PostgreSQL database geared for data warehousing. We publish tables to it, and can provide clients direct access via ODBC/JDBC securely. Layer on top of that a product like Jasper, and we've just solved most of our ad-hoc reporting issues.

  • Heartbeat
    • push to CloudWatch for alarms
    • pull from CloudWatch for hardware instrumentation
    • a lot of our effort is around "what action to take": much of this may migrate to "notify of action already taken"

  • Queuing: our queuing infrastructure can auto-scale using spot instances to guarantee throughput
    • imagine a UI with a slider between guaranteed throughput and pricing feeding the Cloud Formation scaling template parameters

  • Disaster Recovery: pilot light model is ideal with:
    • PROD dbs replicating to a small instance 'pilot' DB
    • AMIs standing by to become web/app servers
    • A 'failover' script will upscale the pilot db(s) to bigger boxes, spin up the web/app servers, and script the website change in Route53
Lots of devil in the details. More quick-hit notes are below.

Server farm profiles: (Foley)
  • PROD: 2-5 web servers autoscale, 1-5 app server with auto scaling spot instances
  • UAT: 1-2 web servers autoscale, 1-2 app server with auto scaling spot instances
  • DEV: 1 web/app server, no autoscale
Deployment (Errol/Cassidy):
  • AMI builds: trunk, and last 3 stable QBO 3 builds
  • A/B deployments with CloudFormation
    • DB connection, 
    • AMI(s) to use, 
    • Queue Service status started?
    • Website name for ELB
  • Keep ‘old’ site available (stopped) for 48 hours?
  • All sites use standard AMI
Performance tuning
  • Auto setup of indexes (Foley/Eric)
  • Auto setup of purging (Eric)
  • Redshift with Jasper for non-real-time reporting
  • DynamoDB / Kinetic as a logging sink
Reporting (Eric)
  • Jasper
  • Redshift
  • Delta generation, including flattening of Xml blobs, to CSV on S3
  • IIS logs to Redshift
  • Rolling logs to Redshift
  • Centralized error reporting
Research (Eric):
  • ProstgreSQL / MySQL: SqlPattern / IPattern
  • Offload history to alternate store (DynamoDB?)
  • Need rollback built in
Disaster Recovery (Foley):
  • Dedicated bandwidth
  • replication or backup shipping
  • Ship via storage gateway?
Caching (Greg/Eric):
  • Elasticache client selection
  • Refactor all config sections to ensure loading from cache
  • How to handle loss of elasticache node?
Leveraging SQS/SNS/SWS (Greg/Eric)