Quandis now hosts significant infrastructure in the Amazon Web Services cloud. Key reasons include:
- low cost of creating disaster recovery sites in different regions
- ease and speed of launching new virtual machines
- effectively unlimited storage
- cheap fail-over and load balancing solutions
Terminology
Amazon has data centers in different regions, or parts of the world. Quandis currently has infrastructure in us-east (Virginia), and us-west-1 (California). Each region includes data centers split into availability zones (which should be thought of as different buildings). Best practice in creating load balanced environments includes load balancing instances in different availability zones. Quandis currently leverages:
Networking includes the creation of a virtual private cloud (VPC), which is a block of private IP addresses. Quandis uses 10.0.* (masked as 10.0.0.0/16). Think of a VPC as a firewall. Within a VPC, Quandis has several subnets, and can use security groups and routing tables to control what traffic the VPC "firewall" will allow between subnets and to or from the public internet. For example:
- 10.0.0.0/24 (10.0.0.*): this subnet is used for our UAT instances, and for our Network Address Translation (NAT) instance
- this subnet routes to the internet via an internet gateway
- 10.0.1.0/24 (10.0.1.*): this subnet is used for our SQL instances
- this subnet routes to the internet via our NAT instance
- 10.0.2.0/24 (10.0.2.*): this subnet routes our outbound traffic for our production servers in us-east-1a
- this subnet routes to the internet via our NAT instance
- traffic like geocoding against Google, or downloading Windows Updates, goes across this subnet
- 10.0.3.0/24 (10.0.3.*): this subnet routes our inbound traffic for our production server in us-east-1a
- this subnet routes to the internet via our internet gateway
- traffic from load balancers to our production web servers goes across this subnet
- 10.0.4.0/24 (10.0.4.*): same as 10.0.2.0/16, except in availability zone us-east-1d
- 10.0.5.0/24 (10.0.5.*): same as 10.0.3.0/16, except in availability zone us-east-1d
Each virtual machine in AWS is called an instance, and normally is assigned only private IP addresses.
Security Groups are rules that dictate firewall rules for subnets, instances or network interfaces. They control inbound and outbound traffic, ports and such by source and destination. When configuring new subnets, the following security groups are relevant:
- NATSG: add each private subnet (allowing outbound connections without public IPs being assigned to the box)
- VPCWebSG: add all subnets (so they can communicate with each other)
Public IP Addresses

Amazon Web Services limit the number of dedicated public IP addresses (elastic IPs) to five per organization. One can get dynamically assigned public IP addresses, but there is no guarantee that an instance will retain the same
dynamically assigned public IP over time. Thus, one cannot reliably use dynamically allocated public IPs for DNS entries.
Amazon provides a solution for this problem, as long as Amazon is hosting our domain. As of this writing, quandis.net is hosted by the Amazon Route 53 domain name service. Route 53 allows us to map a third-level domain to a load balancer, without assigning a public IP address. It's just an 'alias', pointing to a load balancer's name. If the load balancer is assigned new IP addresses (which can happen as new instances are added to the load balancer), no DNS modifications need to be made.
Thus, production instances hosted in AWS will leverage this feature, meaning they will need to end with 'quandis.net'.
Unfortunately, one cannot assign a third-level domain directly to an instance; only to a load balancer. Thus, to do a permanently reliable DNS entry for a UAT site, we either need to:
- front the UAT site with a load balance, and leverage Route 53, or
- we need assign one of our 5 elastic IP addresses to the site
Lessons Learned
Private instances need 2 NICs
Instances that have only a private IP address cannot initiate outbound internet traffic unless they are on a subnet that routes through a NAT. We initially configured instances on the 10.0.0.* subnet, only to find that geocoding (which hits Google) failed. This led us to create instances on the 10.0.2.* subnet (so they could geocode). Ultimately, production web servers should be configured with two NICs: one to a subnet routing to a NAT (for outbound traffic), and one to a subnet routing to an internet gateway (to handle inbound traffic from load balancers).
For our code that initiates outbound web traffic (via HttpWebRequest, such as geocoding), we need to tell Windows to route such traffic through the NAT NIC. (Nic nat patty whack?) This is accomplished by adding permanent routes from the command prompt:
- route -p change 0.0.0.0 mask 0.0.0.0 10.0.2.1 metric 2
- this routes traffic through the 10.0.2.1 gateway, and thus to the NAT, first
- route -p add 0.0.0.0
mask 0.0.0.0 10.0.3.1 metric 13
- this allows traffic through the 10.0.3.1 gateway, but at a lower priority than the 10.0.2.1 gateway
Load balancers must use subnets that route to an internet gateway
Load balancers must communicate with instances on subnets that route through an internet gateway (10.0.3.* or 10.0.5.*). When we configured the load balancer to use the 10.0.2.* subnet (which routes through a NAT), no sites responded. This led us to create the 10.0.3.* subnet, and add a NIC to the production instances. With two NICs, they can both respond to inbound load balancer traffic, and route outbound requests to Google for geocoding.
Even after changing the subnets the load balancer was using, we found the site would sporadically be offline. This appears to have been a caching issue:
- When we added the 10.0.2.* subnet, AWS automatically created a NIC on the load balancer bound to a public IP and 10.0.2.* private IP
- When we added the 10.0.3.* subnet, AWS automatically created a NIC on the load balancer bound to a public IP and 10.0.3.* private IP
- At this stage, the load balancer had two public IPs; which one a browser "got" was essentially random
- If a browser "got" the public IP for the 10.0.2.* load balancer NIC, nothing would respond
- Load balancer NICs cannot be explicitly created, but they can be explicitly deleted.