Load-Balancing in a PHI World

Load-Balancing in a PHI World

This is the first in a series of engineering posts where we take a deep look at the technical underpinnings of Stratasan's analytics platform. We hope you enjoy and perhaps learn something! This does assume a technical background—consider yourself warned :)

We've built Stratasan's analytics platform atop of Amazon Web Services and are big fans of the offerings AWS provides. Our application and worker servers are EC2 instances, application data is stored in an RDS Postgres database, Blackbird results and Canvas PDFs are stored in S3 and our considerable collection of healthcare data is stored and queried from a Redshift cluster.

As developers we appreciate the infrastructure that AWS provides, allowing our team to focus on product development and customer experience and worry less about provisioning servers, applying security updates and other operations-focused needs. AWS particularly excels in securing, auditing and ensuring compliance for their cloud platform. As we prepare our platform to handle Personal Health Information (PHI), we reap the benefits from the resources Amazon has invested in ensuring HIPAA compliance across AWS.

We use Elastic Load Balancers (ELB) to distribute incoming requests to our platform across multiple application servers. A few benefits of this architecture include:

  • Having multiple servers available increases platform reliability in case one machine experiences intermittent issues and cannot serve requests.
  • Because the application servers connect to the load balancer and not the internet, they can live in a secure Virtual Private Cloud, decreasing the chance of nefarious activity against the machines.

A requirement to complying with Amazon's HIPAA guidelines is that secure (HTTPS) traffic must be decrypted not at the load balancer but at the application server. Normal applications can configure the ELB to decrypt this secure traffic, remove that burden from the application servers. In our case though, Amazon requires our instances to decrypt HTTPS so our ELB is set to simply pass HTTPS traffic through to our servers.

This isn't a huge inconvenience except that when it decrypts the HTTPS traffic, the ELB also adds the HTTP_X_FORWARDED_FOR HTTP header. Not only is this header useful so that load-balanced applications can be privy to the true client IP address, full HIPAA compliance with AWS requires that each request is logged with a timestamp and client IP address. When the ELB does not decrypt the traffic and add HTTP_X_FORWARDED_FOR, the only IP address available in the request is that of the load balancer, which is not useful or compliant!

Fortunately, the fine folks behind HAProxy developed the plaintext Proxy Protocol that can be implemented atop HTTPS such that proxies relaying requests (in this case the ELB) can include client IP addresses. Amazon implemented this protocol for ELB a few years ago. The rest of this post will describe how to enable it in your load balancer and configure it at the application level.

Prerequisites

First and foremost, your web server needs to support the Proxy Protocol. We use nginx as our web server; support for Proxy Protocol landed in version 1.5.12.

Configuring the load balancer

Proxy Protocol must be enabled through the ELB API as there is no option to enable it in the AWS console. You'll need to install the AWS CLI tool. First, create a load balancer policy that enables Proxy Protocol:

(In all of these snippets, replace BALANCERNAME with the name of your load balancer.)

$ aws elb create-load-balancer-policy \
--load-balancer-name BALANCERNAME \
--policy-name EnableProxyProtocol \
--policy-type-name ProxyProtocolPolicyType \
--policy-attributes AttributeName=ProxyProtocol,AttributeValue=True

Next, grab the current configuration of the load balancer:

$ aws elb describe-load-balancer --load-balancer-name BALANCERNAME

This will return a JSON document describing all the settings on the balancer. Take note of the existing policies in LoadBalancerDescriptions[0]['Policies']['OtherPolicies']. You must pass these policy names into the next command, otherwise they will be disabled. Enable the EnableProxyProtocol policy with this command:

$ aws elb set-load-balancer-policies-for-backend-server \
--load-balancer-name BALANCERNAME \
--instance-port 443 \
--policy-names EnableProxyProtocol [OTHER POLICIES]

Remember, if there were other policies specified in the previous step, you must include them here.

We redirect all HTTP traffic to HTTPS, so we've only enable Proxy Protocol for HTTPS. If you need it for HTTP as well, run this command again, substituting 80 for 443.

Finally, you can grab the configuration again and make sure that EnableProxyProtocol policy has been enabled.

$ aws elb describe-load-balancers --load-balancer-name BALANCERNAME

(You’ll want to verify there is EnableProxyProtocol in the OtherPolicies list).

Configuring Nginx

(If you're not using Nginx, refer to your web server's documentation for enabling Proxy Protocol.)

First, let's create a log format that will ultimately include the client IP address:

log_format elb_format '$host $proxy_protocol_addr - $remote_user [$time_local]' '"$request" $status $body_bytes_sent ' '"$http_referer" "$http_user_agent" "$http_x_forwarded_for"' '$request_time $upstream_response_time $pipe';

This should go in /etc/nginx/nginx.conf.

We need to alter the listen directive in the server block. The exact file this lives in is site-dependent but will most likely be in a file under /etc/nginx/sites-enabled.

server {
...
listen 443 default proxy_protocol ssl;
...
access_log /path/to/your/logs/access.log elb_format;
...
proxy_set_header X-Forwarded-For $proxy_protocol_addr;
}

The first line directs this virtual host (listening to HTTPS traffic on port 443) to expect the Proxy Protocol in the request and properly parse and store the IP address of the client in the  $proxy_protocol_addr variable. The access log stream uses the elb_format defined above. Finally, our application is expecting the client IP in the X-Forwarded-For header, which is set in the third line. After restarting Nginx, your application will receive the true client IP address in the X-Forwarded-For header.

Other useful reading

Many thanks to Chris Lea's post and this gist.

If you liked this post, we are always interested to talk to smart people about joining our engineering team. See our careers page for more information.

architecture aws Engineering phi