Implementing Resilience for CloudFront

How to implement resilience for CloudFront when using Arpio to recover your application

Implementing Resilience for CloudFront

CloudFront is a global service, so it is not directly susceptible to a regional outage in AWS.  However, CloudFront typically proxies traffic to services (like S3 or ALBs) in a specific region, and if that region is down, then CloudFront is pretty much down.

To make CloudFront resilient to a regional outage, you want to configure it to point at multiple origins in multiple regions.  This is accomplished through the use of Origin Groups in CloudFront.

For example, if you have CloudFront serving static content from an S3 bucket, you would create a second S3 bucket in another region with the same content in it.  You then configure CloudFront with an origin group that points at both buckets.  If serving content from the first bucket fails, CloudFront will transparently fail over to the second bucket.

Using Arpio to Replicate S3 Buckets Behind CloudFront

You can use Arpio to create and manage a replica bucket in another region.  Through this mechanism, you'll simply update content in your primary region bucket, and Arpio will ensure the objects are replicated to the alternate bucket.  You'll then configure the CloudFront Origin Group to point at both buckets.

As part of this set-up, you'll want to be able to serve data from this bucket even when you aren't actively failed over.  Arpio will allow you to modify the bucket policy on the recovery environment bucket to grant access to CloudFront.  In order to ensure that Arpio does not revert your modifications (Arpio usually tries to clean up "drift"), you need to add a "-ArpioRetain" suffix to the SID for the statement you are adding to the bucket policy.  

This bucket policy is an example:

{

    "Version": "2012-10-17",

    "Statement": {

        "Sid": "AllowCloudFront-ArpioRetain",

        "Effect": "Allow",

        "Principal": {

            "Service": "cloudfront.amazonaws.com"

        },

        "Action": "s3:GetObject",

        "Resource": "arn:aws:s3:::<S3 bucket name>/*",

        "Condition": {

            "StringEquals": {

                "AWS:SourceArn": "arn:aws:cloudfront::<AWS account ID>:distribution/<CloudFront distribution ID>"

            }

        }

    }

}

How Do I Configure & Test CloudFront Origin Group Failover?

Here's a nice walkthrough from AWS: https://catalog.us-east-1.prod.workshops.aws/workshops/4557215e-2a5c-4522-a69b-8d058aba088c/en-US/reliability/origin-failover

How Do I Protect CloudFront from an Account Hack?

If a bad actor gets into your AWS account, they may have the ability to delete your CloudFront distributions and the data that lives behind them.  Proper disaster recovery is how you mitigate this risk, however what we've described here is multi-region high availability and not DR.  Usually, we leverage an alternate AWS account for recovery so that the attacker can't undermine your recovery capability.

Unfortunately, CloudFront distributions must have a globally unique name, and you can't create a second CloudFront distribution in a different account to serve the same domain.  As a result, cross account disaster recovery techniques do not suffice, unless you're willing to serve your application from a different domain during a disaster.

An alternate approach is to move the CloudFront distribution out of your primary account, and instead create it from a highly secure account that minimizes the likelihood of an account compromise.  This account would only hold the CloudFront configuration, and possibly other sensitive resources that need to be protected from account compromise.  Because these resources are rarely modified, this account can be extremely locked down so that almost nobody maintains day-to-day access, thus minimizing threat vectors for an attack.

How Do I Test an Application Failover?

A common architecture we see involving CloudFront is a serverless application using API Gateway to serve an API and CloudFront to serve a static web application. To test this failover, we will focus on recovering the API in the recovery environment while serving the web application from the same multi-region HA CloudFront distribution that is described above.

First, failover the API Gateway application using Arpio.  When the API Gateway is recovered, find the API Gateway Domain Name for the Custom Domain that Arpio recovers for you.  This domain name usually looks like: d-abcdefgh9.execute-api.us-east-1.amazonaws.com.

Next, use a command line tool such as host, dig, or nslookup to resolve that domain name to an IP address.  This is the IP address we want to send traffic to for the test.

Finally, create an entry in your /etc/hosts file to alias the hostname for your API to this IP address.

You can now visit your web application as you normally would.  The web content will be served via CloudFront using the multi-region HA architectures described above.  The API traffic will be routed to the recovery environment because the production API endpoint has been aliased to the recovery environment API.

How Do I Perform a Real Failover?

The process for performing a real failover is the same as a test, except instead of creating a DNS alias in your /etc/hosts file, you update your production DNS entry to point at the new API Gateway Domain Name.