About Failback with Arpio
Failback is an Arpio Enterprise feature that allows you to quickly and easily copy the most recent data from your recovery environment back to your primary environment in the event of a failover.
Failback is currently available for Amazon Elastic Compute Cloud (Amazon EC2), Amazon Relational Database Service (Amazon RDS), Amazon Aurora , and Amazon DynamoDB resource types only. If you are looking for more detailed documentation on performing a failback, please view our performing a failback documentation.
Where failback fits in your Disaster Recovery plan
A disaster recovery plan is not just about responding immediately to malicious attacks, or unexpected events, although Arpio can also help you with that. A complete Disaster Recovery plan allows you to return your business to normal as quickly and safely as possible. Failback restores your primary environment applications to an operational state.
What happens during a failback?
- Failback is designed for an actual disaster scenario where your primary environment has been compromised. This feature is available for Premium & Enterprise subscriptions after you Launch Recovery.
- Unlike other Arpio processes, failback necessitates disruptive, potentially destructive actions in the primary environment. After clicking Failback, you will be prompted to install a distinct CloudFormation template in your primary environment that grants additional permissions to replace the data in active primary systems with snapshots copied from the recovery environment.
- Your application will remain in the Recovery state while the failback operation completes. Once you are satisfied that your application has been properly restored, click Conclude Recovery
- Failback is always snapshot-based.
- For EC2 or RDS with Backing EBS Volumes: Any EBS volume snapshots over a certain size will need to be initialized, which can take hours depending on the volume of data that needs to be transferred.
- For RDS Instances restored from snapshots: there is an ongoing process that copies the storage blocks from S3 to the EBS volume.
Failback for RDS and Aurora:
- Aurora:
- The failback process is the same whether or not you have real-time replication enabled
- Arpio deletes the custom endpoints of the primary cluster, disables deletion protection on the cluster, and then deletes all the primary cluster’s related DB compute instances
- Arpio renames the original primary cluster and makes a new cluster in the primary environment from a snapshot of the recovery cluster
- This snapshot used is always taken from a multi-AZ standby instance if available.
- This snapshot contains the failed over data and new data transactions since the recovery instance became available after your initial failover from primary to recovery
- RDS:
- Arpio renames the original primary RDS instance, and a new instance is created in the primary environment from a recovery environment DB snapshot.
- For Amazon RDS instances that are restored in an Arpio failback, the RDS instances created from the recovery snapshots are made available as soon as the necessary AWS infrastructure for the instance is provisioned.
Failback for EC2
- Arpio does not terminate (delete) the primary instances during failback
- Arpio stops the primary instances, detaches (but doesn’t delete) EBS volumes, attaches the EBS volumes we copied back from the recovery environment, and then starts the instances.
Failback for DynamoDB
- When you click Failback on an application with DynamoDB, Arpio automates these actions:
- Creates a snapshot of the DynamoDB table in the recovery environment
- Copies it back to the primary environment
- Deletes the primary environment DynamoDB table
- Creates a new DynamoDB table from the copied snapshot
- Reconfigures streams and tags as they were configured on the primary environment table before the failback was started.
An example failback process
Situation: In response to an incident, you have performed a failover. Your application is now running in your recovery environment.
When you are ready to return to your primary environment:
- Ensure that your primary environment is ready to take production traffic. This may involve manual steps, like validating that malicious actions have been undone, or verifying that AWS services and resources are operating as designed .
- Perform a test failback, by following our guide to Performing a Failback.
- Arpio will ask you to set up some extra access to the primary environment via the installation of a new CloudFormation stack.
- Once Arpio has completed the failback process, you should test that your application works correctly, and that the primary environment has been restored with up-to-date data from the recovery environment.
- Schedule a time to perform a real failback. We suggest scheduling a maintenance window to perform a failback, during which no production traffic is served by your applications, to minimize the chance of data loss.
- Perform the failback, by:
- Following the steps to perform a failback, just like in step 2.
- Flip DNS, or make any other changes required to serve production traffic from your primary environment.
- Test that your application is working correctly in the primary environment.
- Begin the “Conclude Recovery” process to reset the recovery environment back to its original “pilot light” state.
- (optional) To improve the security posture of your Arpio usage, you can delete the CloudFormation stack that was added as part of step 2.
- At this point your application should be running just like it was before the incident, and recovery points should be getting created again.
- (optional) Perform a “Test Recovery” to gain extra confidence that you’re protected, should some other incident occur.