The common misconception is that cloud applications are disaster proof. Several high-profile downtime events have proven otherwise. The reasons for downtime can range from misconfigurations to noisy neighbors. Planning and frequently testing disaster recovery plans can be complex. In this blog, Bhargava K, Cloud Solutions Engineer at MontyCloud writes how a Fortune 200 ISV, automated and simplified cross-region disaster recovery with MontyCloud’s DAY2™ Disaster Recovery Bot. So much so, that the customer is able to achieve 1 Hr. RPO and 4 Hr. RTO with a simple Slack command, as well as test their DR process on demand.
– Sabrinath S. Rao
The common misconception is that once you move to the cloud, you do not need a disaster recovery plan. Nothing can be further from the truth. Yes, the underlying cloud services are a utility and are always available. The major cloud vendors have done a great job over the years in minimizing blast radius of major events.
Application downtime can happen for a number of reasons. Several high profile downtimes in 2020 show that despite their best efforts, the cloud providers are not immune. For example, several AWS services were down on November 25, 2020 for several hours, when something as routine as a relatively small capacity addition to Amazon Kinesis service, affected other foundational services such as AWS CloudWatch.
While cloud infrastructure downtime has broad impact, fortunately they are few and far between. Most disasters are caused by factors such as domain controller failures, misconfigurations, under provisioning core services such as load balancers, noisy neighbors, and scaling errors to name a few. Organizations need to prepare for such events.
In this blog, I will discuss how one of the largest mobile workforce management applications, part of the portfolio of a fortune 200 Independent SaaS Vendors (ISV) uses MontyCloud DAY2™ to achieve 1–hour Recovery Point Objective (RPO), 4-hour Recovery Time Objectives (RTO) and on-demand recovery testing with just a slack command, across 600 hundred Amazon Elastic Compute (EC2) Instances, with an autonomous BOT based approach.
Bot based autonomous operations simplifies disaster recovery
With 600 EC2 instances in production across two production regions (primary regions), it is nearly impossible to setup and monitor replication, and trigger a disaster event manually. It requires planning, as well as continuous monitoring, policy enforcement, an efficient design and frequent testing. AWS provides many tools such as CloudEndure, Amazon Elastic Block Storage (EBS) Snapshots,AWS CloudWatch (CloudWatch) Events and AWS Health (Personal Health Dashboard ).
A well-defined disaster recovery plan begins with provisioning
This global ISV uses MontyCloud DAY2™ Disaster Recovery Bot (DR Bot) to orchestrate cross region disaster recovery plan. The first is between US: Northern California (primary region), and US: Oregon (disaster recovery region. And the second between Europe: Ireland (Primary) and Europe: Frankfurt (Secondary).
For the disaster recovery plan to work, the customer needs to automatically enforce the following:
- EC2 instances need to be enrolled into the DAY2™ Disaster Recovery Bot as they are created.
- A CloudEndure agent is deployed on all for 25 databases and domain controllers that are enrolled into the disaster recovery plan.
- AMI backups are scheduled to be taken every day at low load periods.
- All the other components like Elastic Load Balancer’s and Security Groups are already created as part of pilot light setup.
The IT teams ensure that all production EC2 Instances deployed by their application teams follow these guidelines by enabling them with a pre-configured DAY2™ EC2 Blueprint.
The DAY2™ EC2 Blueprint is a no-code, guard railed, Infrastructure as Code provisioning blueprint that IT teams can enable for their applications teams through a self-service portal.
Introducing the DAY2™ Disaster Recovery Bot
The DAY2™ Disaster Recovery Bot enables the IT team to continuously monitor for outage events and orchestrate an autonomous disaster recovery across several hundreds of EC2 instances. The IT team is notified through Slack or AWS Simple Notification Service (SNS) when a recovery is complete. The DR Bot is built using native AWS services. The IT team does not have to deploy or maintain any third-party agents. The DR Bot enables IT teams to run autonomous disaster recovery and save on operations costs.
The nuts and bolts of the DAY2™ Disaster Recovery Bot
The DAY2™ Disaster Recovery Bot automates all the five elements of the disaster recovery.
- CloudEndure is the backbone of disaster recovery plan. Customer uses CloudEndure for continuous replication of the Database servers and Domain Controllers. The data is stored in replication instances in the secondary region. There is no additional cost for the replication instances.
- The Bot takes EBS snapshots once a day for all application servers and load balancers.
- Communication loss between the primary and secondary regions can create gaps in data copies. The DR Bot monitors network connectivity and availability of communication ports. The remediations are automated by the DR Bot.
- A CloudWatch Event rule in primary region and filters EC2 operational issues from the AWS Health – Personal Health Dashboard and only publishes only availability events to a SNS topic.
- This SNS topic publishes CloudWatch Event when an operational issue is picked up. SNS is used for cross region transfer of events.
Networking – Simplified and automated
- The DR Bot uses Nginx for routing traffic to load balancers and to change routing to the secondary regions in the event of a disaster.
- The routing tables, Virtual Private Network (VPC’s) and security groups are copied once a day and stored in a AWS DynamoDB table.
Recovery is as simple as populating from AWS DynamoDB. The DR Bot has workflow rules to automate this process.
Recovery is as simple as a Slack Notification
- A AWS Lambda (Lambda) functions in secondary region is subscribed to the SNS topic in the primary region and receives the disaster event.
- A Slack notification is published for the disaster recovery admin. The admin can approve the recovery in Slack.
- AWS Systems Manager (SSM) Executor executes an SSM Automation Document to start the recovery.
- First CloudEndure is triggered to recover the database servers and the domain controllers. The database servers are started and the data volumes mounted, and consistency checks are performed.
- Once the consistency checks are performed, the EC2 instances with the application servers are brought online and the AMI’s mounted from the backups.
- Next, VPC’s and security groups are configured from the AWS DynamoDB database.
- The load balancers are brought online and new network paths configured.
- And finally after a test run, the new load balancers are published to Nginx for traffic redirection.
- After the creation of the servers, the traffic is switched in Route 53 and CloudFront Distributions in order to route traffic to this secondary region.
- Once the failover is complete a Slack notification sent indicating the status of the job.
- Meanwhile during all this process customers are routed to a maintenance page until everything is setup in secondary region
Test as frequently as you want – everyday or several times a day
Frequent testing ensures that disaster recovery plan works in the event of an actual event. Industry best practice is to test a failover once a quarter. However, since this customer works with enterprises with stringent requirements, demonstrating disaster recovery is part of their sales cycle. Frequently, they have to demonstrate their disaster readiness several times a day. The DAY2™ DR Bot has test mode pre-configured. At the click of a button the test process, including the slack notifications are automated.
Why should I use the DAY2™ Disaster Recovery Bot if I can build it myself?
You should use the DAY2™ Bot so you don’t have to acquire deep knowledge of tens of different AWS services, integrate and test them frequently. MontyCloud’s AWS certified engineers have built this Bot using the AWS Well-Architected Principles. All you have to do is select a configuration setting and your EC2 Instances are DR enabled. Apart from the costs of distracting your AWS experts on you staff from more valuable tasks, the DAY2™ Bot can save several months of design, build and testing time. For an environment of this scale the automation can further save you at least five high value personnel. Furthermore, the autonomous nature of the Bot ensures that your disaster recovery plan is timely and reliable.
A disaster recovery plan is critical for applications in the cloud. DAY2™ Disaster Recovery Bot can help you drive a reliable and timely recovery in the event of your primary cloud region going down. The autonomous nature of the Bot can help you scale fast, efficiently and in a cost-effective manner, saving you time and people costs.