Microsoft Azure provides a trusted path to enterprise-ready innovation with SAP solutions in the cloud. Mission critical applications such as SAP run reliably on Azure, which is an enterprise proven platform offering hyperscale, agility, and cost savings for running a customer’s SAP landscape.
System availability and disaster recovery are crucial for customers who run mission-critical SAP applications on Azure.
RTO and RPO are two key metrics that organizations consider in order to develop an appropriate disaster recovery plan that can maintain business continuity due to an unexpected event.
Recovery point objective refers to the amount of data at risk in terms of “Time” whereas Recovery Time Objective refers to the amount of time or the maximum tolerable time that system can be down after disaster occurs.
The below diagram gives a view of RPO and RTO on a timeline view in a business as usual (BAU) scenario.
Design principles for disaster recovery systems
- Selection of DR Region based on SAP Certified VMs for SAP HANA – It is important to verify the availability of SAP Certified VMs types in DR Region.
- RPO and RTO Values – Businesses need to lay out clear expectations in RPO and RTO values which greatly affect the architecture for Disaster Recovery and requirements of tools and automation required to implement Disaster Recovery
- Cost of Implementing DR, Maintenance and DR Drills
- Criticality of systems – It is possible to establish a trade-off between Cost of DR implementation and Business Requirements. While most critical systems can utilize state of the art DR architecture, medium and less critical systems may afford higher RPO/RTO values.
- On Demand Resizing of DR instances – It is preferable to use small size VMs for DR instances and upsize those during active DR scenario. It is also possible to reserve the required capacity of VMs at DR region so that there is no “waiting” time to upscale the VMs.
- Additional considerations for cloud infrastructure costs, efforts in setting up environment for Non-disruptive DR Tests. Non-disruptive DR Tests refers to executing DR Tests without performing failover of actual productive systems to DR systems thereby avoiding any business downtimes. This involves additional costs for setting up temporary infrastructure which is in completely isolated vNet during the DR Tests.
- Certain components in SAP systems architecture such as clustered network file system (NFS) which are not recommended to be replicated using Azure Site Recovery, hence there is a need for additional tools with license costs such as SUSE Geo-cluster or SIOS Data keeper for NFS Layer DR.
- Azure offers “Azure Site Recovery (ASR)” which replicates the virtual machines across the region, this technology is used at non-database components or layers of the system while database specific methods such as SAP HANA system replication (HSR) are used at database layer to ensure consistency of databases.
Disaster recovery architecture for SAP systems running on SAP HANA Database
At a very high level, the below diagram depicts the architecture of SAP systems based on SAP HANA and which systems will be available in case of local or regional failures.
The diagram below gives next level details of SAP HANA systems components and corresponding technology used for achieving disaster recovery.
Steps for invoking DR or a DR drill
Microsoft Azure Site Recovery (ASR) helps in faster replication of data at the DR region.
Steps for Invoking DR or a DR drill
- DNS Changes for VMs to use new IP addresses
- Bring up iSCSI – single VM from ASR Replicated data
- Recover Databases and Resize the VMs to required capacity
- Manually provision NFS – Single VM using snapshot backups
- Build Application layer VMs from ASR Replicated data
- Perform cluster changes
- Bring up applications
- Validate Applications
- Release systems
A screenshot of an example DR drill plan.
Resiliency/Reliability
Azure keeps your applications up and running and your data available. Azure is the first cloud platform to provide a built-in backup and disaster recovery solution.
Resiliency is not about avoiding failures but responding to failures. The objective is to respond to failure in a way that avoids downtime and data loss. Business continuity and data protection are critical issues for today’s organizations, and business continuity is built on the foundation of resilient systems, applications, and data.
Reliability and resiliency are closely related. Reliability is defined as dependability and performing consistently well. Resiliency is defined as the capacity to recover quickly. Together, these two qualities are key to a trustworthy cloud service. Despite best efforts, disasters happen; they are inevitable but mostly unpredictable, and vary in type and magnitude. There is almost never a single root cause of a major issue. Instead, there are several contributing factors, which is the reason an issue is able to circumvent various layers of mitigations/defenses.