Congratulations! You successfully designed your VMware Environment including your high-speed Fibre\iSCSI SAN that connects your powerful servers to the latest and greatest disk arrays.
You are capable of running hundreds of VMs simultaneously without so much as a blip in vCenter’s Resource Monitors.
The Backup Administrator in me can’t help but ask “What is your Disaster Recovery plan?” i.e. What happens if the cleaning crew accidentally causes a break in one of those Fibre cables in the middle of the night? Of course, that never happens….right?
Lucky for all of us, there are a plethora of High Availability (HA) technologies available to protect your environment and keep it up and running in the event of a failure:
There are VMware’s own solutions as well as options from 3rd-Party Vendors:
VMware HA – HA will automatically boot VMs on another ESX\ESXi host in the event of a failure on the original physical host that the VMs were running on.
VMware Fault Tolerance (FT) – With FT, you have two ESX servers running copies of the same VM in lockstep. Only one instance of the VM is ever active at any time. As a change is made to the live VM, the shadow copy is immediately updated. If the VM ever goes down, the shadow copy takes over immediately with no data loss.
Great tech, but it has a stringent set of requirements. It is also part of the more expensive licensing options. So, unless your environment requires absolutely 0% downtime, you may want to consider other options.
VMware Site Recovery Manager (SRM) – As the name implies, SRM is designed to failover the entire site in the event of a major disaster. It works by asynchronously replicating snapshots of the disk array LUNs that contain your VMs to an alternate site. If a site failure occurs at the Production environment, the Administrator pushes the failover button, and the VMs are booted at the DR Site using the latest replicated snapshot.
Make sure that you schedule your replication appropriately to avoid significant data loss. For example, you may want to consider every 3-4 hours instead of every 8-10 hours. If you are concerned about WAN traffic, consider a WAN optimization solution like Riverbed’s products.
3rd-Party solutions normally fall into one or more of the three following categories:
1. HA of the entire ESX host (DoubleTake)
2. HA of the applications and data within the individual VMs (CA, DoubleTake, NeverFail, etc.)
3. HA of vCenter itself (CA, DoubleTake, NeverFail [OEM’ed by VMware as “VMware vCenter Server Heartbeat”], etc.)
The 3rd-Party solutions do not require shared storage, unlike the VMware solutions. Also, their pricing options may be more attractive to SMB’s.
Whatever solution(s) you may select, choose wisely. Although you or your customers may be cost-conscious in these tough economic times, is it really a good idea to cut corners in your Recovery Management and High Availability solution after spending so much money on your current infrastructure?