Cross-posted from the Atlantis Computing blog
In recent months, customers have asked me a lot of questions about how to configure resilience for virtualized desktops. If you want to just jump straight to the conclusion, click here. For the full explanation then please carry on. The first question that needs to be asked is whether the workload is stateless or persistent. A stateless desktop workload is created ‘on the fly’ for the user by combining the OS, applications and persona at login. A persistent workload is usually when the user has elevated privileges, can perhaps install applications themselves and has a ‘one to one’ relationship with their desktop.
Stateless desktops are created and destroyed as the user needs them. This means that the accidental destruction via host failure shouldn’t matter too much as the user can log back on to any other desktop and continue working. In fact, to the user, it should seem that any desktop they log onto looks exactly like ‘their’ desktop. This means that, in effect, the resilience for a stateless desktop is taken care of by the broker. This has large advantages in terms of resources for the desktop virtualization implementation as we no longer need to spend any money protecting any of the desktop VMs by enabling resilience technologies either in the hypervisor or on the physical host.
A persistent desktop is a totally different use case and needs to be treated very differently. We now need to be able to minimize downtime on that particular VM as the user may have installed their own apps, made configuration changes to the OS and applications which are not backed up centrally, they may even have stored files on a local drive. In this case we need to enable resilience technologies for that user. I’m going to concentrate on VMware High Availability as the resilience technology to be used as it is the one I run into most commonly.
VMware High Availability (HA) is defined as:
For this blog post, I’ll ignore VM and application monitoring and concentrate on the standard HA functionality. Essentially, VMware HA will use shared storage to restart a VM on another host in the event of a failure.
HA is extremely easily configured and it can be enabled on a cluster in just five clicks. It is the details that I’ve found people to have trouble with. This is especially true for deployments involving Atlantis ILIO, as the presence of a VM with reservations changes the way we think about HA.
The example I’m going to use is an 8 host cluster with 256GB of RAM on each host with guest Windows 7 VMs. Each Windows 7 virtual desktop has 2GB RAM, with the Atlantis ILIO VM allocated a reserved 60GB of RAM (Please refer to the relevant Atlantis ILIO sizing guidelines for your environment). In a stateless VDI deployment, we could probably push the number of users per host up to about 140, but as we are looking at persistent desktops we’ll keep the number down well below that.
The figure below indicates to scale how our host RAM is assigned, with 2GB given to each Desktop and 60GB given to Atlantis ILIO.
“vCenter Server uses admission control to ensure that sufficient resources are available in a cluster to provide failover protection and to ensure that virtual machine resource reservations are respected.”
HA admission control can be configured in one of five ways and we will go through each of the configuration options and examine whether it should be applied to our current use case.
HA with admission control disabled
It would seem that disabling admission control when you turn on HA would be contradictory and you would be right. Essentially you are saying ‘I want my VMs to restart, but I’m not willing to supply the capacity needed’. Having said that, it’s still the most common configuration. To quote Duncan Epping and Frank Denneman:
“Admission Control is more than likely the most misunderstood concept vSphere holds today and because of this it is often disabled.”
I don’t advise disabling admission control in any environment where HA is required as it almost guarantees that you will have increased downtime or decreased performance for your users in the event of a host failure.
Host Failures Cluster Tolerates
The Admission Control Policy that has been around the longest is the “Host Failures Cluster Tolerates” policy. It is also historically the least understood Admission Control Policy due to its complex admission control mechanism. With the Host Failures Cluster Tolerates policy, vSphere HA performs admission control in the following way:
- Calculates the slot size. A slot is a logical representation of memory and CPU resources. By default, it is sized to satisfy the requirements for any powered-on virtual machine in the cluster.
- Determines how many slots each host in the cluster can hold.
- Determines the Current Failover Capacity of the cluster. This is the number of hosts that can fail and still leave enough slots to satisfy all of the powered-on virtual machines.
- Determines whether the Current Failover Capacity is less than the Configured Failover Capacity (provided by the user).
If it is, admission control disallows the operation.
Slot size is calculated in the following way:
‘The slot size for CPU is determined by the highest reservation or 256MHz (vSphere 4.x and prior) / 32MHz (vSphere 5) if no reservation is set anywhere. HA will use the highest Memory Overhead in your cluster as the slot size for memory.’
In this case, it means that each host will only contain four slots due to the Atlantis ILIO memory constraints. Such large slot sizes can lead to reduced consolidation ratios and shouldn’t be used in this case. Host Failures Cluster Tolerates with slot size configured It is possible to configure the slot size using the two advanced settings:
By using these two advanced settings, it’s possible to change the slot size for HA and get better consolidation ratios. It is also possible to change the slot size in the new web administration console. This is what it would look like if you changed the slot size to 2GB for memory.
As you can see the Atlantis ILIO VM consumes 30 slots due to its 60GB RAM reservation out of a total of 128 slots per host. In our eight host cluster, if we set the host failures the cluster tolerates setting to one and reduce the slot size to 2GB RAM, we will have 18 or 19 slots reserved per host as below:
This means that there will not be enough free slots (30) on each host to start the Atlantis ILIO VM. VMware vSphere will try and use DRS (Distributed Resource Scheduler) to move the smaller desktop VMs out of the way to create enough free slots to enable to ILIO VM to be powered on. This is called defragmentation. Defragmentation is not guaranteed to work, as it may have to use multi hop DRS moves or multiple rounds of DRS and it still needs to respect affinity, anti-affinity and reservation rules. Defragmentation can greatly increase downtime and reduce the likelihood of a successful power on of the Atlantis ILIO VM. Both the defragmentation of resources and the fact that HA will start the VMs on any available server means that the Atlantis ILIO and the virtual desktops associated with it could be on different hosts. Although this is a supported configuration, it is less than ideal from a performance and resource utilization perspective.
Percentage of Cluster Resources Reserved
This Admission Control policy is the most common in actual use. The main advantage of the percentage based Admission Control Policy is that it avoids the commonly experienced slot size issue where values are skewed due to a difference in reservations between VMs on the host.
This will add up the total amount of CPU and Memory resources over the cluster and reserve a certain amount for HA purposes. If we configure both memory and CPU to be reserved at 13% for our 8 host cluster, it will look as below:
This will give us enough spare capacity across the whole cluster to account for one failed host (at 12.5% of the cluster resources). In this case, the ILIO VM requires 24% of a single host capacity resulting in a defragmented state. As stated above, defragmentation can increase downtime and reduce the likelihood of a successful power on of the ILIO VM. Both the defragmentation of resources and the fact that HA will start the VMs on any available server means that the Atlantis ILIO and the Desktops associated with it could be on different hosts. Although this is a supported configuration, and we have many customers successfully using this design, it is less than ideal.
Specify Failover Hosts
With the “Specify Failover Hosts” Admission Control Policy, HA will attempt to restart all virtual machines on the designated fail-over hosts. The designated fail-over hosts are essentially “hot standby” hosts. In other words, DRS will not migrate virtual machines to these hosts when resources are scarce or the cluster is imbalanced.
This option both guarantees that you will be able to restart your VMs and keeps the desktop VMs on the same host as their associated ILIO. The reason that many people do not pick this policy is that the dedicated failover hosts are not utilized during normal operations. While this is a fair point, any reservation, be it slots, percentage or host, is going to prevent that proportion of your infrastructure from being used, the fact it’s in one place rather than spread out over the cluster isn’t a problem in my opinion.
As a virtual machine, Atlantis ILIO can take advantage of the availability and resiliency functionality of the underlying virtualization platform to ensure the reliability of virtual desktop infrastructure, while at the same time lowering cost and increasing performance. In the case of VMware vSphere, the best approach to VMware HA with Atlantis ILIO is to use Admission Control policies with a specified fail-over host. Using this approach, the Atlantis ILIO virtual machine will remain on the same physical server as the virtual desktops using its data store, ensuring the fastest possible recovery from a host failure with optimal performance.