Desktop Virtualization Best Practices with VMware HA Admission Control & Atlantis ILIO

Posted by Jim Moyle on March 5th, 2013

Cross-posted from the Atlantis Computing blog

http://blog.atlantiscomputing.com/2013/02/desktop-virtualization-best-practices-with-vmware-ha-admission-control-atlantis-ilio/

In recent months, customers have asked me a lot of questions about how to configure resilience for virtualized desktops. If you want to just jump straight to the conclusion, click here.  For the full explanation then please carry on. The first question that needs to be asked is whether the workload is stateless or persistent.  A stateless desktop workload is created ‘on the fly’ for the user by combining the OS, applications and persona at login.  A persistent workload is usually when the user has elevated privileges, can perhaps install applications themselves and has a ‘one to one’ relationship with their desktop.

Stateless desktops are created and destroyed as the user needs them. This means that the accidental destruction via host failure shouldn’t matter too much as the user can log back on to any other desktop and continue working. In fact, to the user, it should seem that any desktop they log onto looks exactly like ‘their’ desktop.  This means that, in effect, the resilience for a stateless desktop is taken care of by the broker.  This has large advantages in terms of resources for the desktop virtualization implementation as we no longer need to spend any money protecting any of the desktop VMs by enabling resilience technologies either in the hypervisor or on the physical host.

A persistent desktop is a totally different use case and needs to be treated very differently.  We now need to be able to minimize downtime on that particular VM as the user may have installed their own apps, made configuration changes to the OS and applications which are not backed up centrally, they may even have stored files on a local drive.  In this case we need to enable resilience technologies for that user. I’m going to concentrate on VMware High Availability as the resilience technology to be used as it is the one I run into most commonly.

VMware High Availability (HA) is defined as:

For this blog post, I’ll ignore VM and application monitoring and concentrate on the standard HA functionality.   Essentially, VMware HA will use shared storage to restart a VM on another host in the event of a failure.

HA is extremely easily configured and it can be enabled on a cluster in just five clicks.  It is the details that I’ve found people to have trouble with.  This is especially true for deployments involving Atlantis ILIO, as the presence of a VM with reservations changes the way we think about HA.

The example I’m going to use is an 8 host cluster with 256GB of RAM on each host with guest Windows 7 VMs.  Each Windows 7 virtual desktop has 2GB RAM, with the Atlantis ILIO VM allocated a reserved 60GB of RAM (Please refer to the relevant Atlantis ILIO sizing guidelines for your environment).  In a stateless VDI deployment, we could probably push the number of users per host up to about 140, but as we are looking at persistent desktops we’ll keep the number down well below that.

The figure below indicates to scale how our host RAM is assigned, with 2GB given to each Desktop and 60GB given to Atlantis ILIO.

 

VMware HA introduces the concept of admission control which states that:

“vCenter Server uses admission control to ensure that sufficient resources are available in a cluster to provide failover protection and to ensure that virtual machine resource reservations are respected.”

HA admission control can be configured in one of five ways and we will go through each of the configuration options and examine whether it should be applied to our current use case.

HA with admission control disabled

It would seem that disabling admission control when you turn on HA would be contradictory and you would be right.  Essentially you are saying ‘I want my VMs to restart, but I’m not willing to supply the capacity needed’.  Having said that, it’s still the most common configuration.  To quote Duncan Epping and Frank Denneman:

“Admission Control is more than likely the most misunderstood concept vSphere holds today and because of this it is often disabled.”

I don’t advise disabling admission control in any environment where HA is required as it almost guarantees that you will have increased downtime or decreased performance for your users in the event of a host failure.

Host Failures Cluster Tolerates

The Admission Control Policy that has been around the longest is the “Host Failures Cluster Tolerates” policy. It is also historically the least understood Admission Control Policy due to its complex admission control mechanism.  With the Host Failures Cluster Tolerates policy, vSphere HA performs admission control in the following way:

  • Calculates the slot size.  A slot is a logical representation of memory and CPU resources. By default, it is sized to satisfy the requirements for any powered-on virtual machine  in the cluster.
  • Determines how many slots each host in the cluster can hold.
  • Determines the Current Failover Capacity of the cluster.  This is the number of hosts that can fail and still leave enough slots to satisfy all of the powered-on virtual machines.
  • Determines whether the Current Failover Capacity is less than the Configured Failover Capacity (provided by the user).

If it is, admission control disallows the operation.

Slot size is calculated in the following way:

‘The slot size for CPU is determined by the highest reservation or 256MHz (vSphere 4.x and prior) / 32MHz (vSphere 5) if no reservation is set anywhere. HA will use the highest Memory Overhead in your cluster as the slot size for memory.’

When ILIO is involved in slot size calculation it means that the slot size becomes larger due to the memory and CPU reservations.

In this case, it means that each host will only contain four slots due to the Atlantis ILIO memory constraints.  Such large slot sizes can lead to reduced consolidation ratios and shouldn’t be used in this case. Host Failures Cluster Tolerates with slot size configured It is possible to configure the slot size using the two advanced settings:

das.slotMemInMB

das.slotCpuInMHz

By using these two advanced settings, it’s possible to change the slot size for HA and get better consolidation ratios. It is also possible to change the slot size in the new web administration console.  This is what it would look like if you changed the slot size to 2GB for memory.

As you can see the Atlantis ILIO VM consumes 30 slots due to its 60GB RAM reservation out of a total of 128 slots per host. In our eight host cluster, if we set the host failures the cluster tolerates setting to one and reduce the slot size to 2GB RAM, we will have 18 or 19 slots reserved per host as below:

This means that there will not be enough free slots (30) on each host to start the Atlantis ILIO VM.  VMware vSphere will try and use DRS (Distributed Resource Scheduler) to move the smaller desktop VMs out of the way to create enough free slots to enable to ILIO VM to be powered on.  This is called defragmentation.  Defragmentation is not guaranteed to work, as it may have to use multi hop DRS moves or multiple rounds of DRS and it still needs to respect affinity, anti-affinity and reservation rules. Defragmentation can greatly increase downtime and reduce the likelihood of a successful power on of the Atlantis ILIO VM. Both the defragmentation of resources and the fact that HA will start the VMs on any available server means that the Atlantis ILIO and the virtual desktops associated with it could be on different hosts.  Although this is a supported configuration, it is less than ideal from a performance and resource utilization perspective.

Percentage of Cluster Resources Reserved

This Admission Control policy is the most common in actual use.  The main advantage of the percentage based Admission Control Policy is that it avoids the commonly experienced slot size issue where values are skewed due to a difference in reservations between VMs on the host.

This will add up the total amount of CPU and Memory resources over the cluster and reserve a certain amount for HA purposes. If we configure both memory and CPU to be reserved at 13% for our 8 host cluster, it will look as below:

This will give us enough spare capacity across the whole cluster to account for one failed host (at 12.5% of the cluster resources).  In this case, the ILIO VM requires 24% of a single host capacity resulting in a defragmented state. As stated above, defragmentation can increase downtime and reduce the likelihood of a successful power on of the ILIO VM. Both the defragmentation of resources and the fact that HA will start the VMs on any available server means that the Atlantis ILIO and the Desktops associated with it could be on different hosts.  Although this is a supported configuration, and we have many customers successfully using this design, it is less than ideal.

Specify Failover Hosts

With the “Specify Failover Hosts” Admission Control Policy, HA will attempt to restart all virtual machines on the designated fail-over hosts. The designated fail-over hosts are essentially “hot standby” hosts. In other words, DRS will not migrate virtual machines to these hosts when resources are scarce or the cluster is imbalanced.

This option both guarantees that you will be able to restart your VMs and keeps the desktop VMs on the same host as their associated ILIO. The reason that many people do not pick this policy is that the dedicated failover hosts are not utilized during normal operations.  While this is a fair point, any reservation, be it slots, percentage or host, is going to prevent that proportion of your infrastructure from being used, the fact it’s in one place rather than spread out over the cluster isn’t a problem in my opinion.

Conclusion

As a virtual machine, Atlantis ILIO can take advantage of the availability and resiliency functionality of the underlying virtualization platform to ensure the reliability of virtual desktop infrastructure, while at the same time lowering cost and increasing performance. In the case of VMware vSphere, the best approach to VMware HA with Atlantis ILIO is to use Admission Control policies with a specified fail-over host.  Using this approach, the Atlantis ILIO virtual machine will remain on the same physical server as the virtual desktops using its data store, ensuring the fastest possible recovery from a host failure with optimal performance.

Resources:

VMware KB 1031703

VMware vSphere 5.1 Clustering Deepdive

Yellow Bricks HA deepdive

vSphere High Availability Deployment Best Practices

Yellow Bricks – Percentage Based Admission Control gives lower VM restart guarantee?

User Installed Applications – My Take

Posted by Jim Moyle on January 29th, 2010
The conversation about user installed applications has been happening for a while now and much has been said about it by many people such as, Andrew Wood, Gareth KitsonChris OldroydDaniel FellerJeff PitschRon Oglesby, Brian MaddenChris Fleck and more.  The purpose of this post is both to oblige a few people who have asked me to put my thoughts down and for me to clarify exactly what I think.  I’m going to ignore BYOC and Client hypervisors for the time being to concentrate on the issues surrounding the applications.
To set out why I think this topic is important.  I think that user installation of applications is the key differentiator for VDI over terminal services, as I said in a previous post Why is VDI changing into Terminal Server? the difference between Terminal Services and VDI is actually very small without it.
If we want to understand why this change is now possible we should look at why it has been impossible in the past.
Terminal Server:  Any change by one person can adversely effect anyone else running on that box, this is not likely to change and to my mind is the biggest single historical drawback to TS based solutions that has no end in sight.
Fat Desktops:  Support is the key here, as if a user broke their PC usually they couldn’t fix it and it took a ‘man in a van’ to go and resolve the issue.  This is especially problematic where the user has a time critical job, or the site is far away.  Of course remote tools help with this, but desktops don’t have kvm boards for when the OS goes south.  Allowing users free rein meant that support calls would go through the roof and as the time to resolve was huge, it meant that without locking down the desktop companies would spend massive amounts of time, energy and money just keeping the wheels on.
The fact that for the past fifteen years whether enterprise desktops have been fat client or terminal server based, the only choice has been to lock them down.  This means industry inertia seems to be almost unstoppable.

The situation has now changed.  Our user base is changing, we now have the Echo/Y generation who grew up with computers, they learn to type at school along with writing.  They break and maintain their own home PCs, they regularly download and use the tools they need to get the job done.  As these people move into management the old monolithic top down attitude of only using what the IT department give them to do their job will be anathema to them and they will start to demand change.  The people who do a job, day in day out, know what tools they need to be productive much better than the IT dept does. If we don’t give them those tools they will resent us for not enabling their work.  We need to empower people to be more productive, not take away their motivation, morale and confidence in the organisation.

If we bring the desktop OS into the datacenter we should be able to bring to bear the tools to enable this kind of user empowerment.

If we are going to allow this we have clasify which are the different types of user installed applications.  To borrow a little from Simon Bramfitt, with some of my own (in italics), here’s what we are talking about:
  • The departmental app that works with business data that is formally acknowledges as being important to that department and has it’s own budget and support mechanism, but is for what ever reason not packaged by IT. This notion may not sit well with some people, but anyone who has worked in a large enterprise knows they exist and might privately offer plenty of justifications as to why an app might fall into this bucket.
  • The communication app: gotomeeting, webex clients etc that may need to be installed by the user, they may also need other clients to tie into outside companies systems eg they may need to install a citrix web client. Or a propriety Active X plugin for company XYZ’s web app.
  • The personal productivity app that fulfills a limited business function, legitimately purchased but not formally acknowledged by IT as a supported app. A copy of MindMapper maybe that’s needed to map up a new business process. It may only be used by a few people across the enterprise but it fills an important role for them.
  • The personal non-productivity tool like iTunes that is OK to have in a BYOPC environment, but not the sort of thing you want interfering with the corporate computing environment. Although a case could be made for iTunes U and work oriented podcasts etc.
  • The totally unauthorised, no excuse, just down loaded from the internet, malware vector that claimed to be a free ring-tone generator.

As Microsoft found out to its cost allowing uncontrolled user installed apps is a nightmare. So if a user can install all of the above how do we both allow the right apps and protect ourselves against the wrong ones AND reduce our support costs?

  • Any application that directly manipulates business data must provided by the enterprise.
  • The desktop OS must be treated as an untrusted device.
  • Approved applications should be delivered by TS or App streaming.
  • The users must have a method for choosing from available enterprise applications.
  • Users data and enterprise application settings must be separate from user installed application settings.
  • Users must have have the ability to roll back their environment to any point in the past, while keeping data and enterprise application customisations.
  • Users must be able to reset their machines to virgin state whilst keeping data and enterprise application settings.
The last two are the keys to reducing the support costs, ie if the user breaks things you give them the tools to fix it, without needing to have IT skills.  This is possible at the moment with Atlantis, also AppSense have something in the works to enable this coming out soon.
If the users have an appropriate method to choose their own enterprise apps eg Dazzle, they are less likely to need to install their own.  If a large percentage of users are installing a certain app, for instance if a client sends a department files in tar.gz format and 7-zip becomes prevalent in the organisation then the IT department should be able to see this and change it from an unsupported user installed application to a supported enterprise provided application, I call this the ‘park paths‘ methodology.  To do this you need a way to catalog exactly what users are installing.  As an interesting side effect, this may be what brings Open Source apps into the enterprise for the first time.

If users can provide themselves with the tools they need in a timely fashion and lets face it this is exactly what IT admins have been doing for years, business agility is increased, with the right tools support is decreased and application provision is improved.  Giving the organisation lower costs and a competitive advantage.

User installed applications are a minefield, but with the right approach I believe that it could be the VDI killer feature.

The VMware PCoIP ‘Killer App’

Posted by Jim Moyle on September 2nd, 2009

VMware Logo

With the announcement of the inclusion of the PC over IP (PCoIP) Teradici in VMware View this week at VMworld.  I think that there is something people may be missing.

The big disadvantage of the original hardware to hardware PCoIP implementation was that each connection to the server required it’s own Teradici card.  This is obviously not a scalable solution.  As the software to software solution is unveiled at VMworld, the attention seems to be on the fact you can get the performance without stuffing your servers full of Teradici cards.  To my mind the software to software approach has a big flaw, you need power on the client. Power on the client means either a full PC on the other end, which defies the point, or a really expensive thin client.

The real key would be to go from software to hardware.  A software client on the server communicating with a hardware Teradici chip on the client.  You could avoid all the issues of managing the ‘almost PC’ modern thin clients and go back to the cheap, minimal management, devices I think thin clients should be.

I’m curious as to why this is not being made more of as the client devices are already there like this one from Samsung and if you look at the Teradici video on Brian Maddens site they say it will work.

As the devices get cheaper, maybe down to about $200 with the great performance of PCoIP I can see this being the ‘killer app’ for VMware in this space.

How games will show who is the remote protocol winner

Posted by Jim Moyle on June 18th, 2009

CallOfDuty_WorldatWar

If remote protocols are almost exclusively used in regard to business applications, why are games important?  The reason is that if I try and think of what would be the hardest thing to do over a remote protocol, it would be to play games with the same quality as you would see them on your desktop.

Of course I’m not talking about web based flash games, I mean full on, high frame rate with lots of 3D and explosions, all in DirectX with HD sound games, actually lets add some kind of TeamSpeak in there too.

There are two goals in respect to remoting protocols:

  • Get desktop behaviour no matter the application over the LAN
  • Scale the fidelity of the connection according to the bandwidth and endpoint device

The first case is the one I want to talk about, VDI and TS vendors need to be able to prove that their remote protocol can cope with any type of application or companies are not going to be convinced that the old bugbears of bad sound and choppy video poorly synced are over.

If people are out there touting the ‘better than desktop experience’ line I want to see it and as yet the performance just isn’t quite there.

When Microsoft bought Calista back at the beginning of 2008, I had hopes that the features they were working on would have made it into RDP by now, but they just announced that their remote DirectX technology isn’t going to make it into final release.

VMware have the software Teradici stuff in the works and I have no doubt something from Citrix is out there.

The wild card as regards remote protocols go is a company called OnLive who plan to provide games over the cloud remoted to your PC.  I’ve no clue how it works, but I’m anxious to see.

Wouldn’t it be interesting to see someone get up on stage and demo a game over a remote protocol?  I wonder who’s going to be first?  I would say that in the court of public opinion, even if not quite in the technical detail (silverlight etc) then they would have ‘won’.

I’ve always had customers ask me, why can’t I just use VOIP over Citrix, when it works to talk to my niece in Oz?  Once we have good quality bi-directional audio the second device on the users desktop can disappear.  Once we have rich multimedia, users will no longer have to manage without seeing that great presentation from their CEO :).

People are talking about Avistar at the moment in regards to this, but from the brief time I’ve had to look at it I think it requires some kind of broker server in the middle.  So if anyone can enlighten me a bit more about exactly what they do and how they do it, please leave me a comment.

Edit:  It seems I’m not the only one thinking about protocols

Virtualization Display Protocol Wars

Brian Madden on Calista

Where’s my MSI?

Posted by Jim Moyle on June 18th, 2009
When implementing a new VDI or terminal server project, the biggest stumbling block is not usually the solution framework, be that VMware, Microsoft or Citrix.  It’s the applications.
It’s those odd one or two apps that have either been created in house, are cheap bespoke applications or an app so old that it’s ceased being developed and is now out of support.
If the application is old and out of support I can’t blame the vendors, it’s the customer who should never have gotten themselves into that situation.  It’s the other two situations that need to be looked at.

Small application vendors need to raise their game, it’s no longer good enough to code an application, check it works on your local copy of XP or Vista and sell it to the customer.  Terminal services has been around fifteen years, and Application Virtualisation five years, these are no longer new technologies.  If I phone up a vendor and ask them what’s the correct way to install their application on terminal services or App-V, I don’t want to hear ‘sorry that isn’t supported’.

In the past, I’ve had an application vendor hand me a ten sheet document with installation instructions for their app on TS, it went like this:
Create user X,
Assign Y and Z rights to User X
Install weird application service
Add User X to application service
Find Reg key HKLMSoftwareVendorxxxxxxxxxxxxxxxxxx-xxx-xxxxxxxxxxxxx and create DWORD value zzz IMPORTANT! see note
Once all these steps are finished, run the application and click the buttons m through p
Once done install the plug-in as normal.
note:
If you cannot find the regkey DO NOT install weird application service, create ODBC connection as shown on page 9

etc.

In my opinion the customer should have refused to accept this and asked the vendor to finish the application.
The reason that I want vendors to provide MSIs is that they have several advantages over other methods of installation:

  • Database driven instead of script driven
  • The application is installed in an administrative context
  • MSI provides a standard package format
  • Transactional install and rollback
  • Customisation via MST files
  • Many tools available

The tools part is starting to get really interesting, Apptitude have released their App-DNA product, which will test whether your app is suitable for Citrix, App-V, Windows 7, x64 and more.  If you have an MSI, it only needs to look at the MSI tables, you don’t even have to install the application to get the report.

Acresso, the folks who make Admin Studio, have developed a new feature which allows direct conversion from an MSI to an App-V, Citrix Streaming or VMware ThinApp package.

Both the above technologies can drastically reduce the time taken to implement new application delivery methods.  To best take advantage of both tools you need applications provided in an MSI format.

The main reason that I have found applications not being delivered in the correct format is that organisations have not realised that it is vital that the IT department of any organisation is involved in the decision making process when it comes to purchasing new applications, at the very least they need to set the minimum standards required:

  • The application should be provided in an MSI format
  • The vendor must suport multi user OS deployment
  • The vendor must support application virtualisation/streaming

If you are an application vendor and it’s ‘too much effort’ to support the above minimum standards, I would suggest you are cutting yourself off from a large and growing sector of the market.

If you develop applications in-house or are purchasing a bespoke product, there is no reason why standards should slip, apply the same set of rules to these as you would to an off the shelf product.  A bit more development time, is going to save you a whole lot of heartache in the future.

Why is VDI changing into Terminal Server?

Posted by Jim Moyle on May 21st, 2009

It is, and I’m about to try and prove it to you.  Not only is VDI changing into Terminal Server it’s been done through a series of entirely logical and yet very stupid choices.

To work this out we need to start from first principles, way back in 2005ish.  We had many expensively maintained fat desktops, spare CPU cycles in the data center and a virtualisation layer.  This meant that we could take the fat desktops not already covered by terminal server (which only counted for around 20%) and move them into the data-center.  These new desktops would allow our users to install apps, personalise their OS, and IT could keep the environment stable.  People were saying things like ‘I can give my users local admin privileges!’.

That was the dream and it all sounded pretty good.  Then people realised that they would have to change cheap storage on the end point for expensive storage in the data-center.  Also it just seemed, well silly, to have 5000 copies of explorer.exe sitting on the SAN.  The advantages of data de-dupe were talked about, but the model that everyone settled on was a golden OS image, Citrix had Provisioning Server and VMware had linked clones.  Not only did this solve the high SAN demands, it enabled us to only update/patch one golden image and it worked for everyone! Double win!

So now we have thousands of users on one golden image, trouble is we need different application sets.  No Problem! said the industry, we have application virtualisation, it’s even a fairly mature technology, ThinApp, Citrix Streaming, App-V and all the rest.  Except not all applications are suitable for streaming, some have license requirements that rely on MAC addresses, some install drivers or services, etc. etc.

In any large organisation there are maybe 2% of these applications which are generally more than 10 years old, but that can’t be dumped.  Out of say six hundred apps that’s only twelve apps that need to be in the golden image, so we increase the number of golden images to twelve, and the rest of the applications are streamed.

So far so good, although with this golden image model, we have hit a snag, to allow users to install applications, we need to use block level deltas to save the personal information.  Over time these block level deltas can grow to the size of the original installation, ruining our nice SAN space saving ideas!  Not only that, when you update the base image you can’t reconcile the deltas, you have to throw them away.  That’s no good, you can’t give users a facility and then randomly remove their changes.  OK, lets lock down the OS, we can use a profile solution to save user personalisation using the file system (although obviously no user installed apps).  For a great explanation of block vs file see Brian Madden’s post “Atlantis Computing hopes to solve the “file-based” versus “block-based” VDI disk image challenge

Lots of vendors already in the Terminal Server space, immediately said ‘We have a profile solution!’ and Appsense, RES, RTO, Tricerat etc put out VDI profile solutions.

All of this worked great in the POCs and pilots, trouble is when it scaled up to 1000s of users we found that the power users who were moving gigs of VMDK’s around or working with large media files etc. meant we had to have REALLY expensive Tier 1 storage at the SAN, it became uneconomical to move those users to VDI so we left them on their fat desktops.

So where does that leave us on our big VDI project?

  • Multiple users on an OS image
  • Application silos
  • Locked down desktops
  • Profile solutions from Appsense, RTO, RES etc.
  • Users limited to Task and knowledge workers
  • Oh yeah, print solutions from Citrix and ThinPrint.
  • Desktops accessed via RDP or ICA

I mean what does that sound like to you?  To me it sounds EXACTLY like Terminal Server.  What we have done is taken a VDI dream and apply terminal server thinking to it, unsurprisingly, it’s now looking just like terminal server, but with extra licensing costs.

We need to apply some brand new thinking, there are vendors out there trying to do this, like the afore mentioned Atlantis, but before VDI really takes off we need to rethink a lot of things or Gartners prediction of VDI being a $65 billion business with 40% of the worlds professional desktops seems to be a long way off.


Copyright © 2007 JimMoyle.com. All rights reserved.