Become a Citrix Hero - Workload Placement and Sizing

Become a Citrix Hero - Workload Placement and Sizing

In 2018 I decided to write a guide with my best tips for Citrix. The highest impact with the lowest effort or experience level required.

You can get the eBook for free here... but that's not all. You'll get answers to your questions and an opportunity to get a new Citrix Hero tip every month!

First, I thought I'd give you the FULL TEXT of chapter 3! Here you go:

Okay great, you’ve tuned the OS of the VMs and your access layer. But for around 60% of the companies I assessed last year, another big problem was robbing them of value: not paying attention to the special requirements and tunings for Citrix Virtual Apps within their physical server (hypervisor) setups. While this can often become an almost religious debate- I stand behind these recommendations, having seen improvements in dozens of environments since 2010 when I started drawing a hard line on doing things this way. You can debate with whitepapers all day- but the bottom line is that I’ve seen this work even when it seems to make no sense. Sometimes you just must swallow your pride and trust, and that’s what I’m asking a lot of you with doubts to do now. Trust that myself and over 100 other experts have it right here. Hypervisor tuning matters. You might have a fight ahead if you don’t run the physical servers.

But like any hero, a #CitrixHero sometimes has to fight for what is right.

Workload Placement

What I’m seeing emerge once again these days is an old practice which I thought for sure we had done away with: All the VMs in a single cluster and let the Hypervisor sort them out. This presents a… host… of problems (the puns will continue until Leading Practices are followed folks):

1)     You are no longer able to predict how much adding new users will cost in terms of hardware.

2)     You cannot accurately predict how many Server VDA VMs you will need.

3)     You cannot predictably assure performance from one user to the next due to other workloads that co-habitat the same host. For example, when a SQL server goes into freakout mode on the same CPU as your Server VDAs- you’ll have users complaining even though CPU is not showing any signs of issue.

4)     Different workloads can tolerate different overcommit ratios. With a mixed workload style you may have your user workloads on hosts that are actually overcommitted.

I am sure that even as people were reading my statement above tension started to show up. Doubts. But the reality is that user-based workloads behave very differently than the average hypervisor administrator realizes. An RDSH server (XenApp) with 30 users is going to behave very differently than a SQL or Exchange server. So placing them on the same host means you are making the hypervisor essentially pick favorites. But placing the same kind of workloads on a physical host has a kind of magic, reducing the conflicts and more importantly making the scale predictable. We’ll talk more about how many VMs and their configuration later- a very important aspect to avoid oversubscription. But to do that properly, we first need to make sure that our Resources will always have the right backing and no conflicts. This can’t be done without isolation.

The solution to this first problem is to isolate your user workloads from the backend servers completely. This is typically done by dedicating Resource clusters for Server VDAs and Desktop VDAs with all Control components located in the main infrastructure cluster. This has the benefit of allowing you many times to use a different licensing, version or even type of hypervisor for your resources which typically do not need things like HA and backups. If your team has skillsets of supporting multiple hypervisor types but you primarily use VMware, you may even want to consider a higher-performing but simplified hypervisor solution for the Resource hosts such as XenServer (Citrix Hypervisor) or Nutanix Acropolis. Even Hyper-V. However- I do NOT recommend learning a new hypervisor just for this purpose. If your team only knows one hypervisor well, keep the course and just simplify the Resource cluster.

If you don’t have a lot of hosts available or are concerned about keeping N+1 for two clusters rather than one, the second method is to use Host DRS (or it’s equivalent). This will place the workloads to preferred hosts when they boot. So in most cases you would have two Host DRS groups- Control and Resource. In the event of a failure you will still be able to load VMs onto the other hosts temporarily, but only when capacity is exceeded. I’m seeing this option a lot in smaller and mid-sized environments but I never recommend this if you have more than 6 hosts. At that point, you’re usually better off isolating workloads completely. You should also keep in mind that this method may require some ‘babysitting’ during maintenance to be sure workloads end up back on their intended hosts.


Workload Sizing

Now to sizing. This is the other area I’m seeing massive amounts of fail lately. Someone read a whitepaper that said to configure their XenApp servers with 4 vCPU and 8 GB RAM… and wonder why they can only get 15 users before it starts slowing down. The reality here is that you need to default to scaling UP with your users per VM until you reach a performance threshold and then scale OUT with more servers. The way to do this is to increase the resources per VM – namely CPU, Memory and Storage. So how do you know how much you can use? This is an ‘it depends’ answer if there ever was one. But once again, this is why we isolate workloads. We need to know how many CPU cores and RAM we have comfortably available (N+1 or N+2 is typically the threshold here).

So let’s say we have hosts with two 14-core CPUs and 256 GB RAM. To be safe, we say that about 200 GB RAM is our safely available amount, to give the hypervisor some room and some ‘just in case’. We also ALWAYS want our user workloads, be it Desktop or Server OS, to fully reserve the RAM in the hypervisor. This prevents the usage of a paging disk (required in VMware if you don’t use this option, sucking down storage space) and is… you guessed it, another reason to isolate the workloads. We’ll use Server OS for our workload examples.

Some feasible options in terms of memory (just examples- you can use any memory value you want in Windows these days) are:

  • 6 VMs at 32 GB
  • 4 VMs at 48 GB
  • 12 VMs at 16 GB

Next we turn to CPU. Here’s where it gets interesting. An often overlooked bit of math that you must do to be successful in terms of Citrix is division. I won’t go into the full explanation of NUMA vs UMA but in a nutshell you want your workloads to co-habitate the same physical CPU as much as possible. This prevents the need for crossing memory bus lanes and a host of other implications that can slow down the hypervisor itself.

We do this by setting our VMs to a value that matches the CPU’s NUMA values. You should always confirm these, but they are typically divisible numbers of the physical (not virtual) cores. I will tell you that I chose the 14 core processor for a very good reason- it only has 4 valid vCPU NUMA values: 1, 2, 7 and 14. While you can use less- research has shown that you are better off having fewer VMs with more CPUs than you are with more VMs with smaller CPUs because of the way the hypervisor essentially is forced to arrange workloads on the CPUs.

So in our case, we know that we are best served by configuring our VMs with 7 vCPUs – so because this will be a much larger VM, we are looking at either 32 or 48 GB RAM. To determine the right sizing we really need to know our tolerable CPU Overcommit ratio. In most cases I have observed, 1.2:1 is about right for Server OS, whereas Desktop OS can often scale to 5:1 or up to 12:1 in some cases I have seen. Again- ‘it depends’. For us, we know that 1.2x28 (the total of physical CPU cores on the machine) is 33.6. Dividing by 7 gives us 4.8, which we safely round DOWN to 4.

So- your CPU Blade should have FOUR VMs with 7 vCPU and 48 GB RAM.


BUT… how many users per VM?

This is another common Citrix mistake! We shouldn’t really care about users per VM anywhere near as much as we need to determine how many users per PHYSICAL BLADE/HOST.

Think of it this way. If I was to put a pair of virtual machines on your laptop- how many users do you think would safely be able to use it before it became, well, unusable? Even if these are VMs intended for a lot or not a lot of users, the physical hardware can only do so much. The hypervisor doesn’t magically fix this. Any host will have limits that must be anticipated and respected. So now that you’ve isolated your workloads- good news! You can figure out those numbers easily by simple division- Number of users per host divided by the number of VMs. Now, keep in mind that mileage may vary here based on the applications and Operating system… however there is one mistake that you should NOT make:

If Per-VM performance is bad because of the amount of users, you need to adjust the number of HOSTS, not the number of VMs. Adding VMs to a physical host will degrade performance for all users because the physical limitations have not changed.

The Rule of 5 and 10

Here’s the thing. Unless you have the tools, time and patience to figure out EXACTLY how many users to have per blade- you need a simple rule to start with, then further optimize from there.

“More what you’d call guidelines than actual rules…”
-Hector Barbosa (Pirates of the Caribbean)

Thanks here go to Citrix Consulting and Nick Rintalan for figuring the math out on this one. As a matter of fact, him and I worked together for a while trying to figure out a very overly complicated spreadsheet, so I was really pleased when he noticed the way the numbers always seemed to work out. I could go into the underlying arithmetic here, but let’s keep it simple. You can determine how many users can be on a host based on the physical CPU cores. Not the threads (virtual), but Physical.

  • Desktop OS workloads: 5 users per pCPU core
  • Server OS workloads: 10 users per pCPU core

So again in our example, blades with two 14-core processors (28 pCores) we can expect 280 Server OS users or 140 VDI users. Note, this is active users, not VMs. In our example, it means we should expect a maximum of 70 users per VM. While this is possible- it may not always be practical. Load testing is important to determine the exact number.

Now that we know the hardware can handle it, we end up in the domain of the OS itself. Typically I have seen a properly sized and tuned Server 2012 R2 VM safely handle easily 120 users running fairly light apps, but when using a published desktop the numbers dropped to about 50 users per VM. Sometimes, you may need to add an additional VM- just keep to the NUMA values and try not to exceed the overcommit ratio and you’ll usually be fine. Test, Test, Test!

So if we have 3000 users, we know we need 11 host servers (+1 for redundancy = 12). And, as an added #CitrixHero moment, when the VP of IT asks you how much it will cost to add another 500 users… you’ll be able to give them a REAL answer! Scalability is fun!

The Citrix article is https://www.citrix.com/blogs/2017/03/20/citrix-scalability-the-rule-of-5-and-10/

So the question you’re asking yourself is does this work in the real world? Yes. Yes it does.


BONUS Recommendation: Hardware Virtualization Settings

I was going to save this for another chapter, just because it has the kind of impact that can’t be denied- but it really is quite simple. So, you get it here, right now.

Several times this year I was surprised to find that the underlying BIOS settings for physical hosts were not set correctly. Depending on the setting, this can be a MASSIVE problem, so while it isn’t common… I thought I’d include this because of its additional impact.

If you find your VMs are slow and running 100% much of the time, check your BIOS settings. Actually, scratch that. Just go check your BIOS settings anyway. You may find that C-States are enabled. This seems like such a great idea… until you put a Hypervisor on top of it. Once again… they ship for compatibility. You have to tune them once you get them! (sensing a theme here?)


Common Mistakes (not just Citrix, but any Hypervisor):

  • C-States enabled (this allows the CPU to throttle and changes the CPU percentage calculation)
  • Virtualization not enabled (breaks the hypervisor’s ability to function as, well, a hypervisor)
  • Hyperthreading not enabled (yes, the rule of 5 and 10 assumed HT is on)
  • Power settings not set properly (typically should be the “Power” or “no power management” setting)
  • Also very much related to the NUMA settings above, make sure to check for the correct QPI ‘snoop’ modes and don’t always trust the ‘auto’ setting to give you Cluster On-Die. Failure to do this, for example on a 14 core processor can be problematic (this processor tends to array the silicon in a 6+8 configuration and uses COD to present the proper NUMA values).

Power Management Matters

An article that is definitely worth reviewing is from Jasper Geelen (LoginVSI):

https://www.loginvsi.com/blog/834-influence-of-power-management-on-vdi-performance

In a nutshell Jasper notes that power management settings specifically are different per vendor, as are the names used- but almost all allow options to either let the OS handle power or various other settings. While you can let your hypervisor control these settings, you should know that it isn’t typically dynamic. My recommendation is to simply lock the BIOS in it’s highest power state for any VDI or RDSH workloads. In some cases, this simply means turning power management off completely; this depends on your vendor.

“Faulty power management is the most common but easiest to fix VDI mistake. Configuring this properly can save your users a lot of energy (and) user experience will increase”
Mark Plettenberg, LoginVSI and fellow Citrix CTA


LoginVSI found that this can be a difference of 64% of performance.

Go ahead and read that again, I’ll wait.

This means that in a great many cases more than half of the available performance the machines expect simply isn’t there. And this holds up in the real world, in my case even *better* than the synthetic testing showed.

My favorite case of this last year was one of those times when an assessment being performed paid for itself three times over, simply because we caught this one thing that an engineer had made the assumption was correct. The client was able to cancel an order for over $45,000 because upon enabling the power management correction, their problems with VMs reporting 100% CPU went away. They were immediately able to more than double the number of users on each blade and still had better performance than previously there. When combined with other tuning suggestions, they estimate that they will be able to go another three years without additional purchases. So it seems simple enough, but one miss or assumption can literally cost that much.

So, save yourself the $20,000 consulting bill and double-check these settings.

Or- you know, don’t… and call me. I’ll take the money.

What's Next?

There's a few more notes and links in the eBook. Remember- you can download the eBook for free here. But that's not all. Very soon I'll be sending a new tip every month, with even more detail and how to to implement the needed changes. If all goes well- I'll even have a way you can get access to me to ask your questions, as well as video training!

Wish me luck!

I hope you've enjoyed this tip. PLEASE SHARE IT!

Live Q&A

Now- it's very possible that you've got questions about this chapter. I've got a live Q&A session coming up very soon. So here's what you do: message me with your questions or fill out this form if you prefer!

I'll let you know when I'm going to be doing a webinar with a few surprise guests... but first I need your questions! Don't be shy! Speak up; I'm here to help!

cheers!


DJ Eshelman

CTXPro.com

CitrixCoach.com

DJ ?? Eshelman

John Maxwell Certified Leadership ??Speaker??Coach + ??Trainer | ??Author,???YouTuber, ?Podcaster, ????♂?Husband, ??Cat Dad

6 年

If you have questions - get them to me here and I'll be answering them live in a webinar! https://seg.citrixhero.com/ds/e416fea7

回复

要查看或添加评论,请登录

DJ ?? Eshelman的更多文章

  • Estimating Project Time

    Estimating Project Time

    How about another IT Project tip? This is from my book Just Do THIS, Chapter 15. Estimating Time Lt.

    1 条评论
  • Presentation Pointers for IT Geeks

    Presentation Pointers for IT Geeks

    I thought I'd provide a sample of my book Just Do THIS. You can get a personalized copy on hardcover directly from me…

    2 条评论
  • Spring is Citrix Cleaning Season!

    Spring is Citrix Cleaning Season!

    Understanding Infrastructure Assessments and Health Checks In my years of both being an employee and consulting I have…

  • Architect or Engineer?

    Architect or Engineer?

    I get the confusion a lot so I thought I would demonstrate with a humorous example, the difference between a Solutions…

    2 条评论

社区洞察

其他会员也浏览了