How well is your mainframe outsourcer managing capacity and performance? – Part 6 – The tradeoff between cost and performance
Steven Thomas
Principal ITBI Evangelist at SMT Data - helping clients optimize IT capacity costs and understand how they relate to business activities
How well is your mainframe outsourcer managing capacity and performance?
Part 6: The tradeoff between cost and performance
This is the sixth in a series of blogs about how to follow up on your mainframe outsourcer. Previously I’ve discuss why it is important to follow up, understanding MIPS and MSU, the Pitfalls of MIPS, the outsourcer’s costs and pricing models. In this article I will focus on the tradeoffs between cost and performance. This includes looking at reserved capacity, capping and Workload Manager (WLM). Reserved capacity helps ensure you have what you need when you need it. Capping helps prevent your costs from running away. Both can have serious impact on performance, especially if WLM is not set up in a way that reflects your business priorities.
Why do I care?
In the previous blogs I’ve focused on understanding how CPU utilization is computed and billed. But for many outsourced customers, stable operations, good online response time, and reliable batch turnaround are more important than cost savings. No one is going to say thanks for saving money if you haven’t delivered the expected services with the expected quality to your users.
Quality of service, especially performance, is deeply entwined with costs. Some customers believe that having a good Service Level Agreement is enough and they leave it to the outsourcer to worry about the details of reserved capacity, capping and WLM.
But what happens when resources are constrained? What should happen is that WLM policies ensure that the high priority workloads get the resources they need, and the low priority stuff gets delayed and costs stay under control. What can happen though is that the outsourcer simply makes more capacity available. They live up to their SLA and make more money by selling more capacity. The net result is that the customer pays for extra capacity to do low priority work that could just as well have waited.
Reserved Capacity
The objective of virtualization, whether it is a virtual windows server running under VMware, or an LPAR running under z/OS, is to share a common pool of resources. This usually means over-allocating the resources, because the chances of all the virtual servers or LPARS needing their entire allocation at the same time is small. Each LPAR is configured with a certain number of logical processors, but the total number of logical processors allocated will normally exceed the number of physical processors by a factor of 2-4 or more – referred to as the logical to physical ratio.
So, what happens when everyone wants to use their logical processors at the same time and there aren’t enough physical processors available? This is where Reserved Capacity comes in. Each LPAR is set up with a ‘Weight’ which determines its reserved share of the total available capacity. If the total available capacity is set to 1000 units and the LPAR weight is set to 50 units, then the LPAR has Reserved Capacity corresponding to 5% of the physical capacity. It is important to understand what Weight your outsourcer has given your LPARS because that tells you what you are ‘guaranteed’ – which could be a lot less than you normally use during peak periods. Some customers regularly go over their Weight during critical periods. That goes fine if the other customers on the system are not using up to their Weight at the same time. It just means that your performance and turnaround time can be suddenly impacted by the behavior of the other customers on the system.
Capping
The Weight determines the reserved capacity for the LPAR, but if no one else is using the resources, then the LPAR can usually go above its weight. But there are limits to how much capacity an LPAR can use. The obvious limit is the number of logical processors. If an LPAR is configured with two logical processors, each delivering a max of 1000 MIPS, then the LPAR can never go above 2000 MIPS.
But there can be other constraints on the LPAR including various forms of capping. Capping is a big topic, so I will just cover some of the basics. Capping is generally used as a way of preventing you from using more than a certain capacity, and thereby keeping costs under control. Capping can be done on an LPAR, on a group of LPARS or on a specific workload within an LPAR. Capping can be ‘hard capping’ or ‘soft capping’. Hard capping, as the name implies, means that you can never go over the limit. Soft capping on the other hand is based on a 4-hour rolling average (4HRA). You can go over the limit for a period, but when the 4HRA crosses the defined limit, then the system will start restricting CPU usage to bring you back in line.
It is important to discuss capping with your outsourcer. If you are paying based on peak utilization, then the outsourcer is probably happy to set up the system without any capping. Unconstrained systems will increase the outsourcers revenue and ensure that you get the best possible performance. That may not be in your best interest. By carefully applying capping, you can ensure that low priority workload gets delayed during peak hours, thereby reducing your costs. Note that the outsourcer may also be applying capping to your systems to control their own costs. This can have a negative impact on your performance and turnaround time.
Workload Manager (WLM)
Weight determines the minimum Reserved Capacity of your LPARS, and capping determines the maximum. If there is no capping, then the maximum capacity is determined by the number of logical CPUs. So, what happens when you don’t have enough capacity – either because the overall system is heavily loaded and your LPARS are constrained to their Reserved Capacity, or because they hit a capping or configuration limit? This is where Workload Manager (WLM) comes in. WLM determines which of your workloads get access to the CPU and which must wait when there isn’t enough CPU capacity to go around.
WLM determines who gets access to the CPU and who gets delayed based on policies. The policies set out goals for each workload (e.g. a response time goal for CICS transactions) and an importance (which workloads get preference when they can’t all meet their goals). Importance 1 gets priority over importance 2, and so on. Online systems, for example, are normally given a higher importance (a lower importance number) than batch. This means that if the LPAR runs low on CPU power, batch will get delayed before the online systems are impacted.
But note that WLM only comes into play when resources are constrained. Many outsourcers avoid worrying about WLM by simply assuring the customer never runs into resource limits. This ensures maximum revenue for the outsourcer and maximum performance (and cost) for the customer.
You should review the WLM setup with your outsourcer on an ongoing basis. WLM workloads and policies must reflect your business priorities, and the outsourcer cannot know those without input from you. We have seen many cases where the outsourcer has left the WLM policies untouched for years.
What data and reporting do I need?
It is important to have access to data and reporting that allows you to independently monitor the LPAR configuration, capping, WLM policies, and the effect that these have on your performance, response times and batch turnaround times. The outsourcer may provide this information in fixed reports, but ideally you should have access to the underlying data (RMF70 and 72 as a minimum) and your own reporting tools, or access to a self-service reporting environment provided by the outsourcer.
Summary
Cost and performance are closely related. The cost models discussed in my previous blogs have a big impact on how you look at things like reserved capacity, capping and WLM. If you are paying for peak usage, you probably have an interest in some level of capping to keep costs under control. In this case you need to ensure that the WLM policies accurately reflect your business priorities, so the system behaves well when you hit the capping. If you are not paying for peak usage, but paying for total usage during the month, then the opposite may be the case. You may want as little capping as possible. In either case an appropriate reserved capacity – corresponding to your expected peak load – is important.
If you don’t have in-house skills to discuss these factors with your outsourcer, then find an independent advisor to assist you. At SMT Data we specialize in helping outsourced customers with tools and consulting so they can get transparency and get in control.
by Steven Thomas, Chief Technology Officer, SMT Data