Azure OpenAI Deployment Options and Availability
I want to dive into the various deployment types we have with Azure OpenAI, understand what that means for resiliency and availability, and discuss what you should consider as part of your overall application architecture that leverages Azure OpenAI.
This is an article version of my video on the same topic available at https://youtu.be/HnUNi1RMMTA. Written with help from generative AI based on the video transcript with a little human love after ??.
Generative AI Models Don't Maintain State
One of the most important things to remember upfront is that when you're using generative AI and talking to an OpenAI service, there is no ongoing state. Every interaction with the model, such as GPT-4, is essentially a fresh interaction. The model doesn't need to remember anything. When you send a RESTful API request with your prompt, it likely comprises a system prompt and a user prompt. What you get back is the inference—the response generated by the model.
We're used to having conversations with these models, so when you send a second request, that conversation isn't maintained on the model. Instead, you send the full context again: the system prompt, your user prompt, the previous assistant prompt, and your new prompt. Every interaction is fresh, and the model doesn't need to remember anything.
In the cloud, with large deployments, each separate request—even if it seems like part of an ongoing conversation—could be going to a completely different instance of the model. There's zero state to maintain on the server side.
If you want to see this in action, I did a video a few months ago about AI development for non-developers, where we walked through creating that memory, so you can see exactly how it works. You can see that at https://youtu.be/OHQFObW6PXA.
Azure OpenAI Resources
When you create an Azure OpenAI resource, you deploy it to a specific region, like West US 3. This Azure OpenAI resource lives within that region and has the endpoint your application uses to send API calls for inference. No matter what you do, the endpoint and the Azure OpenAI resource are bound to a specific region. Nothing else can change that fundamental fact.
Within the region, the endpoint and the resource aren't where the model runs. Generative AI models are greedy for GPUs. Training requires massive clusters with thousands of GPUs over long periods, but inferencing requires a smaller set of infrastructure. Depending on the model, it may still require many GPUs.
In the region, you'll have many capacity pools running specific models and versions, like GPT-4. These pools consist of virtual machines with GPUs. Depending on the model size, it may require tens of GPUs, so multiple virtual machines are needed. Some newer models can run on a single GPU, allowing multiple instances on a single VM.
Capacity pools run multiple instances of the model, with the number of VMs varying based on the model. They autoscale, adding and removing instances based on incoming requests and workload. There's also a control plane element for health checks, scaling, and repairs.
There's no ongoing state, so if an instance fails, the request is sent to another without issue. Every region offering Azure OpenAI services has capacity pools for each model and version they support.
There is also Responsible AI infrastructure. When you send a prompt, it goes through responsible AI checks for malicious activity or jailbreak attempts. These checks occur in every region offering Azure OpenAI services.
When deploying a model, you have options like "Standard" and "Global Standard." These options impact where your inferencing request can be sent. If you select "Standard," the request goes to the capacity pool in the resource's region. If you select "Global Standard," it can go to any capacity pool hosting that model version worldwide.
Latency and Generative AI
This is crucial because traditional application architecture cares about network latency. With generative AI, the network time to send the API request is minuscule compared to the inferencing time. If the network time is ten milliseconds, the inferencing takes much longer, making the network time negligible in the overall operation.
I care way, way less about the network latency. We have to think a little differently when considering these options. Historically, I might think, "Well, no, I can't use Global because if I do, my request may now have 100 milliseconds of network latency compared to twenty milliseconds." It really is not going to make a big difference when considering generative AI interactions.
Realize what's actually going to happen here, and this is an important point. You have your resource, and between your resource and all those instances is some super-intelligent routing. This routing understands the current utilization of all the different capacity pools—identifying which ones are busy and which have a ton of spare utilization available. Based on your model deployment, when your request comes in, the routing can determine the potential capacity pools to send the request to.
If it's standard Regional, it can only send it to the local capacity pool. The control plane in this capacity pool looks at the individual instances and decides which one is the least busy. However, this gives a very small potential pool of where it could send your request. If it's Global, this intelligent routing might say, "Hey, look, this capacity pool on the other side of the country is doing almost nothing. I'll send it there." The control plane within that pool could send it to the least-used instance, resulting in a really small latency for the actual inferencing because it's not doing much work.
While I might be adding a little bit—thirty milliseconds—of extra network latency, the actual inferencing latency will be way smaller because it's doing way less work. You have to think about it differently. You can't just think about the physical location on Earth because the time it takes for packets to travel over fiber is a very small element of the overall operation time. Instead, I want to think about the wider potential pool where my request can run. The better the chance there is of finding a fairly idle model instance somewhere, the faster it will do the inferencing on my request.
Change the way we think about latency when doing generative AI. Network latency, in all but the smallest edge cases, is insignificant compared to the time of the inferencing. Long story short, I would much rather have a global deployment with a massive pool of potential places to send my request. There's a much higher chance of finding an instance that's pretty idle, processing my API request much quicker than some of the pools local to me.
Consider the different seasonalities around the world. When people are sleeping in one region, there are likely to be far less busy pools compared to the one local to where I'm running. It's all about the potential number of capacity pools I could use for my request, giving me a much better shot.
Available Capacity and Throughput
Additionally, realize it's not just about latency in terms of time. When you have your Azure Open resource, you have a quota—the potential tokens per minute you can use. But that's not a guarantee of capacity being available. As with most things in Azure, you can get a quota, which is what you're allowed to create, but there still has to be the underlying capacity to actually start or use that resource.
If I'm Regional, that capacity has to be available to meet the throughput the quota promises, but it may not be able to meet it. Once again, if I pick Global and have this massive number of potential pools to leverage, I have a much better shot at utilizing all of the quota—all of the tokens per minute—because it has a much bigger pool of potential places to send that request. The greater your regional flexibility, like everything in AI, the better chance you have of getting your requests fulfilled and having that capacity available.
In general, I'd always lean towards the Global option, having a better pool available. Overall, my request will get better latency, taking less time to process the tokens because there are less busy instances. I have a much better chance of hitting the number of tokens I want in terms of throughput because that capacity will be available.
Data Sovereignty and Data Zones
There's a slight wrinkle in what I just said. You might think, "Oh, go and use Global, go and use Global." But if you have a data residency requirement—say, you can't have your inferencing request sent around the world and must stay in the United States or inside the European Union—one of the things they are working on now is the idea of data zones.
I don't see this currently on my screen; I think this is just coming out as we're talking. But one of the other things they are working on is data zones. If there are select regions, maybe these regions—more than two—make up the United States, and another set makes up the European Union. There are these sets of regions, so there's now a middle option available: the data zone.
If I pick the data zone, it gives intelligent routing the option. If I pick the US data zone, it can send my request to any capacity pool hosted inside a US region. If I pick the EU data zone, it can send it to any region that's part of the European Union. I now get a wider selection of potential capacity pools while staying within a certain data residency option.
I totally get it; maybe you're not part of the US or the EU, in which case this may not be useful to you at this time. I know Germany, for example, can have more stringent requirements around data sovereignty, and the EU wouldn't be good enough there. So, you're down to creating multiple instances in the German regions or, if you're in other countries like Australia, multiple Australian regions, and doing your own kind of balancing. We'll talk about some of that balancing a little later on.
These are the options: standard, where I can only send it to the capacity pool in my region, with a finite number of instances and certain utilization; Global, which can send it anywhere, giving the biggest chance of capacity being available and the lowest possible latency because intelligent routing looks at the least-used capacity pool. When it's close, it tries to get it close to you if it can, but it intelligently understands what would be the overall best performance for you. Network latency is one tiny element of that. It's way more important to try and reduce the latency between the tokens, so my overall tokens per minute can be pushed by sending to the least-used capacity pool.
If this is too open for you and you have data sovereignty requirements, maybe you can start to use this data zone option, which is a nice middle ground. So far, this has all been about performance.
领英推荐
Resiliency of the Service
Realize there's a slight availability instance here as well. Remember I talked about control plane elements? If an instance had a problem, it would repair it, and your request would just be sent to another instance. But there could be an entire data plane problem in the entire region. In that case, if I'm a standard deployment and the whole capacity pool has an issue, it can't process my request.
If I were Data Zone or Global, intelligent routing would know this is unhealthy and would send it to the next less busy one. If somehow this happened to be the best one, which it probably wouldn't be, it has more choices to handle regional-level capacity pool problems. You get a 99.9% SLA, but in terms of what's available to you, it may help. However, the endpoint is still regional. It doesn't matter if I do data zone or Global; the point I'm talking to—the major control for my resource—is bound to region one.
If the region hosting my regional resource, the Azure Open resource, has a problem, it doesn't matter what deployment type I've picked. If there's a problem at this element, my request can't go anywhere. If my endpoint is down, I can't do anything.
So, how do I handle what we would normally think of as global resiliency? What is my solution to that? This is where you are responsible, as part of your architecture, to have a solution in place. Don't think that a global deployment or a data zone deployment handles availability for you. It does not. This is about the potential places to send your request for scale and performance.
Building Global Resiliency
I mentioned a tiny bit of availability in this issue scenario, but overall, this is not about the availability of your resource. I have to solve that. So, what would I do? I have to have deployments in multiple regions. Just like we would normally do with any kind of Azure resource, I want instances in multiple regions. I would create an instance in region two, so I'd have resource two with its own endpoint as well—this would be Endpoint2, and the first one would be Endpoint1. They're different endpoints, but now I have resources with completely different control planes, completely different resources, and completely different endpoints to talk to.
What does that mean for the application? You have choices. You could absolutely plumb into your application that there's an Endpoint1 and an Endpoint2. Try to talk to Endpoint1 first, and if it fails or maybe you get a 429 because it's run out of capacity, go and talk to Endpoint2. But now I'm putting all the emphasis on modifying the app used to interact with a generative AI. Do I really want to change my app—or every app—when I may have many using generative AI? Now, I need to modify all of them to understand multiple possible endpoints and how to check their health.
I could use something like an Azure Traffic Manager or Azure Front Door. These could provide a single entry point, albeit not really fully understanding generative AI, for communication. However, most people in this scenario might consider using Azure API Management (APIM). I would recommend setting up an instance of APIM, choosing the premium type. The premium SKU allows for instances that provide services in multiple regions, ensuring geographical resilience.
Your APIM instance would be configured with various endpoints, allowing it to send requests to the appropriate one while maintaining a single endpoint for your application. This setup enables intelligent routing, such as sending all requests to Endpoint1 and only redirecting to Endpoint2 if there's an issue. It can also manage provisions, like sending requests to a provisioned throughput instance and switching to a regular pay-as-you-go instance when necessary. Semantic caching can be employed to avoid sending the same or similar prompts repeatedly, thus saving tokens.
The key point is to architect for resilience, with instances in at least two regions and model deployments for each model. Whether it's data zone, global, or standard, the choice affects only the potential number of polls for requests. A single endpoint for your apps is ideal, so they remain unaware of the underlying complexity. Implementing this is straightforward, provided it's geographically resilient using the premium SKU.
Provisioned Throughput
Provisioned throughput units (PTUs) differ from pay-as-you-go models. With pay-as-you-go, you pay based on consumption—tokens sent and received. Input tokens are cheaper than output tokens due to the complexity of inferencing. However, this model offers no guaranteed throughput, only a quota, and latency can vary.
For applications requiring guaranteed throughput and consistent latency, provisioned throughput units are essential. These units represent a fixed cost, offering a certain amount of throughput per time unit. If you purchase PTUs but only use 50% of their capacity, you still pay the full amount. Proper planning is crucial to avoid over-provisioning and wasting resources.
The mapping of PTUs to tokens varies by model and whether they are input or output tokens. Documentation provides guidance which can be found at https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/provisioned-throughput#how-much-throughput-per-ptu-you-get-for-each-model. Obviously, the smaller models are much more efficient, so I get many more tokens for the same unit. It will vary, and the recommendation here would be to use the calculator. The calculator will open up the Azure AI Studio, and you can input the details of your specific scenario, https://oai.azure.com/portal/calculator.
They might allow a little bit of bursting, but not much. If you only bought a limited amount and try to send more, you'll get a 429 response. You could have a PTU instance and a pay-as-you-go instance, and then your API management could redirect to the pay-as-you-go if it gets a 429 response. This allows for bursting. I discuss this in the APIM for Generative AI video which can be found at https://youtu.be/l_8dTUwrqNw.
The benefit here is that you get a fixed cost but primarily it’s about having a guaranteed amount of throughput that you’ll receive, along with consistent latency for interactions. That’s the focus of PTUs. While the fixed cost may be attractive, the goal is to ensure you can process a certain number of tokens for your business and expect a consistent response.
There is a Model Capacities API available that can help you see the number of PTUs you could buy in different regions. Remember, the capacity must exist in the capacity pool for that region. If you choose the option of Global or Data Zone, there’s a larger potential set of capacity pools, increasing your chance of purchasing the number of PTUs you need. Greater regional flexibility increases your chances of purchasing capacity for the hours you need, but there’s no guarantee you’ll be able to buy them.
Overall, PTUs are great when you need guaranteed throughput and consistent latency, but be cautious. If you buy PTUs, you pay per hour. If you forget about it or are only testing with a few tokens, you’ll pay the same amount as if you were maximizing it. People sometimes buy PTUs and forget, resulting in bills for tens of thousands of dollars. If you’re experimenting or in a development/testing phase, you probably want to use pay-as-you-go. PTUs should only be considered after serious consideration and calculations of the right number to buy, likely only in a production environment.
PTUs are an on-demand hourly cost, and if they don’t run for the full hour, they prorate it, so you pay for the number of minutes. If you have a long-term plan to use PTUs, you can make reservations for cost savings.
With PTUs, you can save money with reservations, but you must understand your business needs. Overbuying reservations can waste money. Even if running at 70% or 80% utilization, you might still save money compared to hourly costs. You need to monitor your usage over time and then consider reservations once you’re comfortable with your business consistency.
You might also create a pay-as-you-go instance to burst out if you exhaust your PTUs, especially if you’re conservative and want to avoid overbuying. Data Zone is coming for both pay-as-you-go and PTU, providing options for data residency requirements.
Azure OpenAI Global Batch
Batch is only Global, designed for inferencing jobs that aren’t time-sensitive. If you have content generation or large-scale data processing without time criticality, Global Batch allows you to submit jobs asynchronously. It uses Microsoft’s global capacity pools to process jobs in spare cycles, returning results within twenty-four hours at a reduced cost, about 50% less.
Data residency cannot be an issue for you to leverage Global Batch, as it’s currently Global only. The focus is on maximizing performance by utilizing the largest pool available for your model and version, achieving the greatest throughput and lowest latency.
Summary
Remember your deployment model type of standard vs global standard vs data zone isn't really changing your resiliency. It is about the potential pool of capacity your request can run on. The greater the pool, the better the performance and the greater chance you can push your desired throughput.
Your resource still exists in a specific region, so you must architect accordingly for geo-redundancy. You’d want instances of Azure OpenAI in multiple regions, using premium Azure API Management to handle health probing and load balancing.
If you require guaranteed throughput and consistent latency, use PTUs, but ensure you understand your requirements. You can combine PTU with pay-as-you-go, using APIM to handle overflow. More regional flexibility, such as Global or Data Zone, generally results in more flexibility in your PTU purchases.
For most, if data residency isn’t an issue, use Global. If it is, and it’s EU or US, use Data Zone. Always build resiliency into your architecture.
Hope that helps! ??
Consultant Cloud
5 天前Great, Thank you John ! One question : Is it possible to monitor the capacity pool region used by each request ?
I work with Companies to deliver business value in practical Microsoft AI and Copilot solutions, Delivering real world business results.
4 周John Savill your timing as usual is uncanny.... I'm neck deep in Azure OpenAI and loving it so the more well crafted and relevant information from you the better . Keep it coming.... It's almost like you have an AI that knows what content to put out ?? Love your work always
Director Cloud Specialist driving cloud strategy and partnerships at Microsoft
4 周Great article! I struggle to find an opportunity to watch the videos so thank you for the alternative option (like now with my daughter asleep on my lap) - would Love to see more in the future :)
Driving AI Enablement | Experienced AI Troublemaker
4 周Such Great content John Savill !!! Also great to remember that #Azure has connectors for Anthropic Claude3.5 and some of the great LLM's from AI at Meta (LLama 3.2 1b, 11b, 90b)
Insightful breakdown, John! The points on Azure OpenAI’s deployment options and the implications for latency and throughput are essential for any architect planning to leverage OpenAI within their infrastructure. Your explanation on the contrast between Regional and Global deployments offers valuable clarity on the balance between network and inference latency in generative AI applications. Additionally, the nuances around data sovereignty and the emerging 'data zones' concept are promising steps for those working with strict regulatory requirements. For organizations needing predictable performance, the discussion on Provisioned Throughput Units (PTUs) versus pay-as-you-go is crucial—especially when aiming for cost-effective scalability. Great read, and thanks for sharing!