Azure OpenAI with Azure API Management

Azure OpenAI with Azure API Management

This article is based on my video of the same topic at https://youtu.be/l_8dTUwrqNw with some help from generative AI and then some human love after! I also did an Azure API Management deep dive at https://youtu.be/PXtFq5wmGt0 if interested.

Transparent for Applications

When we think of API management (APIM) in general, it should be transparent to the application and developers. We don't want them to make significant changes to take advantage of what we’re adding. We’re placing API management between the application and the target endpoint to add governance, security, visibility, limits, and load balancing. However, we don’t want to require major changes to the application.

Ordinarily, when an application wants to communicate with an API, it talks to an endpoint. Typically, with large language models, it uses a key, passing it in the header or as part of a request element. We don’t want to change anything for the application; the only change is the specific endpoint and the key used. Instead of, for example, the AI large language model endpoint, we ask the developer to use our APIM endpoint.

One of the nice features of API management is the concept of subscriptions—not Azure subscriptions, but different applications or business units that we want to distinguish between. We might apply different limits, grant access to different features, or charge based on usage. The key used would likely become an APIM subscription key.

Could we switch to using integrated authentication, like a JSON Web Token (JWT)? Absolutely, but that would be a more invasive change to the applications, which we want to avoid. We aim for transparency for consuming applications by placing APIM in the middle of the interaction to add visibility and governance features.

Supported Models

Now, let’s consider the large language models supported at this point. The focus today is on Azure OpenAI’s sets of large language models and the inference API. If you’ve explored this, you might have seen the Azure AI model inferencing API. It acts as an abstraction between the application and the backend large language model. As a developer, you might want to switch which large language model to use without changing your application. The inferencing API abstracts the backend details, allowing easy switching between different large language models without altering your code. This way, you don’t have to worry about changing your code for future adjustments.

While not officially supported, if a large language model API communicates in the OpenAI format for chat completions on different infrastructure, it will probably work.

Getting Started

I mentioned that communication from your app to APIM will likely use the subscription key to uniquely identify different apps or business units. When APIM authenticates with your large language model, you have choices. Most likely, you’ll want to use managed identity for authentication. Managed identity is preferred because it removes the need for secrets or keys. However, you could still use a key if desired.

To start using it, add the large language model as an API into APIM. Once added, it will be surfaced in APIM, allowing you to understand its operations and consume it. For Azure OpenAI Services, there’s a nice onboarding experience that not only enables adding your large language model as an API exposed for APIM but also adds a few policies if desired. The magic of APIM lies in creating policies to perform various tasks.

During the onboarding experience, you’ll be guided through defining policies to manage token consumption and track token usage. It can emit metrics to track details into App Insights, simplifying your life with a special onboarding experience for OpenAI.

Generative AI APIM Features

Now, let’s explore some features available when using generative AI with APIM.

Token Limits

The most obvious one is token limits or quotas. This is crucial because the cost of using a large language model is based on the number of tokens sent and received. You may want to limit this, and while you can set quotas on the target model, you might want to limit it further for different applications.

Token limits are set through a policy, focusing on tokens per minute, which is the metric used today. You can set an overall token limit or use dimensions like the subscription key, source IP address, or something custom. APIM allows writing policy expressions, enabling dynamic quota adjustments based on other factors.

Another common feature is viewing detailed metrics. APIM can emit metrics to your App Insights instance. The emit token policy example shows how to control particular metrics and dimensions sent to App Insights.

Supporting Multiple Instances

The next thing we may want to do is consider the back end for this API that I've added to APIM, which my client can consume. I may want to add resiliency to that. I would think about potentially having another instance of my large language model. If I'm going to make it transparent to the application, I need to ensure they are the same, so they would have to be the same model and version. This way, I don't have to change something in the app depending on where it's sending me. The idea is that it could have multiple backends for this particular API. Remember, they have to be the same, but I've added multiple instances, so it's the same model deployment, same model name, and same version. I could add them to a load balance pool and determine how I want to balance those requests coming in. Maybe it's just round-robin, maybe there's some weighting to it, or maybe there's some priority to them. I could now make that API more available in case one of them went down for some reason.

That's great, but I can take it a step further. Often, with large language models, what we as an organization will commonly do is use PTU (provisioned throughput units). I'm paying for a certain amount of set performance, so I want to ensure consistent latency for interactions with the large language model. However, that's a finite amount I'm paying for. What I'm probably going to do is have another instance that is pay-as-you-go. As an organization, I want to consume all of my PTU first because I'm paying for that, and then potentially, when that is exhausted for my tokens per minute, switch over to pay-as-you-go. When I've exhausted all my tokens, I'll get a 429 back from the model, then there'll be a retry after I stop bothering me when I run out of tokens until I wait forty-two seconds.

I could automate all of this to make it transparent for the application by combining two different things. For the PTU back end, I can add a circuit breaker. A circuit breaker is designed to say, "Hey, if there's a problem happening, make people stop retrying," which might actually hamper my ability to recover. It's like saying, "This has failed, back off, and don't try to talk to me until after whatever this time is." I could add the circuit breaker to this and say, "If you get a 429 returned, then trigger the circuit breaker until my retry after some value has expired."

For the pay-as-you-go back end, I would add these into a load balancer. When the circuit breaker fires, I would set this as priority one, and the other as priority two. If the 429 fires, it will activate the circuit breaker, at which point this instance will become unhealthy. If it's unhealthy, the load balancer will see it's unhealthy and will start directing people to the pay-as-you-go. After the circuit breaker's retry-after period has expired, it will become healthy again, and requests would flow back to the PTU model first. This is a way to make it transparent for the end application, but APIM will take care of using up my PTU first. I'm paying for that, and I'd want to use it up. Once that's expired and I've received the 429, I run out for that minute and go use the pay-as-you-go.

Overview of key capabilities

Technically, you could be more sophisticated than that. You could use policy expressions to limit it based on the incoming request, based on that subscription key, and give it a certain amount to stop a noisy neighbor. That's not something in the box today; you would be doing custom tracking and quite a lot of policy expressions. Could it be done? Absolutely yes, and there are even some examples of that in the Git repo. This is a very common thing you're going to want to do because if I'm using a PTU, there's a strong chance I'm going to run out. I don't want to buy too much and waste money, so I have the PTU, and I would have a pay-as-you-go as well. If this runs out, my clients just transparently switch over.

Could I write the logic in the client? Totally. I could add to the client, "Hey, if you get a 429, switch over to this instead," etc. But this is just a much nicer way of doing it. Of course, I could do other things here. I could still have other load balancing, other combinations, and other things to make it a great, transparent experience for the clients. That's really the goal of this. Remember, I don't want the clients to be aware of APIM. APIM should be completely transparent; they're just changing the endpoint and the key. But I, as the governance, as the control for large language models of my organization, may need better insight and control, and want to enhance the large language model capabilities for all the applications without them making changes. This is a great way to do that.

Diagnostic Capabilities

At this point, I can set limits, see the usage, and maximize investments for my organization. I can also potentially get some additional logging information. Maybe I want to see what the requests and prompts are actually coming in and what the inferencing is. I'm not sure I would say it's a feature as such, but one of the things I can do on the APIM resource is diagnostic settings. In my diagnostic settings, I can log my prompts and the inference response. The one thing I would say about this, though, is that it cannot do it if you're using the streaming mode. We're kind of used to the streaming mode. Remember, the streaming mode is when I send the large language model a request, and you see the tokens appearing one at a time, showing it's doing something. That's streaming mode; it's sending the tokens as it's predicting them. The logging won't work for that; it will only work if it just sends the whole response. As an application, if I'm sending things to a large language model, I probably don't use the streaming if it's not a user-interactive experience. Then it would work; I could log the outputs and the inferencing. But if I use the streaming mode because it's interactive with a user, it's not going to be able to log it because you get that one token at a time.

All of this is about getting information. I can set some limits on usage of things, but at the end of the day, we're still going and talking to the large language model. We are consuming the tokens for input and output. We may be sending information to Azure AI Search or some other vector database to get additional information, using up money, and there's a certain latency for the end user. There's one feature that helps me really change the behavior.

Semantic Caching

If we consider the components again, we have APIM, the application, and the large language model. In many scenarios, especially if it's an end-user application or a help desk app, they're sending in their requests, and they're probably sending in very similar requests. It sends in the request, goes to APIM, and ordinarily, it would then go to your service. Your orchestrator may then do some lookup in a database, trigger LED generation, and then send it to the large language model, using up tokens, get the response, and send it back. They're probably asking the same thing over and over again, like "What is the time-off policy for xyz?" or "Hey, I'm stuck on x." It takes longer because I'm having to go and talk to the large language model, which takes the inferencing time, but it's costing me money. I'm paying for all of those tokens. If I'm getting those very similar requests over and over again, why not just return the response we got the first time it was asked? That's exactly what we're going to do now.

To do this, we need a couple of extra components. The first thing we have to add is Azure Redis Cache, and it's the Enterprise SKU. We need to use Redis Search. We're going to add an Azure Redis Cache Enterprise instance as a cache for APIM, so it now has somewhere it can go and store vectors, the semantic representation of what I'm being asked for. It has to be able to create the vector. The other thing we have to add is an embedding model as a back end. We don't even have to expose it as an API through APIM; we just have to make it available for APIM to talk to, to say, "Hey, I need the high-dimensional vector that represents the request that is coming in."

To enable semantic caching, you need to configure a policy. This involves setting a similarity score between zero and one, where a lower number indicates a broader match. You also need to determine the cache duration, with two minutes being a common choice.

When a request comes in, it follows a series of steps. First, it is sent to the embedding model to generate a vector representing the request. The system then checks if there is a cached result within the similarity score. If not, the request is sent to the large language model, and the response is stored along with the original request vector. This response is then sent back to the user.

If a subsequent request is semantically similar and within the two-minute cache period, the system retrieves the cached response without querying the large language model again. This saves tokens and improves response time.

Semantic caching in action

The semantic cache can be enabled at different levels. In this example, it is configured for all operations of the OpenAI caching API. The focus is on the Azure OpenAI semantic cache lookup for inbound flow, where you set the score threshold (e.g., 0.4) and configure the embeddings backend endpoint. The outbound process involves setting the cache duration to two minutes.

Summary

In summary, integrating Azure OpenAI with APIM allows for efficient token usage and enhanced user experience. By caching responses for semantically similar requests, you maximize investments and reduce costs. Embedding models play a crucial role in converting text to high-dimensional vectors, enabling semantic comparison and efficient caching.

Until the next article, take care ??


Kranthi V

Technical Lead

2 周

Insightful

Subhani Nurubhas

CKA,Kubernetes,Openshift,GitOps,APIM,Azure,Redhat JBoss Certified, Middleware,IBM MQ,Kafka,Pega,DevOps

3 周

Good share John Savill

要查看或添加评论,请登录

社区洞察

其他会员也浏览了