Efficiency-driven Iterative Model Tuning Approach: how to tune AI Models efficiently while keeping the running costs controlled

AI technology as of today are growing at an exponential rate, as faster adoption occurs around AI , generative AI Models and languages (LLMs), and AI-centric specialized models such as text-to-video, video-to-text, image-to-text, and most recently, and even more recent specialized models that can understand face expressions and reactions and provide insights.

Undoubtedly, AI adoption grows as time goes by, and new applications are discovered every day with new fields of application, aimed to resolve different problem spaces.

But let's step one thousand feet to see the big picture: from IT perspective, AI is just another technology that empowers an existing ecosystem, and being part of the ecosystem, must be also aimed to the high-level business goals already established, as well as run within certain budgetary constraints. Potential of AI is unlimited, however, resources are not, and AI resources for sure represent a budget investment that needs to be driven properly.

In operative terms, AI operational efficiency can be calculated by dividing up the AI Inference quality obtained by the operational computation costs involved; in a typical AI platform, operational computation costs are driven primarily, by multiple factors that altogether contribute to the overall running costs such as:

  • CPU/GPU running costs for hosting running models
  • Physical Memory of nodes involved in GPU computation and respective costs
  • Number of nodes that are part of the AI compute cluster
  • Attached storage/appliance costs needed to persist vectorized data
  • Vector database running costs
  • Content Pipeline and Metadata store related costs
  • Total number of tokens used - this particularly applies to hyperscaler-hosted models where the users are charged depending on the number of tokens already used.

The AI inference quality, on the other hand, can be measured by using indicators that can give us an idea on how well a certain model is performing, in terms of performance, consistency, completeness, and accuracy ( PCCA ). There can be numerous indicators such as:

  • Average Token decoding time: this indicator represents the average time invested by the AI Model to decode one token, as part of a AI model completion. Typically, models that have lower decoding times are well-performing because the AI responses can be generated in less time than others. The computation of this indicator can be calculated as the total time invested for inference, divided by the total number of tokens processed.
  • Average Time per inference: this indicator is related to the above one, but tracks the total time invested by the AI model to come with an actual full competion. This gives an idea on how long takes the AI model to come with an actual response, but measured at the individual inference level.
  • Hallucination Rate: this indicator represents the percentage of requests to the model that actually end up in a hallucinated response. Hallucinations occur when the model actually responds with content that does not relate to the given query, or that deviates from what has been asked. Hallucinations can be caused by multiple reasons, including bad training data, a bad weighted model, or incorrect data provided for RAG purposes. The lower the hallucination rate, the better the model behaves.
  • Inference Completeness Rate: this indicator measures the percentage of requests to the model that represent a complete response. A complete response is defined as such that provides enough context and information to a given query so that there are no "information holes". Completeness can be affected by other underlying AI model parameters, such as the maximum number of tokens, and the temperature. The higher the completeness rate, the better the model AI quality.
  • Inference Accuracy Rate: this indicator measures the percentage of requests to the model that represent an accurate response. Accurate responses are defined as such that are concise enough, and that provide an exact answer to the given query, rather than providing more generalistic and generalizing answer. The accuracy of the model depends on underlying AI model training parameters such as the set of training data, the set of testing data used when performing model evaluation, the number of cycles used (Epochs), the Learning Rate, or the batch size, among others. The higher the accuracy rate, the better the AI model quality.
  • Inference Consistency Rate: this indicator measures the percentage of requests to the model that represent a consistent response. Consistent responses are define as such that have into account the past interactions or chat history, so that all the responses are related together and chained. Chaining of model answers depend on many factors, including how well the chat history is maintained, how mainly, many tokens are assigned to the chat history . With regards this aspect, typical pre-trained models such as GPT family includes the capacity to provide a maximum number of tokens per interaction, which determines how many tokens are then used to provide a chat history. In order to achieve maximum consistency, it's important to set a right balance between the distribution of tokens, as we will discuss later in this article.


Determining the right balance for max tokens configuration


As we mentioned earlier, inference consistency depends on how well a model can maintain a memory of past conversation, and how well it can respond according to it. In typical LLM pre-trained models such as GPT, the model has an internal hard limit of tokens, which represents how many tokens a model is going to support considering both the number of tokens used for the query and the history, and the number of tokens used for the model response, so that

QUERY_TOKENS + MODEL_RESPONSE_TOKENS <= TOKEN_HARD_LIMIT

For example, the closed-source OpenAI gpt-4-32k model derives its name from the fact that it can support up to 32k ( 32767) tokens as the hard limit, which means that the sum of tokens used for the query ( including the history) , and the model response cannot go beyond 32767 tokens.

Other open models, such as gpt-4-turbo and gpt-3.5-turbo typically support up to 4096 tokens overall, which has a direct impact on the completness/accuracy/consistency of the model. The hard token limit impacts directly the AI quality of the model.

Of course, the closed-source gpt-4-32k will required higher amounts of resource allocation, whereas typical gpt-4-turbo or gpt-3.5-turbo can be ran on moderate hardware without too much implications, but these models will contribute with lower AI model quality due to existing token constraints.

Given the above inequation, it's clear that there must be a balance between tokens allocated for response and the tokens allocated for chat history, which implies a trade-off between model response's consistency and completness. The below rules apply:

  • If we keep increasing the tokens used for model response or MODEL_RESPONSE_TOKENS (also known as "completion" in AI terminology), the model will become more verbose, however it will hardly remember what we have already discussed during the conversation, or if it does, it shall remember only a certain portion of the last interactions, hence leading to increased completeness, at the expense of a lower consistency.
  • If we keep increasing the tokens used for the QUERY_TOKENS, then the model will start remembering all past interactions, however the model responses may start behaving as too short or too much concise. In this case, the models gains high consistency at the expense of lower levels of completness.

The above explained trade-off must be fully undertood in order to unlock the full potentiual of AI models. Since models are limited, the right balance must be found so that the model behaves as per the expectations. There's no rule-of-thumb for this as it depends on the particular use case and must be fine-tuned appropiately depending on our needs.


The AI Model Magic Quadrant


We already explained the diverse indicators that contribute to overall Model AI Quality, as well as all those factors that contribute to model running costs. Now we can define overall model eficiency as the coefficient between the two, so that:

MODEL_EFFICIENCY := (MODEL_QUALITY ) / (RUNNING_COSTS)

So our goal is to maximize efficiency, which from the above equation leads to two actions we can do:

  • We increase the model quality, by increasing overall model CAC ( consistency-accuracy-completness), and by reducing the model response times.
  • We decrease the running costs, by ensuring we are being enough frugal in terms of resource usage, or otherwise, we set a dynamic scaling AI capability so that resources can be scaled in/out as per demand.

This lead us to a quadrant where we can categorize AI models regarding this two variables, which lead us to four different AI model categorizations


AI Model "Magic Quadrant"

Typically, AI models evolve from low-budgeted models, with small resources being assigned in order to determine the potential ROI. Such models can be typically less efficient, because those haven't gone through proper fine-tuning process and/or model evaluation for refinement. This is where typicall AI Models start being utilized in AI projects, by bringing less-mature open-source models to leverage certain AI capabilities that can be experimented, and start getting integrated within the business requirements in order to replace old-fashioned business processes with AI-enabled processes.

As models evolve, and get properly fine-tuned, trained and once have gone through proper model evaluation to ensure lower "model drift", the model goes beyond the point where the model quality becomes higher, resulting in a quality vamp that positions the model in the upper left quadrant. The model behaves better in terms of CAC, but at the same time, minimizes the use of compute resources. This quadrant still represents models that are not yet prepared for production sandbox, because haven;'t been properly scaled and haven't gone through proper performance testing to ensure the AI models can respond to the actual user demand without compromising performance.

AI Model Drift is defined as the overall "Bias" for a certain model, compared to a certain baseline given for various AI Quality indicators, incljuding consistency, accuracy and completness, among others. The higher the drift, the higher the model deviation to the baseline quality goal to be achieved.

At this point, the business can decide on invest higher on the model, and hence, resource allocation can be expanded further by incorporating elastic AI scaling of resources, maybe GPU processing, better storage allocation and improved connectivity which leads to a properly sized AI compute cluster that can scale based on demand. The model at this stage undergoes a performance load analysis to ensure it can scale to meet the demand without compromising performance, and by keeping the efficiency equation higher enough to be on the upper quadrant. From here, models can be evolved, fine-tuning can continue and investment on compute infrastructure can continue as model efficiency is kept in the high band.

At the bottom right side, we find those models that require high investment on budget side due to maybe higher compute requirements, but that on the other hand, despite that fact, do not show big improvements on the quality coefficient after proper tunning, which remains low. These models are budget-eaters that can compromise the ROI and hence are less-likely to be adopted for mid/long-term. Models that behave poorly should be avoided except those can be further enhanced by proper re-training.

Capturing Model Quality indicators via the ICS system

Evaluating a model C-A-C ( consistency, accuracy, completness) typically relies on the data scientist team to capture the model response quality across multiple several different scenarios and use cases. Sometimes, there's no data scientist team , or if it's there , measuring a model can be difficult due to lack of time, or lack of expertise.

An ICS system ( average inference confidence score) enables end users to provide feedback for individual model responses, as they interact with the models, by assigning a score to the response. This score is then used to compute an overall average aggregated score that better represents the model CAC overall. ICS can be implemented on AI applications themselves, or via a proper AI Playground, where users can play with the models and give proper feedback.

ICS or average inference confidence score, measures how well a model response is, in terms of completness, accurancy and consistency by enabling end users to provide their own feedback as the user interacts with the model. This capability can be implemented as part of AI Playground, so that we can capture model quality directly from the end user that is playing with the individual models.

Fine-Tuning the efficiency using iterative continuous enhancement approach

As i discussed above, the overall model efficiency decreases as running costs arise as a consequence of increasing the AI compute capability. We should define a certain model efficiency threshold (E0 in our diagram), below which we cannot go beyond because otherwise, our model will perform poorly. This threshold will lead us to the value of running costs, determined by the costs associated with running the AI model, represented by R0 in out diagram. Past that point, we shouldn't go beyond because that will be detrimental in terms of model efficiency.



Secondly, we can fine-tune our model quality by improving the overall model PCAC factors. AS we fine tune it, our overall model efficiency can improve over time. We want our model to reach at least a level of efficiency dictated by E0, given by our initial R0 compute resource allocation. How long we can iterate the model fine-tune in order to improve the PCAC depends, certainly on how much a model can grow within the over-imposed physical resources allocated. In other words, the model will grow to a certain point, after which it will start showing degradation. This point is named E1 in the below diagram , which is our practical limit.



In a real-world scenario, we may want to combine both of the above ways of improving the efficiency, by implementing a continuous iterative enhancement process that needs to be necessarily driven by a goal to be achieved in terms of volumetry, which dictates how far we want to grow, in terms of number of user connected to our AI infrastructure, number of inferences/second completed, etc.

  • We first allocate a certain baseline compute infrastructure capacity to start with, namely R0, which will give us the E0 baseline as a bear minimum efficiency we want to achieve.
  • We play with the model PCAC so that we can maximize it, till we reach E1 point , within the infra boundaries we have.
  • We increase resource allocation on the AI compute infrastructure slightly
  • We play again with the model PCAC so that we can maximize it, till we reach the new E1 point of efficiency.
  • This process can continue till we meet the desired demand in terms of scalability, given the achievable goals we have set.

This process can sound simple, however it's not, especially because:

  • It requires a certain timeframe, that can usually range from a couple months to half a year or more, depending on the volumetric goals to achieve
  • It requires a data scientist team to be closer to this initiative, because they shall be directly involved with improving the PCAC characteristics of the model and the quality indicators
  • This can be trial-and-error , because there's no linear relationship between resources spent and overall efficiency: it could happen that by increasing the resource allocation, our efficiency gets degraded over time, or vice-versa
  • A high-skilled QA performance team must be over-seeing this process, so that the team can measure the model performance and also, collaborate with the architect team in order to set the best-matching volumetric goals to achieve that better reflect the problem to solve.

The continous iterative enhancement approach ensures that we achieve the goal of volumetry we have set, while at the same time, we ensure we maximize the overall model efficiency, by getting a model that over-performs in terms of consistency, scalability, accuracy, performance and completness. This process can be tedious, and time-consuming, especially because the time involved, however it's going to pay back with a maximum gain obtained from the money being spent, in other words, it guarantees better ROI at the end of the day.


Conclusion and final words

Model Efficiency can become an important KPI when it comes to evaluating how well a model behaves given certain running costs and budgetary constraints. Higher levels of model efficiency represent best-fitting models in terms of performance and overall CAC (consistency-accuracy-completeness). Lower levels of efficiency can be tied to poor models, but also to higher running costs that can be caused by an oversized AI compute infrastructure.

Adhering to the Magic AI Model Quadrant, those models that are highly efficient are more desirable. To maximize efficiency, we must ensure that:

  • By keeping running costs as minimal as possible, according to the seen demand. We must allocate compute AI infrastructure as per needs, avoiding over-sizing the resources and applying dynamic auto-scaling policies whenever possible to ensure the infra meets the demand.
  • The model quality indicators are maximized, by improving performance by reducing the overall model response times, minimizing the token decoding time, by avoiding hallucinations, and by improving the C-A-C characteristics of the model.

The C-A-C characteristics of a model are highly impacted by the training and inference hyper-parameters and how they are tuned. The token configuration must find the best balance between model "memory" and model completeness, and the right trade-off must be performed depending on the use case.





要查看或添加评论,请登录

Guillermo Wrba的更多文章

社区洞察

其他会员也浏览了