Efficiency-driven Iterative Model Tuning Approach: how to tune AI Models efficiently while keeping the running costs controlled
Guillermo Wrba
Autor de "Designing and Building Solid Microservice Ecosystems", Consultor Independiente y arquitecto de soluciones ,evangelizador de nuevas tecnologias, computacion distribuida y microservicios.
AI technology as of today are growing at an exponential rate, as faster adoption occurs around AI , generative AI Models and languages (LLMs), and AI-centric specialized models such as text-to-video, video-to-text, image-to-text, and most recently, and even more recent specialized models that can understand face expressions and reactions and provide insights.
Undoubtedly, AI adoption grows as time goes by, and new applications are discovered every day with new fields of application, aimed to resolve different problem spaces.
But let's step one thousand feet to see the big picture: from IT perspective, AI is just another technology that empowers an existing ecosystem, and being part of the ecosystem, must be also aimed to the high-level business goals already established, as well as run within certain budgetary constraints. Potential of AI is unlimited, however, resources are not, and AI resources for sure represent a budget investment that needs to be driven properly.
In operative terms, AI operational efficiency can be calculated by dividing up the AI Inference quality obtained by the operational computation costs involved; in a typical AI platform, operational computation costs are driven primarily, by multiple factors that altogether contribute to the overall running costs such as:
The AI inference quality, on the other hand, can be measured by using indicators that can give us an idea on how well a certain model is performing, in terms of performance, consistency, completeness, and accuracy ( PCCA ). There can be numerous indicators such as:
Determining the right balance for max tokens configuration
As we mentioned earlier, inference consistency depends on how well a model can maintain a memory of past conversation, and how well it can respond according to it. In typical LLM pre-trained models such as GPT, the model has an internal hard limit of tokens, which represents how many tokens a model is going to support considering both the number of tokens used for the query and the history, and the number of tokens used for the model response, so that
QUERY_TOKENS + MODEL_RESPONSE_TOKENS <= TOKEN_HARD_LIMIT
For example, the closed-source OpenAI gpt-4-32k model derives its name from the fact that it can support up to 32k ( 32767) tokens as the hard limit, which means that the sum of tokens used for the query ( including the history) , and the model response cannot go beyond 32767 tokens.
Other open models, such as gpt-4-turbo and gpt-3.5-turbo typically support up to 4096 tokens overall, which has a direct impact on the completness/accuracy/consistency of the model. The hard token limit impacts directly the AI quality of the model.
Of course, the closed-source gpt-4-32k will required higher amounts of resource allocation, whereas typical gpt-4-turbo or gpt-3.5-turbo can be ran on moderate hardware without too much implications, but these models will contribute with lower AI model quality due to existing token constraints.
Given the above inequation, it's clear that there must be a balance between tokens allocated for response and the tokens allocated for chat history, which implies a trade-off between model response's consistency and completness. The below rules apply:
The above explained trade-off must be fully undertood in order to unlock the full potentiual of AI models. Since models are limited, the right balance must be found so that the model behaves as per the expectations. There's no rule-of-thumb for this as it depends on the particular use case and must be fine-tuned appropiately depending on our needs.
The AI Model Magic Quadrant
We already explained the diverse indicators that contribute to overall Model AI Quality, as well as all those factors that contribute to model running costs. Now we can define overall model eficiency as the coefficient between the two, so that:
MODEL_EFFICIENCY := (MODEL_QUALITY ) / (RUNNING_COSTS)
So our goal is to maximize efficiency, which from the above equation leads to two actions we can do:
This lead us to a quadrant where we can categorize AI models regarding this two variables, which lead us to four different AI model categorizations
Typically, AI models evolve from low-budgeted models, with small resources being assigned in order to determine the potential ROI. Such models can be typically less efficient, because those haven't gone through proper fine-tuning process and/or model evaluation for refinement. This is where typicall AI Models start being utilized in AI projects, by bringing less-mature open-source models to leverage certain AI capabilities that can be experimented, and start getting integrated within the business requirements in order to replace old-fashioned business processes with AI-enabled processes.
As models evolve, and get properly fine-tuned, trained and once have gone through proper model evaluation to ensure lower "model drift", the model goes beyond the point where the model quality becomes higher, resulting in a quality vamp that positions the model in the upper left quadrant. The model behaves better in terms of CAC, but at the same time, minimizes the use of compute resources. This quadrant still represents models that are not yet prepared for production sandbox, because haven;'t been properly scaled and haven't gone through proper performance testing to ensure the AI models can respond to the actual user demand without compromising performance.
领英推荐
AI Model Drift is defined as the overall "Bias" for a certain model, compared to a certain baseline given for various AI Quality indicators, incljuding consistency, accuracy and completness, among others. The higher the drift, the higher the model deviation to the baseline quality goal to be achieved.
At this point, the business can decide on invest higher on the model, and hence, resource allocation can be expanded further by incorporating elastic AI scaling of resources, maybe GPU processing, better storage allocation and improved connectivity which leads to a properly sized AI compute cluster that can scale based on demand. The model at this stage undergoes a performance load analysis to ensure it can scale to meet the demand without compromising performance, and by keeping the efficiency equation higher enough to be on the upper quadrant. From here, models can be evolved, fine-tuning can continue and investment on compute infrastructure can continue as model efficiency is kept in the high band.
At the bottom right side, we find those models that require high investment on budget side due to maybe higher compute requirements, but that on the other hand, despite that fact, do not show big improvements on the quality coefficient after proper tunning, which remains low. These models are budget-eaters that can compromise the ROI and hence are less-likely to be adopted for mid/long-term. Models that behave poorly should be avoided except those can be further enhanced by proper re-training.
Capturing Model Quality indicators via the ICS system
Evaluating a model C-A-C ( consistency, accuracy, completness) typically relies on the data scientist team to capture the model response quality across multiple several different scenarios and use cases. Sometimes, there's no data scientist team , or if it's there , measuring a model can be difficult due to lack of time, or lack of expertise.
An ICS system ( average inference confidence score) enables end users to provide feedback for individual model responses, as they interact with the models, by assigning a score to the response. This score is then used to compute an overall average aggregated score that better represents the model CAC overall. ICS can be implemented on AI applications themselves, or via a proper AI Playground, where users can play with the models and give proper feedback.
ICS or average inference confidence score, measures how well a model response is, in terms of completness, accurancy and consistency by enabling end users to provide their own feedback as the user interacts with the model. This capability can be implemented as part of AI Playground, so that we can capture model quality directly from the end user that is playing with the individual models.
Fine-Tuning the efficiency using iterative continuous enhancement approach
As i discussed above, the overall model efficiency decreases as running costs arise as a consequence of increasing the AI compute capability. We should define a certain model efficiency threshold (E0 in our diagram), below which we cannot go beyond because otherwise, our model will perform poorly. This threshold will lead us to the value of running costs, determined by the costs associated with running the AI model, represented by R0 in out diagram. Past that point, we shouldn't go beyond because that will be detrimental in terms of model efficiency.
Secondly, we can fine-tune our model quality by improving the overall model PCAC factors. AS we fine tune it, our overall model efficiency can improve over time. We want our model to reach at least a level of efficiency dictated by E0, given by our initial R0 compute resource allocation. How long we can iterate the model fine-tune in order to improve the PCAC depends, certainly on how much a model can grow within the over-imposed physical resources allocated. In other words, the model will grow to a certain point, after which it will start showing degradation. This point is named E1 in the below diagram , which is our practical limit.
In a real-world scenario, we may want to combine both of the above ways of improving the efficiency, by implementing a continuous iterative enhancement process that needs to be necessarily driven by a goal to be achieved in terms of volumetry, which dictates how far we want to grow, in terms of number of user connected to our AI infrastructure, number of inferences/second completed, etc.
This process can sound simple, however it's not, especially because:
The continous iterative enhancement approach ensures that we achieve the goal of volumetry we have set, while at the same time, we ensure we maximize the overall model efficiency, by getting a model that over-performs in terms of consistency, scalability, accuracy, performance and completness. This process can be tedious, and time-consuming, especially because the time involved, however it's going to pay back with a maximum gain obtained from the money being spent, in other words, it guarantees better ROI at the end of the day.
Conclusion and final words
Model Efficiency can become an important KPI when it comes to evaluating how well a model behaves given certain running costs and budgetary constraints. Higher levels of model efficiency represent best-fitting models in terms of performance and overall CAC (consistency-accuracy-completeness). Lower levels of efficiency can be tied to poor models, but also to higher running costs that can be caused by an oversized AI compute infrastructure.
Adhering to the Magic AI Model Quadrant, those models that are highly efficient are more desirable. To maximize efficiency, we must ensure that:
The C-A-C characteristics of a model are highly impacted by the training and inference hyper-parameters and how they are tuned. The token configuration must find the best balance between model "memory" and model completeness, and the right trade-off must be performed depending on the use case.