LLMs and Financial Data - One Model Cannot Rule Them All
Caption - Generate a caricature for fine-tuning language models. The image should be 1920 x 1080 in size | Source - DallE

LLMs and Financial Data - One Model Cannot Rule Them All

TL;DR: Foundational large language models, although incredibly good for general purpose, may not be best suited for very specific use cases because of size, consistency and cost considerations. If you want to build a customer FAQ bot on your platform's data you could probably get by using a small LLM fine-tuned for that performs better and leaves a smaller budgetary and computational footprint.

If you are building a solution that is even remotely related to generative AI (gen AI) and are selling to enterprises, it is possible that you may have heard things along the lines of: "Can't GPT do this already"; "What if the next version of GPT does this out of the box?"; "We would prefer buying this straight from Google or Microsoft or Amazon since we are already using their infrastructure" so on and so forth.

And with every update it becomes even more clearer that these state-of-the-art (SOTA) large language models (LLMs) and applications built using them such as ChatGPT aren't meant to do everything well. There is a lot of generalization and usability still to be achieved by these models.

QuantumBlack, AI by McKinsey 's latest report titled, "The state of AI in early 2024", claims that 65% of 1363 respondents said that gen AI was being adopted in at least one business function. If you look at it by function 34% usage is happening in marketing and sales, followed by product/service development and IT at 23% and 17% respectively. Risk, strategy and corporate finance, supply chain management and manufacturing are tailenders with single digit respondent reporting their usage. 67% respondents expect their organizations to invest more in AI (not just gen AI) over the next three years.

You would notice functions that permit vagueness are seeing higher adoption. Therefore gen AI-led revenue increases have only been reported in marketing and sales, and supply chain and inventory management, while analytical AI (anything of the non-generative nature) is doing its bit in service operations.

What may be reason for meagre adoption in some functions and no reported revenue increases? The report goes on to highlight that biggest risks to generative AI use is inaccuracy and intellectual property infringement and these are the top two concerns that organizations are working to mitigate.

What does this tell us? Return on invested capital is not yet apparent in all functions where generative AI is applicable, and enterprises give a lot more than two hoots about accuracy and intellectual property infringement.

Although one of the key things that McKinsey has missed in it's report is asking its respondents about implementation costs of generative AI. I am sure that must be consideration of whoever is managing these projects and has to sign-off the cloud compute or consulting costs.

Needless to say, all these concerns cannot be handled by your SOTA models alone.

As Alexander Sukharevsky, senior partner and global coleader of QuantumBlack, AI by McKinsey, points out in the report, "The spine and brain of the enterprise of the future will rely on a well-orchestrated mix of multiple foundational models—both off-the-shelf solutions and tools that have been finely tuned to the enterprise’s specific needs. In fact, with gen AI we are moving from a binary world of “build versus buy” to one that might be better characterized as “buy, build, and partner,” in which the most successful organizations are those that construct ecosystems that blend proprietary, off-the-shelf, and open-source models."

So if you are building and selling in finance and healthcare, this becomes a "Hard Fact" problem rather than a "Hair On Fire" problem - borrowing from Sequoia's PMF Arc framework (Image 1). This means that your clients "carry an assumption of how the world works and the latent demand would have to unlocked through a novel solution".

Image 1 - Hard Fact problem statements need you to upend the status quo and there are not a lot of competitors | Source -

Sequoia says selling in this category requires an "epiphany" in the client mindset. But epiphanies may need educating your client to the possibilities of AI while showing that the possibilities can be built and implemented with all their concerns about inaccuracy and IP infringement being kept in consideration.

The latter part - the one about building and implementing - is where the business constraints and technological capabilities collide head on and result in you having to take a long hard look at training and fine-tuning of ML models.

For the purposes of simplicity and word count, we will focus only on fine-tuning LLMs in this article - you may invariably have to train and/or fine-tune other machine learning (ML) models too.

When Do I Need To fine-tune LLMs?

SOTA foundational models are generalists and in order to get a specialist LLM you would need to fine tune a foundational LLM. For instance, even ChatGPT is a fine-tuned version of GPT4o giving the latter capabilities to carry out conversations (Image 2).

Image 2 - User facing application are generally fine-tuned versions of fundamental models | Source -

Although that is the general principle, in practical scenarios, one does not directly start fine-tuning models at the get go. The first step is to build a working prototype for your use-case using whatever frameworks and SOTA models are available in the wild.

If you are looking to build document search capabilities for the insurance industry you may want to implement a retrieval-augmented generation (RAG) solution using say Unstructured.io , LllamaIndex, Qdrant coupled with a SOTA model such as OpenAI's GPT4, Anthropic's Claude Sonnet, Mistral Large or Cohere's Command R. It is assumed that you will be using the service providers APIs directly or using the model through a cloud service provider (CSP) such as AWS, Microsoft Azure or Google Cloud. (Disclosure - This post is not sponsored).

A thing to note hear is paradigms such as RAG, chain-of-thought etc. are independent of fine-tuning, and fine-tuned models can be used with these paradigms.

With this prototype in hand, you would test it for real world scenarios by asking it questions that you gathered from your potential clients on from your understanding of the domain. You may also expose a minimal interface to some preliminary users to test it out. Since there are cannot be universally agreed upon benchmarks for niche-use cases this is what will amount to a basic accuracy metric.

Although Meta has released CRAG for evaluation of RAG pipelines and there are frameworks such as Prometheus , you may have to prepare a ground truth dataset for your use case that you test your solution's performance against. This ground truth dataset is often called Gold Dataset or Gold Context Data.

Let's say your solution is 80% accurate on 500 real-world queries. In addition to the accuracy, you will also need to compute: the document processing charges; the per query token cost that you are incurring, the cloud infrastructure cost for keeping the solution in production and the inference times. All or some of these of the costs will have to be passed on to your client.

With all this information in hand, you are ready to make the decision of whether or not to fine-tune an LLM. For the sake of clarity, let's itemize the factors you have will be basing your judgement on assuming the other components including the frameworks you have used have not forsaken you. These criteria can in general be used to evaluate any gen AI solution:-

1. Real-world performance - Does your model perform on real-world tasks with an accuracy acceptable to the client.

2. Query Cost - Is the cost per query within tolerable range?

3. Infrastructure Cost - Is the monthly cost of the production infrastructure sustainable?

4. Hallucinations - Does it provide consistent and reliable outputs?

5. Inference Times - Does it provide answers in reasonable time?

6. Intellectual Property - Is the data going out of the client infrastructure not proprietary and the client has no problems with the data moving over the internet?

At Nullpointer , we are building a language interface for financial data and we addressed the first five points without having to fine-tune an LLM with there being just some scope of improvement in accuracy. We are using Anthropic's Claude Haiku model and our accuracy is almost 94%. However, we have had to train an ML model for classification of queries, develop a custom tokenization method and fine-tune an embedding model to be able to provide SOTA-level performance at a fraction of the cost using a CPU-only Azure instance.

But we have still not addressed the final point, intellectual property and our hypothesis is that financial institutions will be very finicky about their data staying on their own systems. In our present minimum viable product (MVP) some data does go to Anthropic although the way we store and retrieve the data means we send very few tokens.

So if the answer to one or more of the criteria specified above is no, you may have to consider fine-tuning an LLM. It could be a closed-source one such as GPT4 or an open-weights model such as Llama 3 or an open-source model such Dolly.

Given that we had one "No" answer and there was still 6 percentage points of accuracy to be gained we decided to fine-tune an LLM for our use case. The LLM of choice being Mistral 7B Instruct v0.2 . You can find other commercially usable LLMs here .

How Do I fine-tune LLMs?

This is where the hard part begins.

Although startups such as Lamini (https://lamini-ai.github.io/ ), Mistral AI (https://docs.mistral.ai/capabilities/finetuning/ ) and some others have built frameworks that allow you to install a software development toolkit (SDK) and save you from writing a whole load of PyTorch and Hugging Face code by making fine-tuning easier, there are still a lot of things to consider. And when I say lots, I literally mean lots. So needless to say fine-tuning too, like all things machine learning, is an iterative process.

But first things first - what even is fine-tuning? Neural networks and consequently the transformer and consequently all LLMs are made up of matrices. Matrices are the mathematical term used to refer to a collection of information arranged in 2-dimensions - there are 3-dimensional collections also but they are called tensors. LLMs perform operations, such as the now famous attention operation, over these matrices (Image 3) to generate the text in relation to a user query.

Image 3 - LLMs perform operation over many matrices. Fine-tuning edits values of some or all of these matrices | Source -

Fine-tuning is changing the values of the numbers stored within these matrices basis new training data. This makes the LLM does your bidding and is called transfer learning as you already had an LLM good at something and now you are transferring its capabilities to a new task.

That is the definition. Now the logical next question would be - what is the best way to change the values of these matrices?

When fine-tuning you can either choose to update all the parameters of a large-language model or update just a subset of them. You can also choose to add more parameters that you have fine-tuned and merge it with your base LLM. Choosing to fine-tune a small percentage of parameters compared to the LLMs total parameters is called parameter-efficient fine-tuning and has been proven to work well. Just for reference GPT4o has 1.76 trillion parameters.

If you look at Image 4, it provides a Venn diagram of PEFT methods applicable to large language models. You can even mix and match these methods to suit your use case.

Image 4 - Parameter-efficient fine-tuning methods taxonomy. There are lots ways to fine-tune a language model | Source -

What seems like a word and acronym soup are methods developed by researchers over the years to efficiently fine-tune language models. Some methods go back to 2019.

The task now is to figure out which is the method will perform best for your use-case. This is a tough thing to do. The easier way out is to take a model and fine-tune it fully, which obliviates the need for parameter based optimization. But full fine-tuning requires a lot of data and computational resources. The Chinchilla paper by Deepmind estimated that a model should be trained on 20 times as many tokens as the number of parameters. And in terms of memory requirement a general thumb rule is training an LLM requires 12-20 times as much memory as the number of parameters. So if you are trying to fully fine-tune a 7 billion parameter model, you may need 140 billion tokens of training data and up to 520GBs of GPU memory. The latter calculation is based on the assumption that the model parameters are stored in the default Float32 which is worth 4 bytes, so 7 billion parameters roughly translate to 26GBs.

"Notably, full fine-tuning necessitates substantial computational resources and labeled data, as the model is trained from scratch for the specific target task. Moreover, as pretrained language models grow in size and with the advent of LLMs containing billions of parameters, full fine-tuning places even greater demands on computational resources. In contrast, PEFT methods aim to alleviate these requirements by selectively updating or modifying specific parts of the PLMs while still achieving performance comparable to full fine-tuning," note the authors of the paper PEFT Methods for Pretrained Language Models: A Critical Review and Assessment .

For those not bestowed with so much data and computation resources the best option is PEFT. Now it comes down to what kind of PEFT to use.

Image 5 - Comparison of the average 5-shot MMLU test accuracy of LLaMA-7B and LLaMA-13B models fine-tuned with Alpaca | Source -

Researchers have reported metrics (Image 5) comparing full fine-tuning with PEFT methods such as LoRA. The table also provide the trainable parameters (# TPs), the total parameters in the final model (# AP), the percentage of parameters trained (% Params) and the 5-shot MMLU Accuracy. MMLU is short for massive multitask language understanding and is used as a general benchmark.

Importantly, you should not fixate yourself with these benchmarks as they may not pertain to your use-case. The general wisdom is take them as a starting point and improve performance on your use-case and iterate to find a fine-tuning method that works.

But Does Fine-tuning Even Work?

As I mentioned before, we are working with financial data and for our present MVP the data was structured corporate financials, which includes balance sheet, cash flow statement, profit and loss statement and financial ratios.

We started off with Mistral 7B Instruct v0.2 as our base model and without any fine-tuning the model's accuracy was around 80% on the ground truth dataset that we have built. In the criteria defined earlier we were not performing well on accuracy itself out-of-the-box so the model obviously had to be fine-tuned.

Like anyone else engaged in this exercise, we also went through multiple iterations and settled on PEFT using adapters coupled with QLoRA. Let me break that down superficially.

LoRA or Low-rank Adaptation essentially introduces low-rank matrices that are added to the model parameters and trained, while the original model parameters remain fixed. These low rank matrices have lesser number of parameters hence are computationally efficient to train. QLoRA or Quantized LoRA performs the same thing but first converts the models parameters to less memory intensive format like Float16 or Int8. This reduces the memory requirements further.

The adapters here are small neural networks independent of the language model that are inserted in between the layers of the language model you are fine-tuning. The idea is to train these small neural networks on your task specific data and use them at the time of inference. QLoRA reduces the memory footprint of the parameters of the adapter even further. I will not go into the technical details of these methods here as Hugging Face Face and Anyscale have done a better job of explaining them and how to implement them here and here .

In terms of performance, the fine-tuned model is almost 100% accurate on our ground truth dataset. This performance is better than GPT4 Turbo (Preview) and Claude Haiku on the same test dataset. But there are still some efficiencies to be gained in terms of the inference times so that's another iterative process that we are engaged in.

And given that this is a 7B parameter model, the compute infrastructure required to run it in production is not prohibitively expensive. So you can address all six criterions highlighted in the "When Do I Need To Fine-tune LLMs?" section using a small model that you have properly fine-tuned.

In addition to reducing inference times we are also researching, with the help of Intel, how small can you make the base model. Our belief is the future of enterprise AI is going to be a lot of purpose-specific fine-tuned LLMs that are controlled by a router that dictates which model should run when. This kind of architecture neatly ties into agentic architecture that necessitates an LLM-based solution to do multiple things before providing an answer.

So if you develop the capability to effectively and efficiently fine-tune language models, or in general neural networks, it would place in you in a very good position to exploit the "buy, build and partner" business model that may become the norm for enterprises.

Plug: If this piques your interest and you would like to partner with us or see a demo of what we have built or try out the MVP you can reach out to us at [email protected] .


Godwin Josh

Co-Founder of Altrosyn and DIrector at CDTECH | Inventor | Manufacturer

5 个月

You mentioned the challenges of using foundational large language models for specific tasks, highlighting the importance of efficiency and cost-effectiveness. Reflecting on past technological advancements, similar debates arose during the transition from mainframe computers to personal computers, where tailored solutions proved more efficient for certain tasks. Considering the dynamic nature of AI development, how do you foresee the balance between leveraging pre-trained models and fine-tuning for specialized applications evolving in the future? Your insights could offer valuable perspectives on optimizing AI solutions for diverse use cases while managing resource constraints.

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了