登录查看更多内容

Building AI models with open source LLMs

Justo Hidalgo

Chief AI Officer at Adigital. Highly interested in Responsible AI and Behavioral Psychology. PhD in Computer Science. Book author, working on my fourth one!

发布日期: 2023年12月18日

How Companies Can Leverage Existing Base Models to Build Improved AI Systems

At Adigital , we recently had the privilege of hosting an enlightening Adigital Academy session led by Daniel Vila Suero , the CEO of Argilla . The session was an eye-opener, providing deep insights into the world of LLM model development. Recognizing the value of this knowledge, I felt compelled to share these learnings in a relatively accessible format, especially for those who are not deeply technical but are keen to understand the nuances of building AI models in this new world of Generative AI. I hope I didn't miss anything critical ;)

Daniel Vila, with his expertise, guided us through the intricacies of leveraging existing AI models to create more refined, tailored solutions. This post aims to distill Daniel’s knowledge.

Companies are starting to seek ways to improve their business and competitiveness via new AI capabilities. One effective approach is leveraging existing base models, such as OpenAI's GPT-4 or Zephyr, to develop tailored solutions that meet specific business needs. Here's a detailed guide on how a company can use these models to build their own improved AI systems.

Step 1: Data Collection

The initial and crucial step involves gathering a dataset. A dataset for generative AI is basically a set of questions and answers. This could range from a thousand examples to whatever is available. For customer service applications, companies often already have an extensive collection of user queries, providing a solid foundation. If the aim is to develop an assistant in a field where user questions are not yet collected, synthetic generation techniques can be employed. Tools like Notus (Argilla’s own LLM, just launched) or GPT-4 can be used to create questions based on relevant texts you may want to start from, breaking them into sentences or paragraphs and generating potential queries for each.

Step 2: Initial Validation

Once a dataset is created, it should be validated. Human experts can use tools like Argilla to ensure that the generated questions are appropriate and contextually relevant. For instance, in a regulatory use case, questions might include, "What does section 2 of the regulation state?" or "How are application criticality levels defined?" Experts can rapidly validate hundreds of such questions.

Step 3: Model Selection and Testing

After validation, select one or two available models (like Zephyr or OpenAI) and pass the questions to see if the responses are accurate, incorrect, or irrelevant. Another round of human validation is crucial here, where experts evaluate the responses, providing a preliminary assessment of the model's effectiveness.

Step 4: Supervised Fine Tuning (SFT)

If the base model doesn't meet expectations, or if there's a preference for open-source models, a larger dataset (around 5,000-10,000 question-answer pairs) is necessary for Supervised Fine Tuning.

Supervised Fine Tuning (SFT) is a technique used in refining AI models, particularly Large Language Models (LLMs). It involves training the AI model on a specific, labeled dataset to adapt it to particular tasks or improve its performance in certain areas. This process is 'supervised' because it uses a dataset where the desired outputs (like correct answers to questions) are already known and labeled. By training the model on this dataset, the model learns to generate more accurate and relevant responses based on the specific requirements of the task. As Daniel explained, SFT is a crucial step in customizing base AI models to suit specific applications, ensuring that they respond more accurately to the kind of inputs they will encounter in their designated use case.

In this example, the process involves adjusting a model like Zephyr to better suit specific needs, without starting from scratch with a base model like Mistral.

Step 5: Post-SFT Evaluation

After Supervised Fine Tuning (SFT), the next crucial phase is evaluating the AI model's response accuracy. If the model's performance aligns with expectations, no further adjustments are needed, we have our model!!! Cheers.

However, if there's potential for improvement, this is where Direct Preference Optimization (DPO) becomes pivotal.

DPO, a novel technique akin to Reinforcement Learning from Human Feedback (RLHF), marks a significant leap in Large Language Model (LLM) refinement. Unlike RLHF, which had a multifaceted objective and faced challenges integrating Reinforcement Learning (RL) intricacies into Natural Language Processing (NLP), DPO offers a more direct approach. Its primary optimization relies on binary cross-entropy loss, simplifying the LLM refinement process significantly.

Wait, what??????

Sorry, my geeky face showed up unexpectedly. Let me start again.

领英推荐

AI news #5: battle of embedding models

Avenga 7 个月前

Almost Timely News: How To Evaluate a Generative AI…

Christopher Penn 1 年前

Almost Timely News: The Generative AI Beginner’s Kit…

Christopher Penn 1 年前

Direct Preference Optimization (DPO) simplifies improving AI language models by focusing on a straightforward method that determines how close the AI's answers are to preferred human responses, making the process of making AI smarter and easier to manage.

In practice, DPO involves creating a dataset of preferences with questions and multiple answers. Each answer is then evaluated, not just for correctness, but also for its alignment with direct human preferences. This process not only refines the model's accuracy but also ensures it aligns more closely with human judgment and nuances in responses.

Expert note 1: Dealing with GPU Costs and Availability

A significant consideration in AI development is the cost and availability of GPUs, especially for smaller ventures. For context, transitioning from a base model like Mistral to an SFT and then a DPO model (like Zephyr) can cost around 500€ with up to 8 hours of training time, significantly less than training a base model from scratch. However, availability of GPUs can be a challenge, with shortages often reported in major cloud services like Google Cloud or Amazon.

Expert note 2: Incremental Improvement and Internal Resources

Start with smaller datasets and scale gradually. Many large companies start with their in-house AI development teams initiating these tests and trials. This approach mitigates the need for external expertise and costs in creating a perfect dataset. Moreover, the end-user teams, such as those operating customer support chatbots, can play a pivotal role in continually refining the dataset.

Expert note 3: Retrieval Augmented Generation

During the course there were some questions about how this approach compared with another technique called Retrieval Augmented Generation (RAG). RAG is a distinct approach compared to the methods like Supervised Fine Tuning (SFT) and Direct Preference Optimization (DPO) used in refining AI models. While SFT and DPO focus on improving a model's responses based on a fixed dataset and human preferences, RAG introduces an additional step where the AI model dynamically retrieves information from a large database or document set to enhance its answers. Essentially, RAG combines the generation capabilities of models like GPT-4 with a retrieval component, allowing the AI to pull in external information to provide more informed, accurate, and contextually relevant responses. This makes RAG particularly useful for tasks requiring up-to-date or specific information that may not be contained within the training data of the model. Please check this previous article of mine where I showed some code to build a simple RAG application.

So this is all! Building a custom AI model on top of existing base models involves a systematic approach starting from data collection, through validation and model testing, to fine-tuning and final evaluations. The process is cost-effective, especially when leveraging open-source models, and allows for gradual improvement, harnessing internal resources.

I'd like to extend a heartfelt thank you to Daniel Vila for his contribution to our Adigital Academy. For those interested in delving deeper, particularly into how Argilla built their Large Language Model, Notus, I highly recommend visiting their detailed blog post.

We at Adigital will continue with introductory and detailed talks about artificial intelligence and other emerging trends in 2024. Consider joining us to obtain access to this knowledge!!!

David Berenstein

ML & DevRel @ Giskard & Pruna || ?????? Cooking, ?????? Coding, ?? Committing

1 年

Interesting article! Tom Aarsen also wrote an interesting tutorial about supervised fine-tuning Mistral AI for chat completion, which might be interesting for some readers. https://docs.argilla.io/en/latest/tutorials_and_integrations/tutorials/feedback/training-llm-mistral-sft.html

2 次回应

查看更多评论

要查看或添加评论，请登录

Justo Hidalgo的更多文章

Watch your brain think: basic concepts of EEG data analysis

2025年1月14日

Watch your brain think: basic concepts of EEG data analysis

The brain and its mysteries are some of the most amazing things for me. I try to read about neuroscience, psychology…
The Fabric of AI Governance

2024年12月1日

The Fabric of AI Governance

I am reading The Fabric of Reality, by David Deutsch. It is a quite compelling and at the same time complex book…
An AI Moral Parliament built with a multi-agent approach

2024年10月30日

An AI Moral Parliament built with a multi-agent approach

During the summer I had the privilege of attending an AI Safety course organized by the renowned Center for AI Safety…

9 条评论
Another AI workshop with children. Validating before scaling

2024年5月27日

Another AI workshop with children. Validating before scaling

A little more than a month ago I wrote about the wonderful experience we at Adigital had by leading a workshop on AI…

2 条评论
My notes on the Global Education Forum's roundtable about Ethical Implications of AI (and 4): Future outlook

2024年5月16日

My notes on the Global Education Forum's roundtable about Ethical Implications of AI (and 4): Future outlook

This is my last post on Education and AI after the Global Education Forum 's roundtable on Ethics and AI I had the…

1 条评论
My notes on the Global Education Forum's roundtable about Ethical Implications of AI (3): Transparency and Accountability

2024年5月14日

My notes on the Global Education Forum's roundtable about Ethical Implications of AI (3): Transparency and Accountability

The Global Education Forum 's roundtable on Ethics and AI continued after talking about Privacy and Bias and Fairness…
My notes on the Global Education Forum's roundtable about Ethical Implications of AI (2): Privacy

2024年5月13日

My notes on the Global Education Forum's roundtable about Ethical Implications of AI (2): Privacy

This is a continuation of the previous post on some reflections on AI and Education we shared during the Global…
My notes on the Global Education Forum's roundtable about Ethical Implications of AI (1): Bias and fairness

2024年5月9日

My notes on the Global Education Forum's roundtable about Ethical Implications of AI (1): Bias and fairness

As I mentioned some time ago, I had the great chance to participate in the SEK Education Group's Global Education Forum…
The Gutenberg Parenthesis

2024年4月19日

The Gutenberg Parenthesis

Since my early times at 24symbols Online Reading, when I started to read everything related to the book publishing…
Igniting curiosity on responsible technologies: A workshop on Artificial Intelligence with children

2024年4月17日

Igniting curiosity on responsible technologies: A workshop on Artificial Intelligence with children

Last week I had an extraordinary experience. I typically have the chance to speak with experts or discuss the main…

5 条评论

See all articles

Building AI models with open source LLMs

Justo Hidalgo

Chief AI Officer at Adigital. Highly interested in Responsible AI and Behavioral Psychology. PhD in Computer Science. Book author, working on my fourth one!

How Companies Can Leverage Existing Base Models to Build Improved AI Systems

Step 1: Data Collection

Step 2: Initial Validation

Step 3: Model Selection and Testing

Step 4: Supervised Fine Tuning (SFT)

Step 5: Post-SFT Evaluation

领英推荐

Expert note 1: Dealing with GPU Costs and Availability

Expert note 2: Incremental Improvement and Internal Resources

Expert note 3: Retrieval Augmented Generation

Justo Hidalgo的更多文章

社区洞察

其他会员也浏览了

DeepSeek R1, Gemini 2.0, Future of AI Automation and more

Generative AI Amplifies the Focus on Data: How Companies Must Evolve into Data-Centric Organizations

Single API to Access Llama 3.1, GPT-4 o, Claude 3.5, Mistral, Florence-2, and Leading Top Open-Source & Third-Party Models ??

Top 50 Generative AI experts to follow

Building and Deploying Robust AI Systems

The new GLM by Contextual AI is here to outperform GPT-4o in terms of accuracy.

From Pixels to Profits: How Synthetic Image Generation Changes Everything

AI Overview: Your Weekly AI Briefing

Upskill in the Age of Generative AI

GPT-5: The Future of Generative Artificial Intelligence

How Companies Can Leverage Existing Base Models to Build Improved AI Systems

Step 1: Data Collection

Step 2: Initial Validation

Step 3: Model Selection and Testing

Step 4: Supervised Fine Tuning (SFT)

Step 5: Post-SFT Evaluation

领英推荐

Expert note 1: Dealing with GPU Costs and Availability

Expert note 2: Incremental Improvement and Internal Resources

Expert note 3: Retrieval Augmented Generation

Justo Hidalgo的更多文章

Watch your brain think: basic concepts of EEG data analysis

The Fabric of AI Governance

An AI Moral Parliament built with a multi-agent approach

Another AI workshop with children. Validating before scaling

My notes on the Global Education Forum's roundtable about Ethical Implications of AI (and 4): Future outlook

My notes on the Global Education Forum's roundtable about Ethical Implications of AI (3): Transparency and Accountability

My notes on the Global Education Forum's roundtable about Ethical Implications of AI (2): Privacy

My notes on the Global Education Forum's roundtable about Ethical Implications of AI (1): Bias and fairness

The Gutenberg Parenthesis

Igniting curiosity on responsible technologies: A workshop on Artificial Intelligence with children

社区洞察

其他会员也浏览了

DeepSeek R1, Gemini 2.0, Future of AI Automation and more

Generative AI Amplifies the Focus on Data: How Companies Must Evolve into Data-Centric Organizations

Single API to Access Llama 3.1, GPT-4 o, Claude 3.5, Mistral, Florence-2, and Leading Top Open-Source & Third-Party Models ??

Top 50 Generative AI experts to follow

Building and Deploying Robust AI Systems

The new GLM by Contextual AI is here to outperform GPT-4o in terms of accuracy.

From Pixels to Profits: How Synthetic Image Generation Changes Everything

AI Overview: Your Weekly AI Briefing

Upskill in the Age of Generative AI

GPT-5: The Future of Generative Artificial Intelligence