Why I Don't like Fine-Tuning LLMs: A Case for Graph-Based Prompt Engineering

Why I Don't like Fine-Tuning LLMs: A Case for Graph-Based Prompt Engineering

As of Jan 2025, a common pattern for trying to improve the quality and performance of LLMs workflows is to fine-tune a particular model for specific use cases, based on a curated data-set.

The idea is compelling: take a general-purpose model and teach it specific domain knowledge. Want a cybersecurity expert that knows GDPR inside out? Just fine-tune a model with your security documentation and regulatory frameworks. Simple, right? What could go wrong with this? :)

Based on my vCISO and vCTO instincts and experience, I think we are going down the wrong path. While I know this might be controversial, I believe there are fundamental security, operational, financial and quality problems with fine-tuning.

Instead, I propose that we should be using off-the-shelf models and focusing the engineering efforts on prompt engineering using graph-based knowledge structures.

Let me explain why.

My concerns and challenges with Fine-Tuning models

When organisations fine-tune models, they're essentially hoping to create a specialised GenAI model that deeply understands their domain.

But there are fundamental problems with this approach that many aren't considering:

  1. We're Working with a Black Box: As of January 2025, we still don't understand how the models and the fine-tuning actually works. We see models acquiring new behaviours, but we don't really know how the data is retained or processed. You might achieve what you want, but you can't be sure what else you're affecting or is retained. This lack of transparency means we can't guarantee deterministic behaviour or fully understand what's happening inside our models.
  2. Data Evolution is a Nightmare: Let's take GDPR as an example. You need to feed the model not just the regulation itself, but country-specific interpretations, case law, enforcement actions, and regular updates. Each training iteration builds on previous ones, creating a compound effect where behaviours might look correct but are actually the result of multiple attempts at getting it right. It's technical debt waiting to explode. More importantly, when you find mistakes or need to update information, you're building on top of previous training iterations, which means you're never really starting from a clean slate.
  3. Hidden Behavious: Recent research shows that once a model during training acquires certain behaviours, they can be impossible to remove – even when the model claims it's not exhibiting them. Think of it like a silent backdoor in your system. What's worse, we're discovering new ways these models can be manipulated through various forms of prompt injection and payload delivery. You could have malicious behaviours hiding in seemingly innocent training data, waiting to be triggered by specific inputs or conditions. It's like having a time bomb in your system that you can't see or remove.
  4. The Cost Equation Doesn't Add Up: Organisations often justify fine-tuning as a one-time investment that pays off through reuse. But what happens when you discover unwanted behaviours? Or when you lose trust in the model? You often end up having to "delete" the model and start over. It's like building a house on quicksand – the foundation keeps shifting under you. The assumed cost benefits often don't materialise when you factor in the need for retraining, updates, and potential model replacements.
  5. The Lock-in Problem: When you fine-tune a model, you're essentially locking yourself into a specific version of a specific model. This creates major limitations in your workflow and cost structure, where for example you can't easily leverage newer and usually more capable versions of the chosen model. More worryingly, you're prevented from using smaller and usually more efficient models, that might do the job just as well. Over time, this lock-in will prevent you from taking advantage of cheaper or more powerful alternatives, and you will be stuck with your initial choice, unless you start again from scratch, which has high costs.
  6. Training Inefficiency Spiral: The way fine-tuning works creates a problematic cycle. You're not just adding new information; you're building layers of training one on top of another. This means: You can't easily optimise your training process because you're always building on previous iterations If you need to retrain, you're stuck with the same inefficient path because you haven't had the chance to experiment with different approaches The training data itself becomes part of your technical debt
  7. Unknown Attack Vectors: We're still discovering new ways these models can be compromised. There could be hidden instructions activated by specific patterns, malicious payloads embedded in training data, vulnerabilities we haven't even discovered yet or attack vectors that emerge from the interaction between different layers of training. What will the "LLM's buffer overflows" look like?
  8. The Provenance Problem: When issues arise, it becomes nearly impossible to: trace exactly where problematic behaviours came from, understand which training data influenced which behaviours, determine how different training iterations interact and establish clear provenance for the model outputs
  9. The Happy Path Fallacy: Most fine-tuning approaches assume ideal conditions where all training data is benign, there's no malicious intent and data isn't contradictory. But in the real world, these assumptions rarely hold true. Even seemingly legitimate training data might contain hidden payloads or problematic patterns that only become apparent later.
  10. No Version Control of Training Workflow: How to manage the versions and diffs of the multiple versions created during fine-tuning process? Since each training iteration creates by-design a new version of a 'black box' model, you can't easily diff between versions and rolling back to previous states is practically impossible. More worryingly, you don't have the ability to map out which training data influenced which behaviours.

The fundamental issue here is that we're trying to modify something we don't fully understand, in ways we can't fully control, with consequences we can't fully predict.

You might achieve your immediate goal, and look like it is 'working', but you have no idea what else you might be affecting in the process and the full range of behaviours and knowledge that are actually there.

But I have a couple more concerns ...

Additional Challenges with Fine-Tuning

The Personalisation Paradox

One of the biggest problems I see with fine-tuning is how it handles personalisation. When you fine-tune a model, you're trying to create one version that works for everyone. But think about it - in the real world, you need to handle multiple languages, diverse cultures, different job roles, and varying levels of expertise. Unless you going to create separate fine-tuned models for each combination, It's simply not practical or cost effective.

Let's say you're building a cybersecurity assistant. A CISO needs different context and language than a security analyst, who needs different information than a board member. Add in multiple languages and cultural contexts, and suddenly you're looking at dozens of potential model variations. Each one would need its own fine-tuning, its own maintenance, its own updates.

The Dangerous Dance with Sensitive Data

Here's where it can get get really dangerous. If you are using of customer data or sensitive information in the fine-tuning process... Don't!

I cannot stress enough how crazy this is.

When you fine-tune a model with sensitive data, you're essentially baking that data into the model permanently, where you lose control over how and when that information might surface sometime in the future.

Think about this scenario: you fine-tune a model with your company's security incidents to help with threat detection. Now that information lives in the model. What happens when someone figures out the right prompt to make the model reveal those details? It's not a question of if, but when. We're just one clever prompt away from a massive data leak.

Even worse, once sensitive data is in the model, you can't reliably remove it. You might think you've trained it out, but as we've seen with recent research, these models can retain behaviours and information in ways we don't fully understand.

The Prompt Engineering Reality Check

Here's the kicker - even with a fine-tuned model, in order to really give relevant responses to the users, you still need sophisticated prompt engineering.

Many GenAI projects start by thinking that fine-tuning will eliminate the need for additional prompt engineering workflows, but when their solution hits the real world, they quickly find out that in addition to the fine-tune answer, prompt engineering will be required to:

  • Provide context for specific queries
  • Control the format and style of responses
  • Guide the model's behaviour in different situations
  • Handle edge cases and special requirements
  • Try to prevent data leaks and other unintended behaviours

So you end up with the worst of both worlds - all the risks and complexity of fine-tuning, plus the ongoing need for prompt engineering anyway.

You're essentially adding a layer of opacity and risk without eliminating the need for careful prompt design and engineering.

The Scale and Evolution Problem

This becomes even more problematic when you consider how knowledge and requirements evolve. Let's say you're dealing with regulations like GDPR. You need to handle:

  • Regular updates to regulations
  • Different interpretations per country
  • Industry-specific implementations
  • Company-specific policies
  • Role-specific access levels

With a fine-tuned model, each of these changes potentially requires retraining, or the model will be providing out-of-date or even worse, misleading/false information

Back to the The False Economy concern

GenAI projects often justify fine-tuning by saying it will save time and resources in the long run.

But when you factor in:

  • The need for multiple models to handle different use cases
  • The ongoing prompt engineering still required
  • The risk management for sensitive data
  • The maintenance of different versions
  • The cost of retraining when things go wrong

...you start to see that fine-tuning often creates more problems than it solves.

So why not removing the fine-tuning step?

My proposed Approach: Graph-Based Prompt Engineering

Instead of fine-tuning models, I propose a different architecture that gives us more control and flexibility.

The goal here is to replicate the benefits of fine-tuning but at the prompt level, where we have full control and transparency.

The Core Architecture

At the heart of this approach are three fundamental components that work together to create a more flexible and maintainable system.

First, we treat LLMs as commodities, standard, off-the-shelf components that we can easily switch between based on our needs. This means we're never locked into a specific model or provider. We can run multiple models in parallel for verification, use specialised models for different tasks, and easily adopt new models as they emerge. This flexibility is crucial for maintaining both cost-effectiveness and technical agility.

Second, we build rich knowledge graphs that capture our domain expertise in a structured, navigable format. Taking GDPR as an example, we create a comprehensive graph that represents not just the core regulations, but also their interpretations, country-specific variations, case law, and real-world applications. These relationships are explicit and queryable, giving us clear visibility into how different pieces of information connect and influence each other.

Third, we implement dynamic prompt construction that navigates these knowledge graphs based on user queries. When someone asks a question, our system traverses the graph to build context-aware prompts that provide exactly what's needed - no more, no less. This approach maintains clear provenance of all information and allows us to control access based on user permissions.

How This Solves the Fine-Tuning Problems

This architecture directly addresses the core issues we face with fine-tuning models.

We gain complete transparency and control over our system - we know exactly what information we're providing to the model, where it came from, and how it's being used. There are no hidden behaviours or unexpected interactions to worry about.

Maintaining and updating the system becomes straightforward. When regulations change or new interpretations emerge, we simply update the information in our knowledge graph. There's no need for complex retraining cycles, and we can easily track and control versions of our knowledge base. If something goes wrong, rolling back changes is simple and predictable.

The security implications are particularly important. Without fine-tuning, we dramatically reduce the risk of hidden training data or unintended model behaviours. We can establish clear boundaries for sensitive information and control exactly what knowledge is accessible in different contexts.

Now it is 'just AppSec' and secure design patterns, that in 2025 we already know how to design, deploy, test and scale (specially if using GenAI to aid in the SDLC).

The Power of Customisation

Where this approach really shines is in its customisation capabilities. Instead of trying to bake everything into the model through fine-tuning, we can customise at multiple levels through our prompt engineering layer.

At the user level, we can adapt responses based on roles, expertise, and individual needs. A board member getting information about GDPR compliance needs a different perspective than a technical implementation team, and our system can provide that without maintaining separate models.

The system naturally handles cultural and linguistic adaptation. We can account for different languages, cultural contexts, and communication styles all through our prompt engineering layer. This "last mile" customisation ensures that information isn't just technically accurate, but also culturally appropriate and effectively communicated

For organisations, we can layer in company-specific policies, industry interpretations, and relevant business context. This means the same underlying knowledge graph can serve different organisations in ways that feel native to their specific context and needs.

Btw, this is exactly what I'm doing at The Cyber Boardroom :) (my current startup)

Advanced Features and Real-World Benefits

In practice, this architecture becomes even more powerful when we implement advanced features like multi-model processing. We can run queries through multiple models in parallel, comparing and validating responses to ensure accuracy. The system can control how deep into the knowledge graph to go, adjusting the level of detail based on the user's needs and context.

The operational benefits are significant. We typically see reduced computational costs and more efficient token usage compared to fine-tuned models. Response times are faster, and resource utilisation is more predictable.

The quality control aspect is particularly valuable. Because we have clear tracking of information sources and can easily validate responses, maintaining consistent output quality becomes much more manageable. When errors occur, they're easy to trace and correct without touching the underlying models.

This approach essentially creates a sophisticated customisation layer that gives us all the benefits of fine-tuning without its drawbacks. Rather than trying to teach models new things through opaque training processes, we're getting better at giving them the right context at the right time. It's a shift from modifying the model to mastering how we communicate with it.

Perhaps most importantly, we're future-proofing our investment - the architecture can easily adopt new models and integrate new knowledge without requiring fundamental changes to the system.

This is not a minor consideration. Any solution developed today needs to be able to leverage the rapid evolution of models like ChatGPT, Claude, Gemini, and LLaMA - not to mention entirely new architectures that haven't been released yet. The reality is that these frontier models are continuously getting better, faster, and cheaper. Projects that rely heavily on fine-tuning will eventually face an uncomfortable truth: newer, out-of-the-box models will likely outperform their carefully fine-tuned versions, but they'll be locked into their older, less capable models

How It Works in Practice

Let's look at a real example. Say someone asks about data erasure requirements in France. Here's what happens:

  1. The system traverses our GDPR knowledge graph to find: Core GDPR articles about erasure EU-level guidance, French regulatory interpretations and relevant enforcement cases
  2. This information is structured into a prompt that gives the model precise context for the question.
  3. We can then: Run the same prompt through multiple models for verification, customise the response based on the user's role and context and ensure we're only sharing appropriate, authorised information

How Graph-Based Prompt Engineering Solves These Problems

Let's compare this 'Graph-based Solution' with the challenges previous identified with fine-tuning.

But the benefits go beyond just solving these problems. This approach actually enables capabilities that would be impractical with fine-tuning:

  1. True Personalisation at Scale: Rather than creating dozens of fine-tuned models for different scenarios, we can dynamically adjust our knowledge graph traversal and prompt construction based on user context, role, language, and needs.
  2. Real-time Knowledge Updates: When regulations change or new information becomes available, we can update our knowledge graph instantly. No waiting for retraining cycles or worrying about how new information might interact with existing model behaviours.
  3. Multi-Model Orchestration: Since we're not locked into specific fine-tuned models, we can intelligently route queries to different models based on their strengths, or run queries in parallel for validation and comparison.
  4. Granular Access Control: We can implement sophisticated access controls at the knowledge graph level, ensuring users only see information they're authorised to access, without risking unauthorised data exposure through model behaviours.

Here is a cool analogy: instead of trying to teach a model our domain knowledge through the opaque process of fine-tuning, we're building a smart librarian that knows exactly where to find the information we need and how to present it effectively.

We maintain control, transparency, and flexibility while still getting all the benefits of advanced language models.

Wrapping up

For me the key to make LLMs work and effective isn't in 'training' them to be domain experts.

The solution is to build systems that can effectively provide LLMs with the right context at the right time.

By focusing on graph-based knowledge representation and sophisticated prompt engineering, we can create more reliable, maintainable, and secure GenAI systems.

This approach might require more upfront thinking about knowledge structure, but it pays off in flexibility, control, and long-term maintenance.

It's the difference between trying to modify a black box and building a transparent, controllable system that leverages the best of what LLMs have to offer.

The goal isn't to make models smarter through training, but to get better at asking those models the right questions with the right context.

Michael Man

Trekking through the GenAI world | Community Leader | DSO Advisor

4 周

Amazing - helped me answer a bunch of questions I have been asking myself. Thank You once again ??

Doug Stewart

Strategist, maximiser, enabler, leader. Empowering world class, global, commercial cloud customer support.

1 个月

Very insightful, thanks!

Ives Laaf

Information Security Leader @ adesso | CISSP, Compliance | CISO adesso as a service

1 个月
Pedro Silva, PMP

IT Operations Business Analyst&Project Manager at NetJets Europe

1 个月

Fully agree Dinis Cruz! Your remark on providing LLMs with the right context at the right time is spot on.

Ruth Kaufman

UX + Content Architect / Product Design Leader // Open to project-based opportunities

1 个月

You're validating an intuition I've had.

要查看或添加评论,请登录

Dinis Cruz的更多文章

社区洞察

其他会员也浏览了