Why I Don't like Fine-Tuning LLMs: A Case for Graph-Based Prompt Engineering
Dinis Cruz
Founder @ The Cyber Boardroom, Chief Scientist @ Glasswall, vCISO, vCTO and GenAI expert
As of Jan 2025, a common pattern for trying to improve the quality and performance of LLMs workflows is to fine-tune a particular model for specific use cases, based on a curated data-set.
The idea is compelling: take a general-purpose model and teach it specific domain knowledge. Want a cybersecurity expert that knows GDPR inside out? Just fine-tune a model with your security documentation and regulatory frameworks. Simple, right? What could go wrong with this? :)
Based on my vCISO and vCTO instincts and experience, I think we are going down the wrong path. While I know this might be controversial, I believe there are fundamental security, operational, financial and quality problems with fine-tuning.
Instead, I propose that we should be using off-the-shelf models and focusing the engineering efforts on prompt engineering using graph-based knowledge structures.
Let me explain why.
My concerns and challenges with Fine-Tuning models
When organisations fine-tune models, they're essentially hoping to create a specialised GenAI model that deeply understands their domain.
But there are fundamental problems with this approach that many aren't considering:
The fundamental issue here is that we're trying to modify something we don't fully understand, in ways we can't fully control, with consequences we can't fully predict.
You might achieve your immediate goal, and look like it is 'working', but you have no idea what else you might be affecting in the process and the full range of behaviours and knowledge that are actually there.
But I have a couple more concerns ...
Additional Challenges with Fine-Tuning
The Personalisation Paradox
One of the biggest problems I see with fine-tuning is how it handles personalisation. When you fine-tune a model, you're trying to create one version that works for everyone. But think about it - in the real world, you need to handle multiple languages, diverse cultures, different job roles, and varying levels of expertise. Unless you going to create separate fine-tuned models for each combination, It's simply not practical or cost effective.
Let's say you're building a cybersecurity assistant. A CISO needs different context and language than a security analyst, who needs different information than a board member. Add in multiple languages and cultural contexts, and suddenly you're looking at dozens of potential model variations. Each one would need its own fine-tuning, its own maintenance, its own updates.
The Dangerous Dance with Sensitive Data
Here's where it can get get really dangerous. If you are using of customer data or sensitive information in the fine-tuning process... Don't!
I cannot stress enough how crazy this is.
When you fine-tune a model with sensitive data, you're essentially baking that data into the model permanently, where you lose control over how and when that information might surface sometime in the future.
Think about this scenario: you fine-tune a model with your company's security incidents to help with threat detection. Now that information lives in the model. What happens when someone figures out the right prompt to make the model reveal those details? It's not a question of if, but when. We're just one clever prompt away from a massive data leak.
Even worse, once sensitive data is in the model, you can't reliably remove it. You might think you've trained it out, but as we've seen with recent research, these models can retain behaviours and information in ways we don't fully understand.
The Prompt Engineering Reality Check
Here's the kicker - even with a fine-tuned model, in order to really give relevant responses to the users, you still need sophisticated prompt engineering.
Many GenAI projects start by thinking that fine-tuning will eliminate the need for additional prompt engineering workflows, but when their solution hits the real world, they quickly find out that in addition to the fine-tune answer, prompt engineering will be required to:
So you end up with the worst of both worlds - all the risks and complexity of fine-tuning, plus the ongoing need for prompt engineering anyway.
You're essentially adding a layer of opacity and risk without eliminating the need for careful prompt design and engineering.
The Scale and Evolution Problem
This becomes even more problematic when you consider how knowledge and requirements evolve. Let's say you're dealing with regulations like GDPR. You need to handle:
With a fine-tuned model, each of these changes potentially requires retraining, or the model will be providing out-of-date or even worse, misleading/false information
Back to the The False Economy concern
GenAI projects often justify fine-tuning by saying it will save time and resources in the long run.
But when you factor in:
...you start to see that fine-tuning often creates more problems than it solves.
So why not removing the fine-tuning step?
My proposed Approach: Graph-Based Prompt Engineering
Instead of fine-tuning models, I propose a different architecture that gives us more control and flexibility.
The goal here is to replicate the benefits of fine-tuning but at the prompt level, where we have full control and transparency.
领英推荐
The Core Architecture
At the heart of this approach are three fundamental components that work together to create a more flexible and maintainable system.
First, we treat LLMs as commodities, standard, off-the-shelf components that we can easily switch between based on our needs. This means we're never locked into a specific model or provider. We can run multiple models in parallel for verification, use specialised models for different tasks, and easily adopt new models as they emerge. This flexibility is crucial for maintaining both cost-effectiveness and technical agility.
Second, we build rich knowledge graphs that capture our domain expertise in a structured, navigable format. Taking GDPR as an example, we create a comprehensive graph that represents not just the core regulations, but also their interpretations, country-specific variations, case law, and real-world applications. These relationships are explicit and queryable, giving us clear visibility into how different pieces of information connect and influence each other.
Third, we implement dynamic prompt construction that navigates these knowledge graphs based on user queries. When someone asks a question, our system traverses the graph to build context-aware prompts that provide exactly what's needed - no more, no less. This approach maintains clear provenance of all information and allows us to control access based on user permissions.
How This Solves the Fine-Tuning Problems
This architecture directly addresses the core issues we face with fine-tuning models.
We gain complete transparency and control over our system - we know exactly what information we're providing to the model, where it came from, and how it's being used. There are no hidden behaviours or unexpected interactions to worry about.
Maintaining and updating the system becomes straightforward. When regulations change or new interpretations emerge, we simply update the information in our knowledge graph. There's no need for complex retraining cycles, and we can easily track and control versions of our knowledge base. If something goes wrong, rolling back changes is simple and predictable.
The security implications are particularly important. Without fine-tuning, we dramatically reduce the risk of hidden training data or unintended model behaviours. We can establish clear boundaries for sensitive information and control exactly what knowledge is accessible in different contexts.
Now it is 'just AppSec' and secure design patterns, that in 2025 we already know how to design, deploy, test and scale (specially if using GenAI to aid in the SDLC).
The Power of Customisation
Where this approach really shines is in its customisation capabilities. Instead of trying to bake everything into the model through fine-tuning, we can customise at multiple levels through our prompt engineering layer.
At the user level, we can adapt responses based on roles, expertise, and individual needs. A board member getting information about GDPR compliance needs a different perspective than a technical implementation team, and our system can provide that without maintaining separate models.
The system naturally handles cultural and linguistic adaptation. We can account for different languages, cultural contexts, and communication styles all through our prompt engineering layer. This "last mile" customisation ensures that information isn't just technically accurate, but also culturally appropriate and effectively communicated
For organisations, we can layer in company-specific policies, industry interpretations, and relevant business context. This means the same underlying knowledge graph can serve different organisations in ways that feel native to their specific context and needs.
Btw, this is exactly what I'm doing at The Cyber Boardroom :) (my current startup)
Advanced Features and Real-World Benefits
In practice, this architecture becomes even more powerful when we implement advanced features like multi-model processing. We can run queries through multiple models in parallel, comparing and validating responses to ensure accuracy. The system can control how deep into the knowledge graph to go, adjusting the level of detail based on the user's needs and context.
The operational benefits are significant. We typically see reduced computational costs and more efficient token usage compared to fine-tuned models. Response times are faster, and resource utilisation is more predictable.
The quality control aspect is particularly valuable. Because we have clear tracking of information sources and can easily validate responses, maintaining consistent output quality becomes much more manageable. When errors occur, they're easy to trace and correct without touching the underlying models.
This approach essentially creates a sophisticated customisation layer that gives us all the benefits of fine-tuning without its drawbacks. Rather than trying to teach models new things through opaque training processes, we're getting better at giving them the right context at the right time. It's a shift from modifying the model to mastering how we communicate with it.
Perhaps most importantly, we're future-proofing our investment - the architecture can easily adopt new models and integrate new knowledge without requiring fundamental changes to the system.
This is not a minor consideration. Any solution developed today needs to be able to leverage the rapid evolution of models like ChatGPT, Claude, Gemini, and LLaMA - not to mention entirely new architectures that haven't been released yet. The reality is that these frontier models are continuously getting better, faster, and cheaper. Projects that rely heavily on fine-tuning will eventually face an uncomfortable truth: newer, out-of-the-box models will likely outperform their carefully fine-tuned versions, but they'll be locked into their older, less capable models
How It Works in Practice
Let's look at a real example. Say someone asks about data erasure requirements in France. Here's what happens:
How Graph-Based Prompt Engineering Solves These Problems
Let's compare this 'Graph-based Solution' with the challenges previous identified with fine-tuning.
But the benefits go beyond just solving these problems. This approach actually enables capabilities that would be impractical with fine-tuning:
Here is a cool analogy: instead of trying to teach a model our domain knowledge through the opaque process of fine-tuning, we're building a smart librarian that knows exactly where to find the information we need and how to present it effectively.
We maintain control, transparency, and flexibility while still getting all the benefits of advanced language models.
Wrapping up
For me the key to make LLMs work and effective isn't in 'training' them to be domain experts.
The solution is to build systems that can effectively provide LLMs with the right context at the right time.
By focusing on graph-based knowledge representation and sophisticated prompt engineering, we can create more reliable, maintainable, and secure GenAI systems.
This approach might require more upfront thinking about knowledge structure, but it pays off in flexibility, control, and long-term maintenance.
It's the difference between trying to modify a black box and building a transparent, controllable system that leverages the best of what LLMs have to offer.
The goal isn't to make models smarter through training, but to get better at asking those models the right questions with the right context.
Trekking through the GenAI world | Community Leader | DSO Advisor
4 周Amazing - helped me answer a bunch of questions I have been asking myself. Thank You once again ??
Strategist, maximiser, enabler, leader. Empowering world class, global, commercial cloud customer support.
1 个月Very insightful, thanks!
Information Security Leader @ adesso | CISSP, Compliance | CISO adesso as a service
1 个月Interesting! What u think? Hagen Hübel, Benedikt Bonnmann, Juergen Angele, Tim Strohschneider, Tobias Struckmeier ?? , Rico Komenda, Jim Manico ?
IT Operations Business Analyst&Project Manager at NetJets Europe
1 个月Fully agree Dinis Cruz! Your remark on providing LLMs with the right context at the right time is spot on.
UX + Content Architect / Product Design Leader // Open to project-based opportunities
1 个月You're validating an intuition I've had.