Experimenting with MAA GPT
Computers and aircraft are my jam. Which is to say, I’m deeply passionate about both aviation and computing. So I’m very fortunate that I find myself in a reasonably unique and genuinely exciting position at the intersection of these two fields. I also believe there’s a lot that both areas can learn from each other, so this is written for both the two main communities which seem to dominate my LinkedIn feed; people who love computers, and people who love flying (and of course, those who love both!). This article tells the story about a practical project involving the development and evaluation of MAA GPT, a custom GPT model integrated with the UK Military Aviation Authority (MAA) regulations.
Bridging the Gap Between Aviation and Computing
The motivation behind this project stems from a genuine interest in harnessing the power of technology
However, the current discourse around AI is often clouded by hype and unrealistic expectations. On one hand, AI is hailed as a revolutionary technology capable of transforming every aspect of our lives. On the other, there is skepticism and fear about its limitations and potential risks. In the context of aviation, where reliability is paramount, and correctness is critical, it is crucial to have a clear and balanced understanding of what AI can and cannot do
Through this project, I aim to demystify Large Language Models (LLM) by telling you about my practical experimentation with MAA GPT. This involves not just creating the tool (that’s the easy bit), but testing it to evaluate its performance, and explaining why I think this latter part is critical in moving the conversation beyond hyperbole, and start delivering value. By doing so, I hope to provide a grounded perspective
Understanding LLMs and RAG
To appreciate the potential and limitations of MAA GPT, it’s important to understand the foundational technologies behind it: Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG).
Large Language Models (LLMs)
Large Language Models (LLMs) are a type of artificial intelligence trained on vast amounts of text data to generate human-like responses. That’s a bit of a simplification, and we could talk all day about ‘transformers architectures’, ‘embedding spaces’ and ‘vector databases’, but that’s pretty much what they do, so we’ll skip some of the technical chat for now. Practically speaking these models have demonstrated remarkable abilities in understanding and generating natural language. They can converse, write essays, summarise information, and even generate creative content. LLMs achieve this by learning patterns, structures, and associations within the data they are trained on. Examples of LLMs include OpenAI's GPT-4o (which powers the latest version of ChatGPT and indeed this experiment), Google's Gemini, Anthropic’s Claude and Meta’s Llama. Each model has unique features and strengths, such as differences in size, architecture, and training methods, which make them suitable for different applications and use cases. The training process involves analysing billions of sentences to understand context, syntax, and semantics (but I said we’d skip the technical chat, so I’ll stop there).
Regardless of the model that is chosen, the performance of LLMs is highly dependent on the data they have been trained on. Most excel at general language tasks but can struggle with highly specialised or nuanced queries that require specific domain knowledge. For example, while an LLM might be able to write a persuasive essay on a broad topic like climate change, it might falter when asked a specific regulatory question in aviation, unless it has been specifically trained on such data. A problem with adding new data to the model is that it is very expensive (in terms of time, money, cost and complexity. No matter what metric you look at, training LLMs is spenny) so retraining a model every day is not a feasible option.
Retrieval-Augmented Generation (RAG)
Retrieval-Augmented Generation (RAG) addresses some of the limitations of LLMs by integrating them with external knowledge bases. RAG combines the generative capabilities of LLMs with the precision of retrieval systems. When a question is asked of the LLM equipped with RAG, the model retrieves relevant information from trusted source and uses this data to generate a more accurate and contextually relevant response.
Looking a little bit at our use case of aviation regulations, a simple generative model might struggle with the specificity required for answering detailed regulatory questions. By using RAG, the model can pull in specific information from an updated and authoritative knowledge base, ensuring that the responses are not only fluent but also precise and reliable. This hybrid approach enhances the model's ability to provide useful answers in specialised fields where the stakes for correctness are high.
So, while LLMs like GPT-4o offer impressive capabilities, their effectiveness can be significantly enhanced through the integration of retrieval systems in RAG. This combination leverages the strengths of both technologies, making it particularly useful for applications requiring high correctness and domain-specific knowledge, such as aviation.
Fine-Tuning LLMs
Fine-tuning Large Language Models (LLMs) is another way of combining custom content with the inherent capabilities of LLMs and can enhance their performance for specific tasks. Fine-tuning involves taking a pre-trained LLM and further training part of it on a smaller, domain-specific dataset. This process allows the model to adapt to the nuances and particularities of specialised fields. By exposing the LLM to targeted data, we can theoretically improve responses that meet the needs of the domain. Fine-tuning not only addresses the limitations of general LLMs in handling specialised queries but also can enhance the effectiveness of Retrieval-Augmented Generation (RAG) by ensuring that the generative component of the model is better aligned with the specific knowledge base it retrieves information from.
Say again…
One characteristic of LLMs (even when combined with RAG) that is worth noting is that they do not always provide the same answer when asked the same question multiple times. This variability can be useful in some applications (creativity and dealing with ambiguity), but perplexing and problematic in others (where you just need the correct answer). This characteristic can perhaps be partially explained through an analogy with human behavior. Just as people might give different answers to the same question depending on context, mood, or recent experiences, LLMs generate responses based on probabilities. Each word in the response is chosen based on the likelihood determined by the model, which can vary slightly with each query. This is not to compare the mechanisms by which humans and LLMs come up with answers, but rather to suggest how we might test a system with an LLM at its core. In the same way we test humans by conducting initial and periodic examinations, we can evaluate LLMs through consistent and repeated testing with a variety of questions and compare the answers against a ‘gold standard’. This idea is at the core of the experiment I conducted with MAA GPT, but first I’ll tell you a little bit about how I made it.
Creating MAA GPT
GPTs are custom AI models within the ChatGPT ecosystem that users can design to perform specific tasks. These models leverage the capabilities of the base ChatGPT model but are tailored with specific instructions, data, and even integrations to meet particular needs. Anyone can create a custom GPT, all you need is a ChatGPT Plus account. Previously, users required a (paid for) ChatGPT Plus account to consume GPTs, but as of last month anyone can use them with a free ChatGPT account.
To create MAA GPT I started by defining the specific tasks MAA GPT needed to perform, focusing on assisting users with navigating and understanding UK Military Aviation regulations. This involved outlining its responsibilities, such as answering questions about regulatory documents. Next, I customised the behaviour and personality of MAA GPT. I set instructions to ensure it always referenced specific regulatory articles and documents in its responses. This step was essential to make sure as far as I can that the information provided is as accurate and relevant as possible.
Following that, I incorporated the custom data into MAA GPT. This involved uploading detailed regulatory documents like MAA01, MAA02, and the various Regulatory Article (RA)series, which the GPT uses to inform its responses. I’ve also recently subscribed to be alerted to any changes in the regulations, so I can make sure MAA GPT is always up to date. If I’m entirely honest, I’m not exactly sure which mechanism GPTs uses to incorporate custom knowledge, whether it’s RAG, fine-tuning, a combination of the two or something completely different. But it wouldn’t surprise me if it’s something very similar to what ive described.
The MAA GPT Experiment
To evaluate the effectiveness of MAA GPT, I designed an experiment involving 50 curated questions and answers about MAA regulations. These questions were carefully selected to cover a broad spectrum, from straightforward, closed queries to more ambiguous, open-ended ones. The objective was to test the model's ability to handle various types of questions that aviation professionals might encounter. I don’t claim here that these 50 questions necessarily cover every aspect which we might want to test MAA GPT for, but 50 seemed like a reasonable amount of questions to start the analysis.
Test conditions and method
To set the scene, I conducted the experiment itself on 14th June at 1900 EST. I had already prepared the questions and gold standard answers on a separate spreadsheet, so there would be no chance of the answers MAA GPT influencing how I wrote the gold standard. It took me about 30 minutes to copy and paste all of the questions one by one into MAA GPT, and the answers back into my master spreadsheet. I’ve since read up a little on Assistants, and next time I do this, I’d definitely like to automate the process through an API interface, but for the time being, copy and paste did the job. Once I had all of MAA GPTs answers I reviewed them and compared them with the gold standard to see how it did. In summary, pretty good! (Please let me know if you'd like the full breakdown of questions, answers and my assessment and I'd be more than happy to share).
Proportion or correct answers
Out of the 50 questions, I assessed that 42 (84%) were correct. Now, you might think that’s not great, but when you consider how many questions a human with very little prior knowledge of the UK MAA regulation might get correct with just 30 minutes flicking through the documents, then perhaps it doesn’t seem that bad.
领英推荐
But we can do slightly better than this. I decided not only evaluate correct or incorrect, but to give a grade to each answer to say just how correct or incorrect the answer was.
How correct exactly?
Grades were given to answers ranging from A to E. It is important to note that these grades are relative to the gold standard, and not what some expert in a particular niche might know. The grades were defined as follows:
A -?Completely correct with new novel and/or interesting insights.
B -?Completely correct.
C -?Mainly correct, but with minor errors.
D -?Partially correct, but with clear errors.
E -?Completely incorrect, misleading or made up.
Of the 50 questions, MAA GPT scored 14 (28%) A’s, 27 (54%) B’s, 3 (6%) C’s, 2 (4%) Ds, and 3 (6%) E’s.
Mostly good, but some not so great responses
Surprisingly, the questions which had the most negative impact on the grades overall were the closed questions of the form “What is contained within RA###?”. It appears that MAA GPT consistently struggled to parse the acronym RA (Regulatory Article). Perhaps even more surprising is that included in the knowledge base was a glossary (MRP03) including the definition of RA, and the acronym appears in pretty much every individual RA. After the initial experiment had concluded, I went back and expanded RA in full form “What is contained within Regulatory Article ####?” for the answers to these kinds of questions that were initially answered incorrectly (graded either as D or E). On doing this, of the 50 questions 46 (92%; +8%) were answered correctly, and here’s the breakdown of the grades: 17 (34%; +6%) A’s, 28 (56%; +1%) B’s, 3 (6%) C’s, 1 (2%; -2%) Ds, and 0 (0%; -6%) E’s.
Another point worth discussing about answers which contained errors, with a bit of reframing of the question, the correct answer could indeed be ascertained. For example, I asked MAA GPT the (intentionally ambiguous, and rather contrived) question “When a UK military registered Air System is at a foreign or civilian airfield, what should the local Command or Defence Contractor Flying Organisation consider?” looking for something along the lines of “Taken from RA4805: Guarding by the host nation or UK Service personnel in order to preserve the Airworthiness of the Air System.”. However, MAA GPT decided to answer with a much more detailed and comprehensive answer which covers a broad range of considerations. Alas, because I was grading MAA GPT against the gold standard, I gave it a C for this answer. The point I’m making here is that without some reasonable context, it cannot read your mind and answer the question you want it to answer. By rephrasing the original question to “When a UK military registered Air System is at a foreign or civilian airfield, when considering security of an aircraft, what should the local Command or Defence Contractor Flying Organisation consider?” It gave an answer that I would have awarded an A for.
This all goes to say that it matters how you use the tool, and how you prompt it, and this should be considered when writing tests for this and any other custom GPT.
Conclusion to the exeperiment
I believe that the results of this experiment not only provide valuable insights into the strengths and weaknesses of MAA GPT itself, but also offers a straw man method of evaluating the performance of a given model in a given context. By analysing the distribution of answers provided by the model and comparing them against a ‘gold standard’ of correct answers, we can identify areas where the model excels and where it falls short, and perhaps most importantly we can judge whether this performance is ‘good enough’ for whatever task we want it to do. I suggest that this testing approach could be crucial for a fair and accurate evaluation of the model's performance in this and other use cases.
Drawing Parallels with Test-Driven Development
The approach taken in this experiment mirrors the principles of Test-Driven Development (TDD)
Similarly, in the context of systems like MAA GPT, continuous evaluation
By adopting a continuous evaluation approach, organisations can maintain high standards of performance and adaptability in their AI applications, or at the very least, be aware when the system isn’t performing as well as you might want it to. This involves creating a robust framework for periodic assessments, where the model is tested against a diverse set of queries that reflect real-world scenarios. Such an approach ensures that the model remains relevant, and effective in its specific domain.
Establishing a Framework for Continuous LLM Monitoring
To further enhance the understanding of, and develop trust in systems which use LLMs, I recommend that organisations that want to use them to support business activity consider the establishment of a comprehensive framework for independent, continuous monitoring and evaluation of such models. This framework should be designed to trigger evaluations routinely, manually, or automatically; for example whenever new data is uploaded to a database used for Retrieval-Augmented Generation (RAG).
This framework might include the following components:
By implementing such a framework, organisations can ensure that their AI models remain robust, reliable, and up-to-date with the latest domain-specific information. A side benefit which I considered only when writing up this article is that taking this approach, and automating routine testing might allow organisations who use LLMs - but otherwise have no or very little sight of, let alone control of their internal architecture and parameters - to monitor the performance of a model as it changes over time. This may be useful
in diagnosing the performance of a suite of tests if it is particularly good or bad on a given day/hour (maybe the model has just got better or worse, not that it is particularly good or bad in a particular context or use case). Aside from this tangent on keeping an eye on what big tech is doing with these powerful models, I strongly believe that continuous monitoring and evaluation would not only help create a more comprehensive understanding of a model's performance, but also build trust and confidence among users.
Conclusion
Through this project, I’ve aimed to contribute to a more informed and balanced discussion about generative AI in aviation. By testing MAA GPT and sharing my findings, I hope to inspire both aviation and computing professionals to engage in thoughtful experimentation and continuous evaluation of LLMs. This approach not only aims to demystify AI, but I hope also paves the way for more reliable and effective applications in specialised fields.
The future of AI in aviation holds tremendous potential. Here we have only scratched the surface of generative AI, and not even mentioned supervised and unsupervised learning, reinforcement learning, but there really is some fascinating and truly useful stuff we could be building to support aircrew, engineers, air traffic controllers and anyone else who helps get aircraft in the sky, and - dare I say it - directly integrate into aircraft. By bridging the gap between these two worlds and fostering a culture of continuous learning and evaluation, we can unlock new possibilities and drive innovation in ways that were previously unimaginable. I look forward to the discussions and collaborations that this project will hopefully inspire, and I am excited to see how we can collectively shape the future of AI in aviation.
Business Development Manager at RINA Tech UK | RAF Reservist | Director at Aerotechnology Ltd
8 个月A great article Peter Kennedy. Watching the defence sector market AI as an omniscient tool that will single handily win all future conflicts, has been rather worrisome. It can be a useful tool but it is not quite the magic bullet the marketers would like us to believe it is. The data accuracy monitoring aspect of AI in defence or aviation I would argue is the most crucial element of an AI system. Without the AI model's responses being constantly moderated and validated, humans using it may implement real world decisions, with real world consequences off potentially dangerously incorrect information. I look forwarded to hearing more about how you manage to implement an automated routine testing schedule and choosing which framework to use. Keep up the amazing work.
GEN AI Evangelist | #TechSherpa | #LiftOthersUp
8 个月Remarkable exploration. AI assistants simplify complex tasks efficiently. Monitoring performance builds trust over time. Peter Kennedy
Creator of The Wingman Way | Executive Coach and Public Speaker | Fighter Pilot & Global Executive Helping Leaders Empower Their Teams.
8 个月Great work Peter, this was a good read. Quick question for you: Given the inherent nature of how LLMs work through pattern matching rather than true understanding, do you think RAG is actually viable for handling critical regulatory information? I can't help thinking that until the answer is always 100% we might be better off with contextual search and navigation tool rather than direct retrival?