Experimenting with MAA GPT
MAA GPT - Image generated using DALL-E

Experimenting with MAA GPT

Computers and aircraft are my jam. Which is to say, I’m deeply passionate about both aviation and computing. So I’m very fortunate that I find myself in a reasonably unique and genuinely exciting position at the intersection of these two fields. I also believe there’s a lot that both areas can learn from each other, so this is written for both the two main communities which seem to dominate my LinkedIn feed; people who love computers, and people who love flying (and of course, those who love both!). This article tells the story about a practical project involving the development and evaluation of MAA GPT, a custom GPT model integrated with the UK Military Aviation Authority (MAA) regulations.


Bridging the Gap Between Aviation and Computing

The motivation behind this project stems from a genuine interest in harnessing the power of technology to solve real-world problems in aviation. Unsurprisingly, the aviation industry - civil and military - is governed by strict regulations and standards to ensure safety and efficiency. Navigating these regulations can be a complex and time-consuming task (as I and many of my colleagues have experienced first hand). As a professional in both fields, I recognised an opportunity to create an AI tool that could assist aviation professionals by providing answers to regulatory questions.

However, the current discourse around AI is often clouded by hype and unrealistic expectations. On one hand, AI is hailed as a revolutionary technology capable of transforming every aspect of our lives. On the other, there is skepticism and fear about its limitations and potential risks. In the context of aviation, where reliability is paramount, and correctness is critical, it is crucial to have a clear and balanced understanding of what AI can and cannot do.

Through this project, I aim to demystify Large Language Models (LLM) by telling you about my practical experimentation with MAA GPT. This involves not just creating the tool (that’s the easy bit), but testing it to evaluate its performance, and explaining why I think this latter part is critical in moving the conversation beyond hyperbole, and start delivering value. By doing so, I hope to provide a grounded perspective that cuts through the hype and offer real insights into the capabilities of AI in the context of aviation regulation. My intention is to foster a level-headed discussion that encourages both aviation and computing professionals to engage with AI in a thoughtful and informed manner. While I recognise that regulations aren’t the most exciting topic in the world, I also hope to highlight that the world of aviation can and should be a driver of the technological revolution we are in the midst of.


Understanding LLMs and RAG

To appreciate the potential and limitations of MAA GPT, it’s important to understand the foundational technologies behind it: Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG).

Large Language Models (LLMs)

Large Language Models (LLMs) are a type of artificial intelligence trained on vast amounts of text data to generate human-like responses. That’s a bit of a simplification, and we could talk all day about ‘transformers architectures’, ‘embedding spaces’ and ‘vector databases’, but that’s pretty much what they do, so we’ll skip some of the technical chat for now. Practically speaking these models have demonstrated remarkable abilities in understanding and generating natural language. They can converse, write essays, summarise information, and even generate creative content. LLMs achieve this by learning patterns, structures, and associations within the data they are trained on. Examples of LLMs include OpenAI's GPT-4o (which powers the latest version of ChatGPT and indeed this experiment), Google's Gemini, Anthropic’s Claude and Meta’s Llama. Each model has unique features and strengths, such as differences in size, architecture, and training methods, which make them suitable for different applications and use cases. The training process involves analysing billions of sentences to understand context, syntax, and semantics (but I said we’d skip the technical chat, so I’ll stop there).

Regardless of the model that is chosen, the performance of LLMs is highly dependent on the data they have been trained on. Most excel at general language tasks but can struggle with highly specialised or nuanced queries that require specific domain knowledge. For example, while an LLM might be able to write a persuasive essay on a broad topic like climate change, it might falter when asked a specific regulatory question in aviation, unless it has been specifically trained on such data. A problem with adding new data to the model is that it is very expensive (in terms of time, money, cost and complexity. No matter what metric you look at, training LLMs is spenny) so retraining a model every day is not a feasible option.


Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) addresses some of the limitations of LLMs by integrating them with external knowledge bases. RAG combines the generative capabilities of LLMs with the precision of retrieval systems. When a question is asked of the LLM equipped with RAG, the model retrieves relevant information from trusted source and uses this data to generate a more accurate and contextually relevant response.

Looking a little bit at our use case of aviation regulations, a simple generative model might struggle with the specificity required for answering detailed regulatory questions. By using RAG, the model can pull in specific information from an updated and authoritative knowledge base, ensuring that the responses are not only fluent but also precise and reliable. This hybrid approach enhances the model's ability to provide useful answers in specialised fields where the stakes for correctness are high.

So, while LLMs like GPT-4o offer impressive capabilities, their effectiveness can be significantly enhanced through the integration of retrieval systems in RAG. This combination leverages the strengths of both technologies, making it particularly useful for applications requiring high correctness and domain-specific knowledge, such as aviation.

Fine-Tuning LLMs

Fine-tuning Large Language Models (LLMs) is another way of combining custom content with the inherent capabilities of LLMs and can enhance their performance for specific tasks. Fine-tuning involves taking a pre-trained LLM and further training part of it on a smaller, domain-specific dataset. This process allows the model to adapt to the nuances and particularities of specialised fields. By exposing the LLM to targeted data, we can theoretically improve responses that meet the needs of the domain. Fine-tuning not only addresses the limitations of general LLMs in handling specialised queries but also can enhance the effectiveness of Retrieval-Augmented Generation (RAG) by ensuring that the generative component of the model is better aligned with the specific knowledge base it retrieves information from.

Say again…

One characteristic of LLMs (even when combined with RAG) that is worth noting is that they do not always provide the same answer when asked the same question multiple times. This variability can be useful in some applications (creativity and dealing with ambiguity), but perplexing and problematic in others (where you just need the correct answer). This characteristic can perhaps be partially explained through an analogy with human behavior. Just as people might give different answers to the same question depending on context, mood, or recent experiences, LLMs generate responses based on probabilities. Each word in the response is chosen based on the likelihood determined by the model, which can vary slightly with each query. This is not to compare the mechanisms by which humans and LLMs come up with answers, but rather to suggest how we might test a system with an LLM at its core. In the same way we test humans by conducting initial and periodic examinations, we can evaluate LLMs through consistent and repeated testing with a variety of questions and compare the answers against a ‘gold standard’. This idea is at the core of the experiment I conducted with MAA GPT, but first I’ll tell you a little bit about how I made it.


Creating MAA GPT

GPTs are custom AI models within the ChatGPT ecosystem that users can design to perform specific tasks. These models leverage the capabilities of the base ChatGPT model but are tailored with specific instructions, data, and even integrations to meet particular needs. Anyone can create a custom GPT, all you need is a ChatGPT Plus account. Previously, users required a (paid for) ChatGPT Plus account to consume GPTs, but as of last month anyone can use them with a free ChatGPT account.

To create MAA GPT I started by defining the specific tasks MAA GPT needed to perform, focusing on assisting users with navigating and understanding UK Military Aviation regulations. This involved outlining its responsibilities, such as answering questions about regulatory documents. Next, I customised the behaviour and personality of MAA GPT. I set instructions to ensure it always referenced specific regulatory articles and documents in its responses. This step was essential to make sure as far as I can that the information provided is as accurate and relevant as possible.

Following that, I incorporated the custom data into MAA GPT. This involved uploading detailed regulatory documents like MAA01, MAA02, and the various Regulatory Article (RA)series, which the GPT uses to inform its responses. I’ve also recently subscribed to be alerted to any changes in the regulations, so I can make sure MAA GPT is always up to date. If I’m entirely honest, I’m not exactly sure which mechanism GPTs uses to incorporate custom knowledge, whether it’s RAG, fine-tuning, a combination of the two or something completely different. But it wouldn’t surprise me if it’s something very similar to what ive described.


The MAA GPT Experiment

To evaluate the effectiveness of MAA GPT, I designed an experiment involving 50 curated questions and answers about MAA regulations. These questions were carefully selected to cover a broad spectrum, from straightforward, closed queries to more ambiguous, open-ended ones. The objective was to test the model's ability to handle various types of questions that aviation professionals might encounter. I don’t claim here that these 50 questions necessarily cover every aspect which we might want to test MAA GPT for, but 50 seemed like a reasonable amount of questions to start the analysis.

Test conditions and method

To set the scene, I conducted the experiment itself on 14th June at 1900 EST. I had already prepared the questions and gold standard answers on a separate spreadsheet, so there would be no chance of the answers MAA GPT influencing how I wrote the gold standard. It took me about 30 minutes to copy and paste all of the questions one by one into MAA GPT, and the answers back into my master spreadsheet. I’ve since read up a little on Assistants, and next time I do this, I’d definitely like to automate the process through an API interface, but for the time being, copy and paste did the job. Once I had all of MAA GPTs answers I reviewed them and compared them with the gold standard to see how it did. In summary, pretty good! (Please let me know if you'd like the full breakdown of questions, answers and my assessment and I'd be more than happy to share).


Proportion or correct answers

Out of the 50 questions, I assessed that 42 (84%) were correct. Now, you might think that’s not great, but when you consider how many questions a human with very little prior knowledge of the UK MAA regulation might get correct with just 30 minutes flicking through the documents, then perhaps it doesn’t seem that bad.

Count of correct / incorrect answers MAA GPT gave to 50 questions

But we can do slightly better than this. I decided not only evaluate correct or incorrect, but to give a grade to each answer to say just how correct or incorrect the answer was.

How correct exactly?

Grades were given to answers ranging from A to E. It is important to note that these grades are relative to the gold standard, and not what some expert in a particular niche might know. The grades were defined as follows:

A -?Completely correct with new novel and/or interesting insights.

B -?Completely correct.

C -?Mainly correct, but with minor errors.

D -?Partially correct, but with clear errors.

E -?Completely incorrect, misleading or made up.

Of the 50 questions, MAA GPT scored 14 (28%) A’s, 27 (54%) B’s, 3 (6%) C’s, 2 (4%) Ds, and 3 (6%) E’s.

Grades awarded to MAA GPT after answering 50 questions


Mostly good, but some not so great responses

Surprisingly, the questions which had the most negative impact on the grades overall were the closed questions of the form “What is contained within RA###?”. It appears that MAA GPT consistently struggled to parse the acronym RA (Regulatory Article). Perhaps even more surprising is that included in the knowledge base was a glossary (MRP03) including the definition of RA, and the acronym appears in pretty much every individual RA. After the initial experiment had concluded, I went back and expanded RA in full form “What is contained within Regulatory Article ####?” for the answers to these kinds of questions that were initially answered incorrectly (graded either as D or E). On doing this, of the 50 questions 46 (92%; +8%) were answered correctly, and here’s the breakdown of the grades: 17 (34%; +6%) A’s, 28 (56%; +1%) B’s, 3 (6%) C’s, 1 (2%; -2%) Ds, and 0 (0%; -6%) E’s.

Grades awarded to MAA GPT after answering 50 questions. Questions originally of the form “What is contained within RA###?” were modified to “What is contained within Regulatory Article ####?”

Another point worth discussing about answers which contained errors, with a bit of reframing of the question, the correct answer could indeed be ascertained. For example, I asked MAA GPT the (intentionally ambiguous, and rather contrived) question “When a UK military registered Air System is at a foreign or civilian airfield, what should the local Command or Defence Contractor Flying Organisation consider?” looking for something along the lines of “Taken from RA4805: Guarding by the host nation or UK Service personnel in order to preserve the Airworthiness of the Air System.”. However, MAA GPT decided to answer with a much more detailed and comprehensive answer which covers a broad range of considerations. Alas, because I was grading MAA GPT against the gold standard, I gave it a C for this answer. The point I’m making here is that without some reasonable context, it cannot read your mind and answer the question you want it to answer. By rephrasing the original question to “When a UK military registered Air System is at a foreign or civilian airfield, when considering security of an aircraft, what should the local Command or Defence Contractor Flying Organisation consider?” It gave an answer that I would have awarded an A for.

This all goes to say that it matters how you use the tool, and how you prompt it, and this should be considered when writing tests for this and any other custom GPT.

Conclusion to the exeperiment

I believe that the results of this experiment not only provide valuable insights into the strengths and weaknesses of MAA GPT itself, but also offers a straw man method of evaluating the performance of a given model in a given context. By analysing the distribution of answers provided by the model and comparing them against a ‘gold standard’ of correct answers, we can identify areas where the model excels and where it falls short, and perhaps most importantly we can judge whether this performance is ‘good enough’ for whatever task we want it to do. I suggest that this testing approach could be crucial for a fair and accurate evaluation of the model's performance in this and other use cases.


Drawing Parallels with Test-Driven Development

The approach taken in this experiment mirrors the principles of Test-Driven Development (TDD), a well-established software development best practice. TDD involves writing tests before the actual code to ensure that the functionality meets the specified requirements. This methodology not only guides the development process but also helps in identifying and fixing issues early. By establishing clear criteria for success from the outset, TDD ensures that each piece of code performs as intended and integrates seamlessly with the rest of the system.

Similarly, in the context of systems like MAA GPT, continuous evaluation against a defined scope of tasks is essential to ensure their reliability and effectiveness. AI models are not static; they can evolve and change over time due to various factors such as updates in the training data, changes in the underlying algorithms, or shifts in the application context. Regular testing of these models with new data would allow us to detect issues like model drift (where the model's performance degrades over time due to changes in the data distribution) and context drift (where the model's understanding of the context changes).

By adopting a continuous evaluation approach, organisations can maintain high standards of performance and adaptability in their AI applications, or at the very least, be aware when the system isn’t performing as well as you might want it to. This involves creating a robust framework for periodic assessments, where the model is tested against a diverse set of queries that reflect real-world scenarios. Such an approach ensures that the model remains relevant, and effective in its specific domain.


Establishing a Framework for Continuous LLM Monitoring

To further enhance the understanding of, and develop trust in systems which use LLMs, I recommend that organisations that want to use them to support business activity consider the establishment of a comprehensive framework for independent, continuous monitoring and evaluation of such models. This framework should be designed to trigger evaluations routinely, manually, or automatically; for example whenever new data is uploaded to a database used for Retrieval-Augmented Generation (RAG).

This framework might include the following components:

  1. Model performance interface: a user interface such that users can create their own ‘gold standard’ question and answer sets, and see how the model performs (whether manually or automatically evaluated) when their custom GPT/RAG model is tested.
  2. Automated Routine Testing: Schedule regular tests at predefined intervals (e.g., weekly, monthly) to evaluate the model's performance on a standardised set of queries. This helps in identifying any gradual change in performance over time. This approach may be particularly useful when a user has no control whatsoever in the underlying LLM architecture or parameters, but still wants to use it and see how it performs over time.
  3. Manual Testing: Allow experts to manually trigger tests when they suspect the model's performance may be compromised or when significant changes are made to the system. This provides a flexible mechanism for on-demand evaluations.
  4. Triggered Testing on Data Updates: Implement automatic testing whenever new data is added to the system. For instance, when new aviation regulations are incorporated into the RAG database, the framework should automatically evaluate the model's responses to ensure that it accurately reflects the updated information.
  5. Performance Metrics and Reporting: Define clear performance metrics (e.g., accuracy, precision, recall) and generate detailed reports after each evaluation. These reports should highlight areas of improvement and provide actionable insights for model refinement.

By implementing such a framework, organisations can ensure that their AI models remain robust, reliable, and up-to-date with the latest domain-specific information. A side benefit which I considered only when writing up this article is that taking this approach, and automating routine testing might allow organisations who use LLMs - but otherwise have no or very little sight of, let alone control of their internal architecture and parameters - to monitor the performance of a model as it changes over time. This may be useful

in diagnosing the performance of a suite of tests if it is particularly good or bad on a given day/hour (maybe the model has just got better or worse, not that it is particularly good or bad in a particular context or use case). Aside from this tangent on keeping an eye on what big tech is doing with these powerful models, I strongly believe that continuous monitoring and evaluation would not only help create a more comprehensive understanding of a model's performance, but also build trust and confidence among users.


Conclusion

Through this project, I’ve aimed to contribute to a more informed and balanced discussion about generative AI in aviation. By testing MAA GPT and sharing my findings, I hope to inspire both aviation and computing professionals to engage in thoughtful experimentation and continuous evaluation of LLMs. This approach not only aims to demystify AI, but I hope also paves the way for more reliable and effective applications in specialised fields.

The future of AI in aviation holds tremendous potential. Here we have only scratched the surface of generative AI, and not even mentioned supervised and unsupervised learning, reinforcement learning, but there really is some fascinating and truly useful stuff we could be building to support aircrew, engineers, air traffic controllers and anyone else who helps get aircraft in the sky, and - dare I say it - directly integrate into aircraft. By bridging the gap between these two worlds and fostering a culture of continuous learning and evaluation, we can unlock new possibilities and drive innovation in ways that were previously unimaginable. I look forward to the discussions and collaborations that this project will hopefully inspire, and I am excited to see how we can collectively shape the future of AI in aviation.

Shayne Paterson

Business Development Manager at RINA Tech UK | RAF Reservist | Director at Aerotechnology Ltd

8 个月

A great article Peter Kennedy. Watching the defence sector market AI as an omniscient tool that will single handily win all future conflicts, has been rather worrisome. It can be a useful tool but it is not quite the magic bullet the marketers would like us to believe it is. The data accuracy monitoring aspect of AI in defence or aviation I would argue is the most crucial element of an AI system. Without the AI model's responses being constantly moderated and validated, humans using it may implement real world decisions, with real world consequences off potentially dangerously incorrect information. I look forwarded to hearing more about how you manage to implement an automated routine testing schedule and choosing which framework to use. Keep up the amazing work.

Pete Grett

GEN AI Evangelist | #TechSherpa | #LiftOthersUp

8 个月

Remarkable exploration. AI assistants simplify complex tasks efficiently. Monitoring performance builds trust over time. Peter Kennedy

Paul Littlejohn

Creator of The Wingman Way | Executive Coach and Public Speaker | Fighter Pilot & Global Executive Helping Leaders Empower Their Teams.

8 个月

Great work Peter, this was a good read. Quick question for you: Given the inherent nature of how LLMs work through pattern matching rather than true understanding, do you think RAG is actually viable for handling critical regulatory information? I can't help thinking that until the answer is always 100% we might be better off with contextual search and navigation tool rather than direct retrival?

要查看或添加评论,请登录

Peter Kennedy的更多文章

  • Bidirectional Search and Capability Development

    Bidirectional Search and Capability Development

    19 February 2023 The article explores the use of bidirectional search for military capability development as a means of…

    4 条评论
  • Leadership in the age of Artificial Intelligence

    Leadership in the age of Artificial Intelligence

    14 January 2023 For decades leaders across all kinds of organisations have been leveraging digital technology to…

    3 条评论
  • Overhauling defence capability, one app at a time

    Overhauling defence capability, one app at a time

    31 December 2022 The UK MOD’s approach to capability development requires an urgent and comprehensive overhaul. We need…

    37 条评论

社区洞察

其他会员也浏览了