Unmasking the performance of Generative AI: Testing ChatGPT's Aptitude in Maintenance, Reliability & PHM
This article was co-written with Asma Ali
Introduction
The recent release of ChatGPT and generative large language models (LLMs) has spurred interest in understanding how it performs (and disrupts) various aspects of our daily lives.?Within the maintenance and reliability domain, and particularly within the prognostics and health management (PHM) space, ChatGPT shows remarkable ability to generate and explain engineering concepts and procedures.?
The impressive performance of ChatGPT in the industrial sector ignited conversations with my former colleague, Asma. We had experience together working in the technical language processing space, particularly around developing prescriptive recommendations in response to alerts from commercial predictive maintenance software. However, our preliminary observations with ChatGPT were not like anything we had previously experienced with out-of-the-box natural language processing (NLP) solutions applied to technical data and industrial use cases.?
The big questions we wanted to understand specifically included:
In order to measure and understand the the precise extent of the capabilities and limitations in the industry space, we designed a test and grading rubric for such a formal evaluation.?We just submitted our results for peer review for the Prognostics and Health Management (PHM) Society annual conference. Titled "Evaluating the Performance of ChatGPT in the Automation of Maintenance Recommendations for Prognostics and Health Management," our technical paper explores the capabilities and limitations of LLMs in this specific industrial context.??
This blog is the first of a series where we will share a summary of our paper's findings. In this particular post, we provide a high-level overview of our results. We hope this will interest folks in the maintenance and reliability community and generates some interesting discussion and feedback.
Testing ChatGPT
To broaden the scope of our LLM testing and assess the generalizability of our evaluation approach, we conducted tests on two models for comparison: ChatGPT and Google's Bard.
To test the AI towards its abilities to make prescriptive recommendations, we developed a three step?testing approach:
Scoring ChatGPT
We devised a methodology to assess the suitability of LLMs for industrial applications. Drawing inspiration from evaluation methods employed in other verticals like the medical field, we tailored these approaches to the industry space. Our evaluation encompasses an examination of response accuracy and a comprehensive analysis of AI explanations, considering factors such as engineering, risk, human aspects, costs, and required adjustments. Below is a summary of our grading rubric:
Performance on the Maintenance and Reliability Knowledge Exam
To assess the knowledge and proficiency of LLMs, we conducted a maintenance and reliability knowledge exam. We adapted a 76-question multiple-choice exam from the widely recognized resource "Maintenance and Reliability Best Practices" by Ramesh Gulati, commonly used for professional certification exams (such as the Certified Maintenance and Reliability Professional, or CMRP exam). The correct answers and explanations from the book served as the answer key for grading the LLMs' performance.
The summary of the scores from both AI models used are shown below:?
The overall scores are 72% (ChatGPT) and 64% (Bard). In his book, Ramesh states that a score of 68 or more means "You have excellent M\&R knowledge, but you should always continue to learn and enhance your knowledge" (and a score of 45-67 means "good").?From a testing perspective, the LLMs are doing pretty well.
领英推荐
Digging deeper, we can see that both models do very well as grasping the central concept of the questions, even if they did not necessarily answer them correctly.?Additionally, the AI exhibits specificity and concordance. However, we observed that both models tend to provide extraneous information, with Bard displaying a higher frequency of unrelated concepts (51% of the time) compared to ChatGPT (13% of the time).
We also encountered instances where the models generated incorrect information or hallucinated concepts.?ChatGPT exhibited hallucinations in 5% of its responses, while Bard showed them in 4% of its responses. Although these percentages are relatively low, the presence of hallucinations cannot be overlooked. In our next blog post, we will focus specifically on the hallucinations discovered during our testing.?Another later blog will also delve into the performance breakdown across different areas, or "pillars," of maintenance and reliability.?
PHM Knowledge Exam
The PHM knowledge exam was comprised 63 questions adapted from the GE Vernova's commercial PHM solution (APM SmartSignal) and covered in-depth industrial verticals, equipment, processes, and domain-specific knowledge.
The scores from both AI models used are shown below:
ChatGPT scored 67% overall, while Bard scored 56%.?If 60% is a passing score for an M&D analyst, then ChatGPT just passed and Bard got pretty close.
Similarly to the maintenance and reliability knowledge exam, the LLMs demonstrated a stronger understanding of central concepts and specificity rather than consistently providing correct answers.
However, when it came to highly specialized PHM knowledge and domain-specific nuances, both models struggled.?We will report more in-depth details on this in the next blogs.
Conclusions
Both LLMs achieved passing scores in industrial knowledge (which is impressive), and their performance on the maintenance and reliability knowledge test suggests that they may even be capable of passing the CMRP certification exam.?
However, our testing rubric went beyond evaluating correctness alone. While both LLM models showcased promising abilities in grasping central concepts and providing specific information, they were also prone to including extraneous details, occasional inaccuracies, and hallucinations. Our findings indicate that while AI models can provide valuable insights and context, they should be regarded as decision-support tools rather than standalone PHM experts.
In the next blog post, we delve deeper into the results concerning how the AI models hallucinated maintenance, reliability, and PHM concepts, which is a concerning aspect that needs to be addressed for industrial use.
In the third blog post, Asma discusses our findings for use of ChatGPT as a maintenance recommender in response to PHM Alerts:
Link to the full technical paper:
Director at GE Vernova
1 年Asma Ali and Sarah Lukens, Well-written and informative article. Good job of providing some interesting insights into the potential benefits and limitations of using LLMs in the M&R space.? What are the implications of these findings for the future of M&R and the challenges that need to be addressed before LLMs can be used in this way?
Principal Director at Accenture Strategy |Gen AI Solutions | E Mobility | Industrial.Ai
1 年Interesting article.
Senior AI Data Scientist
1 年Ngee Hung Tan Wenhao Liao Alexander Sake Michael Taylor
System Safety Engineering and Management of Complex Systems; Risk Management Advisor...Complex System Risks
1 年No risks here? After all the AI/ML, EV, advance automation, meta data, agile processes, advanced digital complexity, and advanced technology, deep tech, 5G assumptions, safety instrumented system fixation, we need to: Maintain control over all automation; Keep the human in the loop; Actually understand system assurance: human, hardware, software, firmware, logic and the human and environmental integrations and apply system safety, software safety, cyber safety, cyber security, system reliability, logistics, availability, human factors, human reliability, quality, survivability, etc.; Design systems to accommodate humans; Systems will fail, inadvertently operate, increase system risk with complexity; Humans will fail; Design systems to fail safe; Design systems to enable human monitoring; Design systems to enable early detection, isolation, correction and recovery…?
Data Scientist for industrial solutions
1 年We just dropped the second post: https://www.dhirubhai.net/pulse/hallucination-chatgpt-uncovering-limitations-ai-language-sarah-lukens/