Root Cause Analysis Use Case with the new O1 Reasoning Model

Root Cause Analysis Use Case with the new O1 Reasoning Model

This blog has below 3 sections:

  • The problem statement
  • The solution approach
  • Conclusions and References

Problem Statement

O1 is the new family of model launched by OpenAI and Microsoft recently. There has been a lot of hype around the advanced reasoning and analytical abilities of the model. The model is suppose to have PHD level of reasoning and COT (chain of though) ability incorporated deep within the model itself.

Although many of us (including myself) have been dabbling with the model in the playground — one of the key challenges I have been finding it difficult to answer is — “what can be an appropriate business use case” where we can demonstrate the power of O1. Some use case which can perform way better through O1 rather than other models like GPT 4O.

In this blog, I will discuss one such use case — this work is a joint collaboration with my friend and colleague Nayan Paul (Chief Azure Architect Generative AI CoE | Certified Data Scientist | Patent Holder & Product Owner (IDF) | Published Author | Certified Master Technology Architect | Speaker at AWS & DataWorks Summit | Guest Lecturer @ (18) Nayan Paul | LinkedIn)

The use case we choose was — root cause analysis for Manufacturing Plant Operational failures. Typically, in a manufacturing plant, there can be multiple factory lines. Each line will have multiple machines (or systems) and each system will have multiple sub-systems. These sub systems will typically sent telemetric data to operations logs. When a machine failure happens across the factory line, analyzing the exact system and identifying the sub system responsible for the issue and then performing a root cause analysis needs significant reasoning and logical problem solving skills. Typically, an engineer spends 3–4 hrs on average per incident to identify and document the root cause and produce report and remediation plans.

In this example we will focus on showcasing how O1 can help with the root cause analysis process.

Below is a quick view of the use case.

Case Study and Use case business value

Solution

We will start with a quick overview of the new O1 model. There were 3 primary reason why we thought that O1 will be a better option for this use case -

  • This use case requires comprehensive reasoning on multiple layers of data. O1 promises to perform self aware reasoning to get the right root cause analysis done.
  • Secondly, the use case also needs logical reasoning to iteratively check and find the root cause for the sequence of events.
  • Thirdly, O1 now supports more context window (128K) that can help with extended memory.

With these promised value from O1, lets also be a bit realistic and scope out the use case to ensure, we are not over promising on the capabilities.

Let us start with the introduction to our solution and walk you through the end to end steps. We created the UI centric solution to make it easy to understand and and follow along.

The first screen starts with the chatbot. The chatbot is agentic making it easy to interact with. We wanted to design the system as if an “engineer” would walk upto the system and start interacting with the agentic bot to start generating the root cause analysis report.

Introduction screen

Agentic framework with O1 makes the story compelling — we can add guardrails and create 3 separate agents which can help the operations engineer navigate and generate the root cause/ causal analysis report. In this case, the user starts by saying “hello” to the system. The system invokes the “orchestrator” agent that asks the first round of question to understand the problem statement. The system asks for factory line and system name that broke down.

Agentic Demo Starting the conversation

The top part of this screen has some “cards” — they provide some guidance for the user for the kind of data we have in the demo. Below you can see sample of the datasets. For reference — we have 2 datasets, one : the machine details dataset that shows that we have 1 factory line with 3 systems (A, B, C) and upto 5 subsystems or each system.

Above is sample meta data of complex system

The other dataset — is the operations / telemetry dataset that has data in 5 min intervals from each sub system for a given system.

Alarms and Event data showing machine operations

Back to the demo, one the systems asks for the factory line and system name that failed, we can provide the answer in a very “human” like way — which means, we can provide partial answer or answer in one go. The guardrails of the agentic system will keep working with the operations engineer to gather the information (else the underlying agent will not change its state and invoke the next agent).

Agents confirming factory line information.

In this case, we provide the line name as 1, the system then asks us again to provide the system name for which we provided the information as ‘X’ (which is not a subsystem available in the dataset). The agent calls the next agent which in-turn calls the SQL tool to get data from the first dataset. As the data set is empty it shows the same and asks us to confirm.

Agents' ability to understand unknown information provided

Human confirmation is an important step in any agentic framework. We need to provide the acknowledgment. In this case, the agent asks us if we want to proceed, as the dataset was empty, we say “no” and the current agent then propagates the control to the first agent (acyclic agent-to-agent interactions). This time, the agent again asks us to the factory line name and the system name which failed.

Agent's ability to understand no and start over

Once we provide the details (and this time we mentioned the system name as B) — the next agent (SQL one) again fetches the data from the datasets and shows the information before asking for confirmation.

Agent fetching tabular data

This time, the data fetched looks ok and this is the exact system the operations engineer is interested in to identify the root cause of failure.

Confirmation screen to validate data obtained are the same.

Once, the confirmation is received by the current agent, it calls the next agent which is the causal analysis agent (powered by O1). The agent asks us what do we want to do ? Based on our answer — which is to generate a ISO standard causal analysis report — it calls the tools behind the same and invokes the O1 model.

Agents confirming the data obtained is correct

Now, as mentioned, the O1 model has inbuilt reasoning and chain of thoughts. Our experience is that the response sometimes take a while (30-sec) to generate.

Agents now doing root cause analysis using ISO 2009:2015 format

We now start seeing some answers. The agent actually has 3 tools. It calls the O1 model and GPT4O model simultaneously. The side by side comparison gives us quite a good understanding of the differences between the models.

Comparing output between o1 and 4o
Below is the complete report:

Root Cause Analysis Report for Latest Failure in System B (Valve — Subsystem 4)

Date: October 2, 2023

— -

## Executive Summary

On October 1, 2023, at 18:40, System B experienced a failure in Subsystem 4 (Valve) due to excessive vibration, as indicated by Error Code E103. This report analyzes the operational data leading up to the failure to identify the root cause. The analysis reveals a progressive increase in pressure and vibration levels in both Subsystem 1 (Boiler) and Subsystem 4 (Valve), coupled with a consistent decrease in system output. The root cause is identified as a systemic pressure escalation originating from the Boiler, likely due to scaling or blockages, which overburdened the Valve, leading to mechanical stress and eventual failure.

— -

## Background

System B operates on Line 1 and comprises five subsystems:

1. Boiler (Subsystem 1) 2. Heat Exchanger (Subsystem 2) 3. Pump (Subsystem 3) 4. Valve (Subsystem 4) 5. Compressor (Subsystem 5)

On multiple occasions throughout October 1, failures were recorded in Subsystems 1 and 4. The latest failure occurred in Subsystem 4 at 18:40, with the cause cited as “Vibration Excess” (Error Code E103). Understanding the underlying issues necessitates analyzing the operational parameters of these subsystems over time.

— -

## Data Analysis

### 1. Failure Events Overview

- Subsystem 1 (Boiler) Failures: — Experienced periodic failures due to “Pressure Spike” (Error Code E102) at approximately hourly intervals. — Failure Times: 08:25, 09:40, 10:55, 12:10, 13:25, 14:40, 15:55, 17:10, 18:25.

- Subsystem 4 (Valve) Failures: — Experienced failures due to “Vibration Excess” (Error Code E103) shortly after Boiler failures. — Failure Times: 08:40, 09:55, 11:10, 12:25, 13:40, 14:55, 16:10, 17:25, 18:40.

### 2. Trend Analysis of Operational Parameters

#### Subsystem 4 (Valve) Parameters Leading to Failure:

| Time | Temperature (°C) | Pressure (bar) | Vibration (mm/s) | Output (%) | Energy Consumption (kWh) | | — — — — | — — — — — — — — — | — — — — — — — — | — — — — — — — — — | — — — — — — | — — — — — — — — — — — — — | | 08:40 | 77.0 | 5.5 | 1.7 | 97 | 54 | | 09:55 | 77.1 | 5.6 | 1.8 | 96 | 54 | | 11:10 | 77.2 | 5.7 | 1.9 | 95 | 54 | | 12:25 | 77.3 | 5.8 | 2.1 | 94 | 54 | | 13:40 | 77.4 | 5.9 | 2.2 | 93 | 54 | | 14:55 | 77.5 | 6.0 | 2.3 | 92 | 54 | | 16:10 | 77.6 | 6.1 | 2.4 | 91 | 54 | | 17:25 | 77.7 | 6.2 | 2.5 | 90 | 54 | | 18:40 | 77.8 | 6.3 | 2.6 | 89 | 54 |

- Observations:Pressure and Vibration: Both parameters show a steady increase over time. — Output: There is a consistent decrease in output efficiency. — Energy Consumption: Remains constant despite decreasing output, indicating inefficiency.

#### Subsystem 1 (Boiler) Parameters Leading to Failure:

| Time | Temperature (°C) | Pressure (bar) | Vibration (mm/s) | Output (%) | Energy Consumption (kWh) | | — — — — | — — — — — — — — — | — — — — — — — — | — — — — — — — — — | — — — — — — | — — — — — — — — — — — — — | | 08:25 | 79.0 | 5.4 | 1.6 | 95 | 53 | | 09:40 | 79.1 | 5.5 | 1.7 | 94 | 53 | | 10:55 | 79.2 | 5.6 | 1.8 | 93 | 53 | | 12:10 | 79.3 | 5.7 | 1.9 | 92 | 53 | | 13:25 | 79.4 | 5.8 | 2.0 | 91 | 53 | | 14:40 | 79.5 | 5.9 | 2.1 | 90 | 53 | | 15:55 | 79.6 | 6.0 | 2.2 | 89 | 53 | | 17:10 | 79.7 | 6.1 | 2.3 | 88 | 53 | | 18:25 | 79.8 | 6.2 | 2.4 | 87 | 53 |

- Observations:Pressure and Vibration: Similar upward trends as seen in the Valve. — Output: Decreasing output mirrors that of the Valve. — Energy Consumption: Steady despite output decline.

### 3. Cross-Subsystem Correlation

- Pressure Increase Correlation: — The incremental pressure rise in the Boiler precedes the pressure increase in the Valve. — Suggests that pressure buildup in the Boiler is propagating downstream.

- Vibration Increase Correlation: — Elevated pressure in the Valve leads to increased mechanical stress, resulting in higher vibration levels. — The Valve is sensitive to pressure changes due to its role in controlling flow.

### 4. System Inefficiency

- Constant Energy Consumption with Decreasing Output: — Indicates the system requires the same amount of energy to produce less output. — Potentially due to internal blockages or mechanical degradation increasing resistance.

- Equipment Wear and Potential Scaling: — Continuous operation without adequate maintenance can lead to scaling within the Boiler and Valve. — Scaling reduces the internal diameter, causing higher pressure and reduced flow efficiency.

— -

## Findings

1. Progressive Pressure Increase: — Both the Boiler and Valve show a consistent rise in pressure over time. — The pressure in the Boiler rose from 5.4 bar to 6.2 bar. — The pressure in the Valve increased from 5.5 bar to 6.3 bar.

2. Incremental Vibration Levels: — Vibration levels in the Valve increased from 1.7 mm/s to 2.6 mm/s. — Elevated vibration is a mechanical response to increased pressure and flow resistance.

3. Decreasing Output with Constant Energy Consumption: — Output efficiency in the Valve decreased from 97% to 89%. — Energy consumption remained at 54 kWh throughout.

4. Failure Pattern Timing: — Failures in the Valve consistently occur shortly after Boiler failures. — Suggests a causative relationship between Boiler issues and Valve failures.

5. Potential Scaling and Blockage: — Likely accumulation of deposits within the Boiler and Valve. — Scaling leads to narrowed pathways, increasing pressure and causing equipment strain.

— -

## Conclusion

The latest failure in System B’s Subsystem 4 (Valve) was caused by excessive vibration resulting from progressively increasing pressure within the system. The root cause is identified as systemic pressure escalation originating from the Boiler (Subsystem 1), likely due to internal scaling or blockages. The increased pressure from the Boiler propagated downstream to the Valve, causing mechanical stress and elevated vibration levels beyond operational thresholds, leading to failure.

— -

## Recommendations

1. Immediate Maintenance Actions:Inspect and Clean the Boiler: — Perform a thorough inspection for scaling, corrosion, or blockages. — Remove any deposits to restore normal pressure levels. — Inspect the Valve: — Check for mechanical wear, damage, or scaling. — Replace or repair components as necessary.

2. Implement Preventative Maintenance Schedule: — Establish regular cleaning and inspection routines for all subsystems. — Monitor key parameters like pressure and vibration closely.

3. Install Monitoring Sensors: — Deploy advanced sensors to track pressure and vibration in real-time. — Set up alerts for when parameters exceed safe thresholds.

4. System Optimization: — Evaluate the need for system upgrades to handle operational demands. — Consider using materials less susceptible to scaling.

5. Staff Training: — Train operational staff to recognize early signs of equipment stress. — Promote proactive responses to parameter deviations.

6. Review Operational Procedures: — Assess whether operational practices contribute to equipment stress. — Adjust processes to maintain parameters within optimal ranges.

— -

## Appendices

### A. Operational Data Extracts

Subsystem 4 (Valve) Detailed Data Leading to Failure:

| Timestamp | Temperature (°C) | Pressure (bar) | Vibration (mm/s) | Voltage (V) | Current (A) | Output (%) | Energy Consumption (kWh) | Failure | Error Code | Cause | | — — — — — — — — — — -| — — — — — — — — — | — — — — — — — — | — — — — — — — — — | — — — — — — -| — — — — — — -| — — — — — — | — — — — — — — — — — — — — | — — — — -| — — — — — — | — — — — — — — — — | | 2023–10–01 08:40 | 77.0 | 5.5 | 1.7 | 219 | 5.3 | 97 | 54 | Yes | E103 | Vibration Excess | | … | … | … | … | … | … | … | … | … | … | … | | 2023–10–01 18:40 | 77.8 | 6.3 | 2.6 | 219 | 5.3 | 89 | 54 | Yes | E103 | Vibration Excess |

Subsystem 1 (Boiler) Detailed Data Leading to Failure:

| Timestamp | Temperature (°C) | Pressure (bar) | Vibration (mm/s) | Voltage (V) | Current (A) | Output (%) | Energy Consumption (kWh) | Failure | Error Code | Cause | | — — — — — — — — — — -| — — — — — — — — — | — — — — — — — — | — — — — — — — — — | — — — — — — -| — — — — — — -| — — — — — — | — — — — — — — — — — — — — | — — — — -| — — — — — — | — — — — — — — — | | 2023–10–01 08:25 | 79.0 | 5.4 | 1.6 | 218 | 5.4 | 95 | 53 | Yes | E102 | Pressure Spike | | … | … | … | … | … | … | … | … | … | … | … | | 2023–10–01 18:25 | 79.8 | 6.2 | 2.4 | 218 | 5.4 | 87 | 53 | Yes | E102 | Pressure Spike |

### B. Graphical Representations

(Note: Include line graphs showing the trends of Pressure and Vibration over time for both Subsystems 1 and 4.)

### C. References

- Operational Data Logs from October 1, 2023 - Equipment Maintenance Manuals - Industry Standards on Boiler and Valve Operations

— -

Prepared by:

Engineering Analysis Team

Reviewed and Approved by:

[Supervisor’s Name]



End of Report generated.

Finally, the last tool (which O1 also excels at is code generation) allows us to slice and dice the data in ways which can be graphically displays to show the trends as below.

Report generated by o1 to show trends

All of these charts (mind you, at this point, O1 does not have multi modality) are generated by O1 showing not only the reasoning capability but some decent code generation capabilities as well !

showing code generation to draw charts

This concludes our initial review and use case centric analysis of o1 model. We took the model for a spin within a well-defined use case and the results were very impressive!

please leave us your comments and suggestions.

Patrick Parker

Manager, Enterprise Programs @RockwellAutomation | MBA, PMP, CMRP | Ex-Nuclear Navy?? Making Makers Better??

3 周

Love seeing ideas become reality Balamurugan Balakreshnan!

要查看或添加评论,请登录

Balamurugan Balakreshnan的更多文章

社区洞察

其他会员也浏览了