Make sure you learn the right lessons
BP Texas City – A Contrarian View
“The greatest enemy of the truth is not the lie, but the myth.” – John F. Kennedy
On March 23, 2005, an explosion occurred at the British Petroleum Refinery in Texas City, Texas. Twenty-five people were killed and over 170 injured. Investigations of the accident were conducted by both external (US Chemical Safety Board) and internal (team led by former Secretary of State James Baker) agencies to (1) determine the root cause(s) of the explosion and (2) make recommendations to prevent such events from happening again.
We have the answers. Yes, but…
My first job out of college was for GPU Nuclear after the accident at their Three Mile Island Nuclear Power Plant. We used to chuckle at various published accounts of the accident, amused at the minor items that were said to be major and the omission of major items altogether. The only problem is that once published, someone’s opinion now becomes “fact”. I see a potentially similar phenomenon with the “lessons learned” from the BP Texas City explosion. Official explanations go unchallenged, and conjecture is treated as an objective truth.
I have just finished reading “Failure to Learn: The BP Texas City Refinery Disaster” by Andrew Hopkins. I would recommend the book, as it is very well written and relies heavily on citations from official analyses of the accident. However, this use of the “official” investigations made me think of the tagline from the movie Absence of Malice – “Everything they said was true, it just wasn’t right.” In repeating what the official investigation “found” to be the causal elements of the accident, the statements in the book are inherently true, even if what the official investigations “found” was not the contributor to the accident that they said it was. The creation of myth has now begun.
My goal here is to provide a contrarian view of the accident analysis, looking at the lessons learned from my perspective of 35+ years in the refining business. While many of the findings may be correct, I want to point out how they may also be wrong. This is not to deflect blame or responsibility from BP or anyone else but to ensure that the real problems are solved (not just the media-friendly problems) and to point out the difficulties in implementing some seemingly easy suggestions. This paper may raise issues at some plants that should prompt a closer examination of what lessons they have learned from the accident. My hope is similar to everyone else’s—learn from the accident to prevent it from happening again. For the sake of full disclosure, I have done work at the Texas City Refinery, both when it was owned by Amoco and BP, now owned by Marathon.
Before going into the analysis of the accident, I highly recommend James Reason’s “Managing the Risks of Organizational Accidents,” in which he makes two particularly salient points. First, major catastrophes are not a good indicator of a poorly run company. The odds of a catastrophe are so small and the alignment of the causal elements so unlikely, that a “safe” organization can have a catastrophe just as well as one that is “unsafe.” With that in mind, do we know that BP is poorly run relative to the rest of the industry? They may be average or above. The accident alone can’t be construed as an indictment of BP’s operating or safety practices. The best possible company is not immune from the random alignment of low probability events. So no one should conclude from the accident that the issue is unique to BP. While some in the industry readily acknowledge that BP is similar to the other companies, I have also heard those jumping on the bandwagon to say, “I could see this coming.” Second, safety and operation are inherently always in a state of tension. If there is too much focus on operations, then safety suffers and accidents will be more likely. However, if there is an excessive focus on safety, then the operation will likely go out of business. If safety were allowed to dominate our society to become risk-free, then we would not be allowed to drive cars, as people die in car accidents.
Faulty Level Indication. Yes, but…
The simple explanation of the accident was that a tower was overfilled. As such, much has been made of filling the tower above the level of the instrument so that the operators had no idea from the level gauge how much was in the vessel. I believe there has been an over-fixation on the level indication in the tower. A similar phenomenon occurred after the Three Mile Accident (TMI). Much was made of the location of the Reactor Coolant Drain Tank (RCDT) Level on a back panel. Had it been more prominent, the operators MAY have diagnosed the problem earlier. However there were other indications in the control room that could have told the TMI operators what was occurring. The RCDT level is not a critical indication, and the TMI operators’ aberrant mental model of the process was more likely a far bigger contributor to the accident. However, there became a push to place the RCDT level front and center on the consoles. If some people had their way, a spotlight would have been shining on it. Yes, the BP operators had no indication of the actual level, but that alone was not the problem, certainly not the major problem. (16)
There is an implication that the level in the column should be a safe operating limit not to be exceeded. I disagree. Exceeding that level by itself is not unsafe. Loss of containment, which should have been seen in the blowdown drum, was what created the unsafe condition. The column can be filled to the top, and that alone is not a safety issue. If a safe operating limit is any variable for which I can hypothesize a safety issue when it is in some combination with another variable, all variables will be safe operating limits. If everything is a safe operating limit, nothing is; there is no differentiation. I would argue that level in the blowdown be such a limit. (13)
All instruments have a probability of failure. The design of the system should be robust enough that single failures will not cause a catastrophe. Suppose the level gauge had failed and shown a false indication of 80%. This is not significantly different than intentionally overfilling. How should the system be designed so there is not uncontrolled release of material? Likely an alarm on where that material would go, namely the level in the blowdown drum.
There is a conclusion that, since 14 of the previous 18 startups required going out of range, there should have been an investigation into the pattern of operating above the level indication. I can’t imagine that having any value. It likely would have gone like this: Investigator, “You went off scale high on level.” Operations, “Yes, we need to do that to maintain pump suction.” Investigator, “Oh, okay.” Loss of pump suction and potential cavitations are much more of a likely safety issue than going beyond the 9 ft level on a 160 ft tower. (60)
So what is my take on the level indicator? The start-up should not have been allowed without the level indication on the blowdown drum. This was the last line of defense to prevent a release. No one should have bought off on a plan to start up a column with nowhere to put the product. If going above the level indication was needed, then feed flow and bottoms flow should have been trended. If the trend lines do not stay parallel, then the level is changing. Manual calculation of level from the flows should have been done hourly.
Poor procedures, too tired, undermanned. Yes, but….
The continuance of filling the column defies explanation. Putting material in with nothing coming out is unsustainable. What was the plan? The official explanations provided are dubious at best. Yes, they might have contributed, but they might not. There is no real evidence to argue either way. The official investigations had four factors that MAY have contributed to the failure to detect the overfilling.
First, the startup spanning multiple shifts was cited as a contributor due to the lack of proper handover. In my experience, whenever an operator comes on shift, the first thing they do after shift relief (and coffee) is to survey their unit, paging through key displays to assess the status of that for which they are responsible. No one blindly believes what the person before them said. They know these guys. Trust, but verify. Shift relief should have increased the probability of catching the problem, not decreased it. (19)
Second, much has been made of failure to follow procedures. I would argue that most refinery procedures are unfollowable. One emergency procedure that I was walking through with an operator had no actions until page seven. The first six pages contained information required for the “standard” formatting of procedures. Often procedure creation is a cover your ass exercise rather than the creation of a useful tool. One place that had useful procedures for their operators used to be considered one of the worst refineries with regard to safety. (11) Should refineries have better procedures? Of course. But it doesn’t require a procedure to know you can’t run a column with material going in and nothing coming out.
Third, the official investigation cited fatigue as a potential reason for failing to notice the overfilled column. Fatigue!? Where is the evidence for this? This strikes me as a conclusion that was made before the investigation, and it was going to be assigned to something. Yes, the operators had put in long hours. They may have been stressed, but the leap to fatigue is unsubstantiated. The fact that the new crew missed the problem argues against fatigue. Was the new crew as fatigued coming onto shift as the outgoing crew was after working 12 hours? (21)
Fourth, inadequate control room staffing was cited as a causal factor. Again, where is the evidence? The fact that something was missed does not mean that there were insufficient personnel present. How many operators should have been present? I guess enough so that the accident didn’t happen. However, this is classic hindsight justification. Nothing indicates that the operators had too many tasks to do in order to perform the column startup safely. Operators on units routinely start up and shut down equipment while continuing to monitor far more instrumentation than the Texas City operators had to watch. (22)
All four of these are POTENTIAL examples of a type of fallacious reasoning that gets the fancy Latin term “post hoc, ergo propter hoc”. In general, this is the error of saying A caused B, simply because A occurred before B. In the Texas City case, any actual or perceived negative characteristic of the plant can be attributed to having had a causal influence on the accident. But, did it? More evidence is needed than simply that one event proceeded another.
Poor Safety Culture. Yes, but….
In almost all accident investigations there is a discussion of the safety culture. It is not clear what criteria are used to determine that. My experience is that it is easy to see, but not easy to define. In fact, some of the best safety cultures would not recognize that about themselves. They do things safely not because safety has some preeminence or there is a special program; they do things safely because it is the right way to do it. In light of the accident, poor safety culture is all but a given. However, what exactly does that mean? Can I identify a poor safety culture a priori? (141)
The fact that management had delayed the startup due to the lack of rundown is telling. Even if the instruction was not relayed, it highlights the unsustainable nature of what was attempted. It is emphasized at every location that I have been to that if you can’t do it safely, then don’t do it. Commencing the startup when management knew it shouldn’t be done strikes me as a Nuremberg defense: “I was just doing what the night orders told me to do.” At some point operators need to refuse to continue an activity they feel is unsafe. This is likely going to be politically unpopular, but at some point individual responsibility must be addressed. If management is to be held responsible, so, too, must all those involved. It will likely be argued that the culture was such that the operators didn’t feel they could stop the startup. I find that such tacit prohibitions come more from the operators themselves than management. One manager I knew instructed the operators on his unit to “go on circulation” if they ever went off-spec with a product. The operators could not imagine that the manager was serious, and it required six months of repeating it before the operators would make the change. (19)
I wonder about a group of people who continue to work at a place where they are “deeply worried about the possibility of disaster.” This per the survey they were given. Granted, I have seen managers who are clueless about safety. At one location, the safety and unit manager could not imagine a chlorine leak reaching the plant fence. Later they had to manufacture in their garages a fix for a leak that required the plant and surrounding community to evacuate. However, I also know of operators who idolized an operator that had waded into knee-deep gasoline to shut down a pump. This same operator later died in an avoidable accident. If people believe safety is a problem, why don’t they do something about it? (71)
I have been to some bad refineries. Most of the management personnel knew things could and should be better. Not everything can be done immediately. Where do you start?
Profits over safety. Yes, but…
Big oil and its profits is a routine target of the news media and public ire. BP is a “for profit” company, providing a return on investment to the millions of individuals and institutions who have invested in them. But do profits trump safety?
It was suggested that BP should have closed or sold the refinery if not able to provide adequate capital investment. I can imagine the local headlines if BP announced closure of one of the largest refineries in the country. The loss of jobs would be decried as inhumane. The natural increase in gasoline prices with such a closure would be evidence of BP’s being ruled by the profit motive. I can’t imagine that selling the facility to another oil company would have any immediate benefit. Whoever bought it would not likely be as capitalized as BP and would probably have little cash leftover after such a purchase. This sounds easy, but isn’t. (75)
领英推荐
Upper management’s desire to cut costs was seen as problematic in leading to cuts in safety-related expenditures. This logic itself is flawed in that it assumes that had cost reduction not been an issue, then lessons from previous events would have been learned and changes would have been made. Maybe they would have or maybe not. The proposed solution is a safety team that would ensure cost-cutting is not done at the expense of safe operation. I have a hard time imagining that a safety team would be of much value. My experience with safety departments is that safety is not just the top priority, it is the only priority. If as a culture we wanted to improve highway safety, then cars should not be allowed to go above 35 MPH. We accept a certain risk, and deaths do occur that could be prevented, because life has competing demands. The nuclear industry had entire departments dedicated to “quality.” Often these departments became bureaucratic beasts that prevented anything from getting done. As is human nature, those that actually wanted to accomplish something found ways to work around the quality group. I see the same thing arising with a safety team. (81)
BP was criticized for delaying routine maintenance. I have certainly seen cases where I disagree with maintenance deferrals, but it is inherently a judgment call. However, the official statement that “improving reliability in this way decreases real reliability of a unit” seems unfounded. The implication is that deferring maintenance leads to bigger downtime later. However, all the data I have seen show refinery reliability to continually increase. Implication that the deferrals result in longer shutdowns later on has not been seen to date. (79)
Safety needs to be the top priority. Yes, but…
It has been said that everyone thinks they have a sense of humor; it’s just that everyone doesn’t. Likewise, everyone wants a safe work environment. Unfortunately, creating that is often muddled by different views of what it means to be safe.
I agree with Andrew Hopkins that current “measures” of safety performance have much room for improvement. The author points out the manipulation of such measures. I was amused when one company wanted a 15% increase in near-miss reports for the next year. Of course what they really wanted was to get people to write more reports, not that they wanted more near-misses. Extreme emphasis on safety metrics can be counterproductive, as no one wants to be the person to break the record or streak. Items can go unreported, and therefore unlearned, in order that the streak not be broken. (86)
BP Texas City was supposedly not listening to their HSE department. In all honesty, sometimes you shouldn’t listen to them. I find that many HSE departments have a Chicken Little attitude—Everything is a crisis. As stated before, if everything is a crisis, nothing is. Of course, if one day the sky did fall, Chicken Little would be allowed to say, “See! I’ve been telling you.” Perspective is a valuable trait that is not easy to quantify or impart. (108)
I consider much of what I do to be related to process safety. However, I am not sure what skills a process safety specialist should have. Suggesting that regulators should “examine the position and powers of a company safety specialist” would require that there be some standard for that position. To my knowledge, no such standard exists. (152) (157)
A different organization is needed. Yes, but….
Determination of the “right” organizational structure is almost as elusive as defining safety culture. Management consultants are constantly coming up with new organizations, usually based upon some company that is doing particularly well with a novel organization. However, the constant change in what is being touted as the best should be indication that there is no “right” organization.
The BP organization was criticized for having support groups who were not responsible for ensuring that standards were followed. Forcing facilities to meet or follow certain standards is another two-edged sword. In the post-Three Mile Island world, numerous new standards arose. Surprisingly, we as the operators of Three Mile Island pushed back against these requirements far more than other nuclear utility companies. Why, because we had followed all the standards up to the time of the accident, and they did not absolve us of the liability. The company then took the attitude of doing what was right, not just following what was required. I saw other companies that viewed the new human factors standards as something they just had to meet to operate. Gaining benefit was less important than “checking a box.” (93)
It was argued that the organization was too flat. Since all organizational structures have inherent strengths and weaknesses, was a comment on the height of the organization inevitable? Had a taller organization been used, would there have been criticism of decision makers too far removed from the day-to-day or too many layers of management to get things done quickly? No matter what organization is used, a valid case can always be made for the alternate due to the inherent strengths and weaknesses of each organizational structure. (76)
“Senior managers have the greatest influence on safety” says the official reports. Do they? It sounds reasonable, but is it true? Within the BP refineries in the US, at least two (Cherry Point and Carson) were seen to have good safety cultures. But don’t they ultimately report to the same person as Texas City? If senior managers have the greatest influence, shouldn’t all the refineries under the same manager have similar safety cultures and attitudes? Obviously managers have an influence, but is it the greatest? And if not them, then who or what is the greatest influence? That may be a critical question that deserves to be answered. (87) (143)
The span of control for the supervisors was seen as being too great for them to supervise effectively. While this can certainly occur, I am not sure how this conclusion was drawn. I wonder, though, how much of the problem are tasks unrelated to supervision that have been placed upon the supervisors. Like much of corporate America, oil companies have a plethora of meetings, many of which are low in value. Attendance at low value meetings over supervising employees is a matter of priorities. One refinery manager whom I respect held a series of get-togethers with all the hourly workers, and one of his questions was how often do you meet with your immediate supervisor? If any criticism should be leveled on the issue of supervisor span of control, failure to prioritize might be the most salient. (117)
I would concur that senior leadership not going to sites is a problem. Managers at all levels need to get out into the field, to the detailed operation of the units. Spending a little time in a control room in the afternoon and evening is a marvelous means to gather information on how things really run. Yes, there will be some exaggeration but a lot of truth as well. One of the best operation managers that I have encountered would routinely walk through the central control room just to make sure that he was seen, and operators knew he was available. (116)
A comment that I hear often and was echoed at Texas City was that management does not discipline fellow workers enough. This could easily be dismissed as a function of the plant being unionized. Here is an area where I see the impact of upper management—setting clear expectations and requirements for people throughout the organization. Formal reprimands are not needed then, just reminders of what is expected, “We wear our PPE,” “We follow procedures.” (126)
Criticism of the senior BP VP on acceptance of the fortress mentality of his subordinate is likely accurate but also a slippery slope. How can someone know that which they haven’t been told? There is a certain need to rely on those reporting to you to provide relevant information. The solution, as alluded to earlier, is to meet with individuals who report to that supervisor, people several levels below you. (130)
Better training is needed. Yes, but….
Training is usually the fallback for overcoming bad designs: “We’ll train them to be careful.” Because knowledge is infinite, training can always be cited as a causal factor. (“If only they knew it better.”)
“Computer-based training was widely seen as ineffective.” This implies that the other training was effective, again without evidence. It may have been ineffective, but to argue that the old training was better does not necessarily follow. Personal preferences for training have been shown to have little relation to improvements in job performance. Whether they liked or didn’t like the training is largely irrelevant to whether it imparted the necessary skills and knowledge. (79)
The myth that high fidelity simulators are the solution to training, using the airline analogy, is heard frequently in the industry. The error in the analogy is ignoring the low fidelity simulation that is also used in aerospace and that the high fidelity simulations are a product of decades of training evolution. The oil industry lacks an understanding and infrastructure for immediate use of simulators. The wide variety of plant designs makes simulator use more costly than when one simulator can be developed for multiple users, such as the case for aircraft. I was involved in evaluating a high fidelity simulator that an oil company had purchased for $500,000. Despite the cost, it could do little to enhance the critical skills of the operators on the unit for which it was purchased. It never was used and eventually was cannibalized for spare parts. The industry is littered with unused simulators, or those whose use ceased once the unit had been started up and was operational. A hand grenade is just a fancy rock if you don’t know to pull the pin. (80)
The false airline analogy was extended further to the criticism of holding meetings in the control room. Planes depart O’Hare airport once per minute. The average console operator gets six alarms per hour. While there are similarities, the refineries and air traffic control are not identical matches. (54)
Plants don’t publicize mistakes. Yes, but…
Finally, there was the criticism of failing to learn previous lessons. In particular, that the unit had eight dangerous occurrences over the previous ten years. The investigators found this data difficult to find and felt that it hadn’t been acted upon.
The difficulty in finding the data that there had been eight dangerous occurrences on the unit in the preceding ten years is of no surprise. Fear of litigation results in making much data, particularly if it deals with safety, hard to find. As a consultant, I often know more about common events at two locations in an oil company than do employees at each facility. Sharing lessons is great, unless your sharing and honesty are going to be used against you.
How was it decided that eight occurrences in the previous ten years was a problem? What do most units have? At what number would this NOT have been cited as a causal factor? Five? Two? This is a datapoint without context. Who knows, maybe eight in ten years is a good number. (61)
There was some criticism that various audits used euphemisms, such as “improvement opportunities,” to identify true safety issues. As consultants we run into a similar problem, where we are told to “suggest consideration” rather than recommend. All of these semantic gymnastics are the result of fear of litigation. Companies and, in particular their attorneys, want nothing to indicate that the company was in any way negligent for failing to take action. So again the desire to make things safer is thwarted by the consequences of admitting that improvements could and should be made. (114)
So what should be learned…
The biggest lesson from Texas City is a reminder that we work in a high-risk industry where people die if we don’t do our jobs correctly. This can happen to any company when all of the low probability events line-up at the same time. All of the lessons from Andrew Hopkin’s book should be learned, but with the proper perspective. The goal in radiation exposure in the nuclear industry is “as low as reasonably achievable” (ALARA). Risk in the process industries should be the same, despite all the slogans. Zero risk is impossible, so it needs to be as low as reasonably achievable, using whatever tools, methods, and practices are at the refineries’ disposal.
Retired from Emerson Automation Solutions
1 年Great article Dave! I, for one, had fully embraced the cause being the lack of a proper level indication in this incident, instead of considering the many other possible influences. One of your points really stands out to me: "The odds of a catastrophe are so small and the alignment of the causal elements so unlikely, that a “safe” organization can have a catastrophe just as well as one that is “unsafe". I highly recommend this article!
Data Science, AI, Machine Learning, Human-Centered Computing
1 年Superb analysis. Objective, thorough, and professional!
2/2 Safety culture – having followed the idea for 25+ years I have seen very little benefit come from it. It just isn't something we can influence directly. The idea that it has to come from the top of the organisation does not help, especially as achieving a single culture on multiple sites, in multiple departments is never going to happen. We made a similar observation as yours about everyone including operators being responsible in this article published in 2021 in The Chemical Engineer - https://abrisk.co.uk/wp-content/uploads/2023/02/2021-05-May-TCE-Risk-Tools-of-the-Trade.pdf Standards – I never understand the near obsession of following them. They don’t tell you how to run your business, operate your plant etc. They may define a baseline set of requirements but you should be doing what is right, which is achieve risks that are ALARA or ALARP. Supervision – we say it is important but don’t really say what we expect. Giving a few guidelines about how many hours per day they should spend face-to-face with their team would help. High-risk industry – I much prefer the term ‘high-hazard.’ Evidence (in the UK at least) shows that lower hazard industries create a higher risk per employee, which is worth recognising.
1/2 Thank you for sharing this. You have touched on a lot of topics that I find myself talking about. Overall, you have clearly illustrated that learning from accidents and incidents is not the same as determining its causes. Finding all the causes is no use if we don't learn. However, anything we find out from an investigation can add to our learning, even if we conclude it was not actually a cause of the incident we are looking at. A few of your points I would elaborate on (I hope they help the discussion): Instrumentation - I used the level instrument as an example in a recent presentation. My point was that the design was inadequate to give operators situational awareness during start-up. I guess it was concluded that it could not be changed, but a calculated level (as you suggest) would have been useful. I also agree that trends should be used more, and integrated into control system graphics. Procedures - We need to be clearer about why we write them. I see two main uses. 1. Supporting competent people to do their job. 2. Supporting people to become competent. The purpose of most procedures is unclear, and as you say often counter-product (arse covering).
Retirement
1 年Excellent!