Just restore service, we can work out why later ......

Just restore service, we can work out why later ......

I wrote this in 2003. I have taken the text and removed the Sun specifics(well most), updated the examples and cut out the more obscure content. Its still a pretty heavy read, but then problem solving and underpinning diagnostic support is not a trivial matter. Buckle up, this is not an accessible read. If you get to the end, I hope it gives you something useful and you have both my thanks and admiration !!!!!!

Diagnosis is defined as the task

of locating faulty parts of a system when a failure is detected

and should be a core competency for engineers. Failure to perform effective, reliable and timely diagnosis to root cause costs both money and customers. Readers will find examples come to mind without a struggle, many which involve some shouting from non-technical people or people who have forgotten they used to be technical!

For complex systems(anything which has a computer of any sort involved), a significant number of product failures will occur irrespective of the standard of pre-release quality procedures. Timely and accurate diagnosis will always be required. Therefore the utility of diagnosis support makes a direct and concrete contribution to

- Value delivered

- Customer satisfaction (availabiliy is underpinned by diagnose on 1st fix)

- Product support costs

- Product development lifecycle

- Product quality

A language to assess the ability to perform effective diagnosis of product's failure is fundamental to establishing the gaps which exist in an individual product and the infrastructure in which products function. The gaps may then be addressed at both a strategic level for new products and a tactical level for existing products where lack of diagnosis support has a significant impact on the end customer.

The history of bridge construction has seen some designs fall down and some stay up. Once some experience with designing a variety of bridges was gained, a study of the engineering principles of bridge design could be made, thus theory and best practise are establish and continue to evolve over time. The computer systems industry has been building computer systems for 60 odd years, but has yet to formalise the engineering principles and best practise which provides the foundation for the evolution of diagnosis support.

Kepner Tregoe Clear Thinking process touches diagnosistic support in a number of places including Incident Mapping, Think Beyond the Fix, Potential Problem Analysis and puzzing out why you have a gaping NMD in your problem specification.

This is the one real big bear I have of the WHAT in ITIL Incident Management. It does not consider diagnosis or that on some rare occasions diagnosis must take priority over restoring service [ Normally when the phrase "Not again" is in common use ].

The Thinking in Diagnosis

Diagnosis is a people centric process and therefore effective diagnostic support must support the thought processes of the practitioner. Two primary thought processes involved in diagnosis are Inference and Conjecture.

Inference is a logic which allows the observation of a defined set of events to be attributed to a well defined cause.

For example, from a Operating system error message "sd_xbuf_init: illegal chain type", the inference is that the IO chain type is invalid. The string only occurs in 1 function in the source and from source code inspection, there can be only 1 reason for the system to panic with that string. While it may not be clear why the variable which indicates the IO chain type has an invalid value, the next step in the diagnostic process becomes clear. If source code access(or reverse engineering of any sort) is required to understand a debug message, the debug messagee is unclear and incomplete in the extreme.

Inference is a engineering discipline based on logic and previous human experience.

Conjecture is a belief that a set of observed events was the result of a cause. The validity of the conjecture has yet to be verified. Conjecture is an Art based on imagination and personal experience.

Conjecture is a foundation of the scientific hypothesis and is used for the exploration of the unknown and when the next step in the diagnostic process is not clear. Conjecture is a central element of creative thinking and a vital part of the diagnostic process, but the more that is known about the problem and its environment, the more targeted, and therefore effective, conjecture becomes.

The aim of improving product diagnosability should be to move along the continuum from conjecture towards inference where the observer can match with certainty a set of events to a cause. Improving the ability of an engineer to perform diagnosis on a product means improving the tools and information available for that engineer to perform inference on the relevant behaviours of a product and its encompassing system.

It is important to be clear if inference or conjecture is being used at any point in the diagnosis process.

Dimensions of Diagnosis

It is proposed that diagnosis has six dimensions which influence the quality and appropriateness of Inference and Conjecture and therefore the difficulty in establishing root cause for a particular defect or set of defects which are manifested.

- Convergence

- Distance

- State

- Signature

- Coupling

- Visibility

Convergence

Convergence describes the number of contributory variables which are required to be aligned for a defect to be exposed. In KT Incident Mapping terms we would call these contributing circumstances.

An analogy is a lunar eclipse : such an event is not a daily occurrence. Three independent objects (Sun, Earth and Moon) must be aligned relative to each other, the effect is only visible to viewers in a limited region. Four variables exist in relation to a occurrence of a lunar eclipse

- Position of the Sun

- Position of the Earth

- Position of the Moon

- Position of the viewer

Though some types of events can be predicted, we only know an event has occurred if we can observe that event directly or indirectly via measurement of effects.

The convergence of variables which trigger the manifestation of a defect contribute to the difficulty in establishing root cause in two ways

- Complexity of understanding the contributory factors to the defect

- Difficulty in reproducing the problem

Larger numbers of contributory variables will tend to result in defects being exposed less frequently, making the definition of a reproducible test case more challenging.

The resolution of an obscure, complex or multiple defects may require multiple fixes. This increases both the complexity of testing and diagnosis.

While the complexity of current environments is huge, the author observes that in a standard installation the number of contributory variables in unique positions for a defect to be exposed tends to be in single digits (Still true in 2024, not so sure).

Distance

Distance describes the number of state changes between a defect occurring and being detected. The greater the number of changes in state, the less evidence exists when the defects effects are detected.

Distance can be described in terms of time(wall clock or cycles) or number of state changes.

The shorter the distance between the defect occurring and the result of the defect being detected, the easier a problem is to diagnose. The observed remaining state is a more accurate representation of the state at the time the defect manifested itself. A debug kernel setting for example, may reduce the distance between a incorrect variable value being set and it being noticed via an assertion.

Consider a mis-directed write to a disk where the wrong data ends up being written to disk. The corrupt data may not be discovered for many days if that data is not used in a way that can detect incorrect data.

State

State is the level of completeness of the system state available for post mortem diagnosis at the point when the defect's presence was detected.

In software, state has by tradition materialised as log files, a kernel crash dump or an application core dump. For performance issues, some system state may be abstracted as the output of `performance tools.

Core/crash dumps are the most advanced descriptions of state available today. The worst case availability of system state would be ``messages'' only.

A number of additional factors must be considered when describing state where a system is all the components that a user requires to complete a business task. Elements of the system where a description of state may assist in diagnosis include

  • The user(s)
  • The administrator
  • The network
  • Application(s)
  • Middleware
  • Job submission mechanism
  • System daemons
  • Hardware elements

Establishing the state of human components of a system is an open challenge, but is often required to establish root cause. Some issues require state to be available over a period of time.

Diagnosis of performance and corruption issues are made far more realistic if information is available on system state over a period of time. This additional dimension adds considerable complexity.

Signature

Defect Signature describes the uniqueness with which a defect is reported.

For example, the Operating System failure message "PCI errors "can be raised for many different hardware related issues, across a number of components.

panic : bad trap 0x31

has a Defect Signature which gives a very accurate description the state which the defect cause i.e. the kernel has detected an attempt to access a memory location for which the memory mapping unit can not find a translation. However, the message gives no indication as to the location of the defect.

Both example above are described as loose Defect Signatures.

In contrast, the message

unix: sorry, variable 'eri_adv_autoneg_cap' is not defined in the 'eri'

can only have very limited root causes.

i.e. "eri_adv_autoneg_cap" is mis-typed in a configuration file.

The last example is described as a tight Defect Signature.

Defect Signature is a major contributor in timely allocation of the appropriate resource to work on establishing root cause or corrective action via an accurate problem statement.

Coupling

Coupling describes a systems ability to localise the dispersal of the effects of a defect, before it is recognised or handled.

Any software engineering text book will describe the merits of loose coupling and tight cohesion(stable interfaces) between software system elements. For diagnosis, Coupling describes the ability of a defect to jump the boundary of system components.

A Java applet running in a browser is confined in both the operations and memory it can access by the Java RunTime that controls its execution. Thus the opportunity for a defect in the applet to produce a confused signature which indicates a root cause in the JVM is limited.

A 3rd party device driver(Oww, like Crowdstrike maybe) exhibits loose coupling as the driver has access any part of the kernel address space.

Coupling refers to module boundaries(spatial), where as Distance consider state changes(temporal).

Visibility

Visibility describes the maturity of tool support to understand Convergence, Distance, Coupling and State. The availability of appropriate tool support to interpret live or post-mortem state is a key facility.

Visibility also describes the infrastructure available to the analyst such as source code, static checking tools(Tools such as Lint consider source code rather than executing or post-mortem state) and architecture or design documentation.

The post mortem kernel crash dump from which information about a failure can be extracted regarding Convergence, Distance and State.

Live tracing tools allows the observation of interaction at the well defined boundary between the kernel and application which contributes to the understanding of Coupling.

Appropriate tool support gives the analyst information at the level they are thinking about the problem with support to drill down or abstract as required.

Measuring Support For Diagnosis

This section presents two proposed approaches, based on the dimensions presented in the previous section, to deriving a measure of diagnosis support.

Assessing Diagnosis by Dimension

Experience suggests decision analysis that the relative assessment of each dimension is practical, where as assigning absolute numbers is not. The table below shows an example where two products are assessed with relative merits across the six dimensions for a specific issue. The products should be as similar as practical. Comparing a Network Interface Card to an Active Directory server would serve little purpose.

MS Word. LaTeX

Convergence 6 10

Distance 2 10

State 10 2

Signature 4 10

Coupling 10 1

Visibility 2 10

Assessment of Word and LaTeX in document format error diagnosis. 0 is the lowest rank(poor) and 10 is the highest rank(high).

An associated description for each dimension will provide the relative shortfalls. For example, the author considered LaTeX as having less convergence (less configurable and less complex having no GUI) than MS Word as problems are easier to reproduce. A discussion describing the facilities of each product to justify its rank is omitted here for brevity, but should accompany the classification.

This approach can be used to provide a Diagnosis gap analysis where a comparison is made between a best in class for diagnosability and those with poorer facilities. The aim being to bring products with support to the standard of the best of breed.

The values are subjective, based on the authors experience of the products and therefore multiple opinions must be sought. This approach has a limitation in that it can only be used to reduce the disparity between products, rather than improve the start of current practise.

Dimensions and KT Problem Analysis

The author observes that the difficulty answering an process question for a particular issue is reflected by a relationship with the maturity of diagnosis support for a product or system.

Convergence : Differences. The greater the number of differences, the greater the number of variables the object has.

Distance : When in the lifecycle. Difficulty with this question may indicate that the location of failure is imprecise or difficult to observe.

State : What defect, Where on the Object A more detailed system state yields a more precise problem statement

Signature : Possible causes. Many possible causes may indicate a set of problem symptoms which share similar observable features

Coupling : What object. The ability to identify a clear object as the location of the problem is tied to the location of defects.

Visibility : Where on the object Ability to integrate state.

I offer observation that the difficulty an engineer has in finding the answer to a partficular KT Problem Analysis question will be manifested by the quality of answer to a particular KT Problem Analysis question. From the quality of the answer to particular questions, the quality of diagnosis support as experienced by the engineer can be derived.

A Diagnosis Maturity Model

I have long been a big fan of the Software Engineering InstituteSoftware Capability Maturity Model which sets out five levels of maturity in the software engineering process with each level being more advanced in the quality of software production process and by implication the end product. For reader unfamiliar with the SEI CMM, the 5 levels in summary are

  1. Initial : Organisation lacks sound practises and during a crisis abandons planned procedure
  2. Repeatable : Planning and tracking of software is stable and repeatable
  3. Defined : Process for development and maintenance is documented across the organisation and software quality is tracked.
  4. Managed : Quantitative quality goals are set and the outcomes are predictable because the process is measured and operates within measurable limits.
  5. Optimising : Focus on pro-active prevention of defects.

I suggest a four level diagnosis capability model against which aspects of the diagnostic capability of a product can be classified.

  1. Random : results of a defect are detected by chance and/or diligence
  2. External : results of a defect can be detected by external monitoring
  3. Self-detecting : detects and reports the results of a defect with self state or inputs will require expert interpretation of external tools support
  4. Self-determining : detects and reports the results of a defect with state or inputs and what the 1st order cause and its locality. Integrated tool support allows non expert users to establish root cause.

Some types of defect such as flaws in the flow of logic of a program (add A to B, instead A is multiplied by B) are generally not detectable by a program or is environment itself. The support provided to observe and understand this class of defect should be considered.

This taxonomy deals only with the detection of failure and determining the root cause of the failure. It does not deal with issues such as recovery from failure.

Random

The defect is detected by an unconnected part of the system(including the user) by chance.

An example is a bit flipped in a word in a register that is detected by a mis-aligned address access of an unconnected operation

A user based example might be a segmentation violation of a program where the user entered a number rather than a letter which is first detected by an unrelated part of the program.

External

A tool or test needs to be applied to detect and isolate such issues. If the issue is not being actively looked for, then its detection is random.

An example is a bit flipped in a word in a register which may be detected by a process which writes patterns to memory and reads back a pattern such as Oracle Database with checksumming enabled.

A user based example might be the review(by hand or automated) of an application's audit log or debug output of user inputs examined after a failure.

Self-detecting

The defect is automatically be detected, but is not isolated.

An example is a bit flipped on a PCI bus which is parity protected which is checked by the PCI North Bridge. While the North Bridge detects the presence of the parity error but can't determine which component or type of component was active when the bit was flipped.

In the user based example, the program detects that a number is an inappropriate value and exits with a standard usage message.

The tool support provided with the environment requires expert direction to establish root cause.

Self-determining

The defect can be detected and isolated to individual component, action or subsystem from error messages alone.

A memory DIMM has a persistent error and the Operating Systtem reports that a particular DIMM's requires replacement on the system console.

In the user based example, an incorrect value has been entered by the user in field X. The value should be a letter from the set [a-k], and not a number as entered. The user is assisted in undertaking diagnosis where the problem lies in their actions.

The tool support provided with the environment can be used by a generalist to establish root cause.

Static Analysis falls into this category, though in an ideal world should have been carried out prior to product release.

Summary

Diagnosis is not someone elses problem. If you are trying to restore a service again and again because root cause has not been established or your KPI's around time to resolve or recover are poor, maybe the quality of diagnosistic support is poor and its time to step back and consider what gaps exist and where it could be improved. Even better, make diagnostic support part of the tender requirements.



A very good article, more than applicable in understanding Human Trauma in early life. Thank you for letting me see, again, that this knowledge can be use a a diagnostic framework where the machine is infinite more complex. The Human OS needs a better diagnostic tool than now available, a diagnose is not the same as understanding the problem. The understanding of the five Skandhas gives fuel to the whole diagnostic world. Carl Jung was also busy unraveling the the four states of consciousness, or the liberation from it. weeks of fun ahead, in understanding the diagnostic world. Thanks for your inspiration.

Berrie Schuurhuis

Software Engineer bij AVTware

2 个月

Thanks Clive! "Diagnosis is not someone elses problem". I cannot agree more, thanks for the insights and suggestions for improved thinking about troubleshooting.

回复

要查看或添加评论,请登录

Clive King的更多文章

社区洞察

其他会员也浏览了