登录查看更多内容

David Schachter’s Blog: Create an error handling framework

David Schachter

Android IDE on Android

发布日期: 2019年1月7日

+ 关注

When you write a program, you’re actually writing three programs, intertwingled. One of them, you know about. What are the other two and how can you make them work better for you?

(A nicely-formatted version is available at https://davidschachter.com/Blog.)

You’re a good programmer. Your program is well-structured, with clear names for variables, constants, classes, and functions; good use of white space for readability; proper decomposition into functions, methods, classes, and inheritance; and so on.

What about the “temporary” debug logic? Is that something you thought about or something you hacked in on the fly during debugging?

And what about the error-handling logic? Did you think carefully about how to handle errors, when to recover from them, how to recover from them, how to report them, and how to be consistent?

In https://www.dhirubhai.net/pulse/david-schachters-blog-build-debugging-framework-your-code-schachter/, I wrote about the debugging aspect of programming. I explained how debug logic is like a second strand intertwingled with the code that you write to solve the actual problem.

Error handling is a third strand and the code for it is intertwingled with the rest, making three “programs” woven together.

Error handling can be messy. Getting the flow of control correct for the “normal” cases is challenging enough; doing the right thing when things go wrong is much harder. It’s hard to reason about that logic, hard to test and debug it, and hard to maintain it over time as the environment changes.

The only way I know to get this third strand correct is to think about it before I write anything else and to create a semi-formal structure for error handling. I use Aspect Oriented Programming (implemented with decorators in Python) to separate some of the error handling code from the “main” code, to un-intertwingle the strands a bit.

I create some reusable functions for error handling and logging. Occasionally (rarely), I also write functions for recovering from errors: such code tends to be one-off and not reusable. Splitting error-recovery code into a separate function may be useful for code structuring and for reasoning about behavior, but it usually provides no advantage for reusability.

Typically, my programs have a high-level flow that looks like this excellent graphic:

main --> initialize(), process(), terminate()

In a long-running program, process() will be an infinite loop or a loop that runs for many iterations and ends after some randomized number of passes. In the latter case, a master process exists to re-spawn dead processes. When process() is a loop, it will set up the environment before each pass, call doWork() to do whatever needs to be done (e.g. read a bunch of records, process them, and write results), and finally clean up the environment at the end of each pass. These two functions tend to look like this:

def process(resourcesFromInitialize):
  resourcesForThisPass = setupEnvironment(resourcesFromInitialize)
  doWork(resourcesForThisPass)
  cleanupEnvironment(resourcesFromInitialize, resourcesForThisPass)

def doWork(resources):
  results = [processRecord(record) for record in getData(resources)]
  [writeResult(resources, result) for result in results]]

To handle errors within this high-level structure, I use try/except blocks. I put them around each low-level function that might return an error, such as file and network operations. Of course, I might miss some such functions, or some might be added during maintenance. Or new error conditions might arise when there’s a new release of the operating system or of some third-party library.

To deal with these potential gaps, I put mid-level try/except blocks (or some other mechanism for catching errors) around large block functions, typically the ones called from process(). Here’s an example:

def initialize(): 
  resources = None
  try:
    resources = … some stuff …

  except Exception, e:
    logError("In initialize(), exception %s occurred." % e)

  return resources
  

def process(resourcesFromInitialize): # A simple version, with no loop.
  try:
    resourcesForThisPass = setupEnvironment(resourcesFromInitialize)

  except Exception, e:
   logError("In setupEnvironment(), exception %s occurred." % e)
   return False # EARLY RETURN

status = False 

try:
  status = doWork(resourcesForThisPass) 

except Exception, e:
  logError("In doWork(), exception %s occurred." % e)

finally:
  try:
    cleanupEnvironment(resourcesFromInitialize, resourcesForThisPass)

 except Exception, e:
   logError("In cleanupEnv(), exception %s occurred." % e)
   status = False # Signal error

 return status
 

def doWork(resources):
  status = False

  try:
    results = [processRecord(record) for record in getData(resources)]

  except Exception, e:
    logError("In doWork(), record processing failed with exception %s." % e)

  else:
    try:
      [writeResult(resources, result) for result in results]]

    except Exception, e:
      logError("In doWork(), result writing failed with exception %s." % e)

    else:
      status = True

  return status # used to set the operating system return status, e.g. $?

Compare these two examples to see how the error handling is intertwingled with the simpler code that accomplishes the main purpose. We use a standardized pattern to keep this mess as clean as we can, but still, it gets messy.

Another approach would be to wrap each function in an error-handling function. This approach would require some standardization of the mechanism that functions use to signal errors (special return values, or whatever). But signaling errors is what exceptions are for, so this approach is a non-starter for me.

Errors in error-handling code are particularly hard to test for and debug. I do what I can: a combination of eyeballing the code and forcing faults to occur so I can test. Eyeballing the code works only when the code is simple and standardized so I can reason about its behavior. Forcing faults to occur usually requires temporary modifications of the code. It’s a chore and I don’t test every possible fault because life is short and I want to ship something to my customers and get their feedback. And perhaps I would still miss something.

So I have two or three more levels of error handling to snag any remaining unhandled errors. First, I put in a top-level handler, like this:

if __name__ == "__main__":
  exitStatus = ERR_UNKNOWN_FAILURE # literal 6
 
  try:
    exitStatus = runProgram()

  except Exception, e:
    logError("runProgram() failed with exception %s." % e)

  sys.exit(exitStatus)

Then runProgram() wraps error handling around the calls to initialize(), process() and terminate():

def runProgram():
  try:
    resources = initialize()

  except Exception, e:
    logError("initialize() failed with exception %s." % e)
    return ERR_INITIALIZE_FAILED # EARLY RETURN literal 3

  status = False
  try:
    status = process(resources)

  except Exception, e:
    logError("process() failed with exception %s." % e)

  finally:
    try:
      terminate(resources)

    except Exception, e:
      if status == True: # Don't clutter the log if process() failed.
        logError("terminate() failed with exception %s." % e)
        status = ERR_TERMINATE_FAILED # literal 5

  return NORMAL_EXIT if status == True else ERR_PROCESS_FAILED) # 0 or 4

The final level of error handling is in the master process that re-spawns workers. When a child process dies because it has reached its (randomized) iteration limit, it leaves a “suicide note” for the master process to find. (Ghoulish terminology but an effective style!)

The master process is sleeping inside an infinite loop, waiting for an indication that a child process has ended. When it wakes up, it harvests the return status of the child to free up the corresponding slot in the Linux process table. Then it looks for the suicide note. If there is a note, the master process logs a “normal” shutdown and erases the note.

If there’s no note, then the death of the child process needs to be investigated. In this case, the master process logs the failure and sends me an email with the return status of the dead child and a copy of the logs. Either way, it spawns a replacement for the child so the system continues to function reliably even if resource leaks or unhandled errors occasionally occur.

The examples above are somewhat simplified. In real code, the error messages sent to the log would include a printout of the parameters and local variables. With that additional state information, debugging is much easier. Also, in real code, I print a formatted stack trace.

Is this the best possible error handling framework? Probably not. Should I be punched in the face? Well, perhaps. But it’s better than having no error-handling framework at all, because it traps errors in a way that lets me debug problems and improve the code. The alternative is uncontrolled shutdowns.

How do you think about error handling? What do you do to make it tractable and maintainable?

Tags: software performance, software architecture

要查看或添加评论，请登录

David Schachter的更多文章

David Schachter’s Blog: Getting everyone's ideas, even the ones who never have anything to say.

2019年2月5日

David Schachter’s Blog: Getting everyone's ideas, even the ones who never have anything to say.

Still waters run deep. But sometimes, they’re just still.

1 条评论
David Schachter’s Blog: Agile for Managers or Agile for Programmers?

2019年1月15日

David Schachter’s Blog: Agile for Managers or Agile for Programmers?

Regrettably, the former is more likely. (A nicely-formatted version is available at https://davidschachter.
David Schachter’s Blog: Is the Computer Science curriculum relevant?

2019年1月13日

David Schachter’s Blog: Is the Computer Science curriculum relevant?

The late 1970’s Princeton CS curriculum was lousy. What I taught in the 1990’s was ok.

2 条评论
David Schachter’s Blog: The basic structure of every program: initialize, process, terminate.

2018年12月16日

David Schachter’s Blog: The basic structure of every program: initialize, process, terminate.

Just as human beings and donuts are topologically identical, in a gross sense, so too are all programs identical. They…

1 条评论
David Schachter’s Blog: Ignorance as a Service

2018年12月6日

David Schachter’s Blog: Ignorance as a Service

Sometimes not knowing is better than knowing. What? How? When? What?! (A nicely-formatted version is available at…

7 条评论
David Schachter’s Blog: ORMs vs. SQL

2018年11月29日

David Schachter’s Blog: ORMs vs. SQL

Why does your database management system need a query language? Why is maintaining a separation between the database’s…

3 条评论
David Schachter’s Blog: The @retry Decorator

2018年11月26日

David Schachter’s Blog: The @retry Decorator

Make failures invisible, so your system appears perfect even when it isn’t. (A nicely-formatted version is available at…
David Schachter’s Blog: The @usesDatabase Decorator

2018年11月24日

David Schachter’s Blog: The @usesDatabase Decorator

Make database access easy, even though it isn’t. (Generically, the “@uses” decorator) (A nicely-formatted version is…
David Schachter’s Blog: Why aren't vectors first-class values?

2018年11月20日

David Schachter’s Blog: Why aren't vectors first-class values?

Powerful modern CPUs have vector instructions. GPUs are vector monsters.
David Schachter’s Blog: “You ain’t gonna’ need it” (YAGNI)

2018年11月12日

David Schachter’s Blog: “You ain’t gonna’ need it” (YAGNI)

Dr. McCoy used to say, “Engineers love to change things.

See all articles

David Schachter’s Blog: Create an error handling framework

David Schachter

Android IDE on Android

When you write a program, you’re actually writing three programs, intertwingled. One of them, you know about. What are the other two and how can you make them work better for you?

David Schachter的更多文章

社区洞察

其他会员也浏览了

Stability for software components: the Stable Dependency Principle

What is an Application Binary Interface?

R Challenge #2: Graceful Coding

C++ Core Guidelines: Rules for Templates and Generic Programming

Leveraging local() in R Scripts for Cleaner and More Maintainable Code

Pointers In Go - Part 1

Demystifying interface{} Usage as a Universal Function Parameter in Go

Avoiding "Spaghetti" prompts

Unlocking the Power of Object-Oriented Programming (OOP)

Conversion of LISP Code into C# Traditional Switch Statement

When you write a program, you’re actually writing three programs, intertwingled. One of them, you know about. What are the other two and how can you make them work better for you?

David Schachter的更多文章

David Schachter’s Blog: Getting everyone's ideas, even the ones who never have anything to say.

David Schachter’s Blog: Agile for Managers or Agile for Programmers?

David Schachter’s Blog: Is the Computer Science curriculum relevant?

David Schachter’s Blog: The basic structure of every program: initialize, process, terminate.

David Schachter’s Blog: Ignorance as a Service

David Schachter’s Blog: ORMs vs. SQL

David Schachter’s Blog: The @retry Decorator

David Schachter’s Blog: The @usesDatabase Decorator

David Schachter’s Blog: Why aren't vectors first-class values?

David Schachter’s Blog: “You ain’t gonna’ need it” (YAGNI)

社区洞察

其他会员也浏览了

Stability for software components: the Stable Dependency Principle

What is an Application Binary Interface?

R Challenge #2: Graceful Coding

C++ Core Guidelines: Rules for Templates and Generic Programming

Leveraging local() in R Scripts for Cleaner and More Maintainable Code

Pointers In Go - Part 1

Demystifying interface{} Usage as a Universal Function Parameter in Go

Avoiding "Spaghetti" prompts

Unlocking the Power of Object-Oriented Programming (OOP)

Conversion of LISP Code into C# Traditional Switch Statement