David Schachter’s Blog: Create an error handling framework
Photo by David Schachter

David Schachter’s Blog: Create an error handling framework

When you write a program, you’re actually writing three programs, intertwingled. One of them, you know about. What are the other two and how can you make them work better for you?

(A nicely-formatted version is available at https://davidschachter.com/Blog.)

You’re a good programmer. Your program is well-structured, with clear names for variables, constants, classes, and functions; good use of white space for readability; proper decomposition into functions, methods, classes, and inheritance; and so on.

What about the “temporary” debug logic? Is that something you thought about or something you hacked in on the fly during debugging?

And what about the error-handling logic? Did you think carefully about how to handle errors, when to recover from them, how to recover from them, how to report them, and how to be consistent?

In https://www.dhirubhai.net/pulse/david-schachters-blog-build-debugging-framework-your-code-schachter/, I wrote about the debugging aspect of programming. I explained how debug logic is like a second strand intertwingled with the code that you write to solve the actual problem.

Error handling is a third strand and the code for it is intertwingled with the rest, making three “programs” woven together.

Error handling can be messy. Getting the flow of control correct for the “normal” cases is challenging enough; doing the right thing when things go wrong is much harder. It’s hard to reason about that logic, hard to test and debug it, and hard to maintain it over time as the environment changes.

The only way I know to get this third strand correct is to think about it before I write anything else and to create a semi-formal structure for error handling. I use Aspect Oriented Programming (implemented with decorators in Python) to separate some of the error handling code from the “main” code, to un-intertwingle the strands a bit.

I create some reusable functions for error handling and logging. Occasionally (rarely), I also write functions for recovering from errors: such code tends to be one-off and not reusable. Splitting error-recovery code into a separate function may be useful for code structuring and for reasoning about behavior, but it usually provides no advantage for reusability.

Typically, my programs have a high-level flow that looks like this excellent graphic:

main --> initialize(), process(), terminate()

In a long-running program, process() will be an infinite loop or a loop that runs for many iterations and ends after some randomized number of passes. In the latter case, a master process exists to re-spawn dead processes. When process() is a loop, it will set up the environment before each pass, call doWork() to do whatever needs to be done (e.g. read a bunch of records, process them, and write results), and finally clean up the environment at the end of each pass. These two functions tend to look like this:

def process(resourcesFromInitialize):
  resourcesForThisPass = setupEnvironment(resourcesFromInitialize)
  doWork(resourcesForThisPass)
  cleanupEnvironment(resourcesFromInitialize, resourcesForThisPass)

def doWork(resources):
  results = [processRecord(record) for record in getData(resources)]
  [writeResult(resources, result) for result in results]]

To handle errors within this high-level structure, I use try/except blocks. I put them around each low-level function that might return an error, such as file and network operations. Of course, I might miss some such functions, or some might be added during maintenance. Or new error conditions might arise when there’s a new release of the operating system or of some third-party library.

To deal with these potential gaps, I put mid-level try/except blocks (or some other mechanism for catching errors) around large block functions, typically the ones called from process(). Here’s an example:

def initialize(): 
  resources = None
  try:
    resources = … some stuff …

  except Exception, e:
    logError("In initialize(), exception %s occurred." % e)

  return resources
  

def process(resourcesFromInitialize): # A simple version, with no loop.
  try:
    resourcesForThisPass = setupEnvironment(resourcesFromInitialize)

  except Exception, e:
   logError("In setupEnvironment(), exception %s occurred." % e)
   return False # EARLY RETURN

status = False 

try:
  status = doWork(resourcesForThisPass) 

except Exception, e:
  logError("In doWork(), exception %s occurred." % e)

finally:
  try:
    cleanupEnvironment(resourcesFromInitialize, resourcesForThisPass)

 except Exception, e:
   logError("In cleanupEnv(), exception %s occurred." % e)
   status = False # Signal error

 return status
 

def doWork(resources):
  status = False

  try:
    results = [processRecord(record) for record in getData(resources)]

  except Exception, e:
    logError("In doWork(), record processing failed with exception %s." % e)

  else:
    try:
      [writeResult(resources, result) for result in results]]

    except Exception, e:
      logError("In doWork(), result writing failed with exception %s." % e)

    else:
      status = True

  return status # used to set the operating system return status, e.g. $?

Compare these two examples to see how the error handling is intertwingled with the simpler code that accomplishes the main purpose. We use a standardized pattern to keep this mess as clean as we can, but still, it gets messy.

Another approach would be to wrap each function in an error-handling function. This approach would require some standardization of the mechanism that functions use to signal errors (special return values, or whatever). But signaling errors is what exceptions are for, so this approach is a non-starter for me.

Errors in error-handling code are particularly hard to test for and debug. I do what I can: a combination of eyeballing the code and forcing faults to occur so I can test. Eyeballing the code works only when the code is simple and standardized so I can reason about its behavior. Forcing faults to occur usually requires temporary modifications of the code. It’s a chore and I don’t test every possible fault because life is short and I want to ship something to my customers and get their feedback. And perhaps I would still miss something.

So I have two or three more levels of error handling to snag any remaining unhandled errors. First, I put in a top-level handler, like this:

if __name__ == "__main__":
  exitStatus = ERR_UNKNOWN_FAILURE # literal 6
 
  try:
    exitStatus = runProgram()

  except Exception, e:
    logError("runProgram() failed with exception %s." % e)

  sys.exit(exitStatus)

Then runProgram() wraps error handling around the calls to initialize(), process() and terminate():

def runProgram():
  try:
    resources = initialize()

  except Exception, e:
    logError("initialize() failed with exception %s." % e)
    return ERR_INITIALIZE_FAILED # EARLY RETURN literal 3

  status = False
  try:
    status = process(resources)

  except Exception, e:
    logError("process() failed with exception %s." % e)

  finally:
    try:
      terminate(resources)

    except Exception, e:
      if status == True: # Don't clutter the log if process() failed.
        logError("terminate() failed with exception %s." % e)
        status = ERR_TERMINATE_FAILED # literal 5

  return NORMAL_EXIT if status == True else ERR_PROCESS_FAILED) # 0 or 4

The final level of error handling is in the master process that re-spawns workers. When a child process dies because it has reached its (randomized) iteration limit, it leaves a “suicide note” for the master process to find. (Ghoulish terminology but an effective style!)

The master process is sleeping inside an infinite loop, waiting for an indication that a child process has ended. When it wakes up, it harvests the return status of the child to free up the corresponding slot in the Linux process table. Then it looks for the suicide note. If there is a note, the master process logs a “normal” shutdown and erases the note.

If there’s no note, then the death of the child process needs to be investigated. In this case, the master process logs the failure and sends me an email with the return status of the dead child and a copy of the logs. Either way, it spawns a replacement for the child so the system continues to function reliably even if resource leaks or unhandled errors occasionally occur.

The examples above are somewhat simplified. In real code, the error messages sent to the log would include a printout of the parameters and local variables. With that additional state information, debugging is much easier. Also, in real code, I print a formatted stack trace.

Is this the best possible error handling framework? Probably not. Should I be punched in the face? Well, perhaps. But it’s better than having no error-handling framework at all, because it traps errors in a way that lets me debug problems and improve the code. The alternative is uncontrolled shutdowns.

How do you think about error handling? What do you do to make it tractable and maintainable?

Tags: software performance, software architecture

要查看或添加评论,请登录

David Schachter的更多文章

社区洞察

其他会员也浏览了