Thursday 7 July 2011

Recovering From Unknown Exceptions

In my last post I talked about using the Execute Around Method pattern to implement a common Big Outer Try Block to handle exceptions that propagate up as far as the service handler. But what do you do with them? Can you do anything with them? After all, if they’re unhandled you clearly never expected them so you need to recover from something you never anticipated...

Recoverability

The point of an exception handler is to allow you to recover from an exceptional condition. Now, there is much debate about what “exceptional” means but in this post I’m talking about scenarios for which no recovery has been attempted. That could be due to lack of knowledge about how to recover, or more likely, because the scenario never came out during testing.

There are really only two ways to behave after catching an unhandled exception - continue or terminate yourself. The former is often classified as the “best effort” approach, while the latter is described as “fail fast”. These are often seen as two opposing views, but in reality they are both valid approaches so long as you can decide how to loosely classify the error to ensure that the stability of the service isn’t compromised as a result of an unanticipated failure.

Systemic vs Domain Exceptions

What we are concerned with at the service handler entry point is trying to decide if the exception we caught indicates that the service has become unstable and will therefore cause problems with subsequent requests or if there will be no residual effects and we can carry on as normal (perhaps after logging a message to notify someone etc). For example a divide by zero exception is not likely to indicate that the process is stuffed whereas an access violation probably is[*].

We can split these two types of errors into two broad categories - Systemic and Domain. The former implies that there is something technically wrong with the process while the latter implies that there was something wrong with the request. We can then create two exception hierarchies rooted on them (ignoring the ultimate base such as System.Exception in .Net) - ServerException and DomainException. Then our Big Outer Try Block will look something like this:-

try
{
  . . .
}
catch (ServerException e)
{
  // Process borked - shutdown service
}
catch (DomainException e)
{
  // Dodgy request - return error
}

3rd Party Exceptions

Of course .Net and the rest of the world won’t know about our exception hierarchy and so we’ll need to categorise any known 3rd party exceptions into the Systemic or Domain groups and then extend the list of catch handlers appropriately. Taking a leaf out of my last post you can also encapsulate the exception handlers into a separate method that will catch and translate so you reduce to amount of cutting-and-pasting:-

public void TranslateCommonErrors(Action<T> method)
{
  try
  {
    method();
  }
  catch (OutOfMemoryException e)
  {
    throw new ServerException(e);
  }
  . . .
  catch (ArgumentOutOfRangeException e)
  {
    throw new DomainException(e);
  }
  . . .
}

We can use nested exceptions to ensure the caller has a chance to react appropriately. One of the first extension methods I (and no doubt many others) wrote was to flatten an exception hierarchy so it could be logged in all its glory or passed across the wire.

Remote Services

If we’re saying that a ServerException signifies the unstable nature of the service process then we need to translate that when it crosses a remote boundary. Due to the requirement that exception types must be serializable (at least with .Net & WCF) to cross the wire and we will be catching everything, including 3rd party exceptions which may not be serializable, we need to marshal the exception chain ourselves.

So, for remote service calls I like to have a mirror pair of exceptions - RemoteServerException and RemoteDomainException. These act as both the container for the marshalled exception chain and more importantly a signal to the client that the service is unstable. This gives the client a chance to perform recovery such as retrying the same request via a different server:-

while (!serverList.Empty())
{
  try
  {
    m_connection.DoTrickyStuff(parameters);
  }
  catch (RemoteServerException e)
  {
    // Out of servers!
    if (serverList.Empty())
      throw new ServerException(“No servers left”, e);

    // Server borked - try another one
    m_connection.Close();
    m_connection.Open(serverList.NextServer());
  }
  catch (RemoteDomainException e)
  {
    // Request stuffed - never gonna work...
    throw new DomainException(e);
  }
}

Another approach could be to immediately translate the RemoteServerException into a DomainException and throw that because the stability of the remote service does not indicate any instability within the client. However, as I mentioned last time, you need to be careful here because a technical error can grow from being a local problem on one server to a bigger one affecting the entire system when all the load balancing and retry logic starts kicking in for many failed requests.

If there is one thing the recent Amazon outage shows it’s that even with lots of smart people and decent testing you still can’t predict how a complex system is going to behave under duress.

The Lazy Approach to Exception Discovery

The eagle-eyed will have noticed that I’ve managed to avoid writing one particular catch block - the ‘catch all’ handler (catch(...) in C++ and catch(System.Exception) in C#). This is the ultimate handler and it’s going to handle The Unknown Unknown, so what do you do? Experience has probably taught you what you would consider a systemic error (e.g. Out of Memory) and so you’ll likely have those bases already covered leaving you to conclude that you can apply the “best effort” rule for everything else. And that is the approach I’ve taken.

In theory (given adequate documentation[#]) for every line of code you write you should be able to determine what the set of exceptions are and come up with a plan for recovering from all of them. Or deciding not to recover. Either way you can make an informed choice about what you’re going to do. But quite frankly who actually has the time to do this (except maybe for programmers writing software where lives are at stake)? What is more likely is that you’ll know the general pitfalls of what you’re invoking, such as when dealing with file-systems, networks or databases, and you’ll explicitly test those scenarios up-front but leave out the crystal-ball gazing. I think the effort is better spent ensuring the integration test environment runs continuously and is suitably enabled for injecting plausible faults than trying to second guess what is going to go bang under any given scenario.

This lazy approach also serves a secondary purpose and that is to see if your diagnostic tooling is up to the job when the time comes to use it. Have you just logged some simple message like “it failed” or do you have a message with the entire chain of exceptions, and do they tell you enough? Did you remember to log the stack trace or create a memory dump? Did the event register on your “system console” so you get a heads-up before it goes viral?

When a new scenario does come up that you realise you can recover from more gracefully that’s the time you’ll be glad you spent the effort mocking the underlying file-system, database and networking API’s so that you can write an automated test for it.

 

[*] Although I’d argue that an attempt to de-reference a NULL pointer probably is recoverable because it is the most common default initialisation state. In my experience access violations involving NULL pointers have nearly always been due to some logic error and not an indication of a wayward service that has started doing random stuff that eventually ended up with a NULL pointer. On the contrary an arbitrary memory location nearly always signals something pretty bad has gone before.

[#] Exception specifications don’t appear to have alleviated that.

No comments:

Post a Comment