Tuesday, 13 November 2012

When Does a Transient Failure Stop Being Transient?

Depending on the type of system you work on the definition of “transient”, when talking about errors, varies. Although I’ve never worked in the embedded arena I can imagine that it could be measured on a different scale to distributed systems. I work in the latter field where the notion of “transient” varies to some degree depending on the day of the week. There is also an element of laziness around dealing with transient failures that means you might be tempted to just punt to the support team on any unexpected failure rather than design recovery into the heart of the system.

The reason I said that the definition of transient varies depending on the day of the week is because, like many organisations, my current client performs their infrastructure maintenance during the weekend. There are also other ad-hoc long outages that occur then such as DR testing. So, whereas during the week there might be the occasional blip that lasts for seconds, maybe minutes tops, at the weekend the outage could last longer than a day. Is that still transient at that point? I’m inclined to suggest it is, at least from our perspective, because the failure will correct itself without direct intervention from our team. To me permanent failures occur when the system cannot recover automatically.

For many systems the weekend glitches probably don’t matter as there are no users in anyway, but the systems I work on generally chug though other work at the weekend that would not be possible to squeeze in every day due to lack of resources. This means that the following kinds of failures are all expected during the weekend and the architecture just tries to allow progress to be made whenever possible:-

  • The database cluster is taken offline or runs performance sapping maintenance tasks
  • The network shares appear and disappear like Cheshire Cats
  • Application servers are patched and bounced in random orders

Ensuring that these kinds of failures only remain transient is not rocket science, by-and-large you just need to remember not to hold onto resources longer than necessary. So for example don’t create a database connection and then cache it as you’ll have to deal with the need to reconnect all over the place. The Pooling pattern from POSA 3 is your friend here as it allows you to delegate the creation and caching of sensitive resources to another object. At a basic level you then be able to treat each request failure independently without it affecting subsequent requests. There is a corollary to this which is Eager Initialisation at start-up which you might use to detect configuration issues.

The next level up is to enable some form of retry mechanism. If you’re only expecting a database cluster failover or minor network glitch you can ride over it by waiting a short period and then retrying. What you need to be careful of is that you don’t busy wait or retry for too long (i.e. indefinitely) so that the failure cascades into the system performing no work at all. If a critical resource goes down permanently then there is little you can do, but if not all requests rely on the same resources then it’s possible for progress to be made. In some cases you might be able to handle the retry locally, such as by using Execute Around Method.

Handling transient failures locally reduces the burden on the caller, but at the expense of increased complexity within the service implementation. You might also need to pass configuration parameters down the chain to control the level of back-off and retry which makes the invocation messy. Hopefully though you’ll be able to delegate all that to main() when you bootstrap your service implementations. The alternative is to let the error propagate right up to the top so that the outermost code gets to take a view and act. The caller always has the ability to keep an eye on the bigger picture, i.e. number and rate of overall failures, whereas local code can only track its own failures. As always Raymond Chen has some sage advice on the matter of localised retries.

Eventually you will need to give up and move on in the hope that someone else will get some work done. At this point we’re talking about rescheduling the work for some time later. If you’re already using a queue to manage your workload then it might be as simple as pushing it to the back again and giving someone else a go. The blockage will clear in due course and progress will be made again. Alternatively you might suspend the work entirely and then resubmit suspended jobs every so often. Just make sure that you track the number of retries to ensure that you don’t have jobs in the system bouncing around that have long outlived their usefulness. In his chapter of Beautiful Architecture Michael Nygard talks about “fast retries” and “slow retries”. I’ve just categorised the same idea as “retries” and “reschedules” because the latter involves deactivating the job which feels like a more significant change in the job’s lifecycle to me.

Testing this kind of non-functional requirement[1] at the system level is difficult. At the unit test level you can generally simulate certain conditions, but even then throwing the exact type of exception is tricky because it’s usually an implementation detail of some 3rd party library or framework. At the system level you might not be able to pull the plug on an app server because it’s hosted and managed independently in some far off data centre. Shutting an app server down gracefully allows clean-up code to run and so you need to resort to TerminateProcess() or the moral equivalent to ensure a process goes without being given the chance to react. I’m sure everyone has heard of The Chaos Monkey by now but that’s the kind of idea I still aspire to.

I suggested earlier that a lazier approach is to just punt to a support team the moment things start going south. But is that a cost-effective option? For starters you’ve got to pay for the support staff to be on call. Then you’ve got to build the tools the support staff will need to fix problems, which have to be designed, written, tested, documented, etc. Wouldn’t it make more financial sense to put all that effort into building a more reliable system in the first place? After all the bedrock of a maintainable system is a reliable one - without it you’ll spend your time fire-fighting instead.

OK, so the system I’m currently working on is far from perfect and has its share of fragile parts but when an unexpected failure does occur we try hard to get agreement on how we can handle it automatically in future so that the support team remains free to get on and deal with the kinds of issues that humans do best.


[1] I’m loathed to use the term “non-functional” because robustness and scalability imply a functioning system and being able to function must therefore be a functional requirement. Tom Gilb doesn’t settle for a wishy-washy requirement like “robust” - he wants it quantified - and why not? It may be the only way the business gets to truly understand how much effort is required to produce reliable software.

No comments:

Post a Comment