Friday 7 September 2012

Assume Failure by Default

Maybe I’m just an old pessimist but when I see code like this below, it makes me think “now there’s an optimistic programmer”:-

void doSomething()
{
    bool ok = true;
    . . .
    return ok;
}

Personally I’d start with ok = false. Actually I wouldn’t even write that because I’ve always favoured early exit (in C++) and despise the heavily nested code that this style can lead to. But does it matter? Well, yes, because developers generally don’t test their error handling with anything like the effort they’ll put into making sure it works under normal conditions. And I’m not just talking unit testing here either, what about system testing where machines run out of memory and disk space, and networks disappear and reappear like the Cheshire Cat?

So, let me give a more common example which touches on one of my pet peeves - process exit codes. How many programmers check that the exit code produced by their executable is anything other than 0? Putting aside the whole “void main()” issue for now on the basis that you’ll be using a language that forces you to return an exit code in the first place, this is not IMHO the best way to begin:-

int main(int argc, char** argv)
{
    . . .
    return 0;
}

For starters let’s stop with the magic numbers (which also happen to look like FALSE for extra mind-bending confusion[+]) and make it obvious:-

int main(int argc, char** argv)
{
    . . .
    return EXIT_SUCCESS;
}

Now that obviously looks dodgy - success, always? In reality main() is just a stub to invoke factories to get the services needed by your application logic and so you might end up with something more like this instead:-

{
    int result = EXIT_FAILURE;

    try
    {
        // Create services
        . . .
        // Invoke application logic
        . . .
        result = EXIT_SUCCESS;
    }
    catch(/* something */)
    {
    }
    catch(/* something different */)
    {
    }

    return result;
}

If you prefer the early exit approach then you might elide the result variable and just perform a return EXIT_<WHATEVER> instead at the relevant spot. The key point here being that you focus on ensuring the default paths through the code should end in failure being reported unless you happen to strike gold and manage to make it through to the end without hitting any problems. By assuming failure you have to work harder to screw it up and create a false positive, and personally I prefer code that fails by accident, a false negative, immediately right where the problem is rather than later when the context has long gone.

As if to labour the point I’ve recently had to work with some components where the code is written in the “optimistic” style - essentially the methods are written this way round[#]:-

{
    bool ok = true;

    try
    {
        if (doSomething())
            if (doSomething())
                . . .
            else
                ok = false;
        else
            ok = false;
    }
    catch(/* stuff */)
    {
        ok = false;
    }

    return ok;
}

I’ve lost count of the number of times that I’ve debugged this code only to find that somewhere earlier a return code was ignored or an exception handler swallowed an error. But hey, this is nothing new, this problem has been around far longer than I’ve been programming. Exceptions are pretty much ubiquitous these days and so it does feel less common to see this kind of problem within application logic.

For me the acid test[*] is still “if this code traps an out-of-memory error what do I want it to do?”. And I’ve yet to answer that question with “just ignore it”.

 

[+] Have you ever seen C code written that does an equality string comparison like this: if (!strcmp(lhs, rhs)).

[#] It’s actually C-style procedural code encoded as C#. The author is not one of the development team and is the first to admit it’s not exactly a beacon of quality. But it’s in production and generating value every day (along with a side order of technical debt).

[*] By this I mean when someone writes “catch (Exception e)” I ask them the acid test question, because that’s exactly what they’re going to catch eventually. Presumably it’ll be forwarded until we hit a component boundary (process, COM, DLL, etc) at which point you’re going to have to transform every error into something the caller understands (ERRORLEVEL, HRESULT, etc).

5 comments:

  1. Nice point. If you like to read (much) more, Matthew Wilson dwelled extensively on this in his column "Quality Matters" in ACCU's Overload Journal 99 of October 2010. See "Exceptions for Practically-Unrecoverable Conditions", section "A production-quality main()", http://accu.org/index.php/journals/1706 .

    ReplyDelete
  2. I started out assuming your post was going to be rubbish. If I'd finished reading after the first sentence I would have come away with that impression.

    But on reading to the end I find it is actually quite good :-)

    But seriously: As usual I agree with pretty much everything you said. And yet I still struggle with this in practice.
    And when I say I struggle, I don't mean I have any personal issues with it. But when I write code that fails fast and hard if something unrecoverable goes wrong I often discover that "the business" has a different idea of what "going wrong" actually means.

    Numbers not quite right? Well, at least we have some.
    Didn't run through? Well how much *did* we get?
    Config not quite right? That's a shame - but it's never been quite right - yet it's always given us something.

    etc...

    Some of these have reasonably straightforward approaches - once identified. Others less so. Some are outright dangerous and irresponsible - and yet what can you do?

    It often ends up with the situation where the legacy system - a tangled mess that everyone knows is given the wrong numbers - has been (incorrectly) patched so many times over the years that it just keeps going - but your new shiny version that does everything right falls over because logic is not one of the business's strong points - and it looks bad.

    It takes a lot of iteration to make things look as robust as a system that sweeps things under the rug :-(

    No. I'm not bitter.

    Carry on.

    ReplyDelete
  3. @martin if you go on to follow the link above about "one of my pet peeves" it will take you to a footnote that cites that very article by Matthew Wilson. I'm sure Matthew will happily accept another plug though :-).

    ReplyDelete
    Replies
    1. That's however an extra level of indirection to ACCU ;-)

      Delete
  4. @phil when I started reading your comment I think you going to be condecending but you actually turned out nice :-).

    We've had a few places at my current site where we've failed fast and the best thing is that it's forced The Business to consider what recovery to take. And the benefit has been a system that now largely runs itself. We still get issues from upstream systems but when it does occur I push for an automated future resolution, not a request for more support staff.

    ReplyDelete