Monday, 30 January 2017

Dumbing Down Code

A conversation with a colleague, which was originally sparked off by “C# BAD PRACTICES: Learn how to make a good code by bad example” (an article about writing clear code), reminded me about a recent change I had just made. When I reflected back I began to consider whether I had just “dumbed down” the code instead of keeping my original possibly simpler version. I also couldn’t decide whether the reason I had changed it was because I was using an old idiom which might now be unfamiliar or because I just didn’t think it was worth putting the burden on the reader.

Going Loopy

The change I made was to add a simple retry loop to some integration test clean-up code that would occasionally fail on a laggy VM. The loop just needed to retry a small piece of code a few times with a brief pause in between each iteration.

My gut instinct was to write the loop like this:

int attempt = 5;

while (attempt-- > 0)
{
  . . .
}

Usually when I need to write a “for” loop these days, I don’t. By far the most common use case is to iterate over a sequence and so I’d use LINQ in C# or, say, a pipeline (or foreach) in PowerShell. If I did need to manually index an array (which is rare) I’d probably go straight for a classic for loop.

After I wrote and tested the code I took a step back just to review what I’d written and realised I felt uncomfortable with the while loop. Yes this was only test code but I don’t treat it as a second class citizen, it still gets the same TLC as production code. So what did I not like?

  • While loops are much rarer than for loops these days.
  • The use of the pre and post-decrement operators in a conditional expression is also uncommon.
  • The loop counts down instead of up.

Of these the middle one – the use of the post-decrement operator in a conditional expression – was the most concerning. Used by themselves (in pre or post form) to just mutate a variable, such as in the final clause of a for loop [1], seems trivial to comprehend, but once you include it in a conditional statement the complexity goes up.

Hence there is no difference between --attempt and attempt-- when used in a simple arithmetic adjustment, but when used in a conditional expression it matters whether the value is decremented before or after the comparison is made. If you get it wrong the loop executes one less time than you expect (or vice-versa) [2].

Which leads me to my real concern – how easy is it to reason about how many times the loop executes? Depending on the placement of the decrement operator it might be one less than the number “attempt” is initialised with at the beginning.

Of course you can also place the “while” comparison at the end of the block which means it would execute at least once, so subconsciously this might also cause you to question the number of iterations, at least momentarily.

Ultimately I know I’ve written the loop so that the magic number “5” which is used to initialise the attempt counter represents the maximum number of iterations, but will a reader trust me to have done the same? I think they’ll err on the side of caution and work it out for themselves.

Alternatives

The solution I went with in the end was this:

const int maxAttempts = 5;

for (int attempt = 0; attempt != maxAttempts;
                                            ++attempt)
{
  . . .
}

Now clearly this is more verbose, but not massively more verbose than my original “while” loop. However the question is whether (to quote Sir Tony Hoare [3]) it obviously contains less bugs that the original. Being au fait with both forms it’s hard for me to decide so I’m trying to guess that the reader would prefer the latter.

Given that we don’t care what the absolute value of “attempt” is, only that we execute the loop N times, I did consider some other approaches, at least momentarily. Both of these examples use methods from the Enumerable class.

The first generates a sequence of 5 numbers starting from the “arbitrary” number 1:

const int maxAttempts = 5;

foreach (var _ in Enumerable.Range(1, maxAttempts))
{
  . . .
}

The second also generates a sequence of 5 numbers but this time by repeating the “arbitrary” number 0:

const int maxAttempts = 5;

foreach (var _ in Enumerable.Repeat(0, maxAttempts))
{
  . . .
}

In both cases I have dealt with the superfluous naming of the loop variable by simply calling it “_”.

I discounted both these ideas purely on the grounds that they’re overkill. It just feels wrong to be bringing so much machinery into play just to execute a loop a fixed number of times. Maybe my brain is addled from too much assembly language programming in my early years but seemingly unnecessary waste is still a hard habit to shake.

As an aside there are plenty of C# extension methods out there which people have written to try and reduce this further so you only need write, say, “5.Times()” or “0.To(5)” but they still feel like syntactic sugar just for the sake of it.

Past Attempts

This is not the first time I’ve questioned whether it’s possible to write code that’s perhaps considered too clever. Way back in 2012 I wrote “Can Code Be Too Simple?” which looked at some C++ code I had encountered 10 years ago and which first got me thinking seriously about the subject.

What separates the audience back then and the one now is the experience level of the programmers who will likely tackle this codebase. A couple of years later in “Will Your Successor Be a Superstar Programmer?” I questioned whether you write code for people of your own calibre or the (inevitable) application support team who have the unenviable task of having to be experts at two disciplines – support and software development. As organisations move towards development teams owning their services this issue is diminishing.

My previous musings were also driven by my perception of the code other people wrote, whereas this time I’m reflecting solely on my own actions. In particular I’m now beginning to wonder if my approach is in fact patronising rather than aiding? Have I gone too far this time and should I give my successors far more credit? Or doesn’t it matter as long as we just don’t write “really weird” code?

 

[1] In C++ iterators can be implemented as simple pointers or complex objects (e.g. STL container iterators in debug builds) and so you tend to be aware of the difference because of the performance impacts it can have.

[2] I was originally going to write while (attempt-- != 0), again because in C++ you normally iterate from “begin” to “!= end”, but many devs seem to be overly defensive and favour using the > and < comparison operators instead.

[3] “There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult.” – Sir Tony Hoare.

Thursday, 26 January 2017

Journey Code

In the move from a Waterfall approach to a more Agile way of working we need to learn to be more comfortable with our software being in a less well defined state. It’s not any less correct per-se, it’s just that we may choose to prioritise our work differently now so that we tackle the higher value features first.

Destination Unknown

A consequence of this approach is that conversations between, say, traditional architects (who are talking about the system as they expect it to end up) and those developing it (who know it’s only somewhere between started and one possible future), need interpretation. Our job as programmers is to break these big features down into much smaller units that can be (ideally) independently scoped and prioritised. Being agile is about adapting to changing requirements and that’s impossible if every feature needs to designed, delivered and tested in it’s entirely.

In an attempt to bridge these two perspectives I’ve somewhat turned into a broken record by frequently saying “but that’s the destination, we only need to start the journey” [1]. This is very much a statement to remind our (often old and habitual) selves that we don’t build software like that anymore.

Build the Right Thing

The knock on effect is that any given feature is likely only partially implemented, especially in the early days when what we are still trying to explore what it is we’re actually trying to build in the first place. For example, data validation is feature we want in the product before going live, but we can probably make do with only touching on the subject lightly whilst we explore what data even needs validating in the first place.

This partially implemented feature has started to go by the name “journey code”. While you can argue that all code is essentially malleable as the system grows from its birth to its eventual demise, what we are really trying to convey here is code which cannot really be described as having been “designed”. As such it does not indicate in any meaningful way the thoughts and intended direction of the original author – they literally did the simplest possible thing to make the acceptance test pass and that’s all.

Destination in Sight

When we finally come to play out the more detailed aspects of the feature that becomes the time at which we intend to replace what we did along the journey with where we think the destination really is. Journey code does not even have to have been written in response to the feature itself, it may just be some infrastructure code required to bootstrap exploring a different one, such as logging or persistence.

What it does mean is that collaboration is even more essential when the story is eventually picked up for fleshing out to ensure that we are not wasting time trying to fit in with some design that never existed in the first place. Unless you are someone who happily ignores whatever code anyone else ever writes anyway, and they do exist, you will no doubt spend at least some time trying to understand what’s gone before you. Hence being able to just say “journey code” has become a nice shorthand for “it wasn’t designed so feel free to chuck it away and do the right thing now we know what really needs doing”.

 

[1] I covered a different side-effect of this waterfall/agile impedance mismatch in “Confusion Over Waste”.

Tuesday, 24 January 2017

Weak and Strong Idempotency

In my recent post “PUT vs POST and Idempotency” I touched on the different degrees of idempotency. In retrospect I think it’s better to describe our idempotency guarantee as weak or strong. Either way this post looks at why you might choose one over the other.

Weak Idempotency

At one end of the spectrum we can choose to classify our unique operations with a single scalar value such as a large number, UUID or complex string value (with other embedded values like the date & time). In this scenario we assume that a simple value represents the identity of the operation in much the same way that a reference (or pointer) is the unique address of a value in languages like C# and Java (or C++).

In SQL terms your write with an idempotency check might look something like this:

insert into Operations values(...)
where not exists
(
  select 1
  from Operations
  where OperationId <> @operationId
)

For a document oriented database like MongoDB where you might store everything in a single object to achieve atomic updates you could do an equivalent find-and-modify-where style query. If the data model requires multiple writes you might still be able to order them carefully to achieve the desired effect (See “Observable State versus Persisted State”).

Strong Idempotency

In contrast, at the other end, we define an idempotent operation based not only on some unique operation identifier but also the arguments to the operation itself, i.e. the request body in HTTP. In this instance we are not willing to accept that a replay can be distinguished solely by it’s handle but also wish to ensure the actual details match too, just like a “deep” value equality comparison.

In SQL terms we potentially now have two steps, one to attempt the write and, if a conflict occurs, a subsequent read to validate it:

insert into Operations values(...)
where not exists
(
  select 1
  from Operations
  where OperationId <> @operationId
)

select *
from Operations
where OperationId = @operationId

If replays are rare (and I’d expect them to be) you should find this adds only a little extra overhead but also allows you to report broken requests properly.

Identity and Equality

Although it might seem like these weak and strong guarantees are directly analogous to the concept of identity and equality in normal programming there is a subtle difference.

Whilst in the weak case we are happy for the unique operation identifier to act as our sole definition of equality, in the strong guarantee both the identifier and the input values [1] take part in the comparison. In a programming language we tend to choose either the reference based comparison or the value based approach but not both [2].

Failure Modes

In theory it should be adequate for us to take the simpler approach, and as the provider of a service it’s almost certainly slightly cheaper for us to do that. However the downside is that it makes it harder to diagnose missed updates caused by the service erroneously treating them as replays. But that’s just an edge case right?

During development when your clients are coding against your API for the first time they may make a simple mistake and not correctly generate unique requests. This is not quite as uncommon as you may think.

Another potential source of mistaken uniqueness can come from transforming requests, e.g. from a message queue. If you’re integrating with a legacy system what you may think is unique might in fact turn out to be an artefact of the small data set you sampled and it could turn out only to be mostly unique [3].

Answering these kinds of support queries (“what happened to my request?“) can easily chip away at a team’s time. The more effort you put into making it harder for consumers to incorrectly use your API the less time you’ll need to support them.

Debug Switch

Although the need to verify a replay should be pretty infrequent you could easily wrap the validation logic in a feature switch so that development and test environments provide better diagnostics whilst production has less work to do. But you should measure before considering dropping such a low-cost feature.

Monitoring

What will help both the service and it’s consumers is if you can afford to capture when replays do occur so that the monitoring can alert if there appears to be a sudden rise in them. For example a client may go into an failure state that keeps replaying the same message over and over again. Alternatively the rise might suggest some configuration problem where upstream traffic is being duplicated. From a functional point of view the entire system could be behaving correctly but wasting resources in the process that could lead to a more catastrophic failure state.

Although a client should never need to behave differently from a functional perspective when a replay occurs, the flip side to the service tracking the event is for the client to report it as well. They could do this if, say, the return code changes in the replay case. In HTTP for example there is 200 OK and 201 Created which could be used to tell the client what happened.

Middle Ground

As I described in my previous post there can also be points between the two extremes. One of the problems with modern document oriented databases is that they have limits on the size of a single document. Depending on the anticipated depth of history this may exceed the practical document limit and therefore require some compromises to be made.

If no false replays can be tolerated than perhaps the model needs splitting across documents which removes the ability to do simple atomic updates. If the history grows slowly and data retention is negotiable then maybe the fine details of much older operations can be discarded to make room for the new.

Brave New World

The new era of databases continues to bring interesting challenges and trade offs due to their differing goals from a traditional big iron RDBMS. Pushing back on some requirements (e.g. “Don’t Be Afraid to Throw Away Data”) is essential if we are to balance their constraints with correctness.

 

[1] You may store server derived properties as well, like a “created at” timestamp. These are of course ignored in any comparison.

[2] Of course breaking an invariant like the relationship between Equals() and GetHashCode() is a popular mistake which can make objects appear to go missing in maps & sets.

[3] I know of one organisation where many people mistakenly believed their customer identifiers were unique, but we discovered they were only truly unique if you took their legacy back-office sources into account too.

Tuesday, 17 January 2017

The Cost of Long-Lived Feature Branches

Many moons ago I was working at large financial organisation on one of their back office systems. The ever increasing growth of the business meant that our system, whilst mostly distributed, was beginning to creak under the strain. I had already spent a month tracking down some out-of-memory problems in the monolithic orchestration service [1] and a corporate programme to reduce hardware meant we needed to move to a 3rd party compute platform to save costs by sharing hardware.

Branch Per Project

The system was developed by a team (both on-shore and off-shore) numbering around 50 and the branching strategy was based around the many ongoing projects, each of which typically lasted many months. Any BAU work got done on the tip of whatever the last release branch was.

While this allowed the team to hack around to their heart’s content without bumping into any other projects it also meant merge problems where highly likely when the time came. Most of the file merges were trivial (i.e. automatic) but there were more than a few awkward manual ones. Luckily this was also in a time beforerelentless refactoring” too so the changes tended to be more surgical in nature.

Breaking up the Monolith

Naturally any project that involved taking one large essential service apart by splitting it into smaller, more distributable components was viewed as being very risky. Even though I’d managed to help get the UAT environment into a state where parallel running meant regressions were usually picked up, the project was still held at arms length like the others. The system had a history of delays in delivering, which was unsurprising given the size of the projects, and so naturally we would be tarred with the same brush.

After spending a bit of time working out architecturally what pieces we needed and how we were going to break them out we then set about splitting the monolith up. The service really had two clearly distinct roles and some common infrastructure code which could be shared. Some of the orchestration logic that monitored outside systems could also be split out and instead of communicating in-process could just as easily spawn other processes to do the heavy lifting.

Decomposition Approach

The use of an enterprise-grade version control system which allows you to keep your changes isolated means you have the luxury of being able to take the engine to pieces, rebuild it differently and then deliver the new version. This was the essence of the project, so why not do that? As long as at the end you don’t have any pieces left over this probably appears to be the most efficient way to do it, and therefore was the method chosen by some of the team.

An alternative approach, and the one I was more familiar with, was to extract components from the monolith and push them “down” the architecture so they turn into library components. This forces you to create abstractions and decouple the internals. You can then wire them back into both the old and new processes to help verify early on that you’ve not broken anything while you also test out your new ideas. This probably appears less efficient as you will be fixing up code you know you’ll eventually delete when the project is finally delivered.

Of course the former approach is somewhat predicated on things never changing during the life of the project…

Man the Pumps!

Like all good stories we never got to the end of our project as a financial crisis hit which, through various business reorganisations, meant we had to drop what we were doing and make immediate plans to remediate the current version of the system. My protestations that the project we were currently doing would be the answer if we could just finish it, or even just pull in parts of it, were met with rejection.

So we just had to drop months of work (i.e. leave it on the branch in stasis) and look for some lower hanging fruit to solve the impending performance problems that would be caused by a potential three times increase in volumes.

When the knee jerk reaction began to subside the project remained shelved for the foreseeable future as a whole host of other requirements came flooding in. There was an increase in data volume but there were now other uncertainties around the existence of the entire system itself which meant it never got resurrected in the subsequent year, or the year after so I’m informed. This of course was still on our current custom platform and therefore no cost savings could be realised from the project work either.

Epilogue

This project was a real eye opener for me around how software is delivered on large legacy systems. Having come from a background where we delivered our systems in small increments as much out of a lack of tooling (a VCS product with no real branching support) as a need to work out what the users really wanted, it felt criminal to just waste all that effort.

In particular I had tried hard to encourage my teammates to keep the code changes in as shippable state as possible. This wasn’t out of any particular foresight I might have had about the impending economic downturn but just out of the discomfort that comes from trying to change too much in one go.

Ultimately if we had made each small refactoring on the branch next being delivered (ideally the trunk [2]) when the project was frozen we would already have been reaping the benefits from the work already done. Then, with most of the work safely delivered to production, the decision to finish it off becomes easier as the risk has largely been mitigated by that point. Even if a short hiatus was required for other concerns, picking the final work up later is still far easier or could itself even be broken down into smaller deliverable chunks.

That said, it’s easy for us developers to criticise the actions of project managers when they don’t go our way. Given the system’s history and delivery record the decision was perfectly understandable and I know I wouldn’t want to be the one making them under those conditions of high uncertainty both inside and outside the business.

Looking back it seems somewhat ridiculous to think that a team would split up and go off in different directions on different branches for many months with little real coordination and just hope that when the time comes we’d merge the changes [3], fix the niggles and release it. But that’s exactly how some larger teams did (and probably even still do) work.

 

[1] The basis of this work was written up in “Utilising More Than 4GB of Memory in a 32-bit Windows Process”.

[2] See “Branching Strategies”.

[3] One developer on one project spend most of their time just trying to resolve the bugs that kept showing up due to semantic merge problems. They had no automated tests either to reduce the feedback loop and a build & deployment to the test environment took in the order of hours.

Monday, 12 December 2016

Surprising Defaults – HttpClient ExpectContinue

One of the things you quickly discover when moving from building services on-premise to “the cloud” is quite how many more bits of wire and kit suddenly sit between you and your consumer. Performance-wise this already elongated network path can then be further compounded when the framework you’re using invokes unintuitive behaviour by default [1].

The Symptoms

The system was a new REST API built in C# on the .Net framework (4.6) and hosted in the cloud with AWS. This AWS endpoint was then further fronted by Akamai for various reasons. The initial consumer was an on-premise adaptor (also written in C#) which itself had to go through an enterprise grade web proxy to reach the outside world.

Naturally monitoring was added in fairly early on so that we could start to get a feel for how much added latency moving to the cloud would bring. Our first order approximation to instrumentation allowed us to tell how long the HTTP requests took to handle along with a breakdown of the major functions, e.g. database queries and 3rd party requests. Outside the service we had some remote monitoring too that could tell us the performance from a more customer-like position.

When we integrated with the 3rd party service some poor performance stats caused us to look closer into our metrics. The vast majority of big delays were outside our control, but it also raised some other questions as the numbers didn’t quite add up. We had expected the following simple formula to account for virtually all the time:

HTTP Request Time ~= 3rd Party Time + Database Time

However we were seeing a 300 ms discrepancy in many (but not all) cases. It was not our immediate concern as there was bigger fish to fry but some extra instrumentation was added to the OWIN pipeline and we did a couple of quick local profile runs to look out for anything obviously out of place. The finger seemed to point to time lost somewhere in the Nancy part of the pipeline, but that didn’t entirely make sense at the time so it was mentally filed away and we moved on.

Serendipity Strikes

Whilst talking to the 3rd party about our performance woes with their service they came back to us and asked if we could stop sending them a “Expect: 100-Continue” header in our HTTP requests.

This wasn’t something anyone in the team was aware of and as far as we could see from the various RFCs and blog posts it was something “naturally occurring” on the internet. We also didn’t know if it was us adding it or one of the many proxies in between us and them.

We discovered how to turn it off, and did, but it made little difference to the performance problems we had with them, which were in the order of seconds, not milliseconds. Feeling uncomfortable about blindly switching settings off without really understanding them we reverted the change.

The mention of this header also cropped up when we started investigating some errors we were getting from Akamai that seemed to be more related to a disparity in idle connection timeouts.

Eventually, as we learned more about this mysterious header someone in the team put two-and-two together and realised this was possibly where our missing time was going too.

The Cause

Our REST API uses PUT requests to add resources and it appears that the default behaviour of the .Net HttpClient class is to enable the sending of this “Expect: 100-Continue” header for those types of requests. Its purpose is to tell the server that the headers have been sent but that it will delay sending the body until it receives a 100-Continue style response. At that point the client sends the body, the server can then process the entire request and the response is handled by the client as per normal.

Yes, that’s right, it splits the request up so that it takes two round trips instead of one!

Now you can probably begin to understand why our request handling time appeared elongated and why it also appeared to be consumed somewhere within the Nancy framework. The request processing is started and handled by the OWN middleware as that only depends on the headers, it then enters Nancy which finds a handler, and so requests the body in the background (asynchronously). When it finally arrives the whole request is then passed to our Nancy handler just as if it had been sent all as a single chunk.

The Cure

When you google this problem with relation to .Net you’ll see that there are a couple of options here. We were slightly nervous about choosing the nuclear option (setting it globally on the ServicePointManager) and instead added an extra line into our HttpClient factory so that it was localised:

var client = new HttpClient(...);
...
client.DefaultRequestHeaders.ExpectContinue = false;

We re-deployed our services, checked our logs to ensure the header was no longer being sent, and then checked the various metrics to see if the time was now all accounted for, and it was.

Epilogue

In hindsight this all seems fairly obvious, at least, once you know what this header is supposed to do, and yet none of the people in my team (who are all pretty smart) joined up the dots right away. When something like this goes astray I like to try and make sense of why we didn’t pick it up as quickly as perhaps we should have.

In the beginning there were so many new things for the team to grasp. The difference in behaviour between our remote monitoring and on-premise adaptor was assumed to be one of infrastructure especially when we had already battled the on-premise web proxy a few times [2]. We saw so many other headers in our requests that we never added so why would we assume this one was any different (given none of us had run across it before)?

Given the popularity and maturity of the Nancy framework we surmised that no one would use it if there was the kind of performance problems we were seeing, so once again were confused as to how the time could appear to be lost inside it. Although we were all aware of what the async/await construct does none of us had really spent any serious time trying to track down performance anomalies in code that used it so liberally and so once again we had difficulties understanding perhaps what the tool was really telling us.

Ultimately though the default behaviour just seems so utterly wrong that none of use could imagine the out-of-the-box settings would cause the HttpClient to behave this way. By choosing this default we are in essence optimising PUT requests for the scenario where the body does not need sending, which we all felt is definitely the exception not the norm. Aside from large file uploads or massive write contention we were struggling to come up with a plausible use case.

I don’t know what forces caused this decision to be made as I clearly wasn’t there and I can’t find any obvious sources that might explain it either. The internet and HTTP has evolved so much over the years that it’s possible this behaviour provides the best compatibility with web servers out-of-the-box. My own HTTP experience only covers the last few years along with few more around the turn of the millennium, but my colleagues easily cover the decades I’m missing so I don’t feel I’m missing anything obvious.

Hopefully some kind soul will use the comments section to link to the rationale so we can all get a little closure on the issue.

 

[1] Violating The Principle of Least Astonishment for configuration settings was something I covered more generally before in “Sensible Defaults”.

[2] See “The Curse of NTLM Based HTTP Proxies”.

Thursday, 17 November 2016

Overly Prescriptive Tests

In my recent post “Tautologies in Tests” I adapted one of Einstein’s apocryphal sayings and suggested that tests should be “as precise as possible, but not too precise”. But what did I mean by that? How can you be too precise, in fact isn’t that the point?

Mocking

One way is to be overly specific when tracking the interactions with mocks. It’s very easy when using a mocking framework to go overboard with your expectations, just because you can. My personal preference (detailed before in “Mock To Test the Outcome, Not the Implementation”) is to keep the details of any interactions loose, but be specific about the outcomes. In other words what matters most is (usually) the observable behaviour, not necessarily how it’s achieved.

For example, rather than set-up detailed instructions on a mock that cover all the expected parameters and call counts I’ll mostly use simple hand-crafted mocks [1] where the method maps to a delegate where I’ll capture only the salient details. Then in the assertions at the end I verify whatever I need to in the same style as the rest of the test. Usually though the canned response is test case specific and so rarely needs any actual logic.

In essence what I’m creating some people prefer to call stubs as they reserve the term “mocks” for more meatier test fakes that record interactions for you. I’d argue that using the more complex form of mock is largely unnecessary and will hurt in the long run. To date (anecdotally speaking) I’ve wasted too much time “fixing” broken tests that overused mocks by specifying every little detail and were never written to give the implementation room to manoeuvre, e.g. during refactoring. In fact an automated refactoring tool is mandatory on code like this because the methods are referenced in so many tests it would take forever to fix-up manually.

I often feel that some of the interactions with dependencies I’ve seen in the past have felt analogous to testing private methods. Another of my previous posts that was inspired by mocking hell is “Don’t Pass Factories, Pass Workers”. Naturally there is a fine line here and maybe I’ve just not seen enough of it done well to appreciate how this particular tool can be used effectively.

White-Box Testing 

The other form of overly specific test I’ve seen comes from what I believe is relying too much on a white-box testing approach so that the tests express the output exactly.

The problem with example based tests is that they are often taken literally, which I guess is kind of the point, but as software engineers we should try and see passed the rigid examples and verify the underlying behaviour instead, which is what we’re really after.

For example, consider a pool of numbers [2] up to some predefined limit, say, 10. A naïve approach to the problem might test the pool by asserting a very specific sequence, i.e. the starting one:

[Test]
public void returns_sequence_up_to_limit()
{
  var pool = new NumberPool(10);
  var expected = new[] { 1, 2, 3, ... , 9, 10 };

  for (var number in expected)
    Assert.That(pool.Acquire(), Is.EqualTo(number));
}

From a white-box testing approach we can look inside the NumberPool and probably see that it’s initially generating numbers using the ++ operator. The implementation might eagerly generate that sequence in the constructor, add them to the end of a queue, and then divvy out the front of the queue.

From a “programmer’s test” point of view (aka unit test) it does indeed verify that, if my expectation is that the implementation should return the exact sequence 1..10, then it will. But how useful is that for the maintainer of this code? I’d argue that we’ve over-specified the way this unit should be allowed to behave.

Verify Behaviours

And that, I think, lies at that heart of the problem. For tests to be truly effective they should not describe exactly what they do, but should describe how they need to behave. Going back to our example above the NumberPool class does not need to return the exact sequence 1..10, it needs to satisfy some looser constraints, such as not returning a duplicate value (until re-acquired), and limiting the range of numbers to between 1 and 10.

[Test]
public void sequence_will_be_unique()
{
  var pool = new NumberPool(10);
  var sequence = new List<int>();

  for (var i in Enumerable.Range(1, 10))
    sequence.Add(pool.Acquire());

  Assert.That(sequence.Distinct().Count(),
              Is.EqualTo(10)); 
}

[Test]
public void sequence_only_contains_one_to_limit()
{
  var pool = new NumberPool(10);
  var sequence = new List<int>();

  for (var i in Enumerable.Range(1, 10))
    sequence.Add(pool.Acquire());

  Assert.That(sequence.Where(n => (n < 1) || (n > 10)),
              Is.Empty);
}

With these two tests we are free to change the implementation to generate a random sequence in the constructor instead if we wanted, and they would still pass, because it conforms to the looser, albeit still well defined, behaviour. (It may have unpredictable performance characteristics but that is a different matter.)

Once again we are beginning to enter the realm of property based testing which forces us to think harder about what behaviours our code exhibits rather than what it should do in one single scenario.

This does not mean there is no place for tests that take a specific set of inputs and validate the result against a known set of outputs. On the contrary they are an excellent starting point for thinking about what the real test should do. They are also important in scenarios where you need some smoke tests that “kick the tyres” or you are naturally handling a very specific scenario.

Indicative Inputs

Sometimes we don’t intend to make our test look specific but it just turns out that way to the future reader. For example in our NumberPool tests above what is the significance of the number “10”? Hopefully in this example it is fairly obvious that it is an arbitrary value as the test names only talk about “a limit”. But what about a test for code that handles, say, an HTTP error?

[Test]
public void client_throws_when_service_unavailable()
{
  using (FakeServer.Returns(InternalServerError))
  {
    var client = new RestClient(. . .);

    Assert.That(client.SendRequest(. . .),
                Throws.InstanceOf<RequestException>());
  }
}

In this test we have a mock (nay stub) HTTP server that will return a non-2XX style result code. Now, what is the significance of the InternalServerError result code returned by the stub? Is it a specific result code we’re handling here, or an indicative one in the 5XX range? The test name uses the term “service unavailable” which maps to the more specific HTTP code 503, so is this in fact a bug in the code or test?

Unless the original author is around to ask (and even remembers) we don’t know. We can surmise what they probably meant by inspecting the production code and seeing how it processes the result code (e.g. a direct comparison or a range based one). From there we might choose to see how we can avoid the ambiguity by refactoring the test. In the case where InternalServerError is merely indicative we can use a suitably named constant instead, e.g.

[Test]
public void throws_when_service_returns_5xx_code()
{
  const int CodeIn5xxRange = InternalServerError;

  using (FakeServer.Returns(CodeIn5xxRange))
  {
    var client = new RestClient(. . .);

    Assert.That(client.SendRequest(. . .),
                Throws.InstanceOf<RequestException>());
  }
}

A clue that there is a disconnect is when the language used in the test name isn’t correctly reflected in the test body itself. So if the name isn’t specific then nor should the test be, but also vice-versa, if the name is specific then expect the test to be. A corollary to this is that if your test name is vague don’t surprised when the test itself turns out equally vague.

Effective Tests

For a suite of tests to be truly effective you need them to remain quietly in the background until you change the code in a way that raises your awareness around some behaviour you didn’t anticipate. The fact that you didn’t anticipate it means that you’ll be relying heavily on the test rather than the code you just changed to make sense of the original intended behaviour.

When it comes under the spotlight (fails) a test needs to convince you that it was well thought out and worthy of your consideration. To be effective a guard dog has to learn the difference between friend and foe and when we write tests we need to learn how to leave enough room for safe manoeuvring without forgetting to bark loudly when we exceed our remit.

 

[1] When you keep your interfaces simple and focused this is pretty easy given how much a modern IDE can generate for you when using a statically typed language.

[2] This example comes from a real one where the numbers where identifiers used to distinguish compute engines in a grid.

Tuesday, 15 November 2016

In The Toolbox – Season Two

As I pointed out in my blog post that collates Season One of my In The Toolbox C Vu column I never intended to write more than a couple of introductory articles before handing it over for others to share their experiences. Yet now, three years later, I’m still plugging away at it and Season Three is already in the making with a couple of episodes already under my belt.

Just as before I also strongly advise you to become a member of the ACCU so you can get this, plus loads of much better content, which may or may not be published online by their respective authors. As I write this post it’s still only a measly £45 per year and is one of the last remaining printed journals about programming.

Anyway, here are links and summaries for episodes 7 through 12.

7: Feature Tracking

We have so many ideas for our products but only so many hours in the day to develop them. Sometimes all it needs is a simple text file in the repo, whilst bigger projects seem to demand an enterprise-grade solution like JIRA.

8: Taming the Inbox

Email is still the predominant means of [a]synchronous communication for many organisations and the barrage of messages need to be triaged if we stand any hope of separating the wheat from the chaff.

9: The Developer’s Sandbox

As programmers we need a safe environment in which to write and test our code, free from the distractions going on around us. When running the tests it should not be at the mercy of other developers running tests at the same time as us; first and foremost we start in isolation, if we can.

10: Dictionary & Thesaurus

One of the hardest problems in computer science is naming and yet two of the oldest tools used to solve this problem often lay dormant on the programmer’s bookshelf.

11: Finding Text

It’s a simple question: how do you find a piece of text? And yet there is a dizzying array of choices available that depend heavily on what’s accessible at the time and where and how that elusive text is stored.

12: Whiteboards

In the move to go digital the humble whiteboard has been pushed aside, which is disappointing as it’s still probably the best design tool available. It also has many other uses than drawing pictures of boxes, drums and cylinders.