Wednesday 12 December 2018

“Hello World” Stories

I’ve always tried really hard to fight against “technical stories”. These are supposedly user stories but which are really framed as a solution to a problem and really just technical tasks. In “Turning Technical Tasks Into User Stories” I looked at how it’s often possible to elevate these from an obvious solution to a problem back up to a problem which needs to be solved. At this point you may discover there are other, hopefully cheaper, solutions to the problem which have been missed in the original analysis either because things have changed or different people are doing the thinking.

On the flip-side there are occasionally times where, after having looked at a few related stories, it’s apparent that they all require the same underlying mechanism to work. One common solution to this is to bulk up the first story with the technical work and let the rest flow through as normal. This way you have no technical work on your backlog per-se as it’s all hidden in the stories.

Transparency

What I don’t like about this approach is that one story arbitrarily gets hit with a load of extra work, which, if you’re using historical data to stick a finger in the air for estimation of similar work later, skews the average somewhat. It also means that from a visibility perspective one story takes longer while the mechanism is being built.

One way I’ve found to address this has been to pull out the bare bones of the technical work into a “Hello, World!” story [1]. This story is framed around building the skeleton of the mechanism that will be used to drive the implementation of the subsequent features. The aim is keep the scope minimal enough that we avoid speculating while still delivering something which stands on its own two feet and remains clearly visible on the board.

Value Proposition

While the value to the end-user is in the eventual feature, the value in the mechanism is proving to the development team that the basic approach seems sound. With the skeleton built, the idiosyncrasies around each individual feature can then be dealt with appropriately at the right time and accounted for in the usual way.

To be clear this is not about doing a spike or building a prototype, although that may have happened earlier to gain the knowledge needed to undertake this piece of work. No, here we’re talking about building the bare bones of a real mechanism along with the most basic feature possible.

The reason I’ve called these “Hello World” Stories is probably self-evident, it alludes to the classic program many have chosen as their first – to write “Hello, World!” to the console. In this context the name is intended to conjure up simplicity and remind us that what we’re doing is delivering the minimum required to make the platform viable. We probably won’t literally write “Hello, World!” to the console, but it may a log message instead that we can then observe and monitor, or be a message on a queue that we can see discarded. Essentially whatever we can do to make its effects observable without wasting any real effort or leaving it partially complete.

Based on the classic INVEST acronym we should strive to make every unit of work: Independent, Negotiable, Valuable, Estimable, Small and Testable. By splitting it out from one of the arbitrary features it becomes more independent, negotiable, estimable and small which can be useful should short-term priorities change. And by extending the scope from a pure mechanism just a little bit further to the most trivial feature possible we make it more testable from a technical perspective, even if not from a product viewpoint. Most importantly, however, is it valuable in its own right? I think sometimes splitting the mechanism out gives value by making the I,N,E,S and T more tangible. In particular breaking work down into smaller deliverable units is often the most valuable practice even if occasionally the end-user has nothing initially to show for it.

Ultimately, I guess, I can’t ever remember anyone complaining they had broken their work down into pieces that were so small they were too visible.

 

[1] I’m sure there is an argument about this not being a “story” per-se but just a “task”. However I prefer to call it a story because our “Hello, World!” realization should have a grounding in the real world, even if it is more abstract than what the end-user will eventually receive.

[2] There is an assumption here that we’ve already decided we cannot or do not want to solve the dependent features in different ways, probably because it would be far more costly (in the long run) than briefly delaying them by building a common pillar.

Friday 7 December 2018

Overthinking is not Overengineering

As the pendulum swings ever closer towards being leaner and focusing on simplicity I grow more concerned about how this is beginning to affect software architecture. By breaking our work down into ever smaller chunks and then focusing on delivering the next most valuable thing, how much of what is further down the pipeline is being factored into the design decisions we make today?

Wasteful Thinking

Part of the ideas around being leaner is an attempt to reduce waste caused by speculative requirements which has led many a project in the past into a state of “analysis paralysis” where they can’t decide what to build because the goalposts keep moving. By focusing on delivering something simpler much sooner we begin to receive some return on our investment earlier and also shape the future based on practical feedback from today, rather than trying to guess what we need.

When we’re building those simpler features that sit nicely upon our existing foundations we have much less need to worry about the cost of rework from getting it wrong as it’s unlikely to be expensive. But as we move from independent features to those which are based around, say, a new “concept” or “pillar” we should spend a little more time looking further down the backlog to see how any design choices we make might play out later.

Thinking to Excess

The term “overthinking” implies that we are doing more thinking than is actually necessary; trying to fit everyone’s requirements in and getting bogged down in analysis is definitely an undesirable outcome of spending too much time thinking about a problem. As a consequence we are starting to think less and less up-front about the problems we solve to try and ensure that we only solve the problem we actually have and not the problems we think we’ll have in the future. Solving those problems that we are only speculating about can lead to overengineering if they never manage to materialise or could have been solved more simply when the facts where eventually known.

But how much thinking is “overthinking”? If I have a feature to develop and only spend as much effort thinking as I need to solve that problem then, by definition, any more thinking than that is “overthinking it”. But not thinking about the wider picture is exactly what leads to the kinds of architecture & design problems that begin to hamper us later in the product’s lifetime, and later on might not be measured in years but even in days or weeks if we are looking to build a set of related features that all sit on top of a new concept or pillar.

The Horizon

Hence, it feels to me that some amount of overthinking is necessary to ensure that we don’t prematurely pessimise our solution and paint ourselves into a corner. We should factor work further down the backlog into our thoughts to help us see the bigger picture and work out how we can shape our decisions today to ensure it biases our thinking towards our anticipated future rather than an arbitrary one.

Acting on our impulses prematurely can lead to overengineering if we implement what’s in our thoughts without having a fairly solid backlog to draw on, and overengineering is wasteful. In contrast a small amount of overthinking – thought experiments – are relatively cheap and can go towards helping to maintain the integrity of the system’s architecture.

One has to be careful quoting old adages like “a stich in time saves nine” or “an ounce of prevention is worth a pound of cure” because they can send the wrong message and lead us back to where we were before – stuck in The Analysis Phase [1]. That said I want us to avoid “throwing the baby out with the bathwater” and forget exactly how much thinking is required to achieve sustained delivery in the longer term.

 

[1] The one phrase I always want to mean this is think globally, act locally” because it sounds like it promotes big picture thinking while only implementing what we need today, but that’s probably stretching it too far.

Monday 12 November 2018

Feeling Isolated

By and large I think I’ve been fairly lucky with my time as a contract programmer. Virtually all the teams I’ve worked in and systems I’ve worked on have been pretty decent. None of them are going to change the world but they’ve been enjoyable, which is probably why I’ve ended up working on them for a decent length of time [1].

I can only say “virtually all” because one contract sadly fell way short of the mark. Although I was technically part of a team it only really felt that way from a managerial perspective, even though we shared a codebase. I felt somewhat isolated both physically and mentally. Aside from the morning stand-up I could easily have gone the rest of the day without speaking to my teammates if I had chosen to do so.

Physical Isolation

I started the contract on a separate floor from the rest of my team with a couple of other recent joiners [2]. We were the only people on that floor with the air conditioning on full blast so we had to wear our coats in the afternoon to stay warm. None of the rest of my team had an office pass that could access the floor either, should they want to talk face-to-face while getting us up to speed.

Even when they moved us onto the same floor a month later we were still on the opposite side of the room. In the next desk shuffle I got to swap colleagues although they were working on an entirely separate area of the system with a totally different bunch of people so we had little need to collaborate per-se, only to make small talk. Also the two desks next to me only seemed to be used for a game of Tower of Hanoi by the office movers given how the occupants came and went.

Even my “customer”, at least, the one I knew about, because they were paying for the project, was situated in a different country and spoke a different language. Although their English was way better than any knowledge I have of a second language I quickly discovered why most communication was via email or IM instead of vocally.

Project Isolation

Being an enterprise scale organisation the work was all about projects, and who was sponsoring how many “resources”. Nowhere was this more apparent than the Scrum Board with its project-oriented swim-lanes. Each swim-lane had the names of the team members assigned to that project, and as the stand-up proceeded it walked down the board a project at a time with each member of the sub-team providing an update.

It was fairly apparent right from the moment I started, just by reading the body language of the team members, that there was often little real interest in what the rest of the team was doing. Those that did, cut across projects to some degree because they tended to nurse the build system, deployments and monitoring. A couple of team members never attended our stand-up because they already attended a different one that encompassed their project.

To be fair some of the apathy at the stand-up was almost certainly down to its excessive length. And with little reason for attending except to provide a status update for the managers it’s no surprise those mostly on the periphery zoned out. Sometimes the only common goal of the team seemed to be to not break the system.

Code Isolation

During my short stint I effectively had one feature to work on. There were a couple of other minor tweaks to begin with but ultimately my project was one feature (nay, user story) and it took 5 months to deliver. That one feature involved making a change in an area of the codebase that nobody else knew except one of the tech leads who I soon discovered was leaving. In fact, taking away his days off after the announcement of his departure, I effectively had 3 days for any handover.

Not only were there no docs to work from there were no tests either. The only real knowledge about how any of the service was expected to behave had left firmly inside the head of the author. This pretty much just left doing a spot of software archaeology with the VCS in the hope that the commit messages might contain some extra clues. Many features had been tracked in a feature tracking tool but there were not enough licenses to go round so I had to hassle a teammate to look things up. Even then it often wasn’t worth it as there were no useful details; it felt like the ticket was just there to “tick a box”.

The code relied heavily on the caller “doing the right thing” so any understanding only made sense if you already knew what the caller was supposed to do, and that relied heavily on knowledge of the problem domain and the organisation’s other systems. (At the interview I made it perfectly clear that I still knew little about the problem domain, despite the many years I have worked in it [3].)

Methodology Isolation

Ever since I had my epiphany [4] around testing all those years ago I have become a firm believer in TDD and automated testing as the preferred approach to the sustainable delivery of quality software. Being told early in the project that “you won’t have time to write tests”, despite being asked in the interview about what your approach is, did not bode well.

It soon became apparent that the previous approach had been to rush something out and rely on manual, end-to-end testing and the customer doing things “right”. Validation was almost entirely left to the underlying maths library and so bizarre errors manifested and needed investigating by the developers due to a lack of basic error handling and reporting [5].

With no way of knowing if I had broken anything, because I didn’t know for sure what anything was supposed to do, my only recourse was to write new code with tests and then refactor later when someone (potentially me) could be sure that it was safe to do so. For existing code that I had to change or understand I would write a barrage of tests first to try and ensure I didn’t accidentally break anything. In some cases it was hard to know what was “by design” and what was “by accident”.

Clearly not everyone took this approach, as you can see in “It Compiles, Ship It!”. My pessimism paid off though once the edge cases and little extras started appearing as I could turn around a fix or improvement (safely) in minutes due to my suite of automated unit and regression tests.

Environment Isolation

Sadly, despite my ability to push through changes quickly into the integration test environment, it still took weeks for them to actually appear in the production environment. When my first task, a handful of lines of boilerplate code, took 6 weeks to make it into production I assumed continuous delivery was not something they cared about.

On the contrary, for one aspect of the business, releases were very frequent. It was just that I was on the other side and due to some (IMHO) poor architecture and deployment decisions my part of the distributed system was tightly-coupled to another (major) system’s release cycle.

While it might seem great having my own integration test environment to play with, I ran into issues no one else knew about and I had no idea who was really using it and for what. Once again that information pretty much departed with the author.

Parting Thoughts

On reflection I have to look at my own behaviour first and ask myself whether I was at least partly responsible for feeling left out. Once we moved onto the same floor it was definitely easier to wander over and ask people questions, which I did. However when the response is “well I worked all this out by myself originally” and “that’s more than anyone ever gave me” I think it’s not entirely unfair to assume that knowledge sharing isn’t high on some people’s agenda.

I believe I was as welcoming as I normally am and was happy to help out where possible, given the limited knowledge I had acquired. I guess that culturally there was such a large drive for autonomy that the idea of just chatting about stuff to see what improvements in the system or process would be beneficial just wasn’t on the cards. A couple of times what should have been a constructive comment or question definitely came out of me more as a snide remark which is never a good sign. I’ve been trying hard to be more aware of any sarcasm, which unfortunately comes all too easily to me, and so not add to any unnecessary negativity but I know I failed a few times.

Ultimately I think it says a lot about an organisation that rejects your approach because “they are not a start-up” when your application of that approach has only ever been in large enterprises and none of them has ever had an issue with it before. On the contrary they have often been grateful for the insights and improvements that I’ve brought.

Maybe if I was a lot younger I’d not have known any better and stuck it out a bit more but these days I know it’s just not worth the effort. I feel comfortable that I left the place in a better state than I joined it by documenting various things and writing tests for the code I wrote. After a slightly rocky start my customer seemed pretty pleased with everything I delivered, which I guess is largely what matters most.

As ever, my main regret is leaving behind some people that I wish I could have gotten to know better. Maybe I will, in another life, one where the benefits of collaboration are more positively encouraged.

 

[1] Mostly my tenure has been measured in years, not months.

[2] Only one of which was left when I called it a day – the other two barely lasted a month or so.

[3] See “Problem Domain Expert or Technical Expert or Even Both” for more on this recurring theme.

[4] See “My [Unit] Testing Epiphany” and my more recent ACCU / Agile on the Beach talk “A Test of Strength” for what lead to my enlightenment.

[5] Poor error messages is a popular topic of mine, see “Terse Exception Messages”. Also “The Perils of DateTime.Parse()” covers one specific example.

Thursday 8 November 2018

Proxy Weirdness – Socket Closed on 404

While investigating the issue that led to the discovery of the strange default behaviour of the .Net HttpClient class which I wrote up in “Surprising Defaults – HttpClient ExpectContinue” we also unearthed some other weirdness in a web proxy that sat between our on-premise adapter and our cloud hosted service.

Web proxies are something I’ve had cause to complain about before (see “The Curse of NTLM Based HTTP Proxies”) as they seem to interfere in unobvious ways and the people you need to consult with to resolve them are almost always out of reach [1]. In this particular instance nobody we spoke to in the company’s networks team knew anything about it and trying to identify if it’s your on-premise proxy and not something broken with any of the other intermediaries that sit between you and the endpoint is often hard to establish.

The Symptoms

Whilst trying to track down where the “Expect: 100-Continue” header was coming from, as we didn’t initially believe it was from our code, we ran a WireShark trace to see if we could capture the traffic from, and to, our box. What was weird in the short trace that we captured was that the socket looked like it kept closing after every request. Effectively we sent a PUT request, the response would come back, and immediately afterwards the socket would be closed (RST).

Naturally we put this on the yak stack. Sometime later when checking the number of connections to the TIBCO server I used the Sysinternals’ TCPView tool to see what the service was doing and again I noticed that sockets were being opened and closed repeatedly. As we had 8 threads concurrently processing the message queue it was easy to see 8 sockets open and close again as in TCPView they go green on creation and red briefly on termination.

At least, that appeared to be true for the HTTP requests which went out to the cloud, but not for the HTTP requests that went sideways to the internal authentication service. However they also had an endpoint hosted in the cloud which our cloud service used and we didn’t see that behaviour with them (i.e. cloud-to-cloud), or when we re-configured our on-premise service to use it either (i.e. on-premise-to-cloud). This suggested it was somehow related to our service, but how?

The HttpClient we were using for both sets of requests were the same [2] and so we were pretty sure that it wasn’t our fault, this time, although as the old saying goes “once bitten, twice shy”.

Naturally when it comes to working with HTTP one of the main diagnostic tools you reach for is CURL and so we replayed our requests via that to see if we could reproduce it with a different (i.e. non-.Net based) technology.

Phased Switchover

While the service we were writing was new, it was intended to replace an existing one and so part of the rollout plan was to phase it in slowly. This meant that all reads and writes would go to both versions of the service but only the one where any particular customer’s data resided would succeed. The consumers of the service would therefore get a 404 from us if the data hadn’t been migrated, which in the early stages of development applied to virtually every request.

A few experiments later to compare the behaviour for requests of migrated data versus unmigrated data and we had an answer. For some reason a proxy between our on-premise adapter and our web hosted service endpoint was injecting a “Connection: Close” header when a PUT or DELETE [3] request returned a 404. The HttpClient naturally honoured the response and duly closed the underlying socket.

However it did not have this behaviour for a GET or HEAD request that returned a 404 (I can’t remember about POST). Hence the reason we didn’t see this behaviour with the authentication service was because we only sent GETs, and anyway, they returned a 200 with a JSON error body instead of a 404 for invalid tokens [4].

Epilogue

I wish I could say that we tracked down the source of the behaviour and provide some closure but I can’t. The need for the on-premise adapter flip-flopped between being essential and merely a performance test aid, and then back again. The issue remained as a product backlog item so we wouldn’t forget it, but nothing more happened while I was there.

We informed the network team that we were opening and closing sockets like crazy, which these days with TLS is somewhat more expensive and therefore would generate extra load, but had to leave that with them along with an offer of help if they wanted to investigate it further, as much for our own sanity.

It’s problems like these which cause teams to deviate from established conventions because ultimately one is within their control while the other is outside it and the path of least resistance is nearly always seen as a winner from the business perspective.

 

[1] I’m sure they’re not hidden on purpose but unless you have a P1 incident it’s hard to get their attention as they’re too busy dealing with actual fires to worry about a bit of smoke elsewhere.

[2] The HttpClient should be treated as a Singleton and not disposed per request, which is a common mistake.

[3] See “PUT vs POST and Idempotency” for more about that particular choice.

[4] The effects of this style of API response on monitoring and how you need to refactor to make the true outcome visible are covered in my recent Overload article “Monitoring: Turning Noise into Signal”.

Tuesday 30 October 2018

Always Reply Within Your SLA – Succeed or Abort

Way back in 2012 I wrote the blog post “Service Providers Are Interested In Your Timeouts Too” about how you can help service teams understand your intentions so that they can handle requests more efficiently. That was written at a time when I had been working for many years on internal systems where there were no real SLAs per-se, often just a “best efforts” approach with manual intervention required to “unblock” the system when the failures start occurring [1]. In contrast I have always strived to create self-healing systems as much as possible so that only truly remarkable events require any kind of human remediation.

In more recent years I’ve spent far more time working on web services where there is a much stronger notion of an SLA and therefore a much higher probability that if you fail to meet your SLA then the client will attempt to perform some kind of recovery rather than hang around and wait for the reply [2]. Hence what I wrote about wasting resources on dead requests in that earlier blog post have started to become more significant.

Deadlines

A consequence of this ideology is that I’ve started to become far more interested in the approach of always responding within the SLA even if that means aborting mid-request. Often an SLA is seen as an aspiration rather than any kind of hard deadline, something which we hope to achieve more often than not, where “more often” usually involves quoting some (arbitrary) number of “nines”. For those requests that fall outside this magical number all bets are off and you might get an answer in a useful timeframe or you might not. This kind of uncertainty has always bothered me as a client consumer.

Hence, I’ve started moving towards building services that always provide a reply within the SLA whether or not the request has been satisfied. Instead of tying up valuable resources in the hope that when the answer finally arrives the client still has a vested interest in it, I’d prefer to just abandon the request and let the client know the SLA would be violated if it had continued servicing it. In essence the request times-out server-side, where the time-out is the SLA.

What this means for the client is that they have a definitive reply (network issues notwithstanding) to their request within the time limit allowed. More importantly if they want to allow more time to handle the request than the SLA allows for then they need to tell the service that they’re willing to wait. Essentially this creates a priority system and allows the service to decide what to do with requests that are happy to hang around for a bit longer.

Mechanics

Implementation-wise what this mostly boils down to is ensuring that every non-trivial piece of work (think: database query, network call, disk read, etc.) must be made with a bounded call time, i.e. one where a timeout can be provided so that the caller always regains control in a timely fashion. Similarly we don’t start any work that we suspect we can’t finish in time either. This generally manifests as aborting on the first timeout which is usually given the entire SLA and therefore you’re never going to recover in time.

Internally the maximum timeout starts with the SLA and as each background query is sent it is timed and the timeout gets progressively shorter [3]. As the load increases and internal queries take longer the chances of a request aborting rises but at least the load on the upstream systems doesn’t keep rising too. Ultimately it’s just a classic negative feedback loop.

Limitations

Unfortunately what makes implementing this somewhat less than idea is that we still don’t really have cancellable requests in many frameworks and you’re never entirely sure what happens when the timeout triggers. If the underlying operation is abandoned, but has to complete anyway because it can’t be cancelled, you may not be much better off. The modern async-enhanced programming world is great for avoiding tying up threads in the happy path but once you start considering the failure modes it’s much harder to reason about and, more importantly, control what’s going to happen. Despite the fact that under the covers the world of I/O has practically always been asynchronous the higher layers still assume a synchronous model with syntactic sugar only helping to reinforce that perspective.

So far I don’t have nearly enough production-level data points to know if it’s an idea that is truly worth the effort to implement or not. Being able to reject work outright because you’ve already missed the SLA isn’t too onerous but does mean you need to tap into the processing pipeline early before the request is queued in the background to know when the internal clock has started ticking. What’s harder to determine is whether you really get any benefit out of the additional complexity needed to track your request’s progress and if aborting upstream requests creates a more or equally unstable service due to the way the timeouts leave their underlying requests dangling.

I still think it’s an approach worth pursuing but I wouldn’t be surprised to find The Morning Paper covering something from decades ago that shows it’s just a fools errand :o).

 

[1] See “Support-Friendly Tooling” for some other examples about how this can play out if reliability out-of-the-box is “assumed”.

[2] In one instance that would mean abandoning the request and potentially taking on some small financial risk on behalf of the customer.

[3] Naturally for parallel / scatter-gather I/O it’s the time of the longest concurrent request.

Tuesday 2 October 2018

Technical Debt – Conscious Competence

Once upon a time the term Technical Debt seemed to have a very clear meaning but over the last few years that has been diluted to generally mean any crap code or process which is holding back delivery. I’m sure any scholars of Wittgenstein will be at pains to point out that “meaning is use” and therefore if everyone uses it this way who am I to argue?

For me the canonical source of information on the technical debt metaphor comes from the wiki of the person who coined the phrase in the first place – Ward Cunningham. The entry on Technical Debt there suggests to me that the choice to enter into debt is a wholly conscious one, not the unconscious acts of a less professional bunch of programmers.

By way of example I thought I’d take the opportunity to write up one of those occasions where I’ve been involved with taking debt on (in the original spirit of the term) and how we dealt with it, to show where the distinction lies.

The Bug

Soon after going live with v1.0 of a new calculation system in a large financial organisation we discovered that a number of key counterparties were missing from the daily report. The report generator was a late addition and there were various other issues around it’s development and testing which muddied the waters somewhat but suffice to say that this wasn’t delivered as cleanly as the core system was. (You might consider the more recent meaning of the term to apply here.)

More importantly what transpired was that due to various mergers in the company’s history a few counterparties had the same “unique” code in different back-end systems. This wasn’t just news to my team (we were all recent hires) but also to quite a few people in the business too. Due to only dealing with a limited set of “books” the codes were always unique to them in their context, but our new system cut right across them all.

The Root Cause

The generation of calculations was ultimately based around a Cartesian product of two counterparties, however given that most of those were pointless there was an optimisation which used another source of data to reduce that by more than an order of magnitude.

This optimisation should have been fairly simple but due to a need to initially use some existing manually managed counterparty data to ease the cutover (so regression testing should then reconcile exactly) it was somewhat more complicated than first envisaged.

Our system was designed to use the correct source of data eventually, but do a reverse lookup for the time being. It might sound simple but the lookup actually involved multiple lookups using combinations of keys that had to make assumptions about which legacy back-end system might hold the related data. The right person who could explain how we could do what we needed to do correctly also seemed elusive; there were many people with “heuristics”, but nobody who knew for sure.

In total there were ~100 counterparties out of a total of ~15,000 permutations that suffered from this problem. Unfortunately a handful of those 100 had a significant effect on the “bottom line” and therefore the usefulness of the system as a whole was in doubt at that point.

Entering Into Debt

Naturally once we unearthed this clanger we had to decide how to tackle it. After getting our heads around what this all meant and roughly where in the code the missing logic probably needed to go we had to make a decision – do we try and fix the underlying issue right away or try and put a workaround in place (assuming that’s even possible) to mitigate the problem, at least temporarily.

We were all very aware of going down the dark road of putting a tactical fix in place because we’d all seen where that can lead. We had made a concerted effort over the 12 months required to build the system to refactor relentlessly [1] and squash any bugs as soon as possible. This felt like a backwards step.

On the positive side by adopting a Design for Testability approach in most parts of the code we had extra switches on our processes [2] that allowed us to make per-counterparty requests, usually for diagnostic purposes. Hence the workaround took the form of sticking the list of missing counterparties in a simple text file, then using a command prompt FOR loop [3] to read the file and invoke the tool in “single counterparty mode”. Yes it was a little slow due the constant restarting of the process but it was easy to surgically insert into the workflow with the minimum of testing or risk.

Paying Back the Debt

With the hole plugged for now, and an easy mechanism in place for adding any other missing counterparties – update the text file – we were in a position to sort out the root problem without feeling under pressure to get the system working correctly 100%, ASAP.

As you can probably imagine the real solution wasn’t easy, not least because it was one of a few areas of SQL code that didn’t have any unit tests and was a tangled web of tables and views which had grown organically in an attempt to graft the old and the new worlds together [4].

What Did it Cost?

If we assume that the gung-ho approach would have been to just jump in and start fixing the real code, then what did we lose by not doing that? It’s possible that the final fix was simple and a little more investigation may have lead to that solution instead.

In contrast, the risk is that we end up in one of those “have we fixed it or not” scenarios where we spend an indeterminate amount of time being “real close” to getting towards “done”. The old adage about the last 10% also taking 90% of the time springs immediately to mind. Instead we were almost positive we had a simple workaround that could be deployed and get the system running correctly enough in an estimable amount of time. I believe there is a lot of value in having that degree of confidence.

What I think was critical was being able to remove the pressure on finding the right solution as this gave us time to really consider what needed to be done. Any fix done under pressure is not going to be given the attention to detail that it probably deserves. You then run the risk of making the system worse and having an even deeper hole to dig yourself out of.

The customer does not care about strategic versus tactical decisions per-se, they just want the thing to work. We cared about the solution because we knew it would be a burden in the short term as everyone had to remember about the bit “grafted on the side”. The general trust the team had built up by keeping quality at the forefront meant that the business would be more willing to trust us to reconcile the problem appropriately when the time came.

Use Language With Care

I really hope the term Technical Debt doesn’t continue to get watered down even further as it’s a powerful concept which is incredibly useful in the right hands. We already have far too many words for “alternate implementation” that are pejoratives carrying an air of unprofessionalness about them. I would like this one to remain in the hands of the professionals so they can continue to have “grown up” conversations with their customers about when it’s appropriate to consider taking shortcuts for a short term business need without them rolling their eyes, yet again.

 

[1] See “Relentless Refactoring” for more thoughts around this (unfortunately) contentious topic.

[2] “From Test Harness To Support Tool”, “Building Systems as Toolkits” and “In The Toolbox - Home-Grown Tools” all look at the non-functional side of tooling.

[3] A batch file just wouldn’t be complete without a for loop, see “Every Solution Starts With ‘FOR /F’”.

[4] This one feature seemed destined to plague us forever, see “So Many Wrongs, But No Rights” for another tale of woe.

Friday 28 September 2018

Beware the Easy Fix

Whenever you get a bug report be sure you can reproduce the problem before you start and check you’ve fixed the bug when you make your change.

This advice might seem blindly obvious and you’re probably wondering who on earth would try and fix a bug without reproducing the problem first and then without testing the fix works afterwards [1]. I wondered that too but I was recently involved in a bug report that seemed so cut-and-dried I thought I might have to reconsider my own obsessive desire to stick rigidly to the process. I was of course mistaken…

The Bug

A bug showed up in a new message queue processing service that meant when the message queue broker was down for longer than a minute or so the consumer lost its connection and never reconnected in the background. In turn this meant the queue would slowly back-up – the process was still alive and kicking, it just wasn’t servicing the queue.

This bug report came by way of a production incident and an experienced colleague had triaged the problem so the ticket came into the team with some useful details attached. In the ticket the final log message from the service before it went dark told us that the dispatcher thread had shut down due to the failure to reconnect. The ticket also pointed us to the bit of code where the dispatcher thread was configured.

Looking at the service code along with a quick read of the third party library documentation made it seem pretty obvious that the recovery options configured for the dispatcher were insufficient. It was set-up with only 3 short retries and a circuit breaker for good measure. As a result of the incident some monitoring had been added to the queue so there was no reason why we couldn’t just enable infinite connection retries [2] and effectively disable the circuit breaker. Fixing the dispatcher code was a doddle as the message consumer library is well designed and has good documentation.

It almost seemed too easy…

The Shortest Path

The problem with bugs in infrastructure code like this is that they almost certainly don’t have any automated test coverage because writing them is really hard [3]. In fact testing this kind of issue can be arduous even when done manually as you need to control the middleware which might be outside your control or just something which sits in the background ticking away and therefore is almost invisible unless you wrote the original code or have had to fix it before. Throw in the fact that the bug wasn’t a showstopper and it’s easy to see how you could apply Sir Tony Hoare’s principle about code “being so simple there are no obvious bugs” and just push the change out based on the ability to compile the code and the fact that it doesn’t make matters any worse (you can show you’ve not broken the ability to connect to the queue).

Of course when the problem shows up in production again you’ll know that you never really fixed the problem and you’ll have to go around the loop once more, but do what you should have done first time around as the second outage will no doubt have made a few more people annoyed.

Another Bug

Unsurprisingly the simple code change suggested by the ticket actually had no effect at all when we came to test it, and this sudden realisation that we didn’t really understand what was going on was the impetus needed to take a step back and start again from the beginning.

Whilst performing a quick disconnection test (by bouncing the middleware) we noticed that the queue was behaving weirdly and not backing up like it said in the bug report. Another rabbit hole later [4] and we discover that the queue was not set-up to be durable, which in itself turned out to be another bug.

Eventually we find a way to reproduce the problem and in the process we learn a bit more about how the middleware and message consumer library both work. However we still don’t understand why the new dispatcher configuration does not appear to be working. Luckily the library is open source and so we can debug the issue ourselves and see what is going on under the hood.

The Real Fix

Who would have guessed that internally the message consumer library had another retry and circuit breaker policy that was used to control the (re)connection attempts to the message queue broker. Unlike the dispatcher thread error recovery policy, which was configured explicitly, the message queue connection policy was controlled by a couple of defaulted arguments on the connection configuration object constructor [5].

Sadly we couldn’t be explicit and use the “wait and retry forever” policy that was available on the dispatcher so instead we had to settle for configurating the number of connection attempts to int.MaxValue.

Problem Solved

Naturally it was far simpler to test the fix because we eventually put the effort into working out how to reproduce the problem in the first place. This can be quite significant from a status reporting perspective because it means you are less likely to be over optimistic about your progress. If you’re struggling to reproduce the problem then you’re going to struggle to prove that you’ve fixed it. If you mistakenly believe that the fix is simple and you then feel under pressure to get the testing done at the end it’s harder to convince yourself to do what needs to be done rather than settle for only potentially being right.

 

[1] This is somewhat disingenuous as there are times where this is not possible, but that’s unusual in the world of mainstream software development.

[2] Without the alert on the queue size we would need to find another way to signal when processing has dropped off. For example the circuit breaker should have triggered some other alert as connection failures are to be expected, but only for a limited time before escalation needs to occur.

[3] See “Automated Integration Testing with TIBCO” for an example of how I’ve done this in the past with a TIBCO message queue.

[4] Yes, the middleware was RabbitMQ but no pun was intended, for once.

[5] I’m not suggesting the library, which is provided for free out of kindness, is at fault. On the contrary the documentation was excellent, as was the support we received on Gitter. I need to help fix this, somehow.

Monday 3 September 2018

Surgical Support Needs Surgical Tools

In the world of IT support there is the universal solution to every problem – turn it off and on again. You would hope that this kind of drastic action is only drawn upon when all other options have been explored or that the problem is known a priori to require such a response and is the least disruptive course of action. Sadly what was often just a joke in the past has become everyday life for many systems as rebooting or restarting is woven into the daily support routine.

In my professional career I have worked on a number of systems where there are daily scheduled jobs to reboot machines or restart services. I’m not talking about the modern, proactive kind like the Chaos Monkey which is used to probe for weaknesses, or when you force cluster failovers to check everything’s healthy; I’m talking about jobs where the restart is required to ensure the correct functioning of the system – disabling them would cripple it.

Sledgehammers

The need for the restart is usually to overcome some kind of corrupt or immutable internal state, or to compensate for a resource leak, such as memory, which degrades the service to an unacceptable level. An example of the former I’ve seen is to flush a poisoned cache or pick up the change in date, while for the latter it might be unclosed database connections or file handles. The notion of “recycling” processes to overcome residual effects has become so prominent that services like IIS provide explicit support for it [1].

Depending on where the service sits in the stack the restart could be somewhat disruptive, if it’s located on the edge, or largely benign if it’s purely internal, say, a background message queue listener. For example I once worked on a compute farm where one of the front-end services was restarted every night and that caused all clients to drop their connection resulting in a support email being sent due to the “unhandled exception”. Needless to say everyone just ignored all the emails as they only added to the background noise making genuine failures harder to spot.

These kind of draconian measures to try and attain some system stability actually make matters worse as the restarts then begin to hide genuine stability issues which eventually start happening during business hours as well and therefore cause grief for customers as unplanned downtime starts occurring. The impetus for one of my first ACCU articles “Utilising More Than 4GB of Memory in a 32-bit Windows Process” came from just such an issue where a service suddenly starting failing with out-of-memory errors even after a restart if the load was awkwardly skewed. It took almost four weeks to diagnose and properly fix the issue during which there were no acceptable workarounds – just constant manual intervention from the support team.

I also lost quite a few hours on the system I mentioned earlier debugging a problem in the caching mechanism which was masked by a restart and only surfaced because the restart failed to occur. No one had remembered about this failure mode because everyone was so used to the restart hiding it. Having additional complexity in the code for a feature that will never be used in practice is essentially waste.

Cracking Nuts

Although it’s not true in all cases (the memory problem just described being a good example) the restart option may be avoidable if the process exposed additional behaviours that allowed for a more surgical approach to recovery to take place. Do you really need to restart the entire process just to flush some internal cache, or maybe just a small part of it? Similarly if you need to bump the business date via an external stimulus can that not be done through a “discoverable” API instead of hidden as part of a service restart [2]?

In some of my previous posts and articles, e.g “From Test Harness To Support Tool”, “Home-Grown Tools” and more recently in “Libraries, Console Apps, and GUIs”, I have described how useful I have found writing simple custom tools to be for development and deployment but also, more importantly, for support. What I think was missing from those posts that I have tried to capture in this one, most notably through its title, is to focus on resolving system problems with the minimal amount of disruption. Assuming that you cannot actually fix the underlying design issue without a serious amount of time and effort can you at least alleviate the problem in the shorter term by adding simple endpoints and tools that can be used to make surgical-like changes inside the critical parts of the system?

For example, imagine that you’re working on a grid computing system where tasks are dished out to different processes and the results are aggregated. Ideally you would assume that failures are going to happen and that some kind of timeout and retry mechanism would be in place so that when a process dies the system recovers automatically [3]. However, if for some reason that automatic mechanism does not exist how are you going to recover? Given that someone (or something) somewhere is waiting for their results how are you going to “unblock the system” and allow them to make progress, without also disturbing all your other users who are unaffected?

You can either try and re-submit the failed task and allow the entire job to complete or kill the entire job and get the client to re-submit its job. As we’ve seen one way to achieve the latter would be to restart parts of the system thereby flushing the job from it. When this is a once in a million event it might make sense [4] but once the failures start racking up throwing away every in-flight request just to fix the odd broken one becomes more and more disruptive. Instead you need a way to identify the failed task (perhaps using the logs) and then instruct the system to recover such as by killing just that job or by asking it to resubmit it to another node.

Hence, ideally you’d just like to hit one admin endpoint and say something like this:

> admin-cli kill-job --server prod --job-id 12345

If that’s not easily achievable and there is distributed state to clear up you might need a few separate little tools instead that can target each part of system, for example:

> admin-cli remove-node –s broker-prod --node NODE99
> admin-cli remove-results -s results-prod --job 12345
> admin-cli remove-tasks –s taskq-prod --job 12345
> admin-cli reset-client –s ui-prod --client CLT42
> admin-cli add-node –s broker-prod --node NODE99

This probably seems like a lot of work to write all these little tools but what I’ve found in practice is that usually most of the tricky logic in the services already exists – you just need to find a way to invoke it externally with the correct arguments.

These days it’s far easier to graft a simple administration endpoint onto an existing service. There are plenty of HTTP libraries available that will allow you to expose a very basic API which you could even poke with CURL. If you’re already using something more meaty like WCF then adding another interface should be trivial too.

Modern systems are becoming more and more decoupled through the use of message queues which also adds a natural extension point as they typically already do all the heavy lifting and you just need to add new message handlers for the extra behaviours. One of the earliest distributed systems I worked on used pub/sub on a system-wide message bus both for functional and administrative use. Instead of using lots of different tools we had a single admin command line tool that the run playbook generally used (even for some classic sysadmin stuff like service restarts) as it made the whole support experience simpler.

Once you have these basic tools it then becomes easy to compose and automate them. For example on a similar system I started by adding a simple support stored procedure to find failed tasks in a batch processing system. This was soon augmented by another one to resubmit a failed task, which was then automated by a simple script. Eventually it got “productionised” and became a formal part of the system providing the “slow retry” path [3] for automatic recovery from less transient outages.

Design for Supportability

One of the design concepts I’ve personally found really helpful is Design for Testability; something which came out of the hardware world and pushes us to design our systems in a way that makes them much easier test. A knock-on effect of this is that you can reduce your end-to-end testing burden and deliver quicker.

A by-product of Design for Testability is that it causes you to design your system in a way that allows internal behaviours to be observed and controlled in isolation. These same behaviours are often the same ones that supporting a system in a more fine-grained manner will also require. Hence by thinking about how you test your system components you are almost certainly thinking about how they would need to be supported too.

Ultimately of course those same thoughts should also be making you think about how the system will likely fail and therefore what needs to be put in place beforehand to help it recover automatically. In the unfortunate event that you can’t recover automatically in the short term you should still have some tools handy that should facilitate a swift and far less disruptive manual recovery.

 

[1] Note that this is different from a process restarting itself because it has detected that it might have become unstable, perhaps through the use of the Circuit Breaker pattern.

[2] Aside from the benefits of comprehension this makes the system far more testable as it means you can control the date and therefore write deterministic tests.

[3] See “When Does a Transient Failure Stop Being Transient” for a tangent about fast and slow retries.

[4] Designing any distributed system that does not tolerate network failures is asking for trouble in my book but the enterprise has a habit of believing the network is “reliable enough”.

Thursday 9 August 2018

Test Language: Behaviours, Not Examples

Naming is hard, as we know from the old adage about the two hardest problems in Computer Science, and naming in tests is no different. I’ve documented my own journey around how I structure tests in two previous posts: “Unit Testing Evolution Part II – Naming Conventions” and “Other Test Naming Conventions”. I’ve also covered some similar ground before quite recently in “Overly Prescriptive Tests” but that was more about the content of the tests themselves, whereas here I’m trying to focus more on the language aspects.

Describing the Example

Something which I’ve observed, both from reviewing Fizz Buzz submissions with tests [1] and from real tests, is that there is often that missing leap from writing a test which describes a single example to generalising the language to describe the effective behaviour [2]. For example, imagine you’re writing a unit test for a calculator, if you literally encode your example as your test name you might write:

[Test]
public void two_plus_two_is_equal_to_four()

Given that you could accidentally implement it with multiplication and still make the test pass you might add another scenario to be sure you don’t fall into that trap:

[Test]
public void three_plus_seven_is_equal_to_ten()

The problem with these test names is that they only tell you about the specific scenario covered by the test, not about the bigger picture. One potential refactoring might be to parameterise the test thereby forcing you to generalise the name:

[TestCase(2, 2, 4)]
[TestCase(3, 7, 10)]
public void adding_two_numbers_together_returns_their_sum(. . .)

One way this often shows up in FizzBuzz tests is with examples for the various rules, e.g.

[Test]
public void three_returns_the_word_fizz()

[Test]
public void five_returns_the_word_buzz()

The rules of a basic calculator are already known to pretty much everyone but here, unless you know the rules of the game Fizz Buzz, you would not be able to derive them from these examples alone and one very important role of tests are to document, nay specify, the behaviour of our code.

Describing the Behaviour

Hence to encode the rules you need to think more generally:

a_number_divisible_by_three_returns_the_word_fizz

There are a couple of issues here, namely that technically any number is divisible by three (just not wholly), and also that it won’t be true once we start bringing in the more advanced rules. It’s not easy trying to be precise and yet also somewhat vague at the same time, but we can try:

a_number_wholly_divisible_by_three_generally_returns_the_word_fizz

Once we bring in the “divisible by three and divisible by five” rule it becomes much harder to be precise in our test names as we’d have to include the overriding rules too which likely makes them harder to read and comprehend:

a_number_wholly_divisible_by_three_but_not_also_wholly_divisible_by_five_returns_the_word_fizz

You might just get away with it this time but its not really scalable and test names, much like code comments, often have a habit of getting out of sync with reality. Even when they break due to new functionality it’s easy to end up fixing the test and forgetting to check whether the “documentation” aspect still reflects the new behaviour.

Hence I personally prefer to use words in test names that suggest “broad strokes” when necessary and guide the reader (top to bottom) from the more general scenarios to the more specific. This, in my mind, is similar to putting the happy path cases before the various error handling ones.

Validating Collections

These examples might be a little too trivial but the impetus for this post came from similar scenarios where the test language talked about the outcome of the example itself rather than the behaviour of the logic in general. The knock-on effect of doing this, apart from making the intent of the example harder to comprehend in the future, was that it also became brittle as the specific scenario outcome was encoded in the test and any change in logic that might be orthogonal to it could break it unnecessarily. (As mentioned earlier, “Overly Prescriptive Tests” looks at brittle tests from a different angle.)

A common place where this shows up is when asserting behaviours around collections. For example imagine writing tests for querying the seats available in a cinema where there are seats in different price bands. When testing the “seat query” method for an exhausted price band you might be inclined to write:

[TestFixture]
public class when_querying_for_seats_and_none_left_in_band
{
  [Test]
  public void then_the_result_is_empty()
  {
    auditorium.Add(“Posh Seats”, new Seats[0]);

    var seats = auditorium.FindAvailableSeats();

    Assert.That(seats, Is.Empty);
  }
}

The example, being minimal in nature, means that technically in this scenario the result will be empty. However that is an artefact of the way the example is expressed and the test has been written. If I were to change the test set-up and add the following line, the test would break:

auditorium.Add(“Cheap Seats”, new Seats[100]);

While the outcome of the example above might be “empty”, that is not the general behaviour of the logic under test and our test language should be changed to describe that:

[Test]
public void then_no_seats_in_that_band_are_returned()

Now we’re not making a statement about what else might or might not be in that result, only what our expectations are for seats in the band in question. Once we have fixed the test language we can address how we validate that in the example. Instead of looking at what is in the collection we should be looking at what isn’t there as the test name tells us to expect that something should be absent, and the assert should reflect that language:

Assert.That(seats.Where(s => s.Band == “Posh Seats”), Is.Empty);

Now I should only be able to break this test by changing the data or logic specific to the example, orthogonal behaviours should not break it by accident. (See “Manual Mutation Testing” for more on how you can test the quality of your tests.)

Invest in Tests

If you’ve ever worked on a codebase with brittle tests you’ll know how frustrating it can be when your feature mushrooms because you broke a bunch of badly written tests. If we’re lucky we see the failed assertion and if it’s not obvious then we can look back at the test name to see if the scenario rings any bells. If we’re unlucky we have to immediately reach for the debugger and likely add “refactor tests” to the yak stack.

If you “pay it forward” by taking the time to write good tests up front you’ll find it easier to sustain delivery in the future.

 

[1] A company I once worked for used Fizz Buzz in their candidate early screening process. Despite being overkill in practice (as was pointed out to candidates) a suite of tests was requested as part of the submission to help get a feel for what style they used. IMHO the tests said much more about candidates than the production code.

[2] Yes, “property based testing” takes this entire concept a step further so that it exercises the behaviour with multiple examples generated differently each time. That’s the destination, this post is about one possible journey.

Tuesday 24 July 2018

TODO or TODO Not – Redux

One of my most “successful” posts was one I wrote way back in 2011 about my dislike for TODO style comments in code – “TODO or TODO Not - There Is No Later”. The premise of that post, which includes a number of examples from real codebases I’ve worked on, is that they are fundamentally pointless because they are almost certainly too low in value to get done. If they were valuable enough they either should be a proper feature on the backlog or be left to be handled as part of a relevant story, e.g. refactoring.

At the time I wrote that post I was convinced about them not living in the codebase (past the feature’s release) but I suggested that any potentially useful ones should be converted into bona fide user stories so they could be formally considered and prioritised along with the other items. After receiving a reply to a tweet that referenced my aging blog post from a company that tries to quantify technical debt by analysing such TODO style comments I felt it must be time for an update. (The TL;DR of this post was my reply to them.)

// Learn to Let Go

One of the hardest lessons I have learned over the intervening 7 years is how to let go of stuff. What caused me to cling onto some of those TODOs that I would run across was a fear of forgetting something important. My daily use of a log book (see “Pen & Paper”) to record notes as I go along exemplifies my apparent need to keep track of my current state as my brain is like the proverbial sieve. In an era where change was much harder because the software development QA process was largely manual this makes sense as it was more popular to batch-up changes. Couple this with a general disregard for anything non-functional, such as an appreciation for refactoring, and it’s all too easy to see why people bury their personal backlog in the code rather than open it up for discussion with non-developers.

There is a familiar ring here and that’s because I’ve walked this path before more recently in 2016’s “Developers Can Be Their Own Worst Enemy”. With a modern development process that puts an emphasis on trust and transparency we can let go of the past and should have more confidence in our peers and the management to see the value in our opinions.

It’s also not acceptable to just leave a cryptic comment in the code about why something should be implemented differently or moved elsewhere – the need to change must be backed up with a reason so that we can understand the value in it, and most TODO comments are throwaway rather than well reasoned arguments.

// Measuring Technical Debt

I’ll be honest, I genuinely cannot see the point of attempting to measure the level of technical debt in a codebase, even if it were possible to do so. For a start you need to decide if you’re taking the original interpretation – a conscious decision to temporarily take a shortcut - or the ever more popular one – crap code. Even if you could do that, what are you ever going to do with the output? I once saw a SonarQube report that said we had £X million of technical debt in a codebase I was working on; how exactly does that inform you?

The reason I find it pointless is because it feels like it’s measuring the wrong thing. Just like the story about the man looking for his keys under the streetlight we apply a tool to measure the quality of the code when what really matters is measuring our ability to deliver. The reason poor quality code affects us is because it makes future change hard and slows us down. Hence if it matters there must be a causal link because if the code is that bad our pace of delivery will slow down and we’ll see that (all other things being equal).

But what is more important in my mind is that even if you did know how much debt you had you would only want to focus on the areas that are subject to change, and you do that already by focusing on the most valuable work in the backlog! And the technique for tackling the debt is refactoring. Hence we already have what we need to address the real problem.

// Trimming the Backlog

Unwinding the stack slightly let me return to the topic of reflecting unfinished work in the product’s backlog.

That sentence should already be making your spider-senses tingle – how can it be unfinished? If we’ve not met the acceptance criteria we’re not done, so carry on. If the acceptance criteria was wrong or incomplete then that’s a discussion to have with the product owner for clarification. Note that “of good quality” is an acceptance criteria inherent in every story we do as it is virtually non-negotiable [1].

Perhaps a better term would be undiscovered work. Woody Zuill has this saying “it’s in the doing of the work that we discover the work that needs doing”. What we often unearth are things which never showed up in our initial analysis and therefore we now have to factor them into our schedule or decide to drop them. When we’re knee deep in the feature it may seem really important to do it now, but given time the need may slowly dissolve, and in the end possibly disappear altogether.

What you might initially put on the backlog (or write as a TODO in the code) could be quite specific, e.g. “Need to handle poisoned messages”. As you continue you might also add other scenarios such as “Should refresh token before it expires”  and “Triage malformed messages separately from those well-formed but still invalid”. These are all valuable behaviours which go towards building a reliable service but the initial focus might not be on reliability, maybe its currently just a tactical fix to enable learning about something else. You don’t want to waste time over-engineering but at the same time you also don’t want to forget what else you have learned doing it.

The problem with this mentality is that the backlog just grows and grows, and we’re back to where we were before with TODOs littered all over the code – it’s just an every growing feature list. As time passes the need to implement some of those features will either happen because they have suddenly become important (it’s no longer tactical) or they will vanish (it remains a mere stop-gap). Leaving the stories on the backlog just makes it harder and harder to prioritise as you keep going over old ground again and again.

Once again we need to let go and rely on the fact that if they are important they will eventually resurface. If you feel really uncomfortable about deleting them entirely then you might consider rolling up related stories back into epic sized ones as part of the backlog grooming. Applying this to our recent example you could easily fold them all into a single “Service Reliability” epic that would be easier to handle. You could turn each card’s title into a checklist item.

// Confidence

That said as I get older I become less tolerant of a never-ending backlog and want to be more aggressive in the backlog grooming. Part of that is down to having more confidence in knowing that issues will be addressed [2] in a more timely fashion and that prioritisation will take into account both the technical and functional features with more or less equal consideration because the team is trusted to do The Right Thing.

Being more experienced there is no doubt more than a grain of truth that my sphere of influence is probably wider but that shouldn’t really matter. Every story must be valuable and ideally stand-up on its own merit, where my experience comes in is probably in being able to express that value more succinctly.

// TODO: Redux

I haven’t seen, heard or read anything in the intervening years that has been able to convince me in the value of using TODO comments in the code as a more successful technique of managing what needs to be done. If anything my appetite for tracking any work outside the next few sprints / weeks has begun to wane simply because it is now so easy to change in direction with only a moment’s notice should the situation deteriorate. This does not mean we should be reckless, far from it, but leaving a TODO in the code as a way of conveying a change in architecture hardly seems optimal either.

As for the notion that a TODO in the code can be equated with an increase in technical debt I don’t buy that either. I would posit it is more likely to indicate a failure within the development process itself as the mismatch between behaviour and acceptance criteria either indicates a bug in the code or a bug in the requirements and neither of these outcomes sounds like something that should be brushed off lightly with a comment in the code.

 

[1] I try not to deal in absolutes as there always seems to be an exception, but either way it must be a conscious decision.

[2] Assuming the organisation is behaving in a moderately agile way and not just paying lip service to a couple of the ceremonies or practices.

Thursday 19 July 2018

Delivery Anti-Pattern: Local Optimisations

The daily stand-up mostly went along as usual. I wasn’t entirely sure why there was a stand-up as it wasn’t so much a team as a bunch of people working on the same codebase but with more-or-less individual goals. Applying the microservices analogy to the team you could say we were a bunch of micro-teams – each person largely acting with autonomy rather than striving to work together to achieve a common goal [1].

Time

But I digress, only slightly. What happened at the stand-up was a brief discussion around the delivery of a feature with the suggestion that it should be hidden behind a feature toggle. The implementer explained that they weren’t going to add a feature toggle because “it was a waste of their time”.

This surprised me. Knowing what I do now about how the team operates it isn’t that surprising but at the time I was used to working in teams where every member works towards common goals. One of those common goals is to try and ensure the delivery of features is a continuous flow and is not disrupted by a bad change which then has to be backed out because rolling back has the potential to create all sorts of disruption, not least the delay of those unaffected changes.

You should note that I’m not disagreeing with their choice of whether or not to use a feature toggle – I did not know anywhere near enough about the change or the system at that time to contribute to that decision. No, what disturbed me was the reason why they chose not to take that approach – that their time is more valuable than that of anyone else in the team, or the business for that matter.

In isolation that paints an unpleasant picture of the individual in question and that simply is not the case. However their choice of words, even if done without real consideration, does appear to reinforce the culture that surrounds them. In essence, with a feeling that the focus is on them and their performance, they are naturally going to behave in a way that favours optimising delivering their own features over that of the team at large.

Quality

Another example of favouring a local optimisation over the longer term goal of sustained delivery occurred when I was assigned my first piece of work. This was not so much a story as a couple of epics funded as an entire project (over 4 months solid work in the end). My instinct, after being shown roughly where in the code I needed to dig, was to ask where the existing tests were so that I knew where to add mine. The tech lead’s immediate response was “you won’t have time to write tests”.

My usual response to this statement is a jovially phrased “how will I know if it works then?” which often has the effect of opening a line of dialogue around the testing strategy and where it’s heading. Unfortunately this time around it only succeeded in the tech lead launching into a diatribe about how important delivery was, how much the business trusted us to deliver on time, blah, blah, blah, in fact almost everything that a good test suite enables!

Of course I still went ahead and implemented the entire project TDD-style and easily delivered it on time because I knew the approach was sound and the investment was more than worth it. The subsequent enhancements and repaying of some technical debt also became trivial at that point and meant that anyone, not just me, could safely and quickly make changes to that area of code. It also showed how easy it was to add new tests to cover changes to the older parts of the component when required later.

In the end over 10% of unit tests of the entire system had been written by me during that project for a codebase of probably over 1/2 million lines of code. I also added a command line test harness and a regression testing “framework” [2] in that time too all in an effort to reduce the amount of hoops you needed to go through to diagnose and safely fix any edge cases that showed up later. None of this was rocket science or in the least bit onerous.

Knowledge

I would consider a lack of supporting documentation one further local optimisation too. When only a select few have the knowledge to help support a system you have to continually rely on their help to nurse it through the bad times. This is especially true when the system has enough quirks that the cost of taking the wrong action is quite high (in terms of additional noise). If you need to remember a complex set of conditions and actions you’re going to get it wrong eventually without some form of checklist to work from. Relying on tribal knowledge is a great form of optimisation until core members of the team leave and you unearth the gaping holes in the team’s knowledge.

Better yet, design away the problems entirely, but that’s a different can of worms…

Project Before Product

I believe this was another example of how “projects” are detrimental to the development of a complex system. With the team funded by various projects and those projects being used as a very clear division on the task board through swim lanes [3] it killed the desire to swarm on anything but a production incident because you felt beholden to your specific stakeholders.

For example there were a number of conversations about fixing issues with the system that were slowing down delivery through unreliability that ended with “but who’s going to pay for that?” Although improvements were made they had to be so small as to not really affect the delivery of the project work. Hence the only real choice was to find easier ways to treat the symptoms rather than cure the disease.

Victims of Circumstance

Whenever I bump into this kind of culture my gut instinct is not to assume they are “incompetent” people, on the contrary, they’re clearly intelligent so I’ll assume they are shaped by their environment. Of course we all have our differences, that’s what makes diversity so useful, but we have to remember to stop once in a while and reflect on what we’re doing and question whether it’s still the right approach to take. What works for building Fizz Buzz does not work for a real-time, distributed calculation engine. And even if that approach did work once upon a time the world keeps moving on and so now we might be able to do even better.

 

[1] Pairing was only something you did when you’d already been stuck for some time, and when the mistake was found you went your separate ways again.

[2] I say “framework” because it was really just leveraging a classic technique: a command line tool reading CSV format data which fired requests into a server, the results of which are then diff’d against a known set of results (Golden Master Testing).

[3] The stand-up was originally run in project order, lead by the PM. Unsurprisingly those not involved in the other projects were rarely engaged in the meeting unless it was their turn to speak.

Thursday 12 July 2018

Optimistic SQL

One of the benefits of learning other programming languages is the way it teaches you about other paradigms and idioms. This is the premise behind the “Seven Xxx in Seven Weeks” range of books. Although I have the database one on my bookshelf I’ve only ever skimmed it as at the time I bought it I suddenly found myself leaving the world of the classic RDBMS behind and working with other types of DB for real; most notably the document-oriented kind.

Although some of these products like MongoDB and Couchbase have come a long way from their early beginnings as highly available key-value stores they often still lack the full-on transaction support of the old stalwarts like SQL Server and PostgreSQL. Coupled with a high-availability service you have to think differently about how you react to concurrency conflicts as explicit locking is almost certainly never the answer [1].

The impetus for this post was going back into the world of SQL databases and being slightly bemused by a stored procedure that appeared to implement an “upsert” (an UPDATE or INSERT depending on whether the row already exists) as I realised it wasn’t how I’d approach it these days.

The Existing SQL Approach

Initially I was somewhat flummoxed why it was even written the way it was as there appeared to be no concurrency issues in play at all, it was a single service doing the writing, but I later discovered that an accident of the implementation meant there were two writers internally competing and they chose to resolve this in the database rather than remove the root source of concurrency in the service.

The upsert was basically written like this:

  • Try SELECTing the existing row.
  • If it exists, UPDATE it.
  • If it doesn’t exist, INSERT it.

In the service code there were a number of comments describing why the transaction level was being bumped up to “serializable” – it was effectively to deal with the concurrency they had introduced within the code by creating two competing writers. On top of that the initial SELECT statement in the upsert applied a HOLDLOCK which also effectively makes the transaction serializable because it puts a range lock on the row’s key (even if that key doesn’t exist yet).

The Document DB Approach

The last few years away from the relational world meant that I was used to dealing with these kinds of conflicts at a slightly lower level. Also, dealing with document updates in the service rather than writing them as SQL mean that updates were done in a server-side loop rather than pushing the concurrency issue down into the database, hence it would look more like this:

  • Try selecting the document.
  • If it exists, update it and try writing it back.
  • If it doesn’t exist, try creating it.
  • If any write fails start over from the beginning.

Due to the lack of transactions and locking, write conflicts are commonly detected by using a version number attribute that gets used in the update predicate [2]. (A write failure, via a “document not found” error, means the predicate failed to match the specific document and version and therefore a conflict has occurred.)

Another SQL Approach

So what does all this have to do with upserts in SQL?

Well, what I found interesting was that my gut reaction was to question why there is the initial select there as I would have written it as:

  • Try to UPDATE the row.
  • If no rows were updated, then INSERT it.

This particular order makes an assumption that updates are more prevalent than inserts and as a I rule I’d say that checking @@ROWCOUNT to see if anything was written is far less ugly than adding a TRY…CATCH block in T-SQL and attempting to verify that the insert failed due to a primary key violation.

That all seemed fairly obvious but I had forgotten that with the document DB approach you tend to expect, and handle, write failures as part of handling concurrency, but in this case if two connections both attempted the insert concurrently it’s theoretically possible that they could both fail the UPDATE step and then one of the INSERTs would succeed and the other would fail resulting in a primary key violation. However the code in the service was not written to detect this and retry the operation (as you would with a document DB) which is why the initial SELECT is there – to lock the “unwritten row” up front which ensures that another transaction is blocked until the row is then inserted or updated. This way no client logic needs to handle the concurrency problem.

However I believe we can still achieve the same effect by adding the same HOLDLOCK hint to our initial UPDATE so that if the row does not exist other writers will be blocked by the range lock until the subsequent INSERT goes through. Hence the initial SELECT is, I believe, redundant.

The MERGE Approach

At this point I remembered that way back in the past SQL Server introduced the MERGE operation which effectively allows you to write an upsert with a single statement as you factor both the hit and miss logic into different branches of the statement. This caused me to go looking to see what the start of the art in upsert techniques were, possibly with performance comparisons to see how much faster this must be (given that SQL Server clearly has the potential to optimise the query plan as it better knows our intent).

I started digging and was somewhat surprised when I came across the page “Performance of the SQL MERGE vs. INSERT/UPDATE”. I was expecting to have my hypothesis validated but discovered that the answer was far from clear cut. Naturally I then googled “SQL Server upsert performance” to see what else had been written on the subject and I discovered this wasn’t an anomaly so much as a misunderstanding about what problem the MERGE statement is really intended to solve.

You should of course never take performance improvements at face value but “measure, measure, measure” yourself. I wasn’t avoiding doing that, I was looking to see if there might be any pitfalls I needed to be wary of when benchmarking the approach.

At this point I haven’t gone any further with this as it’s more of a personal investigation (there is no actual performance issue to solve) but it just goes to show that writing SQL is as much an art as it’s always been.

 

[1] Some document databases, such as Couchbase, do support locking of documents, but there are heavy restrictions so you tend to find another way.

[2] In the particular example I was looking at no version number was needed in the SQL predicate because the data had a total ordering independent of the write order (it was tracking the minimum and maximum of a value over the day).