Monday 22 April 2024

Naming Functions: When Intent and Implementation Differ

Most of the time these days when I get into a conversation about naming it tends to be about tweaking the language, perhaps because I think there is a much better term available, or the author is a non-native speaker and they’ve transliterated the name and it’s ended being quite jarring, or worse, ambiguous. I’m also on somewhat of a personal crusade to try and vastly reduce the lazy use of “get” in codebases by replacing it with far more descriptive names. (For a more humorous take on the overuse of “get” try my Overload back-pager “Acquire Carter”. Alternatively my C Vu article “In The Toolbox – Dictionary & Thesaurus” tackles the issue a little more traditionally.)

However, more recently I ended up in a conversation where I felt the the whole point of “naming to reveal intent” had gone out of the window and the author and other reviewer were far too fixated on reflecting the implementation instead. Normally the intent and implementation align very closely, but in this instance they didn’t because it was the first step towards formalising a new feature, which was implemented temporarily [1] using a different one.

To anyone who’s read Arlo Belshee’s writings on Naming as a Process I think you’ll recognise what follows as a case of “Honest Naming” instead of “Naming for Intent”, and that comes at a cost to the reader.

Background

I currently work on a system used to generate a variety of different reports. When some of the reporting code runs it generates log messages which warn about missing data, which means it’s had to substitute something else instead. When the code was initially written it was being run interactively and so the missing data would get addressed quickly. However, now that it’s automated nobody sees these messages anymore, except when the code fails and anything written to stdout gets included in the failure alert.

Naturally, there should really be some better way of surfacing these anomalies, even if the report generation succeeds, and there will be, eventually. However, in the meantime we discovered there is a way of hooking into the report scheduler so that those warnings could at least be surfaced in a completion status email instead. This mechanism involved a bit of abuse though – using the “critical” log level instead of merely logging the missing data as a “warning”, because serious errors during runtime were still highlighted at completion.

First Attempt: Confusion Through Dishonesty

The simplest thing that could possibly work, and the initial change that was presented, was to directly change any log lines like this:

print(“WARNING: Missing data for ‘XYZ’”)

to use an alternative logging function like this:

logCritical(“Missing data for ‘XYZ’”)

My immediate reaction to this was one of “No, now the code is lying to us!”. The code previously made it clear that this condition was just an inconvenience but now it appears to be far more serious.

While the elevation in severity was unfortunate my real problem was that it was missing the reason why the logging call was being made in the first place. Even in the original code the use of print() was a means to another end, and ultimately we should make that end more obvious.

While I’m a fan of doing the simplest thing by default, that does not also mean doing something overly naïve. The cost of adding an extra level of indirection at this point in time was not onerous and would very likely minimise rework in the future as we evolved the real implementation.

Second Attempt: Too Honest

I thought I had expressed my opinion fairly clearly and there appeared to be agreement on the way forward so I was somewhat surprised when the code came back like this:

enrichEmailWithWarning(“Missing data for ‘XYZ’”)

At this point I pulled out Arlo Belshee’s classic naming infographic to try and show how we were still a way off getting to the point where the code was saying what was really going on at a business level.

What makes this particularly confusing is the code where this appears runs in a child process and so has no knowledge of the parent scheduler. Hence the function being invoked is talking about some obscure mechanism totally outside itself. The name of the function in this instance is too honest, it says what is going under the covers but not what it really means. If you run this interactively again what email is it even talking about?

On the plus side though we have at least encapsulated the underlying mechanism to stop lying about the severity, and we can reimplement this later to use a separate file, database, or whatever without touching all the call sites. Or can we?

Well, the problem with honest naming is that when the implementation changes, then so does the name. This means all the call sites will have to be updated yet again each time we reimplement it. Every time we update the code we run the risk of breaking it [2], and it adds unnecessary noise to the file’s history because the real intent hasn’t changed.

Third Attempt: Revealing Intent

Let’s take a step back and ask ourselves what this line of code is really trying to achieve:

print(“WARNING: Missing data for ‘XYZ’”)

While you might say that it’s printing or logging, that is just the implementation bleeding through. A more general way of saying that might be “recording”. You could argue that “notifying” might be better because the information is included in a status email but I’d suggest that is being too specific at this point because in the future it may only get written somewhere and the retrieval mechanism is totally up for grabs and could easily involve not sending yet more emails.

Okay, so we’re recording something, but that’s not much better than print() or log() if we are just going to pass an arbitrary text string. We need to think more about what we’re recording to see if that can be included in the name too, or broken down into separate arguments.

The whole reason this conversation started was because the messages about missing data were not being seen and therefore we needed to surface them in some way. Hence what we’re recording is “missing data” and that record ideally needs to be different from the diagnostic trace messages and other more general logging output. Hence the function name / API I proposed was:

recordMissingData(“XYZ”)

To me this now expresses far better what the original author intended all along. The implementation has the ability to vary quite widely from simply printing to stdout, logging to a file, writing to a queue / database, and yet the same name will continue to reflect all those possibilities. By elevating the text “missing data” from the log message to the function name we are also being more explicit about the fact that handling missing data is formally recognised as a feature and can be given special treatment, unlike the general logging output.

Anyone not involved in this conversation who runs across the code and has a similar requirement stands a better chance of using it correctly too, meaning less rework. When it’s missing, future readers may try to “second guess” what’s going on and then apply the same mechanism for different, inappropriate reasons and now, as they say, you have two problems.

I can’t say for sure we won’t need to revisit this again as we learn more about the nature of what data is missing and whether with more context we can automatically triage and notify the right people, but for now it feels like the cost / benefit ratio of “talking versus doing” is about right.

 

[1] Describing anything as “temporary” is a fools errand, we all know it will live far longer than planned :o).

[2] This is code written in a dynamic language so even a simple typo can go undetected unless it’s tested properly.

 

Friday 9 February 2024

Our Star Baker

Just over 14 years ago I posted the eulogy I wrote for my father on this blog (So Long and Thanks For All the Onions) mostly because I had just started writing and this blog gave me the confidence to write. Sadly, a month ago my mother passed away too and yesterday I got to present my eulogy for her as well. The writing practice from the intervening 14 years undoubtedly made the mechanics easier and afforded me more time for reflection and allowed me to better translate my thoughts and feelings to the page. So, thank you blog, but far more importantly “thank you Mum” you too will be with me always…

I was visiting Mum in hospital a few years ago, when she was having one of her knees or hips or something replaced, and a nurse came in and addressed her as “Jennifer”. For a brief moment I wondered who they were talking to, I’m not even sure Mum acknowledged her at first either. Mum hasn’t been a Jennifer for very long time – she was always Jenny, or Jen to her friends. Of course, to myself and Jo she was “mum”, and to our various offspring she was Grandma, or Grandma Jenny. Millie and Ella came up with The Jenster but it never really gained any traction outside our family, for obvious reasons.

She was only ever Jennifer in an official capacity, such as when she returned to work part-time, initially doing market research for the BBC. One of my earliest memories was of tagging along with Mum to some local parade of shops where she would interview the public about their viewing and listening habits. We got to play on the slopes and the railings while she quizzed the public, which was a lot more fun than it might sound today. From there she switched to clerical work, most notably (for me at least) at Bowater-Scott where one of her colleagues offered to build Jo and I a wooden sledge, which he did! As I only remember visiting that office maybe once or twice, I’ve always assumed Mum must have made quite an impression there.

Eventually she ended up in the Audiology Department at West Hill Hospital in Dartford treating people that had lost their hearing. It was only supposed to be a two-week placement, but she ended up staying for 21 years in the end (after cross-training as a Student Technician a year later). It was fairly apparent even to us kids that not every patient was easy to deal with, but she always put their needs first, to the extent that she would often bring her paperwork home to allow her to prioritize her time with the patients instead. If anyone was ever in any doubt that Mum was a people-person, her career at the hospital would surely stand as testament, backed up by the many cards of thanks she received over the years from the people she helped. Perhaps the one aspect of her work we regret her bringing home was the need to talk so much louder all the time.

The hospital wasn’t the most enjoyable place to hang around during the school holidays, but I soon discovered a tiny little computer shop at the top of West Hill which then made the trips to her workplace an absolute joy. When it came to buying me that first computer, I know Dad wasn’t so convinced, and it is Mum that I must thank for taking that leap of faith. Even forty years later Mum would still joke about whether it was the right thing to have done, as it could still just be a passing fad.

When I wrote the eulogy for my father, I suggested that the genes which have probably contributed most to my career in computer programming probably came from him, but in this more recent time for reflection I am beginning to question if it wasn’t more from my mother’s side instead. For her generation Mum was very good with technology – the proverbial Silver Surfer. Although she might occasionally ask for technical advice, she often sorted out her own problems, along with those of her friends! She always wrote a good email and picked-up WhatsApp with similar ease, if occasionally being a little over-zealous with the emojis. We had several different family WhatsApp groups with which she was very active and helped ensure she remained in constant contact with her grandchildren and could easily find out what they were up to. She took a genuine interest in their lives and they were always keen to share. It wasn’t unusual for Charlotte or me to mention what Mum was up to only to be met with a chorus of “yes, we know!” because they had been conversely directly with her about it.

This need to adapt to the ever-changing world was something which Mum embraced, not only on a technical level but also from a social perspective. Rather than dismiss young people because they haven’t faced the same struggles or because their viewpoint didn’t match hers, she would instead engage with them to try and understand how and why the world was changing the way it was. Her grandchildren helped her move with the times and in effect helped her to remain young at heart. She very much believed the old saying about only being as old as you feel. Her body may have shown some signs of wear and tear as she reached her eighties, but her mind was still razor-sharp, along with her wit.

We probably shouldn’t be surprised that some of her joints needed replacing later in life because she was always such an active person! Her diary always seemed to be full – from dawn until dusk – whether that be out with friends and family, or abroad visiting another new country and making even more new friends, and not just for the duration of the trip, they often became lifelong friends which speaks volumes about the kind of impression she left on everyone she met.

As a family we often joked when we visited somewhere new that grandma had probably already been there. If she had, you knew you could rely on her suggestions to fill your itinerary as they would include a range of beautiful vistas, buildings, galleries, restaurants, etc.. Over the years she visited a whole variety of different places from Alaska to Moscow with Canada, Croatia, China, and India to name a few in between. We were always fascinated to see the photo album she would put together on her return and listen to the stories she told about the people she met.

When our children were much smaller we managed to convince her to come away with us on a couple of more relaxing beach holidays. Much as she enjoyed reading at home she wasn’t the sort of person to curl up on a sun lounger with a book, not when there were places she could explore, and the grandkids also made sure it wasn’t going to be a relaxing holiday for any of us, least of all Grandma. I’m still not sure what possessed Mum and I to go paragliding in Turkey! Running, and then Jumping off a cliff while strapped to a stranger felt courageous enough for me, let alone mum who was in her mid-sixties by then. At the end she remarked the scariest part was the jeep ride up the mountain, not coming back down again by parachute!

What all her travelling proved was that she had a great sense of adventure and that was epitomised by her walk along the Inca Trail to Machu Picchu. This multi-day hike was essentially a marathon at high elevation, so a challenge even to the younger trekkers. Beforehand Mum was a little concerned about her age and fitness but she put the training in and need not have worried as she found herself at or near the front the entire time. In fact the porters nicknamed her “the nanny goat” on account of how well she acquitted herself. For once the trip didn’t just conclude with a photo album but ended up becoming a PowerPoint presentation too which she gave twice at Isaac and Millie’s school as Machu Picchu was on their curriculum. That wasn’t the only lesson they got from her there either, as Mum and Dad also put us all to shame at the school’s 1940’s night with a wonderful display of swing dancing.

I don’t think mum was ever a passive bystander in anything she got involved in, she was always there to lend a hand and ultimately would get drawn in to fill whatever role needed her talents. While I’m sure she enjoyed watching us swim I think she preferred it when she could also be an active participant – initially helping-out by decorating the float for the parade, to becoming a committee member, then club secretary, and then officiating at galas in various capacities, even after we’d flown the nest. (Her long-standing service to the Kent County Executive was recognised in 2002 when she received the Edward Maples trophy.) She even got to rekindle her netball skills in a couple of Mother & Daughter swimming club socials and we discovered along the way that she had briefly appeared on TV in her youth playing netball.

This wasn’t the only time she has featured on TV, more recently her left arm made a guest appearance on the BBC during The Proms. Mum was a huge fan of art, both in the literal sense and the wider movement. Although we lived well over an hour away London was middle ground for us both and that gave us the perfect opportunity to meet up and take in a West End show or a trip to the Royal Albert Hall. I was already well versed in musicals long before meeting Charlotte and have Mum to thank for knowing so much of The Sound of Music off-by-heart. While always a favourite in our house too, Mamma Mia became the musical of choice when the kids went to stay at Grandma’s after seeing the show together in London.

Even though she didn’t live on our doorstep that didn’t stop her from attending so many of the concerts and productions that her grandchildren featured in. She was always a big supporter of their talents and watched them whether it was a bit-part in the school Nativity or a paid-concert in Ely Cathedral or Huddersfield Town Hall. (Or for that matter a freezing cold football pitch or rugby pitch, which is probably why she nudged Jo and I towards the warmth of a swimming pool.) For some of you this will be old news as she was keen to share their endeavours with her close friends as any doting grandmother would. She attended so many events in and around Godmanchester over the years that people were always surprised to learn that she actually lived 90 miles away!

During the pandemic this distance made it a little harder to meet up in person, but it didn’t stop her from socialising and even doing activities with the kids. Like everyone else we used Zoom to keep in touch and Ella and Mum created their own virtual Great British Bake Off to ensure those legendary cooking skills were still put to good use. I was never a big fan of baking per-se, but I did enjoy squishing the sausage meat between my fingers when we made sausage rolls for Christmas. Likewise making mince pies was something I enjoyed too, and this Christmas baking tradition was passed-down to my children while I took on the more important role of keeping Mum’s glass of Prosecco topped up. Her puddings definitely were legendary, but for The Oldwood family it’s undoubtedly her coffee flavoured birthday cake that she will be most sorely missed for, baking-wise.

I’m now two-thousand words in and have barely scratched the surface of memories I could talk about. At some point I need to stop and give you the opportunity to share your favourite memories with us, and with the other people here. And share them we must, because that’s how we keep her memory alive. Every time we plan a trip, or open a packet of biscuits, or play a game of Rummy, or use her baking tray, or pour a glass of red wine, or whatever, there will be another opportunity to share our love for the person we once knew as Jenny, or Mum, or Grandma.

Wednesday 4 October 2023

Unpacking Code Ownership

This post was prompted by a document I read which was presented as a development guide. While the rest of it was about style, the section that particularly piqued my interest was one involving code ownership. For those of us who’ve been around the block, the term “code ownership” can bring with it connotations of protectionism. If you’ve never worked with people who are incredibly guarded about the code they write may I recommend my 2017 blog post Fallibility which contains two examples of work colleagues that erected a wall around themselves and their code.

While I initially assumed the use of the term was a proxy for accountability, some comments to my suggestion that Relentless Refactoring was an established practice in many teams hinted that there might be more to it than that. What came out of an online meeting of the team was that the term was carrying the weight of two different characteristics.

(I should point out that I’m always wary of this kind of discussion verging into bike-shedding territory. I like to try and ensure that language is only as precise as necessary, so when I suspect there may be confusion or suboptimal behaviour as a consequence, do I feel it’s worth digging deeper. In this instance I think “ownership” was referring to the following attributes and not about gatekeeping or protectionism for selfish reasons, e.g. job security.)

Accountability / Responsibility

When people talk about “owning your mistakes” what they’re referring to is effectively being accountable, nay responsible, for them. While there might be a legal aspect in the cases of Machiavellian behaviour, for the most part what we’re really after is some indication that changes were not made simply because “you felt like it”. Any code change should be justifiable which implies that there is an air of objectivity around your rationale.

For example, reformatting the code simply because you personally prefer a different brace placement is not objective [1]. In contrast, reformatting to meet the pre-agreed in-house style is. Likewise applying any refactoring that brings old code back in line with the team’s preferred idioms is inherently sound. Moreover, neither of these should even require any debate as the guide automatically confers agreement [2].

Where it might get more contentious is when it appears to be superfluous, but as long as you can justify your actions with a sense of objectivity I think the team should err on the side of acceptance [3]. The reason I think this kind of change can end up being rejected by default is when there is nothing in the development process to allow the status quo to be challenged. A healthy development process should include time for retrospection (e.g. a formal retrospective) and this is probably the place for further debate if it cannot quickly be resolved. (You should not build “inventory” in the form of open PRs simply because of unresolved conflict [4].)

One scenario where this can be less objective is when trying to introduce new idioms, i.e. experimental changes that may or may not set a new precedent. I would expect this to solicit at least some up-front discussion or proactive reviewing / pairing to weed out the obvious chaff. Throwing “weird” code into the codebase without consulting your teammates is disrespectful and can lead to unnecessary turf wars.

Being accountable also implies that you are mature enough to deal with the consequences if the decision doesn’t go your way, aka Egoless Programming [5]. That may involve seeing your work rejected or rewritten, either immediately or in the future which can feel like a personal attack, but shouldn’t.

Experience / Expertise

While accountability looks at ownership from the perspective of the person wanting to change the code, the flipside of ownership is about those people best placed to evaluate change. When we look for someone to act as a reviewer we look for those who have the most experience either directly with the code itself, or from working on similar problems. There may also be different people that can provide a technical or business focused viewpoint if there are both elements at play which deserve special attention, for example when touching code where the previous authors have left and you need help validating your assumptions.

In this instance what we’re talking about are Subject Matter Experts. These people are no more “owners” of the code in question than we are but that doesn’t mean they can’t provide useful insights. If anything having people unrelated to the code reviewing it can be more useful because you know they will have no emotional attachment to it. If the change makes sense feature-wise, and does it in a simple, easy to understand way, does anything else really matter?

These days we have modern tooling like version control products which, assuming we put the right level of metadata in, allow us to see the evolution of the codebase along with the rationale even when the authors have long gone. Ownership doesn’t have to be conferred simply because you’re the only one that remembers how and why the code ended up the way it did. This leads into territory around fear of change which is not a sustainable approach to software delivery. In this day-and-age “consulting the elders” should really be a last resort for times when the record of events is lost in the sands of time. Approval should be a function based on knowledge of the subject matter rather than simply years of service [6].

Shepherds, Not Owners

Ultimately what I find slightly distasteful about the term “shared ownership” is that it still conveys a sense of protectionism, especially for those currently “outside the team”.

From a metaphorical point of view what I think I described above is more a sense of shepherding. The desire should be to nurture contributors to understand the culture of the codebase and product to the extent that the conversations can focus on the essential, rather than accidental complexity.

I wonder if “shared mentorship” would work as a substitute?

 

[1] This is a good argument for using a standard code formatting tool as it can make these debates moot. 

[2] If the code is that performance sensitive it should not be touched without consultation then there should either be some performance tests or at a minimum some comments to make that obvious.

[3] The late Pieter Hintjens makes a compelling case in Why Optimistic Merging Works Better.

[4] There is where I favour the optimism of Trust, but Verify as an approach, or pairing / ensemble programming to reach early consensus.

[5] The Psychology of Computer Programming – Gerry Weinberg, 1971.

[6] One needs to be mindful of not falling into the Meritocracy trap though.

Monday 31 October 2022

WMI Performance Anomaly: Querying the Number of CPU Cores

As one of the few devs that both likes and is reasonably well-versed in PowerShell I became the point of contact for a colleague that was bemused by a performance oddity when querying the number of cores on a host. He was introducing Ninja into the build and needed to throttle its expectations around how many actual cores there were because hyperthreading was enabled and our compilation intensive build was being slowed by its bad guesswork [1].

The PowerShell query for the number of cores (rather than logical processors) was pulled straight from the Internet and seemed fairly simple:

(Get-WmiObject Win32_Processor |
  measure -Property NumberOfCores -Sum).Sum

However when he ran this it took a whopping 4 seconds! Although he was using a Windows VM running on QEMU/KVM, I knew from benchmarking a while back this setup added very little overhead, i.e. only a percentage or two, and even on my work PC I observed a similar tardy performance. Here’s how we measured it:

Measure-Command {
  (Get-WmiObject Win32_Processor |
  measure -Property NumberOfCores -Sum).Sum
} | % TotalSeconds
4.0867539

(As I write my HP laptop running Windows 11 is still showing well over a second to run this command.)

My first instinct was that this was some weird overhead with PowerShell, what with it being .Net based so I tried the classic native wmic tool under the Git Bash to see how that behaved:

$ time WMIC CPU Get //Format:List | grep NumberOfCores  | cut -d '=' -f 2 | awk '{ sum += $1 } END{ print sum }'
4

real    0m4.138s

As you can see there was no real difference so that discounted the .Net theory. For kicks I tried lscpu under the WSL based Ubuntu 20.04 and that returned a far more sane time:

$ time lscpu > /dev/null

real    0m0.064s

I presume that lscpu will do some direct spelunking but even so the added machinery of WMI should not be adding the kind of ridiculous overhead that we were seeing. I even tried my own C++ based WMICmd tool as I knew that was talking directly to WMI with no extra cleverness going on behind the scenes, but I got a similar outcome.

On a whim I decided to try pushing more work onto WMI by passing a custom query instead so that it only needed to return the one value I cared about:

Measure-Command {
  (Get-WmiObject -Query 'select NumberOfCores from Win32_Processor' |
  measure -Property NumberOfCores -Sum).Sum
} | % TotalSeconds
0.0481644

Lo-and-behold that gave a timing in the tens of milliseconds range which was far closer to lscpu and definitely more like what we were expecting.

While my office machine has some “industrial strength” [2] anti-virus software that could easily be to blame, my colleague’s VM didn’t, only the default of MS Defender. So at this point I’m none the wiser about what was going on although my personal laptop suggests that the native tools of wmic and wmicmd are both returning times more in-line with lscpu so something funky is going on somewhere.

 

[1] Hyper-threaded cores meant Ninja was scheduling too much concurrent work.

[2] Read that as “massively interfering”!

 

Monday 1 November 2021

Chaining IF and && with CMD

An interesting bug cropped up the other day in a dub configuration file which made me realise I wasn’t consciously aware of the precedence of && when used in an IF statement with cmd.exe.

Batch File Idioms

I’ve written a ton of batch files over the years and, with error handling being a manual affair, the usual pattern is to alternate pairs of statement + error check, e.g.

mkdir folder
if %errorlevel% neq 0 exit /b %errorlevel%

It’s not uncommon for people to explicitly leave off the error check in this particular scenario so that (hopefully) the folder will exist whether not it already does. However it then masks a (not uncommon) failure where the folder can’t be created due to permissions and so I tend to go for the more verbose option:

if not exist "folder" (
  mkdir folder
  if !errorlevel! neq 0 exit /b !errorlevel!
)

Note the switch from %errorlevel% to !errorlevel!. I tend to use setlocal EnableDelayedExpansion at the beginning of every batch file and use !var! everywhere by convention to avoid forgetting this transformation as it’s an easy mistake to make in batch files.

Chaining Statements

In cmd you can chain commands with & (much like ; in bash) with && being used when the previous command succeeds and || for when it fails. This is useful with tools like dub which allow you to define “one liners” that will be executed during a build by “shelling out”. For example you might write this:

mkdir bin\media && copy media\*.* bin\media

This works fine first time but it’s not idempotent which might be okay for automated builds where the workspace is always clean but it’s annoying when running the build repeatedly, locally. Hence you might be inclined to fix this by changing it to:

if not exist "bin\media" mkdir bin\media && copy media\*.* bin\media

Sadly this doesn’t do what the author intended because the && is part of the IF statement “then” block – the copy is only executed if the folder doesn’t exist. Hence this was the aforementioned bug which wasn’t spotted at first as it worked fine for the automated builds but failed locally.

Here is a canonical example:

> if exist "C:\" echo A && echo B
A
B

> if not exist "C:\" echo A && echo B

As you can see, in the second case B is not printed so is part of the IF statement happy path.

Parenthesis to the Rescue

Naturally the solution to problems involving ordering or precedence is to introduce parenthesis to be more explicit.

If you look at how parenthesis were used in the second example right back at the beginning you might be inclined to write this thinking that the parenthesis create a scope somewhat akin to {} in C style languages:

> if not exist "C:\" (echo A) && echo B

But it won’t work as the parenthesis are still part of the “then” statement. (They are useful to control evaluation when mixing compound conditional commands that use, say, || and & [1].)

Hence the correct solution is to use parenthesis around the entire IF statement:

> (if not exist "C:\" echo A) && echo B
B

Applying this to the original problem, it’s:

(if not exist "bin\media" mkdir bin\media) && copy media\*.* bin\media

 

[1] Single line with multiple commands using Windows batch file

Thursday 30 September 2021

Transient Expand-Archive Failures

[I’m sure there is something else going on here but on the off-chance someone else is also observing this and also lost at least they’ll know they’re not alone.]

We have a GitLab project pipeline that started out as a monolithic job but over the last 9 months has slowly been parallelized and now runs as over 150 jobs spread out across a cluster of 4 fairly decent [1] machines with 8 to 10 concurrent jobs per host. More recently we’ve started seeing the PowerShell Expand-Archive cmdlet failing randomly up to 5% of the time with the following error:

Remove-Item : Cannot find path {...} because it does not exist.

The line of code highlighted in the error is:

$expandedItems | % { Remove-Item $_ -Force -Recurse }

If you google this message it suggests this probably isn’t the real error but a problem with the cmdlet trying to clean-up after failing to extract the contents of the .zip file. Sadly the reason why the extraction might have failed in the first place is now lost.

Investigation

While investigating this error message I ran across two main hits – one from Stack Overflow and the other on the PowerShell GitHub project – both about hitting the classic long path problem in Windows. In our case the extracted paths, even including the build agent root, is still only 100 characters so well within the limit as the archive only has one subfolder and the filenames are short.

Also the archive is built with it’s companion cmdlet Compress-Archive so I doubt it’s an impedance mismatch in our choice of tools.

My gut reaction to anything spurious like this is that it’s the virus scanner (AV) [2]. Sadly I have no direct control over the virus scanner product choice or its configuration. In this instance the machines have Trend Micro whereas the other build agents I’ve built are VMs and have Windows Defender [3], but their load is also much lower. I managed to get the build folder excluded temporarily but that appears to have had no effect and nothing was logged in the AV to say it had blocked anything. (The “behaviour monitoring” in modern AV products often gets triggered by build tools which is annoying.)

After discounting the obvious and checking that memory exhaustion also wasn’t a factor as the memory load for the jobs is variable and the worst case loading can cause the page-file to be used, I wondered if there the problem lay with the GitLab runner cache somehow.

Corrupt Runner Cache?

To avoid downloading the .zip file artefact for every job run we utilise the GitLab runner local cache. This is effectively a .zip file of a packages folder in the project working copy that gets packed up and re-used in the other jobs on the same machine which, given our level of concurrency, means it’s constantly in use. Hence I wondered if our archive was being corrupted when the cache was being unpacked as I’ve seen embedded .zip files cause problems in the past for AV tools (even though it supposedly shouldn’t have been touching the folder). So I added a step to test our archive’s integrity before unpacking it by using 7-Zip as there doesn’t appear to be a companion cmdlet Test-Archive. I immediately saw the integrity test pass but the Expand-Archive step fail a few times so I’m pretty sure the problem is not archive corruption.

Workaround

The workaround which I’ve employed is to use 7-Zip for the unpacking step too and so far we’ve seen no errors at all but I’m left wondering why Expand-Archive was intermittently failing. Taking an extra dependency on a popular tool like 7-Zip is hardly onerous but it bumps the complexity up very slightly and needs to be accounted for in the docs / scripts.

In my 2017 post Fallibility I mentioned how I once worked with someone who was more content to accept they’d found an undocumented bug in the Windows CopyFile() function than believe there was a flaw in their code or analysis [4]. Hence I feel something as ubiquitous as Expand-Archive is unlikely to have a decompression bug and that there is some piece of the puzzle here that I’m missing. Maybe the AV is still interfering in some way that isn’t triggered by 7-Zip or the transient memory pressure caused by the heavier jobs is having an impact?

Given the low cost of the workaround (use 7-Zip instead) the time, effort and disruption needed to run further experiments to explore this problem further is sadly too high. For the time being annecdata is the best I can do.

 

[1] 8 /16 cores, 64 / 128 GB RAM, and NVMe based disks.

[2] I once did some Windows kernel debugging to help prove an anti-virus product update was the reason our engine processes where not terminating correctly under low memory conditions.

[3] Ideally servers shouldn’t need anti-virus tools at all but the principle of Defence in Depth suggests the minor performance impact is worth it to potentially help slow lateral movement.

[4] TL;DR: I quickly showed it was the latter at fault not the Windows API.

 

Monday 27 September 2021

Lose the Source Luke?

We were writing a new service to distribute financial pricing data around the trading floor as a companion to our new desktop pricing tool. The plugin architecture allowed us to write modular components that could tap into the event streams for various reasons, e.g. provide gateways to 3rd party data streams.

Linking New to Old

One of the first plugins we wrote allowed us to publish pricing data to a much older in-house data service which had been sat running in the server room for some years as part of the contributions system. This meant we could eventually phase that out and switch over to the new platform once we had parity with it.

The plugin was a doddle to write and we quickly had pricing data flowing from the new service out to a test instance of the old service which we intended to leave running in the background for soak testing. As it was an in-house tool there was no installer and my colleague had a copy of the binaries lying around on his machine [1]. Also he was one of the original developers so knew exactly what he was doing to set it up.

A Curious Error Message

Everything seemed to be working fine at first but as the data volumes grew we suddenly noticed that the data feed would eventually hang after a few days. In the beginning we were developing the core of the new service so quickly it was constantly being upgraded but now the pace was slowing down the new service was alive for much longer. Given how mature the old service was we assumed the issue was with the new one. Also there was a curious message in the log for the old service about “an invalid transaction ID” before the feed stopped.

While debugging the new plugin code my colleague remembered that the Transaction ID meant the message sequence number that goes in every message to allow for ordering and re-transmission when running over UDP. The data type for that was a 16-bit unsigned integer so it dawned on us that we had probably messed up handling the wrapping of the Transaction ID.

Use the Source Luke

Given how long ago he last worked on the old service he couldn’t quite remember what the protocol was for resetting the Transaction ID so we decided to go and look at the old service source code to see how it handled it. Despite being at the company for a few years myself this all pre-dated me so I left my colleague to do the rummaging.

Not long after my colleague came back over to my desk and asked if I might know where the source code was. Like so many programmers in a small company I was a part-time sysadmin and generally looked after some of servers we used for development duties, such as the one where our Visual SourceSafe repository lived that contained all the projects we’d ever worked on since I joined.

The VCS Upgrade

When I first started at the company there were only a couple of programmers not working on the mainframe and they wrote their own version control system. It was very Heath Robinson and used exclusive file locks to side-step the problem of concurrent changes. Having been used to a few VCS tools by then such as PVCS, Star Versions, and Visual SourceSafe I suggested that we move to a 3rd party VCS product as we needed more optimistic concurrency controls as more people were going to join the team. Given the MSDN licenses we already had along with my own experience Visual SourceSafe (VSS) seemed like a natural choice back then [2].

Around the same time the existing development server was getting a bit long in the tooth so the company forked out for a brand new server and so I set-up the new VSS repository on that and all my code went in there along with all the subsequent projects we started. None of the people that joined after me ever touched any of the old codebase or VCS as it was so mature it hadn’t needed changing in some time and anyway the two original devs where still there to look after it.

The Office Move

A couple of years after I joined, the owners of the lovely building the company had been renting for the last few decades decided they wanted to gut and renovate it as the area in London where we were based was getting a big makeover. Hence we were forced to move to new premises about half a mile away. The new premises were nice and modern and I no longer had the vent from the portable air-conditioning machine from one of the small server rooms pumping out hot air right behind my desk [3].

When moving day came I made sure the new server with all our stuff on got safely transported to the new office’s server room so that we ready to go again on Monday morning. As we stood staring around the empty office floor my colleague pointed to the old development server which had lay dormant in the corner and asked me (rhetorically) whether we should even bother taking it with us. As far as I was concerned everything I’d ever needed had always been on the new server and so I didn’t know what was left that we’d still need.

My colleague agreed and so we left the server to be chucked in the skip when the bulldozers came.

Dormant, But Not Redundant

It turned out their original home-grown version control system had a few projects in it, including the old data service. Luckily one of the original developers who worked on the contributions side still had an up-to-date copy of that and my colleague found a local copy of the code for one of the other services but had no idea how up-to-date it was. Sadly nobody had even a partial copy of the source to the data service we were interested in but we were going to replace that anyway so in the end the loss was far less significant than we originally feared.

In retrospect I can’t believe we didn’t even just take the hard disk with us. The server was a classic tower so took up a far bit of room which was still somewhat at a premium in the new office whereas the disk could probably have sit in a desk drawer or even been fitted as an extra drive in the new midi sized development server.

 

[1] +1 for xcopy deployment which made setting up development and test instances a piece of cake.

[2] There are a lot of stories of file corruption issues with VSS but in the 7 years I’d used it with small teams, even over a VPN, we only had one file corruption issue that we quickly restored from a backup.

[3] We were on the opposite side from the windows too so didn’t even get a cool breeze from those either.