Sunday 29 May 2016

The Curse of NTLM Based HTTP Proxies

I first crossed swords with an NTLM authenticating proxy around the turn of the millennium when I was working on an internet based real-time trading system. We had chosen to use a Java applet for the client which provided the underlying smarts for the surrounding web page. Although we had a native TCP transport for the highest fidelity connection when it came to putting the public Internet and, more importantly, enterprise networks between the client and the server it all went downhill fast.

It soon became apparent that Internet access inside the large organisations we were dealing with was hugely different to that of the much smaller company I was working at. In fact it took another 6 months to get the service live due to all the workarounds we had to put in place to make the service usable by the companies it was targeted at. I ended up writing a number of different HTTP transports to try and get a working out-of-the-box configuration for our end users as trying to get support from their IT departments was even harder work. On Netscape it used something based on the standard Java URLConnection class, whilst on Internet Explorer it used Microsoft’s J/Direct Java extension to leverage the underlying WinInet API.

This last nasty hack was done specifically to cater for those organisations that put up the most barriers between its users and the Internet, which often turned out to be some kind of proxy which relied on NTLM for authentication. Trying to rig something similar up at our company to develop against wasn’t easy or cheap either. IIRC in the end I managed to get IIS 4 (with MS Proxy Server?) to act as an NTLM proxy so that we had something to work with.

Back then the NTLM protocol was a propriety Microsoft technology with all the connotations that comes with that, i.e. tooling had to be licensed. Hence you didn’t have any off-the-shelf open source offerings to work with and so you had a classic case of vendor lock-in. Essentially the only technologies that could (reliably) work with an NTLM proxy were Microsoft’s own.

In the intervening years the need for both clients (though mostly just the web browser) and servers have required more and more access to the outside world, both for the development of the software itself, and it’s ultimate presence through a move from on-premises to cloud based hosting.

Additionally the NTLM protocol was reverse engineered and tools and libraries started to appear (e.g. Cntlm) that allowed you to work (to a degree) within this constraint. However this appears to have sprung from a need in the non-Microsoft community and so support is essentially the bare minimum to get you out of the building (i.e. manually presenting a username and password).

From a team collaboration point of view tools like Google Docs and GitHub wikis have become more common as we move away from format-heavy content, and Trello for a lightweight approach to managing a backlog. Messaging in the form of Skype, Google Hangouts and Slack also play a wider role as the number of people outside the corporate network, not only due to remote working, but also to bring strategic partners closer to the work itself.

The process of developing software, even for Microsoft’s own platform, has grown to rely heavily on instant internet access in recent years with the rise of The Package Manager. You can’t develop a non-trivial C# service without access to NuGet, or a JavaScript front-end without Node.js and NPM. Even setting up and configuring your developer workstation or Windows server requires access to Chocolatey unless you prefer haemorrhaging time locating, downloading and manually installing tools.

As the need for more machines to access the Internet grows, so the amount of friction in the process also grows as you bump your head against the world of corporate IT. Due to the complexity of their networks, and the various other tight controls they have in place, you’re never quite sure which barrier you’re finding yourself up against this time.

What makes the NTLM proxy issue particularly galling is that many of the tools don’t make it obvious that this scenario is not supported. Consequently you waste significant amounts of time trying to diagnose a problem with a product that will never work anyway. If you run out of patience you may switch tact long before you discover the footnote or blog post that points out the futility of your efforts.

This was brought home once again only recently when we had developed a nice little tool in Go to help with our deployments. The artefacts were stored in an S3 bucket and there is good support for S3 via the AWS Go SDK. After building and testing the tool we then proceeded to try and work our why it didn’t work on the corporate network. Many rabbit holes were investigated, such as double and triple checking the AWS secrets were set correctly, and being read correctly, etc. before we discovered an NTLM proxy was getting in the way. Although there was a Go library that could provide NTLM support we’d have to find a way to make the S3 library use the NTLM one. Even then it turned out not to work seamlessly with whatever ambient credentials the process was running as so pretty much became a non-starter.

We then investigated other options, such as the AWS CLI tools that could we then script, perhaps with PowerShell. More time wasted before again discovering that NTLM proxies are not supported by them. Finally we resorted to using the AWS Tools for PowerShell which we hoped (by virtue of them being built using Microsoft’s own technology) would do the trick. It didn’t work out of the box, but the Set-AWSProxy cmdlet was the magic we needed and it was easy find now we knew what question to ask.

Or so we thought. Once we built and tested the PowerShell based deployment script we proceeded to invoke it via the Jenkins agent and once again it hung and eventually failed. After all that effort the “service account” under which we were trying to perform the deployment did not have rights to access the internet via (yes, you guessed it) the NTLM proxy.

This need to ensure service accounts are correctly configured even for outbound only internet access is not a new problem, I’ve faced it a few times before. And yet every time it shows up it’s never the first thing it think of. Anyone who has ever had to configure Jenkins to talk to private Git repos will know that there are many other sources of problem aside from whether or not you can even access the internet.

Using a device like an authenticating proxy has that command-and-control air about it; it ensures that the workers only access what the company wants them to. The alternate approach which is gaining traction (albeit very slowly) is the notion of Trust & Verify. Instead of assuming the worst you grant more freedom by putting monitoring in place to ensure people don’t abuse their privilege. If security is a concern, and it almost certainly is a very serious one, then you can stick a transparent proxy in between to maintain that balance between allowing people to get work done whilst also still protecting the company from the riskier attack vectors.

The role of the organisation should be to make it easy for people to fall into The Pit of Success. Developers (testers, system administrators, etc.) in particular are different because they constantly bump into technical issues that the (probably somewhat larger) non-technical workforce (that that policies are normally targeted at) do not experience on anywhere near the same scale.

This is of course ground that I’ve covered before in my C Vu article Developer Freedom. But there I lambasted the disruption caused by overly-zealous content filtering, whereas this particular problem is more of a silent killer. At least a content filter is pretty clear on the matter when it denies access – you aren’t having it, end of. In contrast the NTLM authenticating proxy first hides in the ether waiting for you to discover its mere existence and then, when you think you’ve got it sussed, you feel the sucker punch as you unearth the footnote in the product documentation that tells you that your particular network configuration is not unsupported.

In retrospect I’d say that the NTLM proxy is one of the best examples of why having someone from infrastructure in your team is essential to the successful delivery of your product.

Monday 23 May 2016

The Tortoise and the Hare

[This was my second contribution to the 50 Shades of Scrum book project. It was also written back in late 2013.]

One of the great things about being a parent is having time to go back and read some of the old parables again; this time to my children. Recently I had the pleasure of re-visiting that old classic about a race involving a tortoise and a hare. In the original story the tortoise wins, but surely in our modern age we'd never make the same kinds of mistakes as the hare, would we?

In the all-too-familiar gold rush that springs up in an attempt to monetize any new, successful idea a fresh brand of snake oil goes on sale. The particular brand on sale now suggests that this time the tortoise doesn't have to win. Finally we have found a sure-fire way for the hare to conquer — by sprinting continuously to the finish line we will out-pace the tortoise and raise the trophy!

The problem is it's simply not possible in the physical or software development worlds to continue to sprint for long periods of time. Take Usain Bolt, the current 100m and 200m Olympic champion. He can cover the 100m distance in just 9.63 secs and the 200m in 19.32 secs. Assuming a continuous pace of 9.63 secs per 100m the marathon should have been won in ~4063 secs, or 1:07:43. But it wasn't. It was won by Stephen Kiprotich in almost double that — 2:08:01. Usain Bolt couldn't maintain his sprinting pace even if he wanted to over a short to medium distance, let alone a longer one.

The term "sprint" comes loaded with expectations, the most damaging of which is that there is a finish line at the end of the course. Rarely in software development does a project start and end in such a short period of time. The more likely scenario is that it will go on in cycles, where we build, reflect and release small chunks of functionality over a long period. This cyclic nature is more akin to running laps on a track where we pass the eventual finishing line many times (releasing) before the ultimate conclusion is reached (decommissioning).

In my youth my chosen sport was swimming; in particular the longer distances, such as 1500m. At the start of a race the thought of 16 minutes of hard swimming used to fill me with dread. So I chose to break it down into smaller chunks of 400m, which felt far less daunting. As I approached the end of each 400m leg I would pay more attention to where I was with respect to the pace I had set, but I wouldn't suddenly sprint to the end of the leg just to make up a few seconds of lost time. If I did this I’d start the next one partially exhausted which really upsets the rhythm, so instead I’d have to change my stroke or breathing to steadily claw back any time. Of course when you know the end really is in sight a genuine burst of pace is so much easier to endure.

I suspect that one of the reasons management like the term "sprint" is for motivation. In a more traditional development process you may be working on a project for many months, perhaps years before what you produce ever sees the light of day. It’s hard to remain focused with such a long stretch ahead and so breaking the delivery down also aids in keeping morale up.

That said, what does not help matters is when the workers are forced to make a choice about how to meet what is an entirely arbitrary deadline – work longer hours or skimp on quality. And a one, two or three week deadline is almost certainly exactly that — arbitrary. There are just too many daily distractions to expect progress to run smoothly, even in a short time-span. This week alone my laptop died, internet connectivity has been variable, the build machine has been playing up, a permissions change refuses to have the desired effect and a messaging API isn't doing what the documentation suggests it should. And this is on a run-of-the-mill in-house project only using supposedly mature, stable technologies!

Deadlines like the turn of the millennium are real and immovable and sometimes have to be met, but allowing a piece of work to roll over from one sprint to the next when there is no obvious impediment should be perfectly acceptable. The checks and balances should already be in place to ensure that any task is not allowed to mushroom uncontrollably, or that any one individual does not become bogged down or "go dark". The team should self-regulate to ensure just the right balance is struck between doing quality work to minimise waste whilst also ensuring the solution solves the problem without unnecessary complexity.

As for the motivational factors of "sprinting" I posit that what motivates software developers most is seeing their hard work escape the confines of the development environment and flourish out in production. Making it easy for developers to continually succeed by delivering value to users is a far better carrot-and-stick than "because it's Friday and that's when the sprint ends".

In contrast, the term "iteration" comes with far less baggage. In fact it speaks the developer's own language – iteration is one of the fundamental constructs used within virtually any program. It accentuates the cyclic nature of long term software development rather than masking it. The use of feature toggles as an enabling mechanism allows partially finished features to remain safely integrated whilst hiding them from users as the remaining wrinkles are ironed out. Even knowing that your refactorings have gone live without the main feature is often a small personal victory.

That doesn't meant the term sprint should never be used. I think it can be used when it better reflects the phase of the project, e.g. during a genuine milestone iteration that will lead to a formal rather than internal release. The change in language at this point might aid in conveying the change in mentality required to see the release out the door, if such a distinction is genuinely required. However the idea of an iteration "goal" is already in common use as an alternative approach to providing a point of focus.

If continuous delivery is to be the way forward then we should be planning for the long game and that compels us to favour a sustainable pace where localized variations in scope and priority will allow us to ride the ebbs and flows of the ever changing landscape.

Debbie Does Development

[This was my initial submission to the 50 Shades of Scrum book project, hence it’s slightly dubious sounding title. It was originally written way back in late 2013.]

Running a software development project should be simple. Everyone has a well-defined job to do: Architects architect, Analysts analyse, Developers develop, Testers test, Users use, Administrators administer, and Managers manage. If everyone just got on and did their job properly the team would behave like a well-oiled machine and the desired product would pop out at the end — on time and on budget.

What this idyllic picture describes is a simple pipeline where members of one discipline take the deliverables from their predecessor, perform their own duty, throw their contribution over the wall to the next discipline, and then get on with their next task. Sadly the result of working like this is rarely the desired outcome.

Anyone who believes they got into software development so they could hide all day in a cubicle and avoid interacting with other people is sorely mistaken. In contrast, the needs of a modern software development team demands continual interaction between its members. There is simply no escaping the day-to-day, high-bandwidth conversations required to raise doubts, pass knowledge, and reach consensus so that progress can be made efficiently and, ultimately, for value to be delivered.

Specializing in a single skill is useful for achieving the core responsibilities your role entails, but for a team to work most effectively requires members that can cross disciplines and therefore organize themselves into a unit that is able to play to its strengths and cover its weaknesses. My own personal preference is to establish a position somewhat akin to a centre half in football — I'm often happy taking a less glamorous role watching the backs of my fellow team mates whilst the "strikers" take the glory. Enabling my colleagues to establish and sustain that all important state of "flow" whilst I context-switch helps overall productivity.

To enable each team member to perform at their best they must be given the opportunity to ply their trade effectively and that requires working with the highest quality materials to start with. Rather than throwing poor-quality software over a wall to get it off their own plate, the team should take pride in ensuring they have done all that they can to pass on quality work. The net effect is that with less need to keep revisiting the past they have more time to focus on the future. This underlies the notion of "done done" — when a feature is declared complete it comes with no caveats.

The mechanics of this approach can clearly be seen with the technical practices such as test-driven development, code reviews, and continuous integration. These enable the development staff to maintain a higher degree of workmanship that reduces the traditional burden on QA staff caused by trivial bugs and flawed deployments.

Testers will themselves write code to automate certain kinds of tests as they provide a more holistic view of the system which covers different ground to the developers. In turn this grants them more time to be spent on the valuable pursuits that their specialised skills demand, like exploratory testing.

This skills overlap also shows through with architects who should contribute to the development effort too. Being able to code garners a level of trust that can often be missing between the developer and architect due to an inability to see how an architecture will be realized. This rift between the "classes" is not helped either when you hear architects suggesting that their role is what developers want to be when they "grow up".

Similar ill feelings can exist between other disciplines as a consequence of a buck-passing mentality or mistakenly perceived job envy. Despite what some programmers might believe, not every tester or system administrator aspires to be a developer one day either.

Irrespective of what their chosen technical role is, the one thing everyone needs to be able to do is communicate. One of the biggest hurdles for a software project is trying to build what the customer really wants and this requires close collaboration between them and the development team. A strict chain of command will severely reduce the bandwidth between the people who know what they want and the people who will design, build, test, deploy and support it. Not only do they need to be on the same page as the customer, but also the same page as their fellow team mates. Rather than strict lines of communication there should be ever changing clusters of conversation as those most involved with each story arrive at a shared understanding. It can be left to a daily stand-up to allow the most salient points of each story to permeate out to the rest of the team.

Pair Programming has taken off as a technique for improving the quality of output from developers, but there is no reason why pairing should be restricted to two team members of the same skill-set. A successful shared understanding comes from the diversity of perspectives that each contributor brings, and that comes about more easily when there is a fluidity to team interactions. For example, pairing a developer with a tester will help the tester improve their coding skills, whilst the developer improves their ability to test. Both will improve their knowledge of the problem. If there are business analysts within the team too this can add even further clarity as "the three amigos" cover more angles.

One only has to look to the recent emergence of the DevOps role to see how the traditional friction between the development and operations teams has started to erode. Rather than have two warring factions we've acknowledged that it’s important to incorporate the operational side of the story much earlier into the backlog to avoid those sorts of late surprises. With all bases covered from the start there is far more chance of reaching the nirvana of continuous delivery.

Debbie doesn't just do development any more than Terry just does testing or Alex only does Architecture. The old-fashioned attitude that we are expected to be autonomous to prove our worth must give way to a new ideal where collaboration is king and the output of the team as a whole takes precedence over individual contributions.