Sunday 29 May 2016

The Curse of NTLM Based HTTP Proxies

I first crossed swords with an NTLM authenticating proxy around the turn of the millennium when I was working on an internet based real-time trading system. We had chosen to use a Java applet for the client which provided the underlying smarts for the surrounding web page. Although we had a native TCP transport for the highest fidelity connection when it came to putting the public Internet and, more importantly, enterprise networks between the client and the server it all went downhill fast.

It soon became apparent that Internet access inside the large organisations we were dealing with was hugely different to that of the much smaller company I was working at. In fact it took another 6 months to get the service live due to all the workarounds we had to put in place to make the service usable by the companies it was targeted at. I ended up writing a number of different HTTP transports to try and get a working out-of-the-box configuration for our end users as trying to get support from their IT departments was even harder work. On Netscape it used something based on the standard Java URLConnection class, whilst on Internet Explorer it used Microsoft’s J/Direct Java extension to leverage the underlying WinInet API.

This last nasty hack was done specifically to cater for those organisations that put up the most barriers between its users and the Internet, which often turned out to be some kind of proxy which relied on NTLM for authentication. Trying to rig something similar up at our company to develop against wasn’t easy or cheap either. IIRC in the end I managed to get IIS 4 (with MS Proxy Server?) to act as an NTLM proxy so that we had something to work with.

Back then the NTLM protocol was a propriety Microsoft technology with all the connotations that comes with that, i.e. tooling had to be licensed. Hence you didn’t have any off-the-shelf open source offerings to work with and so you had a classic case of vendor lock-in. Essentially the only technologies that could (reliably) work with an NTLM proxy were Microsoft’s own.

In the intervening years the need for both clients (though mostly just the web browser) and servers have required more and more access to the outside world, both for the development of the software itself, and it’s ultimate presence through a move from on-premises to cloud based hosting.

Additionally the NTLM protocol was reverse engineered and tools and libraries started to appear (e.g. Cntlm) that allowed you to work (to a degree) within this constraint. However this appears to have sprung from a need in the non-Microsoft community and so support is essentially the bare minimum to get you out of the building (i.e. manually presenting a username and password).

From a team collaboration point of view tools like Google Docs and GitHub wikis have become more common as we move away from format-heavy content, and Trello for a lightweight approach to managing a backlog. Messaging in the form of Skype, Google Hangouts and Slack also play a wider role as the number of people outside the corporate network, not only due to remote working, but also to bring strategic partners closer to the work itself.

The process of developing software, even for Microsoft’s own platform, has grown to rely heavily on instant internet access in recent years with the rise of The Package Manager. You can’t develop a non-trivial C# service without access to NuGet, or a JavaScript front-end without Node.js and NPM. Even setting up and configuring your developer workstation or Windows server requires access to Chocolatey unless you prefer haemorrhaging time locating, downloading and manually installing tools.

As the need for more machines to access the Internet grows, so the amount of friction in the process also grows as you bump your head against the world of corporate IT. Due to the complexity of their networks, and the various other tight controls they have in place, you’re never quite sure which barrier you’re finding yourself up against this time.

What makes the NTLM proxy issue particularly galling is that many of the tools don’t make it obvious that this scenario is not supported. Consequently you waste significant amounts of time trying to diagnose a problem with a product that will never work anyway. If you run out of patience you may switch tact long before you discover the footnote or blog post that points out the futility of your efforts.

This was brought home once again only recently when we had developed a nice little tool in Go to help with our deployments. The artefacts were stored in an S3 bucket and there is good support for S3 via the AWS Go SDK. After building and testing the tool we then proceeded to try and work our why it didn’t work on the corporate network. Many rabbit holes were investigated, such as double and triple checking the AWS secrets were set correctly, and being read correctly, etc. before we discovered an NTLM proxy was getting in the way. Although there was a Go library that could provide NTLM support we’d have to find a way to make the S3 library use the NTLM one. Even then it turned out not to work seamlessly with whatever ambient credentials the process was running as so pretty much became a non-starter.

We then investigated other options, such as the AWS CLI tools that could we then script, perhaps with PowerShell. More time wasted before again discovering that NTLM proxies are not supported by them. Finally we resorted to using the AWS Tools for PowerShell which we hoped (by virtue of them being built using Microsoft’s own technology) would do the trick. It didn’t work out of the box, but the Set-AWSProxy cmdlet was the magic we needed and it was easy find now we knew what question to ask.

Or so we thought. Once we built and tested the PowerShell based deployment script we proceeded to invoke it via the Jenkins agent and once again it hung and eventually failed. After all that effort the “service account” under which we were trying to perform the deployment did not have rights to access the internet via (yes, you guessed it) the NTLM proxy.

This need to ensure service accounts are correctly configured even for outbound only internet access is not a new problem, I’ve faced it a few times before. And yet every time it shows up it’s never the first thing it think of. Anyone who has ever had to configure Jenkins to talk to private Git repos will know that there are many other sources of problem aside from whether or not you can even access the internet.

Using a device like an authenticating proxy has that command-and-control air about it; it ensures that the workers only access what the company wants them to. The alternate approach which is gaining traction (albeit very slowly) is the notion of Trust & Verify. Instead of assuming the worst you grant more freedom by putting monitoring in place to ensure people don’t abuse their privilege. If security is a concern, and it almost certainly is a very serious one, then you can stick a transparent proxy in between to maintain that balance between allowing people to get work done whilst also still protecting the company from the riskier attack vectors.

The role of the organisation should be to make it easy for people to fall into The Pit of Success. Developers (testers, system administrators, etc.) in particular are different because they constantly bump into technical issues that the (probably somewhat larger) non-technical workforce (that that policies are normally targeted at) do not experience on anywhere near the same scale.

This is of course ground that I’ve covered before in my C Vu article Developer Freedom. But there I lambasted the disruption caused by overly-zealous content filtering, whereas this particular problem is more of a silent killer. At least a content filter is pretty clear on the matter when it denies access – you aren’t having it, end of. In contrast the NTLM authenticating proxy first hides in the ether waiting for you to discover its mere existence and then, when you think you’ve got it sussed, you feel the sucker punch as you unearth the footnote in the product documentation that tells you that your particular network configuration is not unsupported.

In retrospect I’d say that the NTLM proxy is one of the best examples of why having someone from infrastructure in your team is essential to the successful delivery of your product.

No comments:

Post a Comment