Tuesday 20 December 2011

Diagnostic & Support User Interfaces

When people talk about user interfaces they are invariably talking about the face of a product or system that the end-users see and/or interact with. But within a complex system there are other user interfaces too that are not seen or interacted with by end-users at all; instead it is fellow developers and, more importantly, support staff that get to see this less salubrious side.

In a small well-established team of developers there will naturally be a lot of cross fertilisation meaning that they will be well versed in significant portions of the system. In contrast, a larger team will have skills spread across different disciplines and so from a support perspective it starts to approach the knowledge levels of a support team - far less. In my experience support teams do not get the luxury of devoting themselves to learning one system instead out; instead they invariably have to support a whole bunch of in-house systems written by different teams using different technologies and having different ideas about what is required to make a product or system “supportable”.

So what kinds of interfaces am I referring too? Well, the most traditional are status monitoring pages and custom admin tools; the kind of utilities built specifically for the purpose of administering and supporting the system. Due to the time constraints imposed and the lack of direct business value they are usually not given any real TLC but are just thrown together, presumably under the misguided assumption that it will never really be used except by them or their close colleagues. But these scripts and command line tools are not the only faces of the system you’ll see when doing support duties; before that you’ll probably be faced with hunting through the custom format log files, which may or may not be easily consumable. Stretching the analogy somewhat further you could encompass configuration files and the database schema as interfaces of sorts. They’re not graphical in themselves, but careful though about their structure can ensure that the off-the-shelf tools stand more chance of making interaction a more productive experience.

Anyway, here is my list of things to think about. I’ve not done any “production” UI programming for a few years and it’s been even longer since I read About Face (the 1st edition no less) so I’ll no doubt contradict some modern wisdom along the way. Your target audience may not be novice computer users, but they do have the clock against them and they won’t thank you for making an already difficult situation any harder.

Command-Line Tools

Server-side developers aren’t renowned for their user interface programming. After all, that’s why you have UX designers and developers whose major skills lie in the area of UI programming. But that doesn’t mean we have be to sloppy about the interfaces for the scripts and tools we do create. I’m not suggesting for a moment that we spend hours gold-plating our utilities, but I am proposing that you adhere to some of the common conventions that have been established over the years:-

Provide a help switch (-?, -h, --help) so that the user can see what the options are. Documentation, if you can even find it for a utility, can easily get out of step with the tool usage and so you need to be able to see which switch it is that you’re apparently missing, or that is no longer relevant for the task at hand. Even if you’re au fait with the tool you might just want to double-check its usage before going ahead and wiping out last years data.

Provide a version switch (-v, --version) so the user can check its compatibility, or if it’s become out of date. Many tools are installed on the application servers or alongside client binaries, but sometimes those tools are just “passed around” and then become available on some desktop share without you realising. You can even dump the version out on every invocation as the first line; this can be particularly useful if you’re capturing stdout when running it as a scheduled task.

Provide long switch names (--switch-name) by default and use them in documentation so it is obvious what the example command is doing. Short switch names are for advanced users and so should be considered the optional case. Positional parameters may save the user some typing but when you use concepts like integer based surrogate keys it’s not exactly obvious what an example may be suggesting:-

C:\> DC 1234 2001-02-03 1

compared to:-

C:\> CustomerAdmin delete --customer-id 1234
  --date 2001-02-03 --force

Use a consistent switch format. I’ve had to use utilities where one developer has made the switch names case-sensitive and another case-insensitive. Sometimes they use ‘/’ as the switch marker and other times it’s a ‘-‘. Often the case-sensitivity issue is down to lazy programming - using == or Equals() instead of the correct method. If you’re on Windows try and support both styles as even Microsoft’s developers aren’t consistent. Better yet download a 3rd party library that lets you use either style as there are plenty to choose from[#].

Return a meaningful result code. The convention (on Windows) is that zero means success and non-zero means failure or partial success. Some tools do provide more detailed result codes, but fundamentally you want assume that anyone trying to automate your tool on Windows is going to write something along these lines as a first-order approximation:-

CustomerAdmin.exe --do-something-funky
if errorlevel 1 exit /b 1

Sadly, 0 is also the default result code that most frameworks will return for an unhandled exception which really makes the whole process rather messy and means that you need to get a Big Outer Try Block in place in main() as part of your bootstrap code.

Support a read-only/test/what-if mode. Most of my support tools have grown out of test harnesses and other development tools and so I’ve naturally added switches that allow you to run them against production in a benign mode so that you can see what would happen. PowerShell has taken this idea forward formally with its common -WhatIf switch. When you’re not 100% sure of what the expected action is or how many items it will act upon this can be a life saver.

Most of the same comments apply to scripts, but they have the added advantage that they are deployed in source code form and so you have the ability to see the logic first hand. Of course when you’re in the thick of it you need to be able to trust the tool to do what it says it will.

Logging

When faced with a production issue, log files will often become the first port of call after you’ve established which process it is that has failed. But log files[+] often serve two different masters and the latter one can get forgotten once the development work has finished. During development it is probably the developer that spends the most time pouring over the output and therefore they have a habit of dumping every last bit of detail out. After the feature is complete though the main consumer then becomes the support team. What they don’t want to see is reams of output so that the proverbial wood is lost behind the trees.

Make sure your log messages make sense. This may seem obvious but spelling mistakes, whilst cute in a time of stability, are a distraction when you’re up to your eyes in it. Log messages should form a narrative that lead you up to the moment of interest, and that narrative should be written with the reader in mind - mostly likely a support engineer - someone who doesn’t know the code intimately. By all means be succinct, but ensure you’re not being ambiguous either.

Use the console output for high-level commentary, warnings and errors only. If extra output is required it should be enabled by an extra --verbose switch (which can easily be added temporarily under automation) or by having a companion log file with more detail. I tend to favour the second approach so that the detail is always there if needed but so that it’s not in your face the majority of the time. Remember we’re talking about in-house stuff here and so you have the luxury of controlling the environment in ways a 3rd party can often only dream of.

Keep to a low number of severity levels. If you have a large number of severity levels developers will not be able to log messages consistently. I recently saw a tweet that suggested you use only the following levels - FYI, WFT & OMG. If you put yourself in the support person’s shoes that’s actually a pretty good scale. Once the issue is triaged and the development team are involved the TRACE or DEBUG level output suddenly becomes of more use, but until then keep it simple.

Write the messages as text. A simple lined-based text format may seem very old-fashioned, but there are a plethora of tools out there designed to slice-and-dice text based output any way you want, e.g. grep, awk & sed. Plus you have modern GUI based Tail programs that support highlighting to really make those important messages shine, e.g. BareTail & LogExpert. Layer on top something like LogParser which supports a SQL like syntax and support for many formats out of the box and you’re in good shape to triage quickly. Also don’t underestimate the importance of NotePad as a tool of last resort.

Make the log message format simple. If you’re writing a separate log file include at least the date & time, and preferably the process ID (PID). If the tool is multi-threaded then the thread ID (TID) will be invaluable as well, along with the highest precision[*] of time (e.g. milliseconds or better) to spot potential race conditions. I’ve found the following format to be very easy to scan by eye and parse by tool as the left-hand side is of fixed width (the PID & TID are padded appropriately):-

<ISO date & time> <PID> <TID> <severity> <message...>

2001-01-01 01:02:03.456 1234 3456 INF Starting...
2001-01-01 01:02:04.123 1234 5678 ERR Stuff broke...

Enclose string values in some form of quotes or braces to make empty strings more visible. As I posted very recently, empty strings are a form of Null Object and so they have a habit of being silently propagated around before finally causing odd behaviour. Personally I like to enclose strings in single quotes because I often have to use those values in SQL queries and so then it’s already quoted for me. As an example I once spent far longer than necessary investigating an issue because the log message said this:-

Valuing trade ...

instead of:-

Valuing trade ''...

Include all relevant context, preferably automatically. Messages that just tell you something bad happened are useless without the context in which the operation is taking place. There is nearly always some higher-level context, such as a request ID or customer name that you can associate with the PID or TID to help you narrow down the data set. Debugging a complex or lengthy process just so you can find out the set of inputs to find that the data is dodgy is not a good use of time and costs the business money, especially if the relevant variables are in scope at the point of writing the error message![^]

Don’t spam the log, write a summary. If you have a process that deals with particularly dirty data, or you have a service that is temperamental try not to log every single indiscretion as a warning or error. Instead try and batch them up, perhaps by item count or time window and write a single message that gives a summary of the issue. At a high-level you’ll just be interested in seeing if the problem is getting worse and so using a tolerance is even better as once you have an idea of what “normal operation” is you can save the scary messages for when they’re really needed.

One of the reviews I do during system testing is to look at the log output. I try and put myself in the position of the support engineer and read the log messages to see if they make sense to me (I often only have a vague appreciation of the process in question). The thing I’m most looking out for is log spam which is usually solved by downgrading messages from, say, INFO to TRACE so that the main output is all about the big picture. It should contain virtually no warnings or errors to avoid support staff becoming desensitised by noise and when they are present they should hopefully be backed by more detailed diagnostics somewhere else.

 

[*] Don’t be fooled into thinking that because a datetime value can be formatted with a milliseconds part that it is actually that accurate. IIRC 16-bit Windows used to have a resolution of 50 ms whilst the NT lineage was better at 10 ms. If you wanted better accuracy than that you needed to resort to the Performance Counters API.

[#] My own free C++ Core library has a reasonably decent parser should you want to tap into the 3rd party option.

[+] I have mixed sentiments about the use of log files. On the one hand I feel they should largely be unnecessary because we should be able to anticipate, test for and handle most types of failure gracefully. But I’m also aware that in-house systems rarely have that level of robustness as a requirement and so we’re often forced into using it as a defensive measure against inadequate test environments and the other in-house external systems.

[^] I clearly remember spending a few hours at the weekend debugging a financial process that was failing with the message “Failed to correlate pairs”. What annoyed me most was that the developer had the two failing currency names in variables in scope but had not bothered to include them in the message! I fixed the error message first thing Monday morning...

Sunday 11 December 2011

New SQL Tools - SS-Unit, SS-Cop & sql2doxygen

At my current client I’ve spent far more time working with SQL (and specifically SQL Server) then ever before; but that is probably quite apparent from the bias of blog posts over the last year or so. With any new venture you soon find yourself drawing on your past experiences and in this case I’ve been looking for some of the equivalent tools in the SQL world that I’d use normally use with C# & C++. But I’ve found them hard to come by. It’s possible I’m looking in the wrong places but it’s not as if there are a stream of vendor adverts either trying to distract me and that gives me a feeling it’s not worthy of a prolonged hunt.

There is another reason why I may have not searched overly hard, and that’s because I find it more fun to build something myself - it’s a great learning exercise. Mind you that’s not the right answer for my client where we have both money and far more important things to be doing than building tooling - assuming this is we can find the right alternatives[*]. What I’m releasing below are my clean-room[+] versions of some of the tools we knocked-up to get ourselves going.

Naturally as these begin to surpass what we lashed up they become 3rd party candidates themselves to take over the same duties and my client wins too because it no longer has to consider supporting the code internally. I’ve always made the source code available for my stuff so that others have a chance to tinker or fix bugs and these are no exception. Of course given that they are T-SQL based (or PowerShell in the case of sql2doxygen), it would be pretty hard for me to keep it secret anyway!

SS-Unit - SQL Server Unit

This is a pure T-SQL based unit testing framework. You write your unit tests, in T-SQL, in the familiar xUnit style and then execute them by invoking the SS-Unit test runner (a stored procedure). The framework has support for both the test and fixture level SetUp and TearDown helpers which are more prevalent in SQL tests because of the static data dependencies you often need to satisfy. It also distinguishes between being run interactively, in something like SQL Server Management Studio (SSMS), and being run in batch mode with, say, SQLCMD so that it can be integrated into your Continuous Integration system for automated SQL unit testing.

The .zip file package, which includes a trivial example database and the unit tests for the framework itself is available on my web site here on the SQL section page. There is also an online copy of the manual if you just want to have a nosy.

SS-Cop - SQL Server Schema Cop

Earlier this year I posted about “DbCop”, a StyleCop/FxCop like tool that a team-mate had put together for use on the project. That generated quite a bit of interest and a few people asked if there was any chance of it being released. As I point out in a footnote[+] that isn’t really an option so instead here is my take on the idea. Clearly FxCop does some pretty deep analysis and I’m not suggesting this is going to provide anything of that magnitude. But there are a number of checks that we can do that ensure style conventions are being adhered to and common mistakes, like forgetting a primary key, are avoided. I have a few other new ideas too, such as trying to detect when a foreign key might be missing.

The .zip package is available from the same SQL page as SS-Unit and also contains an online copy of the manual. Not unsurprisingly the unit tests for SS-Cop are written using SS-Unit and so you could consider it a more realistic example of how to use SS-Unit - SS-Cop was written using a Test-Driven Development (TDD) approach.

sql2doxygen - SQL Doxygen Input Filter

Funnily enough the chap that wrote DbCop was the same person that introduced me to Doxygen - a tool for generating API style documentation from source code. What attracted me was the fact that it could generate useful documentation even when no effort had been made by the developer! Better still was that with only a small change in your comment style Doxygen could handle so much more - no horrendous tags or mark up like you get with JavaDoc or the standard C# offering. There are still a bunch of special tags if you need them but they don’t detract from the comments which is what you need when you’re actually trying to read them as you understand the source code.

Anyway, Doxygen supports many programming languages out of the box, but sadly not SQL. However it does provide you with a mechanism for supporting any language as long as you can transform the source files into something Doxygen understands; these are called input filters and are configured with the INPUT_FILTER setting. This is the command line of a process or batch file to execute for each source file and in the case of sql2doxygen it’s a PowerShell script that turns T-SQL code into C.

The sql2doxygen .zip package contains the Doxygen filter script and a manual describing what T-SQL styles the script understands. It is of course available from the same page I’ve mentioned twice already above. Both SS-Unit and SS-Cop are documented with Doxygen using the sql2doxygen filter and so examples of the output are available online here and here.

 

[*] These tools are not deployed into production and so if we picked, say, an open source project that went stale and became unusable, or bought a tool for a lesser known vendor that went bust the direct impact on the business is none. There is of course an indirect impact because we use the tools to help us deliver faster and with better quality and that would be compromised.

[+] Although the stuff we knocked up at my client’s is not the kind of thing I could imagine they would ever try and turn into a product (IT is not their business for starters) the code is clearly their property. I also wouldn’t know where to start trying to get permission to release it. Then there is the whole Fred Brooks approach which is to just throw it away anyway and learn from your mistakes instead. And this is the approach I have decided to take here.