Saturday 27 November 2010

Exception Safe By Design

I’ve recently gone back to my simple database query tool to add a few essential usability features and like any legacy codebase there were a couple of surprises in store. Now, I’m not making excuses for myself, but most of the code is the best part of 10 years old and like any journeyman I’ve learnt a lot since that was written. Also the recent episodes of Matthew Wilson’s Quality Matters column in Overload (one of the ACCU journals) focused on exception handing and so these two examples became even more prominent.

Resource Management

One of the new features I added was to properly support non-query type statements, i.e. INSERT/DELETE (it was originally written to be a query tool but it’s been useful to run ad-hoc SQL as well). This involved me digging into the ODBC facade and eventually I came across this piece of code:-

CSQLCursor* CODBCSource::ExecQuery(...)

  CODBCCursor* pCursor = new CODBCCursor(*this);

  try 
  { 
    ExecQuery(pszQuery, *pCursor); 
  } 
  catch (const CODBCException&) 
  { 
    delete pCursor; 
    throw; 
  } 

  return pCursor;
}

Now clearly you’ll have to put aside your disgust of Hungarian Notation for a moment and focus on the real issue which is the shear verbosity of the code. It’s also wrong. On the one hand I’ve been a good boy and only caught what I can knowingly recover from (a specific CODBCException type exception) but I’ve also missed the point that any other exception type will cause a memory leak. Clearly I should have used catch(…) because the exception type is irrelevant in this case.

Only I shouldn’t, because the try/catch block is entirely unnecessary. Although now I’m firmly in the camp that believes RAII is probably the single most important idiom within C++, I obviously wasn’t then. I try really hard to write code that ensures there are no memory leaks and I use tools like the Debug CRT Heap in Visual C++ and BoundsChecker to point out my mistakes. And yet this piece of code highlights how hard it is to do that, especially in the face of exceptions; if you don’t have 100% code coverage in your testing it’s just too easy to miss something. What is all the more ridiculous though is that the correct solution is less work anyway:-

CSQLCursor* CODBCSource::ExecQuery(...)

  Core::UniquePtr<CODBCCursor> pCursor(new CODBCCur…);

  ExecQuery(pszQuery, *pCursor);

  return pCursor.detach();
}

Assuming I can make no changes to the method signature this ticks the box of being exception safe by design. The Core::UniquePtr smart pointer class is analogous to boost::scoped_ptr which ensures single ownership until it’s given away at the end. But there’s still a bad code smell which is that the method returns a bald pointer so the callee could still leak it. What we should return instead is the smart pointer object so that we explicitly transfer ownership to the callee:-

Core::SharedPtr<CSQLCursor> CODBCSource::ExecQuery(..)

  Core::SharedPtr<CODBCCursor> pCursor(new CODBCCur…);

  ExecQuery(pszQuery, *pCursor);

  return pCursor;
}

One unpleasant taste in the mouth is that because of the change in scope we cannot just return a Core::UniquePtr because that would require temporarily copying the object[*] so instead we use a smart pointer that allows shared ownership – Core::SharedPtr. Funnily enough this transfer of ownership from caller to callee is actually right up std::auto_ptr’s street and for the knowledgeable this would be preferable from a performance perspective. But sadly you end up worrying about the caller and whether they know the pitfalls of auto_ptr and so perhaps you go back to the safer option.

Side Effects

Another feature I wanted to added was to avoid requesting credentials when the connection doesn’t need them and this put me right into the client method responsible for opening a new database connection:-

void CAppCmds::Connect(. . .)

  try 
  { 
    OnDBDisconnect(); 
    . . . 
    App.m_oConnection.Open(strConnection); 
    . . . 
  } 
  catch(CSQLException& e) [#] 
  { 
    App.AlertMsg(TXT("%s"), e.m_strError); 
    OnDBDisconnect(); 
  }
}

This piece of code I found puzzling because I couldn’t obviously see why it called OnDBDisconnect() both at the start of the method and also in the exception handler as the connection will either open successfully or throw, in which case we had already cleaned up earlier. So I looked at what OnDBDisconnect() did to see why it might be necessary:-

void CAppCmds::OnDBDisconnect()

  if (App.m_oConnection.IsOpen()) 
    App.m_oConnection.Close(); 

  App.m_pCurrConn = NULL; 
  . . .
}

Nothing earth shattering there – just run-of-the-mill cleanup code. And so I went back to the Connect() method and looked more closely. And then I spotted it:-

  OnDBDisconnect(); 
  . . . 
  App.m_pCurrConn = App.m_apConConfigs[nConnection]; 
  . . . 
  App.m_oConnection.Open(strConnection);

There are side-effects that occur between the point when the previous connection is closed and when we attempt to open a new one. A quick scoot back through the version control repository[+] to see what changes have been made to this method and I see that it was checked in as a fix for another problem after failing to open a connection. Clearly my mindset was “fix the broken exception handling” instead of “fix the poorly written method”. The first rule of writing exception safe code is to perform all side-effects off to one side and then use exception safe mechanisms like std::swap()[^] to “commit” the changes.

So, for my example code above we need to introduce some temporaries to hold the unstable state, perform the Open and then persist the new state:-

  OnDBDisconnect(); 
  . . . 
  Core::SharedPtr<…> pConfig(App.m_apConConfigs[…]); 
  Core::SharedPtr<…> pConnection(new CODBCConnection); 
  . . . 
  pConnection.Open(strConnection); 
  . . . 
  App.m_oConnection.swap(pConnection); 
  App.m_pCurrConn.swap(pConfig);

Taking a cue from the preceding section about resource management I have used my Core::SharedPtr[$] smart pointer type as the connection will live on after this method returns. The catch handler remains but only to report the error message to the user, no cleanup now needs to be performed.

But we can do even better. If we take a closer look at what happens as a result of using swap() we see that the call to OnDBDisconnect() may actually be superfluous. In fact it may be better not to do it all. The call to swap() effectively means that the new connection will live on as App.m_oConnection and the old one will get destroyed at the end of the scope (the try/catch block). Now if we look at what OnDBDisconnect() does:-

void CAppCmds::OnDBDisconnect()

  App.m_oConnection.Close(); 
  App.m_pCurrConn = NULL;

  UpdateScriptsMenu(); 
  UpdateUI();
}

It closes the existing connection and resets m_pCurrConn (these both now happen automatically near the end of Connect()) and it resets the UI. If we look at the tail end of the Connect() method we see this:-

  UpdateScriptsMenu(); 
  UpdateUI();

What this all means is that if an exception is thrown then the user loses their current connection as well. If we perform all the work off to the side and don’t modify any state (i.e. don’t call OnDBDisconnect()) up front then the user at least gets to keep their current connection as a consolation prize. If the new connection does work then we just save the new state and let the old one disappear at the end of the try/catch block. Is that just being clever or an attempt to write the simplest code possible? A topic for another day…

[*] I need to go back and read up on the new std::unique_ptr type in C++0X as I’m sure this was designed for this kind of scenario – and also the std::vector<std::auto_ptr<T>> gotcha.

[#] Using an exception to signal failure when opening a database connection falls right into the “Using Exceptions For Flow Control” category in Matthew Wilsons Quality Matters column and would almost certainly fall into the real of “Not Exceptional” for Kernighan & Pike too. In this case I’m inclined to agree and will be adding a TryOpen() method at the next opportunity.

[+] Lone developers need Source Configuration Management (SCM/VCS) too. At least, if you want to perform any software archaeology to avoid regressions, that is.

[^] It would be great if std::swap() had a ‘nothrow’ guarantee as that would make it excellent for this idiom. But I’m not sure it is guaranteed by the standard - is it just a convention like not throwing exceptions from destructors? My Google skills found lots of chatter but no quote citing chapter & verse.

[$] Naturally you wouldn’t use the full type name everywhere but use a typedef instead. Given how prevalent the use of the SharedPtr type is (especially with types stored in containers) I tend to add a typedef right after a  new entity class declaration for the smart pointer wrapper, e.g. class Xxx { . . . }; typedef Core::SharedPtr<Xxx> XxxPtr;

Tuesday 23 November 2010

Reacquainting Myself With Sed & Awk

I’ve always felt somewhat uncomfortable using Visual Studio as my main text editor because there seems to be legions of programmers out there that swear by vi and emacs’ text processing prowess. For me (a C++/C# Windows server-side programmer) I find that day-to-day all I need is syntax highlighting, cursor key movement and basic search & replace; plus the basic code navigation aid “Go to definition”. But that’s pretty much it. So I posed a question on accu-general to find out if I’m some kind of programming Neanderthal or whether all these supposed fancy text processing aids are actually used day-to-day. My straw poll suggests not and so I don’t feel quite so pathetic now.

Still, I had one of those moments the other day when I needed to perform a really simple text processing task – I needed to add 6 tab characters to the end of every line in a large (10’s of MB) text file. This kind of thing (modifying[*] a file that large) is somewhat unusual for me. Instinctively I opened the file in my (already running) Visual Studio and tried to work out how I was going to do it. The two options I would normally use are Find & Replace and a macro (which I then run repeatedly by holding the ‘run macro’ key down :-). Given the size of the file I didn’t have all day for the macro to run so it had to be a replace using a regex. The one I came up with replaced the \n character with \t\t\t\t\t\t\n and then I waited for VS to do its thing…

Whilst waiting I thought to myself how much easier that probably would have been with Sed and/or Awk. I could even hear the sniggers of those Unix programmers around me. But I haven’t used either tool in 20 years and I don’t have a copy installed by default because I’m running Windows of course. I remember trying ports of the major Unix tools way back when I started programming professionally and it was painful because of the fundamental difference between who is responsible for handling wildcard expansion (the shell in Unix vs the program in DOS). Plus the memory constraints of DOS meant they weren’t altogether reliable[#]. Time has obviously marched on but I’m still jaded by the experience of the very distant past and I’ve also discovered various alternatives that have made up the shortfall in many respects.

And then I ran into another classic issue – trying to work out which SQL scripts had been saved in ASCII and which in Unicode (SQL Server Management Studio annoyingly uses Unicode by default). This was causing my colleague’s (Cygwin[^] based) grep not to match various files. Once again I remembered the old Unix ‘file’ tool that attempts to detect what format a file is and whist searching for a Windows port of that I came across the UnxUtils project on SourceForge. This looked to be a native port of the most commonly found Unix tools such as awk, sed, grep, egrep etc. It didn’t contain a version of ‘file’ but I got one of those from somewhere else. Anyway, I unzipped the UnxUtils package, added it to my PATH, and then waited for a moment to use them.

That moment came sooner than I expected. I needed to analyse our on disk trade repository and provide some idea of its size and daily growth. My instinct this time was somehow to use Excel as that is pretty much what the financial world uses for all its analysis and reporting. But we were running an old version that still has the 65K row limit and our repository was well over 100K files per day. So I Googled for an Awk based solution which (adapted for Windows) initially became:-

c:\> dir /s *.trade | gawk “{ sum += $3 } END { print sum }”

This was pretty close, but I needed to strip the chod out of the dir listing for the daily comparison. Also, recursively searching the repository whilst I was getting this right was going to be way too slow so I captured the output from dir once and applied a sprinkling of Sed to remove the rubbish:-

c:\> dir /s *.trade > filelist.txt
c:\> sed –n /^[0-9][0-9]/p < filelist.txt | gawk “{ sum += $3 } END { print sum }”

And that was it for the current repository size. The day-on-day growth just involved extra calls to Sort and Diff to trim out the common items - it was a textbook example of pipeline processing!

Of course I got stuck for ages not knowing that Sed has a limited regex capability or that the ‘END’ in the gawk script had to be in uppercase. But it inspired me to put the effort in to learn the tools properly and so I popped onto Amazon and found a second hand copy of Sed & Awk (2nd Edition) by Dougherty & Robbins for under £10 and also picked up the O’Reilly Sed & Awk Pocket Reference for a couple of quid to leave in my drawer at work.

In the last couple of months I’ve read the majority of the book so that I feel more equipped to notice when this pair of tools would be a better choice. One of the most useful features I discovered about Sed (that I don’t believe VS can do easily) is to apply a transformation to repeating sections of a file (which are specified by an ‘address range’, i.e. a pair of regex’s that denote the start and end of the region). I’ve also found it easier to use Awk to generate, say, SQL statements where I have the output from a query that then needs to be turned back into a bunch of stored procedure calls with the previous result set as the parameters. This is doable in VS with regex based search & replacing but it’s probably not the most efficient use of my time.

The only real problem I have is that the Sed regex syntax is a subset of the normal POSIX one, so some common features are not available. It’s also another regex syntax to learn (along with Visual Studio’s own warped variation).

By this point I suspect that I’m preaching to the converted and there is much nodding of heads. There has also probably been the quiet whispering of “the right tool for the job” and all that. So yes, I’m incredibly late to the party, but I’m most pleased with my re-acquaintance. I don’t expect to ever be writing multi-line scripts but even if it makes the one-liners easier it will have been ten quid well spent.

 

[*] I search and analyze much larger log files every day with BareTail, BareGrep and LogParser. It’s just that I don’t need to modify them or generate other files from them.

[#] When it came to searching my entire (400 MB) hard disk it was actually quicker to close Windows 3.1, reboot into Linux (0.91pl15 IIRC) and use find + grep, save the results and then reboot back into 16-bit Windows than it was to do the search directly in Windows (or via a DOS prompt) in the first place.

[^] I want my tools to behave naturally on Windows - I have no desire to simulate a Unix environment.