Thursday 20 June 2013

Don’t Use ORDER BY as a Progress Indicator

I’ve had the misfortune to be involved with investigating a few performance problems with existing SQL queries that just suddenly went AWOL for no apparent reason. In one case it was the same query again. On the first occurrence bad statistics looked to be the blame as an import process had just finished dumping a whole bucket load of data in and so a manual invocation to update the statistics was instigated whenever the import process had run. It shouldn’t have been necessary of course, but that’s just one aspect of the Black Art that is database maintenance.

As an aside I saw similar problems at a different organisation and each time it happened the developer’s immediate reaction was to question why Auto Update Statistics was disabled. It was “the policy”, apparently. The “story” goes that the company lost money because of a borked query that failed due to “bad statistics”. As a consequence no one was allowed to enable Auto Update Statistics without special dispensation. Of course this all happened “years ago” on an earlier version of the RDBMS product. Even more amusing was that a statistics update maintenance job had been added by the DBAs but the timing was such that it ran after most of the data had been deleted by the archiving process!

Anyway, during the investigation a second time I noticed that the query ended with an ORDER BY clause. I remembered from the first time around that removing the sort was another way to avoid the query performing badly. Back then though, what with it being production and following on from a large import, it seemed more fitting to add a task to refresh the statistics to avoid similar problems elsewhere. This time around the sort still seemed to be causing the problem and removing it needed to be more seriously considered.

The main reason I thought it needed serious consideration was because I couldn’t understand why the data was being sorted in the first place. The processing that took place on the 100,000+ results was completely order independent. Luckily the original developer was still around and a quick question revealed that by sorting the results the log output would be in alphabetical order and therefore it would be easier to determine how far along the processing was. A quick grep of the SQL codebase showed that there were a number of queries that had ORDER BY clauses for which I was sure there was absolutely no business need for them. In fact, I think I could count on one hand the cases where the sort was actually a necessity.

As a rule of thumb sorting in the database is a bad idea. You are very likely to have far less database instances than clients and so putting additional load on the database to sort the data is only adding to what is probably already an overloaded beast. Sorting on the client is invariably going to be cheaper and it’s not like you need to write an efficient sort algorithm yourself these days either. I don’t disagree that sorting the data to provide an easy way of seeing progress is altogether a bad idea, it’s just where you do the sorting that matters most. A comment in the client code wouldn’t go a miss either as its not exactly obvious.

This particular piece of advice (not sorting server-side) also caused somewhat of a stir when this was posted: “7 Things Developers Should Know About SQL Server”. The comments (and subsequent updates) are as ever quite revealing about how literally some people take advice like this without thinking.

No comments:

Post a Comment