Tuesday 8 December 2015

Poor Performance of log4net Context Properties

Back in September last year I wrote a post about a simple, low-latency web service we had to put together in a short timeframe (see “The Surgical Team”). During this project we hit a serious performance snag with log4net caused by using its context properties collection in the log message.

The Early Warning Indicator

The web service we were building had to perform a simple lookup on a multi-GB data set in less than 10 ms. We were going to use .Net and ASP.Net and so given its garbage collected environment we sought to verify first of all that garbage collections were not going to be a problem. Hence we got the CI pipeline in place and wrapped up a call to a NOP web API handler in a performance test to measure the basic solution infrastructure.

After a few false starts with the test itself (and the test framework) we found that each call was only taking a couple of ms [1] with an occasionally longer one as we hit a 2nd generation garbage collection (GC) which “stops the world” (at least in .Net 4). Even though the expensive GC meant we were occasionally more than an order of magnitude outside our SLA, the business were happy to go with it given the time constraints [2].

The Klaxon Goes Off

All of a sudden the core part of the build is succeeding but the performance tests are failing. We initially put it down to a blip, perhaps in the infrastructure, and ignore it for now. Then it trips again and again and so we decide to check the last few commits to see what’s happening as this now feels like it might be our code.

One of the developers had recently added log4net into the mix for diagnostic logging and to report SLA violations via the Windows event log, but that had been a few commits before things started going south. The commit that seemed to have pushed us over the edge was the addition of the HTTP request correlation ID to the log message. This was done via the context property bag in log4net, e.g.

%date %-5level %thread %property{correlationId} ...

This didn’t seem quite right at first though as we’d used the same property bag elsewhere, e.g. in the event log message. But of course we quickly realised the event log message only happened when we were already outside the SLA.

We reverted the change and the performance tests were green once more. At this point we didn’t know what it was about using properties that was causing it, or even if it really was that so we looked for another way as the information was important for support and monitoring (see “Causality – Relating Distributed Diagnostic Contexts”).

(This also got noticed and formally reported by someone else a few months later, and is now tracked in the log4net JIRA under LOGNET-421 and LOG4NET-429.)

Pattern Converters to the Rescue

Fortunately the property bag isn’t the only way to insert custom content into a log4net message (without just concatenating it into the message which was our fall-back), it also supports custom fields, called “pattern converters”.

As you can see from the documentation you specify your own field names and use them as placeholders too, e.g.

%date %-5level %thread %correlationId ...

Unlike the built-in property bag you have to do a little bit more work, such as creating some (thread-local) storage for the property value that you can fish out later when you need to invoke log4net to write the message [3].

public class CorrelationIdConverter : PatternConverter
{
  protected override void Convert(TextWriter writer,
                                  object state)
  {
    // The ID is stored in a thread-local property.
    writer.Write(Correlation.Id);
  }
}

You’ll also need to tell log4net about the field name and the class that should be invoked to format the value [4].

<converter>
  <name value="correlationId" />
  <type value="MyLib.CorrelationIdConverter, MyLib"/>
</converter>

We watched the build machine performance test results closely when the change was pushed but, just as we had hoped, the effect wasn’t noticeable.

 

[1] I am in no way suggesting that “a couple of ms” for an empty web API call could be considered decent performance. Quite frankly I was amazed when we discovered how much time and memory the ASP.Net pipeline itself used.

[2] We did suggest ways to mitigate this, such as fronting the service with a load balancer / service that could do a “best of three” as the service was read-only.

[3] Technically speaking it was stored in the ASP.Net HttpContext, if it exists, and if not we fall back to using traditional thread-local storage.

[4] We were actually using programmatic configuration of log4net there, but on another project we had to use the traditional log4net.config file and sadly it starts to add a bit of noise.

No comments:

Post a Comment