CAUBLESTONE INK

.net development and other geeky stuff

Exception woes and the dreaded clr20r3 error

Posted on February 13th, 2009


Well, we got bit by this obscure bug as well. As most of you, if you have encountered this, have found out there is not really a lot of information out there on the web to really explain what the heck is going on. All we know is that it kills our app and we get no real information out of it.

So hopefully my information will provide some extra light on the subject. Since each case is different however from my research they are also very similar in there respective problems.

My scenario

Windows Service running on Windows 2003 server
Multi-threaded app (roughly 70 threads)
Lots of real-time messaging using Tibco EMS
nHibernate database layer
.Net 2.0 framework

The problem

Without reason or warning the windows service would crash, without any notifications going out, and our users would be dead in the water.

What we tried:

  • First, we tried adding the AppDomain.UnhandledException logic. Bam – NO good.
  • Next, we tried to add the .Net legacy exception handling tag to our app config file. Bam – NO good. Not only that but we could not even start our service properly.
  • Next, we called MS. Opened a case and got some tools to try and capture some mem dumps if we could replicate the server failure in our dev environment. This might have worked but we fixed the problem before we could find out.
  • Lastly, we were reviewing lots of our code to try and find any leaks, wholes, etc that could maybe cause a critical thread to fail and bring it down. Honestly we got lucky. Just so happens we found an area that did not look right, fixed the code, and Bam – No more exceptions.

What actually caused the problem

In as few lines as possible here is what caused the problem, this method is a shortened and modified version just showcasing the issue and also this is the method that was given to the thread process to run once the thread was started.

What we had in our code

public void RunMe()
{
    try
    {
       List data = new List();

       data.Add(new SomeObject());
       data.Add(new SomeObject());
       data.Add(new SomeObject());

       foreach(SomeObject o in data)
       {
           DoSomethingToObject(o);
           data.Remove(o)
       }
    }
    catch (ThreadAbortException ex)
    {
        Log(ex);
    }
}

So basically we had a bit of logic that had data in a List collection. We were enumerating over it and once it was processed removing it from the collection. We also had a Try..Catch block to try and catch a threadabort if one occured.

Why it blew up

Well looking at the code you’d think um that should not blow up. However if you stop and think about it for a second you will see what happened. If you guessed throwing an InvalidOperation exception… Here’s your cupie doll. :) You guessed it we were throwing an exception because we were removing from the collection while we were enumerating it. Does not matter if you have a lock or anything else this is just a no-no. Now if we had used a for loop instead of a foreach and iterated in reverse that would have been fine. However the rules around IEnumberable don’t like what we were doing.

So we threw the InvalidOperation exception and since we were in a thread and our Try..Catch handler was not catching generic exceptions it ends up being an unhandled thread exception which then bubbles up and bubbles up and bubbles up… you get the point. Even though we had Try..Catch handlers at the service layers it does not matter as this type of unhandled exception will just shut you down. It won’t even fire the Unhandled AppDomain exception.

How we fixed it

Well obviously we had to fix the foreach loop. However the biggest thing that we did to fix the problem was to actually catch the exceptions and handle them. Once we handled the exception it would still cause our thread to shut down (until we fixed the underlying issue) but our server stayed up and no more clr20r3 errors.

From everything I have found the crux of the clr20r3 is exception handling. Make sure in your threads you have a generic exception handler and log the exception to a log file, event log, database, or wherever else you need to so you can actually get the answer you need and go fix the underlying problem.

The final solution

In case you wanted to see the code that fixed the problem here it is:

public void RunMe()
{
  try
  {
       List data = new List();

       data.Add(new SomeObject());
       data.Add(new SomeObject());
       data.Add(new SomeObject());

       for (int i=data.Count-1;i!=0;i--)
       {
           DoSomethingToObject(data[i]);
           data.Remove(o)
       }
  }
  catch (Exception ex)
  {
     LogToEventLog(ex);
  }
}