What's Your Threading Model?

Digg this!

I've found that time and time again, when developing on top of application servers and frameworks, that many developers do not fully understand the threading model that they are working in and its implications. In a larger development shop, this can be difficult to catch until late in the development process. Typically, the design of the component in question is sound (and goes through the review process), but the implementation architecture (structuring of the code) does not consider these details. The problem is too close to the code and typical unit testing does not simulate a multi-threaded environment. These problems tend to be discovered either through code review or at the start of system test (functional test also typically tests one thread of execution), both of which tend to happen later in the cycle -particularly on projects with tight schedules and resources and which project doesn't fall into that category?

Once discovered, these problems are fixable, but typically the solution is of lower code quality than had the code been written with multi-threading in mind in the first place. By lower code quality, I mean that the code tends to exhibit higher defects and often times lower performance. For example, a developer keeping multi-threaded code would try to reduce the number of locks acquired and time the locks are held by consolidating update to state in a single location in the code in a particular path of execution. A developer unaware of the cost of synchronization (or that synchronization was even needed at all at first) might spread out the updates throughout the code. When the time comes to add locking, locks must be distributed throughout the code, typically resulting inefficient acquiring and releasing of the same lock. In addition to being more error prone, this code is also more difficult to maintain. When another developer takes over the code, it can be hard to surmise the concurrency strategy in the code and the effects of change.

Rewriting the code may not be an option. There are many good reasons not to rewrite code even before the code has been released for the first time:

  • Time - Restructuring the code comes at the cost of some other activity, but essentially provides no function. The performance and quality gain is hard to quantify at this stage: will it really run faster? How much? Will I really have less defects? How many?
  • Too much code - Enough said. Inertia has taken over.
  • Already somewhat tested - That code may have already undergone some portion of test. To rewrite the code is to potentially lose the gains made by fixing previous defects. This is partially true. Usually the developer has gained additional insight into the inner workings of the code by debugging its behavior and the rewrite can incorporate this knowledge. This holds true only if the code is being rewritten by the same developer. However, from a test organization standpoint, this is a risky approach and invalidates previous passed test cases.

It's not so straightforward as to assign blame to the developer or argue something about developer skill. Many frameworks take pains to try to isolate the developer from having to deal with multi-threaded issues and different APIs have different interaction characteristics. In J2EE, there are a myriad of component and interactions models, each with its own threading characteristics. The EJB programming model is a good example of a component model that appears to be non-threaded to the programmer. The Servlet programming model is the opposite, as there is typically a single Servlet object and code is called in a reentrant fashion. Regarding JMS, some objects are thread-safe while others such as session objects are not.

There tends to be a handful of general approaches/patterns to isolating the user from the multi-threading issue:

  • Object pools: The framework creates a pool of object instances. The size of this pool may be scaled up or down as needed based on work load, memory, etc. When dispatching some unit of work (say, a request), a free object is chosen from the pool and used for a particular thread of execution. Since there is typically a 1-1 correspondence between the object and the thread, the programmer can treat all code in the object as single threaded.
  • Reentrant code: ... like the Servlet model. A single object is created during initialization of the framework and persists throughout the lifetime of the framework. Units of work are dispatched to a single thread in the thread pool and the reentrant code is called within the context of that thread. This requires the programmer to be fully aware of multi-threading. In the case of the Servlet model, the programmer must be aware of what is and what is not thread-safe (the ServletContext is not guarantied to be thread-safe!).
  • Session based concurrency: Threading semantics are based on the session. If multiple threads can access the same session, then this degrades to the reentrant code model. However, some approaches suggest that only one thread can access a session or equivalent at a time, in which case the thread can treat access to the session as single-threaded. Thus each thread would have its own session. It may or may not be possible to provide such guarantees. Typically the framework means to get ahold of a session is thread-safe.

In my experience, the problem with any of the approaches other than the reentrant code model is that they break down eventually. The session-based model requires cooperation between the client of the service and the service itself. At some point, it just might make more sense to share the session between multiple threads. The object pool approach lends itself to some nasty code. In such an environment, I've seen developers use full OOP techniques on the objects that live in the pool, which is ill-suited for the execution environment. Of course, at some point in both these cases, it's likely that the code will have to access some shared resource and the effects of accessing that resource will trickle into the code. Regardless, thorough knowledge of the threading model still tends to improve code quality.

While one could argue that including such detail in the design is an important part of documenting the code, I'm don't think it's very practical because of the high overhead of such details. This seems best handled in the day-to-day interaction between the architect/team lead with developers in planning out the how of what they'll implement.

An ethical consideration...

You wrote:

Time - Restructuring the code comes at the cost of some other activity, but essentially provides no function. The performance and quality gain is hard to quantify at this stage: will it really run faster? How much? Will I really have less defects? How many?

--

First, I agree that, in general, whatever the threading model, it simply pays to have thread-savvy developers working on multi-threaded apps, regardless of the platform and language being used. Period. Then again, I'm a C developer... *cough*... so what do I know? At the end of the day, the best way to prevent kludgy, poor-performing code is to have a team of developers that A) clearly understand the problem domain; B) are experts / competent with the tools going to be used to implement the solution; and C) know how to implement the solution using best-practices. No amount of language abstraction and complexity encapsulation are going to compensate for the lack of any of those.

But more to the point of my comment. The time argument, in my mind, is a bit weak. Granted, you were only giving some examples of considerations a development team must make, but I think there's one big consideration that shouldn't be overlooked. That consideration is the ethical one. First, even if you designed an application do be threadsafe from the get go, you'll probably see some matter of degredation in performance when compared to its single-threaded counter-part. Things like serialization of shared data and priority inversion may slow down execution of the app if there is a lot of lock contention or a lack of priority inheritance. Granted, you're talking more about the performance degredations seen by poorly structured code, hacks, and bandaids, that are introduced "afte-the-fact", but ultimately, whatever it may be, what's the potential trade off? Data integrity. And, depending on the application, that better be reason enough to take the time to restructure the code to make it threadsafe. If you're writing a banking app or a program that launches missiles or any mission-critical application for that matter, it's important you get the locking right, and though performance considerations may be important (e.g. real-time guarantees) it shouldn't require quantifying the benefit of restructuring code based on how many defects you could potentially eliminate, but really only the type of defects you'll be eliminating. That should be enough to act in such cases. Even if what you get is a wonderful kludge of spaghetti code, oh well, it's the ethical thing to do, right?

A Complex Decision in Practice

The initial post was trying to focus not so much on correctness issue (it assumes that the developer does and must make the code correct for it to function in the multi-threaded environment), but on reworking code once it's correct. The benefits of the rework are assumed to be easier maintenance, better performance (maybe), and maybe more graceful evolution or better extensibility. Depending on the code path, performance may not be applicable. Fundamentally, it's a problem in inertia increasing as the development cycle progresses. The problem is that it's hard to discover poorly built multi-threaded code earlier enough in the process to correct it. That piece of code already has much invested: it works, maybe it's even been tested by a test team, or appears in a release. The "it works" phenomenon is a powerful thing and it's just hard to quantify the benefits.

Probably the best way to deal with this issue is through well defined componentization. You can substitute the component for better designed internals, subject to the constraints of the thinking that went into the interface design. The challenge though is justifying why effort should be spent writing it versus doing something else that might be of higher value. Quantifying the gain in maintenance expense is extremely difficult.