Condition Variables are not Events

I just spent several days chasing down a bug that essentially amounted to someone assuming that a condition variable behaves just like an event.  I have seen this a number of times before, and it is frustrating fixing the same issues over and over again.

Basically, the broken code looked something like this:

mutex mtx;
condition_variable cv;
{
    unique_lock<mutex>lock(mtx);
    unsigned long wait = 1;
    RunOnOtherThread([&cv]() {
        /* do some work */
        cv.notify_one();
        wait = 0;
    });

    cv.wait(mtx);
    do { this_thread::yield() } while (wait);
}

RunOnOtherThread is a method that takes a function as a parameter and runs it on another thread. The current thread has some work it would like done, and having the other thread do it avoids some synchronization issues.

Synchronization with Events

The person who wrote this code expects condition variables to work like events. An event synchronizes threads by indicating that a certain state has been reached. Here is the equivalent of the above code using an event:

Event evt;
unsigned long wait = 1;
RunOnOtherThread([&evt]() {
    /* do some work */
    evt.set();
    wait = 0;
});

evt.wait();
do { this_thread::yield() } while (wait);

The thread that is calling RunOnOtherThread creates an Event object and passes it to the other thread through the capture of the lambda. The event starts out in a reset state. Any thread that calls evt.wait() while the Event is reset waits until the event’s state becomes set, by some other thread calling evt.set(). When a thread calls evt.set(), the event’s state becomes set and stays that way until it is reset (either automatically by reading the state in event.wait() or manually depending on the type of event object).

The event-based example above works with a typical implementation of Event objects. The obvious case is if the code in the outer function runs first, it will run until it hits evt.wait(), then block until the lambda completes. If the lambda should be scheduled to run first (or concurrently), that’s fine — the state of evt will be set and the first thread will pass through evt.wait() without ever blocking.

The C++ standard library does not provide an implementation of Events. It does, however, provide mutexes and condition variables.  So, the programmer substituted a condition variable for an event.

The anonymous programmer has added a custom spin lock at the end of the first thread to guarantee that the all to evt.set() has completed before the current thread goes out of scope and its destructor frees the evt object’s memory.

Mutexes and Condition Variables

A mutex provides mutual exclusion for a region of code. After locking a mutex, a thread is guaranteed that no other thread that has locked the same mutex can run at the same time. If one thread tries to lock a mutex while another thread has the same mutex locked the second thread blocks until the first thread unlocks the mutex.

In the first example above, creating a unique_lock object initialized with mtx locks the mutex, and holds the lock for as long as the unique_lock object is in scope. Wrapping the mutex in an object is more robust than calling explicit lock and unlock functions, because if someone comes a long later and adds a return in the middle of the function, the mutex will automatically get unlocked when lock goes out of scope.

Mutexes only provide mutual exclusion. They do not provide any guarantees about the order in which operations happen. For that we need condition variables. Condition variables are named for their use in implementing monitors. They are designed to help a thread block until a certain condition is true. The basic operations on a condition variable are wait and notify.

A thread waiting on a condition variable atomically unlocks a mutex and blocks until another thread signals the condition variable by calling notify — or until the thread just happens to unblock (there are number of implementation dependent reasons this may happen). Upon unblocking from the condition variable, the thread has re-acquires the mutex. The unblocking thread is responsible for testing a condition to determine whether it should continue or block again.

Not checking the condition that the thread was waiting for is a common bug. The typical correct design pattern for using a condition variable looks something like:

mutex mtx;
unique_lock lock(mtx);
condition_variable cv;

while (!my_condition)
{
    cv.wait(lock);
}

Any arbitrarily expression could be substituted for my_condition. The above code segment will lock mtx, wait until my_condition is true, and then continue executing with mtx locked. To wake up a blocking thread, the thread that makes the condition true must call either notify_one or notify_all. If one or more threads are blocked on the condition variable, notify_one wakes up at least one of them, and notify_all wakes up all of them.

Testing my_condition before blocking on the condition variable is critical, because a thread only wakes up from being blocked on a condition variable when it is blocked on the condition variable at the same time the signal arrives. So, if the lambda in the first example, runs completely before the first thread calls cv.wait() the first thread will block forever waiting to unblock.

It is also worth noting that pretty much every implementation of condition variables does not guarantee that the thread will never awake before being signalled. A loop testing the condition is required, in case the thread wakes up before some other thread signals that the condition is ready. So, it is possible that the first thread in the example could wake up at any time while the lambda is still executing. The spin lock will catch it before it exits, but it requires the first thread to busy wait, taking CPU and memory access cycles away from other threads.

No Spin

The spin lock at the end of the example code is an inelegant way to prevent the original thread going out of scope before the lambda completes. Leaving it out of the original version could cause the program to crash if the lambda is still in cv.notify_one() when cv goes out of scope in the original thread. Here is a cleaner implementation:

    mutex mtx;
    condition_variable cv;
    bool otherDone = false;

    RunOnOtherThread([&cv, &mtx]() {
        /* do some work */
        unique_lock<mutex>lock(mtx);
        otherDone = true;
        cv.notify_one();
    });

    {
        unique_lock<mutex>lock(mtx);
        while (!otherDone) 
        {
            cv.wait(mtx);
        }
    }

In this implementation, the other thread will do its work then set otherDone true to indicate it is done. I then calls cv.notify_one to wake the first thread. Locking mtx guarantees that the changes to otherDone will be seen by the first thread. If the lambda completes before the while loop tests otherDone, the first thread will complete without blocking. Otherwise, the first thread will block until the lambda calls cv.notify_one() and releases the lock on mtx. There is no need to worry about the first thread exiting before the lambda completes, because the first thread cannot unblock from cv.wait() until it acquires the mutex, which is only released when the lambda exits.

Used properly, condition variables can simplify a lot of typical synchronization problems. A good operating systems text book should have examples for producer/consumer, and all the other standard examples.

Of course, all this assumes that one thread calling another to do something and blocking waiting for the result makes more sense than having the first thread simply do all the work. In this case, the original author was trying to avoid complex synchronization issues that arise when lots of threads try to access the same data.

When high concurrency is not critical to meeting system requirements, the simplicity of doing everything in the same thread can save a lot of headaches. In other cases, more parallelism is required to meet performance goals. It all depends on the application.

Rails 3 Unit Testing on Hudson

I recently setup a Hudson continuous integration server for our main development at Escent.  Part of that project is a Ruby on Rails web app using Rails 3.  I had trouble getting the existing documentation to work for me, possibly because it was designed for Rails 2.  I ended using a Gem written by Nathan Humbert to generate the code coverage information.  I include a brief description of what I did here in case it is useful to anyone else.

Installing Dependencies

Hudson needs both the Hudson Ruby plugin and the Hudson Ruby Metrics plugin.  To add the plugins from the Hudson dashboard, click “Manage Hudson”, then “Manage Plugins”.  From the Plugin Manager  page, click the “Available” tab, and select the check boxes for the plugins you want.  You may also want to select the plugin for your SCM software.  Subversion is included by default, but we are using Mercurial, so I also checked the “Hudson Mercurial Plugin” box.

Click the “Install” button at the bottom of the page after you have selected all the plugins you want, and return to the Hudson Dashboard.

We will be using the “rails_code_qa” gem to measure code coverage.  to make sure it is in your project, add the following to the project’s Gemfile.

group :development, :test do
       gem "rails_code_qa"
end

Configuring the Build

Starting from the Hudson dashboard, click “New Job” to create a new job for your Rails 3 project.  Name the job in the “Job name” box, check the “Build a free-style software project”, and click OK to continue.

The first section is pretty straightforward.   You can optionally add a description for you project, configure how many builds to keep, etc.  I set our configuration to discard builds after seven days and accepted all the other default options, then skipped the “Advanced Project Options” section.

In the “Source Code Management” section configure this project to access your SCM.

In the “Build Triggers” section you can set the conditions for starting a build.  I checked “Poll SCM” and set the schedule to:

*/1 * * * *

So that Hudson will check for updates to the SCM every minute.  Once per minute is probably overkill, but for testing purposes, it made it easier to see if Hudson was working.  I’ll probably back it off to three or five minutes at some point, to cut back on unnecessary polling.

The “Build” section is where things start to get interesting.  Each time Hudson starts a build of your code, it automatically pulls a source tree from the SCM and runs the build commands from the top-level directory of your source tree.  The commands “Build” section lets you specify a sequence of commands to be run in a number of different forms.

First, click “Add build step” and select “Execute shell”.  In the “command” box that appears type:

bundle install

to prepare your build environment.  Click “Add build step” again, but this time select “Invoke Rake”.  I accepted the Default version and in the task box entered:

db:create
db:migrate
test
rcqa

You could just as well add  rake to the list of shell commands.

In the “Post-build Actions” section, check the “Publish Rails stats report” and “Publish Rcov report” boxes.  The “Rcov report directory” points to the HTML coverage report from rcov.  Unfortunately, rails_code_qa  does not have an option to give coverage for all the unit tests.  It splits results into coverage/units and coverage/functionals directories.  For now I am using coverage/units.

Modifying rails_code_qa to put all the results into one directory is trivial, but I still have to figure out how to make it get included in the project.  When I figure out the magic rails/rake is doing to run things, I’ll update my Hudson config.