Condition Variables are not Events

I just spent several days chasing down a bug that essentially amounted to someone assuming that a condition variable behaves just like an event.  I have seen this a number of times before, and it is frustrating fixing the same issues over and over again.

Basically, the broken code looked something like this:

mutex mtx;
condition_variable cv;
{
    unique_lock<mutex>lock(mtx);
    unsigned long wait = 1;
    RunOnOtherThread([&cv]() {
        /* do some work */
        cv.notify_one();
        wait = 0;
    });

    cv.wait(mtx);
    do { this_thread::yield() } while (wait);
}

RunOnOtherThread is a method that takes a function as a parameter and runs it on another thread. The current thread has some work it would like done, and having the other thread do it avoids some synchronization issues.

Synchronization with Events

The person who wrote this code expects condition variables to work like events. An event synchronizes threads by indicating that a certain state has been reached. Here is the equivalent of the above code using an event:

Event evt;
unsigned long wait = 1;
RunOnOtherThread([&evt]() {
    /* do some work */
    evt.set();
    wait = 0;
});

evt.wait();
do { this_thread::yield() } while (wait);

The thread that is calling RunOnOtherThread creates an Event object and passes it to the other thread through the capture of the lambda. The event starts out in a reset state. Any thread that calls evt.wait() while the Event is reset waits until the event’s state becomes set, by some other thread calling evt.set(). When a thread calls evt.set(), the event’s state becomes set and stays that way until it is reset (either automatically by reading the state in event.wait() or manually depending on the type of event object).

The event-based example above works with a typical implementation of Event objects. The obvious case is if the code in the outer function runs first, it will run until it hits evt.wait(), then block until the lambda completes. If the lambda should be scheduled to run first (or concurrently), that’s fine — the state of evt will be set and the first thread will pass through evt.wait() without ever blocking.

The C++ standard library does not provide an implementation of Events. It does, however, provide mutexes and condition variables.  So, the programmer substituted a condition variable for an event.

The anonymous programmer has added a custom spin lock at the end of the first thread to guarantee that the all to evt.set() has completed before the current thread goes out of scope and its destructor frees the evt object’s memory.

Mutexes and Condition Variables

A mutex provides mutual exclusion for a region of code. After locking a mutex, a thread is guaranteed that no other thread that has locked the same mutex can run at the same time. If one thread tries to lock a mutex while another thread has the same mutex locked the second thread blocks until the first thread unlocks the mutex.

In the first example above, creating a unique_lock object initialized with mtx locks the mutex, and holds the lock for as long as the unique_lock object is in scope. Wrapping the mutex in an object is more robust than calling explicit lock and unlock functions, because if someone comes a long later and adds a return in the middle of the function, the mutex will automatically get unlocked when lock goes out of scope.

Mutexes only provide mutual exclusion. They do not provide any guarantees about the order in which operations happen. For that we need condition variables. Condition variables are named for their use in implementing monitors. They are designed to help a thread block until a certain condition is true. The basic operations on a condition variable are wait and notify.

A thread waiting on a condition variable atomically unlocks a mutex and blocks until another thread signals the condition variable by calling notify — or until the thread just happens to unblock (there are number of implementation dependent reasons this may happen). Upon unblocking from the condition variable, the thread has re-acquires the mutex. The unblocking thread is responsible for testing a condition to determine whether it should continue or block again.

Not checking the condition that the thread was waiting for is a common bug. The typical correct design pattern for using a condition variable looks something like:

mutex mtx;
unique_lock lock(mtx);
condition_variable cv;

while (!my_condition)
{
    cv.wait(lock);
}

Any arbitrarily expression could be substituted for my_condition. The above code segment will lock mtx, wait until my_condition is true, and then continue executing with mtx locked. To wake up a blocking thread, the thread that makes the condition true must call either notify_one or notify_all. If one or more threads are blocked on the condition variable, notify_one wakes up at least one of them, and notify_all wakes up all of them.

Testing my_condition before blocking on the condition variable is critical, because a thread only wakes up from being blocked on a condition variable when it is blocked on the condition variable at the same time the signal arrives. So, if the lambda in the first example, runs completely before the first thread calls cv.wait() the first thread will block forever waiting to unblock.

It is also worth noting that pretty much every implementation of condition variables does not guarantee that the thread will never awake before being signalled. A loop testing the condition is required, in case the thread wakes up before some other thread signals that the condition is ready. So, it is possible that the first thread in the example could wake up at any time while the lambda is still executing. The spin lock will catch it before it exits, but it requires the first thread to busy wait, taking CPU and memory access cycles away from other threads.

No Spin

The spin lock at the end of the example code is an inelegant way to prevent the original thread going out of scope before the lambda completes. Leaving it out of the original version could cause the program to crash if the lambda is still in cv.notify_one() when cv goes out of scope in the original thread. Here is a cleaner implementation:

    mutex mtx;
    condition_variable cv;
    bool otherDone = false;

    RunOnOtherThread([&cv, &mtx]() {
        /* do some work */
        unique_lock<mutex>lock(mtx);
        otherDone = true;
        cv.notify_one();
    });

    {
        unique_lock<mutex>lock(mtx);
        while (!otherDone) 
        {
            cv.wait(mtx);
        }
    }

In this implementation, the other thread will do its work then set otherDone true to indicate it is done. I then calls cv.notify_one to wake the first thread. Locking mtx guarantees that the changes to otherDone will be seen by the first thread. If the lambda completes before the while loop tests otherDone, the first thread will complete without blocking. Otherwise, the first thread will block until the lambda calls cv.notify_one() and releases the lock on mtx. There is no need to worry about the first thread exiting before the lambda completes, because the first thread cannot unblock from cv.wait() until it acquires the mutex, which is only released when the lambda exits.

Used properly, condition variables can simplify a lot of typical synchronization problems. A good operating systems text book should have examples for producer/consumer, and all the other standard examples.

Of course, all this assumes that one thread calling another to do something and blocking waiting for the result makes more sense than having the first thread simply do all the work. In this case, the original author was trying to avoid complex synchronization issues that arise when lots of threads try to access the same data.

When high concurrency is not critical to meeting system requirements, the simplicity of doing everything in the same thread can save a lot of headaches. In other cases, more parallelism is required to meet performance goals. It all depends on the application.

Leave a comment