问题

We have found that we have several spots in our code where concurrent reads of data protected by a mutex are rather common, while writes are rare. Our measurements seem to say that using a simple mutex seriously hinders the performance of the code reading that data. So what we would need is a multiple-read/single-write mutex. I know that this can be built atop of simpler primitives, but before I try myself at this, I\'d rather ask for existing knowledge:

What is an approved way to build a multiple-read/single-write lock out of simpler synchronization primitives?

I do have an idea how to make it, but I\'d rather have answers unbiased by what I (probably wrongly) came up with. (Note: What I expect is an explanation how to do it, probably in pseudo code, not a full-fledged implementation. I can certainly write the code myself.)

Caveats:

This needs to have reasonable performance. (What I have in mind would require two lock/unlock operations per access. Now that might not be good enough, but needing many of them instead seems unreasonable.)
Commonly, reads are more numerous, but writes are more important and performance-sensitive than reads. Readers must not starve writers.
We are stuck on a rather old embedded platform (proprietary variant of VxWorks 5.5), with a rather old compiler (GCC 4.1.2), and boost 1.52 – except for most of boost\'s parts relying on POSIX, because POSIX isn\'t fully implemented on that platform. The locking primitives available basically are several kind of semaphores (binary, counting etc.), on top of which we have already created mutexes, conditions variables, and monitors.
This is IA32, single-core.

回答1:

It seems like you only have mutex and condition_variable as synchronization primitives. therefore, I write a reader-writer lock here, which starves readers. it uses one mutex, two conditional_variable and three integer.

readers - readers in the cv readerQ plus the reading reader
writers - writers in cv writerQ plus the writing writer
active_writers - the writer currently writing. can only be 1 or 0.

It starve readers in this way. If there are several writers want to write, readers will never get the chance to read until all writers finish writing. This is because later readers need to check writers variable. At the same time, the active_writers variable will guarantee that only one writer could write at a time.

class RWLock {
public:
    RWLock()
    : shared()
    , readerQ(), writerQ()
    , active_readers(0), waiting_writers(0), active_writers(0)
    {}

    void ReadLock() {
        std::unique_lock<std::mutex> lk(shared);
        while( waiting_writers != 0 )
            readerQ.wait(lk);
        ++active_readers;
        lk.unlock();
    }

    void ReadUnlock() {
        std::unique_lock<std::mutex> lk(shared);
        --active_readers;
        lk.unlock();
        writerQ.notify_one();
    }

    void WriteLock() {
        std::unique_lock<std::mutex> lk(shared);
        ++waiting_writers;
        while( active_readers != 0 || active_writers != 0 )
            writerQ.wait(lk);
        ++active_writers;
        lk.unlock();
    }

    void WriteUnlock() {
        std::unique_lock<std::mutex> lk(shared);
        --waiting_writers;
        --active_writers;
        if(waiting_writers > 0)
            writerQ.notify_one();
        else
            readerQ.notify_all();
        lk.unlock();
    }

private:
    std::mutex              shared;
    std::condition_variable readerQ;
    std::condition_variable writerQ;
    int                     active_readers;
    int                     waiting_writers;
    int                     active_writers;
};

回答2:

At first glance I thought I recognized this answer as the same algorithm that Alexander Terekhov introduced. But after studying it I believe that it is flawed. It is possible for two writers to simultaneously wait on m_exclusive_cond. When one of those writers wakes and obtains the exclusive lock, it will set exclusive_waiting_blocked = false on unlock, thus setting the mutex into an inconsistent state. After that, the mutex is likely hosed.

N2406, which first proposed std::shared_mutex contains a partial implementation, which is repeated below with updated syntax.

class shared_mutex
{
    mutex    mut_;
    condition_variable gate1_;
    condition_variable gate2_;
    unsigned state_;

    static const unsigned write_entered_ = 1U << (sizeof(unsigned)*CHAR_BIT - 1);
    static const unsigned n_readers_ = ~write_entered_;

public:

    shared_mutex() : state_(0) {}

// Exclusive ownership

    void lock();
    bool try_lock();
    void unlock();

// Shared ownership

    void lock_shared();
    bool try_lock_shared();
    void unlock_shared();
};

// Exclusive ownership

void
shared_mutex::lock()
{
    unique_lock<mutex> lk(mut_);
    while (state_ & write_entered_)
        gate1_.wait(lk);
    state_ |= write_entered_;
    while (state_ & n_readers_)
        gate2_.wait(lk);
}

bool
shared_mutex::try_lock()
{
    unique_lock<mutex> lk(mut_, try_to_lock);
    if (lk.owns_lock() && state_ == 0)
    {
        state_ = write_entered_;
        return true;
    }
    return false;
}

void
shared_mutex::unlock()
{
    {
    lock_guard<mutex> _(mut_);
    state_ = 0;
    }
    gate1_.notify_all();
}

// Shared ownership

void
shared_mutex::lock_shared()
{
    unique_lock<mutex> lk(mut_);
    while ((state_ & write_entered_) || (state_ & n_readers_) == n_readers_)
        gate1_.wait(lk);
    unsigned num_readers = (state_ & n_readers_) + 1;
    state_ &= ~n_readers_;
    state_ |= num_readers;
}

bool
shared_mutex::try_lock_shared()
{
    unique_lock<mutex> lk(mut_, try_to_lock);
    unsigned num_readers = state_ & n_readers_;
    if (lk.owns_lock() && !(state_ & write_entered_) && num_readers != n_readers_)
    {
        ++num_readers;
        state_ &= ~n_readers_;
        state_ |= num_readers;
        return true;
    }
    return false;
}

void
shared_mutex::unlock_shared()
{
    lock_guard<mutex> _(mut_);
    unsigned num_readers = (state_ & n_readers_) - 1;
    state_ &= ~n_readers_;
    state_ |= num_readers;
    if (state_ & write_entered_)
    {
        if (num_readers == 0)
            gate2_.notify_one();
    }
    else
    {
        if (num_readers == n_readers_ - 1)
            gate1_.notify_one();
    }
}

The algorithm is derived from an old newsgroup posting of Alexander Terekhov. It starves neither readers nor writers.

There are two "gates", gate1_ and gate2_. Readers and writers have to pass gate1_, and can get blocked in trying to do so. Once a reader gets past gate1_, it has read-locked the mutex. Readers can get past gate1_ as long as there are not a maximum number of readers with ownership, and as long as a writer has not gotten past gate1_.

Only one writer at a time can get past gate1_. And a writer can get past gate1_ even if readers have ownership. But once past gate1_, a writer still does not have ownership. It must first get past gate2_. A writer can not get past gate2_ until all readers with ownership have relinquished it. Recall that new readers can't get past gate1_ while a writer is waiting at gate2_. And neither can a new writer get past gate1_ while a writer is waiting at gate2_.

The characteristic that both readers and writers are blocked at gate1_ with (nearly) identical requirements imposed to get past it, is what makes this algorithm fair to both readers and writers, starving neither.

The mutex "state" is intentionally kept in a single word so as to suggest that the partial use of atomics (as an optimization) for certain state changes is a possibility (i.e. for an uncontended "fast path"). However that optimization is not demonstrated here. One example would be if a writer thread could atomically change state_ from 0 to write_entered then he obtains the lock without having to block or even lock/unlock mut_. And unlock() could be implemented with an atomic store. Etc. These optimizations are not shown herein because they are much harder to implement correctly than this simple description makes it sound.

回答3:

Concurrent reads of data protected by a mutex are rather common, while writes are rare

That sounds like an ideal scenario for User-space RCU:

URCU is similar to its Linux-kernel counterpart, providing a replacement for reader-writer locking, among other uses. This similarity continues with readers not synchronizing directly with RCU updaters, thus making RCU read-side code paths exceedingly fast, while furthermore permitting RCU readers to make useful forward progress even when running concurrently with RCU updaters—and vice versa.

回答4:

There's some good tricks you can do to help.

First, good performance. VxWorks is notable for its very good context switch times. Whatever the locking solution you use it will likely involve semaphores. I wouldn't be afraid of using semaphores (plural) for this, they're pretty well optimsed in VxWorks, and the fast context switch times help mimimise the degradation in performance from assessing many semaphore states, etc.

Also I would forget using POSIX semaphores, which are simply going to be layered on top of VxWork's own semaphores. VxWorks provices native counting, binary and mutex semaphores; using the one that suits makes it all a bit faster. The binary ones can be quite useful sometimes; posted to many times, never exceed the value of 1.

Second, writes being more important than reads. When I've had this kind of requirement in VxWorks and have been using a semaphore(s) to control access, I've used task priority to indicate which task is more important and should get first access to the resource. This works quite well; literally everything in VxWorks is a task (well, thread) like any other, including all the device drivers, etc.

VxWorks also resolves priority inversions (the kind of thing that Linus Torvalds hates). So if you implement your locking with a semaphore(s), you can rely on the OS scheduler to chivvy up lower priority readers if they're blocking a higher priority writer. It can lead to much simpler code, and you're getting the most of the OS too.

So a potential solution is to have a single VxWorks counting semaphore protecting the resource, initialised to a value equal to the number of readers. Each time a reader wants to read, it takes the semaphore (reducing the count by 1. Each time a read is done it posts the semaphore, increasing the count by 1. Each time the writer wants to write it takes the semaphore n (n = number of readers) times, and posts it n times when done. Finally make the writer task of higher priority than any of the readers, and rely on the OS fast context switch time and priority inversion.

Remember that you're programming on a hard-realtime OS, not Linux. Taking / posting a native VxWorks semaphore doesn't involve the same amount of runtime as a similar act on Linux, though even Linux is pretty good these days (I'm using PREEMPT_RT nowadays). The VxWorks scheduler and all the device drivers can be relied upon to behave. You can even make your writer task the highest priority in the whole system if you wish, higher even than all the device drivers!

To help things along, also consider what it is that each of your threads are doing. VxWorks allows you to indicate that a task is/isn't using the FPU. If you're using native VxWorks TaskSpawn routines instead of pthread_create then you get an opportunity to specify this. What it means is that if your thread/task isn't doing any floating point maths, and you've said as such in your call to TaskSpawn, the context switch times will be even faster because the scheduler won't bother to preserve / restore the FPU state.

This stands a reasonable chance of being the best solution on the platform you're developing on. It's playing to the OS's strengths (fast semaphores, fast context switch times) without introducing a load of extra code to recreate an alternate (and possibly more elegant) solution commonly found on other platforms.

Third, stuck with old GCC and old Boost. Basically I can't help there other than low value suggestions about phoning up WindRiver and discussing buying an upgrade. Personally speaking when I've been programming for VxWorks I've used VxWork's native API rather than POSIX. Ok, so the code hasn't be very portable, but it has ended up being fast; POSIX is merely layer on top of the native API anyway and that will always slow things down.

That said, POSIX counting and mutex semaphores are very similar to VxWork's native counting and mutex semaphores. That probably means that the POSIX layering isn't very thick.

General Notes About Programming for VxWorks

Debugging I always sought to use the development tools (Tornado) available for Solaris. This is by far the best multi-threaded debugging environment I've ever come across. Why? It allows you to start up multiple debug sessions, one for each thread/task in the system. You end up with a debug window per thread, and you are individually and independently debugging each one. Step over a blocking operation, that debug window gets blocked. Move mouse focus to another debugging window, step over the operation that will release the block and watch the first window complete its step.

You end up with a lot of debug windows, but it's by far the best way to debug multi-threaded stuff. It made it veeeeery easy to write really quite complex stuff and see problems. You can easily explore the different dynamic interactions in your application because you had simple and all powerful control over what each thread is doing at any time.

Ironically the Windows version of Tornado didn't let you do this; one miserable single debug windows per system, just like any other boring old IDE such as Visual Studio, etc. I've never seen even modern IDEs come anywhere close to being as good as Tornado on Solaris for multi-threaded debugging.

HardDrives If your readers and writers are using files on disk, consider that VxWorks 5.5 is pretty old. Things like NCQ aren't going to be supported. In this case my proposed solution (outlined above) might be better done with a single mutex semaphore to stop multiple readers tripping over each other in their struggle to read different parts of the disk. It depends on what exactly your readers are doing, but if they're reading contiguous data from a file this would avoid thrashing the read/write head to and fro across the disk surface (very slow).

In my case I was using this trick to shape traffic across a network interface; each task was sending a different sort of data, and the task priority reflected the priority of the data on the network. It was very elegant, no message was ever fragmented, but the important messages got the lions share of the available bandwidth.

回答5:

As always the best solution will depend on details. A read-write spin lock may be what you're looking for, but other approaches such as read-copy-update as suggested above might be a solution - though on an old embedded platform the extra memory used might be an issue. With rare writes I often arrange the work using a tasking system such that the writes can only occur when there are no reads from that data structure, but this is algorithm dependent.

回答6:

One algorithm for this based on semaphores and mutexes is described in Concurrent Control with Readers and Writers; P.J. Courtois, F. Heymans, and D.L. Parnas; MBLE Research Laboratory; Brussels, Belgium.

回答7:

This is a simplified answer based on my Boost headers (I would call Boost an approved way). It only requires Condition Variables and Mutexes. I rewrote it using Windows primitives because I find them descriptive and very simple, but view this as Pseudocode.

This is a very simple solution, which does not support things like mutex upgrading, or try_lock() operations. I can add those if you want. I also took out some frills like disabling interrupts that aren't strictly necessary.

Also, it's worth checking out boost\thread\pthread\shared_mutex.hpp (this being based on that). It's human-readable.

class SharedMutex {
  CRITICAL_SECTION m_state_mutex;
  CONDITION_VARIABLE m_shared_cond;
  CONDITION_VARIABLE m_exclusive_cond;

  size_t shared_count;
  bool exclusive;

  // This causes write blocks to prevent further read blocks
  bool exclusive_wait_blocked;

  SharedMutex() : shared_count(0), exclusive(false)
  {
    InitializeConditionVariable (m_shared_cond);
    InitializeConditionVariable (m_exclusive_cond);
    InitializeCriticalSection (m_state_mutex);
  }

  ~SharedMutex() 
  {
    DeleteCriticalSection (&m_state_mutex);
    DeleteConditionVariable (&m_exclusive_cond);
    DeleteConditionVariable (&m_shared_cond);
  }

  // Write lock
  void lock(void)
  {
    EnterCriticalSection (&m_state_mutex);
    while (shared_count > 0 || exclusive)
    {
      exclusive_waiting_blocked = true;
      SleepConditionVariableCS (&m_exclusive_cond, &m_state_mutex, INFINITE)
    }
    // This thread now 'owns' the mutex
    exclusive = true;

    LeaveCriticalSection (&m_state_mutex);
  }

  void unlock(void)
  {
    EnterCriticalSection (&m_state_mutex);
    exclusive = false;
    exclusive_waiting_blocked = false;
    LeaveCriticalSection (&m_state_mutex);
    WakeConditionVariable (&m_exclusive_cond);
    WakeAllConditionVariable (&m_shared_cond);
  }

  // Read lock
  void lock_shared(void)
  {
    EnterCriticalSection (&m_state_mutex);
    while (exclusive || exclusive_waiting_blocked)
    {
      SleepConditionVariableCS (&m_shared_cond, m_state_mutex, INFINITE);
    }
    ++shared_count;
    LeaveCriticalSection (&m_state_mutex);
  }

  void unlock_shared(void)
  {
    EnterCriticalSection (&m_state_mutex);
    --shared_count;

    if (shared_count == 0)
    {
      exclusive_waiting_blocked = false;
      LeaveCriticalSection (&m_state_mutex);
      WakeConditionVariable (&m_exclusive_cond);
      WakeAllConditionVariable (&m_shared_cond);
    }
    else
    {
      LeaveCriticalSection (&m_state_mutex);
    }
  }
};

Behavior

Okay, there is some confusion about the behavior of this algorithm, so here is how it works.

During a Write Lock - Both readers and writers are blocked.

At the end of a Write Lock - Reader threads and one writer thread will race to see which one starts.

During a Read Lock - Writers are blocked. Readers are also blocked if and only if a Writer is blocked.

At the release of the final Read Lock - Reader threads and one writer thread will race to see which one starts.

This could cause readers to starve writers if the processor frequently context switches over to a m_shared_cond thread before an m_exclusive_cond during notification, but I suspect that issue is theoretical and not practical since it's Boost's algorithm.

回答8:

Now that Microsoft has opened up the .NET source code, you can look at their ReaderWRiterLockSlim implementation.

I'm not sure the more basic primitives they use are available to you, some of them are also part of the .NET library and their code is also available.

Microsoft has spent quite a lot of time on improving the performance of their locking mechanisms, so this can be a good starting point.

来源：https://stackoverflow.com/questions/27860685/how-to-make-a-multiple-read-single-write-lock-from-more-basic-synchronization-pr

标签

c++

multithreading

synchronization

readwritelock

How to make a multiple-read/single-write lock from more basic synchronization primitives?

问题