Message ID | 20130122073347.13822.85876.stgit@srivatsabhat.in.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hello, Srivatsa. First of all, I'm not sure whether we need to be this step-by-step when introducing something new. It's not like we're transforming an existing implementation and it doesn't seem to help understanding the series that much either. On Tue, Jan 22, 2013 at 01:03:53PM +0530, Srivatsa S. Bhat wrote: > Using global rwlocks as the backend for per-CPU rwlocks helps us avoid many > lock-ordering related problems (unlike per-cpu locks). However, global So, unfortunately, this already seems broken, right? The problem here seems to be that previously, say, read_lock() implied preempt_disable() but as this series aims to move away from it, it introduces the problem of locking order between such locks and the new contruct. The only two options are either punishing writers or identifying and updating all such possible deadlocks. percpu_rwsem does the former, right? I don't know how feasible the latter would be. Srivatsa, you've been looking at all the places which would require conversion, how difficult would doing the latter be? > +#define reader_uses_percpu_refcnt(pcpu_rwlock, cpu) \ > + (ACCESS_ONCE(per_cpu(*((pcpu_rwlock)->reader_refcnt), cpu))) > + > +#define reader_nested_percpu(pcpu_rwlock) \ > + (__this_cpu_read(*((pcpu_rwlock)->reader_refcnt)) > 1) > + > +#define writer_active(pcpu_rwlock) \ > + (__this_cpu_read(*((pcpu_rwlock)->writer_signal))) Why are these in the public header file? Are they gonna be used to inline something? > +static inline void raise_writer_signal(struct percpu_rwlock *pcpu_rwlock, > + unsigned int cpu) > +{ > + per_cpu(*pcpu_rwlock->writer_signal, cpu) = true; > +} > + > +static inline void drop_writer_signal(struct percpu_rwlock *pcpu_rwlock, > + unsigned int cpu) > +{ > + per_cpu(*pcpu_rwlock->writer_signal, cpu) = false; > +} > + > +static void announce_writer_active(struct percpu_rwlock *pcpu_rwlock) > +{ > + unsigned int cpu; > + > + for_each_online_cpu(cpu) > + raise_writer_signal(pcpu_rwlock, cpu); > + > + smp_mb(); /* Paired with smp_rmb() in percpu_read_[un]lock() */ > +} > + > +static void announce_writer_inactive(struct percpu_rwlock *pcpu_rwlock) > +{ > + unsigned int cpu; > + > + drop_writer_signal(pcpu_rwlock, smp_processor_id()); > + > + for_each_online_cpu(cpu) > + drop_writer_signal(pcpu_rwlock, cpu); > + > + smp_mb(); /* Paired with smp_rmb() in percpu_read_[un]lock() */ > +} It could be just personal preference but I find the above one line wrappers more obfuscating than anything else. What's the point of wrapping writer_signal = true/false into a separate function? These simple wrappers just add layers that people have to dig through to figure out what's going on without adding anything of value. I'd much prefer collapsing these into the percpu_write_[un]lock(). Thanks.
On 01/24/2013 12:25 AM, Tejun Heo wrote: > Hello, Srivatsa. > > First of all, I'm not sure whether we need to be this step-by-step > when introducing something new. It's not like we're transforming an > existing implementation and it doesn't seem to help understanding the > series that much either. > Hmm.. I split it up into steps to help explain the reasoning behind the code sufficiently, rather than spring all of the intricacies at one go (which would make it very hard to write the changelog/comments also). The split made it easier for me to document it well in the changelog, because I could deal with reasonable chunks of code/complexity at a time. IMHO that helps people reading it for the first time to understand the logic easily. > On Tue, Jan 22, 2013 at 01:03:53PM +0530, Srivatsa S. Bhat wrote: >> Using global rwlocks as the backend for per-CPU rwlocks helps us avoid many >> lock-ordering related problems (unlike per-cpu locks). However, global > > So, unfortunately, this already seems broken, right? The problem here > seems to be that previously, say, read_lock() implied > preempt_disable() but as this series aims to move away from it, it > introduces the problem of locking order between such locks and the new > contruct. > Not sure I got your point correctly. Are you referring to Steve's comment that rwlocks are probably fair now (and hence not really safe when used like this)? If yes, I haven't actually verified that yet, but yes, that will make this hard to use, since we need to take care of locking rules. But suppose rwlocks are unfair (as I had assumed them to be), then we have absolutely no problems and no lock-ordering to worry about. > The only two options are either punishing writers or identifying and > updating all such possible deadlocks. percpu_rwsem does the former, > right? I don't know how feasible the latter would be. I don't think we can avoid looking into all the possible deadlocks, as long as we use rwlocks inside get/put_online_cpus_atomic() (assuming rwlocks are fair). Even with Oleg's idea of using synchronize_sched() at the writer, we still need to take care of locking rules, because the synchronize_sched() only helps avoid the memory barriers at the reader, and doesn't help get rid of the rwlocks themselves. So in short, I don't see how we can punish the writers and thereby somehow avoid looking into possible deadlocks (if rwlocks are fair). > Srivatsa, > you've been looking at all the places which would require conversion, > how difficult would doing the latter be? > The problem is that some APIs like smp_call_function() will need to use get/put_online_cpus_atomic(). That is when the locking becomes tricky in the subsystem which invokes these APIs with other (subsystem-specific, internal) locks held. So we could potentially use a convention such as "Make get/put_online_cpus_atomic() your outer-most calls, within which you nest the other locks" to rule out all ABBA deadlock possibilities... But we might still hit some hard-to-convert places.. BTW, Steve, fair rwlocks doesn't mean the following scenario will result in a deadlock right? CPU 0 CPU 1 read_lock(&rwlock) write_lock(&rwlock) //spins, because CPU 0 //has acquired the lock for read read_lock(&rwlock) ^^^^^ What happens here? Does CPU 0 start spinning (and hence deadlock) or will it continue realizing that it already holds the rwlock for read? If the above ends in a deadlock, then its next to impossible to convert all the places safely (because the above mentioned convention will simply fall apart). >> +#define reader_uses_percpu_refcnt(pcpu_rwlock, cpu) \ >> + (ACCESS_ONCE(per_cpu(*((pcpu_rwlock)->reader_refcnt), cpu))) >> + >> +#define reader_nested_percpu(pcpu_rwlock) \ >> + (__this_cpu_read(*((pcpu_rwlock)->reader_refcnt)) > 1) >> + >> +#define writer_active(pcpu_rwlock) \ >> + (__this_cpu_read(*((pcpu_rwlock)->writer_signal))) > > Why are these in the public header file? Are they gonna be used to > inline something? > No, I can put it in the .c file itself. Will do. >> +static inline void raise_writer_signal(struct percpu_rwlock *pcpu_rwlock, >> + unsigned int cpu) >> +{ >> + per_cpu(*pcpu_rwlock->writer_signal, cpu) = true; >> +} >> + >> +static inline void drop_writer_signal(struct percpu_rwlock *pcpu_rwlock, >> + unsigned int cpu) >> +{ >> + per_cpu(*pcpu_rwlock->writer_signal, cpu) = false; >> +} >> + >> +static void announce_writer_active(struct percpu_rwlock *pcpu_rwlock) >> +{ >> + unsigned int cpu; >> + >> + for_each_online_cpu(cpu) >> + raise_writer_signal(pcpu_rwlock, cpu); >> + >> + smp_mb(); /* Paired with smp_rmb() in percpu_read_[un]lock() */ >> +} >> + >> +static void announce_writer_inactive(struct percpu_rwlock *pcpu_rwlock) >> +{ >> + unsigned int cpu; >> + >> + drop_writer_signal(pcpu_rwlock, smp_processor_id()); >> + >> + for_each_online_cpu(cpu) >> + drop_writer_signal(pcpu_rwlock, cpu); >> + >> + smp_mb(); /* Paired with smp_rmb() in percpu_read_[un]lock() */ >> +} > > It could be just personal preference but I find the above one line > wrappers more obfuscating than anything else. What's the point of > wrapping writer_signal = true/false into a separate function? These > simple wrappers just add layers that people have to dig through to > figure out what's going on without adding anything of value. I'd much > prefer collapsing these into the percpu_write_[un]lock(). > Sure, I see your point. I'll change that. Thanks a lot for your feedback Tejun! Regards, Srivatsa S. Bhat
Hello, Srivatsa. On Thu, Jan 24, 2013 at 01:03:52AM +0530, Srivatsa S. Bhat wrote: > Hmm.. I split it up into steps to help explain the reasoning behind > the code sufficiently, rather than spring all of the intricacies at > one go (which would make it very hard to write the changelog/comments > also). The split made it easier for me to document it well in the > changelog, because I could deal with reasonable chunks of code/complexity > at a time. IMHO that helps people reading it for the first time to > understand the logic easily. I don't know. It's a judgement call I guess. I personally would much prefer having ample documentation as comments in the source itself or as a separate Documentation/ file as that's what most people are gonna be looking at to figure out what's going on. Maybe just compact it a bit and add more in-line documentation instead? > > The only two options are either punishing writers or identifying and > > updating all such possible deadlocks. percpu_rwsem does the former, > > right? I don't know how feasible the latter would be. > > I don't think we can avoid looking into all the possible deadlocks, > as long as we use rwlocks inside get/put_online_cpus_atomic() (assuming > rwlocks are fair). Even with Oleg's idea of using synchronize_sched() > at the writer, we still need to take care of locking rules, because the > synchronize_sched() only helps avoid the memory barriers at the reader, > and doesn't help get rid of the rwlocks themselves. Well, percpu_rwlock don't have to use rwlock for the slow path. It can implement its own writer starving locking scheme. It's not like implementing slow path global rwlock logic is difficult. > CPU 0 CPU 1 > > read_lock(&rwlock) > > write_lock(&rwlock) //spins, because CPU 0 > //has acquired the lock for read > > read_lock(&rwlock) > ^^^^^ > What happens here? Does CPU 0 start spinning (and hence deadlock) or will > it continue realizing that it already holds the rwlock for read? I don't think rwlock allows nesting write lock inside read lock. read_lock(); write_lock() will always deadlock. Thanks.
On 01/24/2013 01:27 AM, Tejun Heo wrote: > Hello, Srivatsa. > > On Thu, Jan 24, 2013 at 01:03:52AM +0530, Srivatsa S. Bhat wrote: >> Hmm.. I split it up into steps to help explain the reasoning behind >> the code sufficiently, rather than spring all of the intricacies at >> one go (which would make it very hard to write the changelog/comments >> also). The split made it easier for me to document it well in the >> changelog, because I could deal with reasonable chunks of code/complexity >> at a time. IMHO that helps people reading it for the first time to >> understand the logic easily. > > I don't know. It's a judgement call I guess. I personally would much > prefer having ample documentation as comments in the source itself or > as a separate Documentation/ file as that's what most people are gonna > be looking at to figure out what's going on. Maybe just compact it a > bit and add more in-line documentation instead? > OK, I'll think about this. >>> The only two options are either punishing writers or identifying and >>> updating all such possible deadlocks. percpu_rwsem does the former, >>> right? I don't know how feasible the latter would be. >> >> I don't think we can avoid looking into all the possible deadlocks, >> as long as we use rwlocks inside get/put_online_cpus_atomic() (assuming >> rwlocks are fair). Even with Oleg's idea of using synchronize_sched() >> at the writer, we still need to take care of locking rules, because the >> synchronize_sched() only helps avoid the memory barriers at the reader, >> and doesn't help get rid of the rwlocks themselves. > > Well, percpu_rwlock don't have to use rwlock for the slow path. It > can implement its own writer starving locking scheme. It's not like > implementing slow path global rwlock logic is difficult. > Great idea! So probably I could use atomic ops or something similar in the slow path to implement the scheme we need... >> CPU 0 CPU 1 >> >> read_lock(&rwlock) >> >> write_lock(&rwlock) //spins, because CPU 0 >> //has acquired the lock for read >> >> read_lock(&rwlock) >> ^^^^^ >> What happens here? Does CPU 0 start spinning (and hence deadlock) or will >> it continue realizing that it already holds the rwlock for read? > > I don't think rwlock allows nesting write lock inside read lock. > read_lock(); write_lock() will always deadlock. > Sure, I understand that :-) My question was, what happens when *two* CPUs are involved, as in, the read_lock() is invoked only on CPU 0 whereas the write_lock() is invoked on CPU 1. For example, the same scenario shown above, but with slightly different timing, will NOT result in a deadlock: Scenario 2: CPU 0 CPU 1 read_lock(&rwlock) read_lock(&rwlock) //doesn't spin write_lock(&rwlock) //spins, because CPU 0 //has acquired the lock for read So I was wondering whether the "fairness" logic of rwlocks would cause the second read_lock() to spin (in the first scenario shown above) because a writer is already waiting (and hence new readers should spin) and thus cause a deadlock. Regards, Srivatsa S. Bhat
On Thu, 24 Jan 2013 10:00:04 +0530, Srivatsa S. Bhat wrote: > On 01/24/2013 01:27 AM, Tejun Heo wrote: >> On Thu, Jan 24, 2013 at 01:03:52AM +0530, Srivatsa S. Bhat wrote: >>> CPU 0 CPU 1 >>> >>> read_lock(&rwlock) >>> >>> write_lock(&rwlock) //spins, because CPU 0 >>> //has acquired the lock for read >>> >>> read_lock(&rwlock) >>> ^^^^^ >>> What happens here? Does CPU 0 start spinning (and hence deadlock) or will >>> it continue realizing that it already holds the rwlock for read? >> >> I don't think rwlock allows nesting write lock inside read lock. >> read_lock(); write_lock() will always deadlock. >> > > Sure, I understand that :-) My question was, what happens when *two* CPUs > are involved, as in, the read_lock() is invoked only on CPU 0 whereas the > write_lock() is invoked on CPU 1. > > For example, the same scenario shown above, but with slightly different > timing, will NOT result in a deadlock: > > Scenario 2: > CPU 0 CPU 1 > > read_lock(&rwlock) > > > read_lock(&rwlock) //doesn't spin > > write_lock(&rwlock) //spins, because CPU 0 > //has acquired the lock for read > > > So I was wondering whether the "fairness" logic of rwlocks would cause > the second read_lock() to spin (in the first scenario shown above) because > a writer is already waiting (and hence new readers should spin) and thus > cause a deadlock. In my understanding, current x86 rwlock does basically this (of course, in an atomic fashion): #define RW_LOCK_BIAS 0x10000 rwlock_init(rwlock) { rwlock->lock = RW_LOCK_BIAS; } arch_read_lock(rwlock) { retry: if (--rwlock->lock >= 0) return; rwlock->lock++; while (rwlock->lock < 1) continue; goto retry; } arch_write_lock(rwlock) { retry: if ((rwlock->lock -= RW_LOCK_BIAS) == 0) return; rwlock->lock += RW_LOCK_BIAS; while (rwlock->lock != RW_LOCK_BIAS) continue; goto retry; } So I can't find where the 'fairness' logic comes from.. Thanks, Namhyung
On Tue, Jan 29, 2013 at 08:12:37PM +0900, Namhyung Kim wrote: > On Thu, 24 Jan 2013 10:00:04 +0530, Srivatsa S. Bhat wrote: > > On 01/24/2013 01:27 AM, Tejun Heo wrote: > >> On Thu, Jan 24, 2013 at 01:03:52AM +0530, Srivatsa S. Bhat wrote: > >>> CPU 0 CPU 1 > >>> > >>> read_lock(&rwlock) > >>> > >>> write_lock(&rwlock) //spins, because CPU 0 > >>> //has acquired the lock for read > >>> > >>> read_lock(&rwlock) > >>> ^^^^^ > >>> What happens here? Does CPU 0 start spinning (and hence deadlock) or will > >>> it continue realizing that it already holds the rwlock for read? > >> > >> I don't think rwlock allows nesting write lock inside read lock. > >> read_lock(); write_lock() will always deadlock. > >> > > > > Sure, I understand that :-) My question was, what happens when *two* CPUs > > are involved, as in, the read_lock() is invoked only on CPU 0 whereas the > > write_lock() is invoked on CPU 1. > > > > For example, the same scenario shown above, but with slightly different > > timing, will NOT result in a deadlock: > > > > Scenario 2: > > CPU 0 CPU 1 > > > > read_lock(&rwlock) > > > > > > read_lock(&rwlock) //doesn't spin > > > > write_lock(&rwlock) //spins, because CPU 0 > > //has acquired the lock for read > > > > > > So I was wondering whether the "fairness" logic of rwlocks would cause > > the second read_lock() to spin (in the first scenario shown above) because > > a writer is already waiting (and hence new readers should spin) and thus > > cause a deadlock. > > In my understanding, current x86 rwlock does basically this (of course, > in an atomic fashion): > > > #define RW_LOCK_BIAS 0x10000 > > rwlock_init(rwlock) > { > rwlock->lock = RW_LOCK_BIAS; > } > > arch_read_lock(rwlock) > { > retry: > if (--rwlock->lock >= 0) > return; > > rwlock->lock++; > while (rwlock->lock < 1) > continue; > > goto retry; > } > > arch_write_lock(rwlock) > { > retry: > if ((rwlock->lock -= RW_LOCK_BIAS) == 0) > return; > > rwlock->lock += RW_LOCK_BIAS; > while (rwlock->lock != RW_LOCK_BIAS) > continue; > > goto retry; > } > > > So I can't find where the 'fairness' logic comes from.. I looked through several of the rwlock implementations, and in all of them the writer backs off if it sees readers -- or refrains from asserting ownership of the lock to begin with. So it should be OK to use rwlock as shown in the underlying patch. Thanx, Paul
On Tue, Jan 22, 2013 at 01:03:53PM +0530, Srivatsa S. Bhat wrote: > Using global rwlocks as the backend for per-CPU rwlocks helps us avoid many > lock-ordering related problems (unlike per-cpu locks). However, global > rwlocks lead to unnecessary cache-line bouncing even when there are no > writers present, which can slow down the system needlessly. > > Per-cpu counters can help solve the cache-line bouncing problem. So we > actually use the best of both: per-cpu counters (no-waiting) at the reader > side in the fast-path, and global rwlocks in the slowpath. > > [ Fastpath = no writer is active; Slowpath = a writer is active ] > > IOW, the readers just increment/decrement their per-cpu refcounts (disabling > interrupts during the updates, if necessary) when no writer is active. > When a writer becomes active, he signals all readers to switch to global > rwlocks for the duration of his activity. The readers switch over when it > is safe for them (ie., when they are about to start a fresh, non-nested > read-side critical section) and start using (holding) the global rwlock for > read in their subsequent critical sections. > > The writer waits for every existing reader to switch, and then acquires the > global rwlock for write and enters his critical section. Later, the writer > signals all readers that he is done, and that they can go back to using their > per-cpu refcounts again. > > Note that the lock-safety (despite the per-cpu scheme) comes from the fact > that the readers can *choose* _when_ to switch to rwlocks upon the writer's > signal. And the readers don't wait on anybody based on the per-cpu counters. > The only true synchronization that involves waiting at the reader-side in this > scheme, is the one arising from the global rwlock, which is safe from circular > locking dependency issues. > > Reader-writer locks and per-cpu counters are recursive, so they can be > used in a nested fashion in the reader-path, which makes per-CPU rwlocks also > recursive. Also, this design of switching the synchronization scheme ensures > that you can safely nest and use these locks in a very flexible manner. > > I'm indebted to Michael Wang and Xiao Guangrong for their numerous thoughtful > suggestions and ideas, which inspired and influenced many of the decisions in > this as well as previous designs. Thanks a lot Michael and Xiao! Looks pretty close! Some comments interspersed below. Please either fix the code or my confusion, as the case may be. ;-) Thanx, Paul > Cc: David Howells <dhowells@redhat.com> > Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> > --- > > include/linux/percpu-rwlock.h | 10 +++ > lib/percpu-rwlock.c | 128 ++++++++++++++++++++++++++++++++++++++++- > 2 files changed, 136 insertions(+), 2 deletions(-) > > diff --git a/include/linux/percpu-rwlock.h b/include/linux/percpu-rwlock.h > index 8dec8fe..6819bb8 100644 > --- a/include/linux/percpu-rwlock.h > +++ b/include/linux/percpu-rwlock.h > @@ -68,4 +68,14 @@ extern void percpu_free_rwlock(struct percpu_rwlock *); > __percpu_init_rwlock(pcpu_rwlock, #pcpu_rwlock, &rwlock_key); \ > }) > > +#define reader_uses_percpu_refcnt(pcpu_rwlock, cpu) \ > + (ACCESS_ONCE(per_cpu(*((pcpu_rwlock)->reader_refcnt), cpu))) > + > +#define reader_nested_percpu(pcpu_rwlock) \ > + (__this_cpu_read(*((pcpu_rwlock)->reader_refcnt)) > 1) > + > +#define writer_active(pcpu_rwlock) \ > + (__this_cpu_read(*((pcpu_rwlock)->writer_signal))) > + > #endif > + > diff --git a/lib/percpu-rwlock.c b/lib/percpu-rwlock.c > index 80dad93..992da5c 100644 > --- a/lib/percpu-rwlock.c > +++ b/lib/percpu-rwlock.c > @@ -64,21 +64,145 @@ void percpu_free_rwlock(struct percpu_rwlock *pcpu_rwlock) > > void percpu_read_lock(struct percpu_rwlock *pcpu_rwlock) > { > - read_lock(&pcpu_rwlock->global_rwlock); > + preempt_disable(); > + > + /* First and foremost, let the writer know that a reader is active */ > + this_cpu_inc(*pcpu_rwlock->reader_refcnt); > + > + /* > + * If we are already using per-cpu refcounts, it is not safe to switch > + * the synchronization scheme. So continue using the refcounts. > + */ > + if (reader_nested_percpu(pcpu_rwlock)) { > + goto out; > + } else { > + /* > + * The write to 'reader_refcnt' must be visible before we > + * read 'writer_signal'. > + */ > + smp_mb(); /* Paired with smp_rmb() in sync_reader() */ > + > + if (likely(!writer_active(pcpu_rwlock))) { > + goto out; > + } else { > + /* Writer is active, so switch to global rwlock. */ > + read_lock(&pcpu_rwlock->global_rwlock); > + > + /* > + * We might have raced with a writer going inactive > + * before we took the read-lock. So re-evaluate whether > + * we still need to hold the rwlock or if we can switch > + * back to per-cpu refcounts. (This also helps avoid > + * heterogeneous nesting of readers). > + */ > + if (writer_active(pcpu_rwlock)) The above writer_active() can be reordered with the following this_cpu_dec(), strange though it might seem. But this is OK because holding the rwlock is conservative. But might be worth a comment. > + this_cpu_dec(*pcpu_rwlock->reader_refcnt); > + else In contrast, no reordering can happen here because read_unlock() is required to keep the critical section underneath the lock. > + read_unlock(&pcpu_rwlock->global_rwlock); > + } > + } > + > +out: > + /* Prevent reordering of any subsequent reads */ > + smp_rmb(); This should be smp_mb(). "Readers" really can do writes. Hence the name lglock -- "local/global" rather than "reader/writer". > } > > void percpu_read_unlock(struct percpu_rwlock *pcpu_rwlock) > { > - read_unlock(&pcpu_rwlock->global_rwlock); We need an smp_mb() here to keep the critical section ordered before the this_cpu_dec() below. Otherwise, if a writer shows up just after we exit the fastpath, that writer is not guaranteed to see the effects of our critical section. Equivalently, the prior read-side critical section just might see some of the writer's updates, which could be a bit of a surprise to the reader. > + /* > + * We never allow heterogeneous nesting of readers. So it is trivial > + * to find out the kind of reader we are, and undo the operation > + * done by our corresponding percpu_read_lock(). > + */ > + if (__this_cpu_read(*pcpu_rwlock->reader_refcnt)) { > + this_cpu_dec(*pcpu_rwlock->reader_refcnt); > + smp_wmb(); /* Paired with smp_rmb() in sync_reader() */ Given an smp_mb() above, I don't understand the need for this smp_wmb(). Isn't the idea that if the writer sees ->reader_refcnt decremented to zero, it also needs to see the effects of the corresponding reader's critical section? Or am I missing something subtle here? In any case, if this smp_wmb() really is needed, there should be some subsequent write that the writer might observe. From what I can see, there is no subsequent write from this reader that the writer cares about. > + } else { > + read_unlock(&pcpu_rwlock->global_rwlock); > + } > + > + preempt_enable(); > +} > + > +static inline void raise_writer_signal(struct percpu_rwlock *pcpu_rwlock, > + unsigned int cpu) > +{ > + per_cpu(*pcpu_rwlock->writer_signal, cpu) = true; > +} > + > +static inline void drop_writer_signal(struct percpu_rwlock *pcpu_rwlock, > + unsigned int cpu) > +{ > + per_cpu(*pcpu_rwlock->writer_signal, cpu) = false; > +} > + > +static void announce_writer_active(struct percpu_rwlock *pcpu_rwlock) > +{ > + unsigned int cpu; > + > + for_each_online_cpu(cpu) > + raise_writer_signal(pcpu_rwlock, cpu); > + > + smp_mb(); /* Paired with smp_rmb() in percpu_read_[un]lock() */ > +} > + > +static void announce_writer_inactive(struct percpu_rwlock *pcpu_rwlock) > +{ > + unsigned int cpu; > + > + drop_writer_signal(pcpu_rwlock, smp_processor_id()); Why do we drop ourselves twice? More to the point, why is it important to drop ourselves first? > + > + for_each_online_cpu(cpu) > + drop_writer_signal(pcpu_rwlock, cpu); > + > + smp_mb(); /* Paired with smp_rmb() in percpu_read_[un]lock() */ > +} > + > +/* > + * Wait for the reader to see the writer's signal and switch from percpu > + * refcounts to global rwlock. > + * > + * If the reader is still using percpu refcounts, wait for him to switch. > + * Else, we can safely go ahead, because either the reader has already > + * switched over, or the next reader that comes along on that CPU will > + * notice the writer's signal and will switch over to the rwlock. > + */ > +static inline void sync_reader(struct percpu_rwlock *pcpu_rwlock, > + unsigned int cpu) > +{ > + smp_rmb(); /* Paired with smp_[w]mb() in percpu_read_[un]lock() */ As I understand it, the purpose of this memory barrier is to ensure that the stores in drop_writer_signal() happen before the reads from ->reader_refcnt in reader_uses_percpu_refcnt(), thus preventing the race between a new reader attempting to use the fastpath and this writer acquiring the lock. Unless I am confused, this must be smp_mb() rather than smp_rmb(). Also, why not just have a single smp_mb() at the beginning of sync_all_readers() instead of executing one barrier per CPU? > + > + while (reader_uses_percpu_refcnt(pcpu_rwlock, cpu)) > + cpu_relax(); > +} > + > +static void sync_all_readers(struct percpu_rwlock *pcpu_rwlock) > +{ > + unsigned int cpu; > + > + for_each_online_cpu(cpu) > + sync_reader(pcpu_rwlock, cpu); > } > > void percpu_write_lock(struct percpu_rwlock *pcpu_rwlock) > { > + /* > + * Tell all readers that a writer is becoming active, so that they > + * start switching over to the global rwlock. > + */ > + announce_writer_active(pcpu_rwlock); > + sync_all_readers(pcpu_rwlock); > write_lock(&pcpu_rwlock->global_rwlock); > } > > void percpu_write_unlock(struct percpu_rwlock *pcpu_rwlock) > { > + /* > + * Inform all readers that we are done, so that they can switch back > + * to their per-cpu refcounts. (We don't need to wait for them to > + * see it). > + */ > + announce_writer_inactive(pcpu_rwlock); > write_unlock(&pcpu_rwlock->global_rwlock); > } > >
On 02/08, Paul E. McKenney wrote: > > On Tue, Jan 22, 2013 at 01:03:53PM +0530, Srivatsa S. Bhat wrote: > > > > void percpu_read_unlock(struct percpu_rwlock *pcpu_rwlock) > > { > > - read_unlock(&pcpu_rwlock->global_rwlock); > > We need an smp_mb() here to keep the critical section ordered before the > this_cpu_dec() below. Otherwise, if a writer shows up just after we > exit the fastpath, that writer is not guaranteed to see the effects of > our critical section. Equivalently, the prior read-side critical section > just might see some of the writer's updates, which could be a bit of > a surprise to the reader. Agreed, we should not assume that a "reader" doesn't write. And we should ensure that this "read" section actually completes before this_cpu_dec(). > > + /* > > + * We never allow heterogeneous nesting of readers. So it is trivial > > + * to find out the kind of reader we are, and undo the operation > > + * done by our corresponding percpu_read_lock(). > > + */ > > + if (__this_cpu_read(*pcpu_rwlock->reader_refcnt)) { > > + this_cpu_dec(*pcpu_rwlock->reader_refcnt); > > + smp_wmb(); /* Paired with smp_rmb() in sync_reader() */ > > Given an smp_mb() above, I don't understand the need for this smp_wmb(). > Isn't the idea that if the writer sees ->reader_refcnt decremented to > zero, it also needs to see the effects of the corresponding reader's > critical section? I am equally confused ;) OTOH, we can probably aboid any barrier if reader_nested_percpu() == T. > > +static void announce_writer_inactive(struct percpu_rwlock *pcpu_rwlock) > > +{ > > + unsigned int cpu; > > + > > + drop_writer_signal(pcpu_rwlock, smp_processor_id()); > > Why do we drop ourselves twice? More to the point, why is it important to > drop ourselves first? And don't we need mb() _before_ we clear ->writer_signal ? > > +static inline void sync_reader(struct percpu_rwlock *pcpu_rwlock, > > + unsigned int cpu) > > +{ > > + smp_rmb(); /* Paired with smp_[w]mb() in percpu_read_[un]lock() */ > > As I understand it, the purpose of this memory barrier is to ensure > that the stores in drop_writer_signal() happen before the reads from > ->reader_refcnt in reader_uses_percpu_refcnt(), thus preventing the > race between a new reader attempting to use the fastpath and this writer > acquiring the lock. Unless I am confused, this must be smp_mb() rather > than smp_rmb(). And note that before sync_reader() we call announce_writer_active() which already adds mb() before sync_all_readers/sync_reader, so this rmb() looks unneeded. But, at the same time, could you confirm that we do not need another mb() after sync_all_readers() in percpu_write_lock() ? I mean, without mb(), can't this reader_uses_percpu_refcnt() LOAD leak into the critical section protected by ->global_rwlock? Then this LOAD can be re-ordered with other memory operations done by the writer. Srivatsa, I think that the code would be more understandable if you kill the helpers like sync_reader/raise_writer_signal. Perhaps even all "write" helpers, I am not sure. At least, it seems to me that all barriers should be moved to percpu_write_lock/unlock. But I won't insist of course, up to you. And cosmetic nit... How about struct xxx { unsigned long reader_refcnt; bool writer_signal; } struct percpu_rwlock { struct xxx __percpu *xxx; rwlock_t global_rwlock; }; ? This saves one alloc_percpu() and ensures that reader_refcnt/writer_signal are always in the same cache-line. Oleg.
On 02/09/2013 04:17 AM, Paul E. McKenney wrote: > On Tue, Jan 29, 2013 at 08:12:37PM +0900, Namhyung Kim wrote: >> On Thu, 24 Jan 2013 10:00:04 +0530, Srivatsa S. Bhat wrote: >>> On 01/24/2013 01:27 AM, Tejun Heo wrote: >>>> On Thu, Jan 24, 2013 at 01:03:52AM +0530, Srivatsa S. Bhat wrote: >>>>> CPU 0 CPU 1 >>>>> >>>>> read_lock(&rwlock) >>>>> >>>>> write_lock(&rwlock) //spins, because CPU 0 >>>>> //has acquired the lock for read >>>>> >>>>> read_lock(&rwlock) >>>>> ^^^^^ >>>>> What happens here? Does CPU 0 start spinning (and hence deadlock) or will >>>>> it continue realizing that it already holds the rwlock for read? >>>> >>>> I don't think rwlock allows nesting write lock inside read lock. >>>> read_lock(); write_lock() will always deadlock. >>>> >>> >>> Sure, I understand that :-) My question was, what happens when *two* CPUs >>> are involved, as in, the read_lock() is invoked only on CPU 0 whereas the >>> write_lock() is invoked on CPU 1. >>> >>> For example, the same scenario shown above, but with slightly different >>> timing, will NOT result in a deadlock: >>> >>> Scenario 2: >>> CPU 0 CPU 1 >>> >>> read_lock(&rwlock) >>> >>> >>> read_lock(&rwlock) //doesn't spin >>> >>> write_lock(&rwlock) //spins, because CPU 0 >>> //has acquired the lock for read >>> >>> >>> So I was wondering whether the "fairness" logic of rwlocks would cause >>> the second read_lock() to spin (in the first scenario shown above) because >>> a writer is already waiting (and hence new readers should spin) and thus >>> cause a deadlock. >> >> In my understanding, current x86 rwlock does basically this (of course, >> in an atomic fashion): >> >> >> #define RW_LOCK_BIAS 0x10000 >> >> rwlock_init(rwlock) >> { >> rwlock->lock = RW_LOCK_BIAS; >> } >> >> arch_read_lock(rwlock) >> { >> retry: >> if (--rwlock->lock >= 0) >> return; >> >> rwlock->lock++; >> while (rwlock->lock < 1) >> continue; >> >> goto retry; >> } >> >> arch_write_lock(rwlock) >> { >> retry: >> if ((rwlock->lock -= RW_LOCK_BIAS) == 0) >> return; >> >> rwlock->lock += RW_LOCK_BIAS; >> while (rwlock->lock != RW_LOCK_BIAS) >> continue; >> >> goto retry; >> } >> >> >> So I can't find where the 'fairness' logic comes from.. > > I looked through several of the rwlock implementations, and in all of > them the writer backs off if it sees readers -- or refrains from asserting > ownership of the lock to begin with. > > So it should be OK to use rwlock as shown in the underlying patch. > Thanks a lot for confirming that Paul! So I guess we can use rwlocks as it is, since its behaviour suits our needs perfectly. So I won't tinker with atomic counters for a while, atleast not until someone starts making rwlocks fair.. ;-) Regards, Srivatsa S. Bhat
On 02/09/2013 04:40 AM, Paul E. McKenney wrote: > On Tue, Jan 22, 2013 at 01:03:53PM +0530, Srivatsa S. Bhat wrote: >> Using global rwlocks as the backend for per-CPU rwlocks helps us avoid many >> lock-ordering related problems (unlike per-cpu locks). However, global >> rwlocks lead to unnecessary cache-line bouncing even when there are no >> writers present, which can slow down the system needlessly. >> [...] > > Looks pretty close! Some comments interspersed below. Please either > fix the code or my confusion, as the case may be. ;-) > Sure :-) >> --- >> >> include/linux/percpu-rwlock.h | 10 +++ >> lib/percpu-rwlock.c | 128 ++++++++++++++++++++++++++++++++++++++++- >> 2 files changed, 136 insertions(+), 2 deletions(-) >> >> diff --git a/include/linux/percpu-rwlock.h b/include/linux/percpu-rwlock.h >> index 8dec8fe..6819bb8 100644 >> --- a/include/linux/percpu-rwlock.h >> +++ b/include/linux/percpu-rwlock.h >> @@ -68,4 +68,14 @@ extern void percpu_free_rwlock(struct percpu_rwlock *); >> __percpu_init_rwlock(pcpu_rwlock, #pcpu_rwlock, &rwlock_key); \ >> }) >> >> +#define reader_uses_percpu_refcnt(pcpu_rwlock, cpu) \ >> + (ACCESS_ONCE(per_cpu(*((pcpu_rwlock)->reader_refcnt), cpu))) >> + >> +#define reader_nested_percpu(pcpu_rwlock) \ >> + (__this_cpu_read(*((pcpu_rwlock)->reader_refcnt)) > 1) >> + >> +#define writer_active(pcpu_rwlock) \ >> + (__this_cpu_read(*((pcpu_rwlock)->writer_signal))) >> + >> #endif >> + >> diff --git a/lib/percpu-rwlock.c b/lib/percpu-rwlock.c >> index 80dad93..992da5c 100644 >> --- a/lib/percpu-rwlock.c >> +++ b/lib/percpu-rwlock.c >> @@ -64,21 +64,145 @@ void percpu_free_rwlock(struct percpu_rwlock *pcpu_rwlock) >> >> void percpu_read_lock(struct percpu_rwlock *pcpu_rwlock) >> { >> - read_lock(&pcpu_rwlock->global_rwlock); >> + preempt_disable(); >> + >> + /* First and foremost, let the writer know that a reader is active */ >> + this_cpu_inc(*pcpu_rwlock->reader_refcnt); >> + >> + /* >> + * If we are already using per-cpu refcounts, it is not safe to switch >> + * the synchronization scheme. So continue using the refcounts. >> + */ >> + if (reader_nested_percpu(pcpu_rwlock)) { >> + goto out; >> + } else { >> + /* >> + * The write to 'reader_refcnt' must be visible before we >> + * read 'writer_signal'. >> + */ >> + smp_mb(); /* Paired with smp_rmb() in sync_reader() */ >> + >> + if (likely(!writer_active(pcpu_rwlock))) { >> + goto out; >> + } else { >> + /* Writer is active, so switch to global rwlock. */ >> + read_lock(&pcpu_rwlock->global_rwlock); >> + >> + /* >> + * We might have raced with a writer going inactive >> + * before we took the read-lock. So re-evaluate whether >> + * we still need to hold the rwlock or if we can switch >> + * back to per-cpu refcounts. (This also helps avoid >> + * heterogeneous nesting of readers). >> + */ >> + if (writer_active(pcpu_rwlock)) > > The above writer_active() can be reordered with the following this_cpu_dec(), > strange though it might seem. But this is OK because holding the rwlock > is conservative. But might be worth a comment. > Ok.. >> + this_cpu_dec(*pcpu_rwlock->reader_refcnt); >> + else > > In contrast, no reordering can happen here because read_unlock() is > required to keep the critical section underneath the lock. > Ok.. >> + read_unlock(&pcpu_rwlock->global_rwlock); >> + } >> + } >> + >> +out: >> + /* Prevent reordering of any subsequent reads */ >> + smp_rmb(); > > This should be smp_mb(). "Readers" really can do writes. Hence the > name lglock -- "local/global" rather than "reader/writer". > Ok! [ We were trying to avoid full memory barriers in get/put_online_cpus_atomic() in the fastpath, as far as possible... Now it feels like all that discussion was for nothing :-( ] >> } >> >> void percpu_read_unlock(struct percpu_rwlock *pcpu_rwlock) >> { >> - read_unlock(&pcpu_rwlock->global_rwlock); > > We need an smp_mb() here to keep the critical section ordered before the > this_cpu_dec() below. Otherwise, if a writer shows up just after we > exit the fastpath, that writer is not guaranteed to see the effects of > our critical section. Equivalently, the prior read-side critical section > just might see some of the writer's updates, which could be a bit of > a surprise to the reader. > Ok, will add it. >> + /* >> + * We never allow heterogeneous nesting of readers. So it is trivial >> + * to find out the kind of reader we are, and undo the operation >> + * done by our corresponding percpu_read_lock(). >> + */ >> + if (__this_cpu_read(*pcpu_rwlock->reader_refcnt)) { >> + this_cpu_dec(*pcpu_rwlock->reader_refcnt); >> + smp_wmb(); /* Paired with smp_rmb() in sync_reader() */ > > Given an smp_mb() above, I don't understand the need for this smp_wmb(). > Isn't the idea that if the writer sees ->reader_refcnt decremented to > zero, it also needs to see the effects of the corresponding reader's > critical section? > Not sure what you meant, but my idea here was that the writer should see the reader_refcnt falling to zero as soon as possible, to avoid keeping the writer waiting in a tight loop for longer than necessary. I might have been a little over-zealous to use lighter memory barriers though, (given our lengthy discussions in the previous versions to reduce the memory barrier overheads), so the smp_wmb() used above might be wrong. So, are you saying that the smp_mb() you indicated above would be enough to make the writer observe the 1->0 transition of reader_refcnt immediately? > Or am I missing something subtle here? In any case, if this smp_wmb() > really is needed, there should be some subsequent write that the writer > might observe. From what I can see, there is no subsequent write from > this reader that the writer cares about. > I thought the smp_wmb() here and the smp_rmb() at the writer would ensure immediate reflection of the reader state at the writer side... Please correct me if my understanding is incorrect. >> + } else { >> + read_unlock(&pcpu_rwlock->global_rwlock); >> + } >> + >> + preempt_enable(); >> +} >> + >> +static inline void raise_writer_signal(struct percpu_rwlock *pcpu_rwlock, >> + unsigned int cpu) >> +{ >> + per_cpu(*pcpu_rwlock->writer_signal, cpu) = true; >> +} >> + >> +static inline void drop_writer_signal(struct percpu_rwlock *pcpu_rwlock, >> + unsigned int cpu) >> +{ >> + per_cpu(*pcpu_rwlock->writer_signal, cpu) = false; >> +} >> + >> +static void announce_writer_active(struct percpu_rwlock *pcpu_rwlock) >> +{ >> + unsigned int cpu; >> + >> + for_each_online_cpu(cpu) >> + raise_writer_signal(pcpu_rwlock, cpu); >> + >> + smp_mb(); /* Paired with smp_rmb() in percpu_read_[un]lock() */ >> +} >> + >> +static void announce_writer_inactive(struct percpu_rwlock *pcpu_rwlock) >> +{ >> + unsigned int cpu; >> + >> + drop_writer_signal(pcpu_rwlock, smp_processor_id()); > > Why do we drop ourselves twice? More to the point, why is it important to > drop ourselves first? > I don't see where we are dropping ourselves twice. Note that we are no longer in the cpu_online_mask, so the 'for' loop below won't include us. So we need to manually drop ourselves. It doesn't matter whether we drop ourselves first or later. >> + >> + for_each_online_cpu(cpu) >> + drop_writer_signal(pcpu_rwlock, cpu); >> + >> + smp_mb(); /* Paired with smp_rmb() in percpu_read_[un]lock() */ >> +} >> + >> +/* >> + * Wait for the reader to see the writer's signal and switch from percpu >> + * refcounts to global rwlock. >> + * >> + * If the reader is still using percpu refcounts, wait for him to switch. >> + * Else, we can safely go ahead, because either the reader has already >> + * switched over, or the next reader that comes along on that CPU will >> + * notice the writer's signal and will switch over to the rwlock. >> + */ >> +static inline void sync_reader(struct percpu_rwlock *pcpu_rwlock, >> + unsigned int cpu) >> +{ >> + smp_rmb(); /* Paired with smp_[w]mb() in percpu_read_[un]lock() */ > > As I understand it, the purpose of this memory barrier is to ensure > that the stores in drop_writer_signal() happen before the reads from > ->reader_refcnt in reader_uses_percpu_refcnt(), No, that was not what I intended. announce_writer_inactive() already does a full smp_mb() after calling drop_writer_signal(). I put the smp_rmb() here and the smp_wmb() at the reader side (after updates to the ->reader_refcnt) to reflect the state change of ->reader_refcnt immediately at the writer, so that the writer doesn't have to keep spinning unnecessarily still referring to the old (non-zero) value of ->reader_refcnt. Or perhaps I am confused about how to use memory barriers properly.. :-( > thus preventing the > race between a new reader attempting to use the fastpath and this writer > acquiring the lock. Unless I am confused, this must be smp_mb() rather > than smp_rmb(). > > Also, why not just have a single smp_mb() at the beginning of > sync_all_readers() instead of executing one barrier per CPU? Well, since my intention was to help the writer see the update (->reader_refcnt dropping to zero) ASAP, I kept the multiple smp_rmb()s. > >> + >> + while (reader_uses_percpu_refcnt(pcpu_rwlock, cpu)) >> + cpu_relax(); >> +} >> + >> +static void sync_all_readers(struct percpu_rwlock *pcpu_rwlock) >> +{ >> + unsigned int cpu; >> + >> + for_each_online_cpu(cpu) >> + sync_reader(pcpu_rwlock, cpu); >> } >> >> void percpu_write_lock(struct percpu_rwlock *pcpu_rwlock) >> { >> + /* >> + * Tell all readers that a writer is becoming active, so that they >> + * start switching over to the global rwlock. >> + */ >> + announce_writer_active(pcpu_rwlock); >> + sync_all_readers(pcpu_rwlock); >> write_lock(&pcpu_rwlock->global_rwlock); >> } >> >> void percpu_write_unlock(struct percpu_rwlock *pcpu_rwlock) >> { >> + /* >> + * Inform all readers that we are done, so that they can switch back >> + * to their per-cpu refcounts. (We don't need to wait for them to >> + * see it). >> + */ >> + announce_writer_inactive(pcpu_rwlock); >> write_unlock(&pcpu_rwlock->global_rwlock); >> } >> >> Thanks a lot for your detailed review and comments! :-) Regards, Srivatsa S. Bhat
On 02/10/2013 11:36 PM, Oleg Nesterov wrote: > On 02/08, Paul E. McKenney wrote: >> >> On Tue, Jan 22, 2013 at 01:03:53PM +0530, Srivatsa S. Bhat wrote: >>> >>> void percpu_read_unlock(struct percpu_rwlock *pcpu_rwlock) >>> { >>> - read_unlock(&pcpu_rwlock->global_rwlock); >> >> We need an smp_mb() here to keep the critical section ordered before the >> this_cpu_dec() below. Otherwise, if a writer shows up just after we >> exit the fastpath, that writer is not guaranteed to see the effects of >> our critical section. Equivalently, the prior read-side critical section >> just might see some of the writer's updates, which could be a bit of >> a surprise to the reader. > > Agreed, we should not assume that a "reader" doesn't write. And we should > ensure that this "read" section actually completes before this_cpu_dec(). > Right, will fix. >>> + /* >>> + * We never allow heterogeneous nesting of readers. So it is trivial >>> + * to find out the kind of reader we are, and undo the operation >>> + * done by our corresponding percpu_read_lock(). >>> + */ >>> + if (__this_cpu_read(*pcpu_rwlock->reader_refcnt)) { >>> + this_cpu_dec(*pcpu_rwlock->reader_refcnt); >>> + smp_wmb(); /* Paired with smp_rmb() in sync_reader() */ >> >> Given an smp_mb() above, I don't understand the need for this smp_wmb(). >> Isn't the idea that if the writer sees ->reader_refcnt decremented to >> zero, it also needs to see the effects of the corresponding reader's >> critical section? > > I am equally confused ;) > > OTOH, we can probably aboid any barrier if reader_nested_percpu() == T. > Good point! Will add that optimization, thank you! > >>> +static void announce_writer_inactive(struct percpu_rwlock *pcpu_rwlock) >>> +{ >>> + unsigned int cpu; >>> + >>> + drop_writer_signal(pcpu_rwlock, smp_processor_id()); >> >> Why do we drop ourselves twice? More to the point, why is it important to >> drop ourselves first? > > And don't we need mb() _before_ we clear ->writer_signal ? > Oh, right! Or, how about moving announce_writer_inactive() to _after_ write_unlock()? >>> +static inline void sync_reader(struct percpu_rwlock *pcpu_rwlock, >>> + unsigned int cpu) >>> +{ >>> + smp_rmb(); /* Paired with smp_[w]mb() in percpu_read_[un]lock() */ >> >> As I understand it, the purpose of this memory barrier is to ensure >> that the stores in drop_writer_signal() happen before the reads from >> ->reader_refcnt in reader_uses_percpu_refcnt(), thus preventing the >> race between a new reader attempting to use the fastpath and this writer >> acquiring the lock. Unless I am confused, this must be smp_mb() rather >> than smp_rmb(). > > And note that before sync_reader() we call announce_writer_active() which > already adds mb() before sync_all_readers/sync_reader, so this rmb() looks > unneeded. > My intention was to help the writer see the ->reader_refcnt drop to zero ASAP; hence I used smp_wmb() at reader and smp_rmb() here at the writer. Please correct me if my understanding of memory barriers is wrong here.. > But, at the same time, could you confirm that we do not need another mb() > after sync_all_readers() in percpu_write_lock() ? I mean, without mb(), > can't this reader_uses_percpu_refcnt() LOAD leak into the critical section > protected by ->global_rwlock? Then this LOAD can be re-ordered with other > memory operations done by the writer. > Hmm.. it appears that we need a smp_mb() there. > > > Srivatsa, I think that the code would be more understandable if you kill > the helpers like sync_reader/raise_writer_signal. Perhaps even all "write" > helpers, I am not sure. At least, it seems to me that all barriers should > be moved to percpu_write_lock/unlock. But I won't insist of course, up to > you. > Sure, sure. Even Tejun pointed out that those helpers are getting in the way of readability. I'll get rid of them in the next version. > And cosmetic nit... How about > > struct xxx { > unsigned long reader_refcnt; > bool writer_signal; > } > > struct percpu_rwlock { > struct xxx __percpu *xxx; > rwlock_t global_rwlock; > }; > > ? > > This saves one alloc_percpu() and ensures that reader_refcnt/writer_signal > are always in the same cache-line. > Ok, that sounds better. Will make that change. Thanks a lot Oleg! Regards, Srivatsa S. Bhat
On Mon, Feb 11, 2013 at 12:40:56AM +0530, Srivatsa S. Bhat wrote: > On 02/09/2013 04:40 AM, Paul E. McKenney wrote: > > On Tue, Jan 22, 2013 at 01:03:53PM +0530, Srivatsa S. Bhat wrote: > >> Using global rwlocks as the backend for per-CPU rwlocks helps us avoid many > >> lock-ordering related problems (unlike per-cpu locks). However, global > >> rwlocks lead to unnecessary cache-line bouncing even when there are no > >> writers present, which can slow down the system needlessly. > >> > [...] > > > > Looks pretty close! Some comments interspersed below. Please either > > fix the code or my confusion, as the case may be. ;-) > > > > Sure :-) > > >> --- > >> > >> include/linux/percpu-rwlock.h | 10 +++ > >> lib/percpu-rwlock.c | 128 ++++++++++++++++++++++++++++++++++++++++- > >> 2 files changed, 136 insertions(+), 2 deletions(-) > >> > >> diff --git a/include/linux/percpu-rwlock.h b/include/linux/percpu-rwlock.h > >> index 8dec8fe..6819bb8 100644 > >> --- a/include/linux/percpu-rwlock.h > >> +++ b/include/linux/percpu-rwlock.h > >> @@ -68,4 +68,14 @@ extern void percpu_free_rwlock(struct percpu_rwlock *); > >> __percpu_init_rwlock(pcpu_rwlock, #pcpu_rwlock, &rwlock_key); \ > >> }) > >> > >> +#define reader_uses_percpu_refcnt(pcpu_rwlock, cpu) \ > >> + (ACCESS_ONCE(per_cpu(*((pcpu_rwlock)->reader_refcnt), cpu))) > >> + > >> +#define reader_nested_percpu(pcpu_rwlock) \ > >> + (__this_cpu_read(*((pcpu_rwlock)->reader_refcnt)) > 1) > >> + > >> +#define writer_active(pcpu_rwlock) \ > >> + (__this_cpu_read(*((pcpu_rwlock)->writer_signal))) > >> + > >> #endif > >> + > >> diff --git a/lib/percpu-rwlock.c b/lib/percpu-rwlock.c > >> index 80dad93..992da5c 100644 > >> --- a/lib/percpu-rwlock.c > >> +++ b/lib/percpu-rwlock.c > >> @@ -64,21 +64,145 @@ void percpu_free_rwlock(struct percpu_rwlock *pcpu_rwlock) > >> > >> void percpu_read_lock(struct percpu_rwlock *pcpu_rwlock) > >> { > >> - read_lock(&pcpu_rwlock->global_rwlock); > >> + preempt_disable(); > >> + > >> + /* First and foremost, let the writer know that a reader is active */ > >> + this_cpu_inc(*pcpu_rwlock->reader_refcnt); > >> + > >> + /* > >> + * If we are already using per-cpu refcounts, it is not safe to switch > >> + * the synchronization scheme. So continue using the refcounts. > >> + */ > >> + if (reader_nested_percpu(pcpu_rwlock)) { > >> + goto out; > >> + } else { > >> + /* > >> + * The write to 'reader_refcnt' must be visible before we > >> + * read 'writer_signal'. > >> + */ > >> + smp_mb(); /* Paired with smp_rmb() in sync_reader() */ > >> + > >> + if (likely(!writer_active(pcpu_rwlock))) { > >> + goto out; > >> + } else { > >> + /* Writer is active, so switch to global rwlock. */ > >> + read_lock(&pcpu_rwlock->global_rwlock); > >> + > >> + /* > >> + * We might have raced with a writer going inactive > >> + * before we took the read-lock. So re-evaluate whether > >> + * we still need to hold the rwlock or if we can switch > >> + * back to per-cpu refcounts. (This also helps avoid > >> + * heterogeneous nesting of readers). > >> + */ > >> + if (writer_active(pcpu_rwlock)) > > > > The above writer_active() can be reordered with the following this_cpu_dec(), > > strange though it might seem. But this is OK because holding the rwlock > > is conservative. But might be worth a comment. > > > > Ok.. > > >> + this_cpu_dec(*pcpu_rwlock->reader_refcnt); > >> + else > > > > In contrast, no reordering can happen here because read_unlock() is > > required to keep the critical section underneath the lock. > > > > Ok.. > > >> + read_unlock(&pcpu_rwlock->global_rwlock); > >> + } > >> + } > >> + > >> +out: > >> + /* Prevent reordering of any subsequent reads */ > >> + smp_rmb(); > > > > This should be smp_mb(). "Readers" really can do writes. Hence the > > name lglock -- "local/global" rather than "reader/writer". > > > > Ok! > > [ We were trying to avoid full memory barriers in get/put_online_cpus_atomic() > in the fastpath, as far as possible... Now it feels like all that discussion > was for nothing :-( ] > > >> } > >> > >> void percpu_read_unlock(struct percpu_rwlock *pcpu_rwlock) > >> { > >> - read_unlock(&pcpu_rwlock->global_rwlock); > > > > We need an smp_mb() here to keep the critical section ordered before the > > this_cpu_dec() below. Otherwise, if a writer shows up just after we > > exit the fastpath, that writer is not guaranteed to see the effects of > > our critical section. Equivalently, the prior read-side critical section > > just might see some of the writer's updates, which could be a bit of > > a surprise to the reader. > > > > Ok, will add it. > > >> + /* > >> + * We never allow heterogeneous nesting of readers. So it is trivial > >> + * to find out the kind of reader we are, and undo the operation > >> + * done by our corresponding percpu_read_lock(). > >> + */ > >> + if (__this_cpu_read(*pcpu_rwlock->reader_refcnt)) { > >> + this_cpu_dec(*pcpu_rwlock->reader_refcnt); > >> + smp_wmb(); /* Paired with smp_rmb() in sync_reader() */ > > > > Given an smp_mb() above, I don't understand the need for this smp_wmb(). > > Isn't the idea that if the writer sees ->reader_refcnt decremented to > > zero, it also needs to see the effects of the corresponding reader's > > critical section? > > > > Not sure what you meant, but my idea here was that the writer should see > the reader_refcnt falling to zero as soon as possible, to avoid keeping the > writer waiting in a tight loop for longer than necessary. > I might have been a little over-zealous to use lighter memory barriers though, > (given our lengthy discussions in the previous versions to reduce the memory > barrier overheads), so the smp_wmb() used above might be wrong. > > So, are you saying that the smp_mb() you indicated above would be enough > to make the writer observe the 1->0 transition of reader_refcnt immediately? > > > Or am I missing something subtle here? In any case, if this smp_wmb() > > really is needed, there should be some subsequent write that the writer > > might observe. From what I can see, there is no subsequent write from > > this reader that the writer cares about. > > I thought the smp_wmb() here and the smp_rmb() at the writer would ensure > immediate reflection of the reader state at the writer side... Please correct > me if my understanding is incorrect. Ah, but memory barriers are not so much about making data move faster through the machine, but more about making sure that ordering constraints are met. After all, memory barriers cannot make electrons flow faster through silicon. You should therefore use memory barriers only to constrain ordering, not to try to expedite electrons. > >> + } else { > >> + read_unlock(&pcpu_rwlock->global_rwlock); > >> + } > >> + > >> + preempt_enable(); > >> +} > >> + > >> +static inline void raise_writer_signal(struct percpu_rwlock *pcpu_rwlock, > >> + unsigned int cpu) > >> +{ > >> + per_cpu(*pcpu_rwlock->writer_signal, cpu) = true; > >> +} > >> + > >> +static inline void drop_writer_signal(struct percpu_rwlock *pcpu_rwlock, > >> + unsigned int cpu) > >> +{ > >> + per_cpu(*pcpu_rwlock->writer_signal, cpu) = false; > >> +} > >> + > >> +static void announce_writer_active(struct percpu_rwlock *pcpu_rwlock) > >> +{ > >> + unsigned int cpu; > >> + > >> + for_each_online_cpu(cpu) > >> + raise_writer_signal(pcpu_rwlock, cpu); > >> + > >> + smp_mb(); /* Paired with smp_rmb() in percpu_read_[un]lock() */ > >> +} > >> + > >> +static void announce_writer_inactive(struct percpu_rwlock *pcpu_rwlock) > >> +{ > >> + unsigned int cpu; > >> + > >> + drop_writer_signal(pcpu_rwlock, smp_processor_id()); > > > > Why do we drop ourselves twice? More to the point, why is it important to > > drop ourselves first? > > I don't see where we are dropping ourselves twice. Note that we are no longer > in the cpu_online_mask, so the 'for' loop below won't include us. So we need > to manually drop ourselves. It doesn't matter whether we drop ourselves first > or later. Good point, apologies for my confusion! Still worth a commment, though. > >> + > >> + for_each_online_cpu(cpu) > >> + drop_writer_signal(pcpu_rwlock, cpu); > >> + > >> + smp_mb(); /* Paired with smp_rmb() in percpu_read_[un]lock() */ > >> +} > >> + > >> +/* > >> + * Wait for the reader to see the writer's signal and switch from percpu > >> + * refcounts to global rwlock. > >> + * > >> + * If the reader is still using percpu refcounts, wait for him to switch. > >> + * Else, we can safely go ahead, because either the reader has already > >> + * switched over, or the next reader that comes along on that CPU will > >> + * notice the writer's signal and will switch over to the rwlock. > >> + */ > >> +static inline void sync_reader(struct percpu_rwlock *pcpu_rwlock, > >> + unsigned int cpu) > >> +{ > >> + smp_rmb(); /* Paired with smp_[w]mb() in percpu_read_[un]lock() */ > > > > As I understand it, the purpose of this memory barrier is to ensure > > that the stores in drop_writer_signal() happen before the reads from > > ->reader_refcnt in reader_uses_percpu_refcnt(), > > No, that was not what I intended. announce_writer_inactive() already does > a full smp_mb() after calling drop_writer_signal(). > > I put the smp_rmb() here and the smp_wmb() at the reader side (after updates > to the ->reader_refcnt) to reflect the state change of ->reader_refcnt > immediately at the writer, so that the writer doesn't have to keep spinning > unnecessarily still referring to the old (non-zero) value of ->reader_refcnt. > Or perhaps I am confused about how to use memory barriers properly.. :-( Sadly, no, memory barriers don't make electrons move faster. So you should only need the one -- the additional memory barriers are just slowing things down. > > thus preventing the > > race between a new reader attempting to use the fastpath and this writer > > acquiring the lock. Unless I am confused, this must be smp_mb() rather > > than smp_rmb(). > > > > Also, why not just have a single smp_mb() at the beginning of > > sync_all_readers() instead of executing one barrier per CPU? > > Well, since my intention was to help the writer see the update (->reader_refcnt > dropping to zero) ASAP, I kept the multiple smp_rmb()s. At least you were consistent. ;-) > >> + > >> + while (reader_uses_percpu_refcnt(pcpu_rwlock, cpu)) > >> + cpu_relax(); > >> +} > >> + > >> +static void sync_all_readers(struct percpu_rwlock *pcpu_rwlock) > >> +{ > >> + unsigned int cpu; > >> + > >> + for_each_online_cpu(cpu) > >> + sync_reader(pcpu_rwlock, cpu); > >> } > >> > >> void percpu_write_lock(struct percpu_rwlock *pcpu_rwlock) > >> { > >> + /* > >> + * Tell all readers that a writer is becoming active, so that they > >> + * start switching over to the global rwlock. > >> + */ > >> + announce_writer_active(pcpu_rwlock); > >> + sync_all_readers(pcpu_rwlock); > >> write_lock(&pcpu_rwlock->global_rwlock); > >> } > >> > >> void percpu_write_unlock(struct percpu_rwlock *pcpu_rwlock) > >> { > >> + /* > >> + * Inform all readers that we are done, so that they can switch back > >> + * to their per-cpu refcounts. (We don't need to wait for them to > >> + * see it). > >> + */ > >> + announce_writer_inactive(pcpu_rwlock); > >> write_unlock(&pcpu_rwlock->global_rwlock); > >> } > >> > >> > > Thanks a lot for your detailed review and comments! :-) It will be good to get this in! Thanx, Paul
On 02/11, Srivatsa S. Bhat wrote: > > On 02/10/2013 11:36 PM, Oleg Nesterov wrote: > >>> +static void announce_writer_inactive(struct percpu_rwlock *pcpu_rwlock) > >>> +{ > >>> + unsigned int cpu; > >>> + > >>> + drop_writer_signal(pcpu_rwlock, smp_processor_id()); > >> > >> Why do we drop ourselves twice? More to the point, why is it important to > >> drop ourselves first? > > > > And don't we need mb() _before_ we clear ->writer_signal ? > > > > Oh, right! Or, how about moving announce_writer_inactive() to _after_ > write_unlock()? Not sure this will help... but, either way it seems we have another problem... percpu_rwlock tries to be "generic". This means we should "ignore" its usage in hotplug, and _write_lock should not race with _write_unlock. IOW. Suppose that _write_unlock clears ->writer_signal. We need to ensure that this can't race with another write which wants to set this flag. Perhaps it should be counter as well, and it should be protected by the same ->global_rwlock, but _write_lock() should drop it before sync_all_readers() and then take it again? > >>> +static inline void sync_reader(struct percpu_rwlock *pcpu_rwlock, > >>> + unsigned int cpu) > >>> +{ > >>> + smp_rmb(); /* Paired with smp_[w]mb() in percpu_read_[un]lock() */ > >> > >> As I understand it, the purpose of this memory barrier is to ensure > >> that the stores in drop_writer_signal() happen before the reads from > >> ->reader_refcnt in reader_uses_percpu_refcnt(), thus preventing the > >> race between a new reader attempting to use the fastpath and this writer > >> acquiring the lock. Unless I am confused, this must be smp_mb() rather > >> than smp_rmb(). > > > > And note that before sync_reader() we call announce_writer_active() which > > already adds mb() before sync_all_readers/sync_reader, so this rmb() looks > > unneeded. > > > > My intention was to help the writer see the ->reader_refcnt drop to zero > ASAP; hence I used smp_wmb() at reader and smp_rmb() here at the writer. Hmm, interesting... Not sure, but can't really comment. However I can answer your next question: > Please correct me if my understanding of memory barriers is wrong here.. Who? Me??? No we have paulmck for that ;) Oleg.
On Sun, Feb 10, 2013 at 07:06:07PM +0100, Oleg Nesterov wrote: > On 02/08, Paul E. McKenney wrote: [ . . . ] > > > +static inline void sync_reader(struct percpu_rwlock *pcpu_rwlock, > > > + unsigned int cpu) > > > +{ > > > + smp_rmb(); /* Paired with smp_[w]mb() in percpu_read_[un]lock() */ > > > > As I understand it, the purpose of this memory barrier is to ensure > > that the stores in drop_writer_signal() happen before the reads from > > ->reader_refcnt in reader_uses_percpu_refcnt(), thus preventing the > > race between a new reader attempting to use the fastpath and this writer > > acquiring the lock. Unless I am confused, this must be smp_mb() rather > > than smp_rmb(). > > And note that before sync_reader() we call announce_writer_active() which > already adds mb() before sync_all_readers/sync_reader, so this rmb() looks > unneeded. > > But, at the same time, could you confirm that we do not need another mb() > after sync_all_readers() in percpu_write_lock() ? I mean, without mb(), > can't this reader_uses_percpu_refcnt() LOAD leak into the critical section > protected by ->global_rwlock? Then this LOAD can be re-ordered with other > memory operations done by the writer. As soon as I get the rest of the way through Thomas's patchset. ;-) Thanx, Paul
On 02/11/2013 01:17 AM, Paul E. McKenney wrote: > On Mon, Feb 11, 2013 at 12:40:56AM +0530, Srivatsa S. Bhat wrote: >> On 02/09/2013 04:40 AM, Paul E. McKenney wrote: >>> On Tue, Jan 22, 2013 at 01:03:53PM +0530, Srivatsa S. Bhat wrote: >>>> Using global rwlocks as the backend for per-CPU rwlocks helps us avoid many >>>> lock-ordering related problems (unlike per-cpu locks). However, global >>>> rwlocks lead to unnecessary cache-line bouncing even when there are no >>>> writers present, which can slow down the system needlessly. >>>> >> [...] >>>> + /* >>>> + * We never allow heterogeneous nesting of readers. So it is trivial >>>> + * to find out the kind of reader we are, and undo the operation >>>> + * done by our corresponding percpu_read_lock(). >>>> + */ >>>> + if (__this_cpu_read(*pcpu_rwlock->reader_refcnt)) { >>>> + this_cpu_dec(*pcpu_rwlock->reader_refcnt); >>>> + smp_wmb(); /* Paired with smp_rmb() in sync_reader() */ >>> >>> Given an smp_mb() above, I don't understand the need for this smp_wmb(). >>> Isn't the idea that if the writer sees ->reader_refcnt decremented to >>> zero, it also needs to see the effects of the corresponding reader's >>> critical section? >>> >> >> Not sure what you meant, but my idea here was that the writer should see >> the reader_refcnt falling to zero as soon as possible, to avoid keeping the >> writer waiting in a tight loop for longer than necessary. >> I might have been a little over-zealous to use lighter memory barriers though, >> (given our lengthy discussions in the previous versions to reduce the memory >> barrier overheads), so the smp_wmb() used above might be wrong. >> >> So, are you saying that the smp_mb() you indicated above would be enough >> to make the writer observe the 1->0 transition of reader_refcnt immediately? >> >>> Or am I missing something subtle here? In any case, if this smp_wmb() >>> really is needed, there should be some subsequent write that the writer >>> might observe. From what I can see, there is no subsequent write from >>> this reader that the writer cares about. >> >> I thought the smp_wmb() here and the smp_rmb() at the writer would ensure >> immediate reflection of the reader state at the writer side... Please correct >> me if my understanding is incorrect. > > Ah, but memory barriers are not so much about making data move faster > through the machine, but more about making sure that ordering constraints > are met. After all, memory barriers cannot make electrons flow faster > through silicon. You should therefore use memory barriers only to > constrain ordering, not to try to expedite electrons. > I guess I must have been confused after looking at that graph which showed how much time it takes for other CPUs to notice the change in value of a variable performed in a given CPU.. and must have gotten the (wrong) idea that memory barriers also help speed that up! Very sorry about that! >>>> + } else { >>>> + read_unlock(&pcpu_rwlock->global_rwlock); >>>> + } >>>> + >>>> + preempt_enable(); >>>> +} >>>> + >>>> +static inline void raise_writer_signal(struct percpu_rwlock *pcpu_rwlock, >>>> + unsigned int cpu) >>>> +{ >>>> + per_cpu(*pcpu_rwlock->writer_signal, cpu) = true; >>>> +} >>>> + >>>> +static inline void drop_writer_signal(struct percpu_rwlock *pcpu_rwlock, >>>> + unsigned int cpu) >>>> +{ >>>> + per_cpu(*pcpu_rwlock->writer_signal, cpu) = false; >>>> +} >>>> + >>>> +static void announce_writer_active(struct percpu_rwlock *pcpu_rwlock) >>>> +{ >>>> + unsigned int cpu; >>>> + >>>> + for_each_online_cpu(cpu) >>>> + raise_writer_signal(pcpu_rwlock, cpu); >>>> + >>>> + smp_mb(); /* Paired with smp_rmb() in percpu_read_[un]lock() */ >>>> +} >>>> + >>>> +static void announce_writer_inactive(struct percpu_rwlock *pcpu_rwlock) >>>> +{ >>>> + unsigned int cpu; >>>> + >>>> + drop_writer_signal(pcpu_rwlock, smp_processor_id()); >>> >>> Why do we drop ourselves twice? More to the point, why is it important to >>> drop ourselves first? >> >> I don't see where we are dropping ourselves twice. Note that we are no longer >> in the cpu_online_mask, so the 'for' loop below won't include us. So we need >> to manually drop ourselves. It doesn't matter whether we drop ourselves first >> or later. > > Good point, apologies for my confusion! Still worth a commment, though. > Sure, will add it. >>>> + >>>> + for_each_online_cpu(cpu) >>>> + drop_writer_signal(pcpu_rwlock, cpu); >>>> + >>>> + smp_mb(); /* Paired with smp_rmb() in percpu_read_[un]lock() */ >>>> +} >>>> + >>>> +/* >>>> + * Wait for the reader to see the writer's signal and switch from percpu >>>> + * refcounts to global rwlock. >>>> + * >>>> + * If the reader is still using percpu refcounts, wait for him to switch. >>>> + * Else, we can safely go ahead, because either the reader has already >>>> + * switched over, or the next reader that comes along on that CPU will >>>> + * notice the writer's signal and will switch over to the rwlock. >>>> + */ >>>> +static inline void sync_reader(struct percpu_rwlock *pcpu_rwlock, >>>> + unsigned int cpu) >>>> +{ >>>> + smp_rmb(); /* Paired with smp_[w]mb() in percpu_read_[un]lock() */ >>> >>> As I understand it, the purpose of this memory barrier is to ensure >>> that the stores in drop_writer_signal() happen before the reads from >>> ->reader_refcnt in reader_uses_percpu_refcnt(), >> >> No, that was not what I intended. announce_writer_inactive() already does >> a full smp_mb() after calling drop_writer_signal(). >> >> I put the smp_rmb() here and the smp_wmb() at the reader side (after updates >> to the ->reader_refcnt) to reflect the state change of ->reader_refcnt >> immediately at the writer, so that the writer doesn't have to keep spinning >> unnecessarily still referring to the old (non-zero) value of ->reader_refcnt. >> Or perhaps I am confused about how to use memory barriers properly.. :-( > > Sadly, no, memory barriers don't make electrons move faster. So you > should only need the one -- the additional memory barriers are just > slowing things down. > Ok.. >>> thus preventing the >>> race between a new reader attempting to use the fastpath and this writer >>> acquiring the lock. Unless I am confused, this must be smp_mb() rather >>> than smp_rmb(). >>> >>> Also, why not just have a single smp_mb() at the beginning of >>> sync_all_readers() instead of executing one barrier per CPU? >> >> Well, since my intention was to help the writer see the update (->reader_refcnt >> dropping to zero) ASAP, I kept the multiple smp_rmb()s. > > At least you were consistent. ;-) > Haha, that's an optimistic way of looking at it, but its no good if I was consistently _wrong_! ;-) >>>> + >>>> + while (reader_uses_percpu_refcnt(pcpu_rwlock, cpu)) >>>> + cpu_relax(); >>>> +} >>>> + >>>> +static void sync_all_readers(struct percpu_rwlock *pcpu_rwlock) >>>> +{ >>>> + unsigned int cpu; >>>> + >>>> + for_each_online_cpu(cpu) >>>> + sync_reader(pcpu_rwlock, cpu); >>>> } >>>> >>>> void percpu_write_lock(struct percpu_rwlock *pcpu_rwlock) >>>> { >>>> + /* >>>> + * Tell all readers that a writer is becoming active, so that they >>>> + * start switching over to the global rwlock. >>>> + */ >>>> + announce_writer_active(pcpu_rwlock); >>>> + sync_all_readers(pcpu_rwlock); >>>> write_lock(&pcpu_rwlock->global_rwlock); >>>> } >>>> >>>> void percpu_write_unlock(struct percpu_rwlock *pcpu_rwlock) >>>> { >>>> + /* >>>> + * Inform all readers that we are done, so that they can switch back >>>> + * to their per-cpu refcounts. (We don't need to wait for them to >>>> + * see it). >>>> + */ >>>> + announce_writer_inactive(pcpu_rwlock); >>>> write_unlock(&pcpu_rwlock->global_rwlock); >>>> } >>>> >>>> >> >> Thanks a lot for your detailed review and comments! :-) > > It will be good to get this in! > Thank you :-) I'll try to address the review comments and respin the patchset soon. Regards, Srivatsa S. Bhat
On 02/11/2013 01:20 AM, Oleg Nesterov wrote: > On 02/11, Srivatsa S. Bhat wrote: >> >> On 02/10/2013 11:36 PM, Oleg Nesterov wrote: >>>>> +static void announce_writer_inactive(struct percpu_rwlock *pcpu_rwlock) >>>>> +{ >>>>> + unsigned int cpu; >>>>> + >>>>> + drop_writer_signal(pcpu_rwlock, smp_processor_id()); >>>> >>>> Why do we drop ourselves twice? More to the point, why is it important to >>>> drop ourselves first? >>> >>> And don't we need mb() _before_ we clear ->writer_signal ? >>> >> >> Oh, right! Or, how about moving announce_writer_inactive() to _after_ >> write_unlock()? > > Not sure this will help... but, either way it seems we have another > problem... > > percpu_rwlock tries to be "generic". This means we should "ignore" its > usage in hotplug, and _write_lock should not race with _write_unlock. > Yes, good point! > IOW. Suppose that _write_unlock clears ->writer_signal. We need to ensure > that this can't race with another write which wants to set this flag. > Perhaps it should be counter as well, and it should be protected by > the same ->global_rwlock, but _write_lock() should drop it before > sync_all_readers() and then take it again? Hmm, or we could just add an extra mb() like you suggested, and keep it simple... > >>>>> +static inline void sync_reader(struct percpu_rwlock *pcpu_rwlock, >>>>> + unsigned int cpu) >>>>> +{ >>>>> + smp_rmb(); /* Paired with smp_[w]mb() in percpu_read_[un]lock() */ >>>> >>>> As I understand it, the purpose of this memory barrier is to ensure >>>> that the stores in drop_writer_signal() happen before the reads from >>>> ->reader_refcnt in reader_uses_percpu_refcnt(), thus preventing the >>>> race between a new reader attempting to use the fastpath and this writer >>>> acquiring the lock. Unless I am confused, this must be smp_mb() rather >>>> than smp_rmb(). >>> >>> And note that before sync_reader() we call announce_writer_active() which >>> already adds mb() before sync_all_readers/sync_reader, so this rmb() looks >>> unneeded. >>> >> >> My intention was to help the writer see the ->reader_refcnt drop to zero >> ASAP; hence I used smp_wmb() at reader and smp_rmb() here at the writer. > > Hmm, interesting... Not sure, but can't really comment. However I can > answer your next question: > Paul told me in another mail that I was expecting too much out of memory barriers, like increasing the speed of electrons and what not ;-) [ It would have been cool though, if it had such magical powers :P ] >> Please correct me if my understanding of memory barriers is wrong here.. > > Who? Me??? No we have paulmck for that ;) > Haha ;-) Regards, Srivatsa S. Bhat
On 02/11, Srivatsa S. Bhat wrote: > > On 02/09/2013 04:40 AM, Paul E. McKenney wrote: > >> +static void announce_writer_inactive(struct percpu_rwlock *pcpu_rwlock) > >> +{ > >> + unsigned int cpu; > >> + > >> + drop_writer_signal(pcpu_rwlock, smp_processor_id()); > > > > Why do we drop ourselves twice? More to the point, why is it important to > > drop ourselves first? > > > > I don't see where we are dropping ourselves twice. Note that we are no longer > in the cpu_online_mask, so the 'for' loop below won't include us. So we need > to manually drop ourselves. It doesn't matter whether we drop ourselves first > or later. Yes, but this just reflects its usage in cpu-hotplug. cpu goes away under _write_lock. Perhaps _write_lock/unlock shoud use for_each_possible_cpu() instead? Hmm... I think this makes sense anyway. Otherwise, in theory, percpu_write_lock(random_non_hotplug_lock) can race with cpu_up? Oleg.
On 02/11/2013 01:43 AM, Oleg Nesterov wrote: > On 02/11, Srivatsa S. Bhat wrote: >> >> On 02/09/2013 04:40 AM, Paul E. McKenney wrote: >>>> +static void announce_writer_inactive(struct percpu_rwlock *pcpu_rwlock) >>>> +{ >>>> + unsigned int cpu; >>>> + >>>> + drop_writer_signal(pcpu_rwlock, smp_processor_id()); >>> >>> Why do we drop ourselves twice? More to the point, why is it important to >>> drop ourselves first? >>> >> >> I don't see where we are dropping ourselves twice. Note that we are no longer >> in the cpu_online_mask, so the 'for' loop below won't include us. So we need >> to manually drop ourselves. It doesn't matter whether we drop ourselves first >> or later. > > Yes, but this just reflects its usage in cpu-hotplug. cpu goes away under > _write_lock. > Ah, right. I guess the code still has remnants from the older version in which this locking scheme wasn't generic and was tied to cpu-hotplug alone.. > Perhaps _write_lock/unlock shoud use for_each_possible_cpu() instead? > Hmm, that wouldn't be too bad. > Hmm... I think this makes sense anyway. Otherwise, in theory, > percpu_write_lock(random_non_hotplug_lock) can race with cpu_up? > Yeah, makes sense. Will change it to for_each_possible_cpu(). And I had previously fixed such races with lglocks with a complicated scheme (to avoid the costly for_each_possible loop), which was finally rewritten to use for_each_possible_cpu() for the sake of simplicity.. Regards, Srivatsa S. Bhat
On Mon, Feb 11, 2013 at 01:39:24AM +0530, Srivatsa S. Bhat wrote: > On 02/11/2013 01:20 AM, Oleg Nesterov wrote: > > On 02/11, Srivatsa S. Bhat wrote: > >> > >> On 02/10/2013 11:36 PM, Oleg Nesterov wrote: > >>>>> +static void announce_writer_inactive(struct percpu_rwlock *pcpu_rwlock) > >>>>> +{ > >>>>> + unsigned int cpu; > >>>>> + > >>>>> + drop_writer_signal(pcpu_rwlock, smp_processor_id()); > >>>> > >>>> Why do we drop ourselves twice? More to the point, why is it important to > >>>> drop ourselves first? > >>> > >>> And don't we need mb() _before_ we clear ->writer_signal ? > >>> > >> > >> Oh, right! Or, how about moving announce_writer_inactive() to _after_ > >> write_unlock()? > > > > Not sure this will help... but, either way it seems we have another > > problem... > > > > percpu_rwlock tries to be "generic". This means we should "ignore" its > > usage in hotplug, and _write_lock should not race with _write_unlock. > > > > Yes, good point! > > > IOW. Suppose that _write_unlock clears ->writer_signal. We need to ensure > > that this can't race with another write which wants to set this flag. > > Perhaps it should be counter as well, and it should be protected by > > the same ->global_rwlock, but _write_lock() should drop it before > > sync_all_readers() and then take it again? > > Hmm, or we could just add an extra mb() like you suggested, and keep it > simple... > > > > >>>>> +static inline void sync_reader(struct percpu_rwlock *pcpu_rwlock, > >>>>> + unsigned int cpu) > >>>>> +{ > >>>>> + smp_rmb(); /* Paired with smp_[w]mb() in percpu_read_[un]lock() */ > >>>> > >>>> As I understand it, the purpose of this memory barrier is to ensure > >>>> that the stores in drop_writer_signal() happen before the reads from > >>>> ->reader_refcnt in reader_uses_percpu_refcnt(), thus preventing the > >>>> race between a new reader attempting to use the fastpath and this writer > >>>> acquiring the lock. Unless I am confused, this must be smp_mb() rather > >>>> than smp_rmb(). > >>> > >>> And note that before sync_reader() we call announce_writer_active() which > >>> already adds mb() before sync_all_readers/sync_reader, so this rmb() looks > >>> unneeded. > >>> > >> > >> My intention was to help the writer see the ->reader_refcnt drop to zero > >> ASAP; hence I used smp_wmb() at reader and smp_rmb() here at the writer. > > > > Hmm, interesting... Not sure, but can't really comment. However I can > > answer your next question: > > Paul told me in another mail that I was expecting too much out of memory > barriers, like increasing the speed of electrons and what not ;-) > [ It would have been cool though, if it had such magical powers :P ] "But because you have used the special mb_tachyonic instruction, the speed of light is 600,000 km/s for the next five clock cycles"... ;-) Thanx, Paul > >> Please correct me if my understanding of memory barriers is wrong here.. > > > > Who? Me??? No we have paulmck for that ;) > > > > Haha ;-) > > Regards, > Srivatsa S. Bhat >
On Sun, Feb 10, 2013 at 11:54:17AM -0800, Paul E. McKenney wrote: > On Sun, Feb 10, 2013 at 07:06:07PM +0100, Oleg Nesterov wrote: > > On 02/08, Paul E. McKenney wrote: > > [ . . . ] > > > > > +static inline void sync_reader(struct percpu_rwlock *pcpu_rwlock, > > > > + unsigned int cpu) > > > > +{ > > > > + smp_rmb(); /* Paired with smp_[w]mb() in percpu_read_[un]lock() */ > > > > > > As I understand it, the purpose of this memory barrier is to ensure > > > that the stores in drop_writer_signal() happen before the reads from > > > ->reader_refcnt in reader_uses_percpu_refcnt(), thus preventing the > > > race between a new reader attempting to use the fastpath and this writer > > > acquiring the lock. Unless I am confused, this must be smp_mb() rather > > > than smp_rmb(). > > > > And note that before sync_reader() we call announce_writer_active() which > > already adds mb() before sync_all_readers/sync_reader, so this rmb() looks > > unneeded. > > > > But, at the same time, could you confirm that we do not need another mb() > > after sync_all_readers() in percpu_write_lock() ? I mean, without mb(), > > can't this reader_uses_percpu_refcnt() LOAD leak into the critical section > > protected by ->global_rwlock? Then this LOAD can be re-ordered with other > > memory operations done by the writer. > > As soon as I get the rest of the way through Thomas's patchset. ;-) There is a memory barrier associated with write_lock(), but it is only required to keep the critical section inside the lock -- and is permitted to allow stuff outside of the lock to be reordered into the critical section. So I believe we do indeed need an smp_mb() between sync_all_readers() and write_lock() in percpu_write_lock(). Good eyes, Oleg! Thanx, Paul
diff --git a/include/linux/percpu-rwlock.h b/include/linux/percpu-rwlock.h index 8dec8fe..6819bb8 100644 --- a/include/linux/percpu-rwlock.h +++ b/include/linux/percpu-rwlock.h @@ -68,4 +68,14 @@ extern void percpu_free_rwlock(struct percpu_rwlock *); __percpu_init_rwlock(pcpu_rwlock, #pcpu_rwlock, &rwlock_key); \ }) +#define reader_uses_percpu_refcnt(pcpu_rwlock, cpu) \ + (ACCESS_ONCE(per_cpu(*((pcpu_rwlock)->reader_refcnt), cpu))) + +#define reader_nested_percpu(pcpu_rwlock) \ + (__this_cpu_read(*((pcpu_rwlock)->reader_refcnt)) > 1) + +#define writer_active(pcpu_rwlock) \ + (__this_cpu_read(*((pcpu_rwlock)->writer_signal))) + #endif + diff --git a/lib/percpu-rwlock.c b/lib/percpu-rwlock.c index 80dad93..992da5c 100644 --- a/lib/percpu-rwlock.c +++ b/lib/percpu-rwlock.c @@ -64,21 +64,145 @@ void percpu_free_rwlock(struct percpu_rwlock *pcpu_rwlock) void percpu_read_lock(struct percpu_rwlock *pcpu_rwlock) { - read_lock(&pcpu_rwlock->global_rwlock); + preempt_disable(); + + /* First and foremost, let the writer know that a reader is active */ + this_cpu_inc(*pcpu_rwlock->reader_refcnt); + + /* + * If we are already using per-cpu refcounts, it is not safe to switch + * the synchronization scheme. So continue using the refcounts. + */ + if (reader_nested_percpu(pcpu_rwlock)) { + goto out; + } else { + /* + * The write to 'reader_refcnt' must be visible before we + * read 'writer_signal'. + */ + smp_mb(); /* Paired with smp_rmb() in sync_reader() */ + + if (likely(!writer_active(pcpu_rwlock))) { + goto out; + } else { + /* Writer is active, so switch to global rwlock. */ + read_lock(&pcpu_rwlock->global_rwlock); + + /* + * We might have raced with a writer going inactive + * before we took the read-lock. So re-evaluate whether + * we still need to hold the rwlock or if we can switch + * back to per-cpu refcounts. (This also helps avoid + * heterogeneous nesting of readers). + */ + if (writer_active(pcpu_rwlock)) + this_cpu_dec(*pcpu_rwlock->reader_refcnt); + else + read_unlock(&pcpu_rwlock->global_rwlock); + } + } + +out: + /* Prevent reordering of any subsequent reads */ + smp_rmb(); } void percpu_read_unlock(struct percpu_rwlock *pcpu_rwlock) { - read_unlock(&pcpu_rwlock->global_rwlock); + /* + * We never allow heterogeneous nesting of readers. So it is trivial + * to find out the kind of reader we are, and undo the operation + * done by our corresponding percpu_read_lock(). + */ + if (__this_cpu_read(*pcpu_rwlock->reader_refcnt)) { + this_cpu_dec(*pcpu_rwlock->reader_refcnt); + smp_wmb(); /* Paired with smp_rmb() in sync_reader() */ + } else { + read_unlock(&pcpu_rwlock->global_rwlock); + } + + preempt_enable(); +} + +static inline void raise_writer_signal(struct percpu_rwlock *pcpu_rwlock, + unsigned int cpu) +{ + per_cpu(*pcpu_rwlock->writer_signal, cpu) = true; +} + +static inline void drop_writer_signal(struct percpu_rwlock *pcpu_rwlock, + unsigned int cpu) +{ + per_cpu(*pcpu_rwlock->writer_signal, cpu) = false; +} + +static void announce_writer_active(struct percpu_rwlock *pcpu_rwlock) +{ + unsigned int cpu; + + for_each_online_cpu(cpu) + raise_writer_signal(pcpu_rwlock, cpu); + + smp_mb(); /* Paired with smp_rmb() in percpu_read_[un]lock() */ +} + +static void announce_writer_inactive(struct percpu_rwlock *pcpu_rwlock) +{ + unsigned int cpu; + + drop_writer_signal(pcpu_rwlock, smp_processor_id()); + + for_each_online_cpu(cpu) + drop_writer_signal(pcpu_rwlock, cpu); + + smp_mb(); /* Paired with smp_rmb() in percpu_read_[un]lock() */ +} + +/* + * Wait for the reader to see the writer's signal and switch from percpu + * refcounts to global rwlock. + * + * If the reader is still using percpu refcounts, wait for him to switch. + * Else, we can safely go ahead, because either the reader has already + * switched over, or the next reader that comes along on that CPU will + * notice the writer's signal and will switch over to the rwlock. + */ +static inline void sync_reader(struct percpu_rwlock *pcpu_rwlock, + unsigned int cpu) +{ + smp_rmb(); /* Paired with smp_[w]mb() in percpu_read_[un]lock() */ + + while (reader_uses_percpu_refcnt(pcpu_rwlock, cpu)) + cpu_relax(); +} + +static void sync_all_readers(struct percpu_rwlock *pcpu_rwlock) +{ + unsigned int cpu; + + for_each_online_cpu(cpu) + sync_reader(pcpu_rwlock, cpu); } void percpu_write_lock(struct percpu_rwlock *pcpu_rwlock) { + /* + * Tell all readers that a writer is becoming active, so that they + * start switching over to the global rwlock. + */ + announce_writer_active(pcpu_rwlock); + sync_all_readers(pcpu_rwlock); write_lock(&pcpu_rwlock->global_rwlock); } void percpu_write_unlock(struct percpu_rwlock *pcpu_rwlock) { + /* + * Inform all readers that we are done, so that they can switch back + * to their per-cpu refcounts. (We don't need to wait for them to + * see it). + */ + announce_writer_inactive(pcpu_rwlock); write_unlock(&pcpu_rwlock->global_rwlock); }
Using global rwlocks as the backend for per-CPU rwlocks helps us avoid many lock-ordering related problems (unlike per-cpu locks). However, global rwlocks lead to unnecessary cache-line bouncing even when there are no writers present, which can slow down the system needlessly. Per-cpu counters can help solve the cache-line bouncing problem. So we actually use the best of both: per-cpu counters (no-waiting) at the reader side in the fast-path, and global rwlocks in the slowpath. [ Fastpath = no writer is active; Slowpath = a writer is active ] IOW, the readers just increment/decrement their per-cpu refcounts (disabling interrupts during the updates, if necessary) when no writer is active. When a writer becomes active, he signals all readers to switch to global rwlocks for the duration of his activity. The readers switch over when it is safe for them (ie., when they are about to start a fresh, non-nested read-side critical section) and start using (holding) the global rwlock for read in their subsequent critical sections. The writer waits for every existing reader to switch, and then acquires the global rwlock for write and enters his critical section. Later, the writer signals all readers that he is done, and that they can go back to using their per-cpu refcounts again. Note that the lock-safety (despite the per-cpu scheme) comes from the fact that the readers can *choose* _when_ to switch to rwlocks upon the writer's signal. And the readers don't wait on anybody based on the per-cpu counters. The only true synchronization that involves waiting at the reader-side in this scheme, is the one arising from the global rwlock, which is safe from circular locking dependency issues. Reader-writer locks and per-cpu counters are recursive, so they can be used in a nested fashion in the reader-path, which makes per-CPU rwlocks also recursive. Also, this design of switching the synchronization scheme ensures that you can safely nest and use these locks in a very flexible manner. I'm indebted to Michael Wang and Xiao Guangrong for their numerous thoughtful suggestions and ideas, which inspired and influenced many of the decisions in this as well as previous designs. Thanks a lot Michael and Xiao! Cc: David Howells <dhowells@redhat.com> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> --- include/linux/percpu-rwlock.h | 10 +++ lib/percpu-rwlock.c | 128 ++++++++++++++++++++++++++++++++++++++++- 2 files changed, 136 insertions(+), 2 deletions(-)