Message ID | 20200519214547.352050-3-a.darwish@linutronix.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | seqlock: Extend seqcount API with associated locks | expand |
On Tue, 19 May 2020 23:45:24 +0200 "Ahmed S. Darwish" wrote: > > Commit eef1a429f234 ("mm/swap.c: piggyback lru_add_drain_all() calls") > implemented an optimization mechanism to exit the to-be-started LRU > drain operation (name it A) if another drain operation *started and > finished* while (A) was blocked on the LRU draining mutex. > > This was done through a seqcount latch, which is an abuse of its > semantics: > > 1. Seqcount latching should be used for the purpose of switching > between two storage places with sequence protection to allow > interruptible, preemptible writer sections. The optimization > mechanism has absolutely nothing to do with that. > > 2. The used raw_write_seqcount_latch() has two smp write memory > barriers to always insure one consistent storage place out of the > two storage places available. This extra smp_wmb() is redundant for > the optimization use case. > > Beside the API abuse, the semantics of a latch sequence counter was > force fitted into the optimization. What was actually meant is to track > generations of LRU draining operations, where "current lru draining > generation =3D x" implies that all generations 0 < n <=3D x are already > *scheduled* for draining. > > Remove the conceptually-inappropriate seqcount latch usage and manually > implement the optimization using a counter and SMP memory barriers. > > Link: https://lkml.kernel.org/r/CALYGNiPSr-cxV9MX9czaVh6Wz_gzSv3H_8KPvgjBTGbJywUJpA@mail.gmail.com > Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> > --- > mm/swap.c | 57 +++++++++++++++++++++++++++++++++++++++++++++---------- > 1 file changed, 47 insertions(+), 10 deletions(-) > > diff --git a/mm/swap.c b/mm/swap.c > index bf9a79fed62d..d6910eeed43d 100644 > --- a/mm/swap.c > +++ b/mm/swap.c > @@ -713,10 +713,20 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy) > */ > void lru_add_drain_all(void) > { > - static seqcount_t seqcount =3D SEQCNT_ZERO(seqcount); > - static DEFINE_MUTEX(lock); > + /* > + * lru_drain_gen - Current generation of pages that could be in vectors > + * > + * (A) Definition: lru_drain_gen =3D x implies that all generations > + * 0 < n <=3D x are already scheduled for draining. > + * > + * This is an optimization for the highly-contended use case where a > + * user space workload keeps constantly generating a flow of pages > + * for each CPU. > + */ > + static unsigned int lru_drain_gen; > static struct cpumask has_work; > - int cpu, seq; > + static DEFINE_MUTEX(lock); > + int cpu, this_gen; > =20 > /* > * Make sure nobody triggers this path before mm_percpu_wq is fully > @@ -725,21 +735,48 @@ void lru_add_drain_all(void) > if (WARN_ON(!mm_percpu_wq)) > return; > =20 > - seq =3D raw_read_seqcount_latch(&seqcount); > + /* > + * (B) Cache the LRU draining generation number > + * > + * smp_rmb() ensures that the counter is loaded before the mutex is > + * taken. It pairs with the smp_wmb() inside the mutex critical section > + * at (D). > + */ > + this_gen =3D READ_ONCE(lru_drain_gen); > + smp_rmb(); > =20 > mutex_lock(&lock); > =20 > /* > - * Piggyback on drain started and finished while we waited for lock: > - * all pages pended at the time of our enter were drained from vectors. > + * (C) Exit the draining operation if a newer generation, from another > + * lru_add_drain_all(), was already scheduled for draining. Check (A). > */ > - if (__read_seqcount_retry(&seqcount, seq)) > + if (unlikely(this_gen !=3D lru_drain_gen)) > goto done; > =20 > - raw_write_seqcount_latch(&seqcount); > + /* > + * (D) Increment generation number > + * > + * Pairs with READ_ONCE() and smp_rmb() at (B), outside of the critical > + * section. > + * > + * This pairing must be done here, before the for_each_online_cpu loop > + * below which drains the page vectors. > + * > + * Let x, y, and z represent some system CPU numbers, where x < y < z. > + * Assume CPU #z is is in the middle of the for_each_online_cpu loop > + * below and has already reached CPU #y's per-cpu data. CPU #x comes > + * along, adds some pages to its per-cpu vectors, then calls > + * lru_add_drain_all(). > + * > + * If the paired smp_wmb() below is done at any later step, e.g. after > + * the loop, CPU #x will just exit at (C) and miss flushing out all of > + * its added pages. > + */ > + WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1); > + smp_wmb(); Writer is inside mutex. > =20 > cpumask_clear(&has_work); > - > for_each_online_cpu(cpu) { > struct work_struct *work =3D &per_cpu(lru_add_drain_work, cpu); > =20 > @@ -766,7 +803,7 @@ void lru_add_drain_all(void) > { > lru_add_drain(); > } > -#endif > +#endif /* CONFIG_SMP */ > =20 > /** > * release_pages - batched put_page() > --=20 > 2.20.1 The tiny diff, inspired by this work, is trying to trylock mutex by swicthing the sequence counter's readers and writer across the mutex boundary. --- a/mm/swap.c +++ b/mm/swap.c @@ -714,10 +714,12 @@ static void lru_add_drain_per_cpu(struct */ void lru_add_drain_all(void) { - static seqcount_t seqcount = SEQCNT_ZERO(seqcount); + static unsigned int lru_drain_gen; static DEFINE_MUTEX(lock); static struct cpumask has_work; - int cpu, seq; + int cpu; + int loop = 0; + unsigned int seq; /* * Make sure nobody triggers this path before mm_percpu_wq is fully @@ -726,18 +728,12 @@ void lru_add_drain_all(void) if (WARN_ON(!mm_percpu_wq)) return; - seq = raw_read_seqcount_latch(&seqcount); + seq = lru_drain_gen++; + smp_mb(); - mutex_lock(&lock); - - /* - * Piggyback on drain started and finished while we waited for lock: - * all pages pended at the time of our enter were drained from vectors. - */ - if (__read_seqcount_retry(&seqcount, seq)) - goto done; - - raw_write_seqcount_latch(&seqcount); + if (!mutex_trylock(&lock)) + return; +more_work: cpumask_clear(&has_work); @@ -759,7 +755,11 @@ void lru_add_drain_all(void) for_each_cpu(cpu, &has_work) flush_work(&per_cpu(lru_add_drain_work, cpu)); -done: + smp_mb(); + if (++seq != lru_drain_gen && loop++ < 128) + goto more_work; + + smp_mb(); mutex_unlock(&lock); } #else --
On 20/05/2020 00.45, Ahmed S. Darwish wrote: > Commit eef1a429f234 ("mm/swap.c: piggyback lru_add_drain_all() calls") > implemented an optimization mechanism to exit the to-be-started LRU > drain operation (name it A) if another drain operation *started and > finished* while (A) was blocked on the LRU draining mutex. > > This was done through a seqcount latch, which is an abuse of its > semantics: > > 1. Seqcount latching should be used for the purpose of switching > between two storage places with sequence protection to allow > interruptible, preemptible writer sections. The optimization > mechanism has absolutely nothing to do with that. > > 2. The used raw_write_seqcount_latch() has two smp write memory > barriers to always insure one consistent storage place out of the > two storage places available. This extra smp_wmb() is redundant for > the optimization use case. > > Beside the API abuse, the semantics of a latch sequence counter was > force fitted into the optimization. What was actually meant is to track > generations of LRU draining operations, where "current lru draining > generation = x" implies that all generations 0 < n <= x are already > *scheduled* for draining. > > Remove the conceptually-inappropriate seqcount latch usage and manually > implement the optimization using a counter and SMP memory barriers. Well, I thought it fits perfectly =) Maybe it's worth to add helpers with appropriate semantics? This is pretty common pattern. > > Link: https://lkml.kernel.org/r/CALYGNiPSr-cxV9MX9czaVh6Wz_gzSv3H_8KPvgjBTGbJywUJpA@mail.gmail.com > Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> > --- > mm/swap.c | 57 +++++++++++++++++++++++++++++++++++++++++++++---------- > 1 file changed, 47 insertions(+), 10 deletions(-) > > diff --git a/mm/swap.c b/mm/swap.c > index bf9a79fed62d..d6910eeed43d 100644 > --- a/mm/swap.c > +++ b/mm/swap.c > @@ -713,10 +713,20 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy) > */ > void lru_add_drain_all(void) > { > - static seqcount_t seqcount = SEQCNT_ZERO(seqcount); > - static DEFINE_MUTEX(lock); > + /* > + * lru_drain_gen - Current generation of pages that could be in vectors > + * > + * (A) Definition: lru_drain_gen = x implies that all generations > + * 0 < n <= x are already scheduled for draining. > + * > + * This is an optimization for the highly-contended use case where a > + * user space workload keeps constantly generating a flow of pages > + * for each CPU. > + */ > + static unsigned int lru_drain_gen; > static struct cpumask has_work; > - int cpu, seq; > + static DEFINE_MUTEX(lock); > + int cpu, this_gen; > > /* > * Make sure nobody triggers this path before mm_percpu_wq is fully > @@ -725,21 +735,48 @@ void lru_add_drain_all(void) > if (WARN_ON(!mm_percpu_wq)) > return; > > - seq = raw_read_seqcount_latch(&seqcount); > + /* > + * (B) Cache the LRU draining generation number > + * > + * smp_rmb() ensures that the counter is loaded before the mutex is > + * taken. It pairs with the smp_wmb() inside the mutex critical section > + * at (D). > + */ > + this_gen = READ_ONCE(lru_drain_gen); > + smp_rmb(); > > mutex_lock(&lock); > > /* > - * Piggyback on drain started and finished while we waited for lock: > - * all pages pended at the time of our enter were drained from vectors. > + * (C) Exit the draining operation if a newer generation, from another > + * lru_add_drain_all(), was already scheduled for draining. Check (A). > */ > - if (__read_seqcount_retry(&seqcount, seq)) > + if (unlikely(this_gen != lru_drain_gen)) > goto done; > > - raw_write_seqcount_latch(&seqcount); > + /* > + * (D) Increment generation number > + * > + * Pairs with READ_ONCE() and smp_rmb() at (B), outside of the critical > + * section. > + * > + * This pairing must be done here, before the for_each_online_cpu loop > + * below which drains the page vectors. > + * > + * Let x, y, and z represent some system CPU numbers, where x < y < z. > + * Assume CPU #z is is in the middle of the for_each_online_cpu loop > + * below and has already reached CPU #y's per-cpu data. CPU #x comes > + * along, adds some pages to its per-cpu vectors, then calls > + * lru_add_drain_all(). > + * > + * If the paired smp_wmb() below is done at any later step, e.g. after > + * the loop, CPU #x will just exit at (C) and miss flushing out all of > + * its added pages. > + */ > + WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1); > + smp_wmb(); > > cpumask_clear(&has_work); > - > for_each_online_cpu(cpu) { > struct work_struct *work = &per_cpu(lru_add_drain_work, cpu); > > @@ -766,7 +803,7 @@ void lru_add_drain_all(void) > { > lru_add_drain(); > } > -#endif > +#endif /* CONFIG_SMP */ > > /** > * release_pages - batched put_page() >
On Wed, May 20, 2020 at 03:22:15PM +0300, Konstantin Khlebnikov wrote: > On 20/05/2020 00.45, Ahmed S. Darwish wrote: > > Commit eef1a429f234 ("mm/swap.c: piggyback lru_add_drain_all() calls") > > implemented an optimization mechanism to exit the to-be-started LRU > > drain operation (name it A) if another drain operation *started and > > finished* while (A) was blocked on the LRU draining mutex. That commit is horrible... > Well, I thought it fits perfectly =) > > Maybe it's worth to add helpers with appropriate semantics? > This is pretty common pattern. Where's more sites? > > @@ -725,21 +735,48 @@ void lru_add_drain_all(void) > > if (WARN_ON(!mm_percpu_wq)) > > return; > > - seq = raw_read_seqcount_latch(&seqcount); > > mutex_lock(&lock); > > /* > > - * Piggyback on drain started and finished while we waited for lock: > > - * all pages pended at the time of our enter were drained from vectors. > > */ > > - if (__read_seqcount_retry(&seqcount, seq)) > > goto done; Since there is no ordering in raw_read_seqcount_latch(), and mutex_lock() is an ACQUIRE, there's no guarantee the read actually happens before the mutex is acquired. > > - raw_write_seqcount_latch(&seqcount); > > cpumask_clear(&has_work);
On Tue, May 19, 2020 at 11:45:24PM +0200, Ahmed S. Darwish wrote: > @@ -713,10 +713,20 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy) > */ > void lru_add_drain_all(void) > { > + static unsigned int lru_drain_gen; > static struct cpumask has_work; > + static DEFINE_MUTEX(lock); > + int cpu, this_gen; > > /* > * Make sure nobody triggers this path before mm_percpu_wq is fully > @@ -725,21 +735,48 @@ void lru_add_drain_all(void) > if (WARN_ON(!mm_percpu_wq)) > return; > > + this_gen = READ_ONCE(lru_drain_gen); > + smp_rmb(); this_gen = smp_load_acquire(&lru_drain_gen); > > mutex_lock(&lock); > > /* > + * (C) Exit the draining operation if a newer generation, from another > + * lru_add_drain_all(), was already scheduled for draining. Check (A). > */ > + if (unlikely(this_gen != lru_drain_gen)) > goto done; > > + WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1); > + smp_wmb(); You can leave this smp_wmb() out and rely on the smp_mb() implied by queue_work_on()'s test_and_set_bit(). > cpumask_clear(&has_work); > - > for_each_online_cpu(cpu) { > struct work_struct *work = &per_cpu(lru_add_drain_work, cpu); > While you're here, do: s/cpumask_set_cpu/__&/ > @@ -766,7 +803,7 @@ void lru_add_drain_all(void) > { > lru_add_drain(); > } > -#endif > +#endif /* CONFIG_SMP */ > > /** > * release_pages - batched put_page()
On 2020-05-22 16:57:07 [+0200], Peter Zijlstra wrote: > > @@ -725,21 +735,48 @@ void lru_add_drain_all(void) > > if (WARN_ON(!mm_percpu_wq)) > > return; > > > > > + this_gen = READ_ONCE(lru_drain_gen); > > + smp_rmb(); > > this_gen = smp_load_acquire(&lru_drain_gen); > > > > mutex_lock(&lock); > > > > /* > > + * (C) Exit the draining operation if a newer generation, from another > > + * lru_add_drain_all(), was already scheduled for draining. Check (A). > > */ > > + if (unlikely(this_gen != lru_drain_gen)) > > goto done; > > > > > + WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1); > > + smp_wmb(); > > You can leave this smp_wmb() out and rely on the smp_mb() implied by > queue_work_on()'s test_and_set_bit(). This is to avoid smp_store_release() ? Sebastian
On Fri, May 22, 2020 at 05:17:05PM +0200, Sebastian A. Siewior wrote: > On 2020-05-22 16:57:07 [+0200], Peter Zijlstra wrote: > > > @@ -725,21 +735,48 @@ void lru_add_drain_all(void) > > > if (WARN_ON(!mm_percpu_wq)) > > > return; > > > > > > > > + this_gen = READ_ONCE(lru_drain_gen); > > > + smp_rmb(); > > > > this_gen = smp_load_acquire(&lru_drain_gen); > > > > > > mutex_lock(&lock); > > > > > > /* > > > + * (C) Exit the draining operation if a newer generation, from another > > > + * lru_add_drain_all(), was already scheduled for draining. Check (A). > > > */ > > > + if (unlikely(this_gen != lru_drain_gen)) > > > goto done; > > > > > > > > + WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1); > > > + smp_wmb(); > > > > You can leave this smp_wmb() out and rely on the smp_mb() implied by > > queue_work_on()'s test_and_set_bit(). > > This is to avoid smp_store_release() ? store_release would have the barrier on the other end. If you read the comments (I so helpfully cut out) you'll see it wants to order against later stores, not ealier.
Peter Zijlstra <peterz@infradead.org> wrote: > On Tue, May 19, 2020 at 11:45:24PM +0200, Ahmed S. Darwish wrote: > > @@ -713,10 +713,20 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy) > > */ > > void lru_add_drain_all(void) > > { > Re-adding cut-out comment for context: /* * lru_drain_gen - Current generation of pages that could be in vectors * * (A) Definition: lru_drain_gen = x implies that all generations * 0 < n <= x are already scheduled for draining. * * This is an optimization for the highly-contended use case where a * user space workload keeps constantly generating a flow of pages * for each CPU. */ > > + static unsigned int lru_drain_gen; > > static struct cpumask has_work; > > + static DEFINE_MUTEX(lock); > > + int cpu, this_gen; > > > > /* > > * Make sure nobody triggers this path before mm_percpu_wq is fully > > @@ -725,21 +735,48 @@ void lru_add_drain_all(void) > > if (WARN_ON(!mm_percpu_wq)) > > return; > > > Re-adding cut-out comment for context: /* * (B) Cache the LRU draining generation number * * smp_rmb() ensures that the counter is loaded before the mutex is * taken. It pairs with the smp_wmb() inside the mutex critical section * at (D). */ > > + this_gen = READ_ONCE(lru_drain_gen); > > + smp_rmb(); > > this_gen = smp_load_acquire(&lru_drain_gen); ACK. will do. > > > > mutex_lock(&lock); > > > > /* > > + * (C) Exit the draining operation if a newer generation, from another > > + * lru_add_drain_all(), was already scheduled for draining. Check (A). > > */ > > + if (unlikely(this_gen != lru_drain_gen)) > > goto done; > > > Re-adding cut-out comment for context: /* * (D) Increment generation number * * Pairs with READ_ONCE() and smp_rmb() at (B), outside of the critical * section. * * This pairing must be done here, before the for_each_online_cpu loop * below which drains the page vectors. * * Let x, y, and z represent some system CPU numbers, where x < y < z. * Assume CPU #z is is in the middle of the for_each_online_cpu loop * below and has already reached CPU #y's per-cpu data. CPU #x comes * along, adds some pages to its per-cpu vectors, then calls * lru_add_drain_all(). * * If the paired smp_wmb() below is done at any later step, e.g. after * the loop, CPU #x will just exit at (C) and miss flushing out all of * its added pages. */ > > + WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1); > > + smp_wmb(); > > You can leave this smp_wmb() out and rely on the smp_mb() implied by > queue_work_on()'s test_and_set_bit(). > Won't this be too implicit? Isn't it possible that, over the years, queue_work_on() impementation changes and the test_and_set_bit()/smp_mb() gets removed? If that happens, this commit will get *silently* broken and the local CPU pages won't be drained. > > cpumask_clear(&has_work); > > - > > for_each_online_cpu(cpu) { > > struct work_struct *work = &per_cpu(lru_add_drain_work, cpu); > > > > While you're here, do: > > s/cpumask_set_cpu/__&/ > ACK. Thanks, -- Ahmed S. Darwish Linutronix GmbH
On Mon, May 25, 2020 at 05:24:01PM +0200, Ahmed S. Darwish wrote: > Peter Zijlstra <peterz@infradead.org> wrote: > > On Tue, May 19, 2020 at 11:45:24PM +0200, Ahmed S. Darwish wrote: > > > + WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1); > > > + smp_wmb(); > > > > You can leave this smp_wmb() out and rely on the smp_mb() implied by > > queue_work_on()'s test_and_set_bit(). > > > > Won't this be too implicit? > > Isn't it possible that, over the years, queue_work_on() impementation > changes and the test_and_set_bit()/smp_mb() gets removed? > > If that happens, this commit will get *silently* broken and the local > CPU pages won't be drained. Add a comment to queue_work_on() that points here? That way people are aware.
Hi, This optimization is broken. The main concern here: Is it possible that lru_add_drain_all() _would_ have drained pagevec X, but then aborted because another lru_add_drain_all() is underway and that other task will _not_ drain pagevec X? I claim the answer is yes! My suggested changes are inline below. I attached a litmus test to verify it. On 2020-05-22, Peter Zijlstra <peterz@infradead.org> wrote: > On Tue, May 19, 2020 at 11:45:24PM +0200, Ahmed S. Darwish wrote: >> @@ -713,10 +713,20 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy) >> */ >> void lru_add_drain_all(void) >> { > >> + static unsigned int lru_drain_gen; >> static struct cpumask has_work; >> + static DEFINE_MUTEX(lock); >> + int cpu, this_gen; >> >> /* >> * Make sure nobody triggers this path before mm_percpu_wq is fully >> @@ -725,21 +735,48 @@ void lru_add_drain_all(void) >> if (WARN_ON(!mm_percpu_wq)) >> return; >> An smp_mb() is needed here. /* * Guarantee the pagevec counter stores visible by * this CPU are visible to other CPUs before loading * the current drain generation. */ smp_mb(); >> + this_gen = READ_ONCE(lru_drain_gen); >> + smp_rmb(); > > this_gen = smp_load_acquire(&lru_drain_gen); >> >> mutex_lock(&lock); >> >> /* >> + * (C) Exit the draining operation if a newer generation, from another >> + * lru_add_drain_all(), was already scheduled for draining. Check (A). >> */ >> + if (unlikely(this_gen != lru_drain_gen)) >> goto done; >> > >> + WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1); >> + smp_wmb(); Instead of smp_wmb(), this needs to be a full memory barrier. /* * Guarantee the new drain generation is stored before * loading the pagevec counters. */ smp_mb(); > You can leave this smp_wmb() out and rely on the smp_mb() implied by > queue_work_on()'s test_and_set_bit(). > >> cpumask_clear(&has_work); >> - >> for_each_online_cpu(cpu) { >> struct work_struct *work = &per_cpu(lru_add_drain_work, cpu); >> > > While you're here, do: > > s/cpumask_set_cpu/__&/ > >> @@ -766,7 +803,7 @@ void lru_add_drain_all(void) >> { >> lru_add_drain(); >> } >> -#endif >> +#endif /* CONFIG_SMP */ >> >> /** >> * release_pages - batched put_page() For the litmus test: 1:rx=0 (P1 did not see the pagevec counter) 2:rx=1 (P2 _would_ have seen the pagevec counter) 2:ry1=0 /\ 2:ry2=1 (P2 aborted due to optimization) Changing the smp_mb() back to smp_wmb() in P1 and removing the smp_mb() in P2 represents this patch. And it shows that sometimes P2 will abort even though it would have drained the pagevec and P1 did not drain the pagevec. This is ugly as hell. And there maybe other memory barrier types to make it pretty. But as is, memory barriers are missing. John Ogness C lru_add_drain_all (* * x is a pagevec counter * y is @lru_drain_gen * z is @lock *) { } P0(int *x) { // mark pagevec for draining WRITE_ONCE(*x, 1); } P1(int *x, int *y, int *z) { int rx; int rz; // mutex_lock(&lock); rz = cmpxchg_acquire(z, 0, 1); if (rz == 0) { // WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1); WRITE_ONCE(*y, 1); // guarantee lru_drain_gen store before loading pagevec smp_mb(); // if (pagevec_count(...)) rx = READ_ONCE(*x); // mutex_unlock(&lock); rz = cmpxchg_release(z, 1, 2); } } P2(int *x, int *y, int *z) { int rx; int ry1; int ry2; int rz; // the pagevec counter as visible now to this CPU rx = READ_ONCE(*x); // guarantee pagevec store before loading lru_drain_gen smp_mb(); // this_gen = READ_ONCE(lru_drain_gen); smp_rmb(); ry1 = smp_load_acquire(y); // mutex_lock(&lock) - acquired after P1 rz = cmpxchg_acquire(z, 2, 3); if (rz == 2) { // if (unlikely(this_gen != lru_drain_gen)) ry2 = READ_ONCE(*y); } } locations [x; y; z] exists (1:rx=0 /\ 2:rx=1 /\ 2:ry1=0 /\ 2:ry2=1)
diff --git a/mm/swap.c b/mm/swap.c index bf9a79fed62d..d6910eeed43d 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -713,10 +713,20 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy) */ void lru_add_drain_all(void) { - static seqcount_t seqcount = SEQCNT_ZERO(seqcount); - static DEFINE_MUTEX(lock); + /* + * lru_drain_gen - Current generation of pages that could be in vectors + * + * (A) Definition: lru_drain_gen = x implies that all generations + * 0 < n <= x are already scheduled for draining. + * + * This is an optimization for the highly-contended use case where a + * user space workload keeps constantly generating a flow of pages + * for each CPU. + */ + static unsigned int lru_drain_gen; static struct cpumask has_work; - int cpu, seq; + static DEFINE_MUTEX(lock); + int cpu, this_gen; /* * Make sure nobody triggers this path before mm_percpu_wq is fully @@ -725,21 +735,48 @@ void lru_add_drain_all(void) if (WARN_ON(!mm_percpu_wq)) return; - seq = raw_read_seqcount_latch(&seqcount); + /* + * (B) Cache the LRU draining generation number + * + * smp_rmb() ensures that the counter is loaded before the mutex is + * taken. It pairs with the smp_wmb() inside the mutex critical section + * at (D). + */ + this_gen = READ_ONCE(lru_drain_gen); + smp_rmb(); mutex_lock(&lock); /* - * Piggyback on drain started and finished while we waited for lock: - * all pages pended at the time of our enter were drained from vectors. + * (C) Exit the draining operation if a newer generation, from another + * lru_add_drain_all(), was already scheduled for draining. Check (A). */ - if (__read_seqcount_retry(&seqcount, seq)) + if (unlikely(this_gen != lru_drain_gen)) goto done; - raw_write_seqcount_latch(&seqcount); + /* + * (D) Increment generation number + * + * Pairs with READ_ONCE() and smp_rmb() at (B), outside of the critical + * section. + * + * This pairing must be done here, before the for_each_online_cpu loop + * below which drains the page vectors. + * + * Let x, y, and z represent some system CPU numbers, where x < y < z. + * Assume CPU #z is is in the middle of the for_each_online_cpu loop + * below and has already reached CPU #y's per-cpu data. CPU #x comes + * along, adds some pages to its per-cpu vectors, then calls + * lru_add_drain_all(). + * + * If the paired smp_wmb() below is done at any later step, e.g. after + * the loop, CPU #x will just exit at (C) and miss flushing out all of + * its added pages. + */ + WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1); + smp_wmb(); cpumask_clear(&has_work); - for_each_online_cpu(cpu) { struct work_struct *work = &per_cpu(lru_add_drain_work, cpu); @@ -766,7 +803,7 @@ void lru_add_drain_all(void) { lru_add_drain(); } -#endif +#endif /* CONFIG_SMP */ /** * release_pages - batched put_page()
Commit eef1a429f234 ("mm/swap.c: piggyback lru_add_drain_all() calls") implemented an optimization mechanism to exit the to-be-started LRU drain operation (name it A) if another drain operation *started and finished* while (A) was blocked on the LRU draining mutex. This was done through a seqcount latch, which is an abuse of its semantics: 1. Seqcount latching should be used for the purpose of switching between two storage places with sequence protection to allow interruptible, preemptible writer sections. The optimization mechanism has absolutely nothing to do with that. 2. The used raw_write_seqcount_latch() has two smp write memory barriers to always insure one consistent storage place out of the two storage places available. This extra smp_wmb() is redundant for the optimization use case. Beside the API abuse, the semantics of a latch sequence counter was force fitted into the optimization. What was actually meant is to track generations of LRU draining operations, where "current lru draining generation = x" implies that all generations 0 < n <= x are already *scheduled* for draining. Remove the conceptually-inappropriate seqcount latch usage and manually implement the optimization using a counter and SMP memory barriers. Link: https://lkml.kernel.org/r/CALYGNiPSr-cxV9MX9czaVh6Wz_gzSv3H_8KPvgjBTGbJywUJpA@mail.gmail.com Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> --- mm/swap.c | 57 +++++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 47 insertions(+), 10 deletions(-)