Message ID | 20190711221205.29889-1-daniel.m.jordan@oracle.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Herbert Xu |
Headers | show |
Series | padata: use smp_mb in padata_reorder to avoid orphaned padata jobs | expand |
Daniel Jordan <daniel.m.jordan@oracle.com> wrote: > > CPU0 CPU1 > > padata_reorder padata_do_serial > LOAD reorder_objects // 0 > INC reorder_objects // 1 > padata_reorder > TRYLOCK pd->lock // failed > UNLOCK pd->lock I think this can't happen because CPU1 won't call padata_reorder at all as it's the wrong CPU so it gets pushed back onto a work queue which will go back to CPU0. Steffen, could you please take a look at this as there clearly is a problem here? Thanks,
On Fri, Jul 12, 2019 at 06:06:36PM +0800, Herbert Xu wrote: > Daniel Jordan <daniel.m.jordan@oracle.com> wrote: > > > > CPU0 CPU1 > > > > padata_reorder padata_do_serial > > LOAD reorder_objects // 0 > > INC reorder_objects // 1 > > padata_reorder > > TRYLOCK pd->lock // failed > > UNLOCK pd->lock > > I think this can't happen because CPU1 won't call padata_reorder > at all as it's the wrong CPU so it gets pushed back onto a work > queue which will go back to CPU0. > > Steffen, could you please take a look at this as there clearly > is a problem here? I'm currently travelling, will have a look at it when I'm back.
On Fri, Jul 12, 2019 at 12:10:12PM +0200, Steffen Klassert wrote: > On Fri, Jul 12, 2019 at 06:06:36PM +0800, Herbert Xu wrote: > > Daniel Jordan <daniel.m.jordan@oracle.com> wrote: > > > > > > CPU0 CPU1 > > > > > > padata_reorder padata_do_serial > > > LOAD reorder_objects // 0 > > > INC reorder_objects // 1 > > > padata_reorder > > > TRYLOCK pd->lock // failed > > > UNLOCK pd->lock > > > > I think this can't happen because CPU1 won't call padata_reorder > > at all as it's the wrong CPU so it gets pushed back onto a work > > queue which will go back to CPU0. Thanks for looking at this. When you say the wrong CPU, I think you're referring to the reorder_via_wq logic in padata_do_serial. If so, I think my explanation was unclear, because there were two padata jobs in flight and my tracepoints showed neither of them punted padata_reorder to a workqueue. Let me expand on this, hopefully it helps. pcrypt used CPU 5 for its callback CPU. The first job in question, with sequence number 16581, hashed to CPU 21 on my system. This is a more complete version of what led to the hanging modprobe command. modprobe (CPU2) kworker/21:1-293 (CPU21) kworker/5:2-276 (CPU5) -------------------------- ------------------------ ---------------------- <submit job, seq_nr=16581> ... padata_do_parallel queue_work_on(21, ...) <sleeps> padata_parallel_worker pcrypt_aead_dec padata_do_serial padata_reorder | padata_get_next // returns job, seq_nr=16581 | // serialize seq_nr=16581 | queue_work_on(5, ...) | padata_get_next // returns -EINPROGRESS // padata_reorder doesn't return yet! | | padata_serial_worker | | pcrypt_aead_serial | | <wake up modprobe> | | <worker finishes> <submit job, seq_nr=16582> | | ... | | padata_do_parallel | | queue_work_on(22, ...) | | (LOAD reorder_objects as 0 somewhere <sleeps> | | in here, before the INC) | | kworker/22:1-291 (CPU22) | | ------------------------ | | padata_parallel_worker | | pcrypt_aead_dec | | padata_do_serial | | // INC reorder_objects to 1 | | padata_reorder | | // trylock fail, CPU21 _should_ | | // serialize 16582 but doesn't | | <worker finishes> | // deletes pd->timer // padata_reorder returns <worker finishes> <keeps on sleeping lazily> My tracepoints showed CPU22 increased reorder_objects to 1 but CPU21 read it as 0. I think adding the full barrier guarantees the following ordering, and the memory model people can correct me if I'm wrong: CPU21 CPU22 ------------------------ -------------------------- UNLOCK pd->lock smp_mb() LOAD reorder_objects INC reorder_objects spin_unlock(&pqueue->reorder.lock) // release barrier TRYLOCK pd->lock So if CPU22 has incremented reorder_objects but CPU21 reads 0 for it, CPU21 should also have unlocked pd->lock so CPU22 can get it and serialize any remaining jobs.
On Fri, Jul 12, 2019 at 12:07:37PM -0400, Daniel Jordan wrote: > > modprobe (CPU2) kworker/21:1-293 (CPU21) kworker/5:2-276 (CPU5) > -------------------------- ------------------------ ---------------------- > <submit job, seq_nr=16581> > ... > padata_do_parallel > queue_work_on(21, ...) > <sleeps> > padata_parallel_worker > pcrypt_aead_dec > padata_do_serial > padata_reorder This can't happen because if the job started on CPU2 then it must go back to CPU2 for completion. IOW padata_do_serial should be punting this to a work queue for CPU2 rather than calling padata_reorder on CPU21. Cheers,
On Sat, Jul 13, 2019 at 01:03:21PM +0800, Herbert Xu wrote: > On Fri, Jul 12, 2019 at 12:07:37PM -0400, Daniel Jordan wrote: > > > > modprobe (CPU2) kworker/21:1-293 (CPU21) kworker/5:2-276 (CPU5) > > -------------------------- ------------------------ ---------------------- > > <submit job, seq_nr=16581> > > ... > > padata_do_parallel > > queue_work_on(21, ...) > > <sleeps> > > padata_parallel_worker > > pcrypt_aead_dec > > padata_do_serial > > padata_reorder > > This can't happen because if the job started on CPU2 then it must > go back to CPU2 for completion. IOW padata_do_serial should be > punting this to a work queue for CPU2 rather than calling > padata_reorder on CPU21. I've been wrong before plenty of times, and there's nothing preventing this from being one of those times :) , but in this case I believe what I'm showing is correct. The padata_do_serial call for a given job ensures padata_reorder runs on the CPU that the job hashed to in padata_do_parallel, which is not necessarily the same CPU as the one that padata_do_parallel itself ran on. In this case, the padata job in question started via padata_do_parallel, where it hashed to CPU 21: padata_do_parallel // ran on CPU 2 ... target_cpu = padata_cpu_hash(pd); // target_cpu == 21 padata->cpu = target_cpu; ... queue_work_on(21, ...) The corresponding kworker then started: padata_parallel_worker // bound to CPU 21 pcrypt_aead_dec padata_do_serial padata_reorder Daniel
On Mon, Jul 15, 2019 at 12:10:46PM -0400, Daniel Jordan wrote: > > I've been wrong before plenty of times, and there's nothing preventing this > from being one of those times :) , but in this case I believe what I'm showing > is correct. > > The padata_do_serial call for a given job ensures padata_reorder runs on the > CPU that the job hashed to in padata_do_parallel, which is not necessarily the > same CPU as the one that padata_do_parallel itself ran on. You're right. I was taking the comment in the code at face value, never trust comments :) While looking at the code in question, I think it is seriously broken. For instance, padata_replace does not deal with async crypto at all. It would fail miserably if the underlying async crypto held onto references to the old pd. So we may have to restrict pcrypt to sync crypto only, which would obviously mean that it can no longer use aesni. Or we will have to spend a lot of time to fix this up properly. Cheers,
On Tue, Jul 16, 2019 at 06:04:47PM +0800, Herbert Xu wrote: > On Mon, Jul 15, 2019 at 12:10:46PM -0400, Daniel Jordan wrote: > > > > I've been wrong before plenty of times, and there's nothing preventing this > > from being one of those times :) , but in this case I believe what I'm showing > > is correct. > > > > The padata_do_serial call for a given job ensures padata_reorder runs on the > > CPU that the job hashed to in padata_do_parallel, which is not necessarily the > > same CPU as the one that padata_do_parallel itself ran on. > > You're right. I was taking the comment in the code at face value, > never trust comments :) > > While looking at the code in question, I think it is seriously > broken. For instance, padata_replace does not deal with async > crypto at all. It would fail miserably if the underlying async > crypto held onto references to the old pd. Hm, yes looks like that. padata_replace should not call padata_free_pd() as long as the refcount is not zero. Currenlty padata_flush_queues() will BUG if there are references left. Maybe we can fix it if we call padata_free_pd() from padata_serial_worker() when it sent out the last object.
Hi Daniel, My two cents (summarizing some findings we discussed privately): > I think adding the full barrier guarantees the following ordering, and the > memory model people can correct me if I'm wrong: > > CPU21 CPU22 > ------------------------ -------------------------- > UNLOCK pd->lock > smp_mb() > LOAD reorder_objects > INC reorder_objects > spin_unlock(&pqueue->reorder.lock) // release barrier > TRYLOCK pd->lock > > So if CPU22 has incremented reorder_objects but CPU21 reads 0 for it, CPU21 > should also have unlocked pd->lock so CPU22 can get it and serialize any > remaining jobs. This information inspired me to write down the following litmus test: (AFAICT, you independently wrote a very similar test, which is indeed quite reassuring! ;D) C daniel-padata { } P0(atomic_t *reorder_objects, spinlock_t *pd_lock) { int r0; spin_lock(pd_lock); spin_unlock(pd_lock); smp_mb(); r0 = atomic_read(reorder_objects); } P1(atomic_t *reorder_objects, spinlock_t *pd_lock, spinlock_t *reorder_lock) { int r1; spin_lock(reorder_lock); atomic_inc(reorder_objects); spin_unlock(reorder_lock); //smp_mb(); r1 = spin_trylock(pd_lock); } exists (0:r0=0 /\ 1:r1=0) It seems worth noticing that this test's "exists" clause is satisfiable according to the (current) memory consistency model. (Informally, this can be explained by noticing that the RELEASE from the spin_unlock() in P1 does not provide any order between the atomic increment and the read part of the spin_trylock() operation.) FWIW, uncommenting the smp_mb() in P1 would suffice to prevent this clause from being satisfiable; I am not sure, however, whether this approach is feasible or ideal... (sorry, I'm definitely not too familiar with this code... ;/) Thanks, Andrea
On Tue, Jul 16, 2019 at 02:53:09PM +0200, Andrea Parri wrote: > C daniel-padata > > { } > > P0(atomic_t *reorder_objects, spinlock_t *pd_lock) > { > int r0; > > spin_lock(pd_lock); > spin_unlock(pd_lock); > smp_mb(); > r0 = atomic_read(reorder_objects); > } > > P1(atomic_t *reorder_objects, spinlock_t *pd_lock, spinlock_t *reorder_lock) > { > int r1; > > spin_lock(reorder_lock); > atomic_inc(reorder_objects); > spin_unlock(reorder_lock); > //smp_mb(); > r1 = spin_trylock(pd_lock); > } > > exists (0:r0=0 /\ 1:r1=0) > > It seems worth noticing that this test's "exists" clause is satisfiable > according to the (current) memory consistency model. (Informally, this > can be explained by noticing that the RELEASE from the spin_unlock() in > P1 does not provide any order between the atomic increment and the read > part of the spin_trylock() operation.) FWIW, uncommenting the smp_mb() > in P1 would suffice to prevent this clause from being satisfiable; I am > not sure, however, whether this approach is feasible or ideal... (sorry, > I'm definitely not too familiar with this code... ;/) Urgh, that one again. Yes, you need the smp_mb(); although a whole bunch of architectures can live without it. IIRC it is part of the eternal RCsc/RCpc debate. Paul/RCU have their smp_mb__after_unlock_lock() that is about something similar, although we've so far confinsed that to the RCU code, because of how confusing that all is.
On Tue, Jul 16, 2019 at 02:53:09PM +0200, Andrea Parri wrote: > > P1(atomic_t *reorder_objects, spinlock_t *pd_lock, spinlock_t *reorder_lock) > { > int r1; > > spin_lock(reorder_lock); > atomic_inc(reorder_objects); > spin_unlock(reorder_lock); > //smp_mb(); > r1 = spin_trylock(pd_lock); > } Yes we need a matching mb on the other side. However, we can get away with using smp_mb__after_atomic thanks to the atomic_inc above. Daniel, can you please respin the patch with the matching smp_mb? Thanks,
On 7/16/19 11:01 AM, Herbert Xu wrote: > On Tue, Jul 16, 2019 at 02:53:09PM +0200, Andrea Parri wrote: >> >> P1(atomic_t *reorder_objects, spinlock_t *pd_lock, spinlock_t *reorder_lock) >> { >> int r1; >> >> spin_lock(reorder_lock); >> atomic_inc(reorder_objects); >> spin_unlock(reorder_lock); >> //smp_mb(); >> r1 = spin_trylock(pd_lock); >> } > > Yes we need a matching mb on the other side. However, we can > get away with using smp_mb__after_atomic thanks to the atomic_inc > above. > > Daniel, can you please respin the patch with the matching smp_mb? Sure, Herbert, will do. Thanks, Daniel
diff --git a/kernel/padata.c b/kernel/padata.c index 2d2fddbb7a4c..9cffd4c303cb 100644 --- a/kernel/padata.c +++ b/kernel/padata.c @@ -267,7 +267,12 @@ static void padata_reorder(struct parallel_data *pd) * The next object that needs serialization might have arrived to * the reorder queues in the meantime, we will be called again * from the timer function if no one else cares for it. + * + * Ensure reorder_objects is read after pd->lock is dropped so we see + * an increment from another task in padata_do_serial. Pairs with + * spin_unlock(&pqueue->reorder.lock) in padata_do_serial. */ + smp_mb(); if (atomic_read(&pd->reorder_objects) && !(pinst->flags & PADATA_RESET)) mod_timer(&pd->timer, jiffies + HZ);
Testing padata with the tcrypt module on a 5.2 kernel... # modprobe tcrypt alg="pcrypt(rfc4106(gcm(aes)))" type=3 # modprobe tcrypt mode=211 sec=1 ...produces this splat: INFO: task modprobe:10075 blocked for more than 120 seconds. Not tainted 5.2.0-base+ #16 modprobe D 0 10075 10064 0x80004080 Call Trace: ? __schedule+0x4dd/0x610 ? ring_buffer_unlock_commit+0x23/0x100 schedule+0x6c/0x90 schedule_timeout+0x3b/0x320 ? trace_buffer_unlock_commit_regs+0x4f/0x1f0 wait_for_common+0x160/0x1a0 ? wake_up_q+0x80/0x80 { crypto_wait_req } # entries in braces added by hand { do_one_aead_op } { test_aead_jiffies } test_aead_speed.constprop.17+0x681/0xf30 [tcrypt] do_test+0x4053/0x6a2b [tcrypt] ? 0xffffffffa00f4000 tcrypt_mod_init+0x50/0x1000 [tcrypt] ... The second modprobe command never finishes because in padata_reorder, CPU0's load of reorder_objects is executed before the unlocking store in spin_unlock_bh(pd->lock), causing CPU0 to miss CPU1's increment: CPU0 CPU1 padata_reorder padata_do_serial LOAD reorder_objects // 0 INC reorder_objects // 1 padata_reorder TRYLOCK pd->lock // failed UNLOCK pd->lock CPU0 deletes the timer before returning from padata_reorder and since no other job is submitted to padata, modprobe waits indefinitely. Add a full barrier to prevent this scenario. The hang was happening about once every three runs, but now the test has finished successfully fifty times in a row. Fixes: 16295bec6398 ("padata: Generic parallelization/serialization interface") Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Andrea Parri <andrea.parri@amarulasolutions.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Paul E. McKenney <paulmck@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: linux-arch@vger.kernel.org Cc: linux-crypto@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- memory-barriers.txt says that a full barrier pairs with a release barrier, but I'd appreciate a look from someone more familiar with barriers. Thanks. kernel/padata.c | 5 +++++ 1 file changed, 5 insertions(+) base-commit: 0ecfebd2b52404ae0c54a878c872bb93363ada36