Message ID | 1623891854-57416-1-git-send-email-linyunsheng@huawei.com (mailing list archive) |
---|---|
State | Accepted |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net,v2] net: sched: add barrier to ensure correct ordering for lockless qdisc | expand |
Context | Check | Description |
---|---|---|
netdev/cover_letter | success | Link |
netdev/fixes_present | success | Link |
netdev/patch_count | success | Link |
netdev/tree_selection | success | Clearly marked for net |
netdev/subject_prefix | success | Link |
netdev/cc_maintainers | success | CCed 13 of 13 maintainers |
netdev/source_inline | success | Was 0 now: 0 |
netdev/verify_signedoff | success | Link |
netdev/module_param | success | Was 0 now: 0 |
netdev/build_32bit | success | Errors and warnings before: 3530 this patch: 3530 |
netdev/kdoc | success | Errors and warnings before: 0 this patch: 0 |
netdev/verify_fixes | success | Link |
netdev/checkpatch | success | total: 0 errors, 0 warnings, 0 checks, 24 lines checked |
netdev/build_allmodconfig_warn | success | Errors and warnings before: 3631 this patch: 3631 |
netdev/header_inline | success | Link |
On Thu, 17 Jun 2021 09:04:14 +0800 Yunsheng Lin wrote: > The spin_trylock() was assumed to contain the implicit > barrier needed to ensure the correct ordering between > STATE_MISSED setting/clearing and STATE_MISSED checking > in commit a90c57f2cedd ("net: sched: fix packet stuck > problem for lockless qdisc"). > > But it turns out that spin_trylock() only has load-acquire > semantic, for strongly-ordered system(like x86), the compiler > barrier implicitly contained in spin_trylock() seems enough > to ensure the correct ordering. But for weakly-orderly system > (like arm64), the store-release semantic is needed to ensure > the correct ordering as clear_bit() and test_bit() is store > operation, see queued_spin_lock(). > > So add the explicit barrier to ensure the correct ordering > for the above case. > > Fixes: a90c57f2cedd ("net: sched: fix packet stuck problem for lockless qdisc") > Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Acked-by: Jakub Kicinski <kuba@kernel.org>
On Fri, 18 Jun 2021 17:30:47 -0700 Jakub Kicinski wrote: > On Thu, 17 Jun 2021 09:04:14 +0800 Yunsheng Lin wrote: > > The spin_trylock() was assumed to contain the implicit > > barrier needed to ensure the correct ordering between > > STATE_MISSED setting/clearing and STATE_MISSED checking > > in commit a90c57f2cedd ("net: sched: fix packet stuck > > problem for lockless qdisc"). > > > > But it turns out that spin_trylock() only has load-acquire > > semantic, for strongly-ordered system(like x86), the compiler > > barrier implicitly contained in spin_trylock() seems enough > > to ensure the correct ordering. But for weakly-orderly system > > (like arm64), the store-release semantic is needed to ensure > > the correct ordering as clear_bit() and test_bit() is store > > operation, see queued_spin_lock(). > > > > So add the explicit barrier to ensure the correct ordering > > for the above case. > > > > Fixes: a90c57f2cedd ("net: sched: fix packet stuck problem for lockless qdisc") > > Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> > > Acked-by: Jakub Kicinski <kuba@kernel.org> Actually.. do we really need the _before_atomic() barrier? I'd think we only need to make sure we re-check the lock after we set the bit, ordering of the first check doesn't matter.
On Fri, Jun 18, 2021 at 05:38:37PM -0700, Jakub Kicinski wrote: > On Fri, 18 Jun 2021 17:30:47 -0700 Jakub Kicinski wrote: > > On Thu, 17 Jun 2021 09:04:14 +0800 Yunsheng Lin wrote: > > > The spin_trylock() was assumed to contain the implicit > > > barrier needed to ensure the correct ordering between > > > STATE_MISSED setting/clearing and STATE_MISSED checking > > > in commit a90c57f2cedd ("net: sched: fix packet stuck > > > problem for lockless qdisc"). > > > > > > But it turns out that spin_trylock() only has load-acquire > > > semantic, for strongly-ordered system(like x86), the compiler > > > barrier implicitly contained in spin_trylock() seems enough > > > to ensure the correct ordering. But for weakly-orderly system > > > (like arm64), the store-release semantic is needed to ensure > > > the correct ordering as clear_bit() and test_bit() is store > > > operation, see queued_spin_lock(). > > > > > > So add the explicit barrier to ensure the correct ordering > > > for the above case. > > > > > > Fixes: a90c57f2cedd ("net: sched: fix packet stuck problem for lockless qdisc") > > > Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> > > > > Acked-by: Jakub Kicinski <kuba@kernel.org> > > Actually.. do we really need the _before_atomic() barrier? > I'd think we only need to make sure we re-check the lock > after we set the bit, ordering of the first check doesn't > matter. When debugging pointed to the misordering between STATE_MISSED setting/clearing and STATE_MISSED checking, only _after_atomic() was added first, and it did not fix the misordering problem, when both _before_atomic() and _after_atomic() were added, the misordering problem disappeared. I suppose _before_atomic() matters because the STATE_MISSED setting and the lock rechecking is only done when first check of STATE_MISSED returns false. _before_atomic() is used to make sure the first check returns correct result, if it does not return the correct result, then we may have misordering problem too. cpu0 cpu1 clear MISSED _after_atomic() dequeue enqueue first trylock() #false MISSED check #*true* ? As above, even cpu1 has a _after_atomic() between clearing STATE_MISSED and dequeuing, we might stiil need a barrier to prevent cpu0 doing speculative MISSED checking before cpu1 clearing MISSED? And the implicit load-acquire barrier contained in the first trylock() does not seems to prevent the above case too. And there is no load-acquire barrier in pfifo_fast_dequeue() too, which possibly make the above case more likely to happen.
On Sat, 19 Jun 2021 10:30:09 +0000 Yunsheng Lin wrote: > When debugging pointed to the misordering between STATE_MISSED > setting/clearing and STATE_MISSED checking, only _after_atomic() > was added first, and it did not fix the misordering problem, > when both _before_atomic() and _after_atomic() were added, the > misordering problem disappeared. > > I suppose _before_atomic() matters because the STATE_MISSED > setting and the lock rechecking is only done when first check of > STATE_MISSED returns false. _before_atomic() is used to make sure > the first check returns correct result, if it does not return the > correct result, then we may have misordering problem too. > > cpu0 cpu1 > clear MISSED > _after_atomic() > dequeue > enqueue > first trylock() #false > MISSED check #*true* ? > > As above, even cpu1 has a _after_atomic() between clearing > STATE_MISSED and dequeuing, we might stiil need a barrier to > prevent cpu0 doing speculative MISSED checking before cpu1 > clearing MISSED? > > And the implicit load-acquire barrier contained in the first > trylock() does not seems to prevent the above case too. > > And there is no load-acquire barrier in pfifo_fast_dequeue() > too, which possibly make the above case more likely to happen. Ah, you're right. The test_bit() was not in the patch context, I forgot it's there... Both barriers are indeed needed.
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index 1e62551..5771030 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -163,6 +163,12 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc) if (spin_trylock(&qdisc->seqlock)) goto nolock_empty; + /* Paired with smp_mb__after_atomic() to make sure + * STATE_MISSED checking is synchronized with clearing + * in pfifo_fast_dequeue(). + */ + smp_mb__before_atomic(); + /* If the MISSED flag is set, it means other thread has * set the MISSED flag before second spin_trylock(), so * we can return false here to avoid multi cpus doing @@ -180,6 +186,12 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc) */ set_bit(__QDISC_STATE_MISSED, &qdisc->state); + /* spin_trylock() only has load-acquire semantic, so use + * smp_mb__after_atomic() to ensure STATE_MISSED is set + * before doing the second spin_trylock(). + */ + smp_mb__after_atomic(); + /* Retry again in case other CPU may not see the new flag * after it releases the lock at the end of qdisc_run_end(). */
The spin_trylock() was assumed to contain the implicit barrier needed to ensure the correct ordering between STATE_MISSED setting/clearing and STATE_MISSED checking in commit a90c57f2cedd ("net: sched: fix packet stuck problem for lockless qdisc"). But it turns out that spin_trylock() only has load-acquire semantic, for strongly-ordered system(like x86), the compiler barrier implicitly contained in spin_trylock() seems enough to ensure the correct ordering. But for weakly-orderly system (like arm64), the store-release semantic is needed to ensure the correct ordering as clear_bit() and test_bit() is store operation, see queued_spin_lock(). So add the explicit barrier to ensure the correct ordering for the above case. Fixes: a90c57f2cedd ("net: sched: fix packet stuck problem for lockless qdisc") Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> --- V2: add the missing Fixes tag. The above ordering issue can easily cause out of order packet problem when testing lockless qdisc bypass patchset [1] with two iperf threads and one netdev queue in arm64 system. 1. https://lkml.org/lkml/2021/6/2/1417 --- include/net/sch_generic.h | 12 ++++++++++++ 1 file changed, 12 insertions(+)