Message ID | 1616050402-37023-1-git-send-email-linyunsheng@huawei.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net] net: sched: fix packet stuck problem for lockless qdisc | expand |
Context | Check | Description |
---|---|---|
netdev/cover_letter | success | Link |
netdev/fixes_present | success | Link |
netdev/patch_count | success | Link |
netdev/tree_selection | success | Clearly marked for net |
netdev/subject_prefix | success | Link |
netdev/cc_maintainers | success | CCed 15 of 15 maintainers |
netdev/source_inline | success | Was 0 now: 0 |
netdev/verify_signedoff | success | Link |
netdev/module_param | success | Was 0 now: 0 |
netdev/build_32bit | success | Errors and warnings before: 4510 this patch: 4510 |
netdev/kdoc | success | Errors and warnings before: 0 this patch: 0 |
netdev/verify_fixes | success | Link |
netdev/checkpatch | success | total: 0 errors, 0 warnings, 0 checks, 49 lines checked |
netdev/build_allmodconfig_warn | success | Errors and warnings before: 4750 this patch: 4750 |
netdev/header_inline | success | Link |
On 2021/3/18 14:53, Yunsheng Lin wrote: > Lockless qdisc has below concurrent problem: > cpu0 cpu1 > . . > q->enqueue . > . . > qdisc_run_begin() . > . . > dequeue_skb() . > . . > sch_direct_xmit() . > . . > . q->enqueue > . qdisc_run_begin() > . return and do nothing > . . > qdisc_run_end() . > > cpu1 enqueue a skb without calling __qdisc_run() because cpu0 > has not released the lock yet and spin_trylock() return false > for cpu1 in qdisc_run_begin(), and cpu0 do not see the skb > enqueued by cpu1 when calling dequeue_skb() because cpu1 may > enqueue the skb after cpu0 calling dequeue_skb() and before > cpu0 calling qdisc_run_end(). > > Lockless qdisc has another concurrent problem when tx_action > is involved: > > cpu0(serving tx_action) cpu1 cpu2 > . . . > . q->enqueue . > . qdisc_run_begin() . > . dequeue_skb() . > . . q->enqueue > . . . > . sch_direct_xmit() . > . . qdisc_run_begin() > . . return and do nothing > . . . > clear __QDISC_STATE_SCHED . . > qdisc_run_begin() . . > return and do nothing . . > . . . > . qdisc_run_begin() . > > This patch fixes the above data race by: > 1. Set a flag after spin_trylock() return false. > 2. Retry a spin_trylock() in case other CPU may not see the > new flag after it releases the lock. > 3. reschedule if the flag is set after the lock is released > at the end of qdisc_run_end(). > > For tx_action case, the flags is also set when cpu1 is at the > end if qdisc_run_begin(), so tx_action will be rescheduled > again to dequeue the skb enqueued by cpu2. > > Also clear the flag before dequeuing in order to reduce the > overhead of the above process, and aviod doing the heavy > test_and_clear_bit() at the end of qdisc_run_end(). > > Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking") > Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> > --- > For those who has not been following the qdsic scheduling > discussion, there is packet stuck problem for lockless qdisc, > see [1], and I has done some cleanup and added some enhanced > features too, see [2] [3]. > While I was doing the optimization for lockless qdisc, it > accurred to me that these optimization is useless if there is > still basic bug in lockless qdisc, even the bug is not easily > reproducible. So look through [1] again, I found that the data > race for tx action mentioned by Michael, and thought deep about > it and came up with this patch trying to fix it. > > So I am really appreciated some who still has the reproducer > can try this patch and report back. I had done some performance test to see if there is value to fix the packet stuck problem and support lockless qdisc bypass, here is some result using pktgen in 'queue_xmit' mode on a dummy device as Paolo Abeni had done in [1], and using pfifo_fast qdisc: threads vanilla locked-qdisc vanilla+this_patch 1 2.6Mpps 2.9Mpps 2.5Mpps 2 3.9Mpps 4.8Mpps 3.6Mpps 4 5.6Mpps 3.0Mpps 4.7Mpps 8 2.7Mpps 1.6Mpps 2.8Mpps 16 2.2Mpps 1.3Mpps 2.3Mpps locked-qdisc: test by removing the "TCQ_F_NOLOCK | TCQ_F_CPUSTATS". And add the lockless qdisc bypatch and other optimization upon this patch: threads patch_set_1 patch_set_2 patch_set_3 1 2.5Mpps 3.0Mpps 3.0Mpps 2 3.6Mpps 4.1Mpps 5.3Mpps 4 4.7Mpps 4.6Mpps 5.1Mpps 8 2.8Mpps 2.6Mpps 2.7Mpps 16 2.3Mpps 2.2Mpps 2.2Mpps patch_set_1: vanilla + this_patch patch_set_2: vanilla + this_patch + lockless_qdisc_bypass_patch patch_set_3: vanilla + this_patch + lockless_qdisc_bypass_patch + remove_seq_operation_for_lockless_qdisc_optimization + check_rc_before_calling_qdisc_run()_optimization + spin_trylock()_retry_optimization. So all the fix and optimization added together, the lockless qdisc has better performance than vanilla except for the 4 threads case, which has about 9% performance degradation than vanilla one, but still better than the locked-qdisc. > > 1. https://lore.kernel.org/netdev/d102074f-7489-e35a-98cf-e2cad7efd8a2@netrounds.com/t/#ma7013a79b8c4d8e7c49015c724e481e6d5325b32 > 2. https://patchwork.kernel.org/project/netdevbpf/patch/1615777818-13969-1-git-send-email-linyunsheng@huawei.com/ > 3. https://patchwork.kernel.org/project/netdevbpf/patch/1615800610-34700-1-git-send-email-linyunsheng@huawei.com/ >
On Wed, Mar 17, 2021 at 11:52 PM Yunsheng Lin <linyunsheng@huawei.com> wrote: > > Lockless qdisc has below concurrent problem: > cpu0 cpu1 > . . > q->enqueue . > . . > qdisc_run_begin() . > . . > dequeue_skb() . > . . > sch_direct_xmit() . > . . > . q->enqueue > . qdisc_run_begin() > . return and do nothing > . . > qdisc_run_end() . > > cpu1 enqueue a skb without calling __qdisc_run() because cpu0 > has not released the lock yet and spin_trylock() return false > for cpu1 in qdisc_run_begin(), and cpu0 do not see the skb > enqueued by cpu1 when calling dequeue_skb() because cpu1 may > enqueue the skb after cpu0 calling dequeue_skb() and before > cpu0 calling qdisc_run_end(). > > Lockless qdisc has another concurrent problem when tx_action > is involved: > > cpu0(serving tx_action) cpu1 cpu2 > . . . > . q->enqueue . > . qdisc_run_begin() . > . dequeue_skb() . > . . q->enqueue > . . . > . sch_direct_xmit() . > . . qdisc_run_begin() > . . return and do nothing > . . . > clear __QDISC_STATE_SCHED . . > qdisc_run_begin() . . > return and do nothing . . > . . . > . qdisc_run_begin() . > > This patch fixes the above data race by: > 1. Set a flag after spin_trylock() return false. > 2. Retry a spin_trylock() in case other CPU may not see the > new flag after it releases the lock. > 3. reschedule if the flag is set after the lock is released > at the end of qdisc_run_end(). > > For tx_action case, the flags is also set when cpu1 is at the > end if qdisc_run_begin(), so tx_action will be rescheduled > again to dequeue the skb enqueued by cpu2. > > Also clear the flag before dequeuing in order to reduce the > overhead of the above process, and aviod doing the heavy > test_and_clear_bit() at the end of qdisc_run_end(). > > Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking") > Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> > --- > For those who has not been following the qdsic scheduling > discussion, there is packet stuck problem for lockless qdisc, > see [1], and I has done some cleanup and added some enhanced > features too, see [2] [3]. > While I was doing the optimization for lockless qdisc, it > accurred to me that these optimization is useless if there is > still basic bug in lockless qdisc, even the bug is not easily > reproducible. So look through [1] again, I found that the data > race for tx action mentioned by Michael, and thought deep about > it and came up with this patch trying to fix it. > > So I am really appreciated some who still has the reproducer > can try this patch and report back. > > 1. https://lore.kernel.org/netdev/d102074f-7489-e35a-98cf-e2cad7efd8a2@netrounds.com/t/#ma7013a79b8c4d8e7c49015c724e481e6d5325b32 > 2. https://patchwork.kernel.org/project/netdevbpf/patch/1615777818-13969-1-git-send-email-linyunsheng@huawei.com/ > 3. https://patchwork.kernel.org/project/netdevbpf/patch/1615800610-34700-1-git-send-email-linyunsheng@huawei.com/ > --- > include/net/sch_generic.h | 23 ++++++++++++++++++++--- > net/sched/sch_generic.c | 1 + > 2 files changed, 21 insertions(+), 3 deletions(-) > > diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h > index f7a6e14..4220eab 100644 > --- a/include/net/sch_generic.h > +++ b/include/net/sch_generic.h > @@ -36,6 +36,7 @@ struct qdisc_rate_table { > enum qdisc_state_t { > __QDISC_STATE_SCHED, > __QDISC_STATE_DEACTIVATED, > + __QDISC_STATE_NEED_RESCHEDULE, > }; > > struct qdisc_size_table { > @@ -159,8 +160,17 @@ static inline bool qdisc_is_empty(const struct Qdisc *qdisc) > static inline bool qdisc_run_begin(struct Qdisc *qdisc) > { > if (qdisc->flags & TCQ_F_NOLOCK) { > - if (!spin_trylock(&qdisc->seqlock)) > - return false; > + if (!spin_trylock(&qdisc->seqlock)) { > + set_bit(__QDISC_STATE_NEED_RESCHEDULE, > + &qdisc->state); Why do we need another bit? I mean why not just call __netif_schedule()? > + > + /* Retry again in case other CPU may not see the > + * new flags after it releases the lock at the > + * end of qdisc_run_end(). > + */ > + if (!spin_trylock(&qdisc->seqlock)) > + return false; > + } > WRITE_ONCE(qdisc->empty, false); > } else if (qdisc_is_running(qdisc)) { > return false; > @@ -176,8 +186,15 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc) > static inline void qdisc_run_end(struct Qdisc *qdisc) > { > write_seqcount_end(&qdisc->running); > - if (qdisc->flags & TCQ_F_NOLOCK) > + if (qdisc->flags & TCQ_F_NOLOCK) { > spin_unlock(&qdisc->seqlock); > + > + if (unlikely(test_bit(__QDISC_STATE_NEED_RESCHEDULE, > + &qdisc->state) && > + !test_bit(__QDISC_STATE_DEACTIVATED, > + &qdisc->state))) Testing two bits one by one is not atomic... > + __netif_schedule(qdisc); > + } > } > > static inline bool qdisc_may_bulk(const struct Qdisc *qdisc) > diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c > index 44991ea..25d75d8 100644 > --- a/net/sched/sch_generic.c > +++ b/net/sched/sch_generic.c > @@ -205,6 +205,7 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate, > const struct netdev_queue *txq = q->dev_queue; > struct sk_buff *skb = NULL; > > + clear_bit(__QDISC_STATE_NEED_RESCHEDULE, &q->state); > *packets = 1; > if (unlikely(!skb_queue_empty(&q->gso_skb))) { > spinlock_t *lock = NULL; > -- > 2.7.4 >
On Fri, Mar 19, 2021 at 2:25 AM Yunsheng Lin <linyunsheng@huawei.com> wrote: > I had done some performance test to see if there is value to > fix the packet stuck problem and support lockless qdisc bypass, > here is some result using pktgen in 'queue_xmit' mode on a dummy > device as Paolo Abeni had done in [1], and using pfifo_fast qdisc: > > threads vanilla locked-qdisc vanilla+this_patch > 1 2.6Mpps 2.9Mpps 2.5Mpps > 2 3.9Mpps 4.8Mpps 3.6Mpps > 4 5.6Mpps 3.0Mpps 4.7Mpps > 8 2.7Mpps 1.6Mpps 2.8Mpps > 16 2.2Mpps 1.3Mpps 2.3Mpps > > locked-qdisc: test by removing the "TCQ_F_NOLOCK | TCQ_F_CPUSTATS". I read this as this patch introduces somehow a performance regression for -net, as the lockless bypass patch you submitted is for -net-next. Thanks.
On 2021/3/20 3:40, Cong Wang wrote: > On Wed, Mar 17, 2021 at 11:52 PM Yunsheng Lin <linyunsheng@huawei.com> wrote: >> >> Lockless qdisc has below concurrent problem: >> cpu0 cpu1 >> . . >> q->enqueue . >> . . >> qdisc_run_begin() . >> . . >> dequeue_skb() . >> . . >> sch_direct_xmit() . >> . . >> . q->enqueue >> . qdisc_run_begin() >> . return and do nothing >> . . >> qdisc_run_end() . >> >> cpu1 enqueue a skb without calling __qdisc_run() because cpu0 >> has not released the lock yet and spin_trylock() return false >> for cpu1 in qdisc_run_begin(), and cpu0 do not see the skb >> enqueued by cpu1 when calling dequeue_skb() because cpu1 may >> enqueue the skb after cpu0 calling dequeue_skb() and before >> cpu0 calling qdisc_run_end(). >> >> Lockless qdisc has another concurrent problem when tx_action >> is involved: >> >> cpu0(serving tx_action) cpu1 cpu2 >> . . . >> . q->enqueue . >> . qdisc_run_begin() . >> . dequeue_skb() . >> . . q->enqueue >> . . . >> . sch_direct_xmit() . >> . . qdisc_run_begin() >> . . return and do nothing >> . . . >> clear __QDISC_STATE_SCHED . . >> qdisc_run_begin() . . >> return and do nothing . . >> . . . >> . qdisc_run_begin() . >> >> This patch fixes the above data race by: >> 1. Set a flag after spin_trylock() return false. >> 2. Retry a spin_trylock() in case other CPU may not see the >> new flag after it releases the lock. >> 3. reschedule if the flag is set after the lock is released >> at the end of qdisc_run_end(). >> >> For tx_action case, the flags is also set when cpu1 is at the >> end if qdisc_run_begin(), so tx_action will be rescheduled >> again to dequeue the skb enqueued by cpu2. >> >> Also clear the flag before dequeuing in order to reduce the >> overhead of the above process, and aviod doing the heavy >> test_and_clear_bit() at the end of qdisc_run_end(). >> >> Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking") >> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> >> --- >> For those who has not been following the qdsic scheduling >> discussion, there is packet stuck problem for lockless qdisc, >> see [1], and I has done some cleanup and added some enhanced >> features too, see [2] [3]. >> While I was doing the optimization for lockless qdisc, it >> accurred to me that these optimization is useless if there is >> still basic bug in lockless qdisc, even the bug is not easily >> reproducible. So look through [1] again, I found that the data >> race for tx action mentioned by Michael, and thought deep about >> it and came up with this patch trying to fix it. >> >> So I am really appreciated some who still has the reproducer >> can try this patch and report back. >> >> 1. https://lore.kernel.org/netdev/d102074f-7489-e35a-98cf-e2cad7efd8a2@netrounds.com/t/#ma7013a79b8c4d8e7c49015c724e481e6d5325b32 >> 2. https://patchwork.kernel.org/project/netdevbpf/patch/1615777818-13969-1-git-send-email-linyunsheng@huawei.com/ >> 3. https://patchwork.kernel.org/project/netdevbpf/patch/1615800610-34700-1-git-send-email-linyunsheng@huawei.com/ >> --- >> include/net/sch_generic.h | 23 ++++++++++++++++++++--- >> net/sched/sch_generic.c | 1 + >> 2 files changed, 21 insertions(+), 3 deletions(-) >> >> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h >> index f7a6e14..4220eab 100644 >> --- a/include/net/sch_generic.h >> +++ b/include/net/sch_generic.h >> @@ -36,6 +36,7 @@ struct qdisc_rate_table { >> enum qdisc_state_t { >> __QDISC_STATE_SCHED, >> __QDISC_STATE_DEACTIVATED, >> + __QDISC_STATE_NEED_RESCHEDULE, >> }; >> >> struct qdisc_size_table { >> @@ -159,8 +160,17 @@ static inline bool qdisc_is_empty(const struct Qdisc *qdisc) >> static inline bool qdisc_run_begin(struct Qdisc *qdisc) >> { >> if (qdisc->flags & TCQ_F_NOLOCK) { >> - if (!spin_trylock(&qdisc->seqlock)) >> - return false; >> + if (!spin_trylock(&qdisc->seqlock)) { >> + set_bit(__QDISC_STATE_NEED_RESCHEDULE, >> + &qdisc->state); > > Why do we need another bit? I mean why not just call __netif_schedule()? I think that was your proposal in [1], the only difference is that it also handle the tx_action case when __netif_schedule() is called here. So yes, it can also fix the two data race described in this patch, but with a bigger performance degradation, quoting performance data in the [1]: pktgen threads vanilla patched delta nr kpps kpps % 1 3240 3240 0 2 3910 2710 -30.5 4 5140 4920 -4 performance data with this patch: threads vanilla vanilla+this_patch delta 1 2.6Mpps 2.5Mpps -3% 2 3.9Mpps 3.6Mpps -7% 4 5.6Mpps 4.7Mpps -16% So the performance is why I does not call __netif_schedule() directly here. 1. https://lore.kernel.org/netdev/d102074f-7489-e35a-98cf-e2cad7efd8a2@netrounds.com/t/#md927651488ce4d226f6279aad6699b4bee4674a3 > >> + >> + /* Retry again in case other CPU may not see the >> + * new flags after it releases the lock at the >> + * end of qdisc_run_end(). >> + */ >> + if (!spin_trylock(&qdisc->seqlock)) >> + return false; >> + } >> WRITE_ONCE(qdisc->empty, false); >> } else if (qdisc_is_running(qdisc)) { >> return false; >> @@ -176,8 +186,15 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc) >> static inline void qdisc_run_end(struct Qdisc *qdisc) >> { >> write_seqcount_end(&qdisc->running); >> - if (qdisc->flags & TCQ_F_NOLOCK) >> + if (qdisc->flags & TCQ_F_NOLOCK) { >> spin_unlock(&qdisc->seqlock); >> + >> + if (unlikely(test_bit(__QDISC_STATE_NEED_RESCHEDULE, >> + &qdisc->state) && >> + !test_bit(__QDISC_STATE_DEACTIVATED, >> + &qdisc->state))) > > Testing two bits one by one is not atomic... For non-tx_action case, actually it is atomic because the above two bits testing is within the rcu protection, and qdisc reset will do a synchronize_net() after setting __QDISC_STATE_DEACTIVATED. For tx_action case, I think we need a rcu protection explicitly in net_tx_action() too, at least for PREEMPT_RCU: https://stackoverflow.com/questions/21287932/is-it-necessary-invoke-rcu-read-lock-in-softirq-context > > >> + __netif_schedule(qdisc); >> + } >> } >> >> static inline bool qdisc_may_bulk(const struct Qdisc *qdisc) >> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c >> index 44991ea..25d75d8 100644 >> --- a/net/sched/sch_generic.c >> +++ b/net/sched/sch_generic.c >> @@ -205,6 +205,7 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate, >> const struct netdev_queue *txq = q->dev_queue; >> struct sk_buff *skb = NULL; >> >> + clear_bit(__QDISC_STATE_NEED_RESCHEDULE, &q->state); >> *packets = 1; >> if (unlikely(!skb_queue_empty(&q->gso_skb))) { >> spinlock_t *lock = NULL; >> -- >> 2.7.4 >> > > . >
On 2021/3/20 3:45, Cong Wang wrote: > On Fri, Mar 19, 2021 at 2:25 AM Yunsheng Lin <linyunsheng@huawei.com> wrote: >> I had done some performance test to see if there is value to >> fix the packet stuck problem and support lockless qdisc bypass, >> here is some result using pktgen in 'queue_xmit' mode on a dummy >> device as Paolo Abeni had done in [1], and using pfifo_fast qdisc: >> >> threads vanilla locked-qdisc vanilla+this_patch >> 1 2.6Mpps 2.9Mpps 2.5Mpps >> 2 3.9Mpps 4.8Mpps 3.6Mpps >> 4 5.6Mpps 3.0Mpps 4.7Mpps >> 8 2.7Mpps 1.6Mpps 2.8Mpps >> 16 2.2Mpps 1.3Mpps 2.3Mpps >> >> locked-qdisc: test by removing the "TCQ_F_NOLOCK | TCQ_F_CPUSTATS". > > I read this as this patch introduces somehow a performance > regression for -net, as the lockless bypass patch you submitted is > for -net-next. Yes, right now there is performance regression for fixing this bug, but the problem is that if we can not fix the above data race without any performance regression, do you prefer to send this patch to -net, or to -net-next with the lockless bypass patch? Any idea to fix this with less performance regression? > > Thanks. > > . >
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index f7a6e14..4220eab 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -36,6 +36,7 @@ struct qdisc_rate_table { enum qdisc_state_t { __QDISC_STATE_SCHED, __QDISC_STATE_DEACTIVATED, + __QDISC_STATE_NEED_RESCHEDULE, }; struct qdisc_size_table { @@ -159,8 +160,17 @@ static inline bool qdisc_is_empty(const struct Qdisc *qdisc) static inline bool qdisc_run_begin(struct Qdisc *qdisc) { if (qdisc->flags & TCQ_F_NOLOCK) { - if (!spin_trylock(&qdisc->seqlock)) - return false; + if (!spin_trylock(&qdisc->seqlock)) { + set_bit(__QDISC_STATE_NEED_RESCHEDULE, + &qdisc->state); + + /* Retry again in case other CPU may not see the + * new flags after it releases the lock at the + * end of qdisc_run_end(). + */ + if (!spin_trylock(&qdisc->seqlock)) + return false; + } WRITE_ONCE(qdisc->empty, false); } else if (qdisc_is_running(qdisc)) { return false; @@ -176,8 +186,15 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc) static inline void qdisc_run_end(struct Qdisc *qdisc) { write_seqcount_end(&qdisc->running); - if (qdisc->flags & TCQ_F_NOLOCK) + if (qdisc->flags & TCQ_F_NOLOCK) { spin_unlock(&qdisc->seqlock); + + if (unlikely(test_bit(__QDISC_STATE_NEED_RESCHEDULE, + &qdisc->state) && + !test_bit(__QDISC_STATE_DEACTIVATED, + &qdisc->state))) + __netif_schedule(qdisc); + } } static inline bool qdisc_may_bulk(const struct Qdisc *qdisc) diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 44991ea..25d75d8 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -205,6 +205,7 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate, const struct netdev_queue *txq = q->dev_queue; struct sk_buff *skb = NULL; + clear_bit(__QDISC_STATE_NEED_RESCHEDULE, &q->state); *packets = 1; if (unlikely(!skb_queue_empty(&q->gso_skb))) { spinlock_t *lock = NULL;
Lockless qdisc has below concurrent problem: cpu0 cpu1 . . q->enqueue . . . qdisc_run_begin() . . . dequeue_skb() . . . sch_direct_xmit() . . . . q->enqueue . qdisc_run_begin() . return and do nothing . . qdisc_run_end() . cpu1 enqueue a skb without calling __qdisc_run() because cpu0 has not released the lock yet and spin_trylock() return false for cpu1 in qdisc_run_begin(), and cpu0 do not see the skb enqueued by cpu1 when calling dequeue_skb() because cpu1 may enqueue the skb after cpu0 calling dequeue_skb() and before cpu0 calling qdisc_run_end(). Lockless qdisc has another concurrent problem when tx_action is involved: cpu0(serving tx_action) cpu1 cpu2 . . . . q->enqueue . . qdisc_run_begin() . . dequeue_skb() . . . q->enqueue . . . . sch_direct_xmit() . . . qdisc_run_begin() . . return and do nothing . . . clear __QDISC_STATE_SCHED . . qdisc_run_begin() . . return and do nothing . . . . . . qdisc_run_begin() . This patch fixes the above data race by: 1. Set a flag after spin_trylock() return false. 2. Retry a spin_trylock() in case other CPU may not see the new flag after it releases the lock. 3. reschedule if the flag is set after the lock is released at the end of qdisc_run_end(). For tx_action case, the flags is also set when cpu1 is at the end if qdisc_run_begin(), so tx_action will be rescheduled again to dequeue the skb enqueued by cpu2. Also clear the flag before dequeuing in order to reduce the overhead of the above process, and aviod doing the heavy test_and_clear_bit() at the end of qdisc_run_end(). Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking") Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> --- For those who has not been following the qdsic scheduling discussion, there is packet stuck problem for lockless qdisc, see [1], and I has done some cleanup and added some enhanced features too, see [2] [3]. While I was doing the optimization for lockless qdisc, it accurred to me that these optimization is useless if there is still basic bug in lockless qdisc, even the bug is not easily reproducible. So look through [1] again, I found that the data race for tx action mentioned by Michael, and thought deep about it and came up with this patch trying to fix it. So I am really appreciated some who still has the reproducer can try this patch and report back. 1. https://lore.kernel.org/netdev/d102074f-7489-e35a-98cf-e2cad7efd8a2@netrounds.com/t/#ma7013a79b8c4d8e7c49015c724e481e6d5325b32 2. https://patchwork.kernel.org/project/netdevbpf/patch/1615777818-13969-1-git-send-email-linyunsheng@huawei.com/ 3. https://patchwork.kernel.org/project/netdevbpf/patch/1615800610-34700-1-git-send-email-linyunsheng@huawei.com/ --- include/net/sch_generic.h | 23 ++++++++++++++++++++--- net/sched/sch_generic.c | 1 + 2 files changed, 21 insertions(+), 3 deletions(-)