Message ID | 1622170197-27370-3-git-send-email-linyunsheng@huawei.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | Some optimization for lockless qdisc | expand |
Context | Check | Description |
---|---|---|
netdev/cover_letter | success | Link |
netdev/fixes_present | success | Link |
netdev/patch_count | success | Link |
netdev/tree_selection | success | Clearly marked for net-next |
netdev/subject_prefix | success | Link |
netdev/cc_maintainers | success | CCed 20 of 20 maintainers |
netdev/source_inline | success | Was 0 now: 0 |
netdev/verify_signedoff | success | Link |
netdev/module_param | success | Was 0 now: 0 |
netdev/build_32bit | success | Errors and warnings before: 3466 this patch: 3466 |
netdev/kdoc | success | Errors and warnings before: 0 this patch: 0 |
netdev/verify_fixes | success | Link |
netdev/checkpatch | warning | WARNING: line length of 82 exceeds 80 columns |
netdev/build_allmodconfig_warn | success | Errors and warnings before: 3567 this patch: 3567 |
netdev/header_inline | success | Link |
On Fri, 28 May 2021 10:49:56 +0800 Yunsheng Lin wrote: > Currently pfifo_fast has both TCQ_F_CAN_BYPASS and TCQ_F_NOLOCK > flag set, but queue discipline by-pass does not work for lockless > qdisc because skb is always enqueued to qdisc even when the qdisc > is empty, see __dev_xmit_skb(). > > This patch calls sch_direct_xmit() to transmit the skb directly > to the driver for empty lockless qdisc, which aviod enqueuing > and dequeuing operation. > > As qdisc->empty is not reliable to indicate a empty qdisc because > there is a time window between enqueuing and setting qdisc->empty. > So we use the MISSED state added in commit a90c57f2cedd ("net: > sched: fix packet stuck problem for lockless qdisc"), which > indicate there is lock contention, suggesting that it is better > not to do the qdisc bypass in order to avoid packet out of order > problem. > > In order to make MISSED state reliable to indicate a empty qdisc, > we need to ensure that testing and clearing of MISSED state is > within the protection of qdisc->seqlock, only setting MISSED state > can be done without the protection of qdisc->seqlock. A MISSED > state testing is added without the protection of qdisc->seqlock to > aviod doing unnecessary spin_trylock() for contention case. > > There are below cases that need special handling: > 1. When MISSED state is cleared before another round of dequeuing > in pfifo_fast_dequeue(), and __qdisc_run() might not be able to > dequeue all skb in one round and call __netif_schedule(), which > might result in a non-empty qdisc without MISSED set. In order > to avoid this, the MISSED state is set for lockless qdisc and > __netif_schedule() will be called at the end of qdisc_run_end. > > 2. The MISSED state also need to be set for lockless qdisc instead > of calling __netif_schedule() directly when requeuing a skb for > a similar reason. > > 3. For netdev queue stopped case, the MISSED case need clearing > while the netdev queue is stopped, otherwise there may be > unnecessary __netif_schedule() calling. So a new DRAINING state > is added to indicate this case, which also indicate a non-empty > qdisc. > > 4. As there is already netif_xmit_frozen_or_stopped() checking in > dequeue_skb() and sch_direct_xmit(), which are both within the > protection of qdisc->seqlock, but the same checking in > __dev_xmit_skb() is without the protection, which might cause > empty indication of a lockless qdisc to be not reliable. So > remove the checking in __dev_xmit_skb(), and the checking in > the protection of qdisc->seqlock seems enough to avoid the cpu > consumption problem for netdev queue stopped case. > > Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> > diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h > index 3ed6bcc..177f240 100644 > --- a/include/net/sch_generic.h > +++ b/include/net/sch_generic.h > @@ -37,8 +37,15 @@ enum qdisc_state_t { > __QDISC_STATE_SCHED, > __QDISC_STATE_DEACTIVATED, > __QDISC_STATE_MISSED, > + __QDISC_STATE_DRAINING, > }; > > +#define QDISC_STATE_MISSED BIT(__QDISC_STATE_MISSED) > +#define QDISC_STATE_DRAINING BIT(__QDISC_STATE_DRAINING) > + > +#define QDISC_STATE_NON_EMPTY (QDISC_STATE_MISSED | \ > + QDISC_STATE_DRAINING) > + > struct qdisc_size_table { > struct rcu_head rcu; > struct list_head list; > @@ -145,6 +152,11 @@ static inline bool qdisc_is_running(struct Qdisc *qdisc) > return (raw_read_seqcount(&qdisc->running) & 1) ? true : false; > } > > +static inline bool nolock_qdisc_is_empty(const struct Qdisc *qdisc) > +{ > + return !(READ_ONCE(qdisc->state) & QDISC_STATE_NON_EMPTY); > +} > + > static inline bool qdisc_is_percpu_stats(const struct Qdisc *q) > { > return q->flags & TCQ_F_CPUSTATS; > @@ -206,10 +218,8 @@ static inline void qdisc_run_end(struct Qdisc *qdisc) > spin_unlock(&qdisc->seqlock); > > if (unlikely(test_bit(__QDISC_STATE_MISSED, > - &qdisc->state))) { > - clear_bit(__QDISC_STATE_MISSED, &qdisc->state); > + &qdisc->state))) Why remove the clear_bit()? Wasn't that added to avoid infinite re-schedules? > __netif_schedule(qdisc); > - } > } else { > write_seqcount_end(&qdisc->running); > } > diff --git a/net/core/dev.c b/net/core/dev.c > index 50531a2..e4cc926 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -3852,10 +3852,32 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, > qdisc_calculate_pkt_len(skb, q); > > if (q->flags & TCQ_F_NOLOCK) { > + if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) && > + qdisc_run_begin(q)) { > + /* Retest nolock_qdisc_is_empty() within the protection > + * of q->seqlock to ensure qdisc is indeed empty. > + */ > + if (unlikely(!nolock_qdisc_is_empty(q))) { This is just for the DRAINING case right? MISSED can be set at any moment, testing MISSED seems confusing. Is it really worth the extra code? > + rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK; > + __qdisc_run(q); > + qdisc_run_end(q); > + > + goto no_lock_out; > + } > + > + qdisc_bstats_cpu_update(q, skb); > + if (sch_direct_xmit(skb, q, dev, txq, NULL, true) && > + !nolock_qdisc_is_empty(q)) > + __qdisc_run(q); > + > + qdisc_run_end(q); > + return NET_XMIT_SUCCESS; > + } > + > rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK; > - if (likely(!netif_xmit_frozen_or_stopped(txq))) > - qdisc_run(q); > + qdisc_run(q); > > +no_lock_out: > if (unlikely(to_free)) > kfree_skb_list(to_free); > return rc; > diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c > index fc8b56b..83d7f5f 100644 > --- a/net/sched/sch_generic.c > +++ b/net/sched/sch_generic.c > @@ -52,6 +52,8 @@ static void qdisc_maybe_clear_missed(struct Qdisc *q, > */ > if (!netif_xmit_frozen_or_stopped(txq)) > set_bit(__QDISC_STATE_MISSED, &q->state); > + else > + set_bit(__QDISC_STATE_DRAINING, &q->state); > } > > /* Main transmission queue. */ > @@ -164,9 +166,13 @@ static inline void dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q) > > skb = next; > } > - if (lock) > + > + if (lock) { > spin_unlock(lock); > - __netif_schedule(q); > + set_bit(__QDISC_STATE_MISSED, &q->state); > + } else { > + __netif_schedule(q); Could we reorder qdisc_run_begin() with clear_bit(SCHED) in net_tx_action() and add SCHED to the NON_EMPTY mask? > + } > } > > static void try_bulk_dequeue_skb(struct Qdisc *q, > @@ -409,7 +415,11 @@ void __qdisc_run(struct Qdisc *q) > while (qdisc_restart(q, &packets)) { > quota -= packets; > if (quota <= 0) { > - __netif_schedule(q); > + if (q->flags & TCQ_F_NOLOCK) > + set_bit(__QDISC_STATE_MISSED, &q->state); > + else > + __netif_schedule(q); > + > break; > } > } > @@ -680,13 +690,14 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc) > if (likely(skb)) { > qdisc_update_stats_at_dequeue(qdisc, skb); > } else if (need_retry && > - test_bit(__QDISC_STATE_MISSED, &qdisc->state)) { > + READ_ONCE(qdisc->state) & QDISC_STATE_NON_EMPTY) { Do we really need to retry based on DRAINING being set? Or is it just a convenient way of coding things up? > /* Delay clearing the STATE_MISSED here to reduce > * the overhead of the second spin_trylock() in > * qdisc_run_begin() and __netif_schedule() calling > * in qdisc_run_end(). > */ > clear_bit(__QDISC_STATE_MISSED, &qdisc->state); > + clear_bit(__QDISC_STATE_DRAINING, &qdisc->state);
On 2021/5/29 9:00, Jakub Kicinski wrote: > On Fri, 28 May 2021 10:49:56 +0800 Yunsheng Lin wrote: >> Currently pfifo_fast has both TCQ_F_CAN_BYPASS and TCQ_F_NOLOCK >> flag set, but queue discipline by-pass does not work for lockless >> qdisc because skb is always enqueued to qdisc even when the qdisc >> is empty, see __dev_xmit_skb(). >> >> This patch calls sch_direct_xmit() to transmit the skb directly >> to the driver for empty lockless qdisc, which aviod enqueuing >> and dequeuing operation. >> >> As qdisc->empty is not reliable to indicate a empty qdisc because >> there is a time window between enqueuing and setting qdisc->empty. >> So we use the MISSED state added in commit a90c57f2cedd ("net: >> sched: fix packet stuck problem for lockless qdisc"), which >> indicate there is lock contention, suggesting that it is better >> not to do the qdisc bypass in order to avoid packet out of order >> problem. >> >> In order to make MISSED state reliable to indicate a empty qdisc, >> we need to ensure that testing and clearing of MISSED state is >> within the protection of qdisc->seqlock, only setting MISSED state >> can be done without the protection of qdisc->seqlock. A MISSED >> state testing is added without the protection of qdisc->seqlock to >> aviod doing unnecessary spin_trylock() for contention case. >> >> There are below cases that need special handling: >> 1. When MISSED state is cleared before another round of dequeuing >> in pfifo_fast_dequeue(), and __qdisc_run() might not be able to >> dequeue all skb in one round and call __netif_schedule(), which >> might result in a non-empty qdisc without MISSED set. In order >> to avoid this, the MISSED state is set for lockless qdisc and >> __netif_schedule() will be called at the end of qdisc_run_end. >> >> 2. The MISSED state also need to be set for lockless qdisc instead >> of calling __netif_schedule() directly when requeuing a skb for >> a similar reason. >> >> 3. For netdev queue stopped case, the MISSED case need clearing >> while the netdev queue is stopped, otherwise there may be >> unnecessary __netif_schedule() calling. So a new DRAINING state >> is added to indicate this case, which also indicate a non-empty >> qdisc. >> >> 4. As there is already netif_xmit_frozen_or_stopped() checking in >> dequeue_skb() and sch_direct_xmit(), which are both within the >> protection of qdisc->seqlock, but the same checking in >> __dev_xmit_skb() is without the protection, which might cause >> empty indication of a lockless qdisc to be not reliable. So >> remove the checking in __dev_xmit_skb(), and the checking in >> the protection of qdisc->seqlock seems enough to avoid the cpu >> consumption problem for netdev queue stopped case. >> >> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> > >> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h >> index 3ed6bcc..177f240 100644 >> --- a/include/net/sch_generic.h >> +++ b/include/net/sch_generic.h >> @@ -37,8 +37,15 @@ enum qdisc_state_t { >> __QDISC_STATE_SCHED, >> __QDISC_STATE_DEACTIVATED, >> __QDISC_STATE_MISSED, >> + __QDISC_STATE_DRAINING, >> }; >> >> +#define QDISC_STATE_MISSED BIT(__QDISC_STATE_MISSED) >> +#define QDISC_STATE_DRAINING BIT(__QDISC_STATE_DRAINING) >> + >> +#define QDISC_STATE_NON_EMPTY (QDISC_STATE_MISSED | \ >> + QDISC_STATE_DRAINING) >> + >> struct qdisc_size_table { >> struct rcu_head rcu; >> struct list_head list; >> @@ -145,6 +152,11 @@ static inline bool qdisc_is_running(struct Qdisc *qdisc) >> return (raw_read_seqcount(&qdisc->running) & 1) ? true : false; >> } >> >> +static inline bool nolock_qdisc_is_empty(const struct Qdisc *qdisc) >> +{ >> + return !(READ_ONCE(qdisc->state) & QDISC_STATE_NON_EMPTY); >> +} >> + >> static inline bool qdisc_is_percpu_stats(const struct Qdisc *q) >> { >> return q->flags & TCQ_F_CPUSTATS; >> @@ -206,10 +218,8 @@ static inline void qdisc_run_end(struct Qdisc *qdisc) >> spin_unlock(&qdisc->seqlock); >> >> if (unlikely(test_bit(__QDISC_STATE_MISSED, >> - &qdisc->state))) { >> - clear_bit(__QDISC_STATE_MISSED, &qdisc->state); >> + &qdisc->state))) > > Why remove the clear_bit()? Wasn't that added to avoid infinite > re-schedules? As we has also clear the MISSED for stopped and deactived case, so the above clear_bit() is necessarily needed, it was added because of the 0.02Mpps improvement for the 16 thread pktgen test case. As we need to ensure clearing is within the protection of q->seqlock, and above clear_bit is not within the protection, so remove the clear_bit() above. > >> __netif_schedule(qdisc); >> - } >> } else { >> write_seqcount_end(&qdisc->running); >> } >> diff --git a/net/core/dev.c b/net/core/dev.c >> index 50531a2..e4cc926 100644 >> --- a/net/core/dev.c >> +++ b/net/core/dev.c >> @@ -3852,10 +3852,32 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, >> qdisc_calculate_pkt_len(skb, q); >> >> if (q->flags & TCQ_F_NOLOCK) { >> + if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) && >> + qdisc_run_begin(q)) { >> + /* Retest nolock_qdisc_is_empty() within the protection >> + * of q->seqlock to ensure qdisc is indeed empty. >> + */ >> + if (unlikely(!nolock_qdisc_is_empty(q))) { > > This is just for the DRAINING case right? > > MISSED can be set at any moment, testing MISSED seems confusing. MISSED is only set when there is lock contention, which means it is better not to do the qdisc bypass to avoid out of order packet problem, another good thing is that we could also do the batch dequeuing and transmiting of packets when there is lock contention. > > Is it really worth the extra code? Currently DRAINING is only set for the netdev queue stopped. We could only use DRAINING to indicate the non-empty of a qdisc, then we need to set the DRAINING evrywhere MISSED is set, that is why I use both DRAINING and MISSED to indicate a non-empty qdisc. > >> + rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK; >> + __qdisc_run(q); >> + qdisc_run_end(q); >> + >> + goto no_lock_out; >> + } >> + >> + qdisc_bstats_cpu_update(q, skb); >> + if (sch_direct_xmit(skb, q, dev, txq, NULL, true) && >> + !nolock_qdisc_is_empty(q)) >> + __qdisc_run(q); >> + >> + qdisc_run_end(q); >> + return NET_XMIT_SUCCESS; >> + } >> + >> rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK; >> - if (likely(!netif_xmit_frozen_or_stopped(txq))) >> - qdisc_run(q); >> + qdisc_run(q); >> >> +no_lock_out: >> if (unlikely(to_free)) >> kfree_skb_list(to_free); >> return rc; >> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c >> index fc8b56b..83d7f5f 100644 >> --- a/net/sched/sch_generic.c >> +++ b/net/sched/sch_generic.c >> @@ -52,6 +52,8 @@ static void qdisc_maybe_clear_missed(struct Qdisc *q, >> */ >> if (!netif_xmit_frozen_or_stopped(txq)) >> set_bit(__QDISC_STATE_MISSED, &q->state); >> + else >> + set_bit(__QDISC_STATE_DRAINING, &q->state); >> } >> >> /* Main transmission queue. */ >> @@ -164,9 +166,13 @@ static inline void dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q) >> >> skb = next; >> } >> - if (lock) >> + >> + if (lock) { >> spin_unlock(lock); >> - __netif_schedule(q); >> + set_bit(__QDISC_STATE_MISSED, &q->state); >> + } else { >> + __netif_schedule(q); > > Could we reorder qdisc_run_begin() with clear_bit(SCHED) > in net_tx_action() and add SCHED to the NON_EMPTY mask? Did you mean clearing the SCHED after the q->seqlock is taken? The problem is that the SCHED is also used to indicate a qdisc is in sd->output_queue or not, and the qdisc_run_begin() called by net_tx_action() can not guarantee it will take the q->seqlock(we are using trylock for lockless qdisc) > >> + } >> } >> >> static void try_bulk_dequeue_skb(struct Qdisc *q, >> @@ -409,7 +415,11 @@ void __qdisc_run(struct Qdisc *q) >> while (qdisc_restart(q, &packets)) { >> quota -= packets; >> if (quota <= 0) { >> - __netif_schedule(q); >> + if (q->flags & TCQ_F_NOLOCK) >> + set_bit(__QDISC_STATE_MISSED, &q->state); >> + else >> + __netif_schedule(q); >> + >> break; >> } >> } >> @@ -680,13 +690,14 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc) >> if (likely(skb)) { >> qdisc_update_stats_at_dequeue(qdisc, skb); >> } else if (need_retry && >> - test_bit(__QDISC_STATE_MISSED, &qdisc->state)) { >> + READ_ONCE(qdisc->state) & QDISC_STATE_NON_EMPTY) { > > Do we really need to retry based on DRAINING being set? > Or is it just a convenient way of coding things up? Yes, it is just a convenient way of coding things up. Only MISSED need retrying. > >> /* Delay clearing the STATE_MISSED here to reduce >> * the overhead of the second spin_trylock() in >> * qdisc_run_begin() and __netif_schedule() calling >> * in qdisc_run_end(). >> */ >> clear_bit(__QDISC_STATE_MISSED, &qdisc->state); >> + clear_bit(__QDISC_STATE_DRAINING, &qdisc->state); > > > > . >
On Sat, 29 May 2021 09:44:57 +0800 Yunsheng Lin wrote: > >> @@ -3852,10 +3852,32 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, > >> qdisc_calculate_pkt_len(skb, q); > >> > >> if (q->flags & TCQ_F_NOLOCK) { > >> + if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) && > >> + qdisc_run_begin(q)) { > >> + /* Retest nolock_qdisc_is_empty() within the protection > >> + * of q->seqlock to ensure qdisc is indeed empty. > >> + */ > >> + if (unlikely(!nolock_qdisc_is_empty(q))) { > > > > This is just for the DRAINING case right? > > > > MISSED can be set at any moment, testing MISSED seems confusing. > > MISSED is only set when there is lock contention, which means it > is better not to do the qdisc bypass to avoid out of order packet > problem, Avoid as in make less likely? Nothing guarantees other thread is not interrupted after ->enqueue and before qdisc_run_begin(). TBH I'm not sure what out-of-order situation you're referring to, there is no ordering guarantee between separate threads trying to transmit AFAIU. IOW this check is not required for correctness, right? > another good thing is that we could also do the batch > dequeuing and transmiting of packets when there is lock contention. No doubt, but did you see the flag get set significantly often here to warrant the double-checking? > > Is it really worth the extra code? > > Currently DRAINING is only set for the netdev queue stopped. > We could only use DRAINING to indicate the non-empty of a qdisc, > then we need to set the DRAINING evrywhere MISSED is set, that is > why I use both DRAINING and MISSED to indicate a non-empty qdisc. > > > > >> + rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK; > >> + __qdisc_run(q); > >> + qdisc_run_end(q); > >> + > >> + goto no_lock_out; > >> + } > >> + > >> + qdisc_bstats_cpu_update(q, skb); > >> + if (sch_direct_xmit(skb, q, dev, txq, NULL, true) && > >> + !nolock_qdisc_is_empty(q)) > >> + __qdisc_run(q); > >> + > >> + qdisc_run_end(q); > >> + return NET_XMIT_SUCCESS; > >> + } > >> + > >> rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK; > >> - if (likely(!netif_xmit_frozen_or_stopped(txq))) > >> - qdisc_run(q); > >> + qdisc_run(q); > >> > >> +no_lock_out: > >> if (unlikely(to_free)) > >> kfree_skb_list(to_free); > >> return rc; > >> @@ -164,9 +166,13 @@ static inline void dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q) > >> > >> skb = next; > >> } > >> - if (lock) > >> + > >> + if (lock) { > >> spin_unlock(lock); > >> - __netif_schedule(q); > >> + set_bit(__QDISC_STATE_MISSED, &q->state); > >> + } else { > >> + __netif_schedule(q); > > > > Could we reorder qdisc_run_begin() with clear_bit(SCHED) > > in net_tx_action() and add SCHED to the NON_EMPTY mask? > > Did you mean clearing the SCHED after the q->seqlock is > taken? > > The problem is that the SCHED is also used to indicate > a qdisc is in sd->output_queue or not, and the > qdisc_run_begin() called by net_tx_action() can not > guarantee it will take the q->seqlock(we are using trylock > for lockless qdisc) Ah, right. We'd need to do some more flag juggling in net_tx_action() to get it right. > >> + } > >> } > >> > >> static void try_bulk_dequeue_skb(struct Qdisc *q, > >> @@ -409,7 +415,11 @@ void __qdisc_run(struct Qdisc *q) > >> while (qdisc_restart(q, &packets)) { > >> quota -= packets; > >> if (quota <= 0) { > >> - __netif_schedule(q); > >> + if (q->flags & TCQ_F_NOLOCK) > >> + set_bit(__QDISC_STATE_MISSED, &q->state); > >> + else > >> + __netif_schedule(q); > >> + > >> break; > >> } > >> } > >> @@ -680,13 +690,14 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc) > >> if (likely(skb)) { > >> qdisc_update_stats_at_dequeue(qdisc, skb); > >> } else if (need_retry && > >> - test_bit(__QDISC_STATE_MISSED, &qdisc->state)) { > >> + READ_ONCE(qdisc->state) & QDISC_STATE_NON_EMPTY) { > > > > Do we really need to retry based on DRAINING being set? > > Or is it just a convenient way of coding things up? > > Yes, it is just a convenient way of coding things up. > Only MISSED need retrying.
On 2021/5/29 12:32, Jakub Kicinski wrote: > On Sat, 29 May 2021 09:44:57 +0800 Yunsheng Lin wrote: >>>> @@ -3852,10 +3852,32 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, >>>> qdisc_calculate_pkt_len(skb, q); >>>> >>>> if (q->flags & TCQ_F_NOLOCK) { >>>> + if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) && >>>> + qdisc_run_begin(q)) { >>>> + /* Retest nolock_qdisc_is_empty() within the protection >>>> + * of q->seqlock to ensure qdisc is indeed empty. >>>> + */ >>>> + if (unlikely(!nolock_qdisc_is_empty(q))) { >>> >>> This is just for the DRAINING case right? >>> >>> MISSED can be set at any moment, testing MISSED seems confusing. >> >> MISSED is only set when there is lock contention, which means it >> is better not to do the qdisc bypass to avoid out of order packet >> problem, > > Avoid as in make less likely? Nothing guarantees other thread is not > interrupted after ->enqueue and before qdisc_run_begin(). > > TBH I'm not sure what out-of-order situation you're referring to, > there is no ordering guarantee between separate threads trying to > transmit AFAIU. A thread need to do the bypass checking before doing enqueuing, so it means MISSED is set or the trylock fails for the bypass transmiting( which will set the MISSED after the first trylock), so the MISSED will always be set before a thread doing a enqueuing, and we ensure MISSED only be cleared during the protection of q->seqlock, after clearing MISSED, we do anther round of dequeuing within the protection of q->seqlock. So if a thread has taken the q->seqlock and the MISSED is not set yet, it is allowed to send the packet directly without going through the qdisc enqueuing and dequeuing process. > > IOW this check is not required for correctness, right? if a thread has taken the q->seqlock and the MISSED is not set, it means other thread has not set MISSED after the first trylock and before the second trylock, which means the enqueuing is not done yet. So I assume the this check is required for correctness if I understand your question correctly. > >> another good thing is that we could also do the batch >> dequeuing and transmiting of packets when there is lock contention. > > No doubt, but did you see the flag get set significantly often here > to warrant the double-checking? No, that is just my guess:) > >>> Is it really worth the extra code? >> >> Currently DRAINING is only set for the netdev queue stopped. >> We could only use DRAINING to indicate the non-empty of a qdisc, >> then we need to set the DRAINING evrywhere MISSED is set, that is >> why I use both DRAINING and MISSED to indicate a non-empty qdisc.
On Sat, 29 May 2021 15:03:09 +0800 Yunsheng Lin wrote: > On 2021/5/29 12:32, Jakub Kicinski wrote: > > On Sat, 29 May 2021 09:44:57 +0800 Yunsheng Lin wrote: > >> MISSED is only set when there is lock contention, which means it > >> is better not to do the qdisc bypass to avoid out of order packet > >> problem, > > > > Avoid as in make less likely? Nothing guarantees other thread is not > > interrupted after ->enqueue and before qdisc_run_begin(). > > > > TBH I'm not sure what out-of-order situation you're referring to, > > there is no ordering guarantee between separate threads trying to > > transmit AFAIU. > A thread need to do the bypass checking before doing enqueuing, so > it means MISSED is set or the trylock fails for the bypass transmiting( > which will set the MISSED after the first trylock), so the MISSED will > always be set before a thread doing a enqueuing, and we ensure MISSED > only be cleared during the protection of q->seqlock, after clearing > MISSED, we do anther round of dequeuing within the protection of > q->seqlock. The fact that MISSED is only cleared under q->seqlock does not matter, because setting it and ->enqueue() are not under any lock. If the thread gets interrupted between: if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) && qdisc_run_begin(q)) { and ->enqueue() we can't guarantee that something else won't come in, take q->seqlock and clear MISSED. thread1 thread2 thread3 # holds seqlock qdisc_run_begin(q) set(MISSED) pfifo_fast_dequeue clear(MISSED) # recheck the queue qdisc_run_end() ->enqueue() q->flags & TCQ_F_CAN_BYPASS.. qdisc_run_begin() # true sch_direct_xmit() qdisc_run_begin() set(MISSED) Or am I missing something? Re-checking nolock_qdisc_is_empty() may or may not help. But it doesn't really matter because there is no ordering requirement between thread2 and thread3 here. > So if a thread has taken the q->seqlock and the MISSED is not set yet, > it is allowed to send the packet directly without going through the > qdisc enqueuing and dequeuing process. > > > IOW this check is not required for correctness, right? > > if a thread has taken the q->seqlock and the MISSED is not set, it means > other thread has not set MISSED after the first trylock and before the > second trylock, which means the enqueuing is not done yet. > So I assume the this check is required for correctness if I understand > your question correctly. > > >> another good thing is that we could also do the batch > >> dequeuing and transmiting of packets when there is lock contention. > > > > No doubt, but did you see the flag get set significantly often here > > to warrant the double-checking? > > No, that is just my guess:)
On 2021/5/30 2:49, Jakub Kicinski wrote: > On Sat, 29 May 2021 15:03:09 +0800 Yunsheng Lin wrote: >> On 2021/5/29 12:32, Jakub Kicinski wrote: >>> On Sat, 29 May 2021 09:44:57 +0800 Yunsheng Lin wrote: >>>> MISSED is only set when there is lock contention, which means it >>>> is better not to do the qdisc bypass to avoid out of order packet >>>> problem, >>> >>> Avoid as in make less likely? Nothing guarantees other thread is not >>> interrupted after ->enqueue and before qdisc_run_begin(). >>> >>> TBH I'm not sure what out-of-order situation you're referring to, >>> there is no ordering guarantee between separate threads trying to >>> transmit AFAIU. >> A thread need to do the bypass checking before doing enqueuing, so >> it means MISSED is set or the trylock fails for the bypass transmiting( >> which will set the MISSED after the first trylock), so the MISSED will >> always be set before a thread doing a enqueuing, and we ensure MISSED >> only be cleared during the protection of q->seqlock, after clearing >> MISSED, we do anther round of dequeuing within the protection of >> q->seqlock. > > The fact that MISSED is only cleared under q->seqlock does not matter, > because setting it and ->enqueue() are not under any lock. If the thread > gets interrupted between: > > if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) && > qdisc_run_begin(q)) { > > and ->enqueue() we can't guarantee that something else won't come in, > take q->seqlock and clear MISSED. > > thread1 thread2 thread3 > # holds seqlock > qdisc_run_begin(q) > set(MISSED) > pfifo_fast_dequeue > clear(MISSED) > # recheck the queue > qdisc_run_end() > ->enqueue() > q->flags & TCQ_F_CAN_BYPASS.. > qdisc_run_begin() # true > sch_direct_xmit() > qdisc_run_begin() > set(MISSED) > > Or am I missing something? > > Re-checking nolock_qdisc_is_empty() may or may not help. > But it doesn't really matter because there is no ordering > requirement between thread2 and thread3 here. I were more focued on explaining that using MISSED is reliable as sch_may_need_requeuing() checking in RFCv3 [1] to indicate a empty qdisc, and forgot to mention the data race described in RFCv3, which is kind of like the one described above: "There is a data race as below: CPU1 CPU2 qdisc_run_begin(q) . . q->enqueue() sch_may_need_requeuing() . return true . . . . . q->enqueue() . When above happen, the skb enqueued by CPU1 is dequeued after the skb enqueued by CPU2 because sch_may_need_requeuing() return true. If there is not qdisc bypass, the CPU1 has better chance to queue the skb quicker than CPU2. This patch does not take care of the above data race, because I view this as similar as below: Even at the same time CPU1 and CPU2 write the skb to two socket which both heading to the same qdisc, there is no guarantee that which skb will hit the qdisc first, becuase there is a lot of factor like interrupt/softirq/cache miss/scheduling afffecting that." Does above make sense? Or any idea to avoid it? 1. https://patchwork.kernel.org/project/netdevbpf/patch/1616404156-11772-1-git-send-email-linyunsheng@huawei.com/ > >> So if a thread has taken the q->seqlock and the MISSED is not set yet, >> it is allowed to send the packet directly without going through the >> qdisc enqueuing and dequeuing process. >> >>> IOW this check is not required for correctness, right? >> >> if a thread has taken the q->seqlock and the MISSED is not set, it means >> other thread has not set MISSED after the first trylock and before the >> second trylock, which means the enqueuing is not done yet. >> So I assume the this check is required for correctness if I understand >> your question correctly. >> >>>> another good thing is that we could also do the batch >>>> dequeuing and transmiting of packets when there is lock contention. >>> >>> No doubt, but did you see the flag get set significantly often here >>> to warrant the double-checking? >> >> No, that is just my guess:) > >
On Sun, 30 May 2021 09:37:09 +0800 Yunsheng Lin wrote: > On 2021/5/30 2:49, Jakub Kicinski wrote: > > The fact that MISSED is only cleared under q->seqlock does not matter, > > because setting it and ->enqueue() are not under any lock. If the thread > > gets interrupted between: > > > > if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) && > > qdisc_run_begin(q)) { > > > > and ->enqueue() we can't guarantee that something else won't come in, > > take q->seqlock and clear MISSED. > > > > thread1 thread2 thread3 > > # holds seqlock > > qdisc_run_begin(q) > > set(MISSED) > > pfifo_fast_dequeue > > clear(MISSED) > > # recheck the queue > > qdisc_run_end() > > ->enqueue() > > q->flags & TCQ_F_CAN_BYPASS.. > > qdisc_run_begin() # true > > sch_direct_xmit() > > qdisc_run_begin() > > set(MISSED) > > > > Or am I missing something? > > > > Re-checking nolock_qdisc_is_empty() may or may not help. > > But it doesn't really matter because there is no ordering > > requirement between thread2 and thread3 here. > > I were more focued on explaining that using MISSED is reliable > as sch_may_need_requeuing() checking in RFCv3 [1] to indicate a > empty qdisc, and forgot to mention the data race described in > RFCv3, which is kind of like the one described above: > > "There is a data race as below: > > CPU1 CPU2 > qdisc_run_begin(q) . > . q->enqueue() > sch_may_need_requeuing() . > return true . > . . > . . > q->enqueue() . > > When above happen, the skb enqueued by CPU1 is dequeued after the > skb enqueued by CPU2 because sch_may_need_requeuing() return true. > If there is not qdisc bypass, the CPU1 has better chance to queue > the skb quicker than CPU2. > > This patch does not take care of the above data race, because I > view this as similar as below: > > Even at the same time CPU1 and CPU2 write the skb to two socket > which both heading to the same qdisc, there is no guarantee that > which skb will hit the qdisc first, becuase there is a lot of > factor like interrupt/softirq/cache miss/scheduling afffecting > that." > > Does above make sense? Or any idea to avoid it? > > 1. https://patchwork.kernel.org/project/netdevbpf/patch/1616404156-11772-1-git-send-email-linyunsheng@huawei.com/ We agree on this one. Could you draw a sequence diagram of different CPUs (like the one above) for the case where removing re-checking nolock_qdisc_is_empty() under q->seqlock leads to incorrect behavior? If there is no such case would you be willing to repeat the benchmark with and without this test? Sorry for dragging the review out..
On 2021/5/31 4:21, Jakub Kicinski wrote: > On Sun, 30 May 2021 09:37:09 +0800 Yunsheng Lin wrote: >> On 2021/5/30 2:49, Jakub Kicinski wrote: >>> The fact that MISSED is only cleared under q->seqlock does not matter, >>> because setting it and ->enqueue() are not under any lock. If the thread >>> gets interrupted between: >>> >>> if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) && >>> qdisc_run_begin(q)) { >>> >>> and ->enqueue() we can't guarantee that something else won't come in, >>> take q->seqlock and clear MISSED. >>> >>> thread1 thread2 thread3 >>> # holds seqlock >>> qdisc_run_begin(q) >>> set(MISSED) >>> pfifo_fast_dequeue >>> clear(MISSED) >>> # recheck the queue >>> qdisc_run_end() >>> ->enqueue() >>> q->flags & TCQ_F_CAN_BYPASS.. >>> qdisc_run_begin() # true >>> sch_direct_xmit() >>> qdisc_run_begin() >>> set(MISSED) >>> >>> Or am I missing something? >>> >>> Re-checking nolock_qdisc_is_empty() may or may not help. >>> But it doesn't really matter because there is no ordering >>> requirement between thread2 and thread3 here. >> >> I were more focued on explaining that using MISSED is reliable >> as sch_may_need_requeuing() checking in RFCv3 [1] to indicate a >> empty qdisc, and forgot to mention the data race described in >> RFCv3, which is kind of like the one described above: >> >> "There is a data race as below: >> >> CPU1 CPU2 >> qdisc_run_begin(q) . >> . q->enqueue() >> sch_may_need_requeuing() . >> return true . >> . . >> . . >> q->enqueue() . >> >> When above happen, the skb enqueued by CPU1 is dequeued after the >> skb enqueued by CPU2 because sch_may_need_requeuing() return true. >> If there is not qdisc bypass, the CPU1 has better chance to queue >> the skb quicker than CPU2. >> >> This patch does not take care of the above data race, because I >> view this as similar as below: >> >> Even at the same time CPU1 and CPU2 write the skb to two socket >> which both heading to the same qdisc, there is no guarantee that >> which skb will hit the qdisc first, becuase there is a lot of >> factor like interrupt/softirq/cache miss/scheduling afffecting >> that." >> >> Does above make sense? Or any idea to avoid it? >> >> 1. https://patchwork.kernel.org/project/netdevbpf/patch/1616404156-11772-1-git-send-email-linyunsheng@huawei.com/ > > We agree on this one. > > Could you draw a sequence diagram of different CPUs (like the one > above) for the case where removing re-checking nolock_qdisc_is_empty() > under q->seqlock leads to incorrect behavior? When nolock_qdisc_is_empty() is not re-checking under q->seqlock, we may have: CPU1 CPU2 qdisc_run_begin(q) . . enqueue skb1 deuqueue skb1 and clear MISSED . . nolock_qdisc_is_empty() return true requeue skb . q->enqueue() . set MISSED . . . qdisc_run_end(q) . . qdisc_run_begin(q) . transmit skb2 directly . transmit the requeued skb1 The problem here is that skb1 and skb2 are from the same CPU, which means they are likely from the same flow, so we need to avoid this, right? > > If there is no such case would you be willing to repeat the benchmark > with and without this test? > > Sorry for dragging the review out.. > > . >
On 2021/5/31 8:40, Yunsheng Lin wrote: > On 2021/5/31 4:21, Jakub Kicinski wrote: >> On Sun, 30 May 2021 09:37:09 +0800 Yunsheng Lin wrote: >>> On 2021/5/30 2:49, Jakub Kicinski wrote: >>>> The fact that MISSED is only cleared under q->seqlock does not matter, >>>> because setting it and ->enqueue() are not under any lock. If the thread >>>> gets interrupted between: >>>> >>>> if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) && >>>> qdisc_run_begin(q)) { >>>> >>>> and ->enqueue() we can't guarantee that something else won't come in, >>>> take q->seqlock and clear MISSED. >>>> >>>> thread1 thread2 thread3 >>>> # holds seqlock >>>> qdisc_run_begin(q) >>>> set(MISSED) >>>> pfifo_fast_dequeue >>>> clear(MISSED) >>>> # recheck the queue >>>> qdisc_run_end() >>>> ->enqueue() >>>> q->flags & TCQ_F_CAN_BYPASS.. >>>> qdisc_run_begin() # true >>>> sch_direct_xmit() >>>> qdisc_run_begin() >>>> set(MISSED) >>>> >>>> Or am I missing something? >>>> >>>> Re-checking nolock_qdisc_is_empty() may or may not help. >>>> But it doesn't really matter because there is no ordering >>>> requirement between thread2 and thread3 here. >>> >>> I were more focued on explaining that using MISSED is reliable >>> as sch_may_need_requeuing() checking in RFCv3 [1] to indicate a >>> empty qdisc, and forgot to mention the data race described in >>> RFCv3, which is kind of like the one described above: >>> >>> "There is a data race as below: >>> >>> CPU1 CPU2 >>> qdisc_run_begin(q) . >>> . q->enqueue() >>> sch_may_need_requeuing() . >>> return true . >>> . . >>> . . >>> q->enqueue() . >>> >>> When above happen, the skb enqueued by CPU1 is dequeued after the >>> skb enqueued by CPU2 because sch_may_need_requeuing() return true. >>> If there is not qdisc bypass, the CPU1 has better chance to queue >>> the skb quicker than CPU2. >>> >>> This patch does not take care of the above data race, because I >>> view this as similar as below: >>> >>> Even at the same time CPU1 and CPU2 write the skb to two socket >>> which both heading to the same qdisc, there is no guarantee that >>> which skb will hit the qdisc first, becuase there is a lot of >>> factor like interrupt/softirq/cache miss/scheduling afffecting >>> that." >>> >>> Does above make sense? Or any idea to avoid it? >>> >>> 1. https://patchwork.kernel.org/project/netdevbpf/patch/1616404156-11772-1-git-send-email-linyunsheng@huawei.com/ >> >> We agree on this one. >> >> Could you draw a sequence diagram of different CPUs (like the one >> above) for the case where removing re-checking nolock_qdisc_is_empty() >> under q->seqlock leads to incorrect behavior? > > When nolock_qdisc_is_empty() is not re-checking under q->seqlock, we > may have: > > > CPU1 CPU2 > qdisc_run_begin(q) . > . enqueue skb1 > deuqueue skb1 and clear MISSED . > . nolock_qdisc_is_empty() return true > requeue skb . > q->enqueue() . > set MISSED . > . . > qdisc_run_end(q) . > . qdisc_run_begin(q) > . transmit skb2 directly > . transmit the requeued skb1 > > The problem here is that skb1 and skb2 are from the same CPU, which > means they are likely from the same flow, so we need to avoid this, > right? CPU1 CPU2 qdisc_run_begin(q) . . enqueue skb1 dequeue skb1 . . . netdevice stopped and MISSED is clear . . nolock_qdisc_is_empty() return true requeue skb . . . . . . . qdisc_run_end(q) . . qdisc_run_begin(q) . transmit skb2 directly . transmit the requeued skb1 The above sequence diagram seems more correct, it is basically about how to avoid transmitting a packet directly bypassing the requeued packet. > >> >> If there is no such case would you be willing to repeat the benchmark >> with and without this test? >> >> Sorry for dragging the review out.. >> >> . >> > _______________________________________________ > Linuxarm mailing list -- linuxarm@openeuler.org > To unsubscribe send an email to linuxarm-leave@openeuler.org >
On 2021/5/31 9:10, Yunsheng Lin wrote: > On 2021/5/31 8:40, Yunsheng Lin wrote: >> On 2021/5/31 4:21, Jakub Kicinski wrote: >>> On Sun, 30 May 2021 09:37:09 +0800 Yunsheng Lin wrote: >>>> On 2021/5/30 2:49, Jakub Kicinski wrote: >>>>> The fact that MISSED is only cleared under q->seqlock does not matter, >>>>> because setting it and ->enqueue() are not under any lock. If the thread >>>>> gets interrupted between: >>>>> >>>>> if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) && >>>>> qdisc_run_begin(q)) { >>>>> >>>>> and ->enqueue() we can't guarantee that something else won't come in, >>>>> take q->seqlock and clear MISSED. >>>>> >>>>> thread1 thread2 thread3 >>>>> # holds seqlock >>>>> qdisc_run_begin(q) >>>>> set(MISSED) >>>>> pfifo_fast_dequeue >>>>> clear(MISSED) >>>>> # recheck the queue >>>>> qdisc_run_end() >>>>> ->enqueue() >>>>> q->flags & TCQ_F_CAN_BYPASS.. >>>>> qdisc_run_begin() # true >>>>> sch_direct_xmit() >>>>> qdisc_run_begin() >>>>> set(MISSED) >>>>> >>>>> Or am I missing something? >>>>> >>>>> Re-checking nolock_qdisc_is_empty() may or may not help. >>>>> But it doesn't really matter because there is no ordering >>>>> requirement between thread2 and thread3 here. >>>> >>>> I were more focued on explaining that using MISSED is reliable >>>> as sch_may_need_requeuing() checking in RFCv3 [1] to indicate a >>>> empty qdisc, and forgot to mention the data race described in >>>> RFCv3, which is kind of like the one described above: >>>> >>>> "There is a data race as below: >>>> >>>> CPU1 CPU2 >>>> qdisc_run_begin(q) . >>>> . q->enqueue() >>>> sch_may_need_requeuing() . >>>> return true . >>>> . . >>>> . . >>>> q->enqueue() . >>>> >>>> When above happen, the skb enqueued by CPU1 is dequeued after the >>>> skb enqueued by CPU2 because sch_may_need_requeuing() return true. >>>> If there is not qdisc bypass, the CPU1 has better chance to queue >>>> the skb quicker than CPU2. >>>> >>>> This patch does not take care of the above data race, because I >>>> view this as similar as below: >>>> >>>> Even at the same time CPU1 and CPU2 write the skb to two socket >>>> which both heading to the same qdisc, there is no guarantee that >>>> which skb will hit the qdisc first, becuase there is a lot of >>>> factor like interrupt/softirq/cache miss/scheduling afffecting >>>> that." >>>> >>>> Does above make sense? Or any idea to avoid it? >>>> >>>> 1. https://patchwork.kernel.org/project/netdevbpf/patch/1616404156-11772-1-git-send-email-linyunsheng@huawei.com/ >>> >>> We agree on this one. >>> >>> Could you draw a sequence diagram of different CPUs (like the one >>> above) for the case where removing re-checking nolock_qdisc_is_empty() >>> under q->seqlock leads to incorrect behavior? >> >> When nolock_qdisc_is_empty() is not re-checking under q->seqlock, we >> may have: >> >> >> CPU1 CPU2 >> qdisc_run_begin(q) . >> . enqueue skb1 >> deuqueue skb1 and clear MISSED . >> . nolock_qdisc_is_empty() return true >> requeue skb . >> q->enqueue() . >> set MISSED . >> . . >> qdisc_run_end(q) . >> . qdisc_run_begin(q) >> . transmit skb2 directly >> . transmit the requeued skb1 >> >> The problem here is that skb1 and skb2 are from the same CPU, which >> means they are likely from the same flow, so we need to avoid this, >> right? > > > CPU1 CPU2 > qdisc_run_begin(q) . > . enqueue skb1 > dequeue skb1 . > . . > netdevice stopped and MISSED is clear . > . nolock_qdisc_is_empty() return true > requeue skb . > . . > . . > . . > qdisc_run_end(q) . > . qdisc_run_begin(q) > . transmit skb2 directly > . transmit the requeued skb1 > > The above sequence diagram seems more correct, it is basically about how to > avoid transmitting a packet directly bypassing the requeued packet. > >> >>> >>> If there is no such case would you be willing to repeat the benchmark >>> with and without this test? I had did some interesting testing to show how adjust a small number of code has some notiable performance degrade. 1. I used below patch to remove the nolock_qdisc_is_empty() testing under q->seqlock. @@ -3763,17 +3763,6 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, if (q->flags & TCQ_F_NOLOCK) { if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) && qdisc_run_begin(q)) { - /* Retest nolock_qdisc_is_empty() within the protection - * of q->seqlock to ensure qdisc is indeed empty. - */ - if (unlikely(!nolock_qdisc_is_empty(q))) { - rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK; - __qdisc_run(q); - qdisc_run_end(q); - - goto no_lock_out; - } - qdisc_bstats_cpu_update(q, skb); if (sch_direct_xmit(skb, q, dev, txq, NULL, true) && !nolock_qdisc_is_empty(q)) @@ -3786,7 +3775,6 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK; qdisc_run(q); -no_lock_out: if (unlikely(to_free)) kfree_skb_list(to_free); return rc; which has the below performance improvement: threads v1 v1 + above patch delta 1 3.21Mpps 3.20Mpps -0.3% 2 5.56Mpps 5.94Mpps +4.9% 4 5.58Mpps 5.60Mpps +0.3% 8 2.76Mpps 2.77Mpps +0.3% 16 2.23Mpps 2.23Mpps +0.0% v1 = this patchset. 2. After the above testing, it seems worthwhile to remove the nolock_qdisc_is_empty() testing under q->seqlock, so I used below patch to make sure nolock_qdisc_is_empty() always return false for netdev queue stopped case。 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -38,6 +38,15 @@ EXPORT_SYMBOL(default_qdisc_ops); static void qdisc_maybe_clear_missed(struct Qdisc *q, const struct netdev_queue *txq) { + set_bit(__QDISC_STATE_DRAINING, &q->state); + + /* Make sure DRAINING is set before clearing MISSED + * to make sure nolock_qdisc_is_empty() always return + * false for aoviding transmitting a packet directly + * bypassing the requeued packet. + */ + smp_mb__after_atomic(); + clear_bit(__QDISC_STATE_MISSED, &q->state); /* Make sure the below netif_xmit_frozen_or_stopped() @@ -52,8 +61,6 @@ static void qdisc_maybe_clear_missed(struct Qdisc *q, */ if (!netif_xmit_frozen_or_stopped(txq)) set_bit(__QDISC_STATE_MISSED, &q->state); - else - set_bit(__QDISC_STATE_DRAINING, &q->state); } which has the below performance data: threads v1 v1 + above two patch delta 1 3.21Mpps 3.20Mpps -0.3% 2 5.56Mpps 5.94Mpps +4.9% 4 5.58Mpps 5.02Mpps -10% 8 2.76Mpps 2.77Mpps +0.3% 16 2.23Mpps 2.23Mpps +0.0% So the adjustment in qdisc_maybe_clear_missed() seems to have caused about 10% performance degradation for 4 threads case. And the cpu topdown perf data suggested that icache missed and bad Speculation play the main factor to those performance difference. I tried to control the above factor by removing the inline function and add likely and unlikely tag for netif_xmit_frozen_or_stopped() in sch_generic.c. And after removing the inline mark for function in sch_generic.c and add likely/unlikely tag for netif_xmit_frozen_or_stopped() checking in in sch_generic.c, we got notiable performance improvement for 1/2 threads case(some performance improvement for ip forwarding test too), but not for 4 threads case. So it seems we need to ignore the performance degradation for 4 threads case? or any idea? >>> >>> Sorry for dragging the review out.. >>> >>> . >>> >> _______________________________________________ >> Linuxarm mailing list -- linuxarm@openeuler.org >> To unsubscribe send an email to linuxarm-leave@openeuler.org >> > _______________________________________________ > Linuxarm mailing list -- linuxarm@openeuler.org > To unsubscribe send an email to linuxarm-leave@openeuler.org >
On Mon, 31 May 2021 20:40:01 +0800 Yunsheng Lin wrote: > On 2021/5/31 9:10, Yunsheng Lin wrote: > > On 2021/5/31 8:40, Yunsheng Lin wrote: > >> On 2021/5/31 4:21, Jakub Kicinski wrote: > [...] > [...] > [...] > [...] > [...] > [...] > [...] > >> > >> When nolock_qdisc_is_empty() is not re-checking under q->seqlock, we > >> may have: > >> > >> > >> CPU1 CPU2 > >> qdisc_run_begin(q) . > >> . enqueue skb1 > >> deuqueue skb1 and clear MISSED . > >> . nolock_qdisc_is_empty() return true > >> requeue skb . > >> q->enqueue() . > >> set MISSED . > >> . . > >> qdisc_run_end(q) . > >> . qdisc_run_begin(q) > >> . transmit skb2 directly > >> . transmit the requeued skb1 > >> > >> The problem here is that skb1 and skb2 are from the same CPU, which > >> means they are likely from the same flow, so we need to avoid this, > >> right? > > > > > > CPU1 CPU2 > > qdisc_run_begin(q) . > > . enqueue skb1 > > dequeue skb1 . > > . . > > netdevice stopped and MISSED is clear . > > . nolock_qdisc_is_empty() return true > > requeue skb . > > . . > > . . > > . . > > qdisc_run_end(q) . > > . qdisc_run_begin(q) > > . transmit skb2 directly > > . transmit the requeued skb1 > > > > The above sequence diagram seems more correct, it is basically about how to > > avoid transmitting a packet directly bypassing the requeued packet. I see, thanks! That explains the need. Perhaps we can rephrase the comment? Maybe: + /* Retest nolock_qdisc_is_empty() within the protection + * of q->seqlock to protect from racing with requeuing. + */ > I had did some interesting testing to show how adjust a small number > of code has some notiable performance degrade. > > 1. I used below patch to remove the nolock_qdisc_is_empty() testing > under q->seqlock. > > @@ -3763,17 +3763,6 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, > if (q->flags & TCQ_F_NOLOCK) { > if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) && > qdisc_run_begin(q)) { > - /* Retest nolock_qdisc_is_empty() within the protection > - * of q->seqlock to ensure qdisc is indeed empty. > - */ > - if (unlikely(!nolock_qdisc_is_empty(q))) { > - rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK; > - __qdisc_run(q); > - qdisc_run_end(q); > - > - goto no_lock_out; > - } > - > qdisc_bstats_cpu_update(q, skb); > if (sch_direct_xmit(skb, q, dev, txq, NULL, true) && > !nolock_qdisc_is_empty(q)) > @@ -3786,7 +3775,6 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, > rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK; > qdisc_run(q); > > -no_lock_out: > if (unlikely(to_free)) > kfree_skb_list(to_free); > return rc; > > which has the below performance improvement: > > threads v1 v1 + above patch delta > 1 3.21Mpps 3.20Mpps -0.3% > 2 5.56Mpps 5.94Mpps +4.9% > 4 5.58Mpps 5.60Mpps +0.3% > 8 2.76Mpps 2.77Mpps +0.3% > 16 2.23Mpps 2.23Mpps +0.0% > > v1 = this patchset. > > > 2. After the above testing, it seems worthwhile to remove the > nolock_qdisc_is_empty() testing under q->seqlock, so I used below > patch to make sure nolock_qdisc_is_empty() always return false for > netdev queue stopped case。 > > --- a/net/sched/sch_generic.c > +++ b/net/sched/sch_generic.c > @@ -38,6 +38,15 @@ EXPORT_SYMBOL(default_qdisc_ops); > static void qdisc_maybe_clear_missed(struct Qdisc *q, > const struct netdev_queue *txq) > { > + set_bit(__QDISC_STATE_DRAINING, &q->state); > + > + /* Make sure DRAINING is set before clearing MISSED > + * to make sure nolock_qdisc_is_empty() always return > + * false for aoviding transmitting a packet directly > + * bypassing the requeued packet. > + */ > + smp_mb__after_atomic(); > + > clear_bit(__QDISC_STATE_MISSED, &q->state); > > /* Make sure the below netif_xmit_frozen_or_stopped() > @@ -52,8 +61,6 @@ static void qdisc_maybe_clear_missed(struct Qdisc *q, > */ > if (!netif_xmit_frozen_or_stopped(txq)) > set_bit(__QDISC_STATE_MISSED, &q->state); > - else > - set_bit(__QDISC_STATE_DRAINING, &q->state); > } But this would not be enough because we may also clear MISSING in pfifo_fast_dequeue()? > which has the below performance data: > > threads v1 v1 + above two patch delta > 1 3.21Mpps 3.20Mpps -0.3% > 2 5.56Mpps 5.94Mpps +4.9% > 4 5.58Mpps 5.02Mpps -10% > 8 2.76Mpps 2.77Mpps +0.3% > 16 2.23Mpps 2.23Mpps +0.0% > > So the adjustment in qdisc_maybe_clear_missed() seems to have > caused about 10% performance degradation for 4 threads case. > > And the cpu topdown perf data suggested that icache missed and > bad Speculation play the main factor to those performance difference. > > I tried to control the above factor by removing the inline function > and add likely and unlikely tag for netif_xmit_frozen_or_stopped() > in sch_generic.c. > > And after removing the inline mark for function in sch_generic.c > and add likely/unlikely tag for netif_xmit_frozen_or_stopped() > checking in in sch_generic.c, we got notiable performance improvement > for 1/2 threads case(some performance improvement for ip forwarding > test too), but not for 4 threads case. > > So it seems we need to ignore the performance degradation for 4 > threads case? or any idea? No ideas, are the threads pinned to CPUs in some particular way?
On 2021/6/1 12:51, Jakub Kicinski wrote: > On Mon, 31 May 2021 20:40:01 +0800 Yunsheng Lin wrote: >> On 2021/5/31 9:10, Yunsheng Lin wrote: >>> On 2021/5/31 8:40, Yunsheng Lin wrote: >>>> On 2021/5/31 4:21, Jakub Kicinski wrote: >> [...] >>> >>> >>> CPU1 CPU2 >>> qdisc_run_begin(q) . >>> . enqueue skb1 >>> dequeue skb1 . >>> . . >>> netdevice stopped and MISSED is clear . >>> . nolock_qdisc_is_empty() return true >>> requeue skb . >>> . . >>> . . >>> . . >>> qdisc_run_end(q) . >>> . qdisc_run_begin(q) >>> . transmit skb2 directly >>> . transmit the requeued skb1 >>> >>> The above sequence diagram seems more correct, it is basically about how to >>> avoid transmitting a packet directly bypassing the requeued packet. > > I see, thanks! That explains the need. Perhaps we can rephrase the > comment? Maybe: > > + /* Retest nolock_qdisc_is_empty() within the protection > + * of q->seqlock to protect from racing with requeuing. > + */ Yes if we still decide to preserve the nolock_qdisc_is_empty() rechecking under q->seqlock. > >> I had did some interesting testing to show how adjust a small number >> of code has some notiable performance degrade. >> >> 1. I used below patch to remove the nolock_qdisc_is_empty() testing >> under q->seqlock. >> >> @@ -3763,17 +3763,6 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, >> if (q->flags & TCQ_F_NOLOCK) { >> if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) && >> qdisc_run_begin(q)) { >> - /* Retest nolock_qdisc_is_empty() within the protection >> - * of q->seqlock to ensure qdisc is indeed empty. >> - */ >> - if (unlikely(!nolock_qdisc_is_empty(q))) { >> - rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK; >> - __qdisc_run(q); >> - qdisc_run_end(q); >> - >> - goto no_lock_out; >> - } >> - >> qdisc_bstats_cpu_update(q, skb); >> if (sch_direct_xmit(skb, q, dev, txq, NULL, true) && >> !nolock_qdisc_is_empty(q)) >> @@ -3786,7 +3775,6 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, >> rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK; >> qdisc_run(q); >> >> -no_lock_out: >> if (unlikely(to_free)) >> kfree_skb_list(to_free); >> return rc; >> >> which has the below performance improvement: >> >> threads v1 v1 + above patch delta >> 1 3.21Mpps 3.20Mpps -0.3% >> 2 5.56Mpps 5.94Mpps +4.9% >> 4 5.58Mpps 5.60Mpps +0.3% >> 8 2.76Mpps 2.77Mpps +0.3% >> 16 2.23Mpps 2.23Mpps +0.0% >> >> v1 = this patchset. >> >> >> 2. After the above testing, it seems worthwhile to remove the >> nolock_qdisc_is_empty() testing under q->seqlock, so I used below >> patch to make sure nolock_qdisc_is_empty() always return false for >> netdev queue stopped case。 >> >> --- a/net/sched/sch_generic.c >> +++ b/net/sched/sch_generic.c >> @@ -38,6 +38,15 @@ EXPORT_SYMBOL(default_qdisc_ops); >> static void qdisc_maybe_clear_missed(struct Qdisc *q, >> const struct netdev_queue *txq) >> { >> + set_bit(__QDISC_STATE_DRAINING, &q->state); >> + >> + /* Make sure DRAINING is set before clearing MISSED >> + * to make sure nolock_qdisc_is_empty() always return >> + * false for aoviding transmitting a packet directly >> + * bypassing the requeued packet. >> + */ >> + smp_mb__after_atomic(); >> + >> clear_bit(__QDISC_STATE_MISSED, &q->state); >> >> /* Make sure the below netif_xmit_frozen_or_stopped() >> @@ -52,8 +61,6 @@ static void qdisc_maybe_clear_missed(struct Qdisc *q, >> */ >> if (!netif_xmit_frozen_or_stopped(txq)) >> set_bit(__QDISC_STATE_MISSED, &q->state); >> - else >> - set_bit(__QDISC_STATE_DRAINING, &q->state); >> } > > But this would not be enough because we may also clear MISSING > in pfifo_fast_dequeue()? For the MISSING clearing in pfifo_fast_dequeue(), it seems it looks like the data race described in RFC v3 too? CPU1 CPU2 CPU3 qdisc_run_begin(q) . . . MISSED is set . MISSED is cleared . . q->dequeue() . . . enqueue skb1 check MISSED # true qdisc_run_end(q) . . . . qdisc_run_begin(q) # true . MISSED is set send skb2 directly > >> which has the below performance data: >> >> threads v1 v1 + above two patch delta >> 1 3.21Mpps 3.20Mpps -0.3% >> 2 5.56Mpps 5.94Mpps +4.9% >> 4 5.58Mpps 5.02Mpps -10% >> 8 2.76Mpps 2.77Mpps +0.3% >> 16 2.23Mpps 2.23Mpps +0.0% >> >> So the adjustment in qdisc_maybe_clear_missed() seems to have >> caused about 10% performance degradation for 4 threads case. >> >> And the cpu topdown perf data suggested that icache missed and >> bad Speculation play the main factor to those performance difference. >> >> I tried to control the above factor by removing the inline function >> and add likely and unlikely tag for netif_xmit_frozen_or_stopped() >> in sch_generic.c. >> >> And after removing the inline mark for function in sch_generic.c >> and add likely/unlikely tag for netif_xmit_frozen_or_stopped() >> checking in in sch_generic.c, we got notiable performance improvement >> for 1/2 threads case(some performance improvement for ip forwarding >> test too), but not for 4 threads case. >> >> So it seems we need to ignore the performance degradation for 4 >> threads case? or any idea? > > No ideas, are the threads pinned to CPUs in some particular way? The pktgen seems already runnig a thread for each CPU, so I do not need to do the pinning myself, for the 4 threads case, it runs on the 0~3 cpu. It seems more related to specific cpu implemantaion. > > . >
On Tue, 1 Jun 2021 16:18:54 +0800 Yunsheng Lin wrote: > > I see, thanks! That explains the need. Perhaps we can rephrase the > > comment? Maybe: > > > > + /* Retest nolock_qdisc_is_empty() within the protection > > + * of q->seqlock to protect from racing with requeuing. > > + */ > > Yes if we still decide to preserve the nolock_qdisc_is_empty() rechecking > under q->seqlock. Sounds good. > >> --- a/net/sched/sch_generic.c > >> +++ b/net/sched/sch_generic.c > >> @@ -38,6 +38,15 @@ EXPORT_SYMBOL(default_qdisc_ops); > >> static void qdisc_maybe_clear_missed(struct Qdisc *q, > >> const struct netdev_queue *txq) > >> { > >> + set_bit(__QDISC_STATE_DRAINING, &q->state); > >> + > >> + /* Make sure DRAINING is set before clearing MISSED > >> + * to make sure nolock_qdisc_is_empty() always return > >> + * false for aoviding transmitting a packet directly > >> + * bypassing the requeued packet. > >> + */ > >> + smp_mb__after_atomic(); > >> + > >> clear_bit(__QDISC_STATE_MISSED, &q->state); > >> > >> /* Make sure the below netif_xmit_frozen_or_stopped() > >> @@ -52,8 +61,6 @@ static void qdisc_maybe_clear_missed(struct Qdisc *q, > >> */ > >> if (!netif_xmit_frozen_or_stopped(txq)) > >> set_bit(__QDISC_STATE_MISSED, &q->state); > >> - else > >> - set_bit(__QDISC_STATE_DRAINING, &q->state); > >> } > > > > But this would not be enough because we may also clear MISSING > > in pfifo_fast_dequeue()? > > For the MISSING clearing in pfifo_fast_dequeue(), it seems it > looks like the data race described in RFC v3 too? > > CPU1 CPU2 CPU3 > qdisc_run_begin(q) . . > . MISSED is set . > MISSED is cleared . . > q->dequeue() . . > . enqueue skb1 check MISSED # true > qdisc_run_end(q) . . > . . qdisc_run_begin(q) # true > . MISSED is set send skb2 directly Not sure what you mean.
On 2021/6/2 4:48, Jakub Kicinski wrote: > On Tue, 1 Jun 2021 16:18:54 +0800 Yunsheng Lin wrote: >>> I see, thanks! That explains the need. Perhaps we can rephrase the >>> comment? Maybe: >>> >>> + /* Retest nolock_qdisc_is_empty() within the protection >>> + * of q->seqlock to protect from racing with requeuing. >>> + */ >> >> Yes if we still decide to preserve the nolock_qdisc_is_empty() rechecking >> under q->seqlock. > > Sounds good. > >>>> --- a/net/sched/sch_generic.c >>>> +++ b/net/sched/sch_generic.c >>>> @@ -38,6 +38,15 @@ EXPORT_SYMBOL(default_qdisc_ops); >>>> static void qdisc_maybe_clear_missed(struct Qdisc *q, >>>> const struct netdev_queue *txq) >>>> { >>>> + set_bit(__QDISC_STATE_DRAINING, &q->state); >>>> + >>>> + /* Make sure DRAINING is set before clearing MISSED >>>> + * to make sure nolock_qdisc_is_empty() always return >>>> + * false for aoviding transmitting a packet directly >>>> + * bypassing the requeued packet. >>>> + */ >>>> + smp_mb__after_atomic(); >>>> + >>>> clear_bit(__QDISC_STATE_MISSED, &q->state); >>>> >>>> /* Make sure the below netif_xmit_frozen_or_stopped() >>>> @@ -52,8 +61,6 @@ static void qdisc_maybe_clear_missed(struct Qdisc *q, >>>> */ >>>> if (!netif_xmit_frozen_or_stopped(txq)) >>>> set_bit(__QDISC_STATE_MISSED, &q->state); >>>> - else >>>> - set_bit(__QDISC_STATE_DRAINING, &q->state); >>>> } >>> >>> But this would not be enough because we may also clear MISSING >>> in pfifo_fast_dequeue()? >> >> For the MISSING clearing in pfifo_fast_dequeue(), it seems it >> looks like the data race described in RFC v3 too? >> >> CPU1 CPU2 CPU3 >> qdisc_run_begin(q) . . >> . MISSED is set . >> MISSED is cleared . . >> q->dequeue() . . >> . enqueue skb1 check MISSED # true >> qdisc_run_end(q) . . >> . . qdisc_run_begin(q) # true >> . MISSED is set send skb2 directly > > Not sure what you mean. CPU1 CPU2 CPU3 qdisc_run_begin(q) . . . MISSED is set . MISSED is cleared . . another dequeuing . . . . . . enqueue skb1 nolock_qdisc_is_empty() # true qdisc_run_end(q) . . . . qdisc_run_begin(q) # true . . send skb2 directly . MISSED is set . As qdisc is indeed empty at the point when MISSED is clear and another dequeue is retried by CPU1, MISSED setting is not under q->seqlock, so it seems retesting MISSED under q->seqlock does not seem to make any difference? and it seems like the case that does not need handling as we agreed previously? > > . >
On Wed, 2 Jun 2021 09:21:01 +0800 Yunsheng Lin wrote: > >> For the MISSING clearing in pfifo_fast_dequeue(), it seems it > >> looks like the data race described in RFC v3 too? > >> > >> CPU1 CPU2 CPU3 > >> qdisc_run_begin(q) . . > >> . MISSED is set . > >> MISSED is cleared . . > >> q->dequeue() . . > >> . enqueue skb1 check MISSED # true > >> qdisc_run_end(q) . . > >> . . qdisc_run_begin(q) # true > >> . MISSED is set send skb2 directly > > > > Not sure what you mean. > > CPU1 CPU2 CPU3 > qdisc_run_begin(q) . . > . MISSED is set . > MISSED is cleared . . > another dequeuing . . > . . . > . enqueue skb1 nolock_qdisc_is_empty() # true > qdisc_run_end(q) . . > . . qdisc_run_begin(q) # true > . . send skb2 directly > . MISSED is set . > > As qdisc is indeed empty at the point when MISSED is clear and > another dequeue is retried by CPU1, MISSED setting is not under > q->seqlock, so it seems retesting MISSED under q->seqlock does not > seem to make any difference? and it seems like the case that does > not need handling as we agreed previously? Right, this case doesn't need the re-check under the lock, but pointed out that the re-queuing case requires the re-check.
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index 3ed6bcc..177f240 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -37,8 +37,15 @@ enum qdisc_state_t { __QDISC_STATE_SCHED, __QDISC_STATE_DEACTIVATED, __QDISC_STATE_MISSED, + __QDISC_STATE_DRAINING, }; +#define QDISC_STATE_MISSED BIT(__QDISC_STATE_MISSED) +#define QDISC_STATE_DRAINING BIT(__QDISC_STATE_DRAINING) + +#define QDISC_STATE_NON_EMPTY (QDISC_STATE_MISSED | \ + QDISC_STATE_DRAINING) + struct qdisc_size_table { struct rcu_head rcu; struct list_head list; @@ -145,6 +152,11 @@ static inline bool qdisc_is_running(struct Qdisc *qdisc) return (raw_read_seqcount(&qdisc->running) & 1) ? true : false; } +static inline bool nolock_qdisc_is_empty(const struct Qdisc *qdisc) +{ + return !(READ_ONCE(qdisc->state) & QDISC_STATE_NON_EMPTY); +} + static inline bool qdisc_is_percpu_stats(const struct Qdisc *q) { return q->flags & TCQ_F_CPUSTATS; @@ -206,10 +218,8 @@ static inline void qdisc_run_end(struct Qdisc *qdisc) spin_unlock(&qdisc->seqlock); if (unlikely(test_bit(__QDISC_STATE_MISSED, - &qdisc->state))) { - clear_bit(__QDISC_STATE_MISSED, &qdisc->state); + &qdisc->state))) __netif_schedule(qdisc); - } } else { write_seqcount_end(&qdisc->running); } diff --git a/net/core/dev.c b/net/core/dev.c index 50531a2..e4cc926 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3852,10 +3852,32 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, qdisc_calculate_pkt_len(skb, q); if (q->flags & TCQ_F_NOLOCK) { + if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) && + qdisc_run_begin(q)) { + /* Retest nolock_qdisc_is_empty() within the protection + * of q->seqlock to ensure qdisc is indeed empty. + */ + if (unlikely(!nolock_qdisc_is_empty(q))) { + rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK; + __qdisc_run(q); + qdisc_run_end(q); + + goto no_lock_out; + } + + qdisc_bstats_cpu_update(q, skb); + if (sch_direct_xmit(skb, q, dev, txq, NULL, true) && + !nolock_qdisc_is_empty(q)) + __qdisc_run(q); + + qdisc_run_end(q); + return NET_XMIT_SUCCESS; + } + rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK; - if (likely(!netif_xmit_frozen_or_stopped(txq))) - qdisc_run(q); + qdisc_run(q); +no_lock_out: if (unlikely(to_free)) kfree_skb_list(to_free); return rc; diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index fc8b56b..83d7f5f 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -52,6 +52,8 @@ static void qdisc_maybe_clear_missed(struct Qdisc *q, */ if (!netif_xmit_frozen_or_stopped(txq)) set_bit(__QDISC_STATE_MISSED, &q->state); + else + set_bit(__QDISC_STATE_DRAINING, &q->state); } /* Main transmission queue. */ @@ -164,9 +166,13 @@ static inline void dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q) skb = next; } - if (lock) + + if (lock) { spin_unlock(lock); - __netif_schedule(q); + set_bit(__QDISC_STATE_MISSED, &q->state); + } else { + __netif_schedule(q); + } } static void try_bulk_dequeue_skb(struct Qdisc *q, @@ -409,7 +415,11 @@ void __qdisc_run(struct Qdisc *q) while (qdisc_restart(q, &packets)) { quota -= packets; if (quota <= 0) { - __netif_schedule(q); + if (q->flags & TCQ_F_NOLOCK) + set_bit(__QDISC_STATE_MISSED, &q->state); + else + __netif_schedule(q); + break; } } @@ -680,13 +690,14 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc) if (likely(skb)) { qdisc_update_stats_at_dequeue(qdisc, skb); } else if (need_retry && - test_bit(__QDISC_STATE_MISSED, &qdisc->state)) { + READ_ONCE(qdisc->state) & QDISC_STATE_NON_EMPTY) { /* Delay clearing the STATE_MISSED here to reduce * the overhead of the second spin_trylock() in * qdisc_run_begin() and __netif_schedule() calling * in qdisc_run_end(). */ clear_bit(__QDISC_STATE_MISSED, &qdisc->state); + clear_bit(__QDISC_STATE_DRAINING, &qdisc->state); /* Make sure dequeuing happens after clearing * STATE_MISSED. @@ -1204,6 +1215,7 @@ static void dev_reset_queue(struct net_device *dev, spin_unlock_bh(qdisc_lock(qdisc)); if (nolock) { clear_bit(__QDISC_STATE_MISSED, &qdisc->state); + clear_bit(__QDISC_STATE_DRAINING, &qdisc->state); spin_unlock_bh(&qdisc->seqlock); } }
Currently pfifo_fast has both TCQ_F_CAN_BYPASS and TCQ_F_NOLOCK flag set, but queue discipline by-pass does not work for lockless qdisc because skb is always enqueued to qdisc even when the qdisc is empty, see __dev_xmit_skb(). This patch calls sch_direct_xmit() to transmit the skb directly to the driver for empty lockless qdisc, which aviod enqueuing and dequeuing operation. As qdisc->empty is not reliable to indicate a empty qdisc because there is a time window between enqueuing and setting qdisc->empty. So we use the MISSED state added in commit a90c57f2cedd ("net: sched: fix packet stuck problem for lockless qdisc"), which indicate there is lock contention, suggesting that it is better not to do the qdisc bypass in order to avoid packet out of order problem. In order to make MISSED state reliable to indicate a empty qdisc, we need to ensure that testing and clearing of MISSED state is within the protection of qdisc->seqlock, only setting MISSED state can be done without the protection of qdisc->seqlock. A MISSED state testing is added without the protection of qdisc->seqlock to aviod doing unnecessary spin_trylock() for contention case. There are below cases that need special handling: 1. When MISSED state is cleared before another round of dequeuing in pfifo_fast_dequeue(), and __qdisc_run() might not be able to dequeue all skb in one round and call __netif_schedule(), which might result in a non-empty qdisc without MISSED set. In order to avoid this, the MISSED state is set for lockless qdisc and __netif_schedule() will be called at the end of qdisc_run_end. 2. The MISSED state also need to be set for lockless qdisc instead of calling __netif_schedule() directly when requeuing a skb for a similar reason. 3. For netdev queue stopped case, the MISSED case need clearing while the netdev queue is stopped, otherwise there may be unnecessary __netif_schedule() calling. So a new DRAINING state is added to indicate this case, which also indicate a non-empty qdisc. 4. As there is already netif_xmit_frozen_or_stopped() checking in dequeue_skb() and sch_direct_xmit(), which are both within the protection of qdisc->seqlock, but the same checking in __dev_xmit_skb() is without the protection, which might cause empty indication of a lockless qdisc to be not reliable. So remove the checking in __dev_xmit_skb(), and the checking in the protection of qdisc->seqlock seems enough to avoid the cpu consumption problem for netdev queue stopped case. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> --- V1: Add nolock_qdisc_is_empty() and do the qdisc empty checking without the protection of qdisc->seqlock to aviod doing unnecessary spin_trylock() for contention case. RFC v4: Use STATE_MISSED and STATE_DRAINING to indicate non-empty qdisc. --- include/net/sch_generic.h | 16 +++++++++++++--- net/core/dev.c | 26 ++++++++++++++++++++++++-- net/sched/sch_generic.c | 20 ++++++++++++++++---- 3 files changed, 53 insertions(+), 9 deletions(-)