diff mbox series

[net-next,2/3] net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc

Message ID 1622170197-27370-3-git-send-email-linyunsheng@huawei.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series Some optimization for lockless qdisc | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present success Link
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for net-next
netdev/subject_prefix success Link
netdev/cc_maintainers success CCed 20 of 20 maintainers
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 3466 this patch: 3466
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch warning WARNING: line length of 82 exceeds 80 columns
netdev/build_allmodconfig_warn success Errors and warnings before: 3567 this patch: 3567
netdev/header_inline success Link

Commit Message

Yunsheng Lin May 28, 2021, 2:49 a.m. UTC
Currently pfifo_fast has both TCQ_F_CAN_BYPASS and TCQ_F_NOLOCK
flag set, but queue discipline by-pass does not work for lockless
qdisc because skb is always enqueued to qdisc even when the qdisc
is empty, see __dev_xmit_skb().

This patch calls sch_direct_xmit() to transmit the skb directly
to the driver for empty lockless qdisc, which aviod enqueuing
and dequeuing operation.

As qdisc->empty is not reliable to indicate a empty qdisc because
there is a time window between enqueuing and setting qdisc->empty.
So we use the MISSED state added in commit a90c57f2cedd ("net:
sched: fix packet stuck problem for lockless qdisc"), which
indicate there is lock contention, suggesting that it is better
not to do the qdisc bypass in order to avoid packet out of order
problem.

In order to make MISSED state reliable to indicate a empty qdisc,
we need to ensure that testing and clearing of MISSED state is
within the protection of qdisc->seqlock, only setting MISSED state
can be done without the protection of qdisc->seqlock. A MISSED
state testing is added without the protection of qdisc->seqlock to
aviod doing unnecessary spin_trylock() for contention case.

There are below cases that need special handling:
1. When MISSED state is cleared before another round of dequeuing
   in pfifo_fast_dequeue(), and __qdisc_run() might not be able to
   dequeue all skb in one round and call __netif_schedule(), which
   might result in a non-empty qdisc without MISSED set. In order
   to avoid this, the MISSED state is set for lockless qdisc and
   __netif_schedule() will be called at the end of qdisc_run_end.

2. The MISSED state also need to be set for lockless qdisc instead
   of calling __netif_schedule() directly when requeuing a skb for
   a similar reason.

3. For netdev queue stopped case, the MISSED case need clearing
   while the netdev queue is stopped, otherwise there may be
   unnecessary __netif_schedule() calling. So a new DRAINING state
   is added to indicate this case, which also indicate a non-empty
   qdisc.

4. As there is already netif_xmit_frozen_or_stopped() checking in
   dequeue_skb() and sch_direct_xmit(), which are both within the
   protection of qdisc->seqlock, but the same checking in
   __dev_xmit_skb() is without the protection, which might cause
   empty indication of a lockless qdisc to be not reliable. So
   remove the checking in __dev_xmit_skb(), and the checking in
   the protection of qdisc->seqlock seems enough to avoid the cpu
   consumption problem for netdev queue stopped case.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
V1: Add nolock_qdisc_is_empty() and do the qdisc empty checking
    without the protection of qdisc->seqlock to aviod doing
    unnecessary spin_trylock() for contention case.
RFC v4: Use STATE_MISSED and STATE_DRAINING to indicate non-empty
        qdisc.
---
 include/net/sch_generic.h | 16 +++++++++++++---
 net/core/dev.c            | 26 ++++++++++++++++++++++++--
 net/sched/sch_generic.c   | 20 ++++++++++++++++----
 3 files changed, 53 insertions(+), 9 deletions(-)

Comments

Jakub Kicinski May 29, 2021, 1 a.m. UTC | #1
On Fri, 28 May 2021 10:49:56 +0800 Yunsheng Lin wrote:
> Currently pfifo_fast has both TCQ_F_CAN_BYPASS and TCQ_F_NOLOCK
> flag set, but queue discipline by-pass does not work for lockless
> qdisc because skb is always enqueued to qdisc even when the qdisc
> is empty, see __dev_xmit_skb().
> 
> This patch calls sch_direct_xmit() to transmit the skb directly
> to the driver for empty lockless qdisc, which aviod enqueuing
> and dequeuing operation.
> 
> As qdisc->empty is not reliable to indicate a empty qdisc because
> there is a time window between enqueuing and setting qdisc->empty.
> So we use the MISSED state added in commit a90c57f2cedd ("net:
> sched: fix packet stuck problem for lockless qdisc"), which
> indicate there is lock contention, suggesting that it is better
> not to do the qdisc bypass in order to avoid packet out of order
> problem.
> 
> In order to make MISSED state reliable to indicate a empty qdisc,
> we need to ensure that testing and clearing of MISSED state is
> within the protection of qdisc->seqlock, only setting MISSED state
> can be done without the protection of qdisc->seqlock. A MISSED
> state testing is added without the protection of qdisc->seqlock to
> aviod doing unnecessary spin_trylock() for contention case.
> 
> There are below cases that need special handling:
> 1. When MISSED state is cleared before another round of dequeuing
>    in pfifo_fast_dequeue(), and __qdisc_run() might not be able to
>    dequeue all skb in one round and call __netif_schedule(), which
>    might result in a non-empty qdisc without MISSED set. In order
>    to avoid this, the MISSED state is set for lockless qdisc and
>    __netif_schedule() will be called at the end of qdisc_run_end.
> 
> 2. The MISSED state also need to be set for lockless qdisc instead
>    of calling __netif_schedule() directly when requeuing a skb for
>    a similar reason.
> 
> 3. For netdev queue stopped case, the MISSED case need clearing
>    while the netdev queue is stopped, otherwise there may be
>    unnecessary __netif_schedule() calling. So a new DRAINING state
>    is added to indicate this case, which also indicate a non-empty
>    qdisc.
> 
> 4. As there is already netif_xmit_frozen_or_stopped() checking in
>    dequeue_skb() and sch_direct_xmit(), which are both within the
>    protection of qdisc->seqlock, but the same checking in
>    __dev_xmit_skb() is without the protection, which might cause
>    empty indication of a lockless qdisc to be not reliable. So
>    remove the checking in __dev_xmit_skb(), and the checking in
>    the protection of qdisc->seqlock seems enough to avoid the cpu
>    consumption problem for netdev queue stopped case.
> 
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>

> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index 3ed6bcc..177f240 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -37,8 +37,15 @@ enum qdisc_state_t {
>  	__QDISC_STATE_SCHED,
>  	__QDISC_STATE_DEACTIVATED,
>  	__QDISC_STATE_MISSED,
> +	__QDISC_STATE_DRAINING,
>  };
>  
> +#define QDISC_STATE_MISSED	BIT(__QDISC_STATE_MISSED)
> +#define QDISC_STATE_DRAINING	BIT(__QDISC_STATE_DRAINING)
> +
> +#define QDISC_STATE_NON_EMPTY	(QDISC_STATE_MISSED | \
> +					QDISC_STATE_DRAINING)
> +
>  struct qdisc_size_table {
>  	struct rcu_head		rcu;
>  	struct list_head	list;
> @@ -145,6 +152,11 @@ static inline bool qdisc_is_running(struct Qdisc *qdisc)
>  	return (raw_read_seqcount(&qdisc->running) & 1) ? true : false;
>  }
>  
> +static inline bool nolock_qdisc_is_empty(const struct Qdisc *qdisc)
> +{
> +	return !(READ_ONCE(qdisc->state) & QDISC_STATE_NON_EMPTY);
> +}
> +
>  static inline bool qdisc_is_percpu_stats(const struct Qdisc *q)
>  {
>  	return q->flags & TCQ_F_CPUSTATS;
> @@ -206,10 +218,8 @@ static inline void qdisc_run_end(struct Qdisc *qdisc)
>  		spin_unlock(&qdisc->seqlock);
>  
>  		if (unlikely(test_bit(__QDISC_STATE_MISSED,
> -				      &qdisc->state))) {
> -			clear_bit(__QDISC_STATE_MISSED, &qdisc->state);
> +				      &qdisc->state)))

Why remove the clear_bit()? Wasn't that added to avoid infinite
re-schedules?

>  			__netif_schedule(qdisc);
> -		}
>  	} else {
>  		write_seqcount_end(&qdisc->running);
>  	}
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 50531a2..e4cc926 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3852,10 +3852,32 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>  	qdisc_calculate_pkt_len(skb, q);
>  
>  	if (q->flags & TCQ_F_NOLOCK) {
> +		if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) &&
> +		    qdisc_run_begin(q)) {
> +			/* Retest nolock_qdisc_is_empty() within the protection
> +			 * of q->seqlock to ensure qdisc is indeed empty.
> +			 */
> +			if (unlikely(!nolock_qdisc_is_empty(q))) {

This is just for the DRAINING case right? 

MISSED can be set at any moment, testing MISSED seems confusing.

Is it really worth the extra code?

> +				rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
> +				__qdisc_run(q);
> +				qdisc_run_end(q);
> +
> +				goto no_lock_out;
> +			}
> +
> +			qdisc_bstats_cpu_update(q, skb);
> +			if (sch_direct_xmit(skb, q, dev, txq, NULL, true) &&
> +			    !nolock_qdisc_is_empty(q))
> +				__qdisc_run(q);
> +
> +			qdisc_run_end(q);
> +			return NET_XMIT_SUCCESS;
> +		}
> +
>  		rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
> -		if (likely(!netif_xmit_frozen_or_stopped(txq)))
> -			qdisc_run(q);
> +		qdisc_run(q);
>  
> +no_lock_out:
>  		if (unlikely(to_free))
>  			kfree_skb_list(to_free);
>  		return rc;
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index fc8b56b..83d7f5f 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -52,6 +52,8 @@ static void qdisc_maybe_clear_missed(struct Qdisc *q,
>  	 */
>  	if (!netif_xmit_frozen_or_stopped(txq))
>  		set_bit(__QDISC_STATE_MISSED, &q->state);
> +	else
> +		set_bit(__QDISC_STATE_DRAINING, &q->state);
>  }
>  
>  /* Main transmission queue. */
> @@ -164,9 +166,13 @@ static inline void dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
>  
>  		skb = next;
>  	}
> -	if (lock)
> +
> +	if (lock) {
>  		spin_unlock(lock);
> -	__netif_schedule(q);
> +		set_bit(__QDISC_STATE_MISSED, &q->state);
> +	} else {
> +		__netif_schedule(q);

Could we reorder qdisc_run_begin() with clear_bit(SCHED) 
in net_tx_action() and add SCHED to the NON_EMPTY mask?

> +	}
>  }
>  
>  static void try_bulk_dequeue_skb(struct Qdisc *q,
> @@ -409,7 +415,11 @@ void __qdisc_run(struct Qdisc *q)
>  	while (qdisc_restart(q, &packets)) {
>  		quota -= packets;
>  		if (quota <= 0) {
> -			__netif_schedule(q);
> +			if (q->flags & TCQ_F_NOLOCK)
> +				set_bit(__QDISC_STATE_MISSED, &q->state);
> +			else
> +				__netif_schedule(q);
> +
>  			break;
>  		}
>  	}
> @@ -680,13 +690,14 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
>  	if (likely(skb)) {
>  		qdisc_update_stats_at_dequeue(qdisc, skb);
>  	} else if (need_retry &&
> -		   test_bit(__QDISC_STATE_MISSED, &qdisc->state)) {
> +		   READ_ONCE(qdisc->state) & QDISC_STATE_NON_EMPTY) {

Do we really need to retry based on DRAINING being set?
Or is it just a convenient way of coding things up?

>  		/* Delay clearing the STATE_MISSED here to reduce
>  		 * the overhead of the second spin_trylock() in
>  		 * qdisc_run_begin() and __netif_schedule() calling
>  		 * in qdisc_run_end().
>  		 */
>  		clear_bit(__QDISC_STATE_MISSED, &qdisc->state);
> +		clear_bit(__QDISC_STATE_DRAINING, &qdisc->state);
Yunsheng Lin May 29, 2021, 1:44 a.m. UTC | #2
On 2021/5/29 9:00, Jakub Kicinski wrote:
> On Fri, 28 May 2021 10:49:56 +0800 Yunsheng Lin wrote:
>> Currently pfifo_fast has both TCQ_F_CAN_BYPASS and TCQ_F_NOLOCK
>> flag set, but queue discipline by-pass does not work for lockless
>> qdisc because skb is always enqueued to qdisc even when the qdisc
>> is empty, see __dev_xmit_skb().
>>
>> This patch calls sch_direct_xmit() to transmit the skb directly
>> to the driver for empty lockless qdisc, which aviod enqueuing
>> and dequeuing operation.
>>
>> As qdisc->empty is not reliable to indicate a empty qdisc because
>> there is a time window between enqueuing and setting qdisc->empty.
>> So we use the MISSED state added in commit a90c57f2cedd ("net:
>> sched: fix packet stuck problem for lockless qdisc"), which
>> indicate there is lock contention, suggesting that it is better
>> not to do the qdisc bypass in order to avoid packet out of order
>> problem.
>>
>> In order to make MISSED state reliable to indicate a empty qdisc,
>> we need to ensure that testing and clearing of MISSED state is
>> within the protection of qdisc->seqlock, only setting MISSED state
>> can be done without the protection of qdisc->seqlock. A MISSED
>> state testing is added without the protection of qdisc->seqlock to
>> aviod doing unnecessary spin_trylock() for contention case.
>>
>> There are below cases that need special handling:
>> 1. When MISSED state is cleared before another round of dequeuing
>>    in pfifo_fast_dequeue(), and __qdisc_run() might not be able to
>>    dequeue all skb in one round and call __netif_schedule(), which
>>    might result in a non-empty qdisc without MISSED set. In order
>>    to avoid this, the MISSED state is set for lockless qdisc and
>>    __netif_schedule() will be called at the end of qdisc_run_end.
>>
>> 2. The MISSED state also need to be set for lockless qdisc instead
>>    of calling __netif_schedule() directly when requeuing a skb for
>>    a similar reason.
>>
>> 3. For netdev queue stopped case, the MISSED case need clearing
>>    while the netdev queue is stopped, otherwise there may be
>>    unnecessary __netif_schedule() calling. So a new DRAINING state
>>    is added to indicate this case, which also indicate a non-empty
>>    qdisc.
>>
>> 4. As there is already netif_xmit_frozen_or_stopped() checking in
>>    dequeue_skb() and sch_direct_xmit(), which are both within the
>>    protection of qdisc->seqlock, but the same checking in
>>    __dev_xmit_skb() is without the protection, which might cause
>>    empty indication of a lockless qdisc to be not reliable. So
>>    remove the checking in __dev_xmit_skb(), and the checking in
>>    the protection of qdisc->seqlock seems enough to avoid the cpu
>>    consumption problem for netdev queue stopped case.
>>
>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> 
>> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
>> index 3ed6bcc..177f240 100644
>> --- a/include/net/sch_generic.h
>> +++ b/include/net/sch_generic.h
>> @@ -37,8 +37,15 @@ enum qdisc_state_t {
>>  	__QDISC_STATE_SCHED,
>>  	__QDISC_STATE_DEACTIVATED,
>>  	__QDISC_STATE_MISSED,
>> +	__QDISC_STATE_DRAINING,
>>  };
>>  
>> +#define QDISC_STATE_MISSED	BIT(__QDISC_STATE_MISSED)
>> +#define QDISC_STATE_DRAINING	BIT(__QDISC_STATE_DRAINING)
>> +
>> +#define QDISC_STATE_NON_EMPTY	(QDISC_STATE_MISSED | \
>> +					QDISC_STATE_DRAINING)
>> +
>>  struct qdisc_size_table {
>>  	struct rcu_head		rcu;
>>  	struct list_head	list;
>> @@ -145,6 +152,11 @@ static inline bool qdisc_is_running(struct Qdisc *qdisc)
>>  	return (raw_read_seqcount(&qdisc->running) & 1) ? true : false;
>>  }
>>  
>> +static inline bool nolock_qdisc_is_empty(const struct Qdisc *qdisc)
>> +{
>> +	return !(READ_ONCE(qdisc->state) & QDISC_STATE_NON_EMPTY);
>> +}
>> +
>>  static inline bool qdisc_is_percpu_stats(const struct Qdisc *q)
>>  {
>>  	return q->flags & TCQ_F_CPUSTATS;
>> @@ -206,10 +218,8 @@ static inline void qdisc_run_end(struct Qdisc *qdisc)
>>  		spin_unlock(&qdisc->seqlock);
>>  
>>  		if (unlikely(test_bit(__QDISC_STATE_MISSED,
>> -				      &qdisc->state))) {
>> -			clear_bit(__QDISC_STATE_MISSED, &qdisc->state);
>> +				      &qdisc->state)))
> 
> Why remove the clear_bit()? Wasn't that added to avoid infinite
> re-schedules?

As we has also clear the MISSED for stopped and deactived case, so
the above clear_bit() is necessarily needed, it was added because
of the 0.02Mpps improvement for the 16 thread pktgen test case.

As we need to ensure clearing is within the protection of q->seqlock,
and above clear_bit is not within the protection, so remove the
clear_bit() above.

> 
>>  			__netif_schedule(qdisc);
>> -		}
>>  	} else {
>>  		write_seqcount_end(&qdisc->running);
>>  	}
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index 50531a2..e4cc926 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -3852,10 +3852,32 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>>  	qdisc_calculate_pkt_len(skb, q);
>>  
>>  	if (q->flags & TCQ_F_NOLOCK) {
>> +		if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) &&
>> +		    qdisc_run_begin(q)) {
>> +			/* Retest nolock_qdisc_is_empty() within the protection
>> +			 * of q->seqlock to ensure qdisc is indeed empty.
>> +			 */
>> +			if (unlikely(!nolock_qdisc_is_empty(q))) {
> 
> This is just for the DRAINING case right? 
> 
> MISSED can be set at any moment, testing MISSED seems confusing.

MISSED is only set when there is lock contention, which means it
is better not to do the qdisc bypass to avoid out of order packet
problem, another good thing is that we could also do the batch
dequeuing and transmiting of packets when there is lock contention.

> 
> Is it really worth the extra code?

Currently DRAINING is only set for the netdev queue stopped.
We could only use DRAINING to indicate the non-empty of a qdisc,
then we need to set the DRAINING evrywhere MISSED is set, that is
why I use both DRAINING and MISSED to indicate a non-empty qdisc.

> 
>> +				rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
>> +				__qdisc_run(q);
>> +				qdisc_run_end(q);
>> +
>> +				goto no_lock_out;
>> +			}
>> +
>> +			qdisc_bstats_cpu_update(q, skb);
>> +			if (sch_direct_xmit(skb, q, dev, txq, NULL, true) &&
>> +			    !nolock_qdisc_is_empty(q))
>> +				__qdisc_run(q);
>> +
>> +			qdisc_run_end(q);
>> +			return NET_XMIT_SUCCESS;
>> +		}
>> +
>>  		rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
>> -		if (likely(!netif_xmit_frozen_or_stopped(txq)))
>> -			qdisc_run(q);
>> +		qdisc_run(q);
>>  
>> +no_lock_out:
>>  		if (unlikely(to_free))
>>  			kfree_skb_list(to_free);
>>  		return rc;
>> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
>> index fc8b56b..83d7f5f 100644
>> --- a/net/sched/sch_generic.c
>> +++ b/net/sched/sch_generic.c
>> @@ -52,6 +52,8 @@ static void qdisc_maybe_clear_missed(struct Qdisc *q,
>>  	 */
>>  	if (!netif_xmit_frozen_or_stopped(txq))
>>  		set_bit(__QDISC_STATE_MISSED, &q->state);
>> +	else
>> +		set_bit(__QDISC_STATE_DRAINING, &q->state);
>>  }
>>  
>>  /* Main transmission queue. */
>> @@ -164,9 +166,13 @@ static inline void dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
>>  
>>  		skb = next;
>>  	}
>> -	if (lock)
>> +
>> +	if (lock) {
>>  		spin_unlock(lock);
>> -	__netif_schedule(q);
>> +		set_bit(__QDISC_STATE_MISSED, &q->state);
>> +	} else {
>> +		__netif_schedule(q);
> 
> Could we reorder qdisc_run_begin() with clear_bit(SCHED) 
> in net_tx_action() and add SCHED to the NON_EMPTY mask?

Did you mean clearing the SCHED after the q->seqlock is
taken?

The problem is that the SCHED is also used to indicate
a qdisc is in sd->output_queue or not, and the
qdisc_run_begin() called by net_tx_action() can not
guarantee it will take the q->seqlock(we are using trylock
for lockless qdisc)

> 
>> +	}
>>  }
>>  
>>  static void try_bulk_dequeue_skb(struct Qdisc *q,
>> @@ -409,7 +415,11 @@ void __qdisc_run(struct Qdisc *q)
>>  	while (qdisc_restart(q, &packets)) {
>>  		quota -= packets;
>>  		if (quota <= 0) {
>> -			__netif_schedule(q);
>> +			if (q->flags & TCQ_F_NOLOCK)
>> +				set_bit(__QDISC_STATE_MISSED, &q->state);
>> +			else
>> +				__netif_schedule(q);
>> +
>>  			break;
>>  		}
>>  	}
>> @@ -680,13 +690,14 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
>>  	if (likely(skb)) {
>>  		qdisc_update_stats_at_dequeue(qdisc, skb);
>>  	} else if (need_retry &&
>> -		   test_bit(__QDISC_STATE_MISSED, &qdisc->state)) {
>> +		   READ_ONCE(qdisc->state) & QDISC_STATE_NON_EMPTY) {
> 
> Do we really need to retry based on DRAINING being set?
> Or is it just a convenient way of coding things up?

Yes, it is just a convenient way of coding things up.
Only MISSED need retrying.

> 
>>  		/* Delay clearing the STATE_MISSED here to reduce
>>  		 * the overhead of the second spin_trylock() in
>>  		 * qdisc_run_begin() and __netif_schedule() calling
>>  		 * in qdisc_run_end().
>>  		 */
>>  		clear_bit(__QDISC_STATE_MISSED, &qdisc->state);
>> +		clear_bit(__QDISC_STATE_DRAINING, &qdisc->state);
> 
> 
> 
> .
>
Jakub Kicinski May 29, 2021, 4:32 a.m. UTC | #3
On Sat, 29 May 2021 09:44:57 +0800 Yunsheng Lin wrote:
> >> @@ -3852,10 +3852,32 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
> >>  	qdisc_calculate_pkt_len(skb, q);
> >>  
> >>  	if (q->flags & TCQ_F_NOLOCK) {
> >> +		if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) &&
> >> +		    qdisc_run_begin(q)) {
> >> +			/* Retest nolock_qdisc_is_empty() within the protection
> >> +			 * of q->seqlock to ensure qdisc is indeed empty.
> >> +			 */
> >> +			if (unlikely(!nolock_qdisc_is_empty(q))) {  
> > 
> > This is just for the DRAINING case right? 
> > 
> > MISSED can be set at any moment, testing MISSED seems confusing.  
> 
> MISSED is only set when there is lock contention, which means it
> is better not to do the qdisc bypass to avoid out of order packet
> problem, 

Avoid as in make less likely? Nothing guarantees other thread is not
interrupted after ->enqueue and before qdisc_run_begin().

TBH I'm not sure what out-of-order situation you're referring to,
there is no ordering guarantee between separate threads trying to
transmit AFAIU.

IOW this check is not required for correctness, right?

> another good thing is that we could also do the batch
> dequeuing and transmiting of packets when there is lock contention.

No doubt, but did you see the flag get set significantly often here 
to warrant the double-checking?

> > Is it really worth the extra code?  
> 
> Currently DRAINING is only set for the netdev queue stopped.
> We could only use DRAINING to indicate the non-empty of a qdisc,
> then we need to set the DRAINING evrywhere MISSED is set, that is
> why I use both DRAINING and MISSED to indicate a non-empty qdisc.
> 
> >   
> >> +				rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
> >> +				__qdisc_run(q);
> >> +				qdisc_run_end(q);
> >> +
> >> +				goto no_lock_out;
> >> +			}
> >> +
> >> +			qdisc_bstats_cpu_update(q, skb);
> >> +			if (sch_direct_xmit(skb, q, dev, txq, NULL, true) &&
> >> +			    !nolock_qdisc_is_empty(q))
> >> +				__qdisc_run(q);
> >> +
> >> +			qdisc_run_end(q);
> >> +			return NET_XMIT_SUCCESS;
> >> +		}
> >> +
> >>  		rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
> >> -		if (likely(!netif_xmit_frozen_or_stopped(txq)))
> >> -			qdisc_run(q);
> >> +		qdisc_run(q);
> >>  
> >> +no_lock_out:
> >>  		if (unlikely(to_free))
> >>  			kfree_skb_list(to_free);
> >>  		return rc;

> >> @@ -164,9 +166,13 @@ static inline void dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
> >>  
> >>  		skb = next;
> >>  	}
> >> -	if (lock)
> >> +
> >> +	if (lock) {
> >>  		spin_unlock(lock);
> >> -	__netif_schedule(q);
> >> +		set_bit(__QDISC_STATE_MISSED, &q->state);
> >> +	} else {
> >> +		__netif_schedule(q);  
> > 
> > Could we reorder qdisc_run_begin() with clear_bit(SCHED) 
> > in net_tx_action() and add SCHED to the NON_EMPTY mask?  
> 
> Did you mean clearing the SCHED after the q->seqlock is
> taken?
> 
> The problem is that the SCHED is also used to indicate
> a qdisc is in sd->output_queue or not, and the
> qdisc_run_begin() called by net_tx_action() can not
> guarantee it will take the q->seqlock(we are using trylock
> for lockless qdisc)

Ah, right. We'd need to do some more flag juggling in net_tx_action()
to get it right.

> >> +	}
> >>  }
> >>  
> >>  static void try_bulk_dequeue_skb(struct Qdisc *q,
> >> @@ -409,7 +415,11 @@ void __qdisc_run(struct Qdisc *q)
> >>  	while (qdisc_restart(q, &packets)) {
> >>  		quota -= packets;
> >>  		if (quota <= 0) {
> >> -			__netif_schedule(q);
> >> +			if (q->flags & TCQ_F_NOLOCK)
> >> +				set_bit(__QDISC_STATE_MISSED, &q->state);
> >> +			else
> >> +				__netif_schedule(q);
> >> +
> >>  			break;
> >>  		}
> >>  	}
> >> @@ -680,13 +690,14 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
> >>  	if (likely(skb)) {
> >>  		qdisc_update_stats_at_dequeue(qdisc, skb);
> >>  	} else if (need_retry &&
> >> -		   test_bit(__QDISC_STATE_MISSED, &qdisc->state)) {
> >> +		   READ_ONCE(qdisc->state) & QDISC_STATE_NON_EMPTY) {  
> > 
> > Do we really need to retry based on DRAINING being set?
> > Or is it just a convenient way of coding things up?  
> 
> Yes, it is just a convenient way of coding things up.
> Only MISSED need retrying.
Yunsheng Lin May 29, 2021, 7:03 a.m. UTC | #4
On 2021/5/29 12:32, Jakub Kicinski wrote:
> On Sat, 29 May 2021 09:44:57 +0800 Yunsheng Lin wrote:
>>>> @@ -3852,10 +3852,32 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>>>>  	qdisc_calculate_pkt_len(skb, q);
>>>>  
>>>>  	if (q->flags & TCQ_F_NOLOCK) {
>>>> +		if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) &&
>>>> +		    qdisc_run_begin(q)) {
>>>> +			/* Retest nolock_qdisc_is_empty() within the protection
>>>> +			 * of q->seqlock to ensure qdisc is indeed empty.
>>>> +			 */
>>>> +			if (unlikely(!nolock_qdisc_is_empty(q))) {  
>>>
>>> This is just for the DRAINING case right? 
>>>
>>> MISSED can be set at any moment, testing MISSED seems confusing.  
>>
>> MISSED is only set when there is lock contention, which means it
>> is better not to do the qdisc bypass to avoid out of order packet
>> problem, 
> 
> Avoid as in make less likely? Nothing guarantees other thread is not
> interrupted after ->enqueue and before qdisc_run_begin().
> 
> TBH I'm not sure what out-of-order situation you're referring to,
> there is no ordering guarantee between separate threads trying to
> transmit AFAIU.
A thread need to do the bypass checking before doing enqueuing, so
it means MISSED is set or the trylock fails for the bypass transmiting(
which will set the MISSED after the first trylock), so the MISSED will
always be set before a thread doing a enqueuing, and we ensure MISSED
only be cleared during the protection of q->seqlock, after clearing
MISSED, we do anther round of dequeuing within the protection of
q->seqlock.

So if a thread has taken the q->seqlock and the MISSED is not set yet,
it is allowed to send the packet directly without going through the
qdisc enqueuing and dequeuing process.


> 
> IOW this check is not required for correctness, right?

if a thread has taken the q->seqlock and the MISSED is not set, it means
other thread has not set MISSED after the first trylock and before the
second trylock, which means the enqueuing is not done yet.
So I assume the this check is required for correctness if I understand
your question correctly.

> 
>> another good thing is that we could also do the batch
>> dequeuing and transmiting of packets when there is lock contention.
> 
> No doubt, but did you see the flag get set significantly often here 
> to warrant the double-checking?

No, that is just my guess:)

> 
>>> Is it really worth the extra code?  
>>
>> Currently DRAINING is only set for the netdev queue stopped.
>> We could only use DRAINING to indicate the non-empty of a qdisc,
>> then we need to set the DRAINING evrywhere MISSED is set, that is
>> why I use both DRAINING and MISSED to indicate a non-empty qdisc.
Jakub Kicinski May 29, 2021, 6:49 p.m. UTC | #5
On Sat, 29 May 2021 15:03:09 +0800 Yunsheng Lin wrote:
> On 2021/5/29 12:32, Jakub Kicinski wrote:
> > On Sat, 29 May 2021 09:44:57 +0800 Yunsheng Lin wrote:  
> >> MISSED is only set when there is lock contention, which means it
> >> is better not to do the qdisc bypass to avoid out of order packet
> >> problem,   
> > 
> > Avoid as in make less likely? Nothing guarantees other thread is not
> > interrupted after ->enqueue and before qdisc_run_begin().
> > 
> > TBH I'm not sure what out-of-order situation you're referring to,
> > there is no ordering guarantee between separate threads trying to
> > transmit AFAIU.  
> A thread need to do the bypass checking before doing enqueuing, so
> it means MISSED is set or the trylock fails for the bypass transmiting(
> which will set the MISSED after the first trylock), so the MISSED will
> always be set before a thread doing a enqueuing, and we ensure MISSED
> only be cleared during the protection of q->seqlock, after clearing
> MISSED, we do anther round of dequeuing within the protection of
> q->seqlock.

The fact that MISSED is only cleared under q->seqlock does not matter,
because setting it and ->enqueue() are not under any lock. If the thread
gets interrupted between:

	if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) &&
	    qdisc_run_begin(q)) {

and ->enqueue() we can't guarantee that something else won't come in,
take q->seqlock and clear MISSED.

thread1                thread2             thread3
# holds seqlock
                       qdisc_run_begin(q)
                       set(MISSED)
pfifo_fast_dequeue
  clear(MISSED)
  # recheck the queue
qdisc_run_end()
                       ->enqueue()
                                            q->flags & TCQ_F_CAN_BYPASS..
                                            qdisc_run_begin() # true
                                            sch_direct_xmit()
                       qdisc_run_begin()
                       set(MISSED)

Or am I missing something?

Re-checking nolock_qdisc_is_empty() may or may not help.
But it doesn't really matter because there is no ordering
requirement between thread2 and thread3 here.

> So if a thread has taken the q->seqlock and the MISSED is not set yet,
> it is allowed to send the packet directly without going through the
> qdisc enqueuing and dequeuing process.
> 
> > IOW this check is not required for correctness, right?  
> 
> if a thread has taken the q->seqlock and the MISSED is not set, it means
> other thread has not set MISSED after the first trylock and before the
> second trylock, which means the enqueuing is not done yet.
> So I assume the this check is required for correctness if I understand
> your question correctly.
>
> >> another good thing is that we could also do the batch
> >> dequeuing and transmiting of packets when there is lock contention.  
> > 
> > No doubt, but did you see the flag get set significantly often here 
> > to warrant the double-checking?  
> 
> No, that is just my guess:)
Yunsheng Lin May 30, 2021, 1:37 a.m. UTC | #6
On 2021/5/30 2:49, Jakub Kicinski wrote:
> On Sat, 29 May 2021 15:03:09 +0800 Yunsheng Lin wrote:
>> On 2021/5/29 12:32, Jakub Kicinski wrote:
>>> On Sat, 29 May 2021 09:44:57 +0800 Yunsheng Lin wrote:  
>>>> MISSED is only set when there is lock contention, which means it
>>>> is better not to do the qdisc bypass to avoid out of order packet
>>>> problem,   
>>>
>>> Avoid as in make less likely? Nothing guarantees other thread is not
>>> interrupted after ->enqueue and before qdisc_run_begin().
>>>
>>> TBH I'm not sure what out-of-order situation you're referring to,
>>> there is no ordering guarantee between separate threads trying to
>>> transmit AFAIU.  
>> A thread need to do the bypass checking before doing enqueuing, so
>> it means MISSED is set or the trylock fails for the bypass transmiting(
>> which will set the MISSED after the first trylock), so the MISSED will
>> always be set before a thread doing a enqueuing, and we ensure MISSED
>> only be cleared during the protection of q->seqlock, after clearing
>> MISSED, we do anther round of dequeuing within the protection of
>> q->seqlock.
> 
> The fact that MISSED is only cleared under q->seqlock does not matter,
> because setting it and ->enqueue() are not under any lock. If the thread
> gets interrupted between:
> 
> 	if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) &&
> 	    qdisc_run_begin(q)) {
> 
> and ->enqueue() we can't guarantee that something else won't come in,
> take q->seqlock and clear MISSED.
> 
> thread1                thread2             thread3
> # holds seqlock
>                        qdisc_run_begin(q)
>                        set(MISSED)
> pfifo_fast_dequeue
>   clear(MISSED)
>   # recheck the queue
> qdisc_run_end()
>                        ->enqueue()
>                                             q->flags & TCQ_F_CAN_BYPASS..
>                                             qdisc_run_begin() # true
>                                             sch_direct_xmit()
>                        qdisc_run_begin()
>                        set(MISSED)
> 
> Or am I missing something?
> 
> Re-checking nolock_qdisc_is_empty() may or may not help.
> But it doesn't really matter because there is no ordering
> requirement between thread2 and thread3 here.

I were more focued on explaining that using MISSED is reliable
as sch_may_need_requeuing() checking in RFCv3 [1] to indicate a
empty qdisc, and forgot to mention the data race described in
RFCv3, which is kind of like the one described above:

"There is a data race as below:

      CPU1                                   CPU2
qdisc_run_begin(q)                            .
        .                                q->enqueue()
sch_may_need_requeuing()                      .
    return true                               .
        .                                     .
        .                                     .
    q->enqueue()                              .

When above happen, the skb enqueued by CPU1 is dequeued after the
skb enqueued by CPU2 because sch_may_need_requeuing() return true.
If there is not qdisc bypass, the CPU1 has better chance to queue
the skb quicker than CPU2.

This patch does not take care of the above data race, because I
view this as similar as below:

Even at the same time CPU1 and CPU2 write the skb to two socket
which both heading to the same qdisc, there is no guarantee that
which skb will hit the qdisc first, becuase there is a lot of
factor like interrupt/softirq/cache miss/scheduling afffecting
that."

Does above make sense? Or any idea to avoid it?

1. https://patchwork.kernel.org/project/netdevbpf/patch/1616404156-11772-1-git-send-email-linyunsheng@huawei.com/

> 
>> So if a thread has taken the q->seqlock and the MISSED is not set yet,
>> it is allowed to send the packet directly without going through the
>> qdisc enqueuing and dequeuing process.
>>
>>> IOW this check is not required for correctness, right?  
>>
>> if a thread has taken the q->seqlock and the MISSED is not set, it means
>> other thread has not set MISSED after the first trylock and before the
>> second trylock, which means the enqueuing is not done yet.
>> So I assume the this check is required for correctness if I understand
>> your question correctly.
>>
>>>> another good thing is that we could also do the batch
>>>> dequeuing and transmiting of packets when there is lock contention.  
>>>
>>> No doubt, but did you see the flag get set significantly often here 
>>> to warrant the double-checking?  
>>
>> No, that is just my guess:)
> 
>
Jakub Kicinski May 30, 2021, 8:21 p.m. UTC | #7
On Sun, 30 May 2021 09:37:09 +0800 Yunsheng Lin wrote:
> On 2021/5/30 2:49, Jakub Kicinski wrote:
> > The fact that MISSED is only cleared under q->seqlock does not matter,
> > because setting it and ->enqueue() are not under any lock. If the thread
> > gets interrupted between:
> > 
> > 	if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) &&
> > 	    qdisc_run_begin(q)) {
> > 
> > and ->enqueue() we can't guarantee that something else won't come in,
> > take q->seqlock and clear MISSED.
> > 
> > thread1                thread2             thread3
> > # holds seqlock
> >                        qdisc_run_begin(q)
> >                        set(MISSED)
> > pfifo_fast_dequeue
> >   clear(MISSED)
> >   # recheck the queue
> > qdisc_run_end()  
> >                        ->enqueue()  
> >                                             q->flags & TCQ_F_CAN_BYPASS..
> >                                             qdisc_run_begin() # true
> >                                             sch_direct_xmit()
> >                        qdisc_run_begin()
> >                        set(MISSED)
> > 
> > Or am I missing something?
> > 
> > Re-checking nolock_qdisc_is_empty() may or may not help.
> > But it doesn't really matter because there is no ordering
> > requirement between thread2 and thread3 here.  
> 
> I were more focued on explaining that using MISSED is reliable
> as sch_may_need_requeuing() checking in RFCv3 [1] to indicate a
> empty qdisc, and forgot to mention the data race described in
> RFCv3, which is kind of like the one described above:
> 
> "There is a data race as below:
> 
>       CPU1                                   CPU2
> qdisc_run_begin(q)                            .
>         .                                q->enqueue()
> sch_may_need_requeuing()                      .
>     return true                               .
>         .                                     .
>         .                                     .
>     q->enqueue()                              .
> 
> When above happen, the skb enqueued by CPU1 is dequeued after the
> skb enqueued by CPU2 because sch_may_need_requeuing() return true.
> If there is not qdisc bypass, the CPU1 has better chance to queue
> the skb quicker than CPU2.
> 
> This patch does not take care of the above data race, because I
> view this as similar as below:
> 
> Even at the same time CPU1 and CPU2 write the skb to two socket
> which both heading to the same qdisc, there is no guarantee that
> which skb will hit the qdisc first, becuase there is a lot of
> factor like interrupt/softirq/cache miss/scheduling afffecting
> that."
> 
> Does above make sense? Or any idea to avoid it?
> 
> 1. https://patchwork.kernel.org/project/netdevbpf/patch/1616404156-11772-1-git-send-email-linyunsheng@huawei.com/

We agree on this one.

Could you draw a sequence diagram of different CPUs (like the one
above) for the case where removing re-checking nolock_qdisc_is_empty()
under q->seqlock leads to incorrect behavior? 

If there is no such case would you be willing to repeat the benchmark
with and without this test?

Sorry for dragging the review out..
Yunsheng Lin May 31, 2021, 12:40 a.m. UTC | #8
On 2021/5/31 4:21, Jakub Kicinski wrote:
> On Sun, 30 May 2021 09:37:09 +0800 Yunsheng Lin wrote:
>> On 2021/5/30 2:49, Jakub Kicinski wrote:
>>> The fact that MISSED is only cleared under q->seqlock does not matter,
>>> because setting it and ->enqueue() are not under any lock. If the thread
>>> gets interrupted between:
>>>
>>> 	if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) &&
>>> 	    qdisc_run_begin(q)) {
>>>
>>> and ->enqueue() we can't guarantee that something else won't come in,
>>> take q->seqlock and clear MISSED.
>>>
>>> thread1                thread2             thread3
>>> # holds seqlock
>>>                        qdisc_run_begin(q)
>>>                        set(MISSED)
>>> pfifo_fast_dequeue
>>>   clear(MISSED)
>>>   # recheck the queue
>>> qdisc_run_end()  
>>>                        ->enqueue()  
>>>                                             q->flags & TCQ_F_CAN_BYPASS..
>>>                                             qdisc_run_begin() # true
>>>                                             sch_direct_xmit()
>>>                        qdisc_run_begin()
>>>                        set(MISSED)
>>>
>>> Or am I missing something?
>>>
>>> Re-checking nolock_qdisc_is_empty() may or may not help.
>>> But it doesn't really matter because there is no ordering
>>> requirement between thread2 and thread3 here.  
>>
>> I were more focued on explaining that using MISSED is reliable
>> as sch_may_need_requeuing() checking in RFCv3 [1] to indicate a
>> empty qdisc, and forgot to mention the data race described in
>> RFCv3, which is kind of like the one described above:
>>
>> "There is a data race as below:
>>
>>       CPU1                                   CPU2
>> qdisc_run_begin(q)                            .
>>         .                                q->enqueue()
>> sch_may_need_requeuing()                      .
>>     return true                               .
>>         .                                     .
>>         .                                     .
>>     q->enqueue()                              .
>>
>> When above happen, the skb enqueued by CPU1 is dequeued after the
>> skb enqueued by CPU2 because sch_may_need_requeuing() return true.
>> If there is not qdisc bypass, the CPU1 has better chance to queue
>> the skb quicker than CPU2.
>>
>> This patch does not take care of the above data race, because I
>> view this as similar as below:
>>
>> Even at the same time CPU1 and CPU2 write the skb to two socket
>> which both heading to the same qdisc, there is no guarantee that
>> which skb will hit the qdisc first, becuase there is a lot of
>> factor like interrupt/softirq/cache miss/scheduling afffecting
>> that."
>>
>> Does above make sense? Or any idea to avoid it?
>>
>> 1. https://patchwork.kernel.org/project/netdevbpf/patch/1616404156-11772-1-git-send-email-linyunsheng@huawei.com/
> 
> We agree on this one.
> 
> Could you draw a sequence diagram of different CPUs (like the one
> above) for the case where removing re-checking nolock_qdisc_is_empty()
> under q->seqlock leads to incorrect behavior? 

When nolock_qdisc_is_empty() is not re-checking under q->seqlock, we
may have:


        CPU1                                   CPU2
  qdisc_run_begin(q)                            .
          .                                enqueue skb1
deuqueue skb1 and clear MISSED                  .
          .                        nolock_qdisc_is_empty() return true
    requeue skb                                 .
   q->enqueue()                                 .
    set MISSED                                  .
        .                                       .
 qdisc_run_end(q)                               .
        .                              qdisc_run_begin(q)
        .                             transmit skb2 directly
        .                           transmit the requeued skb1

The problem here is that skb1 and skb2  are from the same CPU, which
means they are likely from the same flow, so we need to avoid this,
right?

> 
> If there is no such case would you be willing to repeat the benchmark
> with and without this test?
> 
> Sorry for dragging the review out..
> 
> .
>
Yunsheng Lin May 31, 2021, 1:10 a.m. UTC | #9
On 2021/5/31 8:40, Yunsheng Lin wrote:
> On 2021/5/31 4:21, Jakub Kicinski wrote:
>> On Sun, 30 May 2021 09:37:09 +0800 Yunsheng Lin wrote:
>>> On 2021/5/30 2:49, Jakub Kicinski wrote:
>>>> The fact that MISSED is only cleared under q->seqlock does not matter,
>>>> because setting it and ->enqueue() are not under any lock. If the thread
>>>> gets interrupted between:
>>>>
>>>> 	if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) &&
>>>> 	    qdisc_run_begin(q)) {
>>>>
>>>> and ->enqueue() we can't guarantee that something else won't come in,
>>>> take q->seqlock and clear MISSED.
>>>>
>>>> thread1                thread2             thread3
>>>> # holds seqlock
>>>>                        qdisc_run_begin(q)
>>>>                        set(MISSED)
>>>> pfifo_fast_dequeue
>>>>   clear(MISSED)
>>>>   # recheck the queue
>>>> qdisc_run_end()  
>>>>                        ->enqueue()  
>>>>                                             q->flags & TCQ_F_CAN_BYPASS..
>>>>                                             qdisc_run_begin() # true
>>>>                                             sch_direct_xmit()
>>>>                        qdisc_run_begin()
>>>>                        set(MISSED)
>>>>
>>>> Or am I missing something?
>>>>
>>>> Re-checking nolock_qdisc_is_empty() may or may not help.
>>>> But it doesn't really matter because there is no ordering
>>>> requirement between thread2 and thread3 here.  
>>>
>>> I were more focued on explaining that using MISSED is reliable
>>> as sch_may_need_requeuing() checking in RFCv3 [1] to indicate a
>>> empty qdisc, and forgot to mention the data race described in
>>> RFCv3, which is kind of like the one described above:
>>>
>>> "There is a data race as below:
>>>
>>>       CPU1                                   CPU2
>>> qdisc_run_begin(q)                            .
>>>         .                                q->enqueue()
>>> sch_may_need_requeuing()                      .
>>>     return true                               .
>>>         .                                     .
>>>         .                                     .
>>>     q->enqueue()                              .
>>>
>>> When above happen, the skb enqueued by CPU1 is dequeued after the
>>> skb enqueued by CPU2 because sch_may_need_requeuing() return true.
>>> If there is not qdisc bypass, the CPU1 has better chance to queue
>>> the skb quicker than CPU2.
>>>
>>> This patch does not take care of the above data race, because I
>>> view this as similar as below:
>>>
>>> Even at the same time CPU1 and CPU2 write the skb to two socket
>>> which both heading to the same qdisc, there is no guarantee that
>>> which skb will hit the qdisc first, becuase there is a lot of
>>> factor like interrupt/softirq/cache miss/scheduling afffecting
>>> that."
>>>
>>> Does above make sense? Or any idea to avoid it?
>>>
>>> 1. https://patchwork.kernel.org/project/netdevbpf/patch/1616404156-11772-1-git-send-email-linyunsheng@huawei.com/
>>
>> We agree on this one.
>>
>> Could you draw a sequence diagram of different CPUs (like the one
>> above) for the case where removing re-checking nolock_qdisc_is_empty()
>> under q->seqlock leads to incorrect behavior? 
> 
> When nolock_qdisc_is_empty() is not re-checking under q->seqlock, we
> may have:
> 
> 
>         CPU1                                   CPU2
>   qdisc_run_begin(q)                            .
>           .                                enqueue skb1
> deuqueue skb1 and clear MISSED                  .
>           .                        nolock_qdisc_is_empty() return true
>     requeue skb                                 .
>    q->enqueue()                                 .
>     set MISSED                                  .
>         .                                       .
>  qdisc_run_end(q)                               .
>         .                              qdisc_run_begin(q)
>         .                             transmit skb2 directly
>         .                           transmit the requeued skb1
> 
> The problem here is that skb1 and skb2  are from the same CPU, which
> means they are likely from the same flow, so we need to avoid this,
> right?


         CPU1                                   CPU2
   qdisc_run_begin(q)                            .
           .                                enqueue skb1
     dequeue skb1                                .
           .                                     .
netdevice stopped and MISSED is clear            .
           .                        nolock_qdisc_is_empty() return true
     requeue skb                                 .
           .                                     .
           .                                     .
           .                                     .
  qdisc_run_end(q)                               .
           .                              qdisc_run_begin(q)
           .                             transmit skb2 directly
           .                           transmit the requeued skb1

The above sequence diagram seems more correct, it is basically about how to
avoid transmitting a packet directly bypassing the requeued packet.

> 
>>
>> If there is no such case would you be willing to repeat the benchmark
>> with and without this test?
>>
>> Sorry for dragging the review out..
>>
>> .
>>
> _______________________________________________
> Linuxarm mailing list -- linuxarm@openeuler.org
> To unsubscribe send an email to linuxarm-leave@openeuler.org
>
Yunsheng Lin May 31, 2021, 12:40 p.m. UTC | #10
On 2021/5/31 9:10, Yunsheng Lin wrote:
> On 2021/5/31 8:40, Yunsheng Lin wrote:
>> On 2021/5/31 4:21, Jakub Kicinski wrote:
>>> On Sun, 30 May 2021 09:37:09 +0800 Yunsheng Lin wrote:
>>>> On 2021/5/30 2:49, Jakub Kicinski wrote:
>>>>> The fact that MISSED is only cleared under q->seqlock does not matter,
>>>>> because setting it and ->enqueue() are not under any lock. If the thread
>>>>> gets interrupted between:
>>>>>
>>>>> 	if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) &&
>>>>> 	    qdisc_run_begin(q)) {
>>>>>
>>>>> and ->enqueue() we can't guarantee that something else won't come in,
>>>>> take q->seqlock and clear MISSED.
>>>>>
>>>>> thread1                thread2             thread3
>>>>> # holds seqlock
>>>>>                        qdisc_run_begin(q)
>>>>>                        set(MISSED)
>>>>> pfifo_fast_dequeue
>>>>>   clear(MISSED)
>>>>>   # recheck the queue
>>>>> qdisc_run_end()  
>>>>>                        ->enqueue()  
>>>>>                                             q->flags & TCQ_F_CAN_BYPASS..
>>>>>                                             qdisc_run_begin() # true
>>>>>                                             sch_direct_xmit()
>>>>>                        qdisc_run_begin()
>>>>>                        set(MISSED)
>>>>>
>>>>> Or am I missing something?
>>>>>
>>>>> Re-checking nolock_qdisc_is_empty() may or may not help.
>>>>> But it doesn't really matter because there is no ordering
>>>>> requirement between thread2 and thread3 here.  
>>>>
>>>> I were more focued on explaining that using MISSED is reliable
>>>> as sch_may_need_requeuing() checking in RFCv3 [1] to indicate a
>>>> empty qdisc, and forgot to mention the data race described in
>>>> RFCv3, which is kind of like the one described above:
>>>>
>>>> "There is a data race as below:
>>>>
>>>>       CPU1                                   CPU2
>>>> qdisc_run_begin(q)                            .
>>>>         .                                q->enqueue()
>>>> sch_may_need_requeuing()                      .
>>>>     return true                               .
>>>>         .                                     .
>>>>         .                                     .
>>>>     q->enqueue()                              .
>>>>
>>>> When above happen, the skb enqueued by CPU1 is dequeued after the
>>>> skb enqueued by CPU2 because sch_may_need_requeuing() return true.
>>>> If there is not qdisc bypass, the CPU1 has better chance to queue
>>>> the skb quicker than CPU2.
>>>>
>>>> This patch does not take care of the above data race, because I
>>>> view this as similar as below:
>>>>
>>>> Even at the same time CPU1 and CPU2 write the skb to two socket
>>>> which both heading to the same qdisc, there is no guarantee that
>>>> which skb will hit the qdisc first, becuase there is a lot of
>>>> factor like interrupt/softirq/cache miss/scheduling afffecting
>>>> that."
>>>>
>>>> Does above make sense? Or any idea to avoid it?
>>>>
>>>> 1. https://patchwork.kernel.org/project/netdevbpf/patch/1616404156-11772-1-git-send-email-linyunsheng@huawei.com/
>>>
>>> We agree on this one.
>>>
>>> Could you draw a sequence diagram of different CPUs (like the one
>>> above) for the case where removing re-checking nolock_qdisc_is_empty()
>>> under q->seqlock leads to incorrect behavior? 
>>
>> When nolock_qdisc_is_empty() is not re-checking under q->seqlock, we
>> may have:
>>
>>
>>         CPU1                                   CPU2
>>   qdisc_run_begin(q)                            .
>>           .                                enqueue skb1
>> deuqueue skb1 and clear MISSED                  .
>>           .                        nolock_qdisc_is_empty() return true
>>     requeue skb                                 .
>>    q->enqueue()                                 .
>>     set MISSED                                  .
>>         .                                       .
>>  qdisc_run_end(q)                               .
>>         .                              qdisc_run_begin(q)
>>         .                             transmit skb2 directly
>>         .                           transmit the requeued skb1
>>
>> The problem here is that skb1 and skb2  are from the same CPU, which
>> means they are likely from the same flow, so we need to avoid this,
>> right?
> 
> 
>          CPU1                                   CPU2
>    qdisc_run_begin(q)                            .
>            .                                enqueue skb1
>      dequeue skb1                                .
>            .                                     .
> netdevice stopped and MISSED is clear            .
>            .                        nolock_qdisc_is_empty() return true
>      requeue skb                                 .
>            .                                     .
>            .                                     .
>            .                                     .
>   qdisc_run_end(q)                               .
>            .                              qdisc_run_begin(q)
>            .                             transmit skb2 directly
>            .                           transmit the requeued skb1
> 
> The above sequence diagram seems more correct, it is basically about how to
> avoid transmitting a packet directly bypassing the requeued packet.
> 
>>
>>>
>>> If there is no such case would you be willing to repeat the benchmark
>>> with and without this test?

I had did some interesting testing to show how adjust a small number
of code has some notiable performance degrade.

1. I used below patch to remove the nolock_qdisc_is_empty() testing
   under q->seqlock.

@@ -3763,17 +3763,6 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
        if (q->flags & TCQ_F_NOLOCK) {
                if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) &&
                    qdisc_run_begin(q)) {
-                       /* Retest nolock_qdisc_is_empty() within the protection
-                        * of q->seqlock to ensure qdisc is indeed empty.
-                        */
-                       if (unlikely(!nolock_qdisc_is_empty(q))) {
-                               rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
-                               __qdisc_run(q);
-                               qdisc_run_end(q);
-
-                               goto no_lock_out;
-                       }
-
                        qdisc_bstats_cpu_update(q, skb);
                        if (sch_direct_xmit(skb, q, dev, txq, NULL, true) &&
                            !nolock_qdisc_is_empty(q))
@@ -3786,7 +3775,6 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
                rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
                qdisc_run(q);

-no_lock_out:
                if (unlikely(to_free))
                        kfree_skb_list(to_free);
                return rc;

which has the below performance improvement:

 threads      v1             v1 + above patch          delta
    1       3.21Mpps            3.20Mpps               -0.3%
    2       5.56Mpps            5.94Mpps               +4.9%
    4       5.58Mpps            5.60Mpps               +0.3%
    8       2.76Mpps            2.77Mpps               +0.3%
   16       2.23Mpps            2.23Mpps               +0.0%

v1 = this patchset.


2. After the above testing, it seems worthwhile to remove the
   nolock_qdisc_is_empty() testing under q->seqlock, so I used below
   patch to make sure nolock_qdisc_is_empty() always return false for
   netdev queue stopped case。

--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -38,6 +38,15 @@ EXPORT_SYMBOL(default_qdisc_ops);
 static void qdisc_maybe_clear_missed(struct Qdisc *q,
                                     const struct netdev_queue *txq)
 {
+       set_bit(__QDISC_STATE_DRAINING, &q->state);
+
+       /* Make sure DRAINING is set before clearing MISSED
+        * to make sure nolock_qdisc_is_empty() always return
+        * false for aoviding transmitting a packet directly
+        * bypassing the requeued packet.
+        */
+       smp_mb__after_atomic();
+
        clear_bit(__QDISC_STATE_MISSED, &q->state);

        /* Make sure the below netif_xmit_frozen_or_stopped()
@@ -52,8 +61,6 @@ static void qdisc_maybe_clear_missed(struct Qdisc *q,
         */
        if (!netif_xmit_frozen_or_stopped(txq))
                set_bit(__QDISC_STATE_MISSED, &q->state);
-       else
-               set_bit(__QDISC_STATE_DRAINING, &q->state);
 }

which has the below performance data:

 threads      v1          v1 + above two patch          delta
    1       3.21Mpps            3.20Mpps               -0.3%
    2       5.56Mpps            5.94Mpps               +4.9%
    4       5.58Mpps            5.02Mpps                -10%
    8       2.76Mpps            2.77Mpps               +0.3%
   16       2.23Mpps            2.23Mpps               +0.0%

So the adjustment in qdisc_maybe_clear_missed() seems to have
caused about 10% performance degradation for 4 threads case.

And the cpu topdown perf data suggested that icache missed and
bad Speculation play the main factor to those performance difference.

I tried to control the above factor by removing the inline function
and add likely and unlikely tag for netif_xmit_frozen_or_stopped()
in sch_generic.c.

And after removing the inline mark for function in sch_generic.c
and add likely/unlikely tag for netif_xmit_frozen_or_stopped()
checking in in sch_generic.c, we got notiable performance improvement
for 1/2 threads case(some performance improvement for ip forwarding
test too), but not for 4 threads case.

So it seems we need to ignore the performance degradation for 4
threads case? or any idea?

>>>
>>> Sorry for dragging the review out..
>>>
>>> .
>>>
>> _______________________________________________
>> Linuxarm mailing list -- linuxarm@openeuler.org
>> To unsubscribe send an email to linuxarm-leave@openeuler.org
>>
> _______________________________________________
> Linuxarm mailing list -- linuxarm@openeuler.org
> To unsubscribe send an email to linuxarm-leave@openeuler.org
>
Jakub Kicinski June 1, 2021, 4:51 a.m. UTC | #11
On Mon, 31 May 2021 20:40:01 +0800 Yunsheng Lin wrote:
> On 2021/5/31 9:10, Yunsheng Lin wrote:
> > On 2021/5/31 8:40, Yunsheng Lin wrote:  
> >> On 2021/5/31 4:21, Jakub Kicinski wrote:  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
> >>
> >> When nolock_qdisc_is_empty() is not re-checking under q->seqlock, we
> >> may have:
> >>
> >>
> >>         CPU1                                   CPU2
> >>   qdisc_run_begin(q)                            .
> >>           .                                enqueue skb1
> >> deuqueue skb1 and clear MISSED                  .
> >>           .                        nolock_qdisc_is_empty() return true
> >>     requeue skb                                 .
> >>    q->enqueue()                                 .
> >>     set MISSED                                  .
> >>         .                                       .
> >>  qdisc_run_end(q)                               .
> >>         .                              qdisc_run_begin(q)
> >>         .                             transmit skb2 directly
> >>         .                           transmit the requeued skb1
> >>
> >> The problem here is that skb1 and skb2  are from the same CPU, which
> >> means they are likely from the same flow, so we need to avoid this,
> >> right?  
> > 
> > 
> >          CPU1                                   CPU2
> >    qdisc_run_begin(q)                            .
> >            .                                enqueue skb1
> >      dequeue skb1                                .
> >            .                                     .
> > netdevice stopped and MISSED is clear            .
> >            .                        nolock_qdisc_is_empty() return true
> >      requeue skb                                 .
> >            .                                     .
> >            .                                     .
> >            .                                     .
> >   qdisc_run_end(q)                               .
> >            .                              qdisc_run_begin(q)
> >            .                             transmit skb2 directly
> >            .                           transmit the requeued skb1
> > 
> > The above sequence diagram seems more correct, it is basically about how to
> > avoid transmitting a packet directly bypassing the requeued packet.

I see, thanks! That explains the need. Perhaps we can rephrase the
comment? Maybe:

+			/* Retest nolock_qdisc_is_empty() within the protection
+			 * of q->seqlock to protect from racing with requeuing.
+			 */

> I had did some interesting testing to show how adjust a small number
> of code has some notiable performance degrade.
> 
> 1. I used below patch to remove the nolock_qdisc_is_empty() testing
>    under q->seqlock.
> 
> @@ -3763,17 +3763,6 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>         if (q->flags & TCQ_F_NOLOCK) {
>                 if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) &&
>                     qdisc_run_begin(q)) {
> -                       /* Retest nolock_qdisc_is_empty() within the protection
> -                        * of q->seqlock to ensure qdisc is indeed empty.
> -                        */
> -                       if (unlikely(!nolock_qdisc_is_empty(q))) {
> -                               rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
> -                               __qdisc_run(q);
> -                               qdisc_run_end(q);
> -
> -                               goto no_lock_out;
> -                       }
> -
>                         qdisc_bstats_cpu_update(q, skb);
>                         if (sch_direct_xmit(skb, q, dev, txq, NULL, true) &&
>                             !nolock_qdisc_is_empty(q))
> @@ -3786,7 +3775,6 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>                 rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
>                 qdisc_run(q);
> 
> -no_lock_out:
>                 if (unlikely(to_free))
>                         kfree_skb_list(to_free);
>                 return rc;
> 
> which has the below performance improvement:
> 
>  threads      v1             v1 + above patch          delta
>     1       3.21Mpps            3.20Mpps               -0.3%
>     2       5.56Mpps            5.94Mpps               +4.9%
>     4       5.58Mpps            5.60Mpps               +0.3%
>     8       2.76Mpps            2.77Mpps               +0.3%
>    16       2.23Mpps            2.23Mpps               +0.0%
> 
> v1 = this patchset.
> 
> 
> 2. After the above testing, it seems worthwhile to remove the
>    nolock_qdisc_is_empty() testing under q->seqlock, so I used below
>    patch to make sure nolock_qdisc_is_empty() always return false for
>    netdev queue stopped case。
> 
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -38,6 +38,15 @@ EXPORT_SYMBOL(default_qdisc_ops);
>  static void qdisc_maybe_clear_missed(struct Qdisc *q,
>                                      const struct netdev_queue *txq)
>  {
> +       set_bit(__QDISC_STATE_DRAINING, &q->state);
> +
> +       /* Make sure DRAINING is set before clearing MISSED
> +        * to make sure nolock_qdisc_is_empty() always return
> +        * false for aoviding transmitting a packet directly
> +        * bypassing the requeued packet.
> +        */
> +       smp_mb__after_atomic();
> +
>         clear_bit(__QDISC_STATE_MISSED, &q->state);
> 
>         /* Make sure the below netif_xmit_frozen_or_stopped()
> @@ -52,8 +61,6 @@ static void qdisc_maybe_clear_missed(struct Qdisc *q,
>          */
>         if (!netif_xmit_frozen_or_stopped(txq))
>                 set_bit(__QDISC_STATE_MISSED, &q->state);
> -       else
> -               set_bit(__QDISC_STATE_DRAINING, &q->state);
>  }

But this would not be enough because we may also clear MISSING 
in pfifo_fast_dequeue()?

> which has the below performance data:
> 
>  threads      v1          v1 + above two patch          delta
>     1       3.21Mpps            3.20Mpps               -0.3%
>     2       5.56Mpps            5.94Mpps               +4.9%
>     4       5.58Mpps            5.02Mpps                -10%
>     8       2.76Mpps            2.77Mpps               +0.3%
>    16       2.23Mpps            2.23Mpps               +0.0%
> 
> So the adjustment in qdisc_maybe_clear_missed() seems to have
> caused about 10% performance degradation for 4 threads case.
> 
> And the cpu topdown perf data suggested that icache missed and
> bad Speculation play the main factor to those performance difference.
> 
> I tried to control the above factor by removing the inline function
> and add likely and unlikely tag for netif_xmit_frozen_or_stopped()
> in sch_generic.c.
> 
> And after removing the inline mark for function in sch_generic.c
> and add likely/unlikely tag for netif_xmit_frozen_or_stopped()
> checking in in sch_generic.c, we got notiable performance improvement
> for 1/2 threads case(some performance improvement for ip forwarding
> test too), but not for 4 threads case.
> 
> So it seems we need to ignore the performance degradation for 4
> threads case? or any idea?

No ideas, are the threads pinned to CPUs in some particular way?
Yunsheng Lin June 1, 2021, 8:18 a.m. UTC | #12
On 2021/6/1 12:51, Jakub Kicinski wrote:
> On Mon, 31 May 2021 20:40:01 +0800 Yunsheng Lin wrote:
>> On 2021/5/31 9:10, Yunsheng Lin wrote:
>>> On 2021/5/31 8:40, Yunsheng Lin wrote:  
>>>> On 2021/5/31 4:21, Jakub Kicinski wrote:  
>>  [...]  >>>
>>>
>>>          CPU1                                   CPU2
>>>    qdisc_run_begin(q)                            .
>>>            .                                enqueue skb1
>>>      dequeue skb1                                .
>>>            .                                     .
>>> netdevice stopped and MISSED is clear            .
>>>            .                        nolock_qdisc_is_empty() return true
>>>      requeue skb                                 .
>>>            .                                     .
>>>            .                                     .
>>>            .                                     .
>>>   qdisc_run_end(q)                               .
>>>            .                              qdisc_run_begin(q)
>>>            .                             transmit skb2 directly
>>>            .                           transmit the requeued skb1
>>>
>>> The above sequence diagram seems more correct, it is basically about how to
>>> avoid transmitting a packet directly bypassing the requeued packet.
> 
> I see, thanks! That explains the need. Perhaps we can rephrase the
> comment? Maybe:
> 
> +			/* Retest nolock_qdisc_is_empty() within the protection
> +			 * of q->seqlock to protect from racing with requeuing.
> +			 */

Yes if we still decide to preserve the nolock_qdisc_is_empty() rechecking
under q->seqlock.

> 
>> I had did some interesting testing to show how adjust a small number
>> of code has some notiable performance degrade.
>>
>> 1. I used below patch to remove the nolock_qdisc_is_empty() testing
>>    under q->seqlock.
>>
>> @@ -3763,17 +3763,6 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>>         if (q->flags & TCQ_F_NOLOCK) {
>>                 if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) &&
>>                     qdisc_run_begin(q)) {
>> -                       /* Retest nolock_qdisc_is_empty() within the protection
>> -                        * of q->seqlock to ensure qdisc is indeed empty.
>> -                        */
>> -                       if (unlikely(!nolock_qdisc_is_empty(q))) {
>> -                               rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
>> -                               __qdisc_run(q);
>> -                               qdisc_run_end(q);
>> -
>> -                               goto no_lock_out;
>> -                       }
>> -
>>                         qdisc_bstats_cpu_update(q, skb);
>>                         if (sch_direct_xmit(skb, q, dev, txq, NULL, true) &&
>>                             !nolock_qdisc_is_empty(q))
>> @@ -3786,7 +3775,6 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>>                 rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
>>                 qdisc_run(q);
>>
>> -no_lock_out:
>>                 if (unlikely(to_free))
>>                         kfree_skb_list(to_free);
>>                 return rc;
>>
>> which has the below performance improvement:
>>
>>  threads      v1             v1 + above patch          delta
>>     1       3.21Mpps            3.20Mpps               -0.3%
>>     2       5.56Mpps            5.94Mpps               +4.9%
>>     4       5.58Mpps            5.60Mpps               +0.3%
>>     8       2.76Mpps            2.77Mpps               +0.3%
>>    16       2.23Mpps            2.23Mpps               +0.0%
>>
>> v1 = this patchset.
>>
>>
>> 2. After the above testing, it seems worthwhile to remove the
>>    nolock_qdisc_is_empty() testing under q->seqlock, so I used below
>>    patch to make sure nolock_qdisc_is_empty() always return false for
>>    netdev queue stopped case。
>>
>> --- a/net/sched/sch_generic.c
>> +++ b/net/sched/sch_generic.c
>> @@ -38,6 +38,15 @@ EXPORT_SYMBOL(default_qdisc_ops);
>>  static void qdisc_maybe_clear_missed(struct Qdisc *q,
>>                                      const struct netdev_queue *txq)
>>  {
>> +       set_bit(__QDISC_STATE_DRAINING, &q->state);
>> +
>> +       /* Make sure DRAINING is set before clearing MISSED
>> +        * to make sure nolock_qdisc_is_empty() always return
>> +        * false for aoviding transmitting a packet directly
>> +        * bypassing the requeued packet.
>> +        */
>> +       smp_mb__after_atomic();
>> +
>>         clear_bit(__QDISC_STATE_MISSED, &q->state);
>>
>>         /* Make sure the below netif_xmit_frozen_or_stopped()
>> @@ -52,8 +61,6 @@ static void qdisc_maybe_clear_missed(struct Qdisc *q,
>>          */
>>         if (!netif_xmit_frozen_or_stopped(txq))
>>                 set_bit(__QDISC_STATE_MISSED, &q->state);
>> -       else
>> -               set_bit(__QDISC_STATE_DRAINING, &q->state);
>>  }
> 
> But this would not be enough because we may also clear MISSING 
> in pfifo_fast_dequeue()?

For the MISSING clearing in pfifo_fast_dequeue(), it seems it
looks like the data race described in RFC v3 too?

      CPU1                 CPU2               CPU3
qdisc_run_begin(q)          .                  .
        .              MISSED is set           .
  MISSED is cleared         .                  .
    q->dequeue()            .                  .
        .              enqueue skb1     check MISSED # true
qdisc_run_end(q)            .                  .
        .                   .         qdisc_run_begin(q) # true
        .            MISSED is set      send skb2 directly


> 
>> which has the below performance data:
>>
>>  threads      v1          v1 + above two patch          delta
>>     1       3.21Mpps            3.20Mpps               -0.3%
>>     2       5.56Mpps            5.94Mpps               +4.9%
>>     4       5.58Mpps            5.02Mpps                -10%
>>     8       2.76Mpps            2.77Mpps               +0.3%
>>    16       2.23Mpps            2.23Mpps               +0.0%
>>
>> So the adjustment in qdisc_maybe_clear_missed() seems to have
>> caused about 10% performance degradation for 4 threads case.
>>
>> And the cpu topdown perf data suggested that icache missed and
>> bad Speculation play the main factor to those performance difference.
>>
>> I tried to control the above factor by removing the inline function
>> and add likely and unlikely tag for netif_xmit_frozen_or_stopped()
>> in sch_generic.c.
>>
>> And after removing the inline mark for function in sch_generic.c
>> and add likely/unlikely tag for netif_xmit_frozen_or_stopped()
>> checking in in sch_generic.c, we got notiable performance improvement
>> for 1/2 threads case(some performance improvement for ip forwarding
>> test too), but not for 4 threads case.
>>
>> So it seems we need to ignore the performance degradation for 4
>> threads case? or any idea?
> 
> No ideas, are the threads pinned to CPUs in some particular way?

The pktgen seems already runnig a thread for each CPU, so I do not
need to do the pinning myself, for the 4 threads case, it runs on
the 0~3 cpu.

It seems more related to specific cpu implemantaion.

> 
> .
>
Jakub Kicinski June 1, 2021, 8:48 p.m. UTC | #13
On Tue, 1 Jun 2021 16:18:54 +0800 Yunsheng Lin wrote:
> > I see, thanks! That explains the need. Perhaps we can rephrase the
> > comment? Maybe:
> > 
> > +			/* Retest nolock_qdisc_is_empty() within the protection
> > +			 * of q->seqlock to protect from racing with requeuing.
> > +			 */  
> 
> Yes if we still decide to preserve the nolock_qdisc_is_empty() rechecking
> under q->seqlock.

Sounds good.

> >> --- a/net/sched/sch_generic.c
> >> +++ b/net/sched/sch_generic.c
> >> @@ -38,6 +38,15 @@ EXPORT_SYMBOL(default_qdisc_ops);
> >>  static void qdisc_maybe_clear_missed(struct Qdisc *q,
> >>                                      const struct netdev_queue *txq)
> >>  {
> >> +       set_bit(__QDISC_STATE_DRAINING, &q->state);
> >> +
> >> +       /* Make sure DRAINING is set before clearing MISSED
> >> +        * to make sure nolock_qdisc_is_empty() always return
> >> +        * false for aoviding transmitting a packet directly
> >> +        * bypassing the requeued packet.
> >> +        */
> >> +       smp_mb__after_atomic();
> >> +
> >>         clear_bit(__QDISC_STATE_MISSED, &q->state);
> >>
> >>         /* Make sure the below netif_xmit_frozen_or_stopped()
> >> @@ -52,8 +61,6 @@ static void qdisc_maybe_clear_missed(struct Qdisc *q,
> >>          */
> >>         if (!netif_xmit_frozen_or_stopped(txq))
> >>                 set_bit(__QDISC_STATE_MISSED, &q->state);
> >> -       else
> >> -               set_bit(__QDISC_STATE_DRAINING, &q->state);
> >>  }  
> > 
> > But this would not be enough because we may also clear MISSING 
> > in pfifo_fast_dequeue()?  
> 
> For the MISSING clearing in pfifo_fast_dequeue(), it seems it
> looks like the data race described in RFC v3 too?
> 
>       CPU1                 CPU2               CPU3
> qdisc_run_begin(q)          .                  .
>         .              MISSED is set           .
>   MISSED is cleared         .                  .
>     q->dequeue()            .                  .
>         .              enqueue skb1     check MISSED # true
> qdisc_run_end(q)            .                  .
>         .                   .         qdisc_run_begin(q) # true
>         .            MISSED is set      send skb2 directly

Not sure what you mean.
Yunsheng Lin June 2, 2021, 1:21 a.m. UTC | #14
On 2021/6/2 4:48, Jakub Kicinski wrote:
> On Tue, 1 Jun 2021 16:18:54 +0800 Yunsheng Lin wrote:
>>> I see, thanks! That explains the need. Perhaps we can rephrase the
>>> comment? Maybe:
>>>
>>> +			/* Retest nolock_qdisc_is_empty() within the protection
>>> +			 * of q->seqlock to protect from racing with requeuing.
>>> +			 */  
>>
>> Yes if we still decide to preserve the nolock_qdisc_is_empty() rechecking
>> under q->seqlock.
> 
> Sounds good.
> 
>>>> --- a/net/sched/sch_generic.c
>>>> +++ b/net/sched/sch_generic.c
>>>> @@ -38,6 +38,15 @@ EXPORT_SYMBOL(default_qdisc_ops);
>>>>  static void qdisc_maybe_clear_missed(struct Qdisc *q,
>>>>                                      const struct netdev_queue *txq)
>>>>  {
>>>> +       set_bit(__QDISC_STATE_DRAINING, &q->state);
>>>> +
>>>> +       /* Make sure DRAINING is set before clearing MISSED
>>>> +        * to make sure nolock_qdisc_is_empty() always return
>>>> +        * false for aoviding transmitting a packet directly
>>>> +        * bypassing the requeued packet.
>>>> +        */
>>>> +       smp_mb__after_atomic();
>>>> +
>>>>         clear_bit(__QDISC_STATE_MISSED, &q->state);
>>>>
>>>>         /* Make sure the below netif_xmit_frozen_or_stopped()
>>>> @@ -52,8 +61,6 @@ static void qdisc_maybe_clear_missed(struct Qdisc *q,
>>>>          */
>>>>         if (!netif_xmit_frozen_or_stopped(txq))
>>>>                 set_bit(__QDISC_STATE_MISSED, &q->state);
>>>> -       else
>>>> -               set_bit(__QDISC_STATE_DRAINING, &q->state);
>>>>  }  
>>>
>>> But this would not be enough because we may also clear MISSING 
>>> in pfifo_fast_dequeue()?  
>>
>> For the MISSING clearing in pfifo_fast_dequeue(), it seems it
>> looks like the data race described in RFC v3 too?
>>
>>       CPU1                 CPU2               CPU3
>> qdisc_run_begin(q)          .                  .
>>         .              MISSED is set           .
>>   MISSED is cleared         .                  .
>>     q->dequeue()            .                  .
>>         .              enqueue skb1     check MISSED # true
>> qdisc_run_end(q)            .                  .
>>         .                   .         qdisc_run_begin(q) # true
>>         .            MISSED is set      send skb2 directly
> 
> Not sure what you mean.

       CPU1                 CPU2               CPU3
 qdisc_run_begin(q)          .                  .
         .              MISSED is set           .
   MISSED is cleared         .                  .
   another dequeuing         .                  .
         .                   .                  .
         .              enqueue skb1  nolock_qdisc_is_empty() # true
 qdisc_run_end(q)            .                  .
         .                   .         qdisc_run_begin(q) # true
         .                   .          send skb2 directly
         .               MISSED is set          .

As qdisc is indeed empty at the point when MISSED is clear and
another dequeue is retried by CPU1, MISSED setting is not under
q->seqlock, so it seems retesting MISSED under q->seqlock does not
seem to make any difference? and it seems like the case that does
not need handling as we agreed previously?


> 
> .
>
Jakub Kicinski June 2, 2021, 4:28 p.m. UTC | #15
On Wed, 2 Jun 2021 09:21:01 +0800 Yunsheng Lin wrote:
> >> For the MISSING clearing in pfifo_fast_dequeue(), it seems it
> >> looks like the data race described in RFC v3 too?
> >>
> >>       CPU1                 CPU2               CPU3
> >> qdisc_run_begin(q)          .                  .
> >>         .              MISSED is set           .
> >>   MISSED is cleared         .                  .
> >>     q->dequeue()            .                  .
> >>         .              enqueue skb1     check MISSED # true
> >> qdisc_run_end(q)            .                  .
> >>         .                   .         qdisc_run_begin(q) # true
> >>         .            MISSED is set      send skb2 directly  
> > 
> > Not sure what you mean.  
> 
>        CPU1                 CPU2               CPU3
>  qdisc_run_begin(q)          .                  .
>          .              MISSED is set           .
>    MISSED is cleared         .                  .
>    another dequeuing         .                  .
>          .                   .                  .
>          .              enqueue skb1  nolock_qdisc_is_empty() # true
>  qdisc_run_end(q)            .                  .
>          .                   .         qdisc_run_begin(q) # true
>          .                   .          send skb2 directly
>          .               MISSED is set          .
> 
> As qdisc is indeed empty at the point when MISSED is clear and
> another dequeue is retried by CPU1, MISSED setting is not under
> q->seqlock, so it seems retesting MISSED under q->seqlock does not
> seem to make any difference? and it seems like the case that does
> not need handling as we agreed previously?

Right, this case doesn't need the re-check under the lock, but pointed
out that the re-queuing case requires the re-check.
diff mbox series

Patch

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 3ed6bcc..177f240 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -37,8 +37,15 @@  enum qdisc_state_t {
 	__QDISC_STATE_SCHED,
 	__QDISC_STATE_DEACTIVATED,
 	__QDISC_STATE_MISSED,
+	__QDISC_STATE_DRAINING,
 };
 
+#define QDISC_STATE_MISSED	BIT(__QDISC_STATE_MISSED)
+#define QDISC_STATE_DRAINING	BIT(__QDISC_STATE_DRAINING)
+
+#define QDISC_STATE_NON_EMPTY	(QDISC_STATE_MISSED | \
+					QDISC_STATE_DRAINING)
+
 struct qdisc_size_table {
 	struct rcu_head		rcu;
 	struct list_head	list;
@@ -145,6 +152,11 @@  static inline bool qdisc_is_running(struct Qdisc *qdisc)
 	return (raw_read_seqcount(&qdisc->running) & 1) ? true : false;
 }
 
+static inline bool nolock_qdisc_is_empty(const struct Qdisc *qdisc)
+{
+	return !(READ_ONCE(qdisc->state) & QDISC_STATE_NON_EMPTY);
+}
+
 static inline bool qdisc_is_percpu_stats(const struct Qdisc *q)
 {
 	return q->flags & TCQ_F_CPUSTATS;
@@ -206,10 +218,8 @@  static inline void qdisc_run_end(struct Qdisc *qdisc)
 		spin_unlock(&qdisc->seqlock);
 
 		if (unlikely(test_bit(__QDISC_STATE_MISSED,
-				      &qdisc->state))) {
-			clear_bit(__QDISC_STATE_MISSED, &qdisc->state);
+				      &qdisc->state)))
 			__netif_schedule(qdisc);
-		}
 	} else {
 		write_seqcount_end(&qdisc->running);
 	}
diff --git a/net/core/dev.c b/net/core/dev.c
index 50531a2..e4cc926 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3852,10 +3852,32 @@  static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 	qdisc_calculate_pkt_len(skb, q);
 
 	if (q->flags & TCQ_F_NOLOCK) {
+		if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) &&
+		    qdisc_run_begin(q)) {
+			/* Retest nolock_qdisc_is_empty() within the protection
+			 * of q->seqlock to ensure qdisc is indeed empty.
+			 */
+			if (unlikely(!nolock_qdisc_is_empty(q))) {
+				rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
+				__qdisc_run(q);
+				qdisc_run_end(q);
+
+				goto no_lock_out;
+			}
+
+			qdisc_bstats_cpu_update(q, skb);
+			if (sch_direct_xmit(skb, q, dev, txq, NULL, true) &&
+			    !nolock_qdisc_is_empty(q))
+				__qdisc_run(q);
+
+			qdisc_run_end(q);
+			return NET_XMIT_SUCCESS;
+		}
+
 		rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
-		if (likely(!netif_xmit_frozen_or_stopped(txq)))
-			qdisc_run(q);
+		qdisc_run(q);
 
+no_lock_out:
 		if (unlikely(to_free))
 			kfree_skb_list(to_free);
 		return rc;
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index fc8b56b..83d7f5f 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -52,6 +52,8 @@  static void qdisc_maybe_clear_missed(struct Qdisc *q,
 	 */
 	if (!netif_xmit_frozen_or_stopped(txq))
 		set_bit(__QDISC_STATE_MISSED, &q->state);
+	else
+		set_bit(__QDISC_STATE_DRAINING, &q->state);
 }
 
 /* Main transmission queue. */
@@ -164,9 +166,13 @@  static inline void dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
 
 		skb = next;
 	}
-	if (lock)
+
+	if (lock) {
 		spin_unlock(lock);
-	__netif_schedule(q);
+		set_bit(__QDISC_STATE_MISSED, &q->state);
+	} else {
+		__netif_schedule(q);
+	}
 }
 
 static void try_bulk_dequeue_skb(struct Qdisc *q,
@@ -409,7 +415,11 @@  void __qdisc_run(struct Qdisc *q)
 	while (qdisc_restart(q, &packets)) {
 		quota -= packets;
 		if (quota <= 0) {
-			__netif_schedule(q);
+			if (q->flags & TCQ_F_NOLOCK)
+				set_bit(__QDISC_STATE_MISSED, &q->state);
+			else
+				__netif_schedule(q);
+
 			break;
 		}
 	}
@@ -680,13 +690,14 @@  static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
 	if (likely(skb)) {
 		qdisc_update_stats_at_dequeue(qdisc, skb);
 	} else if (need_retry &&
-		   test_bit(__QDISC_STATE_MISSED, &qdisc->state)) {
+		   READ_ONCE(qdisc->state) & QDISC_STATE_NON_EMPTY) {
 		/* Delay clearing the STATE_MISSED here to reduce
 		 * the overhead of the second spin_trylock() in
 		 * qdisc_run_begin() and __netif_schedule() calling
 		 * in qdisc_run_end().
 		 */
 		clear_bit(__QDISC_STATE_MISSED, &qdisc->state);
+		clear_bit(__QDISC_STATE_DRAINING, &qdisc->state);
 
 		/* Make sure dequeuing happens after clearing
 		 * STATE_MISSED.
@@ -1204,6 +1215,7 @@  static void dev_reset_queue(struct net_device *dev,
 	spin_unlock_bh(qdisc_lock(qdisc));
 	if (nolock) {
 		clear_bit(__QDISC_STATE_MISSED, &qdisc->state);
+		clear_bit(__QDISC_STATE_DRAINING, &qdisc->state);
 		spin_unlock_bh(&qdisc->seqlock);
 	}
 }