[net] sfc: use budget for TX completions

Message ID	20230612144254.21039-1-ihuguet@redhat.com (mailing list archive)
State	Superseded
Delegated to:	Netdev Maintainers
Headers	show Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7D917156EB for <netdev@vger.kernel.org>; Mon, 12 Jun 2023 14:43:21 +0000 (UTC) From: =?utf-8?b?w43DsWlnbyBIdWd1ZXQ=?= <ihuguet@redhat.com> To: ecree.xilinx@gmail.com, habetsm.xilinx@gmail.com Cc: davem@davemloft.net, edumazet@google.com, kuba@kernel.org, pabeni@redhat.com, netdev@vger.kernel.org, linux-net-drivers@amd.com, =?utf-8?b?w43DsWlnbyBIdWd1ZXQ=?= <ihuguet@redhat.com>, Fei Liu <feliu@redhat.com> Subject: [PATCH net] sfc: use budget for TX completions Date: Mon, 12 Jun 2023 16:42:54 +0200 Message-Id: <20230612144254.21039-1-ihuguet@redhat.com> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit
Series	[net] sfc: use budget for TX completions \| expand [net] sfc: use budget for TX completions

Context	Check	Description
netdev/series_format	success	Single patches do not need cover letters
netdev/tree_selection	success	Clearly marked for net
netdev/fixes_present	fail	Series targets non-next tree, but doesn't contain any Fixes tags
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 9 this patch: 9
netdev/cc_maintainers	success	CCed 8 of 8 maintainers
netdev/build_clang	success	Errors and warnings before: 8 this patch: 8
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/deprecated_api	success	None detected
netdev/check_selftest	success	No net selftest shell script
netdev/verify_fixes	success	No Fixes tag
netdev/build_allmodconfig_warn	success	Errors and warnings before: 9 this patch: 9
netdev/checkpatch	warning	WARNING: line length of 84 exceeds 80 columns
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	success	Was 0 now: 0

Íñigo Huguet June 12, 2023, 2:42 p.m. UTC

When running workloads heavy unbalanced towards TX (high TX, low RX
traffic), sfc driver can retain the CPU during too long times. Although
in many cases this is not enough to be visible, it can affect
performance and system responsiveness.

A way to reproduce it is to use a debug kernel and run some parallel
netperf TX tests. In some systems, this will lead to this message being
logged:
  kernel:watchdog: BUG: soft lockup - CPU#12 stuck for 22s!

The reason is that sfc driver doesn't account any NAPI budget for the TX
completion events work. With high-TX/low-RX traffic, this makes that the
CPU is held for long time for NAPI poll.

Documentations says "drivers can process completions for any number of Tx
packets but should only process up to budget number of Rx packets".
However, many drivers do limit the amount of TX completions that they
process in a single NAPI poll.

In the same way, this patch adds a limit for the TX work in sfc. With
the patch applied, the watchdog warning never appears.

Tested with netperf in different combinations: single process / parallel
processes, TCP / UDP and different sizes of UDP messages. Repeated the
tests before and after the patch, without any noticeable difference in
network or CPU performance.

Test hardware:
Intel(R) Xeon(R) CPU E5-1620 v4 @ 3.50GHz (4 cores, 2 threads/core)
Solarflare Communications XtremeScale X2522-25G Network Adapter

Reported-by: Fei Liu <feliu@redhat.com>
Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>
---
 drivers/net/ethernet/sfc/ef10.c      | 25 ++++++++++++++++++-------
 drivers/net/ethernet/sfc/ef100_nic.c |  7 ++++++-
 drivers/net/ethernet/sfc/ef100_tx.c  |  4 ++--
 drivers/net/ethernet/sfc/ef100_tx.h  |  2 +-
 drivers/net/ethernet/sfc/tx_common.c |  4 +++-
 drivers/net/ethernet/sfc/tx_common.h |  2 +-
 6 files changed, 31 insertions(+), 13 deletions(-)

Maciej Fijalkowski June 12, 2023, 4:04 p.m. UTC | #1

On Mon, Jun 12, 2023 at 04:42:54PM +0200, Íñigo Huguet wrote:

Hey Inigo, some trivial things below.

> When running workloads heavy unbalanced towards TX (high TX, low RX
> traffic), sfc driver can retain the CPU during too long times. Although
> in many cases this is not enough to be visible, it can affect
> performance and system responsiveness.
> 
> A way to reproduce it is to use a debug kernel and run some parallel
> netperf TX tests. In some systems, this will lead to this message being
> logged:
>   kernel:watchdog: BUG: soft lockup - CPU#12 stuck for 22s!
> 
> The reason is that sfc driver doesn't account any NAPI budget for the TX
> completion events work. With high-TX/low-RX traffic, this makes that the
> CPU is held for long time for NAPI poll.
> 
> Documentations says "drivers can process completions for any number of Tx
> packets but should only process up to budget number of Rx packets".
> However, many drivers do limit the amount of TX completions that they
> process in a single NAPI poll.
> 
> In the same way, this patch adds a limit for the TX work in sfc. With
> the patch applied, the watchdog warning never appears.
> 
> Tested with netperf in different combinations: single process / parallel
> processes, TCP / UDP and different sizes of UDP messages. Repeated the
> tests before and after the patch, without any noticeable difference in
> network or CPU performance.
> 
> Test hardware:
> Intel(R) Xeon(R) CPU E5-1620 v4 @ 3.50GHz (4 cores, 2 threads/core)
> Solarflare Communications XtremeScale X2522-25G Network Adapter
> 
> Reported-by: Fei Liu <feliu@redhat.com>
> Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>

You're missing Fixes: tag.

> ---
>  drivers/net/ethernet/sfc/ef10.c      | 25 ++++++++++++++++++-------
>  drivers/net/ethernet/sfc/ef100_nic.c |  7 ++++++-
>  drivers/net/ethernet/sfc/ef100_tx.c  |  4 ++--
>  drivers/net/ethernet/sfc/ef100_tx.h  |  2 +-
>  drivers/net/ethernet/sfc/tx_common.c |  4 +++-
>  drivers/net/ethernet/sfc/tx_common.h |  2 +-
>  6 files changed, 31 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/net/ethernet/sfc/ef10.c b/drivers/net/ethernet/sfc/ef10.c
> index d30459dbfe8f..b63e47af6365 100644
> --- a/drivers/net/ethernet/sfc/ef10.c
> +++ b/drivers/net/ethernet/sfc/ef10.c
> @@ -2950,7 +2950,7 @@ static u32 efx_ef10_extract_event_ts(efx_qword_t *event)
>  	return tstamp;
>  }
>  
> -static void
> +static int
>  efx_ef10_handle_tx_event(struct efx_channel *channel, efx_qword_t *event)
>  {
>  	struct efx_nic *efx = channel->efx;
> @@ -2958,13 +2958,14 @@ efx_ef10_handle_tx_event(struct efx_channel *channel, efx_qword_t *event)
>  	unsigned int tx_ev_desc_ptr;
>  	unsigned int tx_ev_q_label;
>  	unsigned int tx_ev_type;
> +	int work_done;
>  	u64 ts_part;
>  
>  	if (unlikely(READ_ONCE(efx->reset_pending)))
> -		return;
> +		return 0;
>  
>  	if (unlikely(EFX_QWORD_FIELD(*event, ESF_DZ_TX_DROP_EVENT)))
> -		return;
> +		return 0;
>  
>  	/* Get the transmit queue */
>  	tx_ev_q_label = EFX_QWORD_FIELD(*event, ESF_DZ_TX_QLABEL);
> @@ -2973,8 +2974,7 @@ efx_ef10_handle_tx_event(struct efx_channel *channel, efx_qword_t *event)
>  	if (!tx_queue->timestamping) {
>  		/* Transmit completion */
>  		tx_ev_desc_ptr = EFX_QWORD_FIELD(*event, ESF_DZ_TX_DESCR_INDX);
> -		efx_xmit_done(tx_queue, tx_ev_desc_ptr & tx_queue->ptr_mask);
> -		return;
> +		return efx_xmit_done(tx_queue, tx_ev_desc_ptr & tx_queue->ptr_mask);
>  	}
>  
>  	/* Transmit timestamps are only available for 8XXX series. They result
> @@ -3000,6 +3000,7 @@ efx_ef10_handle_tx_event(struct efx_channel *channel, efx_qword_t *event)
>  	 * fields in the event.
>  	 */
>  	tx_ev_type = EFX_QWORD_FIELD(*event, ESF_EZ_TX_SOFT1);
> +	work_done = 0;
>  
>  	switch (tx_ev_type) {
>  	case TX_TIMESTAMP_EVENT_TX_EV_COMPLETION:
> @@ -3016,6 +3017,7 @@ efx_ef10_handle_tx_event(struct efx_channel *channel, efx_qword_t *event)
>  		tx_queue->completed_timestamp_major = ts_part;
>  
>  		efx_xmit_done_single(tx_queue);
> +		work_done = 1;
>  		break;
>  
>  	default:
> @@ -3026,6 +3028,8 @@ efx_ef10_handle_tx_event(struct efx_channel *channel, efx_qword_t *event)
>  			  EFX_QWORD_VAL(*event));
>  		break;
>  	}
> +
> +	return work_done;
>  }
>  
>  static void
> @@ -3081,13 +3085,16 @@ static void efx_ef10_handle_driver_generated_event(struct efx_channel *channel,
>  	}
>  }
>  
> +#define EFX_NAPI_MAX_TX 512
> +
>  static int efx_ef10_ev_process(struct efx_channel *channel, int quota)
>  {
>  	struct efx_nic *efx = channel->efx;
>  	efx_qword_t event, *p_event;
>  	unsigned int read_ptr;
> -	int ev_code;
> +	int spent_tx = 0;
>  	int spent = 0;
> +	int ev_code;
>  
>  	if (quota <= 0)
>  		return spent;
> @@ -3126,7 +3133,11 @@ static int efx_ef10_ev_process(struct efx_channel *channel, int quota)
>  			}
>  			break;
>  		case ESE_DZ_EV_CODE_TX_EV:
> -			efx_ef10_handle_tx_event(channel, &event);
> +			spent_tx += efx_ef10_handle_tx_event(channel, &event);
> +			if (spent_tx >= EFX_NAPI_MAX_TX) {
> +				spent = quota;
> +				goto out;
> +			}
>  			break;
>  		case ESE_DZ_EV_CODE_DRIVER_EV:
>  			efx_ef10_handle_driver_event(channel, &event);
> diff --git a/drivers/net/ethernet/sfc/ef100_nic.c b/drivers/net/ethernet/sfc/ef100_nic.c
> index 4dc643b0d2db..7adde9639c8a 100644
> --- a/drivers/net/ethernet/sfc/ef100_nic.c
> +++ b/drivers/net/ethernet/sfc/ef100_nic.c
> @@ -253,6 +253,8 @@ static void ef100_ev_read_ack(struct efx_channel *channel)
>  		   efx_reg(channel->efx, ER_GZ_EVQ_INT_PRIME));
>  }
>  
> +#define EFX_NAPI_MAX_TX 512

couldn't this go to tx_common.h ? you're defining this two times.

> +
>  static int ef100_ev_process(struct efx_channel *channel, int quota)
>  {
>  	struct efx_nic *efx = channel->efx;
> @@ -260,6 +262,7 @@ static int ef100_ev_process(struct efx_channel *channel, int quota)
>  	bool evq_phase, old_evq_phase;
>  	unsigned int read_ptr;
>  	efx_qword_t *p_event;
> +	int spent_tx = 0;
>  	int spent = 0;
>  	bool ev_phase;
>  	int ev_type;
> @@ -295,7 +298,9 @@ static int ef100_ev_process(struct efx_channel *channel, int quota)
>  			efx_mcdi_process_event(channel, p_event);
>  			break;
>  		case ESE_GZ_EF100_EV_TX_COMPLETION:
> -			ef100_ev_tx(channel, p_event);
> +			spent_tx += ef100_ev_tx(channel, p_event);
> +			if (spent_tx >= EFX_NAPI_MAX_TX)
> +				spent = quota;
>  			break;
>  		case ESE_GZ_EF100_EV_DRIVER:
>  			netif_info(efx, drv, efx->net_dev,
> diff --git a/drivers/net/ethernet/sfc/ef100_tx.c b/drivers/net/ethernet/sfc/ef100_tx.c
> index 29ffaf35559d..849e5555bd12 100644
> --- a/drivers/net/ethernet/sfc/ef100_tx.c
> +++ b/drivers/net/ethernet/sfc/ef100_tx.c
> @@ -346,7 +346,7 @@ void ef100_tx_write(struct efx_tx_queue *tx_queue)
>  	ef100_tx_push_buffers(tx_queue);
>  }
>  
> -void ef100_ev_tx(struct efx_channel *channel, const efx_qword_t *p_event)
> +int ef100_ev_tx(struct efx_channel *channel, const efx_qword_t *p_event)
>  {
>  	unsigned int tx_done =
>  		EFX_QWORD_FIELD(*p_event, ESF_GZ_EV_TXCMPL_NUM_DESC);
> @@ -357,7 +357,7 @@ void ef100_ev_tx(struct efx_channel *channel, const efx_qword_t *p_event)
>  	unsigned int tx_index = (tx_queue->read_count + tx_done - 1) &
>  				tx_queue->ptr_mask;
>  
> -	efx_xmit_done(tx_queue, tx_index);
> +	return efx_xmit_done(tx_queue, tx_index);
>  }
>  
>  /* Add a socket buffer to a TX queue
> diff --git a/drivers/net/ethernet/sfc/ef100_tx.h b/drivers/net/ethernet/sfc/ef100_tx.h
> index e9e11540fcde..d9a0819c5a72 100644
> --- a/drivers/net/ethernet/sfc/ef100_tx.h
> +++ b/drivers/net/ethernet/sfc/ef100_tx.h
> @@ -20,7 +20,7 @@ void ef100_tx_init(struct efx_tx_queue *tx_queue);
>  void ef100_tx_write(struct efx_tx_queue *tx_queue);
>  unsigned int ef100_tx_max_skb_descs(struct efx_nic *efx);
>  
> -void ef100_ev_tx(struct efx_channel *channel, const efx_qword_t *p_event);
> +int ef100_ev_tx(struct efx_channel *channel, const efx_qword_t *p_event);
>  
>  netdev_tx_t ef100_enqueue_skb(struct efx_tx_queue *tx_queue, struct sk_buff *skb);
>  int __ef100_enqueue_skb(struct efx_tx_queue *tx_queue, struct sk_buff *skb,
> diff --git a/drivers/net/ethernet/sfc/tx_common.c b/drivers/net/ethernet/sfc/tx_common.c
> index 67e789b96c43..755aa92bf823 100644
> --- a/drivers/net/ethernet/sfc/tx_common.c
> +++ b/drivers/net/ethernet/sfc/tx_common.c
> @@ -249,7 +249,7 @@ void efx_xmit_done_check_empty(struct efx_tx_queue *tx_queue)
>  	}
>  }
>  
> -void efx_xmit_done(struct efx_tx_queue *tx_queue, unsigned int index)
> +int efx_xmit_done(struct efx_tx_queue *tx_queue, unsigned int index)
>  {
>  	unsigned int fill_level, pkts_compl = 0, bytes_compl = 0;
>  	unsigned int efv_pkts_compl = 0;
> @@ -279,6 +279,8 @@ void efx_xmit_done(struct efx_tx_queue *tx_queue, unsigned int index)
>  	}
>  
>  	efx_xmit_done_check_empty(tx_queue);
> +
> +	return pkts_compl + efv_pkts_compl;
>  }
>  
>  /* Remove buffers put into a tx_queue for the current packet.
> diff --git a/drivers/net/ethernet/sfc/tx_common.h b/drivers/net/ethernet/sfc/tx_common.h
> index d87aecbc7bf1..1e9f42938aac 100644
> --- a/drivers/net/ethernet/sfc/tx_common.h
> +++ b/drivers/net/ethernet/sfc/tx_common.h
> @@ -28,7 +28,7 @@ static inline bool efx_tx_buffer_in_use(struct efx_tx_buffer *buffer)
>  }
>  
>  void efx_xmit_done_check_empty(struct efx_tx_queue *tx_queue);
> -void efx_xmit_done(struct efx_tx_queue *tx_queue, unsigned int index);
> +int efx_xmit_done(struct efx_tx_queue *tx_queue, unsigned int index);
>  
>  void efx_enqueue_unwind(struct efx_tx_queue *tx_queue,
>  			unsigned int insert_count);
> -- 
> 2.40.1
> 
>

Íñigo Huguet June 13, 2023, 2:42 p.m. UTC | #2

On Mon, Jun 12, 2023 at 6:05 PM Maciej Fijalkowski
<maciej.fijalkowski@intel.com> wrote:
>
> On Mon, Jun 12, 2023 at 04:42:54PM +0200, Íñigo Huguet wrote:
>
> Hey Inigo, some trivial things below.
>
> > When running workloads heavy unbalanced towards TX (high TX, low RX
> > traffic), sfc driver can retain the CPU during too long times. Although
> > in many cases this is not enough to be visible, it can affect
> > performance and system responsiveness.
> >
> > A way to reproduce it is to use a debug kernel and run some parallel
> > netperf TX tests. In some systems, this will lead to this message being
> > logged:
> >   kernel:watchdog: BUG: soft lockup - CPU#12 stuck for 22s!
> >
> > The reason is that sfc driver doesn't account any NAPI budget for the TX
> > completion events work. With high-TX/low-RX traffic, this makes that the
> > CPU is held for long time for NAPI poll.
> >
> > Documentations says "drivers can process completions for any number of Tx
> > packets but should only process up to budget number of Rx packets".
> > However, many drivers do limit the amount of TX completions that they
> > process in a single NAPI poll.
> >
> > In the same way, this patch adds a limit for the TX work in sfc. With
> > the patch applied, the watchdog warning never appears.
> >
> > Tested with netperf in different combinations: single process / parallel
> > processes, TCP / UDP and different sizes of UDP messages. Repeated the
> > tests before and after the patch, without any noticeable difference in
> > network or CPU performance.
> >
> > Test hardware:
> > Intel(R) Xeon(R) CPU E5-1620 v4 @ 3.50GHz (4 cores, 2 threads/core)
> > Solarflare Communications XtremeScale X2522-25G Network Adapter
> >
> > Reported-by: Fei Liu <feliu@redhat.com>
> > Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>
>
> You're missing Fixes: tag.

Thanks for the reminder!

>
> > ---
> >  drivers/net/ethernet/sfc/ef10.c      | 25 ++++++++++++++++++-------
> >  drivers/net/ethernet/sfc/ef100_nic.c |  7 ++++++-
> >  drivers/net/ethernet/sfc/ef100_tx.c  |  4 ++--
> >  drivers/net/ethernet/sfc/ef100_tx.h  |  2 +-
> >  drivers/net/ethernet/sfc/tx_common.c |  4 +++-
> >  drivers/net/ethernet/sfc/tx_common.h |  2 +-
> >  6 files changed, 31 insertions(+), 13 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/sfc/ef10.c b/drivers/net/ethernet/sfc/ef10.c
> > index d30459dbfe8f..b63e47af6365 100644
> > --- a/drivers/net/ethernet/sfc/ef10.c
> > +++ b/drivers/net/ethernet/sfc/ef10.c
> > @@ -2950,7 +2950,7 @@ static u32 efx_ef10_extract_event_ts(efx_qword_t *event)
> >       return tstamp;
> >  }
> >
> > -static void
> > +static int
> >  efx_ef10_handle_tx_event(struct efx_channel *channel, efx_qword_t *event)
> >  {
> >       struct efx_nic *efx = channel->efx;
> > @@ -2958,13 +2958,14 @@ efx_ef10_handle_tx_event(struct efx_channel *channel, efx_qword_t *event)
> >       unsigned int tx_ev_desc_ptr;
> >       unsigned int tx_ev_q_label;
> >       unsigned int tx_ev_type;
> > +     int work_done;
> >       u64 ts_part;
> >
> >       if (unlikely(READ_ONCE(efx->reset_pending)))
> > -             return;
> > +             return 0;
> >
> >       if (unlikely(EFX_QWORD_FIELD(*event, ESF_DZ_TX_DROP_EVENT)))
> > -             return;
> > +             return 0;
> >
> >       /* Get the transmit queue */
> >       tx_ev_q_label = EFX_QWORD_FIELD(*event, ESF_DZ_TX_QLABEL);
> > @@ -2973,8 +2974,7 @@ efx_ef10_handle_tx_event(struct efx_channel *channel, efx_qword_t *event)
> >       if (!tx_queue->timestamping) {
> >               /* Transmit completion */
> >               tx_ev_desc_ptr = EFX_QWORD_FIELD(*event, ESF_DZ_TX_DESCR_INDX);
> > -             efx_xmit_done(tx_queue, tx_ev_desc_ptr & tx_queue->ptr_mask);
> > -             return;
> > +             return efx_xmit_done(tx_queue, tx_ev_desc_ptr & tx_queue->ptr_mask);
> >       }
> >
> >       /* Transmit timestamps are only available for 8XXX series. They result
> > @@ -3000,6 +3000,7 @@ efx_ef10_handle_tx_event(struct efx_channel *channel, efx_qword_t *event)
> >        * fields in the event.
> >        */
> >       tx_ev_type = EFX_QWORD_FIELD(*event, ESF_EZ_TX_SOFT1);
> > +     work_done = 0;
> >
> >       switch (tx_ev_type) {
> >       case TX_TIMESTAMP_EVENT_TX_EV_COMPLETION:
> > @@ -3016,6 +3017,7 @@ efx_ef10_handle_tx_event(struct efx_channel *channel, efx_qword_t *event)
> >               tx_queue->completed_timestamp_major = ts_part;
> >
> >               efx_xmit_done_single(tx_queue);
> > +             work_done = 1;
> >               break;
> >
> >       default:
> > @@ -3026,6 +3028,8 @@ efx_ef10_handle_tx_event(struct efx_channel *channel, efx_qword_t *event)
> >                         EFX_QWORD_VAL(*event));
> >               break;
> >       }
> > +
> > +     return work_done;
> >  }
> >
> >  static void
> > @@ -3081,13 +3085,16 @@ static void efx_ef10_handle_driver_generated_event(struct efx_channel *channel,
> >       }
> >  }
> >
> > +#define EFX_NAPI_MAX_TX 512
> > +
> >  static int efx_ef10_ev_process(struct efx_channel *channel, int quota)
> >  {
> >       struct efx_nic *efx = channel->efx;
> >       efx_qword_t event, *p_event;
> >       unsigned int read_ptr;
> > -     int ev_code;
> > +     int spent_tx = 0;
> >       int spent = 0;
> > +     int ev_code;
> >
> >       if (quota <= 0)
> >               return spent;
> > @@ -3126,7 +3133,11 @@ static int efx_ef10_ev_process(struct efx_channel *channel, int quota)
> >                       }
> >                       break;
> >               case ESE_DZ_EV_CODE_TX_EV:
> > -                     efx_ef10_handle_tx_event(channel, &event);
> > +                     spent_tx += efx_ef10_handle_tx_event(channel, &event);
> > +                     if (spent_tx >= EFX_NAPI_MAX_TX) {
> > +                             spent = quota;
> > +                             goto out;
> > +                     }
> >                       break;
> >               case ESE_DZ_EV_CODE_DRIVER_EV:
> >                       efx_ef10_handle_driver_event(channel, &event);
> > diff --git a/drivers/net/ethernet/sfc/ef100_nic.c b/drivers/net/ethernet/sfc/ef100_nic.c
> > index 4dc643b0d2db..7adde9639c8a 100644
> > --- a/drivers/net/ethernet/sfc/ef100_nic.c
> > +++ b/drivers/net/ethernet/sfc/ef100_nic.c
> > @@ -253,6 +253,8 @@ static void ef100_ev_read_ack(struct efx_channel *channel)
> >                  efx_reg(channel->efx, ER_GZ_EVQ_INT_PRIME));
> >  }
> >
> > +#define EFX_NAPI_MAX_TX 512
>
> couldn't this go to tx_common.h ? you're defining this two times.

Not sure about this. I didn't like any place to put it, so I ended up
putting it here. As it is an implementation detail, I didn't like the
idea of putting it in a header file that contains the driver's API.

I would better like to hear the opinion from the sfc maintainers, but
I don't mind changing it because I'm neither happy with the chosen
location.

>
> > +
> >  static int ef100_ev_process(struct efx_channel *channel, int quota)
> >  {
> >       struct efx_nic *efx = channel->efx;
> > @@ -260,6 +262,7 @@ static int ef100_ev_process(struct efx_channel *channel, int quota)
> >       bool evq_phase, old_evq_phase;
> >       unsigned int read_ptr;
> >       efx_qword_t *p_event;
> > +     int spent_tx = 0;
> >       int spent = 0;
> >       bool ev_phase;
> >       int ev_type;
> > @@ -295,7 +298,9 @@ static int ef100_ev_process(struct efx_channel *channel, int quota)
> >                       efx_mcdi_process_event(channel, p_event);
> >                       break;
> >               case ESE_GZ_EF100_EV_TX_COMPLETION:
> > -                     ef100_ev_tx(channel, p_event);
> > +                     spent_tx += ef100_ev_tx(channel, p_event);
> > +                     if (spent_tx >= EFX_NAPI_MAX_TX)
> > +                             spent = quota;
> >                       break;
> >               case ESE_GZ_EF100_EV_DRIVER:
> >                       netif_info(efx, drv, efx->net_dev,
> > diff --git a/drivers/net/ethernet/sfc/ef100_tx.c b/drivers/net/ethernet/sfc/ef100_tx.c
> > index 29ffaf35559d..849e5555bd12 100644
> > --- a/drivers/net/ethernet/sfc/ef100_tx.c
> > +++ b/drivers/net/ethernet/sfc/ef100_tx.c
> > @@ -346,7 +346,7 @@ void ef100_tx_write(struct efx_tx_queue *tx_queue)
> >       ef100_tx_push_buffers(tx_queue);
> >  }
> >
> > -void ef100_ev_tx(struct efx_channel *channel, const efx_qword_t *p_event)
> > +int ef100_ev_tx(struct efx_channel *channel, const efx_qword_t *p_event)
> >  {
> >       unsigned int tx_done =
> >               EFX_QWORD_FIELD(*p_event, ESF_GZ_EV_TXCMPL_NUM_DESC);
> > @@ -357,7 +357,7 @@ void ef100_ev_tx(struct efx_channel *channel, const efx_qword_t *p_event)
> >       unsigned int tx_index = (tx_queue->read_count + tx_done - 1) &
> >                               tx_queue->ptr_mask;
> >
> > -     efx_xmit_done(tx_queue, tx_index);
> > +     return efx_xmit_done(tx_queue, tx_index);
> >  }
> >
> >  /* Add a socket buffer to a TX queue
> > diff --git a/drivers/net/ethernet/sfc/ef100_tx.h b/drivers/net/ethernet/sfc/ef100_tx.h
> > index e9e11540fcde..d9a0819c5a72 100644
> > --- a/drivers/net/ethernet/sfc/ef100_tx.h
> > +++ b/drivers/net/ethernet/sfc/ef100_tx.h
> > @@ -20,7 +20,7 @@ void ef100_tx_init(struct efx_tx_queue *tx_queue);
> >  void ef100_tx_write(struct efx_tx_queue *tx_queue);
> >  unsigned int ef100_tx_max_skb_descs(struct efx_nic *efx);
> >
> > -void ef100_ev_tx(struct efx_channel *channel, const efx_qword_t *p_event);
> > +int ef100_ev_tx(struct efx_channel *channel, const efx_qword_t *p_event);
> >
> >  netdev_tx_t ef100_enqueue_skb(struct efx_tx_queue *tx_queue, struct sk_buff *skb);
> >  int __ef100_enqueue_skb(struct efx_tx_queue *tx_queue, struct sk_buff *skb,
> > diff --git a/drivers/net/ethernet/sfc/tx_common.c b/drivers/net/ethernet/sfc/tx_common.c
> > index 67e789b96c43..755aa92bf823 100644
> > --- a/drivers/net/ethernet/sfc/tx_common.c
> > +++ b/drivers/net/ethernet/sfc/tx_common.c
> > @@ -249,7 +249,7 @@ void efx_xmit_done_check_empty(struct efx_tx_queue *tx_queue)
> >       }
> >  }
> >
> > -void efx_xmit_done(struct efx_tx_queue *tx_queue, unsigned int index)
> > +int efx_xmit_done(struct efx_tx_queue *tx_queue, unsigned int index)
> >  {
> >       unsigned int fill_level, pkts_compl = 0, bytes_compl = 0;
> >       unsigned int efv_pkts_compl = 0;
> > @@ -279,6 +279,8 @@ void efx_xmit_done(struct efx_tx_queue *tx_queue, unsigned int index)
> >       }
> >
> >       efx_xmit_done_check_empty(tx_queue);
> > +
> > +     return pkts_compl + efv_pkts_compl;
> >  }
> >
> >  /* Remove buffers put into a tx_queue for the current packet.
> > diff --git a/drivers/net/ethernet/sfc/tx_common.h b/drivers/net/ethernet/sfc/tx_common.h
> > index d87aecbc7bf1..1e9f42938aac 100644
> > --- a/drivers/net/ethernet/sfc/tx_common.h
> > +++ b/drivers/net/ethernet/sfc/tx_common.h
> > @@ -28,7 +28,7 @@ static inline bool efx_tx_buffer_in_use(struct efx_tx_buffer *buffer)
> >  }
> >
> >  void efx_xmit_done_check_empty(struct efx_tx_queue *tx_queue);
> > -void efx_xmit_done(struct efx_tx_queue *tx_queue, unsigned int index);
> > +int efx_xmit_done(struct efx_tx_queue *tx_queue, unsigned int index);
> >
> >  void efx_enqueue_unwind(struct efx_tx_queue *tx_queue,
> >                       unsigned int insert_count);
> > --
> > 2.40.1
> >
> >
>

Martin Habets June 14, 2023, 8:03 a.m. UTC | #3

On Mon, Jun 12, 2023 at 04:42:54PM +0200, Íñigo Huguet wrote:
> When running workloads heavy unbalanced towards TX (high TX, low RX
> traffic), sfc driver can retain the CPU during too long times. Although
> in many cases this is not enough to be visible, it can affect
> performance and system responsiveness.
> 
> A way to reproduce it is to use a debug kernel and run some parallel
> netperf TX tests. In some systems, this will lead to this message being
> logged:
>   kernel:watchdog: BUG: soft lockup - CPU#12 stuck for 22s!
> 
> The reason is that sfc driver doesn't account any NAPI budget for the TX
> completion events work. With high-TX/low-RX traffic, this makes that the
> CPU is held for long time for NAPI poll.
> 
> Documentations says "drivers can process completions for any number of Tx
> packets but should only process up to budget number of Rx packets".
> However, many drivers do limit the amount of TX completions that they
> process in a single NAPI poll.

I think your work and what other drivers do shows that the documentation is
no longer correct. I haven't checked when that was written, but maybe it
was years ago when link speeds were lower.
Clearly for drivers that support higher link speeds this is an issue, so we
should update the documentation. Not sure what constitutes a high link speed,
with current CPUs for me it's anything >= 50G.

> In the same way, this patch adds a limit for the TX work in sfc. With
> the patch applied, the watchdog warning never appears.
> 
> Tested with netperf in different combinations: single process / parallel
> processes, TCP / UDP and different sizes of UDP messages. Repeated the
> tests before and after the patch, without any noticeable difference in
> network or CPU performance.
> 
> Test hardware:
> Intel(R) Xeon(R) CPU E5-1620 v4 @ 3.50GHz (4 cores, 2 threads/core)
> Solarflare Communications XtremeScale X2522-25G Network Adapter
> 
> Reported-by: Fei Liu <feliu@redhat.com>
> Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>
> ---
>  drivers/net/ethernet/sfc/ef10.c      | 25 ++++++++++++++++++-------
>  drivers/net/ethernet/sfc/ef100_nic.c |  7 ++++++-
>  drivers/net/ethernet/sfc/ef100_tx.c  |  4 ++--
>  drivers/net/ethernet/sfc/ef100_tx.h  |  2 +-
>  drivers/net/ethernet/sfc/tx_common.c |  4 +++-
>  drivers/net/ethernet/sfc/tx_common.h |  2 +-
>  6 files changed, 31 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/net/ethernet/sfc/ef10.c b/drivers/net/ethernet/sfc/ef10.c
> index d30459dbfe8f..b63e47af6365 100644
> --- a/drivers/net/ethernet/sfc/ef10.c
> +++ b/drivers/net/ethernet/sfc/ef10.c
> @@ -2950,7 +2950,7 @@ static u32 efx_ef10_extract_event_ts(efx_qword_t *event)
>  	return tstamp;
>  }
>  
> -static void
> +static int
>  efx_ef10_handle_tx_event(struct efx_channel *channel, efx_qword_t *event)
>  {
>  	struct efx_nic *efx = channel->efx;
> @@ -2958,13 +2958,14 @@ efx_ef10_handle_tx_event(struct efx_channel *channel, efx_qword_t *event)
>  	unsigned int tx_ev_desc_ptr;
>  	unsigned int tx_ev_q_label;
>  	unsigned int tx_ev_type;
> +	int work_done;
>  	u64 ts_part;
>  
>  	if (unlikely(READ_ONCE(efx->reset_pending)))
> -		return;
> +		return 0;
>  
>  	if (unlikely(EFX_QWORD_FIELD(*event, ESF_DZ_TX_DROP_EVENT)))
> -		return;
> +		return 0;
>  
>  	/* Get the transmit queue */
>  	tx_ev_q_label = EFX_QWORD_FIELD(*event, ESF_DZ_TX_QLABEL);
> @@ -2973,8 +2974,7 @@ efx_ef10_handle_tx_event(struct efx_channel *channel, efx_qword_t *event)
>  	if (!tx_queue->timestamping) {
>  		/* Transmit completion */
>  		tx_ev_desc_ptr = EFX_QWORD_FIELD(*event, ESF_DZ_TX_DESCR_INDX);
> -		efx_xmit_done(tx_queue, tx_ev_desc_ptr & tx_queue->ptr_mask);
> -		return;
> +		return efx_xmit_done(tx_queue, tx_ev_desc_ptr & tx_queue->ptr_mask);
>  	}
>  
>  	/* Transmit timestamps are only available for 8XXX series. They result
> @@ -3000,6 +3000,7 @@ efx_ef10_handle_tx_event(struct efx_channel *channel, efx_qword_t *event)
>  	 * fields in the event.
>  	 */
>  	tx_ev_type = EFX_QWORD_FIELD(*event, ESF_EZ_TX_SOFT1);
> +	work_done = 0;
>  
>  	switch (tx_ev_type) {
>  	case TX_TIMESTAMP_EVENT_TX_EV_COMPLETION:
> @@ -3016,6 +3017,7 @@ efx_ef10_handle_tx_event(struct efx_channel *channel, efx_qword_t *event)
>  		tx_queue->completed_timestamp_major = ts_part;
>  
>  		efx_xmit_done_single(tx_queue);
> +		work_done = 1;
>  		break;
>  
>  	default:
> @@ -3026,6 +3028,8 @@ efx_ef10_handle_tx_event(struct efx_channel *channel, efx_qword_t *event)
>  			  EFX_QWORD_VAL(*event));
>  		break;
>  	}
> +
> +	return work_done;
>  }
>  
>  static void
> @@ -3081,13 +3085,16 @@ static void efx_ef10_handle_driver_generated_event(struct efx_channel *channel,
>  	}
>  }
>  
> +#define EFX_NAPI_MAX_TX 512

How did you determine this value? Is it what other driver use?

Martin

> +
>  static int efx_ef10_ev_process(struct efx_channel *channel, int quota)
>  {
>  	struct efx_nic *efx = channel->efx;
>  	efx_qword_t event, *p_event;
>  	unsigned int read_ptr;
> -	int ev_code;
> +	int spent_tx = 0;
>  	int spent = 0;
> +	int ev_code;
>  
>  	if (quota <= 0)
>  		return spent;
> @@ -3126,7 +3133,11 @@ static int efx_ef10_ev_process(struct efx_channel *channel, int quota)
>  			}
>  			break;
>  		case ESE_DZ_EV_CODE_TX_EV:
> -			efx_ef10_handle_tx_event(channel, &event);
> +			spent_tx += efx_ef10_handle_tx_event(channel, &event);
> +			if (spent_tx >= EFX_NAPI_MAX_TX) {
> +				spent = quota;
> +				goto out;
> +			}
>  			break;
>  		case ESE_DZ_EV_CODE_DRIVER_EV:
>  			efx_ef10_handle_driver_event(channel, &event);
> diff --git a/drivers/net/ethernet/sfc/ef100_nic.c b/drivers/net/ethernet/sfc/ef100_nic.c
> index 4dc643b0d2db..7adde9639c8a 100644
> --- a/drivers/net/ethernet/sfc/ef100_nic.c
> +++ b/drivers/net/ethernet/sfc/ef100_nic.c
> @@ -253,6 +253,8 @@ static void ef100_ev_read_ack(struct efx_channel *channel)
>  		   efx_reg(channel->efx, ER_GZ_EVQ_INT_PRIME));
>  }
>  
> +#define EFX_NAPI_MAX_TX 512
> +
>  static int ef100_ev_process(struct efx_channel *channel, int quota)
>  {
>  	struct efx_nic *efx = channel->efx;
> @@ -260,6 +262,7 @@ static int ef100_ev_process(struct efx_channel *channel, int quota)
>  	bool evq_phase, old_evq_phase;
>  	unsigned int read_ptr;
>  	efx_qword_t *p_event;
> +	int spent_tx = 0;
>  	int spent = 0;
>  	bool ev_phase;
>  	int ev_type;
> @@ -295,7 +298,9 @@ static int ef100_ev_process(struct efx_channel *channel, int quota)
>  			efx_mcdi_process_event(channel, p_event);
>  			break;
>  		case ESE_GZ_EF100_EV_TX_COMPLETION:
> -			ef100_ev_tx(channel, p_event);
> +			spent_tx += ef100_ev_tx(channel, p_event);
> +			if (spent_tx >= EFX_NAPI_MAX_TX)
> +				spent = quota;
>  			break;
>  		case ESE_GZ_EF100_EV_DRIVER:
>  			netif_info(efx, drv, efx->net_dev,
> diff --git a/drivers/net/ethernet/sfc/ef100_tx.c b/drivers/net/ethernet/sfc/ef100_tx.c
> index 29ffaf35559d..849e5555bd12 100644
> --- a/drivers/net/ethernet/sfc/ef100_tx.c
> +++ b/drivers/net/ethernet/sfc/ef100_tx.c
> @@ -346,7 +346,7 @@ void ef100_tx_write(struct efx_tx_queue *tx_queue)
>  	ef100_tx_push_buffers(tx_queue);
>  }
>  
> -void ef100_ev_tx(struct efx_channel *channel, const efx_qword_t *p_event)
> +int ef100_ev_tx(struct efx_channel *channel, const efx_qword_t *p_event)
>  {
>  	unsigned int tx_done =
>  		EFX_QWORD_FIELD(*p_event, ESF_GZ_EV_TXCMPL_NUM_DESC);
> @@ -357,7 +357,7 @@ void ef100_ev_tx(struct efx_channel *channel, const efx_qword_t *p_event)
>  	unsigned int tx_index = (tx_queue->read_count + tx_done - 1) &
>  				tx_queue->ptr_mask;
>  
> -	efx_xmit_done(tx_queue, tx_index);
> +	return efx_xmit_done(tx_queue, tx_index);
>  }
>  
>  /* Add a socket buffer to a TX queue
> diff --git a/drivers/net/ethernet/sfc/ef100_tx.h b/drivers/net/ethernet/sfc/ef100_tx.h
> index e9e11540fcde..d9a0819c5a72 100644
> --- a/drivers/net/ethernet/sfc/ef100_tx.h
> +++ b/drivers/net/ethernet/sfc/ef100_tx.h
> @@ -20,7 +20,7 @@ void ef100_tx_init(struct efx_tx_queue *tx_queue);
>  void ef100_tx_write(struct efx_tx_queue *tx_queue);
>  unsigned int ef100_tx_max_skb_descs(struct efx_nic *efx);
>  
> -void ef100_ev_tx(struct efx_channel *channel, const efx_qword_t *p_event);
> +int ef100_ev_tx(struct efx_channel *channel, const efx_qword_t *p_event);
>  
>  netdev_tx_t ef100_enqueue_skb(struct efx_tx_queue *tx_queue, struct sk_buff *skb);
>  int __ef100_enqueue_skb(struct efx_tx_queue *tx_queue, struct sk_buff *skb,
> diff --git a/drivers/net/ethernet/sfc/tx_common.c b/drivers/net/ethernet/sfc/tx_common.c
> index 67e789b96c43..755aa92bf823 100644
> --- a/drivers/net/ethernet/sfc/tx_common.c
> +++ b/drivers/net/ethernet/sfc/tx_common.c
> @@ -249,7 +249,7 @@ void efx_xmit_done_check_empty(struct efx_tx_queue *tx_queue)
>  	}
>  }
>  
> -void efx_xmit_done(struct efx_tx_queue *tx_queue, unsigned int index)
> +int efx_xmit_done(struct efx_tx_queue *tx_queue, unsigned int index)
>  {
>  	unsigned int fill_level, pkts_compl = 0, bytes_compl = 0;
>  	unsigned int efv_pkts_compl = 0;
> @@ -279,6 +279,8 @@ void efx_xmit_done(struct efx_tx_queue *tx_queue, unsigned int index)
>  	}
>  
>  	efx_xmit_done_check_empty(tx_queue);
> +
> +	return pkts_compl + efv_pkts_compl;
>  }
>  
>  /* Remove buffers put into a tx_queue for the current packet.
> diff --git a/drivers/net/ethernet/sfc/tx_common.h b/drivers/net/ethernet/sfc/tx_common.h
> index d87aecbc7bf1..1e9f42938aac 100644
> --- a/drivers/net/ethernet/sfc/tx_common.h
> +++ b/drivers/net/ethernet/sfc/tx_common.h
> @@ -28,7 +28,7 @@ static inline bool efx_tx_buffer_in_use(struct efx_tx_buffer *buffer)
>  }
>  
>  void efx_xmit_done_check_empty(struct efx_tx_queue *tx_queue);
> -void efx_xmit_done(struct efx_tx_queue *tx_queue, unsigned int index);
> +int efx_xmit_done(struct efx_tx_queue *tx_queue, unsigned int index);
>  
>  void efx_enqueue_unwind(struct efx_tx_queue *tx_queue,
>  			unsigned int insert_count);
> -- 
> 2.40.1

Martin Habets June 14, 2023, 8:10 a.m. UTC | #4

On Tue, Jun 13, 2023 at 04:42:25PM +0200, Íñigo Huguet wrote:
> On Mon, Jun 12, 2023 at 6:05 PM Maciej Fijalkowski
> <maciej.fijalkowski@intel.com> wrote:
> >
> > On Mon, Jun 12, 2023 at 04:42:54PM +0200, Íñigo Huguet wrote:
> >
> > Hey Inigo, some trivial things below.
> >
> > > When running workloads heavy unbalanced towards TX (high TX, low RX
> > > traffic), sfc driver can retain the CPU during too long times. Although
> > > in many cases this is not enough to be visible, it can affect
> > > performance and system responsiveness.
> > >
> > > A way to reproduce it is to use a debug kernel and run some parallel
> > > netperf TX tests. In some systems, this will lead to this message being
> > > logged:
> > >   kernel:watchdog: BUG: soft lockup - CPU#12 stuck for 22s!
> > >
> > > The reason is that sfc driver doesn't account any NAPI budget for the TX
> > > completion events work. With high-TX/low-RX traffic, this makes that the
> > > CPU is held for long time for NAPI poll.
> > >
> > > Documentations says "drivers can process completions for any number of Tx
> > > packets but should only process up to budget number of Rx packets".
> > > However, many drivers do limit the amount of TX completions that they
> > > process in a single NAPI poll.
> > >
> > > In the same way, this patch adds a limit for the TX work in sfc. With
> > > the patch applied, the watchdog warning never appears.
> > >
> > > Tested with netperf in different combinations: single process / parallel
> > > processes, TCP / UDP and different sizes of UDP messages. Repeated the
> > > tests before and after the patch, without any noticeable difference in
> > > network or CPU performance.
> > >
> > > Test hardware:
> > > Intel(R) Xeon(R) CPU E5-1620 v4 @ 3.50GHz (4 cores, 2 threads/core)
> > > Solarflare Communications XtremeScale X2522-25G Network Adapter
> > >
> > > Reported-by: Fei Liu <feliu@redhat.com>
> > > Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>
> >
> > You're missing Fixes: tag.
> 
> Thanks for the reminder!
> 
> >
> > > ---
> > >  drivers/net/ethernet/sfc/ef10.c      | 25 ++++++++++++++++++-------
> > >  drivers/net/ethernet/sfc/ef100_nic.c |  7 ++++++-
> > >  drivers/net/ethernet/sfc/ef100_tx.c  |  4 ++--
> > >  drivers/net/ethernet/sfc/ef100_tx.h  |  2 +-
> > >  drivers/net/ethernet/sfc/tx_common.c |  4 +++-
> > >  drivers/net/ethernet/sfc/tx_common.h |  2 +-
> > >  6 files changed, 31 insertions(+), 13 deletions(-)
> > >
> > > diff --git a/drivers/net/ethernet/sfc/ef10.c b/drivers/net/ethernet/sfc/ef10.c
> > > index d30459dbfe8f..b63e47af6365 100644
> > > --- a/drivers/net/ethernet/sfc/ef10.c
> > > +++ b/drivers/net/ethernet/sfc/ef10.c
> > > @@ -2950,7 +2950,7 @@ static u32 efx_ef10_extract_event_ts(efx_qword_t *event)
> > >       return tstamp;
> > >  }
> > >
> > > -static void
> > > +static int
> > >  efx_ef10_handle_tx_event(struct efx_channel *channel, efx_qword_t *event)
> > >  {
> > >       struct efx_nic *efx = channel->efx;
> > > @@ -2958,13 +2958,14 @@ efx_ef10_handle_tx_event(struct efx_channel *channel, efx_qword_t *event)
> > >       unsigned int tx_ev_desc_ptr;
> > >       unsigned int tx_ev_q_label;
> > >       unsigned int tx_ev_type;
> > > +     int work_done;
> > >       u64 ts_part;
> > >
> > >       if (unlikely(READ_ONCE(efx->reset_pending)))
> > > -             return;
> > > +             return 0;
> > >
> > >       if (unlikely(EFX_QWORD_FIELD(*event, ESF_DZ_TX_DROP_EVENT)))
> > > -             return;
> > > +             return 0;
> > >
> > >       /* Get the transmit queue */
> > >       tx_ev_q_label = EFX_QWORD_FIELD(*event, ESF_DZ_TX_QLABEL);
> > > @@ -2973,8 +2974,7 @@ efx_ef10_handle_tx_event(struct efx_channel *channel, efx_qword_t *event)
> > >       if (!tx_queue->timestamping) {
> > >               /* Transmit completion */
> > >               tx_ev_desc_ptr = EFX_QWORD_FIELD(*event, ESF_DZ_TX_DESCR_INDX);
> > > -             efx_xmit_done(tx_queue, tx_ev_desc_ptr & tx_queue->ptr_mask);
> > > -             return;
> > > +             return efx_xmit_done(tx_queue, tx_ev_desc_ptr & tx_queue->ptr_mask);
> > >       }
> > >
> > >       /* Transmit timestamps are only available for 8XXX series. They result
> > > @@ -3000,6 +3000,7 @@ efx_ef10_handle_tx_event(struct efx_channel *channel, efx_qword_t *event)
> > >        * fields in the event.
> > >        */
> > >       tx_ev_type = EFX_QWORD_FIELD(*event, ESF_EZ_TX_SOFT1);
> > > +     work_done = 0;
> > >
> > >       switch (tx_ev_type) {
> > >       case TX_TIMESTAMP_EVENT_TX_EV_COMPLETION:
> > > @@ -3016,6 +3017,7 @@ efx_ef10_handle_tx_event(struct efx_channel *channel, efx_qword_t *event)
> > >               tx_queue->completed_timestamp_major = ts_part;
> > >
> > >               efx_xmit_done_single(tx_queue);
> > > +             work_done = 1;
> > >               break;
> > >
> > >       default:
> > > @@ -3026,6 +3028,8 @@ efx_ef10_handle_tx_event(struct efx_channel *channel, efx_qword_t *event)
> > >                         EFX_QWORD_VAL(*event));
> > >               break;
> > >       }
> > > +
> > > +     return work_done;
> > >  }
> > >
> > >  static void
> > > @@ -3081,13 +3085,16 @@ static void efx_ef10_handle_driver_generated_event(struct efx_channel *channel,
> > >       }
> > >  }
> > >
> > > +#define EFX_NAPI_MAX_TX 512
> > > +
> > >  static int efx_ef10_ev_process(struct efx_channel *channel, int quota)
> > >  {
> > >       struct efx_nic *efx = channel->efx;
> > >       efx_qword_t event, *p_event;
> > >       unsigned int read_ptr;
> > > -     int ev_code;
> > > +     int spent_tx = 0;
> > >       int spent = 0;
> > > +     int ev_code;
> > >
> > >       if (quota <= 0)
> > >               return spent;
> > > @@ -3126,7 +3133,11 @@ static int efx_ef10_ev_process(struct efx_channel *channel, int quota)
> > >                       }
> > >                       break;
> > >               case ESE_DZ_EV_CODE_TX_EV:
> > > -                     efx_ef10_handle_tx_event(channel, &event);
> > > +                     spent_tx += efx_ef10_handle_tx_event(channel, &event);
> > > +                     if (spent_tx >= EFX_NAPI_MAX_TX) {
> > > +                             spent = quota;
> > > +                             goto out;
> > > +                     }
> > >                       break;
> > >               case ESE_DZ_EV_CODE_DRIVER_EV:
> > >                       efx_ef10_handle_driver_event(channel, &event);
> > > diff --git a/drivers/net/ethernet/sfc/ef100_nic.c b/drivers/net/ethernet/sfc/ef100_nic.c
> > > index 4dc643b0d2db..7adde9639c8a 100644
> > > --- a/drivers/net/ethernet/sfc/ef100_nic.c
> > > +++ b/drivers/net/ethernet/sfc/ef100_nic.c
> > > @@ -253,6 +253,8 @@ static void ef100_ev_read_ack(struct efx_channel *channel)
> > >                  efx_reg(channel->efx, ER_GZ_EVQ_INT_PRIME));
> > >  }
> > >
> > > +#define EFX_NAPI_MAX_TX 512
> >
> > couldn't this go to tx_common.h ? you're defining this two times.
> 
> Not sure about this. I didn't like any place to put it, so I ended up
> putting it here. As it is an implementation detail, I didn't like the
> idea of putting it in a header file that contains the driver's API.
> 
> I would better like to hear the opinion from the sfc maintainers, but
> I don't mind changing it because I'm neither happy with the chosen
> location.

I think we should add it in include/linux/netdevice.h, close to
NAPI_POLL_WEIGHT. That way all drivers can use it.
Do we need to add this TX poll weight to struct napi_struct and
extend netif_napi_add_weight()?
That way all drivers could use the value from napi_struct in stead of using
a hard-coded define. And at some point we can adjust it.

Martin

> 
> >
> > > +
> > >  static int ef100_ev_process(struct efx_channel *channel, int quota)
> > >  {
> > >       struct efx_nic *efx = channel->efx;
> > > @@ -260,6 +262,7 @@ static int ef100_ev_process(struct efx_channel *channel, int quota)
> > >       bool evq_phase, old_evq_phase;
> > >       unsigned int read_ptr;
> > >       efx_qword_t *p_event;
> > > +     int spent_tx = 0;
> > >       int spent = 0;
> > >       bool ev_phase;
> > >       int ev_type;
> > > @@ -295,7 +298,9 @@ static int ef100_ev_process(struct efx_channel *channel, int quota)
> > >                       efx_mcdi_process_event(channel, p_event);
> > >                       break;
> > >               case ESE_GZ_EF100_EV_TX_COMPLETION:
> > > -                     ef100_ev_tx(channel, p_event);
> > > +                     spent_tx += ef100_ev_tx(channel, p_event);
> > > +                     if (spent_tx >= EFX_NAPI_MAX_TX)
> > > +                             spent = quota;
> > >                       break;
> > >               case ESE_GZ_EF100_EV_DRIVER:
> > >                       netif_info(efx, drv, efx->net_dev,
> > > diff --git a/drivers/net/ethernet/sfc/ef100_tx.c b/drivers/net/ethernet/sfc/ef100_tx.c
> > > index 29ffaf35559d..849e5555bd12 100644
> > > --- a/drivers/net/ethernet/sfc/ef100_tx.c
> > > +++ b/drivers/net/ethernet/sfc/ef100_tx.c
> > > @@ -346,7 +346,7 @@ void ef100_tx_write(struct efx_tx_queue *tx_queue)
> > >       ef100_tx_push_buffers(tx_queue);
> > >  }
> > >
> > > -void ef100_ev_tx(struct efx_channel *channel, const efx_qword_t *p_event)
> > > +int ef100_ev_tx(struct efx_channel *channel, const efx_qword_t *p_event)
> > >  {
> > >       unsigned int tx_done =
> > >               EFX_QWORD_FIELD(*p_event, ESF_GZ_EV_TXCMPL_NUM_DESC);
> > > @@ -357,7 +357,7 @@ void ef100_ev_tx(struct efx_channel *channel, const efx_qword_t *p_event)
> > >       unsigned int tx_index = (tx_queue->read_count + tx_done - 1) &
> > >                               tx_queue->ptr_mask;
> > >
> > > -     efx_xmit_done(tx_queue, tx_index);
> > > +     return efx_xmit_done(tx_queue, tx_index);
> > >  }
> > >
> > >  /* Add a socket buffer to a TX queue
> > > diff --git a/drivers/net/ethernet/sfc/ef100_tx.h b/drivers/net/ethernet/sfc/ef100_tx.h
> > > index e9e11540fcde..d9a0819c5a72 100644
> > > --- a/drivers/net/ethernet/sfc/ef100_tx.h
> > > +++ b/drivers/net/ethernet/sfc/ef100_tx.h
> > > @@ -20,7 +20,7 @@ void ef100_tx_init(struct efx_tx_queue *tx_queue);
> > >  void ef100_tx_write(struct efx_tx_queue *tx_queue);
> > >  unsigned int ef100_tx_max_skb_descs(struct efx_nic *efx);
> > >
> > > -void ef100_ev_tx(struct efx_channel *channel, const efx_qword_t *p_event);
> > > +int ef100_ev_tx(struct efx_channel *channel, const efx_qword_t *p_event);
> > >
> > >  netdev_tx_t ef100_enqueue_skb(struct efx_tx_queue *tx_queue, struct sk_buff *skb);
> > >  int __ef100_enqueue_skb(struct efx_tx_queue *tx_queue, struct sk_buff *skb,
> > > diff --git a/drivers/net/ethernet/sfc/tx_common.c b/drivers/net/ethernet/sfc/tx_common.c
> > > index 67e789b96c43..755aa92bf823 100644
> > > --- a/drivers/net/ethernet/sfc/tx_common.c
> > > +++ b/drivers/net/ethernet/sfc/tx_common.c
> > > @@ -249,7 +249,7 @@ void efx_xmit_done_check_empty(struct efx_tx_queue *tx_queue)
> > >       }
> > >  }
> > >
> > > -void efx_xmit_done(struct efx_tx_queue *tx_queue, unsigned int index)
> > > +int efx_xmit_done(struct efx_tx_queue *tx_queue, unsigned int index)
> > >  {
> > >       unsigned int fill_level, pkts_compl = 0, bytes_compl = 0;
> > >       unsigned int efv_pkts_compl = 0;
> > > @@ -279,6 +279,8 @@ void efx_xmit_done(struct efx_tx_queue *tx_queue, unsigned int index)
> > >       }
> > >
> > >       efx_xmit_done_check_empty(tx_queue);
> > > +
> > > +     return pkts_compl + efv_pkts_compl;
> > >  }
> > >
> > >  /* Remove buffers put into a tx_queue for the current packet.
> > > diff --git a/drivers/net/ethernet/sfc/tx_common.h b/drivers/net/ethernet/sfc/tx_common.h
> > > index d87aecbc7bf1..1e9f42938aac 100644
> > > --- a/drivers/net/ethernet/sfc/tx_common.h
> > > +++ b/drivers/net/ethernet/sfc/tx_common.h
> > > @@ -28,7 +28,7 @@ static inline bool efx_tx_buffer_in_use(struct efx_tx_buffer *buffer)
> > >  }
> > >
> > >  void efx_xmit_done_check_empty(struct efx_tx_queue *tx_queue);
> > > -void efx_xmit_done(struct efx_tx_queue *tx_queue, unsigned int index);
> > > +int efx_xmit_done(struct efx_tx_queue *tx_queue, unsigned int index);
> > >
> > >  void efx_enqueue_unwind(struct efx_tx_queue *tx_queue,
> > >                       unsigned int insert_count);
> > > --
> > > 2.40.1
> > >
> > >
> >
> 
> 
> -- 
> Íñigo Huguet

Íñigo Huguet June 14, 2023, 10:13 a.m. UTC | #5

On Wed, Jun 14, 2023 at 10:03 AM Martin Habets <habetsm.xilinx@gmail.com> wrote:
>
> On Mon, Jun 12, 2023 at 04:42:54PM +0200, Íñigo Huguet wrote:
> > Documentations says "drivers can process completions for any number of Tx
> > packets but should only process up to budget number of Rx packets".
> > However, many drivers do limit the amount of TX completions that they
> > process in a single NAPI poll.
>
> I think your work and what other drivers do shows that the documentation is
> no longer correct. I haven't checked when that was written, but maybe it
> was years ago when link speeds were lower.
> Clearly for drivers that support higher link speeds this is an issue, so we
> should update the documentation. Not sure what constitutes a high link speed,
> with current CPUs for me it's anything >= 50G.

I reproduced with a 10G link (with debug kernel, though)

> > +#define EFX_NAPI_MAX_TX 512
>
> How did you determine this value? Is it what other driver use?

A bit of trial and error. I wanted to find a value high enough to not
decrease performance but low enough to solve the issue.

Other drivers use lower values too, from 128. However, I decided to go
to the high values in sfc because otherwise it can affect too much to
RX. The most common case I saw in other drivers was: First process TX
completions up to the established limit, then process RX completions
up to the NAPI budget. But sfc processes TX and RX events serially,
intermixed. We need to put a limit to TX events, but if it was too
low, very few RX events would be processed with high TX traffic.

> > I would better like to hear the opinion from the sfc maintainers, but
> > I don't mind changing it because I'm neither happy with the chosen
> > location.
>
> I think we should add it in include/linux/netdevice.h, close to
> NAPI_POLL_WEIGHT. That way all drivers can use it.
> Do we need to add this TX poll weight to struct napi_struct and
> extend netif_napi_add_weight()?
> That way all drivers could use the value from napi_struct instead of using
> a hard-coded define. And at some point we can adjust it.

That's what I thought too, but then we'd need to determine what's the
exact meaning for that TX budget (as you see, it doesn't mean exactly
the same for sfc than for other drivers, and between the other drivers
there were small differences too).

We would also need to decide what the default value for the TX budget
is, so it is used in netif_napi_add. Right now, each driver is using
different values.

If something is done in that direction, it can take some time. May I
suggest including this fix until then?

--
Íñigo Huguet

Jakub Kicinski June 14, 2023, 5:27 p.m. UTC | #6

On Wed, 14 Jun 2023 09:03:05 +0100 Martin Habets wrote:
> On Mon, Jun 12, 2023 at 04:42:54PM +0200, Íñigo Huguet wrote:
> > When running workloads heavy unbalanced towards TX (high TX, low RX
> > traffic), sfc driver can retain the CPU during too long times. Although
> > in many cases this is not enough to be visible, it can affect
> > performance and system responsiveness.
> > 
> > A way to reproduce it is to use a debug kernel and run some parallel
> > netperf TX tests. In some systems, this will lead to this message being
> > logged:
> >   kernel:watchdog: BUG: soft lockup - CPU#12 stuck for 22s!
> > 
> > The reason is that sfc driver doesn't account any NAPI budget for the TX
> > completion events work. With high-TX/low-RX traffic, this makes that the
> > CPU is held for long time for NAPI poll.
> > 
> > Documentations says "drivers can process completions for any number of Tx
> > packets but should only process up to budget number of Rx packets".
> > However, many drivers do limit the amount of TX completions that they
> > process in a single NAPI poll.  
> 
> I think your work and what other drivers do shows that the documentation is
> no longer correct. I haven't checked when that was written, but maybe it
> was years ago when link speeds were lower.
> Clearly for drivers that support higher link speeds this is an issue, so we
> should update the documentation. Not sure what constitutes a high link speed,
> with current CPUs for me it's anything >= 50G.

The documentation is pretty recent. I haven't seen this lockup once 
in production or testing. Do multiple queues complete on the same CPU
for SFC or something weird like that?

Jakub Kicinski June 14, 2023, 5:31 p.m. UTC | #7

On Wed, 14 Jun 2023 12:13:11 +0200 Íñigo Huguet wrote:
> On Wed, Jun 14, 2023 at 10:03 AM Martin Habets <habetsm.xilinx@gmail.com> wrote:
> > On Mon, Jun 12, 2023 at 04:42:54PM +0200, Íñigo Huguet wrote:  
> > > Documentations says "drivers can process completions for any number of Tx
> > > packets but should only process up to budget number of Rx packets".
> > > However, many drivers do limit the amount of TX completions that they
> > > process in a single NAPI poll.  
> >
> > I think your work and what other drivers do shows that the documentation is
> > no longer correct. I haven't checked when that was written, but maybe it
> > was years ago when link speeds were lower.
> > Clearly for drivers that support higher link speeds this is an issue, so we
> > should update the documentation. Not sure what constitutes a high link speed,
> > with current CPUs for me it's anything >= 50G.  
> 
> I reproduced with a 10G link (with debug kernel, though)

Ah.

> > > +#define EFX_NAPI_MAX_TX 512  
> >
> > How did you determine this value? Is it what other driver use?  
> 
> A bit of trial and error. I wanted to find a value high enough to not
> decrease performance but low enough to solve the issue.
> 
> Other drivers use lower values too, from 128. However, I decided to go
> to the high values in sfc because otherwise it can affect too much to
> RX. The most common case I saw in other drivers was: First process TX
> completions up to the established limit, then process RX completions
> up to the NAPI budget. But sfc processes TX and RX events serially,
> intermixed. We need to put a limit to TX events, but if it was too
> low, very few RX events would be processed with high TX traffic.
> 
> > > I would better like to hear the opinion from the sfc maintainers, but
> > > I don't mind changing it because I'm neither happy with the chosen
> > > location.  
> >
> > I think we should add it in include/linux/netdevice.h, close to
> > NAPI_POLL_WEIGHT. That way all drivers can use it.
> > Do we need to add this TX poll weight to struct napi_struct and
> > extend netif_napi_add_weight()?
> > That way all drivers could use the value from napi_struct instead of using
> > a hard-coded define. And at some point we can adjust it.  
> 
> That's what I thought too, but then we'd need to determine what's the
> exact meaning for that TX budget (as you see, it doesn't mean exactly
> the same for sfc than for other drivers, and between the other drivers
> there were small differences too).
> 
> We would also need to decide what the default value for the TX budget
> is, so it is used in netif_napi_add. Right now, each driver is using
> different values.
> 
> If something is done in that direction, it can take some time. May I
> suggest including this fix until then?

Agreed. Still needs a fixes tag, tho.

Edward Cree June 15, 2023, 12:36 a.m. UTC | #8

On 14/06/2023 18:27, Jakub Kicinski wrote:
> The documentation is pretty recent. I haven't seen this lockup once 
> in production or testing. Do multiple queues complete on the same CPU
> for SFC or something weird like that?

I think the key question here is can one CPU be using a TXQ to send
 while another CPU is in a NAPI poll on the same channel and thus
 trying to clean the EVQ that the TXQ is using.  If so the NAPI poll
 could last forever; if not then it shouldn't ever have more than 8k
 (or whatever the TX ring size is set to) events to process.
And even ignoring affinity of the core TXQs, at the very least XDP
 TXQs can serve different CPUs to the one on which their EVQ (and
 hence NAPI poll) lives, which means they can keep filling the EVQ
 as fast as the NAPI poll empties it, and thus keep ev_process
 looping forever.
In principle this can also happen with other kinds of events, e.g.
 if the MC goes crazy and generates infinite MCDI-event spam then
 NAPI poll will spin on that CPU forever eating the events.  So
 maybe this limit needs to be broader than just TX events?  A hard
 cap on the number of events (regardless of type) that can be
 consumed in a single efx_ef10_ev_process() invocation, perhaps?

-ed

Jakub Kicinski June 15, 2023, 4:04 a.m. UTC | #9

On Thu, 15 Jun 2023 01:36:17 +0100 Edward Cree wrote:
> I think the key question here is can one CPU be using a TXQ to send
>  while another CPU is in a NAPI poll on the same channel and thus
>  trying to clean the EVQ that the TXQ is using.  If so the NAPI poll
>  could last forever; if not then it shouldn't ever have more than 8k
>  (or whatever the TX ring size is set to) events to process.
> And even ignoring affinity of the core TXQs, at the very least XDP
>  TXQs can serve different CPUs to the one on which their EVQ (and
>  hence NAPI poll) lives, which means they can keep filling the EVQ
>  as fast as the NAPI poll empties it, and thus keep ev_process
>  looping forever.
> In principle this can also happen with other kinds of events, e.g.
>  if the MC goes crazy and generates infinite MCDI-event spam then
>  NAPI poll will spin on that CPU forever eating the events.  So
>  maybe this limit needs to be broader than just TX events?  A hard
>  cap on the number of events (regardless of type) that can be
>  consumed in a single efx_ef10_ev_process() invocation, perhaps?

It'd be interesting to analyze the processing time for Tx packets but
with GRO at play the processing time on Rx varies 10x packet to packet.
My intuition is that we should target even amount of time spent on Rx
and Tx, but Tx is cheaper hence higher budget.

The way the wind is blowing I think RT and core folks would prefer 
us to switch to checking the time rather than budgets. Time check
amortized across multiple events, so instead of talking about cap on
events perhaps we should consider how often we should be checking if 
the NAPI time slice has elapsed?

That said I feel like if we try to reshuffle NAPI polling logic
without a real workload to test with we may do more damage than good.

Martin Habets June 15, 2023, 8:08 a.m. UTC | #10

On Wed, Jun 14, 2023 at 10:31:31AM -0700, Jakub Kicinski wrote:
> On Wed, 14 Jun 2023 12:13:11 +0200 Íñigo Huguet wrote:
> > On Wed, Jun 14, 2023 at 10:03 AM Martin Habets <habetsm.xilinx@gmail.com> wrote:
> > > On Mon, Jun 12, 2023 at 04:42:54PM +0200, Íñigo Huguet wrote:  
> > > > Documentations says "drivers can process completions for any number of Tx
> > > > packets but should only process up to budget number of Rx packets".
> > > > However, many drivers do limit the amount of TX completions that they
> > > > process in a single NAPI poll.  
> > >
> > > I think your work and what other drivers do shows that the documentation is
> > > no longer correct. I haven't checked when that was written, but maybe it
> > > was years ago when link speeds were lower.
> > > Clearly for drivers that support higher link speeds this is an issue, so we
> > > should update the documentation. Not sure what constitutes a high link speed,
> > > with current CPUs for me it's anything >= 50G.  
> > 
> > I reproduced with a 10G link (with debug kernel, though)
> 
> Ah.

Hmm, we usually don't optimise the driver for debug kernels. The delays
introduced can make it impossible for CPUs to keep up with high bandwidths.
Having said that I did not expect this to fail at only 10G.

> 
> > > > +#define EFX_NAPI_MAX_TX 512  
> > >
> > > How did you determine this value? Is it what other driver use?  
> > 
> > A bit of trial and error. I wanted to find a value high enough to not
> > decrease performance but low enough to solve the issue.
> > 
> > Other drivers use lower values too, from 128. However, I decided to go
> > to the high values in sfc because otherwise it can affect too much to
> > RX. The most common case I saw in other drivers was: First process TX
> > completions up to the established limit, then process RX completions
> > up to the NAPI budget. But sfc processes TX and RX events serially,
> > intermixed. We need to put a limit to TX events, but if it was too
> > low, very few RX events would be processed with high TX traffic.
> > 
> > > > I would better like to hear the opinion from the sfc maintainers, but
> > > > I don't mind changing it because I'm neither happy with the chosen
> > > > location.  
> > >
> > > I think we should add it in include/linux/netdevice.h, close to
> > > NAPI_POLL_WEIGHT. That way all drivers can use it.
> > > Do we need to add this TX poll weight to struct napi_struct and
> > > extend netif_napi_add_weight()?
> > > That way all drivers could use the value from napi_struct instead of using
> > > a hard-coded define. And at some point we can adjust it.  
> > 
> > That's what I thought too, but then we'd need to determine what's the
> > exact meaning for that TX budget (as you see, it doesn't mean exactly
> > the same for sfc than for other drivers, and between the other drivers
> > there were small differences too).
> > 
> > We would also need to decide what the default value for the TX budget
> > is, so it is used in netif_napi_add. Right now, each driver is using
> > different values.
> > 
> > If something is done in that direction, it can take some time. May I
> > suggest including this fix until then?
> 
> Agreed. Still needs a fixes tag, tho.

Fine for me. From your other response it seems trying to generalise this
is not the longer term solution.

Martin

[net] sfc: use budget for TX completions

Checks

Commit Message

Comments

Patch