[RFC,net-next,0/6] Cleanup IRQ affinity checks in several drivers

Message ID	20240812145633.52911-1-jdamato@fastly.com (mailing list archive)
Headers	show Received: from mail-pl1-f177.google.com (mail-pl1-f177.google.com [209.85.214.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 01FF0184525 for <netdev@vger.kernel.org>; Mon, 12 Aug 2024 14:57:03 +0000 (UTC) From: Joe Damato <jdamato@fastly.com> To: netdev@vger.kernel.org Cc: Joe Damato <jdamato@fastly.com>, Daniel Borkmann <daniel@iogearbox.net>, "David S. Miller" <davem@davemloft.net>, Eric Dumazet <edumazet@google.com>, Harshitha Ramamurthy <hramamurthy@google.com>, intel-wired-lan@lists.osuosl.org (moderated list:INTEL ETHERNET DRIVERS), Jakub Kicinski <kuba@kernel.org>, Jeroen de Borst <jeroendb@google.com>, Jiri Pirko <jiri@resnulli.us>, Leon Romanovsky <leon@kernel.org>, linux-kernel@vger.kernel.org (open list), linux-rdma@vger.kernel.org (open list:MELLANOX MLX4 core VPI driver), Lorenzo Bianconi <lorenzo@kernel.org>, Paolo Abeni <pabeni@redhat.com>, Praveen Kaligineedi <pkaligineedi@google.com>, Przemek Kitszel <przemyslaw.kitszel@intel.com>, Saeed Mahameed <saeedm@nvidia.com>, Sebastian Andrzej Siewior <bigeasy@linutronix.de>, Shailend Chand <shailend@google.com>, Tariq Toukan <tariqt@nvidia.com>, Tony Nguyen <anthony.l.nguyen@intel.com>, Willem de Bruijn <willemb@google.com>, Yishai Hadas <yishaih@nvidia.com>, Ziwei Xiao <ziweixiao@google.com> Subject: [RFC net-next 0/6] Cleanup IRQ affinity checks in several drivers Date: Mon, 12 Aug 2024 14:56:21 +0000 Message-Id: <20240812145633.52911-1-jdamato@fastly.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	Cleanup IRQ affinity checks in several drivers \| expand [RFC,net-next,0/6] Cleanup IRQ affinity checks in several drivers [RFC,net-next,1/6] netdevice: Add napi_affinity_no_change [RFC,net-next,2/6] mlx5: Use napi_affinity_no_change [RFC,net-next,3/6] gve: Use napi_affinity_no_change [RFC,net-next,4/6] i40e: Use napi_affinity_no_change [RFC,net-next,5/6] iavf: Use napi_affinity_no_change [RFC,net-next,6/6] mlx4: Use napi_affinity_no_change

Joe Damato Aug. 12, 2024, 2:56 p.m. UTC

Greetings:

Several drivers make a check in their napi poll functions to determine
if the CPU affinity of the IRQ has changed. If it has, the napi poll
function returns a value less than the budget to force polling mode to
be disabled, so that it can be rescheduled on the correct CPU next time
the softirq is raised.

This code is repeated in at least 5 drivers that I found, but there
might be more I missed (please let me know and I'll fix them). IMHO,
it'd be nice to fix this in existing drivers and avoid future drivers
repeating the same pattern.

FWIW, it's possible that patch 4, 5, and 6 could be separated into
"fixes" for the type mismatches and then, separaately, new code, but
that seemed like a lot of noise for the list and maybe unnecessary.

If I should first send fixes for 4, 5, and 6 and then send this cleanup
series after, let me know and I'll do that.

Sending as an RFC because:
  - I wanted to see if this cleanup was desirable overall, and
  - If so, do I need to send fixes for 4-6 first?

Thanks,
Joe

Joe Damato (6):
  netdevice: Add napi_affinity_no_change
  mlx5: Use napi_affinity_no_change
  gve: Use napi_affinity_no_change
  i40e: Use napi_affinity_no_change
  iavf: Use napi_affinity_no_change
  mlx4: Use napi_affinity_no_change

 drivers/net/ethernet/google/gve/gve_main.c        | 14 +-------------
 drivers/net/ethernet/intel/i40e/i40e.h            |  2 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c       |  2 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c       |  4 +---
 drivers/net/ethernet/intel/iavf/iavf.h            |  1 +
 drivers/net/ethernet/intel/iavf/iavf_main.c       |  4 +++-
 drivers/net/ethernet/intel/iavf/iavf_txrx.c       |  4 +---
 drivers/net/ethernet/mellanox/mlx4/en_cq.c        |  6 ++++--
 drivers/net/ethernet/mellanox/mlx4/en_rx.c        |  6 +-----
 drivers/net/ethernet/mellanox/mlx4/eq.c           |  2 +-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h      |  1 +
 drivers/net/ethernet/mellanox/mlx5/core/en.h      |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c |  9 +--------
 include/linux/mlx4/device.h                       |  2 +-
 include/linux/netdevice.h                         |  8 ++++++++
 net/core/dev.c                                    | 14 ++++++++++++++
 17 files changed, 42 insertions(+), 41 deletions(-)

Jakub Kicinski Aug. 14, 2024, 12:17 a.m. UTC | #1

On Mon, 12 Aug 2024 14:56:21 +0000 Joe Damato wrote:
> Several drivers make a check in their napi poll functions to determine
> if the CPU affinity of the IRQ has changed. If it has, the napi poll
> function returns a value less than the budget to force polling mode to
> be disabled, so that it can be rescheduled on the correct CPU next time
> the softirq is raised.

Any reason not to use the irq number already stored in napi_struct ?

Joe Damato Aug. 14, 2024, 7:14 a.m. UTC | #2

On Tue, Aug 13, 2024 at 05:17:10PM -0700, Jakub Kicinski wrote:
> On Mon, 12 Aug 2024 14:56:21 +0000 Joe Damato wrote:
> > Several drivers make a check in their napi poll functions to determine
> > if the CPU affinity of the IRQ has changed. If it has, the napi poll
> > function returns a value less than the budget to force polling mode to
> > be disabled, so that it can be rescheduled on the correct CPU next time
> > the softirq is raised.
> 
> Any reason not to use the irq number already stored in napi_struct ?

Thanks for taking a look.

IIUC, that's possible if i40e, iavf, and gve are updated to call
netif_napi_set_irq first, which I could certainly do.

But as Stanislav points out, I would be adding a call to
irq_get_effective_affinity_mask in the hot path where one did not
exist before for 4 of 5 drivers.

In that case, it might make more sense to introduce:

  bool napi_affinity_no_change(const struct cpumask *aff_mask)

instead and the drivers which have a cached mask can pass it in and
gve can be updated later to cache it.

Not sure how crucial avoiding the irq_get_effective_affinity_mask
call is; I would guess maybe some driver owners would object to
adding a new call in the hot path where one didn't exist before.

What do you think?

Joe Damato Aug. 14, 2024, 12:12 p.m. UTC | #3

On Wed, Aug 14, 2024 at 08:14:48AM +0100, Joe Damato wrote:
> On Tue, Aug 13, 2024 at 05:17:10PM -0700, Jakub Kicinski wrote:
> > On Mon, 12 Aug 2024 14:56:21 +0000 Joe Damato wrote:
> > > Several drivers make a check in their napi poll functions to determine
> > > if the CPU affinity of the IRQ has changed. If it has, the napi poll
> > > function returns a value less than the budget to force polling mode to
> > > be disabled, so that it can be rescheduled on the correct CPU next time
> > > the softirq is raised.
> > 
> > Any reason not to use the irq number already stored in napi_struct ?
> 
> Thanks for taking a look.
> 
> IIUC, that's possible if i40e, iavf, and gve are updated to call
> netif_napi_set_irq first, which I could certainly do.
> 
> But as Stanislav points out, I would be adding a call to
> irq_get_effective_affinity_mask in the hot path where one did not
> exist before for 4 of 5 drivers.
> 
> In that case, it might make more sense to introduce:
> 
>   bool napi_affinity_no_change(const struct cpumask *aff_mask)
> 
> instead and the drivers which have a cached mask can pass it in and
> gve can be updated later to cache it.
> 
> Not sure how crucial avoiding the irq_get_effective_affinity_mask
> call is; I would guess maybe some driver owners would object to
> adding a new call in the hot path where one didn't exist before.
> 
> What do you think?

Actually... how about a slightly different approach, which caches
the affinity mask in the core?

  0. Extend napi struct to have a struct cpumask * field

  1. extend netif_napi_set_irq to:
    a. store the IRQ number in the napi struct (as you suggested)
    b. call irq_get_effective_affinity_mask to store the mask in the
       napi struct
    c. set up generic affinity_notify.notify and
       affinity_notify.release callbacks to update the in core mask
       when it changes

  2. add napi_affinity_no_change which now takes a napi_struct

  3. cleanup all 5 drivers:
    a. add calls to netif_napi_set_irq for all 5 (I think no RTNL
       is needed, so I think this would be straight forward?)
    b. remove all affinity_mask caching code in 4 of 5 drivers
    c. update all 5 drivers to call napi_affinity_no_change in poll

Then ... anyone who adds support for netif_napi_set_irq to their
driver in the future gets automatic support in-core for
caching/updating of the mask? And in the future netdev-genl could
dump the mask since its in-core?

I'll mess around with that locally to see how it looks, but let me
know if that sounds like a better overall approach.

- Joe

Jakub Kicinski Aug. 14, 2024, 3:09 p.m. UTC | #4

On Wed, 14 Aug 2024 13:12:08 +0100 Joe Damato wrote:
> Actually... how about a slightly different approach, which caches
> the affinity mask in the core?

I was gonna say :)

>   0. Extend napi struct to have a struct cpumask * field
> 
>   1. extend netif_napi_set_irq to:
>     a. store the IRQ number in the napi struct (as you suggested)
>     b. call irq_get_effective_affinity_mask to store the mask in the
>        napi struct
>     c. set up generic affinity_notify.notify and
>        affinity_notify.release callbacks to update the in core mask
>        when it changes

This part I'm not an export on.

>   2. add napi_affinity_no_change which now takes a napi_struct
> 
>   3. cleanup all 5 drivers:
>     a. add calls to netif_napi_set_irq for all 5 (I think no RTNL
>        is needed, so I think this would be straight forward?)
>     b. remove all affinity_mask caching code in 4 of 5 drivers
>     c. update all 5 drivers to call napi_affinity_no_change in poll
> 
> Then ... anyone who adds support for netif_napi_set_irq to their
> driver in the future gets automatic support in-core for
> caching/updating of the mask? And in the future netdev-genl could
> dump the mask since its in-core?
> 
> I'll mess around with that locally to see how it looks, but let me
> know if that sounds like a better overall approach.

Could we even handle this directly as part of __napi_poll(),
once the driver gives core all of the relevant pieces of information ?

Joe Damato Aug. 14, 2024, 3:19 p.m. UTC | #5

On Wed, Aug 14, 2024 at 08:09:15AM -0700, Jakub Kicinski wrote:
> On Wed, 14 Aug 2024 13:12:08 +0100 Joe Damato wrote:
> > Actually... how about a slightly different approach, which caches
> > the affinity mask in the core?
> 
> I was gonna say :)
> 
> >   0. Extend napi struct to have a struct cpumask * field
> > 
> >   1. extend netif_napi_set_irq to:
> >     a. store the IRQ number in the napi struct (as you suggested)
> >     b. call irq_get_effective_affinity_mask to store the mask in the
> >        napi struct
> >     c. set up generic affinity_notify.notify and
> >        affinity_notify.release callbacks to update the in core mask
> >        when it changes
> 
> This part I'm not an export on.
> 
> >   2. add napi_affinity_no_change which now takes a napi_struct
> > 
> >   3. cleanup all 5 drivers:
> >     a. add calls to netif_napi_set_irq for all 5 (I think no RTNL
> >        is needed, so I think this would be straight forward?)
> >     b. remove all affinity_mask caching code in 4 of 5 drivers
> >     c. update all 5 drivers to call napi_affinity_no_change in poll
> > 
> > Then ... anyone who adds support for netif_napi_set_irq to their
> > driver in the future gets automatic support in-core for
> > caching/updating of the mask? And in the future netdev-genl could
> > dump the mask since its in-core?
> > 
> > I'll mess around with that locally to see how it looks, but let me
> > know if that sounds like a better overall approach.

I ended up going with the approach laid out above; moving the IRQ
affinity mask updating code into the core (which adds that ability
to gve/mlx4/mlx5... it seems mlx4/5 cached but didn't have notifiers
setup to update the cached copy?) and adding calls to
netif_napi_set_irq in i40e/iavf and deleting their custom notifier
code.

It's almost ready for rfcv2; I think this approach is probably
better ?

> Could we even handle this directly as part of __napi_poll(),
> once the driver gives core all of the relevant pieces of information ?

I had been thinking the same thing, too, but it seems like at least
one driver (mlx5) counts the number of affinity changes to export as
a stat, so moving all of this to core would break that.

So, I may avoid attempting that for this series.

I'm still messing around with this but will send an rfcv2 in a bit.

Shay Drori Aug. 14, 2024, 4:03 p.m. UTC | #6

On 14/08/2024 18:19, Joe Damato wrote:
> On Wed, Aug 14, 2024 at 08:09:15AM -0700, Jakub Kicinski wrote:
>> On Wed, 14 Aug 2024 13:12:08 +0100 Joe Damato wrote:
>>> Actually... how about a slightly different approach, which caches
>>> the affinity mask in the core?
>>
>> I was gonna say :)
>>
>>>    0. Extend napi struct to have a struct cpumask * field
>>>
>>>    1. extend netif_napi_set_irq to:
>>>      a. store the IRQ number in the napi struct (as you suggested)
>>>      b. call irq_get_effective_affinity_mask to store the mask in the
>>>         napi struct
>>>      c. set up generic affinity_notify.notify and
>>>         affinity_notify.release callbacks to update the in core mask
>>>         when it changes
>>
>> This part I'm not an export on.

several net drivers (mlx5, mlx4, ice, ena and more) are using a feature
called ARFS (rmap)[1], and this feature is using the affinity notifier
mechanism.
Also, affinity notifier infra is supporting only a single notifier per
IRQ.

Hence, your suggestion (1.c) will break the ARFS feature.

[1] see irq_cpu_rmap_add()

>>
>>>    2. add napi_affinity_no_change which now takes a napi_struct
>>>
>>>    3. cleanup all 5 drivers:
>>>      a. add calls to netif_napi_set_irq for all 5 (I think no RTNL
>>>         is needed, so I think this would be straight forward?)
>>>      b. remove all affinity_mask caching code in 4 of 5 drivers
>>>      c. update all 5 drivers to call napi_affinity_no_change in poll
>>>
>>> Then ... anyone who adds support for netif_napi_set_irq to their
>>> driver in the future gets automatic support in-core for
>>> caching/updating of the mask? And in the future netdev-genl could
>>> dump the mask since its in-core?
>>>
>>> I'll mess around with that locally to see how it looks, but let me
>>> know if that sounds like a better overall approach.
> 
> I ended up going with the approach laid out above; moving the IRQ
> affinity mask updating code into the core (which adds that ability
> to gve/mlx4/mlx5... it seems mlx4/5 cached but didn't have notifiers
> setup to update the cached copy?)


maybe This is probably due to what I wrote above..


> and adding calls to
> netif_napi_set_irq in i40e/iavf and deleting their custom notifier
> code.
> 
> It's almost ready for rfcv2; I think this approach is probably
> better ?
> 
>> Could we even handle this directly as part of __napi_poll(),
>> once the driver gives core all of the relevant pieces of information ?
> 
> I had been thinking the same thing, too, but it seems like at least
> one driver (mlx5) counts the number of affinity changes to export as
> a stat, so moving all of this to core would break that.
> 
> So, I may avoid attempting that for this series.
> 
> I'm still messing around with this but will send an rfcv2 in a bit.
>

Joe Damato Aug. 14, 2024, 6:01 p.m. UTC | #7

On Wed, Aug 14, 2024 at 07:03:35PM +0300, Shay Drori wrote:
> 
> 
> On 14/08/2024 18:19, Joe Damato wrote:
> > On Wed, Aug 14, 2024 at 08:09:15AM -0700, Jakub Kicinski wrote:
> > > On Wed, 14 Aug 2024 13:12:08 +0100 Joe Damato wrote:
> > > > Actually... how about a slightly different approach, which caches
> > > > the affinity mask in the core?
> > > 
> > > I was gonna say :)
> > > 
> > > >    0. Extend napi struct to have a struct cpumask * field
> > > > 
> > > >    1. extend netif_napi_set_irq to:
> > > >      a. store the IRQ number in the napi struct (as you suggested)
> > > >      b. call irq_get_effective_affinity_mask to store the mask in the
> > > >         napi struct
> > > >      c. set up generic affinity_notify.notify and
> > > >         affinity_notify.release callbacks to update the in core mask
> > > >         when it changes
> > > 
> > > This part I'm not an export on.
> 
> several net drivers (mlx5, mlx4, ice, ena and more) are using a feature
> called ARFS (rmap)[1], and this feature is using the affinity notifier
> mechanism.
> Also, affinity notifier infra is supporting only a single notifier per
> IRQ.
> 
> Hence, your suggestion (1.c) will break the ARFS feature.
> 
> [1] see irq_cpu_rmap_add()

Thanks for taking a look and your reply.

I did notice ARFS use by some drivers and figured that might be why
the notifiers were being used in some cases.

I guess the question comes down to whether adding a call to
irq_get_effective_affinity_mask in the hot path is a bad idea.

If it is, then the only option is to have the drivers pass in their
IRQ affinity masks, as Stanislav suggested, to avoid adding that
call to the hot path.

If not, then the IRQ from napi_struct can be used and the affinity
mask can be generated on every napi poll. i40e/gve/iavf would need
calls to netif_napi_set_irq to set the IRQ mapping, which seems to
be straightforward.

In both cases: the IRQ notifier stuff would be left as is so that it
wouldn't break ARFS.

I suspect that the preferred solution would be to avoid adding that
call to the hot path, right?

Jakub Kicinski Aug. 15, 2024, 12:20 a.m. UTC | #8

On Wed, 14 Aug 2024 19:01:40 +0100 Joe Damato wrote:
> If it is, then the only option is to have the drivers pass in their
> IRQ affinity masks, as Stanislav suggested, to avoid adding that
> call to the hot path.
> 
> If not, then the IRQ from napi_struct can be used and the affinity
> mask can be generated on every napi poll. i40e/gve/iavf would need
> calls to netif_napi_set_irq to set the IRQ mapping, which seems to
> be straightforward.

It's a bit sad to have the generic solution blocked.
cpu_rmap_update() is exported. Maybe we can call it from our notifier?
rmap lives in struct net_device

Joe Damato Aug. 15, 2024, 10:22 a.m. UTC | #9

On Wed, Aug 14, 2024 at 05:20:46PM -0700, Jakub Kicinski wrote:
> On Wed, 14 Aug 2024 19:01:40 +0100 Joe Damato wrote:
> > If it is, then the only option is to have the drivers pass in their
> > IRQ affinity masks, as Stanislav suggested, to avoid adding that
> > call to the hot path.
> > 
> > If not, then the IRQ from napi_struct can be used and the affinity
> > mask can be generated on every napi poll. i40e/gve/iavf would need
> > calls to netif_napi_set_irq to set the IRQ mapping, which seems to
> > be straightforward.
> 
> It's a bit sad to have the generic solution blocked.
> cpu_rmap_update() is exported. Maybe we can call it from our notifier?
> rmap lives in struct net_device

I agree on the sadness. I will take a look today.

I guess if we were being really ambitious, we'd try to move ARFS
stuff into the core (as RSS was moved into the core).

Shay Drori Aug. 20, 2024, 6:40 a.m. UTC | #10

On 15/08/2024 13:22, Joe Damato wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Wed, Aug 14, 2024 at 05:20:46PM -0700, Jakub Kicinski wrote:
>> On Wed, 14 Aug 2024 19:01:40 +0100 Joe Damato wrote:
>>> If it is, then the only option is to have the drivers pass in their
>>> IRQ affinity masks, as Stanislav suggested, to avoid adding that
>>> call to the hot path.
>>>
>>> If not, then the IRQ from napi_struct can be used and the affinity
>>> mask can be generated on every napi poll. i40e/gve/iavf would need
>>> calls to netif_napi_set_irq to set the IRQ mapping, which seems to
>>> be straightforward.
>>
>> It's a bit sad to have the generic solution blocked.
>> cpu_rmap_update() is exported. Maybe we can call it from our notifier?
>> rmap lives in struct net_device
> 
> I agree on the sadness. I will take a look today.
> 
> I guess if we were being really ambitious, we'd try to move ARFS
> stuff into the core (as RSS was moved into the core).


Sorry for the late reply. Maybe we can modify affinity notifier infra to
support more than a single notifier per IRQ.
@Thomas, do you know why only a single notifier per IRQ is supported?

Joe Damato Aug. 20, 2024, 8:33 a.m. UTC | #11

On Tue, Aug 20, 2024 at 09:40:31AM +0300, Shay Drori wrote:
> 
> 
> On 15/08/2024 13:22, Joe Damato wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > On Wed, Aug 14, 2024 at 05:20:46PM -0700, Jakub Kicinski wrote:
> > > On Wed, 14 Aug 2024 19:01:40 +0100 Joe Damato wrote:
> > > > If it is, then the only option is to have the drivers pass in their
> > > > IRQ affinity masks, as Stanislav suggested, to avoid adding that
> > > > call to the hot path.
> > > > 
> > > > If not, then the IRQ from napi_struct can be used and the affinity
> > > > mask can be generated on every napi poll. i40e/gve/iavf would need
> > > > calls to netif_napi_set_irq to set the IRQ mapping, which seems to
> > > > be straightforward.
> > > 
> > > It's a bit sad to have the generic solution blocked.
> > > cpu_rmap_update() is exported. Maybe we can call it from our notifier?
> > > rmap lives in struct net_device
> > 
> > I agree on the sadness. I will take a look today.
> > 
> > I guess if we were being really ambitious, we'd try to move ARFS
> > stuff into the core (as RSS was moved into the core).
> 
> 
> Sorry for the late reply. Maybe we can modify affinity notifier infra to
> support more than a single notifier per IRQ.
> @Thomas, do you know why only a single notifier per IRQ is supported?

Sorry for the delayed response as well on my side; I've been in
between lots of different kernel RFCs :)

Jakub: the issue seems to be that the internals in lib/cpu_rmap.c
are needed to call cpu_rmap_update. It's probably possible to expose
them somehow so that a generic IRQ notifier could call
cpu_rmap_update, as you mentioned, but some rewiring is going to be
needed, I think.

I had a couple ideas for rewiring stuff, but I haven't had time to
context switch back on to this work as I've been busy with a few
other things (the IRQ suspension stuff and another mlx5 thing I have
yet to send upstream).

I hope to take another look at it this week, but I welcome any
suggestions from Shay/Thomas in the meantime.

- Joe

[RFC,net-next,0/6] Cleanup IRQ affinity checks in several drivers

Message

Comments