Message ID | 20240702234757.4188344-2-kuba@kernel.org (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | eth: bnxt: use the new RSS API | expand |
On 03/07/2024 00:47, Jakub Kicinski wrote: > RSS contexts may get lost from a device, in various extreme circumstances. > Specifically if the firmware leaks resources and resets, or crashes and > either recovers in partially working state or the crash causes a > different FW version to run - creating the context again may fail. So, I deliberately *didn't* do this, on the grounds that if the user fixed things by updating FW and resetting again, their contexts could get restored. I suppose big users like Meta will have orchestration doing all that work anyway so it doesn't matter. > Drivers should do their absolute best to prevent this from happening. > When it does, however, telling user that a context exists, when it can't > possibly be used any more is counter productive. Add a helper for > drivers to discard contexts. Print an error, in the future netlink > notification will also be sent. Possibility of a netlink notification makes the idea of a broken flag a bit more workable imho. But it's up to you which way to go.
On Wed, 3 Jul 2024 12:08:36 +0100 Edward Cree wrote: > On 03/07/2024 00:47, Jakub Kicinski wrote: > > RSS contexts may get lost from a device, in various extreme circumstances. > > Specifically if the firmware leaks resources and resets, or crashes and > > either recovers in partially working state or the crash causes a > > different FW version to run - creating the context again may fail. > > So, I deliberately *didn't* do this, on the grounds that if the user > fixed things by updating FW and resetting again, their contexts could > get restored. I suppose big users like Meta will have orchestration > doing all that work anyway so it doesn't matter. "We" don't reset FW while workload is running. I'm speculating why bnxt may lose the contexts. From my perspective if contexts get lost the machine should get taken out of production and at least power cycled. > > Drivers should do their absolute best to prevent this from happening. > > When it does, however, telling user that a context exists, when it can't > > possibly be used any more is counter productive. Add a helper for > > drivers to discard contexts. Print an error, in the future netlink > > notification will also be sent. > > Possibility of a netlink notification makes the idea of a broken flag > a bit more workable imho. But it's up to you which way to go. Oh, have we talked about this? Now that you mention the broken flag I recall talking about devlink health reporter.. a while back. I don't have a preference on how we deal with the lost contexts. The more obvious we make it to orchestration that the machine is broken the better. Can you point me to the discussion / describe the broken flag?
On 03/07/2024 14:43, Jakub Kicinski wrote: > On Wed, 3 Jul 2024 12:08:36 +0100 Edward Cree wrote: >> Possibility of a netlink notification makes the idea of a broken flag >> a bit more workable imho. But it's up to you which way to go. > > Oh, have we talked about this? Now that you mention the broken flag > I recall talking about devlink health reporter.. a while back. > > I don't have a preference on how we deal with the lost contexts. > The more obvious we make it to orchestration that the machine is broken > the better. Can you point me to the discussion / describe the broken > flag? We discussed it briefly on v4 back in October [1][2]. Idea was there'd be a flag in struct ethtool_rxfh_context with meaning of "this context is not present in hw owing to reinsertion failure after a device reset", reported in ethtool -x and perhaps also devlink health. Driver could set this flag (which would trigger a netlink notification) but could also clear it if a second reset (or some runtime configuration change) triggers another round of reinsertion which succeeds this time. Fwiw I don't have a strong preference either — like I say, you do what you think is best. -ed [1]: https://lore.kernel.org/all/2ea45188-5554-8067-820d-378cada735ee@gmail.com/T/#ma8fce7df8b65601009551839d9d102c49e79803a [2]: https://lore.kernel.org/all/2ea45188-5554-8067-820d-378cada735ee@gmail.com/T/#m073cb990982d89e35d78edf364389de62256664b
diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h index f74bb0cf8ed1..3ce5be0d168a 100644 --- a/include/linux/ethtool.h +++ b/include/linux/ethtool.h @@ -210,6 +210,8 @@ static inline size_t ethtool_rxfh_context_size(u32 indir_size, u32 key_size, return struct_size_t(struct ethtool_rxfh_context, data, flex_len); } +void ethtool_rxfh_context_lost(struct net_device *dev, u32 context_id); + /* declare a link mode bitmap */ #define __ETHTOOL_DECLARE_LINK_MODE_MASK(name) \ DECLARE_BITMAP(name, __ETHTOOL_LINK_MODE_MASK_NBITS) diff --git a/net/ethtool/rss.c b/net/ethtool/rss.c index 71679137eff2..e2e5bab56a6b 100644 --- a/net/ethtool/rss.c +++ b/net/ethtool/rss.c @@ -159,3 +159,17 @@ const struct ethnl_request_ops ethnl_rss_request_ops = { .fill_reply = rss_fill_reply, .cleanup_data = rss_cleanup_data, }; + +void ethtool_rxfh_context_lost(struct net_device *dev, u32 context_id) +{ + struct ethtool_rxfh_context *ctx; + + WARN_ONCE(!rtnl_is_locked() && + !lockdep_is_held_type(&dev->ethtool->rss_lock, -1), + "RSS context lock assertion failed\n"); + + netdev_err(dev, "device error, RSS context %d lost\n", context_id); + ctx = xa_erase(&dev->ethtool->rss_ctx, context_id); + kfree(ctx); +} +EXPORT_SYMBOL(ethtool_rxfh_context_lost);
RSS contexts may get lost from a device, in various extreme circumstances. Specifically if the firmware leaks resources and resets, or crashes and either recovers in partially working state or the crash causes a different FW version to run - creating the context again may fail. Drivers should do their absolute best to prevent this from happening. When it does, however, telling user that a context exists, when it can't possibly be used any more is counter productive. Add a helper for drivers to discard contexts. Print an error, in the future netlink notification will also be sent. Signed-off-by: Jakub Kicinski <kuba@kernel.org> --- include/linux/ethtool.h | 2 ++ net/ethtool/rss.c | 14 ++++++++++++++ 2 files changed, 16 insertions(+)