Message ID | 9e2bcb887b5cf9cbb8c0c4ba126115fe01a01f3f.1681236654.git.ecree.xilinx@gmail.com (mailing list archive) |
---|---|
State | RFC |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | ethtool: track custom RSS contexts in the core | expand |
On Tue, Apr 11, 2023 at 07:26:14PM +0100, edward.cree@amd.com wrote: > From: Edward Cree <ecree.xilinx@gmail.com> > > While this is not needed to serialise the ethtool entry points (which > are all under RTNL), drivers may have cause to asynchronously access > dev->ethtool->rss_ctx; taking dev->ethtool->rss_lock allows them to > do this safely without needing to take the RTNL. What is actually wrong with taking RTNL? KISS is often best, especially for locks. Andrew
On 11/04/2023 21:40, Andrew Lunn wrote: > On Tue, Apr 11, 2023 at 07:26:14PM +0100, edward.cree@amd.com wrote: >> While this is not needed to serialise the ethtool entry points (which >> are all under RTNL), drivers may have cause to asynchronously access >> dev->ethtool->rss_ctx; taking dev->ethtool->rss_lock allows them to >> do this safely without needing to take the RTNL. > > What is actually wrong with taking RTNL? KISS is often best, > especially for locks. The examples I have of driver code that needs to access rss_ctx (in the sfc driver) are deep inside call chains where RTNL may or may not already be held. 1) filter insertion. E.g. ethtool -U is already holding RTNL, but other sources of filters (e.g. aRFS) aren't, and thus taking it if necessary might mean passing a 'bool rtnl_locked' all the way down the chain. 2) device reset handling (we have to restore the RSS contexts in the hardware after a reset). Again resets don't always happen under RTNL, and I don't fully understand the details (EFX_ASSERT_RESET_SERIALISED()). So it makes life much simpler if we just have a finer-grained lock we can just take when we need to. Also, RTNL is a very big hammer that serialises all kinds of operations across all the netdevs in the system, holding it for any length of time can cause annoying user-visible latency (e.g. iirc sshd accepting a new connection has to wait for it) so I prefer to avoid it if possible. If anything we want to be breaking up this BKL[1], not making it bigger. -ed [1]: https://legacy.netdevconf.info/2.2/slides/westphal-rtnlmutex-talk.pdf
On Wed, Apr 12, 2023 at 05:16:11PM +0100, Edward Cree wrote: > On 11/04/2023 21:40, Andrew Lunn wrote: > > On Tue, Apr 11, 2023 at 07:26:14PM +0100, edward.cree@amd.com wrote: > >> While this is not needed to serialise the ethtool entry points (which > >> are all under RTNL), drivers may have cause to asynchronously access > >> dev->ethtool->rss_ctx; taking dev->ethtool->rss_lock allows them to > >> do this safely without needing to take the RTNL. > > > > What is actually wrong with taking RTNL? KISS is often best, > > especially for locks. > > The examples I have of driver code that needs to access rss_ctx (in the > sfc driver) are deep inside call chains where RTNL may or may not > already be held. > 1) filter insertion. E.g. ethtool -U is already holding RTNL, but other > sources of filters (e.g. aRFS) aren't, and thus taking it if necessary > might mean passing a 'bool rtnl_locked' all the way down the chain. > 2) device reset handling (we have to restore the RSS contexts in the > hardware after a reset). Again resets don't always happen under RTNL, > and I don't fully understand the details (EFX_ASSERT_RESET_SERIALISED()). > So it makes life much simpler if we just have a finer-grained lock we can > just take when we need to. > Also, RTNL is a very big hammer that serialises all kinds of operations > across all the netdevs in the system, holding it for any length of time > can cause annoying user-visible latency (e.g. iirc sshd accepting a new > connection has to wait for it) so I prefer to avoid it if possible. If > anything we want to be breaking up this BKL[1], not making it bigger. Hi Ed I have to wonder if your locking model is wrong. When i look at the next patch, i see the driver is also using this lock. And i generally find that is wrong. As a rule of thumb, driver writes don't understand locking. Yes, there are some that do, but most don't. As core code developers, i find it good practice to have the locks in the core, and only in the core. Drivers writers should not need to worry about locking. The API into the driver will take the locks needed before entering the driver, and release them on exit. So i don't really agree with 'it makes life much simpler if we just have a finer-grained lock'. It makes life more complex having to help driver writers find the corruption and deadlock bugs in their code because they got the locking wrong. Please try to work on the abstraction so that drivers don't need this lock, just the core. Andrew
On 12/04/2023 18:15, Andrew Lunn wrote: > I have to wonder if your locking model is wrong. When i look at the > next patch, i see the driver is also using this lock. And i generally > find that is wrong. ... > Drivers writers should not need to worry about locking. The API > into the driver will take the locks needed before entering the driver, > and release them on exit. I don't think that's possible without increasing driver complexity in other ways. Essentially, for the driver to take advantage of the core tracking these contexts, and thus not need its own data structures to track them as well (like the efx->rss_context.list we remove in patch #7), it has to be able to access them on driver-initiated (not just core-initiated) events. (The central example of this being "oh, the NIC MCPU just rebooted, we have to reinstall all our filters".) And it needs to be able to exclude writes while it does that, not only for consistency but also because e.g. context deletion will free that memory (though I guess we could finesse that part with RCU?). What I *could* do is add suitable wrapper functions for access to dev->ethtool->rss_ctx (e.g. a core equivalent of efx_find_rss_context_entry() that wraps the idr_find()) and have them assert that the lock is held (like efx_find_rss_context_entry() does); that would at least validate the driver locking somewhat. But having those helper functions perform the locking themselves would mean going to a get/put model for managing the lifetime of the driver's reference (with a separate get_write for exclusive access), at which point it's probably harder for driver writers to understand than "any time you're touching rss_ctx you need to hold the rss_lock". -ed
On Wed, 12 Apr 2023 18:42:16 +0100 Edward Cree wrote: > On 12/04/2023 18:15, Andrew Lunn wrote: > > I have to wonder if your locking model is wrong. When i look at the > > next patch, i see the driver is also using this lock. And i generally > > find that is wrong. > ... > > Drivers writers should not need to worry about locking. The API > > into the driver will take the locks needed before entering the driver, > > and release them on exit. > I don't think that's possible without increasing driver complexity in > other ways. Essentially, for the driver to take advantage of the core > tracking these contexts, and thus not need its own data structures to > track them as well (like the efx->rss_context.list we remove in patch > #7), it has to be able to access them on driver-initiated (not just > core-initiated) events. (The central example of this being "oh, the > NIC MCPU just rebooted, we have to reinstall all our filters".) And > it needs to be able to exclude writes while it does that, not only for > consistency but also because e.g. context deletion will free that > memory (though I guess we could finesse that part with RCU?). > What I *could* do is add suitable wrapper functions for access to > dev->ethtool->rss_ctx (e.g. a core equivalent of > efx_find_rss_context_entry() that wraps the idr_find()) and have them > assert that the lock is held (like efx_find_rss_context_entry() does); > that would at least validate the driver locking somewhat. > But having those helper functions perform the locking themselves would > mean going to a get/put model for managing the lifetime of the > driver's reference (with a separate get_write for exclusive access), > at which point it's probably harder for driver writers to understand > than "any time you're touching rss_ctx you need to hold the rss_lock". IMO the "MCPU has rebooted" case is a control path, taking rtnl seems like the right thing to do. Does that happen often enough or takes so long to recover that we need to be concerned about locking down rtnl? aRFS can't sleep IIUC so the mutex is does not seem like a great match. IOW I had the same reaction as Andrew first time I saw this mutex :(
On 13/04/2023 03:06, Jakub Kicinski wrote: > IMO the "MCPU has rebooted" case is a control path, taking rtnl seems > like the right thing to do. Does that happen often enough or takes so > long to recover that we need to be concerned about locking down rtnl? Normally we *do* hold RTNL across the reset handling path, and all the callers I can find take it. But the existence of a more complicated condition guarding the ASSERT_RTNL() in EFX_ASSERT_RESET_SERIALISED() implies that there is, or at least was, a call site that doesn't; that makes me nervous about assuming it. > aRFS can't sleep IIUC so the mutex is does not seem like a great match. sfc punts aRFS filter insertion into a workitem, because you can't do MCDI without sleeping. And aRFS was just one example; there's lots of other sources of filters in the driver (PTP, sync_rx_mode (device UC/MC addresses), VLAN filtering (NETIF_F_HW_VLAN_CTAG_FILTER)...). Some of those filters can also have EFX_FILTER_FLAG_RX_RSS (though not custom contexts). So while I *think* the filter insert code could carefully go if (spec->flags & EFX_FILTER_FLAG_RX_RSS && spec->rss_context) ASSERT_RTNL(); it's kinda hairy. And if this is a *normal* level of hair for drivers to have in this area, then the "but driver writers don't understand locking!" argument doesn't really favour one solution over the other. After all, the driver will still have to hold *some* lock to access this stuff, whether it's rss_lock or RTNL. (Idk, maybe sfc is just uniquely complex and messy. It wouldn't be the first time.) > IOW I had the same reaction as Andrew first time I saw this mutex :( Well I seem to be outvoted, so I'll have another crack at getting it to work with just RTNL (that's what I went for originally, actually, but one of the ASSERT_RTNL()s failed in test and I couldn't figure out why at the time. Possibly trying to argue the case has helped me to understand more of the details!).
> (Idk, maybe sfc is just uniquely complex and messy. It wouldn't be > the first time.) Hi Ed Have you looked at other drivers? It would be bad to design an API around a messy driver. Maybe this is an opportunity to learn from other drivers and come up with something cleaner for SFC? Andrew
On 13/04/2023 22:58, Andrew Lunn wrote: >> (Idk, maybe sfc is just uniquely complex and messy. It wouldn't be >> the first time.) > > Hi Ed > > Have you looked at other drivers? It would be bad to design an API > around a messy driver. I have; there's really not many that implement custom RSS contexts (just Marvell's mvpp2 and octeontx2, and Mellanox's mlx5). The `rss_ctx_max_id` field is designed for those as they all have fixed- size arrays currently and idk whether that's a purely software limit or whether it reflects the hardware. I couldn't find anything in any of them that looked like "restore stuff after a device reboot"; maybe it's just not something those devices expect to experience normally. I don't know enough about mlx5 hw to really understand their filter code, but the rough equivalent of our efx_mcdi_filter_insert_locked() in that driver appears to be _mlx5_add_flow_rules(), which seems to be doing some kind of hand-over-hand locking. And no sign (whether in comments or in asserts) of whether the function expects callers to hold RTNL. Same goes for their functions operating on TIRs (whatever those are) which are called from all over (aRFS, tc, even kTLS!) in addition to the ethtool RSS/ntuple paths. Anyway I'll cc maintainers of those drivers on v3 so they can chime in on the API design. (Should've done that on v1 really, but I forgot. Mea culpa.) -ed
diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h index 724da9234cf1..e8e88d5900d3 100644 --- a/include/linux/ethtool.h +++ b/include/linux/ethtool.h @@ -1026,11 +1026,14 @@ int ethtool_virtdev_set_link_ksettings(struct net_device *dev, * struct ethtool_netdev_state - per-netdevice state for ethtool features * @rss_ctx_max_id: maximum (exclusive) supported RSS context ID * @rss_ctx: IDR storing custom RSS context state + * @rss_lock: Protects entries in @rss_ctx. May be taken from + * within RTNL. * @wol_enabled: Wake-on-LAN is enabled */ struct ethtool_netdev_state { u32 rss_ctx_max_id; struct idr rss_ctx; + struct mutex rss_lock; unsigned wol_enabled:1; }; diff --git a/net/core/dev.c b/net/core/dev.c index 44668386f376..60c844b372e3 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -9987,6 +9987,7 @@ int register_netdevice(struct net_device *dev) idr_init_base(&dev->ethtool->rss_ctx, 1); spin_lock_init(&dev->addr_list_lock); + mutex_init(&dev->ethtool->rss_lock); netdev_set_addr_lockdep_class(dev); ret = dev_get_valid_name(net, dev, dev->name); @@ -10792,6 +10793,7 @@ static void netdev_rss_contexts_free(struct net_device *dev) if (!dev->ethtool_ops->create_rxfh_context && !dev->ethtool_ops->set_rxfh_context) return; + mutex_lock(&dev->ethtool->rss_lock); idr_for_each_entry(&dev->ethtool->rss_ctx, ctx, context) { u32 *indir = ethtool_rxfh_context_indir(ctx); u8 *key = ethtool_rxfh_context_key(ctx); @@ -10806,6 +10808,7 @@ static void netdev_rss_contexts_free(struct net_device *dev) &context, true); kfree(ctx); } + mutex_unlock(&dev->ethtool->rss_lock); } /** @@ -10919,6 +10922,8 @@ void unregister_netdevice_many_notify(struct list_head *head, if (dev->netdev_ops->ndo_uninit) dev->netdev_ops->ndo_uninit(dev); + mutex_destroy(&dev->ethtool->rss_lock); + if (skb) rtmsg_ifinfo_send(skb, dev, GFP_KERNEL, portid, nlh); diff --git a/net/ethtool/ioctl.c b/net/ethtool/ioctl.c index abd1cf50e681..8b2e90ba03a1 100644 --- a/net/ethtool/ioctl.c +++ b/net/ethtool/ioctl.c @@ -1257,6 +1257,7 @@ static noinline_for_stack int ethtool_set_rxfh(struct net_device *dev, u8 *rss_config; u32 rss_cfg_offset = offsetof(struct ethtool_rxfh, rss_config[0]); bool create = false, delete = false; + bool locked = false; /* dev->ethtool->rss_lock taken */ if (!ops->get_rxnfc || !ops->set_rxfh) return -EOPNOTSUPP; @@ -1334,6 +1335,10 @@ static noinline_for_stack int ethtool_set_rxfh(struct net_device *dev, } } + if (rxfh.rss_context) { + mutex_lock(&dev->ethtool->rss_lock); + locked = true; + } if (create) { if (delete) { ret = -EINVAL; @@ -1453,6 +1458,8 @@ static noinline_for_stack int ethtool_set_rxfh(struct net_device *dev, } out: + if (locked) + mutex_unlock(&dev->ethtool->rss_lock); kfree(rss_config); return ret; }