Message ID | 20201206235919.393158-6-vladimir.oltean@nxp.com (mailing list archive) |
---|---|
State | RFC |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | Make .ndo_get_stats64 sleepable | expand |
Context | Check | Description |
---|---|---|
netdev/cover_letter | success | Link |
netdev/fixes_present | success | Link |
netdev/patch_count | success | Link |
netdev/tree_selection | success | Clearly marked for net-next |
netdev/subject_prefix | success | Link |
netdev/source_inline | success | Was 0 now: 0 |
netdev/verify_signedoff | success | Link |
netdev/module_param | success | Was 0 now: 0 |
netdev/build_32bit | fail | Errors and warnings before: 23 this patch: 24 |
netdev/kdoc | success | Errors and warnings before: 2 this patch: 2 |
netdev/verify_fixes | success | Link |
netdev/checkpatch | success | total: 0 errors, 0 warnings, 0 checks, 48 lines checked |
netdev/build_allmodconfig_warn | fail | Errors and warnings before: 17 this patch: 18 |
netdev/header_inline | success | Link |
netdev/stable | success | Stable not CCed |
On Mon, Dec 07, 2020 at 01:59:11AM +0200, Vladimir Oltean wrote: > In the effort of making .ndo_get_stats64 be able to sleep, we need to > ensure the callers of dev_get_stats do not use atomic context. > > The bonding driver uses an RCU read-side critical section to ensure the > integrity of the list of network interfaces, because the driver iterates > through all net devices in the netns to find the ones which are its > configured slaves. We still need some protection against an interface > registering or deregistering, and the writer-side lock, the netns mutex, > is fine for that, because it offers sleepable context. > > This mutex now serves double duty. It offers code serialization, > something which the stats_lock already did. So now that serves no > purpose, let's remove it. > > Cc: Jay Vosburgh <j.vosburgh@gmail.com> > Cc: Veaceslav Falico <vfalico@gmail.com> > Cc: Andy Gospodarek <andy@greyhouse.net> > Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> > --- There is a very obvious deadlock here which happens when we have bond-over-bond and the upper calls dev_get_stats from the lower. Conceptually, the same can happen even in any number of stacking combinations between bonding, net_failover, [ insert any other driver that takes net->netdev_lists_lock here ]. There would be two approaches trying to solve this issue: - using mutex_lock_nested where we aren't sure that we are top level - ensuring through convention that user space always takes net->netdev_lists_lock when calling dev_get_stats, and documenting that, and therefore making it unnecessary to lock in bonding. I took neither of the two approaches (I don't really like either one too much), hence [ one of ] the reasons for the RFC. Comments?
On Mon, Dec 07, 2020 at 03:00:40AM +0200, Vladimir Oltean wrote: > There is a very obvious deadlock here which happens when we have > bond-over-bond and the upper calls dev_get_stats from the lower. > > Conceptually, the same can happen even in any number of stacking > combinations between bonding, net_failover, [ insert any other driver > that takes net->netdev_lists_lock here ]. > > There would be two approaches trying to solve this issue: > - using mutex_lock_nested where we aren't sure that we are top level > - ensuring through convention that user space always takes > net->netdev_lists_lock when calling dev_get_stats, and documenting > that, and therefore making it unnecessary to lock in bonding. > > I took neither of the two approaches (I don't really like either one too > much), hence [ one of ] the reasons for the RFC. Comments? And there are also issues which are more subtle (or maybe just to me, at the time I wrote the patch). Like the fact that the netdev adjacency lists are not protected by net->netdev_lists_lock, but still by the RTNL mutex and RCU. I think that in order for the iteration through lower interfaces to capture a consistent state of the adjancency lists of all interfaces, the __netdev_adjacent_dev_link_lists and __netdev_adjacent_dev_unlink_lists functions would need to be run under the net->netdev_lists_lock, and not just under some lock per-netdev. But this is raising the locking domain covered by net->netdev_lists_lock to more than I initially intended. I'll try to do this and see how it works.
On Mon, 7 Dec 2020 01:00:40 +0000 Vladimir Oltean wrote: > - ensuring through convention that user space always takes > net->netdev_lists_lock when calling dev_get_stats, and documenting > that, and therefore making it unnecessary to lock in bonding. This seems like the better option to me. Makes the locking rules pretty clear.
On Tue, Dec 08, 2020 at 03:57:44PM -0800, Jakub Kicinski wrote: > On Mon, 7 Dec 2020 01:00:40 +0000 Vladimir Oltean wrote: > > - ensuring through convention that user space always takes > > net->netdev_lists_lock when calling dev_get_stats, and documenting > > that, and therefore making it unnecessary to lock in bonding. > > This seems like the better option to me. Makes the locking rules pretty > clear. It is non-obvious to me that top-level callers of dev_get_stats should hold a lock as specific as the one protecting the lists of network interfaces. In the vast majority of implementations of dev_get_stats, that lock would not actually protect anything, which would lead into just one more lock that is used for more than it should be. In my tree I had actually already switched over to mutex_lock_nested. Nonetheless, I am still open if you want to make the case that simplicity should prevail over specificity. But in that case, maybe we should just keep on using the RTNL mutex.
On Wed, 9 Dec 2020 00:03:56 +0000 Vladimir Oltean wrote: > On Tue, Dec 08, 2020 at 03:57:44PM -0800, Jakub Kicinski wrote: > > On Mon, 7 Dec 2020 01:00:40 +0000 Vladimir Oltean wrote: > > > - ensuring through convention that user space always takes > > > net->netdev_lists_lock when calling dev_get_stats, and documenting > > > that, and therefore making it unnecessary to lock in bonding. > > > > This seems like the better option to me. Makes the locking rules pretty > > clear. > > It is non-obvious to me that top-level callers of dev_get_stats should > hold a lock as specific as the one protecting the lists of network > interfaces. In the vast majority of implementations of dev_get_stats, > that lock would not actually protect anything, which would lead into > just one more lock that is used for more than it should be. In my tree I > had actually already switched over to mutex_lock_nested. Nonetheless, I > am still open if you want to make the case that simplicity should prevail > over specificity. What are the locking rules you have in mind then? Caller may hold RTNL or ifc mutex? > But in that case, maybe we should just keep on using the RTNL mutex. That's a wasted opportunity, RTNL lock is pretty busy.
On Tue, Dec 08, 2020 at 04:17:37PM -0800, Jakub Kicinski wrote: > On Wed, 9 Dec 2020 00:03:56 +0000 Vladimir Oltean wrote: > > On Tue, Dec 08, 2020 at 03:57:44PM -0800, Jakub Kicinski wrote: > > > On Mon, 7 Dec 2020 01:00:40 +0000 Vladimir Oltean wrote: > > > > - ensuring through convention that user space always takes > > > > net->netdev_lists_lock when calling dev_get_stats, and documenting > > > > that, and therefore making it unnecessary to lock in bonding. > > > > > > This seems like the better option to me. Makes the locking rules pretty > > > clear. > > > > It is non-obvious to me that top-level callers of dev_get_stats should > > hold a lock as specific as the one protecting the lists of network > > interfaces. In the vast majority of implementations of dev_get_stats, > > that lock would not actually protect anything, which would lead into > > just one more lock that is used for more than it should be. In my tree I > > had actually already switched over to mutex_lock_nested. Nonetheless, I > > am still open if you want to make the case that simplicity should prevail > > over specificity. > > What are the locking rules you have in mind then? Caller may hold RTNL > or ifc mutex? Well, it's clear that calling mutex_lock_nested would only silence lockdep, there would still be this non-reentrant mutex that will be held from a potentially recursive code path. So it a non-solution, and even worse than using plain mutex_lock, because at least that is detectable when it locks up the system. net_failover and bonding are the only drivers that are creating this recursivity requirement in dev_get_stats. Other one-over-many stackable interfaces, like the bridge, just use dev_get_tstats64. I'm almost thinking that it would be cleaner to convert these two to dev_get_tstats64 too, that would simplify things enormously. Even team uses something that is based on software counters, something reminiscent of dev_get_tstats64, definitely not counters retrieved from the underlying device. Of course, the disadvantage with doing that is going to be that virtual interfaces cannot retrieve statistics recursively from their lower interface. I'm trying to think how much of a real disadvantage that will be. For offloaded interfaces they will be completely off, that's for sure. And this is one of the reasons that mandated the DSA rework to expose MAC-based counters in dev_get_stats in the first place. But thinking in the larger sense. An offloading interface that supports IP forwarding, with 50 VLAN uppers. How could the statistics counters of those VLAN uppers ever be correct. It's not realistic to expect of the underlying hardware to keep separate statistics for each upper, that the upper could then go ahead and just query. Sure, in the cases where a lower can have only one upper at a time (bridge, bonding) what I said is not applicable, but the general goal of having accurate counters for offloading interfaces everywhere seems to not be really achievable. In case this doesn't work out, I guess I'll have to document that dev_get_stats is recursive, and all mutual exclusion based locks need to be taken upfront, which is the only strategy that works with recursion that I know of. > > But in that case, maybe we should just keep on using the RTNL mutex. > > That's a wasted opportunity, RTNL lock is pretty busy. It is certainly easy to use and easy to enforce though. It also seems to protect everything, which is generally the reason why you tend to not think a lot when using it. To be clear, I am not removing the RTNL mutex from any place where it is held today. Letting me do that would be like letting a bull in a china shop. I am only creating a sub-lock of it, which is protecting a subset of what the RTNL mutex is protecting. By sub-lock I mean that code paths that currently hold the RTNL mutex, like list_netdevice(), will also take net->netdev_lists_lock - never the other way around, that would create lock inversion. The plan is to then require new code that iterates through network interface lists to use the new locking scheme and not RTNL mutex, and in parallel to migrate, on a best-effort basis, code that runs under ASSERT_RTNL + net->netdev_lists_lock to only take the net->netdev_lists_lock and be RTNL-free. But is that any better at runtime than just taking the RTNL mutex, will it make it less contended? Nope, due to the indirect serialization effect that is created by the fact that net->netdev_lists_lock is a sub-lock of the RTNL mutex. Say net/core/net-procfs.c holds net->netdev_lists_lock while dumping the statistics from a firmware and sleeping while doing so. All is rosy in its RTNL mutex free code path. Then there comes along another thread which calls for_each_netdev, something else which today requires the RTNL mutex but could be reworked to use the net->netdev_lists_lock. Let's assume we are at a stage where the gazillion places that call for_each_netdev() have been converted to also take the net->netdev_lists_lock. But they couldn't be converted to also _not_ take the RTNL mutex, or at least all of them. This means that there will be code paths that try to take the RTNL mutex and end up waiting for our net/core/net-procfs.c to complete dumping the firmware.
On Wed, Dec 09, 2020 at 03:14:04AM +0200, Vladimir Oltean wrote: > net_failover and bonding are the only drivers that are creating this > recursivity requirement in dev_get_stats. Other one-over-many stackable > interfaces, like the bridge, just use dev_get_tstats64. I'm almost > thinking that it would be cleaner to convert these two to dev_get_tstats64 > too, that would simplify things enormously. Even team uses something > that is based on software counters, something reminiscent of > dev_get_tstats64, definitely not counters retrieved from the underlying > device. Of course, the disadvantage with doing that is going to be that > virtual interfaces cannot retrieve statistics recursively from their > lower interface. I'm trying to think how much of a real disadvantage > that will be. For offloaded interfaces they will be completely off, > that's for sure. And this is one of the reasons that mandated the DSA > rework to expose MAC-based counters in dev_get_stats in the first place. > But thinking in the larger sense. An offloading interface that supports > IP forwarding, with 50 VLAN uppers. How could the statistics counters of > those VLAN uppers ever be correct. It's not realistic to expect of the > underlying hardware to keep separate statistics for each upper, that the > upper could then go ahead and just query. Sure, in the cases where a > lower can have only one upper at a time (bridge, bonding) what I said is > not applicable, but the general goal of having accurate counters for > offloading interfaces everywhere seems to not be really achievable. Ok, putting some more thought into it after sending the email, maybe it isn't unreasonable for devices with IP fwd offload to keep statistics for each upper. They need to be aware of those uppers for a plethora of other reasons, after all. So, here's another idea for eliminating the recursion. Maybe we just add a new netdevice feature, NETIF_F_FOLDING_STATS, or something like that, which bonding and failover will use. Then dev_get_stats sees that flag, and instead of calling ->ndo_get_stats64 for it, it just iterates through its bottom-most level of lower interfaces and aggregates the stats by itself.
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index e0880a3840d7..f788f9fa1858 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -3738,21 +3738,16 @@ static void bond_get_stats(struct net_device *bond_dev, struct rtnl_link_stats64 *stats) { struct bonding *bond = netdev_priv(bond_dev); + struct net *net = dev_net(bond_dev); struct rtnl_link_stats64 temp; struct list_head *iter; struct slave *slave; - int nest_level = 0; + mutex_lock(&net->netdev_lists_lock); - rcu_read_lock(); -#ifdef CONFIG_LOCKDEP - nest_level = bond_get_lowest_level_rcu(bond_dev); -#endif - - spin_lock_nested(&bond->stats_lock, nest_level); memcpy(stats, &bond->bond_stats, sizeof(*stats)); - bond_for_each_slave_rcu(bond, slave, iter) { + bond_for_each_slave(bond, slave, iter) { const struct rtnl_link_stats64 *new = dev_get_stats(slave->dev, &temp); @@ -3763,8 +3758,8 @@ static void bond_get_stats(struct net_device *bond_dev, } memcpy(&bond->bond_stats, stats, sizeof(*stats)); - spin_unlock(&bond->stats_lock); - rcu_read_unlock(); + + mutex_unlock(&net->netdev_lists_lock); } static int bond_do_ioctl(struct net_device *bond_dev, struct ifreq *ifr, int cmd) @@ -5192,7 +5187,6 @@ static int bond_init(struct net_device *bond_dev) if (!bond->wq) return -ENOMEM; - spin_lock_init(&bond->stats_lock); netdev_lockdep_set_classes(bond_dev); list_add_tail(&bond->bond_list, &bn->dev_list); diff --git a/include/net/bonding.h b/include/net/bonding.h index d9d0ff3b0ad3..6fbde9713424 100644 --- a/include/net/bonding.h +++ b/include/net/bonding.h @@ -224,7 +224,6 @@ struct bonding { * ALB mode (6) - to sync the use and modifications of its hash table */ spinlock_t mode_lock; - spinlock_t stats_lock; u8 send_peer_notif; u8 igmp_retrans; #ifdef CONFIG_PROC_FS
In the effort of making .ndo_get_stats64 be able to sleep, we need to ensure the callers of dev_get_stats do not use atomic context. The bonding driver uses an RCU read-side critical section to ensure the integrity of the list of network interfaces, because the driver iterates through all net devices in the netns to find the ones which are its configured slaves. We still need some protection against an interface registering or deregistering, and the writer-side lock, the netns mutex, is fine for that, because it offers sleepable context. This mutex now serves double duty. It offers code serialization, something which the stats_lock already did. So now that serves no purpose, let's remove it. Cc: Jay Vosburgh <j.vosburgh@gmail.com> Cc: Veaceslav Falico <vfalico@gmail.com> Cc: Andy Gospodarek <andy@greyhouse.net> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> --- drivers/net/bonding/bond_main.c | 16 +++++----------- include/net/bonding.h | 1 - 2 files changed, 5 insertions(+), 12 deletions(-)