mbox series

[RFC,net-next,0/4] net-sysfs: remove rtnl_trylock/restart_syscall use

Message ID 20231018154804.420823-1-atenart@kernel.org (mailing list archive)
Headers show
Series net-sysfs: remove rtnl_trylock/restart_syscall use | expand

Message

Antoine Tenart Oct. 18, 2023, 3:47 p.m. UTC
Hi,

This is sent as an RFC because I believe this should be discussed (and
some might want to do additional testing), but the code itself is ready.

Some time ago we tried to improve the rtnl_trylock/restart_syscall
situation[1]. What happens is when there is rtnl contention, userspace
accessing net sysfs attributes will spin and experience delays. This can
happen in different situations, when sysfs attributes are accessed
(networking daemon, configuration, monitoring) while operations under
rtnl are performed (veth creation, driver configuration, etc). A few
improvements can be done in userspace to ease things, like using the
netlink interface instead, or polling less (or more selectively) the
attributes; but in the end the root cause is always there and this keeps
happening from time to time.

That initial effort however wasn't successful, although I think there
was an interest, mostly because we found technical flaws and didn't find
a working solution at the time. Some time later, we gave it a new try
and found something more promising, but the patches fell off my radar. I
recently had another look at this series, made more tests and cleaned it
up.

The technical aspect is described in patch 1 directly in the code
comments, with an additional important comment in patch 3. This was
mostly tested by stress-testing net sysfs attributes (read/write ops)
while adding/removing queues and adding/removing veths, all in parallel.

All comments are welcomed.

Thanks,
Antoine

[1] https://lore.kernel.org/all/20210928125500.167943-1-atenart@kernel.org/T/

Antoine Tenart (4):
  net-sysfs: remove rtnl_trylock from device attributes
  net-sysfs: move queue attribute groups outside the default groups
  net-sysfs: prevent uncleared queues from being re-added
  net-sysfs: remove rtnl_trylock from queue attributes

 include/linux/netdevice.h     |   1 +
 include/net/netdev_rx_queue.h |   1 +
 net/core/net-sysfs.c          | 329 ++++++++++++++++++++++++----------
 3 files changed, 237 insertions(+), 94 deletions(-)

Comments

Stephen Hemminger Oct. 18, 2023, 3:57 p.m. UTC | #1
On Wed, 18 Oct 2023 17:47:42 +0200
Antoine Tenart <atenart@kernel.org> wrote:

> Hi,
> 
> This is sent as an RFC because I believe this should be discussed (and
> some might want to do additional testing), but the code itself is ready.
> 
> Some time ago we tried to improve the rtnl_trylock/restart_syscall
> situation[1]. What happens is when there is rtnl contention, userspace
> accessing net sysfs attributes will spin and experience delays. This can
> happen in different situations, when sysfs attributes are accessed
> (networking daemon, configuration, monitoring) while operations under
> rtnl are performed (veth creation, driver configuration, etc). A few
> improvements can be done in userspace to ease things, like using the
> netlink interface instead, or polling less (or more selectively) the
> attributes; but in the end the root cause is always there and this keeps
> happening from time to time.
> 
> That initial effort however wasn't successful, although I think there
> was an interest, mostly because we found technical flaws and didn't find
> a working solution at the time. Some time later, we gave it a new try
> and found something more promising, but the patches fell off my radar. I
> recently had another look at this series, made more tests and cleaned it
> up.
> 
> The technical aspect is described in patch 1 directly in the code
> comments, with an additional important comment in patch 3. This was
> mostly tested by stress-testing net sysfs attributes (read/write ops)
> while adding/removing queues and adding/removing veths, all in parallel.
> 
> All comments are welcomed.

The trylock was introduced to deal with lock inversion.
It is not clear how this more complex solution prevents that.
Antoine Tenart Oct. 18, 2023, 4:34 p.m. UTC | #2
Quoting Stephen Hemminger (2023-10-18 17:57:17)
> The trylock was introduced to deal with lock inversion.
> It is not clear how this more complex solution prevents that.

Anything specifically in the patch 1 comments is not clear that I can
improve?

The dead lock happens between rtnl_lock and the refcounting on the
attribute kn->active, specifically when unregistering net devices
because device_del kernfs_drain will wait for the kn->active refcount to
reach KN_DEACTIVATED_BIAS, under an rtnl section. The current solution
was making one path to bail out (trylock/restart syscall).

The idea here is we can actually bail out of the attribute kn protection
(kn->active), while still letting unregistering net devices to wait for
the current sysfs operations to complete, by using the net device
refcount instead. To simplify, instead of waiting on kn->active in the
net device unregistration step, this waits on the net device refcount
(netdev_wait_allrefs_any), which is done outside an rtnl section. This
way kernfs_drain can complete under its rtnl section even if a call to
rtnl_lock is waiting in a sysfs operation.

Antoine
Stephen Hemminger Oct. 18, 2023, 6:15 p.m. UTC | #3
On Wed, 18 Oct 2023 17:47:42 +0200
Antoine Tenart <atenart@kernel.org> wrote:

> Some time ago we tried to improve the rtnl_trylock/restart_syscall
> situation[1]. What happens is when there is rtnl contention, userspace
> accessing net sysfs attributes will spin and experience delays. This can
> happen in different situations, when sysfs attributes are accessed
> (networking daemon, configuration, monitoring) while operations under
> rtnl are performed (veth creation, driver configuration, etc). A few
> improvements can be done in userspace to ease things, like using the
> netlink interface instead, or polling less (or more selectively) the
> attributes; but in the end the root cause is always there and this keeps
> happening from time to time.

What attribute is not exposed by netlink, and only by sysfs?
There will always be more overhead using sysfs.
That doesn't mean the locking should not be fixed, just that better
to avoid the situation if possible.
Antoine Tenart Oct. 19, 2023, 7:47 a.m. UTC | #4
Quoting Stephen Hemminger (2023-10-18 20:15:47)
> On Wed, 18 Oct 2023 17:47:42 +0200
> Antoine Tenart <atenart@kernel.org> wrote:
> 
> > Some time ago we tried to improve the rtnl_trylock/restart_syscall
> > situation[1]. What happens is when there is rtnl contention, userspace
> > accessing net sysfs attributes will spin and experience delays. This can
> > happen in different situations, when sysfs attributes are accessed
> > (networking daemon, configuration, monitoring) while operations under
> > rtnl are performed (veth creation, driver configuration, etc). A few
> > improvements can be done in userspace to ease things, like using the
> > netlink interface instead, or polling less (or more selectively) the
> > attributes; but in the end the root cause is always there and this keeps
> > happening from time to time.
> 
> What attribute is not exposed by netlink, and only by sysfs?

I think there were a few last time (a while a go) I checked, but it's
not an issue IMHO, if one is missing in netlink we can add it.

> That doesn't mean the locking should not be fixed, just that better
> to avoid the situation if possible.

100% agree on this one. I believe using netlink is the right way.

Having said that, sysfs is still there and (quite some time ago) while
having discussions with different projects, some were keen to switch to
netlink, but some weren't and pushed back because "sysfs is a stable
API" and "if there is a kernel issue it should be fixed in the kernel".
Not blaming anyone really, they'd have to support the two interfaces for
compatibility. My point is, yes, I would encourage everyone to use
netlink too, but we don't control every user and it's not like sysfs
will disappear anytime soon.

Thanks,
Antoine
Stephen Hemminger Oct. 19, 2023, 2:54 p.m. UTC | #5
On Thu, 19 Oct 2023 09:47:27 +0200
Antoine Tenart <atenart@kernel.org> wrote:

> > That doesn't mean the locking should not be fixed, just that better
> > to avoid the situation if possible.  
> 
> 100% agree on this one. I believe using netlink is the right way.
> 
> Having said that, sysfs is still there and (quite some time ago) while
> having discussions with different projects, some were keen to switch to
> netlink, but some weren't and pushed back because "sysfs is a stable
> API" and "if there is a kernel issue it should be fixed in the kernel".
> Not blaming anyone really, they'd have to support the two interfaces for
> compatibility. My point is, yes, I would encourage everyone to use
> netlink too, but we don't control every user and it's not like sysfs
> will disappear anytime soon.

I have seen code doing discovery of new devices via netlink then poking
around in sysfs. But that usage is inherently racy from the application
point of view. By the time device is discovered, it might be removed or
worse renamed before the sysfs operations.