diff mbox series

[net] net: sysfs: Fix deadlock situation in sysfs accesses

Message ID d416a14ec38c7ba463341b83a7a9ec6ccc435246.1734419614.git.christophe.leroy@csgroup.eu (mailing list archive)
State New
Delegated to: Netdev Maintainers
Headers show
Series [net] net: sysfs: Fix deadlock situation in sysfs accesses | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 16 this patch: 16
netdev/build_tools success Errors and warnings before: 0 (+0) this patch: 0 (+0)
netdev/cc_maintainers warning 1 maintainers not CCed: kuniyu@amazon.com
netdev/build_clang success Errors and warnings before: 20346 this patch: 20346
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 3529 this patch: 3529
netdev/checkpatch warning CHECK: extern prototypes should be avoided in .h files
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 6 this patch: 6
netdev/source_inline success Was 0 now: 0
netdev/contest fail net-next-2024-12-17--21-00 (tests: 883)

Commit Message

Christophe Leroy Dec. 17, 2024, 7:18 a.m. UTC
The following problem is encountered on kernel built with
CONFIG_PREEMPT. An snmp daemon running with normal priority is
regularly calling ioctl(SIOCGMIIPHY). Another process running with
SCHED_FIFO policy is regularly reading /sys/class/net/eth0/carrier.

After some random time, the snmp daemon gets preempted while holding
the RTNL mutex then the high priority process is busy looping into
carrier_show which bails out early due to a non-successfull
rtnl_trylock() which implies restart_syscall(). Because the snmp
daemon has a lower priority, it never gets the chances to release
the RTNL mutex and the high-priority task continues to loop forever.

Replace the trylock by lock_interruptible. This will increase the
priority of the task holding the lock so that it can release it and
allow the reader of /sys/class/net/eth0/carrier to actually perform
its read.

The problem can be reproduced with the following two simple apps:

The one below runs with normal SCHED_OTHER priority:

	int main(int argc, char **argv)
	{
		int sk = socket(AF_INET, SOCK_DGRAM, 0);
		char buf[32];
		struct ifreq ifr = {.ifr_name = "eth0"};

		for (;;)
			ioctl(sk, SIOCGMIIPHY, &ifr);

		exit(0);
	}

And the following one is started with chrt -f 80 so it runs with
SCHED_FIFO policy:

	int main(int argc, char **argv)
	{
		int fd = open("/sys/class/net/eth0/carrier", O_RDONLY);
		char buf[32];

		for (;;) {
			read(fd, buf, sizeof(buf));
			lseek(fd, 0, SEEK_SET);
			usleep(5000);
		}

		exit(0);
	}

When running alone, that high priority task takes approx 6% CPU time.

When running together with the first one above, the high priority task
reaches almost 100% of CPU time.

With this fix applied, the high priority task remains at 6% CPU time
while the other one takes the remaining CPU time available.

Fixes: 336ca57c3b4e ("net-sysfs: Use rtnl_trylock in sysfs methods.")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 include/linux/rtnetlink.h |  1 +
 net/core/net-sysfs.c      | 70 +++++++++++++++++++--------------------
 net/core/rtnetlink.c      |  6 ++++
 3 files changed, 42 insertions(+), 35 deletions(-)

Comments

Eric Dumazet Dec. 17, 2024, 8:16 a.m. UTC | #1
On Tue, Dec 17, 2024 at 8:18 AM Christophe Leroy
<christophe.leroy@csgroup.eu> wrote:
>
> The following problem is encountered on kernel built with
> CONFIG_PREEMPT. An snmp daemon running with normal priority is
> regularly calling ioctl(SIOCGMIIPHY). Another process running with
> SCHED_FIFO policy is regularly reading /sys/class/net/eth0/carrier.
>
> After some random time, the snmp daemon gets preempted while holding
> the RTNL mutex then the high priority process is busy looping into
> carrier_show which bails out early due to a non-successfull
> rtnl_trylock() which implies restart_syscall(). Because the snmp
> daemon has a lower priority, it never gets the chances to release
> the RTNL mutex and the high-priority task continues to loop forever.
>
> Replace the trylock by lock_interruptible. This will increase the
> priority of the task holding the lock so that it can release it and
> allow the reader of /sys/class/net/eth0/carrier to actually perform
> its read.
>
> The problem can be reproduced with the following two simple apps:
>
> The one below runs with normal SCHED_OTHER priority:
>
>         int main(int argc, char **argv)
>         {
>                 int sk = socket(AF_INET, SOCK_DGRAM, 0);
>                 char buf[32];
>                 struct ifreq ifr = {.ifr_name = "eth0"};
>
>                 for (;;)
>                         ioctl(sk, SIOCGMIIPHY, &ifr);
>
>                 exit(0);
>         }
>
> And the following one is started with chrt -f 80 so it runs with
> SCHED_FIFO policy:
>
>         int main(int argc, char **argv)
>         {
>                 int fd = open("/sys/class/net/eth0/carrier", O_RDONLY);
>                 char buf[32];
>
>                 for (;;) {
>                         read(fd, buf, sizeof(buf));
>                         lseek(fd, 0, SEEK_SET);
>                         usleep(5000);
>                 }
>
>                 exit(0);
>         }
>
> When running alone, that high priority task takes approx 6% CPU time.
>
> When running together with the first one above, the high priority task
> reaches almost 100% of CPU time.
>
> With this fix applied, the high priority task remains at 6% CPU time
> while the other one takes the remaining CPU time available.
>
> Fixes: 336ca57c3b4e ("net-sysfs: Use rtnl_trylock in sysfs methods.")
> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
> ---

At a first glance, this might resurface the deadlock issue Eric W. Biederman
was trying to fix in 336ca57c3b4e ("net-sysfs: Use rtnl_trylock in
sysfs methods.")

I was hoping that at some point, some sysfs write methods could be
marked as : "We do not need to hold the sysfs lock"
Christophe Leroy Dec. 17, 2024, 8:59 a.m. UTC | #2
Le 17/12/2024 à 09:16, Eric Dumazet a écrit :
> On Tue, Dec 17, 2024 at 8:18 AM Christophe Leroy
> <christophe.leroy@csgroup.eu> wrote:
>>
>> The following problem is encountered on kernel built with
>> CONFIG_PREEMPT. An snmp daemon running with normal priority is
>> regularly calling ioctl(SIOCGMIIPHY). Another process running with
>> SCHED_FIFO policy is regularly reading /sys/class/net/eth0/carrier.
>>
>> After some random time, the snmp daemon gets preempted while holding
>> the RTNL mutex then the high priority process is busy looping into
>> carrier_show which bails out early due to a non-successfull
>> rtnl_trylock() which implies restart_syscall(). Because the snmp
>> daemon has a lower priority, it never gets the chances to release
>> the RTNL mutex and the high-priority task continues to loop forever.
>>
>> Replace the trylock by lock_interruptible. This will increase the
>> priority of the task holding the lock so that it can release it and
>> allow the reader of /sys/class/net/eth0/carrier to actually perform
>> its read.
>>

...

>>
>> Fixes: 336ca57c3b4e ("net-sysfs: Use rtnl_trylock in sysfs methods.")
>> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
>> ---
> 
> At a first glance, this might resurface the deadlock issue Eric W. Biederman
> was trying to fix in 336ca57c3b4e ("net-sysfs: Use rtnl_trylock in
> sysfs methods.")

Are you talking about the deadlock fixed (incompletely) by 5a5990d3090b 
("net: Avoid race between network down and sysfs"), or the complement 
provided by 336ca57c3b4e ?

My understanding is that mutex_lock() will return EINTR only if a signal 
is pending so there is no need to set signal_pending like it was when 
using mutex_trylock() which does nothing when the mutex is already locked.

And an EINTR return is expected and documented for a read() or a 
write(), I can't see why we want ERESTARTNOINTR instead of ERSTARTSYS. 
Isn't it the responsibility of the user app to call again read or write 
if it has decided to not install the necessary sigaction for an 
automatic restart ?

Do you think I should instead use rtnl_lock_killable() and return 
ERESTARTNOINTR in case of failure ? In that case, is it still possible 
to interrupt a blocked 'cat /sys/class/net/eth0/carrier' which CTRL+C ?

> 
> I was hoping that at some point, some sysfs write methods could be
> marked as : "We do not need to hold the sysfs lock"
Eric Dumazet Dec. 17, 2024, 9:20 a.m. UTC | #3
On Tue, Dec 17, 2024 at 9:59 AM Christophe Leroy
<christophe.leroy@csgroup.eu> wrote:
>
>
>
> Le 17/12/2024 à 09:16, Eric Dumazet a écrit :
> > On Tue, Dec 17, 2024 at 8:18 AM Christophe Leroy
> > <christophe.leroy@csgroup.eu> wrote:
> >>
> >> The following problem is encountered on kernel built with
> >> CONFIG_PREEMPT. An snmp daemon running with normal priority is
> >> regularly calling ioctl(SIOCGMIIPHY). Another process running with
> >> SCHED_FIFO policy is regularly reading /sys/class/net/eth0/carrier.
> >>
> >> After some random time, the snmp daemon gets preempted while holding
> >> the RTNL mutex then the high priority process is busy looping into
> >> carrier_show which bails out early due to a non-successfull
> >> rtnl_trylock() which implies restart_syscall(). Because the snmp
> >> daemon has a lower priority, it never gets the chances to release
> >> the RTNL mutex and the high-priority task continues to loop forever.
> >>
> >> Replace the trylock by lock_interruptible. This will increase the
> >> priority of the task holding the lock so that it can release it and
> >> allow the reader of /sys/class/net/eth0/carrier to actually perform
> >> its read.
> >>
>
> ...
>
> >>
> >> Fixes: 336ca57c3b4e ("net-sysfs: Use rtnl_trylock in sysfs methods.")
> >> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
> >> ---
> >
> > At a first glance, this might resurface the deadlock issue Eric W. Biederman
> > was trying to fix in 336ca57c3b4e ("net-sysfs: Use rtnl_trylock in
> > sysfs methods.")
>
> Are you talking about the deadlock fixed (incompletely) by 5a5990d3090b
> ("net: Avoid race between network down and sysfs"), or the complement
> provided by 336ca57c3b4e ?
>
> My understanding is that mutex_lock() will return EINTR only if a signal
> is pending so there is no need to set signal_pending like it was when
> using mutex_trylock() which does nothing when the mutex is already locked.
>
> And an EINTR return is expected and documented for a read() or a
> write(), I can't see why we want ERESTARTNOINTR instead of ERSTARTSYS.
> Isn't it the responsibility of the user app to call again read or write
> if it has decided to not install the necessary sigaction for an
> automatic restart ?
>
> Do you think I should instead use rtnl_lock_killable() and return
> ERESTARTNOINTR in case of failure ? In that case, is it still possible
> to interrupt a blocked 'cat /sys/class/net/eth0/carrier' which CTRL+C ?

Issue is when no signal is pending, we have a typical deadlock situation :

One process A is :

Holding sysfs lock, then attempts to grab rtnl.

Another one (B) is :

Holding rtnl, then attempts to grab sysfs lock.

Using rtnl_lock_interruptible()  in A will still block A and B, until
a CTRL+C is sent by another thread.
Christophe Leroy Dec. 17, 2024, 9:41 a.m. UTC | #4
Le 17/12/2024 à 10:20, Eric Dumazet a écrit :
> On Tue, Dec 17, 2024 at 9:59 AM Christophe Leroy
> <christophe.leroy@csgroup.eu> wrote:
>>
>>
>>
>> Le 17/12/2024 à 09:16, Eric Dumazet a écrit :
>>> On Tue, Dec 17, 2024 at 8:18 AM Christophe Leroy
>>> <christophe.leroy@csgroup.eu> wrote:
>>>>
>>>> The following problem is encountered on kernel built with
>>>> CONFIG_PREEMPT. An snmp daemon running with normal priority is
>>>> regularly calling ioctl(SIOCGMIIPHY). Another process running with
>>>> SCHED_FIFO policy is regularly reading /sys/class/net/eth0/carrier.
>>>>
>>>> After some random time, the snmp daemon gets preempted while holding
>>>> the RTNL mutex then the high priority process is busy looping into
>>>> carrier_show which bails out early due to a non-successfull
>>>> rtnl_trylock() which implies restart_syscall(). Because the snmp
>>>> daemon has a lower priority, it never gets the chances to release
>>>> the RTNL mutex and the high-priority task continues to loop forever.
>>>>
>>>> Replace the trylock by lock_interruptible. This will increase the
>>>> priority of the task holding the lock so that it can release it and
>>>> allow the reader of /sys/class/net/eth0/carrier to actually perform
>>>> its read.
>>>>
>>
>> ...
>>
>>>>
>>>> Fixes: 336ca57c3b4e ("net-sysfs: Use rtnl_trylock in sysfs methods.")
>>>> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
>>>> ---
>>>
>>> At a first glance, this might resurface the deadlock issue Eric W. Biederman
>>> was trying to fix in 336ca57c3b4e ("net-sysfs: Use rtnl_trylock in
>>> sysfs methods.")
>>
>> Are you talking about the deadlock fixed (incompletely) by 5a5990d3090b
>> ("net: Avoid race between network down and sysfs"), or the complement
>> provided by 336ca57c3b4e ?
>>
>> My understanding is that mutex_lock() will return EINTR only if a signal
>> is pending so there is no need to set signal_pending like it was when
>> using mutex_trylock() which does nothing when the mutex is already locked.
>>
>> And an EINTR return is expected and documented for a read() or a
>> write(), I can't see why we want ERESTARTNOINTR instead of ERSTARTSYS.
>> Isn't it the responsibility of the user app to call again read or write
>> if it has decided to not install the necessary sigaction for an
>> automatic restart ?
>>
>> Do you think I should instead use rtnl_lock_killable() and return
>> ERESTARTNOINTR in case of failure ? In that case, is it still possible
>> to interrupt a blocked 'cat /sys/class/net/eth0/carrier' which CTRL+C ?
> 
> Issue is when no signal is pending, we have a typical deadlock situation :
> 
> One process A is :
> 
> Holding sysfs lock, then attempts to grab rtnl.
> 
> Another one (B) is :
> 
> Holding rtnl, then attempts to grab sysfs lock.

Ok, I see.

But then what can be the solution to avoid busy looping with 
mutex_trylock , not giving any chance to the task holding the rtnl to 
run and unlock it ?

> 
> Using rtnl_lock_interruptible()  in A will still block A and B, until
> a CTRL+C is sent by another thread.
Eric Dumazet Dec. 17, 2024, 9:52 a.m. UTC | #5
On Tue, Dec 17, 2024 at 10:41 AM Christophe Leroy
<christophe.leroy@csgroup.eu> wrote:
>
>
>
> Le 17/12/2024 à 10:20, Eric Dumazet a écrit :
> > On Tue, Dec 17, 2024 at 9:59 AM Christophe Leroy
> > <christophe.leroy@csgroup.eu> wrote:
> >>
> >>
> >>
> >> Le 17/12/2024 à 09:16, Eric Dumazet a écrit :
> >>> On Tue, Dec 17, 2024 at 8:18 AM Christophe Leroy
> >>> <christophe.leroy@csgroup.eu> wrote:
> >>>>
> >>>> The following problem is encountered on kernel built with
> >>>> CONFIG_PREEMPT. An snmp daemon running with normal priority is
> >>>> regularly calling ioctl(SIOCGMIIPHY). Another process running with
> >>>> SCHED_FIFO policy is regularly reading /sys/class/net/eth0/carrier.
> >>>>
> >>>> After some random time, the snmp daemon gets preempted while holding
> >>>> the RTNL mutex then the high priority process is busy looping into
> >>>> carrier_show which bails out early due to a non-successfull
> >>>> rtnl_trylock() which implies restart_syscall(). Because the snmp
> >>>> daemon has a lower priority, it never gets the chances to release
> >>>> the RTNL mutex and the high-priority task continues to loop forever.
> >>>>
> >>>> Replace the trylock by lock_interruptible. This will increase the
> >>>> priority of the task holding the lock so that it can release it and
> >>>> allow the reader of /sys/class/net/eth0/carrier to actually perform
> >>>> its read.
> >>>>
> >>
> >> ...
> >>
> >>>>
> >>>> Fixes: 336ca57c3b4e ("net-sysfs: Use rtnl_trylock in sysfs methods.")
> >>>> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
> >>>> ---
> >>>
> >>> At a first glance, this might resurface the deadlock issue Eric W. Biederman
> >>> was trying to fix in 336ca57c3b4e ("net-sysfs: Use rtnl_trylock in
> >>> sysfs methods.")
> >>
> >> Are you talking about the deadlock fixed (incompletely) by 5a5990d3090b
> >> ("net: Avoid race between network down and sysfs"), or the complement
> >> provided by 336ca57c3b4e ?
> >>
> >> My understanding is that mutex_lock() will return EINTR only if a signal
> >> is pending so there is no need to set signal_pending like it was when
> >> using mutex_trylock() which does nothing when the mutex is already locked.
> >>
> >> And an EINTR return is expected and documented for a read() or a
> >> write(), I can't see why we want ERESTARTNOINTR instead of ERSTARTSYS.
> >> Isn't it the responsibility of the user app to call again read or write
> >> if it has decided to not install the necessary sigaction for an
> >> automatic restart ?
> >>
> >> Do you think I should instead use rtnl_lock_killable() and return
> >> ERESTARTNOINTR in case of failure ? In that case, is it still possible
> >> to interrupt a blocked 'cat /sys/class/net/eth0/carrier' which CTRL+C ?
> >
> > Issue is when no signal is pending, we have a typical deadlock situation :
> >
> > One process A is :
> >
> > Holding sysfs lock, then attempts to grab rtnl.
> >
> > Another one (B) is :
> >
> > Holding rtnl, then attempts to grab sysfs lock.
>
> Ok, I see.
>
> But then what can be the solution to avoid busy looping with
> mutex_trylock , not giving any chance to the task holding the rtnl to
> run and unlock it ?

One idea would be to add a usleep(500, 1000) if the sysfs read/write handler in
returns -ERESTARTNOINTR;

Totally untested idea :

diff --git a/fs/seq_file.c b/fs/seq_file.c
index 8bbb1ad46335c3b8f50dd35d552f86767e62ead1..276c6d594129a18a7a4c2b1df447b34993398ab4
100644
--- a/fs/seq_file.c
+++ b/fs/seq_file.c
@@ -290,6 +290,8 @@ ssize_t seq_read_iter(struct kiocb *iocb, struct
iov_iter *iter)
                m->read_pos += copied;
        }
        mutex_unlock(&m->lock);
+       if (copied == -ERESTARTNOINTR)
+               usleep_range(500, 1000);
        return copied;
 Enomem:
        err = -ENOMEM;
Christophe Leroy Dec. 17, 2024, 11:56 a.m. UTC | #6
Le 17/12/2024 à 10:52, Eric Dumazet a écrit :
> On Tue, Dec 17, 2024 at 10:41 AM Christophe Leroy
> <christophe.leroy@csgroup.eu> wrote:
>>
>>
>>
>> Le 17/12/2024 à 10:20, Eric Dumazet a écrit :
>>> On Tue, Dec 17, 2024 at 9:59 AM Christophe Leroy
>>> <christophe.leroy@csgroup.eu> wrote:
>>>>
>>>>
>>>>
>>>> Le 17/12/2024 à 09:16, Eric Dumazet a écrit :
>>>>> On Tue, Dec 17, 2024 at 8:18 AM Christophe Leroy
>>>>> <christophe.leroy@csgroup.eu> wrote:
>>>>>>
>>>>>> The following problem is encountered on kernel built with
>>>>>> CONFIG_PREEMPT. An snmp daemon running with normal priority is
>>>>>> regularly calling ioctl(SIOCGMIIPHY). Another process running with
>>>>>> SCHED_FIFO policy is regularly reading /sys/class/net/eth0/carrier.
>>>>>>
>>>>>> After some random time, the snmp daemon gets preempted while holding
>>>>>> the RTNL mutex then the high priority process is busy looping into
>>>>>> carrier_show which bails out early due to a non-successfull
>>>>>> rtnl_trylock() which implies restart_syscall(). Because the snmp
>>>>>> daemon has a lower priority, it never gets the chances to release
>>>>>> the RTNL mutex and the high-priority task continues to loop forever.
>>>>>>
>>>>>> Replace the trylock by lock_interruptible. This will increase the
>>>>>> priority of the task holding the lock so that it can release it and
>>>>>> allow the reader of /sys/class/net/eth0/carrier to actually perform
>>>>>> its read.
>>>>>>
>>>>
>>>> ...
>>>>
>>>>>>
>>>>>> Fixes: 336ca57c3b4e ("net-sysfs: Use rtnl_trylock in sysfs methods.")
>>>>>> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
>>>>>> ---
>>>>>
>>>>> At a first glance, this might resurface the deadlock issue Eric W. Biederman
>>>>> was trying to fix in 336ca57c3b4e ("net-sysfs: Use rtnl_trylock in
>>>>> sysfs methods.")
>>>>
>>>> Are you talking about the deadlock fixed (incompletely) by 5a5990d3090b
>>>> ("net: Avoid race between network down and sysfs"), or the complement
>>>> provided by 336ca57c3b4e ?
>>>>
>>>> My understanding is that mutex_lock() will return EINTR only if a signal
>>>> is pending so there is no need to set signal_pending like it was when
>>>> using mutex_trylock() which does nothing when the mutex is already locked.
>>>>
>>>> And an EINTR return is expected and documented for a read() or a
>>>> write(), I can't see why we want ERESTARTNOINTR instead of ERSTARTSYS.
>>>> Isn't it the responsibility of the user app to call again read or write
>>>> if it has decided to not install the necessary sigaction for an
>>>> automatic restart ?
>>>>
>>>> Do you think I should instead use rtnl_lock_killable() and return
>>>> ERESTARTNOINTR in case of failure ? In that case, is it still possible
>>>> to interrupt a blocked 'cat /sys/class/net/eth0/carrier' which CTRL+C ?
>>>
>>> Issue is when no signal is pending, we have a typical deadlock situation :
>>>
>>> One process A is :
>>>
>>> Holding sysfs lock, then attempts to grab rtnl.
>>>
>>> Another one (B) is :
>>>
>>> Holding rtnl, then attempts to grab sysfs lock.
>>
>> Ok, I see.
>>
>> But then what can be the solution to avoid busy looping with
>> mutex_trylock , not giving any chance to the task holding the rtnl to
>> run and unlock it ?
> 
> One idea would be to add a usleep(500, 1000) if the sysfs read/write handler in
> returns -ERESTARTNOINTR;
> 
> Totally untested idea :
> 
> diff --git a/fs/seq_file.c b/fs/seq_file.c
> index 8bbb1ad46335c3b8f50dd35d552f86767e62ead1..276c6d594129a18a7a4c2b1df447b34993398ab4
> 100644
> --- a/fs/seq_file.c
> +++ b/fs/seq_file.c
> @@ -290,6 +290,8 @@ ssize_t seq_read_iter(struct kiocb *iocb, struct
> iov_iter *iter)
>                  m->read_pos += copied;
>          }
>          mutex_unlock(&m->lock);
> +       if (copied == -ERESTARTNOINTR)
> +               usleep_range(500, 1000);
>          return copied;
>   Enomem:
>          err = -ENOMEM;

Ok, that may solve the issue, but it looks more like a hack than a real 
solution, doesn't it ?
It doesn't guarantee that the task holding the RTNL lock will be given 
the floor to run and free the lock.

The real issue is the nest between sysfs lock and RTNL lock. Can't we 
ensure that they are always held in the same order ?

Christophe
Eric Dumazet Dec. 17, 2024, 12:04 p.m. UTC | #7
On Tue, Dec 17, 2024 at 12:56 PM Christophe Leroy
<christophe.leroy@csgroup.eu> wrote:
>
>
>
> Le 17/12/2024 à 10:52, Eric Dumazet a écrit :
> > On Tue, Dec 17, 2024 at 10:41 AM Christophe Leroy
> > <christophe.leroy@csgroup.eu> wrote:
> >>
> >>
> >>
> >> Le 17/12/2024 à 10:20, Eric Dumazet a écrit :
> >>> On Tue, Dec 17, 2024 at 9:59 AM Christophe Leroy
> >>> <christophe.leroy@csgroup.eu> wrote:
> >>>>
> >>>>
> >>>>
> >>>> Le 17/12/2024 à 09:16, Eric Dumazet a écrit :
> >>>>> On Tue, Dec 17, 2024 at 8:18 AM Christophe Leroy
> >>>>> <christophe.leroy@csgroup.eu> wrote:
> >>>>>>
> >>>>>> The following problem is encountered on kernel built with
> >>>>>> CONFIG_PREEMPT. An snmp daemon running with normal priority is
> >>>>>> regularly calling ioctl(SIOCGMIIPHY). Another process running with
> >>>>>> SCHED_FIFO policy is regularly reading /sys/class/net/eth0/carrier.
> >>>>>>
> >>>>>> After some random time, the snmp daemon gets preempted while holding
> >>>>>> the RTNL mutex then the high priority process is busy looping into
> >>>>>> carrier_show which bails out early due to a non-successfull
> >>>>>> rtnl_trylock() which implies restart_syscall(). Because the snmp
> >>>>>> daemon has a lower priority, it never gets the chances to release
> >>>>>> the RTNL mutex and the high-priority task continues to loop forever.
> >>>>>>
> >>>>>> Replace the trylock by lock_interruptible. This will increase the
> >>>>>> priority of the task holding the lock so that it can release it and
> >>>>>> allow the reader of /sys/class/net/eth0/carrier to actually perform
> >>>>>> its read.
> >>>>>>
> >>>>
> >>>> ...
> >>>>
> >>>>>>
> >>>>>> Fixes: 336ca57c3b4e ("net-sysfs: Use rtnl_trylock in sysfs methods.")
> >>>>>> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
> >>>>>> ---
> >>>>>
> >>>>> At a first glance, this might resurface the deadlock issue Eric W. Biederman
> >>>>> was trying to fix in 336ca57c3b4e ("net-sysfs: Use rtnl_trylock in
> >>>>> sysfs methods.")
> >>>>
> >>>> Are you talking about the deadlock fixed (incompletely) by 5a5990d3090b
> >>>> ("net: Avoid race between network down and sysfs"), or the complement
> >>>> provided by 336ca57c3b4e ?
> >>>>
> >>>> My understanding is that mutex_lock() will return EINTR only if a signal
> >>>> is pending so there is no need to set signal_pending like it was when
> >>>> using mutex_trylock() which does nothing when the mutex is already locked.
> >>>>
> >>>> And an EINTR return is expected and documented for a read() or a
> >>>> write(), I can't see why we want ERESTARTNOINTR instead of ERSTARTSYS.
> >>>> Isn't it the responsibility of the user app to call again read or write
> >>>> if it has decided to not install the necessary sigaction for an
> >>>> automatic restart ?
> >>>>
> >>>> Do you think I should instead use rtnl_lock_killable() and return
> >>>> ERESTARTNOINTR in case of failure ? In that case, is it still possible
> >>>> to interrupt a blocked 'cat /sys/class/net/eth0/carrier' which CTRL+C ?
> >>>
> >>> Issue is when no signal is pending, we have a typical deadlock situation :
> >>>
> >>> One process A is :
> >>>
> >>> Holding sysfs lock, then attempts to grab rtnl.
> >>>
> >>> Another one (B) is :
> >>>
> >>> Holding rtnl, then attempts to grab sysfs lock.
> >>
> >> Ok, I see.
> >>
> >> But then what can be the solution to avoid busy looping with
> >> mutex_trylock , not giving any chance to the task holding the rtnl to
> >> run and unlock it ?
> >
> > One idea would be to add a usleep(500, 1000) if the sysfs read/write handler in
> > returns -ERESTARTNOINTR;
> >
> > Totally untested idea :
> >
> > diff --git a/fs/seq_file.c b/fs/seq_file.c
> > index 8bbb1ad46335c3b8f50dd35d552f86767e62ead1..276c6d594129a18a7a4c2b1df447b34993398ab4
> > 100644
> > --- a/fs/seq_file.c
> > +++ b/fs/seq_file.c
> > @@ -290,6 +290,8 @@ ssize_t seq_read_iter(struct kiocb *iocb, struct
> > iov_iter *iter)
> >                  m->read_pos += copied;
> >          }
> >          mutex_unlock(&m->lock);
> > +       if (copied == -ERESTARTNOINTR)
> > +               usleep_range(500, 1000);
> >          return copied;
> >   Enomem:
> >          err = -ENOMEM;
>
> Ok, that may solve the issue, but it looks more like a hack than a real
> solution, doesn't it ?
> It doesn't guarantee that the task holding the RTNL lock will be given
> the floor to run and free the lock.

I am sure all for a real solution, what do you suggest ?

>
> The real issue is the nest between sysfs lock and RTNL lock. Can't we
> ensure that they are always held in the same order ?

Problem : adding/removing netdevices may add/remove sysfs files.

Adding/removing netdevices is done under rtnl.
Andrew Lunn Dec. 17, 2024, 3:30 p.m. UTC | #8
On Tue, Dec 17, 2024 at 08:18:25AM +0100, Christophe Leroy wrote:
> The following problem is encountered on kernel built with
> CONFIG_PREEMPT. An snmp daemon running with normal priority is
> regularly calling ioctl(SIOCGMIIPHY).

Why is an SNMP daemon using that IOCTL? What MAC driver is this? Is it
using phylib? For phylib, that IOCTL is supposed to be for debug only,
and is a bit of a foot gun. So i would not recommend it.

    Andrew
Christophe Leroy Dec. 17, 2024, 4:18 p.m. UTC | #9
Le 17/12/2024 à 16:30, Andrew Lunn a écrit :
> On Tue, Dec 17, 2024 at 08:18:25AM +0100, Christophe Leroy wrote:
>> The following problem is encountered on kernel built with
>> CONFIG_PREEMPT. An snmp daemon running with normal priority is
>> regularly calling ioctl(SIOCGMIIPHY).
> 
> Why is an SNMP daemon using that IOCTL? What MAC driver is this? Is it
> using phylib? For phylib, that IOCTL is supposed to be for debug only,
> and is a bit of a foot gun. So i would not recommend it.
> 

That's the well-known Net-SNMP package.

See for instance 
https://github.com/net-snmp/net-snmp/blob/master/agent/mibgroup/if-mib/data_access/interface_linux.c#L954

The MAC is ucc_geth driver, it is phylib for the moment, it is being 
converted to phylink in net-next , see commit 53036aa8d031 ("net: 
freescale: ucc_geth: phylink conversion")

Christophe
Andrew Lunn Dec. 17, 2024, 4:59 p.m. UTC | #10
On Tue, Dec 17, 2024 at 05:18:40PM +0100, Christophe Leroy wrote:
> 
> 
> Le 17/12/2024 à 16:30, Andrew Lunn a écrit :
> > On Tue, Dec 17, 2024 at 08:18:25AM +0100, Christophe Leroy wrote:
> > > The following problem is encountered on kernel built with
> > > CONFIG_PREEMPT. An snmp daemon running with normal priority is
> > > regularly calling ioctl(SIOCGMIIPHY).
> > 
> > Why is an SNMP daemon using that IOCTL? What MAC driver is this? Is it
> > using phylib? For phylib, that IOCTL is supposed to be for debug only,
> > and is a bit of a foot gun. So i would not recommend it.
> > 
> 
> That's the well-known Net-SNMP package.
> 
> See for instance https://github.com/net-snmp/net-snmp/blob/master/agent/mibgroup/if-mib/data_access/interface_linux.c#L954

That is pretty broken:

It assumes the PHY is using C22. Many PHYs now a days are C45.

It assumes the PHY only supports up to 1G, were as many PHYs now a
days are > 1G.

It assumes the PHY is not an automotive PHY which has its registers in
a different place.

Reading the BMSR can change the BMSR, so phylib is going to get
confused and miss linkup/linkdown events.

There is no locking going on, so the PHY might be on a different page,
e.g. to read the temperature sensors, blink the LEDs, etc. The SNMP
daemon has no way to detect this, so it will be applying BMSR, BMCR,
etc meaning to registers which are in fact not BMSR, BMCR, etc.

This code needs throwing away and replacing with netlink sockets,
which is a lot more abstract API, PHY independent, speed independent,
media independent etc. That would also solve your deadlock.

	Andrew
diff mbox series

Patch

diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index 14b88f551920..f060eecbc382 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -43,6 +43,7 @@  extern void rtnl_lock(void);
 extern void rtnl_unlock(void);
 extern int rtnl_trylock(void);
 extern int rtnl_is_locked(void);
+extern int rtnl_lock_interruptible(void);
 extern int rtnl_lock_killable(void);
 extern bool refcount_dec_and_rtnl_lock(refcount_t *r);
 
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 2d9afc6e2161..0f97ea78c4da 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -95,8 +95,8 @@  static ssize_t netdev_store(struct device *dev, struct device_attribute *attr,
 	if (ret)
 		goto err;
 
-	if (!rtnl_trylock())
-		return restart_syscall();
+	if (rtnl_lock_interruptible())
+		return -ERESTARTSYS;
 
 	if (dev_isalive(netdev)) {
 		ret = (*set)(netdev, new);
@@ -190,7 +190,7 @@  static ssize_t carrier_store(struct device *dev, struct device_attribute *attr,
 	struct net_device *netdev = to_net_dev(dev);
 
 	/* The check is also done in change_carrier; this helps returning early
-	 * without hitting the trylock/restart in netdev_store.
+	 * without hitting the lock/restart in netdev_store.
 	 */
 	if (!netdev->netdev_ops->ndo_change_carrier)
 		return -EOPNOTSUPP;
@@ -204,8 +204,8 @@  static ssize_t carrier_show(struct device *dev,
 	struct net_device *netdev = to_net_dev(dev);
 	int ret = -EINVAL;
 
-	if (!rtnl_trylock())
-		return restart_syscall();
+	if (rtnl_lock_interruptible())
+		return -ERESTARTSYS;
 
 	if (netif_running(netdev)) {
 		/* Synchronize carrier state with link watch,
@@ -228,13 +228,13 @@  static ssize_t speed_show(struct device *dev,
 	int ret = -EINVAL;
 
 	/* The check is also done in __ethtool_get_link_ksettings; this helps
-	 * returning early without hitting the trylock/restart below.
+	 * returning early without hitting the lock/restart below.
 	 */
 	if (!netdev->ethtool_ops->get_link_ksettings)
 		return ret;
 
-	if (!rtnl_trylock())
-		return restart_syscall();
+	if (rtnl_lock_interruptible())
+		return -ERESTARTSYS;
 
 	if (netif_running(netdev)) {
 		struct ethtool_link_ksettings cmd;
@@ -254,13 +254,13 @@  static ssize_t duplex_show(struct device *dev,
 	int ret = -EINVAL;
 
 	/* The check is also done in __ethtool_get_link_ksettings; this helps
-	 * returning early without hitting the trylock/restart below.
+	 * returning early without hitting the lock/restart below.
 	 */
 	if (!netdev->ethtool_ops->get_link_ksettings)
 		return ret;
 
-	if (!rtnl_trylock())
-		return restart_syscall();
+	if (rtnl_lock_interruptible())
+		return -ERESTARTSYS;
 
 	if (netif_running(netdev)) {
 		struct ethtool_link_ksettings cmd;
@@ -459,8 +459,8 @@  static ssize_t ifalias_store(struct device *dev, struct device_attribute *attr,
 	if (len >  0 && buf[len - 1] == '\n')
 		--count;
 
-	if (!rtnl_trylock())
-		return restart_syscall();
+	if (rtnl_lock_interruptible())
+		return -ERESTARTSYS;
 
 	if (dev_isalive(netdev)) {
 		ret = dev_set_alias(netdev, buf, count);
@@ -523,13 +523,13 @@  static ssize_t phys_port_id_show(struct device *dev,
 	ssize_t ret = -EINVAL;
 
 	/* The check is also done in dev_get_phys_port_id; this helps returning
-	 * early without hitting the trylock/restart below.
+	 * early without hitting the lock/restart below.
 	 */
 	if (!netdev->netdev_ops->ndo_get_phys_port_id)
 		return -EOPNOTSUPP;
 
-	if (!rtnl_trylock())
-		return restart_syscall();
+	if (rtnl_lock_interruptible())
+		return -ERESTARTSYS;
 
 	if (dev_isalive(netdev)) {
 		struct netdev_phys_item_id ppid;
@@ -551,14 +551,14 @@  static ssize_t phys_port_name_show(struct device *dev,
 	ssize_t ret = -EINVAL;
 
 	/* The checks are also done in dev_get_phys_port_name; this helps
-	 * returning early without hitting the trylock/restart below.
+	 * returning early without hitting the lock/restart below.
 	 */
 	if (!netdev->netdev_ops->ndo_get_phys_port_name &&
 	    !netdev->devlink_port)
 		return -EOPNOTSUPP;
 
-	if (!rtnl_trylock())
-		return restart_syscall();
+	if (rtnl_lock_interruptible())
+		return -ERESTARTSYS;
 
 	if (dev_isalive(netdev)) {
 		char name[IFNAMSIZ];
@@ -580,15 +580,15 @@  static ssize_t phys_switch_id_show(struct device *dev,
 	ssize_t ret = -EINVAL;
 
 	/* The checks are also done in dev_get_phys_port_name; this helps
-	 * returning early without hitting the trylock/restart below. This works
+	 * returning early without hitting the lock/restart below. This works
 	 * because recurse is false when calling dev_get_port_parent_id.
 	 */
 	if (!netdev->netdev_ops->ndo_get_port_parent_id &&
 	    !netdev->devlink_port)
 		return -EOPNOTSUPP;
 
-	if (!rtnl_trylock())
-		return restart_syscall();
+	if (rtnl_lock_interruptible())
+		return -ERESTARTSYS;
 
 	if (dev_isalive(netdev)) {
 		struct netdev_phys_item_id ppid = { };
@@ -1282,8 +1282,8 @@  static ssize_t traffic_class_show(struct netdev_queue *queue,
 	if (!netif_is_multiqueue(dev))
 		return -ENOENT;
 
-	if (!rtnl_trylock())
-		return restart_syscall();
+	if (rtnl_lock_interruptible())
+		return -ERESTARTSYS;
 
 	index = get_netdev_queue_index(queue);
 
@@ -1327,7 +1327,7 @@  static ssize_t tx_maxrate_store(struct netdev_queue *queue,
 		return -EPERM;
 
 	/* The check is also done later; this helps returning early without
-	 * hitting the trylock/restart below.
+	 * hitting the lock/restart below.
 	 */
 	if (!dev->netdev_ops->ndo_set_tx_maxrate)
 		return -EOPNOTSUPP;
@@ -1336,8 +1336,8 @@  static ssize_t tx_maxrate_store(struct netdev_queue *queue,
 	if (err < 0)
 		return err;
 
-	if (!rtnl_trylock())
-		return restart_syscall();
+	if (rtnl_lock_interruptible())
+		return -ERESTARTSYS;
 
 	err = -EOPNOTSUPP;
 	if (dev->netdev_ops->ndo_set_tx_maxrate)
@@ -1593,8 +1593,8 @@  static ssize_t xps_cpus_show(struct netdev_queue *queue, char *buf)
 
 	index = get_netdev_queue_index(queue);
 
-	if (!rtnl_trylock())
-		return restart_syscall();
+	if (rtnl_lock_interruptible())
+		return -ERESTARTSYS;
 
 	/* If queue belongs to subordinate dev use its map */
 	dev = netdev_get_tx_queue(dev, index)->sb_dev ? : dev;
@@ -1640,9 +1640,9 @@  static ssize_t xps_cpus_store(struct netdev_queue *queue,
 		return err;
 	}
 
-	if (!rtnl_trylock()) {
+	if (rtnl_lock_interruptible()) {
 		free_cpumask_var(mask);
-		return restart_syscall();
+		return -ERESTARTSYS;
 	}
 
 	err = netif_set_xps_queue(dev, mask, index);
@@ -1664,8 +1664,8 @@  static ssize_t xps_rxqs_show(struct netdev_queue *queue, char *buf)
 
 	index = get_netdev_queue_index(queue);
 
-	if (!rtnl_trylock())
-		return restart_syscall();
+	if (rtnl_lock_interruptible())
+		return -ERESTARTSYS;
 
 	tc = netdev_txq_to_tc(dev, index);
 	rtnl_unlock();
@@ -1699,9 +1699,9 @@  static ssize_t xps_rxqs_store(struct netdev_queue *queue, const char *buf,
 		return err;
 	}
 
-	if (!rtnl_trylock()) {
+	if (rtnl_lock_interruptible()) {
 		bitmap_free(mask);
-		return restart_syscall();
+		return -ERESTARTSYS;
 	}
 
 	cpus_read_lock();
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index ebcfc2debf1a..a52ffc3c8ab8 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -80,6 +80,12 @@  void rtnl_lock(void)
 }
 EXPORT_SYMBOL(rtnl_lock);
 
+int rtnl_lock_interruptible(void)
+{
+	return mutex_lock_interruptible(&rtnl_mutex);
+}
+EXPORT_SYMBOL(rtnl_lock_interruptible);
+
 int rtnl_lock_killable(void)
 {
 	return mutex_lock_killable(&rtnl_mutex);