diff mbox series

[1/1] nfslock01.sh: Don't test on NFS v3 on TCP

Message ID 20230502075921.3614794-1-pvorel@suse.cz (mailing list archive)
State New, archived
Headers show
Series [1/1] nfslock01.sh: Don't test on NFS v3 on TCP | expand

Commit Message

Petr Vorel May 2, 2023, 7:59 a.m. UTC
nfs_flock (run via nfslock01.sh) is known to fail on NFS v3 [1]:

    not unsharing /var makes AF_UNIX socket for host's rpcbind to become
    available inside ltp_ns. Then, at NFS v3 mount time, kernel creates
    an instance of lockd for ltp_ns, and ports for that instance leak to
    host's rpcbind and overwrite ports for lockd already active for root
    namespace. This breaks nfs3 file locking.

Before bd512e733 ("nfs_flock: fail the test if lock/unlock ops fail")
it run indefinitely with "unhandled error -107":
[ 2840.099565] lockd: cannot monitor 10.0.0.2
[ 2840.109353] lockd: cannot monitor 10.0.0.2
[ 2843.286811] xs_tcp_setup_socket: connect returned unhandled error -107
[ 2850.198791] xs_tcp_setup_socket: connect returned unhandled error -107

bd512e733 caused an early abort (therefore only "cannot monitor 10.0.0.2"
appears).

Although there is suggestion, how to fix the problem in kernel [2]:

    > Maybe rpcb_create_local() shall detect that it is not in root
    > netns, and only try AF_INET connection to > localhost in that case.

    That would be simple and might be sensible.  IF changing the AF_UNIX
    path to "/run/rpcbind.sock" isn't sufficient, then testing for the
    root_ns is probably the best second option.

Until it's implemented, it's better to:
* don't test on NFS v3 on both TCP and UDP (remove from the runtest file)
* skip the test with TCONF in case version 3 is passed on command line

NOTE: Tested only on TCP (UDP is disabled in kernel by default via
NFS_DISABLE_UDP_SUPPORT).

[1] https://lore.kernel.org/ltp/YebcNQg0u5cU1QyQ@pevik/
[2] https://lore.kernel.org/ltp/164254401568.24166.883582030601071761@noble.neil.brown.name/

Signed-off-by: Petr Vorel <pvorel@suse.cz>
---
 runtest/net.nfs                              | 4 ----
 testcases/network/nfs/nfslock01/nfslock01.sh | 6 ++++++
 2 files changed, 6 insertions(+), 4 deletions(-)

Comments

Petr Vorel May 2, 2023, 9:22 a.m. UTC | #1
Hi,

> Subject: [LTP] [PATCH 1/1] nfslock01.sh: Don't test on NFS v3 on TCP

Obviously, the subject should not contain "on TCP" (first, I removed only TCP
entries in runtest file, but because I expect it would be problematic also on
UDP). Also tst_brk TCONF does not check TCP/UDP either (only NFS v3).

I'll update that before merge (don't send new version).

Kind regards,
Petr

> nfs_flock (run via nfslock01.sh) is known to fail on NFS v3 [1]:

>     not unsharing /var makes AF_UNIX socket for host's rpcbind to become
>     available inside ltp_ns. Then, at NFS v3 mount time, kernel creates
>     an instance of lockd for ltp_ns, and ports for that instance leak to
>     host's rpcbind and overwrite ports for lockd already active for root
>     namespace. This breaks nfs3 file locking.

> Before bd512e733 ("nfs_flock: fail the test if lock/unlock ops fail")
> it run indefinitely with "unhandled error -107":
> [ 2840.099565] lockd: cannot monitor 10.0.0.2
> [ 2840.109353] lockd: cannot monitor 10.0.0.2
> [ 2843.286811] xs_tcp_setup_socket: connect returned unhandled error -107
> [ 2850.198791] xs_tcp_setup_socket: connect returned unhandled error -107

> bd512e733 caused an early abort (therefore only "cannot monitor 10.0.0.2"
> appears).

> Although there is suggestion, how to fix the problem in kernel [2]:

>     > Maybe rpcb_create_local() shall detect that it is not in root
>     > netns, and only try AF_INET connection to > localhost in that case.

>     That would be simple and might be sensible.  IF changing the AF_UNIX
>     path to "/run/rpcbind.sock" isn't sufficient, then testing for the
>     root_ns is probably the best second option.

> Until it's implemented, it's better to:
> * don't test on NFS v3 on both TCP and UDP (remove from the runtest file)
> * skip the test with TCONF in case version 3 is passed on command line

> NOTE: Tested only on TCP (UDP is disabled in kernel by default via
> NFS_DISABLE_UDP_SUPPORT).

> [1] https://lore.kernel.org/ltp/YebcNQg0u5cU1QyQ@pevik/
> [2] https://lore.kernel.org/ltp/164254401568.24166.883582030601071761@noble.neil.brown.name/

> Signed-off-by: Petr Vorel <pvorel@suse.cz>
> ---
>  runtest/net.nfs                              | 4 ----
>  testcases/network/nfs/nfslock01/nfslock01.sh | 6 ++++++
>  2 files changed, 6 insertions(+), 4 deletions(-)

> diff --git a/runtest/net.nfs b/runtest/net.nfs
> index 72cf4b307..a73956015 100644
> --- a/runtest/net.nfs
> +++ b/runtest/net.nfs
> @@ -83,13 +83,9 @@ nfs4_ipv6_08 nfs08.sh -6 -v 4 -t tcp
>  nfs41_ipv6_08 nfs08.sh -6 -v 4.1 -t tcp
>  nfs42_ipv6_08 nfs08.sh -6 -v 4.2 -t tcp

> -nfslock3_01 nfslock01.sh -v 3 -t udp
> -nfslock3t_01 nfslock01.sh -v 3 -t tcp
>  nfslock4_01 nfslock01.sh -v 4 -t tcp
>  nfslock41_01 nfslock01.sh -v 4.1 -t tcp
>  nfslock42_01 nfslock01.sh -v 4.2 -t tcp
> -nfslock3_ipv6_01 nfslock01.sh -6 -v 3 -t udp
> -nfslock3t_ipv6_01 nfslock01.sh -6 -v 3 -t tcp
>  nfslock4_ipv6_01 nfslock01.sh -6 -v 4 -t tcp
>  nfslock41_ipv6_01 nfslock01.sh -6 -v 4.1 -t tcp
>  nfslock42_ipv6_01 nfslock01.sh -6 -v 4.2 -t tcp
> diff --git a/testcases/network/nfs/nfslock01/nfslock01.sh b/testcases/network/nfs/nfslock01/nfslock01.sh
> index fbcc3c00f..78904281b 100755
> --- a/testcases/network/nfs/nfslock01/nfslock01.sh
> +++ b/testcases/network/nfs/nfslock01/nfslock01.sh
> @@ -15,6 +15,12 @@ TST_TESTFUNC="do_test"

>  do_setup()
>  {
> +	local i
> +
> +	for i in $VERSION; do
> +		[ "$v" = 3 ] && tst_brk TCONF "Test is known to fail on NFSv3"
> +	done
> +
>  	nfs_setup

>  	tst_res TINFO "creating test files"
Jeff Layton May 2, 2023, 12:25 p.m. UTC | #2
On Tue, 2023-05-02 at 09:59 +0200, Petr Vorel wrote:
> nfs_flock (run via nfslock01.sh) is known to fail on NFS v3 [1]:
> 
>     not unsharing /var makes AF_UNIX socket for host's rpcbind to become
>     available inside ltp_ns. Then, at NFS v3 mount time, kernel creates
>     an instance of lockd for ltp_ns, and ports for that instance leak to
>     host's rpcbind and overwrite ports for lockd already active for root
>     namespace. This breaks nfs3 file locking.
> 

Yeccchhh...that is pretty nasty.

rpcbind was obviously written in a time before namespaces were even a
thought to anyone. I wonder if there is something we can do in rpcbind
itself to guard against these sorts of shenanigans? Probably not, I
guess...

Is /var shared between namespaces in this test for some particular
reason?

> Before bd512e733 ("nfs_flock: fail the test if lock/unlock ops fail")
> it run indefinitely with "unhandled error -107":
> [ 2840.099565] lockd: cannot monitor 10.0.0.2
> [ 2840.109353] lockd: cannot monitor 10.0.0.2
> [ 2843.286811] xs_tcp_setup_socket: connect returned unhandled error -107
> [ 2850.198791] xs_tcp_setup_socket: connect returned unhandled error -107
> 
> bd512e733 caused an early abort (therefore only "cannot monitor 10.0.0.2"
> appears).
> 
> Although there is suggestion, how to fix the problem in kernel [2]:
> 
>     > Maybe rpcb_create_local() shall detect that it is not in root
>     > netns, and only try AF_INET connection to > localhost in that case.
> 
>     That would be simple and might be sensible.  IF changing the AF_UNIX
>     path to "/run/rpcbind.sock" isn't sufficient, then testing for the
>     root_ns is probably the best second option.
> 

Was it determined that changing the location of the socket wasn't
sufficient to fix this? FWIW, My Fedora 38 machine seems to listen on
that socket already:

    [Socket]
    ListenStream=/run/rpcbind.sock
Petr Vorel May 2, 2023, 1:41 p.m. UTC | #3
Hi all,

[ Cc Steve, Alexey, Nikita - hope the new address is the correct,
because nikita.yushchenko@virtuozzo.com is dead ]

> On Tue, 2023-05-02 at 09:59 +0200, Petr Vorel wrote:
> > nfs_flock (run via nfslock01.sh) is known to fail on NFS v3 [1]:

> >     not unsharing /var makes AF_UNIX socket for host's rpcbind to become
> >     available inside ltp_ns. Then, at NFS v3 mount time, kernel creates
> >     an instance of lockd for ltp_ns, and ports for that instance leak to
> >     host's rpcbind and overwrite ports for lockd already active for root
> >     namespace. This breaks nfs3 file locking.


> Yeccchhh...that is pretty nasty.

> rpcbind was obviously written in a time before namespaces were even a
> thought to anyone. I wonder if there is something we can do in rpcbind
> itself to guard against these sorts of shenanigans? Probably not, I
> guess...

Maybe Steve or Neil have some idea.

> Is /var shared between namespaces in this test for some particular
> reason?

I hope I got , we talk about /var/run/netns/ltp_ns, which is symlink to
/proc/$pid/ns/net. Or is it really about /run/rpcbind.sock vs
/var/run/rpcbind.sock?

/var/run/netns/<NETNS> file is something created by ip:

#define NETNS_RUN_DIR "/var/run/netns" [1]

NETNS_RUN_DIR?=/var/run/netns [2]

man ip-netns(8) [3]
       By convention a named network namespace is an object at
       /var/run/netns/NAME that can be opened. The file descriptor
       resulting from opening /var/run/netns/NAME refers to the
       specified network namespace. Holding that file descriptor open
       keeps the network namespace alive. The file descriptor can be
       used with the setns(2) system call to change the network
       namespace associated with a task.

LTP used to use ip for creating namespaces. Later Alexey added reused custom
LTP code ns_exec.c and ns_ifmove.c [4] (NOTE: these were recently renamed to
tst_ns_exec.c and tst_ns_ifmove.c and moved) in order to allow to reuse already
defined namespaces. This change did the symlink from /proc/$pid/ns/net
/var/run/netns/ltp_ns, to keep the convention.

> > Before bd512e733 ("nfs_flock: fail the test if lock/unlock ops fail")
> > it run indefinitely with "unhandled error -107":
> > [ 2840.099565] lockd: cannot monitor 10.0.0.2
> > [ 2840.109353] lockd: cannot monitor 10.0.0.2
> > [ 2843.286811] xs_tcp_setup_socket: connect returned unhandled error -107
> > [ 2850.198791] xs_tcp_setup_socket: connect returned unhandled error -107

> > bd512e733 caused an early abort (therefore only "cannot monitor 10.0.0.2"
> > appears).

> > Although there is suggestion, how to fix the problem in kernel [2]:

> >     > Maybe rpcb_create_local() shall detect that it is not in root
> >     > netns, and only try AF_INET connection to > localhost in that case.

> >     That would be simple and might be sensible.  IF changing the AF_UNIX
> >     path to "/run/rpcbind.sock" isn't sufficient, then testing for the
> >     root_ns is probably the best second option.


> Was it determined that changing the location of the socket wasn't
> sufficient to fix this? FWIW, My Fedora 38 machine seems to listen on
> that socket already:

>     [Socket]
>     ListenStream=/run/rpcbind.sock

NOTE both openSUSE Tumbleweed and Debian 11 (bullseye), on which I'm able to
detect the problem (IMHO it would be reproducible on the most of distros) have:
* /var/run is symlink to /run
* use /run/rpcbind.lock (have patch to change /var/run/rpcbind.lock to
  /run/rpcbind.lock [5] [6] @Steve shouldn't be this patch accepted as the default?)

Kind regards,
Petr

[1] https://git.kernel.org/pub/scm/network/iproute2/iproute2.git/tree/include/namespace.h#n12
[2] https://git.kernel.org/pub/scm/network/iproute2/iproute2.git/tree/Makefile#n20
[3] https://man7.org/linux/man-pages/man8/ip-netns.8.html
[4] https://github.com/linux-test-project/ltp/commit/3fb501e04c61fb3d6b6b82011919572a87425cf9
[5] https://build.opensuse.org/package/view_file/network/rpcbind/0001-change-lockingdir-to-run.patch?expand=1
[6] https://salsa.debian.org/debian/rpcbind/-/blob/master/debian/patches/run-migration
NeilBrown May 2, 2023, 9:21 p.m. UTC | #4
On Tue, 02 May 2023, Jeff Layton wrote:
> On Tue, 2023-05-02 at 09:59 +0200, Petr Vorel wrote:
> > 
> > Although there is suggestion, how to fix the problem in kernel [2]:
> > 
> >     > Maybe rpcb_create_local() shall detect that it is not in root
> >     > netns, and only try AF_INET connection to > localhost in that case.
> > 
> >     That would be simple and might be sensible.  IF changing the AF_UNIX
> >     path to "/run/rpcbind.sock" isn't sufficient, then testing for the
> >     root_ns is probably the best second option.
> > 
> 
> Was it determined that changing the location of the socket wasn't
> sufficient to fix this? FWIW, My Fedora 38 machine seems to listen on
> that socket already:
> 
>     [Socket]
>     ListenStream=/run/rpcbind.sock

I think the best solution for this problem is to change the kernel to
first try to send to an abstract socket: "\0/run/rpcbind.sock".  Only if
that fails do we try "/run/rpcbind.sock".

We also change rpcbind to listen on both
   ListenStream=@/run/rpcbind.sock
   ListenStream=/run/rpcbind.sock

Abstract sockets are local to a network namespace, while non-abstract
Unix domain sockets are local to a file and so can only be local to a
mount namespace.  We clearly need rpcbind lookup from the kernel to be
netns-local, so abstract is the obvious choice.

NeilBrown
Petr Vorel May 4, 2023, 8:37 p.m. UTC | #5
Hi all,

FYI we're not going to disable the tests, it's against "do not work around real
bugs in tests" [1]. Thus I set this patch as "rejected".

I'm not sure if anybody has tried the suggested settings, I'll try to test it soon.

Kind regards,
Petr

[1] https://lore.kernel.org/ltp/ZFO2vmWk-tvrZoQQ@yuki/
Nikita Yushchenko May 8, 2023, 2:50 a.m. UTC | #6
>> rpcbind was obviously written in a time before namespaces were even a
>> thought to anyone. I wonder if there is something we can do in rpcbind
>> itself to guard against these sorts of shenanigans? Probably not, I
>> guess...
> 
> Maybe Steve or Neil have some idea.
> 
>> Is /var shared between namespaces in this test for some particular
>> reason?
> 
> I hope I got , we talk about /var/run/netns/ltp_ns, which is symlink to
> /proc/$pid/ns/net. Or is it really about /run/rpcbind.sock vs
> /var/run/rpcbind.sock?

The overall picture is:

- by design, filesystem namespaces and network namespaces are independent, it is pretty ok for two 
processes to share filesystem namespace and be in different network namespaces, or vice versa.

- the code in question, while being in the kernel for ages, is breaking this basic design, by using 
filesystem path to contact a network service,

- thus the fix is: just not do that.

I consider kernel contacting rpcbind via AF_UNIX being a bug in the kernel namespace implementation. So 
this is a rare case when a test suite (LTP) helped to find a non-obvious kernel bug. Just need to fix 
that bug, if not yet.

Rpcbind listens both via AF_INET and via AP_UNIX, and did so for ages.
It is even not possible to disable AF_INET listening without patching code. By stopping contacting it 
via AF_UNIX, it is virtually impossible to break anything.

Nikita
NeilBrown May 9, 2023, 11 p.m. UTC | #7
On Mon, 08 May 2023, Nikita Yushchenko wrote:
> >> rpcbind was obviously written in a time before namespaces were even a
> >> thought to anyone. I wonder if there is something we can do in rpcbind
> >> itself to guard against these sorts of shenanigans? Probably not, I
> >> guess...
> > 
> > Maybe Steve or Neil have some idea.
> > 
> >> Is /var shared between namespaces in this test for some particular
> >> reason?
> > 
> > I hope I got , we talk about /var/run/netns/ltp_ns, which is symlink to
> > /proc/$pid/ns/net. Or is it really about /run/rpcbind.sock vs
> > /var/run/rpcbind.sock?
> 
> The overall picture is:
> 
> - by design, filesystem namespaces and network namespaces are independent, it is pretty ok for two 
> processes to share filesystem namespace and be in different network namespaces, or vice versa.
> 
> - the code in question, while being in the kernel for ages, is breaking this basic design, by using 
> filesystem path to contact a network service,

Not just in the kernel, but also in user-space.  The kernel contacts
rpcbind to find how to talk to statd.  statd talks to rpcbind to tell it
how it where it can be reached.  As you say - it has been this way for
"ages".

So maybe the bug is that something creates a network namespace without
also creating a filesystem namespace.  Certainly the kernel allows this
as it should because the kernel doesn't set policy.
But using the freedom to create a setup that doesn't actually work is
foolish.  If ltp creates a network namespace without creating a matching
filesystem name space, and expects NFSv3 to work - then that is a bug in
ltp.

Now I agree that using path names seems not ideal in this case, and it
would be a valuable enhancement to make it easy to avoid that.  But it
is an enhancement, not a bugfix.

> 
> - thus the fix is: just not do that.

Surely the fix is to do something else, not just to do nothing :-)

> 
> I consider kernel contacting rpcbind via AF_UNIX being a bug in the kernel namespace implementation. So 
> this is a rare case when a test suite (LTP) helped to find a non-obvious kernel bug. Just need to fix 
> that bug, if not yet.

There is good reason to use use AF_UNIX for the kernel to contact
rpcbind.

In fact the kernel has only been using AF_UNIX to contact rpcbind for
about 12 years.

Commit 7402ab19cdd5 ("SUNRPC: Use AF_LOCAL for rpcbind upcalls")
gives some of the reasons for the change.  They are still good reasons.

Fortunately Linux provides "abstract" AF_UNIX names, which provide all
the benefits that we want of AF_UNIX, but doesn't depend on the
filesystem and keeps the bindings private to the network namespace - the
best of both worlds.

To fully implement this we need changes to libtirpc and to the kernel.
Not big changes, but not entirely trivial either.

> 
> Rpcbind listens both via AF_INET and via AP_UNIX, and did so for ages.
> It is even not possible to disable AF_INET listening without patching code. By stopping contacting it 
> via AF_UNIX, it is virtually impossible to break anything.

Check that commit for what can break.

For testing, you can change /usr/lib/rpcbind.sock to listen on
/run/NOT-rpcbind.sock instead. of /run/rpcbind.sock

It must listen on at least one AF_UNIX socket as you noted,
but it doesn't have to listen on one that any tool will talk to.  This
will cause all code (user-space and kernel) to fall-back on IPv6 or IPv4
to contact rpcbind.
So maybe you can work-around the bug in ltp that way.  You could even
just "rm -f /var/run/rpcbind.sock" after starting rpcbind, and before
running the test.

Meanwhile I'll post some patches which enhancements to the kernel and to
libtirpc to use abstract AF_UNIX socket when available.

Thanks,
NeilBrown
Nikita Yushchenko May 10, 2023, 2:44 a.m. UTC | #8
>> The overall picture is:
>>
>> - by design, filesystem namespaces and network namespaces are independent, it is pretty ok for two
>> processes to share filesystem namespace and be in different network namespaces, or vice versa.
>>
>> - the code in question, while being in the kernel for ages, is breaking this basic design, by using
>> filesystem path to contact a network service,
> 
> Not just in the kernel, but also in user-space.  The kernel contacts
> rpcbind to find how to talk to statd.  statd talks to rpcbind to tell it
> how it where it can be reached.

AFAIR the badness happens when in-kernel nfsd registers itself in rpcbind, using AF_LOCAL to contact it. 
When nfsd is started for a child network namespace, it immediately breaks nfsd running in the root 
network namespace, because ports used by the later get overwritten inside rpcbind and become no longer 
available for local or remote clients.

I'd say it is userspace responsibility to make entire setup consistent against the used namespaces, but 
it is kernel responsibility to keep namespaces isolation while executing individual operations. In the 
case of registering in-kernel nfsd being started, using namespace-safe way to do it looks more important 
for me than the reasoning outlined in commit 7402ab19cdd5 ("SUNRPC: Use AF_LOCAL for rpcbind upcalls") 
that you mention.

And, this won't be fixed by trying to an abstract AF_LOCAL socket before using a filepath-bound one, 
since if one is not available, the nfsd running in the root namespace will still get broken by starting 
nfsd in a child namespace.

Maybe, the way used to reach rpcbind to register in-kernel services shall be special cased.

Nikita
diff mbox series

Patch

diff --git a/runtest/net.nfs b/runtest/net.nfs
index 72cf4b307..a73956015 100644
--- a/runtest/net.nfs
+++ b/runtest/net.nfs
@@ -83,13 +83,9 @@  nfs4_ipv6_08 nfs08.sh -6 -v 4 -t tcp
 nfs41_ipv6_08 nfs08.sh -6 -v 4.1 -t tcp
 nfs42_ipv6_08 nfs08.sh -6 -v 4.2 -t tcp
 
-nfslock3_01 nfslock01.sh -v 3 -t udp
-nfslock3t_01 nfslock01.sh -v 3 -t tcp
 nfslock4_01 nfslock01.sh -v 4 -t tcp
 nfslock41_01 nfslock01.sh -v 4.1 -t tcp
 nfslock42_01 nfslock01.sh -v 4.2 -t tcp
-nfslock3_ipv6_01 nfslock01.sh -6 -v 3 -t udp
-nfslock3t_ipv6_01 nfslock01.sh -6 -v 3 -t tcp
 nfslock4_ipv6_01 nfslock01.sh -6 -v 4 -t tcp
 nfslock41_ipv6_01 nfslock01.sh -6 -v 4.1 -t tcp
 nfslock42_ipv6_01 nfslock01.sh -6 -v 4.2 -t tcp
diff --git a/testcases/network/nfs/nfslock01/nfslock01.sh b/testcases/network/nfs/nfslock01/nfslock01.sh
index fbcc3c00f..78904281b 100755
--- a/testcases/network/nfs/nfslock01/nfslock01.sh
+++ b/testcases/network/nfs/nfslock01/nfslock01.sh
@@ -15,6 +15,12 @@  TST_TESTFUNC="do_test"
 
 do_setup()
 {
+	local i
+
+	for i in $VERSION; do
+		[ "$v" = 3 ] && tst_brk TCONF "Test is known to fail on NFSv3"
+	done
+
 	nfs_setup
 
 	tst_res TINFO "creating test files"