Message ID | 20230502075921.3614794-1-pvorel@suse.cz (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/1] nfslock01.sh: Don't test on NFS v3 on TCP | expand |
Hi, > Subject: [LTP] [PATCH 1/1] nfslock01.sh: Don't test on NFS v3 on TCP Obviously, the subject should not contain "on TCP" (first, I removed only TCP entries in runtest file, but because I expect it would be problematic also on UDP). Also tst_brk TCONF does not check TCP/UDP either (only NFS v3). I'll update that before merge (don't send new version). Kind regards, Petr > nfs_flock (run via nfslock01.sh) is known to fail on NFS v3 [1]: > not unsharing /var makes AF_UNIX socket for host's rpcbind to become > available inside ltp_ns. Then, at NFS v3 mount time, kernel creates > an instance of lockd for ltp_ns, and ports for that instance leak to > host's rpcbind and overwrite ports for lockd already active for root > namespace. This breaks nfs3 file locking. > Before bd512e733 ("nfs_flock: fail the test if lock/unlock ops fail") > it run indefinitely with "unhandled error -107": > [ 2840.099565] lockd: cannot monitor 10.0.0.2 > [ 2840.109353] lockd: cannot monitor 10.0.0.2 > [ 2843.286811] xs_tcp_setup_socket: connect returned unhandled error -107 > [ 2850.198791] xs_tcp_setup_socket: connect returned unhandled error -107 > bd512e733 caused an early abort (therefore only "cannot monitor 10.0.0.2" > appears). > Although there is suggestion, how to fix the problem in kernel [2]: > > Maybe rpcb_create_local() shall detect that it is not in root > > netns, and only try AF_INET connection to > localhost in that case. > That would be simple and might be sensible. IF changing the AF_UNIX > path to "/run/rpcbind.sock" isn't sufficient, then testing for the > root_ns is probably the best second option. > Until it's implemented, it's better to: > * don't test on NFS v3 on both TCP and UDP (remove from the runtest file) > * skip the test with TCONF in case version 3 is passed on command line > NOTE: Tested only on TCP (UDP is disabled in kernel by default via > NFS_DISABLE_UDP_SUPPORT). > [1] https://lore.kernel.org/ltp/YebcNQg0u5cU1QyQ@pevik/ > [2] https://lore.kernel.org/ltp/164254401568.24166.883582030601071761@noble.neil.brown.name/ > Signed-off-by: Petr Vorel <pvorel@suse.cz> > --- > runtest/net.nfs | 4 ---- > testcases/network/nfs/nfslock01/nfslock01.sh | 6 ++++++ > 2 files changed, 6 insertions(+), 4 deletions(-) > diff --git a/runtest/net.nfs b/runtest/net.nfs > index 72cf4b307..a73956015 100644 > --- a/runtest/net.nfs > +++ b/runtest/net.nfs > @@ -83,13 +83,9 @@ nfs4_ipv6_08 nfs08.sh -6 -v 4 -t tcp > nfs41_ipv6_08 nfs08.sh -6 -v 4.1 -t tcp > nfs42_ipv6_08 nfs08.sh -6 -v 4.2 -t tcp > -nfslock3_01 nfslock01.sh -v 3 -t udp > -nfslock3t_01 nfslock01.sh -v 3 -t tcp > nfslock4_01 nfslock01.sh -v 4 -t tcp > nfslock41_01 nfslock01.sh -v 4.1 -t tcp > nfslock42_01 nfslock01.sh -v 4.2 -t tcp > -nfslock3_ipv6_01 nfslock01.sh -6 -v 3 -t udp > -nfslock3t_ipv6_01 nfslock01.sh -6 -v 3 -t tcp > nfslock4_ipv6_01 nfslock01.sh -6 -v 4 -t tcp > nfslock41_ipv6_01 nfslock01.sh -6 -v 4.1 -t tcp > nfslock42_ipv6_01 nfslock01.sh -6 -v 4.2 -t tcp > diff --git a/testcases/network/nfs/nfslock01/nfslock01.sh b/testcases/network/nfs/nfslock01/nfslock01.sh > index fbcc3c00f..78904281b 100755 > --- a/testcases/network/nfs/nfslock01/nfslock01.sh > +++ b/testcases/network/nfs/nfslock01/nfslock01.sh > @@ -15,6 +15,12 @@ TST_TESTFUNC="do_test" > do_setup() > { > + local i > + > + for i in $VERSION; do > + [ "$v" = 3 ] && tst_brk TCONF "Test is known to fail on NFSv3" > + done > + > nfs_setup > tst_res TINFO "creating test files"
On Tue, 2023-05-02 at 09:59 +0200, Petr Vorel wrote: > nfs_flock (run via nfslock01.sh) is known to fail on NFS v3 [1]: > > not unsharing /var makes AF_UNIX socket for host's rpcbind to become > available inside ltp_ns. Then, at NFS v3 mount time, kernel creates > an instance of lockd for ltp_ns, and ports for that instance leak to > host's rpcbind and overwrite ports for lockd already active for root > namespace. This breaks nfs3 file locking. > Yeccchhh...that is pretty nasty. rpcbind was obviously written in a time before namespaces were even a thought to anyone. I wonder if there is something we can do in rpcbind itself to guard against these sorts of shenanigans? Probably not, I guess... Is /var shared between namespaces in this test for some particular reason? > Before bd512e733 ("nfs_flock: fail the test if lock/unlock ops fail") > it run indefinitely with "unhandled error -107": > [ 2840.099565] lockd: cannot monitor 10.0.0.2 > [ 2840.109353] lockd: cannot monitor 10.0.0.2 > [ 2843.286811] xs_tcp_setup_socket: connect returned unhandled error -107 > [ 2850.198791] xs_tcp_setup_socket: connect returned unhandled error -107 > > bd512e733 caused an early abort (therefore only "cannot monitor 10.0.0.2" > appears). > > Although there is suggestion, how to fix the problem in kernel [2]: > > > Maybe rpcb_create_local() shall detect that it is not in root > > netns, and only try AF_INET connection to > localhost in that case. > > That would be simple and might be sensible. IF changing the AF_UNIX > path to "/run/rpcbind.sock" isn't sufficient, then testing for the > root_ns is probably the best second option. > Was it determined that changing the location of the socket wasn't sufficient to fix this? FWIW, My Fedora 38 machine seems to listen on that socket already: [Socket] ListenStream=/run/rpcbind.sock
Hi all, [ Cc Steve, Alexey, Nikita - hope the new address is the correct, because nikita.yushchenko@virtuozzo.com is dead ] > On Tue, 2023-05-02 at 09:59 +0200, Petr Vorel wrote: > > nfs_flock (run via nfslock01.sh) is known to fail on NFS v3 [1]: > > not unsharing /var makes AF_UNIX socket for host's rpcbind to become > > available inside ltp_ns. Then, at NFS v3 mount time, kernel creates > > an instance of lockd for ltp_ns, and ports for that instance leak to > > host's rpcbind and overwrite ports for lockd already active for root > > namespace. This breaks nfs3 file locking. > Yeccchhh...that is pretty nasty. > rpcbind was obviously written in a time before namespaces were even a > thought to anyone. I wonder if there is something we can do in rpcbind > itself to guard against these sorts of shenanigans? Probably not, I > guess... Maybe Steve or Neil have some idea. > Is /var shared between namespaces in this test for some particular > reason? I hope I got , we talk about /var/run/netns/ltp_ns, which is symlink to /proc/$pid/ns/net. Or is it really about /run/rpcbind.sock vs /var/run/rpcbind.sock? /var/run/netns/<NETNS> file is something created by ip: #define NETNS_RUN_DIR "/var/run/netns" [1] NETNS_RUN_DIR?=/var/run/netns [2] man ip-netns(8) [3] By convention a named network namespace is an object at /var/run/netns/NAME that can be opened. The file descriptor resulting from opening /var/run/netns/NAME refers to the specified network namespace. Holding that file descriptor open keeps the network namespace alive. The file descriptor can be used with the setns(2) system call to change the network namespace associated with a task. LTP used to use ip for creating namespaces. Later Alexey added reused custom LTP code ns_exec.c and ns_ifmove.c [4] (NOTE: these were recently renamed to tst_ns_exec.c and tst_ns_ifmove.c and moved) in order to allow to reuse already defined namespaces. This change did the symlink from /proc/$pid/ns/net /var/run/netns/ltp_ns, to keep the convention. > > Before bd512e733 ("nfs_flock: fail the test if lock/unlock ops fail") > > it run indefinitely with "unhandled error -107": > > [ 2840.099565] lockd: cannot monitor 10.0.0.2 > > [ 2840.109353] lockd: cannot monitor 10.0.0.2 > > [ 2843.286811] xs_tcp_setup_socket: connect returned unhandled error -107 > > [ 2850.198791] xs_tcp_setup_socket: connect returned unhandled error -107 > > bd512e733 caused an early abort (therefore only "cannot monitor 10.0.0.2" > > appears). > > Although there is suggestion, how to fix the problem in kernel [2]: > > > Maybe rpcb_create_local() shall detect that it is not in root > > > netns, and only try AF_INET connection to > localhost in that case. > > That would be simple and might be sensible. IF changing the AF_UNIX > > path to "/run/rpcbind.sock" isn't sufficient, then testing for the > > root_ns is probably the best second option. > Was it determined that changing the location of the socket wasn't > sufficient to fix this? FWIW, My Fedora 38 machine seems to listen on > that socket already: > [Socket] > ListenStream=/run/rpcbind.sock NOTE both openSUSE Tumbleweed and Debian 11 (bullseye), on which I'm able to detect the problem (IMHO it would be reproducible on the most of distros) have: * /var/run is symlink to /run * use /run/rpcbind.lock (have patch to change /var/run/rpcbind.lock to /run/rpcbind.lock [5] [6] @Steve shouldn't be this patch accepted as the default?) Kind regards, Petr [1] https://git.kernel.org/pub/scm/network/iproute2/iproute2.git/tree/include/namespace.h#n12 [2] https://git.kernel.org/pub/scm/network/iproute2/iproute2.git/tree/Makefile#n20 [3] https://man7.org/linux/man-pages/man8/ip-netns.8.html [4] https://github.com/linux-test-project/ltp/commit/3fb501e04c61fb3d6b6b82011919572a87425cf9 [5] https://build.opensuse.org/package/view_file/network/rpcbind/0001-change-lockingdir-to-run.patch?expand=1 [6] https://salsa.debian.org/debian/rpcbind/-/blob/master/debian/patches/run-migration
On Tue, 02 May 2023, Jeff Layton wrote: > On Tue, 2023-05-02 at 09:59 +0200, Petr Vorel wrote: > > > > Although there is suggestion, how to fix the problem in kernel [2]: > > > > > Maybe rpcb_create_local() shall detect that it is not in root > > > netns, and only try AF_INET connection to > localhost in that case. > > > > That would be simple and might be sensible. IF changing the AF_UNIX > > path to "/run/rpcbind.sock" isn't sufficient, then testing for the > > root_ns is probably the best second option. > > > > Was it determined that changing the location of the socket wasn't > sufficient to fix this? FWIW, My Fedora 38 machine seems to listen on > that socket already: > > [Socket] > ListenStream=/run/rpcbind.sock I think the best solution for this problem is to change the kernel to first try to send to an abstract socket: "\0/run/rpcbind.sock". Only if that fails do we try "/run/rpcbind.sock". We also change rpcbind to listen on both ListenStream=@/run/rpcbind.sock ListenStream=/run/rpcbind.sock Abstract sockets are local to a network namespace, while non-abstract Unix domain sockets are local to a file and so can only be local to a mount namespace. We clearly need rpcbind lookup from the kernel to be netns-local, so abstract is the obvious choice. NeilBrown
Hi all, FYI we're not going to disable the tests, it's against "do not work around real bugs in tests" [1]. Thus I set this patch as "rejected". I'm not sure if anybody has tried the suggested settings, I'll try to test it soon. Kind regards, Petr [1] https://lore.kernel.org/ltp/ZFO2vmWk-tvrZoQQ@yuki/
>> rpcbind was obviously written in a time before namespaces were even a >> thought to anyone. I wonder if there is something we can do in rpcbind >> itself to guard against these sorts of shenanigans? Probably not, I >> guess... > > Maybe Steve or Neil have some idea. > >> Is /var shared between namespaces in this test for some particular >> reason? > > I hope I got , we talk about /var/run/netns/ltp_ns, which is symlink to > /proc/$pid/ns/net. Or is it really about /run/rpcbind.sock vs > /var/run/rpcbind.sock? The overall picture is: - by design, filesystem namespaces and network namespaces are independent, it is pretty ok for two processes to share filesystem namespace and be in different network namespaces, or vice versa. - the code in question, while being in the kernel for ages, is breaking this basic design, by using filesystem path to contact a network service, - thus the fix is: just not do that. I consider kernel contacting rpcbind via AF_UNIX being a bug in the kernel namespace implementation. So this is a rare case when a test suite (LTP) helped to find a non-obvious kernel bug. Just need to fix that bug, if not yet. Rpcbind listens both via AF_INET and via AP_UNIX, and did so for ages. It is even not possible to disable AF_INET listening without patching code. By stopping contacting it via AF_UNIX, it is virtually impossible to break anything. Nikita
On Mon, 08 May 2023, Nikita Yushchenko wrote: > >> rpcbind was obviously written in a time before namespaces were even a > >> thought to anyone. I wonder if there is something we can do in rpcbind > >> itself to guard against these sorts of shenanigans? Probably not, I > >> guess... > > > > Maybe Steve or Neil have some idea. > > > >> Is /var shared between namespaces in this test for some particular > >> reason? > > > > I hope I got , we talk about /var/run/netns/ltp_ns, which is symlink to > > /proc/$pid/ns/net. Or is it really about /run/rpcbind.sock vs > > /var/run/rpcbind.sock? > > The overall picture is: > > - by design, filesystem namespaces and network namespaces are independent, it is pretty ok for two > processes to share filesystem namespace and be in different network namespaces, or vice versa. > > - the code in question, while being in the kernel for ages, is breaking this basic design, by using > filesystem path to contact a network service, Not just in the kernel, but also in user-space. The kernel contacts rpcbind to find how to talk to statd. statd talks to rpcbind to tell it how it where it can be reached. As you say - it has been this way for "ages". So maybe the bug is that something creates a network namespace without also creating a filesystem namespace. Certainly the kernel allows this as it should because the kernel doesn't set policy. But using the freedom to create a setup that doesn't actually work is foolish. If ltp creates a network namespace without creating a matching filesystem name space, and expects NFSv3 to work - then that is a bug in ltp. Now I agree that using path names seems not ideal in this case, and it would be a valuable enhancement to make it easy to avoid that. But it is an enhancement, not a bugfix. > > - thus the fix is: just not do that. Surely the fix is to do something else, not just to do nothing :-) > > I consider kernel contacting rpcbind via AF_UNIX being a bug in the kernel namespace implementation. So > this is a rare case when a test suite (LTP) helped to find a non-obvious kernel bug. Just need to fix > that bug, if not yet. There is good reason to use use AF_UNIX for the kernel to contact rpcbind. In fact the kernel has only been using AF_UNIX to contact rpcbind for about 12 years. Commit 7402ab19cdd5 ("SUNRPC: Use AF_LOCAL for rpcbind upcalls") gives some of the reasons for the change. They are still good reasons. Fortunately Linux provides "abstract" AF_UNIX names, which provide all the benefits that we want of AF_UNIX, but doesn't depend on the filesystem and keeps the bindings private to the network namespace - the best of both worlds. To fully implement this we need changes to libtirpc and to the kernel. Not big changes, but not entirely trivial either. > > Rpcbind listens both via AF_INET and via AP_UNIX, and did so for ages. > It is even not possible to disable AF_INET listening without patching code. By stopping contacting it > via AF_UNIX, it is virtually impossible to break anything. Check that commit for what can break. For testing, you can change /usr/lib/rpcbind.sock to listen on /run/NOT-rpcbind.sock instead. of /run/rpcbind.sock It must listen on at least one AF_UNIX socket as you noted, but it doesn't have to listen on one that any tool will talk to. This will cause all code (user-space and kernel) to fall-back on IPv6 or IPv4 to contact rpcbind. So maybe you can work-around the bug in ltp that way. You could even just "rm -f /var/run/rpcbind.sock" after starting rpcbind, and before running the test. Meanwhile I'll post some patches which enhancements to the kernel and to libtirpc to use abstract AF_UNIX socket when available. Thanks, NeilBrown
>> The overall picture is: >> >> - by design, filesystem namespaces and network namespaces are independent, it is pretty ok for two >> processes to share filesystem namespace and be in different network namespaces, or vice versa. >> >> - the code in question, while being in the kernel for ages, is breaking this basic design, by using >> filesystem path to contact a network service, > > Not just in the kernel, but also in user-space. The kernel contacts > rpcbind to find how to talk to statd. statd talks to rpcbind to tell it > how it where it can be reached. AFAIR the badness happens when in-kernel nfsd registers itself in rpcbind, using AF_LOCAL to contact it. When nfsd is started for a child network namespace, it immediately breaks nfsd running in the root network namespace, because ports used by the later get overwritten inside rpcbind and become no longer available for local or remote clients. I'd say it is userspace responsibility to make entire setup consistent against the used namespaces, but it is kernel responsibility to keep namespaces isolation while executing individual operations. In the case of registering in-kernel nfsd being started, using namespace-safe way to do it looks more important for me than the reasoning outlined in commit 7402ab19cdd5 ("SUNRPC: Use AF_LOCAL for rpcbind upcalls") that you mention. And, this won't be fixed by trying to an abstract AF_LOCAL socket before using a filepath-bound one, since if one is not available, the nfsd running in the root namespace will still get broken by starting nfsd in a child namespace. Maybe, the way used to reach rpcbind to register in-kernel services shall be special cased. Nikita
diff --git a/runtest/net.nfs b/runtest/net.nfs index 72cf4b307..a73956015 100644 --- a/runtest/net.nfs +++ b/runtest/net.nfs @@ -83,13 +83,9 @@ nfs4_ipv6_08 nfs08.sh -6 -v 4 -t tcp nfs41_ipv6_08 nfs08.sh -6 -v 4.1 -t tcp nfs42_ipv6_08 nfs08.sh -6 -v 4.2 -t tcp -nfslock3_01 nfslock01.sh -v 3 -t udp -nfslock3t_01 nfslock01.sh -v 3 -t tcp nfslock4_01 nfslock01.sh -v 4 -t tcp nfslock41_01 nfslock01.sh -v 4.1 -t tcp nfslock42_01 nfslock01.sh -v 4.2 -t tcp -nfslock3_ipv6_01 nfslock01.sh -6 -v 3 -t udp -nfslock3t_ipv6_01 nfslock01.sh -6 -v 3 -t tcp nfslock4_ipv6_01 nfslock01.sh -6 -v 4 -t tcp nfslock41_ipv6_01 nfslock01.sh -6 -v 4.1 -t tcp nfslock42_ipv6_01 nfslock01.sh -6 -v 4.2 -t tcp diff --git a/testcases/network/nfs/nfslock01/nfslock01.sh b/testcases/network/nfs/nfslock01/nfslock01.sh index fbcc3c00f..78904281b 100755 --- a/testcases/network/nfs/nfslock01/nfslock01.sh +++ b/testcases/network/nfs/nfslock01/nfslock01.sh @@ -15,6 +15,12 @@ TST_TESTFUNC="do_test" do_setup() { + local i + + for i in $VERSION; do + [ "$v" = 3 ] && tst_brk TCONF "Test is known to fail on NFSv3" + done + nfs_setup tst_res TINFO "creating test files"