[v4,5/6] add listmount(2) syscall

Message ID	20231025140205.3586473-6-mszeredi@redhat.com (mailing list archive)
State	Changes Requested
Delegated to:	Paul Moore
Headers	show Return-Path: <linux-security-module-owner@vger.kernel.org> From: Miklos Szeredi <mszeredi@redhat.com> To: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, linux-man@vger.kernel.org, linux-security-module@vger.kernel.org, Karel Zak <kzak@redhat.com>, Ian Kent <raven@themaw.net>, David Howells <dhowells@redhat.com>, Linus Torvalds <torvalds@linux-foundation.org>, Al Viro <viro@zeniv.linux.org.uk>, Christian Brauner <christian@brauner.io>, Amir Goldstein <amir73il@gmail.com>, Matthew House <mattlloydhouse@gmail.com>, Florian Weimer <fweimer@redhat.com>, Arnd Bergmann <arnd@arndb.de> Subject: [PATCH v4 5/6] add listmount(2) syscall Date: Wed, 25 Oct 2023 16:02:03 +0200 Message-ID: <20231025140205.3586473-6-mszeredi@redhat.com> In-Reply-To: <20231025140205.3586473-1-mszeredi@redhat.com> References: <20231025140205.3586473-1-mszeredi@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset="US-ASCII"; x-default=true Precedence: bulk
Series	querying mount attributes \| expand [v4,0/6] querying mount attributes [v4,1/6] add unique mount ID [v4,2/6] mounts: keep list of mounts in an rbtree [v4,3/6] namespace: extract show_path() helper [v4,4/6] add statmount(2) syscall [v4,5/6] add listmount(2) syscall [v4,6/6] wire up syscalls for statmount/listmount

Miklos Szeredi Oct. 25, 2023, 2:02 p.m. UTC

Add way to query the children of a particular mount.  This is a more
flexible way to iterate the mount tree than having to parse the complete
/proc/self/mountinfo.

Allow listing either

 - immediate child mounts only, or

 - recursively all descendant mounts (depth first).

Lookup the mount by the new 64bit mount ID.  If a mount needs to be queried
based on path, then statx(2) can be used to first query the mount ID
belonging to the path.

Return an array of new (64bit) mount ID's.  Without privileges only mounts
are listed which are reachable from the task's root.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
---
 fs/namespace.c             | 93 ++++++++++++++++++++++++++++++++++++++
 include/linux/syscalls.h   |  3 ++
 include/uapi/linux/mount.h |  9 ++++
 3 files changed, 105 insertions(+)

Jonathan Corbet Nov. 7, 2023, 9:23 p.m. UTC | #1

Miklos Szeredi <mszeredi@redhat.com> writes:

> Add way to query the children of a particular mount.  This is a more
> flexible way to iterate the mount tree than having to parse the complete
> /proc/self/mountinfo.
>
> Allow listing either
>
>  - immediate child mounts only, or
>
>  - recursively all descendant mounts (depth first).

So I have one probably silly question:

> +SYSCALL_DEFINE4(listmount, const struct __mount_arg __user *, req,
> +		u64 __user *, buf, size_t, bufsize, unsigned int, flags)
> +{

Why use struct __mount_arg (or struct mnt_id_req :) here rather than
just passing in the mount ID directly?  You don't use the request_mask
field anywhere.

Thanks,

jon

Paul Moore Nov. 8, 2023, 2:58 a.m. UTC | #2

On Oct 25, 2023 Miklos Szeredi <mszeredi@redhat.com> wrote:
> 
> Add way to query the children of a particular mount.  This is a more
> flexible way to iterate the mount tree than having to parse the complete
> /proc/self/mountinfo.
> 
> Allow listing either
> 
>  - immediate child mounts only, or
> 
>  - recursively all descendant mounts (depth first).
> 
> Lookup the mount by the new 64bit mount ID.  If a mount needs to be queried
> based on path, then statx(2) can be used to first query the mount ID
> belonging to the path.
> 
> Return an array of new (64bit) mount ID's.  Without privileges only mounts
> are listed which are reachable from the task's root.
> 
> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
> Reviewed-by: Ian Kent <raven@themaw.net>
> ---
>  fs/namespace.c             | 93 ++++++++++++++++++++++++++++++++++++++
>  include/linux/syscalls.h   |  3 ++
>  include/uapi/linux/mount.h |  9 ++++
>  3 files changed, 105 insertions(+)
> 
> diff --git a/fs/namespace.c b/fs/namespace.c
> index a980c250a3a6..0afe2344bba6 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -4958,6 +4958,99 @@ SYSCALL_DEFINE4(statmount, const struct __mount_arg __user *, req,
>  	return ret;
>  }
>  
> +static struct mount *listmnt_first(struct mount *root)
> +{
> +	return list_first_entry_or_null(&root->mnt_mounts, struct mount, mnt_child);
> +}
> +
> +static struct mount *listmnt_next(struct mount *curr, struct mount *root, bool recurse)
> +{
> +	if (recurse)
> +		return next_mnt(curr, root);
> +	if (!list_is_head(curr->mnt_child.next, &root->mnt_mounts))
> +		return list_next_entry(curr, mnt_child);
> +	return NULL;
> +}
> +
> +static long do_listmount(struct vfsmount *mnt, u64 __user *buf, size_t bufsize,
> +			 const struct path *root, unsigned int flags)
> +{
> +	struct mount *r, *m = real_mount(mnt);
> +	struct path rootmnt = {
> +		.mnt = root->mnt,
> +		.dentry = root->mnt->mnt_root
> +	};
> +	long ctr = 0;
> +	bool reachable_only = true;
> +	bool recurse = flags & LISTMOUNT_RECURSIVE;
> +	int err;
> +
> +	err = security_sb_statfs(mnt->mnt_root);
> +	if (err)
> +		return err;
> +
> +	if (flags & LISTMOUNT_UNREACHABLE) {
> +		if (!capable(CAP_SYS_ADMIN))
> +			return -EPERM;
> +		reachable_only = false;
> +	}

Similar to my comment in patch 4/6, please move the LSM call after the
capability check.

> +	if (reachable_only && !is_path_reachable(m, mnt->mnt_root, &rootmnt))
> +		return capable(CAP_SYS_ADMIN) ? 0 : -EPERM;
> +
> +	for (r = listmnt_first(m); r; r = listmnt_next(r, m, recurse)) {
> +		if (reachable_only &&
> +		    !is_path_reachable(r, r->mnt.mnt_root, root))
> +			continue;
> +
> +		if (ctr >= bufsize)
> +			return -EOVERFLOW;
> +		if (put_user(r->mnt_id_unique, buf + ctr))
> +			return -EFAULT;
> +		ctr++;
> +		if (ctr < 0)
> +			return -ERANGE;
> +	}
> +	return ctr;
> +}

--
paul-moore.com

Christian Brauner Nov. 8, 2023, 7:53 a.m. UTC | #3

> Why use struct __mount_arg (or struct mnt_id_req :) here rather than
> just passing in the mount ID directly?  You don't use the request_mask

Please see Arnd's detailed summary here:
https://lore.kernel.org/lkml/44631c05-6b8a-42dc-b37e-df6776baa5d4@app.fastmail.com

Jonathan Corbet Nov. 8, 2023, 4:20 p.m. UTC | #4

Christian Brauner <brauner@kernel.org> writes:

>> Why use struct __mount_arg (or struct mnt_id_req :) here rather than
>> just passing in the mount ID directly?  You don't use the request_mask
>
> Please see Arnd's detailed summary here:
> https://lore.kernel.org/lkml/44631c05-6b8a-42dc-b37e-df6776baa5d4@app.fastmail.com

Ah, makes sense, I'd missed that.

Given this, though, it seems like maybe sys_listmount() should enforce
that req->request_mask==0 ?

Thanks,

jon

Christian Brauner Nov. 8, 2023, 4:23 p.m. UTC | #5

On Wed, Nov 08, 2023 at 09:20:28AM -0700, Jonathan Corbet wrote:
> Christian Brauner <brauner@kernel.org> writes:
> 
> >> Why use struct __mount_arg (or struct mnt_id_req :) here rather than
> >> just passing in the mount ID directly?  You don't use the request_mask
> >
> > Please see Arnd's detailed summary here:
> > https://lore.kernel.org/lkml/44631c05-6b8a-42dc-b37e-df6776baa5d4@app.fastmail.com
> 
> Ah, makes sense, I'd missed that.
> 
> Given this, though, it seems like maybe sys_listmount() should enforce
> that req->request_mask==0 ?

Good catch, it does now:

diff --git a/fs/namespace.c b/fs/namespace.c
index ae09321b0f72..21edddd75532 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -5030,6 +5030,8 @@ SYSCALL_DEFINE4(listmount, const struct mnt_id_req __user *, req,

        if (copy_from_user(&kreq, req, sizeof(kreq)))
                return -EFAULT;
+       if (kreq.request_mask != 0)
+               return -EINVAL;
        mnt_id = kreq.mnt_id;

        down_read(&namespace_sem);

Guenter Roeck Jan. 10, 2024, 10:23 p.m. UTC | #6

Hi,

On Wed, Oct 25, 2023 at 04:02:03PM +0200, Miklos Szeredi wrote:
> Add way to query the children of a particular mount.  This is a more
> flexible way to iterate the mount tree than having to parse the complete
> /proc/self/mountinfo.
> 
> Allow listing either
> 
>  - immediate child mounts only, or
> 
>  - recursively all descendant mounts (depth first).
> 
> Lookup the mount by the new 64bit mount ID.  If a mount needs to be queried
> based on path, then statx(2) can be used to first query the mount ID
> belonging to the path.
> 
> Return an array of new (64bit) mount ID's.  Without privileges only mounts
> are listed which are reachable from the task's root.
> 
> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

with this patch in the tree, all sh4 builds fail with ICE.

during RTL pass: final
In file included from fs/namespace.c:11:
fs/namespace.c: In function '__se_sys_listmount':
include/linux/syscalls.h:258:9: internal compiler error: in change_address_1, at emit-rtl.c:2275

I tested with gcc 8.2, 11.3, 11.4, and 12.3. The compiler version
does not make a difference. Has anyone else seen the same problem ?
If so, any idea what to do about it ?

Thanks,
Guenter

Linus Torvalds Jan. 11, 2024, 12:32 a.m. UTC | #7

On Wed, 10 Jan 2024 at 14:23, Guenter Roeck <linux@roeck-us.net> wrote:
>
> with this patch in the tree, all sh4 builds fail with ICE.
>
> during RTL pass: final
> In file included from fs/namespace.c:11:
> fs/namespace.c: In function '__se_sys_listmount':
> include/linux/syscalls.h:258:9: internal compiler error: in change_address_1, at emit-rtl.c:2275

We do have those very ugly SYSCALL_DEFINEx() macros, but I'm not
seeing _anything_ that would be odd about the listmount case.

And the "__se_sys" thing in particular is just a fairly trivial wrapper.

It does use that asmlinkage_protect() thing, and it is unquestionably
horrendously ugly (staring too long at <linux/syscalls.h> has been
known to cause madness and despair), but we do that for *every* single
system call and I don't see why the new listmount entry would be
different.

           Linus

Guenter Roeck Jan. 11, 2024, 5:12 a.m. UTC | #8

On 1/10/24 16:32, Linus Torvalds wrote:
> On Wed, 10 Jan 2024 at 14:23, Guenter Roeck <linux@roeck-us.net> wrote:
>>
>> with this patch in the tree, all sh4 builds fail with ICE.
>>
>> during RTL pass: final
>> In file included from fs/namespace.c:11:
>> fs/namespace.c: In function '__se_sys_listmount':
>> include/linux/syscalls.h:258:9: internal compiler error: in change_address_1, at emit-rtl.c:2275
> 
> We do have those very ugly SYSCALL_DEFINEx() macros, but I'm not
> seeing _anything_ that would be odd about the listmount case.
> 
> And the "__se_sys" thing in particular is just a fairly trivial wrapper.
> 
> It does use that asmlinkage_protect() thing, and it is unquestionably
> horrendously ugly (staring too long at <linux/syscalls.h> has been
> known to cause madness and despair), but we do that for *every* single
> system call and I don't see why the new listmount entry would be
> different.
> 

I don't have much of a clue either, but here is a hint: The problem is
only seen if CONFIG_MMU=n. I tested with all configurations in
arch/sh/configs.

Guenter

Guenter Roeck Jan. 11, 2024, 6:57 p.m. UTC | #9

On 1/10/24 16:32, Linus Torvalds wrote:
> On Wed, 10 Jan 2024 at 14:23, Guenter Roeck <linux@roeck-us.net> wrote:
>>
>> with this patch in the tree, all sh4 builds fail with ICE.
>>
>> during RTL pass: final
>> In file included from fs/namespace.c:11:
>> fs/namespace.c: In function '__se_sys_listmount':
>> include/linux/syscalls.h:258:9: internal compiler error: in change_address_1, at emit-rtl.c:2275
> 
> We do have those very ugly SYSCALL_DEFINEx() macros, but I'm not
> seeing _anything_ that would be odd about the listmount case.
> 
> And the "__se_sys" thing in particular is just a fairly trivial wrapper.
> 
> It does use that asmlinkage_protect() thing, and it is unquestionably
> horrendously ugly (staring too long at <linux/syscalls.h> has been
> known to cause madness and despair), but we do that for *every* single
> system call and I don't see why the new listmount entry would be
> different.
> 

It isn't the syscall. The following hack avoids the problem.

diff --git a/fs/namespace.c b/fs/namespace.c
index ef1fd6829814..28fe2a55bd94 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -5070,8 +5070,10 @@ static ssize_t do_listmount(struct mount *first, struct path *orig, u64 mnt_id,
                 ctr = array_index_nospec(ctr, bufsize);
                 if (put_user(r->mnt_id_unique, buf + ctr))
                         return -EFAULT;
+#if 0
                 if (check_add_overflow(ctr, 1, &ctr))
                         return -ERANGE;
+#endif

But it isn't check_add_overflow() either. This "helps" as well.

diff --git a/fs/namespace.c b/fs/namespace.c
index ef1fd6829814..b53cb2f13530 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -5068,8 +5068,10 @@ static ssize_t do_listmount(struct mount *first, struct path *orig, u64 mnt_id,
                 if (!is_path_reachable(r, r->mnt.mnt_root, orig))
                         continue;
                 ctr = array_index_nospec(ctr, bufsize);
+#if 0
                 if (put_user(r->mnt_id_unique, buf + ctr))
                         return -EFAULT;
+#endif
                 if (check_add_overflow(ctr, 1, &ctr))
                         return -ERANGE;


Any variance of put_user() with &buf[ctr] or buf + ctr fails
if ctr is a variable and permitted to be != 0. For example,
commenting out the call to check_add_overflow() and starting
the loop with ctr = 1 also triggers the problem, as does replacing
the call to check_add_overflow() with ctr++;. Using a temporary
variable such as in
	u64 __user *pbuf;
	...
	pbuf = buf + ctr;
	if (put_user(r->mnt_id_unique, pbuf))
                         return -EFAULT;

doesn't help either. But this does:

-               if (put_user(r->mnt_id_unique, buf + ctr))
+               if (put_user(r->mnt_id_unique, (u32 *)(buf + ctr)))

and "buf + 17" as well as "&buf[17]" work as well. Essentially,
every combination of "buf + ctr" or "&buf[ctr]" fails if buf
is u64* and ctr is a variable.

The following works. Would this be acceptable ?

diff --git a/fs/namespace.c b/fs/namespace.c
index ef1fd6829814..dc0f844205d9 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -5068,10 +5068,11 @@ static ssize_t do_listmount(struct mount *first, struct path *orig, u64 mnt_id,
                 if (!is_path_reachable(r, r->mnt.mnt_root, orig))
                         continue;
                 ctr = array_index_nospec(ctr, bufsize);
-               if (put_user(r->mnt_id_unique, buf + ctr))
+               if (put_user(r->mnt_id_unique, buf))
                         return -EFAULT;
                 if (check_add_overflow(ctr, 1, &ctr))
                         return -ERANGE;
+               buf++;
         }
         return ctr;
  }

Guenter

Linus Torvalds Jan. 11, 2024, 8:14 p.m. UTC | #10

On Thu, 11 Jan 2024 at 10:57, Guenter Roeck <linux@roeck-us.net> wrote:
>
> Any variance of put_user() with &buf[ctr] or buf + ctr fails
> if ctr is a variable and permitted to be != 0.

Crazy. But the 64-bit put_user() is a bit special and tends to require
more registers (the 64-bit value is passed in two registers), so that
probably then results in the ICE.

Side note: looking at the SH version of __put_user_u64(), I think it's
buggy and is missing the exception handler for the second 32-bit move.
I dunno, I don't read sh asm, but it looks suspicious.

> The following works. Would this be acceptable ?

It might be very easy to trigger this once again if somebody goes "that's silly"

That said, I also absolutely detest the "error handling" in that
function. It's horrible.

Noticing the user access error in the middle is just sad, and if that
was just handled better and at least the range was checked first, the
overflow error couldn't happen and checking for it is thus pointless.

And looking at it all, it really looks like the whole interface is
broken. The "bufsize" argument isn't the size of the buffer at all.
It's the number of entries.

Extra confusingly, in the *other* system call, bufsize is in fact the
size of the buffer.

And the 'ctr' overflow checking is doubly garbage, because the only
reason *that* can happen is that we didn't check the incoming
arguments properly.

Same goes for the whole array_index_nospec() - it's pointless, because
the user controls what that code checks against anyway, so there's no
point to trying to manage some range checking.

The only range checking there that matters would be the one that
put_user() has to do against the address space size, but that's done
by put_user().

End result: that thing needs a rewrite.

The SH put_user64() needs to be looked at too, but in the meantime,
maybe something like this fixes the problems with listmount?

NOTE! ENTIRELY untested, but that naming and lack of argument sanity
checking really is horrendous. We should have caught this earlier.

                   Linus

Arnd Bergmann Jan. 11, 2024, 11:01 p.m. UTC | #11

On Thu, Jan 11, 2024, at 21:14, Linus Torvalds wrote:

> The SH put_user64() needs to be looked at too, but in the meantime,
> maybe something like this fixes the problems with listmount?

I tried changing it to use the generic memcpy() based uaccess
that m68k-nommu and riscv-nommu use, which also avoids the
build failure. I still run into other unrelated build issues
on arch/sh, so I'm not sure if this is a sufficient fix.

     Arnd

diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig
index 7500521b2b98..2cc3a541e231 100644
--- a/arch/sh/Kconfig
+++ b/arch/sh/Kconfig
@@ -73,6 +73,7 @@ config SUPERH
        select PERF_USE_VMALLOC
        select RTC_LIB
        select SPARSE_IRQ
+       select UACCESS_MEMCPY if !MMU
        select TRACE_IRQFLAGS_SUPPORT
        help
          The SuperH is a RISC processor targeted for use in embedded systems
diff --git a/arch/sh/include/asm/uaccess.h b/arch/sh/include/asm/uaccess.h
index a79609eb14be..b42764d55901 100644
--- a/arch/sh/include/asm/uaccess.h
+++ b/arch/sh/include/asm/uaccess.h
@@ -2,6 +2,7 @@
 #ifndef __ASM_SH_UACCESS_H
 #define __ASM_SH_UACCESS_H
 
+#ifdef CONFIG_MMU
 #include <asm/extable.h>
 #include <asm-generic/access_ok.h>
 
@@ -130,4 +131,8 @@ struct mem_access {
 int handle_unaligned_access(insn_size_t instruction, struct pt_regs *regs,
                            struct mem_access *ma, int, unsigned long address);
 
+#else
+#include <asm-generic/uaccess.h>
+#endif
+
 #endif /* __ASM_SH_UACCESS_H */
diff --git a/arch/sh/include/asm/uaccess_32.h b/arch/sh/include/asm/uaccess_32.h
index 5d7ddc092afd..e053f2fd245c 100644
--- a/arch/sh/include/asm/uaccess_32.h
+++ b/arch/sh/include/asm/uaccess_32.h
@@ -35,7 +35,6 @@ do {                                                          \
        }                                                       \
 } while (0)
 
-#ifdef CONFIG_MMU
 #define __get_user_asm(x, addr, err, insn) \
 ({ \
 __asm__ __volatile__( \
@@ -56,16 +55,6 @@ __asm__ __volatile__( \
        ".previous" \
        :"=&r" (err), "=&r" (x) \
        :"m" (__m(addr)), "i" (-EFAULT), "0" (err)); })
-#else
-#define __get_user_asm(x, addr, err, insn)             \
-do {                                                   \
-       __asm__ __volatile__ (                          \
-               "mov." insn "   %1, %0\n\t"             \
-               : "=&r" (x)                             \
-               : "m" (__m(addr))                       \
-       );                                              \
-} while (0)
-#endif /* CONFIG_MMU */
 
 extern void __get_user_unknown(void);
 
@@ -140,7 +129,6 @@ do {                                                        \
        }                                               \
 } while (0)
 
-#ifdef CONFIG_MMU
 #define __put_user_asm(x, addr, err, insn)                     \
 do {                                                           \
        __asm__ __volatile__ (                                  \
@@ -164,17 +152,6 @@ do {                                                               \
                : "memory"                                      \
        );                                                      \
 } while (0)
-#else
-#define __put_user_asm(x, addr, err, insn)             \
-do {                                                   \
-       __asm__ __volatile__ (                          \
-               "mov." insn "   %0, %1\n\t"             \
-               : /* no outputs */                      \
-               : "r" (x), "m" (__m(addr))              \
-               : "memory"                              \
-       );                                              \
-} while (0)
-#endif /* CONFIG_MMU */
 
 #if defined(CONFIG_CPU_LITTLE_ENDIAN)
 #define __put_user_u64(val,addr,retval) \

Guenter Roeck Jan. 11, 2024, 11:57 p.m. UTC | #12

On 1/11/24 12:14, Linus Torvalds wrote:
> On Thu, 11 Jan 2024 at 10:57, Guenter Roeck <linux@roeck-us.net> wrote:
>>
>> Any variance of put_user() with &buf[ctr] or buf + ctr fails
>> if ctr is a variable and permitted to be != 0.
> 
> Crazy. But the 64-bit put_user() is a bit special and tends to require
> more registers (the 64-bit value is passed in two registers), so that
> probably then results in the ICE.
> 
> Side note: looking at the SH version of __put_user_u64(), I think it's
> buggy and is missing the exception handler for the second 32-bit move.
> I dunno, I don't read sh asm, but it looks suspicious.
> 

I wonder if something may be wrong with the definition and use of __m
for u64 accesses. The code below also fixes the build problem.

But then I really don't know what

struct __large_struct { unsigned long buf[100]; };
#define __m(x) (*(struct __large_struct __user *)(x))

is supposed to be doing in the first place, and I still don't understand
why the problem only shows up with CONFIG_MMU=n.

Guenter

---
diff --git a/arch/sh/include/asm/uaccess_32.h b/arch/sh/include/asm/uaccess_32.h
index 5d7ddc092afd..f0451a37b6ff 100644
--- a/arch/sh/include/asm/uaccess_32.h
+++ b/arch/sh/include/asm/uaccess_32.h
@@ -196,7 +196,7 @@ __asm__ __volatile__( \
         ".long  1b, 3b\n\t" \
         ".previous" \
         : "=r" (retval) \
-       : "r" (val), "m" (__m(addr)), "i" (-EFAULT), "0" (retval) \
+       : "r" (val), "m" (*(u64 *)(addr)), "i" (-EFAULT), "0" (retval) \
          : "memory"); })
  #else
  #define __put_user_u64(val,addr,retval) \
@@ -218,7 +218,7 @@ __asm__ __volatile__( \
         ".long  1b, 3b\n\t" \
         ".previous" \
         : "=r" (retval) \
-       : "r" (val), "m" (__m(addr)), "i" (-EFAULT), "0" (retval) \
+       : "r" (val), "m" (*(u64 *)(addr)), "i" (-EFAULT), "0" (retval) \
          : "memory"); })
  #endif

Linus Torvalds Jan. 12, 2024, 3:40 a.m. UTC | #13

On Thu, 11 Jan 2024 at 15:57, Guenter Roeck <linux@roeck-us.net> wrote:
>
> I wonder if something may be wrong with the definition and use of __m
> for u64 accesses. The code below also fixes the build problem.

Ok, that looks like the right workaround.

> But then I really don't know what
>
> struct __large_struct { unsigned long buf[100]; };
> #define __m(x) (*(struct __large_struct __user *)(x))

This is a historical pattern we've used because the gcc docs weren't
100% clear on what "m" does, and whether it might for example end up
loading the value from memory into a register, spilling it to the
stack, and then using the stack address...

Using the whole "tell the compiler it accesses a big structure" turns
the memory access into "BLKmode" (in gcc terms) and makes sure that
never happens.

NOTE! I'm not sure it ever did happen with gcc, but we have literally
seen clang do that "load from memory, spill to stack, and then use the
stack address for the asm". Crazy, I know. See

  https://lore.kernel.org/all/CAHk-=wgobnShg4c2yyMbk2p=U-wmnOmX_0=b3ZY_479Jjey2xw@mail.gmail.com/

where I point to clang doing basically exactly that with the "rm"
constraint for another case entirely. I consider it a clang bug, but
happily I've never seen the "load only to spill" in a case where the
"stupid code generation" turned into "actively buggy code generation".

If it ever does, we may need to turn the "m" into a "p" and a memory
clobber, which will generate horrendous code. Or we may just need to
tell clang developers that enough is enough, and that they actually
need to take the asm constraints more seriously.

                Linus

Guenter Roeck Jan. 12, 2024, 5:24 a.m. UTC | #14

On Thu, Jan 11, 2024 at 07:40:32PM -0800, Linus Torvalds wrote:
> On Thu, 11 Jan 2024 at 15:57, Guenter Roeck <linux@roeck-us.net> wrote:
> >
> > I wonder if something may be wrong with the definition and use of __m
> > for u64 accesses. The code below also fixes the build problem.
> 
> Ok, that looks like the right workaround.
> 

I think I'll go with the code below. It doesn't touch the MMU code
(which for some reason doesn't result in a build failure), and
it generates

! 5071 "fs/namespace.c" 1
        mov.l   r6,@r1
        mov.l   r7,@(4,r1)

which seems to be correct. It also matches the pattern used
for __put_user_asm(). Does that make sense ?

Thanks,
Guenter

---
diff --git a/arch/sh/include/asm/uaccess_32.h b/arch/sh/include/asm/uaccess_32.h
index 5d7ddc092afd..5ce46b2a95b6 100644
--- a/arch/sh/include/asm/uaccess_32.h
+++ b/arch/sh/include/asm/uaccess_32.h
@@ -176,6 +176,7 @@ do {							\
 } while (0)
 #endif /* CONFIG_MMU */
 
+#ifdef CONFIG_MMU
 #if defined(CONFIG_CPU_LITTLE_ENDIAN)
 #define __put_user_u64(val,addr,retval) \
 ({ \
@@ -221,6 +222,26 @@ __asm__ __volatile__( \
 	: "r" (val), "m" (__m(addr)), "i" (-EFAULT), "0" (retval) \
         : "memory"); })
 #endif
+#else
+#if defined(CONFIG_CPU_LITTLE_ENDIAN)
+#define __put_user_u64(val,addr,retval) \
+({ \
+__asm__ __volatile__( \
+	"mov.l	%R0,%1\n\t" \
+	"mov.l	%S0,%T1\n\t" \
+	: /* no outputs */ \
+	: "r" (val), "m" (*(u64 *)(addr)) \
+	: "memory"); })
+#else
+({ \
+__asm__ __volatile__( \
+	"mov.l	%S0,%1\n\t" \
+	"mov.l	%R0,%T1\n\t" \
+	: /* no outputs */ \
+	: "r" (val), "m" (*(u64 *)(addr)) \
+	: "memory"); })
+#endif
+#endif /* CONFIG_MMU */
 
 extern void __put_user_unknown(void);

Christian Brauner Jan. 12, 2024, 9 a.m. UTC | #15

> NOTE! ENTIRELY untested, but that naming and lack of argument sanity

Thanks for catching that. I've slightly tweaked the attached patch and
put it into vfs.fixes at [1]. There's a fsnotify performance regression
fix for io_uring in there as well. I plan to get both to you Saturday
CET. Let us know if there's any additional issues.

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=vfs.fixes

John Paul Adrian Glaubitz Jan. 23, 2024, 2:14 p.m. UTC | #16

Hi Guenter,

> with this patch in the tree, all sh4 builds fail with ICE.
> 
> during RTL pass: final
> In file included from fs/namespace.c:11:
> fs/namespace.c: In function '__se_sys_listmount':
> include/linux/syscalls.h:258:9: internal compiler error: in change_address_1, at emit-rtl.c:2275
> 
> I tested with gcc 8.2, 11.3, 11.4, and 12.3. The compiler version
> does not make a difference. Has anyone else seen the same problem ?
> If so, any idea what to do about it ?

I'm not seeing any problems building the SH kernel except some -Werror=missing-prototypes warnings.

I'm using gcc 11.1 from here [1].

Adrian

PS: Please always CC linux-sh and the SH maintainers when reporting issues.

> [1] https://mirrors.edge.kernel.org/pub/tools/crosstool/files/bin/x86_64/11.1.0/

--
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer
`. `'   Physicist
  `-    GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913

John Paul Adrian Glaubitz Jan. 23, 2024, 2:14 p.m. UTC | #17

Hi Guenter,

> with this patch in the tree, all sh4 builds fail with ICE.
> 
> during RTL pass: final
> In file included from fs/namespace.c:11:
> fs/namespace.c: In function '__se_sys_listmount':
> include/linux/syscalls.h:258:9: internal compiler error: in change_address_1, at emit-rtl.c:2275
> 
> I tested with gcc 8.2, 11.3, 11.4, and 12.3. The compiler version
> does not make a difference. Has anyone else seen the same problem ?
> If so, any idea what to do about it ?

I'm not seeing any problems building the SH kernel except some -Werror=missing-prototypes warnings.

I'm using gcc 11.1 from here [1].

Adrian

PS: Please always CC linux-sh and the SH maintainers when reporting issues.

> [1] https://mirrors.edge.kernel.org/pub/tools/crosstool/files/bin/x86_64/11.1.0/

--
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer
`. `'   Physicist
  `-    GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913

Guenter Roeck Jan. 23, 2024, 3:31 p.m. UTC | #18

On 1/23/24 06:14, John Paul Adrian Glaubitz wrote:
> Hi Guenter,
> 
>> with this patch in the tree, all sh4 builds fail with ICE.
>>
>> during RTL pass: final
>> In file included from fs/namespace.c:11:
>> fs/namespace.c: In function '__se_sys_listmount':
>> include/linux/syscalls.h:258:9: internal compiler error: in change_address_1, at emit-rtl.c:2275
>>
>> I tested with gcc 8.2, 11.3, 11.4, and 12.3. The compiler version
>> does not make a difference. Has anyone else seen the same problem ?
>> If so, any idea what to do about it ?
> 
> I'm not seeing any problems building the SH kernel except some -Werror=missing-prototypes warnings.
> 

The problem has been fixed thanks to some guidance from Linus. I did disable
CONFIG_WERROR for sh (and a few other architectures) because it now always
results in pointless build failures on test builds due to commit 0fcb70851fbf
("Makefile.extrawarn: turn on missing-prototypes globally").

> I'm using gcc 11.1 from here [1].
> 
> Adrian
> 
> PS: Please always CC linux-sh and the SH maintainers when reporting issues.
> 

When reporting anything in the linux kernel, it is always a tight rope between
copying too few or too many people or mailing lists. I have been scolded for both.
Replying to the original patch historically results in the fewest complaints,
so I think I'll stick to that.

Guenter

[v4,5/6] add listmount(2) syscall

Commit Message

Comments

Patch