mbox series

[git,pull] d_revalidate pile

Message ID 20250127044721.GD1977892@ZenIV (mailing list archive)
State New
Headers show
Series [git,pull] d_revalidate pile | expand

Pull-request

git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git tags/pull-revalidate

Message

Al Viro Jan. 27, 2025, 4:47 a.m. UTC
->d_revalidate() series, along with ->d_iname preliminary work.
One trivial conflict in fs/afs/dir.c - afs_do_lookup_one() has lost
one argument in mainline and switched another from dentry to qstr
in this series.

The following changes since commit 40384c840ea1944d7c5a392e8975ed088ecf0b37:

  Linux 6.13-rc1 (2024-12-01 14:28:56 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git tags/pull-revalidate

for you to fetch changes up to a8ea90bfec66b239dad9a478fc444aa32d3961bc:

  9p: fix ->rename_sem exclusion (2025-01-25 11:51:57 -0500)

----------------------------------------------------------------
Provide stable parent and name to ->d_revalidate() instances

Most of the filesystem methods where we care about dentry name
and parent have their stability guaranteed by the callers;
->d_revalidate() is the major exception.

It's easy enough for callers to supply stable values for
expected name and expected parent of the dentry being
validated.  That kills quite a bit of boilerplate in
->d_revalidate() instances, along with a bunch of races
where they used to access ->d_name without sufficient
precautions.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

----------------------------------------------------------------
Al Viro (20):
      make sure that DNAME_INLINE_LEN is a multiple of word size
      dcache: back inline names with a struct-wrapped array of unsigned long
      make take_dentry_name_snapshot() lockless
      dissolve external_name.u into separate members
      ext4 fast_commit: make use of name_snapshot primitives
      generic_ci_d_compare(): use shortname_storage
      Pass parent directory inode and expected name to ->d_revalidate()
      afs_d_revalidate(): use stable name and parent inode passed by caller
      ceph_d_revalidate(): use stable parent inode passed by caller
      ceph_d_revalidate(): propagate stable name down into request encoding
      fscrypt_d_revalidate(): use stable parent inode passed by caller
      exfat_d_revalidate(): use stable parent inode passed by caller
      vfat_revalidate{,_ci}(): use stable parent inode passed by caller
      fuse_dentry_revalidate(): use stable parent inode and name passed by caller
      gfs2_drevalidate(): use stable parent inode and name passed by caller
      nfs{,4}_lookup_validate(): use stable parent inode passed by caller
      nfs: fix ->d_revalidate() UAF on ->d_name accesses
      ocfs2_dentry_revalidate(): use stable parent inode and name passed by caller
      orangefs_d_revalidate(): use stable parent inode and name passed by caller
      9p: fix ->rename_sem exclusion

 Documentation/filesystems/locking.rst        |  7 +-
 Documentation/filesystems/porting.rst        | 16 +++++
 Documentation/filesystems/vfs.rst            | 24 ++++++-
 fs/9p/v9fs.h                                 |  2 +-
 fs/9p/vfs_dentry.c                           | 26 +++++++-
 fs/afs/dir.c                                 | 40 ++++--------
 fs/ceph/dir.c                                | 25 ++------
 fs/ceph/mds_client.c                         |  9 ++-
 fs/ceph/mds_client.h                         |  2 +
 fs/coda/dir.c                                |  3 +-
 fs/crypto/fname.c                            | 22 ++-----
 fs/dcache.c                                  | 95 ++++++++++++++++------------
 fs/ecryptfs/dentry.c                         | 18 ++++--
 fs/exfat/namei.c                             | 11 +---
 fs/ext4/fast_commit.c                        | 29 ++-------
 fs/ext4/fast_commit.h                        |  3 +-
 fs/fat/namei_vfat.c                          | 19 +++---
 fs/fuse/dir.c                                | 20 +++---
 fs/gfs2/dentry.c                             | 31 ++++-----
 fs/hfs/sysdep.c                              |  3 +-
 fs/jfs/namei.c                               |  3 +-
 fs/kernfs/dir.c                              |  3 +-
 fs/libfs.c                                   | 15 +++--
 fs/namei.c                                   | 18 +++---
 fs/nfs/dir.c                                 | 62 ++++++++----------
 fs/nfs/namespace.c                           |  2 +-
 fs/nfs/nfs3proc.c                            |  5 +-
 fs/nfs/nfs4proc.c                            | 20 +++---
 fs/nfs/proc.c                                |  6 +-
 fs/ocfs2/dcache.c                            | 14 ++--
 fs/orangefs/dcache.c                         | 22 +++----
 fs/overlayfs/super.c                         | 22 ++++++-
 fs/proc/base.c                               |  6 +-
 fs/proc/fd.c                                 |  3 +-
 fs/proc/generic.c                            |  6 +-
 fs/proc/proc_sysctl.c                        |  3 +-
 fs/smb/client/dir.c                          |  3 +-
 fs/tracefs/inode.c                           |  3 +-
 fs/vboxsf/dir.c                              |  3 +-
 include/linux/dcache.h                       | 23 +++++--
 include/linux/fscrypt.h                      |  7 +-
 include/linux/nfs_xdr.h                      |  2 +-
 tools/testing/selftests/bpf/progs/find_vma.c |  2 +-
 43 files changed, 352 insertions(+), 306 deletions(-)

Comments

Sasha Levin Jan. 27, 2025, 5:19 p.m. UTC | #1
On Mon, Jan 27, 2025 at 04:47:21AM +0000, Al Viro wrote:
>->d_revalidate() series, along with ->d_iname preliminary work.
>One trivial conflict in fs/afs/dir.c - afs_do_lookup_one() has lost
>one argument in mainline and switched another from dentry to qstr
>in this series.
>
>The following changes since commit 40384c840ea1944d7c5a392e8975ed088ecf0b37:
>
>  Linux 6.13-rc1 (2024-12-01 14:28:56 -0800)
>
>are available in the Git repository at:
>
>  git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git tags/pull-revalidate
>
>for you to fetch changes up to a8ea90bfec66b239dad9a478fc444aa32d3961bc:
>
>  9p: fix ->rename_sem exclusion (2025-01-25 11:51:57 -0500)
>
>----------------------------------------------------------------
>Provide stable parent and name to ->d_revalidate() instances
>
>Most of the filesystem methods where we care about dentry name
>and parent have their stability guaranteed by the callers;
>->d_revalidate() is the major exception.
>
>It's easy enough for callers to supply stable values for
>expected name and expected parent of the dentry being
>validated.  That kills quite a bit of boilerplate in
>->d_revalidate() instances, along with a bunch of races
>where they used to access ->d_name without sufficient
>precautions.
>
>Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Hey Al,

With this pulled on top of Linus's tree, LKFT is managing to trigger
kfence warnings:

<3>[   62.180289] BUG: KFENCE: out-of-bounds read in d_same_name+0x4c/0xd0
<3>[   62.180289]
<3>[   62.182647] Out-of-bounds read at 0x00000000eedd4b55 (64B right of kfence-#174):
<4>[   62.184178]  d_same_name+0x4c/0xd0
<4>[   62.184717]  d_lookup+0x40/0x78
<4>[   62.185378]  lookup_dcache+0x34/0xb0
<4>[   62.185980]  lookup_one_qstr_excl+0x2c/0xe0
<4>[   62.186523]  do_renameat2+0x324/0x568
<4>[   62.186948]  __arm64_sys_renameat+0x58/0x78
<4>[   62.187484]  invoke_syscall+0x50/0x120
<4>[   62.188220]  el0_svc_common.constprop.0+0x48/0xf0
<4>[   62.189031]  do_el0_svc_compat+0x24/0x48
<4>[   62.189635]  el0_svc_compat+0x34/0xd0
<4>[   62.190018]  el0t_32_sync_handler+0xb0/0x138
<4>[   62.190537]  el0t_32_sync+0x19c/0x1a0
<3>[   62.190946]
<4>[   62.191399] kfence-#174: 0x0000000012d508d5-0x0000000023355f7e, size=64, cache=kmalloc-rcl-64
<4>[   62.191399]
<4>[   62.192260] allocated by task 1 on cpu 0 at 62.177313s (0.014839s ago):
<4>[   62.193504]  __d_alloc+0x15c/0x1d0
<4>[   62.193925]  d_alloc+0x24/0x90
<4>[   62.194204]  lookup_one_qstr_excl+0x68/0xe0
<4>[   62.194741]  filename_create+0xc0/0x1b0
<4>[   62.195129]  do_symlinkat+0x68/0x150
<4>[   62.195657]  __arm64_sys_symlinkat+0x50/0x70
<4>[   62.195954]  invoke_syscall+0x50/0x120
<4>[   62.196461]  el0_svc_common.constprop.0+0x48/0xf0
<4>[   62.197053]  do_el0_svc_compat+0x24/0x48
<4>[   62.197411]  el0_svc_compat+0x34/0xd0
<4>[   62.197849]  el0t_32_sync_handler+0xb0/0x138
<4>[   62.198422]  el0t_32_sync+0x19c/0x1a0
<3>[   62.198857]
<3>[   62.199577] CPU: 0 UID: 0 PID: 1 Comm: systemd Not tainted 6.13.0 #1
<3>[   62.200435] Hardware name: linux,dummy-virt (DT)

The full log is at: https://qa-reports.linaro.org/lkft/sashal-linus-next/build/v6.13-rc7-8584-gd4639f3659ae/testrun/27028572/suite/log-parser-test/test/kfence-bug-kfence-out-of-bounds-read-in-d_same_name/log

LMK if I should attempt a bisection.
Al Viro Jan. 27, 2025, 5:36 p.m. UTC | #2
On Mon, Jan 27, 2025 at 12:19:54PM -0500, Sasha Levin wrote:

> The full log is at: https://qa-reports.linaro.org/lkft/sashal-linus-next/build/v6.13-rc7-8584-gd4639f3659ae/testrun/27028572/suite/log-parser-test/test/kfence-bug-kfence-out-of-bounds-read-in-d_same_name/log
> 
> LMK if I should attempt a bisection.

Could you try your setup on 58cf9c383c5c "dcache: back inline names
with a struct-wrapped array of unsigned long"?
Linus Torvalds Jan. 27, 2025, 7:12 p.m. UTC | #3
On Mon, 27 Jan 2025 at 09:19, Sasha Levin <sashal@kernel.org> wrote:
>
> With this pulled on top of Linus's tree, LKFT is managing to trigger
> kfence warnings:
>
> <3>[   62.180289] BUG: KFENCE: out-of-bounds read in d_same_name+0x4c/0xd0
> <3>[   62.180289]
> <3>[   62.182647] Out-of-bounds read at 0x00000000eedd4b55 (64B right of kfence-#174):
> <4>[   62.184178]  d_same_name+0x4c/0xd0

Bah. I've said this before, but I really wish LKFT would use debug
builds and run the warnings through scripts/decode_stacktrace.sh.

Getting filenames and line numbers (and inlining information!) for
stack traces can be really really useful.

I think you are using KernelCI builds (at least that was the case last
time), and apparently they are non-debug builds. And that's possibly
due to just resource issues (the debug info does take a lot more disk
space and makes link times much longer too). So it might not be easily
fixable.

But let's see if it might be an option to get this capability. So I'm
adding the kernelci list to see if somebody goes "Oh, that was just an
oversight" and might easily be made to happen. Fingers crossed.

              Linus
Mark Brown Jan. 27, 2025, 8:38 p.m. UTC | #4
On Mon, Jan 27, 2025 at 11:12:11AM -0800, Linus Torvalds wrote:
> On Mon, 27 Jan 2025 at 09:19, Sasha Levin <sashal@kernel.org> wrote:

> > With this pulled on top of Linus's tree, LKFT is managing to trigger
> > kfence warnings:

> > <3>[   62.180289] BUG: KFENCE: out-of-bounds read in d_same_name+0x4c/0xd0
> > <3>[   62.180289]
> > <3>[   62.182647] Out-of-bounds read at 0x00000000eedd4b55 (64B right of kfence-#174):
> > <4>[   62.184178]  d_same_name+0x4c/0xd0

> Bah. I've said this before, but I really wish LKFT would use debug
> builds and run the warnings through scripts/decode_stacktrace.sh.

> Getting filenames and line numbers (and inlining information!) for
> stack traces can be really really useful.

> I think you are using KernelCI builds (at least that was the case last
> time), and apparently they are non-debug builds. And that's possibly
> due to just resource issues (the debug info does take a lot more disk
> space and makes link times much longer too). So it might not be easily
> fixable.

They're not, they're using their own builds done with their tuxsuite
service which is a cloud front end for their tuxmake tool, that does
have the ability to save the vmlinux.  Poking around the LKFT output it
does look like they're doing that for the LKFT builds:

   https://qa-reports.linaro.org/lkft/sashal-linus-next/build/v6.13-rc7-8584-gd4639f3659ae/testrun/27027254/suite/build/test/gcc-13-tinyconfig/details/
   https://storage.tuxsuite.com/public/linaro/lkft/builds/2sDW1jDhjHPNl1XNezFhsjSlvpI/

so hopefully the information is all there and it's just a question of
people doing the decode when reporting issues from LKFT.

> But let's see if it might be an option to get this capability. So I'm
> adding the kernelci list to see if somebody goes "Oh, that was just an
> oversight" and might easily be made to happen. Fingers crossed.

The issue with KernelCI has been that it's not storing the vmlinux, this
was indeed done due to space issues like you suggest.  With the new
infrastructure that's been rolled out as part of the KernelCI 2.0 revamp
the storage should be a lot more scaleable and so this should hopefully
be a cost issue rather than actual space limits like it used to be so
more tractable.  AFAICT we haven't actually revisited making the
required changes to include the vmlinux in the stored output though, I
filed a ticket:

   https://github.com/kernelci/kernelci-project/issues/509

The builds themselves are generally using standard defconfigs and
derivatives of that so will normally have enough debug info for
decode_stacktrace.sh.  Where they don't we should probably just change
that upstream.
Sasha Levin Jan. 27, 2025, 8:52 p.m. UTC | #5
On Mon, Jan 27, 2025 at 05:36:34PM +0000, Al Viro wrote:
>On Mon, Jan 27, 2025 at 12:19:54PM -0500, Sasha Levin wrote:
>
>> The full log is at: https://qa-reports.linaro.org/lkft/sashal-linus-next/build/v6.13-rc7-8584-gd4639f3659ae/testrun/27028572/suite/log-parser-test/test/kfence-bug-kfence-out-of-bounds-read-in-d_same_name/log
>>
>> LMK if I should attempt a bisection.
>
>Could you try your setup on 58cf9c383c5c "dcache: back inline names
>with a struct-wrapped array of unsigned long"?

It looks like we didn't trigger a warnings on that commit, but I'm not
sure if the issue reproduces easily.

I'll start a bisection and see where it takes me...
Al Viro Jan. 27, 2025, 9:34 p.m. UTC | #6
On Mon, Jan 27, 2025 at 03:52:16PM -0500, Sasha Levin wrote:
> On Mon, Jan 27, 2025 at 05:36:34PM +0000, Al Viro wrote:
> > On Mon, Jan 27, 2025 at 12:19:54PM -0500, Sasha Levin wrote:
> > 
> > > The full log is at: https://qa-reports.linaro.org/lkft/sashal-linus-next/build/v6.13-rc7-8584-gd4639f3659ae/testrun/27028572/suite/log-parser-test/test/kfence-bug-kfence-out-of-bounds-read-in-d_same_name/log
> > > 
> > > LMK if I should attempt a bisection.
> > 
> > Could you try your setup on 58cf9c383c5c "dcache: back inline names
> > with a struct-wrapped array of unsigned long"?
> 
> It looks like we didn't trigger a warnings on that commit, but I'm not
> sure if the issue reproduces easily.
> 
> I'll start a bisection and see where it takes me...

Interesting...  The thing is, that's the only commit that goes anywhere
near ->d_name reassignments.  That access smells like access just one
byte past struct external_name...  wait a minute.

Could that be load_unaligned_zeropad() stepping just over the end of
external name?

If so, then
	a) it's a false positive (and IIRC, it's not the first time
kfence gets confused by that)
	b) your bisection will probably converge to bdd9951f60f9
"dissolve external_name.u into separate members" which is where we'd
ended up with offsetof(struct external_name, name) being 4 modulo 8.

As a quick test, try to flip the order of head and count in
struct external_name and see if that makes the warning go away.
If it does, I'm pretty certain that theory above is correct.
Sasha Levin Jan. 27, 2025, 10:32 p.m. UTC | #7
[ Adding in the LKFT folks ]

On Mon, Jan 27, 2025 at 08:38:50PM +0000, Mark Brown wrote:
>On Mon, Jan 27, 2025 at 11:12:11AM -0800, Linus Torvalds wrote:
>> On Mon, 27 Jan 2025 at 09:19, Sasha Levin <sashal@kernel.org> wrote:
>
>> > With this pulled on top of Linus's tree, LKFT is managing to trigger
>> > kfence warnings:
>
>> > <3>[   62.180289] BUG: KFENCE: out-of-bounds read in d_same_name+0x4c/0xd0
>> > <3>[   62.180289]
>> > <3>[   62.182647] Out-of-bounds read at 0x00000000eedd4b55 (64B right of kfence-#174):
>> > <4>[   62.184178]  d_same_name+0x4c/0xd0
>
>> Bah. I've said this before, but I really wish LKFT would use debug
>> builds and run the warnings through scripts/decode_stacktrace.sh.
>
>> Getting filenames and line numbers (and inlining information!) for
>> stack traces can be really really useful.
>
>> I think you are using KernelCI builds (at least that was the case last
>> time), and apparently they are non-debug builds. And that's possibly
>> due to just resource issues (the debug info does take a lot more disk
>> space and makes link times much longer too). So it might not be easily
>> fixable.
>
>They're not, they're using their own builds done with their tuxsuite
>service which is a cloud front end for their tuxmake tool, that does
>have the ability to save the vmlinux.  Poking around the LKFT output it
>does look like they're doing that for the LKFT builds:
>
>   https://qa-reports.linaro.org/lkft/sashal-linus-next/build/v6.13-rc7-8584-gd4639f3659ae/testrun/27027254/suite/build/test/gcc-13-tinyconfig/details/
>   https://storage.tuxsuite.com/public/linaro/lkft/builds/2sDW1jDhjHPNl1XNezFhsjSlvpI/
>
>so hopefully the information is all there and it's just a question of
>people doing the decode when reporting issues from LKFT.

My understanding was that becuase CONFIG_DEBUG_INFO_NONE=y is set, we
actually don't have enough info to resolve line numbers.

I've tried running decode_stacktrace.sh on the vmlinux image linked
above, and indeed we can't get line numbers.
Al Viro Jan. 27, 2025, 10:40 p.m. UTC | #8
On Mon, Jan 27, 2025 at 09:34:56PM +0000, Al Viro wrote:

> If so, then
> 	a) it's a false positive (and IIRC, it's not the first time
> kfence gets confused by that)
> 	b) your bisection will probably converge to bdd9951f60f9
> "dissolve external_name.u into separate members" which is where we'd
> ended up with offsetof(struct external_name, name) being 4 modulo 8.
> 
> As a quick test, try to flip the order of head and count in
> struct external_name and see if that makes the warning go away.
> If it does, I'm pretty certain that theory above is correct.

Not quite...   dentry_string_cmp() assumes that ->d_name.name is
word-aligned, so load_unaligned_zeropad() is done only to the
second string (the one we compare against).

Linus, does the following look sane to you as replacement for
bdd9951f60f9?  I'd rather have explicit __aligned(), along
with the comment spelling the constraints out...


diff --git a/fs/dcache.c b/fs/dcache.c
index f387dc97df86..f8d6a2557736 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -295,12 +295,16 @@ static inline int dentry_cmp(const struct dentry *dentry, const unsigned char *c
 	return dentry_string_cmp(cs, ct, tcount);
 }
 
+/*
+ * long names are allocated separately from dentry and never modified.
+ * Refcounted, freeing is RCU-delayed.  See take_dentry_name_snapshot()
+ * for the reason why ->count and ->head can't be combined into a union.
+ * dentry_string_cmp() relies upon ->name[] being word-aligned.
+ */
 struct external_name {
-	struct {
-		atomic_t count;		// ->count and ->head can't be combined
-		struct rcu_head head;	// see take_dentry_name_snapshot()
-	} u;
-	unsigned char name[];
+	atomic_t count;
+	struct rcu_head head;
+	unsigned char name[] __aligned(sizeof(unsigned long));
 };
 
 static inline struct external_name *external_name(struct dentry *dentry)
@@ -344,7 +348,7 @@ void take_dentry_name_snapshot(struct name_snapshot *name, struct dentry *dentry
 		struct external_name *p;
 		p = container_of(s, struct external_name, name[0]);
 		// get a valid reference
-		if (unlikely(!atomic_inc_not_zero(&p->u.count)))
+		if (unlikely(!atomic_inc_not_zero(&p->count)))
 			goto retry;
 		name->name.name = s;
 	}
@@ -361,8 +365,8 @@ void release_dentry_name_snapshot(struct name_snapshot *name)
 	if (unlikely(name->name.name != name->inline_name.string)) {
 		struct external_name *p;
 		p = container_of(name->name.name, struct external_name, name[0]);
-		if (unlikely(atomic_dec_and_test(&p->u.count)))
-			kfree_rcu(p, u.head);
+		if (unlikely(atomic_dec_and_test(&p->count)))
+			kfree_rcu(p, head);
 	}
 }
 EXPORT_SYMBOL(release_dentry_name_snapshot);
@@ -400,7 +404,7 @@ static void dentry_free(struct dentry *dentry)
 	WARN_ON(!hlist_unhashed(&dentry->d_u.d_alias));
 	if (unlikely(dname_external(dentry))) {
 		struct external_name *p = external_name(dentry);
-		if (likely(atomic_dec_and_test(&p->u.count))) {
+		if (likely(atomic_dec_and_test(&p->count))) {
 			call_rcu(&dentry->d_u.d_rcu, __d_free_external);
 			return;
 		}
@@ -1681,7 +1685,7 @@ static struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
 			kmem_cache_free(dentry_cache, dentry); 
 			return NULL;
 		}
-		atomic_set(&p->u.count, 1);
+		atomic_set(&p->count, 1);
 		dname = p->name;
 	} else  {
 		dname = dentry->d_shortname.string;
@@ -2774,15 +2778,15 @@ static void copy_name(struct dentry *dentry, struct dentry *target)
 	if (unlikely(dname_external(dentry)))
 		old_name = external_name(dentry);
 	if (unlikely(dname_external(target))) {
-		atomic_inc(&external_name(target)->u.count);
+		atomic_inc(&external_name(target)->count);
 		dentry->d_name = target->d_name;
 	} else {
 		dentry->d_shortname = target->d_shortname;
 		dentry->d_name.name = dentry->d_shortname.string;
 		dentry->d_name.hash_len = target->d_name.hash_len;
 	}
-	if (old_name && likely(atomic_dec_and_test(&old_name->u.count)))
-		kfree_rcu(old_name, u.head);
+	if (old_name && likely(atomic_dec_and_test(&old_name->count)))
+		kfree_rcu(old_name, head);
 }
 
 /*
Linus Torvalds Jan. 27, 2025, 11:08 p.m. UTC | #9
On Mon, 27 Jan 2025 at 14:41, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Linus, does the following look sane to you as replacement for
> bdd9951f60f9?  I'd rather have explicit __aligned(), along
> with the comment spelling the constraints out...

Ack, looks good to me.

             Linus
Sasha Levin Jan. 27, 2025, 11:26 p.m. UTC | #10
On Mon, Jan 27, 2025 at 10:40:59PM +0000, Al Viro wrote:
>On Mon, Jan 27, 2025 at 09:34:56PM +0000, Al Viro wrote:
>
>> If so, then
>> 	a) it's a false positive (and IIRC, it's not the first time
>> kfence gets confused by that)
>> 	b) your bisection will probably converge to bdd9951f60f9
>> "dissolve external_name.u into separate members" which is where we'd
>> ended up with offsetof(struct external_name, name) being 4 modulo 8.
>>
>> As a quick test, try to flip the order of head and count in
>> struct external_name and see if that makes the warning go away.
>> If it does, I'm pretty certain that theory above is correct.
>
>Not quite...   dentry_string_cmp() assumes that ->d_name.name is
>word-aligned, so load_unaligned_zeropad() is done only to the
>second string (the one we compare against).

Sorry for the silence on my end: this issue doesn't reproduce
consistently, so I need to do more runs for these tests.
Al Viro Jan. 28, 2025, 12:26 a.m. UTC | #11
On Mon, Jan 27, 2025 at 06:26:58PM -0500, Sasha Levin wrote:
> On Mon, Jan 27, 2025 at 10:40:59PM +0000, Al Viro wrote:
> > On Mon, Jan 27, 2025 at 09:34:56PM +0000, Al Viro wrote:
> > 
> > > If so, then
> > > 	a) it's a false positive (and IIRC, it's not the first time
> > > kfence gets confused by that)
> > > 	b) your bisection will probably converge to bdd9951f60f9
> > > "dissolve external_name.u into separate members" which is where we'd
> > > ended up with offsetof(struct external_name, name) being 4 modulo 8.
> > > 
> > > As a quick test, try to flip the order of head and count in
> > > struct external_name and see if that makes the warning go away.
> > > If it does, I'm pretty certain that theory above is correct.
> > 
> > Not quite...   dentry_string_cmp() assumes that ->d_name.name is
> > word-aligned, so load_unaligned_zeropad() is done only to the
> > second string (the one we compare against).
> 
> Sorry for the silence on my end: this issue doesn't reproduce
> consistently, so I need to do more runs for these tests.

Updated version force-pushed; delta is

diff --git a/fs/dcache.c b/fs/dcache.c
index 695406e48937..903142b324e9 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -295,10 +295,16 @@ static inline int dentry_cmp(const struct dentry *dentry, const unsigned char *c
 	return dentry_string_cmp(cs, ct, tcount);
 }
 
+/*
+ * long names are allocated separately from dentry and never modified.
+ * Refcounted, freeing is RCU-delayed.  See take_dentry_name_snapshot()
+ * for the reason why ->count and ->head can't be combined into a union.
+ * dentry_string_cmp() relies upon ->name[] being word-aligned.
+ */
 struct external_name {
-	struct rcu_head head;	// ->head and ->count can't be combined
-	atomic_t count;		// see take_dentry_name_snapshot()
-	unsigned char name[];
+	atomic_t count;
+	struct rcu_head head;
+	unsigned char name[] __aligned(sizeof(unsigned long));
 };
 
 static inline struct external_name *external_name(struct dentry *dentry)


Could you recheck that one (23e8b451dea4)?  I'll send an update pull request
if nothing wrong shows up.
Al Viro Jan. 28, 2025, 12:31 a.m. UTC | #12
On Tue, Jan 28, 2025 at 12:26:59AM +0000, Al Viro wrote:

> Could you recheck that one (23e8b451dea4)?  I'll send an update pull request

s/update/updated/...
Linus Torvalds Jan. 28, 2025, 12:43 a.m. UTC | #13
On Mon, 27 Jan 2025 at 16:27, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
>  struct external_name {
>         atomic_t count;
>         struct rcu_head head;
>         unsigned char name[] __aligned(sizeof(unsigned long));
>  };

Btw, now that the external name looks like this, and has a 32-bit hole
on 64-bit, I wonder if we should just add a length to the external
name.

It would be free on 64-bit, and it would actually mean that we could
atomically load the name and the length for external names and never
worry about any overruns.

The internal name would still race, since the length in the dentry can
change at any time under us, but any code that really needs to avoid
an overrun and doesn't take any locks can still do things like

        if (name == &dentry->d_iname && len >= DNAME_INLINE_LEN)
                .. we know we raced ..

and for things like d_path() or similar at worst it can just limit len
to DNAME_INLINE_LEN-1 or whatever.

I'm thinking mainly dentry_string_cmp(), which currently does things
one byte at a time at least partially for the "I must not overrun a
NUL character because 'len' may be bogus" reason.

Maybe I'm just being silly.

          Linus
Al Viro Jan. 28, 2025, 1:21 a.m. UTC | #14
On Mon, Jan 27, 2025 at 04:43:13PM -0800, Linus Torvalds wrote:

> I'm thinking mainly dentry_string_cmp(), which currently does things
> one byte at a time at least partially for the "I must not overrun a
> NUL character because 'len' may be bogus" reason.

Umm...  On some architectures it does, but mostly that's the ones
where unaligned word loads are costly.  Which target do you have
in mind?
Linus Torvalds Jan. 28, 2025, 1:27 a.m. UTC | #15
On Mon, 27 Jan 2025 at 17:21, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Umm...  On some architectures it does, but mostly that's the ones
> where unaligned word loads are costly.  Which target do you have
> in mind?

I was more thinking that we could just make the fallback case be a 'memcmp()'.

It's not like this particular place matters - as you say, that
byte-at-a-time code is only used on architectures that don't enable
the dcache word-at-a-time code (that requires the special "do loads
that can fault" zeropad helper), but we've had some other places where
we'd worry about the string length.

Look at d_path() for another example. That copy_from_kernel_nofault()
in prepend_copy()...

             Linus
Al Viro Jan. 28, 2025, 2:56 a.m. UTC | #16
On Mon, Jan 27, 2025 at 05:27:08PM -0800, Linus Torvalds wrote:
> On Mon, 27 Jan 2025 at 17:21, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Umm...  On some architectures it does, but mostly that's the ones
> > where unaligned word loads are costly.  Which target do you have
> > in mind?
> 
> I was more thinking that we could just make the fallback case be a 'memcmp()'.
> 
> It's not like this particular place matters - as you say, that
> byte-at-a-time code is only used on architectures that don't enable
> the dcache word-at-a-time code (that requires the special "do loads
> that can fault" zeropad helper), but we've had some other places where
> we'd worry about the string length.
> 
> Look at d_path() for another example. That copy_from_kernel_nofault()
> in prepend_copy()...

Hmm...  So something like

/*
 * Returns a pointer to name and a length; length might be
 * inaccurate in case of race with dentry renaming, but
 * it will not exceed the distance from returned pointer
 * to the end of containing object.
 * Caller MUST hold rcu_read_lock().
 * Caller MUST NOT expect the contents of name to remain
 * stable - it can change at any time.
 */
const char *__d_name_rcu(struct dentry *dentry, int *p)
{
	const char *name = smp_load_acquire(&dentry->d_name.name);

	if (unlikely(name != &dentry->d_shortname.string))
		*p = container_of(name, struct external_name, name)->len;
	else if (unlikely((*p = dentry->d_name.name) >= DNAME_INLINE_LEN)
		*p = DNAME_INLINE_LEN - 1;
	return name;
}

with very limited accessibility (basically, dcache.c and d_path.c)

prepend_name() might be able to use that...
Guillaume Tucker Jan. 28, 2025, 9:19 a.m. UTC | #17
On 27/01/2025 9:38 pm, Mark Brown wrote:
>> But let's see if it might be an option to get this capability. So I'm
>> adding the kernelci list to see if somebody goes "Oh, that was just an
>> oversight" and might easily be made to happen. Fingers crossed.
> The issue with KernelCI has been that it's not storing the vmlinux, this
> was indeed done due to space issues like you suggest.  With the new
> infrastructure that's been rolled out as part of the KernelCI 2.0 revamp
> the storage should be a lot more scaleable and so this should hopefully
> be a cost issue rather than actual space limits like it used to be so
> more tractable.  AFAICT we haven't actually revisited making the
> required changes to include the vmlinux in the stored output though, I
> filed a ticket:
> 
>     https://github.com/kernelci/kernelci-project/issues/509
> 
> The builds themselves are generally using standard defconfigs and
> derivatives of that so will normally have enough debug info for
> decode_stacktrace.sh.  Where they don't we should probably just change
> that upstream.

One approach that was suggested a while ago was to do extra debug
builds in automated post-processing jobs whenever a failure is
detected.  This came as an evolution of the automated bisection
which had checks for the good and bad revisions: if a stacktrace
was found while testing the "bad" kernel then it could easily be
decoded since bisections do incremental builds and keep the
vmlinux at hand.

As Sasha mentioned in his email, some particular configs are
required in order to decode the stacktrace (IIRC this is enabled
with arm64_defconfig but not x86).  Debug builds also make larger
binaries and affect runtime behaviour, as we all know.  So one
post-processing check would be to do a special debug build with
the right configs for decoding stacktraces as well as maybe some
sanitizers and extra useful things to add more information.

Builds from bisections or any extra jobs should still be uploaded
to public storage so they would be available for manual
investigation too.  That way, the impact on storage costs and
compute resources would be minimal without any real drawback - it
might take 30min to get the post-processing job to complete but
even that could be optimized and it seems a lot more efficient
than doing debug builds and uploading large vmlinux images all
the time.

Hope this helps!

Cheers,
Guillaume
Mark Brown Jan. 28, 2025, 12:14 p.m. UTC | #18
On Mon, Jan 27, 2025 at 05:32:18PM -0500, Sasha Levin wrote:
> [ Adding in the LKFT folks ]

Oops, sorry - didn't realise they weren't already on the report since it
was on LKFT or I'd have done the same.

> On Mon, Jan 27, 2025 at 08:38:50PM +0000, Mark Brown wrote:

> > have the ability to save the vmlinux.  Poking around the LKFT output it
> > does look like they're doing that for the LKFT builds:

> My understanding was that becuase CONFIG_DEBUG_INFO_NONE=y is set, we
> actually don't have enough info to resolve line numbers.

The arm64 and arm defconfigs which are the main ones I'd end up looking
at both set CONFIG_DEBUG_INFO (_REDUCED in the case of arm64), the trace
you posted was from arm64 so unless it was some config that overrode
things there ought to be info.  x86_64 which I guess you might use more
indeed doesn't have it.

> I've tried running decode_stacktrace.sh on the vmlinux image linked
> above, and indeed we can't get line numbers.

That was a random build I pulled out which turns out to be a tinyconfig
rather than the specific build that was used - if we look at an arm64
defconfig (your trace looked to be from arm64):

    https://qa-reports.linaro.org/lkft/sashal-linus-next/build/v6.13-rc7-8584-gd4639f3659ae/testrun/27028517/suite/build/test/clang-nightly-defconfig-40bc7ee5/details/
    https://storage.tuxsuite.com/public/linaro/lkft/builds/2sDW1oYDQrsEOOs4L6yoysbu9aS/

I'm able to decode (just feeding a random log message in, no idea what
specific build generated the log message so the line number is almost
certainly wrong):

   $ echo [   62.184178]  d_same_name+0x4c/0xd0 | ./scripts/decode_stacktrace.sh /tmp/vmlinux
   [   62.184178] d_same_name (fs/dcache.c:2127)
Dan Carpenter Jan. 28, 2025, 12:33 p.m. UTC | #19
On Mon, Jan 27, 2025 at 05:32:18PM -0500, Sasha Levin wrote:
> [ Adding in the LKFT folks ]

Ugh...  The website is pretty difficult to navigate.  I've filed a
ticket to hopefully avoid this going forward.  It's a bit late for
the line numbers to be any use but here they are:

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
<3>[   62.179009] ==================================================================
<3>[ 62.180289] BUG: KFENCE: out-of-bounds read in d_same_name (include/asm-generic/rwonce.h:86 fs/dcache.c:243 fs/dcache.c:295 fs/dcache.c:2129)
<3>[   62.180289]
<3>[   62.182647] Out-of-bounds read at 0x00000000eedd4b55 (64B right of kfence-#174):
<4>[ 62.184178] d_same_name (include/asm-generic/rwonce.h:86 fs/dcache.c:243 fs/dcache.c:295 fs/dcache.c:2129)
<4>[ 62.184717] d_lookup (fs/dcache.c:2292)
<4>[ 62.185378] lookup_dcache (fs/namei.c:1654)
<4>[ 62.185980] lookup_one_qstr_excl (fs/namei.c:1678)
<4>[ 62.186523] do_renameat2 (fs/namei.c:5167)
<4>[ 62.186948] __arm64_sys_renameat (fs/namei.c:5264)
<4>[ 62.187484] invoke_syscall (arch/arm64/include/asm/current.h:19 arch/arm64/kernel/syscall.c:54)
<4>[ 62.188220] el0_svc_common.constprop.0 (include/linux/thread_info.h:135 (discriminator 2) arch/arm64/kernel/syscall.c:140 (discriminator 2))
<4>[ 62.189031] do_el0_svc_compat (arch/arm64/kernel/syscall.c:159)
<4>[ 62.189635] el0_svc_compat (arch/arm64/include/asm/irqflags.h:82 (discriminator 1) arch/arm64/include/asm/irqflags.h:123 (discriminator 1) arch/arm64/include/asm/irqflags.h:136 (discriminator 1) arch/arm64/kernel/entry-common.c:165 (discriminator 1) arch/arm64/kernel/entry-common.c:178 (discriminator 1) arch/arm64/kernel/entry-common.c:888 (discriminator 1))
<4>[ 62.190018] el0t_32_sync_handler (arch/arm64/kernel/entry-common.c:933)
<4>[ 62.190537] el0t_32_sync (arch/arm64/kernel/entry.S:605)
<3>[   62.190946]
<4>[   62.191399] kfence-#174: 0x0000000012d508d5-0x0000000023355f7e, size=64, cache=kmalloc-rcl-64
<4>[   62.191399]
<4>[   62.192260] allocated by task 1 on cpu 0 at 62.177313s (0.014839s ago):
<4>[ 62.193504] __d_alloc (fs/dcache.c:1678)
<4>[ 62.193925] d_alloc (fs/dcache.c:1737)
<4>[ 62.194204] lookup_one_qstr_excl (fs/namei.c:1689)
<4>[ 62.194741] filename_create (fs/namei.c:4083)
<4>[ 62.195129] do_symlinkat (fs/namei.c:4690)
<4>[ 62.195657] __arm64_sys_symlinkat (fs/namei.c:4710)
<4>[ 62.195954] invoke_syscall (arch/arm64/include/asm/current.h:19 arch/arm64/kernel/syscall.c:54)
<4>[ 62.196461] el0_svc_common.constprop.0 (include/linux/thread_info.h:135 (discriminator 2) arch/arm64/kernel/syscall.c:140 (discriminator 2))
<4>[ 62.197053] do_el0_svc_compat (arch/arm64/kernel/syscall.c:159)
<4>[ 62.197411] el0_svc_compat (arch/arm64/include/asm/irqflags.h:82 (discriminator 1) arch/arm64/include/asm/irqflags.h:123 (discriminator 1) arch/arm64/include/asm/irqflags.h:136 (discriminator 1) arch/arm64/kernel/entry-common.c:165 (discriminator 1) arch/arm64/kernel/entry-common.c:178 (discriminator 1) arch/arm64/kernel/entry-common.c:888 (discriminator 1))
<4>[ 62.197849] el0t_32_sync_handler (arch/arm64/kernel/entry-common.c:933)
<4>[ 62.198422] el0t_32_sync (arch/arm64/kernel/entry.S:605)
<3>[   62.198857]
<3>[   62.199577] CPU: 0 UID: 0 PID: 1 Comm: systemd Not tainted 6.13.0 #1
<3>[   62.200435] Hardware name: linux,dummy-virt (DT)
<3>[   62.201130] ==================================================================
[?2004hroot@runner-vwmj3eza-project-40964107-concurrent-3:~#

regards,
dan carpenter
Dan Carpenter Jan. 28, 2025, 12:43 p.m. UTC | #20
On Tue, Jan 28, 2025 at 12:14:07PM +0000, 'Mark Brown' via lkft wrote:
> I'm able to decode (just feeding a random log message in, no idea what
> specific build generated the log message so the line number is almost
> certainly wrong):
> 

All the kernels that we're planning to boot have the DEBUG_INFO enabled.
Here is the config and the vmlinux.xz for this one.

https://storage.tuxsuite.com/public/linaro/lkft/builds/2sDW1u8fB268uU3L32J8FqAxYYR/

I don't want to explain how to find this URL because I've filed a ticket
to make it prominent so the instructions will change soon.  And ideally
we would just have the line numbers on the webpage dmesg itself.

regards,
dan carpenter
Sasha Levin Jan. 28, 2025, 7:24 p.m. UTC | #21
On Tue, Jan 28, 2025 at 03:33:14PM +0300, Dan Carpenter wrote:
>On Mon, Jan 27, 2025 at 05:32:18PM -0500, Sasha Levin wrote:
>> [ Adding in the LKFT folks ]
>
>Ugh...  The website is pretty difficult to navigate.  I've filed a
>ticket to hopefully avoid this going forward.  It's a bit late for
>the line numbers to be any use but here they are:

Thanks Dan & Mark! I think I've figured out (and scripted) it for next
time :)