Message ID | 20220701142310.2188015-44-glider@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Add KernelMemorySanitizer infrastructure | expand |
On Fri, Jul 1, 2022 at 7:25 AM Alexander Potapenko <glider@google.com> wrote: > > Under certain circumstances initialization of `unsigned seq` and > `struct inode *inode` passed into step_into() may be skipped. > In particular, if the call to lookup_fast() in walk_component() > returns NULL, and lookup_slow() returns a valid dentry, then the > `seq` and `inode` will remain uninitialized until the call to > step_into() (see [1] for more info). So while I think this needs to be fixed, I think I'd really prefer to make the initialization and/or usage rules stricter or at least clearer. For example, looking around, I think "handle_dotdot()" has the exact same kind of issue, where follow_dotdot[_rcu|() doesn't initialize seq/inode for certain cases, and it's *really* hard to see exactly what the rules are. It turns out that the rules are that seq/inode only get initialized if these routines return a non-NULL and non-error result. Now, that is true for all of these cases - both follow_dotdot*() and lookup_fast(). Possibly others. But the reason follow_dotdot*() doesn't cause the same issue is that the caller actually does the checks that avoid it, and doesn't pass down the uninitialized cases. Now, the other part of the rule is that they only get _used_ for LOOKUP_RCU cases, where they are used to validate the lookup after we've finalized things. Of course, sometimes the "only get used for LOOKUP_RCU" is very very unclear, because even without being an RCU lookup, step_into() will save it into nd->inode/seq. So the values were "used", and initializing them makes them valid, but then *that* copy must not then be used unless RCU was set. Also, sometimes the LOOKUP_RCU check is in the caller, and has actually been cleared, so by the time the actual use comes around, you just have to trust that it was a RCU lookup (ie legitimize_links/root()). So it all seems to work, and this patch then gets rid of one particular odd case, but I think this patch basically hides the compiler warning without really clarifying the code or the rules. Anyway, what I'm building up to here is that I think we should *document* this a bit more. and then make those initializations then be about that documentation. I also get the feeling that "nd->inode/nd->seq" should also be initialized. Right now we have those quite subtle rules about "set vs use", and while a lot of the uses are conditional on LOOKUP_RCU, that makes the code correct, but doesn't solve the "pass uninitialized values as arguments" case. I also think it's very unclear when nd->inode/nd->seq are initialized, and the compiler warning only caught the case where they were *set* (but by arguments that weren't initialized), but didn't necessarily catch the case where they weren't set at all in the first place and then passed around. End result: - I think I'd like path_init() (or set_nameidata) to actually initialize nd->inode and nd->seq unconditionally too. Right now, they get initialized only for that LOOKUP_RCU case. Pretty much exactly the same issue as the one this patch tries to solve, except the compiler didn't notice because it's all indirect through those structure fields and it just didn't track far enough. - I suspect it would be good to initialize them to actual invalid values (rather than NULL/0 - particularly the sequence number) - I look at that follow_dotdot*() caller case, and think "that looks very similar to the lookup_fast() case, but then we have *very* different initialization rules". Al - can you please take a quick look? Linus
On Sat, Jul 02, 2022 at 10:23:16AM -0700, Linus Torvalds wrote: > On Fri, Jul 1, 2022 at 7:25 AM Alexander Potapenko <glider@google.com> wrote: > > > > Under certain circumstances initialization of `unsigned seq` and > > `struct inode *inode` passed into step_into() may be skipped. > > In particular, if the call to lookup_fast() in walk_component() > > returns NULL, and lookup_slow() returns a valid dentry, then the > > `seq` and `inode` will remain uninitialized until the call to > > step_into() (see [1] for more info). > So while I think this needs to be fixed, I think I'd really prefer to > make the initialization and/or usage rules stricter or at least > clearer. Disclaimer: the bits below are nowhere near what I consider a decent explanation; this might serve as the first approximation, but I really need to get some sleep before I get it into coherent shape. 4 hours of sleep today... The rules are * no pathname resolution without successful path_init(). IOW, path_init() failure is an instant fuck off. * path_init() success sets nd->inode. In all cases. * nd->inode must be set - LOOKUP_RCU or not, we simply cannot proceed without it. * in non-RCU mode nd->inode must be equal to nd->path.dentry->d_inode. * in RCU mode nd->inode must be equal to a value observed in nd->path.dentry->d_inode while nd->path.dentry->d_seq had been equal to nd->seq. * step_into() gets a dentry/inode/seq triple. In non-RCU mode inode and seq are ignored; in RCU mode they must satisfy the same relationship we have for nd->path.dentry/nd->inode/nd->seq. > Of course, sometimes the "only get used for LOOKUP_RCU" is very very > unclear, because even without being an RCU lookup, step_into() will > save it into nd->inode/seq. So the values were "used", and > initializing them makes them valid, but then *that* copy must not then > be used unless RCU was set. You are misreading that (and I admit that it badly needs documentation). The whole point of step_into() is to move over to new place. nd->inode *MUST* be set on success, no matter what. > - I look at that follow_dotdot*() caller case, and think "that looks > very similar to the lookup_fast() case, but then we have *very* > different initialization rules". follow_dotdot() might as well lose inodep and seqp arguments - everything would've worked just as well without those. We would've gotten the same complaints about uninitialized values passed to step_into(), though. This if (unlikely(!parent)) error = step_into(nd, WALK_NOFOLLOW, nd->path.dentry, nd->inode, nd->seq); in handle_dots() probably contributes to confusion - it's the "we have stepped on .. in the root, just jump into whatever's mounted on it" case. In non-RCU case it looks like a use of nd->seq in non-RCU mode; however, in that case step_into() will end up ignoring the last two arguments. I'll post something more coherent after I get some sleep. Sorry... ;-/
On Sat, Jul 02, 2022 at 10:23:16AM -0700, Linus Torvalds wrote:
> Al - can you please take a quick look?
FWIW, trying to write a coherent documentation had its usual effect...
The thing is, we don't really need to fetch the inode that early.
All we really care about is that in RCU mode ->d_seq gets sampled
before we fetch ->d_inode *and* we don't treat "it looks negative"
as hard -ENOENT in case of ->d_seq mismatch.
Which can be bloody well left to step_into(). So we don't need
to pass it inode argument at all - just dentry and seq. Makes
a bunch of functions simpler as well...
It does *not* deal with the "uninitialized" seq argument in
!RCU case; I'll handle that in the followup, but that's a separate
story, IMO (and very clearly a false positive).
Cumulative diff follows; splitup is in #work.namei. Comments?
diff --git a/fs/namei.c b/fs/namei.c
index 1f28d3f463c3..7f4f61ade9e3 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1467,7 +1467,7 @@ EXPORT_SYMBOL(follow_down);
* we meet a managed dentry that would need blocking.
*/
static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
- struct inode **inode, unsigned *seqp)
+ unsigned *seqp)
{
struct dentry *dentry = path->dentry;
unsigned int flags = dentry->d_flags;
@@ -1497,13 +1497,6 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
dentry = path->dentry = mounted->mnt.mnt_root;
nd->state |= ND_JUMPED;
*seqp = read_seqcount_begin(&dentry->d_seq);
- *inode = dentry->d_inode;
- /*
- * We don't need to re-check ->d_seq after this
- * ->d_inode read - there will be an RCU delay
- * between mount hash removal and ->mnt_root
- * becoming unpinned.
- */
flags = dentry->d_flags;
continue;
}
@@ -1515,8 +1508,7 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
}
static inline int handle_mounts(struct nameidata *nd, struct dentry *dentry,
- struct path *path, struct inode **inode,
- unsigned int *seqp)
+ struct path *path, unsigned int *seqp)
{
bool jumped;
int ret;
@@ -1525,9 +1517,7 @@ static inline int handle_mounts(struct nameidata *nd, struct dentry *dentry,
path->dentry = dentry;
if (nd->flags & LOOKUP_RCU) {
unsigned int seq = *seqp;
- if (unlikely(!*inode))
- return -ENOENT;
- if (likely(__follow_mount_rcu(nd, path, inode, seqp)))
+ if (likely(__follow_mount_rcu(nd, path, seqp)))
return 0;
if (!try_to_unlazy_next(nd, dentry, seq))
return -ECHILD;
@@ -1547,7 +1537,6 @@ static inline int handle_mounts(struct nameidata *nd, struct dentry *dentry,
if (path->mnt != nd->path.mnt)
mntput(path->mnt);
} else {
- *inode = d_backing_inode(path->dentry);
*seqp = 0; /* out of RCU mode, so the value doesn't matter */
}
return ret;
@@ -1607,9 +1596,7 @@ static struct dentry *__lookup_hash(const struct qstr *name,
return dentry;
}
-static struct dentry *lookup_fast(struct nameidata *nd,
- struct inode **inode,
- unsigned *seqp)
+static struct dentry *lookup_fast(struct nameidata *nd, unsigned *seqp)
{
struct dentry *dentry, *parent = nd->path.dentry;
int status = 1;
@@ -1628,22 +1615,11 @@ static struct dentry *lookup_fast(struct nameidata *nd,
return NULL;
}
- /*
- * This sequence count validates that the inode matches
- * the dentry name information from lookup.
- */
- *inode = d_backing_inode(dentry);
- if (unlikely(read_seqcount_retry(&dentry->d_seq, seq)))
- return ERR_PTR(-ECHILD);
-
- /*
+ /*
* This sequence count validates that the parent had no
* changes while we did the lookup of the dentry above.
- *
- * The memory barrier in read_seqcount_begin of child is
- * enough, we can use __read_seqcount_retry here.
*/
- if (unlikely(__read_seqcount_retry(&parent->d_seq, nd->seq)))
+ if (unlikely(read_seqcount_retry(&parent->d_seq, nd->seq)))
return ERR_PTR(-ECHILD);
*seqp = seq;
@@ -1838,13 +1814,21 @@ static const char *pick_link(struct nameidata *nd, struct path *link,
* for the common case.
*/
static const char *step_into(struct nameidata *nd, int flags,
- struct dentry *dentry, struct inode *inode, unsigned seq)
+ struct dentry *dentry, unsigned seq)
{
struct path path;
- int err = handle_mounts(nd, dentry, &path, &inode, &seq);
+ struct inode *inode;
+ int err = handle_mounts(nd, dentry, &path, &seq);
if (err < 0)
return ERR_PTR(err);
+ inode = path.dentry->d_inode;
+ if (unlikely(!inode)) {
+ if ((nd->flags & LOOKUP_RCU) &&
+ read_seqcount_retry(&path.dentry->d_seq, seq))
+ return ERR_PTR(-ECHILD);
+ return ERR_PTR(-ENOENT);
+ }
if (likely(!d_is_symlink(path.dentry)) ||
((flags & WALK_TRAILING) && !(nd->flags & LOOKUP_FOLLOW)) ||
(flags & WALK_NOFOLLOW)) {
@@ -1870,9 +1854,7 @@ static const char *step_into(struct nameidata *nd, int flags,
return pick_link(nd, &path, inode, seq, flags);
}
-static struct dentry *follow_dotdot_rcu(struct nameidata *nd,
- struct inode **inodep,
- unsigned *seqp)
+static struct dentry *follow_dotdot_rcu(struct nameidata *nd, unsigned *seqp)
{
struct dentry *parent, *old;
@@ -1895,7 +1877,6 @@ static struct dentry *follow_dotdot_rcu(struct nameidata *nd,
}
old = nd->path.dentry;
parent = old->d_parent;
- *inodep = parent->d_inode;
*seqp = read_seqcount_begin(&parent->d_seq);
if (unlikely(read_seqcount_retry(&old->d_seq, nd->seq)))
return ERR_PTR(-ECHILD);
@@ -1910,9 +1891,7 @@ static struct dentry *follow_dotdot_rcu(struct nameidata *nd,
return NULL;
}
-static struct dentry *follow_dotdot(struct nameidata *nd,
- struct inode **inodep,
- unsigned *seqp)
+static struct dentry *follow_dotdot(struct nameidata *nd, unsigned *seqp)
{
struct dentry *parent;
@@ -1937,7 +1916,6 @@ static struct dentry *follow_dotdot(struct nameidata *nd,
return ERR_PTR(-ENOENT);
}
*seqp = 0;
- *inodep = parent->d_inode;
return parent;
in_root:
@@ -1952,7 +1930,6 @@ static const char *handle_dots(struct nameidata *nd, int type)
if (type == LAST_DOTDOT) {
const char *error = NULL;
struct dentry *parent;
- struct inode *inode;
unsigned seq;
if (!nd->root.mnt) {
@@ -1961,17 +1938,17 @@ static const char *handle_dots(struct nameidata *nd, int type)
return error;
}
if (nd->flags & LOOKUP_RCU)
- parent = follow_dotdot_rcu(nd, &inode, &seq);
+ parent = follow_dotdot_rcu(nd, &seq);
else
- parent = follow_dotdot(nd, &inode, &seq);
+ parent = follow_dotdot(nd, &seq);
if (IS_ERR(parent))
return ERR_CAST(parent);
if (unlikely(!parent))
error = step_into(nd, WALK_NOFOLLOW,
- nd->path.dentry, nd->inode, nd->seq);
+ nd->path.dentry, nd->seq);
else
error = step_into(nd, WALK_NOFOLLOW,
- parent, inode, seq);
+ parent, seq);
if (unlikely(error))
return error;
@@ -1995,7 +1972,6 @@ static const char *handle_dots(struct nameidata *nd, int type)
static const char *walk_component(struct nameidata *nd, int flags)
{
struct dentry *dentry;
- struct inode *inode;
unsigned seq;
/*
* "." and ".." are special - ".." especially so because it has
@@ -2007,7 +1983,7 @@ static const char *walk_component(struct nameidata *nd, int flags)
put_link(nd);
return handle_dots(nd, nd->last_type);
}
- dentry = lookup_fast(nd, &inode, &seq);
+ dentry = lookup_fast(nd, &seq);
if (IS_ERR(dentry))
return ERR_CAST(dentry);
if (unlikely(!dentry)) {
@@ -2017,7 +1993,7 @@ static const char *walk_component(struct nameidata *nd, int flags)
}
if (!(flags & WALK_MORE) && nd->depth)
put_link(nd);
- return step_into(nd, flags, dentry, inode, seq);
+ return step_into(nd, flags, dentry, seq);
}
/*
@@ -2473,8 +2449,7 @@ static int handle_lookup_down(struct nameidata *nd)
{
if (!(nd->flags & LOOKUP_RCU))
dget(nd->path.dentry);
- return PTR_ERR(step_into(nd, WALK_NOFOLLOW,
- nd->path.dentry, nd->inode, nd->seq));
+ return PTR_ERR(step_into(nd, WALK_NOFOLLOW, nd->path.dentry, nd->seq));
}
/* Returns 0 and nd will be valid on success; Retuns error, otherwise. */
@@ -3394,7 +3369,6 @@ static const char *open_last_lookups(struct nameidata *nd,
int open_flag = op->open_flag;
bool got_write = false;
unsigned seq;
- struct inode *inode;
struct dentry *dentry;
const char *res;
@@ -3410,7 +3384,7 @@ static const char *open_last_lookups(struct nameidata *nd,
if (nd->last.name[nd->last.len])
nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
/* we _can_ be in RCU mode here */
- dentry = lookup_fast(nd, &inode, &seq);
+ dentry = lookup_fast(nd, &seq);
if (IS_ERR(dentry))
return ERR_CAST(dentry);
if (likely(dentry))
@@ -3464,7 +3438,7 @@ static const char *open_last_lookups(struct nameidata *nd,
finish_lookup:
if (nd->depth)
put_link(nd);
- res = step_into(nd, WALK_TRAILING, dentry, inode, seq);
+ res = step_into(nd, WALK_TRAILING, dentry, seq);
if (unlikely(res))
nd->flags &= ~(LOOKUP_OPEN|LOOKUP_CREATE|LOOKUP_EXCL);
return res;
On Mon, Jul 4, 2022 at 4:53 AM Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Sat, Jul 02, 2022 at 10:23:16AM -0700, Linus Torvalds wrote: > > > Al - can you please take a quick look? > > FWIW, trying to write a coherent documentation had its usual effect... > The thing is, we don't really need to fetch the inode that early. > All we really care about is that in RCU mode ->d_seq gets sampled > before we fetch ->d_inode *and* we don't treat "it looks negative" > as hard -ENOENT in case of ->d_seq mismatch. > > Which can be bloody well left to step_into(). So we don't need > to pass it inode argument at all - just dentry and seq. Makes > a bunch of functions simpler as well... > > It does *not* deal with the "uninitialized" seq argument in > !RCU case; I'll handle that in the followup, but that's a separate > story, IMO (and very clearly a false positive). I can confirm that your patch fixes KMSAN reports on inode, yet the following reports still persist: ===================================================== BUG: KMSAN: uninit-value in walk_component+0x5e7/0x6c0 fs/namei.c:1996 walk_component+0x5e7/0x6c0 fs/namei.c:1996 lookup_last fs/namei.c:2445 path_lookupat+0x27d/0x6f0 fs/namei.c:2468 filename_lookup+0x24c/0x800 fs/namei.c:2497 kern_path+0x79/0x3a0 fs/namei.c:2587 init_stat+0x72/0x13f fs/init.c:132 clean_path+0x74/0x24c init/initramfs.c:339 do_name+0x12d/0xc17 init/initramfs.c:371 write_buffer init/initramfs.c:457 unpack_to_rootfs+0x49a/0xd9e init/initramfs.c:510 do_populate_rootfs+0x57/0x40f init/initramfs.c:699 async_run_entry_fn+0x8f/0x400 kernel/async.c:127 process_one_work+0xb27/0x13e0 kernel/workqueue.c:2289 worker_thread+0x1076/0x1d60 kernel/workqueue.c:2436 kthread+0x31b/0x430 kernel/kthread.c:376 ret_from_fork+0x1f/0x30 ??:? Local variable seq created at: walk_component+0x46/0x6c0 fs/namei.c:1981 lookup_last fs/namei.c:2445 path_lookupat+0x27d/0x6f0 fs/namei.c:2468 CPU: 0 PID: 10 Comm: kworker/u9:0 Tainted: G B 5.19.0-rc4-00059-gcf2d25715943-dirty #103 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 Workqueue: events_unbound async_run_entry_fn ===================================================== What makes you think they are false positives? Is the scenario I described above: """ In particular, if the call to lookup_fast() in walk_component() returns NULL, and lookup_slow() returns a valid dentry, then the `seq` and `inode` will remain uninitialized until the call to step_into() """ impossible? > Cumulative diff follows; splitup is in #work.namei. Comments? > > diff --git a/fs/namei.c b/fs/namei.c > index 1f28d3f463c3..7f4f61ade9e3 100644 > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -1467,7 +1467,7 @@ EXPORT_SYMBOL(follow_down); > * we meet a managed dentry that would need blocking. > */ > static bool __follow_mount_rcu(struct nameidata *nd, struct path *path, > - struct inode **inode, unsigned *seqp) > + unsigned *seqp) > { > struct dentry *dentry = path->dentry; > unsigned int flags = dentry->d_flags; > @@ -1497,13 +1497,6 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path, > dentry = path->dentry = mounted->mnt.mnt_root; > nd->state |= ND_JUMPED; > *seqp = read_seqcount_begin(&dentry->d_seq); > - *inode = dentry->d_inode; > - /* > - * We don't need to re-check ->d_seq after this > - * ->d_inode read - there will be an RCU delay > - * between mount hash removal and ->mnt_root > - * becoming unpinned. > - */ > flags = dentry->d_flags; > continue; > } > @@ -1515,8 +1508,7 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path, > } > > static inline int handle_mounts(struct nameidata *nd, struct dentry *dentry, > - struct path *path, struct inode **inode, > - unsigned int *seqp) > + struct path *path, unsigned int *seqp) > { > bool jumped; > int ret; > @@ -1525,9 +1517,7 @@ static inline int handle_mounts(struct nameidata *nd, struct dentry *dentry, > path->dentry = dentry; > if (nd->flags & LOOKUP_RCU) { > unsigned int seq = *seqp; > - if (unlikely(!*inode)) > - return -ENOENT; > - if (likely(__follow_mount_rcu(nd, path, inode, seqp))) > + if (likely(__follow_mount_rcu(nd, path, seqp))) > return 0; > if (!try_to_unlazy_next(nd, dentry, seq)) > return -ECHILD; > @@ -1547,7 +1537,6 @@ static inline int handle_mounts(struct nameidata *nd, struct dentry *dentry, > if (path->mnt != nd->path.mnt) > mntput(path->mnt); > } else { > - *inode = d_backing_inode(path->dentry); > *seqp = 0; /* out of RCU mode, so the value doesn't matter */ > } > return ret; > @@ -1607,9 +1596,7 @@ static struct dentry *__lookup_hash(const struct qstr *name, > return dentry; > } > > -static struct dentry *lookup_fast(struct nameidata *nd, > - struct inode **inode, > - unsigned *seqp) > +static struct dentry *lookup_fast(struct nameidata *nd, unsigned *seqp) > { > struct dentry *dentry, *parent = nd->path.dentry; > int status = 1; > @@ -1628,22 +1615,11 @@ static struct dentry *lookup_fast(struct nameidata *nd, > return NULL; > } > > - /* > - * This sequence count validates that the inode matches > - * the dentry name information from lookup. > - */ > - *inode = d_backing_inode(dentry); > - if (unlikely(read_seqcount_retry(&dentry->d_seq, seq))) > - return ERR_PTR(-ECHILD); > - > - /* > + /* > * This sequence count validates that the parent had no > * changes while we did the lookup of the dentry above. > - * > - * The memory barrier in read_seqcount_begin of child is > - * enough, we can use __read_seqcount_retry here. > */ > - if (unlikely(__read_seqcount_retry(&parent->d_seq, nd->seq))) > + if (unlikely(read_seqcount_retry(&parent->d_seq, nd->seq))) > return ERR_PTR(-ECHILD); > > *seqp = seq; > @@ -1838,13 +1814,21 @@ static const char *pick_link(struct nameidata *nd, struct path *link, > * for the common case. > */ > static const char *step_into(struct nameidata *nd, int flags, > - struct dentry *dentry, struct inode *inode, unsigned seq) > + struct dentry *dentry, unsigned seq) > { > struct path path; > - int err = handle_mounts(nd, dentry, &path, &inode, &seq); > + struct inode *inode; > + int err = handle_mounts(nd, dentry, &path, &seq); > > if (err < 0) > return ERR_PTR(err); > + inode = path.dentry->d_inode; > + if (unlikely(!inode)) { > + if ((nd->flags & LOOKUP_RCU) && > + read_seqcount_retry(&path.dentry->d_seq, seq)) > + return ERR_PTR(-ECHILD); > + return ERR_PTR(-ENOENT); > + } > if (likely(!d_is_symlink(path.dentry)) || > ((flags & WALK_TRAILING) && !(nd->flags & LOOKUP_FOLLOW)) || > (flags & WALK_NOFOLLOW)) { > @@ -1870,9 +1854,7 @@ static const char *step_into(struct nameidata *nd, int flags, > return pick_link(nd, &path, inode, seq, flags); > } > > -static struct dentry *follow_dotdot_rcu(struct nameidata *nd, > - struct inode **inodep, > - unsigned *seqp) > +static struct dentry *follow_dotdot_rcu(struct nameidata *nd, unsigned *seqp) > { > struct dentry *parent, *old; > > @@ -1895,7 +1877,6 @@ static struct dentry *follow_dotdot_rcu(struct nameidata *nd, > } > old = nd->path.dentry; > parent = old->d_parent; > - *inodep = parent->d_inode; > *seqp = read_seqcount_begin(&parent->d_seq); > if (unlikely(read_seqcount_retry(&old->d_seq, nd->seq))) > return ERR_PTR(-ECHILD); > @@ -1910,9 +1891,7 @@ static struct dentry *follow_dotdot_rcu(struct nameidata *nd, > return NULL; > } > > -static struct dentry *follow_dotdot(struct nameidata *nd, > - struct inode **inodep, > - unsigned *seqp) > +static struct dentry *follow_dotdot(struct nameidata *nd, unsigned *seqp) > { > struct dentry *parent; > > @@ -1937,7 +1916,6 @@ static struct dentry *follow_dotdot(struct nameidata *nd, > return ERR_PTR(-ENOENT); > } > *seqp = 0; > - *inodep = parent->d_inode; > return parent; > > in_root: > @@ -1952,7 +1930,6 @@ static const char *handle_dots(struct nameidata *nd, int type) > if (type == LAST_DOTDOT) { > const char *error = NULL; > struct dentry *parent; > - struct inode *inode; > unsigned seq; > > if (!nd->root.mnt) { > @@ -1961,17 +1938,17 @@ static const char *handle_dots(struct nameidata *nd, int type) > return error; > } > if (nd->flags & LOOKUP_RCU) > - parent = follow_dotdot_rcu(nd, &inode, &seq); > + parent = follow_dotdot_rcu(nd, &seq); > else > - parent = follow_dotdot(nd, &inode, &seq); > + parent = follow_dotdot(nd, &seq); > if (IS_ERR(parent)) > return ERR_CAST(parent); > if (unlikely(!parent)) > error = step_into(nd, WALK_NOFOLLOW, > - nd->path.dentry, nd->inode, nd->seq); > + nd->path.dentry, nd->seq); > else > error = step_into(nd, WALK_NOFOLLOW, > - parent, inode, seq); > + parent, seq); > if (unlikely(error)) > return error; > > @@ -1995,7 +1972,6 @@ static const char *handle_dots(struct nameidata *nd, int type) > static const char *walk_component(struct nameidata *nd, int flags) > { > struct dentry *dentry; > - struct inode *inode; > unsigned seq; > /* > * "." and ".." are special - ".." especially so because it has > @@ -2007,7 +1983,7 @@ static const char *walk_component(struct nameidata *nd, int flags) > put_link(nd); > return handle_dots(nd, nd->last_type); > } > - dentry = lookup_fast(nd, &inode, &seq); > + dentry = lookup_fast(nd, &seq); > if (IS_ERR(dentry)) > return ERR_CAST(dentry); > if (unlikely(!dentry)) { > @@ -2017,7 +1993,7 @@ static const char *walk_component(struct nameidata *nd, int flags) > } > if (!(flags & WALK_MORE) && nd->depth) > put_link(nd); > - return step_into(nd, flags, dentry, inode, seq); > + return step_into(nd, flags, dentry, seq); > } > > /* > @@ -2473,8 +2449,7 @@ static int handle_lookup_down(struct nameidata *nd) > { > if (!(nd->flags & LOOKUP_RCU)) > dget(nd->path.dentry); > - return PTR_ERR(step_into(nd, WALK_NOFOLLOW, > - nd->path.dentry, nd->inode, nd->seq)); > + return PTR_ERR(step_into(nd, WALK_NOFOLLOW, nd->path.dentry, nd->seq)); > } > > /* Returns 0 and nd will be valid on success; Retuns error, otherwise. */ > @@ -3394,7 +3369,6 @@ static const char *open_last_lookups(struct nameidata *nd, > int open_flag = op->open_flag; > bool got_write = false; > unsigned seq; > - struct inode *inode; > struct dentry *dentry; > const char *res; > > @@ -3410,7 +3384,7 @@ static const char *open_last_lookups(struct nameidata *nd, > if (nd->last.name[nd->last.len]) > nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY; > /* we _can_ be in RCU mode here */ > - dentry = lookup_fast(nd, &inode, &seq); > + dentry = lookup_fast(nd, &seq); > if (IS_ERR(dentry)) > return ERR_CAST(dentry); > if (likely(dentry)) > @@ -3464,7 +3438,7 @@ static const char *open_last_lookups(struct nameidata *nd, > finish_lookup: > if (nd->depth) > put_link(nd); > - res = step_into(nd, WALK_TRAILING, dentry, inode, seq); > + res = step_into(nd, WALK_TRAILING, dentry, seq); > if (unlikely(res)) > nd->flags &= ~(LOOKUP_OPEN|LOOKUP_CREATE|LOOKUP_EXCL); > return res;
On Mon, Jul 04, 2022 at 10:20:53AM +0200, Alexander Potapenko wrote: > What makes you think they are false positives? Is the scenario I > described above: > > """ > In particular, if the call to lookup_fast() in walk_component() > returns NULL, and lookup_slow() returns a valid dentry, then the > `seq` and `inode` will remain uninitialized until the call to > step_into() > """ > > impossible? Suppose step_into() has been called in non-RCU mode. The first thing it does is int err = handle_mounts(nd, dentry, &path, &seq); if (err < 0) return ERR_PTR(err); And handle_mounts() in non-RCU mode is path->mnt = nd->path.mnt; path->dentry = dentry; if (nd->flags & LOOKUP_RCU) { [unreachable code] } [code not touching seqp] if (unlikely(ret)) { [code not touching seqp] } else { *seqp = 0; /* out of RCU mode, so the value doesn't matter */ } return ret; In other words, the value seq argument of step_into() used to have ends up being never fetched and, in case step_into() gets past that if (err < 0) that value is replaced with zero before any further accesses. So it's a false positive; yes, strictly speaking compiler is allowd to do anything whatsoever if it manages to prove that the value is uninitialized. Realistically, though, especially since unsigned int is not allowed any trapping representations... If you want an test stripped of VFS specifics, consider this: int g(int n, _Bool flag) { if (!flag) n = 0; return n + 1; } int f(int n, _Bool flag) { int x; if (flag) x = n + 2; return g(x, flag); } Do your tools trigger on it?
On Mon, Jul 04, 2022 at 02:44:00PM +0100, Al Viro wrote: > On Mon, Jul 04, 2022 at 10:20:53AM +0200, Alexander Potapenko wrote: > > > What makes you think they are false positives? Is the scenario I > > described above: > > > > """ > > In particular, if the call to lookup_fast() in walk_component() > > returns NULL, and lookup_slow() returns a valid dentry, then the > > `seq` and `inode` will remain uninitialized until the call to > > step_into() > > """ > > > > impossible? > > Suppose step_into() has been called in non-RCU mode. The first > thing it does is > int err = handle_mounts(nd, dentry, &path, &seq); > if (err < 0) > return ERR_PTR(err); > > And handle_mounts() in non-RCU mode is > path->mnt = nd->path.mnt; > path->dentry = dentry; > if (nd->flags & LOOKUP_RCU) { > [unreachable code] > } > [code not touching seqp] > if (unlikely(ret)) { > [code not touching seqp] > } else { > *seqp = 0; /* out of RCU mode, so the value doesn't matter */ > } > return ret; Make that [code assigning ret a non-negative value and never using seqp] if (unlikely(ret)) { [code never using seqp or ret] } else { *seqp = 0; /* out of RCU mode, so the value doesn't matter */ } return ret; so if (err < 0) in the caller is equivalent to if (err).
On Mon, Jul 4, 2022 at 3:44 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Mon, Jul 04, 2022 at 10:20:53AM +0200, Alexander Potapenko wrote: > > > What makes you think they are false positives? Is the scenario I > > described above: > > > > """ > > In particular, if the call to lookup_fast() in walk_component() > > returns NULL, and lookup_slow() returns a valid dentry, then the > > `seq` and `inode` will remain uninitialized until the call to > > step_into() > > """ > > > > impossible? > > Suppose step_into() has been called in non-RCU mode. The first > thing it does is > int err = handle_mounts(nd, dentry, &path, &seq); > if (err < 0) > return ERR_PTR(err); > > And handle_mounts() in non-RCU mode is > path->mnt = nd->path.mnt; > path->dentry = dentry; > if (nd->flags & LOOKUP_RCU) { > [unreachable code] > } > [code not touching seqp] > if (unlikely(ret)) { > [code not touching seqp] > } else { > *seqp = 0; /* out of RCU mode, so the value doesn't matter */ > } > return ret; > > In other words, the value seq argument of step_into() used to have ends up > being never fetched and, in case step_into() gets past that if (err < 0) > that value is replaced with zero before any further accesses. Oh, I see. That is actually what had been discussed here: https://lore.kernel.org/linux-toolchains/20220614144853.3693273-1-glider@google.com/ Indeed, step_into() in its current implementation does not use `seq` (which is noted in the patch description ;)), but the question is whether we want to catch such cases regardless of that. One of the reasons to do so is standard compliance - passing an uninitialized value to a function is UB in C11, as Segher pointed out here: https://lore.kernel.org/linux-toolchains/20220614214039.GA25951@gate.crashing.org/ The compilers may not be smart enough to take advantage of this _yet_, but I wouldn't underestimate their ability to evolve (especially that of Clang). I also believe it's fragile to rely on the callee to ignore certain parameters: it may be doing so today, but if someone changes step_into() tomorrow we may miss it. If I am reading Linus's message here (and the following one from him in the same thread): https://lore.kernel.org/linux-toolchains/CAHk-=whjz3wO8zD+itoerphWem+JZz4uS3myf6u1Wd6epGRgmQ@mail.gmail.com/ correctly, we should be reporting uninitialized values passed to functions, unless those values dissolve after inlining. While this is a bit of a vague criterion, at least for Clang we always know that KMSAN instrumentation is applied after inlining, so the reports we see are due to values that are actually passed between functions. > So it's a false positive; yes, strictly speaking compiler is allowd > to do anything whatsoever if it manages to prove that the value is > uninitialized. Realistically, though, especially since unsigned int > is not allowed any trapping representations... > > If you want an test stripped of VFS specifics, consider this: > > int g(int n, _Bool flag) > { > if (!flag) > n = 0; > return n + 1; > } > > int f(int n, _Bool flag) > { > int x; > > if (flag) > x = n + 2; > return g(x, flag); > } > > Do your tools trigger on it? Currently KMSAN has two modes of operation controlled by CONFIG_KMSAN_CHECK_PARAM_RETVAL. When enabled, that config enforces checks of function parameters at call sites (by applying Clang's -fsanitize-memory-param-retval flag). In that mode the tool would report the attempt to call `g(x, false)` if g() survives inlining. In the case CONFIG_KMSAN_CHECK_PARAM_RETVAL=n, KMSAN won't be reporting the error. Based on the mentioned discussion I decided to make CONFIG_KMSAN_CHECK_PARAM_RETVAL=y the default option. So far it only reported a handful of errors (i.e. enforcing this rule shouldn't be very problematic for the kernel), but it simplifies handling of calls between instrumented and non-instrumented functions that occur e.g. at syscall entry points: knowing that passed-by-value arguments are checked at call sites, we can assume they are initialized in the callees. HTH, Alex
On Mon, Jul 04, 2022 at 02:44:00PM +0100, Al Viro wrote: > On Mon, Jul 04, 2022 at 10:20:53AM +0200, Alexander Potapenko wrote: > > > What makes you think they are false positives? Is the scenario I > > described above: > > > > """ > > In particular, if the call to lookup_fast() in walk_component() > > returns NULL, and lookup_slow() returns a valid dentry, then the > > `seq` and `inode` will remain uninitialized until the call to > > step_into() > > """ > > > > impossible? > > Suppose step_into() has been called in non-RCU mode. The first > thing it does is > int err = handle_mounts(nd, dentry, &path, &seq); > if (err < 0) > return ERR_PTR(err); > > And handle_mounts() in non-RCU mode is > path->mnt = nd->path.mnt; > path->dentry = dentry; > if (nd->flags & LOOKUP_RCU) { > [unreachable code] > } > [code not touching seqp] > if (unlikely(ret)) { > [code not touching seqp] > } else { > *seqp = 0; /* out of RCU mode, so the value doesn't matter */ > } > return ret; > > In other words, the value seq argument of step_into() used to have ends up > being never fetched and, in case step_into() gets past that if (err < 0) > that value is replaced with zero before any further accesses. > > So it's a false positive; yes, strictly speaking compiler is allowd > to do anything whatsoever if it manages to prove that the value is > uninitialized. Realistically, though, especially since unsigned int > is not allowed any trapping representations... FWIW, update (and yet untested) branch is in #work.namei. Compared to the previous, we store sampled ->d_seq of the next dentry in nd->next_seq, rather than bothering with local variables. AFAICS, it ends up with better code that way. And both ->seq and ->next_seq are zeroed at the moments when we switch to non-RCU mode (as well as non-RCU path_init()). IMO it looks saner that way. NOTE: it still needs to be tested and probably reordered and massaged; it's not for merge at the moment. Current cumulative diff follows: diff --git a/fs/namei.c b/fs/namei.c index 1f28d3f463c3..b12c6b53579d 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -567,7 +567,7 @@ struct nameidata { struct path root; struct inode *inode; /* path.dentry.d_inode */ unsigned int flags, state; - unsigned seq, m_seq, r_seq; + unsigned seq, next_seq, m_seq, r_seq; int last_type; unsigned depth; int total_link_count; @@ -772,6 +772,7 @@ static bool try_to_unlazy(struct nameidata *nd) goto out; if (unlikely(!legitimize_root(nd))) goto out; + nd->seq = nd->next_seq = 0; rcu_read_unlock(); BUG_ON(nd->inode != parent->d_inode); return true; @@ -780,6 +781,7 @@ static bool try_to_unlazy(struct nameidata *nd) nd->path.mnt = NULL; nd->path.dentry = NULL; out: + nd->seq = nd->next_seq = 0; rcu_read_unlock(); return false; } @@ -788,7 +790,6 @@ static bool try_to_unlazy(struct nameidata *nd) * try_to_unlazy_next - try to switch to ref-walk mode. * @nd: nameidata pathwalk data * @dentry: next dentry to step into - * @seq: seq number to check @dentry against * Returns: true on success, false on failure * * Similar to try_to_unlazy(), but here we have the next dentry already @@ -797,7 +798,7 @@ static bool try_to_unlazy(struct nameidata *nd) * Nothing should touch nameidata between try_to_unlazy_next() failure and * terminate_walk(). */ -static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry, unsigned seq) +static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry) { BUG_ON(!(nd->flags & LOOKUP_RCU)); @@ -818,7 +819,7 @@ static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry, unsi */ if (unlikely(!lockref_get_not_dead(&dentry->d_lockref))) goto out; - if (unlikely(read_seqcount_retry(&dentry->d_seq, seq))) + if (unlikely(read_seqcount_retry(&dentry->d_seq, nd->next_seq))) goto out_dput; /* * Sequence counts matched. Now make sure that the root is @@ -826,6 +827,7 @@ static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry, unsi */ if (unlikely(!legitimize_root(nd))) goto out_dput; + nd->seq = nd->next_seq = 0; rcu_read_unlock(); return true; @@ -834,9 +836,11 @@ static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry, unsi out1: nd->path.dentry = NULL; out: + nd->seq = nd->next_seq = 0; rcu_read_unlock(); return false; out_dput: + nd->seq = nd->next_seq = 0; rcu_read_unlock(); dput(dentry); return false; @@ -1466,8 +1470,7 @@ EXPORT_SYMBOL(follow_down); * Try to skip to top of mountpoint pile in rcuwalk mode. Fail if * we meet a managed dentry that would need blocking. */ -static bool __follow_mount_rcu(struct nameidata *nd, struct path *path, - struct inode **inode, unsigned *seqp) +static bool __follow_mount_rcu(struct nameidata *nd, struct path *path) { struct dentry *dentry = path->dentry; unsigned int flags = dentry->d_flags; @@ -1496,14 +1499,7 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path, path->mnt = &mounted->mnt; dentry = path->dentry = mounted->mnt.mnt_root; nd->state |= ND_JUMPED; - *seqp = read_seqcount_begin(&dentry->d_seq); - *inode = dentry->d_inode; - /* - * We don't need to re-check ->d_seq after this - * ->d_inode read - there will be an RCU delay - * between mount hash removal and ->mnt_root - * becoming unpinned. - */ + nd->next_seq = read_seqcount_begin(&dentry->d_seq); flags = dentry->d_flags; continue; } @@ -1515,8 +1511,7 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path, } static inline int handle_mounts(struct nameidata *nd, struct dentry *dentry, - struct path *path, struct inode **inode, - unsigned int *seqp) + struct path *path) { bool jumped; int ret; @@ -1524,16 +1519,15 @@ static inline int handle_mounts(struct nameidata *nd, struct dentry *dentry, path->mnt = nd->path.mnt; path->dentry = dentry; if (nd->flags & LOOKUP_RCU) { - unsigned int seq = *seqp; - if (unlikely(!*inode)) - return -ENOENT; - if (likely(__follow_mount_rcu(nd, path, inode, seqp))) + unsigned int seq = nd->next_seq; + if (likely(__follow_mount_rcu(nd, path))) return 0; - if (!try_to_unlazy_next(nd, dentry, seq)) - return -ECHILD; - // *path might've been clobbered by __follow_mount_rcu() + // *path and nd->next_seq might've been just clobbered path->mnt = nd->path.mnt; path->dentry = dentry; + nd->next_seq = seq; + if (!try_to_unlazy_next(nd, dentry)) + return -ECHILD; } ret = traverse_mounts(path, &jumped, &nd->total_link_count, nd->flags); if (jumped) { @@ -1546,9 +1540,6 @@ static inline int handle_mounts(struct nameidata *nd, struct dentry *dentry, dput(path->dentry); if (path->mnt != nd->path.mnt) mntput(path->mnt); - } else { - *inode = d_backing_inode(path->dentry); - *seqp = 0; /* out of RCU mode, so the value doesn't matter */ } return ret; } @@ -1607,9 +1598,7 @@ static struct dentry *__lookup_hash(const struct qstr *name, return dentry; } -static struct dentry *lookup_fast(struct nameidata *nd, - struct inode **inode, - unsigned *seqp) +static struct dentry *lookup_fast(struct nameidata *nd) { struct dentry *dentry, *parent = nd->path.dentry; int status = 1; @@ -1620,37 +1609,24 @@ static struct dentry *lookup_fast(struct nameidata *nd, * going to fall back to non-racy lookup. */ if (nd->flags & LOOKUP_RCU) { - unsigned seq; - dentry = __d_lookup_rcu(parent, &nd->last, &seq); + dentry = __d_lookup_rcu(parent, &nd->last, &nd->next_seq); if (unlikely(!dentry)) { if (!try_to_unlazy(nd)) return ERR_PTR(-ECHILD); return NULL; } - /* - * This sequence count validates that the inode matches - * the dentry name information from lookup. - */ - *inode = d_backing_inode(dentry); - if (unlikely(read_seqcount_retry(&dentry->d_seq, seq))) - return ERR_PTR(-ECHILD); - - /* + /* * This sequence count validates that the parent had no * changes while we did the lookup of the dentry above. - * - * The memory barrier in read_seqcount_begin of child is - * enough, we can use __read_seqcount_retry here. */ - if (unlikely(__read_seqcount_retry(&parent->d_seq, nd->seq))) + if (unlikely(read_seqcount_retry(&parent->d_seq, nd->seq))) return ERR_PTR(-ECHILD); - *seqp = seq; status = d_revalidate(dentry, nd->flags); if (likely(status > 0)) return dentry; - if (!try_to_unlazy_next(nd, dentry, seq)) + if (!try_to_unlazy_next(nd, dentry)) return ERR_PTR(-ECHILD); if (status == -ECHILD) /* we'd been told to redo it in non-rcu mode */ @@ -1731,7 +1707,7 @@ static inline int may_lookup(struct user_namespace *mnt_userns, return inode_permission(mnt_userns, nd->inode, MAY_EXEC); } -static int reserve_stack(struct nameidata *nd, struct path *link, unsigned seq) +static int reserve_stack(struct nameidata *nd, struct path *link) { if (unlikely(nd->total_link_count++ >= MAXSYMLINKS)) return -ELOOP; @@ -1746,7 +1722,7 @@ static int reserve_stack(struct nameidata *nd, struct path *link, unsigned seq) if (nd->flags & LOOKUP_RCU) { // we need to grab link before we do unlazy. And we can't skip // unlazy even if we fail to grab the link - cleanup needs it - bool grabbed_link = legitimize_path(nd, link, seq); + bool grabbed_link = legitimize_path(nd, link, nd->next_seq); if (!try_to_unlazy(nd) || !grabbed_link) return -ECHILD; @@ -1760,11 +1736,11 @@ static int reserve_stack(struct nameidata *nd, struct path *link, unsigned seq) enum {WALK_TRAILING = 1, WALK_MORE = 2, WALK_NOFOLLOW = 4}; static const char *pick_link(struct nameidata *nd, struct path *link, - struct inode *inode, unsigned seq, int flags) + struct inode *inode, int flags) { struct saved *last; const char *res; - int error = reserve_stack(nd, link, seq); + int error = reserve_stack(nd, link); if (unlikely(error)) { if (!(nd->flags & LOOKUP_RCU)) @@ -1774,7 +1750,7 @@ static const char *pick_link(struct nameidata *nd, struct path *link, last = nd->stack + nd->depth++; last->link = *link; clear_delayed_call(&last->done); - last->seq = seq; + last->seq = nd->next_seq; if (flags & WALK_TRAILING) { error = may_follow_link(nd, inode); @@ -1836,15 +1812,25 @@ static const char *pick_link(struct nameidata *nd, struct path *link, * to do this check without having to look at inode->i_op, * so we keep a cache of "no, this doesn't need follow_link" * for the common case. + * + * NOTE: nd->next_seq must be the validated sampled dentry->d_seq */ static const char *step_into(struct nameidata *nd, int flags, - struct dentry *dentry, struct inode *inode, unsigned seq) + struct dentry *dentry) { struct path path; - int err = handle_mounts(nd, dentry, &path, &inode, &seq); + struct inode *inode; + int err = handle_mounts(nd, dentry, &path); if (err < 0) return ERR_PTR(err); + inode = path.dentry->d_inode; + if (unlikely(!inode)) { + if ((nd->flags & LOOKUP_RCU) && + read_seqcount_retry(&path.dentry->d_seq, nd->next_seq)) + return ERR_PTR(-ECHILD); + return ERR_PTR(-ENOENT); + } if (likely(!d_is_symlink(path.dentry)) || ((flags & WALK_TRAILING) && !(nd->flags & LOOKUP_FOLLOW)) || (flags & WALK_NOFOLLOW)) { @@ -1856,23 +1842,21 @@ static const char *step_into(struct nameidata *nd, int flags, } nd->path = path; nd->inode = inode; - nd->seq = seq; + nd->seq = nd->next_seq; return NULL; } if (nd->flags & LOOKUP_RCU) { /* make sure that d_is_symlink above matches inode */ - if (read_seqcount_retry(&path.dentry->d_seq, seq)) + if (read_seqcount_retry(&path.dentry->d_seq, nd->next_seq)) return ERR_PTR(-ECHILD); } else { if (path.mnt == nd->path.mnt) mntget(path.mnt); } - return pick_link(nd, &path, inode, seq, flags); + return pick_link(nd, &path, inode, flags); } -static struct dentry *follow_dotdot_rcu(struct nameidata *nd, - struct inode **inodep, - unsigned *seqp) +static struct dentry *follow_dotdot_rcu(struct nameidata *nd) { struct dentry *parent, *old; @@ -1895,8 +1879,7 @@ static struct dentry *follow_dotdot_rcu(struct nameidata *nd, } old = nd->path.dentry; parent = old->d_parent; - *inodep = parent->d_inode; - *seqp = read_seqcount_begin(&parent->d_seq); + nd->next_seq = read_seqcount_begin(&parent->d_seq); if (unlikely(read_seqcount_retry(&old->d_seq, nd->seq))) return ERR_PTR(-ECHILD); if (unlikely(!path_connected(nd->path.mnt, parent))) @@ -1907,12 +1890,11 @@ static struct dentry *follow_dotdot_rcu(struct nameidata *nd, return ERR_PTR(-ECHILD); if (unlikely(nd->flags & LOOKUP_BENEATH)) return ERR_PTR(-ECHILD); - return NULL; + nd->next_seq = nd->seq; + return nd->path.dentry; } -static struct dentry *follow_dotdot(struct nameidata *nd, - struct inode **inodep, - unsigned *seqp) +static struct dentry *follow_dotdot(struct nameidata *nd) { struct dentry *parent; @@ -1936,15 +1918,12 @@ static struct dentry *follow_dotdot(struct nameidata *nd, dput(parent); return ERR_PTR(-ENOENT); } - *seqp = 0; - *inodep = parent->d_inode; return parent; in_root: if (unlikely(nd->flags & LOOKUP_BENEATH)) return ERR_PTR(-EXDEV); - dget(nd->path.dentry); - return NULL; + return dget(nd->path.dentry); } static const char *handle_dots(struct nameidata *nd, int type) @@ -1952,8 +1931,6 @@ static const char *handle_dots(struct nameidata *nd, int type) if (type == LAST_DOTDOT) { const char *error = NULL; struct dentry *parent; - struct inode *inode; - unsigned seq; if (!nd->root.mnt) { error = ERR_PTR(set_root(nd)); @@ -1961,17 +1938,12 @@ static const char *handle_dots(struct nameidata *nd, int type) return error; } if (nd->flags & LOOKUP_RCU) - parent = follow_dotdot_rcu(nd, &inode, &seq); + parent = follow_dotdot_rcu(nd); else - parent = follow_dotdot(nd, &inode, &seq); + parent = follow_dotdot(nd); if (IS_ERR(parent)) return ERR_CAST(parent); - if (unlikely(!parent)) - error = step_into(nd, WALK_NOFOLLOW, - nd->path.dentry, nd->inode, nd->seq); - else - error = step_into(nd, WALK_NOFOLLOW, - parent, inode, seq); + error = step_into(nd, WALK_NOFOLLOW, parent); if (unlikely(error)) return error; @@ -1995,8 +1967,6 @@ static const char *handle_dots(struct nameidata *nd, int type) static const char *walk_component(struct nameidata *nd, int flags) { struct dentry *dentry; - struct inode *inode; - unsigned seq; /* * "." and ".." are special - ".." especially so because it has * to be able to know about the current root directory and @@ -2007,7 +1977,7 @@ static const char *walk_component(struct nameidata *nd, int flags) put_link(nd); return handle_dots(nd, nd->last_type); } - dentry = lookup_fast(nd, &inode, &seq); + dentry = lookup_fast(nd); if (IS_ERR(dentry)) return ERR_CAST(dentry); if (unlikely(!dentry)) { @@ -2017,7 +1987,7 @@ static const char *walk_component(struct nameidata *nd, int flags) } if (!(flags & WALK_MORE) && nd->depth) put_link(nd); - return step_into(nd, flags, dentry, inode, seq); + return step_into(nd, flags, dentry); } /* @@ -2372,6 +2342,8 @@ static const char *path_init(struct nameidata *nd, unsigned flags) flags &= ~LOOKUP_RCU; if (flags & LOOKUP_RCU) rcu_read_lock(); + else + nd->seq = nd->next_seq = 0; nd->flags = flags; nd->state |= ND_JUMPED; @@ -2473,8 +2445,8 @@ static int handle_lookup_down(struct nameidata *nd) { if (!(nd->flags & LOOKUP_RCU)) dget(nd->path.dentry); - return PTR_ERR(step_into(nd, WALK_NOFOLLOW, - nd->path.dentry, nd->inode, nd->seq)); + nd->next_seq = nd->seq; + return PTR_ERR(step_into(nd, WALK_NOFOLLOW, nd->path.dentry)); } /* Returns 0 and nd will be valid on success; Retuns error, otherwise. */ @@ -3393,8 +3365,6 @@ static const char *open_last_lookups(struct nameidata *nd, struct dentry *dir = nd->path.dentry; int open_flag = op->open_flag; bool got_write = false; - unsigned seq; - struct inode *inode; struct dentry *dentry; const char *res; @@ -3410,7 +3380,7 @@ static const char *open_last_lookups(struct nameidata *nd, if (nd->last.name[nd->last.len]) nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY; /* we _can_ be in RCU mode here */ - dentry = lookup_fast(nd, &inode, &seq); + dentry = lookup_fast(nd); if (IS_ERR(dentry)) return ERR_CAST(dentry); if (likely(dentry)) @@ -3464,7 +3434,7 @@ static const char *open_last_lookups(struct nameidata *nd, finish_lookup: if (nd->depth) put_link(nd); - res = step_into(nd, WALK_TRAILING, dentry, inode, seq); + res = step_into(nd, WALK_TRAILING, dentry); if (unlikely(res)) nd->flags &= ~(LOOKUP_OPEN|LOOKUP_CREATE|LOOKUP_EXCL); return res;
On Mon, Jul 04, 2022 at 05:49:13PM +0200, Alexander Potapenko wrote: > This e-mail is confidential. If you received this communication by > mistake, please don't forward it to anyone else, please erase all > copies and attachments, and please let me know that it has gone to the > wrong person. This is not compatible with Linux kernel development, sorry. Now deleted.
On Mon, Jul 4, 2022 at 6:03 PM Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote: > > On Mon, Jul 04, 2022 at 05:49:13PM +0200, Alexander Potapenko wrote: > > This e-mail is confidential. If you received this communication by > > mistake, please don't forward it to anyone else, please erase all > > copies and attachments, and please let me know that it has gone to the > > wrong person. > > This is not compatible with Linux kernel development, sorry. > > Now deleted. Sorry, I shouldn't have added those to public emails. Apologies for the inconvenience.
On Mon, Jul 4, 2022 at 6:00 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Mon, Jul 04, 2022 at 02:44:00PM +0100, Al Viro wrote: > > On Mon, Jul 04, 2022 at 10:20:53AM +0200, Alexander Potapenko wrote: > > > > > What makes you think they are false positives? Is the scenario I > > > described above: > > > > > > """ > > > In particular, if the call to lookup_fast() in walk_component() > > > returns NULL, and lookup_slow() returns a valid dentry, then the > > > `seq` and `inode` will remain uninitialized until the call to > > > step_into() > > > """ > > > > > > impossible? > > > > Suppose step_into() has been called in non-RCU mode. The first > > thing it does is > > int err = handle_mounts(nd, dentry, &path, &seq); > > if (err < 0) > > return ERR_PTR(err); > > > > And handle_mounts() in non-RCU mode is > > path->mnt = nd->path.mnt; > > path->dentry = dentry; > > if (nd->flags & LOOKUP_RCU) { > > [unreachable code] > > } > > [code not touching seqp] > > if (unlikely(ret)) { > > [code not touching seqp] > > } else { > > *seqp = 0; /* out of RCU mode, so the value doesn't matter */ > > } > > return ret; > > > > In other words, the value seq argument of step_into() used to have ends up > > being never fetched and, in case step_into() gets past that if (err < 0) > > that value is replaced with zero before any further accesses. > > > > So it's a false positive; yes, strictly speaking compiler is allowd > > to do anything whatsoever if it manages to prove that the value is > > uninitialized. Realistically, though, especially since unsigned int > > is not allowed any trapping representations... > > FWIW, update (and yet untested) branch is in #work.namei. Compared to the > previous, we store sampled ->d_seq of the next dentry in nd->next_seq, > rather than bothering with local variables. AFAICS, it ends up with > better code that way. And both ->seq and ->next_seq are zeroed at the > moments when we switch to non-RCU mode (as well as non-RCU path_init()). > > IMO it looks saner that way. NOTE: it still needs to be tested and probably > reordered and massaged; it's not for merge at the moment. Current cumulative > diff follows: I confirm all KMSAN reports are gone as a result of applying this patch.
On Sun, Jul 3, 2022 at 7:53 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > FWIW, trying to write a coherent documentation had its usual effect... > The thing is, we don't really need to fetch the inode that early. Hmm. I like the patch, but as I was reading through it, I had a question... In particular, I'd like it even more if each step when the sequence number is updated also had a comment about what then protects the previous sequence number up to and over that new sequence point. For example, in __follow_mount_rcu(), when we jump to a new mount point, and that sequence has *seqp = read_seqcount_begin(&dentry->d_seq); to reset the sequence number to the new path we jumped into. But I don't actually see what checks the previous sequence number in that path. We just reset it to the new one. In contrast, in lookup_fast(), we get the new sequence number from __d_lookup_rcu(), and then after getting the new one and before "instantiating" it, we will revalidate the parent sequence number. So lookup_fast() has that "chain of sequence numbers". For __follow_mount_rcu it looks like validating the previous sequence number is left to the caller, which then does try_to_unlazy_next(). So when reading this code, my reaction was that it really would have been much nicer to have that kind of clear "handoff" of one sequence number domain to the next that lookup_fast() has. IOW, I think it would be lovely to clarify the sequence number handoff. I only quickly scanned your second patch for this, it does seem to at least collect it all into try_to_unlazy_next(). So maybe you already looked at exactly this, but it would be good to be quite explicit about the sequence number logic because it's "a bit opaque" to say the least. Linus
On Mon, Jul 04, 2022 at 05:49:13PM +0200, Alexander Potapenko wrote: > One of the reasons to do so is standard compliance - passing an > uninitialized value to a function is UB in C11, as Segher pointed out > here: https://lore.kernel.org/linux-toolchains/20220614214039.GA25951@gate.crashing.org/ > The compilers may not be smart enough to take advantage of this _yet_, > but I wouldn't underestimate their ability to evolve (especially that > of Clang). GCC doesn't currently detect this UB, and doesn't even warn or error for this, although that shouldn't be hard to do: it is all completely local. An error is warranted here, and you won't get UB ever either then. > I also believe it's fragile to rely on the callee to ignore certain > parameters: it may be doing so today, but if someone changes > step_into() tomorrow we may miss it. There isn't any choice usually, this is C, do you want varargs? :-) But yes, you always should only pass "safe" values; callers should do their part, and not assume the callee will do in the future as it does now. Defensive programming is mostly about defending your own sanity! Segher
On Mon, Jul 04, 2022 at 10:36:05AM -0700, Linus Torvalds wrote: > For example, in __follow_mount_rcu(), when we jump to a new mount > point, and that sequence has > > *seqp = read_seqcount_begin(&dentry->d_seq); > > to reset the sequence number to the new path we jumped into. > > But I don't actually see what checks the previous sequence number in > that path. We just reset it to the new one. Theoretically it could be a problem. We have /mnt/foo/bar and /mnt/baz/bar. Something's mounted on /mnt/foo, hiding /mnt/foo/bar. We start a pathwalk for /mnt/baz/bar, someone umounts /mnt/foo and swaps /mnt/foo to /mnt/baz before we get there. We are doomed to get -ECHILD from an attempt to legitimize in the end, no matter what. However, we might get a hard error (-ENOENT, for example) before that, if we pick up the old mount that used to be on top of /mnt/foo (now /mnt/baz) and had been detached before the damn thing had become /mnt/baz and notice that there's no "bar" in its root. It used to be impossible (rename would've failed if the target had been non-empty and had we managed to empty it first, well, there's your point when -ENOENT would've been accurate). With exchange... Yes, it's a possible race. Might need to add if (read_seqretry(&mount_lock, nd->m_seq)) return false; in there. And yes, it's a nice demonstration of how subtle and brittle RCU pathwalk is - nobody noticed this bit of fun back when RENAME_EXCHANGE had been added... It got a lot more readable these days, but... > For __follow_mount_rcu it looks like validating the previous sequence > number is left to the caller, which then does try_to_unlazy_next(). Not really - the caller goes there only if we have __follow_mount_rcu() say "it's too tricky for me, get out of RCU mode and deal with it there". Anyway, I've thrown a mount_lock check in there, running xfstests to see how it goes...
On Mon, Jul 4, 2022 at 12:03 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > Anyway, I've thrown a mount_lock check in there, running xfstests to > see how it goes... So my reaction had been that it would be good to just do something like this: diff --git a/fs/namei.c b/fs/namei.c index 1f28d3f463c3..25c4bcc91142 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1493,11 +1493,18 @@ static bool __follow_mount_rcu(struct n... if (flags & DCACHE_MOUNTED) { struct mount *mounted = __lookup_mnt(path->mnt, dentry); if (mounted) { + struct dentry *old_dentry = dentry; + unsigned old_seq = *seqp; + path->mnt = &mounted->mnt; dentry = path->dentry = mounted->mnt.mnt_root; nd->state |= ND_JUMPED; *seqp = read_seqcount_begin(&dentry->d_seq); *inode = dentry->d_inode; + + if (read_seqcount_retry(&old_dentry->d_seq, old_seq)) + return false; + /* * We don't need to re-check ->d_seq after this * ->d_inode read - there will be an RCU delay but the above is just whitespace-damaged random monkey-scribbling by yours truly. More like a "shouldn't we do something like this" than a serious patch, in other words. IOW, it has *NOT* had a lot of real thought behind it. Purely a "shouldn't we always clearly check the old sequence number after we've picked up the new one?" Linus
On Mon, Jul 04, 2022 at 12:16:24PM -0700, Linus Torvalds wrote: > On Mon, Jul 4, 2022 at 12:03 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > Anyway, I've thrown a mount_lock check in there, running xfstests to > > see how it goes... > > So my reaction had been that it would be good to just do something like this: > > diff --git a/fs/namei.c b/fs/namei.c > index 1f28d3f463c3..25c4bcc91142 100644 > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -1493,11 +1493,18 @@ static bool __follow_mount_rcu(struct n... > if (flags & DCACHE_MOUNTED) { > struct mount *mounted = __lookup_mnt(path->mnt, dentry); > if (mounted) { > + struct dentry *old_dentry = dentry; > + unsigned old_seq = *seqp; > + > path->mnt = &mounted->mnt; > dentry = path->dentry = mounted->mnt.mnt_root; > nd->state |= ND_JUMPED; > *seqp = read_seqcount_begin(&dentry->d_seq); > *inode = dentry->d_inode; > + > + if (read_seqcount_retry(&old_dentry->d_seq, old_seq)) > + return false; > + > /* > * We don't need to re-check ->d_seq after this > * ->d_inode read - there will be an RCU delay > > but the above is just whitespace-damaged random monkey-scribbling by > yours truly. > > More like a "shouldn't we do something like this" than a serious > patch, in other words. > > IOW, it has *NOT* had a lot of real thought behind it. Purely a > "shouldn't we always clearly check the old sequence number after we've > picked up the new one?" You are checking the wrong thing here. It's really about mount_lock - ->d_seq is *not* bumped when we or attach in some namespace. If there's a mismatch, RCU pathwalk is doomed anyway (it'll fail any form of unlazy) and we might as well bugger off. If it *does* match, we know that both mountpoint and root had been pinned since before the pathwalk, remain pinned as of that check and had a mount connecting them all along. IOW, if we could have arrived to this dentry at any point, we would have gotten that dentry as the next step. We sample into nd->m_seq in path_init() and we want it to stay unchanged all along. If it does, all mountpoints and roots we observe are pinned and their association with each other is stable. It's not dentry -> dentry, it's dentry -> mount -> dentry. The following would've been safe: find mountpoint sample ->d_seq verify whatever had lead us to mountpoint sample mount_lock find mount verify mountpoint's ->d_seq find root of mounted sample its ->d_seq verify mount_lock Correct? Now, note that the last step done against the value we'd sampled in path_init() guarantees that mount hash had not changed through all of that. Which is to say, we can pretend that we'd found mount before ->d_seq of mountpoint might've changed, leaving us with find mountpoint sample ->d_seq verify whatever had lead us to mountpoint find mount find root of mounted sample its ->d_seq verify mount_lock
On Mon, Jul 4, 2022 at 12:55 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > You are checking the wrong thing here. It's really about mount_lock - > ->d_seq is *not* bumped when we or attach in some namespace. I think we're talking past each other. Yes, we need to check the mount sequence lock too, because we're doing that mount traversal. But I think we *also* need to check the dentry sequence count, because the dentry itself could have been moved to another parent. The two are entirely independent, aren't they? And the dentry sequence point check should go along with the "we're now updating the sequence point from the old dentry to the new". The mount point check should go around the "check dentry mount point", but it's a separate issue from the whole "we are now jumping to a different dentry, we should check that the previous dentry hasn't changed". Linus
On Mon, Jul 04, 2022 at 01:24:48PM -0700, Linus Torvalds wrote: > On Mon, Jul 4, 2022 at 12:55 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > You are checking the wrong thing here. It's really about mount_lock - > > ->d_seq is *not* bumped when we or attach in some namespace. > > I think we're talking past each other. We might be. > Yes, we need to check the mount sequence lock too, because we're doing > that mount traversal. > > But I think we *also* need to check the dentry sequence count, because > the dentry itself could have been moved to another parent. Why is that a problem? It could have been moved to another parent, but so it could after we'd crossed to the mounted and we wouldn't have noticed (or cared). What the chain of seqcount checks gives us is that with some timings it would be possible to traverse that path, not that it had remained valid through the entire pathwalk. What I'm suggesting is to treat transition from mountpoint to mount as happening instantly, with transition from mount to root sealed by mount_lock check. If that succeeds, there had been possible history in which refwalk would have passed through the same dentry/mount/dentry and arrived to the root dentry when it had the sampled ->d_seq value. Sure, mountpoint might be moved since we'd reached it. And the mount would move with it, so we can pretend that we'd won the race and got into the mount before it had the mountpoint had been moved. Am I missing something fundamental about the things the sequence of sampling and verifications gives us? I'd always thought it's about verifying that resulting history would be possible for a non-RCU pathwalk with the right timings. What am I missing?
On Mon, Jul 4, 2022 at 1:24 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > The mount point check should go around the "check dentry mount point", > but it's a separate issue from the whole "we are now jumping to a > different dentry, we should check that the previous dentry hasn't > changed". Maybe it doesn't really matter, because we never actually end up dereferencing the previous dentry (exactly since we're following the mount point on it). It feels like the sequence point checks are basically tied to the "we're looking at the inode that the dentry pointed to", and because the mount-point traversal doesn't need to look at the inode, the sequence point check also isn't done. But it feels wrong to traverse a dentry under RCU - even if we don't then look at the inode itself - without having verified that the dentry is still valid. Yes, the d_seq lock protects against the inode going away (aka "unlink()") and that cannot happen when it's a mount-point. But it _also_ ends up changing for __d_move() when the name of the dentry changes. And I think that name change is relevant even to "look up a mount point", exactly because we used that name to look up the dentry in the first place, so if the name is changing, we shouldn't traverse that mount point. But I may have just confused myself terminally here. Linus
On Mon, Jul 4, 2022 at 1:46 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > Why is that a problem? It could have been moved to another parent, > but so it could after we'd crossed to the mounted and we wouldn't have > noticed (or cared). Yeah, see my other email. I agree that it might be a "we don't actually care" situation, where all we care about that the name was valid at one point (when we picked up that sequence point). So maybe we don't care about closing it. But even if so, I think it might warrant a comment, because I still feel like we're basically "throwing away" our previous sequence point information without ever checking it. Maybe all we ever care about is basically "this sequence point protects the dentry inode pointer for the next lookup", and when it comes to mount points that ends up being immaterial. Linus
On Mon, Jul 04, 2022 at 01:51:16PM -0700, Linus Torvalds wrote: > On Mon, Jul 4, 2022 at 1:46 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > Why is that a problem? It could have been moved to another parent, > > but so it could after we'd crossed to the mounted and we wouldn't have > > noticed (or cared). > > Yeah, see my other email. > > I agree that it might be a "we don't actually care" situation, where > all we care about that the name was valid at one point (when we picked > up that sequence point). So maybe we don't care about closing it. > > But even if so, I think it might warrant a comment, because I still > feel like we're basically "throwing away" our previous sequence point > information without ever checking it. > > Maybe all we ever care about is basically "this sequence point > protects the dentry inode pointer for the next lookup", and when it > comes to mount points that ends up being immaterial. There is a problem, actually, but it's in a different place... OK, let me try to write something resembling a formal proof and see what falls out.
On Fri, Jul 1, 2022 at 4:25 PM Alexander Potapenko <glider@google.com> wrote: > > Under certain circumstances initialization of `unsigned seq` and > `struct inode *inode` passed into step_into() may be skipped. > In particular, if the call to lookup_fast() in walk_component() > returns NULL, and lookup_slow() returns a valid dentry, then the > `seq` and `inode` will remain uninitialized until the call to > step_into() (see [1] for more info). > > Right now step_into() does not use these uninitialized values, > yet passing uninitialized values to functions is considered undefined > behavior (see [2]). To fix that, we initialize `seq` and `inode` at > definition. Given that Al Viro has a patch series in flight to address the problem, I am going to drop this patch from KMSAN v5 series.
diff --git a/fs/namei.c b/fs/namei.c index 1f28d3f463c3b..6b39dfd3b41bc 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1995,8 +1995,8 @@ static const char *handle_dots(struct nameidata *nd, int type) static const char *walk_component(struct nameidata *nd, int flags) { struct dentry *dentry; - struct inode *inode; - unsigned seq; + struct inode *inode = NULL; + unsigned seq = 0; /* * "." and ".." are special - ".." especially so because it has * to be able to know about the current root directory and @@ -3393,8 +3393,8 @@ static const char *open_last_lookups(struct nameidata *nd, struct dentry *dir = nd->path.dentry; int open_flag = op->open_flag; bool got_write = false; - unsigned seq; - struct inode *inode; + unsigned seq = 0; + struct inode *inode = NULL; struct dentry *dentry; const char *res;
Under certain circumstances initialization of `unsigned seq` and `struct inode *inode` passed into step_into() may be skipped. In particular, if the call to lookup_fast() in walk_component() returns NULL, and lookup_slow() returns a valid dentry, then the `seq` and `inode` will remain uninitialized until the call to step_into() (see [1] for more info). Right now step_into() does not use these uninitialized values, yet passing uninitialized values to functions is considered undefined behavior (see [2]). To fix that, we initialize `seq` and `inode` at definition. [1] https://github.com/ClangBuiltLinux/linux/issues/1648#issuecomment-1146608063 [2] https://lore.kernel.org/linux-toolchains/CAHk-=whjz3wO8zD+itoerphWem+JZz4uS3myf6u1Wd6epGRgmQ@mail.gmail.com/ Cc: Evgenii Stepanov <eugenis@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Marco Elver <elver@google.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Segher Boessenkool <segher@kernel.crashing.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vitaly Buka <vitalybuka@google.com> Cc: linux-kernel@vger.kernel.org Cc: linux-toolchains@vger.kernel.org Signed-off-by: Alexander Potapenko <glider@google.com> --- Link: https://linux-review.googlesource.com/id/I94d4e8cc1f0ecc7174659e9506ce96aaf2201d0a --- fs/namei.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)