ceph: is_root_ceph_dentry() cleanup

Message ID	20250128011023.55012-1-slava@dubeyko.com (mailing list archive)
State	New
Headers	show Received: from mail-pl1-f193.google.com (mail-pl1-f193.google.com [209.85.214.193]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6E42B8462 for <ceph-devel@vger.kernel.org>; Tue, 28 Jan 2025 01:10:41 +0000 (UTC) From: Viacheslav Dubeyko <slava@dubeyko.com> To: ceph-devel@vger.kernel.org Cc: idryomov@gmail.com, linux-fsdevel@vger.kernel.org, pdonnell@redhat.com, amarkuze@redhat.com, Slava.Dubeyko@ibm.com, slava@dubeyko.com Subject: [PATCH] ceph: is_root_ceph_dentry() cleanup Date: Mon, 27 Jan 2025 17:10:23 -0800 Message-ID: <20250128011023.55012-1-slava@dubeyko.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	ceph: is_root_ceph_dentry() cleanup \| expand ceph: is_root_ceph_dentry() cleanup

Viacheslav Dubeyko Jan. 28, 2025, 1:10 a.m. UTC

From: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>

This patch introduces CEPH_HIDDEN_DIR_NAME. It
declares name of the hidden directory .ceph in
the include/linux/ceph/ceph_fs.h instead of hiding
it in dir.c file. Also hardcoded length of the name
is changed on strlen(CEPH_HIDDEN_DIR_NAME).

Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
---
 fs/ceph/dir.c                | 10 ++++++++--
 include/linux/ceph/ceph_fs.h |  2 ++
 2 files changed, 10 insertions(+), 2 deletions(-)

Al Viro Jan. 28, 2025, 3:07 a.m. UTC | #1

On Mon, Jan 27, 2025 at 05:10:23PM -0800, Viacheslav Dubeyko wrote:
> From: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
> 
> This patch introduces CEPH_HIDDEN_DIR_NAME. It
> declares name of the hidden directory .ceph in
> the include/linux/ceph/ceph_fs.h instead of hiding
> it in dir.c file. Also hardcoded length of the name
> is changed on strlen(CEPH_HIDDEN_DIR_NAME).

Hmm...

Speaking of that area
	* how the hell could ceph_lookup() ever be called with dentry
that is *NOT* negative?  VFS certainly won't do that; I'm not sure about
ceph_handle_notrace_create(), but it doesn't look like that's possible
without server being malicious (if it's possible at all).

	* speaking of malicious servers, what happens if
it gets CEPH_MDS_OP_LOOKUP and it returns a normal reply to positive
lookup, but with cpu_to_le32(-ENOENT) shoved into head->result?
	AFAICS, ceph_handle_snapdir() will be called with dentry
that is already made positive; results will not be pretty...

Viacheslav Dubeyko Jan. 28, 2025, 11:27 p.m. UTC | #2

On Tue, 2025-01-28 at 03:07 +0000, Al Viro wrote:
> On Mon, Jan 27, 2025 at 05:10:23PM -0800, Viacheslav Dubeyko wrote:
> > From: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
> > 
> > This patch introduces CEPH_HIDDEN_DIR_NAME. It
> > declares name of the hidden directory .ceph in
> > the include/linux/ceph/ceph_fs.h instead of hiding
> > it in dir.c file. Also hardcoded length of the name
> > is changed on strlen(CEPH_HIDDEN_DIR_NAME).
> 
> Hmm...
> 
> Speaking of that area
> 	* how the hell could ceph_lookup() ever be called with dentry
> that is *NOT* negative?  VFS certainly won't do that; I'm not sure about
> ceph_handle_notrace_create(), but it doesn't look like that's possible
> without server being malicious (if it's possible at all).
> 
> 	* speaking of malicious servers, what happens if
> it gets CEPH_MDS_OP_LOOKUP and it returns a normal reply to positive
> lookup, but with cpu_to_le32(-ENOENT) shoved into head->result?
> 	AFAICS, ceph_handle_snapdir() will be called with dentry
> that is already made positive; results will not be pretty...

I assume that you imply this code:

	/* can we conclude ENOENT locally? */
	if (d_really_is_negative(dentry)) {
		struct ceph_inode_info *ci = ceph_inode(dir);
		struct ceph_dentry_info *di = ceph_dentry(dentry);

		spin_lock(&ci->i_ceph_lock);
		doutc(cl, " dir %llx.%llx flags are 0x%lx\n",
		      ceph_vinop(dir), ci->i_ceph_flags);
		if (strncmp(dentry->d_name.name,
			    fsc->mount_options->snapdir_name,
			    dentry->d_name.len) &&
		    !is_root_ceph_dentry(dir, dentry) &&
		    ceph_test_mount_opt(fsc, DCACHE) &&
		    __ceph_dir_is_complete(ci) &&
		    __ceph_caps_issued_mask_metric(ci, CEPH_CAP_FILE_SHARED,
1)) {
			__ceph_touch_fmode(ci, mdsc, CEPH_FILE_MODE_RD);
			spin_unlock(&ci->i_ceph_lock);
			doutc(cl, " dir %llx.%llx complete, -ENOENT\n",
			      ceph_vinop(dir));
			d_add(dentry, NULL);
			di->lease_shared_gen = atomic_read(&ci->i_shared_gen);
			return NULL;
		}
		spin_unlock(&ci->i_ceph_lock);
	}

Am I correct? So, how can we rework this code if it's wrong? What is your
vision? Do you mean that it's dead code? How can we check it?

Thanks,
Slava.

Al Viro Jan. 29, 2025, 1:12 a.m. UTC | #3

On Tue, Jan 28, 2025 at 11:27:05PM +0000, Viacheslav Dubeyko wrote:

> I assume that you imply this code:
> 
> 	/* can we conclude ENOENT locally? */
> 	if (d_really_is_negative(dentry)) {
> 		struct ceph_inode_info *ci = ceph_inode(dir);
> 		struct ceph_dentry_info *di = ceph_dentry(dentry);
> 
> 		spin_lock(&ci->i_ceph_lock);
> 		doutc(cl, " dir %llx.%llx flags are 0x%lx\n",
> 		      ceph_vinop(dir), ci->i_ceph_flags);
> 		if (strncmp(dentry->d_name.name,
> 			    fsc->mount_options->snapdir_name,
> 			    dentry->d_name.len) &&
> 		    !is_root_ceph_dentry(dir, dentry) &&
> 		    ceph_test_mount_opt(fsc, DCACHE) &&
> 		    __ceph_dir_is_complete(ci) &&
> 		    __ceph_caps_issued_mask_metric(ci, CEPH_CAP_FILE_SHARED,
> 1)) {
> 			__ceph_touch_fmode(ci, mdsc, CEPH_FILE_MODE_RD);
> 			spin_unlock(&ci->i_ceph_lock);
> 			doutc(cl, " dir %llx.%llx complete, -ENOENT\n",
> 			      ceph_vinop(dir));
> 			d_add(dentry, NULL);
> 			di->lease_shared_gen = atomic_read(&ci->i_shared_gen);
> 			return NULL;
> 		}
> 		spin_unlock(&ci->i_ceph_lock);
> 	}
> 
> Am I correct? So, how can we rework this code if it's wrong? What is your
> vision? Do you mean that it's dead code? How can we check it?

I mean that ->lookup() is called *ONLY* for a negative unhashed dentries.
In other words, on a call from VFS that condition will always be true.
That part is easily provable; what is harder to reason about is the
direct call of ceph_lookup() from ceph_handle_notrace_create().

The callers of that thing (ceph_mknod(), ceph_symlink() and ceph_mkdir())
are all guaranteed that dentry will be negative when they are called.
The hard-to-reason-about part is the call of ceph_mdsc_do_request()
directly preceding the calls of ceph_handle_notrace_create().

Can ceph_mdsc_do_request() return 0, with req->r_reply_info.head->is_dentry
being false *AND* a call of splice_dentry() made by ceph_fill_trace() called
by ceph_mdsc_do_request()?

AFAICS, there are 3 calls of splice_dentry(); two of them happen under
explicit check for ->is_dentry and thus are not interesting for our
purposes.  The third one, though, could be hit with ->is_dentry being
false and ->r_op being CEPH_MDS_OP_MKSNAP.  That is not impossible
from ceph_mkdir(), as far as I can tell, and I don't understand the
details well enough to tell whether it can actually happen.

Is it actually possible to hit ceph_handle_notrace_create() with
a positive dentry?

Viacheslav Dubeyko Feb. 11, 2025, 12:08 a.m. UTC | #4

On Wed, 2025-01-29 at 01:12 +0000, Al Viro wrote:
> On Tue, Jan 28, 2025 at 11:27:05PM +0000, Viacheslav Dubeyko wrote:
> 
> > I assume that you imply this code:
> > 
> > 	/* can we conclude ENOENT locally? */
> > 	if (d_really_is_negative(dentry)) {
> > 		struct ceph_inode_info *ci = ceph_inode(dir);
> > 		struct ceph_dentry_info *di = ceph_dentry(dentry);
> > 
> > 		spin_lock(&ci->i_ceph_lock);
> > 		doutc(cl, " dir %llx.%llx flags are 0x%lx\n",
> > 		      ceph_vinop(dir), ci->i_ceph_flags);
> > 		if (strncmp(dentry->d_name.name,
> > 			    fsc->mount_options->snapdir_name,
> > 			    dentry->d_name.len) &&
> > 		    !is_root_ceph_dentry(dir, dentry) &&
> > 		    ceph_test_mount_opt(fsc, DCACHE) &&
> > 		    __ceph_dir_is_complete(ci) &&
> > 		    __ceph_caps_issued_mask_metric(ci, CEPH_CAP_FILE_SHARED,
> > 1)) {
> > 			__ceph_touch_fmode(ci, mdsc, CEPH_FILE_MODE_RD);
> > 			spin_unlock(&ci->i_ceph_lock);
> > 			doutc(cl, " dir %llx.%llx complete, -ENOENT\n",
> > 			      ceph_vinop(dir));
> > 			d_add(dentry, NULL);
> > 			di->lease_shared_gen = atomic_read(&ci->i_shared_gen);
> > 			return NULL;
> > 		}
> > 		spin_unlock(&ci->i_ceph_lock);
> > 	}
> > 
> > Am I correct? So, how can we rework this code if it's wrong? What is your
> > vision? Do you mean that it's dead code? How can we check it?
> 
> I mean that ->lookup() is called *ONLY* for a negative unhashed dentries.
> In other words, on a call from VFS that condition will always be true.
> That part is easily provable; what is harder to reason about is the
> direct call of ceph_lookup() from ceph_handle_notrace_create().
> 
> The callers of that thing (ceph_mknod(), ceph_symlink() and ceph_mkdir())
> are all guaranteed that dentry will be negative when they are called.
> The hard-to-reason-about part is the call of ceph_mdsc_do_request()
> directly preceding the calls of ceph_handle_notrace_create().
> 
> Can ceph_mdsc_do_request() return 0, with req->r_reply_info.head->is_dentry
> being false *AND* a call of splice_dentry() made by ceph_fill_trace() called
> by ceph_mdsc_do_request()?
> 
> AFAICS, there are 3 calls of splice_dentry(); two of them happen under
> explicit check for ->is_dentry and thus are not interesting for our
> purposes.  The third one, though, could be hit with ->is_dentry being
> false and ->r_op being CEPH_MDS_OP_MKSNAP.  That is not impossible
> from ceph_mkdir(), as far as I can tell, and I don't understand the
> details well enough to tell whether it can actually happen.
> 
> Is it actually possible to hit ceph_handle_notrace_create() with
> a positive dentry?

Sorry for delay, I was busy with other issues.

I did run the xfstests and, as far as I can see, I had only negative dentries as
an input of ceph_lookup(). However, I did testing with presence of only one
CephFS kernel client and I didn't use snapshots. I guess, potentially, if we
have several CephFS kernel clients, for example, on different nodes working with
the same folder, then, maybe, MDS will behave in slightly different way. But
it's theory that I need to check.

In general, the MDS can issue NULL dentries to clients so that  they "know" the
direntry does not exist without having some capability (or lease) associated
with it. As far as I can see, if application repeatedly does stat of file, then
the kernel driver isn't repeatedly going out to the MDS to lookup that file. So,
I assume that this is the goal of this check and logic.

But, frankly speaking, I am not completely digested this piece of code and all
possible use-cases yet. :)

Thanks,
Slava.

Al Viro Feb. 11, 2025, 12:15 a.m. UTC | #5

On Tue, Feb 11, 2025 at 12:08:21AM +0000, Viacheslav Dubeyko wrote:

> In general, the MDS can issue NULL dentries to clients so that  they "know" the
> direntry does not exist without having some capability (or lease) associated
> with it. As far as I can see, if application repeatedly does stat of file, then
> the kernel driver isn't repeatedly going out to the MDS to lookup that file. So,
> I assume that this is the goal of this check and logic.

Er...  On repeated stat(2) you will get ->lookup() called the first time around;
afterwards you'll be getting dentry from dcache lookup.  With ->d_revalidate()
each time you find it in dcache, and eviction if ->d_revalidate() says it's stale.
In which case a new dentry will be allocated and fed to ->lookup(), passed to
it in negative-unhashed state...

Viacheslav Dubeyko Feb. 11, 2025, 6:01 p.m. UTC | #6

On Tue, 2025-02-11 at 00:15 +0000, Al Viro wrote:
> On Tue, Feb 11, 2025 at 12:08:21AM +0000, Viacheslav Dubeyko wrote:
> 
> > In general, the MDS can issue NULL dentries to clients so that  they "know" the
> > direntry does not exist without having some capability (or lease) associated
> > with it. As far as I can see, if application repeatedly does stat of file, then
> > the kernel driver isn't repeatedly going out to the MDS to lookup that file. So,
> > I assume that this is the goal of this check and logic.
> 
> Er...  On repeated stat(2) you will get ->lookup() called the first time around;
> afterwards you'll be getting dentry from dcache lookup.  With ->d_revalidate()
> each time you find it in dcache, and eviction if ->d_revalidate() says it's stale.
> In which case a new dentry will be allocated and fed to ->lookup(), passed to
> it in negative-unhashed state...

After some considerations, I believe we can follow such simple logic.
Correct me if I will be wrong here. The ceph_lookup() method's responsibility is
to "look up a single dir entry". It sounds for me that if we have positive
dentry,
then it doesn't make sense to call the ceph_lookup(). And if ceph_lookup() has
been
called for the positive dentry, then something wrong is happening.

As far as I can see, currently, we have a confusing execution flow in
ceph_lookup():

if (d_really_is_negative(dentry)) {
    <do check locally>
    if (-ENOENT)
        return NULL;
}

<send request to MDS server>

But all this logic is not about negative dentry, it's about local check
before sending request to MDS server. So, I think we need to change the logic
in likewise way:

if (<we can check locally>) {
    <do check locally>
    if (-ENOENT)
        return NULL;
} else {
   <send request to MDS server>
}

Am I right here? :) Let me change the logic in this way and to test it.

Thanks,
Slava.

Al Viro Feb. 11, 2025, 7:01 p.m. UTC | #7

On Tue, Feb 11, 2025 at 06:01:21PM +0000, Viacheslav Dubeyko wrote:

> After some considerations, I believe we can follow such simple logic.
> Correct me if I will be wrong here. The ceph_lookup() method's responsibility is
> to "look up a single dir entry". It sounds for me that if we have positive
> dentry,
> then it doesn't make sense to call the ceph_lookup(). And if ceph_lookup() has
> been
> called for the positive dentry, then something wrong is happening.

VFS never calls it that way; ceph itself might, if ceph_handle_notrace_create()
is called with positive dentry.

> But all this logic is not about negative dentry, it's about local check
> before sending request to MDS server. So, I think we need to change the logic
> in likewise way:
> 
> if (<we can check locally>) {
>     <do check locally>
>     if (-ENOENT)
>         return NULL;
> } else {
>    <send request to MDS server>
> }
> 
> Am I right here? :) Let me change the logic in this way and to test it.

First of all, returning NULL does *not* mean "it's negative"; d_add(dentry, NULL)
does that.  What would "we can check locally" be?  AFAICS, the checks in
__ceph_dir_is_complete() and near it are doing just that...

The really unpleasant question is whether ceph_handle_notrace_create() could
end up feeding an already-positive dentry to direct call of ceph_lookup()...

Viacheslav Dubeyko Feb. 11, 2025, 7:32 p.m. UTC | #8

On Tue, 2025-02-11 at 19:01 +0000, Al Viro wrote:
> On Tue, Feb 11, 2025 at 06:01:21PM +0000, Viacheslav Dubeyko wrote:
> 
> > After some considerations, I believe we can follow such simple logic.
> > Correct me if I will be wrong here. The ceph_lookup() method's responsibility is
> > to "look up a single dir entry". It sounds for me that if we have positive
> > dentry,
> > then it doesn't make sense to call the ceph_lookup(). And if ceph_lookup() has
> > been
> > called for the positive dentry, then something wrong is happening.
> 
> VFS never calls it that way; ceph itself might, if ceph_handle_notrace_create()
> is called with positive dentry.
> 
> > But all this logic is not about negative dentry, it's about local check
> > before sending request to MDS server. So, I think we need to change the logic
> > in likewise way:
> > 
> > if (<we can check locally>) {
> >     <do check locally>
> >     if (-ENOENT)
> >         return NULL;
> > } else {
> >    <send request to MDS server>
> > }
> > 
> > Am I right here? :) Let me change the logic in this way and to test it.
> 
> First of all, returning NULL does *not* mean "it's negative"; d_add(dentry, NULL)
> does that.  What would "we can check locally" be?  AFAICS, the checks in
> __ceph_dir_is_complete() and near it are doing just that...
> 

Currently, we have:

if (d_really_is_negative(dentry)) {
    <execute logic>
}

My point is simply to delete the d_really_is_negative() check and execute
this logic without the check.

> The really unpleasant question is whether ceph_handle_notrace_create() could
> end up feeding an already-positive dentry to direct call of ceph_lookup()...

We have ceph_handle_notrace_create() call in several methods:
(1) ceph_mknod()
(2) ceph_symlink()
(3) ceph_mkdir()
(4) ceph_atomic_open()

Every time we create object at first and, then, we call
ceph_handle_notrace_create() if creation was successful. We have such pattern:

req = ceph_mdsc_create_request(mdsc, CEPH_MDS_OP_MKNOD, USE_AUTH_MDS);

<skipped>

err = ceph_mdsc_do_request(mdsc, dir, req);
if (!err && !req->r_reply_info.head->is_dentry)
     err = ceph_handle_notrace_create(dir, dentry);

And ceph_lookup() has such logic:

if (d_really_is_negative(dentry)) {
    <execute logic>
    if (-ENOENT)
        return NULL;
}

req = ceph_mdsc_create_request(mdsc, op, USE_ANY_MDS);

<skipped>

err = ceph_mdsc_do_request(mdsc, NULL, req);

So, we have two different type of requests here:
(1) ceph_mdsc_create_request(mdsc, CEPH_MDS_OP_MKNOD, USE_AUTH_MDS)
(2) ceph_mdsc_create_request(mdsc, op, USE_ANY_MDS)

The first request creates an object on MDS side and second one checks that this
object exists on MDS side by lookup. I assume that first request simply reports
success or failure of object creation. And only second one can extract metadata
from MDS side.

Thanks,
Slava.

Al Viro Feb. 11, 2025, 9:40 p.m. UTC | #9

On Tue, Feb 11, 2025 at 07:32:16PM +0000, Viacheslav Dubeyko wrote:

> > The really unpleasant question is whether ceph_handle_notrace_create() could
> > end up feeding an already-positive dentry to direct call of ceph_lookup()...
> 
> We have ceph_handle_notrace_create() call in several methods:
> (1) ceph_mknod()
> (2) ceph_symlink()
> (3) ceph_mkdir()
> (4) ceph_atomic_open()
> 
> Every time we create object at first and, then, we call
> ceph_handle_notrace_create() if creation was successful. We have such pattern:
> 
> req = ceph_mdsc_create_request(mdsc, CEPH_MDS_OP_MKNOD, USE_AUTH_MDS);
> 
> <skipped>
> 
> err = ceph_mdsc_do_request(mdsc, dir, req);
> if (!err && !req->r_reply_info.head->is_dentry)
>      err = ceph_handle_notrace_create(dir, dentry);
> 
> And ceph_lookup() has such logic:
> 
> if (d_really_is_negative(dentry)) {
>     <execute logic>
>     if (-ENOENT)
>         return NULL;
> }
> 
> req = ceph_mdsc_create_request(mdsc, op, USE_ANY_MDS);
> 
> <skipped>
> 
> err = ceph_mdsc_do_request(mdsc, NULL, req);
> 
> So, we have two different type of requests here:
> (1) ceph_mdsc_create_request(mdsc, CEPH_MDS_OP_MKNOD, USE_AUTH_MDS)
> (2) ceph_mdsc_create_request(mdsc, op, USE_ANY_MDS)
> 
> The first request creates an object on MDS side and second one checks that this
> object exists on MDS side by lookup. I assume that first request simply reports
> success or failure of object creation. And only second one can extract metadata
> from MDS side.

If only...  The first request may return that metadata, in which case we'll get
splice_dentry() called by ceph_fill_trace() when reply arrives.  Note that
calls of ceph_handle_notrace_create() are conditional.

Intent is as you've described and normally that's how it works, but that
assumes sane reply.  Which is not a good assumption to make, when it comes
to memory corruption on client...

I _think_ the conditions are sufficient.  There are 4 callers of
ceph_handle_notrace_create().  All of them require is_dentry in reply to be false;
the one in ceph_mkdir() requires both is_dentry and is_target to be false.

ceph_fill_trace() does not call splice_dentry() when both is_dentry and is_target
are false, so we are only interested in the mknod/symlink/atomic_open callers.

Past the handling of !is_dentry && !is_target case ceph_fill_trace() has
a large if (is_dentry); not a concern for us.  Next comes if (is_target) where
we deal with ceph_fill_inode(); no calls of splice_dentry() in it.

After that we have
	if (is_dentry && ....) {
		...
		splice_dentry() call possible here
		...
	} else if ((req->r_op == CEPH_MDS_OP_LOOKUPSNAP ||
		    req->r_op == CEPH_MDS_OP_MKSNAP) &&
		   test_bit(CEPH_MDS_R_PARENT_LOCKED, &req->r_req_flags) &&
		   !test_bit(CEPH_MDS_R_ABORTED, &req->r_req_flags)) {
		...
		splice_dentry() call possible here
		...
	} else if (is_dentry && ...) {
		...
	}

Since we only care about !is_dentry case, that leaves the middle
part.  LOOKUPSNAP can come from ceph_lookup() or ceph_d_revalidate(),
neither of which call ceph_handle_notrace_create().  MKSNAP comes
only from ceph_mkdir(), which _does_ call ceph_handle_notrace_create(),
but only in case !is_dentry && !is_target, so that call of splice_dentry()
is not going to happen there.

*IF* the analysis above is correct, we can rely upon the ceph_lookup()
always getting a negative dentry and that if (d_really_is_negative(dentry))
is always taken.  But I could be missing something subtle in there ;-/

ceph: is_root_ceph_dentry() cleanup

Commit Message

Comments

Patch