Message ID | 20250128011023.55012-1-slava@dubeyko.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | ceph: is_root_ceph_dentry() cleanup | expand |
On Mon, Jan 27, 2025 at 05:10:23PM -0800, Viacheslav Dubeyko wrote: > From: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> > > This patch introduces CEPH_HIDDEN_DIR_NAME. It > declares name of the hidden directory .ceph in > the include/linux/ceph/ceph_fs.h instead of hiding > it in dir.c file. Also hardcoded length of the name > is changed on strlen(CEPH_HIDDEN_DIR_NAME). Hmm... Speaking of that area * how the hell could ceph_lookup() ever be called with dentry that is *NOT* negative? VFS certainly won't do that; I'm not sure about ceph_handle_notrace_create(), but it doesn't look like that's possible without server being malicious (if it's possible at all). * speaking of malicious servers, what happens if it gets CEPH_MDS_OP_LOOKUP and it returns a normal reply to positive lookup, but with cpu_to_le32(-ENOENT) shoved into head->result? AFAICS, ceph_handle_snapdir() will be called with dentry that is already made positive; results will not be pretty...
On Tue, 2025-01-28 at 03:07 +0000, Al Viro wrote: > On Mon, Jan 27, 2025 at 05:10:23PM -0800, Viacheslav Dubeyko wrote: > > From: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> > > > > This patch introduces CEPH_HIDDEN_DIR_NAME. It > > declares name of the hidden directory .ceph in > > the include/linux/ceph/ceph_fs.h instead of hiding > > it in dir.c file. Also hardcoded length of the name > > is changed on strlen(CEPH_HIDDEN_DIR_NAME). > > Hmm... > > Speaking of that area > * how the hell could ceph_lookup() ever be called with dentry > that is *NOT* negative? VFS certainly won't do that; I'm not sure about > ceph_handle_notrace_create(), but it doesn't look like that's possible > without server being malicious (if it's possible at all). > > * speaking of malicious servers, what happens if > it gets CEPH_MDS_OP_LOOKUP and it returns a normal reply to positive > lookup, but with cpu_to_le32(-ENOENT) shoved into head->result? > AFAICS, ceph_handle_snapdir() will be called with dentry > that is already made positive; results will not be pretty... I assume that you imply this code: /* can we conclude ENOENT locally? */ if (d_really_is_negative(dentry)) { struct ceph_inode_info *ci = ceph_inode(dir); struct ceph_dentry_info *di = ceph_dentry(dentry); spin_lock(&ci->i_ceph_lock); doutc(cl, " dir %llx.%llx flags are 0x%lx\n", ceph_vinop(dir), ci->i_ceph_flags); if (strncmp(dentry->d_name.name, fsc->mount_options->snapdir_name, dentry->d_name.len) && !is_root_ceph_dentry(dir, dentry) && ceph_test_mount_opt(fsc, DCACHE) && __ceph_dir_is_complete(ci) && __ceph_caps_issued_mask_metric(ci, CEPH_CAP_FILE_SHARED, 1)) { __ceph_touch_fmode(ci, mdsc, CEPH_FILE_MODE_RD); spin_unlock(&ci->i_ceph_lock); doutc(cl, " dir %llx.%llx complete, -ENOENT\n", ceph_vinop(dir)); d_add(dentry, NULL); di->lease_shared_gen = atomic_read(&ci->i_shared_gen); return NULL; } spin_unlock(&ci->i_ceph_lock); } Am I correct? So, how can we rework this code if it's wrong? What is your vision? Do you mean that it's dead code? How can we check it? Thanks, Slava.
On Tue, Jan 28, 2025 at 11:27:05PM +0000, Viacheslav Dubeyko wrote: > I assume that you imply this code: > > /* can we conclude ENOENT locally? */ > if (d_really_is_negative(dentry)) { > struct ceph_inode_info *ci = ceph_inode(dir); > struct ceph_dentry_info *di = ceph_dentry(dentry); > > spin_lock(&ci->i_ceph_lock); > doutc(cl, " dir %llx.%llx flags are 0x%lx\n", > ceph_vinop(dir), ci->i_ceph_flags); > if (strncmp(dentry->d_name.name, > fsc->mount_options->snapdir_name, > dentry->d_name.len) && > !is_root_ceph_dentry(dir, dentry) && > ceph_test_mount_opt(fsc, DCACHE) && > __ceph_dir_is_complete(ci) && > __ceph_caps_issued_mask_metric(ci, CEPH_CAP_FILE_SHARED, > 1)) { > __ceph_touch_fmode(ci, mdsc, CEPH_FILE_MODE_RD); > spin_unlock(&ci->i_ceph_lock); > doutc(cl, " dir %llx.%llx complete, -ENOENT\n", > ceph_vinop(dir)); > d_add(dentry, NULL); > di->lease_shared_gen = atomic_read(&ci->i_shared_gen); > return NULL; > } > spin_unlock(&ci->i_ceph_lock); > } > > Am I correct? So, how can we rework this code if it's wrong? What is your > vision? Do you mean that it's dead code? How can we check it? I mean that ->lookup() is called *ONLY* for a negative unhashed dentries. In other words, on a call from VFS that condition will always be true. That part is easily provable; what is harder to reason about is the direct call of ceph_lookup() from ceph_handle_notrace_create(). The callers of that thing (ceph_mknod(), ceph_symlink() and ceph_mkdir()) are all guaranteed that dentry will be negative when they are called. The hard-to-reason-about part is the call of ceph_mdsc_do_request() directly preceding the calls of ceph_handle_notrace_create(). Can ceph_mdsc_do_request() return 0, with req->r_reply_info.head->is_dentry being false *AND* a call of splice_dentry() made by ceph_fill_trace() called by ceph_mdsc_do_request()? AFAICS, there are 3 calls of splice_dentry(); two of them happen under explicit check for ->is_dentry and thus are not interesting for our purposes. The third one, though, could be hit with ->is_dentry being false and ->r_op being CEPH_MDS_OP_MKSNAP. That is not impossible from ceph_mkdir(), as far as I can tell, and I don't understand the details well enough to tell whether it can actually happen. Is it actually possible to hit ceph_handle_notrace_create() with a positive dentry?
On Wed, 2025-01-29 at 01:12 +0000, Al Viro wrote: > On Tue, Jan 28, 2025 at 11:27:05PM +0000, Viacheslav Dubeyko wrote: > > > I assume that you imply this code: > > > > /* can we conclude ENOENT locally? */ > > if (d_really_is_negative(dentry)) { > > struct ceph_inode_info *ci = ceph_inode(dir); > > struct ceph_dentry_info *di = ceph_dentry(dentry); > > > > spin_lock(&ci->i_ceph_lock); > > doutc(cl, " dir %llx.%llx flags are 0x%lx\n", > > ceph_vinop(dir), ci->i_ceph_flags); > > if (strncmp(dentry->d_name.name, > > fsc->mount_options->snapdir_name, > > dentry->d_name.len) && > > !is_root_ceph_dentry(dir, dentry) && > > ceph_test_mount_opt(fsc, DCACHE) && > > __ceph_dir_is_complete(ci) && > > __ceph_caps_issued_mask_metric(ci, CEPH_CAP_FILE_SHARED, > > 1)) { > > __ceph_touch_fmode(ci, mdsc, CEPH_FILE_MODE_RD); > > spin_unlock(&ci->i_ceph_lock); > > doutc(cl, " dir %llx.%llx complete, -ENOENT\n", > > ceph_vinop(dir)); > > d_add(dentry, NULL); > > di->lease_shared_gen = atomic_read(&ci->i_shared_gen); > > return NULL; > > } > > spin_unlock(&ci->i_ceph_lock); > > } > > > > Am I correct? So, how can we rework this code if it's wrong? What is your > > vision? Do you mean that it's dead code? How can we check it? > > I mean that ->lookup() is called *ONLY* for a negative unhashed dentries. > In other words, on a call from VFS that condition will always be true. > That part is easily provable; what is harder to reason about is the > direct call of ceph_lookup() from ceph_handle_notrace_create(). > > The callers of that thing (ceph_mknod(), ceph_symlink() and ceph_mkdir()) > are all guaranteed that dentry will be negative when they are called. > The hard-to-reason-about part is the call of ceph_mdsc_do_request() > directly preceding the calls of ceph_handle_notrace_create(). > > Can ceph_mdsc_do_request() return 0, with req->r_reply_info.head->is_dentry > being false *AND* a call of splice_dentry() made by ceph_fill_trace() called > by ceph_mdsc_do_request()? > > AFAICS, there are 3 calls of splice_dentry(); two of them happen under > explicit check for ->is_dentry and thus are not interesting for our > purposes. The third one, though, could be hit with ->is_dentry being > false and ->r_op being CEPH_MDS_OP_MKSNAP. That is not impossible > from ceph_mkdir(), as far as I can tell, and I don't understand the > details well enough to tell whether it can actually happen. > > Is it actually possible to hit ceph_handle_notrace_create() with > a positive dentry? Sorry for delay, I was busy with other issues. I did run the xfstests and, as far as I can see, I had only negative dentries as an input of ceph_lookup(). However, I did testing with presence of only one CephFS kernel client and I didn't use snapshots. I guess, potentially, if we have several CephFS kernel clients, for example, on different nodes working with the same folder, then, maybe, MDS will behave in slightly different way. But it's theory that I need to check. In general, the MDS can issue NULL dentries to clients so that they "know" the direntry does not exist without having some capability (or lease) associated with it. As far as I can see, if application repeatedly does stat of file, then the kernel driver isn't repeatedly going out to the MDS to lookup that file. So, I assume that this is the goal of this check and logic. But, frankly speaking, I am not completely digested this piece of code and all possible use-cases yet. :) Thanks, Slava.
On Tue, Feb 11, 2025 at 12:08:21AM +0000, Viacheslav Dubeyko wrote: > In general, the MDS can issue NULL dentries to clients so that they "know" the > direntry does not exist without having some capability (or lease) associated > with it. As far as I can see, if application repeatedly does stat of file, then > the kernel driver isn't repeatedly going out to the MDS to lookup that file. So, > I assume that this is the goal of this check and logic. Er... On repeated stat(2) you will get ->lookup() called the first time around; afterwards you'll be getting dentry from dcache lookup. With ->d_revalidate() each time you find it in dcache, and eviction if ->d_revalidate() says it's stale. In which case a new dentry will be allocated and fed to ->lookup(), passed to it in negative-unhashed state...
On Tue, 2025-02-11 at 00:15 +0000, Al Viro wrote: > On Tue, Feb 11, 2025 at 12:08:21AM +0000, Viacheslav Dubeyko wrote: > > > In general, the MDS can issue NULL dentries to clients so that they "know" the > > direntry does not exist without having some capability (or lease) associated > > with it. As far as I can see, if application repeatedly does stat of file, then > > the kernel driver isn't repeatedly going out to the MDS to lookup that file. So, > > I assume that this is the goal of this check and logic. > > Er... On repeated stat(2) you will get ->lookup() called the first time around; > afterwards you'll be getting dentry from dcache lookup. With ->d_revalidate() > each time you find it in dcache, and eviction if ->d_revalidate() says it's stale. > In which case a new dentry will be allocated and fed to ->lookup(), passed to > it in negative-unhashed state... After some considerations, I believe we can follow such simple logic. Correct me if I will be wrong here. The ceph_lookup() method's responsibility is to "look up a single dir entry". It sounds for me that if we have positive dentry, then it doesn't make sense to call the ceph_lookup(). And if ceph_lookup() has been called for the positive dentry, then something wrong is happening. As far as I can see, currently, we have a confusing execution flow in ceph_lookup(): if (d_really_is_negative(dentry)) { <do check locally> if (-ENOENT) return NULL; } <send request to MDS server> But all this logic is not about negative dentry, it's about local check before sending request to MDS server. So, I think we need to change the logic in likewise way: if (<we can check locally>) { <do check locally> if (-ENOENT) return NULL; } else { <send request to MDS server> } Am I right here? :) Let me change the logic in this way and to test it. Thanks, Slava.
On Tue, Feb 11, 2025 at 06:01:21PM +0000, Viacheslav Dubeyko wrote: > After some considerations, I believe we can follow such simple logic. > Correct me if I will be wrong here. The ceph_lookup() method's responsibility is > to "look up a single dir entry". It sounds for me that if we have positive > dentry, > then it doesn't make sense to call the ceph_lookup(). And if ceph_lookup() has > been > called for the positive dentry, then something wrong is happening. VFS never calls it that way; ceph itself might, if ceph_handle_notrace_create() is called with positive dentry. > But all this logic is not about negative dentry, it's about local check > before sending request to MDS server. So, I think we need to change the logic > in likewise way: > > if (<we can check locally>) { > <do check locally> > if (-ENOENT) > return NULL; > } else { > <send request to MDS server> > } > > Am I right here? :) Let me change the logic in this way and to test it. First of all, returning NULL does *not* mean "it's negative"; d_add(dentry, NULL) does that. What would "we can check locally" be? AFAICS, the checks in __ceph_dir_is_complete() and near it are doing just that... The really unpleasant question is whether ceph_handle_notrace_create() could end up feeding an already-positive dentry to direct call of ceph_lookup()...
On Tue, 2025-02-11 at 19:01 +0000, Al Viro wrote: > On Tue, Feb 11, 2025 at 06:01:21PM +0000, Viacheslav Dubeyko wrote: > > > After some considerations, I believe we can follow such simple logic. > > Correct me if I will be wrong here. The ceph_lookup() method's responsibility is > > to "look up a single dir entry". It sounds for me that if we have positive > > dentry, > > then it doesn't make sense to call the ceph_lookup(). And if ceph_lookup() has > > been > > called for the positive dentry, then something wrong is happening. > > VFS never calls it that way; ceph itself might, if ceph_handle_notrace_create() > is called with positive dentry. > > > But all this logic is not about negative dentry, it's about local check > > before sending request to MDS server. So, I think we need to change the logic > > in likewise way: > > > > if (<we can check locally>) { > > <do check locally> > > if (-ENOENT) > > return NULL; > > } else { > > <send request to MDS server> > > } > > > > Am I right here? :) Let me change the logic in this way and to test it. > > First of all, returning NULL does *not* mean "it's negative"; d_add(dentry, NULL) > does that. What would "we can check locally" be? AFAICS, the checks in > __ceph_dir_is_complete() and near it are doing just that... > Currently, we have: if (d_really_is_negative(dentry)) { <execute logic> } My point is simply to delete the d_really_is_negative() check and execute this logic without the check. > The really unpleasant question is whether ceph_handle_notrace_create() could > end up feeding an already-positive dentry to direct call of ceph_lookup()... We have ceph_handle_notrace_create() call in several methods: (1) ceph_mknod() (2) ceph_symlink() (3) ceph_mkdir() (4) ceph_atomic_open() Every time we create object at first and, then, we call ceph_handle_notrace_create() if creation was successful. We have such pattern: req = ceph_mdsc_create_request(mdsc, CEPH_MDS_OP_MKNOD, USE_AUTH_MDS); <skipped> err = ceph_mdsc_do_request(mdsc, dir, req); if (!err && !req->r_reply_info.head->is_dentry) err = ceph_handle_notrace_create(dir, dentry); And ceph_lookup() has such logic: if (d_really_is_negative(dentry)) { <execute logic> if (-ENOENT) return NULL; } req = ceph_mdsc_create_request(mdsc, op, USE_ANY_MDS); <skipped> err = ceph_mdsc_do_request(mdsc, NULL, req); So, we have two different type of requests here: (1) ceph_mdsc_create_request(mdsc, CEPH_MDS_OP_MKNOD, USE_AUTH_MDS) (2) ceph_mdsc_create_request(mdsc, op, USE_ANY_MDS) The first request creates an object on MDS side and second one checks that this object exists on MDS side by lookup. I assume that first request simply reports success or failure of object creation. And only second one can extract metadata from MDS side. Thanks, Slava.
On Tue, Feb 11, 2025 at 07:32:16PM +0000, Viacheslav Dubeyko wrote: > > The really unpleasant question is whether ceph_handle_notrace_create() could > > end up feeding an already-positive dentry to direct call of ceph_lookup()... > > We have ceph_handle_notrace_create() call in several methods: > (1) ceph_mknod() > (2) ceph_symlink() > (3) ceph_mkdir() > (4) ceph_atomic_open() > > Every time we create object at first and, then, we call > ceph_handle_notrace_create() if creation was successful. We have such pattern: > > req = ceph_mdsc_create_request(mdsc, CEPH_MDS_OP_MKNOD, USE_AUTH_MDS); > > <skipped> > > err = ceph_mdsc_do_request(mdsc, dir, req); > if (!err && !req->r_reply_info.head->is_dentry) > err = ceph_handle_notrace_create(dir, dentry); > > And ceph_lookup() has such logic: > > if (d_really_is_negative(dentry)) { > <execute logic> > if (-ENOENT) > return NULL; > } > > req = ceph_mdsc_create_request(mdsc, op, USE_ANY_MDS); > > <skipped> > > err = ceph_mdsc_do_request(mdsc, NULL, req); > > So, we have two different type of requests here: > (1) ceph_mdsc_create_request(mdsc, CEPH_MDS_OP_MKNOD, USE_AUTH_MDS) > (2) ceph_mdsc_create_request(mdsc, op, USE_ANY_MDS) > > The first request creates an object on MDS side and second one checks that this > object exists on MDS side by lookup. I assume that first request simply reports > success or failure of object creation. And only second one can extract metadata > from MDS side. If only... The first request may return that metadata, in which case we'll get splice_dentry() called by ceph_fill_trace() when reply arrives. Note that calls of ceph_handle_notrace_create() are conditional. Intent is as you've described and normally that's how it works, but that assumes sane reply. Which is not a good assumption to make, when it comes to memory corruption on client... I _think_ the conditions are sufficient. There are 4 callers of ceph_handle_notrace_create(). All of them require is_dentry in reply to be false; the one in ceph_mkdir() requires both is_dentry and is_target to be false. ceph_fill_trace() does not call splice_dentry() when both is_dentry and is_target are false, so we are only interested in the mknod/symlink/atomic_open callers. Past the handling of !is_dentry && !is_target case ceph_fill_trace() has a large if (is_dentry); not a concern for us. Next comes if (is_target) where we deal with ceph_fill_inode(); no calls of splice_dentry() in it. After that we have if (is_dentry && ....) { ... splice_dentry() call possible here ... } else if ((req->r_op == CEPH_MDS_OP_LOOKUPSNAP || req->r_op == CEPH_MDS_OP_MKSNAP) && test_bit(CEPH_MDS_R_PARENT_LOCKED, &req->r_req_flags) && !test_bit(CEPH_MDS_R_ABORTED, &req->r_req_flags)) { ... splice_dentry() call possible here ... } else if (is_dentry && ...) { ... } Since we only care about !is_dentry case, that leaves the middle part. LOOKUPSNAP can come from ceph_lookup() or ceph_d_revalidate(), neither of which call ceph_handle_notrace_create(). MKSNAP comes only from ceph_mkdir(), which _does_ call ceph_handle_notrace_create(), but only in case !is_dentry && !is_target, so that call of splice_dentry() is not going to happen there. *IF* the analysis above is correct, we can rely upon the ceph_lookup() always getting a negative dentry and that if (d_really_is_negative(dentry)) is always taken. But I could be missing something subtle in there ;-/
diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c index 0bf388e07a02..5151c614b5cb 100644 --- a/fs/ceph/dir.c +++ b/fs/ceph/dir.c @@ -782,10 +782,16 @@ struct dentry *ceph_finish_lookup(struct ceph_mds_request *req, return dentry; } +static inline +bool is_hidden_ceph_dir(struct dentry *dentry) +{ + size_t len = strlen(CEPH_HIDDEN_DIR_NAME); + return strncmp(dentry->d_name.name, CEPH_HIDDEN_DIR_NAME, len) == 0; +} + static bool is_root_ceph_dentry(struct inode *inode, struct dentry *dentry) { - return ceph_ino(inode) == CEPH_INO_ROOT && - strncmp(dentry->d_name.name, ".ceph", 5) == 0; + return ceph_ino(inode) == CEPH_INO_ROOT && is_hidden_ceph_dir(dentry); } /* diff --git a/include/linux/ceph/ceph_fs.h b/include/linux/ceph/ceph_fs.h index 2d7d86f0290d..84a1391aab29 100644 --- a/include/linux/ceph/ceph_fs.h +++ b/include/linux/ceph/ceph_fs.h @@ -31,6 +31,8 @@ #define CEPH_INO_CEPH 2 /* hidden .ceph dir */ #define CEPH_INO_GLOBAL_SNAPREALM 3 /* global dummy snaprealm */ +#define CEPH_HIDDEN_DIR_NAME ".ceph" + /* arbitrary limit on max # of monitors (cluster of 3 is typical) */ #define CEPH_MAX_MON 31