Message ID | 20190425174739.27604-1-idryomov@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [GIT,PULL] Ceph fixes for 5.1-rc7 | expand |
On Thu, Apr 25, 2019 at 10:48 AM Ilya Dryomov <idryomov@gmail.com> wrote: > > dentry name handling fixes from Jeff and a memory leak fix from Zheng. > Both are old issues, marked for stable. Hmm. You probably should have talked to Al about the dentry name issue, because he'd most likely have pointed you towards our helper function for exactly this thing: struct name_snapshot stable; take_dentry_name_snapshot(&stable, dentry); ... use stable.name .. release_dentry_name_snapshot(&stable); which doesn't need any extra memory allocation outside of some fairly limited stack allocation for the 'name_snapshot' itself, because it knows about the dentry name rules, and - for inline names, it copies it under the d_lock into the fixed DNAME_INLINE_LEN-sized buffer - for out-of-line names, it knows that the name allocation is stable and ref-counted, and just increments the refcount and uses the existing name pointer. now, maybe you need to always do that name allocation anyway (looking at the diff it looks like you often do that for other cases), so maybe the name snapshot capability isn't all that useful for you and the above wouldn't have helped, but I suspect you might not even have realized that there was an option like this. I've pulled this, but maybe Jeff wants to look at whether that snapshotting model could have helped. Linus
On Thu, Apr 25, 2019 at 11:02:54AM -0700, Linus Torvalds wrote: > I've pulled this, but maybe Jeff wants to look at whether that > snapshotting model could have helped. I really wonder if 76a495d666e5 (ceph: ensure d_name stability in ceph_dentry_hash()) makes any sense; OK, you have ->d_lock held over that, but what does it protect against? Sure, you'll get something that was valid while you held ->d_lock, but what good does it do to the callers? If they really have to care about races with d_move(), which value do they want?
On Thu, 2019-04-25 at 11:02 -0700, Linus Torvalds wrote: > On Thu, Apr 25, 2019 at 10:48 AM Ilya Dryomov <idryomov@gmail.com> wrote: > > dentry name handling fixes from Jeff and a memory leak fix from Zheng. > > Both are old issues, marked for stable. > > Hmm. You probably should have talked to Al about the dentry name > issue, because he'd most likely have pointed you towards our helper > function for exactly this thing: > > struct name_snapshot stable; > > take_dentry_name_snapshot(&stable, dentry); > ... use stable.name .. > release_dentry_name_snapshot(&stable); > > which doesn't need any extra memory allocation outside of some fairly > limited stack allocation for the 'name_snapshot' itself, because it > knows about the dentry name rules, and > > - for inline names, it copies it under the d_lock into the fixed > DNAME_INLINE_LEN-sized buffer > > - for out-of-line names, it knows that the name allocation is stable > and ref-counted, and just increments the refcount and uses the > existing name pointer. > > now, maybe you need to always do that name allocation anyway (looking > at the diff it looks like you often do that for other cases), so maybe > the name snapshot capability isn't all that useful for you and the > above wouldn't have helped, but I suspect you might not even have > realized that there was an option like this. > > I've pulled this, but maybe Jeff wants to look at whether that > snapshotting model could have helped. > > Linus Thanks for the info! I think it would have. I took a quick look at the dcache code to see if we had something like that before I did this, but I guess I didn't look closely enough. Those routines do look nicer than my hand-rolled version. I'll look at spinning up a patch to switch that over in the near future. Thanks,
On Thu, Apr 25, 2019 at 11:21 AM Al Viro <viro@zeniv.linux.org.uk> wrote: > > I really wonder if 76a495d666e5 (ceph: ensure d_name stability in > ceph_dentry_hash()) makes any sense; OK, you have ->d_lock held > over that, but what does it protect against? Sure, you'll get > something that was valid while you held ->d_lock, but what good > does it do to the callers? If they really have to care about > races with d_move(), which value do they want? I suspect they only care that the data they gather for the network packet is coherent, and passes internal sanity tests. You get either old or new, but at least you don't get "not NUL terminated because the length didn't match the data". Linus
On Thu, Apr 25, 2019 at 11:24:53AM -0700, Linus Torvalds wrote: > I suspect they only care that the data they gather for the network > packet is coherent, and passes internal sanity tests. You get either > old or new, but at least you don't get "not NUL terminated because the > length didn't match the data". I'm not familiar enough with the protocol to tell if it's actually OK; if anything, it's a question to ceph folks...
The pull request you sent on Thu, 25 Apr 2019 19:47:39 +0200:
> https://github.com/ceph/ceph-client.git tags/ceph-for-5.1-rc7
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/8113a85f872003a9f5c58f9f143054b0d8ec73a5
Thank you!
On Thu, 2019-04-25 at 11:24 -0700, Linus Torvalds wrote: > On Thu, Apr 25, 2019 at 11:21 AM Al Viro <viro@zeniv.linux.org.uk> wrote: > > I really wonder if 76a495d666e5 (ceph: ensure d_name stability in > > ceph_dentry_hash()) makes any sense; OK, you have ->d_lock held > > over that, but what does it protect against? Sure, you'll get > > something that was valid while you held ->d_lock, but what good > > does it do to the callers? If they really have to care about > > races with d_move(), which value do they want? > > I suspect they only care that the data they gather for the network > packet is coherent, and passes internal sanity tests. You get either > old or new, but at least you don't get "not NUL terminated because the > length didn't match the data". > > Linus Right. My main concern was preventing oopses while iterating over d_name. The main use for this hash is to help determine to which MDS we should direct the request. My understanding is that if we get it wrong, we'll end up retrying the request and pick a different MDS, so it's not fatal if we do race in this fashion.
On Thu, Apr 25, 2019 at 02:23:59PM -0400, Jeff Layton wrote: > I took a quick look at the dcache code to see if we had something like > that before I did this, but I guess I didn't look closely enough. Those > routines do look nicer than my hand-rolled version. > > I'll look at spinning up a patch to switch that over in the near future. Jeff asked to put the name length in there; looking at the existing users, it does make sense. I propose turning struct name_snapshot into struct name_snapshot { struct qstr name; unsigned char inline_name[DNAME_INLINE_LEN]; }; with void take_dentry_name_snapshot(struct name_snapshot *name, struct dentry *dentry) { spin_lock(&dentry->d_lock); name->name = dentry->d_name; if (unlikely(dname_external(dentry))) { struct external_name *p = external_name(dentry); atomic_inc(&p->u.count); } else { memcpy(name->inline_name, dentry->d_iname, dentry->d_name.len + 1); name->name.name = name->inline_name; } spin_unlock(&dentry->d_lock); } and callers adjusted, initially simply by turning snapshot.name into snapshot.name.name. Next step: overlayfs call site becoming take_dentry_name_snapshot(&name, real); this = lookup_one_len(name.name.name, connected, name.name.len); Next: snotify stuff switched to passing struct qstr * - fsnotify_move() first, then fsnotify(). That one would * leave callers passing NULL alone * have the callers passing snapshot.name.name pass &snapshot.name * fsnotify_dirent() pass the entire &dentry->d_name, not just dentry->d_name.name (that's dependent upon parent being held exclusive; devpts plays fast and loose here, relying upon the lack of any kind of renames, explicit or implicit, in that fs) * ditto for remaining call in fsnotify_move() (both parents are locked in all callers, thankfully) * ditto for fsnotify_link() * kernfs callers in kernfs_notify_workfn() would grow strlen(). Next: send_to_group() and ->handle_event() instances switched to passing struct qstr *. Next: inotify_handle_event() doesn't need to bother with strlen(). Next: audit_update_watch() and audit_compare_dname_path() switched to struct qstr *. Callers in __audit_inode_child() pass the entire &dentry->d_name. strlen() inside audit_compare_dname_path() goes away. Objections?
On Thu, 2019-04-25 at 21:09 +0100, Al Viro wrote: > On Thu, Apr 25, 2019 at 02:23:59PM -0400, Jeff Layton wrote: > > > I took a quick look at the dcache code to see if we had something like > > that before I did this, but I guess I didn't look closely enough. Those > > routines do look nicer than my hand-rolled version. > > > > I'll look at spinning up a patch to switch that over in the near future. > > Jeff asked to put the name length in there; looking at the existing users, > it does make sense. I propose turning struct name_snapshot into > > struct name_snapshot { > struct qstr name; > unsigned char inline_name[DNAME_INLINE_LEN]; > }; > > with > void take_dentry_name_snapshot(struct name_snapshot *name, struct dentry *dentry) > { > spin_lock(&dentry->d_lock); > name->name = dentry->d_name; > if (unlikely(dname_external(dentry))) { > struct external_name *p = external_name(dentry); > atomic_inc(&p->u.count); > } else { > memcpy(name->inline_name, dentry->d_iname, > dentry->d_name.len + 1); > name->name.name = name->inline_name; > } > spin_unlock(&dentry->d_lock); > } > > and callers adjusted, initially simply by turning snapshot.name into > snapshot.name.name. Next step: overlayfs call site becoming > take_dentry_name_snapshot(&name, real); > this = lookup_one_len(name.name.name, connected, name.name.len); > Next: snotify stuff switched to passing struct qstr * - fsnotify_move() > first, then fsnotify(). That one would > * leave callers passing NULL alone > * have the callers passing snapshot.name.name pass &snapshot.name > * fsnotify_dirent() pass the entire &dentry->d_name, not just > dentry->d_name.name (that's dependent upon parent being held exclusive; > devpts plays fast and loose here, relying upon the lack of any kind of > renames, explicit or implicit, in that fs) > * ditto for remaining call in fsnotify_move() (both parents > are locked in all callers, thankfully) > * ditto for fsnotify_link() > * kernfs callers in kernfs_notify_workfn() would grow strlen(). > Next: send_to_group() and ->handle_event() instances switched to passing > struct qstr *. > Next: inotify_handle_event() doesn't need to bother with strlen(). > Next: audit_update_watch() and audit_compare_dname_path() switched to > struct qstr *. Callers in __audit_inode_child() pass the entire > &dentry->d_name. strlen() inside audit_compare_dname_path() goes away. > > Objections? I have some patches that do what you lay out above. They build but I haven't ran them through much testing yet. It turns out though that using name_snapshot from ceph is a bit more tricky. In some cases, we have to call ceph_mdsc_build_path to build up a full path string. We can't easily populate a name_snapshot from there because struct external_name is only defined in fs/dcache.c. I could add some routines to do this, but it feels a lot like I'm abusing internal dcache interfaces. I'll keep thinking about it though. While we're on the subject though: struct external_name { union { atomic_t count; struct rcu_head head; } u; unsigned char name[]; }; Is it really ok to union the count and rcu_head there? I haven't trawled through all of the code yet, but what prevents someone from trying to access the count inside an RCU critical section, after call_rcu has been called on it? Thanks,
On Fri, Apr 26, 2019 at 9:25 AM Jeff Layton <jlayton@kernel.org> wrote: > > Is it really ok to union the count and rcu_head there? It should be fine, because the rcu_count should only ever be used once the count has gone to zero and the name cannot be found any more. And while RCU path walking may find and use the *name* after the dentry has been killed off (but not free'd yet), all the actual external_name() accesses should be serialized by the dentry lock, so there's no access to those fields once the dentry is dead. At least that's how it's supposed to work. Al would be the final arbiter on this. Linus
On Fri, Apr 26, 2019 at 9:36 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Fri, Apr 26, 2019 at 9:25 AM Jeff Layton <jlayton@kernel.org> wrote: > > > > Is it really ok to union the count and rcu_head there? > > It should be fine, because the rcu_count should only ever be used once ^^^^^ that should be 'head' of course. Linus "duh" Torvalds
On Fri, Apr 26, 2019 at 12:25:03PM -0400, Jeff Layton wrote: > It turns out though that using name_snapshot from ceph is a bit more > tricky. In some cases, we have to call ceph_mdsc_build_path to build up > a full path string. We can't easily populate a name_snapshot from there > because struct external_name is only defined in fs/dcache.c. Explain, please. For ceph_mdsc_build_path() you don't need name snapshots at all and existing code is, AFAICS, just fine, except for pointless pr_err() there. I _probably_ would take allocation out of the loop (e.g. make it __getname(), called unconditionally) and turned it into the d_path.c-style read_seqbegin_or_lock()/need_seqretry()/done_seqretry() loop, so that the first pass would go under rcu_read_lock(), while the second (if needed) would just hold rename_lock exclusive (without bumping the refcount). But that's a matter of (theoretical) livelock avoidance, not the locking correctness for ->d_name accesses. Oh, and *base = ceph_ino(d_inode(temp)); *plen = len; probably belongs in critical section - _that_ might be a correctness issue, since temp is not held by anything once you are out of there. > I could add some routines to do this, but it feels a lot like I'm > abusing internal dcache interfaces. I'll keep thinking about it though. > > While we're on the subject though: > > struct external_name { > union { > atomic_t count; > struct rcu_head head; > } u; > unsigned char name[]; > }; > > Is it really ok to union the count and rcu_head there? > > I haven't trawled through all of the code yet, but what prevents someone > from trying to access the count inside an RCU critical section, after > call_rcu has been called on it? The fact that no lockless accesses to ->count are ever done?
On Fri, Apr 26, 2019 at 09:36:00AM -0700, Linus Torvalds wrote: > On Fri, Apr 26, 2019 at 9:25 AM Jeff Layton <jlayton@kernel.org> wrote: > > > > Is it really ok to union the count and rcu_head there? > > It should be fine, because the rcu_count should only ever be used once > the count has gone to zero and the name cannot be found any more. > > And while RCU path walking may find and use the *name* after the > dentry has been killed off (but not free'd yet), all the actual > external_name() accesses should be serialized by the dentry lock, so > there's no access to those fields once the dentry is dead. It's not quite that; access to external_name contents is fine, ->d_lock or not. __d_lookup_rcu() does read it under rcu_read_lock alone. However: * we never free it without an RCU delay after the final drop of refcount. RCU delay might happen on dentry->d_rcu (if it's dentry_free()) or on name->p.rcu (if it's release_dentry_name_snapshot() or d_move() dropping the final reference). * it's never observed in ->d_name after the refcount reaches zero. * no lockless access ever looks at the refcount. It can look at ->name[], but that's it. What I don't understand is why would anyone want to mess with name snapshots for dentry_path() lookalikes...
On Fri, Apr 26, 2019 at 10:01 AM Al Viro <viro@zeniv.linux.org.uk> wrote: > > What I don't understand is why would anyone want to mess with > name snapshots for dentry_path() lookalikes... Talking about dentry_path()... Shouldn't the ceph path walking code also check the mount_lock sequence to make sure the path is actually stable? Maybe it doesn't matter. Linus
On Fri, Apr 26, 2019 at 10:08:48AM -0700, Linus Torvalds wrote: > On Fri, Apr 26, 2019 at 10:01 AM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > What I don't understand is why would anyone want to mess with > > name snapshots for dentry_path() lookalikes... > > Talking about dentry_path()... Shouldn't the ceph path walking code > also check the mount_lock sequence to make sure the path is actually > stable? > > Maybe it doesn't matter. They want it relative to fs root - after all, server doesn't know or care what's mounted where on client...
On Fri, 2019-04-26 at 17:50 +0100, Al Viro wrote: > On Fri, Apr 26, 2019 at 12:25:03PM -0400, Jeff Layton wrote: > > > It turns out though that using name_snapshot from ceph is a bit more > > tricky. In some cases, we have to call ceph_mdsc_build_path to build up > > a full path string. We can't easily populate a name_snapshot from there > > because struct external_name is only defined in fs/dcache.c. > > Explain, please. For ceph_mdsc_build_path() you don't need name > snapshots at all and existing code is, AFAICS, just fine, except > for pointless pr_err() there. > Eventually we have to pass back the result of all the build_dentry_path() shenanigans to create_request_message(), and then free whatever that result is after using it. Today we can get back a string+length from ceph_mdsc_build_path or clone_dentry_name, or we might get direct pointers into the dentry if the situation allows for it. Now we want to rip out clone_dentry_name() and start using take_dentry_name_snapshot(). That returns a name_snapshot that we'll need to pass back to create_request_message. It will need to deal with the fact that it could get one of those instead of just a string+length. My original thought was to always pass back a name_snapshot, but that turns out to be difficult because its innards are not public. The other potential solutions that I've tried make this code look even worse than it already is. > I _probably_ would take allocation out of the loop (e.g. make it > __getname(), called unconditionally) and turned it into the > d_path.c-style read_seqbegin_or_lock()/need_seqretry()/done_seqretry() > loop, so that the first pass would go under rcu_read_lock(), while > the second (if needed) would just hold rename_lock exclusive (without > bumping the refcount). But that's a matter of (theoretical) livelock > avoidance, not the locking correctness for ->d_name accesses. > Yeah, that does sound better. I want to think about this code a bit > Oh, and > *base = ceph_ino(d_inode(temp)); > *plen = len; > probably belongs in critical section - _that_ might be a correctness > issue, since temp is not held by anything once you are out of there. > Good catch. I'll fix that up. > > I could add some routines to do this, but it feels a lot like I'm > > abusing internal dcache interfaces. I'll keep thinking about it though. > > > > While we're on the subject though: > > > > struct external_name { > > union { > > atomic_t count; > > struct rcu_head head; > > } u; > > unsigned char name[]; > > }; > > > > Is it really ok to union the count and rcu_head there? > > > > I haven't trawled through all of the code yet, but what prevents someone > > from trying to access the count inside an RCU critical section, after > > call_rcu has been called on it? > > The fact that no lockless accesses to ->count are ever done? Thanks,
On Fri, 2019-04-26 at 18:01 +0100, Al Viro wrote: > On Fri, Apr 26, 2019 at 09:36:00AM -0700, Linus Torvalds wrote: > > On Fri, Apr 26, 2019 at 9:25 AM Jeff Layton <jlayton@kernel.org> wrote: > > > Is it really ok to union the count and rcu_head there? > > > > It should be fine, because the rcu_count should only ever be used once > > the count has gone to zero and the name cannot be found any more. > > > > And while RCU path walking may find and use the *name* after the > > dentry has been killed off (but not free'd yet), all the actual > > external_name() accesses should be serialized by the dentry lock, so > > there's no access to those fields once the dentry is dead. > > It's not quite that; access to external_name contents is fine, > ->d_lock or not. __d_lookup_rcu() does read it under rcu_read_lock > alone. > > However: > * we never free it without an RCU delay after the final > drop of refcount. RCU delay might happen on dentry->d_rcu (if > it's dentry_free()) or on name->p.rcu (if it's release_dentry_name_snapshot() > or d_move() dropping the final reference). > * it's never observed in ->d_name after the refcount > reaches zero. > * no lockless access ever looks at the refcount. It > can look at ->name[], but that's it. > Got it, thanks. Why use an atomic_t for the refcount if it's always accessed under spinlock? > What I don't understand is why would anyone want to mess with > name snapshots for dentry_path() lookalikes... Mostly because the place where the ceph code needs to use and free these strings is rather far removed from where they are created. It simplifies that part if we can access and free them all in the same way. I was planning to just use name_snapshots universally for that purpose, but they have a rather specific method of freeing things that is hard to duplicate if you don't have a dentry to clone. Probably that means I'm approaching this problem in the wrong way and need to do it differently.
On Fri, Apr 26, 2019 at 04:49:24PM -0400, Jeff Layton wrote: > Got it, thanks. Why use an atomic_t for the refcount if it's always > accessed under spinlock? _Which_ spinlock? The whole reason for refcounts is that many dentries might end up with shared external name. So ->d_lock on one of them won't do anything to accesses via another... > > What I don't understand is why would anyone want to mess with > > name snapshots for dentry_path() lookalikes... > > Mostly because the place where the ceph code needs to use and free these > strings is rather far removed from where they are created. It simplifies > that part if we can access and free them all in the same way. > > I was planning to just use name_snapshots universally for that purpose, > but they have a rather specific method of freeing things that is hard to > duplicate if you don't have a dentry to clone. > > Probably that means I'm approaching this problem in the wrong way and > need to do it differently. For short names snapshot will put the copy into your variable, normally on the stack frame. So it doesn't really help if you want to keep that stuff around. Again, dentries with short names have those stored inside the dentry and for anything long-term you will need to copy that data; otherwise rename() will happily change those under you. If you want to deal with pathnames, do __getname() + a loop similar to that in dentry_path() et.al. All there is to it. Snapshots are for situations when you want a reasonably short-term access to name of specific dentry you have a reference to and do not want to do any allocations.
On Fri, Apr 26, 2019 at 01:30:53PM -0400, Jeff Layton wrote: > > I _probably_ would take allocation out of the loop (e.g. make it > > __getname(), called unconditionally) and turned it into the > > d_path.c-style read_seqbegin_or_lock()/need_seqretry()/done_seqretry() > > loop, so that the first pass would go under rcu_read_lock(), while > > the second (if needed) would just hold rename_lock exclusive (without > > bumping the refcount). But that's a matter of (theoretical) livelock > > avoidance, not the locking correctness for ->d_name accesses. > > > > Yeah, that does sound better. I want to think about this code a bit FWIW, is there any reason to insist that the pathname is put into the beginning of the buffer? I mean, instead of path + pathlen we might return path + offset, with the pathname going from path + offset to path + PATH_MAX - 1 inclusive, with path being the thing eventually freed. It's easier to build the string backwards, seeing that we are walking from leaf to root...
On Sun, 2019-04-28 at 05:38 +0100, Al Viro wrote: > On Fri, Apr 26, 2019 at 01:30:53PM -0400, Jeff Layton wrote: > > > > I _probably_ would take allocation out of the loop (e.g. make it > > > __getname(), called unconditionally) and turned it into the > > > d_path.c-style read_seqbegin_or_lock()/need_seqretry()/done_seqretry() > > > loop, so that the first pass would go under rcu_read_lock(), while > > > the second (if needed) would just hold rename_lock exclusive (without > > > bumping the refcount). But that's a matter of (theoretical) livelock > > > avoidance, not the locking correctness for ->d_name accesses. > > > > > > > Yeah, that does sound better. I want to think about this code a bit > > FWIW, is there any reason to insist that the pathname is put into the > beginning of the buffer? I mean, instead of path + pathlen we might > return path + offset, with the pathname going from path + offset to > path + PATH_MAX - 1 inclusive, with path being the thing eventually > freed. > > It's easier to build the string backwards, seeing that we are walking > from leaf to root... (cc'ing linux-cifs) I don't see a problem doing what you suggest. An offset + fixed length buffer would be fine there. Is there a real benefit to using __getname though? It sucks when we have to reallocate but I doubt that it happens with any frequency. Most of these paths will end up being much shorter than PATH_MAX and that slims down the memory footprint a bit. Also, FWIW -- this code was originally copied from cifs' build_path_from_dentry(). Should we aim to put something in common infrastructure that both can call? There are some significant logic differences in the two functions though so we might need some sort of callback function or something to know when to stop walking.
On Sun, Apr 28, 2019 at 09:27:20AM -0400, Jeff Layton wrote: > I don't see a problem doing what you suggest. An offset + fixed length > buffer would be fine there. > > Is there a real benefit to using __getname though? It sucks when we have > to reallocate but I doubt that it happens with any frequency. Most of > these paths will end up being much shorter than PATH_MAX and that slims > down the memory footprint a bit. AFAICS, they are all short-lived; don't forget that slabs have cache, so in that situation allocations are cheap. > Also, FWIW -- this code was originally copied from cifs' > build_path_from_dentry(). Should we aim to put something in common > infrastructure that both can call? > > There are some significant logic differences in the two functions though > so we might need some sort of callback function or something to know > when to stop walking. Not if you want it fast... Indirect calls are not cheap; the cost of those callbacks would be considerable. Besides, you want more than "where do I stop", right? It's also "what output do I use for this dentry", both for you and for cifs (there it's "which separator to use", in ceph it's "these we want represented as //")... Can it be called on detached subtree, during e.g. open_by_handle()? There it can get really fishy; you end up with base being at the random point on the way towards root. How does that work, and if it *does* work, why do we need the whole path in the first place? BTW, for cifs there's no need to play with ->d_lock as we go. For ceph, the only need comes from looking at d_inode(), and I wonder if it would be better to duplicate that information ("is that a snapdir/nosnap") into dentry iself - would certainly be cheaper. OTOH, we are getting short on spare bits in ->d_flags...
On Sun, 2019-04-28 at 15:48 +0100, Al Viro wrote: > On Sun, Apr 28, 2019 at 09:27:20AM -0400, Jeff Layton wrote: > > > I don't see a problem doing what you suggest. An offset + fixed length > > buffer would be fine there. > > > > Is there a real benefit to using __getname though? It sucks when we have > > to reallocate but I doubt that it happens with any frequency. Most of > > these paths will end up being much shorter than PATH_MAX and that slims > > down the memory footprint a bit. > > AFAICS, they are all short-lived; don't forget that slabs have cache, > so in that situation allocations are cheap. > Fair enough. Al also pointed out on IRC that the __getname/__putname caches are likely to be hot, so using that may be less costly cpu-wise. > > Also, FWIW -- this code was originally copied from cifs' > > build_path_from_dentry(). Should we aim to put something in common > > infrastructure that both can call? > > > > There are some significant logic differences in the two functions though > > so we might need some sort of callback function or something to know > > when to stop walking. > > Not if you want it fast... Indirect calls are not cheap; the cost of > those callbacks would be considerable. Besides, you want more than > "where do I stop", right? It's also "what output do I use for this > dentry", both for you and for cifs (there it's "which separator to use", > in ceph it's "these we want represented as //")... > > Can it be called on detached subtree, during e.g. open_by_handle()? > There it can get really fishy; you end up with base being at the > random point on the way towards root. How does that work, and if > it *does* work, why do we need the whole path in the first place? > This I'm not sure of. commit 79b33c8874334e (ceph: snapshot nfs re- export) explains this a bit, but I'm not sure it really covers this case. Zheng/Sage, feel free to correct me here: My understanding is that for snapshots you need the base inode number, snapid, and the full path from there to the dentry for a ceph MDS call. There is a filehandle type for a snapshotted inode: struct ceph_nfs_snapfh { u64 ino; u64 snapid; u64 parent_ino; u32 hash; } __attribute__ ((packed)); So I guess it is possible. You could do name_to_handle_at for an inode deep down in a snapshotted tree, and then try to open_by_handle_at after the dcache gets cleaned out for some other reason. What I'm not clear on is why we need to build paths at all for snapshots. Why is a parent inode number (inside the snapshot) + a snapid + dentry name not sufficient? > BTW, for cifs there's no need to play with ->d_lock as we go. For > ceph, the only need comes from looking at d_inode(), and I wonder if > it would be better to duplicate that information ("is that a > snapdir/nosnap") into dentry iself - would certainly be cheaper. > OTOH, we are getting short on spare bits in ->d_flags... We could stick that in ceph_dentry_info (->d_fsdata). We have a flags field in there already.
On Sun, Apr 28, 2019 at 11:47:58AM -0400, Jeff Layton wrote: > We could stick that in ceph_dentry_info (->d_fsdata). We have a flags > field in there already. Yes, but... You have it freed in ->d_release(), AFAICS, and without any delays. So lockless accesses will be trouble.
On Sun, 2019-04-28 at 16:52 +0100, Al Viro wrote: > On Sun, Apr 28, 2019 at 11:47:58AM -0400, Jeff Layton wrote: > > > We could stick that in ceph_dentry_info (->d_fsdata). We have a flags > > field in there already. > > Yes, but... You have it freed in ->d_release(), AFAICS, and without > any delays. So lockless accesses will be trouble. That's easy enough to fix -- we could add a rcu_head to it and call_rcu in ceph_d_release instead of just freeing it immediately. I guess if we find that d_fsdata is NULL, we can just treat it like we currently do a negative dentry?
On Sun, Apr 28, 2019 at 04:52:16PM +0100, Al Viro wrote: > On Sun, Apr 28, 2019 at 11:47:58AM -0400, Jeff Layton wrote: > > > We could stick that in ceph_dentry_info (->d_fsdata). We have a flags > > field in there already. > > Yes, but... You have it freed in ->d_release(), AFAICS, and without > any delays. So lockless accesses will be trouble. You could RCU-delay the actual kmem_cache_free(ceph_dentry_cachep, di) in there, but I've no idea whether the overhead would be painful - on massive eviction (e.g. on memory pressure) it might be. Another variant is to introduce ->d_free(), to be called from __d_free() and __d_free_external(). That, however, would need another ->d_flags bit for presence of that method, so that we don't get extra overhead from looking into ->d_op... Looking through ->d_release() instances, we have afs: empty, might as well have not been there autofs: does some sync stuff (eviction from ->active_list/->expire_list) plus kfree_rcu ceph: some sync stuff + immediate kmem_cache_free() debugfs: kfree(), might or might not be worth RCU-delaying ecryptfs: sync stuff (path_put for ->lower) + RCU-delayed part fuse: kfree_rcu() nfs: kfree() overlayfs: a bunch of dput() (obviously sync) + kfree_rcu() 9p: sync So it actually might make sense to move the RCU-delayed bits to separate method. Some ->d_release() instances would be simply gone, as for the rest... I wonder which of the sync parts can be moved over to ->d_prune(). Not guaranteed to be doable (or a good idea), but... E.g. for autofs it almost certainly would be the right place for the sync parts - we are, essentially, telling the filesystem to forget its private (non-refcounted) references to the victim.