Message ID | 20150424163534.6eb109eb@notabene.brown (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, Apr 24, 2015 at 04:35:34PM +1000, NeilBrown wrote: > > * path_init() arranges for that to be true in the beginning of the walk. > > > > * symlinks aside, walk_component() preserves that. > > + normal name components go through lookup_fast(), where we have > > __d_lookup_rcu() find a child of current nd->path with matching > > name and (atomically) picks ->d_seq of that child, which had the > > name matching our component. Atomicity is provided by ->d_lock > > on child. Then we proceed to pick ->d_inode (and verify that > > I don't see d_lock being taken in __d_lookup_rcu. Do'h... No, it isn't - sorry about the braino. > I think this atomicity is provided by ->d_seq on the child. Yes. It's fetch ->d_seq, compare names, then in caller fetch ->d_inode and compare ->d_seq. > So..... > Where I have: > > + if (nd->flags & LOOKUP_RCU) { > + if (!nd->root.mnt) > + set_root_rcu(nd); > + nd->path = nd->root; > > > in the case where the symlink starts '/', I need to set nd->seq > > nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq); > > > but in the case where symlink doesn't start '/', I don't change nd->path, > so nd->seq should still be correct? It would be correct, if walk_component() hadn't _already_ flipped it to match the symlink. By that point it's too late - we'd already lost the old value. This is the area where it hits the fan: if (__read_seqcount_retry(&parent->d_seq, nd->seq)) return -ECHILD; OK, parent is still valid nd->seq = seq; ... and we lose its seq number, replacing it with that of a child. if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE)) { status = d_revalidate(dentry, nd->flags); if (unlikely(status <= 0)) { if (status != -ECHILD) need_reval = 0; goto unlazy; } } path->mnt = mnt; path->dentry = dentry; set path to (nd->path.mnt, dentry of child) if (likely(__follow_mount_rcu(nd, path, inode))) see if we need to (and can) cross the mountpoint here. That can change (vfsmount,dentry) pair *and* nd->seq - making it match whatever path->dentry ends us being. return 0; if everything's fine, we return to caller, still in RCU mode. unlazy: if (unlazy_walk(nd, dentry)) ... and here we attempt to fall back to non-lazy mode, expecting nd->seq to match dentry. Win or lose, that's the last point where nd->seq will be looked at. return -ECHILD; See the problem? We have discarded the nd->seq value for parent and we can't re-pick it without opening a race. AFAICS, the simplest solution is to have nd->next_seq and set _that_ in lookup_fast() and __follow_mount_rcu(), using it in unlazy_walk(). And in callers have it copied into nd->seq after we'd established that it's not a symlink to be followed. Handling of absolute symlinks also needs some care - we have no idea whether nd->root still makes any sense. It might have become negative by that point. unlazy_walk() done first would've checked it's still our process' root, but if we stay in lazy mode, we obviously can't rely on that check. One possible solution is to do the same kind of check we do in unlazy_walk() - if (!(nd->flags & LOOKUP_ROOT)) { spin_lock(&fs->lock); if (nd->root.mnt != fs->root.mnt || nd->root.dentry != fs->root.dentry) bugger off fetch ->d_inode and ->d_seq spin_unlock(&fs->lock); } else { fetch ->d_inode and ->d_seq } Another - to store the ->d_seq of root in nd->seq_root, have set_root_rcu() set that instead of (or, better, in addition to) returning it to caller, have path_init() initialize it explicitly in LOOKUP_ROOT case and have this "reset to root" in following an absolute symlink just do set_root_rcu(nd); nd->path = nd->root; nd->seq = nd->seq_root; nd->inode = nd->path.dentry->d_inode; check if nd->path.dentry is stale, bugger off if it is That avoids this spin_lock() on each absolute symlink at the price of extra 32 bits in struct nameidata. It looks like doing on-demand reallocation of nd->stack is the right way to go anyway, so the pressure on nameidata size is going to be weaker and that might be the right way to go... -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Apr 24, 2015 at 02:42:03PM +0100, Al Viro wrote: > That avoids this spin_lock() on each absolute symlink at the price of extra > 32 bits in struct nameidata. It looks like doing on-demand reallocation > of nd->stack is the right way to go anyway, so the pressure on nameidata size > is going to be weaker and that might be the right way to go... OK, on-demand reallocation is done. What I have right now is * flat MAXSYMLINKS 40, no matter what kind of nesting there might be. * purely iterative link_path_walk(). * no damn nameidata on stack for generic_readlink() * stack footprint of the entire thing independent from the nesting depth, and about on par with "no symlinks at all" case in mainline. * some massage towards RCU follow_link done (in the end of queue), but quite a bit more work remains. What I've got so far is in vfs.git#link_path_walk; I'm not too happy about posting a 70-chunk mailbomb, but it really will need review and testing. It survives xfstests and LTP with no regressions, but it will need serious profiling, etc., along with RTFS. I tried to keep it in reasonably small pieces, but there's a lot of them ;-/ FWIW, I've a bit more reorganization plotted out, but it's not far from where we need to be for RCU follow_link. Some notes: * I don't believe we want to pass flags to ->follow_link() - it's much simpler to give the damn thing NULL for dentry in RCU case. In *all* cases where we might have a change to get the symlink body without blocking we can do that by inode alone. We obviously want to pass dentry and inode separately (and in case of fast symlinks we don't hit the filesystem at all), but that's it - flags isn't needed. * terminate_walk() should do bulk put_link(). So should the failure cases of complete_walk(). _Success_ of complete_walk() should be careful about legitimizing links - it *can* be called with one link on stack, and be followed by access to link body. Yes, really - do_last() in O_CREAT case. * do_last(), lookup_last() and mountpoint_last() ought to have put_link() done when called on non-empty stack (thus turning the loops into something like while ((err = lookup_last(nd)) > 0) { err = trailing_symlink(nd); if (err) break; } _After_ the point where they don't need to look at the last component of name, obviously. * I think we should leave terminate_walk() to callers in failure cases of walk_component() and handle_dots(), as well as get_link(). Makes life simpler in callers, actually. I'll play with that a bit more. * it might make sense to add the second flag to walk_component(), in addition to LOOKUP_FOLLOW, meaning "do put_link() once you are done looking at the name". In principle, it promises simpler logics with unlazy_walk(), but that's one area I'm still not entirely sure about. Will need to experiment a bit... * nd->seq clobbering will need to be dealt with, as discussed upthread. * I _really_ hate your "let's use the LSB of struct page * to tell if we need to kunmap()" approach. It's too damn ugly to live. _And_ it's trivial to avoid - all we need is to have (non-lazy) page_follow_link_light() and page_symlink() to remove __GFP_HIGHMEM from inode->i_mapping before ever asking to allocate pages there. That'll suffice, and it makes sense regardless of RCU work - that kmap/kunmap with potential for minutes in between (while waiting for stuck NFS server in the middle of symlink traversal) is simply wrong. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 4 May 2015 06:11:29 +0100 Al Viro <viro@ZenIV.linux.org.uk> wrote: > On Fri, Apr 24, 2015 at 02:42:03PM +0100, Al Viro wrote: > > > That avoids this spin_lock() on each absolute symlink at the price of extra > > 32 bits in struct nameidata. It looks like doing on-demand reallocation > > of nd->stack is the right way to go anyway, so the pressure on nameidata size > > is going to be weaker and that might be the right way to go... > > OK, on-demand reallocation is done. What I have right now is > * flat MAXSYMLINKS 40, no matter what kind of nesting there might > be. > * purely iterative link_path_walk(). > * no damn nameidata on stack for generic_readlink() > * stack footprint of the entire thing independent from the nesting > depth, and about on par with "no symlinks at all" case in mainline. > * some massage towards RCU follow_link done (in the end of queue), > but quite a bit more work remains. > > What I've got so far is in vfs.git#link_path_walk; I'm not too happy about > posting a 70-chunk mailbomb, but it really will need review and testing. > It survives xfstests and LTP with no regressions, but it will need > serious profiling, etc., along with RTFS. I tried to keep it in reasonably > small pieces, but there's a lot of them ;-/ > > FWIW, I've a bit more reorganization plotted out, but it's not far from > where we need to be for RCU follow_link. Some notes: > * I don't believe we want to pass flags to ->follow_link() - it's > much simpler to give the damn thing NULL for dentry in RCU case. In *all* > cases where we might have a change to get the symlink body without blocking > we can do that by inode alone. We obviously want to pass dentry and inode > separately (and in case of fast symlinks we don't hit the filesystem at > all), but that's it - flags isn't needed. > * terminate_walk() should do bulk put_link(). So should the > failure cases of complete_walk(). _Success_ of complete_walk() should > be careful about legitimizing links - it *can* be called with one link > on stack, and be followed by access to link body. Yes, really - do_last() > in O_CREAT case. > * do_last(), lookup_last() and mountpoint_last() ought to > have put_link() done when called on non-empty stack (thus turning the loops > into something like > while ((err = lookup_last(nd)) > 0) { > err = trailing_symlink(nd); > if (err) > break; > } > _After_ the point where they don't need to look at the last component of > name, obviously. > * I think we should leave terminate_walk() to callers in failure > cases of walk_component() and handle_dots(), as well as get_link(). Makes > life simpler in callers, actually. I'll play with that a bit more. > * it might make sense to add the second flag to walk_component(), > in addition to LOOKUP_FOLLOW, meaning "do put_link() once you are done looking > at the name". In principle, it promises simpler logics with unlazy_walk(), > but that's one area I'm still not entirely sure about. Will need to > experiment a bit... > * nd->seq clobbering will need to be dealt with, as discussed upthread. > * I _really_ hate your "let's use the LSB of struct page * to tell > if we need to kunmap()" approach. It's too damn ugly to live. _And_ it's > trivial to avoid - all we need is to have (non-lazy) page_follow_link_light() > and page_symlink() to remove __GFP_HIGHMEM from inode->i_mapping before > ever asking to allocate pages there. That'll suffice, and it makes sense > regardless of RCU work - that kmap/kunmap with potential for minutes in > between (while waiting for stuck NFS server in the middle of symlink traversal) > is simply wrong. Thanks! I'll have another look and see about adding what is needed for RCU symlink support. NeilBrown
diff --git a/fs/namei.c b/fs/namei.c index d13b4315447f..ce6387d5317c 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -947,6 +947,8 @@ follow_link(struct path *link, struct nameidata *nd, void **p) if (!nd->root.mnt) set_root_rcu(nd); nd->path = nd->root; + nd->seq = read_seqcount_begin( + &nd->path->dentry->d_seq); } else { if (!nd->root.mnt) set_root(nd); @@ -954,9 +956,9 @@ follow_link(struct path *link, struct nameidata *nd, void **p) nd->path = nd->root; path_get(&nd->root); } + nd->inode = nd->path.dentry->d_inode; nd->flags |= LOOKUP_JUMPED; } - nd->inode = nd->path.dentry->d_inode; error = link_path_walk(s, nd); if (unlikely(error)) put_link(nd, link, inode, *p);