Message ID | 45778432fba32dce1fb1f5fd13272c89c95c3f52.1696043833.git.kjlx@templeofstupid.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | virtiofs submounts that are still in use forgotten by shrinker | expand |
On 10/2/23 17:24, Krister Johansen wrote: > The submount code uses the parent nodeid passed into the function in > order to create the root dentry for the new submount. This nodeid does > not get its remote reference count incremented by a lookup option. > > If the parent inode is evicted from its superblock, due to memory > pressure for example, it can result in a forget opertation being sent to > the server. Should this nodeid be forgotten while it is still in use in > a submount, users of the submount get an error from the server on any > subsequent access. In the author's case, this was an EBADF on all > subsequent operations that needed to reference the root. > > Debugging the problem revealed that the dentry shrinker triggered a forget > after killing the dentry with the last reference, despite the root > dentry in another superblock still using the nodeid. > > As a result, a container that was also using this submount failed to > access its filesystem because it had borrowed the reference instead of > taking its own when setting up its superblock for the submount. > > This commit fixes the problem by having the new submount trigger a > lookup for the parent as part of creating a new root dentry for the > virtiofsd submount superblock. This allows each superblock to have its > inodes removed by the shrinker when unreferenced, while keeping the > nodeid reference count accurate and active with the server. > > Signed-off-by: Krister Johansen <kjlx@templeofstupid.com> > --- > fs/fuse/dir.c | 10 +++++----- > fs/fuse/fuse_i.h | 6 ++++++ > fs/fuse/inode.c | 43 +++++++++++++++++++++++++++++++++++++------ > 3 files changed, 48 insertions(+), 11 deletions(-) > > diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c > index 5e01946d7531..333730c74619 100644 > --- a/fs/fuse/dir.c > +++ b/fs/fuse/dir.c > @@ -183,11 +183,11 @@ static void fuse_lookup_init(struct fuse_conn *fc, struct fuse_args *args, > args->out_args[0].value = outarg; > } > > -static int fuse_dentry_revalidate_lookup(struct fuse_mount *fm, > - struct dentry *entry, > - struct inode *inode, > - struct fuse_entry_out *outarg, > - bool *lookedup) > +int fuse_dentry_revalidate_lookup(struct fuse_mount *fm, > + struct dentry *entry, > + struct inode *inode, > + struct fuse_entry_out *outarg, > + bool *lookedup) > { > struct dentry *parent; > struct fuse_forget_link *forget; > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h > index 405252bb51f2..a66fcf50a4cc 100644 > --- a/fs/fuse/fuse_i.h > +++ b/fs/fuse/fuse_i.h > @@ -1325,6 +1325,12 @@ void fuse_dax_dontcache(struct inode *inode, unsigned int flags); > bool fuse_dax_check_alignment(struct fuse_conn *fc, unsigned int map_alignment); > void fuse_dax_cancel_work(struct fuse_conn *fc); > > +/* dir.c */ > +int fuse_dentry_revalidate_lookup(struct fuse_mount *fm, struct dentry *entry, > + struct inode *inode, > + struct fuse_entry_out *outarg, > + bool *lookedup); > + > /* ioctl.c */ > long fuse_file_ioctl(struct file *file, unsigned int cmd, unsigned long arg); > long fuse_file_compat_ioctl(struct file *file, unsigned int cmd, > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c > index 444418e240c8..79a31cb55512 100644 > --- a/fs/fuse/inode.c > +++ b/fs/fuse/inode.c > @@ -1464,7 +1464,13 @@ static int fuse_fill_super_submount(struct super_block *sb, > struct fuse_mount *fm = get_fuse_mount_super(sb); > struct super_block *parent_sb = parent_fi->inode.i_sb; > struct fuse_attr root_attr; > + struct fuse_inode *fi; > struct inode *root; > + struct inode *parent; > + struct dentry *pdent; > + struct fuse_entry_out outarg; > + bool lookedup = false; > + int ret; > > fuse_sb_defaults(sb); > fm->sb = sb; > @@ -1480,14 +1486,39 @@ static int fuse_fill_super_submount(struct super_block *sb, > if (parent_sb->s_subtype && !sb->s_subtype) > return -ENOMEM; > > - fuse_fill_attr_from_inode(&root_attr, parent_fi); > - root = fuse_iget(sb, parent_fi->nodeid, 0, &root_attr, 0, 0); > /* > - * This inode is just a duplicate, so it is not looked up and > - * its nlookup should not be incremented. fuse_iget() does > - * that, though, so undo it here. > + * It is necessary to lookup the parent_if->nodeid in case the dentry > + * that triggered the automount of the submount is later evicted. > + * If this dentry is evicted without the lookup count getting increased > + * on the submount root, then the server can subsequently forget this > + * nodeid which leads to errors when trying to access the root of the > + * submount. > */ > - get_fuse_inode(root)->nlookup--; > + parent = &parent_fi->inode; > + pdent = d_find_alias(parent); > + if (!pdent) > + return -EINVAL; > + > + ret = fuse_dentry_revalidate_lookup(fm, pdent, parent, &outarg, > + &lookedup); > + dput(pdent); > + /* > + * The new root owns this nlookup on success, and it is incremented by > + * fuse_iget(). In the case the lookup succeeded but revalidate fails, > + * ensure that the lookup count is tracked by the parent. > + */ > + if (ret <= 0) { > + if (lookedup) { > + fi = get_fuse_inode(parent); > + spin_lock(&fi->lock); > + fi->nlookup++; > + spin_unlock(&fi->lock); > + } I might be wrong, but doesn't that mean that "get_fuse_inode(root)->nlookup--" needs to be called? Thanks, Bernd
On Fri, Oct 06, 2023 at 07:13:06PM +0200, Bernd Schubert wrote: > > > On 10/2/23 17:24, Krister Johansen wrote: > > The submount code uses the parent nodeid passed into the function in > > order to create the root dentry for the new submount. This nodeid does > > not get its remote reference count incremented by a lookup option. > > > > If the parent inode is evicted from its superblock, due to memory > > pressure for example, it can result in a forget opertation being sent to > > the server. Should this nodeid be forgotten while it is still in use in > > a submount, users of the submount get an error from the server on any > > subsequent access. In the author's case, this was an EBADF on all > > subsequent operations that needed to reference the root. > > > > Debugging the problem revealed that the dentry shrinker triggered a forget > > after killing the dentry with the last reference, despite the root > > dentry in another superblock still using the nodeid. > > > > As a result, a container that was also using this submount failed to > > access its filesystem because it had borrowed the reference instead of > > taking its own when setting up its superblock for the submount. > > > > This commit fixes the problem by having the new submount trigger a > > lookup for the parent as part of creating a new root dentry for the > > virtiofsd submount superblock. This allows each superblock to have its > > inodes removed by the shrinker when unreferenced, while keeping the > > nodeid reference count accurate and active with the server. > > > > Signed-off-by: Krister Johansen <kjlx@templeofstupid.com> > > --- > > fs/fuse/dir.c | 10 +++++----- > > fs/fuse/fuse_i.h | 6 ++++++ > > fs/fuse/inode.c | 43 +++++++++++++++++++++++++++++++++++++------ > > 3 files changed, 48 insertions(+), 11 deletions(-) > > > > diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c > > index 5e01946d7531..333730c74619 100644 > > --- a/fs/fuse/dir.c > > +++ b/fs/fuse/dir.c > > @@ -183,11 +183,11 @@ static void fuse_lookup_init(struct fuse_conn *fc, struct fuse_args *args, > > args->out_args[0].value = outarg; > > } > > -static int fuse_dentry_revalidate_lookup(struct fuse_mount *fm, > > - struct dentry *entry, > > - struct inode *inode, > > - struct fuse_entry_out *outarg, > > - bool *lookedup) > > +int fuse_dentry_revalidate_lookup(struct fuse_mount *fm, > > + struct dentry *entry, > > + struct inode *inode, > > + struct fuse_entry_out *outarg, > > + bool *lookedup) > > { > > struct dentry *parent; > > struct fuse_forget_link *forget; > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h > > index 405252bb51f2..a66fcf50a4cc 100644 > > --- a/fs/fuse/fuse_i.h > > +++ b/fs/fuse/fuse_i.h > > @@ -1325,6 +1325,12 @@ void fuse_dax_dontcache(struct inode *inode, unsigned int flags); > > bool fuse_dax_check_alignment(struct fuse_conn *fc, unsigned int map_alignment); > > void fuse_dax_cancel_work(struct fuse_conn *fc); > > +/* dir.c */ > > +int fuse_dentry_revalidate_lookup(struct fuse_mount *fm, struct dentry *entry, > > + struct inode *inode, > > + struct fuse_entry_out *outarg, > > + bool *lookedup); > > + > > /* ioctl.c */ > > long fuse_file_ioctl(struct file *file, unsigned int cmd, unsigned long arg); > > long fuse_file_compat_ioctl(struct file *file, unsigned int cmd, > > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c > > index 444418e240c8..79a31cb55512 100644 > > --- a/fs/fuse/inode.c > > +++ b/fs/fuse/inode.c > > @@ -1464,7 +1464,13 @@ static int fuse_fill_super_submount(struct super_block *sb, > > struct fuse_mount *fm = get_fuse_mount_super(sb); > > struct super_block *parent_sb = parent_fi->inode.i_sb; > > struct fuse_attr root_attr; > > + struct fuse_inode *fi; > > struct inode *root; > > + struct inode *parent; > > + struct dentry *pdent; > > + struct fuse_entry_out outarg; > > + bool lookedup = false; > > + int ret; > > fuse_sb_defaults(sb); > > fm->sb = sb; > > @@ -1480,14 +1486,39 @@ static int fuse_fill_super_submount(struct super_block *sb, > > if (parent_sb->s_subtype && !sb->s_subtype) > > return -ENOMEM; > > - fuse_fill_attr_from_inode(&root_attr, parent_fi); > > - root = fuse_iget(sb, parent_fi->nodeid, 0, &root_attr, 0, 0); > > /* > > - * This inode is just a duplicate, so it is not looked up and > > - * its nlookup should not be incremented. fuse_iget() does > > - * that, though, so undo it here. > > + * It is necessary to lookup the parent_if->nodeid in case the dentry > > + * that triggered the automount of the submount is later evicted. > > + * If this dentry is evicted without the lookup count getting increased > > + * on the submount root, then the server can subsequently forget this > > + * nodeid which leads to errors when trying to access the root of the > > + * submount. > > */ > > - get_fuse_inode(root)->nlookup--; > > + parent = &parent_fi->inode; > > + pdent = d_find_alias(parent); > > + if (!pdent) > > + return -EINVAL; > > + > > + ret = fuse_dentry_revalidate_lookup(fm, pdent, parent, &outarg, > > + &lookedup); > > + dput(pdent); > > + /* > > + * The new root owns this nlookup on success, and it is incremented by > > + * fuse_iget(). In the case the lookup succeeded but revalidate fails, > > + * ensure that the lookup count is tracked by the parent. > > + */ > > + if (ret <= 0) { > > + if (lookedup) { > > + fi = get_fuse_inode(parent); > > + spin_lock(&fi->lock); > > + fi->nlookup++; > > + spin_unlock(&fi->lock); > > + } > > I might be wrong, but doesn't that mean that > "get_fuse_inode(root)->nlookup--" needs to be called? In the case where ret > 0, the nlookup on get_fuse_inode(root) is set to 1 by fuse_iget(). That ensures that the root is forgotten when later unmounted. The code that handles the forget uses the count of nlookup to tell the server-side how many references to forget. (That's in fuse_evict_inode()). However, if the fuse_dentry_revalidate_lookup() call performs a valid lookup but returns an error, this function will return before it fills out s_root in the superblock or calls fuse_iget(). If the superblock doesn't have a s_root set, then the code in generic_kill_super() won't dput() the root dentry and trigger the forget. The intention of this code was to handle the case where the lookup had succeeded, but the code determined it was still necessary to return an error. In that situation, the reference taken by the lookup has to be accounted somewhere, and the parent seemed like a plausible candidate. However, after writing up this response, I can see that there's still a problem here if d_make_root(root) returns NULL, because we'll also lose track of the nlookup in that case. If you agree that charging this to the parent on error makes sense, I'll re-work the error handling here so that the right thing happens when either fuse_dentry_revalidate_lookup() or d_make_root() encounter an error. Thanks for the feedback. -K
On 10/7/23 02:41, Krister Johansen wrote: > On Fri, Oct 06, 2023 at 07:13:06PM +0200, Bernd Schubert wrote: >> >> >> On 10/2/23 17:24, Krister Johansen wrote: >>> The submount code uses the parent nodeid passed into the function in >>> order to create the root dentry for the new submount. This nodeid does >>> not get its remote reference count incremented by a lookup option. >>> >>> If the parent inode is evicted from its superblock, due to memory >>> pressure for example, it can result in a forget opertation being sent to >>> the server. Should this nodeid be forgotten while it is still in use in >>> a submount, users of the submount get an error from the server on any >>> subsequent access. In the author's case, this was an EBADF on all >>> subsequent operations that needed to reference the root. >>> >>> Debugging the problem revealed that the dentry shrinker triggered a forget >>> after killing the dentry with the last reference, despite the root >>> dentry in another superblock still using the nodeid. >>> >>> As a result, a container that was also using this submount failed to >>> access its filesystem because it had borrowed the reference instead of >>> taking its own when setting up its superblock for the submount. >>> >>> This commit fixes the problem by having the new submount trigger a >>> lookup for the parent as part of creating a new root dentry for the >>> virtiofsd submount superblock. This allows each superblock to have its >>> inodes removed by the shrinker when unreferenced, while keeping the >>> nodeid reference count accurate and active with the server. >>> >>> Signed-off-by: Krister Johansen <kjlx@templeofstupid.com> >>> --- >>> fs/fuse/dir.c | 10 +++++----- >>> fs/fuse/fuse_i.h | 6 ++++++ >>> fs/fuse/inode.c | 43 +++++++++++++++++++++++++++++++++++++------ >>> 3 files changed, 48 insertions(+), 11 deletions(-) >>> >>> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c >>> index 5e01946d7531..333730c74619 100644 >>> --- a/fs/fuse/dir.c >>> +++ b/fs/fuse/dir.c >>> @@ -183,11 +183,11 @@ static void fuse_lookup_init(struct fuse_conn *fc, struct fuse_args *args, >>> args->out_args[0].value = outarg; >>> } >>> -static int fuse_dentry_revalidate_lookup(struct fuse_mount *fm, >>> - struct dentry *entry, >>> - struct inode *inode, >>> - struct fuse_entry_out *outarg, >>> - bool *lookedup) >>> +int fuse_dentry_revalidate_lookup(struct fuse_mount *fm, >>> + struct dentry *entry, >>> + struct inode *inode, >>> + struct fuse_entry_out *outarg, >>> + bool *lookedup) >>> { >>> struct dentry *parent; >>> struct fuse_forget_link *forget; >>> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h >>> index 405252bb51f2..a66fcf50a4cc 100644 >>> --- a/fs/fuse/fuse_i.h >>> +++ b/fs/fuse/fuse_i.h >>> @@ -1325,6 +1325,12 @@ void fuse_dax_dontcache(struct inode *inode, unsigned int flags); >>> bool fuse_dax_check_alignment(struct fuse_conn *fc, unsigned int map_alignment); >>> void fuse_dax_cancel_work(struct fuse_conn *fc); >>> +/* dir.c */ >>> +int fuse_dentry_revalidate_lookup(struct fuse_mount *fm, struct dentry *entry, >>> + struct inode *inode, >>> + struct fuse_entry_out *outarg, >>> + bool *lookedup); >>> + >>> /* ioctl.c */ >>> long fuse_file_ioctl(struct file *file, unsigned int cmd, unsigned long arg); >>> long fuse_file_compat_ioctl(struct file *file, unsigned int cmd, >>> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c >>> index 444418e240c8..79a31cb55512 100644 >>> --- a/fs/fuse/inode.c >>> +++ b/fs/fuse/inode.c >>> @@ -1464,7 +1464,13 @@ static int fuse_fill_super_submount(struct super_block *sb, >>> struct fuse_mount *fm = get_fuse_mount_super(sb); >>> struct super_block *parent_sb = parent_fi->inode.i_sb; >>> struct fuse_attr root_attr; >>> + struct fuse_inode *fi; >>> struct inode *root; >>> + struct inode *parent; >>> + struct dentry *pdent; >>> + struct fuse_entry_out outarg; >>> + bool lookedup = false; >>> + int ret; >>> fuse_sb_defaults(sb); >>> fm->sb = sb; >>> @@ -1480,14 +1486,39 @@ static int fuse_fill_super_submount(struct super_block *sb, >>> if (parent_sb->s_subtype && !sb->s_subtype) >>> return -ENOMEM; >>> - fuse_fill_attr_from_inode(&root_attr, parent_fi); >>> - root = fuse_iget(sb, parent_fi->nodeid, 0, &root_attr, 0, 0); >>> /* >>> - * This inode is just a duplicate, so it is not looked up and >>> - * its nlookup should not be incremented. fuse_iget() does >>> - * that, though, so undo it here. >>> + * It is necessary to lookup the parent_if->nodeid in case the dentry >>> + * that triggered the automount of the submount is later evicted. >>> + * If this dentry is evicted without the lookup count getting increased >>> + * on the submount root, then the server can subsequently forget this >>> + * nodeid which leads to errors when trying to access the root of the >>> + * submount. >>> */ >>> - get_fuse_inode(root)->nlookup--; >>> + parent = &parent_fi->inode; >>> + pdent = d_find_alias(parent); >>> + if (!pdent) >>> + return -EINVAL; >>> + >>> + ret = fuse_dentry_revalidate_lookup(fm, pdent, parent, &outarg, >>> + &lookedup); >>> + dput(pdent); >>> + /* >>> + * The new root owns this nlookup on success, and it is incremented by >>> + * fuse_iget(). In the case the lookup succeeded but revalidate fails, >>> + * ensure that the lookup count is tracked by the parent. >>> + */ >>> + if (ret <= 0) { >>> + if (lookedup) { >>> + fi = get_fuse_inode(parent); >>> + spin_lock(&fi->lock); >>> + fi->nlookup++; >>> + spin_unlock(&fi->lock); >>> + } >> >> I might be wrong, but doesn't that mean that >> "get_fuse_inode(root)->nlookup--" needs to be called? > > In the case where ret > 0, the nlookup on get_fuse_inode(root) is set to > 1 by fuse_iget(). That ensures that the root is forgotten when later > unmounted. The code that handles the forget uses the count of nlookup > to tell the server-side how many references to forget. (That's in > fuse_evict_inode()). > > However, if the fuse_dentry_revalidate_lookup() call performs a valid > lookup but returns an error, this function will return before it fills > out s_root in the superblock or calls fuse_iget(). If the superblock > doesn't have a s_root set, then the code in generic_kill_super() won't > dput() the root dentry and trigger the forget. > > The intention of this code was to handle the case where the lookup had > succeeded, but the code determined it was still necessary to return an > error. In that situation, the reference taken by the lookup has to be > accounted somewhere, and the parent seemed like a plausible candidate. Yeah sorry, I had just missed that fuse_iget() also moved and then thought it would have increased fi->nlookup already. > > However, after writing up this response, I can see that there's still a > problem here if d_make_root(root) returns NULL, because we'll also lose > track of the nlookup in that case. > > If you agree that charging this to the parent on error makes sense, I'll > re-work the error handling here so that the right thing happens when > either fuse_dentry_revalidate_lookup() or d_make_root() encounter an > error. Oh yeah, I also missed that. Although, iput() calls iput_final, which then calls evict and sends the fuse forget - isn't that the right action already? > > Thanks for the feedback. Well, false alarm from my side, sorry again! Bernd
On Mon, Oct 09, 2023 at 02:52:42PM +0200, Bernd Schubert wrote: > > > On 10/7/23 02:41, Krister Johansen wrote: > > On Fri, Oct 06, 2023 at 07:13:06PM +0200, Bernd Schubert wrote: > > > > > > > > > On 10/2/23 17:24, Krister Johansen wrote: > > > > The submount code uses the parent nodeid passed into the function in > > > > order to create the root dentry for the new submount. This nodeid does > > > > not get its remote reference count incremented by a lookup option. > > > > > > > > If the parent inode is evicted from its superblock, due to memory > > > > pressure for example, it can result in a forget opertation being sent to > > > > the server. Should this nodeid be forgotten while it is still in use in > > > > a submount, users of the submount get an error from the server on any > > > > subsequent access. In the author's case, this was an EBADF on all > > > > subsequent operations that needed to reference the root. > > > > > > > > Debugging the problem revealed that the dentry shrinker triggered a forget > > > > after killing the dentry with the last reference, despite the root > > > > dentry in another superblock still using the nodeid. > > > > > > > > As a result, a container that was also using this submount failed to > > > > access its filesystem because it had borrowed the reference instead of > > > > taking its own when setting up its superblock for the submount. > > > > > > > > This commit fixes the problem by having the new submount trigger a > > > > lookup for the parent as part of creating a new root dentry for the > > > > virtiofsd submount superblock. This allows each superblock to have its > > > > inodes removed by the shrinker when unreferenced, while keeping the > > > > nodeid reference count accurate and active with the server. > > > > > > > > Signed-off-by: Krister Johansen <kjlx@templeofstupid.com> > > > > --- > > > > fs/fuse/dir.c | 10 +++++----- > > > > fs/fuse/fuse_i.h | 6 ++++++ > > > > fs/fuse/inode.c | 43 +++++++++++++++++++++++++++++++++++++------ > > > > 3 files changed, 48 insertions(+), 11 deletions(-) > > > > > > > > diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c > > > > index 5e01946d7531..333730c74619 100644 > > > > --- a/fs/fuse/dir.c > > > > +++ b/fs/fuse/dir.c > > > > @@ -183,11 +183,11 @@ static void fuse_lookup_init(struct fuse_conn *fc, struct fuse_args *args, > > > > args->out_args[0].value = outarg; > > > > } > > > > -static int fuse_dentry_revalidate_lookup(struct fuse_mount *fm, > > > > - struct dentry *entry, > > > > - struct inode *inode, > > > > - struct fuse_entry_out *outarg, > > > > - bool *lookedup) > > > > +int fuse_dentry_revalidate_lookup(struct fuse_mount *fm, > > > > + struct dentry *entry, > > > > + struct inode *inode, > > > > + struct fuse_entry_out *outarg, > > > > + bool *lookedup) > > > > { > > > > struct dentry *parent; > > > > struct fuse_forget_link *forget; > > > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h > > > > index 405252bb51f2..a66fcf50a4cc 100644 > > > > --- a/fs/fuse/fuse_i.h > > > > +++ b/fs/fuse/fuse_i.h > > > > @@ -1325,6 +1325,12 @@ void fuse_dax_dontcache(struct inode *inode, unsigned int flags); > > > > bool fuse_dax_check_alignment(struct fuse_conn *fc, unsigned int map_alignment); > > > > void fuse_dax_cancel_work(struct fuse_conn *fc); > > > > +/* dir.c */ > > > > +int fuse_dentry_revalidate_lookup(struct fuse_mount *fm, struct dentry *entry, > > > > + struct inode *inode, > > > > + struct fuse_entry_out *outarg, > > > > + bool *lookedup); > > > > + > > > > /* ioctl.c */ > > > > long fuse_file_ioctl(struct file *file, unsigned int cmd, unsigned long arg); > > > > long fuse_file_compat_ioctl(struct file *file, unsigned int cmd, > > > > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c > > > > index 444418e240c8..79a31cb55512 100644 > > > > --- a/fs/fuse/inode.c > > > > +++ b/fs/fuse/inode.c > > > > @@ -1464,7 +1464,13 @@ static int fuse_fill_super_submount(struct super_block *sb, > > > > struct fuse_mount *fm = get_fuse_mount_super(sb); > > > > struct super_block *parent_sb = parent_fi->inode.i_sb; > > > > struct fuse_attr root_attr; > > > > + struct fuse_inode *fi; > > > > struct inode *root; > > > > + struct inode *parent; > > > > + struct dentry *pdent; > > > > + struct fuse_entry_out outarg; > > > > + bool lookedup = false; > > > > + int ret; > > > > fuse_sb_defaults(sb); > > > > fm->sb = sb; > > > > @@ -1480,14 +1486,39 @@ static int fuse_fill_super_submount(struct super_block *sb, > > > > if (parent_sb->s_subtype && !sb->s_subtype) > > > > return -ENOMEM; > > > > - fuse_fill_attr_from_inode(&root_attr, parent_fi); > > > > - root = fuse_iget(sb, parent_fi->nodeid, 0, &root_attr, 0, 0); > > > > /* > > > > - * This inode is just a duplicate, so it is not looked up and > > > > - * its nlookup should not be incremented. fuse_iget() does > > > > - * that, though, so undo it here. > > > > + * It is necessary to lookup the parent_if->nodeid in case the dentry > > > > + * that triggered the automount of the submount is later evicted. > > > > + * If this dentry is evicted without the lookup count getting increased > > > > + * on the submount root, then the server can subsequently forget this > > > > + * nodeid which leads to errors when trying to access the root of the > > > > + * submount. > > > > */ > > > > - get_fuse_inode(root)->nlookup--; > > > > + parent = &parent_fi->inode; > > > > + pdent = d_find_alias(parent); > > > > + if (!pdent) > > > > + return -EINVAL; > > > > + > > > > + ret = fuse_dentry_revalidate_lookup(fm, pdent, parent, &outarg, > > > > + &lookedup); > > > > + dput(pdent); > > > > + /* > > > > + * The new root owns this nlookup on success, and it is incremented by > > > > + * fuse_iget(). In the case the lookup succeeded but revalidate fails, > > > > + * ensure that the lookup count is tracked by the parent. > > > > + */ > > > > + if (ret <= 0) { > > > > + if (lookedup) { > > > > + fi = get_fuse_inode(parent); > > > > + spin_lock(&fi->lock); > > > > + fi->nlookup++; > > > > + spin_unlock(&fi->lock); > > > > + } > > > > > > I might be wrong, but doesn't that mean that > > > "get_fuse_inode(root)->nlookup--" needs to be called? > > > > In the case where ret > 0, the nlookup on get_fuse_inode(root) is set to > > 1 by fuse_iget(). That ensures that the root is forgotten when later > > unmounted. The code that handles the forget uses the count of nlookup > > to tell the server-side how many references to forget. (That's in > > fuse_evict_inode()). > > > > However, if the fuse_dentry_revalidate_lookup() call performs a valid > > lookup but returns an error, this function will return before it fills > > out s_root in the superblock or calls fuse_iget(). If the superblock > > doesn't have a s_root set, then the code in generic_kill_super() won't > > dput() the root dentry and trigger the forget. > > > > The intention of this code was to handle the case where the lookup had > > succeeded, but the code determined it was still necessary to return an > > error. In that situation, the reference taken by the lookup has to be > > accounted somewhere, and the parent seemed like a plausible candidate. > > Yeah sorry, I had just missed that fuse_iget() also moved and then thought > it would have increased fi->nlookup already. No worries; I'd much rather get feedback if something doesn't look right, even if it turns out okay in the end. > > However, after writing up this response, I can see that there's still a > > problem here if d_make_root(root) returns NULL, because we'll also lose > > track of the nlookup in that case. > > > > If you agree that charging this to the parent on error makes sense, I'll > > re-work the error handling here so that the right thing happens when > > either fuse_dentry_revalidate_lookup() or d_make_root() encounter an > > error. > > Oh yeah, I also missed that. Although, iput() calls iput_final, which then > calls evict and sends the fuse forget - isn't that the right action already? Thanks, I had forgotten that d_make_root() would call iput() for me if d_alloc_anon() fails. Let me restate this to suggest that I account the nlookup to the parent if fuse_dentry_revalidate_lookup() or fuse_iget() fail instead. Does that sound right? > > Thanks for the feedback. > > Well, false alarm from my side, sorry again! No apology necessary; I appreciate you spending the time to look and ask questions. -K
On 10/9/23 19:15, Krister Johansen wrote: > On Mon, Oct 09, 2023 at 02:52:42PM +0200, Bernd Schubert wrote: >> >> >> On 10/7/23 02:41, Krister Johansen wrote: >>> On Fri, Oct 06, 2023 at 07:13:06PM +0200, Bernd Schubert wrote: >>>> >>>> >>>> On 10/2/23 17:24, Krister Johansen wrote: >>>>> The submount code uses the parent nodeid passed into the function in >>>>> order to create the root dentry for the new submount. This nodeid does >>>>> not get its remote reference count incremented by a lookup option. >>>>> >>>>> If the parent inode is evicted from its superblock, due to memory >>>>> pressure for example, it can result in a forget opertation being sent to >>>>> the server. Should this nodeid be forgotten while it is still in use in >>>>> a submount, users of the submount get an error from the server on any >>>>> subsequent access. In the author's case, this was an EBADF on all >>>>> subsequent operations that needed to reference the root. >>>>> >>>>> Debugging the problem revealed that the dentry shrinker triggered a forget >>>>> after killing the dentry with the last reference, despite the root >>>>> dentry in another superblock still using the nodeid. >>>>> >>>>> As a result, a container that was also using this submount failed to >>>>> access its filesystem because it had borrowed the reference instead of >>>>> taking its own when setting up its superblock for the submount. >>>>> >>>>> This commit fixes the problem by having the new submount trigger a >>>>> lookup for the parent as part of creating a new root dentry for the >>>>> virtiofsd submount superblock. This allows each superblock to have its >>>>> inodes removed by the shrinker when unreferenced, while keeping the >>>>> nodeid reference count accurate and active with the server. >>>>> >>>>> Signed-off-by: Krister Johansen <kjlx@templeofstupid.com> >>>>> --- >>>>> fs/fuse/dir.c | 10 +++++----- >>>>> fs/fuse/fuse_i.h | 6 ++++++ >>>>> fs/fuse/inode.c | 43 +++++++++++++++++++++++++++++++++++++------ >>>>> 3 files changed, 48 insertions(+), 11 deletions(-) >>>>> >>>>> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c >>>>> index 5e01946d7531..333730c74619 100644 >>>>> --- a/fs/fuse/dir.c >>>>> +++ b/fs/fuse/dir.c >>>>> @@ -183,11 +183,11 @@ static void fuse_lookup_init(struct fuse_conn *fc, struct fuse_args *args, >>>>> args->out_args[0].value = outarg; >>>>> } >>>>> -static int fuse_dentry_revalidate_lookup(struct fuse_mount *fm, >>>>> - struct dentry *entry, >>>>> - struct inode *inode, >>>>> - struct fuse_entry_out *outarg, >>>>> - bool *lookedup) >>>>> +int fuse_dentry_revalidate_lookup(struct fuse_mount *fm, >>>>> + struct dentry *entry, >>>>> + struct inode *inode, >>>>> + struct fuse_entry_out *outarg, >>>>> + bool *lookedup) >>>>> { >>>>> struct dentry *parent; >>>>> struct fuse_forget_link *forget; >>>>> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h >>>>> index 405252bb51f2..a66fcf50a4cc 100644 >>>>> --- a/fs/fuse/fuse_i.h >>>>> +++ b/fs/fuse/fuse_i.h >>>>> @@ -1325,6 +1325,12 @@ void fuse_dax_dontcache(struct inode *inode, unsigned int flags); >>>>> bool fuse_dax_check_alignment(struct fuse_conn *fc, unsigned int map_alignment); >>>>> void fuse_dax_cancel_work(struct fuse_conn *fc); >>>>> +/* dir.c */ >>>>> +int fuse_dentry_revalidate_lookup(struct fuse_mount *fm, struct dentry *entry, >>>>> + struct inode *inode, >>>>> + struct fuse_entry_out *outarg, >>>>> + bool *lookedup); >>>>> + >>>>> /* ioctl.c */ >>>>> long fuse_file_ioctl(struct file *file, unsigned int cmd, unsigned long arg); >>>>> long fuse_file_compat_ioctl(struct file *file, unsigned int cmd, >>>>> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c >>>>> index 444418e240c8..79a31cb55512 100644 >>>>> --- a/fs/fuse/inode.c >>>>> +++ b/fs/fuse/inode.c >>>>> @@ -1464,7 +1464,13 @@ static int fuse_fill_super_submount(struct super_block *sb, >>>>> struct fuse_mount *fm = get_fuse_mount_super(sb); >>>>> struct super_block *parent_sb = parent_fi->inode.i_sb; >>>>> struct fuse_attr root_attr; >>>>> + struct fuse_inode *fi; >>>>> struct inode *root; >>>>> + struct inode *parent; >>>>> + struct dentry *pdent; >>>>> + struct fuse_entry_out outarg; >>>>> + bool lookedup = false; >>>>> + int ret; >>>>> fuse_sb_defaults(sb); >>>>> fm->sb = sb; >>>>> @@ -1480,14 +1486,39 @@ static int fuse_fill_super_submount(struct super_block *sb, >>>>> if (parent_sb->s_subtype && !sb->s_subtype) >>>>> return -ENOMEM; >>>>> - fuse_fill_attr_from_inode(&root_attr, parent_fi); >>>>> - root = fuse_iget(sb, parent_fi->nodeid, 0, &root_attr, 0, 0); >>>>> /* >>>>> - * This inode is just a duplicate, so it is not looked up and >>>>> - * its nlookup should not be incremented. fuse_iget() does >>>>> - * that, though, so undo it here. >>>>> + * It is necessary to lookup the parent_if->nodeid in case the dentry >>>>> + * that triggered the automount of the submount is later evicted. >>>>> + * If this dentry is evicted without the lookup count getting increased >>>>> + * on the submount root, then the server can subsequently forget this >>>>> + * nodeid which leads to errors when trying to access the root of the >>>>> + * submount. >>>>> */ >>>>> - get_fuse_inode(root)->nlookup--; >>>>> + parent = &parent_fi->inode; >>>>> + pdent = d_find_alias(parent); >>>>> + if (!pdent) >>>>> + return -EINVAL; >>>>> + >>>>> + ret = fuse_dentry_revalidate_lookup(fm, pdent, parent, &outarg, >>>>> + &lookedup); >>>>> + dput(pdent); >>>>> + /* >>>>> + * The new root owns this nlookup on success, and it is incremented by >>>>> + * fuse_iget(). In the case the lookup succeeded but revalidate fails, >>>>> + * ensure that the lookup count is tracked by the parent. >>>>> + */ >>>>> + if (ret <= 0) { >>>>> + if (lookedup) { >>>>> + fi = get_fuse_inode(parent); >>>>> + spin_lock(&fi->lock); >>>>> + fi->nlookup++; >>>>> + spin_unlock(&fi->lock); >>>>> + } >>>> >>>> I might be wrong, but doesn't that mean that >>>> "get_fuse_inode(root)->nlookup--" needs to be called? >>> >>> In the case where ret > 0, the nlookup on get_fuse_inode(root) is set to >>> 1 by fuse_iget(). That ensures that the root is forgotten when later >>> unmounted. The code that handles the forget uses the count of nlookup >>> to tell the server-side how many references to forget. (That's in >>> fuse_evict_inode()). >>> >>> However, if the fuse_dentry_revalidate_lookup() call performs a valid >>> lookup but returns an error, this function will return before it fills >>> out s_root in the superblock or calls fuse_iget(). If the superblock >>> doesn't have a s_root set, then the code in generic_kill_super() won't >>> dput() the root dentry and trigger the forget. >>> >>> The intention of this code was to handle the case where the lookup had >>> succeeded, but the code determined it was still necessary to return an >>> error. In that situation, the reference taken by the lookup has to be >>> accounted somewhere, and the parent seemed like a plausible candidate. >> >> Yeah sorry, I had just missed that fuse_iget() also moved and then thought >> it would have increased fi->nlookup already. > > No worries; I'd much rather get feedback if something doesn't look > right, even if it turns out okay in the end. > >>> However, after writing up this response, I can see that there's still a >>> problem here if d_make_root(root) returns NULL, because we'll also lose >>> track of the nlookup in that case. >>> >>> If you agree that charging this to the parent on error makes sense, I'll >>> re-work the error handling here so that the right thing happens when >>> either fuse_dentry_revalidate_lookup() or d_make_root() encounter an >>> error. >> >> Oh yeah, I also missed that. Although, iput() calls iput_final, which then >> calls evict and sends the fuse forget - isn't that the right action already? > > Thanks, I had forgotten that d_make_root() would call iput() for me if > d_alloc_anon() fails. Let me restate this to suggest that I account the > nlookup to the parent if fuse_dentry_revalidate_lookup() or fuse_iget() > fail instead. Does that sound right? Hmm, so server/daemon side uses the lookup count to have an inode reference - are you sure that parent is the right inode for the forget call? And what is the probability for such failures? I.e. is that performance critical? Wouldn't be much simpler and clearer to just avoid and doubt and to send an immediate forget? Thanks, Bernd
On Mon, 2 Oct 2023 at 17:24, Krister Johansen <kjlx@templeofstupid.com> wrote: > > The submount code uses the parent nodeid passed into the function in > order to create the root dentry for the new submount. This nodeid does > not get its remote reference count incremented by a lookup option. > > If the parent inode is evicted from its superblock, due to memory > pressure for example, it can result in a forget opertation being sent to > the server. Should this nodeid be forgotten while it is still in use in > a submount, users of the submount get an error from the server on any > subsequent access. In the author's case, this was an EBADF on all > subsequent operations that needed to reference the root. > > Debugging the problem revealed that the dentry shrinker triggered a forget > after killing the dentry with the last reference, despite the root > dentry in another superblock still using the nodeid. There's some context missing here. There are two dentries: a mount point in the parent mount and the root of the submount. The server indicates that the looked up inode is a submount using FUSE_ATTR_SUBMOUNT. Then AFAICS the following happens: 1) the mountpoint dentry is created with nlookup = 1. The S_AUTOMOUNT flag is set on the mountpoint inode. 2) the lookup code sees S_AUTOMOUNT and triggers the submount operation, which sets up the new super_block and the root dentry with the user supplied nodeid and with nlookup = 0 (because it wasn't actually looked up). How the automount gets torn down is another story. You say that the mount point gets evicted due to memory pressure. But it can't get evicted while the submount is attached. So the submount must first get detached, and then the mount point can be reclaimed. The question is: how does the submount gets detached. Do you have an idea? Thanks, Miklos
On Mon, Oct 09, 2023 at 09:45:08PM +0200, Miklos Szeredi wrote: > On Mon, 2 Oct 2023 at 17:24, Krister Johansen <kjlx@templeofstupid.com> wrote: > > > > The submount code uses the parent nodeid passed into the function in > > order to create the root dentry for the new submount. This nodeid does > > not get its remote reference count incremented by a lookup option. > > > > If the parent inode is evicted from its superblock, due to memory > > pressure for example, it can result in a forget opertation being sent to > > the server. Should this nodeid be forgotten while it is still in use in > > a submount, users of the submount get an error from the server on any > > subsequent access. In the author's case, this was an EBADF on all > > subsequent operations that needed to reference the root. > > > > Debugging the problem revealed that the dentry shrinker triggered a forget > > after killing the dentry with the last reference, despite the root > > dentry in another superblock still using the nodeid. > > There's some context missing here. There are two dentries: a mount > point in the parent mount and the root of the submount. > > The server indicates that the looked up inode is a submount using > FUSE_ATTR_SUBMOUNT. Then AFAICS the following happens: > > 1) the mountpoint dentry is created with nlookup = 1. The > S_AUTOMOUNT flag is set on the mountpoint inode. > > 2) the lookup code sees S_AUTOMOUNT and triggers the submount > operation, which sets up the new super_block and the root dentry with > the user supplied nodeid and with nlookup = 0 (because it wasn't > actually looked up). > > How the automount gets torn down is another story. You say that the > mount point gets evicted due to memory pressure. But it can't get > evicted while the submount is attached. So the submount must first > get detached, and then the mount point can be reclaimed. The > question is: how does the submount gets detached. Do you have an > idea? Apologies for not stating this clearly. The use case is a container running in a VM, and the container's root is provided to the guest via virtiofs. I believe the submount is getting detached as part of the container setup, either via a umount2(MNT_DETACH) of the old root filesystem, or as part of pivot_root() itself. By the time I'm able to inspect the dentry associated with the submount in the initial mount ns (case #1) its d_lockref.count is 0, and /proc/mountinfo doesn't show an active mount for the submount in that mount namespace. If I manually traverse the path to the submount via something like cd and ls from the initial mount namespace, it'll stay referenced until I run a umount for the automounted path. I'm reasonably sure it's the container setup that's causing the detaching. I'm happy to go debug this some more, though, if you're skeptical of the explanation. -K
On Mon, Oct 09, 2023 at 08:43:02PM +0200, Bernd Schubert wrote: > On 10/9/23 19:15, Krister Johansen wrote: > > Thanks, I had forgotten that d_make_root() would call iput() for me if > > d_alloc_anon() fails. Let me restate this to suggest that I account the > > nlookup to the parent if fuse_dentry_revalidate_lookup() or fuse_iget() > > fail instead. Does that sound right? > > Hmm, so server/daemon side uses the lookup count to have an inode reference > - are you sure that parent is the right inode for the forget call? And what > is the probability for such failures? I.e. is that performance critical? > Wouldn't be much simpler and clearer to just avoid and doubt and to send an > immediate forget? Yeah, the server / daemon side need to track the lookup count so that it knows when it can close the fd for the file on the server-side. (At least for virtiofsd, anyway.). The reason I had avoided doing the forget in the submount code is that it needs a forget linkage in order to call fuse_queue_forget(). One of these is allocated by fuse_alloc_inode(). A well formed parent should always have one. However, if the fuse_iget() for the submount root fails, then there's no linkage to borrow from the new inode. The code could always call fuse_alloc_forget() directly, like is done elsewhere. I thought it might be hard to get the memory for this allocation if fuse_iget() also can't allocate enough, but I could move the allocation earlier in the function and just free it if it's not used. I'm not confident that would reduce the amount of code in the function, but if you'd find it clearer, I'm happy to modify it accordingly. -K
On Tue, 10 Oct 2023 at 04:35, Krister Johansen <kjlx@templeofstupid.com> wrote: > If I manually traverse the path to the submount via something like cd > and ls from the initial mount namespace, it'll stay referenced until I > run a umount for the automounted path. I'm reasonably sure it's the > container setup that's causing the detaching. Okay. Can you please try the attached test patch. It's not a proper solution, but I think it's the right direction. Thanks, Miklos
Hi Miklos, On Tue, Oct 10, 2023 at 10:15:38AM +0200, Miklos Szeredi wrote: > On Tue, 10 Oct 2023 at 04:35, Krister Johansen <kjlx@templeofstupid.com> wrote: > > > If I manually traverse the path to the submount via something like cd > > and ls from the initial mount namespace, it'll stay referenced until I > > run a umount for the automounted path. I'm reasonably sure it's the > > container setup that's causing the detaching. > > Okay. Can you please try the attached test patch. It's not a proper > solution, but I think it's the right direction. Thanks, I tested this patch and was unable to reproduce the scenario where an OOM resulted in the submount from the guest in the guest getting an EBADF from virtiofsd. (I did generate OOMs, though). I am curious what you have in mind in order to move this towards a proper fix? I shied away from the approach of stealing a nlookup from mp_fi beacuse it wasn't clear that I could always count on the nlookup in the parent staying positive. E.g. I was afraid I was either going to not have enough nlookups to move to submounts, or trigger a forget from an exiting container that leads to an EBADF from the initial mount namespace. -K > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c > index 2e4eb7cf26fb..d5f47203dfbc 100644 > --- a/fs/fuse/inode.c > +++ b/fs/fuse/inode.c > @@ -1524,6 +1524,18 @@ static int fuse_get_tree_submount(struct fs_context *fsc) > return err; > } > > + spin_lock(&mp_fi->lock); > + if (mp_fi->nlookup) { > + struct fuse_inode *fi = get_fuse_inode(d_inode(sb->s_root)); > + mp_fi->nlookup--; > + spin_unlock(&mp_fi->lock); > + spin_lock(&fi->lock); > + fi->nlookup++; > + spin_unlock(&fi->lock); > + } else { > + spin_unlock(&mp_fi->lock); > + } > + > down_write(&fc->killsb); > list_add_tail(&fm->fc_entry, &fc->mounts); > up_write(&fc->killsb);
On Wed, 11 Oct 2023 at 03:26, Krister Johansen <kjlx@templeofstupid.com> wrote: > I am curious what you have in mind in order to move this towards a > proper fix? I shied away from the approach of stealing a nlookup from > mp_fi beacuse it wasn't clear that I could always count on the nlookup > in the parent staying positive. E.g. I was afraid I was either going to > not have enough nlookups to move to submounts, or trigger a forget from > an exiting container that leads to an EBADF from the initial mount > namespace. One idea is to transfer the nlookup to a separately refcounted object that is referenced from mp_fi as well as all the submounts. Thanks, Miklos
On Wed, Oct 11, 2023 at 09:07:33AM +0200, Miklos Szeredi wrote: > On Wed, 11 Oct 2023 at 03:26, Krister Johansen <kjlx@templeofstupid.com> wrote: > > > I am curious what you have in mind in order to move this towards a > > proper fix? I shied away from the approach of stealing a nlookup from > > mp_fi beacuse it wasn't clear that I could always count on the nlookup > > in the parent staying positive. E.g. I was afraid I was either going to > > not have enough nlookups to move to submounts, or trigger a forget from > > an exiting container that leads to an EBADF from the initial mount > > namespace. > > One idea is to transfer the nlookup to a separately refcounted object > that is referenced from mp_fi as well as all the submounts. That seems possible. Would the idea be to move all tracking of nlookup to a separate refcounted object for the particular nodeid, or just do this for the first lookup of a submount? Would you like me to put together a v3 that heads this direction? Thanks, -K
On Wed, 11 Oct 2023 at 18:32, Krister Johansen <kjlx@templeofstupid.com> wrote: > > On Wed, Oct 11, 2023 at 09:07:33AM +0200, Miklos Szeredi wrote: > > On Wed, 11 Oct 2023 at 03:26, Krister Johansen <kjlx@templeofstupid.com> wrote: > > > > > I am curious what you have in mind in order to move this towards a > > > proper fix? I shied away from the approach of stealing a nlookup from > > > mp_fi beacuse it wasn't clear that I could always count on the nlookup > > > in the parent staying positive. E.g. I was afraid I was either going to > > > not have enough nlookups to move to submounts, or trigger a forget from > > > an exiting container that leads to an EBADF from the initial mount > > > namespace. > > > > One idea is to transfer the nlookup to a separately refcounted object > > that is referenced from mp_fi as well as all the submounts. > > That seems possible. Would the idea be to move all tracking of nlookup > to a separate refcounted object for the particular nodeid, or just do > this for the first lookup of a submount? Just for submounts. And yes, it should work if the count from the first lookup is transferred to this object (fuse_iget()) and subsequent counts (fuse_dentry_revalidate()) go to the mountpoint inode as usual. This will result in more than one FORGET in most cases, but that's okay. > Would you like me to put together a v3 that heads this direction? That would be great, thanks. Miklos
Hi Miklos, On Wed, Oct 11, 2023 at 08:27:34PM +0200, Miklos Szeredi wrote: > On Wed, 11 Oct 2023 at 18:32, Krister Johansen <kjlx@templeofstupid.com> wrote: > > > > On Wed, Oct 11, 2023 at 09:07:33AM +0200, Miklos Szeredi wrote: > > > On Wed, 11 Oct 2023 at 03:26, Krister Johansen <kjlx@templeofstupid.com> wrote: > > > > > > > I am curious what you have in mind in order to move this towards a > > > > proper fix? I shied away from the approach of stealing a nlookup from > > > > mp_fi beacuse it wasn't clear that I could always count on the nlookup > > > > in the parent staying positive. E.g. I was afraid I was either going to > > > > not have enough nlookups to move to submounts, or trigger a forget from > > > > an exiting container that leads to an EBADF from the initial mount > > > > namespace. > > > > > > One idea is to transfer the nlookup to a separately refcounted object > > > that is referenced from mp_fi as well as all the submounts. > > > > That seems possible. Would the idea be to move all tracking of nlookup > > to a separate refcounted object for the particular nodeid, or just do > > this for the first lookup of a submount? > > Just for submounts. And yes, it should work if the count from the > first lookup is transferred to this object (fuse_iget()) and > subsequent counts (fuse_dentry_revalidate()) go to the mountpoint > inode as usual. This will result in more than one FORGET in most > cases, but that's okay. > > > Would you like me to put together a v3 that heads this direction? > > That would be great, thanks. Thanks for the pointers here. I started over and followed the approach that you suggested. It condensed to a single patch, so I'll send it as a follow-up to this thread. -K
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index 5e01946d7531..333730c74619 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -183,11 +183,11 @@ static void fuse_lookup_init(struct fuse_conn *fc, struct fuse_args *args, args->out_args[0].value = outarg; } -static int fuse_dentry_revalidate_lookup(struct fuse_mount *fm, - struct dentry *entry, - struct inode *inode, - struct fuse_entry_out *outarg, - bool *lookedup) +int fuse_dentry_revalidate_lookup(struct fuse_mount *fm, + struct dentry *entry, + struct inode *inode, + struct fuse_entry_out *outarg, + bool *lookedup) { struct dentry *parent; struct fuse_forget_link *forget; diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index 405252bb51f2..a66fcf50a4cc 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -1325,6 +1325,12 @@ void fuse_dax_dontcache(struct inode *inode, unsigned int flags); bool fuse_dax_check_alignment(struct fuse_conn *fc, unsigned int map_alignment); void fuse_dax_cancel_work(struct fuse_conn *fc); +/* dir.c */ +int fuse_dentry_revalidate_lookup(struct fuse_mount *fm, struct dentry *entry, + struct inode *inode, + struct fuse_entry_out *outarg, + bool *lookedup); + /* ioctl.c */ long fuse_file_ioctl(struct file *file, unsigned int cmd, unsigned long arg); long fuse_file_compat_ioctl(struct file *file, unsigned int cmd, diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 444418e240c8..79a31cb55512 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -1464,7 +1464,13 @@ static int fuse_fill_super_submount(struct super_block *sb, struct fuse_mount *fm = get_fuse_mount_super(sb); struct super_block *parent_sb = parent_fi->inode.i_sb; struct fuse_attr root_attr; + struct fuse_inode *fi; struct inode *root; + struct inode *parent; + struct dentry *pdent; + struct fuse_entry_out outarg; + bool lookedup = false; + int ret; fuse_sb_defaults(sb); fm->sb = sb; @@ -1480,14 +1486,39 @@ static int fuse_fill_super_submount(struct super_block *sb, if (parent_sb->s_subtype && !sb->s_subtype) return -ENOMEM; - fuse_fill_attr_from_inode(&root_attr, parent_fi); - root = fuse_iget(sb, parent_fi->nodeid, 0, &root_attr, 0, 0); /* - * This inode is just a duplicate, so it is not looked up and - * its nlookup should not be incremented. fuse_iget() does - * that, though, so undo it here. + * It is necessary to lookup the parent_if->nodeid in case the dentry + * that triggered the automount of the submount is later evicted. + * If this dentry is evicted without the lookup count getting increased + * on the submount root, then the server can subsequently forget this + * nodeid which leads to errors when trying to access the root of the + * submount. */ - get_fuse_inode(root)->nlookup--; + parent = &parent_fi->inode; + pdent = d_find_alias(parent); + if (!pdent) + return -EINVAL; + + ret = fuse_dentry_revalidate_lookup(fm, pdent, parent, &outarg, + &lookedup); + dput(pdent); + /* + * The new root owns this nlookup on success, and it is incremented by + * fuse_iget(). In the case the lookup succeeded but revalidate fails, + * ensure that the lookup count is tracked by the parent. + */ + if (ret <= 0) { + if (lookedup) { + fi = get_fuse_inode(parent); + spin_lock(&fi->lock); + fi->nlookup++; + spin_unlock(&fi->lock); + } + return ret ? ret : -EINVAL; + } + + fuse_fill_attr_from_inode(&root_attr, parent_fi); + root = fuse_iget(sb, parent_fi->nodeid, 0, &root_attr, 0, 0); sb->s_d_op = &fuse_dentry_operations; sb->s_root = d_make_root(root); if (!sb->s_root)
The submount code uses the parent nodeid passed into the function in order to create the root dentry for the new submount. This nodeid does not get its remote reference count incremented by a lookup option. If the parent inode is evicted from its superblock, due to memory pressure for example, it can result in a forget opertation being sent to the server. Should this nodeid be forgotten while it is still in use in a submount, users of the submount get an error from the server on any subsequent access. In the author's case, this was an EBADF on all subsequent operations that needed to reference the root. Debugging the problem revealed that the dentry shrinker triggered a forget after killing the dentry with the last reference, despite the root dentry in another superblock still using the nodeid. As a result, a container that was also using this submount failed to access its filesystem because it had borrowed the reference instead of taking its own when setting up its superblock for the submount. This commit fixes the problem by having the new submount trigger a lookup for the parent as part of creating a new root dentry for the virtiofsd submount superblock. This allows each superblock to have its inodes removed by the shrinker when unreferenced, while keeping the nodeid reference count accurate and active with the server. Signed-off-by: Krister Johansen <kjlx@templeofstupid.com> --- fs/fuse/dir.c | 10 +++++----- fs/fuse/fuse_i.h | 6 ++++++ fs/fuse/inode.c | 43 +++++++++++++++++++++++++++++++++++++------ 3 files changed, 48 insertions(+), 11 deletions(-)