Message ID | 162322868275.361452.17585267026652222121.stgit@web.messagingengine.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | kernfs: proposed locking and concurrency improvement | expand |
On Wed, 9 Jun 2021 at 10:52, Ian Kent <raven@themaw.net> wrote: > > The inode operations .permission() and .getattr() use the kernfs node > write lock but all that's needed is to keep the rb tree stable while > updating the inode attributes as well as protecting the update itself > against concurrent changes. > > And .permission() is called frequently during path walks and can cause > quite a bit of contention between kernfs node operations and path > walks when the number of concurrent walks is high. > > To change kernfs_iop_getattr() and kernfs_iop_permission() to take > the rw sem read lock instead of the write lock an additional lock is > needed to protect against multiple processes concurrently updating > the inode attributes and link count in kernfs_refresh_inode(). > > The inode i_lock seems like the sensible thing to use to protect these > inode attribute updates so use it in kernfs_refresh_inode(). > > The last hunk in the patch, applied to kernfs_fill_super(), is possibly > not needed but taking the lock was present originally and I prefer to > continue to take it so the rb tree is held stable during the call to > kernfs_refresh_inode() made by kernfs_get_inode(). > > Signed-off-by: Ian Kent <raven@themaw.net> Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
On Wed, Jun 09, 2021 at 04:51:22PM +0800, Ian Kent wrote: > The inode operations .permission() and .getattr() use the kernfs node > write lock but all that's needed is to keep the rb tree stable while > updating the inode attributes as well as protecting the update itself > against concurrent changes. Huh? Where does it access the rbtree at all? Confused... > diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c > index 3b01e9e61f14e..6728ecd81eb37 100644 > --- a/fs/kernfs/inode.c > +++ b/fs/kernfs/inode.c > @@ -172,6 +172,7 @@ static void kernfs_refresh_inode(struct kernfs_node *kn, struct inode *inode) > { > struct kernfs_iattrs *attrs = kn->iattr; > > + spin_lock(&inode->i_lock); > inode->i_mode = kn->mode; > if (attrs) > /* > @@ -182,6 +183,7 @@ static void kernfs_refresh_inode(struct kernfs_node *kn, struct inode *inode) > > if (kernfs_type(kn) == KERNFS_DIR) > set_nlink(inode, kn->dir.subdirs + 2); > + spin_unlock(&inode->i_lock); > } Even more so - just what are you serializing here? That code synchronizes inode metadata with those in kernfs_node. Suppose you've got two threads doing ->permission(); the first one gets through kernfs_refresh_inode() and goes into generic_permission(). No locks are held, so kernfs_refresh_inode() from another thread can run in parallel with generic_permission(). If that's not a problem, why two kernfs_refresh_inode() done in parallel would be a problem? Thread 1: permission done refresh, all locks released now Thread 2: change metadata in kernfs_node Thread 2: permission goes into refresh, copying metadata into inode Thread 1: generic_permission() No locks in common between the last two operations, so we generic_permission() might see partially updated metadata. Either we don't give a fuck (in which case I don't understand what purpose does that ->i_lock serve) *or* we need the exclusion to cover a wider area.
On Sat, 2021-06-12 at 01:45 +0000, Al Viro wrote: > On Wed, Jun 09, 2021 at 04:51:22PM +0800, Ian Kent wrote: > > The inode operations .permission() and .getattr() use the kernfs > > node > > write lock but all that's needed is to keep the rb tree stable > > while > > updating the inode attributes as well as protecting the update > > itself > > against concurrent changes. > > Huh? Where does it access the rbtree at all? Confused... That description's wrong, I'll fix that. > > > diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c > > index 3b01e9e61f14e..6728ecd81eb37 100644 > > --- a/fs/kernfs/inode.c > > +++ b/fs/kernfs/inode.c > > @@ -172,6 +172,7 @@ static void kernfs_refresh_inode(struct > > kernfs_node *kn, struct inode *inode) > > { > > struct kernfs_iattrs *attrs = kn->iattr; > > > > + spin_lock(&inode->i_lock); > > inode->i_mode = kn->mode; > > if (attrs) > > /* > > @@ -182,6 +183,7 @@ static void kernfs_refresh_inode(struct > > kernfs_node *kn, struct inode *inode) > > > > if (kernfs_type(kn) == KERNFS_DIR) > > set_nlink(inode, kn->dir.subdirs + 2); > > + spin_unlock(&inode->i_lock); > > } > > Even more so - just what are you serializing here? That code > synchronizes inode > metadata with those in kernfs_node. Suppose you've got two threads > doing > ->permission(); the first one gets through kernfs_refresh_inode() and > goes into > generic_permission(). No locks are held, so kernfs_refresh_inode() > from another > thread can run in parallel with generic_permission(). > > If that's not a problem, why two kernfs_refresh_inode() done in > parallel would > be a problem? > > Thread 1: > permission > done refresh, all locks released now > Thread 2: > change metadata in kernfs_node > Thread 2: > permission > goes into refresh, copying metadata into inode > Thread 1: > generic_permission() > No locks in common between the last two operations, so > we generic_permission() might see partially updated metadata. > Either we don't give a fuck (in which case I don't understand > what purpose does that ->i_lock serve) *or* we need the exclusion > to cover a wider area.
On Sat, 2021-06-12 at 01:45 +0000, Al Viro wrote: > On Wed, Jun 09, 2021 at 04:51:22PM +0800, Ian Kent wrote: > > The inode operations .permission() and .getattr() use the kernfs > > node > > write lock but all that's needed is to keep the rb tree stable > > while > > updating the inode attributes as well as protecting the update > > itself > > against concurrent changes. > > Huh? Where does it access the rbtree at all? Confused... > > > diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c > > index 3b01e9e61f14e..6728ecd81eb37 100644 > > --- a/fs/kernfs/inode.c > > +++ b/fs/kernfs/inode.c > > @@ -172,6 +172,7 @@ static void kernfs_refresh_inode(struct > > kernfs_node *kn, struct inode *inode) > > { > > struct kernfs_iattrs *attrs = kn->iattr; > > > > + spin_lock(&inode->i_lock); > > inode->i_mode = kn->mode; > > if (attrs) > > /* > > @@ -182,6 +183,7 @@ static void kernfs_refresh_inode(struct > > kernfs_node *kn, struct inode *inode) > > > > if (kernfs_type(kn) == KERNFS_DIR) > > set_nlink(inode, kn->dir.subdirs + 2); > > + spin_unlock(&inode->i_lock); > > } > > Even more so - just what are you serializing here? That code > synchronizes inode > metadata with those in kernfs_node. Suppose you've got two threads > doing > ->permission(); the first one gets through kernfs_refresh_inode() and > goes into > generic_permission(). No locks are held, so kernfs_refresh_inode() > from another > thread can run in parallel with generic_permission(). > > If that's not a problem, why two kernfs_refresh_inode() done in > parallel would > be a problem? > > Thread 1: > permission > done refresh, all locks released now > Thread 2: > change metadata in kernfs_node > Thread 2: > permission > goes into refresh, copying metadata into inode > Thread 1: > generic_permission() > No locks in common between the last two operations, so > we generic_permission() might see partially updated metadata. > Either we don't give a fuck (in which case I don't understand > what purpose does that ->i_lock serve) *or* we need the exclusion > to cover a wider area. This didn't occur to me, obviously. It seems to me this can happen with the original code too although using a mutex might reduce the likelihood of it happening. Still ->permission() is meant to be a read-only function so the VFS shouldn't need to care about it. Do you have any suggestions on how to handle this. Perhaps the only way is to ensure the inode is updated only in functions that are expected to do this. Ian
On Mon, 2021-06-14 at 09:32 +0800, Ian Kent wrote: > On Sat, 2021-06-12 at 01:45 +0000, Al Viro wrote: > > On Wed, Jun 09, 2021 at 04:51:22PM +0800, Ian Kent wrote: > > > The inode operations .permission() and .getattr() use the kernfs > > > node > > > write lock but all that's needed is to keep the rb tree stable > > > while > > > updating the inode attributes as well as protecting the update > > > itself > > > against concurrent changes. > > > > Huh? Where does it access the rbtree at all? Confused... > > > > > diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c > > > index 3b01e9e61f14e..6728ecd81eb37 100644 > > > --- a/fs/kernfs/inode.c > > > +++ b/fs/kernfs/inode.c > > > @@ -172,6 +172,7 @@ static void kernfs_refresh_inode(struct > > > kernfs_node *kn, struct inode *inode) > > > { > > > struct kernfs_iattrs *attrs = kn->iattr; > > > > > > + spin_lock(&inode->i_lock); > > > inode->i_mode = kn->mode; > > > if (attrs) > > > /* > > > @@ -182,6 +183,7 @@ static void kernfs_refresh_inode(struct > > > kernfs_node *kn, struct inode *inode) > > > > > > if (kernfs_type(kn) == KERNFS_DIR) > > > set_nlink(inode, kn->dir.subdirs + 2); > > > + spin_unlock(&inode->i_lock); > > > } > > > > Even more so - just what are you serializing here? That code > > synchronizes inode > > metadata with those in kernfs_node. Suppose you've got two threads > > doing > > ->permission(); the first one gets through kernfs_refresh_inode() > > and > > goes into > > generic_permission(). No locks are held, so kernfs_refresh_inode() > > from another > > thread can run in parallel with generic_permission(). > > > > If that's not a problem, why two kernfs_refresh_inode() done in > > parallel would > > be a problem? > > > > Thread 1: > > permission > > done refresh, all locks released now > > Thread 2: > > change metadata in kernfs_node > > Thread 2: > > permission > > goes into refresh, copying metadata into inode > > Thread 1: > > generic_permission() > > No locks in common between the last two operations, so > > we generic_permission() might see partially updated metadata. > > Either we don't give a fuck (in which case I don't understand > > what purpose does that ->i_lock serve) *or* we need the exclusion > > to cover a wider area. > > This didn't occur to me, obviously. > > It seems to me this can happen with the original code too although > using a mutex might reduce the likelihood of it happening. > > Still ->permission() is meant to be a read-only function so the VFS > shouldn't need to care about it. > > Do you have any suggestions on how to handle this. > Perhaps the only way is to ensure the inode is updated only in > functions that are expected to do this. IIRC Greg and Tejun weren't averse to adding a field to the struct kernfs_iattrs, but there were concerns about increasing memory usage. Because of this I think the best way to handle this would be to broaden the scope of the i_lock to cover the generic calls in kernfs_iop_getattr() and kernfs_iop_permission(). The only other call to kernfs_refresh_inode() is at inode initialization and then only for I_NEW inodes so that should be ok. Also both generic_permission() and generic_fillattr() are reading from the inode so not likely to need to take the i_lock any time soon (is this a reasonable assumption Al?). Do you think this is a sensible way to go Al? Ian
On Mon, 2021-06-14 at 14:52 +0800, Ian Kent wrote: > On Mon, 2021-06-14 at 09:32 +0800, Ian Kent wrote: > > On Sat, 2021-06-12 at 01:45 +0000, Al Viro wrote: > > > On Wed, Jun 09, 2021 at 04:51:22PM +0800, Ian Kent wrote: > > > > The inode operations .permission() and .getattr() use the > > > > kernfs > > > > node > > > > write lock but all that's needed is to keep the rb tree stable > > > > while > > > > updating the inode attributes as well as protecting the update > > > > itself > > > > against concurrent changes. > > > > > > Huh? Where does it access the rbtree at all? Confused... > > > > > > > diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c > > > > index 3b01e9e61f14e..6728ecd81eb37 100644 > > > > --- a/fs/kernfs/inode.c > > > > +++ b/fs/kernfs/inode.c > > > > @@ -172,6 +172,7 @@ static void kernfs_refresh_inode(struct > > > > kernfs_node *kn, struct inode *inode) > > > > { > > > > struct kernfs_iattrs *attrs = kn->iattr; > > > > > > > > + spin_lock(&inode->i_lock); > > > > inode->i_mode = kn->mode; > > > > if (attrs) > > > > /* > > > > @@ -182,6 +183,7 @@ static void kernfs_refresh_inode(struct > > > > kernfs_node *kn, struct inode *inode) > > > > > > > > if (kernfs_type(kn) == KERNFS_DIR) > > > > set_nlink(inode, kn->dir.subdirs + 2); > > > > + spin_unlock(&inode->i_lock); > > > > } > > > > > > Even more so - just what are you serializing here? That code > > > synchronizes inode > > > metadata with those in kernfs_node. Suppose you've got two > > > threads > > > doing > > > ->permission(); the first one gets through kernfs_refresh_inode() > > > and > > > goes into > > > generic_permission(). No locks are held, so > > > kernfs_refresh_inode() > > > from another > > > thread can run in parallel with generic_permission(). > > > > > > If that's not a problem, why two kernfs_refresh_inode() done in > > > parallel would > > > be a problem? > > > > > > Thread 1: > > > permission > > > done refresh, all locks released now > > > Thread 2: > > > change metadata in kernfs_node > > > Thread 2: > > > permission > > > goes into refresh, copying metadata into inode > > > Thread 1: > > > generic_permission() > > > No locks in common between the last two operations, so > > > we generic_permission() might see partially updated metadata. > > > Either we don't give a fuck (in which case I don't understand > > > what purpose does that ->i_lock serve) *or* we need the exclusion > > > to cover a wider area. > > > > This didn't occur to me, obviously. > > > > It seems to me this can happen with the original code too although > > using a mutex might reduce the likelihood of it happening. > > > > Still ->permission() is meant to be a read-only function so the VFS > > shouldn't need to care about it. > > > > Do you have any suggestions on how to handle this. > > Perhaps the only way is to ensure the inode is updated only in > > functions that are expected to do this. > > IIRC Greg and Tejun weren't averse to adding a field to the > struct kernfs_iattrs, but there were concerns about increasing > memory usage. > > Because of this I think the best way to handle this would be to > broaden the scope of the i_lock to cover the generic calls in > kernfs_iop_getattr() and kernfs_iop_permission(). The only other > call to kernfs_refresh_inode() is at inode initialization and > then only for I_NEW inodes so that should be ok. Also both > generic_permission() and generic_fillattr() are reading from the > inode so not likely to need to take the i_lock any time soon (is > this a reasonable assumption Al?). > > Do you think this is a sensible way to go Al? Unless of course we don't care about taking a lock here at all, Greg, Tejun? Ian
diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c index 3b01e9e61f14e..6728ecd81eb37 100644 --- a/fs/kernfs/inode.c +++ b/fs/kernfs/inode.c @@ -172,6 +172,7 @@ static void kernfs_refresh_inode(struct kernfs_node *kn, struct inode *inode) { struct kernfs_iattrs *attrs = kn->iattr; + spin_lock(&inode->i_lock); inode->i_mode = kn->mode; if (attrs) /* @@ -182,6 +183,7 @@ static void kernfs_refresh_inode(struct kernfs_node *kn, struct inode *inode) if (kernfs_type(kn) == KERNFS_DIR) set_nlink(inode, kn->dir.subdirs + 2); + spin_unlock(&inode->i_lock); } int kernfs_iop_getattr(struct user_namespace *mnt_userns, @@ -191,9 +193,9 @@ int kernfs_iop_getattr(struct user_namespace *mnt_userns, struct inode *inode = d_inode(path->dentry); struct kernfs_node *kn = inode->i_private; - down_write(&kernfs_rwsem); + down_read(&kernfs_rwsem); kernfs_refresh_inode(kn, inode); - up_write(&kernfs_rwsem); + up_read(&kernfs_rwsem); generic_fillattr(&init_user_ns, inode, stat); return 0; @@ -284,9 +286,9 @@ int kernfs_iop_permission(struct user_namespace *mnt_userns, kn = inode->i_private; - down_write(&kernfs_rwsem); + down_read(&kernfs_rwsem); kernfs_refresh_inode(kn, inode); - up_write(&kernfs_rwsem); + up_read(&kernfs_rwsem); return generic_permission(&init_user_ns, inode, mask); } diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c index baa4155ba2edf..f2f909d09f522 100644 --- a/fs/kernfs/mount.c +++ b/fs/kernfs/mount.c @@ -255,9 +255,9 @@ static int kernfs_fill_super(struct super_block *sb, struct kernfs_fs_context *k sb->s_shrink.seeks = 0; /* get root inode, initialize and unlock it */ - down_write(&kernfs_rwsem); + down_read(&kernfs_rwsem); inode = kernfs_get_inode(sb, info->root->kn); - up_write(&kernfs_rwsem); + up_read(&kernfs_rwsem); if (!inode) { pr_debug("kernfs: could not get root inode\n"); return -ENOMEM;
The inode operations .permission() and .getattr() use the kernfs node write lock but all that's needed is to keep the rb tree stable while updating the inode attributes as well as protecting the update itself against concurrent changes. And .permission() is called frequently during path walks and can cause quite a bit of contention between kernfs node operations and path walks when the number of concurrent walks is high. To change kernfs_iop_getattr() and kernfs_iop_permission() to take the rw sem read lock instead of the write lock an additional lock is needed to protect against multiple processes concurrently updating the inode attributes and link count in kernfs_refresh_inode(). The inode i_lock seems like the sensible thing to use to protect these inode attribute updates so use it in kernfs_refresh_inode(). The last hunk in the patch, applied to kernfs_fill_super(), is possibly not needed but taking the lock was present originally and I prefer to continue to take it so the rb tree is held stable during the call to kernfs_refresh_inode() made by kernfs_get_inode(). Signed-off-by: Ian Kent <raven@themaw.net> --- fs/kernfs/inode.c | 10 ++++++---- fs/kernfs/mount.c | 4 ++-- 2 files changed, 8 insertions(+), 6 deletions(-)