mbox series

[00/11,RFC] Allow concurrent changes in a directory

Message ID 20241220030830.272429-1-neilb@suse.de
Headers show
Series Allow concurrent changes in a directory | expand

Message

NeilBrown Dec. 20, 2024, 2:54 a.m. UTC
A while ago I posted a patchset with a similar goal as this:

https://lore.kernel.org/all/166147828344.25420.13834885828450967910.stgit@noble.brown/

and recieved useful feedback.  Here is a new version.

This version is not complete.  It does not change rename and does not
change any filesystem to make use of the new opportunity for
parallelism.  I'll work on those once the bases functionality is agreed
on.

With this series, instead of a filesystem setting a flag to indiciate
that parallel updates are support, there are now a new set of inode
operations with a _shared prefix.  If a directory provides a _shared
interface it will be used with a shared lock on the inode, else the
current interface will be used with an exclusive lock.

When a shared lock is taken, we also take an exclusive lock on the
dentry using a couple of flag bits and the "wait_var_event"
infrastructure.

When an exclusive lock is needed (for the old interface) we still take a
shared lock on the directory i_rw_sem but also take a second exclusive
lock on the directory using a couple of flag bits and wait_var_event.
This is meant as a temporary measure until all filesystems are change to
use the _shared interfaces.

Not all calling code has been converted.  Some callers outside of
fs/namei.c still take an exclusive lock with i_rw_sem.  Some might never
be changed.

As yet this has only been lightly tested as I haven't add foo_shared
operations to any filesystem yet.

The motivation partly come from NFS where a high-latency network link
can cause a noticiable performance hit when multiple files are being
created concurrently in a directory.  Another motivation is lustre which
can use a modified ext4 as the storage backend.  One of the current
modification is to allow concurrent updates in a directory as lustre uses a flat directory structure to store data.

Thoughts?

Thanks,
NeilBrown


 [PATCH 01/11] VFS: introduce vfs_mkdir_return()
 [PATCH 02/11] VFS: add _shared versions of the various directory
 [PATCH 03/11] VFS: use global wait-queue table for d_alloc_parallel()
 [PATCH 04/11] VFS: use d_alloc_parallel() in lookup_one_qstr_excl()
 [PATCH 05/11] VFS: change kern_path_locked() and
 [PATCH 06/11] VFS: introduce done_lookup_and_lock()
 [PATCH 07/11] VFS: introduce lookup_and_lock()
 [PATCH 08/11] VFS: add inode_dir_lock/unlock
 [PATCH 09/11] VFS: re-pack DENTRY_ flags.
 [PATCH 10/11] VFS: take a shared lock for create/remove directory
 [PATCH 11/11] nfsd: use lookup_and_lock_one()

Comments

Andreas Dilger Dec. 20, 2024, 8:55 p.m. UTC | #1
On Dec 19, 2024, at 7:54 PM, NeilBrown <neilb@suse.de> wrote:
> 
> A while ago I posted a patchset with a similar goal as this:
> 
> https://lore.kernel.org/all/166147828344.25420.13834885828450967910.stgit@noble.brown/
> 
> and recieved useful feedback.  Here is a new version.
> 
> This version is not complete.  It does not change rename and does not
> change any filesystem to make use of the new opportunity for
> parallelism.  I'll work on those once the bases functionality is agreed
> on.
> 
> With this series, instead of a filesystem setting a flag to indiciate
> that parallel updates are support, there are now a new set of inode
> operations with a _shared prefix.  If a directory provides a _shared
> interface it will be used with a shared lock on the inode, else the
> current interface will be used with an exclusive lock.

Hi Neil, thanks for the patch.  One minor nit for the next revision
of the cover letter:

> Another motivation is lustre which
> can use a modified ext4 as the storage backend.  One of the current
> modification is to allow concurrent updates in a directory as lustre uses a flat directory structure to store data.

This isn't really correct.  Lustre uses a directory tree for the
namespace, but directories might become very large in some cases
with 1M+ cores working in a single directory (hey, I don't write
the applications, I just need to deal with them).  The servers will
only have 500-2000 threads working on a single directory, but the
fine-grained locking on the servers is definitely a big win.

Being able to have parallel locking on the client VFS side would
also be a win, given that large nodes commonly have 192 or 256
cores/threads today.  We know parallel directory locking will be
a win because mounting the filesystem multiple times on a single
client (which the VFS treats as multiple separate filesystems)
and running a multi-threaded benchmark in each mount in parallel
is considerably faster than running the same number of threads in
a single mountpoint.

Cheers, Andreas