diff mbox series

fs: Add a new flag RWF_IOWAIT for preadv2(2)

Message ID 20240804080251.21239-1-laoar.shao@gmail.com (mailing list archive)
State New
Headers show
Series fs: Add a new flag RWF_IOWAIT for preadv2(2) | expand

Commit Message

Yafang Shao Aug. 4, 2024, 8:02 a.m. UTC
Background
==========

Our big data workloads are deployed on XFS-based disks, and we frequently
encounter hung tasks caused by xfs_ilock. These hung tasks arise because
different applications may access the same files concurrently. For example,
while a datanode task is writing to a file, a filebeat[0] task might be
reading the same file concurrently. If the task writing to the file takes a
long time, the task reading the file will hang due to contention on the XFS
inode lock.

This inode lock contention between writing and reading files only occurs on
XFS, but not on other file systems such as EXT4. Dave provided a clear
explanation for why this occurs only on XFS[1]:

  : I/O is intended to be atomic to ordinary files and pipes and FIFOs.
  : Atomic means that all the bytes from a single operation that started
  : out together end up together, without interleaving from other I/O
  : operations. [2]
  : XFS is the only linux filesystem that provides this behaviour.

As we have been running big data on XFS for years, we don't want to switch
to other file systems like EXT4. Therefore, we plan to resolve these issues
within XFS.

Proposal
========

One solution we're currently exploring is leveraging the preadv2(2)
syscall. By using the RWF_NOWAIT flag, preadv2(2) can avoid the XFS inode
lock hung task. This can be illustrated as follows:

  retry:
      if (preadv2(fd, iovec, cnt, offset, RWF_NOWAIT) < 0) {
          sleep(n)
          goto retry;
      }

Since the tasks reading the same files are not critical tasks, a delay in
reading is acceptable. However, RWF_NOWAIT not only enables IOCB_NOWAIT but
also enables IOCB_NOIO. Therefore, if the file is not in the page cache, it
will loop indefinitely until someone else reads it from disk, which is not
acceptable.

So we're planning to introduce a new flag, IOCB_IOWAIT, to preadv2(2). This
flag will allow reading from the disk if the file is not in the page cache
but will not allow waiting for the lock if it is held by others. With this
new flag, we can resolve our issues effectively.

Link: https://lore.kernel.org/linux-xfs/20190325001044.GA23020@dastard/ [0]
Link: https://github.com/elastic/beats/tree/master/filebeat [1]
Link: https://pubs.opengroup.org/onlinepubs/009695399/functions/read.html [2]
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Dave Chinner <david@fromorbit.com>
---
 include/linux/fs.h      | 6 ++++++
 include/uapi/linux/fs.h | 5 ++++-
 2 files changed, 10 insertions(+), 1 deletion(-)

Comments

Jan Kara Aug. 5, 2024, 1:40 p.m. UTC | #1
On Sun 04-08-24 16:02:51, Yafang Shao wrote:
> Background
> ==========
> 
> Our big data workloads are deployed on XFS-based disks, and we frequently
> encounter hung tasks caused by xfs_ilock. These hung tasks arise because
> different applications may access the same files concurrently. For example,
> while a datanode task is writing to a file, a filebeat[0] task might be
> reading the same file concurrently. If the task writing to the file takes a
> long time, the task reading the file will hang due to contention on the XFS
> inode lock.
> 
> This inode lock contention between writing and reading files only occurs on
> XFS, but not on other file systems such as EXT4. Dave provided a clear
> explanation for why this occurs only on XFS[1]:
> 
>   : I/O is intended to be atomic to ordinary files and pipes and FIFOs.
>   : Atomic means that all the bytes from a single operation that started
>   : out together end up together, without interleaving from other I/O
>   : operations. [2]
>   : XFS is the only linux filesystem that provides this behaviour.
> 
> As we have been running big data on XFS for years, we don't want to switch
> to other file systems like EXT4. Therefore, we plan to resolve these issues
> within XFS.
> 
> Proposal
> ========
> 
> One solution we're currently exploring is leveraging the preadv2(2)
> syscall. By using the RWF_NOWAIT flag, preadv2(2) can avoid the XFS inode
> lock hung task. This can be illustrated as follows:
> 
>   retry:
>       if (preadv2(fd, iovec, cnt, offset, RWF_NOWAIT) < 0) {
>           sleep(n)
>           goto retry;
>       }
> 
> Since the tasks reading the same files are not critical tasks, a delay in
> reading is acceptable. However, RWF_NOWAIT not only enables IOCB_NOWAIT but
> also enables IOCB_NOIO. Therefore, if the file is not in the page cache, it
> will loop indefinitely until someone else reads it from disk, which is not
> acceptable.
> 
> So we're planning to introduce a new flag, IOCB_IOWAIT, to preadv2(2). This
> flag will allow reading from the disk if the file is not in the page cache
> but will not allow waiting for the lock if it is held by others. With this
> new flag, we can resolve our issues effectively.
> 
> Link: https://lore.kernel.org/linux-xfs/20190325001044.GA23020@dastard/ [0]
> Link: https://github.com/elastic/beats/tree/master/filebeat [1]
> Link: https://pubs.opengroup.org/onlinepubs/009695399/functions/read.html [2]
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> Cc: Dave Chinner <david@fromorbit.com>

Thanks for the detailed explanation! I understand your problem but I have to
say I find this flag like a hack to workaround particular XFS behavior and
the guarantees the new RWF_IOWAIT flag should provide are not very clear to
me. I've CCed Amir who's been dealing with similar issues with XFS at his
employer and had some patches as far as I remember.

What you could possibly do to read the file contents without blocking on
xfs_iolock is to mmap the file and grab the data from the mapping. It is
still hacky but at least we don't have to pollute the kernel with an IO
flag with unclear semantics.

								Honza

> ---
>  include/linux/fs.h      | 6 ++++++
>  include/uapi/linux/fs.h | 5 ++++-
>  2 files changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index fd34b5755c0b..5df7b5b0927a 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -3472,6 +3472,12 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags,
>  			return -EPERM;
>  		ki->ki_flags &= ~IOCB_APPEND;
>  	}
> +	if (flags & RWF_IOWAIT) {
> +		kiocb_flags |= IOCB_NOWAIT;
> +		/* IOCB_NOIO is not allowed for RWF_IOWAIT */
> +		if (kiocb_flags & IOCB_NOIO)
> +			return -EINVAL;
> +	}
>  
>  	ki->ki_flags |= kiocb_flags;
>  	return 0;
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index 191a7e88a8ab..17a8c065d636 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -332,9 +332,12 @@ typedef int __bitwise __kernel_rwf_t;
>  /* Atomic Write */
>  #define RWF_ATOMIC	((__force __kernel_rwf_t)0x00000040)
>  
> +/* per-IO, allow waiting for IO, but not waiting for lock */
> +#define RWF_IOWAIT	((__force __kernel_rwf_t)0x00000080)
> +
>  /* mask of flags supported by the kernel */
>  #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
> -			 RWF_APPEND | RWF_NOAPPEND | RWF_ATOMIC)
> +			 RWF_APPEND | RWF_NOAPPEND | RWF_ATOMIC | RWF_IOWAIT)
>  
>  /* Pagemap ioctl */
>  #define PAGEMAP_SCAN	_IOWR('f', 16, struct pm_scan_arg)
> -- 
> 2.43.5
>
Christian Brauner Aug. 5, 2024, 2:07 p.m. UTC | #2
> Thanks for the detailed explanation! I understand your problem but I have to
> say I find this flag like a hack to workaround particular XFS behavior and
> the guarantees the new RWF_IOWAIT flag should provide are not very clear to
> me. I've CCed Amir who's been dealing with similar issues with XFS at his

I agree. I tried to make sense of what exactly this flags was supposed
to guarantee and it's not really clear.
Dave Chinner Aug. 6, 2024, 5:47 a.m. UTC | #3
On Sun, Aug 04, 2024 at 04:02:51PM +0800, Yafang Shao wrote:
> Background
> ==========
> 
> Our big data workloads are deployed on XFS-based disks, and we frequently
> encounter hung tasks caused by xfs_ilock. These hung tasks arise because
> different applications may access the same files concurrently. For example,
> while a datanode task is writing to a file, a filebeat[0] task might be
> reading the same file concurrently. If the task writing to the file takes a
> long time, the task reading the file will hang due to contention on the XFS
> inode lock.
>
> This inode lock contention between writing and reading files only occurs on
> XFS, but not on other file systems such as EXT4. Dave provided a clear
> explanation for why this occurs only on XFS[1]:
> 
>   : I/O is intended to be atomic to ordinary files and pipes and FIFOs.
>   : Atomic means that all the bytes from a single operation that started
>   : out together end up together, without interleaving from other I/O
>   : operations. [2]
>   : XFS is the only linux filesystem that provides this behaviour.
> 
> As we have been running big data on XFS for years, we don't want to switch
> to other file systems like EXT4. Therefore, we plan to resolve these issues
> within XFS.

I've been looking at range locks again in the past few days because,
once again, the need for range locking to allow exclusive range
based operations to take place whilst concurrent IO is occurring has
arisen. We need to be able to clone, unshare, punch holes, exchange
extents, etc without interrupting ongoing IO to the same file.

This is just another one of the cases where range locking will solve
the problems you are having without giving up the atomic write vs
read behaviour posix asks us to provide...

> Proposal
> ========
> 
> One solution we're currently exploring is leveraging the preadv2(2)
> syscall. By using the RWF_NOWAIT flag, preadv2(2) can avoid the XFS inode
> lock hung task. This can be illustrated as follows:
> 
>   retry:
>       if (preadv2(fd, iovec, cnt, offset, RWF_NOWAIT) < 0) {
>           sleep(n)
>           goto retry;
>       }

Hmmm.

> Since the tasks reading the same files are not critical tasks, a delay in
> reading is acceptable. However, RWF_NOWAIT not only enables IOCB_NOWAIT but
> also enables IOCB_NOIO. Therefore, if the file is not in the page cache, it
> will loop indefinitely until someone else reads it from disk, which is not
> acceptable.
> 
> So we're planning to introduce a new flag, IOCB_IOWAIT, to preadv2(2). This
> flag will allow reading from the disk if the file is not in the page cache
> but will not allow waiting for the lock if it is held by others. With this
> new flag, we can resolve our issues effectively.
> 
> Link: https://lore.kernel.org/linux-xfs/20190325001044.GA23020@dastard/ [0]
> Link: https://github.com/elastic/beats/tree/master/filebeat [1]
> Link: https://pubs.opengroup.org/onlinepubs/009695399/functions/read.html [2]
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> Cc: Dave Chinner <david@fromorbit.com>
> ---
>  include/linux/fs.h      | 6 ++++++
>  include/uapi/linux/fs.h | 5 ++++-
>  2 files changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index fd34b5755c0b..5df7b5b0927a 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -3472,6 +3472,12 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags,
>  			return -EPERM;
>  		ki->ki_flags &= ~IOCB_APPEND;
>  	}
> +	if (flags & RWF_IOWAIT) {
> +		kiocb_flags |= IOCB_NOWAIT;
> +		/* IOCB_NOIO is not allowed for RWF_IOWAIT */
> +		if (kiocb_flags & IOCB_NOIO)
> +			return -EINVAL;
> +	}

I'm not sure that this will be considered an acceptible workaround
for what is largely considered by most Linux filesystem developers
an anchronistic filesystem behaviour. I don't really want people to
work around this XFS behaviour, either - waht I'd like to see is
more people putting effort into trying to solve the range locking
problem...

-Dave.
Yafang Shao Aug. 6, 2024, 11:44 a.m. UTC | #4
On Tue, Aug 6, 2024 at 1:47 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Sun, Aug 04, 2024 at 04:02:51PM +0800, Yafang Shao wrote:
> > Background
> > ==========
> >
> > Our big data workloads are deployed on XFS-based disks, and we frequently
> > encounter hung tasks caused by xfs_ilock. These hung tasks arise because
> > different applications may access the same files concurrently. For example,
> > while a datanode task is writing to a file, a filebeat[0] task might be
> > reading the same file concurrently. If the task writing to the file takes a
> > long time, the task reading the file will hang due to contention on the XFS
> > inode lock.
> >
> > This inode lock contention between writing and reading files only occurs on
> > XFS, but not on other file systems such as EXT4. Dave provided a clear
> > explanation for why this occurs only on XFS[1]:
> >
> >   : I/O is intended to be atomic to ordinary files and pipes and FIFOs.
> >   : Atomic means that all the bytes from a single operation that started
> >   : out together end up together, without interleaving from other I/O
> >   : operations. [2]
> >   : XFS is the only linux filesystem that provides this behaviour.
> >
> > As we have been running big data on XFS for years, we don't want to switch
> > to other file systems like EXT4. Therefore, we plan to resolve these issues
> > within XFS.
>
> I've been looking at range locks again in the past few days because,
> once again, the need for range locking to allow exclusive range
> based operations to take place whilst concurrent IO is occurring has
> arisen. We need to be able to clone, unshare, punch holes, exchange
> extents, etc without interrupting ongoing IO to the same file.
>
> This is just another one of the cases where range locking will solve
> the problems you are having without giving up the atomic write vs
> read behaviour posix asks us to provide...

We noticed you mentioned that the issue could be resolved with range
locking, but it's unclear when that will be completed. It would be
helpful if you could send patches for it; we can assist with testing.
Additionally, aside from encountering xfs_ilock in the read() syscall,
we have also experienced it in the sendfile() syscall. Currently, our
only solution for sendfile() is to modify the userspace code to avoid
using sendfile() when other threads are writing to the file...

>
> > Proposal
> > ========
> >
> > One solution we're currently exploring is leveraging the preadv2(2)
> > syscall. By using the RWF_NOWAIT flag, preadv2(2) can avoid the XFS inode
> > lock hung task. This can be illustrated as follows:
> >
> >   retry:
> >       if (preadv2(fd, iovec, cnt, offset, RWF_NOWAIT) < 0) {
> >           sleep(n)
> >           goto retry;
> >       }
>
> Hmmm.
>
> > Since the tasks reading the same files are not critical tasks, a delay in
> > reading is acceptable. However, RWF_NOWAIT not only enables IOCB_NOWAIT but
> > also enables IOCB_NOIO. Therefore, if the file is not in the page cache, it
> > will loop indefinitely until someone else reads it from disk, which is not
> > acceptable.
> >
> > So we're planning to introduce a new flag, IOCB_IOWAIT, to preadv2(2). This
> > flag will allow reading from the disk if the file is not in the page cache
> > but will not allow waiting for the lock if it is held by others. With this
> > new flag, we can resolve our issues effectively.
> >
> > Link: https://lore.kernel.org/linux-xfs/20190325001044.GA23020@dastard/ [0]
> > Link: https://github.com/elastic/beats/tree/master/filebeat [1]
> > Link: https://pubs.opengroup.org/onlinepubs/009695399/functions/read.html [2]
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > Cc: Dave Chinner <david@fromorbit.com>
> > ---
> >  include/linux/fs.h      | 6 ++++++
> >  include/uapi/linux/fs.h | 5 ++++-
> >  2 files changed, 10 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index fd34b5755c0b..5df7b5b0927a 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -3472,6 +3472,12 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags,
> >                       return -EPERM;
> >               ki->ki_flags &= ~IOCB_APPEND;
> >       }
> > +     if (flags & RWF_IOWAIT) {
> > +             kiocb_flags |= IOCB_NOWAIT;
> > +             /* IOCB_NOIO is not allowed for RWF_IOWAIT */
> > +             if (kiocb_flags & IOCB_NOIO)
> > +                     return -EINVAL;
> > +     }
>
> I'm not sure that this will be considered an acceptible workaround
> for what is largely considered by most Linux filesystem developers
> an anchronistic filesystem behaviour.

Considering that it is required by POSIX, it may not be deemed an
anachronistic filesystem behavior.

> I don't really want people to
> work around this XFS behaviour, either - waht I'd like to see is
> more people putting effort into trying to solve the range locking
> problem...

I'm not an expert on XFS, but I'm willing to assist by testing this
new feature for you.

--
Regards
Yafang
Yafang Shao Aug. 6, 2024, 11:54 a.m. UTC | #5
On Mon, Aug 5, 2024 at 9:40 PM Jan Kara <jack@suse.cz> wrote:
>
> On Sun 04-08-24 16:02:51, Yafang Shao wrote:
> > Background
> > ==========
> >
> > Our big data workloads are deployed on XFS-based disks, and we frequently
> > encounter hung tasks caused by xfs_ilock. These hung tasks arise because
> > different applications may access the same files concurrently. For example,
> > while a datanode task is writing to a file, a filebeat[0] task might be
> > reading the same file concurrently. If the task writing to the file takes a
> > long time, the task reading the file will hang due to contention on the XFS
> > inode lock.
> >
> > This inode lock contention between writing and reading files only occurs on
> > XFS, but not on other file systems such as EXT4. Dave provided a clear
> > explanation for why this occurs only on XFS[1]:
> >
> >   : I/O is intended to be atomic to ordinary files and pipes and FIFOs.
> >   : Atomic means that all the bytes from a single operation that started
> >   : out together end up together, without interleaving from other I/O
> >   : operations. [2]
> >   : XFS is the only linux filesystem that provides this behaviour.
> >
> > As we have been running big data on XFS for years, we don't want to switch
> > to other file systems like EXT4. Therefore, we plan to resolve these issues
> > within XFS.
> >
> > Proposal
> > ========
> >
> > One solution we're currently exploring is leveraging the preadv2(2)
> > syscall. By using the RWF_NOWAIT flag, preadv2(2) can avoid the XFS inode
> > lock hung task. This can be illustrated as follows:
> >
> >   retry:
> >       if (preadv2(fd, iovec, cnt, offset, RWF_NOWAIT) < 0) {
> >           sleep(n)
> >           goto retry;
> >       }
> >
> > Since the tasks reading the same files are not critical tasks, a delay in
> > reading is acceptable. However, RWF_NOWAIT not only enables IOCB_NOWAIT but
> > also enables IOCB_NOIO. Therefore, if the file is not in the page cache, it
> > will loop indefinitely until someone else reads it from disk, which is not
> > acceptable.
> >
> > So we're planning to introduce a new flag, IOCB_IOWAIT, to preadv2(2). This
> > flag will allow reading from the disk if the file is not in the page cache
> > but will not allow waiting for the lock if it is held by others. With this
> > new flag, we can resolve our issues effectively.
> >
> > Link: https://lore.kernel.org/linux-xfs/20190325001044.GA23020@dastard/ [0]
> > Link: https://github.com/elastic/beats/tree/master/filebeat [1]
> > Link: https://pubs.opengroup.org/onlinepubs/009695399/functions/read.html [2]
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > Cc: Dave Chinner <david@fromorbit.com>
>
> Thanks for the detailed explanation! I understand your problem but I have to
> say I find this flag like a hack to workaround particular XFS behavior and
> the guarantees the new RWF_IOWAIT flag should provide are not very clear to
> me.

Its guarantee is clear:

  : I/O is intended to be atomic to ordinary files and pipes and FIFOs.
  : Atomic means that all the bytes from a single operation that started
  : out together end up together, without interleaving from other I/O
  : operations.

What this flag does is avoid waiting for this type of lock if it
exists. Maybe we should consider a more descriptive name like
RWF_NOATOMICWAIT, RWF_NOFSLOCK, or RWF_NOPOSIXWAIT? Naming is always
challenging.

Since this behavior is required by POSIX, it shouldn't be viewed as an
XFS-specific behavior. Other filesystems might adopt this rule in the
future as well.

> I've CCed Amir who's been dealing with similar issues with XFS at his
> employer and had some patches as far as I remember.
>
> What you could possibly do to read the file contents without blocking on
> xfs_iolock is to mmap the file and grab the data from the mapping. It is
> still hacky but at least we don't have to pollute the kernel with an IO
> flag with unclear semantics.

The file size to be read is not fixed, which is why we prefer to use
the traditional read API rather than mmap. We have implemented a
hotfix version of this commit on many of our production servers, and
it works well as expected. While I agree that mmap() is another viable
option, we may consider switching to it in the future if this new flag
introduces any issues.
Jan Kara Aug. 6, 2024, 1:24 p.m. UTC | #6
On Tue 06-08-24 19:54:58, Yafang Shao wrote:
> On Mon, Aug 5, 2024 at 9:40 PM Jan Kara <jack@suse.cz> wrote:
> > On Sun 04-08-24 16:02:51, Yafang Shao wrote:
> > > Background
> > > ==========
> > >
> > > Our big data workloads are deployed on XFS-based disks, and we frequently
> > > encounter hung tasks caused by xfs_ilock. These hung tasks arise because
> > > different applications may access the same files concurrently. For example,
> > > while a datanode task is writing to a file, a filebeat[0] task might be
> > > reading the same file concurrently. If the task writing to the file takes a
> > > long time, the task reading the file will hang due to contention on the XFS
> > > inode lock.
> > >
> > > This inode lock contention between writing and reading files only occurs on
> > > XFS, but not on other file systems such as EXT4. Dave provided a clear
> > > explanation for why this occurs only on XFS[1]:
> > >
> > >   : I/O is intended to be atomic to ordinary files and pipes and FIFOs.
> > >   : Atomic means that all the bytes from a single operation that started
> > >   : out together end up together, without interleaving from other I/O
> > >   : operations. [2]
> > >   : XFS is the only linux filesystem that provides this behaviour.
> > >
> > > As we have been running big data on XFS for years, we don't want to switch
> > > to other file systems like EXT4. Therefore, we plan to resolve these issues
> > > within XFS.
> > >
> > > Proposal
> > > ========
> > >
> > > One solution we're currently exploring is leveraging the preadv2(2)
> > > syscall. By using the RWF_NOWAIT flag, preadv2(2) can avoid the XFS inode
> > > lock hung task. This can be illustrated as follows:
> > >
> > >   retry:
> > >       if (preadv2(fd, iovec, cnt, offset, RWF_NOWAIT) < 0) {
> > >           sleep(n)
> > >           goto retry;
> > >       }
> > >
> > > Since the tasks reading the same files are not critical tasks, a delay in
> > > reading is acceptable. However, RWF_NOWAIT not only enables IOCB_NOWAIT but
> > > also enables IOCB_NOIO. Therefore, if the file is not in the page cache, it
> > > will loop indefinitely until someone else reads it from disk, which is not
> > > acceptable.
> > >
> > > So we're planning to introduce a new flag, IOCB_IOWAIT, to preadv2(2). This
> > > flag will allow reading from the disk if the file is not in the page cache
> > > but will not allow waiting for the lock if it is held by others. With this
> > > new flag, we can resolve our issues effectively.
> > >
> > > Link: https://lore.kernel.org/linux-xfs/20190325001044.GA23020@dastard/ [0]
> > > Link: https://github.com/elastic/beats/tree/master/filebeat [1]
> > > Link: https://pubs.opengroup.org/onlinepubs/009695399/functions/read.html [2]
> > > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > > Cc: Dave Chinner <david@fromorbit.com>
> >
> > Thanks for the detailed explanation! I understand your problem but I have to
> > say I find this flag like a hack to workaround particular XFS behavior and
> > the guarantees the new RWF_IOWAIT flag should provide are not very clear to
> > me.
> 
> Its guarantee is clear:
> 
>   : I/O is intended to be atomic to ordinary files and pipes and FIFOs.
>   : Atomic means that all the bytes from a single operation that started
>   : out together end up together, without interleaving from other I/O
>   : operations.

Oh, I understand why XFS does locking this way and I'm well aware this is
a requirement in POSIX. However, as you have experienced, it has a
significant performance cost for certain workloads (at least with simple
locking protocol we have now) and history shows users rather want the extra
performance at the cost of being a bit more careful in userspace. So I
don't see any filesystem switching to XFS behavior until we have a
performant range locking primitive.

> What this flag does is avoid waiting for this type of lock if it
> exists. Maybe we should consider a more descriptive name like
> RWF_NOATOMICWAIT, RWF_NOFSLOCK, or RWF_NOPOSIXWAIT? Naming is always
> challenging.

Aha, OK. So you want the flag to mean "I don't care about POSIX read-write
exclusion". I'm still not convinced the flag is a great idea but
RWF_NOWRITEEXCLUSION could perhaps better describe the intent of the flag.

								Honza
Yafang Shao Aug. 6, 2024, 2:05 p.m. UTC | #7
On Tue, Aug 6, 2024 at 9:24 PM Jan Kara <jack@suse.cz> wrote:
>
> On Tue 06-08-24 19:54:58, Yafang Shao wrote:
> > On Mon, Aug 5, 2024 at 9:40 PM Jan Kara <jack@suse.cz> wrote:
> > > On Sun 04-08-24 16:02:51, Yafang Shao wrote:
> > > > Background
> > > > ==========
> > > >
> > > > Our big data workloads are deployed on XFS-based disks, and we frequently
> > > > encounter hung tasks caused by xfs_ilock. These hung tasks arise because
> > > > different applications may access the same files concurrently. For example,
> > > > while a datanode task is writing to a file, a filebeat[0] task might be
> > > > reading the same file concurrently. If the task writing to the file takes a
> > > > long time, the task reading the file will hang due to contention on the XFS
> > > > inode lock.
> > > >
> > > > This inode lock contention between writing and reading files only occurs on
> > > > XFS, but not on other file systems such as EXT4. Dave provided a clear
> > > > explanation for why this occurs only on XFS[1]:
> > > >
> > > >   : I/O is intended to be atomic to ordinary files and pipes and FIFOs.
> > > >   : Atomic means that all the bytes from a single operation that started
> > > >   : out together end up together, without interleaving from other I/O
> > > >   : operations. [2]
> > > >   : XFS is the only linux filesystem that provides this behaviour.
> > > >
> > > > As we have been running big data on XFS for years, we don't want to switch
> > > > to other file systems like EXT4. Therefore, we plan to resolve these issues
> > > > within XFS.
> > > >
> > > > Proposal
> > > > ========
> > > >
> > > > One solution we're currently exploring is leveraging the preadv2(2)
> > > > syscall. By using the RWF_NOWAIT flag, preadv2(2) can avoid the XFS inode
> > > > lock hung task. This can be illustrated as follows:
> > > >
> > > >   retry:
> > > >       if (preadv2(fd, iovec, cnt, offset, RWF_NOWAIT) < 0) {
> > > >           sleep(n)
> > > >           goto retry;
> > > >       }
> > > >
> > > > Since the tasks reading the same files are not critical tasks, a delay in
> > > > reading is acceptable. However, RWF_NOWAIT not only enables IOCB_NOWAIT but
> > > > also enables IOCB_NOIO. Therefore, if the file is not in the page cache, it
> > > > will loop indefinitely until someone else reads it from disk, which is not
> > > > acceptable.
> > > >
> > > > So we're planning to introduce a new flag, IOCB_IOWAIT, to preadv2(2). This
> > > > flag will allow reading from the disk if the file is not in the page cache
> > > > but will not allow waiting for the lock if it is held by others. With this
> > > > new flag, we can resolve our issues effectively.
> > > >
> > > > Link: https://lore.kernel.org/linux-xfs/20190325001044.GA23020@dastard/ [0]
> > > > Link: https://github.com/elastic/beats/tree/master/filebeat [1]
> > > > Link: https://pubs.opengroup.org/onlinepubs/009695399/functions/read.html [2]
> > > > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > > > Cc: Dave Chinner <david@fromorbit.com>
> > >
> > > Thanks for the detailed explanation! I understand your problem but I have to
> > > say I find this flag like a hack to workaround particular XFS behavior and
> > > the guarantees the new RWF_IOWAIT flag should provide are not very clear to
> > > me.
> >
> > Its guarantee is clear:
> >
> >   : I/O is intended to be atomic to ordinary files and pipes and FIFOs.
> >   : Atomic means that all the bytes from a single operation that started
> >   : out together end up together, without interleaving from other I/O
> >   : operations.
>
> Oh, I understand why XFS does locking this way and I'm well aware this is
> a requirement in POSIX. However, as you have experienced, it has a
> significant performance cost for certain workloads (at least with simple
> locking protocol we have now) and history shows users rather want the extra
> performance at the cost of being a bit more careful in userspace. So I
> don't see any filesystem switching to XFS behavior until we have a
> performant range locking primitive.
>
> > What this flag does is avoid waiting for this type of lock if it
> > exists. Maybe we should consider a more descriptive name like
> > RWF_NOATOMICWAIT, RWF_NOFSLOCK, or RWF_NOPOSIXWAIT? Naming is always
> > challenging.
>
> Aha, OK. So you want the flag to mean "I don't care about POSIX read-write
> exclusion". I'm still not convinced the flag is a great idea but
> RWF_NOWRITEEXCLUSION could perhaps better describe the intent of the flag.

That's better. Should we proceed with implementing this new flag? It
provides users with an option to avoid this type of issue.
Christian Brauner Aug. 6, 2024, 2:57 p.m. UTC | #8
On Tue, Aug 06, 2024 at 03:24:32PM GMT, Jan Kara wrote:
> On Tue 06-08-24 19:54:58, Yafang Shao wrote:
> > On Mon, Aug 5, 2024 at 9:40 PM Jan Kara <jack@suse.cz> wrote:
> > > On Sun 04-08-24 16:02:51, Yafang Shao wrote:
> > > > Background
> > > > ==========
> > > >
> > > > Our big data workloads are deployed on XFS-based disks, and we frequently
> > > > encounter hung tasks caused by xfs_ilock. These hung tasks arise because
> > > > different applications may access the same files concurrently. For example,
> > > > while a datanode task is writing to a file, a filebeat[0] task might be
> > > > reading the same file concurrently. If the task writing to the file takes a
> > > > long time, the task reading the file will hang due to contention on the XFS
> > > > inode lock.
> > > >
> > > > This inode lock contention between writing and reading files only occurs on
> > > > XFS, but not on other file systems such as EXT4. Dave provided a clear
> > > > explanation for why this occurs only on XFS[1]:
> > > >
> > > >   : I/O is intended to be atomic to ordinary files and pipes and FIFOs.
> > > >   : Atomic means that all the bytes from a single operation that started
> > > >   : out together end up together, without interleaving from other I/O
> > > >   : operations. [2]
> > > >   : XFS is the only linux filesystem that provides this behaviour.
> > > >
> > > > As we have been running big data on XFS for years, we don't want to switch
> > > > to other file systems like EXT4. Therefore, we plan to resolve these issues
> > > > within XFS.
> > > >
> > > > Proposal
> > > > ========
> > > >
> > > > One solution we're currently exploring is leveraging the preadv2(2)
> > > > syscall. By using the RWF_NOWAIT flag, preadv2(2) can avoid the XFS inode
> > > > lock hung task. This can be illustrated as follows:
> > > >
> > > >   retry:
> > > >       if (preadv2(fd, iovec, cnt, offset, RWF_NOWAIT) < 0) {
> > > >           sleep(n)
> > > >           goto retry;
> > > >       }
> > > >
> > > > Since the tasks reading the same files are not critical tasks, a delay in
> > > > reading is acceptable. However, RWF_NOWAIT not only enables IOCB_NOWAIT but
> > > > also enables IOCB_NOIO. Therefore, if the file is not in the page cache, it
> > > > will loop indefinitely until someone else reads it from disk, which is not
> > > > acceptable.
> > > >
> > > > So we're planning to introduce a new flag, IOCB_IOWAIT, to preadv2(2). This
> > > > flag will allow reading from the disk if the file is not in the page cache
> > > > but will not allow waiting for the lock if it is held by others. With this
> > > > new flag, we can resolve our issues effectively.
> > > >
> > > > Link: https://lore.kernel.org/linux-xfs/20190325001044.GA23020@dastard/ [0]
> > > > Link: https://github.com/elastic/beats/tree/master/filebeat [1]
> > > > Link: https://pubs.opengroup.org/onlinepubs/009695399/functions/read.html [2]
> > > > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > > > Cc: Dave Chinner <david@fromorbit.com>
> > >
> > > Thanks for the detailed explanation! I understand your problem but I have to
> > > say I find this flag like a hack to workaround particular XFS behavior and
> > > the guarantees the new RWF_IOWAIT flag should provide are not very clear to
> > > me.
> > 
> > Its guarantee is clear:
> > 
> >   : I/O is intended to be atomic to ordinary files and pipes and FIFOs.
> >   : Atomic means that all the bytes from a single operation that started
> >   : out together end up together, without interleaving from other I/O
> >   : operations.
> 
> Oh, I understand why XFS does locking this way and I'm well aware this is
> a requirement in POSIX. However, as you have experienced, it has a
> significant performance cost for certain workloads (at least with simple
> locking protocol we have now) and history shows users rather want the extra
> performance at the cost of being a bit more careful in userspace. So I
> don't see any filesystem switching to XFS behavior until we have a
> performant range locking primitive.
> 
> > What this flag does is avoid waiting for this type of lock if it
> > exists. Maybe we should consider a more descriptive name like
> > RWF_NOATOMICWAIT, RWF_NOFSLOCK, or RWF_NOPOSIXWAIT? Naming is always
> > challenging.
> 
> Aha, OK. So you want the flag to mean "I don't care about POSIX read-write
> exclusion". I'm still not convinced the flag is a great idea but
> RWF_NOWRITEEXCLUSION could perhaps better describe the intent of the flag.

I have to say that I find this extremely hard to swallow because it so
clearly specific to an individual filesystem. If we're doing this hack I
would like an Ack from at least both Jan and Dave.
Matthew Wilcox Aug. 6, 2024, 3:08 p.m. UTC | #9
On Sun, Aug 04, 2024 at 04:02:51PM +0800, Yafang Shao wrote:
> One solution we're currently exploring is leveraging the preadv2(2)
> syscall. By using the RWF_NOWAIT flag, preadv2(2) can avoid the XFS inode
> lock hung task. This can be illustrated as follows:
> 
>   retry:
>       if (preadv2(fd, iovec, cnt, offset, RWF_NOWAIT) < 0) {
>           sleep(n)
>           goto retry;

But that's not how you're supposed to use RWF_NOWAIT!  You're supposed
to try it _once_ in the thread that can't block, then hand the I/O off
to another thread which _can_ block.  Because that single thread is the
one which does all the blocking I/O, there's no lock contention.

So this is a kernel workaround for bad application design, and should
be rejected.
Dave Chinner Aug. 6, 2024, 9:52 p.m. UTC | #10
On Tue, Aug 06, 2024 at 10:05:50PM +0800, Yafang Shao wrote:
> On Tue, Aug 6, 2024 at 9:24 PM Jan Kara <jack@suse.cz> wrote:
> > On Tue 06-08-24 19:54:58, Yafang Shao wrote:
> > > Its guarantee is clear:
> > >
> > >   : I/O is intended to be atomic to ordinary files and pipes and FIFOs.
> > >   : Atomic means that all the bytes from a single operation that started
> > >   : out together end up together, without interleaving from other I/O
> > >   : operations.
> >
> > Oh, I understand why XFS does locking this way and I'm well aware this is
> > a requirement in POSIX. However, as you have experienced, it has a
> > significant performance cost for certain workloads (at least with simple
> > locking protocol we have now) and history shows users rather want the extra
> > performance at the cost of being a bit more careful in userspace. So I
> > don't see any filesystem switching to XFS behavior until we have a
> > performant range locking primitive.
> >
> > > What this flag does is avoid waiting for this type of lock if it
> > > exists. Maybe we should consider a more descriptive name like
> > > RWF_NOATOMICWAIT, RWF_NOFSLOCK, or RWF_NOPOSIXWAIT? Naming is always
> > > challenging.
> >
> > Aha, OK. So you want the flag to mean "I don't care about POSIX read-write
> > exclusion". I'm still not convinced the flag is a great idea but
> > RWF_NOWRITEEXCLUSION could perhaps better describe the intent of the flag.
> 
> That's better. Should we proceed with implementing this new flag? It
> provides users with an option to avoid this type of issue.

No. If we are going to add a flag like that, the fix to XFS isn't to
use IOCB_NOWAIT on reads, it's to use shared locking for buffered
writes just like we do for direct IO.

IOWs, this flag would be needed on -writes-, not reads, and at that
point we may as well just change XFS to do shared buffered writes
for -everyone- so it is consistent with all other Linux filesystems.

Indeed, last time Amir brought this up, I suggested that shared
buffered write locking in XFS was the simplest way forward. Given
that we use large folios now, small IOs get mapped to a single folio
and so will still have the same write vs overlapping write exclusion
behaviour most all the time.

However, since then we've moved to using shared IO locking for
cloning files. A clone does not modify data, so read IO is allowed
during the clone. If we move writes to use shared locking, this
breaks file cloning. We would have to move cloning back to to using
exclusive locking, and that's going to cause performance and IO
latency regressions for applications using clones with concurrent IO
(e.g. VM image snapshots in cloud infrastruction).

Hence the only viable solution to all these different competing "we
need exclusive access to a range of the file whilst allowing other
concurrent IO" issues is to move to range locking for IO
exclusion....

-Dave.
Yafang Shao Aug. 7, 2024, 2:29 a.m. UTC | #11
On Tue, Aug 6, 2024 at 11:08 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Sun, Aug 04, 2024 at 04:02:51PM +0800, Yafang Shao wrote:
> > One solution we're currently exploring is leveraging the preadv2(2)
> > syscall. By using the RWF_NOWAIT flag, preadv2(2) can avoid the XFS inode
> > lock hung task. This can be illustrated as follows:
> >
> >   retry:
> >       if (preadv2(fd, iovec, cnt, offset, RWF_NOWAIT) < 0) {
> >           sleep(n)
> >           goto retry;
>
> But that's not how you're supposed to use RWF_NOWAIT!  You're supposed
> to try it _once_ in the thread that can't block, then hand the I/O off
> to another thread which _can_ block.  Because that single thread is the
> one which does all the blocking I/O, there's no lock contention.
>
> So this is a kernel workaround for bad application design, and should
> be rejected.

They are different applications, but not different threads within a
single application.

A simple example:

  $ ps -eLo pid,comm
  1 systemd
  2 tail -f /var/log/messages.

In this case, tail is reading /var/log/messages while systemd is
writing to it. Are you suggesting we should forbid `tail -f` on Linux
servers? If you had taken the time to understand what filebeat is
doing, you wouldn't have reached this arbitrary conclusion.
Yafang Shao Aug. 7, 2024, 3:01 a.m. UTC | #12
On Wed, Aug 7, 2024 at 5:52 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Tue, Aug 06, 2024 at 10:05:50PM +0800, Yafang Shao wrote:
> > On Tue, Aug 6, 2024 at 9:24 PM Jan Kara <jack@suse.cz> wrote:
> > > On Tue 06-08-24 19:54:58, Yafang Shao wrote:
> > > > Its guarantee is clear:
> > > >
> > > >   : I/O is intended to be atomic to ordinary files and pipes and FIFOs.
> > > >   : Atomic means that all the bytes from a single operation that started
> > > >   : out together end up together, without interleaving from other I/O
> > > >   : operations.
> > >
> > > Oh, I understand why XFS does locking this way and I'm well aware this is
> > > a requirement in POSIX. However, as you have experienced, it has a
> > > significant performance cost for certain workloads (at least with simple
> > > locking protocol we have now) and history shows users rather want the extra
> > > performance at the cost of being a bit more careful in userspace. So I
> > > don't see any filesystem switching to XFS behavior until we have a
> > > performant range locking primitive.
> > >
> > > > What this flag does is avoid waiting for this type of lock if it
> > > > exists. Maybe we should consider a more descriptive name like
> > > > RWF_NOATOMICWAIT, RWF_NOFSLOCK, or RWF_NOPOSIXWAIT? Naming is always
> > > > challenging.
> > >
> > > Aha, OK. So you want the flag to mean "I don't care about POSIX read-write
> > > exclusion". I'm still not convinced the flag is a great idea but
> > > RWF_NOWRITEEXCLUSION could perhaps better describe the intent of the flag.
> >
> > That's better. Should we proceed with implementing this new flag? It
> > provides users with an option to avoid this type of issue.
>
> No. If we are going to add a flag like that, the fix to XFS isn't to
> use IOCB_NOWAIT on reads, it's to use shared locking for buffered
> writes just like we do for direct IO.
>
> IOWs, this flag would be needed on -writes-, not reads, and at that
> point we may as well just change XFS to do shared buffered writes
> for -everyone- so it is consistent with all other Linux filesystems.
>
> Indeed, last time Amir brought this up, I suggested that shared
> buffered write locking in XFS was the simplest way forward. Given
> that we use large folios now, small IOs get mapped to a single folio
> and so will still have the same write vs overlapping write exclusion
> behaviour most all the time.
>
> However, since then we've moved to using shared IO locking for
> cloning files. A clone does not modify data, so read IO is allowed
> during the clone. If we move writes to use shared locking, this
> breaks file cloning. We would have to move cloning back to to using
> exclusive locking, and that's going to cause performance and IO
> latency regressions for applications using clones with concurrent IO
> (e.g. VM image snapshots in cloud infrastruction).
>
> Hence the only viable solution to all these different competing "we
> need exclusive access to a range of the file whilst allowing other
> concurrent IO" issues is to move to range locking for IO
> exclusion....

The initial post you mentioned about range locking dates back to 2019,
five years ago. Now, five years have passed, and nothing has happened.

In 2029, five years later, someone else might encounter this issue
again, and the response will be the same: "let's try range locking."

And then another five years will pass...

So, "range locking == Do nothing." I'm not saying it's your
responsibility to implement range locking, but it seems no one else is
capable of implementing this complex feature except you.

RWF_NOWAIT was initially introduced for AIO in commit b745fafaf70c
("fs: Introduce RWF_NOWAIT and FMODE_AIO_NOWAIT") with a clear
definition that it shouldn't "block while allocating requests while
performing direct I/O."
It was then extended to buffered IO in commit 91f9943e1c7b ("fs:
support RWF_NOWAIT for buffered reads"), where the IOCB_NOIO was not
set, meaning it would perform read IO if there was no page cache.
Readahead support was added for this flag in commit 2e85abf053b9 ("mm:
allow read-ahead with IOCB_NOWAIT set"). However, this behavior
changed in commit efa8480a8316 ("fs: RWF_NOWAIT should imply
IOCB_NOIO"), without a clear use case, simply stating that "RWF_NOWAIT
semantics of only doing cached reads." If it breaks the "RWF_NOWAIT
semantics," why not introduce a new flag for this new semantics where
non-cached reads are allowed?
Dave Chinner Aug. 8, 2024, 2:51 a.m. UTC | #13
On Wed, Aug 07, 2024 at 11:01:36AM +0800, Yafang Shao wrote:
> On Wed, Aug 7, 2024 at 5:52 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Tue, Aug 06, 2024 at 10:05:50PM +0800, Yafang Shao wrote:
> > > On Tue, Aug 6, 2024 at 9:24 PM Jan Kara <jack@suse.cz> wrote:
> > > > On Tue 06-08-24 19:54:58, Yafang Shao wrote:
> > > > > Its guarantee is clear:
> > > > >
> > > > >   : I/O is intended to be atomic to ordinary files and pipes and FIFOs.
> > > > >   : Atomic means that all the bytes from a single operation that started
> > > > >   : out together end up together, without interleaving from other I/O
> > > > >   : operations.
> > > >
> > > > Oh, I understand why XFS does locking this way and I'm well aware this is
> > > > a requirement in POSIX. However, as you have experienced, it has a
> > > > significant performance cost for certain workloads (at least with simple
> > > > locking protocol we have now) and history shows users rather want the extra
> > > > performance at the cost of being a bit more careful in userspace. So I
> > > > don't see any filesystem switching to XFS behavior until we have a
> > > > performant range locking primitive.
> > > >
> > > > > What this flag does is avoid waiting for this type of lock if it
> > > > > exists. Maybe we should consider a more descriptive name like
> > > > > RWF_NOATOMICWAIT, RWF_NOFSLOCK, or RWF_NOPOSIXWAIT? Naming is always
> > > > > challenging.
> > > >
> > > > Aha, OK. So you want the flag to mean "I don't care about POSIX read-write
> > > > exclusion". I'm still not convinced the flag is a great idea but
> > > > RWF_NOWRITEEXCLUSION could perhaps better describe the intent of the flag.
> > >
> > > That's better. Should we proceed with implementing this new flag? It
> > > provides users with an option to avoid this type of issue.
> >
> > No. If we are going to add a flag like that, the fix to XFS isn't to
> > use IOCB_NOWAIT on reads, it's to use shared locking for buffered
> > writes just like we do for direct IO.
> >
> > IOWs, this flag would be needed on -writes-, not reads, and at that
> > point we may as well just change XFS to do shared buffered writes
> > for -everyone- so it is consistent with all other Linux filesystems.
> >
> > Indeed, last time Amir brought this up, I suggested that shared
> > buffered write locking in XFS was the simplest way forward. Given
> > that we use large folios now, small IOs get mapped to a single folio
> > and so will still have the same write vs overlapping write exclusion
> > behaviour most all the time.
> >
> > However, since then we've moved to using shared IO locking for
> > cloning files. A clone does not modify data, so read IO is allowed
> > during the clone. If we move writes to use shared locking, this
> > breaks file cloning. We would have to move cloning back to to using
> > exclusive locking, and that's going to cause performance and IO
> > latency regressions for applications using clones with concurrent IO
> > (e.g. VM image snapshots in cloud infrastruction).
> >
> > Hence the only viable solution to all these different competing "we
> > need exclusive access to a range of the file whilst allowing other
> > concurrent IO" issues is to move to range locking for IO
> > exclusion....
> 
> The initial post you mentioned about range locking dates back to 2019,
> five years ago. Now, five years have passed, and nothing has happened.
> 
> In 2029, five years later, someone else might encounter this issue
> again, and the response will be the same: "let's try range locking."
> 
> And then another five years will pass...
> 
> So, "range locking == Do nothing."

How long do you think it would take you to understand the entire
serialisation model for a complex subsystem, understand where it is
deficient and then design a novel, scalable serialisation technique
that addresses that problem with only limited IO performance
regressions?

Some context:

It took me 6 years to work out how to do delayed logging in XFS once
I first learnt of the idea back in 2004. It took me 4 -from scratch-
design and implementation efforts before I found a solution that
didn't have a subtle, fatal architectural issue in it. Once I solved
the deadlock issues in early 2010, it took about 4 months from first
code to being merged in 2.6.35.

It took me 7 years to go from my initial "self describing metadata"
idea (2006) ideas to actually having it implemented and fully merged
in 2013. I described reverse mapping at the same time, and that took
another couple of years to realise.

It took me close to 10 years and 6 or 7 separate attemps to solve
the "XFS needs blocking RMW IO in the inode reclaim shrinker to
prevent OOM" problem. This caused memory allocation latency problems
for many production environments over the course of a decade, and
the problem was around that long because it took me an awful long
time to work out how to pin inode cluster buffers in memory without
deadlocking inode modification or inode cluster writeback.

These sorts of complex problems are *hard to solve* and it often
takes several attempts that fail to learn what -doesn't work- before
a successful architecture, design and implementation is realised.

Often, there is no prior art out there that we can use to help solve
the problem. Sure, for range locking there's heaps of academic
papers out there about scaling concurrency in database key
operations (SIX, MGL, ARIES/KVL, multi-dimension key-range lock
separation, intention locking, speculative lock inheritance,
lightweight intent locks, etc), but none of the underlying
algorithms have any significant relevance to the problem we need to
solve.

There's also papers aout there about how to scale concurrent btree
modifications. Latching, lock coupling, optimistic lock coupling,
etc. The problem with these papers is that they often gloss over or
ignore important details such as how they deal with node contention,
how concurrent unlocked traversal and modification to the same node
are safely handled (i.e. NX concurrency algorithms), how a top-down
traversal algorithm that doesn't guarantee temporal path stability
is used for bottom up key update propagation (OLC-based algorithms),
etc.D...

They also tend to focus on huge static data sets where concurrent
random operations are guaranteed to have unique paths and minimal
contention. Hence the researchers are able to demonstrate how much
better their new algorithm scales than the previous state of the
art.  However, they rarely demonstrate how the algorithm scales
down, and that's something we really care about for range locks. A
couple of the scalable range indexing algorithms I prototyped simply
didn't work for small data sets - they performed far worse than just
using a single tree-wide mutex.

Hence we are in a situation where I can't find an algorithm in
existing computer science literature that will work for our problem
case.  Hence we need to come up with a novel algorithm that solves
the problem ourselves. This is an iterative process where we learn
by failing and then understanding why what we did failed.

Davidlohr Bueso attempted to solve mmap_sem issues with an interval
tree based range lock. From that implementation, I learnt that the
cost of per-range cacheline misses walking the rb-tree under the
tree-wide spin lock was the major performance limitation of that
range lock.

I took that observation, and attempted to adapt that code to a btree
(based on the XFS iext btree). That performed a little better, but
removing the cacheline miss penalty from the search algorithm simply
moved the bottleneck to the spin lock. i.e. I learnt that we can't
use a global scope spin lock for protecting the range lock tree -
the total number of range lock/unlock operations is bound entirely
by how many we can process on a single CPU because they are all done
under a single spinlock.

Further, per-inode IO concurrency is unbound, but a single
inode-wide serialisation primitive will not scale beyond a 3-4 CPUs
doing IO at the same time. This taught me IO path usage of per-IO,
per-inode exclusive cachelines needs to be avoided if at all
possible.

I made another attempt a year or so later after doing a lot of
reading about scalable btree structures. The second attempt expanded
the original btree I used with an OLC based architecture. I called
in an RCU-pathwalk btree because it used RCU and sequence numbers in
a very similar way to the dentry cache RCU pathwalk algorithm. This
provided range locks with a high concurrency range tracking
structure.

In some situations it scaled out to millions of lock/unlock
operations per second (far exceeding perf/scalability requirements),
but in others performance was miserably and was 10x slower than
plain exclusive locking. IOWs, another failure.

Again, I learn a lot about what not to do from that attempt. I've
spent a good deal of time in the past two years reading through
decades of database key range locking optimisation papers in an
effort to understand how I might address the issues I came across.
I have also had time to understand why several implementation issues
existed and how to avoid/mitigate them in the new design I'm
currently working on.

So, yeah, I haven't written any rangelock code in the past couple of
years, but:

$ wc -l Documentation/filesystems/xfs-IO-rangelocks.rst
749 Documentation/filesystems/xfs-IO-rangelocks.rst
$

I've been writing up a design doc as I've been doing this analysis
and research to document the problems and the solutions. I think I'm
close to the point where I can start implementing this new
design.

Clearly, I've been busy doing a lot of nothing on rangelocks, thank
you very much.

> I'm not saying it's your
> responsibility to implement range locking, but it seems no one else is
> capable of implementing this complex feature except you.

*cough*

There's lots of people better qualified than me to solve a problem
like this. It's a computer science problem, and I'm not a computer
scientist. I'm an engineer - I take stuff that scientists have
discovered and documented and build tools, infrastructure and
objects based on that knowledge.

What I'm not good at is coming up with novel new algorithms to solve
mathematical problems. A range lock is the latter, not the former,
and there are plenty of people who would be better suited to this
work than me.  i.e. I'm good at putting knowledge to practical use,
not creating new knowledge.

However, I'm spending time on it because nobody else is going to
solve the problem for me.  CPU and IO concurrency is only going up
and shared/exclusive IO locking behaviour only gets more restrictive
as concurrency requirements go up and we use extent sharing to
avoid data copying more extensively. This problem isn't going
away...

-Dave.
Yafang Shao Aug. 8, 2024, 1:16 p.m. UTC | #14
On Thu, Aug 8, 2024 at 10:51 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Wed, Aug 07, 2024 at 11:01:36AM +0800, Yafang Shao wrote:
> > On Wed, Aug 7, 2024 at 5:52 AM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Tue, Aug 06, 2024 at 10:05:50PM +0800, Yafang Shao wrote:
> > > > On Tue, Aug 6, 2024 at 9:24 PM Jan Kara <jack@suse.cz> wrote:
> > > > > On Tue 06-08-24 19:54:58, Yafang Shao wrote:
> > > > > > Its guarantee is clear:
> > > > > >
> > > > > >   : I/O is intended to be atomic to ordinary files and pipes and FIFOs.
> > > > > >   : Atomic means that all the bytes from a single operation that started
> > > > > >   : out together end up together, without interleaving from other I/O
> > > > > >   : operations.
> > > > >
> > > > > Oh, I understand why XFS does locking this way and I'm well aware this is
> > > > > a requirement in POSIX. However, as you have experienced, it has a
> > > > > significant performance cost for certain workloads (at least with simple
> > > > > locking protocol we have now) and history shows users rather want the extra
> > > > > performance at the cost of being a bit more careful in userspace. So I
> > > > > don't see any filesystem switching to XFS behavior until we have a
> > > > > performant range locking primitive.
> > > > >
> > > > > > What this flag does is avoid waiting for this type of lock if it
> > > > > > exists. Maybe we should consider a more descriptive name like
> > > > > > RWF_NOATOMICWAIT, RWF_NOFSLOCK, or RWF_NOPOSIXWAIT? Naming is always
> > > > > > challenging.
> > > > >
> > > > > Aha, OK. So you want the flag to mean "I don't care about POSIX read-write
> > > > > exclusion". I'm still not convinced the flag is a great idea but
> > > > > RWF_NOWRITEEXCLUSION could perhaps better describe the intent of the flag.
> > > >
> > > > That's better. Should we proceed with implementing this new flag? It
> > > > provides users with an option to avoid this type of issue.
> > >
> > > No. If we are going to add a flag like that, the fix to XFS isn't to
> > > use IOCB_NOWAIT on reads, it's to use shared locking for buffered
> > > writes just like we do for direct IO.
> > >
> > > IOWs, this flag would be needed on -writes-, not reads, and at that
> > > point we may as well just change XFS to do shared buffered writes
> > > for -everyone- so it is consistent with all other Linux filesystems.
> > >
> > > Indeed, last time Amir brought this up, I suggested that shared
> > > buffered write locking in XFS was the simplest way forward. Given
> > > that we use large folios now, small IOs get mapped to a single folio
> > > and so will still have the same write vs overlapping write exclusion
> > > behaviour most all the time.
> > >
> > > However, since then we've moved to using shared IO locking for
> > > cloning files. A clone does not modify data, so read IO is allowed
> > > during the clone. If we move writes to use shared locking, this
> > > breaks file cloning. We would have to move cloning back to to using
> > > exclusive locking, and that's going to cause performance and IO
> > > latency regressions for applications using clones with concurrent IO
> > > (e.g. VM image snapshots in cloud infrastruction).
> > >
> > > Hence the only viable solution to all these different competing "we
> > > need exclusive access to a range of the file whilst allowing other
> > > concurrent IO" issues is to move to range locking for IO
> > > exclusion....
> >
> > The initial post you mentioned about range locking dates back to 2019,
> > five years ago. Now, five years have passed, and nothing has happened.
> >
> > In 2029, five years later, someone else might encounter this issue
> > again, and the response will be the same: "let's try range locking."
> >
> > And then another five years will pass...
> >
> > So, "range locking == Do nothing."
>
> How long do you think it would take you to understand the entire
> serialisation model for a complex subsystem, understand where it is
> deficient and then design a novel, scalable serialisation technique
> that addresses that problem with only limited IO performance
> regressions?
>
> Some context:
>
> It took me 6 years to work out how to do delayed logging in XFS once
> I first learnt of the idea back in 2004. It took me 4 -from scratch-
> design and implementation efforts before I found a solution that
> didn't have a subtle, fatal architectural issue in it. Once I solved
> the deadlock issues in early 2010, it took about 4 months from first
> code to being merged in 2.6.35.
>
> It took me 7 years to go from my initial "self describing metadata"
> idea (2006) ideas to actually having it implemented and fully merged
> in 2013. I described reverse mapping at the same time, and that took
> another couple of years to realise.
>
> It took me close to 10 years and 6 or 7 separate attemps to solve
> the "XFS needs blocking RMW IO in the inode reclaim shrinker to
> prevent OOM" problem. This caused memory allocation latency problems
> for many production environments over the course of a decade, and
> the problem was around that long because it took me an awful long
> time to work out how to pin inode cluster buffers in memory without
> deadlocking inode modification or inode cluster writeback.
>
> These sorts of complex problems are *hard to solve* and it often
> takes several attempts that fail to learn what -doesn't work- before
> a successful architecture, design and implementation is realised.

Thank you for all your hard work and contributions to XFS. The entire
XFS community has greatly benefited from your efforts.

>
> Often, there is no prior art out there that we can use to help solve
> the problem. Sure, for range locking there's heaps of academic
> papers out there about scaling concurrency in database key
> operations (SIX, MGL, ARIES/KVL, multi-dimension key-range lock
> separation, intention locking, speculative lock inheritance,
> lightweight intent locks, etc), but none of the underlying
> algorithms have any significant relevance to the problem we need to
> solve.
>
> There's also papers aout there about how to scale concurrent btree
> modifications. Latching, lock coupling, optimistic lock coupling,
> etc. The problem with these papers is that they often gloss over or
> ignore important details such as how they deal with node contention,
> how concurrent unlocked traversal and modification to the same node
> are safely handled (i.e. NX concurrency algorithms), how a top-down
> traversal algorithm that doesn't guarantee temporal path stability
> is used for bottom up key update propagation (OLC-based algorithms),
> etc.D...
>
> They also tend to focus on huge static data sets where concurrent
> random operations are guaranteed to have unique paths and minimal
> contention. Hence the researchers are able to demonstrate how much
> better their new algorithm scales than the previous state of the
> art.  However, they rarely demonstrate how the algorithm scales
> down, and that's something we really care about for range locks. A
> couple of the scalable range indexing algorithms I prototyped simply
> didn't work for small data sets - they performed far worse than just
> using a single tree-wide mutex.
>
> Hence we are in a situation where I can't find an algorithm in
> existing computer science literature that will work for our problem
> case.  Hence we need to come up with a novel algorithm that solves
> the problem ourselves. This is an iterative process where we learn
> by failing and then understanding why what we did failed.
>
> Davidlohr Bueso attempted to solve mmap_sem issues with an interval
> tree based range lock. From that implementation, I learnt that the
> cost of per-range cacheline misses walking the rb-tree under the
> tree-wide spin lock was the major performance limitation of that
> range lock.
>
> I took that observation, and attempted to adapt that code to a btree
> (based on the XFS iext btree). That performed a little better, but
> removing the cacheline miss penalty from the search algorithm simply
> moved the bottleneck to the spin lock. i.e. I learnt that we can't
> use a global scope spin lock for protecting the range lock tree -
> the total number of range lock/unlock operations is bound entirely
> by how many we can process on a single CPU because they are all done
> under a single spinlock.
>
> Further, per-inode IO concurrency is unbound, but a single
> inode-wide serialisation primitive will not scale beyond a 3-4 CPUs
> doing IO at the same time. This taught me IO path usage of per-IO,
> per-inode exclusive cachelines needs to be avoided if at all
> possible.
>
> I made another attempt a year or so later after doing a lot of
> reading about scalable btree structures. The second attempt expanded
> the original btree I used with an OLC based architecture. I called
> in an RCU-pathwalk btree because it used RCU and sequence numbers in
> a very similar way to the dentry cache RCU pathwalk algorithm. This
> provided range locks with a high concurrency range tracking
> structure.
>
> In some situations it scaled out to millions of lock/unlock
> operations per second (far exceeding perf/scalability requirements),
> but in others performance was miserably and was 10x slower than
> plain exclusive locking. IOWs, another failure.
>
> Again, I learn a lot about what not to do from that attempt. I've
> spent a good deal of time in the past two years reading through
> decades of database key range locking optimisation papers in an
> effort to understand how I might address the issues I came across.
> I have also had time to understand why several implementation issues
> existed and how to avoid/mitigate them in the new design I'm
> currently working on.

Thank you for the detailed explanation of all the efforts that have
gone into this. It will help others understand why it is the way it is
if it gets accepted upstream. I appreciate the challenges you’re
facing, even though I may not fully grasp all the technical details.

>
> So, yeah, I haven't written any rangelock code in the past couple of
> years, but:
>
> $ wc -l Documentation/filesystems/xfs-IO-rangelocks.rst
> 749 Documentation/filesystems/xfs-IO-rangelocks.rst
> $

I couldn’t find any information about it online. It would be really
helpful if you could share your current progress and the roadmap
somewhere. This could help others better understand XFS range locking
and potentially contribute to the effort.

>
> I've been writing up a design doc as I've been doing this analysis
> and research to document the problems and the solutions. I think I'm
> close to the point where I can start implementing this new
> design.

Great news.

>
> Clearly, I've been busy doing a lot of nothing on rangelocks, thank
> you very much.

I’m not suggesting that you haven’t made progress on this complex
feature. What I mean is that XFS users continually express their
frustration to us when we can’t provide a solution to this issue, even
if it’s less than perfect. Now that we have a workable, though
imperfect, solution, why not try it if the user’s issue is urgent?
That way, when users encounter the same problem in the future, we can
offer them a choice: use the imperfect solution if it’s urgent, or
wait for range locking if it’s not. As you’re aware, most XFS users
don’t have the expertise needed to contribute to the development of
XFS range locking.

>
> > I'm not saying it's your
> > responsibility to implement range locking, but it seems no one else is
> > capable of implementing this complex feature except you.
>
> *cough*
>
> There's lots of people better qualified than me to solve a problem
> like this. It's a computer science problem, and I'm not a computer
> scientist. I'm an engineer - I take stuff that scientists have
> discovered and documented and build tools, infrastructure and
> objects based on that knowledge.
>
> What I'm not good at is coming up with novel new algorithms to solve
> mathematical problems. A range lock is the latter, not the former,
> and there are plenty of people who would be better suited to this
> work than me.  i.e. I'm good at putting knowledge to practical use,
> not creating new knowledge.

Most XFS users, like myself, are mainly skilled in utilizing the
features you’ve developed ;)

>
> However, I'm spending time on it because nobody else is going to
> solve the problem for me.  CPU and IO concurrency is only going up
> and shared/exclusive IO locking behaviour only gets more restrictive
> as concurrency requirements go up and we use extent sharing to
> avoid data copying more extensively. This problem isn't going
> away...

I understand the challenges you're facing. Thank you once again for
all your hard work.
diff mbox series

Patch

diff --git a/include/linux/fs.h b/include/linux/fs.h
index fd34b5755c0b..5df7b5b0927a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3472,6 +3472,12 @@  static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags,
 			return -EPERM;
 		ki->ki_flags &= ~IOCB_APPEND;
 	}
+	if (flags & RWF_IOWAIT) {
+		kiocb_flags |= IOCB_NOWAIT;
+		/* IOCB_NOIO is not allowed for RWF_IOWAIT */
+		if (kiocb_flags & IOCB_NOIO)
+			return -EINVAL;
+	}
 
 	ki->ki_flags |= kiocb_flags;
 	return 0;
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 191a7e88a8ab..17a8c065d636 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -332,9 +332,12 @@  typedef int __bitwise __kernel_rwf_t;
 /* Atomic Write */
 #define RWF_ATOMIC	((__force __kernel_rwf_t)0x00000040)
 
+/* per-IO, allow waiting for IO, but not waiting for lock */
+#define RWF_IOWAIT	((__force __kernel_rwf_t)0x00000080)
+
 /* mask of flags supported by the kernel */
 #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
-			 RWF_APPEND | RWF_NOAPPEND | RWF_ATOMIC)
+			 RWF_APPEND | RWF_NOAPPEND | RWF_ATOMIC | RWF_IOWAIT)
 
 /* Pagemap ioctl */
 #define PAGEMAP_SCAN	_IOWR('f', 16, struct pm_scan_arg)