diff mbox series

xfs: Remove i_rwsem lock in buffered read

Message ID 20241226061602.2222985-1-chizhiling@163.com (mailing list archive)
State New
Headers show
Series xfs: Remove i_rwsem lock in buffered read | expand

Commit Message

Chi Zhiling Dec. 26, 2024, 6:16 a.m. UTC
From: Chi Zhiling <chizhiling@kylinos.cn>

Using an rwsem to protect file data ensures that we can always obtain a
completed modification. But due to the lock, we need to wait for the
write process to release the rwsem before we can read it, even if we are
reading a different region of the file. This could take a lot of time
when many processes need to write and read this file.

On the other hand, The ext4 filesystem and others do not hold the lock
during buffered reading, which make the ext4 have better performance in
that case. Therefore, I think it will be fine if we remove the lock in
xfs, as most applications can handle this situation.

Without this lock, we achieve a great improvement when multi-threaded
reading and writing. Use the following command to test:
fio -ioengine=libaio -filename=testfile -bs=4k -rw=randrw -numjobs=16 -name="randrw"

Before this patch:
   READ: bw=351MiB/s (368MB/s), 21.8MiB/s-22.0MiB/s (22.9MB/s-23.1MB/s), io=8185MiB (8582MB), run=23206-23347msec
  WRITE: bw=351MiB/s (368MB/s), 21.9MiB/s-22.1MiB/s (23.0MB/s-23.2MB/s), io=8199MiB (8597MB), run=23206-23347msec

After this patch:
   READ: bw=1961MiB/s (2056MB/s), 122MiB/s-125MiB/s (128MB/s-131MB/s), io=8185MiB (8582MB), run=4097-4174msec
  WRITE: bw=1964MiB/s (2060MB/s), 123MiB/s-125MiB/s (129MB/s-131MB/s), io=8199MiB (8597MB), run=4097-4174msec

Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn>
---
 fs/xfs/xfs_file.c | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

Comments

Dave Chinner Dec. 26, 2024, 9:50 p.m. UTC | #1
On Thu, Dec 26, 2024 at 02:16:02PM +0800, Chi Zhiling wrote:
> From: Chi Zhiling <chizhiling@kylinos.cn>
> 
> Using an rwsem to protect file data ensures that we can always obtain a
> completed modification. But due to the lock, we need to wait for the
> write process to release the rwsem before we can read it, even if we are
> reading a different region of the file. This could take a lot of time
> when many processes need to write and read this file.
> 
> On the other hand, The ext4 filesystem and others do not hold the lock
> during buffered reading, which make the ext4 have better performance in
> that case. Therefore, I think it will be fine if we remove the lock in
> xfs, as most applications can handle this situation.

Nope.

This means that XFS loses high level serialisation of incoming IO
against operations like truncate, fallocate, pnfs operations, etc.

We've been through this multiple times before; the solution lies in
doing the work to make buffered writes use shared locking, not
removing shared locking from buffered reads.

A couple of old discussions from the list:

https://lore.kernel.org/linux-xfs/CAOQ4uxi0pGczXBX7GRAFs88Uw0n1ERJZno3JSeZR71S1dXg+2w@mail.gmail.com/
https://lore.kernel.org/linux-xfs/20190404165737.30889-1-amir73il@gmail.com/

There are likely others - you can search for them yourself to get
more background information.

Fundamentally, though, removing locking from the read side is not
the answer to this buffered write IO exclusion problem....

-Dave.
Chi Zhiling Dec. 28, 2024, 7:37 a.m. UTC | #2
On 2024/12/27 05:50, Dave Chinner wrote:
> On Thu, Dec 26, 2024 at 02:16:02PM +0800, Chi Zhiling wrote:
>> From: Chi Zhiling <chizhiling@kylinos.cn>
>>
>> Using an rwsem to protect file data ensures that we can always obtain a
>> completed modification. But due to the lock, we need to wait for the
>> write process to release the rwsem before we can read it, even if we are
>> reading a different region of the file. This could take a lot of time
>> when many processes need to write and read this file.
>>
>> On the other hand, The ext4 filesystem and others do not hold the lock
>> during buffered reading, which make the ext4 have better performance in
>> that case. Therefore, I think it will be fine if we remove the lock in
>> xfs, as most applications can handle this situation.
> 
> Nope.
> 
> This means that XFS loses high level serialisation of incoming IO
> against operations like truncate, fallocate, pnfs operations, etc.
> 
> We've been through this multiple times before; the solution lies in
> doing the work to make buffered writes use shared locking, not
> removing shared locking from buffered reads.

You mean using shared locking for buffered reads and writes, right?

I think it's a great idea. In theory, write operations can be performed
simultaneously if they write to different ranges.

So we should track all the ranges we are reading or writing,
and check whether the new read or write operations can be performed
concurrently with the current operations.

Do we have any plans to use shared locking for buffered writes?


> 
> A couple of old discussions from the list:
> 
> https://lore.kernel.org/linux-xfs/CAOQ4uxi0pGczXBX7GRAFs88Uw0n1ERJZno3JSeZR71S1dXg+2w@mail.gmail.com/
> https://lore.kernel.org/linux-xfs/20190404165737.30889-1-amir73il@gmail.com/
> 
> There are likely others - you can search for them yourself to get
> more background information.

Sorry, I didn't find those discussions earlier.

> 
> Fundamentally, though, removing locking from the read side is not
> the answer to this buffered write IO exclusion problem....
> 
> -Dave.

Best regards,
Chi Zhiling
Dave Chinner Dec. 28, 2024, 10:17 p.m. UTC | #3
On Sat, Dec 28, 2024 at 03:37:41PM +0800, Chi Zhiling wrote:
> 
> 
> On 2024/12/27 05:50, Dave Chinner wrote:
> > On Thu, Dec 26, 2024 at 02:16:02PM +0800, Chi Zhiling wrote:
> > > From: Chi Zhiling <chizhiling@kylinos.cn>
> > > 
> > > Using an rwsem to protect file data ensures that we can always obtain a
> > > completed modification. But due to the lock, we need to wait for the
> > > write process to release the rwsem before we can read it, even if we are
> > > reading a different region of the file. This could take a lot of time
> > > when many processes need to write and read this file.
> > > 
> > > On the other hand, The ext4 filesystem and others do not hold the lock
> > > during buffered reading, which make the ext4 have better performance in
> > > that case. Therefore, I think it will be fine if we remove the lock in
> > > xfs, as most applications can handle this situation.
> > 
> > Nope.
> > 
> > This means that XFS loses high level serialisation of incoming IO
> > against operations like truncate, fallocate, pnfs operations, etc.
> > 
> > We've been through this multiple times before; the solution lies in
> > doing the work to make buffered writes use shared locking, not
> > removing shared locking from buffered reads.
> 
> You mean using shared locking for buffered reads and writes, right?
> 
> I think it's a great idea. In theory, write operations can be performed
> simultaneously if they write to different ranges.

Even if they overlap, the folio locks will prevent concurrent writes
to the same range.

Now that we have atomic write support as native functionality (i.e.
RWF_ATOMIC), we really should not have to care that much about
normal buffered IO being atomic. i.e. if the application wants
atomic writes, it can now specify that it wants atomic writes and so
we can relax the constraints we have on existing IO...

> So we should track all the ranges we are reading or writing,
> and check whether the new read or write operations can be performed
> concurrently with the current operations.

That is all discussed in detail in the discussions I linked.

> Do we have any plans to use shared locking for buffered writes?

All we are waiting for is someone to put the resources into making
the changes and testing it properly...

-Dave.
Chi Zhiling Dec. 30, 2024, 2:42 a.m. UTC | #4
On 2024/12/29 06:17, Dave Chinner wrote:
> On Sat, Dec 28, 2024 at 03:37:41PM +0800, Chi Zhiling wrote:
>>
>>
>> On 2024/12/27 05:50, Dave Chinner wrote:
>>> On Thu, Dec 26, 2024 at 02:16:02PM +0800, Chi Zhiling wrote:
>>>> From: Chi Zhiling <chizhiling@kylinos.cn>
>>>>
>>>> Using an rwsem to protect file data ensures that we can always obtain a
>>>> completed modification. But due to the lock, we need to wait for the
>>>> write process to release the rwsem before we can read it, even if we are
>>>> reading a different region of the file. This could take a lot of time
>>>> when many processes need to write and read this file.
>>>>
>>>> On the other hand, The ext4 filesystem and others do not hold the lock
>>>> during buffered reading, which make the ext4 have better performance in
>>>> that case. Therefore, I think it will be fine if we remove the lock in
>>>> xfs, as most applications can handle this situation.
>>>
>>> Nope.
>>>
>>> This means that XFS loses high level serialisation of incoming IO
>>> against operations like truncate, fallocate, pnfs operations, etc.
>>>
>>> We've been through this multiple times before; the solution lies in
>>> doing the work to make buffered writes use shared locking, not
>>> removing shared locking from buffered reads.
>>
>> You mean using shared locking for buffered reads and writes, right?
>>
>> I think it's a great idea. In theory, write operations can be performed
>> simultaneously if they write to different ranges.
> 
> Even if they overlap, the folio locks will prevent concurrent writes
> to the same range.
> 
> Now that we have atomic write support as native functionality (i.e.
> RWF_ATOMIC), we really should not have to care that much about
> normal buffered IO being atomic. i.e. if the application wants
> atomic writes, it can now specify that it wants atomic writes and so
> we can relax the constraints we have on existing IO...

Yes, I'm not particularly concerned about whether buffered I/O is 
atomic. I'm more concerned about the multithreading performance of 
buffered I/O.

Last week, it was mentioned that removing i_rwsem would have some 
impacts on truncate, fallocate, and PNFS operations.

(I'm not familiar with pNFS, so please correct me if I'm wrong.)

My understanding is that the current i_rwsem is used to protect both
the file's data and its size. Operations like truncate, fallocate,
and PNFS use i_rwsem because they modify both the file's data and its 
size. So, I'm thinking whether it's possible to use i_rwsem to protect 
only the file's size, without protecting the file's data.

So operations that modify the file's size need to be executed 
sequentially. For example, buffered writes to the EOF, fallocate 
operations without the "keep size" requirement, and truncate operations, 
etc, all need to hold an exclusive lock.

Other operations require a shared lock because they only need to access
the file's size without modifying it.

> 
>> So we should track all the ranges we are reading or writing,
>> and check whether the new read or write operations can be performed
>> concurrently with the current operations.
> 
> That is all discussed in detail in the discussions I linked.

Sorry, I overlooked some details from old discussion last time.
It seems that you are not satisfied with the effectiveness of
range locks.


Best regards,
Chi Zhiling
Amir Goldstein Jan. 7, 2025, 12:13 p.m. UTC | #5
On Mon, Dec 30, 2024 at 3:43 AM Chi Zhiling <chizhiling@163.com> wrote:
>
>
>
> On 2024/12/29 06:17, Dave Chinner wrote:
> > On Sat, Dec 28, 2024 at 03:37:41PM +0800, Chi Zhiling wrote:
> >>
> >>
> >> On 2024/12/27 05:50, Dave Chinner wrote:
> >>> On Thu, Dec 26, 2024 at 02:16:02PM +0800, Chi Zhiling wrote:
> >>>> From: Chi Zhiling <chizhiling@kylinos.cn>
> >>>>
> >>>> Using an rwsem to protect file data ensures that we can always obtain a
> >>>> completed modification. But due to the lock, we need to wait for the
> >>>> write process to release the rwsem before we can read it, even if we are
> >>>> reading a different region of the file. This could take a lot of time
> >>>> when many processes need to write and read this file.
> >>>>
> >>>> On the other hand, The ext4 filesystem and others do not hold the lock
> >>>> during buffered reading, which make the ext4 have better performance in
> >>>> that case. Therefore, I think it will be fine if we remove the lock in
> >>>> xfs, as most applications can handle this situation.
> >>>
> >>> Nope.
> >>>
> >>> This means that XFS loses high level serialisation of incoming IO
> >>> against operations like truncate, fallocate, pnfs operations, etc.
> >>>
> >>> We've been through this multiple times before; the solution lies in
> >>> doing the work to make buffered writes use shared locking, not
> >>> removing shared locking from buffered reads.
> >>
> >> You mean using shared locking for buffered reads and writes, right?
> >>
> >> I think it's a great idea. In theory, write operations can be performed
> >> simultaneously if they write to different ranges.
> >
> > Even if they overlap, the folio locks will prevent concurrent writes
> > to the same range.
> >
> > Now that we have atomic write support as native functionality (i.e.
> > RWF_ATOMIC), we really should not have to care that much about
> > normal buffered IO being atomic. i.e. if the application wants
> > atomic writes, it can now specify that it wants atomic writes and so
> > we can relax the constraints we have on existing IO...
>
> Yes, I'm not particularly concerned about whether buffered I/O is
> atomic. I'm more concerned about the multithreading performance of
> buffered I/O.

Hi Chi,

Sorry for joining late, I was on vacation.
I am very happy that you have taken on this task and your timing is good,
because  John Garry just posted his patches for large atomic writes [1].

I need to explain the relation to atomic buffered I/O, because it is not
easy to follow it from the old discussions and also some of the discussions
about the solution were held in-person during LSFMM2024.

Naturally, your interest is improving multithreading performance of
buffered I/O, so was mine when I first posted this question [2].

The issue with atomicity of buffered I/O is the xfs has traditionally
provided atomicity of write vs. read (a.k.a no torn writes), which is
not required by POSIX standard (because POSIX was not written with
threads in mind) and is not respected by any other in-tree filesystem.

It is obvious that the exclusive rw lock for buffered write hurts performance
in the case of mixed read/write on xfs, so the question was - if xfs provides
atomic semantics that portable applications cannot rely on, why bother
providing these atomic semantics at all?

Dave's answer to this question was that there are some legacy applications
(database applications IIRC) on production systems that do rely on the fact
that xfs provides this semantics and on the prerequisite that they run on xfs.

However, it was noted that:
1. Those application do not require atomicity for any size of IO, they
    typically work in I/O size that is larger than block size (e.g. 16K or 64K)
    and they only require no torn writes for this I/O size
2. Large folios and iomap can usually provide this semantics via folio lock,
    but application has currently no way of knowing if the semantics are
    provided or not
3. The upcoming ability of application to opt-in for atomic writes larger
    than fs block size [1] can be used to facilitate the applications that
    want to legacy xfs semantics and avoid the need to enforce the legacy
    semantics all the time for no good reason

Disclaimer: this is not a standard way to deal with potentially breaking
legacy semantics, because old applications will not usually be rebuilt
to opt-in for the old semantics, but the use case in hand is probably
so specific, of a specific application that relies on xfs specific semantics
that there are currently no objections for trying this solution.

[1] https://lore.kernel.org/linux-xfs/20250102140411.14617-1-john.g.garry@oracle.com/
[2] https://lore.kernel.org/linux-xfs/CAOQ4uxi0pGczXBX7GRAFs88Uw0n1ERJZno3JSeZR71S1dXg+2w@mail.gmail.com/

>
> Last week, it was mentioned that removing i_rwsem would have some
> impacts on truncate, fallocate, and PNFS operations.
>
> (I'm not familiar with pNFS, so please correct me if I'm wrong.)

You are not wrong. pNFS uses a "layout lease", which requires
that the blockmap of the file will not be modified while the lease is held.
but I think that block that are not mapped (i.e. holes) can be mapped
while the lease is held.

>
> My understanding is that the current i_rwsem is used to protect both
> the file's data and its size. Operations like truncate, fallocate,
> and PNFS use i_rwsem because they modify both the file's data and its
> size. So, I'm thinking whether it's possible to use i_rwsem to protect
> only the file's size, without protecting the file's data.
>

It also protects the file's blockmap, for example, punch hole
does not change the size, but it unmaps blocks from the blockmap,
leading to races that could end up reading stale data from disk
if the lock wasn't taken.

> So operations that modify the file's size need to be executed
> sequentially. For example, buffered writes to the EOF, fallocate
> operations without the "keep size" requirement, and truncate operations,
> etc, all need to hold an exclusive lock.
>
> Other operations require a shared lock because they only need to access
> the file's size without modifying it.
>

As far as I understand, exclusive lock is not needed for non-extending
writes, because it is ok to map new blocks.
I guess the need for exclusive lock for extending writes is related to
update of file size, but not 100% sure.
Anyway, exclusive lock on extending write is widely common in other fs,
while exclusive lock for non-extending write is unique to xfs.

> >
> >> So we should track all the ranges we are reading or writing,
> >> and check whether the new read or write operations can be performed
> >> concurrently with the current operations.
> >
> > That is all discussed in detail in the discussions I linked.
>
> Sorry, I overlooked some details from old discussion last time.
> It seems that you are not satisfied with the effectiveness of
> range locks.
>

Correct. This solution was taken off the table.

I hope my explanation was correct and clear.
If anything else is not clear, please feel free to ask.

Thanks,
Amir.
Christoph Hellwig Jan. 7, 2025, 5:12 p.m. UTC | #6
On Tue, Jan 07, 2025 at 01:13:17PM +0100, Amir Goldstein wrote:
> The issue with atomicity of buffered I/O is the xfs has traditionally
> provided atomicity of write vs. read (a.k.a no torn writes), which is
> not required by POSIX standard (because POSIX was not written with
> threads in mind) and is not respected by any other in-tree filesystem.

That is true for original Posix, but once Posix Threads joined the game
the behavior was and still is required.  See "2.9.7 Thread Interactions
with Regular File Operations" here:

https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html

Now most Linux filesystems ignored that and got away with ignoring
the requirement, but it still exists.
Chi Zhiling Jan. 8, 2025, 7:43 a.m. UTC | #7
On 2025/1/7 20:13, Amir Goldstein wrote:
> On Mon, Dec 30, 2024 at 3:43 AM Chi Zhiling <chizhiling@163.com> wrote:
>> On 2024/12/29 06:17, Dave Chinner wrote:
>>> On Sat, Dec 28, 2024 at 03:37:41PM +0800, Chi Zhiling wrote:
>>>> On 2024/12/27 05:50, Dave Chinner wrote:
>>>>> On Thu, Dec 26, 2024 at 02:16:02PM +0800, Chi Zhiling wrote:
>>>>>> From: Chi Zhiling <chizhiling@kylinos.cn>
>>>>>>
>>>>>> Using an rwsem to protect file data ensures that we can always obtain a
>>>>>> completed modification. But due to the lock, we need to wait for the
>>>>>> write process to release the rwsem before we can read it, even if we are
>>>>>> reading a different region of the file. This could take a lot of time
>>>>>> when many processes need to write and read this file.
>>>>>>
>>>>>> On the other hand, The ext4 filesystem and others do not hold the lock
>>>>>> during buffered reading, which make the ext4 have better performance in
>>>>>> that case. Therefore, I think it will be fine if we remove the lock in
>>>>>> xfs, as most applications can handle this situation.
>>>>>
>>>>> Nope.
>>>>>
>>>>> This means that XFS loses high level serialisation of incoming IO
>>>>> against operations like truncate, fallocate, pnfs operations, etc.
>>>>>
>>>>> We've been through this multiple times before; the solution lies in
>>>>> doing the work to make buffered writes use shared locking, not
>>>>> removing shared locking from buffered reads.
>>>>
>>>> You mean using shared locking for buffered reads and writes, right?
>>>>
>>>> I think it's a great idea. In theory, write operations can be performed
>>>> simultaneously if they write to different ranges.
>>>
>>> Even if they overlap, the folio locks will prevent concurrent writes
>>> to the same range.
>>>
>>> Now that we have atomic write support as native functionality (i.e.
>>> RWF_ATOMIC), we really should not have to care that much about
>>> normal buffered IO being atomic. i.e. if the application wants
>>> atomic writes, it can now specify that it wants atomic writes and so
>>> we can relax the constraints we have on existing IO...
>>
>> Yes, I'm not particularly concerned about whether buffered I/O is
>> atomic. I'm more concerned about the multithreading performance of
>> buffered I/O.
> 
> Hi Chi,
> 
> Sorry for joining late, I was on vacation.
> I am very happy that you have taken on this task and your timing is good,
> because  John Garry just posted his patches for large atomic writes [1].

I'm glad to have you on board. :)

I'm not sure if my understanding is correct, but it seems that our
discussion is about multithreading safety, while John's patch is about
providing power-failure safety, even though both mention atomicity.

> 
> I need to explain the relation to atomic buffered I/O, because it is not
> easy to follow it from the old discussions and also some of the discussions
> about the solution were held in-person during LSFMM2024.
> 
> Naturally, your interest is improving multithreading performance of
> buffered I/O, so was mine when I first posted this question [2].
> 
> The issue with atomicity of buffered I/O is the xfs has traditionally
> provided atomicity of write vs. read (a.k.a no torn writes), which is
> not required by POSIX standard (because POSIX was not written with
> threads in mind) and is not respected by any other in-tree filesystem.
> 
> It is obvious that the exclusive rw lock for buffered write hurts performance
> in the case of mixed read/write on xfs, so the question was - if xfs provides
> atomic semantics that portable applications cannot rely on, why bother
> providing these atomic semantics at all?

Perhaps we can add an option that allows distributions or users to
decide whether they need this semantics. I would not hesitate to
turn off this semantics on my system when the time comes.

> 
> Dave's answer to this question was that there are some legacy applications
> (database applications IIRC) on production systems that do rely on the fact
> that xfs provides this semantics and on the prerequisite that they run on xfs.
> 
> However, it was noted that:
> 1. Those application do not require atomicity for any size of IO, they
>      typically work in I/O size that is larger than block size (e.g. 16K or 64K)
>      and they only require no torn writes for this I/O size
> 2. Large folios and iomap can usually provide this semantics via folio lock,
>      but application has currently no way of knowing if the semantics are
>      provided or not

To be honest, it would be best if the folio lock could provide such
semantics, as it would not cause any potential problems for the
application, and we have hope to achieve concurrent writes.

However, I am not sure if this is easy to implement and will not cause
other problems.

> 3. The upcoming ability of application to opt-in for atomic writes larger
>      than fs block size [1] can be used to facilitate the applications that
>      want to legacy xfs semantics and avoid the need to enforce the legacy
>      semantics all the time for no good reason
> 
> Disclaimer: this is not a standard way to deal with potentially breaking
> legacy semantics, because old applications will not usually be rebuilt
> to opt-in for the old semantics, but the use case in hand is probably
> so specific, of a specific application that relies on xfs specific semantics
> that there are currently no objections for trying this solution.
> 
> [1] https://lore.kernel.org/linux-xfs/20250102140411.14617-1-john.g.garry@oracle.com/
> [2] https://lore.kernel.org/linux-xfs/CAOQ4uxi0pGczXBX7GRAFs88Uw0n1ERJZno3JSeZR71S1dXg+2w@mail.gmail.com/
> 
>>
>> Last week, it was mentioned that removing i_rwsem would have some
>> impacts on truncate, fallocate, and PNFS operations.
>>
>> (I'm not familiar with pNFS, so please correct me if I'm wrong.)
> 
> You are not wrong. pNFS uses a "layout lease", which requires
> that the blockmap of the file will not be modified while the lease is held.
> but I think that block that are not mapped (i.e. holes) can be mapped
> while the lease is held.
> 
>>
>> My understanding is that the current i_rwsem is used to protect both
>> the file's data and its size. Operations like truncate, fallocate,
>> and PNFS use i_rwsem because they modify both the file's data and its
>> size. So, I'm thinking whether it's possible to use i_rwsem to protect
>> only the file's size, without protecting the file's data.
>>
> 
> It also protects the file's blockmap, for example, punch hole
> does not change the size, but it unmaps blocks from the blockmap,
> leading to races that could end up reading stale data from disk
> if the lock wasn't taken.
> 
>> So operations that modify the file's size need to be executed
>> sequentially. For example, buffered writes to the EOF, fallocate
>> operations without the "keep size" requirement, and truncate operations,
>> etc, all need to hold an exclusive lock.
>>
>> Other operations require a shared lock because they only need to access
>> the file's size without modifying it.
>>
> 
> As far as I understand, exclusive lock is not needed for non-extending
> writes, because it is ok to map new blocks.
> I guess the need for exclusive lock for extending writes is related to
> update of file size, but not 100% sure.
> Anyway, exclusive lock on extending write is widely common in other fs,
> while exclusive lock for non-extending write is unique to xfs.
> 
>>>
>>>> So we should track all the ranges we are reading or writing,
>>>> and check whether the new read or write operations can be performed
>>>> concurrently with the current operations.
>>>
>>> That is all discussed in detail in the discussions I linked.
>>
>> Sorry, I overlooked some details from old discussion last time.
>> It seems that you are not satisfied with the effectiveness of
>> range locks.
>>
> 
> Correct. This solution was taken off the table.
> 
> I hope my explanation was correct and clear.
> If anything else is not clear, please feel free to ask.
> 

I think your explanation is very clear.

Thanks,
Chi Zhiling
Amir Goldstein Jan. 8, 2025, 11:33 a.m. UTC | #8
On Wed, Jan 8, 2025 at 8:43 AM Chi Zhiling <chizhiling@163.com> wrote:
>
> On 2025/1/7 20:13, Amir Goldstein wrote:
> > On Mon, Dec 30, 2024 at 3:43 AM Chi Zhiling <chizhiling@163.com> wrote:
> >> On 2024/12/29 06:17, Dave Chinner wrote:
> >>> On Sat, Dec 28, 2024 at 03:37:41PM +0800, Chi Zhiling wrote:
> >>>> On 2024/12/27 05:50, Dave Chinner wrote:
> >>>>> On Thu, Dec 26, 2024 at 02:16:02PM +0800, Chi Zhiling wrote:
> >>>>>> From: Chi Zhiling <chizhiling@kylinos.cn>
> >>>>>>
> >>>>>> Using an rwsem to protect file data ensures that we can always obtain a
> >>>>>> completed modification. But due to the lock, we need to wait for the
> >>>>>> write process to release the rwsem before we can read it, even if we are
> >>>>>> reading a different region of the file. This could take a lot of time
> >>>>>> when many processes need to write and read this file.
> >>>>>>
> >>>>>> On the other hand, The ext4 filesystem and others do not hold the lock
> >>>>>> during buffered reading, which make the ext4 have better performance in
> >>>>>> that case. Therefore, I think it will be fine if we remove the lock in
> >>>>>> xfs, as most applications can handle this situation.
> >>>>>
> >>>>> Nope.
> >>>>>
> >>>>> This means that XFS loses high level serialisation of incoming IO
> >>>>> against operations like truncate, fallocate, pnfs operations, etc.
> >>>>>
> >>>>> We've been through this multiple times before; the solution lies in
> >>>>> doing the work to make buffered writes use shared locking, not
> >>>>> removing shared locking from buffered reads.
> >>>>
> >>>> You mean using shared locking for buffered reads and writes, right?
> >>>>
> >>>> I think it's a great idea. In theory, write operations can be performed
> >>>> simultaneously if they write to different ranges.
> >>>
> >>> Even if they overlap, the folio locks will prevent concurrent writes
> >>> to the same range.
> >>>
> >>> Now that we have atomic write support as native functionality (i.e.
> >>> RWF_ATOMIC), we really should not have to care that much about
> >>> normal buffered IO being atomic. i.e. if the application wants
> >>> atomic writes, it can now specify that it wants atomic writes and so
> >>> we can relax the constraints we have on existing IO...
> >>
> >> Yes, I'm not particularly concerned about whether buffered I/O is
> >> atomic. I'm more concerned about the multithreading performance of
> >> buffered I/O.
> >
> > Hi Chi,
> >
> > Sorry for joining late, I was on vacation.
> > I am very happy that you have taken on this task and your timing is good,
> > because  John Garry just posted his patches for large atomic writes [1].
>
> I'm glad to have you on board. :)
>
> I'm not sure if my understanding is correct, but it seems that our
> discussion is about multithreading safety, while John's patch is about
> providing power-failure safety, even though both mention atomicity.
>

You are right about the *motivation* of John's work.
But as a by-product of his work, we get an API to opt-in for
atomic writes and we can piggy back this opt-in API to say
that whoever wants the legacy behavior can use the new API
to get both power failure safety and multithreading safety.

> >
> > I need to explain the relation to atomic buffered I/O, because it is not
> > easy to follow it from the old discussions and also some of the discussions
> > about the solution were held in-person during LSFMM2024.
> >
> > Naturally, your interest is improving multithreading performance of
> > buffered I/O, so was mine when I first posted this question [2].
> >
> > The issue with atomicity of buffered I/O is the xfs has traditionally
> > provided atomicity of write vs. read (a.k.a no torn writes), which is
> > not required by POSIX standard (because POSIX was not written with
> > threads in mind) and is not respected by any other in-tree filesystem.
> >
> > It is obvious that the exclusive rw lock for buffered write hurts performance
> > in the case of mixed read/write on xfs, so the question was - if xfs provides
> > atomic semantics that portable applications cannot rely on, why bother
> > providing these atomic semantics at all?
>
> Perhaps we can add an option that allows distributions or users to
> decide whether they need this semantics. I would not hesitate to
> turn off this semantics on my system when the time comes.
>

Yes, we can, but we do not want to do it unless we have to.
99% of the users do not want the legacy semantics, so it would
be better if they get the best performance without having to opt-in to get it.

> >
> > Dave's answer to this question was that there are some legacy applications
> > (database applications IIRC) on production systems that do rely on the fact
> > that xfs provides this semantics and on the prerequisite that they run on xfs.
> >
> > However, it was noted that:
> > 1. Those application do not require atomicity for any size of IO, they
> >      typically work in I/O size that is larger than block size (e.g. 16K or 64K)
> >      and they only require no torn writes for this I/O size
> > 2. Large folios and iomap can usually provide this semantics via folio lock,
> >      but application has currently no way of knowing if the semantics are
> >      provided or not
>
> To be honest, it would be best if the folio lock could provide such
> semantics, as it would not cause any potential problems for the
> application, and we have hope to achieve concurrent writes.
>
> However, I am not sure if this is easy to implement and will not cause
> other problems.
>
> > 3. The upcoming ability of application to opt-in for atomic writes larger
> >      than fs block size [1] can be used to facilitate the applications that
> >      want to legacy xfs semantics and avoid the need to enforce the legacy
> >      semantics all the time for no good reason
> >
> > Disclaimer: this is not a standard way to deal with potentially breaking
> > legacy semantics, because old applications will not usually be rebuilt
> > to opt-in for the old semantics, but the use case in hand is probably
> > so specific, of a specific application that relies on xfs specific semantics
> > that there are currently no objections for trying this solution.
> >
> > [1] https://lore.kernel.org/linux-xfs/20250102140411.14617-1-john.g.garry@oracle.com/
> > [2] https://lore.kernel.org/linux-xfs/CAOQ4uxi0pGczXBX7GRAFs88Uw0n1ERJZno3JSeZR71S1dXg+2w@mail.gmail.com/
> >
> >>
> >> Last week, it was mentioned that removing i_rwsem would have some
> >> impacts on truncate, fallocate, and PNFS operations.
> >>
> >> (I'm not familiar with pNFS, so please correct me if I'm wrong.)
> >
> > You are not wrong. pNFS uses a "layout lease", which requires
> > that the blockmap of the file will not be modified while the lease is held.
> > but I think that block that are not mapped (i.e. holes) can be mapped
> > while the lease is held.
> >
> >>
> >> My understanding is that the current i_rwsem is used to protect both
> >> the file's data and its size. Operations like truncate, fallocate,
> >> and PNFS use i_rwsem because they modify both the file's data and its
> >> size. So, I'm thinking whether it's possible to use i_rwsem to protect
> >> only the file's size, without protecting the file's data.
> >>
> >
> > It also protects the file's blockmap, for example, punch hole
> > does not change the size, but it unmaps blocks from the blockmap,
> > leading to races that could end up reading stale data from disk
> > if the lock wasn't taken.
> >
> >> So operations that modify the file's size need to be executed
> >> sequentially. For example, buffered writes to the EOF, fallocate
> >> operations without the "keep size" requirement, and truncate operations,
> >> etc, all need to hold an exclusive lock.
> >>
> >> Other operations require a shared lock because they only need to access
> >> the file's size without modifying it.
> >>
> >
> > As far as I understand, exclusive lock is not needed for non-extending
> > writes, because it is ok to map new blocks.
> > I guess the need for exclusive lock for extending writes is related to
> > update of file size, but not 100% sure.
> > Anyway, exclusive lock on extending write is widely common in other fs,
> > while exclusive lock for non-extending write is unique to xfs.
> >
> >>>
> >>>> So we should track all the ranges we are reading or writing,
> >>>> and check whether the new read or write operations can be performed
> >>>> concurrently with the current operations.
> >>>
> >>> That is all discussed in detail in the discussions I linked.
> >>
> >> Sorry, I overlooked some details from old discussion last time.
> >> It seems that you are not satisfied with the effectiveness of
> >> range locks.
> >>
> >
> > Correct. This solution was taken off the table.
> >
> > I hope my explanation was correct and clear.
> > If anything else is not clear, please feel free to ask.
> >
>
> I think your explanation is very clear.

One more thing I should mention.
You do not need to wait for atomic large writes patches to land.
There is nothing stopping you from implementing the suggested
solution based on the xfs code already in master (v6.13-rc1),
which has support for the RWF_ATOMIC flag for writes.

It just means that the API will not be usable for applications that
want to do IO larger than block size, but concurrent read/write
performance of 4K IO could be improved already.

It's possible that all you need to do is:

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index c488ae26b23d0..2542f15496488 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -777,9 +777,10 @@ xfs_file_buffered_write(
        ssize_t                 ret;
        bool                    cleared_space = false;
        unsigned int            iolock;
+       bool                    atomic_write = iocb->ki_flags & IOCB_ATOMIC;

 write_retry:
-       iolock = XFS_IOLOCK_EXCL;
+       iolock = atomic_write ? XFS_IOLOCK_SHARED : XFS_IOLOCK_EXCL;
        ret = xfs_ilock_iocb(iocb, iolock);
--

xfs_file_write_checks() afterwards already takes care of promoting
XFS_IOLOCK_SHARED to XFS_IOLOCK_EXCL for extending writes.

It is possible that XFS_IOLOCK_EXCL could be immediately demoted
back to XFS_IOLOCK_SHARED for atomic_writes as done in
xfs_file_dio_write_aligned().

TBH, I am not sure which blockdevs support 4K atomic writes that could
be used to test this.

John, can you share your test setup instructions for atomic writes?

Thanks,
Amir.
Amir Goldstein Jan. 8, 2025, 11:45 a.m. UTC | #9
On Wed, Jan 8, 2025 at 12:33 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Wed, Jan 8, 2025 at 8:43 AM Chi Zhiling <chizhiling@163.com> wrote:
> >
> > On 2025/1/7 20:13, Amir Goldstein wrote:
> > > On Mon, Dec 30, 2024 at 3:43 AM Chi Zhiling <chizhiling@163.com> wrote:
> > >> On 2024/12/29 06:17, Dave Chinner wrote:
> > >>> On Sat, Dec 28, 2024 at 03:37:41PM +0800, Chi Zhiling wrote:
> > >>>> On 2024/12/27 05:50, Dave Chinner wrote:
> > >>>>> On Thu, Dec 26, 2024 at 02:16:02PM +0800, Chi Zhiling wrote:
> > >>>>>> From: Chi Zhiling <chizhiling@kylinos.cn>
> > >>>>>>
> > >>>>>> Using an rwsem to protect file data ensures that we can always obtain a
> > >>>>>> completed modification. But due to the lock, we need to wait for the
> > >>>>>> write process to release the rwsem before we can read it, even if we are
> > >>>>>> reading a different region of the file. This could take a lot of time
> > >>>>>> when many processes need to write and read this file.
> > >>>>>>
> > >>>>>> On the other hand, The ext4 filesystem and others do not hold the lock
> > >>>>>> during buffered reading, which make the ext4 have better performance in
> > >>>>>> that case. Therefore, I think it will be fine if we remove the lock in
> > >>>>>> xfs, as most applications can handle this situation.
> > >>>>>
> > >>>>> Nope.
> > >>>>>
> > >>>>> This means that XFS loses high level serialisation of incoming IO
> > >>>>> against operations like truncate, fallocate, pnfs operations, etc.
> > >>>>>
> > >>>>> We've been through this multiple times before; the solution lies in
> > >>>>> doing the work to make buffered writes use shared locking, not
> > >>>>> removing shared locking from buffered reads.
> > >>>>
> > >>>> You mean using shared locking for buffered reads and writes, right?
> > >>>>
> > >>>> I think it's a great idea. In theory, write operations can be performed
> > >>>> simultaneously if they write to different ranges.
> > >>>
> > >>> Even if they overlap, the folio locks will prevent concurrent writes
> > >>> to the same range.
> > >>>
> > >>> Now that we have atomic write support as native functionality (i.e.
> > >>> RWF_ATOMIC), we really should not have to care that much about
> > >>> normal buffered IO being atomic. i.e. if the application wants
> > >>> atomic writes, it can now specify that it wants atomic writes and so
> > >>> we can relax the constraints we have on existing IO...
> > >>
> > >> Yes, I'm not particularly concerned about whether buffered I/O is
> > >> atomic. I'm more concerned about the multithreading performance of
> > >> buffered I/O.
> > >
> > > Hi Chi,
> > >
> > > Sorry for joining late, I was on vacation.
> > > I am very happy that you have taken on this task and your timing is good,
> > > because  John Garry just posted his patches for large atomic writes [1].
> >
> > I'm glad to have you on board. :)
> >
> > I'm not sure if my understanding is correct, but it seems that our
> > discussion is about multithreading safety, while John's patch is about
> > providing power-failure safety, even though both mention atomicity.
> >
>
> You are right about the *motivation* of John's work.
> But as a by-product of his work, we get an API to opt-in for
> atomic writes and we can piggy back this opt-in API to say
> that whoever wants the legacy behavior can use the new API
> to get both power failure safety and multithreading safety.
>
> > >
> > > I need to explain the relation to atomic buffered I/O, because it is not
> > > easy to follow it from the old discussions and also some of the discussions
> > > about the solution were held in-person during LSFMM2024.
> > >
> > > Naturally, your interest is improving multithreading performance of
> > > buffered I/O, so was mine when I first posted this question [2].
> > >
> > > The issue with atomicity of buffered I/O is the xfs has traditionally
> > > provided atomicity of write vs. read (a.k.a no torn writes), which is
> > > not required by POSIX standard (because POSIX was not written with
> > > threads in mind) and is not respected by any other in-tree filesystem.
> > >
> > > It is obvious that the exclusive rw lock for buffered write hurts performance
> > > in the case of mixed read/write on xfs, so the question was - if xfs provides
> > > atomic semantics that portable applications cannot rely on, why bother
> > > providing these atomic semantics at all?
> >
> > Perhaps we can add an option that allows distributions or users to
> > decide whether they need this semantics. I would not hesitate to
> > turn off this semantics on my system when the time comes.
> >
>
> Yes, we can, but we do not want to do it unless we have to.
> 99% of the users do not want the legacy semantics, so it would
> be better if they get the best performance without having to opt-in to get it.
>
> > >
> > > Dave's answer to this question was that there are some legacy applications
> > > (database applications IIRC) on production systems that do rely on the fact
> > > that xfs provides this semantics and on the prerequisite that they run on xfs.
> > >
> > > However, it was noted that:
> > > 1. Those application do not require atomicity for any size of IO, they
> > >      typically work in I/O size that is larger than block size (e.g. 16K or 64K)
> > >      and they only require no torn writes for this I/O size
> > > 2. Large folios and iomap can usually provide this semantics via folio lock,
> > >      but application has currently no way of knowing if the semantics are
> > >      provided or not
> >
> > To be honest, it would be best if the folio lock could provide such
> > semantics, as it would not cause any potential problems for the
> > application, and we have hope to achieve concurrent writes.
> >
> > However, I am not sure if this is easy to implement and will not cause
> > other problems.
> >
> > > 3. The upcoming ability of application to opt-in for atomic writes larger
> > >      than fs block size [1] can be used to facilitate the applications that
> > >      want to legacy xfs semantics and avoid the need to enforce the legacy
> > >      semantics all the time for no good reason
> > >
> > > Disclaimer: this is not a standard way to deal with potentially breaking
> > > legacy semantics, because old applications will not usually be rebuilt
> > > to opt-in for the old semantics, but the use case in hand is probably
> > > so specific, of a specific application that relies on xfs specific semantics
> > > that there are currently no objections for trying this solution.
> > >
> > > [1] https://lore.kernel.org/linux-xfs/20250102140411.14617-1-john.g.garry@oracle.com/
> > > [2] https://lore.kernel.org/linux-xfs/CAOQ4uxi0pGczXBX7GRAFs88Uw0n1ERJZno3JSeZR71S1dXg+2w@mail.gmail.com/
> > >
> > >>
> > >> Last week, it was mentioned that removing i_rwsem would have some
> > >> impacts on truncate, fallocate, and PNFS operations.
> > >>
> > >> (I'm not familiar with pNFS, so please correct me if I'm wrong.)
> > >
> > > You are not wrong. pNFS uses a "layout lease", which requires
> > > that the blockmap of the file will not be modified while the lease is held.
> > > but I think that block that are not mapped (i.e. holes) can be mapped
> > > while the lease is held.
> > >
> > >>
> > >> My understanding is that the current i_rwsem is used to protect both
> > >> the file's data and its size. Operations like truncate, fallocate,
> > >> and PNFS use i_rwsem because they modify both the file's data and its
> > >> size. So, I'm thinking whether it's possible to use i_rwsem to protect
> > >> only the file's size, without protecting the file's data.
> > >>
> > >
> > > It also protects the file's blockmap, for example, punch hole
> > > does not change the size, but it unmaps blocks from the blockmap,
> > > leading to races that could end up reading stale data from disk
> > > if the lock wasn't taken.
> > >
> > >> So operations that modify the file's size need to be executed
> > >> sequentially. For example, buffered writes to the EOF, fallocate
> > >> operations without the "keep size" requirement, and truncate operations,
> > >> etc, all need to hold an exclusive lock.
> > >>
> > >> Other operations require a shared lock because they only need to access
> > >> the file's size without modifying it.
> > >>
> > >
> > > As far as I understand, exclusive lock is not needed for non-extending
> > > writes, because it is ok to map new blocks.
> > > I guess the need for exclusive lock for extending writes is related to
> > > update of file size, but not 100% sure.
> > > Anyway, exclusive lock on extending write is widely common in other fs,
> > > while exclusive lock for non-extending write is unique to xfs.
> > >
> > >>>
> > >>>> So we should track all the ranges we are reading or writing,
> > >>>> and check whether the new read or write operations can be performed
> > >>>> concurrently with the current operations.
> > >>>
> > >>> That is all discussed in detail in the discussions I linked.
> > >>
> > >> Sorry, I overlooked some details from old discussion last time.
> > >> It seems that you are not satisfied with the effectiveness of
> > >> range locks.
> > >>
> > >
> > > Correct. This solution was taken off the table.
> > >
> > > I hope my explanation was correct and clear.
> > > If anything else is not clear, please feel free to ask.
> > >
> >
> > I think your explanation is very clear.
>
> One more thing I should mention.
> You do not need to wait for atomic large writes patches to land.
> There is nothing stopping you from implementing the suggested
> solution based on the xfs code already in master (v6.13-rc1),
> which has support for the RWF_ATOMIC flag for writes.
>
> It just means that the API will not be usable for applications that
> want to do IO larger than block size, but concurrent read/write
> performance of 4K IO could be improved already.
>
> It's possible that all you need to do is:
>
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index c488ae26b23d0..2542f15496488 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -777,9 +777,10 @@ xfs_file_buffered_write(
>         ssize_t                 ret;
>         bool                    cleared_space = false;
>         unsigned int            iolock;
> +       bool                    atomic_write = iocb->ki_flags & IOCB_ATOMIC;
>
>  write_retry:
> -       iolock = XFS_IOLOCK_EXCL;
> +       iolock = atomic_write ? XFS_IOLOCK_SHARED : XFS_IOLOCK_EXCL;

It's the other way around of course :)
and did not state the obvious - this mock patch is NOT TESTED!

Thanks,
Amir.

>         ret = xfs_ilock_iocb(iocb, iolock);
> --
>
> xfs_file_write_checks() afterwards already takes care of promoting
> XFS_IOLOCK_SHARED to XFS_IOLOCK_EXCL for extending writes.
>
> It is possible that XFS_IOLOCK_EXCL could be immediately demoted
> back to XFS_IOLOCK_SHARED for atomic_writes as done in
> xfs_file_dio_write_aligned().
>
> TBH, I am not sure which blockdevs support 4K atomic writes that could
> be used to test this.
>
> John, can you share your test setup instructions for atomic writes?
>
> Thanks,
> Amir.
John Garry Jan. 8, 2025, 12:15 p.m. UTC | #10
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index c488ae26b23d0..2542f15496488 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -777,9 +777,10 @@ xfs_file_buffered_write(
>          ssize_t                 ret;
>          bool                    cleared_space = false;
>          unsigned int            iolock;
> +       bool                    atomic_write = iocb->ki_flags & IOCB_ATOMIC;
> 
>   write_retry:
> -       iolock = XFS_IOLOCK_EXCL;
> +       iolock = atomic_write ? XFS_IOLOCK_SHARED : XFS_IOLOCK_EXCL;
>          ret = xfs_ilock_iocb(iocb, iolock);
> --
> 
> xfs_file_write_checks() afterwards already takes care of promoting
> XFS_IOLOCK_SHARED to XFS_IOLOCK_EXCL for extending writes.
> 
> It is possible that XFS_IOLOCK_EXCL could be immediately demoted
> back to XFS_IOLOCK_SHARED for atomic_writes as done in
> xfs_file_dio_write_aligned().
> 
> TBH, I am not sure which blockdevs support 4K atomic writes that could
> be used to test this.
> 
> John, can you share your test setup instructions for atomic writes?

Please note that IOCB_ATOMIC is not supported for buffered IO, so we 
can't do this - we only support direct IO today.

And supporting buffered IO has its challenges; how to handle overlapping 
atomic writes of differing sizes sitting in the page cache is the main 
issue which comes to mind.

Thanks,
John
Darrick J. Wong Jan. 8, 2025, 5:35 p.m. UTC | #11
On Wed, Jan 08, 2025 at 03:43:04PM +0800, Chi Zhiling wrote:
> On 2025/1/7 20:13, Amir Goldstein wrote:
> > On Mon, Dec 30, 2024 at 3:43 AM Chi Zhiling <chizhiling@163.com> wrote:
> > > On 2024/12/29 06:17, Dave Chinner wrote:
> > > > On Sat, Dec 28, 2024 at 03:37:41PM +0800, Chi Zhiling wrote:
> > > > > On 2024/12/27 05:50, Dave Chinner wrote:
> > > > > > On Thu, Dec 26, 2024 at 02:16:02PM +0800, Chi Zhiling wrote:
> > > > > > > From: Chi Zhiling <chizhiling@kylinos.cn>
> > > > > > > 
> > > > > > > Using an rwsem to protect file data ensures that we can always obtain a
> > > > > > > completed modification. But due to the lock, we need to wait for the
> > > > > > > write process to release the rwsem before we can read it, even if we are
> > > > > > > reading a different region of the file. This could take a lot of time
> > > > > > > when many processes need to write and read this file.
> > > > > > > 
> > > > > > > On the other hand, The ext4 filesystem and others do not hold the lock
> > > > > > > during buffered reading, which make the ext4 have better performance in
> > > > > > > that case. Therefore, I think it will be fine if we remove the lock in
> > > > > > > xfs, as most applications can handle this situation.
> > > > > > 
> > > > > > Nope.
> > > > > > 
> > > > > > This means that XFS loses high level serialisation of incoming IO
> > > > > > against operations like truncate, fallocate, pnfs operations, etc.
> > > > > > 
> > > > > > We've been through this multiple times before; the solution lies in
> > > > > > doing the work to make buffered writes use shared locking, not
> > > > > > removing shared locking from buffered reads.
> > > > > 
> > > > > You mean using shared locking for buffered reads and writes, right?
> > > > > 
> > > > > I think it's a great idea. In theory, write operations can be performed
> > > > > simultaneously if they write to different ranges.
> > > > 
> > > > Even if they overlap, the folio locks will prevent concurrent writes
> > > > to the same range.
> > > > 
> > > > Now that we have atomic write support as native functionality (i.e.
> > > > RWF_ATOMIC), we really should not have to care that much about
> > > > normal buffered IO being atomic. i.e. if the application wants
> > > > atomic writes, it can now specify that it wants atomic writes and so
> > > > we can relax the constraints we have on existing IO...
> > > 
> > > Yes, I'm not particularly concerned about whether buffered I/O is
> > > atomic. I'm more concerned about the multithreading performance of
> > > buffered I/O.
> > 
> > Hi Chi,
> > 
> > Sorry for joining late, I was on vacation.
> > I am very happy that you have taken on this task and your timing is good,
> > because  John Garry just posted his patches for large atomic writes [1].
> 
> I'm glad to have you on board. :)
> 
> I'm not sure if my understanding is correct, but it seems that our
> discussion is about multithreading safety, while John's patch is about
> providing power-failure safety, even though both mention atomicity.
> 
> > 
> > I need to explain the relation to atomic buffered I/O, because it is not
> > easy to follow it from the old discussions and also some of the discussions
> > about the solution were held in-person during LSFMM2024.
> > 
> > Naturally, your interest is improving multithreading performance of
> > buffered I/O, so was mine when I first posted this question [2].
> > 
> > The issue with atomicity of buffered I/O is the xfs has traditionally
> > provided atomicity of write vs. read (a.k.a no torn writes), which is
> > not required by POSIX standard (because POSIX was not written with
> > threads in mind) and is not respected by any other in-tree filesystem.
> > 
> > It is obvious that the exclusive rw lock for buffered write hurts performance
> > in the case of mixed read/write on xfs, so the question was - if xfs provides
> > atomic semantics that portable applications cannot rely on, why bother
> > providing these atomic semantics at all?
> 
> Perhaps we can add an option that allows distributions or users to
> decide whether they need this semantics. I would not hesitate to
> turn off this semantics on my system when the time comes.
> 
> > 
> > Dave's answer to this question was that there are some legacy applications
> > (database applications IIRC) on production systems that do rely on the fact
> > that xfs provides this semantics and on the prerequisite that they run on xfs.
> > 
> > However, it was noted that:
> > 1. Those application do not require atomicity for any size of IO, they
> >      typically work in I/O size that is larger than block size (e.g. 16K or 64K)
> >      and they only require no torn writes for this I/O size
> > 2. Large folios and iomap can usually provide this semantics via folio lock,
> >      but application has currently no way of knowing if the semantics are
> >      provided or not
> 
> To be honest, it would be best if the folio lock could provide such
> semantics, as it would not cause any potential problems for the
> application, and we have hope to achieve concurrent writes.
> 
> However, I am not sure if this is easy to implement and will not cause
> other problems.

Assuming we're not abandoning POSIX "Thread Interactions with Regular
File Operations", you can't use the folio lock for coordination, for
several reasons:

a) Apps can't directly control the size of the folio in the page cache

b) The folio size can (theoretically) change underneath the program at
any time (reclaim can take your large folio and the next read gets a
smaller folio)

c) If your write crosses folios, you've just crossed a synchronization
boundary and all bets are off, though all the other filesystems behave
this way and there seem not to be complaints

d) If you try to "guarantee" folio granularity by messing with min/max
folio size, you run the risk of ENOMEM if the base pages get fragmented

I think that's why Dave suggested range locks as the correct solution to
this; though it is a pity that so far nobody has come up with a
performant implementation.

--D

> > 3. The upcoming ability of application to opt-in for atomic writes larger
> >      than fs block size [1] can be used to facilitate the applications that
> >      want to legacy xfs semantics and avoid the need to enforce the legacy
> >      semantics all the time for no good reason
> > 
> > Disclaimer: this is not a standard way to deal with potentially breaking
> > legacy semantics, because old applications will not usually be rebuilt
> > to opt-in for the old semantics, but the use case in hand is probably
> > so specific, of a specific application that relies on xfs specific semantics
> > that there are currently no objections for trying this solution.
> > 
> > [1] https://lore.kernel.org/linux-xfs/20250102140411.14617-1-john.g.garry@oracle.com/
> > [2] https://lore.kernel.org/linux-xfs/CAOQ4uxi0pGczXBX7GRAFs88Uw0n1ERJZno3JSeZR71S1dXg+2w@mail.gmail.com/
> > 
> > > 
> > > Last week, it was mentioned that removing i_rwsem would have some
> > > impacts on truncate, fallocate, and PNFS operations.
> > > 
> > > (I'm not familiar with pNFS, so please correct me if I'm wrong.)
> > 
> > You are not wrong. pNFS uses a "layout lease", which requires
> > that the blockmap of the file will not be modified while the lease is held.
> > but I think that block that are not mapped (i.e. holes) can be mapped
> > while the lease is held.
> > 
> > > 
> > > My understanding is that the current i_rwsem is used to protect both
> > > the file's data and its size. Operations like truncate, fallocate,
> > > and PNFS use i_rwsem because they modify both the file's data and its
> > > size. So, I'm thinking whether it's possible to use i_rwsem to protect
> > > only the file's size, without protecting the file's data.
> > > 
> > 
> > It also protects the file's blockmap, for example, punch hole
> > does not change the size, but it unmaps blocks from the blockmap,
> > leading to races that could end up reading stale data from disk
> > if the lock wasn't taken.
> > 
> > > So operations that modify the file's size need to be executed
> > > sequentially. For example, buffered writes to the EOF, fallocate
> > > operations without the "keep size" requirement, and truncate operations,
> > > etc, all need to hold an exclusive lock.
> > > 
> > > Other operations require a shared lock because they only need to access
> > > the file's size without modifying it.
> > > 
> > 
> > As far as I understand, exclusive lock is not needed for non-extending
> > writes, because it is ok to map new blocks.
> > I guess the need for exclusive lock for extending writes is related to
> > update of file size, but not 100% sure.
> > Anyway, exclusive lock on extending write is widely common in other fs,
> > while exclusive lock for non-extending write is unique to xfs.
> > 
> > > > 
> > > > > So we should track all the ranges we are reading or writing,
> > > > > and check whether the new read or write operations can be performed
> > > > > concurrently with the current operations.
> > > > 
> > > > That is all discussed in detail in the discussions I linked.
> > > 
> > > Sorry, I overlooked some details from old discussion last time.
> > > It seems that you are not satisfied with the effectiveness of
> > > range locks.
> > > 
> > 
> > Correct. This solution was taken off the table.
> > 
> > I hope my explanation was correct and clear.
> > If anything else is not clear, please feel free to ask.
> > 
> 
> I think your explanation is very clear.
> 
> Thanks,
> Chi Zhiling
> 
>
Chi Zhiling Jan. 9, 2025, 8:37 a.m. UTC | #12
On 2025/1/8 19:33, Amir Goldstein wrote:
> On Wed, Jan 8, 2025 at 8:43 AM Chi Zhiling <chizhiling@163.com> wrote:
>> On 2025/1/7 20:13, Amir Goldstein wrote:
>>> On Mon, Dec 30, 2024 at 3:43 AM Chi Zhiling <chizhiling@163.com> wrote:
>>>> On 2024/12/29 06:17, Dave Chinner wrote:
>>>>> On Sat, Dec 28, 2024 at 03:37:41PM +0800, Chi Zhiling wrote:
>>>>>> On 2024/12/27 05:50, Dave Chinner wrote:
>>>>>>> On Thu, Dec 26, 2024 at 02:16:02PM +0800, Chi Zhiling wrote:
>>>>>>>> From: Chi Zhiling <chizhiling@kylinos.cn>
>>>>>>>>
>>>>>>>> Using an rwsem to protect file data ensures that we can always obtain a
>>>>>>>> completed modification. But due to the lock, we need to wait for the
>>>>>>>> write process to release the rwsem before we can read it, even if we are
>>>>>>>> reading a different region of the file. This could take a lot of time
>>>>>>>> when many processes need to write and read this file.
>>>>>>>>
>>>>>>>> On the other hand, The ext4 filesystem and others do not hold the lock
>>>>>>>> during buffered reading, which make the ext4 have better performance in
>>>>>>>> that case. Therefore, I think it will be fine if we remove the lock in
>>>>>>>> xfs, as most applications can handle this situation.
>>>>>>>
>>>>>>> Nope.
>>>>>>>
>>>>>>> This means that XFS loses high level serialisation of incoming IO
>>>>>>> against operations like truncate, fallocate, pnfs operations, etc.
>>>>>>>
>>>>>>> We've been through this multiple times before; the solution lies in
>>>>>>> doing the work to make buffered writes use shared locking, not
>>>>>>> removing shared locking from buffered reads.
>>>>>>
>>>>>> You mean using shared locking for buffered reads and writes, right?
>>>>>>
>>>>>> I think it's a great idea. In theory, write operations can be performed
>>>>>> simultaneously if they write to different ranges.
>>>>>
>>>>> Even if they overlap, the folio locks will prevent concurrent writes
>>>>> to the same range.
>>>>>
>>>>> Now that we have atomic write support as native functionality (i.e.
>>>>> RWF_ATOMIC), we really should not have to care that much about
>>>>> normal buffered IO being atomic. i.e. if the application wants
>>>>> atomic writes, it can now specify that it wants atomic writes and so
>>>>> we can relax the constraints we have on existing IO...
>>>>
>>>> Yes, I'm not particularly concerned about whether buffered I/O is
>>>> atomic. I'm more concerned about the multithreading performance of
>>>> buffered I/O.
>>>
>>> Hi Chi,
>>>
>>> Sorry for joining late, I was on vacation.
>>> I am very happy that you have taken on this task and your timing is good,
>>> because  John Garry just posted his patches for large atomic writes [1].
>>
>> I'm glad to have you on board. :)
>>
>> I'm not sure if my understanding is correct, but it seems that our
>> discussion is about multithreading safety, while John's patch is about
>> providing power-failure safety, even though both mention atomicity.
>>
> 
> You are right about the *motivation* of John's work.
> But as a by-product of his work, we get an API to opt-in for
> atomic writes and we can piggy back this opt-in API to say
> that whoever wants the legacy behavior can use the new API
> to get both power failure safety and multithreading safety.

Okay, I understand what you mean.

> 
>>>
>>> I need to explain the relation to atomic buffered I/O, because it is not
>>> easy to follow it from the old discussions and also some of the discussions
>>> about the solution were held in-person during LSFMM2024.
>>>
>>> Naturally, your interest is improving multithreading performance of
>>> buffered I/O, so was mine when I first posted this question [2].
>>>
>>> The issue with atomicity of buffered I/O is the xfs has traditionally
>>> provided atomicity of write vs. read (a.k.a no torn writes), which is
>>> not required by POSIX standard (because POSIX was not written with
>>> threads in mind) and is not respected by any other in-tree filesystem.
>>>
>>> It is obvious that the exclusive rw lock for buffered write hurts performance
>>> in the case of mixed read/write on xfs, so the question was - if xfs provides
>>> atomic semantics that portable applications cannot rely on, why bother
>>> providing these atomic semantics at all?
>>
>> Perhaps we can add an option that allows distributions or users to
>> decide whether they need this semantics. I would not hesitate to
>> turn off this semantics on my system when the time comes.
>>
> 
> Yes, we can, but we do not want to do it unless we have to.
> 99% of the users do not want the legacy semantics, so it would
> be better if they get the best performance without having to opt-in to get it.
> 
>>>
>>> Dave's answer to this question was that there are some legacy applications
>>> (database applications IIRC) on production systems that do rely on the fact
>>> that xfs provides this semantics and on the prerequisite that they run on xfs.
>>>
>>> However, it was noted that:
>>> 1. Those application do not require atomicity for any size of IO, they
>>>       typically work in I/O size that is larger than block size (e.g. 16K or 64K)
>>>       and they only require no torn writes for this I/O size
>>> 2. Large folios and iomap can usually provide this semantics via folio lock,
>>>       but application has currently no way of knowing if the semantics are
>>>       provided or not
>>
>> To be honest, it would be best if the folio lock could provide such
>> semantics, as it would not cause any potential problems for the
>> application, and we have hope to achieve concurrent writes.
>>
>> However, I am not sure if this is easy to implement and will not cause
>> other problems.
>>
>>> 3. The upcoming ability of application to opt-in for atomic writes larger
>>>       than fs block size [1] can be used to facilitate the applications that
>>>       want to legacy xfs semantics and avoid the need to enforce the legacy
>>>       semantics all the time for no good reason
>>>
>>> Disclaimer: this is not a standard way to deal with potentially breaking
>>> legacy semantics, because old applications will not usually be rebuilt
>>> to opt-in for the old semantics, but the use case in hand is probably
>>> so specific, of a specific application that relies on xfs specific semantics
>>> that there are currently no objections for trying this solution.
>>>
>>> [1] https://lore.kernel.org/linux-xfs/20250102140411.14617-1-john.g.garry@oracle.com/
>>> [2] https://lore.kernel.org/linux-xfs/CAOQ4uxi0pGczXBX7GRAFs88Uw0n1ERJZno3JSeZR71S1dXg+2w@mail.gmail.com/
>>>
>>>>
>>>> Last week, it was mentioned that removing i_rwsem would have some
>>>> impacts on truncate, fallocate, and PNFS operations.
>>>>
>>>> (I'm not familiar with pNFS, so please correct me if I'm wrong.)
>>>
>>> You are not wrong. pNFS uses a "layout lease", which requires
>>> that the blockmap of the file will not be modified while the lease is held.
>>> but I think that block that are not mapped (i.e. holes) can be mapped
>>> while the lease is held.
>>>
>>>>
>>>> My understanding is that the current i_rwsem is used to protect both
>>>> the file's data and its size. Operations like truncate, fallocate,
>>>> and PNFS use i_rwsem because they modify both the file's data and its
>>>> size. So, I'm thinking whether it's possible to use i_rwsem to protect
>>>> only the file's size, without protecting the file's data.
>>>>
>>>
>>> It also protects the file's blockmap, for example, punch hole
>>> does not change the size, but it unmaps blocks from the blockmap,
>>> leading to races that could end up reading stale data from disk
>>> if the lock wasn't taken.
>>>
>>>> So operations that modify the file's size need to be executed
>>>> sequentially. For example, buffered writes to the EOF, fallocate
>>>> operations without the "keep size" requirement, and truncate operations,
>>>> etc, all need to hold an exclusive lock.
>>>>
>>>> Other operations require a shared lock because they only need to access
>>>> the file's size without modifying it.
>>>>
>>>
>>> As far as I understand, exclusive lock is not needed for non-extending
>>> writes, because it is ok to map new blocks.
>>> I guess the need for exclusive lock for extending writes is related to
>>> update of file size, but not 100% sure.
>>> Anyway, exclusive lock on extending write is widely common in other fs,
>>> while exclusive lock for non-extending write is unique to xfs.
>>>
>>>>>
>>>>>> So we should track all the ranges we are reading or writing,
>>>>>> and check whether the new read or write operations can be performed
>>>>>> concurrently with the current operations.
>>>>>
>>>>> That is all discussed in detail in the discussions I linked.
>>>>
>>>> Sorry, I overlooked some details from old discussion last time.
>>>> It seems that you are not satisfied with the effectiveness of
>>>> range locks.
>>>>
>>>
>>> Correct. This solution was taken off the table.
>>>
>>> I hope my explanation was correct and clear.
>>> If anything else is not clear, please feel free to ask.
>>>
>>
>> I think your explanation is very clear.
> 
> One more thing I should mention.
> You do not need to wait for atomic large writes patches to land.
> There is nothing stopping you from implementing the suggested
> solution based on the xfs code already in master (v6.13-rc1),
> which has support for the RWF_ATOMIC flag for writes.
> 
> It just means that the API will not be usable for applications that
> want to do IO larger than block size, but concurrent read/write
                               ^
To be precise, this is the page size, not the block size, right?

> performance of 4K IO could be improved already.

Great, which means that IO operations aligned within a single page
can be executed concurrently, because the folio lock already
provides atomicity guarantees.

If the write does not exceed the boundary of a page, we can
downgrade the iolock to XFS_IOLOCK_SHARED. It seems to be safe
and will not change the current behavior.

--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -454,6 +454,11 @@ xfs_file_write_checks(
         if (error)
                 return error;

+       if ( iocb->ki_pos >> PAGE_SHIFT == (iocb->ki_pos + count) >> 
PAGE_SHIFT) {
+               *iolock = XFS_IOLOCK_SHARED;
+       }
+
         /*
          * For changing security info in file_remove_privs() we need 
i_rwsem
          * exclusively.

> 
> It's possible that all you need to do is:
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index c488ae26b23d0..2542f15496488 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -777,9 +777,10 @@ xfs_file_buffered_write(
>          ssize_t                 ret;
>          bool                    cleared_space = false;
>          unsigned int            iolock;
> +       bool                    atomic_write = iocb->ki_flags & IOCB_ATOMIC;
> 
>   write_retry:
> -       iolock = XFS_IOLOCK_EXCL;
> +       iolock = atomic_write ? XFS_IOLOCK_SHARED : XFS_IOLOCK_EXCL;
>          ret = xfs_ilock_iocb(iocb, iolock);
> --
> 
> xfs_file_write_checks() afterwards already takes care of promoting
> XFS_IOLOCK_SHARED to XFS_IOLOCK_EXCL for extending writes.

Yeah, for writes that exceed the PAGE boundary, we can also promote
the lock to XFS_IOLOCK_EXCL. Otherwise, I am concerned that it may
lead to old data being retained in the file.

For example, two processes writing four pages of data to the same area.

process A           process B
--------------------------------
write AA--
<sleep>
                     new write BBBB
write --AA

The final data is BBAA.

> 
> It is possible that XFS_IOLOCK_EXCL could be immediately demoted
> back to XFS_IOLOCK_SHARED for atomic_writes as done in
> xfs_file_dio_write_aligned().
> 



Thanks,
Chi Zhiling
Amir Goldstein Jan. 9, 2025, 10:07 a.m. UTC | #13
On Wed, Jan 8, 2025 at 1:16 PM John Garry <john.g.garry@oracle.com> wrote:
>
>
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index c488ae26b23d0..2542f15496488 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -777,9 +777,10 @@ xfs_file_buffered_write(
> >          ssize_t                 ret;
> >          bool                    cleared_space = false;
> >          unsigned int            iolock;
> > +       bool                    atomic_write = iocb->ki_flags & IOCB_ATOMIC;
> >
> >   write_retry:
> > -       iolock = XFS_IOLOCK_EXCL;
> > +       iolock = atomic_write ? XFS_IOLOCK_SHARED : XFS_IOLOCK_EXCL;
> >          ret = xfs_ilock_iocb(iocb, iolock);
> > --
> >
> > xfs_file_write_checks() afterwards already takes care of promoting
> > XFS_IOLOCK_SHARED to XFS_IOLOCK_EXCL for extending writes.
> >
> > It is possible that XFS_IOLOCK_EXCL could be immediately demoted
> > back to XFS_IOLOCK_SHARED for atomic_writes as done in
> > xfs_file_dio_write_aligned().
> >
> > TBH, I am not sure which blockdevs support 4K atomic writes that could
> > be used to test this.
> >
> > John, can you share your test setup instructions for atomic writes?
>
> Please note that IOCB_ATOMIC is not supported for buffered IO, so we
> can't do this - we only support direct IO today.

Oops. I see now.

>
> And supporting buffered IO has its challenges; how to handle overlapping
> atomic writes of differing sizes sitting in the page cache is the main
> issue which comes to mind.
>

How about the combination of RWF_ATOMIC | RWF_UNCACHED [1]
Would it be easier/possible to support this considering that the write of folio
is started before the write system call returns?

Note that application that desires mutithreaded atomicity of writes vs. reads
will only need to opt-in for RWF_ATOMIC | RWF_UNCACHED writes, so this
is not expected to actually break its performance by killing the read caching.

Thanks,
Amir.

[1] https://lore.kernel.org/linux-fsdevel/20241220154831.1086649-1-axboe@kernel.dk/
Amir Goldstein Jan. 9, 2025, 10:25 a.m. UTC | #14
> > One more thing I should mention.
> > You do not need to wait for atomic large writes patches to land.
> > There is nothing stopping you from implementing the suggested
> > solution based on the xfs code already in master (v6.13-rc1),
> > which has support for the RWF_ATOMIC flag for writes.

Only I missed the fact that there is not yet a plan to support
atomic buffered writes :-/

> >
> > It just means that the API will not be usable for applications that
> > want to do IO larger than block size, but concurrent read/write
>                                ^
> To be precise, this is the page size, not the block size, right?
>

fs block size:

        if (iocb->ki_flags & IOCB_ATOMIC) {
                /*
                 * Currently only atomic writing of a single FS block is
                 * supported. It would be possible to atomic write smaller than
                 * a FS block, but there is no requirement to support this.
                 * Note that iomap also does not support this yet.
                 */
                if (ocount != ip->i_mount->m_sb.sb_blocksize)
                        return -EINVAL;
                ret = generic_atomic_write_valid(iocb, from);
                if (ret)
                        return ret;
        }

> > performance of 4K IO could be improved already.
>
> Great, which means that IO operations aligned within a single page
> can be executed concurrently, because the folio lock already
> provides atomicity guarantees.
>
> If the write does not exceed the boundary of a page, we can
> downgrade the iolock to XFS_IOLOCK_SHARED. It seems to be safe
> and will not change the current behavior.
>
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -454,6 +454,11 @@ xfs_file_write_checks(
>          if (error)
>                  return error;
>
> +       if ( iocb->ki_pos >> PAGE_SHIFT == (iocb->ki_pos + count) >>
> PAGE_SHIFT) {
> +               *iolock = XFS_IOLOCK_SHARED;
> +       }
> +
>          /*
>           * For changing security info in file_remove_privs() we need
> i_rwsem
>           * exclusively.
>

I think that may be possible, but you should do it in the buffered write
code as the patch below.
xfs_file_write_checks() is called from code paths like
xfs_file_dio_write_unaligned() where you should not demote to shared lock.


> >
> > It's possible that all you need to do is:
> >
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index c488ae26b23d0..2542f15496488 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -777,9 +777,10 @@ xfs_file_buffered_write(
> >          ssize_t                 ret;
> >          bool                    cleared_space = false;
> >          unsigned int            iolock;
> > +       bool                    atomic_write = iocb->ki_flags & IOCB_ATOMIC;
> >
> >   write_retry:
> > -       iolock = XFS_IOLOCK_EXCL;
> > +       iolock = atomic_write ? XFS_IOLOCK_SHARED : XFS_IOLOCK_EXCL;
> >          ret = xfs_ilock_iocb(iocb, iolock);
> > --
> >
> > xfs_file_write_checks() afterwards already takes care of promoting
> > XFS_IOLOCK_SHARED to XFS_IOLOCK_EXCL for extending writes.
>
> Yeah, for writes that exceed the PAGE boundary, we can also promote
> the lock to XFS_IOLOCK_EXCL. Otherwise, I am concerned that it may
> lead to old data being retained in the file.
>
> For example, two processes writing four pages of data to the same area.
>
> process A           process B
> --------------------------------
> write AA--
> <sleep>
>                      new write BBBB
> write --AA
>
> The final data is BBAA.
>

What is the use case for which you are trying to fix performance?
Is it a use case with single block IO? if not then it does not help to implement
a partial solution for single block size IO.

Thanks,
Amir.
Chi Zhiling Jan. 9, 2025, 12:10 p.m. UTC | #15
On 2025/1/9 18:25, Amir Goldstein wrote:
>>> One more thing I should mention.
>>> You do not need to wait for atomic large writes patches to land.
>>> There is nothing stopping you from implementing the suggested
>>> solution based on the xfs code already in master (v6.13-rc1),
>>> which has support for the RWF_ATOMIC flag for writes.
> 
> Only I missed the fact that there is not yet a plan to support
> atomic buffered writes :-/

I think it's necessary to support atomic buffered writes.

> 
>>>
>>> It just means that the API will not be usable for applications that
>>> want to do IO larger than block size, but concurrent read/write
>>                                 ^
>> To be precise, this is the page size, not the block size, right?
>>
> 
> fs block size:
> 
>          if (iocb->ki_flags & IOCB_ATOMIC) {
>                  /*
>                   * Currently only atomic writing of a single FS block is
>                   * supported. It would be possible to atomic write smaller than
>                   * a FS block, but there is no requirement to support this.
>                   * Note that iomap also does not support this yet.
>                   */
>                  if (ocount != ip->i_mount->m_sb.sb_blocksize)
>                          return -EINVAL;
>                  ret = generic_atomic_write_valid(iocb, from);
>                  if (ret)
>                          return ret;
>          }

Uh, okay, maybe I didn't understand it very accurately. I thought
you were talking about buffered IO. I was quite curious as to why
you mentioned block size in the context of buffered IO.

> 
>>> performance of 4K IO could be improved already.
>>
>> Great, which means that IO operations aligned within a single page
>> can be executed concurrently, because the folio lock already
>> provides atomicity guarantees.
>>
>> If the write does not exceed the boundary of a page, we can
>> downgrade the iolock to XFS_IOLOCK_SHARED. It seems to be safe
>> and will not change the current behavior.
>>
>> --- a/fs/xfs/xfs_file.c
>> +++ b/fs/xfs/xfs_file.c
>> @@ -454,6 +454,11 @@ xfs_file_write_checks(
>>           if (error)
>>                   return error;
>>
>> +       if ( iocb->ki_pos >> PAGE_SHIFT == (iocb->ki_pos + count) >>
>> PAGE_SHIFT) {
>> +               *iolock = XFS_IOLOCK_SHARED;
>> +       }
>> +
>>           /*
>>            * For changing security info in file_remove_privs() we need
>> i_rwsem
>>            * exclusively.
>>
> 
> I think that may be possible, but you should do it in the buffered write
> code as the patch below.
> xfs_file_write_checks() is called from code paths like
> xfs_file_dio_write_unaligned() where you should not demote to shared lock.

Wow, thank you for the reminder. This is the prototype of the patch.
I might need to consider more scenarios and conduct testing before
sending the patch.


> 
> 
>>>
>>> It's possible that all you need to do is:
>>>
>>> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
>>> index c488ae26b23d0..2542f15496488 100644
>>> --- a/fs/xfs/xfs_file.c
>>> +++ b/fs/xfs/xfs_file.c
>>> @@ -777,9 +777,10 @@ xfs_file_buffered_write(
>>>           ssize_t                 ret;
>>>           bool                    cleared_space = false;
>>>           unsigned int            iolock;
>>> +       bool                    atomic_write = iocb->ki_flags & IOCB_ATOMIC;
>>>
>>>    write_retry:
>>> -       iolock = XFS_IOLOCK_EXCL;
>>> +       iolock = atomic_write ? XFS_IOLOCK_SHARED : XFS_IOLOCK_EXCL;
>>>           ret = xfs_ilock_iocb(iocb, iolock);
>>> --
>>>
>>> xfs_file_write_checks() afterwards already takes care of promoting
>>> XFS_IOLOCK_SHARED to XFS_IOLOCK_EXCL for extending writes.
>>
>> Yeah, for writes that exceed the PAGE boundary, we can also promote
>> the lock to XFS_IOLOCK_EXCL. Otherwise, I am concerned that it may
>> lead to old data being retained in the file.
>>
>> For example, two processes writing four pages of data to the same area.
>>
>> process A           process B
>> --------------------------------
>> write AA--
>> <sleep>
>>                       new write BBBB
>> write --AA
>>
>> The final data is BBAA.
>>
> 
> What is the use case for which you are trying to fix performance?
> Is it a use case with single block IO? if not then it does not help to implement
> a partial solution for single block size IO.

I want to improve the UnixBench score for XFS. UnixBench uses buffered
IO for testing, including both single-threaded and multi-threaded tests.
Additionally, the IO size tested in UnixBench is below 4K, so this will
be very helpful.


Thanks,
Chi Zhiling
John Garry Jan. 9, 2025, 12:25 p.m. UTC | #16
On 09/01/2025 12:10, Chi Zhiling wrote:
> On 2025/1/9 18:25, Amir Goldstein wrote:
>>>> One more thing I should mention.
>>>> You do not need to wait for atomic large writes patches to land.
>>>> There is nothing stopping you from implementing the suggested
>>>> solution based on the xfs code already in master (v6.13-rc1),
>>>> which has support for the RWF_ATOMIC flag for writes.
>>
>> Only I missed the fact that there is not yet a plan to support
>> atomic buffered writes :-/
> 
> I think it's necessary to support atomic buffered writes.

Please check this thread then:
https://lore.kernel.org/linux-fsdevel/20240228061257.GA106651@mit.edu/
John Garry Jan. 9, 2025, 12:40 p.m. UTC | #17
On 09/01/2025 10:07, Amir Goldstein wrote:
>> Please note that IOCB_ATOMIC is not supported for buffered IO, so we
>> can't do this - we only support direct IO today.
> Oops. I see now.
> 
>> And supporting buffered IO has its challenges; how to handle overlapping
>> atomic writes of differing sizes sitting in the page cache is the main
>> issue which comes to mind.
>>
> How about the combination of RWF_ATOMIC | RWF_UNCACHED [1]
> Would it be easier/possible to support this considering that the write of folio
> is started before the write system call returns?

I am not sure exactly what you are proposing. Is it that RWF_ATOMIC for 
buffered IO auto-sets RWF_UNCACHED? Or that RWF_ATOMIC requires 
RWF_UNCACHED to be set?

But that is not so important, as I just think that future users of 
RWF_ATOMIC may not want the behavior of RWF_UNCACHED always (for 
buffered IO).

And I don't think that RWF_UNCACHED even properly solves the issues of 
RWF_ATOMIC for buffered IO in terms of handling overlapping atomic 
writes in the page cache.

Thanks,
John

> 
> Note that application that desires mutithreaded atomicity of writes vs. reads
> will only need to opt-in for RWF_ATOMIC | RWF_UNCACHED writes, so this
> is not expected to actually break its performance by killing the read caching.
> 
> Thanks,
> Amir.
> 
> [1]https://urldefense.com/v3/__https://lore.kernel.org/linux- 
> fsdevel/20241220154831.1086649-1-axboe@kernel.dk/__;!!ACWV5N9M2RV99hQ! 
> J7_5N_kSixl5iSy8IX37Cup3uKTHAaC5Oy-RlvsJeTE2kr3iJ2IXNww_rApK7TwI_ocCBSE- 
> G2vZSKSRHqY$
Dave Chinner Jan. 9, 2025, 11:28 p.m. UTC | #18
On Wed, Jan 08, 2025 at 09:35:47AM -0800, Darrick J. Wong wrote:
> On Wed, Jan 08, 2025 at 03:43:04PM +0800, Chi Zhiling wrote:
> > On 2025/1/7 20:13, Amir Goldstein wrote:
> > > Dave's answer to this question was that there are some legacy applications
> > > (database applications IIRC) on production systems that do rely on the fact
> > > that xfs provides this semantics and on the prerequisite that they run on xfs.
> > > 
> > > However, it was noted that:
> > > 1. Those application do not require atomicity for any size of IO, they
> > >      typically work in I/O size that is larger than block size (e.g. 16K or 64K)
> > >      and they only require no torn writes for this I/O size
> > > 2. Large folios and iomap can usually provide this semantics via folio lock,
> > >      but application has currently no way of knowing if the semantics are
> > >      provided or not
> > 
> > To be honest, it would be best if the folio lock could provide such
> > semantics, as it would not cause any potential problems for the
> > application, and we have hope to achieve concurrent writes.
> > 
> > However, I am not sure if this is easy to implement and will not cause
> > other problems.
> 
> Assuming we're not abandoning POSIX "Thread Interactions with Regular
> File Operations", you can't use the folio lock for coordination, for
> several reasons:
> 
> a) Apps can't directly control the size of the folio in the page cache
> 
> b) The folio size can (theoretically) change underneath the program at
> any time (reclaim can take your large folio and the next read gets a
> smaller folio)
> 
> c) If your write crosses folios, you've just crossed a synchronization
> boundary and all bets are off, though all the other filesystems behave
> this way and there seem not to be complaints
> 
> d) If you try to "guarantee" folio granularity by messing with min/max
> folio size, you run the risk of ENOMEM if the base pages get fragmented
> 
> I think that's why Dave suggested range locks as the correct solution to
> this; though it is a pity that so far nobody has come up with a
> performant implementation.

Yes, that's a fair summary of the situation.

That said, I just had a left-field idea for a quasi-range lock
that may allow random writes to run concurrently and atomically
with reads.

Essentially, we add an unsigned long to the inode, and use it as a
lock bitmap. That gives up to 64 "lock segments" for the buffered
write. We may also need a "segment size" variable....

The existing i_rwsem gets taken shared unless it is an extending
write.

For a non-extending write, we then do an offset->segment translation
and lock that bit in the bit mask. If it's already locked, we wait
on the lock bit. i.e. shared IOLOCK, exclusive write bit lock.

The segments are evenly sized - say a minimum of 64kB each, but when
EOF is extended or truncated (which is done with the i_rwsem held
exclusive) the segment size is rescaled. As nothing can hold bit
locks while the i_rwsem is held exclusive, this will not race with
anything.

If we are doing an extending write, we take the i_rwsem shared
first, then check if the extension will rescale the locks. If lock
rescaling is needed, we have to take the i_rwsem exclusive to do the
EOF extension. Otherwise, the bit lock that covers EOF will
serialise file extensions so it can be done under a shared i_rwsem
safely.

This will allow buffered writes to remain atomic w.r.t. each other,
and potentially allow buffered reads to wait on writes to the same
segment and so potentially provide buffered read vs buffered write
atomicity as well.

If we need more concurrency than an unsigned long worth of bits for
buffered writes, then maybe we can enlarge the bitmap further.

I suspect this can be extended to direct IO in a similar way to
buffered reads, and that then opens up the possibility of truncate
and fallocate() being able to use the bitmap for range exclusion,
too.

The overhead is likely minimal - setting and clearing bits in a
bitmap, as opposed to tracking ranges in a tree structure....

Thoughts?

-Dave.
Chi Zhiling Jan. 10, 2025, 1:31 a.m. UTC | #19
On 2025/1/10 07:28, Dave Chinner wrote:
> On Wed, Jan 08, 2025 at 09:35:47AM -0800, Darrick J. Wong wrote:
>> On Wed, Jan 08, 2025 at 03:43:04PM +0800, Chi Zhiling wrote:
>>> On 2025/1/7 20:13, Amir Goldstein wrote:
>>>> Dave's answer to this question was that there are some legacy applications
>>>> (database applications IIRC) on production systems that do rely on the fact
>>>> that xfs provides this semantics and on the prerequisite that they run on xfs.
>>>>
>>>> However, it was noted that:
>>>> 1. Those application do not require atomicity for any size of IO, they
>>>>       typically work in I/O size that is larger than block size (e.g. 16K or 64K)
>>>>       and they only require no torn writes for this I/O size
>>>> 2. Large folios and iomap can usually provide this semantics via folio lock,
>>>>       but application has currently no way of knowing if the semantics are
>>>>       provided or not
>>>
>>> To be honest, it would be best if the folio lock could provide such
>>> semantics, as it would not cause any potential problems for the
>>> application, and we have hope to achieve concurrent writes.
>>>
>>> However, I am not sure if this is easy to implement and will not cause
>>> other problems.
>>
>> Assuming we're not abandoning POSIX "Thread Interactions with Regular
>> File Operations", you can't use the folio lock for coordination, for
>> several reasons:
>>
>> a) Apps can't directly control the size of the folio in the page cache
>>
>> b) The folio size can (theoretically) change underneath the program at
>> any time (reclaim can take your large folio and the next read gets a
>> smaller folio)
>>
>> c) If your write crosses folios, you've just crossed a synchronization
>> boundary and all bets are off, though all the other filesystems behave
>> this way and there seem not to be complaints
>>
>> d) If you try to "guarantee" folio granularity by messing with min/max
>> folio size, you run the risk of ENOMEM if the base pages get fragmented
>>
>> I think that's why Dave suggested range locks as the correct solution to
>> this; though it is a pity that so far nobody has come up with a
>> performant implementation.
> 
> Yes, that's a fair summary of the situation.
> 
> That said, I just had a left-field idea for a quasi-range lock
> that may allow random writes to run concurrently and atomically
> with reads.
> 
> Essentially, we add an unsigned long to the inode, and use it as a
> lock bitmap. That gives up to 64 "lock segments" for the buffered
> write. We may also need a "segment size" variable....
> 
> The existing i_rwsem gets taken shared unless it is an extending
> write.
> 
> For a non-extending write, we then do an offset->segment translation
> and lock that bit in the bit mask. If it's already locked, we wait
> on the lock bit. i.e. shared IOLOCK, exclusive write bit lock.
> 
> The segments are evenly sized - say a minimum of 64kB each, but when
> EOF is extended or truncated (which is done with the i_rwsem held
> exclusive) the segment size is rescaled. As nothing can hold bit
> locks while the i_rwsem is held exclusive, this will not race with
> anything.
> 
> If we are doing an extending write, we take the i_rwsem shared
> first, then check if the extension will rescale the locks. If lock
> rescaling is needed, we have to take the i_rwsem exclusive to do the
> EOF extension. Otherwise, the bit lock that covers EOF will
> serialise file extensions so it can be done under a shared i_rwsem
> safely.
> 
> This will allow buffered writes to remain atomic w.r.t. each other,
> and potentially allow buffered reads to wait on writes to the same
> segment and so potentially provide buffered read vs buffered write
> atomicity as well.
> 
> If we need more concurrency than an unsigned long worth of bits for
> buffered writes, then maybe we can enlarge the bitmap further.
> 
> I suspect this can be extended to direct IO in a similar way to
> buffered reads, and that then opens up the possibility of truncate
> and fallocate() being able to use the bitmap for range exclusion,
> too.
> 
> The overhead is likely minimal - setting and clearing bits in a
> bitmap, as opposed to tracking ranges in a tree structure....
> 
> Thoughts?

I think it's fine. Additionally, even if multiple writes occur
in the same segment, if the write operations are within a single
page, we can still acquire the i_rwsem lock in shared mode,
right?
Amir Goldstein Jan. 10, 2025, 5:07 p.m. UTC | #20
On Fri, Jan 10, 2025 at 12:28 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Wed, Jan 08, 2025 at 09:35:47AM -0800, Darrick J. Wong wrote:
> > On Wed, Jan 08, 2025 at 03:43:04PM +0800, Chi Zhiling wrote:
> > > On 2025/1/7 20:13, Amir Goldstein wrote:
> > > > Dave's answer to this question was that there are some legacy applications
> > > > (database applications IIRC) on production systems that do rely on the fact
> > > > that xfs provides this semantics and on the prerequisite that they run on xfs.
> > > >
> > > > However, it was noted that:
> > > > 1. Those application do not require atomicity for any size of IO, they
> > > >      typically work in I/O size that is larger than block size (e.g. 16K or 64K)
> > > >      and they only require no torn writes for this I/O size
> > > > 2. Large folios and iomap can usually provide this semantics via folio lock,
> > > >      but application has currently no way of knowing if the semantics are
> > > >      provided or not
> > >
> > > To be honest, it would be best if the folio lock could provide such
> > > semantics, as it would not cause any potential problems for the
> > > application, and we have hope to achieve concurrent writes.
> > >
> > > However, I am not sure if this is easy to implement and will not cause
> > > other problems.
> >
> > Assuming we're not abandoning POSIX "Thread Interactions with Regular
> > File Operations", you can't use the folio lock for coordination, for
> > several reasons:
> >
> > a) Apps can't directly control the size of the folio in the page cache
> >
> > b) The folio size can (theoretically) change underneath the program at
> > any time (reclaim can take your large folio and the next read gets a
> > smaller folio)
> >
> > c) If your write crosses folios, you've just crossed a synchronization
> > boundary and all bets are off, though all the other filesystems behave
> > this way and there seem not to be complaints
> >
> > d) If you try to "guarantee" folio granularity by messing with min/max
> > folio size, you run the risk of ENOMEM if the base pages get fragmented
> >
> > I think that's why Dave suggested range locks as the correct solution to
> > this; though it is a pity that so far nobody has come up with a
> > performant implementation.
>
> Yes, that's a fair summary of the situation.
>
> That said, I just had a left-field idea for a quasi-range lock
> that may allow random writes to run concurrently and atomically
> with reads.
>
> Essentially, we add an unsigned long to the inode, and use it as a
> lock bitmap. That gives up to 64 "lock segments" for the buffered
> write. We may also need a "segment size" variable....
>
> The existing i_rwsem gets taken shared unless it is an extending
> write.
>
> For a non-extending write, we then do an offset->segment translation
> and lock that bit in the bit mask. If it's already locked, we wait
> on the lock bit. i.e. shared IOLOCK, exclusive write bit lock.
>
> The segments are evenly sized - say a minimum of 64kB each, but when
> EOF is extended or truncated (which is done with the i_rwsem held
> exclusive) the segment size is rescaled. As nothing can hold bit
> locks while the i_rwsem is held exclusive, this will not race with
> anything.
>
> If we are doing an extending write, we take the i_rwsem shared
> first, then check if the extension will rescale the locks. If lock
> rescaling is needed, we have to take the i_rwsem exclusive to do the
> EOF extension. Otherwise, the bit lock that covers EOF will
> serialise file extensions so it can be done under a shared i_rwsem
> safely.
>
> This will allow buffered writes to remain atomic w.r.t. each other,
> and potentially allow buffered reads to wait on writes to the same
> segment and so potentially provide buffered read vs buffered write
> atomicity as well.
>
> If we need more concurrency than an unsigned long worth of bits for
> buffered writes, then maybe we can enlarge the bitmap further.
>
> I suspect this can be extended to direct IO in a similar way to
> buffered reads, and that then opens up the possibility of truncate
> and fallocate() being able to use the bitmap for range exclusion,
> too.
>
> The overhead is likely minimal - setting and clearing bits in a
> bitmap, as opposed to tracking ranges in a tree structure....
>
> Thoughts?

I think that's a very neat idea, but it will not address the reference
benchmark.
The reference benchmark I started the original report with which is similar
to my understanding to the benchmark that Chi is running simulates the
workload of a database writing with buffered IO.

That means a very large file and small IO size ~64K.
Leaving the probability of intersecting writes in the same segment quite high.

Can we do this opportunistically based on available large folios?
If IO size is within an existing folio, use the folio lock and IOLOCK_SHARED
if it is not, use IOLOCK_EXCL?

for a benchmark that does all buffered IO 64K aligned, wouldn't large folios
naturally align to IO size and above?

Thanks,
Amir.
Chi Zhiling Jan. 12, 2025, 10:05 a.m. UTC | #21
On 2025/1/11 01:07, Amir Goldstein wrote:
> On Fri, Jan 10, 2025 at 12:28 AM Dave Chinner <david@fromorbit.com> wrote:
>>
>> On Wed, Jan 08, 2025 at 09:35:47AM -0800, Darrick J. Wong wrote:
>>> On Wed, Jan 08, 2025 at 03:43:04PM +0800, Chi Zhiling wrote:
>>>> On 2025/1/7 20:13, Amir Goldstein wrote:
>>>>> Dave's answer to this question was that there are some legacy applications
>>>>> (database applications IIRC) on production systems that do rely on the fact
>>>>> that xfs provides this semantics and on the prerequisite that they run on xfs.
>>>>>
>>>>> However, it was noted that:
>>>>> 1. Those application do not require atomicity for any size of IO, they
>>>>>       typically work in I/O size that is larger than block size (e.g. 16K or 64K)
>>>>>       and they only require no torn writes for this I/O size
>>>>> 2. Large folios and iomap can usually provide this semantics via folio lock,
>>>>>       but application has currently no way of knowing if the semantics are
>>>>>       provided or not
>>>>
>>>> To be honest, it would be best if the folio lock could provide such
>>>> semantics, as it would not cause any potential problems for the
>>>> application, and we have hope to achieve concurrent writes.
>>>>
>>>> However, I am not sure if this is easy to implement and will not cause
>>>> other problems.
>>>
>>> Assuming we're not abandoning POSIX "Thread Interactions with Regular
>>> File Operations", you can't use the folio lock for coordination, for
>>> several reasons:
>>>
>>> a) Apps can't directly control the size of the folio in the page cache
>>>
>>> b) The folio size can (theoretically) change underneath the program at
>>> any time (reclaim can take your large folio and the next read gets a
>>> smaller folio)
>>>
>>> c) If your write crosses folios, you've just crossed a synchronization
>>> boundary and all bets are off, though all the other filesystems behave
>>> this way and there seem not to be complaints
>>>
>>> d) If you try to "guarantee" folio granularity by messing with min/max
>>> folio size, you run the risk of ENOMEM if the base pages get fragmented
>>>
>>> I think that's why Dave suggested range locks as the correct solution to
>>> this; though it is a pity that so far nobody has come up with a
>>> performant implementation.
>>
>> Yes, that's a fair summary of the situation.
>>
>> That said, I just had a left-field idea for a quasi-range lock
>> that may allow random writes to run concurrently and atomically
>> with reads.
>>
>> Essentially, we add an unsigned long to the inode, and use it as a
>> lock bitmap. That gives up to 64 "lock segments" for the buffered
>> write. We may also need a "segment size" variable....
>>
>> The existing i_rwsem gets taken shared unless it is an extending
>> write.
>>
>> For a non-extending write, we then do an offset->segment translation
>> and lock that bit in the bit mask. If it's already locked, we wait
>> on the lock bit. i.e. shared IOLOCK, exclusive write bit lock.
>>
>> The segments are evenly sized - say a minimum of 64kB each, but when
>> EOF is extended or truncated (which is done with the i_rwsem held
>> exclusive) the segment size is rescaled. As nothing can hold bit
>> locks while the i_rwsem is held exclusive, this will not race with
>> anything.
>>
>> If we are doing an extending write, we take the i_rwsem shared
>> first, then check if the extension will rescale the locks. If lock
>> rescaling is needed, we have to take the i_rwsem exclusive to do the
>> EOF extension. Otherwise, the bit lock that covers EOF will
>> serialise file extensions so it can be done under a shared i_rwsem
>> safely.
>>
>> This will allow buffered writes to remain atomic w.r.t. each other,
>> and potentially allow buffered reads to wait on writes to the same
>> segment and so potentially provide buffered read vs buffered write
>> atomicity as well.
>>
>> If we need more concurrency than an unsigned long worth of bits for
>> buffered writes, then maybe we can enlarge the bitmap further.
>>
>> I suspect this can be extended to direct IO in a similar way to
>> buffered reads, and that then opens up the possibility of truncate
>> and fallocate() being able to use the bitmap for range exclusion,
>> too.
>>
>> The overhead is likely minimal - setting and clearing bits in a
>> bitmap, as opposed to tracking ranges in a tree structure....
>>
>> Thoughts?
> 
> I think that's a very neat idea, but it will not address the reference
> benchmark.
> The reference benchmark I started the original report with which is similar
> to my understanding to the benchmark that Chi is running simulates the
> workload of a database writing with buffered IO.
> 
> That means a very large file and small IO size ~64K.
> Leaving the probability of intersecting writes in the same segment quite high.
> 
> Can we do this opportunistically based on available large folios?
> If IO size is within an existing folio, use the folio lock and IOLOCK_SHARED
> if it is not, use IOLOCK_EXCL?
> 
> for a benchmark that does all buffered IO 64K aligned, wouldn't large folios
> naturally align to IO size and above?
> 

Great, I think we're getting close to aligning our thoughts.

IMO, we shouldn't use a shared lock for write operations that are
larger than page size.

I believe the current issue is that when acquiring the i_rwsem lock,
we have no way of knowing the size of a large folio [1] (as Darrick
mentioned earlier), so we can't determine if only one large folio will
be written.

There's only one certainty: if the IO size fits within one page size,
it will definitely fit within one large folio.

So for now, we can only use IOLOCK_SHARED if we verify that the IO fits
within page size.

[1]: Maybe we can find a way to obtain the size of a folio from the page
cache, but it might come with some performance costs.


Thanks,
Chi Zhiling
Darrick J. Wong Jan. 13, 2025, 2:44 a.m. UTC | #22
On Sun, Jan 12, 2025 at 06:05:37PM +0800, Chi Zhiling wrote:
> On 2025/1/11 01:07, Amir Goldstein wrote:
> > On Fri, Jan 10, 2025 at 12:28 AM Dave Chinner <david@fromorbit.com> wrote:
> > > 
> > > On Wed, Jan 08, 2025 at 09:35:47AM -0800, Darrick J. Wong wrote:
> > > > On Wed, Jan 08, 2025 at 03:43:04PM +0800, Chi Zhiling wrote:
> > > > > On 2025/1/7 20:13, Amir Goldstein wrote:
> > > > > > Dave's answer to this question was that there are some legacy applications
> > > > > > (database applications IIRC) on production systems that do rely on the fact
> > > > > > that xfs provides this semantics and on the prerequisite that they run on xfs.
> > > > > > 
> > > > > > However, it was noted that:
> > > > > > 1. Those application do not require atomicity for any size of IO, they
> > > > > >       typically work in I/O size that is larger than block size (e.g. 16K or 64K)
> > > > > >       and they only require no torn writes for this I/O size
> > > > > > 2. Large folios and iomap can usually provide this semantics via folio lock,
> > > > > >       but application has currently no way of knowing if the semantics are
> > > > > >       provided or not
> > > > > 
> > > > > To be honest, it would be best if the folio lock could provide such
> > > > > semantics, as it would not cause any potential problems for the
> > > > > application, and we have hope to achieve concurrent writes.
> > > > > 
> > > > > However, I am not sure if this is easy to implement and will not cause
> > > > > other problems.
> > > > 
> > > > Assuming we're not abandoning POSIX "Thread Interactions with Regular
> > > > File Operations", you can't use the folio lock for coordination, for
> > > > several reasons:
> > > > 
> > > > a) Apps can't directly control the size of the folio in the page cache
> > > > 
> > > > b) The folio size can (theoretically) change underneath the program at
> > > > any time (reclaim can take your large folio and the next read gets a
> > > > smaller folio)
> > > > 
> > > > c) If your write crosses folios, you've just crossed a synchronization
> > > > boundary and all bets are off, though all the other filesystems behave
> > > > this way and there seem not to be complaints
> > > > 
> > > > d) If you try to "guarantee" folio granularity by messing with min/max
> > > > folio size, you run the risk of ENOMEM if the base pages get fragmented
> > > > 
> > > > I think that's why Dave suggested range locks as the correct solution to
> > > > this; though it is a pity that so far nobody has come up with a
> > > > performant implementation.
> > > 
> > > Yes, that's a fair summary of the situation.
> > > 
> > > That said, I just had a left-field idea for a quasi-range lock
> > > that may allow random writes to run concurrently and atomically
> > > with reads.
> > > 
> > > Essentially, we add an unsigned long to the inode, and use it as a
> > > lock bitmap. That gives up to 64 "lock segments" for the buffered
> > > write. We may also need a "segment size" variable....
> > > 
> > > The existing i_rwsem gets taken shared unless it is an extending
> > > write.
> > > 
> > > For a non-extending write, we then do an offset->segment translation
> > > and lock that bit in the bit mask. If it's already locked, we wait
> > > on the lock bit. i.e. shared IOLOCK, exclusive write bit lock.
> > > 
> > > The segments are evenly sized - say a minimum of 64kB each, but when
> > > EOF is extended or truncated (which is done with the i_rwsem held
> > > exclusive) the segment size is rescaled. As nothing can hold bit
> > > locks while the i_rwsem is held exclusive, this will not race with
> > > anything.
> > > 
> > > If we are doing an extending write, we take the i_rwsem shared
> > > first, then check if the extension will rescale the locks. If lock
> > > rescaling is needed, we have to take the i_rwsem exclusive to do the
> > > EOF extension. Otherwise, the bit lock that covers EOF will
> > > serialise file extensions so it can be done under a shared i_rwsem
> > > safely.
> > > 
> > > This will allow buffered writes to remain atomic w.r.t. each other,
> > > and potentially allow buffered reads to wait on writes to the same
> > > segment and so potentially provide buffered read vs buffered write
> > > atomicity as well.
> > > 
> > > If we need more concurrency than an unsigned long worth of bits for
> > > buffered writes, then maybe we can enlarge the bitmap further.
> > > 
> > > I suspect this can be extended to direct IO in a similar way to
> > > buffered reads, and that then opens up the possibility of truncate
> > > and fallocate() being able to use the bitmap for range exclusion,
> > > too.
> > > 
> > > The overhead is likely minimal - setting and clearing bits in a
> > > bitmap, as opposed to tracking ranges in a tree structure....
> > > 
> > > Thoughts?
> > 
> > I think that's a very neat idea, but it will not address the reference
> > benchmark.
> > The reference benchmark I started the original report with which is similar
> > to my understanding to the benchmark that Chi is running simulates the
> > workload of a database writing with buffered IO.
> > 
> > That means a very large file and small IO size ~64K.
> > Leaving the probability of intersecting writes in the same segment quite high.
> > 
> > Can we do this opportunistically based on available large folios?
> > If IO size is within an existing folio, use the folio lock and IOLOCK_SHARED
> > if it is not, use IOLOCK_EXCL?
> > 
> > for a benchmark that does all buffered IO 64K aligned, wouldn't large folios
> > naturally align to IO size and above?
> > 
> 
> Great, I think we're getting close to aligning our thoughts.
> 
> IMO, we shouldn't use a shared lock for write operations that are
> larger than page size.
> 
> I believe the current issue is that when acquiring the i_rwsem lock,
> we have no way of knowing the size of a large folio [1] (as Darrick
> mentioned earlier), so we can't determine if only one large folio will
> be written.
> 
> There's only one certainty: if the IO size fits within one page size,
> it will definitely fit within one large folio.
> 
> So for now, we can only use IOLOCK_SHARED if we verify that the IO fits
> within page size.

For filesystems that /do/ support large folios (xfs), I suppose you
could have it tell iomap that it only took i_rwsem in shared mode; and
then the iomap buffered write implementation could proceed if it got a
folio covering the entire write range, or return some magic code that
means "take i_rwsem in exclusive mode and try again".

Though you're correct that we should always take IOLOCK_EXCL if the
write size is larger than whatever the max folio size is for that file.

--D

> [1]: Maybe we can find a way to obtain the size of a folio from the page
> cache, but it might come with some performance costs.
> 
> 
> Thanks,
> Chi Zhiling
> 
>
Chi Zhiling Jan. 13, 2025, 5:59 a.m. UTC | #23
On 2025/1/13 10:44, Darrick J. Wong wrote:
> On Sun, Jan 12, 2025 at 06:05:37PM +0800, Chi Zhiling wrote:
>> On 2025/1/11 01:07, Amir Goldstein wrote:
>>> On Fri, Jan 10, 2025 at 12:28 AM Dave Chinner <david@fromorbit.com> wrote:
>>>>
>>>> On Wed, Jan 08, 2025 at 09:35:47AM -0800, Darrick J. Wong wrote:
>>>>> On Wed, Jan 08, 2025 at 03:43:04PM +0800, Chi Zhiling wrote:
>>>>>> On 2025/1/7 20:13, Amir Goldstein wrote:
>>>>>>> Dave's answer to this question was that there are some legacy applications
>>>>>>> (database applications IIRC) on production systems that do rely on the fact
>>>>>>> that xfs provides this semantics and on the prerequisite that they run on xfs.
>>>>>>>
>>>>>>> However, it was noted that:
>>>>>>> 1. Those application do not require atomicity for any size of IO, they
>>>>>>>        typically work in I/O size that is larger than block size (e.g. 16K or 64K)
>>>>>>>        and they only require no torn writes for this I/O size
>>>>>>> 2. Large folios and iomap can usually provide this semantics via folio lock,
>>>>>>>        but application has currently no way of knowing if the semantics are
>>>>>>>        provided or not
>>>>>>
>>>>>> To be honest, it would be best if the folio lock could provide such
>>>>>> semantics, as it would not cause any potential problems for the
>>>>>> application, and we have hope to achieve concurrent writes.
>>>>>>
>>>>>> However, I am not sure if this is easy to implement and will not cause
>>>>>> other problems.
>>>>>
>>>>> Assuming we're not abandoning POSIX "Thread Interactions with Regular
>>>>> File Operations", you can't use the folio lock for coordination, for
>>>>> several reasons:
>>>>>
>>>>> a) Apps can't directly control the size of the folio in the page cache
>>>>>
>>>>> b) The folio size can (theoretically) change underneath the program at
>>>>> any time (reclaim can take your large folio and the next read gets a
>>>>> smaller folio)
>>>>>
>>>>> c) If your write crosses folios, you've just crossed a synchronization
>>>>> boundary and all bets are off, though all the other filesystems behave
>>>>> this way and there seem not to be complaints
>>>>>
>>>>> d) If you try to "guarantee" folio granularity by messing with min/max
>>>>> folio size, you run the risk of ENOMEM if the base pages get fragmented
>>>>>
>>>>> I think that's why Dave suggested range locks as the correct solution to
>>>>> this; though it is a pity that so far nobody has come up with a
>>>>> performant implementation.
>>>>
>>>> Yes, that's a fair summary of the situation.
>>>>
>>>> That said, I just had a left-field idea for a quasi-range lock
>>>> that may allow random writes to run concurrently and atomically
>>>> with reads.
>>>>
>>>> Essentially, we add an unsigned long to the inode, and use it as a
>>>> lock bitmap. That gives up to 64 "lock segments" for the buffered
>>>> write. We may also need a "segment size" variable....
>>>>
>>>> The existing i_rwsem gets taken shared unless it is an extending
>>>> write.
>>>>
>>>> For a non-extending write, we then do an offset->segment translation
>>>> and lock that bit in the bit mask. If it's already locked, we wait
>>>> on the lock bit. i.e. shared IOLOCK, exclusive write bit lock.
>>>>
>>>> The segments are evenly sized - say a minimum of 64kB each, but when
>>>> EOF is extended or truncated (which is done with the i_rwsem held
>>>> exclusive) the segment size is rescaled. As nothing can hold bit
>>>> locks while the i_rwsem is held exclusive, this will not race with
>>>> anything.
>>>>
>>>> If we are doing an extending write, we take the i_rwsem shared
>>>> first, then check if the extension will rescale the locks. If lock
>>>> rescaling is needed, we have to take the i_rwsem exclusive to do the
>>>> EOF extension. Otherwise, the bit lock that covers EOF will
>>>> serialise file extensions so it can be done under a shared i_rwsem
>>>> safely.
>>>>
>>>> This will allow buffered writes to remain atomic w.r.t. each other,
>>>> and potentially allow buffered reads to wait on writes to the same
>>>> segment and so potentially provide buffered read vs buffered write
>>>> atomicity as well.
>>>>
>>>> If we need more concurrency than an unsigned long worth of bits for
>>>> buffered writes, then maybe we can enlarge the bitmap further.
>>>>
>>>> I suspect this can be extended to direct IO in a similar way to
>>>> buffered reads, and that then opens up the possibility of truncate
>>>> and fallocate() being able to use the bitmap for range exclusion,
>>>> too.
>>>>
>>>> The overhead is likely minimal - setting and clearing bits in a
>>>> bitmap, as opposed to tracking ranges in a tree structure....
>>>>
>>>> Thoughts?
>>>
>>> I think that's a very neat idea, but it will not address the reference
>>> benchmark.
>>> The reference benchmark I started the original report with which is similar
>>> to my understanding to the benchmark that Chi is running simulates the
>>> workload of a database writing with buffered IO.
>>>
>>> That means a very large file and small IO size ~64K.
>>> Leaving the probability of intersecting writes in the same segment quite high.
>>>
>>> Can we do this opportunistically based on available large folios?
>>> If IO size is within an existing folio, use the folio lock and IOLOCK_SHARED
>>> if it is not, use IOLOCK_EXCL?
>>>
>>> for a benchmark that does all buffered IO 64K aligned, wouldn't large folios
>>> naturally align to IO size and above?
>>>
>>
>> Great, I think we're getting close to aligning our thoughts.
>>
>> IMO, we shouldn't use a shared lock for write operations that are
>> larger than page size.
>>
>> I believe the current issue is that when acquiring the i_rwsem lock,
>> we have no way of knowing the size of a large folio [1] (as Darrick
>> mentioned earlier), so we can't determine if only one large folio will
>> be written.
>>
>> There's only one certainty: if the IO size fits within one page size,
>> it will definitely fit within one large folio.
>>
>> So for now, we can only use IOLOCK_SHARED if we verify that the IO fits
>> within page size.
> 
> For filesystems that /do/ support large folios (xfs), I suppose you
> could have it tell iomap that it only took i_rwsem in shared mode; and
> then the iomap buffered write implementation could proceed if it got a
> folio covering the entire write range, or return some magic code that
> means "take i_rwsem in exclusive mode and try again".
> 
> Though you're correct that we should always take IOLOCK_EXCL if the
> write size is larger than whatever the max folio size is for that file.
> 

I think this is a pretty good idea.

I'll implement the basic functionality as soon as possible, and then we
can discuss some details for optimization.


Thanks,
Chi Zhiling
Brian Foster Jan. 13, 2025, 1:40 p.m. UTC | #24
On Sun, Jan 12, 2025 at 06:44:01PM -0800, Darrick J. Wong wrote:
> On Sun, Jan 12, 2025 at 06:05:37PM +0800, Chi Zhiling wrote:
> > On 2025/1/11 01:07, Amir Goldstein wrote:
> > > On Fri, Jan 10, 2025 at 12:28 AM Dave Chinner <david@fromorbit.com> wrote:
> > > > 
> > > > On Wed, Jan 08, 2025 at 09:35:47AM -0800, Darrick J. Wong wrote:
> > > > > On Wed, Jan 08, 2025 at 03:43:04PM +0800, Chi Zhiling wrote:
> > > > > > On 2025/1/7 20:13, Amir Goldstein wrote:
> > > > > > > Dave's answer to this question was that there are some legacy applications
> > > > > > > (database applications IIRC) on production systems that do rely on the fact
> > > > > > > that xfs provides this semantics and on the prerequisite that they run on xfs.
> > > > > > > 
> > > > > > > However, it was noted that:
> > > > > > > 1. Those application do not require atomicity for any size of IO, they
> > > > > > >       typically work in I/O size that is larger than block size (e.g. 16K or 64K)
> > > > > > >       and they only require no torn writes for this I/O size
> > > > > > > 2. Large folios and iomap can usually provide this semantics via folio lock,
> > > > > > >       but application has currently no way of knowing if the semantics are
> > > > > > >       provided or not
> > > > > > 
> > > > > > To be honest, it would be best if the folio lock could provide such
> > > > > > semantics, as it would not cause any potential problems for the
> > > > > > application, and we have hope to achieve concurrent writes.
> > > > > > 
> > > > > > However, I am not sure if this is easy to implement and will not cause
> > > > > > other problems.
> > > > > 
> > > > > Assuming we're not abandoning POSIX "Thread Interactions with Regular
> > > > > File Operations", you can't use the folio lock for coordination, for
> > > > > several reasons:
> > > > > 
> > > > > a) Apps can't directly control the size of the folio in the page cache
> > > > > 
> > > > > b) The folio size can (theoretically) change underneath the program at
> > > > > any time (reclaim can take your large folio and the next read gets a
> > > > > smaller folio)
> > > > > 
> > > > > c) If your write crosses folios, you've just crossed a synchronization
> > > > > boundary and all bets are off, though all the other filesystems behave
> > > > > this way and there seem not to be complaints
> > > > > 
> > > > > d) If you try to "guarantee" folio granularity by messing with min/max
> > > > > folio size, you run the risk of ENOMEM if the base pages get fragmented
> > > > > 
> > > > > I think that's why Dave suggested range locks as the correct solution to
> > > > > this; though it is a pity that so far nobody has come up with a
> > > > > performant implementation.
> > > > 
> > > > Yes, that's a fair summary of the situation.
> > > > 
> > > > That said, I just had a left-field idea for a quasi-range lock
> > > > that may allow random writes to run concurrently and atomically
> > > > with reads.
> > > > 
> > > > Essentially, we add an unsigned long to the inode, and use it as a
> > > > lock bitmap. That gives up to 64 "lock segments" for the buffered
> > > > write. We may also need a "segment size" variable....
> > > > 
> > > > The existing i_rwsem gets taken shared unless it is an extending
> > > > write.
> > > > 
> > > > For a non-extending write, we then do an offset->segment translation
> > > > and lock that bit in the bit mask. If it's already locked, we wait
> > > > on the lock bit. i.e. shared IOLOCK, exclusive write bit lock.
> > > > 
> > > > The segments are evenly sized - say a minimum of 64kB each, but when
> > > > EOF is extended or truncated (which is done with the i_rwsem held
> > > > exclusive) the segment size is rescaled. As nothing can hold bit
> > > > locks while the i_rwsem is held exclusive, this will not race with
> > > > anything.
> > > > 
> > > > If we are doing an extending write, we take the i_rwsem shared
> > > > first, then check if the extension will rescale the locks. If lock
> > > > rescaling is needed, we have to take the i_rwsem exclusive to do the
> > > > EOF extension. Otherwise, the bit lock that covers EOF will
> > > > serialise file extensions so it can be done under a shared i_rwsem
> > > > safely.
> > > > 
> > > > This will allow buffered writes to remain atomic w.r.t. each other,
> > > > and potentially allow buffered reads to wait on writes to the same
> > > > segment and so potentially provide buffered read vs buffered write
> > > > atomicity as well.
> > > > 
> > > > If we need more concurrency than an unsigned long worth of bits for
> > > > buffered writes, then maybe we can enlarge the bitmap further.
> > > > 
> > > > I suspect this can be extended to direct IO in a similar way to
> > > > buffered reads, and that then opens up the possibility of truncate
> > > > and fallocate() being able to use the bitmap for range exclusion,
> > > > too.
> > > > 
> > > > The overhead is likely minimal - setting and clearing bits in a
> > > > bitmap, as opposed to tracking ranges in a tree structure....
> > > > 
> > > > Thoughts?
> > > 
> > > I think that's a very neat idea, but it will not address the reference
> > > benchmark.
> > > The reference benchmark I started the original report with which is similar
> > > to my understanding to the benchmark that Chi is running simulates the
> > > workload of a database writing with buffered IO.
> > > 
> > > That means a very large file and small IO size ~64K.
> > > Leaving the probability of intersecting writes in the same segment quite high.
> > > 
> > > Can we do this opportunistically based on available large folios?
> > > If IO size is within an existing folio, use the folio lock and IOLOCK_SHARED
> > > if it is not, use IOLOCK_EXCL?
> > > 
> > > for a benchmark that does all buffered IO 64K aligned, wouldn't large folios
> > > naturally align to IO size and above?
> > > 
> > 
> > Great, I think we're getting close to aligning our thoughts.
> > 
> > IMO, we shouldn't use a shared lock for write operations that are
> > larger than page size.
> > 
> > I believe the current issue is that when acquiring the i_rwsem lock,
> > we have no way of knowing the size of a large folio [1] (as Darrick
> > mentioned earlier), so we can't determine if only one large folio will
> > be written.
> > 
> > There's only one certainty: if the IO size fits within one page size,
> > it will definitely fit within one large folio.
> > 
> > So for now, we can only use IOLOCK_SHARED if we verify that the IO fits
> > within page size.
> 
> For filesystems that /do/ support large folios (xfs), I suppose you
> could have it tell iomap that it only took i_rwsem in shared mode; and
> then the iomap buffered write implementation could proceed if it got a
> folio covering the entire write range, or return some magic code that
> means "take i_rwsem in exclusive mode and try again".
> 

Sorry if this is out of left field as I haven't followed the discussion
closely, but I presumed one of the reasons Darrick and Christoph raised
the idea of using the folio batch thing I'm playing around with on zero
range for buffered writes would be to acquire and lock all targeted
folios up front. If so, would that help with what you're trying to
achieve here? (If not, nothing to see here, move along.. ;).

Brian

> Though you're correct that we should always take IOLOCK_EXCL if the
> write size is larger than whatever the max folio size is for that file.
> 
> --D
> 
> > [1]: Maybe we can find a way to obtain the size of a folio from the page
> > cache, but it might come with some performance costs.
> > 
> > 
> > Thanks,
> > Chi Zhiling
> > 
> > 
>
Darrick J. Wong Jan. 13, 2025, 4:19 p.m. UTC | #25
On Mon, Jan 13, 2025 at 08:40:51AM -0500, Brian Foster wrote:
> On Sun, Jan 12, 2025 at 06:44:01PM -0800, Darrick J. Wong wrote:
> > On Sun, Jan 12, 2025 at 06:05:37PM +0800, Chi Zhiling wrote:
> > > On 2025/1/11 01:07, Amir Goldstein wrote:
> > > > On Fri, Jan 10, 2025 at 12:28 AM Dave Chinner <david@fromorbit.com> wrote:
> > > > > 
> > > > > On Wed, Jan 08, 2025 at 09:35:47AM -0800, Darrick J. Wong wrote:
> > > > > > On Wed, Jan 08, 2025 at 03:43:04PM +0800, Chi Zhiling wrote:
> > > > > > > On 2025/1/7 20:13, Amir Goldstein wrote:
> > > > > > > > Dave's answer to this question was that there are some legacy applications
> > > > > > > > (database applications IIRC) on production systems that do rely on the fact
> > > > > > > > that xfs provides this semantics and on the prerequisite that they run on xfs.
> > > > > > > > 
> > > > > > > > However, it was noted that:
> > > > > > > > 1. Those application do not require atomicity for any size of IO, they
> > > > > > > >       typically work in I/O size that is larger than block size (e.g. 16K or 64K)
> > > > > > > >       and they only require no torn writes for this I/O size
> > > > > > > > 2. Large folios and iomap can usually provide this semantics via folio lock,
> > > > > > > >       but application has currently no way of knowing if the semantics are
> > > > > > > >       provided or not
> > > > > > > 
> > > > > > > To be honest, it would be best if the folio lock could provide such
> > > > > > > semantics, as it would not cause any potential problems for the
> > > > > > > application, and we have hope to achieve concurrent writes.
> > > > > > > 
> > > > > > > However, I am not sure if this is easy to implement and will not cause
> > > > > > > other problems.
> > > > > > 
> > > > > > Assuming we're not abandoning POSIX "Thread Interactions with Regular
> > > > > > File Operations", you can't use the folio lock for coordination, for
> > > > > > several reasons:
> > > > > > 
> > > > > > a) Apps can't directly control the size of the folio in the page cache
> > > > > > 
> > > > > > b) The folio size can (theoretically) change underneath the program at
> > > > > > any time (reclaim can take your large folio and the next read gets a
> > > > > > smaller folio)
> > > > > > 
> > > > > > c) If your write crosses folios, you've just crossed a synchronization
> > > > > > boundary and all bets are off, though all the other filesystems behave
> > > > > > this way and there seem not to be complaints
> > > > > > 
> > > > > > d) If you try to "guarantee" folio granularity by messing with min/max
> > > > > > folio size, you run the risk of ENOMEM if the base pages get fragmented
> > > > > > 
> > > > > > I think that's why Dave suggested range locks as the correct solution to
> > > > > > this; though it is a pity that so far nobody has come up with a
> > > > > > performant implementation.
> > > > > 
> > > > > Yes, that's a fair summary of the situation.
> > > > > 
> > > > > That said, I just had a left-field idea for a quasi-range lock
> > > > > that may allow random writes to run concurrently and atomically
> > > > > with reads.
> > > > > 
> > > > > Essentially, we add an unsigned long to the inode, and use it as a
> > > > > lock bitmap. That gives up to 64 "lock segments" for the buffered
> > > > > write. We may also need a "segment size" variable....
> > > > > 
> > > > > The existing i_rwsem gets taken shared unless it is an extending
> > > > > write.
> > > > > 
> > > > > For a non-extending write, we then do an offset->segment translation
> > > > > and lock that bit in the bit mask. If it's already locked, we wait
> > > > > on the lock bit. i.e. shared IOLOCK, exclusive write bit lock.
> > > > > 
> > > > > The segments are evenly sized - say a minimum of 64kB each, but when
> > > > > EOF is extended or truncated (which is done with the i_rwsem held
> > > > > exclusive) the segment size is rescaled. As nothing can hold bit
> > > > > locks while the i_rwsem is held exclusive, this will not race with
> > > > > anything.
> > > > > 
> > > > > If we are doing an extending write, we take the i_rwsem shared
> > > > > first, then check if the extension will rescale the locks. If lock
> > > > > rescaling is needed, we have to take the i_rwsem exclusive to do the
> > > > > EOF extension. Otherwise, the bit lock that covers EOF will
> > > > > serialise file extensions so it can be done under a shared i_rwsem
> > > > > safely.
> > > > > 
> > > > > This will allow buffered writes to remain atomic w.r.t. each other,
> > > > > and potentially allow buffered reads to wait on writes to the same
> > > > > segment and so potentially provide buffered read vs buffered write
> > > > > atomicity as well.
> > > > > 
> > > > > If we need more concurrency than an unsigned long worth of bits for
> > > > > buffered writes, then maybe we can enlarge the bitmap further.
> > > > > 
> > > > > I suspect this can be extended to direct IO in a similar way to
> > > > > buffered reads, and that then opens up the possibility of truncate
> > > > > and fallocate() being able to use the bitmap for range exclusion,
> > > > > too.
> > > > > 
> > > > > The overhead is likely minimal - setting and clearing bits in a
> > > > > bitmap, as opposed to tracking ranges in a tree structure....
> > > > > 
> > > > > Thoughts?
> > > > 
> > > > I think that's a very neat idea, but it will not address the reference
> > > > benchmark.
> > > > The reference benchmark I started the original report with which is similar
> > > > to my understanding to the benchmark that Chi is running simulates the
> > > > workload of a database writing with buffered IO.
> > > > 
> > > > That means a very large file and small IO size ~64K.
> > > > Leaving the probability of intersecting writes in the same segment quite high.
> > > > 
> > > > Can we do this opportunistically based on available large folios?
> > > > If IO size is within an existing folio, use the folio lock and IOLOCK_SHARED
> > > > if it is not, use IOLOCK_EXCL?
> > > > 
> > > > for a benchmark that does all buffered IO 64K aligned, wouldn't large folios
> > > > naturally align to IO size and above?
> > > > 
> > > 
> > > Great, I think we're getting close to aligning our thoughts.
> > > 
> > > IMO, we shouldn't use a shared lock for write operations that are
> > > larger than page size.
> > > 
> > > I believe the current issue is that when acquiring the i_rwsem lock,
> > > we have no way of knowing the size of a large folio [1] (as Darrick
> > > mentioned earlier), so we can't determine if only one large folio will
> > > be written.
> > > 
> > > There's only one certainty: if the IO size fits within one page size,
> > > it will definitely fit within one large folio.
> > > 
> > > So for now, we can only use IOLOCK_SHARED if we verify that the IO fits
> > > within page size.
> > 
> > For filesystems that /do/ support large folios (xfs), I suppose you
> > could have it tell iomap that it only took i_rwsem in shared mode; and
> > then the iomap buffered write implementation could proceed if it got a
> > folio covering the entire write range, or return some magic code that
> > means "take i_rwsem in exclusive mode and try again".
> > 
> 
> Sorry if this is out of left field as I haven't followed the discussion
> closely, but I presumed one of the reasons Darrick and Christoph raised
> the idea of using the folio batch thing I'm playing around with on zero
> range for buffered writes would be to acquire and lock all targeted
> folios up front. If so, would that help with what you're trying to
> achieve here? (If not, nothing to see here, move along.. ;).

I think the folio batch thing would help here, since you then could just
lock all the folios you need for a write, which provides (in effect) a
range lock implementation.

--D

> Brian
> 
> > Though you're correct that we should always take IOLOCK_EXCL if the
> > write size is larger than whatever the max folio size is for that file.
> > 
> > --D
> > 
> > > [1]: Maybe we can find a way to obtain the size of a folio from the page
> > > cache, but it might come with some performance costs.
> > > 
> > > 
> > > Thanks,
> > > Chi Zhiling
> > > 
> > > 
> > 
> 
>
Dave Chinner Jan. 14, 2025, 12:09 a.m. UTC | #26
On Fri, Jan 10, 2025 at 06:07:48PM +0100, Amir Goldstein wrote:
> On Fri, Jan 10, 2025 at 12:28 AM Dave Chinner <david@fromorbit.com> wrote:
> > That said, I just had a left-field idea for a quasi-range lock
> > that may allow random writes to run concurrently and atomically
> > with reads.
> >
> > Essentially, we add an unsigned long to the inode, and use it as a
> > lock bitmap. That gives up to 64 "lock segments" for the buffered
> > write. We may also need a "segment size" variable....
> >
> > The existing i_rwsem gets taken shared unless it is an extending
> > write.
> >
> > For a non-extending write, we then do an offset->segment translation
> > and lock that bit in the bit mask. If it's already locked, we wait
> > on the lock bit. i.e. shared IOLOCK, exclusive write bit lock.
> >
> > The segments are evenly sized - say a minimum of 64kB each, but when
> > EOF is extended or truncated (which is done with the i_rwsem held
> > exclusive) the segment size is rescaled. As nothing can hold bit
> > locks while the i_rwsem is held exclusive, this will not race with
> > anything.
> >
> > If we are doing an extending write, we take the i_rwsem shared
> > first, then check if the extension will rescale the locks. If lock
> > rescaling is needed, we have to take the i_rwsem exclusive to do the
> > EOF extension. Otherwise, the bit lock that covers EOF will
> > serialise file extensions so it can be done under a shared i_rwsem
> > safely.
> >
> > This will allow buffered writes to remain atomic w.r.t. each other,
> > and potentially allow buffered reads to wait on writes to the same
> > segment and so potentially provide buffered read vs buffered write
> > atomicity as well.
> >
> > If we need more concurrency than an unsigned long worth of bits for
> > buffered writes, then maybe we can enlarge the bitmap further.
> >
> > I suspect this can be extended to direct IO in a similar way to
> > buffered reads, and that then opens up the possibility of truncate
> > and fallocate() being able to use the bitmap for range exclusion,
> > too.
> >
> > The overhead is likely minimal - setting and clearing bits in a
> > bitmap, as opposed to tracking ranges in a tree structure....
> >
> > Thoughts?
> 
> I think that's a very neat idea, but it will not address the reference
> benchmark.
> The reference benchmark I started the original report with which is similar
> to my understanding to the benchmark that Chi is running simulates the
> workload of a database writing with buffered IO.
> 
> That means a very large file and small IO size ~64K.
> Leaving the probability of intersecting writes in the same segment quite high.

Likely - I recognised this granularity problem, though:

| If we need more concurrency than an unsigned long worth of bits for
| buffered writes, then maybe we can enlarge the bitmap further.

We could also hash offsets into the bitmap so that offset-local IO
hit different locks - the bit lock doesn't necessarily need to be
range based. i.e. this is a "finer grained" lock that will typically
increase concurrency. If we keep striving for perfect (i.e. scalable
range locks) we're not going to improve the situation any time
soon...

> Can we do this opportunistically based on available large folios?
> If IO size is within an existing folio, use the folio lock and IOLOCK_SHARED
> if it is not, use IOLOCK_EXCL?

The biggest problem with this is that direct IO will -not- do
folio-by folio locking, and so folio based locking does not work for
direct IO exclusion. Currently we get coherency against buffered
writes by writing back or invalidation dirty folios before doing
the DIO read/write. Because we hold the IOLOCK shared, a buffered
write will not redirty the page cache until the DIO write has been
submitted (and completed for non-async DIO).

Hence moving to shared i_rwsem, folio-based range locking for
buffered writes will lose all serialisation against DIO operations.
We will lose what coherency we currently have between buffered write
ops and DIO, and I don't think that's an acceptible trade-off.

i.e. The problem with using the mapping tree for DIO coherency is,
once again, locking overhead. If we have to insert exceptional
entries to lock the range in the mapping tree because there is no
folio present (for DIO to serialise against new buffered IOs), then
we are simply back to the same exclusive tree update scalability
problem that tree based range lock algorithms have....

That's why I suggested a bitmap lock external to both buffered and
direct IO...

> for a benchmark that does all buffered IO 64K aligned, wouldn't large folios
> naturally align to IO size and above?

Maybe, but I don't think folio/mapping tree based locking is a
workable solution. Hence the external bitmap idea...

-Dave.
Christoph Hellwig Jan. 15, 2025, 5:55 a.m. UTC | #27
On Mon, Jan 13, 2025 at 08:40:51AM -0500, Brian Foster wrote:
> Sorry if this is out of left field as I haven't followed the discussion
> closely, but I presumed one of the reasons Darrick and Christoph raised
> the idea of using the folio batch thing I'm playing around with on zero
> range for buffered writes would be to acquire and lock all targeted
> folios up front. If so, would that help with what you're trying to
> achieve here? (If not, nothing to see here, move along.. ;).

I mostly thought about acquiring, as locking doesn't really have much
batching effects.  That being said, no that you got the idea in my mind
here's my early morning brainfart on it:

Let's ignore DIRECT I/O for the first step.  In that case lookup /
allocation and locking all folios for write before copying data will
remove the need for i_rwsem in the read and write path.  In a way that
sounds perfect, and given that btrfs already does that (although in a
very convoluted way) we know it's possible.

But direct I/O throws a big monkey wrench here as already mentioned by
others.  Now one interesting thing some file systems have done is
to serialize buffered against direct I/O, either by waiting for one
to finish, or by simply forcing buffered I/O when direct I/O would
conflict.  It's easy to detect outstanding direct I/O using i_dio_count
so buffered I/O could wait for that, and downgrading to buffered I/O
(potentially using the new uncached mode from Jens) if there are any
pages on the mapping after the invalidation also sounds pretty doable.
I don't really have time to turn this hand waving into, but maybe we 
should think if it's worthwhile or if I'm missing something important.
Dave Chinner Jan. 15, 2025, 9:41 p.m. UTC | #28
On Tue, Jan 14, 2025 at 09:55:21PM -0800, Christoph Hellwig wrote:
> On Mon, Jan 13, 2025 at 08:40:51AM -0500, Brian Foster wrote:
> > Sorry if this is out of left field as I haven't followed the discussion
> > closely, but I presumed one of the reasons Darrick and Christoph raised
> > the idea of using the folio batch thing I'm playing around with on zero
> > range for buffered writes would be to acquire and lock all targeted
> > folios up front. If so, would that help with what you're trying to
> > achieve here? (If not, nothing to see here, move along.. ;).
> 
> I mostly thought about acquiring, as locking doesn't really have much
> batching effects.  That being said, no that you got the idea in my mind
> here's my early morning brainfart on it:
> 
> Let's ignore DIRECT I/O for the first step.  In that case lookup /
> allocation and locking all folios for write before copying data will
> remove the need for i_rwsem in the read and write path.  In a way that
> sounds perfect, and given that btrfs already does that (although in a
> very convoluted way) we know it's possible.

Yes, this seems like a sane, general approach to allowing concurrent
buffered writes (and reads).

> But direct I/O throws a big monkey wrench here as already mentioned by
> others.  Now one interesting thing some file systems have done is
> to serialize buffered against direct I/O, either by waiting for one
> to finish, or by simply forcing buffered I/O when direct I/O would
> conflict. 

Right. We really don't want to downgrade to buffered IO if we can
help it, though.

> It's easy to detect outstanding direct I/O using i_dio_count
> so buffered I/O could wait for that, and downgrading to buffered I/O
> (potentially using the new uncached mode from Jens) if there are any
> pages on the mapping after the invalidation also sounds pretty doable.

It's much harder to sanely serialise DIO against buffered writes
this way, because i_dio_count only forms a submission barrier in
conjunction with the i_rwsem being held exclusively. e.g. ongoing
DIO would result in the buffered write being indefinitely delayed.

I think the model and method that bcachefs uses is probably the best
way to move forward - the "two-state exclusive shared" lock which it
uses to do buffered vs direct exclusion is a simple, easy way to
handle this problem. The same-state shared locking fast path is a
single atomic cmpxchg operation, so it has neglible extra overhead
compared to using a rwsem in the shared DIO fast path.

The lock also has non-owner semantics, so DIO can take it during
submission and then drop it during IO completion. This solves the
problem we currently use the i_rwsem and
inode_dio_{start,end/wait}() to solve (i.e. create a DIO submission
barrier and waiting for all existing DIO to drain).

IOWs, a two-state shared lock provides the mechanism to allow DIO
to be done without holding the i_rwsem at all, as well as being able
to elide two atomic operations per DIO to track in-flight DIOs.

We'd get this whilst maintaining buffered/DIO coherency without
adding any new overhead to the DIO path, and allow concurrent
buffered reads and writes that have their atomicity defined by the
batched folio locking strategy that Brian is working on...

This only leaves DIO coherency issues with mmap() based IO as an
issue, but that's a problem for a different day...

> I don't really have time to turn this hand waving into, but maybe we 
> should think if it's worthwhile or if I'm missing something important.

If people are OK with XFS moving to exclusive buffered or DIO
submission model, then I can find some time to work on the
converting the IO path locking to use a two-state shared lock in
preparation for the batched folio stuff that will allow concurrent
buffered writes...

-Dave.
Christoph Hellwig Jan. 16, 2025, 4:36 a.m. UTC | #29
On Thu, Jan 16, 2025 at 08:41:21AM +1100, Dave Chinner wrote:
> > to finish, or by simply forcing buffered I/O when direct I/O would
> > conflict. 
> 
> Right. We really don't want to downgrade to buffered IO if we can
> help it, though.

Of course we never want it.  But if we do have invalidation failures
and thus still folios in the page cache it might the least bad of
all the bad options.

> It's much harder to sanely serialise DIO against buffered writes
> this way, because i_dio_count only forms a submission barrier in
> conjunction with the i_rwsem being held exclusively. e.g. ongoing
> DIO would result in the buffered write being indefinitely delayed.

Or any other exclusive lock taken by all submitters, yes.

> I think the model and method that bcachefs uses is probably the best
> way to move forward - the "two-state exclusive shared" lock which it
> uses to do buffered vs direct exclusion is a simple, easy way to
> handle this problem. The same-state shared locking fast path is a
> single atomic cmpxchg operation, so it has neglible extra overhead
> compared to using a rwsem in the shared DIO fast path.

NFS and ocfs2 have been doing this for about two decades as well.

> This only leaves DIO coherency issues with mmap() based IO as an
> issue, but that's a problem for a different day...

I think it's a generally unsolveable problem that we can just whack
enough to make it good enough in practice for the few workloads that
matter.

> 
> > I don't really have time to turn this hand waving into, but maybe we 
> > should think if it's worthwhile or if I'm missing something important.
> 
> If people are OK with XFS moving to exclusive buffered or DIO
> submission model, then I can find some time to work on the
> converting the IO path locking to use a two-state shared lock in
> preparation for the batched folio stuff that will allow concurrent
> buffered writes...

This does sound fine to me, but it's hard to judge without seeing
a prototype and results based on it.
Brian Foster Jan. 16, 2025, 2:23 p.m. UTC | #30
On Thu, Jan 16, 2025 at 08:41:21AM +1100, Dave Chinner wrote:
> On Tue, Jan 14, 2025 at 09:55:21PM -0800, Christoph Hellwig wrote:
> > On Mon, Jan 13, 2025 at 08:40:51AM -0500, Brian Foster wrote:
...
> 
> > I don't really have time to turn this hand waving into, but maybe we 
> > should think if it's worthwhile or if I'm missing something important.
> 
> If people are OK with XFS moving to exclusive buffered or DIO
> submission model, then I can find some time to work on the
> converting the IO path locking to use a two-state shared lock in
> preparation for the batched folio stuff that will allow concurrent
> buffered writes...
> 

Ack to this, FWIW. I think this is a natural/logical approach,
prototyping and whatnot notwithstanding.

Brian

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Amir Goldstein Jan. 17, 2025, 1:27 p.m. UTC | #31
On Wed, Jan 15, 2025 at 10:41 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Tue, Jan 14, 2025 at 09:55:21PM -0800, Christoph Hellwig wrote:
> > On Mon, Jan 13, 2025 at 08:40:51AM -0500, Brian Foster wrote:
> > > Sorry if this is out of left field as I haven't followed the discussion
> > > closely, but I presumed one of the reasons Darrick and Christoph raised
> > > the idea of using the folio batch thing I'm playing around with on zero
> > > range for buffered writes would be to acquire and lock all targeted
> > > folios up front. If so, would that help with what you're trying to
> > > achieve here? (If not, nothing to see here, move along.. ;).
> >
> > I mostly thought about acquiring, as locking doesn't really have much
> > batching effects.  That being said, no that you got the idea in my mind
> > here's my early morning brainfart on it:
> >
> > Let's ignore DIRECT I/O for the first step.  In that case lookup /
> > allocation and locking all folios for write before copying data will
> > remove the need for i_rwsem in the read and write path.  In a way that
> > sounds perfect, and given that btrfs already does that (although in a
> > very convoluted way) we know it's possible.
>
> Yes, this seems like a sane, general approach to allowing concurrent
> buffered writes (and reads).
>
> > But direct I/O throws a big monkey wrench here as already mentioned by
> > others.  Now one interesting thing some file systems have done is
> > to serialize buffered against direct I/O, either by waiting for one
> > to finish, or by simply forcing buffered I/O when direct I/O would
> > conflict.
>
> Right. We really don't want to downgrade to buffered IO if we can
> help it, though.
>
> > It's easy to detect outstanding direct I/O using i_dio_count
> > so buffered I/O could wait for that, and downgrading to buffered I/O
> > (potentially using the new uncached mode from Jens) if there are any
> > pages on the mapping after the invalidation also sounds pretty doable.
>
> It's much harder to sanely serialise DIO against buffered writes
> this way, because i_dio_count only forms a submission barrier in
> conjunction with the i_rwsem being held exclusively. e.g. ongoing
> DIO would result in the buffered write being indefinitely delayed.

Isn't this already the case today with EX vs. SH iolock?
I guess the answer depends whether or not i_rwsem
starves existing writers in the face of ongoing new readers.
If my memory serves me right, this exact behavior of i_rwsem
is what sparked the original regression report when xfs
moved from i_iolock to i_rwsem [1].

[1] https://lore.kernel.org/linux-xfs/CAOQ4uxi0pGczXBX7GRAFs88Uw0n1ERJZno3JSeZR71S1dXg+2w@mail.gmail.com/

>
> I think the model and method that bcachefs uses is probably the best
> way to move forward - the "two-state exclusive shared" lock which it
> uses to do buffered vs direct exclusion is a simple, easy way to
> handle this problem. The same-state shared locking fast path is a
> single atomic cmpxchg operation, so it has neglible extra overhead
> compared to using a rwsem in the shared DIO fast path.
>
> The lock also has non-owner semantics, so DIO can take it during
> submission and then drop it during IO completion. This solves the
> problem we currently use the i_rwsem and
> inode_dio_{start,end/wait}() to solve (i.e. create a DIO submission
> barrier and waiting for all existing DIO to drain).
>
> IOWs, a two-state shared lock provides the mechanism to allow DIO
> to be done without holding the i_rwsem at all, as well as being able
> to elide two atomic operations per DIO to track in-flight DIOs.
>
> We'd get this whilst maintaining buffered/DIO coherency without
> adding any new overhead to the DIO path, and allow concurrent
> buffered reads and writes that have their atomicity defined by the
> batched folio locking strategy that Brian is working on...
>
> This only leaves DIO coherency issues with mmap() based IO as an
> issue, but that's a problem for a different day...
>
> > I don't really have time to turn this hand waving into, but maybe we
> > should think if it's worthwhile or if I'm missing something important.
>
> If people are OK with XFS moving to exclusive buffered or DIO
> submission model, then I can find some time to work on the
> converting the IO path locking to use a two-state shared lock in
> preparation for the batched folio stuff that will allow concurrent
> buffered writes...

I won't object to getting the best of all worlds, but I have to say,
upfront, this sounds a bit like premature optimization and for
a workload (mixed buffered/dio write) that I don't think anybody
does in practice and nobody should care how it performs.
Am I wrong?

For all practical purposes, we could maintain a counter in inode
not for submitted DIO, but for files opened O_DIRECT.

The first open O_DIRECT could serialize with in-flight
buffered writes holding a shared iolock and then buffered writes
could take SH vs. EX iolock depending on folio state and on
i_dio_open_count.

I would personally prefer a simple solution that is good enough
and has a higher likelihood for allocating the development, review
and testing resources that are needed to bring it to the finish line.

Thanks,
Amir.
Chi Zhiling Jan. 17, 2025, 4:12 p.m. UTC | #32
On 2025/1/16 05:41, Dave Chinner wrote:
> On Tue, Jan 14, 2025 at 09:55:21PM -0800, Christoph Hellwig wrote:
>> On Mon, Jan 13, 2025 at 08:40:51AM -0500, Brian Foster wrote:
>>> Sorry if this is out of left field as I haven't followed the discussion
>>> closely, but I presumed one of the reasons Darrick and Christoph raised
>>> the idea of using the folio batch thing I'm playing around with on zero
>>> range for buffered writes would be to acquire and lock all targeted
>>> folios up front. If so, would that help with what you're trying to
>>> achieve here? (If not, nothing to see here, move along.. ;).
>>
>> I mostly thought about acquiring, as locking doesn't really have much
>> batching effects.  That being said, no that you got the idea in my mind
>> here's my early morning brainfart on it:
>>
>> Let's ignore DIRECT I/O for the first step.  In that case lookup /
>> allocation and locking all folios for write before copying data will
>> remove the need for i_rwsem in the read and write path.  In a way that
>> sounds perfect, and given that btrfs already does that (although in a
>> very convoluted way) we know it's possible.
> 
> Yes, this seems like a sane, general approach to allowing concurrent
> buffered writes (and reads).
> 
>> But direct I/O throws a big monkey wrench here as already mentioned by
>> others.  Now one interesting thing some file systems have done is
>> to serialize buffered against direct I/O, either by waiting for one
>> to finish, or by simply forcing buffered I/O when direct I/O would
>> conflict.
> 
> Right. We really don't want to downgrade to buffered IO if we can
> help it, though.
> 
>> It's easy to detect outstanding direct I/O using i_dio_count
>> so buffered I/O could wait for that, and downgrading to buffered I/O
>> (potentially using the new uncached mode from Jens) if there are any
>> pages on the mapping after the invalidation also sounds pretty doable.
> 
> It's much harder to sanely serialise DIO against buffered writes
> this way, because i_dio_count only forms a submission barrier in
> conjunction with the i_rwsem being held exclusively. e.g. ongoing
> DIO would result in the buffered write being indefinitely delayed.
> 
> I think the model and method that bcachefs uses is probably the best
> way to move forward - the "two-state exclusive shared" lock which it
> uses to do buffered vs direct exclusion is a simple, easy way to
> handle this problem. The same-state shared locking fast path is a
> single atomic cmpxchg operation, so it has neglible extra overhead
> compared to using a rwsem in the shared DIO fast path.
> 
> The lock also has non-owner semantics, so DIO can take it during
> submission and then drop it during IO completion. This solves the
> problem we currently use the i_rwsem and
> inode_dio_{start,end/wait}() to solve (i.e. create a DIO submission
> barrier and waiting for all existing DIO to drain).
> 
> IOWs, a two-state shared lock provides the mechanism to allow DIO
> to be done without holding the i_rwsem at all, as well as being able
> to elide two atomic operations per DIO to track in-flight DIOs.
> 
> We'd get this whilst maintaining buffered/DIO coherency without
> adding any new overhead to the DIO path, and allow concurrent
> buffered reads and writes that have their atomicity defined by the
> batched folio locking strategy that Brian is working on...
> 
> This only leaves DIO coherency issues with mmap() based IO as an
> issue, but that's a problem for a different day...
> 
>> I don't really have time to turn this hand waving into, but maybe we
>> should think if it's worthwhile or if I'm missing something important.
> 
> If people are OK with XFS moving to exclusive buffered or DIO
> submission model, then I can find some time to work on the
> converting the IO path locking to use a two-state shared lock in
> preparation for the batched folio stuff that will allow concurrent
> buffered writes...
> 

I really think it's time for someone to submit the basic patch, and then
we can continue with the discussion and testing. :)

BTW, I support the solution with folio batch and two-state shared lock.


Thanks,
Chi Zhiling
Dave Chinner Jan. 17, 2025, 10:19 p.m. UTC | #33
On Fri, Jan 17, 2025 at 02:27:46PM +0100, Amir Goldstein wrote:
> On Wed, Jan 15, 2025 at 10:41 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Tue, Jan 14, 2025 at 09:55:21PM -0800, Christoph Hellwig wrote:
> > > On Mon, Jan 13, 2025 at 08:40:51AM -0500, Brian Foster wrote:
> > > > Sorry if this is out of left field as I haven't followed the discussion
> > > > closely, but I presumed one of the reasons Darrick and Christoph raised
> > > > the idea of using the folio batch thing I'm playing around with on zero
> > > > range for buffered writes would be to acquire and lock all targeted
> > > > folios up front. If so, would that help with what you're trying to
> > > > achieve here? (If not, nothing to see here, move along.. ;).
> > >
> > > I mostly thought about acquiring, as locking doesn't really have much
> > > batching effects.  That being said, no that you got the idea in my mind
> > > here's my early morning brainfart on it:
> > >
> > > Let's ignore DIRECT I/O for the first step.  In that case lookup /
> > > allocation and locking all folios for write before copying data will
> > > remove the need for i_rwsem in the read and write path.  In a way that
> > > sounds perfect, and given that btrfs already does that (although in a
> > > very convoluted way) we know it's possible.
> >
> > Yes, this seems like a sane, general approach to allowing concurrent
> > buffered writes (and reads).
> >
> > > But direct I/O throws a big monkey wrench here as already mentioned by
> > > others.  Now one interesting thing some file systems have done is
> > > to serialize buffered against direct I/O, either by waiting for one
> > > to finish, or by simply forcing buffered I/O when direct I/O would
> > > conflict.
> >
> > Right. We really don't want to downgrade to buffered IO if we can
> > help it, though.
> >
> > > It's easy to detect outstanding direct I/O using i_dio_count
> > > so buffered I/O could wait for that, and downgrading to buffered I/O
> > > (potentially using the new uncached mode from Jens) if there are any
> > > pages on the mapping after the invalidation also sounds pretty doable.
> >
> > It's much harder to sanely serialise DIO against buffered writes
> > this way, because i_dio_count only forms a submission barrier in
> > conjunction with the i_rwsem being held exclusively. e.g. ongoing
> > DIO would result in the buffered write being indefinitely delayed.
> 
> Isn't this already the case today with EX vs. SH iolock?

No. We do not hold the i_rwsem across async DIO read or write, so
we can have DIO in flight whilst a buffered write grabs and holds
the i_rwsem exclusive. This is the problem that i_dio_count solves
for truncate, etc. i.e. the only way to actually ensure there is no
DIO in flight is to grab the i_rwsem exclusive and then call
inode_dio_wait() to ensure all async DIO in flight completes before
continuing.

> I guess the answer depends whether or not i_rwsem
> starves existing writers in the face of ongoing new readers.

It does not. If the lock is held for read and there is a pending
write, new readers are queued behind the pending write waiters
(see rwsem_down_read_slowpath(), which is triggered if there are
pending waiters on an attempt to read lock).

> > > I don't really have time to turn this hand waving into, but maybe we
> > > should think if it's worthwhile or if I'm missing something important.
> >
> > If people are OK with XFS moving to exclusive buffered or DIO
> > submission model, then I can find some time to work on the
> > converting the IO path locking to use a two-state shared lock in
> > preparation for the batched folio stuff that will allow concurrent
> > buffered writes...
> 
> I won't object to getting the best of all worlds, but I have to say,
> upfront, this sounds a bit like premature optimization and for
> a workload (mixed buffered/dio write)

Say what? The dio/buffered exclusion change provides a different
data coherency and correctness model that is required to then
optimising the workload you care about - mixed buffered read/write.

This change doesn't optimise either DIO or buffered IO, nor is a
shared-exclusive BIO/DIO lock isn't going to improve performance of
such mixed IO workloads in any way.  IOWs, it's pretty hard to call
a locking model change like this "optimisation", let alone call it
"premature".

> that I don't think anybody
> does in practice and nobody should care how it performs.
> Am I wrong?

Yes and no. mixed buffered/direct IO worklaods are more common than
you think. e.g. many backup programs use direct IO and so inherently
mix DIO with buffered IO for the majority of files the backup
program touches. However, all we care about in this case is data
coherency, not performance, and this change should improve data
coherency between DIO and buffered IO...

> For all practical purposes, we could maintain a counter in inode
> not for submitted DIO, but for files opened O_DIRECT.
> 
> The first open O_DIRECT could serialize with in-flight
> buffered writes holding a shared iolock and then buffered writes
> could take SH vs. EX iolock depending on folio state and on
> i_dio_open_count.

I don't see how this can be made to work w.r.t. sane data coherency
behaviour. e.g. how does this model serialise a new DIO write that
is submitted after a share-locked buffered write has just started
and passed all the "i_dio_count == 0" checks that enable it to use
shared locking? i.e. we now have potentially overlapping buffered
writes and DIO writes being done concurrently because the buffered
write may not have instantiated folios in the page cache yet....

> I would personally prefer a simple solution that is good enough
> and has a higher likelihood for allocating the development, review
> and testing resources that are needed to bring it to the finish line.

Have you looked at how simple the bcachefs buffered/dio exclusion
implementation is? The lock mechanism itself is about 50 lines of
code, and there are only 4 or 5 places where the lock is actually
needed. It doesn't get much simpler than that.

And, quite frankly, the fact the bcachefs solution also covers AIO
DIO in flight (which i_rwsem based locking does not!) means it is a
more robust solution than trying to rely on racy i_dio_count hacks
and folio residency in the page cache...

-Dave.
Dave Chinner Jan. 17, 2025, 10:20 p.m. UTC | #34
On Wed, Jan 15, 2025 at 08:36:48PM -0800, Christoph Hellwig wrote:
> On Thu, Jan 16, 2025 at 08:41:21AM +1100, Dave Chinner wrote:
> > > I don't really have time to turn this hand waving into, but maybe we 
> > > should think if it's worthwhile or if I'm missing something important.
> > 
> > If people are OK with XFS moving to exclusive buffered or DIO
> > submission model, then I can find some time to work on the
> > converting the IO path locking to use a two-state shared lock in
> > preparation for the batched folio stuff that will allow concurrent
> > buffered writes...
> 
> This does sound fine to me, but it's hard to judge without seeing
> a prototype and results based on it.

OK, the consensus seems to be to run with this for the moment and
see how clean it ends up being. I'll look into it over the next
couple of weeks then.

-Dave.
Amir Goldstein Jan. 18, 2025, 1:03 p.m. UTC | #35
On Fri, Jan 17, 2025 at 11:19 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Fri, Jan 17, 2025 at 02:27:46PM +0100, Amir Goldstein wrote:
> > On Wed, Jan 15, 2025 at 10:41 PM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Tue, Jan 14, 2025 at 09:55:21PM -0800, Christoph Hellwig wrote:
> > > > On Mon, Jan 13, 2025 at 08:40:51AM -0500, Brian Foster wrote:
> > > > > Sorry if this is out of left field as I haven't followed the discussion
> > > > > closely, but I presumed one of the reasons Darrick and Christoph raised
> > > > > the idea of using the folio batch thing I'm playing around with on zero
> > > > > range for buffered writes would be to acquire and lock all targeted
> > > > > folios up front. If so, would that help with what you're trying to
> > > > > achieve here? (If not, nothing to see here, move along.. ;).
> > > >
> > > > I mostly thought about acquiring, as locking doesn't really have much
> > > > batching effects.  That being said, no that you got the idea in my mind
> > > > here's my early morning brainfart on it:
> > > >
> > > > Let's ignore DIRECT I/O for the first step.  In that case lookup /
> > > > allocation and locking all folios for write before copying data will
> > > > remove the need for i_rwsem in the read and write path.  In a way that
> > > > sounds perfect, and given that btrfs already does that (although in a
> > > > very convoluted way) we know it's possible.
> > >
> > > Yes, this seems like a sane, general approach to allowing concurrent
> > > buffered writes (and reads).
> > >
> > > > But direct I/O throws a big monkey wrench here as already mentioned by
> > > > others.  Now one interesting thing some file systems have done is
> > > > to serialize buffered against direct I/O, either by waiting for one
> > > > to finish, or by simply forcing buffered I/O when direct I/O would
> > > > conflict.
> > >
> > > Right. We really don't want to downgrade to buffered IO if we can
> > > help it, though.
> > >
> > > > It's easy to detect outstanding direct I/O using i_dio_count
> > > > so buffered I/O could wait for that, and downgrading to buffered I/O
> > > > (potentially using the new uncached mode from Jens) if there are any
> > > > pages on the mapping after the invalidation also sounds pretty doable.
> > >
> > > It's much harder to sanely serialise DIO against buffered writes
> > > this way, because i_dio_count only forms a submission barrier in
> > > conjunction with the i_rwsem being held exclusively. e.g. ongoing
> > > DIO would result in the buffered write being indefinitely delayed.
> >
> > Isn't this already the case today with EX vs. SH iolock?
>
> No. We do not hold the i_rwsem across async DIO read or write, so
> we can have DIO in flight whilst a buffered write grabs and holds
> the i_rwsem exclusive. This is the problem that i_dio_count solves
> for truncate, etc. i.e. the only way to actually ensure there is no
> DIO in flight is to grab the i_rwsem exclusive and then call
> inode_dio_wait() to ensure all async DIO in flight completes before
> continuing.
>
> > I guess the answer depends whether or not i_rwsem
> > starves existing writers in the face of ongoing new readers.
>
> It does not. If the lock is held for read and there is a pending
> write, new readers are queued behind the pending write waiters
> (see rwsem_down_read_slowpath(), which is triggered if there are
> pending waiters on an attempt to read lock).
>
> > > > I don't really have time to turn this hand waving into, but maybe we
> > > > should think if it's worthwhile or if I'm missing something important.
> > >
> > > If people are OK with XFS moving to exclusive buffered or DIO
> > > submission model, then I can find some time to work on the
> > > converting the IO path locking to use a two-state shared lock in
> > > preparation for the batched folio stuff that will allow concurrent
> > > buffered writes...
> >
> > I won't object to getting the best of all worlds, but I have to say,
> > upfront, this sounds a bit like premature optimization and for
> > a workload (mixed buffered/dio write)
>
> Say what? The dio/buffered exclusion change provides a different
> data coherency and correctness model that is required to then
> optimising the workload you care about - mixed buffered read/write.
>
> This change doesn't optimise either DIO or buffered IO, nor is a
> shared-exclusive BIO/DIO lock isn't going to improve performance of
> such mixed IO workloads in any way.  IOWs, it's pretty hard to call
> a locking model change like this "optimisation", let alone call it
> "premature".
>

I was questioning the need to optimize the mixed buffered read/write
*while file is also open for O_DIRECT*
I thought that the solution would be easier if can take Christof's
"Let's ignore DIRECT I/O for the first step" and make it testable.
Then if the inode is open O_DIRECT, no change to locking.

> > that I don't think anybody
> > does in practice and nobody should care how it performs.
> > Am I wrong?
>
> Yes and no. mixed buffered/direct IO worklaods are more common than
> you think. e.g. many backup programs use direct IO and so inherently
> mix DIO with buffered IO for the majority of files the backup
> program touches.

Still feels like we can forego performance of mixed buffered rw
*while the backup program reads the database file*, should a backup
program ever really read a large database file...

> However, all we care about in this case is data
> coherency, not performance, and this change should improve data
> coherency between DIO and buffered IO...
>

I did not understand that you are proposing an improvement over the
existing state of affairs. Yeh, I certainly have no objections to fixing
things that are wrong.

> > For all practical purposes, we could maintain a counter in inode
> > not for submitted DIO, but for files opened O_DIRECT.
> >
> > The first open O_DIRECT could serialize with in-flight
> > buffered writes holding a shared iolock and then buffered writes
> > could take SH vs. EX iolock depending on folio state and on
> > i_dio_open_count.
>
> I don't see how this can be made to work w.r.t. sane data coherency
> behaviour. e.g. how does this model serialise a new DIO write that
> is submitted after a share-locked buffered write has just started
> and passed all the "i_dio_count == 0" checks that enable it to use
> shared locking? i.e. we now have potentially overlapping buffered
> writes and DIO writes being done concurrently because the buffered
> write may not have instantiated folios in the page cache yet....
>

A shared-locked buffered write can only start when there is no
O_DIRECT file open on the inode.

The first open for O_DIRECT to increment i_dio_open_count
needs to take exclusive iolock to wait for all in-flight buffered writes
then release the iolock.

All DIO submitted via O_DIRECT fds will be safe against in-flight
share-locked buffered write.

> > I would personally prefer a simple solution that is good enough
> > and has a higher likelihood for allocating the development, review
> > and testing resources that are needed to bring it to the finish line.
>
> Have you looked at how simple the bcachefs buffered/dio exclusion
> implementation is? The lock mechanism itself is about 50 lines of
> code, and there are only 4 or 5 places where the lock is actually
> needed. It doesn't get much simpler than that.
>
> And, quite frankly, the fact the bcachefs solution also covers AIO
> DIO in flight (which i_rwsem based locking does not!) means it is a
> more robust solution than trying to rely on racy i_dio_count hacks
> and folio residency in the page cache...

I will be happy to see this improvement.

Thanks,
Amir.
Dave Chinner Jan. 20, 2025, 5:11 a.m. UTC | #36
On Sat, Jan 18, 2025 at 02:03:41PM +0100, Amir Goldstein wrote:
> On Fri, Jan 17, 2025 at 11:19 PM Dave Chinner
> <david@fromorbit.com> wrote:
> > On Fri, Jan 17, 2025 at 02:27:46PM +0100, Amir Goldstein wrote:
> > > On Wed, Jan 15, 2025 at 10:41 PM Dave Chinner
> > > <david@fromorbit.com> wrote: For all practical purposes, we
> > > could maintain a counter in inode not for submitted DIO, but
> > > for files opened O_DIRECT.
> > >
> > > The first open O_DIRECT could serialize with in-flight
> > > buffered writes holding a shared iolock and then buffered
> > > writes could take SH vs. EX iolock depending on folio state
> > > and on i_dio_open_count.
> >
> > I don't see how this can be made to work w.r.t. sane data
> > coherency behaviour. e.g. how does this model serialise a new
> > DIO write that is submitted after a share-locked buffered write
> > has just started and passed all the "i_dio_count == 0" checks
> > that enable it to use shared locking? i.e. we now have
> > potentially overlapping buffered writes and DIO writes being
> > done concurrently because the buffered write may not have
> > instantiated folios in the page cache yet....
> >
> 
> A shared-locked buffered write can only start when there is no
> O_DIRECT file open on the inode.  The first open for O_DIRECT to
> increment i_dio_open_count needs to take exclusive iolock to wait
> for all in-flight buffered writes then release the iolock.
> 
> All DIO submitted via O_DIRECT fds will be safe against in-flight
> share-locked buffered write.

Hooking ->open() and ->release() isn't sufficient. O_DIRECT can be
changed at any time via fcntl(F_SETFL) which isn't synchronised
against IO in progress at all. i.e. we can't track that, nor can we
even use f_ops->check_flags to reject changes to O_DIRECT state
because it doesn't pass either the filp or the inode....

IOWs, as it stands logic like "is the file opened for O_DIRECT" in
the IO path is inherently racy and the racing mechanisms are
directly under the control of unprivileged userspace applications.
Without mods from the VFS down and new hooks being implemented in
XFS along with all the whacky "which serialisiation method do we
use" logic and dealing with the complexities and switching
between them, tracking struct files that are open for O_DIRECT isn't
really a viable mechanism for enabling shared buffered writes...

-Dave.
Christoph Hellwig Jan. 22, 2025, 6:08 a.m. UTC | #37
On Sat, Jan 18, 2025 at 09:19:15AM +1100, Dave Chinner wrote:
> And, quite frankly, the fact the bcachefs solution also covers AIO
> DIO in flight (which i_rwsem based locking does not!) means it is a
> more robust solution than trying to rely on racy i_dio_count hacks
> and folio residency in the page cache...

The original i_rwsem (still i_iolock then) scheme did that, but the
core locking maintainers asked us to remove the non-owner unlocks,
so I did that.  It turns out later we got officially sanctioned
non-owner unlocks, so we could have easily add this back.  I did that
5 years ago, but reception was lukewarm:

http://git.infradead.org/?p=users/hch/xfs.git;a=shortlog;h=refs/heads/i_rwsem-non_owner
Dave Chinner Jan. 22, 2025, 11:35 p.m. UTC | #38
On Tue, Jan 21, 2025 at 10:08:10PM -0800, Christoph Hellwig wrote:
> On Sat, Jan 18, 2025 at 09:19:15AM +1100, Dave Chinner wrote:
> > And, quite frankly, the fact the bcachefs solution also covers AIO
> > DIO in flight (which i_rwsem based locking does not!) means it is a
> > more robust solution than trying to rely on racy i_dio_count hacks
> > and folio residency in the page cache...
> 
> The original i_rwsem (still i_iolock then) scheme did that, but the
> core locking maintainers asked us to remove the non-owner unlocks,
> so I did that.  It turns out later we got officially sanctioned
> non-owner unlocks, so we could have easily add this back.  I did that
> 5 years ago, but reception was lukewarm:
> 
> http://git.infradead.org/?p=users/hch/xfs.git;a=shortlog;h=refs/heads/i_rwsem-non_owner

I have no issues with the concept - I have always thought that
holding the IOLOCK to completion was how the IO locking should work
(like we do with the xfs_buf semaphore). The i_dio_count submission
barrier solution was a necessary hack because rwsems did not work
the way we needed to use them.

What I didn't like about the proposal above was the implementation
(i.e. the encoding of rwsem semantics in the iomap dio API and
implementation), but there was never any followup that addressed the
issues that were raised.....

-Dave.
Chi Zhiling Jan. 24, 2025, 7:57 a.m. UTC | #39
On 2025/1/16 05:41, Dave Chinner wrote:
> On Tue, Jan 14, 2025 at 09:55:21PM -0800, Christoph Hellwig wrote:
>> On Mon, Jan 13, 2025 at 08:40:51AM -0500, Brian Foster wrote:
>>> Sorry if this is out of left field as I haven't followed the discussion
>>> closely, but I presumed one of the reasons Darrick and Christoph raised
>>> the idea of using the folio batch thing I'm playing around with on zero
>>> range for buffered writes would be to acquire and lock all targeted
>>> folios up front. If so, would that help with what you're trying to
>>> achieve here? (If not, nothing to see here, move along.. ;).
>>
>> I mostly thought about acquiring, as locking doesn't really have much
>> batching effects.  That being said, no that you got the idea in my mind
>> here's my early morning brainfart on it:
>>
>> Let's ignore DIRECT I/O for the first step.  In that case lookup /
>> allocation and locking all folios for write before copying data will
>> remove the need for i_rwsem in the read and write path.  In a way that
>> sounds perfect, and given that btrfs already does that (although in a
>> very convoluted way) we know it's possible.
> 
> Yes, this seems like a sane, general approach to allowing concurrent
> buffered writes (and reads).
> 
>> But direct I/O throws a big monkey wrench here as already mentioned by
>> others.  Now one interesting thing some file systems have done is
>> to serialize buffered against direct I/O, either by waiting for one
>> to finish, or by simply forcing buffered I/O when direct I/O would
>> conflict.
> 
> Right. We really don't want to downgrade to buffered IO if we can
> help it, though.
> 
>> It's easy to detect outstanding direct I/O using i_dio_count
>> so buffered I/O could wait for that, and downgrading to buffered I/O
>> (potentially using the new uncached mode from Jens) if there are any
>> pages on the mapping after the invalidation also sounds pretty doable.
> 
> It's much harder to sanely serialise DIO against buffered writes
> this way, because i_dio_count only forms a submission barrier in
> conjunction with the i_rwsem being held exclusively. e.g. ongoing
> DIO would result in the buffered write being indefinitely delayed.
> 
> I think the model and method that bcachefs uses is probably the best
> way to move forward - the "two-state exclusive shared" lock which it
> uses to do buffered vs direct exclusion is a simple, easy way to
> handle this problem. The same-state shared locking fast path is a
> single atomic cmpxchg operation, so it has neglible extra overhead
> compared to using a rwsem in the shared DIO fast path.
> 
> The lock also has non-owner semantics, so DIO can take it during
> submission and then drop it during IO completion. This solves the
> problem we currently use the i_rwsem and
> inode_dio_{start,end/wait}() to solve (i.e. create a DIO submission
> barrier and waiting for all existing DIO to drain).
> 
> IOWs, a two-state shared lock provides the mechanism to allow DIO
> to be done without holding the i_rwsem at all, as well as being able
> to elide two atomic operations per DIO to track in-flight DIOs.
> 
> We'd get this whilst maintaining buffered/DIO coherency without
> adding any new overhead to the DIO path, and allow concurrent
> buffered reads and writes that have their atomicity defined by the
> batched folio locking strategy that Brian is working on...
> 
> This only leaves DIO coherency issues with mmap() based IO as an
> issue, but that's a problem for a different day...

When I try to implement those features, I think we could use exclusive
locks for RWF_APPEND writes, and shared locks for other writes.

The reason is that concurrent operations are also possible for extended
writes if there is no overlap in the regions being written.
So there is no need to determine whether it is an extended write in
advance.

As for why an exclusive lock is needed for append writes, it's because
we don't want the EOF to be modified during the append write.

The code is like that:
+       if (iocb->ki_flags & IOCB_APPEND)
+               iolock = XFS_IOLOCK_EXCL;
+       else
+               iolock = XFS_IOLOCK_SHARED;


If we use exclusive locks for all extended writes, we need to check if
it is an extended write before acquiring the lock, but this value could
become outdated if we do not hold the lock.

So if we do an extended write,
we might need to follow this process:

1. Get read lock.
2. Check if it is an extended write.
3. Release read lock.
4. Get write lock.
5. Do the write operation.


Thanks,
Chi Zhiling
Jinliang Zheng Jan. 25, 2025, 8:43 a.m. UTC | #40
On Tue, 7 Jan 2025 13:13:17 +0100, Amir Goldstein <amir73il@gmail.com> wrote:
> On Mon, Dec 30, 2024 at 3:43 AM Chi Zhiling <chizhiling@163.com> wrote:
> >
> >
> >
> > On 2024/12/29 06:17, Dave Chinner wrote:
> > > On Sat, Dec 28, 2024 at 03:37:41PM +0800, Chi Zhiling wrote:
> > >>
> > >>
> > >> On 2024/12/27 05:50, Dave Chinner wrote:
> > >>> On Thu, Dec 26, 2024 at 02:16:02PM +0800, Chi Zhiling wrote:
> > >>>> From: Chi Zhiling <chizhiling@kylinos.cn>
> > >>>>
> > >>>> Using an rwsem to protect file data ensures that we can always obtain a
> > >>>> completed modification. But due to the lock, we need to wait for the
> > >>>> write process to release the rwsem before we can read it, even if we are
> > >>>> reading a different region of the file. This could take a lot of time
> > >>>> when many processes need to write and read this file.
> > >>>>
> > >>>> On the other hand, The ext4 filesystem and others do not hold the lock
> > >>>> during buffered reading, which make the ext4 have better performance in
> > >>>> that case. Therefore, I think it will be fine if we remove the lock in
> > >>>> xfs, as most applications can handle this situation.
> > >>>
> > >>> Nope.
> > >>>
> > >>> This means that XFS loses high level serialisation of incoming IO
> > >>> against operations like truncate, fallocate, pnfs operations, etc.
> > >>>
> > >>> We've been through this multiple times before; the solution lies in
> > >>> doing the work to make buffered writes use shared locking, not
> > >>> removing shared locking from buffered reads.
> > >>
> > >> You mean using shared locking for buffered reads and writes, right?
> > >>
> > >> I think it's a great idea. In theory, write operations can be performed
> > >> simultaneously if they write to different ranges.
> > >
> > > Even if they overlap, the folio locks will prevent concurrent writes
> > > to the same range.
> > >
> > > Now that we have atomic write support as native functionality (i.e.
> > > RWF_ATOMIC), we really should not have to care that much about
> > > normal buffered IO being atomic. i.e. if the application wants
> > > atomic writes, it can now specify that it wants atomic writes and so
> > > we can relax the constraints we have on existing IO...
> >
> > Yes, I'm not particularly concerned about whether buffered I/O is
> > atomic. I'm more concerned about the multithreading performance of
> > buffered I/O.
> 
> Hi Chi,
> 
> Sorry for joining late, I was on vacation.
> I am very happy that you have taken on this task and your timing is good,
> because  John Garry just posted his patches for large atomic writes [1].
> 
> I need to explain the relation to atomic buffered I/O, because it is not
> easy to follow it from the old discussions and also some of the discussions
> about the solution were held in-person during LSFMM2024.
> 
> Naturally, your interest is improving multithreading performance of
> buffered I/O, so was mine when I first posted this question [2].
> 
> The issue with atomicity of buffered I/O is the xfs has traditionally
> provided atomicity of write vs. read (a.k.a no torn writes), which is
> not required by POSIX standard (because POSIX was not written with
> threads in mind) and is not respected by any other in-tree filesystem.
> 
> It is obvious that the exclusive rw lock for buffered write hurts performance
> in the case of mixed read/write on xfs, so the question was - if xfs provides
> atomic semantics that portable applications cannot rely on, why bother
> providing these atomic semantics at all?
> 
> Dave's answer to this question was that there are some legacy applications
> (database applications IIRC) on production systems that do rely on the fact
> that xfs provides this semantics and on the prerequisite that they run on xfs.
> 
> However, it was noted that:
> 1. Those application do not require atomicity for any size of IO, they
>     typically work in I/O size that is larger than block size (e.g. 16K or 64K)
>     and they only require no torn writes for this I/O size
> 2. Large folios and iomap can usually provide this semantics via folio lock,
>     but application has currently no way of knowing if the semantics are
>     provided or not
> 3. The upcoming ability of application to opt-in for atomic writes larger
>     than fs block size [1] can be used to facilitate the applications that
>     want to legacy xfs semantics and avoid the need to enforce the legacy
>     semantics all the time for no good reason
> 
> Disclaimer: this is not a standard way to deal with potentially breaking
> legacy semantics, because old applications will not usually be rebuilt
> to opt-in for the old semantics, but the use case in hand is probably
> so specific, of a specific application that relies on xfs specific semantics
> that there are currently no objections for trying this solution.
> 
> [1] https://lore.kernel.org/linux-xfs/20250102140411.14617-1-john.g.garry@oracle.com/
> [2] https://lore.kernel.org/linux-xfs/CAOQ4uxi0pGczXBX7GRAFs88Uw0n1ERJZno3JSeZR71S1dXg+2w@mail.gmail.com/
> 
> >
> > Last week, it was mentioned that removing i_rwsem would have some
> > impacts on truncate, fallocate, and PNFS operations.
> >
> > (I'm not familiar with pNFS, so please correct me if I'm wrong.)
> 
> You are not wrong. pNFS uses a "layout lease", which requires
> that the blockmap of the file will not be modified while the lease is held.
> but I think that block that are not mapped (i.e. holes) can be mapped
> while the lease is held.
> 
> >
> > My understanding is that the current i_rwsem is used to protect both
> > the file's data and its size. Operations like truncate, fallocate,
> > and PNFS use i_rwsem because they modify both the file's data and its
> > size. So, I'm thinking whether it's possible to use i_rwsem to protect
> > only the file's size, without protecting the file's data.
> >
> 
> It also protects the file's blockmap, for example, punch hole
> does not change the size, but it unmaps blocks from the blockmap,
> leading to races that could end up reading stale data from disk
> if the lock wasn't taken.
> 
> > So operations that modify the file's size need to be executed
> > sequentially. For example, buffered writes to the EOF, fallocate
> > operations without the "keep size" requirement, and truncate operations,
> > etc, all need to hold an exclusive lock.
> >
> > Other operations require a shared lock because they only need to access
> > the file's size without modifying it.
> >
> 
> As far as I understand, exclusive lock is not needed for non-extending
> writes, because it is ok to map new blocks.
> I guess the need for exclusive lock for extending writes is related to
> update of file size, but not 100% sure.


> Anyway, exclusive lock on extending write is widely common in other fs,
> while exclusive lock for non-extending write is unique to xfs.

I am sorry but I can't understand, because I found ext4_buffered_write_iter()
take exclusive lock unconditionally.

Did I miss some context of the discussion?

Thank you very much.
Jinliang Zheng

> 
> > >
> > >> So we should track all the ranges we are reading or writing,
> > >> and check whether the new read or write operations can be performed
> > >> concurrently with the current operations.
> > >
> > > That is all discussed in detail in the discussions I linked.
> >
> > Sorry, I overlooked some details from old discussion last time.
> > It seems that you are not satisfied with the effectiveness of
> > range locks.
> >
> 
> Correct. This solution was taken off the table.
> 
> I hope my explanation was correct and clear.
> If anything else is not clear, please feel free to ask.
> 
> Thanks,
> Amir.
Amir Goldstein Jan. 25, 2025, 2:14 p.m. UTC | #41
On Sat, Jan 25, 2025 at 9:43 AM Jinliang Zheng <alexjlzheng@gmail.com> wrote:
>
> On Tue, 7 Jan 2025 13:13:17 +0100, Amir Goldstein <amir73il@gmail.com> wrote:
> > On Mon, Dec 30, 2024 at 3:43 AM Chi Zhiling <chizhiling@163.com> wrote:
> > >
> > >
> > >
> > > On 2024/12/29 06:17, Dave Chinner wrote:
> > > > On Sat, Dec 28, 2024 at 03:37:41PM +0800, Chi Zhiling wrote:
> > > >>
> > > >>
> > > >> On 2024/12/27 05:50, Dave Chinner wrote:
> > > >>> On Thu, Dec 26, 2024 at 02:16:02PM +0800, Chi Zhiling wrote:
> > > >>>> From: Chi Zhiling <chizhiling@kylinos.cn>
> > > >>>>
> > > >>>> Using an rwsem to protect file data ensures that we can always obtain a
> > > >>>> completed modification. But due to the lock, we need to wait for the
> > > >>>> write process to release the rwsem before we can read it, even if we are
> > > >>>> reading a different region of the file. This could take a lot of time
> > > >>>> when many processes need to write and read this file.
> > > >>>>
> > > >>>> On the other hand, The ext4 filesystem and others do not hold the lock
> > > >>>> during buffered reading, which make the ext4 have better performance in
> > > >>>> that case. Therefore, I think it will be fine if we remove the lock in
> > > >>>> xfs, as most applications can handle this situation.
> > > >>>
> > > >>> Nope.
> > > >>>
> > > >>> This means that XFS loses high level serialisation of incoming IO
> > > >>> against operations like truncate, fallocate, pnfs operations, etc.
> > > >>>
> > > >>> We've been through this multiple times before; the solution lies in
> > > >>> doing the work to make buffered writes use shared locking, not
> > > >>> removing shared locking from buffered reads.
> > > >>
> > > >> You mean using shared locking for buffered reads and writes, right?
> > > >>
> > > >> I think it's a great idea. In theory, write operations can be performed
> > > >> simultaneously if they write to different ranges.
> > > >
> > > > Even if they overlap, the folio locks will prevent concurrent writes
> > > > to the same range.
> > > >
> > > > Now that we have atomic write support as native functionality (i.e.
> > > > RWF_ATOMIC), we really should not have to care that much about
> > > > normal buffered IO being atomic. i.e. if the application wants
> > > > atomic writes, it can now specify that it wants atomic writes and so
> > > > we can relax the constraints we have on existing IO...
> > >
> > > Yes, I'm not particularly concerned about whether buffered I/O is
> > > atomic. I'm more concerned about the multithreading performance of
> > > buffered I/O.
> >
> > Hi Chi,
> >
> > Sorry for joining late, I was on vacation.
> > I am very happy that you have taken on this task and your timing is good,
> > because  John Garry just posted his patches for large atomic writes [1].
> >
> > I need to explain the relation to atomic buffered I/O, because it is not
> > easy to follow it from the old discussions and also some of the discussions
> > about the solution were held in-person during LSFMM2024.
> >
> > Naturally, your interest is improving multithreading performance of
> > buffered I/O, so was mine when I first posted this question [2].
> >
> > The issue with atomicity of buffered I/O is the xfs has traditionally
> > provided atomicity of write vs. read (a.k.a no torn writes), which is
> > not required by POSIX standard (because POSIX was not written with
> > threads in mind) and is not respected by any other in-tree filesystem.
> >
> > It is obvious that the exclusive rw lock for buffered write hurts performance
> > in the case of mixed read/write on xfs, so the question was - if xfs provides
> > atomic semantics that portable applications cannot rely on, why bother
> > providing these atomic semantics at all?
> >
> > Dave's answer to this question was that there are some legacy applications
> > (database applications IIRC) on production systems that do rely on the fact
> > that xfs provides this semantics and on the prerequisite that they run on xfs.
> >
> > However, it was noted that:
> > 1. Those application do not require atomicity for any size of IO, they
> >     typically work in I/O size that is larger than block size (e.g. 16K or 64K)
> >     and they only require no torn writes for this I/O size
> > 2. Large folios and iomap can usually provide this semantics via folio lock,
> >     but application has currently no way of knowing if the semantics are
> >     provided or not
> > 3. The upcoming ability of application to opt-in for atomic writes larger
> >     than fs block size [1] can be used to facilitate the applications that
> >     want to legacy xfs semantics and avoid the need to enforce the legacy
> >     semantics all the time for no good reason
> >
> > Disclaimer: this is not a standard way to deal with potentially breaking
> > legacy semantics, because old applications will not usually be rebuilt
> > to opt-in for the old semantics, but the use case in hand is probably
> > so specific, of a specific application that relies on xfs specific semantics
> > that there are currently no objections for trying this solution.
> >
> > [1] https://lore.kernel.org/linux-xfs/20250102140411.14617-1-john.g.garry@oracle.com/
> > [2] https://lore.kernel.org/linux-xfs/CAOQ4uxi0pGczXBX7GRAFs88Uw0n1ERJZno3JSeZR71S1dXg+2w@mail.gmail.com/
> >
> > >
> > > Last week, it was mentioned that removing i_rwsem would have some
> > > impacts on truncate, fallocate, and PNFS operations.
> > >
> > > (I'm not familiar with pNFS, so please correct me if I'm wrong.)
> >
> > You are not wrong. pNFS uses a "layout lease", which requires
> > that the blockmap of the file will not be modified while the lease is held.
> > but I think that block that are not mapped (i.e. holes) can be mapped
> > while the lease is held.
> >
> > >
> > > My understanding is that the current i_rwsem is used to protect both
> > > the file's data and its size. Operations like truncate, fallocate,
> > > and PNFS use i_rwsem because they modify both the file's data and its
> > > size. So, I'm thinking whether it's possible to use i_rwsem to protect
> > > only the file's size, without protecting the file's data.
> > >
> >
> > It also protects the file's blockmap, for example, punch hole
> > does not change the size, but it unmaps blocks from the blockmap,
> > leading to races that could end up reading stale data from disk
> > if the lock wasn't taken.
> >
> > > So operations that modify the file's size need to be executed
> > > sequentially. For example, buffered writes to the EOF, fallocate
> > > operations without the "keep size" requirement, and truncate operations,
> > > etc, all need to hold an exclusive lock.
> > >
> > > Other operations require a shared lock because they only need to access
> > > the file's size without modifying it.
> > >
> >
> > As far as I understand, exclusive lock is not needed for non-extending
> > writes, because it is ok to map new blocks.
> > I guess the need for exclusive lock for extending writes is related to
> > update of file size, but not 100% sure.
>
>
> > Anyway, exclusive lock on extending write is widely common in other fs,
> > while exclusive lock for non-extending write is unique to xfs.
>
> I am sorry but I can't understand, because I found ext4_buffered_write_iter()
> take exclusive lock unconditionally.
>
> Did I miss some context of the discussion?
>

No, I misspoke. Sorry.
The behavior that is unique to xfs is shared iolock on buffered read.
ext4 (and the rest) do not take iolock on buffered reads.

Internally, filemap_read() takes filemap_invalidate_lock_shared() as needed
to avoid races with truncate and punch hole.

Before the introduction of filemap_invalidate_lock(), xfs was the only fs that
was safe from races of punch hole vs. buffered read leading to reading
stale data.

One way to fix the mixed buffered rw performance issue is to follow suit with
other fs and not take shared iolock on buffered reads in xfs, but that has the
downside of regressing the legacy "untorn writes" xfs behavior and it will make
it hard to maintain the existing buffered/dio coherency in xfs.

So instead of going in the direction of "follow ext4 buffered io
locking scheme",
this discussion took the direction of "follow dio locking scheme" when foilios
can be used to provide "untorn writes" and use another method to mutually
exclude dio writes and buffered writes.

I hope that my summary is correct and helps to clarify where we stand.

Thanks,
Amir.
Dave Chinner Jan. 27, 2025, 8:49 p.m. UTC | #42
On Fri, Jan 24, 2025 at 03:57:43PM +0800, Chi Zhiling wrote:
> On 2025/1/16 05:41, Dave Chinner wrote:
> > IOWs, a two-state shared lock provides the mechanism to allow DIO
> > to be done without holding the i_rwsem at all, as well as being able
> > to elide two atomic operations per DIO to track in-flight DIOs.
> > 
> > We'd get this whilst maintaining buffered/DIO coherency without
> > adding any new overhead to the DIO path, and allow concurrent
> > buffered reads and writes that have their atomicity defined by the
> > batched folio locking strategy that Brian is working on...
> > 
> > This only leaves DIO coherency issues with mmap() based IO as an
> > issue, but that's a problem for a different day...
> 
> When I try to implement those features, I think we could use exclusive
> locks for RWF_APPEND writes, and shared locks for other writes.
> 
> The reason is that concurrent operations are also possible for extended
> writes if there is no overlap in the regions being written.
> So there is no need to determine whether it is an extended write in
> advance.
> 
> As for why an exclusive lock is needed for append writes, it's because
> we don't want the EOF to be modified during the append write.

We don't care if the EOF moves during the append write at the
filesystem level. We set kiocb->ki_pos = i_size_read() from
generic_write_checks() under shared locking, and if we then race
with another extending append write there are two cases:

	1. the other task has already extended i_size; or
	2. we have two IOs at the same offset (i.e. at i_size).

In either case, we don't need exclusive locking for the IO because
the worst thing that happens is that two IOs hit the same file
offset. IOWs, it has always been left up to the application
serialise RWF_APPEND writes on XFS, not the filesystem.

There is good reason for not doing exclusive locking for extending
writes. When extending the file naturally (think sequential writes),
we need those IOs be able to run concurrently with other writes
in progress. i.e. it is common for
applications to submit multiple sequential extending writes in a
batch, and as long as we submit them all in order, they all hit the
(moving) EOF exactly and hence get run concurrently.

This also works with batch-submitted RWF_APPEND AIO-DIO - they just
turn into concurrent in-flight extending writes...

IOWs, forcing exclusive locking for writes at exactly EOF is
definitely not desirable, and existing behaviour is that they use
shared locking.

The only time we use exclusive locking for extending writes is when
they -start- beyond EOF (i.e. leave a hole between EOF and
kiocb->ki_pos) and so we have to guarantee that range does not have
stale data exposed from it while the write is in progress. i.e. we
have to zero the hole that moving EOF exposes if it contains
written extents, and we cannot allow reads or writes to that hole
before we are done.

> The code is like that:
> +       if (iocb->ki_flags & IOCB_APPEND)
> +               iolock = XFS_IOLOCK_EXCL;
> +       else
> +               iolock = XFS_IOLOCK_SHARED;
> 
> 
> If we use exclusive locks for all extended writes, we need to check if
> it is an extended write before acquiring the lock, but this value could
> become outdated if we do not hold the lock.
> 
> So if we do an extended write,
> we might need to follow this process:
> 
> 1. Get read lock.
> 2. Check if it is an extended write.
> 3. Release read lock.
> 4. Get write lock.
> 5. Do the write operation.

Please see xfs_file_write_checks() - it should already have all the
shared/exclusive relocking logic we need for temporarily excluding
buffered reads whilst submitting concurrent buffered writes (i.e. it
is the same as what we already do for concurrent DIO writes).

-Dave.
Christoph Hellwig Jan. 28, 2025, 5:15 a.m. UTC | #43
On Tue, Jan 28, 2025 at 07:49:17AM +1100, Dave Chinner wrote:
> > As for why an exclusive lock is needed for append writes, it's because
> > we don't want the EOF to be modified during the append write.
> 
> We don't care if the EOF moves during the append write at the
> filesystem level. We set kiocb->ki_pos = i_size_read() from
> generic_write_checks() under shared locking, and if we then race
> with another extending append write there are two cases:
> 
> 	1. the other task has already extended i_size; or
> 	2. we have two IOs at the same offset (i.e. at i_size).
> 
> In either case, we don't need exclusive locking for the IO because
> the worst thing that happens is that two IOs hit the same file
> offset. IOWs, it has always been left up to the application
> serialise RWF_APPEND writes on XFS, not the filesystem.

I disagree.  O_APPEND (RWF_APPEND is just the Linux-specific
per-I/O version of that) is extensively used for things like
multi-thread loggers where you have multiple threads doing O_APPEND
writes to a single log file, and they expect to not lose data
that way.  The fact that we currently don't do that for O_DIRECT
is a bug, which is just papered over that barely anyone uses
O_DIRECT | O_APPEND as that's not a very natural use case for
most applications (in fact NFS got away with never allowing it
at all).  But extending racy O_APPEND to buffered writes would
break a lot of applications.
David Laight Jan. 28, 2025, 9:23 p.m. UTC | #44
On Mon, 27 Jan 2025 21:15:41 -0800
Christoph Hellwig <hch@infradead.org> wrote:

> On Tue, Jan 28, 2025 at 07:49:17AM +1100, Dave Chinner wrote:
> > > As for why an exclusive lock is needed for append writes, it's because
> > > we don't want the EOF to be modified during the append write.  
> > 
> > We don't care if the EOF moves during the append write at the
> > filesystem level. We set kiocb->ki_pos = i_size_read() from
> > generic_write_checks() under shared locking, and if we then race
> > with another extending append write there are two cases:
> > 
> > 	1. the other task has already extended i_size; or
> > 	2. we have two IOs at the same offset (i.e. at i_size).
> > 
> > In either case, we don't need exclusive locking for the IO because
> > the worst thing that happens is that two IOs hit the same file
> > offset. IOWs, it has always been left up to the application
> > serialise RWF_APPEND writes on XFS, not the filesystem.  
> 
> I disagree.  O_APPEND (RWF_APPEND is just the Linux-specific
> per-I/O version of that) is extensively used for things like
> multi-thread loggers where you have multiple threads doing O_APPEND
> writes to a single log file, and they expect to not lose data
> that way.  The fact that we currently don't do that for O_DIRECT
> is a bug, which is just papered over that barely anyone uses
> O_DIRECT | O_APPEND as that's not a very natural use case for
> most applications (in fact NFS got away with never allowing it
> at all).  But extending racy O_APPEND to buffered writes would
> break a lot of applications.

It is broken in windows :-)
You get two writes to the same file offset and then (IIRC) two advances of EOF
(usually) giving a block of '\0' bytes in the file.

You might get away with doing an atomic update of EOF and then writing
into the gap.
But you have to decide what to do if there is a seg fault on the user buffer
It could be a multi-TB write from an mmap-ed file (maybe even over NFS) that
hits a disk read error.

Actually I suspect that if you let the two writes proceed in parallel you can't
let later ones complete first.
If the returns are sequenced a write can then be redone if an earlier write
got shortened.

	David
Dave Chinner Jan. 29, 2025, 12:59 a.m. UTC | #45
On Mon, Jan 27, 2025 at 09:15:41PM -0800, Christoph Hellwig wrote:
> On Tue, Jan 28, 2025 at 07:49:17AM +1100, Dave Chinner wrote:
> > > As for why an exclusive lock is needed for append writes, it's because
> > > we don't want the EOF to be modified during the append write.
> > 
> > We don't care if the EOF moves during the append write at the
> > filesystem level. We set kiocb->ki_pos = i_size_read() from
> > generic_write_checks() under shared locking, and if we then race
> > with another extending append write there are two cases:
> > 
> > 	1. the other task has already extended i_size; or
> > 	2. we have two IOs at the same offset (i.e. at i_size).
> > 
> > In either case, we don't need exclusive locking for the IO because
> > the worst thing that happens is that two IOs hit the same file
> > offset. IOWs, it has always been left up to the application
> > serialise RWF_APPEND writes on XFS, not the filesystem.
> 
> I disagree.  O_APPEND (RWF_APPEND is just the Linux-specific
> per-I/O version of that) is extensively used for things like
> multi-thread loggers where you have multiple threads doing O_APPEND
> writes to a single log file, and they expect to not lose data
> that way.

Sure, but we don't think we need full file offset-scope IO exclusion
to solve that problem.  We can still safely do concurrent writes
within EOF to occur whilst another buffered append write is doing
file extension work.

IOWs, where we really need to get to is a model that allows
concurrent buffered IO at all times, except for the case where IO
operations that change the inode size need to serialise against
other similar operations (e.g. other EOF extending IOs, truncate,
etc).

Hence I think we can largely ignore O_APPEND for the
purposes of prototyping shared buffered IO and getting rid of the
IOLOCK from the XFS IO path. I may end up re-using the i_rwsem as
a "EOF modification" serialisation mechanism for O_APPEND and
extending writes in general, but I don't think we need a general
write IO exclusion mechanism for this...

-Dave.
Christoph Hellwig Jan. 29, 2025, 5:20 a.m. UTC | #46
On Wed, Jan 29, 2025 at 11:59:53AM +1100, Dave Chinner wrote:
> Sure, but we don't think we need full file offset-scope IO exclusion
> to solve that problem.  We can still safely do concurrent writes
> within EOF to occur whilst another buffered append write is doing
> file extension work.

Sure.  The previous mail just sounded like you'd want to do away
with exclusion for assigning the offset.

> IOWs, where we really need to get to is a model that allows
> concurrent buffered IO at all times, except for the case where IO
> operations that change the inode size need to serialise against
> other similar operations (e.g. other EOF extending IOs, truncate,
> etc).
> 
> Hence I think we can largely ignore O_APPEND for the
> purposes of prototyping shared buffered IO and getting rid of the
> IOLOCK from the XFS IO path. I may end up re-using the i_rwsem as
> a "EOF modification" serialisation mechanism for O_APPEND and
> extending writes in general, but I don't think we need a general
> write IO exclusion mechanism for this...

That might be a chance to also fix O_DIRECT while we're at it..
diff mbox series

Patch

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 9a435b1ff264..7d039cc3ae9e 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -279,18 +279,9 @@  xfs_file_buffered_read(
 	struct kiocb		*iocb,
 	struct iov_iter		*to)
 {
-	struct xfs_inode	*ip = XFS_I(file_inode(iocb->ki_filp));
-	ssize_t			ret;
-
 	trace_xfs_file_buffered_read(iocb, to);
 
-	ret = xfs_ilock_iocb(iocb, XFS_IOLOCK_SHARED);
-	if (ret)
-		return ret;
-	ret = generic_file_read_iter(iocb, to);
-	xfs_iunlock(ip, XFS_IOLOCK_SHARED);
-
-	return ret;
+	return generic_file_read_iter(iocb, to);
 }
 
 STATIC ssize_t