Message ID | 20241226061602.2222985-1-chizhiling@163.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | xfs: Remove i_rwsem lock in buffered read | expand |
On Thu, Dec 26, 2024 at 02:16:02PM +0800, Chi Zhiling wrote: > From: Chi Zhiling <chizhiling@kylinos.cn> > > Using an rwsem to protect file data ensures that we can always obtain a > completed modification. But due to the lock, we need to wait for the > write process to release the rwsem before we can read it, even if we are > reading a different region of the file. This could take a lot of time > when many processes need to write and read this file. > > On the other hand, The ext4 filesystem and others do not hold the lock > during buffered reading, which make the ext4 have better performance in > that case. Therefore, I think it will be fine if we remove the lock in > xfs, as most applications can handle this situation. Nope. This means that XFS loses high level serialisation of incoming IO against operations like truncate, fallocate, pnfs operations, etc. We've been through this multiple times before; the solution lies in doing the work to make buffered writes use shared locking, not removing shared locking from buffered reads. A couple of old discussions from the list: https://lore.kernel.org/linux-xfs/CAOQ4uxi0pGczXBX7GRAFs88Uw0n1ERJZno3JSeZR71S1dXg+2w@mail.gmail.com/ https://lore.kernel.org/linux-xfs/20190404165737.30889-1-amir73il@gmail.com/ There are likely others - you can search for them yourself to get more background information. Fundamentally, though, removing locking from the read side is not the answer to this buffered write IO exclusion problem.... -Dave.
On 2024/12/27 05:50, Dave Chinner wrote: > On Thu, Dec 26, 2024 at 02:16:02PM +0800, Chi Zhiling wrote: >> From: Chi Zhiling <chizhiling@kylinos.cn> >> >> Using an rwsem to protect file data ensures that we can always obtain a >> completed modification. But due to the lock, we need to wait for the >> write process to release the rwsem before we can read it, even if we are >> reading a different region of the file. This could take a lot of time >> when many processes need to write and read this file. >> >> On the other hand, The ext4 filesystem and others do not hold the lock >> during buffered reading, which make the ext4 have better performance in >> that case. Therefore, I think it will be fine if we remove the lock in >> xfs, as most applications can handle this situation. > > Nope. > > This means that XFS loses high level serialisation of incoming IO > against operations like truncate, fallocate, pnfs operations, etc. > > We've been through this multiple times before; the solution lies in > doing the work to make buffered writes use shared locking, not > removing shared locking from buffered reads. You mean using shared locking for buffered reads and writes, right? I think it's a great idea. In theory, write operations can be performed simultaneously if they write to different ranges. So we should track all the ranges we are reading or writing, and check whether the new read or write operations can be performed concurrently with the current operations. Do we have any plans to use shared locking for buffered writes? > > A couple of old discussions from the list: > > https://lore.kernel.org/linux-xfs/CAOQ4uxi0pGczXBX7GRAFs88Uw0n1ERJZno3JSeZR71S1dXg+2w@mail.gmail.com/ > https://lore.kernel.org/linux-xfs/20190404165737.30889-1-amir73il@gmail.com/ > > There are likely others - you can search for them yourself to get > more background information. Sorry, I didn't find those discussions earlier. > > Fundamentally, though, removing locking from the read side is not > the answer to this buffered write IO exclusion problem.... > > -Dave. Best regards, Chi Zhiling
On Sat, Dec 28, 2024 at 03:37:41PM +0800, Chi Zhiling wrote: > > > On 2024/12/27 05:50, Dave Chinner wrote: > > On Thu, Dec 26, 2024 at 02:16:02PM +0800, Chi Zhiling wrote: > > > From: Chi Zhiling <chizhiling@kylinos.cn> > > > > > > Using an rwsem to protect file data ensures that we can always obtain a > > > completed modification. But due to the lock, we need to wait for the > > > write process to release the rwsem before we can read it, even if we are > > > reading a different region of the file. This could take a lot of time > > > when many processes need to write and read this file. > > > > > > On the other hand, The ext4 filesystem and others do not hold the lock > > > during buffered reading, which make the ext4 have better performance in > > > that case. Therefore, I think it will be fine if we remove the lock in > > > xfs, as most applications can handle this situation. > > > > Nope. > > > > This means that XFS loses high level serialisation of incoming IO > > against operations like truncate, fallocate, pnfs operations, etc. > > > > We've been through this multiple times before; the solution lies in > > doing the work to make buffered writes use shared locking, not > > removing shared locking from buffered reads. > > You mean using shared locking for buffered reads and writes, right? > > I think it's a great idea. In theory, write operations can be performed > simultaneously if they write to different ranges. Even if they overlap, the folio locks will prevent concurrent writes to the same range. Now that we have atomic write support as native functionality (i.e. RWF_ATOMIC), we really should not have to care that much about normal buffered IO being atomic. i.e. if the application wants atomic writes, it can now specify that it wants atomic writes and so we can relax the constraints we have on existing IO... > So we should track all the ranges we are reading or writing, > and check whether the new read or write operations can be performed > concurrently with the current operations. That is all discussed in detail in the discussions I linked. > Do we have any plans to use shared locking for buffered writes? All we are waiting for is someone to put the resources into making the changes and testing it properly... -Dave.
On 2024/12/29 06:17, Dave Chinner wrote: > On Sat, Dec 28, 2024 at 03:37:41PM +0800, Chi Zhiling wrote: >> >> >> On 2024/12/27 05:50, Dave Chinner wrote: >>> On Thu, Dec 26, 2024 at 02:16:02PM +0800, Chi Zhiling wrote: >>>> From: Chi Zhiling <chizhiling@kylinos.cn> >>>> >>>> Using an rwsem to protect file data ensures that we can always obtain a >>>> completed modification. But due to the lock, we need to wait for the >>>> write process to release the rwsem before we can read it, even if we are >>>> reading a different region of the file. This could take a lot of time >>>> when many processes need to write and read this file. >>>> >>>> On the other hand, The ext4 filesystem and others do not hold the lock >>>> during buffered reading, which make the ext4 have better performance in >>>> that case. Therefore, I think it will be fine if we remove the lock in >>>> xfs, as most applications can handle this situation. >>> >>> Nope. >>> >>> This means that XFS loses high level serialisation of incoming IO >>> against operations like truncate, fallocate, pnfs operations, etc. >>> >>> We've been through this multiple times before; the solution lies in >>> doing the work to make buffered writes use shared locking, not >>> removing shared locking from buffered reads. >> >> You mean using shared locking for buffered reads and writes, right? >> >> I think it's a great idea. In theory, write operations can be performed >> simultaneously if they write to different ranges. > > Even if they overlap, the folio locks will prevent concurrent writes > to the same range. > > Now that we have atomic write support as native functionality (i.e. > RWF_ATOMIC), we really should not have to care that much about > normal buffered IO being atomic. i.e. if the application wants > atomic writes, it can now specify that it wants atomic writes and so > we can relax the constraints we have on existing IO... Yes, I'm not particularly concerned about whether buffered I/O is atomic. I'm more concerned about the multithreading performance of buffered I/O. Last week, it was mentioned that removing i_rwsem would have some impacts on truncate, fallocate, and PNFS operations. (I'm not familiar with pNFS, so please correct me if I'm wrong.) My understanding is that the current i_rwsem is used to protect both the file's data and its size. Operations like truncate, fallocate, and PNFS use i_rwsem because they modify both the file's data and its size. So, I'm thinking whether it's possible to use i_rwsem to protect only the file's size, without protecting the file's data. So operations that modify the file's size need to be executed sequentially. For example, buffered writes to the EOF, fallocate operations without the "keep size" requirement, and truncate operations, etc, all need to hold an exclusive lock. Other operations require a shared lock because they only need to access the file's size without modifying it. > >> So we should track all the ranges we are reading or writing, >> and check whether the new read or write operations can be performed >> concurrently with the current operations. > > That is all discussed in detail in the discussions I linked. Sorry, I overlooked some details from old discussion last time. It seems that you are not satisfied with the effectiveness of range locks. Best regards, Chi Zhiling
On Mon, Dec 30, 2024 at 3:43 AM Chi Zhiling <chizhiling@163.com> wrote: > > > > On 2024/12/29 06:17, Dave Chinner wrote: > > On Sat, Dec 28, 2024 at 03:37:41PM +0800, Chi Zhiling wrote: > >> > >> > >> On 2024/12/27 05:50, Dave Chinner wrote: > >>> On Thu, Dec 26, 2024 at 02:16:02PM +0800, Chi Zhiling wrote: > >>>> From: Chi Zhiling <chizhiling@kylinos.cn> > >>>> > >>>> Using an rwsem to protect file data ensures that we can always obtain a > >>>> completed modification. But due to the lock, we need to wait for the > >>>> write process to release the rwsem before we can read it, even if we are > >>>> reading a different region of the file. This could take a lot of time > >>>> when many processes need to write and read this file. > >>>> > >>>> On the other hand, The ext4 filesystem and others do not hold the lock > >>>> during buffered reading, which make the ext4 have better performance in > >>>> that case. Therefore, I think it will be fine if we remove the lock in > >>>> xfs, as most applications can handle this situation. > >>> > >>> Nope. > >>> > >>> This means that XFS loses high level serialisation of incoming IO > >>> against operations like truncate, fallocate, pnfs operations, etc. > >>> > >>> We've been through this multiple times before; the solution lies in > >>> doing the work to make buffered writes use shared locking, not > >>> removing shared locking from buffered reads. > >> > >> You mean using shared locking for buffered reads and writes, right? > >> > >> I think it's a great idea. In theory, write operations can be performed > >> simultaneously if they write to different ranges. > > > > Even if they overlap, the folio locks will prevent concurrent writes > > to the same range. > > > > Now that we have atomic write support as native functionality (i.e. > > RWF_ATOMIC), we really should not have to care that much about > > normal buffered IO being atomic. i.e. if the application wants > > atomic writes, it can now specify that it wants atomic writes and so > > we can relax the constraints we have on existing IO... > > Yes, I'm not particularly concerned about whether buffered I/O is > atomic. I'm more concerned about the multithreading performance of > buffered I/O. Hi Chi, Sorry for joining late, I was on vacation. I am very happy that you have taken on this task and your timing is good, because John Garry just posted his patches for large atomic writes [1]. I need to explain the relation to atomic buffered I/O, because it is not easy to follow it from the old discussions and also some of the discussions about the solution were held in-person during LSFMM2024. Naturally, your interest is improving multithreading performance of buffered I/O, so was mine when I first posted this question [2]. The issue with atomicity of buffered I/O is the xfs has traditionally provided atomicity of write vs. read (a.k.a no torn writes), which is not required by POSIX standard (because POSIX was not written with threads in mind) and is not respected by any other in-tree filesystem. It is obvious that the exclusive rw lock for buffered write hurts performance in the case of mixed read/write on xfs, so the question was - if xfs provides atomic semantics that portable applications cannot rely on, why bother providing these atomic semantics at all? Dave's answer to this question was that there are some legacy applications (database applications IIRC) on production systems that do rely on the fact that xfs provides this semantics and on the prerequisite that they run on xfs. However, it was noted that: 1. Those application do not require atomicity for any size of IO, they typically work in I/O size that is larger than block size (e.g. 16K or 64K) and they only require no torn writes for this I/O size 2. Large folios and iomap can usually provide this semantics via folio lock, but application has currently no way of knowing if the semantics are provided or not 3. The upcoming ability of application to opt-in for atomic writes larger than fs block size [1] can be used to facilitate the applications that want to legacy xfs semantics and avoid the need to enforce the legacy semantics all the time for no good reason Disclaimer: this is not a standard way to deal with potentially breaking legacy semantics, because old applications will not usually be rebuilt to opt-in for the old semantics, but the use case in hand is probably so specific, of a specific application that relies on xfs specific semantics that there are currently no objections for trying this solution. [1] https://lore.kernel.org/linux-xfs/20250102140411.14617-1-john.g.garry@oracle.com/ [2] https://lore.kernel.org/linux-xfs/CAOQ4uxi0pGczXBX7GRAFs88Uw0n1ERJZno3JSeZR71S1dXg+2w@mail.gmail.com/ > > Last week, it was mentioned that removing i_rwsem would have some > impacts on truncate, fallocate, and PNFS operations. > > (I'm not familiar with pNFS, so please correct me if I'm wrong.) You are not wrong. pNFS uses a "layout lease", which requires that the blockmap of the file will not be modified while the lease is held. but I think that block that are not mapped (i.e. holes) can be mapped while the lease is held. > > My understanding is that the current i_rwsem is used to protect both > the file's data and its size. Operations like truncate, fallocate, > and PNFS use i_rwsem because they modify both the file's data and its > size. So, I'm thinking whether it's possible to use i_rwsem to protect > only the file's size, without protecting the file's data. > It also protects the file's blockmap, for example, punch hole does not change the size, but it unmaps blocks from the blockmap, leading to races that could end up reading stale data from disk if the lock wasn't taken. > So operations that modify the file's size need to be executed > sequentially. For example, buffered writes to the EOF, fallocate > operations without the "keep size" requirement, and truncate operations, > etc, all need to hold an exclusive lock. > > Other operations require a shared lock because they only need to access > the file's size without modifying it. > As far as I understand, exclusive lock is not needed for non-extending writes, because it is ok to map new blocks. I guess the need for exclusive lock for extending writes is related to update of file size, but not 100% sure. Anyway, exclusive lock on extending write is widely common in other fs, while exclusive lock for non-extending write is unique to xfs. > > > >> So we should track all the ranges we are reading or writing, > >> and check whether the new read or write operations can be performed > >> concurrently with the current operations. > > > > That is all discussed in detail in the discussions I linked. > > Sorry, I overlooked some details from old discussion last time. > It seems that you are not satisfied with the effectiveness of > range locks. > Correct. This solution was taken off the table. I hope my explanation was correct and clear. If anything else is not clear, please feel free to ask. Thanks, Amir.
On Tue, Jan 07, 2025 at 01:13:17PM +0100, Amir Goldstein wrote: > The issue with atomicity of buffered I/O is the xfs has traditionally > provided atomicity of write vs. read (a.k.a no torn writes), which is > not required by POSIX standard (because POSIX was not written with > threads in mind) and is not respected by any other in-tree filesystem. That is true for original Posix, but once Posix Threads joined the game the behavior was and still is required. See "2.9.7 Thread Interactions with Regular File Operations" here: https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html Now most Linux filesystems ignored that and got away with ignoring the requirement, but it still exists.
On 2025/1/7 20:13, Amir Goldstein wrote: > On Mon, Dec 30, 2024 at 3:43 AM Chi Zhiling <chizhiling@163.com> wrote: >> On 2024/12/29 06:17, Dave Chinner wrote: >>> On Sat, Dec 28, 2024 at 03:37:41PM +0800, Chi Zhiling wrote: >>>> On 2024/12/27 05:50, Dave Chinner wrote: >>>>> On Thu, Dec 26, 2024 at 02:16:02PM +0800, Chi Zhiling wrote: >>>>>> From: Chi Zhiling <chizhiling@kylinos.cn> >>>>>> >>>>>> Using an rwsem to protect file data ensures that we can always obtain a >>>>>> completed modification. But due to the lock, we need to wait for the >>>>>> write process to release the rwsem before we can read it, even if we are >>>>>> reading a different region of the file. This could take a lot of time >>>>>> when many processes need to write and read this file. >>>>>> >>>>>> On the other hand, The ext4 filesystem and others do not hold the lock >>>>>> during buffered reading, which make the ext4 have better performance in >>>>>> that case. Therefore, I think it will be fine if we remove the lock in >>>>>> xfs, as most applications can handle this situation. >>>>> >>>>> Nope. >>>>> >>>>> This means that XFS loses high level serialisation of incoming IO >>>>> against operations like truncate, fallocate, pnfs operations, etc. >>>>> >>>>> We've been through this multiple times before; the solution lies in >>>>> doing the work to make buffered writes use shared locking, not >>>>> removing shared locking from buffered reads. >>>> >>>> You mean using shared locking for buffered reads and writes, right? >>>> >>>> I think it's a great idea. In theory, write operations can be performed >>>> simultaneously if they write to different ranges. >>> >>> Even if they overlap, the folio locks will prevent concurrent writes >>> to the same range. >>> >>> Now that we have atomic write support as native functionality (i.e. >>> RWF_ATOMIC), we really should not have to care that much about >>> normal buffered IO being atomic. i.e. if the application wants >>> atomic writes, it can now specify that it wants atomic writes and so >>> we can relax the constraints we have on existing IO... >> >> Yes, I'm not particularly concerned about whether buffered I/O is >> atomic. I'm more concerned about the multithreading performance of >> buffered I/O. > > Hi Chi, > > Sorry for joining late, I was on vacation. > I am very happy that you have taken on this task and your timing is good, > because John Garry just posted his patches for large atomic writes [1]. I'm glad to have you on board. :) I'm not sure if my understanding is correct, but it seems that our discussion is about multithreading safety, while John's patch is about providing power-failure safety, even though both mention atomicity. > > I need to explain the relation to atomic buffered I/O, because it is not > easy to follow it from the old discussions and also some of the discussions > about the solution were held in-person during LSFMM2024. > > Naturally, your interest is improving multithreading performance of > buffered I/O, so was mine when I first posted this question [2]. > > The issue with atomicity of buffered I/O is the xfs has traditionally > provided atomicity of write vs. read (a.k.a no torn writes), which is > not required by POSIX standard (because POSIX was not written with > threads in mind) and is not respected by any other in-tree filesystem. > > It is obvious that the exclusive rw lock for buffered write hurts performance > in the case of mixed read/write on xfs, so the question was - if xfs provides > atomic semantics that portable applications cannot rely on, why bother > providing these atomic semantics at all? Perhaps we can add an option that allows distributions or users to decide whether they need this semantics. I would not hesitate to turn off this semantics on my system when the time comes. > > Dave's answer to this question was that there are some legacy applications > (database applications IIRC) on production systems that do rely on the fact > that xfs provides this semantics and on the prerequisite that they run on xfs. > > However, it was noted that: > 1. Those application do not require atomicity for any size of IO, they > typically work in I/O size that is larger than block size (e.g. 16K or 64K) > and they only require no torn writes for this I/O size > 2. Large folios and iomap can usually provide this semantics via folio lock, > but application has currently no way of knowing if the semantics are > provided or not To be honest, it would be best if the folio lock could provide such semantics, as it would not cause any potential problems for the application, and we have hope to achieve concurrent writes. However, I am not sure if this is easy to implement and will not cause other problems. > 3. The upcoming ability of application to opt-in for atomic writes larger > than fs block size [1] can be used to facilitate the applications that > want to legacy xfs semantics and avoid the need to enforce the legacy > semantics all the time for no good reason > > Disclaimer: this is not a standard way to deal with potentially breaking > legacy semantics, because old applications will not usually be rebuilt > to opt-in for the old semantics, but the use case in hand is probably > so specific, of a specific application that relies on xfs specific semantics > that there are currently no objections for trying this solution. > > [1] https://lore.kernel.org/linux-xfs/20250102140411.14617-1-john.g.garry@oracle.com/ > [2] https://lore.kernel.org/linux-xfs/CAOQ4uxi0pGczXBX7GRAFs88Uw0n1ERJZno3JSeZR71S1dXg+2w@mail.gmail.com/ > >> >> Last week, it was mentioned that removing i_rwsem would have some >> impacts on truncate, fallocate, and PNFS operations. >> >> (I'm not familiar with pNFS, so please correct me if I'm wrong.) > > You are not wrong. pNFS uses a "layout lease", which requires > that the blockmap of the file will not be modified while the lease is held. > but I think that block that are not mapped (i.e. holes) can be mapped > while the lease is held. > >> >> My understanding is that the current i_rwsem is used to protect both >> the file's data and its size. Operations like truncate, fallocate, >> and PNFS use i_rwsem because they modify both the file's data and its >> size. So, I'm thinking whether it's possible to use i_rwsem to protect >> only the file's size, without protecting the file's data. >> > > It also protects the file's blockmap, for example, punch hole > does not change the size, but it unmaps blocks from the blockmap, > leading to races that could end up reading stale data from disk > if the lock wasn't taken. > >> So operations that modify the file's size need to be executed >> sequentially. For example, buffered writes to the EOF, fallocate >> operations without the "keep size" requirement, and truncate operations, >> etc, all need to hold an exclusive lock. >> >> Other operations require a shared lock because they only need to access >> the file's size without modifying it. >> > > As far as I understand, exclusive lock is not needed for non-extending > writes, because it is ok to map new blocks. > I guess the need for exclusive lock for extending writes is related to > update of file size, but not 100% sure. > Anyway, exclusive lock on extending write is widely common in other fs, > while exclusive lock for non-extending write is unique to xfs. > >>> >>>> So we should track all the ranges we are reading or writing, >>>> and check whether the new read or write operations can be performed >>>> concurrently with the current operations. >>> >>> That is all discussed in detail in the discussions I linked. >> >> Sorry, I overlooked some details from old discussion last time. >> It seems that you are not satisfied with the effectiveness of >> range locks. >> > > Correct. This solution was taken off the table. > > I hope my explanation was correct and clear. > If anything else is not clear, please feel free to ask. > I think your explanation is very clear. Thanks, Chi Zhiling
On Wed, Jan 8, 2025 at 8:43 AM Chi Zhiling <chizhiling@163.com> wrote: > > On 2025/1/7 20:13, Amir Goldstein wrote: > > On Mon, Dec 30, 2024 at 3:43 AM Chi Zhiling <chizhiling@163.com> wrote: > >> On 2024/12/29 06:17, Dave Chinner wrote: > >>> On Sat, Dec 28, 2024 at 03:37:41PM +0800, Chi Zhiling wrote: > >>>> On 2024/12/27 05:50, Dave Chinner wrote: > >>>>> On Thu, Dec 26, 2024 at 02:16:02PM +0800, Chi Zhiling wrote: > >>>>>> From: Chi Zhiling <chizhiling@kylinos.cn> > >>>>>> > >>>>>> Using an rwsem to protect file data ensures that we can always obtain a > >>>>>> completed modification. But due to the lock, we need to wait for the > >>>>>> write process to release the rwsem before we can read it, even if we are > >>>>>> reading a different region of the file. This could take a lot of time > >>>>>> when many processes need to write and read this file. > >>>>>> > >>>>>> On the other hand, The ext4 filesystem and others do not hold the lock > >>>>>> during buffered reading, which make the ext4 have better performance in > >>>>>> that case. Therefore, I think it will be fine if we remove the lock in > >>>>>> xfs, as most applications can handle this situation. > >>>>> > >>>>> Nope. > >>>>> > >>>>> This means that XFS loses high level serialisation of incoming IO > >>>>> against operations like truncate, fallocate, pnfs operations, etc. > >>>>> > >>>>> We've been through this multiple times before; the solution lies in > >>>>> doing the work to make buffered writes use shared locking, not > >>>>> removing shared locking from buffered reads. > >>>> > >>>> You mean using shared locking for buffered reads and writes, right? > >>>> > >>>> I think it's a great idea. In theory, write operations can be performed > >>>> simultaneously if they write to different ranges. > >>> > >>> Even if they overlap, the folio locks will prevent concurrent writes > >>> to the same range. > >>> > >>> Now that we have atomic write support as native functionality (i.e. > >>> RWF_ATOMIC), we really should not have to care that much about > >>> normal buffered IO being atomic. i.e. if the application wants > >>> atomic writes, it can now specify that it wants atomic writes and so > >>> we can relax the constraints we have on existing IO... > >> > >> Yes, I'm not particularly concerned about whether buffered I/O is > >> atomic. I'm more concerned about the multithreading performance of > >> buffered I/O. > > > > Hi Chi, > > > > Sorry for joining late, I was on vacation. > > I am very happy that you have taken on this task and your timing is good, > > because John Garry just posted his patches for large atomic writes [1]. > > I'm glad to have you on board. :) > > I'm not sure if my understanding is correct, but it seems that our > discussion is about multithreading safety, while John's patch is about > providing power-failure safety, even though both mention atomicity. > You are right about the *motivation* of John's work. But as a by-product of his work, we get an API to opt-in for atomic writes and we can piggy back this opt-in API to say that whoever wants the legacy behavior can use the new API to get both power failure safety and multithreading safety. > > > > I need to explain the relation to atomic buffered I/O, because it is not > > easy to follow it from the old discussions and also some of the discussions > > about the solution were held in-person during LSFMM2024. > > > > Naturally, your interest is improving multithreading performance of > > buffered I/O, so was mine when I first posted this question [2]. > > > > The issue with atomicity of buffered I/O is the xfs has traditionally > > provided atomicity of write vs. read (a.k.a no torn writes), which is > > not required by POSIX standard (because POSIX was not written with > > threads in mind) and is not respected by any other in-tree filesystem. > > > > It is obvious that the exclusive rw lock for buffered write hurts performance > > in the case of mixed read/write on xfs, so the question was - if xfs provides > > atomic semantics that portable applications cannot rely on, why bother > > providing these atomic semantics at all? > > Perhaps we can add an option that allows distributions or users to > decide whether they need this semantics. I would not hesitate to > turn off this semantics on my system when the time comes. > Yes, we can, but we do not want to do it unless we have to. 99% of the users do not want the legacy semantics, so it would be better if they get the best performance without having to opt-in to get it. > > > > Dave's answer to this question was that there are some legacy applications > > (database applications IIRC) on production systems that do rely on the fact > > that xfs provides this semantics and on the prerequisite that they run on xfs. > > > > However, it was noted that: > > 1. Those application do not require atomicity for any size of IO, they > > typically work in I/O size that is larger than block size (e.g. 16K or 64K) > > and they only require no torn writes for this I/O size > > 2. Large folios and iomap can usually provide this semantics via folio lock, > > but application has currently no way of knowing if the semantics are > > provided or not > > To be honest, it would be best if the folio lock could provide such > semantics, as it would not cause any potential problems for the > application, and we have hope to achieve concurrent writes. > > However, I am not sure if this is easy to implement and will not cause > other problems. > > > 3. The upcoming ability of application to opt-in for atomic writes larger > > than fs block size [1] can be used to facilitate the applications that > > want to legacy xfs semantics and avoid the need to enforce the legacy > > semantics all the time for no good reason > > > > Disclaimer: this is not a standard way to deal with potentially breaking > > legacy semantics, because old applications will not usually be rebuilt > > to opt-in for the old semantics, but the use case in hand is probably > > so specific, of a specific application that relies on xfs specific semantics > > that there are currently no objections for trying this solution. > > > > [1] https://lore.kernel.org/linux-xfs/20250102140411.14617-1-john.g.garry@oracle.com/ > > [2] https://lore.kernel.org/linux-xfs/CAOQ4uxi0pGczXBX7GRAFs88Uw0n1ERJZno3JSeZR71S1dXg+2w@mail.gmail.com/ > > > >> > >> Last week, it was mentioned that removing i_rwsem would have some > >> impacts on truncate, fallocate, and PNFS operations. > >> > >> (I'm not familiar with pNFS, so please correct me if I'm wrong.) > > > > You are not wrong. pNFS uses a "layout lease", which requires > > that the blockmap of the file will not be modified while the lease is held. > > but I think that block that are not mapped (i.e. holes) can be mapped > > while the lease is held. > > > >> > >> My understanding is that the current i_rwsem is used to protect both > >> the file's data and its size. Operations like truncate, fallocate, > >> and PNFS use i_rwsem because they modify both the file's data and its > >> size. So, I'm thinking whether it's possible to use i_rwsem to protect > >> only the file's size, without protecting the file's data. > >> > > > > It also protects the file's blockmap, for example, punch hole > > does not change the size, but it unmaps blocks from the blockmap, > > leading to races that could end up reading stale data from disk > > if the lock wasn't taken. > > > >> So operations that modify the file's size need to be executed > >> sequentially. For example, buffered writes to the EOF, fallocate > >> operations without the "keep size" requirement, and truncate operations, > >> etc, all need to hold an exclusive lock. > >> > >> Other operations require a shared lock because they only need to access > >> the file's size without modifying it. > >> > > > > As far as I understand, exclusive lock is not needed for non-extending > > writes, because it is ok to map new blocks. > > I guess the need for exclusive lock for extending writes is related to > > update of file size, but not 100% sure. > > Anyway, exclusive lock on extending write is widely common in other fs, > > while exclusive lock for non-extending write is unique to xfs. > > > >>> > >>>> So we should track all the ranges we are reading or writing, > >>>> and check whether the new read or write operations can be performed > >>>> concurrently with the current operations. > >>> > >>> That is all discussed in detail in the discussions I linked. > >> > >> Sorry, I overlooked some details from old discussion last time. > >> It seems that you are not satisfied with the effectiveness of > >> range locks. > >> > > > > Correct. This solution was taken off the table. > > > > I hope my explanation was correct and clear. > > If anything else is not clear, please feel free to ask. > > > > I think your explanation is very clear. One more thing I should mention. You do not need to wait for atomic large writes patches to land. There is nothing stopping you from implementing the suggested solution based on the xfs code already in master (v6.13-rc1), which has support for the RWF_ATOMIC flag for writes. It just means that the API will not be usable for applications that want to do IO larger than block size, but concurrent read/write performance of 4K IO could be improved already. It's possible that all you need to do is: diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index c488ae26b23d0..2542f15496488 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -777,9 +777,10 @@ xfs_file_buffered_write( ssize_t ret; bool cleared_space = false; unsigned int iolock; + bool atomic_write = iocb->ki_flags & IOCB_ATOMIC; write_retry: - iolock = XFS_IOLOCK_EXCL; + iolock = atomic_write ? XFS_IOLOCK_SHARED : XFS_IOLOCK_EXCL; ret = xfs_ilock_iocb(iocb, iolock); -- xfs_file_write_checks() afterwards already takes care of promoting XFS_IOLOCK_SHARED to XFS_IOLOCK_EXCL for extending writes. It is possible that XFS_IOLOCK_EXCL could be immediately demoted back to XFS_IOLOCK_SHARED for atomic_writes as done in xfs_file_dio_write_aligned(). TBH, I am not sure which blockdevs support 4K atomic writes that could be used to test this. John, can you share your test setup instructions for atomic writes? Thanks, Amir.
On Wed, Jan 8, 2025 at 12:33 PM Amir Goldstein <amir73il@gmail.com> wrote: > > On Wed, Jan 8, 2025 at 8:43 AM Chi Zhiling <chizhiling@163.com> wrote: > > > > On 2025/1/7 20:13, Amir Goldstein wrote: > > > On Mon, Dec 30, 2024 at 3:43 AM Chi Zhiling <chizhiling@163.com> wrote: > > >> On 2024/12/29 06:17, Dave Chinner wrote: > > >>> On Sat, Dec 28, 2024 at 03:37:41PM +0800, Chi Zhiling wrote: > > >>>> On 2024/12/27 05:50, Dave Chinner wrote: > > >>>>> On Thu, Dec 26, 2024 at 02:16:02PM +0800, Chi Zhiling wrote: > > >>>>>> From: Chi Zhiling <chizhiling@kylinos.cn> > > >>>>>> > > >>>>>> Using an rwsem to protect file data ensures that we can always obtain a > > >>>>>> completed modification. But due to the lock, we need to wait for the > > >>>>>> write process to release the rwsem before we can read it, even if we are > > >>>>>> reading a different region of the file. This could take a lot of time > > >>>>>> when many processes need to write and read this file. > > >>>>>> > > >>>>>> On the other hand, The ext4 filesystem and others do not hold the lock > > >>>>>> during buffered reading, which make the ext4 have better performance in > > >>>>>> that case. Therefore, I think it will be fine if we remove the lock in > > >>>>>> xfs, as most applications can handle this situation. > > >>>>> > > >>>>> Nope. > > >>>>> > > >>>>> This means that XFS loses high level serialisation of incoming IO > > >>>>> against operations like truncate, fallocate, pnfs operations, etc. > > >>>>> > > >>>>> We've been through this multiple times before; the solution lies in > > >>>>> doing the work to make buffered writes use shared locking, not > > >>>>> removing shared locking from buffered reads. > > >>>> > > >>>> You mean using shared locking for buffered reads and writes, right? > > >>>> > > >>>> I think it's a great idea. In theory, write operations can be performed > > >>>> simultaneously if they write to different ranges. > > >>> > > >>> Even if they overlap, the folio locks will prevent concurrent writes > > >>> to the same range. > > >>> > > >>> Now that we have atomic write support as native functionality (i.e. > > >>> RWF_ATOMIC), we really should not have to care that much about > > >>> normal buffered IO being atomic. i.e. if the application wants > > >>> atomic writes, it can now specify that it wants atomic writes and so > > >>> we can relax the constraints we have on existing IO... > > >> > > >> Yes, I'm not particularly concerned about whether buffered I/O is > > >> atomic. I'm more concerned about the multithreading performance of > > >> buffered I/O. > > > > > > Hi Chi, > > > > > > Sorry for joining late, I was on vacation. > > > I am very happy that you have taken on this task and your timing is good, > > > because John Garry just posted his patches for large atomic writes [1]. > > > > I'm glad to have you on board. :) > > > > I'm not sure if my understanding is correct, but it seems that our > > discussion is about multithreading safety, while John's patch is about > > providing power-failure safety, even though both mention atomicity. > > > > You are right about the *motivation* of John's work. > But as a by-product of his work, we get an API to opt-in for > atomic writes and we can piggy back this opt-in API to say > that whoever wants the legacy behavior can use the new API > to get both power failure safety and multithreading safety. > > > > > > > I need to explain the relation to atomic buffered I/O, because it is not > > > easy to follow it from the old discussions and also some of the discussions > > > about the solution were held in-person during LSFMM2024. > > > > > > Naturally, your interest is improving multithreading performance of > > > buffered I/O, so was mine when I first posted this question [2]. > > > > > > The issue with atomicity of buffered I/O is the xfs has traditionally > > > provided atomicity of write vs. read (a.k.a no torn writes), which is > > > not required by POSIX standard (because POSIX was not written with > > > threads in mind) and is not respected by any other in-tree filesystem. > > > > > > It is obvious that the exclusive rw lock for buffered write hurts performance > > > in the case of mixed read/write on xfs, so the question was - if xfs provides > > > atomic semantics that portable applications cannot rely on, why bother > > > providing these atomic semantics at all? > > > > Perhaps we can add an option that allows distributions or users to > > decide whether they need this semantics. I would not hesitate to > > turn off this semantics on my system when the time comes. > > > > Yes, we can, but we do not want to do it unless we have to. > 99% of the users do not want the legacy semantics, so it would > be better if they get the best performance without having to opt-in to get it. > > > > > > > Dave's answer to this question was that there are some legacy applications > > > (database applications IIRC) on production systems that do rely on the fact > > > that xfs provides this semantics and on the prerequisite that they run on xfs. > > > > > > However, it was noted that: > > > 1. Those application do not require atomicity for any size of IO, they > > > typically work in I/O size that is larger than block size (e.g. 16K or 64K) > > > and they only require no torn writes for this I/O size > > > 2. Large folios and iomap can usually provide this semantics via folio lock, > > > but application has currently no way of knowing if the semantics are > > > provided or not > > > > To be honest, it would be best if the folio lock could provide such > > semantics, as it would not cause any potential problems for the > > application, and we have hope to achieve concurrent writes. > > > > However, I am not sure if this is easy to implement and will not cause > > other problems. > > > > > 3. The upcoming ability of application to opt-in for atomic writes larger > > > than fs block size [1] can be used to facilitate the applications that > > > want to legacy xfs semantics and avoid the need to enforce the legacy > > > semantics all the time for no good reason > > > > > > Disclaimer: this is not a standard way to deal with potentially breaking > > > legacy semantics, because old applications will not usually be rebuilt > > > to opt-in for the old semantics, but the use case in hand is probably > > > so specific, of a specific application that relies on xfs specific semantics > > > that there are currently no objections for trying this solution. > > > > > > [1] https://lore.kernel.org/linux-xfs/20250102140411.14617-1-john.g.garry@oracle.com/ > > > [2] https://lore.kernel.org/linux-xfs/CAOQ4uxi0pGczXBX7GRAFs88Uw0n1ERJZno3JSeZR71S1dXg+2w@mail.gmail.com/ > > > > > >> > > >> Last week, it was mentioned that removing i_rwsem would have some > > >> impacts on truncate, fallocate, and PNFS operations. > > >> > > >> (I'm not familiar with pNFS, so please correct me if I'm wrong.) > > > > > > You are not wrong. pNFS uses a "layout lease", which requires > > > that the blockmap of the file will not be modified while the lease is held. > > > but I think that block that are not mapped (i.e. holes) can be mapped > > > while the lease is held. > > > > > >> > > >> My understanding is that the current i_rwsem is used to protect both > > >> the file's data and its size. Operations like truncate, fallocate, > > >> and PNFS use i_rwsem because they modify both the file's data and its > > >> size. So, I'm thinking whether it's possible to use i_rwsem to protect > > >> only the file's size, without protecting the file's data. > > >> > > > > > > It also protects the file's blockmap, for example, punch hole > > > does not change the size, but it unmaps blocks from the blockmap, > > > leading to races that could end up reading stale data from disk > > > if the lock wasn't taken. > > > > > >> So operations that modify the file's size need to be executed > > >> sequentially. For example, buffered writes to the EOF, fallocate > > >> operations without the "keep size" requirement, and truncate operations, > > >> etc, all need to hold an exclusive lock. > > >> > > >> Other operations require a shared lock because they only need to access > > >> the file's size without modifying it. > > >> > > > > > > As far as I understand, exclusive lock is not needed for non-extending > > > writes, because it is ok to map new blocks. > > > I guess the need for exclusive lock for extending writes is related to > > > update of file size, but not 100% sure. > > > Anyway, exclusive lock on extending write is widely common in other fs, > > > while exclusive lock for non-extending write is unique to xfs. > > > > > >>> > > >>>> So we should track all the ranges we are reading or writing, > > >>>> and check whether the new read or write operations can be performed > > >>>> concurrently with the current operations. > > >>> > > >>> That is all discussed in detail in the discussions I linked. > > >> > > >> Sorry, I overlooked some details from old discussion last time. > > >> It seems that you are not satisfied with the effectiveness of > > >> range locks. > > >> > > > > > > Correct. This solution was taken off the table. > > > > > > I hope my explanation was correct and clear. > > > If anything else is not clear, please feel free to ask. > > > > > > > I think your explanation is very clear. > > One more thing I should mention. > You do not need to wait for atomic large writes patches to land. > There is nothing stopping you from implementing the suggested > solution based on the xfs code already in master (v6.13-rc1), > which has support for the RWF_ATOMIC flag for writes. > > It just means that the API will not be usable for applications that > want to do IO larger than block size, but concurrent read/write > performance of 4K IO could be improved already. > > It's possible that all you need to do is: > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > index c488ae26b23d0..2542f15496488 100644 > --- a/fs/xfs/xfs_file.c > +++ b/fs/xfs/xfs_file.c > @@ -777,9 +777,10 @@ xfs_file_buffered_write( > ssize_t ret; > bool cleared_space = false; > unsigned int iolock; > + bool atomic_write = iocb->ki_flags & IOCB_ATOMIC; > > write_retry: > - iolock = XFS_IOLOCK_EXCL; > + iolock = atomic_write ? XFS_IOLOCK_SHARED : XFS_IOLOCK_EXCL; It's the other way around of course :) and did not state the obvious - this mock patch is NOT TESTED! Thanks, Amir. > ret = xfs_ilock_iocb(iocb, iolock); > -- > > xfs_file_write_checks() afterwards already takes care of promoting > XFS_IOLOCK_SHARED to XFS_IOLOCK_EXCL for extending writes. > > It is possible that XFS_IOLOCK_EXCL could be immediately demoted > back to XFS_IOLOCK_SHARED for atomic_writes as done in > xfs_file_dio_write_aligned(). > > TBH, I am not sure which blockdevs support 4K atomic writes that could > be used to test this. > > John, can you share your test setup instructions for atomic writes? > > Thanks, > Amir.
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > index c488ae26b23d0..2542f15496488 100644 > --- a/fs/xfs/xfs_file.c > +++ b/fs/xfs/xfs_file.c > @@ -777,9 +777,10 @@ xfs_file_buffered_write( > ssize_t ret; > bool cleared_space = false; > unsigned int iolock; > + bool atomic_write = iocb->ki_flags & IOCB_ATOMIC; > > write_retry: > - iolock = XFS_IOLOCK_EXCL; > + iolock = atomic_write ? XFS_IOLOCK_SHARED : XFS_IOLOCK_EXCL; > ret = xfs_ilock_iocb(iocb, iolock); > -- > > xfs_file_write_checks() afterwards already takes care of promoting > XFS_IOLOCK_SHARED to XFS_IOLOCK_EXCL for extending writes. > > It is possible that XFS_IOLOCK_EXCL could be immediately demoted > back to XFS_IOLOCK_SHARED for atomic_writes as done in > xfs_file_dio_write_aligned(). > > TBH, I am not sure which blockdevs support 4K atomic writes that could > be used to test this. > > John, can you share your test setup instructions for atomic writes? Please note that IOCB_ATOMIC is not supported for buffered IO, so we can't do this - we only support direct IO today. And supporting buffered IO has its challenges; how to handle overlapping atomic writes of differing sizes sitting in the page cache is the main issue which comes to mind. Thanks, John
On Wed, Jan 08, 2025 at 03:43:04PM +0800, Chi Zhiling wrote: > On 2025/1/7 20:13, Amir Goldstein wrote: > > On Mon, Dec 30, 2024 at 3:43 AM Chi Zhiling <chizhiling@163.com> wrote: > > > On 2024/12/29 06:17, Dave Chinner wrote: > > > > On Sat, Dec 28, 2024 at 03:37:41PM +0800, Chi Zhiling wrote: > > > > > On 2024/12/27 05:50, Dave Chinner wrote: > > > > > > On Thu, Dec 26, 2024 at 02:16:02PM +0800, Chi Zhiling wrote: > > > > > > > From: Chi Zhiling <chizhiling@kylinos.cn> > > > > > > > > > > > > > > Using an rwsem to protect file data ensures that we can always obtain a > > > > > > > completed modification. But due to the lock, we need to wait for the > > > > > > > write process to release the rwsem before we can read it, even if we are > > > > > > > reading a different region of the file. This could take a lot of time > > > > > > > when many processes need to write and read this file. > > > > > > > > > > > > > > On the other hand, The ext4 filesystem and others do not hold the lock > > > > > > > during buffered reading, which make the ext4 have better performance in > > > > > > > that case. Therefore, I think it will be fine if we remove the lock in > > > > > > > xfs, as most applications can handle this situation. > > > > > > > > > > > > Nope. > > > > > > > > > > > > This means that XFS loses high level serialisation of incoming IO > > > > > > against operations like truncate, fallocate, pnfs operations, etc. > > > > > > > > > > > > We've been through this multiple times before; the solution lies in > > > > > > doing the work to make buffered writes use shared locking, not > > > > > > removing shared locking from buffered reads. > > > > > > > > > > You mean using shared locking for buffered reads and writes, right? > > > > > > > > > > I think it's a great idea. In theory, write operations can be performed > > > > > simultaneously if they write to different ranges. > > > > > > > > Even if they overlap, the folio locks will prevent concurrent writes > > > > to the same range. > > > > > > > > Now that we have atomic write support as native functionality (i.e. > > > > RWF_ATOMIC), we really should not have to care that much about > > > > normal buffered IO being atomic. i.e. if the application wants > > > > atomic writes, it can now specify that it wants atomic writes and so > > > > we can relax the constraints we have on existing IO... > > > > > > Yes, I'm not particularly concerned about whether buffered I/O is > > > atomic. I'm more concerned about the multithreading performance of > > > buffered I/O. > > > > Hi Chi, > > > > Sorry for joining late, I was on vacation. > > I am very happy that you have taken on this task and your timing is good, > > because John Garry just posted his patches for large atomic writes [1]. > > I'm glad to have you on board. :) > > I'm not sure if my understanding is correct, but it seems that our > discussion is about multithreading safety, while John's patch is about > providing power-failure safety, even though both mention atomicity. > > > > > I need to explain the relation to atomic buffered I/O, because it is not > > easy to follow it from the old discussions and also some of the discussions > > about the solution were held in-person during LSFMM2024. > > > > Naturally, your interest is improving multithreading performance of > > buffered I/O, so was mine when I first posted this question [2]. > > > > The issue with atomicity of buffered I/O is the xfs has traditionally > > provided atomicity of write vs. read (a.k.a no torn writes), which is > > not required by POSIX standard (because POSIX was not written with > > threads in mind) and is not respected by any other in-tree filesystem. > > > > It is obvious that the exclusive rw lock for buffered write hurts performance > > in the case of mixed read/write on xfs, so the question was - if xfs provides > > atomic semantics that portable applications cannot rely on, why bother > > providing these atomic semantics at all? > > Perhaps we can add an option that allows distributions or users to > decide whether they need this semantics. I would not hesitate to > turn off this semantics on my system when the time comes. > > > > > Dave's answer to this question was that there are some legacy applications > > (database applications IIRC) on production systems that do rely on the fact > > that xfs provides this semantics and on the prerequisite that they run on xfs. > > > > However, it was noted that: > > 1. Those application do not require atomicity for any size of IO, they > > typically work in I/O size that is larger than block size (e.g. 16K or 64K) > > and they only require no torn writes for this I/O size > > 2. Large folios and iomap can usually provide this semantics via folio lock, > > but application has currently no way of knowing if the semantics are > > provided or not > > To be honest, it would be best if the folio lock could provide such > semantics, as it would not cause any potential problems for the > application, and we have hope to achieve concurrent writes. > > However, I am not sure if this is easy to implement and will not cause > other problems. Assuming we're not abandoning POSIX "Thread Interactions with Regular File Operations", you can't use the folio lock for coordination, for several reasons: a) Apps can't directly control the size of the folio in the page cache b) The folio size can (theoretically) change underneath the program at any time (reclaim can take your large folio and the next read gets a smaller folio) c) If your write crosses folios, you've just crossed a synchronization boundary and all bets are off, though all the other filesystems behave this way and there seem not to be complaints d) If you try to "guarantee" folio granularity by messing with min/max folio size, you run the risk of ENOMEM if the base pages get fragmented I think that's why Dave suggested range locks as the correct solution to this; though it is a pity that so far nobody has come up with a performant implementation. --D > > 3. The upcoming ability of application to opt-in for atomic writes larger > > than fs block size [1] can be used to facilitate the applications that > > want to legacy xfs semantics and avoid the need to enforce the legacy > > semantics all the time for no good reason > > > > Disclaimer: this is not a standard way to deal with potentially breaking > > legacy semantics, because old applications will not usually be rebuilt > > to opt-in for the old semantics, but the use case in hand is probably > > so specific, of a specific application that relies on xfs specific semantics > > that there are currently no objections for trying this solution. > > > > [1] https://lore.kernel.org/linux-xfs/20250102140411.14617-1-john.g.garry@oracle.com/ > > [2] https://lore.kernel.org/linux-xfs/CAOQ4uxi0pGczXBX7GRAFs88Uw0n1ERJZno3JSeZR71S1dXg+2w@mail.gmail.com/ > > > > > > > > Last week, it was mentioned that removing i_rwsem would have some > > > impacts on truncate, fallocate, and PNFS operations. > > > > > > (I'm not familiar with pNFS, so please correct me if I'm wrong.) > > > > You are not wrong. pNFS uses a "layout lease", which requires > > that the blockmap of the file will not be modified while the lease is held. > > but I think that block that are not mapped (i.e. holes) can be mapped > > while the lease is held. > > > > > > > > My understanding is that the current i_rwsem is used to protect both > > > the file's data and its size. Operations like truncate, fallocate, > > > and PNFS use i_rwsem because they modify both the file's data and its > > > size. So, I'm thinking whether it's possible to use i_rwsem to protect > > > only the file's size, without protecting the file's data. > > > > > > > It also protects the file's blockmap, for example, punch hole > > does not change the size, but it unmaps blocks from the blockmap, > > leading to races that could end up reading stale data from disk > > if the lock wasn't taken. > > > > > So operations that modify the file's size need to be executed > > > sequentially. For example, buffered writes to the EOF, fallocate > > > operations without the "keep size" requirement, and truncate operations, > > > etc, all need to hold an exclusive lock. > > > > > > Other operations require a shared lock because they only need to access > > > the file's size without modifying it. > > > > > > > As far as I understand, exclusive lock is not needed for non-extending > > writes, because it is ok to map new blocks. > > I guess the need for exclusive lock for extending writes is related to > > update of file size, but not 100% sure. > > Anyway, exclusive lock on extending write is widely common in other fs, > > while exclusive lock for non-extending write is unique to xfs. > > > > > > > > > > > So we should track all the ranges we are reading or writing, > > > > > and check whether the new read or write operations can be performed > > > > > concurrently with the current operations. > > > > > > > > That is all discussed in detail in the discussions I linked. > > > > > > Sorry, I overlooked some details from old discussion last time. > > > It seems that you are not satisfied with the effectiveness of > > > range locks. > > > > > > > Correct. This solution was taken off the table. > > > > I hope my explanation was correct and clear. > > If anything else is not clear, please feel free to ask. > > > > I think your explanation is very clear. > > Thanks, > Chi Zhiling > >
On 2025/1/8 19:33, Amir Goldstein wrote: > On Wed, Jan 8, 2025 at 8:43 AM Chi Zhiling <chizhiling@163.com> wrote: >> On 2025/1/7 20:13, Amir Goldstein wrote: >>> On Mon, Dec 30, 2024 at 3:43 AM Chi Zhiling <chizhiling@163.com> wrote: >>>> On 2024/12/29 06:17, Dave Chinner wrote: >>>>> On Sat, Dec 28, 2024 at 03:37:41PM +0800, Chi Zhiling wrote: >>>>>> On 2024/12/27 05:50, Dave Chinner wrote: >>>>>>> On Thu, Dec 26, 2024 at 02:16:02PM +0800, Chi Zhiling wrote: >>>>>>>> From: Chi Zhiling <chizhiling@kylinos.cn> >>>>>>>> >>>>>>>> Using an rwsem to protect file data ensures that we can always obtain a >>>>>>>> completed modification. But due to the lock, we need to wait for the >>>>>>>> write process to release the rwsem before we can read it, even if we are >>>>>>>> reading a different region of the file. This could take a lot of time >>>>>>>> when many processes need to write and read this file. >>>>>>>> >>>>>>>> On the other hand, The ext4 filesystem and others do not hold the lock >>>>>>>> during buffered reading, which make the ext4 have better performance in >>>>>>>> that case. Therefore, I think it will be fine if we remove the lock in >>>>>>>> xfs, as most applications can handle this situation. >>>>>>> >>>>>>> Nope. >>>>>>> >>>>>>> This means that XFS loses high level serialisation of incoming IO >>>>>>> against operations like truncate, fallocate, pnfs operations, etc. >>>>>>> >>>>>>> We've been through this multiple times before; the solution lies in >>>>>>> doing the work to make buffered writes use shared locking, not >>>>>>> removing shared locking from buffered reads. >>>>>> >>>>>> You mean using shared locking for buffered reads and writes, right? >>>>>> >>>>>> I think it's a great idea. In theory, write operations can be performed >>>>>> simultaneously if they write to different ranges. >>>>> >>>>> Even if they overlap, the folio locks will prevent concurrent writes >>>>> to the same range. >>>>> >>>>> Now that we have atomic write support as native functionality (i.e. >>>>> RWF_ATOMIC), we really should not have to care that much about >>>>> normal buffered IO being atomic. i.e. if the application wants >>>>> atomic writes, it can now specify that it wants atomic writes and so >>>>> we can relax the constraints we have on existing IO... >>>> >>>> Yes, I'm not particularly concerned about whether buffered I/O is >>>> atomic. I'm more concerned about the multithreading performance of >>>> buffered I/O. >>> >>> Hi Chi, >>> >>> Sorry for joining late, I was on vacation. >>> I am very happy that you have taken on this task and your timing is good, >>> because John Garry just posted his patches for large atomic writes [1]. >> >> I'm glad to have you on board. :) >> >> I'm not sure if my understanding is correct, but it seems that our >> discussion is about multithreading safety, while John's patch is about >> providing power-failure safety, even though both mention atomicity. >> > > You are right about the *motivation* of John's work. > But as a by-product of his work, we get an API to opt-in for > atomic writes and we can piggy back this opt-in API to say > that whoever wants the legacy behavior can use the new API > to get both power failure safety and multithreading safety. Okay, I understand what you mean. > >>> >>> I need to explain the relation to atomic buffered I/O, because it is not >>> easy to follow it from the old discussions and also some of the discussions >>> about the solution were held in-person during LSFMM2024. >>> >>> Naturally, your interest is improving multithreading performance of >>> buffered I/O, so was mine when I first posted this question [2]. >>> >>> The issue with atomicity of buffered I/O is the xfs has traditionally >>> provided atomicity of write vs. read (a.k.a no torn writes), which is >>> not required by POSIX standard (because POSIX was not written with >>> threads in mind) and is not respected by any other in-tree filesystem. >>> >>> It is obvious that the exclusive rw lock for buffered write hurts performance >>> in the case of mixed read/write on xfs, so the question was - if xfs provides >>> atomic semantics that portable applications cannot rely on, why bother >>> providing these atomic semantics at all? >> >> Perhaps we can add an option that allows distributions or users to >> decide whether they need this semantics. I would not hesitate to >> turn off this semantics on my system when the time comes. >> > > Yes, we can, but we do not want to do it unless we have to. > 99% of the users do not want the legacy semantics, so it would > be better if they get the best performance without having to opt-in to get it. > >>> >>> Dave's answer to this question was that there are some legacy applications >>> (database applications IIRC) on production systems that do rely on the fact >>> that xfs provides this semantics and on the prerequisite that they run on xfs. >>> >>> However, it was noted that: >>> 1. Those application do not require atomicity for any size of IO, they >>> typically work in I/O size that is larger than block size (e.g. 16K or 64K) >>> and they only require no torn writes for this I/O size >>> 2. Large folios and iomap can usually provide this semantics via folio lock, >>> but application has currently no way of knowing if the semantics are >>> provided or not >> >> To be honest, it would be best if the folio lock could provide such >> semantics, as it would not cause any potential problems for the >> application, and we have hope to achieve concurrent writes. >> >> However, I am not sure if this is easy to implement and will not cause >> other problems. >> >>> 3. The upcoming ability of application to opt-in for atomic writes larger >>> than fs block size [1] can be used to facilitate the applications that >>> want to legacy xfs semantics and avoid the need to enforce the legacy >>> semantics all the time for no good reason >>> >>> Disclaimer: this is not a standard way to deal with potentially breaking >>> legacy semantics, because old applications will not usually be rebuilt >>> to opt-in for the old semantics, but the use case in hand is probably >>> so specific, of a specific application that relies on xfs specific semantics >>> that there are currently no objections for trying this solution. >>> >>> [1] https://lore.kernel.org/linux-xfs/20250102140411.14617-1-john.g.garry@oracle.com/ >>> [2] https://lore.kernel.org/linux-xfs/CAOQ4uxi0pGczXBX7GRAFs88Uw0n1ERJZno3JSeZR71S1dXg+2w@mail.gmail.com/ >>> >>>> >>>> Last week, it was mentioned that removing i_rwsem would have some >>>> impacts on truncate, fallocate, and PNFS operations. >>>> >>>> (I'm not familiar with pNFS, so please correct me if I'm wrong.) >>> >>> You are not wrong. pNFS uses a "layout lease", which requires >>> that the blockmap of the file will not be modified while the lease is held. >>> but I think that block that are not mapped (i.e. holes) can be mapped >>> while the lease is held. >>> >>>> >>>> My understanding is that the current i_rwsem is used to protect both >>>> the file's data and its size. Operations like truncate, fallocate, >>>> and PNFS use i_rwsem because they modify both the file's data and its >>>> size. So, I'm thinking whether it's possible to use i_rwsem to protect >>>> only the file's size, without protecting the file's data. >>>> >>> >>> It also protects the file's blockmap, for example, punch hole >>> does not change the size, but it unmaps blocks from the blockmap, >>> leading to races that could end up reading stale data from disk >>> if the lock wasn't taken. >>> >>>> So operations that modify the file's size need to be executed >>>> sequentially. For example, buffered writes to the EOF, fallocate >>>> operations without the "keep size" requirement, and truncate operations, >>>> etc, all need to hold an exclusive lock. >>>> >>>> Other operations require a shared lock because they only need to access >>>> the file's size without modifying it. >>>> >>> >>> As far as I understand, exclusive lock is not needed for non-extending >>> writes, because it is ok to map new blocks. >>> I guess the need for exclusive lock for extending writes is related to >>> update of file size, but not 100% sure. >>> Anyway, exclusive lock on extending write is widely common in other fs, >>> while exclusive lock for non-extending write is unique to xfs. >>> >>>>> >>>>>> So we should track all the ranges we are reading or writing, >>>>>> and check whether the new read or write operations can be performed >>>>>> concurrently with the current operations. >>>>> >>>>> That is all discussed in detail in the discussions I linked. >>>> >>>> Sorry, I overlooked some details from old discussion last time. >>>> It seems that you are not satisfied with the effectiveness of >>>> range locks. >>>> >>> >>> Correct. This solution was taken off the table. >>> >>> I hope my explanation was correct and clear. >>> If anything else is not clear, please feel free to ask. >>> >> >> I think your explanation is very clear. > > One more thing I should mention. > You do not need to wait for atomic large writes patches to land. > There is nothing stopping you from implementing the suggested > solution based on the xfs code already in master (v6.13-rc1), > which has support for the RWF_ATOMIC flag for writes. > > It just means that the API will not be usable for applications that > want to do IO larger than block size, but concurrent read/write ^ To be precise, this is the page size, not the block size, right? > performance of 4K IO could be improved already. Great, which means that IO operations aligned within a single page can be executed concurrently, because the folio lock already provides atomicity guarantees. If the write does not exceed the boundary of a page, we can downgrade the iolock to XFS_IOLOCK_SHARED. It seems to be safe and will not change the current behavior. --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -454,6 +454,11 @@ xfs_file_write_checks( if (error) return error; + if ( iocb->ki_pos >> PAGE_SHIFT == (iocb->ki_pos + count) >> PAGE_SHIFT) { + *iolock = XFS_IOLOCK_SHARED; + } + /* * For changing security info in file_remove_privs() we need i_rwsem * exclusively. > > It's possible that all you need to do is: > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > index c488ae26b23d0..2542f15496488 100644 > --- a/fs/xfs/xfs_file.c > +++ b/fs/xfs/xfs_file.c > @@ -777,9 +777,10 @@ xfs_file_buffered_write( > ssize_t ret; > bool cleared_space = false; > unsigned int iolock; > + bool atomic_write = iocb->ki_flags & IOCB_ATOMIC; > > write_retry: > - iolock = XFS_IOLOCK_EXCL; > + iolock = atomic_write ? XFS_IOLOCK_SHARED : XFS_IOLOCK_EXCL; > ret = xfs_ilock_iocb(iocb, iolock); > -- > > xfs_file_write_checks() afterwards already takes care of promoting > XFS_IOLOCK_SHARED to XFS_IOLOCK_EXCL for extending writes. Yeah, for writes that exceed the PAGE boundary, we can also promote the lock to XFS_IOLOCK_EXCL. Otherwise, I am concerned that it may lead to old data being retained in the file. For example, two processes writing four pages of data to the same area. process A process B -------------------------------- write AA-- <sleep> new write BBBB write --AA The final data is BBAA. > > It is possible that XFS_IOLOCK_EXCL could be immediately demoted > back to XFS_IOLOCK_SHARED for atomic_writes as done in > xfs_file_dio_write_aligned(). > Thanks, Chi Zhiling
On Wed, Jan 8, 2025 at 1:16 PM John Garry <john.g.garry@oracle.com> wrote: > > > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > > index c488ae26b23d0..2542f15496488 100644 > > --- a/fs/xfs/xfs_file.c > > +++ b/fs/xfs/xfs_file.c > > @@ -777,9 +777,10 @@ xfs_file_buffered_write( > > ssize_t ret; > > bool cleared_space = false; > > unsigned int iolock; > > + bool atomic_write = iocb->ki_flags & IOCB_ATOMIC; > > > > write_retry: > > - iolock = XFS_IOLOCK_EXCL; > > + iolock = atomic_write ? XFS_IOLOCK_SHARED : XFS_IOLOCK_EXCL; > > ret = xfs_ilock_iocb(iocb, iolock); > > -- > > > > xfs_file_write_checks() afterwards already takes care of promoting > > XFS_IOLOCK_SHARED to XFS_IOLOCK_EXCL for extending writes. > > > > It is possible that XFS_IOLOCK_EXCL could be immediately demoted > > back to XFS_IOLOCK_SHARED for atomic_writes as done in > > xfs_file_dio_write_aligned(). > > > > TBH, I am not sure which blockdevs support 4K atomic writes that could > > be used to test this. > > > > John, can you share your test setup instructions for atomic writes? > > Please note that IOCB_ATOMIC is not supported for buffered IO, so we > can't do this - we only support direct IO today. Oops. I see now. > > And supporting buffered IO has its challenges; how to handle overlapping > atomic writes of differing sizes sitting in the page cache is the main > issue which comes to mind. > How about the combination of RWF_ATOMIC | RWF_UNCACHED [1] Would it be easier/possible to support this considering that the write of folio is started before the write system call returns? Note that application that desires mutithreaded atomicity of writes vs. reads will only need to opt-in for RWF_ATOMIC | RWF_UNCACHED writes, so this is not expected to actually break its performance by killing the read caching. Thanks, Amir. [1] https://lore.kernel.org/linux-fsdevel/20241220154831.1086649-1-axboe@kernel.dk/
> > One more thing I should mention. > > You do not need to wait for atomic large writes patches to land. > > There is nothing stopping you from implementing the suggested > > solution based on the xfs code already in master (v6.13-rc1), > > which has support for the RWF_ATOMIC flag for writes. Only I missed the fact that there is not yet a plan to support atomic buffered writes :-/ > > > > It just means that the API will not be usable for applications that > > want to do IO larger than block size, but concurrent read/write > ^ > To be precise, this is the page size, not the block size, right? > fs block size: if (iocb->ki_flags & IOCB_ATOMIC) { /* * Currently only atomic writing of a single FS block is * supported. It would be possible to atomic write smaller than * a FS block, but there is no requirement to support this. * Note that iomap also does not support this yet. */ if (ocount != ip->i_mount->m_sb.sb_blocksize) return -EINVAL; ret = generic_atomic_write_valid(iocb, from); if (ret) return ret; } > > performance of 4K IO could be improved already. > > Great, which means that IO operations aligned within a single page > can be executed concurrently, because the folio lock already > provides atomicity guarantees. > > If the write does not exceed the boundary of a page, we can > downgrade the iolock to XFS_IOLOCK_SHARED. It seems to be safe > and will not change the current behavior. > > --- a/fs/xfs/xfs_file.c > +++ b/fs/xfs/xfs_file.c > @@ -454,6 +454,11 @@ xfs_file_write_checks( > if (error) > return error; > > + if ( iocb->ki_pos >> PAGE_SHIFT == (iocb->ki_pos + count) >> > PAGE_SHIFT) { > + *iolock = XFS_IOLOCK_SHARED; > + } > + > /* > * For changing security info in file_remove_privs() we need > i_rwsem > * exclusively. > I think that may be possible, but you should do it in the buffered write code as the patch below. xfs_file_write_checks() is called from code paths like xfs_file_dio_write_unaligned() where you should not demote to shared lock. > > > > It's possible that all you need to do is: > > > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > > index c488ae26b23d0..2542f15496488 100644 > > --- a/fs/xfs/xfs_file.c > > +++ b/fs/xfs/xfs_file.c > > @@ -777,9 +777,10 @@ xfs_file_buffered_write( > > ssize_t ret; > > bool cleared_space = false; > > unsigned int iolock; > > + bool atomic_write = iocb->ki_flags & IOCB_ATOMIC; > > > > write_retry: > > - iolock = XFS_IOLOCK_EXCL; > > + iolock = atomic_write ? XFS_IOLOCK_SHARED : XFS_IOLOCK_EXCL; > > ret = xfs_ilock_iocb(iocb, iolock); > > -- > > > > xfs_file_write_checks() afterwards already takes care of promoting > > XFS_IOLOCK_SHARED to XFS_IOLOCK_EXCL for extending writes. > > Yeah, for writes that exceed the PAGE boundary, we can also promote > the lock to XFS_IOLOCK_EXCL. Otherwise, I am concerned that it may > lead to old data being retained in the file. > > For example, two processes writing four pages of data to the same area. > > process A process B > -------------------------------- > write AA-- > <sleep> > new write BBBB > write --AA > > The final data is BBAA. > What is the use case for which you are trying to fix performance? Is it a use case with single block IO? if not then it does not help to implement a partial solution for single block size IO. Thanks, Amir.
On 2025/1/9 18:25, Amir Goldstein wrote: >>> One more thing I should mention. >>> You do not need to wait for atomic large writes patches to land. >>> There is nothing stopping you from implementing the suggested >>> solution based on the xfs code already in master (v6.13-rc1), >>> which has support for the RWF_ATOMIC flag for writes. > > Only I missed the fact that there is not yet a plan to support > atomic buffered writes :-/ I think it's necessary to support atomic buffered writes. > >>> >>> It just means that the API will not be usable for applications that >>> want to do IO larger than block size, but concurrent read/write >> ^ >> To be precise, this is the page size, not the block size, right? >> > > fs block size: > > if (iocb->ki_flags & IOCB_ATOMIC) { > /* > * Currently only atomic writing of a single FS block is > * supported. It would be possible to atomic write smaller than > * a FS block, but there is no requirement to support this. > * Note that iomap also does not support this yet. > */ > if (ocount != ip->i_mount->m_sb.sb_blocksize) > return -EINVAL; > ret = generic_atomic_write_valid(iocb, from); > if (ret) > return ret; > } Uh, okay, maybe I didn't understand it very accurately. I thought you were talking about buffered IO. I was quite curious as to why you mentioned block size in the context of buffered IO. > >>> performance of 4K IO could be improved already. >> >> Great, which means that IO operations aligned within a single page >> can be executed concurrently, because the folio lock already >> provides atomicity guarantees. >> >> If the write does not exceed the boundary of a page, we can >> downgrade the iolock to XFS_IOLOCK_SHARED. It seems to be safe >> and will not change the current behavior. >> >> --- a/fs/xfs/xfs_file.c >> +++ b/fs/xfs/xfs_file.c >> @@ -454,6 +454,11 @@ xfs_file_write_checks( >> if (error) >> return error; >> >> + if ( iocb->ki_pos >> PAGE_SHIFT == (iocb->ki_pos + count) >> >> PAGE_SHIFT) { >> + *iolock = XFS_IOLOCK_SHARED; >> + } >> + >> /* >> * For changing security info in file_remove_privs() we need >> i_rwsem >> * exclusively. >> > > I think that may be possible, but you should do it in the buffered write > code as the patch below. > xfs_file_write_checks() is called from code paths like > xfs_file_dio_write_unaligned() where you should not demote to shared lock. Wow, thank you for the reminder. This is the prototype of the patch. I might need to consider more scenarios and conduct testing before sending the patch. > > >>> >>> It's possible that all you need to do is: >>> >>> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c >>> index c488ae26b23d0..2542f15496488 100644 >>> --- a/fs/xfs/xfs_file.c >>> +++ b/fs/xfs/xfs_file.c >>> @@ -777,9 +777,10 @@ xfs_file_buffered_write( >>> ssize_t ret; >>> bool cleared_space = false; >>> unsigned int iolock; >>> + bool atomic_write = iocb->ki_flags & IOCB_ATOMIC; >>> >>> write_retry: >>> - iolock = XFS_IOLOCK_EXCL; >>> + iolock = atomic_write ? XFS_IOLOCK_SHARED : XFS_IOLOCK_EXCL; >>> ret = xfs_ilock_iocb(iocb, iolock); >>> -- >>> >>> xfs_file_write_checks() afterwards already takes care of promoting >>> XFS_IOLOCK_SHARED to XFS_IOLOCK_EXCL for extending writes. >> >> Yeah, for writes that exceed the PAGE boundary, we can also promote >> the lock to XFS_IOLOCK_EXCL. Otherwise, I am concerned that it may >> lead to old data being retained in the file. >> >> For example, two processes writing four pages of data to the same area. >> >> process A process B >> -------------------------------- >> write AA-- >> <sleep> >> new write BBBB >> write --AA >> >> The final data is BBAA. >> > > What is the use case for which you are trying to fix performance? > Is it a use case with single block IO? if not then it does not help to implement > a partial solution for single block size IO. I want to improve the UnixBench score for XFS. UnixBench uses buffered IO for testing, including both single-threaded and multi-threaded tests. Additionally, the IO size tested in UnixBench is below 4K, so this will be very helpful. Thanks, Chi Zhiling
On 09/01/2025 12:10, Chi Zhiling wrote: > On 2025/1/9 18:25, Amir Goldstein wrote: >>>> One more thing I should mention. >>>> You do not need to wait for atomic large writes patches to land. >>>> There is nothing stopping you from implementing the suggested >>>> solution based on the xfs code already in master (v6.13-rc1), >>>> which has support for the RWF_ATOMIC flag for writes. >> >> Only I missed the fact that there is not yet a plan to support >> atomic buffered writes :-/ > > I think it's necessary to support atomic buffered writes. Please check this thread then: https://lore.kernel.org/linux-fsdevel/20240228061257.GA106651@mit.edu/
On 09/01/2025 10:07, Amir Goldstein wrote: >> Please note that IOCB_ATOMIC is not supported for buffered IO, so we >> can't do this - we only support direct IO today. > Oops. I see now. > >> And supporting buffered IO has its challenges; how to handle overlapping >> atomic writes of differing sizes sitting in the page cache is the main >> issue which comes to mind. >> > How about the combination of RWF_ATOMIC | RWF_UNCACHED [1] > Would it be easier/possible to support this considering that the write of folio > is started before the write system call returns? I am not sure exactly what you are proposing. Is it that RWF_ATOMIC for buffered IO auto-sets RWF_UNCACHED? Or that RWF_ATOMIC requires RWF_UNCACHED to be set? But that is not so important, as I just think that future users of RWF_ATOMIC may not want the behavior of RWF_UNCACHED always (for buffered IO). And I don't think that RWF_UNCACHED even properly solves the issues of RWF_ATOMIC for buffered IO in terms of handling overlapping atomic writes in the page cache. Thanks, John > > Note that application that desires mutithreaded atomicity of writes vs. reads > will only need to opt-in for RWF_ATOMIC | RWF_UNCACHED writes, so this > is not expected to actually break its performance by killing the read caching. > > Thanks, > Amir. > > [1]https://urldefense.com/v3/__https://lore.kernel.org/linux- > fsdevel/20241220154831.1086649-1-axboe@kernel.dk/__;!!ACWV5N9M2RV99hQ! > J7_5N_kSixl5iSy8IX37Cup3uKTHAaC5Oy-RlvsJeTE2kr3iJ2IXNww_rApK7TwI_ocCBSE- > G2vZSKSRHqY$
On Wed, Jan 08, 2025 at 09:35:47AM -0800, Darrick J. Wong wrote: > On Wed, Jan 08, 2025 at 03:43:04PM +0800, Chi Zhiling wrote: > > On 2025/1/7 20:13, Amir Goldstein wrote: > > > Dave's answer to this question was that there are some legacy applications > > > (database applications IIRC) on production systems that do rely on the fact > > > that xfs provides this semantics and on the prerequisite that they run on xfs. > > > > > > However, it was noted that: > > > 1. Those application do not require atomicity for any size of IO, they > > > typically work in I/O size that is larger than block size (e.g. 16K or 64K) > > > and they only require no torn writes for this I/O size > > > 2. Large folios and iomap can usually provide this semantics via folio lock, > > > but application has currently no way of knowing if the semantics are > > > provided or not > > > > To be honest, it would be best if the folio lock could provide such > > semantics, as it would not cause any potential problems for the > > application, and we have hope to achieve concurrent writes. > > > > However, I am not sure if this is easy to implement and will not cause > > other problems. > > Assuming we're not abandoning POSIX "Thread Interactions with Regular > File Operations", you can't use the folio lock for coordination, for > several reasons: > > a) Apps can't directly control the size of the folio in the page cache > > b) The folio size can (theoretically) change underneath the program at > any time (reclaim can take your large folio and the next read gets a > smaller folio) > > c) If your write crosses folios, you've just crossed a synchronization > boundary and all bets are off, though all the other filesystems behave > this way and there seem not to be complaints > > d) If you try to "guarantee" folio granularity by messing with min/max > folio size, you run the risk of ENOMEM if the base pages get fragmented > > I think that's why Dave suggested range locks as the correct solution to > this; though it is a pity that so far nobody has come up with a > performant implementation. Yes, that's a fair summary of the situation. That said, I just had a left-field idea for a quasi-range lock that may allow random writes to run concurrently and atomically with reads. Essentially, we add an unsigned long to the inode, and use it as a lock bitmap. That gives up to 64 "lock segments" for the buffered write. We may also need a "segment size" variable.... The existing i_rwsem gets taken shared unless it is an extending write. For a non-extending write, we then do an offset->segment translation and lock that bit in the bit mask. If it's already locked, we wait on the lock bit. i.e. shared IOLOCK, exclusive write bit lock. The segments are evenly sized - say a minimum of 64kB each, but when EOF is extended or truncated (which is done with the i_rwsem held exclusive) the segment size is rescaled. As nothing can hold bit locks while the i_rwsem is held exclusive, this will not race with anything. If we are doing an extending write, we take the i_rwsem shared first, then check if the extension will rescale the locks. If lock rescaling is needed, we have to take the i_rwsem exclusive to do the EOF extension. Otherwise, the bit lock that covers EOF will serialise file extensions so it can be done under a shared i_rwsem safely. This will allow buffered writes to remain atomic w.r.t. each other, and potentially allow buffered reads to wait on writes to the same segment and so potentially provide buffered read vs buffered write atomicity as well. If we need more concurrency than an unsigned long worth of bits for buffered writes, then maybe we can enlarge the bitmap further. I suspect this can be extended to direct IO in a similar way to buffered reads, and that then opens up the possibility of truncate and fallocate() being able to use the bitmap for range exclusion, too. The overhead is likely minimal - setting and clearing bits in a bitmap, as opposed to tracking ranges in a tree structure.... Thoughts? -Dave.
On 2025/1/10 07:28, Dave Chinner wrote: > On Wed, Jan 08, 2025 at 09:35:47AM -0800, Darrick J. Wong wrote: >> On Wed, Jan 08, 2025 at 03:43:04PM +0800, Chi Zhiling wrote: >>> On 2025/1/7 20:13, Amir Goldstein wrote: >>>> Dave's answer to this question was that there are some legacy applications >>>> (database applications IIRC) on production systems that do rely on the fact >>>> that xfs provides this semantics and on the prerequisite that they run on xfs. >>>> >>>> However, it was noted that: >>>> 1. Those application do not require atomicity for any size of IO, they >>>> typically work in I/O size that is larger than block size (e.g. 16K or 64K) >>>> and they only require no torn writes for this I/O size >>>> 2. Large folios and iomap can usually provide this semantics via folio lock, >>>> but application has currently no way of knowing if the semantics are >>>> provided or not >>> >>> To be honest, it would be best if the folio lock could provide such >>> semantics, as it would not cause any potential problems for the >>> application, and we have hope to achieve concurrent writes. >>> >>> However, I am not sure if this is easy to implement and will not cause >>> other problems. >> >> Assuming we're not abandoning POSIX "Thread Interactions with Regular >> File Operations", you can't use the folio lock for coordination, for >> several reasons: >> >> a) Apps can't directly control the size of the folio in the page cache >> >> b) The folio size can (theoretically) change underneath the program at >> any time (reclaim can take your large folio and the next read gets a >> smaller folio) >> >> c) If your write crosses folios, you've just crossed a synchronization >> boundary and all bets are off, though all the other filesystems behave >> this way and there seem not to be complaints >> >> d) If you try to "guarantee" folio granularity by messing with min/max >> folio size, you run the risk of ENOMEM if the base pages get fragmented >> >> I think that's why Dave suggested range locks as the correct solution to >> this; though it is a pity that so far nobody has come up with a >> performant implementation. > > Yes, that's a fair summary of the situation. > > That said, I just had a left-field idea for a quasi-range lock > that may allow random writes to run concurrently and atomically > with reads. > > Essentially, we add an unsigned long to the inode, and use it as a > lock bitmap. That gives up to 64 "lock segments" for the buffered > write. We may also need a "segment size" variable.... > > The existing i_rwsem gets taken shared unless it is an extending > write. > > For a non-extending write, we then do an offset->segment translation > and lock that bit in the bit mask. If it's already locked, we wait > on the lock bit. i.e. shared IOLOCK, exclusive write bit lock. > > The segments are evenly sized - say a minimum of 64kB each, but when > EOF is extended or truncated (which is done with the i_rwsem held > exclusive) the segment size is rescaled. As nothing can hold bit > locks while the i_rwsem is held exclusive, this will not race with > anything. > > If we are doing an extending write, we take the i_rwsem shared > first, then check if the extension will rescale the locks. If lock > rescaling is needed, we have to take the i_rwsem exclusive to do the > EOF extension. Otherwise, the bit lock that covers EOF will > serialise file extensions so it can be done under a shared i_rwsem > safely. > > This will allow buffered writes to remain atomic w.r.t. each other, > and potentially allow buffered reads to wait on writes to the same > segment and so potentially provide buffered read vs buffered write > atomicity as well. > > If we need more concurrency than an unsigned long worth of bits for > buffered writes, then maybe we can enlarge the bitmap further. > > I suspect this can be extended to direct IO in a similar way to > buffered reads, and that then opens up the possibility of truncate > and fallocate() being able to use the bitmap for range exclusion, > too. > > The overhead is likely minimal - setting and clearing bits in a > bitmap, as opposed to tracking ranges in a tree structure.... > > Thoughts? I think it's fine. Additionally, even if multiple writes occur in the same segment, if the write operations are within a single page, we can still acquire the i_rwsem lock in shared mode, right?
On Fri, Jan 10, 2025 at 12:28 AM Dave Chinner <david@fromorbit.com> wrote: > > On Wed, Jan 08, 2025 at 09:35:47AM -0800, Darrick J. Wong wrote: > > On Wed, Jan 08, 2025 at 03:43:04PM +0800, Chi Zhiling wrote: > > > On 2025/1/7 20:13, Amir Goldstein wrote: > > > > Dave's answer to this question was that there are some legacy applications > > > > (database applications IIRC) on production systems that do rely on the fact > > > > that xfs provides this semantics and on the prerequisite that they run on xfs. > > > > > > > > However, it was noted that: > > > > 1. Those application do not require atomicity for any size of IO, they > > > > typically work in I/O size that is larger than block size (e.g. 16K or 64K) > > > > and they only require no torn writes for this I/O size > > > > 2. Large folios and iomap can usually provide this semantics via folio lock, > > > > but application has currently no way of knowing if the semantics are > > > > provided or not > > > > > > To be honest, it would be best if the folio lock could provide such > > > semantics, as it would not cause any potential problems for the > > > application, and we have hope to achieve concurrent writes. > > > > > > However, I am not sure if this is easy to implement and will not cause > > > other problems. > > > > Assuming we're not abandoning POSIX "Thread Interactions with Regular > > File Operations", you can't use the folio lock for coordination, for > > several reasons: > > > > a) Apps can't directly control the size of the folio in the page cache > > > > b) The folio size can (theoretically) change underneath the program at > > any time (reclaim can take your large folio and the next read gets a > > smaller folio) > > > > c) If your write crosses folios, you've just crossed a synchronization > > boundary and all bets are off, though all the other filesystems behave > > this way and there seem not to be complaints > > > > d) If you try to "guarantee" folio granularity by messing with min/max > > folio size, you run the risk of ENOMEM if the base pages get fragmented > > > > I think that's why Dave suggested range locks as the correct solution to > > this; though it is a pity that so far nobody has come up with a > > performant implementation. > > Yes, that's a fair summary of the situation. > > That said, I just had a left-field idea for a quasi-range lock > that may allow random writes to run concurrently and atomically > with reads. > > Essentially, we add an unsigned long to the inode, and use it as a > lock bitmap. That gives up to 64 "lock segments" for the buffered > write. We may also need a "segment size" variable.... > > The existing i_rwsem gets taken shared unless it is an extending > write. > > For a non-extending write, we then do an offset->segment translation > and lock that bit in the bit mask. If it's already locked, we wait > on the lock bit. i.e. shared IOLOCK, exclusive write bit lock. > > The segments are evenly sized - say a minimum of 64kB each, but when > EOF is extended or truncated (which is done with the i_rwsem held > exclusive) the segment size is rescaled. As nothing can hold bit > locks while the i_rwsem is held exclusive, this will not race with > anything. > > If we are doing an extending write, we take the i_rwsem shared > first, then check if the extension will rescale the locks. If lock > rescaling is needed, we have to take the i_rwsem exclusive to do the > EOF extension. Otherwise, the bit lock that covers EOF will > serialise file extensions so it can be done under a shared i_rwsem > safely. > > This will allow buffered writes to remain atomic w.r.t. each other, > and potentially allow buffered reads to wait on writes to the same > segment and so potentially provide buffered read vs buffered write > atomicity as well. > > If we need more concurrency than an unsigned long worth of bits for > buffered writes, then maybe we can enlarge the bitmap further. > > I suspect this can be extended to direct IO in a similar way to > buffered reads, and that then opens up the possibility of truncate > and fallocate() being able to use the bitmap for range exclusion, > too. > > The overhead is likely minimal - setting and clearing bits in a > bitmap, as opposed to tracking ranges in a tree structure.... > > Thoughts? I think that's a very neat idea, but it will not address the reference benchmark. The reference benchmark I started the original report with which is similar to my understanding to the benchmark that Chi is running simulates the workload of a database writing with buffered IO. That means a very large file and small IO size ~64K. Leaving the probability of intersecting writes in the same segment quite high. Can we do this opportunistically based on available large folios? If IO size is within an existing folio, use the folio lock and IOLOCK_SHARED if it is not, use IOLOCK_EXCL? for a benchmark that does all buffered IO 64K aligned, wouldn't large folios naturally align to IO size and above? Thanks, Amir.
On 2025/1/11 01:07, Amir Goldstein wrote: > On Fri, Jan 10, 2025 at 12:28 AM Dave Chinner <david@fromorbit.com> wrote: >> >> On Wed, Jan 08, 2025 at 09:35:47AM -0800, Darrick J. Wong wrote: >>> On Wed, Jan 08, 2025 at 03:43:04PM +0800, Chi Zhiling wrote: >>>> On 2025/1/7 20:13, Amir Goldstein wrote: >>>>> Dave's answer to this question was that there are some legacy applications >>>>> (database applications IIRC) on production systems that do rely on the fact >>>>> that xfs provides this semantics and on the prerequisite that they run on xfs. >>>>> >>>>> However, it was noted that: >>>>> 1. Those application do not require atomicity for any size of IO, they >>>>> typically work in I/O size that is larger than block size (e.g. 16K or 64K) >>>>> and they only require no torn writes for this I/O size >>>>> 2. Large folios and iomap can usually provide this semantics via folio lock, >>>>> but application has currently no way of knowing if the semantics are >>>>> provided or not >>>> >>>> To be honest, it would be best if the folio lock could provide such >>>> semantics, as it would not cause any potential problems for the >>>> application, and we have hope to achieve concurrent writes. >>>> >>>> However, I am not sure if this is easy to implement and will not cause >>>> other problems. >>> >>> Assuming we're not abandoning POSIX "Thread Interactions with Regular >>> File Operations", you can't use the folio lock for coordination, for >>> several reasons: >>> >>> a) Apps can't directly control the size of the folio in the page cache >>> >>> b) The folio size can (theoretically) change underneath the program at >>> any time (reclaim can take your large folio and the next read gets a >>> smaller folio) >>> >>> c) If your write crosses folios, you've just crossed a synchronization >>> boundary and all bets are off, though all the other filesystems behave >>> this way and there seem not to be complaints >>> >>> d) If you try to "guarantee" folio granularity by messing with min/max >>> folio size, you run the risk of ENOMEM if the base pages get fragmented >>> >>> I think that's why Dave suggested range locks as the correct solution to >>> this; though it is a pity that so far nobody has come up with a >>> performant implementation. >> >> Yes, that's a fair summary of the situation. >> >> That said, I just had a left-field idea for a quasi-range lock >> that may allow random writes to run concurrently and atomically >> with reads. >> >> Essentially, we add an unsigned long to the inode, and use it as a >> lock bitmap. That gives up to 64 "lock segments" for the buffered >> write. We may also need a "segment size" variable.... >> >> The existing i_rwsem gets taken shared unless it is an extending >> write. >> >> For a non-extending write, we then do an offset->segment translation >> and lock that bit in the bit mask. If it's already locked, we wait >> on the lock bit. i.e. shared IOLOCK, exclusive write bit lock. >> >> The segments are evenly sized - say a minimum of 64kB each, but when >> EOF is extended or truncated (which is done with the i_rwsem held >> exclusive) the segment size is rescaled. As nothing can hold bit >> locks while the i_rwsem is held exclusive, this will not race with >> anything. >> >> If we are doing an extending write, we take the i_rwsem shared >> first, then check if the extension will rescale the locks. If lock >> rescaling is needed, we have to take the i_rwsem exclusive to do the >> EOF extension. Otherwise, the bit lock that covers EOF will >> serialise file extensions so it can be done under a shared i_rwsem >> safely. >> >> This will allow buffered writes to remain atomic w.r.t. each other, >> and potentially allow buffered reads to wait on writes to the same >> segment and so potentially provide buffered read vs buffered write >> atomicity as well. >> >> If we need more concurrency than an unsigned long worth of bits for >> buffered writes, then maybe we can enlarge the bitmap further. >> >> I suspect this can be extended to direct IO in a similar way to >> buffered reads, and that then opens up the possibility of truncate >> and fallocate() being able to use the bitmap for range exclusion, >> too. >> >> The overhead is likely minimal - setting and clearing bits in a >> bitmap, as opposed to tracking ranges in a tree structure.... >> >> Thoughts? > > I think that's a very neat idea, but it will not address the reference > benchmark. > The reference benchmark I started the original report with which is similar > to my understanding to the benchmark that Chi is running simulates the > workload of a database writing with buffered IO. > > That means a very large file and small IO size ~64K. > Leaving the probability of intersecting writes in the same segment quite high. > > Can we do this opportunistically based on available large folios? > If IO size is within an existing folio, use the folio lock and IOLOCK_SHARED > if it is not, use IOLOCK_EXCL? > > for a benchmark that does all buffered IO 64K aligned, wouldn't large folios > naturally align to IO size and above? > Great, I think we're getting close to aligning our thoughts. IMO, we shouldn't use a shared lock for write operations that are larger than page size. I believe the current issue is that when acquiring the i_rwsem lock, we have no way of knowing the size of a large folio [1] (as Darrick mentioned earlier), so we can't determine if only one large folio will be written. There's only one certainty: if the IO size fits within one page size, it will definitely fit within one large folio. So for now, we can only use IOLOCK_SHARED if we verify that the IO fits within page size. [1]: Maybe we can find a way to obtain the size of a folio from the page cache, but it might come with some performance costs. Thanks, Chi Zhiling
On Sun, Jan 12, 2025 at 06:05:37PM +0800, Chi Zhiling wrote: > On 2025/1/11 01:07, Amir Goldstein wrote: > > On Fri, Jan 10, 2025 at 12:28 AM Dave Chinner <david@fromorbit.com> wrote: > > > > > > On Wed, Jan 08, 2025 at 09:35:47AM -0800, Darrick J. Wong wrote: > > > > On Wed, Jan 08, 2025 at 03:43:04PM +0800, Chi Zhiling wrote: > > > > > On 2025/1/7 20:13, Amir Goldstein wrote: > > > > > > Dave's answer to this question was that there are some legacy applications > > > > > > (database applications IIRC) on production systems that do rely on the fact > > > > > > that xfs provides this semantics and on the prerequisite that they run on xfs. > > > > > > > > > > > > However, it was noted that: > > > > > > 1. Those application do not require atomicity for any size of IO, they > > > > > > typically work in I/O size that is larger than block size (e.g. 16K or 64K) > > > > > > and they only require no torn writes for this I/O size > > > > > > 2. Large folios and iomap can usually provide this semantics via folio lock, > > > > > > but application has currently no way of knowing if the semantics are > > > > > > provided or not > > > > > > > > > > To be honest, it would be best if the folio lock could provide such > > > > > semantics, as it would not cause any potential problems for the > > > > > application, and we have hope to achieve concurrent writes. > > > > > > > > > > However, I am not sure if this is easy to implement and will not cause > > > > > other problems. > > > > > > > > Assuming we're not abandoning POSIX "Thread Interactions with Regular > > > > File Operations", you can't use the folio lock for coordination, for > > > > several reasons: > > > > > > > > a) Apps can't directly control the size of the folio in the page cache > > > > > > > > b) The folio size can (theoretically) change underneath the program at > > > > any time (reclaim can take your large folio and the next read gets a > > > > smaller folio) > > > > > > > > c) If your write crosses folios, you've just crossed a synchronization > > > > boundary and all bets are off, though all the other filesystems behave > > > > this way and there seem not to be complaints > > > > > > > > d) If you try to "guarantee" folio granularity by messing with min/max > > > > folio size, you run the risk of ENOMEM if the base pages get fragmented > > > > > > > > I think that's why Dave suggested range locks as the correct solution to > > > > this; though it is a pity that so far nobody has come up with a > > > > performant implementation. > > > > > > Yes, that's a fair summary of the situation. > > > > > > That said, I just had a left-field idea for a quasi-range lock > > > that may allow random writes to run concurrently and atomically > > > with reads. > > > > > > Essentially, we add an unsigned long to the inode, and use it as a > > > lock bitmap. That gives up to 64 "lock segments" for the buffered > > > write. We may also need a "segment size" variable.... > > > > > > The existing i_rwsem gets taken shared unless it is an extending > > > write. > > > > > > For a non-extending write, we then do an offset->segment translation > > > and lock that bit in the bit mask. If it's already locked, we wait > > > on the lock bit. i.e. shared IOLOCK, exclusive write bit lock. > > > > > > The segments are evenly sized - say a minimum of 64kB each, but when > > > EOF is extended or truncated (which is done with the i_rwsem held > > > exclusive) the segment size is rescaled. As nothing can hold bit > > > locks while the i_rwsem is held exclusive, this will not race with > > > anything. > > > > > > If we are doing an extending write, we take the i_rwsem shared > > > first, then check if the extension will rescale the locks. If lock > > > rescaling is needed, we have to take the i_rwsem exclusive to do the > > > EOF extension. Otherwise, the bit lock that covers EOF will > > > serialise file extensions so it can be done under a shared i_rwsem > > > safely. > > > > > > This will allow buffered writes to remain atomic w.r.t. each other, > > > and potentially allow buffered reads to wait on writes to the same > > > segment and so potentially provide buffered read vs buffered write > > > atomicity as well. > > > > > > If we need more concurrency than an unsigned long worth of bits for > > > buffered writes, then maybe we can enlarge the bitmap further. > > > > > > I suspect this can be extended to direct IO in a similar way to > > > buffered reads, and that then opens up the possibility of truncate > > > and fallocate() being able to use the bitmap for range exclusion, > > > too. > > > > > > The overhead is likely minimal - setting and clearing bits in a > > > bitmap, as opposed to tracking ranges in a tree structure.... > > > > > > Thoughts? > > > > I think that's a very neat idea, but it will not address the reference > > benchmark. > > The reference benchmark I started the original report with which is similar > > to my understanding to the benchmark that Chi is running simulates the > > workload of a database writing with buffered IO. > > > > That means a very large file and small IO size ~64K. > > Leaving the probability of intersecting writes in the same segment quite high. > > > > Can we do this opportunistically based on available large folios? > > If IO size is within an existing folio, use the folio lock and IOLOCK_SHARED > > if it is not, use IOLOCK_EXCL? > > > > for a benchmark that does all buffered IO 64K aligned, wouldn't large folios > > naturally align to IO size and above? > > > > Great, I think we're getting close to aligning our thoughts. > > IMO, we shouldn't use a shared lock for write operations that are > larger than page size. > > I believe the current issue is that when acquiring the i_rwsem lock, > we have no way of knowing the size of a large folio [1] (as Darrick > mentioned earlier), so we can't determine if only one large folio will > be written. > > There's only one certainty: if the IO size fits within one page size, > it will definitely fit within one large folio. > > So for now, we can only use IOLOCK_SHARED if we verify that the IO fits > within page size. For filesystems that /do/ support large folios (xfs), I suppose you could have it tell iomap that it only took i_rwsem in shared mode; and then the iomap buffered write implementation could proceed if it got a folio covering the entire write range, or return some magic code that means "take i_rwsem in exclusive mode and try again". Though you're correct that we should always take IOLOCK_EXCL if the write size is larger than whatever the max folio size is for that file. --D > [1]: Maybe we can find a way to obtain the size of a folio from the page > cache, but it might come with some performance costs. > > > Thanks, > Chi Zhiling > >
On 2025/1/13 10:44, Darrick J. Wong wrote: > On Sun, Jan 12, 2025 at 06:05:37PM +0800, Chi Zhiling wrote: >> On 2025/1/11 01:07, Amir Goldstein wrote: >>> On Fri, Jan 10, 2025 at 12:28 AM Dave Chinner <david@fromorbit.com> wrote: >>>> >>>> On Wed, Jan 08, 2025 at 09:35:47AM -0800, Darrick J. Wong wrote: >>>>> On Wed, Jan 08, 2025 at 03:43:04PM +0800, Chi Zhiling wrote: >>>>>> On 2025/1/7 20:13, Amir Goldstein wrote: >>>>>>> Dave's answer to this question was that there are some legacy applications >>>>>>> (database applications IIRC) on production systems that do rely on the fact >>>>>>> that xfs provides this semantics and on the prerequisite that they run on xfs. >>>>>>> >>>>>>> However, it was noted that: >>>>>>> 1. Those application do not require atomicity for any size of IO, they >>>>>>> typically work in I/O size that is larger than block size (e.g. 16K or 64K) >>>>>>> and they only require no torn writes for this I/O size >>>>>>> 2. Large folios and iomap can usually provide this semantics via folio lock, >>>>>>> but application has currently no way of knowing if the semantics are >>>>>>> provided or not >>>>>> >>>>>> To be honest, it would be best if the folio lock could provide such >>>>>> semantics, as it would not cause any potential problems for the >>>>>> application, and we have hope to achieve concurrent writes. >>>>>> >>>>>> However, I am not sure if this is easy to implement and will not cause >>>>>> other problems. >>>>> >>>>> Assuming we're not abandoning POSIX "Thread Interactions with Regular >>>>> File Operations", you can't use the folio lock for coordination, for >>>>> several reasons: >>>>> >>>>> a) Apps can't directly control the size of the folio in the page cache >>>>> >>>>> b) The folio size can (theoretically) change underneath the program at >>>>> any time (reclaim can take your large folio and the next read gets a >>>>> smaller folio) >>>>> >>>>> c) If your write crosses folios, you've just crossed a synchronization >>>>> boundary and all bets are off, though all the other filesystems behave >>>>> this way and there seem not to be complaints >>>>> >>>>> d) If you try to "guarantee" folio granularity by messing with min/max >>>>> folio size, you run the risk of ENOMEM if the base pages get fragmented >>>>> >>>>> I think that's why Dave suggested range locks as the correct solution to >>>>> this; though it is a pity that so far nobody has come up with a >>>>> performant implementation. >>>> >>>> Yes, that's a fair summary of the situation. >>>> >>>> That said, I just had a left-field idea for a quasi-range lock >>>> that may allow random writes to run concurrently and atomically >>>> with reads. >>>> >>>> Essentially, we add an unsigned long to the inode, and use it as a >>>> lock bitmap. That gives up to 64 "lock segments" for the buffered >>>> write. We may also need a "segment size" variable.... >>>> >>>> The existing i_rwsem gets taken shared unless it is an extending >>>> write. >>>> >>>> For a non-extending write, we then do an offset->segment translation >>>> and lock that bit in the bit mask. If it's already locked, we wait >>>> on the lock bit. i.e. shared IOLOCK, exclusive write bit lock. >>>> >>>> The segments are evenly sized - say a minimum of 64kB each, but when >>>> EOF is extended or truncated (which is done with the i_rwsem held >>>> exclusive) the segment size is rescaled. As nothing can hold bit >>>> locks while the i_rwsem is held exclusive, this will not race with >>>> anything. >>>> >>>> If we are doing an extending write, we take the i_rwsem shared >>>> first, then check if the extension will rescale the locks. If lock >>>> rescaling is needed, we have to take the i_rwsem exclusive to do the >>>> EOF extension. Otherwise, the bit lock that covers EOF will >>>> serialise file extensions so it can be done under a shared i_rwsem >>>> safely. >>>> >>>> This will allow buffered writes to remain atomic w.r.t. each other, >>>> and potentially allow buffered reads to wait on writes to the same >>>> segment and so potentially provide buffered read vs buffered write >>>> atomicity as well. >>>> >>>> If we need more concurrency than an unsigned long worth of bits for >>>> buffered writes, then maybe we can enlarge the bitmap further. >>>> >>>> I suspect this can be extended to direct IO in a similar way to >>>> buffered reads, and that then opens up the possibility of truncate >>>> and fallocate() being able to use the bitmap for range exclusion, >>>> too. >>>> >>>> The overhead is likely minimal - setting and clearing bits in a >>>> bitmap, as opposed to tracking ranges in a tree structure.... >>>> >>>> Thoughts? >>> >>> I think that's a very neat idea, but it will not address the reference >>> benchmark. >>> The reference benchmark I started the original report with which is similar >>> to my understanding to the benchmark that Chi is running simulates the >>> workload of a database writing with buffered IO. >>> >>> That means a very large file and small IO size ~64K. >>> Leaving the probability of intersecting writes in the same segment quite high. >>> >>> Can we do this opportunistically based on available large folios? >>> If IO size is within an existing folio, use the folio lock and IOLOCK_SHARED >>> if it is not, use IOLOCK_EXCL? >>> >>> for a benchmark that does all buffered IO 64K aligned, wouldn't large folios >>> naturally align to IO size and above? >>> >> >> Great, I think we're getting close to aligning our thoughts. >> >> IMO, we shouldn't use a shared lock for write operations that are >> larger than page size. >> >> I believe the current issue is that when acquiring the i_rwsem lock, >> we have no way of knowing the size of a large folio [1] (as Darrick >> mentioned earlier), so we can't determine if only one large folio will >> be written. >> >> There's only one certainty: if the IO size fits within one page size, >> it will definitely fit within one large folio. >> >> So for now, we can only use IOLOCK_SHARED if we verify that the IO fits >> within page size. > > For filesystems that /do/ support large folios (xfs), I suppose you > could have it tell iomap that it only took i_rwsem in shared mode; and > then the iomap buffered write implementation could proceed if it got a > folio covering the entire write range, or return some magic code that > means "take i_rwsem in exclusive mode and try again". > > Though you're correct that we should always take IOLOCK_EXCL if the > write size is larger than whatever the max folio size is for that file. > I think this is a pretty good idea. I'll implement the basic functionality as soon as possible, and then we can discuss some details for optimization. Thanks, Chi Zhiling
On Sun, Jan 12, 2025 at 06:44:01PM -0800, Darrick J. Wong wrote: > On Sun, Jan 12, 2025 at 06:05:37PM +0800, Chi Zhiling wrote: > > On 2025/1/11 01:07, Amir Goldstein wrote: > > > On Fri, Jan 10, 2025 at 12:28 AM Dave Chinner <david@fromorbit.com> wrote: > > > > > > > > On Wed, Jan 08, 2025 at 09:35:47AM -0800, Darrick J. Wong wrote: > > > > > On Wed, Jan 08, 2025 at 03:43:04PM +0800, Chi Zhiling wrote: > > > > > > On 2025/1/7 20:13, Amir Goldstein wrote: > > > > > > > Dave's answer to this question was that there are some legacy applications > > > > > > > (database applications IIRC) on production systems that do rely on the fact > > > > > > > that xfs provides this semantics and on the prerequisite that they run on xfs. > > > > > > > > > > > > > > However, it was noted that: > > > > > > > 1. Those application do not require atomicity for any size of IO, they > > > > > > > typically work in I/O size that is larger than block size (e.g. 16K or 64K) > > > > > > > and they only require no torn writes for this I/O size > > > > > > > 2. Large folios and iomap can usually provide this semantics via folio lock, > > > > > > > but application has currently no way of knowing if the semantics are > > > > > > > provided or not > > > > > > > > > > > > To be honest, it would be best if the folio lock could provide such > > > > > > semantics, as it would not cause any potential problems for the > > > > > > application, and we have hope to achieve concurrent writes. > > > > > > > > > > > > However, I am not sure if this is easy to implement and will not cause > > > > > > other problems. > > > > > > > > > > Assuming we're not abandoning POSIX "Thread Interactions with Regular > > > > > File Operations", you can't use the folio lock for coordination, for > > > > > several reasons: > > > > > > > > > > a) Apps can't directly control the size of the folio in the page cache > > > > > > > > > > b) The folio size can (theoretically) change underneath the program at > > > > > any time (reclaim can take your large folio and the next read gets a > > > > > smaller folio) > > > > > > > > > > c) If your write crosses folios, you've just crossed a synchronization > > > > > boundary and all bets are off, though all the other filesystems behave > > > > > this way and there seem not to be complaints > > > > > > > > > > d) If you try to "guarantee" folio granularity by messing with min/max > > > > > folio size, you run the risk of ENOMEM if the base pages get fragmented > > > > > > > > > > I think that's why Dave suggested range locks as the correct solution to > > > > > this; though it is a pity that so far nobody has come up with a > > > > > performant implementation. > > > > > > > > Yes, that's a fair summary of the situation. > > > > > > > > That said, I just had a left-field idea for a quasi-range lock > > > > that may allow random writes to run concurrently and atomically > > > > with reads. > > > > > > > > Essentially, we add an unsigned long to the inode, and use it as a > > > > lock bitmap. That gives up to 64 "lock segments" for the buffered > > > > write. We may also need a "segment size" variable.... > > > > > > > > The existing i_rwsem gets taken shared unless it is an extending > > > > write. > > > > > > > > For a non-extending write, we then do an offset->segment translation > > > > and lock that bit in the bit mask. If it's already locked, we wait > > > > on the lock bit. i.e. shared IOLOCK, exclusive write bit lock. > > > > > > > > The segments are evenly sized - say a minimum of 64kB each, but when > > > > EOF is extended or truncated (which is done with the i_rwsem held > > > > exclusive) the segment size is rescaled. As nothing can hold bit > > > > locks while the i_rwsem is held exclusive, this will not race with > > > > anything. > > > > > > > > If we are doing an extending write, we take the i_rwsem shared > > > > first, then check if the extension will rescale the locks. If lock > > > > rescaling is needed, we have to take the i_rwsem exclusive to do the > > > > EOF extension. Otherwise, the bit lock that covers EOF will > > > > serialise file extensions so it can be done under a shared i_rwsem > > > > safely. > > > > > > > > This will allow buffered writes to remain atomic w.r.t. each other, > > > > and potentially allow buffered reads to wait on writes to the same > > > > segment and so potentially provide buffered read vs buffered write > > > > atomicity as well. > > > > > > > > If we need more concurrency than an unsigned long worth of bits for > > > > buffered writes, then maybe we can enlarge the bitmap further. > > > > > > > > I suspect this can be extended to direct IO in a similar way to > > > > buffered reads, and that then opens up the possibility of truncate > > > > and fallocate() being able to use the bitmap for range exclusion, > > > > too. > > > > > > > > The overhead is likely minimal - setting and clearing bits in a > > > > bitmap, as opposed to tracking ranges in a tree structure.... > > > > > > > > Thoughts? > > > > > > I think that's a very neat idea, but it will not address the reference > > > benchmark. > > > The reference benchmark I started the original report with which is similar > > > to my understanding to the benchmark that Chi is running simulates the > > > workload of a database writing with buffered IO. > > > > > > That means a very large file and small IO size ~64K. > > > Leaving the probability of intersecting writes in the same segment quite high. > > > > > > Can we do this opportunistically based on available large folios? > > > If IO size is within an existing folio, use the folio lock and IOLOCK_SHARED > > > if it is not, use IOLOCK_EXCL? > > > > > > for a benchmark that does all buffered IO 64K aligned, wouldn't large folios > > > naturally align to IO size and above? > > > > > > > Great, I think we're getting close to aligning our thoughts. > > > > IMO, we shouldn't use a shared lock for write operations that are > > larger than page size. > > > > I believe the current issue is that when acquiring the i_rwsem lock, > > we have no way of knowing the size of a large folio [1] (as Darrick > > mentioned earlier), so we can't determine if only one large folio will > > be written. > > > > There's only one certainty: if the IO size fits within one page size, > > it will definitely fit within one large folio. > > > > So for now, we can only use IOLOCK_SHARED if we verify that the IO fits > > within page size. > > For filesystems that /do/ support large folios (xfs), I suppose you > could have it tell iomap that it only took i_rwsem in shared mode; and > then the iomap buffered write implementation could proceed if it got a > folio covering the entire write range, or return some magic code that > means "take i_rwsem in exclusive mode and try again". > Sorry if this is out of left field as I haven't followed the discussion closely, but I presumed one of the reasons Darrick and Christoph raised the idea of using the folio batch thing I'm playing around with on zero range for buffered writes would be to acquire and lock all targeted folios up front. If so, would that help with what you're trying to achieve here? (If not, nothing to see here, move along.. ;). Brian > Though you're correct that we should always take IOLOCK_EXCL if the > write size is larger than whatever the max folio size is for that file. > > --D > > > [1]: Maybe we can find a way to obtain the size of a folio from the page > > cache, but it might come with some performance costs. > > > > > > Thanks, > > Chi Zhiling > > > > >
On Mon, Jan 13, 2025 at 08:40:51AM -0500, Brian Foster wrote: > On Sun, Jan 12, 2025 at 06:44:01PM -0800, Darrick J. Wong wrote: > > On Sun, Jan 12, 2025 at 06:05:37PM +0800, Chi Zhiling wrote: > > > On 2025/1/11 01:07, Amir Goldstein wrote: > > > > On Fri, Jan 10, 2025 at 12:28 AM Dave Chinner <david@fromorbit.com> wrote: > > > > > > > > > > On Wed, Jan 08, 2025 at 09:35:47AM -0800, Darrick J. Wong wrote: > > > > > > On Wed, Jan 08, 2025 at 03:43:04PM +0800, Chi Zhiling wrote: > > > > > > > On 2025/1/7 20:13, Amir Goldstein wrote: > > > > > > > > Dave's answer to this question was that there are some legacy applications > > > > > > > > (database applications IIRC) on production systems that do rely on the fact > > > > > > > > that xfs provides this semantics and on the prerequisite that they run on xfs. > > > > > > > > > > > > > > > > However, it was noted that: > > > > > > > > 1. Those application do not require atomicity for any size of IO, they > > > > > > > > typically work in I/O size that is larger than block size (e.g. 16K or 64K) > > > > > > > > and they only require no torn writes for this I/O size > > > > > > > > 2. Large folios and iomap can usually provide this semantics via folio lock, > > > > > > > > but application has currently no way of knowing if the semantics are > > > > > > > > provided or not > > > > > > > > > > > > > > To be honest, it would be best if the folio lock could provide such > > > > > > > semantics, as it would not cause any potential problems for the > > > > > > > application, and we have hope to achieve concurrent writes. > > > > > > > > > > > > > > However, I am not sure if this is easy to implement and will not cause > > > > > > > other problems. > > > > > > > > > > > > Assuming we're not abandoning POSIX "Thread Interactions with Regular > > > > > > File Operations", you can't use the folio lock for coordination, for > > > > > > several reasons: > > > > > > > > > > > > a) Apps can't directly control the size of the folio in the page cache > > > > > > > > > > > > b) The folio size can (theoretically) change underneath the program at > > > > > > any time (reclaim can take your large folio and the next read gets a > > > > > > smaller folio) > > > > > > > > > > > > c) If your write crosses folios, you've just crossed a synchronization > > > > > > boundary and all bets are off, though all the other filesystems behave > > > > > > this way and there seem not to be complaints > > > > > > > > > > > > d) If you try to "guarantee" folio granularity by messing with min/max > > > > > > folio size, you run the risk of ENOMEM if the base pages get fragmented > > > > > > > > > > > > I think that's why Dave suggested range locks as the correct solution to > > > > > > this; though it is a pity that so far nobody has come up with a > > > > > > performant implementation. > > > > > > > > > > Yes, that's a fair summary of the situation. > > > > > > > > > > That said, I just had a left-field idea for a quasi-range lock > > > > > that may allow random writes to run concurrently and atomically > > > > > with reads. > > > > > > > > > > Essentially, we add an unsigned long to the inode, and use it as a > > > > > lock bitmap. That gives up to 64 "lock segments" for the buffered > > > > > write. We may also need a "segment size" variable.... > > > > > > > > > > The existing i_rwsem gets taken shared unless it is an extending > > > > > write. > > > > > > > > > > For a non-extending write, we then do an offset->segment translation > > > > > and lock that bit in the bit mask. If it's already locked, we wait > > > > > on the lock bit. i.e. shared IOLOCK, exclusive write bit lock. > > > > > > > > > > The segments are evenly sized - say a minimum of 64kB each, but when > > > > > EOF is extended or truncated (which is done with the i_rwsem held > > > > > exclusive) the segment size is rescaled. As nothing can hold bit > > > > > locks while the i_rwsem is held exclusive, this will not race with > > > > > anything. > > > > > > > > > > If we are doing an extending write, we take the i_rwsem shared > > > > > first, then check if the extension will rescale the locks. If lock > > > > > rescaling is needed, we have to take the i_rwsem exclusive to do the > > > > > EOF extension. Otherwise, the bit lock that covers EOF will > > > > > serialise file extensions so it can be done under a shared i_rwsem > > > > > safely. > > > > > > > > > > This will allow buffered writes to remain atomic w.r.t. each other, > > > > > and potentially allow buffered reads to wait on writes to the same > > > > > segment and so potentially provide buffered read vs buffered write > > > > > atomicity as well. > > > > > > > > > > If we need more concurrency than an unsigned long worth of bits for > > > > > buffered writes, then maybe we can enlarge the bitmap further. > > > > > > > > > > I suspect this can be extended to direct IO in a similar way to > > > > > buffered reads, and that then opens up the possibility of truncate > > > > > and fallocate() being able to use the bitmap for range exclusion, > > > > > too. > > > > > > > > > > The overhead is likely minimal - setting and clearing bits in a > > > > > bitmap, as opposed to tracking ranges in a tree structure.... > > > > > > > > > > Thoughts? > > > > > > > > I think that's a very neat idea, but it will not address the reference > > > > benchmark. > > > > The reference benchmark I started the original report with which is similar > > > > to my understanding to the benchmark that Chi is running simulates the > > > > workload of a database writing with buffered IO. > > > > > > > > That means a very large file and small IO size ~64K. > > > > Leaving the probability of intersecting writes in the same segment quite high. > > > > > > > > Can we do this opportunistically based on available large folios? > > > > If IO size is within an existing folio, use the folio lock and IOLOCK_SHARED > > > > if it is not, use IOLOCK_EXCL? > > > > > > > > for a benchmark that does all buffered IO 64K aligned, wouldn't large folios > > > > naturally align to IO size and above? > > > > > > > > > > Great, I think we're getting close to aligning our thoughts. > > > > > > IMO, we shouldn't use a shared lock for write operations that are > > > larger than page size. > > > > > > I believe the current issue is that when acquiring the i_rwsem lock, > > > we have no way of knowing the size of a large folio [1] (as Darrick > > > mentioned earlier), so we can't determine if only one large folio will > > > be written. > > > > > > There's only one certainty: if the IO size fits within one page size, > > > it will definitely fit within one large folio. > > > > > > So for now, we can only use IOLOCK_SHARED if we verify that the IO fits > > > within page size. > > > > For filesystems that /do/ support large folios (xfs), I suppose you > > could have it tell iomap that it only took i_rwsem in shared mode; and > > then the iomap buffered write implementation could proceed if it got a > > folio covering the entire write range, or return some magic code that > > means "take i_rwsem in exclusive mode and try again". > > > > Sorry if this is out of left field as I haven't followed the discussion > closely, but I presumed one of the reasons Darrick and Christoph raised > the idea of using the folio batch thing I'm playing around with on zero > range for buffered writes would be to acquire and lock all targeted > folios up front. If so, would that help with what you're trying to > achieve here? (If not, nothing to see here, move along.. ;). I think the folio batch thing would help here, since you then could just lock all the folios you need for a write, which provides (in effect) a range lock implementation. --D > Brian > > > Though you're correct that we should always take IOLOCK_EXCL if the > > write size is larger than whatever the max folio size is for that file. > > > > --D > > > > > [1]: Maybe we can find a way to obtain the size of a folio from the page > > > cache, but it might come with some performance costs. > > > > > > > > > Thanks, > > > Chi Zhiling > > > > > > > > > >
On Fri, Jan 10, 2025 at 06:07:48PM +0100, Amir Goldstein wrote: > On Fri, Jan 10, 2025 at 12:28 AM Dave Chinner <david@fromorbit.com> wrote: > > That said, I just had a left-field idea for a quasi-range lock > > that may allow random writes to run concurrently and atomically > > with reads. > > > > Essentially, we add an unsigned long to the inode, and use it as a > > lock bitmap. That gives up to 64 "lock segments" for the buffered > > write. We may also need a "segment size" variable.... > > > > The existing i_rwsem gets taken shared unless it is an extending > > write. > > > > For a non-extending write, we then do an offset->segment translation > > and lock that bit in the bit mask. If it's already locked, we wait > > on the lock bit. i.e. shared IOLOCK, exclusive write bit lock. > > > > The segments are evenly sized - say a minimum of 64kB each, but when > > EOF is extended or truncated (which is done with the i_rwsem held > > exclusive) the segment size is rescaled. As nothing can hold bit > > locks while the i_rwsem is held exclusive, this will not race with > > anything. > > > > If we are doing an extending write, we take the i_rwsem shared > > first, then check if the extension will rescale the locks. If lock > > rescaling is needed, we have to take the i_rwsem exclusive to do the > > EOF extension. Otherwise, the bit lock that covers EOF will > > serialise file extensions so it can be done under a shared i_rwsem > > safely. > > > > This will allow buffered writes to remain atomic w.r.t. each other, > > and potentially allow buffered reads to wait on writes to the same > > segment and so potentially provide buffered read vs buffered write > > atomicity as well. > > > > If we need more concurrency than an unsigned long worth of bits for > > buffered writes, then maybe we can enlarge the bitmap further. > > > > I suspect this can be extended to direct IO in a similar way to > > buffered reads, and that then opens up the possibility of truncate > > and fallocate() being able to use the bitmap for range exclusion, > > too. > > > > The overhead is likely minimal - setting and clearing bits in a > > bitmap, as opposed to tracking ranges in a tree structure.... > > > > Thoughts? > > I think that's a very neat idea, but it will not address the reference > benchmark. > The reference benchmark I started the original report with which is similar > to my understanding to the benchmark that Chi is running simulates the > workload of a database writing with buffered IO. > > That means a very large file and small IO size ~64K. > Leaving the probability of intersecting writes in the same segment quite high. Likely - I recognised this granularity problem, though: | If we need more concurrency than an unsigned long worth of bits for | buffered writes, then maybe we can enlarge the bitmap further. We could also hash offsets into the bitmap so that offset-local IO hit different locks - the bit lock doesn't necessarily need to be range based. i.e. this is a "finer grained" lock that will typically increase concurrency. If we keep striving for perfect (i.e. scalable range locks) we're not going to improve the situation any time soon... > Can we do this opportunistically based on available large folios? > If IO size is within an existing folio, use the folio lock and IOLOCK_SHARED > if it is not, use IOLOCK_EXCL? The biggest problem with this is that direct IO will -not- do folio-by folio locking, and so folio based locking does not work for direct IO exclusion. Currently we get coherency against buffered writes by writing back or invalidation dirty folios before doing the DIO read/write. Because we hold the IOLOCK shared, a buffered write will not redirty the page cache until the DIO write has been submitted (and completed for non-async DIO). Hence moving to shared i_rwsem, folio-based range locking for buffered writes will lose all serialisation against DIO operations. We will lose what coherency we currently have between buffered write ops and DIO, and I don't think that's an acceptible trade-off. i.e. The problem with using the mapping tree for DIO coherency is, once again, locking overhead. If we have to insert exceptional entries to lock the range in the mapping tree because there is no folio present (for DIO to serialise against new buffered IOs), then we are simply back to the same exclusive tree update scalability problem that tree based range lock algorithms have.... That's why I suggested a bitmap lock external to both buffered and direct IO... > for a benchmark that does all buffered IO 64K aligned, wouldn't large folios > naturally align to IO size and above? Maybe, but I don't think folio/mapping tree based locking is a workable solution. Hence the external bitmap idea... -Dave.
On Mon, Jan 13, 2025 at 08:40:51AM -0500, Brian Foster wrote: > Sorry if this is out of left field as I haven't followed the discussion > closely, but I presumed one of the reasons Darrick and Christoph raised > the idea of using the folio batch thing I'm playing around with on zero > range for buffered writes would be to acquire and lock all targeted > folios up front. If so, would that help with what you're trying to > achieve here? (If not, nothing to see here, move along.. ;). I mostly thought about acquiring, as locking doesn't really have much batching effects. That being said, no that you got the idea in my mind here's my early morning brainfart on it: Let's ignore DIRECT I/O for the first step. In that case lookup / allocation and locking all folios for write before copying data will remove the need for i_rwsem in the read and write path. In a way that sounds perfect, and given that btrfs already does that (although in a very convoluted way) we know it's possible. But direct I/O throws a big monkey wrench here as already mentioned by others. Now one interesting thing some file systems have done is to serialize buffered against direct I/O, either by waiting for one to finish, or by simply forcing buffered I/O when direct I/O would conflict. It's easy to detect outstanding direct I/O using i_dio_count so buffered I/O could wait for that, and downgrading to buffered I/O (potentially using the new uncached mode from Jens) if there are any pages on the mapping after the invalidation also sounds pretty doable. I don't really have time to turn this hand waving into, but maybe we should think if it's worthwhile or if I'm missing something important.
On Tue, Jan 14, 2025 at 09:55:21PM -0800, Christoph Hellwig wrote: > On Mon, Jan 13, 2025 at 08:40:51AM -0500, Brian Foster wrote: > > Sorry if this is out of left field as I haven't followed the discussion > > closely, but I presumed one of the reasons Darrick and Christoph raised > > the idea of using the folio batch thing I'm playing around with on zero > > range for buffered writes would be to acquire and lock all targeted > > folios up front. If so, would that help with what you're trying to > > achieve here? (If not, nothing to see here, move along.. ;). > > I mostly thought about acquiring, as locking doesn't really have much > batching effects. That being said, no that you got the idea in my mind > here's my early morning brainfart on it: > > Let's ignore DIRECT I/O for the first step. In that case lookup / > allocation and locking all folios for write before copying data will > remove the need for i_rwsem in the read and write path. In a way that > sounds perfect, and given that btrfs already does that (although in a > very convoluted way) we know it's possible. Yes, this seems like a sane, general approach to allowing concurrent buffered writes (and reads). > But direct I/O throws a big monkey wrench here as already mentioned by > others. Now one interesting thing some file systems have done is > to serialize buffered against direct I/O, either by waiting for one > to finish, or by simply forcing buffered I/O when direct I/O would > conflict. Right. We really don't want to downgrade to buffered IO if we can help it, though. > It's easy to detect outstanding direct I/O using i_dio_count > so buffered I/O could wait for that, and downgrading to buffered I/O > (potentially using the new uncached mode from Jens) if there are any > pages on the mapping after the invalidation also sounds pretty doable. It's much harder to sanely serialise DIO against buffered writes this way, because i_dio_count only forms a submission barrier in conjunction with the i_rwsem being held exclusively. e.g. ongoing DIO would result in the buffered write being indefinitely delayed. I think the model and method that bcachefs uses is probably the best way to move forward - the "two-state exclusive shared" lock which it uses to do buffered vs direct exclusion is a simple, easy way to handle this problem. The same-state shared locking fast path is a single atomic cmpxchg operation, so it has neglible extra overhead compared to using a rwsem in the shared DIO fast path. The lock also has non-owner semantics, so DIO can take it during submission and then drop it during IO completion. This solves the problem we currently use the i_rwsem and inode_dio_{start,end/wait}() to solve (i.e. create a DIO submission barrier and waiting for all existing DIO to drain). IOWs, a two-state shared lock provides the mechanism to allow DIO to be done without holding the i_rwsem at all, as well as being able to elide two atomic operations per DIO to track in-flight DIOs. We'd get this whilst maintaining buffered/DIO coherency without adding any new overhead to the DIO path, and allow concurrent buffered reads and writes that have their atomicity defined by the batched folio locking strategy that Brian is working on... This only leaves DIO coherency issues with mmap() based IO as an issue, but that's a problem for a different day... > I don't really have time to turn this hand waving into, but maybe we > should think if it's worthwhile or if I'm missing something important. If people are OK with XFS moving to exclusive buffered or DIO submission model, then I can find some time to work on the converting the IO path locking to use a two-state shared lock in preparation for the batched folio stuff that will allow concurrent buffered writes... -Dave.
On Thu, Jan 16, 2025 at 08:41:21AM +1100, Dave Chinner wrote: > > to finish, or by simply forcing buffered I/O when direct I/O would > > conflict. > > Right. We really don't want to downgrade to buffered IO if we can > help it, though. Of course we never want it. But if we do have invalidation failures and thus still folios in the page cache it might the least bad of all the bad options. > It's much harder to sanely serialise DIO against buffered writes > this way, because i_dio_count only forms a submission barrier in > conjunction with the i_rwsem being held exclusively. e.g. ongoing > DIO would result in the buffered write being indefinitely delayed. Or any other exclusive lock taken by all submitters, yes. > I think the model and method that bcachefs uses is probably the best > way to move forward - the "two-state exclusive shared" lock which it > uses to do buffered vs direct exclusion is a simple, easy way to > handle this problem. The same-state shared locking fast path is a > single atomic cmpxchg operation, so it has neglible extra overhead > compared to using a rwsem in the shared DIO fast path. NFS and ocfs2 have been doing this for about two decades as well. > This only leaves DIO coherency issues with mmap() based IO as an > issue, but that's a problem for a different day... I think it's a generally unsolveable problem that we can just whack enough to make it good enough in practice for the few workloads that matter. > > > I don't really have time to turn this hand waving into, but maybe we > > should think if it's worthwhile or if I'm missing something important. > > If people are OK with XFS moving to exclusive buffered or DIO > submission model, then I can find some time to work on the > converting the IO path locking to use a two-state shared lock in > preparation for the batched folio stuff that will allow concurrent > buffered writes... This does sound fine to me, but it's hard to judge without seeing a prototype and results based on it.
On Thu, Jan 16, 2025 at 08:41:21AM +1100, Dave Chinner wrote: > On Tue, Jan 14, 2025 at 09:55:21PM -0800, Christoph Hellwig wrote: > > On Mon, Jan 13, 2025 at 08:40:51AM -0500, Brian Foster wrote: ... > > > I don't really have time to turn this hand waving into, but maybe we > > should think if it's worthwhile or if I'm missing something important. > > If people are OK with XFS moving to exclusive buffered or DIO > submission model, then I can find some time to work on the > converting the IO path locking to use a two-state shared lock in > preparation for the batched folio stuff that will allow concurrent > buffered writes... > Ack to this, FWIW. I think this is a natural/logical approach, prototyping and whatnot notwithstanding. Brian > -Dave. > -- > Dave Chinner > david@fromorbit.com >
On Wed, Jan 15, 2025 at 10:41 PM Dave Chinner <david@fromorbit.com> wrote: > > On Tue, Jan 14, 2025 at 09:55:21PM -0800, Christoph Hellwig wrote: > > On Mon, Jan 13, 2025 at 08:40:51AM -0500, Brian Foster wrote: > > > Sorry if this is out of left field as I haven't followed the discussion > > > closely, but I presumed one of the reasons Darrick and Christoph raised > > > the idea of using the folio batch thing I'm playing around with on zero > > > range for buffered writes would be to acquire and lock all targeted > > > folios up front. If so, would that help with what you're trying to > > > achieve here? (If not, nothing to see here, move along.. ;). > > > > I mostly thought about acquiring, as locking doesn't really have much > > batching effects. That being said, no that you got the idea in my mind > > here's my early morning brainfart on it: > > > > Let's ignore DIRECT I/O for the first step. In that case lookup / > > allocation and locking all folios for write before copying data will > > remove the need for i_rwsem in the read and write path. In a way that > > sounds perfect, and given that btrfs already does that (although in a > > very convoluted way) we know it's possible. > > Yes, this seems like a sane, general approach to allowing concurrent > buffered writes (and reads). > > > But direct I/O throws a big monkey wrench here as already mentioned by > > others. Now one interesting thing some file systems have done is > > to serialize buffered against direct I/O, either by waiting for one > > to finish, or by simply forcing buffered I/O when direct I/O would > > conflict. > > Right. We really don't want to downgrade to buffered IO if we can > help it, though. > > > It's easy to detect outstanding direct I/O using i_dio_count > > so buffered I/O could wait for that, and downgrading to buffered I/O > > (potentially using the new uncached mode from Jens) if there are any > > pages on the mapping after the invalidation also sounds pretty doable. > > It's much harder to sanely serialise DIO against buffered writes > this way, because i_dio_count only forms a submission barrier in > conjunction with the i_rwsem being held exclusively. e.g. ongoing > DIO would result in the buffered write being indefinitely delayed. Isn't this already the case today with EX vs. SH iolock? I guess the answer depends whether or not i_rwsem starves existing writers in the face of ongoing new readers. If my memory serves me right, this exact behavior of i_rwsem is what sparked the original regression report when xfs moved from i_iolock to i_rwsem [1]. [1] https://lore.kernel.org/linux-xfs/CAOQ4uxi0pGczXBX7GRAFs88Uw0n1ERJZno3JSeZR71S1dXg+2w@mail.gmail.com/ > > I think the model and method that bcachefs uses is probably the best > way to move forward - the "two-state exclusive shared" lock which it > uses to do buffered vs direct exclusion is a simple, easy way to > handle this problem. The same-state shared locking fast path is a > single atomic cmpxchg operation, so it has neglible extra overhead > compared to using a rwsem in the shared DIO fast path. > > The lock also has non-owner semantics, so DIO can take it during > submission and then drop it during IO completion. This solves the > problem we currently use the i_rwsem and > inode_dio_{start,end/wait}() to solve (i.e. create a DIO submission > barrier and waiting for all existing DIO to drain). > > IOWs, a two-state shared lock provides the mechanism to allow DIO > to be done without holding the i_rwsem at all, as well as being able > to elide two atomic operations per DIO to track in-flight DIOs. > > We'd get this whilst maintaining buffered/DIO coherency without > adding any new overhead to the DIO path, and allow concurrent > buffered reads and writes that have their atomicity defined by the > batched folio locking strategy that Brian is working on... > > This only leaves DIO coherency issues with mmap() based IO as an > issue, but that's a problem for a different day... > > > I don't really have time to turn this hand waving into, but maybe we > > should think if it's worthwhile or if I'm missing something important. > > If people are OK with XFS moving to exclusive buffered or DIO > submission model, then I can find some time to work on the > converting the IO path locking to use a two-state shared lock in > preparation for the batched folio stuff that will allow concurrent > buffered writes... I won't object to getting the best of all worlds, but I have to say, upfront, this sounds a bit like premature optimization and for a workload (mixed buffered/dio write) that I don't think anybody does in practice and nobody should care how it performs. Am I wrong? For all practical purposes, we could maintain a counter in inode not for submitted DIO, but for files opened O_DIRECT. The first open O_DIRECT could serialize with in-flight buffered writes holding a shared iolock and then buffered writes could take SH vs. EX iolock depending on folio state and on i_dio_open_count. I would personally prefer a simple solution that is good enough and has a higher likelihood for allocating the development, review and testing resources that are needed to bring it to the finish line. Thanks, Amir.
On 2025/1/16 05:41, Dave Chinner wrote: > On Tue, Jan 14, 2025 at 09:55:21PM -0800, Christoph Hellwig wrote: >> On Mon, Jan 13, 2025 at 08:40:51AM -0500, Brian Foster wrote: >>> Sorry if this is out of left field as I haven't followed the discussion >>> closely, but I presumed one of the reasons Darrick and Christoph raised >>> the idea of using the folio batch thing I'm playing around with on zero >>> range for buffered writes would be to acquire and lock all targeted >>> folios up front. If so, would that help with what you're trying to >>> achieve here? (If not, nothing to see here, move along.. ;). >> >> I mostly thought about acquiring, as locking doesn't really have much >> batching effects. That being said, no that you got the idea in my mind >> here's my early morning brainfart on it: >> >> Let's ignore DIRECT I/O for the first step. In that case lookup / >> allocation and locking all folios for write before copying data will >> remove the need for i_rwsem in the read and write path. In a way that >> sounds perfect, and given that btrfs already does that (although in a >> very convoluted way) we know it's possible. > > Yes, this seems like a sane, general approach to allowing concurrent > buffered writes (and reads). > >> But direct I/O throws a big monkey wrench here as already mentioned by >> others. Now one interesting thing some file systems have done is >> to serialize buffered against direct I/O, either by waiting for one >> to finish, or by simply forcing buffered I/O when direct I/O would >> conflict. > > Right. We really don't want to downgrade to buffered IO if we can > help it, though. > >> It's easy to detect outstanding direct I/O using i_dio_count >> so buffered I/O could wait for that, and downgrading to buffered I/O >> (potentially using the new uncached mode from Jens) if there are any >> pages on the mapping after the invalidation also sounds pretty doable. > > It's much harder to sanely serialise DIO against buffered writes > this way, because i_dio_count only forms a submission barrier in > conjunction with the i_rwsem being held exclusively. e.g. ongoing > DIO would result in the buffered write being indefinitely delayed. > > I think the model and method that bcachefs uses is probably the best > way to move forward - the "two-state exclusive shared" lock which it > uses to do buffered vs direct exclusion is a simple, easy way to > handle this problem. The same-state shared locking fast path is a > single atomic cmpxchg operation, so it has neglible extra overhead > compared to using a rwsem in the shared DIO fast path. > > The lock also has non-owner semantics, so DIO can take it during > submission and then drop it during IO completion. This solves the > problem we currently use the i_rwsem and > inode_dio_{start,end/wait}() to solve (i.e. create a DIO submission > barrier and waiting for all existing DIO to drain). > > IOWs, a two-state shared lock provides the mechanism to allow DIO > to be done without holding the i_rwsem at all, as well as being able > to elide two atomic operations per DIO to track in-flight DIOs. > > We'd get this whilst maintaining buffered/DIO coherency without > adding any new overhead to the DIO path, and allow concurrent > buffered reads and writes that have their atomicity defined by the > batched folio locking strategy that Brian is working on... > > This only leaves DIO coherency issues with mmap() based IO as an > issue, but that's a problem for a different day... > >> I don't really have time to turn this hand waving into, but maybe we >> should think if it's worthwhile or if I'm missing something important. > > If people are OK with XFS moving to exclusive buffered or DIO > submission model, then I can find some time to work on the > converting the IO path locking to use a two-state shared lock in > preparation for the batched folio stuff that will allow concurrent > buffered writes... > I really think it's time for someone to submit the basic patch, and then we can continue with the discussion and testing. :) BTW, I support the solution with folio batch and two-state shared lock. Thanks, Chi Zhiling
On Fri, Jan 17, 2025 at 02:27:46PM +0100, Amir Goldstein wrote: > On Wed, Jan 15, 2025 at 10:41 PM Dave Chinner <david@fromorbit.com> wrote: > > > > On Tue, Jan 14, 2025 at 09:55:21PM -0800, Christoph Hellwig wrote: > > > On Mon, Jan 13, 2025 at 08:40:51AM -0500, Brian Foster wrote: > > > > Sorry if this is out of left field as I haven't followed the discussion > > > > closely, but I presumed one of the reasons Darrick and Christoph raised > > > > the idea of using the folio batch thing I'm playing around with on zero > > > > range for buffered writes would be to acquire and lock all targeted > > > > folios up front. If so, would that help with what you're trying to > > > > achieve here? (If not, nothing to see here, move along.. ;). > > > > > > I mostly thought about acquiring, as locking doesn't really have much > > > batching effects. That being said, no that you got the idea in my mind > > > here's my early morning brainfart on it: > > > > > > Let's ignore DIRECT I/O for the first step. In that case lookup / > > > allocation and locking all folios for write before copying data will > > > remove the need for i_rwsem in the read and write path. In a way that > > > sounds perfect, and given that btrfs already does that (although in a > > > very convoluted way) we know it's possible. > > > > Yes, this seems like a sane, general approach to allowing concurrent > > buffered writes (and reads). > > > > > But direct I/O throws a big monkey wrench here as already mentioned by > > > others. Now one interesting thing some file systems have done is > > > to serialize buffered against direct I/O, either by waiting for one > > > to finish, or by simply forcing buffered I/O when direct I/O would > > > conflict. > > > > Right. We really don't want to downgrade to buffered IO if we can > > help it, though. > > > > > It's easy to detect outstanding direct I/O using i_dio_count > > > so buffered I/O could wait for that, and downgrading to buffered I/O > > > (potentially using the new uncached mode from Jens) if there are any > > > pages on the mapping after the invalidation also sounds pretty doable. > > > > It's much harder to sanely serialise DIO against buffered writes > > this way, because i_dio_count only forms a submission barrier in > > conjunction with the i_rwsem being held exclusively. e.g. ongoing > > DIO would result in the buffered write being indefinitely delayed. > > Isn't this already the case today with EX vs. SH iolock? No. We do not hold the i_rwsem across async DIO read or write, so we can have DIO in flight whilst a buffered write grabs and holds the i_rwsem exclusive. This is the problem that i_dio_count solves for truncate, etc. i.e. the only way to actually ensure there is no DIO in flight is to grab the i_rwsem exclusive and then call inode_dio_wait() to ensure all async DIO in flight completes before continuing. > I guess the answer depends whether or not i_rwsem > starves existing writers in the face of ongoing new readers. It does not. If the lock is held for read and there is a pending write, new readers are queued behind the pending write waiters (see rwsem_down_read_slowpath(), which is triggered if there are pending waiters on an attempt to read lock). > > > I don't really have time to turn this hand waving into, but maybe we > > > should think if it's worthwhile or if I'm missing something important. > > > > If people are OK with XFS moving to exclusive buffered or DIO > > submission model, then I can find some time to work on the > > converting the IO path locking to use a two-state shared lock in > > preparation for the batched folio stuff that will allow concurrent > > buffered writes... > > I won't object to getting the best of all worlds, but I have to say, > upfront, this sounds a bit like premature optimization and for > a workload (mixed buffered/dio write) Say what? The dio/buffered exclusion change provides a different data coherency and correctness model that is required to then optimising the workload you care about - mixed buffered read/write. This change doesn't optimise either DIO or buffered IO, nor is a shared-exclusive BIO/DIO lock isn't going to improve performance of such mixed IO workloads in any way. IOWs, it's pretty hard to call a locking model change like this "optimisation", let alone call it "premature". > that I don't think anybody > does in practice and nobody should care how it performs. > Am I wrong? Yes and no. mixed buffered/direct IO worklaods are more common than you think. e.g. many backup programs use direct IO and so inherently mix DIO with buffered IO for the majority of files the backup program touches. However, all we care about in this case is data coherency, not performance, and this change should improve data coherency between DIO and buffered IO... > For all practical purposes, we could maintain a counter in inode > not for submitted DIO, but for files opened O_DIRECT. > > The first open O_DIRECT could serialize with in-flight > buffered writes holding a shared iolock and then buffered writes > could take SH vs. EX iolock depending on folio state and on > i_dio_open_count. I don't see how this can be made to work w.r.t. sane data coherency behaviour. e.g. how does this model serialise a new DIO write that is submitted after a share-locked buffered write has just started and passed all the "i_dio_count == 0" checks that enable it to use shared locking? i.e. we now have potentially overlapping buffered writes and DIO writes being done concurrently because the buffered write may not have instantiated folios in the page cache yet.... > I would personally prefer a simple solution that is good enough > and has a higher likelihood for allocating the development, review > and testing resources that are needed to bring it to the finish line. Have you looked at how simple the bcachefs buffered/dio exclusion implementation is? The lock mechanism itself is about 50 lines of code, and there are only 4 or 5 places where the lock is actually needed. It doesn't get much simpler than that. And, quite frankly, the fact the bcachefs solution also covers AIO DIO in flight (which i_rwsem based locking does not!) means it is a more robust solution than trying to rely on racy i_dio_count hacks and folio residency in the page cache... -Dave.
On Wed, Jan 15, 2025 at 08:36:48PM -0800, Christoph Hellwig wrote: > On Thu, Jan 16, 2025 at 08:41:21AM +1100, Dave Chinner wrote: > > > I don't really have time to turn this hand waving into, but maybe we > > > should think if it's worthwhile or if I'm missing something important. > > > > If people are OK with XFS moving to exclusive buffered or DIO > > submission model, then I can find some time to work on the > > converting the IO path locking to use a two-state shared lock in > > preparation for the batched folio stuff that will allow concurrent > > buffered writes... > > This does sound fine to me, but it's hard to judge without seeing > a prototype and results based on it. OK, the consensus seems to be to run with this for the moment and see how clean it ends up being. I'll look into it over the next couple of weeks then. -Dave.
On Fri, Jan 17, 2025 at 11:19 PM Dave Chinner <david@fromorbit.com> wrote: > > On Fri, Jan 17, 2025 at 02:27:46PM +0100, Amir Goldstein wrote: > > On Wed, Jan 15, 2025 at 10:41 PM Dave Chinner <david@fromorbit.com> wrote: > > > > > > On Tue, Jan 14, 2025 at 09:55:21PM -0800, Christoph Hellwig wrote: > > > > On Mon, Jan 13, 2025 at 08:40:51AM -0500, Brian Foster wrote: > > > > > Sorry if this is out of left field as I haven't followed the discussion > > > > > closely, but I presumed one of the reasons Darrick and Christoph raised > > > > > the idea of using the folio batch thing I'm playing around with on zero > > > > > range for buffered writes would be to acquire and lock all targeted > > > > > folios up front. If so, would that help with what you're trying to > > > > > achieve here? (If not, nothing to see here, move along.. ;). > > > > > > > > I mostly thought about acquiring, as locking doesn't really have much > > > > batching effects. That being said, no that you got the idea in my mind > > > > here's my early morning brainfart on it: > > > > > > > > Let's ignore DIRECT I/O for the first step. In that case lookup / > > > > allocation and locking all folios for write before copying data will > > > > remove the need for i_rwsem in the read and write path. In a way that > > > > sounds perfect, and given that btrfs already does that (although in a > > > > very convoluted way) we know it's possible. > > > > > > Yes, this seems like a sane, general approach to allowing concurrent > > > buffered writes (and reads). > > > > > > > But direct I/O throws a big monkey wrench here as already mentioned by > > > > others. Now one interesting thing some file systems have done is > > > > to serialize buffered against direct I/O, either by waiting for one > > > > to finish, or by simply forcing buffered I/O when direct I/O would > > > > conflict. > > > > > > Right. We really don't want to downgrade to buffered IO if we can > > > help it, though. > > > > > > > It's easy to detect outstanding direct I/O using i_dio_count > > > > so buffered I/O could wait for that, and downgrading to buffered I/O > > > > (potentially using the new uncached mode from Jens) if there are any > > > > pages on the mapping after the invalidation also sounds pretty doable. > > > > > > It's much harder to sanely serialise DIO against buffered writes > > > this way, because i_dio_count only forms a submission barrier in > > > conjunction with the i_rwsem being held exclusively. e.g. ongoing > > > DIO would result in the buffered write being indefinitely delayed. > > > > Isn't this already the case today with EX vs. SH iolock? > > No. We do not hold the i_rwsem across async DIO read or write, so > we can have DIO in flight whilst a buffered write grabs and holds > the i_rwsem exclusive. This is the problem that i_dio_count solves > for truncate, etc. i.e. the only way to actually ensure there is no > DIO in flight is to grab the i_rwsem exclusive and then call > inode_dio_wait() to ensure all async DIO in flight completes before > continuing. > > > I guess the answer depends whether or not i_rwsem > > starves existing writers in the face of ongoing new readers. > > It does not. If the lock is held for read and there is a pending > write, new readers are queued behind the pending write waiters > (see rwsem_down_read_slowpath(), which is triggered if there are > pending waiters on an attempt to read lock). > > > > > I don't really have time to turn this hand waving into, but maybe we > > > > should think if it's worthwhile or if I'm missing something important. > > > > > > If people are OK with XFS moving to exclusive buffered or DIO > > > submission model, then I can find some time to work on the > > > converting the IO path locking to use a two-state shared lock in > > > preparation for the batched folio stuff that will allow concurrent > > > buffered writes... > > > > I won't object to getting the best of all worlds, but I have to say, > > upfront, this sounds a bit like premature optimization and for > > a workload (mixed buffered/dio write) > > Say what? The dio/buffered exclusion change provides a different > data coherency and correctness model that is required to then > optimising the workload you care about - mixed buffered read/write. > > This change doesn't optimise either DIO or buffered IO, nor is a > shared-exclusive BIO/DIO lock isn't going to improve performance of > such mixed IO workloads in any way. IOWs, it's pretty hard to call > a locking model change like this "optimisation", let alone call it > "premature". > I was questioning the need to optimize the mixed buffered read/write *while file is also open for O_DIRECT* I thought that the solution would be easier if can take Christof's "Let's ignore DIRECT I/O for the first step" and make it testable. Then if the inode is open O_DIRECT, no change to locking. > > that I don't think anybody > > does in practice and nobody should care how it performs. > > Am I wrong? > > Yes and no. mixed buffered/direct IO worklaods are more common than > you think. e.g. many backup programs use direct IO and so inherently > mix DIO with buffered IO for the majority of files the backup > program touches. Still feels like we can forego performance of mixed buffered rw *while the backup program reads the database file*, should a backup program ever really read a large database file... > However, all we care about in this case is data > coherency, not performance, and this change should improve data > coherency between DIO and buffered IO... > I did not understand that you are proposing an improvement over the existing state of affairs. Yeh, I certainly have no objections to fixing things that are wrong. > > For all practical purposes, we could maintain a counter in inode > > not for submitted DIO, but for files opened O_DIRECT. > > > > The first open O_DIRECT could serialize with in-flight > > buffered writes holding a shared iolock and then buffered writes > > could take SH vs. EX iolock depending on folio state and on > > i_dio_open_count. > > I don't see how this can be made to work w.r.t. sane data coherency > behaviour. e.g. how does this model serialise a new DIO write that > is submitted after a share-locked buffered write has just started > and passed all the "i_dio_count == 0" checks that enable it to use > shared locking? i.e. we now have potentially overlapping buffered > writes and DIO writes being done concurrently because the buffered > write may not have instantiated folios in the page cache yet.... > A shared-locked buffered write can only start when there is no O_DIRECT file open on the inode. The first open for O_DIRECT to increment i_dio_open_count needs to take exclusive iolock to wait for all in-flight buffered writes then release the iolock. All DIO submitted via O_DIRECT fds will be safe against in-flight share-locked buffered write. > > I would personally prefer a simple solution that is good enough > > and has a higher likelihood for allocating the development, review > > and testing resources that are needed to bring it to the finish line. > > Have you looked at how simple the bcachefs buffered/dio exclusion > implementation is? The lock mechanism itself is about 50 lines of > code, and there are only 4 or 5 places where the lock is actually > needed. It doesn't get much simpler than that. > > And, quite frankly, the fact the bcachefs solution also covers AIO > DIO in flight (which i_rwsem based locking does not!) means it is a > more robust solution than trying to rely on racy i_dio_count hacks > and folio residency in the page cache... I will be happy to see this improvement. Thanks, Amir.
On Sat, Jan 18, 2025 at 02:03:41PM +0100, Amir Goldstein wrote: > On Fri, Jan 17, 2025 at 11:19 PM Dave Chinner > <david@fromorbit.com> wrote: > > On Fri, Jan 17, 2025 at 02:27:46PM +0100, Amir Goldstein wrote: > > > On Wed, Jan 15, 2025 at 10:41 PM Dave Chinner > > > <david@fromorbit.com> wrote: For all practical purposes, we > > > could maintain a counter in inode not for submitted DIO, but > > > for files opened O_DIRECT. > > > > > > The first open O_DIRECT could serialize with in-flight > > > buffered writes holding a shared iolock and then buffered > > > writes could take SH vs. EX iolock depending on folio state > > > and on i_dio_open_count. > > > > I don't see how this can be made to work w.r.t. sane data > > coherency behaviour. e.g. how does this model serialise a new > > DIO write that is submitted after a share-locked buffered write > > has just started and passed all the "i_dio_count == 0" checks > > that enable it to use shared locking? i.e. we now have > > potentially overlapping buffered writes and DIO writes being > > done concurrently because the buffered write may not have > > instantiated folios in the page cache yet.... > > > > A shared-locked buffered write can only start when there is no > O_DIRECT file open on the inode. The first open for O_DIRECT to > increment i_dio_open_count needs to take exclusive iolock to wait > for all in-flight buffered writes then release the iolock. > > All DIO submitted via O_DIRECT fds will be safe against in-flight > share-locked buffered write. Hooking ->open() and ->release() isn't sufficient. O_DIRECT can be changed at any time via fcntl(F_SETFL) which isn't synchronised against IO in progress at all. i.e. we can't track that, nor can we even use f_ops->check_flags to reject changes to O_DIRECT state because it doesn't pass either the filp or the inode.... IOWs, as it stands logic like "is the file opened for O_DIRECT" in the IO path is inherently racy and the racing mechanisms are directly under the control of unprivileged userspace applications. Without mods from the VFS down and new hooks being implemented in XFS along with all the whacky "which serialisiation method do we use" logic and dealing with the complexities and switching between them, tracking struct files that are open for O_DIRECT isn't really a viable mechanism for enabling shared buffered writes... -Dave.
On Sat, Jan 18, 2025 at 09:19:15AM +1100, Dave Chinner wrote: > And, quite frankly, the fact the bcachefs solution also covers AIO > DIO in flight (which i_rwsem based locking does not!) means it is a > more robust solution than trying to rely on racy i_dio_count hacks > and folio residency in the page cache... The original i_rwsem (still i_iolock then) scheme did that, but the core locking maintainers asked us to remove the non-owner unlocks, so I did that. It turns out later we got officially sanctioned non-owner unlocks, so we could have easily add this back. I did that 5 years ago, but reception was lukewarm: http://git.infradead.org/?p=users/hch/xfs.git;a=shortlog;h=refs/heads/i_rwsem-non_owner
On Tue, Jan 21, 2025 at 10:08:10PM -0800, Christoph Hellwig wrote: > On Sat, Jan 18, 2025 at 09:19:15AM +1100, Dave Chinner wrote: > > And, quite frankly, the fact the bcachefs solution also covers AIO > > DIO in flight (which i_rwsem based locking does not!) means it is a > > more robust solution than trying to rely on racy i_dio_count hacks > > and folio residency in the page cache... > > The original i_rwsem (still i_iolock then) scheme did that, but the > core locking maintainers asked us to remove the non-owner unlocks, > so I did that. It turns out later we got officially sanctioned > non-owner unlocks, so we could have easily add this back. I did that > 5 years ago, but reception was lukewarm: > > http://git.infradead.org/?p=users/hch/xfs.git;a=shortlog;h=refs/heads/i_rwsem-non_owner I have no issues with the concept - I have always thought that holding the IOLOCK to completion was how the IO locking should work (like we do with the xfs_buf semaphore). The i_dio_count submission barrier solution was a necessary hack because rwsems did not work the way we needed to use them. What I didn't like about the proposal above was the implementation (i.e. the encoding of rwsem semantics in the iomap dio API and implementation), but there was never any followup that addressed the issues that were raised..... -Dave.
On 2025/1/16 05:41, Dave Chinner wrote: > On Tue, Jan 14, 2025 at 09:55:21PM -0800, Christoph Hellwig wrote: >> On Mon, Jan 13, 2025 at 08:40:51AM -0500, Brian Foster wrote: >>> Sorry if this is out of left field as I haven't followed the discussion >>> closely, but I presumed one of the reasons Darrick and Christoph raised >>> the idea of using the folio batch thing I'm playing around with on zero >>> range for buffered writes would be to acquire and lock all targeted >>> folios up front. If so, would that help with what you're trying to >>> achieve here? (If not, nothing to see here, move along.. ;). >> >> I mostly thought about acquiring, as locking doesn't really have much >> batching effects. That being said, no that you got the idea in my mind >> here's my early morning brainfart on it: >> >> Let's ignore DIRECT I/O for the first step. In that case lookup / >> allocation and locking all folios for write before copying data will >> remove the need for i_rwsem in the read and write path. In a way that >> sounds perfect, and given that btrfs already does that (although in a >> very convoluted way) we know it's possible. > > Yes, this seems like a sane, general approach to allowing concurrent > buffered writes (and reads). > >> But direct I/O throws a big monkey wrench here as already mentioned by >> others. Now one interesting thing some file systems have done is >> to serialize buffered against direct I/O, either by waiting for one >> to finish, or by simply forcing buffered I/O when direct I/O would >> conflict. > > Right. We really don't want to downgrade to buffered IO if we can > help it, though. > >> It's easy to detect outstanding direct I/O using i_dio_count >> so buffered I/O could wait for that, and downgrading to buffered I/O >> (potentially using the new uncached mode from Jens) if there are any >> pages on the mapping after the invalidation also sounds pretty doable. > > It's much harder to sanely serialise DIO against buffered writes > this way, because i_dio_count only forms a submission barrier in > conjunction with the i_rwsem being held exclusively. e.g. ongoing > DIO would result in the buffered write being indefinitely delayed. > > I think the model and method that bcachefs uses is probably the best > way to move forward - the "two-state exclusive shared" lock which it > uses to do buffered vs direct exclusion is a simple, easy way to > handle this problem. The same-state shared locking fast path is a > single atomic cmpxchg operation, so it has neglible extra overhead > compared to using a rwsem in the shared DIO fast path. > > The lock also has non-owner semantics, so DIO can take it during > submission and then drop it during IO completion. This solves the > problem we currently use the i_rwsem and > inode_dio_{start,end/wait}() to solve (i.e. create a DIO submission > barrier and waiting for all existing DIO to drain). > > IOWs, a two-state shared lock provides the mechanism to allow DIO > to be done without holding the i_rwsem at all, as well as being able > to elide two atomic operations per DIO to track in-flight DIOs. > > We'd get this whilst maintaining buffered/DIO coherency without > adding any new overhead to the DIO path, and allow concurrent > buffered reads and writes that have their atomicity defined by the > batched folio locking strategy that Brian is working on... > > This only leaves DIO coherency issues with mmap() based IO as an > issue, but that's a problem for a different day... When I try to implement those features, I think we could use exclusive locks for RWF_APPEND writes, and shared locks for other writes. The reason is that concurrent operations are also possible for extended writes if there is no overlap in the regions being written. So there is no need to determine whether it is an extended write in advance. As for why an exclusive lock is needed for append writes, it's because we don't want the EOF to be modified during the append write. The code is like that: + if (iocb->ki_flags & IOCB_APPEND) + iolock = XFS_IOLOCK_EXCL; + else + iolock = XFS_IOLOCK_SHARED; If we use exclusive locks for all extended writes, we need to check if it is an extended write before acquiring the lock, but this value could become outdated if we do not hold the lock. So if we do an extended write, we might need to follow this process: 1. Get read lock. 2. Check if it is an extended write. 3. Release read lock. 4. Get write lock. 5. Do the write operation. Thanks, Chi Zhiling
On Tue, 7 Jan 2025 13:13:17 +0100, Amir Goldstein <amir73il@gmail.com> wrote: > On Mon, Dec 30, 2024 at 3:43 AM Chi Zhiling <chizhiling@163.com> wrote: > > > > > > > > On 2024/12/29 06:17, Dave Chinner wrote: > > > On Sat, Dec 28, 2024 at 03:37:41PM +0800, Chi Zhiling wrote: > > >> > > >> > > >> On 2024/12/27 05:50, Dave Chinner wrote: > > >>> On Thu, Dec 26, 2024 at 02:16:02PM +0800, Chi Zhiling wrote: > > >>>> From: Chi Zhiling <chizhiling@kylinos.cn> > > >>>> > > >>>> Using an rwsem to protect file data ensures that we can always obtain a > > >>>> completed modification. But due to the lock, we need to wait for the > > >>>> write process to release the rwsem before we can read it, even if we are > > >>>> reading a different region of the file. This could take a lot of time > > >>>> when many processes need to write and read this file. > > >>>> > > >>>> On the other hand, The ext4 filesystem and others do not hold the lock > > >>>> during buffered reading, which make the ext4 have better performance in > > >>>> that case. Therefore, I think it will be fine if we remove the lock in > > >>>> xfs, as most applications can handle this situation. > > >>> > > >>> Nope. > > >>> > > >>> This means that XFS loses high level serialisation of incoming IO > > >>> against operations like truncate, fallocate, pnfs operations, etc. > > >>> > > >>> We've been through this multiple times before; the solution lies in > > >>> doing the work to make buffered writes use shared locking, not > > >>> removing shared locking from buffered reads. > > >> > > >> You mean using shared locking for buffered reads and writes, right? > > >> > > >> I think it's a great idea. In theory, write operations can be performed > > >> simultaneously if they write to different ranges. > > > > > > Even if they overlap, the folio locks will prevent concurrent writes > > > to the same range. > > > > > > Now that we have atomic write support as native functionality (i.e. > > > RWF_ATOMIC), we really should not have to care that much about > > > normal buffered IO being atomic. i.e. if the application wants > > > atomic writes, it can now specify that it wants atomic writes and so > > > we can relax the constraints we have on existing IO... > > > > Yes, I'm not particularly concerned about whether buffered I/O is > > atomic. I'm more concerned about the multithreading performance of > > buffered I/O. > > Hi Chi, > > Sorry for joining late, I was on vacation. > I am very happy that you have taken on this task and your timing is good, > because John Garry just posted his patches for large atomic writes [1]. > > I need to explain the relation to atomic buffered I/O, because it is not > easy to follow it from the old discussions and also some of the discussions > about the solution were held in-person during LSFMM2024. > > Naturally, your interest is improving multithreading performance of > buffered I/O, so was mine when I first posted this question [2]. > > The issue with atomicity of buffered I/O is the xfs has traditionally > provided atomicity of write vs. read (a.k.a no torn writes), which is > not required by POSIX standard (because POSIX was not written with > threads in mind) and is not respected by any other in-tree filesystem. > > It is obvious that the exclusive rw lock for buffered write hurts performance > in the case of mixed read/write on xfs, so the question was - if xfs provides > atomic semantics that portable applications cannot rely on, why bother > providing these atomic semantics at all? > > Dave's answer to this question was that there are some legacy applications > (database applications IIRC) on production systems that do rely on the fact > that xfs provides this semantics and on the prerequisite that they run on xfs. > > However, it was noted that: > 1. Those application do not require atomicity for any size of IO, they > typically work in I/O size that is larger than block size (e.g. 16K or 64K) > and they only require no torn writes for this I/O size > 2. Large folios and iomap can usually provide this semantics via folio lock, > but application has currently no way of knowing if the semantics are > provided or not > 3. The upcoming ability of application to opt-in for atomic writes larger > than fs block size [1] can be used to facilitate the applications that > want to legacy xfs semantics and avoid the need to enforce the legacy > semantics all the time for no good reason > > Disclaimer: this is not a standard way to deal with potentially breaking > legacy semantics, because old applications will not usually be rebuilt > to opt-in for the old semantics, but the use case in hand is probably > so specific, of a specific application that relies on xfs specific semantics > that there are currently no objections for trying this solution. > > [1] https://lore.kernel.org/linux-xfs/20250102140411.14617-1-john.g.garry@oracle.com/ > [2] https://lore.kernel.org/linux-xfs/CAOQ4uxi0pGczXBX7GRAFs88Uw0n1ERJZno3JSeZR71S1dXg+2w@mail.gmail.com/ > > > > > Last week, it was mentioned that removing i_rwsem would have some > > impacts on truncate, fallocate, and PNFS operations. > > > > (I'm not familiar with pNFS, so please correct me if I'm wrong.) > > You are not wrong. pNFS uses a "layout lease", which requires > that the blockmap of the file will not be modified while the lease is held. > but I think that block that are not mapped (i.e. holes) can be mapped > while the lease is held. > > > > > My understanding is that the current i_rwsem is used to protect both > > the file's data and its size. Operations like truncate, fallocate, > > and PNFS use i_rwsem because they modify both the file's data and its > > size. So, I'm thinking whether it's possible to use i_rwsem to protect > > only the file's size, without protecting the file's data. > > > > It also protects the file's blockmap, for example, punch hole > does not change the size, but it unmaps blocks from the blockmap, > leading to races that could end up reading stale data from disk > if the lock wasn't taken. > > > So operations that modify the file's size need to be executed > > sequentially. For example, buffered writes to the EOF, fallocate > > operations without the "keep size" requirement, and truncate operations, > > etc, all need to hold an exclusive lock. > > > > Other operations require a shared lock because they only need to access > > the file's size without modifying it. > > > > As far as I understand, exclusive lock is not needed for non-extending > writes, because it is ok to map new blocks. > I guess the need for exclusive lock for extending writes is related to > update of file size, but not 100% sure. > Anyway, exclusive lock on extending write is widely common in other fs, > while exclusive lock for non-extending write is unique to xfs. I am sorry but I can't understand, because I found ext4_buffered_write_iter() take exclusive lock unconditionally. Did I miss some context of the discussion? Thank you very much. Jinliang Zheng > > > > > > >> So we should track all the ranges we are reading or writing, > > >> and check whether the new read or write operations can be performed > > >> concurrently with the current operations. > > > > > > That is all discussed in detail in the discussions I linked. > > > > Sorry, I overlooked some details from old discussion last time. > > It seems that you are not satisfied with the effectiveness of > > range locks. > > > > Correct. This solution was taken off the table. > > I hope my explanation was correct and clear. > If anything else is not clear, please feel free to ask. > > Thanks, > Amir.
On Sat, Jan 25, 2025 at 9:43 AM Jinliang Zheng <alexjlzheng@gmail.com> wrote: > > On Tue, 7 Jan 2025 13:13:17 +0100, Amir Goldstein <amir73il@gmail.com> wrote: > > On Mon, Dec 30, 2024 at 3:43 AM Chi Zhiling <chizhiling@163.com> wrote: > > > > > > > > > > > > On 2024/12/29 06:17, Dave Chinner wrote: > > > > On Sat, Dec 28, 2024 at 03:37:41PM +0800, Chi Zhiling wrote: > > > >> > > > >> > > > >> On 2024/12/27 05:50, Dave Chinner wrote: > > > >>> On Thu, Dec 26, 2024 at 02:16:02PM +0800, Chi Zhiling wrote: > > > >>>> From: Chi Zhiling <chizhiling@kylinos.cn> > > > >>>> > > > >>>> Using an rwsem to protect file data ensures that we can always obtain a > > > >>>> completed modification. But due to the lock, we need to wait for the > > > >>>> write process to release the rwsem before we can read it, even if we are > > > >>>> reading a different region of the file. This could take a lot of time > > > >>>> when many processes need to write and read this file. > > > >>>> > > > >>>> On the other hand, The ext4 filesystem and others do not hold the lock > > > >>>> during buffered reading, which make the ext4 have better performance in > > > >>>> that case. Therefore, I think it will be fine if we remove the lock in > > > >>>> xfs, as most applications can handle this situation. > > > >>> > > > >>> Nope. > > > >>> > > > >>> This means that XFS loses high level serialisation of incoming IO > > > >>> against operations like truncate, fallocate, pnfs operations, etc. > > > >>> > > > >>> We've been through this multiple times before; the solution lies in > > > >>> doing the work to make buffered writes use shared locking, not > > > >>> removing shared locking from buffered reads. > > > >> > > > >> You mean using shared locking for buffered reads and writes, right? > > > >> > > > >> I think it's a great idea. In theory, write operations can be performed > > > >> simultaneously if they write to different ranges. > > > > > > > > Even if they overlap, the folio locks will prevent concurrent writes > > > > to the same range. > > > > > > > > Now that we have atomic write support as native functionality (i.e. > > > > RWF_ATOMIC), we really should not have to care that much about > > > > normal buffered IO being atomic. i.e. if the application wants > > > > atomic writes, it can now specify that it wants atomic writes and so > > > > we can relax the constraints we have on existing IO... > > > > > > Yes, I'm not particularly concerned about whether buffered I/O is > > > atomic. I'm more concerned about the multithreading performance of > > > buffered I/O. > > > > Hi Chi, > > > > Sorry for joining late, I was on vacation. > > I am very happy that you have taken on this task and your timing is good, > > because John Garry just posted his patches for large atomic writes [1]. > > > > I need to explain the relation to atomic buffered I/O, because it is not > > easy to follow it from the old discussions and also some of the discussions > > about the solution were held in-person during LSFMM2024. > > > > Naturally, your interest is improving multithreading performance of > > buffered I/O, so was mine when I first posted this question [2]. > > > > The issue with atomicity of buffered I/O is the xfs has traditionally > > provided atomicity of write vs. read (a.k.a no torn writes), which is > > not required by POSIX standard (because POSIX was not written with > > threads in mind) and is not respected by any other in-tree filesystem. > > > > It is obvious that the exclusive rw lock for buffered write hurts performance > > in the case of mixed read/write on xfs, so the question was - if xfs provides > > atomic semantics that portable applications cannot rely on, why bother > > providing these atomic semantics at all? > > > > Dave's answer to this question was that there are some legacy applications > > (database applications IIRC) on production systems that do rely on the fact > > that xfs provides this semantics and on the prerequisite that they run on xfs. > > > > However, it was noted that: > > 1. Those application do not require atomicity for any size of IO, they > > typically work in I/O size that is larger than block size (e.g. 16K or 64K) > > and they only require no torn writes for this I/O size > > 2. Large folios and iomap can usually provide this semantics via folio lock, > > but application has currently no way of knowing if the semantics are > > provided or not > > 3. The upcoming ability of application to opt-in for atomic writes larger > > than fs block size [1] can be used to facilitate the applications that > > want to legacy xfs semantics and avoid the need to enforce the legacy > > semantics all the time for no good reason > > > > Disclaimer: this is not a standard way to deal with potentially breaking > > legacy semantics, because old applications will not usually be rebuilt > > to opt-in for the old semantics, but the use case in hand is probably > > so specific, of a specific application that relies on xfs specific semantics > > that there are currently no objections for trying this solution. > > > > [1] https://lore.kernel.org/linux-xfs/20250102140411.14617-1-john.g.garry@oracle.com/ > > [2] https://lore.kernel.org/linux-xfs/CAOQ4uxi0pGczXBX7GRAFs88Uw0n1ERJZno3JSeZR71S1dXg+2w@mail.gmail.com/ > > > > > > > > Last week, it was mentioned that removing i_rwsem would have some > > > impacts on truncate, fallocate, and PNFS operations. > > > > > > (I'm not familiar with pNFS, so please correct me if I'm wrong.) > > > > You are not wrong. pNFS uses a "layout lease", which requires > > that the blockmap of the file will not be modified while the lease is held. > > but I think that block that are not mapped (i.e. holes) can be mapped > > while the lease is held. > > > > > > > > My understanding is that the current i_rwsem is used to protect both > > > the file's data and its size. Operations like truncate, fallocate, > > > and PNFS use i_rwsem because they modify both the file's data and its > > > size. So, I'm thinking whether it's possible to use i_rwsem to protect > > > only the file's size, without protecting the file's data. > > > > > > > It also protects the file's blockmap, for example, punch hole > > does not change the size, but it unmaps blocks from the blockmap, > > leading to races that could end up reading stale data from disk > > if the lock wasn't taken. > > > > > So operations that modify the file's size need to be executed > > > sequentially. For example, buffered writes to the EOF, fallocate > > > operations without the "keep size" requirement, and truncate operations, > > > etc, all need to hold an exclusive lock. > > > > > > Other operations require a shared lock because they only need to access > > > the file's size without modifying it. > > > > > > > As far as I understand, exclusive lock is not needed for non-extending > > writes, because it is ok to map new blocks. > > I guess the need for exclusive lock for extending writes is related to > > update of file size, but not 100% sure. > > > > Anyway, exclusive lock on extending write is widely common in other fs, > > while exclusive lock for non-extending write is unique to xfs. > > I am sorry but I can't understand, because I found ext4_buffered_write_iter() > take exclusive lock unconditionally. > > Did I miss some context of the discussion? > No, I misspoke. Sorry. The behavior that is unique to xfs is shared iolock on buffered read. ext4 (and the rest) do not take iolock on buffered reads. Internally, filemap_read() takes filemap_invalidate_lock_shared() as needed to avoid races with truncate and punch hole. Before the introduction of filemap_invalidate_lock(), xfs was the only fs that was safe from races of punch hole vs. buffered read leading to reading stale data. One way to fix the mixed buffered rw performance issue is to follow suit with other fs and not take shared iolock on buffered reads in xfs, but that has the downside of regressing the legacy "untorn writes" xfs behavior and it will make it hard to maintain the existing buffered/dio coherency in xfs. So instead of going in the direction of "follow ext4 buffered io locking scheme", this discussion took the direction of "follow dio locking scheme" when foilios can be used to provide "untorn writes" and use another method to mutually exclude dio writes and buffered writes. I hope that my summary is correct and helps to clarify where we stand. Thanks, Amir.
On Fri, Jan 24, 2025 at 03:57:43PM +0800, Chi Zhiling wrote: > On 2025/1/16 05:41, Dave Chinner wrote: > > IOWs, a two-state shared lock provides the mechanism to allow DIO > > to be done without holding the i_rwsem at all, as well as being able > > to elide two atomic operations per DIO to track in-flight DIOs. > > > > We'd get this whilst maintaining buffered/DIO coherency without > > adding any new overhead to the DIO path, and allow concurrent > > buffered reads and writes that have their atomicity defined by the > > batched folio locking strategy that Brian is working on... > > > > This only leaves DIO coherency issues with mmap() based IO as an > > issue, but that's a problem for a different day... > > When I try to implement those features, I think we could use exclusive > locks for RWF_APPEND writes, and shared locks for other writes. > > The reason is that concurrent operations are also possible for extended > writes if there is no overlap in the regions being written. > So there is no need to determine whether it is an extended write in > advance. > > As for why an exclusive lock is needed for append writes, it's because > we don't want the EOF to be modified during the append write. We don't care if the EOF moves during the append write at the filesystem level. We set kiocb->ki_pos = i_size_read() from generic_write_checks() under shared locking, and if we then race with another extending append write there are two cases: 1. the other task has already extended i_size; or 2. we have two IOs at the same offset (i.e. at i_size). In either case, we don't need exclusive locking for the IO because the worst thing that happens is that two IOs hit the same file offset. IOWs, it has always been left up to the application serialise RWF_APPEND writes on XFS, not the filesystem. There is good reason for not doing exclusive locking for extending writes. When extending the file naturally (think sequential writes), we need those IOs be able to run concurrently with other writes in progress. i.e. it is common for applications to submit multiple sequential extending writes in a batch, and as long as we submit them all in order, they all hit the (moving) EOF exactly and hence get run concurrently. This also works with batch-submitted RWF_APPEND AIO-DIO - they just turn into concurrent in-flight extending writes... IOWs, forcing exclusive locking for writes at exactly EOF is definitely not desirable, and existing behaviour is that they use shared locking. The only time we use exclusive locking for extending writes is when they -start- beyond EOF (i.e. leave a hole between EOF and kiocb->ki_pos) and so we have to guarantee that range does not have stale data exposed from it while the write is in progress. i.e. we have to zero the hole that moving EOF exposes if it contains written extents, and we cannot allow reads or writes to that hole before we are done. > The code is like that: > + if (iocb->ki_flags & IOCB_APPEND) > + iolock = XFS_IOLOCK_EXCL; > + else > + iolock = XFS_IOLOCK_SHARED; > > > If we use exclusive locks for all extended writes, we need to check if > it is an extended write before acquiring the lock, but this value could > become outdated if we do not hold the lock. > > So if we do an extended write, > we might need to follow this process: > > 1. Get read lock. > 2. Check if it is an extended write. > 3. Release read lock. > 4. Get write lock. > 5. Do the write operation. Please see xfs_file_write_checks() - it should already have all the shared/exclusive relocking logic we need for temporarily excluding buffered reads whilst submitting concurrent buffered writes (i.e. it is the same as what we already do for concurrent DIO writes). -Dave.
On Tue, Jan 28, 2025 at 07:49:17AM +1100, Dave Chinner wrote: > > As for why an exclusive lock is needed for append writes, it's because > > we don't want the EOF to be modified during the append write. > > We don't care if the EOF moves during the append write at the > filesystem level. We set kiocb->ki_pos = i_size_read() from > generic_write_checks() under shared locking, and if we then race > with another extending append write there are two cases: > > 1. the other task has already extended i_size; or > 2. we have two IOs at the same offset (i.e. at i_size). > > In either case, we don't need exclusive locking for the IO because > the worst thing that happens is that two IOs hit the same file > offset. IOWs, it has always been left up to the application > serialise RWF_APPEND writes on XFS, not the filesystem. I disagree. O_APPEND (RWF_APPEND is just the Linux-specific per-I/O version of that) is extensively used for things like multi-thread loggers where you have multiple threads doing O_APPEND writes to a single log file, and they expect to not lose data that way. The fact that we currently don't do that for O_DIRECT is a bug, which is just papered over that barely anyone uses O_DIRECT | O_APPEND as that's not a very natural use case for most applications (in fact NFS got away with never allowing it at all). But extending racy O_APPEND to buffered writes would break a lot of applications.
On Mon, 27 Jan 2025 21:15:41 -0800 Christoph Hellwig <hch@infradead.org> wrote: > On Tue, Jan 28, 2025 at 07:49:17AM +1100, Dave Chinner wrote: > > > As for why an exclusive lock is needed for append writes, it's because > > > we don't want the EOF to be modified during the append write. > > > > We don't care if the EOF moves during the append write at the > > filesystem level. We set kiocb->ki_pos = i_size_read() from > > generic_write_checks() under shared locking, and if we then race > > with another extending append write there are two cases: > > > > 1. the other task has already extended i_size; or > > 2. we have two IOs at the same offset (i.e. at i_size). > > > > In either case, we don't need exclusive locking for the IO because > > the worst thing that happens is that two IOs hit the same file > > offset. IOWs, it has always been left up to the application > > serialise RWF_APPEND writes on XFS, not the filesystem. > > I disagree. O_APPEND (RWF_APPEND is just the Linux-specific > per-I/O version of that) is extensively used for things like > multi-thread loggers where you have multiple threads doing O_APPEND > writes to a single log file, and they expect to not lose data > that way. The fact that we currently don't do that for O_DIRECT > is a bug, which is just papered over that barely anyone uses > O_DIRECT | O_APPEND as that's not a very natural use case for > most applications (in fact NFS got away with never allowing it > at all). But extending racy O_APPEND to buffered writes would > break a lot of applications. It is broken in windows :-) You get two writes to the same file offset and then (IIRC) two advances of EOF (usually) giving a block of '\0' bytes in the file. You might get away with doing an atomic update of EOF and then writing into the gap. But you have to decide what to do if there is a seg fault on the user buffer It could be a multi-TB write from an mmap-ed file (maybe even over NFS) that hits a disk read error. Actually I suspect that if you let the two writes proceed in parallel you can't let later ones complete first. If the returns are sequenced a write can then be redone if an earlier write got shortened. David
On Mon, Jan 27, 2025 at 09:15:41PM -0800, Christoph Hellwig wrote: > On Tue, Jan 28, 2025 at 07:49:17AM +1100, Dave Chinner wrote: > > > As for why an exclusive lock is needed for append writes, it's because > > > we don't want the EOF to be modified during the append write. > > > > We don't care if the EOF moves during the append write at the > > filesystem level. We set kiocb->ki_pos = i_size_read() from > > generic_write_checks() under shared locking, and if we then race > > with another extending append write there are two cases: > > > > 1. the other task has already extended i_size; or > > 2. we have two IOs at the same offset (i.e. at i_size). > > > > In either case, we don't need exclusive locking for the IO because > > the worst thing that happens is that two IOs hit the same file > > offset. IOWs, it has always been left up to the application > > serialise RWF_APPEND writes on XFS, not the filesystem. > > I disagree. O_APPEND (RWF_APPEND is just the Linux-specific > per-I/O version of that) is extensively used for things like > multi-thread loggers where you have multiple threads doing O_APPEND > writes to a single log file, and they expect to not lose data > that way. Sure, but we don't think we need full file offset-scope IO exclusion to solve that problem. We can still safely do concurrent writes within EOF to occur whilst another buffered append write is doing file extension work. IOWs, where we really need to get to is a model that allows concurrent buffered IO at all times, except for the case where IO operations that change the inode size need to serialise against other similar operations (e.g. other EOF extending IOs, truncate, etc). Hence I think we can largely ignore O_APPEND for the purposes of prototyping shared buffered IO and getting rid of the IOLOCK from the XFS IO path. I may end up re-using the i_rwsem as a "EOF modification" serialisation mechanism for O_APPEND and extending writes in general, but I don't think we need a general write IO exclusion mechanism for this... -Dave.
On Wed, Jan 29, 2025 at 11:59:53AM +1100, Dave Chinner wrote: > Sure, but we don't think we need full file offset-scope IO exclusion > to solve that problem. We can still safely do concurrent writes > within EOF to occur whilst another buffered append write is doing > file extension work. Sure. The previous mail just sounded like you'd want to do away with exclusion for assigning the offset. > IOWs, where we really need to get to is a model that allows > concurrent buffered IO at all times, except for the case where IO > operations that change the inode size need to serialise against > other similar operations (e.g. other EOF extending IOs, truncate, > etc). > > Hence I think we can largely ignore O_APPEND for the > purposes of prototyping shared buffered IO and getting rid of the > IOLOCK from the XFS IO path. I may end up re-using the i_rwsem as > a "EOF modification" serialisation mechanism for O_APPEND and > extending writes in general, but I don't think we need a general > write IO exclusion mechanism for this... That might be a chance to also fix O_DIRECT while we're at it..
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 9a435b1ff264..7d039cc3ae9e 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -279,18 +279,9 @@ xfs_file_buffered_read( struct kiocb *iocb, struct iov_iter *to) { - struct xfs_inode *ip = XFS_I(file_inode(iocb->ki_filp)); - ssize_t ret; - trace_xfs_file_buffered_read(iocb, to); - ret = xfs_ilock_iocb(iocb, XFS_IOLOCK_SHARED); - if (ret) - return ret; - ret = generic_file_read_iter(iocb, to); - xfs_iunlock(ip, XFS_IOLOCK_SHARED); - - return ret; + return generic_file_read_iter(iocb, to); } STATIC ssize_t