Message ID | 9b564309ec83dc89ffd90676e593f9d7ce24ae77.1728880585.git.wqu@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: use FGP_STABLE to wait for folio writeback | expand |
LGTM
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Thx
On Mon, Oct 14, 2024 at 03:06:31PM +1030, Qu Wenruo wrote: > __filemap_get_folio() provides the flag FGP_STABLE to wait for > writeback. > > There are two call sites doing __filemap_get_folio() then > folio_wait_writeback(): > > - btrfs_truncate_block() > - defrag_prepare_one_folio() > > We can directly utilize that flag instead of manually calling > folio_wait_writeback(). We can do that but I'm missing a justification for that. The explicit writeback calls are done at a different points than what FGP_STABLE does. So what's the difference?
在 2024/10/15 00:46, David Sterba 写道: > On Mon, Oct 14, 2024 at 03:06:31PM +1030, Qu Wenruo wrote: >> __filemap_get_folio() provides the flag FGP_STABLE to wait for >> writeback. >> >> There are two call sites doing __filemap_get_folio() then >> folio_wait_writeback(): >> >> - btrfs_truncate_block() >> - defrag_prepare_one_folio() >> >> We can directly utilize that flag instead of manually calling >> folio_wait_writeback(). > > We can do that but I'm missing a justification for that. The explicit > writeback calls are done at a different points than what FGP_STABLE > does. So what's the difference? > TL;DR, it's not safe to read folio before waiting for writeback in theory. There is a possible race, mostly related to my previous attempt of subpage partial uptodate support: Thread A | Thread B -------------------------------+----------------------------- extent_writepage_io() | |- submit_one_sector() | |- folio_set_writeback() | The folio is partial dirty| and uninvolved sectors are| not uptodate | | btrfs_truncate_block() | |- btrfs_do_readpage() | |- submit_one_folio | This will read all sectors | from disk, but that writeback | sector is not yet finished In this case, we can read out garbage from disk, since the write is not yet finished. This is not yet possible, because we always read out the whole page so in that case thread B won't trigger a read. But this already shows the way we wait for writeback is not safe. And that's why no other filesystems manually wait for writeback after reading the folio. Thus I think doing FGP_STABLE is way more safer, and that's why all other fses are doing this way instead. Thanks, Qu
在 2024/10/15 07:25, Qu Wenruo 写道: > > > 在 2024/10/15 00:46, David Sterba 写道: >> On Mon, Oct 14, 2024 at 03:06:31PM +1030, Qu Wenruo wrote: >>> __filemap_get_folio() provides the flag FGP_STABLE to wait for >>> writeback. >>> >>> There are two call sites doing __filemap_get_folio() then >>> folio_wait_writeback(): >>> >>> - btrfs_truncate_block() >>> - defrag_prepare_one_folio() >>> >>> We can directly utilize that flag instead of manually calling >>> folio_wait_writeback(). >> >> We can do that but I'm missing a justification for that. The explicit >> writeback calls are done at a different points than what FGP_STABLE >> does. So what's the difference? >> > > TL;DR, it's not safe to read folio before waiting for writeback in theory. > > There is a possible race, mostly related to my previous attempt of > subpage partial uptodate support: > > Thread A | Thread B > -------------------------------+----------------------------- > extent_writepage_io() | > |- submit_one_sector() | > |- folio_set_writeback() | > The folio is partial dirty| > and uninvolved sectors are| > not uptodate | > | btrfs_truncate_block() > | |- btrfs_do_readpage() > | |- submit_one_folio > | This will read all sectors > | from disk, but that writeback > | sector is not yet finished > > In this case, we can read out garbage from disk, since the write is not > yet finished. > > This is not yet possible, because we always read out the whole page so > in that case thread B won't trigger a read. Here I mean at page dirty time, we always read out the whole page if the write range is not page aligned. So we won't have partial page uptodate yet. But the race is still here. > > > But this already shows the way we wait for writeback is not safe. > And that's why no other filesystems manually wait for writeback after > reading the folio. > > Thus I think doing FGP_STABLE is way more safer, and that's why all > other fses are doing this way instead. > > Thanks, > Qu >
On Tue, Oct 15, 2024 at 07:25:20AM +1030, Qu Wenruo wrote: > 在 2024/10/15 00:46, David Sterba 写道: > > On Mon, Oct 14, 2024 at 03:06:31PM +1030, Qu Wenruo wrote: > >> __filemap_get_folio() provides the flag FGP_STABLE to wait for > >> writeback. > >> > >> There are two call sites doing __filemap_get_folio() then > >> folio_wait_writeback(): > >> > >> - btrfs_truncate_block() > >> - defrag_prepare_one_folio() > >> > >> We can directly utilize that flag instead of manually calling > >> folio_wait_writeback(). > > > > We can do that but I'm missing a justification for that. The explicit > > writeback calls are done at a different points than what FGP_STABLE > > does. So what's the difference? > > > > TL;DR, it's not safe to read folio before waiting for writeback in theory. > > There is a possible race, mostly related to my previous attempt of > subpage partial uptodate support: > > Thread A | Thread B > -------------------------------+----------------------------- > extent_writepage_io() | > |- submit_one_sector() | > |- folio_set_writeback() | > The folio is partial dirty| > and uninvolved sectors are| > not uptodate | > | btrfs_truncate_block() > | |- btrfs_do_readpage() > | |- submit_one_folio > | This will read all sectors > | from disk, but that writeback > | sector is not yet finished > > In this case, we can read out garbage from disk, since the write is not > yet finished. > > This is not yet possible, because we always read out the whole page so > in that case thread B won't trigger a read. > > But this already shows the way we wait for writeback is not safe. > And that's why no other filesystems manually wait for writeback after > reading the folio. > > Thus I think doing FGP_STABLE is way more safer, and that's why all > other fses are doing this way instead. I'm not disputing we need it and I may be missing something, what I noticed in the patch is maybe a generic pattern, structure read at some time and then synced/written, but there could be some change in bettween. One example is one you show (theoretically or not). The writeback is a kind of synchronization point, but also in parallel with the data/metadata in page cache. If the state, regarding writeback, is not safe and we can either get old data or could get partially synced data (ie. ok in page cache but not regarding writeback) it is a problematic pattern. You found two cases, truncate and defrag. Both are different I think, truncate comes from normal MM operations, while defrag is triggered by an ioctl (still trying to be in sync with MM). I'm not sure we can copy what other filesystems do, even if it's just on the basic principle of COW vs update in place + journaling. We copy the and do the next update and don't have to care about previous state, so even a split between read and writeback does no harm.
在 2024/10/15 11:01, David Sterba 写道: > On Tue, Oct 15, 2024 at 07:25:20AM +1030, Qu Wenruo wrote: >> 在 2024/10/15 00:46, David Sterba 写道: >>> On Mon, Oct 14, 2024 at 03:06:31PM +1030, Qu Wenruo wrote: >>>> __filemap_get_folio() provides the flag FGP_STABLE to wait for >>>> writeback. >>>> >>>> There are two call sites doing __filemap_get_folio() then >>>> folio_wait_writeback(): >>>> >>>> - btrfs_truncate_block() >>>> - defrag_prepare_one_folio() >>>> >>>> We can directly utilize that flag instead of manually calling >>>> folio_wait_writeback(). >>> >>> We can do that but I'm missing a justification for that. The explicit >>> writeback calls are done at a different points than what FGP_STABLE >>> does. So what's the difference? >>> >> >> TL;DR, it's not safe to read folio before waiting for writeback in theory. >> >> There is a possible race, mostly related to my previous attempt of >> subpage partial uptodate support: >> >> Thread A | Thread B >> -------------------------------+----------------------------- >> extent_writepage_io() | >> |- submit_one_sector() | >> |- folio_set_writeback() | >> The folio is partial dirty| >> and uninvolved sectors are| >> not uptodate | >> | btrfs_truncate_block() >> | |- btrfs_do_readpage() >> | |- submit_one_folio >> | This will read all sectors >> | from disk, but that writeback >> | sector is not yet finished >> >> In this case, we can read out garbage from disk, since the write is not >> yet finished. >> >> This is not yet possible, because we always read out the whole page so >> in that case thread B won't trigger a read. >> >> But this already shows the way we wait for writeback is not safe. >> And that's why no other filesystems manually wait for writeback after >> reading the folio. >> >> Thus I think doing FGP_STABLE is way more safer, and that's why all >> other fses are doing this way instead. > > I'm not disputing we need it and I may be missing something, what I > noticed in the patch is maybe a generic pattern, structure read at some > time and then synced/written, but there could be some change in > bettween. One example is one you show (theoretically or not). > > The writeback is a kind of synchronization point, but also in parallel > with the data/metadata in page cache. If the state, regarding writeback, > is not safe and we can either get old data or could get partially synced > data (ie. ok in page cache but not regarding writeback) it is a > problematic pattern. The writeback is a sync point, but it's more like an optimization to reduce page lock contention. E.g. when a page contains several blocks, and some blocks are dirty and being written back, but also some sectors needs to be read. If implemented properly, the not uptodate blocks can be properly read meanwhile without waiting for the writeback to finish. But from the safety point of view, I strongly prefer to wait for the folio writeback in such case. Especially considering all the existing get folio + read + wait for writeback is from the time where we only consider sectorsize == page size. We have enabled sectorsize < page size since 5.15, and we should have dropped the wrong assumption for years. > > You found two cases, truncate and defrag. Both are different I think, > truncate comes from normal MM operations, while defrag is triggered by > an ioctl (still trying to be in sync with MM). > > I'm not sure we can copy what other filesystems do, even if it's just on > the basic principle of COW vs update in place + journaling. I do not think COW is making any difference. All the COW handling is at delalloc time, meanwhile the folio get/lock/read/wait sequence is more common for page dirtying (regular buffered write, defrag, truncate are all in a similar situation). Page cache is here just to provide file content cache, it doesn't care about if it's COW or NOT. Furthermore, COW is no longer our exclusive feature, XFS has supported it for quite some time, and there is no special handling just for the page cache. (Meanwhile XFS and ext4 has much better blocksize < page size handling than us for years) And I have already explained in that case, waiting for the writeback at folio get time is much safer (although reduces concurrency). Just for the data safety, I believe you need to provide a much stronger argument than COW vs overwrite (which is completely unrelated). > We copy the > and do the next update and don't have to care about previous state, so > even a split between read and writeback does no harm. Although I love the csum/datacow, I do not see any strong reason not to follow the more common (and IMHO safer) way to wait for the writeback. I have explained the possible race, and I do not think I need to repeat that example again. If you didn't fully understand why my example, please let me know where it's unclear that I can explain it better. Thanks, Qu
On 15/10/24 09:26, Qu Wenruo wrote: > > > 在 2024/10/15 11:01, David Sterba 写道: >> On Tue, Oct 15, 2024 at 07:25:20AM +1030, Qu Wenruo wrote: >>> 在 2024/10/15 00:46, David Sterba 写道: >>>> On Mon, Oct 14, 2024 at 03:06:31PM +1030, Qu Wenruo wrote: >>>>> __filemap_get_folio() provides the flag FGP_STABLE to wait for >>>>> writeback. >>>>> >>>>> There are two call sites doing __filemap_get_folio() then >>>>> folio_wait_writeback(): >>>>> >>>>> - btrfs_truncate_block() >>>>> - defrag_prepare_one_folio() >>>>> >>>>> We can directly utilize that flag instead of manually calling >>>>> folio_wait_writeback(). >>>> >>>> We can do that but I'm missing a justification for that. The explicit >>>> writeback calls are done at a different points than what FGP_STABLE >>>> does. So what's the difference? >>>> >>> >>> TL;DR, it's not safe to read folio before waiting for writeback in >>> theory. >>> >>> There is a possible race, mostly related to my previous attempt of >>> subpage partial uptodate support: >>> >>> Thread A | Thread B >>> -------------------------------+----------------------------- >>> extent_writepage_io() | >>> |- submit_one_sector() | >>> |- folio_set_writeback() | >>> The folio is partial dirty| >>> and uninvolved sectors are| >>> not uptodate | >>> | btrfs_truncate_block() >>> | |- btrfs_do_readpage() >>> | |- submit_one_folio >>> | This will read all sectors >>> | from disk, but that writeback >>> | sector is not yet finished >>> >>> In this case, we can read out garbage from disk, since the write is not >>> yet finished. >>> >>> This is not yet possible, because we always read out the whole page so >>> in that case thread B won't trigger a read. >>> >>> But this already shows the way we wait for writeback is not safe. >>> And that's why no other filesystems manually wait for writeback after >>> reading the folio. >>> >>> Thus I think doing FGP_STABLE is way more safer, and that's why all >>> other fses are doing this way instead. >> >> I'm not disputing we need it and I may be missing something, what I >> noticed in the patch is maybe a generic pattern, structure read at some >> time and then synced/written, but there could be some change in >> bettween. One example is one you show (theoretically or not). >> >> The writeback is a kind of synchronization point, but also in parallel >> with the data/metadata in page cache. If the state, regarding writeback, >> is not safe and we can either get old data or could get partially synced >> data (ie. ok in page cache but not regarding writeback) it is a >> problematic pattern. > > The writeback is a sync point, but it's more like an optimization to > reduce page lock contention. > The commit/ref below avoids using %FGP_STABLE (and possibly %FGP_WRITEBEGIN) in f2fs due to a potential deadlock in the write-back code, but I'm unsure how. The reasoning isn't clear. The two changes in our case aren't in the write-back path, though. On the 2nd thought, any idea if a similar deadlock would apply to our case in your pov? Ref: ---- commit dfd2e81d37e1 ("f2fs: Convert f2fs_write_begin() to use a folio") %FGP_STABLE - Wait for the folio to be stable (finished writeback) #define FGP_WRITEBEGIN (FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE) ---- Thanks, Anand > E.g. when a page contains several blocks, and some blocks are dirty and > being written back, but also some sectors needs to be read. > If implemented properly, the not uptodate blocks can be properly read > meanwhile without waiting for the writeback to finish. > > But from the safety point of view, I strongly prefer to wait for the > folio writeback in such case. > > Especially considering all the existing get folio + read + wait for > writeback is from the time where we only consider sectorsize == page size. > > We have enabled sectorsize < page size since 5.15, and we should have > dropped the wrong assumption for years. > >> >> You found two cases, truncate and defrag. Both are different I think, >> truncate comes from normal MM operations, while defrag is triggered by >> an ioctl (still trying to be in sync with MM). >> >> I'm not sure we can copy what other filesystems do, even if it's just on >> the basic principle of COW vs update in place + journaling. > > I do not think COW is making any difference. All the COW handling is at > delalloc time, meanwhile the folio get/lock/read/wait sequence is more > common for page dirtying (regular buffered write, defrag, truncate are > all in a similar situation). > > Page cache is here just to provide file content cache, it doesn't care > about if it's COW or NOT. > > Furthermore, COW is no longer our exclusive feature, XFS has supported > it for quite some time, and there is no special handling just for the > page cache. > (Meanwhile XFS and ext4 has much better blocksize < page size handling > than us for years) > > > And I have already explained in that case, waiting for the writeback at > folio get time is much safer (although reduces concurrency). > > Just for the data safety, I believe you need to provide a much stronger > argument than COW vs overwrite (which is completely unrelated). > >> We copy the >> and do the next update and don't have to care about previous state, so >> even a split between read and writeback does no harm. > > Although I love the csum/datacow, I do not see any strong reason not to > follow the more common (and IMHO safer) way to wait for the writeback. > > I have explained the possible race, and I do not think I need to repeat > that example again. > > If you didn't fully understand why my example, please let me know where > it's unclear that I can explain it better. > > Thanks, > Qu
在 2024/10/15 14:34, Anand Jain 写道: > On 15/10/24 09:26, Qu Wenruo wrote: >> >> >> 在 2024/10/15 11:01, David Sterba 写道: >>> On Tue, Oct 15, 2024 at 07:25:20AM +1030, Qu Wenruo wrote: >>>> 在 2024/10/15 00:46, David Sterba 写道: >>>>> On Mon, Oct 14, 2024 at 03:06:31PM +1030, Qu Wenruo wrote: >>>>>> __filemap_get_folio() provides the flag FGP_STABLE to wait for >>>>>> writeback. >>>>>> >>>>>> There are two call sites doing __filemap_get_folio() then >>>>>> folio_wait_writeback(): >>>>>> >>>>>> - btrfs_truncate_block() >>>>>> - defrag_prepare_one_folio() >>>>>> >>>>>> We can directly utilize that flag instead of manually calling >>>>>> folio_wait_writeback(). >>>>> >>>>> We can do that but I'm missing a justification for that. The explicit >>>>> writeback calls are done at a different points than what FGP_STABLE >>>>> does. So what's the difference? >>>>> >>>> >>>> TL;DR, it's not safe to read folio before waiting for writeback in >>>> theory. >>>> >>>> There is a possible race, mostly related to my previous attempt of >>>> subpage partial uptodate support: >>>> >>>> Thread A | Thread B >>>> -------------------------------+----------------------------- >>>> extent_writepage_io() | >>>> |- submit_one_sector() | >>>> |- folio_set_writeback() | >>>> The folio is partial dirty| >>>> and uninvolved sectors are| >>>> not uptodate | >>>> | btrfs_truncate_block() >>>> | |- btrfs_do_readpage() >>>> | |- submit_one_folio >>>> | This will read all sectors >>>> | from disk, but that writeback >>>> | sector is not yet finished >>>> >>>> In this case, we can read out garbage from disk, since the write is not >>>> yet finished. >>>> >>>> This is not yet possible, because we always read out the whole page so >>>> in that case thread B won't trigger a read. >>>> >>>> But this already shows the way we wait for writeback is not safe. >>>> And that's why no other filesystems manually wait for writeback after >>>> reading the folio. >>>> >>>> Thus I think doing FGP_STABLE is way more safer, and that's why all >>>> other fses are doing this way instead. >>> >>> I'm not disputing we need it and I may be missing something, what I >>> noticed in the patch is maybe a generic pattern, structure read at some >>> time and then synced/written, but there could be some change in >>> bettween. One example is one you show (theoretically or not). >>> >>> The writeback is a kind of synchronization point, but also in parallel >>> with the data/metadata in page cache. If the state, regarding writeback, >>> is not safe and we can either get old data or could get partially synced >>> data (ie. ok in page cache but not regarding writeback) it is a >>> problematic pattern. >> >> The writeback is a sync point, but it's more like an optimization to >> reduce page lock contention. >> > > The commit/ref below avoids using %FGP_STABLE (and possibly > %FGP_WRITEBEGIN) in f2fs due to a potential deadlock in the > write-back code, but I'm unsure how. The reasoning isn't clear. > The two changes in our case aren't in the write-back path, > though. On the 2nd thought, any idea if a similar deadlock would > apply to our case in your pov? Not an expert on f2fs, but from the function f2fs_wait_one_page_writeback() it will do extra bio submission and merge. So my current guess is, there may be some page marked writeback, but not yet submitted related to F2FS specific behaviors. If that's the case, f2fs will cause deadlock where it didn't submit but wait for those writeback, causing the deadlock. Thankfully for btrfs we do not have such cases thus it doesn't apply to us. Thanks, Qu > > > Ref: > ---- > commit dfd2e81d37e1 ("f2fs: Convert f2fs_write_begin() to use a folio") > > %FGP_STABLE - Wait for the folio to be stable (finished writeback) > > #define FGP_WRITEBEGIN (FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE) > ---- > > Thanks, Anand > > >> E.g. when a page contains several blocks, and some blocks are dirty >> and being written back, but also some sectors needs to be read. >> If implemented properly, the not uptodate blocks can be properly read >> meanwhile without waiting for the writeback to finish. >> >> But from the safety point of view, I strongly prefer to wait for the >> folio writeback in such case. >> >> Especially considering all the existing get folio + read + wait for >> writeback is from the time where we only consider sectorsize == page >> size. >> >> We have enabled sectorsize < page size since 5.15, and we should have >> dropped the wrong assumption for years. >> >>> >>> You found two cases, truncate and defrag. Both are different I think, >>> truncate comes from normal MM operations, while defrag is triggered by >>> an ioctl (still trying to be in sync with MM). >>> >>> I'm not sure we can copy what other filesystems do, even if it's just on >>> the basic principle of COW vs update in place + journaling. >> >> I do not think COW is making any difference. All the COW handling is >> at delalloc time, meanwhile the folio get/lock/read/wait sequence is >> more common for page dirtying (regular buffered write, defrag, >> truncate are all in a similar situation). >> >> Page cache is here just to provide file content cache, it doesn't care >> about if it's COW or NOT. >> >> Furthermore, COW is no longer our exclusive feature, XFS has supported >> it for quite some time, and there is no special handling just for the >> page cache. >> (Meanwhile XFS and ext4 has much better blocksize < page size handling >> than us for years) >> >> >> And I have already explained in that case, waiting for the writeback >> at folio get time is much safer (although reduces concurrency). >> >> Just for the data safety, I believe you need to provide a much >> stronger argument than COW vs overwrite (which is completely unrelated). >> >>> We copy the >>> and do the next update and don't have to care about previous state, so >>> even a split between read and writeback does no harm. >> >> Although I love the csum/datacow, I do not see any strong reason not >> to follow the more common (and IMHO safer) way to wait for the writeback. >> >> I have explained the possible race, and I do not think I need to >> repeat that example again. >> >> If you didn't fully understand why my example, please let me know >> where it's unclear that I can explain it better. >> >> Thanks, >> Qu >
diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c index b95ef44c326b..1644470b9df7 100644 --- a/fs/btrfs/defrag.c +++ b/fs/btrfs/defrag.c @@ -865,7 +865,7 @@ static struct folio *defrag_prepare_one_folio(struct btrfs_inode *inode, pgoff_t again: folio = __filemap_get_folio(mapping, index, - FGP_LOCK | FGP_ACCESSED | FGP_CREAT, mask); + FGP_LOCK | FGP_ACCESSED | FGP_CREAT | FGP_STABLE, mask); if (IS_ERR(folio)) return folio; @@ -1222,8 +1222,6 @@ static int defrag_one_range(struct btrfs_inode *inode, u64 start, u32 len, goto free_folios; } } - for (i = 0; i < nr_pages; i++) - folio_wait_writeback(folios[i]); /* Lock the pages range */ lock_extent(&inode->io_tree, start_index << PAGE_SHIFT, diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 9ca74e5e7aa6..a21701571cbb 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -4743,7 +4743,7 @@ int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t len, } again: folio = __filemap_get_folio(mapping, index, - FGP_LOCK | FGP_ACCESSED | FGP_CREAT, mask); + FGP_LOCK | FGP_ACCESSED | FGP_CREAT | FGP_STABLE, mask); if (IS_ERR(folio)) { btrfs_delalloc_release_space(inode, data_reserved, block_start, blocksize, true); @@ -4776,8 +4776,6 @@ int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t len, if (ret < 0) goto out_unlock; - folio_wait_writeback(folio); - lock_extent(io_tree, block_start, block_end, &cached_state); ordered = btrfs_lookup_ordered_extent(inode, block_start);
__filemap_get_folio() provides the flag FGP_STABLE to wait for writeback. There are two call sites doing __filemap_get_folio() then folio_wait_writeback(): - btrfs_truncate_block() - defrag_prepare_one_folio() We can directly utilize that flag instead of manually calling folio_wait_writeback(). Signed-off-by: Qu Wenruo <wqu@suse.com> --- fs/btrfs/defrag.c | 4 +--- fs/btrfs/inode.c | 4 +--- 2 files changed, 2 insertions(+), 6 deletions(-)