Message ID | 20190607131025.31996-18-naohiro.aota@wdc.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs zoned block device support | expand |
On Fri, Jun 07, 2019 at 10:10:23PM +0900, Naohiro Aota wrote: > In a write heavy workload, the following scenario can occur: > > 1. mark page #0 to page #2 (and their corresponding extent region) as dirty > and candidate for delayed allocation > > pages 0 1 2 3 4 > dirty o o o - - > towrite - - - - - > delayed o o o - - > alloc > > 2. extent_write_cache_pages() mark dirty pages as TOWRITE > > pages 0 1 2 3 4 > dirty o o o - - > towrite o o o - - > delayed o o o - - > alloc > > 3. Meanwhile, another write dirties page #3 and page #4 > > pages 0 1 2 3 4 > dirty o o o o o > towrite o o o - - > delayed o o o o o > alloc > > 4. find_lock_delalloc_range() decide to allocate a region to write page #0 > to page #4 > 5. but, extent_write_cache_pages() only initiate write to TOWRITE tagged > pages (#0 to #2) > > So the above process leaves page #3 and page #4 behind. Usually, the > periodic dirty flush kicks write IOs for page #3 and #4. However, if we try > to mount a subvolume at this timing, mount process takes s_umount write > lock to block the periodic flush to come in. > > To deal with the problem, shrink the delayed allocation region to have only > expected to be written pages. > > Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> > --- > fs/btrfs/extent_io.c | 27 +++++++++++++++++++++++++++ > 1 file changed, 27 insertions(+) > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > index c73c69e2bef4..ea582ff85c73 100644 > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -3310,6 +3310,33 @@ static noinline_for_stack int writepage_delalloc(struct inode *inode, > delalloc_start = delalloc_end + 1; > continue; > } > + > + if (btrfs_fs_incompat(btrfs_sb(inode->i_sb), HMZONED) && > + (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages) && > + ((delalloc_start >> PAGE_SHIFT) < > + (delalloc_end >> PAGE_SHIFT))) { > + unsigned long i; > + unsigned long end_index = delalloc_end >> PAGE_SHIFT; > + > + for (i = delalloc_start >> PAGE_SHIFT; > + i <= end_index; i++) > + if (!xa_get_mark(&inode->i_mapping->i_pages, i, > + PAGECACHE_TAG_TOWRITE)) > + break; > + > + if (i <= end_index) { > + u64 unlock_start = (u64)i << PAGE_SHIFT; > + > + if (i == delalloc_start >> PAGE_SHIFT) > + unlock_start += PAGE_SIZE; > + > + unlock_extent(tree, unlock_start, delalloc_end); > + __unlock_for_delalloc(inode, page, unlock_start, > + delalloc_end); > + delalloc_end = unlock_start - 1; > + } > + } > + Helper please. Really for all this hmzoned stuff I want it segregated as much as possible so when I'm debugging or cleaning other stuff up I want to easily be able to say "oh this is for zoned devices, it doesn't matter." Thanks, Josef
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index c73c69e2bef4..ea582ff85c73 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3310,6 +3310,33 @@ static noinline_for_stack int writepage_delalloc(struct inode *inode, delalloc_start = delalloc_end + 1; continue; } + + if (btrfs_fs_incompat(btrfs_sb(inode->i_sb), HMZONED) && + (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages) && + ((delalloc_start >> PAGE_SHIFT) < + (delalloc_end >> PAGE_SHIFT))) { + unsigned long i; + unsigned long end_index = delalloc_end >> PAGE_SHIFT; + + for (i = delalloc_start >> PAGE_SHIFT; + i <= end_index; i++) + if (!xa_get_mark(&inode->i_mapping->i_pages, i, + PAGECACHE_TAG_TOWRITE)) + break; + + if (i <= end_index) { + u64 unlock_start = (u64)i << PAGE_SHIFT; + + if (i == delalloc_start >> PAGE_SHIFT) + unlock_start += PAGE_SIZE; + + unlock_extent(tree, unlock_start, delalloc_end); + __unlock_for_delalloc(inode, page, unlock_start, + delalloc_end); + delalloc_end = unlock_start - 1; + } + } + ret = btrfs_run_delalloc_range(inode, page, delalloc_start, delalloc_end, &page_started, nr_written, wbc); /* File system has been set read-only */
In a write heavy workload, the following scenario can occur: 1. mark page #0 to page #2 (and their corresponding extent region) as dirty and candidate for delayed allocation pages 0 1 2 3 4 dirty o o o - - towrite - - - - - delayed o o o - - alloc 2. extent_write_cache_pages() mark dirty pages as TOWRITE pages 0 1 2 3 4 dirty o o o - - towrite o o o - - delayed o o o - - alloc 3. Meanwhile, another write dirties page #3 and page #4 pages 0 1 2 3 4 dirty o o o o o towrite o o o - - delayed o o o o o alloc 4. find_lock_delalloc_range() decide to allocate a region to write page #0 to page #4 5. but, extent_write_cache_pages() only initiate write to TOWRITE tagged pages (#0 to #2) So the above process leaves page #3 and page #4 behind. Usually, the periodic dirty flush kicks write IOs for page #3 and #4. However, if we try to mount a subvolume at this timing, mount process takes s_umount write lock to block the periodic flush to come in. To deal with the problem, shrink the delayed allocation region to have only expected to be written pages. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> --- fs/btrfs/extent_io.c | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+)