diff mbox series

[v5,3/5] btrfs: lock subpage ranges in one go for writepage_delalloc()

Message ID b067a8a2c97f58f569991987ad8c3e9a2018cf06.1716008374.git.wqu@suse.com (mailing list archive)
State New, archived
Headers show
Series btrfs: subpage + zoned fixes | expand

Commit Message

Qu Wenruo May 18, 2024, 5:07 a.m. UTC
If we have a subpage range like this for a 16K page with 4K sectorsize:

    0     4K     8K     12K     16K
    |/////|      |//////|       |

    |/////| = dirty range

Currently writepage_delalloc() would go through the following steps:

- lock range [0, 4K)
- run delalloc range for [0, 4K)
- lock range [8K, 12K)
- run delalloc range for [8K 12K)

So far it's fine for regular subpage writeback, as
btrfs_run_delalloc_range() can only go into one of run_delalloc_nocow(),
cow_file_range() and run_delalloc_compressed().

But there is a special pitfall for zoned subpage, where we will go
through run_delalloc_cow(), which would create the ordered extent for the
range and immediately submit the range.
This would unlock the whole page range, causing all kinds of different
ASSERT()s related to locked page.

This patch would address the page unlocking problem of run_delalloc_cow(),
by changing the workflow to the following one:

- lock range [0, 4K)
- lock range [8K, 12K)
- run delalloc range for [0, 4K)
- run delalloc range for [8K, 12K)

So that run_delalloc_cow() can only unlock the full page until the
last lock user released.

To do that, this patch would:

- Utilizing subpage locked bitmap
  So for every delalloc range we found, call
  btrfs_folio_set_writer_lock() to populate the subpage locked bitmap,
  and later btrfs_folio_end_all_writers() if the page is fully unlocked.

  So we know there is a delalloc range that needs to be run later.

- Save the @delalloc_end as @last_delalloc_end inside
  writepage_delalloc()
  Since subpage locked bitmap is only for ranges inside the page,
  meanwhile we can have delalloc range ends beyond our page boundary,
  we have to save the @last_delalloc_end just in case it's beyond our
  page boundary.

Although there is one extra point to notice:

- We need to handle errors in previous iteration
  Since we can have multiple locked delalloc ranges thus we have to call
  run_delalloc_ranges() multiple times.
  If we hit an error half way, we still need to unlock the remaining
  ranges.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 77 ++++++++++++++++++++++++++++++++++++++++----
 fs/btrfs/subpage.c   |  6 ++++
 2 files changed, 76 insertions(+), 7 deletions(-)

Comments

Naohiro Aota May 21, 2024, 8:11 a.m. UTC | #1
On Sat, May 18, 2024 at 02:37:41PM GMT, Qu Wenruo wrote:
> If we have a subpage range like this for a 16K page with 4K sectorsize:
> 
>     0     4K     8K     12K     16K
>     |/////|      |//////|       |
> 
>     |/////| = dirty range
> 
> Currently writepage_delalloc() would go through the following steps:
> 
> - lock range [0, 4K)
> - run delalloc range for [0, 4K)
> - lock range [8K, 12K)
> - run delalloc range for [8K 12K)
> 
> So far it's fine for regular subpage writeback, as
> btrfs_run_delalloc_range() can only go into one of run_delalloc_nocow(),
> cow_file_range() and run_delalloc_compressed().
> 
> But there is a special pitfall for zoned subpage, where we will go
> through run_delalloc_cow(), which would create the ordered extent for the
> range and immediately submit the range.
> This would unlock the whole page range, causing all kinds of different
> ASSERT()s related to locked page.
> 
> This patch would address the page unlocking problem of run_delalloc_cow(),
> by changing the workflow to the following one:
> 
> - lock range [0, 4K)
> - lock range [8K, 12K)
> - run delalloc range for [0, 4K)
> - run delalloc range for [8K, 12K)
> 
> So that run_delalloc_cow() can only unlock the full page until the
> last lock user released.
> 
> To do that, this patch would:
> 
> - Utilizing subpage locked bitmap
>   So for every delalloc range we found, call
>   btrfs_folio_set_writer_lock() to populate the subpage locked bitmap,
>   and later btrfs_folio_end_all_writers() if the page is fully unlocked.
> 
>   So we know there is a delalloc range that needs to be run later.
> 
> - Save the @delalloc_end as @last_delalloc_end inside
>   writepage_delalloc()
>   Since subpage locked bitmap is only for ranges inside the page,
>   meanwhile we can have delalloc range ends beyond our page boundary,
>   we have to save the @last_delalloc_end just in case it's beyond our
>   page boundary.
> 
> Although there is one extra point to notice:
> 
> - We need to handle errors in previous iteration
>   Since we can have multiple locked delalloc ranges thus we have to call
>   run_delalloc_ranges() multiple times.
>   If we hit an error half way, we still need to unlock the remaining
>   ranges.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/extent_io.c | 77 ++++++++++++++++++++++++++++++++++++++++----
>  fs/btrfs/subpage.c   |  6 ++++
>  2 files changed, 76 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 8a4a7d00795f..b6dc9308105d 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -1226,13 +1226,22 @@ static inline void contiguous_readpages(struct page *pages[], int nr_pages,
>  static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
>  		struct page *page, struct writeback_control *wbc)
>  {
> +	struct btrfs_fs_info *fs_info = inode_to_fs_info(&inode->vfs_inode);
> +	struct folio *folio = page_folio(page);
>  	const u64 page_start = page_offset(page);
>  	const u64 page_end = page_start + PAGE_SIZE - 1;
> +	/*
> +	 * Saves the last found delalloc end. As the delalloc end can go beyond
> +	 * page boundary, thus we can not rely on subpage bitmap to locate
> +	 * the last dealloc end.

typo: dealloc -> delalloc

> +	 */
> +	u64 last_delalloc_end = 0;
>  	u64 delalloc_start = page_start;
>  	u64 delalloc_end = page_end;
>  	u64 delalloc_to_write = 0;
>  	int ret = 0;
>  
> +	/* Lock all (subpage) dealloc ranges inside the page first. */

Same here.

>  	while (delalloc_start < page_end) {
>  		delalloc_end = page_end;
>  		if (!find_lock_delalloc_range(&inode->vfs_inode, page,
> @@ -1240,15 +1249,68 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
>  			delalloc_start = delalloc_end + 1;
>  			continue;
>  		}
> -
> -		ret = btrfs_run_delalloc_range(inode, page, delalloc_start,
> -					       delalloc_end, wbc);
> -		if (ret < 0)
> -			return ret;
> -
> +		btrfs_folio_set_writer_lock(fs_info, folio, delalloc_start,
> +					    min(delalloc_end, page_end) + 1 -
> +					    delalloc_start);
> +		last_delalloc_end = delalloc_end;
>  		delalloc_start = delalloc_end + 1;
>  	}

Can we bail out on the "if (!last_delalloc_end)" case? It would make the
following code simpler.

> +	delalloc_start = page_start;
> +	/* Run the delalloc ranges for above locked ranges. */
> +	while (last_delalloc_end && delalloc_start < page_end) {
> +		u64 found_start;
> +		u32 found_len;
> +		bool found;
>  
> +		if (!btrfs_is_subpage(fs_info, page->mapping)) {
> +			/*
> +			 * For non-subpage case, the found delalloc range must
> +			 * cover this page and there must be only one locked
> +			 * delalloc range.
> +			 */
> +			found_start = page_start;
> +			found_len = last_delalloc_end + 1 - found_start;
> +			found = true;
> +		} else {
> +			found = btrfs_subpage_find_writer_locked(fs_info, folio,
> +					delalloc_start, &found_start, &found_len);
> +		}
> +		if (!found)
> +			break;
> +		/*
> +		 * The subpage range covers the last sector, the delalloc range may
> +		 * end beyonds the page boundary, use the saved delalloc_end
> +		 * instead.
> +		 */
> +		if (found_start + found_len >= page_end)
> +			found_len = last_delalloc_end + 1 - found_start;
> +
> +		if (ret < 0) {

At first glance, it is strange because "ret" is not set above. But, it is
executed when btrfs_run_delalloc_range() returns an error in an iteration,
for the remaining iterations...

I'd like to have a dedicated clean-up path ... but I agree it is difficult
to make such cleanup loop clean.

Flipping the if-conditions looks better? Or, adding more comments would be nice.

> +			/* Cleanup the remaining locked ranges. */
> +			unlock_extent(&inode->io_tree, found_start,
> +				      found_start + found_len - 1, NULL);
> +			__unlock_for_delalloc(&inode->vfs_inode, page, found_start,
> +					      found_start + found_len - 1);
> +		} else {
> +			ret = btrfs_run_delalloc_range(inode, page, found_start,
> +						       found_start + found_len - 1, wbc);

Also, what happens if the first range returns "1" and a later one returns
"0"? Is it OK to override the "ret" for the case? Actually, I guess it
won't happen now because (as said in patch 5) subpage disables an inline
extent, but having an ASSERT() would be good to catch a future mistake.

> +		}
> +		/*
> +		 * Above btrfs_run_delalloc_range() may have unlocked the page,
> +		 * Thus for the last range, we can not touch the page anymore.
> +		 */
> +		if (found_start + found_len >= last_delalloc_end + 1)
> +			break;
> +
> +		delalloc_start = found_start + found_len;
> +	}
> +	if (ret < 0)
> +		return ret;
> +
> +	if (last_delalloc_end)
> +		delalloc_end = last_delalloc_end;
> +	else
> +		delalloc_end = page_end;
>  	/*
>  	 * delalloc_end is already one less than the total length, so
>  	 * we don't subtract one from PAGE_SIZE
> @@ -1520,7 +1582,8 @@ static int __extent_writepage(struct page *page, struct btrfs_bio_ctrl *bio_ctrl
>  					       PAGE_SIZE, !ret);
>  		mapping_set_error(page->mapping, ret);
>  	}
> -	unlock_page(page);
> +
> +	btrfs_folio_end_all_writers(inode_to_fs_info(inode), folio);
>  	ASSERT(ret <= 0);
>  	return ret;
>  }
> diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
> index 3c957d03324e..81b862d7ab53 100644
> --- a/fs/btrfs/subpage.c
> +++ b/fs/btrfs/subpage.c
> @@ -862,6 +862,7 @@ bool btrfs_subpage_find_writer_locked(const struct btrfs_fs_info *fs_info,
>  void btrfs_folio_end_all_writers(const struct btrfs_fs_info *fs_info,
>  				 struct folio *folio)
>  {
> +	struct btrfs_subpage *subpage = folio_get_private(folio);
>  	u64 folio_start = folio_pos(folio);
>  	u64 cur = folio_start;
>  
> @@ -871,6 +872,11 @@ void btrfs_folio_end_all_writers(const struct btrfs_fs_info *fs_info,
>  		return;
>  	}
>  
> +	/* The page has no new delalloc range locked on it. Just plain unlock. */
> +	if (atomic_read(&subpage->writers) == 0) {
> +		folio_unlock(folio);
> +		return;
> +	}
>  	while (cur < folio_start + PAGE_SIZE) {
>  		u64 found_start;
>  		u32 found_len;
> -- 
> 2.45.0
>
Qu Wenruo May 21, 2024, 8:45 a.m. UTC | #2
在 2024/5/21 17:41, Naohiro Aota 写道:
[...]
> Same here.
> 
>>   	while (delalloc_start < page_end) {
>>   		delalloc_end = page_end;
>>   		if (!find_lock_delalloc_range(&inode->vfs_inode, page,
>> @@ -1240,15 +1249,68 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
>>   			delalloc_start = delalloc_end + 1;
>>   			continue;
>>   		}
>> -
>> -		ret = btrfs_run_delalloc_range(inode, page, delalloc_start,
>> -					       delalloc_end, wbc);
>> -		if (ret < 0)
>> -			return ret;
>> -
>> +		btrfs_folio_set_writer_lock(fs_info, folio, delalloc_start,
>> +					    min(delalloc_end, page_end) + 1 -
>> +					    delalloc_start);
>> +		last_delalloc_end = delalloc_end;
>>   		delalloc_start = delalloc_end + 1;
>>   	}
> 
> Can we bail out on the "if (!last_delalloc_end)" case? It would make the
> following code simpler.

Mind to explain it a little more?

Did you mean something like this:

	while (delalloc_start < page_end) {
		/* lock all subpage delalloc range code */
	}
	if (!last_delalloc_end)
		goto finish;
	while (delalloc_start < page_end) {
		/* run the delalloc ranges code* /
	}

If so, I can definitely go that way.

> 
>> +	delalloc_start = page_start;
>> +	/* Run the delalloc ranges for above locked ranges. */
>> +	while (last_delalloc_end && delalloc_start < page_end) {
>> +		u64 found_start;
>> +		u32 found_len;
>> +		bool found;
>>   
>> +		if (!btrfs_is_subpage(fs_info, page->mapping)) {
>> +			/*
>> +			 * For non-subpage case, the found delalloc range must
>> +			 * cover this page and there must be only one locked
>> +			 * delalloc range.
>> +			 */
>> +			found_start = page_start;
>> +			found_len = last_delalloc_end + 1 - found_start;
>> +			found = true;
>> +		} else {
>> +			found = btrfs_subpage_find_writer_locked(fs_info, folio,
>> +					delalloc_start, &found_start, &found_len);
>> +		}
>> +		if (!found)
>> +			break;
>> +		/*
>> +		 * The subpage range covers the last sector, the delalloc range may
>> +		 * end beyonds the page boundary, use the saved delalloc_end
>> +		 * instead.
>> +		 */
>> +		if (found_start + found_len >= page_end)
>> +			found_len = last_delalloc_end + 1 - found_start;
>> +
>> +		if (ret < 0) {
> 
> At first glance, it is strange because "ret" is not set above. But, it is
> executed when btrfs_run_delalloc_range() returns an error in an iteration,
> for the remaining iterations...
> 
> I'd like to have a dedicated clean-up path ... but I agree it is difficult
> to make such cleanup loop clean.

I can add an extra bool to indicate if we have any error, but overall 
it's not much different.

> 
> Flipping the if-conditions looks better? Or, adding more comments would be nice.

I guess that would go this path, flipping the if conditions and extra 
comments.

> 
>> +			/* Cleanup the remaining locked ranges. */
>> +			unlock_extent(&inode->io_tree, found_start,
>> +				      found_start + found_len - 1, NULL);
>> +			__unlock_for_delalloc(&inode->vfs_inode, page, found_start,
>> +					      found_start + found_len - 1);
>> +		} else {
>> +			ret = btrfs_run_delalloc_range(inode, page, found_start,
>> +						       found_start + found_len - 1, wbc);
> 
> Also, what happens if the first range returns "1" and a later one returns
> "0"? Is it OK to override the "ret" for the case? Actually, I guess it
> won't happen now because (as said in patch 5) subpage disables an inline
> extent, but having an ASSERT() would be good to catch a future mistake.

It's not only inline but also compression can return 1.

Thankfully for subpage, inline is disabled, meanwhile compression can 
only be done for a full page aligned range (start and end are both page 
aligned).

Considering you're mentioning this, I would definitely add an ASSERT() 
with comments explaining this.

Thanks for the feedback!
Qu

> 
>> +		}
>> +		/*
>> +		 * Above btrfs_run_delalloc_range() may have unlocked the page,
>> +		 * Thus for the last range, we can not touch the page anymore.
>> +		 */
>> +		if (found_start + found_len >= last_delalloc_end + 1)
>> +			break;
>> +
>> +		delalloc_start = found_start + found_len;
>> +	}
>> +	if (ret < 0)
>> +		return ret;
>> +
>> +	if (last_delalloc_end)
>> +		delalloc_end = last_delalloc_end;
>> +	else
>> +		delalloc_end = page_end;
>>   	/*
>>   	 * delalloc_end is already one less than the total length, so
>>   	 * we don't subtract one from PAGE_SIZE
>> @@ -1520,7 +1582,8 @@ static int __extent_writepage(struct page *page, struct btrfs_bio_ctrl *bio_ctrl
>>   					       PAGE_SIZE, !ret);
>>   		mapping_set_error(page->mapping, ret);
>>   	}
>> -	unlock_page(page);
>> +
>> +	btrfs_folio_end_all_writers(inode_to_fs_info(inode), folio);
>>   	ASSERT(ret <= 0);
>>   	return ret;
>>   }
>> diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
>> index 3c957d03324e..81b862d7ab53 100644
>> --- a/fs/btrfs/subpage.c
>> +++ b/fs/btrfs/subpage.c
>> @@ -862,6 +862,7 @@ bool btrfs_subpage_find_writer_locked(const struct btrfs_fs_info *fs_info,
>>   void btrfs_folio_end_all_writers(const struct btrfs_fs_info *fs_info,
>>   				 struct folio *folio)
>>   {
>> +	struct btrfs_subpage *subpage = folio_get_private(folio);
>>   	u64 folio_start = folio_pos(folio);
>>   	u64 cur = folio_start;
>>   
>> @@ -871,6 +872,11 @@ void btrfs_folio_end_all_writers(const struct btrfs_fs_info *fs_info,
>>   		return;
>>   	}
>>   
>> +	/* The page has no new delalloc range locked on it. Just plain unlock. */
>> +	if (atomic_read(&subpage->writers) == 0) {
>> +		folio_unlock(folio);
>> +		return;
>> +	}
>>   	while (cur < folio_start + PAGE_SIZE) {
>>   		u64 found_start;
>>   		u32 found_len;
>> -- 
>> 2.45.0
Naohiro Aota May 21, 2024, 11:54 a.m. UTC | #3
On Tue, May 21, 2024 at 06:15:32PM GMT, Qu Wenruo wrote:
> 
> 
> 在 2024/5/21 17:41, Naohiro Aota 写道:
> [...]
> > Same here.
> > 
> > >   	while (delalloc_start < page_end) {
> > >   		delalloc_end = page_end;
> > >   		if (!find_lock_delalloc_range(&inode->vfs_inode, page,
> > > @@ -1240,15 +1249,68 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
> > >   			delalloc_start = delalloc_end + 1;
> > >   			continue;
> > >   		}
> > > -
> > > -		ret = btrfs_run_delalloc_range(inode, page, delalloc_start,
> > > -					       delalloc_end, wbc);
> > > -		if (ret < 0)
> > > -			return ret;
> > > -
> > > +		btrfs_folio_set_writer_lock(fs_info, folio, delalloc_start,
> > > +					    min(delalloc_end, page_end) + 1 -
> > > +					    delalloc_start);
> > > +		last_delalloc_end = delalloc_end;
> > >   		delalloc_start = delalloc_end + 1;
> > >   	}
> > 
> > Can we bail out on the "if (!last_delalloc_end)" case? It would make the
> > following code simpler.
> 
> Mind to explain it a little more?
> 
> Did you mean something like this:
> 
> 	while (delalloc_start < page_end) {
> 		/* lock all subpage delalloc range code */
> 	}
> 	if (!last_delalloc_end)
> 		goto finish;
> 	while (delalloc_start < page_end) {
> 		/* run the delalloc ranges code* /
> 	}
> 
> If so, I can definitely go that way.

Yes, I meant that way. Apparently, "!last_delalloc_end" means it get no
delalloc region. So, we can just return 0 in that case without touching
"wbc->nr_to_write" as the current code does.

BTW, is this actually an overlooked error case? Is it OK to progress in
__extent_writepage() even if we don't run run_delalloc_range() ?
Qu Wenruo May 21, 2024, 10:16 p.m. UTC | #4
在 2024/5/21 21:24, Naohiro Aota 写道:
> On Tue, May 21, 2024 at 06:15:32PM GMT, Qu Wenruo wrote:
>>
>>
>> 在 2024/5/21 17:41, Naohiro Aota 写道:
>> [...]
>>> Same here.
>>>
>>>>    	while (delalloc_start < page_end) {
>>>>    		delalloc_end = page_end;
>>>>    		if (!find_lock_delalloc_range(&inode->vfs_inode, page,
>>>> @@ -1240,15 +1249,68 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
>>>>    			delalloc_start = delalloc_end + 1;
>>>>    			continue;
>>>>    		}
>>>> -
>>>> -		ret = btrfs_run_delalloc_range(inode, page, delalloc_start,
>>>> -					       delalloc_end, wbc);
>>>> -		if (ret < 0)
>>>> -			return ret;
>>>> -
>>>> +		btrfs_folio_set_writer_lock(fs_info, folio, delalloc_start,
>>>> +					    min(delalloc_end, page_end) + 1 -
>>>> +					    delalloc_start);
>>>> +		last_delalloc_end = delalloc_end;
>>>>    		delalloc_start = delalloc_end + 1;
>>>>    	}
>>>
>>> Can we bail out on the "if (!last_delalloc_end)" case? It would make the
>>> following code simpler.
>>
>> Mind to explain it a little more?
>>
>> Did you mean something like this:
>>
>> 	while (delalloc_start < page_end) {
>> 		/* lock all subpage delalloc range code */
>> 	}
>> 	if (!last_delalloc_end)
>> 		goto finish;
>> 	while (delalloc_start < page_end) {
>> 		/* run the delalloc ranges code* /
>> 	}
>>
>> If so, I can definitely go that way.
>
> Yes, I meant that way. Apparently, "!last_delalloc_end" means it get no
> delalloc region. So, we can just return 0 in that case without touching
> "wbc->nr_to_write" as the current code does.
>
> BTW, is this actually an overlooked error case? Is it OK to progress in
> __extent_writepage() even if we don't run run_delalloc_range() ?

That's totally expected, and it would even be more common in fact.

Consider a very ordinary case like this:

    0             4K              8K            12K
    |/////////////|///////////////|/////////////|

When running extent_writepage() for page 0, we run delalloc range for
the whole [0, 12K) range, and created an OE for it.
Then __extent_writepage_io() add page range [0, 4k) for bio.

Then extent_writepage() for page 4K, find_lock_delalloc() would not find
any range, as previous iteration at page 0 has already created OE for
the whole [0, 12K) range.

Although we would still run __extent_writepage_io() to add page range
[4k, 8K) to the bio.

The same for page 8K.

Thanks,
Qu
Naohiro Aota May 22, 2024, 1:10 a.m. UTC | #5
On Wed, May 22, 2024 at 07:46:46AM GMT, Qu Wenruo wrote:
> 
> 
> 在 2024/5/21 21:24, Naohiro Aota 写道:
> > On Tue, May 21, 2024 at 06:15:32PM GMT, Qu Wenruo wrote:
> > > 
> > > 
> > > 在 2024/5/21 17:41, Naohiro Aota 写道:
> > > [...]
> > > > Same here.
> > > > 
> > > > >    	while (delalloc_start < page_end) {
> > > > >    		delalloc_end = page_end;
> > > > >    		if (!find_lock_delalloc_range(&inode->vfs_inode, page,
> > > > > @@ -1240,15 +1249,68 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
> > > > >    			delalloc_start = delalloc_end + 1;
> > > > >    			continue;
> > > > >    		}
> > > > > -
> > > > > -		ret = btrfs_run_delalloc_range(inode, page, delalloc_start,
> > > > > -					       delalloc_end, wbc);
> > > > > -		if (ret < 0)
> > > > > -			return ret;
> > > > > -
> > > > > +		btrfs_folio_set_writer_lock(fs_info, folio, delalloc_start,
> > > > > +					    min(delalloc_end, page_end) + 1 -
> > > > > +					    delalloc_start);
> > > > > +		last_delalloc_end = delalloc_end;
> > > > >    		delalloc_start = delalloc_end + 1;
> > > > >    	}
> > > > 
> > > > Can we bail out on the "if (!last_delalloc_end)" case? It would make the
> > > > following code simpler.
> > > 
> > > Mind to explain it a little more?
> > > 
> > > Did you mean something like this:
> > > 
> > > 	while (delalloc_start < page_end) {
> > > 		/* lock all subpage delalloc range code */
> > > 	}
> > > 	if (!last_delalloc_end)
> > > 		goto finish;
> > > 	while (delalloc_start < page_end) {
> > > 		/* run the delalloc ranges code* /
> > > 	}
> > > 
> > > If so, I can definitely go that way.
> > 
> > Yes, I meant that way. Apparently, "!last_delalloc_end" means it get no
> > delalloc region. So, we can just return 0 in that case without touching
> > "wbc->nr_to_write" as the current code does.
> > 
> > BTW, is this actually an overlooked error case? Is it OK to progress in
> > __extent_writepage() even if we don't run run_delalloc_range() ?
> 
> That's totally expected, and it would even be more common in fact.
> 
> Consider a very ordinary case like this:
> 
>    0             4K              8K            12K
>    |/////////////|///////////////|/////////////|
> 
> When running extent_writepage() for page 0, we run delalloc range for
> the whole [0, 12K) range, and created an OE for it.
> Then __extent_writepage_io() add page range [0, 4k) for bio.
> 
> Then extent_writepage() for page 4K, find_lock_delalloc() would not find
> any range, as previous iteration at page 0 has already created OE for
> the whole [0, 12K) range.
> 
> Although we would still run __extent_writepage_io() to add page range
> [4k, 8K) to the bio.
> 
> The same for page 8K.
> 
> Thanks,
> Qu

Ah, yes, that's true. I forgot that the following pages case. Thank you for
your explanation.

Regards,
diff mbox series

Patch

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 8a4a7d00795f..b6dc9308105d 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1226,13 +1226,22 @@  static inline void contiguous_readpages(struct page *pages[], int nr_pages,
 static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
 		struct page *page, struct writeback_control *wbc)
 {
+	struct btrfs_fs_info *fs_info = inode_to_fs_info(&inode->vfs_inode);
+	struct folio *folio = page_folio(page);
 	const u64 page_start = page_offset(page);
 	const u64 page_end = page_start + PAGE_SIZE - 1;
+	/*
+	 * Saves the last found delalloc end. As the delalloc end can go beyond
+	 * page boundary, thus we can not rely on subpage bitmap to locate
+	 * the last dealloc end.
+	 */
+	u64 last_delalloc_end = 0;
 	u64 delalloc_start = page_start;
 	u64 delalloc_end = page_end;
 	u64 delalloc_to_write = 0;
 	int ret = 0;
 
+	/* Lock all (subpage) dealloc ranges inside the page first. */
 	while (delalloc_start < page_end) {
 		delalloc_end = page_end;
 		if (!find_lock_delalloc_range(&inode->vfs_inode, page,
@@ -1240,15 +1249,68 @@  static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
 			delalloc_start = delalloc_end + 1;
 			continue;
 		}
-
-		ret = btrfs_run_delalloc_range(inode, page, delalloc_start,
-					       delalloc_end, wbc);
-		if (ret < 0)
-			return ret;
-
+		btrfs_folio_set_writer_lock(fs_info, folio, delalloc_start,
+					    min(delalloc_end, page_end) + 1 -
+					    delalloc_start);
+		last_delalloc_end = delalloc_end;
 		delalloc_start = delalloc_end + 1;
 	}
+	delalloc_start = page_start;
+	/* Run the delalloc ranges for above locked ranges. */
+	while (last_delalloc_end && delalloc_start < page_end) {
+		u64 found_start;
+		u32 found_len;
+		bool found;
 
+		if (!btrfs_is_subpage(fs_info, page->mapping)) {
+			/*
+			 * For non-subpage case, the found delalloc range must
+			 * cover this page and there must be only one locked
+			 * delalloc range.
+			 */
+			found_start = page_start;
+			found_len = last_delalloc_end + 1 - found_start;
+			found = true;
+		} else {
+			found = btrfs_subpage_find_writer_locked(fs_info, folio,
+					delalloc_start, &found_start, &found_len);
+		}
+		if (!found)
+			break;
+		/*
+		 * The subpage range covers the last sector, the delalloc range may
+		 * end beyonds the page boundary, use the saved delalloc_end
+		 * instead.
+		 */
+		if (found_start + found_len >= page_end)
+			found_len = last_delalloc_end + 1 - found_start;
+
+		if (ret < 0) {
+			/* Cleanup the remaining locked ranges. */
+			unlock_extent(&inode->io_tree, found_start,
+				      found_start + found_len - 1, NULL);
+			__unlock_for_delalloc(&inode->vfs_inode, page, found_start,
+					      found_start + found_len - 1);
+		} else {
+			ret = btrfs_run_delalloc_range(inode, page, found_start,
+						       found_start + found_len - 1, wbc);
+		}
+		/*
+		 * Above btrfs_run_delalloc_range() may have unlocked the page,
+		 * Thus for the last range, we can not touch the page anymore.
+		 */
+		if (found_start + found_len >= last_delalloc_end + 1)
+			break;
+
+		delalloc_start = found_start + found_len;
+	}
+	if (ret < 0)
+		return ret;
+
+	if (last_delalloc_end)
+		delalloc_end = last_delalloc_end;
+	else
+		delalloc_end = page_end;
 	/*
 	 * delalloc_end is already one less than the total length, so
 	 * we don't subtract one from PAGE_SIZE
@@ -1520,7 +1582,8 @@  static int __extent_writepage(struct page *page, struct btrfs_bio_ctrl *bio_ctrl
 					       PAGE_SIZE, !ret);
 		mapping_set_error(page->mapping, ret);
 	}
-	unlock_page(page);
+
+	btrfs_folio_end_all_writers(inode_to_fs_info(inode), folio);
 	ASSERT(ret <= 0);
 	return ret;
 }
diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index 3c957d03324e..81b862d7ab53 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -862,6 +862,7 @@  bool btrfs_subpage_find_writer_locked(const struct btrfs_fs_info *fs_info,
 void btrfs_folio_end_all_writers(const struct btrfs_fs_info *fs_info,
 				 struct folio *folio)
 {
+	struct btrfs_subpage *subpage = folio_get_private(folio);
 	u64 folio_start = folio_pos(folio);
 	u64 cur = folio_start;
 
@@ -871,6 +872,11 @@  void btrfs_folio_end_all_writers(const struct btrfs_fs_info *fs_info,
 		return;
 	}
 
+	/* The page has no new delalloc range locked on it. Just plain unlock. */
+	if (atomic_read(&subpage->writers) == 0) {
+		folio_unlock(folio);
+		return;
+	}
 	while (cur < folio_start + PAGE_SIZE) {
 		u64 found_start;
 		u32 found_len;