diff mbox series

[f2fs-dev,02/14] btrfs: convert get_next_extent_buffer() to take a folio

Message ID 20240822013714.3278193-3-lizetao1@huawei.com (mailing list archive)
State Accepted
Commit d4aeb5f7a7e67d780e3eaae0b6e7d4e2d31042ee
Headers show
Series btrfs: Cleaned up folio->page conversion | expand

Commit Message

Li Zetao Aug. 22, 2024, 1:37 a.m. UTC
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. And use folio_pos instend of page_offset, which is more
consistent with folio usage.

Signed-off-by: Li Zetao <lizetao1@huawei.com>
---
 fs/btrfs/extent_io.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

Comments

Matthew Wilcox Aug. 22, 2024, 3:05 a.m. UTC | #1
On Thu, Aug 22, 2024 at 09:37:02AM +0800, Li Zetao wrote:
>  static struct extent_buffer *get_next_extent_buffer(
> -		const struct btrfs_fs_info *fs_info, struct page *page, u64 bytenr)
> +		const struct btrfs_fs_info *fs_info, struct folio *folio, u64 bytenr)
>  {
>  	struct extent_buffer *gang[GANG_LOOKUP_SIZE];
>  	struct extent_buffer *found = NULL;
> -	u64 page_start = page_offset(page);
> -	u64 cur = page_start;
> +	u64 folio_start = folio_pos(folio);
> +	u64 cur = folio_start;
>  
> -	ASSERT(in_range(bytenr, page_start, PAGE_SIZE));
> +	ASSERT(in_range(bytenr, folio_start, PAGE_SIZE));
>  	lockdep_assert_held(&fs_info->buffer_lock);
>  
> -	while (cur < page_start + PAGE_SIZE) {
> +	while (cur < folio_start + PAGE_SIZE) {

Presumably we want to support large folios in btrfs at some point?
I certainly want to remove CONFIG_READ_ONLY_THP_FOR_FS soon and that'll
be a bit of a regression for btrfs if it doesn't have large folio
support.  So shouldn't we also s/PAGE_SIZE/folio_size(folio)/ ?
Qu Wenruo Aug. 22, 2024, 10:58 a.m. UTC | #2
在 2024/8/22 12:35, Matthew Wilcox 写道:
> On Thu, Aug 22, 2024 at 09:37:02AM +0800, Li Zetao wrote:
>>   static struct extent_buffer *get_next_extent_buffer(
>> -		const struct btrfs_fs_info *fs_info, struct page *page, u64 bytenr)
>> +		const struct btrfs_fs_info *fs_info, struct folio *folio, u64 bytenr)
>>   {
>>   	struct extent_buffer *gang[GANG_LOOKUP_SIZE];
>>   	struct extent_buffer *found = NULL;
>> -	u64 page_start = page_offset(page);
>> -	u64 cur = page_start;
>> +	u64 folio_start = folio_pos(folio);
>> +	u64 cur = folio_start;
>>
>> -	ASSERT(in_range(bytenr, page_start, PAGE_SIZE));
>> +	ASSERT(in_range(bytenr, folio_start, PAGE_SIZE));
>>   	lockdep_assert_held(&fs_info->buffer_lock);
>>
>> -	while (cur < page_start + PAGE_SIZE) {
>> +	while (cur < folio_start + PAGE_SIZE) {
>
> Presumably we want to support large folios in btrfs at some point?

Yes, and we're already working towards that direction.

> I certainly want to remove CONFIG_READ_ONLY_THP_FOR_FS soon and that'll
> be a bit of a regression for btrfs if it doesn't have large folio
> support.  So shouldn't we also s/PAGE_SIZE/folio_size(folio)/ ?

AFAIK we're only going to support larger folios to support larger than
PAGE_SIZE sector size so far.
So every folio is still in a fixed size (sector size, >= PAGE_SIZE).

Not familiar with transparent huge page, I thought transparent huge page
is transparent to fs.

Or do we need some special handling?
My uneducated guess is, we will get a larger folio passed to readpage
call back directly?
It's straightforward enough to read all contents for a larger folio,
it's no different to subpage handling.

But what will happen if some writes happened to that larger folio?
Do MM layer detects that and split the folios? Or the fs has to go the
subpage routine (with an extra structure recording all the subpage flags
bitmap)?

Thanks,
Qu
Qu Wenruo Aug. 22, 2024, 11:01 a.m. UTC | #3
在 2024/8/22 12:35, Matthew Wilcox 写道:
> On Thu, Aug 22, 2024 at 09:37:02AM +0800, Li Zetao wrote:
>>   static struct extent_buffer *get_next_extent_buffer(
>> -		const struct btrfs_fs_info *fs_info, struct page *page, u64 bytenr)
>> +		const struct btrfs_fs_info *fs_info, struct folio *folio, u64 bytenr)
>>   {
>>   	struct extent_buffer *gang[GANG_LOOKUP_SIZE];
>>   	struct extent_buffer *found = NULL;
>> -	u64 page_start = page_offset(page);
>> -	u64 cur = page_start;
>> +	u64 folio_start = folio_pos(folio);
>> +	u64 cur = folio_start;
>>
>> -	ASSERT(in_range(bytenr, page_start, PAGE_SIZE));
>> +	ASSERT(in_range(bytenr, folio_start, PAGE_SIZE));
>>   	lockdep_assert_held(&fs_info->buffer_lock);
>>
>> -	while (cur < page_start + PAGE_SIZE) {
>> +	while (cur < folio_start + PAGE_SIZE) {
>
> Presumably we want to support large folios in btrfs at some point?
> I certainly want to remove CONFIG_READ_ONLY_THP_FOR_FS soon and that'll
> be a bit of a regression for btrfs if it doesn't have large folio
> support.  So shouldn't we also s/PAGE_SIZE/folio_size(folio)/ ?

Forgot to mention that, this function is only called inside subpage
routine (sectorsize < PAGE_SIZE and nodesize, aka metadata size < PAGE_SIZE)

So PAGE_SIZE is correct. Going folio_size() is only wasting CPU time,
but if you do not feel safe, we can add extra ASSERT() to make sure it's
only called for subpage routine.

Thanks,
Qu
Matthew Wilcox Aug. 22, 2024, 12:07 p.m. UTC | #4
On Thu, Aug 22, 2024 at 08:28:09PM +0930, Qu Wenruo wrote:
> 在 2024/8/22 12:35, Matthew Wilcox 写道:
> > > -	while (cur < page_start + PAGE_SIZE) {
> > > +	while (cur < folio_start + PAGE_SIZE) {
> > 
> > Presumably we want to support large folios in btrfs at some point?
> 
> Yes, and we're already working towards that direction.
> 
> > I certainly want to remove CONFIG_READ_ONLY_THP_FOR_FS soon and that'll
> > be a bit of a regression for btrfs if it doesn't have large folio
> > support.  So shouldn't we also s/PAGE_SIZE/folio_size(folio)/ ?
> 
> AFAIK we're only going to support larger folios to support larger than
> PAGE_SIZE sector size so far.

Why do you not want the performance gains from using larger folios?

> So every folio is still in a fixed size (sector size, >= PAGE_SIZE).
> 
> Not familiar with transparent huge page, I thought transparent huge page
> is transparent to fs.
> 
> Or do we need some special handling?
> My uneducated guess is, we will get a larger folio passed to readpage
> call back directly?

Why do you choose to remain uneducated?  It's not like I've been keeping
all of this to myself for the past five years.  I've given dozens of
presentations on it, including plenary sessions at LSFMM.  As a filesystem
developer, you must want to not know about it at this point.

> It's straightforward enough to read all contents for a larger folio,
> it's no different to subpage handling.
> 
> But what will happen if some writes happened to that larger folio?
> Do MM layer detects that and split the folios? Or the fs has to go the
> subpage routine (with an extra structure recording all the subpage flags
> bitmap)?

Entirely up to the filesystem.  It would help if btrfs used the same
terminology as the rest of the filesystems instead of inventing its own
"subpage" thing.  As far as I can tell, "subpage" means "fs block size",
but maybe it has a different meaning that I haven't ascertained.

Tracking dirtiness on a per-folio basis does not seem to be good enough.
Various people have workloads that regress in performance if you do
that.  So having some data structure attached to folio->private which
tracks dirtiness on a per-fs-block basis works pretty well.  iomap also
tracks the uptodate bit on a per-fs-block basis, but I'm less convinced
that's necessary.

I have no idea why btrfs thinks it needs to track writeback, ordered,
checked and locked in a bitmap.  Those make no sense to me.  But they
make no sense to me if you're support a 4KiB filesystem on a machine
with a 64KiB PAGE_SIZE, not just in the context of "larger folios".
Writeback is something the VM tells you to do; why do you need to tag
individual blocks for writeback?
Qu Wenruo Aug. 22, 2024, 10:25 p.m. UTC | #5
在 2024/8/22 21:37, Matthew Wilcox 写道:
> On Thu, Aug 22, 2024 at 08:28:09PM +0930, Qu Wenruo wrote:
>> 在 2024/8/22 12:35, Matthew Wilcox 写道:
>>>> -	while (cur < page_start + PAGE_SIZE) {
>>>> +	while (cur < folio_start + PAGE_SIZE) {
>>>
>>> Presumably we want to support large folios in btrfs at some point?
>>
>> Yes, and we're already working towards that direction.
>>
>>> I certainly want to remove CONFIG_READ_ONLY_THP_FOR_FS soon and that'll
>>> be a bit of a regression for btrfs if it doesn't have large folio
>>> support.  So shouldn't we also s/PAGE_SIZE/folio_size(folio)/ ?
>>
>> AFAIK we're only going to support larger folios to support larger than
>> PAGE_SIZE sector size so far.
>
> Why do you not want the performance gains from using larger folios?
>
>> So every folio is still in a fixed size (sector size, >= PAGE_SIZE).
>>
>> Not familiar with transparent huge page, I thought transparent huge page
>> is transparent to fs.
>>
>> Or do we need some special handling?
>> My uneducated guess is, we will get a larger folio passed to readpage
>> call back directly?
>
> Why do you choose to remain uneducated?  It's not like I've been keeping
> all of this to myself for the past five years.  I've given dozens of
> presentations on it, including plenary sessions at LSFMM.  As a filesystem
> developer, you must want to not know about it at this point.
>
>> It's straightforward enough to read all contents for a larger folio,
>> it's no different to subpage handling.
>>
>> But what will happen if some writes happened to that larger folio?
>> Do MM layer detects that and split the folios? Or the fs has to go the
>> subpage routine (with an extra structure recording all the subpage flags
>> bitmap)?
>
> Entirely up to the filesystem.  It would help if btrfs used the same
> terminology as the rest of the filesystems instead of inventing its own
> "subpage" thing.  As far as I can tell, "subpage" means "fs block size",
> but maybe it has a different meaning that I haven't ascertained.

Then tell me the correct terminology to describe fs block size smaller
than page size in the first place.

"fs block size" is not good enough, we want a terminology to describe
"fs block size" smaller than page size.

>
> Tracking dirtiness on a per-folio basis does not seem to be good enough.
> Various people have workloads that regress in performance if you do
> that.  So having some data structure attached to folio->private which
> tracks dirtiness on a per-fs-block basis works pretty well.  iomap also
> tracks the uptodate bit on a per-fs-block basis, but I'm less convinced
> that's necessary.
>
> I have no idea why btrfs thinks it needs to track writeback, ordered,
> checked and locked in a bitmap.  Those make no sense to me.  But they
> make no sense to me if you're support a 4KiB filesystem on a machine
> with a 64KiB PAGE_SIZE, not just in the context of "larger folios".
> Writeback is something the VM tells you to do; why do you need to tag
> individual blocks for writeback?

Because there are cases where btrfs needs to only write back part of the
folio independently.

And especially for mixing compression and non-compression writes inside
a page, e.g:

       0     16K     32K     48K      64K
       |//|          |///////|
          4K

In above case, if we need to writeback above page with 4K sector size,
then the first 4K is not suitable for compression (result will still
take a full 4K block), while the range [32K, 48K) will be compressed.

In that case, [0, 4K) range will be submitted directly for IO.
Meanwhile [32K, 48) will be submitted for compression in antoher wq.
(Or time consuming compression will delay the writeback of the remaining
pages)

This means the dirty/writeback flags will have a different timing to be
changed.

I think compression is no long a btrfs exclusive feature, thus this
should be obvious?

Thanks,
Qu
Qu Wenruo Aug. 23, 2024, 2:13 a.m. UTC | #6
在 2024/8/23 07:55, Qu Wenruo 写道:
>
>
> 在 2024/8/22 21:37, Matthew Wilcox 写道:
>> On Thu, Aug 22, 2024 at 08:28:09PM +0930, Qu Wenruo wrote:
>>> 在 2024/8/22 12:35, Matthew Wilcox 写道:
>>>>> -    while (cur < page_start + PAGE_SIZE) {
>>>>> +    while (cur < folio_start + PAGE_SIZE) {
>>>>
>>>> Presumably we want to support large folios in btrfs at some point?
>>>
>>> Yes, and we're already working towards that direction.
>>>
>>>> I certainly want to remove CONFIG_READ_ONLY_THP_FOR_FS soon and that'll
>>>> be a bit of a regression for btrfs if it doesn't have large folio
>>>> support.  So shouldn't we also s/PAGE_SIZE/folio_size(folio)/ ?
>>>
>>> AFAIK we're only going to support larger folios to support larger than
>>> PAGE_SIZE sector size so far.
>>
>> Why do you not want the performance gains from using larger folios?
>>
>>> So every folio is still in a fixed size (sector size, >= PAGE_SIZE).
>>>
>>> Not familiar with transparent huge page, I thought transparent huge page
>>> is transparent to fs.
>>>
>>> Or do we need some special handling?
>>> My uneducated guess is, we will get a larger folio passed to readpage
>>> call back directly?
>>
>> Why do you choose to remain uneducated?  It's not like I've been keeping
>> all of this to myself for the past five years.  I've given dozens of
>> presentations on it, including plenary sessions at LSFMM.  As a
>> filesystem
>> developer, you must want to not know about it at this point.
>>
>>> It's straightforward enough to read all contents for a larger folio,
>>> it's no different to subpage handling.
>>>
>>> But what will happen if some writes happened to that larger folio?
>>> Do MM layer detects that and split the folios? Or the fs has to go the
>>> subpage routine (with an extra structure recording all the subpage flags
>>> bitmap)?
>>
>> Entirely up to the filesystem.  It would help if btrfs used the same
>> terminology as the rest of the filesystems instead of inventing its own
>> "subpage" thing.  As far as I can tell, "subpage" means "fs block size",
>> but maybe it has a different meaning that I haven't ascertained.
>
> Then tell me the correct terminology to describe fs block size smaller
> than page size in the first place.
>
> "fs block size" is not good enough, we want a terminology to describe
> "fs block size" smaller than page size.
>
>>
>> Tracking dirtiness on a per-folio basis does not seem to be good enough.
>> Various people have workloads that regress in performance if you do
>> that.  So having some data structure attached to folio->private which
>> tracks dirtiness on a per-fs-block basis works pretty well.  iomap also
>> tracks the uptodate bit on a per-fs-block basis, but I'm less convinced
>> that's necessary.
>>
>> I have no idea why btrfs thinks it needs to track writeback, ordered,
>> checked and locked in a bitmap.  Those make no sense to me.  But they
>> make no sense to me if you're support a 4KiB filesystem on a machine
>> with a 64KiB PAGE_SIZE, not just in the context of "larger folios".
>> Writeback is something the VM tells you to do; why do you need to tag
>> individual blocks for writeback?
>
> Because there are cases where btrfs needs to only write back part of the
> folio independently.
>
> And especially for mixing compression and non-compression writes inside
> a page, e.g:
>
>        0     16K     32K     48K      64K
>        |//|          |///////|
>           4K
>
> In above case, if we need to writeback above page with 4K sector size,
> then the first 4K is not suitable for compression (result will still
> take a full 4K block), while the range [32K, 48K) will be compressed.
>
> In that case, [0, 4K) range will be submitted directly for IO.
> Meanwhile [32K, 48) will be submitted for compression in antoher wq.
> (Or time consuming compression will delay the writeback of the remaining
> pages)
>
> This means the dirty/writeback flags will have a different timing to be
> changed.

Just in case if you mean using an atomic to trace the writeback/lock
progress, then it's possible to go that path, but for now it's not space
efficient.

For 16 blocks per page case (4K sectorsize 64K page size), each atomic
takes 4 bytes while a bitmap only takes 2 bytes.

And for 4K sectorsize 16K page size case, it's worse and btrfs compact
all the bitmaps into a larger one to save more space, while each atomic
still takes 4 bytes.

Thanks,
Qu

>
> I think compression is no long a btrfs exclusive feature, thus this
> should be obvious?
>
> Thanks,
> Qu
>
Josef Bacik Aug. 23, 2024, 3:17 p.m. UTC | #7
On Thu, Aug 22, 2024 at 01:07:52PM +0100, Matthew Wilcox wrote:
> On Thu, Aug 22, 2024 at 08:28:09PM +0930, Qu Wenruo wrote:
> > 在 2024/8/22 12:35, Matthew Wilcox 写道:
> > > > -	while (cur < page_start + PAGE_SIZE) {
> > > > +	while (cur < folio_start + PAGE_SIZE) {
> > > 
> > > Presumably we want to support large folios in btrfs at some point?
> > 
> > Yes, and we're already working towards that direction.
> > 
> > > I certainly want to remove CONFIG_READ_ONLY_THP_FOR_FS soon and that'll
> > > be a bit of a regression for btrfs if it doesn't have large folio
> > > support.  So shouldn't we also s/PAGE_SIZE/folio_size(folio)/ ?
> > 
> > AFAIK we're only going to support larger folios to support larger than
> > PAGE_SIZE sector size so far.
> 
> Why do you not want the performance gains from using larger folios?
> 
> > So every folio is still in a fixed size (sector size, >= PAGE_SIZE).
> > 
> > Not familiar with transparent huge page, I thought transparent huge page
> > is transparent to fs.
> > 
> > Or do we need some special handling?
> > My uneducated guess is, we will get a larger folio passed to readpage
> > call back directly?
> 
> Why do you choose to remain uneducated?  It's not like I've been keeping
> all of this to myself for the past five years.  I've given dozens of
> presentations on it, including plenary sessions at LSFMM.  As a filesystem
> developer, you must want to not know about it at this point.
> 

Or, a more charitable read, is that this particular developer has never been to
LSFMMBPF/Plumbers/anywhere you've given a talk.

This stuff is complicated, I don't even know what's going on half the time, much
less a developer who exclusively works on btrfs internals.

There's a lot of things to know in this space, we can't all know what's going on
in every corner.

As for the topic at hand, I'm moving us in the direction of an iomap conversion
so we can have the large folio support, regardless of the underlying
sectorsize/metadata block size.  Unfortunately there's a lot of fundamental
changes that need to be made to facilitate that, and those are going to take
some time to test and validate to make sure we didn't break anything.  In the
meantime we're going to be in this weird limbo states while I tackle the
individual problem areas.

My priorities are split between getting this to work and fuse improvements to
eventually no longer need to have file systems in the kernel to avoid all this
in general, so it's going to be spurty, but I hope to have this work done by the
next LSFMMBPF.  Thanks,

Josef
Matthew Wilcox Aug. 23, 2024, 3:38 p.m. UTC | #8
On Fri, Aug 23, 2024 at 11:43:41AM +0930, Qu Wenruo wrote:
> 在 2024/8/23 07:55, Qu Wenruo 写道:
> > 在 2024/8/22 21:37, Matthew Wilcox 写道:
> > > On Thu, Aug 22, 2024 at 08:28:09PM +0930, Qu Wenruo wrote:
> > > > But what will happen if some writes happened to that larger folio?
> > > > Do MM layer detects that and split the folios? Or the fs has to go the
> > > > subpage routine (with an extra structure recording all the subpage flags
> > > > bitmap)?
> > > 
> > > Entirely up to the filesystem.  It would help if btrfs used the same
> > > terminology as the rest of the filesystems instead of inventing its own
> > > "subpage" thing.  As far as I can tell, "subpage" means "fs block size",
> > > but maybe it has a different meaning that I haven't ascertained.
> > 
> > Then tell me the correct terminology to describe fs block size smaller
> > than page size in the first place.
> > 
> > "fs block size" is not good enough, we want a terminology to describe
> > "fs block size" smaller than page size.

Oh dear.  btrfs really has corrupted your brain.  Here's the terminology
used in the rest of Linux:

SECTOR_SIZE.  Fixed at 512 bytes.  This is the unit used to communicate
with the block layer.  It has no real meaning, other than Linux doesn't
support block devices with 128 and 256 byte sector sizes (I have used
such systems, but not in the last 30 years).

LBA size.  This is the unit that the block layer uses to communicate
with the block device.  Must be at least SECTOR_SIZE.  I/O cannot be
performed in smaller chunks than this.

Physical block size.  This is the unit that the device advertises as
its efficient minimum size.  I/Os smaller than this / not aligned to
this will probably incur a performance penalty as the device will need
to do a read-modify-write cycle.

fs block size.  Known as s_blocksize or i_blocksize.  Must be a multiple
of LBA size, but may be smaller than physical block size.  Files are
allocated in multiples of this unit.

PAGE_SIZE.  Unit that memory can be mapped in.  This applies to both
userspace mapping of files as well as calls to kmap_local_*().

folio size.  The size that the page cache has decided to manage this
chunk of the file in.  A multiple of PAGE_SIZE.


I've mostly listed this in smallest to largest.  The relationships that
must be true:

SS <= LBA <= Phys
LBA <= fsb
PS <= folio
fsb <= folio

ocfs2 supports fsb > PAGE_SIZE, but this is a rarity.  Most filesystems
require fsb <= PAGE_SIZE.

Filesystems like UFS also support a fragment size which is less than fs
block size.  It's kind of like tail packing.  Anyway, that's internal to
the filesystem and not exposed to the VFS.

> > > I have no idea why btrfs thinks it needs to track writeback, ordered,
> > > checked and locked in a bitmap.  Those make no sense to me.  But they
> > > make no sense to me if you're support a 4KiB filesystem on a machine
> > > with a 64KiB PAGE_SIZE, not just in the context of "larger folios".
> > > Writeback is something the VM tells you to do; why do you need to tag
> > > individual blocks for writeback?
> > 
> > Because there are cases where btrfs needs to only write back part of the
> > folio independently.

iomap manages to do this with only tracking per-block dirty bits.

> > And especially for mixing compression and non-compression writes inside
> > a page, e.g:
> > 
> >        0     16K     32K     48K      64K
> >        |//|          |///////|
> >           4K
> > 
> > In above case, if we need to writeback above page with 4K sector size,
> > then the first 4K is not suitable for compression (result will still
> > take a full 4K block), while the range [32K, 48K) will be compressed.
> > 
> > In that case, [0, 4K) range will be submitted directly for IO.
> > Meanwhile [32K, 48) will be submitted for compression in antoher wq.
> > (Or time consuming compression will delay the writeback of the remaining
> > pages)
> > 
> > This means the dirty/writeback flags will have a different timing to be
> > changed.
> 
> Just in case if you mean using an atomic to trace the writeback/lock
> progress, then it's possible to go that path, but for now it's not space
> efficient.
> 
> For 16 blocks per page case (4K sectorsize 64K page size), each atomic
> takes 4 bytes while a bitmap only takes 2 bytes.
> 
> And for 4K sectorsize 16K page size case, it's worse and btrfs compact
> all the bitmaps into a larger one to save more space, while each atomic
> still takes 4 bytes.

Sure, but it doesn't scale up well.  And it's kind of irrelevant whether
you occupy 2 or 4 bytes at the low end because you're allocating all
this through slab, so you get rounded to 8 bytes anyway.
iomap_folio_state currently occupies 12 bytes + 2 bits per block.  So
for a 16 block folio (4k in 64k), that's 32 bits for a total of 16
bytes.  For a 2MB folio on a 4kB block size fs, that's 1024 bits for
a total of 140 bytes (rounded to 192 bytes by slab).

Hm, it might be worth adding a kmalloc-160, we'd get 25 objects per 4KiB
page instead of 21 192-byte objects ...
Qu Wenruo Aug. 23, 2024, 9:32 p.m. UTC | #9
在 2024/8/24 01:08, Matthew Wilcox 写道:
> On Fri, Aug 23, 2024 at 11:43:41AM +0930, Qu Wenruo wrote:
>> 在 2024/8/23 07:55, Qu Wenruo 写道:
>>> 在 2024/8/22 21:37, Matthew Wilcox 写道:
>>>> On Thu, Aug 22, 2024 at 08:28:09PM +0930, Qu Wenruo wrote:
>>>>> But what will happen if some writes happened to that larger folio?
>>>>> Do MM layer detects that and split the folios? Or the fs has to go the
>>>>> subpage routine (with an extra structure recording all the subpage flags
>>>>> bitmap)?
>>>>
>>>> Entirely up to the filesystem.  It would help if btrfs used the same
>>>> terminology as the rest of the filesystems instead of inventing its own
>>>> "subpage" thing.  As far as I can tell, "subpage" means "fs block size",
>>>> but maybe it has a different meaning that I haven't ascertained.
>>>
>>> Then tell me the correct terminology to describe fs block size smaller
>>> than page size in the first place.
>>>
>>> "fs block size" is not good enough, we want a terminology to describe
>>> "fs block size" smaller than page size.
>
> Oh dear.  btrfs really has corrupted your brain.  Here's the terminology
> used in the rest of Linux:
>
> SECTOR_SIZE.  Fixed at 512 bytes.  This is the unit used to communicate
> with the block layer.  It has no real meaning, other than Linux doesn't
> support block devices with 128 and 256 byte sector sizes (I have used
> such systems, but not in the last 30 years).
>
> LBA size.  This is the unit that the block layer uses to communicate
> with the block device.  Must be at least SECTOR_SIZE.  I/O cannot be
> performed in smaller chunks than this.
>
> Physical block size.  This is the unit that the device advertises as
> its efficient minimum size.  I/Os smaller than this / not aligned to
> this will probably incur a performance penalty as the device will need
> to do a read-modify-write cycle.
>
> fs block size.  Known as s_blocksize or i_blocksize.  Must be a multiple
> of LBA size, but may be smaller than physical block size.  Files are
> allocated in multiples of this unit.
>
> PAGE_SIZE.  Unit that memory can be mapped in.  This applies to both
> userspace mapping of files as well as calls to kmap_local_*().
>
> folio size.  The size that the page cache has decided to manage this
> chunk of the file in.  A multiple of PAGE_SIZE.
>
>
> I've mostly listed this in smallest to largest.  The relationships that
> must be true:
>
> SS <= LBA <= Phys
> LBA <= fsb
> PS <= folio
> fsb <= folio
>
> ocfs2 supports fsb > PAGE_SIZE, but this is a rarity.  Most filesystems
> require fsb <= PAGE_SIZE.
>
> Filesystems like UFS also support a fragment size which is less than fs
> block size.  It's kind of like tail packing.  Anyway, that's internal to
> the filesystem and not exposed to the VFS.

I know all these things, the terminology I need is a short one to
describe fsb < PAGE_SIZE case.

So far, in the fs' realm, "subpage" (sub page sized block size) is the
shortest and simplest one.

Sure you will get confused with a "subpage" range  inside a page, but
that's because you're mostly working in the MM layer.

So please give me a better alternative to describe exact "fbs <
PAGE_SIZE" case.
Or it's just complaining without any constructive advice.

>
>>>> I have no idea why btrfs thinks it needs to track writeback, ordered,
>>>> checked and locked in a bitmap.  Those make no sense to me.  But they
>>>> make no sense to me if you're support a 4KiB filesystem on a machine
>>>> with a 64KiB PAGE_SIZE, not just in the context of "larger folios".
>>>> Writeback is something the VM tells you to do; why do you need to tag
>>>> individual blocks for writeback?
>>>
>>> Because there are cases where btrfs needs to only write back part of the
>>> folio independently.
>
> iomap manages to do this with only tracking per-block dirty bits.

Well, does iomap support asynchronous compression?

This proves the point of Josef, different people have different focus,
please do not assume everyone knows the realm you're doing, nor assume
there will always be a one-fit-all solution.

>
>>> And especially for mixing compression and non-compression writes inside
>>> a page, e.g:
>>>
>>>         0     16K     32K     48K      64K
>>>         |//|          |///////|
>>>            4K
>>>
>>> In above case, if we need to writeback above page with 4K sector size,
>>> then the first 4K is not suitable for compression (result will still
>>> take a full 4K block), while the range [32K, 48K) will be compressed.
>>>
>>> In that case, [0, 4K) range will be submitted directly for IO.
>>> Meanwhile [32K, 48) will be submitted for compression in antoher wq.
>>> (Or time consuming compression will delay the writeback of the remaining
>>> pages)
>>>
>>> This means the dirty/writeback flags will have a different timing to be
>>> changed.
>>
>> Just in case if you mean using an atomic to trace the writeback/lock
>> progress, then it's possible to go that path, but for now it's not space
>> efficient.
>>
>> For 16 blocks per page case (4K sectorsize 64K page size), each atomic
>> takes 4 bytes while a bitmap only takes 2 bytes.
>>
>> And for 4K sectorsize 16K page size case, it's worse and btrfs compact
>> all the bitmaps into a larger one to save more space, while each atomic
>> still takes 4 bytes.
>
> Sure, but it doesn't scale up well.  And it's kind of irrelevant whether
> you occupy 2 or 4 bytes at the low end because you're allocating all
> this through slab, so you get rounded to 8 bytes anyway.
> iomap_folio_state currently occupies 12 bytes + 2 bits per block.  So
> for a 16 block folio (4k in 64k), that's 32 bits for a total of 16
> bytes.  For a 2MB folio on a 4kB block size fs, that's 1024 bits for
> a total of 140 bytes (rounded to 192 bytes by slab).

Yes it's not scalable for all folio sizes, but the turning point is 32
bits, which means 128K folio for 4K page sized system.
Since the folio code already consider order > 3 as costly, I'm totally
fine to sacrifice the higher order ones, not the other way around.

Although the real determining factor is the real world distribution of
folio sizes.

But for now, since btrfs only supports 4K block size with 64K/16K page
size, it's still a win for us.

Another point of the bitmap is, it helps (at least for me) a lot for
debugging, but that can always be hidden behind some debug flag.


I'm not denying the possibility to fully migrate to the iomap way, but
that will need a lot of extra work like clean up the cow_fixup thing to
reduce the extra page flag tracking first.
(That always causes a lot of discussion but seldomly leads to patches)

Thanks,
Qu
>
> Hm, it might be worth adding a kmalloc-160, we'd get 25 objects per 4KiB
> page instead of 21 192-byte objects ...
>
>
Josef Bacik Aug. 26, 2024, 2:13 p.m. UTC | #10
On Fri, Aug 23, 2024 at 04:38:27PM +0100, Matthew Wilcox wrote:
> On Fri, Aug 23, 2024 at 11:43:41AM +0930, Qu Wenruo wrote:
> > 在 2024/8/23 07:55, Qu Wenruo 写道:
> > > 在 2024/8/22 21:37, Matthew Wilcox 写道:
> > > > On Thu, Aug 22, 2024 at 08:28:09PM +0930, Qu Wenruo wrote:
> > > > > But what will happen if some writes happened to that larger folio?
> > > > > Do MM layer detects that and split the folios? Or the fs has to go the
> > > > > subpage routine (with an extra structure recording all the subpage flags
> > > > > bitmap)?
> > > > 
> > > > Entirely up to the filesystem.  It would help if btrfs used the same
> > > > terminology as the rest of the filesystems instead of inventing its own
> > > > "subpage" thing.  As far as I can tell, "subpage" means "fs block size",
> > > > but maybe it has a different meaning that I haven't ascertained.
> > > 
> > > Then tell me the correct terminology to describe fs block size smaller
> > > than page size in the first place.
> > > 
> > > "fs block size" is not good enough, we want a terminology to describe
> > > "fs block size" smaller than page size.
> 
> Oh dear.  btrfs really has corrupted your brain.  Here's the terminology
> used in the rest of Linux:

This isn't necessary commentary, this gives the impression that we're
wrong/stupid/etc.  We're all in this community together, having public, negative
commentary like this is unnecessary, and frankly contributes to my growing
desire/priorities to shift most of my development outside of the kernel
community.  And if somebody with my experience and history in this community is
becoming more and more disillusioned with this work and making serious efforts
to extricate themselves from the project, then what does that say about our
ability to bring in new developers?  Thanks,

Josef
Matthew Wilcox Aug. 26, 2024, 4:56 p.m. UTC | #11
On Mon, Aug 26, 2024 at 10:13:01AM -0400, Josef Bacik wrote:
> On Fri, Aug 23, 2024 at 04:38:27PM +0100, Matthew Wilcox wrote:
> > On Fri, Aug 23, 2024 at 11:43:41AM +0930, Qu Wenruo wrote:
> > > 在 2024/8/23 07:55, Qu Wenruo 写道:
> > > > 在 2024/8/22 21:37, Matthew Wilcox 写道:
> > > > > On Thu, Aug 22, 2024 at 08:28:09PM +0930, Qu Wenruo wrote:
> > > > > > But what will happen if some writes happened to that larger folio?
> > > > > > Do MM layer detects that and split the folios? Or the fs has to go the
> > > > > > subpage routine (with an extra structure recording all the subpage flags
> > > > > > bitmap)?
> > > > > 
> > > > > Entirely up to the filesystem.  It would help if btrfs used the same
> > > > > terminology as the rest of the filesystems instead of inventing its own
> > > > > "subpage" thing.  As far as I can tell, "subpage" means "fs block size",
> > > > > but maybe it has a different meaning that I haven't ascertained.
> > > > 
> > > > Then tell me the correct terminology to describe fs block size smaller
> > > > than page size in the first place.
> > > > 
> > > > "fs block size" is not good enough, we want a terminology to describe
> > > > "fs block size" smaller than page size.
> > 
> > Oh dear.  btrfs really has corrupted your brain.  Here's the terminology
> > used in the rest of Linux:
> 
> This isn't necessary commentary, this gives the impression that we're
> wrong/stupid/etc.  We're all in this community together, having public, negative
> commentary like this is unnecessary, and frankly contributes to my growing
> desire/priorities to shift most of my development outside of the kernel
> community.  And if somebody with my experience and history in this community is
> becoming more and more disillusioned with this work and making serious efforts
> to extricate themselves from the project, then what does that say about our
> ability to bring in new developers?  Thanks,

You know what?  I'm disillusioned too.  It's been over two and a half
years since folios were added (v5.16 was the first release with folios).
I knew it would be a long project (I was anticipating five years).
I was expecting to have to slog through all the old unmaintained crap
filesystems doing minimal conversions.  What I wasn't expecting was for
all the allegedly maintained filesystems to sit on their fucking hands and
ignore it as long as possible.  The biggest pains right now are btrfs,
ceph and f2fs, all of which have people who are paid to work on them!
I had to do ext4 largely myself.

Some filesystems have been great.  XFS worked with me as I did that
filesystem first.  nfs took care of their own code.  Dave Howells has
done most of the other network filesystems.  Many other people have
also helped.

But we can't even talk to each other unless we agree on what words mean.
And btrfs inventing its own terminology for things which already exist
in other filesystems is extremely unhelpful.

We need to remove the temporary hack that is CONFIG_READ_ONLY_THP_FOR_FS.
That went in with the understanding that filesystems that mattered would
add large folio support.  When I see someone purporting to represent
btrfs say "Oh, we're not going to do that", that's a breach of trust.

When I'm accused of not understanding things from the filesystem
perspective, that's just a lie.  I have 192 commits in fs/ between 6.6
and 6.10 versus 160 in mm/ (346 commits in both combined, so 6 commits
are double-counted because they touch both).  This whole project has
been filesystem-centric from the beginning.
Josef Bacik Aug. 26, 2024, 9:32 p.m. UTC | #12
On Mon, Aug 26, 2024 at 05:56:01PM +0100, Matthew Wilcox wrote:
> On Mon, Aug 26, 2024 at 10:13:01AM -0400, Josef Bacik wrote:
> > On Fri, Aug 23, 2024 at 04:38:27PM +0100, Matthew Wilcox wrote:
> > > On Fri, Aug 23, 2024 at 11:43:41AM +0930, Qu Wenruo wrote:
> > > > 在 2024/8/23 07:55, Qu Wenruo 写道:
> > > > > 在 2024/8/22 21:37, Matthew Wilcox 写道:
> > > > > > On Thu, Aug 22, 2024 at 08:28:09PM +0930, Qu Wenruo wrote:
> > > > > > > But what will happen if some writes happened to that larger folio?
> > > > > > > Do MM layer detects that and split the folios? Or the fs has to go the
> > > > > > > subpage routine (with an extra structure recording all the subpage flags
> > > > > > > bitmap)?
> > > > > > 
> > > > > > Entirely up to the filesystem.  It would help if btrfs used the same
> > > > > > terminology as the rest of the filesystems instead of inventing its own
> > > > > > "subpage" thing.  As far as I can tell, "subpage" means "fs block size",
> > > > > > but maybe it has a different meaning that I haven't ascertained.
> > > > > 
> > > > > Then tell me the correct terminology to describe fs block size smaller
> > > > > than page size in the first place.
> > > > > 
> > > > > "fs block size" is not good enough, we want a terminology to describe
> > > > > "fs block size" smaller than page size.
> > > 
> > > Oh dear.  btrfs really has corrupted your brain.  Here's the terminology
> > > used in the rest of Linux:
> > 
> > This isn't necessary commentary, this gives the impression that we're
> > wrong/stupid/etc.  We're all in this community together, having public, negative
> > commentary like this is unnecessary, and frankly contributes to my growing
> > desire/priorities to shift most of my development outside of the kernel
> > community.  And if somebody with my experience and history in this community is
> > becoming more and more disillusioned with this work and making serious efforts
> > to extricate themselves from the project, then what does that say about our
> > ability to bring in new developers?  Thanks,
> 
> You know what?  I'm disillusioned too.  It's been over two and a half
> years since folios were added (v5.16 was the first release with folios).
> I knew it would be a long project (I was anticipating five years).
> I was expecting to have to slog through all the old unmaintained crap
> filesystems doing minimal conversions.  What I wasn't expecting was for
> all the allegedly maintained filesystems to sit on their fucking hands and
> ignore it as long as possible.  The biggest pains right now are btrfs,
> ceph and f2fs, all of which have people who are paid to work on them!
> I had to do ext4 largely myself.
> 
> Some filesystems have been great.  XFS worked with me as I did that
> filesystem first.  nfs took care of their own code.  Dave Howells has
> done most of the other network filesystems.  Many other people have
> also helped.
> 
> But we can't even talk to each other unless we agree on what words mean.
> And btrfs inventing its own terminology for things which already exist
> in other filesystems is extremely unhelpful.
> 
> We need to remove the temporary hack that is CONFIG_READ_ONLY_THP_FOR_FS.
> That went in with the understanding that filesystems that mattered would
> add large folio support.  When I see someone purporting to represent
> btrfs say "Oh, we're not going to do that", that's a breach of trust.
> 
> When I'm accused of not understanding things from the filesystem
> perspective, that's just a lie.  I have 192 commits in fs/ between 6.6
> and 6.10 versus 160 in mm/ (346 commits in both combined, so 6 commits
> are double-counted because they touch both).  This whole project has
> been filesystem-centric from the beginning.

I'm not talking about the pace of change, I'm talking about basic, professional
communication standards.  Being frustrated is one thing, taking it out on a
developer or a project in a public forum is another.  Thanks,

Josef
diff mbox series

Patch

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 3c2ad5c9990d..b9d159fcbbc5 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4135,17 +4135,17 @@  void memmove_extent_buffer(const struct extent_buffer *dst,
 
 #define GANG_LOOKUP_SIZE	16
 static struct extent_buffer *get_next_extent_buffer(
-		const struct btrfs_fs_info *fs_info, struct page *page, u64 bytenr)
+		const struct btrfs_fs_info *fs_info, struct folio *folio, u64 bytenr)
 {
 	struct extent_buffer *gang[GANG_LOOKUP_SIZE];
 	struct extent_buffer *found = NULL;
-	u64 page_start = page_offset(page);
-	u64 cur = page_start;
+	u64 folio_start = folio_pos(folio);
+	u64 cur = folio_start;
 
-	ASSERT(in_range(bytenr, page_start, PAGE_SIZE));
+	ASSERT(in_range(bytenr, folio_start, PAGE_SIZE));
 	lockdep_assert_held(&fs_info->buffer_lock);
 
-	while (cur < page_start + PAGE_SIZE) {
+	while (cur < folio_start + PAGE_SIZE) {
 		int ret;
 		int i;
 
@@ -4157,7 +4157,7 @@  static struct extent_buffer *get_next_extent_buffer(
 			goto out;
 		for (i = 0; i < ret; i++) {
 			/* Already beyond page end */
-			if (gang[i]->start >= page_start + PAGE_SIZE)
+			if (gang[i]->start >= folio_start + PAGE_SIZE)
 				goto out;
 			/* Found one */
 			if (gang[i]->start >= bytenr) {
@@ -4190,7 +4190,7 @@  static int try_release_subpage_extent_buffer(struct page *page)
 		 * with spinlock rather than RCU.
 		 */
 		spin_lock(&fs_info->buffer_lock);
-		eb = get_next_extent_buffer(fs_info, page, cur);
+		eb = get_next_extent_buffer(fs_info, page_folio(page), cur);
 		if (!eb) {
 			/* No more eb in the page range after or at cur */
 			spin_unlock(&fs_info->buffer_lock);