Message ID | 20240822013714.3278193-3-lizetao1@huawei.com (mailing list archive) |
---|---|
State | Accepted |
Commit | d4aeb5f7a7e67d780e3eaae0b6e7d4e2d31042ee |
Headers | show |
Series | btrfs: Cleaned up folio->page conversion | expand |
On Thu, Aug 22, 2024 at 09:37:02AM +0800, Li Zetao wrote: > static struct extent_buffer *get_next_extent_buffer( > - const struct btrfs_fs_info *fs_info, struct page *page, u64 bytenr) > + const struct btrfs_fs_info *fs_info, struct folio *folio, u64 bytenr) > { > struct extent_buffer *gang[GANG_LOOKUP_SIZE]; > struct extent_buffer *found = NULL; > - u64 page_start = page_offset(page); > - u64 cur = page_start; > + u64 folio_start = folio_pos(folio); > + u64 cur = folio_start; > > - ASSERT(in_range(bytenr, page_start, PAGE_SIZE)); > + ASSERT(in_range(bytenr, folio_start, PAGE_SIZE)); > lockdep_assert_held(&fs_info->buffer_lock); > > - while (cur < page_start + PAGE_SIZE) { > + while (cur < folio_start + PAGE_SIZE) { Presumably we want to support large folios in btrfs at some point? I certainly want to remove CONFIG_READ_ONLY_THP_FOR_FS soon and that'll be a bit of a regression for btrfs if it doesn't have large folio support. So shouldn't we also s/PAGE_SIZE/folio_size(folio)/ ?
在 2024/8/22 12:35, Matthew Wilcox 写道: > On Thu, Aug 22, 2024 at 09:37:02AM +0800, Li Zetao wrote: >> static struct extent_buffer *get_next_extent_buffer( >> - const struct btrfs_fs_info *fs_info, struct page *page, u64 bytenr) >> + const struct btrfs_fs_info *fs_info, struct folio *folio, u64 bytenr) >> { >> struct extent_buffer *gang[GANG_LOOKUP_SIZE]; >> struct extent_buffer *found = NULL; >> - u64 page_start = page_offset(page); >> - u64 cur = page_start; >> + u64 folio_start = folio_pos(folio); >> + u64 cur = folio_start; >> >> - ASSERT(in_range(bytenr, page_start, PAGE_SIZE)); >> + ASSERT(in_range(bytenr, folio_start, PAGE_SIZE)); >> lockdep_assert_held(&fs_info->buffer_lock); >> >> - while (cur < page_start + PAGE_SIZE) { >> + while (cur < folio_start + PAGE_SIZE) { > > Presumably we want to support large folios in btrfs at some point? Yes, and we're already working towards that direction. > I certainly want to remove CONFIG_READ_ONLY_THP_FOR_FS soon and that'll > be a bit of a regression for btrfs if it doesn't have large folio > support. So shouldn't we also s/PAGE_SIZE/folio_size(folio)/ ? AFAIK we're only going to support larger folios to support larger than PAGE_SIZE sector size so far. So every folio is still in a fixed size (sector size, >= PAGE_SIZE). Not familiar with transparent huge page, I thought transparent huge page is transparent to fs. Or do we need some special handling? My uneducated guess is, we will get a larger folio passed to readpage call back directly? It's straightforward enough to read all contents for a larger folio, it's no different to subpage handling. But what will happen if some writes happened to that larger folio? Do MM layer detects that and split the folios? Or the fs has to go the subpage routine (with an extra structure recording all the subpage flags bitmap)? Thanks, Qu
在 2024/8/22 12:35, Matthew Wilcox 写道: > On Thu, Aug 22, 2024 at 09:37:02AM +0800, Li Zetao wrote: >> static struct extent_buffer *get_next_extent_buffer( >> - const struct btrfs_fs_info *fs_info, struct page *page, u64 bytenr) >> + const struct btrfs_fs_info *fs_info, struct folio *folio, u64 bytenr) >> { >> struct extent_buffer *gang[GANG_LOOKUP_SIZE]; >> struct extent_buffer *found = NULL; >> - u64 page_start = page_offset(page); >> - u64 cur = page_start; >> + u64 folio_start = folio_pos(folio); >> + u64 cur = folio_start; >> >> - ASSERT(in_range(bytenr, page_start, PAGE_SIZE)); >> + ASSERT(in_range(bytenr, folio_start, PAGE_SIZE)); >> lockdep_assert_held(&fs_info->buffer_lock); >> >> - while (cur < page_start + PAGE_SIZE) { >> + while (cur < folio_start + PAGE_SIZE) { > > Presumably we want to support large folios in btrfs at some point? > I certainly want to remove CONFIG_READ_ONLY_THP_FOR_FS soon and that'll > be a bit of a regression for btrfs if it doesn't have large folio > support. So shouldn't we also s/PAGE_SIZE/folio_size(folio)/ ? Forgot to mention that, this function is only called inside subpage routine (sectorsize < PAGE_SIZE and nodesize, aka metadata size < PAGE_SIZE) So PAGE_SIZE is correct. Going folio_size() is only wasting CPU time, but if you do not feel safe, we can add extra ASSERT() to make sure it's only called for subpage routine. Thanks, Qu
On Thu, Aug 22, 2024 at 08:28:09PM +0930, Qu Wenruo wrote: > 在 2024/8/22 12:35, Matthew Wilcox 写道: > > > - while (cur < page_start + PAGE_SIZE) { > > > + while (cur < folio_start + PAGE_SIZE) { > > > > Presumably we want to support large folios in btrfs at some point? > > Yes, and we're already working towards that direction. > > > I certainly want to remove CONFIG_READ_ONLY_THP_FOR_FS soon and that'll > > be a bit of a regression for btrfs if it doesn't have large folio > > support. So shouldn't we also s/PAGE_SIZE/folio_size(folio)/ ? > > AFAIK we're only going to support larger folios to support larger than > PAGE_SIZE sector size so far. Why do you not want the performance gains from using larger folios? > So every folio is still in a fixed size (sector size, >= PAGE_SIZE). > > Not familiar with transparent huge page, I thought transparent huge page > is transparent to fs. > > Or do we need some special handling? > My uneducated guess is, we will get a larger folio passed to readpage > call back directly? Why do you choose to remain uneducated? It's not like I've been keeping all of this to myself for the past five years. I've given dozens of presentations on it, including plenary sessions at LSFMM. As a filesystem developer, you must want to not know about it at this point. > It's straightforward enough to read all contents for a larger folio, > it's no different to subpage handling. > > But what will happen if some writes happened to that larger folio? > Do MM layer detects that and split the folios? Or the fs has to go the > subpage routine (with an extra structure recording all the subpage flags > bitmap)? Entirely up to the filesystem. It would help if btrfs used the same terminology as the rest of the filesystems instead of inventing its own "subpage" thing. As far as I can tell, "subpage" means "fs block size", but maybe it has a different meaning that I haven't ascertained. Tracking dirtiness on a per-folio basis does not seem to be good enough. Various people have workloads that regress in performance if you do that. So having some data structure attached to folio->private which tracks dirtiness on a per-fs-block basis works pretty well. iomap also tracks the uptodate bit on a per-fs-block basis, but I'm less convinced that's necessary. I have no idea why btrfs thinks it needs to track writeback, ordered, checked and locked in a bitmap. Those make no sense to me. But they make no sense to me if you're support a 4KiB filesystem on a machine with a 64KiB PAGE_SIZE, not just in the context of "larger folios". Writeback is something the VM tells you to do; why do you need to tag individual blocks for writeback?
在 2024/8/22 21:37, Matthew Wilcox 写道: > On Thu, Aug 22, 2024 at 08:28:09PM +0930, Qu Wenruo wrote: >> 在 2024/8/22 12:35, Matthew Wilcox 写道: >>>> - while (cur < page_start + PAGE_SIZE) { >>>> + while (cur < folio_start + PAGE_SIZE) { >>> >>> Presumably we want to support large folios in btrfs at some point? >> >> Yes, and we're already working towards that direction. >> >>> I certainly want to remove CONFIG_READ_ONLY_THP_FOR_FS soon and that'll >>> be a bit of a regression for btrfs if it doesn't have large folio >>> support. So shouldn't we also s/PAGE_SIZE/folio_size(folio)/ ? >> >> AFAIK we're only going to support larger folios to support larger than >> PAGE_SIZE sector size so far. > > Why do you not want the performance gains from using larger folios? > >> So every folio is still in a fixed size (sector size, >= PAGE_SIZE). >> >> Not familiar with transparent huge page, I thought transparent huge page >> is transparent to fs. >> >> Or do we need some special handling? >> My uneducated guess is, we will get a larger folio passed to readpage >> call back directly? > > Why do you choose to remain uneducated? It's not like I've been keeping > all of this to myself for the past five years. I've given dozens of > presentations on it, including plenary sessions at LSFMM. As a filesystem > developer, you must want to not know about it at this point. > >> It's straightforward enough to read all contents for a larger folio, >> it's no different to subpage handling. >> >> But what will happen if some writes happened to that larger folio? >> Do MM layer detects that and split the folios? Or the fs has to go the >> subpage routine (with an extra structure recording all the subpage flags >> bitmap)? > > Entirely up to the filesystem. It would help if btrfs used the same > terminology as the rest of the filesystems instead of inventing its own > "subpage" thing. As far as I can tell, "subpage" means "fs block size", > but maybe it has a different meaning that I haven't ascertained. Then tell me the correct terminology to describe fs block size smaller than page size in the first place. "fs block size" is not good enough, we want a terminology to describe "fs block size" smaller than page size. > > Tracking dirtiness on a per-folio basis does not seem to be good enough. > Various people have workloads that regress in performance if you do > that. So having some data structure attached to folio->private which > tracks dirtiness on a per-fs-block basis works pretty well. iomap also > tracks the uptodate bit on a per-fs-block basis, but I'm less convinced > that's necessary. > > I have no idea why btrfs thinks it needs to track writeback, ordered, > checked and locked in a bitmap. Those make no sense to me. But they > make no sense to me if you're support a 4KiB filesystem on a machine > with a 64KiB PAGE_SIZE, not just in the context of "larger folios". > Writeback is something the VM tells you to do; why do you need to tag > individual blocks for writeback? Because there are cases where btrfs needs to only write back part of the folio independently. And especially for mixing compression and non-compression writes inside a page, e.g: 0 16K 32K 48K 64K |//| |///////| 4K In above case, if we need to writeback above page with 4K sector size, then the first 4K is not suitable for compression (result will still take a full 4K block), while the range [32K, 48K) will be compressed. In that case, [0, 4K) range will be submitted directly for IO. Meanwhile [32K, 48) will be submitted for compression in antoher wq. (Or time consuming compression will delay the writeback of the remaining pages) This means the dirty/writeback flags will have a different timing to be changed. I think compression is no long a btrfs exclusive feature, thus this should be obvious? Thanks, Qu
在 2024/8/23 07:55, Qu Wenruo 写道: > > > 在 2024/8/22 21:37, Matthew Wilcox 写道: >> On Thu, Aug 22, 2024 at 08:28:09PM +0930, Qu Wenruo wrote: >>> 在 2024/8/22 12:35, Matthew Wilcox 写道: >>>>> - while (cur < page_start + PAGE_SIZE) { >>>>> + while (cur < folio_start + PAGE_SIZE) { >>>> >>>> Presumably we want to support large folios in btrfs at some point? >>> >>> Yes, and we're already working towards that direction. >>> >>>> I certainly want to remove CONFIG_READ_ONLY_THP_FOR_FS soon and that'll >>>> be a bit of a regression for btrfs if it doesn't have large folio >>>> support. So shouldn't we also s/PAGE_SIZE/folio_size(folio)/ ? >>> >>> AFAIK we're only going to support larger folios to support larger than >>> PAGE_SIZE sector size so far. >> >> Why do you not want the performance gains from using larger folios? >> >>> So every folio is still in a fixed size (sector size, >= PAGE_SIZE). >>> >>> Not familiar with transparent huge page, I thought transparent huge page >>> is transparent to fs. >>> >>> Or do we need some special handling? >>> My uneducated guess is, we will get a larger folio passed to readpage >>> call back directly? >> >> Why do you choose to remain uneducated? It's not like I've been keeping >> all of this to myself for the past five years. I've given dozens of >> presentations on it, including plenary sessions at LSFMM. As a >> filesystem >> developer, you must want to not know about it at this point. >> >>> It's straightforward enough to read all contents for a larger folio, >>> it's no different to subpage handling. >>> >>> But what will happen if some writes happened to that larger folio? >>> Do MM layer detects that and split the folios? Or the fs has to go the >>> subpage routine (with an extra structure recording all the subpage flags >>> bitmap)? >> >> Entirely up to the filesystem. It would help if btrfs used the same >> terminology as the rest of the filesystems instead of inventing its own >> "subpage" thing. As far as I can tell, "subpage" means "fs block size", >> but maybe it has a different meaning that I haven't ascertained. > > Then tell me the correct terminology to describe fs block size smaller > than page size in the first place. > > "fs block size" is not good enough, we want a terminology to describe > "fs block size" smaller than page size. > >> >> Tracking dirtiness on a per-folio basis does not seem to be good enough. >> Various people have workloads that regress in performance if you do >> that. So having some data structure attached to folio->private which >> tracks dirtiness on a per-fs-block basis works pretty well. iomap also >> tracks the uptodate bit on a per-fs-block basis, but I'm less convinced >> that's necessary. >> >> I have no idea why btrfs thinks it needs to track writeback, ordered, >> checked and locked in a bitmap. Those make no sense to me. But they >> make no sense to me if you're support a 4KiB filesystem on a machine >> with a 64KiB PAGE_SIZE, not just in the context of "larger folios". >> Writeback is something the VM tells you to do; why do you need to tag >> individual blocks for writeback? > > Because there are cases where btrfs needs to only write back part of the > folio independently. > > And especially for mixing compression and non-compression writes inside > a page, e.g: > > 0 16K 32K 48K 64K > |//| |///////| > 4K > > In above case, if we need to writeback above page with 4K sector size, > then the first 4K is not suitable for compression (result will still > take a full 4K block), while the range [32K, 48K) will be compressed. > > In that case, [0, 4K) range will be submitted directly for IO. > Meanwhile [32K, 48) will be submitted for compression in antoher wq. > (Or time consuming compression will delay the writeback of the remaining > pages) > > This means the dirty/writeback flags will have a different timing to be > changed. Just in case if you mean using an atomic to trace the writeback/lock progress, then it's possible to go that path, but for now it's not space efficient. For 16 blocks per page case (4K sectorsize 64K page size), each atomic takes 4 bytes while a bitmap only takes 2 bytes. And for 4K sectorsize 16K page size case, it's worse and btrfs compact all the bitmaps into a larger one to save more space, while each atomic still takes 4 bytes. Thanks, Qu > > I think compression is no long a btrfs exclusive feature, thus this > should be obvious? > > Thanks, > Qu >
On Thu, Aug 22, 2024 at 01:07:52PM +0100, Matthew Wilcox wrote: > On Thu, Aug 22, 2024 at 08:28:09PM +0930, Qu Wenruo wrote: > > 在 2024/8/22 12:35, Matthew Wilcox 写道: > > > > - while (cur < page_start + PAGE_SIZE) { > > > > + while (cur < folio_start + PAGE_SIZE) { > > > > > > Presumably we want to support large folios in btrfs at some point? > > > > Yes, and we're already working towards that direction. > > > > > I certainly want to remove CONFIG_READ_ONLY_THP_FOR_FS soon and that'll > > > be a bit of a regression for btrfs if it doesn't have large folio > > > support. So shouldn't we also s/PAGE_SIZE/folio_size(folio)/ ? > > > > AFAIK we're only going to support larger folios to support larger than > > PAGE_SIZE sector size so far. > > Why do you not want the performance gains from using larger folios? > > > So every folio is still in a fixed size (sector size, >= PAGE_SIZE). > > > > Not familiar with transparent huge page, I thought transparent huge page > > is transparent to fs. > > > > Or do we need some special handling? > > My uneducated guess is, we will get a larger folio passed to readpage > > call back directly? > > Why do you choose to remain uneducated? It's not like I've been keeping > all of this to myself for the past five years. I've given dozens of > presentations on it, including plenary sessions at LSFMM. As a filesystem > developer, you must want to not know about it at this point. > Or, a more charitable read, is that this particular developer has never been to LSFMMBPF/Plumbers/anywhere you've given a talk. This stuff is complicated, I don't even know what's going on half the time, much less a developer who exclusively works on btrfs internals. There's a lot of things to know in this space, we can't all know what's going on in every corner. As for the topic at hand, I'm moving us in the direction of an iomap conversion so we can have the large folio support, regardless of the underlying sectorsize/metadata block size. Unfortunately there's a lot of fundamental changes that need to be made to facilitate that, and those are going to take some time to test and validate to make sure we didn't break anything. In the meantime we're going to be in this weird limbo states while I tackle the individual problem areas. My priorities are split between getting this to work and fuse improvements to eventually no longer need to have file systems in the kernel to avoid all this in general, so it's going to be spurty, but I hope to have this work done by the next LSFMMBPF. Thanks, Josef
On Fri, Aug 23, 2024 at 11:43:41AM +0930, Qu Wenruo wrote: > 在 2024/8/23 07:55, Qu Wenruo 写道: > > 在 2024/8/22 21:37, Matthew Wilcox 写道: > > > On Thu, Aug 22, 2024 at 08:28:09PM +0930, Qu Wenruo wrote: > > > > But what will happen if some writes happened to that larger folio? > > > > Do MM layer detects that and split the folios? Or the fs has to go the > > > > subpage routine (with an extra structure recording all the subpage flags > > > > bitmap)? > > > > > > Entirely up to the filesystem. It would help if btrfs used the same > > > terminology as the rest of the filesystems instead of inventing its own > > > "subpage" thing. As far as I can tell, "subpage" means "fs block size", > > > but maybe it has a different meaning that I haven't ascertained. > > > > Then tell me the correct terminology to describe fs block size smaller > > than page size in the first place. > > > > "fs block size" is not good enough, we want a terminology to describe > > "fs block size" smaller than page size. Oh dear. btrfs really has corrupted your brain. Here's the terminology used in the rest of Linux: SECTOR_SIZE. Fixed at 512 bytes. This is the unit used to communicate with the block layer. It has no real meaning, other than Linux doesn't support block devices with 128 and 256 byte sector sizes (I have used such systems, but not in the last 30 years). LBA size. This is the unit that the block layer uses to communicate with the block device. Must be at least SECTOR_SIZE. I/O cannot be performed in smaller chunks than this. Physical block size. This is the unit that the device advertises as its efficient minimum size. I/Os smaller than this / not aligned to this will probably incur a performance penalty as the device will need to do a read-modify-write cycle. fs block size. Known as s_blocksize or i_blocksize. Must be a multiple of LBA size, but may be smaller than physical block size. Files are allocated in multiples of this unit. PAGE_SIZE. Unit that memory can be mapped in. This applies to both userspace mapping of files as well as calls to kmap_local_*(). folio size. The size that the page cache has decided to manage this chunk of the file in. A multiple of PAGE_SIZE. I've mostly listed this in smallest to largest. The relationships that must be true: SS <= LBA <= Phys LBA <= fsb PS <= folio fsb <= folio ocfs2 supports fsb > PAGE_SIZE, but this is a rarity. Most filesystems require fsb <= PAGE_SIZE. Filesystems like UFS also support a fragment size which is less than fs block size. It's kind of like tail packing. Anyway, that's internal to the filesystem and not exposed to the VFS. > > > I have no idea why btrfs thinks it needs to track writeback, ordered, > > > checked and locked in a bitmap. Those make no sense to me. But they > > > make no sense to me if you're support a 4KiB filesystem on a machine > > > with a 64KiB PAGE_SIZE, not just in the context of "larger folios". > > > Writeback is something the VM tells you to do; why do you need to tag > > > individual blocks for writeback? > > > > Because there are cases where btrfs needs to only write back part of the > > folio independently. iomap manages to do this with only tracking per-block dirty bits. > > And especially for mixing compression and non-compression writes inside > > a page, e.g: > > > > 0 16K 32K 48K 64K > > |//| |///////| > > 4K > > > > In above case, if we need to writeback above page with 4K sector size, > > then the first 4K is not suitable for compression (result will still > > take a full 4K block), while the range [32K, 48K) will be compressed. > > > > In that case, [0, 4K) range will be submitted directly for IO. > > Meanwhile [32K, 48) will be submitted for compression in antoher wq. > > (Or time consuming compression will delay the writeback of the remaining > > pages) > > > > This means the dirty/writeback flags will have a different timing to be > > changed. > > Just in case if you mean using an atomic to trace the writeback/lock > progress, then it's possible to go that path, but for now it's not space > efficient. > > For 16 blocks per page case (4K sectorsize 64K page size), each atomic > takes 4 bytes while a bitmap only takes 2 bytes. > > And for 4K sectorsize 16K page size case, it's worse and btrfs compact > all the bitmaps into a larger one to save more space, while each atomic > still takes 4 bytes. Sure, but it doesn't scale up well. And it's kind of irrelevant whether you occupy 2 or 4 bytes at the low end because you're allocating all this through slab, so you get rounded to 8 bytes anyway. iomap_folio_state currently occupies 12 bytes + 2 bits per block. So for a 16 block folio (4k in 64k), that's 32 bits for a total of 16 bytes. For a 2MB folio on a 4kB block size fs, that's 1024 bits for a total of 140 bytes (rounded to 192 bytes by slab). Hm, it might be worth adding a kmalloc-160, we'd get 25 objects per 4KiB page instead of 21 192-byte objects ...
在 2024/8/24 01:08, Matthew Wilcox 写道: > On Fri, Aug 23, 2024 at 11:43:41AM +0930, Qu Wenruo wrote: >> 在 2024/8/23 07:55, Qu Wenruo 写道: >>> 在 2024/8/22 21:37, Matthew Wilcox 写道: >>>> On Thu, Aug 22, 2024 at 08:28:09PM +0930, Qu Wenruo wrote: >>>>> But what will happen if some writes happened to that larger folio? >>>>> Do MM layer detects that and split the folios? Or the fs has to go the >>>>> subpage routine (with an extra structure recording all the subpage flags >>>>> bitmap)? >>>> >>>> Entirely up to the filesystem. It would help if btrfs used the same >>>> terminology as the rest of the filesystems instead of inventing its own >>>> "subpage" thing. As far as I can tell, "subpage" means "fs block size", >>>> but maybe it has a different meaning that I haven't ascertained. >>> >>> Then tell me the correct terminology to describe fs block size smaller >>> than page size in the first place. >>> >>> "fs block size" is not good enough, we want a terminology to describe >>> "fs block size" smaller than page size. > > Oh dear. btrfs really has corrupted your brain. Here's the terminology > used in the rest of Linux: > > SECTOR_SIZE. Fixed at 512 bytes. This is the unit used to communicate > with the block layer. It has no real meaning, other than Linux doesn't > support block devices with 128 and 256 byte sector sizes (I have used > such systems, but not in the last 30 years). > > LBA size. This is the unit that the block layer uses to communicate > with the block device. Must be at least SECTOR_SIZE. I/O cannot be > performed in smaller chunks than this. > > Physical block size. This is the unit that the device advertises as > its efficient minimum size. I/Os smaller than this / not aligned to > this will probably incur a performance penalty as the device will need > to do a read-modify-write cycle. > > fs block size. Known as s_blocksize or i_blocksize. Must be a multiple > of LBA size, but may be smaller than physical block size. Files are > allocated in multiples of this unit. > > PAGE_SIZE. Unit that memory can be mapped in. This applies to both > userspace mapping of files as well as calls to kmap_local_*(). > > folio size. The size that the page cache has decided to manage this > chunk of the file in. A multiple of PAGE_SIZE. > > > I've mostly listed this in smallest to largest. The relationships that > must be true: > > SS <= LBA <= Phys > LBA <= fsb > PS <= folio > fsb <= folio > > ocfs2 supports fsb > PAGE_SIZE, but this is a rarity. Most filesystems > require fsb <= PAGE_SIZE. > > Filesystems like UFS also support a fragment size which is less than fs > block size. It's kind of like tail packing. Anyway, that's internal to > the filesystem and not exposed to the VFS. I know all these things, the terminology I need is a short one to describe fsb < PAGE_SIZE case. So far, in the fs' realm, "subpage" (sub page sized block size) is the shortest and simplest one. Sure you will get confused with a "subpage" range inside a page, but that's because you're mostly working in the MM layer. So please give me a better alternative to describe exact "fbs < PAGE_SIZE" case. Or it's just complaining without any constructive advice. > >>>> I have no idea why btrfs thinks it needs to track writeback, ordered, >>>> checked and locked in a bitmap. Those make no sense to me. But they >>>> make no sense to me if you're support a 4KiB filesystem on a machine >>>> with a 64KiB PAGE_SIZE, not just in the context of "larger folios". >>>> Writeback is something the VM tells you to do; why do you need to tag >>>> individual blocks for writeback? >>> >>> Because there are cases where btrfs needs to only write back part of the >>> folio independently. > > iomap manages to do this with only tracking per-block dirty bits. Well, does iomap support asynchronous compression? This proves the point of Josef, different people have different focus, please do not assume everyone knows the realm you're doing, nor assume there will always be a one-fit-all solution. > >>> And especially for mixing compression and non-compression writes inside >>> a page, e.g: >>> >>> 0 16K 32K 48K 64K >>> |//| |///////| >>> 4K >>> >>> In above case, if we need to writeback above page with 4K sector size, >>> then the first 4K is not suitable for compression (result will still >>> take a full 4K block), while the range [32K, 48K) will be compressed. >>> >>> In that case, [0, 4K) range will be submitted directly for IO. >>> Meanwhile [32K, 48) will be submitted for compression in antoher wq. >>> (Or time consuming compression will delay the writeback of the remaining >>> pages) >>> >>> This means the dirty/writeback flags will have a different timing to be >>> changed. >> >> Just in case if you mean using an atomic to trace the writeback/lock >> progress, then it's possible to go that path, but for now it's not space >> efficient. >> >> For 16 blocks per page case (4K sectorsize 64K page size), each atomic >> takes 4 bytes while a bitmap only takes 2 bytes. >> >> And for 4K sectorsize 16K page size case, it's worse and btrfs compact >> all the bitmaps into a larger one to save more space, while each atomic >> still takes 4 bytes. > > Sure, but it doesn't scale up well. And it's kind of irrelevant whether > you occupy 2 or 4 bytes at the low end because you're allocating all > this through slab, so you get rounded to 8 bytes anyway. > iomap_folio_state currently occupies 12 bytes + 2 bits per block. So > for a 16 block folio (4k in 64k), that's 32 bits for a total of 16 > bytes. For a 2MB folio on a 4kB block size fs, that's 1024 bits for > a total of 140 bytes (rounded to 192 bytes by slab). Yes it's not scalable for all folio sizes, but the turning point is 32 bits, which means 128K folio for 4K page sized system. Since the folio code already consider order > 3 as costly, I'm totally fine to sacrifice the higher order ones, not the other way around. Although the real determining factor is the real world distribution of folio sizes. But for now, since btrfs only supports 4K block size with 64K/16K page size, it's still a win for us. Another point of the bitmap is, it helps (at least for me) a lot for debugging, but that can always be hidden behind some debug flag. I'm not denying the possibility to fully migrate to the iomap way, but that will need a lot of extra work like clean up the cow_fixup thing to reduce the extra page flag tracking first. (That always causes a lot of discussion but seldomly leads to patches) Thanks, Qu > > Hm, it might be worth adding a kmalloc-160, we'd get 25 objects per 4KiB > page instead of 21 192-byte objects ... > >
On Fri, Aug 23, 2024 at 04:38:27PM +0100, Matthew Wilcox wrote: > On Fri, Aug 23, 2024 at 11:43:41AM +0930, Qu Wenruo wrote: > > 在 2024/8/23 07:55, Qu Wenruo 写道: > > > 在 2024/8/22 21:37, Matthew Wilcox 写道: > > > > On Thu, Aug 22, 2024 at 08:28:09PM +0930, Qu Wenruo wrote: > > > > > But what will happen if some writes happened to that larger folio? > > > > > Do MM layer detects that and split the folios? Or the fs has to go the > > > > > subpage routine (with an extra structure recording all the subpage flags > > > > > bitmap)? > > > > > > > > Entirely up to the filesystem. It would help if btrfs used the same > > > > terminology as the rest of the filesystems instead of inventing its own > > > > "subpage" thing. As far as I can tell, "subpage" means "fs block size", > > > > but maybe it has a different meaning that I haven't ascertained. > > > > > > Then tell me the correct terminology to describe fs block size smaller > > > than page size in the first place. > > > > > > "fs block size" is not good enough, we want a terminology to describe > > > "fs block size" smaller than page size. > > Oh dear. btrfs really has corrupted your brain. Here's the terminology > used in the rest of Linux: This isn't necessary commentary, this gives the impression that we're wrong/stupid/etc. We're all in this community together, having public, negative commentary like this is unnecessary, and frankly contributes to my growing desire/priorities to shift most of my development outside of the kernel community. And if somebody with my experience and history in this community is becoming more and more disillusioned with this work and making serious efforts to extricate themselves from the project, then what does that say about our ability to bring in new developers? Thanks, Josef
On Mon, Aug 26, 2024 at 10:13:01AM -0400, Josef Bacik wrote: > On Fri, Aug 23, 2024 at 04:38:27PM +0100, Matthew Wilcox wrote: > > On Fri, Aug 23, 2024 at 11:43:41AM +0930, Qu Wenruo wrote: > > > 在 2024/8/23 07:55, Qu Wenruo 写道: > > > > 在 2024/8/22 21:37, Matthew Wilcox 写道: > > > > > On Thu, Aug 22, 2024 at 08:28:09PM +0930, Qu Wenruo wrote: > > > > > > But what will happen if some writes happened to that larger folio? > > > > > > Do MM layer detects that and split the folios? Or the fs has to go the > > > > > > subpage routine (with an extra structure recording all the subpage flags > > > > > > bitmap)? > > > > > > > > > > Entirely up to the filesystem. It would help if btrfs used the same > > > > > terminology as the rest of the filesystems instead of inventing its own > > > > > "subpage" thing. As far as I can tell, "subpage" means "fs block size", > > > > > but maybe it has a different meaning that I haven't ascertained. > > > > > > > > Then tell me the correct terminology to describe fs block size smaller > > > > than page size in the first place. > > > > > > > > "fs block size" is not good enough, we want a terminology to describe > > > > "fs block size" smaller than page size. > > > > Oh dear. btrfs really has corrupted your brain. Here's the terminology > > used in the rest of Linux: > > This isn't necessary commentary, this gives the impression that we're > wrong/stupid/etc. We're all in this community together, having public, negative > commentary like this is unnecessary, and frankly contributes to my growing > desire/priorities to shift most of my development outside of the kernel > community. And if somebody with my experience and history in this community is > becoming more and more disillusioned with this work and making serious efforts > to extricate themselves from the project, then what does that say about our > ability to bring in new developers? Thanks, You know what? I'm disillusioned too. It's been over two and a half years since folios were added (v5.16 was the first release with folios). I knew it would be a long project (I was anticipating five years). I was expecting to have to slog through all the old unmaintained crap filesystems doing minimal conversions. What I wasn't expecting was for all the allegedly maintained filesystems to sit on their fucking hands and ignore it as long as possible. The biggest pains right now are btrfs, ceph and f2fs, all of which have people who are paid to work on them! I had to do ext4 largely myself. Some filesystems have been great. XFS worked with me as I did that filesystem first. nfs took care of their own code. Dave Howells has done most of the other network filesystems. Many other people have also helped. But we can't even talk to each other unless we agree on what words mean. And btrfs inventing its own terminology for things which already exist in other filesystems is extremely unhelpful. We need to remove the temporary hack that is CONFIG_READ_ONLY_THP_FOR_FS. That went in with the understanding that filesystems that mattered would add large folio support. When I see someone purporting to represent btrfs say "Oh, we're not going to do that", that's a breach of trust. When I'm accused of not understanding things from the filesystem perspective, that's just a lie. I have 192 commits in fs/ between 6.6 and 6.10 versus 160 in mm/ (346 commits in both combined, so 6 commits are double-counted because they touch both). This whole project has been filesystem-centric from the beginning.
On Mon, Aug 26, 2024 at 05:56:01PM +0100, Matthew Wilcox wrote: > On Mon, Aug 26, 2024 at 10:13:01AM -0400, Josef Bacik wrote: > > On Fri, Aug 23, 2024 at 04:38:27PM +0100, Matthew Wilcox wrote: > > > On Fri, Aug 23, 2024 at 11:43:41AM +0930, Qu Wenruo wrote: > > > > 在 2024/8/23 07:55, Qu Wenruo 写道: > > > > > 在 2024/8/22 21:37, Matthew Wilcox 写道: > > > > > > On Thu, Aug 22, 2024 at 08:28:09PM +0930, Qu Wenruo wrote: > > > > > > > But what will happen if some writes happened to that larger folio? > > > > > > > Do MM layer detects that and split the folios? Or the fs has to go the > > > > > > > subpage routine (with an extra structure recording all the subpage flags > > > > > > > bitmap)? > > > > > > > > > > > > Entirely up to the filesystem. It would help if btrfs used the same > > > > > > terminology as the rest of the filesystems instead of inventing its own > > > > > > "subpage" thing. As far as I can tell, "subpage" means "fs block size", > > > > > > but maybe it has a different meaning that I haven't ascertained. > > > > > > > > > > Then tell me the correct terminology to describe fs block size smaller > > > > > than page size in the first place. > > > > > > > > > > "fs block size" is not good enough, we want a terminology to describe > > > > > "fs block size" smaller than page size. > > > > > > Oh dear. btrfs really has corrupted your brain. Here's the terminology > > > used in the rest of Linux: > > > > This isn't necessary commentary, this gives the impression that we're > > wrong/stupid/etc. We're all in this community together, having public, negative > > commentary like this is unnecessary, and frankly contributes to my growing > > desire/priorities to shift most of my development outside of the kernel > > community. And if somebody with my experience and history in this community is > > becoming more and more disillusioned with this work and making serious efforts > > to extricate themselves from the project, then what does that say about our > > ability to bring in new developers? Thanks, > > You know what? I'm disillusioned too. It's been over two and a half > years since folios were added (v5.16 was the first release with folios). > I knew it would be a long project (I was anticipating five years). > I was expecting to have to slog through all the old unmaintained crap > filesystems doing minimal conversions. What I wasn't expecting was for > all the allegedly maintained filesystems to sit on their fucking hands and > ignore it as long as possible. The biggest pains right now are btrfs, > ceph and f2fs, all of which have people who are paid to work on them! > I had to do ext4 largely myself. > > Some filesystems have been great. XFS worked with me as I did that > filesystem first. nfs took care of their own code. Dave Howells has > done most of the other network filesystems. Many other people have > also helped. > > But we can't even talk to each other unless we agree on what words mean. > And btrfs inventing its own terminology for things which already exist > in other filesystems is extremely unhelpful. > > We need to remove the temporary hack that is CONFIG_READ_ONLY_THP_FOR_FS. > That went in with the understanding that filesystems that mattered would > add large folio support. When I see someone purporting to represent > btrfs say "Oh, we're not going to do that", that's a breach of trust. > > When I'm accused of not understanding things from the filesystem > perspective, that's just a lie. I have 192 commits in fs/ between 6.6 > and 6.10 versus 160 in mm/ (346 commits in both combined, so 6 commits > are double-counted because they touch both). This whole project has > been filesystem-centric from the beginning. I'm not talking about the pace of change, I'm talking about basic, professional communication standards. Being frustrated is one thing, taking it out on a developer or a project in a public forum is another. Thanks, Josef
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 3c2ad5c9990d..b9d159fcbbc5 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -4135,17 +4135,17 @@ void memmove_extent_buffer(const struct extent_buffer *dst, #define GANG_LOOKUP_SIZE 16 static struct extent_buffer *get_next_extent_buffer( - const struct btrfs_fs_info *fs_info, struct page *page, u64 bytenr) + const struct btrfs_fs_info *fs_info, struct folio *folio, u64 bytenr) { struct extent_buffer *gang[GANG_LOOKUP_SIZE]; struct extent_buffer *found = NULL; - u64 page_start = page_offset(page); - u64 cur = page_start; + u64 folio_start = folio_pos(folio); + u64 cur = folio_start; - ASSERT(in_range(bytenr, page_start, PAGE_SIZE)); + ASSERT(in_range(bytenr, folio_start, PAGE_SIZE)); lockdep_assert_held(&fs_info->buffer_lock); - while (cur < page_start + PAGE_SIZE) { + while (cur < folio_start + PAGE_SIZE) { int ret; int i; @@ -4157,7 +4157,7 @@ static struct extent_buffer *get_next_extent_buffer( goto out; for (i = 0; i < ret; i++) { /* Already beyond page end */ - if (gang[i]->start >= page_start + PAGE_SIZE) + if (gang[i]->start >= folio_start + PAGE_SIZE) goto out; /* Found one */ if (gang[i]->start >= bytenr) { @@ -4190,7 +4190,7 @@ static int try_release_subpage_extent_buffer(struct page *page) * with spinlock rather than RCU. */ spin_lock(&fs_info->buffer_lock); - eb = get_next_extent_buffer(fs_info, page, cur); + eb = get_next_extent_buffer(fs_info, page_folio(page), cur); if (!eb) { /* No more eb in the page range after or at cur */ spin_unlock(&fs_info->buffer_lock);
The old page API is being gradually replaced and converted to use folio to improve code readability and avoid repeated conversion between page and folio. And use folio_pos instend of page_offset, which is more consistent with folio usage. Signed-off-by: Li Zetao <lizetao1@huawei.com> --- fs/btrfs/extent_io.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-)