Message ID | 20230915213254.2724586-1-mcgrof@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | bdev: LBS devices support to coexist with buffer-heads | expand |
On Fri, Sep 15, 2023 at 02:32:44PM -0700, Luis Chamberlain wrote: > However, an issue is that disabling CONFIG_BUFFER_HEAD in practice is not viable > for many Linux distributions since it also means disabling support for most > filesystems other than btrfs and XFS. So we either support larger order folios > on buffer-heads, or we draw up a solution to enable co-existence. Since at LSFMM > 2023 it was decided we would not support larger order folios on buffer-heads, Um, I didn't agree to that. If block size is equal to folio size, there are no problems supporting one buffer head per folio. In fact, we could probably go up to 8 buffer heads per folio without any trouble (but I'm not signing up to do that work).
On Fri, Sep 15, 2023 at 10:51:12PM +0100, Matthew Wilcox wrote: > On Fri, Sep 15, 2023 at 02:32:44PM -0700, Luis Chamberlain wrote: > > However, an issue is that disabling CONFIG_BUFFER_HEAD in practice is not viable > > for many Linux distributions since it also means disabling support for most > > filesystems other than btrfs and XFS. So we either support larger order folios > > on buffer-heads, or we draw up a solution to enable co-existence. Since at LSFMM > > 2023 it was decided we would not support larger order folios on buffer-heads, > > Um, I didn't agree to that. If block size is equal to folio size, there > are no problems supporting one buffer head per folio. In fact, we could > probably go up to 8 buffer heads per folio without any trouble (but I'm > not signing up to do that work). Patches welcomed for buffer-head larger order folio support :) But in the meantime... Luis
On 9/15/23 23:51, Matthew Wilcox wrote: > On Fri, Sep 15, 2023 at 02:32:44PM -0700, Luis Chamberlain wrote: >> However, an issue is that disabling CONFIG_BUFFER_HEAD in practice is not viable >> for many Linux distributions since it also means disabling support for most >> filesystems other than btrfs and XFS. So we either support larger order folios >> on buffer-heads, or we draw up a solution to enable co-existence. Since at LSFMM >> 2023 it was decided we would not support larger order folios on buffer-heads, > > Um, I didn't agree to that. If block size is equal to folio size, there > are no problems supporting one buffer head per folio. In fact, we could > probably go up to 8 buffer heads per folio without any trouble (but I'm > not signing up to do that work). > Entirely correct. I have a patchset ready for doing just that (ie having one buffer head per folio); hope I'll find some time next week to get it cleaned up and posted. Cheers, Hannes
On Fri, Sep 15, 2023 at 02:32:44PM -0700, Luis Chamberlain wrote: > Christoph added CONFIG_BUFFER_HEAD on v6.1 enabling a world where we can > live without buffer-heads. When we opt into that world we end up also > using the address space operations of the block device cache using > iomap. Since iomap supports higher order folios it means then that block > devices which do use the iomap aops can end up having a logical block > size or physical block size greater than PAGE_SIZE. We refer to these as > LBS devices. This in turn allows filesystems which support bs > 4k to be > enabled on a 4k PAGE_SIZE world on LBS block devices. This alows LBS > device then to take advantage of the recenlty posted work today to enable > LBS support for filesystems [0]. Why do we need LBS devices to support bs > ps in XFS? As long as the filesystem block size is >= logical sector size of the device, XFS just doesn't care what the "block size" of the underlying block device is. i.e. XFS will allow minimum IO sizes of the logical sector size of the device for select metadata and sub-block direct IO, but otherwise all other IO will be aligned to filesystem block size and so the underlying device block sizes are completely irrelevant... > To experiment with larger LBA formtas you can also use kdevops and enable > CONFIG_QEMU_ENABLE_EXTRA_DRIVE_LARGEIO. That enables a ton of drives with > logical and physical block sizes >= 4k up to a desriable max target for > experimentation. Since filesystems today only support up to 32k sector sizes, > in practice you may only want to experiment up to 32k physical / logical. > > Support for 64k sector sizes requires an XFS format change, which is something > Daniel Gomez has experimental patches for, in case folks are interested in > messing with. Please don't do this without first talking to the upstream XFS developers about the intended use cases and design of the new format. Especially when the problem involves requiring a whole new journal header and indexing format to be implemented... As a general rule, nobody should *ever* be writing patches to change the on-disk format of a -any- filesystem without first engaging the people who maintain that filesystem. Architecture and design come first, not implementation. The last thing we want is to have someone spend weeks or months on something that takes the experts half a minute to NACK because it is so obviously flawed.... > Patch 6 could probably be squashed with patch 5, but I wanted to be > explicit about this, as this should be decided with the community. > > There might be a better way to do this than do deal with the switching > of the aops dynamically, ideas welcomed! Is it even safe to switch aops dynamically? We know there are inherent race conditions in doing this w.r.t. mmap and page faults, as the write fault part of the processing is directly dependent on the page being correctly initialised during the initial population of the page data (the "read fault" side of the write fault). Hence it's not generally considered safe to change aops from one mechanism to another dynamically. Block devices can be mmap()d, but I don't see anything in this patch set that ensures there are no other users of the block device when the swaps are done. What am I missing? Cheers, Dave.
On Mon, Sep 18, 2023 at 08:38:37AM +1000, Dave Chinner wrote: > On Fri, Sep 15, 2023 at 02:32:44PM -0700, Luis Chamberlain wrote: > > LBS devices. This in turn allows filesystems which support bs > 4k to be > > enabled on a 4k PAGE_SIZE world on LBS block devices. This alows LBS > > device then to take advantage of the recenlty posted work today to enable > > LBS support for filesystems [0]. > > Why do we need LBS devices to support bs > ps in XFS? It's the other way round -- we need the support in the page cache to reject sub-block-size folios (which is in the other patches) before we can sensibly talk about enabling any filesystems on top of LBS devices. Even XFS, or for that matter ext2 which support 16k block sizes on CONFIG_PAGE_SIZE_16K (or 64K) kernels need that support first. [snipping the parts I agree with; this should not be the first you're hearing about a format change to XFS] > > There might be a better way to do this than do deal with the switching > > of the aops dynamically, ideas welcomed! > > Is it even safe to switch aops dynamically? We know there are > inherent race conditions in doing this w.r.t. mmap and page faults, > as the write fault part of the processing is directly dependent > on the page being correctly initialised during the initial > population of the page data (the "read fault" side of the write > fault). > > Hence it's not generally considered safe to change aops from one > mechanism to another dynamically. Block devices can be mmap()d, but > I don't see anything in this patch set that ensures there are no > other users of the block device when the swaps are done. What am I > missing? We need to evict all pages from the page cache before switching aops to prevent misinterpretation of folio->private. If switching aops is even the right thing to do. I don't see the problem with allowing buffer heads on block devices, but I haven't been involved with the discussion here.
On Mon, Sep 18, 2023 at 12:14:48AM +0100, Matthew Wilcox wrote: > On Mon, Sep 18, 2023 at 08:38:37AM +1000, Dave Chinner wrote: > > On Fri, Sep 15, 2023 at 02:32:44PM -0700, Luis Chamberlain wrote: > > > LBS devices. This in turn allows filesystems which support bs > 4k to be > > > enabled on a 4k PAGE_SIZE world on LBS block devices. This alows LBS > > > device then to take advantage of the recenlty posted work today to enable > > > LBS support for filesystems [0]. > > > > Why do we need LBS devices to support bs > ps in XFS? > > It's the other way round -- we need the support in the page cache to > reject sub-block-size folios (which is in the other patches) before we > can sensibly talk about enabling any filesystems on top of LBS devices. > > Even XFS, or for that matter ext2 which support 16k block sizes on > CONFIG_PAGE_SIZE_16K (or 64K) kernels need that support first. Well, yes, I know that. But the statement above implies that we can't use bs > ps filesytems without LBS support on 4kB PAGE_SIZE systems. If it's meant to mean the exact opposite, then it is extremely poorly worded.... > > > There might be a better way to do this than do deal with the switching > > > of the aops dynamically, ideas welcomed! > > > > Is it even safe to switch aops dynamically? We know there are > > inherent race conditions in doing this w.r.t. mmap and page faults, > > as the write fault part of the processing is directly dependent > > on the page being correctly initialised during the initial > > population of the page data (the "read fault" side of the write > > fault). > > > > Hence it's not generally considered safe to change aops from one > > mechanism to another dynamically. Block devices can be mmap()d, but > > I don't see anything in this patch set that ensures there are no > > other users of the block device when the swaps are done. What am I > > missing? > > We need to evict all pages from the page cache before switching aops to > prevent misinterpretation of folio->private. Yes, but if the device is mapped, even after an invalidation, we can still race with a new fault instantiating a page whilst the aops are being swapped, right? That was the problem that sunk dynamic swapping of the aops when turning DAX on and off on an inode, right? > If switching aops is even > the right thing to do. I don't see the problem with allowing buffer heads > on block devices, but I haven't been involved with the discussion here. iomap supports bufferheads as a transitional thing (e.g. for gfs2). Hence I suspect that a better solution is to always use iomap and the same aops, but just switch from iomap page state to buffer heads in the bdev mapping interface via a synchronised invalidation + setting/clearing IOMAP_F_BUFFER_HEAD in all new mapping requests... Cheers, Dave.
On Mon, Sep 18, 2023 at 10:59:07AM +1000, Dave Chinner wrote: > On Mon, Sep 18, 2023 at 12:14:48AM +0100, Matthew Wilcox wrote: > > On Mon, Sep 18, 2023 at 08:38:37AM +1000, Dave Chinner wrote: > > > On Fri, Sep 15, 2023 at 02:32:44PM -0700, Luis Chamberlain wrote: > > > > LBS devices. This in turn allows filesystems which support bs > 4k to be > > > > enabled on a 4k PAGE_SIZE world on LBS block devices. This alows LBS > > > > device then to take advantage of the recenlty posted work today to enable > > > > LBS support for filesystems [0]. > > > > > > Why do we need LBS devices to support bs > ps in XFS? > > > > It's the other way round -- we need the support in the page cache to > > reject sub-block-size folios (which is in the other patches) before we > > can sensibly talk about enabling any filesystems on top of LBS devices. > > > > Even XFS, or for that matter ext2 which support 16k block sizes on > > CONFIG_PAGE_SIZE_16K (or 64K) kernels need that support first. > > Well, yes, I know that. But the statement above implies that we > can't use bs > ps filesytems without LBS support on 4kB PAGE_SIZE > systems. If it's meant to mean the exact opposite, then it is > extremely poorly worded.... Let's ignore the above statement of a second just to clarify the goal here. The patches posted by Pankaj enable bs > ps even on 4k sector drives. This patch series by definition is suggesting that an LBS device is one where the minimum sector size is > ps, in practice I'm talking about devices where the logical block size is > 4k, or where the physical block size can be > 4k. There are two situations in the NVMe world where this can happen. One is where the LBA format used is > 4k, the other is where the npwg & awupf | nawupf is > 4k. The first makes the logical block size > 4k, the later allows the physical block size to be > 4k. The first *needs* an aops which groks > ps on the block device cache. The second let's you remain backward compatible with 4k sector size, but if you want to use a larger sector size you can too, but that also requires a block device cache which groks > ps. When using > ps for logical block size of physical block size is what I am suggesting we should call LBS devices. I'm happy for us to use other names but it just seemed like a natural preogression to me. > > > > There might be a better way to do this than do deal with the switching > > > > of the aops dynamically, ideas welcomed! > > > > > > Is it even safe to switch aops dynamically? We know there are > > > inherent race conditions in doing this w.r.t. mmap and page faults, > > > as the write fault part of the processing is directly dependent > > > on the page being correctly initialised during the initial > > > population of the page data (the "read fault" side of the write > > > fault). > > > > > > Hence it's not generally considered safe to change aops from one > > > mechanism to another dynamically. Block devices can be mmap()d, but > > > I don't see anything in this patch set that ensures there are no > > > other users of the block device when the swaps are done. What am I > > > missing? > > > > We need to evict all pages from the page cache before switching aops to > > prevent misinterpretation of folio->private. > > Yes, but if the device is mapped, even after an invalidation, we can > still race with a new fault instantiating a page whilst the aops are > being swapped, right? That was the problem that sunk dynamic > swapping of the aops when turning DAX on and off on an inode, right? I was not aware of that, thanks! > > If switching aops is even > > the right thing to do. I don't see the problem with allowing buffer heads > > on block devices, but I haven't been involved with the discussion here. > > iomap supports bufferheads as a transitional thing (e.g. for gfs2). But not for the block device cache. > Hence I suspect that a better solution is to always use iomap and > the same aops, but just switch from iomap page state to buffer heads > in the bdev mapping Not sure this means, any chance I can trouble you to clarify a bit more? > interface via a synchronised invalidation + > setting/clearing IOMAP_F_BUFFER_HEAD in all new mapping requests... I'm happy we use whatever makes more sense and is safer, just keep in mind that we can't default to the iomap aops otherwise buffer-head based filesystems won't work. I tried that. The first aops which the block device cache needs if we want co-existence is buffer-heads. If buffer-heads get large folio support then this is a non-issue as far as I can tell but only because XFS does not use buffer-heads for meta data. Once and when we do have a filesystem which does use buffer-heads for metadata to support LBS it could be a problem I think. Luis
On Sun, Sep 17, 2023 at 06:13:40PM -0700, Luis Chamberlain wrote: > On Mon, Sep 18, 2023 at 10:59:07AM +1000, Dave Chinner wrote: > > On Mon, Sep 18, 2023 at 12:14:48AM +0100, Matthew Wilcox wrote: > > > On Mon, Sep 18, 2023 at 08:38:37AM +1000, Dave Chinner wrote: > > > > On Fri, Sep 15, 2023 at 02:32:44PM -0700, Luis Chamberlain wrote: > > > > > LBS devices. This in turn allows filesystems which support bs > 4k to be > > > > > enabled on a 4k PAGE_SIZE world on LBS block devices. This alows LBS > > > > > device then to take advantage of the recenlty posted work today to enable > > > > > LBS support for filesystems [0]. > > > > > > > > Why do we need LBS devices to support bs > ps in XFS? > > > > > > It's the other way round -- we need the support in the page cache to > > > reject sub-block-size folios (which is in the other patches) before we > > > can sensibly talk about enabling any filesystems on top of LBS devices. > > > > > > Even XFS, or for that matter ext2 which support 16k block sizes on > > > CONFIG_PAGE_SIZE_16K (or 64K) kernels need that support first. > > > > Well, yes, I know that. But the statement above implies that we > > can't use bs > ps filesytems without LBS support on 4kB PAGE_SIZE > > systems. If it's meant to mean the exact opposite, then it is > > extremely poorly worded.... > > Let's ignore the above statement of a second just to clarify the goal here. > The patches posted by Pankaj enable bs > ps even on 4k sector drives. Yes. They also enable XFS to support bs > ps on devices up to 32kB sector sizes, too. All the sector size does is define the minimum filesystem block size that can be supported by the filesystem on that device and apart from that we just don't care what the sector size on the underlying device is. > This patch series by definition is suggesting that an LBS device is one > where the minimum sector size is > ps, in practice I'm talking about > devices where the logical block size is > 4k, or where the physical > block size can be > 4k. Which XFS with bs > ps just doesn't care about. As long as the logical sector size is a power of 2 between 512 bytes and 32kB, it will just work. > There are two situations in the NVMe world where > this can happen. One is where the LBA format used is > 4k, the other is > where the npwg & awupf | nawupf is > 4k. The first makes the logical FWIW, I have no idea what these acronyms actually mean.... > block size > 4k, the later allows the physical block size to be > 4k. > The first *needs* an aops which groks > ps on the block device cache. > The second let's you remain backward compatible with 4k sector size, but > if you want to use a larger sector size you can too, but that also > requires a block device cache which groks > ps. When using > ps for > logical block size of physical block size is what I am suggesting we > should call LBS devices. Sure. LBS means sector size > page size for the block device. That much has always been clear. But telling me that again doesn't explain what LBS support has to do with the filesystem implementation. mkfs.xfs doesn't require LBS support to make a new XFS filesystem with a 32kB sector size and 64kB filessytem block size. For the mounted filesystem that supports bs > ps, it also doesn't care about the device sector size is smaller than what mkfs told it to us. e.g. this is how run 4kB sector size filesystem testing on 512 byte sector devices today.... What I'm asking is why LBS support even mentions filesystems? It's not necessary for filesystems to support bs > ps for LBS to be implemented, and it's not necessary for LBS to be supported for filesytsems to implement bs > ps. Conflating them seems a recipe for confusion.... > > > > > There might be a better way to do this than do deal with the switching > > > > > of the aops dynamically, ideas welcomed! > > > > > > > > Is it even safe to switch aops dynamically? We know there are > > > > inherent race conditions in doing this w.r.t. mmap and page faults, > > > > as the write fault part of the processing is directly dependent > > > > on the page being correctly initialised during the initial > > > > population of the page data (the "read fault" side of the write > > > > fault). > > > > > > > > Hence it's not generally considered safe to change aops from one > > > > mechanism to another dynamically. Block devices can be mmap()d, but > > > > I don't see anything in this patch set that ensures there are no > > > > other users of the block device when the swaps are done. What am I > > > > missing? > > > > > > We need to evict all pages from the page cache before switching aops to > > > prevent misinterpretation of folio->private. > > > > Yes, but if the device is mapped, even after an invalidation, we can > > still race with a new fault instantiating a page whilst the aops are > > being swapped, right? That was the problem that sunk dynamic > > swapping of the aops when turning DAX on and off on an inode, right? > > I was not aware of that, thanks! > > > > If switching aops is even > > > the right thing to do. I don't see the problem with allowing buffer heads > > > on block devices, but I haven't been involved with the discussion here. > > > > iomap supports bufferheads as a transitional thing (e.g. for gfs2). > > But not for the block device cache. That's why I'm suggesting that you implement support for bufferheads through the existing iomap infrastructure instead of trying to dynamically switch aops structures.... > > Hence I suspect that a better solution is to always use iomap and > > the same aops, but just switch from iomap page state to buffer heads > > in the bdev mapping > > Not sure this means, any chance I can trouble you to clarify a bit more? bdev_use_buggerheads() { /* * set the bufferhead bit atomically with invalidation emptying the * page cache to prevent repopulation races. */ filemap_invalidate_lock() invalidate_bdev() if (turn_on_bufferheads) set_bit(bdev->state, BDEV_USE_BUFFERHEADS); else clear_bit(bdev->state, BDEV_USE_BUFFERHEADS); filemap_invalidate_unlock() } bdev_iomap_begin() { ..... if (test_bit(bdev->state, BDEV_USE_BUFFERHEADS)) iomap->flags |= IOMAP_F_BUFFER_HEAD; } /* * If an indexing switch happened while processing the iomap, make * sure to get the iomap marked stale to force a new mapping to be * looked up. */ bdev_iomap_valid() { bool need_bhs = iomap->flags & IOMAP_F_BUFFER_HEAD; bool use_bhs = test_bit(bdev->state, BDEV_USE_BUGGERHEADS); return need_bhs == use_bhs; } -Dave.
On 9/18/23 01:14, Matthew Wilcox wrote: > On Mon, Sep 18, 2023 at 08:38:37AM +1000, Dave Chinner wrote: >> On Fri, Sep 15, 2023 at 02:32:44PM -0700, Luis Chamberlain wrote: >>> LBS devices. This in turn allows filesystems which support bs > 4k to be >>> enabled on a 4k PAGE_SIZE world on LBS block devices. This alows LBS >>> device then to take advantage of the recenlty posted work today to enable >>> LBS support for filesystems [0]. >> >> Why do we need LBS devices to support bs > ps in XFS? > > It's the other way round -- we need the support in the page cache to > reject sub-block-size folios (which is in the other patches) before we > can sensibly talk about enabling any filesystems on top of LBS devices. > Even XFS, or for that matter ext2 which support 16k block sizes on > CONFIG_PAGE_SIZE_16K (or 64K) kernels need that support first. > > [snipping the parts I agree with; this should not be the first you're > hearing about a format change to XFS] > >>> There might be a better way to do this than do deal with the switching >>> of the aops dynamically, ideas welcomed! >> >> Is it even safe to switch aops dynamically? We know there are >> inherent race conditions in doing this w.r.t. mmap and page faults, >> as the write fault part of the processing is directly dependent >> on the page being correctly initialised during the initial >> population of the page data (the "read fault" side of the write >> fault). >> >> Hence it's not generally considered safe to change aops from one >> mechanism to another dynamically. Block devices can be mmap()d, but >> I don't see anything in this patch set that ensures there are no >> other users of the block device when the swaps are done. What am I >> missing? > > We need to evict all pages from the page cache before switching aops to > prevent misinterpretation of folio->private. If switching aops is even > the right thing to do. I don't see the problem with allowing buffer heads > on block devices, but I haven't been involved with the discussion here. Did we even have that conversation? That's one of the first things I've stumbled across when doing my patchset, and found the implications too horrible to consider. Not a big fan, plus I don't think we need that. Cf my patchset :-) Cheers, Hannes
On Fri, Sep 15, 2023 at 10:51:12PM +0100, Matthew Wilcox wrote: > On Fri, Sep 15, 2023 at 02:32:44PM -0700, Luis Chamberlain wrote: > > However, an issue is that disabling CONFIG_BUFFER_HEAD in practice is not viable > > for many Linux distributions since it also means disabling support for most > > filesystems other than btrfs and XFS. So we either support larger order folios > > on buffer-heads, or we draw up a solution to enable co-existence. Since at LSFMM > > 2023 it was decided we would not support larger order folios on buffer-heads, > > Um, I didn't agree to that. Coverage on sunsetting buffer-heads talk by LWN: https://lwn.net/Articles/931809/ "the apparent conclusion from the session: the buffer-head layer will be converted to use folios internally while minimizing changes visible to the filesystems using it. Only single-page folios will be used within this new buffer-head layer. Any other desires, he said, can be addressed later after this problem has been solved." And so I think we're at the later part of this. I'm happy to see efforts for buffer-heads support with large order folio support, and so at this point I think it would be wise to look for commonalities and colalborate on what things could / should be shared. Luis
On Mon, Sep 18, 2023 at 12:49:36PM +1000, Dave Chinner wrote: > On Sun, Sep 17, 2023 at 06:13:40PM -0700, Luis Chamberlain wrote: > > On Mon, Sep 18, 2023 at 10:59:07AM +1000, Dave Chinner wrote: > > > On Mon, Sep 18, 2023 at 12:14:48AM +0100, Matthew Wilcox wrote: > > > > On Mon, Sep 18, 2023 at 08:38:37AM +1000, Dave Chinner wrote: > > > > > On Fri, Sep 15, 2023 at 02:32:44PM -0700, Luis Chamberlain wrote: > > > > > > LBS devices. This in turn allows filesystems which support bs > 4k to be > > > > > > enabled on a 4k PAGE_SIZE world on LBS block devices. This alows LBS > > > > > > device then to take advantage of the recenlty posted work today to enable > > > > > > LBS support for filesystems [0]. > > > > > > > > > > Why do we need LBS devices to support bs > ps in XFS? > > > > > > > > It's the other way round -- we need the support in the page cache to > > > > reject sub-block-size folios (which is in the other patches) before we > > > > can sensibly talk about enabling any filesystems on top of LBS devices. > > > > > > > > Even XFS, or for that matter ext2 which support 16k block sizes on > > > > CONFIG_PAGE_SIZE_16K (or 64K) kernels need that support first. > > > > > > Well, yes, I know that. But the statement above implies that we > > > can't use bs > ps filesytems without LBS support on 4kB PAGE_SIZE > > > systems. If it's meant to mean the exact opposite, then it is > > > extremely poorly worded.... > > > > Let's ignore the above statement of a second just to clarify the goal here. > > The patches posted by Pankaj enable bs > ps even on 4k sector drives. > > Yes. They also enable XFS to support bs > ps on devices up to 32kB > sector sizes, too. All the sector size does is define the minimum > filesystem block size that can be supported by the filesystem on > that device and apart from that we just don't care what the sector > size on the underlying device is. Ah yes, but on a system with 4k page size, my point is that you still can't use a 32k sector size drive. > > This patch series by definition is suggesting that an LBS device is one > > where the minimum sector size is > ps, in practice I'm talking about > > devices where the logical block size is > 4k, or where the physical > > block size can be > 4k. > > Which XFS with bs > ps just doesn't care about. As long as the > logical sector size is a power of 2 between 512 bytes and 32kB, it > will just work. > > > There are two situations in the NVMe world where > > this can happen. One is where the LBA format used is > 4k, the other is > > where the npwg & awupf | nawupf is > 4k. The first makes the logical > > FWIW, I have no idea what these acronyms actually mean.... Sorry, I've described them them, here now: https://kernelnewbies.org/KernelProjects/large-block-size#LBS_NVMe_support What matters though is in practice it is a combination of two values which today also allows the nvme driver to have a higher than 4k physical block size. > > block size > 4k, the later allows the physical block size to be > 4k. > > The first *needs* an aops which groks > ps on the block device cache. > > The second let's you remain backward compatible with 4k sector size, but > > if you want to use a larger sector size you can too, but that also > > requires a block device cache which groks > ps. When using > ps for > > logical block size of physical block size is what I am suggesting we > > should call LBS devices. > > Sure. LBS means sector size > page size for the block device. That > much has always been clear. > > But telling me that again doesn't explain what LBS support has to do > with the filesystem implementation. mkfs.xfs doesn't require LBS > support to make a new XFS filesystem with a 32kB sector size and > 64kB filessytem block size. For the mounted filesystem that > supports bs > ps, it also doesn't care about the device sector size > is smaller than what mkfs told it to us. e.g. this is how run 4kB > sector size filesystem testing on 512 byte sector devices today.... > > What I'm asking is why LBS support even mentions filesystems? It's just that without this patchset you can mkfs an filesystem with larger bs > ps, but will only be able to mount them if the sector size matches the page size. This patchset allows them to be mounted and used where sector size > ps. The main issue there is just the block device cache aops and the current size limitaiton on set_blocksize(). > It's > not necessary for filesystems to support bs > ps for LBS to be > implemented, and it's not necessary for LBS to be supported for > filesytsems to implement bs > ps. Conflating them seems a recipe > for confusion.... I see, so we should just limit the scope to enabling LBS devices on the block device cache? > > > iomap supports bufferheads as a transitional thing (e.g. for gfs2). > > > > But not for the block device cache. > > That's why I'm suggesting that you implement support for bufferheads > through the existing iomap infrastructure instead of trying to > dynamically switch aops structures.... > > > > Hence I suspect that a better solution is to always use iomap and > > > the same aops, but just switch from iomap page state to buffer heads > > > in the bdev mapping > > > > Not sure this means, any chance I can trouble you to clarify a bit more? > > bdev_use_buggerheads() > { > /* > * set the bufferhead bit atomically with invalidation emptying the > * page cache to prevent repopulation races. > */ > filemap_invalidate_lock() > invalidate_bdev() > if (turn_on_bufferheads) > set_bit(bdev->state, BDEV_USE_BUFFERHEADS); > else > clear_bit(bdev->state, BDEV_USE_BUFFERHEADS); > filemap_invalidate_unlock() > } > > bdev_iomap_begin() > { > ..... > if (test_bit(bdev->state, BDEV_USE_BUFFERHEADS)) > iomap->flags |= IOMAP_F_BUFFER_HEAD; > } > > /* > * If an indexing switch happened while processing the iomap, make > * sure to get the iomap marked stale to force a new mapping to be > * looked up. > */ > bdev_iomap_valid() > { > bool need_bhs = iomap->flags & IOMAP_F_BUFFER_HEAD; > bool use_bhs = test_bit(bdev->state, BDEV_USE_BUGGERHEADS); > > return need_bhs == use_bhs; > } Will give this a shot, thanks! Luis
On Mon, Sep 18, 2023 at 10:12:04AM -0700, Luis Chamberlain wrote: > On Fri, Sep 15, 2023 at 10:51:12PM +0100, Matthew Wilcox wrote: > > On Fri, Sep 15, 2023 at 02:32:44PM -0700, Luis Chamberlain wrote: > > > However, an issue is that disabling CONFIG_BUFFER_HEAD in practice is not viable > > > for many Linux distributions since it also means disabling support for most > > > filesystems other than btrfs and XFS. So we either support larger order folios > > > on buffer-heads, or we draw up a solution to enable co-existence. Since at LSFMM > > > 2023 it was decided we would not support larger order folios on buffer-heads, > > > > Um, I didn't agree to that. > > Coverage on sunsetting buffer-heads talk by LWN: > > https://lwn.net/Articles/931809/ > > "the apparent conclusion from the session: the buffer-head layer will be > converted to use folios internally while minimizing changes visible to > the filesystems using it. Only single-page folios will be used within > this new buffer-head layer. Any other desires, he said, can be addressed > later after this problem has been solved." Other people said that. Not me. I said it was fine for single buffer_head per folio.
On 9/18/23 20:15, Matthew Wilcox wrote: > On Mon, Sep 18, 2023 at 10:12:04AM -0700, Luis Chamberlain wrote: >> On Fri, Sep 15, 2023 at 10:51:12PM +0100, Matthew Wilcox wrote: >>> On Fri, Sep 15, 2023 at 02:32:44PM -0700, Luis Chamberlain wrote: >>>> However, an issue is that disabling CONFIG_BUFFER_HEAD in practice is not viable >>>> for many Linux distributions since it also means disabling support for most >>>> filesystems other than btrfs and XFS. So we either support larger order folios >>>> on buffer-heads, or we draw up a solution to enable co-existence. Since at LSFMM >>>> 2023 it was decided we would not support larger order folios on buffer-heads, >>> >>> Um, I didn't agree to that. >> >> Coverage on sunsetting buffer-heads talk by LWN: >> >> https://lwn.net/Articles/931809/ >> >> "the apparent conclusion from the session: the buffer-head layer will be >> converted to use folios internally while minimizing changes visible to >> the filesystems using it. Only single-page folios will be used within >> this new buffer-head layer. Any other desires, he said, can be addressed >> later after this problem has been solved." > > Other people said that. Not me. I said it was fine for single > buffer_head per folio. And my patchset proves that to be the case. Cheers, Hannes