mbox series

[RFC,v2,00/10] bdev: LBS devices support to coexist with buffer-heads

Message ID 20230915213254.2724586-1-mcgrof@kernel.org (mailing list archive)
Headers show
Series bdev: LBS devices support to coexist with buffer-heads | expand

Message

Luis Chamberlain Sept. 15, 2023, 9:32 p.m. UTC
Christoph added CONFIG_BUFFER_HEAD on v6.1 enabling a world where we can
live without buffer-heads. When we opt into that world we end up also
using the address space operations of the block device cache using
iomap. Since iomap supports higher order folios it means then that block
devices which do use the iomap aops can end up having a logical block
size or physical block size greater than PAGE_SIZE. We refer to these as
LBS devices. This in turn allows filesystems which support bs > 4k to be
enabled on a 4k PAGE_SIZE world on LBS block devices. This alows LBS
device then to take advantage of the recenlty posted work today to enable
LBS support for filesystems [0].

However, an issue is that disabling CONFIG_BUFFER_HEAD in practice is not viable
for many Linux distributions since it also means disabling support for most
filesystems other than btrfs and XFS. So we either support larger order folios
on buffer-heads, or we draw up a solution to enable co-existence. Since at LSFMM
2023 it was decided we would not support larger order folios on buffer-heads,
this takes the approach to help support co-existence between using buffer-head
based filesystems and using the iomap aops dynamically when needed on block
devices. Folks who want to enhance buffer-heads to support larger order folios
can do so in parallel to this effort. This just allows iomap based filesystems
to support LBS block devices today while also allowing buffer-head based
filesystems to be used on them as well the LBS device also supports a
logical block of PAGE_SIZE.

LBS devices can come in a few flavors. Adopting larger LBA format enables a
logical block sizes to be > 4k. Different block device mechanisms which can also
increase a block device's physical block size to be larger than 4k while still
retaining backward compatibility with a logical block size of 4k. An example is
NVMe drives with an LBA format of 4k and an npwg and awupf | nawupf of 16k. With
this you a logical block size of 4k and a physical block size of 16k. Drives
which do this may be a reality as larger Indirection Units (IUs) on QLC adopt
IUs larger than 4k so to better support larger capacity. At least OCP 2.0 does
require you to expose the IU through npwg. Whether or not a device supports the
same value for awupf | nawupf (large atomic) is device specific. Devices which
*do* sport an LBA format of 4k and npwg & awupf | nawupf >= 16k essentially can
benefit from LBS support such as with XFS [1].

While LBS devices come to market folks can also experiment with NVMe today
what some of this might be like by ignoring power failure and faking the larger
atomic up to the NVMe awun. When the NVMe awun is 0xffff it means awun is
MDTS, so for those drives, in practice you can experiment today with LBS
up to MDTS with real drives. This is documented on the kdevops documentation
for LBS by using something like nvme_core.debug_large_atomics=16384 [2].

To experiment with larger LBA formtas you can also use kdevops and enable
CONFIG_QEMU_ENABLE_EXTRA_DRIVE_LARGEIO. That enables a ton of drives with
logical and physical block sizes >= 4k up to a desriable max target for
experimentation. Since filesystems today only support up to 32k sector sizes,
in practice you may only want to experiment up to 32k physical / logical.

Support for 64k sector sizes requires an XFS format change, which is something
Daniel Gomez has experimental patches for, in case folks are interested in
messing with.

Patch 6 could probably be squashed with patch 5, but I wanted to be
explicit about this, as this should be decided with the community.

There might be a better way to do this than do deal with the switching
of the aops dynamically, ideas welcomed!

The NVMe debug patches are posted for pure experimentation purposes, but
the AWUN patch might be welcomed upsteram. Unless folks really want it,
the goal here is *not* to upstream the support for the debug module
parameter nvme_core.debug_large_atomics=16384. On a v3 RFC I can drop those
patches and just aim to simplify the logic to support LBA formats > ps.

Changes since v1 RFC:

  o Modified the approach to accept we can have different aops per super block
    on the block device cache as suggested by Christoph
  o Try to allow dynamically switching between iomap aops and buffer-head aops
  o Use the iomap aops if the target block size order or min order > ps
  o Loosten restrictions on set_blocksize() so to just check for the iomap
    aops for now, making it easy to allow buffer-heads to later add support
    for bs > ps as well
  o Use buffer-head aops first always unless the min order is > ps so to always
    support buffer-head based filesystems
  o Adopt a flag for buffer-heads so to check for support, buffer-heads based
    filesystems cannot be used while an LBS filesystem is being used, for
    instance
  o Allow iomap aops to be used when the physical block size is > ps as well
  o NVMe changes to experiment with LBS devices without power failure
  o NVMe changes to support LBS devices

[0] https://lkml.kernel.org/r/20230608032404.1887046-1-mcgrof@kernel.org
[1] https://lkml.kernel.org/r/20230915183848.1018717-1-kernel@pankajraghav.com
[2] https://github.com/linux-kdevops/kdevops/blob/master/docs/lbs.md

Luis Chamberlain (10):
  bdev: rename iomap aops
  bdev: dynamically set aops to enable LBS support
  bdev: increase bdev max blocksize depending on the aops used
  filesystems: add filesytem buffer-head flag
  bdev: allow to switch between bdev aops
  bdev: simplify coexistance
  nvme: enhance max supported LBA format check
  nvme: add awun / nawun sanity check
  nvme: add nvme_core.debug_large_atomics to force high awun as phys_bs
  nvme: enable LBS support

 block/bdev.c             | 73 ++++++++++++++++++++++++++++++++++++++--
 block/blk.h              |  1 +
 block/fops.c             | 14 ++++----
 drivers/nvme/host/core.c | 70 +++++++++++++++++++++++++++++++++++---
 drivers/nvme/host/nvme.h |  1 +
 fs/adfs/super.c          |  2 +-
 fs/affs/super.c          |  2 +-
 fs/befs/linuxvfs.c       |  2 +-
 fs/bfs/inode.c           |  2 +-
 fs/efs/super.c           |  2 +-
 fs/exfat/super.c         |  2 +-
 fs/ext2/super.c          |  2 +-
 fs/ext4/super.c          |  7 ++--
 fs/f2fs/super.c          |  2 +-
 fs/fat/namei_msdos.c     |  2 +-
 fs/fat/namei_vfat.c      |  2 +-
 fs/freevxfs/vxfs_super.c |  2 +-
 fs/gfs2/ops_fstype.c     |  4 +--
 fs/hfs/super.c           |  2 +-
 fs/hfsplus/super.c       |  2 +-
 fs/isofs/inode.c         |  2 +-
 fs/jfs/super.c           |  2 +-
 fs/minix/inode.c         |  2 +-
 fs/nilfs2/super.c        |  2 +-
 fs/ntfs/super.c          |  2 +-
 fs/ntfs3/super.c         |  2 +-
 fs/ocfs2/super.c         |  2 +-
 fs/omfs/inode.c          |  2 +-
 fs/qnx4/inode.c          |  2 +-
 fs/qnx6/inode.c          |  2 +-
 fs/reiserfs/super.c      |  2 +-
 fs/super.c               |  3 +-
 fs/sysv/super.c          |  4 +--
 fs/udf/super.c           |  2 +-
 fs/ufs/super.c           |  2 +-
 include/linux/blkdev.h   |  7 ++++
 include/linux/fs.h       |  1 +
 37 files changed, 189 insertions(+), 48 deletions(-)

Comments

Matthew Wilcox (Oracle) Sept. 15, 2023, 9:51 p.m. UTC | #1
On Fri, Sep 15, 2023 at 02:32:44PM -0700, Luis Chamberlain wrote:
> However, an issue is that disabling CONFIG_BUFFER_HEAD in practice is not viable
> for many Linux distributions since it also means disabling support for most
> filesystems other than btrfs and XFS. So we either support larger order folios
> on buffer-heads, or we draw up a solution to enable co-existence. Since at LSFMM
> 2023 it was decided we would not support larger order folios on buffer-heads,

Um, I didn't agree to that.  If block size is equal to folio size, there
are no problems supporting one buffer head per folio.  In fact, we could
probably go up to 8 buffer heads per folio without any trouble (but I'm
not signing up to do that work).
Luis Chamberlain Sept. 15, 2023, 10:26 p.m. UTC | #2
On Fri, Sep 15, 2023 at 10:51:12PM +0100, Matthew Wilcox wrote:
> On Fri, Sep 15, 2023 at 02:32:44PM -0700, Luis Chamberlain wrote:
> > However, an issue is that disabling CONFIG_BUFFER_HEAD in practice is not viable
> > for many Linux distributions since it also means disabling support for most
> > filesystems other than btrfs and XFS. So we either support larger order folios
> > on buffer-heads, or we draw up a solution to enable co-existence. Since at LSFMM
> > 2023 it was decided we would not support larger order folios on buffer-heads,
> 
> Um, I didn't agree to that.  If block size is equal to folio size, there
> are no problems supporting one buffer head per folio.  In fact, we could
> probably go up to 8 buffer heads per folio without any trouble (but I'm
> not signing up to do that work).

Patches welcomed for buffer-head larger order folio support :)

But in the meantime...

  Luis
Hannes Reinecke Sept. 17, 2023, 11:50 a.m. UTC | #3
On 9/15/23 23:51, Matthew Wilcox wrote:
> On Fri, Sep 15, 2023 at 02:32:44PM -0700, Luis Chamberlain wrote:
>> However, an issue is that disabling CONFIG_BUFFER_HEAD in practice is not viable
>> for many Linux distributions since it also means disabling support for most
>> filesystems other than btrfs and XFS. So we either support larger order folios
>> on buffer-heads, or we draw up a solution to enable co-existence. Since at LSFMM
>> 2023 it was decided we would not support larger order folios on buffer-heads,
> 
> Um, I didn't agree to that.  If block size is equal to folio size, there
> are no problems supporting one buffer head per folio.  In fact, we could
> probably go up to 8 buffer heads per folio without any trouble (but I'm
> not signing up to do that work).
> 
Entirely correct.
I have a patchset ready for doing just that (ie having one buffer head 
per folio); hope I'll find some time next week to get it cleaned up and 
posted.

Cheers,

Hannes
Dave Chinner Sept. 17, 2023, 10:38 p.m. UTC | #4
On Fri, Sep 15, 2023 at 02:32:44PM -0700, Luis Chamberlain wrote:
> Christoph added CONFIG_BUFFER_HEAD on v6.1 enabling a world where we can
> live without buffer-heads. When we opt into that world we end up also
> using the address space operations of the block device cache using
> iomap. Since iomap supports higher order folios it means then that block
> devices which do use the iomap aops can end up having a logical block
> size or physical block size greater than PAGE_SIZE. We refer to these as
> LBS devices. This in turn allows filesystems which support bs > 4k to be
> enabled on a 4k PAGE_SIZE world on LBS block devices. This alows LBS
> device then to take advantage of the recenlty posted work today to enable
> LBS support for filesystems [0].

Why do we need LBS devices to support bs > ps in XFS?

As long as the filesystem block size is >= logical sector size of
the device, XFS just doesn't care what the "block size" of the
underlying block device is. i.e. XFS will allow minimum IO sizes of
the logical sector size of the device for select metadata and
sub-block direct IO, but otherwise all other IO will be aligned to
filesystem block size and so the underlying device block sizes are
completely irrelevant...

> To experiment with larger LBA formtas you can also use kdevops and enable
> CONFIG_QEMU_ENABLE_EXTRA_DRIVE_LARGEIO. That enables a ton of drives with
> logical and physical block sizes >= 4k up to a desriable max target for
> experimentation. Since filesystems today only support up to 32k sector sizes,
> in practice you may only want to experiment up to 32k physical / logical.
> 
> Support for 64k sector sizes requires an XFS format change, which is something
> Daniel Gomez has experimental patches for, in case folks are interested in
> messing with.

Please don't do this without first talking to the upstream XFS
developers about the intended use cases and design of the new
format. Especially when the problem involves requiring a whole new
journal header and indexing format to be implemented...

As a general rule, nobody should *ever* be writing patches to change
the on-disk format of a -any- filesystem without first engaging the
people who maintain that filesystem. Architecture and design come
first, not implementation. The last thing we want is to have someone
spend weeks or months on something that takes the experts half a
minute to NACK because it is so obviously flawed....

> Patch 6 could probably be squashed with patch 5, but I wanted to be
> explicit about this, as this should be decided with the community.
> 
> There might be a better way to do this than do deal with the switching
> of the aops dynamically, ideas welcomed!

Is it even safe to switch aops dynamically? We know there are
inherent race conditions in doing this w.r.t. mmap and page faults,
as the write fault part of the processing is directly dependent
on the page being correctly initialised during the initial
population of the page data (the "read fault" side of the write
fault).

Hence it's not generally considered safe to change aops from one
mechanism to another dynamically. Block devices can be mmap()d, but
I don't see anything in this patch set that ensures there are no
other users of the block device when the swaps are done. What am I
missing?

Cheers,

Dave.
Matthew Wilcox (Oracle) Sept. 17, 2023, 11:14 p.m. UTC | #5
On Mon, Sep 18, 2023 at 08:38:37AM +1000, Dave Chinner wrote:
> On Fri, Sep 15, 2023 at 02:32:44PM -0700, Luis Chamberlain wrote:
> > LBS devices. This in turn allows filesystems which support bs > 4k to be
> > enabled on a 4k PAGE_SIZE world on LBS block devices. This alows LBS
> > device then to take advantage of the recenlty posted work today to enable
> > LBS support for filesystems [0].
> 
> Why do we need LBS devices to support bs > ps in XFS?

It's the other way round -- we need the support in the page cache to
reject sub-block-size folios (which is in the other patches) before we
can sensibly talk about enabling any filesystems on top of LBS devices.
Even XFS, or for that matter ext2 which support 16k block sizes on
CONFIG_PAGE_SIZE_16K (or 64K) kernels need that support first.

[snipping the parts I agree with; this should not be the first you're
hearing about a format change to XFS]

> > There might be a better way to do this than do deal with the switching
> > of the aops dynamically, ideas welcomed!
> 
> Is it even safe to switch aops dynamically? We know there are
> inherent race conditions in doing this w.r.t. mmap and page faults,
> as the write fault part of the processing is directly dependent
> on the page being correctly initialised during the initial
> population of the page data (the "read fault" side of the write
> fault).
> 
> Hence it's not generally considered safe to change aops from one
> mechanism to another dynamically. Block devices can be mmap()d, but
> I don't see anything in this patch set that ensures there are no
> other users of the block device when the swaps are done. What am I
> missing?

We need to evict all pages from the page cache before switching aops to
prevent misinterpretation of folio->private.  If switching aops is even
the right thing to do.  I don't see the problem with allowing buffer heads
on block devices, but I haven't been involved with the discussion here.
Dave Chinner Sept. 18, 2023, 12:59 a.m. UTC | #6
On Mon, Sep 18, 2023 at 12:14:48AM +0100, Matthew Wilcox wrote:
> On Mon, Sep 18, 2023 at 08:38:37AM +1000, Dave Chinner wrote:
> > On Fri, Sep 15, 2023 at 02:32:44PM -0700, Luis Chamberlain wrote:
> > > LBS devices. This in turn allows filesystems which support bs > 4k to be
> > > enabled on a 4k PAGE_SIZE world on LBS block devices. This alows LBS
> > > device then to take advantage of the recenlty posted work today to enable
> > > LBS support for filesystems [0].
> > 
> > Why do we need LBS devices to support bs > ps in XFS?
> 
> It's the other way round -- we need the support in the page cache to
> reject sub-block-size folios (which is in the other patches) before we
> can sensibly talk about enabling any filesystems on top of LBS devices.
>
> Even XFS, or for that matter ext2 which support 16k block sizes on
> CONFIG_PAGE_SIZE_16K (or 64K) kernels need that support first.

Well, yes, I know that. But the statement above implies that we
can't use bs > ps filesytems without LBS support on 4kB PAGE_SIZE
systems. If it's meant to mean the exact opposite, then it is
extremely poorly worded....

> > > There might be a better way to do this than do deal with the switching
> > > of the aops dynamically, ideas welcomed!
> > 
> > Is it even safe to switch aops dynamically? We know there are
> > inherent race conditions in doing this w.r.t. mmap and page faults,
> > as the write fault part of the processing is directly dependent
> > on the page being correctly initialised during the initial
> > population of the page data (the "read fault" side of the write
> > fault).
> > 
> > Hence it's not generally considered safe to change aops from one
> > mechanism to another dynamically. Block devices can be mmap()d, but
> > I don't see anything in this patch set that ensures there are no
> > other users of the block device when the swaps are done. What am I
> > missing?
> 
> We need to evict all pages from the page cache before switching aops to
> prevent misinterpretation of folio->private. 

Yes, but if the device is mapped, even after an invalidation, we can
still race with a new fault instantiating a page whilst the aops are
being swapped, right? That was the problem that sunk dynamic
swapping of the aops when turning DAX on and off on an inode, right?

> If switching aops is even
> the right thing to do.  I don't see the problem with allowing buffer heads
> on block devices, but I haven't been involved with the discussion here.

iomap supports bufferheads as a transitional thing (e.g. for gfs2).

Hence I suspect that a better solution is to always use iomap and
the same aops, but just switch from iomap page state to buffer heads
in the bdev mapping interface via a synchronised invalidation +
setting/clearing IOMAP_F_BUFFER_HEAD in all new mapping requests...

Cheers,

Dave.
Luis Chamberlain Sept. 18, 2023, 1:13 a.m. UTC | #7
On Mon, Sep 18, 2023 at 10:59:07AM +1000, Dave Chinner wrote:
> On Mon, Sep 18, 2023 at 12:14:48AM +0100, Matthew Wilcox wrote:
> > On Mon, Sep 18, 2023 at 08:38:37AM +1000, Dave Chinner wrote:
> > > On Fri, Sep 15, 2023 at 02:32:44PM -0700, Luis Chamberlain wrote:
> > > > LBS devices. This in turn allows filesystems which support bs > 4k to be
> > > > enabled on a 4k PAGE_SIZE world on LBS block devices. This alows LBS
> > > > device then to take advantage of the recenlty posted work today to enable
> > > > LBS support for filesystems [0].
> > > 
> > > Why do we need LBS devices to support bs > ps in XFS?
> > 
> > It's the other way round -- we need the support in the page cache to
> > reject sub-block-size folios (which is in the other patches) before we
> > can sensibly talk about enabling any filesystems on top of LBS devices.
> >
> > Even XFS, or for that matter ext2 which support 16k block sizes on
> > CONFIG_PAGE_SIZE_16K (or 64K) kernels need that support first.
> 
> Well, yes, I know that. But the statement above implies that we
> can't use bs > ps filesytems without LBS support on 4kB PAGE_SIZE
> systems. If it's meant to mean the exact opposite, then it is
> extremely poorly worded....

Let's ignore the above statement of a second just to clarify the goal here.
The patches posted by Pankaj enable bs > ps even on 4k sector drives.

This patch series by definition is suggesting that an LBS device is one
where the minimum sector size is > ps, in practice I'm talking about
devices where the logical block size is > 4k, or where the physical
block size can be > 4k. There are two situations in the NVMe world where
this can happen. One is where the LBA format used is > 4k, the other is
where the npwg & awupf | nawupf is > 4k. The first makes the logical
block size > 4k, the later allows the physical block size to be > 4k.
The first *needs* an aops which groks > ps on the block device cache.
The second let's you remain backward compatible with 4k sector size, but
if you want to use a larger sector size you can too, but that also
requires a block device cache which groks > ps. When using > ps for
logical block size of physical block size is what I am suggesting we
should call LBS devices.

I'm happy for us to use other names but it just seemed like a natural
preogression to me.

> > > > There might be a better way to do this than do deal with the switching
> > > > of the aops dynamically, ideas welcomed!
> > > 
> > > Is it even safe to switch aops dynamically? We know there are
> > > inherent race conditions in doing this w.r.t. mmap and page faults,
> > > as the write fault part of the processing is directly dependent
> > > on the page being correctly initialised during the initial
> > > population of the page data (the "read fault" side of the write
> > > fault).
> > > 
> > > Hence it's not generally considered safe to change aops from one
> > > mechanism to another dynamically. Block devices can be mmap()d, but
> > > I don't see anything in this patch set that ensures there are no
> > > other users of the block device when the swaps are done. What am I
> > > missing?
> > 
> > We need to evict all pages from the page cache before switching aops to
> > prevent misinterpretation of folio->private. 
> 
> Yes, but if the device is mapped, even after an invalidation, we can
> still race with a new fault instantiating a page whilst the aops are
> being swapped, right? That was the problem that sunk dynamic
> swapping of the aops when turning DAX on and off on an inode, right?

I was not aware of that, thanks!

> > If switching aops is even
> > the right thing to do.  I don't see the problem with allowing buffer heads
> > on block devices, but I haven't been involved with the discussion here.
> 
> iomap supports bufferheads as a transitional thing (e.g. for gfs2).

But not for the block device cache.

> Hence I suspect that a better solution is to always use iomap and
> the same aops, but just switch from iomap page state to buffer heads
> in the bdev mapping 

Not sure this means, any chance I can trouble you to clarify a bit more?

> interface via a synchronised invalidation +
> setting/clearing IOMAP_F_BUFFER_HEAD in all new mapping requests...

I'm happy we use whatever makes more sense and is safer, just keep in
mind that we can't default to the iomap aops otherwise buffer-head
based filesystems won't work. I tried that. The first aops which the
block device cache needs if we want co-existence is buffer-heads.

If buffer-heads get large folio support then this is a non-issue as
far as I can tell but only because XFS does not use buffer-heads for
meta data. Once and when we do have a filesystem which does use
buffer-heads for metadata to support LBS it could be a problem I think.

  Luis
Dave Chinner Sept. 18, 2023, 2:49 a.m. UTC | #8
On Sun, Sep 17, 2023 at 06:13:40PM -0700, Luis Chamberlain wrote:
> On Mon, Sep 18, 2023 at 10:59:07AM +1000, Dave Chinner wrote:
> > On Mon, Sep 18, 2023 at 12:14:48AM +0100, Matthew Wilcox wrote:
> > > On Mon, Sep 18, 2023 at 08:38:37AM +1000, Dave Chinner wrote:
> > > > On Fri, Sep 15, 2023 at 02:32:44PM -0700, Luis Chamberlain wrote:
> > > > > LBS devices. This in turn allows filesystems which support bs > 4k to be
> > > > > enabled on a 4k PAGE_SIZE world on LBS block devices. This alows LBS
> > > > > device then to take advantage of the recenlty posted work today to enable
> > > > > LBS support for filesystems [0].
> > > > 
> > > > Why do we need LBS devices to support bs > ps in XFS?
> > > 
> > > It's the other way round -- we need the support in the page cache to
> > > reject sub-block-size folios (which is in the other patches) before we
> > > can sensibly talk about enabling any filesystems on top of LBS devices.
> > >
> > > Even XFS, or for that matter ext2 which support 16k block sizes on
> > > CONFIG_PAGE_SIZE_16K (or 64K) kernels need that support first.
> > 
> > Well, yes, I know that. But the statement above implies that we
> > can't use bs > ps filesytems without LBS support on 4kB PAGE_SIZE
> > systems. If it's meant to mean the exact opposite, then it is
> > extremely poorly worded....
> 
> Let's ignore the above statement of a second just to clarify the goal here.
> The patches posted by Pankaj enable bs > ps even on 4k sector drives.

Yes. They also enable XFS to support bs > ps on devices up to 32kB
sector sizes, too. All the sector size does is define the minimum
filesystem block size that can be supported by the filesystem on
that device and apart from that we just don't care what the sector
size on the underlying device is.

> This patch series by definition is suggesting that an LBS device is one
> where the minimum sector size is > ps, in practice I'm talking about
> devices where the logical block size is > 4k, or where the physical
> block size can be > 4k.

Which XFS with bs > ps just doesn't care about. As long as the
logical sector size is a power of 2 between 512 bytes and 32kB, it
will just work.

> There are two situations in the NVMe world where
> this can happen. One is where the LBA format used is > 4k, the other is
> where the npwg & awupf | nawupf is > 4k. The first makes the logical

FWIW, I have no idea what these acronyms actually mean....

> block size > 4k, the later allows the physical block size to be > 4k.
> The first *needs* an aops which groks > ps on the block device cache.
> The second let's you remain backward compatible with 4k sector size, but
> if you want to use a larger sector size you can too, but that also
> requires a block device cache which groks > ps. When using > ps for
> logical block size of physical block size is what I am suggesting we
> should call LBS devices.

Sure. LBS means sector size > page size for the block device. That
much has always been clear.

But telling me that again doesn't explain what LBS support has to do
with the filesystem implementation. mkfs.xfs doesn't require LBS
support to make a new XFS filesystem with a 32kB sector size and
64kB filessytem block size.  For the mounted filesystem that
supports bs > ps, it also doesn't care about the device sector size
is smaller than what mkfs told it to us. e.g. this is how run 4kB
sector size filesystem testing on 512 byte sector devices today....

What I'm asking is why LBS support even mentions filesystems? It's
not necessary for filesystems to support bs > ps for LBS to be
implemented, and it's not necessary for LBS to be supported for
filesytsems to implement bs > ps. Conflating them seems a recipe
for confusion....

> > > > > There might be a better way to do this than do deal with the switching
> > > > > of the aops dynamically, ideas welcomed!
> > > > 
> > > > Is it even safe to switch aops dynamically? We know there are
> > > > inherent race conditions in doing this w.r.t. mmap and page faults,
> > > > as the write fault part of the processing is directly dependent
> > > > on the page being correctly initialised during the initial
> > > > population of the page data (the "read fault" side of the write
> > > > fault).
> > > > 
> > > > Hence it's not generally considered safe to change aops from one
> > > > mechanism to another dynamically. Block devices can be mmap()d, but
> > > > I don't see anything in this patch set that ensures there are no
> > > > other users of the block device when the swaps are done. What am I
> > > > missing?
> > > 
> > > We need to evict all pages from the page cache before switching aops to
> > > prevent misinterpretation of folio->private. 
> > 
> > Yes, but if the device is mapped, even after an invalidation, we can
> > still race with a new fault instantiating a page whilst the aops are
> > being swapped, right? That was the problem that sunk dynamic
> > swapping of the aops when turning DAX on and off on an inode, right?
> 
> I was not aware of that, thanks!
> 
> > > If switching aops is even
> > > the right thing to do.  I don't see the problem with allowing buffer heads
> > > on block devices, but I haven't been involved with the discussion here.
> > 
> > iomap supports bufferheads as a transitional thing (e.g. for gfs2).
> 
> But not for the block device cache.

That's why I'm suggesting that you implement support for bufferheads
through the existing iomap infrastructure instead of trying to
dynamically switch aops structures....

> > Hence I suspect that a better solution is to always use iomap and
> > the same aops, but just switch from iomap page state to buffer heads
> > in the bdev mapping 
> 
> Not sure this means, any chance I can trouble you to clarify a bit more?

bdev_use_buggerheads()
{
	/*
	 * set the bufferhead bit atomically with invalidation emptying the
	 * page cache to prevent repopulation races. 
	 */
	filemap_invalidate_lock()
	invalidate_bdev()
	if (turn_on_bufferheads)
		set_bit(bdev->state, BDEV_USE_BUFFERHEADS);
	else
		clear_bit(bdev->state, BDEV_USE_BUFFERHEADS);
	filemap_invalidate_unlock()
}

bdev_iomap_begin()
{
	.....
	if (test_bit(bdev->state, BDEV_USE_BUFFERHEADS))
		iomap->flags |= IOMAP_F_BUFFER_HEAD;
}

/*
 * If an indexing switch happened while processing the iomap, make
 * sure to get the iomap marked stale to force a new mapping to be
 * looked up.
 */
bdev_iomap_valid()
{
	bool need_bhs = iomap->flags & IOMAP_F_BUFFER_HEAD;
	bool use_bhs = test_bit(bdev->state, BDEV_USE_BUGGERHEADS);

	return need_bhs == use_bhs;
}

-Dave.
Hannes Reinecke Sept. 18, 2023, 11:34 a.m. UTC | #9
On 9/18/23 01:14, Matthew Wilcox wrote:
> On Mon, Sep 18, 2023 at 08:38:37AM +1000, Dave Chinner wrote:
>> On Fri, Sep 15, 2023 at 02:32:44PM -0700, Luis Chamberlain wrote:
>>> LBS devices. This in turn allows filesystems which support bs > 4k to be
>>> enabled on a 4k PAGE_SIZE world on LBS block devices. This alows LBS
>>> device then to take advantage of the recenlty posted work today to enable
>>> LBS support for filesystems [0].
>>
>> Why do we need LBS devices to support bs > ps in XFS?
> 
> It's the other way round -- we need the support in the page cache to
> reject sub-block-size folios (which is in the other patches) before we
> can sensibly talk about enabling any filesystems on top of LBS devices.
> Even XFS, or for that matter ext2 which support 16k block sizes on
> CONFIG_PAGE_SIZE_16K (or 64K) kernels need that support first.
> 
> [snipping the parts I agree with; this should not be the first you're
> hearing about a format change to XFS]
> 
>>> There might be a better way to do this than do deal with the switching
>>> of the aops dynamically, ideas welcomed!
>>
>> Is it even safe to switch aops dynamically? We know there are
>> inherent race conditions in doing this w.r.t. mmap and page faults,
>> as the write fault part of the processing is directly dependent
>> on the page being correctly initialised during the initial
>> population of the page data (the "read fault" side of the write
>> fault).
>>
>> Hence it's not generally considered safe to change aops from one
>> mechanism to another dynamically. Block devices can be mmap()d, but
>> I don't see anything in this patch set that ensures there are no
>> other users of the block device when the swaps are done. What am I
>> missing?
> 
> We need to evict all pages from the page cache before switching aops to
> prevent misinterpretation of folio->private.  If switching aops is even
> the right thing to do.  I don't see the problem with allowing buffer heads
> on block devices, but I haven't been involved with the discussion here.

Did we even have that conversation?
That's one of the first things I've stumbled across when doing my 
patchset, and found the implications too horrible to consider.

Not a big fan, plus I don't think we need that.
Cf my patchset :-)

Cheers,

Hannes
Luis Chamberlain Sept. 18, 2023, 5:12 p.m. UTC | #10
On Fri, Sep 15, 2023 at 10:51:12PM +0100, Matthew Wilcox wrote:
> On Fri, Sep 15, 2023 at 02:32:44PM -0700, Luis Chamberlain wrote:
> > However, an issue is that disabling CONFIG_BUFFER_HEAD in practice is not viable
> > for many Linux distributions since it also means disabling support for most
> > filesystems other than btrfs and XFS. So we either support larger order folios
> > on buffer-heads, or we draw up a solution to enable co-existence. Since at LSFMM
> > 2023 it was decided we would not support larger order folios on buffer-heads,
> 
> Um, I didn't agree to that.

Coverage on sunsetting buffer-heads talk by LWN:

https://lwn.net/Articles/931809/

"the apparent conclusion from the session: the buffer-head layer will be
converted to use folios internally while minimizing changes visible to
the filesystems using it. Only single-page folios will be used within
this new buffer-head layer. Any other desires, he said, can be addressed
later after this problem has been solved."

And so I think we're at the later part of this. I'm happy to see efforts
for buffer-heads support with large order folio support, and so at this
point I think it would be wise to look for commonalities and colalborate
on what things could / should be shared.

  Luis
Luis Chamberlain Sept. 18, 2023, 5:51 p.m. UTC | #11
On Mon, Sep 18, 2023 at 12:49:36PM +1000, Dave Chinner wrote:
> On Sun, Sep 17, 2023 at 06:13:40PM -0700, Luis Chamberlain wrote:
> > On Mon, Sep 18, 2023 at 10:59:07AM +1000, Dave Chinner wrote:
> > > On Mon, Sep 18, 2023 at 12:14:48AM +0100, Matthew Wilcox wrote:
> > > > On Mon, Sep 18, 2023 at 08:38:37AM +1000, Dave Chinner wrote:
> > > > > On Fri, Sep 15, 2023 at 02:32:44PM -0700, Luis Chamberlain wrote:
> > > > > > LBS devices. This in turn allows filesystems which support bs > 4k to be
> > > > > > enabled on a 4k PAGE_SIZE world on LBS block devices. This alows LBS
> > > > > > device then to take advantage of the recenlty posted work today to enable
> > > > > > LBS support for filesystems [0].
> > > > > 
> > > > > Why do we need LBS devices to support bs > ps in XFS?
> > > > 
> > > > It's the other way round -- we need the support in the page cache to
> > > > reject sub-block-size folios (which is in the other patches) before we
> > > > can sensibly talk about enabling any filesystems on top of LBS devices.
> > > >
> > > > Even XFS, or for that matter ext2 which support 16k block sizes on
> > > > CONFIG_PAGE_SIZE_16K (or 64K) kernels need that support first.
> > > 
> > > Well, yes, I know that. But the statement above implies that we
> > > can't use bs > ps filesytems without LBS support on 4kB PAGE_SIZE
> > > systems. If it's meant to mean the exact opposite, then it is
> > > extremely poorly worded....
> > 
> > Let's ignore the above statement of a second just to clarify the goal here.
> > The patches posted by Pankaj enable bs > ps even on 4k sector drives.
> 
> Yes. They also enable XFS to support bs > ps on devices up to 32kB
> sector sizes, too. All the sector size does is define the minimum
> filesystem block size that can be supported by the filesystem on
> that device and apart from that we just don't care what the sector
> size on the underlying device is.

Ah yes, but on a system with 4k page size, my point is that you still
can't use a 32k sector size drive.

> > This patch series by definition is suggesting that an LBS device is one
> > where the minimum sector size is > ps, in practice I'm talking about
> > devices where the logical block size is > 4k, or where the physical
> > block size can be > 4k.
> 
> Which XFS with bs > ps just doesn't care about. As long as the
> logical sector size is a power of 2 between 512 bytes and 32kB, it
> will just work.
> 
> > There are two situations in the NVMe world where
> > this can happen. One is where the LBA format used is > 4k, the other is
> > where the npwg & awupf | nawupf is > 4k. The first makes the logical
> 
> FWIW, I have no idea what these acronyms actually mean....

Sorry, I've described them them, here now:

https://kernelnewbies.org/KernelProjects/large-block-size#LBS_NVMe_support

What matters though is in practice it is a combination of two values
which today also allows the nvme driver to have a higher than 4k
physical block size.

> > block size > 4k, the later allows the physical block size to be > 4k.
> > The first *needs* an aops which groks > ps on the block device cache.
> > The second let's you remain backward compatible with 4k sector size, but
> > if you want to use a larger sector size you can too, but that also
> > requires a block device cache which groks > ps. When using > ps for
> > logical block size of physical block size is what I am suggesting we
> > should call LBS devices.
> 
> Sure. LBS means sector size > page size for the block device. That
> much has always been clear.
> 
> But telling me that again doesn't explain what LBS support has to do
> with the filesystem implementation. mkfs.xfs doesn't require LBS
> support to make a new XFS filesystem with a 32kB sector size and
> 64kB filessytem block size.  For the mounted filesystem that
> supports bs > ps, it also doesn't care about the device sector size
> is smaller than what mkfs told it to us. e.g. this is how run 4kB
> sector size filesystem testing on 512 byte sector devices today....
> 
> What I'm asking is why LBS support even mentions filesystems?

It's just that without this patchset you can mkfs an filesystem with
larger bs > ps, but will only be able to mount them if the sector size
matches the page size. This patchset allows them to be mounted and used
where sector size > ps. The main issue there is just the block device
cache aops and the current size limitaiton on set_blocksize().

> It's
> not necessary for filesystems to support bs > ps for LBS to be
> implemented, and it's not necessary for LBS to be supported for
> filesytsems to implement bs > ps. Conflating them seems a recipe
> for confusion....

I see, so we should just limit the scope to enabling LBS devices on
the block device cache?

> > > iomap supports bufferheads as a transitional thing (e.g. for gfs2).
> > 
> > But not for the block device cache.
> 
> That's why I'm suggesting that you implement support for bufferheads
> through the existing iomap infrastructure instead of trying to
> dynamically switch aops structures....
> 
> > > Hence I suspect that a better solution is to always use iomap and
> > > the same aops, but just switch from iomap page state to buffer heads
> > > in the bdev mapping 
> > 
> > Not sure this means, any chance I can trouble you to clarify a bit more?
> 
> bdev_use_buggerheads()
> {
> 	/*
> 	 * set the bufferhead bit atomically with invalidation emptying the
> 	 * page cache to prevent repopulation races. 
> 	 */
> 	filemap_invalidate_lock()
> 	invalidate_bdev()
> 	if (turn_on_bufferheads)
> 		set_bit(bdev->state, BDEV_USE_BUFFERHEADS);
> 	else
> 		clear_bit(bdev->state, BDEV_USE_BUFFERHEADS);
> 	filemap_invalidate_unlock()
> }
> 
> bdev_iomap_begin()
> {
> 	.....
> 	if (test_bit(bdev->state, BDEV_USE_BUFFERHEADS))
> 		iomap->flags |= IOMAP_F_BUFFER_HEAD;
> }
> 
> /*
>  * If an indexing switch happened while processing the iomap, make
>  * sure to get the iomap marked stale to force a new mapping to be
>  * looked up.
>  */
> bdev_iomap_valid()
> {
> 	bool need_bhs = iomap->flags & IOMAP_F_BUFFER_HEAD;
> 	bool use_bhs = test_bit(bdev->state, BDEV_USE_BUGGERHEADS);
> 
> 	return need_bhs == use_bhs;
> }

Will give this a shot, thanks!

  Luis
Matthew Wilcox (Oracle) Sept. 18, 2023, 6:15 p.m. UTC | #12
On Mon, Sep 18, 2023 at 10:12:04AM -0700, Luis Chamberlain wrote:
> On Fri, Sep 15, 2023 at 10:51:12PM +0100, Matthew Wilcox wrote:
> > On Fri, Sep 15, 2023 at 02:32:44PM -0700, Luis Chamberlain wrote:
> > > However, an issue is that disabling CONFIG_BUFFER_HEAD in practice is not viable
> > > for many Linux distributions since it also means disabling support for most
> > > filesystems other than btrfs and XFS. So we either support larger order folios
> > > on buffer-heads, or we draw up a solution to enable co-existence. Since at LSFMM
> > > 2023 it was decided we would not support larger order folios on buffer-heads,
> > 
> > Um, I didn't agree to that.
> 
> Coverage on sunsetting buffer-heads talk by LWN:
> 
> https://lwn.net/Articles/931809/
> 
> "the apparent conclusion from the session: the buffer-head layer will be
> converted to use folios internally while minimizing changes visible to
> the filesystems using it. Only single-page folios will be used within
> this new buffer-head layer. Any other desires, he said, can be addressed
> later after this problem has been solved."

Other people said that.  Not me.  I said it was fine for single
buffer_head per folio.
Hannes Reinecke Sept. 18, 2023, 6:42 p.m. UTC | #13
On 9/18/23 20:15, Matthew Wilcox wrote:
> On Mon, Sep 18, 2023 at 10:12:04AM -0700, Luis Chamberlain wrote:
>> On Fri, Sep 15, 2023 at 10:51:12PM +0100, Matthew Wilcox wrote:
>>> On Fri, Sep 15, 2023 at 02:32:44PM -0700, Luis Chamberlain wrote:
>>>> However, an issue is that disabling CONFIG_BUFFER_HEAD in practice is not viable
>>>> for many Linux distributions since it also means disabling support for most
>>>> filesystems other than btrfs and XFS. So we either support larger order folios
>>>> on buffer-heads, or we draw up a solution to enable co-existence. Since at LSFMM
>>>> 2023 it was decided we would not support larger order folios on buffer-heads,
>>>
>>> Um, I didn't agree to that.
>>
>> Coverage on sunsetting buffer-heads talk by LWN:
>>
>> https://lwn.net/Articles/931809/
>>
>> "the apparent conclusion from the session: the buffer-head layer will be
>> converted to use folios internally while minimizing changes visible to
>> the filesystems using it. Only single-page folios will be used within
>> this new buffer-head layer. Any other desires, he said, can be addressed
>> later after this problem has been solved."
> 
> Other people said that.  Not me.  I said it was fine for single
> buffer_head per folio.

And my patchset proves that to be the case.

Cheers,

Hannes