mbox series

[v6,00/10] block atomic writes

Message ID 20240326133813.3224593-1-john.g.garry@oracle.com (mailing list archive)
Headers show
Series block atomic writes | expand

Message

John Garry March 26, 2024, 1:38 p.m. UTC
This series introduces a proposal to implementing atomic writes in the
kernel for torn-write protection.

This series takes the approach of adding a new "atomic" flag to each of
pwritev2() and iocb->ki_flags - RWF_ATOMIC and IOCB_ATOMIC, respectively.
When set, these indicate that we want the write issued "atomically".

Only direct IO is supported and for block devices here. For this, atomic
write HW is required, like SCSI ATOMIC WRITE (16).

XFS FS support has previously been posted at:
https://lore.kernel.org/linux-xfs/20240304130428.13026-1-john.g.garry@oracle.com/

I am working on a new version of that series, which I hope to post soon.

Updated man pages have been posted at:
https://lore.kernel.org/lkml/20240124112731.28579-1-john.g.garry@oracle.com/T/#m520dca97a9748de352b5a723d3155a4bb1e46456

The goal here is to provide an interface that allows applications use
application-specific block sizes larger than logical block size
reported by the storage device or larger than filesystem block size as
reported by stat().

With this new interface, application blocks will never be torn or
fractured when written. For a power fail, for each individual application
block, all or none of the data to be written. A racing atomic write and
read will mean that the read sees all the old data or all the new data,
but never a mix of old and new.

Three new fields are added to struct statx - atomic_write_unit_min,
atomic_write_unit_max, and atomic_write_segments_max. For each atomic
individual write, the total length of a write must be a between
atomic_write_unit_min and atomic_write_unit_max, inclusive, and a
power-of-2. The write must also be at a natural offset in the file
wrt the write length. For pwritev2, iovcnt is limited by
atomic_write_segments_max.

There has been some discussion on supporting buffered IO and whether the
API is suitable, like:
https://lore.kernel.org/linux-nvme/ZeembVG-ygFal6Eb@casper.infradead.org/

Specifically the concern is that supporting a range of sizes of atomic IO
in the pagecache is complex to support. For this, my idea is that FSes can
fix atomic_write_unit_min and atomic_write_unit_max at the same size, the
extent alignment size, which should be easier to support. We may need to
implement O_ATOMIC to avoid mixing atomic and non-atomic IOs for this. I
have no proposed solution for atomic write buffered IO for bdev file
operations, but I know of no requirement for this.

SCSI sd.c and scsi_debug and NVMe kernel support is added.

This series is based on v6.9-rc1

Patches can be found at:
https://github.com/johnpgarry/linux/commits/atomic-writes-v6.9-v6

Changes since v5:
- Rebase and update NVMe support for new request_queue limits API
  - Keith, please check since I still have your RB tag
- Change request_queue limits to byte-based sizes to suit new queue limits
  API
- Pass rw_type to io_uring io_rw_init_file() (Jens)
- Add BLK_STS_INVAL
- Don't check size in generic_atomic_write_valid()

Alan Adamson (1):
  nvme: Atomic write support

John Garry (6):
  block: Pass blk_queue_get_max_sectors() a request pointer
  block: Call blkdev_dio_unaligned() from blkdev_direct_IO()
  block: Add core atomic write support
  block: Add fops atomic write support
  scsi: sd: Atomic write support
  scsi: scsi_debug: Atomic write support

Prasad Singamsetty (3):
  fs: Initial atomic write support
  fs: Add initial atomic write support info to statx
  block: Add atomic write support for statx

 Documentation/ABI/stable/sysfs-block |  52 +++
 block/bdev.c                         |  36 +-
 block/blk-core.c                     |  19 +
 block/blk-merge.c                    |  98 ++++-
 block/blk-mq.c                       |   2 +-
 block/blk-settings.c                 | 109 +++++
 block/blk-sysfs.c                    |  33 ++
 block/blk.h                          |   9 +-
 block/fops.c                         |  47 ++-
 drivers/nvme/host/core.c             |  49 +++
 drivers/scsi/scsi_debug.c            | 588 +++++++++++++++++++++------
 drivers/scsi/scsi_trace.c            |  22 +
 drivers/scsi/sd.c                    |  93 ++++-
 drivers/scsi/sd.h                    |   8 +
 fs/aio.c                             |   8 +-
 fs/btrfs/ioctl.c                     |   2 +-
 fs/read_write.c                      |   2 +-
 fs/stat.c                            |  50 ++-
 include/linux/blk_types.h            |   8 +-
 include/linux/blkdev.h               |  67 ++-
 include/linux/fs.h                   |  36 +-
 include/linux/stat.h                 |   3 +
 include/scsi/scsi_proto.h            |   1 +
 include/trace/events/scsi.h          |   1 +
 include/uapi/linux/fs.h              |   5 +-
 include/uapi/linux/stat.h            |   9 +-
 io_uring/rw.c                        |   8 +-
 27 files changed, 1173 insertions(+), 192 deletions(-)

Comments

Matthew Wilcox (Oracle) March 27, 2024, 3:50 a.m. UTC | #1
On Tue, Mar 26, 2024 at 01:38:03PM +0000, John Garry wrote:
> The goal here is to provide an interface that allows applications use
> application-specific block sizes larger than logical block size
> reported by the storage device or larger than filesystem block size as
> reported by stat().
> 
> With this new interface, application blocks will never be torn or
> fractured when written. For a power fail, for each individual application
> block, all or none of the data to be written. A racing atomic write and
> read will mean that the read sees all the old data or all the new data,
> but never a mix of old and new.
> 
> Three new fields are added to struct statx - atomic_write_unit_min,
> atomic_write_unit_max, and atomic_write_segments_max. For each atomic
> individual write, the total length of a write must be a between
> atomic_write_unit_min and atomic_write_unit_max, inclusive, and a
> power-of-2. The write must also be at a natural offset in the file
> wrt the write length. For pwritev2, iovcnt is limited by
> atomic_write_segments_max.
> 
> There has been some discussion on supporting buffered IO and whether the
> API is suitable, like:
> https://lore.kernel.org/linux-nvme/ZeembVG-ygFal6Eb@casper.infradead.org/
> 
> Specifically the concern is that supporting a range of sizes of atomic IO
> in the pagecache is complex to support. For this, my idea is that FSes can
> fix atomic_write_unit_min and atomic_write_unit_max at the same size, the
> extent alignment size, which should be easier to support. We may need to
> implement O_ATOMIC to avoid mixing atomic and non-atomic IOs for this. I
> have no proposed solution for atomic write buffered IO for bdev file
> operations, but I know of no requirement for this.

The thing is that there's no requirement for an interface as complex as
the one you're proposing here.  I've talked to a few database people
and all they want is to increase the untorn write boundary from "one
disc block" to one database block, typically 8kB or 16kB.

So they would be quite happy with a much simpler interface where they
set the inode block size at inode creation time, and then all writes to
that inode were guaranteed to be untorn.  This would also be simpler to
implement for buffered writes.

Who's asking for this more complex interface?
John Garry March 27, 2024, 1:37 p.m. UTC | #2
On 27/03/2024 03:50, Matthew Wilcox wrote:
> On Tue, Mar 26, 2024 at 01:38:03PM +0000, John Garry wrote:
>> The goal here is to provide an interface that allows applications use
>> application-specific block sizes larger than logical block size
>> reported by the storage device or larger than filesystem block size as
>> reported by stat().
>>
>> With this new interface, application blocks will never be torn or
>> fractured when written. For a power fail, for each individual application
>> block, all or none of the data to be written. A racing atomic write and
>> read will mean that the read sees all the old data or all the new data,
>> but never a mix of old and new.
>>
>> Three new fields are added to struct statx - atomic_write_unit_min,
>> atomic_write_unit_max, and atomic_write_segments_max. For each atomic
>> individual write, the total length of a write must be a between
>> atomic_write_unit_min and atomic_write_unit_max, inclusive, and a
>> power-of-2. The write must also be at a natural offset in the file
>> wrt the write length. For pwritev2, iovcnt is limited by
>> atomic_write_segments_max.
>>
>> There has been some discussion on supporting buffered IO and whether the
>> API is suitable, like:
>> https://lore.kernel.org/linux-nvme/ZeembVG-ygFal6Eb@casper.infradead.org/
>>
>> Specifically the concern is that supporting a range of sizes of atomic IO
>> in the pagecache is complex to support. For this, my idea is that FSes can
>> fix atomic_write_unit_min and atomic_write_unit_max at the same size, the
>> extent alignment size, which should be easier to support. We may need to
>> implement O_ATOMIC to avoid mixing atomic and non-atomic IOs for this. I
>> have no proposed solution for atomic write buffered IO for bdev file
>> operations, but I know of no requirement for this.
> 
> The thing is that there's no requirement for an interface as complex as
> the one you're proposing here.  I've talked to a few database people
> and all they want is to increase the untorn write boundary from "one
> disc block" to one database block, typically 8kB or 16kB.
> 
> So they would be quite happy with a much simpler interface where they
> set the inode block size at inode creation time,

We want to support untorn writes for bdev file operations - how can we 
set the inode block size there? Currently it is based on logical block size.

> and then all writes to
> that inode were guaranteed to be untorn.  This would also be simpler to
> implement for buffered writes.

We did consider that. Won't that lead to the possibility of breaking 
existing applications which want to do regular unaligned writes to these 
files? We do know that mysql/innodb does have some "compressed" mode of 
operation, which involves regular writes to the same file which wants 
untorn writes.

Furthermore, untorn writes in HW are expensive - for SCSI anyway. Do we 
always want these for such a file?

We saw untorn writes as not being a property of the file or even the 
inode itself, but rather an attribute of the specific IO being issued 
from the userspace application.

> 
> Who's asking for this more complex interface?

It's not a case of someone specifically asking for this interface. This 
is just a proposal to satisfy userspace requirement to do untorn writes 
in a generic way.

 From a user point-of-view, untorn writes for a regular file can be 
enabled for up to a specific size* with FS_IOC_SETFLAGS API. Then they 
need to follow alignment and size rules for issuing untorn writes, but 
they would always need to do this. In addition, the user may still issue 
regular (tearable) writes to the file.

* I think that we could change this to only allow writes for that 
specific size, which was my proposal for buffered IO.

Thanks,
John
Dave Chinner March 27, 2024, 8:31 p.m. UTC | #3
On Wed, Mar 27, 2024 at 03:50:07AM +0000, Matthew Wilcox wrote:
> On Tue, Mar 26, 2024 at 01:38:03PM +0000, John Garry wrote:
> > The goal here is to provide an interface that allows applications use
> > application-specific block sizes larger than logical block size
> > reported by the storage device or larger than filesystem block size as
> > reported by stat().
> > 
> > With this new interface, application blocks will never be torn or
> > fractured when written. For a power fail, for each individual application
> > block, all or none of the data to be written. A racing atomic write and
> > read will mean that the read sees all the old data or all the new data,
> > but never a mix of old and new.
> > 
> > Three new fields are added to struct statx - atomic_write_unit_min,
> > atomic_write_unit_max, and atomic_write_segments_max. For each atomic
> > individual write, the total length of a write must be a between
> > atomic_write_unit_min and atomic_write_unit_max, inclusive, and a
> > power-of-2. The write must also be at a natural offset in the file
> > wrt the write length. For pwritev2, iovcnt is limited by
> > atomic_write_segments_max.
> > 
> > There has been some discussion on supporting buffered IO and whether the
> > API is suitable, like:
> > https://lore.kernel.org/linux-nvme/ZeembVG-ygFal6Eb@casper.infradead.org/
> > 
> > Specifically the concern is that supporting a range of sizes of atomic IO
> > in the pagecache is complex to support. For this, my idea is that FSes can
> > fix atomic_write_unit_min and atomic_write_unit_max at the same size, the
> > extent alignment size, which should be easier to support. We may need to
> > implement O_ATOMIC to avoid mixing atomic and non-atomic IOs for this. I
> > have no proposed solution for atomic write buffered IO for bdev file
> > operations, but I know of no requirement for this.
> 
> The thing is that there's no requirement for an interface as complex as
> the one you're proposing here.  I've talked to a few database people
> and all they want is to increase the untorn write boundary from "one
> disc block" to one database block, typically 8kB or 16kB.
> 
> So they would be quite happy with a much simpler interface where they
> set the inode block size at inode creation time, and then all writes to
> that inode were guaranteed to be untorn.  This would also be simpler to
> implement for buffered writes.

You're conflating filesystem functionality that applications will use
with hardware and block-layer enablement that filesystems and
filesystem utilities need to configure the filesystem in ways that
allow users to make use of atomic write capability of the hardware.

The block layer functionality needs to export everything that the
hardware can do and filesystems will make use of. The actual
application usage and setup of atomic writes at the filesystem/page
cache layer is a separate problem.  i.e. The block layer interfaces
need only support direct IO and expose limits for issuing atomic
direct IO, and nothing more. All the more complex stuff to make it
"easy to use" is filesystem level functionality and completely
outside the scope of this patchset....

-Dave.
Matthew Wilcox (Oracle) April 4, 2024, 4:48 p.m. UTC | #4
On Wed, Mar 27, 2024 at 01:37:41PM +0000, John Garry wrote:
> On 27/03/2024 03:50, Matthew Wilcox wrote:
> > On Tue, Mar 26, 2024 at 01:38:03PM +0000, John Garry wrote:
> > > The goal here is to provide an interface that allows applications use
> > > application-specific block sizes larger than logical block size
> > > reported by the storage device or larger than filesystem block size as
> > > reported by stat().
> > > 
> > > With this new interface, application blocks will never be torn or
> > > fractured when written. For a power fail, for each individual application
> > > block, all or none of the data to be written. A racing atomic write and
> > > read will mean that the read sees all the old data or all the new data,
> > > but never a mix of old and new.
> > > 
> > > Three new fields are added to struct statx - atomic_write_unit_min,
> > > atomic_write_unit_max, and atomic_write_segments_max. For each atomic
> > > individual write, the total length of a write must be a between
> > > atomic_write_unit_min and atomic_write_unit_max, inclusive, and a
> > > power-of-2. The write must also be at a natural offset in the file
> > > wrt the write length. For pwritev2, iovcnt is limited by
> > > atomic_write_segments_max.
> > > 
> > > There has been some discussion on supporting buffered IO and whether the
> > > API is suitable, like:
> > > https://lore.kernel.org/linux-nvme/ZeembVG-ygFal6Eb@casper.infradead.org/
> > > 
> > > Specifically the concern is that supporting a range of sizes of atomic IO
> > > in the pagecache is complex to support. For this, my idea is that FSes can
> > > fix atomic_write_unit_min and atomic_write_unit_max at the same size, the
> > > extent alignment size, which should be easier to support. We may need to
> > > implement O_ATOMIC to avoid mixing atomic and non-atomic IOs for this. I
> > > have no proposed solution for atomic write buffered IO for bdev file
> > > operations, but I know of no requirement for this.
> > 
> > The thing is that there's no requirement for an interface as complex as
> > the one you're proposing here.  I've talked to a few database people
> > and all they want is to increase the untorn write boundary from "one
> > disc block" to one database block, typically 8kB or 16kB.
> > 
> > So they would be quite happy with a much simpler interface where they
> > set the inode block size at inode creation time,
> 
> We want to support untorn writes for bdev file operations - how can we set
> the inode block size there? Currently it is based on logical block size.

ioctl(BLKBSZSET), I guess?  That currently limits to PAGE_SIZE, but I
think we can remove that limitation with the bs>PS patches.

> > and then all writes to
> > that inode were guaranteed to be untorn.  This would also be simpler to
> > implement for buffered writes.
> 
> We did consider that. Won't that lead to the possibility of breaking
> existing applications which want to do regular unaligned writes to these
> files? We do know that mysql/innodb does have some "compressed" mode of
> operation, which involves regular writes to the same file which wants untorn
> writes.

If you're talking about "regular unaligned buffered writes", then that
won't break.  If you cross a folio boundary, the result may be torn,
but if you're crossing a block boundary you expect that.

> Furthermore, untorn writes in HW are expensive - for SCSI anyway. Do we
> always want these for such a file?

Do untorn writes actually exist in SCSI?  I was under the impression
nobody had actually implemented them in SCSI hardware.

> We saw untorn writes as not being a property of the file or even the inode
> itself, but rather an attribute of the specific IO being issued from the
> userspace application.

The problem is that keeping track of that is expensive for buffered
writes.  It's a model that only works for direct IO.  Arguably we
could make it work for O_SYNC buffered IO, but that'll require some
surgery.

> > Who's asking for this more complex interface?
> 
> It's not a case of someone specifically asking for this interface. This is
> just a proposal to satisfy userspace requirement to do untorn writes in a
> generic way.
> 
> From a user point-of-view, untorn writes for a regular file can be enabled
> for up to a specific size* with FS_IOC_SETFLAGS API. Then they need to
> follow alignment and size rules for issuing untorn writes, but they would
> always need to do this. In addition, the user may still issue regular
> (tearable) writes to the file.
> 
> * I think that we could change this to only allow writes for that specific
> size, which was my proposal for buffered IO.
> 
> Thanks,
> John
>
Kent Overstreet April 5, 2024, 6:14 a.m. UTC | #5
On Wed, Mar 27, 2024 at 03:50:07AM +0000, Matthew Wilcox wrote:
> On Tue, Mar 26, 2024 at 01:38:03PM +0000, John Garry wrote:
> > The goal here is to provide an interface that allows applications use
> > application-specific block sizes larger than logical block size
> > reported by the storage device or larger than filesystem block size as
> > reported by stat().
> > 
> > With this new interface, application blocks will never be torn or
> > fractured when written. For a power fail, for each individual application
> > block, all or none of the data to be written. A racing atomic write and
> > read will mean that the read sees all the old data or all the new data,
> > but never a mix of old and new.
> > 
> > Three new fields are added to struct statx - atomic_write_unit_min,
> > atomic_write_unit_max, and atomic_write_segments_max. For each atomic
> > individual write, the total length of a write must be a between
> > atomic_write_unit_min and atomic_write_unit_max, inclusive, and a
> > power-of-2. The write must also be at a natural offset in the file
> > wrt the write length. For pwritev2, iovcnt is limited by
> > atomic_write_segments_max.
> > 
> > There has been some discussion on supporting buffered IO and whether the
> > API is suitable, like:
> > https://lore.kernel.org/linux-nvme/ZeembVG-ygFal6Eb@casper.infradead.org/
> > 
> > Specifically the concern is that supporting a range of sizes of atomic IO
> > in the pagecache is complex to support. For this, my idea is that FSes can
> > fix atomic_write_unit_min and atomic_write_unit_max at the same size, the
> > extent alignment size, which should be easier to support. We may need to
> > implement O_ATOMIC to avoid mixing atomic and non-atomic IOs for this. I
> > have no proposed solution for atomic write buffered IO for bdev file
> > operations, but I know of no requirement for this.
> 
> The thing is that there's no requirement for an interface as complex as
> the one you're proposing here.  I've talked to a few database people
> and all they want is to increase the untorn write boundary from "one
> disc block" to one database block, typically 8kB or 16kB.
> 
> So they would be quite happy with a much simpler interface where they
> set the inode block size at inode creation time, and then all writes to
> that inode were guaranteed to be untorn.  This would also be simpler to
> implement for buffered writes.
> 
> Who's asking for this more complex interface?

I get the impression the atomic writes stuff has suffered from too
/much/ review and too many maintainers asking for and demanding all
their different must-haves.
John Garry April 5, 2024, 10:06 a.m. UTC | #6
On 04/04/2024 17:48, Matthew Wilcox wrote:
>>> The thing is that there's no requirement for an interface as complex as
>>> the one you're proposing here.  I've talked to a few database people
>>> and all they want is to increase the untorn write boundary from "one
>>> disc block" to one database block, typically 8kB or 16kB.
>>>
>>> So they would be quite happy with a much simpler interface where they
>>> set the inode block size at inode creation time,
>> We want to support untorn writes for bdev file operations - how can we set
>> the inode block size there? Currently it is based on logical block size.
> ioctl(BLKBSZSET), I guess?  That currently limits to PAGE_SIZE, but I
> think we can remove that limitation with the bs>PS patches.

We want a consistent interface for bdev and regular files, so that would 
need to work for FSes also. FSes(XFS) work based on a homogeneous inode 
blocksize, which is the SB blocksize.

Furthermore, we would seem to be mixing different concepts here. 
Currently in Linux we say that a logical block size write is atomic. In 
the block layer, we split BIOs on LBS boundaries. iomap creates BIOs 
based on LBS boundaries. But writing a FS block is not always guaranteed 
to be atomic, as far as I'm concerned. So just increasing the inode 
block size / FS block size does not really change anything, in itself.

> 
>>> and then all writes to
>>> that inode were guaranteed to be untorn.  This would also be simpler to
>>> implement for buffered writes.
>> We did consider that. Won't that lead to the possibility of breaking
>> existing applications which want to do regular unaligned writes to these
>> files? We do know that mysql/innodb does have some "compressed" mode of
>> operation, which involves regular writes to the same file which wants untorn
>> writes.
> If you're talking about "regular unaligned buffered writes", then that
> won't break.  If you cross a folio boundary, the result may be torn,
> but if you're crossing a block boundary you expect that.
> 
>> Furthermore, untorn writes in HW are expensive - for SCSI anyway. Do we
>> always want these for such a file?
> Do untorn writes actually exist in SCSI?  I was under the impression
> nobody had actually implemented them in SCSI hardware.

I know that some SCSI targets actually atomically write data in chunks > 
LBS. Obviously atomic vs non-atomic performance is a moot point there, 
as data is implicitly always atomically written.

We actually have an mysql/innodb port of this API working on such a SCSI 
target.

However I am not sure about atomic write support for other SCSI targets.

> 
>> We saw untorn writes as not being a property of the file or even the inode
>> itself, but rather an attribute of the specific IO being issued from the
>> userspace application.
> The problem is that keeping track of that is expensive for buffered
> writes.  It's a model that only works for direct IO.  Arguably we
> could make it work for O_SYNC buffered IO, but that'll require some
> surgery.

To me, O_ATOMIC would be required for buffered atomic writes IO, as we 
want a fixed-sized IO, so that would mean no mixing of atomic and 
non-atomic IO.

Thanks,
John
Kent Overstreet April 5, 2024, 10:20 a.m. UTC | #7
On Thu, Mar 28, 2024 at 07:31:45AM +1100, Dave Chinner wrote:
> On Wed, Mar 27, 2024 at 03:50:07AM +0000, Matthew Wilcox wrote:
> > On Tue, Mar 26, 2024 at 01:38:03PM +0000, John Garry wrote:
> > > The goal here is to provide an interface that allows applications use
> > > application-specific block sizes larger than logical block size
> > > reported by the storage device or larger than filesystem block size as
> > > reported by stat().
> > > 
> > > With this new interface, application blocks will never be torn or
> > > fractured when written. For a power fail, for each individual application
> > > block, all or none of the data to be written. A racing atomic write and
> > > read will mean that the read sees all the old data or all the new data,
> > > but never a mix of old and new.
> > > 
> > > Three new fields are added to struct statx - atomic_write_unit_min,
> > > atomic_write_unit_max, and atomic_write_segments_max. For each atomic
> > > individual write, the total length of a write must be a between
> > > atomic_write_unit_min and atomic_write_unit_max, inclusive, and a
> > > power-of-2. The write must also be at a natural offset in the file
> > > wrt the write length. For pwritev2, iovcnt is limited by
> > > atomic_write_segments_max.
> > > 
> > > There has been some discussion on supporting buffered IO and whether the
> > > API is suitable, like:
> > > https://lore.kernel.org/linux-nvme/ZeembVG-ygFal6Eb@casper.infradead.org/
> > > 
> > > Specifically the concern is that supporting a range of sizes of atomic IO
> > > in the pagecache is complex to support. For this, my idea is that FSes can
> > > fix atomic_write_unit_min and atomic_write_unit_max at the same size, the
> > > extent alignment size, which should be easier to support. We may need to
> > > implement O_ATOMIC to avoid mixing atomic and non-atomic IOs for this. I
> > > have no proposed solution for atomic write buffered IO for bdev file
> > > operations, but I know of no requirement for this.
> > 
> > The thing is that there's no requirement for an interface as complex as
> > the one you're proposing here.  I've talked to a few database people
> > and all they want is to increase the untorn write boundary from "one
> > disc block" to one database block, typically 8kB or 16kB.
> > 
> > So they would be quite happy with a much simpler interface where they
> > set the inode block size at inode creation time, and then all writes to
> > that inode were guaranteed to be untorn.  This would also be simpler to
> > implement for buffered writes.
> 
> You're conflating filesystem functionality that applications will use
> with hardware and block-layer enablement that filesystems and
> filesystem utilities need to configure the filesystem in ways that
> allow users to make use of atomic write capability of the hardware.
> 
> The block layer functionality needs to export everything that the
> hardware can do and filesystems will make use of. The actual
> application usage and setup of atomic writes at the filesystem/page
> cache layer is a separate problem.  i.e. The block layer interfaces
> need only support direct IO and expose limits for issuing atomic
> direct IO, and nothing more. All the more complex stuff to make it
> "easy to use" is filesystem level functionality and completely
> outside the scope of this patchset....

A CoW filesystem can implement atomic writes without any block device
support. It seems to me that might have been the easier place to start -
start by getting the APIs right, then do all the plumbing for efficient
untorn writes on non CoW filesystems...
John Garry April 5, 2024, 10:55 a.m. UTC | #8
On 05/04/2024 11:20, Kent Overstreet wrote:
>>> The thing is that there's no requirement for an interface as complex as
>>> the one you're proposing here.  I've talked to a few database people
>>> and all they want is to increase the untorn write boundary from "one
>>> disc block" to one database block, typically 8kB or 16kB.
>>>
>>> So they would be quite happy with a much simpler interface where they
>>> set the inode block size at inode creation time, and then all writes to
>>> that inode were guaranteed to be untorn.  This would also be simpler to
>>> implement for buffered writes.
>> You're conflating filesystem functionality that applications will use
>> with hardware and block-layer enablement that filesystems and
>> filesystem utilities need to configure the filesystem in ways that
>> allow users to make use of atomic write capability of the hardware.
>>
>> The block layer functionality needs to export everything that the
>> hardware can do and filesystems will make use of. The actual
>> application usage and setup of atomic writes at the filesystem/page
>> cache layer is a separate problem.  i.e. The block layer interfaces
>> need only support direct IO and expose limits for issuing atomic
>> direct IO, and nothing more. All the more complex stuff to make it
>> "easy to use" is filesystem level functionality and completely
>> outside the scope of this patchset....
> A CoW filesystem can implement atomic writes without any block device
> support. It seems to me that might have been the easier place to start -
> start by getting the APIs right, then do all the plumbing for efficient
> untorn writes on non CoW filesystems...

03/10 and 04/10 in this series define the user API, i.e. RWF_ATOMIC and 
statx updates.

Any filesystem-specific changes - like in 
https://lore.kernel.org/linux-xfs/20240304130428.13026-1-john.g.garry@oracle.com/ 
- are just for enabling this API for that filesystem.
Luis Chamberlain April 8, 2024, 5:50 p.m. UTC | #9
On Fri, Apr 05, 2024 at 11:06:00AM +0100, John Garry wrote:
> On 04/04/2024 17:48, Matthew Wilcox wrote:
> > > > The thing is that there's no requirement for an interface as complex as
> > > > the one you're proposing here.  I've talked to a few database people
> > > > and all they want is to increase the untorn write boundary from "one
> > > > disc block" to one database block, typically 8kB or 16kB.
> > > > 
> > > > So they would be quite happy with a much simpler interface where they
> > > > set the inode block size at inode creation time,
> > > We want to support untorn writes for bdev file operations - how can we set
> > > the inode block size there? Currently it is based on logical block size.
> > ioctl(BLKBSZSET), I guess?  That currently limits to PAGE_SIZE, but I
> > think we can remove that limitation with the bs>PS patches.

I can say a bit more on this, as I explored that. Essentially Matthew,
yes, I got that to work but it requires a set of different patches. We have
what we tried and then based on feedback from Chinner we have a
direction on what to try next. The last effort on that front was having the
iomap aops for bdev be used and lifting the PAGE_SIZE limit up to the
page cache limits. The crux on that front was that we end requiring
disabling BUFFER_HEAD and that is pretty limitting, so my old
implementation had dynamic aops so to let us use the buffer-head aops
only when using filesystems which require it and use iomap aops
otherwise. But as Chinner noted we learned through the DAX experience
that's not a route we want to again try, so the real solution is to
extend iomap bdev aops code with buffer-head compatibility.

> We want a consistent interface for bdev and regular files, so that would
> need to work for FSes also. FSes(XFS) work based on a homogeneous inode
> blocksize, which is the SB blocksize.

There are two aspects to this and it is important to differentiate them.

1) LBA formats used
2) When a large atomic is supported and you want to use smaller LBA formats

When the LBA format, and so logical block size is say 16k, the LBS
patches with the above mentioned patches enable IOs to the block device
to be atomic to say 16k.

But to remain flexible we want to support a world where 512 byte and 4k
LBA formats are still used, and you *optionally* want to leverage say
16k atomics. Today's block device topology enables this only with a knob
to userspace to allow userspace to override the sector size for the
filesystem. In practice today if you want to use 4k IOs you just format
the drive to use 4k LBA format. However, an alternative at laest for
NVMe today is to support say 16k atomic with an 4k or 512 LBA format.
This essentially *lifts* the physical block size to 16k while keeping
the logical block size at the LBA format, so 4k or 512 bytes. What you
*could* do with this, from the userspace side of things is at mkfs you
can *opt* in to use a larger sector size up to the physical block size.
When you do this the block device still has a logical block size of the
LBA format, but all IOs the filesystem would use use the larger sector
size you opted in for.

I suspect this is a use case where perhaps the max folio order could be
set for the bdev in the future, the logical block size the min order,
and max order the large atomic.

> Furthermore, we would seem to be mixing different concepts here. Currently
> in Linux we say that a logical block size write is atomic. In the block
> layer, we split BIOs on LBS boundaries. iomap creates BIOs based on LBS
> boundaries. But writing a FS block is not always guaranteed to be atomic, as
> far as I'm concerned.

True. To be clear above paragraph refers to LBS as logical block size.

However when a filesystem sets the min order, and it should be respected.
I agree that when you don't set the sector size to 16k you are not forcing the
filesystem to use 16k IOs, the metadata can still be 4k. But when you
use a 16k sector size, the 16k IOs should be respected by the
filesystem.

Do we break BIOs to below a min order if the sector size is also set to
16k?  I haven't seen that and its unclear when or how that could happen.

At least for NVMe we don't need to yell to a device to inform it we want
a 16k IO issued to it to be atomic, if we read that it has the
capability for it, it just does it. The IO verificaiton can be done with
blkalgn [0].

Does SCSI *require* an 16k atomic prep work, or can it be done implicitly?
Does it need WRITE_ATOMIC_16?

[0] https://github.com/dagmcr/bcc/tree/blkalgn

> So just increasing the inode block size / FS block size does not
> really change anything, in itself.

If we're breaking up IOs when a min order is set for an inode, that
would need to be looked into, but we're not seeing that.

> > Do untorn writes actually exist in SCSI?  I was under the impression
> > nobody had actually implemented them in SCSI hardware.
> 
> I know that some SCSI targets actually atomically write data in chunks >
> LBS. Obviously atomic vs non-atomic performance is a moot point there, as
> data is implicitly always atomically written.
> 
> We actually have an mysql/innodb port of this API working on such a SCSI
> target.

I suspect IO verification with the above tool should prove to show the
same if you use a filesystem with a larger sector size set too, and you
just would not have to do any changes to userspace other than the
filesystem creation with say mkfs.xfs params of -b size=16k -s size=16k

> However I am not sure about atomic write support for other SCSI targets.

Good to know!

> > > We saw untorn writes as not being a property of the file or even the inode
> > > itself, but rather an attribute of the specific IO being issued from the
> > > userspace application.
> > The problem is that keeping track of that is expensive for buffered
> > writes.  It's a model that only works for direct IO.  Arguably we
> > could make it work for O_SYNC buffered IO, but that'll require some
> > surgery.
> 
> To me, O_ATOMIC would be required for buffered atomic writes IO, as we want
> a fixed-sized IO, so that would mean no mixing of atomic and non-atomic IO.

Would using the same min and max order for the inode work instead?

  Luis
Matthew Wilcox (Oracle) April 10, 2024, 4:05 a.m. UTC | #10
On Mon, Apr 08, 2024 at 10:50:47AM -0700, Luis Chamberlain wrote:
> On Fri, Apr 05, 2024 at 11:06:00AM +0100, John Garry wrote:
> > On 04/04/2024 17:48, Matthew Wilcox wrote:
> > > > > The thing is that there's no requirement for an interface as complex as
> > > > > the one you're proposing here.  I've talked to a few database people
> > > > > and all they want is to increase the untorn write boundary from "one
> > > > > disc block" to one database block, typically 8kB or 16kB.
> > > > > 
> > > > > So they would be quite happy with a much simpler interface where they
> > > > > set the inode block size at inode creation time,
> > > > We want to support untorn writes for bdev file operations - how can we set
> > > > the inode block size there? Currently it is based on logical block size.
> > > ioctl(BLKBSZSET), I guess?  That currently limits to PAGE_SIZE, but I
> > > think we can remove that limitation with the bs>PS patches.
> 
> I can say a bit more on this, as I explored that. Essentially Matthew,
> yes, I got that to work but it requires a set of different patches. We have
> what we tried and then based on feedback from Chinner we have a
> direction on what to try next. The last effort on that front was having the
> iomap aops for bdev be used and lifting the PAGE_SIZE limit up to the
> page cache limits. The crux on that front was that we end requiring
> disabling BUFFER_HEAD and that is pretty limitting, so my old
> implementation had dynamic aops so to let us use the buffer-head aops
> only when using filesystems which require it and use iomap aops
> otherwise. But as Chinner noted we learned through the DAX experience
> that's not a route we want to again try, so the real solution is to
> extend iomap bdev aops code with buffer-head compatibility.

Have you tried just using the buffer_head code?  I think you heard bad
advice at last LSFMM.  Since then I've landed a bunch of patches which
remove PAGE_SIZE assumptions throughout the buffer_head code, and while
I haven't tried it, it might work.  And it might be easier to make work
than adding more BH hacks to the iomap code.

A quick audit for problems ...

__getblk_slow:
       if (unlikely(size & (bdev_logical_block_size(bdev)-1) ||
                        (size < 512 || size > PAGE_SIZE))) {

cont_expand_zero (not used by bdev code)
cont_write_begin (ditto)

That's all I spot from a quick grep for PAGE, offset_in_page() and kmap.

You can't do a lot of buffer_heads per folio, because you'll overrun
        struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
in block_read_full_folio(), but you can certainly do _one_ buffer_head
per folio, and that's all you need for bs>PS.

> I suspect this is a use case where perhaps the max folio order could be
> set for the bdev in the future, the logical block size the min order,
> and max order the large atomic.

No, that's not what we want to do at all!  Minimum writeback size needs
to be the atomic size, otherwise we have to keep track of which writes
are atomic and which ones aren't.  So, just set the logical block size
to the atomic size, and we're done.
Hannes Reinecke April 10, 2024, 6:20 a.m. UTC | #11
On 4/10/24 06:05, Matthew Wilcox wrote:
> On Mon, Apr 08, 2024 at 10:50:47AM -0700, Luis Chamberlain wrote:
>> On Fri, Apr 05, 2024 at 11:06:00AM +0100, John Garry wrote:
>>> On 04/04/2024 17:48, Matthew Wilcox wrote:
>>>>>> The thing is that there's no requirement for an interface as complex as
>>>>>> the one you're proposing here.  I've talked to a few database people
>>>>>> and all they want is to increase the untorn write boundary from "one
>>>>>> disc block" to one database block, typically 8kB or 16kB.
>>>>>>
>>>>>> So they would be quite happy with a much simpler interface where they
>>>>>> set the inode block size at inode creation time,
>>>>> We want to support untorn writes for bdev file operations - how can we set
>>>>> the inode block size there? Currently it is based on logical block size.
>>>> ioctl(BLKBSZSET), I guess?  That currently limits to PAGE_SIZE, but I
>>>> think we can remove that limitation with the bs>PS patches.
>>
>> I can say a bit more on this, as I explored that. Essentially Matthew,
>> yes, I got that to work but it requires a set of different patches. We have
>> what we tried and then based on feedback from Chinner we have a
>> direction on what to try next. The last effort on that front was having the
>> iomap aops for bdev be used and lifting the PAGE_SIZE limit up to the
>> page cache limits. The crux on that front was that we end requiring
>> disabling BUFFER_HEAD and that is pretty limitting, so my old
>> implementation had dynamic aops so to let us use the buffer-head aops
>> only when using filesystems which require it and use iomap aops
>> otherwise. But as Chinner noted we learned through the DAX experience
>> that's not a route we want to again try, so the real solution is to
>> extend iomap bdev aops code with buffer-head compatibility.
> 
> Have you tried just using the buffer_head code?  I think you heard bad
> advice at last LSFMM.  Since then I've landed a bunch of patches which
> remove PAGE_SIZE assumptions throughout the buffer_head code, and while
> I haven't tried it, it might work.  And it might be easier to make work
> than adding more BH hacks to the iomap code.
> 
> A quick audit for problems ...
> 
> __getblk_slow:
>         if (unlikely(size & (bdev_logical_block_size(bdev)-1) ||
>                          (size < 512 || size > PAGE_SIZE))) {
> 
> cont_expand_zero (not used by bdev code)
> cont_write_begin (ditto)
> 
> That's all I spot from a quick grep for PAGE, offset_in_page() and kmap.
> 
> You can't do a lot of buffer_heads per folio, because you'll overrun
>          struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
> in block_read_full_folio(), but you can certainly do _one_ buffer_head
> per folio, and that's all you need for bs>PS.
> 
Indeed; I got a patch here to just restart the submission loop if one
reaches the end of the array. But maybe submitting one bh at a time and
using plugging should achieve that same thing. Let's see.

>> I suspect this is a use case where perhaps the max folio order could be
>> set for the bdev in the future, the logical block size the min order,
>> and max order the large atomic.
> 
> No, that's not what we want to do at all!  Minimum writeback size needs
> to be the atomic size, otherwise we have to keep track of which writes
> are atomic and which ones aren't.  So, just set the logical block size
> to the atomic size, and we're done.
> 
+1. My thoughts all along.

Cheers,

Hannes
John Garry April 10, 2024, 8:34 a.m. UTC | #12
On 08/04/2024 18:50, Luis Chamberlain wrote:
> I agree that when you don't set the sector size to 16k you are not forcing the
> filesystem to use 16k IOs, the metadata can still be 4k. But when you
> use a 16k sector size, the 16k IOs should be respected by the
> filesystem.
> 
> Do we break BIOs to below a min order if the sector size is also set to
> 16k?  I haven't seen that and its unclear when or how that could happen.

AFAICS, the only guarantee is to not split below LBS.

> 
> At least for NVMe we don't need to yell to a device to inform it we want
> a 16k IO issued to it to be atomic, if we read that it has the
> capability for it, it just does it. The IO verificaiton can be done with
> blkalgn [0].
> 
> Does SCSI*require*  an 16k atomic prep work, or can it be done implicitly?
> Does it need WRITE_ATOMIC_16?

physical block size is what we can implicitly write atomically. So if 
you have a 4K PBS and 512B LBS, then WRITE_ATOMIC_16 would be required 
to write 16KB atomically.

> 
> [0]https://urldefense.com/v3/__https://github.com/dagmcr/bcc/tree/blkalgn__;!!ACWV5N9M2RV99hQ!I0tfdPsEq9vdHMSC7JVmVDHCb5w6invjudW7pZW50v3mZ7dWMMf0cBtY_BQlZZmYSjLzPQDZoLO7-K6MQQ$  
> 
>> So just increasing the inode block size / FS block size does not
>> really change anything, in itself.
> If we're breaking up IOs when a min order is set for an inode, that
> would need to be looked into, but we're not seeing that.

In practice you won't see it, but I am talking about guarantees not to 
see it.

> 
>>> Do untorn writes actually exist in SCSI?  I was under the impression
>>> nobody had actually implemented them in SCSI hardware.
>> I know that some SCSI targets actually atomically write data in chunks >
>> LBS. Obviously atomic vs non-atomic performance is a moot point there, as
>> data is implicitly always atomically written.
>>
>> We actually have an mysql/innodb port of this API working on such a SCSI
>> target.
> I suspect IO verification with the above tool should prove to show the
> same if you use a filesystem with a larger sector size set too, and you
> just would not have to do any changes to userspace other than the
> filesystem creation with say mkfs.xfs params of -b size=16k -s size=16k

Ok, I see

> 
>> However I am not sure about atomic write support for other SCSI targets.
> Good to know!
> 
>>>> We saw untorn writes as not being a property of the file or even the inode
>>>> itself, but rather an attribute of the specific IO being issued from the
>>>> userspace application.
>>> The problem is that keeping track of that is expensive for buffered
>>> writes.  It's a model that only works for direct IO.  Arguably we
>>> could make it work for O_SYNC buffered IO, but that'll require some
>>> surgery.
>> To me, O_ATOMIC would be required for buffered atomic writes IO, as we want
>> a fixed-sized IO, so that would mean no mixing of atomic and non-atomic IO.
> Would using the same min and max order for the inode work instead?

Maybe, I would need to check further.

Thanks,
John
Luis Chamberlain April 11, 2024, 12:38 a.m. UTC | #13
On Wed, Apr 10, 2024 at 08:20:37AM +0200, Hannes Reinecke wrote:
> On 4/10/24 06:05, Matthew Wilcox wrote:
> > On Mon, Apr 08, 2024 at 10:50:47AM -0700, Luis Chamberlain wrote:
> > > On Fri, Apr 05, 2024 at 11:06:00AM +0100, John Garry wrote:
> > > > On 04/04/2024 17:48, Matthew Wilcox wrote:
> > > > > > > The thing is that there's no requirement for an interface as complex as
> > > > > > > the one you're proposing here.  I've talked to a few database people
> > > > > > > and all they want is to increase the untorn write boundary from "one
> > > > > > > disc block" to one database block, typically 8kB or 16kB.
> > > > > > > 
> > > > > > > So they would be quite happy with a much simpler interface where they
> > > > > > > set the inode block size at inode creation time,
> > > > > > We want to support untorn writes for bdev file operations - how can we set
> > > > > > the inode block size there? Currently it is based on logical block size.
> > > > > ioctl(BLKBSZSET), I guess?  That currently limits to PAGE_SIZE, but I
> > > > > think we can remove that limitation with the bs>PS patches.
> > > 
> > > I can say a bit more on this, as I explored that. Essentially Matthew,
> > > yes, I got that to work but it requires a set of different patches. We have
> > > what we tried and then based on feedback from Chinner we have a
> > > direction on what to try next. The last effort on that front was having the
> > > iomap aops for bdev be used and lifting the PAGE_SIZE limit up to the
> > > page cache limits. The crux on that front was that we end requiring
> > > disabling BUFFER_HEAD and that is pretty limitting, so my old
> > > implementation had dynamic aops so to let us use the buffer-head aops
> > > only when using filesystems which require it and use iomap aops
> > > otherwise. But as Chinner noted we learned through the DAX experience
> > > that's not a route we want to again try, so the real solution is to
> > > extend iomap bdev aops code with buffer-head compatibility.
> > 
> > Have you tried just using the buffer_head code?  I think you heard bad
> > advice at last LSFMM.  Since then I've landed a bunch of patches which
> > remove PAGE_SIZE assumptions throughout the buffer_head code, and while
> > I haven't tried it, it might work.  And it might be easier to make work
> > than adding more BH hacks to the iomap code.
> > 
> > A quick audit for problems ...
> > 
> > __getblk_slow:
> >         if (unlikely(size & (bdev_logical_block_size(bdev)-1) ||
> >                          (size < 512 || size > PAGE_SIZE))) {
> > 
> > cont_expand_zero (not used by bdev code)
> > cont_write_begin (ditto)
> > 
> > That's all I spot from a quick grep for PAGE, offset_in_page() and kmap.
> > 
> > You can't do a lot of buffer_heads per folio, because you'll overrun
> >          struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
> > in block_read_full_folio(), but you can certainly do _one_ buffer_head
> > per folio, and that's all you need for bs>PS.
> > 
> Indeed; I got a patch here to just restart the submission loop if one
> reaches the end of the array. But maybe submitting one bh at a time and
> using plugging should achieve that same thing. Let's see.

That's great to hear, what about a target filesystem? Without a
buffer-head filesystem to test I'm not sure we'd get enough test
coverage.

The block device cache isn't exaclty a great filesystem target to test
correctness.

> > > I suspect this is a use case where perhaps the max folio order could be
> > > set for the bdev in the future, the logical block size the min order,
> > > and max order the large atomic.
> > 
> > No, that's not what we want to do at all!  Minimum writeback size needs
> > to be the atomic size, otherwise we have to keep track of which writes
> > are atomic and which ones aren't.  So, just set the logical block size
> > to the atomic size, and we're done.
> > 
> +1. My thoughts all along.

Oh, hrm yes, but let's test it out then...

  Luis
Luis Chamberlain April 11, 2024, 7:07 p.m. UTC | #14
On Wed, Apr 10, 2024 at 09:34:36AM +0100, John Garry wrote:
> On 08/04/2024 18:50, Luis Chamberlain wrote:
> > I agree that when you don't set the sector size to 16k you are not forcing the
> > filesystem to use 16k IOs, the metadata can still be 4k. But when you
> > use a 16k sector size, the 16k IOs should be respected by the
> > filesystem.
> > 
> > Do we break BIOs to below a min order if the sector size is also set to
> > 16k?  I haven't seen that and its unclear when or how that could happen.
> 
> AFAICS, the only guarantee is to not split below LBS.

It would be odd to split a BIO given a inode requirement size spelled
out, but indeed I don't recall verifying this gaurantee.

> > At least for NVMe we don't need to yell to a device to inform it we want
> > a 16k IO issued to it to be atomic, if we read that it has the
> > capability for it, it just does it. The IO verificaiton can be done with
> > blkalgn [0].
> > 
> > Does SCSI*require*  an 16k atomic prep work, or can it be done implicitly?
> > Does it need WRITE_ATOMIC_16?
> 
> physical block size is what we can implicitly write atomically.

Yes, and also on flash to avoid read modify writes.

> So if you
> have a 4K PBS and 512B LBS, then WRITE_ATOMIC_16 would be required to write
> 16KB atomically.

Ugh. Why does SCSI requires a special command for this?

Now we know what would be needed to bump the physical block size, it is
certainly a different feature, however I think it would be good to
evaluate that world too. For NVMe we don't have such special write
requirements.

I put together this kludge with the last patches series of LBS + the
bdev cache aops stuff (which as I said before needs an alternative
solution) and just the scsi atomics topology + physical block size
change to easily experiment to see what would break:

https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20240408-lbs-scsi-kludge

Using a larger sector size works but it does not use the special scsi
atomic write.

> > > To me, O_ATOMIC would be required for buffered atomic writes IO, as we want
> > > a fixed-sized IO, so that would mean no mixing of atomic and non-atomic IO.
> > Would using the same min and max order for the inode work instead?
> 
> Maybe, I would need to check further.

I'd be happy to help review too.

  Luis
John Garry April 12, 2024, 8:15 a.m. UTC | #15
On 11/04/2024 20:07, Luis Chamberlain wrote:
>> So if you
>> have a 4K PBS and 512B LBS, then WRITE_ATOMIC_16 would be required to write
>> 16KB atomically.
> Ugh. Why does SCSI requires a special command for this?

The actual question from others is why does NVMe not have a dedicated 
command for this, like:
https://lore.kernel.org/linux-nvme/20240129062035.GB19796@lst.de/

It's a data integrity feature, and we want to know if it works properly.

> 
> Now we know what would be needed to bump the physical block size, it is
> certainly a different feature, however I think it would be good to
> evaluate that world too. For NVMe we don't have such special write
> requirements.
> 
> I put together this kludge with the last patches series of LBS + the
> bdev cache aops stuff (which as I said before needs an alternative
> solution) and just the scsi atomics topology + physical block size
> change to easily experiment to see what would break:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20240408-lbs-scsi-kludge
> 
> Using a larger sector size works but it does not use the special scsi
> atomic write.

If you are using scsi_debug driver, then you can just pass the desired 
physblk_exp and sector_size args - they both default to 512B. Then you 
don't need bother with sd.c atomic stuff, which I think is what you want.

> 
>>>> To me, O_ATOMIC would be required for buffered atomic writes IO, as we want
>>>> a fixed-sized IO, so that would mean no mixing of atomic and non-atomic IO.
>>> Would using the same min and max order for the inode work instead?
>> Maybe, I would need to check further.
> I'd be happy to help review too.

Yeah, I'm starting to think that min and max inode would make life 
easier, as we don't need to deal with the scenario of an atomic write to 
a folio > atomic write size.

Thanks,
John
Luis Chamberlain April 12, 2024, 6:28 p.m. UTC | #16
+ Dan,

On Fri, Apr 12, 2024 at 09:15:57AM +0100, John Garry wrote:
> On 11/04/2024 20:07, Luis Chamberlain wrote:
> > > So if you
> > > have a 4K PBS and 512B LBS, then WRITE_ATOMIC_16 would be required to write
> > > 16KB atomically.
> > Ugh. Why does SCSI requires a special command for this?
> 
> The actual question from others is why does NVMe not have a dedicated
> command for this, like:
> https://lore.kernel.org/linux-nvme/20240129062035.GB19796@lst.de/

Because we don't really need it for the hardware that supports it if the
host does the respective topology checks. For instance the respective
checks for NVMe are that atomics respect AWUN as the cap as the drive
already can go up to AWUN, and the limit for power-fail is implicit by
checking for AWUPF / NAWUPF. The alignment constraints can be dealt with
by the host software.

> It's a data integrity feature, and we want to know if it works properly.

For drives which already support this integrity is ensured already for
you. An NVMe specific atomic write command could be useful for for
existing drives for other reasons or future uses but its not a requirement
with the existing use cases if the NVMe alignment / atomic are respected by
the host.

> > Now we know what would be needed to bump the physical block size, it is
> > certainly a different feature, however I think it would be good to
> > evaluate that world too. For NVMe we don't have such special write
> > requirements.
> > 
> > I put together this kludge with the last patches series of LBS + the
> > bdev cache aops stuff (which as I said before needs an alternative
> > solution) and just the scsi atomics topology + physical block size
> > change to easily experiment to see what would break:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20240408-lbs-scsi-kludge
> > 
> > Using a larger sector size works but it does not use the special scsi
> > atomic write.
> 
> If you are using scsi_debug driver, then you can just pass the desired
> physblk_exp and sector_size args - they both default to 512B. Then you don't
> need bother with sd.c atomic stuff, which I think is what you want.
> 
> > 
> > > > > To me, O_ATOMIC would be required for buffered atomic writes IO, as we want
> > > > > a fixed-sized IO, so that would mean no mixing of atomic and non-atomic IO.
> > > > Would using the same min and max order for the inode work instead?
> > > Maybe, I would need to check further.
> > I'd be happy to help review too.
> 
> Yeah, I'm starting to think that min and max inode would make life easier,
> as we don't need to deal with the scenario of an atomic write to a folio >
> atomic write size.

And aligments constraints could be dealt with as well.

  Luis
Luis Chamberlain April 14, 2024, 8:50 p.m. UTC | #17
On Wed, Apr 10, 2024 at 05:05:20AM +0100, Matthew Wilcox wrote:
> On Mon, Apr 08, 2024 at 10:50:47AM -0700, Luis Chamberlain wrote:
> > On Fri, Apr 05, 2024 at 11:06:00AM +0100, John Garry wrote:
> > > On 04/04/2024 17:48, Matthew Wilcox wrote:
> > > > > > The thing is that there's no requirement for an interface as complex as
> > > > > > the one you're proposing here.  I've talked to a few database people
> > > > > > and all they want is to increase the untorn write boundary from "one
> > > > > > disc block" to one database block, typically 8kB or 16kB.
> > > > > > 
> > > > > > So they would be quite happy with a much simpler interface where they
> > > > > > set the inode block size at inode creation time,
> > > > > We want to support untorn writes for bdev file operations - how can we set
> > > > > the inode block size there? Currently it is based on logical block size.
> > > > ioctl(BLKBSZSET), I guess?  That currently limits to PAGE_SIZE, but I
> > > > think we can remove that limitation with the bs>PS patches.
> > 
> > I can say a bit more on this, as I explored that. Essentially Matthew,
> > yes, I got that to work but it requires a set of different patches. We have
> > what we tried and then based on feedback from Chinner we have a
> > direction on what to try next. The last effort on that front was having the
> > iomap aops for bdev be used and lifting the PAGE_SIZE limit up to the
> > page cache limits. The crux on that front was that we end requiring
> > disabling BUFFER_HEAD and that is pretty limitting, so my old
> > implementation had dynamic aops so to let us use the buffer-head aops
> > only when using filesystems which require it and use iomap aops
> > otherwise. But as Chinner noted we learned through the DAX experience
> > that's not a route we want to again try, so the real solution is to
> > extend iomap bdev aops code with buffer-head compatibility.
> 
> Have you tried just using the buffer_head code?  I think you heard bad
> advice at last LSFMM.  Since then I've landed a bunch of patches which
> remove PAGE_SIZE assumptions throughout the buffer_head code, and while
> I haven't tried it, it might work.  And it might be easier to make work
> than adding more BH hacks to the iomap code.

I have considered it but the issue is that *may work* isn't good enough and
without a test plan for buffer-heads on a real filesystem this may never
suffice. Addressing a buffere-head iomap compat for the block device cache
is less error prone here for now.

  Luis
Matthew Wilcox (Oracle) April 15, 2024, 9:18 p.m. UTC | #18
On Sun, Apr 14, 2024 at 01:50:16PM -0700, Luis Chamberlain wrote:
> On Wed, Apr 10, 2024 at 05:05:20AM +0100, Matthew Wilcox wrote:
> > Have you tried just using the buffer_head code?  I think you heard bad
> > advice at last LSFMM.  Since then I've landed a bunch of patches which
> > remove PAGE_SIZE assumptions throughout the buffer_head code, and while
> > I haven't tried it, it might work.  And it might be easier to make work
> > than adding more BH hacks to the iomap code.
> 
> I have considered it but the issue is that *may work* isn't good enough and
> without a test plan for buffer-heads on a real filesystem this may never
> suffice. Addressing a buffere-head iomap compat for the block device cache
> is less error prone here for now.

Is it really your position that testing the code I already wrote is
harder than writing and testing some entirely new code?  Surely the
tests are the same for both.

Besides, we aren't talking about a filesystem on top of the bdev here.
We're talking about accessing the bdev's page cache directly.
Luis Chamberlain April 16, 2024, 9:11 p.m. UTC | #19
On Mon, Apr 15, 2024 at 10:18:30PM +0100, Matthew Wilcox wrote:
> On Sun, Apr 14, 2024 at 01:50:16PM -0700, Luis Chamberlain wrote:
> > On Wed, Apr 10, 2024 at 05:05:20AM +0100, Matthew Wilcox wrote:
> > > Have you tried just using the buffer_head code?  I think you heard bad
> > > advice at last LSFMM.  Since then I've landed a bunch of patches which
> > > remove PAGE_SIZE assumptions throughout the buffer_head code, and while
> > > I haven't tried it, it might work.  And it might be easier to make work
> > > than adding more BH hacks to the iomap code.
> > 
> > I have considered it but the issue is that *may work* isn't good enough and
> > without a test plan for buffer-heads on a real filesystem this may never
> > suffice. Addressing a buffere-head iomap compat for the block device cache
> > is less error prone here for now.
> 
> Is it really your position that testing the code I already wrote is
> harder than writing and testing some entirely new code?  Surely the
> tests are the same for both.

The compat code would only allow large folios for iomap, and use
buffer-heads for non-large folios, so nothing much would change except
a special wrapper.

> Besides, we aren't talking about a filesystem on top of the bdev here.
> We're talking about accessing the bdev's page cache directly.

Sure, but my concern was the lack of testing for buffer-head large
folios. While for iomap we'd at least have done the ton of work to
stress test testing large folios while testing XFS with it.

While the block device cache is not a proper full blown filesystem,
it just means since no filesystem has been tested with buffer heads with
large folios its a possible minefield waiting to explode due to lack of
testing.

Is writing a proper test plan for the block device cache code with
buffer-heads with large folios less work than writing the compat code
for the block device cache? I concede that I'm not sure.

I'm happy to try it out to see what blows up.

  Luis