mbox series

[RFC,0/7] buffered block atomic writes

Message ID 20240422143923.3927601-1-john.g.garry@oracle.com (mailing list archive)
Headers show
Series buffered block atomic writes | expand

Message

John Garry April 22, 2024, 2:39 p.m. UTC
This series introduces a proof-of-concept for buffered block atomic
writes.

There is a requirement for userspace to be able to issue a write which
will not be torn due to HW or some other failure. A solution is presented
in [0] and [1].

Those series mentioned only support atomic writes for direct IO. The
primary target of atomic (or untorn) writes is DBs like InnoDB/MySQL,
which require direct IO support. However, as mentioned in [2], there is
a want to support atomic writes for DBs which use buffered writes, like
Postgres.

The issue raised in [2] was that the API proposed is not suitable for
buffered atomic writes. Specifically, since the API permits a range of
sizes of atomic writes, it is too difficult to track in the pagecache the
geometry of atomic writes which overlap with other atomic writes of
differing sizes and alignment. In addition, tracking and handling
overlapping atomic and non-atomic writes is difficult also.

In this series, buffered atomic writes are supported based upon the
following principles:
- A buffered atomic write requires RWF_ATOMIC flag be set, same as
  direct IO. The same other atomic writes rules apply, like power-of-2
  size and naturally aligned.
- For an inode, only a single size of buffered write is allowed. So for
  statx, atomic_write_unit_min = atomic_write_unit_max always for
  buffered atomic writes.
- A single folio maps to an atomic write in the pagecache. Folios match
  atomic writes well, as an atomic write must be a power-of-2 in size and
  naturally aligned.
- A folio is tagged as "atomic" when atomically written. If any part of an
  "atomic" folio is fully or partially overwritten with a non-atomic
  write, the folio loses it atomicity. Indeed, issuing a non-atomic write
  over an atomic write would typically be seen as a userspace bug.
- If userspace wants to guarantee a buffered atomic write is written to
  media atomically after the write syscall returns, it must use RWF_SYNC
  or similar (along with RWF_ATOMIC).

This series just supports buffered atomic writes for XFS. I do have some
patches for bdev file operations buffered atomic writes. I did not include
them, as:
a. I don't know of any requirement for this support
b. atomic_write_unit_min and atomic_write_unit_max would be fixed at
   PAGE_SIZE there. This is very limiting. However an API like BLKBSZSET
   could be added to allow userspace to program the values for
   atomic_write_unit_{min, max}.
c. We may want to support atomic_write_unit_{min, max} < PAGE_SIZE, and
   this becomes more complicated to support.
d. I would like to see what happens with bs > ps work there.

This series is just an early proof-of-concept, to prove that the API
proposed for block atomic writes can work for buffered IO. I would like to
unblock that direct IO series and have it merged.

Patches are based on [0], [1], and [3] (the bs > ps series). For the bs >
ps series, I had to borrow an earlier filemap change which allows the
folio min and max order be selected.

All patches can be found at:
https://github.com/johnpgarry/linux/tree/atomic-writes-v6.9-v6-fs-v2-buffered

[0] https://lore.kernel.org/linux-block/20240326133813.3224593-1-john.g.garry@oracle.com/
[1] https://lore.kernel.org/linux-block/20240304130428.13026-1-john.g.garry@oracle.com/
[2] https://lore.kernel.org/linux-fsdevel/20240228061257.GA106651@mit.edu/
[3] https://lore.kernel.org/linux-xfs/20240313170253.2324812-1-kernel@pankajraghav.com/

John Garry (7):
  fs: Rename STATX{_ATTR}_WRITE_ATOMIC -> STATX{_ATTR}_WRITE_ATOMIC_DIO
  filemap: Change mapping_set_folio_min_order() ->
    mapping_set_folio_orders()
  mm: Add PG_atomic
  fs: Add initial buffered atomic write support info to statx
  fs: iomap: buffered atomic write support
  fs: xfs: buffered atomic writes statx support
  fs: xfs: Enable buffered atomic writes

 block/bdev.c                   |  9 +++---
 fs/iomap/buffered-io.c         | 53 +++++++++++++++++++++++++++++-----
 fs/iomap/trace.h               |  3 +-
 fs/stat.c                      | 26 ++++++++++++-----
 fs/xfs/libxfs/xfs_inode_buf.c  |  8 +++++
 fs/xfs/xfs_file.c              | 12 ++++++--
 fs/xfs/xfs_icache.c            | 10 ++++---
 fs/xfs/xfs_ioctl.c             |  3 ++
 fs/xfs/xfs_iops.c              | 11 +++++--
 include/linux/fs.h             |  3 +-
 include/linux/iomap.h          |  1 +
 include/linux/page-flags.h     |  5 ++++
 include/linux/pagemap.h        | 20 ++++++++-----
 include/trace/events/mmflags.h |  3 +-
 include/uapi/linux/stat.h      |  6 ++--
 mm/filemap.c                   |  8 ++++-
 16 files changed, 141 insertions(+), 40 deletions(-)