diff mbox series

[v7,2/9] fs: Initial atomic write support

Message ID 20240602140912.970947-3-john.g.garry@oracle.com (mailing list archive)
State New, archived
Headers show
Series block atomic writes | expand

Commit Message

John Garry June 2, 2024, 2:09 p.m. UTC
From: Prasad Singamsetty <prasad.singamsetty@oracle.com>

An atomic write is a write issued with torn-write protection, meaning
that for a power failure or any other hardware failure, all or none of the
data from the write will be stored, but never a mix of old and new data.

Userspace may add flag RWF_ATOMIC to pwritev2() to indicate that the
write is to be issued with torn-write prevention, according to special
alignment and length rules.

For any syscall interface utilizing struct iocb, add IOCB_ATOMIC for
iocb->ki_flags field to indicate the same.

A call to statx will give the relevant atomic write info for a file:
- atomic_write_unit_min
- atomic_write_unit_max
- atomic_write_segments_max

Both min and max values must be a power-of-2.

Applications can avail of atomic write feature by ensuring that the total
length of a write is a power-of-2 in size and also sized between
atomic_write_unit_min and atomic_write_unit_max, inclusive. Applications
must ensure that the write is at a naturally-aligned offset in the file
wrt the total write length. The value in atomic_write_segments_max
indicates the upper limit for IOV_ITER iovcnt.

Add file mode flag FMODE_CAN_ATOMIC_WRITE, so files which do not have the
flag set will have RWF_ATOMIC rejected and not just ignored.

Add a type argument to kiocb_set_rw_flags() to allows reads which have
RWF_ATOMIC set to be rejected.

Helper function generic_atomic_write_valid() can be used by FSes to verify
compliant writes. There we check for iov_iter type is for ubuf, which
implies iovcnt==1 for pwritev2(), which is an initial restriction for
atomic_write_segments_max. Initially the only user will be bdev file
operations write handler. We will rely on the block BIO submission path to
ensure write sizes are compliant for the bdev, so we don't need to check
atomic writes sizes yet.

Signed-off-by: Prasad Singamsetty <prasad.singamsetty@oracle.com>
jpg: merge into single patch and much rewrite
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/aio.c                |  8 ++++----
 fs/btrfs/ioctl.c        |  2 +-
 fs/read_write.c         |  2 +-
 include/linux/fs.h      | 33 +++++++++++++++++++++++++++++++--
 include/uapi/linux/fs.h |  5 ++++-
 io_uring/rw.c           |  9 ++++-----
 6 files changed, 45 insertions(+), 14 deletions(-)

Comments

Christoph Hellwig June 5, 2024, 8:30 a.m. UTC | #1
Highlevel question:  in a lot of the discussions we've used the
term "untorn writes" instead, which feels better than atomic to
me as atomic is a highly overloaded term.  Should we switch the
naming to that?

> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 0283cf366c2a..6cb67882bcfd 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -45,6 +45,7 @@
>  #include <linux/slab.h>
>  #include <linux/maple_tree.h>
>  #include <linux/rw_hint.h>
> +#include <linux/uio.h>

fs.h is included almost everywhere, so if we can avoid pulling in
even more dependencies that would be great.

It seems like it is pulled in just for this helper:

> +static inline
> +bool generic_atomic_write_valid(loff_t pos, struct iov_iter *iter)
> +{
> +	size_t len = iov_iter_count(iter);
> +
> +	if (!iter_is_ubuf(iter))
> +		return false;
> +
> +	if (!is_power_of_2(len))
> +		return false;
> +
> +	if (!IS_ALIGNED(pos, len))
> +		return false;
> +
> +	return true;
> +}

should that just go to uio.h instead, or move out of line?

Also the return type formatting is wrong, the two normal styles are
either:

static inline bool generic_atomic_write_valid(loff_t pos,
		struct iov_iter *iter)

or:

static inline bool
generic_atomic_write_valid(loff_t pos, struct iov_iter *iter)

(and while I'm at nitpicking, passing the pos before the iter
feels weird)

Last but not least: if READ/WRITE is passed to kiocb_set_rw_flags,
it should probably set IOCB_WRITE as well?  That might be a worthwile
prep patch on it's own.
John Garry June 5, 2024, 10:48 a.m. UTC | #2
On 05/06/2024 09:30, Christoph Hellwig wrote:
> Highlevel question:  in a lot of the discussions we've used the
> term "untorn writes" instead, which feels better than atomic to
> me as atomic is a highly overloaded term.  Should we switch the
> naming to that?

I have no strong attachment to that name (atomic).

For both SCSI and NVMe, it's an "atomic" feature and I was basing the 
naming on that.

We could have RWF_NOTEARS or RWF_UNTEARABLE_WRITE or RWF_UNTEARABLE or 
RWF_UNTORN or similar. Any preference?

> 
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 0283cf366c2a..6cb67882bcfd 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -45,6 +45,7 @@
>>   #include <linux/slab.h>
>>   #include <linux/maple_tree.h>
>>   #include <linux/rw_hint.h>
>> +#include <linux/uio.h>
> 
> fs.h is included almost everywhere, so if we can avoid pulling in
> even more dependencies that would be great.
> 
> It seems like it is pulled in just for this helper:

right

> 
>> +static inline
>> +bool generic_atomic_write_valid(loff_t pos, struct iov_iter *iter)
>> +{
>> +	size_t len = iov_iter_count(iter);
>> +
>> +	if (!iter_is_ubuf(iter))
>> +		return false;
>> +
>> +	if (!is_power_of_2(len))
>> +		return false;
>> +
>> +	if (!IS_ALIGNED(pos, len))
>> +		return false;
>> +
>> +	return true;
>> +}
> 
> should that just go to uio.h instead, or move out of line?

ok, I am not sure about moving to uio.h, but I'll try to do something 
about this issue

> 
> Also the return type formatting is wrong, the two normal styles are
> either:
> 
> static inline bool generic_atomic_write_valid(loff_t pos,
> 		struct iov_iter *iter)
> 
> or:
> 
> static inline bool
> generic_atomic_write_valid(loff_t pos, struct iov_iter *iter)
> 
> (and while I'm at nitpicking, passing the pos before the iter
> feels weird)

generally pos is first and then len (which iter provides) when a 
function accepts position and length, but then iter is the "larger" arg, 
and normally they go first. Anyway I don't mind changing that as you 
suggest.

> 
> Last but not least: if READ/WRITE is passed to kiocb_set_rw_flags,
> it should probably set IOCB_WRITE as well?  That might be a worthwile
> prep patch on it's own.

For io_uring/rw.c, we have io_write() -> io_rw_init_file(..., WRITE), 
and then later we set IOCB_WRITE, so would be neat to use there. But 
then do_iter_readv_writev() does not set IOCB_WRITE - I can't imagine 
that setting IOCB_WRITE would do any harm there. I see a similar change 
in 
https://lore.kernel.org/linux-fsdevel/167391048988.2311931.1567396746365286847.stgit@warthog.procyon.org.uk/

AFAICS, setting IOCB_WRITE is quite inconsistent. From browsing through 
fsdevel on lore, there was some history in trying to use IOCB_WRITE 
always instead of iov_iter direction. Any idea what happened to that?

I'm just getting the feeling that setting IOCB_WRITE in 
kiocb_set_rw_flags() is a small part - and maybe counter productive - of 
a larger job of fixing IOCB_WRITE usage.

Thanks,
John
Christoph Hellwig June 6, 2024, 5:41 a.m. UTC | #3
On Wed, Jun 05, 2024 at 11:48:12AM +0100, John Garry wrote:
> I have no strong attachment to that name (atomic).
>
> For both SCSI and NVMe, it's an "atomic" feature and I was basing the 
> naming on that.
>
> We could have RWF_NOTEARS or RWF_UNTEARABLE_WRITE or RWF_UNTEARABLE or 
> RWF_UNTORN or similar. Any preference?

No particular preference between any of the option including atomic.
Just mumbling out aloud my thoughts :)

> For io_uring/rw.c, we have io_write() -> io_rw_init_file(..., WRITE), and 
> then later we set IOCB_WRITE, so would be neat to use there. But then 
> do_iter_readv_writev() does not set IOCB_WRITE - I can't imagine that 
> setting IOCB_WRITE would do any harm there. I see a similar change in 
> https://lore.kernel.org/linux-fsdevel/167391048988.2311931.1567396746365286847.stgit@warthog.procyon.org.uk/
>
> AFAICS, setting IOCB_WRITE is quite inconsistent. From browsing through 
> fsdevel on lore, there was some history in trying to use IOCB_WRITE always 
> instead of iov_iter direction. Any idea what happened to that?
>
> I'm just getting the feeling that setting IOCB_WRITE in 
> kiocb_set_rw_flags() is a small part - and maybe counter productive - of a 
> larger job of fixing IOCB_WRITE usage.

Someone (IIRC Dave H.) want to move it into the iov_iter a while ago.
I think that is a bad idea - the iov_iter is a data container except
for the shoehorned in read/write information doesn't describe the
operation at all.  So using the flag in the iocb seems like the better
architecture.  But I can understand that you might want to stay out
of all of this, so let's not touch IOCB_WRITE here.
John Garry June 6, 2024, 6:38 a.m. UTC | #4
On 06/06/2024 06:41, Christoph Hellwig wrote:
> On Wed, Jun 05, 2024 at 11:48:12AM +0100, John Garry wrote:
>> I have no strong attachment to that name (atomic).
>>
>> For both SCSI and NVMe, it's an "atomic" feature and I was basing the
>> naming on that.
>>
>> We could have RWF_NOTEARS or RWF_UNTEARABLE_WRITE or RWF_UNTEARABLE or
>> RWF_UNTORN or similar. Any preference?
> 
> No particular preference between any of the option including atomic.
> Just mumbling out aloud my thoughts :)

Regardless of the userspace API, I think that the block layer 
terminology should match that of the underlying HW technology - so I 
would plan to keep "atomic" in the block layer, including request_queue 
sysfs limits.

If we used RWF_UNTORN, at some level the "atomic" and "untorn" 
terminology would need to interface with one another. If it's going to 
be insane to have RWF_UNTORN from userspace being translated into 
REQ_ATOMIC, then I could keep RWF_ATOMIC.

Someone please decide ....

> 
>> For io_uring/rw.c, we have io_write() -> io_rw_init_file(..., WRITE), and
>> then later we set IOCB_WRITE, so would be neat to use there. But then
>> do_iter_readv_writev() does not set IOCB_WRITE - I can't imagine that
>> setting IOCB_WRITE would do any harm there. I see a similar change in
>> https://lore.kernel.org/linux-fsdevel/167391048988.2311931.1567396746365286847.stgit@warthog.procyon.org.uk/
>>
>> AFAICS, setting IOCB_WRITE is quite inconsistent. From browsing through
>> fsdevel on lore, there was some history in trying to use IOCB_WRITE always
>> instead of iov_iter direction. Any idea what happened to that?
>>
>> I'm just getting the feeling that setting IOCB_WRITE in
>> kiocb_set_rw_flags() is a small part - and maybe counter productive - of a
>> larger job of fixing IOCB_WRITE usage.
> 
> Someone (IIRC Dave H.) want to move it into the iov_iter a while ago.
> I think that is a bad idea - the iov_iter is a data container except
> for the shoehorned in read/write information doesn't describe the
> operation at all.  So using the flag in the iocb seems like the better
> architecture.  But I can understand that you might want to stay out
> of all of this, so let's not touch IOCB_WRITE here.
> 

ok
diff mbox series

Patch

diff --git a/fs/aio.c b/fs/aio.c
index 57c9f7c077e6..93ef59d358b3 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1516,7 +1516,7 @@  static void aio_complete_rw(struct kiocb *kiocb, long res)
 	iocb_put(iocb);
 }
 
-static int aio_prep_rw(struct kiocb *req, const struct iocb *iocb)
+static int aio_prep_rw(struct kiocb *req, const struct iocb *iocb, int rw_type)
 {
 	int ret;
 
@@ -1542,7 +1542,7 @@  static int aio_prep_rw(struct kiocb *req, const struct iocb *iocb)
 	} else
 		req->ki_ioprio = get_current_ioprio();
 
-	ret = kiocb_set_rw_flags(req, iocb->aio_rw_flags);
+	ret = kiocb_set_rw_flags(req, iocb->aio_rw_flags, rw_type);
 	if (unlikely(ret))
 		return ret;
 
@@ -1594,7 +1594,7 @@  static int aio_read(struct kiocb *req, const struct iocb *iocb,
 	struct file *file;
 	int ret;
 
-	ret = aio_prep_rw(req, iocb);
+	ret = aio_prep_rw(req, iocb, READ);
 	if (ret)
 		return ret;
 	file = req->ki_filp;
@@ -1621,7 +1621,7 @@  static int aio_write(struct kiocb *req, const struct iocb *iocb,
 	struct file *file;
 	int ret;
 
-	ret = aio_prep_rw(req, iocb);
+	ret = aio_prep_rw(req, iocb, WRITE);
 	if (ret)
 		return ret;
 	file = req->ki_filp;
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index efd5d6e9589e..6ad524b894fc 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -4627,7 +4627,7 @@  static int btrfs_ioctl_encoded_write(struct file *file, void __user *argp, bool
 		goto out_iov;
 
 	init_sync_kiocb(&kiocb, file);
-	ret = kiocb_set_rw_flags(&kiocb, 0);
+	ret = kiocb_set_rw_flags(&kiocb, 0, WRITE);
 	if (ret)
 		goto out_iov;
 	kiocb.ki_pos = pos;
diff --git a/fs/read_write.c b/fs/read_write.c
index ef6339391351..819f065230fb 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -730,7 +730,7 @@  static ssize_t do_iter_readv_writev(struct file *filp, struct iov_iter *iter,
 	ssize_t ret;
 
 	init_sync_kiocb(&kiocb, filp);
-	ret = kiocb_set_rw_flags(&kiocb, flags);
+	ret = kiocb_set_rw_flags(&kiocb, flags, type);
 	if (ret)
 		return ret;
 	kiocb.ki_pos = (ppos ? *ppos : 0);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0283cf366c2a..6cb67882bcfd 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -45,6 +45,7 @@ 
 #include <linux/slab.h>
 #include <linux/maple_tree.h>
 #include <linux/rw_hint.h>
+#include <linux/uio.h>
 
 #include <asm/byteorder.h>
 #include <uapi/linux/fs.h>
@@ -125,8 +126,10 @@  typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 #define FMODE_EXEC		((__force fmode_t)(1 << 5))
 /* File writes are restricted (block device specific) */
 #define FMODE_WRITE_RESTRICTED	((__force fmode_t)(1 << 6))
+/* File supports atomic writes */
+#define FMODE_CAN_ATOMIC_WRITE	((__force fmode_t)(1 << 7))
 
-/* FMODE_* bits 7 to 8 */
+/* FMODE_* bit 8 */
 
 /* 32bit hashes as llseek() offset (for directories) */
 #define FMODE_32BITHASH         ((__force fmode_t)(1 << 9))
@@ -317,6 +320,7 @@  struct readahead_control;
 #define IOCB_SYNC		(__force int) RWF_SYNC
 #define IOCB_NOWAIT		(__force int) RWF_NOWAIT
 #define IOCB_APPEND		(__force int) RWF_APPEND
+#define IOCB_ATOMIC		(__force int) RWF_ATOMIC
 
 /* non-RWF related bits - start at 16 */
 #define IOCB_EVENTFD		(1 << 16)
@@ -351,6 +355,7 @@  struct readahead_control;
 	{ IOCB_SYNC,		"SYNC" }, \
 	{ IOCB_NOWAIT,		"NOWAIT" }, \
 	{ IOCB_APPEND,		"APPEND" }, \
+	{ IOCB_ATOMIC,		"ATOMIC"}, \
 	{ IOCB_EVENTFD,		"EVENTFD"}, \
 	{ IOCB_DIRECT,		"DIRECT" }, \
 	{ IOCB_WRITE,		"WRITE" }, \
@@ -3403,7 +3408,8 @@  static inline int iocb_flags(struct file *file)
 	return res;
 }
 
-static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
+static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags,
+				     int rw_type)
 {
 	int kiocb_flags = 0;
 
@@ -3422,6 +3428,12 @@  static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
 			return -EOPNOTSUPP;
 		kiocb_flags |= IOCB_NOIO;
 	}
+	if (flags & RWF_ATOMIC) {
+		if (rw_type != WRITE)
+			return -EOPNOTSUPP;
+		if (!(ki->ki_filp->f_mode & FMODE_CAN_ATOMIC_WRITE))
+			return -EOPNOTSUPP;
+	}
 	kiocb_flags |= (__force int) (flags & RWF_SUPPORTED);
 	if (flags & RWF_SYNC)
 		kiocb_flags |= IOCB_DSYNC;
@@ -3613,4 +3625,21 @@  extern int vfs_fadvise(struct file *file, loff_t offset, loff_t len,
 extern int generic_fadvise(struct file *file, loff_t offset, loff_t len,
 			   int advice);
 
+static inline
+bool generic_atomic_write_valid(loff_t pos, struct iov_iter *iter)
+{
+	size_t len = iov_iter_count(iter);
+
+	if (!iter_is_ubuf(iter))
+		return false;
+
+	if (!is_power_of_2(len))
+		return false;
+
+	if (!IS_ALIGNED(pos, len))
+		return false;
+
+	return true;
+}
+
 #endif /* _LINUX_FS_H */
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 45e4e64fd664..191a7e88a8ab 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -329,9 +329,12 @@  typedef int __bitwise __kernel_rwf_t;
 /* per-IO negation of O_APPEND */
 #define RWF_NOAPPEND	((__force __kernel_rwf_t)0x00000020)
 
+/* Atomic Write */
+#define RWF_ATOMIC	((__force __kernel_rwf_t)0x00000040)
+
 /* mask of flags supported by the kernel */
 #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
-			 RWF_APPEND | RWF_NOAPPEND)
+			 RWF_APPEND | RWF_NOAPPEND | RWF_ATOMIC)
 
 /* Pagemap ioctl */
 #define PAGEMAP_SCAN	_IOWR('f', 16, struct pm_scan_arg)
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 1a2128459cb4..c004d21e2f12 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -772,7 +772,7 @@  static bool need_complete_io(struct io_kiocb *req)
 		S_ISBLK(file_inode(req->file)->i_mode);
 }
 
-static int io_rw_init_file(struct io_kiocb *req, fmode_t mode)
+static int io_rw_init_file(struct io_kiocb *req, fmode_t mode, int rw_type)
 {
 	struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
 	struct kiocb *kiocb = &rw->kiocb;
@@ -787,7 +787,7 @@  static int io_rw_init_file(struct io_kiocb *req, fmode_t mode)
 		req->flags |= io_file_get_flags(file);
 
 	kiocb->ki_flags = file->f_iocb_flags;
-	ret = kiocb_set_rw_flags(kiocb, rw->flags);
+	ret = kiocb_set_rw_flags(kiocb, rw->flags, rw_type);
 	if (unlikely(ret))
 		return ret;
 	kiocb->ki_flags |= IOCB_ALLOC_CACHE;
@@ -832,8 +832,7 @@  static int __io_read(struct io_kiocb *req, unsigned int issue_flags)
 		if (unlikely(ret < 0))
 			return ret;
 	}
-
-	ret = io_rw_init_file(req, FMODE_READ);
+	ret = io_rw_init_file(req, FMODE_READ, READ);
 	if (unlikely(ret))
 		return ret;
 	req->cqe.res = iov_iter_count(&io->iter);
@@ -1013,7 +1012,7 @@  int io_write(struct io_kiocb *req, unsigned int issue_flags)
 	ssize_t ret, ret2;
 	loff_t *ppos;
 
-	ret = io_rw_init_file(req, FMODE_WRITE);
+	ret = io_rw_init_file(req, FMODE_WRITE, WRITE);
 	if (unlikely(ret))
 		return ret;
 	req->cqe.res = iov_iter_count(&io->iter);