Message ID | 20240602140912.970947-3-john.g.garry@oracle.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | block atomic writes | expand |
Highlevel question: in a lot of the discussions we've used the term "untorn writes" instead, which feels better than atomic to me as atomic is a highly overloaded term. Should we switch the naming to that? > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 0283cf366c2a..6cb67882bcfd 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -45,6 +45,7 @@ > #include <linux/slab.h> > #include <linux/maple_tree.h> > #include <linux/rw_hint.h> > +#include <linux/uio.h> fs.h is included almost everywhere, so if we can avoid pulling in even more dependencies that would be great. It seems like it is pulled in just for this helper: > +static inline > +bool generic_atomic_write_valid(loff_t pos, struct iov_iter *iter) > +{ > + size_t len = iov_iter_count(iter); > + > + if (!iter_is_ubuf(iter)) > + return false; > + > + if (!is_power_of_2(len)) > + return false; > + > + if (!IS_ALIGNED(pos, len)) > + return false; > + > + return true; > +} should that just go to uio.h instead, or move out of line? Also the return type formatting is wrong, the two normal styles are either: static inline bool generic_atomic_write_valid(loff_t pos, struct iov_iter *iter) or: static inline bool generic_atomic_write_valid(loff_t pos, struct iov_iter *iter) (and while I'm at nitpicking, passing the pos before the iter feels weird) Last but not least: if READ/WRITE is passed to kiocb_set_rw_flags, it should probably set IOCB_WRITE as well? That might be a worthwile prep patch on it's own.
On 05/06/2024 09:30, Christoph Hellwig wrote: > Highlevel question: in a lot of the discussions we've used the > term "untorn writes" instead, which feels better than atomic to > me as atomic is a highly overloaded term. Should we switch the > naming to that? I have no strong attachment to that name (atomic). For both SCSI and NVMe, it's an "atomic" feature and I was basing the naming on that. We could have RWF_NOTEARS or RWF_UNTEARABLE_WRITE or RWF_UNTEARABLE or RWF_UNTORN or similar. Any preference? > >> diff --git a/include/linux/fs.h b/include/linux/fs.h >> index 0283cf366c2a..6cb67882bcfd 100644 >> --- a/include/linux/fs.h >> +++ b/include/linux/fs.h >> @@ -45,6 +45,7 @@ >> #include <linux/slab.h> >> #include <linux/maple_tree.h> >> #include <linux/rw_hint.h> >> +#include <linux/uio.h> > > fs.h is included almost everywhere, so if we can avoid pulling in > even more dependencies that would be great. > > It seems like it is pulled in just for this helper: right > >> +static inline >> +bool generic_atomic_write_valid(loff_t pos, struct iov_iter *iter) >> +{ >> + size_t len = iov_iter_count(iter); >> + >> + if (!iter_is_ubuf(iter)) >> + return false; >> + >> + if (!is_power_of_2(len)) >> + return false; >> + >> + if (!IS_ALIGNED(pos, len)) >> + return false; >> + >> + return true; >> +} > > should that just go to uio.h instead, or move out of line? ok, I am not sure about moving to uio.h, but I'll try to do something about this issue > > Also the return type formatting is wrong, the two normal styles are > either: > > static inline bool generic_atomic_write_valid(loff_t pos, > struct iov_iter *iter) > > or: > > static inline bool > generic_atomic_write_valid(loff_t pos, struct iov_iter *iter) > > (and while I'm at nitpicking, passing the pos before the iter > feels weird) generally pos is first and then len (which iter provides) when a function accepts position and length, but then iter is the "larger" arg, and normally they go first. Anyway I don't mind changing that as you suggest. > > Last but not least: if READ/WRITE is passed to kiocb_set_rw_flags, > it should probably set IOCB_WRITE as well? That might be a worthwile > prep patch on it's own. For io_uring/rw.c, we have io_write() -> io_rw_init_file(..., WRITE), and then later we set IOCB_WRITE, so would be neat to use there. But then do_iter_readv_writev() does not set IOCB_WRITE - I can't imagine that setting IOCB_WRITE would do any harm there. I see a similar change in https://lore.kernel.org/linux-fsdevel/167391048988.2311931.1567396746365286847.stgit@warthog.procyon.org.uk/ AFAICS, setting IOCB_WRITE is quite inconsistent. From browsing through fsdevel on lore, there was some history in trying to use IOCB_WRITE always instead of iov_iter direction. Any idea what happened to that? I'm just getting the feeling that setting IOCB_WRITE in kiocb_set_rw_flags() is a small part - and maybe counter productive - of a larger job of fixing IOCB_WRITE usage. Thanks, John
On Wed, Jun 05, 2024 at 11:48:12AM +0100, John Garry wrote: > I have no strong attachment to that name (atomic). > > For both SCSI and NVMe, it's an "atomic" feature and I was basing the > naming on that. > > We could have RWF_NOTEARS or RWF_UNTEARABLE_WRITE or RWF_UNTEARABLE or > RWF_UNTORN or similar. Any preference? No particular preference between any of the option including atomic. Just mumbling out aloud my thoughts :) > For io_uring/rw.c, we have io_write() -> io_rw_init_file(..., WRITE), and > then later we set IOCB_WRITE, so would be neat to use there. But then > do_iter_readv_writev() does not set IOCB_WRITE - I can't imagine that > setting IOCB_WRITE would do any harm there. I see a similar change in > https://lore.kernel.org/linux-fsdevel/167391048988.2311931.1567396746365286847.stgit@warthog.procyon.org.uk/ > > AFAICS, setting IOCB_WRITE is quite inconsistent. From browsing through > fsdevel on lore, there was some history in trying to use IOCB_WRITE always > instead of iov_iter direction. Any idea what happened to that? > > I'm just getting the feeling that setting IOCB_WRITE in > kiocb_set_rw_flags() is a small part - and maybe counter productive - of a > larger job of fixing IOCB_WRITE usage. Someone (IIRC Dave H.) want to move it into the iov_iter a while ago. I think that is a bad idea - the iov_iter is a data container except for the shoehorned in read/write information doesn't describe the operation at all. So using the flag in the iocb seems like the better architecture. But I can understand that you might want to stay out of all of this, so let's not touch IOCB_WRITE here.
On 06/06/2024 06:41, Christoph Hellwig wrote: > On Wed, Jun 05, 2024 at 11:48:12AM +0100, John Garry wrote: >> I have no strong attachment to that name (atomic). >> >> For both SCSI and NVMe, it's an "atomic" feature and I was basing the >> naming on that. >> >> We could have RWF_NOTEARS or RWF_UNTEARABLE_WRITE or RWF_UNTEARABLE or >> RWF_UNTORN or similar. Any preference? > > No particular preference between any of the option including atomic. > Just mumbling out aloud my thoughts :) Regardless of the userspace API, I think that the block layer terminology should match that of the underlying HW technology - so I would plan to keep "atomic" in the block layer, including request_queue sysfs limits. If we used RWF_UNTORN, at some level the "atomic" and "untorn" terminology would need to interface with one another. If it's going to be insane to have RWF_UNTORN from userspace being translated into REQ_ATOMIC, then I could keep RWF_ATOMIC. Someone please decide .... > >> For io_uring/rw.c, we have io_write() -> io_rw_init_file(..., WRITE), and >> then later we set IOCB_WRITE, so would be neat to use there. But then >> do_iter_readv_writev() does not set IOCB_WRITE - I can't imagine that >> setting IOCB_WRITE would do any harm there. I see a similar change in >> https://lore.kernel.org/linux-fsdevel/167391048988.2311931.1567396746365286847.stgit@warthog.procyon.org.uk/ >> >> AFAICS, setting IOCB_WRITE is quite inconsistent. From browsing through >> fsdevel on lore, there was some history in trying to use IOCB_WRITE always >> instead of iov_iter direction. Any idea what happened to that? >> >> I'm just getting the feeling that setting IOCB_WRITE in >> kiocb_set_rw_flags() is a small part - and maybe counter productive - of a >> larger job of fixing IOCB_WRITE usage. > > Someone (IIRC Dave H.) want to move it into the iov_iter a while ago. > I think that is a bad idea - the iov_iter is a data container except > for the shoehorned in read/write information doesn't describe the > operation at all. So using the flag in the iocb seems like the better > architecture. But I can understand that you might want to stay out > of all of this, so let's not touch IOCB_WRITE here. > ok
diff --git a/fs/aio.c b/fs/aio.c index 57c9f7c077e6..93ef59d358b3 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1516,7 +1516,7 @@ static void aio_complete_rw(struct kiocb *kiocb, long res) iocb_put(iocb); } -static int aio_prep_rw(struct kiocb *req, const struct iocb *iocb) +static int aio_prep_rw(struct kiocb *req, const struct iocb *iocb, int rw_type) { int ret; @@ -1542,7 +1542,7 @@ static int aio_prep_rw(struct kiocb *req, const struct iocb *iocb) } else req->ki_ioprio = get_current_ioprio(); - ret = kiocb_set_rw_flags(req, iocb->aio_rw_flags); + ret = kiocb_set_rw_flags(req, iocb->aio_rw_flags, rw_type); if (unlikely(ret)) return ret; @@ -1594,7 +1594,7 @@ static int aio_read(struct kiocb *req, const struct iocb *iocb, struct file *file; int ret; - ret = aio_prep_rw(req, iocb); + ret = aio_prep_rw(req, iocb, READ); if (ret) return ret; file = req->ki_filp; @@ -1621,7 +1621,7 @@ static int aio_write(struct kiocb *req, const struct iocb *iocb, struct file *file; int ret; - ret = aio_prep_rw(req, iocb); + ret = aio_prep_rw(req, iocb, WRITE); if (ret) return ret; file = req->ki_filp; diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index efd5d6e9589e..6ad524b894fc 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -4627,7 +4627,7 @@ static int btrfs_ioctl_encoded_write(struct file *file, void __user *argp, bool goto out_iov; init_sync_kiocb(&kiocb, file); - ret = kiocb_set_rw_flags(&kiocb, 0); + ret = kiocb_set_rw_flags(&kiocb, 0, WRITE); if (ret) goto out_iov; kiocb.ki_pos = pos; diff --git a/fs/read_write.c b/fs/read_write.c index ef6339391351..819f065230fb 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -730,7 +730,7 @@ static ssize_t do_iter_readv_writev(struct file *filp, struct iov_iter *iter, ssize_t ret; init_sync_kiocb(&kiocb, filp); - ret = kiocb_set_rw_flags(&kiocb, flags); + ret = kiocb_set_rw_flags(&kiocb, flags, type); if (ret) return ret; kiocb.ki_pos = (ppos ? *ppos : 0); diff --git a/include/linux/fs.h b/include/linux/fs.h index 0283cf366c2a..6cb67882bcfd 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -45,6 +45,7 @@ #include <linux/slab.h> #include <linux/maple_tree.h> #include <linux/rw_hint.h> +#include <linux/uio.h> #include <asm/byteorder.h> #include <uapi/linux/fs.h> @@ -125,8 +126,10 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset, #define FMODE_EXEC ((__force fmode_t)(1 << 5)) /* File writes are restricted (block device specific) */ #define FMODE_WRITE_RESTRICTED ((__force fmode_t)(1 << 6)) +/* File supports atomic writes */ +#define FMODE_CAN_ATOMIC_WRITE ((__force fmode_t)(1 << 7)) -/* FMODE_* bits 7 to 8 */ +/* FMODE_* bit 8 */ /* 32bit hashes as llseek() offset (for directories) */ #define FMODE_32BITHASH ((__force fmode_t)(1 << 9)) @@ -317,6 +320,7 @@ struct readahead_control; #define IOCB_SYNC (__force int) RWF_SYNC #define IOCB_NOWAIT (__force int) RWF_NOWAIT #define IOCB_APPEND (__force int) RWF_APPEND +#define IOCB_ATOMIC (__force int) RWF_ATOMIC /* non-RWF related bits - start at 16 */ #define IOCB_EVENTFD (1 << 16) @@ -351,6 +355,7 @@ struct readahead_control; { IOCB_SYNC, "SYNC" }, \ { IOCB_NOWAIT, "NOWAIT" }, \ { IOCB_APPEND, "APPEND" }, \ + { IOCB_ATOMIC, "ATOMIC"}, \ { IOCB_EVENTFD, "EVENTFD"}, \ { IOCB_DIRECT, "DIRECT" }, \ { IOCB_WRITE, "WRITE" }, \ @@ -3403,7 +3408,8 @@ static inline int iocb_flags(struct file *file) return res; } -static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags) +static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags, + int rw_type) { int kiocb_flags = 0; @@ -3422,6 +3428,12 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags) return -EOPNOTSUPP; kiocb_flags |= IOCB_NOIO; } + if (flags & RWF_ATOMIC) { + if (rw_type != WRITE) + return -EOPNOTSUPP; + if (!(ki->ki_filp->f_mode & FMODE_CAN_ATOMIC_WRITE)) + return -EOPNOTSUPP; + } kiocb_flags |= (__force int) (flags & RWF_SUPPORTED); if (flags & RWF_SYNC) kiocb_flags |= IOCB_DSYNC; @@ -3613,4 +3625,21 @@ extern int vfs_fadvise(struct file *file, loff_t offset, loff_t len, extern int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice); +static inline +bool generic_atomic_write_valid(loff_t pos, struct iov_iter *iter) +{ + size_t len = iov_iter_count(iter); + + if (!iter_is_ubuf(iter)) + return false; + + if (!is_power_of_2(len)) + return false; + + if (!IS_ALIGNED(pos, len)) + return false; + + return true; +} + #endif /* _LINUX_FS_H */ diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index 45e4e64fd664..191a7e88a8ab 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -329,9 +329,12 @@ typedef int __bitwise __kernel_rwf_t; /* per-IO negation of O_APPEND */ #define RWF_NOAPPEND ((__force __kernel_rwf_t)0x00000020) +/* Atomic Write */ +#define RWF_ATOMIC ((__force __kernel_rwf_t)0x00000040) + /* mask of flags supported by the kernel */ #define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\ - RWF_APPEND | RWF_NOAPPEND) + RWF_APPEND | RWF_NOAPPEND | RWF_ATOMIC) /* Pagemap ioctl */ #define PAGEMAP_SCAN _IOWR('f', 16, struct pm_scan_arg) diff --git a/io_uring/rw.c b/io_uring/rw.c index 1a2128459cb4..c004d21e2f12 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -772,7 +772,7 @@ static bool need_complete_io(struct io_kiocb *req) S_ISBLK(file_inode(req->file)->i_mode); } -static int io_rw_init_file(struct io_kiocb *req, fmode_t mode) +static int io_rw_init_file(struct io_kiocb *req, fmode_t mode, int rw_type) { struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw); struct kiocb *kiocb = &rw->kiocb; @@ -787,7 +787,7 @@ static int io_rw_init_file(struct io_kiocb *req, fmode_t mode) req->flags |= io_file_get_flags(file); kiocb->ki_flags = file->f_iocb_flags; - ret = kiocb_set_rw_flags(kiocb, rw->flags); + ret = kiocb_set_rw_flags(kiocb, rw->flags, rw_type); if (unlikely(ret)) return ret; kiocb->ki_flags |= IOCB_ALLOC_CACHE; @@ -832,8 +832,7 @@ static int __io_read(struct io_kiocb *req, unsigned int issue_flags) if (unlikely(ret < 0)) return ret; } - - ret = io_rw_init_file(req, FMODE_READ); + ret = io_rw_init_file(req, FMODE_READ, READ); if (unlikely(ret)) return ret; req->cqe.res = iov_iter_count(&io->iter); @@ -1013,7 +1012,7 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags) ssize_t ret, ret2; loff_t *ppos; - ret = io_rw_init_file(req, FMODE_WRITE); + ret = io_rw_init_file(req, FMODE_WRITE, WRITE); if (unlikely(ret)) return ret; req->cqe.res = iov_iter_count(&io->iter);