Message ID | 20240930181305.17286-4-joshi.k@samsung.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v7,1/3] nvme: enable FDP support | expand |
On 9/30/24 19:13, Kanchan Joshi wrote: > With F_SET_RW_HINT fcntl, user can set a hint on the file inode, and > all the subsequent writes on the file pass that hint value down. > This can be limiting for large files (and for block device) as all the > writes can be tagged with only one lifetime hint value. > Concurrent writes (with different hint values) are hard to manage. > Per-IO hinting solves that problem. > > Allow userspace to pass additional metadata in the SQE. > The type of passed metadata is expressed by a new field > > __u16 meta_type; The new layout looks nicer, but let me elaborate on the previous comment. I don't believe we should be restricting to only one attribute per IO. What if someone wants to pass a lifetime hint together with integrity information? Instead, we might need something more extensible like an ability to pass a list / array of typed attributes / meta information / hints etc. An example from networking I gave last time was control messages, i.e. cmsg. In a basic oversimplified form the API from the user perspective could look like: struct meta_attr { u16 type; u64 data; }; struct meta_attr attr[] = {{HINT, hint_value}, {INTEGRITY, ptr}}; sqe->meta_attrs = attr; sqe->meta_nr = 2; I'm pretty sure there will be a bunch of considerations people will name the API should account for, like sizing the struct so it can fit integrity bits or even making it variable sized. > At this point one type META_TYPE_LIFETIME_HINT is supported. > With this type, user can pass lifetime hint values in the new field > > __u64 lifetime_val; > > This accepts all lifetime hint values that are possible with > F_SET_RW_HINT fcntl. > > The write handlers (io_prep_rw, io_write) send the hint value to > lower-layer using kiocb. This is good for upporting direct IO, > but not when kiocb is not available (e.g., buffered IO). > > When per-io hints are not passed, the per-inode hint values are set in > the kiocb (as before). Otherwise, these take the precedence on per-inode > hints.
On 9/30/24 11:13 AM, Kanchan Joshi wrote: > diff --git a/include/linux/rw_hint.h b/include/linux/rw_hint.h > index 309ca72f2dfb..f4373a71ffed 100644 > --- a/include/linux/rw_hint.h > +++ b/include/linux/rw_hint.h > @@ -21,4 +21,28 @@ enum rw_hint { > static_assert(sizeof(enum rw_hint) == 1); > #endif > > +#define WRITE_LIFE_INVALID (RWH_WRITE_LIFE_EXTREME + 1) > + > +static inline bool rw_hint_valid(u64 hint) > +{ > + BUILD_BUG_ON(WRITE_LIFE_NOT_SET != RWH_WRITE_LIFE_NOT_SET); > + BUILD_BUG_ON(WRITE_LIFE_NONE != RWH_WRITE_LIFE_NONE); > + BUILD_BUG_ON(WRITE_LIFE_SHORT != RWH_WRITE_LIFE_SHORT); > + BUILD_BUG_ON(WRITE_LIFE_MEDIUM != RWH_WRITE_LIFE_MEDIUM); > + BUILD_BUG_ON(WRITE_LIFE_LONG != RWH_WRITE_LIFE_LONG); > + BUILD_BUG_ON(WRITE_LIFE_EXTREME != RWH_WRITE_LIFE_EXTREME); > + > + switch (hint) { > + case RWH_WRITE_LIFE_NOT_SET: > + case RWH_WRITE_LIFE_NONE: > + case RWH_WRITE_LIFE_SHORT: > + case RWH_WRITE_LIFE_MEDIUM: > + case RWH_WRITE_LIFE_LONG: > + case RWH_WRITE_LIFE_EXTREME: > + return true; > + default: > + return false; > + } > +} Moving a function that is not in the hot path from a .c file to a .h file is wrong. This increases the kernel size, slows down compilation and makes implementation details visible to callers that should not have access to these implementation details. Bart.
On 10/2/2024 7:56 PM, Pavel Begunkov wrote: > On 9/30/24 19:13, Kanchan Joshi wrote: >> With F_SET_RW_HINT fcntl, user can set a hint on the file inode, and >> all the subsequent writes on the file pass that hint value down. >> This can be limiting for large files (and for block device) as all the >> writes can be tagged with only one lifetime hint value. >> Concurrent writes (with different hint values) are hard to manage. >> Per-IO hinting solves that problem. >> >> Allow userspace to pass additional metadata in the SQE. >> The type of passed metadata is expressed by a new field >> >> __u16 meta_type; > > The new layout looks nicer, but let me elaborate on the previous > comment. I don't believe we should be restricting to only one > attribute per IO. What if someone wants to pass a lifetime hint > together with integrity information? For that reason only I made meta_type to accept multiple bit values. META_TYPE_LIFETIME_HINT and a new META_TYPE_INTEGRITY can coexist. Overall 16 meta types can coexist. > Instead, we might need something more extensible like an ability > to pass a list / array of typed attributes / meta information / hints > etc. An example from networking I gave last time was control messages, > i.e. cmsg. In a basic oversimplified form the API from the user > perspective could look like: > > struct meta_attr { > u16 type; > u64 data; > }; > > struct meta_attr attr[] = {{HINT, hint_value}, {INTEGRITY, ptr}}; > sqe->meta_attrs = attr; > sqe->meta_nr = 2; I did not feel like adding a pointer (and have copy_from_user cost) for integrity. Currently integrity uses space in second SQE which seems fine [*]. Down the line if meta-types increase and we are on verge of low SQE space, we can resort to add indirect reference. [*] https://lore.kernel.org/linux-nvme/20241016112912.63542-8-anuj20.g@samsung.com/
diff --git a/fs/fcntl.c b/fs/fcntl.c index 22dd9dcce7ec..a390a05f4ef8 100644 --- a/fs/fcntl.c +++ b/fs/fcntl.c @@ -334,28 +334,6 @@ static int f_getowner_uids(struct file *filp, unsigned long arg) } #endif -static bool rw_hint_valid(u64 hint) -{ - BUILD_BUG_ON(WRITE_LIFE_NOT_SET != RWH_WRITE_LIFE_NOT_SET); - BUILD_BUG_ON(WRITE_LIFE_NONE != RWH_WRITE_LIFE_NONE); - BUILD_BUG_ON(WRITE_LIFE_SHORT != RWH_WRITE_LIFE_SHORT); - BUILD_BUG_ON(WRITE_LIFE_MEDIUM != RWH_WRITE_LIFE_MEDIUM); - BUILD_BUG_ON(WRITE_LIFE_LONG != RWH_WRITE_LIFE_LONG); - BUILD_BUG_ON(WRITE_LIFE_EXTREME != RWH_WRITE_LIFE_EXTREME); - - switch (hint) { - case RWH_WRITE_LIFE_NOT_SET: - case RWH_WRITE_LIFE_NONE: - case RWH_WRITE_LIFE_SHORT: - case RWH_WRITE_LIFE_MEDIUM: - case RWH_WRITE_LIFE_LONG: - case RWH_WRITE_LIFE_EXTREME: - return true; - default: - return false; - } -} - static long fcntl_get_rw_hint(struct file *file, unsigned int cmd, unsigned long arg) { diff --git a/include/linux/rw_hint.h b/include/linux/rw_hint.h index 309ca72f2dfb..f4373a71ffed 100644 --- a/include/linux/rw_hint.h +++ b/include/linux/rw_hint.h @@ -21,4 +21,28 @@ enum rw_hint { static_assert(sizeof(enum rw_hint) == 1); #endif +#define WRITE_LIFE_INVALID (RWH_WRITE_LIFE_EXTREME + 1) + +static inline bool rw_hint_valid(u64 hint) +{ + BUILD_BUG_ON(WRITE_LIFE_NOT_SET != RWH_WRITE_LIFE_NOT_SET); + BUILD_BUG_ON(WRITE_LIFE_NONE != RWH_WRITE_LIFE_NONE); + BUILD_BUG_ON(WRITE_LIFE_SHORT != RWH_WRITE_LIFE_SHORT); + BUILD_BUG_ON(WRITE_LIFE_MEDIUM != RWH_WRITE_LIFE_MEDIUM); + BUILD_BUG_ON(WRITE_LIFE_LONG != RWH_WRITE_LIFE_LONG); + BUILD_BUG_ON(WRITE_LIFE_EXTREME != RWH_WRITE_LIFE_EXTREME); + + switch (hint) { + case RWH_WRITE_LIFE_NOT_SET: + case RWH_WRITE_LIFE_NONE: + case RWH_WRITE_LIFE_SHORT: + case RWH_WRITE_LIFE_MEDIUM: + case RWH_WRITE_LIFE_LONG: + case RWH_WRITE_LIFE_EXTREME: + return true; + default: + return false; + } +} + #endif /* _LINUX_RW_HINT_H */ diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 86cb385fe0b5..951e35226229 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -92,12 +92,23 @@ struct io_uring_sqe { __u16 addr_len; __u16 __pad3[1]; }; + struct { + /* Bit field to express 16 meta types */ + __u16 meta_type; + __u16 __pad4[1]; + }; }; union { struct { __u64 addr3; __u64 __pad2[1]; }; + struct { + /* First meta type specific fields */ + __u64 lifetime_val; + /* For future use */ + __u64 __pad5[1]; + }; __u64 optval; /* * If the ring is initialized with IORING_SETUP_SQE128, then @@ -107,6 +118,14 @@ struct io_uring_sqe { }; }; +enum io_uring_sqe_meta_type_bits { + META_TYPE_LIFETIME_HINT_BIT +}; + +/* this meta type covers write hint values supported by F_SET_RW_HINT fcntl */ +#define META_TYPE_LIFETIME_HINT (1U << META_TYPE_LIFETIME_HINT_BIT) + + /* * If sqe->file_index is set to this for opcodes that instantiate a new * direct descriptor (like openat/openat2/accept), then io_uring will allocate diff --git a/io_uring/rw.c b/io_uring/rw.c index 510123d3d837..bf45ee8904a4 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -269,6 +269,24 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, rw->kiocb.ki_ioprio = get_current_ioprio(); } rw->kiocb.dio_complete = NULL; + if (ddir == ITER_SOURCE) { + u16 mtype = READ_ONCE(sqe->meta_type); + + rw->kiocb.ki_write_hint = WRITE_LIFE_INVALID; + if (mtype) { + u64 lhint = READ_ONCE(sqe->lifetime_val); + + if (READ_ONCE(sqe->__pad4[0]) || + READ_ONCE(sqe->__pad5[0])) + return -EINVAL; + + if (mtype != META_TYPE_LIFETIME_HINT || + !rw_hint_valid(lhint)) + return -EINVAL; + + rw->kiocb.ki_write_hint = lhint; + } + } rw->addr = READ_ONCE(sqe->addr); rw->len = READ_ONCE(sqe->len); @@ -1023,7 +1041,12 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags) if (unlikely(ret)) return ret; req->cqe.res = iov_iter_count(&io->iter); - rw->kiocb.ki_write_hint = file_write_hint(rw->kiocb.ki_filp); + /* + * Use per-file hint only if per-io hint is not set. + * We need per-io hint to get precedence. + */ + if (rw->kiocb.ki_write_hint == WRITE_LIFE_INVALID) + rw->kiocb.ki_write_hint = file_write_hint(rw->kiocb.ki_filp); if (force_nonblock) { /* If the file doesn't support async, just async punt */