Message ID | 20241017160937.2283225-1-kbusch@meta.com (mailing list archive) |
---|---|
Headers | show |
Series | write hints for nvme fdp | expand |
On Thu, Oct 17, 2024 at 09:09:32AM -0700, Keith Busch wrote: > From: Kanchan Joshi <joshi.k@samsung.com> > > struct kiocb has a 2 bytes hole that developed post commit 41d36a9f3e53 > ("fs: remove kiocb.ki_hint"). But write hint returned with commit > 449813515d3e ("block, fs: Restore the per-bio/request data lifetime > fields"). > > This patch uses the leftover space in kiocb to carve 2 byte field > ki_write_hint. Restore the code that operates on kiocb to use > ki_write_hint instead of inode hint value. > > This does not change any behavior, but needed to enable per-io hints. In this version it doesn't really restore anything, but adds a new write hinting capability. Similarly to the bio patch we'll probably need to make clear what is in there instead of having it completely untyped (the exact same appraoch as for the bio should work). > index bbd05f1a21453..73629e26becbe 100644 > --- a/fs/direct-io.c > +++ b/fs/direct-io.c > @@ -409,7 +409,7 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio, > bio->bi_end_io = dio_bio_end_io; > if (dio->is_pinned) > bio_set_flag(bio, BIO_PAGE_PINNED); > - bio->bi_write_hint = file_inode(dio->iocb->ki_filp)->i_write_hint; > + bio->bi_write_hint = dio->iocb->ki_write_hint; > > sdio->bio = bio; > sdio->logical_offset_in_bio = sdio->cur_page_fs_offset; > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c > index f637aa0706a31..fff43f121ee65 100644 > --- a/fs/iomap/direct-io.c > +++ b/fs/iomap/direct-io.c > @@ -397,7 +397,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter, > fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits, > GFP_KERNEL); > bio->bi_iter.bi_sector = iomap_sector(iomap, pos); > - bio->bi_write_hint = inode->i_write_hint; > + bio->bi_write_hint = dio->iocb->ki_write_hint; File system (helper) code should not directly apply this limit, but the file system needs to set it. > +static inline enum rw_hint file_write_hint(struct file *filp) > +{ > + return file_inode(filp)->i_write_hint; > +} > + > static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp) > { > *kiocb = (struct kiocb) { > .ki_filp = filp, > .ki_flags = filp->f_iocb_flags, > .ki_ioprio = get_current_ioprio(), > + .ki_write_hint = file_write_hint(filp), And we'll need to distinguish between the per-inode and per file hint. I.e. don't blindly initialize ki_write_hint to the per-inode one here, but make that conditional in the file operation.
On 10/17/2024 9:39 PM, Keith Busch wrote: > > +#define NVME_MAX_PLIDS (NVME_CTRL_PAGE_SIZE / sizeof(16)) > + Seems you intended sizeof(u16).
On 10/17/24 9:09 AM, Keith Busch wrote: > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 3559446279c15..04e875a37f604 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -370,6 +370,7 @@ struct kiocb { > void *private; > int ki_flags; > u16 ki_ioprio; /* See linux/ioprio.h */ > + u16 ki_write_hint; > union { > /* > * Only used for async buffered reads, where it denotes the Why 'u16'? Shouldn't this patch use the type 'enum rw_hint' for ki_write_hint and shouldn't this type be changed into u16 in a later patch? Thanks, Bart.
On Fri, Oct 18, 2024 at 04:18:17PM +0530, Kanchan Joshi wrote: > On 10/17/2024 9:39 PM, Keith Busch wrote: > > > > +#define NVME_MAX_PLIDS (NVME_CTRL_PAGE_SIZE / sizeof(16)) > > + > > Seems you intended sizeof(u16). Ha, yes, that's the intended type. I don't have anything close to 1k placement hints, so would have never noticed this was the wrong size.
On Fri, Oct 18, 2024 at 07:50:32AM +0200, Christoph Hellwig wrote: > On Thu, Oct 17, 2024 at 09:09:32AM -0700, Keith Busch wrote: > > { > > *kiocb = (struct kiocb) { > > .ki_filp = filp, > > .ki_flags = filp->f_iocb_flags, > > .ki_ioprio = get_current_ioprio(), > > + .ki_write_hint = file_write_hint(filp), > > And we'll need to distinguish between the per-inode and per file > hint. I.e. don't blindly initialize ki_write_hint to the per-inode > one here, but make that conditional in the file operation. Maybe someone wants to do direct-io with partions where each partition has a different default "hint" when not provided a per-io hint? I don't know of such a case, but it doesn't sound terrible. In any case, I feel if you're directing writes through these interfaces, you get to keep all the pieces: user space controls policy, kernel just provides the mechanisms to do it.
On 10/21/24 8:47 AM, Keith Busch wrote: > On Fri, Oct 18, 2024 at 07:50:32AM +0200, Christoph Hellwig wrote: >> On Thu, Oct 17, 2024 at 09:09:32AM -0700, Keith Busch wrote: >>> { >>> *kiocb = (struct kiocb) { >>> .ki_filp = filp, >>> .ki_flags = filp->f_iocb_flags, >>> .ki_ioprio = get_current_ioprio(), >>> + .ki_write_hint = file_write_hint(filp), >> >> And we'll need to distinguish between the per-inode and per file >> hint. I.e. don't blindly initialize ki_write_hint to the per-inode >> one here, but make that conditional in the file operation. > > Maybe someone wants to do direct-io with partions where each partition > has a different default "hint" when not provided a per-io hint? I don't > know of such a case, but it doesn't sound terrible. In any case, I feel > if you're directing writes through these interfaces, you get to keep all > the pieces: user space controls policy, kernel just provides the > mechanisms to do it. Is it important to support partitions on top of FDP namespaces? We could follow the example of zoned block devices and not support partitions on top of FDP devices. From block/core.c, function add_partition(): /* * Partitions are not supported on zoned block devices that are used as * such. */ if (bdev_is_zoned(disk->part0)) { pr_warn("%s: partitions not supported on host managed zoned block device\n", disk->disk_name); return ERR_PTR(-ENXIO); } Thanks, Bart.
On Mon, Oct 21, 2024 at 10:09:57AM -0700, Bart Van Assche wrote: > On 10/21/24 8:47 AM, Keith Busch wrote: > > On Fri, Oct 18, 2024 at 07:50:32AM +0200, Christoph Hellwig wrote: > > > On Thu, Oct 17, 2024 at 09:09:32AM -0700, Keith Busch wrote: > > > > { > > > > *kiocb = (struct kiocb) { > > > > .ki_filp = filp, > > > > .ki_flags = filp->f_iocb_flags, > > > > .ki_ioprio = get_current_ioprio(), > > > > + .ki_write_hint = file_write_hint(filp), > > > > > > And we'll need to distinguish between the per-inode and per file > > > hint. I.e. don't blindly initialize ki_write_hint to the per-inode > > > one here, but make that conditional in the file operation. > > > > Maybe someone wants to do direct-io with partions where each partition > > has a different default "hint" when not provided a per-io hint? I don't > > know of such a case, but it doesn't sound terrible. In any case, I feel > > if you're directing writes through these interfaces, you get to keep all > > the pieces: user space controls policy, kernel just provides the > > mechanisms to do it. > > Is it important to support partitions on top of FDP namespaces? It's already used with partitions, so yes, it's important. Breaking that as a condition to make it work with the block stack is a non-starter.
On Mon, Oct 21, 2024 at 09:47:47AM -0600, Keith Busch wrote: > On Fri, Oct 18, 2024 at 07:50:32AM +0200, Christoph Hellwig wrote: > > On Thu, Oct 17, 2024 at 09:09:32AM -0700, Keith Busch wrote: > > > { > > > *kiocb = (struct kiocb) { > > > .ki_filp = filp, > > > .ki_flags = filp->f_iocb_flags, > > > .ki_ioprio = get_current_ioprio(), > > > + .ki_write_hint = file_write_hint(filp), > > > > And we'll need to distinguish between the per-inode and per file > > hint. I.e. don't blindly initialize ki_write_hint to the per-inode > > one here, but make that conditional in the file operation. > > Maybe someone wants to do direct-io with partions where each partition > has a different default "hint" when not provided a per-io hint? I don't > know of such a case, but it doesn't sound terrible. In any case, I feel > if you're directing writes through these interfaces, you get to keep all > the pieces: user space controls policy, kernel just provides the > mechanisms to do it. Eww. You actually pointed out a real problem here: if a device has multiple partitions the write streams as of this series are shared by them, which breaks their use case as the applications or file systems in different partitions will get other users of the write stream randomly overlayed onto theirs. So either the available streams need to be split into smaller pools by partitions, or we just assigned them to the first partition to make these scheme work for partitioned devices. Either way mixing up the per-inode hint and the dynamic one remains a bad idea.
On Tue, Oct 22, 2024 at 08:43:09AM +0200, Christoph Hellwig wrote: > On Mon, Oct 21, 2024 at 09:47:47AM -0600, Keith Busch wrote: > > On Fri, Oct 18, 2024 at 07:50:32AM +0200, Christoph Hellwig wrote: > > > On Thu, Oct 17, 2024 at 09:09:32AM -0700, Keith Busch wrote: > > > > { > > > > *kiocb = (struct kiocb) { > > > > .ki_filp = filp, > > > > .ki_flags = filp->f_iocb_flags, > > > > .ki_ioprio = get_current_ioprio(), > > > > + .ki_write_hint = file_write_hint(filp), > > > > > > And we'll need to distinguish between the per-inode and per file > > > hint. I.e. don't blindly initialize ki_write_hint to the per-inode > > > one here, but make that conditional in the file operation. > > > > Maybe someone wants to do direct-io with partions where each partition > > has a different default "hint" when not provided a per-io hint? I don't > > know of such a case, but it doesn't sound terrible. In any case, I feel > > if you're directing writes through these interfaces, you get to keep all > > the pieces: user space controls policy, kernel just provides the > > mechanisms to do it. > > Eww. You actually pointed out a real problem here: if a device > has multiple partitions the write streams as of this series are > shared by them, which breaks their use case as the applications or > file systems in different partitions will get other users of the > write stream randomly overlayed onto theirs. > > So either the available streams need to be split into smaller pools > by partitions, or we just assigned them to the first partition to > make these scheme work for partitioned devices. > > Either way mixing up the per-inode hint and the dynamic one remains > a bad idea. No doubt it's almost certainly not a good idea to mix different stream usages, but that's not the kernels problem. It's user space policy. I don't think the kernel needs to perform any heroic efforts to split anything here. Just keep it simple.
On Tue, Oct 22, 2024 at 08:37:56AM -0600, Keith Busch wrote: > No doubt it's almost certainly not a good idea to mix different stream > usages, but that's not the kernels problem. It's user space policy. I > don't think the kernel needs to perform any heroic efforts to split > anything here. Just keep it simple. Unfortunately it's not. This complicated breaks any intelligent scheme to manage them. You can't write portable code assuming you can know write streams when you need to cope with someone else using some or all of the same resources.
From: Keith Busch <kbusch@kernel.org> Changes from v7: Limits io_uring per-io hints to raw block, and only if the block device registers a new queue limit indicating support for it. The per-io hints are opaque to the kernel. Minor changelog and code organization changes. I don't really understand the io_uring suggestions, so I just made the write_hint a first class field without the "meta" indirection. It's kind of like ioprio, which has it's own field too. Actually, might be neat if we could use ioprio since it already has a "hints" field that is currently only used by command duration limits. Kanchan Joshi (3): block, fs: restore kiocb based write hint processing io_uring: enable per-io hinting capability nvme: enable FDP support Keith Busch (3): block: use generic u16 for write hints block: introduce max_write_hints queue limit fs: introduce per-io hint support flag Documentation/ABI/stable/sysfs-block | 7 +++ block/blk-settings.c | 3 + block/blk-sysfs.c | 3 + block/fops.c | 10 ++-- drivers/nvme/host/core.c | 82 ++++++++++++++++++++++++++++ drivers/nvme/host/nvme.h | 5 ++ fs/aio.c | 1 + fs/cachefiles/io.c | 1 + fs/direct-io.c | 2 +- fs/iomap/direct-io.c | 2 +- include/linux/blk-mq.h | 3 +- include/linux/blk_types.h | 2 +- include/linux/blkdev.h | 12 ++++ include/linux/fs.h | 10 ++++ include/linux/nvme.h | 19 +++++++ include/uapi/linux/io_uring.h | 4 ++ io_uring/rw.c | 10 +++- 17 files changed, 166 insertions(+), 10 deletions(-)