Message ID | 20241017160937.2283225-2-kbusch@meta.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | write hints for nvme fdp | expand |
On Thu, Oct 17, 2024 at 09:09:32AM -0700, Keith Busch wrote: > From: Kanchan Joshi <joshi.k@samsung.com> > > struct kiocb has a 2 bytes hole that developed post commit 41d36a9f3e53 > ("fs: remove kiocb.ki_hint"). But write hint returned with commit > 449813515d3e ("block, fs: Restore the per-bio/request data lifetime > fields"). > > This patch uses the leftover space in kiocb to carve 2 byte field > ki_write_hint. Restore the code that operates on kiocb to use > ki_write_hint instead of inode hint value. > > This does not change any behavior, but needed to enable per-io hints. In this version it doesn't really restore anything, but adds a new write hinting capability. Similarly to the bio patch we'll probably need to make clear what is in there instead of having it completely untyped (the exact same appraoch as for the bio should work). > index bbd05f1a21453..73629e26becbe 100644 > --- a/fs/direct-io.c > +++ b/fs/direct-io.c > @@ -409,7 +409,7 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio, > bio->bi_end_io = dio_bio_end_io; > if (dio->is_pinned) > bio_set_flag(bio, BIO_PAGE_PINNED); > - bio->bi_write_hint = file_inode(dio->iocb->ki_filp)->i_write_hint; > + bio->bi_write_hint = dio->iocb->ki_write_hint; > > sdio->bio = bio; > sdio->logical_offset_in_bio = sdio->cur_page_fs_offset; > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c > index f637aa0706a31..fff43f121ee65 100644 > --- a/fs/iomap/direct-io.c > +++ b/fs/iomap/direct-io.c > @@ -397,7 +397,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter, > fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits, > GFP_KERNEL); > bio->bi_iter.bi_sector = iomap_sector(iomap, pos); > - bio->bi_write_hint = inode->i_write_hint; > + bio->bi_write_hint = dio->iocb->ki_write_hint; File system (helper) code should not directly apply this limit, but the file system needs to set it. > +static inline enum rw_hint file_write_hint(struct file *filp) > +{ > + return file_inode(filp)->i_write_hint; > +} > + > static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp) > { > *kiocb = (struct kiocb) { > .ki_filp = filp, > .ki_flags = filp->f_iocb_flags, > .ki_ioprio = get_current_ioprio(), > + .ki_write_hint = file_write_hint(filp), And we'll need to distinguish between the per-inode and per file hint. I.e. don't blindly initialize ki_write_hint to the per-inode one here, but make that conditional in the file operation.
On 10/17/24 9:09 AM, Keith Busch wrote: > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 3559446279c15..04e875a37f604 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -370,6 +370,7 @@ struct kiocb { > void *private; > int ki_flags; > u16 ki_ioprio; /* See linux/ioprio.h */ > + u16 ki_write_hint; > union { > /* > * Only used for async buffered reads, where it denotes the Why 'u16'? Shouldn't this patch use the type 'enum rw_hint' for ki_write_hint and shouldn't this type be changed into u16 in a later patch? Thanks, Bart.
On Fri, Oct 18, 2024 at 07:50:32AM +0200, Christoph Hellwig wrote: > On Thu, Oct 17, 2024 at 09:09:32AM -0700, Keith Busch wrote: > > { > > *kiocb = (struct kiocb) { > > .ki_filp = filp, > > .ki_flags = filp->f_iocb_flags, > > .ki_ioprio = get_current_ioprio(), > > + .ki_write_hint = file_write_hint(filp), > > And we'll need to distinguish between the per-inode and per file > hint. I.e. don't blindly initialize ki_write_hint to the per-inode > one here, but make that conditional in the file operation. Maybe someone wants to do direct-io with partions where each partition has a different default "hint" when not provided a per-io hint? I don't know of such a case, but it doesn't sound terrible. In any case, I feel if you're directing writes through these interfaces, you get to keep all the pieces: user space controls policy, kernel just provides the mechanisms to do it.
On 10/21/24 8:47 AM, Keith Busch wrote: > On Fri, Oct 18, 2024 at 07:50:32AM +0200, Christoph Hellwig wrote: >> On Thu, Oct 17, 2024 at 09:09:32AM -0700, Keith Busch wrote: >>> { >>> *kiocb = (struct kiocb) { >>> .ki_filp = filp, >>> .ki_flags = filp->f_iocb_flags, >>> .ki_ioprio = get_current_ioprio(), >>> + .ki_write_hint = file_write_hint(filp), >> >> And we'll need to distinguish between the per-inode and per file >> hint. I.e. don't blindly initialize ki_write_hint to the per-inode >> one here, but make that conditional in the file operation. > > Maybe someone wants to do direct-io with partions where each partition > has a different default "hint" when not provided a per-io hint? I don't > know of such a case, but it doesn't sound terrible. In any case, I feel > if you're directing writes through these interfaces, you get to keep all > the pieces: user space controls policy, kernel just provides the > mechanisms to do it. Is it important to support partitions on top of FDP namespaces? We could follow the example of zoned block devices and not support partitions on top of FDP devices. From block/core.c, function add_partition(): /* * Partitions are not supported on zoned block devices that are used as * such. */ if (bdev_is_zoned(disk->part0)) { pr_warn("%s: partitions not supported on host managed zoned block device\n", disk->disk_name); return ERR_PTR(-ENXIO); } Thanks, Bart.
On Mon, Oct 21, 2024 at 10:09:57AM -0700, Bart Van Assche wrote: > On 10/21/24 8:47 AM, Keith Busch wrote: > > On Fri, Oct 18, 2024 at 07:50:32AM +0200, Christoph Hellwig wrote: > > > On Thu, Oct 17, 2024 at 09:09:32AM -0700, Keith Busch wrote: > > > > { > > > > *kiocb = (struct kiocb) { > > > > .ki_filp = filp, > > > > .ki_flags = filp->f_iocb_flags, > > > > .ki_ioprio = get_current_ioprio(), > > > > + .ki_write_hint = file_write_hint(filp), > > > > > > And we'll need to distinguish between the per-inode and per file > > > hint. I.e. don't blindly initialize ki_write_hint to the per-inode > > > one here, but make that conditional in the file operation. > > > > Maybe someone wants to do direct-io with partions where each partition > > has a different default "hint" when not provided a per-io hint? I don't > > know of such a case, but it doesn't sound terrible. In any case, I feel > > if you're directing writes through these interfaces, you get to keep all > > the pieces: user space controls policy, kernel just provides the > > mechanisms to do it. > > Is it important to support partitions on top of FDP namespaces? It's already used with partitions, so yes, it's important. Breaking that as a condition to make it work with the block stack is a non-starter.
On Mon, Oct 21, 2024 at 09:47:47AM -0600, Keith Busch wrote: > On Fri, Oct 18, 2024 at 07:50:32AM +0200, Christoph Hellwig wrote: > > On Thu, Oct 17, 2024 at 09:09:32AM -0700, Keith Busch wrote: > > > { > > > *kiocb = (struct kiocb) { > > > .ki_filp = filp, > > > .ki_flags = filp->f_iocb_flags, > > > .ki_ioprio = get_current_ioprio(), > > > + .ki_write_hint = file_write_hint(filp), > > > > And we'll need to distinguish between the per-inode and per file > > hint. I.e. don't blindly initialize ki_write_hint to the per-inode > > one here, but make that conditional in the file operation. > > Maybe someone wants to do direct-io with partions where each partition > has a different default "hint" when not provided a per-io hint? I don't > know of such a case, but it doesn't sound terrible. In any case, I feel > if you're directing writes through these interfaces, you get to keep all > the pieces: user space controls policy, kernel just provides the > mechanisms to do it. Eww. You actually pointed out a real problem here: if a device has multiple partitions the write streams as of this series are shared by them, which breaks their use case as the applications or file systems in different partitions will get other users of the write stream randomly overlayed onto theirs. So either the available streams need to be split into smaller pools by partitions, or we just assigned them to the first partition to make these scheme work for partitioned devices. Either way mixing up the per-inode hint and the dynamic one remains a bad idea.
On Tue, Oct 22, 2024 at 08:43:09AM +0200, Christoph Hellwig wrote: > On Mon, Oct 21, 2024 at 09:47:47AM -0600, Keith Busch wrote: > > On Fri, Oct 18, 2024 at 07:50:32AM +0200, Christoph Hellwig wrote: > > > On Thu, Oct 17, 2024 at 09:09:32AM -0700, Keith Busch wrote: > > > > { > > > > *kiocb = (struct kiocb) { > > > > .ki_filp = filp, > > > > .ki_flags = filp->f_iocb_flags, > > > > .ki_ioprio = get_current_ioprio(), > > > > + .ki_write_hint = file_write_hint(filp), > > > > > > And we'll need to distinguish between the per-inode and per file > > > hint. I.e. don't blindly initialize ki_write_hint to the per-inode > > > one here, but make that conditional in the file operation. > > > > Maybe someone wants to do direct-io with partions where each partition > > has a different default "hint" when not provided a per-io hint? I don't > > know of such a case, but it doesn't sound terrible. In any case, I feel > > if you're directing writes through these interfaces, you get to keep all > > the pieces: user space controls policy, kernel just provides the > > mechanisms to do it. > > Eww. You actually pointed out a real problem here: if a device > has multiple partitions the write streams as of this series are > shared by them, which breaks their use case as the applications or > file systems in different partitions will get other users of the > write stream randomly overlayed onto theirs. > > So either the available streams need to be split into smaller pools > by partitions, or we just assigned them to the first partition to > make these scheme work for partitioned devices. > > Either way mixing up the per-inode hint and the dynamic one remains > a bad idea. No doubt it's almost certainly not a good idea to mix different stream usages, but that's not the kernels problem. It's user space policy. I don't think the kernel needs to perform any heroic efforts to split anything here. Just keep it simple.
On Tue, Oct 22, 2024 at 08:37:56AM -0600, Keith Busch wrote: > No doubt it's almost certainly not a good idea to mix different stream > usages, but that's not the kernels problem. It's user space policy. I > don't think the kernel needs to perform any heroic efforts to split > anything here. Just keep it simple. Unfortunately it's not. This complicated breaks any intelligent scheme to manage them. You can't write portable code assuming you can know write streams when you need to cope with someone else using some or all of the same resources.
diff --git a/block/fops.c b/block/fops.c index e696ae53bf1e0..85b9b97d372c8 100644 --- a/block/fops.c +++ b/block/fops.c @@ -74,7 +74,7 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb, bio_init(&bio, bdev, vecs, nr_pages, dio_bio_write_op(iocb)); } bio.bi_iter.bi_sector = pos >> SECTOR_SHIFT; - bio.bi_write_hint = file_inode(iocb->ki_filp)->i_write_hint; + bio.bi_write_hint = iocb->ki_write_hint; bio.bi_ioprio = iocb->ki_ioprio; if (iocb->ki_flags & IOCB_ATOMIC) bio.bi_opf |= REQ_ATOMIC; @@ -203,7 +203,7 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, for (;;) { bio->bi_iter.bi_sector = pos >> SECTOR_SHIFT; - bio->bi_write_hint = file_inode(iocb->ki_filp)->i_write_hint; + bio->bi_write_hint = iocb->ki_write_hint; bio->bi_private = dio; bio->bi_end_io = blkdev_bio_end_io; bio->bi_ioprio = iocb->ki_ioprio; @@ -319,7 +319,7 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb, dio->flags = 0; dio->iocb = iocb; bio->bi_iter.bi_sector = pos >> SECTOR_SHIFT; - bio->bi_write_hint = file_inode(iocb->ki_filp)->i_write_hint; + bio->bi_write_hint = iocb->ki_write_hint; bio->bi_end_io = blkdev_bio_end_io_async; bio->bi_ioprio = iocb->ki_ioprio; diff --git a/fs/aio.c b/fs/aio.c index e8920178b50f7..db618817e670d 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1517,6 +1517,7 @@ static int aio_prep_rw(struct kiocb *req, const struct iocb *iocb, int rw_type) req->ki_flags = req->ki_filp->f_iocb_flags | IOCB_AIO_RW; if (iocb->aio_flags & IOCB_FLAG_RESFD) req->ki_flags |= IOCB_EVENTFD; + req->ki_write_hint = file_write_hint(req->ki_filp); if (iocb->aio_flags & IOCB_FLAG_IOPRIO) { /* * If the IOCB_FLAG_IOPRIO flag of aio_flags is set, then diff --git a/fs/cachefiles/io.c b/fs/cachefiles/io.c index 6a821a959b59e..c3db102ae64e2 100644 --- a/fs/cachefiles/io.c +++ b/fs/cachefiles/io.c @@ -309,6 +309,7 @@ int __cachefiles_write(struct cachefiles_object *object, ki->iocb.ki_pos = start_pos; ki->iocb.ki_flags = IOCB_DIRECT | IOCB_WRITE; ki->iocb.ki_ioprio = get_current_ioprio(); + ki->iocb.ki_write_hint = file_write_hint(file); ki->object = object; ki->start = start_pos; ki->len = len; diff --git a/fs/direct-io.c b/fs/direct-io.c index bbd05f1a21453..73629e26becbe 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -409,7 +409,7 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio, bio->bi_end_io = dio_bio_end_io; if (dio->is_pinned) bio_set_flag(bio, BIO_PAGE_PINNED); - bio->bi_write_hint = file_inode(dio->iocb->ki_filp)->i_write_hint; + bio->bi_write_hint = dio->iocb->ki_write_hint; sdio->bio = bio; sdio->logical_offset_in_bio = sdio->cur_page_fs_offset; diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c index f637aa0706a31..fff43f121ee65 100644 --- a/fs/iomap/direct-io.c +++ b/fs/iomap/direct-io.c @@ -397,7 +397,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter, fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits, GFP_KERNEL); bio->bi_iter.bi_sector = iomap_sector(iomap, pos); - bio->bi_write_hint = inode->i_write_hint; + bio->bi_write_hint = dio->iocb->ki_write_hint; bio->bi_ioprio = dio->iocb->ki_ioprio; bio->bi_private = dio; bio->bi_end_io = iomap_dio_bio_end_io; diff --git a/include/linux/fs.h b/include/linux/fs.h index 3559446279c15..04e875a37f604 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -370,6 +370,7 @@ struct kiocb { void *private; int ki_flags; u16 ki_ioprio; /* See linux/ioprio.h */ + u16 ki_write_hint; union { /* * Only used for async buffered reads, where it denotes the @@ -2337,12 +2338,18 @@ static inline bool HAS_UNMAPPED_ID(struct mnt_idmap *idmap, !vfsgid_valid(i_gid_into_vfsgid(idmap, inode)); } +static inline enum rw_hint file_write_hint(struct file *filp) +{ + return file_inode(filp)->i_write_hint; +} + static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp) { *kiocb = (struct kiocb) { .ki_filp = filp, .ki_flags = filp->f_iocb_flags, .ki_ioprio = get_current_ioprio(), + .ki_write_hint = file_write_hint(filp), }; } @@ -2353,6 +2360,7 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src, .ki_filp = filp, .ki_flags = kiocb_src->ki_flags, .ki_ioprio = kiocb_src->ki_ioprio, + .ki_write_hint = kiocb_src->ki_write_hint, .ki_pos = kiocb_src->ki_pos, }; } diff --git a/io_uring/rw.c b/io_uring/rw.c index 80ae3c2ebb70c..ffd637ca0bd17 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -1027,6 +1027,7 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags) if (unlikely(ret)) return ret; req->cqe.res = iov_iter_count(&io->iter); + rw->kiocb.ki_write_hint = file_write_hint(rw->kiocb.ki_filp); if (force_nonblock) { /* If the file doesn't support async, just async punt */