mbox series

[PATCHv8,0/6] write hints for nvme fdp

Message ID 20241017160937.2283225-1-kbusch@meta.com (mailing list archive)
Headers show
Series write hints for nvme fdp | expand

Message

Keith Busch Oct. 17, 2024, 4:09 p.m. UTC
From: Keith Busch <kbusch@kernel.org>

Changes from v7:

  Limits io_uring per-io hints to raw block, and only if the block
  device registers a new queue limit indicating support for it.

  The per-io hints are opaque to the kernel.

  Minor changelog and code organization changes.

  I don't really understand the io_uring suggestions, so I just made the
  write_hint a first class field without the "meta" indirection. It's
  kind of like ioprio, which has it's own field too. Actually, might be
  neat if we could use ioprio since it already has a "hints" field that
  is currently only used by command duration limits.

Kanchan Joshi (3):
  block, fs: restore kiocb based write hint processing
  io_uring: enable per-io hinting capability
  nvme: enable FDP support

Keith Busch (3):
  block: use generic u16 for write hints
  block: introduce max_write_hints queue limit
  fs: introduce per-io hint support flag

 Documentation/ABI/stable/sysfs-block |  7 +++
 block/blk-settings.c                 |  3 +
 block/blk-sysfs.c                    |  3 +
 block/fops.c                         | 10 ++--
 drivers/nvme/host/core.c             | 82 ++++++++++++++++++++++++++++
 drivers/nvme/host/nvme.h             |  5 ++
 fs/aio.c                             |  1 +
 fs/cachefiles/io.c                   |  1 +
 fs/direct-io.c                       |  2 +-
 fs/iomap/direct-io.c                 |  2 +-
 include/linux/blk-mq.h               |  3 +-
 include/linux/blk_types.h            |  2 +-
 include/linux/blkdev.h               | 12 ++++
 include/linux/fs.h                   | 10 ++++
 include/linux/nvme.h                 | 19 +++++++
 include/uapi/linux/io_uring.h        |  4 ++
 io_uring/rw.c                        | 10 +++-
 17 files changed, 166 insertions(+), 10 deletions(-)

Comments

Christoph Hellwig Oct. 18, 2024, 5:50 a.m. UTC | #1
On Thu, Oct 17, 2024 at 09:09:32AM -0700, Keith Busch wrote:
> From: Kanchan Joshi <joshi.k@samsung.com>
> 
> struct kiocb has a 2 bytes hole that developed post commit 41d36a9f3e53
> ("fs: remove kiocb.ki_hint"). But write hint returned with commit
> 449813515d3e ("block, fs: Restore the per-bio/request data lifetime
> fields").
> 
> This patch uses the leftover space in kiocb to carve 2 byte field
> ki_write_hint. Restore the code that operates on kiocb to use
> ki_write_hint instead of inode hint value.
> 
> This does not change any behavior, but needed to enable per-io hints.

In this version it doesn't really restore anything, but adds a new
write hinting capability.   Similarly to the bio patch we'll probably
need to make clear what is in there instead of having it completely
untyped (the exact same appraoch as for the bio should work).

> index bbd05f1a21453..73629e26becbe 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -409,7 +409,7 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
>  		bio->bi_end_io = dio_bio_end_io;
>  	if (dio->is_pinned)
>  		bio_set_flag(bio, BIO_PAGE_PINNED);
> -	bio->bi_write_hint = file_inode(dio->iocb->ki_filp)->i_write_hint;
> +	bio->bi_write_hint = dio->iocb->ki_write_hint;
>  
>  	sdio->bio = bio;
>  	sdio->logical_offset_in_bio = sdio->cur_page_fs_offset;
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index f637aa0706a31..fff43f121ee65 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -397,7 +397,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  		fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
>  					  GFP_KERNEL);
>  		bio->bi_iter.bi_sector = iomap_sector(iomap, pos);
> -		bio->bi_write_hint = inode->i_write_hint;
> +		bio->bi_write_hint = dio->iocb->ki_write_hint;

File system (helper) code should not directly apply this limit,
but the file system needs to set it.

> +static inline enum rw_hint file_write_hint(struct file *filp)
> +{
> +	return file_inode(filp)->i_write_hint;
> +}
> +
>  static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp)
>  {
>  	*kiocb = (struct kiocb) {
>  		.ki_filp = filp,
>  		.ki_flags = filp->f_iocb_flags,
>  		.ki_ioprio = get_current_ioprio(),
> +		.ki_write_hint = file_write_hint(filp),

And we'll need to distinguish between the per-inode and per file
hint.  I.e. don't blindly initialize ki_write_hint to the per-inode
one here, but make that conditional in the file operation.
Kanchan Joshi Oct. 18, 2024, 10:48 a.m. UTC | #2
On 10/17/2024 9:39 PM, Keith Busch wrote:
>   
> +#define NVME_MAX_PLIDS   (NVME_CTRL_PAGE_SIZE / sizeof(16))
> +

Seems you intended sizeof(u16).
Bart Van Assche Oct. 18, 2024, 4:11 p.m. UTC | #3
On 10/17/24 9:09 AM, Keith Busch wrote:
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 3559446279c15..04e875a37f604 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -370,6 +370,7 @@ struct kiocb {
>   	void			*private;
>   	int			ki_flags;
>   	u16			ki_ioprio; /* See linux/ioprio.h */
> +	u16			ki_write_hint;
>   	union {
>   		/*
>   		 * Only used for async buffered reads, where it denotes the

Why 'u16'? Shouldn't this patch use the type 'enum rw_hint' for 
ki_write_hint and shouldn't this type be changed into u16 in a later patch?

Thanks,

Bart.
Keith Busch Oct. 21, 2024, 3:08 p.m. UTC | #4
On Fri, Oct 18, 2024 at 04:18:17PM +0530, Kanchan Joshi wrote:
> On 10/17/2024 9:39 PM, Keith Busch wrote:
> >   
> > +#define NVME_MAX_PLIDS   (NVME_CTRL_PAGE_SIZE / sizeof(16))
> > +
> 
> Seems you intended sizeof(u16).

Ha, yes, that's the intended type. I don't have anything close to 1k
placement hints, so would have never noticed this was the wrong size.
Keith Busch Oct. 21, 2024, 3:47 p.m. UTC | #5
On Fri, Oct 18, 2024 at 07:50:32AM +0200, Christoph Hellwig wrote:
> On Thu, Oct 17, 2024 at 09:09:32AM -0700, Keith Busch wrote:
> >  {
> >  	*kiocb = (struct kiocb) {
> >  		.ki_filp = filp,
> >  		.ki_flags = filp->f_iocb_flags,
> >  		.ki_ioprio = get_current_ioprio(),
> > +		.ki_write_hint = file_write_hint(filp),
> 
> And we'll need to distinguish between the per-inode and per file
> hint.  I.e. don't blindly initialize ki_write_hint to the per-inode
> one here, but make that conditional in the file operation.

Maybe someone wants to do direct-io with partions where each partition
has a different default "hint" when not provided a per-io hint? I don't
know of such a case, but it doesn't sound terrible. In any case, I feel
if you're directing writes through these interfaces, you get to keep all
the pieces: user space controls policy, kernel just provides the
mechanisms to do it.
Bart Van Assche Oct. 21, 2024, 5:09 p.m. UTC | #6
On 10/21/24 8:47 AM, Keith Busch wrote:
> On Fri, Oct 18, 2024 at 07:50:32AM +0200, Christoph Hellwig wrote:
>> On Thu, Oct 17, 2024 at 09:09:32AM -0700, Keith Busch wrote:
>>>   {
>>>   	*kiocb = (struct kiocb) {
>>>   		.ki_filp = filp,
>>>   		.ki_flags = filp->f_iocb_flags,
>>>   		.ki_ioprio = get_current_ioprio(),
>>> +		.ki_write_hint = file_write_hint(filp),
>>
>> And we'll need to distinguish between the per-inode and per file
>> hint.  I.e. don't blindly initialize ki_write_hint to the per-inode
>> one here, but make that conditional in the file operation.
> 
> Maybe someone wants to do direct-io with partions where each partition
> has a different default "hint" when not provided a per-io hint? I don't
> know of such a case, but it doesn't sound terrible. In any case, I feel
> if you're directing writes through these interfaces, you get to keep all
> the pieces: user space controls policy, kernel just provides the
> mechanisms to do it.

Is it important to support partitions on top of FDP namespaces? We could
follow the example of zoned block devices and not support partitions on
top of FDP devices. From block/core.c, function add_partition():

	/*
	 * Partitions are not supported on zoned block devices that are used as
	 * such.
	 */
	if (bdev_is_zoned(disk->part0)) {
		pr_warn("%s: partitions not supported on host managed zoned block 
device\n",
			disk->disk_name);
		return ERR_PTR(-ENXIO);
	}

Thanks,

Bart.
Keith Busch Oct. 21, 2024, 7:35 p.m. UTC | #7
On Mon, Oct 21, 2024 at 10:09:57AM -0700, Bart Van Assche wrote:
> On 10/21/24 8:47 AM, Keith Busch wrote:
> > On Fri, Oct 18, 2024 at 07:50:32AM +0200, Christoph Hellwig wrote:
> > > On Thu, Oct 17, 2024 at 09:09:32AM -0700, Keith Busch wrote:
> > > >   {
> > > >   	*kiocb = (struct kiocb) {
> > > >   		.ki_filp = filp,
> > > >   		.ki_flags = filp->f_iocb_flags,
> > > >   		.ki_ioprio = get_current_ioprio(),
> > > > +		.ki_write_hint = file_write_hint(filp),
> > > 
> > > And we'll need to distinguish between the per-inode and per file
> > > hint.  I.e. don't blindly initialize ki_write_hint to the per-inode
> > > one here, but make that conditional in the file operation.
> > 
> > Maybe someone wants to do direct-io with partions where each partition
> > has a different default "hint" when not provided a per-io hint? I don't
> > know of such a case, but it doesn't sound terrible. In any case, I feel
> > if you're directing writes through these interfaces, you get to keep all
> > the pieces: user space controls policy, kernel just provides the
> > mechanisms to do it.
> 
> Is it important to support partitions on top of FDP namespaces? 

It's already used with partitions, so yes, it's important. Breaking that
as a condition to make it work with the block stack is a non-starter.
Christoph Hellwig Oct. 22, 2024, 6:43 a.m. UTC | #8
On Mon, Oct 21, 2024 at 09:47:47AM -0600, Keith Busch wrote:
> On Fri, Oct 18, 2024 at 07:50:32AM +0200, Christoph Hellwig wrote:
> > On Thu, Oct 17, 2024 at 09:09:32AM -0700, Keith Busch wrote:
> > >  {
> > >  	*kiocb = (struct kiocb) {
> > >  		.ki_filp = filp,
> > >  		.ki_flags = filp->f_iocb_flags,
> > >  		.ki_ioprio = get_current_ioprio(),
> > > +		.ki_write_hint = file_write_hint(filp),
> > 
> > And we'll need to distinguish between the per-inode and per file
> > hint.  I.e. don't blindly initialize ki_write_hint to the per-inode
> > one here, but make that conditional in the file operation.
> 
> Maybe someone wants to do direct-io with partions where each partition
> has a different default "hint" when not provided a per-io hint? I don't
> know of such a case, but it doesn't sound terrible. In any case, I feel
> if you're directing writes through these interfaces, you get to keep all
> the pieces: user space controls policy, kernel just provides the
> mechanisms to do it.

Eww.  You actually pointed out a real problem here: if a device
has multiple partitions the write streams as of this series are
shared by them, which breaks their use case as the applications or
file systems in different partitions will get other users of the
write stream randomly overlayed onto theirs.

So either the available streams need to be split into smaller pools
by partitions, or we just assigned them to the first partition to
make these scheme work for partitioned devices.

Either way mixing up the per-inode hint and the dynamic one remains
a bad idea.
Keith Busch Oct. 22, 2024, 2:37 p.m. UTC | #9
On Tue, Oct 22, 2024 at 08:43:09AM +0200, Christoph Hellwig wrote:
> On Mon, Oct 21, 2024 at 09:47:47AM -0600, Keith Busch wrote:
> > On Fri, Oct 18, 2024 at 07:50:32AM +0200, Christoph Hellwig wrote:
> > > On Thu, Oct 17, 2024 at 09:09:32AM -0700, Keith Busch wrote:
> > > >  {
> > > >  	*kiocb = (struct kiocb) {
> > > >  		.ki_filp = filp,
> > > >  		.ki_flags = filp->f_iocb_flags,
> > > >  		.ki_ioprio = get_current_ioprio(),
> > > > +		.ki_write_hint = file_write_hint(filp),
> > > 
> > > And we'll need to distinguish between the per-inode and per file
> > > hint.  I.e. don't blindly initialize ki_write_hint to the per-inode
> > > one here, but make that conditional in the file operation.
> > 
> > Maybe someone wants to do direct-io with partions where each partition
> > has a different default "hint" when not provided a per-io hint? I don't
> > know of such a case, but it doesn't sound terrible. In any case, I feel
> > if you're directing writes through these interfaces, you get to keep all
> > the pieces: user space controls policy, kernel just provides the
> > mechanisms to do it.
> 
> Eww.  You actually pointed out a real problem here: if a device
> has multiple partitions the write streams as of this series are
> shared by them, which breaks their use case as the applications or
> file systems in different partitions will get other users of the
> write stream randomly overlayed onto theirs.
> 
> So either the available streams need to be split into smaller pools
> by partitions, or we just assigned them to the first partition to
> make these scheme work for partitioned devices.
> 
> Either way mixing up the per-inode hint and the dynamic one remains
> a bad idea.

No doubt it's almost certainly not a good idea to mix different stream
usages, but that's not the kernels problem. It's user space policy. I
don't think the kernel needs to perform any heroic efforts to split
anything here. Just keep it simple.
Christoph Hellwig Oct. 22, 2024, 2:40 p.m. UTC | #10
On Tue, Oct 22, 2024 at 08:37:56AM -0600, Keith Busch wrote:
> No doubt it's almost certainly not a good idea to mix different stream
> usages, but that's not the kernels problem. It's user space policy. I
> don't think the kernel needs to perform any heroic efforts to split
> anything here. Just keep it simple.

Unfortunately it's not.  This complicated breaks any intelligent
scheme to manage them.  You can't write portable code assuming you
can know write streams when you need to cope with someone else using
some or all of the same resources.