mbox series

[PATCHv10,0/9] write hints with nvme fdp, scsi streams

Message ID 20241029151922.459139-1-kbusch@meta.com (mailing list archive)
Headers show
Series write hints with nvme fdp, scsi streams | expand

Message

Keith Busch Oct. 29, 2024, 3:19 p.m. UTC
From: Keith Busch <kbusch@kernel.org>

Changes from v9:

  Document the partition hint mask

  Use bitmap_alloc API

  Fixup bitmap memory leak

  Return invalid value if user requests an invalid write hint

  Added and exported a block device feature flag for indicating generic
  placement hint support

  Added statx write hint max field

  Added BUILD_BUG_ON check for new io_uring SQE fields.

  Added reviews

Kanchan Joshi (2):
  io_uring: enable per-io hinting capability
  nvme: enable FDP support

Keith Busch (7):
  block: use generic u16 for write hints
  block: introduce max_write_hints queue limit
  statx: add write hint information
  block: allow ability to limit partition write hints
  block, fs: add write hint to kiocb
  block: export placement hint feature
  scsi: set permanent stream count in block limits

 Documentation/ABI/stable/sysfs-block | 13 +++++
 block/bdev.c                         | 18 ++++++
 block/blk-settings.c                 |  5 ++
 block/blk-sysfs.c                    |  6 ++
 block/fops.c                         | 31 +++++++++-
 block/partitions/core.c              | 44 ++++++++++++++-
 drivers/nvme/host/core.c             | 84 ++++++++++++++++++++++++++++
 drivers/nvme/host/nvme.h             |  5 ++
 drivers/scsi/sd.c                    |  2 +
 fs/stat.c                            |  1 +
 include/linux/blk-mq.h               |  3 +-
 include/linux/blk_types.h            |  4 +-
 include/linux/blkdev.h               | 15 +++++
 include/linux/fs.h                   |  1 +
 include/linux/nvme.h                 | 19 +++++++
 include/linux/stat.h                 |  1 +
 include/uapi/linux/io_uring.h        |  4 ++
 include/uapi/linux/stat.h            |  3 +-
 io_uring/io_uring.c                  |  2 +
 io_uring/rw.c                        |  3 +-
 20 files changed, 253 insertions(+), 11 deletions(-)

Comments

Christoph Hellwig Oct. 29, 2024, 3:24 p.m. UTC | #1
On Tue, Oct 29, 2024 at 08:19:13AM -0700, Keith Busch wrote:
>   Return invalid value if user requests an invalid write hint
> 
>   Added and exported a block device feature flag for indicating generic
>   placement hint support

But it still talks of write hints everywhere and conflates the write
streams with the temperature hints which are completely different
beasts.
Chaitanya Kulkarni Oct. 30, 2024, 12:24 a.m. UTC | #2
On 10/29/24 08:19, Keith Busch wrote:
> From: Kanchan Joshi <joshi.k@samsung.com>
>
> Flexible Data Placement (FDP), as ratified in TP 4146a, allows the host
> to control the placement of logical blocks so as to reduce the SSD WAF.
> Userspace can send the write hint information using io_uring or fcntl.
>
> Fetch the placement-identifiers if the device supports FDP. The incoming
> write-hint is mapped to a placement-identifier, which in turn is set in
> the DSPEC field of the write command.
>
> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
> Signed-off-by: Hui Qi <hui81.qi@samsung.com>
> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> ---
>   drivers/nvme/host/core.c | 84 ++++++++++++++++++++++++++++++++++++++++
>   drivers/nvme/host/nvme.h |  5 +++
>   include/linux/nvme.h     | 19 +++++++++
>   3 files changed, 108 insertions(+)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 3de7555a7de74..bd7b89912ddb9 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -44,6 +44,20 @@ struct nvme_ns_info {
>   	bool is_removed;
>   };
>   
> +struct nvme_fdp_ruh_status_desc {
> +	u16 pid;
> +	u16 ruhid;
> +	u32 earutr;
> +	u64 ruamw;
> +	u8  rsvd16[16];
> +};
> +
> +struct nvme_fdp_ruh_status {
> +	u8  rsvd0[14];
> +	__le16 nruhsd;
> +	struct nvme_fdp_ruh_status_desc ruhsd[];
> +};
> +
>   unsigned int admin_timeout = 60;
>   module_param(admin_timeout, uint, 0644);
>   MODULE_PARM_DESC(admin_timeout, "timeout in seconds for admin commands");
> @@ -657,6 +671,7 @@ static void nvme_free_ns_head(struct kref *ref)
>   	ida_free(&head->subsys->ns_ida, head->instance);
>   	cleanup_srcu_struct(&head->srcu);
>   	nvme_put_subsystem(head->subsys);
> +	kfree(head->plids);
>   	kfree(head);
>   }
>   
> @@ -974,6 +989,13 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
>   	if (req->cmd_flags & REQ_RAHEAD)
>   		dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH;
>   
> +	if (req->write_hint && ns->head->nr_plids) {
> +		u16 hint = max(req->write_hint, ns->head->nr_plids);
> +
> +		dsmgmt |= ns->head->plids[hint - 1] << 16;
> +		control |= NVME_RW_DTYPE_DPLCMT;
> +	}
> +
>   	if (req->cmd_flags & REQ_ATOMIC && !nvme_valid_atomic_write(req))
>   		return BLK_STS_INVAL;
>   
> @@ -2105,6 +2127,52 @@ static int nvme_update_ns_info_generic(struct nvme_ns *ns,
>   	return ret;
>   }
>   
> +static int nvme_fetch_fdp_plids(struct nvme_ns *ns, u32 nsid)
> +{
> +	struct nvme_fdp_ruh_status_desc *ruhsd;
> +	struct nvme_ns_head *head = ns->head;
> +	struct nvme_fdp_ruh_status *ruhs;
> +	struct nvme_command c = {};
> +	int size, ret, i;
> +
> +	if (head->plids)
> +		return 0;
> +
> +	size = struct_size(ruhs, ruhsd, NVME_MAX_PLIDS);
> +	ruhs = kzalloc(size, GFP_KERNEL);
> +	if (!ruhs)
> +		return -ENOMEM;
> +
> +	c.imr.opcode = nvme_cmd_io_mgmt_recv;
> +	c.imr.nsid = cpu_to_le32(nsid);
> +	c.imr.mo = 0x1;

can we please add some comment where values are hardcoded ?

> +	c.imr.numd =  cpu_to_le32((size >> 2) - 1);
> +
> +	ret = nvme_submit_sync_cmd(ns->queue, &c, ruhs, size);
> +	if (ret)
> +		goto out;
> +
> +	i = le16_to_cpu(ruhs->nruhsd);

instead of i why can't we use local variable nr_plids ?



> +	if (!i)
> +		goto out;
> +
> +	ns->head->nr_plids = min_t(u16, i, NVME_MAX_PLIDS);
> +	head->plids = kcalloc(ns->head->nr_plids, sizeof(head->plids),
> +			      GFP_KERNEL);
> +	if (!head->plids) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	for (i = 0; i < ns->head->nr_plids; i++) {
> +		ruhsd = &ruhs->ruhsd[i];
> +		head->plids[i] = le16_to_cpu(ruhsd->pid);
> +	}
> +out:
> +	kfree(ruhs);
> +	return ret;
> +}
> +
>   static int nvme_update_ns_info_block(struct nvme_ns *ns,
>   		struct nvme_ns_info *info)
>   {
> @@ -2141,6 +2209,19 @@ static int nvme_update_ns_info_block(struct nvme_ns *ns,
>   			goto out;
>   	}
>   
> +	if (ns->ctrl->ctratt & NVME_CTRL_ATTR_FDPS) {
> +		ret = nvme_fetch_fdp_plids(ns, info->nsid);
> +		if (ret)
> +			dev_warn(ns->ctrl->device,
> +				"FDP failure status:0x%x\n", ret);
> +		if (ret < 0)
> +			goto out;
> +	} else {
> +		ns->head->nr_plids = 0;
> +		kfree(ns->head->plids);
> +		ns->head->plids = NULL;
> +	}
> +
>   	blk_mq_freeze_queue(ns->disk->queue);
>   	ns->head->lba_shift = id->lbaf[lbaf].ds;
>   	ns->head->nuse = le64_to_cpu(id->nuse);
> @@ -2171,6 +2252,9 @@ static int nvme_update_ns_info_block(struct nvme_ns *ns,
>   	if (!nvme_init_integrity(ns->head, &lim, info))
>   		capacity = 0;
>   
> +	lim.max_write_hints = ns->head->nr_plids;
> +	if (lim.max_write_hints)
> +		lim.features |= BLK_FEAT_PLACEMENT_HINTS;
>   	ret = queue_limits_commit_update(ns->disk->queue, &lim);
>   	if (ret) {
>   		blk_mq_unfreeze_queue(ns->disk->queue);
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index 093cb423f536b..cec8e5d96377b 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -454,6 +454,8 @@ struct nvme_ns_ids {
>   	u8	csi;
>   };
>   
> +#define NVME_MAX_PLIDS   (NVME_CTRL_PAGE_SIZE / sizeof(16))

this calculates how many plids can fit into the ctrl page size ?

sorry but I didn't understand sizeof(16) here, since plids are u16

nvme_ns_head -> u16 *plidsshould this be sizeof(u16) ? -ck
Christoph Hellwig Nov. 5, 2024, 3:50 p.m. UTC | #3
I've pushed my branch that tries to make this work with the XFS
data separation here:

http://git.infradead.org/?p=users/hch/xfs.git;a=shortlog;h=refs/heads/xfs-zoned-streams

This is basically my current WIP xfs zoned (aka always write out place)
work optimistically destined for 6.14 + the patch set in this thread +
a little fix to make it work for nvme-multipath plus the tiny patch to
wire it up.

The good news is that the API from Keith mostly works.  I don't really
know how to cope with the streams per partition bitmap, and I suspect
this will need to be dealt with a bit better.  One option might be
to always have a bitmap, which would also support discontiguous
write stream numbers as actually supported by the underlying NVMe
implementation, another option would be to always map to consecutive
numbers.

The bad news is that for file systems or applications to make full use
of the API we also really need an API to expose how much space is left
in a write stream, as otherwise they can easily get out of sync on
a power fail.  I've left that code in as a TODO, it should not affect
basic testing.

We get the same kind of performance numbers as the ZNS support on
comparable hardware platforms, which is expected.  Testing on an
actual state of the art non-prototype hardware will take more time
as the capacities are big enough that getting serious numbers will
take a lot more time.
Keith Busch Nov. 6, 2024, 6:36 p.m. UTC | #4
On Tue, Nov 05, 2024 at 04:50:14PM +0100, Christoph Hellwig wrote:
> I've pushed my branch that tries to make this work with the XFS
> data separation here:
> 
> http://git.infradead.org/?p=users/hch/xfs.git;a=shortlog;h=refs/heads/xfs-zoned-streams
> 
> This is basically my current WIP xfs zoned (aka always write out place)
> work optimistically destined for 6.14 + the patch set in this thread +
> a little fix to make it work for nvme-multipath plus the tiny patch to
> wire it up.
> 
> The good news is that the API from Keith mostly works.  I don't really
> know how to cope with the streams per partition bitmap, and I suspect
> this will need to be dealt with a bit better.  One option might be
> to always have a bitmap, which would also support discontiguous
> write stream numbers as actually supported by the underlying NVMe
> implementation, another option would be to always map to consecutive
> numbers.

Thanks for sharing that. Seeing the code makes it much easier to
understand where you're trying to steer this. I'll take a look and
probably have some feedback after a couple days going through it.
Keith Busch Nov. 7, 2024, 8:36 p.m. UTC | #5
On Tue, Nov 05, 2024 at 04:50:14PM +0100, Christoph Hellwig wrote:
> I've pushed my branch that tries to make this work with the XFS
> data separation here:
> 
> http://git.infradead.org/?p=users/hch/xfs.git;a=shortlog;h=refs/heads/xfs-zoned-streams

The zone block support all looks pretty neat, but I think you're making
this harder than necessary to support streams. You don't need to treat
these like a sequential write device. The controller side does its own
garbage collection, so no need to duplicate the effort on the host. And
it looks like the host side gc potentially merges multiple streams into
a single gc stream, so that's probably not desirable.
Christoph Hellwig Nov. 8, 2024, 2:18 p.m. UTC | #6
On Thu, Nov 07, 2024 at 01:36:35PM -0700, Keith Busch wrote:
> The zone block support all looks pretty neat, but I think you're making
> this harder than necessary to support streams. You don't need to treat
> these like a sequential write device. The controller side does its own
> garbage collection, so no need to duplicate the effort on the host. And
> it looks like the host side gc potentially merges multiple streams into
> a single gc stream, so that's probably not desirable.

We're not really duplicating much.  Writing sequential is pretty easy,
and tracking reclaim units separately means you need another tracking
data structure, and either that or the LBA one is always going to be
badly fragmented if they aren't the same.
Keith Busch Nov. 8, 2024, 3:51 p.m. UTC | #7
On Fri, Nov 08, 2024 at 03:18:52PM +0100, Christoph Hellwig wrote:
> On Thu, Nov 07, 2024 at 01:36:35PM -0700, Keith Busch wrote:
> > The zone block support all looks pretty neat, but I think you're making
> > this harder than necessary to support streams. You don't need to treat
> > these like a sequential write device. The controller side does its own
> > garbage collection, so no need to duplicate the effort on the host. And
> > it looks like the host side gc potentially merges multiple streams into
> > a single gc stream, so that's probably not desirable.
> 
> We're not really duplicating much.  Writing sequential is pretty easy,
> and tracking reclaim units separately means you need another tracking
> data structure, and either that or the LBA one is always going to be
> badly fragmented if they aren't the same.

You're getting fragmentation anyway, which is why you had to implement
gc. You're just shifting who gets to deal with it from the controller to
the host. The host is further from the media, so you're starting from a
disadvantage. The host gc implementation would have to be quite a bit
better to justify the link and memory usage necessary for the copies
(...queue a copy-offload discussion? oom?).

This xfs implementation also has logic to recover from a power fail. The
device already does that if you use the LBA abstraction instead of
tracking sequential write pointers and free blocks.

I think you are underestimating the duplication of efforts going on
here.
Matthew Wilcox Nov. 8, 2024, 4:54 p.m. UTC | #8
On Fri, Nov 08, 2024 at 08:51:31AM -0700, Keith Busch wrote:
> On Fri, Nov 08, 2024 at 03:18:52PM +0100, Christoph Hellwig wrote:
> > We're not really duplicating much.  Writing sequential is pretty easy,
> > and tracking reclaim units separately means you need another tracking
> > data structure, and either that or the LBA one is always going to be
> > badly fragmented if they aren't the same.
> 
> You're getting fragmentation anyway, which is why you had to implement
> gc. You're just shifting who gets to deal with it from the controller to
> the host. The host is further from the media, so you're starting from a
> disadvantage. The host gc implementation would have to be quite a bit
> better to justify the link and memory usage necessary for the copies
> (...queue a copy-offload discussion? oom?).

But the filesystem knows which blocks are actually in use.  Sending
TRIM/DISCARD information to the drive at block-level granularity hasn't
worked out so well in the past.  So the drive is the one at a disadvantage
because it has to copy blocks which aren't actually in use.

I like the idea of using copy-offload though.
Javier Gonzalez Nov. 8, 2024, 5:43 p.m. UTC | #9
> -----Original Message-----
> From: Matthew Wilcox <willy@infradead.org>
> Sent: Friday, November 8, 2024 5:55 PM
> To: Keith Busch <kbusch@kernel.org>
> Cc: Christoph Hellwig <hch@lst.de>; Keith Busch <kbusch@meta.com>; linux-
> block@vger.kernel.org; linux-nvme@lists.infradead.org; linux-scsi@vger.kernel.org;
> io-uring@vger.kernel.org; linux-fsdevel@vger.kernel.org; joshi.k@samsung.com;
> Javier Gonzalez <javier.gonz@samsung.com>; bvanassche@acm.org
> Subject: Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
> 
> On Fri, Nov 08, 2024 at 08:51:31AM -0700, Keith Busch wrote:
> > On Fri, Nov 08, 2024 at 03:18:52PM +0100, Christoph Hellwig wrote:
> > > We're not really duplicating much.  Writing sequential is pretty easy,
> > > and tracking reclaim units separately means you need another tracking
> > > data structure, and either that or the LBA one is always going to be
> > > badly fragmented if they aren't the same.
> >
> > You're getting fragmentation anyway, which is why you had to implement
> > gc. You're just shifting who gets to deal with it from the controller to
> > the host. The host is further from the media, so you're starting from a
> > disadvantage. The host gc implementation would have to be quite a bit
> > better to justify the link and memory usage necessary for the copies
> > (...queue a copy-offload discussion? oom?).
> 
> But the filesystem knows which blocks are actually in use.  Sending
> TRIM/DISCARD information to the drive at block-level granularity hasn't
> worked out so well in the past.  So the drive is the one at a disadvantage
> because it has to copy blocks which aren't actually in use.

It is true that trim has not been great. I would say that at least enterprise
SSDs have fixed this in general. For FDP, DSM Deallocate is respected, which
Provides a good "erase" interface to the host.

It is true though that this is not properly described in the spec and we should
fix it.

> 
> I like the idea of using copy-offload though.

We have been iterating in the patches for years, but it is unfortunately
one of these series that go in circles forever. I don't think it is due
to any specific problem, but mostly due to unaligned requests form
different folks reviewing. Last time I talked to Damien he asked me to 
send the patches again; we have not followed through due to bandwidth.

If there is an interest, we can re-spin this again...
Bart Van Assche Nov. 8, 2024, 6:51 p.m. UTC | #10
On 11/8/24 9:43 AM, Javier Gonzalez wrote:
> If there is an interest, we can re-spin this again...

I'm interested. Work is ongoing in JEDEC on support for copy offloading
for UFS devices. This work involves standardizing which SCSI copy
offloading features should be supported and which features are not
required. Implementations are expected to be available soon.

Thanks,

Bart.
Christoph Hellwig Nov. 11, 2024, 6:48 a.m. UTC | #11
On Fri, Nov 08, 2024 at 08:51:31AM -0700, Keith Busch wrote:
> You're getting fragmentation anyway, which is why you had to implement
> gc.

A general purpose file system always has fragmentation of some kind,
even it manages to avoid those for certain workloads with cooperative
applications.

If there was magic pixies dust to ensure freespace never fragments file
system development would be solved problem :)

> You're just shifting who gets to deal with it from the controller to
> the host. The host is further from the media, so you're starting from a
> disadvantage.

And the controller is further from the application and misses a lot of
information like say the file structure, so it inherently is at a
disadvantage.

> The host gc implementation would have to be quite a bit
> better to justify the link and memory usage necessary for the copies

That assumes you still have to device GC.  If you do align to the
zone/erase (super)block/reclaim unit boundaries you don't.

> This xfs implementation also has logic to recover from a power fail. The
> device already does that if you use the LBA abstraction instead of
> tracking sequential write pointers and free blocks.

Every file system has logic to recover from a power fail.  I'm not sure
what kind of discussion you're trying to kick off here.

> I think you are underestimating the duplication of efforts going on
> here.

I'm still not sure what discussion you're trying to to start here.
There is very little work in here, and it is work required to support
SMR drives.  It turns out for a fair amount of workloads it actually
works really well on SSDs as well beating everything else we've tried.
Christoph Hellwig Nov. 11, 2024, 6:49 a.m. UTC | #12
On Fri, Nov 08, 2024 at 04:54:34PM +0000, Matthew Wilcox wrote:
> I like the idea of using copy-offload though.

FYI, the XFS GC code is written so that copy offload can be easily
plugged into it.  We'll have to see how beneficial it actually is,
but at least it should give us a good test platform.
Christoph Hellwig Nov. 11, 2024, 6:51 a.m. UTC | #13
On Fri, Nov 08, 2024 at 05:43:44PM +0000, Javier Gonzalez wrote:
> We have been iterating in the patches for years, but it is unfortunately
> one of these series that go in circles forever. I don't think it is due
> to any specific problem, but mostly due to unaligned requests form
> different folks reviewing. Last time I talked to Damien he asked me to 
> send the patches again; we have not followed through due to bandwidth.

A big problem is that it actually lacks a killer use case.  If you'd
actually manage to plug it into an in-kernel user and show a real
speedup people might actually be interested in it and help optimizing
for it.
Javier Gonzalez Nov. 11, 2024, 9:30 a.m. UTC | #14
On 11.11.2024 07:51, Christoph Hellwig wrote:
>On Fri, Nov 08, 2024 at 05:43:44PM +0000, Javier Gonzalez wrote:
>> We have been iterating in the patches for years, but it is unfortunately
>> one of these series that go in circles forever. I don't think it is due
>> to any specific problem, but mostly due to unaligned requests form
>> different folks reviewing. Last time I talked to Damien he asked me to
>> send the patches again; we have not followed through due to bandwidth.
>
>A big problem is that it actually lacks a killer use case.  If you'd
>actually manage to plug it into an in-kernel user and show a real
>speedup people might actually be interested in it and help optimizing
>for it.
>

Agree. Initially it was all about ZNS. Seems ZUFS can use it.

Then we saw good results in offload to target on NVMe-OF, similar to
copy_file_range, but that does not seem to be enough. You seem to
indicacte too that XFS can use it for GC.

We can try putting a new series out to see where we are...
Javier Gonzalez Nov. 11, 2024, 9:31 a.m. UTC | #15
On 08.11.2024 10:51, Bart Van Assche wrote:
>On 11/8/24 9:43 AM, Javier Gonzalez wrote:
>>If there is an interest, we can re-spin this again...
>
>I'm interested. Work is ongoing in JEDEC on support for copy offloading
>for UFS devices. This work involves standardizing which SCSI copy
>offloading features should be supported and which features are not
>required. Implementations are expected to be available soon.
>

Do you have any specific blockers on the last series? I know you have
left comments in many of the patches already, but I think we are all a
bit confused on where we are ATM.
Johannes Thumshirn Nov. 11, 2024, 9:37 a.m. UTC | #16
On 11.11.24 10:31, Javier Gonzalez wrote:
> On 11.11.2024 07:51, Christoph Hellwig wrote:
>> On Fri, Nov 08, 2024 at 05:43:44PM +0000, Javier Gonzalez wrote:
>>> We have been iterating in the patches for years, but it is unfortunately
>>> one of these series that go in circles forever. I don't think it is due
>>> to any specific problem, but mostly due to unaligned requests form
>>> different folks reviewing. Last time I talked to Damien he asked me to
>>> send the patches again; we have not followed through due to bandwidth.
>>
>> A big problem is that it actually lacks a killer use case.  If you'd
>> actually manage to plug it into an in-kernel user and show a real
>> speedup people might actually be interested in it and help optimizing
>> for it.
>>
> 
> Agree. Initially it was all about ZNS. Seems ZUFS can use it.
> 
> Then we saw good results in offload to target on NVMe-OF, similar to
> copy_file_range, but that does not seem to be enough. You seem to
> indicacte too that XFS can use it for GC.
> 
> We can try putting a new series out to see where we are...

I don't want to sound like a broken record, but I've said more than 
once, that btrfs (regardless of zoned or non-zoned) would be very 
interested in that as well and I'd be willing to help with the code or 
even do it myself once the block bits are in.

But apparently my voice doesn't count here
Javier Gonzalez Nov. 11, 2024, 9:41 a.m. UTC | #17
On 11.11.2024 09:37, Johannes Thumshirn wrote:
>On 11.11.24 10:31, Javier Gonzalez wrote:
>> On 11.11.2024 07:51, Christoph Hellwig wrote:
>>> On Fri, Nov 08, 2024 at 05:43:44PM +0000, Javier Gonzalez wrote:
>>>> We have been iterating in the patches for years, but it is unfortunately
>>>> one of these series that go in circles forever. I don't think it is due
>>>> to any specific problem, but mostly due to unaligned requests form
>>>> different folks reviewing. Last time I talked to Damien he asked me to
>>>> send the patches again; we have not followed through due to bandwidth.
>>>
>>> A big problem is that it actually lacks a killer use case.  If you'd
>>> actually manage to plug it into an in-kernel user and show a real
>>> speedup people might actually be interested in it and help optimizing
>>> for it.
>>>
>>
>> Agree. Initially it was all about ZNS. Seems ZUFS can use it.
>>
>> Then we saw good results in offload to target on NVMe-OF, similar to
>> copy_file_range, but that does not seem to be enough. You seem to
>> indicacte too that XFS can use it for GC.
>>
>> We can try putting a new series out to see where we are...
>
>I don't want to sound like a broken record, but I've said more than
>once, that btrfs (regardless of zoned or non-zoned) would be very
>interested in that as well and I'd be willing to help with the code or
>even do it myself once the block bits are in.
>
>But apparently my voice doesn't count here

You are right. Sorry I forgot.

Would this be through copy_file_range or something different?
Christoph Hellwig Nov. 11, 2024, 9:42 a.m. UTC | #18
On Mon, Nov 11, 2024 at 10:41:33AM +0100, Javier Gonzalez wrote:
> You are right. Sorry I forgot.
>
> Would this be through copy_file_range or something different?

Just like for f2fs, nilfs2, or the upcoming zoned xfs the prime user
would be the file system GC code.
Johannes Thumshirn Nov. 11, 2024, 9:43 a.m. UTC | #19
On 11.11.24 10:41, Javier Gonzalez wrote:
> On 11.11.2024 09:37, Johannes Thumshirn wrote:
>> On 11.11.24 10:31, Javier Gonzalez wrote:
>>> On 11.11.2024 07:51, Christoph Hellwig wrote:
>>>> On Fri, Nov 08, 2024 at 05:43:44PM +0000, Javier Gonzalez wrote:
>>>>> We have been iterating in the patches for years, but it is unfortunately
>>>>> one of these series that go in circles forever. I don't think it is due
>>>>> to any specific problem, but mostly due to unaligned requests form
>>>>> different folks reviewing. Last time I talked to Damien he asked me to
>>>>> send the patches again; we have not followed through due to bandwidth.
>>>>
>>>> A big problem is that it actually lacks a killer use case.  If you'd
>>>> actually manage to plug it into an in-kernel user and show a real
>>>> speedup people might actually be interested in it and help optimizing
>>>> for it.
>>>>
>>>
>>> Agree. Initially it was all about ZNS. Seems ZUFS can use it.
>>>
>>> Then we saw good results in offload to target on NVMe-OF, similar to
>>> copy_file_range, but that does not seem to be enough. You seem to
>>> indicacte too that XFS can use it for GC.
>>>
>>> We can try putting a new series out to see where we are...
>>
>> I don't want to sound like a broken record, but I've said more than
>> once, that btrfs (regardless of zoned or non-zoned) would be very
>> interested in that as well and I'd be willing to help with the code or
>> even do it myself once the block bits are in.
>>
>> But apparently my voice doesn't count here
> 
> You are right. Sorry I forgot.
> 
> Would this be through copy_file_range or something different?
> 

Unfortunately not, brtfs' reclaim/balance path is a wrapper on top of 
buffered read and write (plus some extra things). _BUT_ this makes it 
possible to switch the read/write part and do copy offload (where possible).
Javier Gonzalez Nov. 11, 2024, 10:37 a.m. UTC | #20
On 11.11.2024 09:43, Johannes Thumshirn wrote:
>On 11.11.24 10:41, Javier Gonzalez wrote:
>> On 11.11.2024 09:37, Johannes Thumshirn wrote:
>>> On 11.11.24 10:31, Javier Gonzalez wrote:
>>>> On 11.11.2024 07:51, Christoph Hellwig wrote:
>>>>> On Fri, Nov 08, 2024 at 05:43:44PM +0000, Javier Gonzalez wrote:
>>>>>> We have been iterating in the patches for years, but it is unfortunately
>>>>>> one of these series that go in circles forever. I don't think it is due
>>>>>> to any specific problem, but mostly due to unaligned requests form
>>>>>> different folks reviewing. Last time I talked to Damien he asked me to
>>>>>> send the patches again; we have not followed through due to bandwidth.
>>>>>
>>>>> A big problem is that it actually lacks a killer use case.  If you'd
>>>>> actually manage to plug it into an in-kernel user and show a real
>>>>> speedup people might actually be interested in it and help optimizing
>>>>> for it.
>>>>>
>>>>
>>>> Agree. Initially it was all about ZNS. Seems ZUFS can use it.
>>>>
>>>> Then we saw good results in offload to target on NVMe-OF, similar to
>>>> copy_file_range, but that does not seem to be enough. You seem to
>>>> indicacte too that XFS can use it for GC.
>>>>
>>>> We can try putting a new series out to see where we are...
>>>
>>> I don't want to sound like a broken record, but I've said more than
>>> once, that btrfs (regardless of zoned or non-zoned) would be very
>>> interested in that as well and I'd be willing to help with the code or
>>> even do it myself once the block bits are in.
>>>
>>> But apparently my voice doesn't count here
>>
>> You are right. Sorry I forgot.
>>
>> Would this be through copy_file_range or something different?
>>
>
>Unfortunately not, brtfs' reclaim/balance path is a wrapper on top of
>buffered read and write (plus some extra things). _BUT_ this makes it
>possible to switch the read/write part and do copy offload (where possible).

On 11.11.2024 10:42, hch wrote:
>On Mon, Nov 11, 2024 at 10:41:33AM +0100, Javier Gonzalez wrote:
>> You are right. Sorry I forgot.
>>
>> Would this be through copy_file_range or something different?
>
>Just like for f2fs, nilfs2, or the upcoming zoned xfs the prime user
>would be the file system GC code.

Replying to both.

Thanks. Makes sense. Now that we can talke a look at your branch, we can
think how this would look like.
Bart Van Assche Nov. 11, 2024, 5:45 p.m. UTC | #21
On 11/11/24 1:31 AM, Javier Gonzalez wrote:
> On 08.11.2024 10:51, Bart Van Assche wrote:
>> On 11/8/24 9:43 AM, Javier Gonzalez wrote:
>>> If there is an interest, we can re-spin this again...
>>
>> I'm interested. Work is ongoing in JEDEC on support for copy offloading
>> for UFS devices. This work involves standardizing which SCSI copy
>> offloading features should be supported and which features are not
>> required. Implementations are expected to be available soon.
> 
> Do you have any specific blockers on the last series? I know you have
> left comments in many of the patches already, but I think we are all a
> bit confused on where we are ATM.

Nobody replied to this question that was raised 4 months ago:
https://lore.kernel.org/linux-block/4c7f30af-9fbc-4f19-8f48-ad741aa557c4@acm.org/

I think we need to agree about the answer to that question before we can
continue with implementing copy offloading.

Thanks,

Bart.
Nitesh Shetty Nov. 12, 2024, 1:52 p.m. UTC | #22
On 11/11/24 09:45AM, Bart Van Assche wrote:
>On 11/11/24 1:31 AM, Javier Gonzalez wrote:
>>On 08.11.2024 10:51, Bart Van Assche wrote:
>>>On 11/8/24 9:43 AM, Javier Gonzalez wrote:
>>>>If there is an interest, we can re-spin this again...
>>>
>>>I'm interested. Work is ongoing in JEDEC on support for copy offloading
>>>for UFS devices. This work involves standardizing which SCSI copy
>>>offloading features should be supported and which features are not
>>>required. Implementations are expected to be available soon.
>>
>>Do you have any specific blockers on the last series? I know you have
>>left comments in many of the patches already, but I think we are all a
>>bit confused on where we are ATM.
>
>Nobody replied to this question that was raised 4 months ago:
>https://lore.kernel.org/linux-block/4c7f30af-9fbc-4f19-8f48-ad741aa557c4@acm.org/
>
>I think we need to agree about the answer to that question before we can
>continue with implementing copy offloading.
>
Yes, even I feel the same.
Blocker with copy has been how we should plumb things in block layer.
A couple of approaches we tried in the past[1].
Restating for reference,

1.payload based approach:
   a. Based on Mikulas patch, here a common payload is used for both source
     and destination bio. 
   b. Initially we send source bio, upon reaching driver we update payload
     and complete the bio.
   c. Send destination bio, in driver layer we recover the source info
     from the payload and send the copy command to device.

   Drawback:
   Request payload contains IO information rather than data.
   Based on past experience Christoph and Bart suggested not a good way
   forward.
   Alternate suggestion from Christoph was to used separate BIOs for src
   and destination and match them using token/id.
   As Bart pointed, I find it hard how to match when the IO split happens.

2. Plug based approach:
   a. Take a plug, send destination bio, form request and wait for src bio
   b. send source bio, merge with destination bio
   c. Upon release of plug send request down to driver.

   Drawback:
   Doesn't work for stacked devices which has async submission.
   Bart suggested this is not good solution overall.
   Alternate suggestion was to use list based approach.
   But we observed lifetime management problems, especially in
   failure handling.

3. Single bio approach:
   a. Use single bio to represent both src and dst info.
   b. Use abnormal IO handling similar to discard.
   Drawback:
   Christoph pointed out that, this will have issue of payload containing
   information for both IO stack and wire.


I am really torn on how to proceed further ?
-- Nitesh Shetty


[1] https://lore.kernel.org/linux-block/20240624103212.2donuac5apwwqaor@nj.shetty@samsung.com/
Martin K. Petersen Nov. 19, 2024, 2:03 a.m. UTC | #23
Nitesh,

> 1.payload based approach:
>   a. Based on Mikulas patch, here a common payload is used for both source
>     and destination bio.
>   b. Initially we send source bio, upon reaching driver we update payload
>     and complete the bio.
>   c. Send destination bio, in driver layer we recover the source info
>     from the payload and send the copy command to device.
>
>   Drawback:
>   Request payload contains IO information rather than data.
>   Based on past experience Christoph and Bart suggested not a good way
>   forward.
>   Alternate suggestion from Christoph was to used separate BIOs for src
>   and destination and match them using token/id.
>   As Bart pointed, I find it hard how to match when the IO split happens.

In my experience the payload-based approach was what made things work. I
tried many things before settling on that. Also note that to support
token-based SCSI devices, you inevitably need to separate the
read/copy_in operation from the write/copy_out ditto and carry the token
in the payload.

For "single copy command" devices, you can just synthesize the token in
the driver. Although I don't really know what the point of the token is
in that case because as far as I'm concerned, the only interesting
information is that the read/copy_in operation made it down the stack
without being split.

Handling splits made things way too complicated for my taste. Especially
with a potential many-to-many mapping. Better to just fall back to
regular read/writes if either the copy_in or the copy_out operation
needs to be split. If your stacked storage is configured with a
prohibitively small stripe chunk size, then your copy performance is
just going to be approaching that of a regular read/write data movement.
Not a big deal as far as I'm concerned...