Message ID | 20241029151922.459139-1-kbusch@meta.com (mailing list archive) |
---|---|
Headers | show |
Series | write hints with nvme fdp, scsi streams | expand |
On Tue, Oct 29, 2024 at 08:19:13AM -0700, Keith Busch wrote: > Return invalid value if user requests an invalid write hint > > Added and exported a block device feature flag for indicating generic > placement hint support But it still talks of write hints everywhere and conflates the write streams with the temperature hints which are completely different beasts.
I've pushed my branch that tries to make this work with the XFS data separation here: http://git.infradead.org/?p=users/hch/xfs.git;a=shortlog;h=refs/heads/xfs-zoned-streams This is basically my current WIP xfs zoned (aka always write out place) work optimistically destined for 6.14 + the patch set in this thread + a little fix to make it work for nvme-multipath plus the tiny patch to wire it up. The good news is that the API from Keith mostly works. I don't really know how to cope with the streams per partition bitmap, and I suspect this will need to be dealt with a bit better. One option might be to always have a bitmap, which would also support discontiguous write stream numbers as actually supported by the underlying NVMe implementation, another option would be to always map to consecutive numbers. The bad news is that for file systems or applications to make full use of the API we also really need an API to expose how much space is left in a write stream, as otherwise they can easily get out of sync on a power fail. I've left that code in as a TODO, it should not affect basic testing. We get the same kind of performance numbers as the ZNS support on comparable hardware platforms, which is expected. Testing on an actual state of the art non-prototype hardware will take more time as the capacities are big enough that getting serious numbers will take a lot more time.
On Tue, Nov 05, 2024 at 04:50:14PM +0100, Christoph Hellwig wrote: > I've pushed my branch that tries to make this work with the XFS > data separation here: > > http://git.infradead.org/?p=users/hch/xfs.git;a=shortlog;h=refs/heads/xfs-zoned-streams > > This is basically my current WIP xfs zoned (aka always write out place) > work optimistically destined for 6.14 + the patch set in this thread + > a little fix to make it work for nvme-multipath plus the tiny patch to > wire it up. > > The good news is that the API from Keith mostly works. I don't really > know how to cope with the streams per partition bitmap, and I suspect > this will need to be dealt with a bit better. One option might be > to always have a bitmap, which would also support discontiguous > write stream numbers as actually supported by the underlying NVMe > implementation, another option would be to always map to consecutive > numbers. Thanks for sharing that. Seeing the code makes it much easier to understand where you're trying to steer this. I'll take a look and probably have some feedback after a couple days going through it.
On Tue, Nov 05, 2024 at 04:50:14PM +0100, Christoph Hellwig wrote: > I've pushed my branch that tries to make this work with the XFS > data separation here: > > http://git.infradead.org/?p=users/hch/xfs.git;a=shortlog;h=refs/heads/xfs-zoned-streams The zone block support all looks pretty neat, but I think you're making this harder than necessary to support streams. You don't need to treat these like a sequential write device. The controller side does its own garbage collection, so no need to duplicate the effort on the host. And it looks like the host side gc potentially merges multiple streams into a single gc stream, so that's probably not desirable.
On Thu, Nov 07, 2024 at 01:36:35PM -0700, Keith Busch wrote: > The zone block support all looks pretty neat, but I think you're making > this harder than necessary to support streams. You don't need to treat > these like a sequential write device. The controller side does its own > garbage collection, so no need to duplicate the effort on the host. And > it looks like the host side gc potentially merges multiple streams into > a single gc stream, so that's probably not desirable. We're not really duplicating much. Writing sequential is pretty easy, and tracking reclaim units separately means you need another tracking data structure, and either that or the LBA one is always going to be badly fragmented if they aren't the same.
On Fri, Nov 08, 2024 at 03:18:52PM +0100, Christoph Hellwig wrote: > On Thu, Nov 07, 2024 at 01:36:35PM -0700, Keith Busch wrote: > > The zone block support all looks pretty neat, but I think you're making > > this harder than necessary to support streams. You don't need to treat > > these like a sequential write device. The controller side does its own > > garbage collection, so no need to duplicate the effort on the host. And > > it looks like the host side gc potentially merges multiple streams into > > a single gc stream, so that's probably not desirable. > > We're not really duplicating much. Writing sequential is pretty easy, > and tracking reclaim units separately means you need another tracking > data structure, and either that or the LBA one is always going to be > badly fragmented if they aren't the same. You're getting fragmentation anyway, which is why you had to implement gc. You're just shifting who gets to deal with it from the controller to the host. The host is further from the media, so you're starting from a disadvantage. The host gc implementation would have to be quite a bit better to justify the link and memory usage necessary for the copies (...queue a copy-offload discussion? oom?). This xfs implementation also has logic to recover from a power fail. The device already does that if you use the LBA abstraction instead of tracking sequential write pointers and free blocks. I think you are underestimating the duplication of efforts going on here.
On Fri, Nov 08, 2024 at 08:51:31AM -0700, Keith Busch wrote: > On Fri, Nov 08, 2024 at 03:18:52PM +0100, Christoph Hellwig wrote: > > We're not really duplicating much. Writing sequential is pretty easy, > > and tracking reclaim units separately means you need another tracking > > data structure, and either that or the LBA one is always going to be > > badly fragmented if they aren't the same. > > You're getting fragmentation anyway, which is why you had to implement > gc. You're just shifting who gets to deal with it from the controller to > the host. The host is further from the media, so you're starting from a > disadvantage. The host gc implementation would have to be quite a bit > better to justify the link and memory usage necessary for the copies > (...queue a copy-offload discussion? oom?). But the filesystem knows which blocks are actually in use. Sending TRIM/DISCARD information to the drive at block-level granularity hasn't worked out so well in the past. So the drive is the one at a disadvantage because it has to copy blocks which aren't actually in use. I like the idea of using copy-offload though.
> -----Original Message----- > From: Matthew Wilcox <willy@infradead.org> > Sent: Friday, November 8, 2024 5:55 PM > To: Keith Busch <kbusch@kernel.org> > Cc: Christoph Hellwig <hch@lst.de>; Keith Busch <kbusch@meta.com>; linux- > block@vger.kernel.org; linux-nvme@lists.infradead.org; linux-scsi@vger.kernel.org; > io-uring@vger.kernel.org; linux-fsdevel@vger.kernel.org; joshi.k@samsung.com; > Javier Gonzalez <javier.gonz@samsung.com>; bvanassche@acm.org > Subject: Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams > > On Fri, Nov 08, 2024 at 08:51:31AM -0700, Keith Busch wrote: > > On Fri, Nov 08, 2024 at 03:18:52PM +0100, Christoph Hellwig wrote: > > > We're not really duplicating much. Writing sequential is pretty easy, > > > and tracking reclaim units separately means you need another tracking > > > data structure, and either that or the LBA one is always going to be > > > badly fragmented if they aren't the same. > > > > You're getting fragmentation anyway, which is why you had to implement > > gc. You're just shifting who gets to deal with it from the controller to > > the host. The host is further from the media, so you're starting from a > > disadvantage. The host gc implementation would have to be quite a bit > > better to justify the link and memory usage necessary for the copies > > (...queue a copy-offload discussion? oom?). > > But the filesystem knows which blocks are actually in use. Sending > TRIM/DISCARD information to the drive at block-level granularity hasn't > worked out so well in the past. So the drive is the one at a disadvantage > because it has to copy blocks which aren't actually in use. It is true that trim has not been great. I would say that at least enterprise SSDs have fixed this in general. For FDP, DSM Deallocate is respected, which Provides a good "erase" interface to the host. It is true though that this is not properly described in the spec and we should fix it. > > I like the idea of using copy-offload though. We have been iterating in the patches for years, but it is unfortunately one of these series that go in circles forever. I don't think it is due to any specific problem, but mostly due to unaligned requests form different folks reviewing. Last time I talked to Damien he asked me to send the patches again; we have not followed through due to bandwidth. If there is an interest, we can re-spin this again...
On 11/8/24 9:43 AM, Javier Gonzalez wrote:
> If there is an interest, we can re-spin this again...
I'm interested. Work is ongoing in JEDEC on support for copy offloading
for UFS devices. This work involves standardizing which SCSI copy
offloading features should be supported and which features are not
required. Implementations are expected to be available soon.
Thanks,
Bart.
On Fri, Nov 08, 2024 at 08:51:31AM -0700, Keith Busch wrote: > You're getting fragmentation anyway, which is why you had to implement > gc. A general purpose file system always has fragmentation of some kind, even it manages to avoid those for certain workloads with cooperative applications. If there was magic pixies dust to ensure freespace never fragments file system development would be solved problem :) > You're just shifting who gets to deal with it from the controller to > the host. The host is further from the media, so you're starting from a > disadvantage. And the controller is further from the application and misses a lot of information like say the file structure, so it inherently is at a disadvantage. > The host gc implementation would have to be quite a bit > better to justify the link and memory usage necessary for the copies That assumes you still have to device GC. If you do align to the zone/erase (super)block/reclaim unit boundaries you don't. > This xfs implementation also has logic to recover from a power fail. The > device already does that if you use the LBA abstraction instead of > tracking sequential write pointers and free blocks. Every file system has logic to recover from a power fail. I'm not sure what kind of discussion you're trying to kick off here. > I think you are underestimating the duplication of efforts going on > here. I'm still not sure what discussion you're trying to to start here. There is very little work in here, and it is work required to support SMR drives. It turns out for a fair amount of workloads it actually works really well on SSDs as well beating everything else we've tried.
On Fri, Nov 08, 2024 at 04:54:34PM +0000, Matthew Wilcox wrote:
> I like the idea of using copy-offload though.
FYI, the XFS GC code is written so that copy offload can be easily
plugged into it. We'll have to see how beneficial it actually is,
but at least it should give us a good test platform.
On Fri, Nov 08, 2024 at 05:43:44PM +0000, Javier Gonzalez wrote: > We have been iterating in the patches for years, but it is unfortunately > one of these series that go in circles forever. I don't think it is due > to any specific problem, but mostly due to unaligned requests form > different folks reviewing. Last time I talked to Damien he asked me to > send the patches again; we have not followed through due to bandwidth. A big problem is that it actually lacks a killer use case. If you'd actually manage to plug it into an in-kernel user and show a real speedup people might actually be interested in it and help optimizing for it.
On 11.11.2024 07:51, Christoph Hellwig wrote: >On Fri, Nov 08, 2024 at 05:43:44PM +0000, Javier Gonzalez wrote: >> We have been iterating in the patches for years, but it is unfortunately >> one of these series that go in circles forever. I don't think it is due >> to any specific problem, but mostly due to unaligned requests form >> different folks reviewing. Last time I talked to Damien he asked me to >> send the patches again; we have not followed through due to bandwidth. > >A big problem is that it actually lacks a killer use case. If you'd >actually manage to plug it into an in-kernel user and show a real >speedup people might actually be interested in it and help optimizing >for it. > Agree. Initially it was all about ZNS. Seems ZUFS can use it. Then we saw good results in offload to target on NVMe-OF, similar to copy_file_range, but that does not seem to be enough. You seem to indicacte too that XFS can use it for GC. We can try putting a new series out to see where we are...
On 08.11.2024 10:51, Bart Van Assche wrote: >On 11/8/24 9:43 AM, Javier Gonzalez wrote: >>If there is an interest, we can re-spin this again... > >I'm interested. Work is ongoing in JEDEC on support for copy offloading >for UFS devices. This work involves standardizing which SCSI copy >offloading features should be supported and which features are not >required. Implementations are expected to be available soon. > Do you have any specific blockers on the last series? I know you have left comments in many of the patches already, but I think we are all a bit confused on where we are ATM.
On 11.11.24 10:31, Javier Gonzalez wrote: > On 11.11.2024 07:51, Christoph Hellwig wrote: >> On Fri, Nov 08, 2024 at 05:43:44PM +0000, Javier Gonzalez wrote: >>> We have been iterating in the patches for years, but it is unfortunately >>> one of these series that go in circles forever. I don't think it is due >>> to any specific problem, but mostly due to unaligned requests form >>> different folks reviewing. Last time I talked to Damien he asked me to >>> send the patches again; we have not followed through due to bandwidth. >> >> A big problem is that it actually lacks a killer use case. If you'd >> actually manage to plug it into an in-kernel user and show a real >> speedup people might actually be interested in it and help optimizing >> for it. >> > > Agree. Initially it was all about ZNS. Seems ZUFS can use it. > > Then we saw good results in offload to target on NVMe-OF, similar to > copy_file_range, but that does not seem to be enough. You seem to > indicacte too that XFS can use it for GC. > > We can try putting a new series out to see where we are... I don't want to sound like a broken record, but I've said more than once, that btrfs (regardless of zoned or non-zoned) would be very interested in that as well and I'd be willing to help with the code or even do it myself once the block bits are in. But apparently my voice doesn't count here
On 11.11.2024 09:37, Johannes Thumshirn wrote: >On 11.11.24 10:31, Javier Gonzalez wrote: >> On 11.11.2024 07:51, Christoph Hellwig wrote: >>> On Fri, Nov 08, 2024 at 05:43:44PM +0000, Javier Gonzalez wrote: >>>> We have been iterating in the patches for years, but it is unfortunately >>>> one of these series that go in circles forever. I don't think it is due >>>> to any specific problem, but mostly due to unaligned requests form >>>> different folks reviewing. Last time I talked to Damien he asked me to >>>> send the patches again; we have not followed through due to bandwidth. >>> >>> A big problem is that it actually lacks a killer use case. If you'd >>> actually manage to plug it into an in-kernel user and show a real >>> speedup people might actually be interested in it and help optimizing >>> for it. >>> >> >> Agree. Initially it was all about ZNS. Seems ZUFS can use it. >> >> Then we saw good results in offload to target on NVMe-OF, similar to >> copy_file_range, but that does not seem to be enough. You seem to >> indicacte too that XFS can use it for GC. >> >> We can try putting a new series out to see where we are... > >I don't want to sound like a broken record, but I've said more than >once, that btrfs (regardless of zoned or non-zoned) would be very >interested in that as well and I'd be willing to help with the code or >even do it myself once the block bits are in. > >But apparently my voice doesn't count here You are right. Sorry I forgot. Would this be through copy_file_range or something different?
On Mon, Nov 11, 2024 at 10:41:33AM +0100, Javier Gonzalez wrote: > You are right. Sorry I forgot. > > Would this be through copy_file_range or something different? Just like for f2fs, nilfs2, or the upcoming zoned xfs the prime user would be the file system GC code.
On 11.11.24 10:41, Javier Gonzalez wrote: > On 11.11.2024 09:37, Johannes Thumshirn wrote: >> On 11.11.24 10:31, Javier Gonzalez wrote: >>> On 11.11.2024 07:51, Christoph Hellwig wrote: >>>> On Fri, Nov 08, 2024 at 05:43:44PM +0000, Javier Gonzalez wrote: >>>>> We have been iterating in the patches for years, but it is unfortunately >>>>> one of these series that go in circles forever. I don't think it is due >>>>> to any specific problem, but mostly due to unaligned requests form >>>>> different folks reviewing. Last time I talked to Damien he asked me to >>>>> send the patches again; we have not followed through due to bandwidth. >>>> >>>> A big problem is that it actually lacks a killer use case. If you'd >>>> actually manage to plug it into an in-kernel user and show a real >>>> speedup people might actually be interested in it and help optimizing >>>> for it. >>>> >>> >>> Agree. Initially it was all about ZNS. Seems ZUFS can use it. >>> >>> Then we saw good results in offload to target on NVMe-OF, similar to >>> copy_file_range, but that does not seem to be enough. You seem to >>> indicacte too that XFS can use it for GC. >>> >>> We can try putting a new series out to see where we are... >> >> I don't want to sound like a broken record, but I've said more than >> once, that btrfs (regardless of zoned or non-zoned) would be very >> interested in that as well and I'd be willing to help with the code or >> even do it myself once the block bits are in. >> >> But apparently my voice doesn't count here > > You are right. Sorry I forgot. > > Would this be through copy_file_range or something different? > Unfortunately not, brtfs' reclaim/balance path is a wrapper on top of buffered read and write (plus some extra things). _BUT_ this makes it possible to switch the read/write part and do copy offload (where possible).
On 11.11.2024 09:43, Johannes Thumshirn wrote: >On 11.11.24 10:41, Javier Gonzalez wrote: >> On 11.11.2024 09:37, Johannes Thumshirn wrote: >>> On 11.11.24 10:31, Javier Gonzalez wrote: >>>> On 11.11.2024 07:51, Christoph Hellwig wrote: >>>>> On Fri, Nov 08, 2024 at 05:43:44PM +0000, Javier Gonzalez wrote: >>>>>> We have been iterating in the patches for years, but it is unfortunately >>>>>> one of these series that go in circles forever. I don't think it is due >>>>>> to any specific problem, but mostly due to unaligned requests form >>>>>> different folks reviewing. Last time I talked to Damien he asked me to >>>>>> send the patches again; we have not followed through due to bandwidth. >>>>> >>>>> A big problem is that it actually lacks a killer use case. If you'd >>>>> actually manage to plug it into an in-kernel user and show a real >>>>> speedup people might actually be interested in it and help optimizing >>>>> for it. >>>>> >>>> >>>> Agree. Initially it was all about ZNS. Seems ZUFS can use it. >>>> >>>> Then we saw good results in offload to target on NVMe-OF, similar to >>>> copy_file_range, but that does not seem to be enough. You seem to >>>> indicacte too that XFS can use it for GC. >>>> >>>> We can try putting a new series out to see where we are... >>> >>> I don't want to sound like a broken record, but I've said more than >>> once, that btrfs (regardless of zoned or non-zoned) would be very >>> interested in that as well and I'd be willing to help with the code or >>> even do it myself once the block bits are in. >>> >>> But apparently my voice doesn't count here >> >> You are right. Sorry I forgot. >> >> Would this be through copy_file_range or something different? >> > >Unfortunately not, brtfs' reclaim/balance path is a wrapper on top of >buffered read and write (plus some extra things). _BUT_ this makes it >possible to switch the read/write part and do copy offload (where possible). On 11.11.2024 10:42, hch wrote: >On Mon, Nov 11, 2024 at 10:41:33AM +0100, Javier Gonzalez wrote: >> You are right. Sorry I forgot. >> >> Would this be through copy_file_range or something different? > >Just like for f2fs, nilfs2, or the upcoming zoned xfs the prime user >would be the file system GC code. Replying to both. Thanks. Makes sense. Now that we can talke a look at your branch, we can think how this would look like.
On 11/11/24 1:31 AM, Javier Gonzalez wrote: > On 08.11.2024 10:51, Bart Van Assche wrote: >> On 11/8/24 9:43 AM, Javier Gonzalez wrote: >>> If there is an interest, we can re-spin this again... >> >> I'm interested. Work is ongoing in JEDEC on support for copy offloading >> for UFS devices. This work involves standardizing which SCSI copy >> offloading features should be supported and which features are not >> required. Implementations are expected to be available soon. > > Do you have any specific blockers on the last series? I know you have > left comments in many of the patches already, but I think we are all a > bit confused on where we are ATM. Nobody replied to this question that was raised 4 months ago: https://lore.kernel.org/linux-block/4c7f30af-9fbc-4f19-8f48-ad741aa557c4@acm.org/ I think we need to agree about the answer to that question before we can continue with implementing copy offloading. Thanks, Bart.
On 11/11/24 09:45AM, Bart Van Assche wrote: >On 11/11/24 1:31 AM, Javier Gonzalez wrote: >>On 08.11.2024 10:51, Bart Van Assche wrote: >>>On 11/8/24 9:43 AM, Javier Gonzalez wrote: >>>>If there is an interest, we can re-spin this again... >>> >>>I'm interested. Work is ongoing in JEDEC on support for copy offloading >>>for UFS devices. This work involves standardizing which SCSI copy >>>offloading features should be supported and which features are not >>>required. Implementations are expected to be available soon. >> >>Do you have any specific blockers on the last series? I know you have >>left comments in many of the patches already, but I think we are all a >>bit confused on where we are ATM. > >Nobody replied to this question that was raised 4 months ago: >https://lore.kernel.org/linux-block/4c7f30af-9fbc-4f19-8f48-ad741aa557c4@acm.org/ > >I think we need to agree about the answer to that question before we can >continue with implementing copy offloading. > Yes, even I feel the same. Blocker with copy has been how we should plumb things in block layer. A couple of approaches we tried in the past[1]. Restating for reference, 1.payload based approach: a. Based on Mikulas patch, here a common payload is used for both source and destination bio. b. Initially we send source bio, upon reaching driver we update payload and complete the bio. c. Send destination bio, in driver layer we recover the source info from the payload and send the copy command to device. Drawback: Request payload contains IO information rather than data. Based on past experience Christoph and Bart suggested not a good way forward. Alternate suggestion from Christoph was to used separate BIOs for src and destination and match them using token/id. As Bart pointed, I find it hard how to match when the IO split happens. 2. Plug based approach: a. Take a plug, send destination bio, form request and wait for src bio b. send source bio, merge with destination bio c. Upon release of plug send request down to driver. Drawback: Doesn't work for stacked devices which has async submission. Bart suggested this is not good solution overall. Alternate suggestion was to use list based approach. But we observed lifetime management problems, especially in failure handling. 3. Single bio approach: a. Use single bio to represent both src and dst info. b. Use abnormal IO handling similar to discard. Drawback: Christoph pointed out that, this will have issue of payload containing information for both IO stack and wire. I am really torn on how to proceed further ? -- Nitesh Shetty [1] https://lore.kernel.org/linux-block/20240624103212.2donuac5apwwqaor@nj.shetty@samsung.com/
Nitesh, > 1.payload based approach: > a. Based on Mikulas patch, here a common payload is used for both source > and destination bio. > b. Initially we send source bio, upon reaching driver we update payload > and complete the bio. > c. Send destination bio, in driver layer we recover the source info > from the payload and send the copy command to device. > > Drawback: > Request payload contains IO information rather than data. > Based on past experience Christoph and Bart suggested not a good way > forward. > Alternate suggestion from Christoph was to used separate BIOs for src > and destination and match them using token/id. > As Bart pointed, I find it hard how to match when the IO split happens. In my experience the payload-based approach was what made things work. I tried many things before settling on that. Also note that to support token-based SCSI devices, you inevitably need to separate the read/copy_in operation from the write/copy_out ditto and carry the token in the payload. For "single copy command" devices, you can just synthesize the token in the driver. Although I don't really know what the point of the token is in that case because as far as I'm concerned, the only interesting information is that the read/copy_in operation made it down the stack without being split. Handling splits made things way too complicated for my taste. Especially with a potential many-to-many mapping. Better to just fall back to regular read/writes if either the copy_in or the copy_out operation needs to be split. If your stacked storage is configured with a prohibitively small stripe chunk size, then your copy performance is just going to be approaching that of a regular read/write data movement. Not a big deal as far as I'm concerned...
On 11/18/24 6:03 PM, Martin K. Petersen wrote: > In my experience the payload-based approach was what made things work. I > tried many things before settling on that. Also note that to support > token-based SCSI devices, you inevitably need to separate the > read/copy_in operation from the write/copy_out ditto and carry the token > in the payload. > > For "single copy command" devices, you can just synthesize the token in > the driver. Although I don't really know what the point of the token is > in that case because as far as I'm concerned, the only interesting > information is that the read/copy_in operation made it down the stack > without being split. Hi Martin, There are some strong arguments in this thread from May 2024 in favor of representing the entire copy operation as a single REQ_OP_ operation: https://lore.kernel.org/linux-block/20240520102033.9361-1-nj.shetty@samsung.com/ Token-based copy offloading (called ODX by Microsoft) could be implemented by maintaining a state machine in the SCSI sd driver and using a single block layer request to submit the four following SCSI commands: * POPULATE TOKEN * RECEIVE ROD TOKEN INFORMATION * WRITE USING TOKEN I'm assuming that the IMMED bit will be set to zero in the WRITE USING TOKEN command. Otherwise one or more additional RECEIVE ROD TOKEN INFORMATION commands would be required to poll for the WRITE USING TOKEN completion status. I guess that the block layer maintainer wouldn't be happy if all block drivers would have to deal with three or four phases for copy offloading just because ODX is this complicated. Thanks, Bart.
Bart, > There are some strong arguments in this thread from May 2024 in favor of > representing the entire copy operation as a single REQ_OP_ operation: > https://lore.kernel.org/linux-block/20240520102033.9361-1-nj.shetty@samsung.com/ As has been discussed many times, a copy operation is semantically a read operation followed by a write operation. And, based on my experience implementing support for both types of copy offload in Linux, what made things elegant was treating the operation as a read followed by a write throughout the stack. Exactly like the token-based offload specification describes. > Token-based copy offloading (called ODX by Microsoft) could be > implemented by maintaining a state machine in the SCSI sd driver I suspect the SCSI maintainer would object strongly to the idea of maintaining cross-device copy offload state and associated object lifetime issues in the sd driver. > I'm assuming that the IMMED bit will be set to zero in the WRITE USING > TOKEN command. Otherwise one or more additional RECEIVE ROD TOKEN > INFORMATION commands would be required to poll for the WRITE USING TOKEN > completion status. What would the benefit of making WRITE USING TOKEN be a background operation? That seems like a completely unnecessary complication. > I guess that the block layer maintainer wouldn't be happy if all block > drivers would have to deal with three or four phases for copy > offloading just because ODX is this complicated. Last I looked, EXTENDED COPY consumed something like 70 pages in the spec. Token-based copy is trivially simple and elegant by comparison.
On 11/26/24 6:54 PM, Martin K. Petersen wrote: > Bart wrote: >> There are some strong arguments in this thread from May 2024 in favor of >> representing the entire copy operation as a single REQ_OP_ operation: >> https://lore.kernel.org/linux-block/20240520102033.9361-1-nj.shetty@samsung.com/ > > As has been discussed many times, a copy operation is semantically a > read operation followed by a write operation. And, based on my > experience implementing support for both types of copy offload in Linux, > what made things elegant was treating the operation as a read followed > by a write throughout the stack. Exactly like the token-based offload > specification describes. Submitting a copy operation as two bios or two requests means that there is a risk that one of the two operations never reaches the block driver at the bottom of the storage stack and hence that a deadlock occurs. I prefer not to introduce any mechanisms that can cause a deadlock. As one can see here, Damien Le Moal and Keith Busch both prefer to submit copy operations as a single operation: Keith Busch, Re: [PATCH v20 02/12] Add infrastructure for copy offload in block and request layer, linux-block mailing list, 2024-06-24 (https://lore.kernel.org/all/Znn6C-C73Tps3WJk@kbusch-mbp.dhcp.thefacebook.com/). >> Token-based copy offloading (called ODX by Microsoft) could be >> implemented by maintaining a state machine in the SCSI sd driver > > I suspect the SCSI maintainer would object strongly to the idea of > maintaining cross-device copy offload state and associated object > lifetime issues in the sd driver. Such information wouldn't have to be maintained inside the sd driver. A new kernel module could be introduced that tracks the state of copy operations and that interacts with the sd driver. >> I'm assuming that the IMMED bit will be set to zero in the WRITE USING >> TOKEN command. Otherwise one or more additional RECEIVE ROD TOKEN >> INFORMATION commands would be required to poll for the WRITE USING TOKEN >> completion status. > > What would the benefit of making WRITE USING TOKEN be a background > operation? That seems like a completely unnecessary complication. If a single copy operation takes significantly more time than the time required to switch between power states, power can be saved by using IMMED=1. Mechanisms like run-time power management (RPM) or the UFS host controller auto-hibernation mechanism can only be activated if no commands are in progress. With IMMED=0, the link between the host and the storage device will remain powered as long as the copy operation is in progress. With IMMED=1, the link between the host and the storage device can be powered down after the copy operation has been submitted until the host decides to check whether or not the copy operation has completed. >> I guess that the block layer maintainer wouldn't be happy if all block >> drivers would have to deal with three or four phases for copy >> offloading just because ODX is this complicated. > > Last I looked, EXTENDED COPY consumed something like 70 pages in the > spec. Token-based copy is trivially simple and elegant by comparison. I don't know of any storage device vendor who has implemented all EXTENDED COPY features that have been standardized. Assuming that 50 lines of code fit on a single page, here is an example of an EXTENDED COPY implementation that can be printed on 21 pages of paper: $ wc -l drivers/target/target_core_xcopy.c 1041 $ echo $(((1041 + 49) / 50)) 21 The advantages of EXTENDED COPY over ODX are as follows: - EXTENDED COPY is a single SCSI command and hence better suited for devices with a limited queue depth. While the UFS 3.0 standard restricts the queue depth to 32, most UFS 4.0 devices support a queue depth of 64. - The latency of setting up a copy command with EXTENDED COPY is lower since only a single command has to be sent to the device. ODX requires three round-trips to the device (assuming IMMED=0). - EXTENDED COPY requires less memory in storage devices. Each ODX token occupies some memory and the rules around token lifetimes are nontrivial. Thanks, Bart.
Bart, > Submitting a copy operation as two bios or two requests means that > there is a risk that one of the two operations never reaches the block > driver at the bottom of the storage stack and hence that a deadlock > occurs. I prefer not to introduce any mechanisms that can cause a > deadlock. How do you copy a block range without offload? You perform a READ to read the data into memory. And once the READ completes, you do a WRITE of the data to the new location. Token-based copy offload works exactly the same way. You do a POPULATE TOKEN which is identical to a READ except you get a cookie instead of the actual data. And then once you have the cookie, you perform a WRITE USING TOKEN to perform the write operation. Semantically, it's exactly the same as a normal copy except for the lack of data movement. That's the whole point! Once I had support for token-based copy offload working, it became clear to me that this approach is much simpler than pointer matching, bio pairs, etc. The REQ_OP_COPY_IN operation and the REQ_OP_COPY_OUT operation are never in flight at the same time. There are no synchronization hassles, no lifetimes, no lookup tables in the sd driver, no nonsense. Semantically, it's a read followed by a write. For devices that implement single-command copy offload, the REQ_OP_COPY_IN operation only serves as a validation that no splitting took place. Once the bio reaches the ULD, the I/O is completed without ever sending a command to the device. blk-lib then issues a REQ_OP_COPY_OUT which gets turned into EXTENDED COPY or NVMe Copy and sent to the destination device. Aside from making things trivially simple, the COPY_IN/COPY_OUT semantic is a *requirement* for token-based offload devices. Why would we even consider having two incompatible sets of copy offload semantics coexist in the block layer?
On 11/27/24 12:14 PM, Martin K. Petersen wrote: > Once I had support for token-based copy offload working, it became clear > to me that this approach is much simpler than pointer matching, bio > pairs, etc. The REQ_OP_COPY_IN operation and the REQ_OP_COPY_OUT > operation are never in flight at the same time. There are no > synchronization hassles, no lifetimes, no lookup tables in the sd > driver, no nonsense. Semantically, it's a read followed by a write. What if the source LBA range does not require splitting but the destination LBA range requires splitting, e.g. because it crosses a chunk_sectors boundary? Will the REQ_OP_COPY_IN operation succeed in this case and the REQ_OP_COPY_OUT operation fail? Does this mean that a third operation is needed to cancel REQ_OP_COPY_IN operations if the REQ_OP_COPY_OUT operation fails? Additionally, how to handle bugs in REQ_OP_COPY_* submitters where a large number of REQ_OP_COPY_IN operations is submitted without corresponding REQ_OP_COPY_OUT operation? Is perhaps a mechanism required to discard unmatched REQ_OP_COPY_IN operations after a certain time? > Aside from making things trivially simple, the COPY_IN/COPY_OUT semantic > is a *requirement* for token-based offload devices. Hmm ... we may each have a different opinion about whether or not the COPY_IN/COPY_OUT semantics are a requirement for token-based copy offloading. Additionally, I'm not convinced that implementing COPY_IN/COPY_OUT for ODX devices is that simple. The COPY_IN and COPY_OUT operations have to be translated into three SCSI commands, isn't it? I'm referring to the POPULATE TOKEN, RECEIVE ROD TOKEN INFORMATION and WRITE USING TOKEN commands. What is your opinion about how to translate the two block layer operations into these three SCSI commands? > Why would we even consider having two incompatible sets of copy > offload semantics coexist in the block layer? I am not aware of any proposal to implement two sets of copy operations in the block layer. All proposals I have seen so far involve adding a single set of copy operations to the block layer. Opinions differ however about whether to add a single copy operation primitive or separate IN and OUT primitives. Thanks, Bart.
Bart, > What if the source LBA range does not require splitting but the > destination LBA range requires splitting, e.g. because it crosses a > chunk_sectors boundary? Will the REQ_OP_COPY_IN operation succeed in > this case and the REQ_OP_COPY_OUT operation fail? Yes. I experimented with approaching splitting in an iterative fashion. And thus, if there was a split halfway through the COPY_IN I/O, we'd issue a corresponding COPY_OUT up to the split point and hope that the write subsequently didn't need a split. And then deal with the next segment. However, given that copy offload offers diminishing returns for small I/Os, it was not worth the hassle for the devices I used for development. It was cleaner and faster to just fall back to regular read/write when a split was required. > Does this mean that a third operation is needed to cancel > REQ_OP_COPY_IN operations if the REQ_OP_COPY_OUT operation fails? No. The device times out the token. > Additionally, how to handle bugs in REQ_OP_COPY_* submitters where a > large number of REQ_OP_COPY_IN operations is submitted without > corresponding REQ_OP_COPY_OUT operation? Is perhaps a mechanism > required to discard unmatched REQ_OP_COPY_IN operations after a > certain time? See above. For your EXTENDED COPY use case there is no token and thus the COPY_IN completes immediately. And for the token case, if you populate a million tokens and don't use them before they time out, it sounds like your submitting code is badly broken. But it doesn't matter because there are no I/Os in flight and thus nothing to discard. > Hmm ... we may each have a different opinion about whether or not the > COPY_IN/COPY_OUT semantics are a requirement for token-based copy > offloading. Maybe. But you'll have a hard time convincing me to add any kind of state machine or bio matching magic to the SCSI stack when the simplest solution is to treat copying like a read followed by a write. There is no concurrency, no kernel state, no dependency between two commands, nor two scsi_disk/scsi_device object lifetimes to manage. > Additionally, I'm not convinced that implementing COPY_IN/COPY_OUT for > ODX devices is that simple. The COPY_IN and COPY_OUT operations have > to be translated into three SCSI commands, isn't it? I'm referring to > the POPULATE TOKEN, RECEIVE ROD TOKEN INFORMATION and WRITE USING > TOKEN commands. What is your opinion about how to translate the two > block layer operations into these three SCSI commands? COPY_IN is translated to a NOP for devices implementing EXTENDED COPY and a POPULATE TOKEN for devices using tokens. COPY_OUT is translated to an EXTENDED COPY (or NVMe Copy) for devices using a single command approach and WRITE USING TOKEN for devices using tokens. There is no need for RECEIVE ROD TOKEN INFORMATION. I am not aware of UFS devices using the token-based approach. And for EXTENDED COPY there is only a single command sent to the device. If you want to do power management while that command is being processed, please deal with that in UFS. The block layer doesn't deal with the async variants of any of the other SCSI commands either...
On Wed, Nov 27, 2024 at 03:14:09PM -0500, Martin K. Petersen wrote: > How do you copy a block range without offload? You perform a READ to > read the data into memory. And once the READ completes, you do a WRITE > of the data to the new location. Yes. I.e. this is code that makes this pattern very clearm and for which I'd love to be able to use copy offload when available: http://git.infradead.org/?p=users/hch/xfs.git;a=blob;f=fs/xfs/xfs_zone_gc.c;h=ed8aa08b3c18d50afe79326e697d83e09542a9b6;hb=refs/heads/xfs-zoned#l820
On 11/28/24 11:09, Martin K. Petersen wrote: > > Bart, > >> What if the source LBA range does not require splitting but the >> destination LBA range requires splitting, e.g. because it crosses a >> chunk_sectors boundary? Will the REQ_OP_COPY_IN operation succeed in >> this case and the REQ_OP_COPY_OUT operation fail? > > Yes. > > I experimented with approaching splitting in an iterative fashion. And > thus, if there was a split halfway through the COPY_IN I/O, we'd issue a > corresponding COPY_OUT up to the split point and hope that the write > subsequently didn't need a split. And then deal with the next segment. > > However, given that copy offload offers diminishing returns for small > I/Os, it was not worth the hassle for the devices I used for > development. It was cleaner and faster to just fall back to regular > read/write when a split was required. > >> Does this mean that a third operation is needed to cancel >> REQ_OP_COPY_IN operations if the REQ_OP_COPY_OUT operation fails? > > No. The device times out the token. > >> Additionally, how to handle bugs in REQ_OP_COPY_* submitters where a >> large number of REQ_OP_COPY_IN operations is submitted without >> corresponding REQ_OP_COPY_OUT operation? Is perhaps a mechanism >> required to discard unmatched REQ_OP_COPY_IN operations after a >> certain time? > > See above. > > For your EXTENDED COPY use case there is no token and thus the COPY_IN > completes immediately. > > And for the token case, if you populate a million tokens and don't use > them before they time out, it sounds like your submitting code is badly > broken. But it doesn't matter because there are no I/Os in flight and > thus nothing to discard. > >> Hmm ... we may each have a different opinion about whether or not the >> COPY_IN/COPY_OUT semantics are a requirement for token-based copy >> offloading. > > Maybe. But you'll have a hard time convincing me to add any kind of > state machine or bio matching magic to the SCSI stack when the simplest > solution is to treat copying like a read followed by a write. There is > no concurrency, no kernel state, no dependency between two commands, nor > two scsi_disk/scsi_device object lifetimes to manage. And that also would allow supporting a fake copy offload with regular read/write BIOs very easily, I think. So all block devices can be presented as supporting "copy offload". That is nice for FSes. > >> Additionally, I'm not convinced that implementing COPY_IN/COPY_OUT for >> ODX devices is that simple. The COPY_IN and COPY_OUT operations have >> to be translated into three SCSI commands, isn't it? I'm referring to >> the POPULATE TOKEN, RECEIVE ROD TOKEN INFORMATION and WRITE USING >> TOKEN commands. What is your opinion about how to translate the two >> block layer operations into these three SCSI commands? > > COPY_IN is translated to a NOP for devices implementing EXTENDED COPY > and a POPULATE TOKEN for devices using tokens. > > COPY_OUT is translated to an EXTENDED COPY (or NVMe Copy) for devices > using a single command approach and WRITE USING TOKEN for devices using > tokens. ATA WRITE GATHERED command is also a single copy command. That matches and while I have not checked SAT, translation would likely work. While I was initially worried that the 2 BIO based approach would be overly complicated, it seems that I was wrong :) > > There is no need for RECEIVE ROD TOKEN INFORMATION. > > I am not aware of UFS devices using the token-based approach. And for > EXTENDED COPY there is only a single command sent to the device. If you > want to do power management while that command is being processed, > please deal with that in UFS. The block layer doesn't deal with the > async variants of any of the other SCSI commands either... >
On Wed, Nov 27, 2024 at 03:14:09PM -0500, Martin K. Petersen wrote: > > Bart, > > > Submitting a copy operation as two bios or two requests means that > > there is a risk that one of the two operations never reaches the block > > driver at the bottom of the storage stack and hence that a deadlock > > occurs. I prefer not to introduce any mechanisms that can cause a > > deadlock. > > How do you copy a block range without offload? You perform a READ to > read the data into memory. And once the READ completes, you do a WRITE > of the data to the new location. > > Token-based copy offload works exactly the same way. You do a POPULATE > TOKEN which is identical to a READ except you get a cookie instead of > the actual data. And then once you have the cookie, you perform a WRITE > USING TOKEN to perform the write operation. Semantically, it's exactly > the same as a normal copy except for the lack of data movement. That's > the whole point! I think of copy a little differently. When you do a normal write command, the host provides the controller a vector of sources and lengths. A copy command is like a write command, but the sources are just logical block addresses instead of memory addresses. Whatever solution happens, it would be a real shame if it doesn't allow vectored LBAs. The token based source bio doesn't seem to extend to that.
On Thu, Nov 28, 2024 at 08:21:16AM -0700, Keith Busch wrote: > I think of copy a little differently. When you do a normal write > command, the host provides the controller a vector of sources and > lengths. A copy command is like a write command, but the sources are > just logical block addresses instead of memory addresses. > > Whatever solution happens, it would be a real shame if it doesn't allow > vectored LBAs. The token based source bio doesn't seem to extend to > that. POPULATE TOKEN as defined by SCSI/SBC takes a list of LBA ranges as well.
On Thu, Nov 28, 2024 at 05:51:52PM +0900, Damien Le Moal wrote: > > Maybe. But you'll have a hard time convincing me to add any kind of > > state machine or bio matching magic to the SCSI stack when the simplest > > solution is to treat copying like a read followed by a write. There is > > no concurrency, no kernel state, no dependency between two commands, nor > > two scsi_disk/scsi_device object lifetimes to manage. > > And that also would allow supporting a fake copy offload with regular > read/write BIOs very easily, I think. So all block devices can be > presented as supporting "copy offload". That is nice for FSes. Just as when that showed up in one of the last copy offload series I'm still very critical of a stateless copy offload emulation. The reason for that is that a host based copy helper needs scratch space to read into, and doing these large allocation on every copy puts a lot of pressure onto the allocator. Allocating the buffer once at mount time and the just cycling through it is generally a lot more efficient.
On 11/29/24 15:19, Christoph Hellwig wrote: > On Thu, Nov 28, 2024 at 05:51:52PM +0900, Damien Le Moal wrote: >>> Maybe. But you'll have a hard time convincing me to add any kind of >>> state machine or bio matching magic to the SCSI stack when the simplest >>> solution is to treat copying like a read followed by a write. There is >>> no concurrency, no kernel state, no dependency between two commands, nor >>> two scsi_disk/scsi_device object lifetimes to manage. >> >> And that also would allow supporting a fake copy offload with regular >> read/write BIOs very easily, I think. So all block devices can be >> presented as supporting "copy offload". That is nice for FSes. > > Just as when that showed up in one of the last copy offload series > I'm still very critical of a stateless copy offload emulation. The > reason for that is that a host based copy helper needs scratch space > to read into, and doing these large allocation on every copy puts a > lot of pressure onto the allocator. Allocating the buffer once at > mount time and the just cycling through it is generally a lot more > efficient. Sure, that sounds good. My point was that it seems that a token based copy offload design makes it relatively easy to emulate copy in software for devices that do not support copy offload in hardware. That emulation can certainly be implemented using a single buffer like you suggest.
On 27/11/24 03:14PM, Martin K. Petersen wrote: > >Bart, > >> Submitting a copy operation as two bios or two requests means that >> there is a risk that one of the two operations never reaches the block >> driver at the bottom of the storage stack and hence that a deadlock >> occurs. I prefer not to introduce any mechanisms that can cause a >> deadlock. > >How do you copy a block range without offload? You perform a READ to >read the data into memory. And once the READ completes, you do a WRITE >of the data to the new location. > >Token-based copy offload works exactly the same way. You do a POPULATE >TOKEN which is identical to a READ except you get a cookie instead of >the actual data. And then once you have the cookie, you perform a WRITE >USING TOKEN to perform the write operation. Semantically, it's exactly >the same as a normal copy except for the lack of data movement. That's >the whole point! > Martin This approach looks simpler to me as well. But where do we store the read sector info before sending write. I see 2 approaches here, 1. Should it be part of a payload along with write ? We did something similar in previous series which was not liked by Christoph and Bart. 2. Or driver should store it as part of an internal list inside namespace/ctrl data structure ? As Bart pointed out, here we might need to send one more fail request later if copy_write fails to land in same driver. Thanks, Nitesh Shetty
Nitesh, > This approach looks simpler to me as well. > But where do we store the read sector info before sending write. > I see 2 approaches here, > 1. Should it be part of a payload along with write ? We did something > similar in previous series which was not liked by Christoph and Bart. > 2. Or driver should store it as part of an internal list inside > namespace/ctrl data structure ? As Bart pointed out, here we might > need to send one more fail request later if copy_write fails to land > in same driver. The problem with option 2 is that when you're doing copy between two different LUNs, then you suddenly have to maintain state in one kernel object about stuff relating to another kernel object. I think that is messy. Seems unnecessarily complex. With option 1, for single command offload, there is no payload to worry about. Only command completion status matters for the COPY_IN phase. And once you have completion, you can issue a COPY_OUT. Done. For token based offload, I really don't understand the objection to storing the cookie in the bio. I fail to see the benefit of storing the cookie in the driver and then have the bio refer to something else which maps to the actual cookie returned by the storage. Again that introduces object lifetime complexity. It's much simpler to just have the cookie be part of the very command that is being executed. Once the COPY_IN completes, you can either use the cookie or throw it away. Doesn't matter. The device will time it out if you sit on it too long. And there is zero state in the kernel outside of the memory for the cookie that you, as the submitter, are responsible for deallocating.
On 12/5/24 12:03 AM, Nitesh Shetty wrote: > But where do we store the read sector info before sending write. > I see 2 approaches here, > 1. Should it be part of a payload along with write ? > We did something similar in previous series which was not liked > by Christoph and Bart. > 2. Or driver should store it as part of an internal list inside > namespace/ctrl data structure ? > As Bart pointed out, here we might need to send one more fail > request later if copy_write fails to land in same driver. Hi Nitesh, Consider the following example: dm-linear is used to concatenate two block devices. An NVMe device (LBA 0..999) and a SCSI device (LBA 1000..1999). Suppose that a copy operation is submitted to the dm-linear device to copy LBAs 1..998 to LBAs 2..1998. If the copy operation is submitted as two separate operations (REQ_OP_COPY_SRC and REQ_OP_COPY_DST) then the NVMe device will receive the REQ_OP_COPY_SRC operation and the SCSI device will receive the REQ_OP_COPY_DST operation. The NVMe and SCSI device drivers should fail the copy operations after a timeout because they only received half of the copy operation. After the timeout the block layer core can switch from offloading to emulating a copy operation. Waiting for a timeout is necessary because requests may be reordered. I think this is a strong argument in favor of representing copy operations as a single operation. This will allow stacking drivers as dm-linear to deal in an elegant way with copy offload requests where source and destination LBA ranges map onto different block devices and potentially different block drivers. Thanks, Bart.
On 12/10/24 07:13, Bart Van Assche wrote: > On 12/5/24 12:03 AM, Nitesh Shetty wrote: >> But where do we store the read sector info before sending write. >> I see 2 approaches here, >> 1. Should it be part of a payload along with write ? >> We did something similar in previous series which was not liked >> by Christoph and Bart. >> 2. Or driver should store it as part of an internal list inside >> namespace/ctrl data structure ? >> As Bart pointed out, here we might need to send one more fail >> request later if copy_write fails to land in same driver. > > Hi Nitesh, > > Consider the following example: dm-linear is used to concatenate two > block devices. An NVMe device (LBA 0..999) and a SCSI device (LBA > 1000..1999). Suppose that a copy operation is submitted to the dm-linear > device to copy LBAs 1..998 to LBAs 2..1998. If the copy operation is > submitted as two separate operations (REQ_OP_COPY_SRC and > REQ_OP_COPY_DST) then the NVMe device will receive the REQ_OP_COPY_SRC > operation and the SCSI device will receive the REQ_OP_COPY_DST > operation. The NVMe and SCSI device drivers should fail the copy > operations after a timeout because they only received half of the copy > operation. After the timeout the block layer core can switch from > offloading to emulating a copy operation. Waiting for a timeout is > necessary because requests may be reordered. > > I think this is a strong argument in favor of representing copy > operations as a single operation. This will allow stacking drivers > as dm-linear to deal in an elegant way with copy offload requests > where source and destination LBA ranges map onto different block > devices and potentially different block drivers. Why ? As long as REQ_OP_COPY_SRC carries both source and destination information, DM can trivially detect that the copy is not within a single device and either return ENOTSUPP or switch to using a regular read+write operations using block layer helpers. Or the block layer can fallback to that emulation itself if it gets a ENOTSUPP from the device. I am not sure how a REQ_OP_COPY_SRC BIO definition would look like. Ideally, we want to be able to describe several source LBA ranges with it and for the above issue also have the destination LBA range as well. If we can do that in a nice way, I do not see the need for switching back to a single BIO, though we could too I guess. From what Martin said for scsi token-based copy, it seems that 2 operations is easier. Knowing how the scsi stack works, I can see that too.
On Mon, Dec 09, 2024 at 02:13:40PM -0800, Bart Van Assche wrote: > On 12/5/24 12:03 AM, Nitesh Shetty wrote: > > But where do we store the read sector info before sending write. > > I see 2 approaches here, > > 1. Should it be part of a payload along with write ? > > We did something similar in previous series which was not liked > > by Christoph and Bart. > > 2. Or driver should store it as part of an internal list inside > > namespace/ctrl data structure ? > > As Bart pointed out, here we might need to send one more fail > > request later if copy_write fails to land in same driver. > > Hi Nitesh, > > Consider the following example: dm-linear is used to concatenate two > block devices. An NVMe device (LBA 0..999) and a SCSI device (LBA > 1000..1999). Suppose that a copy operation is submitted to the dm-linear > device to copy LBAs 1..998 to LBAs 2..1998. If the copy operation is Sorry, I don't think that's a valid operation -- 1998 - 2 = 1996 and 998 - 1 is 997, so these ranges are of different lengths. I presume you're trying to construct an operation which is entirely reading within the first device, and then is going to write across both devices. So let's say you want to read 1-900 and write to 501-1400. > submitted as two separate operations (REQ_OP_COPY_SRC and > REQ_OP_COPY_DST) then the NVMe device will receive the REQ_OP_COPY_SRC > operation and the SCSI device will receive the REQ_OP_COPY_DST > operation. The NVMe and SCSI device drivers should fail the copy operations > after a timeout because they only received half of the copy > operation. ... no? The SRC operation succeeds, but then the DM driver gets the DST operation and sees that it crosses the boundary and fails the DST op. Then the pair of ops can be retried using an in-memory buffer. I'm not quite clear on the atomicity; whether there could be an initial copy of 500-900 to 1000-1400 and then a remap of 1-499 to 501-999.
On 12/9/24 3:31 PM, Matthew Wilcox wrote: > On Mon, Dec 09, 2024 at 02:13:40PM -0800, Bart Van Assche wrote: >> Consider the following example: dm-linear is used to concatenate two >> block devices. An NVMe device (LBA 0..999) and a SCSI device (LBA >> 1000..1999). Suppose that a copy operation is submitted to the dm-linear >> device to copy LBAs 1..998 to LBAs 2..1998. If the copy operation is > > Sorry, I don't think that's a valid operation -- 1998 - 2 = 1996 and 998 > - 1 is 997, so these ranges are of different lengths. Agreed that the ranges should have the same length. I have been traveling and I'm under jet lag, hence the range length mismatch. I wanted to construct a copy operation from the first to the second block device: 1..998 to 1001..1998. >> submitted as two separate operations (REQ_OP_COPY_SRC and >> REQ_OP_COPY_DST) then the NVMe device will receive the REQ_OP_COPY_SRC >> operation and the SCSI device will receive the REQ_OP_COPY_DST >> operation. The NVMe and SCSI device drivers should fail the copy operations >> after a timeout because they only received half of the copy >> operation. > > ... no? The SRC operation succeeds, but then the DM driver gets the DST > operation and sees that it crosses the boundary and fails the DST op. > Then the pair of ops can be retried using an in-memory buffer. Since the second range can be mapped onto the second block device, the dm-linear driver can only fail the REQ_OP_COPY_DST operation if it keeps track of the source LBA regions of pending copy operations. Which would be an unnecessary complexity. A possible alternative is to specify the source and destination range information in every REQ_OP_COPY_SRC and in every REQ_OP_COPY_DST operation (see also Damien's email). Thanks, Bart.
On 12/5/24 12:37 PM, Martin K. Petersen wrote: > For token based offload, I really don't understand the objection to > storing the cookie in the bio. I fail to see the benefit of storing the > cookie in the driver and then have the bio refer to something else which > maps to the actual cookie returned by the storage. Does "cookie" refer to the SCSI ROD token? Storing the ROD token in the REQ_OP_COPY_DST bio implies that the REQ_OP_COPY_DST bio is only submitted after the REQ_OP_COPY_SRC bio has completed. NVMe users may prefer that REQ_OP_COPY_SRC and REQ_OP_COPY_DST bios are submitted simultaneously. Thanks, Bart.
Bart, > Does "cookie" refer to the SCSI ROD token? Storing the ROD token in > the REQ_OP_COPY_DST bio implies that the REQ_OP_COPY_DST bio is only > submitted after the REQ_OP_COPY_SRC bio has completed. Obviously. You can't issue a WRITE USING TOKEN until you have the token. > NVMe users may prefer that REQ_OP_COPY_SRC and REQ_OP_COPY_DST bios > are submitted simultaneously. What would be the benefit of submitting these operations concurrently? As I have explained, it adds substantial complexity and object lifetime issues throughout the stack. To what end?
On Thu, Dec 05, 2024 at 03:37:25PM -0500, Martin K. Petersen wrote: > The problem with option 2 is that when you're doing copy between two > different LUNs, then you suddenly have to maintain state in one kernel > object about stuff relating to another kernel object. I think that is > messy. Seems unnecessarily complex. Generally agreeing with all you said, but do we actually have any serious use case for cross-LU copies? They just seem incredibly complex any not all that useful.
On 10.12.24 08:13, Christoph Hellwig wrote: > On Thu, Dec 05, 2024 at 03:37:25PM -0500, Martin K. Petersen wrote: >> The problem with option 2 is that when you're doing copy between two >> different LUNs, then you suddenly have to maintain state in one kernel >> object about stuff relating to another kernel object. I think that is >> messy. Seems unnecessarily complex. > > Generally agreeing with all you said, but do we actually have any > serious use case for cross-LU copies? They just seem incredibly > complex any not all that useful. One use case I can think of is (again) btrfs balance (GC, convert, etc) on a multi drive filesystem. BUT this use case is something that can just use the fallback read-write path as it is doing now.
On 09/12/24 09:20PM, Martin K. Petersen wrote: > >Bart, > >> Does "cookie" refer to the SCSI ROD token? Storing the ROD token in >> the REQ_OP_COPY_DST bio implies that the REQ_OP_COPY_DST bio is only >> submitted after the REQ_OP_COPY_SRC bio has completed. > >Obviously. You can't issue a WRITE USING TOKEN until you have the token. > >> NVMe users may prefer that REQ_OP_COPY_SRC and REQ_OP_COPY_DST bios >> are submitted simultaneously. > >What would be the benefit of submitting these operations concurrently? >As I have explained, it adds substantial complexity and object lifetime >issues throughout the stack. To what end? > >-- Bart, We did implement payload based approach in the past[1] which aligns with this. Since we wait till the REQ_OP_COPY_SRC completes, there won't be issue with async type of dm IOs. Since this would be an internal kernel plumbing, we can optimize/change the approach moving forward. If you are okay with the approach, I can give a respin to that version. Thanks, Nitesh Shetty [1] https://lore.kernel.org/linux-block/20230605121732.28468-1-nj.shetty@samsung.com/T/#mecd04c060cd4285a4b036ca79cc58713308771fe
On Tue, Dec 10, 2024 at 08:05:31AM +0000, Johannes Thumshirn wrote: > > Generally agreeing with all you said, but do we actually have any > > serious use case for cross-LU copies? They just seem incredibly > > complex any not all that useful. > > One use case I can think of is (again) btrfs balance (GC, convert, etc) > on a multi drive filesystem. BUT this use case is something that can > just use the fallback read-write path as it is doing now. Who uses multi-device file systems on multiple LUs of the same SCSI target ơr multiple namespaces on the same nvme subsystem?
On 12/10/24 2:58 AM, hch wrote: > On Tue, Dec 10, 2024 at 08:05:31AM +0000, Johannes Thumshirn wrote: >>> Generally agreeing with all you said, but do we actually have any >>> serious use case for cross-LU copies? They just seem incredibly >>> complex any not all that useful. >> >> One use case I can think of is (again) btrfs balance (GC, convert, etc) >> on a multi drive filesystem. BUT this use case is something that can >> just use the fallback read-write path as it is doing now. > > Who uses multi-device file systems on multiple LUs of the same SCSI > target ơr multiple namespaces on the same nvme subsystem? On Android systems F2FS combines a small conventional logical unit and a large zoned logical unit into a single filesystem. This use case will benefit from copy offloading between different logical units on the same SCSI device. While there may be disagreement about how desirable this setup is from a technical point of view, there is a real use case today for offloading data copying between different logical units. Bart.
On 12/9/24 6:20 PM, Martin K. Petersen wrote: > What would be the benefit of submitting these operations concurrently? I expect that submitting the two copy operations concurrently would result in lower latency for NVMe devices because the REQ_OP_COPY_DST operation can be submitted without waiting for the REQ_OP_COPY_SRC result. > As I have explained, it adds substantial complexity and object lifetime > issues throughout the stack. To what end? I think the approach of embedding the ROD token in the bio payload would add complexity in the block layer. The token-based copy offload approach involves submitting at least the following commands to the SCSI device: * POPULATE TOKEN with a list identifier and source data ranges as parameters to send the source data ranges to the device. * RECEIVE ROD TOKEN INFORMATION with a list identifier as parameter to receive the ROD token. * WRITE USING TOKEN with the ROD token and the destination ranges as parameters to tell the device to start the copy operation. If the block layer would have to manage the ROD token, how would the ROD token be provided to the block layer? Bidirectional commands have been removed from the Linux kernel a while ago so the REQ_OP_COPY_IN parameter data would have to be used to pass parameters to the SCSI driver and also to pass the ROD token back to the block layer. A possible approach is to let the SCSI core allocate memory for the ROD token with kmalloc and to pass that pointer back to the block layer by writing that pointer into the REQ_OP_COPY_IN parameter data. While this can be implemented, I'm not sure that we should integrate support in the block layer for managing ROD tokens since ROD tokens are a concept that is specific to the SCSI protocol. Thanks, Bart.
On 12/10/24 1:53 AM, Nitesh Shetty wrote: > We did implement payload based approach in the past[1] which aligns > with this. Since we wait till the REQ_OP_COPY_SRC completes, there won't > be issue with async type of dm IOs. > Since this would be an internal kernel plumbing, we can optimize/change > the approach moving forward. > If you are okay with the approach, I can give a respin to that version. Yes, I remember this. Let's wait with respinning/reposting until there is agreement about the approach for copy offloading. Thanks, Bart.
On 12/11/24 4:21 AM, Bart Van Assche wrote: > On 12/10/24 2:58 AM, hch wrote: >> On Tue, Dec 10, 2024 at 08:05:31AM +0000, Johannes Thumshirn wrote: >>>> Generally agreeing with all you said, but do we actually have any >>>> serious use case for cross-LU copies? They just seem incredibly >>>> complex any not all that useful. >>> >>> One use case I can think of is (again) btrfs balance (GC, convert, etc) >>> on a multi drive filesystem. BUT this use case is something that can >>> just use the fallback read-write path as it is doing now. >> >> Who uses multi-device file systems on multiple LUs of the same SCSI >> target ơr multiple namespaces on the same nvme subsystem? > > On Android systems F2FS combines a small conventional logical unit and a > large zoned logical unit into a single filesystem. This use case will > benefit from copy offloading between different logical units on the same > SCSI device. While there may be disagreement about how desirable this > setup is from a technical point of view, there is a real use case today > for offloading data copying between different logical units. But for F2FS, the conventional unit is used for metadata and the other zoned LU for data. How come copying from one to the other can be useful ? > > Bart. >
On 10/12/24 11:41AM, Bart Van Assche wrote: >On 12/9/24 6:20 PM, Martin K. Petersen wrote: >>What would be the benefit of submitting these operations concurrently? > >I expect that submitting the two copy operations concurrently would >result in lower latency for NVMe devices because the REQ_OP_COPY_DST >operation can be submitted without waiting for the REQ_OP_COPY_SRC >result. > >>As I have explained, it adds substantial complexity and object lifetime >>issues throughout the stack. To what end? > >I think the approach of embedding the ROD token in the bio payload would >add complexity in the block layer. The token-based copy offload approach >involves submitting at least the following commands to the SCSI device: >* POPULATE TOKEN with a list identifier and source data ranges as > parameters to send the source data ranges to the device. >* RECEIVE ROD TOKEN INFORMATION with a list identifier as parameter to > receive the ROD token. >* WRITE USING TOKEN with the ROD token and the destination ranges as > parameters to tell the device to start the copy operation. > >If the block layer would have to manage the ROD token, how would the ROD >token be provided to the block layer? Bidirectional commands have been >removed from the Linux kernel a while ago so the REQ_OP_COPY_IN >parameter data would have to be used to pass parameters to the SCSI >driver and also to pass the ROD token back to the block layer. A >possible approach is to let the SCSI core allocate memory for the ROD >token with kmalloc and to pass that pointer back to the block layer >by writing that pointer into the REQ_OP_COPY_IN parameter data. While >this can be implemented, I'm not sure that we should integrate support >in the block layer for managing ROD tokens since ROD tokens are a >concept that is specific to the SCSI protocol. > Block layer can allocate a buffer and send this as part of copy operation. Driver can store token/custom info inside the buffer sent along with REQ_OP_COPY_SRC and expect that block layer sends back this info/buffer again in REQ_OP_COPY_DST ? This will reduce the effort for block layer to manage the lifetime issues. Is there any reason, why we cant store the info inside this buffer in driver ? This scheme will require sequential submission of SRC and DST bio's. This might increase in latency, but allows to have simpler design. Main use case for copy is GC, which is mostly a background operation. -- Nitesh Shetty
On 12/11/24 1:36 AM, Nitesh Shetty wrote: > Block layer can allocate a buffer and send this as part of copy > operation. The block layer can only do that if it knows how large the buffer should be. Or in other words, if knowledge of a SCSI buffer size is embedded in the block layer. That doesn't sound ideal to me. > This scheme will require sequential submission of SRC and DST > bio's. This might increase in latency, but allows to have simpler design. > Main use case for copy is GC, which is mostly a background operation. I still prefer a single REQ_OP_COPY operation instead of separate REQ_OP_COPY_SRC and REQ_OP_COPY_DST operations. While this will require additional work in the SCSI disk (sd) driver (implementation of a state machine), it prevents that any details about the SCSI copy offloading approach have to be known by the block layer. Even if copy offloading would be implemented as two operations (REQ_OP_COPY_SRC and REQ_OP_COPY_DST), a state machine is required anyway in the SCSI disk driver because REQ_OP_COPY_SRC would have to be translated into two SCSI commands (POPULATE TOKEN + RECEIVE ROD TOKEN INFORMATION). Thanks, Bart.
Bart, >> What would be the benefit of submitting these operations concurrently? > > I expect that submitting the two copy operations concurrently would > result in lower latency for NVMe devices because the REQ_OP_COPY_DST > operation can be submitted without waiting for the REQ_OP_COPY_SRC > result. Perhaps you are engaging in premature optimization? > If the block layer would have to manage the ROD token, how would the > ROD token be provided to the block layer? In the data buffer described by the bio, of course. Just like the data buffer when we do a READ. Only difference here is that the data is compressed to a fixed size and thus only 512 bytes long regardless of the amount of logical blocks described by the operation. > Bidirectional commands have been removed from the Linux kernel a while > ago so the REQ_OP_COPY_IN parameter data would have to be used to pass > parameters to the SCSI driver and also to pass the ROD token back to > the block layer. A normal READ operation also passes parameters to the SCSI driver. These are the start LBA and the transfer length. That does not make it a bidirectional command. > While this can be implemented, I'm not sure that we should integrate > support in the block layer for managing ROD tokens since ROD tokens > are a concept that is specific to the SCSI protocol. A well-known commercial operating system supports copy offload via the token-based approach. I don't see any reason why our implementation should exclude a wide variety of devices in the industry supported by that platform. And obviously, given that this other operating system uses a token-based implementation in their stack, one could perhaps envision this capability appearing in other protocols in the future? In any case. I only have two horses in this race: 1. Make sure that our user API and block layer implementation are flexible enough to accommodate current and future offload specifications. 2. Make sure our implementation is as simple as possible. Splitting the block layer implementation into a semantic read followed by a semantic write permits token-based offload to be supported. It also makes the implementation simple because there is no concurrency element. The only state is owned by the entity which issues the bio. No lookups, no timeouts, no allocating things in sd.c and hoping that somebody remembers to free them later despite the disk suddenly going away. Even if we were to not support the token-based approach and only do single-command offload, I still think the two-phase operation makes things simpler and more elegant.
Christoph, > Generally agreeing with all you said, but do we actually have any > serious use case for cross-LU copies? They just seem incredibly > complex any not all that useful. It's still widely used to populate a new LUN from a golden image.
On 12/10/24 8:07 PM, Damien Le Moal wrote: > But for F2FS, the conventional unit is used for metadata and the other zoned LU > for data. How come copying from one to the other can be useful ? Hi Damien, What you wrote is correct in general. If a conventional and zoned LU are combined, data is only written to the conventional LU once the zoned LU is full. The data on the conventional LU may be migrated to the zoned LU during garbage collection. This is why copying from the conventional LU to the zoned LU is useful. Jaegeuk, please correct me if I got this wrong. Bart.
From: Keith Busch <kbusch@kernel.org> Changes from v9: Document the partition hint mask Use bitmap_alloc API Fixup bitmap memory leak Return invalid value if user requests an invalid write hint Added and exported a block device feature flag for indicating generic placement hint support Added statx write hint max field Added BUILD_BUG_ON check for new io_uring SQE fields. Added reviews Kanchan Joshi (2): io_uring: enable per-io hinting capability nvme: enable FDP support Keith Busch (7): block: use generic u16 for write hints block: introduce max_write_hints queue limit statx: add write hint information block: allow ability to limit partition write hints block, fs: add write hint to kiocb block: export placement hint feature scsi: set permanent stream count in block limits Documentation/ABI/stable/sysfs-block | 13 +++++ block/bdev.c | 18 ++++++ block/blk-settings.c | 5 ++ block/blk-sysfs.c | 6 ++ block/fops.c | 31 +++++++++- block/partitions/core.c | 44 ++++++++++++++- drivers/nvme/host/core.c | 84 ++++++++++++++++++++++++++++ drivers/nvme/host/nvme.h | 5 ++ drivers/scsi/sd.c | 2 + fs/stat.c | 1 + include/linux/blk-mq.h | 3 +- include/linux/blk_types.h | 4 +- include/linux/blkdev.h | 15 +++++ include/linux/fs.h | 1 + include/linux/nvme.h | 19 +++++++ include/linux/stat.h | 1 + include/uapi/linux/io_uring.h | 4 ++ include/uapi/linux/stat.h | 3 +- io_uring/io_uring.c | 2 + io_uring/rw.c | 3 +- 20 files changed, 253 insertions(+), 11 deletions(-)