Message ID | 20210817101423.12367-4-selvakuma.s1@samsung.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/7] block: make bio_map_kern() non static | expand |
On 8/17/21 3:14 AM, SelvaKumar S wrote: > Introduce REQ_OP_COPY, a no-merge copy offload operation. Create > bio with control information as payload and submit to the device. > Larger copy operation may be divided if necessary by looking at device > limits. REQ_OP_COPY(19) is a write op and takes zone_write_lock when > submitted to zoned device. > Native copy offload is not supported for stacked devices. Using a single operation for copy-offloading instead of separate operations for reading and writing is fundamentally incompatible with the device mapper. I think we need a copy-offloading implementation that is compatible with the device mapper. Storing the parameters of the copy operation in the bio payload is incompatible with the current implementation of bio_split(). In other words, I think there are fundamental problems with this patch series. Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Hi SelvaKumar, Thank you for the patch! Perhaps something to improve: [auto build test WARNING on block/for-next] [also build test WARNING on dm/for-next linus/master v5.14-rc6 next-20210817] [cannot apply to linux-nvme/for-next] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/0day-ci/linux/commits/SelvaKumar-S/block-make-bio_map_kern-non-static/20210817-193111 base: https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next config: hexagon-randconfig-r013-20210816 (attached as .config) compiler: clang version 12.0.0 reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # https://github.com/0day-ci/linux/commit/35fc502a7f20a7cd42432cee2777a621c40a3bd3 git remote add linux-review https://github.com/0day-ci/linux git fetch --no-tags linux-review SelvaKumar-S/block-make-bio_map_kern-non-static/20210817-193111 git checkout 35fc502a7f20a7cd42432cee2777a621c40a3bd3 # save the attached .config to linux build tree COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=hexagon If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot <lkp@intel.com> All warnings (new ones prefixed by >>): >> block/blk-lib.c:197:5: warning: no previous prototype for function 'blk_copy_offload_submit_bio' [-Wmissing-prototypes] int blk_copy_offload_submit_bio(struct block_device *bdev, ^ block/blk-lib.c:197:1: note: declare 'static' if the function is not intended to be used outside of this translation unit int blk_copy_offload_submit_bio(struct block_device *bdev, ^ static >> block/blk-lib.c:250:5: warning: no previous prototype for function 'blk_copy_offload_scc' [-Wmissing-prototypes] int blk_copy_offload_scc(struct block_device *src_bdev, int nr_srcs, ^ block/blk-lib.c:250:1: note: declare 'static' if the function is not intended to be used outside of this translation unit int blk_copy_offload_scc(struct block_device *src_bdev, int nr_srcs, ^ static 2 warnings generated. vim +/blk_copy_offload_submit_bio +197 block/blk-lib.c 196 > 197 int blk_copy_offload_submit_bio(struct block_device *bdev, 198 struct blk_copy_payload *payload, int payload_size, 199 struct cio *cio, gfp_t gfp_mask) 200 { 201 struct request_queue *q = bdev_get_queue(bdev); 202 struct bio *bio; 203 204 bio = bio_map_kern(q, payload, payload_size, gfp_mask); 205 if (IS_ERR(bio)) 206 return PTR_ERR(bio); 207 208 bio_set_dev(bio, bdev); 209 bio->bi_opf = REQ_OP_COPY | REQ_NOMERGE; 210 bio->bi_iter.bi_sector = payload->dest; 211 bio->bi_end_io = cio_bio_end_io; 212 bio->bi_private = cio; 213 atomic_inc(&cio->refcount); 214 submit_bio(bio); 215 216 return 0; 217 } 218 219 /* Go through all the enrties inside user provided payload, and determine the 220 * maximum number of entries in a payload, based on device's scc-limits. 221 */ 222 static inline int blk_max_payload_entries(int nr_srcs, struct range_entry *rlist, 223 int max_nr_srcs, sector_t max_copy_range_sectors, sector_t max_copy_len) 224 { 225 sector_t range_len, copy_len = 0, remaining = 0; 226 int ri = 0, pi = 1, max_pi = 0; 227 228 for (ri = 0; ri < nr_srcs; ri++) { 229 for (remaining = rlist[ri].len; remaining > 0; remaining -= range_len) { 230 range_len = min3(remaining, max_copy_range_sectors, 231 max_copy_len - copy_len); 232 pi++; 233 copy_len += range_len; 234 235 if ((pi == max_nr_srcs) || (copy_len == max_copy_len)) { 236 max_pi = max(max_pi, pi); 237 pi = 1; 238 copy_len = 0; 239 } 240 } 241 } 242 243 return max(max_pi, pi); 244 } 245 246 /* 247 * blk_copy_offload_scc - Use device's native copy offload feature 248 * Go through user provide payload, prepare new payload based on device's copy offload limits. 249 */ > 250 int blk_copy_offload_scc(struct block_device *src_bdev, int nr_srcs, 251 struct range_entry *rlist, struct block_device *dest_bdev, 252 sector_t dest, gfp_t gfp_mask) 253 { 254 struct request_queue *q = bdev_get_queue(dest_bdev); 255 struct cio *cio = NULL; 256 struct blk_copy_payload *payload; 257 sector_t range_len, copy_len = 0, remaining = 0; 258 sector_t src_blk, cdest = dest; 259 sector_t max_copy_range_sectors, max_copy_len; 260 int ri = 0, pi = 0, ret = 0, payload_size, max_pi, max_nr_srcs; 261 262 cio = kzalloc(sizeof(struct cio), GFP_KERNEL); 263 if (!cio) 264 return -ENOMEM; 265 atomic_set(&cio->refcount, 0); 266 267 max_nr_srcs = q->limits.max_copy_nr_ranges; 268 max_copy_range_sectors = q->limits.max_copy_range_sectors; 269 max_copy_len = q->limits.max_copy_sectors; 270 271 max_pi = blk_max_payload_entries(nr_srcs, rlist, max_nr_srcs, 272 max_copy_range_sectors, max_copy_len); 273 payload_size = struct_size(payload, range, max_pi); 274 275 payload = kvmalloc(payload_size, gfp_mask); 276 if (!payload) { 277 ret = -ENOMEM; 278 goto free_cio; 279 } 280 payload->src_bdev = src_bdev; 281 282 for (ri = 0; ri < nr_srcs; ri++) { 283 for (remaining = rlist[ri].len, src_blk = rlist[ri].src; remaining > 0; 284 remaining -= range_len, src_blk += range_len) { 285 286 range_len = min3(remaining, max_copy_range_sectors, 287 max_copy_len - copy_len); 288 payload->range[pi].len = range_len; 289 payload->range[pi].src = src_blk; 290 pi++; 291 copy_len += range_len; 292 293 /* Submit current payload, if crossing device copy limits */ 294 if ((pi == max_nr_srcs) || (copy_len == max_copy_len)) { 295 payload->dest = cdest; 296 payload->copy_nr_ranges = pi; 297 ret = blk_copy_offload_submit_bio(dest_bdev, payload, 298 payload_size, cio, gfp_mask); 299 if (ret) 300 goto free_payload; 301 302 /* reset index, length and allocate new payload */ 303 pi = 0; 304 cdest += copy_len; 305 copy_len = 0; 306 payload = kvmalloc(payload_size, gfp_mask); 307 if (!payload) { 308 ret = -ENOMEM; 309 goto free_cio; 310 } 311 payload->src_bdev = src_bdev; 312 } 313 } 314 } 315 316 if (pi) { 317 payload->dest = cdest; 318 payload->copy_nr_ranges = pi; 319 ret = blk_copy_offload_submit_bio(dest_bdev, payload, payload_size, cio, gfp_mask); 320 if (ret) 321 goto free_payload; 322 } 323 324 /* Wait for completion of all IO's*/ 325 ret = cio_await_completion(cio); 326 327 return ret; 328 329 free_payload: 330 kvfree(payload); 331 free_cio: 332 cio_await_completion(cio); 333 return ret; 334 } 335 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
On Tue, 17 Aug 2021, Bart Van Assche wrote: > On 8/17/21 3:14 AM, SelvaKumar S wrote: > > Introduce REQ_OP_COPY, a no-merge copy offload operation. Create > > bio with control information as payload and submit to the device. > > Larger copy operation may be divided if necessary by looking at device > > limits. REQ_OP_COPY(19) is a write op and takes zone_write_lock when > > submitted to zoned device. > > Native copy offload is not supported for stacked devices. > > Using a single operation for copy-offloading instead of separate operations > for reading and writing is fundamentally incompatible with the device mapper. > I think we need a copy-offloading implementation that is compatible with the > device mapper. I once wrote a copy offload implementation that is compatible with device mapper. The copy operation creates two bios (one for reading and one for writing), passes them independently through device mapper and pairs them at the physical device driver. It's here: http://people.redhat.com/~mpatocka/patches/kernel/xcopy/current I verified that it works with iSCSI. Would you be interested in continuing this work? Mikulas > Storing the parameters of the copy operation in the bio payload is > incompatible with the current implementation of bio_split(). > > In other words, I think there are fundamental problems with this patch series. > > Bart. > -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
On 2021-08-17 4:41 p.m., Mikulas Patocka wrote: > > > On Tue, 17 Aug 2021, Bart Van Assche wrote: > >> On 8/17/21 3:14 AM, SelvaKumar S wrote: >>> Introduce REQ_OP_COPY, a no-merge copy offload operation. Create >>> bio with control information as payload and submit to the device. >>> Larger copy operation may be divided if necessary by looking at device >>> limits. REQ_OP_COPY(19) is a write op and takes zone_write_lock when >>> submitted to zoned device. >>> Native copy offload is not supported for stacked devices. >> >> Using a single operation for copy-offloading instead of separate operations >> for reading and writing is fundamentally incompatible with the device mapper. >> I think we need a copy-offloading implementation that is compatible with the >> device mapper. > > I once wrote a copy offload implementation that is compatible with device > mapper. The copy operation creates two bios (one for reading and one for > writing), passes them independently through device mapper and pairs them > at the physical device driver. > > It's here: http://people.redhat.com/~mpatocka/patches/kernel/xcopy/current In my copy solution the read-side and write-side bio pairs share the same storage (i.e. ram) This gets around the need to copy data between the bio_s. See: https://sg.danny.cz/sg/sg_v40.html in Section 8 on Request sharing. This technique can be efficiently extend to source --> destination1,destination2,... copies. Doug Gilbert > I verified that it works with iSCSI. Would you be interested in continuing > this work? > > Mikulas > >> Storing the parameters of the copy operation in the bio payload is >> incompatible with the current implementation of bio_split(). >> >> In other words, I think there are fundamental problems with this patch series. >> >> Bart. >> > -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
On 8/17/21 2:53 PM, Douglas Gilbert wrote: > On 2021-08-17 4:41 p.m., Mikulas Patocka wrote: >> On Tue, 17 Aug 2021, Bart Van Assche wrote: >>> On 8/17/21 3:14 AM, SelvaKumar S wrote: >>>> Introduce REQ_OP_COPY, a no-merge copy offload operation. Create >>>> bio with control information as payload and submit to the device. >>>> Larger copy operation may be divided if necessary by looking at device >>>> limits. REQ_OP_COPY(19) is a write op and takes zone_write_lock when >>>> submitted to zoned device. >>>> Native copy offload is not supported for stacked devices. >>> >>> Using a single operation for copy-offloading instead of separate >>> operations >>> for reading and writing is fundamentally incompatible with the device >>> mapper. >>> I think we need a copy-offloading implementation that is compatible >>> with the >>> device mapper. >> >> I once wrote a copy offload implementation that is compatible with device >> mapper. The copy operation creates two bios (one for reading and one for >> writing), passes them independently through device mapper and pairs them >> at the physical device driver. >> >> It's here: >> http://people.redhat.com/~mpatocka/patches/kernel/xcopy/current > > In my copy solution the read-side and write-side bio pairs share the > same storage (i.e. ram) This gets around the need to copy data between > the bio_s. > See: > https://sg.danny.cz/sg/sg_v40.html > in Section 8 on Request sharing. This technique can be efficiently > extend to > source --> destination1,destination2,... copies. > > Doug Gilbert > >> I verified that it works with iSCSI. Would you be interested in >> continuing >> this work? Hi Mikulas and Doug, Yes, I'm interested in continuing Mikulas' work on copy offloading. I will take a look at Doug's approach too for sharing buffers between read-side and write-side bios. It may take a few months however before I can find the time to work on this. Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
> Native copy offload is not supported for stacked devices.
One of the main reasons that the historic attempts at supporting copy
offload did not get merged was that the ubiquitous deployment scenario,
stacked block devices, was not handled well.
Pitfalls surrounding stacking has been brought up several times in
response to your series. It is critically important that both kernel
plumbing and user-facing interfaces are defined in a way that works for
the most common use cases. This includes copying between block devices
and handling block device stacking. Stacking being one of the most
fundamental operating principles of the Linux block layer!
Proposing a brand new interface that out of the gate is incompatible
with both stacking and the copy offload capability widely implemented in
shipping hardware makes little sense. While NVMe currently only supports
copy operations inside a single namespace, it is surely only a matter of
time before that restriction is lifted.
Changing existing interfaces is painful, especially when these are
exposed to userland. We obviously can't predict every field or feature
that may be needed in the future. But we should at the very least build
the infrastructure around what already exists. And that's where the
proposed design falls short...
Bart, Mikulas On Tue, Aug 17, 2021 at 10:44 PM Bart Van Assche <bvanassche@acm.org> wrote: > > On 8/17/21 3:14 AM, SelvaKumar S wrote: > > Introduce REQ_OP_COPY, a no-merge copy offload operation. Create > > bio with control information as payload and submit to the device. > > Larger copy operation may be divided if necessary by looking at device > > limits. REQ_OP_COPY(19) is a write op and takes zone_write_lock when > > submitted to zoned device. > > Native copy offload is not supported for stacked devices. > > Using a single operation for copy-offloading instead of separate > operations for reading and writing is fundamentally incompatible with > the device mapper. I think we need a copy-offloading implementation that > is compatible with the device mapper. > While each read/write command is for a single contiguous range of device, with simple-copy we get to operate on multiple discontiguous ranges, with a single command. That seemed like a good opportunity to reduce control-plane traffic (compared to read/write operations) as well. With a separate read-and-write bio approach, each source-range will spawn at least one read, one write and eventually one SCC command. And it only gets worse as there could be many such discontiguous ranges (for GC use-case at least) coming from user-space in a single payload. Overall sequence will be - Receive a payload from user-space - Disassemble into many read-write pair bios at block-layer - Assemble those (somehow) in NVMe to reduce simple-copy commands - Send commands to device We thought payload could be a good way to reduce the disassembly/assembly work and traffic between block-layer to nvme. How do you see this tradeoff? What seems necessary for device-mapper usecase, appears to be a cost when device-mapper isn't used. Especially for SCC (since copy is within single ns), device-mappers may not be too compelling anyway. Must device-mapper support be a requirement for the initial support atop SCC? Or do you think it will still be a progress if we finalize the user-space interface to cover all that is foreseeable.And for device-mapper compatible transport between block-layer and NVMe - we do it in the later stage when NVMe too comes up with better copy capabilities?
On Thu, Aug 19, 2021 at 12:05 AM Martin K. Petersen <martin.petersen@oracle.com> wrote: > > > > Native copy offload is not supported for stacked devices. > > One of the main reasons that the historic attempts at supporting copy > offload did not get merged was that the ubiquitous deployment scenario, > stacked block devices, was not handled well. > > Pitfalls surrounding stacking has been brought up several times in > response to your series. It is critically important that both kernel > plumbing and user-facing interfaces are defined in a way that works for > the most common use cases. This includes copying between block devices > and handling block device stacking. Stacking being one of the most > fundamental operating principles of the Linux block layer! > > Proposing a brand new interface that out of the gate is incompatible > with both stacking and the copy offload capability widely implemented in > shipping hardware makes little sense. While NVMe currently only supports > copy operations inside a single namespace, it is surely only a matter of > time before that restriction is lifted. > > Changing existing interfaces is painful, especially when these are > exposed to userland. We obviously can't predict every field or feature > that may be needed in the future. But we should at the very least build > the infrastructure around what already exists. And that's where the > proposed design falls short... > Certainly, on user-space interface. We've got few cracks to be filled there, missing the future viability. But on stacking, can that be additive. Could you please take a look at the other response (comment from Bart) for the trade-offs.
On 8/20/21 3:39 AM, Kanchan Joshi wrote: > Bart, Mikulas > > On Tue, Aug 17, 2021 at 10:44 PM Bart Van Assche <bvanassche@acm.org> wrote: >> >> On 8/17/21 3:14 AM, SelvaKumar S wrote: >>> Introduce REQ_OP_COPY, a no-merge copy offload operation. Create >>> bio with control information as payload and submit to the device. >>> Larger copy operation may be divided if necessary by looking at device >>> limits. REQ_OP_COPY(19) is a write op and takes zone_write_lock when >>> submitted to zoned device. >>> Native copy offload is not supported for stacked devices. >> >> Using a single operation for copy-offloading instead of separate >> operations for reading and writing is fundamentally incompatible with >> the device mapper. I think we need a copy-offloading implementation that >> is compatible with the device mapper. >> > > While each read/write command is for a single contiguous range of > device, with simple-copy we get to operate on multiple discontiguous > ranges, with a single command. > That seemed like a good opportunity to reduce control-plane traffic > (compared to read/write operations) as well. > > With a separate read-and-write bio approach, each source-range will > spawn at least one read, one write and eventually one SCC command. And > it only gets worse as there could be many such discontiguous ranges (for > GC use-case at least) coming from user-space in a single payload. > Overall sequence will be > - Receive a payload from user-space > - Disassemble into many read-write pair bios at block-layer > - Assemble those (somehow) in NVMe to reduce simple-copy commands > - Send commands to device > > We thought payload could be a good way to reduce the > disassembly/assembly work and traffic between block-layer to nvme. > How do you see this tradeoff? What seems necessary for device-mapper > usecase, appears to be a cost when device-mapper isn't used. > Especially for SCC (since copy is within single ns), device-mappers > may not be too compelling anyway. > > Must device-mapper support be a requirement for the initial support atop SCC? > Or do you think it will still be a progress if we finalize the > user-space interface to cover all that is foreseeable.And for > device-mapper compatible transport between block-layer and NVMe - we > do it in the later stage when NVMe too comes up with better copy > capabilities? Hi Kanchan, These days there might be more systems that run the device mapper on top of the NVMe driver or a SCSI driver than systems that do use the device mapper. It is common practice these days to use dm-crypt on personal workstations and laptops. LVM (dm-linear) is popular because it is more flexible than a traditional partition table. Android phones use dm-verity on top of hardware encryption. In other words, not supporting the device mapper means that a very large number of use cases is excluded. So I think supporting the device mapper from the start is important, even if that means combining individual bios at the bottom of the storage stack into simple copy commands. Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Hi Bart,Mikulas,Martin,Douglas, We will go through your previous work and use this thread as a medium for further discussion, if we come across issues to be sorted out. Thank you, Nitesh Shetty On Sat, Aug 21, 2021 at 2:48 AM Bart Van Assche <bvanassche@acm.org> wrote: > > On 8/20/21 3:39 AM, Kanchan Joshi wrote: > > Bart, Mikulas > > > > On Tue, Aug 17, 2021 at 10:44 PM Bart Van Assche <bvanassche@acm.org> wrote: > >> > >> On 8/17/21 3:14 AM, SelvaKumar S wrote: > >>> Introduce REQ_OP_COPY, a no-merge copy offload operation. Create > >>> bio with control information as payload and submit to the device. > >>> Larger copy operation may be divided if necessary by looking at device > >>> limits. REQ_OP_COPY(19) is a write op and takes zone_write_lock when > >>> submitted to zoned device. > >>> Native copy offload is not supported for stacked devices. > >> > >> Using a single operation for copy-offloading instead of separate > >> operations for reading and writing is fundamentally incompatible with > >> the device mapper. I think we need a copy-offloading implementation that > >> is compatible with the device mapper. > >> > > > > While each read/write command is for a single contiguous range of > > device, with simple-copy we get to operate on multiple discontiguous > > ranges, with a single command. > > That seemed like a good opportunity to reduce control-plane traffic > > (compared to read/write operations) as well. > > > > With a separate read-and-write bio approach, each source-range will > > spawn at least one read, one write and eventually one SCC command. And > > it only gets worse as there could be many such discontiguous ranges (for > > GC use-case at least) coming from user-space in a single payload. > > Overall sequence will be > > - Receive a payload from user-space > > - Disassemble into many read-write pair bios at block-layer > > - Assemble those (somehow) in NVMe to reduce simple-copy commands > > - Send commands to device > > > > We thought payload could be a good way to reduce the > > disassembly/assembly work and traffic between block-layer to nvme. > > How do you see this tradeoff? What seems necessary for device-mapper > > usecase, appears to be a cost when device-mapper isn't used. > > Especially for SCC (since copy is within single ns), device-mappers > > may not be too compelling anyway. > > > > Must device-mapper support be a requirement for the initial support atop SCC? > > Or do you think it will still be a progress if we finalize the > > user-space interface to cover all that is foreseeable.And for > > device-mapper compatible transport between block-layer and NVMe - we > > do it in the later stage when NVMe too comes up with better copy > > capabilities? > > Hi Kanchan, > > These days there might be more systems that run the device mapper on top > of the NVMe driver or a SCSI driver than systems that do use the device > mapper. It is common practice these days to use dm-crypt on personal > workstations and laptops. LVM (dm-linear) is popular because it is more > flexible than a traditional partition table. Android phones use > dm-verity on top of hardware encryption. In other words, not supporting > the device mapper means that a very large number of use cases is > excluded. So I think supporting the device mapper from the start is > important, even if that means combining individual bios at the bottom of > the storage stack into simple copy commands. > > Thanks, > > Bart. > -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
diff --git a/block/blk-core.c b/block/blk-core.c index d2722ecd4d9b..541b1561b4af 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -704,6 +704,17 @@ static noinline int should_fail_bio(struct bio *bio) } ALLOW_ERROR_INJECTION(should_fail_bio, ERRNO); +static inline int bio_check_copy_eod(struct bio *bio, sector_t start, + sector_t nr_sectors, sector_t max_sect) +{ + if (nr_sectors && max_sect && + (nr_sectors > max_sect || start > max_sect - nr_sectors)) { + handle_bad_sector(bio, max_sect); + return -EIO; + } + return 0; +} + /* * Check whether this bio extends beyond the end of the device or partition. * This may well happen - the kernel calls bread() without checking the size of @@ -723,6 +734,61 @@ static inline int bio_check_eod(struct bio *bio) return 0; } +/* + * check for eod limits and remap ranges if needed + */ +static int blk_check_copy(struct bio *bio) +{ + struct blk_copy_payload *payload = bio_data(bio); + sector_t dst_max_sect, dst_start_sect, copy_size = 0; + sector_t src_max_sect, src_start_sect; + struct block_device *bd_part; + int i, ret = -EIO; + + rcu_read_lock(); + + bd_part = bio->bi_bdev; + if (unlikely(!bd_part)) + goto err; + + dst_max_sect = bdev_nr_sectors(bd_part); + dst_start_sect = bd_part->bd_start_sect; + + src_max_sect = bdev_nr_sectors(payload->src_bdev); + src_start_sect = payload->src_bdev->bd_start_sect; + + if (unlikely(should_fail_request(bd_part, bio->bi_iter.bi_size))) + goto err; + + if (unlikely(bio_check_ro(bio))) + goto err; + + rcu_read_unlock(); + + for (i = 0; i < payload->copy_nr_ranges; i++) { + ret = bio_check_copy_eod(bio, payload->range[i].src, + payload->range[i].len, src_max_sect); + if (unlikely(ret)) + goto out; + + payload->range[i].src += src_start_sect; + copy_size += payload->range[i].len; + } + + /* check if copy length crosses eod */ + ret = bio_check_copy_eod(bio, bio->bi_iter.bi_sector, + copy_size, dst_max_sect); + if (unlikely(ret)) + goto out; + + bio->bi_iter.bi_sector += dst_start_sect; + return 0; +err: + rcu_read_unlock(); +out: + return ret; +} + /* * Remap block n of partition p to block n+start(p) of the disk. */ @@ -799,13 +865,15 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio) if (should_fail_bio(bio)) goto end_io; - if (unlikely(bio_check_ro(bio))) - goto end_io; - if (!bio_flagged(bio, BIO_REMAPPED)) { - if (unlikely(bio_check_eod(bio))) - goto end_io; - if (bdev->bd_partno && unlikely(blk_partition_remap(bio))) + if (likely(!op_is_copy(bio->bi_opf))) { + if (unlikely(bio_check_ro(bio))) goto end_io; + if (!bio_flagged(bio, BIO_REMAPPED)) { + if (unlikely(bio_check_eod(bio))) + goto end_io; + if (bdev->bd_partno && unlikely(blk_partition_remap(bio))) + goto end_io; + } } /* @@ -829,6 +897,10 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio) if (!blk_queue_discard(q)) goto not_supported; break; + case REQ_OP_COPY: + if (unlikely(blk_check_copy(bio))) + goto end_io; + break; case REQ_OP_SECURE_ERASE: if (!blk_queue_secure_erase(q)) goto not_supported; diff --git a/block/blk-lib.c b/block/blk-lib.c index 9f09beadcbe3..7fee0ae95c44 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -151,6 +151,258 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector, } EXPORT_SYMBOL(blkdev_issue_discard); +/* + * Wait on and process all in-flight BIOs. This must only be called once + * all bios have been issued so that the refcount can only decrease. + * This just waits for all bios to make it through cio_bio_end_io. IO + * errors are propagated through cio->io_error. + */ +static int cio_await_completion(struct cio *cio) +{ + int ret = 0; + + while (atomic_read(&cio->refcount)) { + cio->waiter = current; + __set_current_state(TASK_UNINTERRUPTIBLE); + blk_io_schedule(); + /* wake up sets us TASK_RUNNING */ + cio->waiter = NULL; + ret = cio->io_err; + } + kvfree(cio); + + return ret; +} + +/* + * The BIO completion handler simply decrements refcount. + * Also wake up process, if this is the last bio to be completed. + * + * During I/O bi_private points at the cio. + */ +static void cio_bio_end_io(struct bio *bio) +{ + struct cio *cio = bio->bi_private; + + if (bio->bi_status) + cio->io_err = bio->bi_status; + kvfree(page_address(bio_first_bvec_all(bio)->bv_page) + + bio_first_bvec_all(bio)->bv_offset); + bio_put(bio); + + if (atomic_dec_and_test(&cio->refcount) && cio->waiter) + wake_up_process(cio->waiter); +} + +int blk_copy_offload_submit_bio(struct block_device *bdev, + struct blk_copy_payload *payload, int payload_size, + struct cio *cio, gfp_t gfp_mask) +{ + struct request_queue *q = bdev_get_queue(bdev); + struct bio *bio; + + bio = bio_map_kern(q, payload, payload_size, gfp_mask); + if (IS_ERR(bio)) + return PTR_ERR(bio); + + bio_set_dev(bio, bdev); + bio->bi_opf = REQ_OP_COPY | REQ_NOMERGE; + bio->bi_iter.bi_sector = payload->dest; + bio->bi_end_io = cio_bio_end_io; + bio->bi_private = cio; + atomic_inc(&cio->refcount); + submit_bio(bio); + + return 0; +} + +/* Go through all the enrties inside user provided payload, and determine the + * maximum number of entries in a payload, based on device's scc-limits. + */ +static inline int blk_max_payload_entries(int nr_srcs, struct range_entry *rlist, + int max_nr_srcs, sector_t max_copy_range_sectors, sector_t max_copy_len) +{ + sector_t range_len, copy_len = 0, remaining = 0; + int ri = 0, pi = 1, max_pi = 0; + + for (ri = 0; ri < nr_srcs; ri++) { + for (remaining = rlist[ri].len; remaining > 0; remaining -= range_len) { + range_len = min3(remaining, max_copy_range_sectors, + max_copy_len - copy_len); + pi++; + copy_len += range_len; + + if ((pi == max_nr_srcs) || (copy_len == max_copy_len)) { + max_pi = max(max_pi, pi); + pi = 1; + copy_len = 0; + } + } + } + + return max(max_pi, pi); +} + +/* + * blk_copy_offload_scc - Use device's native copy offload feature + * Go through user provide payload, prepare new payload based on device's copy offload limits. + */ +int blk_copy_offload_scc(struct block_device *src_bdev, int nr_srcs, + struct range_entry *rlist, struct block_device *dest_bdev, + sector_t dest, gfp_t gfp_mask) +{ + struct request_queue *q = bdev_get_queue(dest_bdev); + struct cio *cio = NULL; + struct blk_copy_payload *payload; + sector_t range_len, copy_len = 0, remaining = 0; + sector_t src_blk, cdest = dest; + sector_t max_copy_range_sectors, max_copy_len; + int ri = 0, pi = 0, ret = 0, payload_size, max_pi, max_nr_srcs; + + cio = kzalloc(sizeof(struct cio), GFP_KERNEL); + if (!cio) + return -ENOMEM; + atomic_set(&cio->refcount, 0); + + max_nr_srcs = q->limits.max_copy_nr_ranges; + max_copy_range_sectors = q->limits.max_copy_range_sectors; + max_copy_len = q->limits.max_copy_sectors; + + max_pi = blk_max_payload_entries(nr_srcs, rlist, max_nr_srcs, + max_copy_range_sectors, max_copy_len); + payload_size = struct_size(payload, range, max_pi); + + payload = kvmalloc(payload_size, gfp_mask); + if (!payload) { + ret = -ENOMEM; + goto free_cio; + } + payload->src_bdev = src_bdev; + + for (ri = 0; ri < nr_srcs; ri++) { + for (remaining = rlist[ri].len, src_blk = rlist[ri].src; remaining > 0; + remaining -= range_len, src_blk += range_len) { + + range_len = min3(remaining, max_copy_range_sectors, + max_copy_len - copy_len); + payload->range[pi].len = range_len; + payload->range[pi].src = src_blk; + pi++; + copy_len += range_len; + + /* Submit current payload, if crossing device copy limits */ + if ((pi == max_nr_srcs) || (copy_len == max_copy_len)) { + payload->dest = cdest; + payload->copy_nr_ranges = pi; + ret = blk_copy_offload_submit_bio(dest_bdev, payload, + payload_size, cio, gfp_mask); + if (ret) + goto free_payload; + + /* reset index, length and allocate new payload */ + pi = 0; + cdest += copy_len; + copy_len = 0; + payload = kvmalloc(payload_size, gfp_mask); + if (!payload) { + ret = -ENOMEM; + goto free_cio; + } + payload->src_bdev = src_bdev; + } + } + } + + if (pi) { + payload->dest = cdest; + payload->copy_nr_ranges = pi; + ret = blk_copy_offload_submit_bio(dest_bdev, payload, payload_size, cio, gfp_mask); + if (ret) + goto free_payload; + } + + /* Wait for completion of all IO's*/ + ret = cio_await_completion(cio); + + return ret; + +free_payload: + kvfree(payload); +free_cio: + cio_await_completion(cio); + return ret; +} + +static inline sector_t blk_copy_len(struct range_entry *rlist, int nr_srcs) +{ + int i; + sector_t len = 0; + + for (i = 0; i < nr_srcs; i++) { + if (rlist[i].len) + len += rlist[i].len; + else + return 0; + } + + return len; +} + +static inline bool blk_check_offload_scc(struct request_queue *src_q, + struct request_queue *dest_q) +{ + if (src_q == dest_q && src_q->limits.copy_offload == BLK_COPY_OFFLOAD_SCC) + return true; + + return false; +} + +/* + * blkdev_issue_copy - queue a copy + * @src_bdev: source block device + * @nr_srcs: number of source ranges to copy + * @src_rlist: array of source ranges + * @dest_bdev: destination block device + * @dest: destination in sector + * @gfp_mask: memory allocation flags (for bio_alloc) + * @flags: BLKDEV_COPY_* flags to control behaviour + * + * Description: + * Copy source ranges from source block device to destination block device. + * length of a source range cannot be zero. + */ +int blkdev_issue_copy(struct block_device *src_bdev, int nr_srcs, + struct range_entry *src_rlist, struct block_device *dest_bdev, + sector_t dest, gfp_t gfp_mask, int flags) +{ + struct request_queue *src_q = bdev_get_queue(src_bdev); + struct request_queue *dest_q = bdev_get_queue(dest_bdev); + sector_t copy_len; + int ret = -EINVAL; + + if (!src_q || !dest_q) + return -ENXIO; + + if (!nr_srcs) + return -EINVAL; + + if (nr_srcs >= MAX_COPY_NR_RANGE) + return -EINVAL; + + copy_len = blk_copy_len(src_rlist, nr_srcs); + if (!copy_len && copy_len >= MAX_COPY_TOTAL_LENGTH) + return -EINVAL; + + if (bdev_read_only(dest_bdev)) + return -EPERM; + + if (blk_check_offload_scc(src_q, dest_q)) + ret = blk_copy_offload_scc(src_bdev, nr_srcs, src_rlist, dest_bdev, dest, gfp_mask); + + return ret; +} +EXPORT_SYMBOL(blkdev_issue_copy); + /** * __blkdev_issue_write_same - generate number of bios with same page * @bdev: target blockdev diff --git a/block/blk-zoned.c b/block/blk-zoned.c index 86fce751bb17..7643fc868521 100644 --- a/block/blk-zoned.c +++ b/block/blk-zoned.c @@ -67,6 +67,7 @@ bool blk_req_needs_zone_write_lock(struct request *rq) case REQ_OP_WRITE_ZEROES: case REQ_OP_WRITE_SAME: case REQ_OP_WRITE: + case REQ_OP_COPY: return blk_rq_zone_is_seq(rq); default: return false; diff --git a/block/bounce.c b/block/bounce.c index 05fc7148489d..d9b05aaf6e56 100644 --- a/block/bounce.c +++ b/block/bounce.c @@ -176,6 +176,7 @@ static struct bio *bounce_clone_bio(struct bio *bio_src) bio->bi_iter.bi_size = bio_src->bi_iter.bi_size; switch (bio_op(bio)) { + case REQ_OP_COPY: case REQ_OP_DISCARD: case REQ_OP_SECURE_ERASE: case REQ_OP_WRITE_ZEROES: diff --git a/include/linux/bio.h b/include/linux/bio.h index 3d67d0fbc868..068fa2e8896a 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -73,6 +73,7 @@ static inline bool bio_has_data(struct bio *bio) static inline bool bio_no_advance_iter(const struct bio *bio) { return bio_op(bio) == REQ_OP_DISCARD || + bio_op(bio) == REQ_OP_COPY || bio_op(bio) == REQ_OP_SECURE_ERASE || bio_op(bio) == REQ_OP_WRITE_SAME || bio_op(bio) == REQ_OP_WRITE_ZEROES; diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index 9e392daa1d7f..1ab77176cb46 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -347,6 +347,8 @@ enum req_opf { REQ_OP_ZONE_RESET = 15, /* reset all the zone present on the device */ REQ_OP_ZONE_RESET_ALL = 17, + /* copy ranges within device */ + REQ_OP_COPY = 19, /* Driver private requests */ REQ_OP_DRV_IN = 34, @@ -470,6 +472,11 @@ static inline bool op_is_discard(unsigned int op) return (op & REQ_OP_MASK) == REQ_OP_DISCARD; } +static inline bool op_is_copy(unsigned int op) +{ + return (op & REQ_OP_MASK) == REQ_OP_COPY; +} + /* * Check if a bio or request operation is a zone management operation, with * the exception of REQ_OP_ZONE_RESET_ALL which is treated as a special case @@ -529,4 +536,17 @@ struct blk_rq_stat { u64 batch; }; +struct cio { + atomic_t refcount; + blk_status_t io_err; + struct task_struct *waiter; /* waiting task (NULL if none) */ +}; + +struct blk_copy_payload { + struct block_device *src_bdev; + sector_t dest; + int copy_nr_ranges; + struct range_entry range[]; +}; + #endif /* __LINUX_BLK_TYPES_H */ diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index fd4cfaadda5b..38369dff6a36 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -52,6 +52,12 @@ struct blk_keyslot_manager; /* Doing classic polling */ #define BLK_MQ_POLL_CLASSIC -1 +/* Define copy offload options */ +enum blk_copy { + BLK_COPY_OFFLOAD_EMULATE = 0, + BLK_COPY_OFFLOAD_SCC, +}; + /* * Maximum number of blkcg policies allowed to be registered concurrently. * Defined here to simplify include dependency. @@ -1051,6 +1057,9 @@ static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q, return min(q->limits.max_discard_sectors, UINT_MAX >> SECTOR_SHIFT); + if (unlikely(op == REQ_OP_COPY)) + return q->limits.max_copy_sectors; + if (unlikely(op == REQ_OP_WRITE_SAME)) return q->limits.max_write_same_sectors; @@ -1326,6 +1335,10 @@ extern int __blkdev_issue_discard(struct block_device *bdev, sector_t sector, sector_t nr_sects, gfp_t gfp_mask, int flags, struct bio **biop); +int blkdev_issue_copy(struct block_device *src_bdev, int nr_srcs, + struct range_entry *src_rlist, struct block_device *dest_bdev, + sector_t dest, gfp_t gfp_mask, int flags); + #define BLKDEV_ZERO_NOUNMAP (1 << 0) /* do not free blocks */ #define BLKDEV_ZERO_NOFALLBACK (1 << 1) /* don't write explicit zeroes */ diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index bdf7b404b3e7..7a97b588d892 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -64,6 +64,18 @@ struct fstrim_range { __u64 minlen; }; +/* Maximum no of entries supported */ +#define MAX_COPY_NR_RANGE (1 << 12) + +/* maximum total copy length */ +#define MAX_COPY_TOTAL_LENGTH (1 << 21) + +/* Source range entry for copy */ +struct range_entry { + __u64 src; + __u64 len; +}; + /* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions */ #define FILE_DEDUPE_RANGE_SAME 0 #define FILE_DEDUPE_RANGE_DIFFERS 1