Message ID | 20220909102136.3020-5-joshi.k@samsung.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | fixed-buffer for uring-cmd/passthru | expand |
> -static int bio_map_user_iov(struct request *rq, struct iov_iter *iter, > +static struct bio *bio_map_get(struct request *rq, unsigned int nr_vecs, > gfp_t gfp_mask) bio_map_get is a very confusing name. And I also still think this is the wrong way to go. If plain slab allocations don't use proper per-cpu caches we have a MM problem and need to talk to the slab maintainers and not use the overkill bio_set here. > +/* Prepare bio for passthrough IO given an existing bvec iter */ > +int blk_rq_map_user_bvec(struct request *rq, struct iov_iter *iter) I'm a little confused about the interface we're trying to present from the block layer to the driver here. blk_rq_map_user_iov really should be able to detect that it is called on a bvec iter and just do the right thing rather than needing different helpers. > + /* > + * If the queue doesn't support SG gaps and adding this > + * offset would create a gap, disallow it. > + */ > + if (bvprvp && bvec_gap_to_prev(lim, bvprvp, bv->bv_offset)) > + goto out_err; So now you limit the input that is accepted? That's not really how iov_iters are used. We can either try to reshuffle the bvecs, or just fall back to the copy data version as blk_rq_map_user_iov does for 'weird' iters˙ > + > + /* check full condition */ > + if (nsegs >= nr_segs || bytes > UINT_MAX - bv->bv_len) > + goto out_err; > + > + if (bytes + bv->bv_len <= nr_iter && > + bv->bv_offset + bv->bv_len <= PAGE_SIZE) { > + nsegs++; > + bytes += bv->bv_len; > + } else > + goto out_err; Nit: This would read much better as: if (bytes + bv->bv_len > nr_iter) goto out_err; if (bv->bv_offset + bv->bv_len > PAGE_SIZE) goto out_err; bytes += bv->bv_len; nsegs++;
On Tue, Sep 20, 2022 at 02:08:02PM +0200, Christoph Hellwig wrote: >> -static int bio_map_user_iov(struct request *rq, struct iov_iter *iter, >> +static struct bio *bio_map_get(struct request *rq, unsigned int nr_vecs, >> gfp_t gfp_mask) > >bio_map_get is a very confusing name. So I chose that name because functionality is opposite of what we do inside existing bio_map_put helper. In that way it is symmetric. >And I also still think this is >the wrong way to go. If plain slab allocations don't use proper >per-cpu caches we have a MM problem and need to talk to the slab >maintainers and not use the overkill bio_set here. This series is not about using (or not using) bio-set. Attempt here has been to use pre-mapped buffers (and bvec) that we got from io_uring. >> +/* Prepare bio for passthrough IO given an existing bvec iter */ >> +int blk_rq_map_user_bvec(struct request *rq, struct iov_iter *iter) > >I'm a little confused about the interface we're trying to present from >the block layer to the driver here. > >blk_rq_map_user_iov really should be able to detect that it is called >on a bvec iter and just do the right thing rather than needing different >helpers. I too explored that possibility, but found that it does not. It maps the user-pages into bio either directly or by doing that copy (in certain odd conditions) but does not know how to deal with existing bvec. Reason, I guess, is no one felt the need to try passthrough for bvecs before. It makes sense only in context of io_uring passthrough. And it really felt cleaner to me write a new function rather than overloading the blk_rq_map_user_iov with multiple if/else canals. I tried that again after your comment, but it does not seem to produce any good-looking code. The other factor is - it seemed safe to go this way as I am more sure that I will not break something else (using blk_rq_map_user_iov). >> + /* >> + * If the queue doesn't support SG gaps and adding this >> + * offset would create a gap, disallow it. >> + */ >> + if (bvprvp && bvec_gap_to_prev(lim, bvprvp, bv->bv_offset)) >> + goto out_err; > >So now you limit the input that is accepted? That's not really how >iov_iters are used. We can either try to reshuffle the bvecs, or >just fall back to the copy data version as blk_rq_map_user_iov does >for 'weird' iters˙ Since I was writing a 'new' helper for passthrough only, I thought it will not too bad to just bail out (rather than try to handle it using copy) if we hit this queue_virt_boundary related situation. To handle it the 'copy data' way we would need this - 585 else if (queue_virt_boundary(q)) 586 copy = queue_virt_boundary(q) & iov_iter_gap_alignment(iter); 587 But iov_iter_gap_alignment does not work on bvec iters. Line #1274 below 1264 unsigned long iov_iter_gap_alignment(const struct iov_iter *i) 1265 { 1266 unsigned long res = 0; 1267 unsigned long v = 0; 1268 size_t size = i->count; 1269 unsigned k; 1270 1271 if (iter_is_ubuf(i)) 1272 return 0; 1273 1274 if (WARN_ON(!iter_is_iovec(i))) 1275 return ~0U; Do you see a way to overcome this. Or maybe this can be revisted as we are not missing a lot? >> + >> + /* check full condition */ >> + if (nsegs >= nr_segs || bytes > UINT_MAX - bv->bv_len) >> + goto out_err; >> + >> + if (bytes + bv->bv_len <= nr_iter && >> + bv->bv_offset + bv->bv_len <= PAGE_SIZE) { >> + nsegs++; >> + bytes += bv->bv_len; >> + } else >> + goto out_err; > >Nit: This would read much better as: > > if (bytes + bv->bv_len > nr_iter) > goto out_err; > if (bv->bv_offset + bv->bv_len > PAGE_SIZE) > goto out_err; > > bytes += bv->bv_len; > nsegs++; Indeed, cleaner. Thanks.
On Thu, Sep 22, 2022 at 08:53:31PM +0530, Kanchan Joshi wrote: >> blk_rq_map_user_iov really should be able to detect that it is called >> on a bvec iter and just do the right thing rather than needing different >> helpers. > > I too explored that possibility, but found that it does not. It maps the > user-pages into bio either directly or by doing that copy (in certain odd > conditions) but does not know how to deal with existing bvec. What do you mean with existing bvec? We allocate a brand new bio here that we want to map the next chunk of the iov_iter to, and that is exactly what blk_rq_map_user_iov does. What blk_rq_map_user_iov currently does not do is to implement this mapping efficiently for ITER_BVEC iters, but that is something that could and should be fixed. > And it really felt cleaner to me write a new function rather than > overloading the blk_rq_map_user_iov with multiple if/else canals. No. The whole point of the iov_iter is to support this "overload". > But iov_iter_gap_alignment does not work on bvec iters. Line #1274 below So we'll need to fix it. > 1264 unsigned long iov_iter_gap_alignment(const struct iov_iter *i) > 1265 { > 1266 unsigned long res = 0; > 1267 unsigned long v = 0; > 1268 size_t size = i->count; > 1269 unsigned k; > 1270 > 1271 if (iter_is_ubuf(i)) > 1272 return 0; > 1273 > 1274 if (WARN_ON(!iter_is_iovec(i))) > 1275 return ~0U; > > Do you see a way to overcome this. Or maybe this can be revisted as we > are not missing a lot? We just need to implement the equivalent functionality for bvecs. It isn't really hard, it just wasn't required so far.
On Fri, Sep 23, 2022 at 05:29:41PM +0200, Christoph Hellwig wrote: >On Thu, Sep 22, 2022 at 08:53:31PM +0530, Kanchan Joshi wrote: >>> blk_rq_map_user_iov really should be able to detect that it is called >>> on a bvec iter and just do the right thing rather than needing different >>> helpers. >> >> I too explored that possibility, but found that it does not. It maps the >> user-pages into bio either directly or by doing that copy (in certain odd >> conditions) but does not know how to deal with existing bvec. > >What do you mean with existing bvec? We allocate a brand new bio here >that we want to map the next chunk of the iov_iter to, and that >is exactly what blk_rq_map_user_iov does. What blk_rq_map_user_iov >currently does not do is to implement this mapping efficiently >for ITER_BVEC iters It is clear that it was not written for ITER_BVEC iters. Otherwise that WARN_ON would not have hit. And efficency is the concern as we are moving to more heavyweight helper that 'handles' weird conditions rather than just 'bails out'. These alignment checks end up adding a loop that traverses the entire ITER_BVEC. Also blk_rq_map_user_iov uses bio_iter_advance which also seems cycle-consuming given below code-comment in io_import_fixed(): if (offset) { /* * Don't use iov_iter_advance() here, as it's really slow for * using the latter parts of a big fixed buffer - it iterates * over each segment manually. We can cheat a bit here, because * we know that: So if at all I could move the code inside blk_rq_map_user_iov, I will need to see that I skip doing iov_iter_advance. I still think it would be better to take this route only when there are other usecases/callers of this. And that is a future thing. For the current requirement, it seems better to prioritze efficency. >, but that is something that could and should >be fixed. > >> And it really felt cleaner to me write a new function rather than >> overloading the blk_rq_map_user_iov with multiple if/else canals. > >No. The whole point of the iov_iter is to support this "overload". Even if I try taking that route, WARN_ON is a blocker that prevents me to put this code inside blk_rq_map_user_iov. >> But iov_iter_gap_alignment does not work on bvec iters. Line #1274 below > >So we'll need to fix it. Do you see good way to trigger this virt-alignment condition? I have not seen this hitting (the SG gap checks) when running with fixebufs. >> 1264 unsigned long iov_iter_gap_alignment(const struct iov_iter *i) >> 1265 { >> 1266 unsigned long res = 0; >> 1267 unsigned long v = 0; >> 1268 size_t size = i->count; >> 1269 unsigned k; >> 1270 >> 1271 if (iter_is_ubuf(i)) >> 1272 return 0; >> 1273 >> 1274 if (WARN_ON(!iter_is_iovec(i))) >> 1275 return ~0U; >> >> Do you see a way to overcome this. Or maybe this can be revisted as we >> are not missing a lot? > >We just need to implement the equivalent functionality for bvecs. It >isn't really hard, it just wasn't required so far. Can the virt-boundary alignment gap exist for ITER_BVEC iter in first place? Two reasons to ask this question: 1. Commit description of this code (from Al viro) says - "iov_iter_gap_alignment(): get rid of iterate_all_kinds() For one thing, it's only used for iovec (and makes sense only for those)." 2. I did not hit it so far as I mentioned above.
On Sat, Sep 24, 2022 at 12:13:49AM +0530, Kanchan Joshi wrote: >On Fri, Sep 23, 2022 at 05:29:41PM +0200, Christoph Hellwig wrote: >>On Thu, Sep 22, 2022 at 08:53:31PM +0530, Kanchan Joshi wrote: >>>>blk_rq_map_user_iov really should be able to detect that it is called >>>>on a bvec iter and just do the right thing rather than needing different >>>>helpers. >>> >>>I too explored that possibility, but found that it does not. It maps the >>>user-pages into bio either directly or by doing that copy (in certain odd >>>conditions) but does not know how to deal with existing bvec. >> >>What do you mean with existing bvec? We allocate a brand new bio here >>that we want to map the next chunk of the iov_iter to, and that >>is exactly what blk_rq_map_user_iov does. What blk_rq_map_user_iov >>currently does not do is to implement this mapping efficiently >>for ITER_BVEC iters > >It is clear that it was not written for ITER_BVEC iters. >Otherwise that WARN_ON would not have hit. > >And efficency is the concern as we are moving to more heavyweight >helper that 'handles' weird conditions rather than just 'bails out'. >These alignment checks end up adding a loop that traverses >the entire ITER_BVEC. >Also blk_rq_map_user_iov uses bio_iter_advance which also seems >cycle-consuming given below code-comment in io_import_fixed(): > >if (offset) { > /* > * Don't use iov_iter_advance() here, as it's really slow for > * using the latter parts of a big fixed buffer - it iterates > * over each segment manually. We can cheat a bit here, because > * we know that: > >So if at all I could move the code inside blk_rq_map_user_iov, I will >need to see that I skip doing iov_iter_advance. > >I still think it would be better to take this route only when there are >other usecases/callers of this. And that is a future thing. For the current >requirement, it seems better to prioritze efficency. > >>, but that is something that could and should >>be fixed. >> >>>And it really felt cleaner to me write a new function rather than >>>overloading the blk_rq_map_user_iov with multiple if/else canals. >> >>No. The whole point of the iov_iter is to support this "overload". > >Even if I try taking that route, WARN_ON is a blocker that prevents >me to put this code inside blk_rq_map_user_iov. > >>>But iov_iter_gap_alignment does not work on bvec iters. Line #1274 below >> >>So we'll need to fix it. > >Do you see good way to trigger this virt-alignment condition? I have >not seen this hitting (the SG gap checks) when running with fixebufs. > >>>1264 unsigned long iov_iter_gap_alignment(const struct iov_iter *i) >>>1265 { >>>1266 unsigned long res = 0; >>>1267 unsigned long v = 0; >>>1268 size_t size = i->count; >>>1269 unsigned k; >>>1270 >>>1271 if (iter_is_ubuf(i)) >>>1272 return 0; >>>1273 >>>1274 if (WARN_ON(!iter_is_iovec(i))) >>>1275 return ~0U; >>> >>>Do you see a way to overcome this. Or maybe this can be revisted as we >>>are not missing a lot? >> >>We just need to implement the equivalent functionality for bvecs. It >>isn't really hard, it just wasn't required so far. > >Can the virt-boundary alignment gap exist for ITER_BVEC iter in first >place? Two reasons to ask this question: > >1. Commit description of this code (from Al viro) says - > >"iov_iter_gap_alignment(): get rid of iterate_all_kinds() > >For one thing, it's only used for iovec (and makes sense only for >those)." > >2. I did not hit it so far as I mentioned above. And we also have below condition (patch of Linus) that restricts blk_rq_map_user_iov to only iovec iterator commit a0ac402cfcdc904f9772e1762b3fda112dcc56a0 Author: Linus Torvalds <torvalds@linux-foundation.org> Date: Tue Dec 6 16:18:14 2016 -0800 Don't feed anything but regular iovec's to blk_rq_map_user_iov In theory we could map other things, but there's a reason that function is called "user_iov". Using anything else (like splice can do) just confuses it. Reported-and-tested-by: Johannes Thumshirn <jthumshirn@suse.de> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff --git a/block/blk-map.c b/block/blk-map.c index b8657fa8dc9a..27fd8d92892d 100644 --- a/block/blk-map.c +++ b/block/blk-map.c @@ -118,6 +118,9 @@ int blk_rq_map_user_iov(struct request_queue *q, struct request *rq, struct iov_iter i; int ret; + if (!iter_is_iovec(iter)) + goto fail; + if (map_data) copy = true; else if (iov_iter_alignment(iter) & align) @@ -140,6 +143,7 @@ int blk_rq_map_user_iov(struct request_queue *q, struct request *rq, unmap_rq: __blk_rq_unmap_user(bio); +fail: rq->bio = NULL; return -EINVAL; }
On Sat, Sep 24, 2022 at 12:13:49AM +0530, Kanchan Joshi wrote: > And efficency is the concern as we are moving to more heavyweight > helper that 'handles' weird conditions rather than just 'bails out'. > These alignment checks end up adding a loop that traverses > the entire ITER_BVEC. > Also blk_rq_map_user_iov uses bio_iter_advance which also seems > cycle-consuming given below code-comment in io_import_fixed(): No one says you should use the existing loop in blk_rq_map_user_iov. Just make it call your new helper early on when a ITER_BVEC iter is passed in. > Do you see good way to trigger this virt-alignment condition? I have > not seen this hitting (the SG gap checks) when running with fixebufs. You'd need to make sure the iovec passed to the fixed buffer registration is chunked up smaller than the nvme page size. E.g. if you pass lots of non-contiguous 512 byte sized iovecs to the buffer registration. >> We just need to implement the equivalent functionality for bvecs. It >> isn't really hard, it just wasn't required so far. > > Can the virt-boundary alignment gap exist for ITER_BVEC iter in first > place? Yes. bvecs are just a way to represent data. If the individual segments don't fit the virt boundary you still need to deal with it.
On Mon, Sep 26, 2022 at 04:50:40PM +0200, Christoph Hellwig wrote: >On Sat, Sep 24, 2022 at 12:13:49AM +0530, Kanchan Joshi wrote: >> And efficency is the concern as we are moving to more heavyweight >> helper that 'handles' weird conditions rather than just 'bails out'. >> These alignment checks end up adding a loop that traverses >> the entire ITER_BVEC. >> Also blk_rq_map_user_iov uses bio_iter_advance which also seems >> cycle-consuming given below code-comment in io_import_fixed(): > >No one says you should use the existing loop in blk_rq_map_user_iov. >Just make it call your new helper early on when a ITER_BVEC iter is >passed in. Indeed. I will send the v10 with that approach. >> Do you see good way to trigger this virt-alignment condition? I have >> not seen this hitting (the SG gap checks) when running with fixebufs. > >You'd need to make sure the iovec passed to the fixed buffer >registration is chunked up smaller than the nvme page size. > >E.g. if you pass lots of non-contiguous 512 byte sized iovecs to the >buffer registration. > >>> We just need to implement the equivalent functionality for bvecs. It >>> isn't really hard, it just wasn't required so far. >> >> Can the virt-boundary alignment gap exist for ITER_BVEC iter in first >> place? > >Yes. bvecs are just a way to represent data. If the individual >segments don't fit the virt boundary you still need to deal with it. Thanks for clearing this.
diff --git a/block/blk-map.c b/block/blk-map.c index 7693f8e3c454..5dcfa112f240 100644 --- a/block/blk-map.c +++ b/block/blk-map.c @@ -241,17 +241,10 @@ static void bio_map_put(struct bio *bio) } } -static int bio_map_user_iov(struct request *rq, struct iov_iter *iter, +static struct bio *bio_map_get(struct request *rq, unsigned int nr_vecs, gfp_t gfp_mask) { - unsigned int max_sectors = queue_max_hw_sectors(rq->q); - unsigned int nr_vecs = iov_iter_npages(iter, BIO_MAX_VECS); struct bio *bio; - int ret; - int j; - - if (!iov_iter_count(iter)) - return -EINVAL; if (rq->cmd_flags & REQ_POLLED) { blk_opf_t opf = rq->cmd_flags | REQ_ALLOC_CACHE; @@ -259,13 +252,31 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter, bio = bio_alloc_bioset(NULL, nr_vecs, opf, gfp_mask, &fs_bio_set); if (!bio) - return -ENOMEM; + return NULL; } else { bio = bio_kmalloc(nr_vecs, gfp_mask); if (!bio) - return -ENOMEM; + return NULL; bio_init(bio, NULL, bio->bi_inline_vecs, nr_vecs, req_op(rq)); } + return bio; +} + +static int bio_map_user_iov(struct request *rq, struct iov_iter *iter, + gfp_t gfp_mask) +{ + unsigned int max_sectors = queue_max_hw_sectors(rq->q); + unsigned int nr_vecs = iov_iter_npages(iter, BIO_MAX_VECS); + struct bio *bio; + int ret; + int j; + + if (!iov_iter_count(iter)) + return -EINVAL; + + bio = bio_map_get(rq, nr_vecs, gfp_mask); + if (bio == NULL) + return -ENOMEM; while (iov_iter_count(iter)) { struct page **pages, *stack_pages[UIO_FASTIOV]; @@ -611,6 +622,62 @@ int blk_rq_map_user(struct request_queue *q, struct request *rq, } EXPORT_SYMBOL(blk_rq_map_user); +/* Prepare bio for passthrough IO given an existing bvec iter */ +int blk_rq_map_user_bvec(struct request *rq, struct iov_iter *iter) +{ + struct request_queue *q = rq->q; + size_t nr_iter, nr_segs, i; + struct bio *bio; + struct bio_vec *bv, *bvecs, *bvprvp = NULL; + struct queue_limits *lim = &q->limits; + unsigned int nsegs = 0, bytes = 0; + + nr_iter = iov_iter_count(iter); + nr_segs = iter->nr_segs; + + if (!nr_iter || (nr_iter >> SECTOR_SHIFT) > queue_max_hw_sectors(q)) + return -EINVAL; + if (nr_segs > queue_max_segments(q)) + return -EINVAL; + + /* no iovecs to alloc, as we already have a BVEC iterator */ + bio = bio_map_get(rq, 0, GFP_KERNEL); + if (bio == NULL) + return -ENOMEM; + + bio_iov_bvec_set(bio, iter); + blk_rq_bio_prep(rq, bio, nr_segs); + + /* loop to perform a bunch of sanity checks */ + bvecs = (struct bio_vec *)iter->bvec; + for (i = 0; i < nr_segs; i++) { + bv = &bvecs[i]; + /* + * If the queue doesn't support SG gaps and adding this + * offset would create a gap, disallow it. + */ + if (bvprvp && bvec_gap_to_prev(lim, bvprvp, bv->bv_offset)) + goto out_err; + + /* check full condition */ + if (nsegs >= nr_segs || bytes > UINT_MAX - bv->bv_len) + goto out_err; + + if (bytes + bv->bv_len <= nr_iter && + bv->bv_offset + bv->bv_len <= PAGE_SIZE) { + nsegs++; + bytes += bv->bv_len; + } else + goto out_err; + bvprvp = bv; + } + return 0; +out_err: + bio_map_put(bio); + return -EINVAL; +} +EXPORT_SYMBOL_GPL(blk_rq_map_user_bvec); + /** * blk_rq_unmap_user - unmap a request with user data * @bio: start of bio list diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index b43c81d91892..83bef362f0f9 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -970,6 +970,7 @@ struct rq_map_data { bool from_user; }; +int blk_rq_map_user_bvec(struct request *rq, struct iov_iter *iter); int blk_rq_map_user(struct request_queue *, struct request *, struct rq_map_data *, void __user *, unsigned long, gfp_t); int blk_rq_map_user_iov(struct request_queue *, struct request *,