diff mbox series

block: streamline meta bounce buffer handling

Message ID 20240506051047.4291-1-joshi.k@samsung.com (mailing list archive)
State New, archived
Headers show
Series block: streamline meta bounce buffer handling | expand

Commit Message

Kanchan Joshi May 6, 2024, 5:10 a.m. UTC
Currently bio_integrity_copy_user() uses bip_vec array to store two
different things: (a) kernel allocated bounce buffer (b) bunch of extra
bvecs to pinned user memory.
This leads to unwieldy handling at places.

The patch cleans this up and uses bip_vec for just kernel allocated meta
buffer. The bio_vecs (to pinned user memory) and their count are now
stored seprately into bip. Existing memory (of the field bip_work) is
reused so that bip does not grow unconditionally.

Based on an earlier patch from Keith Busch.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
---
 block/bio-integrity.c | 28 ++++++++++++++++++----------
 include/linux/bio.h   | 12 +++++++++++-
 2 files changed, 29 insertions(+), 11 deletions(-)

Comments

Christoph Hellwig May 6, 2024, 6:05 a.m. UTC | #1
Can we take a step back first?

Current the blk-map user buffer handling decided to either pin
the memory and use that directly or use the normal user copy helpers
through copy_page_to_iter/copy_page_from_iter.

Why do we even pin the memory here to then do an in-kernel copy instead
of doing the copy_from/to_user which is going to be a lot more efficient?

Sort of related to that is that this does driver the copy to user and
unpin from bio_integrity_free, which is a low-level routine.  It really
should be driven from the highlevel blk-map code that is the I/O
submitter, just like the data side.  Shoe-horning uaccess into the
low-level block layer plumbing is just going to get us into trouble.
Kanchan Joshi May 6, 2024, 12:46 p.m. UTC | #2
On 5/6/2024 11:35 AM, Christoph Hellwig wrote:
> Can we take a step back first?
> 
> Current the blk-map user buffer handling decided to either pin
> the memory and use that directly or use the normal user copy helpers
> through copy_page_to_iter/copy_page_from_iter.
> 
> Why do we even pin the memory here to then do an in-kernel copy instead
> of doing the copy_from/to_user which is going to be a lot more efficient?

Many good reasons for that. Keeping the user meta memory pinned allows 
to update it nicely during completion. Otherwise, it required using 
task-work, lots of bookkeeping in driver (nvme), and all that was pretty 
convoluted.
Please see this series (you have reviewed in past):
https://lore.kernel.org/linux-block/20231130215309.2923568-1-kbusch@meta.com/#r 

Nvme and io_uring (Patch 2, 3, 4) get nice wins because of keeping the 
user memory pinned even for bounce-buffer case.

> Sort of related to that is that this does driver the copy to user and
> unpin from bio_integrity_free, which is a low-level routine.  It really
> should be driven from the highlevel blk-map code that is the I/O
> submitter, just like the data side.  Shoe-horning uaccess into the
> low-level block layer plumbing is just going to get us into trouble.
> 

Not sure I follow, but citing this nvme patch again:
https://lore.kernel.org/linux-block/20231130215309.2923568-3-kbusch@meta.com/
Driver does not need to know whether meta was handled by pinning or by 
using bounce buffer. Everything is centrally handled in 
block/bio-integrity.c.
Christoph Hellwig May 6, 2024, 2:56 p.m. UTC | #3
On Mon, May 06, 2024 at 06:16:45PM +0530, Kanchan Joshi wrote:
> Nvme and io_uring (Patch 2, 3, 4) get nice wins because of keeping the 
> user memory pinned even for bounce-buffer case.

In that case the data path should be doing the same.

> > Sort of related to that is that this does driver the copy to user and
> > unpin from bio_integrity_free, which is a low-level routine.  It really
> > should be driven from the highlevel blk-map code that is the I/O
> > submitter, just like the data side.  Shoe-horning uaccess into the
> > low-level block layer plumbing is just going to get us into trouble.
> > 
> 
> Not sure I follow, but citing this nvme patch again:
> https://lore.kernel.org/linux-block/20231130215309.2923568-3-kbusch@meta.com/
> Driver does not need to know whether meta was handled by pinning or by 
> using bounce buffer. Everything is centrally handled in 
> block/bio-integrity.c.

But the low-level bio code does, and it absolutely should not.  And
while I should have caught this earlier we really need to stop undoing
all the sanity we got by clearly splitting the submitter side from
the consumer side of the bio, as that will lead us straight back
into the mess we had before.
Keith Busch May 6, 2024, 8:36 p.m. UTC | #4
On Mon, May 06, 2024 at 08:05:09AM +0200, Christoph Hellwig wrote:
> Can we take a step back first?
> 
> Current the blk-map user buffer handling decided to either pin
> the memory and use that directly or use the normal user copy helpers
> through copy_page_to_iter/copy_page_from_iter.
> 
> Why do we even pin the memory here to then do an in-kernel copy instead
> of doing the copy_from/to_user which is going to be a lot more efficient?

Unlike blk-map, the integrity user buffer will fallback to a copy if the
ubuf has too many segments, where blk_rq_map_user() fails with EINVAL.

For user integrity, we have to pin the buffer anyway to get the true
segment count and check against the queue limits, so the copy to/from
takes advantage of that needed pin.

That EINVAL has been the source of a lot of "bugs" where we have to
explain why huge pages are necessary for largish (>512k) transfer nvme
passthrough commands. It might be a nice feature if blk_rq_map_user()
behaved like blk_integrity_map_user() for that condition.
 
> Sort of related to that is that this does driver the copy to user and
> unpin from bio_integrity_free, which is a low-level routine.  It really
> should be driven from the highlevel blk-map code that is the I/O
> submitter, just like the data side.  Shoe-horning uaccess into the
> low-level block layer plumbing is just going to get us into trouble.

Okay, I think I see what you're saying. We can make the existing use
more like the blk-map code for callers using struct request. The
proposed iouring generic read/write user metadata would need something
different, but looks reasonable.
Christoph Hellwig May 7, 2024, 5:45 a.m. UTC | #5
On Mon, May 06, 2024 at 02:36:06PM -0600, Keith Busch wrote:
> Unlike blk-map, the integrity user buffer will fallback to a copy if the
> ubuf has too many segments, where blk_rq_map_user() fails with EINVAL.
> 
> For user integrity, we have to pin the buffer anyway to get the true
> segment count and check against the queue limits, so the copy to/from
> takes advantage of that needed pin.

Can we document that somewhere please?

> That EINVAL has been the source of a lot of "bugs" where we have to
> explain why huge pages are necessary for largish (>512k) transfer nvme
> passthrough commands. It might be a nice feature if blk_rq_map_user()
> behaved like blk_integrity_map_user() for that condition.

Who wants to sign up for it?  If we also clean up the mess in sg/st
with their own allocated pages that would have the potential to
significantly simply this code.

> > Sort of related to that is that this does driver the copy to user and
> > unpin from bio_integrity_free, which is a low-level routine.  It really
> > should be driven from the highlevel blk-map code that is the I/O
> > submitter, just like the data side.  Shoe-horning uaccess into the
> > low-level block layer plumbing is just going to get us into trouble.
> 
> Okay, I think I see what you're saying. We can make the existing use
> more like the blk-map code for callers using struct request. The
> proposed iouring generic read/write user metadata would need something
> different, but looks reasonable.

The important point is that the unpin and copy back should be driven by
the submitter side, not matter if it is bio or request based.
Kanchan Joshi May 24, 2024, 10:28 a.m. UTC | #6
On 5/6/2024 11:35 AM, Christoph Hellwig wrote:
> Can we take a step back first?

Now that back step has been taken[*], can this patch be looked at.
This still applies cleanly.

[*] https://lore.kernel.org/linux-block/20240520154943.GA1327@lst.de/
Christoph Hellwig May 28, 2024, 7:59 a.m. UTC | #7
On Fri, May 24, 2024 at 03:58:45PM +0530, Kanchan Joshi wrote:
> On 5/6/2024 11:35 AM, Christoph Hellwig wrote:
> > Can we take a step back first?
> 
> Now that back step has been taken[*], can this patch be looked at.
> This still applies cleanly.

Please try the same approach for the blk-map data path first,
that should keep them consistent and also gives much better test
coverage.
diff mbox series

Patch

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 2e3e8e04961e..47634b89d27c 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -105,8 +105,8 @@  static void bio_integrity_unpin_bvec(struct bio_vec *bv, int nr_vecs,
 
 static void bio_integrity_uncopy_user(struct bio_integrity_payload *bip)
 {
-	unsigned short nr_vecs = bip->bip_max_vcnt - 1;
-	struct bio_vec *copy = &bip->bip_vec[1];
+	unsigned short nr_vecs = bip->copy_vcnt;
+	struct bio_vec *copy = bip->copy_vec;
 	size_t bytes = bip->bip_iter.bi_size;
 	struct iov_iter iter;
 	int ret;
@@ -116,6 +116,7 @@  static void bio_integrity_uncopy_user(struct bio_integrity_payload *bip)
 	WARN_ON_ONCE(ret != bytes);
 
 	bio_integrity_unpin_bvec(copy, nr_vecs, true);
+	kfree(copy);
 }
 
 static void bio_integrity_unmap_user(struct bio_integrity_payload *bip)
@@ -208,6 +209,7 @@  static int bio_integrity_copy_user(struct bio *bio, struct bio_vec *bvec,
 				   unsigned int direction, u32 seed)
 {
 	bool write = direction == ITER_SOURCE;
+	struct bio_vec *copy_vec = NULL;
 	struct bio_integrity_payload *bip;
 	struct iov_iter iter;
 	void *buf;
@@ -224,7 +226,7 @@  static int bio_integrity_copy_user(struct bio *bio, struct bio_vec *bvec,
 			goto free_buf;
 		}
 
-		bip = bio_integrity_alloc(bio, GFP_KERNEL, 1);
+		bio_integrity_unpin_bvec(bvec, nr_vecs, false);
 	} else {
 		memset(buf, 0, len);
 
@@ -232,19 +234,21 @@  static int bio_integrity_copy_user(struct bio *bio, struct bio_vec *bvec,
 		 * We need to preserve the original bvec and the number of vecs
 		 * in it for completion handling
 		 */
-		bip = bio_integrity_alloc(bio, GFP_KERNEL, nr_vecs + 1);
+		copy_vec = kcalloc(nr_vecs, sizeof(*bvec), GFP_KERNEL);
+		if (!copy_vec) {
+			ret = -ENOMEM;
+			goto free_buf;
+		}
+		memcpy(copy_vec, bvec, nr_vecs * sizeof(*bvec));
+
 	}
 
+	bip = bio_integrity_alloc(bio, GFP_KERNEL, 1);
 	if (IS_ERR(bip)) {
 		ret = PTR_ERR(bip);
-		goto free_buf;
+		goto free_copy;
 	}
 
-	if (write)
-		bio_integrity_unpin_bvec(bvec, nr_vecs, false);
-	else
-		memcpy(&bip->bip_vec[1], bvec, nr_vecs * sizeof(*bvec));
-
 	ret = bio_integrity_add_page(bio, virt_to_page(buf), len,
 				     offset_in_page(buf));
 	if (ret != len) {
@@ -254,9 +258,13 @@  static int bio_integrity_copy_user(struct bio *bio, struct bio_vec *bvec,
 
 	bip->bip_flags |= BIP_INTEGRITY_USER | BIP_COPY_USER;
 	bip->bip_iter.bi_sector = seed;
+	bip->copy_vec = copy_vec;
+	bip->copy_vcnt = nr_vecs;
 	return 0;
 free_bip:
 	bio_integrity_free(bio);
+free_copy:
+	kfree(copy_vec);
 free_buf:
 	kfree(buf);
 	return ret;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 9b8a369f44bc..ffbfa43d51af 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -345,7 +345,17 @@  struct bio_integrity_payload {
 
 	struct bvec_iter	bio_iter;	/* for rewinding parent bio */
 
-	struct work_struct	bip_work;	/* I/O completion */
+	union {
+		/*
+		 * bip_work is used only for block-layer sent integrity.
+		 * For user sent integrity, reuse the same memory.
+		 */
+		struct work_struct	bip_work;	/* I/O completion */
+		struct {
+			struct bio_vec	*copy_vec; /* pinned user memory */
+			unsigned short	copy_vcnt; /* # of bio_vecs for above */
+		};
+	};
 
 	struct bio_vec		*bip_vec;
 	struct bio_vec		bip_inline_vecs[];/* embedded bvec array */