[0/5] fs/buffer: strack reduction on async read

Message ID	20241218022626.3668119-1-mcgrof@kernel.org (mailing list archive)
Headers	show Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2AF336CDAF; Wed, 18 Dec 2024 02:26:32 +0000 (UTC) From: Luis Chamberlain <mcgrof@kernel.org> To: hare@suse.de, willy@infradead.org, dave@stgolabs.net, david@fromorbit.com, djwong@kernel.org, kbusch@kernel.org Cc: john.g.garry@oracle.com, hch@lst.de, ritesh.list@gmail.com, linux-fsdevel@vger.kernel.org, linux-xfs@vger.kernel.org, linux-mm@kvack.org, linux-block@vger.kernel.org, gost.dev@samsung.com, p.raghav@samsung.com, da.gomez@samsung.com, kernel@pankajraghav.com, mcgrof@kernel.org Subject: [PATCH 0/5] fs/buffer: strack reduction on async read Date: Tue, 17 Dec 2024 18:26:21 -0800 Message-ID: <20241218022626.3668119-1-mcgrof@kernel.org> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: Luis Chamberlain <mcgrof@infradead.org>
Series	fs/buffer: strack reduction on async read \| expand [0/5] fs/buffer: strack reduction on async read [1/5] fs/buffer: move async batch read code into a helper [2/5] fs/buffer: simplify block_read_full_folio() with bh_offset() [3/5] fs/buffer: add a for_each_bh() for block_read_full_folio() [4/5] fs/buffer: add iteration support for block_read_full_folio() [5/5] fs/buffer: reduce stack usage on bh_read_iter()

Luis Chamberlain Dec. 18, 2024, 2:26 a.m. UTC

This splits up a minor enhancement from the bs > ps device support
series into its own series for better review / focus / testing.
This series just addresses the reducing the array size used and cleaning
up the async read to be easier to read and maintain.

Changes on this series from the RFC v2:

  - Fixes the loop by ensuring we bump counts on continue as noted by
    Matthew Wilcox
  - new patch to simplify the loop by using bh_offset()
  - bike shed love on comments
  - Testing: tested on ext4 and XFS with fstests, with no regressions found
    after willy's suggested loop fix on continue. I'm however running a new
    set of tests now after a rebase onto v6.13-rc3 and a bit of patch ordering
    and the addition of bh_offset(). Prelimimary tests however show no
    issues.

[0] https://lkml.kernel.org/r/20241214031050.1337920-1-mcgrof@kernel.org

Luis Chamberlain (5):
  fs/buffer: move async batch read code into a helper
  fs/buffer: simplify block_read_full_folio() with bh_offset()
  fs/buffer: add a for_each_bh() for block_read_full_folio()
  fs/buffer: add iteration support for block_read_full_folio()
  fs/buffer: reduce stack usage on bh_read_iter()

 fs/buffer.c | 221 +++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 160 insertions(+), 61 deletions(-)

Matthew Wilcox Dec. 18, 2024, 8:05 p.m. UTC | #1

On Tue, Dec 17, 2024 at 06:26:21PM -0800, Luis Chamberlain wrote:
> This splits up a minor enhancement from the bs > ps device support
> series into its own series for better review / focus / testing.
> This series just addresses the reducing the array size used and cleaning
> up the async read to be easier to read and maintain.

How about this approach instead -- get rid of the batch entirely?

diff --git a/fs/buffer.c b/fs/buffer.c
index cc8452f60251..f50ebbc1f518 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2361,9 +2361,9 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
 {
 	struct inode *inode = folio->mapping->host;
 	sector_t iblock, lblock;
-	struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
+	struct buffer_head *bh, *head;
 	size_t blocksize;
-	int nr, i;
+	int i, submitted = 0;
 	int fully_mapped = 1;
 	bool page_error = false;
 	loff_t limit = i_size_read(inode);
@@ -2380,7 +2380,6 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
 	iblock = div_u64(folio_pos(folio), blocksize);
 	lblock = div_u64(limit + blocksize - 1, blocksize);
 	bh = head;
-	nr = 0;
 	i = 0;
 
 	do {
@@ -2411,40 +2410,30 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
 			if (buffer_uptodate(bh))
 				continue;
 		}
-		arr[nr++] = bh;
+
+		lock_buffer(bh);
+		if (buffer_uptodate(bh)) {
+			unlock_buffer(bh);
+			continue;
+		}
+
+		mark_buffer_async_read(bh);
+		submit_bh(REQ_OP_READ, bh);
+		submitted++;
 	} while (i++, iblock++, (bh = bh->b_this_page) != head);
 
 	if (fully_mapped)
 		folio_set_mappedtodisk(folio);
 
-	if (!nr) {
-		/*
-		 * All buffers are uptodate or get_block() returned an
-		 * error when trying to map them - we can finish the read.
-		 */
-		folio_end_read(folio, !page_error);
-		return 0;
-	}
-
-	/* Stage two: lock the buffers */
-	for (i = 0; i < nr; i++) {
-		bh = arr[i];
-		lock_buffer(bh);
-		mark_buffer_async_read(bh);
-	}
-
 	/*
-	 * Stage 3: start the IO.  Check for uptodateness
-	 * inside the buffer lock in case another process reading
-	 * the underlying blockdev brought it uptodate (the sct fix).
+	 * All buffers are uptodate or get_block() returned an error
+	 * when trying to map them - we must finish the read because
+	 * end_buffer_async_read() will never be called on any buffer
+	 * in this folio.
 	 */
-	for (i = 0; i < nr; i++) {
-		bh = arr[i];
-		if (buffer_uptodate(bh))
-			end_buffer_async_read(bh, 1);
-		else
-			submit_bh(REQ_OP_READ, bh);
-	}
+	if (!submitted)
+		folio_end_read(folio, !page_error);
+
 	return 0;
 }
 EXPORT_SYMBOL(block_read_full_folio);

Luis Chamberlain Dec. 19, 2024, 2:27 a.m. UTC | #2

On Wed, Dec 18, 2024 at 08:05:29PM +0000, Matthew Wilcox wrote:
> On Tue, Dec 17, 2024 at 06:26:21PM -0800, Luis Chamberlain wrote:
> > This splits up a minor enhancement from the bs > ps device support
> > series into its own series for better review / focus / testing.
> > This series just addresses the reducing the array size used and cleaning
> > up the async read to be easier to read and maintain.
> 
> How about this approach instead -- get rid of the batch entirely?

Less is more! I wish it worked, but we end up with a null pointer on
ext4/032 (and indeed this is the test that helped me find most bugs in
what I was working on):

[  105.942462] loop0: detected capacity change from 0 to 1342177280
[  106.034851] BUG: kernel NULL pointer dereference, address: 0000000000000000
[  106.036903] #PF: supervisor read access in kernel mode
[  106.038366] #PF: error_code(0x0000) - not-present page
[  106.039819] PGD 0 P4D 0
[  106.040574] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
[  106.041967] CPU: 2 UID: 0 PID: 29 Comm: ksoftirqd/2 Not tainted 6.13.0-rc3+ #42
[  106.044018] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 2024.08-1 09/18/2024
[  106.046300] RIP: 0010:end_buffer_async_read_io+0x11/0x90
[  106.047819] Code: f2 ff 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 53 48 8b 47 10 48 89 fb 48 8b 40 18 <48> 8b 00 f6 40 0d 40 74 0d 0f b7 00 66 25 00 f0 66 3d 00 80 74 09
[  106.053016] RSP: 0018:ffffa85880137dd0 EFLAGS: 00010246
[  106.054499] RAX: 0000000000000000 RBX: ffff95e38f22e5b0 RCX: ffff95e39c8753e0
[  106.056507] RDX: ffff95e3809f8000 RSI: 0000000000000001 RDI: ffff95e38f22e5b0
[  106.058530] RBP: 0000000000000400 R08: ffff95e3a326b040 R09: 0000000000000001
[  106.060546] R10: ffffffffbb6070c0 R11: 00000000002dc6c0 R12: 0000000000000000
[  106.062426] R13: ffff95e3960ea800 R14: ffff95e39586ae40 R15: 0000000000000400
[  106.064223] FS:  0000000000000000(0000) GS:ffff95e3fbc80000(0000) knlGS:0000000000000000
[  106.066155] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  106.067473] CR2: 0000000000000000 CR3: 00000001226e2006 CR4: 0000000000772ef0
[  106.069085] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  106.070571] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  106.072050] PKRU: 55555554
[  106.072632] Call Trace:
[  106.073176]  <TASK>
[  106.073611]  ? __die_body.cold+0x19/0x26
[  106.074383]  ? page_fault_oops+0xa2/0x230
[  106.075155]  ? __smp_call_single_queue+0xa7/0x110
[  106.076077]  ? do_user_addr_fault+0x63/0x640
[  106.076916]  ? exc_page_fault+0x7a/0x190
[  106.077639]  ? asm_exc_page_fault+0x22/0x30
[  106.078394]  ? end_buffer_async_read_io+0x11/0x90
[  106.079245]  end_bio_bh_io_sync+0x23/0x40
[  106.079973]  blk_update_request+0x178/0x420
[  106.080727]  ? finish_task_switch.isra.0+0x94/0x290
[  106.081600]  blk_mq_end_request+0x18/0x30
[  106.082281]  blk_complete_reqs+0x3d/0x50
[  106.082954]  handle_softirqs+0xf9/0x2c0
[  106.083607]  ? __pfx_smpboot_thread_fn+0x10/0x10
[  106.084393]  run_ksoftirqd+0x37/0x50
[  106.085012]  smpboot_thread_fn+0x184/0x220
[  106.085688]  kthread+0xda/0x110
[  106.086208]  ? __pfx_kthread+0x10/0x10
[  106.086824]  ret_from_fork+0x2d/0x50
[  106.087409]  ? __pfx_kthread+0x10/0x10
[  106.088013]  ret_from_fork_asm+0x1a/0x30
[  106.088658]  </TASK>
[  106.089045] Modules linked in: loop sunrpc 9p nls_iso8859_1 nls_cp437 vfat fat kvm_intel kvm crct10dif_pclmul ghash_clmulni_intel sha512_ssse3 sha512_generic sha256_ssse3 sha1_ssse3 aesni_intel gf128mul crypto_simd cryptd virtio_balloon 9pnet_virtio virtio_console joydev button evdev serio_raw nvme_fabrics nvme_core dm_mod drm nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vsock autofs4 ext4 crc16 mbcache jbd2 btrfs blake2b_generic efivarfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 md_mod virtio_net net_failover virtio_blk failover crc32_pclmul crc32c_intel psmouse virtio_pci virtio_pci_legacy_dev virtio_pci_modern_dev virtio virtio_ring
[  106.097895] CR2: 0000000000000000
[  106.098326] ---[ end trace 0000000000000000 ]---

  Luis

Matthew Wilcox Dec. 19, 2024, 3:51 a.m. UTC | #3

On Wed, Dec 18, 2024 at 06:27:36PM -0800, Luis Chamberlain wrote:
> On Wed, Dec 18, 2024 at 08:05:29PM +0000, Matthew Wilcox wrote:
> > On Tue, Dec 17, 2024 at 06:26:21PM -0800, Luis Chamberlain wrote:
> > > This splits up a minor enhancement from the bs > ps device support
> > > series into its own series for better review / focus / testing.
> > > This series just addresses the reducing the array size used and cleaning
> > > up the async read to be easier to read and maintain.
> > 
> > How about this approach instead -- get rid of the batch entirely?
> 
> Less is more! I wish it worked, but we end up with a null pointer on
> ext4/032 (and indeed this is the test that helped me find most bugs in
> what I was working on):

Yeah, I did no testing; just wanted to give people a different approach
to consider.

> [  106.034851] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [  106.046300] RIP: 0010:end_buffer_async_read_io+0x11/0x90
> [  106.047819] Code: f2 ff 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 53 48 8b 47 10 48 89 fb 48 8b 40 18 <48> 8b 00 f6 40 0d 40 74 0d 0f b7 00 66 25 00 f0 66 3d 00 80 74 09

That decodes as:

   5:	53                   	push   %rbx
   6:	48 8b 47 10          	mov    0x10(%rdi),%rax
   a:	48 89 fb             	mov    %rdi,%rbx
   d:	48 8b 40 18          	mov    0x18(%rax),%rax
  11:*	48 8b 00             	mov    (%rax),%rax		<-- trapping instruction
  14:	f6 40 0d 40          	testb  $0x40,0xd(%rax)

6: bh->b_folio
d: b_folio->mapping
11: mapping->host

So folio->mapping is NULL.

Ah, I see the problem.  end_buffer_async_read() uses the buffer_async_read
test to decide if all buffers on the page are uptodate or not.  So both
having no batch (ie this patch) and having a batch which is smaller than
the number of buffers in the folio can lead to folio_end_read() being
called prematurely (ie we'll unlock the folio before finishing reading
every buffer in the folio).

Once the folio is unlocked, it can be truncated.  That's a second-order
problem, but it's the one your test happened to hit.


This should fix the problem; we always have at least one BH held in
the submission path with the async_read flag set, so
end_buffer_async_read() will not end it prematurely.

By the way, do you have CONFIG_VM_DEBUG enabled in your testing?

        VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
in folio_end_read() should have tripped before hitting the race with
truncate.

diff --git a/fs/buffer.c b/fs/buffer.c
index cc8452f60251..fd2633e4a5d2 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2361,9 +2361,9 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
 {
 	struct inode *inode = folio->mapping->host;
 	sector_t iblock, lblock;
-	struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
+	struct buffer_head *bh, *head, *prev = NULL;
 	size_t blocksize;
-	int nr, i;
+	int i;
 	int fully_mapped = 1;
 	bool page_error = false;
 	loff_t limit = i_size_read(inode);
@@ -2380,7 +2380,6 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
 	iblock = div_u64(folio_pos(folio), blocksize);
 	lblock = div_u64(limit + blocksize - 1, blocksize);
 	bh = head;
-	nr = 0;
 	i = 0;
 
 	do {
@@ -2411,40 +2410,33 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
 			if (buffer_uptodate(bh))
 				continue;
 		}
-		arr[nr++] = bh;
+
+		lock_buffer(bh);
+		if (buffer_uptodate(bh)) {
+			unlock_buffer(bh);
+			continue;
+		}
+
+		mark_buffer_async_read(bh);
+		if (prev)
+			submit_bh(REQ_OP_READ, prev);
+		prev = bh;
 	} while (i++, iblock++, (bh = bh->b_this_page) != head);
 
 	if (fully_mapped)
 		folio_set_mappedtodisk(folio);
 
-	if (!nr) {
-		/*
-		 * All buffers are uptodate or get_block() returned an
-		 * error when trying to map them - we can finish the read.
-		 */
-		folio_end_read(folio, !page_error);
-		return 0;
-	}
-
-	/* Stage two: lock the buffers */
-	for (i = 0; i < nr; i++) {
-		bh = arr[i];
-		lock_buffer(bh);
-		mark_buffer_async_read(bh);
-	}
-
 	/*
-	 * Stage 3: start the IO.  Check for uptodateness
-	 * inside the buffer lock in case another process reading
-	 * the underlying blockdev brought it uptodate (the sct fix).
+	 * All buffers are uptodate or get_block() returned an error
+	 * when trying to map them - we must finish the read because
+	 * end_buffer_async_read() will never be called on any buffer
+	 * in this folio.
 	 */
-	for (i = 0; i < nr; i++) {
-		bh = arr[i];
-		if (buffer_uptodate(bh))
-			end_buffer_async_read(bh, 1);
-		else
-			submit_bh(REQ_OP_READ, bh);
-	}
+	if (prev)
+		submit_bh(REQ_OP_READ, prev);
+	else
+		folio_end_read(folio, !page_error);
+
 	return 0;
 }
 EXPORT_SYMBOL(block_read_full_folio);

Christoph Hellwig Dec. 19, 2024, 6:28 a.m. UTC | #4

What is this strack that gets reduced here?

Luis Chamberlain Dec. 19, 2024, 5:53 p.m. UTC | #5

On Thu, Dec 19, 2024 at 07:28:27AM +0100, Christoph Hellwig wrote:
> What is this strack that gets reduced here?

s/strack/stack

I seriously need a spell checker as part of my pipeline.

  Luis

Luis Chamberlain Dec. 30, 2024, 5:30 p.m. UTC | #6

On Thu, Dec 19, 2024 at 03:51:34AM +0000, Matthew Wilcox wrote:
> So folio->mapping is NULL.
> 
> Ah, I see the problem.  end_buffer_async_read() uses the buffer_async_read
> test to decide if all buffers on the page are uptodate or not.  So both
> having no batch (ie this patch) and having a batch which is smaller than
> the number of buffers in the folio can lead to folio_end_read() being
> called prematurely (ie we'll unlock the folio before finishing reading
> every buffer in the folio).
> 
> Once the folio is unlocked, it can be truncated.  That's a second-order
> problem, but it's the one your test happened to hit.
> 
> This should fix the problem; we always have at least one BH held in
> the submission path with the async_read flag set, so
> end_buffer_async_read() will not end it prematurely.

Oh neat, yes.

> By the way, do you have CONFIG_VM_DEBUG enabled in your testing?

You mean DEBUG_VM ? Yes:

grep DEBUG_VM .config
CONFIG_ARCH_HAS_DEBUG_VM_PGTABLE=y
CONFIG_DEBUG_VM_IRQSOFF=y
CONFIG_DEBUG_VM=y
# CONFIG_DEBUG_VM_MAPLE_TREE is not set
# CONFIG_DEBUG_VM_RB is not set
CONFIG_DEBUG_VM_PGFLAGS=y
# CONFIG_DEBUG_VM_PGTABLE is not set

>         VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
> in folio_end_read() should have tripped before hitting the race with
> truncate.

Odd that it did not, I had run into that folio_test_locked() splat but in my
attempts to simplify this without your trick to only run into the similar
truncate race, your resolution to this is nice.

> diff --git a/fs/buffer.c b/fs/buffer.c

This is a nice resolution and simplification, thanks, I've tested it and
passes without regressions on ext4. I'll take this into this series as
an alternative.

  Luis

Luis Chamberlain Jan. 31, 2025, 4:54 p.m. UTC | #7

On Thu, Dec 19, 2024 at 03:51:34AM +0000, Matthew Wilcox wrote:
> On Wed, Dec 18, 2024 at 06:27:36PM -0800, Luis Chamberlain wrote:
> > On Wed, Dec 18, 2024 at 08:05:29PM +0000, Matthew Wilcox wrote:
> > > On Tue, Dec 17, 2024 at 06:26:21PM -0800, Luis Chamberlain wrote:
> > > > This splits up a minor enhancement from the bs > ps device support
> > > > series into its own series for better review / focus / testing.
> > > > This series just addresses the reducing the array size used and cleaning
> > > > up the async read to be easier to read and maintain.
> > > 
> > > How about this approach instead -- get rid of the batch entirely?
> > 
> > Less is more! I wish it worked, but we end up with a null pointer on
> > ext4/032 (and indeed this is the test that helped me find most bugs in
> > what I was working on):
> 
> Yeah, I did no testing; just wanted to give people a different approach
> to consider.
> 
> > [  106.034851] BUG: kernel NULL pointer dereference, address: 0000000000000000
> > [  106.046300] RIP: 0010:end_buffer_async_read_io+0x11/0x90
> > [  106.047819] Code: f2 ff 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 53 48 8b 47 10 48 89 fb 48 8b 40 18 <48> 8b 00 f6 40 0d 40 74 0d 0f b7 00 66 25 00 f0 66 3d 00 80 74 09
> 
> That decodes as:
> 
>    5:	53                   	push   %rbx
>    6:	48 8b 47 10          	mov    0x10(%rdi),%rax
>    a:	48 89 fb             	mov    %rdi,%rbx
>    d:	48 8b 40 18          	mov    0x18(%rax),%rax
>   11:*	48 8b 00             	mov    (%rax),%rax		<-- trapping instruction
>   14:	f6 40 0d 40          	testb  $0x40,0xd(%rax)
> 
> 6: bh->b_folio
> d: b_folio->mapping
> 11: mapping->host
> 
> So folio->mapping is NULL.
> 
> Ah, I see the problem.  end_buffer_async_read() uses the buffer_async_read
> test to decide if all buffers on the page are uptodate or not.  So both
> having no batch (ie this patch) and having a batch which is smaller than
> the number of buffers in the folio can lead to folio_end_read() being
> called prematurely (ie we'll unlock the folio before finishing reading
> every buffer in the folio).

But:

a) all batched buffers are locked in the old code, we only unlock
   the currently evaluated buffer, the buffers from our pivot are locked
   and should also have the async flag set. That fact that buffers ahead
   should have the async flag set should prevent from calling
   folio_end_read() prematurely as I read the code, no?

b) In the case we're evaluting the last buffer, we can unlock and call
   folio_end_read(), but that seems fine given the previous batch work
   was in charge of finding candidate buffers which need a read, so in
   this case there should be no pending read.

So I don't see how we yet can call folio_end_read() prematurely.

We do however unlock the buffer in end_buffer_async_read(), but in case
of an inconsistency we simply bail on the loop, and since we only called
end_buffer_async_read() in case of the buffer being up to date, the only
iconsistency we check for is if (!buffer_uptodate(tmp)) in which case
the folio_end_read() will be called but without a success being annoted.

> Once the folio is unlocked, it can be truncated.  That's a second-order
> problem, but it's the one your test happened to hit.
> 
> 
> This should fix the problem; we always have at least one BH held in
> the submission path with the async_read flag set, so
> end_buffer_async_read() will not end it prematurely.

But this alternative does not call end_buffer_async_read(), in fact
we only call folio_end_read() in case of no pending reads being needed.

> diff --git a/fs/buffer.c b/fs/buffer.c
> index cc8452f60251..fd2633e4a5d2 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2361,9 +2361,9 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
>  {
>  	struct inode *inode = folio->mapping->host;
>  	sector_t iblock, lblock;
> -	struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
> +	struct buffer_head *bh, *head, *prev = NULL;
>  	size_t blocksize;
> -	int nr, i;
> +	int i;
>  	int fully_mapped = 1;
>  	bool page_error = false;
>  	loff_t limit = i_size_read(inode);
> @@ -2380,7 +2380,6 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
>  	iblock = div_u64(folio_pos(folio), blocksize);
>  	lblock = div_u64(limit + blocksize - 1, blocksize);
>  	bh = head;
> -	nr = 0;
>  	i = 0;
>  
>  	do {
> @@ -2411,40 +2410,33 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
>  			if (buffer_uptodate(bh))
>  				continue;
>  		}
> -		arr[nr++] = bh;
> +
> +		lock_buffer(bh);
> +		if (buffer_uptodate(bh)) {
> +			unlock_buffer(bh);
> +			continue;
> +		}
> +
> +		mark_buffer_async_read(bh);
> +		if (prev)
> +			submit_bh(REQ_OP_READ, prev);
> +		prev = bh;
>  	} while (i++, iblock++, (bh = bh->b_this_page) != head);
>  
>  	if (fully_mapped)
>  		folio_set_mappedtodisk(folio);
>  
> -	if (!nr) {
> -		/*
> -		 * All buffers are uptodate or get_block() returned an
> -		 * error when trying to map them - we can finish the read.
> -		 */
> -		folio_end_read(folio, !page_error);
> -		return 0;
> -	}
> -
> -	/* Stage two: lock the buffers */
> -	for (i = 0; i < nr; i++) {
> -		bh = arr[i];
> -		lock_buffer(bh);
> -		mark_buffer_async_read(bh);
> -	}
> -
>  	/*
> -	 * Stage 3: start the IO.  Check for uptodateness
> -	 * inside the buffer lock in case another process reading
> -	 * the underlying blockdev brought it uptodate (the sct fix).
> +	 * All buffers are uptodate or get_block() returned an error
> +	 * when trying to map them - we must finish the read because
> +	 * end_buffer_async_read() will never be called on any buffer
> +	 * in this folio.

Do we want to keep mentioning end_buffer_async_read() here?

>  	 */
> -	for (i = 0; i < nr; i++) {
> -		bh = arr[i];
> -		if (buffer_uptodate(bh))
> -			end_buffer_async_read(bh, 1);
> -		else
> -			submit_bh(REQ_OP_READ, bh);
> -	}
> +	if (prev)
> +		submit_bh(REQ_OP_READ, prev);
> +	else
> +		folio_end_read(folio, !page_error);
> +
>  	return 0;

Becuase we only call folio_end_read() in the above code in case we had
no pending read IO determined. In case we had to at least issue one read
for one buffer we never call folio_end_read(). We didn't before unless
we ran into a race where a pending batched read coincided with a read
being issued and updating our buffer by chance, and we determined we
either completed that read fine or with an error.

Reason I'm asking these things is I'm trying to determine if there was
an issue before we're trying to fix other than the simplification with
the new un-batched strategy. I don't see it yet. If we're fixing
something here, it is still a bit obscure to me and I'd like to make
sure we document it properly.

  Luis

Matthew Wilcox Jan. 31, 2025, 10:01 p.m. UTC | #8

On Fri, Jan 31, 2025 at 08:54:31AM -0800, Luis Chamberlain wrote:
> On Thu, Dec 19, 2024 at 03:51:34AM +0000, Matthew Wilcox wrote:
> > On Wed, Dec 18, 2024 at 06:27:36PM -0800, Luis Chamberlain wrote:
> > > On Wed, Dec 18, 2024 at 08:05:29PM +0000, Matthew Wilcox wrote:
> > > > On Tue, Dec 17, 2024 at 06:26:21PM -0800, Luis Chamberlain wrote:
> > > > > This splits up a minor enhancement from the bs > ps device support
> > > > > series into its own series for better review / focus / testing.
> > > > > This series just addresses the reducing the array size used and cleaning
> > > > > up the async read to be easier to read and maintain.
> > > > 
> > > > How about this approach instead -- get rid of the batch entirely?
> > > 
> > > Less is more! I wish it worked, but we end up with a null pointer on
> > > ext4/032 (and indeed this is the test that helped me find most bugs in
> > > what I was working on):
> > 
> > Yeah, I did no testing; just wanted to give people a different approach
> > to consider.
> > 
> > > [  106.034851] BUG: kernel NULL pointer dereference, address: 0000000000000000
> > > [  106.046300] RIP: 0010:end_buffer_async_read_io+0x11/0x90
> > > [  106.047819] Code: f2 ff 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 53 48 8b 47 10 48 89 fb 48 8b 40 18 <48> 8b 00 f6 40 0d 40 74 0d 0f b7 00 66 25 00 f0 66 3d 00 80 74 09
> > 
> > That decodes as:
> > 
> >    5:	53                   	push   %rbx
> >    6:	48 8b 47 10          	mov    0x10(%rdi),%rax
> >    a:	48 89 fb             	mov    %rdi,%rbx
> >    d:	48 8b 40 18          	mov    0x18(%rax),%rax
> >   11:*	48 8b 00             	mov    (%rax),%rax		<-- trapping instruction
> >   14:	f6 40 0d 40          	testb  $0x40,0xd(%rax)
> > 
> > 6: bh->b_folio
> > d: b_folio->mapping
> > 11: mapping->host
> > 
> > So folio->mapping is NULL.
> > 
> > Ah, I see the problem.  end_buffer_async_read() uses the buffer_async_read
> > test to decide if all buffers on the page are uptodate or not.  So both
> > having no batch (ie this patch) and having a batch which is smaller than
> > the number of buffers in the folio can lead to folio_end_read() being
> > called prematurely (ie we'll unlock the folio before finishing reading
> > every buffer in the folio).
> 
> But:
> 
> a) all batched buffers are locked in the old code, we only unlock
>    the currently evaluated buffer, the buffers from our pivot are locked
>    and should also have the async flag set. That fact that buffers ahead
>    should have the async flag set should prevent from calling
>    folio_end_read() prematurely as I read the code, no?

I'm sure you know what you mean by "the old code", but I don't.

If you mean "the code in 6.13", here's what it does:

        tmp = bh;
        do {
                if (!buffer_uptodate(tmp))
                        folio_uptodate = 0;
                if (buffer_async_read(tmp)) {
                        BUG_ON(!buffer_locked(tmp));
                        goto still_busy;
                }
                tmp = tmp->b_this_page;
        } while (tmp != bh);
        folio_end_read(folio, folio_uptodate);

so it's going to cycle around every buffer on the page, and if it finds
none which are marked async_read, it'll call folio_end_read().
That's fine in 6.13 because in stage 2, all buffers which are part of
this folio are marked as async_read.

In your patch, you mark every buffer _in the batch_ as async_read
and then submit the entire batch.  So if they all complete before you
mark the next bh as being uptodate, it'll think the read is complete
and call folio_end_read().

Luis Chamberlain Feb. 3, 2025, 2 p.m. UTC | #9

On Fri, Jan 31, 2025 at 10:01:46PM +0000, Matthew Wilcox wrote:
> On Fri, Jan 31, 2025 at 08:54:31AM -0800, Luis Chamberlain wrote:
> > On Thu, Dec 19, 2024 at 03:51:34AM +0000, Matthew Wilcox wrote:
> > > On Wed, Dec 18, 2024 at 06:27:36PM -0800, Luis Chamberlain wrote:
> > > > On Wed, Dec 18, 2024 at 08:05:29PM +0000, Matthew Wilcox wrote:
> > > > > On Tue, Dec 17, 2024 at 06:26:21PM -0800, Luis Chamberlain wrote:
> > > > > > This splits up a minor enhancement from the bs > ps device support
> > > > > > series into its own series for better review / focus / testing.
> > > > > > This series just addresses the reducing the array size used and cleaning
> > > > > > up the async read to be easier to read and maintain.
> > > > > 
> > > > > How about this approach instead -- get rid of the batch entirely?
> > > > 
> > > > Less is more! I wish it worked, but we end up with a null pointer on
> > > > ext4/032 (and indeed this is the test that helped me find most bugs in
> > > > what I was working on):
> > > 
> > > Yeah, I did no testing; just wanted to give people a different approach
> > > to consider.
> > > 
> > > > [  106.034851] BUG: kernel NULL pointer dereference, address: 0000000000000000
> > > > [  106.046300] RIP: 0010:end_buffer_async_read_io+0x11/0x90
> > > > [  106.047819] Code: f2 ff 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 53 48 8b 47 10 48 89 fb 48 8b 40 18 <48> 8b 00 f6 40 0d 40 74 0d 0f b7 00 66 25 00 f0 66 3d 00 80 74 09
> > > 
> > > That decodes as:
> > > 
> > >    5:	53                   	push   %rbx
> > >    6:	48 8b 47 10          	mov    0x10(%rdi),%rax
> > >    a:	48 89 fb             	mov    %rdi,%rbx
> > >    d:	48 8b 40 18          	mov    0x18(%rax),%rax
> > >   11:*	48 8b 00             	mov    (%rax),%rax		<-- trapping instruction
> > >   14:	f6 40 0d 40          	testb  $0x40,0xd(%rax)
> > > 
> > > 6: bh->b_folio
> > > d: b_folio->mapping
> > > 11: mapping->host
> > > 
> > > So folio->mapping is NULL.
> > > 
> > > Ah, I see the problem.  end_buffer_async_read() uses the buffer_async_read
> > > test to decide if all buffers on the page are uptodate or not.  So both
> > > having no batch (ie this patch) and having a batch which is smaller than
> > > the number of buffers in the folio can lead to folio_end_read() being
> > > called prematurely (ie we'll unlock the folio before finishing reading
> > > every buffer in the folio).
> > 
> > But:
> > 
> > a) all batched buffers are locked in the old code, we only unlock
> >    the currently evaluated buffer, the buffers from our pivot are locked
> >    and should also have the async flag set. That fact that buffers ahead
> >    should have the async flag set should prevent from calling
> >    folio_end_read() prematurely as I read the code, no?
> 
> I'm sure you know what you mean by "the old code", but I don't.
> 
> If you mean "the code in 6.13", here's what it does:

Yes that is what I meant, sorry.

> 
>         tmp = bh;
>         do {
>                 if (!buffer_uptodate(tmp))
>                         folio_uptodate = 0;
>                 if (buffer_async_read(tmp)) {
>                         BUG_ON(!buffer_locked(tmp));
>                         goto still_busy;
>                 }
>                 tmp = tmp->b_this_page;
>         } while (tmp != bh);
>         folio_end_read(folio, folio_uptodate);
> 
> so it's going to cycle around every buffer on the page, and if it finds
> none which are marked async_read, it'll call folio_end_read().
> That's fine in 6.13 because in stage 2, all buffers which are part of
> this folio are marked as async_read.

Indeed, also, its not just every buffer on the page, since we can call
end_buffer_async_read() on every buffer in the page we can end up
calling end_buffer_async_read() on every buffer in the worst case, and
on each loop above we start from the pivot buffer up to the end of the
page.

> In your patch, you mark every buffer _in the batch_ as async_read
> and then submit the entire batch.  So if they all complete before you
> mark the next bh as being uptodate, it'll think the read is complete
> and call folio_end_read().

Ah yes, thanks, this clarifies to me what you meant!

  Luis

[0/5] fs/buffer: strack reduction on async read

Message

Comments