Message ID | 20241218022626.3668119-1-mcgrof@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | fs/buffer: strack reduction on async read | expand |
On Tue, Dec 17, 2024 at 06:26:21PM -0800, Luis Chamberlain wrote: > This splits up a minor enhancement from the bs > ps device support > series into its own series for better review / focus / testing. > This series just addresses the reducing the array size used and cleaning > up the async read to be easier to read and maintain. How about this approach instead -- get rid of the batch entirely? diff --git a/fs/buffer.c b/fs/buffer.c index cc8452f60251..f50ebbc1f518 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -2361,9 +2361,9 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) { struct inode *inode = folio->mapping->host; sector_t iblock, lblock; - struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE]; + struct buffer_head *bh, *head; size_t blocksize; - int nr, i; + int i, submitted = 0; int fully_mapped = 1; bool page_error = false; loff_t limit = i_size_read(inode); @@ -2380,7 +2380,6 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) iblock = div_u64(folio_pos(folio), blocksize); lblock = div_u64(limit + blocksize - 1, blocksize); bh = head; - nr = 0; i = 0; do { @@ -2411,40 +2410,30 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) if (buffer_uptodate(bh)) continue; } - arr[nr++] = bh; + + lock_buffer(bh); + if (buffer_uptodate(bh)) { + unlock_buffer(bh); + continue; + } + + mark_buffer_async_read(bh); + submit_bh(REQ_OP_READ, bh); + submitted++; } while (i++, iblock++, (bh = bh->b_this_page) != head); if (fully_mapped) folio_set_mappedtodisk(folio); - if (!nr) { - /* - * All buffers are uptodate or get_block() returned an - * error when trying to map them - we can finish the read. - */ - folio_end_read(folio, !page_error); - return 0; - } - - /* Stage two: lock the buffers */ - for (i = 0; i < nr; i++) { - bh = arr[i]; - lock_buffer(bh); - mark_buffer_async_read(bh); - } - /* - * Stage 3: start the IO. Check for uptodateness - * inside the buffer lock in case another process reading - * the underlying blockdev brought it uptodate (the sct fix). + * All buffers are uptodate or get_block() returned an error + * when trying to map them - we must finish the read because + * end_buffer_async_read() will never be called on any buffer + * in this folio. */ - for (i = 0; i < nr; i++) { - bh = arr[i]; - if (buffer_uptodate(bh)) - end_buffer_async_read(bh, 1); - else - submit_bh(REQ_OP_READ, bh); - } + if (!submitted) + folio_end_read(folio, !page_error); + return 0; } EXPORT_SYMBOL(block_read_full_folio);
On Wed, Dec 18, 2024 at 08:05:29PM +0000, Matthew Wilcox wrote: > On Tue, Dec 17, 2024 at 06:26:21PM -0800, Luis Chamberlain wrote: > > This splits up a minor enhancement from the bs > ps device support > > series into its own series for better review / focus / testing. > > This series just addresses the reducing the array size used and cleaning > > up the async read to be easier to read and maintain. > > How about this approach instead -- get rid of the batch entirely? Less is more! I wish it worked, but we end up with a null pointer on ext4/032 (and indeed this is the test that helped me find most bugs in what I was working on): [ 105.942462] loop0: detected capacity change from 0 to 1342177280 [ 106.034851] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 106.036903] #PF: supervisor read access in kernel mode [ 106.038366] #PF: error_code(0x0000) - not-present page [ 106.039819] PGD 0 P4D 0 [ 106.040574] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI [ 106.041967] CPU: 2 UID: 0 PID: 29 Comm: ksoftirqd/2 Not tainted 6.13.0-rc3+ #42 [ 106.044018] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 2024.08-1 09/18/2024 [ 106.046300] RIP: 0010:end_buffer_async_read_io+0x11/0x90 [ 106.047819] Code: f2 ff 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 53 48 8b 47 10 48 89 fb 48 8b 40 18 <48> 8b 00 f6 40 0d 40 74 0d 0f b7 00 66 25 00 f0 66 3d 00 80 74 09 [ 106.053016] RSP: 0018:ffffa85880137dd0 EFLAGS: 00010246 [ 106.054499] RAX: 0000000000000000 RBX: ffff95e38f22e5b0 RCX: ffff95e39c8753e0 [ 106.056507] RDX: ffff95e3809f8000 RSI: 0000000000000001 RDI: ffff95e38f22e5b0 [ 106.058530] RBP: 0000000000000400 R08: ffff95e3a326b040 R09: 0000000000000001 [ 106.060546] R10: ffffffffbb6070c0 R11: 00000000002dc6c0 R12: 0000000000000000 [ 106.062426] R13: ffff95e3960ea800 R14: ffff95e39586ae40 R15: 0000000000000400 [ 106.064223] FS: 0000000000000000(0000) GS:ffff95e3fbc80000(0000) knlGS:0000000000000000 [ 106.066155] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 106.067473] CR2: 0000000000000000 CR3: 00000001226e2006 CR4: 0000000000772ef0 [ 106.069085] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 106.070571] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 106.072050] PKRU: 55555554 [ 106.072632] Call Trace: [ 106.073176] <TASK> [ 106.073611] ? __die_body.cold+0x19/0x26 [ 106.074383] ? page_fault_oops+0xa2/0x230 [ 106.075155] ? __smp_call_single_queue+0xa7/0x110 [ 106.076077] ? do_user_addr_fault+0x63/0x640 [ 106.076916] ? exc_page_fault+0x7a/0x190 [ 106.077639] ? asm_exc_page_fault+0x22/0x30 [ 106.078394] ? end_buffer_async_read_io+0x11/0x90 [ 106.079245] end_bio_bh_io_sync+0x23/0x40 [ 106.079973] blk_update_request+0x178/0x420 [ 106.080727] ? finish_task_switch.isra.0+0x94/0x290 [ 106.081600] blk_mq_end_request+0x18/0x30 [ 106.082281] blk_complete_reqs+0x3d/0x50 [ 106.082954] handle_softirqs+0xf9/0x2c0 [ 106.083607] ? __pfx_smpboot_thread_fn+0x10/0x10 [ 106.084393] run_ksoftirqd+0x37/0x50 [ 106.085012] smpboot_thread_fn+0x184/0x220 [ 106.085688] kthread+0xda/0x110 [ 106.086208] ? __pfx_kthread+0x10/0x10 [ 106.086824] ret_from_fork+0x2d/0x50 [ 106.087409] ? __pfx_kthread+0x10/0x10 [ 106.088013] ret_from_fork_asm+0x1a/0x30 [ 106.088658] </TASK> [ 106.089045] Modules linked in: loop sunrpc 9p nls_iso8859_1 nls_cp437 vfat fat kvm_intel kvm crct10dif_pclmul ghash_clmulni_intel sha512_ssse3 sha512_generic sha256_ssse3 sha1_ssse3 aesni_intel gf128mul crypto_simd cryptd virtio_balloon 9pnet_virtio virtio_console joydev button evdev serio_raw nvme_fabrics nvme_core dm_mod drm nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vsock autofs4 ext4 crc16 mbcache jbd2 btrfs blake2b_generic efivarfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 md_mod virtio_net net_failover virtio_blk failover crc32_pclmul crc32c_intel psmouse virtio_pci virtio_pci_legacy_dev virtio_pci_modern_dev virtio virtio_ring [ 106.097895] CR2: 0000000000000000 [ 106.098326] ---[ end trace 0000000000000000 ]--- Luis
On Wed, Dec 18, 2024 at 06:27:36PM -0800, Luis Chamberlain wrote: > On Wed, Dec 18, 2024 at 08:05:29PM +0000, Matthew Wilcox wrote: > > On Tue, Dec 17, 2024 at 06:26:21PM -0800, Luis Chamberlain wrote: > > > This splits up a minor enhancement from the bs > ps device support > > > series into its own series for better review / focus / testing. > > > This series just addresses the reducing the array size used and cleaning > > > up the async read to be easier to read and maintain. > > > > How about this approach instead -- get rid of the batch entirely? > > Less is more! I wish it worked, but we end up with a null pointer on > ext4/032 (and indeed this is the test that helped me find most bugs in > what I was working on): Yeah, I did no testing; just wanted to give people a different approach to consider. > [ 106.034851] BUG: kernel NULL pointer dereference, address: 0000000000000000 > [ 106.046300] RIP: 0010:end_buffer_async_read_io+0x11/0x90 > [ 106.047819] Code: f2 ff 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 53 48 8b 47 10 48 89 fb 48 8b 40 18 <48> 8b 00 f6 40 0d 40 74 0d 0f b7 00 66 25 00 f0 66 3d 00 80 74 09 That decodes as: 5: 53 push %rbx 6: 48 8b 47 10 mov 0x10(%rdi),%rax a: 48 89 fb mov %rdi,%rbx d: 48 8b 40 18 mov 0x18(%rax),%rax 11:* 48 8b 00 mov (%rax),%rax <-- trapping instruction 14: f6 40 0d 40 testb $0x40,0xd(%rax) 6: bh->b_folio d: b_folio->mapping 11: mapping->host So folio->mapping is NULL. Ah, I see the problem. end_buffer_async_read() uses the buffer_async_read test to decide if all buffers on the page are uptodate or not. So both having no batch (ie this patch) and having a batch which is smaller than the number of buffers in the folio can lead to folio_end_read() being called prematurely (ie we'll unlock the folio before finishing reading every buffer in the folio). Once the folio is unlocked, it can be truncated. That's a second-order problem, but it's the one your test happened to hit. This should fix the problem; we always have at least one BH held in the submission path with the async_read flag set, so end_buffer_async_read() will not end it prematurely. By the way, do you have CONFIG_VM_DEBUG enabled in your testing? VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); in folio_end_read() should have tripped before hitting the race with truncate. diff --git a/fs/buffer.c b/fs/buffer.c index cc8452f60251..fd2633e4a5d2 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -2361,9 +2361,9 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) { struct inode *inode = folio->mapping->host; sector_t iblock, lblock; - struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE]; + struct buffer_head *bh, *head, *prev = NULL; size_t blocksize; - int nr, i; + int i; int fully_mapped = 1; bool page_error = false; loff_t limit = i_size_read(inode); @@ -2380,7 +2380,6 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) iblock = div_u64(folio_pos(folio), blocksize); lblock = div_u64(limit + blocksize - 1, blocksize); bh = head; - nr = 0; i = 0; do { @@ -2411,40 +2410,33 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) if (buffer_uptodate(bh)) continue; } - arr[nr++] = bh; + + lock_buffer(bh); + if (buffer_uptodate(bh)) { + unlock_buffer(bh); + continue; + } + + mark_buffer_async_read(bh); + if (prev) + submit_bh(REQ_OP_READ, prev); + prev = bh; } while (i++, iblock++, (bh = bh->b_this_page) != head); if (fully_mapped) folio_set_mappedtodisk(folio); - if (!nr) { - /* - * All buffers are uptodate or get_block() returned an - * error when trying to map them - we can finish the read. - */ - folio_end_read(folio, !page_error); - return 0; - } - - /* Stage two: lock the buffers */ - for (i = 0; i < nr; i++) { - bh = arr[i]; - lock_buffer(bh); - mark_buffer_async_read(bh); - } - /* - * Stage 3: start the IO. Check for uptodateness - * inside the buffer lock in case another process reading - * the underlying blockdev brought it uptodate (the sct fix). + * All buffers are uptodate or get_block() returned an error + * when trying to map them - we must finish the read because + * end_buffer_async_read() will never be called on any buffer + * in this folio. */ - for (i = 0; i < nr; i++) { - bh = arr[i]; - if (buffer_uptodate(bh)) - end_buffer_async_read(bh, 1); - else - submit_bh(REQ_OP_READ, bh); - } + if (prev) + submit_bh(REQ_OP_READ, prev); + else + folio_end_read(folio, !page_error); + return 0; } EXPORT_SYMBOL(block_read_full_folio);
What is this strack that gets reduced here?
On Thu, Dec 19, 2024 at 07:28:27AM +0100, Christoph Hellwig wrote:
> What is this strack that gets reduced here?
s/strack/stack
I seriously need a spell checker as part of my pipeline.
Luis
On Thu, Dec 19, 2024 at 03:51:34AM +0000, Matthew Wilcox wrote: > So folio->mapping is NULL. > > Ah, I see the problem. end_buffer_async_read() uses the buffer_async_read > test to decide if all buffers on the page are uptodate or not. So both > having no batch (ie this patch) and having a batch which is smaller than > the number of buffers in the folio can lead to folio_end_read() being > called prematurely (ie we'll unlock the folio before finishing reading > every buffer in the folio). > > Once the folio is unlocked, it can be truncated. That's a second-order > problem, but it's the one your test happened to hit. > > This should fix the problem; we always have at least one BH held in > the submission path with the async_read flag set, so > end_buffer_async_read() will not end it prematurely. Oh neat, yes. > By the way, do you have CONFIG_VM_DEBUG enabled in your testing? You mean DEBUG_VM ? Yes: grep DEBUG_VM .config CONFIG_ARCH_HAS_DEBUG_VM_PGTABLE=y CONFIG_DEBUG_VM_IRQSOFF=y CONFIG_DEBUG_VM=y # CONFIG_DEBUG_VM_MAPLE_TREE is not set # CONFIG_DEBUG_VM_RB is not set CONFIG_DEBUG_VM_PGFLAGS=y # CONFIG_DEBUG_VM_PGTABLE is not set > VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); > in folio_end_read() should have tripped before hitting the race with > truncate. Odd that it did not, I had run into that folio_test_locked() splat but in my attempts to simplify this without your trick to only run into the similar truncate race, your resolution to this is nice. > diff --git a/fs/buffer.c b/fs/buffer.c This is a nice resolution and simplification, thanks, I've tested it and passes without regressions on ext4. I'll take this into this series as an alternative. Luis
On Thu, Dec 19, 2024 at 03:51:34AM +0000, Matthew Wilcox wrote: > On Wed, Dec 18, 2024 at 06:27:36PM -0800, Luis Chamberlain wrote: > > On Wed, Dec 18, 2024 at 08:05:29PM +0000, Matthew Wilcox wrote: > > > On Tue, Dec 17, 2024 at 06:26:21PM -0800, Luis Chamberlain wrote: > > > > This splits up a minor enhancement from the bs > ps device support > > > > series into its own series for better review / focus / testing. > > > > This series just addresses the reducing the array size used and cleaning > > > > up the async read to be easier to read and maintain. > > > > > > How about this approach instead -- get rid of the batch entirely? > > > > Less is more! I wish it worked, but we end up with a null pointer on > > ext4/032 (and indeed this is the test that helped me find most bugs in > > what I was working on): > > Yeah, I did no testing; just wanted to give people a different approach > to consider. > > > [ 106.034851] BUG: kernel NULL pointer dereference, address: 0000000000000000 > > [ 106.046300] RIP: 0010:end_buffer_async_read_io+0x11/0x90 > > [ 106.047819] Code: f2 ff 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 53 48 8b 47 10 48 89 fb 48 8b 40 18 <48> 8b 00 f6 40 0d 40 74 0d 0f b7 00 66 25 00 f0 66 3d 00 80 74 09 > > That decodes as: > > 5: 53 push %rbx > 6: 48 8b 47 10 mov 0x10(%rdi),%rax > a: 48 89 fb mov %rdi,%rbx > d: 48 8b 40 18 mov 0x18(%rax),%rax > 11:* 48 8b 00 mov (%rax),%rax <-- trapping instruction > 14: f6 40 0d 40 testb $0x40,0xd(%rax) > > 6: bh->b_folio > d: b_folio->mapping > 11: mapping->host > > So folio->mapping is NULL. > > Ah, I see the problem. end_buffer_async_read() uses the buffer_async_read > test to decide if all buffers on the page are uptodate or not. So both > having no batch (ie this patch) and having a batch which is smaller than > the number of buffers in the folio can lead to folio_end_read() being > called prematurely (ie we'll unlock the folio before finishing reading > every buffer in the folio). But: a) all batched buffers are locked in the old code, we only unlock the currently evaluated buffer, the buffers from our pivot are locked and should also have the async flag set. That fact that buffers ahead should have the async flag set should prevent from calling folio_end_read() prematurely as I read the code, no? b) In the case we're evaluting the last buffer, we can unlock and call folio_end_read(), but that seems fine given the previous batch work was in charge of finding candidate buffers which need a read, so in this case there should be no pending read. So I don't see how we yet can call folio_end_read() prematurely. We do however unlock the buffer in end_buffer_async_read(), but in case of an inconsistency we simply bail on the loop, and since we only called end_buffer_async_read() in case of the buffer being up to date, the only iconsistency we check for is if (!buffer_uptodate(tmp)) in which case the folio_end_read() will be called but without a success being annoted. > Once the folio is unlocked, it can be truncated. That's a second-order > problem, but it's the one your test happened to hit. > > > This should fix the problem; we always have at least one BH held in > the submission path with the async_read flag set, so > end_buffer_async_read() will not end it prematurely. But this alternative does not call end_buffer_async_read(), in fact we only call folio_end_read() in case of no pending reads being needed. > diff --git a/fs/buffer.c b/fs/buffer.c > index cc8452f60251..fd2633e4a5d2 100644 > --- a/fs/buffer.c > +++ b/fs/buffer.c > @@ -2361,9 +2361,9 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) > { > struct inode *inode = folio->mapping->host; > sector_t iblock, lblock; > - struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE]; > + struct buffer_head *bh, *head, *prev = NULL; > size_t blocksize; > - int nr, i; > + int i; > int fully_mapped = 1; > bool page_error = false; > loff_t limit = i_size_read(inode); > @@ -2380,7 +2380,6 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) > iblock = div_u64(folio_pos(folio), blocksize); > lblock = div_u64(limit + blocksize - 1, blocksize); > bh = head; > - nr = 0; > i = 0; > > do { > @@ -2411,40 +2410,33 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) > if (buffer_uptodate(bh)) > continue; > } > - arr[nr++] = bh; > + > + lock_buffer(bh); > + if (buffer_uptodate(bh)) { > + unlock_buffer(bh); > + continue; > + } > + > + mark_buffer_async_read(bh); > + if (prev) > + submit_bh(REQ_OP_READ, prev); > + prev = bh; > } while (i++, iblock++, (bh = bh->b_this_page) != head); > > if (fully_mapped) > folio_set_mappedtodisk(folio); > > - if (!nr) { > - /* > - * All buffers are uptodate or get_block() returned an > - * error when trying to map them - we can finish the read. > - */ > - folio_end_read(folio, !page_error); > - return 0; > - } > - > - /* Stage two: lock the buffers */ > - for (i = 0; i < nr; i++) { > - bh = arr[i]; > - lock_buffer(bh); > - mark_buffer_async_read(bh); > - } > - > /* > - * Stage 3: start the IO. Check for uptodateness > - * inside the buffer lock in case another process reading > - * the underlying blockdev brought it uptodate (the sct fix). > + * All buffers are uptodate or get_block() returned an error > + * when trying to map them - we must finish the read because > + * end_buffer_async_read() will never be called on any buffer > + * in this folio. Do we want to keep mentioning end_buffer_async_read() here? > */ > - for (i = 0; i < nr; i++) { > - bh = arr[i]; > - if (buffer_uptodate(bh)) > - end_buffer_async_read(bh, 1); > - else > - submit_bh(REQ_OP_READ, bh); > - } > + if (prev) > + submit_bh(REQ_OP_READ, prev); > + else > + folio_end_read(folio, !page_error); > + > return 0; Becuase we only call folio_end_read() in the above code in case we had no pending read IO determined. In case we had to at least issue one read for one buffer we never call folio_end_read(). We didn't before unless we ran into a race where a pending batched read coincided with a read being issued and updating our buffer by chance, and we determined we either completed that read fine or with an error. Reason I'm asking these things is I'm trying to determine if there was an issue before we're trying to fix other than the simplification with the new un-batched strategy. I don't see it yet. If we're fixing something here, it is still a bit obscure to me and I'd like to make sure we document it properly. Luis
On Fri, Jan 31, 2025 at 08:54:31AM -0800, Luis Chamberlain wrote: > On Thu, Dec 19, 2024 at 03:51:34AM +0000, Matthew Wilcox wrote: > > On Wed, Dec 18, 2024 at 06:27:36PM -0800, Luis Chamberlain wrote: > > > On Wed, Dec 18, 2024 at 08:05:29PM +0000, Matthew Wilcox wrote: > > > > On Tue, Dec 17, 2024 at 06:26:21PM -0800, Luis Chamberlain wrote: > > > > > This splits up a minor enhancement from the bs > ps device support > > > > > series into its own series for better review / focus / testing. > > > > > This series just addresses the reducing the array size used and cleaning > > > > > up the async read to be easier to read and maintain. > > > > > > > > How about this approach instead -- get rid of the batch entirely? > > > > > > Less is more! I wish it worked, but we end up with a null pointer on > > > ext4/032 (and indeed this is the test that helped me find most bugs in > > > what I was working on): > > > > Yeah, I did no testing; just wanted to give people a different approach > > to consider. > > > > > [ 106.034851] BUG: kernel NULL pointer dereference, address: 0000000000000000 > > > [ 106.046300] RIP: 0010:end_buffer_async_read_io+0x11/0x90 > > > [ 106.047819] Code: f2 ff 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 53 48 8b 47 10 48 89 fb 48 8b 40 18 <48> 8b 00 f6 40 0d 40 74 0d 0f b7 00 66 25 00 f0 66 3d 00 80 74 09 > > > > That decodes as: > > > > 5: 53 push %rbx > > 6: 48 8b 47 10 mov 0x10(%rdi),%rax > > a: 48 89 fb mov %rdi,%rbx > > d: 48 8b 40 18 mov 0x18(%rax),%rax > > 11:* 48 8b 00 mov (%rax),%rax <-- trapping instruction > > 14: f6 40 0d 40 testb $0x40,0xd(%rax) > > > > 6: bh->b_folio > > d: b_folio->mapping > > 11: mapping->host > > > > So folio->mapping is NULL. > > > > Ah, I see the problem. end_buffer_async_read() uses the buffer_async_read > > test to decide if all buffers on the page are uptodate or not. So both > > having no batch (ie this patch) and having a batch which is smaller than > > the number of buffers in the folio can lead to folio_end_read() being > > called prematurely (ie we'll unlock the folio before finishing reading > > every buffer in the folio). > > But: > > a) all batched buffers are locked in the old code, we only unlock > the currently evaluated buffer, the buffers from our pivot are locked > and should also have the async flag set. That fact that buffers ahead > should have the async flag set should prevent from calling > folio_end_read() prematurely as I read the code, no? I'm sure you know what you mean by "the old code", but I don't. If you mean "the code in 6.13", here's what it does: tmp = bh; do { if (!buffer_uptodate(tmp)) folio_uptodate = 0; if (buffer_async_read(tmp)) { BUG_ON(!buffer_locked(tmp)); goto still_busy; } tmp = tmp->b_this_page; } while (tmp != bh); folio_end_read(folio, folio_uptodate); so it's going to cycle around every buffer on the page, and if it finds none which are marked async_read, it'll call folio_end_read(). That's fine in 6.13 because in stage 2, all buffers which are part of this folio are marked as async_read. In your patch, you mark every buffer _in the batch_ as async_read and then submit the entire batch. So if they all complete before you mark the next bh as being uptodate, it'll think the read is complete and call folio_end_read().
On Fri, Jan 31, 2025 at 10:01:46PM +0000, Matthew Wilcox wrote: > On Fri, Jan 31, 2025 at 08:54:31AM -0800, Luis Chamberlain wrote: > > On Thu, Dec 19, 2024 at 03:51:34AM +0000, Matthew Wilcox wrote: > > > On Wed, Dec 18, 2024 at 06:27:36PM -0800, Luis Chamberlain wrote: > > > > On Wed, Dec 18, 2024 at 08:05:29PM +0000, Matthew Wilcox wrote: > > > > > On Tue, Dec 17, 2024 at 06:26:21PM -0800, Luis Chamberlain wrote: > > > > > > This splits up a minor enhancement from the bs > ps device support > > > > > > series into its own series for better review / focus / testing. > > > > > > This series just addresses the reducing the array size used and cleaning > > > > > > up the async read to be easier to read and maintain. > > > > > > > > > > How about this approach instead -- get rid of the batch entirely? > > > > > > > > Less is more! I wish it worked, but we end up with a null pointer on > > > > ext4/032 (and indeed this is the test that helped me find most bugs in > > > > what I was working on): > > > > > > Yeah, I did no testing; just wanted to give people a different approach > > > to consider. > > > > > > > [ 106.034851] BUG: kernel NULL pointer dereference, address: 0000000000000000 > > > > [ 106.046300] RIP: 0010:end_buffer_async_read_io+0x11/0x90 > > > > [ 106.047819] Code: f2 ff 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 53 48 8b 47 10 48 89 fb 48 8b 40 18 <48> 8b 00 f6 40 0d 40 74 0d 0f b7 00 66 25 00 f0 66 3d 00 80 74 09 > > > > > > That decodes as: > > > > > > 5: 53 push %rbx > > > 6: 48 8b 47 10 mov 0x10(%rdi),%rax > > > a: 48 89 fb mov %rdi,%rbx > > > d: 48 8b 40 18 mov 0x18(%rax),%rax > > > 11:* 48 8b 00 mov (%rax),%rax <-- trapping instruction > > > 14: f6 40 0d 40 testb $0x40,0xd(%rax) > > > > > > 6: bh->b_folio > > > d: b_folio->mapping > > > 11: mapping->host > > > > > > So folio->mapping is NULL. > > > > > > Ah, I see the problem. end_buffer_async_read() uses the buffer_async_read > > > test to decide if all buffers on the page are uptodate or not. So both > > > having no batch (ie this patch) and having a batch which is smaller than > > > the number of buffers in the folio can lead to folio_end_read() being > > > called prematurely (ie we'll unlock the folio before finishing reading > > > every buffer in the folio). > > > > But: > > > > a) all batched buffers are locked in the old code, we only unlock > > the currently evaluated buffer, the buffers from our pivot are locked > > and should also have the async flag set. That fact that buffers ahead > > should have the async flag set should prevent from calling > > folio_end_read() prematurely as I read the code, no? > > I'm sure you know what you mean by "the old code", but I don't. > > If you mean "the code in 6.13", here's what it does: Yes that is what I meant, sorry. > > tmp = bh; > do { > if (!buffer_uptodate(tmp)) > folio_uptodate = 0; > if (buffer_async_read(tmp)) { > BUG_ON(!buffer_locked(tmp)); > goto still_busy; > } > tmp = tmp->b_this_page; > } while (tmp != bh); > folio_end_read(folio, folio_uptodate); > > so it's going to cycle around every buffer on the page, and if it finds > none which are marked async_read, it'll call folio_end_read(). > That's fine in 6.13 because in stage 2, all buffers which are part of > this folio are marked as async_read. Indeed, also, its not just every buffer on the page, since we can call end_buffer_async_read() on every buffer in the page we can end up calling end_buffer_async_read() on every buffer in the worst case, and on each loop above we start from the pivot buffer up to the end of the page. > In your patch, you mark every buffer _in the batch_ as async_read > and then submit the entire batch. So if they all complete before you > mark the next bh as being uptodate, it'll think the read is complete > and call folio_end_read(). Ah yes, thanks, this clarifies to me what you meant! Luis