diff mbox series

block: fix request.queuelist usage in flush

Message ID 20240604064745.808610-1-chengming.zhou@linux.dev (mailing list archive)
State New, archived
Headers show
Series block: fix request.queuelist usage in flush | expand

Commit Message

Chengming Zhou June 4, 2024, 6:47 a.m. UTC
Friedrich Weber reported a kernel crash problem and bisected to commit
81ada09cc25e ("blk-flush: reuse rq queuelist in flush state machine").

The root cause is that we use "list_move_tail(&rq->queuelist, pending)"
in the PREFLUSH/POSTFLUSH sequences. But rq->queuelist.next == xxx since
it's popped out from plug->cached_rq in __blk_mq_alloc_requests_batch().
We don't initialize its queuelist just for this first request, although
the queuelist of all later popped requests will be initialized.

Fix it by changing to use "list_add_tail(&rq->queuelist, pending)" so
rq->queuelist doesn't need to be initialized. It should be ok since rq
can't be on any list when PREFLUSH or POSTFLUSH, has no move actually.

Please note the commit 81ada09cc25e ("blk-flush: reuse rq queuelist in
flush state machine") also has another requirement that no drivers would
touch rq->queuelist after blk_mq_end_request() since we will reuse it to
add rq to the post-flush pending list in POSTFLUSH. If this is not true,
we will have to revert that commit IMHO.

Reported-by: Friedrich Weber <f.weber@proxmox.com>
Closes: https://lore.kernel.org/lkml/14b89dfb-505c-49f7-aebb-01c54451db40@proxmox.com/
Fixes: 81ada09cc25e ("blk-flush: reuse rq queuelist in flush state machine")
Cc: Christoph Hellwig <hch@lst.de>
Cc: ming.lei@redhat.com
Cc: bvanassche@acm.org
Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev>
---
 block/blk-flush.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Jens Axboe June 4, 2024, 2:17 p.m. UTC | #1
On Tue, 04 Jun 2024 14:47:45 +0800, Chengming Zhou wrote:
> Friedrich Weber reported a kernel crash problem and bisected to commit
> 81ada09cc25e ("blk-flush: reuse rq queuelist in flush state machine").
> 
> The root cause is that we use "list_move_tail(&rq->queuelist, pending)"
> in the PREFLUSH/POSTFLUSH sequences. But rq->queuelist.next == xxx since
> it's popped out from plug->cached_rq in __blk_mq_alloc_requests_batch().
> We don't initialize its queuelist just for this first request, although
> the queuelist of all later popped requests will be initialized.
> 
> [...]

Applied, thanks!

[1/1] block: fix request.queuelist usage in flush
      commit: a315b96155e4c0362742aa3c3b3aebe2ec3844bd

Best regards,
Friedrich Weber June 5, 2024, 8:45 a.m. UTC | #2
Hi,

On 04/06/2024 08:47, Chengming Zhou wrote:
> Friedrich Weber reported a kernel crash problem and bisected to commit
> 81ada09cc25e ("blk-flush: reuse rq queuelist in flush state machine").
> 
> The root cause is that we use "list_move_tail(&rq->queuelist, pending)"
> in the PREFLUSH/POSTFLUSH sequences. But rq->queuelist.next == xxx since
> it's popped out from plug->cached_rq in __blk_mq_alloc_requests_batch().
> We don't initialize its queuelist just for this first request, although
> the queuelist of all later popped requests will be initialized.
> 
> Fix it by changing to use "list_add_tail(&rq->queuelist, pending)" so
> rq->queuelist doesn't need to be initialized. It should be ok since rq
> can't be on any list when PREFLUSH or POSTFLUSH, has no move actually.
> 
> Please note the commit 81ada09cc25e ("blk-flush: reuse rq queuelist in
> flush state machine") also has another requirement that no drivers would
> touch rq->queuelist after blk_mq_end_request() since we will reuse it to
> add rq to the post-flush pending list in POSTFLUSH. If this is not true,
> we will have to revert that commit IMHO.

Unfortunately, with this patch applied to kernel 6.9 I get a different
crash [2] on a Debian 12 (virtual) machine with root on LVM on boot (no
software RAID involved). See [1] for lsblk and findmnt output. addr2line
says:

# addr2line -f -e /usr/lib/debug/vmlinux-6.9.0-patch0604-nodebuglist+
blk_mq_request_bypass_insert+0x20
blk_mq_request_bypass_insert
[...]/linux/block/blk-mq.c:2456

No crashes seen so far if the root is on LVM on top of software RAID, or
if the root partition is directly on disk.

If I can provide any more information, just let me know.

Thanks!

Best,

Friedrich

[1]

# lsblk -o name,fstype,label --ascii
NAME                          FSTYPE      LABEL
sda
|-sda1                        ext2
|-sda2
`-sda5                        LVM2_member
  |-kernel684--deb--vg-root   ext4
  `-kernel684--deb--vg-swap_1 swap
sr0                           iso9660     Debian 12.5.0 amd64 n
# findmnt --ascii
TARGET                       SOURCE     FSTYPE    OPTIONS
/                            /dev/mapper/kernel684--deb--vg-root
                                        ext4
rw,relatime,errors=remount-ro
|-/sys                       sysfs      sysfs
rw,nosuid,nodev,noexec,relatime
| |-/sys/kernel/security     securityfs securityf
rw,nosuid,nodev,noexec,relatime
| |-/sys/fs/cgroup           cgroup2    cgroup2
rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursive
| |-/sys/fs/pstore           pstore     pstore
rw,nosuid,nodev,noexec,relatime
| |-/sys/fs/bpf              bpf        bpf
rw,nosuid,nodev,noexec,relatime,mode=700
| |-/sys/kernel/debug        debugfs    debugfs
rw,nosuid,nodev,noexec,relatime
| |-/sys/kernel/tracing      tracefs    tracefs
rw,nosuid,nodev,noexec,relatime
| |-/sys/fs/fuse/connections fusectl    fusectl
rw,nosuid,nodev,noexec,relatime
| `-/sys/kernel/config       configfs   configfs
rw,nosuid,nodev,noexec,relatime
|-/proc                      proc       proc
rw,nosuid,nodev,noexec,relatime
| `-/proc/sys/fs/binfmt_misc systemd-1  autofs
rw,relatime,fd=30,pgrp=1,timeout=0,minproto=5,maxproto=5,di
|-/dev                       udev       devtmpfs
rw,nosuid,relatime,size=4040780k,nr_inodes=1010195,mode=755
| |-/dev/pts                 devpts     devpts
rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000
| |-/dev/shm                 tmpfs      tmpfs     rw,nosuid,nodev,inode64
| |-/dev/hugepages           hugetlbfs  hugetlbfs rw,relatime,pagesize=2M
| `-/dev/mqueue              mqueue     mqueue
rw,nosuid,nodev,noexec,relatime
|-/run                       tmpfs      tmpfs
rw,nosuid,nodev,noexec,relatime,size=813456k,mode=755,inode
| |-/run/lock                tmpfs      tmpfs
rw,nosuid,nodev,noexec,relatime,size=5120k,inode64
| |-/run/credentials/systemd-sysctl.service
| |                          ramfs      ramfs
ro,nosuid,nodev,noexec,relatime,mode=700
| |-/run/credentials/systemd-sysusers.service
| |                          ramfs      ramfs
ro,nosuid,nodev,noexec,relatime,mode=700
| |-/run/credentials/systemd-tmpfiles-setup-dev.service
| |                          ramfs      ramfs
ro,nosuid,nodev,noexec,relatime,mode=700
| |-/run/user/0              tmpfs      tmpfs
rw,nosuid,nodev,relatime,size=813452k,nr_inodes=203363,mode
| `-/run/credentials/systemd-tmpfiles-setup.service
|                            ramfs      ramfs
ro,nosuid,nodev,noexec,relatime,mode=700
`-/boot                      /dev/sda1  ext2      rw,relatime

[2]
[    1.137443] BUG: kernel NULL pointer dereference, address:
0000000000000000
[    1.137951] #PF: supervisor write access in kernel mode
[    1.138332] #PF: error_code(0x0002) - not-present page
[    1.138695] PGD 0 P4D 0
[    1.138697] Oops: 0002 [#1] PREEMPT SMP NOPTI
[    1.138702] CPU: 1 PID: 27 Comm: kworker/1:0H Tainted: G            E
     6.9.0-patch0604-nodebuglist+ #35
[    1.138703] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[    1.138705] Workqueue: kblockd blk_mq_requeue_work
[    1.141021] RIP: 0010:_raw_spin_lock+0x13/0x60
[    1.141336] Code: 31 db c3 cc cc cc cc 90 90 90 90 90 90 90 90 90 90
90 90 90 90 90 90 0f 1f 44 00 00 65 ff 05 bc 94 cb 69 31 c0 ba 01 00 00
00 <f0> 0f b1 17 75 1b 31 c0 31 d2 31 c9 31 f6 31 ff 45 31 c0 45 31 c9
[    1.142670] RSP: 0018:ffffa42a40103d78 EFLAGS: 00010246
[    1.143032] RAX: 0000000000000000 RBX: ffff91c4c0357c00 RCX:
00000000ffffffe0
[    1.143545] RDX: 0000000000000001 RSI: 0000000000000001 RDI:
0000000000000000
[    1.144037] RBP: ffffa42a40103d98 R08: 0000000000000000 R09:
0000000000000000
[    1.144548] R10: 0000000000000000 R11: 0000000000000000 R12:
0000000000000000
[    1.145036] R13: 0000000000000001 R14: ffff91c5f7cc1d80 R15:
ffff91c4c153eb54
[    1.145542] FS:  0000000000000000(0000) GS:ffff91c5f7c80000(0000)
knlGS:0000000000000000
[    1.146092] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.146511] CR2: 0000000000000000 CR3: 000000010e514001 CR4:
0000000000370ef0
[    1.147003] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[    1.147507] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[    1.147997] Call Trace:
[    1.148177]  <TASK>
[    1.148332]  ? show_regs+0x6c/0x80
[    1.148603]  ? __die+0x24/0x80
[    1.148824]  ? page_fault_oops+0x175/0x5b0
[    1.149111]  ? do_user_addr_fault+0x311/0x680
[    1.149420]  ? exc_page_fault+0x82/0x1b0
[    1.149718]  ? asm_exc_page_fault+0x27/0x30
[    1.150013]  ? _raw_spin_lock+0x13/0x60
[    1.150282]  ? blk_mq_request_bypass_insert+0x20/0xe0
[    1.150663]  blk_mq_insert_request+0x120/0x1e0
[    1.150975]  blk_mq_requeue_work+0x18f/0x230
[    1.151277]  process_one_work+0x19b/0x3f0
[    1.151562]  worker_thread+0x32a/0x500
[    1.151847]  ? __pfx_worker_thread+0x10/0x10
[    1.152148]  kthread+0xe1/0x110
[    1.152373]  ? __pfx_kthread+0x10/0x10
[    1.152640]  ret_from_fork+0x44/0x70
[    1.152906]  ? __pfx_kthread+0x10/0x10
[    1.153169]  ret_from_fork_asm+0x1a/0x30
[    1.153449]  </TASK>
[    1.153608] Modules linked in: efi_pstore(E) dmi_sysfs(E)
qemu_fw_cfg(E) ip_tables(E) x_tables(E) autofs4(E) psmouse(E) bochs(E)
uhci_hcd(E) crc32_pclmul(E) drm_vram_helper(E) drm_ttm_helper(E)
i2c_piix4(E) ttm(E) ehci_hcd(E) pata_acpi(E) floppy(E)
[    1.155135] CR2: 0000000000000000
[    1.155370] ---[ end trace 0000000000000000 ]---
[    1.155694] RIP: 0010:_raw_spin_lock+0x13/0x60
[    1.156024] Code: 31 db c3 cc cc cc cc 90 90 90 90 90 90 90 90 90 90
90 90 90 90 90 90 0f 1f 44 00 00 65 ff 05 bc 94 cb 69 31 c0 ba 01 00 00
00 <f0> 0f b1 17 75 1b 31 c0 31 d2 31 c9 31 f6 31 ff 45 31 c0 45 31 c9
[    1.157306] RSP: 0018:ffffa42a40103d78 EFLAGS: 00010246
[    1.157669] RAX: 0000000000000000 RBX: ffff91c4c0357c00 RCX:
00000000ffffffe0
[    1.158172] RDX: 0000000000000001 RSI: 0000000000000001 RDI:
0000000000000000
[    1.158682] RBP: ffffa42a40103d98 R08: 0000000000000000 R09:
0000000000000000
[    1.159311] R10: 0000000000000000 R11: 0000000000000000 R12:
0000000000000000
[    1.159992] R13: 0000000000000001 R14: ffff91c5f7cc1d80 R15:
ffff91c4c153eb54
[    1.160575] FS:  0000000000000000(0000) GS:ffff91c5f7c80000(0000)
knlGS:0000000000000000
[    1.161186] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.161618] CR2: 0000000000000000 CR3: 000000010e514001 CR4:
0000000000370ef0
[    1.162158] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[    1.162691] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
Chengming Zhou June 5, 2024, 10:30 a.m. UTC | #3
On 2024/6/5 16:45, Friedrich Weber wrote:
> Hi,
> 
> On 04/06/2024 08:47, Chengming Zhou wrote:
>> Friedrich Weber reported a kernel crash problem and bisected to commit
>> 81ada09cc25e ("blk-flush: reuse rq queuelist in flush state machine").
>>
>> The root cause is that we use "list_move_tail(&rq->queuelist, pending)"
>> in the PREFLUSH/POSTFLUSH sequences. But rq->queuelist.next == xxx since
>> it's popped out from plug->cached_rq in __blk_mq_alloc_requests_batch().
>> We don't initialize its queuelist just for this first request, although
>> the queuelist of all later popped requests will be initialized.
>>
>> Fix it by changing to use "list_add_tail(&rq->queuelist, pending)" so
>> rq->queuelist doesn't need to be initialized. It should be ok since rq
>> can't be on any list when PREFLUSH or POSTFLUSH, has no move actually.
>>
>> Please note the commit 81ada09cc25e ("blk-flush: reuse rq queuelist in
>> flush state machine") also has another requirement that no drivers would
>> touch rq->queuelist after blk_mq_end_request() since we will reuse it to
>> add rq to the post-flush pending list in POSTFLUSH. If this is not true,
>> we will have to revert that commit IMHO.
> 
> Unfortunately, with this patch applied to kernel 6.9 I get a different
> crash [2] on a Debian 12 (virtual) machine with root on LVM on boot (no
> software RAID involved). See [1] for lsblk and findmnt output. addr2line
> says:

Sorry, which commit is your kernel? Is mainline tag v6.9 or at some commit?
And is it reproducible using the mainline kernel v6.10-rc2?

> 
> # addr2line -f -e /usr/lib/debug/vmlinux-6.9.0-patch0604-nodebuglist+
> blk_mq_request_bypass_insert+0x20

I think here should use blk_mq_insert_request+0x120, instead of the
blk_mq_request_bypass_insert+0x20, which has "?" at the beginning.

> blk_mq_request_bypass_insert
> [...]/linux/block/blk-mq.c:2456
> 
> No crashes seen so far if the root is on LVM on top of software RAID, or
> if the root partition is directly on disk.

Ok, I will look into this ASAP, thank you for the information!

> 
> If I can provide any more information, just let me know.
> 
> Thanks!
> 
> Best,
> 
> Friedrich
> 
> [1]
> 
> # lsblk -o name,fstype,label --ascii
> NAME                          FSTYPE      LABEL
> sda
> |-sda1                        ext2
> |-sda2
> `-sda5                        LVM2_member
>   |-kernel684--deb--vg-root   ext4
>   `-kernel684--deb--vg-swap_1 swap
> sr0                           iso9660     Debian 12.5.0 amd64 n
> # findmnt --ascii
> TARGET                       SOURCE     FSTYPE    OPTIONS
> /                            /dev/mapper/kernel684--deb--vg-root
>                                         ext4
> rw,relatime,errors=remount-ro
> |-/sys                       sysfs      sysfs
> rw,nosuid,nodev,noexec,relatime
> | |-/sys/kernel/security     securityfs securityf
> rw,nosuid,nodev,noexec,relatime
> | |-/sys/fs/cgroup           cgroup2    cgroup2
> rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursive
> | |-/sys/fs/pstore           pstore     pstore
> rw,nosuid,nodev,noexec,relatime
> | |-/sys/fs/bpf              bpf        bpf
> rw,nosuid,nodev,noexec,relatime,mode=700
> | |-/sys/kernel/debug        debugfs    debugfs
> rw,nosuid,nodev,noexec,relatime
> | |-/sys/kernel/tracing      tracefs    tracefs
> rw,nosuid,nodev,noexec,relatime
> | |-/sys/fs/fuse/connections fusectl    fusectl
> rw,nosuid,nodev,noexec,relatime
> | `-/sys/kernel/config       configfs   configfs
> rw,nosuid,nodev,noexec,relatime
> |-/proc                      proc       proc
> rw,nosuid,nodev,noexec,relatime
> | `-/proc/sys/fs/binfmt_misc systemd-1  autofs
> rw,relatime,fd=30,pgrp=1,timeout=0,minproto=5,maxproto=5,di
> |-/dev                       udev       devtmpfs
> rw,nosuid,relatime,size=4040780k,nr_inodes=1010195,mode=755
> | |-/dev/pts                 devpts     devpts
> rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000
> | |-/dev/shm                 tmpfs      tmpfs     rw,nosuid,nodev,inode64
> | |-/dev/hugepages           hugetlbfs  hugetlbfs rw,relatime,pagesize=2M
> | `-/dev/mqueue              mqueue     mqueue
> rw,nosuid,nodev,noexec,relatime
> |-/run                       tmpfs      tmpfs
> rw,nosuid,nodev,noexec,relatime,size=813456k,mode=755,inode
> | |-/run/lock                tmpfs      tmpfs
> rw,nosuid,nodev,noexec,relatime,size=5120k,inode64
> | |-/run/credentials/systemd-sysctl.service
> | |                          ramfs      ramfs
> ro,nosuid,nodev,noexec,relatime,mode=700
> | |-/run/credentials/systemd-sysusers.service
> | |                          ramfs      ramfs
> ro,nosuid,nodev,noexec,relatime,mode=700
> | |-/run/credentials/systemd-tmpfiles-setup-dev.service
> | |                          ramfs      ramfs
> ro,nosuid,nodev,noexec,relatime,mode=700
> | |-/run/user/0              tmpfs      tmpfs
> rw,nosuid,nodev,relatime,size=813452k,nr_inodes=203363,mode
> | `-/run/credentials/systemd-tmpfiles-setup.service
> |                            ramfs      ramfs
> ro,nosuid,nodev,noexec,relatime,mode=700
> `-/boot                      /dev/sda1  ext2      rw,relatime
> 
> [2]
> [    1.137443] BUG: kernel NULL pointer dereference, address:
> 0000000000000000
> [    1.137951] #PF: supervisor write access in kernel mode
> [    1.138332] #PF: error_code(0x0002) - not-present page
> [    1.138695] PGD 0 P4D 0
> [    1.138697] Oops: 0002 [#1] PREEMPT SMP NOPTI
> [    1.138702] CPU: 1 PID: 27 Comm: kworker/1:0H Tainted: G            E
>      6.9.0-patch0604-nodebuglist+ #35
> [    1.138703] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> [    1.138705] Workqueue: kblockd blk_mq_requeue_work
> [    1.141021] RIP: 0010:_raw_spin_lock+0x13/0x60
> [    1.141336] Code: 31 db c3 cc cc cc cc 90 90 90 90 90 90 90 90 90 90
> 90 90 90 90 90 90 0f 1f 44 00 00 65 ff 05 bc 94 cb 69 31 c0 ba 01 00 00
> 00 <f0> 0f b1 17 75 1b 31 c0 31 d2 31 c9 31 f6 31 ff 45 31 c0 45 31 c9
> [    1.142670] RSP: 0018:ffffa42a40103d78 EFLAGS: 00010246
> [    1.143032] RAX: 0000000000000000 RBX: ffff91c4c0357c00 RCX:
> 00000000ffffffe0
> [    1.143545] RDX: 0000000000000001 RSI: 0000000000000001 RDI:
> 0000000000000000
> [    1.144037] RBP: ffffa42a40103d98 R08: 0000000000000000 R09:
> 0000000000000000
> [    1.144548] R10: 0000000000000000 R11: 0000000000000000 R12:
> 0000000000000000
> [    1.145036] R13: 0000000000000001 R14: ffff91c5f7cc1d80 R15:
> ffff91c4c153eb54
> [    1.145542] FS:  0000000000000000(0000) GS:ffff91c5f7c80000(0000)
> knlGS:0000000000000000
> [    1.146092] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    1.146511] CR2: 0000000000000000 CR3: 000000010e514001 CR4:
> 0000000000370ef0
> [    1.147003] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [    1.147507] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [    1.147997] Call Trace:
> [    1.148177]  <TASK>
> [    1.148332]  ? show_regs+0x6c/0x80
> [    1.148603]  ? __die+0x24/0x80
> [    1.148824]  ? page_fault_oops+0x175/0x5b0
> [    1.149111]  ? do_user_addr_fault+0x311/0x680
> [    1.149420]  ? exc_page_fault+0x82/0x1b0
> [    1.149718]  ? asm_exc_page_fault+0x27/0x30
> [    1.150013]  ? _raw_spin_lock+0x13/0x60
> [    1.150282]  ? blk_mq_request_bypass_insert+0x20/0xe0
> [    1.150663]  blk_mq_insert_request+0x120/0x1e0
> [    1.150975]  blk_mq_requeue_work+0x18f/0x230
> [    1.151277]  process_one_work+0x19b/0x3f0
> [    1.151562]  worker_thread+0x32a/0x500
> [    1.151847]  ? __pfx_worker_thread+0x10/0x10
> [    1.152148]  kthread+0xe1/0x110
> [    1.152373]  ? __pfx_kthread+0x10/0x10
> [    1.152640]  ret_from_fork+0x44/0x70
> [    1.152906]  ? __pfx_kthread+0x10/0x10
> [    1.153169]  ret_from_fork_asm+0x1a/0x30
> [    1.153449]  </TASK>
> [    1.153608] Modules linked in: efi_pstore(E) dmi_sysfs(E)
> qemu_fw_cfg(E) ip_tables(E) x_tables(E) autofs4(E) psmouse(E) bochs(E)
> uhci_hcd(E) crc32_pclmul(E) drm_vram_helper(E) drm_ttm_helper(E)
> i2c_piix4(E) ttm(E) ehci_hcd(E) pata_acpi(E) floppy(E)
> [    1.155135] CR2: 0000000000000000
> [    1.155370] ---[ end trace 0000000000000000 ]---
> [    1.155694] RIP: 0010:_raw_spin_lock+0x13/0x60
> [    1.156024] Code: 31 db c3 cc cc cc cc 90 90 90 90 90 90 90 90 90 90
> 90 90 90 90 90 90 0f 1f 44 00 00 65 ff 05 bc 94 cb 69 31 c0 ba 01 00 00
> 00 <f0> 0f b1 17 75 1b 31 c0 31 d2 31 c9 31 f6 31 ff 45 31 c0 45 31 c9
> [    1.157306] RSP: 0018:ffffa42a40103d78 EFLAGS: 00010246
> [    1.157669] RAX: 0000000000000000 RBX: ffff91c4c0357c00 RCX:
> 00000000ffffffe0
> [    1.158172] RDX: 0000000000000001 RSI: 0000000000000001 RDI:
> 0000000000000000
> [    1.158682] RBP: ffffa42a40103d98 R08: 0000000000000000 R09:
> 0000000000000000
> [    1.159311] R10: 0000000000000000 R11: 0000000000000000 R12:
> 0000000000000000
> [    1.159992] R13: 0000000000000001 R14: ffff91c5f7cc1d80 R15:
> ffff91c4c153eb54
> [    1.160575] FS:  0000000000000000(0000) GS:ffff91c5f7c80000(0000)
> knlGS:0000000000000000
> [    1.161186] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    1.161618] CR2: 0000000000000000 CR3: 000000010e514001 CR4:
> 0000000000370ef0
> [    1.162158] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [    1.162691] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
>
Friedrich Weber June 5, 2024, 10:54 a.m. UTC | #4
On 05/06/2024 12:30, Chengming Zhou wrote:
> On 2024/6/5 16:45, Friedrich Weber wrote:
>> Hi,
>>
>> On 04/06/2024 08:47, Chengming Zhou wrote:
>>> Friedrich Weber reported a kernel crash problem and bisected to commit
>>> 81ada09cc25e ("blk-flush: reuse rq queuelist in flush state machine").
>>>
>>> The root cause is that we use "list_move_tail(&rq->queuelist, pending)"
>>> in the PREFLUSH/POSTFLUSH sequences. But rq->queuelist.next == xxx since
>>> it's popped out from plug->cached_rq in __blk_mq_alloc_requests_batch().
>>> We don't initialize its queuelist just for this first request, although
>>> the queuelist of all later popped requests will be initialized.
>>>
>>> Fix it by changing to use "list_add_tail(&rq->queuelist, pending)" so
>>> rq->queuelist doesn't need to be initialized. It should be ok since rq
>>> can't be on any list when PREFLUSH or POSTFLUSH, has no move actually.
>>>
>>> Please note the commit 81ada09cc25e ("blk-flush: reuse rq queuelist in
>>> flush state machine") also has another requirement that no drivers would
>>> touch rq->queuelist after blk_mq_end_request() since we will reuse it to
>>> add rq to the post-flush pending list in POSTFLUSH. If this is not true,
>>> we will have to revert that commit IMHO.
>>
>> Unfortunately, with this patch applied to kernel 6.9 I get a different
>> crash [2] on a Debian 12 (virtual) machine with root on LVM on boot (no
>> software RAID involved). See [1] for lsblk and findmnt output. addr2line
>> says:
> 
> Sorry, which commit is your kernel? Is mainline tag v6.9 or at some commit?

Yes, by "kernel 6.9" I meant mainline tag v6.9, so commit a38297e3fb01.

If I boot this mainline kernel v6.9 in a Debian (virtual) machine with
root on LVM, I do not get a crash. If I apply the patch "block: fix
request.queuelist usage in flush" on top of this mainline kernel v6.9,
and boot the Debian machine into that patched kernel, I get a crash on boot.

> And is it reproducible using the mainline kernel v6.10-rc2?

I'll test mainline kernel v6.10-rc2, and "block: fix request.queuelist
usage in flush" applied on top of v6.10-rc2, and get back to you.

>> # addr2line -f -e /usr/lib/debug/vmlinux-6.9.0-patch0604-nodebuglist+
>> blk_mq_request_bypass_insert+0x20
> 
> I think here should use blk_mq_insert_request+0x120, instead of the
> blk_mq_request_bypass_insert+0x20, which has "?" at the beginning.
> 

Right, sorry:

# addr2line -f -e /usr/lib/debug/vmlinux-6.9.0-patch0604-nodebuglist+
blk_mq_insert_request+0x120
blk_mq_insert_request
[...]/linux/block/blk-mq.c:2539

which refers to this line [1]:

		blk_mq_request_bypass_insert(rq, BLK_MQ_INSERT_AT_HEAD);

Thanks!

Friedrich

[1]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/block/blk-mq.c?h=v6.9#n2539
Friedrich Weber June 5, 2024, 1:34 p.m. UTC | #5
On 05/06/2024 12:54, Friedrich Weber wrote:
> On 05/06/2024 12:30, Chengming Zhou wrote:
>> On 2024/6/5 16:45, Friedrich Weber wrote:
>>> [...]
>>> Unfortunately, with this patch applied to kernel 6.9 I get a different
>>> crash [2] on a Debian 12 (virtual) machine with root on LVM on boot (no
>>> software RAID involved). See [1] for lsblk and findmnt output. addr2line
>>> says:
>>
>> Sorry, which commit is your kernel? Is mainline tag v6.9 or at some commit?
> 
> Yes, by "kernel 6.9" I meant mainline tag v6.9, so commit a38297e3fb01.
> 
> If I boot this mainline kernel v6.9 in a Debian (virtual) machine with
> root on LVM, I do not get a crash. If I apply the patch "block: fix
> request.queuelist usage in flush" on top of this mainline kernel v6.9,
> and boot the Debian machine into that patched kernel, I get a crash on boot.
> 
>> And is it reproducible using the mainline kernel v6.10-rc2?
> 
> I'll test mainline kernel v6.10-rc2, and "block: fix request.queuelist
> usage in flush" applied on top of v6.10-rc2, and get back to you.

My results:

Booting the Debian (virtual) machine with mainline kernel v6.10-rc2
(c3f38fa61af77b49866b006939479069cd451173):
works fine, no crash

Booting the Debian (virtual) machine with patch "block: fix
request.queuelist usage in flush" applied on top of v6.10-rc2: The
Debian (virtual) machine crashes during boot with [1].

Hope this helps! If I can provide anything else, just let me know.

Best wishes,

Friedrich

[1]

[    1.091562] BUG: kernel NULL pointer dereference, address:
0000000000000000
[    1.092097] #PF: supervisor write access in kernel mode
[    1.092469] #PF: error_code(0x0002) - not-present page
[    1.092880] PGD 0 P4D 0
[    1.093064] Oops: Oops: 0002 [#1] PREEMPT SMP NOPTI
[    1.093193] systemd[1]: Finished systemd-sysusers.service - Create
System Users.
[    1.093422] CPU: 1 PID: 130 Comm: kworker/1:1H Tainted: G
E      6.10.0-rc2-patch0604-6-10rc2+ #37
[    1.095178] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[    1.096005] Workqueue: kblockd blk_mq_requeue_work
[    1.096342] RIP: 0010:_raw_spin_lock+0x13/0x60
[    1.096707] Code: 31 db c3 cc cc cc cc 90 90 90 90 90 90 90 90 90 90
90 90 90 90 90 90 0f 1f 44 00 00 65 ff 05 3c 42 4a 6f 31 c0 ba 01 00 00
00 <f0> 0f b1 17 75 1b 31 c0 31 d2 31 c9 31 f6 31 ff 45 31 c0 45 31 c9
[    1.098021] RSP: 0000:ffffb5ebc0343d78 EFLAGS: 00010246
[    1.098381] RAX: 0000000000000000 RBX: ffff9326c8c8c800 RCX:
00000000ffffffe0
[    1.098917] RDX: 0000000000000001 RSI: 0000000000000001 RDI:
0000000000000000
[    1.099409] RBP: ffffb5ebc0343d98 R08: 0000000000000000 R09:
0000000000000000
[    1.099944] R10: 0000000000000000 R11: 0000000000000000 R12:
0000000000000000
[    1.100440] R13: 0000000000000001 R14: ffff9327f7cc2180 R15:
ffff9326c8c91894
[    1.100969] FS:  0000000000000000(0000) GS:ffff9327f7c80000(0000)
knlGS:0000000000000000
[    1.101526] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.101950] CR2: 0000000000000000 CR3: 0000000100eaa005 CR4:
0000000000370ef0
[    1.102443] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[    1.102951] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[    1.103447] Call Trace:
[    1.103626]  <TASK>
[    1.103805]  ? show_regs+0x6c/0x80
[    1.104053]  ? __die+0x24/0x80
[    1.104055]  ? page_fault_oops+0x175/0x5e0
[    1.104059]  ? do_user_addr_fault+0x325/0x690
[    1.104062]  ? exc_page_fault+0x82/0x1b0
[    1.105390]  ? asm_exc_page_fault+0x27/0x30
[    1.105716]  ? _raw_spin_lock+0x13/0x60
[    1.106033]  ? blk_mq_request_bypass_insert+0x20/0xe0
[    1.106385]  blk_mq_insert_request+0x120/0x1e0
[    1.106704]  blk_mq_requeue_work+0x18f/0x230
[    1.107033]  process_one_work+0x196/0x3e0
[    1.107316]  worker_thread+0x32a/0x500
[    1.107587]  ? __pfx_worker_thread+0x10/0x10
[    1.107915]  kthread+0xe1/0x110
[    1.108140]  ? __pfx_kthread+0x10/0x10
[    1.108409]  ret_from_fork+0x44/0x70
[    1.108662]  ? __pfx_kthread+0x10/0x10
[    1.108952]  ret_from_fork_asm+0x1a/0x30
[    1.109228]  </TASK>
[    1.109386] Modules linked in: efi_pstore(E) dmi_sysfs(E)
qemu_fw_cfg(E) ip_tables(E) x_tables(E) autofs4(E) crc32_pclmul(E)
bochs(E) drm_vram_helper(E) drm_ttm_helper(E) psmouse(E) uhci_hcd(E)
ttm(E) ehci_hcd(E) i2c_piix4(E) pata_acpi(E) floppy(E)
[    1.110910] CR2: 0000000000000000
[    1.111161] ---[ end trace 0000000000000000 ]---
[    1.111489] RIP: 0010:_raw_spin_lock+0x13/0x60
[    1.111802] Code: 31 db c3 cc cc cc cc 90 90 90 90 90 90 90 90 90 90
90 90 90 90 90 90 0f 1f 44 00 00 65 ff 05 3c 42 4a 6f 31 c0 ba 01 00 00
00 <f0> 0f b1 17 75 1b 31 c0 31 d2 31 c9 31 f6 31 ff 45 31 c0 45 31 c9
[    1.113119] RSP: 0000:ffffb5ebc0343d78 EFLAGS: 00010246
[    1.113489] RAX: 0000000000000000 RBX: ffff9326c8c8c800 RCX:
00000000ffffffe0
[    1.114001] RDX: 0000000000000001 RSI: 0000000000000001 RDI:
0000000000000000
[    1.114497] RBP: ffffb5ebc0343d98 R08: 0000000000000000 R09:
0000000000000000
[    1.114998] R10: 0000000000000000 R11: 0000000000000000 R12:
0000000000000000
[    1.115508] R13: 0000000000000001 R14: ffff9327f7cc2180 R15:
ffff9326c8c91894
[    1.115997] FS:  0000000000000000(0000) GS:ffff9327f7c80000(0000)
knlGS:0000000000000000
[    1.116578] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.116975] CR2: 0000000000000000 CR3: 0000000100eaa005 CR4:
0000000000370ef0
[    1.117494] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[    1.117982] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[    1.118526] note: kworker/1:1H[130] exited with irqs disabled
[    1.118947] note: kworker/1:1H[130] exited with preempt_count 1
Chengming Zhou June 5, 2024, 2:27 p.m. UTC | #6
On 2024/6/5 21:34, Friedrich Weber wrote:
> On 05/06/2024 12:54, Friedrich Weber wrote:
>> On 05/06/2024 12:30, Chengming Zhou wrote:
>>> On 2024/6/5 16:45, Friedrich Weber wrote:
>>>> [...]
>>>> Unfortunately, with this patch applied to kernel 6.9 I get a different
>>>> crash [2] on a Debian 12 (virtual) machine with root on LVM on boot (no
>>>> software RAID involved). See [1] for lsblk and findmnt output. addr2line
>>>> says:
>>>
>>> Sorry, which commit is your kernel? Is mainline tag v6.9 or at some commit?
>>
>> Yes, by "kernel 6.9" I meant mainline tag v6.9, so commit a38297e3fb01.
>>
>> If I boot this mainline kernel v6.9 in a Debian (virtual) machine with
>> root on LVM, I do not get a crash. If I apply the patch "block: fix
>> request.queuelist usage in flush" on top of this mainline kernel v6.9,
>> and boot the Debian machine into that patched kernel, I get a crash on boot.
>>
>>> And is it reproducible using the mainline kernel v6.10-rc2?
>>
>> I'll test mainline kernel v6.10-rc2, and "block: fix request.queuelist
>> usage in flush" applied on top of v6.10-rc2, and get back to you.
> 
> My results:
> 
> Booting the Debian (virtual) machine with mainline kernel v6.10-rc2
> (c3f38fa61af77b49866b006939479069cd451173):
> works fine, no crash
> 
> Booting the Debian (virtual) machine with patch "block: fix
> request.queuelist usage in flush" applied on top of v6.10-rc2: The
> Debian (virtual) machine crashes during boot with [1].
> 
> Hope this helps! If I can provide anything else, just let me know.

Thanks for your help, I still can't reproduce it myself, don't know why.

Could you help to test with this diff?

diff --git a/block/blk-flush.c b/block/blk-flush.c
index e7aebcf00714..cca4f9131f79 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -263,6 +263,7 @@ static enum rq_end_io_ret flush_end_io(struct request *flush_rq,
                unsigned int seq = blk_flush_cur_seq(rq);

                BUG_ON(seq != REQ_FSEQ_PREFLUSH && seq != REQ_FSEQ_POSTFLUSH);
+               list_del_init(&rq->queuelist);
                blk_flush_complete_seq(rq, fq, seq, error);
        }


I don't know if the request can have PREFLUSH and POSTFLUSH set but no DATA,
maybe in some special cases? Hope someone can give me some hints BTW.

The panic below seems something very bad happened, hctx(maybe ctx)->lock got
from the request is NULL.

Thanks!

> 
> Best wishes,
> 
> Friedrich
> 
> [1]
> 
> [    1.091562] BUG: kernel NULL pointer dereference, address:
> 0000000000000000
> [    1.092097] #PF: supervisor write access in kernel mode
> [    1.092469] #PF: error_code(0x0002) - not-present page
> [    1.092880] PGD 0 P4D 0
> [    1.093064] Oops: Oops: 0002 [#1] PREEMPT SMP NOPTI
> [    1.093193] systemd[1]: Finished systemd-sysusers.service - Create
> System Users.
> [    1.093422] CPU: 1 PID: 130 Comm: kworker/1:1H Tainted: G
> E      6.10.0-rc2-patch0604-6-10rc2+ #37
> [    1.095178] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> [    1.096005] Workqueue: kblockd blk_mq_requeue_work
> [    1.096342] RIP: 0010:_raw_spin_lock+0x13/0x60
> [    1.096707] Code: 31 db c3 cc cc cc cc 90 90 90 90 90 90 90 90 90 90
> 90 90 90 90 90 90 0f 1f 44 00 00 65 ff 05 3c 42 4a 6f 31 c0 ba 01 00 00
> 00 <f0> 0f b1 17 75 1b 31 c0 31 d2 31 c9 31 f6 31 ff 45 31 c0 45 31 c9
> [    1.098021] RSP: 0000:ffffb5ebc0343d78 EFLAGS: 00010246
> [    1.098381] RAX: 0000000000000000 RBX: ffff9326c8c8c800 RCX:
> 00000000ffffffe0
> [    1.098917] RDX: 0000000000000001 RSI: 0000000000000001 RDI:
> 0000000000000000
> [    1.099409] RBP: ffffb5ebc0343d98 R08: 0000000000000000 R09:
> 0000000000000000
> [    1.099944] R10: 0000000000000000 R11: 0000000000000000 R12:
> 0000000000000000
> [    1.100440] R13: 0000000000000001 R14: ffff9327f7cc2180 R15:
> ffff9326c8c91894
> [    1.100969] FS:  0000000000000000(0000) GS:ffff9327f7c80000(0000)
> knlGS:0000000000000000
> [    1.101526] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    1.101950] CR2: 0000000000000000 CR3: 0000000100eaa005 CR4:
> 0000000000370ef0
> [    1.102443] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [    1.102951] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [    1.103447] Call Trace:
> [    1.103626]  <TASK>
> [    1.103805]  ? show_regs+0x6c/0x80
> [    1.104053]  ? __die+0x24/0x80
> [    1.104055]  ? page_fault_oops+0x175/0x5e0
> [    1.104059]  ? do_user_addr_fault+0x325/0x690
> [    1.104062]  ? exc_page_fault+0x82/0x1b0
> [    1.105390]  ? asm_exc_page_fault+0x27/0x30
> [    1.105716]  ? _raw_spin_lock+0x13/0x60
> [    1.106033]  ? blk_mq_request_bypass_insert+0x20/0xe0
> [    1.106385]  blk_mq_insert_request+0x120/0x1e0
> [    1.106704]  blk_mq_requeue_work+0x18f/0x230
> [    1.107033]  process_one_work+0x196/0x3e0
> [    1.107316]  worker_thread+0x32a/0x500
> [    1.107587]  ? __pfx_worker_thread+0x10/0x10
> [    1.107915]  kthread+0xe1/0x110
> [    1.108140]  ? __pfx_kthread+0x10/0x10
> [    1.108409]  ret_from_fork+0x44/0x70
> [    1.108662]  ? __pfx_kthread+0x10/0x10
> [    1.108952]  ret_from_fork_asm+0x1a/0x30
> [    1.109228]  </TASK>
> [    1.109386] Modules linked in: efi_pstore(E) dmi_sysfs(E)
> qemu_fw_cfg(E) ip_tables(E) x_tables(E) autofs4(E) crc32_pclmul(E)
> bochs(E) drm_vram_helper(E) drm_ttm_helper(E) psmouse(E) uhci_hcd(E)
> ttm(E) ehci_hcd(E) i2c_piix4(E) pata_acpi(E) floppy(E)
> [    1.110910] CR2: 0000000000000000
> [    1.111161] ---[ end trace 0000000000000000 ]---
> [    1.111489] RIP: 0010:_raw_spin_lock+0x13/0x60
> [    1.111802] Code: 31 db c3 cc cc cc cc 90 90 90 90 90 90 90 90 90 90
> 90 90 90 90 90 90 0f 1f 44 00 00 65 ff 05 3c 42 4a 6f 31 c0 ba 01 00 00
> 00 <f0> 0f b1 17 75 1b 31 c0 31 d2 31 c9 31 f6 31 ff 45 31 c0 45 31 c9
> [    1.113119] RSP: 0000:ffffb5ebc0343d78 EFLAGS: 00010246
> [    1.113489] RAX: 0000000000000000 RBX: ffff9326c8c8c800 RCX:
> 00000000ffffffe0
> [    1.114001] RDX: 0000000000000001 RSI: 0000000000000001 RDI:
> 0000000000000000
> [    1.114497] RBP: ffffb5ebc0343d98 R08: 0000000000000000 R09:
> 0000000000000000
> [    1.114998] R10: 0000000000000000 R11: 0000000000000000 R12:
> 0000000000000000
> [    1.115508] R13: 0000000000000001 R14: ffff9327f7cc2180 R15:
> ffff9326c8c91894
> [    1.115997] FS:  0000000000000000(0000) GS:ffff9327f7c80000(0000)
> knlGS:0000000000000000
> [    1.116578] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    1.116975] CR2: 0000000000000000 CR3: 0000000100eaa005 CR4:
> 0000000000370ef0
> [    1.117494] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [    1.117982] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [    1.118526] note: kworker/1:1H[130] exited with irqs disabled
> [    1.118947] note: kworker/1:1H[130] exited with preempt_count 1
> 
>
Jens Axboe June 5, 2024, 6:14 p.m. UTC | #7
On 6/4/24 8:17 AM, Jens Axboe wrote:
> 
> On Tue, 04 Jun 2024 14:47:45 +0800, Chengming Zhou wrote:
>> Friedrich Weber reported a kernel crash problem and bisected to commit
>> 81ada09cc25e ("blk-flush: reuse rq queuelist in flush state machine").
>>
>> The root cause is that we use "list_move_tail(&rq->queuelist, pending)"
>> in the PREFLUSH/POSTFLUSH sequences. But rq->queuelist.next == xxx since
>> it's popped out from plug->cached_rq in __blk_mq_alloc_requests_batch().
>> We don't initialize its queuelist just for this first request, although
>> the queuelist of all later popped requests will be initialized.
>>
>> [...]
> 
> Applied, thanks!
> 
> [1/1] block: fix request.queuelist usage in flush
>       commit: a315b96155e4c0362742aa3c3b3aebe2ec3844bd

Given the pending investigation into crashes potentially caused by this
patch, I've dropped it from the 6.10 tree for now.
Friedrich Weber June 6, 2024, 8:44 a.m. UTC | #8
On 05/06/2024 16:27, Chengming Zhou wrote:
> On 2024/6/5 21:34, Friedrich Weber wrote:
>> On 05/06/2024 12:54, Friedrich Weber wrote:
>> [...]
>>
>> My results:
>>
>> Booting the Debian (virtual) machine with mainline kernel v6.10-rc2
>> (c3f38fa61af77b49866b006939479069cd451173):
>> works fine, no crash
>>
>> Booting the Debian (virtual) machine with patch "block: fix
>> request.queuelist usage in flush" applied on top of v6.10-rc2: The
>> Debian (virtual) machine crashes during boot with [1].
>>
>> Hope this helps! If I can provide anything else, just let me know.
> 
> Thanks for your help, I still can't reproduce it myself, don't know why.

Weird -- when booting the Debian machine into mainline kernel v6.10-rc2
with "block: fix request.queuelist usage in flush" applied on top, it
crashes reliably for me. The machine having its root on LVM seems to be
essential to reproduce the crash, though.

Maybe the fact that I'm running the Debian machine virtualized makes the
crash more likely to trigger. I'll try to reproduce on bare metal to
narrow down the reproducer and get back to you.

> Could you help to test with this diff?
> 
> diff --git a/block/blk-flush.c b/block/blk-flush.c
> index e7aebcf00714..cca4f9131f79 100644
> --- a/block/blk-flush.c
> +++ b/block/blk-flush.c
> @@ -263,6 +263,7 @@ static enum rq_end_io_ret flush_end_io(struct request *flush_rq,
>                 unsigned int seq = blk_flush_cur_seq(rq);
> 
>                 BUG_ON(seq != REQ_FSEQ_PREFLUSH && seq != REQ_FSEQ_POSTFLUSH);
> +               list_del_init(&rq->queuelist);
>                 blk_flush_complete_seq(rq, fq, seq, error);
>         }

I used mainline kernel v6.10-rc2 as base and applied:

- "block: fix request.queuelist usage in flush"
- Your `list_del_init` addition from above

and if I boot the Debian machine into this kernel, I do not get the
crash anymore.

Happy to run more tests for you, just let me know.

Thanks!

Friedrich
Friedrich Weber June 6, 2024, 4:05 p.m. UTC | #9
On 06/06/2024 10:44, Friedrich Weber wrote:
> On 05/06/2024 16:27, Chengming Zhou wrote:
>> On 2024/6/5 21:34, Friedrich Weber wrote:
>>> On 05/06/2024 12:54, Friedrich Weber wrote:
>>> [...]
>>>
>>> My results:
>>>
>>> Booting the Debian (virtual) machine with mainline kernel v6.10-rc2
>>> (c3f38fa61af77b49866b006939479069cd451173):
>>> works fine, no crash
>>>
>>> Booting the Debian (virtual) machine with patch "block: fix
>>> request.queuelist usage in flush" applied on top of v6.10-rc2: The
>>> Debian (virtual) machine crashes during boot with [1].
>>>
>>> Hope this helps! If I can provide anything else, just let me know.
>>
>> Thanks for your help, I still can't reproduce it myself, don't know why.
> 
> Weird -- when booting the Debian machine into mainline kernel v6.10-rc2
> with "block: fix request.queuelist usage in flush" applied on top, it
> crashes reliably for me. The machine having its root on LVM seems to be
> essential to reproduce the crash, though.
> 
> Maybe the fact that I'm running the Debian machine virtualized makes the
> crash more likely to trigger. I'll try to reproduce on bare metal to
> narrow down the reproducer and get back to you.

The crashing Debian VM (QEMU/KVM) has its root on an LVM Logical Volume.
As it turns out, whether there is an in-guest kernel crash on boot
depends on the cache mode of the disk backing the LVM Physical Volume
(Christoph also mentioned the write cache being relevant for the
original issue with software RAID [1]). Steps to reproduce:

- Install kernel v6.10-rc2 with "block: fix request.queuelist usage in
flush" applied inside the Debian VM

- Start the Debian VM with `cache=writethrough` for its disk (see [2]
for upstream QEMU 8.2.2 command line)
  => no in-guest crash on boot

- Start the VM with `cache=writeback` for its disk (see [3] for diff to
previous command line)
  => in-guest crash [4] on boot

Maybe the cache=writeback was missing to reproduce the crash on your
end? If you still cannot reproduce the crash, let me know -- I'll
provide more detailed steps how to generate the VM disk then.

I was also able to reproduce the crash on bare metal, with a Proxmox VE
8.2 installation -- if needed I can also retry with e.g. a bare-metal
Debian installation.

However, also here the write cache seems to be relevant:

- Running kernel is again mainline v6.10-rc2 with "block: fix
request.queuelist usage in flush" applied. root is on LVM (see [5]).

- If I enable write caching for the disk backing the LVM PV via `hdparm
-W1 /dev/sdb` and reboot, I see the crash in blk_mq_requeue_work on boot.

- If I disable write caching via `hdparm -W0 /dev/sdb` and reboot, I see
no crash.

Hope this helps. Let me know if I can provide any more information.

Best,

Friedrich

[1] https://lore.kernel.org/all/20240531061708.GB18075@lst.de/
[2]

[...]/qemu-stable-8.2/qemu-8.2.2/build/qemu-system-x86_64 \
  -accel kvm \
  -name 'kernel684deb,debug-threads=on' \
  -chardev
'socket,id=qmp,path=/var/run/qemu-server/198.qmp,server=on,wait=off' \
  -mon 'chardev=qmp,mode=control' \
  -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' \
  -mon 'chardev=qmp-event,mode=control' \
  -pidfile /var/run/qemu-server/198.pid \
  -smbios 'type=1,uuid=49351322-c4a0-420b-a780-d445a638973a' \
  -smp '1,sockets=1,cores=1,maxcpus=1' \
  -nodefaults \
  -vnc 'unix:/var/run/qemu-server/198.vnc,password=on' \
  -cpu qemu64,enforce,+kvm_pv_eoi,+kvm_pv_unhalt \
  -m 8192 \
  -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' \
  -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' \
  -device 'vmgenid,guid=78d4ee6d-a73a-43bd-9e3a-5395449af862' \
  -chardev
'socket,id=serial0,path=/var/run/qemu-server/198.serial0,server=on,wait=off'
\
  -device 'isa-serial,chardev=serial0' \
  -device 'VGA,id=vga,bus=pci.0,addr=0x2' \
  -device 'ahci,id=ahci0,multifunction=on,bus=pci.0,addr=0x7' \
  -drive
'file=/dev/pve/vm-198-disk-0,if=none,id=drive-sata0,format=raw,aio=threads,detect-zeroes=on,cache=writethrough'
\
  -device 'ide-hd,bus=ahci0.0,drive=drive-sata0,id=sata0,bootindex=100' \
  -machine 'type=pc'

[3]
--- run-198-nocrash.sh	2024-06-06 17:19:10.725256488 +0200
+++ run-198-crash.sh	2024-06-06 17:18:39.444974266 +0200
@@ -19,6 +19,6 @@
   -device 'isa-serial,chardev=serial0' \
   -device 'VGA,id=vga,bus=pci.0,addr=0x2' \
   -device 'ahci,id=ahci0,multifunction=on,bus=pci.0,addr=0x7' \
-  -drive
'file=/dev/pve/vm-198-disk-0,if=none,id=drive-sata0,format=raw,aio=threads,detect-zeroes=on,cache=writethrough'
\
+  -drive
'file=/dev/pve/vm-198-disk-0,if=none,id=drive-sata0,format=raw,aio=threads,detect-zeroes=on,cache=writeback'
\
   -device 'ide-hd,bus=ahci0.0,drive=drive-sata0,id=sata0,bootindex=100' \
   -machine 'type=pc'

[4]
[    1.911631] BUG: kernel NULL pointer dereference, address:
0000000000000000
[    1.912249] #PF: supervisor write access in kernel mode
[    1.912723] #PF: error_code(0x0002) - not-present page
[    1.913157] PGD 0 P4D 0
[    1.913405] Oops: Oops: 0002 [#1] PREEMPT SMP PTI
[    1.913787] CPU: 0 PID: 45 Comm: kworker/0:1H Tainted: G            E
     6.10.0-rc2-patch0604-6-10rc2+ #37
[    1.915161] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[    1.916098] Workqueue: kblockd blk_mq_requeue_work
[    1.916501] RIP: 0010:_raw_spin_lock+0x13/0x60
[    1.916895] Code: 31 db c3 cc cc cc cc 90 90 90 90 90 90 90 90 90 90
90 90 90 90 90 90 0f 1f 44 00 00 65 ff 05 3c 42 4a 59 31 c0 ba 01 00 00
00 <3e> 0f b1 17 75 1b 31 c0 31 d2 31 c9 31 f6 31 ff 45 31 c0 45 31 c9
[    1.918434] RSP: 0000:ffffa31fc0177d78 EFLAGS: 00010246
[    1.918880] RAX: 0000000000000000 RBX: ffff8dcf4dcd0400 RCX:
00000000ffffffe0
[    1.919393] RDX: 0000000000000001 RSI: 0000000000000001 RDI:
0000000000000000
[    1.920014] RBP: ffffa31fc0177d98 R08: 0000000000000000 R09:
0000000000000000
[    1.920615] R10: 0000000000000000 R11: 0000000000000000 R12:
0000000000000000
[    1.921216] R13: 0000000000000001 R14: ffffc31fbfc013c0 R15:
ffff8dcf4dcd3c54
[    1.921811] FS:  0000000000000000(0000) GS:ffff8dd077c00000(0000)
knlGS:0000000000000000
[    1.922471] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.922964] CR2: 0000000000000000 CR3: 000000010235a000 CR4:
00000000000006f0
[    1.923566] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[    1.924166] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[    1.924768] Call Trace:
[    1.924968]  <TASK>
[    1.925120]  ? show_regs+0x6c/0x80
[    1.925366]  ? __die+0x24/0x80
[    1.925580]  ? page_fault_oops+0x175/0x5e0
[    1.925862]  ? _raw_spin_lock+0x13/0x60
[    1.926127]  ? kernelmode_fixup_or_oops.constprop.0+0x69/0x90
[    1.926523]  ? __bad_area_nosemaphore+0x19f/0x280
[    1.926844]  ? bad_area_nosemaphore+0x16/0x30
[    1.927143]  ? do_user_addr_fault+0x2ce/0x690
[    1.927450]  ? exc_page_fault+0x82/0x1b0
[    1.927719]  ? asm_exc_page_fault+0x27/0x30
[    1.928006]  ? _raw_spin_lock+0x13/0x60
[    1.928273]  ? blk_mq_request_bypass_insert+0x20/0xe0
[    1.928617]  blk_mq_insert_request+0x120/0x1e0
[    1.928922]  blk_mq_requeue_work+0x18f/0x230
[    1.929216]  process_one_work+0x199/0x3e0
[    1.929496]  worker_thread+0x32a/0x500
[    1.929754]  ? __pfx_worker_thread+0x10/0x10
[    1.930047]  kthread+0xe4/0x110
[    1.930271]  ? __pfx_kthread+0x10/0x10
[    1.930531]  ret_from_fork+0x47/0x70
[    1.930781]  ? __pfx_kthread+0x10/0x10
[    1.931040]  ret_from_fork_asm+0x1a/0x30
[    1.931318]  </TASK>
[    1.931473] Modules linked in: efi_pstore(E) dmi_sysfs(E)
qemu_fw_cfg(E) ip_tables(E) x_tables(E) autofs4(E) psmouse(E) bochs(E)
drm_vram_helper(E) drm_ttm_helper(E) ahci(E) ttm(E) libahci(E)
i2c_piix4(E) pata_acpi(E) floppy(E)
[    1.932778] CR2: 0000000000000000
[    1.932993] ---[ end trace 0000000000000000 ]---
[    1.933289] RIP: 0010:_raw_spin_lock+0x13/0x60
[    1.933572] Code: 31 db c3 cc cc cc cc 90 90 90 90 90 90 90 90 90 90
90 90 90 90 90 90 0f 1f 44 00 00 65 ff 05 3c 42 4a 59 31 c0 ba 01 00 00
00 <3e> 0f b1 17 75 1b 31 c0 31 d2 31 c9 31 f6 31 ff 45 31 c0 45 31 c9
[    1.934720] RSP: 0000:ffffa31fc0177d78 EFLAGS: 00010246
[    1.935049] RAX: 0000000000000000 RBX: ffff8dcf4dcd0400 RCX:
00000000ffffffe0
[    1.935510] RDX: 0000000000000001 RSI: 0000000000000001 RDI:
0000000000000000
[    1.935967] RBP: ffffa31fc0177d98 R08: 0000000000000000 R09:
0000000000000000
[    1.936437] R10: 0000000000000000 R11: 0000000000000000 R12:
0000000000000000
[    1.937001] R13: 0000000000000001 R14: ffffc31fbfc013c0 R15:
ffff8dcf4dcd3c54
[    1.937481] FS:  0000000000000000(0000) GS:ffff8dd077c00000(0000)
knlGS:0000000000000000
[    1.937995] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.938599] CR2: 0000000000000000 CR3: 000000010235a000 CR4:
00000000000006f0
[    1.939320] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[    1.940067] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400

[5]
# lsblk --ascii
NAME               MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda                  8:0    1 465.8G  0 disk
sdb                  8:16   1 119.2G  0 disk
|-sdb1               8:17   1  1007K  0 part
|-sdb2               8:18   1     1G  0 part /boot/efi
`-sdb3               8:19   1 118.2G  0 part
  |-pve-swap       252:0    0     8G  0 lvm  [SWAP]
  |-pve-root       252:2    0  39.6G  0 lvm  /
  |-pve-data_tmeta 252:3    0     1G  0 lvm
  | `-pve-data     252:5    0  53.9G  0 lvm
  `-pve-data_tdata 252:4    0  53.9G  0 lvm
    `-pve-data     252:5    0  53.9G  0 lvm
sdc                  8:32   1 476.9G  0 disk
sdd                  8:48   1 446.9G  0 disk
# pvs
  PV         VG  Fmt  Attr PSize    PFree
  /dev/sdb3  pve lvm2 a--  <118.24g 14.75g
# vgs
  VG  #PV #LV #SN Attr   VSize    VFree
  pve   1   3   0 wz--n- <118.24g 14.75g
# lvs
  LV   VG  Attr       LSize   Pool Origin Data%  Meta%  Move Log
Cpy%Sync Convert
  data pve twi-a-tz-- <53.93g             0.00   1.59
  root pve -wi-ao---- <39.56g
  swap pve -wi-ao----   8.00g
# findmnt -ascii
TARGET    SOURCE               FSTYPE OPTIONS
/         /dev/mapper/pve-root ext4   errors=remount-ro
/boot/efi UUID=6314-15A9       vfat   defaults
none      /dev/mapper/pve-swap swap   sw
/proc     proc                 proc   defaults
Chengming Zhou June 7, 2024, 2:37 a.m. UTC | #10
On 2024/6/6 16:44, Friedrich Weber wrote:
> On 05/06/2024 16:27, Chengming Zhou wrote:
>> On 2024/6/5 21:34, Friedrich Weber wrote:
>>> On 05/06/2024 12:54, Friedrich Weber wrote:
>>> [...]
>>>
>>> My results:
>>>
>>> Booting the Debian (virtual) machine with mainline kernel v6.10-rc2
>>> (c3f38fa61af77b49866b006939479069cd451173):
>>> works fine, no crash
>>>
>>> Booting the Debian (virtual) machine with patch "block: fix
>>> request.queuelist usage in flush" applied on top of v6.10-rc2: The
>>> Debian (virtual) machine crashes during boot with [1].
>>>
>>> Hope this helps! If I can provide anything else, just let me know.
>>
>> Thanks for your help, I still can't reproduce it myself, don't know why.
> 
> Weird -- when booting the Debian machine into mainline kernel v6.10-rc2
> with "block: fix request.queuelist usage in flush" applied on top, it
> crashes reliably for me. The machine having its root on LVM seems to be
> essential to reproduce the crash, though.

Yeah, right, it seems LVM may create this special request that only has
PREFLUSH | POSTFLUSH without any DATA, goes into the flush state machine.
Then, cause the request double list_add_tail() without list_del_init().
I don't know the reason behind it, but well, it's allowable in the current
flush code.

> 
> Maybe the fact that I'm running the Debian machine virtualized makes the
> crash more likely to trigger. I'll try to reproduce on bare metal to
> narrow down the reproducer and get back to you.

Thanks much for your very detailed process on that thread!

> 
>> Could you help to test with this diff?
>>
>> diff --git a/block/blk-flush.c b/block/blk-flush.c
>> index e7aebcf00714..cca4f9131f79 100644
>> --- a/block/blk-flush.c
>> +++ b/block/blk-flush.c
>> @@ -263,6 +263,7 @@ static enum rq_end_io_ret flush_end_io(struct request *flush_rq,
>>                 unsigned int seq = blk_flush_cur_seq(rq);
>>
>>                 BUG_ON(seq != REQ_FSEQ_PREFLUSH && seq != REQ_FSEQ_POSTFLUSH);
>> +               list_del_init(&rq->queuelist);
>>                 blk_flush_complete_seq(rq, fq, seq, error);
>>         }
> 
> I used mainline kernel v6.10-rc2 as base and applied:
> 
> - "block: fix request.queuelist usage in flush"
> - Your `list_del_init` addition from above
> 
> and if I boot the Debian machine into this kernel, I do not get the
> crash anymore.

Good to hear. So can I merge these two diffs into one patch and add
your Tested-by?

> 
> Happy to run more tests for you, just let me know.

Thanks again!
Christoph Hellwig June 7, 2024, 4:55 a.m. UTC | #11
On Fri, Jun 07, 2024 at 10:37:58AM +0800, Chengming Zhou wrote:
> Yeah, right, it seems LVM may create this special request that only has
> PREFLUSH | POSTFLUSH without any DATA, goes into the flush state machine.
> Then, cause the request double list_add_tail() without list_del_init().
> I don't know the reason behind it, but well, it's allowable in the current
> flush code.

PREFLUSH | POSTFLUSH is a weird invalid format.  We'll need to fix this
in dm, and probably also catch it in the block layer submission path.
Chengming Zhou June 7, 2024, 6:24 a.m. UTC | #12
On 2024/6/7 12:55, Christoph Hellwig wrote:
> On Fri, Jun 07, 2024 at 10:37:58AM +0800, Chengming Zhou wrote:
>> Yeah, right, it seems LVM may create this special request that only has
>> PREFLUSH | POSTFLUSH without any DATA, goes into the flush state machine.
>> Then, cause the request double list_add_tail() without list_del_init().
>> I don't know the reason behind it, but well, it's allowable in the current
>> flush code.
> 
> PREFLUSH | POSTFLUSH is a weird invalid format.  We'll need to fix this
> in dm, and probably also catch it in the block layer submission path.
> 

Right, how about add WARN here to catch it? Or just set it to PREFLUSH?
Not familiar with dm code, need help if we need to fix it in dm. :)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index c17cf8ed8113..3ce9ed78c375 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -185,7 +185,7 @@ static void blk_flush_complete_seq(struct request *rq,
                /* queue for flush */
                if (list_empty(pending))
                        fq->flush_pending_since = jiffies;
-               list_move_tail(&rq->queuelist, pending);
+               list_add_tail(&rq->queuelist, pending);
                break;

        case REQ_FSEQ_DATA:
@@ -263,6 +263,7 @@ static enum rq_end_io_ret flush_end_io(struct request *flush_rq,
                unsigned int seq = blk_flush_cur_seq(rq);

                BUG_ON(seq != REQ_FSEQ_PREFLUSH && seq != REQ_FSEQ_POSTFLUSH);
+               list_del_init(&rq->queuelist);
                blk_flush_complete_seq(rq, fq, seq, error);
        }

@@ -402,6 +403,12 @@ bool blk_insert_flush(struct request *rq)
        unsigned int policy = blk_flush_policy(fflags, rq);
        struct blk_flush_queue *fq = blk_get_flush_queue(q, rq->mq_ctx);

+       /*
+        * PREFLUSH | POSTFLUSH is a weird invalid format,
+        * need to fix in the upper layer, catch it here.
+        */
+       WARN_ON_ONCE(policy == (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH));
+
        /* FLUSH/FUA request must never be merged */
        WARN_ON_ONCE(rq->bio != rq->biotail);
Christoph Hellwig June 7, 2024, 6:31 a.m. UTC | #13
On Fri, Jun 07, 2024 at 02:24:52PM +0800, Chengming Zhou wrote:
> Right, how about add WARN here to catch it? Or just set it to PREFLUSH?
> Not familiar with dm code, need help if we need to fix it in dm. :)

We'll need to fix dm first.  I'll take a look if I can reproduce it.
Let's kept the list_del_init fix in first, I hope I can allocate some
time to this soon.
Chengming Zhou June 7, 2024, 6:33 a.m. UTC | #14
On 2024/6/7 14:31, Christoph Hellwig wrote:
> On Fri, Jun 07, 2024 at 02:24:52PM +0800, Chengming Zhou wrote:
>> Right, how about add WARN here to catch it? Or just set it to PREFLUSH?
>> Not familiar with dm code, need help if we need to fix it in dm. :)
> 
> We'll need to fix dm first.  I'll take a look if I can reproduce it.
> Let's kept the list_del_init fix in first, I hope I can allocate some
> time to this soon.

Ok, it's great, thanks!
Friedrich Weber June 7, 2024, 3:13 p.m. UTC | #15
On 07/06/2024 04:37, Chengming Zhou wrote:
> On 2024/6/6 16:44, Friedrich Weber wrote:
>> [...]
>>
>> I used mainline kernel v6.10-rc2 as base and applied:
>>
>> - "block: fix request.queuelist usage in flush"
>> - Your `list_del_init` addition from above
>>
>> and if I boot the Debian machine into this kernel, I do not get the
>> crash anymore.
> 
> Good to hear. So can I merge these two diffs into one patch and add
> your Tested-by?

I applied your merged patch [1] on top of mainline v6.10-rc2 (c3f38fa61a):

- I cannot reproduce the crash from [0] anymore in the (virtual) machine
with root (on LVM) on software RAID1

- I cannot reproduce the `blk_mq_requeue_work` crash from this thread
anymore in the Debian VM with root on LVM. With cache=writeback for the
VM disk, I get the expected in-guest WARNING [2] on VM boot.

- No crashes on bare-metal either. If write caching is enabled, I get
the expected WARNING.

So this looks good to me:

Tested-by: Friedrich Weber <f.weber@proxmox.com>

We might backport the merged patch (minus the WARN, possibly) to our
downstream Proxmox VE kernel 6.9 to fix the software RAID crash [0] --
if I understand correctly, the merged patch should be safe for now until
dm is fixed.

Thanks a lot for your work on this!

Best,

Friedrich

[0]
https://lore.kernel.org/all/14b89dfb-505c-49f7-aebb-01c54451db40@proxmox.com/

[1]

diff --git a/block/blk-flush.c b/block/blk-flush.c
index c17cf8ed8113..3d72393a1710 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -185,7 +185,7 @@ static void blk_flush_complete_seq(struct request *rq,
 		/* queue for flush */
 		if (list_empty(pending))
 			fq->flush_pending_since = jiffies;
-		list_move_tail(&rq->queuelist, pending);
+               list_add_tail(&rq->queuelist, pending);
 		break;

 	case REQ_FSEQ_DATA:
@@ -263,6 +263,7 @@ static enum rq_end_io_ret flush_end_io(struct
request *flush_rq,
 		unsigned int seq = blk_flush_cur_seq(rq);

 		BUG_ON(seq != REQ_FSEQ_PREFLUSH && seq != REQ_FSEQ_POSTFLUSH);
+               list_del_init(&rq->queuelist);
 		blk_flush_complete_seq(rq, fq, seq, error);
 	}

@@ -402,6 +403,12 @@ bool blk_insert_flush(struct request *rq)
 	unsigned int policy = blk_flush_policy(fflags, rq);
 	struct blk_flush_queue *fq = blk_get_flush_queue(q, rq->mq_ctx);

+       /*
+        * PREFLUSH | POSTFLUSH is a weird invalid format,
+        * need to fix in the upper layer, catch it here.
+        */
+       WARN_ON_ONCE(policy == (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH));
+
 	/* FLUSH/FUA request must never be merged */
 	WARN_ON_ONCE(rq->bio != rq->biotail);

[2]

[    2.142204] ------------[ cut here ]------------
[    2.142206] WARNING: CPU: 0 PID: 179 at block/blk-flush.c:410
blk_insert_flush+0xff/0x270
[    2.142211] Modules linked in: efi_pstore(E) dmi_sysfs(E)
qemu_fw_cfg(E) ip_tables(E) x_tables(E) autofs4(E) hid_generic(E)
usbhid(E) hid(E) psmouse(E) bochs(E) drm_vram_helper(E)
drm_ttm_helper(E) ahci(E) ttm(E) i2c_piix4(E) uhci_hcd(E) ehci_hcd(E)
libahci(E) pata_acpi(E) floppy(E)
[    2.142225] CPU: 0 PID: 179 Comm: jbd2/dm-0-8 Tainted: G            E
     6.10.0-rc2-nohardened-patch0607+ #41
[    2.142226] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[    2.142227] RIP: 0010:blk_insert_flush+0xff/0x270
[    2.142229] Code: cc cc cc cc a9 00 00 04 00 74 3d 44 89 e2 83 ca 01
4d 85 c0 75 69 a9 00 00 02 00 0f 84 15 01 00 00 45 85 e4 0f 85 59 01 00
00 <0f> 0b 48 39 ce 0f 85 44 01 00 00 25 ff ff f9 ff 41 bc 05 00 00 00
[    2.142230] RSP: 0018:ffffa608c0303a30 EFLAGS: 00010246
[    2.142231] RAX: 0000000000069801 RBX: ffff93b70dc89600 RCX:
ffff93b70dd7baf8
[    2.142233] RDX: 0000000000000001 RSI: ffff93b70dd7baf8 RDI:
ffff93b70d545e00
[    2.142233] RBP: ffffa608c0303a48 R08: 0000000000000000 R09:
0000000000000000
[    2.142234] R10: 0000000000000000 R11: 0000000000000000 R12:
0000000000000000
[    2.142235] R13: ffff93b70d127980 R14: 0000000000000000 R15:
ffff93b70dc89600
[    2.142236] FS:  0000000000000000(0000) GS:ffff93b837c00000(0000)
knlGS:0000000000000000
[    2.142237] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    2.142238] CR2: 00005d28ed347fb8 CR3: 000000010d62e000 CR4:
00000000000006f0
[    2.142240] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[    2.142240] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[    2.142241] Call Trace:
[    2.142244]  <TASK>
[    2.142248]  ? show_regs+0x6c/0x80
[    2.142251]  ? __warn+0x88/0x140
[    2.142253]  ? blk_insert_flush+0xff/0x270
[    2.142254]  ? report_bug+0x182/0x1b0
[    2.142256]  ? handle_bug+0x46/0x90
[    2.142258]  ? exc_invalid_op+0x18/0x80
[    2.142259]  ? asm_exc_invalid_op+0x1b/0x20
[    2.142261]  ? blk_insert_flush+0xff/0x270
[    2.142262]  blk_mq_submit_bio+0x5c9/0x740
[    2.142265]  __submit_bio+0x6e/0x250
[    2.142267]  submit_bio_noacct_nocheck+0x1a3/0x3c0
[    2.142269]  submit_bio_noacct+0x1dc/0x650
[    2.142271]  submit_bio+0xb1/0x110
[    2.142272]  submit_bh_wbc+0x163/0x1a0
[    2.142274]  submit_bh+0x12/0x20
[    2.142275]  journal_submit_commit_record+0x1c5/0x250
[    2.142278]  jbd2_journal_commit_transaction+0x120d/0x1960
[    2.142281]  ? __schedule+0x408/0x15d0
[    2.142284]  kjournald2+0xaa/0x280
[    2.142285]  ? __pfx_autoremove_wake_function+0x10/0x10
[    2.142288]  ? __pfx_kjournald2+0x10/0x10
[    2.142289]  kthread+0xe4/0x110
[    2.142291]  ? __pfx_kthread+0x10/0x10
[    2.142292]  ret_from_fork+0x47/0x70
[    2.142294]  ? __pfx_kthread+0x10/0x10
[    2.142295]  ret_from_fork_asm+0x1a/0x30
[    2.142298]  </TASK>
[    2.142298] ---[ end trace 0000000000000000 ]---
diff mbox series

Patch

diff --git a/block/blk-flush.c b/block/blk-flush.c
index c17cf8ed8113..e7aebcf00714 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -185,7 +185,7 @@  static void blk_flush_complete_seq(struct request *rq,
 		/* queue for flush */
 		if (list_empty(pending))
 			fq->flush_pending_since = jiffies;
-		list_move_tail(&rq->queuelist, pending);
+		list_add_tail(&rq->queuelist, pending);
 		break;
 
 	case REQ_FSEQ_DATA: