block:added printing when bio->bi_status fails

Message ID	tencent_9A3345EA79C1EE9DC4464BB576C6A602A105@qq.com (mailing list archive)
State	New, archived
Headers	show Received: from out162-62-58-211.mail.qq.com (out162-62-58-211.mail.qq.com [162.62.58.211]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 131FA189F33; Thu, 8 Aug 2024 10:13:03 +0000 (UTC) Message-ID: <tencent_9A3345EA79C1EE9DC4464BB576C6A602A105@qq.com> From: 824731276@qq.com To: axboe@kernel.dk Cc: linux-kernel@vger.kernel.org, linux-block@vger.kernel.org, baiguo@kylinos.cn Subject: [PATCH] block:added printing when bio->bi_status fails Date: Thu, 8 Aug 2024 17:54:48 +0800 Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	block:added printing when bio->bi_status fails \| expand block:added printing when bio->bi_status fails

824731276@qq.com Aug. 8, 2024, 9:54 a.m. UTC

From: baiguo <baiguo@kylinos.cn>

    When ftrace is not enabled and bio is not OK,
    the system cannot actively record which disk is abnormal.
    Add a message record to bio_endio.

Signed-off-by: baiguo <baiguo@kylinos.cn>
---
 block/bio.c | 5 +++++
 1 file changed, 5 insertions(+)

kernel test robot Aug. 16, 2024, 4:04 a.m. UTC | #1

Hello,

kernel test robot noticed "WARNING:at_fs/buffer.c:#mark_buffer_dirty" on:

commit: 0824beb1d430c30731166484b8c26e37147d4dbb ("[PATCH] block:added printing when bio->bi_status fails")
url: https://github.com/intel-lab-lkp/linux/commits/824731276-qq-com/block-added-printing-when-bio-bi_status-fails/20240808-181758
base: https://git.kernel.org/cgit/linux/kernel/git/axboe/linux-block.git for-next
patch link: https://lore.kernel.org/all/tencent_9A3345EA79C1EE9DC4464BB576C6A602A105@qq.com/
patch subject: [PATCH] block:added printing when bio->bi_status fails

in testcase: xfstests
version: xfstests-x86_64-f5ada754-1_20240812
with following parameters:

	disk: 4HDD
	fs: udf
	test: generic-081



compiler: gcc-12
test machine: 8 threads Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (Skylake) with 28G memory

(please refer to attached dmesg/kmsg for entire log/backtrace)



If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202408161114.dfe9cc49-oliver.sang@intel.com


we do see a lot of print that:

[   68.449409][   T89] bio: bio_endio status is 10, disk[253:2]
[   68.454169][ T2095] bio: bio_endio status is 10, disk[253:3]
[   68.455863][   T89] bio: bio_endio status is 10, disk[253:2]
[   68.466598][   T89] bio: bio_endio status is 10, disk[253:2]
[   68.472314][   T89] bio: bio_endio status is 10, disk[253:2]

...

[   74.216172][   T89] bio: bio_endio status is 10, disk[253:2]
[   74.221903][   T89] bio: bio_endio status is 10, disk[253:2]
[   74.2:2]
[   74.348274][   T89] bio: bio_endio status is 10, disk[253:3]
[   74.356178][ T2096] ------------[ cut here ]------------
[   74.361531][ T2096] WARNING: CPU: 0 PID: 2096 at fs/buffer.c:1181 mark_buffer_dirty+0x1e6/0x240


then see below WARNING


[   72.605562][ T2097] ------------[ cut here ]------------
[ 72.610907][ T2097] WARNING: CPU: 7 PID: 2097 at fs/buffer.c:1181 mark_buffer_dirty (fs/buffer.c:1181 (discriminator 1)) 
[   72.619661][ T2097] Modules linked in: dm_snapshot dm_bufio udf crc_itu_t cdrom btrfs blake2b_generic xor zstd_compress raid6_pq libcrc32c intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp sd_mod ipmi_devintf sg ipmi_msghandler kvm_intel i915 kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel sha512_ssse3 drm_buddy ahci intel_gtt rapl mei_wdt drm_display_helper libahci intel_cstate wmi_bmof ttm mei_me i2c_i801 intel_uncore libata drm_kms_helper i2c_smbus mei intel_pch_thermal video wmi acpi_pad binfmt_misc loop fuse drm dm_mod ip_tables
[   72.671579][ T2097] CPU: 7 UID: 0 PID: 2097 Comm: xfs_io Not tainted 6.11.0-rc1-00021-g0824beb1d430 #1
[   72.680936][ T2097] Hardware name: Dell Inc. OptiPlex 7040/0Y7WYT, BIOS 1.2.8 01/26/2016
[ 72.689071][ T2097] RIP: 0010:mark_buffer_dirty (fs/buffer.c:1181 (discriminator 1)) 
[ 72.694676][ T2097] Code: 58 c6 ff 48 89 ea 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 80 3c 02 00 75 60 48 8b 7d 00 5b be 04 00 00 00 5d e9 3a fc fc ff <0f> 0b e9 34 fe ff ff 48 89 df e8 5b 23 e5 ff e9 54 fe ff ff 48 89
All code
========
   0:	58                   	pop    %rax
   1:	c6                   	(bad)  
   2:	ff 48 89             	decl   -0x77(%rax)
   5:	ea                   	(bad)  
   6:	48 b8 00 00 00 00 00 	movabs $0xdffffc0000000000,%rax
   d:	fc ff df 
  10:	48 c1 ea 03          	shr    $0x3,%rdx
  14:	80 3c 02 00          	cmpb   $0x0,(%rdx,%rax,1)
  18:	75 60                	jne    0x7a
  1a:	48 8b 7d 00          	mov    0x0(%rbp),%rdi
  1e:	5b                   	pop    %rbx
  1f:	be 04 00 00 00       	mov    $0x4,%esi
  24:	5d                   	pop    %rbp
  25:	e9 3a fc fc ff       	jmpq   0xfffffffffffcfc64
  2a:*	0f 0b                	ud2    		<-- trapping instruction
  2c:	e9 34 fe ff ff       	jmpq   0xfffffffffffffe65
  31:	48 89 df             	mov    %rbx,%rdi
  34:	e8 5b 23 e5 ff       	callq  0xffffffffffe52394
  39:	e9 54 fe ff ff       	jmpq   0xfffffffffffffe92
  3e:	48                   	rex.W
  3f:	89                   	.byte 0x89

Code starting with the faulting instruction
===========================================
   0:	0f 0b                	ud2    
   2:	e9 34 fe ff ff       	jmpq   0xfffffffffffffe3b
   7:	48 89 df             	mov    %rbx,%rdi
   a:	e8 5b 23 e5 ff       	callq  0xffffffffffe5236a
   f:	e9 54 fe ff ff       	jmpq   0xfffffffffffffe68
  14:	48                   	rex.W
  15:	89                   	.byte 0x89
[   72.714238][ T2097] RSP: 0018:ffffc900033df8a0 EFLAGS: 00010246
[   72.720206][ T2097] RAX: 0000000000000001 RBX: ffff888120d219d8 RCX: ffffffff81c2b878
[   72.728093][ T2097] RDX: ffffed10241a433c RSI: 0000000000000008 RDI: ffff888120d219d8
[   72.735965][ T2097] RBP: ffff888120d219d8 R08: 0000000000000000 R09: ffffed10241a433b
[   72.743867][ T2097] R10: ffff888120d219df R11: 0000000000000008 R12: ffffed10241a4340
[   72.751752][ T2097] R13: 0000000000000004 R14: ffff8887472e8000 R15: 0000000000000948
[   72.759638][ T2097] FS:  0000000000000000(0000) GS:ffff888634180000(0000) knlGS:0000000000000000
[   72.768470][ T2097] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   72.774945][ T2097] CR2: 000055d09397fc48 CR3: 000000075685c004 CR4: 00000000003706f0
[   72.782843][ T2097] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   72.790713][ T2097] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   72.798583][ T2097] Call Trace:
[   72.801752][ T2097]  <TASK>
[ 72.804567][ T2097] ? __warn (kernel/panic.c:735) 
[ 72.808515][ T2097] ? mark_buffer_dirty (fs/buffer.c:1181 (discriminator 1)) 
[ 72.813514][ T2097] ? report_bug (lib/bug.c:180 lib/bug.c:219) 
[ 72.817908][ T2097] ? handle_bug (arch/x86/kernel/traps.c:239) 
[ 72.822125][ T2097] ? exc_invalid_op (arch/x86/kernel/traps.c:260 (discriminator 1)) 
[ 72.826692][ T2097] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:621) 
[ 72.831612][ T2097] ? mark_buffer_dirty (arch/x86/include/asm/bitops.h:213 arch/x86/include/asm/bitops.h:245 include/asm-generic/bitops/instrumented-non-atomic.h:154 include/linux/buffer_head.h:171 fs/buffer.c:1181) 
[ 72.836529][ T2097] ? mark_buffer_dirty (fs/buffer.c:1181 (discriminator 1)) 
[ 72.841536][ T2097] udf_bitmap_free_blocks (fs/udf/balloc.c:164) udf
[ 72.847326][ T2097] udf_free_blocks (fs/udf/balloc.c:662) udf
[ 72.852530][ T2097] udf_discard_prealloc (fs/udf/truncate.c:147) udf
[ 72.858161][ T2097] ? __pfx_udf_discard_prealloc (fs/udf/truncate.c:118) udf
[ 72.864303][ T2097] ? __pfx_down_write (kernel/locking/rwsem.c:1577) 
[ 72.869051][ T2097] ? __pfx_locks_remove_file (fs/locks.c:2687) 
[ 72.874412][ T2097] udf_release_file (fs/udf/file.c:185 fs/udf/file.c:174) udf
[ 72.879584][ T2097] ? security_file_release (security/security.c:2754 (discriminator 11)) 
[ 72.884757][ T2097] __fput (fs/file_table.c:422) 
[ 72.888638][ T2097] task_work_run (kernel/task_work.c:222 (discriminator 1)) 
[ 72.893108][ T2097] ? __pfx_task_work_run (kernel/task_work.c:190) 
[ 72.898101][ T2097] do_exit (kernel/exit.c:883) 
[ 72.902048][ T2097] ? __pfx_do_exit (kernel/exit.c:821) 
[ 72.906518][ T2097] ? _raw_spin_lock_irq (arch/x86/include/asm/atomic.h:107 include/linux/atomic/atomic-arch-fallback.h:2170 include/linux/atomic/atomic-instrumented.h:1302 include/asm-generic/qspinlock.h:111 include/linux/spinlock.h:187 include/linux/spinlock_api_smp.h:120 kernel/locking/spinlock.c:170) 
[ 72.911423][ T2097] do_group_exit (kernel/exit.c:1012) 
[ 72.915818][ T2097] get_signal (include/linux/signal.h:78 kernel/signal.c:2751) 
[ 72.920215][ T2097] ? finish_task_switch+0x495/0x750 
[ 72.925907][ T2097] ? __switch_to (arch/x86/include/asm/bitops.h:55 include/asm-generic/bitops/instrumented-atomic.h:29 include/linux/thread_info.h:89 include/linux/sched.h:1945 arch/x86/include/asm/fpu/sched.h:68 arch/x86/kernel/process_64.c:674) 
[ 72.930377][ T2097] ? __pfx_get_signal (kernel/signal.c:2682) 
[ 72.935108][ T2097] ? __schedule (kernel/sched/core.c:6399) 
[ 72.939576][ T2097] arch_do_signal_or_restart (arch/x86/kernel/signal.c:310) 
[ 72.945005][ T2097] ? __pfx_arch_do_signal_or_restart (arch/x86/kernel/signal.c:307) 
[ 72.951046][ T2097] syscall_exit_to_user_mode (kernel/entry/common.c:111 include/linux/entry-common.h:328 kernel/entry/common.c:207 kernel/entry/common.c:218) 
[ 72.956561][ T2097] do_syscall_64 (arch/x86/entry/common.c:102) 
[ 72.960942][ T2097] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) 
[   72.966720][ T2097] RIP: 0033:0x7f896a5efd32
[ 72.971012][ T2097] Code: Unable to access opcode bytes at 0x7f896a5efd08.

Code starting with the faulting instruction
===========================================
[   72.977922][ T2097] RSP: 002b:00007f896a1ffdb0 EFLAGS: 00000293 ORIG_RAX: 0000000000000022
[   72.986232][ T2097] RAX: fffffffffffffdfe RBX: 00007f896a2006c0 RCX: 00007f896a5efd32
[   72.994101][ T2097] RDX: 0000000000000002 RSI: 0000000000000000 RDI: 0000000000000000
[   73.001969][ T2097] RBP: 0000000000000000 R08: 0000000000000000 R09: 00007ffdb5bc88c7
[   73.009844][ T2097] R10: 00007f896a535f80 R11: 0000000000000293 R12: ffffffffffffff80
[   73.017712][ T2097] R13: 0000000000000002 R14: 00007ffdb5bc87d0 R15: 00007f8969a00000
[   73.025582][ T2097]  </TASK>
[   73.028478][ T2097] ---[ end trace 0000000000000000 ]---
[   73.034147][   T66] bio: bio_endio status is 10, disk[253:3]
[   73.034167][ T2097] bio: bio_endio status is 10, disk[253:3]
[   73.039846][   T66] Buffer I/O error on dev dm-3, logical block 259, lost async page write
[   73.045565][ T2097] Buffer I/O error on dev dm-3, logical block 128, lost async page write
[   73.053894][   T66] bio: bio_endio status is 10, disk[253:3]
[   73.067887][   T66] Buffer I/O error on dev dm-3, logical block 387, lost async page write
[   73.076268][   T66] bio: bio_endio status is 10, disk[253:3]
[   73.081976][   T66] Buffer I/O error on dev dm-3, logical block 388, lost async page write
[   73.180045][ T2097] bio: bio_endio status is 10, disk[253:3]
[   73.185743][ T2097] Buffer I/O error on dev dm-3, logical block 128, lost sync page write



The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240816/202408161114.dfe9cc49-oliver.sang@intel.com

Yu Kuai Aug. 16, 2024, 7:45 a.m. UTC | #2

Hi,

在 2024/08/08 17:54, 824731276@qq.com 写道:
> From: baiguo <baiguo@kylinos.cn>
> 
>      When ftrace is not enabled and bio is not OK,
>      the system cannot actively record which disk is abnormal.
>      Add a message record to bio_endio.
> 
> Signed-off-by: baiguo <baiguo@kylinos.cn>
> ---
>   block/bio.c | 5 +++++
>   1 file changed, 5 insertions(+)
> 
> diff --git a/block/bio.c b/block/bio.c
> index c4053d496..fb07589c8 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1617,6 +1617,11 @@ void bio_endio(struct bio *bio)
>   		bio_clear_flag(bio, BIO_TRACE_COMPLETION);
>   	}
>   
> +	if (bio->bi_status && bio->bi_bdev)
> +		printk(KERN_ERR "bio: %s status is %d, disk[%d:%d]\n",\
> +				__func__, bio->bi_status, bio->bi_bdev->bd_disk->major,\
> +				bio->bi_bdev->bd_disk->first_minor);

I don't understand why you'll need this, bio_endio() will still be
called for unsupported bio from submit_bio_noacct() when the disk is
fine.

For real disks blk_print_req_error() already print message for failed IO
that are submitted to disk.

Thanks,
Kuai

> +
>   	/*
>   	 * Need to have a real endio function for chained bios, otherwise
>   	 * various corner cases will break (like stacking block devices that
>

block:added printing when bio->bi_status fails

Commit Message

Comments

Patch