diff mbox series

[v2] blk-mq: release scheduler resource when request complete

Message ID 20230813152325.3017343-1-chengming.zhou@linux.dev (mailing list archive)
State New, archived
Headers show
Series [v2] blk-mq: release scheduler resource when request complete | expand

Commit Message

Chengming Zhou Aug. 13, 2023, 3:23 p.m. UTC
From: Chengming Zhou <zhouchengming@bytedance.com>

Chuck reported [1] a IO hang problem on NFS exports that reside on SATA
devices and bisected to commit 615939a2ae73 ("blk-mq: defer to the normal
submission path for post-flush requests").

We analysed the IO hang problem, found there are two postflush requests
are waiting for each other.

The first postflush request completed the REQ_FSEQ_DATA sequence, so go to
the REQ_FSEQ_POSTFLUSH sequence and added in the flush pending list, but
failed to blk_kick_flush() because of the second postflush request which
is inflight waiting in scheduler queue.

The second postflush waiting in scheduler queue can't be dispatched because
the first postflush hasn't released scheduler resource even though it has
completed by itself.

Fix it by releasing scheduler resource when the first postflush request
completed, so the second postflush can be dispatched and completed, then
make blk_kick_flush() succeed.

[1] https://lore.kernel.org/all/7A57C7AE-A51A-4254-888B-FE15CA21F9E9@oracle.com/

Fixes: 615939a2ae73 ("blk-mq: defer to the normal submission path for post-flush requests")
Reported-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
---
v2:
 - All IO schedulers do set ->finish_request(), so remove the
   check and warn on not setting when register.
---
 block/blk-mq.c   | 16 ++++++++++++----
 block/elevator.c |  3 +++
 2 files changed, 15 insertions(+), 4 deletions(-)

Comments

Jens Axboe Aug. 13, 2023, 3:34 p.m. UTC | #1
On 8/13/23 9:23 AM, chengming.zhou@linux.dev wrote:
> From: Chengming Zhou <zhouchengming@bytedance.com>
> 
> Chuck reported [1] a IO hang problem on NFS exports that reside on SATA
> devices and bisected to commit 615939a2ae73 ("blk-mq: defer to the normal
> submission path for post-flush requests").
> 
> We analysed the IO hang problem, found there are two postflush requests
> are waiting for each other.
> 
> The first postflush request completed the REQ_FSEQ_DATA sequence, so go to
> the REQ_FSEQ_POSTFLUSH sequence and added in the flush pending list, but
> failed to blk_kick_flush() because of the second postflush request which
> is inflight waiting in scheduler queue.
> 
> The second postflush waiting in scheduler queue can't be dispatched because
> the first postflush hasn't released scheduler resource even though it has
> completed by itself.
> 
> Fix it by releasing scheduler resource when the first postflush request
> completed, so the second postflush can be dispatched and completed, then
> make blk_kick_flush() succeed.

Thanks, applied with a bit of commit message massaging and adding a
comment for the newly added WARN_ON_ONCE().
Chengming Zhou Aug. 13, 2023, 3:45 p.m. UTC | #2
On 2023/8/13 23:34, Jens Axboe wrote:
> On 8/13/23 9:23 AM, chengming.zhou@linux.dev wrote:
>> From: Chengming Zhou <zhouchengming@bytedance.com>
>>
>> Chuck reported [1] a IO hang problem on NFS exports that reside on SATA
>> devices and bisected to commit 615939a2ae73 ("blk-mq: defer to the normal
>> submission path for post-flush requests").
>>
>> We analysed the IO hang problem, found there are two postflush requests
>> are waiting for each other.
>>
>> The first postflush request completed the REQ_FSEQ_DATA sequence, so go to
>> the REQ_FSEQ_POSTFLUSH sequence and added in the flush pending list, but
>> failed to blk_kick_flush() because of the second postflush request which
>> is inflight waiting in scheduler queue.
>>
>> The second postflush waiting in scheduler queue can't be dispatched because
>> the first postflush hasn't released scheduler resource even though it has
>> completed by itself.
>>
>> Fix it by releasing scheduler resource when the first postflush request
>> completed, so the second postflush can be dispatched and completed, then
>> make blk_kick_flush() succeed.
> 
> Thanks, applied with a bit of commit message massaging and adding a
> comment for the newly added WARN_ON_ONCE().
> 

Ok, thanks!
Jens Axboe Aug. 14, 2023, 9:42 p.m. UTC | #3
On Sun, 13 Aug 2023 23:23:25 +0800, chengming.zhou@linux.dev wrote:
> Chuck reported [1] a IO hang problem on NFS exports that reside on SATA
> devices and bisected to commit 615939a2ae73 ("blk-mq: defer to the normal
> submission path for post-flush requests").
> 
> We analysed the IO hang problem, found there are two postflush requests
> are waiting for each other.
> 
> [...]

Applied, thanks!

[1/1] blk-mq: release scheduler resource when request complete
      (no commit info)

Best regards,
kernel test robot Aug. 17, 2023, 2:41 p.m. UTC | #4
Hello,

kernel test robot noticed "WARNING:at_block/mq-deadline.c:#dd_exit_sched" on:

commit: 567b5a4f2f8b9e9fe97d5f8dca266d95c734ae91 ("[PATCH v2] blk-mq: release scheduler resource when request complete")
url: https://github.com/intel-lab-lkp/linux/commits/chengming-zhou-linux-dev/blk-mq-release-scheduler-resource-when-request-complete/20230813-232513
base: https://git.kernel.org/cgit/linux/kernel/git/axboe/linux-block.git for-next
patch link: https://lore.kernel.org/all/20230813152325.3017343-1-chengming.zhou@linux.dev/
patch subject: [PATCH v2] blk-mq: release scheduler resource when request complete

this also caused xfstests.generic.704.fail

in testcase: xfstests
version: xfstests-x86_64-bb8af9c-1_20230807
with following parameters:

	disk: 4HDD
	fs: ext4
	test: generic-704



compiler: gcc-12
test machine: 4 threads Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz (Skylake) with 32G memory

(please refer to attached dmesg/kmsg for entire log/backtrace)



If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202308172100.8ce4b853-oliver.sang@intel.com


[  222.617396][ T2216] ------------[ cut here ]------------
[  222.622837][ T2216] statistics for priority 1: i 276 m 0 d 276 c 278
[ 222.629307][ T2216] WARNING: CPU: 0 PID: 2216 at block/mq-deadline.c:680 dd_exit_sched (block/mq-deadline.c:680 (discriminator 3)) 
[  222.638218][ T2216] Modules linked in: scsi_debug(-) dm_mod btrfs blake2b_generic xor intel_rapl_msr raid6_pq intel_rapl_common zstd_compress libcrc32c x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel sd_mod t10_pi kvm crc64_rocksoft_generic crc64_rocksoft crc64 irqbypass sg crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel sha512_ssse3 ipmi_devintf ipmi_msghandler mei_wdt i915 rapl drm_buddy intel_cstate intel_gtt wmi_bmof drm_display_helper intel_uncore drm_kms_helper ahci libahci mei_me libata intel_pch_thermal ttm mei video wmi intel_pmc_core acpi_pad drm fuse ip_tables
[  222.690679][ T2216] CPU: 0 PID: 2216 Comm: modprobe Tainted: G          I        6.5.0-rc5-00190-g567b5a4f2f8b #1
[  222.700878][ T2216] Hardware name: Dell Inc. OptiPlex 7040/0Y7WYT, BIOS 1.1.1 10/07/2015
[ 222.708919][ T2216] RIP: 0010:dd_exit_sched (block/mq-deadline.c:680 (discriminator 3)) 
[ 222.714130][ T2216] Code: 89 da 8b 4b 04 48 c1 ea 03 0f b6 04 02 84 c0 74 08 3c 03 0f 8e c4 00 00 00 8b 13 44 89 e6 48 c7 c7 20 d3 e2 83 e8 e2 d1 08 ff <0f> 0b e9 f4 fe ff ff 0f 0b e9 78 fe ff ff 48 89 ee 48 c7 c7 20 cb
All code
========
   0:	89 da                	mov    %ebx,%edx
   2:	8b 4b 04             	mov    0x4(%rbx),%ecx
   5:	48 c1 ea 03          	shr    $0x3,%rdx
   9:	0f b6 04 02          	movzbl (%rdx,%rax,1),%eax
   d:	84 c0                	test   %al,%al
   f:	74 08                	je     0x19
  11:	3c 03                	cmp    $0x3,%al
  13:	0f 8e c4 00 00 00    	jle    0xdd
  19:	8b 13                	mov    (%rbx),%edx
  1b:	44 89 e6             	mov    %r12d,%esi
  1e:	48 c7 c7 20 d3 e2 83 	mov    $0xffffffff83e2d320,%rdi
  25:	e8 e2 d1 08 ff       	callq  0xffffffffff08d20c
  2a:*	0f 0b                	ud2    		<-- trapping instruction
  2c:	e9 f4 fe ff ff       	jmpq   0xffffffffffffff25
  31:	0f 0b                	ud2    
  33:	e9 78 fe ff ff       	jmpq   0xfffffffffffffeb0
  38:	48 89 ee             	mov    %rbp,%rsi
  3b:	48                   	rex.W
  3c:	c7                   	.byte 0xc7
  3d:	c7                   	(bad)  
  3e:	20 cb                	and    %cl,%bl

Code starting with the faulting instruction
===========================================
   0:	0f 0b                	ud2    
   2:	e9 f4 fe ff ff       	jmpq   0xfffffffffffffefb
   7:	0f 0b                	ud2    
   9:	e9 78 fe ff ff       	jmpq   0xfffffffffffffe86
   e:	48 89 ee             	mov    %rbp,%rsi
  11:	48                   	rex.W
  12:	c7                   	.byte 0xc7
  13:	c7                   	(bad)  
  14:	20 cb                	and    %cl,%bl
[  222.733468][ T2216] RSP: 0018:ffffc9000092f858 EFLAGS: 00010282
[  222.739367][ T2216] RAX: 0000000000000000 RBX: ffff88819e33e0b0 RCX: ffffffff821fb87e
[  222.747166][ T2216] RDX: 1ffff110fe1c57b8 RSI: 0000000000000008 RDI: ffffc9000092f610
[  222.754975][ T2216] RBP: 0000000000000116 R08: 0000000000000001 R09: fffff52000125ec2
[  222.762782][ T2216] R10: ffffc9000092f617 R11: 0000000000000001 R12: 0000000000000001
[  222.770591][ T2216] R13: dffffc0000000000 R14: ffff88819e33e000 R15: ffff88819e33e0bc
[  222.778386][ T2216] FS:  00007fd9aea5e540(0000) GS:ffff8887f0e00000(0000) knlGS:0000000000000000
[  222.787144][ T2216] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  222.793561][ T2216] CR2: 00007fbe00432bb0 CR3: 0000000126856005 CR4: 00000000003706f0
[  222.801359][ T2216] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  222.809156][ T2216] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  222.816970][ T2216] Call Trace:
[  222.820093][ T2216]  <TASK>
[ 222.822872][ T2216] ? __warn (kernel/panic.c:673) 
[ 222.826794][ T2216] ? dd_exit_sched (block/mq-deadline.c:680 (discriminator 3)) 
[ 222.831403][ T2216] ? report_bug (lib/bug.c:180 lib/bug.c:219) 
[ 222.835752][ T2216] ? handle_bug (arch/x86/kernel/traps.c:324) 
[ 222.839912][ T2216] ? exc_invalid_op (arch/x86/kernel/traps.c:345 (discriminator 1)) 
[ 222.844433][ T2216] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:568) 
[ 222.849286][ T2216] ? llist_add_batch (lib/llist.c:33 (discriminator 14)) 
[ 222.853980][ T2216] ? dd_exit_sched (block/mq-deadline.c:680 (discriminator 3)) 
[ 222.858586][ T2216] ? dd_exit_sched (block/mq-deadline.c:680 (discriminator 3)) 
[ 222.863193][ T2216] blk_mq_exit_sched (block/blk-mq-sched.c:550) 
[ 222.867973][ T2216] ? blk_mq_sched_free_rqs (block/blk-mq-sched.c:527) 
[ 222.873257][ T2216] ? __mutex_lock_slowpath (kernel/locking/mutex.c:282) 
[ 222.878384][ T2216] ? ioc_clear_queue (include/linux/list.h:292 block/blk-ioc.c:174) 
[ 222.883076][ T2216] ? blk_mq_poll (block/blk-mq.c:4832) 
[ 222.887324][ T2216] elevator_exit (block/elevator.c:168) 
[ 222.891572][ T2216] del_gendisk (block/genhd.c:701) 
[ 222.895834][ T2216] ? diskstats_show (block/genhd.c:631) 
[ 222.900526][ T2216] ? __pm_runtime_resume (drivers/base/power/runtime.c:1174) 
[ 222.905568][ T2216] sd_remove (drivers/scsi/sd.c:3736) sd_mod
[ 222.910436][ T2216] device_release_driver_internal (drivers/base/dd.c:1272 drivers/base/dd.c:1293) 
[ 222.916335][ T2216] bus_remove_device (include/linux/kobject.h:191 drivers/base/base.h:73 drivers/base/bus.c:581) 
[ 222.921225][ T2216] device_del (drivers/base/core.c:3815) 
[ 222.925388][ T2216] ? attribute_container_device_trigger (drivers/base/attribute_container.c:357) 
[ 222.931807][ T2216] ? __device_link_del (drivers/base/core.c:3769) 
[ 222.936760][ T2216] ? attribute_container_device_trigger_safe (drivers/base/attribute_container.c:357) 
[ 222.943612][ T2216] __scsi_remove_device (drivers/scsi/scsi_sysfs.c:1489) 
[ 222.948651][ T2216] scsi_forget_host (drivers/scsi/scsi_scan.c:1988) 
[ 222.953243][ T2216] scsi_remove_host (drivers/scsi/hosts.c:182) 
[ 222.957838][ T2216] sdebug_driver_remove (drivers/scsi/scsi_debug.c:7841) scsi_debug
[ 222.963933][ T2216] ? kernfs_remove_by_name_ns (fs/kernfs/dir.c:1679) 
[ 222.969492][ T2216] device_release_driver_internal (drivers/base/dd.c:1272 drivers/base/dd.c:1293) 
[ 222.975392][ T2216] bus_remove_device (include/linux/kobject.h:191 drivers/base/base.h:73 drivers/base/bus.c:581) 
[ 222.980173][ T2216] device_del (drivers/base/core.c:3815) 
[ 222.984349][ T2216] ? __device_link_del (drivers/base/core.c:3769) 
[ 222.989299][ T2216] ? mutex_unlock (arch/x86/include/asm/atomic64_64.h:109 include/linux/atomic/atomic-arch-fallback.h:4303 include/linux/atomic/atomic-long.h:1499 include/linux/atomic/atomic-instrumented.h:4446 kernel/locking/mutex.c:181 kernel/locking/mutex.c:540) 
[ 222.993645][ T2216] ? __mutex_unlock_slowpath+0x2b0/0x2b0 
[ 223.000160][ T2216] device_unregister (drivers/base/core.c:3732 drivers/base/core.c:3845) 
[ 223.004772][ T2216] sdebug_do_remove_host (drivers/scsi/scsi_debug.c:7299) scsi_debug
[ 223.011039][ T2216] scsi_debug_exit (drivers/scsi/scsi_debug.c:7670) scsi_debug
[ 223.016625][ T2216] __do_sys_delete_module+0x316/0x540 
[ 223.022880][ T2216] ? module_flags (kernel/module/main.c:698) 
[ 223.027401][ T2216] ? task_work_cancel (kernel/task_work.c:147) 
[ 223.032081][ T2216] ? _raw_spin_lock (arch/x86/include/asm/atomic.h:115 include/linux/atomic/atomic-arch-fallback.h:2155 include/linux/atomic/atomic-instrumented.h:1296 include/asm-generic/qspinlock.h:111 include/linux/spinlock.h:187 include/linux/spinlock_api_smp.h:134 kernel/locking/spinlock.c:154) 
[ 223.036590][ T2216] ? exit_to_user_mode_loop (include/linux/sched.h:2337 include/linux/resume_user_mode.h:61 kernel/entry/common.c:171) 
[ 223.041890][ T2216] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) 
[ 223.046154][ T2216] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) 
[  223.051867][ T2216] RIP: 0033:0x7fd9aeb83417
[ 223.056129][ T2216] Code: 73 01 c3 48 8b 0d 79 1a 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 49 1a 0d 00 f7 d8 64 89 01 48
All code
========
   0:	73 01                	jae    0x3
   2:	c3                   	retq   
   3:	48 8b 0d 79 1a 0d 00 	mov    0xd1a79(%rip),%rcx        # 0xd1a83
   a:	f7 d8                	neg    %eax
   c:	64 89 01             	mov    %eax,%fs:(%rcx)
   f:	48 83 c8 ff          	or     $0xffffffffffffffff,%rax
  13:	c3                   	retq   
  14:	66 2e 0f 1f 84 00 00 	nopw   %cs:0x0(%rax,%rax,1)
  1b:	00 00 00 
  1e:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)
  23:	b8 b0 00 00 00       	mov    $0xb0,%eax
  28:	0f 05                	syscall 
  2a:*	48 3d 01 f0 ff ff    	cmp    $0xfffffffffffff001,%rax		<-- trapping instruction
  30:	73 01                	jae    0x33
  32:	c3                   	retq   
  33:	48 8b 0d 49 1a 0d 00 	mov    0xd1a49(%rip),%rcx        # 0xd1a83
  3a:	f7 d8                	neg    %eax
  3c:	64 89 01             	mov    %eax,%fs:(%rcx)
  3f:	48                   	rex.W

Code starting with the faulting instruction
===========================================
   0:	48 3d 01 f0 ff ff    	cmp    $0xfffffffffffff001,%rax
   6:	73 01                	jae    0x9
   8:	c3                   	retq   
   9:	48 8b 0d 49 1a 0d 00 	mov    0xd1a49(%rip),%rcx        # 0xd1a59
  10:	f7 d8                	neg    %eax
  12:	64 89 01             	mov    %eax,%fs:(%rcx)
  15:	48                   	rex.W


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20230817/202308172100.8ce4b853-oliver.sang@intel.com
Bart Van Assche Aug. 17, 2023, 2:50 p.m. UTC | #5
On 8/17/23 07:41, kernel test robot wrote:
> [  222.622837][ T2216] statistics for priority 1: i 276 m 0 d 276 c 278
> [ 222.629307][ T2216] WARNING: CPU: 0 PID: 2216 at block/mq-deadline.c:680 dd_exit_sched (block/mq-deadline.c:680 (discriminator 3))

The above information shows that dd_inserted_request() has been called
276 times and also that dd_finish_request() has been called 278 times.
Calling dd_finish_request() more than once per request breaks the code
for priority handling since that code checks how many requests are
pending per priority level by subtracting the number of completion calls
from the number of insertion calls (see also dd_queued()). I think the
above output indicates that this patch introduced a regression.

Bart.
Chengming Zhou Aug. 17, 2023, 3:29 p.m. UTC | #6
On 2023/8/17 22:50, Bart Van Assche wrote:
> On 8/17/23 07:41, kernel test robot wrote:
>> [  222.622837][ T2216] statistics for priority 1: i 276 m 0 d 276 c 278
>> [ 222.629307][ T2216] WARNING: CPU: 0 PID: 2216 at block/mq-deadline.c:680 dd_exit_sched (block/mq-deadline.c:680 (discriminator 3))
> 
> The above information shows that dd_inserted_request() has been called
> 276 times and also that dd_finish_request() has been called 278 times.

Thanks much for your help.

This patch indeed introduced a regression, postflush requests will be completed
twice, so here dd_finish_request() is more than dd_inserted_request().

diff --git a/block/blk-mq.c b/block/blk-mq.c
index a8c63bef8ff1..7cd47ffc04ce 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -686,8 +686,10 @@ static void blk_mq_finish_request(struct request *rq)
 {
        struct request_queue *q = rq->q;

-       if (rq->rq_flags & RQF_USE_SCHED)
+       if (rq->rq_flags & RQF_USE_SCHED) {
                q->elevator->type->ops.finish_request(rq);
+               rq->rq_flags &= ~RQF_USE_SCHED;
+       }
 }


Clear RQF_USE_SCHED flag here should fix this problem, which should be ok
since finish_request() is the last callback, this flag isn't needed anymore.

Jens, should I send this diff as another patch or resend updated v3?

Thanks.

> Calling dd_finish_request() more than once per request breaks the code
> for priority handling since that code checks how many requests are
> pending per priority level by subtracting the number of completion calls
> from the number of insertion calls (see also dd_queued()). I think the
> above output indicates that this patch introduced a regression.
> 
> Bart.
>
Chengming Zhou Aug. 17, 2023, 5:17 p.m. UTC | #7
On 2023/8/17 23:29, Chengming Zhou wrote:
> On 2023/8/17 22:50, Bart Van Assche wrote:
>> On 8/17/23 07:41, kernel test robot wrote:
>>> [  222.622837][ T2216] statistics for priority 1: i 276 m 0 d 276 c 278
>>> [ 222.629307][ T2216] WARNING: CPU: 0 PID: 2216 at block/mq-deadline.c:680 dd_exit_sched (block/mq-deadline.c:680 (discriminator 3))
>>
>> The above information shows that dd_inserted_request() has been called
>> 276 times and also that dd_finish_request() has been called 278 times.
> 
> Thanks much for your help.
> 
> This patch indeed introduced a regression, postflush requests will be completed
> twice, so here dd_finish_request() is more than dd_inserted_request().
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index a8c63bef8ff1..7cd47ffc04ce 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -686,8 +686,10 @@ static void blk_mq_finish_request(struct request *rq)
>  {
>         struct request_queue *q = rq->q;
> 
> -       if (rq->rq_flags & RQF_USE_SCHED)
> +       if (rq->rq_flags & RQF_USE_SCHED) {
>                 q->elevator->type->ops.finish_request(rq);
> +               rq->rq_flags &= ~RQF_USE_SCHED;
> +       }
>  }
> 

I just tried to run LKP and xfstests, firstly failed to run LKP on my server
which seems to miss some dependencies. Then I ran xfstests successfully.

But xfstests generic/704 always pass and no WARN in dmesg. (I don't know why,
maybe my server settings are some different from the test robot.)

So I try to reproduce it manually. Steps:

```
echo mq-deadline > /sys/block/sdb/queue/scheduler

mkfs.ext4 /dev/sdb
mount /dev/sdb /fs/sdb
cd /fs/sdb
stress-ng --symlink 4 --timeout 60

echo none > /sys/block/sdb/queue/scheduler
```

This way the WARNING in mq-deadline can be reproduced easily.

Then retest with the diff, mq-deadline WARNING still happened... So there
are still other requests which have RQF_USE_SCHED flag completed without
being inserted into elevator.

Will use some tracing and look again.

Thanks.

> 
> Clear RQF_USE_SCHED flag here should fix this problem, which should be ok
> since finish_request() is the last callback, this flag isn't needed anymore.
> 
> Jens, should I send this diff as another patch or resend updated v3?
> 
> Thanks.
> 
>> Calling dd_finish_request() more than once per request breaks the code
>> for priority handling since that code checks how many requests are
>> pending per priority level by subtracting the number of completion calls
>> from the number of insertion calls (see also dd_queued()). I think the
>> above output indicates that this patch introduced a regression.
>>
>> Bart.
>>
Jens Axboe Aug. 17, 2023, 5:20 p.m. UTC | #8
On 8/17/23 9:29 AM, Chengming Zhou wrote:
> On 2023/8/17 22:50, Bart Van Assche wrote:
>> On 8/17/23 07:41, kernel test robot wrote:
>>> [  222.622837][ T2216] statistics for priority 1: i 276 m 0 d 276 c 278
>>> [ 222.629307][ T2216] WARNING: CPU: 0 PID: 2216 at block/mq-deadline.c:680 dd_exit_sched (block/mq-deadline.c:680 (discriminator 3))
>>
>> The above information shows that dd_inserted_request() has been called
>> 276 times and also that dd_finish_request() has been called 278 times.
> 
> Thanks much for your help.
> 
> This patch indeed introduced a regression, postflush requests will be completed
> twice, so here dd_finish_request() is more than dd_inserted_request().
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index a8c63bef8ff1..7cd47ffc04ce 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -686,8 +686,10 @@ static void blk_mq_finish_request(struct request *rq)
>  {
>         struct request_queue *q = rq->q;
> 
> -       if (rq->rq_flags & RQF_USE_SCHED)
> +       if (rq->rq_flags & RQF_USE_SCHED) {
>                 q->elevator->type->ops.finish_request(rq);
> +               rq->rq_flags &= ~RQF_USE_SCHED;
> +       }
>  }
> 
> 
> Clear RQF_USE_SCHED flag here should fix this problem, which should be ok
> since finish_request() is the last callback, this flag isn't needed anymore.
> 
> Jens, should I send this diff as another patch or resend updated v3?

I don't think this is the right solution, it makes all kinds of
assumptions on what that flag is and when it's safe to clear it. It's a
very fragile fix, I think we need to do better than that.
Jens Axboe Aug. 17, 2023, 5:24 p.m. UTC | #9
On 8/17/23 11:20 AM, Jens Axboe wrote:
> On 8/17/23 9:29 AM, Chengming Zhou wrote:
>> On 2023/8/17 22:50, Bart Van Assche wrote:
>>> On 8/17/23 07:41, kernel test robot wrote:
>>>> [  222.622837][ T2216] statistics for priority 1: i 276 m 0 d 276 c 278
>>>> [ 222.629307][ T2216] WARNING: CPU: 0 PID: 2216 at block/mq-deadline.c:680 dd_exit_sched (block/mq-deadline.c:680 (discriminator 3))
>>>
>>> The above information shows that dd_inserted_request() has been called
>>> 276 times and also that dd_finish_request() has been called 278 times.
>>
>> Thanks much for your help.
>>
>> This patch indeed introduced a regression, postflush requests will be completed
>> twice, so here dd_finish_request() is more than dd_inserted_request().
>>
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index a8c63bef8ff1..7cd47ffc04ce 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -686,8 +686,10 @@ static void blk_mq_finish_request(struct request *rq)
>>  {
>>         struct request_queue *q = rq->q;
>>
>> -       if (rq->rq_flags & RQF_USE_SCHED)
>> +       if (rq->rq_flags & RQF_USE_SCHED) {
>>                 q->elevator->type->ops.finish_request(rq);
>> +               rq->rq_flags &= ~RQF_USE_SCHED;
>> +       }
>>  }
>>
>>
>> Clear RQF_USE_SCHED flag here should fix this problem, which should be ok
>> since finish_request() is the last callback, this flag isn't needed anymore.
>>
>> Jens, should I send this diff as another patch or resend updated v3?
> 
> I don't think this is the right solution, it makes all kinds of
> assumptions on what that flag is and when it's safe to clear it. It's a
> very fragile fix, I think we need to do better than that.

Well maybe this is actually fine, since we're freeing the request now
anyway. I can fold it in the fix, I'll add a comment as well. If this is
subtle enough that it caused this issue, we definitely should have a
comment on why we're clearing this flag.
Jens Axboe Aug. 17, 2023, 5:26 p.m. UTC | #10
On 8/17/23 11:17 AM, Chengming Zhou wrote:
> On 2023/8/17 23:29, Chengming Zhou wrote:
>> On 2023/8/17 22:50, Bart Van Assche wrote:
>>> On 8/17/23 07:41, kernel test robot wrote:
>>>> [  222.622837][ T2216] statistics for priority 1: i 276 m 0 d 276 c 278
>>>> [ 222.629307][ T2216] WARNING: CPU: 0 PID: 2216 at block/mq-deadline.c:680 dd_exit_sched (block/mq-deadline.c:680 (discriminator 3))
>>>
>>> The above information shows that dd_inserted_request() has been called
>>> 276 times and also that dd_finish_request() has been called 278 times.
>>
>> Thanks much for your help.
>>
>> This patch indeed introduced a regression, postflush requests will be completed
>> twice, so here dd_finish_request() is more than dd_inserted_request().
>>
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index a8c63bef8ff1..7cd47ffc04ce 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -686,8 +686,10 @@ static void blk_mq_finish_request(struct request *rq)
>>  {
>>         struct request_queue *q = rq->q;
>>
>> -       if (rq->rq_flags & RQF_USE_SCHED)
>> +       if (rq->rq_flags & RQF_USE_SCHED) {
>>                 q->elevator->type->ops.finish_request(rq);
>> +               rq->rq_flags &= ~RQF_USE_SCHED;
>> +       }
>>  }
>>
> 
> I just tried to run LKP and xfstests, firstly failed to run LKP on my server
> which seems to miss some dependencies. Then I ran xfstests successfully.
> 
> But xfstests generic/704 always pass and no WARN in dmesg. (I don't know why,
> maybe my server settings are some different from the test robot.)
> 
> So I try to reproduce it manually. Steps:
> 
> ```
> echo mq-deadline > /sys/block/sdb/queue/scheduler
> 
> mkfs.ext4 /dev/sdb
> mount /dev/sdb /fs/sdb
> cd /fs/sdb
> stress-ng --symlink 4 --timeout 60
> 
> echo none > /sys/block/sdb/queue/scheduler
> ```
> 
> This way the WARNING in mq-deadline can be reproduced easily.
> 
> Then retest with the diff, mq-deadline WARNING still happened... So there
> are still other requests which have RQF_USE_SCHED flag completed without
> being inserted into elevator.
> 
> Will use some tracing and look again.

Ah missed this, thanks for doing this testing. I'll wait for an update
version. We can just fold in whatever change we need, and most likely
just push the patch to next week rather than send off a pull request for
this week. It's the only important thing pending on the block side.
diff mbox series

Patch

diff --git a/block/blk-mq.c b/block/blk-mq.c
index f14b8669ac69..a8c63bef8ff1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -682,6 +682,14 @@  struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 }
 EXPORT_SYMBOL_GPL(blk_mq_alloc_request_hctx);
 
+static void blk_mq_finish_request(struct request *rq)
+{
+	struct request_queue *q = rq->q;
+
+	if (rq->rq_flags & RQF_USE_SCHED)
+		q->elevator->type->ops.finish_request(rq);
+}
+
 static void __blk_mq_free_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
@@ -708,10 +716,6 @@  void blk_mq_free_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
 
-	if ((rq->rq_flags & RQF_USE_SCHED) &&
-	    q->elevator->type->ops.finish_request)
-		q->elevator->type->ops.finish_request(rq);
-
 	if (unlikely(laptop_mode && !blk_rq_is_passthrough(rq)))
 		laptop_io_completion(q->disk->bdi);
 
@@ -1021,6 +1025,8 @@  inline void __blk_mq_end_request(struct request *rq, blk_status_t error)
 	if (blk_mq_need_time_stamp(rq))
 		__blk_mq_end_request_acct(rq, ktime_get_ns());
 
+	blk_mq_finish_request(rq);
+
 	if (rq->end_io) {
 		rq_qos_done(rq->q, rq);
 		if (rq->end_io(rq, error) == RQ_END_IO_FREE)
@@ -1075,6 +1081,8 @@  void blk_mq_end_request_batch(struct io_comp_batch *iob)
 		if (iob->need_ts)
 			__blk_mq_end_request_acct(rq, now);
 
+		blk_mq_finish_request(rq);
+
 		rq_qos_done(rq->q, rq);
 
 		/*
diff --git a/block/elevator.c b/block/elevator.c
index 8400e303fbcb..ac2cb3814eac 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -499,6 +499,9 @@  void elv_unregister_queue(struct request_queue *q)
 
 int elv_register(struct elevator_type *e)
 {
+	if (WARN_ON_ONCE(!e->ops.finish_request))
+		return -EINVAL;
+
 	/* insert_requests and dispatch_request are mandatory */
 	if (WARN_ON_ONCE(!e->ops.insert_requests || !e->ops.dispatch_request))
 		return -EINVAL;