Message ID | 77ab74b3fdff491db2a5596b1edc86b6@huawei.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | test/defer: fix deadlock when io_uring_submit fail | expand |
On 1/15/25 6:10 AM, lizetao wrote: > While performing fault injection testing, a bug report was triggered: > > FAULT_INJECTION: forcing a failure. > name fail_usercopy, interval 1, probability 0, space 0, times 0 > CPU: 12 UID: 0 PID: 18795 Comm: defer.t Tainted: G O 6.13.0-rc6-gf2a0a37b174b #17 > Tainted: [O]=OOT_MODULE > Hardware name: linux,dummy-virt (DT) > Call trace: > show_stack+0x20/0x38 (C) > dump_stack_lvl+0x78/0x90 > dump_stack+0x1c/0x28 > should_fail_ex+0x544/0x648 > should_fail+0x14/0x20 > should_fail_usercopy+0x1c/0x28 > get_timespec64+0x7c/0x258 > __io_timeout_prep+0x31c/0x798 > io_link_timeout_prep+0x1c/0x30 > io_submit_sqes+0x59c/0x1d50 > __arm64_sys_io_uring_enter+0x8dc/0xfa0 > invoke_syscall+0x74/0x270 > el0_svc_common.constprop.0+0xb4/0x240 > do_el0_svc+0x48/0x68 > el0_svc+0x38/0x78 > el0t_64_sync_handler+0xc8/0xd0 > el0t_64_sync+0x198/0x1a0 > > The deadlock stack is as follows: > > io_cqring_wait+0xa64/0x1060 > __arm64_sys_io_uring_enter+0x46c/0xfa0 > invoke_syscall+0x74/0x270 > el0_svc_common.constprop.0+0xb4/0x240 > do_el0_svc+0x48/0x68 > el0_svc+0x38/0x78 > el0t_64_sync_handler+0xc8/0xd0 > el0t_64_sync+0x198/0x1a0 > > This is because after the submission fails, the defer.t testcase is still waiting to submit the failed request, resulting in an eventual deadlock. > Solve the problem by telling wait_cqes the number of requests to wait for. I suspect this would be fixed by setting IORING_SETUP_SUBMIT_ALL for ring init, something probably all/most tests should set.
Hi, > -----Original Message----- > From: Jens Axboe <axboe@kernel.dk> > Sent: Thursday, January 16, 2025 10:51 PM > To: lizetao <lizetao1@huawei.com>; Pavel Begunkov <asml.silence@gmail.com> > Cc: io-uring@vger.kernel.org > Subject: Re: [PATCH] test/defer: fix deadlock when io_uring_submit fail > > On 1/15/25 6:10 AM, lizetao wrote: > > While performing fault injection testing, a bug report was triggered: > > > > FAULT_INJECTION: forcing a failure. > > name fail_usercopy, interval 1, probability 0, space 0, times 0 > > CPU: 12 UID: 0 PID: 18795 Comm: defer.t Tainted: G O > 6.13.0-rc6-gf2a0a37b174b #17 > > Tainted: [O]=OOT_MODULE > > Hardware name: linux,dummy-virt (DT) > > Call trace: > > show_stack+0x20/0x38 (C) > > dump_stack_lvl+0x78/0x90 > > dump_stack+0x1c/0x28 > > should_fail_ex+0x544/0x648 > > should_fail+0x14/0x20 > > should_fail_usercopy+0x1c/0x28 > > get_timespec64+0x7c/0x258 > > __io_timeout_prep+0x31c/0x798 > > io_link_timeout_prep+0x1c/0x30 > > io_submit_sqes+0x59c/0x1d50 > > __arm64_sys_io_uring_enter+0x8dc/0xfa0 > > invoke_syscall+0x74/0x270 > > el0_svc_common.constprop.0+0xb4/0x240 > > do_el0_svc+0x48/0x68 > > el0_svc+0x38/0x78 > > el0t_64_sync_handler+0xc8/0xd0 > > el0t_64_sync+0x198/0x1a0 > > > > The deadlock stack is as follows: > > > > io_cqring_wait+0xa64/0x1060 > > __arm64_sys_io_uring_enter+0x46c/0xfa0 > > invoke_syscall+0x74/0x270 > > el0_svc_common.constprop.0+0xb4/0x240 > > do_el0_svc+0x48/0x68 > > el0_svc+0x38/0x78 > > el0t_64_sync_handler+0xc8/0xd0 > > el0t_64_sync+0x198/0x1a0 > > > > This is because after the submission fails, the defer.t testcase is still waiting to > submit the failed request, resulting in an eventual deadlock. > > Solve the problem by telling wait_cqes the number of requests to wait for. > > I suspect this would be fixed by setting IORING_SETUP_SUBMIT_ALL for ring init, > something probably all/most tests should set. I tested it and found that IORING_SETUP_SUBMIT_ALL can indeed solve this problem. Should I just modify this problem or add IORING_SETUP_SUBMIT_ALL to the general path to solve most possible problems? > > -- > Jens Axboe --- Li Zetao
On 1/18/25 2:42 AM, lizetao wrote: > Hi, > >> -----Original Message----- >> From: Jens Axboe <axboe@kernel.dk> >> Sent: Thursday, January 16, 2025 10:51 PM >> To: lizetao <lizetao1@huawei.com>; Pavel Begunkov <asml.silence@gmail.com> >> Cc: io-uring@vger.kernel.org >> Subject: Re: [PATCH] test/defer: fix deadlock when io_uring_submit fail >> >> On 1/15/25 6:10 AM, lizetao wrote: >>> While performing fault injection testing, a bug report was triggered: >>> >>> FAULT_INJECTION: forcing a failure. >>> name fail_usercopy, interval 1, probability 0, space 0, times 0 >>> CPU: 12 UID: 0 PID: 18795 Comm: defer.t Tainted: G O >> 6.13.0-rc6-gf2a0a37b174b #17 >>> Tainted: [O]=OOT_MODULE >>> Hardware name: linux,dummy-virt (DT) >>> Call trace: >>> show_stack+0x20/0x38 (C) >>> dump_stack_lvl+0x78/0x90 >>> dump_stack+0x1c/0x28 >>> should_fail_ex+0x544/0x648 >>> should_fail+0x14/0x20 >>> should_fail_usercopy+0x1c/0x28 >>> get_timespec64+0x7c/0x258 >>> __io_timeout_prep+0x31c/0x798 >>> io_link_timeout_prep+0x1c/0x30 >>> io_submit_sqes+0x59c/0x1d50 >>> __arm64_sys_io_uring_enter+0x8dc/0xfa0 >>> invoke_syscall+0x74/0x270 >>> el0_svc_common.constprop.0+0xb4/0x240 >>> do_el0_svc+0x48/0x68 >>> el0_svc+0x38/0x78 >>> el0t_64_sync_handler+0xc8/0xd0 >>> el0t_64_sync+0x198/0x1a0 >>> >>> The deadlock stack is as follows: >>> >>> io_cqring_wait+0xa64/0x1060 >>> __arm64_sys_io_uring_enter+0x46c/0xfa0 >>> invoke_syscall+0x74/0x270 >>> el0_svc_common.constprop.0+0xb4/0x240 >>> do_el0_svc+0x48/0x68 >>> el0_svc+0x38/0x78 >>> el0t_64_sync_handler+0xc8/0xd0 >>> el0t_64_sync+0x198/0x1a0 >>> >>> This is because after the submission fails, the defer.t testcase is still waiting to >> submit the failed request, resulting in an eventual deadlock. >>> Solve the problem by telling wait_cqes the number of requests to wait for. >> >> I suspect this would be fixed by setting IORING_SETUP_SUBMIT_ALL for ring init, >> something probably all/most tests should set. > > > I tested it and found that IORING_SETUP_SUBMIT_ALL can indeed solve > this problem. Should I just modify this problem or add > IORING_SETUP_SUBMIT_ALL to the general path to solve most possible > problems? I think just fix up this one. We really should have all the tests use t_create_ring*() first, and those helpers should just set SUBMIT_ALL. But that's a separate change.
diff --git a/test/defer.c b/test/defer.c index b0770ef..2447be0 100644 --- a/test/defer.c +++ b/test/defer.c @@ -69,12 +69,12 @@ err: return 1; } -static int wait_cqes(struct test_context *ctx) +static int wait_cqes(struct test_context *ctx, int num) { int ret, i; struct io_uring_cqe *cqe; - for (i = 0; i < ctx->nr; i++) { + for (i = 0; i < num; i++) { ret = io_uring_wait_cqe(ctx->ring, &cqe); if (ret < 0) { @@ -105,7 +105,7 @@ static int test_canceled_userdata(struct io_uring *ring) goto err; } - if (wait_cqes(&ctx)) + if (wait_cqes(&ctx, ret)) goto err; for (i = 0; i < nr; i++) { @@ -139,7 +139,7 @@ static int test_thread_link_cancel(struct io_uring *ring) goto err; } - if (wait_cqes(&ctx)) + if (wait_cqes(&ctx, ret)) goto err; for (i = 0; i < nr; i++) { @@ -185,7 +185,7 @@ static int test_drain_with_linked_timeout(struct io_uring *ring) goto err; } - if (wait_cqes(&ctx)) + if (wait_cqes(&ctx, ret)) goto err; free_context(&ctx); @@ -212,7 +212,7 @@ static int run_drained(struct io_uring *ring, int nr) goto err; } - if (wait_cqes(&ctx)) + if (wait_cqes(&ctx, ret)) goto err; free_context(&ctx);
While performing fault injection testing, a bug report was triggered: FAULT_INJECTION: forcing a failure. name fail_usercopy, interval 1, probability 0, space 0, times 0 CPU: 12 UID: 0 PID: 18795 Comm: defer.t Tainted: G O 6.13.0-rc6-gf2a0a37b174b #17 Tainted: [O]=OOT_MODULE Hardware name: linux,dummy-virt (DT) Call trace: show_stack+0x20/0x38 (C) dump_stack_lvl+0x78/0x90 dump_stack+0x1c/0x28 should_fail_ex+0x544/0x648 should_fail+0x14/0x20 should_fail_usercopy+0x1c/0x28 get_timespec64+0x7c/0x258 __io_timeout_prep+0x31c/0x798 io_link_timeout_prep+0x1c/0x30 io_submit_sqes+0x59c/0x1d50 __arm64_sys_io_uring_enter+0x8dc/0xfa0 invoke_syscall+0x74/0x270 el0_svc_common.constprop.0+0xb4/0x240 do_el0_svc+0x48/0x68 el0_svc+0x38/0x78 el0t_64_sync_handler+0xc8/0xd0 el0t_64_sync+0x198/0x1a0 The deadlock stack is as follows: io_cqring_wait+0xa64/0x1060 __arm64_sys_io_uring_enter+0x46c/0xfa0 invoke_syscall+0x74/0x270 el0_svc_common.constprop.0+0xb4/0x240 do_el0_svc+0x48/0x68 el0_svc+0x38/0x78 el0t_64_sync_handler+0xc8/0xd0 el0t_64_sync+0x198/0x1a0 This is because after the submission fails, the defer.t testcase is still waiting to submit the failed request, resulting in an eventual deadlock. Solve the problem by telling wait_cqes the number of requests to wait for. Fixes: 6f6de47d6126 ("test/defer: Test deferring with drain and links") Signed-off-by: Li Zetao <lizetao1@huawei.com> --- test/defer.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) -- 2.33.0