Message ID | 20221206093833.3812138-1-harshit.m.mogalapalli@oracle.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | io_uring: Fix a null-ptr-deref in io_tctx_exit_cb() | expand |
On 12/6/22 10:38, Harshit Mogalapalli wrote: > Syzkaller reports a NULL deref bug as follows: > > BUG: KASAN: null-ptr-deref in io_tctx_exit_cb+0x53/0xd3 [...] > Add a NULL check on tctx to prevent this. > > Fixes: d56d938b4bef ("io_uring: do ctx initiated file note removal") > Reported-by: syzkaller <syzkaller@googlegroups.com> > Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> > --- > Could not find the root cause of this. Hi, I don't think the patch is correct as-is -- we should in any case probably understand better what's going on. I think what's happening is something like this, where tsk->io_uring is set to NULL in begin_new_exec() while we have a pending callback: fd = io_uring_setup() [...] close(fd) ? - __fput() - io_uring_release() - io_ring_ctx_wait_and_kill() - init_task_work(..., io_tctx_exit_cb) // callback posted exec() - begin_new_exec() - io_uring_task_cancel() - __io_uring_cancel() - io_uring_cancel_generic() - __io_uring_free() - tsk->io_uring = NULL // pointer nulled - syscall_exit_to_user_mode() - [...] - task_work_run() - io_tctx_exit_cb() - *current->io_uring // callback runs: oops As far as I can tell, whatever is happening in io_ring_exit_work() is happening too late, as task->io_uring has already been set to NULL. It looks a bit like this is supposed to be handled in io_uring_cancel_generic() already where it tries to cancel and wait for all the outstanding work items to finish, but maybe that is not taking into account the fact that the exit callback is still pending? Should io_ring_ctx_wait_and_kill() bump the inflight counter..? It's unclear to me whether the io_ring_ctx_wait_and_kill() call is coming through close(), dup2(), or simply exec(), but it looks like this could potentially get delayed (from the current syscall) and thus pushed into the exec() call. Maybe flush_delayed_fput() needs to be called somewhere..? Anyway, I could be completely off base here as I'm not really familiar with the code, just wanted to share my notes. Vegard
On 12/6/22 2:38?AM, Harshit Mogalapalli wrote: > Syzkaller reports a NULL deref bug as follows: > > BUG: KASAN: null-ptr-deref in io_tctx_exit_cb+0x53/0xd3 > Read of size 4 at addr 0000000000000138 by task file1/1955 > > CPU: 1 PID: 1955 Comm: file1 Not tainted 6.1.0-rc7-00103-gef4d3ea40565 #75 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 > Call Trace: > <TASK> > dump_stack_lvl+0xcd/0x134 > ? io_tctx_exit_cb+0x53/0xd3 > kasan_report+0xbb/0x1f0 > ? io_tctx_exit_cb+0x53/0xd3 > kasan_check_range+0x140/0x190 > io_tctx_exit_cb+0x53/0xd3 > task_work_run+0x164/0x250 > ? task_work_cancel+0x30/0x30 > get_signal+0x1c3/0x2440 > ? lock_downgrade+0x6e0/0x6e0 > ? lock_downgrade+0x6e0/0x6e0 > ? exit_signals+0x8b0/0x8b0 > ? do_raw_read_unlock+0x3b/0x70 > ? do_raw_spin_unlock+0x50/0x230 > arch_do_signal_or_restart+0x82/0x2470 > ? kmem_cache_free+0x260/0x4b0 > ? putname+0xfe/0x140 > ? get_sigframe_size+0x10/0x10 > ? do_execveat_common.isra.0+0x226/0x710 > ? lockdep_hardirqs_on+0x79/0x100 > ? putname+0xfe/0x140 > ? do_execveat_common.isra.0+0x238/0x710 > exit_to_user_mode_prepare+0x15f/0x250 > syscall_exit_to_user_mode+0x19/0x50 > do_syscall_64+0x42/0xb0 > entry_SYSCALL_64_after_hwframe+0x63/0xcd > RIP: 0023:0x0 > Code: Unable to access opcode bytes at 0xffffffffffffffd6. > RSP: 002b:00000000fffb7790 EFLAGS: 00000200 ORIG_RAX: 000000000000000b > RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 > RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 > RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 > R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 > R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 > </TASK> > Kernel panic - not syncing: panic_on_warn set ... > > Add a NULL check on tctx to prevent this. I agree with Vegard that I don't think this is fixing the core of the issue. I think what is happening here is that we don't run the task_work in io_uring_cancel_generic() unconditionally, if we don't need to in the loop above. But we do need to ensure we run it before we clear current->io_uring. Do you have a reproducer? If so, can you try the below? I _think_ this is all we need. We can't be hitting the delayed fput path as the task isn't exiting, and we're dealing with current here. diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 36cb63e4174f..4791d94c88f5 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -3125,6 +3125,15 @@ __cold void io_uring_cancel_generic(bool cancel_all, struct io_sq_data *sqd) io_uring_clean_tctx(tctx); if (cancel_all) { + /* + * If we didn't run task_work in the loop above, ensure we + * do so here. If an fput() queued up exit task_work for the + * ring descriptor before we started the exec that led to this + * cancelation, then we need to have that run before we proceed + * with tearing down current->io_uring. + */ + io_run_task_work(); + /* * We shouldn't run task_works after cancel, so just leave * ->in_idle set for normal exit.
On 12/6/22 2:15?PM, Jens Axboe wrote: > On 12/6/22 2:38?AM, Harshit Mogalapalli wrote: >> Syzkaller reports a NULL deref bug as follows: >> >> BUG: KASAN: null-ptr-deref in io_tctx_exit_cb+0x53/0xd3 >> Read of size 4 at addr 0000000000000138 by task file1/1955 >> >> CPU: 1 PID: 1955 Comm: file1 Not tainted 6.1.0-rc7-00103-gef4d3ea40565 #75 >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 >> Call Trace: >> <TASK> >> dump_stack_lvl+0xcd/0x134 >> ? io_tctx_exit_cb+0x53/0xd3 >> kasan_report+0xbb/0x1f0 >> ? io_tctx_exit_cb+0x53/0xd3 >> kasan_check_range+0x140/0x190 >> io_tctx_exit_cb+0x53/0xd3 >> task_work_run+0x164/0x250 >> ? task_work_cancel+0x30/0x30 >> get_signal+0x1c3/0x2440 >> ? lock_downgrade+0x6e0/0x6e0 >> ? lock_downgrade+0x6e0/0x6e0 >> ? exit_signals+0x8b0/0x8b0 >> ? do_raw_read_unlock+0x3b/0x70 >> ? do_raw_spin_unlock+0x50/0x230 >> arch_do_signal_or_restart+0x82/0x2470 >> ? kmem_cache_free+0x260/0x4b0 >> ? putname+0xfe/0x140 >> ? get_sigframe_size+0x10/0x10 >> ? do_execveat_common.isra.0+0x226/0x710 >> ? lockdep_hardirqs_on+0x79/0x100 >> ? putname+0xfe/0x140 >> ? do_execveat_common.isra.0+0x238/0x710 >> exit_to_user_mode_prepare+0x15f/0x250 >> syscall_exit_to_user_mode+0x19/0x50 >> do_syscall_64+0x42/0xb0 >> entry_SYSCALL_64_after_hwframe+0x63/0xcd >> RIP: 0023:0x0 >> Code: Unable to access opcode bytes at 0xffffffffffffffd6. >> RSP: 002b:00000000fffb7790 EFLAGS: 00000200 ORIG_RAX: 000000000000000b >> RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 >> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 >> RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 >> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 >> R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 >> </TASK> >> Kernel panic - not syncing: panic_on_warn set ... >> >> Add a NULL check on tctx to prevent this. > > I agree with Vegard that I don't think this is fixing the core of > the issue. I think what is happening here is that we don't run the > task_work in io_uring_cancel_generic() unconditionally, if we don't > need to in the loop above. But we do need to ensure we run it before > we clear current->io_uring. > > Do you have a reproducer? If so, can you try the below? I _think_ > this is all we need. We can't be hitting the delayed fput path as > the task isn't exiting, and we're dealing with current here. While I think the above is the right description of what happens, I think there's still a race with the proposed solution. If the task_work gets added right after the newly inserted io_run_task_work(), then we'll still crash when the targeted task exits to userspace and runs the task_work. It should actually be fine to add that NULL check in io_tctx_exit_cb. We can't be racing here, as both the clear and io_tctx_exit_cb() are run by current itself. It's really just an ordering issue.
On 12/6/22 2:30 PM, Jens Axboe wrote: > On 12/6/22 2:15?PM, Jens Axboe wrote: >> On 12/6/22 2:38?AM, Harshit Mogalapalli wrote: >>> Syzkaller reports a NULL deref bug as follows: >>> >>> BUG: KASAN: null-ptr-deref in io_tctx_exit_cb+0x53/0xd3 >>> Read of size 4 at addr 0000000000000138 by task file1/1955 >>> >>> CPU: 1 PID: 1955 Comm: file1 Not tainted 6.1.0-rc7-00103-gef4d3ea40565 #75 >>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 >>> Call Trace: >>> <TASK> >>> dump_stack_lvl+0xcd/0x134 >>> ? io_tctx_exit_cb+0x53/0xd3 >>> kasan_report+0xbb/0x1f0 >>> ? io_tctx_exit_cb+0x53/0xd3 >>> kasan_check_range+0x140/0x190 >>> io_tctx_exit_cb+0x53/0xd3 >>> task_work_run+0x164/0x250 >>> ? task_work_cancel+0x30/0x30 >>> get_signal+0x1c3/0x2440 >>> ? lock_downgrade+0x6e0/0x6e0 >>> ? lock_downgrade+0x6e0/0x6e0 >>> ? exit_signals+0x8b0/0x8b0 >>> ? do_raw_read_unlock+0x3b/0x70 >>> ? do_raw_spin_unlock+0x50/0x230 >>> arch_do_signal_or_restart+0x82/0x2470 >>> ? kmem_cache_free+0x260/0x4b0 >>> ? putname+0xfe/0x140 >>> ? get_sigframe_size+0x10/0x10 >>> ? do_execveat_common.isra.0+0x226/0x710 >>> ? lockdep_hardirqs_on+0x79/0x100 >>> ? putname+0xfe/0x140 >>> ? do_execveat_common.isra.0+0x238/0x710 >>> exit_to_user_mode_prepare+0x15f/0x250 >>> syscall_exit_to_user_mode+0x19/0x50 >>> do_syscall_64+0x42/0xb0 >>> entry_SYSCALL_64_after_hwframe+0x63/0xcd >>> RIP: 0023:0x0 >>> Code: Unable to access opcode bytes at 0xffffffffffffffd6. >>> RSP: 002b:00000000fffb7790 EFLAGS: 00000200 ORIG_RAX: 000000000000000b >>> RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 >>> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 >>> RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 >>> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 >>> R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 >>> </TASK> >>> Kernel panic - not syncing: panic_on_warn set ... >>> >>> Add a NULL check on tctx to prevent this. >> >> I agree with Vegard that I don't think this is fixing the core of >> the issue. I think what is happening here is that we don't run the >> task_work in io_uring_cancel_generic() unconditionally, if we don't >> need to in the loop above. But we do need to ensure we run it before >> we clear current->io_uring. >> >> Do you have a reproducer? If so, can you try the below? I _think_ >> this is all we need. We can't be hitting the delayed fput path as >> the task isn't exiting, and we're dealing with current here. > > While I think the above is the right description of what happens, I > think there's still a race with the proposed solution. If the task_work > gets added right after the newly inserted io_run_task_work(), then we'll > still crash when the targeted task exits to userspace and runs the > task_work. > > It should actually be fine to add that NULL check in io_tctx_exit_cb. We > can't be racing here, as both the clear and io_tctx_exit_cb() are run by > current itself. It's really just an ordering issue. I've queued it up with an improved commit message, and also a code comment: https://git.kernel.dk/cgit/linux-block/commit/?h=for-6.2/io_uring-next&id=6d1b48314b989d059642958fc94ef0a58b25fc8c
On 07/12/22 2:45 am, Jens Axboe wrote: > On 12/6/22 2:38?AM, Harshit Mogalapalli wrote: >> Syzkaller reports a NULL deref bug as follows: >> >> BUG: KASAN: null-ptr-deref in io_tctx_exit_cb+0x53/0xd3 >> Read of size 4 at addr 0000000000000138 by task file1/1955 >> >> CPU: 1 PID: 1955 Comm: file1 Not tainted 6.1.0-rc7-00103-gef4d3ea40565 #75 >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 >> Call Trace: >> <TASK> >> dump_stack_lvl+0xcd/0x134 >> ? io_tctx_exit_cb+0x53/0xd3 >> kasan_report+0xbb/0x1f0 >> ? io_tctx_exit_cb+0x53/0xd3 >> kasan_check_range+0x140/0x190 >> io_tctx_exit_cb+0x53/0xd3 >> task_work_run+0x164/0x250 >> ? task_work_cancel+0x30/0x30 >> get_signal+0x1c3/0x2440 >> ? lock_downgrade+0x6e0/0x6e0 >> ? lock_downgrade+0x6e0/0x6e0 >> ? exit_signals+0x8b0/0x8b0 >> ? do_raw_read_unlock+0x3b/0x70 >> ? do_raw_spin_unlock+0x50/0x230 >> arch_do_signal_or_restart+0x82/0x2470 >> ? kmem_cache_free+0x260/0x4b0 >> ? putname+0xfe/0x140 >> ? get_sigframe_size+0x10/0x10 >> ? do_execveat_common.isra.0+0x226/0x710 >> ? lockdep_hardirqs_on+0x79/0x100 >> ? putname+0xfe/0x140 >> ? do_execveat_common.isra.0+0x238/0x710 >> exit_to_user_mode_prepare+0x15f/0x250 >> syscall_exit_to_user_mode+0x19/0x50 >> do_syscall_64+0x42/0xb0 >> entry_SYSCALL_64_after_hwframe+0x63/0xcd >> RIP: 0023:0x0 >> Code: Unable to access opcode bytes at 0xffffffffffffffd6. >> RSP: 002b:00000000fffb7790 EFLAGS: 00000200 ORIG_RAX: 000000000000000b >> RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 >> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 >> RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 >> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 >> R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 >> </TASK> >> Kernel panic - not syncing: panic_on_warn set ... >> >> Add a NULL check on tctx to prevent this. > > I agree with Vegard that I don't think this is fixing the core of > the issue. I think what is happening here is that we don't run the > task_work in io_uring_cancel_generic() unconditionally, if we don't > need to in the loop above. But we do need to ensure we run it before > we clear current->io_uring. > > Do you have a reproducer? If so, can you try the below? I _think_ > this is all we need. We can't be hitting the delayed fput path as > the task isn't exiting, and we're dealing with current here. > > Thanks Jens and Vegard for the suggestions and analysis. Yes, the below patch silences the reproducer. > diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c > index 36cb63e4174f..4791d94c88f5 100644 > --- a/io_uring/io_uring.c > +++ b/io_uring/io_uring.c > @@ -3125,6 +3125,15 @@ __cold void io_uring_cancel_generic(bool cancel_all, struct io_sq_data *sqd) > > io_uring_clean_tctx(tctx); > if (cancel_all) { > + /* > + * If we didn't run task_work in the loop above, ensure we > + * do so here. If an fput() queued up exit task_work for the > + * ring descriptor before we started the exec that led to this > + * cancelation, then we need to have that run before we proceed > + * with tearing down current->io_uring. > + */ > + io_run_task_work(); > + > /* > * We shouldn't run task_works after cancel, so just leave > * ->in_idle set for normal exit. > Thanks, Harshit
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 8840cf3e20f2..20f7d8655b50 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -2708,7 +2708,7 @@ static __cold void io_tctx_exit_cb(struct callback_head *cb) * When @in_idle, we're in cancellation and it's racy to remove the * node. It'll be removed by the end of cancellation, just ignore it. */ - if (!atomic_read(&tctx->in_idle)) + if (tctx && !atomic_read(&tctx->in_idle)) io_uring_del_tctx_node((unsigned long)work->ctx); complete(&work->completion); }
Syzkaller reports a NULL deref bug as follows: BUG: KASAN: null-ptr-deref in io_tctx_exit_cb+0x53/0xd3 Read of size 4 at addr 0000000000000138 by task file1/1955 CPU: 1 PID: 1955 Comm: file1 Not tainted 6.1.0-rc7-00103-gef4d3ea40565 #75 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0xcd/0x134 ? io_tctx_exit_cb+0x53/0xd3 kasan_report+0xbb/0x1f0 ? io_tctx_exit_cb+0x53/0xd3 kasan_check_range+0x140/0x190 io_tctx_exit_cb+0x53/0xd3 task_work_run+0x164/0x250 ? task_work_cancel+0x30/0x30 get_signal+0x1c3/0x2440 ? lock_downgrade+0x6e0/0x6e0 ? lock_downgrade+0x6e0/0x6e0 ? exit_signals+0x8b0/0x8b0 ? do_raw_read_unlock+0x3b/0x70 ? do_raw_spin_unlock+0x50/0x230 arch_do_signal_or_restart+0x82/0x2470 ? kmem_cache_free+0x260/0x4b0 ? putname+0xfe/0x140 ? get_sigframe_size+0x10/0x10 ? do_execveat_common.isra.0+0x226/0x710 ? lockdep_hardirqs_on+0x79/0x100 ? putname+0xfe/0x140 ? do_execveat_common.isra.0+0x238/0x710 exit_to_user_mode_prepare+0x15f/0x250 syscall_exit_to_user_mode+0x19/0x50 do_syscall_64+0x42/0xb0 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0023:0x0 Code: Unable to access opcode bytes at 0xffffffffffffffd6. RSP: 002b:00000000fffb7790 EFLAGS: 00000200 ORIG_RAX: 000000000000000b RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 </TASK> Kernel panic - not syncing: panic_on_warn set ... Add a NULL check on tctx to prevent this. Fixes: d56d938b4bef ("io_uring: do ctx initiated file note removal") Reported-by: syzkaller <syzkaller@googlegroups.com> Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> --- Could not find the root cause of this. --- io_uring/io_uring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)