Message ID | 20210126001219.845816-1-yhs@fb.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | BPF |
Headers | show |
Series | [bpf] x86/bpf: handle bpf-program-triggered exceptions properly | expand |
On Mon, Jan 25, 2021 at 04:12:19PM -0800, Yonghong Song wrote: > When the test is run normally after login prompt, cpu_feature_enabled(X86_FEATURE_SMAP) > is true and bad_area_nosemaphore() is called and then fixup_exception() is called, > where bpf specific handler is able to fixup the exception. > > But when the test is run at /sbin/init time, cpu_feature_enabled(X86_FEATURE_SMAP) is > false, the control reaches That makes no sense, cpu features should be set in stone long before we reach userspace. > To fix the issue, before the above mmap_read_trylock(), we will check > whether fault ip can be served by bpf exception handler or not, if > yes, the exception will be fixed up and return. > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > index f1f1b5a0956a..e8182d30bf67 100644 > --- a/arch/x86/mm/fault.c > +++ b/arch/x86/mm/fault.c > @@ -1317,6 +1317,15 @@ void do_user_addr_fault(struct pt_regs *regs, > if (emulate_vsyscall(hw_error_code, regs, address)) > return; > } > + > +#ifdef CONFIG_BPF_JIT > + /* > + * Faults incurred by bpf program might need emulation, i.e., > + * clearing the dest register. > + */ > + if (fixup_bpf_exception(regs, X86_TRAP_PF, hw_error_code, address)) > + return; > +#endif > #endif NAK, this is broken. You're now disallowing faults that should've gone through.
On 1/26/21 1:15 AM, Peter Zijlstra wrote: > On Mon, Jan 25, 2021 at 04:12:19PM -0800, Yonghong Song wrote: >> When the test is run normally after login prompt, cpu_feature_enabled(X86_FEATURE_SMAP) >> is true and bad_area_nosemaphore() is called and then fixup_exception() is called, >> where bpf specific handler is able to fixup the exception. >> >> But when the test is run at /sbin/init time, cpu_feature_enabled(X86_FEATURE_SMAP) is >> false, the control reaches > > That makes no sense, cpu features should be set in stone long before we > reach userspace. You are correct. Sorry for misleading description. The reason is I use two qemu script, one from my local environment and the other from ci test since they works respectively. I thought they should have roughly the same kernel setup, but apparently they are not. For my local qemu script, I have "-cpu host" option and with this, X86_FEATURE_SMAP is set up probably in function get_cpu_cap(), file arch/x86/kernel/cpu/common.c. For CI qemu script (in https://lore.kernel.org/bpf/20210123004445.299149-1-kpsingh@kernel.org/), the "-cpu kvm64" is the qemu argument. This cpu does not enable X86_FEATURE_SMAP, so we will see the kernel warning. Changing CI script to use "-cpu host" resolved the issue. I think "-cpu kvm64" may emulate old x86 cpus and ignore X86_FEATURE_SMAP. Do we have any x64 cpus which does not support X86_FEATURE_SMAP? For CI, with "-cpu kvm64" is good as it can specify the number of cores with e.g. "-smp 4" regardless of underlying host # of cores. I think we could change CI to use "-cpu host" by ensuring the CI host having at least 4 cores. Thanks. > >> To fix the issue, before the above mmap_read_trylock(), we will check >> whether fault ip can be served by bpf exception handler or not, if >> yes, the exception will be fixed up and return. > > > >> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c >> index f1f1b5a0956a..e8182d30bf67 100644 >> --- a/arch/x86/mm/fault.c >> +++ b/arch/x86/mm/fault.c >> @@ -1317,6 +1317,15 @@ void do_user_addr_fault(struct pt_regs *regs, >> if (emulate_vsyscall(hw_error_code, regs, address)) >> return; >> } >> + >> +#ifdef CONFIG_BPF_JIT >> + /* >> + * Faults incurred by bpf program might need emulation, i.e., >> + * clearing the dest register. >> + */ >> + if (fixup_bpf_exception(regs, X86_TRAP_PF, hw_error_code, address)) >> + return; >> +#endif >> #endif > > NAK, this is broken. You're now disallowing faults that should've gone > through. >
On Tue, Jan 26, 2021 at 11:21:01AM -0800, Yonghong Song wrote:
> Do we have any x64 cpus which does not support X86_FEATURE_SMAP?
x64 is not an achitecture I know of. If you typoed x86_64 then sure, I
think SMAP is fairly recent, Wikipedia seems to suggest Broadwell and
later. AMD seems to sport it since Zen.
On Tue, Jan 26, 2021 at 2:57 PM Yonghong Song <yhs@fb.com> wrote: > > > > On 1/26/21 1:15 AM, Peter Zijlstra wrote: > > On Mon, Jan 25, 2021 at 04:12:19PM -0800, Yonghong Song wrote: > >> When the test is run normally after login prompt, cpu_feature_enabled(X86_FEATURE_SMAP) > >> is true and bad_area_nosemaphore() is called and then fixup_exception() is called, > >> where bpf specific handler is able to fixup the exception. > >> > >> But when the test is run at /sbin/init time, cpu_feature_enabled(X86_FEATURE_SMAP) is > >> false, the control reaches > > > > That makes no sense, cpu features should be set in stone long before we > > reach userspace. > > You are correct. Sorry for misleading description. The reason is I use > two qemu script, one from my local environment and the other from ci > test since they works respectively. I thought they should have roughly > the same kernel setup, but apparently they are not. > > For my local qemu script, I have "-cpu host" option and with this, > X86_FEATURE_SMAP is set up probably in function get_cpu_cap(), file > arch/x86/kernel/cpu/common.c. > > For CI qemu script (in > https://lore.kernel.org/bpf/20210123004445.299149-1-kpsingh@kernel.org/), > the "-cpu kvm64" is the qemu argument. This cpu does not > enable X86_FEATURE_SMAP, so we will see the kernel warning. > > Changing CI script to use "-cpu host" resolved the issue. I think "-cpu > kvm64" may emulate old x86 cpus and ignore X86_FEATURE_SMAP. > > Do we have any x64 cpus which does not support X86_FEATURE_SMAP? We don't control what CPUs are used in our CIs, which is why we didn't use "-cpu host". Is there some other way to make necessary features be available in qemu for this to work and not emit warnings? But also, what will happen in this case on CPUs that really don't support X86_FEATURE_SMAP? Should that be addressed instead? > > For CI, with "-cpu kvm64" is good as it can specify the number > of cores with e.g. "-smp 4" regardless of underlying host # of cores. > I think we could change CI to use "-cpu host" by ensuring the CI host > having at least 4 cores. > > Thanks. > > > > > >> To fix the issue, before the above mmap_read_trylock(), we will check > >> whether fault ip can be served by bpf exception handler or not, if > >> yes, the exception will be fixed up and return. > > > > > > > >> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > >> index f1f1b5a0956a..e8182d30bf67 100644 > >> --- a/arch/x86/mm/fault.c > >> +++ b/arch/x86/mm/fault.c > >> @@ -1317,6 +1317,15 @@ void do_user_addr_fault(struct pt_regs *regs, > >> if (emulate_vsyscall(hw_error_code, regs, address)) > >> return; > >> } > >> + > >> +#ifdef CONFIG_BPF_JIT > >> + /* > >> + * Faults incurred by bpf program might need emulation, i.e., > >> + * clearing the dest register. > >> + */ > >> + if (fixup_bpf_exception(regs, X86_TRAP_PF, hw_error_code, address)) > >> + return; > >> +#endif > >> #endif > > > > NAK, this is broken. You're now disallowing faults that should've gone > > through. > >
On Mon, Jan 25, 2021 at 4:12 PM Yonghong Song <yhs@fb.com> wrote: > > When reviewing patch ([1]), which adds a script to run bpf selftest > through qemu at /sbin/init stage, I found the following kernel bug > warning: > > [ 112.118892] BUG: sleeping function called from invalid context at arch/x86/mm/fault.c:1351 > [ 112.119805] in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 354, name: new_name > [ 112.120512] 3 locks held by new_name/354: > [ 112.120868] #0: ffff88800476e0a0 (&p->lock){+.+.}-{3:3}, at: bpf_seq_read+0x3a/0x3d0 > [ 112.121573] #1: ffffffff82d69800 (rcu_read_lock){....}-{1:2}, at: bpf_iter_run_prog+0x5/0x160 > [ 112.122348] #2: ffff8880061c2088 (&mm->mmap_lock#2){++++}-{3:3}, at: exc_page_fault+0x1a1/0x640 > [ 112.123128] Preemption disabled at: > [ 112.123130] [<ffffffff8108f913>] migrate_disable+0x33/0x80 > [ 112.123942] CPU: 0 PID: 354 Comm: new_name Tainted: G O 5.11.0-rc4-00524-g6e66fbb > 10597-dirty #1249 > [ 112.124822] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.el7.centos 04/01 > /2014 > [ 112.125614] Call Trace: > [ 112.125835] dump_stack+0x77/0x97 > [ 112.126137] ___might_sleep.cold.119+0xf2/0x106 > [ 112.126537] exc_page_fault+0x1c1/0x640 > [ 112.126888] asm_exc_page_fault+0x1e/0x30 > [ 112.127241] RIP: 0010:bpf_prog_0a182df2d34af188_dump_bpf_prog+0xf5/0xb3c > [ 112.127825] Code: 00 00 8b 7d f4 41 8b 76 44 48 39 f7 73 06 48 01 fb 49 89 df 4c 89 7d d8 49 8b > bd 20 01 00 00 48 89 7d e0 49 8b bd e0 00 00 00 <48> 8b 7f 20 48 01 d7 48 89 7d e8 48 89 e9 48 83 c > 1 d0 48 8b 7d c8 > [ 112.129433] RSP: 0018:ffffc9000035fdc8 EFLAGS: 00010282 > [ 112.129895] RAX: 0000000000000000 RBX: ffff888005a49458 RCX: 0000000000000024 > [ 112.130509] RDX: 00000000000002f0 RSI: 0000000000000509 RDI: 0000000000000000 > [ 112.131126] RBP: ffffc9000035fe20 R08: 0000000000000001 R09: 0000000000000000 > [ 112.131737] R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000400 > [ 112.132355] R13: ffff888006085800 R14: ffff888004718540 R15: ffff888005a49458 > [ 112.132990] ? bpf_prog_0a182df2d34af188_dump_bpf_prog+0xc8/0xb3c > [ 112.133526] bpf_iter_run_prog+0x75/0x160 > [ 112.133880] __bpf_prog_seq_show+0x39/0x40 > [ 112.134258] bpf_seq_read+0xf6/0x3d0 > [ 112.134582] vfs_read+0xa3/0x1b0 > [ 112.134873] ksys_read+0x4f/0xc0 > [ 112.135166] do_syscall_64+0x2d/0x40 > [ 112.135482] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > To reproduce the issue, with patch [1] and use the following script: > tools/testing/selftests/bpf/run_in_vm.sh -- cat /sys/fs/bpf/progs.debug > > The reason of the above kernel warning is due to bpf program > tries to dereference an address of 0 and which is not caught > by bpf exception handling logic. > > ... > SEC("iter/bpf_prog") > int dump_bpf_prog(struct bpf_iter__bpf_prog *ctx) > { > struct bpf_prog *prog = ctx->prog; > struct bpf_prog_aux *aux; > ... > if (!prog) > return 0; > aux = prog->aux; > ... > ... aux->dst_prog->aux->name ... > return 0; > } > > If the aux->dst_prog is NULL pointer, a fault will happen when trying > to access aux->dst_prog->aux. > Which would be a bug in the bpf verifier, no? This is a bpf program that apparently passed verification, and it somehow dereferenced a NULL pointer. Let's enumerate some cases. 1. x86-like architecture, SMAP enabled. The CPU determines that this is bogus, and bpf gets lucky, because the x86 code runs the exception handler instead of summarily OOPSing. IMO it would be entirely reasonable to OOPS. 2 x86-like architecture, SMAP disabled. This looks like a valid user access, and for all bpf knows, 0 might be an actual mapped address, and it might have userfaultfd attached, etc. And it's called from a non-sleepable context. The warning is entirely correct. 3. s390x-like architecture. This is a dereference of a bad kernel pointer. OOPS, unless you get lucky. This patch is totally bogus. You can't work around this by some magic exception handling. Unless I'm missing something big, this is a bpf bug. The following is not a valid kernel code pattern: label: dereference might-be-NULL kernel pointer _EXHANDLER_...(label, ...); /* try to recover */ This appears to have been introduced by: commit 3dec541b2e632d630fe7142ed44f0b3702ef1f8c Author: Alexei Starovoitov <ast@kernel.org> Date: Tue Oct 15 20:25:03 2019 -0700 bpf: Add support for BTF pointers to x86 JIT Alexei, this looks like a very long-winded and ultimately incorrect way to remove the branch from: if (ptr) val = *ptr; Wouldn't it be better to either just emit the branch directly or to make sure that the pointer is valid in the first place? --Andy
On 1/27/21 1:47 PM, Andy Lutomirski wrote: > On Mon, Jan 25, 2021 at 4:12 PM Yonghong Song <yhs@fb.com> wrote: >> >> When reviewing patch ([1]), which adds a script to run bpf selftest >> through qemu at /sbin/init stage, I found the following kernel bug >> warning: >> >> [ 112.118892] BUG: sleeping function called from invalid context at arch/x86/mm/fault.c:1351 >> [ 112.119805] in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 354, name: new_name >> [ 112.120512] 3 locks held by new_name/354: >> [ 112.120868] #0: ffff88800476e0a0 (&p->lock){+.+.}-{3:3}, at: bpf_seq_read+0x3a/0x3d0 >> [ 112.121573] #1: ffffffff82d69800 (rcu_read_lock){....}-{1:2}, at: bpf_iter_run_prog+0x5/0x160 >> [ 112.122348] #2: ffff8880061c2088 (&mm->mmap_lock#2){++++}-{3:3}, at: exc_page_fault+0x1a1/0x640 >> [ 112.123128] Preemption disabled at: >> [ 112.123130] [<ffffffff8108f913>] migrate_disable+0x33/0x80 >> [ 112.123942] CPU: 0 PID: 354 Comm: new_name Tainted: G O 5.11.0-rc4-00524-g6e66fbb >> 10597-dirty #1249 >> [ 112.124822] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.el7.centos 04/01 >> /2014 >> [ 112.125614] Call Trace: >> [ 112.125835] dump_stack+0x77/0x97 >> [ 112.126137] ___might_sleep.cold.119+0xf2/0x106 >> [ 112.126537] exc_page_fault+0x1c1/0x640 >> [ 112.126888] asm_exc_page_fault+0x1e/0x30 >> [ 112.127241] RIP: 0010:bpf_prog_0a182df2d34af188_dump_bpf_prog+0xf5/0xb3c >> [ 112.127825] Code: 00 00 8b 7d f4 41 8b 76 44 48 39 f7 73 06 48 01 fb 49 89 df 4c 89 7d d8 49 8b >> bd 20 01 00 00 48 89 7d e0 49 8b bd e0 00 00 00 <48> 8b 7f 20 48 01 d7 48 89 7d e8 48 89 e9 48 83 c >> 1 d0 48 8b 7d c8 >> [ 112.129433] RSP: 0018:ffffc9000035fdc8 EFLAGS: 00010282 >> [ 112.129895] RAX: 0000000000000000 RBX: ffff888005a49458 RCX: 0000000000000024 >> [ 112.130509] RDX: 00000000000002f0 RSI: 0000000000000509 RDI: 0000000000000000 >> [ 112.131126] RBP: ffffc9000035fe20 R08: 0000000000000001 R09: 0000000000000000 >> [ 112.131737] R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000400 >> [ 112.132355] R13: ffff888006085800 R14: ffff888004718540 R15: ffff888005a49458 >> [ 112.132990] ? bpf_prog_0a182df2d34af188_dump_bpf_prog+0xc8/0xb3c >> [ 112.133526] bpf_iter_run_prog+0x75/0x160 >> [ 112.133880] __bpf_prog_seq_show+0x39/0x40 >> [ 112.134258] bpf_seq_read+0xf6/0x3d0 >> [ 112.134582] vfs_read+0xa3/0x1b0 >> [ 112.134873] ksys_read+0x4f/0xc0 >> [ 112.135166] do_syscall_64+0x2d/0x40 >> [ 112.135482] entry_SYSCALL_64_after_hwframe+0x44/0xa9 >> >> To reproduce the issue, with patch [1] and use the following script: >> tools/testing/selftests/bpf/run_in_vm.sh -- cat /sys/fs/bpf/progs.debug >> >> The reason of the above kernel warning is due to bpf program >> tries to dereference an address of 0 and which is not caught >> by bpf exception handling logic. >> >> ... >> SEC("iter/bpf_prog") >> int dump_bpf_prog(struct bpf_iter__bpf_prog *ctx) >> { >> struct bpf_prog *prog = ctx->prog; >> struct bpf_prog_aux *aux; >> ... >> if (!prog) >> return 0; >> aux = prog->aux; >> ... >> ... aux->dst_prog->aux->name ... >> return 0; >> } >> >> If the aux->dst_prog is NULL pointer, a fault will happen when trying >> to access aux->dst_prog->aux. >> > > Which would be a bug in the bpf verifier, no? This is a bpf program > that apparently passed verification, and it somehow dereferenced a > NULL pointer. > > Let's enumerate some cases. > > 1. x86-like architecture, SMAP enabled. The CPU determines that this > is bogus, and bpf gets lucky, because the x86 code runs the exception > handler instead of summarily OOPSing. IMO it would be entirely > reasonable to OOPS. > > 2 x86-like architecture, SMAP disabled. This looks like a valid user > access, and for all bpf knows, 0 might be an actual mapped address, > and it might have userfaultfd attached, etc. And it's called from a > non-sleepable context. The warning is entirely correct. > > 3. s390x-like architecture. This is a dereference of a bad kernel > pointer. OOPS, unless you get lucky. > > > This patch is totally bogus. You can't work around this by some magic > exception handling. Unless I'm missing something big, this is a bpf > bug. The following is not a valid kernel code pattern: > > label: > dereference might-be-NULL kernel pointer > _EXHANDLER_...(label, ...); /* try to recover */ > > This appears to have been introduced by: > > commit 3dec541b2e632d630fe7142ed44f0b3702ef1f8c > Author: Alexei Starovoitov <ast@kernel.org> > Date: Tue Oct 15 20:25:03 2019 -0700 > > bpf: Add support for BTF pointers to x86 JIT > > Alexei, this looks like a very long-winded and ultimately incorrect > way to remove the branch from: > > if (ptr) > val = *ptr; > > Wouldn't it be better to either just emit the branch directly or to > make sure that the pointer is valid in the first place? Let me explain the motivation for this patch. Previously, for any kernel data structure access, people has to use bpf_probe_read() or bpf_probe_read_kernel() helper even most of these accesses are okay and will not fault. For example, for int t = a->b->c->d three bpf_probe_read() will be needed, e.g., bpf_probe_read_kernel(&t1, sizeof(t1), &a->b); bpf_probe_read_kernel(&t2, sizeof(t2), &t1->c); bpf_probe_read_kernel(&t, sizeof(t), &t2->d); if there is a fault, bpf_probe_read_kernel() helper will suppress the exception and clears the dest buffer and return. The above usage of bpf_probe_read_kernel() is complicated and not C like and bpf developers does not like it. bcc (https://github.com/iovisor/bcc/) permits users to write "a->b->c->d" styles and then through clang rewriter to convert it to a series of bpf_probe_read_kernel()'s. But most users are directly using clang to compile their programs so they have to write bpf_probe_read_kernel()'s by themselves. The motivation here is to improve user experience so user can just write int t = a->b->c->d some kernel will automatically take care of exceptions and maintain bpf_probe_read_kernel() semantics. So what the above patch essentially did is to check if the "regs->ip" is one of ips which try to a "bpf_probe_read_kernel()" (actually a direct access), it will fix up exception (clear the dest register) and returns. For a->b->c->d, some users may add "if (ptr) check" for some of them and if that is the case, compiler/verifier will honor that. Some users may not add if they are certain from code that in most or all cases, pointer will not be null. From verifier perspective, it will be hard to decide whether a->b, a->b->c is null or not, adding null checks to every kernel de-references might be excessive. Also, not 100% sure whether with null check, the pointer dereference will absolutely not produce fault or not. If there is no such guarantee then bpf_probe_read_kernel() will be still needed. I see page fault handler specially processed page fault for kprobe and vsyscall. maybe page faults generated by special bpf insns (simulating bpf_probe_read_kernel()) can also be specially processed here. > > --Andy >
On Wed, Jan 27, 2021 at 5:06 PM Yonghong Song <yhs@fb.com> wrote: > > > > On 1/27/21 1:47 PM, Andy Lutomirski wrote: > > On Mon, Jan 25, 2021 at 4:12 PM Yonghong Song <yhs@fb.com> wrote: > >> > >> When reviewing patch ([1]), which adds a script to run bpf selftest > >> through qemu at /sbin/init stage, I found the following kernel bug > >> warning: > >> > >> [ 112.118892] BUG: sleeping function called from invalid context at arch/x86/mm/fault.c:1351 > >> [ 112.119805] in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 354, name: new_name > >> [ 112.120512] 3 locks held by new_name/354: > >> [ 112.120868] #0: ffff88800476e0a0 (&p->lock){+.+.}-{3:3}, at: bpf_seq_read+0x3a/0x3d0 > >> [ 112.121573] #1: ffffffff82d69800 (rcu_read_lock){....}-{1:2}, at: bpf_iter_run_prog+0x5/0x160 > >> [ 112.122348] #2: ffff8880061c2088 (&mm->mmap_lock#2){++++}-{3:3}, at: exc_page_fault+0x1a1/0x640 > >> [ 112.123128] Preemption disabled at: > >> [ 112.123130] [<ffffffff8108f913>] migrate_disable+0x33/0x80 > >> [ 112.123942] CPU: 0 PID: 354 Comm: new_name Tainted: G O 5.11.0-rc4-00524-g6e66fbb > >> 10597-dirty #1249 > >> [ 112.124822] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.el7.centos 04/01 > >> /2014 > >> [ 112.125614] Call Trace: > >> [ 112.125835] dump_stack+0x77/0x97 > >> [ 112.126137] ___might_sleep.cold.119+0xf2/0x106 > >> [ 112.126537] exc_page_fault+0x1c1/0x640 > >> [ 112.126888] asm_exc_page_fault+0x1e/0x30 > >> [ 112.127241] RIP: 0010:bpf_prog_0a182df2d34af188_dump_bpf_prog+0xf5/0xb3c > >> [ 112.127825] Code: 00 00 8b 7d f4 41 8b 76 44 48 39 f7 73 06 48 01 fb 49 89 df 4c 89 7d d8 49 8b > >> bd 20 01 00 00 48 89 7d e0 49 8b bd e0 00 00 00 <48> 8b 7f 20 48 01 d7 48 89 7d e8 48 89 e9 48 83 c > >> 1 d0 48 8b 7d c8 > >> [ 112.129433] RSP: 0018:ffffc9000035fdc8 EFLAGS: 00010282 > >> [ 112.129895] RAX: 0000000000000000 RBX: ffff888005a49458 RCX: 0000000000000024 > >> [ 112.130509] RDX: 00000000000002f0 RSI: 0000000000000509 RDI: 0000000000000000 > >> [ 112.131126] RBP: ffffc9000035fe20 R08: 0000000000000001 R09: 0000000000000000 > >> [ 112.131737] R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000400 > >> [ 112.132355] R13: ffff888006085800 R14: ffff888004718540 R15: ffff888005a49458 > >> [ 112.132990] ? bpf_prog_0a182df2d34af188_dump_bpf_prog+0xc8/0xb3c > >> [ 112.133526] bpf_iter_run_prog+0x75/0x160 > >> [ 112.133880] __bpf_prog_seq_show+0x39/0x40 > >> [ 112.134258] bpf_seq_read+0xf6/0x3d0 > >> [ 112.134582] vfs_read+0xa3/0x1b0 > >> [ 112.134873] ksys_read+0x4f/0xc0 > >> [ 112.135166] do_syscall_64+0x2d/0x40 > >> [ 112.135482] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > >> > >> To reproduce the issue, with patch [1] and use the following script: > >> tools/testing/selftests/bpf/run_in_vm.sh -- cat /sys/fs/bpf/progs.debug > >> > >> The reason of the above kernel warning is due to bpf program > >> tries to dereference an address of 0 and which is not caught > >> by bpf exception handling logic. > >> > >> ... > >> SEC("iter/bpf_prog") > >> int dump_bpf_prog(struct bpf_iter__bpf_prog *ctx) > >> { > >> struct bpf_prog *prog = ctx->prog; > >> struct bpf_prog_aux *aux; > >> ... > >> if (!prog) > >> return 0; > >> aux = prog->aux; > >> ... > >> ... aux->dst_prog->aux->name ... > >> return 0; > >> } > >> > >> If the aux->dst_prog is NULL pointer, a fault will happen when trying > >> to access aux->dst_prog->aux. > >> > > > > Which would be a bug in the bpf verifier, no? This is a bpf program > > that apparently passed verification, and it somehow dereferenced a > > NULL pointer. > > > > Let's enumerate some cases. > > > > 1. x86-like architecture, SMAP enabled. The CPU determines that this > > is bogus, and bpf gets lucky, because the x86 code runs the exception > > handler instead of summarily OOPSing. IMO it would be entirely > > reasonable to OOPS. > > > > 2 x86-like architecture, SMAP disabled. This looks like a valid user > > access, and for all bpf knows, 0 might be an actual mapped address, > > and it might have userfaultfd attached, etc. And it's called from a > > non-sleepable context. The warning is entirely correct. > > > > 3. s390x-like architecture. This is a dereference of a bad kernel > > pointer. OOPS, unless you get lucky. > > > > > > This patch is totally bogus. You can't work around this by some magic > > exception handling. Unless I'm missing something big, this is a bpf > > bug. The following is not a valid kernel code pattern: > > > > label: > > dereference might-be-NULL kernel pointer > > _EXHANDLER_...(label, ...); /* try to recover */ > > > > This appears to have been introduced by: > > > > commit 3dec541b2e632d630fe7142ed44f0b3702ef1f8c > > Author: Alexei Starovoitov <ast@kernel.org> > > Date: Tue Oct 15 20:25:03 2019 -0700 > > > > bpf: Add support for BTF pointers to x86 JIT > > > > Alexei, this looks like a very long-winded and ultimately incorrect > > way to remove the branch from: > > > > if (ptr) > > val = *ptr; > > > > Wouldn't it be better to either just emit the branch directly or to > > make sure that the pointer is valid in the first place? > > Let me explain the motivation for this patch. > > Previously, for any kernel data structure access, > people has to use bpf_probe_read() or bpf_probe_read_kernel() > helper even most of these accesses are okay and will not > fault. For example, for > int t = a->b->c->d > three bpf_probe_read() will be needed, e.g., > bpf_probe_read_kernel(&t1, sizeof(t1), &a->b); > bpf_probe_read_kernel(&t2, sizeof(t2), &t1->c); > bpf_probe_read_kernel(&t, sizeof(t), &t2->d); > > if there is a fault, bpf_probe_read_kernel() helper will > suppress the exception and clears the dest buffer and > return. > > The above usage of bpf_probe_read_kernel() > is complicated and not C like and bpf developers > does not like it. > > bcc (https://github.com/iovisor/bcc/) permits > users to write "a->b->c->d" styles and then through > clang rewriter to convert it to a series of > bpf_probe_read_kernel()'s. But most users are > directly using clang to compile their programs so > they have to write bpf_probe_read_kernel()'s > by themselves. > > The motivation here is to improve user experience so > user can just write > int t = a->b->c->d > some kernel will automatically take care of exceptions > and maintain bpf_probe_read_kernel() semantics. > So what the above patch essentially did is to check if the "regs->ip" > is one of ips which try to a "bpf_probe_read_kernel()" (actually > a direct access), it will fix up exception (clear the dest register) > and returns. Okay, so I guess you're trying to inline probe_read_kernel(). But that means you have to inline a valid implementation. In particular, you need to check that you're accessing *kernel* memory. Just like how get_user() validates that the pointer points into user memory, your helper should bounds check the pointer. On x86, you could check the high bit. As an extra complication, we should really add logic to get_kernel_nofault() to verify that the pointer points into actual memory as opposed to MMIO space (or future incoherent MKTME space or something like that, sigh). This will severely complicate inlining it. And we should *really* make the same fix to get_kernel_nofault() -- it should validate that the pointer is a kernel pointer. Is this really worth inlining instead of having the BPF JIT generate an out of line call to a real C function? That would let us put in a sane implementation.
On Thu, Jan 28, 2021 at 03:51:13PM -0800, Andy Lutomirski wrote: > > Okay, so I guess you're trying to inline probe_read_kernel(). But > that means you have to inline a valid implementation. In particular, > you need to check that you're accessing *kernel* memory. Just like That check is on the verifier side. It only does it for kernel pointers with known types. In a sequnce a->b->c the verifier guarantees that 'a' is valid kernel pointer and it's also !null. Then it guarantees that offsetof(b) points to valid kernel field which is also a pointer. What it doesn't check that b != null, so that users don't have to write silly code with 'if (p)' after every dereference. > how get_user() validates that the pointer points into user memory, > your helper should bounds check the pointer. On x86, you could check > the high bit. > > As an extra complication, we should really add logic to > get_kernel_nofault() to verify that the pointer points into actual > memory as opposed to MMIO space (or future incoherent MKTME space or > something like that, sigh). This will severely complicate inlining > it. And we should *really* make the same fix to get_kernel_nofault() > -- it should validate that the pointer is a kernel pointer. > > Is this really worth inlining instead of having the BPF JIT generate > an out of line call to a real C function? That would let us put in a > sane implementation. It's out of the question. JIT cannot generate a helper call for single bpf insn without huge overhead. All registers are used. It needs full save/restore, stack increase, etc. Anyhow I bet the bug we're discussing has nothing to do with bpf and jit. Something got changed and now probe_kernel_read(NULL) warns on !SMAP. This is something to debug.
On Thu, Jan 28, 2021 at 4:11 PM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Thu, Jan 28, 2021 at 03:51:13PM -0800, Andy Lutomirski wrote: > > > > Okay, so I guess you're trying to inline probe_read_kernel(). But > > that means you have to inline a valid implementation. In particular, > > you need to check that you're accessing *kernel* memory. Just like > > That check is on the verifier side. It only does it for kernel > pointers with known types. > In a sequnce a->b->c the verifier guarantees that 'a' is valid > kernel pointer and it's also !null. Then it guarantees that offsetof(b) > points to valid kernel field which is also a pointer. > What it doesn't check that b != null, so > that users don't have to write silly code with 'if (p)' after every > dereference. That sounds like a verifier and/or JIT bug. If you have a pointer p (doesn't matter whether p is what you call a or a->b) and you have not confirmed that p points to the kernel range, you may not generate a load from that pointer. > > > how get_user() validates that the pointer points into user memory, > > your helper should bounds check the pointer. On x86, you could check > > the high bit. > > > > As an extra complication, we should really add logic to > > get_kernel_nofault() to verify that the pointer points into actual > > memory as opposed to MMIO space (or future incoherent MKTME space or > > something like that, sigh). This will severely complicate inlining > > it. And we should *really* make the same fix to get_kernel_nofault() > > -- it should validate that the pointer is a kernel pointer. > > > > Is this really worth inlining instead of having the BPF JIT generate > > an out of line call to a real C function? That would let us put in a > > sane implementation. > > It's out of the question. > JIT cannot generate a helper call for single bpf insn without huge overhead. > All registers are used. It needs full save/restore, stack increase, etc. > > Anyhow I bet the bug we're discussing has nothing to do with bpf and jit. > Something got changed and now probe_kernel_read(NULL) warns on !SMAP. > This is something to debug. The bug is in bpf.
On Thu, Jan 28, 2021 at 04:18:24PM -0800, Andy Lutomirski wrote: > On Thu, Jan 28, 2021 at 4:11 PM Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > > > On Thu, Jan 28, 2021 at 03:51:13PM -0800, Andy Lutomirski wrote: > > > > > > Okay, so I guess you're trying to inline probe_read_kernel(). But > > > that means you have to inline a valid implementation. In particular, > > > you need to check that you're accessing *kernel* memory. Just like > > > > That check is on the verifier side. It only does it for kernel > > pointers with known types. > > In a sequnce a->b->c the verifier guarantees that 'a' is valid > > kernel pointer and it's also !null. Then it guarantees that offsetof(b) > > points to valid kernel field which is also a pointer. > > What it doesn't check that b != null, so > > that users don't have to write silly code with 'if (p)' after every > > dereference. > > That sounds like a verifier and/or JIT bug. If you have a pointer p > (doesn't matter whether p is what you call a or a->b) and you have not > confirmed that p points to the kernel range, you may not generate a > load from that pointer. Please read the explanation again. It's an inlined probe_kernel_read. > > > > > how get_user() validates that the pointer points into user memory, > > > your helper should bounds check the pointer. On x86, you could check > > > the high bit. > > > > > > As an extra complication, we should really add logic to > > > get_kernel_nofault() to verify that the pointer points into actual > > > memory as opposed to MMIO space (or future incoherent MKTME space or > > > something like that, sigh). This will severely complicate inlining > > > it. And we should *really* make the same fix to get_kernel_nofault() > > > -- it should validate that the pointer is a kernel pointer. > > > > > > Is this really worth inlining instead of having the BPF JIT generate > > > an out of line call to a real C function? That would let us put in a > > > sane implementation. > > > > It's out of the question. > > JIT cannot generate a helper call for single bpf insn without huge overhead. > > All registers are used. It needs full save/restore, stack increase, etc. > > > > Anyhow I bet the bug we're discussing has nothing to do with bpf and jit. > > Something got changed and now probe_kernel_read(NULL) warns on !SMAP. > > This is something to debug. > > The bug is in bpf. If you don't care to debug please don't provide wrong guesses.
On Thu, Jan 28, 2021 at 4:26 PM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Thu, Jan 28, 2021 at 04:18:24PM -0800, Andy Lutomirski wrote: > > On Thu, Jan 28, 2021 at 4:11 PM Alexei Starovoitov > > <alexei.starovoitov@gmail.com> wrote: > > > > > > On Thu, Jan 28, 2021 at 03:51:13PM -0800, Andy Lutomirski wrote: > > > > > > > > Okay, so I guess you're trying to inline probe_read_kernel(). But > > > > that means you have to inline a valid implementation. In particular, > > > > you need to check that you're accessing *kernel* memory. Just like > > > > > > That check is on the verifier side. It only does it for kernel > > > pointers with known types. > > > In a sequnce a->b->c the verifier guarantees that 'a' is valid > > > kernel pointer and it's also !null. Then it guarantees that offsetof(b) > > > points to valid kernel field which is also a pointer. > > > What it doesn't check that b != null, so > > > that users don't have to write silly code with 'if (p)' after every > > > dereference. > > > > That sounds like a verifier and/or JIT bug. If you have a pointer p > > (doesn't matter whether p is what you call a or a->b) and you have not > > confirmed that p points to the kernel range, you may not generate a > > load from that pointer. > > Please read the explanation again. It's an inlined probe_kernel_read. Can you point me at the uninlined implementation? Does it still exist? I see get_kernel_nofault(), which is currently buggy, and I will fix it. > > > > > > > > how get_user() validates that the pointer points into user memory, > > > > your helper should bounds check the pointer. On x86, you could check > > > > the high bit. > > > > > > > > As an extra complication, we should really add logic to > > > > get_kernel_nofault() to verify that the pointer points into actual > > > > memory as opposed to MMIO space (or future incoherent MKTME space or > > > > something like that, sigh). This will severely complicate inlining > > > > it. And we should *really* make the same fix to get_kernel_nofault() > > > > -- it should validate that the pointer is a kernel pointer. > > > > > > > > Is this really worth inlining instead of having the BPF JIT generate > > > > an out of line call to a real C function? That would let us put in a > > > > sane implementation. > > > > > > It's out of the question. > > > JIT cannot generate a helper call for single bpf insn without huge overhead. > > > All registers are used. It needs full save/restore, stack increase, etc. > > > > > > Anyhow I bet the bug we're discussing has nothing to do with bpf and jit. > > > Something got changed and now probe_kernel_read(NULL) warns on !SMAP. > > > This is something to debug. > > > > The bug is in bpf. > > If you don't care to debug please don't provide wrong guesses. BPF generated a NULL pointer dereference (where NULL is a user pointer) and expected it to recover cleanly. What exactly am I supposed to debug? IMO the only thing wrong with the x86 code is that it doesn't complain more loudly. I will fix that, too. --Andy
On Thu, Jan 28, 2021 at 4:29 PM Andy Lutomirski <luto@kernel.org> wrote: > > On Thu, Jan 28, 2021 at 4:26 PM Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > > > On Thu, Jan 28, 2021 at 04:18:24PM -0800, Andy Lutomirski wrote: > > > On Thu, Jan 28, 2021 at 4:11 PM Alexei Starovoitov > > > <alexei.starovoitov@gmail.com> wrote: > > > > > > > > On Thu, Jan 28, 2021 at 03:51:13PM -0800, Andy Lutomirski wrote: > > > > > > > > > > Okay, so I guess you're trying to inline probe_read_kernel(). But > > > > > that means you have to inline a valid implementation. In particular, > > > > > you need to check that you're accessing *kernel* memory. Just like > > > > > > > > That check is on the verifier side. It only does it for kernel > > > > pointers with known types. > > > > In a sequnce a->b->c the verifier guarantees that 'a' is valid > > > > kernel pointer and it's also !null. Then it guarantees that offsetof(b) > > > > points to valid kernel field which is also a pointer. > > > > What it doesn't check that b != null, so > > > > that users don't have to write silly code with 'if (p)' after every > > > > dereference. > > > > > > That sounds like a verifier and/or JIT bug. If you have a pointer p > > > (doesn't matter whether p is what you call a or a->b) and you have not > > > confirmed that p points to the kernel range, you may not generate a > > > load from that pointer. > > > > Please read the explanation again. It's an inlined probe_kernel_read. > > Can you point me at the uninlined implementation? Does it still > exist? I see get_kernel_nofault(), which is currently buggy, and I > will fix it. I spoke too soon. get_kernel_nofault() is just fine. You have inlined something like __get_kernel_nofault(), which is incorrect. get_kernel_nofault() would have done the right thing.
On Fri, Jan 29, 2021 at 1:11 AM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > On Thu, Jan 28, 2021 at 03:51:13PM -0800, Andy Lutomirski wrote: > > Okay, so I guess you're trying to inline probe_read_kernel(). But > > that means you have to inline a valid implementation. In particular, > > you need to check that you're accessing *kernel* memory. Just like > > That check is on the verifier side. It only does it for kernel > pointers with known types. > In a sequnce a->b->c the verifier guarantees that 'a' is valid > kernel pointer and it's also !null. Then it guarantees that offsetof(b) > points to valid kernel field which is also a pointer. > What it doesn't check that b != null, so > that users don't have to write silly code with 'if (p)' after every > dereference. How is that supposed to work? If I e.g. have a pointer to a task_struct, and I do something like: task->mm->mmap->vm_file->f_inode and another thread concurrently mutates the VMA tree and frees the VMA that we're traversing here, how can BPF guarantee that task->mm->mmap->vm_file is a valid pointer and not whatever garbage we read from freed memory?
On Thu, Jan 28, 2021 at 04:29:51PM -0800, Andy Lutomirski wrote: > On Thu, Jan 28, 2021 at 4:26 PM Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > > > On Thu, Jan 28, 2021 at 04:18:24PM -0800, Andy Lutomirski wrote: > > > On Thu, Jan 28, 2021 at 4:11 PM Alexei Starovoitov > > > <alexei.starovoitov@gmail.com> wrote: > > > > > > > > On Thu, Jan 28, 2021 at 03:51:13PM -0800, Andy Lutomirski wrote: > > > > > > > > > > Okay, so I guess you're trying to inline probe_read_kernel(). But > > > > > that means you have to inline a valid implementation. In particular, > > > > > you need to check that you're accessing *kernel* memory. Just like > > > > > > > > That check is on the verifier side. It only does it for kernel > > > > pointers with known types. > > > > In a sequnce a->b->c the verifier guarantees that 'a' is valid > > > > kernel pointer and it's also !null. Then it guarantees that offsetof(b) > > > > points to valid kernel field which is also a pointer. > > > > What it doesn't check that b != null, so > > > > that users don't have to write silly code with 'if (p)' after every > > > > dereference. > > > > > > That sounds like a verifier and/or JIT bug. If you have a pointer p > > > (doesn't matter whether p is what you call a or a->b) and you have not > > > confirmed that p points to the kernel range, you may not generate a > > > load from that pointer. > > > > Please read the explanation again. It's an inlined probe_kernel_read. > > Can you point me at the uninlined implementation? Does it still > exist? I see get_kernel_nofault(), which is currently buggy, and I > will fix it. > > > > > > > > > > > > how get_user() validates that the pointer points into user memory, > > > > > your helper should bounds check the pointer. On x86, you could check > > > > > the high bit. > > > > > > > > > > As an extra complication, we should really add logic to > > > > > get_kernel_nofault() to verify that the pointer points into actual > > > > > memory as opposed to MMIO space (or future incoherent MKTME space or > > > > > something like that, sigh). This will severely complicate inlining > > > > > it. And we should *really* make the same fix to get_kernel_nofault() > > > > > -- it should validate that the pointer is a kernel pointer. > > > > > > > > > > Is this really worth inlining instead of having the BPF JIT generate > > > > > an out of line call to a real C function? That would let us put in a > > > > > sane implementation. > > > > > > > > It's out of the question. > > > > JIT cannot generate a helper call for single bpf insn without huge overhead. > > > > All registers are used. It needs full save/restore, stack increase, etc. > > > > > > > > Anyhow I bet the bug we're discussing has nothing to do with bpf and jit. > > > > Something got changed and now probe_kernel_read(NULL) warns on !SMAP. > > > > This is something to debug. > > > > > > The bug is in bpf. > > > > If you don't care to debug please don't provide wrong guesses. > > BPF generated a NULL pointer dereference (where NULL is a user > pointer) and expected it to recover cleanly. What exactly am I > supposed to debug? IMO the only thing wrong with the x86 code is that > it doesn't complain more loudly. I will fix that, too. are you saying that NULL is a _user_ pointer?! It's NULL. All zeros. probe_read_kernel(NULL) was returning EFAULT on it and should continue doing so.
On Fri, Jan 29, 2021 at 01:35:16AM +0100, Jann Horn wrote: > On Fri, Jan 29, 2021 at 1:11 AM Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > On Thu, Jan 28, 2021 at 03:51:13PM -0800, Andy Lutomirski wrote: > > > Okay, so I guess you're trying to inline probe_read_kernel(). But > > > that means you have to inline a valid implementation. In particular, > > > you need to check that you're accessing *kernel* memory. Just like > > > > That check is on the verifier side. It only does it for kernel > > pointers with known types. > > In a sequnce a->b->c the verifier guarantees that 'a' is valid > > kernel pointer and it's also !null. Then it guarantees that offsetof(b) > > points to valid kernel field which is also a pointer. > > What it doesn't check that b != null, so > > that users don't have to write silly code with 'if (p)' after every > > dereference. > > How is that supposed to work? If I e.g. have a pointer to a > task_struct, and I do something like: > > task->mm->mmap->vm_file->f_inode > > and another thread concurrently mutates the VMA tree and frees the VMA > that we're traversing here, how can BPF guarantee that > task->mm->mmap->vm_file is a valid pointer and not whatever garbage we > read from freed memory? Please read upthread. Every -> is replaced with probe_kernel_read. That's what was kprobes were doing for years. That's what bpf was doing for years.
On Thu, Jan 28, 2021 at 4:41 PM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Thu, Jan 28, 2021 at 04:29:51PM -0800, Andy Lutomirski wrote: > > BPF generated a NULL pointer dereference (where NULL is a user > > pointer) and expected it to recover cleanly. What exactly am I > > supposed to debug? IMO the only thing wrong with the x86 code is that > > it doesn't complain more loudly. I will fix that, too. > > are you saying that NULL is a _user_ pointer?! > It's NULL. All zeros. > probe_read_kernel(NULL) was returning EFAULT on it and should continue doing so. probe_read_kernel() does not exist. get_kernel_nofault() returns -ERANGE. And yes, NULL is a user pointer. I can write you a little Linux program that maps some real valid data at user address 0. As I noted when I first analyzed this bug, because NULL is a user address, bpf is incorrectly triggering the *user* fault handling code, and that code is objecting. I propose the following fix to the x86 code. I'll send it as a real patch tomorrow. https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/fixes&id=f61282777772f375bba7130ae39ccbd7e83878b2 --Andy
On Fri, Jan 29, 2021 at 1:43 AM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > On Fri, Jan 29, 2021 at 01:35:16AM +0100, Jann Horn wrote: > > On Fri, Jan 29, 2021 at 1:11 AM Alexei Starovoitov > > <alexei.starovoitov@gmail.com> wrote: > > > On Thu, Jan 28, 2021 at 03:51:13PM -0800, Andy Lutomirski wrote: > > > > Okay, so I guess you're trying to inline probe_read_kernel(). But > > > > that means you have to inline a valid implementation. In particular, > > > > you need to check that you're accessing *kernel* memory. Just like > > > > > > That check is on the verifier side. It only does it for kernel > > > pointers with known types. > > > In a sequnce a->b->c the verifier guarantees that 'a' is valid > > > kernel pointer and it's also !null. Then it guarantees that offsetof(b) > > > points to valid kernel field which is also a pointer. > > > What it doesn't check that b != null, so > > > that users don't have to write silly code with 'if (p)' after every > > > dereference. > > > > How is that supposed to work? If I e.g. have a pointer to a > > task_struct, and I do something like: > > > > task->mm->mmap->vm_file->f_inode > > > > and another thread concurrently mutates the VMA tree and frees the VMA > > that we're traversing here, how can BPF guarantee that > > task->mm->mmap->vm_file is a valid pointer and not whatever garbage we > > read from freed memory? > > Please read upthread. Every -> is replaced with probe_kernel_read. > That's what was kprobes were doing for years. That's what bpf was > doing for years. Uh... but -> on PTR_TO_BTF_ID pointers is not replaced with probe_kernel_read() and can be done directly with BPF_LDX, right? And dereferencing a PTR_TO_BTF_ID pointer returns another PTR_TO_BTF_ID pointer if type information is available, right? (See btf_struct_access().) And stuff like BPF LSM programs or some of the XDP stuff receives BTF-typed pointers to kernel data structures as arguments, right? And as an example, this is visible in tools/testing/selftests/bpf/progs/ima.c , which does: SEC("lsm.s/bprm_committed_creds") int BPF_PROG(ima, struct linux_binprm *bprm) { u32 pid = bpf_get_current_pid_tgid() >> 32; if (pid == monitored_pid) ima_hash_ret = bpf_ima_inode_hash(bprm->file->f_inode, &ima_hash, sizeof(ima_hash)); return 0; } As far as I can tell, we are getting a BTF-typed pointer "bprm", doing dereferences that again yield BTF-typed pointers, and then pass the BTF-typed pointer at the end of it into bpf_ima_inode_hash()? What am I missing?
On Thu, Jan 28, 2021 at 04:45:41PM -0800, Andy Lutomirski wrote: > On Thu, Jan 28, 2021 at 4:41 PM Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > > > On Thu, Jan 28, 2021 at 04:29:51PM -0800, Andy Lutomirski wrote: > > > BPF generated a NULL pointer dereference (where NULL is a user > > > pointer) and expected it to recover cleanly. What exactly am I > > > supposed to debug? IMO the only thing wrong with the x86 code is that > > > it doesn't complain more loudly. I will fix that, too. > > > > are you saying that NULL is a _user_ pointer?! > > It's NULL. All zeros. > > probe_read_kernel(NULL) was returning EFAULT on it and should continue doing so. > > probe_read_kernel() does not exist. get_kernel_nofault() returns -ERANGE. That was an old name. bpf_probe_read_kernel() is using copy_from_kernel_nofault() now. > And yes, NULL is a user pointer. I can write you a little Linux > program that maps some real valid data at user address 0. As I noted are you sure? I thought mmap of addr zero was disallowed long ago. > when I first analyzed this bug, because NULL is a user address, bpf is > incorrectly triggering the *user* fault handling code, and that code > is objecting. > > I propose the following fix to the x86 code. I'll send it as a real > patch tomorrow. > > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/fixes&id=f61282777772f375bba7130ae39ccbd7e83878b2 You realize that you propose to panic kernels for all existing tracing users, right? Do you have a specific security concern with treating fault on NULL special?
On Fri, Jan 29, 2021 at 02:03:02AM +0100, Jann Horn wrote: > On Fri, Jan 29, 2021 at 1:43 AM Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > On Fri, Jan 29, 2021 at 01:35:16AM +0100, Jann Horn wrote: > > > On Fri, Jan 29, 2021 at 1:11 AM Alexei Starovoitov > > > <alexei.starovoitov@gmail.com> wrote: > > > > On Thu, Jan 28, 2021 at 03:51:13PM -0800, Andy Lutomirski wrote: > > > > > Okay, so I guess you're trying to inline probe_read_kernel(). But > > > > > that means you have to inline a valid implementation. In particular, > > > > > you need to check that you're accessing *kernel* memory. Just like > > > > > > > > That check is on the verifier side. It only does it for kernel > > > > pointers with known types. > > > > In a sequnce a->b->c the verifier guarantees that 'a' is valid > > > > kernel pointer and it's also !null. Then it guarantees that offsetof(b) > > > > points to valid kernel field which is also a pointer. > > > > What it doesn't check that b != null, so > > > > that users don't have to write silly code with 'if (p)' after every > > > > dereference. > > > > > > How is that supposed to work? If I e.g. have a pointer to a > > > task_struct, and I do something like: > > > > > > task->mm->mmap->vm_file->f_inode > > > > > > and another thread concurrently mutates the VMA tree and frees the VMA > > > that we're traversing here, how can BPF guarantee that > > > task->mm->mmap->vm_file is a valid pointer and not whatever garbage we > > > read from freed memory? > > > > Please read upthread. Every -> is replaced with probe_kernel_read. > > That's what was kprobes were doing for years. That's what bpf was > > doing for years. > > Uh... but -> on PTR_TO_BTF_ID pointers is not replaced with > probe_kernel_read() and can be done directly with BPF_LDX, right? Almost. They're replaced with BPF_LDX | BPF_PROBE_MEM. The interpreter is calling bpf_probe_read_kernel() to process them. JIT is replacing these insns with atomic loads and extable. > And > dereferencing a PTR_TO_BTF_ID pointer returns another PTR_TO_BTF_ID > pointer if type information is available, right? (See > btf_struct_access().) And stuff like BPF LSM programs or some of the > XDP stuff receives BTF-typed pointers to kernel data structures as > arguments, right? > > And as an example, this is visible in > tools/testing/selftests/bpf/progs/ima.c , which does: > > SEC("lsm.s/bprm_committed_creds") > int BPF_PROG(ima, struct linux_binprm *bprm) > { > u32 pid = bpf_get_current_pid_tgid() >> 32; > > if (pid == monitored_pid) > ima_hash_ret = bpf_ima_inode_hash(bprm->file->f_inode, > &ima_hash, sizeof(ima_hash)); > > return 0; > } > > As far as I can tell, we are getting a BTF-typed pointer "bprm", doing > dereferences that again yield BTF-typed pointers, and then pass the > BTF-typed pointer at the end of it into bpf_ima_inode_hash()? correct. all of them bpf_probe_read_kernel() == copy_from_kernel_nofault().
On Thu, Jan 28, 2021 at 5:04 PM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Thu, Jan 28, 2021 at 04:45:41PM -0800, Andy Lutomirski wrote: > > On Thu, Jan 28, 2021 at 4:41 PM Alexei Starovoitov > > <alexei.starovoitov@gmail.com> wrote: > > > > > > On Thu, Jan 28, 2021 at 04:29:51PM -0800, Andy Lutomirski wrote: > > > > BPF generated a NULL pointer dereference (where NULL is a user > > > > pointer) and expected it to recover cleanly. What exactly am I > > > > supposed to debug? IMO the only thing wrong with the x86 code is that > > > > it doesn't complain more loudly. I will fix that, too. > > > > > > are you saying that NULL is a _user_ pointer?! > > > It's NULL. All zeros. > > > probe_read_kernel(NULL) was returning EFAULT on it and should continue doing so. > > > > probe_read_kernel() does not exist. get_kernel_nofault() returns -ERANGE. > > That was an old name. bpf_probe_read_kernel() is using copy_from_kernel_nofault() now. > > > And yes, NULL is a user pointer. I can write you a little Linux > > program that maps some real valid data at user address 0. As I noted > > are you sure? I thought mmap of addr zero was disallowed long ago. Quite sure. #define _GNU_SOURCE #include <stdio.h> #include <sys/mman.h> #include <err.h> int main() { int *ptr = mmap(0, 4096, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED, -1, 0); if (ptr == MAP_FAILED) err(1, "mmap"); *ptr = 1; printf("I just wrote %d to %p\n", *ptr, ptr); return 0; } Whether this works or not depends on a complicated combination of sysctl settings, process capabilities, and whether SELinux feels like adding its own restrictions. But it does work on current kernels. > > > when I first analyzed this bug, because NULL is a user address, bpf is > > incorrectly triggering the *user* fault handling code, and that code > > is objecting. > > > > I propose the following fix to the x86 code. I'll send it as a real > > patch tomorrow. > > > > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/fixes&id=f61282777772f375bba7130ae39ccbd7e83878b2 > > You realize that you propose to panic kernels for all existing tracing users, right? > > Do you have a specific security concern with treating fault on NULL special? The security concern is probably not that severe because of the aforementioned restrictions, but I haven't analyzed it very carefully. But I do have a *functionality* concern. As the original email that prompted this whole discussion noted, the current x86 fault code does not do what you want it to do if SMAP is off. The fact that it mostly works with SMAP on is luck. With SMAP off, we will behave erratically at best. What exactly could the fault code even do to fix this up? Something like: if (addr == 0 && SMAP off && error_code says it's kernel mode && we don't have permission to map NULL) { special care for bpf; } This seems arbitrary and fragile. And it's not obviously implementable safely without taking locks, which we really ought not to be doing from inside arbitrary bpf programs. Keep in mind that, if SMAP is unavailable, the fault code literally does not know whether the page fault originated form a valid uaccess region. There's also always the possibility that you simultaneously run bpf and something like Wine or DOSEMU2 that actually maps the zero page, in which case the effect of the bpf code is going to be quite erratic and will depend on which mm is loaded. --Andy
On Thu, Jan 28, 2021 at 05:31:35PM -0800, Andy Lutomirski wrote: > > What exactly could the fault code even do to fix this up? Something like: > > if (addr == 0 && SMAP off && error_code says it's kernel mode && we > don't have permission to map NULL) { > special care for bpf; > } right. where 'special care' is checking extable and skipping single load instruction. > This seems arbitrary and fragile. And it's not obviously > implementable safely without taking locks, search_bpf_extables() needs rcu_read_lock() only. Not the locks you're talking about. > which we really ought not > to be doing from inside arbitrary bpf programs. Not in arbitrary progs. It's only available to progs loaded by root. > Keep in mind that, if > SMAP is unavailable, the fault code literally does not know whether > the page fault originated form a valid uaccess region. > > There's also always the possibility that you simultaneously run bpf > and something like Wine or DOSEMU2 that actually maps the zero page, > in which case the effect of the bpf code is going to be quite erratic > and will depend on which mm is loaded. It's tracing. Folks who write those progs accepted the fact that the data returned by probe_read is not going to be 100% accurate. bpf jit can insert !null check before every such special load (since it knows all of them), but it's an obvious performance loss that would be good to avoid. If fault handler can do this if (addr == 0 && ...) search_bpf_extables() at least in some conditions and cpu flags it's already a win. In all other cases bpf jit will insert !null checks.
On Thu, Jan 28, 2021 at 5:53 PM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Thu, Jan 28, 2021 at 05:31:35PM -0800, Andy Lutomirski wrote: > > > > What exactly could the fault code even do to fix this up? Something like: > > > > if (addr == 0 && SMAP off && error_code says it's kernel mode && we > > don't have permission to map NULL) { > > special care for bpf; > > } > > right. where 'special care' is checking extable and skipping single > load instruction. > > > This seems arbitrary and fragile. And it's not obviously > > implementable safely without taking locks, > > search_bpf_extables() needs rcu_read_lock() only. > Not the locks you're talking about. I mean the locks in the if statement. How am I supposed to tell whether this fault is a special bpf fault or a normal page fault without taking a lock to look up the VMA or to do some other hack? Also, why should BPF get special dispensation to generate a special kind of nonsensical page fault that no other kernel code is allowed to do? > > > which we really ought not > > to be doing from inside arbitrary bpf programs. > > Not in arbitrary progs. It's only available to progs loaded by root. > > > Keep in mind that, if > > SMAP is unavailable, the fault code literally does not know whether > > the page fault originated form a valid uaccess region. > > > > There's also always the possibility that you simultaneously run bpf > > and something like Wine or DOSEMU2 that actually maps the zero page, > > in which case the effect of the bpf code is going to be quite erratic > > and will depend on which mm is loaded. > > It's tracing. Folks who write those progs accepted the fact that > the data returned by probe_read is not going to be 100% accurate. Is this really all tracing or is some of it actual live network code? > > bpf jit can insert !null check before every such special load > (since it knows all of them), but it's an obvious performance loss > that would be good to avoid. If fault handler can do this > if (addr == 0 && ...) search_bpf_extables() > at least in some conditions and cpu flags it's already a win. > In all other cases bpf jit will insert !null checks. Again, there is no guarantee that a page fault is even generated for this. And it doesn't seem very reasonable for the fault code to have to decide whether a NULL pointer dereference is a special BPF fault or a real user access fault against a VMA at address 9. --Andy
On Thu, Jan 28, 2021 at 06:18:37PM -0800, Andy Lutomirski wrote: > On Thu, Jan 28, 2021 at 5:53 PM Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > > > On Thu, Jan 28, 2021 at 05:31:35PM -0800, Andy Lutomirski wrote: > > > > > > What exactly could the fault code even do to fix this up? Something like: > > > > > > if (addr == 0 && SMAP off && error_code says it's kernel mode && we > > > don't have permission to map NULL) { > > > special care for bpf; > > > } > > > > right. where 'special care' is checking extable and skipping single > > load instruction. > > > > > This seems arbitrary and fragile. And it's not obviously > > > implementable safely without taking locks, > > > > search_bpf_extables() needs rcu_read_lock() only. > > Not the locks you're talking about. > > I mean the locks in the if statement. How am I supposed to tell > whether this fault is a special bpf fault or a normal page fault > without taking a lock to look up the VMA or to do some other hack? search_bpf_extables() only needs a faulting rip. No need to lookup vma. if (addr == 0 && search_bpf_extables(regs->ip)... is trivial enough and won't penalize page faults in general. These conditions are not going to happen in the normal kernel code. > Also, why should BPF get special dispensation to generate a special > kind of nonsensical page fault that no other kernel code is allowed to > do? bpf is special, since it cares about performance a lot and saving branches in critical path is important. > > > > > > which we really ought not > > > to be doing from inside arbitrary bpf programs. > > > > Not in arbitrary progs. It's only available to progs loaded by root. > > > > > Keep in mind that, if > > > SMAP is unavailable, the fault code literally does not know whether > > > the page fault originated form a valid uaccess region. > > > > > > There's also always the possibility that you simultaneously run bpf > > > and something like Wine or DOSEMU2 that actually maps the zero page, > > > in which case the effect of the bpf code is going to be quite erratic > > > and will depend on which mm is loaded. > > > > It's tracing. Folks who write those progs accepted the fact that > > the data returned by probe_read is not going to be 100% accurate. > > Is this really all tracing or is some of it actual live network code? Networking bpf progs don't use this. bpf tracing does. But I'm not sure why you're asking. Tracing has higher performance demands than networking. > > > > bpf jit can insert !null check before every such special load > > (since it knows all of them), but it's an obvious performance loss > > that would be good to avoid. If fault handler can do this > > if (addr == 0 && ...) search_bpf_extables() > > at least in some conditions and cpu flags it's already a win. > > In all other cases bpf jit will insert !null checks. > > Again, there is no guarantee that a page fault is even generated for > this. And it doesn't seem very reasonable for the fault code to have > to decide whether a NULL pointer dereference is a special BPF fault or > a real user access fault against a VMA at address 9. The faulting address and faulting ip will precisely identify this situation. There is no guess work.
> On Jan 28, 2021, at 6:33 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Thu, Jan 28, 2021 at 06:18:37PM -0800, Andy Lutomirski wrote: >>> On Thu, Jan 28, 2021 at 5:53 PM Alexei Starovoitov >>> <alexei.starovoitov@gmail.com> wrote: >>> >>> On Thu, Jan 28, 2021 at 05:31:35PM -0800, Andy Lutomirski wrote: >>>> >>>> What exactly could the fault code even do to fix this up? Something like: >>>> >>>> if (addr == 0 && SMAP off && error_code says it's kernel mode && we >>>> don't have permission to map NULL) { >>>> special care for bpf; >>>> } >>> >>> right. where 'special care' is checking extable and skipping single >>> load instruction. >>> >>>> This seems arbitrary and fragile. And it's not obviously >>>> implementable safely without taking locks, >>> >>> search_bpf_extables() needs rcu_read_lock() only. >>> Not the locks you're talking about. >> >> I mean the locks in the if statement. How am I supposed to tell >> whether this fault is a special bpf fault or a normal page fault >> without taking a lock to look up the VMA or to do some other hack? > > search_bpf_extables() only needs a faulting rip. > No need to lookup vma. > if (addr == 0 && search_bpf_extables(regs->ip)... > is trivial enough and won't penalize page faults in general. > These conditions are not going to happen in the normal kernel code. The need to make this decision will happen in normal code. The page fault code is a critical piece of the kernel. The absolute top priority is correctness. Then performance for the actual common case, which is a regular page fault against a valid VMA. Everything else is lower priority. This means I’m not about to add an if (special bpf insn) before the VMA lookup. And by the time we do the VMA lookup, we have already lost. You could try to play games with pagefault_disable(), but that will have its own problems. > The faulting address and faulting ip will precisely identify this situation. > There is no guess work. If there is a fault on an instruction with an exception handler and the faulting address is a valid user pointer, it is absolutely ambiguous whether it’s BPF chasing a NULL pointer or something else following a valid user pointer. The only reason this thing works somewhat reliably right now is that SMAP lessens the ambiguity to some extent.
On Thu, Jan 28, 2021 at 07:09:29PM -0800, Andy Lutomirski wrote: > > > > On Jan 28, 2021, at 6:33 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > > > On Thu, Jan 28, 2021 at 06:18:37PM -0800, Andy Lutomirski wrote: > >>> On Thu, Jan 28, 2021 at 5:53 PM Alexei Starovoitov > >>> <alexei.starovoitov@gmail.com> wrote: > >>> > >>> On Thu, Jan 28, 2021 at 05:31:35PM -0800, Andy Lutomirski wrote: > >>>> > >>>> What exactly could the fault code even do to fix this up? Something like: > >>>> > >>>> if (addr == 0 && SMAP off && error_code says it's kernel mode && we > >>>> don't have permission to map NULL) { > >>>> special care for bpf; > >>>> } > >>> > >>> right. where 'special care' is checking extable and skipping single > >>> load instruction. > >>> > >>>> This seems arbitrary and fragile. And it's not obviously > >>>> implementable safely without taking locks, > >>> > >>> search_bpf_extables() needs rcu_read_lock() only. > >>> Not the locks you're talking about. > >> > >> I mean the locks in the if statement. How am I supposed to tell > >> whether this fault is a special bpf fault or a normal page fault > >> without taking a lock to look up the VMA or to do some other hack? > > > > search_bpf_extables() only needs a faulting rip. > > No need to lookup vma. > > if (addr == 0 && search_bpf_extables(regs->ip)... > > is trivial enough and won't penalize page faults in general. > > These conditions are not going to happen in the normal kernel code. > > The need to make this decision will happen in normal code. > > The page fault code is a critical piece of the kernel. The absolute top priority is correctness. Then performance for the actual common case, which is a regular page fault against a valid VMA. Everything else is lower priority. > > This means I’m not about to add an if (special bpf insn) before the VMA lookup. And by the time we do the VMA lookup, we have already lost. what's the difference between bpf prog and kprobe ? Not much from page fault pov. They both do things that are not normal for the rest of the kernel. if (kprobe_page_fault()) is handled first by the fault handler. So there is a precedent to handle special things. > You could try to play games with pagefault_disable(), but that will have its own problems. > > > > The faulting address and faulting ip will precisely identify this situation. > > There is no guess work. > > If there is a fault on an instruction with an exception handler and the faulting address is a valid user pointer, > it is absolutely ambiguous whether it’s BPF chasing a NULL pointer or something else following a valid user pointer. It's not ambiguous. Please take a look at search_bpf_extables. It's not like the whole bpf program is one special region. Only certain specific instructions are in extable. That table was dynamically generated by JIT. Just like kernel module's extables. bpf helpers are not in the extable. Most of the bpf prog code is not in it either.
diff --git a/arch/x86/include/asm/extable.h b/arch/x86/include/asm/extable.h index 1f0cbc52937c..1e99da15336c 100644 --- a/arch/x86/include/asm/extable.h +++ b/arch/x86/include/asm/extable.h @@ -38,6 +38,9 @@ enum handler_type { extern int fixup_exception(struct pt_regs *regs, int trapnr, unsigned long error_code, unsigned long fault_addr); +extern int fixup_bpf_exception(struct pt_regs *regs, int trapnr, + unsigned long error_code, + unsigned long fault_addr); extern int fixup_bug(struct pt_regs *regs, int trapnr); extern enum handler_type ex_get_fault_handler_type(unsigned long ip); extern void early_fixup_exception(struct pt_regs *regs, int trapnr); diff --git a/arch/x86/mm/extable.c b/arch/x86/mm/extable.c index b93d6cd08a7f..311da42c0aec 100644 --- a/arch/x86/mm/extable.c +++ b/arch/x86/mm/extable.c @@ -155,6 +155,20 @@ enum handler_type ex_get_fault_handler_type(unsigned long ip) return EX_HANDLER_OTHER; } +int fixup_bpf_exception(struct pt_regs *regs, int trapnr, + unsigned long error_code, unsigned long fault_addr) +{ + const struct exception_table_entry *e; + ex_handler_t handler; + + e = search_bpf_extables(regs->ip); + if (!e) + return 0; + + handler = ex_fixup_handler(e); + return handler(e, regs, trapnr, error_code, fault_addr); +} + int fixup_exception(struct pt_regs *regs, int trapnr, unsigned long error_code, unsigned long fault_addr) { diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index f1f1b5a0956a..e8182d30bf67 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1317,6 +1317,15 @@ void do_user_addr_fault(struct pt_regs *regs, if (emulate_vsyscall(hw_error_code, regs, address)) return; } + +#ifdef CONFIG_BPF_JIT + /* + * Faults incurred by bpf program might need emulation, i.e., + * clearing the dest register. + */ + if (fixup_bpf_exception(regs, X86_TRAP_PF, hw_error_code, address)) + return; +#endif #endif /*
When reviewing patch ([1]), which adds a script to run bpf selftest through qemu at /sbin/init stage, I found the following kernel bug warning: [ 112.118892] BUG: sleeping function called from invalid context at arch/x86/mm/fault.c:1351 [ 112.119805] in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 354, name: new_name [ 112.120512] 3 locks held by new_name/354: [ 112.120868] #0: ffff88800476e0a0 (&p->lock){+.+.}-{3:3}, at: bpf_seq_read+0x3a/0x3d0 [ 112.121573] #1: ffffffff82d69800 (rcu_read_lock){....}-{1:2}, at: bpf_iter_run_prog+0x5/0x160 [ 112.122348] #2: ffff8880061c2088 (&mm->mmap_lock#2){++++}-{3:3}, at: exc_page_fault+0x1a1/0x640 [ 112.123128] Preemption disabled at: [ 112.123130] [<ffffffff8108f913>] migrate_disable+0x33/0x80 [ 112.123942] CPU: 0 PID: 354 Comm: new_name Tainted: G O 5.11.0-rc4-00524-g6e66fbb 10597-dirty #1249 [ 112.124822] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.el7.centos 04/01 /2014 [ 112.125614] Call Trace: [ 112.125835] dump_stack+0x77/0x97 [ 112.126137] ___might_sleep.cold.119+0xf2/0x106 [ 112.126537] exc_page_fault+0x1c1/0x640 [ 112.126888] asm_exc_page_fault+0x1e/0x30 [ 112.127241] RIP: 0010:bpf_prog_0a182df2d34af188_dump_bpf_prog+0xf5/0xb3c [ 112.127825] Code: 00 00 8b 7d f4 41 8b 76 44 48 39 f7 73 06 48 01 fb 49 89 df 4c 89 7d d8 49 8b bd 20 01 00 00 48 89 7d e0 49 8b bd e0 00 00 00 <48> 8b 7f 20 48 01 d7 48 89 7d e8 48 89 e9 48 83 c 1 d0 48 8b 7d c8 [ 112.129433] RSP: 0018:ffffc9000035fdc8 EFLAGS: 00010282 [ 112.129895] RAX: 0000000000000000 RBX: ffff888005a49458 RCX: 0000000000000024 [ 112.130509] RDX: 00000000000002f0 RSI: 0000000000000509 RDI: 0000000000000000 [ 112.131126] RBP: ffffc9000035fe20 R08: 0000000000000001 R09: 0000000000000000 [ 112.131737] R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000400 [ 112.132355] R13: ffff888006085800 R14: ffff888004718540 R15: ffff888005a49458 [ 112.132990] ? bpf_prog_0a182df2d34af188_dump_bpf_prog+0xc8/0xb3c [ 112.133526] bpf_iter_run_prog+0x75/0x160 [ 112.133880] __bpf_prog_seq_show+0x39/0x40 [ 112.134258] bpf_seq_read+0xf6/0x3d0 [ 112.134582] vfs_read+0xa3/0x1b0 [ 112.134873] ksys_read+0x4f/0xc0 [ 112.135166] do_syscall_64+0x2d/0x40 [ 112.135482] entry_SYSCALL_64_after_hwframe+0x44/0xa9 To reproduce the issue, with patch [1] and use the following script: tools/testing/selftests/bpf/run_in_vm.sh -- cat /sys/fs/bpf/progs.debug The reason of the above kernel warning is due to bpf program tries to dereference an address of 0 and which is not caught by bpf exception handling logic. ... SEC("iter/bpf_prog") int dump_bpf_prog(struct bpf_iter__bpf_prog *ctx) { struct bpf_prog *prog = ctx->prog; struct bpf_prog_aux *aux; ... if (!prog) return 0; aux = prog->aux; ... ... aux->dst_prog->aux->name ... return 0; } If the aux->dst_prog is NULL pointer, a fault will happen when trying to access aux->dst_prog->aux. In arch/x86/mm/fault.c function do_usr_addr_fault(), we have following code if (unlikely(cpu_feature_enabled(X86_FEATURE_SMAP) && !(hw_error_code & X86_PF_USER) && !(regs->flags & X86_EFLAGS_AC))) { bad_area_nosemaphore(regs, hw_error_code, address); return; } When the test is run normally after login prompt, cpu_feature_enabled(X86_FEATURE_SMAP) is true and bad_area_nosemaphore() is called and then fixup_exception() is called, where bpf specific handler is able to fixup the exception. But when the test is run at /sbin/init time, cpu_feature_enabled(X86_FEATURE_SMAP) is false, the control reaches if (unlikely(!mmap_read_trylock(mm))) { if (!user_mode(regs) && !search_exception_tables(regs->ip)) { /* * Fault from code in kernel from * which we do not expect faults. */ bad_area_nosemaphore(regs, hw_error_code, address); return; } retry: mmap_read_lock(mm); } else { /* * The above down_read_trylock() might have succeeded in * which case we'll have missed the might_sleep() from * down_read(): */ might_sleep(); } and might_sleep() is triggered and the above kernel warning is print. To fix the issue, before the above mmap_read_trylock(), we will check whether fault ip can be served by bpf exception handler or not, if yes, the exception will be fixed up and return. [1] https://lore.kernel.org/bpf/20210123004445.299149-1-kpsingh@kernel.org/ Cc: Alexei Starovoitov <ast@kernel.org> Cc: KP Singh <kpsingh@kernel.org> Signed-off-by: Yonghong Song <yhs@fb.com> --- arch/x86/include/asm/extable.h | 3 +++ arch/x86/mm/extable.c | 14 ++++++++++++++ arch/x86/mm/fault.c | 9 +++++++++ 3 files changed, 26 insertions(+)