diff mbox series

[bpf-next,v3,1/2] x86: Perform BPF exception fixup in do_user_addr_fault

Message ID 20241103193512.4076710-2-memxor@gmail.com (mailing list archive)
State New, archived
Delegated to: BPF
Headers show
Series Zero overhead PROBE_MEM | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for bpf-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 5 this patch: 5
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers warning 1 maintainers not CCed: hpa@zytor.com
netdev/build_clang success Errors and warnings before: 3 this patch: 3
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 4 this patch: 4
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 23 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-next-VM_Test-0 success Logs for Lint
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-2 success Logs for Unittests
bpf/vmtest-bpf-next-VM_Test-3 success Logs for Validate matrix.py
bpf/vmtest-bpf-next-VM_Test-5 success Logs for aarch64-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-4 success Logs for aarch64-gcc / build / build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-9 success Logs for aarch64-gcc / test (test_verifier, false, 360) / test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-10 success Logs for aarch64-gcc / veristat
bpf/vmtest-bpf-next-VM_Test-12 success Logs for s390x-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-6 success Logs for aarch64-gcc / test (test_maps, false, 360) / test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-11 success Logs for s390x-gcc / build / build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-17 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-16 success Logs for s390x-gcc / veristat
bpf/vmtest-bpf-next-VM_Test-18 success Logs for x86_64-gcc / build / build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-19 success Logs for x86_64-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-27 success Logs for x86_64-llvm-17 / build / build for x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-28 success Logs for x86_64-llvm-17 / build-release / build for x86_64 with llvm-17-O2
bpf/vmtest-bpf-next-VM_Test-33 success Logs for x86_64-llvm-17 / veristat
bpf/vmtest-bpf-next-VM_Test-34 success Logs for x86_64-llvm-18 / build / build for x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-35 success Logs for x86_64-llvm-18 / build-release / build for x86_64 with llvm-18-O2
bpf/vmtest-bpf-next-VM_Test-41 success Logs for x86_64-llvm-18 / veristat
bpf/vmtest-bpf-next-VM_Test-7 success Logs for aarch64-gcc / test (test_progs, false, 360) / test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-8 success Logs for aarch64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-15 success Logs for s390x-gcc / test (test_verifier, false, 360) / test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-20 success Logs for x86_64-gcc / test (test_maps, false, 360) / test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-23 success Logs for x86_64-gcc / test (test_progs_no_alu32_parallel, true, 30) / test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-21 success Logs for x86_64-gcc / test (test_progs, false, 360) / test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-29 success Logs for x86_64-llvm-17 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-22 success Logs for x86_64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-30 success Logs for x86_64-llvm-17 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-24 success Logs for x86_64-gcc / test (test_progs_parallel, true, 30) / test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-31 success Logs for x86_64-llvm-17 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-25 success Logs for x86_64-gcc / test (test_verifier, false, 360) / test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-26 success Logs for x86_64-gcc / veristat / veristat on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-40 success Logs for x86_64-llvm-18 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-32 success Logs for x86_64-llvm-17 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-36 success Logs for x86_64-llvm-18 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-37 success Logs for x86_64-llvm-18 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-38 success Logs for x86_64-llvm-18 / test (test_progs_cpuv4, false, 360) / test_progs_cpuv4 on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-39 success Logs for x86_64-llvm-18 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-18
bpf/vmtest-bpf-next-PR fail PR summary
bpf/vmtest-bpf-next-VM_Test-13 fail Logs for s390x-gcc / test (test_progs, false, 360) / test_progs on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-14 success Logs for s390x-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on s390x with gcc

Commit Message

Kumar Kartikeya Dwivedi Nov. 3, 2024, 7:35 p.m. UTC
Currently, on x86, when SMAP is enabled, and a page fault occurs in
kernel mode for accessing a user address, the kernel will rightly panic
as no valid kernel code can cause such a page fault (unless buggy).
There is no valid correct kernel code that can generate such a fault,
therefore this behavior would be correct.

BPF programs that currently encounter user addresses when doing
PROBE_MEM loads (load instructions which are allowed to read any kernel
address, only available for root users) avoid a page fault by performing
bounds checking on the address.  This requires the JIT to emit a jump
over each PROBE_MEM load instruction to avoid hitting page faults.

We would prefer avoiding these jump instructions to improve performance
of programs which use PROBE_MEM loads pervasively. For correct behavior,
programs already rely on the kernel addresses being valid when they are
executing, but BPF's safety properties must still ensure kernel safety
in presence of invalid addresses. Therefore, for correct programs, the
bounds checking is an added cost meant to ensure kernel safety. If the
do_user_addr_fault handler could perform fixups for the BPF program in
such a case, the bounds checking could be eliminated, the load
instruction could be emitted directly without any checking.

Thus, in case SMAP is enabled (which would mean the kernel traps on
accessing a user address), and the instruction pointer belongs to a BPF
program, perform fixup for the access by searching exception tables.
All BPF programs already execute with SMAP protection. When SMAP is not
enabled, the BPF JIT will continue to emit bounds checking instructions.

Acked-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 arch/x86/mm/fault.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

Comments

Dave Hansen Nov. 4, 2024, 5:16 p.m. UTC | #1
On 11/3/24 11:35, Kumar Kartikeya Dwivedi wrote:
> Currently, on x86, when SMAP is enabled, and a page fault occurs in
> kernel mode for accessing a user address, the kernel will rightly panic
> as no valid kernel code can cause such a page fault (unless buggy).
> There is no valid correct kernel code that can generate such a fault,
> therefore this behavior would be correct.
> 
> BPF programs that currently encounter user addresses when doing
> PROBE_MEM loads (load instructions which are allowed to read any kernel
> address, only available for root users) avoid a page fault by performing
> bounds checking on the address.  This requires the JIT to emit a jump
> over each PROBE_MEM load instruction to avoid hitting page faults.

To be honest, I think the overhead (and complexity) is in the right spot
today: the BPF program.  I don't want to complicate the x86 page fault
handler for a debugging feature that already has a functional solution
today.
Kumar Kartikeya Dwivedi Nov. 4, 2024, 5:50 p.m. UTC | #2
On Mon, 4 Nov 2024 at 11:16, Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 11/3/24 11:35, Kumar Kartikeya Dwivedi wrote:
> > Currently, on x86, when SMAP is enabled, and a page fault occurs in
> > kernel mode for accessing a user address, the kernel will rightly panic
> > as no valid kernel code can cause such a page fault (unless buggy).
> > There is no valid correct kernel code that can generate such a fault,
> > therefore this behavior would be correct.
> >
> > BPF programs that currently encounter user addresses when doing
> > PROBE_MEM loads (load instructions which are allowed to read any kernel
> > address, only available for root users) avoid a page fault by performing
> > bounds checking on the address.  This requires the JIT to emit a jump
> > over each PROBE_MEM load instruction to avoid hitting page faults.
>
> To be honest, I think the overhead (and complexity) is in the right spot
> today: the BPF program.  I don't want to complicate the x86 page fault
> handler for a debugging feature that already has a functional solution
> today.

This is not a debugging feature in BPF programs. Almost all tracing
BPF programs that potentially execute inside the kernel contain loads
which have to be marked PROBE_MEM because potentially, they may read a
user pointer and dereference it. The BPF runtime relies on PROBE_MEM
whenever reading from  pointers (for programs having root) that cannot
be trusted to be safe statically. While reading invalid memory is a
rare case that should not happen, it may be possible if some kernel
field contains a stale address etc. so loads into such pointers are
allowed statically, and handled/trapped at runtime to zero out the
destination register. So far x86 won't invoke fixup_exception for user
faults, as it wasn't supposed to happen, which is correct behavior
(hence the bounds check).

However, this cost is paid to handle a rare case in the common path of
BPF programs.
It would be nice to avoid it, by deferring to
fixup_exception->ex_handler_bpf in case the pc belongs to BPF text.
This was the least intrusive way I could think of doing it.

On arm64, the fault handler would do exception handling in such a
case. So with this patch we can think of removing the bounds checking
across architectures. We can do something similar for RISC-V as well.

I am fine with changing the commit log/title to reflect that it's an
added cost for the fault handler. I felt that it wouldn't matter
because the kernel was about to panic, so we had bailed anyway, but we
_do_ end up spending extra cycles in case this path is hit now.
Dave Hansen Nov. 4, 2024, 6:09 p.m. UTC | #3
On 11/4/24 09:50, Kumar Kartikeya Dwivedi wrote:
> While reading invalid memory is a rare case that should not happen,
> it may be possible if some kernel field contains a stale address
> etc.

Reading random (unaccepted) memory in a TDX guest is fatal, even from
kernel addresses.  We had a lot of fun even getting
load_unaligned_zeropad() to work.

So, honestly, if you've letting buggy BFP programs get loaded and read
random memory (kernel or user), you've got bigger problems than a
verbose kernel panic when the buggy program happens to touch userspace.

I'd rather not hack code into the page fault handler to add to the
illusion that this is a good idea or safe in any way.
diff mbox series

Patch

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index e6c469b323cc..189e93d88bd4 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -21,6 +21,7 @@ 
 #include <linux/mm_types.h>
 #include <linux/mm.h>			/* find_and_lock_vma() */
 #include <linux/vmalloc.h>
+#include <linux/filter.h>		/* is_bpf_text_address()	*/
 
 #include <asm/cpufeature.h>		/* boot_cpu_has, ...		*/
 #include <asm/traps.h>			/* dotraplinkage, ...		*/
@@ -1257,6 +1258,16 @@  void do_user_addr_fault(struct pt_regs *regs,
 	if (unlikely(cpu_feature_enabled(X86_FEATURE_SMAP) &&
 		     !(error_code & X86_PF_USER) &&
 		     !(regs->flags & X86_EFLAGS_AC))) {
+		/*
+		 * If the kernel access happened to an invalid user pointer
+		 * under SMAP by a BPF program, we will have an extable entry
+		 * here, and need to perform the fixup.
+		 */
+		if (is_bpf_text_address(regs->ip)) {
+			kernelmode_fixup_or_oops(regs, error_code, address,
+						 0, 0, ARCH_DEFAULT_PKEY);
+			return;
+		}
 		/*
 		 * No extable entry here.  This was a kernel access to an
 		 * invalid pointer.  get_kernel_nofault() will not get here.