mbox series

[bpf-next,0/1] arm64: Add BPF exception tables

Message ID 20200728152122.1292756-1-jean-philippe@linaro.org (mailing list archive)
Headers show
Series arm64: Add BPF exception tables | expand

Message

Jean-Philippe Brucker July 28, 2020, 3:21 p.m. UTC
The following patch adds support for BPF_PROBE_MEM on arm64. The
implementation is simple but I wanted to give a bit of background first.
If you're familiar with recent BPF development you can skip to the patch
(or fact-check the following blurb).

BPF programs used for tracing can inspect any of the traced function's
arguments and follow pointers in struct members. Traditionally the BPF
program would get a struct pt_regs as argument and cast the register
values to the appropriate struct pointer. The BPF verifier would mandate
that any memory access uses the bpf_probe_read() helper, to suppress
page faults (see samples/bpf/tracex1_kern.c).

With BPF Type Format embedded into the kernel (CONFIG_DEBUG_INFO_BTF),
the verifier can now check the type of any access performed by a BPF
program. It rejects for example programs that cast to a different
structure and perform out-of-bounds accesses, or programs that attempt
to dereference something that isn't a pointer, or that hasn't gone
through a NULL check.

As this makes tracing programs safer, the verifier now allows loading
programs that access struct members without bpf_probe_read(). It is
however still possible to trigger page faults. For example in the
following example with which I've tested this patch, the verifier does
not mandate a NULL check for the second-level pointer:

/*
 * From tools/testing/selftests/bpf/progs/bpf_iter_task.c
 * dump_task() is called for each task.
 */
SEC("iter/task")
int dump_task(struct bpf_iter__task *ctx)
{
	struct seq_file *seq = ctx->meta->seq;
	struct task_struct *task = ctx->task;

	/* Program would be rejected without this check */
	if (task == NULL)
		return 0;

	/*
	 * However the verifier does not currently mandate
	 * checking task->mm, and the following faults for kernel
	 * threads.
	 */
	BPF_SEQ_PRINTF(seq, "pid=%d vm=%d", task->pid, task->mm->total_vm);
	return 0;
}

Even if it checked this case, the verifier couldn't guarantee that all
accesses are safe since kernel structures could in theory contain
garbage or error pointers. So to allow fast access without
bpf_probe_read(), a JIT implementation must support BPF exception
tables. For each access to a BTF pointer, the JIT generates an entry
into an exception table appended to the BPF program. If the access
faults at runtime, the handler skips the faulting instruction. The
example above will display vm=0 for kernel threads.

See also
* The original implementation on x86
  https://lore.kernel.org/bpf/20191016032505.2089704-1-ast@kernel.org/
* The s390 implementation
  https://lore.kernel.org/bpf/20200715233301.933201-1-iii@linux.ibm.com/

Jean-Philippe Brucker (1):
  arm64: bpf: Add BPF exception tables

 arch/arm64/include/asm/extable.h |  3 ++
 arch/arm64/mm/extable.c          | 11 ++--
 arch/arm64/net/bpf_jit_comp.c    | 93 +++++++++++++++++++++++++++++---
 3 files changed, 98 insertions(+), 9 deletions(-)

Comments

Ravi Bangoria June 9, 2021, 12:04 p.m. UTC | #1
Hi Alexei,

On 7/28/20 8:51 PM, Jean-Philippe Brucker wrote:
> The following patch adds support for BPF_PROBE_MEM on arm64. The
> implementation is simple but I wanted to give a bit of background first.
> If you're familiar with recent BPF development you can skip to the patch
> (or fact-check the following blurb).
> 
> BPF programs used for tracing can inspect any of the traced function's
> arguments and follow pointers in struct members. Traditionally the BPF
> program would get a struct pt_regs as argument and cast the register
> values to the appropriate struct pointer. The BPF verifier would mandate
> that any memory access uses the bpf_probe_read() helper, to suppress
> page faults (see samples/bpf/tracex1_kern.c).
> 
> With BPF Type Format embedded into the kernel (CONFIG_DEBUG_INFO_BTF),
> the verifier can now check the type of any access performed by a BPF
> program. It rejects for example programs that cast to a different
> structure and perform out-of-bounds accesses, or programs that attempt
> to dereference something that isn't a pointer, or that hasn't gone
> through a NULL check.
> 
> As this makes tracing programs safer, the verifier now allows loading
> programs that access struct members without bpf_probe_read(). It is
> however still possible to trigger page faults. For example in the
> following example with which I've tested this patch, the verifier does
> not mandate a NULL check for the second-level pointer:
> 
> /*
>   * From tools/testing/selftests/bpf/progs/bpf_iter_task.c
>   * dump_task() is called for each task.
>   */
> SEC("iter/task")
> int dump_task(struct bpf_iter__task *ctx)
> {
> 	struct seq_file *seq = ctx->meta->seq;
> 	struct task_struct *task = ctx->task;
> 
> 	/* Program would be rejected without this check */
> 	if (task == NULL)
> 		return 0;
> 
> 	/*
> 	 * However the verifier does not currently mandate
> 	 * checking task->mm, and the following faults for kernel
> 	 * threads.
> 	 */
> 	BPF_SEQ_PRINTF(seq, "pid=%d vm=%d", task->pid, task->mm->total_vm);
> 	return 0;
> }
> 
> Even if it checked this case, the verifier couldn't guarantee that all
> accesses are safe since kernel structures could in theory contain
> garbage or error pointers. So to allow fast access without
> bpf_probe_read(), a JIT implementation must support BPF exception
> tables. For each access to a BTF pointer, the JIT generates an entry
> into an exception table appended to the BPF program. If the access
> faults at runtime, the handler skips the faulting instruction. The
> example above will display vm=0 for kernel threads.

I'm trying with the example above (task->mm->total_vm) on x86 machine
with bpf/master (11fc79fc9f2e3) plus commit 4c5de127598e1 ("bpf: Emit
explicit NULL pointer checks for PROBE_LDX instructions.") *reverted*,
I'm seeing the app getting killed with error in dmesg.

   $ sudo bpftool iter pin bpf_iter_task.o /sys/fs/bpf/task
   $ sudo cat /sys/fs/bpf/task
   Killed

   $ dmesg
   [  188.810020] BUG: kernel NULL pointer dereference, address: 00000000000000c8
   [  188.810030] #PF: supervisor read access in kernel mode
   [  188.810034] #PF: error_code(0x0000) - not-present page

IIUC, this should be handled by bpf exception table rather than killing
the app. Am I missing anything?

Ravi
Alexei Starovoitov June 11, 2021, 12:12 a.m. UTC | #2
On Wed, Jun 9, 2021 at 5:05 AM Ravi Bangoria
<ravi.bangoria@linux.ibm.com> wrote:
>
> Hi Alexei,
>
> On 7/28/20 8:51 PM, Jean-Philippe Brucker wrote:
> > The following patch adds support for BPF_PROBE_MEM on arm64. The
> > implementation is simple but I wanted to give a bit of background first.
> > If you're familiar with recent BPF development you can skip to the patch
> > (or fact-check the following blurb).
> >
> > BPF programs used for tracing can inspect any of the traced function's
> > arguments and follow pointers in struct members. Traditionally the BPF
> > program would get a struct pt_regs as argument and cast the register
> > values to the appropriate struct pointer. The BPF verifier would mandate
> > that any memory access uses the bpf_probe_read() helper, to suppress
> > page faults (see samples/bpf/tracex1_kern.c).
> >
> > With BPF Type Format embedded into the kernel (CONFIG_DEBUG_INFO_BTF),
> > the verifier can now check the type of any access performed by a BPF
> > program. It rejects for example programs that cast to a different
> > structure and perform out-of-bounds accesses, or programs that attempt
> > to dereference something that isn't a pointer, or that hasn't gone
> > through a NULL check.
> >
> > As this makes tracing programs safer, the verifier now allows loading
> > programs that access struct members without bpf_probe_read(). It is
> > however still possible to trigger page faults. For example in the
> > following example with which I've tested this patch, the verifier does
> > not mandate a NULL check for the second-level pointer:
> >
> > /*
> >   * From tools/testing/selftests/bpf/progs/bpf_iter_task.c
> >   * dump_task() is called for each task.
> >   */
> > SEC("iter/task")
> > int dump_task(struct bpf_iter__task *ctx)
> > {
> >       struct seq_file *seq = ctx->meta->seq;
> >       struct task_struct *task = ctx->task;
> >
> >       /* Program would be rejected without this check */
> >       if (task == NULL)
> >               return 0;
> >
> >       /*
> >        * However the verifier does not currently mandate
> >        * checking task->mm, and the following faults for kernel
> >        * threads.
> >        */
> >       BPF_SEQ_PRINTF(seq, "pid=%d vm=%d", task->pid, task->mm->total_vm);
> >       return 0;
> > }
> >
> > Even if it checked this case, the verifier couldn't guarantee that all
> > accesses are safe since kernel structures could in theory contain
> > garbage or error pointers. So to allow fast access without
> > bpf_probe_read(), a JIT implementation must support BPF exception
> > tables. For each access to a BTF pointer, the JIT generates an entry
> > into an exception table appended to the BPF program. If the access
> > faults at runtime, the handler skips the faulting instruction. The
> > example above will display vm=0 for kernel threads.
>
> I'm trying with the example above (task->mm->total_vm) on x86 machine
> with bpf/master (11fc79fc9f2e3) plus commit 4c5de127598e1 ("bpf: Emit
> explicit NULL pointer checks for PROBE_LDX instructions.") *reverted*,
> I'm seeing the app getting killed with error in dmesg.
>
>    $ sudo bpftool iter pin bpf_iter_task.o /sys/fs/bpf/task
>    $ sudo cat /sys/fs/bpf/task
>    Killed
>
>    $ dmesg
>    [  188.810020] BUG: kernel NULL pointer dereference, address: 00000000000000c8
>    [  188.810030] #PF: supervisor read access in kernel mode
>    [  188.810034] #PF: error_code(0x0000) - not-present page
>
> IIUC, this should be handled by bpf exception table rather than killing
> the app. Am I missing anything?

For PROBE_LDX the verifier guarantees that the address is either
a very likely valid kernel address or NULL. On x86 the user and kernel
address spaces are shared and NULL is a user address, so there cannot be
an exception table for NULL. Hence x86-64 JIT inserts NULL check when
it converts PROBE_LDX into load insn.
Ravi Bangoria June 17, 2021, 6:58 a.m. UTC | #3
Hi Alexei,

Sorry to bother you again!

On 6/11/21 5:42 AM, Alexei Starovoitov wrote:
> On Wed, Jun 9, 2021 at 5:05 AM Ravi Bangoria
> <ravi.bangoria@linux.ibm.com> wrote:
>>
>> Hi Alexei,
>>
>> On 7/28/20 8:51 PM, Jean-Philippe Brucker wrote:
>>> The following patch adds support for BPF_PROBE_MEM on arm64. The
>>> implementation is simple but I wanted to give a bit of background first.
>>> If you're familiar with recent BPF development you can skip to the patch
>>> (or fact-check the following blurb).
>>>
>>> BPF programs used for tracing can inspect any of the traced function's
>>> arguments and follow pointers in struct members. Traditionally the BPF
>>> program would get a struct pt_regs as argument and cast the register
>>> values to the appropriate struct pointer. The BPF verifier would mandate
>>> that any memory access uses the bpf_probe_read() helper, to suppress
>>> page faults (see samples/bpf/tracex1_kern.c).
>>>
>>> With BPF Type Format embedded into the kernel (CONFIG_DEBUG_INFO_BTF),
>>> the verifier can now check the type of any access performed by a BPF
>>> program. It rejects for example programs that cast to a different
>>> structure and perform out-of-bounds accesses, or programs that attempt
>>> to dereference something that isn't a pointer, or that hasn't gone
>>> through a NULL check.
>>>
>>> As this makes tracing programs safer, the verifier now allows loading
>>> programs that access struct members without bpf_probe_read(). It is
>>> however still possible to trigger page faults. For example in the
>>> following example with which I've tested this patch, the verifier does
>>> not mandate a NULL check for the second-level pointer:
>>>
>>> /*
>>>    * From tools/testing/selftests/bpf/progs/bpf_iter_task.c
>>>    * dump_task() is called for each task.
>>>    */
>>> SEC("iter/task")
>>> int dump_task(struct bpf_iter__task *ctx)
>>> {
>>>        struct seq_file *seq = ctx->meta->seq;
>>>        struct task_struct *task = ctx->task;
>>>
>>>        /* Program would be rejected without this check */
>>>        if (task == NULL)
>>>                return 0;
>>>
>>>        /*
>>>         * However the verifier does not currently mandate
>>>         * checking task->mm, and the following faults for kernel
>>>         * threads.
>>>         */
>>>        BPF_SEQ_PRINTF(seq, "pid=%d vm=%d", task->pid, task->mm->total_vm);
>>>        return 0;
>>> }
>>>
>>> Even if it checked this case, the verifier couldn't guarantee that all
>>> accesses are safe since kernel structures could in theory contain
>>> garbage or error pointers. So to allow fast access without
>>> bpf_probe_read(), a JIT implementation must support BPF exception
>>> tables. For each access to a BTF pointer, the JIT generates an entry
>>> into an exception table appended to the BPF program. If the access
>>> faults at runtime, the handler skips the faulting instruction. The
>>> example above will display vm=0 for kernel threads.
>>
>> I'm trying with the example above (task->mm->total_vm) on x86 machine
>> with bpf/master (11fc79fc9f2e3) plus commit 4c5de127598e1 ("bpf: Emit
>> explicit NULL pointer checks for PROBE_LDX instructions.") *reverted*,
>> I'm seeing the app getting killed with error in dmesg.
>>
>>     $ sudo bpftool iter pin bpf_iter_task.o /sys/fs/bpf/task
>>     $ sudo cat /sys/fs/bpf/task
>>     Killed
>>
>>     $ dmesg
>>     [  188.810020] BUG: kernel NULL pointer dereference, address: 00000000000000c8
>>     [  188.810030] #PF: supervisor read access in kernel mode
>>     [  188.810034] #PF: error_code(0x0000) - not-present page
>>
>> IIUC, this should be handled by bpf exception table rather than killing
>> the app. Am I missing anything?
> 
> For PROBE_LDX the verifier guarantees that the address is either
> a very likely valid kernel address or NULL. On x86 the user and kernel
> address spaces are shared and NULL is a user address, so there cannot be
> an exception table for NULL. Hence x86-64 JIT inserts NULL check when
> it converts PROBE_LDX into load insn.

IIUC, there could be 3 types of addresses a BPF prog can have:

1. NULL ptr. Which is handled by adding additional check in BPF program.
    So this won't cause a fault.
2. Valid kernel address. This will never cause a fault.
3. Bad address. This is very unlikely but possible. IIUC, BPF extable
    is introduced to handle such scenarios. Unfortunately, with any type
    of bad address (user or kernel), extable on x86 never gets involved.
    Kernel always kills the application with error in dmesg.

Please let me know if I understood it incorrectly.

TLDR;
To check the case 3 further, I tried with bpf/master, this time without
reverting your patch. I added a dummy structure and a bad pointer to it
in task_struct.

   diff --git a/include/linux/sched.h b/include/linux/sched.h
   index d2c881384517..4698188bcf45 100644
   --- a/include/linux/sched.h
   +++ b/include/linux/sched.h
   @@ -646,6 +646,10 @@ struct kmap_ctrl {
    #endif
    };
   +struct dummy_task_ele {
   +       int dummy;
   +};
   +
    struct task_struct {
    #ifdef CONFIG_THREAD_INFO_IN_TASK
           /*
   @@ -771,6 +775,8 @@ struct task_struct {
           struct mm_struct                *mm;
           struct mm_struct                *active_mm;
   +       struct dummy_task_ele           *dte;
   +
           /* Per-thread vma caching: */
           struct vmacache                 vmacache;
   diff --git a/kernel/fork.c b/kernel/fork.c
   index dc06afd725cb..ed01f25edd8e 100644
   --- a/kernel/fork.c
   +++ b/kernel/fork.c
   @@ -2116,6 +2116,9 @@ static __latent_entropy struct task_struct *copy_process(
           retval = copy_mm(clone_flags, p);
           if (retval)
                   goto bad_fork_cleanup_signal;
   +
   +       p->dte = (void *)0xd12345;
   +
           retval = copy_namespaces(clone_flags, p);
           if (retval)
                   goto bad_fork_cleanup_mm;

And with below change in testcase:

   diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_task.c b/tools/testing/selftests/bpf/progs/bpf_iter_task.c
   index b7f32c160f4e..391c1b3da638 100644
   --- a/tools/testing/selftests/bpf/progs/bpf_iter_task.c
   +++ b/tools/testing/selftests/bpf/progs/bpf_iter_task.c
   @@ -21,6 +21,6 @@ int dump_task(struct bpf_iter__task *ctx)
           if (ctx->meta->seq_num == 0)
                   BPF_SEQ_PRINTF(seq, "    tgid      gid\n");
   -       BPF_SEQ_PRINTF(seq, "%8d %8d\n", task->tgid, task->pid);
   +       BPF_SEQ_PRINTF(seq, "%8d %8d %d\n", task->tgid, task->pid, task->dte->dummy);
           return 0;
    }

I see the same issue:

   $ sudo bpftool iter pin bpf_iter_task.o /sys/fs/bpf/task
   $ sudo cat /sys/fs/bpf/task
   Killed

   $ dmesg
   [  166.864325] BUG: unable to handle page fault for address: 0000000000d12345
   [  166.864336] #PF: supervisor read access in kernel mode
   [  166.864338] #PF: error_code(0x0000) - not-present page

0xd12345 is unallocated userspace address. Similarly, I also tried with
p->dte = (void *)0xffffffffc1234567 after confirming it's not allocated
to kernel or any module address. I see the same failure with it too.

Though, the same test with bpf_probe_read(task->dte->dummy) works fine.

Ravi
Alexei Starovoitov June 18, 2021, 4:34 p.m. UTC | #4
On Wed, Jun 16, 2021 at 11:58 PM Ravi Bangoria
<ravi.bangoria@linux.ibm.com> wrote:
>
>    $ dmesg
>    [  166.864325] BUG: unable to handle page fault for address: 0000000000d12345
>    [  166.864336] #PF: supervisor read access in kernel mode
>    [  166.864338] #PF: error_code(0x0000) - not-present page
>
> 0xd12345 is unallocated userspace address. Similarly, I also tried with

that's unfortunately expected, since this is a user address.

> p->dte = (void *)0xffffffffc1234567 after confirming it's not allocated
> to kernel or any module address. I see the same failure with it too.

This one is surprising though. Sounds like a bug in exception table
construction. Can you debug it to see what's causing it?
First check that do_kern_addr_fault() is invoked in this case.
And then fixup_exception() and why search_bpf_extables()
cannot find it.
Separately we probably need to replace the NULL check
with addr >= TASK_SIZE_MAX to close this issue though it's a bit artificial.
Ravi Bangoria June 22, 2021, 7:10 a.m. UTC | #5
Hi Alexei,

On 6/18/21 10:04 PM, Alexei Starovoitov wrote:
> On Wed, Jun 16, 2021 at 11:58 PM Ravi Bangoria
> <ravi.bangoria@linux.ibm.com> wrote:
>>
>>     $ dmesg
>>     [  166.864325] BUG: unable to handle page fault for address: 0000000000d12345
>>     [  166.864336] #PF: supervisor read access in kernel mode
>>     [  166.864338] #PF: error_code(0x0000) - not-present page
>>
>> 0xd12345 is unallocated userspace address. Similarly, I also tried with
> 
> that's unfortunately expected, since this is a user address.

Sure. fwiw, it works with bpf_probe_read().

>> p->dte = (void *)0xffffffffc1234567 after confirming it's not allocated
>> to kernel or any module address. I see the same failure with it too.
> 
> This one is surprising though. Sounds like a bug in exception table
> construction. Can you debug it to see what's causing it?
> First check that do_kern_addr_fault() is invoked in this case.
> And then fixup_exception() and why search_bpf_extables()
> cannot find it.

It seems the commit 4c5de127598e1 ("bpf: Emit explicit NULL pointer checks
for PROBE_LDX instructions.") added few instructions before actual load
but does not consider those additional instruction while calculating
extable offset. Let me prepare a fix.

> Separately we probably need to replace the NULL check
> with addr >= TASK_SIZE_MAX to close this issue though it's a bit artificial.

Ravi