Message ID | 20230118051443.78988-1-alexei.starovoitov@gmail.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | BPF |
Headers | show |
Series | [bpf,1/2] mm: Fix copy_from_user_nofault(). | expand |
After applying the patches, running the fuzzer with the BPF PoC program no
longer triggers the warning.
Tested-by: Hsin-Wei Hung <hsinweih@uci.edu>
On Tue, Jan 17, 2023 at 09:14:42PM -0800, Alexei Starovoitov wrote: > From: Alexei Starovoitov <ast@kernel.org> > > There are several issues with copy_from_user_nofault(): > > - access_ok() is designed for user context only and for that reason > it has WARN_ON_IN_IRQ() which triggers when bpf, kprobe, eprobe > and perf on ppc are calling it from irq. > > - it's missing nmi_uaccess_okay() which is a nop on all architectures > except x86 where it's required. > The comment in arch/x86/mm/tlb.c explains the details why it's necessary. > Calling copy_from_user_nofault() from bpf, [ke]probe without this check is not safe. > > - __copy_from_user_inatomic() under CONFIG_HARDENED_USERCOPY is calling > check_object_size()->__check_object_size()->check_heap_object()->find_vmap_area()->spin_lock() > which is not safe to do from bpf, [ke]probe and perf due to potential deadlock. Er, this drops check_object_size() -- that needs to stay. The vmap area test in check_object_size is likely what needs fixing. It was discussed before: https://lore.kernel.org/lkml/YySML2HfqaE%2FwXBU@casper.infradead.org/ The only reason it was ultimately tolerable to remove the check from the x86-only _nmi function was because it was being used on compile-time sized copies. We need to fix the vmap lookup so the checking doesn't regress -- especially for trace, bpf, etc, where we could have much more interested dest/source/size combinations. :) -Kees
On Thu, Jan 19, 2023 at 8:52 AM Kees Cook <keescook@chromium.org> wrote: > > On Tue, Jan 17, 2023 at 09:14:42PM -0800, Alexei Starovoitov wrote: > > From: Alexei Starovoitov <ast@kernel.org> > > > > There are several issues with copy_from_user_nofault(): > > > > - access_ok() is designed for user context only and for that reason > > it has WARN_ON_IN_IRQ() which triggers when bpf, kprobe, eprobe > > and perf on ppc are calling it from irq. > > > > - it's missing nmi_uaccess_okay() which is a nop on all architectures > > except x86 where it's required. > > The comment in arch/x86/mm/tlb.c explains the details why it's necessary. > > Calling copy_from_user_nofault() from bpf, [ke]probe without this check is not safe. > > > > - __copy_from_user_inatomic() under CONFIG_HARDENED_USERCOPY is calling > > check_object_size()->__check_object_size()->check_heap_object()->find_vmap_area()->spin_lock() > > which is not safe to do from bpf, [ke]probe and perf due to potential deadlock. > > Er, this drops check_object_size() -- that needs to stay. The vmap area > test in check_object_size is likely what needs fixing. It was discussed > before: > https://lore.kernel.org/lkml/YySML2HfqaE%2FwXBU@casper.infradead.org/ Thanks for the link. Unfortunately all options discussed in that link won't work, since all of them rely on in_interrupt() which will not catch the condition. [ke]probe, bpf, perf can run after spin_lock is taken. Like via trace_lock_release tracepoint. It's only with lockdep=on, but still. Or via trace_contention_begin tracepoint with lockdep=off. check_object_size() will not execute in_interrupt(). > The only reason it was ultimately tolerable to remove the check from > the x86-only _nmi function was because it was being used on compile-time > sized copies. It doesn't look to be the case. copy_from_user_nmi() is called via __output_copy_user by perf with run-time 'size'. > We need to fix the vmap lookup so the checking doesn't regress -- > especially for trace, bpf, etc, where we could have much more interested > dest/source/size combinations. :) Well, for bpf the 'dst' is never a vmalloc area, so is_vmalloc_addr() and later spin_lock() in check_heap_object() won't trigger. Also for bpf the 'dst' area is statically checked by the verifier at program load time, so at run-time the dst pointer is guaranteed to be valid and of correct dimensions. So doing check_object_size() is pointless unless there is a bug in the verifier, but if there is a bug kasan and friends will find it sooner. The 'dst' checks are generic and not copy_from_user_nofault() specific. For trace, kprobe and perf would be nice to keep check_object_size() working, of course. What do you suggest? I frankly don't see other options other than done in this patch, though it's not great. Happy to be proven otherwise.
On Thu, Jan 19, 2023 at 11:21:33AM -0800, Alexei Starovoitov wrote: > On Thu, Jan 19, 2023 at 8:52 AM Kees Cook <keescook@chromium.org> wrote: > > > > On Tue, Jan 17, 2023 at 09:14:42PM -0800, Alexei Starovoitov wrote: > > > From: Alexei Starovoitov <ast@kernel.org> > > > > > > There are several issues with copy_from_user_nofault(): > > > > > > - access_ok() is designed for user context only and for that reason > > > it has WARN_ON_IN_IRQ() which triggers when bpf, kprobe, eprobe > > > and perf on ppc are calling it from irq. > > > > > > - it's missing nmi_uaccess_okay() which is a nop on all architectures > > > except x86 where it's required. > > > The comment in arch/x86/mm/tlb.c explains the details why it's necessary. > > > Calling copy_from_user_nofault() from bpf, [ke]probe without this check is not safe. > > > > > > - __copy_from_user_inatomic() under CONFIG_HARDENED_USERCOPY is calling > > > check_object_size()->__check_object_size()->check_heap_object()->find_vmap_area()->spin_lock() > > > which is not safe to do from bpf, [ke]probe and perf due to potential deadlock. > > > > Er, this drops check_object_size() -- that needs to stay. The vmap area > > test in check_object_size is likely what needs fixing. It was discussed > > before: > > https://lore.kernel.org/lkml/YySML2HfqaE%2FwXBU@casper.infradead.org/ > > Thanks for the link. > Unfortunately all options discussed in that link won't work, > since all of them rely on in_interrupt() which will not catch the condition. > [ke]probe, bpf, perf can run after spin_lock is taken. > Like via trace_lock_release tracepoint. > It's only with lockdep=on, but still. > Or via trace_contention_begin tracepoint with lockdep=off. > check_object_size() will not execute in_interrupt(). > > > The only reason it was ultimately tolerable to remove the check from > > the x86-only _nmi function was because it was being used on compile-time > > sized copies. > > It doesn't look to be the case. > copy_from_user_nmi() is called via __output_copy_user by perf > with run-time 'size'. Perhaps this changed recently? It was only called in copy_code() before when I looked last. Regardless, it still needs solving. > > We need to fix the vmap lookup so the checking doesn't regress -- > > especially for trace, bpf, etc, where we could have much more interested > > dest/source/size combinations. :) > > Well, for bpf the 'dst' is never a vmalloc area, so > is_vmalloc_addr() and later spin_lock() in check_heap_object() > won't trigger. > Also for bpf the 'dst' area is statically checked by the verifier > at program load time, so at run-time the dst pointer is > guaranteed to be valid and of correct dimensions. > So doing check_object_size() is pointless unless there is a bug > in the verifier, but if there is a bug kasan and friends > will find it sooner. The 'dst' checks are generic and > not copy_from_user_nofault() specific. > > For trace, kprobe and perf would be nice to keep check_object_size() > working, of course. > > What do you suggest? > I frankly don't see other options other than done in this patch, > though it's not great. > Happy to be proven otherwise. Matthew, do you have any thoughts on dealing with this? Can we use a counter instead of a spin lock? -Kees
On Thu, Jan 19, 2023 at 12:08 PM Kees Cook <keescook@chromium.org> wrote: > > On Thu, Jan 19, 2023 at 11:21:33AM -0800, Alexei Starovoitov wrote: > > On Thu, Jan 19, 2023 at 8:52 AM Kees Cook <keescook@chromium.org> wrote: > > > > > > On Tue, Jan 17, 2023 at 09:14:42PM -0800, Alexei Starovoitov wrote: > > > > From: Alexei Starovoitov <ast@kernel.org> > > > > > > > > There are several issues with copy_from_user_nofault(): > > > > > > > > - access_ok() is designed for user context only and for that reason > > > > it has WARN_ON_IN_IRQ() which triggers when bpf, kprobe, eprobe > > > > and perf on ppc are calling it from irq. > > > > > > > > - it's missing nmi_uaccess_okay() which is a nop on all architectures > > > > except x86 where it's required. > > > > The comment in arch/x86/mm/tlb.c explains the details why it's necessary. > > > > Calling copy_from_user_nofault() from bpf, [ke]probe without this check is not safe. > > > > > > > > - __copy_from_user_inatomic() under CONFIG_HARDENED_USERCOPY is calling > > > > check_object_size()->__check_object_size()->check_heap_object()->find_vmap_area()->spin_lock() > > > > which is not safe to do from bpf, [ke]probe and perf due to potential deadlock. > > > > > > Er, this drops check_object_size() -- that needs to stay. The vmap area > > > test in check_object_size is likely what needs fixing. It was discussed > > > before: > > > https://lore.kernel.org/lkml/YySML2HfqaE%2FwXBU@casper.infradead.org/ > > > > Thanks for the link. > > Unfortunately all options discussed in that link won't work, > > since all of them rely on in_interrupt() which will not catch the condition. > > [ke]probe, bpf, perf can run after spin_lock is taken. > > Like via trace_lock_release tracepoint. > > It's only with lockdep=on, but still. > > Or via trace_contention_begin tracepoint with lockdep=off. > > check_object_size() will not execute in_interrupt(). > > > > > The only reason it was ultimately tolerable to remove the check from > > > the x86-only _nmi function was because it was being used on compile-time > > > sized copies. > > > > It doesn't look to be the case. > > copy_from_user_nmi() is called via __output_copy_user by perf > > with run-time 'size'. > > Perhaps this changed recently? It was only called in copy_code() before > when I looked last. Regardless, it still needs solving. I think it was this way forever: perf_output_sample_ustack(handle, data->stack_user_size, data->regs_user.regs); __output_copy_user(handle, (void *) sp, dump_size); kernel/events/internal.h:#define arch_perf_out_copy_user copy_from_user_nmi kernel/events/internal.h:DEFINE_OUTPUT_COPY(__output_copy_user, arch_perf_out_copy_user) > > > We need to fix the vmap lookup so the checking doesn't regress -- > > > especially for trace, bpf, etc, where we could have much more interested > > > dest/source/size combinations. :) > > > > Well, for bpf the 'dst' is never a vmalloc area, so > > is_vmalloc_addr() and later spin_lock() in check_heap_object() > > won't trigger. > > Also for bpf the 'dst' area is statically checked by the verifier > > at program load time, so at run-time the dst pointer is > > guaranteed to be valid and of correct dimensions. > > So doing check_object_size() is pointless unless there is a bug > > in the verifier, but if there is a bug kasan and friends > > will find it sooner. The 'dst' checks are generic and > > not copy_from_user_nofault() specific. > > > > For trace, kprobe and perf would be nice to keep check_object_size() > > working, of course. > > > > What do you suggest? > > I frankly don't see other options other than done in this patch, > > though it's not great. > > Happy to be proven otherwise. > > Matthew, do you have any thoughts on dealing with this? Can we use a > counter instead of a spin lock? > > -Kees > > -- > Kees Cook
On Thu, Jan 19, 2023 at 12:08 PM Kees Cook <keescook@chromium.org> wrote: > > > > What do you suggest? > > I frankly don't see other options other than done in this patch, > > though it's not great. > > Happy to be proven otherwise. > > Matthew, do you have any thoughts on dealing with this? Can we use a > counter instead of a spin lock? Have you consider using pagefault_disabled() instead of in_interrupt()? spin_trylock() and if (pagefault_disabled()) out ? or diff --git a/mm/usercopy.c b/mm/usercopy.c index 4c3164beacec..83c164aba6e0 100644 --- a/mm/usercopy.c +++ b/mm/usercopy.c @@ -173,7 +173,7 @@ static inline void check_heap_object(const void *ptr, unsigned long n, return; } - if (is_vmalloc_addr(ptr)) { + if (is_vmalloc_addr(ptr) && !pagefault_disabled()) { struct vmap_area *area = find_vmap_area(addr); effectively gutting that part of check for *_nofault() and *_nmi() ?
With this patch applied on top of bpf/bpf-next (55fbae05) the system no longer runs into a total freeze as reported in https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1033398.
Tested-by: Florian Lehner <dev@der-flo.net>
On Sat, Mar 25, 2023 at 7:55 AM Florian Lehner <dev@der-flo.net> wrote: > > With this patch applied on top of bpf/bpf-next (55fbae05) the system no longer runs into a total freeze as reported in https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1033398. > > Tested-by: Florian Lehner <dev@der-flo.net> Thanks for testing and for bumping the thread. The fix slipped through the cracks. Looking at the stack trace in bugzilla the patch set should indeed fix the issue, since the kernel is deadlocking on: copy_from_user_nofault -> check_object_size -> find_vmap_area -> spin_lock I'm travelling this and next week, so if you can take over the whole patch set and roll in the tweak that was proposed back in January: - if (is_vmalloc_addr(ptr)) { + if (is_vmalloc_addr(ptr) && !pagefault_disabled()) and respin for the bpf tree our group maintainers can review and apply while I'm travelling.
Hi, On Sat, Mar 25, 2023 at 12:47:17PM -0700, Alexei Starovoitov wrote: > On Sat, Mar 25, 2023 at 7:55 AM Florian Lehner <dev@der-flo.net> wrote: > > > > With this patch applied on top of bpf/bpf-next (55fbae05) the system no longer runs into a total freeze as reported in https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1033398. > > > > Tested-by: Florian Lehner <dev@der-flo.net> > > Thanks for testing and for bumping the thread. > The fix slipped through the cracks. > > Looking at the stack trace in bugzilla the patch set > should indeed fix the issue, since the kernel is deadlocking on: > copy_from_user_nofault -> check_object_size -> find_vmap_area -> spin_lock > > I'm travelling this and next week, so if you can take over > the whole patch set and roll in the tweak that was proposed back in January: > > - if (is_vmalloc_addr(ptr)) { > + if (is_vmalloc_addr(ptr) && !pagefault_disabled()) > > and respin for the bpf tree our group maintainers can review and apply > while I'm travelling. Anyone can pick it up as suggested by Alexei, and propose that to the bpf tree maintainers? Regards, Salvatore
On Thu, Apr 6, 2023 at 1:17 PM Salvatore Bonaccorso <carnil@debian.org> wrote: > > Hi, > > On Sat, Mar 25, 2023 at 12:47:17PM -0700, Alexei Starovoitov wrote: > > On Sat, Mar 25, 2023 at 7:55 AM Florian Lehner <dev@der-flo.net> wrote: > > > > > > With this patch applied on top of bpf/bpf-next (55fbae05) the system no longer runs into a total freeze as reported in https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1033398. > > > > > > Tested-by: Florian Lehner <dev@der-flo.net> > > > > Thanks for testing and for bumping the thread. > > The fix slipped through the cracks. > > > > Looking at the stack trace in bugzilla the patch set > > should indeed fix the issue, since the kernel is deadlocking on: > > copy_from_user_nofault -> check_object_size -> find_vmap_area -> spin_lock > > > > I'm travelling this and next week, so if you can take over > > the whole patch set and roll in the tweak that was proposed back in January: > > > > - if (is_vmalloc_addr(ptr)) { > > + if (is_vmalloc_addr(ptr) && !pagefault_disabled()) > > > > and respin for the bpf tree our group maintainers can review and apply > > while I'm travelling. > > Anyone can pick it up as suggested by Alexei, and propose that to the > bpf tree maintainers? Florian already did. Changes were requested. https://patchwork.kernel.org/project/netdevbpf/patch/20230329193931.320642-3-dev@der-flo.net/
diff --git a/mm/maccess.c b/mm/maccess.c index 074f6b086671..6ee9b337c501 100644 --- a/mm/maccess.c +++ b/mm/maccess.c @@ -5,6 +5,7 @@ #include <linux/export.h> #include <linux/mm.h> #include <linux/uaccess.h> +#include <asm/tlb.h> bool __weak copy_from_kernel_nofault_allowed(const void *unsafe_src, size_t size) @@ -113,11 +114,18 @@ long strncpy_from_kernel_nofault(char *dst, const void *unsafe_addr, long count) long copy_from_user_nofault(void *dst, const void __user *src, size_t size) { long ret = -EFAULT; - if (access_ok(src, size)) { - pagefault_disable(); - ret = __copy_from_user_inatomic(dst, src, size); - pagefault_enable(); - } + + if (!__access_ok(src, size)) + return ret; + + if (!nmi_uaccess_okay()) + return ret; + + pagefault_disable(); + instrument_copy_from_user_before(dst, src, size); + ret = raw_copy_from_user(dst, src, size); + instrument_copy_from_user_after(dst, src, size, ret); + pagefault_enable(); if (ret) return -EFAULT;