diff mbox series

[bpf,1/2] mm: Fix copy_from_user_nofault().

Message ID 20230118051443.78988-1-alexei.starovoitov@gmail.com (mailing list archive)
State Changes Requested
Delegated to: BPF
Headers show
Series [bpf,1/2] mm: Fix copy_from_user_nofault(). | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for bpf
netdev/fixes_present fail Series targets non-next tree, but doesn't contain any Fixes tags
netdev/subject_prefix success Link
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 2 this patch: 2
netdev/cc_maintainers warning 3 maintainers not CCed: linux-mm@kvack.org akpm@linux-foundation.org linux-hardening@vger.kernel.org
netdev/build_clang success Errors and warnings before: 1 this patch: 1
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 2 this patch: 2
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 30 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-PR fail PR summary
bpf/vmtest-bpf-VM_Test-1 success Logs for ${{ matrix.test }} on ${{ matrix.arch }} with ${{ matrix.toolchain }}
bpf/vmtest-bpf-VM_Test-2 success Logs for ShellCheck
bpf/vmtest-bpf-VM_Test-3 fail Logs for build for aarch64 with gcc
bpf/vmtest-bpf-VM_Test-4 fail Logs for build for aarch64 with llvm-16
bpf/vmtest-bpf-VM_Test-5 fail Logs for build for s390x with gcc
bpf/vmtest-bpf-VM_Test-6 success Logs for build for x86_64 with gcc
bpf/vmtest-bpf-VM_Test-7 success Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-8 success Logs for llvm-toolchain
bpf/vmtest-bpf-VM_Test-9 success Logs for set-matrix

Commit Message

Alexei Starovoitov Jan. 18, 2023, 5:14 a.m. UTC
From: Alexei Starovoitov <ast@kernel.org>

There are several issues with copy_from_user_nofault():

- access_ok() is designed for user context only and for that reason
it has WARN_ON_IN_IRQ() which triggers when bpf, kprobe, eprobe
and perf on ppc are calling it from irq.

- it's missing nmi_uaccess_okay() which is a nop on all architectures
except x86 where it's required.
The comment in arch/x86/mm/tlb.c explains the details why it's necessary.
Calling copy_from_user_nofault() from bpf, [ke]probe without this check is not safe.

- __copy_from_user_inatomic() under CONFIG_HARDENED_USERCOPY is calling
check_object_size()->__check_object_size()->check_heap_object()->find_vmap_area()->spin_lock()
which is not safe to do from bpf, [ke]probe and perf due to potential deadlock.

Fix all three issues. At the end the copy_from_user_nofault() becomes
equivalent to copy_from_user_nmi() from safety point of view with
a difference in the return value.

Reported-by: Hsin-Wei Hung <hsinweih@uci.edu>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 mm/maccess.c | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

Comments

Hsin-Wei Hung Jan. 18, 2023, 9:32 p.m. UTC | #1
After applying the patches, running the fuzzer with the BPF PoC program no 
longer triggers the warning.

Tested-by: Hsin-Wei Hung <hsinweih@uci.edu>
Kees Cook Jan. 19, 2023, 4:52 p.m. UTC | #2
On Tue, Jan 17, 2023 at 09:14:42PM -0800, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> There are several issues with copy_from_user_nofault():
> 
> - access_ok() is designed for user context only and for that reason
> it has WARN_ON_IN_IRQ() which triggers when bpf, kprobe, eprobe
> and perf on ppc are calling it from irq.
> 
> - it's missing nmi_uaccess_okay() which is a nop on all architectures
> except x86 where it's required.
> The comment in arch/x86/mm/tlb.c explains the details why it's necessary.
> Calling copy_from_user_nofault() from bpf, [ke]probe without this check is not safe.
> 
> - __copy_from_user_inatomic() under CONFIG_HARDENED_USERCOPY is calling
> check_object_size()->__check_object_size()->check_heap_object()->find_vmap_area()->spin_lock()
> which is not safe to do from bpf, [ke]probe and perf due to potential deadlock.

Er, this drops check_object_size() -- that needs to stay. The vmap area
test in check_object_size is likely what needs fixing. It was discussed
before:
https://lore.kernel.org/lkml/YySML2HfqaE%2FwXBU@casper.infradead.org/

The only reason it was ultimately tolerable to remove the check from
the x86-only _nmi function was because it was being used on compile-time
sized copies.

We need to fix the vmap lookup so the checking doesn't regress --
especially for trace, bpf, etc, where we could have much more interested
dest/source/size combinations. :)

-Kees
Alexei Starovoitov Jan. 19, 2023, 7:21 p.m. UTC | #3
On Thu, Jan 19, 2023 at 8:52 AM Kees Cook <keescook@chromium.org> wrote:
>
> On Tue, Jan 17, 2023 at 09:14:42PM -0800, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > There are several issues with copy_from_user_nofault():
> >
> > - access_ok() is designed for user context only and for that reason
> > it has WARN_ON_IN_IRQ() which triggers when bpf, kprobe, eprobe
> > and perf on ppc are calling it from irq.
> >
> > - it's missing nmi_uaccess_okay() which is a nop on all architectures
> > except x86 where it's required.
> > The comment in arch/x86/mm/tlb.c explains the details why it's necessary.
> > Calling copy_from_user_nofault() from bpf, [ke]probe without this check is not safe.
> >
> > - __copy_from_user_inatomic() under CONFIG_HARDENED_USERCOPY is calling
> > check_object_size()->__check_object_size()->check_heap_object()->find_vmap_area()->spin_lock()
> > which is not safe to do from bpf, [ke]probe and perf due to potential deadlock.
>
> Er, this drops check_object_size() -- that needs to stay. The vmap area
> test in check_object_size is likely what needs fixing. It was discussed
> before:
> https://lore.kernel.org/lkml/YySML2HfqaE%2FwXBU@casper.infradead.org/

Thanks for the link.
Unfortunately all options discussed in that link won't work,
since all of them rely on in_interrupt() which will not catch the condition.
[ke]probe, bpf, perf can run after spin_lock is taken.
Like via trace_lock_release tracepoint.
It's only with lockdep=on, but still.
Or via trace_contention_begin tracepoint with lockdep=off.
check_object_size() will not execute in_interrupt().

> The only reason it was ultimately tolerable to remove the check from
> the x86-only _nmi function was because it was being used on compile-time
> sized copies.

It doesn't look to be the case.
copy_from_user_nmi() is called via __output_copy_user by perf
with run-time 'size'.

> We need to fix the vmap lookup so the checking doesn't regress --
> especially for trace, bpf, etc, where we could have much more interested
> dest/source/size combinations. :)

Well, for bpf the 'dst' is never a vmalloc area, so
is_vmalloc_addr() and later spin_lock() in check_heap_object()
won't trigger.
Also for bpf the 'dst' area is statically checked by the verifier
at program load time, so at run-time the dst pointer is
guaranteed to be valid and of correct dimensions.
So doing check_object_size() is pointless unless there is a bug
in the verifier, but if there is a bug kasan and friends
will find it sooner. The 'dst' checks are generic and
not copy_from_user_nofault() specific.

For trace, kprobe and perf would be nice to keep check_object_size()
working, of course.

What do you suggest?
I frankly don't see other options other than done in this patch,
though it's not great.
Happy to be proven otherwise.
Kees Cook Jan. 19, 2023, 8:08 p.m. UTC | #4
On Thu, Jan 19, 2023 at 11:21:33AM -0800, Alexei Starovoitov wrote:
> On Thu, Jan 19, 2023 at 8:52 AM Kees Cook <keescook@chromium.org> wrote:
> >
> > On Tue, Jan 17, 2023 at 09:14:42PM -0800, Alexei Starovoitov wrote:
> > > From: Alexei Starovoitov <ast@kernel.org>
> > >
> > > There are several issues with copy_from_user_nofault():
> > >
> > > - access_ok() is designed for user context only and for that reason
> > > it has WARN_ON_IN_IRQ() which triggers when bpf, kprobe, eprobe
> > > and perf on ppc are calling it from irq.
> > >
> > > - it's missing nmi_uaccess_okay() which is a nop on all architectures
> > > except x86 where it's required.
> > > The comment in arch/x86/mm/tlb.c explains the details why it's necessary.
> > > Calling copy_from_user_nofault() from bpf, [ke]probe without this check is not safe.
> > >
> > > - __copy_from_user_inatomic() under CONFIG_HARDENED_USERCOPY is calling
> > > check_object_size()->__check_object_size()->check_heap_object()->find_vmap_area()->spin_lock()
> > > which is not safe to do from bpf, [ke]probe and perf due to potential deadlock.
> >
> > Er, this drops check_object_size() -- that needs to stay. The vmap area
> > test in check_object_size is likely what needs fixing. It was discussed
> > before:
> > https://lore.kernel.org/lkml/YySML2HfqaE%2FwXBU@casper.infradead.org/
> 
> Thanks for the link.
> Unfortunately all options discussed in that link won't work,
> since all of them rely on in_interrupt() which will not catch the condition.
> [ke]probe, bpf, perf can run after spin_lock is taken.
> Like via trace_lock_release tracepoint.
> It's only with lockdep=on, but still.
> Or via trace_contention_begin tracepoint with lockdep=off.
> check_object_size() will not execute in_interrupt().
> 
> > The only reason it was ultimately tolerable to remove the check from
> > the x86-only _nmi function was because it was being used on compile-time
> > sized copies.
> 
> It doesn't look to be the case.
> copy_from_user_nmi() is called via __output_copy_user by perf
> with run-time 'size'.

Perhaps this changed recently? It was only called in copy_code() before
when I looked last. Regardless, it still needs solving.

> > We need to fix the vmap lookup so the checking doesn't regress --
> > especially for trace, bpf, etc, where we could have much more interested
> > dest/source/size combinations. :)
> 
> Well, for bpf the 'dst' is never a vmalloc area, so
> is_vmalloc_addr() and later spin_lock() in check_heap_object()
> won't trigger.
> Also for bpf the 'dst' area is statically checked by the verifier
> at program load time, so at run-time the dst pointer is
> guaranteed to be valid and of correct dimensions.
> So doing check_object_size() is pointless unless there is a bug
> in the verifier, but if there is a bug kasan and friends
> will find it sooner. The 'dst' checks are generic and
> not copy_from_user_nofault() specific.
> 
> For trace, kprobe and perf would be nice to keep check_object_size()
> working, of course.
> 
> What do you suggest?
> I frankly don't see other options other than done in this patch,
> though it's not great.
> Happy to be proven otherwise.

Matthew, do you have any thoughts on dealing with this? Can we use a
counter instead of a spin lock?

-Kees
Alexei Starovoitov Jan. 19, 2023, 8:14 p.m. UTC | #5
On Thu, Jan 19, 2023 at 12:08 PM Kees Cook <keescook@chromium.org> wrote:
>
> On Thu, Jan 19, 2023 at 11:21:33AM -0800, Alexei Starovoitov wrote:
> > On Thu, Jan 19, 2023 at 8:52 AM Kees Cook <keescook@chromium.org> wrote:
> > >
> > > On Tue, Jan 17, 2023 at 09:14:42PM -0800, Alexei Starovoitov wrote:
> > > > From: Alexei Starovoitov <ast@kernel.org>
> > > >
> > > > There are several issues with copy_from_user_nofault():
> > > >
> > > > - access_ok() is designed for user context only and for that reason
> > > > it has WARN_ON_IN_IRQ() which triggers when bpf, kprobe, eprobe
> > > > and perf on ppc are calling it from irq.
> > > >
> > > > - it's missing nmi_uaccess_okay() which is a nop on all architectures
> > > > except x86 where it's required.
> > > > The comment in arch/x86/mm/tlb.c explains the details why it's necessary.
> > > > Calling copy_from_user_nofault() from bpf, [ke]probe without this check is not safe.
> > > >
> > > > - __copy_from_user_inatomic() under CONFIG_HARDENED_USERCOPY is calling
> > > > check_object_size()->__check_object_size()->check_heap_object()->find_vmap_area()->spin_lock()
> > > > which is not safe to do from bpf, [ke]probe and perf due to potential deadlock.
> > >
> > > Er, this drops check_object_size() -- that needs to stay. The vmap area
> > > test in check_object_size is likely what needs fixing. It was discussed
> > > before:
> > > https://lore.kernel.org/lkml/YySML2HfqaE%2FwXBU@casper.infradead.org/
> >
> > Thanks for the link.
> > Unfortunately all options discussed in that link won't work,
> > since all of them rely on in_interrupt() which will not catch the condition.
> > [ke]probe, bpf, perf can run after spin_lock is taken.
> > Like via trace_lock_release tracepoint.
> > It's only with lockdep=on, but still.
> > Or via trace_contention_begin tracepoint with lockdep=off.
> > check_object_size() will not execute in_interrupt().
> >
> > > The only reason it was ultimately tolerable to remove the check from
> > > the x86-only _nmi function was because it was being used on compile-time
> > > sized copies.
> >
> > It doesn't look to be the case.
> > copy_from_user_nmi() is called via __output_copy_user by perf
> > with run-time 'size'.
>
> Perhaps this changed recently? It was only called in copy_code() before
> when I looked last. Regardless, it still needs solving.

I think it was this way forever:
perf_output_sample_ustack(handle,
                          data->stack_user_size,
                          data->regs_user.regs);
__output_copy_user(handle, (void *) sp, dump_size);

kernel/events/internal.h:#define arch_perf_out_copy_user copy_from_user_nmi
kernel/events/internal.h:DEFINE_OUTPUT_COPY(__output_copy_user,
arch_perf_out_copy_user)


> > > We need to fix the vmap lookup so the checking doesn't regress --
> > > especially for trace, bpf, etc, where we could have much more interested
> > > dest/source/size combinations. :)
> >
> > Well, for bpf the 'dst' is never a vmalloc area, so
> > is_vmalloc_addr() and later spin_lock() in check_heap_object()
> > won't trigger.
> > Also for bpf the 'dst' area is statically checked by the verifier
> > at program load time, so at run-time the dst pointer is
> > guaranteed to be valid and of correct dimensions.
> > So doing check_object_size() is pointless unless there is a bug
> > in the verifier, but if there is a bug kasan and friends
> > will find it sooner. The 'dst' checks are generic and
> > not copy_from_user_nofault() specific.
> >
> > For trace, kprobe and perf would be nice to keep check_object_size()
> > working, of course.
> >
> > What do you suggest?
> > I frankly don't see other options other than done in this patch,
> > though it's not great.
> > Happy to be proven otherwise.
>
> Matthew, do you have any thoughts on dealing with this? Can we use a
> counter instead of a spin lock?
>
> -Kees
>
> --
> Kees Cook
Alexei Starovoitov Jan. 19, 2023, 8:28 p.m. UTC | #6
On Thu, Jan 19, 2023 at 12:08 PM Kees Cook <keescook@chromium.org> wrote:
> >
> > What do you suggest?
> > I frankly don't see other options other than done in this patch,
> > though it's not great.
> > Happy to be proven otherwise.
>
> Matthew, do you have any thoughts on dealing with this? Can we use a
> counter instead of a spin lock?

Have you consider using pagefault_disabled() instead of in_interrupt()?

spin_trylock() and if (pagefault_disabled()) out ?

or
diff --git a/mm/usercopy.c b/mm/usercopy.c
index 4c3164beacec..83c164aba6e0 100644
--- a/mm/usercopy.c
+++ b/mm/usercopy.c
@@ -173,7 +173,7 @@ static inline void check_heap_object(const void
*ptr, unsigned long n,
                return;
        }

-       if (is_vmalloc_addr(ptr)) {
+       if (is_vmalloc_addr(ptr) && !pagefault_disabled()) {
                struct vmap_area *area = find_vmap_area(addr);

effectively gutting that part of check for *_nofault() and *_nmi() ?
Florian Lehner March 25, 2023, 2:55 p.m. UTC | #7
With this patch applied on top of bpf/bpf-next (55fbae05) the system no longer runs into a total freeze as reported in https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1033398.

Tested-by: Florian Lehner <dev@der-flo.net>
Alexei Starovoitov March 25, 2023, 7:47 p.m. UTC | #8
On Sat, Mar 25, 2023 at 7:55 AM Florian Lehner <dev@der-flo.net> wrote:
>
> With this patch applied on top of bpf/bpf-next (55fbae05) the system no longer runs into a total freeze as reported in https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1033398.
>
> Tested-by: Florian Lehner <dev@der-flo.net>

Thanks for testing and for bumping the thread.
The fix slipped through the cracks.

Looking at the stack trace in bugzilla the patch set
should indeed fix the issue, since the kernel is deadlocking on:
copy_from_user_nofault -> check_object_size -> find_vmap_area -> spin_lock

I'm travelling this and next week, so if you can take over
the whole patch set and roll in the tweak that was proposed back in January:

-       if (is_vmalloc_addr(ptr)) {
+       if (is_vmalloc_addr(ptr) && !pagefault_disabled())

and respin for the bpf tree our group maintainers can review and apply
while I'm travelling.
Salvatore Bonaccorso April 6, 2023, 8:17 p.m. UTC | #9
Hi,

On Sat, Mar 25, 2023 at 12:47:17PM -0700, Alexei Starovoitov wrote:
> On Sat, Mar 25, 2023 at 7:55 AM Florian Lehner <dev@der-flo.net> wrote:
> >
> > With this patch applied on top of bpf/bpf-next (55fbae05) the system no longer runs into a total freeze as reported in https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1033398.
> >
> > Tested-by: Florian Lehner <dev@der-flo.net>
> 
> Thanks for testing and for bumping the thread.
> The fix slipped through the cracks.
> 
> Looking at the stack trace in bugzilla the patch set
> should indeed fix the issue, since the kernel is deadlocking on:
> copy_from_user_nofault -> check_object_size -> find_vmap_area -> spin_lock
> 
> I'm travelling this and next week, so if you can take over
> the whole patch set and roll in the tweak that was proposed back in January:
> 
> -       if (is_vmalloc_addr(ptr)) {
> +       if (is_vmalloc_addr(ptr) && !pagefault_disabled())
> 
> and respin for the bpf tree our group maintainers can review and apply
> while I'm travelling.

Anyone can pick it up as suggested by Alexei, and propose that to the
bpf tree maintainers?

Regards,
Salvatore
Alexei Starovoitov April 6, 2023, 8:24 p.m. UTC | #10
On Thu, Apr 6, 2023 at 1:17 PM Salvatore Bonaccorso <carnil@debian.org> wrote:
>
> Hi,
>
> On Sat, Mar 25, 2023 at 12:47:17PM -0700, Alexei Starovoitov wrote:
> > On Sat, Mar 25, 2023 at 7:55 AM Florian Lehner <dev@der-flo.net> wrote:
> > >
> > > With this patch applied on top of bpf/bpf-next (55fbae05) the system no longer runs into a total freeze as reported in https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1033398.
> > >
> > > Tested-by: Florian Lehner <dev@der-flo.net>
> >
> > Thanks for testing and for bumping the thread.
> > The fix slipped through the cracks.
> >
> > Looking at the stack trace in bugzilla the patch set
> > should indeed fix the issue, since the kernel is deadlocking on:
> > copy_from_user_nofault -> check_object_size -> find_vmap_area -> spin_lock
> >
> > I'm travelling this and next week, so if you can take over
> > the whole patch set and roll in the tweak that was proposed back in January:
> >
> > -       if (is_vmalloc_addr(ptr)) {
> > +       if (is_vmalloc_addr(ptr) && !pagefault_disabled())
> >
> > and respin for the bpf tree our group maintainers can review and apply
> > while I'm travelling.
>
> Anyone can pick it up as suggested by Alexei, and propose that to the
> bpf tree maintainers?

Florian already did.
Changes were requested.
https://patchwork.kernel.org/project/netdevbpf/patch/20230329193931.320642-3-dev@der-flo.net/
diff mbox series

Patch

diff --git a/mm/maccess.c b/mm/maccess.c
index 074f6b086671..6ee9b337c501 100644
--- a/mm/maccess.c
+++ b/mm/maccess.c
@@ -5,6 +5,7 @@ 
 #include <linux/export.h>
 #include <linux/mm.h>
 #include <linux/uaccess.h>
+#include <asm/tlb.h>
 
 bool __weak copy_from_kernel_nofault_allowed(const void *unsafe_src,
 		size_t size)
@@ -113,11 +114,18 @@  long strncpy_from_kernel_nofault(char *dst, const void *unsafe_addr, long count)
 long copy_from_user_nofault(void *dst, const void __user *src, size_t size)
 {
 	long ret = -EFAULT;
-	if (access_ok(src, size)) {
-		pagefault_disable();
-		ret = __copy_from_user_inatomic(dst, src, size);
-		pagefault_enable();
-	}
+
+	if (!__access_ok(src, size))
+		return ret;
+
+	if (!nmi_uaccess_okay())
+		return ret;
+
+	pagefault_disable();
+	instrument_copy_from_user_before(dst, src, size);
+	ret = raw_copy_from_user(dst, src, size);
+	instrument_copy_from_user_after(dst, src, size, ret);
+	pagefault_enable();
 
 	if (ret)
 		return -EFAULT;