mbox series

[v4,0/7] kernel: introduce uaccess logging

Message ID 20211209221545.2333249-1-pcc@google.com (mailing list archive)
Headers show
Series kernel: introduce uaccess logging | expand

Message

Peter Collingbourne Dec. 9, 2021, 10:15 p.m. UTC
This patch series introduces a kernel feature known as uaccess
logging, which allows userspace programs to be made aware of the
address and size of uaccesses performed by the kernel during
the servicing of a syscall. More details on the motivation
for and interface to this feature are available in the file
Documentation/admin-guide/uaccess-logging.rst added by the final
patch in the series.

Because we don't have a common kernel entry/exit code path that is used
on all architectures, uaccess logging is only implemented for arm64
and architectures that use CONFIG_GENERIC_ENTRY, i.e. x86 and s390.

The proposed interface is the result of numerous iterations and
prototyping and is based on a proposal by Dmitry Vyukov. The interface
preserves the correspondence between uaccess log identity and syscall
identity while tolerating incoming asynchronous signals in the interval
between setting up the logging and the actual syscall. We considered
a number of alternative designs but rejected them for various reasons:

- The design from v1 of this patch [1] proposed notifying the kernel
  of the address and size of the uaccess buffer via a prctl that
  would also automatically mask and unmask asynchronous signals as
  needed, but this would require multiple syscalls per "real" syscall,
  harming performance.

- We considered extending the syscall calling convention to
  designate currently-unused registers to be used to pass the
  location of the uaccess buffer, but this was rejected for being
  architecture-specific.

- One idea that we considered involved using the stack pointer address
  as a unique identifier for the syscall, but this currently would
  need to be arch-specific as we currently do not appear to have an
  arch-generic way of retrieving the stack pointer; the userspace
  side would also need some arch-specific code for this to work. It's
  also possible that a longjmp() past the signal handler would make
  the stack pointer address not unique enough for this purpose.

We also evaluated implementing this on top of the existing tracepoint
facility, but concluded that it is not suitable for this purpose:

- Tracepoints have a per-task granularity at best, whereas we really want
  to trace per-syscall. This is so that we can exclude syscalls that
  should not be traced, such as syscalls that make up part of the
  sanitizer implementation (to avoid infinite recursion when e.g. printing
  an error report).

- Tracing would need to be synchronous in order to produce useful
  stack traces. For example this could be achieved using the new SIGTRAP
  on perf events mechanism. However, this would require logging each
  access to the stack (in the form of a sigcontext) and this is more
  likely to overflow the stack due to being much larger than a uaccess
  buffer entry as well as being unbounded, in contrast to the bounded
  buffer size passed to prctl(). An approach based on signal handlers is
  also likely to fall foul of the asynchronous signal issues mentioned
  previously, together with needing sigreturn to be handled specially
  (because it copies a sigcontext from userspace) otherwise we could
  never return from the signal handler. Furthermore, arguments to the
  trace events are not available to SIGTRAP. (This on its own wouldn't
  be insurmountable though -- we could add the arguments as fields
  to siginfo.)

- The API in https://www.kernel.org/doc/Documentation/trace/ftrace.txt
  -- e.g. trace_pipe_raw gives access to the internal ring buffer, but
  I don't think it's usable because it's per-CPU and not per-task.

- Tracepoints can be used by eBPF programs, but eBPF programs may
  only be loaded as root, among other potential headaches.

[1] https://lore.kernel.org/all/20210922061809.736124-1-pcc@google.com/

Peter Collingbourne (7):
  include: split out uaccess instrumentation into a separate header
  uaccess-buffer: add core code
  fs: use copy_from_user_nolog() to copy mount() data
  uaccess-buffer: add CONFIG_GENERIC_ENTRY support
  arm64: add support for uaccess logging
  Documentation: document uaccess logging
  selftests: test uaccess logging

 Documentation/admin-guide/index.rst           |   1 +
 Documentation/admin-guide/uaccess-logging.rst | 151 +++++++++++++++++
 arch/Kconfig                                  |  14 ++
 arch/arm64/Kconfig                            |   1 +
 arch/arm64/include/asm/thread_info.h          |   7 +-
 arch/arm64/kernel/ptrace.c                    |   7 +
 arch/arm64/kernel/signal.c                    |   5 +
 fs/exec.c                                     |   3 +
 fs/namespace.c                                |   8 +-
 include/linux/entry-common.h                  |   2 +
 include/linux/instrumented-uaccess.h          |  53 ++++++
 include/linux/instrumented.h                  |  34 ----
 include/linux/sched.h                         |   5 +
 include/linux/thread_info.h                   |   4 +
 include/linux/uaccess-buffer-info.h           |  46 ++++++
 include/linux/uaccess-buffer.h                | 152 ++++++++++++++++++
 include/linux/uaccess.h                       |   2 +-
 include/uapi/linux/prctl.h                    |   3 +
 include/uapi/linux/uaccess-buffer.h           |  27 ++++
 kernel/Makefile                               |   1 +
 kernel/bpf/helpers.c                          |   7 +-
 kernel/entry/common.c                         |  14 +-
 kernel/fork.c                                 |   4 +
 kernel/signal.c                               |   9 +-
 kernel/sys.c                                  |   6 +
 kernel/uaccess-buffer.c                       | 145 +++++++++++++++++
 lib/iov_iter.c                                |   2 +-
 lib/usercopy.c                                |   2 +-
 tools/testing/selftests/Makefile              |   1 +
 .../testing/selftests/uaccess_buffer/Makefile |   4 +
 .../uaccess_buffer/uaccess_buffer_test.c      | 126 +++++++++++++++
 31 files changed, 802 insertions(+), 44 deletions(-)
 create mode 100644 Documentation/admin-guide/uaccess-logging.rst
 create mode 100644 include/linux/instrumented-uaccess.h
 create mode 100644 include/linux/uaccess-buffer-info.h
 create mode 100644 include/linux/uaccess-buffer.h
 create mode 100644 include/uapi/linux/uaccess-buffer.h
 create mode 100644 kernel/uaccess-buffer.c
 create mode 100644 tools/testing/selftests/uaccess_buffer/Makefile
 create mode 100644 tools/testing/selftests/uaccess_buffer/uaccess_buffer_test.c

Comments

David Laight Dec. 11, 2021, 5:23 p.m. UTC | #1
From: Peter Collingbourne
> Sent: 09 December 2021 22:16
> 
> This patch series introduces a kernel feature known as uaccess
> logging, which allows userspace programs to be made aware of the
> address and size of uaccesses performed by the kernel during
> the servicing of a syscall. More details on the motivation
> for and interface to this feature are available in the file
> Documentation/admin-guide/uaccess-logging.rst added by the final
> patch in the series.

How does this work when get_user() and put_user() are used to
do optimised copies?

While adding checks to copy_to/from_user() is going to have
a measurable performance impact - even if nothing is done,
adding them to get/put_user() (and friends) is going to
make some hot paths really slow.

So maybe you could add it so KASAN test kernels, but you can't
sensibly enable it on a production kernel.

Now, it might be that you could semi-sensibly log 'data' transfers.
But have you actually looked at all the transfers that happen
for something like sendmsg().
The 'user copy hardening' code already has a significant impact
on that code (in many places).


	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Peter Collingbourne Dec. 13, 2021, 7:48 p.m. UTC | #2
On Sat, Dec 11, 2021 at 9:23 AM David Laight <David.Laight@aculab.com> wrote:
>
> From: Peter Collingbourne
> > Sent: 09 December 2021 22:16
> >
> > This patch series introduces a kernel feature known as uaccess
> > logging, which allows userspace programs to be made aware of the
> > address and size of uaccesses performed by the kernel during
> > the servicing of a syscall. More details on the motivation
> > for and interface to this feature are available in the file
> > Documentation/admin-guide/uaccess-logging.rst added by the final
> > patch in the series.
>
> How does this work when get_user() and put_user() are used to
> do optimised copies?
>
> While adding checks to copy_to/from_user() is going to have
> a measurable performance impact - even if nothing is done,
> adding them to get/put_user() (and friends) is going to
> make some hot paths really slow.
>
> So maybe you could add it so KASAN test kernels, but you can't
> sensibly enable it on a production kernel.
>
> Now, it might be that you could semi-sensibly log 'data' transfers.
> But have you actually looked at all the transfers that happen
> for something like sendmsg().
> The 'user copy hardening' code already has a significant impact
> on that code (in many places).

Hi David,

Yes, I realised after I sent out my patch (and while writing test
cases for it) that it didn't cover get_user()/put_user(). I have a
patch under development that will add this coverage. I used it to run
my invalid syscall and uname benchmarks and the results were basically
the same as without the coverage.

Are you aware of any benchmarks that cover sendmsg()? I can try to
look at writing my own if not. I was also planning to write a
benchmark that uses getresuid() as this was the simplest syscall that
I could find that does multiple put_user() calls.

Peter
David Laight Dec. 13, 2021, 11:07 p.m. UTC | #3
From: Peter Collingbourne
> Sent: 13 December 2021 19:49
> 
> On Sat, Dec 11, 2021 at 9:23 AM David Laight <David.Laight@aculab.com> wrote:
> >
> > From: Peter Collingbourne
> > > Sent: 09 December 2021 22:16
> > >
> > > This patch series introduces a kernel feature known as uaccess
> > > logging, which allows userspace programs to be made aware of the
> > > address and size of uaccesses performed by the kernel during
> > > the servicing of a syscall. More details on the motivation
> > > for and interface to this feature are available in the file
> > > Documentation/admin-guide/uaccess-logging.rst added by the final
> > > patch in the series.
> >
> > How does this work when get_user() and put_user() are used to
> > do optimised copies?
> >
> > While adding checks to copy_to/from_user() is going to have
> > a measurable performance impact - even if nothing is done,
> > adding them to get/put_user() (and friends) is going to
> > make some hot paths really slow.
> >
> > So maybe you could add it so KASAN test kernels, but you can't
> > sensibly enable it on a production kernel.
> >
> > Now, it might be that you could semi-sensibly log 'data' transfers.
> > But have you actually looked at all the transfers that happen
> > for something like sendmsg().
> > The 'user copy hardening' code already has a significant impact
> > on that code (in many places).
> 
> Hi David,
> 
> Yes, I realised after I sent out my patch (and while writing test
> cases for it) that it didn't cover get_user()/put_user(). I have a
> patch under development that will add this coverage. I used it to run
> my invalid syscall and uname benchmarks and the results were basically
> the same as without the coverage.
> 
> Are you aware of any benchmarks that cover sendmsg()? I can try to
> look at writing my own if not. I was also planning to write a
> benchmark that uses getresuid() as this was the simplest syscall that
> I could find that does multiple put_user() calls.

Also look at sys_poll() I think that uses __put/get_user().

I think you'll find some of the socket option code also uses get_user().

There is also the compat code for import_iovec().
IIRC that is actually faster than the non-compat version at the moment.

I did some benchmarking of writev("/dev/null", iov, 10);
The cost of reading in the iovec is significant in that case.
Maybe I ought to find time to sort out my patches.

For sendmsg() using __copy_from_user() to avoid the user-copy
hardening checks also makes a measurable difference when sending UDP
through raw sockets - which we do a lot of.

I think you'd need to instrument user_access_begin() and also be able
to merge trace entries (for multiple get_user() calls).

You really don't have to look far to find places where copy_to/from_user()
is optimised to multiple get/put_user() or __get/put_user() (or are they
the 'nofault' variants?)
Those are all hot paths - at least for some workloads.
So adding anything there isn't likely to be accepted for production kernels.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Peter Collingbourne Dec. 14, 2021, 3:47 a.m. UTC | #4
On Mon, Dec 13, 2021 at 3:07 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Peter Collingbourne
> > Sent: 13 December 2021 19:49
> >
> > On Sat, Dec 11, 2021 at 9:23 AM David Laight <David.Laight@aculab.com> wrote:
> > >
> > > From: Peter Collingbourne
> > > > Sent: 09 December 2021 22:16
> > > >
> > > > This patch series introduces a kernel feature known as uaccess
> > > > logging, which allows userspace programs to be made aware of the
> > > > address and size of uaccesses performed by the kernel during
> > > > the servicing of a syscall. More details on the motivation
> > > > for and interface to this feature are available in the file
> > > > Documentation/admin-guide/uaccess-logging.rst added by the final
> > > > patch in the series.
> > >
> > > How does this work when get_user() and put_user() are used to
> > > do optimised copies?
> > >
> > > While adding checks to copy_to/from_user() is going to have
> > > a measurable performance impact - even if nothing is done,
> > > adding them to get/put_user() (and friends) is going to
> > > make some hot paths really slow.
> > >
> > > So maybe you could add it so KASAN test kernels, but you can't
> > > sensibly enable it on a production kernel.
> > >
> > > Now, it might be that you could semi-sensibly log 'data' transfers.
> > > But have you actually looked at all the transfers that happen
> > > for something like sendmsg().
> > > The 'user copy hardening' code already has a significant impact
> > > on that code (in many places).
> >
> > Hi David,
> >
> > Yes, I realised after I sent out my patch (and while writing test
> > cases for it) that it didn't cover get_user()/put_user(). I have a
> > patch under development that will add this coverage. I used it to run
> > my invalid syscall and uname benchmarks and the results were basically
> > the same as without the coverage.
> >
> > Are you aware of any benchmarks that cover sendmsg()? I can try to
> > look at writing my own if not. I was also planning to write a
> > benchmark that uses getresuid() as this was the simplest syscall that
> > I could find that does multiple put_user() calls.
>
> Also look at sys_poll() I think that uses __put/get_user().
>
> I think you'll find some of the socket option code also uses get_user().
>
> There is also the compat code for import_iovec().
> IIRC that is actually faster than the non-compat version at the moment.
>
> I did some benchmarking of writev("/dev/null", iov, 10);
> The cost of reading in the iovec is significant in that case.
> Maybe I ought to find time to sort out my patches.
>
> For sendmsg() using __copy_from_user() to avoid the user-copy
> hardening checks also makes a measurable difference when sending UDP
> through raw sockets - which we do a lot of.
>
> I think you'd need to instrument user_access_begin() and also be able
> to merge trace entries (for multiple get_user() calls).
>
> You really don't have to look far to find places where copy_to/from_user()
> is optimised to multiple get/put_user() or __get/put_user() (or are they
> the 'nofault' variants?)
> Those are all hot paths - at least for some workloads.
> So adding anything there isn't likely to be accepted for production kernels.

Okay, but let's see what the benchmarks say first.

I added calls to uaccess_buffer_log_{read,write}() to
__{get,put}_user() in arch/arm64/include/asm/uaccess.h and wrote a
variant of my usual test program that does getresuid() in a loop and
measured an overhead of 5.9% on the small cores and 2.5% on the big
cores. The overhead appears to come from two sources:
1) The calling convention for the call to
__uaccess_buffer_log_read/write() transforms some leaf functions into
non-leaf functions, and as a result we end up spilling more registers.
2) We need to reload the flags from the task_struct every time to
determine the feature enabled state.
I think it should be possible to reduce the cost down to one
instruction (a conditional branch) per get/put_user() call, at least
on arm64, by using a different calling convention for the call and
maintaining the feature enabled state in one of the currently-ignored
bits 60-63 of a reserved register (x18).

I have the patch below which reduces the overhead with my test program
to 2.3% small 1.5% big (not fully functional yet because the rest of
the code never actually flips that bit in x18). For syscalls like
sendmsg() and poll() I would expect the overall overhead to be lower
because of other required work done by these syscalls (glancing
through their implementations this appears to be the case). But I'll
benchmark it of course. The preserve_most attribute is Clang-only but
I think it should be possible to use inline asm to get the same effect
with GCC. We currently only reserve x18 with CONFIG_SHADOW_CALL_STACK
enabled but I was unable to measure an overhead for getresuid() when
reserving x18 unconditionally.

Peter

diff --git a/include/linux/uaccess-buffer.h b/include/linux/uaccess-buffer.h
index 122d9e1ccbbd..81b43a0a5505 100644
--- a/include/linux/uaccess-buffer.h
+++ b/include/linux/uaccess-buffer.h
@@ -99,7 +99,8 @@ unsigned long copy_from_user_with_log_flags(void
*to, const void __user *from,
  */
 void uaccess_buffer_free(struct task_struct *tsk);

-void __uaccess_buffer_log_read(const void __user *from, unsigned long n);
+__attribute__((preserve_most)) void
+__uaccess_buffer_log_read(const void __user *from, unsigned long n);
 /**
  * uaccess_buffer_log_read - log a read access
  * @from: the address of the access.
@@ -107,11 +108,14 @@ void __uaccess_buffer_log_read(const void __user
*from, unsigned long n);
  */
 static inline void uaccess_buffer_log_read(const void __user *from,
unsigned long n)
 {
-       if (unlikely(test_syscall_work(UACCESS_BUFFER_EXIT)))
-               __uaccess_buffer_log_read(from, n);
+       asm volatile goto("tbz x18, #60, %l0" :::: doit);
+       return;
+doit:
+       __uaccess_buffer_log_read(from, n);
 }

-void __uaccess_buffer_log_write(void __user *to, unsigned long n);
+__attribute__((preserve_most)) void __uaccess_buffer_log_write(void __user *to,
+                                                              unsigned long n);
 /**
  * uaccess_buffer_log_write - log a write access
  * @to: the address of the access.
@@ -119,8 +123,10 @@ void __uaccess_buffer_log_write(void __user *to,
unsigned long n);
  */
 static inline void uaccess_buffer_log_write(void __user *to, unsigned long n)
 {
-       if (unlikely(test_syscall_work(UACCESS_BUFFER_EXIT)))
-               __uaccess_buffer_log_write(to, n);
+       asm volatile goto("tbz x18, #60, %l0" :::: doit);
+       return;
+doit:
+       __uaccess_buffer_log_write(to, n);
 }

 #else
Peter Collingbourne Dec. 15, 2021, 4:27 a.m. UTC | #5
On Mon, Dec 13, 2021 at 7:47 PM Peter Collingbourne <pcc@google.com> wrote:
>
> On Mon, Dec 13, 2021 at 3:07 PM David Laight <David.Laight@aculab.com> wrote:
> >
> > From: Peter Collingbourne
> > > Sent: 13 December 2021 19:49
> > >
> > > On Sat, Dec 11, 2021 at 9:23 AM David Laight <David.Laight@aculab.com> wrote:
> > > >
> > > > From: Peter Collingbourne
> > > > > Sent: 09 December 2021 22:16
> > > > >
> > > > > This patch series introduces a kernel feature known as uaccess
> > > > > logging, which allows userspace programs to be made aware of the
> > > > > address and size of uaccesses performed by the kernel during
> > > > > the servicing of a syscall. More details on the motivation
> > > > > for and interface to this feature are available in the file
> > > > > Documentation/admin-guide/uaccess-logging.rst added by the final
> > > > > patch in the series.
> > > >
> > > > How does this work when get_user() and put_user() are used to
> > > > do optimised copies?
> > > >
> > > > While adding checks to copy_to/from_user() is going to have
> > > > a measurable performance impact - even if nothing is done,
> > > > adding them to get/put_user() (and friends) is going to
> > > > make some hot paths really slow.
> > > >
> > > > So maybe you could add it so KASAN test kernels, but you can't
> > > > sensibly enable it on a production kernel.
> > > >
> > > > Now, it might be that you could semi-sensibly log 'data' transfers.
> > > > But have you actually looked at all the transfers that happen
> > > > for something like sendmsg().
> > > > The 'user copy hardening' code already has a significant impact
> > > > on that code (in many places).
> > >
> > > Hi David,
> > >
> > > Yes, I realised after I sent out my patch (and while writing test
> > > cases for it) that it didn't cover get_user()/put_user(). I have a
> > > patch under development that will add this coverage. I used it to run
> > > my invalid syscall and uname benchmarks and the results were basically
> > > the same as without the coverage.
> > >
> > > Are you aware of any benchmarks that cover sendmsg()? I can try to
> > > look at writing my own if not. I was also planning to write a
> > > benchmark that uses getresuid() as this was the simplest syscall that
> > > I could find that does multiple put_user() calls.
> >
> > Also look at sys_poll() I think that uses __put/get_user().
> >
> > I think you'll find some of the socket option code also uses get_user().
> >
> > There is also the compat code for import_iovec().
> > IIRC that is actually faster than the non-compat version at the moment.
> >
> > I did some benchmarking of writev("/dev/null", iov, 10);
> > The cost of reading in the iovec is significant in that case.
> > Maybe I ought to find time to sort out my patches.
> >
> > For sendmsg() using __copy_from_user() to avoid the user-copy
> > hardening checks also makes a measurable difference when sending UDP
> > through raw sockets - which we do a lot of.
> >
> > I think you'd need to instrument user_access_begin() and also be able
> > to merge trace entries (for multiple get_user() calls).
> >
> > You really don't have to look far to find places where copy_to/from_user()
> > is optimised to multiple get/put_user() or __get/put_user() (or are they
> > the 'nofault' variants?)
> > Those are all hot paths - at least for some workloads.
> > So adding anything there isn't likely to be accepted for production kernels.
>
> Okay, but let's see what the benchmarks say first.
>
> I added calls to uaccess_buffer_log_{read,write}() to
> __{get,put}_user() in arch/arm64/include/asm/uaccess.h and wrote a
> variant of my usual test program that does getresuid() in a loop and
> measured an overhead of 5.9% on the small cores and 2.5% on the big
> cores. The overhead appears to come from two sources:
> 1) The calling convention for the call to
> __uaccess_buffer_log_read/write() transforms some leaf functions into
> non-leaf functions, and as a result we end up spilling more registers.
> 2) We need to reload the flags from the task_struct every time to
> determine the feature enabled state.
> I think it should be possible to reduce the cost down to one
> instruction (a conditional branch) per get/put_user() call, at least
> on arm64, by using a different calling convention for the call and
> maintaining the feature enabled state in one of the currently-ignored
> bits 60-63 of a reserved register (x18).
>
> I have the patch below which reduces the overhead with my test program
> to 2.3% small 1.5% big (not fully functional yet because the rest of
> the code never actually flips that bit in x18). For syscalls like
> sendmsg() and poll() I would expect the overall overhead to be lower
> because of other required work done by these syscalls (glancing
> through their implementations this appears to be the case). But I'll
> benchmark it of course. The preserve_most attribute is Clang-only but
> I think it should be possible to use inline asm to get the same effect
> with GCC. We currently only reserve x18 with CONFIG_SHADOW_CALL_STACK
> enabled but I was unable to measure an overhead for getresuid() when
> reserving x18 unconditionally.

I developed benchmarks for sendmsg() and poll() and, as I expected,
the performance overhead of my patch was in the noise, except that I
measured 1.0% overhead for poll(), but only on the little cores. So it
looks like getresuid() still represents the worst case. I'm not sure
it's worth spending time developing additional benchmarks because I
would expect them to show similar results.

By the way, is there any interest in maintaining my syscall latency
benchmarks in tree? They have been invaluable for measuring the
performance impact of my kernel changes. The benchmarks that I
developed are written in arm64 assembly to avoid measurement error due
to overheads introduced by the libc syscall wrappers (i.e. they make
my benchmark results look as bad as possible, by spending as little
time in userspace as possible). I looked around in the tree and found
"perf bench syscall" but that uses the libc wrappers.

Peter