Message ID | 20211123051658.3195589-6-pcc@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | kernel: introduce uaccess logging | expand |
On Tue, 23 Nov 2021 at 06:17, Peter Collingbourne <pcc@google.com> wrote: > > Add documentation for the uaccess logging feature. > > Link: https://linux-review.googlesource.com/id/Ia626c0ca91bc0a3d8067d7f28406aa40693b65a2 > Signed-off-by: Peter Collingbourne <pcc@google.com> > --- > Documentation/admin-guide/index.rst | 1 + > Documentation/admin-guide/uaccess-logging.rst | 149 ++++++++++++++++++ > 2 files changed, 150 insertions(+) > create mode 100644 Documentation/admin-guide/uaccess-logging.rst > > diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst > index 1bedab498104..4f6ee447ab2f 100644 > --- a/Documentation/admin-guide/index.rst > +++ b/Documentation/admin-guide/index.rst > @@ -54,6 +54,7 @@ ABI will be found here. > :maxdepth: 1 > > sysfs-rules > + uaccess-logging > > The rest of this manual consists of various unordered guides on how to > configure specific aspects of kernel behavior to your liking. > diff --git a/Documentation/admin-guide/uaccess-logging.rst b/Documentation/admin-guide/uaccess-logging.rst > new file mode 100644 > index 000000000000..4b2b297afc00 > --- /dev/null > +++ b/Documentation/admin-guide/uaccess-logging.rst > @@ -0,0 +1,149 @@ > +.. SPDX-License-Identifier: GPL-2.0 > + > +=============== > +Uaccess Logging > +=============== > + > +Background > +---------- > + > +Userspace tools such as sanitizers (ASan, MSan, HWASan) and tools > +making use of the ARM Memory Tagging Extension (MTE) need to > +monitor all memory accesses in a program so that they can detect > +memory errors. Furthermore, fuzzing tools such as syzkaller need to > +monitor all memory accesses so that they know which parts of memory > +to fuzz. For accesses made purely in userspace, this is achieved > +via compiler instrumentation, or for MTE, via direct hardware > +support. However, accesses made by the kernel on behalf of the user > +program via syscalls (i.e. uaccesses) are normally invisible to > +these tools. > + > +Traditionally, the sanitizers have handled this by interposing the libc > +syscall stubs with a wrapper that checks the memory based on what we > +believe the uaccesses will be. However, this creates a maintenance > +burden: each syscall must be annotated with its uaccesses in order > +to be recognized by the sanitizer, and these annotations must be > +continuously updated as the kernel changes. > + > +The kernel's uaccess logging feature provides userspace tools with > +the address and size of each userspace access, thereby allowing these > +tools to report memory errors involving these accesses without needing > +annotations for every syscall. > + > +By relying on the kernel's actual uaccesses, rather than a > +reimplementation of them, the userspace memory safety tools may > +play a dual role of verifying the validity of kernel accesses. Even > +a sanitizer whose syscall wrappers have complete knowledge of the > +kernel's intended API may vary from the kernel's actual uaccesses due > +to kernel bugs. A sanitizer with knowledge of the kernel's actual > +uaccesses may produce more accurate error reports that reveal such > +bugs. For example, a kernel that accesses more memory than expected > +by the userspace program could indicate that either userspace or the > +kernel has the wrong idea about which kernel functionality is being > +requested -- either way, there is a bug. > + > +Interface > +--------- > + > +The feature may be used via the following prctl: > + > +.. code-block:: c > + > + uint64_t addr = 0; /* Generally will be a TLS slot or equivalent */ > + prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR, &addr, 0, 0, 0); > + > +Supplying a non-zero address as the second argument to ``prctl`` Is it possible to unregister it? Is it what happens when 0 is passed as addr? If so, please describe. It may be handy to do one-off tracing with the address on stack. > +will cause the kernel to read an address (referred to as the *uaccess > +descriptor address*) from that address on each kernel entry. > + > +When entering the kernel with a non-zero uaccess descriptor address > +to handle a syscall, the kernel will read a data structure of type > +``struct uaccess_descriptor`` from the uaccess descriptor address, > +which is defined as follows: > + > +.. code-block:: c > + > + struct uaccess_descriptor { > + uint64_t addr, size; > + }; Want to double check the extension story. If we ever want flags in uaccess_descriptor, we can add a flag to prctl that would say that address must point to uaccess_descriptor_v2 that contains flags, right? And similarly we can extend uaccess_buffer_entry, right? > +This data structure contains the address and size (in array elements) > +of a *uaccess buffer*, which is an array of data structures of type > +``struct uaccess_buffer_entry``. Before returning to userspace, the > +kernel will log information about uaccesses to sequential entries > +in the uaccess buffer. It will also store ``NULL`` to the uaccess > +descriptor address, and store the address and size of the unused > +portion of the uaccess buffer to the uaccess descriptor. > + > +The format of a uaccess buffer entry is defined as follows: > + > +.. code-block:: c > + > + struct uaccess_buffer_entry { > + uint64_t addr, size, flags; > + }; > + > +The meaning of ``addr`` and ``size`` should be obvious. On arm64, I would say explicitly "addr and size contain address and size of the user memory access". > +tag bits are preserved in the ``addr`` field. There is currently > +one flag bit assignment for the ``flags`` field: > + > +.. code-block:: c > + > + #define UACCESS_BUFFER_FLAG_WRITE 1 > + > +This flag is set if the access was a write, or clear if it was a > +read. The meaning of all other flag bits is reserved. > + > +When entering the kernel with a non-zero uaccess descriptor > +address for a reason other than a syscall (for example, when > +IPI'd due to an incoming asynchronous signal), any signals other > +than ``SIGKILL`` and ``SIGSTOP`` are masked as if by calling > +``sigprocmask(SIG_SETMASK, set, NULL)`` where ``set`` has been > +initialized with ``sigfillset(set)``. This is to prevent incoming > +signals from interfering with uaccess logging. > + > +Example > +------- > + > +Here is an example of a code snippet that will enumerate the accesses > +performed by a ``uname(2)`` syscall: > + > +.. code-block:: c > + > + struct uaccess_buffer_entry entries[64]; > + struct uaccess_descriptor desc; > + uint64_t desc_addr = 0; > + prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR, &desc_addr, 0, 0, 0); > + > + desc.addr = (uint64_t)&entries; > + desc.size = 64; > + desc_addr = (uint64_t)&desc; We don't need any additional compiler barriers here, right? It seems that we only need to prevent re-ordering of these writes with the next and previous syscalls, which the compiler should do already. > + struct utsname un; > + uname(&un); > + > + struct uaccess_buffer_entry* entries_end = (struct uaccess_buffer_entry*)desc.addr; > + for (struct uaccess_buffer_entry* entry = entries; entry != entries_end; ++entry) { > + printf("%s at 0x%lx size 0x%lx\n", entry->flags & UACCESS_BUFFER_FLAG_WRITE ? "WRITE" : "READ", > + (unsigned long)entry->addr, (unsigned long)entry->size); > + } > + > +Limitations > +----------- > + > +This feature is currently only supported on the arm64, s390 and x86 > +architectures. > + > +Uaccess buffers are a "best-effort" mechanism for logging uaccesses. Of > +course, not all of the accesses may fit in the buffer, but aside from > +that, not all internal kernel APIs that access userspace memory are > +covered. Therefore, userspace programs should tolerate unreported > +accesses. > + > +On the other hand, the kernel guarantees that it will not > +(intentionally) report accessing more data than it is specified > +to read. For example, if the kernel implements a syscall that is > +specified to read a data structure of size ``N`` bytes by first > +reading a page's worth of data and then only using the first ``N`` > +bytes from it, the kernel will either report reading ``N`` bytes or > +not report the access at all. > -- > 2.34.0.rc2.393.gf8c9666880-goog >
On Mon, Nov 22, 2021 at 11:46 PM Dmitry Vyukov <dvyukov@google.com> wrote: > > On Tue, 23 Nov 2021 at 06:17, Peter Collingbourne <pcc@google.com> wrote: > > > > Add documentation for the uaccess logging feature. > > > > Link: https://linux-review.googlesource.com/id/Ia626c0ca91bc0a3d8067d7f28406aa40693b65a2 > > Signed-off-by: Peter Collingbourne <pcc@google.com> > > --- > > Documentation/admin-guide/index.rst | 1 + > > Documentation/admin-guide/uaccess-logging.rst | 149 ++++++++++++++++++ > > 2 files changed, 150 insertions(+) > > create mode 100644 Documentation/admin-guide/uaccess-logging.rst > > > > diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst > > index 1bedab498104..4f6ee447ab2f 100644 > > --- a/Documentation/admin-guide/index.rst > > +++ b/Documentation/admin-guide/index.rst > > @@ -54,6 +54,7 @@ ABI will be found here. > > :maxdepth: 1 > > > > sysfs-rules > > + uaccess-logging > > > > The rest of this manual consists of various unordered guides on how to > > configure specific aspects of kernel behavior to your liking. > > diff --git a/Documentation/admin-guide/uaccess-logging.rst b/Documentation/admin-guide/uaccess-logging.rst > > new file mode 100644 > > index 000000000000..4b2b297afc00 > > --- /dev/null > > +++ b/Documentation/admin-guide/uaccess-logging.rst > > @@ -0,0 +1,149 @@ > > +.. SPDX-License-Identifier: GPL-2.0 > > + > > +=============== > > +Uaccess Logging > > +=============== > > + > > +Background > > +---------- > > + > > +Userspace tools such as sanitizers (ASan, MSan, HWASan) and tools > > +making use of the ARM Memory Tagging Extension (MTE) need to > > +monitor all memory accesses in a program so that they can detect > > +memory errors. Furthermore, fuzzing tools such as syzkaller need to > > +monitor all memory accesses so that they know which parts of memory > > +to fuzz. For accesses made purely in userspace, this is achieved > > +via compiler instrumentation, or for MTE, via direct hardware > > +support. However, accesses made by the kernel on behalf of the user > > +program via syscalls (i.e. uaccesses) are normally invisible to > > +these tools. > > + > > +Traditionally, the sanitizers have handled this by interposing the libc > > +syscall stubs with a wrapper that checks the memory based on what we > > +believe the uaccesses will be. However, this creates a maintenance > > +burden: each syscall must be annotated with its uaccesses in order > > +to be recognized by the sanitizer, and these annotations must be > > +continuously updated as the kernel changes. > > + > > +The kernel's uaccess logging feature provides userspace tools with > > +the address and size of each userspace access, thereby allowing these > > +tools to report memory errors involving these accesses without needing > > +annotations for every syscall. > > + > > +By relying on the kernel's actual uaccesses, rather than a > > +reimplementation of them, the userspace memory safety tools may > > +play a dual role of verifying the validity of kernel accesses. Even > > +a sanitizer whose syscall wrappers have complete knowledge of the > > +kernel's intended API may vary from the kernel's actual uaccesses due > > +to kernel bugs. A sanitizer with knowledge of the kernel's actual > > +uaccesses may produce more accurate error reports that reveal such > > +bugs. For example, a kernel that accesses more memory than expected > > +by the userspace program could indicate that either userspace or the > > +kernel has the wrong idea about which kernel functionality is being > > +requested -- either way, there is a bug. > > + > > +Interface > > +--------- > > + > > +The feature may be used via the following prctl: > > + > > +.. code-block:: c > > + > > + uint64_t addr = 0; /* Generally will be a TLS slot or equivalent */ > > + prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR, &addr, 0, 0, 0); > > + > > +Supplying a non-zero address as the second argument to ``prctl`` > > Is it possible to unregister it? Is it what happens when 0 is passed > as addr? If so, please describe. > It may be handy to do one-off tracing with the address on stack. Yes, done in v3. > > +will cause the kernel to read an address (referred to as the *uaccess > > +descriptor address*) from that address on each kernel entry. > > + > > +When entering the kernel with a non-zero uaccess descriptor address > > +to handle a syscall, the kernel will read a data structure of type > > +``struct uaccess_descriptor`` from the uaccess descriptor address, > > +which is defined as follows: > > + > > +.. code-block:: c > > + > > + struct uaccess_descriptor { > > + uint64_t addr, size; > > + }; > > Want to double check the extension story. If we ever want flags in > uaccess_descriptor, we can add a flag to prctl that would say that > address must point to uaccess_descriptor_v2 that contains flags, > right? > And similarly we can extend uaccess_buffer_entry, right? Yes, we can specify a flag bit in e.g. the third argument to prctl, which could switch us to using new struct definitions for the uaccess descriptor and uaccess buffer entries. > > +This data structure contains the address and size (in array elements) > > +of a *uaccess buffer*, which is an array of data structures of type > > +``struct uaccess_buffer_entry``. Before returning to userspace, the > > +kernel will log information about uaccesses to sequential entries > > +in the uaccess buffer. It will also store ``NULL`` to the uaccess > > +descriptor address, and store the address and size of the unused > > +portion of the uaccess buffer to the uaccess descriptor. > > + > > +The format of a uaccess buffer entry is defined as follows: > > + > > +.. code-block:: c > > + > > + struct uaccess_buffer_entry { > > + uint64_t addr, size, flags; > > + }; > > + > > +The meaning of ``addr`` and ``size`` should be obvious. On arm64, > > I would say explicitly "addr and size contain address and size of the > user memory access". Done in v3. > > +tag bits are preserved in the ``addr`` field. There is currently > > +one flag bit assignment for the ``flags`` field: > > + > > +.. code-block:: c > > + > > + #define UACCESS_BUFFER_FLAG_WRITE 1 > > + > > +This flag is set if the access was a write, or clear if it was a > > +read. The meaning of all other flag bits is reserved. > > + > > +When entering the kernel with a non-zero uaccess descriptor > > +address for a reason other than a syscall (for example, when > > +IPI'd due to an incoming asynchronous signal), any signals other > > +than ``SIGKILL`` and ``SIGSTOP`` are masked as if by calling > > +``sigprocmask(SIG_SETMASK, set, NULL)`` where ``set`` has been > > +initialized with ``sigfillset(set)``. This is to prevent incoming > > +signals from interfering with uaccess logging. > > + > > +Example > > +------- > > + > > +Here is an example of a code snippet that will enumerate the accesses > > +performed by a ``uname(2)`` syscall: > > + > > +.. code-block:: c > > + > > + struct uaccess_buffer_entry entries[64]; > > + struct uaccess_descriptor desc; > > + uint64_t desc_addr = 0; > > + prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR, &desc_addr, 0, 0, 0); > > + > > + desc.addr = (uint64_t)&entries; > > + desc.size = 64; > > + desc_addr = (uint64_t)&desc; > > We don't need any additional compiler barriers here, right? > It seems that we only need to prevent re-ordering of these writes with > the next and previous syscalls, which the compiler should do already. Right. From the compiler's perspective the address of desc_addr is leaked at the prctl call site, so any external function call (including syscalls) could read or write to it. Peter
diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst index 1bedab498104..4f6ee447ab2f 100644 --- a/Documentation/admin-guide/index.rst +++ b/Documentation/admin-guide/index.rst @@ -54,6 +54,7 @@ ABI will be found here. :maxdepth: 1 sysfs-rules + uaccess-logging The rest of this manual consists of various unordered guides on how to configure specific aspects of kernel behavior to your liking. diff --git a/Documentation/admin-guide/uaccess-logging.rst b/Documentation/admin-guide/uaccess-logging.rst new file mode 100644 index 000000000000..4b2b297afc00 --- /dev/null +++ b/Documentation/admin-guide/uaccess-logging.rst @@ -0,0 +1,149 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============== +Uaccess Logging +=============== + +Background +---------- + +Userspace tools such as sanitizers (ASan, MSan, HWASan) and tools +making use of the ARM Memory Tagging Extension (MTE) need to +monitor all memory accesses in a program so that they can detect +memory errors. Furthermore, fuzzing tools such as syzkaller need to +monitor all memory accesses so that they know which parts of memory +to fuzz. For accesses made purely in userspace, this is achieved +via compiler instrumentation, or for MTE, via direct hardware +support. However, accesses made by the kernel on behalf of the user +program via syscalls (i.e. uaccesses) are normally invisible to +these tools. + +Traditionally, the sanitizers have handled this by interposing the libc +syscall stubs with a wrapper that checks the memory based on what we +believe the uaccesses will be. However, this creates a maintenance +burden: each syscall must be annotated with its uaccesses in order +to be recognized by the sanitizer, and these annotations must be +continuously updated as the kernel changes. + +The kernel's uaccess logging feature provides userspace tools with +the address and size of each userspace access, thereby allowing these +tools to report memory errors involving these accesses without needing +annotations for every syscall. + +By relying on the kernel's actual uaccesses, rather than a +reimplementation of them, the userspace memory safety tools may +play a dual role of verifying the validity of kernel accesses. Even +a sanitizer whose syscall wrappers have complete knowledge of the +kernel's intended API may vary from the kernel's actual uaccesses due +to kernel bugs. A sanitizer with knowledge of the kernel's actual +uaccesses may produce more accurate error reports that reveal such +bugs. For example, a kernel that accesses more memory than expected +by the userspace program could indicate that either userspace or the +kernel has the wrong idea about which kernel functionality is being +requested -- either way, there is a bug. + +Interface +--------- + +The feature may be used via the following prctl: + +.. code-block:: c + + uint64_t addr = 0; /* Generally will be a TLS slot or equivalent */ + prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR, &addr, 0, 0, 0); + +Supplying a non-zero address as the second argument to ``prctl`` +will cause the kernel to read an address (referred to as the *uaccess +descriptor address*) from that address on each kernel entry. + +When entering the kernel with a non-zero uaccess descriptor address +to handle a syscall, the kernel will read a data structure of type +``struct uaccess_descriptor`` from the uaccess descriptor address, +which is defined as follows: + +.. code-block:: c + + struct uaccess_descriptor { + uint64_t addr, size; + }; + +This data structure contains the address and size (in array elements) +of a *uaccess buffer*, which is an array of data structures of type +``struct uaccess_buffer_entry``. Before returning to userspace, the +kernel will log information about uaccesses to sequential entries +in the uaccess buffer. It will also store ``NULL`` to the uaccess +descriptor address, and store the address and size of the unused +portion of the uaccess buffer to the uaccess descriptor. + +The format of a uaccess buffer entry is defined as follows: + +.. code-block:: c + + struct uaccess_buffer_entry { + uint64_t addr, size, flags; + }; + +The meaning of ``addr`` and ``size`` should be obvious. On arm64, +tag bits are preserved in the ``addr`` field. There is currently +one flag bit assignment for the ``flags`` field: + +.. code-block:: c + + #define UACCESS_BUFFER_FLAG_WRITE 1 + +This flag is set if the access was a write, or clear if it was a +read. The meaning of all other flag bits is reserved. + +When entering the kernel with a non-zero uaccess descriptor +address for a reason other than a syscall (for example, when +IPI'd due to an incoming asynchronous signal), any signals other +than ``SIGKILL`` and ``SIGSTOP`` are masked as if by calling +``sigprocmask(SIG_SETMASK, set, NULL)`` where ``set`` has been +initialized with ``sigfillset(set)``. This is to prevent incoming +signals from interfering with uaccess logging. + +Example +------- + +Here is an example of a code snippet that will enumerate the accesses +performed by a ``uname(2)`` syscall: + +.. code-block:: c + + struct uaccess_buffer_entry entries[64]; + struct uaccess_descriptor desc; + uint64_t desc_addr = 0; + prctl(PR_SET_UACCESS_DESCRIPTOR_ADDR_ADDR, &desc_addr, 0, 0, 0); + + desc.addr = (uint64_t)&entries; + desc.size = 64; + desc_addr = (uint64_t)&desc; + + struct utsname un; + uname(&un); + + struct uaccess_buffer_entry* entries_end = (struct uaccess_buffer_entry*)desc.addr; + for (struct uaccess_buffer_entry* entry = entries; entry != entries_end; ++entry) { + printf("%s at 0x%lx size 0x%lx\n", entry->flags & UACCESS_BUFFER_FLAG_WRITE ? "WRITE" : "READ", + (unsigned long)entry->addr, (unsigned long)entry->size); + } + +Limitations +----------- + +This feature is currently only supported on the arm64, s390 and x86 +architectures. + +Uaccess buffers are a "best-effort" mechanism for logging uaccesses. Of +course, not all of the accesses may fit in the buffer, but aside from +that, not all internal kernel APIs that access userspace memory are +covered. Therefore, userspace programs should tolerate unreported +accesses. + +On the other hand, the kernel guarantees that it will not +(intentionally) report accessing more data than it is specified +to read. For example, if the kernel implements a syscall that is +specified to read a data structure of size ``N`` bytes by first +reading a page's worth of data and then only using the first ``N`` +bytes from it, the kernel will either report reading ``N`` bytes or +not report the access at all.
Add documentation for the uaccess logging feature. Link: https://linux-review.googlesource.com/id/Ia626c0ca91bc0a3d8067d7f28406aa40693b65a2 Signed-off-by: Peter Collingbourne <pcc@google.com> --- Documentation/admin-guide/index.rst | 1 + Documentation/admin-guide/uaccess-logging.rst | 149 ++++++++++++++++++ 2 files changed, 150 insertions(+) create mode 100644 Documentation/admin-guide/uaccess-logging.rst