[RFC,v1,00/14] Exceptions - Resource Cleanup

Message ID	20240201042109.1150490-1-memxor@gmail.com (mailing list archive)
Headers	show Received: from mail-ed1-f66.google.com (mail-ed1-f66.google.com [209.85.208.66]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8EEF23B78D for <bpf@vger.kernel.org>; Thu, 1 Feb 2024 04:21:13 +0000 (UTC) From: Kumar Kartikeya Dwivedi <memxor@gmail.com> To: bpf@vger.kernel.org Cc: Alexei Starovoitov <ast@kernel.org>, Andrii Nakryiko <andrii@kernel.org>, Daniel Borkmann <daniel@iogearbox.net>, Martin KaFai Lau <martin.lau@kernel.org>, David Vernet <void@manifault.com>, Tejun Heo <tj@kernel.org>, Raj Sahu <rjsu26@vt.edu>, Dan Williams <djwillia@vt.edu>, Rishabh Iyer <rishabh.iyer@epfl.ch>, Sanidhya Kashyap <sanidhya.kashyap@epfl.ch> Subject: [RFC PATCH v1 00/14] Exceptions - Resource Cleanup Date: Thu, 1 Feb 2024 04:20:55 +0000 Message-Id: <20240201042109.1150490-1-memxor@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	Exceptions - Resource Cleanup \| expand [RFC,v1,00/14] Exceptions - Resource Cleanup [RFC,v1,01/14] bpf: Mark subprogs as throw reachable before do_check pass [RFC,v1,02/14] bpf: Process global subprog's exception propagation [RFC,v1,03/14] selftests/bpf: Add test for throwing global subprog with acquired refs [RFC,v1,04/14] bpf: Refactor check_pseudo_btf_id's BTF reference bump [RFC,v1,05/14] bpf: Implement BPF exception frame descriptor generation [RFC,v1,06/14] bpf: Adjust frame descriptor pc on instruction patching [RFC,v1,07/14] bpf: Use hidden subprog trampoline for bpf_throw [RFC,v1,08/14] bpf: Compute used callee saved registers for subprogs [RFC,v1,09/14] bpf, x86: Fix up pc offsets for frame descriptor entries [RFC,v1,10/14] bpf, x86: Implement runtime resource cleanup for exceptions [RFC,v1,11/14] bpf: Release references in verifier state when throwing exceptions [RFC,v1,12/14] bpf: Register cleanup dtors for runtime unwinding [RFC,v1,13/14] bpf: Make bpf_throw available to all program types [RFC,v1,14/14] selftests/bpf: Add tests for exceptions runtime cleanup

Message ID

20240201042109.1150490-1-memxor@gmail.com (mailing list archive)

Headers

From: Kumar Kartikeya Dwivedi <memxor@gmail.com>
To: bpf@vger.kernel.org
Cc: Alexei Starovoitov <ast@kernel.org>,
	Andrii Nakryiko <andrii@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Martin KaFai Lau <martin.lau@kernel.org>,
	David Vernet <void@manifault.com>,
	Tejun Heo <tj@kernel.org>,
	Raj Sahu <rjsu26@vt.edu>,
	Dan Williams <djwillia@vt.edu>,
	Rishabh Iyer <rishabh.iyer@epfl.ch>,
	Sanidhya Kashyap <sanidhya.kashyap@epfl.ch>
Subject: [RFC PATCH v1 00/14] Exceptions - Resource Cleanup
Date: Thu,  1 Feb 2024 04:20:55 +0000
Message-Id: <20240201042109.1150490-1-memxor@gmail.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

Exceptions - Resource Cleanup | expand

Message

Kumar Kartikeya Dwivedi Feb. 1, 2024, 4:20 a.m. UTC

This set implements the second part of the exceptions set, with support
for releasing resources at runtime during the unwinding phase. This
allows programs to throw an exception at any point of time and terminate
their execution immediately.

Currently, any acquired resources held by the program will cause a
thrown exception to fail during verification. This is because safely
unwinding the stack requires releasing these resources which represent
kernel objects and locks. Not doing this and continuing the stack
unwinding will destroy the stack frames containing such objects and will
lead to all kinds of issues and the violation of BPF's safety
properties.

Note that while the current mechanism only supports throwing exceptions
synchronously, the unwinding mechanism only requires a valid frame
descriptor to perform the cleanup. Thus, in a followup, we can build on
top of this series to introduce support for preempting execution of BPF
programs and terminating them, in cases where termination is not
statically provable and the program exceeds some runtime threshold.
This can also allow holding multiple locks at the same time, detecting
deadlocks and aborting programs, etc.

In this set, we implement support that allows the kernel to release such
resources when performing the unwinding step for the program and
aborting it. To enable this, the kernel needs to be made aware about the
layout of the program stack (for each BPF frame) and the type of each
kernel object residing therein. This information is retrieved at runtime
by tying this metadata to the program counter.

Everytime a bpf_throw call is processed by the verifier, it generates a
frame descriptor for the caller and all of its caller frames, and ties
them to the program counter. At runtime, the kernel will only process
unwinding requests for such program counters, and this mapping of frame
descriptors to the PC will allow their discovery of resources for each
frame.

Note that at the same program point, there cannot be a case where the
verifier may have to produce distinct frame descriptors. If such a case
is encountered, then the verification will fail. This is an uncommon
case where depending on the program path taken at runtime, the same
stack slot may contain pointers of different types. In most of these
examples, the program would not pass the verifier anyway, since the
value that has to be freed would be lost in verifier state and cannot be
recovered.

A special provision is made for cases where the stack slot contains NULL
or a pointer in two different program paths, it is quite common for such
a case to occur where a resource may be acquired conditionally, and the
release occurs later in the program guarded by the same conditional.

Notes
=====

 * Releasing bpf_spin_lock, RCU read locks is not supported in the RFC,
   but will be done as a follow up or added to the next revision on top
   of this set.
 * A few known rough edges/minor bugs which will be fixed in the next
   version.
 * Adding more tests for corner cases.

Kumar Kartikeya Dwivedi (14):
  bpf: Mark subprogs as throw reachable before do_check pass
  bpf: Process global subprog's exception propagation
  selftests/bpf: Add test for throwing global subprog with acquired refs
  bpf: Refactor check_pseudo_btf_id's BTF reference bump
  bpf: Implement BPF exception frame descriptor generation
  bpf: Adjust frame descriptor pc on instruction patching
  bpf: Use hidden subprog trampoline for bpf_throw
  bpf: Compute used callee saved registers for subprogs
  bpf, x86: Fix up pc offsets for frame descriptor entries
  bpf, x86: Implement runtime resource cleanup for exceptions
  bpf: Release references in verifier state when throwing exceptions
  bpf: Register cleanup dtors for runtime unwinding
  bpf: Make bpf_throw available to all program types
  selftests/bpf: Add tests for exceptions runtime cleanup

 arch/x86/net/bpf_jit_comp.c                   | 116 ++-
 drivers/hid/bpf/hid_bpf_dispatch.c            |  17 +
 include/linux/bpf.h                           |  57 ++
 include/linux/bpf_verifier.h                  |   9 +-
 include/linux/btf.h                           |  10 +-
 include/linux/filter.h                        |   3 +
 kernel/bpf/btf.c                              |  11 +-
 kernel/bpf/core.c                             |  18 +
 kernel/bpf/cpumask.c                          |   3 +-
 kernel/bpf/helpers.c                          | 165 +++-
 kernel/bpf/verifier.c                         | 714 +++++++++++++++++-
 kernel/trace/bpf_trace.c                      |  16 +
 net/bpf/test_run.c                            |   4 +-
 net/core/filter.c                             |   5 +
 net/netfilter/nf_conntrack_bpf.c              |  14 +-
 net/xfrm/xfrm_state_bpf.c                     |  16 +
 tools/testing/selftests/bpf/DENYLIST.aarch64  |   1 +
 tools/testing/selftests/bpf/DENYLIST.s390x    |   1 +
 .../bpf/prog_tests/exceptions_cleanup.c       |  65 ++
 .../selftests/bpf/progs/exceptions_cleanup.c  | 468 ++++++++++++
 .../bpf/progs/exceptions_cleanup_fail.c       | 154 ++++
 .../selftests/bpf/progs/exceptions_fail.c     |  38 +-
 22 files changed, 1817 insertions(+), 88 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/exceptions_cleanup.c
 create mode 100644 tools/testing/selftests/bpf/progs/exceptions_cleanup.c
 create mode 100644 tools/testing/selftests/bpf/progs/exceptions_cleanup_fail.c


base-commit: 77326a4a06e1e97432322f403cb439880871d34d

Comments

Eduard Zingerman March 14, 2024, 11:08 a.m. UTC | #1

Hi Kumar,

Sorry for delayed response. Theoretical possibility that frame merge
might fail because of some e.g. clang version changes bugs me,
so I thought a bit on what alternatives are there.
For the sake of discussion, what do you think about runtime-based
approach:
- for each throwing sub-program reserve a stack region where
  references to currently acquired resources (and their types)
  would be stored;
- upon a call to something that acquires resource push pointer/type
  pair to this region;
- upon a call to something that releases resource delete pointer/type
  pair from this region;
- when bpf_throw() is executed, walk this region for each active frame
  and execute corresponding destructors.
- (alternatively, reserve one region for all frames).

Technically, this could be implemented as follows:
- during main verification pass:
  - when verifier processes a call to something that acquires resource,
    mark call instruction and resource type in insn_aux_data data;
  - same for processing of a call to something that releases resource;
  - keep track of a maximal number of simultaneously acquired resources;
- after main verification pass:
  - bump stack size for each subprogram by amount, necessary to hold
    the acquired resource table and assume that this table is at the
    end of the subprogram stack;
  - after each acquire call, insert a kfunc call that would add
    resource reference to the table;
  - after each release call, insert a kfunc call that would remove
    resource reference from the table.

On a surface it appears that this approach has a few advantages:
- seems simpler than frame descriptors tracking and merging
  (but only implementation would tell if this is so);
- should be resilient to program compilation changes;
- abort is possible at any program point, which might be interesting for:
  - cancelable BPF programs (where abort might be needed in the middle
    of the e.g. loop);
  - arenas, where it might be desirable to stop the program after e.g.
    faulting arena access.

The obvious disadvantage is incurred runtime cost.
On the other hand, it might be not that big.

What do you think?

Thanks,
Eduard

Kumar Kartikeya Dwivedi March 18, 2024, 5:40 a.m. UTC | #2

On Thu, 14 Mar 2024 at 12:08, Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> Hi Kumar,
>

Hello Eduard,

> Sorry for delayed response. Theoretical possibility that frame merge
> might fail because of some e.g. clang version changes bugs me,
> so I thought a bit on what alternatives are there.

I spent some time this weekend addressing feedback, and prepping the
next version.
As you point out (and as I explained in one of the replies to you),
the possibility of the
same stack slot / register containing a conflicting resource type can
cause merge problems.
I have had a similar thought to what you describe below (emitting
resource hints in such cases) to address this,
but I slept over your idea and have a few thoughts. Please let me know
what you think.

> For the sake of discussion, what do you think about runtime-based
> approach:
> - for each throwing sub-program reserve a stack region where
>   references to currently acquired resources (and their types)
>   would be stored;
> - upon a call to something that acquires resource push pointer/type
>   pair to this region;
> - upon a call to something that releases resource delete pointer/type
>   pair from this region;
> - when bpf_throw() is executed, walk this region for each active frame
>   and execute corresponding destructors.
> - (alternatively, reserve one region for all frames).
>

I went over this option early on but rejected it due to the typical
downsides you observed.
The maximum stack usage due to this will be peak resource acquisition
during the runtime of a subprogram (N) x 16 bytes.
The maximum will be the sum of all subprogs in a call chain.
This seems a bit problematic to me, given this can basically greatly
exceed typical program stack usage. Compared to
extra stack slot usage by bpf_loop inlining, and may_goto instruction,
this appears to be too much.
We can mitigate it by using a more space efficient encoding to
identify resources, but the core problem still remains.
Apart from that, there's the overhead to know zero of this stack
portion on entry every time.

I just think this is all unnecessary especially when most of the time
the exception is not going to be thrown anyway,
so it's all cost incurred for a case that will practically never
happen in a correct program.

> Technically, this could be implemented as follows:
> - during main verification pass:
>   - when verifier processes a call to something that acquires resource,
>     mark call instruction and resource type in insn_aux_data data;
>   - same for processing of a call to something that releases resource;
>   - keep track of a maximal number of simultaneously acquired resources;
> - after main verification pass:
>   - bump stack size for each subprogram by amount, necessary to hold
>     the acquired resource table and assume that this table is at the
>     end of the subprogram stack;
>   - after each acquire call, insert a kfunc call that would add
>     resource reference to the table;
>   - after each release call, insert a kfunc call that would remove
>     resource reference from the table.
>
> On a surface it appears that this approach has a few advantages:
> - seems simpler than frame descriptors tracking and merging
>   (but only implementation would tell if this is so);
> - should be resilient to program compilation changes;
> - abort is possible at any program point, which might be interesting for:
>   - cancelable BPF programs (where abort might be needed in the middle
>     of the e.g. loop);

Let us discuss cancellation in another set that I plan to send in some
time after this one,
but I think aborting arbitrarily at any point of the program is not
the way to go. I have also
reviewed literature on how this happens in other language runtimes,
and the broad consensus
is to have explicit cancellation points in programs, and only allow
cancellation for them.
Synchronous and asynchronous is irrelevant to the mechanism, but
usually it is at explicit program
points.

Not only can you encounter a program executing in the middle of a
loop, you can interrupt it within a kfunc,
in such a case, you cannot really abort it immediately, as that may
not be safe. It is thus better to designate
cancellation points in the program, which are visited even in cases
where the program may possibly be stuck
for a long time (like iterator next kfunc) and only do cancellation
when executing them.

Anyway, let's continue this discussion once I send the RFC out for
cancellation. Will try to also include a PoC
for arena fault aborting.

>   - arenas, where it might be desirable to stop the program after e.g.
>     faulting arena access.
>
> The obvious disadvantage is incurred runtime cost.
> On the other hand, it might be not that big.
>
> What do you think?

Let's go back to the problem we can encounter with frame descriptors.
The case of multiple paths having distinct resource types in a common
stack slot.

What I would do is only force a spill of the specific resource type
(right after the pc acquires it)
only if their stack slots are unmergeable. Since we go over the stack
first, any conflicts in the register
types will be resolved since all states for the ref_obj_id will be
erased. For conflicts in registers, we'd
do something similar. A new 8-byte entry would be reserved for the pc
after max_stack_depth and
all these entries will be zeroed upon entry. A similar thing could be
done by the compiler itself and the
verifier wouldn't have to do all this, but for now let's just do it in
the verifier.

In this way, when we encounter an unlikely case where the same slot
has conflicting entries, we'll erase
state by storing its entry in the stack slot reserved for the pc, and
clear its state from the verifier state,
allowing all respective entries for the same resource to now merge
easily. The same can be repeated
for each resource if all of them conflict.

So far playing around with this series applying it to program
cancellation and sched-ext cases, I haven't
encountered merging issues, so it seems like an edge case that is
unlikely to often occur in practice. But when it does,
it can be resolved by the logic above. This should address all issues
and should be resilient to changes triggered by different clang
versions.

Let me know what you think.

>
> Thanks,
> Eduard