Message ID | 20250206105435.2159977-1-memxor@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | Resilient Queued Spin Lock | expand |
On Thu, Feb 06, 2025 at 02:54:08AM -0800, Kumar Kartikeya Dwivedi wrote: > Additionally, eBPF programs attached to different parts of the kernel > can introduce new control flow into the kernel, which increases the > likelihood of deadlocks in code not written to handle reentrancy. There > have been multiple syzbot reports surfacing deadlocks in internal kernel > code due to the diverse ways in which eBPF programs can be attached to > different parts of the kernel. By switching the BPF subsystem’s lock > usage to rqspinlock, all of these issues can be mitigated at runtime. Only if the called stuff is using this new lock. IIRC we've had a number of cases where eBPF was used to tie together 'normal' kernel functions in a way that wasn't sound. You can't help there. As an example, eBPF calling strncpy_from_user(), which ends up in fault injection and badness happens -- this has been since fixed, but still.
On Thu, Feb 06, 2025 at 02:54:08AM -0800, Kumar Kartikeya Dwivedi wrote: > Deadlock Detection > ~~~~~~~~~~~~~~~~~~ > We handle two cases of deadlocks: AA deadlocks (attempts to acquire the > same lock again), and ABBA deadlocks (attempts to acquire two locks in > the opposite order from two distinct threads). Variants of ABBA > deadlocks may be encountered with more than two locks being held in the > incorrect order. These are not diagnosed explicitly, as they reduce to > ABBA deadlocks. > > Deadlock detection is triggered immediately when beginning the waiting > loop of a lock slow path. > > While timeouts ensure that any waiting loops in the locking slow path > terminate and return to the caller, it can be excessively long in some > situations. While the default timeout is short (0.5s), a stall for this > duration inside the kernel can set off alerts for latency-critical > services with strict SLOs. Ideally, the kernel should recover from an > undesired state of the lock as soon as possible. > > A multi-step strategy is used to recover the kernel from waiting loops > in the locking algorithm which may fail to terminate in a bounded amount > of time. > > * Each CPU maintains a table of held locks. Entries are inserted and > removed upon entry into lock, and exit from unlock, respectively. > * Deadlock detection for AA locks is thus simple: we have an AA > deadlock if we find a held lock entry for the lock we’re attempting > to acquire on the same CPU. > * During deadlock detection for ABBA, we search through the tables of > all other CPUs to find situations where we are holding a lock the > remote CPU is attempting to acquire, and they are holding a lock we > are attempting to acquire. Upon encountering such a condition, we > report an ABBA deadlock. > * We divide the duration between entry time point into the waiting loop > and the timeout time point into intervals of 1 ms, and perform > deadlock detection until timeout happens. Upon entry into the slow > path, and then completion of each 1 ms interval, we perform detection > of both AA and ABBA deadlocks. In the event that deadlock detection > yields a positive result, the recovery happens sooner than the > timeout. Otherwise, it happens as a last resort upon completion of > the timeout. > > Timeouts > ~~~~~~~~ > Timeouts act as final line of defense against stalls for waiting loops. > The ‘ktime_get_mono_fast_ns’ function is used to poll for the current > time, and it is compared to the timestamp indicating the end time in the > waiter loop. Each waiting loop is instrumented to check an extra > condition using a macro. Internally, the macro implementation amortizes > the checking of the timeout to avoid sampling the clock in every > iteration. Precisely, the timeout checks are invoked every 64k > iterations. > > Recovery > ~~~~~~~~ I'm probably bad at reading, but I failed to find anything that explained how you recover from a deadlock. Do you force unload the BPF program?
On Thu, Feb 06, 2025 at 02:54:08AM -0800, Kumar Kartikeya Dwivedi wrote: > Changelog: > ---------- > v1 -> v2 > v1: https://lore.kernel.org/bpf/20250107140004.2732830-1-memxor@gmail.com > > * Address nits from Waiman and Peter > * Fix arm64 WFE bug pointed out by Peter. What's the state of that smp_cond_relaxed_timeout() patch-set? That still seems like what you're needing, right?
On Mon, Feb 10, 2025 at 10:38:41AM +0100, Peter Zijlstra wrote: > On Thu, Feb 06, 2025 at 02:54:08AM -0800, Kumar Kartikeya Dwivedi wrote: > > > > Deadlock Detection > > ~~~~~~~~~~~~~~~~~~ > > We handle two cases of deadlocks: AA deadlocks (attempts to acquire the > > same lock again), and ABBA deadlocks (attempts to acquire two locks in > > the opposite order from two distinct threads). Variants of ABBA > > deadlocks may be encountered with more than two locks being held in the > > incorrect order. These are not diagnosed explicitly, as they reduce to > > ABBA deadlocks. > > > > Deadlock detection is triggered immediately when beginning the waiting > > loop of a lock slow path. > > > > While timeouts ensure that any waiting loops in the locking slow path > > terminate and return to the caller, it can be excessively long in some > > situations. While the default timeout is short (0.5s), a stall for this > > duration inside the kernel can set off alerts for latency-critical > > services with strict SLOs. Ideally, the kernel should recover from an > > undesired state of the lock as soon as possible. > > > > A multi-step strategy is used to recover the kernel from waiting loops > > in the locking algorithm which may fail to terminate in a bounded amount > > of time. > > > > * Each CPU maintains a table of held locks. Entries are inserted and > > removed upon entry into lock, and exit from unlock, respectively. > > * Deadlock detection for AA locks is thus simple: we have an AA > > deadlock if we find a held lock entry for the lock we’re attempting > > to acquire on the same CPU. > > * During deadlock detection for ABBA, we search through the tables of > > all other CPUs to find situations where we are holding a lock the > > remote CPU is attempting to acquire, and they are holding a lock we > > are attempting to acquire. Upon encountering such a condition, we > > report an ABBA deadlock. > > * We divide the duration between entry time point into the waiting loop > > and the timeout time point into intervals of 1 ms, and perform > > deadlock detection until timeout happens. Upon entry into the slow > > path, and then completion of each 1 ms interval, we perform detection > > of both AA and ABBA deadlocks. In the event that deadlock detection > > yields a positive result, the recovery happens sooner than the > > timeout. Otherwise, it happens as a last resort upon completion of > > the timeout. > > > > Timeouts > > ~~~~~~~~ > > Timeouts act as final line of defense against stalls for waiting loops. > > The ‘ktime_get_mono_fast_ns’ function is used to poll for the current > > time, and it is compared to the timestamp indicating the end time in the > > waiter loop. Each waiting loop is instrumented to check an extra > > condition using a macro. Internally, the macro implementation amortizes > > the checking of the timeout to avoid sampling the clock in every > > iteration. Precisely, the timeout checks are invoked every 64k > > iterations. > > > > Recovery > > ~~~~~~~~ > > I'm probably bad at reading, but I failed to find anything that > explained how you recover from a deadlock. > > Do you force unload the BPF program? Even the simple AB-BA case, CPU0 CPU1 lock-A lock-B lock-B lock-A <- just having a random lock op return -ETIMO doesn't actually solve anything. Suppose CPU1's lock-A will time out; it will have to unwind and release lock-B before CPU0 can make progress. Worse, if CPU1 isn't quick enough to unwind and release B, then CPU0's lock-B will also time out. At which point they'll both try again and you're stuck in the same place, no? Given you *have* to unwind to make progress; why not move the entire thing to a wound-wait style lock? Then you also get rid of the whole timeout mess.
Peter Zijlstra <peterz@infradead.org> writes: > On Thu, Feb 06, 2025 at 02:54:08AM -0800, Kumar Kartikeya Dwivedi wrote: >> Changelog: >> ---------- >> v1 -> v2 >> v1: https://lore.kernel.org/bpf/20250107140004.2732830-1-memxor@gmail.com >> >> * Address nits from Waiman and Peter >> * Fix arm64 WFE bug pointed out by Peter. > > What's the state of that smp_cond_relaxed_timeout() patch-set? Just waiting for review comments: https://lore.kernel.org/lkml/20250203214911.898276-1-ankur.a.arora@oracle.com/ -- ankur
On Mon, Feb 10, 2025 at 2:49 AM Peter Zijlstra <peterz@infradead.org> wrote: > > > > Do you force unload the BPF program? Not yet. As you can imagine, cancelling bpf program is much harder than sending sigkill to the user space process. The prog needs to safely free all the resources it holds. This work was ongoing for a couple years now with numerous discussions. Many steps in-between are being considered as well. Including detaching misbehaving prog, but there is always a counter argument. > Even the simple AB-BA case, > > CPU0 CPU1 > lock-A lock-B > lock-B lock-A <- > > just having a random lock op return -ETIMO doesn't actually solve > anything. Suppose CPU1's lock-A will time out; it will have to unwind > and release lock-B before CPU0 can make progress. > > Worse, if CPU1 isn't quick enough to unwind and release B, then CPU0's > lock-B will also time out. > > At which point they'll both try again and you're stuck in the same > place, no? Not really. You're missing that deadlock is not a normal case. As soon as we have cancellation logic working we will be "sigkilling" prog where deadlock was detected. In this patch the verifier guarantees that the prog must check the return value from bpf_res_spin_lock(). The prog cannot keep re-trying. The only thing it can do is to exit. Failing to grab res_spin_lock() is not a normal condition. The prog has to implement a fallback path for it, but it has the look and feel of normal spin_lock and algorithms are written assuming that the lock will be taken. If res_spin_lock errors, it's a bug in the prog or the prog was invoked from an unexpected context. Same thing for patches 19,20,21 where we're addressing years of accumulated tech debt in the bpf core parts, like bpf hashmap. Once res_spin_lock() fails in kernel/bpf/hashtab.c the bpf_map_update_elem() will return EBUSY (just like it does now when it detects re-entrance on bucket lock). This is no retry. If res_spin_lock fails in bpf hashmap it's 99% case of syzbot doing "clever" attaching of bpf progs to bpf internals and trying hard to break things. > Given you *have* to unwind to make progress; why not move the entire > thing to a wound-wait style lock? Then you also get rid of the whole > timeout mess. We looked at things like ww_mutex_lock, but they don't fit. wound-wait is for databases where deadlock is normal and expected. The transaction has to be aborted and retried. res_spin_lock is different. It's kinda safe spin_lock that doesn't brick the kernel. To be a drop in replacement it has to perform at the same speed as spin_lock. Hence the massive benchmarking effort that you see in the cover letter. That's also the reason to keep it 4 bytes. We don't want to increase it to 8 or whatever unless it's absolutely necessary. In the other email you say: > And it seems to me this thing might benefit somewhat significantly from > adding this little extra bit. referring to optimization that 8 byte res_spin_lock can potentially do O(1) ABBA deadlock detection instead of O(NR_CPUS). That was a conscious trade-off. Deadlocks are not normal. If it takes a bit longer to detect it's fine. The res_spin_lock is optimized to proceed as normal qspinlock.
On Mon, Feb 10, 2025 at 08:37:06PM -0800, Alexei Starovoitov wrote: > On Mon, Feb 10, 2025 at 2:49 AM Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > Do you force unload the BPF program? > > Not yet. As you can imagine, cancelling bpf program is much > harder than sending sigkill to the user space process. So you are killing the user program? Because it wasn't at all clear what if anything is done when this failure case is tripped. > The prog needs to safely free all the resources it holds. > This work was ongoing for a couple years now with numerous discussions. Well, for you maybe, I'm new here. This is only the second submission, and really only the first one I got to mostly read. > > Even the simple AB-BA case, > > > > CPU0 CPU1 > > lock-A lock-B > > lock-B lock-A <- > > > > just having a random lock op return -ETIMO doesn't actually solve > > anything. Suppose CPU1's lock-A will time out; it will have to unwind > > and release lock-B before CPU0 can make progress. > > > > Worse, if CPU1 isn't quick enough to unwind and release B, then CPU0's > > lock-B will also time out. > > > > At which point they'll both try again and you're stuck in the same > > place, no? > > Not really. You're missing that deadlock is not a normal case. Well, if this is unpriv user programs, you should most definitely consider them the normal case. Must assume user space is malicious. > As soon as we have cancellation logic working we will be "sigkilling" > prog where deadlock was detected. Ah, so that's the plan, but not yet included here? This means that every BPF program invocation must be 'cancellable'? What if kernel thread is hitting tracepoint or somesuch? So much details not clear to me and not explained either :/ > In this patch the verifier guarantees that the prog must check > the return value from bpf_res_spin_lock(). Yeah, but so what? It can check and still not do the right thing. Only checking the return value is consumed somehow doesn't really help much. > The prog cannot keep re-trying. > The only thing it can do is to exit. Right, but it might have already modified things, how are you going to recover from that? > Failing to grab res_spin_lock() is not a normal condition. If you're going to be exposing this to unpriv, I really do think you should assume it to be the normal case. > The prog has to implement a fallback path for it, But verifier must verify it is sane fallback, how can it do that? > > Given you *have* to unwind to make progress; why not move the entire > > thing to a wound-wait style lock? Then you also get rid of the whole > > timeout mess. > > We looked at things like ww_mutex_lock, but they don't fit. > wound-wait is for databases where deadlock is normal and expected. > The transaction has to be aborted and retried. Right, which to me sounds exactly like what you want for unpriv. Have the program structured such that it must acquire all locks before it does a modification / store -- and have the verifier enforce this. Then any lock failure can be handled by the bpf core, not the program itself. Core can unlock all previously acquired locks, and core can either re-attempt the program or 'skip' it after N failures. It does mean the bpf core needs to track the acquired locks -- which you already do, except it becomes mandatory, prog cannot acquire more than ~32 locks. > res_spin_lock is different. It's kinda safe spin_lock that doesn't > brick the kernel. Well, 1/2 second is pretty much bricked imo. > That was a conscious trade-off. Deadlocks are not normal. I really do think you should assume they are normal, unpriv and all that.
On Tue, Feb 11, 2025 at 2:44 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Mon, Feb 10, 2025 at 08:37:06PM -0800, Alexei Starovoitov wrote: > > On Mon, Feb 10, 2025 at 2:49 AM Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > > > Do you force unload the BPF program? > > > > Not yet. As you can imagine, cancelling bpf program is much > > harder than sending sigkill to the user space process. > > So you are killing the user program? Because it wasn't at all clear what > if anything is done when this failure case is tripped. No. We're not killing the user process. bpf progs often run when there is no owner process. They're just attached somewhere and doing things. Like XDP firewall will work just fine without any user space. > > The prog needs to safely free all the resources it holds. > > This work was ongoing for a couple years now with numerous discussions. > > Well, for you maybe, I'm new here. This is only the second submission, > and really only the first one I got to mostly read. > > > > Even the simple AB-BA case, > > > > > > CPU0 CPU1 > > > lock-A lock-B > > > lock-B lock-A <- > > > > > > just having a random lock op return -ETIMO doesn't actually solve > > > anything. Suppose CPU1's lock-A will time out; it will have to unwind > > > and release lock-B before CPU0 can make progress. > > > > > > Worse, if CPU1 isn't quick enough to unwind and release B, then CPU0's > > > lock-B will also time out. > > > > > > At which point they'll both try again and you're stuck in the same > > > place, no? > > > > Not really. You're missing that deadlock is not a normal case. > > Well, if this is unpriv user programs, you should most definitely > consider them the normal case. Must assume user space is malicious. Ohh. No unpriv here. Since spectre was discovered unpriv bpf died. BPF_UNPRIV_DEFAULT_OFF=y was the default for distros and all hyperscalers for quite some time. > > As soon as we have cancellation logic working we will be "sigkilling" > > prog where deadlock was detected. > > Ah, so that's the plan, but not yet included here? This means that every > BPF program invocation must be 'cancellable'? What if kernel thread is > hitting tracepoint or somesuch? > > So much details not clear to me and not explained either :/ Yes. The plan is to "kill" bpf prog when it misbehaves. But this is orthogonal to this res_spin_lock set which is a building block. > Right, but it might have already modified things, how are you going to > recover from that? Tracking resources acquisition and release by the bpf prog is a normal verifier job. When bpf prog does bpf_rcu_read_lock() the verifier makes sure that all execution paths from there on have bpf_rcu_read_unlock() before program reaches the exit. Same thing with locks. If bpf_res_spin_lock() succeeds the verifier will make sure there is matching bpf_res_spin_unlock(). If some resource was acquired before bpf_res_spin_lock() and it returned -EDEADLK the verifier will not allow early return without releasing all acquired resources. > > Failing to grab res_spin_lock() is not a normal condition. > > If you're going to be exposing this to unpriv, I really do think you > should assume it to be the normal case. No unpriv for foreseeable future. > > The prog has to implement a fallback path for it, > > But verifier must verify it is sane fallback, how can it do that? > > > > Given you *have* to unwind to make progress; why not move the entire > > > thing to a wound-wait style lock? Then you also get rid of the whole > > > timeout mess. > > > > We looked at things like ww_mutex_lock, but they don't fit. > > wound-wait is for databases where deadlock is normal and expected. > > The transaction has to be aborted and retried. > > Right, which to me sounds exactly like what you want for unpriv. > > Have the program structured such that it must acquire all locks before > it does a modification / store -- and have the verifier enforce this. > Then any lock failure can be handled by the bpf core, not the program > itself. Core can unlock all previously acquired locks, and core can > either re-attempt the program or 'skip' it after N failures. We definitely don't want to bpf core to keep track of acquired resources. That just doesn't scale. There could be rcu_read_locks, all kinds of refcounted objects, locks taken, and so on. The verifier makes sure that the program does the release no matter what the execution path. That's how it scales. On my devserver I have 152 bpf programs running. All of them keep acquiring and releasing resources (locks, sockets, memory) million times a second. The verifier checks that each prog is doing its job individually. > It does mean the bpf core needs to track the acquired locks -- which you > already do, We don't. The bpf infra does static checks only. The core doesn't track objects at run-time. The only exceptions are map elements. bpf prog might store an acquired object in a map. Only in that case bpf infra will free that object when it frees the whole map. But that doesn't apply to short lived things like RCU CS and locks. Those cannot last long. They must complete within single execution of the prog. > > That was a conscious trade-off. Deadlocks are not normal. > > I really do think you should assume they are normal, unpriv and all > that. No unpriv and no, we don't want deadlocks to be considered normal by bpf users. They need to hear "fix your broken prog" message loud and clear. Patch 14 splat is a step in that direction. Currently it's only for in-kernel res_spin_lock() usage (like in bpf hashtab). Eventually we will deliver the message to users without polluting dmesg. Still debating the actual mechanism.
On Tue, Feb 11, 2025 at 10:33:00AM -0800, Alexei Starovoitov wrote: > Ohh. No unpriv here. > Since spectre was discovered unpriv bpf died. > BPF_UNPRIV_DEFAULT_OFF=y was the default for distros and > all hyperscalers for quite some time. Ah, okay. Time to remove the option then? > > So much details not clear to me and not explained either :/ > > Yes. The plan is to "kill" bpf prog when it misbehaves. > But this is orthogonal to this res_spin_lock set which is > a building block. > > > Right, but it might have already modified things, how are you going to > > recover from that? > > Tracking resources acquisition and release by the bpf prog > is a normal verifier job. > When bpf prog does bpf_rcu_read_lock() the verifier makes sure > that all execution paths from there on have bpf_rcu_read_unlock() > before program reaches the exit. > Same thing with locks. Ah, okay, this wasn't stated anywhere. This is rather crucial information. > If bpf_res_spin_lock() succeeds the verifier will make sure > there is matching bpf_res_spin_unlock(). > If some resource was acquired before bpf_res_spin_lock() and > it returned -EDEADLK the verifier will not allow early return > without releasing all acquired resources. Good. > > Have the program structured such that it must acquire all locks before > > it does a modification / store -- and have the verifier enforce this. > > Then any lock failure can be handled by the bpf core, not the program > > itself. Core can unlock all previously acquired locks, and core can > > either re-attempt the program or 'skip' it after N failures. > > We definitely don't want to bpf core to keep track of acquired resources. > That just doesn't scale. > There could be rcu_read_locks, all kinds of refcounted objects, > locks taken, and so on. > The verifier makes sure that the program does the release no matter > what the execution path. > That's how it scales. > On my devserver I have 152 bpf programs running. > All of them keep acquiring and releasing resources (locks, sockets, > memory) million times a second. > The verifier checks that each prog is doing its job individually. Well, this patch set tracks the held lock stack -- which is required in order to do the deadlock thing after all. > > It does mean the bpf core needs to track the acquired locks -- which you > > already do, > > We don't. This patch set does exactly that. Is required for deadlock analysis. > The bpf infra does static checks only. > The core doesn't track objects at run-time. > The only exceptions are map elements. > bpf prog might store an acquired object in a map. > Only in that case bpf infra will free that object when it frees > the whole map. > But that doesn't apply to short lived things like RCU CS and > locks. Those cannot last long. They must complete within single > execution of the prog. Right. Held lock stack is like that. > > > That was a conscious trade-off. Deadlocks are not normal. > > > > I really do think you should assume they are normal, unpriv and all > > that. > > No unpriv and no, we don't want deadlocks to be considered normal > by bpf users. They need to hear "fix your broken prog" message loud > and clear. Patch 14 splat is a step in that direction. > Currently it's only for in-kernel res_spin_lock() usage > (like in bpf hashtab). Eventually we will deliver the message to users > without polluting dmesg. Still debating the actual mechanism. OK; how is the user supposed to handle locking two hash buckets? Does the BPF prog create some global lock to serialize the multi bucket case? Anyway, I wonder. Since the verifier tracks all this, it can determine lock order for the prog. Can't it do what lockdep does and maintain lock order graph of all loaded BPF programs? This is load-time overhead, rather than runtime.
On Thu, Feb 13, 2025 at 1:59 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Tue, Feb 11, 2025 at 10:33:00AM -0800, Alexei Starovoitov wrote: > > > Ohh. No unpriv here. > > Since spectre was discovered unpriv bpf died. > > BPF_UNPRIV_DEFAULT_OFF=y was the default for distros and > > all hyperscalers for quite some time. > > Ah, okay. Time to remove the option then? Good point. Indeed. Will accept the patch if anyone has cycles to prep it, test it. > > > So much details not clear to me and not explained either :/ > > > > Yes. The plan is to "kill" bpf prog when it misbehaves. > > But this is orthogonal to this res_spin_lock set which is > > a building block. > > > > > Right, but it might have already modified things, how are you going to > > > recover from that? > > > > Tracking resources acquisition and release by the bpf prog > > is a normal verifier job. > > When bpf prog does bpf_rcu_read_lock() the verifier makes sure > > that all execution paths from there on have bpf_rcu_read_unlock() > > before program reaches the exit. > > Same thing with locks. > > Ah, okay, this wasn't stated anywhere. This is rather crucial > information. This is kinda verifier 101. I don't think it needs to be in the log. > > We definitely don't want to bpf core to keep track of acquired resources. > > That just doesn't scale. > > There could be rcu_read_locks, all kinds of refcounted objects, > > locks taken, and so on. > > The verifier makes sure that the program does the release no matter > > what the execution path. > > That's how it scales. > > On my devserver I have 152 bpf programs running. > > All of them keep acquiring and releasing resources (locks, sockets, > > memory) million times a second. > > The verifier checks that each prog is doing its job individually. > > Well, this patch set tracks the held lock stack -- which is required in > order to do the deadlock thing after all. Right, but the held lock set is per-cpu global and not exhaustive. It cannot detect 3-lock circles _by design_. We rely on timeout for extreme cases. > > The bpf infra does static checks only. > > The core doesn't track objects at run-time. > > The only exceptions are map elements. > > bpf prog might store an acquired object in a map. > > Only in that case bpf infra will free that object when it frees > > the whole map. > > But that doesn't apply to short lived things like RCU CS and > > locks. Those cannot last long. They must complete within single > > execution of the prog. > > Right. Held lock stack is like that. They're not equivalent and not used for correctness. See patch 26 and res_spin_lock_test_held_lock_max() selftest that was added specifically to overwhelm: +struct rqspinlock_held { + int cnt; + void *locks[RES_NR_HELD]; +}; It's an impossible case in reality, but the res_spin_lock code should be prepared for extreme cases like that. Just like existing qspinlock has 4 percpu qnodes and test-and-set fallback in case "if (unlikely(idx >= MAX_NODES))" line qspinlock.c:413. Can it happen in practice ? Probably never. But the code has to be ready to handle it. > > > > That was a conscious trade-off. Deadlocks are not normal. > > > > > > I really do think you should assume they are normal, unpriv and all > > > that. > > > > No unpriv and no, we don't want deadlocks to be considered normal > > by bpf users. They need to hear "fix your broken prog" message loud > > and clear. Patch 14 splat is a step in that direction. > > Currently it's only for in-kernel res_spin_lock() usage > > (like in bpf hashtab). Eventually we will deliver the message to users > > without polluting dmesg. Still debating the actual mechanism. > > OK; how is the user supposed to handle locking two hash buckets? Does > the BPF prog create some global lock to serialize the multi bucket case? Not following. Are you talking about patch 19 where we convert per-bucket raw_spinlock_t in bpf hashmap to rqspinlock_t ? Only one bucket lock is held at a time by map update code, but due to reentrance and crazy kprobes in the wrong places two bucket locks of a single map can be held on the same cpu. bpf_prog_A -> bpf_map_update -> res_spin_lock(bucket_A) -> kprobe or tracepoint -> bpf_prob_B -> bpf_map_update -> res_spin_lock(bucket_B) and that's why we currently have: if (__this_cpu_inc_return(*(htab->map_locked[hash])) ... return -EBUSY; .. workaround to prevent the most obvious AA deadlock, but it's not enough. People were able to hit ABBA. Note, raw_spin_lock today (and res_spin_lock after patch 19) is used by proper kernel code in kernel/bpf/hashtab.c. bpf prog just calls bpf_map_update() which is a normal helper call from the verifier point of view. It doesn't know whether there are locks inside or not. bpf_ktime_get_ns() helper is similar. The verifier knows that it's safe from NMI, but what kinds of locks inside it doesn't care. > Anyway, I wonder. Since the verifier tracks all this, it can determine > lock order for the prog. Can't it do what lockdep does and maintain lock > order graph of all loaded BPF programs? > > This is load-time overhead, rather than runtime. I wish it was possible. Locks are dynamic. They protect dynamically allocated objects, so the order cannot be statically verified. We pushed the limit of static analysis a lot. Maybe too much. For example, the verifier can statically validate the following code: struct node_data *n, *m, *o; struct bpf_rb_node *res, *res2; // here we allocate an object of type known to the verifier n = bpf_obj_new(typeof(*n)); if (!n) return 1; n->key = 41; n->data = 42; // here the verifier knows that glock spin_lock // protect rbtree groot bpf_spin_lock(&glock); // here it checks that the lock is held and type of // objects in rbtree matches the type of 'n' bpf_rbtree_add(&groot, &n->node, less); bpf_spin_unlock(&glock); and all kinds of other more complex stuff, but it is not enough to cover necessary algorithms. Here is an example from real code that shows why we cannot verify two held locks: struct bpf_vqueue { struct bpf_spin_lock lock; int credit; unsigned long long lasttime; unsigned int rate; }; struct { __uint(type, BPF_MAP_TYPE_HASH); __uint(max_entries, ...); __type(key, int); __type(value, struct bpf_vqueue); } vqueue SEC(".maps"); q = bpf_map_lookup_elem(&vqueue, &key); if (!q) goto err; curtime = bpf_ktime_get_ns(); bpf_spin_lock(&q->lock); q->lasttime = curtime; q->credit -= ...; credit = q->credit; bpf_spin_unlock(&q->lock); the above is safe, but if there are two lookups: q1 = bpf_map_lookup_elem(&vqueue, &key1); q2 = bpf_map_lookup_elem(&vqueue, &key2); both will point to two different locks, and since the key is dynamic there is no way to know the order of q1->lock vs q2->lock. So we allow only one lock at a time with bare minimal operations while holding the lock, but it's not enough to do any meaningful work. The top feature request is to allow calls while holding locks (currently they're disallowed, like above bpf_ktime_get_ns() cannot be done while holding the lock) and allow grabbing more than one lock. That's what res_spin_lock() is achieving. Having said all that I think the discussion is diverging into all-thing-bpf instead of focusing on res_spin_lock. Just to make it clear... there is a patch 18: F: kernel/bpf/ F: kernel/trace/bpf_trace.c F: lib/buildid.c +F: arch/*/include/asm/rqspinlock.h +F: include/asm-generic/rqspinlock.h +F: kernel/locking/rqspinlock.c F: lib/test_bpf.c F: net/bpf/ that adds maintainer entries to BPF scope. We're not asking locking experts to maintain this new res_spin_lock. It's not a generic kernel infra. It will only be used by bpf infra and by bpf progs. We will maintain it and we will fix whatever bugs we introduce. We can place it in kernel/bpf/rqspinlock.c to make things more obvious, but kernel/locking/ feels a bit cleaner. We're not asking to review patches 14 and higher. They are presented for completeness. (patch 17 was out-of-order. It will be moved sooner. Sorry about that) But welcome feedback for patches 1-13. Like the way you spotted broken smp_cond_load_acquire() on arm64 due to WFE. That was a great catch. We really appreciate it.