[bpf-next,v2,11/26] rqspinlock: Add deadlock detection and recovery

Message ID	20250206105435.2159977-12-memxor@gmail.com (mailing list archive)
State	New
Headers	show Return-Path: <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org> From: Kumar Kartikeya Dwivedi <memxor@gmail.com> To: bpf@vger.kernel.org, linux-kernel@vger.kernel.org Cc: Linus Torvalds <torvalds@linux-foundation.org>, Peter Zijlstra <peterz@infradead.org>, Will Deacon <will@kernel.org>, Waiman Long <llong@redhat.com>, Alexei Starovoitov <ast@kernel.org>, Andrii Nakryiko <andrii@kernel.org>, Daniel Borkmann <daniel@iogearbox.net>, Martin KaFai Lau <martin.lau@kernel.org>, Eduard Zingerman <eddyz87@gmail.com>, "Paul E. McKenney" <paulmck@kernel.org>, Tejun Heo <tj@kernel.org>, Barret Rhoden <brho@google.com>, Josh Don <joshdon@google.com>, Dohyun Kim <dohyunkim@google.com>, linux-arm-kernel@lists.infradead.org, kernel-team@meta.com Subject: [PATCH bpf-next v2 11/26] rqspinlock: Add deadlock detection and recovery Date: Thu, 6 Feb 2025 02:54:19 -0800 Message-ID: <20250206105435.2159977-12-memxor@gmail.com> In-Reply-To: <20250206105435.2159977-1-memxor@gmail.com> References: <20250206105435.2159977-1-memxor@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: list Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
Series	Resilient Queued Spin Lock \| expand [bpf-next,v2,00/26] Resilient Queued Spin Lock [bpf-next,v2,01/26] locking: Move MCS struct definition to public header [bpf-next,v2,02/26] locking: Move common qspinlock helpers to a private header [bpf-next,v2,03/26] locking: Allow obtaining result of arch_mcs_spin_lock_contended [bpf-next,v2,04/26] locking: Copy out qspinlock.c to rqspinlock.c [bpf-next,v2,05/26] rqspinlock: Add rqspinlock.h header [bpf-next,v2,06/26] rqspinlock: Drop PV and virtualization support [bpf-next,v2,07/26] rqspinlock: Add support for timeouts [bpf-next,v2,08/26] rqspinlock: Protect pending bit owners from stalls [bpf-next,v2,09/26] rqspinlock: Protect waiters in queue from stalls [bpf-next,v2,10/26] rqspinlock: Protect waiters in trylock fallback from stalls [bpf-next,v2,11/26] rqspinlock: Add deadlock detection and recovery [bpf-next,v2,12/26] rqspinlock: Add a test-and-set fallback [bpf-next,v2,13/26] rqspinlock: Add basic support for CONFIG_PARAVIRT [bpf-next,v2,14/26] rqspinlock: Add helper to print a splat on timeout or deadlock [bpf-next,v2,15/26] rqspinlock: Add macros for rqspinlock usage [bpf-next,v2,16/26] rqspinlock: Add locktorture support [bpf-next,v2,17/26] rqspinlock: Hardcode cond_acquire loops to asm-generic implementation [bpf-next,v2,18/26] rqspinlock: Add entry to Makefile, MAINTAINERS [bpf-next,v2,19/26] bpf: Convert hashtab.c to rqspinlock [bpf-next,v2,20/26] bpf: Convert percpu_freelist.c to rqspinlock [bpf-next,v2,21/26] bpf: Convert lpm_trie.c to rqspinlock [bpf-next,v2,22/26] bpf: Introduce rqspinlock kfuncs [bpf-next,v2,23/26] bpf: Handle allocation failure in acquire_lock_state [bpf-next,v2,24/26] bpf: Implement verifier support for rqspinlock [bpf-next,v2,25/26] bpf: Maintain FIFO property for rqspinlock unlock [bpf-next,v2,26/26] selftests/bpf: Add tests for rqspinlock

diff --git a/include/asm-generic/rqspinlock.h b/include/asm-generic/rqspinlock.h index 0981162c8ac7..c1dbd25287a1 100644 --- a/include/asm-generic/rqspinlock.h +++ b/include/asm-generic/rqspinlock.h @@ -11,15 +11,96 @@ #include <linux/types.h> #include <vdso/time64.h> +#include <linux/percpu.h> struct qspinlock; typedef struct qspinlock rqspinlock_t; +extern int resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val, u64 timeout); + /* * Default timeout for waiting loops is 0.5 seconds */ #define RES_DEF_TIMEOUT (NSEC_PER_SEC / 2) -extern int resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val, u64 timeout); +#define RES_NR_HELD 32 + +struct rqspinlock_held { + int cnt; + void *locks[RES_NR_HELD]; +}; + +DECLARE_PER_CPU_ALIGNED(struct rqspinlock_held, rqspinlock_held_locks); + +static __always_inline void grab_held_lock_entry(void *lock) +{ + int cnt = this_cpu_inc_return(rqspinlock_held_locks.cnt); + + if (unlikely(cnt > RES_NR_HELD)) { + /* Still keep the inc so we decrement later. */ + return; + } + + /* + * Implied compiler barrier in per-CPU operations; otherwise we can have + * the compiler reorder inc with write to table, allowing interrupts to + * overwrite and erase our write to the table (as on interrupt exit it + * will be reset to NULL). + */ + this_cpu_write(rqspinlock_held_locks.locks[cnt - 1], lock); +} + +/* + * It is possible to run into misdetection scenarios of AA deadlocks on the same + * CPU, and missed ABBA deadlocks on remote CPUs when this function pops entries + * out of order (due to lock A, lock B, unlock A, unlock B) pattern. The correct + * logic to preserve right entries in the table would be to walk the array of + * held locks and swap and clear out-of-order entries, but that's too + * complicated and we don't have a compelling use case for out of order unlocking. + * + * Therefore, we simply don't support such cases and keep the logic simple here. + */ +static __always_inline void release_held_lock_entry(void) +{ + struct rqspinlock_held *rqh = this_cpu_ptr(&rqspinlock_held_locks); + + if (unlikely(rqh->cnt > RES_NR_HELD)) + goto dec; + WRITE_ONCE(rqh->locks[rqh->cnt - 1], NULL); +dec: + this_cpu_dec(rqspinlock_held_locks.cnt); + /* + * This helper is invoked when we unwind upon failing to acquire the + * lock. Unlike the unlock path which constitutes a release store after + * we clear the entry, we need to emit a write barrier here. Otherwise, + * we may have a situation as follows: + * + * <error> for lock B + * release_held_lock_entry + * + * try_cmpxchg_acquire for lock A + * grab_held_lock_entry + * + * Since these are attempts for different locks, no sequentiality is + * guaranteed and reordering may occur such that dec, inc are done + * before entry is overwritten. This permits a remote lock holder of + * lock B to now observe it as being attempted on this CPU, and may lead + * to misdetection. + * + * In case of unlock, we will always do a release on the lock word after + * releasing the entry, ensuring that other CPUs cannot hold the lock + * (and make conclusions about deadlocks) until the entry has been + * cleared on the local CPU, preventing any anomalies. Reordering is + * still possible there, but a remote CPU cannot observe a lock in our + * table which it is already holding, since visibility entails our + * release store for the said lock has not retired. + * + * We don't have a problem if the dec and WRITE_ONCE above get reordered + * with each other, we either notice an empty NULL entry on top (if dec + * succeeds WRITE_ONCE), or a potentially stale entry which cannot be + * observed (if dec precedes WRITE_ONCE). + */ + smp_wmb(); +} #endif /* __ASM_GENERIC_RQSPINLOCK_H */ diff --git a/kernel/locking/rqspinlock.c b/kernel/locking/rqspinlock.c index df7adec59cec..42e8a56534b6 100644 --- a/kernel/locking/rqspinlock.c +++ b/kernel/locking/rqspinlock.c @@ -30,6 +30,7 @@ * Include queued spinlock definitions and statistics code */ #include "qspinlock.h" +#include "rqspinlock.h" #include "qspinlock_stat.h" /* @@ -74,16 +75,146 @@ struct rqspinlock_timeout { u64 timeout_end; u64 duration; + u64 cur; u16 spin; }; #define RES_TIMEOUT_VAL 2 -static noinline int check_timeout(struct rqspinlock_timeout *ts) +DEFINE_PER_CPU_ALIGNED(struct rqspinlock_held, rqspinlock_held_locks); + +static bool is_lock_released(rqspinlock_t *lock, u32 mask, struct rqspinlock_timeout *ts) +{ + if (!(atomic_read_acquire(&lock->val) & (mask))) + return true; + return false; +} + +static noinline int check_deadlock_AA(rqspinlock_t *lock, u32 mask, + struct rqspinlock_timeout *ts) +{ + struct rqspinlock_held *rqh = this_cpu_ptr(&rqspinlock_held_locks); + int cnt = min(RES_NR_HELD, rqh->cnt); + + /* + * Return an error if we hold the lock we are attempting to acquire. + * We'll iterate over max 32 locks; no need to do is_lock_released. + */ + for (int i = 0; i < cnt - 1; i++) { + if (rqh->locks[i] == lock) + return -EDEADLK; + } + return 0; +} + +/* + * This focuses on the most common case of ABBA deadlocks (or ABBA involving + * more locks, which reduce to ABBA). This is not exhaustive, and we rely on + * timeouts as the final line of defense. + */ +static noinline int check_deadlock_ABBA(rqspinlock_t *lock, u32 mask, + struct rqspinlock_timeout *ts) +{ + struct rqspinlock_held *rqh = this_cpu_ptr(&rqspinlock_held_locks); + int rqh_cnt = min(RES_NR_HELD, rqh->cnt); + void *remote_lock; + int cpu; + + /* + * Find the CPU holding the lock that we want to acquire. If there is a + * deadlock scenario, we will read a stable set on the remote CPU and + * find the target. This would be a constant time operation instead of + * O(NR_CPUS) if we could determine the owning CPU from a lock value, but + * that requires increasing the size of the lock word. + */ + for_each_possible_cpu(cpu) { + struct rqspinlock_held *rqh_cpu = per_cpu_ptr(&rqspinlock_held_locks, cpu); + int real_cnt = READ_ONCE(rqh_cpu->cnt); + int cnt = min(RES_NR_HELD, real_cnt); + + /* + * Let's ensure to break out of this loop if the lock is available for + * us to potentially acquire. + */ + if (is_lock_released(lock, mask, ts)) + return 0; + + /* + * Skip ourselves, and CPUs whose count is less than 2, as they need at + * least one held lock and one acquisition attempt (reflected as top + * most entry) to participate in an ABBA deadlock. + * + * If cnt is more than RES_NR_HELD, it means the current lock being + * acquired won't appear in the table, and other locks in the table are + * already held, so we can't determine ABBA. + */ + if (cpu == smp_processor_id() || real_cnt < 2 || real_cnt > RES_NR_HELD) + continue; + + /* + * Obtain the entry at the top, this corresponds to the lock the + * remote CPU is attempting to acquire in a deadlock situation, + * and would be one of the locks we hold on the current CPU. + */ + remote_lock = READ_ONCE(rqh_cpu->locks[cnt - 1]); + /* + * If it is NULL, we've raced and cannot determine a deadlock + * conclusively, skip this CPU. + */ + if (!remote_lock) + continue; + /* + * Find if the lock we're attempting to acquire is held by this CPU. + * Don't consider the topmost entry, as that must be the latest lock + * being held or acquired. For a deadlock, the target CPU must also + * attempt to acquire a lock we hold, so for this search only 'cnt - 1' + * entries are important. + */ + for (int i = 0; i < cnt - 1; i++) { + if (READ_ONCE(rqh_cpu->locks[i]) != lock) + continue; + /* + * We found our lock as held on the remote CPU. Is the + * acquisition attempt on the remote CPU for a lock held + * by us? If so, we have a deadlock situation, and need + * to recover. + */ + for (int i = 0; i < rqh_cnt - 1; i++) { + if (rqh->locks[i] == remote_lock) + return -EDEADLK; + } + /* + * Inconclusive; retry again later. + */ + return 0; + } + } + return 0; +} + +static noinline int check_deadlock(rqspinlock_t *lock, u32 mask, + struct rqspinlock_timeout *ts) +{ + int ret; + + ret = check_deadlock_AA(lock, mask, ts); + if (ret) + return ret; + ret = check_deadlock_ABBA(lock, mask, ts); + if (ret) + return ret; + + return 0; +} + +static noinline int check_timeout(rqspinlock_t *lock, u32 mask, + struct rqspinlock_timeout *ts) { u64 time = ktime_get_mono_fast_ns(); + u64 prev = ts->cur; if (!ts->timeout_end) { + ts->cur = time; ts->timeout_end = time + ts->duration; return 0; } @@ -91,20 +222,30 @@ static noinline int check_timeout(struct rqspinlock_timeout *ts) if (time > ts->timeout_end) return -ETIMEDOUT; + /* + * A millisecond interval passed from last time? Trigger deadlock + * checks. + */ + if (prev + NSEC_PER_MSEC < time) { + ts->cur = time; + return check_deadlock(lock, mask, ts); + } + return 0; } -#define RES_CHECK_TIMEOUT(ts, ret) \ - ({ \ - if (!(ts).spin++) \ - (ret) = check_timeout(&(ts)); \ - (ret); \ +#define RES_CHECK_TIMEOUT(ts, ret, mask) \ + ({ \ + if (!(ts).spin++) \ + (ret) = check_timeout((lock), (mask), &(ts)); \ + (ret); \ }) /* * Initialize the 'duration' member with the chosen timeout. + * Set spin member to 0 to trigger AA/ABBA checks immediately. */ -#define RES_INIT_TIMEOUT(ts, _timeout) ({ (ts).spin = 1; (ts).duration = _timeout; }) +#define RES_INIT_TIMEOUT(ts, _timeout) ({ (ts).spin = 0; (ts).duration = _timeout; }) /* * We only need to reset 'timeout_end', 'spin' will just wrap around as necessary. @@ -192,6 +333,11 @@ int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val, goto queue; } + /* + * Grab an entry in the held locks array, to enable deadlock detection. + */ + grab_held_lock_entry(lock); + /* * We're pending, wait for the owner to go away. * @@ -205,7 +351,7 @@ int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val, */ if (val & _Q_LOCKED_MASK) { RES_RESET_TIMEOUT(ts); - smp_cond_load_acquire(&lock->locked, !VAL || RES_CHECK_TIMEOUT(ts, ret)); + smp_cond_load_acquire(&lock->locked, !VAL || RES_CHECK_TIMEOUT(ts, ret, _Q_LOCKED_MASK)); } if (ret) { @@ -220,7 +366,7 @@ int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val, */ clear_pending(lock); lockevent_inc(rqspinlock_lock_timeout); - return ret; + goto err_release_entry; } /* @@ -238,6 +384,11 @@ int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val, */ queue: lockevent_inc(lock_slowpath); + /* + * Grab deadlock detection entry for the queue path. + */ + grab_held_lock_entry(lock); + node = this_cpu_ptr(&qnodes[0].mcs); idx = node->count++; tail = encode_tail(smp_processor_id(), idx); @@ -257,9 +408,9 @@ int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val, lockevent_inc(lock_no_node); RES_RESET_TIMEOUT(ts); while (!queued_spin_trylock(lock)) { - if (RES_CHECK_TIMEOUT(ts, ret)) { + if (RES_CHECK_TIMEOUT(ts, ret, ~0u)) { lockevent_inc(rqspinlock_lock_timeout); - break; + goto err_release_node; } cpu_relax(); } @@ -350,7 +501,7 @@ int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val, */ RES_RESET_TIMEOUT(ts); val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK) || - RES_CHECK_TIMEOUT(ts, ret)); + RES_CHECK_TIMEOUT(ts, ret, _Q_LOCKED_PENDING_MASK)); waitq_timeout: if (ret) { @@ -375,7 +526,7 @@ int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val, WRITE_ONCE(next->locked, RES_TIMEOUT_VAL); } lockevent_inc(rqspinlock_lock_timeout); - goto release; + goto err_release_node; } /* @@ -422,5 +573,11 @@ int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val, */ __this_cpu_dec(qnodes[0].mcs.count); return ret; +err_release_node: + trace_contention_end(lock, ret); + __this_cpu_dec(qnodes[0].mcs.count); +err_release_entry: + release_held_lock_entry(); + return ret; } EXPORT_SYMBOL(resilient_queued_spin_lock_slowpath);

[bpf-next,v2,11/26] rqspinlock: Add deadlock detection and recovery

Commit Message

Comments

Patch