From patchwork Mon Mar  3 15:22:50 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Kumar Kartikeya Dwivedi <memxor@gmail.com>
X-Patchwork-Id: 13999062
Return-Path: 
 <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from bombadil.infradead.org (bombadil.infradead.org
 [198.137.202.133])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.lore.kernel.org (Postfix) with ESMTPS id C11AAC282CD
	for <linux-arm-kernel@archiver.kernel.org>;
 Mon,  3 Mar 2025 15:41:03 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=lists.infradead.org; s=bombadil.20210309; h=Sender:List-Subscribe:List-Help
	:List-Post:List-Archive:List-Unsubscribe:List-Id:Content-Transfer-Encoding:
	MIME-Version:References:In-Reply-To:Message-ID:Date:Subject:Cc:To:From:
	Reply-To:Content-Type:Content-ID:Content-Description:Resent-Date:Resent-From:
	Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner;
	bh=TdqbAA6BbOHNHxPeedO3e4MUz1jdw47Vp3CnPvreE48=; b=AtvVEENlmfE6SIHVjCt4rhk0ys
	NhB4IL5mxjCyLLzWEUrIRiDEq1fWYgoN/KJh1eD5cqFjYw0LVeZGEa7W4a8R8IJpuEot9QahL24qS
	/CtSitZWB4sz93YAICrGfc6tX2sRHTSxV5QBt8hgR+5xzjseyyPWkwsoqLmX1mAlQgN14IFGN3Wtb
	Xvgr7q/mHmF92HgfBUtOvGFIMaB4Qz4b7MUPMmyDBM2tYUAQuWsnAV6p1vJtXAcm0H9vi/txWMqpG
	/bV9OzNW5VDQVQWP7v/ZiiCCyw5ivHl2Uaiako2uI70HL/KDVsPt6ZQt+cGuz0LOPbatm4UQ1inG2
	vxDH4o0w==;
Received: from localhost ([::1] helo=bombadil.infradead.org)
	by bombadil.infradead.org with esmtp (Exim 4.98 #2 (Red Hat Linux))
	id 1tp7uY-00000001MME-3ZTE;
	Mon, 03 Mar 2025 15:40:54 +0000
Received: from mail-wm1-f67.google.com ([209.85.128.67])
	by bombadil.infradead.org with esmtps (Exim 4.98 #2 (Red Hat Linux))
	id 1tp7dd-00000001IF1-0yTd
	for linux-arm-kernel@lists.infradead.org;
	Mon, 03 Mar 2025 15:23:26 +0000
Received: by mail-wm1-f67.google.com with SMTP id
 5b1f17b1804b1-43bbb440520so14147045e9.2
        for <linux-arm-kernel@lists.infradead.org>;
 Mon, 03 Mar 2025 07:23:24 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1741015403; x=1741620203;
 darn=lists.infradead.org;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:from:to:cc:subject:date
         :message-id:reply-to;
        bh=TdqbAA6BbOHNHxPeedO3e4MUz1jdw47Vp3CnPvreE48=;
        b=YpyfxFhyeb2LCNr5X509ou23qx35VU8AMZivuPwkOTtec3m8+9VJxj+v2ci3T/kNLk
         5dpU5v86dC9iK0xBSruusrzpX223mVeA5z7lmXQLUn7yqX7qVEKfHVtGzTyDltuvQ/wF
         GiBV/KlUbbGUl1vGU9kkNyL5cYeDXR8fVoNYwMB0dM1qbI7L/HyNxqNe68Bdoi87NVit
         fAKFpGsCoHZOVtqKIuTrlbRS2MrcT9d6zSZwNnhKkbpUKqFXr3lZXPPG6fTojtd6ZFla
         NKseu5Lrf/1HqZCwnjNiSiDzQa58WwmSraW4UJ8kR3CROFUASvOMsdAp8uZobY4Jmwzq
         RNMg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1741015403; x=1741620203;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=TdqbAA6BbOHNHxPeedO3e4MUz1jdw47Vp3CnPvreE48=;
        b=ABMnEVOazxDgeNrTLO6a4zYKru3I/LXj+QZAELvF/idc44bSMZj1zytINnHuMORAGG
         3LsZa1jerTkvO8sGw6MIMo0SrK8XulU2oakRTkh+DQJBGkSkVAXge8BncghYQa5fWrMX
         oDPbdEgssNCGhZOtUbbZtUQ6uAIBh6Y2mI2SN3bUJsIh1djANIoljLcTA4dLoiWuXOvj
         7fqyuhgwuaWNairt2PlremuMLYPnMn3KfvM77soBrBLq9m3l/g78HTeMJJEyPXVkr5eh
         dXmackdtEP4v2fT74SoI6ecP0UPj67bETmD0HxREQmRd9qNYxqZrnpSaqQuNvD1cpr9J
         3Ueg==
X-Forwarded-Encrypted: i=1;
 AJvYcCX9i2m7n5KzKxkE5IzM4+Oy+OSm1dgcT3IELO9k7gQ03Y8CnvkM5F5vBaLYv+kbDBP2CqoqCp8fTDwk6ZWv863f@lists.infradead.org
X-Gm-Message-State: AOJu0YxQOUI0zrXQGtFnTg7SNZuy7s8aHkUbHZeR60U1ls48u+eJ7T8E
	qHWQd7Duhjp0qbTIOE9aY5d5Yk0uSStEma+z3Tv38E2PGvUvW1bE
X-Gm-Gg: ASbGncsWCA2B3c8jpBis8kLDpvNck7wYYV20cOdCvbvvMz3kNybpmHGteD55ClZWTbu
	Xon8agNVEdHfdcWyrXb0c1Q3ywqEplOqFPBL316lsK/DjQLY3QFrBw54ja/28V3zUADo5Wv+8nv
	6eK+2ogn/7UoF/ULPiT3SXhFz8MoCoTf8VYJfq4ZPMws9q3lpzJeLynfMUMUbQFuJP8SxhkkUqq
	RyFYJCnJHfOXBrDuL8NXaElfUCgRcMOYbsjphaqD8ucKc29Pu5RCCw3rDTlOquSBUkQCvNRCmez
	rjMUFc0iDJE3d9oSFQEgaRDIdVGqzCaPde8=
X-Google-Smtp-Source: 
 AGHT+IGFxxitWQfJahw973p0I2xV7/kZwy8gicFDigUlKkHKzWL+qRQIH+fD/xVRKEKwh58Qm9F6xw==
X-Received: by 2002:a05:600c:5014:b0:439:89d1:30dc with SMTP id
 5b1f17b1804b1-43ba730d5b6mr142469095e9.10.1741015402850;
        Mon, 03 Mar 2025 07:23:22 -0800 (PST)
Received: from localhost ([2a03:2880:31ff:54::])
        by smtp.gmail.com with ESMTPSA id
 5b1f17b1804b1-43bc6a8ff01sm23489995e9.39.2025.03.03.07.23.22
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 03 Mar 2025 07:23:22 -0800 (PST)
From: Kumar Kartikeya Dwivedi <memxor@gmail.com>
To: bpf@vger.kernel.org,
	linux-kernel@vger.kernel.org
Cc: Barret Rhoden <brho@google.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Will Deacon <will@kernel.org>,
	Waiman Long <llong@redhat.com>,
	Alexei Starovoitov <ast@kernel.org>,
	Andrii Nakryiko <andrii@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Martin KaFai Lau <martin.lau@kernel.org>,
	Eduard Zingerman <eddyz87@gmail.com>,
	"Paul E. McKenney" <paulmck@kernel.org>,
	Tejun Heo <tj@kernel.org>,
	Josh Don <joshdon@google.com>,
	Dohyun Kim <dohyunkim@google.com>,
	linux-arm-kernel@lists.infradead.org,
	kkd@meta.com,
	kernel-team@meta.com
Subject: [PATCH bpf-next v3 10/25] rqspinlock: Protect waiters in queue from
 stalls
Date: Mon,  3 Mar 2025 07:22:50 -0800
Message-ID: <20250303152305.3195648-11-memxor@gmail.com>
X-Mailer: git-send-email 2.43.5
In-Reply-To: <20250303152305.3195648-1-memxor@gmail.com>
References: <20250303152305.3195648-1-memxor@gmail.com>
MIME-Version: 1.0
X-Developer-Signature: v=1; a=openpgp-sha256; l=8548; h=from:subject;
 bh=/tO7WsHRnQ1y7o7jGLLRZ75jx2rvQksn4PRPhYxulGw=;
 b=owEBbQKS/ZANAwAIAUzgyIZIvxHKAcsmYgBnxcWXBmmTn9Q/JpHcUWagL6D0LQj7D9XnazfmRv20
 F3ZP162JAjMEAAEIAB0WIQRLvip+Buz51YI8YRFM4MiGSL8RygUCZ8XFlwAKCRBM4MiGSL8RyoaOD/
 9RNH1I8JHE99sQbIlMiKmtHTFHuJCcV0nvMTdA9wCT6zW9DamNJ7yjyl81IYTGgDbSp7m2QaZu788r
 wMj5fcd2iwhWcBBc3PlDuAKaTQwSz8B5BN1i6UV7vcK1E/AtpCEpvKXpqZH1SSiPx2YkgVYq8KmZxt
 Z/dIeZESDo9b9d1uSbJmJosgaNhZz1AbbuYEp6kniedp3jozmo3YFBcOue3YFKPe2Kg4NTJhwyRrXv
 tULCtm9RRxbu6QxV+kLPyqhYo74CyHJaolrRbLzf8we5rzaMwdJnBzFvKd7i2ZpNaoWQxykdCanhPC
 kPtXLBOdzzqKzwxaC1t4g3BJ7UIeouteWNdMi5qZqLhvMLByPg/i0nacGi4CD64Lriiq0FL+AMSlTM
 GZPhan3dSlPjsN45LIuBIPnN5XtrmgKV629AYm01KKITf7cSzqgl2iJ3dpqSasXs//53rgJxbnJaVq
 VOv1pbHmn+VLbJiV2X88Q43nhN1CwA5HGLBirqzzZ4K0hRzxuo9oGT6S2r76PZFDucgeMDRJB9c7Mm
 GUAgAQ0RCNDP+ehlhqzyWE0+v7m1qVYImUE0vh4QbgRxK8wEBGaupTPCZ0wHn0ruv4PtHa7PABtP+B
 PRV0YKX9hGtJ//vt5pVqTc7HjT1GByFMSa0MhSbB3s6F+v0Qkm6vjB3b7DBQ==
X-Developer-Key: i=memxor@gmail.com; a=openpgp;
 fpr=4BBE2A7E06ECF9D5823C61114CE0C88648BF11CA
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 
X-CRM114-CacheID: sfid-20250303_072325_267848_6EFEBAA4 
X-CRM114-Status: GOOD (  38.26  )
X-BeenThere: linux-arm-kernel@lists.infradead.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: <linux-arm-kernel.lists.infradead.org>
List-Unsubscribe: 
 <http://lists.infradead.org/mailman/options/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
List-Subscribe: 
 <http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: 
 linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org

Implement the wait queue cleanup algorithm for rqspinlock. There are
three forms of waiters in the original queued spin lock algorithm. The
first is the waiter which acquires the pending bit and spins on the lock
word without forming a wait queue. The second is the head waiter that is
the first waiter heading the wait queue. The third form is of all the
non-head waiters queued behind the head, waiting to be signalled through
their MCS node to overtake the responsibility of the head.

In this commit, we are concerned with the second and third kind. First,
we augment the waiting loop of the head of the wait queue with a
timeout. When this timeout happens, all waiters part of the wait queue
will abort their lock acquisition attempts. This happens in three steps.
First, the head breaks out of its loop waiting for pending and locked
bits to turn to 0, and non-head waiters break out of their MCS node spin
(more on that later). Next, every waiter (head or non-head) attempts to
check whether they are also the tail waiter, in such a case they attempt
to zero out the tail word and allow a new queue to be built up for this
lock. If they succeed, they have no one to signal next in the queue to
stop spinning. Otherwise, they signal the MCS node of the next waiter to
break out of its spin and try resetting the tail word back to 0. This
goes on until the tail waiter is found. In case of races, the new tail
will be responsible for performing the same task, as the old tail will
then fail to reset the tail word and wait for its next pointer to be
updated before it signals the new tail to do the same.

We terminate the whole wait queue because of two main reasons. Firstly,
we eschew per-waiter timeouts with one applied at the head of the wait
queue.  This allows everyone to break out faster once we've seen the
owner / pending waiter not responding for the timeout duration from the
head.  Secondly, it avoids complicated synchronization, because when not
leaving in FIFO order, prev's next pointer needs to be fixed up etc.

Lastly, all of these waiters release the rqnode and return to the
caller. This patch underscores the point that rqspinlock's timeout does
not apply to each waiter individually, and cannot be relied upon as an
upper bound. It is possible for the rqspinlock waiters to return early
from a failed lock acquisition attempt as soon as stalls are detected.

The head waiter cannot directly WRITE_ONCE the tail to zero, as it may
race with a concurrent xchg and a non-head waiter linking its MCS node
to the head's MCS node through 'prev->next' assignment.

One notable thing is that we must use RES_DEF_TIMEOUT * 2 as our maximum
duration for the waiting loop (for the wait queue head), since we may
have both the owner and pending bit waiter ahead of us, and in the worst
case, need to span their maximum permitted critical section lengths.

Reviewed-by: Barret Rhoden <brho@google.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 kernel/locking/rqspinlock.c | 55 +++++++++++++++++++++++++++++++++++--
 kernel/locking/rqspinlock.h | 48 ++++++++++++++++++++++++++++++++
 2 files changed, 100 insertions(+), 3 deletions(-)
 create mode 100644 kernel/locking/rqspinlock.h

diff --git a/kernel/locking/rqspinlock.c b/kernel/locking/rqspinlock.c
index 6be36798ded9..9ad18b3c46f7 100644
--- a/kernel/locking/rqspinlock.c
+++ b/kernel/locking/rqspinlock.c
@@ -77,6 +77,8 @@ struct rqspinlock_timeout {
 	u16 spin;
 };
 
+#define RES_TIMEOUT_VAL	2
+
 static noinline int check_timeout(struct rqspinlock_timeout *ts)
 {
 	u64 time = ktime_get_mono_fast_ns();
@@ -321,12 +323,18 @@ int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val)
 	 * head of the waitqueue.
 	 */
 	if (old & _Q_TAIL_MASK) {
+		int val;
+
 		prev = decode_tail(old, rqnodes);
 
 		/* Link @node into the waitqueue. */
 		WRITE_ONCE(prev->next, node);
 
-		arch_mcs_spin_lock_contended(&node->locked);
+		val = arch_mcs_spin_lock_contended(&node->locked);
+		if (val == RES_TIMEOUT_VAL) {
+			ret = -EDEADLK;
+			goto waitq_timeout;
+		}
 
 		/*
 		 * While waiting for the MCS lock, the next pointer may have
@@ -349,8 +357,49 @@ int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val)
 	 * store-release that clears the locked bit and create lock
 	 * sequentiality; this is because the set_locked() function below
 	 * does not imply a full barrier.
+	 *
+	 * We use RES_DEF_TIMEOUT * 2 as the duration, as RES_DEF_TIMEOUT is
+	 * meant to span maximum allowed time per critical section, and we may
+	 * have both the owner of the lock and the pending bit waiter ahead of
+	 * us.
 	 */
-	val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK));
+	RES_RESET_TIMEOUT(ts, RES_DEF_TIMEOUT * 2);
+	val = res_atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK) ||
+					   RES_CHECK_TIMEOUT(ts, ret));
+
+waitq_timeout:
+	if (ret) {
+		/*
+		 * If the tail is still pointing to us, then we are the final waiter,
+		 * and are responsible for resetting the tail back to 0. Otherwise, if
+		 * the cmpxchg operation fails, we signal the next waiter to take exit
+		 * and try the same. For a waiter with tail node 'n':
+		 *
+		 * n,*,* -> 0,*,*
+		 *
+		 * When performing cmpxchg for the whole word (NR_CPUS > 16k), it is
+		 * possible locked/pending bits keep changing and we see failures even
+		 * when we remain the head of wait queue. However, eventually,
+		 * pending bit owner will unset the pending bit, and new waiters
+		 * will queue behind us. This will leave the lock owner in
+		 * charge, and it will eventually either set locked bit to 0, or
+		 * leave it as 1, allowing us to make progress.
+		 *
+		 * We terminate the whole wait queue for two reasons. Firstly,
+		 * we eschew per-waiter timeouts with one applied at the head of
+		 * the wait queue.  This allows everyone to break out faster
+		 * once we've seen the owner / pending waiter not responding for
+		 * the timeout duration from the head.  Secondly, it avoids
+		 * complicated synchronization, because when not leaving in FIFO
+		 * order, prev's next pointer needs to be fixed up etc.
+		 */
+		if (!try_cmpxchg_tail(lock, tail, 0)) {
+			next = smp_cond_load_relaxed(&node->next, VAL);
+			WRITE_ONCE(next->locked, RES_TIMEOUT_VAL);
+		}
+		lockevent_inc(rqspinlock_lock_timeout);
+		goto release;
+	}
 
 	/*
 	 * claim the lock:
@@ -395,6 +444,6 @@ int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val)
 	 * release the node
 	 */
 	__this_cpu_dec(rqnodes[0].mcs.count);
-	return 0;
+	return ret;
 }
 EXPORT_SYMBOL(resilient_queued_spin_lock_slowpath);
diff --git a/kernel/locking/rqspinlock.h b/kernel/locking/rqspinlock.h
new file mode 100644
index 000000000000..3cec3a0f2d7e
--- /dev/null
+++ b/kernel/locking/rqspinlock.h
@@ -0,0 +1,48 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Resilient Queued Spin Lock defines
+ *
+ * (C) Copyright 2024 Meta Platforms, Inc. and affiliates.
+ *
+ * Authors: Kumar Kartikeya Dwivedi <memxor@gmail.com>
+ */
+#ifndef __LINUX_RQSPINLOCK_H
+#define __LINUX_RQSPINLOCK_H
+
+#include "qspinlock.h"
+
+/*
+ * try_cmpxchg_tail - Return result of cmpxchg of tail word with a new value
+ * @lock: Pointer to queued spinlock structure
+ * @tail: The tail to compare against
+ * @new_tail: The new queue tail code word
+ * Return: Bool to indicate whether the cmpxchg operation succeeded
+ *
+ * This is used by the head of the wait queue to clean up the queue.
+ * Provides relaxed ordering, since observers only rely on initialized
+ * state of the node which was made visible through the xchg_tail operation,
+ * i.e. through the smp_wmb preceding xchg_tail.
+ *
+ * We avoid using 16-bit cmpxchg, which is not available on all architectures.
+ */
+static __always_inline bool try_cmpxchg_tail(struct qspinlock *lock, u32 tail, u32 new_tail)
+{
+	u32 old, new;
+
+	old = atomic_read(&lock->val);
+	do {
+		/*
+		 * Is the tail part we compare to already stale? Fail.
+		 */
+		if ((old & _Q_TAIL_MASK) != tail)
+			return false;
+		/*
+		 * Encode latest locked/pending state for new tail.
+		 */
+		new = (old & _Q_LOCKED_PENDING_MASK) | new_tail;
+	} while (!atomic_try_cmpxchg_relaxed(&lock->val, &old, new));
+
+	return true;
+}
+
+#endif /* __LINUX_RQSPINLOCK_H */