From patchwork Mon Apr  9 14:54:09 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Will Deacon <will.deacon@arm.com>
X-Patchwork-Id: 10331627
Return-Path: 
 <linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	4DAE56020F for <patchwork-linux-arm@patchwork.kernel.org>;
	Mon,  9 Apr 2018 14:54:16 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 3C45128329
	for <patchwork-linux-arm@patchwork.kernel.org>;
	Mon,  9 Apr 2018 14:54:16 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 3103D28569; Mon,  9 Apr 2018 14:54:16 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,MAILING_LIST_MULTI autolearn=unavailable version=3.3.1
Received: from bombadil.infradead.org (bombadil.infradead.org
	[198.137.202.133])
	(using TLSv1.2 with cipher AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id C1C4D28329
	for <patchwork-linux-arm@patchwork.kernel.org>;
	Mon,  9 Apr 2018 14:54:15 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=lists.infradead.org; s=bombadil.20170209; h=Sender:
	Content-Transfer-Encoding:Content-Type:Cc:List-Subscribe:List-Help:List-Post:
	List-Archive:List-Unsubscribe:List-Id:In-Reply-To:MIME-Version:References:
	Message-ID:Subject:To:From:Date:Reply-To:Content-ID:Content-Description:
	Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:
	List-Owner; bh=CV72qk7o9Ka5S9Fonxw3HUPfl/kvEs20IU4nKZyEylA=;
	b=TPfWlTNxtnkz9v
	eVvhv6CseRcHB0I/5KfsvKx5SrwIPOKMevw2/U+VaD+rPuKLBsW+mvDWe9P9rVz5mYDosF8zCiwdO
	A0uP7KNo+janRRT32OD4mkiJjNYq93Fs6+MgZDFWWeb6N+fmvIC1VsLMUTlLlHI5Ia2lu8uLDe3eu
	OnDa/VsImHYOzKxJNqa7TSJegx90ttkFlSstZhU3vM4EesSp5Q9BbVpGH9Yf9+vphuvgi6K7eDEia
	9L41ZxYm1mrkpD39ahIJLogiTEoaIULMsBy6wVJ1CcCm8gMoUEbiGOx4dPdXztIbw/NgXDF3j1Lx5
	dgS1QSMRjmWS/SUzUllg==;
Received: from localhost ([127.0.0.1] helo=bombadil.infradead.org)
	by bombadil.infradead.org with esmtp (Exim 4.90_1 #2 (Red Hat Linux))
	id 1f5YBP-0005fZ-58; Mon, 09 Apr 2018 14:54:11 +0000
Received: from foss.arm.com ([217.140.101.70])
	by bombadil.infradead.org with esmtp (Exim 4.90_1 #2 (Red Hat Linux))
	id 1f5YBK-0005ZK-Dd for linux-arm-kernel@lists.infradead.org;
	Mon, 09 Apr 2018 14:54:07 +0000
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 414DAF;
	Mon,  9 Apr 2018 07:53:55 -0700 (PDT)
Received: from edgewater-inn.cambridge.arm.com
	(usa-sjc-imap-foss1.foss.arm.com [10.72.51.249])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id
	132803F24A; Mon,  9 Apr 2018 07:53:55 -0700 (PDT)
Received: by edgewater-inn.cambridge.arm.com (Postfix, from userid 1000)
	id C416E1AE55D9; Mon,  9 Apr 2018 15:54:09 +0100 (BST)
Date: Mon, 9 Apr 2018 15:54:09 +0100
From: Will Deacon <will.deacon@arm.com>
To: Waiman Long <longman@redhat.com>
Subject: Re: [PATCH 02/10] locking/qspinlock: Remove unbounded cmpxchg loop
	from locking slowpath
Message-ID: <20180409145409.GA9661@arm.com>
References: <1522947547-24081-1-git-send-email-will.deacon@arm.com>
	<1522947547-24081-3-git-send-email-will.deacon@arm.com>
	<dc5f5e43-a60a-05f2-16fb-46960c40459e@redhat.com>
	<20180409105835.GC23134@arm.com>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <20180409105835.GC23134@arm.com>
User-Agent: Mutt/1.5.23 (2014-03-12)
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 
X-CRM114-CacheID: sfid-20180409_075406_473673_E5315EBA 
X-CRM114-Status: GOOD (  30.29  )
X-BeenThere: linux-arm-kernel@lists.infradead.org
X-Mailman-Version: 2.1.21
Precedence: list
List-Id: <linux-arm-kernel.lists.infradead.org>
List-Unsubscribe: 
 <http://lists.infradead.org/mailman/options/linux-arm-kernel>,
	<mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
List-Subscribe: 
 <http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>,
	<mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
Cc: peterz@infradead.org, catalin.marinas@arm.com, boqun.feng@gmail.com,
	linux-kernel@vger.kernel.org, paulmck@linux.vnet.ibm.com,
	mingo@kernel.org, linux-arm-kernel@lists.infradead.org
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: 
 linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org
X-Virus-Scanned: ClamAV using ClamSMTP

On Mon, Apr 09, 2018 at 11:58:35AM +0100, Will Deacon wrote:
> On Fri, Apr 06, 2018 at 04:50:19PM -0400, Waiman Long wrote:
> > The pending bit was added to the qspinlock design to counter performance
> > degradation compared with ticket lock for workloads with light
> > spinlock contention. I run my spinlock stress test on a Intel Skylake
> > server running the vanilla 4.16 kernel vs a patched kernel with this
> > patchset. The locking rates with different number of locking threads
> > were as follows:
> > 
> >   # of threads  4.16 kernel     patched 4.16 kernel
> >   ------------  -----------     -------------------
> >         1       7,417 kop/s         7,408 kop/s
> >         2       5,755 kop/s         4,486 kop/s
> >         3       4,214 kop/s         4,169 kop/s
> >         4       4,396 kop/s         4,383 kop/s
> >        
> > The 2 contending threads case is the one that exercise the pending bit
> > code path the most. So it is obvious that this is the one that is most
> > impacted by this patchset. The differences in the other cases are mostly
> > noise or maybe just a little bit on the 3 contending threads case.
> 
> That is bizarre. A few questions:
> 
>   1. Is this with my patches as posted, or also with your WRITE_ONCE change?
>   2. Could you try to bisect my series to see which patch is responsible
>      for this degradation, please?
>   3. Could you point me at your stress test, so I can try to reproduce these
>      numbers on arm64 systems, please?
> 
> > I am not against this patch, but we certainly need to find out a way to
> > bring the performance number up closer to what it is before applying
> > the patch.
> 
> We certainly need to *understand* where the drop is coming from, because
> the two-threaded case is still just a CAS on x86 with and without this
> patch series. Generally, there's a throughput cost when ensuring fairness
> and forward-progress otherwise we'd all be using test-and-set.

Whilst I think we still need to address my questions above, I've had a
crack at the diff below. Please can you give it a spin? It sticks a trylock
on the slowpath before setting pending and replaces the CAS-based set
with an xchg (which I *think* is safe, but will need to ponder it some
more).

Thanks,

Will

--->8

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 19261af9f61e..71eb5e3a3d91 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -139,6 +139,20 @@ static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
 	WRITE_ONCE(lock->locked_pending, _Q_LOCKED_VAL);
 }
 
+/**
+ * set_pending_fetch_acquire - set the pending bit and return the old lock
+ *                             value with acquire semantics.
+ * @lock: Pointer to queued spinlock structure
+ *
+ * *,*,* -> *,1,*
+ */
+static __always_inline u32 set_pending_fetch_acquire(struct qspinlock *lock)
+{
+	u32 val = xchg_relaxed(&lock->pending, 1) << _Q_PENDING_OFFSET;
+	val |= (atomic_read_acquire(&lock->val) & ~_Q_PENDING_MASK);
+	return val;
+}
+
 /*
  * xchg_tail - Put in the new queue tail code word & retrieve previous one
  * @lock : Pointer to queued spinlock structure
@@ -184,6 +198,18 @@ static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
 }
 
 /**
+ * set_pending_fetch_acquire - set the pending bit and return the old lock
+ *                             value with acquire semantics.
+ * @lock: Pointer to queued spinlock structure
+ *
+ * *,*,* -> *,1,*
+ */
+static __always_inline u32 set_pending_fetch_acquire(struct qspinlock *lock)
+{
+	return atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val);
+}
+
+/**
  * xchg_tail - Put in the new queue tail code word & retrieve previous one
  * @lock : Pointer to queued spinlock structure
  * @tail : The new queue tail code word
@@ -289,18 +315,26 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 		return;
 
 	/*
-	 * If we observe any contention; queue.
+	 * If we observe queueing, then queue ourselves.
 	 */
-	if (val & ~_Q_LOCKED_MASK)
+	if (val & _Q_TAIL_MASK)
 		goto queue;
 
 	/*
+	 * We didn't see any queueing, so have one more try at snatching
+	 * the lock in case it became available whilst we were taking the
+	 * slow path.
+	 */
+	if (queued_spin_trylock(lock))
+		return;
+
+	/*
 	 * trylock || pending
 	 *
 	 * 0,0,0 -> 0,0,1 ; trylock
 	 * 0,0,1 -> 0,1,1 ; pending
 	 */
-	val = atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val);
+	val = set_pending_fetch_acquire(lock);
 	if (!(val & ~_Q_LOCKED_MASK)) {
 		/*
 		 * we're pending, wait for the owner to go away.