From patchwork Fri Aug 18 18:04:22 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Dario Faggioli <dario.faggioli@citrix.com>
X-Patchwork-Id: 9909697
Return-Path: <xen-devel-bounces@lists.xen.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	54CB260385 for <patchwork-xen-devel@patchwork.kernel.org>;
	Fri, 18 Aug 2017 18:06:35 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4D4FF28D18
	for <patchwork-xen-devel@patchwork.kernel.org>;
	Fri, 18 Aug 2017 18:06:35 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 3FB7428D20; Fri, 18 Aug 2017 18:06:35 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.6 required=2.0 tests=BAYES_00,DKIM_SIGNED,
	RCVD_IN_DNSWL_MED,RCVD_IN_SORBS_SPAM,T_DKIM_INVALID autolearn=ham
	version=3.3.1
Received: from lists.xenproject.org (lists.xenproject.org [192.237.175.120])
	(using TLSv1.2 with cipher AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 9733728D18
	for <patchwork-xen-devel@patchwork.kernel.org>;
	Fri, 18 Aug 2017 18:06:34 +0000 (UTC)
Received: from localhost ([127.0.0.1] helo=lists.xenproject.org)
	by lists.xenproject.org with esmtp (Exim 4.84_2)
	(envelope-from <xen-devel-bounces@lists.xen.org>)
	id 1dildE-0002FX-D0; Fri, 18 Aug 2017 18:04:28 +0000
Received: from mail6.bemta3.messagelabs.com ([195.245.230.39])
	by lists.xenproject.org with esmtp (Exim 4.84_2)
	(envelope-from <raistlin.df@gmail.com>) id 1dildD-0002F5-8N
	for xen-devel@lists.xenproject.org; Fri, 18 Aug 2017 18:04:27 +0000
Received: from [85.158.137.68] by server-16.bemta-3.messagelabs.com id
	DE/78-01732-A2C27995; Fri, 18 Aug 2017 18:04:26 +0000
X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFvrBIsWRWlGSWpSXmKPExsXiVRvkoqupMz3
	SYOsPTYvvWyYzOTB6HP5whSWAMYo1My8pvyKBNWNn/wPWgp2mFVMOdjM1MJ7R7mLk4hASmMEo
	0bn1DjOIwyKwhlXi1pdTTCCOhMAlVoknjz4xdjFyADlxEn8vRnUxcgKZZRJ/Z69mBbGFBFQkb
	m5fxQQx6QejxJSu1YwgCWEBPYkjR3+wQ9hJEnsOdzOD2GwCBhJvduwFaxYRUJK4t2oyE4jNLP
	CESWLlU7BeFgFViUsvz4DFeQV8JN49eMYGYnMC2S8unoBa7C3x7vcKFhBbVEBOYuXlFlaIekG
	JkzOfsIDczCygKbF+lz7EeHmJ7W/nME9gFJmFpGoWQtUsJFULGJlXMWoUpxaVpRbpGhrrJRVl
	pmeU5CZm5ugaGhjr5aYWFyemp+YkJhXrJefnbmIEhj8DEOxg3Lbd8xCjJAeTkijv71lTIoX4k
	vJTKjMSizPii0pzUosPMcpwcChJ8HJpT48UEixKTU+tSMvMAUYiTFqCg0dJhHeFFlCat7ggMb
	c4Mx0idYrRmGPD6vVfmDgmHdj+hUmIJS8/L1VKnDcOZJIASGlGaR7cIFiCuMQoKyXMywh0mhB
	PQWpRbmYJqvwrRnEORiVh3maQhTyZeSVw+14BncIEdIph6zSQU0oSEVJSDYw5HHfKPxVevdZl
	K1p27FBxSOi83D+zn3894eCz7EDA9FnLvx5etdG5gWuR5oyrb9k9pX5l7XT6tTPulN6CHsYnZ
	ZpTi0rmKs1cvjJV6dLUT/ssdk1lTpz9fprODR/RzTyS971nnrmbbPKjSGBNzrnmFVufq8aoK+
	obie9QnSYTMzVF3sAotvOjEktxRqKhFnNRcSIApSqPuwsDAAA=
X-Env-Sender: raistlin.df@gmail.com
X-Msg-Ref: server-9.tower-31.messagelabs.com!1503079465!55158490!1
X-Originating-IP: [74.125.82.68]
X-SpamReason: No, hits=0.0 required=7.0 tests=
X-StarScan-Received: 
X-StarScan-Version: 9.4.45; banners=-,-,-
X-VirusChecked: Checked
Received: (qmail 37463 invoked from network); 18 Aug 2017 18:04:25 -0000
Received: from mail-wm0-f68.google.com (HELO mail-wm0-f68.google.com)
	(74.125.82.68)
	by server-9.tower-31.messagelabs.com with AES128-GCM-SHA256 encrypted
	SMTP; 18 Aug 2017 18:04:25 -0000
Received: by mail-wm0-f68.google.com with SMTP id q189so5015907wmd.0
	for <xen-devel@lists.xenproject.org>;
	Fri, 18 Aug 2017 11:04:25 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
	h=sender:subject:from:to:cc:date:message-id:in-reply-to:references
	:user-agent:mime-version:content-transfer-encoding;
	bh=i6oKlisyxyHRQIto6F717mTjMjLdwv8XTEG7naKzAyQ=;
	b=tDn7UX0tQoQ1+1jesY35S53/u5os17840dnqXvDpw+21kNxXGFThojH8dIsC7u+ajA
	ZP1VuVEwrBGpVnty/HSd6SUZPkA25fFFV8e8Z3EF3pm9whLbwRFEnEzyNJdAX43ik2ez
	dKdfY1VlBJ7G0AusYhuAsXGZ6sc1ucdR0oYwEwvc8gN05aQOFdgkpY7TosxHrPqjRd1i
	T7wwlQ024tAQ8zsuGggaKp0ZV2wP98GLtzkXbLUxM9jX9CFqNRfI/DbsvbsJPOqne/1m
	DUUYZFV9wdL4in5o+TVQG22ShNnjkKCZ9xUPNz2OEyBlvilUl20brm82g3OtT0whfBLf
	yiEg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:sender:subject:from:to:cc:date:message-id
	:in-reply-to:references:user-agent:mime-version
	:content-transfer-encoding;
	bh=i6oKlisyxyHRQIto6F717mTjMjLdwv8XTEG7naKzAyQ=;
	b=Rw4m9hst0NXYS3rSvkm9WLA5OYexqtIglPag0FxOWHbcqxKfEQYDh3e22BGEpPPmCr
	0jMJQPuQji2U2k2zX3Z30NkX8aCzVv00vulgCwHIClOLwy3ecMR/g2lfBvS6NFOXgkT/
	26KqyS/0BXpVbWeAPP2Jh1LGQgob9j8kviKBNRcSd64YV2vhyoqM/hPY0G10/eKDeuh4
	ECIfhKBXFxWph9/ta9WmwaJlPsfV0QXKXJsxt5lOdAgeX314TvhD9B8FYlyF6XhwAkyW
	ocWybXwh09DrWUPXk5WMukcca4be+GkSQ6gYXjryVQpp97istfNJUGPL4O639N3F+lTC
	3Y5A==
X-Gm-Message-State: AHYfb5g4dD4SROlEOTSJkyRMMiXbviGg7vyqbYycqZqf8uBUE1OTmnTL
	C/yWs60S/nLuzA==
X-Received: by 10.28.21.72 with SMTP id 69mr1820013wmv.75.1503079465233;
	Fri, 18 Aug 2017 11:04:25 -0700 (PDT)
Received: from Solace.fritz.box ([80.66.223.3])
	by smtp.gmail.com with ESMTPSA id
	c77sm1892698wmd.25.2017.08.18.11.04.23
	(version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
	Fri, 18 Aug 2017 11:04:24 -0700 (PDT)
From: Dario Faggioli <dario.faggioli@citrix.com>
To: xen-devel@lists.xenproject.org
Date: Fri, 18 Aug 2017 20:04:22 +0200
Message-ID: <150307946273.29525.14036690240810795204.stgit@Solace.fritz.box>
In-Reply-To: <150307710991.29525.3681195976643263117.stgit@Solace.fritz.box>
References: <150307710991.29525.3681195976643263117.stgit@Solace.fritz.box>
User-Agent: StGit/0.17.1-dirty
MIME-Version: 1.0
Cc: Stefano Stabellini <sstabellini@kernel.org>,
	Wei Liu <wei.liu2@citrix.com>,
	George Dunlap <George.Dunlap@eu.citrix.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	Ian Jackson <ian.jackson@eu.citrix.com>, Tim Deegan <tim@xen.org>,
	Julien Grall <julien.grall@arm.com>, Jan Beulich <jbeulich@suse.com>
Subject: [Xen-devel] [PATCH v3 3/6] xen: RCU/x86/ARM: discount CPUs that
	were idle when grace period started.
X-BeenThere: xen-devel@lists.xen.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: Xen developer discussion <xen-devel.lists.xen.org>
List-Unsubscribe: <https://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <https://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Errors-To: xen-devel-bounces@lists.xen.org
Sender: "Xen-devel" <xen-devel-bounces@lists.xen.org>
X-Virus-Scanned: ClamAV using ClamSMTP

Xen is a tickless (micro-)kernel, i.e., when a CPU becomes
idle there is no timer tick that will periodically wake the
CPU up.
OTOH, when we imported RCU from Linux, Linux was (on x86) a
ticking kernel, i.e., there was a periodic timer tick always
running, even on idle CPUs. This was bad for power consumption,
but, for instance, made it easy to monitor the quiescent states
of all the CPUs, and hence tell when RCU grace periods ended.

In Xen, that is impossible, and that's particularly problematic
when the system is very lightly loaded, as some CPUs may never
have the chance to tell the RCU core logic about their quiescence,
and grace periods could extend indefinitely!

This has led, on x86, to long (and unpredictable) delays between
RCU callbacks queueing and their actual invokation. On ARM, we've
even seen infinite grace periods (e.g., complate_domain_destroy()
never being actually invoked!). See here:

 https://lists.xenproject.org/archives/html/xen-devel/2017-01/msg02454.html

The first step for fixing this situation is for RCU to record,
at the beginning of a grace period, which CPUs are already idle.
In fact, being idle, they can't be in the middle of any read-side
critical section, and we don't have to wait for their quiescence.

This is tracked in a cpumask, in a similar way to how it was also
done in Linux (on s390, which was tickless already). It is also
basically the same approach used for making Linux x86 tickless,
in 2.6.21 on (see commit 79bf2bb3 "tick-management: dyntick /
highres functionality").

For correctness, wee also add barriers. One is also present in
Linux, (see commit c3f59023, "Fix RCU race in access of nohz_cpu_mask",
although, we change the code comment to something that makes better
sense for us). The other (which is its pair), is put in the newly
introduced function rcu_idle_enter(), right after updating the
cpumask. They prevent races between CPUs going idle during the
beginning of a grace period.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
---
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Tim Deegan <tim@xen.org>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: Julien Grall <julien.grall@arm.com>
---
Changes from v2:
* initialize idle_cpumask to "all clear", i.e., all the CPUs are busy;
  they'll clear their bit out themselves as soon as the run the idle
  loop (pretty soon anyway).

Changes from v1:
* call rcu_idle_{enter,exit}() from tick suspension/restarting logic.  This
  widen the window during which a CPU has its bit set in the idle cpumask.
  During review, it was suggested to do the opposite (narrow it), and that's
  what I did first. But then, I changed my mind, as doing things as they look
  now (wide window), cures another pre-existing (and independent) raca which
  Tim discovered, still during v1 review;
* add a barrier in rcu_idle_enter() too, to properly deal with the race Tim
  pointed out during review;
* mark CPU where RCU initialization happens, at boot, as non-idle.
---
 xen/common/rcupdate.c      |   41 +++++++++++++++++++++++++++++++++++++++--
 xen/common/schedule.c      |    2 ++
 xen/include/xen/rcupdate.h |    3 +++
 3 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/xen/common/rcupdate.c b/xen/common/rcupdate.c
index 8cc5a82..12ae7da 100644
--- a/xen/common/rcupdate.c
+++ b/xen/common/rcupdate.c
@@ -52,7 +52,8 @@ static struct rcu_ctrlblk {
     int  next_pending;  /* Is the next batch already waiting?         */
 
     spinlock_t  lock __cacheline_aligned;
-    cpumask_t   cpumask; /* CPUs that need to switch in order    */
+    cpumask_t   cpumask; /* CPUs that need to switch in order ... */
+    cpumask_t   idle_cpumask; /* ... unless they are already idle */
     /* for current batch to proceed.        */
 } __cacheline_aligned rcu_ctrlblk = {
     .cur = -300,
@@ -248,7 +249,16 @@ static void rcu_start_batch(struct rcu_ctrlblk *rcp)
         smp_wmb();
         rcp->cur++;
 
-        cpumask_copy(&rcp->cpumask, &cpu_online_map);
+       /*
+        * Make sure the increment of rcp->cur is visible so, even if a
+        * CPU that is about to go idle, is captured inside rcp->cpumask,
+        * rcu_pending() will return false, which then means cpu_quiet()
+        * will be invoked, before the CPU would actually enter idle.
+        *
+        * This barrier is paired with the one in rcu_idle_enter().
+        */
+        smp_mb();
+        cpumask_andnot(&rcp->cpumask, &cpu_online_map, &rcp->idle_cpumask);
     }
 }
 
@@ -474,7 +484,34 @@ static struct notifier_block cpu_nfb = {
 void __init rcu_init(void)
 {
     void *cpu = (void *)(long)smp_processor_id();
+
+    cpumask_clear(&rcu_ctrlblk.idle_cpumask);
     cpu_callback(&cpu_nfb, CPU_UP_PREPARE, cpu);
     register_cpu_notifier(&cpu_nfb);
     open_softirq(RCU_SOFTIRQ, rcu_process_callbacks);
 }
+
+/*
+ * The CPU is becoming idle, so no more read side critical
+ * sections, and one more step toward grace period.
+ */
+void rcu_idle_enter(unsigned int cpu)
+{
+    ASSERT(!cpumask_test_cpu(cpu, &rcu_ctrlblk.idle_cpumask));
+    cpumask_set_cpu(cpu, &rcu_ctrlblk.idle_cpumask);
+    /*
+     * If some other CPU is starting a new grace period, we'll notice that
+     * by seeing a new value in rcp->cur (different than our quiescbatch).
+     * That will force us all the way until cpu_quiet(), clearing our bit
+     * in rcp->cpumask, even in case we managed to get in there.
+     *
+     * Se the comment before cpumask_andnot() in  rcu_start_batch().
+     */
+    smp_mb();
+}
+
+void rcu_idle_exit(unsigned int cpu)
+{
+    ASSERT(cpumask_test_cpu(cpu, &rcu_ctrlblk.idle_cpumask));
+    cpumask_clear_cpu(cpu, &rcu_ctrlblk.idle_cpumask);
+}
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index e83f4c7..c6f4817 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -1903,6 +1903,7 @@ void sched_tick_suspend(void)
 
     sched = per_cpu(scheduler, cpu);
     SCHED_OP(sched, tick_suspend, cpu);
+    rcu_idle_enter(cpu);
 }
 
 void sched_tick_resume(void)
@@ -1910,6 +1911,7 @@ void sched_tick_resume(void)
     struct scheduler *sched;
     unsigned int cpu = smp_processor_id();
 
+    rcu_idle_exit(cpu);
     sched = per_cpu(scheduler, cpu);
     SCHED_OP(sched, tick_resume, cpu);
 }
diff --git a/xen/include/xen/rcupdate.h b/xen/include/xen/rcupdate.h
index 557a7b1..561ac43 100644
--- a/xen/include/xen/rcupdate.h
+++ b/xen/include/xen/rcupdate.h
@@ -146,4 +146,7 @@ void call_rcu(struct rcu_head *head,
 
 int rcu_barrier(void);
 
+void rcu_idle_enter(unsigned int cpu);
+void rcu_idle_exit(unsigned int cpu);
+
 #endif /* __XEN_RCUPDATE_H */