From patchwork Wed Aug 16 16:45:49 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dario Faggioli X-Patchwork-Id: 9904149 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 778DB600CA for ; Wed, 16 Aug 2017 16:48:14 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 68F8628A1B for ; Wed, 16 Aug 2017 16:48:14 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 5D56328A4D; Wed, 16 Aug 2017 16:48:14 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.6 required=2.0 tests=BAYES_00,DKIM_SIGNED, RCVD_IN_DNSWL_MED,RCVD_IN_SORBS_SPAM,T_DKIM_INVALID autolearn=ham version=3.3.1 Received: from lists.xenproject.org (lists.xenproject.org [192.237.175.120]) (using TLSv1.2 with cipher AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id AE18428A1B for ; Wed, 16 Aug 2017 16:48:13 +0000 (UTC) Received: from localhost ([127.0.0.1] helo=lists.xenproject.org) by lists.xenproject.org with esmtp (Exim 4.84_2) (envelope-from ) id 1di1S7-000230-0g; Wed, 16 Aug 2017 16:45:55 +0000 Received: from mail6.bemta6.messagelabs.com ([193.109.254.103]) by lists.xenproject.org with esmtp (Exim 4.84_2) (envelope-from ) id 1di1S5-00022K-SX for xen-devel@lists.xenproject.org; Wed, 16 Aug 2017 16:45:53 +0000 Received: from [193.109.254.147] by server-6.bemta-6.messagelabs.com id 1D/86-03937-1C674995; Wed, 16 Aug 2017 16:45:53 +0000 X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFvrLIsWRWlGSWpSXmKPExsVyMbThoO6Bsim RBpMWSFl83zKZyYHR4/CHKywBjFGsmXlJ+RUJrBlfdy1jKfhlVjFlzjamBsY7Ol2MnBxCAjMY JQ7s9uhi5OJgEVjDKvHn2RRWEEdC4BKrxJbpm1lBqiQE4iQuTd/I0sXIAWRXShy95QXRrCJxc /sqJpB6IYEfjBLfXxwGqxcW0JM4cvQHO4SdJLF5zWlGEJtNwEDizY69YDUiAkoS91ZNZgKxmQ WeMEmsfApWwyKgKnHv4EGwOK+Aj8T86z9YQGxOAV+JhQ8us0As9pE42LMXrF5UQE5i5eUWVoh 6QYmTM5+A3cksoCmxfpc+xHh5ie1v5zBPYBSZhaRqFkLVLCRVCxiZVzFqFKcWlaUW6Rqa6iUV ZaZnlOQmZuboGhqY6eWmFhcnpqfmJCYV6yXn525iBAY/AxDsYPy2LOAQoyQHk5Ior1f+lEghv qT8lMqMxOKM+KLSnNTiQ4wyHBxKErwbSoFygkWp6akVaZk5wDiESUtw8CiJ8M4vAUrzFhck5h ZnpkOkTjFacly5su4LE8eG1euB5JQD278wCbHk5eelSonzLgKZJwDSkFGaBzcOliouMcpKCfM yAh0oxFOQWpSbWYIq/4pRnINRSZg3BmQKT2ZeCdzWV0AHMQEddKV9EshBJYkIKakGRrFkSQvv W8cPzzvsHdu0bYnBf+6tlafZWEJ5ll768tf/UMD2/volwsLKl2p8Jlzk4p+dqrgs5oPqt8X71 oY7G7+eeik9zZDj0VRt3ul+eb56c+KEVkWlOV0stH+y8eo6o28rk/oiFfU3zfx0qWtn3M3Mma d1Vde2tjy5rhk9/cKCHfznPS9+d1diKc5INNRiLipOBACDAmL6EAMAAA== X-Env-Sender: raistlin.df@gmail.com X-Msg-Ref: server-5.tower-27.messagelabs.com!1502901952!106621095!1 X-Originating-IP: [209.85.128.193] X-SpamReason: No, hits=0.0 required=7.0 tests= X-StarScan-Received: X-StarScan-Version: 9.4.45; banners=-,-,- X-VirusChecked: Checked Received: (qmail 20776 invoked from network); 16 Aug 2017 16:45:52 -0000 Received: from mail-wr0-f193.google.com (HELO mail-wr0-f193.google.com) (209.85.128.193) by server-5.tower-27.messagelabs.com with AES128-GCM-SHA256 encrypted SMTP; 16 Aug 2017 16:45:52 -0000 Received: by mail-wr0-f193.google.com with SMTP id y67so4006889wrb.3 for ; Wed, 16 Aug 2017 09:45:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:subject:from:to:cc:date:message-id:in-reply-to:references :user-agent:mime-version:content-transfer-encoding; bh=i+WI8gfF85syWjVTMZfirfqm+ymSjeCWWPlFjBF/T+Q=; b=J/ZuNUe7MY7vAmBhg6G6Oa4UJ0H5sHNUULHmj4KneFW3JtQvAx2ymC6Cv/qrPr+tYI 4dGJ4EpCBlFYaEdCdBRbwBlrrHF0tr8Ya8+rznIRbObSGUEXPS4nH09RFtj4mjQRuQ5F dBOaFuFCUkIGhU6M27ao9Tj/MMFTACItbToQ/1TTxTRfdrwaZp4IeEqf0AiCxlM+URTi ArWsTcDdMI1WOdwfS9eq/2MpGsieS0BEoTR+/6tEcZ+hBgPBBXcJ/LiXsgNE0rqhav6S mlWTjhFGbYzFTLTVRTN9479RAbuY7XKYbox9pJ3kapi0xvzKSuTjjUgHH7D9Zdmunzop JaIg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:subject:from:to:cc:date:message-id :in-reply-to:references:user-agent:mime-version :content-transfer-encoding; bh=i+WI8gfF85syWjVTMZfirfqm+ymSjeCWWPlFjBF/T+Q=; b=r6esgSaSxFjPZYDfHI/ZiUkdwNZzhfT3mwub2k9IAMXiMtf6tHiryHE7Gn3zhmJa4m jlLRZaVzeY0+CbqDImE7+QmBRgoXeLOLlF1jHWVcwZHhGB14AAEWT1Gi8mfXU/rNEsh+ p8yncr/i2iPSLfvDUmg+6s/QMdyMzs+pxmesanNtAnwLtfEQCEEvQJLrW7IeyKsqU5Ih Duu/kq8CJvw4ichxzl4Hx/BtFRBi+y0ATSSS9twgb3GFWeUk5CAKOU6zl63ZqpHBp/lW SKziqwBYATm75mRsw8/tK/TQEfTkt/m4hRrGWtxSKs6UzHTutzmZbMxvtmWHE9nibGIC b9zw== X-Gm-Message-State: AHYfb5hhqA6OuqMr4zTC/CFNexKlckgOSUzgldkp6LU6cy1Pf/qMNTYF vODiwUFpbT0AEg== X-Received: by 10.223.162.139 with SMTP id s11mr716658wra.25.1502901952040; Wed, 16 Aug 2017 09:45:52 -0700 (PDT) Received: from Solace.fritz.box ([80.66.223.3]) by smtp.gmail.com with ESMTPSA id m189sm1960652wmb.9.2017.08.16.09.45.50 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 16 Aug 2017 09:45:51 -0700 (PDT) From: Dario Faggioli To: xen-devel@lists.xenproject.org Date: Wed, 16 Aug 2017 18:45:49 +0200 Message-ID: <150290194976.24854.16620727129787611822.stgit@Solace.fritz.box> In-Reply-To: <150290125292.24854.17418548557562763544.stgit@Solace.fritz.box> References: <150290125292.24854.17418548557562763544.stgit@Solace.fritz.box> User-Agent: StGit/0.17.1-dirty MIME-Version: 1.0 Cc: Stefano Stabellini , Wei Liu , George Dunlap , Andrew Cooper , Ian Jackson , Tim Deegan , Julien Grall , Jan Beulich Subject: [Xen-devel] [PATCH v2 3/6] xen: RCU/x86/ARM: discount CPUs that were idle when grace period started. X-BeenThere: xen-devel@lists.xen.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Xen developer discussion List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Errors-To: xen-devel-bounces@lists.xen.org Sender: "Xen-devel" X-Virus-Scanned: ClamAV using ClamSMTP Xen is a tickless (micro-)kernel, i.e., when a CPU becomes idle there is no timer tick that will periodically wake the CPU up. OTOH, when we imported RCU from Linux, Linux was (on x86) a ticking kernel, i.e., there was a periodic timer tick always running, even on idle CPUs. This was bad for power consumption, but, for instance, made it easy to monitor the quiescent states of all the CPUs, and hence tell when RCU grace periods ended. In Xen, that is impossible, and that's particularly problematic when the system is very lightly loaded, as some CPUs may never have the chance to tell the RCU core logic about their quiescence, and grace periods could extend indefinitely! This has led, on x86, to long (and unpredictable) delays between RCU callbacks queueing and their actual invokation. On ARM, we've even seen infinite grace periods (e.g., complate_domain_destroy() never being actually invoked!). See here: https://lists.xenproject.org/archives/html/xen-devel/2017-01/msg02454.html The first step for fixing this situation is for RCU to record, at the beginning of a grace period, which CPUs are already idle. In fact, being idle, they can't be in the middle of any read-side critical section, and we don't have to wait for their quiescence. This is tracked in a cpumask, in a similar way to how it was also done in Linux (on s390, which was tickless already). It is also basically the same approach used for making Linux x86 tickless, in 2.6.21 on (see commit 79bf2bb3 "tick-management: dyntick / highres functionality"). For correctness, wee also add barriers. One is also present in Linux, (see commit c3f59023, "Fix RCU race in access of nohz_cpu_mask", although, we change the code comment to something that makes better sense for us). The other (which is its pair), is put in the newly introduced function rcu_idle_enter(), right after updating the cpumask. They prevent races between CPUs going idle during the beginning of a grace period. Signed-off-by: Dario Faggioli --- Cc: Andrew Cooper Cc: George Dunlap Cc: Ian Jackson Cc: Jan Beulich Cc: Konrad Rzeszutek Wilk Cc: Stefano Stabellini Cc: Tim Deegan Cc: Wei Liu Cc: Julien Grall --- Changes from v1: * call rcu_idle_{enter,exit}() from tick suspension/restarting logic. This widen the window during which a CPU has its bit set in the idle cpumask. During review, it was suggested to do the opposite (narrow it), and that's what I did first. But then, I changed my mind, as doing things as they look now (wide window), cures another pre-existing (and independent) raca which Tim discovered, still during v1 review; * add a barrier in rcu_idle_enter() too, to properly deal with the race Tim pointed out during review; * mark CPU where RCU initialization happens, at boot, as non-idle. --- xen/common/rcupdate.c | 48 ++++++++++++++++++++++++++++++++++++++++++-- xen/common/schedule.c | 2 ++ xen/include/xen/rcupdate.h | 3 +++ 3 files changed, 51 insertions(+), 2 deletions(-) diff --git a/xen/common/rcupdate.c b/xen/common/rcupdate.c index 8cc5a82..9f7d41d 100644 --- a/xen/common/rcupdate.c +++ b/xen/common/rcupdate.c @@ -52,7 +52,8 @@ static struct rcu_ctrlblk { int next_pending; /* Is the next batch already waiting? */ spinlock_t lock __cacheline_aligned; - cpumask_t cpumask; /* CPUs that need to switch in order */ + cpumask_t cpumask; /* CPUs that need to switch in order ... */ + cpumask_t idle_cpumask; /* ... unless they are already idle */ /* for current batch to proceed. */ } __cacheline_aligned rcu_ctrlblk = { .cur = -300, @@ -248,7 +249,16 @@ static void rcu_start_batch(struct rcu_ctrlblk *rcp) smp_wmb(); rcp->cur++; - cpumask_copy(&rcp->cpumask, &cpu_online_map); + /* + * Make sure the increment of rcp->cur is visible so, even if a + * CPU that is about to go idle, is captured inside rcp->cpumask, + * rcu_pending() will return false, which then means cpu_quiet() + * will be invoked, before the CPU would actually enter idle. + * + * This barrier is paired with the one in rcu_idle_enter(). + */ + smp_mb(); + cpumask_andnot(&rcp->cpumask, &cpu_online_map, &rcp->idle_cpumask); } } @@ -474,7 +484,41 @@ static struct notifier_block cpu_nfb = { void __init rcu_init(void) { void *cpu = (void *)(long)smp_processor_id(); + + cpumask_setall(&rcu_ctrlblk.idle_cpumask); + /* The CPU we're running on is certainly not idle */ + cpumask_clear_cpu(smp_processor_id(), &rcu_ctrlblk.idle_cpumask); cpu_callback(&cpu_nfb, CPU_UP_PREPARE, cpu); register_cpu_notifier(&cpu_nfb); open_softirq(RCU_SOFTIRQ, rcu_process_callbacks); } + +/* + * The CPU is becoming idle, so no more read side critical + * sections, and one more step toward grace period. + */ +void rcu_idle_enter(unsigned int cpu) +{ + /* + * During non-boot CPU bringup and resume, until this function is + * called for the first time, it's fine to find our bit already set. + */ + ASSERT(!cpumask_test_cpu(cpu, &rcu_ctrlblk.idle_cpumask) || + (system_state < SYS_STATE_active || system_state >= SYS_STATE_resume)); + cpumask_set_cpu(cpu, &rcu_ctrlblk.idle_cpumask); + /* + * If some other CPU is starting a new grace period, we'll notice that + * by seeing a new value in rcp->cur (different than our quiescbatch). + * That will force us all the way until cpu_quiet(), clearing our bit + * in rcp->cpumask, even in case we managed to get in there. + * + * Se the comment before cpumask_andnot() in rcu_start_batch(). + */ + smp_mb(); +} + +void rcu_idle_exit(unsigned int cpu) +{ + ASSERT(cpumask_test_cpu(cpu, &rcu_ctrlblk.idle_cpumask)); + cpumask_clear_cpu(cpu, &rcu_ctrlblk.idle_cpumask); +} diff --git a/xen/common/schedule.c b/xen/common/schedule.c index e83f4c7..c6f4817 100644 --- a/xen/common/schedule.c +++ b/xen/common/schedule.c @@ -1903,6 +1903,7 @@ void sched_tick_suspend(void) sched = per_cpu(scheduler, cpu); SCHED_OP(sched, tick_suspend, cpu); + rcu_idle_enter(cpu); } void sched_tick_resume(void) @@ -1910,6 +1911,7 @@ void sched_tick_resume(void) struct scheduler *sched; unsigned int cpu = smp_processor_id(); + rcu_idle_exit(cpu); sched = per_cpu(scheduler, cpu); SCHED_OP(sched, tick_resume, cpu); } diff --git a/xen/include/xen/rcupdate.h b/xen/include/xen/rcupdate.h index 557a7b1..561ac43 100644 --- a/xen/include/xen/rcupdate.h +++ b/xen/include/xen/rcupdate.h @@ -146,4 +146,7 @@ void call_rcu(struct rcu_head *head, int rcu_barrier(void); +void rcu_idle_enter(unsigned int cpu); +void rcu_idle_exit(unsigned int cpu); + #endif /* __XEN_RCUPDATE_H */