From patchwork Thu Mar  2 10:38:19 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Dario Faggioli <dario.faggioli@citrix.com>
X-Patchwork-Id: 9599911
Return-Path: <xen-devel-bounces@lists.xen.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	E48DD60414 for <patchwork-xen-devel@patchwork.kernel.org>;
	Thu,  2 Mar 2017 10:40:45 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id C2855284F4
	for <patchwork-xen-devel@patchwork.kernel.org>;
	Thu,  2 Mar 2017 10:40:45 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id B75AF2858F; Thu,  2 Mar 2017 10:40:45 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.6 required=2.0 tests=BAYES_00,DKIM_SIGNED,
	RCVD_IN_DNSWL_MED,RCVD_IN_SORBS_SPAM,T_DKIM_INVALID autolearn=ham
	version=3.3.1
Received: from lists.xenproject.org (lists.xenproject.org [192.237.175.120])
	(using TLSv1.2 with cipher AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id DCE40284F4
	for <patchwork-xen-devel@patchwork.kernel.org>;
	Thu,  2 Mar 2017 10:40:44 +0000 (UTC)
Received: from localhost ([127.0.0.1] helo=lists.xenproject.org)
	by lists.xenproject.org with esmtp (Exim 4.84_2)
	(envelope-from <xen-devel-bounces@lists.xen.org>)
	id 1cjO7v-0004Cc-DU; Thu, 02 Mar 2017 10:38:27 +0000
Received: from mail6.bemta6.messagelabs.com ([193.109.254.103])
	by lists.xenproject.org with esmtp (Exim 4.84_2)
	(envelope-from <raistlin.df@gmail.com>) id 1cjO7u-0004C7-4F
	for xen-devel@lists.xenproject.org; Thu, 02 Mar 2017 10:38:26 +0000
Received: from [193.109.254.147] by server-7.bemta-6.messagelabs.com id
	B4/FD-04817-126F7B85; Thu, 02 Mar 2017 10:38:25 +0000
X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFmpmleJIrShJLcpLzFFi42K5GNpwSFfu2/Y
	Ig/VHOS2+b5nM5MDocfjDFZYAxijWzLyk/IoE1oxry46xFMw0rZj3ZAZ7A+MHzS5GLg4hgRmM
	EkveHGMCcVgE1rBKfDyxgBXEkRC4xCpx5P5zoAwHkBMjMX8JfxcjJ5BZLbGm5QwziC0koCJxc
	/sqJgj7J6PEnKd5ILawgJ7EkaM/2CHsKImlbW/BbDYBA4k3O/aygtgiAkoS91ZNButlFoiWWP
	mwGayGRUBVYuHpfWDzeQV8JP7PnghWwyngK9H59wIrxC4fiR8z+8FsUQE5iZWXW1gh6gUlTs5
	8wgJyMrOApsT6XfoQ4+Ultr+dwzyBUWQWkqpZCFWzkFQtYGRexahRnFpUllqka2Skl1SUmZ5R
	kpuYmaNraGCml5taXJyYnpqTmFSsl5yfu4kRGPwMQLCDcc38wEOMkhxMSqK8vM+2RwjxJeWnV
	GYkFmfEF5XmpBYfYtTg4BCYcHbudCYplrz8vFQlCV7Hr0B1gkWp6akVaZk5wPiEKZXg4FES4S
	0HSfMWFyTmFmemQ6ROMRpzPDi16w0Tx6f+w2+YhMAmSYnzsoGUCoCUZpTmwQ2CpY1LjLJSwry
	MQGcK8RSkFuVmlqDKv2IU52BUEuY9/AVoCk9mXgncvldApzABnfJCZSvIKSWJCCmpBsaw0NO6
	WWwFE7puH1awSa3juyHA9cIntyVM+Mju8sdNIv/2HT/k0f/Hw//HD+1HxxP/Sb6U7Eo3mht8K
	V5C1Mh9hsnh7Qzv5d8sV2g4EF0wcUvhhx+fmnfYxG1pjy2VKGFT5t0oI627pVzuj8OckzxMtm
	dYJ+Qx3KrbMm2r2bI95olBLX5KvEosxRmJhlrMRcWJABFaiCIWAwAA
X-Env-Sender: raistlin.df@gmail.com
X-Msg-Ref: server-6.tower-27.messagelabs.com!1488451102!89819354!1
X-Originating-IP: [209.85.128.194]
X-SpamReason: No, hits=0.0 required=7.0 tests=
X-StarScan-Received: 
X-StarScan-Version: 9.4.3; banners=-,-,-
X-VirusChecked: Checked
Received: (qmail 3932 invoked from network); 2 Mar 2017 10:38:22 -0000
Received: from mail-wr0-f194.google.com (HELO mail-wr0-f194.google.com)
	(209.85.128.194)
	by server-6.tower-27.messagelabs.com with AES128-GCM-SHA256 encrypted
	SMTP; 2 Mar 2017 10:38:22 -0000
Received: by mail-wr0-f194.google.com with SMTP id g10so8897869wrg.0
	for <xen-devel@lists.xenproject.org>;
	Thu, 02 Mar 2017 02:38:22 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
	h=sender:subject:from:to:cc:date:message-id:in-reply-to:references
	:user-agent:mime-version:content-transfer-encoding;
	bh=MaLZLBgICrkhUs5HqnHRzHQ2wOOvjNBOCuMDLTuOGYc=;
	b=XPl67n13i4fJmk6T9VUQxxHyRCQrnGCraC+jicv0SfNQLvPZkA/0E4+1zmIdEulVt3
	2kzF/fsddKalBa/qP3M7X9CgWlCQBzCeVCa5pzN6EgpsYj9qJfcELdf1KM8IpLn0wvkH
	z4wVFlqeNbRdgki+ra5DFALvoaKnwLUqCnQHD8fKCUKNSW8XWQAHPl0lNRYrWda4yePU
	50Vjife4S8IpgTGwb506Oyq7AAfqr+E1azvkyEmz0ePIPUPx7CBCsokiHBOjE2CScB4H
	lADRJnLxhEMH9ZKRtqwyn+uvOXYmmTru/pt+fXKDVtNv20d2FrU24w/PMtV6KMcBris1
	viwA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:sender:subject:from:to:cc:date:message-id
	:in-reply-to:references:user-agent:mime-version
	:content-transfer-encoding;
	bh=MaLZLBgICrkhUs5HqnHRzHQ2wOOvjNBOCuMDLTuOGYc=;
	b=CWRytAYlwFsQUgr7w/63Kvq/+xOHS3JaSpn518Qf8UFXv5vdT1FYriAcHck9/NJlXm
	HFmWnJyxMenbkR1fLo7yIGgaRqkIbqtbj/gzkLKayKodtBN4mfNpkxTl3af5auQpGRIo
	xQR4FNb5e+XnfqG4kq7yi5Wz7it7RkIKVr0G+e3H6LFDn7D+vZcLc89fhACi5NrIAG2I
	FMwJ/VK2OxAR6hUh7PhGw3jRMkx8jcKNm0CyoPPMo8ldRaAkubhEpTCSrRrm5Wbn15Ao
	pG0x4hsrDgyFbkQSCaQwfNDJ2x0FgtW+4LEI8PDYtd2EbYtd7DSIWG/dnB9SC5zKOoXE
	2MBA==
X-Gm-Message-State: 
 AMke39lk0JoCYpvX5io027FeepGvMOy5i1k/MggpErkAZUZDBx/1SeOcxMjdynYEpnCvCQ==
X-Received: by 10.223.154.44 with SMTP id z41mr11512244wrb.188.1488451101736;
	Thu, 02 Mar 2017 02:38:21 -0800 (PST)
Received: from Solace.fritz.box ([80.66.223.93])
	by smtp.gmail.com with ESMTPSA id
	o15sm2213003wra.61.2017.03.02.02.38.20
	(version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
	Thu, 02 Mar 2017 02:38:21 -0800 (PST)
From: Dario Faggioli <dario.faggioli@citrix.com>
To: xen-devel@lists.xenproject.org
Date: Thu, 02 Mar 2017 11:38:19 +0100
Message-ID: <148845109955.23452.14312315410693510946.stgit@Solace.fritz.box>
In-Reply-To: <148844531279.23452.17528540110704914171.stgit@Solace.fritz.box>
References: <148844531279.23452.17528540110704914171.stgit@Solace.fritz.box>
User-Agent: StGit/0.17.1-dirty
MIME-Version: 1.0
Cc: George Dunlap <george.dunlap@eu.citrix.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>
Subject: [Xen-devel] [PATCH 3/6] xen: credit1: increase efficiency and
	scalability of load balancing.
X-BeenThere: xen-devel@lists.xen.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: Xen developer discussion <xen-devel.lists.xen.org>
List-Unsubscribe: <https://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <https://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Errors-To: xen-devel-bounces@lists.xen.org
Sender: "Xen-devel" <xen-devel-bounces@lists.xen.org>
X-Virus-Scanned: ClamAV using ClamSMTP

During load balancing, we check the non idle pCPUs to
see if they have runnable but not running vCPUs that
can be stolen by and set to run on currently idle pCPUs.

If a pCPU has only one running (or runnable) vCPU,
though, we don't want to steal it from there, and
it's therefore pointless bothering with it
(especially considering that bothering means trying
to take its runqueue lock!).

On large systems, when load is only slightly higher
than the number of pCPUs (i.e., there are just a few
more active vCPUs than the number of the pCPUs), this
may mean that:
 - we go through all the pCPUs,
 - for each one, we (try to) take its runqueue locks,
 - we figure out there's actually nothing to be stolen!

To mitigate this, we introduce here the concept of
overloaded runqueues, and a cpumask where to record
what pCPUs are in such state.

An overloaded runqueue has at least runnable 2 vCPUs
(plus the idle one, which is always there). Typically,
this means 1 vCPU is running, and 1 is sitting in  the
runqueue, and can hence be stolen.

Then, in  csched_balance_load(), it is enough to go
over the overloaded pCPUs, instead than all non-idle
pCPUs, which is better.

signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
I'm Cc-ing Andy on this patch, because we've discussed once about doing
something like this upstream.
---
 xen/common/sched_credit.c |   56 ++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 47 insertions(+), 9 deletions(-)

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index 2b13e99..529b6c7 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -171,6 +171,7 @@ struct csched_pcpu {
     struct timer ticker;
     unsigned int tick;
     unsigned int idle_bias;
+    unsigned int nr_runnable;
 };
 
 /*
@@ -221,6 +222,7 @@ struct csched_private {
     uint32_t ncpus;
     struct timer  master_ticker;
     unsigned int master;
+    cpumask_var_t overloaded;
     cpumask_var_t idlers;
     cpumask_var_t cpus;
     uint32_t weight;
@@ -263,7 +265,10 @@ static inline bool_t is_runq_idle(unsigned int cpu)
 static inline void
 __runq_insert(struct csched_vcpu *svc)
 {
-    const struct list_head * const runq = RUNQ(svc->vcpu->processor);
+    unsigned int cpu = svc->vcpu->processor;
+    const struct list_head * const runq = RUNQ(cpu);
+    struct csched_private * const prv = CSCHED_PRIV(per_cpu(scheduler, cpu));
+    struct csched_pcpu * const spc = CSCHED_PCPU(cpu);
     struct list_head *iter;
 
     BUG_ON( __vcpu_on_runq(svc) );
@@ -288,12 +293,37 @@ __runq_insert(struct csched_vcpu *svc)
     }
 
     list_add_tail(&svc->runq_elem, iter);
+
+    /*
+     * If there is more than just the idle vCPU and a "regular" vCPU runnable
+     * on the runqueue of this pCPU, mark it as overloaded (so other pCPU
+     * can come and pick up some work).
+     */
+    if ( ++spc->nr_runnable > 2 &&
+         !cpumask_test_cpu(cpu, prv->overloaded) )
+        cpumask_set_cpu(cpu, prv->overloaded);
 }
 
 static inline void
 __runq_remove(struct csched_vcpu *svc)
 {
+    unsigned int cpu = svc->vcpu->processor;
+    struct csched_private * const prv = CSCHED_PRIV(per_cpu(scheduler, cpu));
+    struct csched_pcpu * const spc = CSCHED_PCPU(cpu);
+
     BUG_ON( !__vcpu_on_runq(svc) );
+
+    /*
+     * Mark the CPU as no longer overloaded when we drop to having only
+     * 1 vCPU in its runqueue. In fact, this means that just the idle
+     * idle vCPU and a "regular" vCPU are around.
+     */
+    if ( --spc->nr_runnable <= 2 &&
+         cpumask_test_cpu(cpu, prv->overloaded) )
+        cpumask_clear_cpu(cpu, prv->overloaded);
+
+    ASSERT(spc->nr_runnable >= 1);
+
     list_del_init(&svc->runq_elem);
 }
 
@@ -590,6 +620,7 @@ init_pdata(struct csched_private *prv, struct csched_pcpu *spc, int cpu)
     /* Start off idling... */
     BUG_ON(!is_idle_vcpu(curr_on_cpu(cpu)));
     cpumask_set_cpu(cpu, prv->idlers);
+    spc->nr_runnable = 1;
 }
 
 static void
@@ -1704,8 +1735,8 @@ csched_load_balance(struct csched_private *prv, int cpu,
         peer_node = node;
         do
         {
-            /* Find out what the !idle are in this node */
-            cpumask_andnot(&workers, online, prv->idlers);
+            /* Select the pCPUs in this node that have work we can steal. */
+            cpumask_and(&workers, online, prv->overloaded);
             cpumask_and(&workers, &workers, &node_to_cpumask(peer_node));
             __cpumask_clear_cpu(cpu, &workers);
 
@@ -1989,7 +2020,8 @@ csched_dump_pcpu(const struct scheduler *ops, int cpu)
     runq = &spc->runq;
 
     cpumask_scnprintf(cpustr, sizeof(cpustr), per_cpu(cpu_sibling_mask, cpu));
-    printk("CPU[%02d] sort=%d, sibling=%s, ", cpu, spc->runq_sort_last, cpustr);
+    printk("CPU[%02d] nr_run=%d, sort=%d, sibling=%s, ",
+           cpu, spc->nr_runnable, spc->runq_sort_last, cpustr);
     cpumask_scnprintf(cpustr, sizeof(cpustr), per_cpu(cpu_core_mask, cpu));
     printk("core=%s\n", cpustr);
 
@@ -2027,7 +2059,7 @@ csched_dump(const struct scheduler *ops)
 
     spin_lock_irqsave(&prv->lock, flags);
 
-#define idlers_buf keyhandler_scratch
+#define cpumask_buf keyhandler_scratch
 
     printk("info:\n"
            "\tncpus              = %u\n"
@@ -2055,8 +2087,10 @@ csched_dump(const struct scheduler *ops)
            prv->ticks_per_tslice,
            vcpu_migration_delay);
 
-    cpumask_scnprintf(idlers_buf, sizeof(idlers_buf), prv->idlers);
-    printk("idlers: %s\n", idlers_buf);
+    cpumask_scnprintf(cpumask_buf, sizeof(cpumask_buf), prv->idlers);
+    printk("idlers: %s\n", cpumask_buf);
+    cpumask_scnprintf(cpumask_buf, sizeof(cpumask_buf), prv->overloaded);
+    printk("overloaded: %s\n", cpumask_buf);
 
     printk("active vcpus:\n");
     loop = 0;
@@ -2079,7 +2113,7 @@ csched_dump(const struct scheduler *ops)
             vcpu_schedule_unlock(lock, svc->vcpu);
         }
     }
-#undef idlers_buf
+#undef cpumask_buf
 
     spin_unlock_irqrestore(&prv->lock, flags);
 }
@@ -2093,8 +2127,11 @@ csched_init(struct scheduler *ops)
     if ( prv == NULL )
         return -ENOMEM;
     if ( !zalloc_cpumask_var(&prv->cpus) ||
-         !zalloc_cpumask_var(&prv->idlers) )
+         !zalloc_cpumask_var(&prv->idlers) ||
+         !zalloc_cpumask_var(&prv->overloaded) )
     {
+        free_cpumask_var(prv->overloaded);
+        free_cpumask_var(prv->idlers);
         free_cpumask_var(prv->cpus);
         xfree(prv);
         return -ENOMEM;
@@ -2141,6 +2178,7 @@ csched_deinit(struct scheduler *ops)
         ops->sched_data = NULL;
         free_cpumask_var(prv->cpus);
         free_cpumask_var(prv->idlers);
+        free_cpumask_var(prv->overloaded);
         xfree(prv);
     }
 }