From patchwork Thu Sep 21 12:23:51 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: George Dunlap X-Patchwork-Id: 13393909 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.xenproject.org (lists.xenproject.org [192.237.175.120]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id ACB21E71060 for ; Thu, 21 Sep 2023 12:24:12 +0000 (UTC) Received: from list by lists.xenproject.org with outflank-mailman.606388.944273 (Exim 4.92) (envelope-from ) id 1qjIir-0004nA-Cy; Thu, 21 Sep 2023 12:23:57 +0000 X-Outflank-Mailman: Message body and most headers restored to incoming version Received: by outflank-mailman (output) from mailman id 606388.944273; Thu, 21 Sep 2023 12:23:57 +0000 Received: from localhost ([127.0.0.1] helo=lists.xenproject.org) by lists.xenproject.org with esmtp (Exim 4.92) (envelope-from ) id 1qjIir-0004n3-9J; Thu, 21 Sep 2023 12:23:57 +0000 Received: by outflank-mailman (input) for mailman id 606388; Thu, 21 Sep 2023 12:23:56 +0000 Received: from se1-gles-sth1-in.inumbo.com ([159.253.27.254] helo=se1-gles-sth1.inumbo.com) by lists.xenproject.org with esmtp (Exim 4.92) (envelope-from ) id 1qjIiq-0004mx-0E for xen-devel@lists.xenproject.org; Thu, 21 Sep 2023 12:23:56 +0000 Received: from mail-wr1-x436.google.com (mail-wr1-x436.google.com [2a00:1450:4864:20::436]) by se1-gles-sth1.inumbo.com (Halon) with ESMTPS id bb13c40b-5879-11ee-878a-cb3800f73035; Thu, 21 Sep 2023 14:23:54 +0200 (CEST) Received: by mail-wr1-x436.google.com with SMTP id ffacd0b85a97d-3214cdb4b27so860126f8f.1 for ; Thu, 21 Sep 2023 05:23:54 -0700 (PDT) Received: from georged-x-u.eng.citrite.net ([185.25.67.249]) by smtp.gmail.com with ESMTPSA id n5-20020a5d4005000000b0031c5dda3aedsm1620901wrp.95.2023.09.21.05.23.53 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 21 Sep 2023 05:23:53 -0700 (PDT) X-BeenThere: xen-devel@lists.xenproject.org List-Id: Xen developer discussion List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Errors-To: xen-devel-bounces@lists.xenproject.org Precedence: list Sender: "Xen-devel" X-Inumbo-ID: bb13c40b-5879-11ee-878a-cb3800f73035 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cloud.com; s=cloud; t=1695299034; x=1695903834; darn=lists.xenproject.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=saA8fJ2ZeQlapnn4EiEmFDuuaSgwbsaBe1MiDOt4CaE=; b=lWiHw+MJgiPOn0B0nXwr+A9EG370AH7EsjUaCfFywuPim3LgAuQaeFQrj97SmOcs9/ Ug1Nic7JyGKs9VMtDyJRpB8quvlKzlVU7Jjdbbxhs+FSiLXtVoPva3P1iYFFY3si80Gq RBXp7EJx1iE2tYaXLga2HoI8XN760gkWRZgxg= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1695299034; x=1695903834; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=saA8fJ2ZeQlapnn4EiEmFDuuaSgwbsaBe1MiDOt4CaE=; b=BUBUr6l5N+aqyA3EbKj1EhQVWtVyJwsAHtG65qonlt2ZExjl15vE8IRMMZQrPPGRx2 7N8n3y798MDUCxCF9IKkxvu3bYXZzSzOk510aU99QnFVEeK0jTi/UHf4BRlnPEzO0iA1 2DwhwZGGvp54KfB+mv6U4OjOrLs38ztA1PriOvRXM2k+HUhS3P5bv/w8Ywh6y/S4/86E +wsIUnuicLV+2cQYN6pP5FbErb9dH1fEPu4V6+aUSgZR0yyK8Uh8Q3oX/YEpd+PWQTex 4+VHCmhGz0cyB4F+nT/r1XEgby7Uq0VrJS0VbN/LffRtsi+MtkRqR/4+jTWMlpCawkGo 02OA== X-Gm-Message-State: AOJu0YxMdTM18kLTLkFrivl9vm/v7Hd1AdDIAc8giMvf6wh/U+sOze0d pRhhis/EpeDpOuZ9atw64SpAUvqkLbrHsl1nWgw= X-Google-Smtp-Source: AGHT+IHFPWiFzBwbxJRb3ZpsZQyu2i+rjOazfrsiQY+q3qLKekKZm05U42crXBzHqqK7nm8NelYElw== X-Received: by 2002:a5d:554e:0:b0:319:775f:d553 with SMTP id g14-20020a5d554e000000b00319775fd553mr5568232wrw.9.1695299034021; Thu, 21 Sep 2023 05:23:54 -0700 (PDT) From: George Dunlap To: xen-devel@lists.xenproject.org Cc: George Dunlap , Dario Faggioli , Andrew Cooper , George Dunlap , Jan Beulich , Julien Grall , Stefano Stabellini , Wei Liu Subject: [PATCH v2 1/2] credit: Limit load balancing to once per millisecond Date: Thu, 21 Sep 2023 13:23:51 +0100 Message-Id: <20230921122352.2307574-1-george.dunlap@cloud.com> X-Mailer: git-send-email 2.25.1 MIME-Version: 1.0 The credit scheduler tries as hard as it can to ensure that it always runs scheduling units with positive credit (PRI_TS_UNDER) before running those with negative credit (PRI_TS_OVER). If the next runnable scheduling unit is of priority OVER, it will always run the load balancer, which will scour the system looking for another scheduling unit of the UNDER priority. Unfortunately, as the number of cores on a system has grown, the cost of the work-stealing algorithm has dramatically increased; a recent trace on a system with 128 cores showed this taking over 50 microseconds. Add a parameter, load_balance_ratelimit, to limit the frequency of load balance operations on a given pcpu. Default this to 1 millisecond. Invert the load balancing conditional to make it more clear, and line up more closely with the comment above it. Overall it might be cleaner to have the last_load_balance checking happen inside csched_load_balance(), but that would require either passing both now and spc into the function, or looking them up again; both of which seemed to be worse than simply checking and setting the values before calling it. On a system with a vcpu:pcpu ratio of 2:1, running Windows guests (which will end up calling YIELD during spinlock contention), this patch increased performance significantly. Signed-off-by: George Dunlap Reviewed-by: Juergen Gross --- Changes since v1: - Fix editing mistake in commit message - Improve documentation - global var is __ro_after_init - Remove sysctl, as it's not used. Define max value in credit.c. - Fix some style issues - Move comment tweak to the right patch - In the event that the commandline-parameter value is too high, clip to the maximum value rather than setting to the default. CC: Dario Faggioli CC: Andrew Cooper CC: George Dunlap CC: Jan Beulich CC: Julien Grall CC: Stefano Stabellini CC: Wei Liu --- docs/misc/xen-command-line.pandoc | 8 ++++++ xen/common/sched/credit.c | 47 +++++++++++++++++++++++++------ 2 files changed, 46 insertions(+), 9 deletions(-) diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc index f88e6a70ae..9c3c72a7f9 100644 --- a/docs/misc/xen-command-line.pandoc +++ b/docs/misc/xen-command-line.pandoc @@ -1884,6 +1884,14 @@ By default, Xen will use the INVPCID instruction for TLB management if it is available. This option can be used to cause Xen to fall back to older mechanisms, which are generally slower. +### load-balance-ratelimit +> `= ` + +The minimum interval between load balancing events on a given pcpu, in +microseconds. A value of '0' will disable rate limiting. Maximum +value 1 second. At the moment only credit honors this parameter. +Default 1ms. + ### noirqbalance (x86) > `= ` diff --git a/xen/common/sched/credit.c b/xen/common/sched/credit.c index f2cd3d9da3..5c06f596d2 100644 --- a/xen/common/sched/credit.c +++ b/xen/common/sched/credit.c @@ -50,6 +50,10 @@ #define CSCHED_TICKS_PER_TSLICE 3 /* Default timeslice: 30ms */ #define CSCHED_DEFAULT_TSLICE_MS 30 +/* Default load balancing ratelimit: 1ms */ +#define CSCHED_DEFAULT_LOAD_BALANCE_RATELIMIT_US 1000 +/* Max load balancing ratelimit: 1s */ +#define CSCHED_MAX_LOAD_BALANCE_RATELIMIT_US 1000000 #define CSCHED_CREDITS_PER_MSEC 10 /* Never set a timer shorter than this value. */ #define CSCHED_MIN_TIMER XEN_SYSCTL_SCHED_RATELIMIT_MIN @@ -153,6 +157,7 @@ struct csched_pcpu { unsigned int idle_bias; unsigned int nr_runnable; + s_time_t last_load_balance; unsigned int tick; struct timer ticker; @@ -218,7 +223,7 @@ struct csched_private { /* Period of master and tick in milliseconds */ unsigned int tick_period_us, ticks_per_tslice; - s_time_t ratelimit, tslice, unit_migr_delay; + s_time_t ratelimit, tslice, unit_migr_delay, load_balance_ratelimit; struct list_head active_sdom; uint32_t weight; @@ -612,6 +617,8 @@ init_pdata(struct csched_private *prv, struct csched_pcpu *spc, int cpu) BUG_ON(!is_idle_unit(curr_on_cpu(cpu))); cpumask_set_cpu(cpu, prv->idlers); spc->nr_runnable = 0; + + spc->last_load_balance = NOW(); } static void cf_check @@ -1676,9 +1683,17 @@ csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step) return NULL; } +/* + * Minimum delay, in microseconds, between load balance operations. + * This prevents spending too much time doing load balancing, particularly + * when the system has a high number of YIELDs due to spinlock priority inversion. + */ +static unsigned int __ro_after_init load_balance_ratelimit_us = CSCHED_DEFAULT_LOAD_BALANCE_RATELIMIT_US; +integer_param("load-balance-ratelimit", load_balance_ratelimit_us); + static struct csched_unit * csched_load_balance(struct csched_private *prv, int cpu, - struct csched_unit *snext, bool *stolen) + struct csched_unit *snext, bool *stolen) { const struct cpupool *c = get_sched_res(cpu)->cpupool; struct csched_unit *speer; @@ -1958,15 +1973,19 @@ static void cf_check csched_schedule( /* * SMP Load balance: * - * If the next highest priority local runnable UNIT has already eaten - * through its credits, look on other PCPUs to see if we have more - * urgent work... If not, csched_load_balance() will return snext, but - * already removed from the runq. + * If the next highest priority local runnable UNIT has + * already eaten through its credits (and we're below the + * balancing ratelimit), look on other PCPUs to see if we have + * more urgent work... If we don't, csched_load_balance() will + * return snext, but already removed from the runq. */ - if ( snext->pri > CSCHED_PRI_TS_OVER ) - __runq_remove(snext); - else + if ( snext->pri <= CSCHED_PRI_TS_OVER + && now - spc->last_load_balance > prv->load_balance_ratelimit) { + spc->last_load_balance = now; snext = csched_load_balance(prv, sched_cpu, snext, &migrated); + } + else + __runq_remove(snext); } while ( !unit_runnable_state(snext->unit) ); @@ -2181,6 +2200,14 @@ csched_global_init(void) XEN_SYSCTL_CSCHED_MGR_DLY_MAX_US, vcpu_migration_delay_us); } + if ( load_balance_ratelimit_us > CSCHED_MAX_LOAD_BALANCE_RATELIMIT_US ) + { + load_balance_ratelimit_us = CSCHED_MAX_LOAD_BALANCE_RATELIMIT_US; + printk("WARNING: load-balance-ratelimit outside of valid range [0,%d]us.\n" + "Setting to max.\n", + CSCHED_MAX_LOAD_BALANCE_RATELIMIT_US); + } + return 0; } @@ -2223,6 +2250,8 @@ csched_init(struct scheduler *ops) prv->unit_migr_delay = MICROSECS(vcpu_migration_delay_us); + prv->load_balance_ratelimit = MICROSECS(load_balance_ratelimit_us); + return 0; } From patchwork Thu Sep 21 12:23:52 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: George Dunlap X-Patchwork-Id: 13393910 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.xenproject.org (lists.xenproject.org [192.237.175.120]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 4010CE71062 for ; Thu, 21 Sep 2023 12:24:14 +0000 (UTC) Received: from list by lists.xenproject.org with outflank-mailman.606389.944283 (Exim 4.92) (envelope-from ) id 1qjIit-00052Q-MN; Thu, 21 Sep 2023 12:23:59 +0000 X-Outflank-Mailman: Message body and most headers restored to incoming version Received: by outflank-mailman (output) from mailman id 606389.944283; Thu, 21 Sep 2023 12:23:59 +0000 Received: from localhost ([127.0.0.1] helo=lists.xenproject.org) by lists.xenproject.org with esmtp (Exim 4.92) (envelope-from ) id 1qjIit-00052J-JL; Thu, 21 Sep 2023 12:23:59 +0000 Received: by outflank-mailman (input) for mailman id 606389; Thu, 21 Sep 2023 12:23:58 +0000 Received: from se1-gles-flk1-in.inumbo.com ([94.247.172.50] helo=se1-gles-flk1.inumbo.com) by lists.xenproject.org with esmtp (Exim 4.92) (envelope-from ) id 1qjIis-0004xy-7l for xen-devel@lists.xenproject.org; Thu, 21 Sep 2023 12:23:58 +0000 Received: from mail-wr1-x435.google.com (mail-wr1-x435.google.com [2a00:1450:4864:20::435]) by se1-gles-flk1.inumbo.com (Halon) with ESMTPS id bb718365-5879-11ee-9b0d-b553b5be7939; Thu, 21 Sep 2023 14:23:55 +0200 (CEST) Received: by mail-wr1-x435.google.com with SMTP id ffacd0b85a97d-32179d3c167so755909f8f.1 for ; Thu, 21 Sep 2023 05:23:55 -0700 (PDT) Received: from georged-x-u.eng.citrite.net ([185.25.67.249]) by smtp.gmail.com with ESMTPSA id n5-20020a5d4005000000b0031c5dda3aedsm1620901wrp.95.2023.09.21.05.23.54 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 21 Sep 2023 05:23:54 -0700 (PDT) X-BeenThere: xen-devel@lists.xenproject.org List-Id: Xen developer discussion List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Errors-To: xen-devel-bounces@lists.xenproject.org Precedence: list Sender: "Xen-devel" X-Inumbo-ID: bb718365-5879-11ee-9b0d-b553b5be7939 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cloud.com; s=cloud; t=1695299035; x=1695903835; darn=lists.xenproject.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=ND6JjdDALXmgUMQ6eVlffEnK99gwZtTIG42TlqB7sOk=; b=GDd7DlMzly7unkahiNv1p9gD70jWIKCcCwe3co9olUZoYUthQx3bhn7MLwMMvqcUF6 F9lCHOoZPqRmeH9X/xyLA3konPT5hYnqgZfqEeXMNBhdbUz++BXPQVm6vs30NBpE7zNG HWJhevp1S7tlim64xIR6pi5Lb87K/OESgQv8o= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1695299035; x=1695903835; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ND6JjdDALXmgUMQ6eVlffEnK99gwZtTIG42TlqB7sOk=; b=KsTC5v900+Am7xxTd/ExVrQSYZuu+Smqh28KOZEXE9plH9qNaq8av7jZC6ctudvJy6 Tx1vjDtpH2r+QJTpi2gy1zSQir3DrHAmlhcNoy4USh5jsLAXYeisagkD4QEM74S6rhM/ g+P9gPMpzkSzGovkjflULIqs5vmuoioh8pGYPtpLU0NcAXyFUQmwN5lvZa0jSB/P/TLs 4j3SorPG859Oo/OQpELs4z5Qf3QQttBQt02C7GUL/6w+odcDHNcDWsqRQf3b8cqKXWuS ow35JLmzkWR+sUq1Did8OVw8nusuXHfS7upVnXNPQFlhuQKnQzEMv/KCJJ3l+PokYFYe sOLQ== X-Gm-Message-State: AOJu0YxgXd1pI5F+09nUW++6FPltmALEWvByIEY6857/5Xt+DPD1T6qL pBdXkfQyJxwnKSqdfV1et1KmP7GgMVh/ATAShAg= X-Google-Smtp-Source: AGHT+IGjwxRe2ylX66/0bx/FQE0r6xzF2bdDMVP4PCeaXLq53QDiJu9bRg6SNSp3yo0ySblFUN4cFw== X-Received: by 2002:a5d:61cd:0:b0:31f:91ae:4509 with SMTP id q13-20020a5d61cd000000b0031f91ae4509mr5096426wrv.40.1695299034566; Thu, 21 Sep 2023 05:23:54 -0700 (PDT) From: George Dunlap To: xen-devel@lists.xenproject.org Cc: George Dunlap , Dario Faggioli Subject: [PATCH v2 2/2] credit: Don't steal vcpus which have yielded Date: Thu, 21 Sep 2023 13:23:52 +0100 Message-Id: <20230921122352.2307574-2-george.dunlap@cloud.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20230921122352.2307574-1-george.dunlap@cloud.com> References: <20230921122352.2307574-1-george.dunlap@cloud.com> MIME-Version: 1.0 On large systems with many vcpus yielding due to spinlock priority inversion, it's not uncommon for a vcpu to yield its timeslice, only to be immediately stolen by another pcpu looking for higher-priority work. To prevent this: * Keep the YIELD flag until a vcpu is removed from a runqueue * When looking for work to steal, skip vcpus which have yielded NB that this does mean that sometimes a VM is inserted into an empty runqueue; handle that case. Signed-off-by: George Dunlap Reviewed-by: Juergen Gross --- Changes since v1: - Moved a comment tweak to the right patch CC: Dario Faggioli --- xen/common/sched/credit.c | 25 ++++++++++++++----------- 1 file changed, 14 insertions(+), 11 deletions(-) diff --git a/xen/common/sched/credit.c b/xen/common/sched/credit.c index 5c06f596d2..38a6f6fa6d 100644 --- a/xen/common/sched/credit.c +++ b/xen/common/sched/credit.c @@ -298,14 +298,10 @@ __runq_insert(struct csched_unit *svc) * runnable unit if we can. The next runq_sort will bring it forward * within 30ms if the queue too long. */ if ( test_bit(CSCHED_FLAG_UNIT_YIELD, &svc->flags) - && __runq_elem(iter)->pri > CSCHED_PRI_IDLE ) - { + && __runq_elem(iter)->pri > CSCHED_PRI_IDLE + && iter->next != runq) iter=iter->next; - /* Some sanity checks */ - BUG_ON(iter == runq); - } - list_add_tail(&svc->runq_elem, iter); } @@ -321,6 +317,11 @@ __runq_remove(struct csched_unit *svc) { BUG_ON( !__unit_on_runq(svc) ); list_del_init(&svc->runq_elem); + + /* + * Clear YIELD flag when scheduling back in + */ + clear_bit(CSCHED_FLAG_UNIT_YIELD, &svc->flags); } static inline void @@ -1637,6 +1638,13 @@ csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step) if ( speer->pri <= pri ) break; + /* + * Don't steal a UNIT which has yielded; it's waiting for a + * reason + */ + if (test_bit(CSCHED_FLAG_UNIT_YIELD, &speer->flags)) + continue; + /* Is this UNIT runnable on our PCPU? */ unit = speer->unit; BUG_ON( is_idle_unit(unit) ); @@ -1954,11 +1962,6 @@ static void cf_check csched_schedule( dec_nr_runnable(sched_cpu); } - /* - * Clear YIELD flag before scheduling out - */ - clear_bit(CSCHED_FLAG_UNIT_YIELD, &scurr->flags); - do { snext = __runq_elem(runq->next);