From patchwork Tue Nov 12 14:22:38 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Frederic Weisbecker X-Patchwork-Id: 13872297 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D21D2D42B89 for ; Tue, 12 Nov 2024 14:23:37 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5D24F8D0002; Tue, 12 Nov 2024 09:23:37 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 5574B8D0001; Tue, 12 Nov 2024 09:23:37 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 383828D0002; Tue, 12 Nov 2024 09:23:37 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 15E5B8D0001 for ; Tue, 12 Nov 2024 09:23:37 -0500 (EST) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 97604A01F2 for ; Tue, 12 Nov 2024 14:23:36 +0000 (UTC) X-FDA: 82777659342.19.CB66D6C Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf20.hostedemail.com (Postfix) with ESMTP id BB5041C002F for ; Tue, 12 Nov 2024 14:22:42 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=kiABhKiz; dmarc=pass (policy=quarantine) header.from=kernel.org; spf=pass (imf20.hostedemail.com: domain of frederic@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=frederic@kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1731421239; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=k0+AxhLJsS/cWA9UXBUN/9YbI08WTDHWq4laG3A9trQ=; b=OFgGRKIyn8Aucbv5w2aemk2CwlaNGmgcNO4HIQzF8V0zEDpy0USi8rpwt+dv6spKYKOYCx +r0yHWQHvM0v4+QCsJuvX0AthzHfHHfuGfcO95meW25URlftlhF7t/u1XKaEyDe/SuFJae cI1OcR8QLd8SHBecgbrQPnI18i4wEnQ= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=kiABhKiz; dmarc=pass (policy=quarantine) header.from=kernel.org; spf=pass (imf20.hostedemail.com: domain of frederic@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=frederic@kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1731421239; a=rsa-sha256; cv=none; b=ozT3vyMYGIyed6GzwpscB37kOtShb6yuUTJSXD/65vE2n2pie32YCpk9NmYzmX81xKbJY3 KR0M04UOsZhJTGVJoA0GcorrxVS9mJ/o8ABAFrpcfSTX+5S9Es1Mvv9j3ymUYiQDhmcwjT OMpKxEt1WNP3MwfL+8jdEHcnqnBl7j8= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id BD97F5C553C; Tue, 12 Nov 2024 14:22:49 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 9BFDEC4CED0; Tue, 12 Nov 2024 14:23:30 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1731421413; bh=fruRAFRJD5hyDJnlGTMnLVDzGhwGwDTnT8l0kLXzHsE=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=kiABhKizd+xSU7eSDo97xVfxHa0380/ByomV7zUv6ueneevzh1xXifVnlAMUS7vd8 LHxbO/iDk8dsf9rXzv51qRdEFEdXaKqJavyrjCroXq6+gWSCd3rXPq5v7nzDo9ocR4 tH6JMZJYeGkFMmnT1lCHj7V0PWcwU1FE8omcEQJQz41Ktze/luV4ToL8xrUI+JU/AS lBj2FTzGXaG+Iod2JkZASzYKEr/i/HqFjGb3CtqVzlWpILLdqXNDoiclaLdx3N6nHY qeutjgCZ5Ne8bxZsQCxxKdGobkHKHaC6KpxR7mLOgq1EqOtYhRYya9soZueVmEvKFd T/keCoqVQsMGw== From: Frederic Weisbecker To: LKML Cc: Frederic Weisbecker , Andrew Morton , Kees Cook , Peter Zijlstra , Thomas Gleixner , Michal Hocko , Vlastimil Babka , linux-mm@kvack.org, "Paul E. McKenney" , Neeraj Upadhyay , Joel Fernandes , Boqun Feng , Uladzislau Rezki , Zqiang , rcu@vger.kernel.org Subject: [PATCH 14/21] kthread: Default affine kthread to its preferred NUMA node Date: Tue, 12 Nov 2024 15:22:38 +0100 Message-ID: <20241112142248.20503-15-frederic@kernel.org> X-Mailer: git-send-email 2.46.0 In-Reply-To: <20241112142248.20503-1-frederic@kernel.org> References: <20241112142248.20503-1-frederic@kernel.org> MIME-Version: 1.0 X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: BB5041C002F X-Stat-Signature: uf1a157sszwwyfqqq4tj66kk4pzrkjtb X-Rspam-User: X-HE-Tag: 1731421362-513907 X-HE-Meta: U2FsdGVkX1/ReyTOS3s6XCzSXej6Ms+U6jOvSpTbW2JIocS7OvZNLcxdNEW1At+pDAr9ctCZndc9Kgvd8NCkYuW4zpcraMDgLhPMiyuXV+x/UYYBco9fB1f1X4BprBwqLnSIDjLmQgczXBRvnOPUyMf1kOistlpwrxGKPBkDUkO3NRqVSBE3iVzdrmsCyC3WMDN1j+nPnX+R89Iu5yHVyV70f4ysljM/lFmP5gHFRIHZkZybCOCSOxeZz/bC3qtgg03oWjQo8NNzABTlC7AL7xLsF4gltIh5w0EmlRAM98QbOqEx5QZxVrZMsT9Y9cFCVXRq1IXeUdlk5ABIIsquH7bZsTZGRU4lv22LkWNXwmIsy2fTeNP6GIqqXP5dgldbbg1kx2j8qjDq4MKi3yIGfVwGFmW/VN7nOKzllGayB1XLdqQGgU5+rBALZrjUtS1y1gJWretHHJnF3czjVwk0Kyy0V0EsiDOa5V9+iCxHxoco4qhA1K9vp7CcHuH8kqWN5A4hKfxBSncsQni5dcl+s/melAuu7+Xk1AaFSpmdoFbZXk1OY2m9hJ7LT8+lsK0SOtZbV/NngSJ2ApPdAmlf8T0AfctnoNQ0lzGvoEbxzx7EBzWUM3WkBAcvVP7oGUIYOONFS+Vujmbycp4rzqtGbPoRlaAGiOcbOGVIfzM8e7dnRk9OFEbG7p8Zd9UGVBjWrkhDj5QqHyBS4gi00n9za9AkvTZxAvHb9MXFoUMC2nAO4iv6k4hjSThjvVWIylCkMVHO7XBzml5fu+jvCRV/K7LPToDT9pT5tbcEJj58cWkuv3PHoKPuFPept1Sa/a+tE9gJwY+41sulkTdg/qP2HXOniGV4546WLXLxSfTy+N6Fx8MZ8YTMGh1W+56SXucUE7KXnyD6FUxO9J03b/iz+c0As2knG9h1t9A7G8LMYLGrh9GRUfdZMuYYaODbTduk7rON8su6KWm9qMOx6tc eUDth1QJ nshaBxDpTMuHL5PvsS1fmE1pz9F3Zv+u3LCjO8T/Xn1MmsQ4YS+1B1rjKTSc4Sz1/6gyZN6Kz3w7q7SvUdnnq07NUoqRRwhogPTq8F7zhIGQ/jxCQ1RGhrkVpYMCSyV70XU6ccTdoYAjxvoSaeQZGIW0aYPTeL1jkswRbWZq8/bEna/nKjaOcje86dxgExDlF7t+j1wwQApAYAVJiVcUQXY50sQLTtY1EX9MDptsnIlNFCeiBrzLpO6ic3s0bOAOoS+E1X61WRsKcceLpyR8DoISjF6mU8Z3w/sXyBCX55Fcien+FBYnLGPA5og== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Kthreads attached to a preferred NUMA node for their task structure allocation can also be assumed to run preferrably within that same node. A more precise affinity is usually notified by calling kthread_create_on_cpu() or kthread_bind[_mask]() before the first wakeup. For the others, a default affinity to the node is desired and sometimes implemented with more or less success when it comes to deal with hotplug events and nohz_full / CPU Isolation interactions: - kcompactd is affine to its node and handles hotplug but not CPU Isolation - kswapd is affine to its node and ignores hotplug and CPU Isolation - A bunch of drivers create their kthreads on a specific node and don't take care about affining further. Handle that default node affinity preference at the generic level instead, provided a kthread is created on an actual node and doesn't apply any specific affinity such as a given CPU or a custom cpumask to bind to before its first wake-up. This generic handling is aware of CPU hotplug events and CPU isolation such that: * When a housekeeping CPU goes up that is part of the node of a given kthread, the related task is re-affined to that own node if it was previously running on the default last resort online housekeeping set from other nodes. * When a housekeeping CPU goes down while it was part of the node of a kthread, the running task is migrated (or the sleeping task is woken up) automatically by the scheduler to other housekeepers within the same node or, as a last resort, to all housekeepers from other nodes. Acked-by: Vlastimil Babka Signed-off-by: Frederic Weisbecker --- include/linux/cpuhotplug.h | 1 + kernel/kthread.c | 106 ++++++++++++++++++++++++++++++++++++- 2 files changed, 106 insertions(+), 1 deletion(-) diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h index 2361ed4d2b15..228f27150a93 100644 --- a/include/linux/cpuhotplug.h +++ b/include/linux/cpuhotplug.h @@ -239,6 +239,7 @@ enum cpuhp_state { CPUHP_AP_WORKQUEUE_ONLINE, CPUHP_AP_RANDOM_ONLINE, CPUHP_AP_RCUTREE_ONLINE, + CPUHP_AP_KTHREADS_ONLINE, CPUHP_AP_BASE_CACHEINFO_ONLINE, CPUHP_AP_ONLINE_DYN, CPUHP_AP_ONLINE_DYN_END = CPUHP_AP_ONLINE_DYN + 40, diff --git a/kernel/kthread.c b/kernel/kthread.c index b9bdb21a0101..df6a0551e8ba 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -35,6 +35,9 @@ static DEFINE_SPINLOCK(kthread_create_lock); static LIST_HEAD(kthread_create_list); struct task_struct *kthreadd_task; +static LIST_HEAD(kthreads_hotplug); +static DEFINE_MUTEX(kthreads_hotplug_lock); + struct kthread_create_info { /* Information passed to kthread() from kthreadd. */ @@ -53,6 +56,7 @@ struct kthread_create_info struct kthread { unsigned long flags; unsigned int cpu; + unsigned int node; int started; int result; int (*threadfn)(void *); @@ -64,6 +68,8 @@ struct kthread { #endif /* To store the full name if task comm is truncated. */ char *full_name; + struct task_struct *task; + struct list_head hotplug_node; }; enum KTHREAD_BITS { @@ -122,8 +128,11 @@ bool set_kthread_struct(struct task_struct *p) init_completion(&kthread->exited); init_completion(&kthread->parked); + INIT_LIST_HEAD(&kthread->hotplug_node); p->vfork_done = &kthread->exited; + kthread->task = p; + kthread->node = tsk_fork_get_node(current); p->worker_private = kthread; return true; } @@ -314,6 +323,11 @@ void __noreturn kthread_exit(long result) { struct kthread *kthread = to_kthread(current); kthread->result = result; + if (!list_empty(&kthread->hotplug_node)) { + mutex_lock(&kthreads_hotplug_lock); + list_del(&kthread->hotplug_node); + mutex_unlock(&kthreads_hotplug_lock); + } do_exit(0); } EXPORT_SYMBOL(kthread_exit); @@ -339,6 +353,48 @@ void __noreturn kthread_complete_and_exit(struct completion *comp, long code) } EXPORT_SYMBOL(kthread_complete_and_exit); +static void kthread_fetch_affinity(struct kthread *kthread, struct cpumask *cpumask) +{ + cpumask_and(cpumask, cpumask_of_node(kthread->node), + housekeeping_cpumask(HK_TYPE_KTHREAD)); + + if (cpumask_empty(cpumask)) + cpumask_copy(cpumask, housekeeping_cpumask(HK_TYPE_KTHREAD)); +} + +static void kthread_affine_node(void) +{ + struct kthread *kthread = to_kthread(current); + cpumask_var_t affinity; + + WARN_ON_ONCE(kthread_is_per_cpu(current)); + + if (kthread->node == NUMA_NO_NODE) { + housekeeping_affine(current, HK_TYPE_RCU); + } else { + if (!zalloc_cpumask_var(&affinity, GFP_KERNEL)) { + WARN_ON_ONCE(1); + return; + } + + mutex_lock(&kthreads_hotplug_lock); + WARN_ON_ONCE(!list_empty(&kthread->hotplug_node)); + list_add_tail(&kthread->hotplug_node, &kthreads_hotplug); + /* + * The node cpumask is racy when read from kthread() but: + * - a racing CPU going down will either fail on the subsequent + * call to set_cpus_allowed_ptr() or be migrated to housekeepers + * afterwards by the scheduler. + * - a racing CPU going up will be handled by kthreads_online_cpu() + */ + kthread_fetch_affinity(kthread, affinity); + set_cpus_allowed_ptr(current, affinity); + mutex_unlock(&kthreads_hotplug_lock); + + free_cpumask_var(affinity); + } +} + static int kthread(void *_create) { static const struct sched_param param = { .sched_priority = 0 }; @@ -369,7 +425,6 @@ static int kthread(void *_create) * back to default in case they have been changed. */ sched_setscheduler_nocheck(current, SCHED_NORMAL, ¶m); - set_cpus_allowed_ptr(current, housekeeping_cpumask(HK_TYPE_KTHREAD)); /* OK, tell user we're spawned, wait for stop or wakeup */ __set_current_state(TASK_UNINTERRUPTIBLE); @@ -385,6 +440,9 @@ static int kthread(void *_create) self->started = 1; + if (!(current->flags & PF_NO_SETAFFINITY)) + kthread_affine_node(); + ret = -EINTR; if (!test_bit(KTHREAD_SHOULD_STOP, &self->flags)) { cgroup_kthread_ready(); @@ -781,6 +839,52 @@ int kthreadd(void *unused) return 0; } +/* + * Re-affine kthreads according to their preferences + * and the newly online CPU. The CPU down part is handled + * by select_fallback_rq() which default re-affines to + * housekeepers in case the preferred affinity doesn't + * apply anymore. + */ +static int kthreads_online_cpu(unsigned int cpu) +{ + cpumask_var_t affinity; + struct kthread *k; + int ret; + + guard(mutex)(&kthreads_hotplug_lock); + + if (list_empty(&kthreads_hotplug)) + return 0; + + if (!zalloc_cpumask_var(&affinity, GFP_KERNEL)) + return -ENOMEM; + + ret = 0; + + list_for_each_entry(k, &kthreads_hotplug, hotplug_node) { + if (WARN_ON_ONCE((k->task->flags & PF_NO_SETAFFINITY) || + kthread_is_per_cpu(k->task) || + k->node == NUMA_NO_NODE)) { + ret = -EINVAL; + continue; + } + kthread_fetch_affinity(k, affinity); + set_cpus_allowed_ptr(k->task, affinity); + } + + free_cpumask_var(affinity); + + return ret; +} + +static int kthreads_init(void) +{ + return cpuhp_setup_state(CPUHP_AP_KTHREADS_ONLINE, "kthreads:online", + kthreads_online_cpu, NULL); +} +early_initcall(kthreads_init); + void __kthread_init_worker(struct kthread_worker *worker, const char *name, struct lock_class_key *key)