From patchwork Tue Feb 25 14:00:01 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chen Yu X-Patchwork-Id: 13990061 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 10E00C021B2 for ; Tue, 25 Feb 2025 14:05:06 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9C1BF6B0089; Tue, 25 Feb 2025 09:05:05 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 949FD6B008A; Tue, 25 Feb 2025 09:05:05 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 79D6D6B008C; Tue, 25 Feb 2025 09:05:05 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 55ACE6B0089 for ; Tue, 25 Feb 2025 09:05:05 -0500 (EST) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id EBE9581605 for ; Tue, 25 Feb 2025 14:05:04 +0000 (UTC) X-FDA: 83158638528.04.D8D2BCC Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.16]) by imf22.hostedemail.com (Postfix) with ESMTP id BE80BC0025 for ; Tue, 25 Feb 2025 14:05:02 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=PmWNJWSX; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf22.hostedemail.com: domain of yu.c.chen@intel.com designates 198.175.65.16 as permitted sender) smtp.mailfrom=yu.c.chen@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1740492303; a=rsa-sha256; cv=none; b=xXU5krveKxDA38sP+If+MZHB/HVnis46LTXcICzPdBdFFpTODhvdWGMlKr5xGmRLEi7Y/H fQsiZW4McCUpShBi26tajwIpJhitYH15c+Xx+//a7PrISaly4GEGBRdo4B3hSDMwSRidZE INkLy+2G4Bzyqjo3wsNcg1HdtCcr4ho= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=PmWNJWSX; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf22.hostedemail.com: domain of yu.c.chen@intel.com designates 198.175.65.16 as permitted sender) smtp.mailfrom=yu.c.chen@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1740492303; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=YSXlqu+wLTawFl3+al+ZWt/0DvxG2giBkV7DA47stJ8=; b=JmS8UZlHqPNc4EZvwNM4qC50HQYJdzPgWRIVSXpxXxLth+Uz+Cp3LXqCHr3xFv4Q5yU/fD 6wc4odobjdAL33DjdNixeeJDjmOl7prkBdJTgEabQYsBepbTdGJvInhD+NU0HctSYbZPal ari/YWXQ0g6rmmukoxw8p2qSKcfB3Cg= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1740492303; x=1772028303; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=hdQzrrNE52j/1pPIwjImRAwPDZ4EG0HB8b/6MuOnZsA=; b=PmWNJWSXMDYQktx6ceMp0MX4mHsAYyR7zxoIwi2G8Pb3Il7gYZ9B2oUT XCaXbPKKZKGKJYgbj67JQcfZcCbeeefHK5VLrxexBysWppVHTRajp2O4B eb3AGgmkVsddeW+OssFUXUqNcSvcBu+OjsiA3eYTSAZSCoGBZSt3CwBi9 5wXfSH3SSkEfKkmwM6Qc6mi/iexs7C001j4LRJ+mMApjwVsHL8E5EbvMa 3E4UNk5APKCtY571kCSW4KkSK7cVN2S8BjtlIu+185LdNLlGWs/qPHH4O QeotFYqdxapoj8Es+936O714McGHK6aJxEBrl7NhPSUmeJBb7k6IsICv9 g==; X-CSE-ConnectionGUID: H+kh5yZjQgWGQ/dUA5eYAQ== X-CSE-MsgGUID: VBzwjdJ1QM6nKJudzhvBZQ== X-IronPort-AV: E=McAfee;i="6700,10204,11356"; a="41424601" X-IronPort-AV: E=Sophos;i="6.13,314,1732608000"; d="scan'208";a="41424601" Received: from fmviesa008.fm.intel.com ([10.60.135.148]) by orvoesa108.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 Feb 2025 06:05:01 -0800 X-CSE-ConnectionGUID: jOvlhMYcRna49H6HM95Otg== X-CSE-MsgGUID: aHIhg91ZQQexTDVfL18beg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.13,314,1732608000"; d="scan'208";a="116590762" Received: from chenyu-dev.sh.intel.com ([10.239.62.107]) by fmviesa008.fm.intel.com with ESMTP; 25 Feb 2025 06:04:56 -0800 From: Chen Yu To: Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Andrew Morton Cc: Rik van Riel , Mel Gorman , Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , "Liam R. Howlett" , Lorenzo Stoakes , "Huang, Ying" , Tim Chen , Aubrey Li , Michael Wang , Kaiyang Zhao , David Rientjes , Raghavendra K T , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Chen Yu Subject: [RFC PATCH 1/3] sched/numa: Introduce numa balance task migration and swap in schedstats Date: Tue, 25 Feb 2025 22:00:01 +0800 Message-Id: <1847c5ef828ad4835a35e3a54b88d2e13bce0eea.1740483690.git.yu.c.chen@intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: BE80BC0025 X-Stat-Signature: s94hcx9uxmrkbqp643wufqhuco1xcx1g X-Rspam-User: X-HE-Tag: 1740492302-937787 X-HE-Meta: U2FsdGVkX1+ckirCu6Xa3k5c6K+y+20sqH8HVQxbq7pj/jYO+m5CQoRtHt27J72x84ZQLw7PY6vZkPa+rKTatQfcoIjIbKiNim4YOH+YWOVnjLQmWzFIkqAImdXtQWWAp+RZOQA1wCf7d8yI0OrTToqM/yJehFvIRg5nwiuHQ7YfVn4ztFHAlffQYzQ/q8dCSdJQEF+f+4q8ZgPjzD/V056nbKNuDeL2UALZnp/wAtBJG0xCL5BnqhnGX5+c8zKAcsk4q4KlAM/dGOYk/sVJDlri8+iRD+bHd1VloBCl8aoEn69vYQzpbClkcxluVhzw3RPYnTq9nk7Wo4RLxQt+g015tlI8BVnPWQkmnXXSFW20+VoW9Qz6rdGiBlbp5NF9/5wgg4WwVUlzm0w2YQrKuaKcbIW0a+r7nI+fvNbRNhV4vgrZovzipZP/gq+IrhRuuWDyj8Pn3xnWe3VyKZW17vab5AcUWvb2ymAtm+dIm0WZZ/vbw1rFqcCLltWKYYKsXKnbb3A1fr5mndVoZ/ug3xRDNg//rAD68emIP8/DEFcWuikwCJFpSxUWjwqbohSpY9uREJ7iZNLX0fk73TgCccX64K44g3PnsfN3SsvurYQ1Gkg9X+ar6D8k6tR94qJNLo7l2di/krWivBs8WDIxPypioD1iN0a+UO/zjp4t3tF8OmKU+Z6LxoHU49vZxKCzJ9bKUrb7SDHPBzvPQTOxtJphQvpYXNiG3qFQ+jzYVEilRauOEhEqaLCMrlfbSqx9AAfXF8emV1l+C5YEcqaHfa5Uaf3CjsaRoJs2FEFS76TVt8eS/xB/mWm4eFZznCyKZbvSPxX5d4eG/Fwg8gXZvgAWSNl2+s5aH1BA0WPfF9ytk0cAMLIOnhK8fDm5kG7R28dhjXAjurf0zvi9k6w0/fFQRzbbBhtPnDD9H/DN1erut/nX9CxDNKfRyXzthPzA114Ju/jrI+il0EeoCf/ +NP0uKkX ymAiwwXepVwL2F++LQ5cUYp3vmi8rQ4mU0a5LOK3rlziGmdY2rn5BYHcUDTmwFmZY2EpE8mTr1sGSyfIbc2ke0m9ouxr4H0WKPGQ/kBmgYyiGFUpUOa7nt+kk7eAzNPnSYweBQlqVjs8fk+h+4x7Iwk7U8P65g0zbWlGf/vsMuMHYdHq+o1WQAeMRRw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: There is a requirement to track task activities during NUMA balancing. NUMA balancing has two mechanisms for task migration: one is to migrate the task to an idle CPU in its preferred node, and the other is to swap tasks on different nodes if they are on each other's preferred node. The kernel already has NUMA page migration statistics. Add the task migration and swap count described above in the per-task/cgroup scope. The data will be displayed at /sys/fs/cgroup/mytest/memory.stat and /proc/{PID}/sched. Signed-off-by: Chen Yu --- include/linux/sched.h | 4 ++++ include/linux/vm_event_item.h | 2 ++ kernel/sched/core.c | 10 ++++++++-- kernel/sched/debug.c | 4 ++++ mm/memcontrol.c | 2 ++ mm/vmstat.c | 2 ++ 6 files changed, 22 insertions(+), 2 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 9632e3318e0d..01faa608ed7c 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -527,6 +527,10 @@ struct sched_statistics { u64 nr_failed_migrations_running; u64 nr_failed_migrations_hot; u64 nr_forced_migrations; +#ifdef CONFIG_NUMA_BALANCING + u64 nr_numa_migrations; + u64 nr_numa_swap; +#endif u64 nr_wakeups; u64 nr_wakeups_sync; diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index f70d0958095c..aef817474781 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -64,6 +64,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, NUMA_HINT_FAULTS, NUMA_HINT_FAULTS_LOCAL, NUMA_PAGE_MIGRATE, + NUMA_TASK_MIGRATE, + NUMA_TASK_SWAP, #endif #ifdef CONFIG_MIGRATION PGMIGRATE_SUCCESS, PGMIGRATE_FAIL, diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 165c90ba64ea..44efc725054a 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3348,6 +3348,11 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu) #ifdef CONFIG_NUMA_BALANCING static void __migrate_swap_task(struct task_struct *p, int cpu) { + __schedstat_inc(p->stats.nr_numa_swap); + + if (p->mm) + count_memcg_events_mm(p->mm, NUMA_TASK_SWAP, 1); + if (task_on_rq_queued(p)) { struct rq *src_rq, *dst_rq; struct rq_flags srf, drf; @@ -7901,8 +7906,9 @@ int migrate_task_to(struct task_struct *p, int target_cpu) if (!cpumask_test_cpu(target_cpu, p->cpus_ptr)) return -EINVAL; - /* TODO: This is not properly updating schedstats */ - + __schedstat_inc(p->stats.nr_numa_migrations); + if (p->mm) + count_memcg_events_mm(p->mm, NUMA_TASK_MIGRATE, 1); trace_sched_move_numa(p, curr_cpu, target_cpu); return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg); } diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index ef047add7f9e..ed801cc00bf1 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -1204,6 +1204,10 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns, P_SCHEDSTAT(nr_failed_migrations_running); P_SCHEDSTAT(nr_failed_migrations_hot); P_SCHEDSTAT(nr_forced_migrations); +#ifdef CONFIG_NUMA_BALANCING + P_SCHEDSTAT(nr_numa_migrations); + P_SCHEDSTAT(nr_numa_swap); +#endif P_SCHEDSTAT(nr_wakeups); P_SCHEDSTAT(nr_wakeups_sync); P_SCHEDSTAT(nr_wakeups_migrate); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 46f8b372d212..496b5edc3db6 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -460,6 +460,8 @@ static const unsigned int memcg_vm_event_stat[] = { NUMA_PAGE_MIGRATE, NUMA_PTE_UPDATES, NUMA_HINT_FAULTS, + NUMA_TASK_MIGRATE, + NUMA_TASK_SWAP, #endif }; diff --git a/mm/vmstat.c b/mm/vmstat.c index 16bfe1c694dd..d6651778e4bf 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1339,6 +1339,8 @@ const char * const vmstat_text[] = { "numa_hint_faults", "numa_hint_faults_local", "numa_pages_migrated", + "numa_task_migrated", + "numa_task_swaped", #endif #ifdef CONFIG_MIGRATION "pgmigrate_success", From patchwork Tue Feb 25 14:00:15 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Chen Yu X-Patchwork-Id: 13990062 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7D9B4C021B2 for ; Tue, 25 Feb 2025 14:05:31 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1D12B6B0082; Tue, 25 Feb 2025 09:05:31 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 15A7D6B008C; Tue, 25 Feb 2025 09:05:31 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F166D6B0092; Tue, 25 Feb 2025 09:05:30 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id BD15B6B0082 for ; Tue, 25 Feb 2025 09:05:30 -0500 (EST) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 2741DA3C35 for ; Tue, 25 Feb 2025 14:05:30 +0000 (UTC) X-FDA: 83158639620.27.E4531E5 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.10]) by imf29.hostedemail.com (Postfix) with ESMTP id 1627B120016 for ; Tue, 25 Feb 2025 14:05:26 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=AK5bKZFb; spf=pass (imf29.hostedemail.com: domain of yu.c.chen@intel.com designates 198.175.65.10 as permitted sender) smtp.mailfrom=yu.c.chen@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1740492327; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=mCCBEOMwT9wDGcPnCDfYIODmG3E31WrLBtudo7U4XGI=; b=VR4RSyB6f/oSl30h5F40n2KiYagZbalqlZzBNaPVe0ccu22iIpdljzboR2k00E7cQkub3D G08/+3D7m2McLu5ZuD6R6oxTMsGD4tAykaNusACnBX9buB4ovp1rFA1NYBrT9YElZlZvRV JwSoamBXhc3lbxMI6UYZD8W3OJA8CNA= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=AK5bKZFb; spf=pass (imf29.hostedemail.com: domain of yu.c.chen@intel.com designates 198.175.65.10 as permitted sender) smtp.mailfrom=yu.c.chen@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1740492327; a=rsa-sha256; cv=none; b=rnh+1DlhaJ8TZN7ZhgjqgN9HDmbVWY1xpr0J6G9549aequUSI/JN25he32najkFHVRDq+/ 0vtq8o9MhCH/voPmpG2VO7O/oYVmzOMP191v3DtNsSkYgOx3I3gIFcMM6wMoxgWz7/Up7Y R0Vwl+pi2YKd8y1u6ZYoFANsXs5YAEU= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1740492327; x=1772028327; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=MRXLTjehIE10KAbbp3iCpL5NeG3pAuNZ+yln9NJGlpU=; b=AK5bKZFbpaQ0XL4obPawQngCEd5PMAR+i0AktPIT7QsrevlvJqqi9V2a 0tRki9cSovBD0GPX6BfeBnE7ApZkGGJ63l8ItHoWkrvIamqQkdNY9GGkQ 4gZcPUHUxXIvKvwiesnrnc8xkkRNjvYKBpf6zpn/HHuAYohDGlV3L9ml2 XuZiI1F8wYQJPcxjR+eA8lAgsOUYcVxXvma8yj9UJOQsDliWXMx3P7yfy 6SDbk087GQ7jvZ++gpy07DADEWww7rQBGqyPVmRQ2yYvPG5bUs2lRSJBb xL5fFgjfNu2prRTzu3oMzNuLHvJh34+iBVrt7B06e/OOmgwWQ+OXsW62D Q==; X-CSE-ConnectionGUID: oRxa6qqZSPCpkuxIsP2GYQ== X-CSE-MsgGUID: dLMftUmuSOSwH9qJq42t4A== X-IronPort-AV: E=McAfee;i="6700,10204,11356"; a="58720196" X-IronPort-AV: E=Sophos;i="6.13,314,1732608000"; d="scan'208";a="58720196" Received: from orviesa002.jf.intel.com ([10.64.159.142]) by orvoesa102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 Feb 2025 06:05:26 -0800 X-CSE-ConnectionGUID: jQaiv7nJRRiPCgEZItUVGQ== X-CSE-MsgGUID: BxXGW796SFaKvlUgJxOdPw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.13,314,1732608000"; d="scan'208";a="147217658" Received: from chenyu-dev.sh.intel.com ([10.239.62.107]) by orviesa002.jf.intel.com with ESMTP; 25 Feb 2025 06:05:20 -0800 From: Chen Yu To: Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Andrew Morton Cc: Rik van Riel , Mel Gorman , Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , "Liam R. Howlett" , Lorenzo Stoakes , "Huang, Ying" , Tim Chen , Aubrey Li , Michael Wang , Kaiyang Zhao , David Rientjes , Raghavendra K T , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Chen Yu Subject: [RFC PATCH 2/3] sched/numa: Introduce per cgroup numa balance control Date: Tue, 25 Feb 2025 22:00:15 +0800 Message-Id: X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 1627B120016 X-Stat-Signature: rycrenxq5tp6dez9otbtc61bfhwnp561 X-HE-Tag: 1740492326-282056 X-HE-Meta: U2FsdGVkX1/h497Z1soeT7li4R1XsKU4Ss93VuOkf9GEIdZT6Kz/aAZMW80SHzp1q1/2p+tgRsHE7bElWDVC9neCrOp2plvQpPe262DcktaiS7T0+1gkv9EimHEfjneUo6Bpw78jDsaa1p8wdRZ+GA+d4GhjCHuSQN8nbUMd14WKX/A8P6c+d6rP5uPHuk5oGGGUnSl4y7EVmz2Hwfq1qyNw1RvKiIdwElkKGPfI+blyuZ+2/nEgThDBB76ac44eZIy3Aj7D5HYExUPpu0yjVYeovse5Q7PUWF0gsyctdwykOjJDfiz7r8Sq3VMWAeVmznRq+K5DQhbnqggGtkv8uQqUH2VAGbA1bFEpPE45XEuAkFqtLU+DKpRaqWAJbzq8nWFaJ5Q/RhjnaWXCxC06DKyeyDdbLAQwYTxm+spyr32vKM3rpv5HiW5A3iNdK8UvPLkPcRUime/5xSWfCeqnk/fQibhR2iA0msiv+7B9oHHss+8gZbSaFl8AdyTEYEOMuvbtq9kCxwDWcZ0PXDVzXVNDgvpyps5t/0I4tXbQZgz+FC4coZu5nxmG7LeOT8u3sCUDfjEh4RHPAWO9ZWdRkz7isdHr4GSowJuyrn9LK5QM+gx4SnzHPIFaI8cnIuULWO+ZgSXZSvYBJhg0umebIf0cbE2yrhyLZcxXgbKEsp05/VUImJi9U/Sj+v+3W48ivufRhcaFFSGOiCL06OIMNC7qD6fmsEMWvvBzO6X9j6TJhiZTekxiC86cy4AZMlF6if32jnMV97mCzdPJ/Qm3tP8J1hsij4NP4LumgivvZYEUFj+ys3srwFx/B9NmjcOGqhhvZ0Wrrzv7Bxl0ylpUf/QRMiSN+9hZX/9ilX6GA33ZMKAGPFkeuy5VI29196R16+bU3C7cVb7zLVpRgWwQ9a7vnNxZjaLBmpWTzQvCg8YDfD+cexvYc7cjebR5D2PiTTyIDcS/hnqXUb7jVHA Y386ACXT ENyfpZJuwNAkyo5hxjfbiW8Dfwu2Ov3ZlyOTKkD53DrnBAMwuQBQGfMc5sfp6GLX5QOBrj8hPzoQd7/h98/zKdCLCpLciQTnqLJDiukSkxyHKj6bFO5QsL9sZiIIyXIMf3FolXLuxo76mkzeNY4W8t6wXQrU03GIAG3YGlIVR9eIiM144jwlTWeTWYnQbOjrsXDg2lMbhsl/BI8vvSvtTwg+SdQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: [Problem Statement] Currently, NUMA balancing is configured system-wide. However, in some production environments, different containers may have varying requirements for NUMA balancing. Some containers are CPU-intensive, while others are memory-intensive. Some do not benefit from NUMA balancing due to the overhead associated with VMA scanning, while the other prefers NUMA balancing as it helps improve memory locality. In this case, system-wide NUMA balancing is usually disabled to produce stable results. [Proposal] Introduce a per-cgroup interface to enable NUMA balancing for specific cgroups. The system administrator must set the NUMA balancing mode to NUMA_BALANCING_CGROUP=4 to enable this feature. When in global NUMA_BALANCING_CGROUP mode, all cgroups' NUMA balancing is disabled by default. After the administrator enables this feature for a specific cgroup, NUMA balancing for that cgroup is enabled. A simple example to show how to use per-cgroup Numa balancing: Step1 //switch to global per cgroup Numa balancing, //All cgroup's Numa balance is disabled by default. echo 4 > /proc/sys/kernel/numa_balancing Step2 //created a cgroup named mytest, enable its Numa balancing echo 1 > /sys/fs/cgroup/mytest/cpu.numa_load_balance [Benchmark] Tested on two systems. Both systems have 4 nodes. Created a cgroup mytest which is bind to node0 and node1(cpu affinity as well as memory allocation policy). Launched autonumabench NUMA01_THREADLOCAL via mmtests. echo 0 > /sys/fs/cgroup/mytest/cpu.numa_load_balance cgexec -g cpuset:mytest ./run-mmtests.sh --no-monitor \ --config config-numa baseline echo 1 > /sys/fs/cgroup/mytest/cpu.numa_load_balance \ cgexec -g cpuset:mytest ./run-mmtests.sh --no-monitor --config config-numa nb_cgroup system1: 4 nodes, 24 Cores(48 CPUs)/node. baseline took a total of 191.32 seconds to finish, while cgroup numa balancing took a total of 104.46 seconds. There is around 45% improvement. baselin nb_cgrou baseline nb_cgroup Min syst-NUMA01_THREADLOCAL 69.65 ( 0.00%) 106.73 ( -53.24%) Min elsp-NUMA01_THREADLOCAL 191.32 ( 0.00%) 104.46 ( 45.40%) Amean syst-NUMA01_THREADLOCAL 69.65 ( 0.00%) 106.73 * -53.24%* Amean elsp-NUMA01_THREADLOCAL 191.32 ( 0.00%) 104.46 * 45.40%* <--- Stddev syst-NUMA01_THREADLOCAL 0.00 ( 0.00%) 0.00 ( 0.00%) Stddev elsp-NUMA01_THREADLOCAL 0.00 ( 0.00%) 0.00 ( 0.00%) CoeffVar syst-NUMA01_THREADLOCAL 0.00 ( 0.00%) 0.00 ( 0.00%) CoeffVar elsp-NUMA01_THREADLOCAL 0.00 ( 0.00%) 0.00 ( 0.00%) Max syst-NUMA01_THREADLOCAL 69.65 ( 0.00%) 106.73 ( -53.24%) Max elsp-NUMA01_THREADLOCAL 191.32 ( 0.00%) 104.46 ( 45.40%) BAmean-50 syst-NUMA01_THREADLOCAL 69.65 ( 0.00%) 106.73 ( -53.24%) BAmean-50 elsp-NUMA01_THREADLOCAL 191.32 ( 0.00%) 104.46 ( 45.40%) BAmean-95 syst-NUMA01_THREADLOCAL 69.65 ( 0.00%) 106.73 ( -53.24%) BAmean-95 elsp-NUMA01_THREADLOCAL 191.32 ( 0.00%) 104.46 ( 45.40%) BAmean-99 syst-NUMA01_THREADLOCAL 69.65 ( 0.00%) 106.73 ( -53.24%) BAmean-99 elsp-NUMA01_THREADLOCAL 191.32 ( 0.00%) 104.46 ( 45.40%) The run-to-run deviation downgrading occurs because sometimes the per-cgroup NUMA balancing does not improve the score, although no performance downgrading is observed. delta of /sys/fs/cgroup/mytest/memory.stat during the test: numa_pages_migrated: 979933 numa_pte_updates: 21007548 <-- introduced in previous patch numa_hint_faults: 19663982 <-- introduced in previous patch system1: 4 nodes, 40 Cores(80 CPUs)/node. baseline took a total of 212.94 seconds to finish, while cgroup numa balance took a total of 127.05 second, which is of 40.34% improvment. baselin nb_cgrou baseline nb_cgroup Min syst-NUMA01_THREADLOCAL 8356.05 ( 0.00%) 8921.84 ( -6.77%) Min elsp-NUMA01_THREADLOCAL 212.94 ( 0.00%) 127.05 ( 40.34%) Amean syst-NUMA01_THREADLOCAL 8356.05 ( 0.00%) 8921.84 ( -6.77%) Amean elsp-NUMA01_THREADLOCAL 212.94 ( 0.00%) 127.05 ( 40.34%) <--- Stddev syst-NUMA01_THREADLOCAL 0.00 ( 0.00%) 0.00 ( 0.00%) Stddev elsp-NUMA01_THREADLOCAL 0.00 ( 0.00%) 0.00 ( 0.00%) CoeffVar syst-NUMA01_THREADLOCAL 0.00 ( 0.00%) 0.00 ( 0.00%) CoeffVar elsp-NUMA01_THREADLOCAL 0.00 ( 0.00%) 0.00 ( 0.00%) Max syst-NUMA01_THREADLOCAL 8356.05 ( 0.00%) 8921.84 ( -6.77%) Max elsp-NUMA01_THREADLOCAL 212.94 ( 0.00%) 127.05 ( 40.34%) BAmean-50 syst-NUMA01_THREADLOCAL 8356.05 ( 0.00%) 8921.84 ( -6.77%) BAmean-50 elsp-NUMA01_THREADLOCAL 212.94 ( 0.00%) 127.05 ( 40.34%) BAmean-95 syst-NUMA01_THREADLOCAL 8356.05 ( 0.00%) 8921.84 ( -6.77%) BAmean-95 elsp-NUMA01_THREADLOCAL 212.94 ( 0.00%) 127.05 ( 40.34%) BAmean-99 syst-NUMA01_THREADLOCAL 8356.05 ( 0.00%) 8921.84 ( -6.77%) BAmean-99 elsp-NUMA01_THREADLOCAL 212.94 ( 0.00%) 127.05 ( 40.34%) The Numa statistics delta during the test: numa_pages_migrated: 785848 numa_pte_updates: 2359714 numa_hint_faults: 2349857 [Shortage] It has been observed that even with per-cgroup NUMA balancing enabled, there is still remote node access, and the benchmark score does not increase compared to the baseline. According to the NUMA statistics, not much NUMA page migration is detected. Further testing shows that global NUMA balancing has the same issue—sometimes NUMA balancing does not help. This could be a generic issue in the current kernel code, possibly due to either the NUMA page migration or task migration strategy, and it needs to be further investigated. Suggested-by: Tim Chen Signed-off-by: Chen Yu --- include/linux/sched/sysctl.h | 1 + kernel/sched/core.c | 32 ++++++++++++++++++++++++++++++++ kernel/sched/fair.c | 18 ++++++++++++++++++ kernel/sched/sched.h | 3 +++ mm/mprotect.c | 5 +++-- 5 files changed, 57 insertions(+), 2 deletions(-) diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index 5a64582b086b..1e4d5a9ddb26 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -22,6 +22,7 @@ enum sched_tunable_scaling { #define NUMA_BALANCING_DISABLED 0x0 #define NUMA_BALANCING_NORMAL 0x1 #define NUMA_BALANCING_MEMORY_TIERING 0x2 +#define NUMA_BALANCING_CGROUP 0x4 #ifdef CONFIG_NUMA_BALANCING extern int sysctl_numa_balancing_mode; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 44efc725054a..f4f048b3da68 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -10023,6 +10023,31 @@ static ssize_t cpu_max_write(struct kernfs_open_file *of, } #endif +#ifdef CONFIG_NUMA_BALANCING +static DEFINE_MUTEX(numa_balance_mutex); +static int numa_balance_write_u64(struct cgroup_subsys_state *css, + struct cftype *cftype, u64 enable) +{ + struct task_group *tg; + int ret; + + guard(mutex)(&numa_balance_mutex); + tg = css_tg(css); + if (tg->nlb_enabled == enable) + return 0; + + tg->nlb_enabled = enable; + + return ret; +} + +static u64 numa_balance_read_u64(struct cgroup_subsys_state *css, + struct cftype *cft) +{ + return css_tg(css)->nlb_enabled; +} +#endif /* CONFIG_NUMA_BALANCING */ + static struct cftype cpu_files[] = { #ifdef CONFIG_GROUP_SCHED_WEIGHT { @@ -10071,6 +10096,13 @@ static struct cftype cpu_files[] = { .seq_show = cpu_uclamp_max_show, .write = cpu_uclamp_max_write, }, +#endif +#ifdef CONFIG_NUMA_BALANCING + { + .name = "numa_load_balance", + .read_u64 = numa_balance_read_u64, + .write_u64 = numa_balance_write_u64, + }, #endif { } /* terminate */ }; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1c0ef435a7aa..526cb33b007c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3146,6 +3146,18 @@ void task_numa_free(struct task_struct *p, bool final) } } +/* return true if the task group has enabled the numa balance */ +static bool tg_numa_balance_enabled(struct task_struct *p) +{ + struct task_group *tg = task_group(p); + + if (tg && (sysctl_numa_balancing_mode & NUMA_BALANCING_CGROUP) && + !tg->nlb_enabled) + return false; + + return true; +} + /* * Got a PROT_NONE fault for a page on @node. */ @@ -3174,6 +3186,9 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags) !cpupid_valid(last_cpupid))) return; + if (!tg_numa_balance_enabled(p)) + return; + /* Allocate buffer to track faults on a per-node basis */ if (unlikely(!p->numa_faults)) { int size = sizeof(*p->numa_faults) * @@ -3596,6 +3611,9 @@ static void task_tick_numa(struct rq *rq, struct task_struct *curr) if (!curr->mm || (curr->flags & (PF_EXITING | PF_KTHREAD)) || work->next != work) return; + if (!tg_numa_balance_enabled(curr)) + return; + /* * Using runtime rather than walltime has the dual advantage that * we (mostly) drive the selection from busy threads and that the diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 38e0e323dda2..9f478fb2c03a 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -491,6 +491,9 @@ struct task_group { /* Effective clamp values used for a task group */ struct uclamp_se uclamp[UCLAMP_CNT]; #endif +#ifdef CONFIG_NUMA_BALANCING + u64 nlb_enabled; +#endif }; diff --git a/mm/mprotect.c b/mm/mprotect.c index 516b1d847e2c..ddaaf20ef94c 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -155,10 +155,11 @@ static long change_pte_range(struct mmu_gather *tlb, toptier = node_is_toptier(nid); /* - * Skip scanning top tier node if normal numa + * Skip scanning top tier node if normal/cgroup numa * balancing is disabled */ - if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) && + if (!(sysctl_numa_balancing_mode & + (NUMA_BALANCING_CGROUP | NUMA_BALANCING_NORMAL)) && toptier) continue; if (folio_use_access_time(folio)) From patchwork Tue Feb 25 14:00:40 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Chen Yu X-Patchwork-Id: 13990065 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6F91DC021BB for ; Tue, 25 Feb 2025 14:06:02 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0C0556B0092; Tue, 25 Feb 2025 09:06:02 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 0494D6B0093; Tue, 25 Feb 2025 09:06:01 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E05ED6B0095; Tue, 25 Feb 2025 09:06:01 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id BE11E6B0092 for ; Tue, 25 Feb 2025 09:06:01 -0500 (EST) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 2AD0DC172A for ; Tue, 25 Feb 2025 14:06:01 +0000 (UTC) X-FDA: 83158640922.16.DC1CFF8 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11]) by imf16.hostedemail.com (Postfix) with ESMTP id E869918002D for ; Tue, 25 Feb 2025 14:05:55 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=WNyKJh5J; spf=pass (imf16.hostedemail.com: domain of yu.c.chen@intel.com designates 198.175.65.11 as permitted sender) smtp.mailfrom=yu.c.chen@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1740492356; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=D/aUhmgkLU+optdIXXSfFxAw69KcEJSVyRu44xG4FwU=; b=Y1XOGwvGJecNxVhBwlTVa3gMnUTaxYBgxdvxlLG3WlAEGuxqXcxCqtDOrWajcZCZNnOok6 H0m5F93snkoXVySn7zZLXbzu2TBfv4NaFiweHdpUSNefx7mMYf+1eHzoV0G5ctywnVpRNM rZK4Og0BHy0UMv6wpZnO3n8lVkcRDyo= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=WNyKJh5J; spf=pass (imf16.hostedemail.com: domain of yu.c.chen@intel.com designates 198.175.65.11 as permitted sender) smtp.mailfrom=yu.c.chen@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1740492356; a=rsa-sha256; cv=none; b=GMwZMU17ZbX+pZSueo8bwa9sC+gviGVh/YI6KRVKqhMhoKzP87+Iu2buOvVZ0LsWi8G9O6 OcVPU4w+7WNHcB9c0N9DqtkG4Z+zRy23G6b9H6NpXUhjk8kdeKyQ2jXrNSl8w0wtr9w9Cu RmaBLHqLOiw0lNm80quvUERG5aKUZAk= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1740492356; x=1772028356; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=pdPlp0rDL1jfLw0dpmJfYk+F666XwJmSc0932Ij+l7s=; b=WNyKJh5JZpkL6lRQVLK3w6MmtE4ir9zR0df5jLDetjqKzPoNoAykBS1l LGT1qz5PJlMpQQDnuEdPh8Mo9vc9jcbZvdX1lmRHNh3YqyY3bviYgXToQ I1uqZ4fELtLstiWu7MTb+ali49T3LoEMiMkEUD0vbDqUpG9LnUqcqDAol Lfbs9y/Q2xx9HVBqjamQJwF8XQ0XkTDP0hAvUn/d15vH2rSOaGtMJhSol sycFk/SkkX2QvWkPKE+v3+IY6JcXgd7xkoSZqRzNr+gDVnO5sIc62IE1l eTztfOmZ/IlxBTJ7bDvyLpvIubBt2GDIRi9Dxncqfb7v4OZO1x+1rLbxx A==; X-CSE-ConnectionGUID: XjHNlaseTPCPWh6a2vuEVg== X-CSE-MsgGUID: wQxrGxxiQamSnmyYdlTdzw== X-IronPort-AV: E=McAfee;i="6700,10204,11356"; a="51513796" X-IronPort-AV: E=Sophos;i="6.13,314,1732608000"; d="scan'208";a="51513796" Received: from orviesa008.jf.intel.com ([10.64.159.148]) by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 Feb 2025 06:05:42 -0800 X-CSE-ConnectionGUID: VzWe+/YeR0W/uh6YQZcfXg== X-CSE-MsgGUID: kkeWeCHnS1e4ahFVdyY1dw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.12,224,1728975600"; d="scan'208";a="117326296" Received: from chenyu-dev.sh.intel.com ([10.239.62.107]) by orviesa008.jf.intel.com with ESMTP; 25 Feb 2025 06:05:35 -0800 From: Chen Yu To: Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Andrew Morton Cc: Rik van Riel , Mel Gorman , Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , "Liam R. Howlett" , Lorenzo Stoakes , "Huang, Ying" , Tim Chen , Aubrey Li , Michael Wang , Kaiyang Zhao , David Rientjes , Raghavendra K T , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Chen Yu Subject: [RFC PATCH 3/3] sched/numa: Allow intervale memory allocation for numa balance Date: Tue, 25 Feb 2025 22:00:40 +0800 Message-Id: X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: E869918002D X-Stat-Signature: 6n8gor7fkmdns8eks5h5t8z8jzqcq5pr X-HE-Tag: 1740492355-152757 X-HE-Meta: U2FsdGVkX19k8jDHelXfTrknvjTXknhZLTLUVAbu1WTjrJitvtEtvnj7cMZ4H/Wu7bAbixRnQWDG4X+70gtyDL9huyJy4GlyL7r+Q/t0rcwely1hOwXM6L5P28zYQsaaNN8Dsd5L4ihMv4Gs1gINII2SJBxhcgoUSDqWdfnz7lLXZt7KNKYcMn2fXRxjP5nNxynmHzvDn1hkeXYOXteBTE8F3rkWQ5t1t6NlRz9npwvyFF7Ew7aETNzVpm5VCkoxWgKEZm70F0CaKsEXskUIUtLxC1WxVSkfy+LhiQkgu5waH0QG+yZDKZPgzlPAy7RTKuNh/qXOPFYBYVeotH4oInr/SDx5g9urVkKRJMkciL8AUmvpmhXBYNO3O0lXOEA3GNHi/HqINGcWUuQf2wsV7OYAF6EkVMiz0cT8ICnLBd+DAfHeXmWfU1s390KOM3z+IpRh404sUQTvvglIuDe1HjKWBKgntPQ5x3p6ugNTAHmrwDB0ddzTZZ/TRdbesmaahLbwQk5wI6kpdYezZ5+QvPdfXLPt44nrs+cXXQvDGlSMTAeNZpHn+OPBlpUK93MOB6Elbyujfp2Wh4mTYyLhcPwdXg0IFni4artkGav0V6wfc5kAMXy5Do0WHQ65akayl2o7aE/IbE27UWI71ZtGUkuj3veIZ4FaewC8UhVZZ7M73ylhjHVnuxMbaYDPg4Q1//KrpZ3M0jH/xEBfOoksi4ghEK1nV0xGC6FBuwFs6QnP8R2omI1Mu/gVIwIidXEU3beCqWVv80eVO+wS6OYWKFe0OvFKS92mljLKC9Fo2/jVXPqFltRV441I0RvX71XUIydqlCjuL85RgW4DONoOjPrpdm1yewbnfr5bIlukaSIfjftfMfgIyf91wkpoE41Uwg2VayFSOW26pm+vxDMOMYfi6eC3KAVZrSF7QV0NHaqYhyIoAUruyywLC4W9BI9AV1RUcLihy4NQxst4u/S Kbpd4nOR SIqRkAAQRNCRJN0nS9WXTf+LTZeuFHwPjiW/vp2Ocly9mI2leqOJxf1nnJPM0e7GYnAwbrDcfOFTVTkzByGKAN1TD3DsTVd3XoaPR2qKF+k6cabPCSjKgoax+bQazI/AJm+7IRKQN3Q5yzs4Efu5SexNWxSXq+UlvvVjeu/MVnlvHTnWFGWBe47C31pdhYNADFGVBveA+m8aATB1+AjA3gvy56Q== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: MPOL_INTERLEAVE is used to allocate pages interleaved across different NUMA nodes to make the best use of memory bandwidth. Under MPOL_INTERLEAVE mode, NUMA load balance page migration does not occur because the page is already in its designated place. Similarly, NUMA load task migration does not occur either—mpol_misplaced() returns NUMA_NO_NODE, which instructs do_numa_page() to skip page/task migration. However, there is a scenario in the production environment where NUMA balance could benefit MPOL_INTERLEAVE. This typical scenario involves tasks within cgroup g_A being bound to two SNC (Sub-NUMA Cluster) nodes via cpuset, with their pages allocated only on these two SNC nodes in an interleaved manner using MPOL_INTERLEAVE. This setup allows g_A to achieve good resource isolation while effectively utilizing the memory bandwidth of the two SNC nodes. However, it is possible that tasks t1 and t2 in g_A could experience remote access patterns: Node 0 Node 1 t1 t1.page t2.page t2 Ideally, a NUMA balance task swap would be beneficial: Node 0 Node 1 t2 t1.page t2.page t1 In other words, NUMA balancing can help swap t1 and t2 to improve NUMA locality without migrating pages, thereby still honoring the MPOL_INTERLEAVE policy. To enable NUMA balancing to manage MPOL_INTERLEAVE, add MPOL_F_MOF to the MPOL_INTERLEAVE policy if the user has requested it via MPOL_F_NUMA_BALANCING (similar to MPOL_BIND). In summary, pages will not be migrated for MPOL_INTERLEAVE, but tasks will be migrated to their preferred nodes. Tested on a system with 4 nodes, 40 Cores(80 CPUs)/node, using autonumabench NUMA01_THREADLOCAL, with some minor changes to support MPOL_INTERLEAVE: p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, \ MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); set_mempolicy(MPOL_INTERLEAVE | MPOL_F_NUMA_BALANCING, \ &nodemask_global, max_nodes); ... //each thread accesses 4K of data every 8K, //1 thread should access the pages on 1 node. No obvious score difference was observed, but noticed some Numa balance task migration: baseline_nocg_interleav nb_nocg_interlave baseline_nocg_interleave nb_nocg_interlave/ Min syst-NUMA01_THREADLOCAL 7156.34 ( 0.00%) 7267.28 ( -1.55%) Min elsp-NUMA01_THREADLOCAL 90.73 ( 0.00%) 90.88 ( -0.17%) Amean syst-NUMA01_THREADLOCAL 7156.34 ( 0.00%) 7267.28 ( -1.55%) Amean elsp-NUMA01_THREADLOCAL 90.73 ( 0.00%) 90.88 ( -0.17%) Stddev syst-NUMA01_THREADLOCAL 0.00 ( 0.00%) 0.00 ( 0.00%) Stddev elsp-NUMA01_THREADLOCAL 0.00 ( 0.00%) 0.00 ( 0.00%) CoeffVar syst-NUMA01_THREADLOCAL 0.00 ( 0.00%) 0.00 ( 0.00%) CoeffVar elsp-NUMA01_THREADLOCAL 0.00 ( 0.00%) 0.00 ( 0.00%) Max syst-NUMA01_THREADLOCAL 7156.34 ( 0.00%) 7267.28 ( -1.55%) Max elsp-NUMA01_THREADLOCAL 90.73 ( 0.00%) 90.88 ( -0.17%) BAmean-50 syst-NUMA01_THREADLOCAL 7156.34 ( 0.00%) 7267.28 ( -1.55%) BAmean-50 elsp-NUMA01_THREADLOCAL 90.73 ( 0.00%) 90.88 ( -0.17%) BAmean-95 syst-NUMA01_THREADLOCAL 7156.34 ( 0.00%) 7267.28 ( -1.55%) BAmean-95 elsp-NUMA01_THREADLOCAL 90.73 ( 0.00%) 90.88 ( -0.17%) BAmean-99 syst-NUMA01_THREADLOCAL 7156.34 ( 0.00%) 7267.28 ( -1.55%) BAmean-99 elsp-NUMA01_THREADLOCAL 90.73 ( 0.00%) 90.88 ( -0.17%) delta of /sys/fs/cgroup/mytest/memory.stat during the test: numa_pages_migrated: 0 numa_pte_updates: 9156154 numa_hint_faults: 8659673 numa_task_migrated: 282 <--- introduced in previous patch numa_task_swaped: 114 <---- introduced in previous patch More tests to come. Suggested-by: Aubrey Li Signed-off-by: Chen Yu --- include/linux/numa.h | 1 + include/uapi/linux/mempolicy.h | 1 + mm/memory.c | 2 +- mm/mempolicy.c | 7 +++++++ 4 files changed, 10 insertions(+), 1 deletion(-) diff --git a/include/linux/numa.h b/include/linux/numa.h index 3567e40329eb..6c3f2d839c76 100644 --- a/include/linux/numa.h +++ b/include/linux/numa.h @@ -14,6 +14,7 @@ #define NUMA_NO_NODE (-1) #define NUMA_NO_MEMBLK (-1) +#define NUMA_TASK_MIG (1) static inline bool numa_valid_node(int nid) { diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h index 1f9bb10d1a47..2081365612ac 100644 --- a/include/uapi/linux/mempolicy.h +++ b/include/uapi/linux/mempolicy.h @@ -64,6 +64,7 @@ enum { #define MPOL_F_SHARED (1 << 0) /* identify shared policies */ #define MPOL_F_MOF (1 << 3) /* this policy wants migrate on fault */ #define MPOL_F_MORON (1 << 4) /* Migrate On protnone Reference On Node */ +#define MPOL_F_MOFT (1 << 5) /* allow task but no page migrate on fault */ /* * These bit locations are exposed in the vm.zone_reclaim_mode sysctl diff --git a/mm/memory.c b/mm/memory.c index 539c0f7c6d54..4013bbcbf40f 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5683,7 +5683,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) target_nid = numa_migrate_check(folio, vmf, vmf->address, &flags, writable, &last_cpupid); - if (target_nid == NUMA_NO_NODE) + if (target_nid == NUMA_NO_NODE || target_nid == NUMA_TASK_MIG) goto out_map; if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) { flags |= TNF_MIGRATE_FAIL; diff --git a/mm/mempolicy.c b/mm/mempolicy.c index bbaadbeeb291..0b88601ec22d 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1510,6 +1510,8 @@ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags) if (*flags & MPOL_F_NUMA_BALANCING) { if (*mode == MPOL_BIND || *mode == MPOL_PREFERRED_MANY) *flags |= (MPOL_F_MOF | MPOL_F_MORON); + else if (*mode == MPOL_INTERLEAVE) + *flags |= (MPOL_F_MOF | MPOL_F_MOFT); else return -EINVAL; } @@ -2779,6 +2781,11 @@ int mpol_misplaced(struct folio *folio, struct vm_fault *vmf, if (!(pol->flags & MPOL_F_MOF)) goto out; + if (pol->flags & MPOL_F_MOFT) { + ret = NUMA_TASK_MIG; + goto out; + } + switch (pol->mode) { case MPOL_INTERLEAVE: polnid = interleave_nid(pol, ilx);