From patchwork Mon Feb 17 11:23:16 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gabriele Monaco X-Patchwork-Id: 13977624 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A4627C0219E for ; Mon, 17 Feb 2025 11:23:37 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1B8F3280049; Mon, 17 Feb 2025 06:23:37 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 16879280048; Mon, 17 Feb 2025 06:23:37 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0572C280049; Mon, 17 Feb 2025 06:23:36 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id DE14D280048 for ; Mon, 17 Feb 2025 06:23:36 -0500 (EST) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 8162E1420AF for ; Mon, 17 Feb 2025 11:23:36 +0000 (UTC) X-FDA: 83129201232.14.6DC149C Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf30.hostedemail.com (Postfix) with ESMTP id B0F4580006 for ; Mon, 17 Feb 2025 11:23:34 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=DY5g65qI; spf=pass (imf30.hostedemail.com: domain of gmonaco@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=gmonaco@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1739791414; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=UmVVq3HHzqRlZLdZZrTZki/75/BrjuDDaU0LBdE44wk=; b=YnmeiZQX7ongVsKPVVQjVmc2EH7QRBHCxAr1OKy3p8UoyL78K2xOe3V6n75AdGiiyWCQ0q aO+RylEju+6XHYhLyvWNv6srAujMijHibO9urQObP+R46CPx9t9YlQ+AxPIcIir5p1mHeE +naKNWUCaqtq6ACDIwNF5d9f6iwBkX8= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=DY5g65qI; spf=pass (imf30.hostedemail.com: domain of gmonaco@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=gmonaco@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1739791414; a=rsa-sha256; cv=none; b=GqSCXLH6CdItiVLRxu8atvjRWM2EebS2jYRTqjYegYcv1MR6Wslv/3ml8i+lkdrgtDKlsw GhPjXUb8P0TkrqbQRqBcdh3ydVm1ySN6X4JZQGKbkESAZmVjFYZDbpxCl+EJgU3wkr4ikY UGRpVAID/kJ+7GB2rAQFOd+G6WOGvXY= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1739791414; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=UmVVq3HHzqRlZLdZZrTZki/75/BrjuDDaU0LBdE44wk=; b=DY5g65qIs9vPawVHTVndx2/sG/wWwOOCXa3W+oe8GaBw1jdIKw6Z8cleF7cVmctDTSDN/1 7CN7ZQcV0UiSbzfwQuvVskjGaHc+J2MfF04L7R9iOn18VfC0hvp7B9C8KYeiD8L9D/7hIs OV6K7ZrGWYcjZlPheulLaqdFzZPhxPA= Received: from mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-663-yVEZmuJVO1SYSDLbzU-3CA-1; Mon, 17 Feb 2025 06:23:30 -0500 X-MC-Unique: yVEZmuJVO1SYSDLbzU-3CA-1 X-Mimecast-MFC-AGG-ID: yVEZmuJVO1SYSDLbzU-3CA_1739791409 Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 5F3D51800874; Mon, 17 Feb 2025 11:23:29 +0000 (UTC) Received: from gmonaco-thinkpadt14gen3.rmtit.com (unknown [10.44.32.190]) by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id BAE811955BD4; Mon, 17 Feb 2025 11:23:25 +0000 (UTC) From: Gabriele Monaco To: linux-kernel@vger.kernel.org, Andrew Morton , Ingo Molnar , Peter Zijlstra , Mathieu Desnoyers , linux-mm@kvack.org Cc: Gabriele Monaco , Ingo Molnar , "Paul E. McKenney" Subject: [PATCH 1/2] sched: Compact RSEQ concurrency IDs in batches Date: Mon, 17 Feb 2025 12:23:16 +0100 Message-ID: <20250217112317.258716-2-gmonaco@redhat.com> In-Reply-To: <20250217112317.258716-1-gmonaco@redhat.com> References: <20250217112317.258716-1-gmonaco@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12 X-Rspam-User: X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: B0F4580006 X-Stat-Signature: oaduuxjeoa43z4g3mnb6saw8s7zmn4m6 X-HE-Tag: 1739791414-498826 X-HE-Meta: U2FsdGVkX1+NcmlIYwJdHoMysGWRJUlkqZ0OycOyTsXWUZdijbJ1J06xpeDcJIFIGsWWUIUMXcrbjejdhxJdmfw2HVG87pxtVkxEHx8qTbGOhnXGFl9kJ//pflrnGoiEizRabyy6K2SPGGxzNYT0uEqk1k+m5iYz/IKQhRg4zbyIY0odDHjc/r11ceNq2AOfkK8ALF6QwrrYDh6mbSY25Fg3icj9QVEVM6mdopY7oENfaZQRTj87et0sV538l24DhNa9A8ZfqrOps7jeaw+0O3bsJZo7XKdmGgm/2cOus/VF6+6mLfi4+wqXmvXEPnR/QpSzT4i9VatK9DFfqWXlpGa0nQwUCSeIMi3AKQFyCIVCI40PP70pxKXI7KlS/scRYjUwIu53dZa26Uo6G32KEWoLUGRVS0UXmcG/3A3bBNmVyRUTUMbX9qrNONRE3va7KmLCsQbGsf2ekI//VaChcX4AfZmfw6cMDYyBU7ArmGLiGvNFwtm7qS8aSGoBtDPw5CLHr7LtNsVQPgvrfT+h3sroRbf0NVBHMSy9fyLu6mg0CeFLOWaTTHM202uYhXhFgkWCvP2hn/wiAhe1YxdU8fRolT1BflAj6BfxzxsjmdCmf0E00bIrVLMd9/ckwI9+2yAaRxRnhPO1S5Uzg4T4AfbyVA0Xpr1RpTQ+tMKD0zyru9y1NSr20tbjOU4FPeftmvf1wbHX499oSrPytt1eDHI5J93Tl0IQRbZ/CGe01BgNXhXp7XAy1APJtcVBkblPzOLMsVWDNlHdtexDjd8aIn0odfJz8GPA718eEIoLCifOuic+4P+5nvAu2aAl8Y/cg1XOaH52ZDqUWpJTYV0vbbgK5x2eoPxKltiKDFSLVK70zoCMcz+ymqy1O5+yzNpz4039T5CHFZWIszBzA/YCnOhUKAPXWjfnEprRShGODB52OizCPTKGvcIDeKF7PLQ3m+/O+rqyatNwmGt8S37 F4fXMY31 NvFjWhERRGqw6a6vl9dPmUI1Gc6ieEThJmGDAQbynHi7gTinP+4fXuq15VWMVOxA2K+mCAFG+TmBJvsDVwsASx8oIEcJu6Hm56Kjc05V8bnm/skPZhEFwbO6FdkewiHBysX/gs6hQYYuRDcbEyivwpzk6ICGlhkohTTO26of5/4ir0oucJav2D5ufz1cW8NRaQpIcC1mXptb04X257Z4k8g7TTEl7q5+DF53rWIANyT+p5P1n9ibcHE4ot7pJFjuFIhbQQ1rsxJU3DZQUg3207hwOE3IclFQeB1HKOlwd4ppUE5lfe6RQtdMJWaYwVYFYdueZ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Currently, the task_mm_cid_work function is called in a task work triggered by a scheduler tick to frequently compact the mm_cids of each process for each core. This can delay the execution of the corresponding thread for the entire duration of the function, negatively affecting the response in case of real time tasks. In practice, we observe task_mm_cid_work increasing the latency of 30-35us on a 128 cores system, this order of magnitude is meaningful under PREEMPT_RT. Run the task_mm_cid_work in batches of up to CONFIG_RSEQ_CID_SCAN_BATCH cpus, this contains the duration of the delay for each scan. Also improve the duration by iterating for all present cpus and not for all possible. The task_mm_cid_work already contains a mechanism to avoid running more frequently than every 100ms, considering the function runs at every tick, assuming ticks every 1ms (HZ=1000 is common on distros) and assuming an unfavorable scenario of 1/10 ticks during task T runtime, we can compact the CIDs for task T in about 130ms by setting CONFIG_RSEQ_CID_SCAN_BATCH to 10 on a 128 cores machine. This value also drastically reduces the task work duration and is a more acceptable latency for the aforementioned machine. Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced by mm_cid") Signed-off-by: Gabriele Monaco --- include/linux/mm_types.h | 8 ++++++++ init/Kconfig | 12 ++++++++++++ kernel/sched/core.c | 27 ++++++++++++++++++++++++--- 3 files changed, 44 insertions(+), 3 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 0234f14f2aa6b..1e0e491d2c5c2 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -867,6 +867,13 @@ struct mm_struct { * When the next mm_cid scan is due (in jiffies). */ unsigned long mm_cid_next_scan; + /* + * @mm_cid_scan_cpu: Which cpu to start from in the next scan + * + * Scan in batches of CONFIG_RSEQ_CID_SCAN_BATCH after each scan + * save the next cpu index here (or 0 if we are done) + */ + unsigned int mm_cid_scan_cpu; /** * @nr_cpus_allowed: Number of CPUs allowed for mm. * @@ -1249,6 +1256,7 @@ static inline void mm_init_cid(struct mm_struct *mm, struct task_struct *p) raw_spin_lock_init(&mm->cpus_allowed_lock); cpumask_copy(mm_cpus_allowed(mm), &p->cpus_mask); cpumask_clear(mm_cidmask(mm)); + mm->mm_cid_scan_cpu = 0; } static inline int mm_alloc_cid_noprof(struct mm_struct *mm, struct task_struct *p) diff --git a/init/Kconfig b/init/Kconfig index d0d021b3fa3b3..39f1d4c7980c0 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1813,6 +1813,18 @@ config DEBUG_RSEQ If unsure, say N. +config RSEQ_CID_SCAN_BATCH + int "Number of CPUs to scan every time we attempt mm_cid compaction" + range 1 NR_CPUS + default 10 + depends on SCHED_MM_CID + help + CPUs are scanned pseudo-periodically to compact the CID of each task, + this operation can take a longer amount of time on systems with many + CPUs, resulting in higher scheduling latency for the current task. + A higher value means the CID is compacted faster, but results in + higher scheduling latency. + config CACHESTAT_SYSCALL bool "Enable cachestat() system call" if EXPERT default y diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 9aecd914ac691..8d1cce4ed62c6 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -10536,7 +10536,7 @@ static void task_mm_cid_work(struct callback_head *work) struct task_struct *t = current; struct cpumask *cidmask; struct mm_struct *mm; - int weight, cpu; + int weight, cpu, from_cpu, to_cpu; SCHED_WARN_ON(t != container_of(work, struct task_struct, cid_work)); @@ -10546,6 +10546,15 @@ static void task_mm_cid_work(struct callback_head *work) mm = t->mm; if (!mm) return; + cpu = from_cpu = READ_ONCE(mm->mm_cid_scan_cpu); + to_cpu = from_cpu + CONFIG_RSEQ_CID_SCAN_BATCH; + if (from_cpu > cpumask_last(cpu_present_mask)) { + from_cpu = 0; + to_cpu = CONFIG_RSEQ_CID_SCAN_BATCH; + } + if (from_cpu != 0) + /* Delay scan only if we are done with all cpus. */ + goto cid_compact; old_scan = READ_ONCE(mm->mm_cid_next_scan); next_scan = now + msecs_to_jiffies(MM_CID_SCAN_DELAY); if (!old_scan) { @@ -10561,17 +10570,29 @@ static void task_mm_cid_work(struct callback_head *work) return; if (!try_cmpxchg(&mm->mm_cid_next_scan, &old_scan, next_scan)) return; + +cid_compact: + if (!try_cmpxchg(&mm->mm_cid_scan_cpu, &cpu, to_cpu)) + return; cidmask = mm_cidmask(mm); /* Clear cids that were not recently used. */ - for_each_possible_cpu(cpu) + cpu = from_cpu; + for_each_cpu_from(cpu, cpu_present_mask) { + if (cpu == to_cpu) + break; sched_mm_cid_remote_clear_old(mm, cpu); + } weight = cpumask_weight(cidmask); /* * Clear cids that are greater or equal to the cidmask weight to * recompact it. */ - for_each_possible_cpu(cpu) + cpu = from_cpu; + for_each_cpu_from(cpu, cpu_present_mask) { + if (cpu == to_cpu) + break; sched_mm_cid_remote_clear_weight(mm, cpu, weight); + } } void init_sched_mm_cid(struct task_struct *t)