diff mbox series

[v2,3/4] sched: Compact RSEQ concurrency IDs with reduced threads and affinity

Message ID 20241213095407.271357-4-gmonaco@redhat.com (mailing list archive)
State New
Headers show
Series sched: Move task_mm_cid_work to mm delayed work | expand

Commit Message

Gabriele Monaco Dec. 13, 2024, 9:54 a.m. UTC
From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

When a process reduces its number of threads or clears bits in its CPU
affinity mask, the mm_cid allocation should eventually converge towards
smaller values.

However, the change introduced by:

commit 7e019dcc470f ("sched: Improve cache locality of RSEQ concurrency
IDs for intermittent workloads")

adds a per-mm/CPU recent_cid which is never unset unless a thread
migrates.

This is a tradeoff between:

A) Preserving cache locality after a transition from many threads to few
   threads, or after reducing the hamming weight of the allowed CPU mask.

B) Making the mm_cid upper bounds wrt nr threads and allowed CPU mask
   easy to document and understand.

C) Allowing applications to eventually react to mm_cid compaction after
   reduction of the nr threads or allowed CPU mask, making the tracking
   of mm_cid compaction easier by shrinking it back towards 0 or not.

D) Making sure applications that periodically reduce and then increase
   again the nr threads or allowed CPU mask still benefit from good
   cache locality with mm_cid.

Introduce the following changes:

* After shrinking the number of threads or reducing the number of
  allowed CPUs, reduce the value of max_nr_cid so expansion of CID
  allocation will preserve cache locality if the number of threads or
  allowed CPUs increase again.

* Only re-use a recent_cid if it is within the max_nr_cid upper bound,
  else find the first available CID.

Fixes: 7e019dcc470f ("sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads")
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Marco Elver <elver@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Tested-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 include/linux/mm_types.h |  7 ++++---
 kernel/sched/sched.h     | 25 ++++++++++++++++++++++---
 2 files changed, 26 insertions(+), 6 deletions(-)

Comments

Mathieu Desnoyers Dec. 13, 2024, 2:05 p.m. UTC | #1
On 2024-12-13 04:54, Gabriele Monaco wrote:
> From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> 
> When a process reduces its number of threads or clears bits in its CPU
> affinity mask, the mm_cid allocation should eventually converge towards
> smaller values.

I target v6.13 for this patch. As it fixes a commit which was
recently introduced in v6.13-rc1, I would be tempted to place
this patch early in your series (first patch).

Then the more elaborate change from task work to mm delayed work
can follow, and then the added selftest.

The reason for placing your change second is because I am not
sure we need to backport it as a fix.

Thanks,

Mathieu


> 
> However, the change introduced by:
> 
> commit 7e019dcc470f ("sched: Improve cache locality of RSEQ concurrency
> IDs for intermittent workloads")
> 
> adds a per-mm/CPU recent_cid which is never unset unless a thread
> migrates.
> 
> This is a tradeoff between:
> 
> A) Preserving cache locality after a transition from many threads to few
>     threads, or after reducing the hamming weight of the allowed CPU mask.
> 
> B) Making the mm_cid upper bounds wrt nr threads and allowed CPU mask
>     easy to document and understand.
> 
> C) Allowing applications to eventually react to mm_cid compaction after
>     reduction of the nr threads or allowed CPU mask, making the tracking
>     of mm_cid compaction easier by shrinking it back towards 0 or not.
> 
> D) Making sure applications that periodically reduce and then increase
>     again the nr threads or allowed CPU mask still benefit from good
>     cache locality with mm_cid.
> 
> Introduce the following changes:
> 
> * After shrinking the number of threads or reducing the number of
>    allowed CPUs, reduce the value of max_nr_cid so expansion of CID
>    allocation will preserve cache locality if the number of threads or
>    allowed CPUs increase again.
> 
> * Only re-use a recent_cid if it is within the max_nr_cid upper bound,
>    else find the first available CID.
> 
> Fixes: 7e019dcc470f ("sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads")
> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
> Cc: Marco Elver <elver@google.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Tested-by: Gabriele Monaco <gmonaco@redhat.com>
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> ---
>   include/linux/mm_types.h |  7 ++++---
>   kernel/sched/sched.h     | 25 ++++++++++++++++++++++---
>   2 files changed, 26 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 8a76a1c09234..16076e70a6b9 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -837,10 +837,11 @@ struct mm_struct {
>   		 */
>   		unsigned int nr_cpus_allowed;
>   		/**
> -		 * @max_nr_cid: Maximum number of concurrency IDs allocated.
> +		 * @max_nr_cid: Maximum number of allowed concurrency
> +		 *              IDs allocated.
>   		 *
> -		 * Track the highest number of concurrency IDs allocated for the
> -		 * mm.
> +		 * Track the highest number of allowed concurrency IDs
> +		 * allocated for the mm.
>   		 */
>   		atomic_t max_nr_cid;
>   		/**
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 21be461ff913..f3b0d1d86622 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -3652,10 +3652,28 @@ static inline int __mm_cid_try_get(struct task_struct *t, struct mm_struct *mm)
>   {
>   	struct cpumask *cidmask = mm_cidmask(mm);
>   	struct mm_cid __percpu *pcpu_cid = mm->pcpu_cid;
> -	int cid = __this_cpu_read(pcpu_cid->recent_cid);
> +	int cid, max_nr_cid, allowed_max_nr_cid;
>   
> +	/*
> +	 * After shrinking the number of threads or reducing the number
> +	 * of allowed cpus, reduce the value of max_nr_cid so expansion
> +	 * of cid allocation will preserve cache locality if the number
> +	 * of threads or allowed cpus increase again.
> +	 */
> +	max_nr_cid = atomic_read(&mm->max_nr_cid);
> +	while ((allowed_max_nr_cid = min_t(int, READ_ONCE(mm->nr_cpus_allowed),
> +					   atomic_read(&mm->mm_users))),
> +	       max_nr_cid > allowed_max_nr_cid) {
> +		/* atomic_try_cmpxchg loads previous mm->max_nr_cid into max_nr_cid. */
> +		if (atomic_try_cmpxchg(&mm->max_nr_cid, &max_nr_cid, allowed_max_nr_cid)) {
> +			max_nr_cid = allowed_max_nr_cid;
> +			break;
> +		}
> +	}
>   	/* Try to re-use recent cid. This improves cache locality. */
> -	if (!mm_cid_is_unset(cid) && !cpumask_test_and_set_cpu(cid, cidmask))
> +	cid = __this_cpu_read(pcpu_cid->recent_cid);
> +	if (!mm_cid_is_unset(cid) && cid < max_nr_cid &&
> +	    !cpumask_test_and_set_cpu(cid, cidmask))
>   		return cid;
>   	/*
>   	 * Expand cid allocation if the maximum number of concurrency
> @@ -3663,8 +3681,9 @@ static inline int __mm_cid_try_get(struct task_struct *t, struct mm_struct *mm)
>   	 * and number of threads. Expanding cid allocation as much as
>   	 * possible improves cache locality.
>   	 */
> -	cid = atomic_read(&mm->max_nr_cid);
> +	cid = max_nr_cid;
>   	while (cid < READ_ONCE(mm->nr_cpus_allowed) && cid < atomic_read(&mm->mm_users)) {
> +		/* atomic_try_cmpxchg loads previous mm->max_nr_cid into cid. */
>   		if (!atomic_try_cmpxchg(&mm->max_nr_cid, &cid, cid + 1))
>   			continue;
>   		if (!cpumask_test_and_set_cpu(cid, cidmask))
diff mbox series

Patch

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 8a76a1c09234..16076e70a6b9 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -837,10 +837,11 @@  struct mm_struct {
 		 */
 		unsigned int nr_cpus_allowed;
 		/**
-		 * @max_nr_cid: Maximum number of concurrency IDs allocated.
+		 * @max_nr_cid: Maximum number of allowed concurrency
+		 *              IDs allocated.
 		 *
-		 * Track the highest number of concurrency IDs allocated for the
-		 * mm.
+		 * Track the highest number of allowed concurrency IDs
+		 * allocated for the mm.
 		 */
 		atomic_t max_nr_cid;
 		/**
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 21be461ff913..f3b0d1d86622 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3652,10 +3652,28 @@  static inline int __mm_cid_try_get(struct task_struct *t, struct mm_struct *mm)
 {
 	struct cpumask *cidmask = mm_cidmask(mm);
 	struct mm_cid __percpu *pcpu_cid = mm->pcpu_cid;
-	int cid = __this_cpu_read(pcpu_cid->recent_cid);
+	int cid, max_nr_cid, allowed_max_nr_cid;
 
+	/*
+	 * After shrinking the number of threads or reducing the number
+	 * of allowed cpus, reduce the value of max_nr_cid so expansion
+	 * of cid allocation will preserve cache locality if the number
+	 * of threads or allowed cpus increase again.
+	 */
+	max_nr_cid = atomic_read(&mm->max_nr_cid);
+	while ((allowed_max_nr_cid = min_t(int, READ_ONCE(mm->nr_cpus_allowed),
+					   atomic_read(&mm->mm_users))),
+	       max_nr_cid > allowed_max_nr_cid) {
+		/* atomic_try_cmpxchg loads previous mm->max_nr_cid into max_nr_cid. */
+		if (atomic_try_cmpxchg(&mm->max_nr_cid, &max_nr_cid, allowed_max_nr_cid)) {
+			max_nr_cid = allowed_max_nr_cid;
+			break;
+		}
+	}
 	/* Try to re-use recent cid. This improves cache locality. */
-	if (!mm_cid_is_unset(cid) && !cpumask_test_and_set_cpu(cid, cidmask))
+	cid = __this_cpu_read(pcpu_cid->recent_cid);
+	if (!mm_cid_is_unset(cid) && cid < max_nr_cid &&
+	    !cpumask_test_and_set_cpu(cid, cidmask))
 		return cid;
 	/*
 	 * Expand cid allocation if the maximum number of concurrency
@@ -3663,8 +3681,9 @@  static inline int __mm_cid_try_get(struct task_struct *t, struct mm_struct *mm)
 	 * and number of threads. Expanding cid allocation as much as
 	 * possible improves cache locality.
 	 */
-	cid = atomic_read(&mm->max_nr_cid);
+	cid = max_nr_cid;
 	while (cid < READ_ONCE(mm->nr_cpus_allowed) && cid < atomic_read(&mm->mm_users)) {
+		/* atomic_try_cmpxchg loads previous mm->max_nr_cid into cid. */
 		if (!atomic_try_cmpxchg(&mm->max_nr_cid, &cid, cid + 1))
 			continue;
 		if (!cpumask_test_and_set_cpu(cid, cidmask))