sched: Move task_mm_cid_work to mm delayed work

Message ID	20241205083110.180134-2-gmonaco@redhat.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Gabriele Monaco <gmonaco@redhat.com> To: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>, Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Andrew Morton <akpm@linux-foundation.org>, Mel Gorman <mgorman@suse.de>, linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Juri Lelli <juri.lelli@redhat.com>, Vincent Guittot <vincent.guittot@linaro.org>, Gabriele Monaco <gmonaco@redhat.com> Subject: [PATCH] sched: Move task_mm_cid_work to mm delayed work Date: Thu, 5 Dec 2024 09:31:11 +0100 Message-ID: <20241205083110.180134-2-gmonaco@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	sched: Move task_mm_cid_work to mm delayed work \| expand sched: Move task_mm_cid_work to mm delayed work

Gabriele Monaco Dec. 5, 2024, 8:31 a.m. UTC

Currently, the task_mm_cid_work function is called in a task work
triggered by a scheduler tick. This can delay the execution of the task
for the entire duration of the function, negatively affecting the
response of real time tasks.

This patch runs the task_mm_cid_work in a new delayed work connected to
the mm_struct rather than in the task context before returning to
userspace.

This delayed work is initialised while allocating the mm and disabled
before freeing it, its execution is no longer triggered by scheduler
ticks but running periodically based on the defined MM_CID_SCAN_DELAY.

The main advantage of this change is that the function can be offloaded
to a different CPU and even preempted by RT tasks.

On a busy system, this may mean the function gets called less often, but
the current behaviour already doesn't provide guarantees. Moreover, this
new behaviour could be more predictable in some situations since the
delayed work is always scheduled with the same periodicity for each mm.

This deprecate the "sched: improve task_mm_cid_work impact on isolated
systems" patches by dropping the first patch and using a workqueue
instead of RCU callbacks.

Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Link: https://lore.kernel.org/lkml/20241202140735.56368-1-gmonaco@redhat.com/
---
 include/linux/mm_types.h |  5 +++++
 kernel/sched/core.c      | 37 +++++++------------------------------
 2 files changed, 12 insertions(+), 30 deletions(-)


base-commit: feffde684ac29a3b7aec82d2df850fbdbdee55e4

Mathieu Desnoyers Dec. 5, 2024, 4:25 p.m. UTC | #1

On 2024-12-05 09:33, Gabriele Monaco wrote:
> The patch is fundamentally broken since I somehow lost the line calling
> schedule_delayed_work in task_mm_cid_work to re-schedule itself.

Yes, I was puzzled about it when reviewing your patch.

> Before sending a V2, however, I'd like to get some more insights about
> the requirements of this function.
> 
> The current behaviour upstream is to call task_mm_cid_work for the task
> running after the scheduler tick. The function checks that we don't run
> too often for the same mm, but it seems possible that some process with
> short runtime would rarely run during the tick.
> 

So your concern is about a mm with threads running in short bursts,
and those would happen to rarely run while the tick interrupt is
triggered. We may indeed be missing something here, because the goal
is to ensure that we periodically do the task_mm_cid_work for each mm.

The side-effect of missing this work is not compacting the
mm_cid allocation cpumask. It won't cause rseq to fail per se,
but it will cause the mm_cid allocation to be less compact than
it should be.

> The behaviour imposed by this patch (at least the intended one) is to
> run the task_mm_cid_work with the configured periodicity (plus
> scheduling latency) for each active mm.

What you propose looks like a more robust design than running under
the tick.

> This behaviour seem to me more predictable, but would that even be
> required for rseq or is it just an overkill?

Your approach looks more robust, so I would be tempted to introduce
it as a fix. Is the space/runtime overhead similar between the
tick/task work approach vs yours ?

> 
> In other words, was the tick chosen out of simplicity or is there some
> property that has to be preserved?

Out of simplicity, and "do like what NUMA has done". But I am not
particularly attached to it. :-)

> 
> P.S. I run the rseq self tests on both this and the previous patch
> (both broken) and saw no failure.

That's expected, because the tests do not so much depend on the
compactness of the mm_cid allocation. They way I validated this
in the past is by creating a simple multi-threaded program that
periodically prints the current mm_cid from userspace, and
sleep for a few seconds between printing, from many threads on
a many-core system.

Then see how it reacts when run: are the mm_cid close to 0, or
are there large values of mm_cid allocated without compaction
over time ? I have not found a good way to translate this into
an automated test though. Ideas are welcome.

You can look at the librseq basic_test as a starting point. [1]

Thanks,

Mathieu

[1] https://github.com/compudj/librseq/blob/master/tests/basic_test.c

> 
> Thanks,
> Gabriele
>

Gabriele Monaco Dec. 6, 2024, 8:53 a.m. UTC | #2

On Thu, 2024-12-05 at 11:25 -0500, Mathieu Desnoyers wrote:
> On 2024-12-05 09:33, Gabriele Monaco wrote:
> 
> > Before sending a V2, however, I'd like to get some more insights
> > about
> > the requirements of this function.
> > 
> > The current behaviour upstream is to call task_mm_cid_work for the
> > task
> > running after the scheduler tick. The function checks that we don't
> > run
> > too often for the same mm, but it seems possible that some process
> > with
> > short runtime would rarely run during the tick.
> > 
> 
> So your concern is about a mm with threads running in short bursts,
> and those would happen to rarely run while the tick interrupt is
> triggered. We may indeed be missing something here, because the goal
> is to ensure that we periodically do the task_mm_cid_work for each
> mm.
> 
> The side-effect of missing this work is not compacting the
> mm_cid allocation cpumask. It won't cause rseq to fail per se,
> but it will cause the mm_cid allocation to be less compact than
> it should be.

Yes, that was exactly the case, tasks like timerlat/cyclictest running
periodically but doing very short work.
Makes sense, now it's much clearer.

> 
> > The behaviour imposed by this patch (at least the intended one) is
> > to
> > run the task_mm_cid_work with the configured periodicity (plus
> > scheduling latency) for each active mm.
> 
> What you propose looks like a more robust design than running under
> the tick.
> 
> > This behaviour seem to me more predictable, but would that even be
> > required for rseq or is it just an overkill?
> 
> Your approach looks more robust, so I would be tempted to introduce
> it as a fix. Is the space/runtime overhead similar between the
> tick/task work approach vs yours ?

I'm going to fix the implementation and come up with some runtime stats
to compare the overhead of both methods.
As for the space overhead, I think I can answer this question already:
* The current approach uses a callback_head per thread (16 bytes)
* Mine relies on a delayed work per mm (88 bytes)

Tasks with 5 threads or less have lower memory footprint with the
current approach.
I checked quickly on some systems I have access to and I'd say my
approach introduces some memory overhead on an average system, but
considering a task_struct can be 7-13 kB and an mm_struct is about 1.4
kB, the overhead should be acceptable.

> 
> > 
> > In other words, was the tick chosen out of simplicity or is there
> > some
> > property that has to be preserved?
> 
> Out of simplicity, and "do like what NUMA has done". But I am not
> particularly attached to it. :-)
> 
> > 
> > P.S. I run the rseq self tests on both this and the previous patch
> > (both broken) and saw no failure.
> 
> That's expected, because the tests do not so much depend on the
> compactness of the mm_cid allocation. They way I validated this
> in the past is by creating a simple multi-threaded program that
> periodically prints the current mm_cid from userspace, and
> sleep for a few seconds between printing, from many threads on
> a many-core system.
> 
> Then see how it reacts when run: are the mm_cid close to 0, or
> are there large values of mm_cid allocated without compaction
> over time ? I have not found a good way to translate this into
> an automated test though. Ideas are welcome.
> 
> You can look at the librseq basic_test as a starting point. [1]

Perfect, will try those!

Thanks,
Gabriele

Mathieu Desnoyers Dec. 6, 2024, 2:06 p.m. UTC | #3

On 2024-12-06 03:53, Gabriele Monaco wrote:
> On Thu, 2024-12-05 at 11:25 -0500, Mathieu Desnoyers wrote:
[...]
> 
>>
>>> The behaviour imposed by this patch (at least the intended one) is
>>> to
>>> run the task_mm_cid_work with the configured periodicity (plus
>>> scheduling latency) for each active mm.
>>
>> What you propose looks like a more robust design than running under
>> the tick.
>>
>>> This behaviour seem to me more predictable, but would that even be
>>> required for rseq or is it just an overkill?
>>
>> Your approach looks more robust, so I would be tempted to introduce
>> it as a fix. Is the space/runtime overhead similar between the
>> tick/task work approach vs yours ?
> 
> I'm going to fix the implementation and come up with some runtime stats
> to compare the overhead of both methods.
> As for the space overhead, I think I can answer this question already:
> * The current approach uses a callback_head per thread (16 bytes)
> * Mine relies on a delayed work per mm (88 bytes)
> 
> Tasks with 5 threads or less have lower memory footprint with the
> current approach.
> I checked quickly on some systems I have access to and I'd say my
> approach introduces some memory overhead on an average system, but
> considering a task_struct can be 7-13 kB and an mm_struct is about 1.4
> kB, the overhead should be acceptable.

ok!

> 
>>
>>>
>>> In other words, was the tick chosen out of simplicity or is there
>>> some
>>> property that has to be preserved?
>>
>> Out of simplicity, and "do like what NUMA has done". But I am not
>> particularly attached to it. :-)
>>
>>>
>>> P.S. I run the rseq self tests on both this and the previous patch
>>> (both broken) and saw no failure.
>>
>> That's expected, because the tests do not so much depend on the
>> compactness of the mm_cid allocation. They way I validated this
>> in the past is by creating a simple multi-threaded program that
>> periodically prints the current mm_cid from userspace, and
>> sleep for a few seconds between printing, from many threads on
>> a many-core system.
>>
>> Then see how it reacts when run: are the mm_cid close to 0, or
>> are there large values of mm_cid allocated without compaction
>> over time ? I have not found a good way to translate this into
>> an automated test though. Ideas are welcome.
>>
>> You can look at the librseq basic_test as a starting point. [1]
> 
> Perfect, will try those!

Thinking back on this, you'll want a program that does the following
on a system with N CPUs:

- Phase 1: run one thread per cpu, pinned on each cpu. Print the
   mm_cid from each thread with the cpu number every second or so.

- Exit all threads except the main thread, join them from the main
   thread,

- Phase 2: the program is now single-threaded. We'd expect the
   mm_cid value to converge towards 0 as the periodic task clears
   unused CIDs.

So I think in phase 2 we can have an actual automated test: If after
an order of magnitude more time than the 100ms delay between periodic
tasks we still observe mm_cid > 0 in phase 2, then something is
wrong.

Thoughts ?

Mathieu

Gabriele Monaco Dec. 9, 2024, 1:45 p.m. UTC | #4

> Thinking back on this, you'll want a program that does the following
> on a system with N CPUs:
> 
> - Phase 1: run one thread per cpu, pinned on each cpu. Print the
>    mm_cid from each thread with the cpu number every second or so.
> 
> - Exit all threads except the main thread, join them from the main
>    thread,
> 
> - Phase 2: the program is now single-threaded. We'd expect the
>    mm_cid value to converge towards 0 as the periodic task clears
>    unused CIDs.
> 
> So I think in phase 2 we can have an actual automated test: If after
> an order of magnitude more time than the 100ms delay between periodic
> tasks we still observe mm_cid > 0 in phase 2, then something is
> wrong.

Been thinking about this and came up with a simple draft, I'll probably
send it as a separate patch.

Doing this can lead to false positives: the main thread may be assigned
the mm_cid 0 and keep it till the end, in this scenario the other
threads (CPUs) would get different mm_cids and exit, the main thread
will still have 0 and pass the test regardless.

I have an idea to make it a bit more robust: we can run threads as you
described in phase 1, stop all but one (let's say the one running on
the last core), make sure the main thread doesn't accidentally run on
the same core by pinning to core 0 and wait until we see the 2
remaining threads holding 0 and 1, in any order.
Besides a special case if we have only 1 available core, this should
work fine, sure we could get false positives but it seems to me much
less likely.

Does it make sense to you?

Gabriele

Mathieu Desnoyers Dec. 9, 2024, 3:33 p.m. UTC | #5

On 2024-12-09 08:45, Gabriele Monaco wrote:
> 
>> Thinking back on this, you'll want a program that does the following
>> on a system with N CPUs:
>>
>> - Phase 1: run one thread per cpu, pinned on each cpu. Print the
>>     mm_cid from each thread with the cpu number every second or so.
>>
>> - Exit all threads except the main thread, join them from the main
>>     thread,
>>
>> - Phase 2: the program is now single-threaded. We'd expect the
>>     mm_cid value to converge towards 0 as the periodic task clears
>>     unused CIDs.
>>
>> So I think in phase 2 we can have an actual automated test: If after
>> an order of magnitude more time than the 100ms delay between periodic
>> tasks we still observe mm_cid > 0 in phase 2, then something is
>> wrong.
> 
> Been thinking about this and came up with a simple draft, I'll probably
> send it as a separate patch.
> 
> Doing this can lead to false positives: the main thread may be assigned
> the mm_cid 0 and keep it till the end, in this scenario the other
> threads (CPUs) would get different mm_cids and exit, the main thread
> will still have 0 and pass the test regardless.
> 
> I have an idea to make it a bit more robust: we can run threads as you
> described in phase 1, stop all but one (let's say the one running on
> the last core), make sure the main thread doesn't accidentally run on
> the same core by pinning to core 0 and wait until we see the 2
> remaining threads holding 0 and 1, in any order.
> Besides a special case if we have only 1 available core, this should
> work fine, sure we could get false positives but it seems to me much
> less likely.
> 
> Does it make sense to you?

A small tweak on your proposed approach: in phase 1, get each thread
to publish which mm_cid they observe, and select one thread which
has observed mm_cid > 1 (possibly the largest mm_cid) as the thread
that will keep running in phase 2 (in addition to the main thread).

All threads other than the main thread and that selected thread exit
and are joined before phase 2.

So you end up in phase 2 with:

- main (observed any mm_cid)
- selected thread (observed mm_cid > 1, possibly largest)

Then after a while, the selected thread should observe a
mm_cid <= 1.

This test should be skipped if there are less than 3 CPUs in
allowed cpumask (sched_getaffinity).

Thoughts ?

Thanks,

Mathieu

Mathieu Desnoyers Dec. 9, 2024, 3:48 p.m. UTC | #6

On 2024-12-09 10:33, Mathieu Desnoyers wrote:
> On 2024-12-09 08:45, Gabriele Monaco wrote:
>>
>>> Thinking back on this, you'll want a program that does the following
>>> on a system with N CPUs:
>>>
>>> - Phase 1: run one thread per cpu, pinned on each cpu. Print the
>>>     mm_cid from each thread with the cpu number every second or so.
>>>
>>> - Exit all threads except the main thread, join them from the main
>>>     thread,
>>>
>>> - Phase 2: the program is now single-threaded. We'd expect the
>>>     mm_cid value to converge towards 0 as the periodic task clears
>>>     unused CIDs.
>>>
>>> So I think in phase 2 we can have an actual automated test: If after
>>> an order of magnitude more time than the 100ms delay between periodic
>>> tasks we still observe mm_cid > 0 in phase 2, then something is
>>> wrong.
>>
>> Been thinking about this and came up with a simple draft, I'll probably
>> send it as a separate patch.
>>
>> Doing this can lead to false positives: the main thread may be assigned
>> the mm_cid 0 and keep it till the end, in this scenario the other
>> threads (CPUs) would get different mm_cids and exit, the main thread
>> will still have 0 and pass the test regardless.
>>
>> I have an idea to make it a bit more robust: we can run threads as you
>> described in phase 1, stop all but one (let's say the one running on
>> the last core), make sure the main thread doesn't accidentally run on
>> the same core by pinning to core 0 and wait until we see the 2
>> remaining threads holding 0 and 1, in any order.
>> Besides a special case if we have only 1 available core, this should
>> work fine, sure we could get false positives but it seems to me much
>> less likely.
>>
>> Does it make sense to you?
> 
> A small tweak on your proposed approach: in phase 1, get each thread
> to publish which mm_cid they observe, and select one thread which
> has observed mm_cid > 1 (possibly the largest mm_cid) as the thread
> that will keep running in phase 2 (in addition to the main thread).
> 
> All threads other than the main thread and that selected thread exit
> and are joined before phase 2.
> 
> So you end up in phase 2 with:
> 
> - main (observed any mm_cid)
> - selected thread (observed mm_cid > 1, possibly largest)
> 
> Then after a while, the selected thread should observe a
> mm_cid <= 1.
> 
> This test should be skipped if there are less than 3 CPUs in
> allowed cpumask (sched_getaffinity).

Even better:

For a sched_getaffinity with N cpus:

- If N == 1 -> skip (we cannot validate anything)

Phase 1: create N - 1 pthreads, each pinned to a CPU. main thread
also pinned to a cpu.

Publish the mm_cids observed by each thread, including main thread.

Select a new leader for phase 2: a thread which has observed nonzero
mm_cid. Each other thread including possibly main thread issue
pthread_exit, and the new leader does pthread join on each other.

Then check that the new leader eventually observe mm_cid == 0.

And it works with an allowed cpu mask that has only 2 cpus.

Thanks,

Mathieu

Gabriele Monaco Dec. 11, 2024, 12:27 p.m. UTC | #7

On Mon, 2024-12-09 at 10:48 -0500, Mathieu Desnoyers wrote:
> On 2024-12-09 10:33, Mathieu Desnoyers wrote:
> > A small tweak on your proposed approach: in phase 1, get each
> > thread
> > to publish which mm_cid they observe, and select one thread which
> > has observed mm_cid > 1 (possibly the largest mm_cid) as the thread
> > that will keep running in phase 2 (in addition to the main thread).
> > 
> > All threads other than the main thread and that selected thread
> > exit
> > and are joined before phase 2.
> > 
> > So you end up in phase 2 with:
> > 
> > - main (observed any mm_cid)
> > - selected thread (observed mm_cid > 1, possibly largest)
> > 
> > Then after a while, the selected thread should observe a
> > mm_cid <= 1.
> > 
> > This test should be skipped if there are less than 3 CPUs in
> > allowed cpumask (sched_getaffinity).
> 
> Even better:
> 
> For a sched_getaffinity with N cpus:
> 
> - If N == 1 -> skip (we cannot validate anything)
> 
> Phase 1: create N - 1 pthreads, each pinned to a CPU. main thread
> also pinned to a cpu.
> 
> Publish the mm_cids observed by each thread, including main thread.
> 
> Select a new leader for phase 2: a thread which has observed nonzero
> mm_cid. Each other thread including possibly main thread issue
> pthread_exit, and the new leader does pthread join on each other.
> 
> Then check that the new leader eventually observe mm_cid == 0.
> 
> And it works with an allowed cpu mask that has only 2 cpus.

Sounds even neater, thanks for the tips, I'll try this last one out!

Coming back to the implementation, I have been trying to validate my
approach with this test, wrapped my head around it, and found out that
the test can't actually pass on the latest upstream.

When an mm_cid is lazy dropped to compact the mask, it is again re-
assigned while switching in.
The  the change introduced in "sched: Improve cache locality of RSEQ
concurrency IDs for intermittent workloads" adds a recent_cid and it
seems that is never unset during the test (nothing migrates).

Now, I'm still running my first version of the test, so I have a thread
running on CPU0 with mm_cid=0 and another running on CPU127 with
mm_cid, say, 127 (weight=2).
In practice, the test is expecting 127 to be dropped (>2) but this is
not the case since 127 could exhibit better cache locality, so it is
selected on the next round.

Here's where I'm in doubt, is a compact map more desirable than reusing
the same mm_cids for cache locality?
If not, should we perhaps ignore the recent_cid if it's larger than the
map weight?
It seems the only way the recent_cid is unset is with migrations, but
I'm not sure if forcing one would make the test vain as the cid could
be dropped outside of task_mm_cid_work.

What do you think?

Thanks,
Gabriele

Mathieu Desnoyers Dec. 11, 2024, 5:07 p.m. UTC | #8

On 2024-12-11 07:27, Gabriele Monaco wrote:
> 
> On Mon, 2024-12-09 at 10:48 -0500, Mathieu Desnoyers wrote:
>> On 2024-12-09 10:33, Mathieu Desnoyers wrote:
>>> A small tweak on your proposed approach: in phase 1, get each
>>> thread
>>> to publish which mm_cid they observe, and select one thread which
>>> has observed mm_cid > 1 (possibly the largest mm_cid) as the thread
>>> that will keep running in phase 2 (in addition to the main thread).
>>>
>>> All threads other than the main thread and that selected thread
>>> exit
>>> and are joined before phase 2.
>>>
>>> So you end up in phase 2 with:
>>>
>>> - main (observed any mm_cid)
>>> - selected thread (observed mm_cid > 1, possibly largest)
>>>
>>> Then after a while, the selected thread should observe a
>>> mm_cid <= 1.
>>>
>>> This test should be skipped if there are less than 3 CPUs in
>>> allowed cpumask (sched_getaffinity).
>>
>> Even better:
>>
>> For a sched_getaffinity with N cpus:
>>
>> - If N == 1 -> skip (we cannot validate anything)
>>
>> Phase 1: create N - 1 pthreads, each pinned to a CPU. main thread
>> also pinned to a cpu.
>>
>> Publish the mm_cids observed by each thread, including main thread.
>>
>> Select a new leader for phase 2: a thread which has observed nonzero
>> mm_cid. Each other thread including possibly main thread issue
>> pthread_exit, and the new leader does pthread join on each other.
>>
>> Then check that the new leader eventually observe mm_cid == 0.
>>
>> And it works with an allowed cpu mask that has only 2 cpus.
> 
> Sounds even neater, thanks for the tips, I'll try this last one out!
> 
> Coming back to the implementation, I have been trying to validate my
> approach with this test, wrapped my head around it, and found out that
> the test can't actually pass on the latest upstream.
> 
> When an mm_cid is lazy dropped to compact the mask, it is again re-
> assigned while switching in.
> The  the change introduced in "sched: Improve cache locality of RSEQ
> concurrency IDs for intermittent workloads" adds a recent_cid and it
> seems that is never unset during the test (nothing migrates).

Good point!

> 
> Now, I'm still running my first version of the test, so I have a thread
> running on CPU0 with mm_cid=0 and another running on CPU127 with
> mm_cid, say, 127 (weight=2).
> In practice, the test is expecting 127 to be dropped (>2) but this is
> not the case since 127 could exhibit better cache locality, so it is
> selected on the next round.

Understood.

> 
> Here's where I'm in doubt, is a compact map more desirable than reusing
> the same mm_cids for cache locality?

This is a tradeoff between:

A) Preserving cache locality after a transition from many threads to few
    threads, or after reducing the hamming weight of the allowed cpu mask.

B) Making the mm_cid guarantees wrt nr threads and allowed cpu mask easy
    to document and understand.

C) Allowing applications to eventually react to mm_cid compaction after
    reduction of the nr threads or allowed cpu mask, making the tracking
    of mm_cid compaction easier by shrinking it back towards 0 or not.

D) Making sure applications that periodically reduce and then increase
    again the nr threads or allowed cpu mask still benefit from good
    cache locality with mm_cid.


> If not, should we perhaps ignore the recent_cid if it's larger than the
> map weight?
> It seems the only way the recent_cid is unset is with migrations, but
> I'm not sure if forcing one would make the test vain as the cid could
> be dropped outside of task_mm_cid_work.
> 
> What do you think?

Can you try this patch ? (compile-tested only)

commit 500649e03c5c28443f431829732c580750657326
Author: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Date:   Wed Dec 11 11:53:01 2024 -0500

     sched: shrink mm_cid allocation with nr thread/affinity

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 76f5f53a645f..b92e79770a93 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3657,10 +3657,24 @@ static inline int __mm_cid_try_get(struct task_struct *t, struct mm_struct *mm)
  {
  	struct cpumask *cidmask = mm_cidmask(mm);
  	struct mm_cid __percpu *pcpu_cid = mm->pcpu_cid;
-	int cid = __this_cpu_read(pcpu_cid->recent_cid);
+	int cid, max_nr_cid, allowed_max_nr_cid;
  
+	/*
+	 * After shrinking the number of threads or reducing the number
+	 * of allowed cpus, reduce the value of max_nr_cid so expansion
+	 * of cid allocation will preserve cache locality if the number
+	 * of threads or allowed cpus increase again.
+	 */
+	max_nr_cid = atomic_read(&mm->max_nr_cid);
+	while ((allowed_max_nr_cid = min_t(int, READ_ONCE(mm->nr_cpus_allowed), atomic_read(&mm->mm_users))),
+		max_nr_cid > allowed_max_nr_cid) {
+		if (atomic_try_cmpxchg(&mm->max_nr_cid, &max_nr_cid, allowed_max_nr_cid))
+			break;
+	}
  	/* Try to re-use recent cid. This improves cache locality. */
-	if (!mm_cid_is_unset(cid) && !cpumask_test_and_set_cpu(cid, cidmask))
+	cid = __this_cpu_read(pcpu_cid->recent_cid);
+	if (!mm_cid_is_unset(cid) && cid < max_nr_cid &&
+	    !cpumask_test_and_set_cpu(cid, cidmask))
  		return cid;
  	/*
  	 * Expand cid allocation if the maximum number of concurrency
@@ -3668,12 +3682,11 @@ static inline int __mm_cid_try_get(struct task_struct *t, struct mm_struct *mm)
  	 * and number of threads. Expanding cid allocation as much as
  	 * possible improves cache locality.
  	 */
-	cid = atomic_read(&mm->max_nr_cid);
-	while (cid < READ_ONCE(mm->nr_cpus_allowed) && cid < atomic_read(&mm->mm_users)) {
-		if (!atomic_try_cmpxchg(&mm->max_nr_cid, &cid, cid + 1))
+	while (max_nr_cid < allowed_max_nr_cid) {
+		if (!atomic_try_cmpxchg(&mm->max_nr_cid, &max_nr_cid, max_nr_cid + 1))
  			continue;
-		if (!cpumask_test_and_set_cpu(cid, cidmask))
-			return cid;
+		if (!cpumask_test_and_set_cpu(max_nr_cid, cidmask))
+			return max_nr_cid;
  	}
  	/*
  	 * Find the first available concurrency id.

Gabriele Monaco Dec. 12, 2024, 11:09 a.m. UTC | #9

On Wed, 2024-12-11 at 12:07 -0500, Mathieu Desnoyers wrote:
> > Here's where I'm in doubt, is a compact map more desirable than
> > reusing
> > the same mm_cids for cache locality?
> 
> This is a tradeoff between:
> 
> A) Preserving cache locality after a transition from many threads to
> few
>     threads, or after reducing the hamming weight of the allowed cpu
> mask.
> 
> B) Making the mm_cid guarantees wrt nr threads and allowed cpu mask
> easy
>     to document and understand.
> 
> C) Allowing applications to eventually react to mm_cid compaction
> after
>     reduction of the nr threads or allowed cpu mask, making the
> tracking
>     of mm_cid compaction easier by shrinking it back towards 0 or
> not.
> 
> D) Making sure applications that periodically reduce and then
> increase
>     again the nr threads or allowed cpu mask still benefit from good
>     cache locality with mm_cid.
> 
> 
> > If not, should we perhaps ignore the recent_cid if it's larger than
> > the
> > map weight?
> > It seems the only way the recent_cid is unset is with migrations,
> > but
> > I'm not sure if forcing one would make the test vain as the cid
> > could
> > be dropped outside of task_mm_cid_work.
> > 
> > What do you think?
> 
> Can you try this patch ? (compile-tested only)
> 
> commit 500649e03c5c28443f431829732c580750657326
> Author: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Date:   Wed Dec 11 11:53:01 2024 -0500
> 
>      sched: shrink mm_cid allocation with nr thread/affinity
> 
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 76f5f53a645f..b92e79770a93 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -3657,10 +3657,24 @@ static inline int __mm_cid_try_get(struct
> task_struct *t, struct mm_struct *mm)
>   {
>    struct cpumask *cidmask = mm_cidmask(mm);
>    struct mm_cid __percpu *pcpu_cid = mm->pcpu_cid;
> - int cid = __this_cpu_read(pcpu_cid->recent_cid);
> + int cid, max_nr_cid, allowed_max_nr_cid;
>   
> + /*
> + * After shrinking the number of threads or reducing the number
> + * of allowed cpus, reduce the value of max_nr_cid so expansion
> + * of cid allocation will preserve cache locality if the number
> + * of threads or allowed cpus increase again.
> + */
> + max_nr_cid = atomic_read(&mm->max_nr_cid);
> + while ((allowed_max_nr_cid = min_t(int, READ_ONCE(mm-
> >nr_cpus_allowed), atomic_read(&mm->mm_users))),
> + max_nr_cid > allowed_max_nr_cid) {
> + if (atomic_try_cmpxchg(&mm->max_nr_cid, &max_nr_cid,
> allowed_max_nr_cid))
> + break;
> + }
>    /* Try to re-use recent cid. This improves cache locality. */
> - if (!mm_cid_is_unset(cid) && !cpumask_test_and_set_cpu(cid,
> cidmask))
> + cid = __this_cpu_read(pcpu_cid->recent_cid);
> + if (!mm_cid_is_unset(cid) && cid < max_nr_cid &&
> +     !cpumask_test_and_set_cpu(cid, cidmask))
>    return cid;
>    /*
>    * Expand cid allocation if the maximum number of concurrency
> @@ -3668,12 +3682,11 @@ static inline int __mm_cid_try_get(struct
> task_struct *t, struct mm_struct *mm)
>    * and number of threads. Expanding cid allocation as much as
>    * possible improves cache locality.
>    */
> - cid = atomic_read(&mm->max_nr_cid);
> - while (cid < READ_ONCE(mm->nr_cpus_allowed) && cid <
> atomic_read(&mm->mm_users)) {
> - if (!atomic_try_cmpxchg(&mm->max_nr_cid, &cid, cid + 1))
> + while (max_nr_cid < allowed_max_nr_cid) {
> + if (!atomic_try_cmpxchg(&mm->max_nr_cid, &max_nr_cid, max_nr_cid +
> 1))
>    continue;
> - if (!cpumask_test_and_set_cpu(cid, cidmask))
> - return cid;
> + if (!cpumask_test_and_set_cpu(max_nr_cid, cidmask))
> + return max_nr_cid;
>    }
>    /*
>    * Find the first available concurrency id.

Thanks for the patch, it seems much more robust than my simple
condition on the weight. It passes the test (both versions) we
previously discussed and doesn't seem to interfere with the general
rseq functionality as checked by the other selftests.
I'm not sure if I should run more tests on this one.
I will come up with a V2 shortly and attach some performance
evaluations.

Do you want to keep your patch separate or do I submit it together with
V2?

Thanks,
Gabriele

Mathieu Desnoyers Dec. 12, 2024, 2:06 p.m. UTC | #10

On 2024-12-12 06:09, Gabriele Monaco wrote:
> 
> On Wed, 2024-12-11 at 12:07 -0500, Mathieu Desnoyers wrote:
>>> Here's where I'm in doubt, is a compact map more desirable than
>>> reusing
>>> the same mm_cids for cache locality?
>>
>> This is a tradeoff between:
>>
>> A) Preserving cache locality after a transition from many threads to
>> few
>>      threads, or after reducing the hamming weight of the allowed cpu
>> mask.
>>
>> B) Making the mm_cid guarantees wrt nr threads and allowed cpu mask
>> easy
>>      to document and understand.
>>
>> C) Allowing applications to eventually react to mm_cid compaction
>> after
>>      reduction of the nr threads or allowed cpu mask, making the
>> tracking
>>      of mm_cid compaction easier by shrinking it back towards 0 or
>> not.
>>
>> D) Making sure applications that periodically reduce and then
>> increase
>>      again the nr threads or allowed cpu mask still benefit from good
>>      cache locality with mm_cid.
>>
>>
>>> If not, should we perhaps ignore the recent_cid if it's larger than
>>> the
>>> map weight?
>>> It seems the only way the recent_cid is unset is with migrations,
>>> but
>>> I'm not sure if forcing one would make the test vain as the cid
>>> could
>>> be dropped outside of task_mm_cid_work.
>>>
>>> What do you think?
>>
>> Can you try this patch ? (compile-tested only)
>>
>> commit 500649e03c5c28443f431829732c580750657326
>> Author: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> Date:   Wed Dec 11 11:53:01 2024 -0500
>>
>>       sched: shrink mm_cid allocation with nr thread/affinity
>>
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 76f5f53a645f..b92e79770a93 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -3657,10 +3657,24 @@ static inline int __mm_cid_try_get(struct
>> task_struct *t, struct mm_struct *mm)
>>    {
>>     struct cpumask *cidmask = mm_cidmask(mm);
>>     struct mm_cid __percpu *pcpu_cid = mm->pcpu_cid;
>> - int cid = __this_cpu_read(pcpu_cid->recent_cid);
>> + int cid, max_nr_cid, allowed_max_nr_cid;
>>    
>> + /*
>> + * After shrinking the number of threads or reducing the number
>> + * of allowed cpus, reduce the value of max_nr_cid so expansion
>> + * of cid allocation will preserve cache locality if the number
>> + * of threads or allowed cpus increase again.
>> + */
>> + max_nr_cid = atomic_read(&mm->max_nr_cid);
>> + while ((allowed_max_nr_cid = min_t(int, READ_ONCE(mm-
>>> nr_cpus_allowed), atomic_read(&mm->mm_users))),
>> + max_nr_cid > allowed_max_nr_cid) {
>> + if (atomic_try_cmpxchg(&mm->max_nr_cid, &max_nr_cid,
>> allowed_max_nr_cid))
>> + break;
>> + }
>>     /* Try to re-use recent cid. This improves cache locality. */
>> - if (!mm_cid_is_unset(cid) && !cpumask_test_and_set_cpu(cid,
>> cidmask))
>> + cid = __this_cpu_read(pcpu_cid->recent_cid);
>> + if (!mm_cid_is_unset(cid) && cid < max_nr_cid &&
>> +     !cpumask_test_and_set_cpu(cid, cidmask))
>>     return cid;
>>     /*
>>     * Expand cid allocation if the maximum number of concurrency
>> @@ -3668,12 +3682,11 @@ static inline int __mm_cid_try_get(struct
>> task_struct *t, struct mm_struct *mm)
>>     * and number of threads. Expanding cid allocation as much as
>>     * possible improves cache locality.
>>     */
>> - cid = atomic_read(&mm->max_nr_cid);
>> - while (cid < READ_ONCE(mm->nr_cpus_allowed) && cid <
>> atomic_read(&mm->mm_users)) {
>> - if (!atomic_try_cmpxchg(&mm->max_nr_cid, &cid, cid + 1))
>> + while (max_nr_cid < allowed_max_nr_cid) {
>> + if (!atomic_try_cmpxchg(&mm->max_nr_cid, &max_nr_cid, max_nr_cid +
>> 1))
>>     continue;
>> - if (!cpumask_test_and_set_cpu(cid, cidmask))
>> - return cid;
>> + if (!cpumask_test_and_set_cpu(max_nr_cid, cidmask))
>> + return max_nr_cid;
>>     }
>>     /*
>>     * Find the first available concurrency id.
> 
> Thanks for the patch, it seems much more robust than my simple
> condition on the weight. It passes the test (both versions) we
> previously discussed and doesn't seem to interfere with the general
> rseq functionality as checked by the other selftests.

Great!

> I'm not sure if I should run more tests on this one.

The other thing I'd be interested in is to see if those
patches introduce any performance regression (e.g. the
will-it-scale tests).

If you have spare cycles to try this out, that would be good,
else we can let the test bots complain.

> I will come up with a V2 shortly and attach some performance
> evaluations.

OK

> 
> Do you want to keep your patch separate or do I submit it together with
> V2?

Let me prepare a proper patch with commit message, and then feel
free to add it into your series, so it benefits from your testing.

Thanks,

Mathieu

> 
> Thanks,
> Gabriele
>

sched: Move task_mm_cid_work to mm delayed work

Commit Message

Comments

Patch