Message ID | 20241205083110.180134-2-gmonaco@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | sched: Move task_mm_cid_work to mm delayed work | expand |
On 2024-12-05 09:33, Gabriele Monaco wrote: > The patch is fundamentally broken since I somehow lost the line calling > schedule_delayed_work in task_mm_cid_work to re-schedule itself. Yes, I was puzzled about it when reviewing your patch. > Before sending a V2, however, I'd like to get some more insights about > the requirements of this function. > > The current behaviour upstream is to call task_mm_cid_work for the task > running after the scheduler tick. The function checks that we don't run > too often for the same mm, but it seems possible that some process with > short runtime would rarely run during the tick. > So your concern is about a mm with threads running in short bursts, and those would happen to rarely run while the tick interrupt is triggered. We may indeed be missing something here, because the goal is to ensure that we periodically do the task_mm_cid_work for each mm. The side-effect of missing this work is not compacting the mm_cid allocation cpumask. It won't cause rseq to fail per se, but it will cause the mm_cid allocation to be less compact than it should be. > The behaviour imposed by this patch (at least the intended one) is to > run the task_mm_cid_work with the configured periodicity (plus > scheduling latency) for each active mm. What you propose looks like a more robust design than running under the tick. > This behaviour seem to me more predictable, but would that even be > required for rseq or is it just an overkill? Your approach looks more robust, so I would be tempted to introduce it as a fix. Is the space/runtime overhead similar between the tick/task work approach vs yours ? > > In other words, was the tick chosen out of simplicity or is there some > property that has to be preserved? Out of simplicity, and "do like what NUMA has done". But I am not particularly attached to it. :-) > > P.S. I run the rseq self tests on both this and the previous patch > (both broken) and saw no failure. That's expected, because the tests do not so much depend on the compactness of the mm_cid allocation. They way I validated this in the past is by creating a simple multi-threaded program that periodically prints the current mm_cid from userspace, and sleep for a few seconds between printing, from many threads on a many-core system. Then see how it reacts when run: are the mm_cid close to 0, or are there large values of mm_cid allocated without compaction over time ? I have not found a good way to translate this into an automated test though. Ideas are welcome. You can look at the librseq basic_test as a starting point. [1] Thanks, Mathieu [1] https://github.com/compudj/librseq/blob/master/tests/basic_test.c > > Thanks, > Gabriele >
On Thu, 2024-12-05 at 11:25 -0500, Mathieu Desnoyers wrote: > On 2024-12-05 09:33, Gabriele Monaco wrote: > > > Before sending a V2, however, I'd like to get some more insights > > about > > the requirements of this function. > > > > The current behaviour upstream is to call task_mm_cid_work for the > > task > > running after the scheduler tick. The function checks that we don't > > run > > too often for the same mm, but it seems possible that some process > > with > > short runtime would rarely run during the tick. > > > > So your concern is about a mm with threads running in short bursts, > and those would happen to rarely run while the tick interrupt is > triggered. We may indeed be missing something here, because the goal > is to ensure that we periodically do the task_mm_cid_work for each > mm. > > The side-effect of missing this work is not compacting the > mm_cid allocation cpumask. It won't cause rseq to fail per se, > but it will cause the mm_cid allocation to be less compact than > it should be. Yes, that was exactly the case, tasks like timerlat/cyclictest running periodically but doing very short work. Makes sense, now it's much clearer. > > > The behaviour imposed by this patch (at least the intended one) is > > to > > run the task_mm_cid_work with the configured periodicity (plus > > scheduling latency) for each active mm. > > What you propose looks like a more robust design than running under > the tick. > > > This behaviour seem to me more predictable, but would that even be > > required for rseq or is it just an overkill? > > Your approach looks more robust, so I would be tempted to introduce > it as a fix. Is the space/runtime overhead similar between the > tick/task work approach vs yours ? I'm going to fix the implementation and come up with some runtime stats to compare the overhead of both methods. As for the space overhead, I think I can answer this question already: * The current approach uses a callback_head per thread (16 bytes) * Mine relies on a delayed work per mm (88 bytes) Tasks with 5 threads or less have lower memory footprint with the current approach. I checked quickly on some systems I have access to and I'd say my approach introduces some memory overhead on an average system, but considering a task_struct can be 7-13 kB and an mm_struct is about 1.4 kB, the overhead should be acceptable. > > > > > In other words, was the tick chosen out of simplicity or is there > > some > > property that has to be preserved? > > Out of simplicity, and "do like what NUMA has done". But I am not > particularly attached to it. :-) > > > > > P.S. I run the rseq self tests on both this and the previous patch > > (both broken) and saw no failure. > > That's expected, because the tests do not so much depend on the > compactness of the mm_cid allocation. They way I validated this > in the past is by creating a simple multi-threaded program that > periodically prints the current mm_cid from userspace, and > sleep for a few seconds between printing, from many threads on > a many-core system. > > Then see how it reacts when run: are the mm_cid close to 0, or > are there large values of mm_cid allocated without compaction > over time ? I have not found a good way to translate this into > an automated test though. Ideas are welcome. > > You can look at the librseq basic_test as a starting point. [1] Perfect, will try those! Thanks, Gabriele
On 2024-12-06 03:53, Gabriele Monaco wrote: > On Thu, 2024-12-05 at 11:25 -0500, Mathieu Desnoyers wrote: [...] > >> >>> The behaviour imposed by this patch (at least the intended one) is >>> to >>> run the task_mm_cid_work with the configured periodicity (plus >>> scheduling latency) for each active mm. >> >> What you propose looks like a more robust design than running under >> the tick. >> >>> This behaviour seem to me more predictable, but would that even be >>> required for rseq or is it just an overkill? >> >> Your approach looks more robust, so I would be tempted to introduce >> it as a fix. Is the space/runtime overhead similar between the >> tick/task work approach vs yours ? > > I'm going to fix the implementation and come up with some runtime stats > to compare the overhead of both methods. > As for the space overhead, I think I can answer this question already: > * The current approach uses a callback_head per thread (16 bytes) > * Mine relies on a delayed work per mm (88 bytes) > > Tasks with 5 threads or less have lower memory footprint with the > current approach. > I checked quickly on some systems I have access to and I'd say my > approach introduces some memory overhead on an average system, but > considering a task_struct can be 7-13 kB and an mm_struct is about 1.4 > kB, the overhead should be acceptable. ok! > >> >>> >>> In other words, was the tick chosen out of simplicity or is there >>> some >>> property that has to be preserved? >> >> Out of simplicity, and "do like what NUMA has done". But I am not >> particularly attached to it. :-) >> >>> >>> P.S. I run the rseq self tests on both this and the previous patch >>> (both broken) and saw no failure. >> >> That's expected, because the tests do not so much depend on the >> compactness of the mm_cid allocation. They way I validated this >> in the past is by creating a simple multi-threaded program that >> periodically prints the current mm_cid from userspace, and >> sleep for a few seconds between printing, from many threads on >> a many-core system. >> >> Then see how it reacts when run: are the mm_cid close to 0, or >> are there large values of mm_cid allocated without compaction >> over time ? I have not found a good way to translate this into >> an automated test though. Ideas are welcome. >> >> You can look at the librseq basic_test as a starting point. [1] > > Perfect, will try those! Thinking back on this, you'll want a program that does the following on a system with N CPUs: - Phase 1: run one thread per cpu, pinned on each cpu. Print the mm_cid from each thread with the cpu number every second or so. - Exit all threads except the main thread, join them from the main thread, - Phase 2: the program is now single-threaded. We'd expect the mm_cid value to converge towards 0 as the periodic task clears unused CIDs. So I think in phase 2 we can have an actual automated test: If after an order of magnitude more time than the 100ms delay between periodic tasks we still observe mm_cid > 0 in phase 2, then something is wrong. Thoughts ? Mathieu
> Thinking back on this, you'll want a program that does the following > on a system with N CPUs: > > - Phase 1: run one thread per cpu, pinned on each cpu. Print the > mm_cid from each thread with the cpu number every second or so. > > - Exit all threads except the main thread, join them from the main > thread, > > - Phase 2: the program is now single-threaded. We'd expect the > mm_cid value to converge towards 0 as the periodic task clears > unused CIDs. > > So I think in phase 2 we can have an actual automated test: If after > an order of magnitude more time than the 100ms delay between periodic > tasks we still observe mm_cid > 0 in phase 2, then something is > wrong. Been thinking about this and came up with a simple draft, I'll probably send it as a separate patch. Doing this can lead to false positives: the main thread may be assigned the mm_cid 0 and keep it till the end, in this scenario the other threads (CPUs) would get different mm_cids and exit, the main thread will still have 0 and pass the test regardless. I have an idea to make it a bit more robust: we can run threads as you described in phase 1, stop all but one (let's say the one running on the last core), make sure the main thread doesn't accidentally run on the same core by pinning to core 0 and wait until we see the 2 remaining threads holding 0 and 1, in any order. Besides a special case if we have only 1 available core, this should work fine, sure we could get false positives but it seems to me much less likely. Does it make sense to you? Gabriele
On 2024-12-09 08:45, Gabriele Monaco wrote: > >> Thinking back on this, you'll want a program that does the following >> on a system with N CPUs: >> >> - Phase 1: run one thread per cpu, pinned on each cpu. Print the >> mm_cid from each thread with the cpu number every second or so. >> >> - Exit all threads except the main thread, join them from the main >> thread, >> >> - Phase 2: the program is now single-threaded. We'd expect the >> mm_cid value to converge towards 0 as the periodic task clears >> unused CIDs. >> >> So I think in phase 2 we can have an actual automated test: If after >> an order of magnitude more time than the 100ms delay between periodic >> tasks we still observe mm_cid > 0 in phase 2, then something is >> wrong. > > Been thinking about this and came up with a simple draft, I'll probably > send it as a separate patch. > > Doing this can lead to false positives: the main thread may be assigned > the mm_cid 0 and keep it till the end, in this scenario the other > threads (CPUs) would get different mm_cids and exit, the main thread > will still have 0 and pass the test regardless. > > I have an idea to make it a bit more robust: we can run threads as you > described in phase 1, stop all but one (let's say the one running on > the last core), make sure the main thread doesn't accidentally run on > the same core by pinning to core 0 and wait until we see the 2 > remaining threads holding 0 and 1, in any order. > Besides a special case if we have only 1 available core, this should > work fine, sure we could get false positives but it seems to me much > less likely. > > Does it make sense to you? A small tweak on your proposed approach: in phase 1, get each thread to publish which mm_cid they observe, and select one thread which has observed mm_cid > 1 (possibly the largest mm_cid) as the thread that will keep running in phase 2 (in addition to the main thread). All threads other than the main thread and that selected thread exit and are joined before phase 2. So you end up in phase 2 with: - main (observed any mm_cid) - selected thread (observed mm_cid > 1, possibly largest) Then after a while, the selected thread should observe a mm_cid <= 1. This test should be skipped if there are less than 3 CPUs in allowed cpumask (sched_getaffinity). Thoughts ? Thanks, Mathieu
On 2024-12-09 10:33, Mathieu Desnoyers wrote: > On 2024-12-09 08:45, Gabriele Monaco wrote: >> >>> Thinking back on this, you'll want a program that does the following >>> on a system with N CPUs: >>> >>> - Phase 1: run one thread per cpu, pinned on each cpu. Print the >>> mm_cid from each thread with the cpu number every second or so. >>> >>> - Exit all threads except the main thread, join them from the main >>> thread, >>> >>> - Phase 2: the program is now single-threaded. We'd expect the >>> mm_cid value to converge towards 0 as the periodic task clears >>> unused CIDs. >>> >>> So I think in phase 2 we can have an actual automated test: If after >>> an order of magnitude more time than the 100ms delay between periodic >>> tasks we still observe mm_cid > 0 in phase 2, then something is >>> wrong. >> >> Been thinking about this and came up with a simple draft, I'll probably >> send it as a separate patch. >> >> Doing this can lead to false positives: the main thread may be assigned >> the mm_cid 0 and keep it till the end, in this scenario the other >> threads (CPUs) would get different mm_cids and exit, the main thread >> will still have 0 and pass the test regardless. >> >> I have an idea to make it a bit more robust: we can run threads as you >> described in phase 1, stop all but one (let's say the one running on >> the last core), make sure the main thread doesn't accidentally run on >> the same core by pinning to core 0 and wait until we see the 2 >> remaining threads holding 0 and 1, in any order. >> Besides a special case if we have only 1 available core, this should >> work fine, sure we could get false positives but it seems to me much >> less likely. >> >> Does it make sense to you? > > A small tweak on your proposed approach: in phase 1, get each thread > to publish which mm_cid they observe, and select one thread which > has observed mm_cid > 1 (possibly the largest mm_cid) as the thread > that will keep running in phase 2 (in addition to the main thread). > > All threads other than the main thread and that selected thread exit > and are joined before phase 2. > > So you end up in phase 2 with: > > - main (observed any mm_cid) > - selected thread (observed mm_cid > 1, possibly largest) > > Then after a while, the selected thread should observe a > mm_cid <= 1. > > This test should be skipped if there are less than 3 CPUs in > allowed cpumask (sched_getaffinity). Even better: For a sched_getaffinity with N cpus: - If N == 1 -> skip (we cannot validate anything) Phase 1: create N - 1 pthreads, each pinned to a CPU. main thread also pinned to a cpu. Publish the mm_cids observed by each thread, including main thread. Select a new leader for phase 2: a thread which has observed nonzero mm_cid. Each other thread including possibly main thread issue pthread_exit, and the new leader does pthread join on each other. Then check that the new leader eventually observe mm_cid == 0. And it works with an allowed cpu mask that has only 2 cpus. Thanks, Mathieu
On Mon, 2024-12-09 at 10:48 -0500, Mathieu Desnoyers wrote: > On 2024-12-09 10:33, Mathieu Desnoyers wrote: > > A small tweak on your proposed approach: in phase 1, get each > > thread > > to publish which mm_cid they observe, and select one thread which > > has observed mm_cid > 1 (possibly the largest mm_cid) as the thread > > that will keep running in phase 2 (in addition to the main thread). > > > > All threads other than the main thread and that selected thread > > exit > > and are joined before phase 2. > > > > So you end up in phase 2 with: > > > > - main (observed any mm_cid) > > - selected thread (observed mm_cid > 1, possibly largest) > > > > Then after a while, the selected thread should observe a > > mm_cid <= 1. > > > > This test should be skipped if there are less than 3 CPUs in > > allowed cpumask (sched_getaffinity). > > Even better: > > For a sched_getaffinity with N cpus: > > - If N == 1 -> skip (we cannot validate anything) > > Phase 1: create N - 1 pthreads, each pinned to a CPU. main thread > also pinned to a cpu. > > Publish the mm_cids observed by each thread, including main thread. > > Select a new leader for phase 2: a thread which has observed nonzero > mm_cid. Each other thread including possibly main thread issue > pthread_exit, and the new leader does pthread join on each other. > > Then check that the new leader eventually observe mm_cid == 0. > > And it works with an allowed cpu mask that has only 2 cpus. Sounds even neater, thanks for the tips, I'll try this last one out! Coming back to the implementation, I have been trying to validate my approach with this test, wrapped my head around it, and found out that the test can't actually pass on the latest upstream. When an mm_cid is lazy dropped to compact the mask, it is again re- assigned while switching in. The the change introduced in "sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads" adds a recent_cid and it seems that is never unset during the test (nothing migrates). Now, I'm still running my first version of the test, so I have a thread running on CPU0 with mm_cid=0 and another running on CPU127 with mm_cid, say, 127 (weight=2). In practice, the test is expecting 127 to be dropped (>2) but this is not the case since 127 could exhibit better cache locality, so it is selected on the next round. Here's where I'm in doubt, is a compact map more desirable than reusing the same mm_cids for cache locality? If not, should we perhaps ignore the recent_cid if it's larger than the map weight? It seems the only way the recent_cid is unset is with migrations, but I'm not sure if forcing one would make the test vain as the cid could be dropped outside of task_mm_cid_work. What do you think? Thanks, Gabriele
On 2024-12-11 07:27, Gabriele Monaco wrote: > > On Mon, 2024-12-09 at 10:48 -0500, Mathieu Desnoyers wrote: >> On 2024-12-09 10:33, Mathieu Desnoyers wrote: >>> A small tweak on your proposed approach: in phase 1, get each >>> thread >>> to publish which mm_cid they observe, and select one thread which >>> has observed mm_cid > 1 (possibly the largest mm_cid) as the thread >>> that will keep running in phase 2 (in addition to the main thread). >>> >>> All threads other than the main thread and that selected thread >>> exit >>> and are joined before phase 2. >>> >>> So you end up in phase 2 with: >>> >>> - main (observed any mm_cid) >>> - selected thread (observed mm_cid > 1, possibly largest) >>> >>> Then after a while, the selected thread should observe a >>> mm_cid <= 1. >>> >>> This test should be skipped if there are less than 3 CPUs in >>> allowed cpumask (sched_getaffinity). >> >> Even better: >> >> For a sched_getaffinity with N cpus: >> >> - If N == 1 -> skip (we cannot validate anything) >> >> Phase 1: create N - 1 pthreads, each pinned to a CPU. main thread >> also pinned to a cpu. >> >> Publish the mm_cids observed by each thread, including main thread. >> >> Select a new leader for phase 2: a thread which has observed nonzero >> mm_cid. Each other thread including possibly main thread issue >> pthread_exit, and the new leader does pthread join on each other. >> >> Then check that the new leader eventually observe mm_cid == 0. >> >> And it works with an allowed cpu mask that has only 2 cpus. > > Sounds even neater, thanks for the tips, I'll try this last one out! > > Coming back to the implementation, I have been trying to validate my > approach with this test, wrapped my head around it, and found out that > the test can't actually pass on the latest upstream. > > When an mm_cid is lazy dropped to compact the mask, it is again re- > assigned while switching in. > The the change introduced in "sched: Improve cache locality of RSEQ > concurrency IDs for intermittent workloads" adds a recent_cid and it > seems that is never unset during the test (nothing migrates). Good point! > > Now, I'm still running my first version of the test, so I have a thread > running on CPU0 with mm_cid=0 and another running on CPU127 with > mm_cid, say, 127 (weight=2). > In practice, the test is expecting 127 to be dropped (>2) but this is > not the case since 127 could exhibit better cache locality, so it is > selected on the next round. Understood. > > Here's where I'm in doubt, is a compact map more desirable than reusing > the same mm_cids for cache locality? This is a tradeoff between: A) Preserving cache locality after a transition from many threads to few threads, or after reducing the hamming weight of the allowed cpu mask. B) Making the mm_cid guarantees wrt nr threads and allowed cpu mask easy to document and understand. C) Allowing applications to eventually react to mm_cid compaction after reduction of the nr threads or allowed cpu mask, making the tracking of mm_cid compaction easier by shrinking it back towards 0 or not. D) Making sure applications that periodically reduce and then increase again the nr threads or allowed cpu mask still benefit from good cache locality with mm_cid. > If not, should we perhaps ignore the recent_cid if it's larger than the > map weight? > It seems the only way the recent_cid is unset is with migrations, but > I'm not sure if forcing one would make the test vain as the cid could > be dropped outside of task_mm_cid_work. > > What do you think? Can you try this patch ? (compile-tested only) commit 500649e03c5c28443f431829732c580750657326 Author: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Date: Wed Dec 11 11:53:01 2024 -0500 sched: shrink mm_cid allocation with nr thread/affinity diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 76f5f53a645f..b92e79770a93 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -3657,10 +3657,24 @@ static inline int __mm_cid_try_get(struct task_struct *t, struct mm_struct *mm) { struct cpumask *cidmask = mm_cidmask(mm); struct mm_cid __percpu *pcpu_cid = mm->pcpu_cid; - int cid = __this_cpu_read(pcpu_cid->recent_cid); + int cid, max_nr_cid, allowed_max_nr_cid; + /* + * After shrinking the number of threads or reducing the number + * of allowed cpus, reduce the value of max_nr_cid so expansion + * of cid allocation will preserve cache locality if the number + * of threads or allowed cpus increase again. + */ + max_nr_cid = atomic_read(&mm->max_nr_cid); + while ((allowed_max_nr_cid = min_t(int, READ_ONCE(mm->nr_cpus_allowed), atomic_read(&mm->mm_users))), + max_nr_cid > allowed_max_nr_cid) { + if (atomic_try_cmpxchg(&mm->max_nr_cid, &max_nr_cid, allowed_max_nr_cid)) + break; + } /* Try to re-use recent cid. This improves cache locality. */ - if (!mm_cid_is_unset(cid) && !cpumask_test_and_set_cpu(cid, cidmask)) + cid = __this_cpu_read(pcpu_cid->recent_cid); + if (!mm_cid_is_unset(cid) && cid < max_nr_cid && + !cpumask_test_and_set_cpu(cid, cidmask)) return cid; /* * Expand cid allocation if the maximum number of concurrency @@ -3668,12 +3682,11 @@ static inline int __mm_cid_try_get(struct task_struct *t, struct mm_struct *mm) * and number of threads. Expanding cid allocation as much as * possible improves cache locality. */ - cid = atomic_read(&mm->max_nr_cid); - while (cid < READ_ONCE(mm->nr_cpus_allowed) && cid < atomic_read(&mm->mm_users)) { - if (!atomic_try_cmpxchg(&mm->max_nr_cid, &cid, cid + 1)) + while (max_nr_cid < allowed_max_nr_cid) { + if (!atomic_try_cmpxchg(&mm->max_nr_cid, &max_nr_cid, max_nr_cid + 1)) continue; - if (!cpumask_test_and_set_cpu(cid, cidmask)) - return cid; + if (!cpumask_test_and_set_cpu(max_nr_cid, cidmask)) + return max_nr_cid; } /* * Find the first available concurrency id.
On Wed, 2024-12-11 at 12:07 -0500, Mathieu Desnoyers wrote: > > Here's where I'm in doubt, is a compact map more desirable than > > reusing > > the same mm_cids for cache locality? > > This is a tradeoff between: > > A) Preserving cache locality after a transition from many threads to > few > threads, or after reducing the hamming weight of the allowed cpu > mask. > > B) Making the mm_cid guarantees wrt nr threads and allowed cpu mask > easy > to document and understand. > > C) Allowing applications to eventually react to mm_cid compaction > after > reduction of the nr threads or allowed cpu mask, making the > tracking > of mm_cid compaction easier by shrinking it back towards 0 or > not. > > D) Making sure applications that periodically reduce and then > increase > again the nr threads or allowed cpu mask still benefit from good > cache locality with mm_cid. > > > > If not, should we perhaps ignore the recent_cid if it's larger than > > the > > map weight? > > It seems the only way the recent_cid is unset is with migrations, > > but > > I'm not sure if forcing one would make the test vain as the cid > > could > > be dropped outside of task_mm_cid_work. > > > > What do you think? > > Can you try this patch ? (compile-tested only) > > commit 500649e03c5c28443f431829732c580750657326 > Author: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > Date: Wed Dec 11 11:53:01 2024 -0500 > > sched: shrink mm_cid allocation with nr thread/affinity > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index 76f5f53a645f..b92e79770a93 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -3657,10 +3657,24 @@ static inline int __mm_cid_try_get(struct > task_struct *t, struct mm_struct *mm) > { > struct cpumask *cidmask = mm_cidmask(mm); > struct mm_cid __percpu *pcpu_cid = mm->pcpu_cid; > - int cid = __this_cpu_read(pcpu_cid->recent_cid); > + int cid, max_nr_cid, allowed_max_nr_cid; > > + /* > + * After shrinking the number of threads or reducing the number > + * of allowed cpus, reduce the value of max_nr_cid so expansion > + * of cid allocation will preserve cache locality if the number > + * of threads or allowed cpus increase again. > + */ > + max_nr_cid = atomic_read(&mm->max_nr_cid); > + while ((allowed_max_nr_cid = min_t(int, READ_ONCE(mm- > >nr_cpus_allowed), atomic_read(&mm->mm_users))), > + max_nr_cid > allowed_max_nr_cid) { > + if (atomic_try_cmpxchg(&mm->max_nr_cid, &max_nr_cid, > allowed_max_nr_cid)) > + break; > + } > /* Try to re-use recent cid. This improves cache locality. */ > - if (!mm_cid_is_unset(cid) && !cpumask_test_and_set_cpu(cid, > cidmask)) > + cid = __this_cpu_read(pcpu_cid->recent_cid); > + if (!mm_cid_is_unset(cid) && cid < max_nr_cid && > + !cpumask_test_and_set_cpu(cid, cidmask)) > return cid; > /* > * Expand cid allocation if the maximum number of concurrency > @@ -3668,12 +3682,11 @@ static inline int __mm_cid_try_get(struct > task_struct *t, struct mm_struct *mm) > * and number of threads. Expanding cid allocation as much as > * possible improves cache locality. > */ > - cid = atomic_read(&mm->max_nr_cid); > - while (cid < READ_ONCE(mm->nr_cpus_allowed) && cid < > atomic_read(&mm->mm_users)) { > - if (!atomic_try_cmpxchg(&mm->max_nr_cid, &cid, cid + 1)) > + while (max_nr_cid < allowed_max_nr_cid) { > + if (!atomic_try_cmpxchg(&mm->max_nr_cid, &max_nr_cid, max_nr_cid + > 1)) > continue; > - if (!cpumask_test_and_set_cpu(cid, cidmask)) > - return cid; > + if (!cpumask_test_and_set_cpu(max_nr_cid, cidmask)) > + return max_nr_cid; > } > /* > * Find the first available concurrency id. Thanks for the patch, it seems much more robust than my simple condition on the weight. It passes the test (both versions) we previously discussed and doesn't seem to interfere with the general rseq functionality as checked by the other selftests. I'm not sure if I should run more tests on this one. I will come up with a V2 shortly and attach some performance evaluations. Do you want to keep your patch separate or do I submit it together with V2? Thanks, Gabriele
On 2024-12-12 06:09, Gabriele Monaco wrote: > > On Wed, 2024-12-11 at 12:07 -0500, Mathieu Desnoyers wrote: >>> Here's where I'm in doubt, is a compact map more desirable than >>> reusing >>> the same mm_cids for cache locality? >> >> This is a tradeoff between: >> >> A) Preserving cache locality after a transition from many threads to >> few >> threads, or after reducing the hamming weight of the allowed cpu >> mask. >> >> B) Making the mm_cid guarantees wrt nr threads and allowed cpu mask >> easy >> to document and understand. >> >> C) Allowing applications to eventually react to mm_cid compaction >> after >> reduction of the nr threads or allowed cpu mask, making the >> tracking >> of mm_cid compaction easier by shrinking it back towards 0 or >> not. >> >> D) Making sure applications that periodically reduce and then >> increase >> again the nr threads or allowed cpu mask still benefit from good >> cache locality with mm_cid. >> >> >>> If not, should we perhaps ignore the recent_cid if it's larger than >>> the >>> map weight? >>> It seems the only way the recent_cid is unset is with migrations, >>> but >>> I'm not sure if forcing one would make the test vain as the cid >>> could >>> be dropped outside of task_mm_cid_work. >>> >>> What do you think? >> >> Can you try this patch ? (compile-tested only) >> >> commit 500649e03c5c28443f431829732c580750657326 >> Author: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> >> Date: Wed Dec 11 11:53:01 2024 -0500 >> >> sched: shrink mm_cid allocation with nr thread/affinity >> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h >> index 76f5f53a645f..b92e79770a93 100644 >> --- a/kernel/sched/sched.h >> +++ b/kernel/sched/sched.h >> @@ -3657,10 +3657,24 @@ static inline int __mm_cid_try_get(struct >> task_struct *t, struct mm_struct *mm) >> { >> struct cpumask *cidmask = mm_cidmask(mm); >> struct mm_cid __percpu *pcpu_cid = mm->pcpu_cid; >> - int cid = __this_cpu_read(pcpu_cid->recent_cid); >> + int cid, max_nr_cid, allowed_max_nr_cid; >> >> + /* >> + * After shrinking the number of threads or reducing the number >> + * of allowed cpus, reduce the value of max_nr_cid so expansion >> + * of cid allocation will preserve cache locality if the number >> + * of threads or allowed cpus increase again. >> + */ >> + max_nr_cid = atomic_read(&mm->max_nr_cid); >> + while ((allowed_max_nr_cid = min_t(int, READ_ONCE(mm- >>> nr_cpus_allowed), atomic_read(&mm->mm_users))), >> + max_nr_cid > allowed_max_nr_cid) { >> + if (atomic_try_cmpxchg(&mm->max_nr_cid, &max_nr_cid, >> allowed_max_nr_cid)) >> + break; >> + } >> /* Try to re-use recent cid. This improves cache locality. */ >> - if (!mm_cid_is_unset(cid) && !cpumask_test_and_set_cpu(cid, >> cidmask)) >> + cid = __this_cpu_read(pcpu_cid->recent_cid); >> + if (!mm_cid_is_unset(cid) && cid < max_nr_cid && >> + !cpumask_test_and_set_cpu(cid, cidmask)) >> return cid; >> /* >> * Expand cid allocation if the maximum number of concurrency >> @@ -3668,12 +3682,11 @@ static inline int __mm_cid_try_get(struct >> task_struct *t, struct mm_struct *mm) >> * and number of threads. Expanding cid allocation as much as >> * possible improves cache locality. >> */ >> - cid = atomic_read(&mm->max_nr_cid); >> - while (cid < READ_ONCE(mm->nr_cpus_allowed) && cid < >> atomic_read(&mm->mm_users)) { >> - if (!atomic_try_cmpxchg(&mm->max_nr_cid, &cid, cid + 1)) >> + while (max_nr_cid < allowed_max_nr_cid) { >> + if (!atomic_try_cmpxchg(&mm->max_nr_cid, &max_nr_cid, max_nr_cid + >> 1)) >> continue; >> - if (!cpumask_test_and_set_cpu(cid, cidmask)) >> - return cid; >> + if (!cpumask_test_and_set_cpu(max_nr_cid, cidmask)) >> + return max_nr_cid; >> } >> /* >> * Find the first available concurrency id. > > Thanks for the patch, it seems much more robust than my simple > condition on the weight. It passes the test (both versions) we > previously discussed and doesn't seem to interfere with the general > rseq functionality as checked by the other selftests. Great! > I'm not sure if I should run more tests on this one. The other thing I'd be interested in is to see if those patches introduce any performance regression (e.g. the will-it-scale tests). If you have spare cycles to try this out, that would be good, else we can let the test bots complain. > I will come up with a V2 shortly and attach some performance > evaluations. OK > > Do you want to keep your patch separate or do I submit it together with > V2? Let me prepare a proper patch with commit message, and then feel free to add it into your series, so it benefits from your testing. Thanks, Mathieu > > Thanks, > Gabriele >
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 7361a8f3ab68..38c567f06dce 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -856,6 +856,7 @@ struct mm_struct { * mm nr_cpus_allowed updates. */ raw_spinlock_t cpus_allowed_lock; + struct delayed_work mm_cid_work; #endif #ifdef CONFIG_MMU atomic_long_t pgtables_bytes; /* size of all page tables */ @@ -1149,6 +1150,8 @@ enum mm_cid_state { MM_CID_LAZY_PUT = (1U << 31), }; +extern void task_mm_cid_work(struct work_struct *work); + static inline bool mm_cid_is_unset(int cid) { return cid == MM_CID_UNSET; @@ -1221,12 +1224,14 @@ static inline int mm_alloc_cid_noprof(struct mm_struct *mm, struct task_struct * if (!mm->pcpu_cid) return -ENOMEM; mm_init_cid(mm, p); + INIT_DELAYED_WORK(&mm->mm_cid_work, task_mm_cid_work); return 0; } #define mm_alloc_cid(...) alloc_hooks(mm_alloc_cid_noprof(__VA_ARGS__)) static inline void mm_destroy_cid(struct mm_struct *mm) { + disable_delayed_work(&mm->mm_cid_work); free_percpu(mm->pcpu_cid); mm->pcpu_cid = NULL; } diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 95e40895a519..0c3a778c9cb5 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5654,7 +5654,6 @@ void sched_tick(void) resched_latency = cpu_resched_latency(rq); calc_global_load_tick(rq); sched_core_tick(rq); - task_tick_mm_cid(rq, donor); scx_tick(rq); rq_unlock(rq, &rf); @@ -10520,22 +10519,14 @@ static void sched_mm_cid_remote_clear_weight(struct mm_struct *mm, int cpu, sched_mm_cid_remote_clear(mm, pcpu_cid, cpu); } -static void task_mm_cid_work(struct callback_head *work) +void task_mm_cid_work(struct work_struct *work) { unsigned long now = jiffies, old_scan, next_scan; - struct task_struct *t = current; struct cpumask *cidmask; - struct mm_struct *mm; + struct delayed_work *delayed_work = container_of(work, struct delayed_work, work); + struct mm_struct *mm = container_of(delayed_work, struct mm_struct, mm_cid_work); int weight, cpu; - SCHED_WARN_ON(t != container_of(work, struct task_struct, cid_work)); - - work->next = work; /* Prevent double-add */ - if (t->flags & PF_EXITING) - return; - mm = t->mm; - if (!mm) - return; old_scan = READ_ONCE(mm->mm_cid_next_scan); next_scan = now + msecs_to_jiffies(MM_CID_SCAN_DELAY); if (!old_scan) { @@ -10571,26 +10562,12 @@ void init_sched_mm_cid(struct task_struct *t) if (mm) { mm_users = atomic_read(&mm->mm_users); - if (mm_users == 1) + if (mm_users == 1) { mm->mm_cid_next_scan = jiffies + msecs_to_jiffies(MM_CID_SCAN_DELAY); + schedule_delayed_work(&mm->mm_cid_work, + msecs_to_jiffies(MM_CID_SCAN_DELAY)); + } } - t->cid_work.next = &t->cid_work; /* Protect against double add */ - init_task_work(&t->cid_work, task_mm_cid_work); -} - -void task_tick_mm_cid(struct rq *rq, struct task_struct *curr) -{ - struct callback_head *work = &curr->cid_work; - unsigned long now = jiffies; - - if (!curr->mm || (curr->flags & (PF_EXITING | PF_KTHREAD)) || - work->next != work) - return; - if (time_before(now, READ_ONCE(curr->mm->mm_cid_next_scan))) - return; - - /* No page allocation under rq lock */ - task_work_add(curr, work, TWA_RESUME | TWAF_NO_ALLOC); } void sched_mm_cid_exit_signals(struct task_struct *t)
Currently, the task_mm_cid_work function is called in a task work triggered by a scheduler tick. This can delay the execution of the task for the entire duration of the function, negatively affecting the response of real time tasks. This patch runs the task_mm_cid_work in a new delayed work connected to the mm_struct rather than in the task context before returning to userspace. This delayed work is initialised while allocating the mm and disabled before freeing it, its execution is no longer triggered by scheduler ticks but running periodically based on the defined MM_CID_SCAN_DELAY. The main advantage of this change is that the function can be offloaded to a different CPU and even preempted by RT tasks. On a busy system, this may mean the function gets called less often, but the current behaviour already doesn't provide guarantees. Moreover, this new behaviour could be more predictable in some situations since the delayed work is always scheduled with the same periodicity for each mm. This deprecate the "sched: improve task_mm_cid_work impact on isolated systems" patches by dropping the first patch and using a workqueue instead of RCU callbacks. Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Link: https://lore.kernel.org/lkml/20241202140735.56368-1-gmonaco@redhat.com/ --- include/linux/mm_types.h | 5 +++++ kernel/sched/core.c | 37 +++++++------------------------------ 2 files changed, 12 insertions(+), 30 deletions(-) base-commit: feffde684ac29a3b7aec82d2df850fbdbdee55e4