Message ID | 20230515180015.016409657@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | fold per-CPU vmstats remotely | expand |
The patchset still modifies the semantics of this_cpu operations semantics replacing the lockless RMV operations with locked ones. One of the rationales for the use this_cpu operations is their efficiency since locked RMV atomics are avoided. This patchset destroys that functionality. If you want locked RMV semantics then use them through cmpxchg() and friends. Do not modify this_cpu operations by changing the implementation in the arch code.
Hi Christoph, On Tue, May 16, 2023 at 10:09:02AM +0200, Christoph Lameter wrote: > The patchset still modifies the semantics of this_cpu operations semantics > replacing the lockless RMV operations with locked ones. It does that to follow the pre-existing kernel convention: function-name LOCK prefix cmpxchg YES cmpxchg_local NO So the patchset introduces: function-name LOCK prefix this_cpu_cmpxchg YES this_cpu_cmpxchg_local NO > One of the > rationales for the use this_cpu operations is their efficiency since > locked RMV atomics are avoided. And there is the freedom to choose between this_cpu_cmpxchg and this_cpu_cmpxchg_local (depending on intended usage). > This patchset destroys that functionality. Patch 6 is Subject: [PATCH v8 06/13] add this_cpu_cmpxchg_local and asm-generic definitions Which adds this_cpu_cmpxchg_local Patch 7 converts all other this_cmpxchg users (except the vmstat ones) [PATCH v8 07/13] convert this_cpu_cmpxchg users to this_cpu_cmpxchg_local So the non-LOCK'ed behaviour is maintained for existing users. > If you want locked RMV semantics then use them through cmpxchg() and > friends. Do not modify this_cpu operations by changing the implementation > in the arch code. But then it would be necessary to disable preemption here: static inline void mod_zone_state(struct zone *zone, enum zone_stat_item item, long delta, int overstep_mode) { struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats; s32 __percpu *p = pcp->vm_stat_diff + item; long o, n, t, z; do { z = 0; /* overflow to zone counters */ /* * The fetching of the stat_threshold is racy. We may apply * a counter threshold to the wrong the cpu if we get * rescheduled while executing here. However, the next * counter update will apply the threshold again and * therefore bring the counter under the threshold again. * * Most of the time the thresholds are the same anyways * for all cpus in a zone. */ t = this_cpu_read(pcp->stat_threshold); o = this_cpu_read(*p); n = delta + o; if (abs(n) > t) { int os = overstep_mode * (t >> 1); /* Overflow must be added to zone counters */ z = n + os; n = -os; } } while (this_cpu_cmpxchg(*p, o, n) != o); ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ if (z) zone_page_state_add(z, zone, item); } Earlier you objected to disabling preemption on this codepath (which is what led to this patchset in the first place): "Using preemption is a way to make this work correctly. However, doing so would sacrifice the performance, low impact and the scalability of the vm counters." So it seems a locked, this_cpu function which does lock cmxpchg is desired. Perhaps you disagree with the this_cpu_cmpxchg_local/this_cpu_cmpxchg naming?
[Sorry for a late response but I was conferencing last two weeks and now catching up] On Mon 15-05-23 15:00:15, Marcelo Tosatti wrote: [...] > v8 > - Add summary of discussion on -v7 to cover letter Thanks this is very useful! This helps to frame the further discussion. I believe the most important question to answer is this in fact > I think what needs to be done is to avoid new queue_work_on() > users from being introduced in the tree (the number of > existing ones is finite and can therefore be fixed). > > Agree with the criticism here, however, i can't see other > options than the following: > > 1) Given an activity, which contains a sequence of instructions > to execute on a CPU, to change the algorithm > to execute that code remotely (therefore avoid interrupting a CPU), > or to avoid the interruption somehow (which must be dealt with > on a case-by-case basis). > > 2) To block that activity from happening in the first place, > for the sites where it can be blocked (that return errors to > userspace, for example). > > 3) Completly isolate the CPU from the kernel (off-line it). I agree that a reliable cpu isolation implementation needs to address queue_work_on problem. And it has to do that _realiably_. This cannot by achieved by an endless whack-a-mole and chasing each new instance. There must be a more systematic approach. One way would be to change the semantic of schedule_work_on and fail call for an isolated CPU. The caller would have a way to fallback and handle the operation by other means. E.g. vmstat could simply ignore folding pcp data because an imprecision shouldn't really matter. Other callers might chose to do the operation remotely. This is a lot of work, no doubt about that, but it is a long term maintainable solution that doesn't give you new surprises with any new released kernel. There are likely other remote interfaces that would need to follow that scheme. If the cpu isolation is not planned to be worth that time investment then I do not think it is also worth reducing a highly optimized vmstat code. These stats are invoked from many hot paths and per-cpu implementation has been optimized for that case. If your workload would like to avoid that as disturbing then you already have a quiet_vmstat precedence so find a way how to use it for your workload instead.
On Wed, May 24, 2023 at 02:51:55PM +0200, Michal Hocko wrote: > [Sorry for a late response but I was conferencing last two weeks and now > catching up] > > On Mon 15-05-23 15:00:15, Marcelo Tosatti wrote: > [...] > > v8 > > - Add summary of discussion on -v7 to cover letter > > Thanks this is very useful! This helps to frame the further discussion. > > I believe the most important question to answer is this in fact > > I think what needs to be done is to avoid new queue_work_on() > > users from being introduced in the tree (the number of > > existing ones is finite and can therefore be fixed). > > > > Agree with the criticism here, however, i can't see other > > options than the following: > > > > 1) Given an activity, which contains a sequence of instructions > > to execute on a CPU, to change the algorithm > > to execute that code remotely (therefore avoid interrupting a CPU), > > or to avoid the interruption somehow (which must be dealt with > > on a case-by-case basis). > > > > 2) To block that activity from happening in the first place, > > for the sites where it can be blocked (that return errors to > > userspace, for example). > > > > 3) Completly isolate the CPU from the kernel (off-line it). > > I agree that a reliable cpu isolation implementation needs to address > queue_work_on problem. And it has to do that _realiably_. This cannot by > achieved by an endless whack-a-mole and chasing each new instance. There > must be a more systematic approach. One way would be to change the > semantic of schedule_work_on and fail call for an isolated CPU. The > caller would have a way to fallback and handle the operation by other > means. E.g. vmstat could simply ignore folding pcp data because an > imprecision shouldn't really matter. Other callers might chose to do the > operation remotely. This is a lot of work, no doubt about that, but it > is a long term maintainable solution that doesn't give you new surprises > with any new released kernel. There are likely other remote interfaces > that would need to follow that scheme. > > If the cpu isolation is not planned to be worth that time investment > then I do not think it is also worth reducing a highly optimized vmstat > code. These stats are invoked from many hot paths and per-cpu > implementation has been optimized for that case. It is exactly the same code, but now with a "LOCK" prefix for CMPXCHG instruction. Which should not cost much due to cache locking (these are per-CPU variables anyway). > If your workload would > like to avoid that as disturbing then you already have a quiet_vmstat > precedence so find a way how to use it for your workload instead. > > -- > Michal Hocko > SUSE Labs OK so an alternative solution is to completly disable vmstat updates for isolated CPUs. Are you OK with that ?
On Wed 24-05-23 10:53:23, Marcelo Tosatti wrote: > On Wed, May 24, 2023 at 02:51:55PM +0200, Michal Hocko wrote: > > [Sorry for a late response but I was conferencing last two weeks and now > > catching up] > > > > On Mon 15-05-23 15:00:15, Marcelo Tosatti wrote: > > [...] > > > v8 > > > - Add summary of discussion on -v7 to cover letter > > > > Thanks this is very useful! This helps to frame the further discussion. > > > > I believe the most important question to answer is this in fact > > > I think what needs to be done is to avoid new queue_work_on() > > > users from being introduced in the tree (the number of > > > existing ones is finite and can therefore be fixed). > > > > > > Agree with the criticism here, however, i can't see other > > > options than the following: > > > > > > 1) Given an activity, which contains a sequence of instructions > > > to execute on a CPU, to change the algorithm > > > to execute that code remotely (therefore avoid interrupting a CPU), > > > or to avoid the interruption somehow (which must be dealt with > > > on a case-by-case basis). > > > > > > 2) To block that activity from happening in the first place, > > > for the sites where it can be blocked (that return errors to > > > userspace, for example). > > > > > > 3) Completly isolate the CPU from the kernel (off-line it). > > > > I agree that a reliable cpu isolation implementation needs to address > > queue_work_on problem. And it has to do that _realiably_. This cannot by > > achieved by an endless whack-a-mole and chasing each new instance. There > > must be a more systematic approach. One way would be to change the > > semantic of schedule_work_on and fail call for an isolated CPU. The > > caller would have a way to fallback and handle the operation by other > > means. E.g. vmstat could simply ignore folding pcp data because an > > imprecision shouldn't really matter. Other callers might chose to do the > > operation remotely. This is a lot of work, no doubt about that, but it > > is a long term maintainable solution that doesn't give you new surprises > > with any new released kernel. There are likely other remote interfaces > > that would need to follow that scheme. > > > > If the cpu isolation is not planned to be worth that time investment > > then I do not think it is also worth reducing a highly optimized vmstat > > code. These stats are invoked from many hot paths and per-cpu > > implementation has been optimized for that case. > > It is exactly the same code, but now with a "LOCK" prefix for CMPXCHG > instruction. Which should not cost much due to cache locking (these are > per-CPU variables anyway). Sorry but just a LOCK prefix for a hot path is not a serious argument. > > If your workload would > > like to avoid that as disturbing then you already have a quiet_vmstat > > precedence so find a way how to use it for your workload instead. > > > > -- > > Michal Hocko > > SUSE Labs > > OK so an alternative solution is to completly disable vmstat updates > for isolated CPUs. Are you OK with that ? Yes, the number of events should be reasonably small and those places in the kernel which really need a precise value need to do a per-cpu walk anyway. IIRC /proc/vmstat et al also do accumulate pcp state. But let me reiterate. Even with vmstat updates out of the game, you have so many other sources of disruption that your isolated workload will be fragile until you actually try to deal with the problem on a more fundamental level.