mbox series

[v8,00/13] fold per-CPU vmstats remotely

Message ID 20230515180015.016409657@redhat.com (mailing list archive)
Headers show
Series fold per-CPU vmstats remotely | expand

Message

Marcelo Tosatti May 15, 2023, 6 p.m. UTC
This patch series addresses the following two problems:

1. A customer provided evidence indicating that a process
   was stalled in direct reclaim:

 - The process was trapped in throttle_direct_reclaim().
   The function wait_event_killable() was called to wait condition
   allow_direct_reclaim(pgdat) for current node to be true.
   The allow_direct_reclaim(pgdat) examined the number of free pages
   on the node by zone_page_state() which just returns value in
   zone->vm_stat[NR_FREE_PAGES].

 - On node #1, zone->vm_stat[NR_FREE_PAGES] was 0.
   However, the freelist on this node was not empty.

 - This inconsistent of vmstat value was caused by percpu vmstat on
   nohz_full cpus. Every increment/decrement of vmstat is performed
   on percpu vmstat counter at first, then pooled diffs are cumulated
   to the zone's vmstat counter in timely manner. However, on nohz_full
   cpus (in case of this customer's system, 48 of 52 cpus) these pooled
   diffs were not cumulated once the cpu had no event on it so that
   the cpu started sleeping infinitely.
   I checked percpu vmstat and found there were total 69 counts not
   cumulated to the zone's vmstat counter yet.

 - In this situation, kswapd did not help the trapped process.
   In pgdat_balanced(), zone_wakermark_ok_safe() examined the number
   of free pages on the node by zone_page_state_snapshot() which
   checks pending counts on percpu vmstat.
   Therefore kswapd could know there were 69 free pages correctly.
   Since zone->_watermark = {8, 20, 32}, kswapd did not work because
   69 was greater than 32 as high watermark.

 2. With a task that busy loops on a given CPU,
    the kworker interruption to execute vmstat_update
    is undesired and may exceed latency thresholds
    for certain applications.

By having vmstat_shepherd flush the per-CPU counters to the
global counters from remote CPUs.

This is done using cmpxchg to manipulate the counters,
both CPU locally (via the account functions),
and remotely (via cpu_vm_stats_fold).

Thanks to Aaron Tomlin for diagnosing issue 1 and writing
the initial patch series.


Performance details for the kworker interruption:

oslat   1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
oslat   1094.456971: workqueue_queue_work: ... function=vmstat_update ...
oslat   1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...
 
The example above shows an additional 7us for the

        oslat -> kworker -> oslat

switches. In the case of a virtualized CPU, and the vmstat_update
interruption in the host (of a qemu-kvm vcpu), the latency penalty
observed in the guest is higher than 50us, violating the acceptable
latency threshold for certain applications.

Follows a summary of the arguments against this patchset
and responses to them. They are numbered from 1 and 
start with "O-x)", where x is a number. A line with 
"*** Response: ***" precedes the response to each numbered
objection.


O-1) `echo 1 > /proc/sys/vm/stat_refresh' achieves essentially the same
without any kernel changes.

*** Response ***

The stat_refresh interface is not reliable as it triggers "queue_work_on" for all
CPUs which have dirty per-CPU mm counters. So if you have two threads that desire not to be
interrupted, starting one of the threads can interrupt an already executing
and isolated thread.

O-2) Why not always loop through all CPUs when reading the counters? 
(that is replace zone_page_state with zone_page_state_snapshot).

*** Response ***

Consider zone_watermark_fast, called from get_page_from_freelist.

https://lwn.net/Articles/684616/

On x86 systems, DMA is not usually the problem; instead, memory
allocation is. He is working with a target of 14.8 million packets (full
wire speed on a 10Gb/s link) per second; on a 3GHz system, that gives
the kernel about 200 cycles in which to process each packet. Allocating
a single page, though, takes 277 cycles on that system, making the
14.8Mpps rate unattainable. He pointed out Mel Gorman's recent work to
reduce memory-allocator overhead; that work reduced the cost to 230
cycles — a significant improvement, but nowhere near enough.

O-3) Also vmstat already has a concept of silencing - i.e. quiet_vmstat. IIRC
this is used by NOHZ. 

*** Response ***:

The quiet_vmstat mechanism is not reliable, as it is used by the NOHZ code to
synchronize the per-CPU mm stats to global counters (when entering NOHZ mode).

Any subsequent use of per-CPU mm counters will allow vmstat shepherd thread
to queue work (therefore waking up kwork thread) on a CPU.

/*
 * Switch off vmstat processing and then fold all the remaining differentials
 * until the diffs stay at zero. The function is used by NOHZ and can only be
 * invoked when tick processing is not active.
 */
void quiet_vmstat(void)

O-4) The only applications of interest are those that do not enter the kernel.

> > >  2. With a task that busy loops on a given CPU,
> > >     the kworker interruption to execute vmstat_update
> > >     is undesired and may exceed latency thresholds
> > >     for certain applications.
> >
> > Yes it can but why does that matter?
>
> It matters for the application that is executing and expects
> not to be interrupted.                                                                                                                                                      
Those workloads shouldn't enter the kernel in the first place, no?
Otherwise the in kernel execution with all the direct or indirect
dependencies (e.g. via locks) can throw any latency expectations off the
window.

*** Response ***: a common counter example is for latency sensitive applications 
to call sys_nanosleep (for example cyclictest or PLC programs do that).

O-5) Why not implement a syscall to flush all per-cpu caches.

> The practical problem we have been seeing is -RT app initialization.                                                                                                        
> For example:                                                                                                                                                                
>                                                                                                                                                                             
> 1) mlock();                                                                                                                                                                 
> 2) enter loop without system calls

OK, that is what I have kinda expected. Would have been better to
mention it explicitly.

I expect this to be a very common pattern and vmstat might not be the
only subsystem that could interfere later on. Would it make more sense
to address this by a more generic solution? E.g. a syscall to flush all
per-cpu caches so they won't interfere later unless userspace hits the
kernel path in some way (e.g. flush_cpu_caches(cpu_set_t cpumask, int flags)?
The above pattern could then be implemented as

        do_initial_setup()
        sched_setaffinity(getpid(), cpumask);
        flush_cpu_caches(cpumask, 0);
        do_userspace_loop()

*** Response ***:

A special mode, where flushing of caches has been attempted
before:

https://lwn.net/Articles/883940/

However it has a number of drawbacks:

1) If the application is in kernel mode, the interruption
will not be avoided (this patchset will avoid the interruption
even in kernel space).

2) It requires modification of applications.

3) Applications which attempt to use this mode in combination
with system call periods, for example:

https://lore.kernel.org/linux-mm/87im9d4ezq.fsf@nanos.tec.linutronix.de/

"In a real-world usecase we had the situation of compute bursts and an
unfortunate hw enforced requirement to go into the kernel between them
for synchronization between the compute threads and hardware (A quick
hardware assisted save/load).

Unmodified NOHZ full accumulated to more than 6% loss compared to a
fully undisturbed run. Most of it was caused by cache effects and not by
the actually used CPU time.

A full enforced quiescing upfront gained ~2-3%, but a lazy approach of
accepting that some stuff might happen once and does not happen again
gained almost 5%. In that particular scenario 5% _is_ a huge win."

Will suffer performance slowdowns.

O-7) There is a lack of examples where this change is required.

*** Response ***:

Example-1: MAC scheduler processing must occur every 1ms,                                                                                                         
and a certain amount of computation takes place (and must finish before                                                                                                     
the next 1ms timeframe). A > 50us latency spike as observed by cyclictest                                                                                                   
is considered a "failure".     

Performance details for the kworker interruption being solved:
                                                                                                                                                                    
oslat   1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
oslat   1094.456971: workqueue_queue_work: ... function=vmstat_update ...
oslat   1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...
                                                                                                                                                                    
The example above shows an additional 7us for the
                                                                                                                                                                    
        oslat -> kworker -> oslat                                                                                                                                   
                                                                                                                                                                    
switches. In the case of a virtualized CPU, and the vmstat_update                                                                                                   
interruption in the host (of a qemu-kvm vcpu), the latency penalty                                                                                                  
observed in the guest is higher than 50us, violating the acceptable                                                                                                 
latency threshold for certain applications.

Example-2: Based on the tracing data above, but a slight different
use case:

Nearly every telco we work with for 5G RAN is demanding <20 usec CPU latency
as measured by cyclictest & oslat.  We cannot achieve under 20 usec with the vmstats
interruption.

Example-3: 7us above has been measured on recent Intel Xeon processors.
There are use cases which use less powerful processors, such as embedded
ARM boards, where switching from kworker and back is much more expensive
(causing problems to a larger range of applications). For example, 3D printing:
https://scholarworks.sjsu.edu/cgi/viewcontent.cgi?article=2077&context=etd_projects

O-8) This is a general problem, therefore requires a general solution.

> But let me repeat, this is not just about vmstats. Just have a look at                                                                                                                        
> other queue_work_on users. You do not want to handy pick each and every                                                                                                                       
> one and do so in the future as well.                                                                                                                                                          

*** Response ***:

The ones that are problematic are being fixed for sometime now. For example:

commit 2de79ee27fdb52626ac4ac48ec6d8d52ba6f9047
Author: Paolo Abeni <pabeni@redhat.com>

    net: try to avoid unneeded backlog flush

    flush_all_backlogs() may cause deadlock on systems
    running processes with FIFO scheduling policy.

    The above is critical in -RT scenarios, where user-space
    specifically ensure no network activity is scheduled on
    the CPU running the mentioned FIFO process, but still get
    stuck.

    This commit tries to address the problem checking the
    backlog status on the remote CPUs before scheduling the
    flush operation. If the backlog is empty, we can skip it.

    v1 -> v2:
     - explicitly clear flushed cpu mask - Eric

    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>


And it has been a normal process so far.

I think what needs to be done is to avoid new queue_work_on()
users from being introduced in the tree (the number of
existing ones is finite and can therefore be fixed).

Agree with the criticism here, however, i can't see other
options than the following:

	1) Given an activity, which contains a sequence of instructions
	   to execute on a CPU, to change the algorithm
	   to execute that code remotely (therefore avoid interrupting a CPU),
	   or to avoid the interruption somehow (which must be dealt with 
	   on a case-by-case basis).

	2) To block that activity from happening in the first place,
	   for the sites where it can be blocked (that return errors to
	   userspace, for example).

	3) Completly isolate the CPU from the kernel (off-line it).

Working on patches to improve #2.

v8
- Add summary of discussion on -v7 to cover letter
- rebase

v7:
- Fix allow_direct_reclaim issue by using
  zone_page_state_snapshot				(Michal Hocko)

v6:
- Add more information on throttle_direct_reclaim problem 
  to commit logs	       			       (Michal Hocko)

v5:
- Drop "mm/vmstat: remove remote node draining"        (Vlastimil Babka)
- Implement remote node draining for cpu_vm_stats_fold (Vlastimil Babka)

v4:
- Switch per-CPU vmstat counters to s32, required 
  by RISC-V, ARC architectures			

v3:
- Removed unused drain_zone_pages and changes variable (David Hildenbrand)
- Use xchg instead of cmpxchg in refresh_cpu_vm_stats  (Peter Xu)
- Add drain_all_pages to vmstat_refresh to make
  stats more accurate				       (Peter Xu)
- Improve changelog of
  "mm/vmstat: switch counter modification to cmpxchg"  (Peter Xu / David)
- Improve changelog of
  "mm/vmstat: remove remote node draining"	       (David Hildenbrand)


v2:
- actually use LOCK CMPXCHG on counter mod/inc/dec functions
  (Christoph Lameter)
- use try_cmpxchg for cmpxchg loops
  (Uros Bizjak / Matthew Wilcox)

 arch/arm64/include/asm/percpu.h     |   16 ++
 arch/loongarch/include/asm/percpu.h |   23 +++-
 arch/s390/include/asm/percpu.h      |    5 
 arch/x86/include/asm/percpu.h       |   39 +++---
 include/asm-generic/percpu.h        |   17 ++
 include/linux/mmzone.h              |    8 -
 include/linux/percpu-defs.h         |    2 
 include/linux/vmstat.h              |    2 
 kernel/fork.c                       |    2 
 kernel/scs.c                        |    2 
 mm/page_alloc.c                     |    5 
 mm/vmscan.c                         |    2 
 mm/vmstat.c                         |  440 +++++++++++++++++++++++++++++++++++++++++++++++------------------------------
 13 files changed, 361 insertions(+), 202 deletions(-)

Comments

Christoph Lameter May 16, 2023, 8:09 a.m. UTC | #1
The patchset still modifies the semantics of this_cpu operations semantics
replacing the lockless RMV operations with locked ones. One of the
rationales for the use  this_cpu operations is their efficiency since
locked RMV atomics are avoided. This patchset destroys that functionality.

If you want locked RMV semantics then use them through cmpxchg() and
friends. Do not modify this_cpu operations by changing the implementation
in the arch code.
Marcelo Tosatti May 16, 2023, 6:02 p.m. UTC | #2
Hi Christoph,

On Tue, May 16, 2023 at 10:09:02AM +0200, Christoph Lameter wrote:
> The patchset still modifies the semantics of this_cpu operations semantics
> replacing the lockless RMV operations with locked ones. 

It does that to follow the pre-existing kernel convention:

function-name			LOCK prefix
cmpxchg				YES
cmpxchg_local			NO

So the patchset introduces:

function-name			LOCK prefix
this_cpu_cmpxchg		YES
this_cpu_cmpxchg_local		NO

> One of the
> rationales for the use  this_cpu operations is their efficiency since
> locked RMV atomics are avoided. 

And there is the freedom to choose between this_cpu_cmpxchg and
this_cpu_cmpxchg_local (depending on intended usage).

> This patchset destroys that functionality.

Patch 6 is

Subject: [PATCH v8 06/13] add this_cpu_cmpxchg_local and asm-generic definitions

Which adds this_cpu_cmpxchg_local

Patch 7 converts all other this_cmpxchg users
(except the vmstat ones)

[PATCH v8 07/13] convert this_cpu_cmpxchg users to this_cpu_cmpxchg_local

So the non-LOCK'ed behaviour is maintained for existing users.

> If you want locked RMV semantics then use them through cmpxchg() and
> friends. Do not modify this_cpu operations by changing the implementation
> in the arch code.

But then it would be necessary to disable preemption here:

static inline void mod_zone_state(struct zone *zone, enum zone_stat_item item,
                                  long delta, int overstep_mode)
{
        struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats;
        s32 __percpu *p = pcp->vm_stat_diff + item;
        long o, n, t, z;

        do {
                z = 0;  /* overflow to zone counters */

                /*
                 * The fetching of the stat_threshold is racy. We may apply
                 * a counter threshold to the wrong the cpu if we get
                 * rescheduled while executing here. However, the next
                 * counter update will apply the threshold again and
                 * therefore bring the counter under the threshold again.
                 *
                 * Most of the time the thresholds are the same anyways
                 * for all cpus in a zone.
                 */
                t = this_cpu_read(pcp->stat_threshold);

                o = this_cpu_read(*p);
                n = delta + o;

                if (abs(n) > t) {
                        int os = overstep_mode * (t >> 1);

                        /* Overflow must be added to zone counters */
                        z = n + os;
                        n = -os;
                }
        } while (this_cpu_cmpxchg(*p, o, n) != o);
		 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

        if (z)
                zone_page_state_add(z, zone, item);
}

Earlier you objected to disabling preemption on this codepath
(which is what led to this patchset in the first place):

"Using preemption is a way to make this work correctly. However, 
doing so would sacrifice the performance, low impact and the
scalability of the vm counters."

So it seems a locked, this_cpu function which does lock cmxpchg
is desired.

Perhaps you disagree with the this_cpu_cmpxchg_local/this_cpu_cmpxchg
naming?
Michal Hocko May 24, 2023, 12:51 p.m. UTC | #3
[Sorry for a late response but I was conferencing last two weeks and now
catching up]

On Mon 15-05-23 15:00:15, Marcelo Tosatti wrote:
[...]
> v8
> - Add summary of discussion on -v7 to cover letter

Thanks this is very useful! This helps to frame the further discussion.

I believe the most important question to answer is this in fact
> I think what needs to be done is to avoid new queue_work_on()
> users from being introduced in the tree (the number of
> existing ones is finite and can therefore be fixed).
> 
> Agree with the criticism here, however, i can't see other
> options than the following:
> 
>         1) Given an activity, which contains a sequence of instructions
>            to execute on a CPU, to change the algorithm
>            to execute that code remotely (therefore avoid interrupting a CPU),
>            or to avoid the interruption somehow (which must be dealt with
>            on a case-by-case basis).
> 
>         2) To block that activity from happening in the first place,
>            for the sites where it can be blocked (that return errors to
>            userspace, for example).
> 
>         3) Completly isolate the CPU from the kernel (off-line it).

I agree that a reliable cpu isolation implementation needs to address
queue_work_on problem. And it has to do that _realiably_. This cannot by
achieved by an endless whack-a-mole and chasing each new instance. There
must be a more systematic approach. One way would be to change the
semantic of schedule_work_on and fail call for an isolated CPU. The
caller would have a way to fallback and handle the operation by other
means. E.g. vmstat could simply ignore folding pcp data because an
imprecision shouldn't really matter. Other callers might chose to do the
operation remotely. This is a lot of work, no doubt about that, but it
is a long term maintainable solution that doesn't give you new surprises
with any new released kernel. There are likely other remote interfaces
that would need to follow that scheme.

If the cpu isolation is not planned to be worth that time investment
then I do not think it is also worth reducing a highly optimized vmstat
code. These stats are invoked from many hot paths and per-cpu
implementation has been optimized for that case. If your workload would
like to avoid that as disturbing then you already have a quiet_vmstat
precedence so find a way how to use it for your workload instead.
Marcelo Tosatti May 24, 2023, 1:53 p.m. UTC | #4
On Wed, May 24, 2023 at 02:51:55PM +0200, Michal Hocko wrote:
> [Sorry for a late response but I was conferencing last two weeks and now
> catching up]
> 
> On Mon 15-05-23 15:00:15, Marcelo Tosatti wrote:
> [...]
> > v8
> > - Add summary of discussion on -v7 to cover letter
> 
> Thanks this is very useful! This helps to frame the further discussion.
> 
> I believe the most important question to answer is this in fact
> > I think what needs to be done is to avoid new queue_work_on()
> > users from being introduced in the tree (the number of
> > existing ones is finite and can therefore be fixed).
> > 
> > Agree with the criticism here, however, i can't see other
> > options than the following:
> > 
> >         1) Given an activity, which contains a sequence of instructions
> >            to execute on a CPU, to change the algorithm
> >            to execute that code remotely (therefore avoid interrupting a CPU),
> >            or to avoid the interruption somehow (which must be dealt with
> >            on a case-by-case basis).
> > 
> >         2) To block that activity from happening in the first place,
> >            for the sites where it can be blocked (that return errors to
> >            userspace, for example).
> > 
> >         3) Completly isolate the CPU from the kernel (off-line it).
> 
> I agree that a reliable cpu isolation implementation needs to address
> queue_work_on problem. And it has to do that _realiably_. This cannot by
> achieved by an endless whack-a-mole and chasing each new instance. There
> must be a more systematic approach. One way would be to change the
> semantic of schedule_work_on and fail call for an isolated CPU. The
> caller would have a way to fallback and handle the operation by other
> means. E.g. vmstat could simply ignore folding pcp data because an
> imprecision shouldn't really matter. Other callers might chose to do the
> operation remotely. This is a lot of work, no doubt about that, but it
> is a long term maintainable solution that doesn't give you new surprises
> with any new released kernel. There are likely other remote interfaces
> that would need to follow that scheme.
> 
> If the cpu isolation is not planned to be worth that time investment
> then I do not think it is also worth reducing a highly optimized vmstat
> code. These stats are invoked from many hot paths and per-cpu
> implementation has been optimized for that case.

It is exactly the same code, but now with a "LOCK" prefix for CMPXCHG
instruction. Which should not cost much due to cache locking (these are
per-CPU variables anyway).

> If your workload would
> like to avoid that as disturbing then you already have a quiet_vmstat
> precedence so find a way how to use it for your workload instead.
>  
> -- 
> Michal Hocko
> SUSE Labs

OK so an alternative solution is to completly disable vmstat updates
for isolated CPUs. Are you OK with that ?
Michal Hocko May 25, 2023, 6:47 a.m. UTC | #5
On Wed 24-05-23 10:53:23, Marcelo Tosatti wrote:
> On Wed, May 24, 2023 at 02:51:55PM +0200, Michal Hocko wrote:
> > [Sorry for a late response but I was conferencing last two weeks and now
> > catching up]
> > 
> > On Mon 15-05-23 15:00:15, Marcelo Tosatti wrote:
> > [...]
> > > v8
> > > - Add summary of discussion on -v7 to cover letter
> > 
> > Thanks this is very useful! This helps to frame the further discussion.
> > 
> > I believe the most important question to answer is this in fact
> > > I think what needs to be done is to avoid new queue_work_on()
> > > users from being introduced in the tree (the number of
> > > existing ones is finite and can therefore be fixed).
> > > 
> > > Agree with the criticism here, however, i can't see other
> > > options than the following:
> > > 
> > >         1) Given an activity, which contains a sequence of instructions
> > >            to execute on a CPU, to change the algorithm
> > >            to execute that code remotely (therefore avoid interrupting a CPU),
> > >            or to avoid the interruption somehow (which must be dealt with
> > >            on a case-by-case basis).
> > > 
> > >         2) To block that activity from happening in the first place,
> > >            for the sites where it can be blocked (that return errors to
> > >            userspace, for example).
> > > 
> > >         3) Completly isolate the CPU from the kernel (off-line it).
> > 
> > I agree that a reliable cpu isolation implementation needs to address
> > queue_work_on problem. And it has to do that _realiably_. This cannot by
> > achieved by an endless whack-a-mole and chasing each new instance. There
> > must be a more systematic approach. One way would be to change the
> > semantic of schedule_work_on and fail call for an isolated CPU. The
> > caller would have a way to fallback and handle the operation by other
> > means. E.g. vmstat could simply ignore folding pcp data because an
> > imprecision shouldn't really matter. Other callers might chose to do the
> > operation remotely. This is a lot of work, no doubt about that, but it
> > is a long term maintainable solution that doesn't give you new surprises
> > with any new released kernel. There are likely other remote interfaces
> > that would need to follow that scheme.
> > 
> > If the cpu isolation is not planned to be worth that time investment
> > then I do not think it is also worth reducing a highly optimized vmstat
> > code. These stats are invoked from many hot paths and per-cpu
> > implementation has been optimized for that case.
> 
> It is exactly the same code, but now with a "LOCK" prefix for CMPXCHG
> instruction. Which should not cost much due to cache locking (these are
> per-CPU variables anyway).

Sorry but just a LOCK prefix for a hot path is not a serious argument.
 
> > If your workload would
> > like to avoid that as disturbing then you already have a quiet_vmstat
> > precedence so find a way how to use it for your workload instead.
> >  
> > -- 
> > Michal Hocko
> > SUSE Labs
> 
> OK so an alternative solution is to completly disable vmstat updates
> for isolated CPUs. Are you OK with that ?

Yes, the number of events should be reasonably small and those places in
the kernel which really need a precise value need to do a per-cpu walk
anyway. IIRC /proc/vmstat et al also do accumulate pcp state.

But let me reiterate. Even with vmstat updates out of the game, you have
so many other sources of disruption that your isolated workload will be
fragile until you actually try to deal with the problem on a more
fundamental level.