Message ID | 20231007062356.187621-1-ying.huang@intel.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [-V2] mm: fix draining PCP of remote zone | expand |
On Sat, 7 Oct 2023 14:23:56 +0800 Huang Ying <ying.huang@intel.com> wrote: > If there is no memory allocation/freeing in the PCP (Per-CPU Pageset) > of a remote zone (zone in remote NUMA node) after some time (3 seconds > for now), the pages of the PCP of the remote zone will be drained to > avoid memory wastage. > > This behavior was introduced in the commit 4ae7c03943fc ("[PATCH] > Periodically drain non local pagesets") and the commit > 4037d452202e ("Move remote node draining out of slab allocators") > > But, after the commit 7cc36bbddde5 ("vmstat: on-demand vmstat workers > V8"), the vmstat updater worker which is used to drain the PCP of > remote zones may not be re-queued when we are waiting for the > timeout (pcp->expire != 0) if there are no vmstat changes on this CPU, > for example, when the CPU goes idle or runs user space only workloads. > This may cause the pages of a remote zone be kept in PCP of this CPU > for long time. So that, the page reclaiming of the remote zone may be > triggered prematurely. This isn't a severe problem in practice, > because the PCP of the remote zone will be drained if some memory are > allocated/freed again on this CPU. And, the PCP will eventually be > drained during the direct reclaiming if necessary. > > Anyway, the problem still deserves a fix via guaranteeing that the > vmstat updater worker will always be re-queued when we are waiting for > the timeout. In effect, this restores the original behavior before > the commit 7cc36bbddde5. > > We can reproduce the bug via allocating/freeing pages from a remote > zone then go idle as follows. And the patch can fix it. > > - Run some workloads, use `numactl` to bind CPU to node 0 and memory to > node 1. So the PCP of the CPU on node 0 for zone on node 1 will be > filled. > > - After workloads finish, idle for 60s > > - Check /proc/zoneinfo > > With the original kernel, the number of pages in the PCP of the CPU on > node 0 for zone on node 1 is non-zero after idle. With the patched > kernel, it becomes 0 after idle. That is, we avoid to keep pages in > the remote PCP during idle. > Thanks, I updated the changelog in place and queued this for mm-stable.
diff --git a/mm/vmstat.c b/mm/vmstat.c index 00e81e99c6ee..7f1bf40e71e8 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -855,8 +855,10 @@ static int refresh_cpu_vm_stats(bool do_pagesets) continue; } - if (__this_cpu_dec_return(pcp->expire)) + if (__this_cpu_dec_return(pcp->expire)) { + changes++; continue; + } if (__this_cpu_read(pcp->count)) { drain_zone_pages(zone, this_cpu_ptr(pcp));