Message ID | 20230811090819.60845-1-ying.huang@intel.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: fix draining remote pageset | expand |
On Fri 11-08-23 17:08:19, Huang Ying wrote: > If there is no memory allocation/freeing in the remote pageset after > some time (3 seconds for now), the remote pageset will be drained to > avoid memory wastage. > > But in the current implementation, vmstat updater worker may not be > re-queued when we are waiting for the timeout (pcp->expire != 0) if > there are no vmstat changes, for example, when CPU goes idle. Why is that a problem? > This is fixed via guaranteeing that the vmstat updater worker will > always be re-queued when we are waiting for the timeout. > > We can reproduce the bug via allocating/freeing pages from remote > node, then go idle. And the patch can fix it. > > Fixes: 7cc36bbddde5 ("vmstat: on-demand vmstat workers V8") > Signed-off-by: "Huang, Ying" <ying.huang@intel.com> > Cc: Christoph Lameter <cl@linux.com> > Cc: Mel Gorman <mgorman@techsingularity.net> > Cc: Vlastimil Babka <vbabka@suse.cz> > Cc: Michal Hocko <mhocko@kernel.org> > --- > mm/vmstat.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/mm/vmstat.c b/mm/vmstat.c > index b731d57996c5..111118741abf 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -856,8 +856,10 @@ static int refresh_cpu_vm_stats(bool do_pagesets) > continue; > } > > - if (__this_cpu_dec_return(pcp->expire)) > + if (__this_cpu_dec_return(pcp->expire)) { > + changes++; > continue; > + } > > if (__this_cpu_read(pcp->count)) { > drain_zone_pages(zone, this_cpu_ptr(pcp)); > -- > 2.39.2
Hi, Michal, Michal Hocko <mhocko@suse.com> writes: > On Fri 11-08-23 17:08:19, Huang Ying wrote: >> If there is no memory allocation/freeing in the remote pageset after >> some time (3 seconds for now), the remote pageset will be drained to >> avoid memory wastage. >> >> But in the current implementation, vmstat updater worker may not be >> re-queued when we are waiting for the timeout (pcp->expire != 0) if >> there are no vmstat changes, for example, when CPU goes idle. > > Why is that a problem? The pages of the remote zone may be kept in the local per-CPU pageset for long time as long as there's no page allocation/freeing on the logical CPU. In addition to the logical CPU goes idle, this is also possible if the logical CPU is busy in the user space. I will update the change log to include this. -- Best Regards, Huang, Ying >> This is fixed via guaranteeing that the vmstat updater worker will >> always be re-queued when we are waiting for the timeout. >> >> We can reproduce the bug via allocating/freeing pages from remote >> node, then go idle. And the patch can fix it. >> >> Fixes: 7cc36bbddde5 ("vmstat: on-demand vmstat workers V8") >> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> >> Cc: Christoph Lameter <cl@linux.com> >> Cc: Mel Gorman <mgorman@techsingularity.net> >> Cc: Vlastimil Babka <vbabka@suse.cz> >> Cc: Michal Hocko <mhocko@kernel.org> >> --- >> mm/vmstat.c | 4 +++- >> 1 file changed, 3 insertions(+), 1 deletion(-) >> >> diff --git a/mm/vmstat.c b/mm/vmstat.c >> index b731d57996c5..111118741abf 100644 >> --- a/mm/vmstat.c >> +++ b/mm/vmstat.c >> @@ -856,8 +856,10 @@ static int refresh_cpu_vm_stats(bool do_pagesets) >> continue; >> } >> >> - if (__this_cpu_dec_return(pcp->expire)) >> + if (__this_cpu_dec_return(pcp->expire)) { >> + changes++; >> continue; >> + } >> >> if (__this_cpu_read(pcp->count)) { >> drain_zone_pages(zone, this_cpu_ptr(pcp)); >> -- >> 2.39.2
On Mon 14-08-23 09:59:51, Huang, Ying wrote: > Hi, Michal, > > Michal Hocko <mhocko@suse.com> writes: > > > On Fri 11-08-23 17:08:19, Huang Ying wrote: > >> If there is no memory allocation/freeing in the remote pageset after > >> some time (3 seconds for now), the remote pageset will be drained to > >> avoid memory wastage. > >> > >> But in the current implementation, vmstat updater worker may not be > >> re-queued when we are waiting for the timeout (pcp->expire != 0) if > >> there are no vmstat changes, for example, when CPU goes idle. > > > > Why is that a problem? > > The pages of the remote zone may be kept in the local per-CPU pageset > for long time as long as there's no page allocation/freeing on the > logical CPU. In addition to the logical CPU goes idle, this is also > possible if the logical CPU is busy in the user space. But why is this a problem? Is the scale of the problem sufficient to trigger out of memory situations or be otherwise harmful?
Michal Hocko <mhocko@suse.com> writes: > On Mon 14-08-23 09:59:51, Huang, Ying wrote: >> Hi, Michal, >> >> Michal Hocko <mhocko@suse.com> writes: >> >> > On Fri 11-08-23 17:08:19, Huang Ying wrote: >> >> If there is no memory allocation/freeing in the remote pageset after >> >> some time (3 seconds for now), the remote pageset will be drained to >> >> avoid memory wastage. >> >> >> >> But in the current implementation, vmstat updater worker may not be >> >> re-queued when we are waiting for the timeout (pcp->expire != 0) if >> >> there are no vmstat changes, for example, when CPU goes idle. >> > >> > Why is that a problem? >> >> The pages of the remote zone may be kept in the local per-CPU pageset >> for long time as long as there's no page allocation/freeing on the >> logical CPU. In addition to the logical CPU goes idle, this is also >> possible if the logical CPU is busy in the user space. > > But why is this a problem? Is the scale of the problem sufficient to > trigger out of memory situations or be otherwise harmful? This may trigger premature page reclaiming. The pages in the PCP of the remote zone would have been freed to satisfy the page allocation for the remote zone to avoid page reclaiming. It's highly possible that the local CPU just allocate/free from/to the remote zone temporarily. So, we should free PCP pages of the remote zone if there is no page allocation/freeing from/to the remote zone for 3 seconds. This will not trigger OOM, because all PCP will be drained if allocation failed after direct reclaiming. -- Best Regards, Huang, Ying
On Wed, 16 Aug 2023, Huang, Ying wrote: > Michal Hocko <mhocko@suse.com> writes: >>>> Why is that a problem? >>> >>> The pages of the remote zone may be kept in the local per-CPU pageset >>> for long time as long as there's no page allocation/freeing on the >>> logical CPU. In addition to the logical CPU goes idle, this is also >>> possible if the logical CPU is busy in the user space. >> >> But why is this a problem? Is the scale of the problem sufficient to >> trigger out of memory situations or be otherwise harmful? > > This may trigger premature page reclaiming. The pages in the PCP of the > remote zone would have been freed to satisfy the page allocation for the > remote zone to avoid page reclaiming. It's highly possible that the > local CPU just allocate/free from/to the remote zone temporarily. So, > we should free PCP pages of the remote zone if there is no page > allocation/freeing from/to the remote zone for 3 seconds. > > This will not trigger OOM, because all PCP will be drained if allocation > failed after direct reclaiming. I think this is a minor issue but its cleaner behavior. vmstat refresh's should continue as long as there are pages queued in remote pcps.
On Wed 16-08-23 15:08:23, Huang, Ying wrote: > Michal Hocko <mhocko@suse.com> writes: > > > On Mon 14-08-23 09:59:51, Huang, Ying wrote: > >> Hi, Michal, > >> > >> Michal Hocko <mhocko@suse.com> writes: > >> > >> > On Fri 11-08-23 17:08:19, Huang Ying wrote: > >> >> If there is no memory allocation/freeing in the remote pageset after > >> >> some time (3 seconds for now), the remote pageset will be drained to > >> >> avoid memory wastage. > >> >> > >> >> But in the current implementation, vmstat updater worker may not be > >> >> re-queued when we are waiting for the timeout (pcp->expire != 0) if > >> >> there are no vmstat changes, for example, when CPU goes idle. > >> > > >> > Why is that a problem? > >> > >> The pages of the remote zone may be kept in the local per-CPU pageset > >> for long time as long as there's no page allocation/freeing on the > >> logical CPU. In addition to the logical CPU goes idle, this is also > >> possible if the logical CPU is busy in the user space. > > > > But why is this a problem? Is the scale of the problem sufficient to > > trigger out of memory situations or be otherwise harmful? > > This may trigger premature page reclaiming. The pages in the PCP of the > remote zone would have been freed to satisfy the page allocation for the > remote zone to avoid page reclaiming. It's highly possible that the > local CPU just allocate/free from/to the remote zone temporarily. I am slightly confused here but I suspect by zone you mean remote pcp. But more importantly is this a concern seen in real workload? Can you quantify it in some manner? E.g. with this patch we have X more kswapd scanning or even hit direct reclaim much less often. > So, > we should free PCP pages of the remote zone if there is no page > allocation/freeing from/to the remote zone for 3 seconds. Well, I would argue this depends a lot. There are workloads which really like to have CPUs idle and yet they would like to benefit from the allocator fast path after that CPU goes out of idle because idling is their power saving opportunity while workloads want to act quickly after there is something to run. That being said, we really need some numbers (ideally from real world) that proves this is not just a theoretical concern.
Michal Hocko <mhocko@suse.com> writes: > On Wed 16-08-23 15:08:23, Huang, Ying wrote: >> Michal Hocko <mhocko@suse.com> writes: >> >> > On Mon 14-08-23 09:59:51, Huang, Ying wrote: >> >> Hi, Michal, >> >> >> >> Michal Hocko <mhocko@suse.com> writes: >> >> >> >> > On Fri 11-08-23 17:08:19, Huang Ying wrote: >> >> >> If there is no memory allocation/freeing in the remote pageset after >> >> >> some time (3 seconds for now), the remote pageset will be drained to >> >> >> avoid memory wastage. >> >> >> >> >> >> But in the current implementation, vmstat updater worker may not be >> >> >> re-queued when we are waiting for the timeout (pcp->expire != 0) if >> >> >> there are no vmstat changes, for example, when CPU goes idle. >> >> > >> >> > Why is that a problem? >> >> >> >> The pages of the remote zone may be kept in the local per-CPU pageset >> >> for long time as long as there's no page allocation/freeing on the >> >> logical CPU. In addition to the logical CPU goes idle, this is also >> >> possible if the logical CPU is busy in the user space. >> > >> > But why is this a problem? Is the scale of the problem sufficient to >> > trigger out of memory situations or be otherwise harmful? >> >> This may trigger premature page reclaiming. The pages in the PCP of the >> remote zone would have been freed to satisfy the page allocation for the >> remote zone to avoid page reclaiming. It's highly possible that the >> local CPU just allocate/free from/to the remote zone temporarily. > > I am slightly confused here but I suspect by zone you mean remote pcp. > But more importantly is this a concern seen in real workload? Can you > quantify it in some manner? E.g. with this patch we have X more kswapd > scanning or even hit direct reclaim much less often. >> So, >> we should free PCP pages of the remote zone if there is no page >> allocation/freeing from/to the remote zone for 3 seconds. > > Well, I would argue this depends a lot. There are workloads which really > like to have CPUs idle and yet they would like to benefit from the > allocator fast path after that CPU goes out of idle because idling is > their power saving opportunity while workloads want to act quickly after > there is something to run. > > That being said, we really need some numbers (ideally from real world) > that proves this is not just a theoretical concern. The behavior to drain the PCP of the remote zone (that is, remote PCP) was introduced in commit 4ae7c03943fc ("[PATCH] Periodically drain non local pagesets"). The goal of draining was well documented in the change log. IIUC, some of your questions can be answered there? This patch just restores the original behavior changed by commit 7cc36bbddde5 ("vmstat: on-demand vmstat workers V8"). -- Best Regards, Huang, Ying
On Mon 21-08-23 16:30:18, Huang, Ying wrote: > Michal Hocko <mhocko@suse.com> writes: > > > On Wed 16-08-23 15:08:23, Huang, Ying wrote: > >> Michal Hocko <mhocko@suse.com> writes: > >> > >> > On Mon 14-08-23 09:59:51, Huang, Ying wrote: > >> >> Hi, Michal, > >> >> > >> >> Michal Hocko <mhocko@suse.com> writes: > >> >> > >> >> > On Fri 11-08-23 17:08:19, Huang Ying wrote: > >> >> >> If there is no memory allocation/freeing in the remote pageset after > >> >> >> some time (3 seconds for now), the remote pageset will be drained to > >> >> >> avoid memory wastage. > >> >> >> > >> >> >> But in the current implementation, vmstat updater worker may not be > >> >> >> re-queued when we are waiting for the timeout (pcp->expire != 0) if > >> >> >> there are no vmstat changes, for example, when CPU goes idle. > >> >> > > >> >> > Why is that a problem? > >> >> > >> >> The pages of the remote zone may be kept in the local per-CPU pageset > >> >> for long time as long as there's no page allocation/freeing on the > >> >> logical CPU. In addition to the logical CPU goes idle, this is also > >> >> possible if the logical CPU is busy in the user space. > >> > > >> > But why is this a problem? Is the scale of the problem sufficient to > >> > trigger out of memory situations or be otherwise harmful? > >> > >> This may trigger premature page reclaiming. The pages in the PCP of the > >> remote zone would have been freed to satisfy the page allocation for the > >> remote zone to avoid page reclaiming. It's highly possible that the > >> local CPU just allocate/free from/to the remote zone temporarily. > > > > I am slightly confused here but I suspect by zone you mean remote pcp. > > But more importantly is this a concern seen in real workload? Can you > > quantify it in some manner? E.g. with this patch we have X more kswapd > > scanning or even hit direct reclaim much less often. > >> So, > >> we should free PCP pages of the remote zone if there is no page > >> allocation/freeing from/to the remote zone for 3 seconds. > > > > Well, I would argue this depends a lot. There are workloads which really > > like to have CPUs idle and yet they would like to benefit from the > > allocator fast path after that CPU goes out of idle because idling is > > their power saving opportunity while workloads want to act quickly after > > there is something to run. > > > > That being said, we really need some numbers (ideally from real world) > > that proves this is not just a theoretical concern. > > The behavior to drain the PCP of the remote zone (that is, remote PCP) > was introduced in commit 4ae7c03943fc ("[PATCH] Periodically drain non > local pagesets"). The goal of draining was well documented in the > change log. IIUC, some of your questions can be answered there? > > This patch just restores the original behavior changed by commit > 7cc36bbddde5 ("vmstat: on-demand vmstat workers V8"). Let me repeat. You need some numbers to show this is needed.
Michal Hocko <mhocko@suse.com> writes: > On Mon 21-08-23 16:30:18, Huang, Ying wrote: >> Michal Hocko <mhocko@suse.com> writes: >> >> > On Wed 16-08-23 15:08:23, Huang, Ying wrote: >> >> Michal Hocko <mhocko@suse.com> writes: >> >> >> >> > On Mon 14-08-23 09:59:51, Huang, Ying wrote: >> >> >> Hi, Michal, >> >> >> >> >> >> Michal Hocko <mhocko@suse.com> writes: >> >> >> >> >> >> > On Fri 11-08-23 17:08:19, Huang Ying wrote: >> >> >> >> If there is no memory allocation/freeing in the remote pageset after >> >> >> >> some time (3 seconds for now), the remote pageset will be drained to >> >> >> >> avoid memory wastage. >> >> >> >> >> >> >> >> But in the current implementation, vmstat updater worker may not be >> >> >> >> re-queued when we are waiting for the timeout (pcp->expire != 0) if >> >> >> >> there are no vmstat changes, for example, when CPU goes idle. >> >> >> > >> >> >> > Why is that a problem? >> >> >> >> >> >> The pages of the remote zone may be kept in the local per-CPU pageset >> >> >> for long time as long as there's no page allocation/freeing on the >> >> >> logical CPU. In addition to the logical CPU goes idle, this is also >> >> >> possible if the logical CPU is busy in the user space. >> >> > >> >> > But why is this a problem? Is the scale of the problem sufficient to >> >> > trigger out of memory situations or be otherwise harmful? >> >> >> >> This may trigger premature page reclaiming. The pages in the PCP of the >> >> remote zone would have been freed to satisfy the page allocation for the >> >> remote zone to avoid page reclaiming. It's highly possible that the >> >> local CPU just allocate/free from/to the remote zone temporarily. >> > >> > I am slightly confused here but I suspect by zone you mean remote pcp. >> > But more importantly is this a concern seen in real workload? Can you >> > quantify it in some manner? E.g. with this patch we have X more kswapd >> > scanning or even hit direct reclaim much less often. >> >> So, >> >> we should free PCP pages of the remote zone if there is no page >> >> allocation/freeing from/to the remote zone for 3 seconds. >> > >> > Well, I would argue this depends a lot. There are workloads which really >> > like to have CPUs idle and yet they would like to benefit from the >> > allocator fast path after that CPU goes out of idle because idling is >> > their power saving opportunity while workloads want to act quickly after >> > there is something to run. >> > >> > That being said, we really need some numbers (ideally from real world) >> > that proves this is not just a theoretical concern. >> >> The behavior to drain the PCP of the remote zone (that is, remote PCP) >> was introduced in commit 4ae7c03943fc ("[PATCH] Periodically drain non >> local pagesets"). The goal of draining was well documented in the >> change log. IIUC, some of your questions can be answered there? >> >> This patch just restores the original behavior changed by commit >> 7cc36bbddde5 ("vmstat: on-demand vmstat workers V8"). > > Let me repeat. You need some numbers to show this is needed. I have done some test for this patch as follows, - Run some workloads, use `numactl` to bind CPU to node 0 and memory to node 1. So the PCP of the CPU on node 0 for zone on node 1 will be filled. - After workloads finish, idle for 60s - Check /proc/zoneinfo With the original kernel, the number of pages in the PCP of the CPU on node 0 for zone on node 1 is non-zero after idle. With the patched kernel, that becomes 0 after idle. We avoid to keep pages in the remote PCP during idle. This is the number I have. If you think it isn't enough to justify the patch, then I'm OK too (although I think it's enough). Because the remote PCP will be drained later when some pages are allocated/freed on the CPU. -- Best Regards, Huang, Ying
On Tue 22-08-23 06:31:42, Huang, Ying wrote: > Michal Hocko <mhocko@suse.com> writes: > > > On Mon 21-08-23 16:30:18, Huang, Ying wrote: > >> Michal Hocko <mhocko@suse.com> writes: > >> > >> > On Wed 16-08-23 15:08:23, Huang, Ying wrote: > >> >> Michal Hocko <mhocko@suse.com> writes: > >> >> > >> >> > On Mon 14-08-23 09:59:51, Huang, Ying wrote: > >> >> >> Hi, Michal, > >> >> >> > >> >> >> Michal Hocko <mhocko@suse.com> writes: > >> >> >> > >> >> >> > On Fri 11-08-23 17:08:19, Huang Ying wrote: > >> >> >> >> If there is no memory allocation/freeing in the remote pageset after > >> >> >> >> some time (3 seconds for now), the remote pageset will be drained to > >> >> >> >> avoid memory wastage. > >> >> >> >> > >> >> >> >> But in the current implementation, vmstat updater worker may not be > >> >> >> >> re-queued when we are waiting for the timeout (pcp->expire != 0) if > >> >> >> >> there are no vmstat changes, for example, when CPU goes idle. > >> >> >> > > >> >> >> > Why is that a problem? > >> >> >> > >> >> >> The pages of the remote zone may be kept in the local per-CPU pageset > >> >> >> for long time as long as there's no page allocation/freeing on the > >> >> >> logical CPU. In addition to the logical CPU goes idle, this is also > >> >> >> possible if the logical CPU is busy in the user space. > >> >> > > >> >> > But why is this a problem? Is the scale of the problem sufficient to > >> >> > trigger out of memory situations or be otherwise harmful? > >> >> > >> >> This may trigger premature page reclaiming. The pages in the PCP of the > >> >> remote zone would have been freed to satisfy the page allocation for the > >> >> remote zone to avoid page reclaiming. It's highly possible that the > >> >> local CPU just allocate/free from/to the remote zone temporarily. > >> > > >> > I am slightly confused here but I suspect by zone you mean remote pcp. > >> > But more importantly is this a concern seen in real workload? Can you > >> > quantify it in some manner? E.g. with this patch we have X more kswapd > >> > scanning or even hit direct reclaim much less often. > >> >> So, > >> >> we should free PCP pages of the remote zone if there is no page > >> >> allocation/freeing from/to the remote zone for 3 seconds. > >> > > >> > Well, I would argue this depends a lot. There are workloads which really > >> > like to have CPUs idle and yet they would like to benefit from the > >> > allocator fast path after that CPU goes out of idle because idling is > >> > their power saving opportunity while workloads want to act quickly after > >> > there is something to run. > >> > > >> > That being said, we really need some numbers (ideally from real world) > >> > that proves this is not just a theoretical concern. > >> > >> The behavior to drain the PCP of the remote zone (that is, remote PCP) > >> was introduced in commit 4ae7c03943fc ("[PATCH] Periodically drain non > >> local pagesets"). The goal of draining was well documented in the > >> change log. IIUC, some of your questions can be answered there? > >> > >> This patch just restores the original behavior changed by commit > >> 7cc36bbddde5 ("vmstat: on-demand vmstat workers V8"). > > > > Let me repeat. You need some numbers to show this is needed. > > I have done some test for this patch as follows, > > - Run some workloads, use `numactl` to bind CPU to node 0 and memory to > node 1. So the PCP of the CPU on node 0 for zone on node 1 will be > filled. > > - After workloads finish, idle for 60s > > - Check /proc/zoneinfo > > With the original kernel, the number of pages in the PCP of the CPU on > node 0 for zone on node 1 is non-zero after idle. With the patched > kernel, that becomes 0 after idle. We avoid to keep pages in the remote > PCP during idle. > > This is the number I have. If you think it isn't enough to justify the > patch, then I'm OK too (although I think it's enough). Because the > remote PCP will be drained later when some pages are allocated/freed on > the CPU. Yes, this doesn't really show any actual correctness problem so I do not think this is sufficient to change the code. You would need to show that the existing behavior is actively harmful.
On Tue, 22 Aug 2023, Michal Hocko wrote: > Yes, this doesn't really show any actual correctness problem so I do not > think this is sufficient to change the code. You would need to show that > the existing behavior is actively harmful. Having some pages from a remote NUMA node stuck in a pcp somewhere is making that memory unusable. It is usually rate that these remote pages are needed again and so they may remain there for a long time if the situation is right. And he is right that the intended behavior of freeing the remote pages has been disabled by the patch. So I think there is sufficient rationale to apply these fixes.
Hi, Christopher, "Lameter, Christopher" <cl@os.amperecomputing.com> writes: > On Tue, 22 Aug 2023, Michal Hocko wrote: > >> Yes, this doesn't really show any actual correctness problem so I do not >> think this is sufficient to change the code. You would need to show that >> the existing behavior is actively harmful. > > Having some pages from a remote NUMA node stuck in a pcp somewhere is > making that memory unusable. It is usually rate that these remote > pages are needed again and so they may remain there for a long time if > the situation is right. > > And he is right that the intended behavior of freeing the remote pages > has been disabled by the patch. > > So I think there is sufficient rationale to apply these fixes. Thanks! Can I get your "Acked-by" or "Reviewed-by" for the patch? -- Best Regards, Huang, Ying
On Tue, 29 Aug 2023, Huang, Ying wrote: >> So I think there is sufficient rationale to apply these fixes. > > Thanks! Can I get your "Acked-by" or "Reviewed-by" for the patch? Reviewed-by: Christoph Lameter <cl@linux.com>
On 8/25/23 19:06, Lameter, Christopher wrote: > On Tue, 22 Aug 2023, Michal Hocko wrote: > >> Yes, this doesn't really show any actual correctness problem so I do not >> think this is sufficient to change the code. You would need to show that >> the existing behavior is actively harmful. > > Having some pages from a remote NUMA node stuck in a pcp somewhere is > making that memory unusable. It is usually rate that these remote pages > are needed again and so they may remain there for a long time if the > situation is right. > > And he is right that the intended behavior of freeing the remote pages > has been disabled by the patch. > > So I think there is sufficient rationale to apply these fixes. I wonder if this the optimum way to handle the NOHZ case? IIUC there we use quiet_vmstat() to call refresh_cpu_vm_stats(). I'd expect if there were pending remote pages to flush, it would be best to do it immediately, and not keep a worker being requeued and only do that after the pcp->expires goes zero. However quiet_vmstat() even calls the refresh with do_pagesets == false. Why do we even refresh the stats at that moment if the delayed update is pending anyway? And could we maybe make sure that in that case the flush is done on the first delayed update in that case and not expiring like this?
Vlastimil Babka <vbabka@suse.cz> writes: > On 8/25/23 19:06, Lameter, Christopher wrote: >> On Tue, 22 Aug 2023, Michal Hocko wrote: >> >>> Yes, this doesn't really show any actual correctness problem so I do not >>> think this is sufficient to change the code. You would need to show that >>> the existing behavior is actively harmful. >> >> Having some pages from a remote NUMA node stuck in a pcp somewhere is >> making that memory unusable. It is usually rate that these remote pages >> are needed again and so they may remain there for a long time if the >> situation is right. >> >> And he is right that the intended behavior of freeing the remote pages >> has been disabled by the patch. >> >> So I think there is sufficient rationale to apply these fixes. > > I wonder if this the optimum way to handle the NOHZ case? IIUC there we use > quiet_vmstat() to call refresh_cpu_vm_stats(). I'd expect if there were > pending remote pages to flush, it would be best to do it immediately, and > not keep a worker being requeued and only do that after the pcp->expires > goes zero. > > However quiet_vmstat() even calls the refresh with do_pagesets == false. Why > do we even refresh the stats at that moment if the delayed update is pending > anyway? According to commit f01f17d3705b ("mm, vmstat: make quiet_vmstat lighter") and the comments in quiet_vmstat(). The pending worker will not be canceled to avoid long latency of idle entry. > And could we maybe make sure that in that case the flush is done on > the first delayed update in that case and not expiring like this? This sounds reasonable. How to identify whether the current CPU is in NOHZ state? Via tick_get_tick_sched()->tick_stopped? -- Best Regards, Huang, Ying
diff --git a/mm/vmstat.c b/mm/vmstat.c index b731d57996c5..111118741abf 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -856,8 +856,10 @@ static int refresh_cpu_vm_stats(bool do_pagesets) continue; } - if (__this_cpu_dec_return(pcp->expire)) + if (__this_cpu_dec_return(pcp->expire)) { + changes++; continue; + } if (__this_cpu_read(pcp->count)) { drain_zone_pages(zone, this_cpu_ptr(pcp));
If there is no memory allocation/freeing in the remote pageset after some time (3 seconds for now), the remote pageset will be drained to avoid memory wastage. But in the current implementation, vmstat updater worker may not be re-queued when we are waiting for the timeout (pcp->expire != 0) if there are no vmstat changes, for example, when CPU goes idle. This is fixed via guaranteeing that the vmstat updater worker will always be re-queued when we are waiting for the timeout. We can reproduce the bug via allocating/freeing pages from remote node, then go idle. And the patch can fix it. Fixes: 7cc36bbddde5 ("vmstat: on-demand vmstat workers V8") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> --- mm/vmstat.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)