diff mbox series

mm: vmscan: ensure kswapd is woken up if the wait queue is active

Message ID 20241126150612.114561-1-snishika@redhat.com (mailing list archive)
State New
Headers show
Series mm: vmscan: ensure kswapd is woken up if the wait queue is active | expand

Commit Message

Seiji Nishikawa Nov. 26, 2024, 3:06 p.m. UTC
Even after commit 501b26510ae3 ("vmstat: allow_direct_reclaim should use
zone_page_state_snapshot"), a task may remain indefinitely stuck in
throttle_direct_reclaim() while holding mm->rwsem.

__alloc_pages_nodemask
 try_to_free_pages
  throttle_direct_reclaim

This can cause numerous other tasks to wait on the same rwsem, leading
to severe system hangups:

[1088963.358712] INFO: task python3:1670971 blocked for more than 120 seconds.
[1088963.365653]       Tainted: G           OE     -------- -  - 4.18.0-553.el8_10.aarch64 #1
[1088963.373887] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1088963.381862] task:python3         state:D stack:0     pid:1670971 ppid:1667117 flags:0x00800080
[1088963.381869] Call trace:
[1088963.381872]  __switch_to+0xd0/0x120
[1088963.381877]  __schedule+0x340/0xac8
[1088963.381881]  schedule+0x68/0x118
[1088963.381886]  rwsem_down_read_slowpath+0x2d4/0x4b8

The issue arises when allow_direct_reclaim(pgdat) returns false,
preventing progress even when the pgdat->pfmemalloc_wait wait queue is
empty. Despite the wait queue being empty, the condition,
allow_direct_reclaim(pgdat), may still be returning false, causing it to
continue looping.

In some cases, reclaimable pages exist (zone_reclaimable_pages() returns
 > 0), but calculations of pfmemalloc_reserve and free_pages result in
wmark_ok being false.

And then, despite the pgdat->kswapd_wait queue being non-empty, kswapd
is not woken up, further exacerbating the problem:

crash> px ((struct pglist_data *) 0xffff00817fffe540)->kswapd_highest_zoneidx
$775 = __MAX_NR_ZONES

This patch modifies allow_direct_reclaim() to wake kswapd if the
pgdat->kswapd_wait queue is active, regardless of whether wmark_ok is
true or false. This change ensures kswapd does not miss wake-ups under
high memory pressure, reducing the risk of task stalls in the throttled
reclaim path.

Signed-off-by: Seiji Nishikawa <snishika@redhat.com>
---
 mm/vmscan.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Andrew Morton Nov. 28, 2024, 12:49 a.m. UTC | #1
On Wed, 27 Nov 2024 00:06:12 +0900 Seiji Nishikawa <snishika@redhat.com> wrote:

> Even after commit 501b26510ae3 ("vmstat: allow_direct_reclaim should use
> zone_page_state_snapshot"), a task may remain indefinitely stuck in
> throttle_direct_reclaim() while holding mm->rwsem.
> 
> __alloc_pages_nodemask
>  try_to_free_pages
>   throttle_direct_reclaim
> 
> This can cause numerous other tasks to wait on the same rwsem, leading
> to severe system hangups:
> 
> [1088963.358712] INFO: task python3:1670971 blocked for more than 120 seconds.
> [1088963.365653]       Tainted: G           OE     -------- -  - 4.18.0-553.el8_10.aarch64 #1
> [1088963.373887] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [1088963.381862] task:python3         state:D stack:0     pid:1670971 ppid:1667117 flags:0x00800080
> [1088963.381869] Call trace:
> [1088963.381872]  __switch_to+0xd0/0x120
> [1088963.381877]  __schedule+0x340/0xac8
> [1088963.381881]  schedule+0x68/0x118
> [1088963.381886]  rwsem_down_read_slowpath+0x2d4/0x4b8
> 
> The issue arises when allow_direct_reclaim(pgdat) returns false,
> preventing progress even when the pgdat->pfmemalloc_wait wait queue is
> empty. Despite the wait queue being empty, the condition,
> allow_direct_reclaim(pgdat), may still be returning false, causing it to
> continue looping.
> 
> In some cases, reclaimable pages exist (zone_reclaimable_pages() returns
>  > 0), but calculations of pfmemalloc_reserve and free_pages result in
> wmark_ok being false.
> 
> And then, despite the pgdat->kswapd_wait queue being non-empty, kswapd
> is not woken up, further exacerbating the problem:
> 
> crash> px ((struct pglist_data *) 0xffff00817fffe540)->kswapd_highest_zoneidx
> $775 = __MAX_NR_ZONES
> 
> This patch modifies allow_direct_reclaim() to wake kswapd if the
> pgdat->kswapd_wait queue is active, regardless of whether wmark_ok is
> true or false. This change ensures kswapd does not miss wake-ups under
> high memory pressure, reducing the risk of task stalls in the throttled
> reclaim path.

The code which is being altered is over 10 years old.  

Is this misbehavior more recent?  If so, are we able to identify which
commit caused this?

Otherwise, can you suggest why it took so long for this to be
discovered?  Your test case must be doing something unusual?

Thanks.

> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -6389,8 +6389,8 @@ static bool allow_direct_reclaim(pg_data_t *pgdat)
>  
>  	wmark_ok = free_pages > pfmemalloc_reserve / 2;
>  
> -	/* kswapd must be awake if processes are being throttled */
> -	if (!wmark_ok && waitqueue_active(&pgdat->kswapd_wait)) {
> +	/* Always wake up kswapd if the wait queue is not empty */
> +	if (waitqueue_active(&pgdat->kswapd_wait)) {
>  		if (READ_ONCE(pgdat->kswapd_highest_zoneidx) > ZONE_NORMAL)
>  			WRITE_ONCE(pgdat->kswapd_highest_zoneidx, ZONE_NORMAL);
>
Seiji Nishikawa Nov. 29, 2024, 4:39 a.m. UTC | #2
On Thu, Nov 28, 2024 at 9:49 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Wed, 27 Nov 2024 00:06:12 +0900 Seiji Nishikawa <snishika@redhat.com> wrote:
>
> > Even after commit 501b26510ae3 ("vmstat: allow_direct_reclaim should use
> > zone_page_state_snapshot"), a task may remain indefinitely stuck in
> > throttle_direct_reclaim() while holding mm->rwsem.
> >
> > __alloc_pages_nodemask
> >  try_to_free_pages
> >   throttle_direct_reclaim
> >
> > This can cause numerous other tasks to wait on the same rwsem, leading
> > to severe system hangups:
> >
> > [1088963.358712] INFO: task python3:1670971 blocked for more than 120 seconds.
> > [1088963.365653]       Tainted: G           OE     -------- -  - 4.18.0-553.el8_10.aarch64 #1
> > [1088963.373887] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [1088963.381862] task:python3         state:D stack:0     pid:1670971 ppid:1667117 flags:0x00800080
> > [1088963.381869] Call trace:
> > [1088963.381872]  __switch_to+0xd0/0x120
> > [1088963.381877]  __schedule+0x340/0xac8
> > [1088963.381881]  schedule+0x68/0x118
> > [1088963.381886]  rwsem_down_read_slowpath+0x2d4/0x4b8
> >
> > The issue arises when allow_direct_reclaim(pgdat) returns false,
> > preventing progress even when the pgdat->pfmemalloc_wait wait queue is
> > empty. Despite the wait queue being empty, the condition,
> > allow_direct_reclaim(pgdat), may still be returning false, causing it to
> > continue looping.
> >
> > In some cases, reclaimable pages exist (zone_reclaimable_pages() returns
> >  > 0), but calculations of pfmemalloc_reserve and free_pages result in
> > wmark_ok being false.
> >
> > And then, despite the pgdat->kswapd_wait queue being non-empty, kswapd
> > is not woken up, further exacerbating the problem:
> >
> > crash> px ((struct pglist_data *) 0xffff00817fffe540)->kswapd_highest_zoneidx
> > $775 = __MAX_NR_ZONES
> >
> > This patch modifies allow_direct_reclaim() to wake kswapd if the
> > pgdat->kswapd_wait queue is active, regardless of whether wmark_ok is
> > true or false. This change ensures kswapd does not miss wake-ups under
> > high memory pressure, reducing the risk of task stalls in the throttled
> > reclaim path.
>
> The code which is being altered is over 10 years old.
>
> Is this misbehavior more recent?  If so, are we able to identify which
> commit caused this?

The issue is not new but may have become more noticeable after commit 
501b26510ae3, which improved precision in allow_direct_reclaim(). This 
change exposed edge cases where wmark_ok is false despite reclaimable 
pages being available.

> Otherwise, can you suggest why it took so long for this to be
> discovered?  Your test case must be doing something unusual?

The issue likely occurs under specific conditions: high memory pressure 
with frequent direct reclaim, contention on mmap_sem from concurrent 
memory allocations, reclaimable pages exist, but zone states cause 
wmark_ok to return false.

Modern workloads (e.g., Python multiprocessing) and changes in kernel 
reclaim logic may have surfaced such edge cases more prominently than 
before.

The workload involves concurrent Python processes under high memory 
pressure, leading to contention on mmap_sem. While not unusual, this 
workload may trigger a rare combination of conditions that expose the 
issue.

>
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -6389,8 +6389,8 @@ static bool allow_direct_reclaim(pg_data_t *pgdat)
> >
> >       wmark_ok = free_pages > pfmemalloc_reserve / 2;
> >
> > -     /* kswapd must be awake if processes are being throttled */
> > -     if (!wmark_ok && waitqueue_active(&pgdat->kswapd_wait)) {
> > +     /* Always wake up kswapd if the wait queue is not empty */
> > +     if (waitqueue_active(&pgdat->kswapd_wait)) {
> >               if (READ_ONCE(pgdat->kswapd_highest_zoneidx) > ZONE_NORMAL)
> >                       WRITE_ONCE(pgdat->kswapd_highest_zoneidx, ZONE_NORMAL);
> >
diff mbox series

Patch

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 76378bc257e3..b1b3e5a116a8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -6389,8 +6389,8 @@  static bool allow_direct_reclaim(pg_data_t *pgdat)
 
 	wmark_ok = free_pages > pfmemalloc_reserve / 2;
 
-	/* kswapd must be awake if processes are being throttled */
-	if (!wmark_ok && waitqueue_active(&pgdat->kswapd_wait)) {
+	/* Always wake up kswapd if the wait queue is not empty */
+	if (waitqueue_active(&pgdat->kswapd_wait)) {
 		if (READ_ONCE(pgdat->kswapd_highest_zoneidx) > ZONE_NORMAL)
 			WRITE_ONCE(pgdat->kswapd_highest_zoneidx, ZONE_NORMAL);