diff mbox series

mm: avoid blocking lock_page() in kcompactd

Message ID 20200109225646.22983-1-xiyou.wangcong@gmail.com (mailing list archive)
State New, archived
Headers show
Series mm: avoid blocking lock_page() in kcompactd | expand

Commit Message

Cong Wang Jan. 9, 2020, 10:56 p.m. UTC
We observed kcompactd hung at __lock_page():

 INFO: task kcompactd0:57 blocked for more than 120 seconds.
       Not tainted 4.19.56.x86_64 #1
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 kcompactd0      D    0    57      2 0x80000000
 Call Trace:
  ? __schedule+0x236/0x860
  schedule+0x28/0x80
  io_schedule+0x12/0x40
  __lock_page+0xf9/0x120
  ? page_cache_tree_insert+0xb0/0xb0
  ? update_pageblock_skip+0xb0/0xb0
  migrate_pages+0x88c/0xb90
  ? isolate_freepages_block+0x3b0/0x3b0
  compact_zone+0x5f1/0x870
  kcompactd_do_work+0x130/0x2c0
  ? __switch_to_asm+0x35/0x70
  ? __switch_to_asm+0x41/0x70
  ? kcompactd_do_work+0x2c0/0x2c0
  ? kcompactd+0x73/0x180
  kcompactd+0x73/0x180
  ? finish_wait+0x80/0x80
  kthread+0x113/0x130
  ? kthread_create_worker_on_cpu+0x50/0x50
  ret_from_fork+0x35/0x40

which faddr2line maps to:

  migrate_pages+0x88c/0xb90:
  lock_page at include/linux/pagemap.h:483
  (inlined by) __unmap_and_move at mm/migrate.c:1024
  (inlined by) unmap_and_move at mm/migrate.c:1189
  (inlined by) migrate_pages at mm/migrate.c:1419

Sometimes kcompactd eventually got out of this situation, sometimes not.

I think for memory compaction, it is a best effort to migrate the pages,
so it doesn't have to wait for I/O to complete. It is fine to call
trylock_page() here, which is pretty much similar to
buffer_migrate_lock_buffers().

Given MIGRATE_SYNC_LIGHT is used on compaction path, just relax the
check for it.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-mm@kvack.org
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 mm/migrate.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Yang Shi Jan. 10, 2020, 12:28 a.m. UTC | #1
On Thu, Jan 9, 2020 at 2:57 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> We observed kcompactd hung at __lock_page():
>
>  INFO: task kcompactd0:57 blocked for more than 120 seconds.
>        Not tainted 4.19.56.x86_64 #1
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>  kcompactd0      D    0    57      2 0x80000000
>  Call Trace:
>   ? __schedule+0x236/0x860
>   schedule+0x28/0x80
>   io_schedule+0x12/0x40
>   __lock_page+0xf9/0x120
>   ? page_cache_tree_insert+0xb0/0xb0
>   ? update_pageblock_skip+0xb0/0xb0
>   migrate_pages+0x88c/0xb90
>   ? isolate_freepages_block+0x3b0/0x3b0
>   compact_zone+0x5f1/0x870
>   kcompactd_do_work+0x130/0x2c0
>   ? __switch_to_asm+0x35/0x70
>   ? __switch_to_asm+0x41/0x70
>   ? kcompactd_do_work+0x2c0/0x2c0
>   ? kcompactd+0x73/0x180
>   kcompactd+0x73/0x180
>   ? finish_wait+0x80/0x80
>   kthread+0x113/0x130
>   ? kthread_create_worker_on_cpu+0x50/0x50
>   ret_from_fork+0x35/0x40
>
> which faddr2line maps to:
>
>   migrate_pages+0x88c/0xb90:
>   lock_page at include/linux/pagemap.h:483
>   (inlined by) __unmap_and_move at mm/migrate.c:1024
>   (inlined by) unmap_and_move at mm/migrate.c:1189
>   (inlined by) migrate_pages at mm/migrate.c:1419
>
> Sometimes kcompactd eventually got out of this situation, sometimes not.
>
> I think for memory compaction, it is a best effort to migrate the pages,
> so it doesn't have to wait for I/O to complete. It is fine to call
> trylock_page() here, which is pretty much similar to
> buffer_migrate_lock_buffers().
>
> Given MIGRATE_SYNC_LIGHT is used on compaction path, just relax the
> check for it.

But this changed the semantics of MIGRATE_SYNC_LIGHT which means
blocking on most operations but not ->writepage. When
MIGRATE_SYNC_LIGHT is used it means compaction priority is increased
(the initial priority is ASYNC) due to whatever reason (i.e. not
enough clean, non-writeback and non-locked pages to migrate). So, it
has to wait for some pages to try to not backoff pre-maturely.

If I read the code correctly, buffer_migrate_lock_buffers() also
blocks on page lock with non-ASYNC mode.

Since v5.1 Mel Gorman improved compaction a lot. So, I'm wondering if
this happens on the latest upstream or not.

And, did you figure out who is locking the page for such long time? Or
there might be too many waiters on the list for this page?

>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: linux-mm@kvack.org
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> ---
>  mm/migrate.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 86873b6f38a7..df60026779d2 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1010,7 +1010,8 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
>         bool is_lru = !__PageMovable(page);
>
>         if (!trylock_page(page)) {
> -               if (!force || mode == MIGRATE_ASYNC)
> +               if (!force || mode == MIGRATE_ASYNC
> +                          || mode == MIGRATE_SYNC_LIGHT)
>                         goto out;
>
>                 /*
> --
> 2.21.1
>
>
Cong Wang Jan. 10, 2020, 1:01 a.m. UTC | #2
On Thu, Jan 9, 2020 at 4:28 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Thu, Jan 9, 2020 at 2:57 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > We observed kcompactd hung at __lock_page():
> >
> >  INFO: task kcompactd0:57 blocked for more than 120 seconds.
> >        Not tainted 4.19.56.x86_64 #1
> >  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >  kcompactd0      D    0    57      2 0x80000000
> >  Call Trace:
> >   ? __schedule+0x236/0x860
> >   schedule+0x28/0x80
> >   io_schedule+0x12/0x40
> >   __lock_page+0xf9/0x120
> >   ? page_cache_tree_insert+0xb0/0xb0
> >   ? update_pageblock_skip+0xb0/0xb0
> >   migrate_pages+0x88c/0xb90
> >   ? isolate_freepages_block+0x3b0/0x3b0
> >   compact_zone+0x5f1/0x870
> >   kcompactd_do_work+0x130/0x2c0
> >   ? __switch_to_asm+0x35/0x70
> >   ? __switch_to_asm+0x41/0x70
> >   ? kcompactd_do_work+0x2c0/0x2c0
> >   ? kcompactd+0x73/0x180
> >   kcompactd+0x73/0x180
> >   ? finish_wait+0x80/0x80
> >   kthread+0x113/0x130
> >   ? kthread_create_worker_on_cpu+0x50/0x50
> >   ret_from_fork+0x35/0x40
> >
> > which faddr2line maps to:
> >
> >   migrate_pages+0x88c/0xb90:
> >   lock_page at include/linux/pagemap.h:483
> >   (inlined by) __unmap_and_move at mm/migrate.c:1024
> >   (inlined by) unmap_and_move at mm/migrate.c:1189
> >   (inlined by) migrate_pages at mm/migrate.c:1419
> >
> > Sometimes kcompactd eventually got out of this situation, sometimes not.
> >
> > I think for memory compaction, it is a best effort to migrate the pages,
> > so it doesn't have to wait for I/O to complete. It is fine to call
> > trylock_page() here, which is pretty much similar to
> > buffer_migrate_lock_buffers().
> >
> > Given MIGRATE_SYNC_LIGHT is used on compaction path, just relax the
> > check for it.
>
> But this changed the semantics of MIGRATE_SYNC_LIGHT which means
> blocking on most operations but not ->writepage. When
> MIGRATE_SYNC_LIGHT is used it means compaction priority is increased
> (the initial priority is ASYNC) due to whatever reason (i.e. not
> enough clean, non-writeback and non-locked pages to migrate). So, it
> has to wait for some pages to try to not backoff pre-maturely.

Thanks for explaining MIGRATE_SYNC_LIGHT. I didn't dig the history
of MIGRATE_SYNC_LIGHT.

>
> If I read the code correctly, buffer_migrate_lock_buffers() also
> blocks on page lock with non-ASYNC mode.
>
> Since v5.1 Mel Gorman improved compaction a lot. So, I'm wondering if
> this happens on the latest upstream or not.

Unfortunately we can't test upstream kernel because:
1) We don't know how to reproduce it.
2) It is not easy to deploy upstream kernel in our environment.


>
> And, did you figure out who is locking the page for such long time? Or
> there might be too many waiters on the list for this page?

I tried to dump the process stacks after we saw the hung task
with /proc/X/stack (apparently sysrq is not an option for production),
but I didn't find anything useful. I didn't see any other process
hung in lock_page() either, all mm-related kernel threads were sleeping
(non-D) at the time I debugged. So, it is possible there was some place
missing a unlock_page() too, which is too late to debug after the hung
task was reported.

Thanks.
Cong Wang Jan. 10, 2020, 4:51 a.m. UTC | #3
On Thu, Jan 9, 2020 at 5:01 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> I tried to dump the process stacks after we saw the hung task
> with /proc/X/stack (apparently sysrq is not an option for production),
> but I didn't find anything useful. I didn't see any other process
> hung in lock_page() either, all mm-related kernel threads were sleeping
> (non-D) at the time I debugged. So, it is possible there was some place
> missing a unlock_page() too, which is too late to debug after the hung
> task was reported.

Never mind. I found the process locking the page, /proc/X/stack apparently
fools me by hiding page fault stack. :-/
Michal Hocko Jan. 10, 2020, 7:38 a.m. UTC | #4
[CC Mel and Vlastimil]

On Thu 09-01-20 14:56:46, Cong Wang wrote:
> We observed kcompactd hung at __lock_page():
> 
>  INFO: task kcompactd0:57 blocked for more than 120 seconds.
>        Not tainted 4.19.56.x86_64 #1
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>  kcompactd0      D    0    57      2 0x80000000
>  Call Trace:
>   ? __schedule+0x236/0x860
>   schedule+0x28/0x80
>   io_schedule+0x12/0x40
>   __lock_page+0xf9/0x120
>   ? page_cache_tree_insert+0xb0/0xb0
>   ? update_pageblock_skip+0xb0/0xb0
>   migrate_pages+0x88c/0xb90
>   ? isolate_freepages_block+0x3b0/0x3b0
>   compact_zone+0x5f1/0x870
>   kcompactd_do_work+0x130/0x2c0
>   ? __switch_to_asm+0x35/0x70
>   ? __switch_to_asm+0x41/0x70
>   ? kcompactd_do_work+0x2c0/0x2c0
>   ? kcompactd+0x73/0x180
>   kcompactd+0x73/0x180
>   ? finish_wait+0x80/0x80
>   kthread+0x113/0x130
>   ? kthread_create_worker_on_cpu+0x50/0x50
>   ret_from_fork+0x35/0x40
> 
> which faddr2line maps to:
> 
>   migrate_pages+0x88c/0xb90:
>   lock_page at include/linux/pagemap.h:483
>   (inlined by) __unmap_and_move at mm/migrate.c:1024
>   (inlined by) unmap_and_move at mm/migrate.c:1189
>   (inlined by) migrate_pages at mm/migrate.c:1419
> 
> Sometimes kcompactd eventually got out of this situation, sometimes not.

What does this mean exactly? Who is holding the page lock?

> I think for memory compaction, it is a best effort to migrate the pages,
> so it doesn't have to wait for I/O to complete. It is fine to call
> trylock_page() here, which is pretty much similar to
> buffer_migrate_lock_buffers().
> 
> Given MIGRATE_SYNC_LIGHT is used on compaction path, just relax the
> check for it.

The exact definition of MIGRATE_SYNC_LIGHT is a bit fuzzy but bailing
out on the page lock sounds too early to me. So far we have tried to
block on everything but the writeback which can take a lot of time.

> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: linux-mm@kvack.org
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> ---
>  mm/migrate.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 86873b6f38a7..df60026779d2 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1010,7 +1010,8 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
>  	bool is_lru = !__PageMovable(page);
>  
>  	if (!trylock_page(page)) {
> -		if (!force || mode == MIGRATE_ASYNC)
> +		if (!force || mode == MIGRATE_ASYNC
> +			   || mode == MIGRATE_SYNC_LIGHT)
>  			goto out;
>  
>  		/*
> -- 
> 2.21.1
>
Mel Gorman Jan. 10, 2020, 9:22 a.m. UTC | #5
On Thu, Jan 09, 2020 at 02:56:46PM -0800, Cong Wang wrote:
> We observed kcompactd hung at __lock_page():
> 
>  INFO: task kcompactd0:57 blocked for more than 120 seconds.
>        Not tainted 4.19.56.x86_64 #1
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>  kcompactd0      D    0    57      2 0x80000000
>  Call Trace:
>   ? __schedule+0x236/0x860
>   schedule+0x28/0x80
>   io_schedule+0x12/0x40
>   __lock_page+0xf9/0x120
>   ? page_cache_tree_insert+0xb0/0xb0
>   ? update_pageblock_skip+0xb0/0xb0
>   migrate_pages+0x88c/0xb90
>   ? isolate_freepages_block+0x3b0/0x3b0
>   compact_zone+0x5f1/0x870
>   kcompactd_do_work+0x130/0x2c0
>   ? __switch_to_asm+0x35/0x70
>   ? __switch_to_asm+0x41/0x70
>   ? kcompactd_do_work+0x2c0/0x2c0
>   ? kcompactd+0x73/0x180
>   kcompactd+0x73/0x180
>   ? finish_wait+0x80/0x80
>   kthread+0x113/0x130
>   ? kthread_create_worker_on_cpu+0x50/0x50
>   ret_from_fork+0x35/0x40
> 
> which faddr2line maps to:
> 
>   migrate_pages+0x88c/0xb90:
>   lock_page at include/linux/pagemap.h:483
>   (inlined by) __unmap_and_move at mm/migrate.c:1024
>   (inlined by) unmap_and_move at mm/migrate.c:1189
>   (inlined by) migrate_pages at mm/migrate.c:1419
> 
> Sometimes kcompactd eventually got out of this situation, sometimes not.
> 
> I think for memory compaction, it is a best effort to migrate the pages,
> so it doesn't have to wait for I/O to complete. It is fine to call
> trylock_page() here, which is pretty much similar to
> buffer_migrate_lock_buffers().
> 
> Given MIGRATE_SYNC_LIGHT is used on compaction path, just relax the
> check for it.
> 

Is this a single page being locked for a long time or multiple pages
being locked without reaching a reschedule point?

If it's a single page being locked, it's important to identify what held
page lock for 2 minutes because that is potentially a missing
unlock_page. The kernel in question is old -- 4.19.56. Are there any
other modifications to that kernel?
Cong Wang Jan. 20, 2020, 10:41 p.m. UTC | #6
On Fri, Jan 10, 2020 at 1:22 AM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Thu, Jan 09, 2020 at 02:56:46PM -0800, Cong Wang wrote:
> > We observed kcompactd hung at __lock_page():
> >
> >  INFO: task kcompactd0:57 blocked for more than 120 seconds.
> >        Not tainted 4.19.56.x86_64 #1
> >  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >  kcompactd0      D    0    57      2 0x80000000
> >  Call Trace:
> >   ? __schedule+0x236/0x860
> >   schedule+0x28/0x80
> >   io_schedule+0x12/0x40
> >   __lock_page+0xf9/0x120
> >   ? page_cache_tree_insert+0xb0/0xb0
> >   ? update_pageblock_skip+0xb0/0xb0
> >   migrate_pages+0x88c/0xb90
> >   ? isolate_freepages_block+0x3b0/0x3b0
> >   compact_zone+0x5f1/0x870
> >   kcompactd_do_work+0x130/0x2c0
> >   ? __switch_to_asm+0x35/0x70
> >   ? __switch_to_asm+0x41/0x70
> >   ? kcompactd_do_work+0x2c0/0x2c0
> >   ? kcompactd+0x73/0x180
> >   kcompactd+0x73/0x180
> >   ? finish_wait+0x80/0x80
> >   kthread+0x113/0x130
> >   ? kthread_create_worker_on_cpu+0x50/0x50
> >   ret_from_fork+0x35/0x40
> >
> > which faddr2line maps to:
> >
> >   migrate_pages+0x88c/0xb90:
> >   lock_page at include/linux/pagemap.h:483
> >   (inlined by) __unmap_and_move at mm/migrate.c:1024
> >   (inlined by) unmap_and_move at mm/migrate.c:1189
> >   (inlined by) migrate_pages at mm/migrate.c:1419
> >
> > Sometimes kcompactd eventually got out of this situation, sometimes not.
> >
> > I think for memory compaction, it is a best effort to migrate the pages,
> > so it doesn't have to wait for I/O to complete. It is fine to call
> > trylock_page() here, which is pretty much similar to
> > buffer_migrate_lock_buffers().
> >
> > Given MIGRATE_SYNC_LIGHT is used on compaction path, just relax the
> > check for it.
> >
>
> Is this a single page being locked for a long time or multiple pages
> being locked without reaching a reschedule point?

Not sure whether it is single page or multiple pages, but I successfully
located the process locking the page (or pages), and I used perf to
capture its stack trace:


 ffffffffa722aa06 shrink_inactive_list
 ffffffffa722b3d7 shrink_node_memcg
 ffffffffa722b85f shrink_node
 ffffffffa722bc89 do_try_to_free_pages
 ffffffffa722c179 try_to_free_mem_cgroup_pages
 ffffffffa7298703 try_charge
 ffffffffa729a886 mem_cgroup_try_charge
 ffffffffa720ec03 __add_to_page_cache_locked
 ffffffffa720ee3a add_to_page_cache_lru
 ffffffffa7312ddb iomap_readpages_actor
 ffffffffa73133f7 iomap_apply
 ffffffffa73135da iomap_readpages
 ffffffffa722062e read_pages
 ffffffffa7220b3f __do_page_cache_readahead
 ffffffffa7210554 filemap_fault
 ffffffffc039e41f __xfs_filemap_fault
 ffffffffa724f5e7 __do_fault
 ffffffffa724c5f2 __handle_mm_fault
 ffffffffa724cbc6 handle_mm_fault
 ffffffffa70a313e __do_page_fault
 ffffffffa7a00dfe page_fault

This process got stuck in this situation for a long time (since I sent out
this patch) without making any progress. It behaves like stuck in an infinite
loop, although the EIP still moves around within mem_cgroup_try_charge().

I also enabled trace event mm_vmscan_lru_shrink_inactive(), here is what
I collected:

           <...>-455459 [003] .... 2691911.664706:
mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
           <...>-455459 [003] .... 2691911.664711:
mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=4
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
           <...>-455459 [003] .... 2691911.664714:
mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=2 nr_reclaimed=0
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=3
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
           <...>-455459 [003] .... 2691911.664717:
mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=5 nr_reclaimed=0
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=2
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
           <...>-455459 [003] .... 2691911.664720:
mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=5 nr_reclaimed=0
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=1
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
           <...>-455459 [003] .... 2691911.664725:
mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=7 nr_reclaimed=0
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
           <...>-455459 [003] .... 2691911.664730:
mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=2
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
           <...>-455459 [003] .... 2691911.664732:
mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
           <...>-455459 [003] .... 2691911.664736:
mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=4
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
           <...>-455459 [003] .... 2691911.664739:
mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=2 nr_reclaimed=0
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=3
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
           <...>-455459 [003] .... 2691911.664744:
mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=5 nr_reclaimed=0
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=2
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
           <...>-455459 [003] .... 2691911.664747:
mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=4 nr_reclaimed=0
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=1
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
           <...>-455459 [003] .... 2691911.664752:
mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=12 nr_reclaimed=0
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activa
te=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
           <...>-455459 [003] .... 2691911.664755:
mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=4
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
           <...>-455459 [003] .... 2691911.664761:
mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=2
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
           <...>-455459 [003] .... 2691911.664762:
mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=1
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
           <...>-455459 [003] .... 2691911.664764:
mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
           <...>-455459 [003] .... 2691911.664770:
mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=4 nr_reclaimed=0
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=1
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
           <...>-455459 [003] .... 2691911.664777:
mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=21 nr_reclaimed=0
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activa
te=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
           <...>-455459 [003] .... 2691911.664780:
mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=4
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
           <...>-455459 [003] .... 2691911.664783:
mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=2 nr_reclaimed=0
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=3
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC


>
> If it's a single page being locked, it's important to identify what held
> page lock for 2 minutes because that is potentially a missing
> unlock_page. The kernel in question is old -- 4.19.56. Are there any
> other modifications to that kernel?

We only backported some networking and hardware driver patches,
not any MM change.

Please let me know if I can collect any other information you need,
before it gets rebooted.

Thanks.
Cong Wang Jan. 20, 2020, 10:48 p.m. UTC | #7
Hi, Michal

On Thu, Jan 9, 2020 at 11:38 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> [CC Mel and Vlastimil]
>
> On Thu 09-01-20 14:56:46, Cong Wang wrote:
> > We observed kcompactd hung at __lock_page():
> >
> >  INFO: task kcompactd0:57 blocked for more than 120 seconds.
> >        Not tainted 4.19.56.x86_64 #1
> >  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >  kcompactd0      D    0    57      2 0x80000000
> >  Call Trace:
> >   ? __schedule+0x236/0x860
> >   schedule+0x28/0x80
> >   io_schedule+0x12/0x40
> >   __lock_page+0xf9/0x120
> >   ? page_cache_tree_insert+0xb0/0xb0
> >   ? update_pageblock_skip+0xb0/0xb0
> >   migrate_pages+0x88c/0xb90
> >   ? isolate_freepages_block+0x3b0/0x3b0
> >   compact_zone+0x5f1/0x870
> >   kcompactd_do_work+0x130/0x2c0
> >   ? __switch_to_asm+0x35/0x70
> >   ? __switch_to_asm+0x41/0x70
> >   ? kcompactd_do_work+0x2c0/0x2c0
> >   ? kcompactd+0x73/0x180
> >   kcompactd+0x73/0x180
> >   ? finish_wait+0x80/0x80
> >   kthread+0x113/0x130
> >   ? kthread_create_worker_on_cpu+0x50/0x50
> >   ret_from_fork+0x35/0x40
> >
> > which faddr2line maps to:
> >
> >   migrate_pages+0x88c/0xb90:
> >   lock_page at include/linux/pagemap.h:483
> >   (inlined by) __unmap_and_move at mm/migrate.c:1024
> >   (inlined by) unmap_and_move at mm/migrate.c:1189
> >   (inlined by) migrate_pages at mm/migrate.c:1419
> >
> > Sometimes kcompactd eventually got out of this situation, sometimes not.
>
> What does this mean exactly? Who is holding the page lock?

As I explained in other email, I didn't locate the process holding the page
lock before I sent out this patch, as I was fooled by /proc/X/stack.

But now I got its stack trace with `perf`:

 ffffffffa722aa06 shrink_inactive_list
 ffffffffa722b3d7 shrink_node_memcg
 ffffffffa722b85f shrink_node
 ffffffffa722bc89 do_try_to_free_pages
 ffffffffa722c179 try_to_free_mem_cgroup_pages
 ffffffffa7298703 try_charge
 ffffffffa729a886 mem_cgroup_try_charge
 ffffffffa720ec03 __add_to_page_cache_locked
 ffffffffa720ee3a add_to_page_cache_lru
 ffffffffa7312ddb iomap_readpages_actor
 ffffffffa73133f7 iomap_apply
 ffffffffa73135da iomap_readpages
 ffffffffa722062e read_pages
 ffffffffa7220b3f __do_page_cache_readahead
 ffffffffa7210554 filemap_fault
 ffffffffc039e41f __xfs_filemap_fault
 ffffffffa724f5e7 __do_fault
 ffffffffa724c5f2 __handle_mm_fault
 ffffffffa724cbc6 handle_mm_fault
 ffffffffa70a313e __do_page_fault
 ffffffffa7a00dfe page_fault

It got stuck somewhere along the call path of mem_cgroup_try_charge(),
and the trace events of mm_vmscan_lru_shrink_inactive() indicates this
too:

         <...>-455459 [003] .... 2691911.664706:
mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
           <...>-455459 [003] .... 2691911.664711:
mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=4
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
           <...>-455459 [003] .... 2691911.664714:
mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=2 nr_reclaimed=0
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=3
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
           <...>-455459 [003] .... 2691911.664717:
mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=5 nr_reclaimed=0
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=2
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
           <...>-455459 [003] .... 2691911.664720:
mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=5 nr_reclaimed=0
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=1
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
           <...>-455459 [003] .... 2691911.664725:
mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=7 nr_reclaimed=0
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
           <...>-455459 [003] .... 2691911.664730:
mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=2
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
           <...>-455459 [003] .... 2691911.664732:
mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
           <...>-455459 [003] .... 2691911.664736:
mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=4
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
           <...>-455459 [003] .... 2691911.664739:
mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=2 nr_reclaimed=0
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=3
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
           <...>-455459 [003] .... 2691911.664744:
mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=5 nr_reclaimed=0
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=2
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
           <...>-455459 [003] .... 2691911.664747:
mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=4 nr_reclaimed=0
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=1
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
           <...>-455459 [003] .... 2691911.664752:
mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=12 nr_reclaimed=0
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activa
te=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
           <...>-455459 [003] .... 2691911.664755:
mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=4
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
           <...>-455459 [003] .... 2691911.664761:
mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=2
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
           <...>-455459 [003] .... 2691911.664762:
mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=1
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
           <...>-455459 [003] .... 2691911.664764:
mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
           <...>-455459 [003] .... 2691911.664770:
mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=4 nr_reclaimed=0
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=1
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
           <...>-455459 [003] .... 2691911.664777:
mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=21 nr_reclaimed=0
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activa
te=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
           <...>-455459 [003] .... 2691911.664780:
mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=4
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
           <...>-455459 [003] .... 2691911.664783:
mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=2 nr_reclaimed=0
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=3
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC

Thanks.
Hillf Danton Jan. 21, 2020, 8:26 a.m. UTC | #8
On Mon, 20 Jan 2020 14:41:50 -0800 Cong Wang wrote:
> On Fri, Jan 10, 2020 at 1:22 AM Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > On Thu, Jan 09, 2020 at 02:56:46PM -0800, Cong Wang wrote:
> > > We observed kcompactd hung at __lock_page():
> > >
> > >  INFO: task kcompactd0:57 blocked for more than 120 seconds.
> > >        Not tainted 4.19.56.x86_64 #1
> > >  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > >  kcompactd0      D    0    57      2 0x80000000
> > >  Call Trace:
> > >   ? __schedule+0x236/0x860
> > >   schedule+0x28/0x80
> > >   io_schedule+0x12/0x40
> > >   __lock_page+0xf9/0x120
> > >   ? page_cache_tree_insert+0xb0/0xb0
> > >   ? update_pageblock_skip+0xb0/0xb0
> > >   migrate_pages+0x88c/0xb90
> > >   ? isolate_freepages_block+0x3b0/0x3b0
> > >   compact_zone+0x5f1/0x870
> > >   kcompactd_do_work+0x130/0x2c0
> > >   ? __switch_to_asm+0x35/0x70
> > >   ? __switch_to_asm+0x41/0x70
> > >   ? kcompactd_do_work+0x2c0/0x2c0
> > >   ? kcompactd+0x73/0x180
> > >   kcompactd+0x73/0x180
> > >   ? finish_wait+0x80/0x80
> > >   kthread+0x113/0x130
> > >   ? kthread_create_worker_on_cpu+0x50/0x50
> > >   ret_from_fork+0x35/0x40
> > >
> > > which faddr2line maps to:
> > >
> > >   migrate_pages+0x88c/0xb90:
> > >   lock_page at include/linux/pagemap.h:483
> > >   (inlined by) __unmap_and_move at mm/migrate.c:1024
> > >   (inlined by) unmap_and_move at mm/migrate.c:1189
> > >   (inlined by) migrate_pages at mm/migrate.c:1419
> > >
> > > Sometimes kcompactd eventually got out of this situation, sometimes not.
> > >
> > > I think for memory compaction, it is a best effort to migrate the pages,
> > > so it doesn't have to wait for I/O to complete. It is fine to call
> > > trylock_page() here, which is pretty much similar to
> > > buffer_migrate_lock_buffers().
> > >
> > > Given MIGRATE_SYNC_LIGHT is used on compaction path, just relax the
> > > check for it.
> > >
> >
> > Is this a single page being locked for a long time or multiple pages
> > being locked without reaching a reschedule point?
> 
> Not sure whether it is single page or multiple pages, but I successfully
> located the process locking the page (or pages), and I used perf to
> capture its stack trace:
> 
> 
>  ffffffffa722aa06 shrink_inactive_list
>  ffffffffa722b3d7 shrink_node_memcg
>  ffffffffa722b85f shrink_node
>  ffffffffa722bc89 do_try_to_free_pages
>  ffffffffa722c179 try_to_free_mem_cgroup_pages
>  ffffffffa7298703 try_charge
>  ffffffffa729a886 mem_cgroup_try_charge
>  ffffffffa720ec03 __add_to_page_cache_locked
>  ffffffffa720ee3a add_to_page_cache_lru
>  ffffffffa7312ddb iomap_readpages_actor
>  ffffffffa73133f7 iomap_apply
>  ffffffffa73135da iomap_readpages
>  ffffffffa722062e read_pages
>  ffffffffa7220b3f __do_page_cache_readahead
>  ffffffffa7210554 filemap_fault
>  ffffffffc039e41f __xfs_filemap_fault
>  ffffffffa724f5e7 __do_fault
>  ffffffffa724c5f2 __handle_mm_fault
>  ffffffffa724cbc6 handle_mm_fault
>  ffffffffa70a313e __do_page_fault
>  ffffffffa7a00dfe page_fault
> 
> This process got stuck in this situation for a long time (since I sent out
> this patch) without making any progress. It behaves like stuck in an infinite
> loop, although the EIP still moves around within mem_cgroup_try_charge().

Make page reclaim in try_charge() async assuming sync reclaim is unnecessary
without memory pressure and it does not help much under heavy pressure. Skipping
reclaim is only confined in page fault context to avoid adding too much a time.

--- linux-5.5-rc3/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2525,6 +2525,12 @@ force:
 	if (do_memsw_account())
 		page_counter_charge(&memcg->memsw, nr_pages);
 	css_get_many(&memcg->css, nr_pages);
+	/*
+	 * reclaim high limit pages soon without holding resources like
+	 * page lock e.g in page fault context
+	 */
+	if (unlikely(current->flags & PF_MEMALLOC))
+		schedule_work(&memcg->high_work);
 
 	return 0;
 
--- linux-5.5-rc3/mm/filemap.c	
+++ b/mm/filemap.c
@@ -863,8 +863,14 @@ static int __add_to_page_cache_locked(st
 	mapping_set_update(&xas, mapping);
 
 	if (!huge) {
+		bool was_set = current->flags & PF_MEMALLOC;
+		if (!was_set)
+			current->flags |= PF_MEMALLOC;
+
 		error = mem_cgroup_try_charge(page, current->mm,
 					      gfp_mask, &memcg, false);
+		if (!was_set)
+			current->flags &= ~PF_MEMALLOC;
 		if (error)
 			return error;
 	}
--

> I also enabled trace event mm_vmscan_lru_shrink_inactive(), here is what
> I collected:
> 
>            <...>-455459 [003] .... 2691911.664706:
> mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0
> nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
> e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0
> flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>            <...>-455459 [003] .... 2691911.664711:
> mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0
> nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
> e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=4
> flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>            <...>-455459 [003] .... 2691911.664714:
> mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=2 nr_reclaimed=0
> nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
> e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=3
> flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>            <...>-455459 [003] .... 2691911.664717:
> mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=5 nr_reclaimed=0
> nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
> e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=2
> flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>            <...>-455459 [003] .... 2691911.664720:
> mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=5 nr_reclaimed=0
> nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
> e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=1
> flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>            <...>-455459 [003] .... 2691911.664725:
> mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=7 nr_reclaimed=0
> nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
> e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0
> flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>            <...>-455459 [003] .... 2691911.664730:
> mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0
> nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
> e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=2
> flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>            <...>-455459 [003] .... 2691911.664732:
> mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0
> nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
> e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0
> flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>            <...>-455459 [003] .... 2691911.664736:
> mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0
> nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
> e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=4
> flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>            <...>-455459 [003] .... 2691911.664739:
> mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=2 nr_reclaimed=0
> nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
> e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=3
> flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>            <...>-455459 [003] .... 2691911.664744:
> mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=5 nr_reclaimed=0
> nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
> e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=2
> flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>            <...>-455459 [003] .... 2691911.664747:
> mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=4 nr_reclaimed=0
> nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
> e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=1
> flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>            <...>-455459 [003] .... 2691911.664752:
> mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=12 nr_reclaimed=0
> nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activa
> te=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0
> flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>            <...>-455459 [003] .... 2691911.664755:
> mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0
> nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
> e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=4
> flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>            <...>-455459 [003] .... 2691911.664761:
> mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0
> nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
> e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=2
> flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>            <...>-455459 [003] .... 2691911.664762:
> mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0
> nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
> e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=1
> flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>            <...>-455459 [003] .... 2691911.664764:
> mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0
> nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
> e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0
> flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>            <...>-455459 [003] .... 2691911.664770:
> mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=4 nr_reclaimed=0
> nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
> e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=1
> flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>            <...>-455459 [003] .... 2691911.664777:
> mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=21 nr_reclaimed=0
> nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activa
> te=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0
> flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>            <...>-455459 [003] .... 2691911.664780:
> mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0
> nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
> e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=4
> flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>            <...>-455459 [003] .... 2691911.664783:
> mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=2 nr_reclaimed=0
> nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
> e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=3
> flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
> 
> 
> >
> > If it's a single page being locked, it's important to identify what held
> > page lock for 2 minutes because that is potentially a missing
> > unlock_page. The kernel in question is old -- 4.19.56. Are there any
> > other modifications to that kernel?
> 
> We only backported some networking and hardware driver patches,
> not any MM change.
> 
> Please let me know if I can collect any other information you need,
> before it gets rebooted.
> 
> Thanks.
Michal Hocko Jan. 21, 2020, 9 a.m. UTC | #9
On Mon 20-01-20 14:48:05, Cong Wang wrote:
> Hi, Michal
> 
> On Thu, Jan 9, 2020 at 11:38 PM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > [CC Mel and Vlastimil]
> >
> > On Thu 09-01-20 14:56:46, Cong Wang wrote:
> > > We observed kcompactd hung at __lock_page():
> > >
> > >  INFO: task kcompactd0:57 blocked for more than 120 seconds.
> > >        Not tainted 4.19.56.x86_64 #1
> > >  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > >  kcompactd0      D    0    57      2 0x80000000
> > >  Call Trace:
> > >   ? __schedule+0x236/0x860
> > >   schedule+0x28/0x80
> > >   io_schedule+0x12/0x40
> > >   __lock_page+0xf9/0x120
> > >   ? page_cache_tree_insert+0xb0/0xb0
> > >   ? update_pageblock_skip+0xb0/0xb0
> > >   migrate_pages+0x88c/0xb90
> > >   ? isolate_freepages_block+0x3b0/0x3b0
> > >   compact_zone+0x5f1/0x870
> > >   kcompactd_do_work+0x130/0x2c0
> > >   ? __switch_to_asm+0x35/0x70
> > >   ? __switch_to_asm+0x41/0x70
> > >   ? kcompactd_do_work+0x2c0/0x2c0
> > >   ? kcompactd+0x73/0x180
> > >   kcompactd+0x73/0x180
> > >   ? finish_wait+0x80/0x80
> > >   kthread+0x113/0x130
> > >   ? kthread_create_worker_on_cpu+0x50/0x50
> > >   ret_from_fork+0x35/0x40
> > >
> > > which faddr2line maps to:
> > >
> > >   migrate_pages+0x88c/0xb90:
> > >   lock_page at include/linux/pagemap.h:483
> > >   (inlined by) __unmap_and_move at mm/migrate.c:1024
> > >   (inlined by) unmap_and_move at mm/migrate.c:1189
> > >   (inlined by) migrate_pages at mm/migrate.c:1419
> > >
> > > Sometimes kcompactd eventually got out of this situation, sometimes not.
> >
> > What does this mean exactly? Who is holding the page lock?
> 
> As I explained in other email, I didn't locate the process holding the page
> lock before I sent out this patch, as I was fooled by /proc/X/stack.
> 
> But now I got its stack trace with `perf`:
> 
>  ffffffffa722aa06 shrink_inactive_list
>  ffffffffa722b3d7 shrink_node_memcg
>  ffffffffa722b85f shrink_node
>  ffffffffa722bc89 do_try_to_free_pages
>  ffffffffa722c179 try_to_free_mem_cgroup_pages
>  ffffffffa7298703 try_charge
>  ffffffffa729a886 mem_cgroup_try_charge
>  ffffffffa720ec03 __add_to_page_cache_locked
>  ffffffffa720ee3a add_to_page_cache_lru
>  ffffffffa7312ddb iomap_readpages_actor
>  ffffffffa73133f7 iomap_apply
>  ffffffffa73135da iomap_readpages
>  ffffffffa722062e read_pages
>  ffffffffa7220b3f __do_page_cache_readahead
>  ffffffffa7210554 filemap_fault
>  ffffffffc039e41f __xfs_filemap_fault
>  ffffffffa724f5e7 __do_fault
>  ffffffffa724c5f2 __handle_mm_fault
>  ffffffffa724cbc6 handle_mm_fault
>  ffffffffa70a313e __do_page_fault
>  ffffffffa7a00dfe page_fault
> 
> It got stuck somewhere along the call path of mem_cgroup_try_charge(),
> and the trace events of mm_vmscan_lru_shrink_inactive() indicates this
> too:

So it seems that you are condending on the page lock. It is really
unexpected that the reclaim would take that long though. Please try to
enable more vmscan tracepoints to see where the time is spent.

Thanks!
Michal Hocko Jan. 21, 2020, 9:06 a.m. UTC | #10
On Tue 21-01-20 16:26:24, Hillf Danton wrote:
> 
> On Mon, 20 Jan 2020 14:41:50 -0800 Cong Wang wrote:
> > On Fri, Jan 10, 2020 at 1:22 AM Mel Gorman <mgorman@techsingularity.net> wrote:
> > >
> > > On Thu, Jan 09, 2020 at 02:56:46PM -0800, Cong Wang wrote:
> > > > We observed kcompactd hung at __lock_page():
> > > >
> > > >  INFO: task kcompactd0:57 blocked for more than 120 seconds.
> > > >        Not tainted 4.19.56.x86_64 #1
> > > >  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > >  kcompactd0      D    0    57      2 0x80000000
> > > >  Call Trace:
> > > >   ? __schedule+0x236/0x860
> > > >   schedule+0x28/0x80
> > > >   io_schedule+0x12/0x40
> > > >   __lock_page+0xf9/0x120
> > > >   ? page_cache_tree_insert+0xb0/0xb0
> > > >   ? update_pageblock_skip+0xb0/0xb0
> > > >   migrate_pages+0x88c/0xb90
> > > >   ? isolate_freepages_block+0x3b0/0x3b0
> > > >   compact_zone+0x5f1/0x870
> > > >   kcompactd_do_work+0x130/0x2c0
> > > >   ? __switch_to_asm+0x35/0x70
> > > >   ? __switch_to_asm+0x41/0x70
> > > >   ? kcompactd_do_work+0x2c0/0x2c0
> > > >   ? kcompactd+0x73/0x180
> > > >   kcompactd+0x73/0x180
> > > >   ? finish_wait+0x80/0x80
> > > >   kthread+0x113/0x130
> > > >   ? kthread_create_worker_on_cpu+0x50/0x50
> > > >   ret_from_fork+0x35/0x40
> > > >
> > > > which faddr2line maps to:
> > > >
> > > >   migrate_pages+0x88c/0xb90:
> > > >   lock_page at include/linux/pagemap.h:483
> > > >   (inlined by) __unmap_and_move at mm/migrate.c:1024
> > > >   (inlined by) unmap_and_move at mm/migrate.c:1189
> > > >   (inlined by) migrate_pages at mm/migrate.c:1419
> > > >
> > > > Sometimes kcompactd eventually got out of this situation, sometimes not.
> > > >
> > > > I think for memory compaction, it is a best effort to migrate the pages,
> > > > so it doesn't have to wait for I/O to complete. It is fine to call
> > > > trylock_page() here, which is pretty much similar to
> > > > buffer_migrate_lock_buffers().
> > > >
> > > > Given MIGRATE_SYNC_LIGHT is used on compaction path, just relax the
> > > > check for it.
> > > >
> > >
> > > Is this a single page being locked for a long time or multiple pages
> > > being locked without reaching a reschedule point?
> > 
> > Not sure whether it is single page or multiple pages, but I successfully
> > located the process locking the page (or pages), and I used perf to
> > capture its stack trace:
> > 
> > 
> >  ffffffffa722aa06 shrink_inactive_list
> >  ffffffffa722b3d7 shrink_node_memcg
> >  ffffffffa722b85f shrink_node
> >  ffffffffa722bc89 do_try_to_free_pages
> >  ffffffffa722c179 try_to_free_mem_cgroup_pages
> >  ffffffffa7298703 try_charge
> >  ffffffffa729a886 mem_cgroup_try_charge
> >  ffffffffa720ec03 __add_to_page_cache_locked
> >  ffffffffa720ee3a add_to_page_cache_lru
> >  ffffffffa7312ddb iomap_readpages_actor
> >  ffffffffa73133f7 iomap_apply
> >  ffffffffa73135da iomap_readpages
> >  ffffffffa722062e read_pages
> >  ffffffffa7220b3f __do_page_cache_readahead
> >  ffffffffa7210554 filemap_fault
> >  ffffffffc039e41f __xfs_filemap_fault
> >  ffffffffa724f5e7 __do_fault
> >  ffffffffa724c5f2 __handle_mm_fault
> >  ffffffffa724cbc6 handle_mm_fault
> >  ffffffffa70a313e __do_page_fault
> >  ffffffffa7a00dfe page_fault
> > 
> > This process got stuck in this situation for a long time (since I sent out
> > this patch) without making any progress. It behaves like stuck in an infinite
> > loop, although the EIP still moves around within mem_cgroup_try_charge().
> 
> Make page reclaim in try_charge() async assuming sync reclaim is unnecessary
> without memory pressure and it does not help much under heavy pressure. Skipping
> reclaim is only confined in page fault context to avoid adding too much a time.
> 
> --- linux-5.5-rc3/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2525,6 +2525,12 @@ force:
>  	if (do_memsw_account())
>  		page_counter_charge(&memcg->memsw, nr_pages);
>  	css_get_many(&memcg->css, nr_pages);
> +	/*
> +	 * reclaim high limit pages soon without holding resources like
> +	 * page lock e.g in page fault context
> +	 */
> +	if (unlikely(current->flags & PF_MEMALLOC))
> +		schedule_work(&memcg->high_work);
>  
>  	return 0;
>  
> --- linux-5.5-rc3/mm/filemap.c	
> +++ b/mm/filemap.c
> @@ -863,8 +863,14 @@ static int __add_to_page_cache_locked(st
>  	mapping_set_update(&xas, mapping);
>  
>  	if (!huge) {
> +		bool was_set = current->flags & PF_MEMALLOC;
> +		if (!was_set)
> +			current->flags |= PF_MEMALLOC;
> +
>  		error = mem_cgroup_try_charge(page, current->mm,
>  					      gfp_mask, &memcg, false);
> +		if (!was_set)
> +			current->flags &= ~PF_MEMALLOC;
>  		if (error)
>  			return error;

Not only this doesn't help at all because the direct reclaim for the
hard limit is done already. The high limit is also already handled
properly when we are under the hard limit. So this patch doesn't really
make much sense to me.
Yang Shi Jan. 21, 2020, 7:21 p.m. UTC | #11
On Mon, Jan 20, 2020 at 2:42 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> On Fri, Jan 10, 2020 at 1:22 AM Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > On Thu, Jan 09, 2020 at 02:56:46PM -0800, Cong Wang wrote:
> > > We observed kcompactd hung at __lock_page():
> > >
> > >  INFO: task kcompactd0:57 blocked for more than 120 seconds.
> > >        Not tainted 4.19.56.x86_64 #1
> > >  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > >  kcompactd0      D    0    57      2 0x80000000
> > >  Call Trace:
> > >   ? __schedule+0x236/0x860
> > >   schedule+0x28/0x80
> > >   io_schedule+0x12/0x40
> > >   __lock_page+0xf9/0x120
> > >   ? page_cache_tree_insert+0xb0/0xb0
> > >   ? update_pageblock_skip+0xb0/0xb0
> > >   migrate_pages+0x88c/0xb90
> > >   ? isolate_freepages_block+0x3b0/0x3b0
> > >   compact_zone+0x5f1/0x870
> > >   kcompactd_do_work+0x130/0x2c0
> > >   ? __switch_to_asm+0x35/0x70
> > >   ? __switch_to_asm+0x41/0x70
> > >   ? kcompactd_do_work+0x2c0/0x2c0
> > >   ? kcompactd+0x73/0x180
> > >   kcompactd+0x73/0x180
> > >   ? finish_wait+0x80/0x80
> > >   kthread+0x113/0x130
> > >   ? kthread_create_worker_on_cpu+0x50/0x50
> > >   ret_from_fork+0x35/0x40
> > >
> > > which faddr2line maps to:
> > >
> > >   migrate_pages+0x88c/0xb90:
> > >   lock_page at include/linux/pagemap.h:483
> > >   (inlined by) __unmap_and_move at mm/migrate.c:1024
> > >   (inlined by) unmap_and_move at mm/migrate.c:1189
> > >   (inlined by) migrate_pages at mm/migrate.c:1419
> > >
> > > Sometimes kcompactd eventually got out of this situation, sometimes not.
> > >
> > > I think for memory compaction, it is a best effort to migrate the pages,
> > > so it doesn't have to wait for I/O to complete. It is fine to call
> > > trylock_page() here, which is pretty much similar to
> > > buffer_migrate_lock_buffers().
> > >
> > > Given MIGRATE_SYNC_LIGHT is used on compaction path, just relax the
> > > check for it.
> > >
> >
> > Is this a single page being locked for a long time or multiple pages
> > being locked without reaching a reschedule point?
>
> Not sure whether it is single page or multiple pages, but I successfully
> located the process locking the page (or pages), and I used perf to
> capture its stack trace:
>
>
>  ffffffffa722aa06 shrink_inactive_list
>  ffffffffa722b3d7 shrink_node_memcg
>  ffffffffa722b85f shrink_node
>  ffffffffa722bc89 do_try_to_free_pages
>  ffffffffa722c179 try_to_free_mem_cgroup_pages
>  ffffffffa7298703 try_charge
>  ffffffffa729a886 mem_cgroup_try_charge
>  ffffffffa720ec03 __add_to_page_cache_locked
>  ffffffffa720ee3a add_to_page_cache_lru
>  ffffffffa7312ddb iomap_readpages_actor
>  ffffffffa73133f7 iomap_apply
>  ffffffffa73135da iomap_readpages
>  ffffffffa722062e read_pages
>  ffffffffa7220b3f __do_page_cache_readahead
>  ffffffffa7210554 filemap_fault
>  ffffffffc039e41f __xfs_filemap_fault
>  ffffffffa724f5e7 __do_fault
>  ffffffffa724c5f2 __handle_mm_fault
>  ffffffffa724cbc6 handle_mm_fault
>  ffffffffa70a313e __do_page_fault
>  ffffffffa7a00dfe page_fault
>
> This process got stuck in this situation for a long time (since I sent out
> this patch) without making any progress. It behaves like stuck in an infinite
> loop, although the EIP still moves around within mem_cgroup_try_charge().
>
> I also enabled trace event mm_vmscan_lru_shrink_inactive(), here is what
> I collected:
>
>            <...>-455459 [003] .... 2691911.664706:
> mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0
> nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
> e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0
> flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>            <...>-455459 [003] .... 2691911.664711:
> mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0
> nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
> e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=4
> flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>            <...>-455459 [003] .... 2691911.664714:
> mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=2 nr_reclaimed=0
> nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
> e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=3
> flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>            <...>-455459 [003] .... 2691911.664717:
> mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=5 nr_reclaimed=0
> nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
> e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=2
> flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>            <...>-455459 [003] .... 2691911.664720:
> mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=5 nr_reclaimed=0
> nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
> e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=1
> flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>            <...>-455459 [003] .... 2691911.664725:
> mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=7 nr_reclaimed=0
> nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
> e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0
> flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>            <...>-455459 [003] .... 2691911.664730:
> mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0
> nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
> e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=2
> flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>            <...>-455459 [003] .... 2691911.664732:
> mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0
> nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
> e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0
> flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>            <...>-455459 [003] .... 2691911.664736:
> mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0
> nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
> e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=4
> flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>            <...>-455459 [003] .... 2691911.664739:
> mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=2 nr_reclaimed=0
> nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
> e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=3
> flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>            <...>-455459 [003] .... 2691911.664744:
> mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=5 nr_reclaimed=0
> nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
> e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=2
> flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>            <...>-455459 [003] .... 2691911.664747:
> mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=4 nr_reclaimed=0
> nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
> e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=1
> flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>            <...>-455459 [003] .... 2691911.664752:
> mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=12 nr_reclaimed=0
> nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activa
> te=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0
> flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>            <...>-455459 [003] .... 2691911.664755:
> mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0
> nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
> e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=4
> flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>            <...>-455459 [003] .... 2691911.664761:
> mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0
> nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
> e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=2
> flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>            <...>-455459 [003] .... 2691911.664762:
> mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0
> nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
> e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=1
> flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>            <...>-455459 [003] .... 2691911.664764:
> mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0
> nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
> e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0
> flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>            <...>-455459 [003] .... 2691911.664770:
> mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=4 nr_reclaimed=0
> nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
> e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=1
> flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>            <...>-455459 [003] .... 2691911.664777:
> mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=21 nr_reclaimed=0
> nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activa
> te=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0
> flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>            <...>-455459 [003] .... 2691911.664780:
> mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0
> nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
> e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=4
> flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>            <...>-455459 [003] .... 2691911.664783:
> mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=2 nr_reclaimed=0
> nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
> e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=3
> flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC

Thanks for sharing the tracing result. I suspect the case might be:

                       CPU A                                CPU B
page fault
    locked the page
        try charge
            do limit reclaim
                                                           compaction
tried to isolate the locked page


But it looks the limit reclaim took forever without reclaiming any
page, and no oom or bail out correctly. And, according to the tracing
result, it seems the inactive lru is very short since it just scanned
a few pages even with very high priority.

So, other than Michal's suggestion, I'd suggest you inspect what the
jobs are doing in that memcg.

>
>
> >
> > If it's a single page being locked, it's important to identify what held
> > page lock for 2 minutes because that is potentially a missing
> > unlock_page. The kernel in question is old -- 4.19.56. Are there any
> > other modifications to that kernel?
>
> We only backported some networking and hardware driver patches,
> not any MM change.
>
> Please let me know if I can collect any other information you need,
> before it gets rebooted.
>
> Thanks.
>
Cong Wang Jan. 26, 2020, 7:53 p.m. UTC | #12
On Tue, Jan 21, 2020 at 1:00 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Mon 20-01-20 14:48:05, Cong Wang wrote:
> > It got stuck somewhere along the call path of mem_cgroup_try_charge(),
> > and the trace events of mm_vmscan_lru_shrink_inactive() indicates this
> > too:
>
> So it seems that you are condending on the page lock. It is really
> unexpected that the reclaim would take that long though. Please try to
> enable more vmscan tracepoints to see where the time is spent.

Sorry for the delay. I have been trying to collect more data in one shot.

This is a a typical round of the loop after enabling all vmscan tracepoints:

           <...>-455450 [007] .... 4048595.842992:
mm_vmscan_memcg_reclaim_begin: order=0 may_writepage=1
gfp_flags=GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE
classzone_idx=
4
           <...>-455450 [007] d... 4048595.842995:
mm_vmscan_lru_isolate: isolate_mode=0 classzone=3 order=0
nr_requested=1 nr_scanned=1 nr_skipped=0 nr_taken=1 lru=inactive_file
           <...>-455450 [007] .... 4048595.842995:
mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=4
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
           <...>-455450 [007] dN.. 4048595.842997:
mm_vmscan_lru_isolate: isolate_mode=0 classzone=3 order=0
nr_requested=1 nr_scanned=1 nr_skipped=0 nr_taken=1 lru=inactive_file
           <...>-455450 [007] .... 4048595.843001:
mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=3
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
           <...>-455450 [007] d... 4048595.843002:
mm_vmscan_lru_isolate: isolate_mode=0 classzone=3 order=0
nr_requested=5 nr_scanned=5 nr_skipped=0 nr_taken=5 lru=inactive_file
           <...>-455450 [007] .... 4048595.843004:
mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=5 nr_reclaimed=0
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=2
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
           <...>-455450 [007] d... 4048595.843006:
mm_vmscan_lru_isolate: isolate_mode=0 classzone=3 order=0
nr_requested=9 nr_scanned=9 nr_skipped=0 nr_taken=9 lru=inactive_file
           <...>-455450 [007] .... 4048595.843007:
mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=9 nr_reclaimed=0
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat
e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=1
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
           <...>-455450 [007] d... 4048595.843009:
mm_vmscan_lru_isolate: isolate_mode=0 classzone=3 order=0
nr_requested=17 nr_scanned=15 nr_skipped=0 nr_taken=15
lru=inactive_file
           <...>-455450 [007] .... 4048595.843011:
mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=15 nr_reclaimed=0
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activa
te=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
           <...>-455450 [007] .... 4048595.843012:
mm_vmscan_memcg_reclaim_end: nr_reclaimed=0

The whole trace output is huge (33M), I can provide it on request.

I suspect the process gets stuck in the retry loop in try_charge(), as
the _shortest_ stacktrace of the perf samples indicated:

cycles:ppp:
        ffffffffa72963db mem_cgroup_iter
        ffffffffa72980ca mem_cgroup_oom_unlock
        ffffffffa7298c15 try_charge
        ffffffffa729a886 mem_cgroup_try_charge
        ffffffffa720ec03 __add_to_page_cache_locked
        ffffffffa720ee3a add_to_page_cache_lru
        ffffffffa7312ddb iomap_readpages_actor
        ffffffffa73133f7 iomap_apply
        ffffffffa73135da iomap_readpages
        ffffffffa722062e read_pages
        ffffffffa7220b3f __do_page_cache_readahead
        ffffffffa7210554 filemap_fault
        ffffffffc039e41f __xfs_filemap_fault
        ffffffffa724f5e7 __do_fault
        ffffffffa724c5f2 __handle_mm_fault
        ffffffffa724cbc6 handle_mm_fault
        ffffffffa70a313e __do_page_fault
        ffffffffa7a00dfe page_fault

But I don't see how it could be, the only possible case is when
mem_cgroup_oom() returns OOM_SUCCESS. However I can't
find any clue in dmesg pointing to OOM. These processes in the
same memcg are either running or sleeping (that is not exiting or
coredump'ing), I don't see how and why they could be selected as
a victim of OOM killer. I don't see any signal pending either from
their /proc/X/status.

Here is the status of the whole memcg:

$ sudo cat /sys/fs/cgroup/memory/system.slice/osqueryd.service/memory.oom_control
oom_kill_disable 0
under_oom 0
oom_kill 0

$ sudo cat /sys/fs/cgroup/memory/system.slice/osqueryd.service/memory.stat
cache 376832
rss 255864832
rss_huge 0
shmem 0
mapped_file 135168
dirty 135168
writeback 0
pgpgin 5327157
pgpgout 5264598
pgfault 4251687
pgmajfault 7491
inactive_anon 4096
active_anon 256184320
inactive_file 32768
active_file 0
unevictable 0
hierarchical_memory_limit 262144000
total_cache 376832
total_rss 255864832
total_rss_huge 0
total_shmem 0
total_mapped_file 135168
total_dirty 135168
total_writeback 0
total_pgpgin 5327157
total_pgpgout 5264598
total_pgfault 4251687
total_pgmajfault 7491
total_inactive_anon 4096
total_active_anon 256184320
total_inactive_file 86016
total_active_file 0
total_unevictable 0

Please let me know if I can provide anything else.

Thanks.
Matthew Wilcox Jan. 26, 2020, 11:39 p.m. UTC | #13
On Sun, Jan 26, 2020 at 11:53:55AM -0800, Cong Wang wrote:
> On Tue, Jan 21, 2020 at 1:00 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Mon 20-01-20 14:48:05, Cong Wang wrote:
> > > It got stuck somewhere along the call path of mem_cgroup_try_charge(),
> > > and the trace events of mm_vmscan_lru_shrink_inactive() indicates this
> > > too:
> >
> > So it seems that you are condending on the page lock. It is really
> > unexpected that the reclaim would take that long though. Please try to
> > enable more vmscan tracepoints to see where the time is spent.
> 
> I suspect the process gets stuck in the retry loop in try_charge(), as
> the _shortest_ stacktrace of the perf samples indicated:
> 
> cycles:ppp:
>         ffffffffa72963db mem_cgroup_iter
>         ffffffffa72980ca mem_cgroup_oom_unlock
>         ffffffffa7298c15 try_charge
>         ffffffffa729a886 mem_cgroup_try_charge
>         ffffffffa720ec03 __add_to_page_cache_locked
>         ffffffffa720ee3a add_to_page_cache_lru
>         ffffffffa7312ddb iomap_readpages_actor
>         ffffffffa73133f7 iomap_apply
>         ffffffffa73135da iomap_readpages
>         ffffffffa722062e read_pages
>         ffffffffa7220b3f __do_page_cache_readahead
>         ffffffffa7210554 filemap_fault
>         ffffffffc039e41f __xfs_filemap_fault
>         ffffffffa724f5e7 __do_fault
>         ffffffffa724c5f2 __handle_mm_fault
>         ffffffffa724cbc6 handle_mm_fault
>         ffffffffa70a313e __do_page_fault
>         ffffffffa7a00dfe page_fault
> 
> But I don't see how it could be, the only possible case is when
> mem_cgroup_oom() returns OOM_SUCCESS. However I can't
> find any clue in dmesg pointing to OOM. These processes in the
> same memcg are either running or sleeping (that is not exiting or
> coredump'ing), I don't see how and why they could be selected as
> a victim of OOM killer. I don't see any signal pending either from
> their /proc/X/status.

I think this is a situation where we might end up with a genuine deadlock
if we're not trylocking the pages.  readahead allocates a batch of
locked pages and adds them to the pagecache.  If it has allocated,
say, 5 pages, successfully inserted the first three into i_pages, then
needs to allocate memory to insert the fourth one into i_pages, and
the process then attempts to migrate the pages which are still locked,
they will never come unlocked because they haven't yet been submitted
to the filesystem for reading.

Or is this enough?

static inline gfp_t readahead_gfp_mask(struct address_space *x)
        return mapping_gfp_mask(x) | __GFP_NORETRY | __GFP_NOWARN;
Michal Hocko Jan. 27, 2020, 2:49 p.m. UTC | #14
On Sun 26-01-20 11:53:55, Cong Wang wrote:
> On Tue, Jan 21, 2020 at 1:00 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Mon 20-01-20 14:48:05, Cong Wang wrote:
> > > It got stuck somewhere along the call path of mem_cgroup_try_charge(),
> > > and the trace events of mm_vmscan_lru_shrink_inactive() indicates this
> > > too:
> >
> > So it seems that you are condending on the page lock. It is really
> > unexpected that the reclaim would take that long though. Please try to
> > enable more vmscan tracepoints to see where the time is spent.
> 
> Sorry for the delay. I have been trying to collect more data in one shot.
> 
> This is a a typical round of the loop after enabling all vmscan tracepoints:
> 
>            <...>-455450 [007] .... 4048595.842992:
> mm_vmscan_memcg_reclaim_begin: order=0 may_writepage=1
> gfp_flags=GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE
> classzone_idx=4
>            <...>-455450 [007] .... 4048595.843012:
> mm_vmscan_memcg_reclaim_end: nr_reclaimed=0

This doesn't tell us much though. This reclaim round has taken close to
no time. See timestamps.

> The whole trace output is huge (33M), I can provide it on request.

Focus on reclaim rounds that take a long time and see where it gets you.
Michal Hocko Jan. 27, 2020, 3 p.m. UTC | #15
On Sun 26-01-20 15:39:35, Matthew Wilcox wrote:
> On Sun, Jan 26, 2020 at 11:53:55AM -0800, Cong Wang wrote:
> > On Tue, Jan 21, 2020 at 1:00 AM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Mon 20-01-20 14:48:05, Cong Wang wrote:
> > > > It got stuck somewhere along the call path of mem_cgroup_try_charge(),
> > > > and the trace events of mm_vmscan_lru_shrink_inactive() indicates this
> > > > too:
> > >
> > > So it seems that you are condending on the page lock. It is really
> > > unexpected that the reclaim would take that long though. Please try to
> > > enable more vmscan tracepoints to see where the time is spent.
> > 
> > I suspect the process gets stuck in the retry loop in try_charge(), as
> > the _shortest_ stacktrace of the perf samples indicated:
> > 
> > cycles:ppp:
> >         ffffffffa72963db mem_cgroup_iter
> >         ffffffffa72980ca mem_cgroup_oom_unlock
> >         ffffffffa7298c15 try_charge
> >         ffffffffa729a886 mem_cgroup_try_charge
> >         ffffffffa720ec03 __add_to_page_cache_locked
> >         ffffffffa720ee3a add_to_page_cache_lru
> >         ffffffffa7312ddb iomap_readpages_actor
> >         ffffffffa73133f7 iomap_apply
> >         ffffffffa73135da iomap_readpages
> >         ffffffffa722062e read_pages
> >         ffffffffa7220b3f __do_page_cache_readahead
> >         ffffffffa7210554 filemap_fault
> >         ffffffffc039e41f __xfs_filemap_fault
> >         ffffffffa724f5e7 __do_fault
> >         ffffffffa724c5f2 __handle_mm_fault
> >         ffffffffa724cbc6 handle_mm_fault
> >         ffffffffa70a313e __do_page_fault
> >         ffffffffa7a00dfe page_fault
> > 
> > But I don't see how it could be, the only possible case is when
> > mem_cgroup_oom() returns OOM_SUCCESS. However I can't
> > find any clue in dmesg pointing to OOM. These processes in the
> > same memcg are either running or sleeping (that is not exiting or
> > coredump'ing), I don't see how and why they could be selected as
> > a victim of OOM killer. I don't see any signal pending either from
> > their /proc/X/status.
> 
> I think this is a situation where we might end up with a genuine deadlock
> if we're not trylocking the pages.  readahead allocates a batch of
> locked pages and adds them to the pagecache.  If it has allocated,
> say, 5 pages, successfully inserted the first three into i_pages, then
> needs to allocate memory to insert the fourth one into i_pages, and
> the process then attempts to migrate the pages which are still locked,
> they will never come unlocked because they haven't yet been submitted
> to the filesystem for reading.

Just to make sure I understand. Do you mean this?
lock_page(A)
alloc_pages
  try_to_compact_pages
    compact_zone_order
      compact_zone(MIGRATE_SYNC_LIGHT)
        migrate_pages
	  unmap_and_move
	    __unmap_and_move
	      lock_page(A)
Matthew Wilcox Jan. 27, 2020, 7:06 p.m. UTC | #16
On Mon, Jan 27, 2020 at 04:00:24PM +0100, Michal Hocko wrote:
> On Sun 26-01-20 15:39:35, Matthew Wilcox wrote:
> > On Sun, Jan 26, 2020 at 11:53:55AM -0800, Cong Wang wrote:
> > > I suspect the process gets stuck in the retry loop in try_charge(), as
> > > the _shortest_ stacktrace of the perf samples indicated:
> > > 
> > > cycles:ppp:
> > >         ffffffffa72963db mem_cgroup_iter
> > >         ffffffffa72980ca mem_cgroup_oom_unlock
> > >         ffffffffa7298c15 try_charge
> > >         ffffffffa729a886 mem_cgroup_try_charge
> > >         ffffffffa720ec03 __add_to_page_cache_locked
> > >         ffffffffa720ee3a add_to_page_cache_lru
> > >         ffffffffa7312ddb iomap_readpages_actor
> > >         ffffffffa73133f7 iomap_apply
> > >         ffffffffa73135da iomap_readpages
> > >         ffffffffa722062e read_pages
> > >         ffffffffa7220b3f __do_page_cache_readahead
> > >         ffffffffa7210554 filemap_fault
> > >         ffffffffc039e41f __xfs_filemap_fault
> > >         ffffffffa724f5e7 __do_fault
> > >         ffffffffa724c5f2 __handle_mm_fault
> > >         ffffffffa724cbc6 handle_mm_fault
> > >         ffffffffa70a313e __do_page_fault
> > >         ffffffffa7a00dfe page_fault
> > > 
> > > But I don't see how it could be, the only possible case is when
> > > mem_cgroup_oom() returns OOM_SUCCESS. However I can't
> > > find any clue in dmesg pointing to OOM. These processes in the
> > > same memcg are either running or sleeping (that is not exiting or
> > > coredump'ing), I don't see how and why they could be selected as
> > > a victim of OOM killer. I don't see any signal pending either from
> > > their /proc/X/status.
> > 
> > I think this is a situation where we might end up with a genuine deadlock
> > if we're not trylocking the pages.  readahead allocates a batch of
> > locked pages and adds them to the pagecache.  If it has allocated,
> > say, 5 pages, successfully inserted the first three into i_pages, then
> > needs to allocate memory to insert the fourth one into i_pages, and
> > the process then attempts to migrate the pages which are still locked,
> > they will never come unlocked because they haven't yet been submitted
> > to the filesystem for reading.
> 
> Just to make sure I understand. Do you mean this?
> lock_page(A)
> alloc_pages
>   try_to_compact_pages
>     compact_zone_order
>       compact_zone(MIGRATE_SYNC_LIGHT)
>         migrate_pages
> 	  unmap_and_move
> 	    __unmap_and_move
> 	      lock_page(A)

Yes.  There's a little more to it than that, eg slab is involved, but
you have it in a nutshell.
Cong Wang Jan. 28, 2020, 12:46 a.m. UTC | #17
On Mon, Jan 27, 2020 at 6:49 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Sun 26-01-20 11:53:55, Cong Wang wrote:
> > On Tue, Jan 21, 2020 at 1:00 AM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Mon 20-01-20 14:48:05, Cong Wang wrote:
> > > > It got stuck somewhere along the call path of mem_cgroup_try_charge(),
> > > > and the trace events of mm_vmscan_lru_shrink_inactive() indicates this
> > > > too:
> > >
> > > So it seems that you are condending on the page lock. It is really
> > > unexpected that the reclaim would take that long though. Please try to
> > > enable more vmscan tracepoints to see where the time is spent.
> >
> > Sorry for the delay. I have been trying to collect more data in one shot.
> >
> > This is a a typical round of the loop after enabling all vmscan tracepoints:
> >
> >            <...>-455450 [007] .... 4048595.842992:
> > mm_vmscan_memcg_reclaim_begin: order=0 may_writepage=1
> > gfp_flags=GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE
> > classzone_idx=4
> >            <...>-455450 [007] .... 4048595.843012:
> > mm_vmscan_memcg_reclaim_end: nr_reclaimed=0
>
> This doesn't tell us much though. This reclaim round has taken close to
> no time. See timestamps.
>
> > The whole trace output is huge (33M), I can provide it on request.
>
> Focus on reclaim rounds that take a long time and see where it gets you.

I reviewed the tracing output with my eyes, they all took little time.
But of course I can't review all of them given the size is huge.

For me, it seems that the loop happens in its caller, something
like:

retry:
  mm_vmscan_memcg_reclaim_begin();
  ...
  mm_vmscan_memcg_reclaim_end();
  goto retry;

So, I think we should focus on try_charge()?

More interestingly, the margin of that memcg is 0:

$ sudo cat /sys/fs/cgroup/memory/system.slice/osqueryd.service/memory.usage_in_bytes
262144000
$ sudo cat /sys/fs/cgroup/memory/system.slice/osqueryd.service/memory.limit_in_bytes
262144000

Thanks!
Yang Shi Jan. 28, 2020, 1:25 a.m. UTC | #18
On Mon, Jan 27, 2020 at 11:06 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Jan 27, 2020 at 04:00:24PM +0100, Michal Hocko wrote:
> > On Sun 26-01-20 15:39:35, Matthew Wilcox wrote:
> > > On Sun, Jan 26, 2020 at 11:53:55AM -0800, Cong Wang wrote:
> > > > I suspect the process gets stuck in the retry loop in try_charge(), as
> > > > the _shortest_ stacktrace of the perf samples indicated:
> > > >
> > > > cycles:ppp:
> > > >         ffffffffa72963db mem_cgroup_iter
> > > >         ffffffffa72980ca mem_cgroup_oom_unlock
> > > >         ffffffffa7298c15 try_charge
> > > >         ffffffffa729a886 mem_cgroup_try_charge
> > > >         ffffffffa720ec03 __add_to_page_cache_locked
> > > >         ffffffffa720ee3a add_to_page_cache_lru
> > > >         ffffffffa7312ddb iomap_readpages_actor
> > > >         ffffffffa73133f7 iomap_apply
> > > >         ffffffffa73135da iomap_readpages
> > > >         ffffffffa722062e read_pages
> > > >         ffffffffa7220b3f __do_page_cache_readahead
> > > >         ffffffffa7210554 filemap_fault
> > > >         ffffffffc039e41f __xfs_filemap_fault
> > > >         ffffffffa724f5e7 __do_fault
> > > >         ffffffffa724c5f2 __handle_mm_fault
> > > >         ffffffffa724cbc6 handle_mm_fault
> > > >         ffffffffa70a313e __do_page_fault
> > > >         ffffffffa7a00dfe page_fault
> > > >
> > > > But I don't see how it could be, the only possible case is when
> > > > mem_cgroup_oom() returns OOM_SUCCESS. However I can't
> > > > find any clue in dmesg pointing to OOM. These processes in the
> > > > same memcg are either running or sleeping (that is not exiting or
> > > > coredump'ing), I don't see how and why they could be selected as
> > > > a victim of OOM killer. I don't see any signal pending either from
> > > > their /proc/X/status.
> > >
> > > I think this is a situation where we might end up with a genuine deadlock
> > > if we're not trylocking the pages.  readahead allocates a batch of
> > > locked pages and adds them to the pagecache.  If it has allocated,
> > > say, 5 pages, successfully inserted the first three into i_pages, then
> > > needs to allocate memory to insert the fourth one into i_pages, and
> > > the process then attempts to migrate the pages which are still locked,
> > > they will never come unlocked because they haven't yet been submitted
> > > to the filesystem for reading.
> >
> > Just to make sure I understand. Do you mean this?
> > lock_page(A)
> > alloc_pages
> >   try_to_compact_pages
> >     compact_zone_order
> >       compact_zone(MIGRATE_SYNC_LIGHT)
> >         migrate_pages
> >         unmap_and_move
> >           __unmap_and_move
> >             lock_page(A)
>
> Yes.  There's a little more to it than that, eg slab is involved, but
> you have it in a nutshell.

But, how compact could get blocked for readahead page if it is not on LRU?

The page is charged before adding to LRU, so if kernel just retry
charge or reclaim forever, the page should be not on LRU, so it should
not block compaction.

>
Matthew Wilcox Jan. 28, 2020, 6:03 a.m. UTC | #19
On Mon, Jan 27, 2020 at 05:25:13PM -0800, Yang Shi wrote:
> On Mon, Jan 27, 2020 at 11:06 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Mon, Jan 27, 2020 at 04:00:24PM +0100, Michal Hocko wrote:
> > > On Sun 26-01-20 15:39:35, Matthew Wilcox wrote:
> > > > On Sun, Jan 26, 2020 at 11:53:55AM -0800, Cong Wang wrote:
> > > > > I suspect the process gets stuck in the retry loop in try_charge(), as
> > > > > the _shortest_ stacktrace of the perf samples indicated:
> > > > >
> > > > > cycles:ppp:
> > > > >         ffffffffa72963db mem_cgroup_iter
> > > > >         ffffffffa72980ca mem_cgroup_oom_unlock
> > > > >         ffffffffa7298c15 try_charge
> > > > >         ffffffffa729a886 mem_cgroup_try_charge
> > > > >         ffffffffa720ec03 __add_to_page_cache_locked
> > > > >         ffffffffa720ee3a add_to_page_cache_lru
> > > > >         ffffffffa7312ddb iomap_readpages_actor
> > > > >         ffffffffa73133f7 iomap_apply
> > > > >         ffffffffa73135da iomap_readpages
> > > > >         ffffffffa722062e read_pages
> > > > >         ffffffffa7220b3f __do_page_cache_readahead
> > > > >         ffffffffa7210554 filemap_fault
> > > > >         ffffffffc039e41f __xfs_filemap_fault
> > > > >         ffffffffa724f5e7 __do_fault
> > > > >         ffffffffa724c5f2 __handle_mm_fault
> > > > >         ffffffffa724cbc6 handle_mm_fault
> > > > >         ffffffffa70a313e __do_page_fault
> > > > >         ffffffffa7a00dfe page_fault
> > > > >
> > > > > But I don't see how it could be, the only possible case is when
> > > > > mem_cgroup_oom() returns OOM_SUCCESS. However I can't
> > > > > find any clue in dmesg pointing to OOM. These processes in the
> > > > > same memcg are either running or sleeping (that is not exiting or
> > > > > coredump'ing), I don't see how and why they could be selected as
> > > > > a victim of OOM killer. I don't see any signal pending either from
> > > > > their /proc/X/status.
> > > >
> > > > I think this is a situation where we might end up with a genuine deadlock
> > > > if we're not trylocking the pages.  readahead allocates a batch of
> > > > locked pages and adds them to the pagecache.  If it has allocated,
> > > > say, 5 pages, successfully inserted the first three into i_pages, then
> > > > needs to allocate memory to insert the fourth one into i_pages, and
> > > > the process then attempts to migrate the pages which are still locked,
> > > > they will never come unlocked because they haven't yet been submitted
> > > > to the filesystem for reading.
> > >
> > > Just to make sure I understand. Do you mean this?
> > > lock_page(A)
> > > alloc_pages
> > >   try_to_compact_pages
> > >     compact_zone_order
> > >       compact_zone(MIGRATE_SYNC_LIGHT)
> > >         migrate_pages
> > >         unmap_and_move
> > >           __unmap_and_move
> > >             lock_page(A)
> >
> > Yes.  There's a little more to it than that, eg slab is involved, but
> > you have it in a nutshell.
> 
> But, how compact could get blocked for readahead page if it is not on LRU?
> 
> The page is charged before adding to LRU, so if kernel just retry
> charge or reclaim forever, the page should be not on LRU, so it should
> not block compaction.

The five pages are allocated ABCDE, then they are added one at a time to
both the LRU list and i_pages.  Once ABCDE have been added to the page
cache, they are submitted to the filesystem in a batch.  The sceanrio
here is that once ABC have been added, we need to allocate memory to
add D.  The page migration code then attempts to migrate A.  Deadlock;
it will never come unlocked.

This kind of problem can occur in any filesystem in either readpages or
readpage.  Once the page is locked and on the LRU list, if the filesystem
attempts to allocate memory (and many do), this kind of deadlock can
occur.  It's avoided if the filesystem uses a mempool to avoid memory
allocation in this path, but they certainly don't all do that.

This specific deadlock can be avoided if we skip !PageUptodate pages.
But I don't know what other situations there are where we allocate memory
while holding a page locked that is on the LRU list.
Michal Hocko Jan. 28, 2020, 8:17 a.m. UTC | #20
On Mon 27-01-20 11:06:53, Matthew Wilcox wrote:
> On Mon, Jan 27, 2020 at 04:00:24PM +0100, Michal Hocko wrote:
> > On Sun 26-01-20 15:39:35, Matthew Wilcox wrote:
> > > On Sun, Jan 26, 2020 at 11:53:55AM -0800, Cong Wang wrote:
> > > > I suspect the process gets stuck in the retry loop in try_charge(), as
> > > > the _shortest_ stacktrace of the perf samples indicated:
> > > > 
> > > > cycles:ppp:
> > > >         ffffffffa72963db mem_cgroup_iter
> > > >         ffffffffa72980ca mem_cgroup_oom_unlock
> > > >         ffffffffa7298c15 try_charge
> > > >         ffffffffa729a886 mem_cgroup_try_charge
> > > >         ffffffffa720ec03 __add_to_page_cache_locked
> > > >         ffffffffa720ee3a add_to_page_cache_lru
> > > >         ffffffffa7312ddb iomap_readpages_actor
> > > >         ffffffffa73133f7 iomap_apply
> > > >         ffffffffa73135da iomap_readpages
> > > >         ffffffffa722062e read_pages
> > > >         ffffffffa7220b3f __do_page_cache_readahead
> > > >         ffffffffa7210554 filemap_fault
> > > >         ffffffffc039e41f __xfs_filemap_fault
> > > >         ffffffffa724f5e7 __do_fault
> > > >         ffffffffa724c5f2 __handle_mm_fault
> > > >         ffffffffa724cbc6 handle_mm_fault
> > > >         ffffffffa70a313e __do_page_fault
> > > >         ffffffffa7a00dfe page_fault
> > > > 
> > > > But I don't see how it could be, the only possible case is when
> > > > mem_cgroup_oom() returns OOM_SUCCESS. However I can't
> > > > find any clue in dmesg pointing to OOM. These processes in the
> > > > same memcg are either running or sleeping (that is not exiting or
> > > > coredump'ing), I don't see how and why they could be selected as
> > > > a victim of OOM killer. I don't see any signal pending either from
> > > > their /proc/X/status.
> > > 
> > > I think this is a situation where we might end up with a genuine deadlock
> > > if we're not trylocking the pages.  readahead allocates a batch of
> > > locked pages and adds them to the pagecache.  If it has allocated,
> > > say, 5 pages, successfully inserted the first three into i_pages, then
> > > needs to allocate memory to insert the fourth one into i_pages, and
> > > the process then attempts to migrate the pages which are still locked,
> > > they will never come unlocked because they haven't yet been submitted
> > > to the filesystem for reading.
> > 
> > Just to make sure I understand. Do you mean this?
> > lock_page(A)
> > alloc_pages
> >   try_to_compact_pages
> >     compact_zone_order
> >       compact_zone(MIGRATE_SYNC_LIGHT)
> >         migrate_pages
> > 	  unmap_and_move
> > 	    __unmap_and_move
> > 	      lock_page(A)
> 
> Yes.  There's a little more to it than that, eg slab is involved, but
> you have it in a nutshell.

I am not deeply familiar with the readahead code. But is there really a
high oerder allocation (order > 1) that would trigger compaction in the
phase when pages are locked?

Btw. the compaction rejects to consider file backed pages when __GFP_FS
is not present AFAIR.
Michal Hocko Jan. 28, 2020, 8:22 a.m. UTC | #21
On Mon 27-01-20 16:46:16, Cong Wang wrote:
> On Mon, Jan 27, 2020 at 6:49 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Sun 26-01-20 11:53:55, Cong Wang wrote:
> > > On Tue, Jan 21, 2020 at 1:00 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > >
> > > > On Mon 20-01-20 14:48:05, Cong Wang wrote:
> > > > > It got stuck somewhere along the call path of mem_cgroup_try_charge(),
> > > > > and the trace events of mm_vmscan_lru_shrink_inactive() indicates this
> > > > > too:
> > > >
> > > > So it seems that you are condending on the page lock. It is really
> > > > unexpected that the reclaim would take that long though. Please try to
> > > > enable more vmscan tracepoints to see where the time is spent.
> > >
> > > Sorry for the delay. I have been trying to collect more data in one shot.
> > >
> > > This is a a typical round of the loop after enabling all vmscan tracepoints:
> > >
> > >            <...>-455450 [007] .... 4048595.842992:
> > > mm_vmscan_memcg_reclaim_begin: order=0 may_writepage=1
> > > gfp_flags=GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE
> > > classzone_idx=4
> > >            <...>-455450 [007] .... 4048595.843012:
> > > mm_vmscan_memcg_reclaim_end: nr_reclaimed=0
> >
> > This doesn't tell us much though. This reclaim round has taken close to
> > no time. See timestamps.
> >
> > > The whole trace output is huge (33M), I can provide it on request.
> >
> > Focus on reclaim rounds that take a long time and see where it gets you.
> 
> I reviewed the tracing output with my eyes, they all took little time.
> But of course I can't review all of them given the size is huge.

Sure, I would simply use a script to check for exessive reclaims rounds.
Also it is important to find out whether the page of your interest is
locked outside of the reclaim or inside.
Matthew Wilcox Jan. 28, 2020, 8:30 a.m. UTC | #22
On Tue, Jan 28, 2020 at 09:17:12AM +0100, Michal Hocko wrote:
> On Mon 27-01-20 11:06:53, Matthew Wilcox wrote:
> > On Mon, Jan 27, 2020 at 04:00:24PM +0100, Michal Hocko wrote:
> > > On Sun 26-01-20 15:39:35, Matthew Wilcox wrote:
> > > > On Sun, Jan 26, 2020 at 11:53:55AM -0800, Cong Wang wrote:
> > > > > I suspect the process gets stuck in the retry loop in try_charge(), as
> > > > > the _shortest_ stacktrace of the perf samples indicated:
> > > > > 
> > > > > cycles:ppp:
> > > > >         ffffffffa72963db mem_cgroup_iter
> > > > >         ffffffffa72980ca mem_cgroup_oom_unlock
> > > > >         ffffffffa7298c15 try_charge
> > > > >         ffffffffa729a886 mem_cgroup_try_charge
> > > > >         ffffffffa720ec03 __add_to_page_cache_locked
> > > > >         ffffffffa720ee3a add_to_page_cache_lru
> > > > >         ffffffffa7312ddb iomap_readpages_actor
> > > > >         ffffffffa73133f7 iomap_apply
> > > > >         ffffffffa73135da iomap_readpages
> > > > >         ffffffffa722062e read_pages
> > > > >         ffffffffa7220b3f __do_page_cache_readahead
> > > > >         ffffffffa7210554 filemap_fault
> > > > >         ffffffffc039e41f __xfs_filemap_fault
> > > > >         ffffffffa724f5e7 __do_fault
> > > > >         ffffffffa724c5f2 __handle_mm_fault
> > > > >         ffffffffa724cbc6 handle_mm_fault
> > > > >         ffffffffa70a313e __do_page_fault
> > > > >         ffffffffa7a00dfe page_fault
> > > > > 
> > > > > But I don't see how it could be, the only possible case is when
> > > > > mem_cgroup_oom() returns OOM_SUCCESS. However I can't
> > > > > find any clue in dmesg pointing to OOM. These processes in the
> > > > > same memcg are either running or sleeping (that is not exiting or
> > > > > coredump'ing), I don't see how and why they could be selected as
> > > > > a victim of OOM killer. I don't see any signal pending either from
> > > > > their /proc/X/status.
> > > > 
> > > > I think this is a situation where we might end up with a genuine deadlock
> > > > if we're not trylocking the pages.  readahead allocates a batch of
> > > > locked pages and adds them to the pagecache.  If it has allocated,
> > > > say, 5 pages, successfully inserted the first three into i_pages, then
> > > > needs to allocate memory to insert the fourth one into i_pages, and
> > > > the process then attempts to migrate the pages which are still locked,
> > > > they will never come unlocked because they haven't yet been submitted
> > > > to the filesystem for reading.
> > > 
> > > Just to make sure I understand. Do you mean this?
> > > lock_page(A)
> > > alloc_pages
> > >   try_to_compact_pages
> > >     compact_zone_order
> > >       compact_zone(MIGRATE_SYNC_LIGHT)
> > >         migrate_pages
> > > 	  unmap_and_move
> > > 	    __unmap_and_move
> > > 	      lock_page(A)
> > 
> > Yes.  There's a little more to it than that, eg slab is involved, but
> > you have it in a nutshell.
> 
> I am not deeply familiar with the readahead code. But is there really a
> high oerder allocation (order > 1) that would trigger compaction in the
> phase when pages are locked?

Thanks to sl*b, yes:

radix_tree_node    80890 102536    584   28    4 : tunables    0    0    0 : slabdata   3662   3662      0

so it's allocating 4 pages for an allocation of a 576 byte node.

> Btw. the compaction rejects to consider file backed pages when __GFP_FS
> is not present AFAIR.

Ah, that would save us.
Michal Hocko Jan. 28, 2020, 9:13 a.m. UTC | #23
On Tue 28-01-20 00:30:44, Matthew Wilcox wrote:
> On Tue, Jan 28, 2020 at 09:17:12AM +0100, Michal Hocko wrote:
> > On Mon 27-01-20 11:06:53, Matthew Wilcox wrote:
> > > On Mon, Jan 27, 2020 at 04:00:24PM +0100, Michal Hocko wrote:
> > > > On Sun 26-01-20 15:39:35, Matthew Wilcox wrote:
> > > > > On Sun, Jan 26, 2020 at 11:53:55AM -0800, Cong Wang wrote:
> > > > > > I suspect the process gets stuck in the retry loop in try_charge(), as
> > > > > > the _shortest_ stacktrace of the perf samples indicated:
> > > > > > 
> > > > > > cycles:ppp:
> > > > > >         ffffffffa72963db mem_cgroup_iter
> > > > > >         ffffffffa72980ca mem_cgroup_oom_unlock
> > > > > >         ffffffffa7298c15 try_charge
> > > > > >         ffffffffa729a886 mem_cgroup_try_charge
> > > > > >         ffffffffa720ec03 __add_to_page_cache_locked
> > > > > >         ffffffffa720ee3a add_to_page_cache_lru
> > > > > >         ffffffffa7312ddb iomap_readpages_actor
> > > > > >         ffffffffa73133f7 iomap_apply
> > > > > >         ffffffffa73135da iomap_readpages
> > > > > >         ffffffffa722062e read_pages
> > > > > >         ffffffffa7220b3f __do_page_cache_readahead
> > > > > >         ffffffffa7210554 filemap_fault
> > > > > >         ffffffffc039e41f __xfs_filemap_fault
> > > > > >         ffffffffa724f5e7 __do_fault
> > > > > >         ffffffffa724c5f2 __handle_mm_fault
> > > > > >         ffffffffa724cbc6 handle_mm_fault
> > > > > >         ffffffffa70a313e __do_page_fault
> > > > > >         ffffffffa7a00dfe page_fault
> > > > > > 
> > > > > > But I don't see how it could be, the only possible case is when
> > > > > > mem_cgroup_oom() returns OOM_SUCCESS. However I can't
> > > > > > find any clue in dmesg pointing to OOM. These processes in the
> > > > > > same memcg are either running or sleeping (that is not exiting or
> > > > > > coredump'ing), I don't see how and why they could be selected as
> > > > > > a victim of OOM killer. I don't see any signal pending either from
> > > > > > their /proc/X/status.
> > > > > 
> > > > > I think this is a situation where we might end up with a genuine deadlock
> > > > > if we're not trylocking the pages.  readahead allocates a batch of
> > > > > locked pages and adds them to the pagecache.  If it has allocated,
> > > > > say, 5 pages, successfully inserted the first three into i_pages, then
> > > > > needs to allocate memory to insert the fourth one into i_pages, and
> > > > > the process then attempts to migrate the pages which are still locked,
> > > > > they will never come unlocked because they haven't yet been submitted
> > > > > to the filesystem for reading.
> > > > 
> > > > Just to make sure I understand. Do you mean this?
> > > > lock_page(A)
> > > > alloc_pages
> > > >   try_to_compact_pages
> > > >     compact_zone_order
> > > >       compact_zone(MIGRATE_SYNC_LIGHT)
> > > >         migrate_pages
> > > > 	  unmap_and_move
> > > > 	    __unmap_and_move
> > > > 	      lock_page(A)
> > > 
> > > Yes.  There's a little more to it than that, eg slab is involved, but
> > > you have it in a nutshell.
> > 
> > I am not deeply familiar with the readahead code. But is there really a
> > high oerder allocation (order > 1) that would trigger compaction in the
> > phase when pages are locked?
> 
> Thanks to sl*b, yes:
> 
> radix_tree_node    80890 102536    584   28    4 : tunables    0    0    0 : slabdata   3662   3662      0
> 
> so it's allocating 4 pages for an allocation of a 576 byte node.

I am not really sure that we do sync migration for costly orders.

> > Btw. the compaction rejects to consider file backed pages when __GFP_FS
> > is not present AFAIR.
> 
> Ah, that would save us.

So the NOFS comes from the mapping GFP mask, right? That is something I
was hoping to get rid of eventually :/ Anyway it would be better to have
an explicit NOFS with a comment explaining why we need that. If for
nothing else then for documentation.
Matthew Wilcox Jan. 28, 2020, 10:48 a.m. UTC | #24
On Tue, Jan 28, 2020 at 10:13:52AM +0100, Michal Hocko wrote:
> On Tue 28-01-20 00:30:44, Matthew Wilcox wrote:
> > On Tue, Jan 28, 2020 at 09:17:12AM +0100, Michal Hocko wrote:
> > > On Mon 27-01-20 11:06:53, Matthew Wilcox wrote:
> > > > On Mon, Jan 27, 2020 at 04:00:24PM +0100, Michal Hocko wrote:
> > > > > On Sun 26-01-20 15:39:35, Matthew Wilcox wrote:
> > > > > > On Sun, Jan 26, 2020 at 11:53:55AM -0800, Cong Wang wrote:
> > > > > > > I suspect the process gets stuck in the retry loop in try_charge(), as
> > > > > > > the _shortest_ stacktrace of the perf samples indicated:
> > > > > > > 
> > > > > > > cycles:ppp:
> > > > > > >         ffffffffa72963db mem_cgroup_iter
> > > > > > >         ffffffffa72980ca mem_cgroup_oom_unlock
> > > > > > >         ffffffffa7298c15 try_charge
> > > > > > >         ffffffffa729a886 mem_cgroup_try_charge
> > > > > > >         ffffffffa720ec03 __add_to_page_cache_locked
> > > > > > >         ffffffffa720ee3a add_to_page_cache_lru
> > > > > > >         ffffffffa7312ddb iomap_readpages_actor
> > > > > > >         ffffffffa73133f7 iomap_apply
> > > > > > >         ffffffffa73135da iomap_readpages
> > > > > > >         ffffffffa722062e read_pages
> > > > > > >         ffffffffa7220b3f __do_page_cache_readahead
> > > > > > >         ffffffffa7210554 filemap_fault
> > > > > > >         ffffffffc039e41f __xfs_filemap_fault
> > > > > > >         ffffffffa724f5e7 __do_fault
> > > > > > >         ffffffffa724c5f2 __handle_mm_fault
> > > > > > >         ffffffffa724cbc6 handle_mm_fault
> > > > > > >         ffffffffa70a313e __do_page_fault
> > > > > > >         ffffffffa7a00dfe page_fault
> > > 
> > > I am not deeply familiar with the readahead code. But is there really a
> > > high oerder allocation (order > 1) that would trigger compaction in the
> > > phase when pages are locked?
> > 
> > Thanks to sl*b, yes:
> > 
> > radix_tree_node    80890 102536    584   28    4 : tunables    0    0    0 : slabdata   3662   3662      0
> > 
> > so it's allocating 4 pages for an allocation of a 576 byte node.
> 
> I am not really sure that we do sync migration for costly orders.

Doesn't the stack trace above indicate that we're doing migration as
the result of an allocation in add_to_page_cache_lru()?

> > > Btw. the compaction rejects to consider file backed pages when __GFP_FS
> > > is not present AFAIR.
> > 
> > Ah, that would save us.
> 
> So the NOFS comes from the mapping GFP mask, right? That is something I
> was hoping to get rid of eventually :/ Anyway it would be better to have
> an explicit NOFS with a comment explaining why we need that. If for
> nothing else then for documentation.

I'd also like to see the mapping GFP mask go away, but rather than seeing
an explicit GFP_NOFS here, I'd rather see the memalloc_nofs API used.
I just don't understand the whole problem space well enough to know
where to put the call for best effect.
Michal Hocko Jan. 28, 2020, 11:39 a.m. UTC | #25
On Tue 28-01-20 02:48:57, Matthew Wilcox wrote:
> On Tue, Jan 28, 2020 at 10:13:52AM +0100, Michal Hocko wrote:
> > On Tue 28-01-20 00:30:44, Matthew Wilcox wrote:
> > > On Tue, Jan 28, 2020 at 09:17:12AM +0100, Michal Hocko wrote:
> > > > On Mon 27-01-20 11:06:53, Matthew Wilcox wrote:
> > > > > On Mon, Jan 27, 2020 at 04:00:24PM +0100, Michal Hocko wrote:
> > > > > > On Sun 26-01-20 15:39:35, Matthew Wilcox wrote:
> > > > > > > On Sun, Jan 26, 2020 at 11:53:55AM -0800, Cong Wang wrote:
> > > > > > > > I suspect the process gets stuck in the retry loop in try_charge(), as
> > > > > > > > the _shortest_ stacktrace of the perf samples indicated:
> > > > > > > > 
> > > > > > > > cycles:ppp:
> > > > > > > >         ffffffffa72963db mem_cgroup_iter
> > > > > > > >         ffffffffa72980ca mem_cgroup_oom_unlock
> > > > > > > >         ffffffffa7298c15 try_charge
> > > > > > > >         ffffffffa729a886 mem_cgroup_try_charge
> > > > > > > >         ffffffffa720ec03 __add_to_page_cache_locked
> > > > > > > >         ffffffffa720ee3a add_to_page_cache_lru
> > > > > > > >         ffffffffa7312ddb iomap_readpages_actor
> > > > > > > >         ffffffffa73133f7 iomap_apply
> > > > > > > >         ffffffffa73135da iomap_readpages
> > > > > > > >         ffffffffa722062e read_pages
> > > > > > > >         ffffffffa7220b3f __do_page_cache_readahead
> > > > > > > >         ffffffffa7210554 filemap_fault
> > > > > > > >         ffffffffc039e41f __xfs_filemap_fault
> > > > > > > >         ffffffffa724f5e7 __do_fault
> > > > > > > >         ffffffffa724c5f2 __handle_mm_fault
> > > > > > > >         ffffffffa724cbc6 handle_mm_fault
> > > > > > > >         ffffffffa70a313e __do_page_fault
> > > > > > > >         ffffffffa7a00dfe page_fault
> > > > 
> > > > I am not deeply familiar with the readahead code. But is there really a
> > > > high oerder allocation (order > 1) that would trigger compaction in the
> > > > phase when pages are locked?
> > > 
> > > Thanks to sl*b, yes:
> > > 
> > > radix_tree_node    80890 102536    584   28    4 : tunables    0    0    0 : slabdata   3662   3662      0
> > > 
> > > so it's allocating 4 pages for an allocation of a 576 byte node.
> > 
> > I am not really sure that we do sync migration for costly orders.
> 
> Doesn't the stack trace above indicate that we're doing migration as
> the result of an allocation in add_to_page_cache_lru()?

Which stack trace do you refer to? Because the one above doesn't show
much more beyond mem_cgroup_iter and likewise others in this email
thread. I do not really remember any stack with lock_page on the trace.
> 
> > > > Btw. the compaction rejects to consider file backed pages when __GFP_FS
> > > > is not present AFAIR.
> > > 
> > > Ah, that would save us.
> > 
> > So the NOFS comes from the mapping GFP mask, right? That is something I
> > was hoping to get rid of eventually :/ Anyway it would be better to have
> > an explicit NOFS with a comment explaining why we need that. If for
> > nothing else then for documentation.
> 
> I'd also like to see the mapping GFP mask go away, but rather than seeing
> an explicit GFP_NOFS here, I'd rather see the memalloc_nofs API used.

Completely agreed agree here. The proper place for the scope would be
the place where pages are locked with an explanation that there are
other allocations down the line which might invoke sync migration and
that would be dangerous. Having that explicitly documented is clearly an
improvement.
Cong Wang Jan. 28, 2020, 7:44 p.m. UTC | #26
On Tue, Jan 28, 2020 at 3:39 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 28-01-20 02:48:57, Matthew Wilcox wrote:
> > Doesn't the stack trace above indicate that we're doing migration as
> > the result of an allocation in add_to_page_cache_lru()?
>
> Which stack trace do you refer to? Because the one above doesn't show
> much more beyond mem_cgroup_iter and likewise others in this email
> thread. I do not really remember any stack with lock_page on the trace.

I think the page is locked in add_to_page_cache_lru() by
__SetPageLocked(), as the stack trace shows __add_to_page_cache_locked().
It is not yet unlocked, as it is still looping inside try_charge().

I will write a script to see if I can find the longest time spent in reclaim
as you suggested.

Thanks.
Cong Wang Jan. 30, 2020, 10:52 p.m. UTC | #27
On Tue, Jan 28, 2020 at 11:44 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> On Tue, Jan 28, 2020 at 3:39 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Tue 28-01-20 02:48:57, Matthew Wilcox wrote:
> > > Doesn't the stack trace above indicate that we're doing migration as
> > > the result of an allocation in add_to_page_cache_lru()?
> >
> > Which stack trace do you refer to? Because the one above doesn't show
> > much more beyond mem_cgroup_iter and likewise others in this email
> > thread. I do not really remember any stack with lock_page on the trace.
>
> I think the page is locked in add_to_page_cache_lru() by
> __SetPageLocked(), as the stack trace shows __add_to_page_cache_locked().
> It is not yet unlocked, as it is still looping inside try_charge().
>
> I will write a script to see if I can find the longest time spent in reclaim
> as you suggested.

After digging the changelog, I believe the following commit could fix
the problem:

commit f9c645621a28e37813a1de96d9cbd89cde94a1e4
Author: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Date:   Mon Sep 23 15:37:08 2019 -0700

    memcg, oom: don't require __GFP_FS when invoking memcg OOM killer

which is not yet in our 4.19 branch yet. We will sync with 4.19 stable soon.

Thanks.
Michal Hocko Feb. 13, 2020, 7:48 a.m. UTC | #28
On Tue 28-01-20 12:39:55, Michal Hocko wrote:
> On Tue 28-01-20 02:48:57, Matthew Wilcox wrote:
> > On Tue, Jan 28, 2020 at 10:13:52AM +0100, Michal Hocko wrote:
> > > On Tue 28-01-20 00:30:44, Matthew Wilcox wrote:
> > > > On Tue, Jan 28, 2020 at 09:17:12AM +0100, Michal Hocko wrote:
> > > > > On Mon 27-01-20 11:06:53, Matthew Wilcox wrote:
> > > > > > On Mon, Jan 27, 2020 at 04:00:24PM +0100, Michal Hocko wrote:
> > > > > > > On Sun 26-01-20 15:39:35, Matthew Wilcox wrote:
> > > > > > > > On Sun, Jan 26, 2020 at 11:53:55AM -0800, Cong Wang wrote:
> > > > > > > > > I suspect the process gets stuck in the retry loop in try_charge(), as
> > > > > > > > > the _shortest_ stacktrace of the perf samples indicated:
> > > > > > > > > 
> > > > > > > > > cycles:ppp:
> > > > > > > > >         ffffffffa72963db mem_cgroup_iter
> > > > > > > > >         ffffffffa72980ca mem_cgroup_oom_unlock
> > > > > > > > >         ffffffffa7298c15 try_charge
> > > > > > > > >         ffffffffa729a886 mem_cgroup_try_charge
> > > > > > > > >         ffffffffa720ec03 __add_to_page_cache_locked
> > > > > > > > >         ffffffffa720ee3a add_to_page_cache_lru
> > > > > > > > >         ffffffffa7312ddb iomap_readpages_actor
> > > > > > > > >         ffffffffa73133f7 iomap_apply
> > > > > > > > >         ffffffffa73135da iomap_readpages
> > > > > > > > >         ffffffffa722062e read_pages
> > > > > > > > >         ffffffffa7220b3f __do_page_cache_readahead
> > > > > > > > >         ffffffffa7210554 filemap_fault
> > > > > > > > >         ffffffffc039e41f __xfs_filemap_fault
> > > > > > > > >         ffffffffa724f5e7 __do_fault
> > > > > > > > >         ffffffffa724c5f2 __handle_mm_fault
> > > > > > > > >         ffffffffa724cbc6 handle_mm_fault
> > > > > > > > >         ffffffffa70a313e __do_page_fault
> > > > > > > > >         ffffffffa7a00dfe page_fault
> > > > > 
> > > > > I am not deeply familiar with the readahead code. But is there really a
> > > > > high oerder allocation (order > 1) that would trigger compaction in the
> > > > > phase when pages are locked?
> > > > 
> > > > Thanks to sl*b, yes:
> > > > 
> > > > radix_tree_node    80890 102536    584   28    4 : tunables    0    0    0 : slabdata   3662   3662      0
> > > > 
> > > > so it's allocating 4 pages for an allocation of a 576 byte node.
> > > 
> > > I am not really sure that we do sync migration for costly orders.
> > 
> > Doesn't the stack trace above indicate that we're doing migration as
> > the result of an allocation in add_to_page_cache_lru()?
> 
> Which stack trace do you refer to? Because the one above doesn't show
> much more beyond mem_cgroup_iter and likewise others in this email
> thread. I do not really remember any stack with lock_page on the trace.
> > 
> > > > > Btw. the compaction rejects to consider file backed pages when __GFP_FS
> > > > > is not present AFAIR.
> > > > 
> > > > Ah, that would save us.
> > > 
> > > So the NOFS comes from the mapping GFP mask, right? That is something I
> > > was hoping to get rid of eventually :/ Anyway it would be better to have
> > > an explicit NOFS with a comment explaining why we need that. If for
> > > nothing else then for documentation.
> > 
> > I'd also like to see the mapping GFP mask go away, but rather than seeing
> > an explicit GFP_NOFS here, I'd rather see the memalloc_nofs API used.
> 
> Completely agreed agree here. The proper place for the scope would be
> the place where pages are locked with an explanation that there are
> other allocations down the line which might invoke sync migration and
> that would be dangerous. Having that explicitly documented is clearly an
> improvement.

Can we pursue on this please? An explicit NOFS scope annotation with a
reference to compaction potentially locking up on pages in the readahead
would be a great start.
Matthew Wilcox Feb. 13, 2020, 4:46 p.m. UTC | #29
On Thu, Feb 13, 2020 at 08:48:47AM +0100, Michal Hocko wrote:
> Can we pursue on this please? An explicit NOFS scope annotation with a
> reference to compaction potentially locking up on pages in the readahead
> would be a great start.

How about this (on top of the current readahead series):

diff --git a/mm/readahead.c b/mm/readahead.c
index 29ca25c8f01e..32fd32b913da 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -160,6 +160,16 @@ unsigned long page_cache_readahead_limit(struct address_space *mapping,
 		.nr_pages = 0,
 	};
 
+	/*
+	 * Partway through the readahead operation, we will have added
+	 * locked pages to the page cache, but will not yet have submitted
+	 * them for I/O.  Adding another page may need to allocate
+	 * memory, which can trigger memory migration.	Telling the VM
+	 * we're in the middle of a filesystem operation will cause it
+	 * to not touch file-backed pages, preventing a deadlock.
+	 */
+	unsigned int nofs = memalloc_nofs_save();
+
 	/*
 	 * Preallocate as many pages as we will need.
 	 */
@@ -217,6 +227,7 @@ unsigned long page_cache_readahead_limit(struct address_space *mapping,
 	 */
 	read_pages(&rac, &page_pool);
 	BUG_ON(!list_empty(&page_pool));
+	memalloc_nofs_restore(nofs);
 	return rac.nr_pages;
 }
 EXPORT_SYMBOL_GPL(page_cache_readahead_limit);
Michal Hocko Feb. 13, 2020, 5:08 p.m. UTC | #30
On Thu 13-02-20 08:46:07, Matthew Wilcox wrote:
> On Thu, Feb 13, 2020 at 08:48:47AM +0100, Michal Hocko wrote:
> > Can we pursue on this please? An explicit NOFS scope annotation with a
> > reference to compaction potentially locking up on pages in the readahead
> > would be a great start.
> 
> How about this (on top of the current readahead series):
> 
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 29ca25c8f01e..32fd32b913da 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -160,6 +160,16 @@ unsigned long page_cache_readahead_limit(struct address_space *mapping,
>  		.nr_pages = 0,
>  	};
>  
> +	/*
> +	 * Partway through the readahead operation, we will have added
> +	 * locked pages to the page cache, but will not yet have submitted
> +	 * them for I/O.  Adding another page may need to allocate
> +	 * memory, which can trigger memory migration.	Telling the VM

I would go with s@migration@compaction@ because it would make it more
obvious that this is about high order allocations.

The patch looks ok other than that but FS people would be better than me
to give it a proper review.

It would be great if you could mention that this shouldn't be a real
problem for many(/most?) filesystems because they tend to have NOFS like
mask in the the mapping but we would love to remove that in future
hopefully so this change is really desiable.

Thanks!

> +	 * we're in the middle of a filesystem operation will cause it
> +	 * to not touch file-backed pages, preventing a deadlock.
> +	 */
> +	unsigned int nofs = memalloc_nofs_save();
> +
>  	/*
>  	 * Preallocate as many pages as we will need.
>  	 */
> @@ -217,6 +227,7 @@ unsigned long page_cache_readahead_limit(struct address_space *mapping,
>  	 */
>  	read_pages(&rac, &page_pool);
>  	BUG_ON(!list_empty(&page_pool));
> +	memalloc_nofs_restore(nofs);
>  	return rac.nr_pages;
>  }
>  EXPORT_SYMBOL_GPL(page_cache_readahead_limit);
Matthew Wilcox Feb. 14, 2020, 4:27 a.m. UTC | #31
On Thu, Feb 13, 2020 at 06:08:24PM +0100, Michal Hocko wrote:
> On Thu 13-02-20 08:46:07, Matthew Wilcox wrote:
> > On Thu, Feb 13, 2020 at 08:48:47AM +0100, Michal Hocko wrote:
> > > Can we pursue on this please? An explicit NOFS scope annotation with a
> > > reference to compaction potentially locking up on pages in the readahead
> > > would be a great start.
> > 
> > How about this (on top of the current readahead series):
> > 
> > diff --git a/mm/readahead.c b/mm/readahead.c
> > index 29ca25c8f01e..32fd32b913da 100644
> > --- a/mm/readahead.c
> > +++ b/mm/readahead.c
> > @@ -160,6 +160,16 @@ unsigned long page_cache_readahead_limit(struct address_space *mapping,
> >  		.nr_pages = 0,
> >  	};
> >  
> > +	/*
> > +	 * Partway through the readahead operation, we will have added
> > +	 * locked pages to the page cache, but will not yet have submitted
> > +	 * them for I/O.  Adding another page may need to allocate
> > +	 * memory, which can trigger memory migration.	Telling the VM
> 
> I would go with s@migration@compaction@ because it would make it more
> obvious that this is about high order allocations.

Perhaps even just 'reclaim' -- it's about compaction today, but tomorrow's
VM might try to reclaim these pages too.  They are on the LRU, after all.

So I currently have:

        /*
         * Partway through the readahead operation, we will have added
         * locked pages to the page cache, but will not yet have submitted
         * them for I/O.  Adding another page may need to allocate memory,
         * which can trigger memory reclaim.  Telling the VM we're in
         * the middle of a filesystem operation will cause it to not
         * touch file-backed pages, preventing a deadlock.  Most (all?)
         * filesystems already specify __GFP_NOFS in their mapping's
         * gfp_mask, but let's be explicit here.
         */

Thanks!
Michal Hocko Feb. 14, 2020, 6:55 a.m. UTC | #32
On Thu 13-02-20 20:27:24, Matthew Wilcox wrote:
> On Thu, Feb 13, 2020 at 06:08:24PM +0100, Michal Hocko wrote:
> > On Thu 13-02-20 08:46:07, Matthew Wilcox wrote:
> > > On Thu, Feb 13, 2020 at 08:48:47AM +0100, Michal Hocko wrote:
> > > > Can we pursue on this please? An explicit NOFS scope annotation with a
> > > > reference to compaction potentially locking up on pages in the readahead
> > > > would be a great start.
> > > 
> > > How about this (on top of the current readahead series):
> > > 
> > > diff --git a/mm/readahead.c b/mm/readahead.c
> > > index 29ca25c8f01e..32fd32b913da 100644
> > > --- a/mm/readahead.c
> > > +++ b/mm/readahead.c
> > > @@ -160,6 +160,16 @@ unsigned long page_cache_readahead_limit(struct address_space *mapping,
> > >  		.nr_pages = 0,
> > >  	};
> > >  
> > > +	/*
> > > +	 * Partway through the readahead operation, we will have added
> > > +	 * locked pages to the page cache, but will not yet have submitted
> > > +	 * them for I/O.  Adding another page may need to allocate
> > > +	 * memory, which can trigger memory migration.	Telling the VM
> > 
> > I would go with s@migration@compaction@ because it would make it more
> > obvious that this is about high order allocations.
> 
> Perhaps even just 'reclaim' -- it's about compaction today, but tomorrow's
> VM might try to reclaim these pages too.  They are on the LRU, after all.
> 
> So I currently have:
> 
>         /*
>          * Partway through the readahead operation, we will have added
>          * locked pages to the page cache, but will not yet have submitted
>          * them for I/O.  Adding another page may need to allocate memory,
>          * which can trigger memory reclaim.  Telling the VM we're in
>          * the middle of a filesystem operation will cause it to not
>          * touch file-backed pages, preventing a deadlock.  Most (all?)
>          * filesystems already specify __GFP_NOFS in their mapping's
>          * gfp_mask, but let's be explicit here.
>          */

OK, Thanks!
diff mbox series

Patch

diff --git a/mm/migrate.c b/mm/migrate.c
index 86873b6f38a7..df60026779d2 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1010,7 +1010,8 @@  static int __unmap_and_move(struct page *page, struct page *newpage,
 	bool is_lru = !__PageMovable(page);
 
 	if (!trylock_page(page)) {
-		if (!force || mode == MIGRATE_ASYNC)
+		if (!force || mode == MIGRATE_ASYNC
+			   || mode == MIGRATE_SYNC_LIGHT)
 			goto out;
 
 		/*