Message ID | 20200109225646.22983-1-xiyou.wangcong@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm: avoid blocking lock_page() in kcompactd | expand |
On Thu, Jan 9, 2020 at 2:57 PM Cong Wang <xiyou.wangcong@gmail.com> wrote: > > We observed kcompactd hung at __lock_page(): > > INFO: task kcompactd0:57 blocked for more than 120 seconds. > Not tainted 4.19.56.x86_64 #1 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > kcompactd0 D 0 57 2 0x80000000 > Call Trace: > ? __schedule+0x236/0x860 > schedule+0x28/0x80 > io_schedule+0x12/0x40 > __lock_page+0xf9/0x120 > ? page_cache_tree_insert+0xb0/0xb0 > ? update_pageblock_skip+0xb0/0xb0 > migrate_pages+0x88c/0xb90 > ? isolate_freepages_block+0x3b0/0x3b0 > compact_zone+0x5f1/0x870 > kcompactd_do_work+0x130/0x2c0 > ? __switch_to_asm+0x35/0x70 > ? __switch_to_asm+0x41/0x70 > ? kcompactd_do_work+0x2c0/0x2c0 > ? kcompactd+0x73/0x180 > kcompactd+0x73/0x180 > ? finish_wait+0x80/0x80 > kthread+0x113/0x130 > ? kthread_create_worker_on_cpu+0x50/0x50 > ret_from_fork+0x35/0x40 > > which faddr2line maps to: > > migrate_pages+0x88c/0xb90: > lock_page at include/linux/pagemap.h:483 > (inlined by) __unmap_and_move at mm/migrate.c:1024 > (inlined by) unmap_and_move at mm/migrate.c:1189 > (inlined by) migrate_pages at mm/migrate.c:1419 > > Sometimes kcompactd eventually got out of this situation, sometimes not. > > I think for memory compaction, it is a best effort to migrate the pages, > so it doesn't have to wait for I/O to complete. It is fine to call > trylock_page() here, which is pretty much similar to > buffer_migrate_lock_buffers(). > > Given MIGRATE_SYNC_LIGHT is used on compaction path, just relax the > check for it. But this changed the semantics of MIGRATE_SYNC_LIGHT which means blocking on most operations but not ->writepage. When MIGRATE_SYNC_LIGHT is used it means compaction priority is increased (the initial priority is ASYNC) due to whatever reason (i.e. not enough clean, non-writeback and non-locked pages to migrate). So, it has to wait for some pages to try to not backoff pre-maturely. If I read the code correctly, buffer_migrate_lock_buffers() also blocks on page lock with non-ASYNC mode. Since v5.1 Mel Gorman improved compaction a lot. So, I'm wondering if this happens on the latest upstream or not. And, did you figure out who is locking the page for such long time? Or there might be too many waiters on the list for this page? > > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Michal Hocko <mhocko@suse.com> > Cc: linux-mm@kvack.org > Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> > --- > mm/migrate.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/mm/migrate.c b/mm/migrate.c > index 86873b6f38a7..df60026779d2 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -1010,7 +1010,8 @@ static int __unmap_and_move(struct page *page, struct page *newpage, > bool is_lru = !__PageMovable(page); > > if (!trylock_page(page)) { > - if (!force || mode == MIGRATE_ASYNC) > + if (!force || mode == MIGRATE_ASYNC > + || mode == MIGRATE_SYNC_LIGHT) > goto out; > > /* > -- > 2.21.1 > >
On Thu, Jan 9, 2020 at 4:28 PM Yang Shi <shy828301@gmail.com> wrote: > > On Thu, Jan 9, 2020 at 2:57 PM Cong Wang <xiyou.wangcong@gmail.com> wrote: > > > > We observed kcompactd hung at __lock_page(): > > > > INFO: task kcompactd0:57 blocked for more than 120 seconds. > > Not tainted 4.19.56.x86_64 #1 > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > kcompactd0 D 0 57 2 0x80000000 > > Call Trace: > > ? __schedule+0x236/0x860 > > schedule+0x28/0x80 > > io_schedule+0x12/0x40 > > __lock_page+0xf9/0x120 > > ? page_cache_tree_insert+0xb0/0xb0 > > ? update_pageblock_skip+0xb0/0xb0 > > migrate_pages+0x88c/0xb90 > > ? isolate_freepages_block+0x3b0/0x3b0 > > compact_zone+0x5f1/0x870 > > kcompactd_do_work+0x130/0x2c0 > > ? __switch_to_asm+0x35/0x70 > > ? __switch_to_asm+0x41/0x70 > > ? kcompactd_do_work+0x2c0/0x2c0 > > ? kcompactd+0x73/0x180 > > kcompactd+0x73/0x180 > > ? finish_wait+0x80/0x80 > > kthread+0x113/0x130 > > ? kthread_create_worker_on_cpu+0x50/0x50 > > ret_from_fork+0x35/0x40 > > > > which faddr2line maps to: > > > > migrate_pages+0x88c/0xb90: > > lock_page at include/linux/pagemap.h:483 > > (inlined by) __unmap_and_move at mm/migrate.c:1024 > > (inlined by) unmap_and_move at mm/migrate.c:1189 > > (inlined by) migrate_pages at mm/migrate.c:1419 > > > > Sometimes kcompactd eventually got out of this situation, sometimes not. > > > > I think for memory compaction, it is a best effort to migrate the pages, > > so it doesn't have to wait for I/O to complete. It is fine to call > > trylock_page() here, which is pretty much similar to > > buffer_migrate_lock_buffers(). > > > > Given MIGRATE_SYNC_LIGHT is used on compaction path, just relax the > > check for it. > > But this changed the semantics of MIGRATE_SYNC_LIGHT which means > blocking on most operations but not ->writepage. When > MIGRATE_SYNC_LIGHT is used it means compaction priority is increased > (the initial priority is ASYNC) due to whatever reason (i.e. not > enough clean, non-writeback and non-locked pages to migrate). So, it > has to wait for some pages to try to not backoff pre-maturely. Thanks for explaining MIGRATE_SYNC_LIGHT. I didn't dig the history of MIGRATE_SYNC_LIGHT. > > If I read the code correctly, buffer_migrate_lock_buffers() also > blocks on page lock with non-ASYNC mode. > > Since v5.1 Mel Gorman improved compaction a lot. So, I'm wondering if > this happens on the latest upstream or not. Unfortunately we can't test upstream kernel because: 1) We don't know how to reproduce it. 2) It is not easy to deploy upstream kernel in our environment. > > And, did you figure out who is locking the page for such long time? Or > there might be too many waiters on the list for this page? I tried to dump the process stacks after we saw the hung task with /proc/X/stack (apparently sysrq is not an option for production), but I didn't find anything useful. I didn't see any other process hung in lock_page() either, all mm-related kernel threads were sleeping (non-D) at the time I debugged. So, it is possible there was some place missing a unlock_page() too, which is too late to debug after the hung task was reported. Thanks.
On Thu, Jan 9, 2020 at 5:01 PM Cong Wang <xiyou.wangcong@gmail.com> wrote: > I tried to dump the process stacks after we saw the hung task > with /proc/X/stack (apparently sysrq is not an option for production), > but I didn't find anything useful. I didn't see any other process > hung in lock_page() either, all mm-related kernel threads were sleeping > (non-D) at the time I debugged. So, it is possible there was some place > missing a unlock_page() too, which is too late to debug after the hung > task was reported. Never mind. I found the process locking the page, /proc/X/stack apparently fools me by hiding page fault stack. :-/
[CC Mel and Vlastimil] On Thu 09-01-20 14:56:46, Cong Wang wrote: > We observed kcompactd hung at __lock_page(): > > INFO: task kcompactd0:57 blocked for more than 120 seconds. > Not tainted 4.19.56.x86_64 #1 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > kcompactd0 D 0 57 2 0x80000000 > Call Trace: > ? __schedule+0x236/0x860 > schedule+0x28/0x80 > io_schedule+0x12/0x40 > __lock_page+0xf9/0x120 > ? page_cache_tree_insert+0xb0/0xb0 > ? update_pageblock_skip+0xb0/0xb0 > migrate_pages+0x88c/0xb90 > ? isolate_freepages_block+0x3b0/0x3b0 > compact_zone+0x5f1/0x870 > kcompactd_do_work+0x130/0x2c0 > ? __switch_to_asm+0x35/0x70 > ? __switch_to_asm+0x41/0x70 > ? kcompactd_do_work+0x2c0/0x2c0 > ? kcompactd+0x73/0x180 > kcompactd+0x73/0x180 > ? finish_wait+0x80/0x80 > kthread+0x113/0x130 > ? kthread_create_worker_on_cpu+0x50/0x50 > ret_from_fork+0x35/0x40 > > which faddr2line maps to: > > migrate_pages+0x88c/0xb90: > lock_page at include/linux/pagemap.h:483 > (inlined by) __unmap_and_move at mm/migrate.c:1024 > (inlined by) unmap_and_move at mm/migrate.c:1189 > (inlined by) migrate_pages at mm/migrate.c:1419 > > Sometimes kcompactd eventually got out of this situation, sometimes not. What does this mean exactly? Who is holding the page lock? > I think for memory compaction, it is a best effort to migrate the pages, > so it doesn't have to wait for I/O to complete. It is fine to call > trylock_page() here, which is pretty much similar to > buffer_migrate_lock_buffers(). > > Given MIGRATE_SYNC_LIGHT is used on compaction path, just relax the > check for it. The exact definition of MIGRATE_SYNC_LIGHT is a bit fuzzy but bailing out on the page lock sounds too early to me. So far we have tried to block on everything but the writeback which can take a lot of time. > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Michal Hocko <mhocko@suse.com> > Cc: linux-mm@kvack.org > Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> > --- > mm/migrate.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/mm/migrate.c b/mm/migrate.c > index 86873b6f38a7..df60026779d2 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -1010,7 +1010,8 @@ static int __unmap_and_move(struct page *page, struct page *newpage, > bool is_lru = !__PageMovable(page); > > if (!trylock_page(page)) { > - if (!force || mode == MIGRATE_ASYNC) > + if (!force || mode == MIGRATE_ASYNC > + || mode == MIGRATE_SYNC_LIGHT) > goto out; > > /* > -- > 2.21.1 >
On Thu, Jan 09, 2020 at 02:56:46PM -0800, Cong Wang wrote: > We observed kcompactd hung at __lock_page(): > > INFO: task kcompactd0:57 blocked for more than 120 seconds. > Not tainted 4.19.56.x86_64 #1 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > kcompactd0 D 0 57 2 0x80000000 > Call Trace: > ? __schedule+0x236/0x860 > schedule+0x28/0x80 > io_schedule+0x12/0x40 > __lock_page+0xf9/0x120 > ? page_cache_tree_insert+0xb0/0xb0 > ? update_pageblock_skip+0xb0/0xb0 > migrate_pages+0x88c/0xb90 > ? isolate_freepages_block+0x3b0/0x3b0 > compact_zone+0x5f1/0x870 > kcompactd_do_work+0x130/0x2c0 > ? __switch_to_asm+0x35/0x70 > ? __switch_to_asm+0x41/0x70 > ? kcompactd_do_work+0x2c0/0x2c0 > ? kcompactd+0x73/0x180 > kcompactd+0x73/0x180 > ? finish_wait+0x80/0x80 > kthread+0x113/0x130 > ? kthread_create_worker_on_cpu+0x50/0x50 > ret_from_fork+0x35/0x40 > > which faddr2line maps to: > > migrate_pages+0x88c/0xb90: > lock_page at include/linux/pagemap.h:483 > (inlined by) __unmap_and_move at mm/migrate.c:1024 > (inlined by) unmap_and_move at mm/migrate.c:1189 > (inlined by) migrate_pages at mm/migrate.c:1419 > > Sometimes kcompactd eventually got out of this situation, sometimes not. > > I think for memory compaction, it is a best effort to migrate the pages, > so it doesn't have to wait for I/O to complete. It is fine to call > trylock_page() here, which is pretty much similar to > buffer_migrate_lock_buffers(). > > Given MIGRATE_SYNC_LIGHT is used on compaction path, just relax the > check for it. > Is this a single page being locked for a long time or multiple pages being locked without reaching a reschedule point? If it's a single page being locked, it's important to identify what held page lock for 2 minutes because that is potentially a missing unlock_page. The kernel in question is old -- 4.19.56. Are there any other modifications to that kernel?
On Fri, Jan 10, 2020 at 1:22 AM Mel Gorman <mgorman@techsingularity.net> wrote: > > On Thu, Jan 09, 2020 at 02:56:46PM -0800, Cong Wang wrote: > > We observed kcompactd hung at __lock_page(): > > > > INFO: task kcompactd0:57 blocked for more than 120 seconds. > > Not tainted 4.19.56.x86_64 #1 > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > kcompactd0 D 0 57 2 0x80000000 > > Call Trace: > > ? __schedule+0x236/0x860 > > schedule+0x28/0x80 > > io_schedule+0x12/0x40 > > __lock_page+0xf9/0x120 > > ? page_cache_tree_insert+0xb0/0xb0 > > ? update_pageblock_skip+0xb0/0xb0 > > migrate_pages+0x88c/0xb90 > > ? isolate_freepages_block+0x3b0/0x3b0 > > compact_zone+0x5f1/0x870 > > kcompactd_do_work+0x130/0x2c0 > > ? __switch_to_asm+0x35/0x70 > > ? __switch_to_asm+0x41/0x70 > > ? kcompactd_do_work+0x2c0/0x2c0 > > ? kcompactd+0x73/0x180 > > kcompactd+0x73/0x180 > > ? finish_wait+0x80/0x80 > > kthread+0x113/0x130 > > ? kthread_create_worker_on_cpu+0x50/0x50 > > ret_from_fork+0x35/0x40 > > > > which faddr2line maps to: > > > > migrate_pages+0x88c/0xb90: > > lock_page at include/linux/pagemap.h:483 > > (inlined by) __unmap_and_move at mm/migrate.c:1024 > > (inlined by) unmap_and_move at mm/migrate.c:1189 > > (inlined by) migrate_pages at mm/migrate.c:1419 > > > > Sometimes kcompactd eventually got out of this situation, sometimes not. > > > > I think for memory compaction, it is a best effort to migrate the pages, > > so it doesn't have to wait for I/O to complete. It is fine to call > > trylock_page() here, which is pretty much similar to > > buffer_migrate_lock_buffers(). > > > > Given MIGRATE_SYNC_LIGHT is used on compaction path, just relax the > > check for it. > > > > Is this a single page being locked for a long time or multiple pages > being locked without reaching a reschedule point? Not sure whether it is single page or multiple pages, but I successfully located the process locking the page (or pages), and I used perf to capture its stack trace: ffffffffa722aa06 shrink_inactive_list ffffffffa722b3d7 shrink_node_memcg ffffffffa722b85f shrink_node ffffffffa722bc89 do_try_to_free_pages ffffffffa722c179 try_to_free_mem_cgroup_pages ffffffffa7298703 try_charge ffffffffa729a886 mem_cgroup_try_charge ffffffffa720ec03 __add_to_page_cache_locked ffffffffa720ee3a add_to_page_cache_lru ffffffffa7312ddb iomap_readpages_actor ffffffffa73133f7 iomap_apply ffffffffa73135da iomap_readpages ffffffffa722062e read_pages ffffffffa7220b3f __do_page_cache_readahead ffffffffa7210554 filemap_fault ffffffffc039e41f __xfs_filemap_fault ffffffffa724f5e7 __do_fault ffffffffa724c5f2 __handle_mm_fault ffffffffa724cbc6 handle_mm_fault ffffffffa70a313e __do_page_fault ffffffffa7a00dfe page_fault This process got stuck in this situation for a long time (since I sent out this patch) without making any progress. It behaves like stuck in an infinite loop, although the EIP still moves around within mem_cgroup_try_charge(). I also enabled trace event mm_vmscan_lru_shrink_inactive(), here is what I collected: <...>-455459 [003] .... 2691911.664706: mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC <...>-455459 [003] .... 2691911.664711: mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=4 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC <...>-455459 [003] .... 2691911.664714: mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=2 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=3 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC <...>-455459 [003] .... 2691911.664717: mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=5 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=2 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC <...>-455459 [003] .... 2691911.664720: mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=5 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=1 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC <...>-455459 [003] .... 2691911.664725: mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=7 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC <...>-455459 [003] .... 2691911.664730: mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=2 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC <...>-455459 [003] .... 2691911.664732: mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC <...>-455459 [003] .... 2691911.664736: mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=4 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC <...>-455459 [003] .... 2691911.664739: mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=2 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=3 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC <...>-455459 [003] .... 2691911.664744: mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=5 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=2 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC <...>-455459 [003] .... 2691911.664747: mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=4 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=1 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC <...>-455459 [003] .... 2691911.664752: mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=12 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activa te=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC <...>-455459 [003] .... 2691911.664755: mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=4 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC <...>-455459 [003] .... 2691911.664761: mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=2 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC <...>-455459 [003] .... 2691911.664762: mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=1 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC <...>-455459 [003] .... 2691911.664764: mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC <...>-455459 [003] .... 2691911.664770: mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=4 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=1 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC <...>-455459 [003] .... 2691911.664777: mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=21 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activa te=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC <...>-455459 [003] .... 2691911.664780: mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=4 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC <...>-455459 [003] .... 2691911.664783: mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=2 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=3 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC > > If it's a single page being locked, it's important to identify what held > page lock for 2 minutes because that is potentially a missing > unlock_page. The kernel in question is old -- 4.19.56. Are there any > other modifications to that kernel? We only backported some networking and hardware driver patches, not any MM change. Please let me know if I can collect any other information you need, before it gets rebooted. Thanks.
Hi, Michal On Thu, Jan 9, 2020 at 11:38 PM Michal Hocko <mhocko@kernel.org> wrote: > > [CC Mel and Vlastimil] > > On Thu 09-01-20 14:56:46, Cong Wang wrote: > > We observed kcompactd hung at __lock_page(): > > > > INFO: task kcompactd0:57 blocked for more than 120 seconds. > > Not tainted 4.19.56.x86_64 #1 > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > kcompactd0 D 0 57 2 0x80000000 > > Call Trace: > > ? __schedule+0x236/0x860 > > schedule+0x28/0x80 > > io_schedule+0x12/0x40 > > __lock_page+0xf9/0x120 > > ? page_cache_tree_insert+0xb0/0xb0 > > ? update_pageblock_skip+0xb0/0xb0 > > migrate_pages+0x88c/0xb90 > > ? isolate_freepages_block+0x3b0/0x3b0 > > compact_zone+0x5f1/0x870 > > kcompactd_do_work+0x130/0x2c0 > > ? __switch_to_asm+0x35/0x70 > > ? __switch_to_asm+0x41/0x70 > > ? kcompactd_do_work+0x2c0/0x2c0 > > ? kcompactd+0x73/0x180 > > kcompactd+0x73/0x180 > > ? finish_wait+0x80/0x80 > > kthread+0x113/0x130 > > ? kthread_create_worker_on_cpu+0x50/0x50 > > ret_from_fork+0x35/0x40 > > > > which faddr2line maps to: > > > > migrate_pages+0x88c/0xb90: > > lock_page at include/linux/pagemap.h:483 > > (inlined by) __unmap_and_move at mm/migrate.c:1024 > > (inlined by) unmap_and_move at mm/migrate.c:1189 > > (inlined by) migrate_pages at mm/migrate.c:1419 > > > > Sometimes kcompactd eventually got out of this situation, sometimes not. > > What does this mean exactly? Who is holding the page lock? As I explained in other email, I didn't locate the process holding the page lock before I sent out this patch, as I was fooled by /proc/X/stack. But now I got its stack trace with `perf`: ffffffffa722aa06 shrink_inactive_list ffffffffa722b3d7 shrink_node_memcg ffffffffa722b85f shrink_node ffffffffa722bc89 do_try_to_free_pages ffffffffa722c179 try_to_free_mem_cgroup_pages ffffffffa7298703 try_charge ffffffffa729a886 mem_cgroup_try_charge ffffffffa720ec03 __add_to_page_cache_locked ffffffffa720ee3a add_to_page_cache_lru ffffffffa7312ddb iomap_readpages_actor ffffffffa73133f7 iomap_apply ffffffffa73135da iomap_readpages ffffffffa722062e read_pages ffffffffa7220b3f __do_page_cache_readahead ffffffffa7210554 filemap_fault ffffffffc039e41f __xfs_filemap_fault ffffffffa724f5e7 __do_fault ffffffffa724c5f2 __handle_mm_fault ffffffffa724cbc6 handle_mm_fault ffffffffa70a313e __do_page_fault ffffffffa7a00dfe page_fault It got stuck somewhere along the call path of mem_cgroup_try_charge(), and the trace events of mm_vmscan_lru_shrink_inactive() indicates this too: <...>-455459 [003] .... 2691911.664706: mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC <...>-455459 [003] .... 2691911.664711: mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=4 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC <...>-455459 [003] .... 2691911.664714: mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=2 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=3 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC <...>-455459 [003] .... 2691911.664717: mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=5 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=2 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC <...>-455459 [003] .... 2691911.664720: mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=5 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=1 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC <...>-455459 [003] .... 2691911.664725: mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=7 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC <...>-455459 [003] .... 2691911.664730: mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=2 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC <...>-455459 [003] .... 2691911.664732: mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC <...>-455459 [003] .... 2691911.664736: mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=4 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC <...>-455459 [003] .... 2691911.664739: mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=2 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=3 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC <...>-455459 [003] .... 2691911.664744: mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=5 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=2 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC <...>-455459 [003] .... 2691911.664747: mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=4 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=1 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC <...>-455459 [003] .... 2691911.664752: mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=12 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activa te=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC <...>-455459 [003] .... 2691911.664755: mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=4 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC <...>-455459 [003] .... 2691911.664761: mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=2 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC <...>-455459 [003] .... 2691911.664762: mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=1 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC <...>-455459 [003] .... 2691911.664764: mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC <...>-455459 [003] .... 2691911.664770: mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=4 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=1 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC <...>-455459 [003] .... 2691911.664777: mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=21 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activa te=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC <...>-455459 [003] .... 2691911.664780: mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=4 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC <...>-455459 [003] .... 2691911.664783: mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=2 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=3 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC Thanks.
On Mon, 20 Jan 2020 14:41:50 -0800 Cong Wang wrote: > On Fri, Jan 10, 2020 at 1:22 AM Mel Gorman <mgorman@techsingularity.net> wrote: > > > > On Thu, Jan 09, 2020 at 02:56:46PM -0800, Cong Wang wrote: > > > We observed kcompactd hung at __lock_page(): > > > > > > INFO: task kcompactd0:57 blocked for more than 120 seconds. > > > Not tainted 4.19.56.x86_64 #1 > > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > > kcompactd0 D 0 57 2 0x80000000 > > > Call Trace: > > > ? __schedule+0x236/0x860 > > > schedule+0x28/0x80 > > > io_schedule+0x12/0x40 > > > __lock_page+0xf9/0x120 > > > ? page_cache_tree_insert+0xb0/0xb0 > > > ? update_pageblock_skip+0xb0/0xb0 > > > migrate_pages+0x88c/0xb90 > > > ? isolate_freepages_block+0x3b0/0x3b0 > > > compact_zone+0x5f1/0x870 > > > kcompactd_do_work+0x130/0x2c0 > > > ? __switch_to_asm+0x35/0x70 > > > ? __switch_to_asm+0x41/0x70 > > > ? kcompactd_do_work+0x2c0/0x2c0 > > > ? kcompactd+0x73/0x180 > > > kcompactd+0x73/0x180 > > > ? finish_wait+0x80/0x80 > > > kthread+0x113/0x130 > > > ? kthread_create_worker_on_cpu+0x50/0x50 > > > ret_from_fork+0x35/0x40 > > > > > > which faddr2line maps to: > > > > > > migrate_pages+0x88c/0xb90: > > > lock_page at include/linux/pagemap.h:483 > > > (inlined by) __unmap_and_move at mm/migrate.c:1024 > > > (inlined by) unmap_and_move at mm/migrate.c:1189 > > > (inlined by) migrate_pages at mm/migrate.c:1419 > > > > > > Sometimes kcompactd eventually got out of this situation, sometimes not. > > > > > > I think for memory compaction, it is a best effort to migrate the pages, > > > so it doesn't have to wait for I/O to complete. It is fine to call > > > trylock_page() here, which is pretty much similar to > > > buffer_migrate_lock_buffers(). > > > > > > Given MIGRATE_SYNC_LIGHT is used on compaction path, just relax the > > > check for it. > > > > > > > Is this a single page being locked for a long time or multiple pages > > being locked without reaching a reschedule point? > > Not sure whether it is single page or multiple pages, but I successfully > located the process locking the page (or pages), and I used perf to > capture its stack trace: > > > ffffffffa722aa06 shrink_inactive_list > ffffffffa722b3d7 shrink_node_memcg > ffffffffa722b85f shrink_node > ffffffffa722bc89 do_try_to_free_pages > ffffffffa722c179 try_to_free_mem_cgroup_pages > ffffffffa7298703 try_charge > ffffffffa729a886 mem_cgroup_try_charge > ffffffffa720ec03 __add_to_page_cache_locked > ffffffffa720ee3a add_to_page_cache_lru > ffffffffa7312ddb iomap_readpages_actor > ffffffffa73133f7 iomap_apply > ffffffffa73135da iomap_readpages > ffffffffa722062e read_pages > ffffffffa7220b3f __do_page_cache_readahead > ffffffffa7210554 filemap_fault > ffffffffc039e41f __xfs_filemap_fault > ffffffffa724f5e7 __do_fault > ffffffffa724c5f2 __handle_mm_fault > ffffffffa724cbc6 handle_mm_fault > ffffffffa70a313e __do_page_fault > ffffffffa7a00dfe page_fault > > This process got stuck in this situation for a long time (since I sent out > this patch) without making any progress. It behaves like stuck in an infinite > loop, although the EIP still moves around within mem_cgroup_try_charge(). Make page reclaim in try_charge() async assuming sync reclaim is unnecessary without memory pressure and it does not help much under heavy pressure. Skipping reclaim is only confined in page fault context to avoid adding too much a time. --- linux-5.5-rc3/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2525,6 +2525,12 @@ force: if (do_memsw_account()) page_counter_charge(&memcg->memsw, nr_pages); css_get_many(&memcg->css, nr_pages); + /* + * reclaim high limit pages soon without holding resources like + * page lock e.g in page fault context + */ + if (unlikely(current->flags & PF_MEMALLOC)) + schedule_work(&memcg->high_work); return 0; --- linux-5.5-rc3/mm/filemap.c +++ b/mm/filemap.c @@ -863,8 +863,14 @@ static int __add_to_page_cache_locked(st mapping_set_update(&xas, mapping); if (!huge) { + bool was_set = current->flags & PF_MEMALLOC; + if (!was_set) + current->flags |= PF_MEMALLOC; + error = mem_cgroup_try_charge(page, current->mm, gfp_mask, &memcg, false); + if (!was_set) + current->flags &= ~PF_MEMALLOC; if (error) return error; } -- > I also enabled trace event mm_vmscan_lru_shrink_inactive(), here is what > I collected: > > <...>-455459 [003] .... 2691911.664706: > mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0 > nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat > e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0 > flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC > <...>-455459 [003] .... 2691911.664711: > mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0 > nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat > e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=4 > flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC > <...>-455459 [003] .... 2691911.664714: > mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=2 nr_reclaimed=0 > nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat > e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=3 > flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC > <...>-455459 [003] .... 2691911.664717: > mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=5 nr_reclaimed=0 > nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat > e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=2 > flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC > <...>-455459 [003] .... 2691911.664720: > mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=5 nr_reclaimed=0 > nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat > e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=1 > flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC > <...>-455459 [003] .... 2691911.664725: > mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=7 nr_reclaimed=0 > nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat > e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0 > flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC > <...>-455459 [003] .... 2691911.664730: > mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0 > nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat > e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=2 > flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC > <...>-455459 [003] .... 2691911.664732: > mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0 > nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat > e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0 > flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC > <...>-455459 [003] .... 2691911.664736: > mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0 > nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat > e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=4 > flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC > <...>-455459 [003] .... 2691911.664739: > mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=2 nr_reclaimed=0 > nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat > e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=3 > flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC > <...>-455459 [003] .... 2691911.664744: > mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=5 nr_reclaimed=0 > nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat > e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=2 > flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC > <...>-455459 [003] .... 2691911.664747: > mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=4 nr_reclaimed=0 > nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat > e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=1 > flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC > <...>-455459 [003] .... 2691911.664752: > mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=12 nr_reclaimed=0 > nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activa > te=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0 > flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC > <...>-455459 [003] .... 2691911.664755: > mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0 > nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat > e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=4 > flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC > <...>-455459 [003] .... 2691911.664761: > mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0 > nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat > e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=2 > flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC > <...>-455459 [003] .... 2691911.664762: > mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0 > nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat > e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=1 > flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC > <...>-455459 [003] .... 2691911.664764: > mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0 > nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat > e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0 > flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC > <...>-455459 [003] .... 2691911.664770: > mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=4 nr_reclaimed=0 > nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat > e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=1 > flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC > <...>-455459 [003] .... 2691911.664777: > mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=21 nr_reclaimed=0 > nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activa > te=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0 > flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC > <...>-455459 [003] .... 2691911.664780: > mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0 > nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat > e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=4 > flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC > <...>-455459 [003] .... 2691911.664783: > mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=2 nr_reclaimed=0 > nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat > e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=3 > flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC > > > > > > If it's a single page being locked, it's important to identify what held > > page lock for 2 minutes because that is potentially a missing > > unlock_page. The kernel in question is old -- 4.19.56. Are there any > > other modifications to that kernel? > > We only backported some networking and hardware driver patches, > not any MM change. > > Please let me know if I can collect any other information you need, > before it gets rebooted. > > Thanks.
On Mon 20-01-20 14:48:05, Cong Wang wrote: > Hi, Michal > > On Thu, Jan 9, 2020 at 11:38 PM Michal Hocko <mhocko@kernel.org> wrote: > > > > [CC Mel and Vlastimil] > > > > On Thu 09-01-20 14:56:46, Cong Wang wrote: > > > We observed kcompactd hung at __lock_page(): > > > > > > INFO: task kcompactd0:57 blocked for more than 120 seconds. > > > Not tainted 4.19.56.x86_64 #1 > > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > > kcompactd0 D 0 57 2 0x80000000 > > > Call Trace: > > > ? __schedule+0x236/0x860 > > > schedule+0x28/0x80 > > > io_schedule+0x12/0x40 > > > __lock_page+0xf9/0x120 > > > ? page_cache_tree_insert+0xb0/0xb0 > > > ? update_pageblock_skip+0xb0/0xb0 > > > migrate_pages+0x88c/0xb90 > > > ? isolate_freepages_block+0x3b0/0x3b0 > > > compact_zone+0x5f1/0x870 > > > kcompactd_do_work+0x130/0x2c0 > > > ? __switch_to_asm+0x35/0x70 > > > ? __switch_to_asm+0x41/0x70 > > > ? kcompactd_do_work+0x2c0/0x2c0 > > > ? kcompactd+0x73/0x180 > > > kcompactd+0x73/0x180 > > > ? finish_wait+0x80/0x80 > > > kthread+0x113/0x130 > > > ? kthread_create_worker_on_cpu+0x50/0x50 > > > ret_from_fork+0x35/0x40 > > > > > > which faddr2line maps to: > > > > > > migrate_pages+0x88c/0xb90: > > > lock_page at include/linux/pagemap.h:483 > > > (inlined by) __unmap_and_move at mm/migrate.c:1024 > > > (inlined by) unmap_and_move at mm/migrate.c:1189 > > > (inlined by) migrate_pages at mm/migrate.c:1419 > > > > > > Sometimes kcompactd eventually got out of this situation, sometimes not. > > > > What does this mean exactly? Who is holding the page lock? > > As I explained in other email, I didn't locate the process holding the page > lock before I sent out this patch, as I was fooled by /proc/X/stack. > > But now I got its stack trace with `perf`: > > ffffffffa722aa06 shrink_inactive_list > ffffffffa722b3d7 shrink_node_memcg > ffffffffa722b85f shrink_node > ffffffffa722bc89 do_try_to_free_pages > ffffffffa722c179 try_to_free_mem_cgroup_pages > ffffffffa7298703 try_charge > ffffffffa729a886 mem_cgroup_try_charge > ffffffffa720ec03 __add_to_page_cache_locked > ffffffffa720ee3a add_to_page_cache_lru > ffffffffa7312ddb iomap_readpages_actor > ffffffffa73133f7 iomap_apply > ffffffffa73135da iomap_readpages > ffffffffa722062e read_pages > ffffffffa7220b3f __do_page_cache_readahead > ffffffffa7210554 filemap_fault > ffffffffc039e41f __xfs_filemap_fault > ffffffffa724f5e7 __do_fault > ffffffffa724c5f2 __handle_mm_fault > ffffffffa724cbc6 handle_mm_fault > ffffffffa70a313e __do_page_fault > ffffffffa7a00dfe page_fault > > It got stuck somewhere along the call path of mem_cgroup_try_charge(), > and the trace events of mm_vmscan_lru_shrink_inactive() indicates this > too: So it seems that you are condending on the page lock. It is really unexpected that the reclaim would take that long though. Please try to enable more vmscan tracepoints to see where the time is spent. Thanks!
On Tue 21-01-20 16:26:24, Hillf Danton wrote: > > On Mon, 20 Jan 2020 14:41:50 -0800 Cong Wang wrote: > > On Fri, Jan 10, 2020 at 1:22 AM Mel Gorman <mgorman@techsingularity.net> wrote: > > > > > > On Thu, Jan 09, 2020 at 02:56:46PM -0800, Cong Wang wrote: > > > > We observed kcompactd hung at __lock_page(): > > > > > > > > INFO: task kcompactd0:57 blocked for more than 120 seconds. > > > > Not tainted 4.19.56.x86_64 #1 > > > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > > > kcompactd0 D 0 57 2 0x80000000 > > > > Call Trace: > > > > ? __schedule+0x236/0x860 > > > > schedule+0x28/0x80 > > > > io_schedule+0x12/0x40 > > > > __lock_page+0xf9/0x120 > > > > ? page_cache_tree_insert+0xb0/0xb0 > > > > ? update_pageblock_skip+0xb0/0xb0 > > > > migrate_pages+0x88c/0xb90 > > > > ? isolate_freepages_block+0x3b0/0x3b0 > > > > compact_zone+0x5f1/0x870 > > > > kcompactd_do_work+0x130/0x2c0 > > > > ? __switch_to_asm+0x35/0x70 > > > > ? __switch_to_asm+0x41/0x70 > > > > ? kcompactd_do_work+0x2c0/0x2c0 > > > > ? kcompactd+0x73/0x180 > > > > kcompactd+0x73/0x180 > > > > ? finish_wait+0x80/0x80 > > > > kthread+0x113/0x130 > > > > ? kthread_create_worker_on_cpu+0x50/0x50 > > > > ret_from_fork+0x35/0x40 > > > > > > > > which faddr2line maps to: > > > > > > > > migrate_pages+0x88c/0xb90: > > > > lock_page at include/linux/pagemap.h:483 > > > > (inlined by) __unmap_and_move at mm/migrate.c:1024 > > > > (inlined by) unmap_and_move at mm/migrate.c:1189 > > > > (inlined by) migrate_pages at mm/migrate.c:1419 > > > > > > > > Sometimes kcompactd eventually got out of this situation, sometimes not. > > > > > > > > I think for memory compaction, it is a best effort to migrate the pages, > > > > so it doesn't have to wait for I/O to complete. It is fine to call > > > > trylock_page() here, which is pretty much similar to > > > > buffer_migrate_lock_buffers(). > > > > > > > > Given MIGRATE_SYNC_LIGHT is used on compaction path, just relax the > > > > check for it. > > > > > > > > > > Is this a single page being locked for a long time or multiple pages > > > being locked without reaching a reschedule point? > > > > Not sure whether it is single page or multiple pages, but I successfully > > located the process locking the page (or pages), and I used perf to > > capture its stack trace: > > > > > > ffffffffa722aa06 shrink_inactive_list > > ffffffffa722b3d7 shrink_node_memcg > > ffffffffa722b85f shrink_node > > ffffffffa722bc89 do_try_to_free_pages > > ffffffffa722c179 try_to_free_mem_cgroup_pages > > ffffffffa7298703 try_charge > > ffffffffa729a886 mem_cgroup_try_charge > > ffffffffa720ec03 __add_to_page_cache_locked > > ffffffffa720ee3a add_to_page_cache_lru > > ffffffffa7312ddb iomap_readpages_actor > > ffffffffa73133f7 iomap_apply > > ffffffffa73135da iomap_readpages > > ffffffffa722062e read_pages > > ffffffffa7220b3f __do_page_cache_readahead > > ffffffffa7210554 filemap_fault > > ffffffffc039e41f __xfs_filemap_fault > > ffffffffa724f5e7 __do_fault > > ffffffffa724c5f2 __handle_mm_fault > > ffffffffa724cbc6 handle_mm_fault > > ffffffffa70a313e __do_page_fault > > ffffffffa7a00dfe page_fault > > > > This process got stuck in this situation for a long time (since I sent out > > this patch) without making any progress. It behaves like stuck in an infinite > > loop, although the EIP still moves around within mem_cgroup_try_charge(). > > Make page reclaim in try_charge() async assuming sync reclaim is unnecessary > without memory pressure and it does not help much under heavy pressure. Skipping > reclaim is only confined in page fault context to avoid adding too much a time. > > --- linux-5.5-rc3/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2525,6 +2525,12 @@ force: > if (do_memsw_account()) > page_counter_charge(&memcg->memsw, nr_pages); > css_get_many(&memcg->css, nr_pages); > + /* > + * reclaim high limit pages soon without holding resources like > + * page lock e.g in page fault context > + */ > + if (unlikely(current->flags & PF_MEMALLOC)) > + schedule_work(&memcg->high_work); > > return 0; > > --- linux-5.5-rc3/mm/filemap.c > +++ b/mm/filemap.c > @@ -863,8 +863,14 @@ static int __add_to_page_cache_locked(st > mapping_set_update(&xas, mapping); > > if (!huge) { > + bool was_set = current->flags & PF_MEMALLOC; > + if (!was_set) > + current->flags |= PF_MEMALLOC; > + > error = mem_cgroup_try_charge(page, current->mm, > gfp_mask, &memcg, false); > + if (!was_set) > + current->flags &= ~PF_MEMALLOC; > if (error) > return error; Not only this doesn't help at all because the direct reclaim for the hard limit is done already. The high limit is also already handled properly when we are under the hard limit. So this patch doesn't really make much sense to me.
On Mon, Jan 20, 2020 at 2:42 PM Cong Wang <xiyou.wangcong@gmail.com> wrote: > > On Fri, Jan 10, 2020 at 1:22 AM Mel Gorman <mgorman@techsingularity.net> wrote: > > > > On Thu, Jan 09, 2020 at 02:56:46PM -0800, Cong Wang wrote: > > > We observed kcompactd hung at __lock_page(): > > > > > > INFO: task kcompactd0:57 blocked for more than 120 seconds. > > > Not tainted 4.19.56.x86_64 #1 > > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > > kcompactd0 D 0 57 2 0x80000000 > > > Call Trace: > > > ? __schedule+0x236/0x860 > > > schedule+0x28/0x80 > > > io_schedule+0x12/0x40 > > > __lock_page+0xf9/0x120 > > > ? page_cache_tree_insert+0xb0/0xb0 > > > ? update_pageblock_skip+0xb0/0xb0 > > > migrate_pages+0x88c/0xb90 > > > ? isolate_freepages_block+0x3b0/0x3b0 > > > compact_zone+0x5f1/0x870 > > > kcompactd_do_work+0x130/0x2c0 > > > ? __switch_to_asm+0x35/0x70 > > > ? __switch_to_asm+0x41/0x70 > > > ? kcompactd_do_work+0x2c0/0x2c0 > > > ? kcompactd+0x73/0x180 > > > kcompactd+0x73/0x180 > > > ? finish_wait+0x80/0x80 > > > kthread+0x113/0x130 > > > ? kthread_create_worker_on_cpu+0x50/0x50 > > > ret_from_fork+0x35/0x40 > > > > > > which faddr2line maps to: > > > > > > migrate_pages+0x88c/0xb90: > > > lock_page at include/linux/pagemap.h:483 > > > (inlined by) __unmap_and_move at mm/migrate.c:1024 > > > (inlined by) unmap_and_move at mm/migrate.c:1189 > > > (inlined by) migrate_pages at mm/migrate.c:1419 > > > > > > Sometimes kcompactd eventually got out of this situation, sometimes not. > > > > > > I think for memory compaction, it is a best effort to migrate the pages, > > > so it doesn't have to wait for I/O to complete. It is fine to call > > > trylock_page() here, which is pretty much similar to > > > buffer_migrate_lock_buffers(). > > > > > > Given MIGRATE_SYNC_LIGHT is used on compaction path, just relax the > > > check for it. > > > > > > > Is this a single page being locked for a long time or multiple pages > > being locked without reaching a reschedule point? > > Not sure whether it is single page or multiple pages, but I successfully > located the process locking the page (or pages), and I used perf to > capture its stack trace: > > > ffffffffa722aa06 shrink_inactive_list > ffffffffa722b3d7 shrink_node_memcg > ffffffffa722b85f shrink_node > ffffffffa722bc89 do_try_to_free_pages > ffffffffa722c179 try_to_free_mem_cgroup_pages > ffffffffa7298703 try_charge > ffffffffa729a886 mem_cgroup_try_charge > ffffffffa720ec03 __add_to_page_cache_locked > ffffffffa720ee3a add_to_page_cache_lru > ffffffffa7312ddb iomap_readpages_actor > ffffffffa73133f7 iomap_apply > ffffffffa73135da iomap_readpages > ffffffffa722062e read_pages > ffffffffa7220b3f __do_page_cache_readahead > ffffffffa7210554 filemap_fault > ffffffffc039e41f __xfs_filemap_fault > ffffffffa724f5e7 __do_fault > ffffffffa724c5f2 __handle_mm_fault > ffffffffa724cbc6 handle_mm_fault > ffffffffa70a313e __do_page_fault > ffffffffa7a00dfe page_fault > > This process got stuck in this situation for a long time (since I sent out > this patch) without making any progress. It behaves like stuck in an infinite > loop, although the EIP still moves around within mem_cgroup_try_charge(). > > I also enabled trace event mm_vmscan_lru_shrink_inactive(), here is what > I collected: > > <...>-455459 [003] .... 2691911.664706: > mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0 > nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat > e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0 > flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC > <...>-455459 [003] .... 2691911.664711: > mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0 > nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat > e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=4 > flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC > <...>-455459 [003] .... 2691911.664714: > mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=2 nr_reclaimed=0 > nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat > e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=3 > flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC > <...>-455459 [003] .... 2691911.664717: > mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=5 nr_reclaimed=0 > nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat > e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=2 > flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC > <...>-455459 [003] .... 2691911.664720: > mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=5 nr_reclaimed=0 > nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat > e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=1 > flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC > <...>-455459 [003] .... 2691911.664725: > mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=7 nr_reclaimed=0 > nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat > e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0 > flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC > <...>-455459 [003] .... 2691911.664730: > mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0 > nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat > e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=2 > flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC > <...>-455459 [003] .... 2691911.664732: > mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0 > nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat > e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0 > flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC > <...>-455459 [003] .... 2691911.664736: > mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0 > nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat > e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=4 > flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC > <...>-455459 [003] .... 2691911.664739: > mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=2 nr_reclaimed=0 > nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat > e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=3 > flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC > <...>-455459 [003] .... 2691911.664744: > mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=5 nr_reclaimed=0 > nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat > e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=2 > flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC > <...>-455459 [003] .... 2691911.664747: > mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=4 nr_reclaimed=0 > nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat > e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=1 > flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC > <...>-455459 [003] .... 2691911.664752: > mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=12 nr_reclaimed=0 > nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activa > te=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0 > flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC > <...>-455459 [003] .... 2691911.664755: > mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0 > nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat > e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=4 > flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC > <...>-455459 [003] .... 2691911.664761: > mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0 > nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat > e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=2 > flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC > <...>-455459 [003] .... 2691911.664762: > mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0 > nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat > e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=1 > flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC > <...>-455459 [003] .... 2691911.664764: > mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0 > nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat > e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0 > flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC > <...>-455459 [003] .... 2691911.664770: > mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=4 nr_reclaimed=0 > nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat > e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=1 > flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC > <...>-455459 [003] .... 2691911.664777: > mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=21 nr_reclaimed=0 > nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activa > te=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0 > flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC > <...>-455459 [003] .... 2691911.664780: > mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0 > nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat > e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=4 > flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC > <...>-455459 [003] .... 2691911.664783: > mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=2 nr_reclaimed=0 > nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat > e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=3 > flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC Thanks for sharing the tracing result. I suspect the case might be: CPU A CPU B page fault locked the page try charge do limit reclaim compaction tried to isolate the locked page But it looks the limit reclaim took forever without reclaiming any page, and no oom or bail out correctly. And, according to the tracing result, it seems the inactive lru is very short since it just scanned a few pages even with very high priority. So, other than Michal's suggestion, I'd suggest you inspect what the jobs are doing in that memcg. > > > > > > If it's a single page being locked, it's important to identify what held > > page lock for 2 minutes because that is potentially a missing > > unlock_page. The kernel in question is old -- 4.19.56. Are there any > > other modifications to that kernel? > > We only backported some networking and hardware driver patches, > not any MM change. > > Please let me know if I can collect any other information you need, > before it gets rebooted. > > Thanks. >
On Tue, Jan 21, 2020 at 1:00 AM Michal Hocko <mhocko@kernel.org> wrote: > > On Mon 20-01-20 14:48:05, Cong Wang wrote: > > It got stuck somewhere along the call path of mem_cgroup_try_charge(), > > and the trace events of mm_vmscan_lru_shrink_inactive() indicates this > > too: > > So it seems that you are condending on the page lock. It is really > unexpected that the reclaim would take that long though. Please try to > enable more vmscan tracepoints to see where the time is spent. Sorry for the delay. I have been trying to collect more data in one shot. This is a a typical round of the loop after enabling all vmscan tracepoints: <...>-455450 [007] .... 4048595.842992: mm_vmscan_memcg_reclaim_begin: order=0 may_writepage=1 gfp_flags=GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE classzone_idx= 4 <...>-455450 [007] d... 4048595.842995: mm_vmscan_lru_isolate: isolate_mode=0 classzone=3 order=0 nr_requested=1 nr_scanned=1 nr_skipped=0 nr_taken=1 lru=inactive_file <...>-455450 [007] .... 4048595.842995: mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=4 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC <...>-455450 [007] dN.. 4048595.842997: mm_vmscan_lru_isolate: isolate_mode=0 classzone=3 order=0 nr_requested=1 nr_scanned=1 nr_skipped=0 nr_taken=1 lru=inactive_file <...>-455450 [007] .... 4048595.843001: mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=1 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=3 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC <...>-455450 [007] d... 4048595.843002: mm_vmscan_lru_isolate: isolate_mode=0 classzone=3 order=0 nr_requested=5 nr_scanned=5 nr_skipped=0 nr_taken=5 lru=inactive_file <...>-455450 [007] .... 4048595.843004: mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=5 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=2 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC <...>-455450 [007] d... 4048595.843006: mm_vmscan_lru_isolate: isolate_mode=0 classzone=3 order=0 nr_requested=9 nr_scanned=9 nr_skipped=0 nr_taken=9 lru=inactive_file <...>-455450 [007] .... 4048595.843007: mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=9 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activat e=0 nr_ref_keep=0 nr_unmap_fail=0 priority=1 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC <...>-455450 [007] d... 4048595.843009: mm_vmscan_lru_isolate: isolate_mode=0 classzone=3 order=0 nr_requested=17 nr_scanned=15 nr_skipped=0 nr_taken=15 lru=inactive_file <...>-455450 [007] .... 4048595.843011: mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=15 nr_reclaimed=0 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activa te=0 nr_ref_keep=0 nr_unmap_fail=0 priority=0 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC <...>-455450 [007] .... 4048595.843012: mm_vmscan_memcg_reclaim_end: nr_reclaimed=0 The whole trace output is huge (33M), I can provide it on request. I suspect the process gets stuck in the retry loop in try_charge(), as the _shortest_ stacktrace of the perf samples indicated: cycles:ppp: ffffffffa72963db mem_cgroup_iter ffffffffa72980ca mem_cgroup_oom_unlock ffffffffa7298c15 try_charge ffffffffa729a886 mem_cgroup_try_charge ffffffffa720ec03 __add_to_page_cache_locked ffffffffa720ee3a add_to_page_cache_lru ffffffffa7312ddb iomap_readpages_actor ffffffffa73133f7 iomap_apply ffffffffa73135da iomap_readpages ffffffffa722062e read_pages ffffffffa7220b3f __do_page_cache_readahead ffffffffa7210554 filemap_fault ffffffffc039e41f __xfs_filemap_fault ffffffffa724f5e7 __do_fault ffffffffa724c5f2 __handle_mm_fault ffffffffa724cbc6 handle_mm_fault ffffffffa70a313e __do_page_fault ffffffffa7a00dfe page_fault But I don't see how it could be, the only possible case is when mem_cgroup_oom() returns OOM_SUCCESS. However I can't find any clue in dmesg pointing to OOM. These processes in the same memcg are either running or sleeping (that is not exiting or coredump'ing), I don't see how and why they could be selected as a victim of OOM killer. I don't see any signal pending either from their /proc/X/status. Here is the status of the whole memcg: $ sudo cat /sys/fs/cgroup/memory/system.slice/osqueryd.service/memory.oom_control oom_kill_disable 0 under_oom 0 oom_kill 0 $ sudo cat /sys/fs/cgroup/memory/system.slice/osqueryd.service/memory.stat cache 376832 rss 255864832 rss_huge 0 shmem 0 mapped_file 135168 dirty 135168 writeback 0 pgpgin 5327157 pgpgout 5264598 pgfault 4251687 pgmajfault 7491 inactive_anon 4096 active_anon 256184320 inactive_file 32768 active_file 0 unevictable 0 hierarchical_memory_limit 262144000 total_cache 376832 total_rss 255864832 total_rss_huge 0 total_shmem 0 total_mapped_file 135168 total_dirty 135168 total_writeback 0 total_pgpgin 5327157 total_pgpgout 5264598 total_pgfault 4251687 total_pgmajfault 7491 total_inactive_anon 4096 total_active_anon 256184320 total_inactive_file 86016 total_active_file 0 total_unevictable 0 Please let me know if I can provide anything else. Thanks.
On Sun, Jan 26, 2020 at 11:53:55AM -0800, Cong Wang wrote: > On Tue, Jan 21, 2020 at 1:00 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > On Mon 20-01-20 14:48:05, Cong Wang wrote: > > > It got stuck somewhere along the call path of mem_cgroup_try_charge(), > > > and the trace events of mm_vmscan_lru_shrink_inactive() indicates this > > > too: > > > > So it seems that you are condending on the page lock. It is really > > unexpected that the reclaim would take that long though. Please try to > > enable more vmscan tracepoints to see where the time is spent. > > I suspect the process gets stuck in the retry loop in try_charge(), as > the _shortest_ stacktrace of the perf samples indicated: > > cycles:ppp: > ffffffffa72963db mem_cgroup_iter > ffffffffa72980ca mem_cgroup_oom_unlock > ffffffffa7298c15 try_charge > ffffffffa729a886 mem_cgroup_try_charge > ffffffffa720ec03 __add_to_page_cache_locked > ffffffffa720ee3a add_to_page_cache_lru > ffffffffa7312ddb iomap_readpages_actor > ffffffffa73133f7 iomap_apply > ffffffffa73135da iomap_readpages > ffffffffa722062e read_pages > ffffffffa7220b3f __do_page_cache_readahead > ffffffffa7210554 filemap_fault > ffffffffc039e41f __xfs_filemap_fault > ffffffffa724f5e7 __do_fault > ffffffffa724c5f2 __handle_mm_fault > ffffffffa724cbc6 handle_mm_fault > ffffffffa70a313e __do_page_fault > ffffffffa7a00dfe page_fault > > But I don't see how it could be, the only possible case is when > mem_cgroup_oom() returns OOM_SUCCESS. However I can't > find any clue in dmesg pointing to OOM. These processes in the > same memcg are either running or sleeping (that is not exiting or > coredump'ing), I don't see how and why they could be selected as > a victim of OOM killer. I don't see any signal pending either from > their /proc/X/status. I think this is a situation where we might end up with a genuine deadlock if we're not trylocking the pages. readahead allocates a batch of locked pages and adds them to the pagecache. If it has allocated, say, 5 pages, successfully inserted the first three into i_pages, then needs to allocate memory to insert the fourth one into i_pages, and the process then attempts to migrate the pages which are still locked, they will never come unlocked because they haven't yet been submitted to the filesystem for reading. Or is this enough? static inline gfp_t readahead_gfp_mask(struct address_space *x) return mapping_gfp_mask(x) | __GFP_NORETRY | __GFP_NOWARN;
On Sun 26-01-20 11:53:55, Cong Wang wrote: > On Tue, Jan 21, 2020 at 1:00 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > On Mon 20-01-20 14:48:05, Cong Wang wrote: > > > It got stuck somewhere along the call path of mem_cgroup_try_charge(), > > > and the trace events of mm_vmscan_lru_shrink_inactive() indicates this > > > too: > > > > So it seems that you are condending on the page lock. It is really > > unexpected that the reclaim would take that long though. Please try to > > enable more vmscan tracepoints to see where the time is spent. > > Sorry for the delay. I have been trying to collect more data in one shot. > > This is a a typical round of the loop after enabling all vmscan tracepoints: > > <...>-455450 [007] .... 4048595.842992: > mm_vmscan_memcg_reclaim_begin: order=0 may_writepage=1 > gfp_flags=GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE > classzone_idx=4 > <...>-455450 [007] .... 4048595.843012: > mm_vmscan_memcg_reclaim_end: nr_reclaimed=0 This doesn't tell us much though. This reclaim round has taken close to no time. See timestamps. > The whole trace output is huge (33M), I can provide it on request. Focus on reclaim rounds that take a long time and see where it gets you.
On Sun 26-01-20 15:39:35, Matthew Wilcox wrote: > On Sun, Jan 26, 2020 at 11:53:55AM -0800, Cong Wang wrote: > > On Tue, Jan 21, 2020 at 1:00 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > > > On Mon 20-01-20 14:48:05, Cong Wang wrote: > > > > It got stuck somewhere along the call path of mem_cgroup_try_charge(), > > > > and the trace events of mm_vmscan_lru_shrink_inactive() indicates this > > > > too: > > > > > > So it seems that you are condending on the page lock. It is really > > > unexpected that the reclaim would take that long though. Please try to > > > enable more vmscan tracepoints to see where the time is spent. > > > > I suspect the process gets stuck in the retry loop in try_charge(), as > > the _shortest_ stacktrace of the perf samples indicated: > > > > cycles:ppp: > > ffffffffa72963db mem_cgroup_iter > > ffffffffa72980ca mem_cgroup_oom_unlock > > ffffffffa7298c15 try_charge > > ffffffffa729a886 mem_cgroup_try_charge > > ffffffffa720ec03 __add_to_page_cache_locked > > ffffffffa720ee3a add_to_page_cache_lru > > ffffffffa7312ddb iomap_readpages_actor > > ffffffffa73133f7 iomap_apply > > ffffffffa73135da iomap_readpages > > ffffffffa722062e read_pages > > ffffffffa7220b3f __do_page_cache_readahead > > ffffffffa7210554 filemap_fault > > ffffffffc039e41f __xfs_filemap_fault > > ffffffffa724f5e7 __do_fault > > ffffffffa724c5f2 __handle_mm_fault > > ffffffffa724cbc6 handle_mm_fault > > ffffffffa70a313e __do_page_fault > > ffffffffa7a00dfe page_fault > > > > But I don't see how it could be, the only possible case is when > > mem_cgroup_oom() returns OOM_SUCCESS. However I can't > > find any clue in dmesg pointing to OOM. These processes in the > > same memcg are either running or sleeping (that is not exiting or > > coredump'ing), I don't see how and why they could be selected as > > a victim of OOM killer. I don't see any signal pending either from > > their /proc/X/status. > > I think this is a situation where we might end up with a genuine deadlock > if we're not trylocking the pages. readahead allocates a batch of > locked pages and adds them to the pagecache. If it has allocated, > say, 5 pages, successfully inserted the first three into i_pages, then > needs to allocate memory to insert the fourth one into i_pages, and > the process then attempts to migrate the pages which are still locked, > they will never come unlocked because they haven't yet been submitted > to the filesystem for reading. Just to make sure I understand. Do you mean this? lock_page(A) alloc_pages try_to_compact_pages compact_zone_order compact_zone(MIGRATE_SYNC_LIGHT) migrate_pages unmap_and_move __unmap_and_move lock_page(A)
On Mon, Jan 27, 2020 at 04:00:24PM +0100, Michal Hocko wrote: > On Sun 26-01-20 15:39:35, Matthew Wilcox wrote: > > On Sun, Jan 26, 2020 at 11:53:55AM -0800, Cong Wang wrote: > > > I suspect the process gets stuck in the retry loop in try_charge(), as > > > the _shortest_ stacktrace of the perf samples indicated: > > > > > > cycles:ppp: > > > ffffffffa72963db mem_cgroup_iter > > > ffffffffa72980ca mem_cgroup_oom_unlock > > > ffffffffa7298c15 try_charge > > > ffffffffa729a886 mem_cgroup_try_charge > > > ffffffffa720ec03 __add_to_page_cache_locked > > > ffffffffa720ee3a add_to_page_cache_lru > > > ffffffffa7312ddb iomap_readpages_actor > > > ffffffffa73133f7 iomap_apply > > > ffffffffa73135da iomap_readpages > > > ffffffffa722062e read_pages > > > ffffffffa7220b3f __do_page_cache_readahead > > > ffffffffa7210554 filemap_fault > > > ffffffffc039e41f __xfs_filemap_fault > > > ffffffffa724f5e7 __do_fault > > > ffffffffa724c5f2 __handle_mm_fault > > > ffffffffa724cbc6 handle_mm_fault > > > ffffffffa70a313e __do_page_fault > > > ffffffffa7a00dfe page_fault > > > > > > But I don't see how it could be, the only possible case is when > > > mem_cgroup_oom() returns OOM_SUCCESS. However I can't > > > find any clue in dmesg pointing to OOM. These processes in the > > > same memcg are either running or sleeping (that is not exiting or > > > coredump'ing), I don't see how and why they could be selected as > > > a victim of OOM killer. I don't see any signal pending either from > > > their /proc/X/status. > > > > I think this is a situation where we might end up with a genuine deadlock > > if we're not trylocking the pages. readahead allocates a batch of > > locked pages and adds them to the pagecache. If it has allocated, > > say, 5 pages, successfully inserted the first three into i_pages, then > > needs to allocate memory to insert the fourth one into i_pages, and > > the process then attempts to migrate the pages which are still locked, > > they will never come unlocked because they haven't yet been submitted > > to the filesystem for reading. > > Just to make sure I understand. Do you mean this? > lock_page(A) > alloc_pages > try_to_compact_pages > compact_zone_order > compact_zone(MIGRATE_SYNC_LIGHT) > migrate_pages > unmap_and_move > __unmap_and_move > lock_page(A) Yes. There's a little more to it than that, eg slab is involved, but you have it in a nutshell.
On Mon, Jan 27, 2020 at 6:49 AM Michal Hocko <mhocko@kernel.org> wrote: > > On Sun 26-01-20 11:53:55, Cong Wang wrote: > > On Tue, Jan 21, 2020 at 1:00 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > > > On Mon 20-01-20 14:48:05, Cong Wang wrote: > > > > It got stuck somewhere along the call path of mem_cgroup_try_charge(), > > > > and the trace events of mm_vmscan_lru_shrink_inactive() indicates this > > > > too: > > > > > > So it seems that you are condending on the page lock. It is really > > > unexpected that the reclaim would take that long though. Please try to > > > enable more vmscan tracepoints to see where the time is spent. > > > > Sorry for the delay. I have been trying to collect more data in one shot. > > > > This is a a typical round of the loop after enabling all vmscan tracepoints: > > > > <...>-455450 [007] .... 4048595.842992: > > mm_vmscan_memcg_reclaim_begin: order=0 may_writepage=1 > > gfp_flags=GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE > > classzone_idx=4 > > <...>-455450 [007] .... 4048595.843012: > > mm_vmscan_memcg_reclaim_end: nr_reclaimed=0 > > This doesn't tell us much though. This reclaim round has taken close to > no time. See timestamps. > > > The whole trace output is huge (33M), I can provide it on request. > > Focus on reclaim rounds that take a long time and see where it gets you. I reviewed the tracing output with my eyes, they all took little time. But of course I can't review all of them given the size is huge. For me, it seems that the loop happens in its caller, something like: retry: mm_vmscan_memcg_reclaim_begin(); ... mm_vmscan_memcg_reclaim_end(); goto retry; So, I think we should focus on try_charge()? More interestingly, the margin of that memcg is 0: $ sudo cat /sys/fs/cgroup/memory/system.slice/osqueryd.service/memory.usage_in_bytes 262144000 $ sudo cat /sys/fs/cgroup/memory/system.slice/osqueryd.service/memory.limit_in_bytes 262144000 Thanks!
On Mon, Jan 27, 2020 at 11:06 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, Jan 27, 2020 at 04:00:24PM +0100, Michal Hocko wrote: > > On Sun 26-01-20 15:39:35, Matthew Wilcox wrote: > > > On Sun, Jan 26, 2020 at 11:53:55AM -0800, Cong Wang wrote: > > > > I suspect the process gets stuck in the retry loop in try_charge(), as > > > > the _shortest_ stacktrace of the perf samples indicated: > > > > > > > > cycles:ppp: > > > > ffffffffa72963db mem_cgroup_iter > > > > ffffffffa72980ca mem_cgroup_oom_unlock > > > > ffffffffa7298c15 try_charge > > > > ffffffffa729a886 mem_cgroup_try_charge > > > > ffffffffa720ec03 __add_to_page_cache_locked > > > > ffffffffa720ee3a add_to_page_cache_lru > > > > ffffffffa7312ddb iomap_readpages_actor > > > > ffffffffa73133f7 iomap_apply > > > > ffffffffa73135da iomap_readpages > > > > ffffffffa722062e read_pages > > > > ffffffffa7220b3f __do_page_cache_readahead > > > > ffffffffa7210554 filemap_fault > > > > ffffffffc039e41f __xfs_filemap_fault > > > > ffffffffa724f5e7 __do_fault > > > > ffffffffa724c5f2 __handle_mm_fault > > > > ffffffffa724cbc6 handle_mm_fault > > > > ffffffffa70a313e __do_page_fault > > > > ffffffffa7a00dfe page_fault > > > > > > > > But I don't see how it could be, the only possible case is when > > > > mem_cgroup_oom() returns OOM_SUCCESS. However I can't > > > > find any clue in dmesg pointing to OOM. These processes in the > > > > same memcg are either running or sleeping (that is not exiting or > > > > coredump'ing), I don't see how and why they could be selected as > > > > a victim of OOM killer. I don't see any signal pending either from > > > > their /proc/X/status. > > > > > > I think this is a situation where we might end up with a genuine deadlock > > > if we're not trylocking the pages. readahead allocates a batch of > > > locked pages and adds them to the pagecache. If it has allocated, > > > say, 5 pages, successfully inserted the first three into i_pages, then > > > needs to allocate memory to insert the fourth one into i_pages, and > > > the process then attempts to migrate the pages which are still locked, > > > they will never come unlocked because they haven't yet been submitted > > > to the filesystem for reading. > > > > Just to make sure I understand. Do you mean this? > > lock_page(A) > > alloc_pages > > try_to_compact_pages > > compact_zone_order > > compact_zone(MIGRATE_SYNC_LIGHT) > > migrate_pages > > unmap_and_move > > __unmap_and_move > > lock_page(A) > > Yes. There's a little more to it than that, eg slab is involved, but > you have it in a nutshell. But, how compact could get blocked for readahead page if it is not on LRU? The page is charged before adding to LRU, so if kernel just retry charge or reclaim forever, the page should be not on LRU, so it should not block compaction. >
On Mon, Jan 27, 2020 at 05:25:13PM -0800, Yang Shi wrote: > On Mon, Jan 27, 2020 at 11:06 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Mon, Jan 27, 2020 at 04:00:24PM +0100, Michal Hocko wrote: > > > On Sun 26-01-20 15:39:35, Matthew Wilcox wrote: > > > > On Sun, Jan 26, 2020 at 11:53:55AM -0800, Cong Wang wrote: > > > > > I suspect the process gets stuck in the retry loop in try_charge(), as > > > > > the _shortest_ stacktrace of the perf samples indicated: > > > > > > > > > > cycles:ppp: > > > > > ffffffffa72963db mem_cgroup_iter > > > > > ffffffffa72980ca mem_cgroup_oom_unlock > > > > > ffffffffa7298c15 try_charge > > > > > ffffffffa729a886 mem_cgroup_try_charge > > > > > ffffffffa720ec03 __add_to_page_cache_locked > > > > > ffffffffa720ee3a add_to_page_cache_lru > > > > > ffffffffa7312ddb iomap_readpages_actor > > > > > ffffffffa73133f7 iomap_apply > > > > > ffffffffa73135da iomap_readpages > > > > > ffffffffa722062e read_pages > > > > > ffffffffa7220b3f __do_page_cache_readahead > > > > > ffffffffa7210554 filemap_fault > > > > > ffffffffc039e41f __xfs_filemap_fault > > > > > ffffffffa724f5e7 __do_fault > > > > > ffffffffa724c5f2 __handle_mm_fault > > > > > ffffffffa724cbc6 handle_mm_fault > > > > > ffffffffa70a313e __do_page_fault > > > > > ffffffffa7a00dfe page_fault > > > > > > > > > > But I don't see how it could be, the only possible case is when > > > > > mem_cgroup_oom() returns OOM_SUCCESS. However I can't > > > > > find any clue in dmesg pointing to OOM. These processes in the > > > > > same memcg are either running or sleeping (that is not exiting or > > > > > coredump'ing), I don't see how and why they could be selected as > > > > > a victim of OOM killer. I don't see any signal pending either from > > > > > their /proc/X/status. > > > > > > > > I think this is a situation where we might end up with a genuine deadlock > > > > if we're not trylocking the pages. readahead allocates a batch of > > > > locked pages and adds them to the pagecache. If it has allocated, > > > > say, 5 pages, successfully inserted the first three into i_pages, then > > > > needs to allocate memory to insert the fourth one into i_pages, and > > > > the process then attempts to migrate the pages which are still locked, > > > > they will never come unlocked because they haven't yet been submitted > > > > to the filesystem for reading. > > > > > > Just to make sure I understand. Do you mean this? > > > lock_page(A) > > > alloc_pages > > > try_to_compact_pages > > > compact_zone_order > > > compact_zone(MIGRATE_SYNC_LIGHT) > > > migrate_pages > > > unmap_and_move > > > __unmap_and_move > > > lock_page(A) > > > > Yes. There's a little more to it than that, eg slab is involved, but > > you have it in a nutshell. > > But, how compact could get blocked for readahead page if it is not on LRU? > > The page is charged before adding to LRU, so if kernel just retry > charge or reclaim forever, the page should be not on LRU, so it should > not block compaction. The five pages are allocated ABCDE, then they are added one at a time to both the LRU list and i_pages. Once ABCDE have been added to the page cache, they are submitted to the filesystem in a batch. The sceanrio here is that once ABC have been added, we need to allocate memory to add D. The page migration code then attempts to migrate A. Deadlock; it will never come unlocked. This kind of problem can occur in any filesystem in either readpages or readpage. Once the page is locked and on the LRU list, if the filesystem attempts to allocate memory (and many do), this kind of deadlock can occur. It's avoided if the filesystem uses a mempool to avoid memory allocation in this path, but they certainly don't all do that. This specific deadlock can be avoided if we skip !PageUptodate pages. But I don't know what other situations there are where we allocate memory while holding a page locked that is on the LRU list.
On Mon 27-01-20 11:06:53, Matthew Wilcox wrote: > On Mon, Jan 27, 2020 at 04:00:24PM +0100, Michal Hocko wrote: > > On Sun 26-01-20 15:39:35, Matthew Wilcox wrote: > > > On Sun, Jan 26, 2020 at 11:53:55AM -0800, Cong Wang wrote: > > > > I suspect the process gets stuck in the retry loop in try_charge(), as > > > > the _shortest_ stacktrace of the perf samples indicated: > > > > > > > > cycles:ppp: > > > > ffffffffa72963db mem_cgroup_iter > > > > ffffffffa72980ca mem_cgroup_oom_unlock > > > > ffffffffa7298c15 try_charge > > > > ffffffffa729a886 mem_cgroup_try_charge > > > > ffffffffa720ec03 __add_to_page_cache_locked > > > > ffffffffa720ee3a add_to_page_cache_lru > > > > ffffffffa7312ddb iomap_readpages_actor > > > > ffffffffa73133f7 iomap_apply > > > > ffffffffa73135da iomap_readpages > > > > ffffffffa722062e read_pages > > > > ffffffffa7220b3f __do_page_cache_readahead > > > > ffffffffa7210554 filemap_fault > > > > ffffffffc039e41f __xfs_filemap_fault > > > > ffffffffa724f5e7 __do_fault > > > > ffffffffa724c5f2 __handle_mm_fault > > > > ffffffffa724cbc6 handle_mm_fault > > > > ffffffffa70a313e __do_page_fault > > > > ffffffffa7a00dfe page_fault > > > > > > > > But I don't see how it could be, the only possible case is when > > > > mem_cgroup_oom() returns OOM_SUCCESS. However I can't > > > > find any clue in dmesg pointing to OOM. These processes in the > > > > same memcg are either running or sleeping (that is not exiting or > > > > coredump'ing), I don't see how and why they could be selected as > > > > a victim of OOM killer. I don't see any signal pending either from > > > > their /proc/X/status. > > > > > > I think this is a situation where we might end up with a genuine deadlock > > > if we're not trylocking the pages. readahead allocates a batch of > > > locked pages and adds them to the pagecache. If it has allocated, > > > say, 5 pages, successfully inserted the first three into i_pages, then > > > needs to allocate memory to insert the fourth one into i_pages, and > > > the process then attempts to migrate the pages which are still locked, > > > they will never come unlocked because they haven't yet been submitted > > > to the filesystem for reading. > > > > Just to make sure I understand. Do you mean this? > > lock_page(A) > > alloc_pages > > try_to_compact_pages > > compact_zone_order > > compact_zone(MIGRATE_SYNC_LIGHT) > > migrate_pages > > unmap_and_move > > __unmap_and_move > > lock_page(A) > > Yes. There's a little more to it than that, eg slab is involved, but > you have it in a nutshell. I am not deeply familiar with the readahead code. But is there really a high oerder allocation (order > 1) that would trigger compaction in the phase when pages are locked? Btw. the compaction rejects to consider file backed pages when __GFP_FS is not present AFAIR.
On Mon 27-01-20 16:46:16, Cong Wang wrote: > On Mon, Jan 27, 2020 at 6:49 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > On Sun 26-01-20 11:53:55, Cong Wang wrote: > > > On Tue, Jan 21, 2020 at 1:00 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > > > > > On Mon 20-01-20 14:48:05, Cong Wang wrote: > > > > > It got stuck somewhere along the call path of mem_cgroup_try_charge(), > > > > > and the trace events of mm_vmscan_lru_shrink_inactive() indicates this > > > > > too: > > > > > > > > So it seems that you are condending on the page lock. It is really > > > > unexpected that the reclaim would take that long though. Please try to > > > > enable more vmscan tracepoints to see where the time is spent. > > > > > > Sorry for the delay. I have been trying to collect more data in one shot. > > > > > > This is a a typical round of the loop after enabling all vmscan tracepoints: > > > > > > <...>-455450 [007] .... 4048595.842992: > > > mm_vmscan_memcg_reclaim_begin: order=0 may_writepage=1 > > > gfp_flags=GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE > > > classzone_idx=4 > > > <...>-455450 [007] .... 4048595.843012: > > > mm_vmscan_memcg_reclaim_end: nr_reclaimed=0 > > > > This doesn't tell us much though. This reclaim round has taken close to > > no time. See timestamps. > > > > > The whole trace output is huge (33M), I can provide it on request. > > > > Focus on reclaim rounds that take a long time and see where it gets you. > > I reviewed the tracing output with my eyes, they all took little time. > But of course I can't review all of them given the size is huge. Sure, I would simply use a script to check for exessive reclaims rounds. Also it is important to find out whether the page of your interest is locked outside of the reclaim or inside.
On Tue, Jan 28, 2020 at 09:17:12AM +0100, Michal Hocko wrote: > On Mon 27-01-20 11:06:53, Matthew Wilcox wrote: > > On Mon, Jan 27, 2020 at 04:00:24PM +0100, Michal Hocko wrote: > > > On Sun 26-01-20 15:39:35, Matthew Wilcox wrote: > > > > On Sun, Jan 26, 2020 at 11:53:55AM -0800, Cong Wang wrote: > > > > > I suspect the process gets stuck in the retry loop in try_charge(), as > > > > > the _shortest_ stacktrace of the perf samples indicated: > > > > > > > > > > cycles:ppp: > > > > > ffffffffa72963db mem_cgroup_iter > > > > > ffffffffa72980ca mem_cgroup_oom_unlock > > > > > ffffffffa7298c15 try_charge > > > > > ffffffffa729a886 mem_cgroup_try_charge > > > > > ffffffffa720ec03 __add_to_page_cache_locked > > > > > ffffffffa720ee3a add_to_page_cache_lru > > > > > ffffffffa7312ddb iomap_readpages_actor > > > > > ffffffffa73133f7 iomap_apply > > > > > ffffffffa73135da iomap_readpages > > > > > ffffffffa722062e read_pages > > > > > ffffffffa7220b3f __do_page_cache_readahead > > > > > ffffffffa7210554 filemap_fault > > > > > ffffffffc039e41f __xfs_filemap_fault > > > > > ffffffffa724f5e7 __do_fault > > > > > ffffffffa724c5f2 __handle_mm_fault > > > > > ffffffffa724cbc6 handle_mm_fault > > > > > ffffffffa70a313e __do_page_fault > > > > > ffffffffa7a00dfe page_fault > > > > > > > > > > But I don't see how it could be, the only possible case is when > > > > > mem_cgroup_oom() returns OOM_SUCCESS. However I can't > > > > > find any clue in dmesg pointing to OOM. These processes in the > > > > > same memcg are either running or sleeping (that is not exiting or > > > > > coredump'ing), I don't see how and why they could be selected as > > > > > a victim of OOM killer. I don't see any signal pending either from > > > > > their /proc/X/status. > > > > > > > > I think this is a situation where we might end up with a genuine deadlock > > > > if we're not trylocking the pages. readahead allocates a batch of > > > > locked pages and adds them to the pagecache. If it has allocated, > > > > say, 5 pages, successfully inserted the first three into i_pages, then > > > > needs to allocate memory to insert the fourth one into i_pages, and > > > > the process then attempts to migrate the pages which are still locked, > > > > they will never come unlocked because they haven't yet been submitted > > > > to the filesystem for reading. > > > > > > Just to make sure I understand. Do you mean this? > > > lock_page(A) > > > alloc_pages > > > try_to_compact_pages > > > compact_zone_order > > > compact_zone(MIGRATE_SYNC_LIGHT) > > > migrate_pages > > > unmap_and_move > > > __unmap_and_move > > > lock_page(A) > > > > Yes. There's a little more to it than that, eg slab is involved, but > > you have it in a nutshell. > > I am not deeply familiar with the readahead code. But is there really a > high oerder allocation (order > 1) that would trigger compaction in the > phase when pages are locked? Thanks to sl*b, yes: radix_tree_node 80890 102536 584 28 4 : tunables 0 0 0 : slabdata 3662 3662 0 so it's allocating 4 pages for an allocation of a 576 byte node. > Btw. the compaction rejects to consider file backed pages when __GFP_FS > is not present AFAIR. Ah, that would save us.
On Tue 28-01-20 00:30:44, Matthew Wilcox wrote: > On Tue, Jan 28, 2020 at 09:17:12AM +0100, Michal Hocko wrote: > > On Mon 27-01-20 11:06:53, Matthew Wilcox wrote: > > > On Mon, Jan 27, 2020 at 04:00:24PM +0100, Michal Hocko wrote: > > > > On Sun 26-01-20 15:39:35, Matthew Wilcox wrote: > > > > > On Sun, Jan 26, 2020 at 11:53:55AM -0800, Cong Wang wrote: > > > > > > I suspect the process gets stuck in the retry loop in try_charge(), as > > > > > > the _shortest_ stacktrace of the perf samples indicated: > > > > > > > > > > > > cycles:ppp: > > > > > > ffffffffa72963db mem_cgroup_iter > > > > > > ffffffffa72980ca mem_cgroup_oom_unlock > > > > > > ffffffffa7298c15 try_charge > > > > > > ffffffffa729a886 mem_cgroup_try_charge > > > > > > ffffffffa720ec03 __add_to_page_cache_locked > > > > > > ffffffffa720ee3a add_to_page_cache_lru > > > > > > ffffffffa7312ddb iomap_readpages_actor > > > > > > ffffffffa73133f7 iomap_apply > > > > > > ffffffffa73135da iomap_readpages > > > > > > ffffffffa722062e read_pages > > > > > > ffffffffa7220b3f __do_page_cache_readahead > > > > > > ffffffffa7210554 filemap_fault > > > > > > ffffffffc039e41f __xfs_filemap_fault > > > > > > ffffffffa724f5e7 __do_fault > > > > > > ffffffffa724c5f2 __handle_mm_fault > > > > > > ffffffffa724cbc6 handle_mm_fault > > > > > > ffffffffa70a313e __do_page_fault > > > > > > ffffffffa7a00dfe page_fault > > > > > > > > > > > > But I don't see how it could be, the only possible case is when > > > > > > mem_cgroup_oom() returns OOM_SUCCESS. However I can't > > > > > > find any clue in dmesg pointing to OOM. These processes in the > > > > > > same memcg are either running or sleeping (that is not exiting or > > > > > > coredump'ing), I don't see how and why they could be selected as > > > > > > a victim of OOM killer. I don't see any signal pending either from > > > > > > their /proc/X/status. > > > > > > > > > > I think this is a situation where we might end up with a genuine deadlock > > > > > if we're not trylocking the pages. readahead allocates a batch of > > > > > locked pages and adds them to the pagecache. If it has allocated, > > > > > say, 5 pages, successfully inserted the first three into i_pages, then > > > > > needs to allocate memory to insert the fourth one into i_pages, and > > > > > the process then attempts to migrate the pages which are still locked, > > > > > they will never come unlocked because they haven't yet been submitted > > > > > to the filesystem for reading. > > > > > > > > Just to make sure I understand. Do you mean this? > > > > lock_page(A) > > > > alloc_pages > > > > try_to_compact_pages > > > > compact_zone_order > > > > compact_zone(MIGRATE_SYNC_LIGHT) > > > > migrate_pages > > > > unmap_and_move > > > > __unmap_and_move > > > > lock_page(A) > > > > > > Yes. There's a little more to it than that, eg slab is involved, but > > > you have it in a nutshell. > > > > I am not deeply familiar with the readahead code. But is there really a > > high oerder allocation (order > 1) that would trigger compaction in the > > phase when pages are locked? > > Thanks to sl*b, yes: > > radix_tree_node 80890 102536 584 28 4 : tunables 0 0 0 : slabdata 3662 3662 0 > > so it's allocating 4 pages for an allocation of a 576 byte node. I am not really sure that we do sync migration for costly orders. > > Btw. the compaction rejects to consider file backed pages when __GFP_FS > > is not present AFAIR. > > Ah, that would save us. So the NOFS comes from the mapping GFP mask, right? That is something I was hoping to get rid of eventually :/ Anyway it would be better to have an explicit NOFS with a comment explaining why we need that. If for nothing else then for documentation.
On Tue, Jan 28, 2020 at 10:13:52AM +0100, Michal Hocko wrote: > On Tue 28-01-20 00:30:44, Matthew Wilcox wrote: > > On Tue, Jan 28, 2020 at 09:17:12AM +0100, Michal Hocko wrote: > > > On Mon 27-01-20 11:06:53, Matthew Wilcox wrote: > > > > On Mon, Jan 27, 2020 at 04:00:24PM +0100, Michal Hocko wrote: > > > > > On Sun 26-01-20 15:39:35, Matthew Wilcox wrote: > > > > > > On Sun, Jan 26, 2020 at 11:53:55AM -0800, Cong Wang wrote: > > > > > > > I suspect the process gets stuck in the retry loop in try_charge(), as > > > > > > > the _shortest_ stacktrace of the perf samples indicated: > > > > > > > > > > > > > > cycles:ppp: > > > > > > > ffffffffa72963db mem_cgroup_iter > > > > > > > ffffffffa72980ca mem_cgroup_oom_unlock > > > > > > > ffffffffa7298c15 try_charge > > > > > > > ffffffffa729a886 mem_cgroup_try_charge > > > > > > > ffffffffa720ec03 __add_to_page_cache_locked > > > > > > > ffffffffa720ee3a add_to_page_cache_lru > > > > > > > ffffffffa7312ddb iomap_readpages_actor > > > > > > > ffffffffa73133f7 iomap_apply > > > > > > > ffffffffa73135da iomap_readpages > > > > > > > ffffffffa722062e read_pages > > > > > > > ffffffffa7220b3f __do_page_cache_readahead > > > > > > > ffffffffa7210554 filemap_fault > > > > > > > ffffffffc039e41f __xfs_filemap_fault > > > > > > > ffffffffa724f5e7 __do_fault > > > > > > > ffffffffa724c5f2 __handle_mm_fault > > > > > > > ffffffffa724cbc6 handle_mm_fault > > > > > > > ffffffffa70a313e __do_page_fault > > > > > > > ffffffffa7a00dfe page_fault > > > > > > I am not deeply familiar with the readahead code. But is there really a > > > high oerder allocation (order > 1) that would trigger compaction in the > > > phase when pages are locked? > > > > Thanks to sl*b, yes: > > > > radix_tree_node 80890 102536 584 28 4 : tunables 0 0 0 : slabdata 3662 3662 0 > > > > so it's allocating 4 pages for an allocation of a 576 byte node. > > I am not really sure that we do sync migration for costly orders. Doesn't the stack trace above indicate that we're doing migration as the result of an allocation in add_to_page_cache_lru()? > > > Btw. the compaction rejects to consider file backed pages when __GFP_FS > > > is not present AFAIR. > > > > Ah, that would save us. > > So the NOFS comes from the mapping GFP mask, right? That is something I > was hoping to get rid of eventually :/ Anyway it would be better to have > an explicit NOFS with a comment explaining why we need that. If for > nothing else then for documentation. I'd also like to see the mapping GFP mask go away, but rather than seeing an explicit GFP_NOFS here, I'd rather see the memalloc_nofs API used. I just don't understand the whole problem space well enough to know where to put the call for best effect.
On Tue 28-01-20 02:48:57, Matthew Wilcox wrote: > On Tue, Jan 28, 2020 at 10:13:52AM +0100, Michal Hocko wrote: > > On Tue 28-01-20 00:30:44, Matthew Wilcox wrote: > > > On Tue, Jan 28, 2020 at 09:17:12AM +0100, Michal Hocko wrote: > > > > On Mon 27-01-20 11:06:53, Matthew Wilcox wrote: > > > > > On Mon, Jan 27, 2020 at 04:00:24PM +0100, Michal Hocko wrote: > > > > > > On Sun 26-01-20 15:39:35, Matthew Wilcox wrote: > > > > > > > On Sun, Jan 26, 2020 at 11:53:55AM -0800, Cong Wang wrote: > > > > > > > > I suspect the process gets stuck in the retry loop in try_charge(), as > > > > > > > > the _shortest_ stacktrace of the perf samples indicated: > > > > > > > > > > > > > > > > cycles:ppp: > > > > > > > > ffffffffa72963db mem_cgroup_iter > > > > > > > > ffffffffa72980ca mem_cgroup_oom_unlock > > > > > > > > ffffffffa7298c15 try_charge > > > > > > > > ffffffffa729a886 mem_cgroup_try_charge > > > > > > > > ffffffffa720ec03 __add_to_page_cache_locked > > > > > > > > ffffffffa720ee3a add_to_page_cache_lru > > > > > > > > ffffffffa7312ddb iomap_readpages_actor > > > > > > > > ffffffffa73133f7 iomap_apply > > > > > > > > ffffffffa73135da iomap_readpages > > > > > > > > ffffffffa722062e read_pages > > > > > > > > ffffffffa7220b3f __do_page_cache_readahead > > > > > > > > ffffffffa7210554 filemap_fault > > > > > > > > ffffffffc039e41f __xfs_filemap_fault > > > > > > > > ffffffffa724f5e7 __do_fault > > > > > > > > ffffffffa724c5f2 __handle_mm_fault > > > > > > > > ffffffffa724cbc6 handle_mm_fault > > > > > > > > ffffffffa70a313e __do_page_fault > > > > > > > > ffffffffa7a00dfe page_fault > > > > > > > > I am not deeply familiar with the readahead code. But is there really a > > > > high oerder allocation (order > 1) that would trigger compaction in the > > > > phase when pages are locked? > > > > > > Thanks to sl*b, yes: > > > > > > radix_tree_node 80890 102536 584 28 4 : tunables 0 0 0 : slabdata 3662 3662 0 > > > > > > so it's allocating 4 pages for an allocation of a 576 byte node. > > > > I am not really sure that we do sync migration for costly orders. > > Doesn't the stack trace above indicate that we're doing migration as > the result of an allocation in add_to_page_cache_lru()? Which stack trace do you refer to? Because the one above doesn't show much more beyond mem_cgroup_iter and likewise others in this email thread. I do not really remember any stack with lock_page on the trace. > > > > > Btw. the compaction rejects to consider file backed pages when __GFP_FS > > > > is not present AFAIR. > > > > > > Ah, that would save us. > > > > So the NOFS comes from the mapping GFP mask, right? That is something I > > was hoping to get rid of eventually :/ Anyway it would be better to have > > an explicit NOFS with a comment explaining why we need that. If for > > nothing else then for documentation. > > I'd also like to see the mapping GFP mask go away, but rather than seeing > an explicit GFP_NOFS here, I'd rather see the memalloc_nofs API used. Completely agreed agree here. The proper place for the scope would be the place where pages are locked with an explanation that there are other allocations down the line which might invoke sync migration and that would be dangerous. Having that explicitly documented is clearly an improvement.
On Tue, Jan 28, 2020 at 3:39 AM Michal Hocko <mhocko@kernel.org> wrote: > > On Tue 28-01-20 02:48:57, Matthew Wilcox wrote: > > Doesn't the stack trace above indicate that we're doing migration as > > the result of an allocation in add_to_page_cache_lru()? > > Which stack trace do you refer to? Because the one above doesn't show > much more beyond mem_cgroup_iter and likewise others in this email > thread. I do not really remember any stack with lock_page on the trace. I think the page is locked in add_to_page_cache_lru() by __SetPageLocked(), as the stack trace shows __add_to_page_cache_locked(). It is not yet unlocked, as it is still looping inside try_charge(). I will write a script to see if I can find the longest time spent in reclaim as you suggested. Thanks.
On Tue, Jan 28, 2020 at 11:44 AM Cong Wang <xiyou.wangcong@gmail.com> wrote: > > On Tue, Jan 28, 2020 at 3:39 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > On Tue 28-01-20 02:48:57, Matthew Wilcox wrote: > > > Doesn't the stack trace above indicate that we're doing migration as > > > the result of an allocation in add_to_page_cache_lru()? > > > > Which stack trace do you refer to? Because the one above doesn't show > > much more beyond mem_cgroup_iter and likewise others in this email > > thread. I do not really remember any stack with lock_page on the trace. > > I think the page is locked in add_to_page_cache_lru() by > __SetPageLocked(), as the stack trace shows __add_to_page_cache_locked(). > It is not yet unlocked, as it is still looping inside try_charge(). > > I will write a script to see if I can find the longest time spent in reclaim > as you suggested. After digging the changelog, I believe the following commit could fix the problem: commit f9c645621a28e37813a1de96d9cbd89cde94a1e4 Author: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Date: Mon Sep 23 15:37:08 2019 -0700 memcg, oom: don't require __GFP_FS when invoking memcg OOM killer which is not yet in our 4.19 branch yet. We will sync with 4.19 stable soon. Thanks.
On Tue 28-01-20 12:39:55, Michal Hocko wrote: > On Tue 28-01-20 02:48:57, Matthew Wilcox wrote: > > On Tue, Jan 28, 2020 at 10:13:52AM +0100, Michal Hocko wrote: > > > On Tue 28-01-20 00:30:44, Matthew Wilcox wrote: > > > > On Tue, Jan 28, 2020 at 09:17:12AM +0100, Michal Hocko wrote: > > > > > On Mon 27-01-20 11:06:53, Matthew Wilcox wrote: > > > > > > On Mon, Jan 27, 2020 at 04:00:24PM +0100, Michal Hocko wrote: > > > > > > > On Sun 26-01-20 15:39:35, Matthew Wilcox wrote: > > > > > > > > On Sun, Jan 26, 2020 at 11:53:55AM -0800, Cong Wang wrote: > > > > > > > > > I suspect the process gets stuck in the retry loop in try_charge(), as > > > > > > > > > the _shortest_ stacktrace of the perf samples indicated: > > > > > > > > > > > > > > > > > > cycles:ppp: > > > > > > > > > ffffffffa72963db mem_cgroup_iter > > > > > > > > > ffffffffa72980ca mem_cgroup_oom_unlock > > > > > > > > > ffffffffa7298c15 try_charge > > > > > > > > > ffffffffa729a886 mem_cgroup_try_charge > > > > > > > > > ffffffffa720ec03 __add_to_page_cache_locked > > > > > > > > > ffffffffa720ee3a add_to_page_cache_lru > > > > > > > > > ffffffffa7312ddb iomap_readpages_actor > > > > > > > > > ffffffffa73133f7 iomap_apply > > > > > > > > > ffffffffa73135da iomap_readpages > > > > > > > > > ffffffffa722062e read_pages > > > > > > > > > ffffffffa7220b3f __do_page_cache_readahead > > > > > > > > > ffffffffa7210554 filemap_fault > > > > > > > > > ffffffffc039e41f __xfs_filemap_fault > > > > > > > > > ffffffffa724f5e7 __do_fault > > > > > > > > > ffffffffa724c5f2 __handle_mm_fault > > > > > > > > > ffffffffa724cbc6 handle_mm_fault > > > > > > > > > ffffffffa70a313e __do_page_fault > > > > > > > > > ffffffffa7a00dfe page_fault > > > > > > > > > > I am not deeply familiar with the readahead code. But is there really a > > > > > high oerder allocation (order > 1) that would trigger compaction in the > > > > > phase when pages are locked? > > > > > > > > Thanks to sl*b, yes: > > > > > > > > radix_tree_node 80890 102536 584 28 4 : tunables 0 0 0 : slabdata 3662 3662 0 > > > > > > > > so it's allocating 4 pages for an allocation of a 576 byte node. > > > > > > I am not really sure that we do sync migration for costly orders. > > > > Doesn't the stack trace above indicate that we're doing migration as > > the result of an allocation in add_to_page_cache_lru()? > > Which stack trace do you refer to? Because the one above doesn't show > much more beyond mem_cgroup_iter and likewise others in this email > thread. I do not really remember any stack with lock_page on the trace. > > > > > > > Btw. the compaction rejects to consider file backed pages when __GFP_FS > > > > > is not present AFAIR. > > > > > > > > Ah, that would save us. > > > > > > So the NOFS comes from the mapping GFP mask, right? That is something I > > > was hoping to get rid of eventually :/ Anyway it would be better to have > > > an explicit NOFS with a comment explaining why we need that. If for > > > nothing else then for documentation. > > > > I'd also like to see the mapping GFP mask go away, but rather than seeing > > an explicit GFP_NOFS here, I'd rather see the memalloc_nofs API used. > > Completely agreed agree here. The proper place for the scope would be > the place where pages are locked with an explanation that there are > other allocations down the line which might invoke sync migration and > that would be dangerous. Having that explicitly documented is clearly an > improvement. Can we pursue on this please? An explicit NOFS scope annotation with a reference to compaction potentially locking up on pages in the readahead would be a great start.
On Thu, Feb 13, 2020 at 08:48:47AM +0100, Michal Hocko wrote: > Can we pursue on this please? An explicit NOFS scope annotation with a > reference to compaction potentially locking up on pages in the readahead > would be a great start. How about this (on top of the current readahead series): diff --git a/mm/readahead.c b/mm/readahead.c index 29ca25c8f01e..32fd32b913da 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -160,6 +160,16 @@ unsigned long page_cache_readahead_limit(struct address_space *mapping, .nr_pages = 0, }; + /* + * Partway through the readahead operation, we will have added + * locked pages to the page cache, but will not yet have submitted + * them for I/O. Adding another page may need to allocate + * memory, which can trigger memory migration. Telling the VM + * we're in the middle of a filesystem operation will cause it + * to not touch file-backed pages, preventing a deadlock. + */ + unsigned int nofs = memalloc_nofs_save(); + /* * Preallocate as many pages as we will need. */ @@ -217,6 +227,7 @@ unsigned long page_cache_readahead_limit(struct address_space *mapping, */ read_pages(&rac, &page_pool); BUG_ON(!list_empty(&page_pool)); + memalloc_nofs_restore(nofs); return rac.nr_pages; } EXPORT_SYMBOL_GPL(page_cache_readahead_limit);
On Thu 13-02-20 08:46:07, Matthew Wilcox wrote: > On Thu, Feb 13, 2020 at 08:48:47AM +0100, Michal Hocko wrote: > > Can we pursue on this please? An explicit NOFS scope annotation with a > > reference to compaction potentially locking up on pages in the readahead > > would be a great start. > > How about this (on top of the current readahead series): > > diff --git a/mm/readahead.c b/mm/readahead.c > index 29ca25c8f01e..32fd32b913da 100644 > --- a/mm/readahead.c > +++ b/mm/readahead.c > @@ -160,6 +160,16 @@ unsigned long page_cache_readahead_limit(struct address_space *mapping, > .nr_pages = 0, > }; > > + /* > + * Partway through the readahead operation, we will have added > + * locked pages to the page cache, but will not yet have submitted > + * them for I/O. Adding another page may need to allocate > + * memory, which can trigger memory migration. Telling the VM I would go with s@migration@compaction@ because it would make it more obvious that this is about high order allocations. The patch looks ok other than that but FS people would be better than me to give it a proper review. It would be great if you could mention that this shouldn't be a real problem for many(/most?) filesystems because they tend to have NOFS like mask in the the mapping but we would love to remove that in future hopefully so this change is really desiable. Thanks! > + * we're in the middle of a filesystem operation will cause it > + * to not touch file-backed pages, preventing a deadlock. > + */ > + unsigned int nofs = memalloc_nofs_save(); > + > /* > * Preallocate as many pages as we will need. > */ > @@ -217,6 +227,7 @@ unsigned long page_cache_readahead_limit(struct address_space *mapping, > */ > read_pages(&rac, &page_pool); > BUG_ON(!list_empty(&page_pool)); > + memalloc_nofs_restore(nofs); > return rac.nr_pages; > } > EXPORT_SYMBOL_GPL(page_cache_readahead_limit);
On Thu, Feb 13, 2020 at 06:08:24PM +0100, Michal Hocko wrote: > On Thu 13-02-20 08:46:07, Matthew Wilcox wrote: > > On Thu, Feb 13, 2020 at 08:48:47AM +0100, Michal Hocko wrote: > > > Can we pursue on this please? An explicit NOFS scope annotation with a > > > reference to compaction potentially locking up on pages in the readahead > > > would be a great start. > > > > How about this (on top of the current readahead series): > > > > diff --git a/mm/readahead.c b/mm/readahead.c > > index 29ca25c8f01e..32fd32b913da 100644 > > --- a/mm/readahead.c > > +++ b/mm/readahead.c > > @@ -160,6 +160,16 @@ unsigned long page_cache_readahead_limit(struct address_space *mapping, > > .nr_pages = 0, > > }; > > > > + /* > > + * Partway through the readahead operation, we will have added > > + * locked pages to the page cache, but will not yet have submitted > > + * them for I/O. Adding another page may need to allocate > > + * memory, which can trigger memory migration. Telling the VM > > I would go with s@migration@compaction@ because it would make it more > obvious that this is about high order allocations. Perhaps even just 'reclaim' -- it's about compaction today, but tomorrow's VM might try to reclaim these pages too. They are on the LRU, after all. So I currently have: /* * Partway through the readahead operation, we will have added * locked pages to the page cache, but will not yet have submitted * them for I/O. Adding another page may need to allocate memory, * which can trigger memory reclaim. Telling the VM we're in * the middle of a filesystem operation will cause it to not * touch file-backed pages, preventing a deadlock. Most (all?) * filesystems already specify __GFP_NOFS in their mapping's * gfp_mask, but let's be explicit here. */ Thanks!
On Thu 13-02-20 20:27:24, Matthew Wilcox wrote: > On Thu, Feb 13, 2020 at 06:08:24PM +0100, Michal Hocko wrote: > > On Thu 13-02-20 08:46:07, Matthew Wilcox wrote: > > > On Thu, Feb 13, 2020 at 08:48:47AM +0100, Michal Hocko wrote: > > > > Can we pursue on this please? An explicit NOFS scope annotation with a > > > > reference to compaction potentially locking up on pages in the readahead > > > > would be a great start. > > > > > > How about this (on top of the current readahead series): > > > > > > diff --git a/mm/readahead.c b/mm/readahead.c > > > index 29ca25c8f01e..32fd32b913da 100644 > > > --- a/mm/readahead.c > > > +++ b/mm/readahead.c > > > @@ -160,6 +160,16 @@ unsigned long page_cache_readahead_limit(struct address_space *mapping, > > > .nr_pages = 0, > > > }; > > > > > > + /* > > > + * Partway through the readahead operation, we will have added > > > + * locked pages to the page cache, but will not yet have submitted > > > + * them for I/O. Adding another page may need to allocate > > > + * memory, which can trigger memory migration. Telling the VM > > > > I would go with s@migration@compaction@ because it would make it more > > obvious that this is about high order allocations. > > Perhaps even just 'reclaim' -- it's about compaction today, but tomorrow's > VM might try to reclaim these pages too. They are on the LRU, after all. > > So I currently have: > > /* > * Partway through the readahead operation, we will have added > * locked pages to the page cache, but will not yet have submitted > * them for I/O. Adding another page may need to allocate memory, > * which can trigger memory reclaim. Telling the VM we're in > * the middle of a filesystem operation will cause it to not > * touch file-backed pages, preventing a deadlock. Most (all?) > * filesystems already specify __GFP_NOFS in their mapping's > * gfp_mask, but let's be explicit here. > */ OK, Thanks!
diff --git a/mm/migrate.c b/mm/migrate.c index 86873b6f38a7..df60026779d2 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1010,7 +1010,8 @@ static int __unmap_and_move(struct page *page, struct page *newpage, bool is_lru = !__PageMovable(page); if (!trylock_page(page)) { - if (!force || mode == MIGRATE_ASYNC) + if (!force || mode == MIGRATE_ASYNC + || mode == MIGRATE_SYNC_LIGHT) goto out; /*
We observed kcompactd hung at __lock_page(): INFO: task kcompactd0:57 blocked for more than 120 seconds. Not tainted 4.19.56.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kcompactd0 D 0 57 2 0x80000000 Call Trace: ? __schedule+0x236/0x860 schedule+0x28/0x80 io_schedule+0x12/0x40 __lock_page+0xf9/0x120 ? page_cache_tree_insert+0xb0/0xb0 ? update_pageblock_skip+0xb0/0xb0 migrate_pages+0x88c/0xb90 ? isolate_freepages_block+0x3b0/0x3b0 compact_zone+0x5f1/0x870 kcompactd_do_work+0x130/0x2c0 ? __switch_to_asm+0x35/0x70 ? __switch_to_asm+0x41/0x70 ? kcompactd_do_work+0x2c0/0x2c0 ? kcompactd+0x73/0x180 kcompactd+0x73/0x180 ? finish_wait+0x80/0x80 kthread+0x113/0x130 ? kthread_create_worker_on_cpu+0x50/0x50 ret_from_fork+0x35/0x40 which faddr2line maps to: migrate_pages+0x88c/0xb90: lock_page at include/linux/pagemap.h:483 (inlined by) __unmap_and_move at mm/migrate.c:1024 (inlined by) unmap_and_move at mm/migrate.c:1189 (inlined by) migrate_pages at mm/migrate.c:1419 Sometimes kcompactd eventually got out of this situation, sometimes not. I think for memory compaction, it is a best effort to migrate the pages, so it doesn't have to wait for I/O to complete. It is fine to call trylock_page() here, which is pretty much similar to buffer_migrate_lock_buffers(). Given MIGRATE_SYNC_LIGHT is used on compaction path, just relax the check for it. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: linux-mm@kvack.org Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> --- mm/migrate.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)