mbox series

[v11,0/6] Use pageblock_order for cma and alloc_contig_range alignment.

Message ID 20220425143118.2850746-1-zi.yan@sent.com (mailing list archive)
Headers show
Series Use pageblock_order for cma and alloc_contig_range alignment. | expand

Message

Zi Yan April 25, 2022, 2:31 p.m. UTC
From: Zi Yan <ziy@nvidia.com>

Hi David,

This patchset tries to remove the MAX_ORDER-1 alignment requirement for CMA
and alloc_contig_range(). It prepares for my upcoming changes to make
MAX_ORDER adjustable at boot time[1]. It is on top of mmotm-2022-04-20-17-12.

Changelog
===
V11
---
1. Moved start_isolate_page_range()/undo_isolate_page_range() alignment
   change to a separate patch after the unmovable page check change and
   alloc_contig_range() change to avoid some unwanted memory
   hotplug/hotremove failures.
2. Cleaned up has_unmovable_pages() in Patch 2.

V10
---
1. Reverted back to the original outer_start, outer_end range for
   test_pages_isolated() and isolate_freepages_range() in Patch 3,
   otherwise isolation will fail if start in alloc_contig_range() is in
   the middle of a free page.

V9
---
1. Limited has_unmovable_pages() check within a pageblock.
2. Added a check to ensure page isolation is done within a single zone
   in isolate_single_pageblock().
3. Fixed an off-by-one bug in isolate_single_pageblock().
4. Fixed a NULL-deferencing bug when the pages before to-be-isolated pageblock
   is not online in isolate_single_pageblock().

V8
---
1. Cleaned up has_unmovable_pages() to remove page argument.

V7
---
1. Added page validity check in isolate_single_pageblock() to avoid out
   of zone pages.
2. Fixed a bug in split_free_page() to split and free pages in correct
   page order.

V6
---
1. Resolved compilation error/warning reported by kernel test robot.
2. Tried to solve the coding concerns from Christophe Leroy.
3. Shortened lengthy lines (pointed out by Christoph Hellwig).

V5
---
1. Moved isolation address alignment handling in start_isolate_page_range().
2. Rewrote and simplified how alloc_contig_range() works at pageblock
   granularity (Patch 3). Only two pageblock migratetypes need to be saved and
   restored. start_isolate_page_range() might need to migrate pages in this
   version, but it prevents the caller from worrying about
   max(MAX_ORDER_NR_PAEGS, pageblock_nr_pages) alignment after the page range
   is isolated.

V4
---
1. Dropped two irrelevant patches on non-lru compound page handling, as
   it is not supported upstream.
2. Renamed migratetype_has_fallback() to migratetype_is_mergeable().
3. Always check whether two pageblocks can be merged in
   __free_one_page() when order is >= pageblock_order, as the case (not
   mergeable pageblocks are isolated, CMA, and HIGHATOMIC) becomes more common.
3. Moving has_unmovable_pages() is now a separate patch.
4. Removed MAX_ORDER-1 alignment requirement in the comment in virtio_mem code.

Description
===

The MAX_ORDER - 1 alignment requirement comes from that alloc_contig_range()
isolates pageblocks to remove free memory from buddy allocator but isolating
only a subset of pageblocks within a page spanning across multiple pageblocks
causes free page accounting issues. Isolated page might not be put into the
right free list, since the code assumes the migratetype of the first pageblock
as the whole free page migratetype. This is based on the discussion at [2].

To remove the requirement, this patchset:
1. isolates pages at pageblock granularity instead of
   max(MAX_ORDER_NR_PAEGS, pageblock_nr_pages);
2. splits free pages across the specified range or migrates in-use pages
   across the specified range then splits the freed page to avoid free page
   accounting issues (it happens when multiple pageblocks within a single page
   have different migratetypes);
3. only checks unmovable pages within the range instead of MAX_ORDER - 1 aligned
   range during isolation to avoid alloc_contig_range() failure when pageblocks
   within a MAX_ORDER - 1 aligned range are allocated separately.
4. returns pages not in the range as it did before.

One optimization might come later:
1. make MIGRATE_ISOLATE a separate bit to be able to restore the original
   migratetypes when isolation fails in the middle of the range.

Feel free to give comments and suggestions. Thanks.

[1] https://lore.kernel.org/linux-mm/20210805190253.2795604-1-zi.yan@sent.com/
[2] https://lore.kernel.org/linux-mm/d19fb078-cb9b-f60f-e310-fdeea1b947d2@redhat.com/

Zi Yan (6):
  mm: page_isolation: move has_unmovable_pages() to mm/page_isolation.c
  mm: page_isolation: check specified range for unmovable pages
  mm: make alloc_contig_range work at pageblock granularity
  mm: page_isolation: enable arbitrary range page isolation.
  mm: cma: use pageblock_order as the single alignment
  drivers: virtio_mem: use pageblock size as the minimum virtio_mem
    size.

 drivers/virtio/virtio_mem.c    |   6 +-
 include/linux/cma.h            |   4 +-
 include/linux/mmzone.h         |   5 +-
 include/linux/page-isolation.h |   6 +-
 mm/internal.h                  |   6 +
 mm/memory_hotplug.c            |   3 +-
 mm/page_alloc.c                | 191 +++++-------------
 mm/page_isolation.c            | 345 +++++++++++++++++++++++++++++++--
 8 files changed, 392 insertions(+), 174 deletions(-)

Comments

Qian Cai April 26, 2022, 8:18 p.m. UTC | #1
On Mon, Apr 25, 2022 at 10:31:12AM -0400, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> Hi David,
> 
> This patchset tries to remove the MAX_ORDER-1 alignment requirement for CMA
> and alloc_contig_range(). It prepares for my upcoming changes to make
> MAX_ORDER adjustable at boot time[1]. It is on top of mmotm-2022-04-20-17-12.
> 
> Changelog
> ===
> V11
> ---
> 1. Moved start_isolate_page_range()/undo_isolate_page_range() alignment
>    change to a separate patch after the unmovable page check change and
>    alloc_contig_range() change to avoid some unwanted memory
>    hotplug/hotremove failures.
> 2. Cleaned up has_unmovable_pages() in Patch 2.
> 
> V10
> ---
> 1. Reverted back to the original outer_start, outer_end range for
>    test_pages_isolated() and isolate_freepages_range() in Patch 3,
>    otherwise isolation will fail if start in alloc_contig_range() is in
>    the middle of a free page.
> 
> V9
> ---
> 1. Limited has_unmovable_pages() check within a pageblock.
> 2. Added a check to ensure page isolation is done within a single zone
>    in isolate_single_pageblock().
> 3. Fixed an off-by-one bug in isolate_single_pageblock().
> 4. Fixed a NULL-deferencing bug when the pages before to-be-isolated pageblock
>    is not online in isolate_single_pageblock().
> 
> V8
> ---
> 1. Cleaned up has_unmovable_pages() to remove page argument.
> 
> V7
> ---
> 1. Added page validity check in isolate_single_pageblock() to avoid out
>    of zone pages.
> 2. Fixed a bug in split_free_page() to split and free pages in correct
>    page order.
> 
> V6
> ---
> 1. Resolved compilation error/warning reported by kernel test robot.
> 2. Tried to solve the coding concerns from Christophe Leroy.
> 3. Shortened lengthy lines (pointed out by Christoph Hellwig).
> 
> V5
> ---
> 1. Moved isolation address alignment handling in start_isolate_page_range().
> 2. Rewrote and simplified how alloc_contig_range() works at pageblock
>    granularity (Patch 3). Only two pageblock migratetypes need to be saved and
>    restored. start_isolate_page_range() might need to migrate pages in this
>    version, but it prevents the caller from worrying about
>    max(MAX_ORDER_NR_PAEGS, pageblock_nr_pages) alignment after the page range
>    is isolated.
> 
> V4
> ---
> 1. Dropped two irrelevant patches on non-lru compound page handling, as
>    it is not supported upstream.
> 2. Renamed migratetype_has_fallback() to migratetype_is_mergeable().
> 3. Always check whether two pageblocks can be merged in
>    __free_one_page() when order is >= pageblock_order, as the case (not
>    mergeable pageblocks are isolated, CMA, and HIGHATOMIC) becomes more common.
> 3. Moving has_unmovable_pages() is now a separate patch.
> 4. Removed MAX_ORDER-1 alignment requirement in the comment in virtio_mem code.
> 
> Description
> ===
> 
> The MAX_ORDER - 1 alignment requirement comes from that alloc_contig_range()
> isolates pageblocks to remove free memory from buddy allocator but isolating
> only a subset of pageblocks within a page spanning across multiple pageblocks
> causes free page accounting issues. Isolated page might not be put into the
> right free list, since the code assumes the migratetype of the first pageblock
> as the whole free page migratetype. This is based on the discussion at [2].
> 
> To remove the requirement, this patchset:
> 1. isolates pages at pageblock granularity instead of
>    max(MAX_ORDER_NR_PAEGS, pageblock_nr_pages);
> 2. splits free pages across the specified range or migrates in-use pages
>    across the specified range then splits the freed page to avoid free page
>    accounting issues (it happens when multiple pageblocks within a single page
>    have different migratetypes);
> 3. only checks unmovable pages within the range instead of MAX_ORDER - 1 aligned
>    range during isolation to avoid alloc_contig_range() failure when pageblocks
>    within a MAX_ORDER - 1 aligned range are allocated separately.
> 4. returns pages not in the range as it did before.
> 
> One optimization might come later:
> 1. make MIGRATE_ISOLATE a separate bit to be able to restore the original
>    migratetypes when isolation fails in the middle of the range.
> 
> Feel free to give comments and suggestions. Thanks.
> 
> [1] https://lore.kernel.org/linux-mm/20210805190253.2795604-1-zi.yan@sent.com/
> [2] https://lore.kernel.org/linux-mm/d19fb078-cb9b-f60f-e310-fdeea1b947d2@redhat.com/
> 
> Zi Yan (6):
>   mm: page_isolation: move has_unmovable_pages() to mm/page_isolation.c
>   mm: page_isolation: check specified range for unmovable pages
>   mm: make alloc_contig_range work at pageblock granularity
>   mm: page_isolation: enable arbitrary range page isolation.
>   mm: cma: use pageblock_order as the single alignment
>   drivers: virtio_mem: use pageblock size as the minimum virtio_mem
>     size.
> 
>  drivers/virtio/virtio_mem.c    |   6 +-
>  include/linux/cma.h            |   4 +-
>  include/linux/mmzone.h         |   5 +-
>  include/linux/page-isolation.h |   6 +-
>  mm/internal.h                  |   6 +
>  mm/memory_hotplug.c            |   3 +-
>  mm/page_alloc.c                | 191 +++++-------------
>  mm/page_isolation.c            | 345 +++++++++++++++++++++++++++++++--
>  8 files changed, 392 insertions(+), 174 deletions(-)

Reverting this series fixed a deadlock during memory offline/online
tests and then a crash.

 INFO: task kmemleak:1027 blocked for more than 120 seconds.
       Not tainted 5.18.0-rc4-next-20220426-dirty #27
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 task:kmemleak        state:D stack:27744 pid: 1027 ppid:     2 flags:0x00000008
 Call trace:
  __switch_to
  __schedule
  schedule
  percpu_rwsem_wait
  __percpu_down_read
  percpu_down_read.constprop.0
  get_online_mems
  kmemleak_scan
  kmemleak_scan_thread
  kthread
  ret_from_fork

 Showing all locks held in the system:
 1 lock held by rcu_tasks_kthre/11:
  #0: ffffc1e2cefc17f0 (rcu_tasks.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one_gp
 1 lock held by rcu_tasks_rude_/12:
  #0: ffffc1e2cefc1a90 (rcu_tasks_rude.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one_gp
 1 lock held by rcu_tasks_trace/13:
  #0: ffffc1e2cefc1db0 (rcu_tasks_trace.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one_gp
 1 lock held by khungtaskd/824:
  #0: ffffc1e2cefc2820 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks
 2 locks held by kmemleak/1027:
  #0: ffffc1e2cf1aa628 (scan_mutex){+.+.}-{3:3}, at: kmemleak_scan_thread
  #1: ffffc1e2cf14e690 (mem_hotplug_lock){++++}-{0:0}, at: get_online_mems
 2 locks held by cppc_fie/1805:
 1 lock held by in:imklog/2822:
 8 locks held by tee/3334:
  #0: ffff0816d65c9438 (sb_writers#6){.+.+}-{0:0}, at: vfs_write
  #1: ffff40025438be88 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter
  #2: ffff4000c8261eb0 (kn->active#298){.+.+}-{0:0}, at: kernfs_fop_write_iter
  #3: ffffc1e2d0013f68 (device_hotplug_lock){+.+.}-{3:3}, at: online_store
  #4: ffff0800cd8bb998 (&dev->mutex){....}-{3:3}, at: device_offline
  #5: ffffc1e2ceed3750 (cpu_hotplug_lock){++++}-{0:0}, at: cpus_read_lock
  #6: ffffc1e2cf14e690 (mem_hotplug_lock){++++}-{0:0}, at: offline_pages
  #7: ffffc1e2cf13bf68 (pcp_batch_high_lock){+.+.}-{3:3}, at: zone_pcp_disable
 __zone_set_pageset_high_and_batch at mm/page_alloc.c:7005
 (inlined by) zone_pcp_disable at mm/page_alloc.c:9286

Later, running some kernel compilation workloads could trigger a crash.

 Unable to handle kernel paging request at virtual address fffffbfffe000030
 KASAN: maybe wild-memory-access in range [0x0003dffff0000180-0x0003dffff0000187]
 Mem abort info:
   ESR = 0x96000006
   EC = 0x25: DABT (current EL), IL = 32 bits
   SET = 0, FnV = 0
   EA = 0, S1PTW = 0
   FSC = 0x06: level 2 translation fault
 Data abort info:
   ISV = 0, ISS = 0x00000006
   CM = 0, WnR = 0
 swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000817545fd000
 [fffffbfffe000030] pgd=00000817581e9003, p4d=00000817581e9003, pud=00000817581ea003, pmd=0000000000000000
 Internal error: Oops: 96000006 [#1] PREEMPT SMP
 Modules linked in: bridge stp llc cdc_ether usbnet ipmi_devintf ipmi_msghandler cppc_cpufreq fuse ip_tables x_tables ipv6 btrfs blake2b_generic libcrc32c xor xor_neon raid6_pq zstd_compress dm_mod nouveau drm_ttm_helper ttm crct10dif_ce mlx5_core drm_display_helper drm_kms_helper nvme mpt3sas xhci_pci nvme_core drm raid_class xhci_pci_renesas
 CPU: 147 PID: 3334 Comm: tee Not tainted 5.18.0-rc4-next-20220426-dirty #27
 pstate: 10400009 (nzcV daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
 pc : isolate_single_pageblock
 lr : isolate_single_pageblock
 sp : ffff80003e767500
 x29: ffff80003e767500 x28: 0000000000000000 x27: ffff783c59963b1f
 x26: dfff800000000000 x25: ffffc1e2ccb1d000 x24: ffffc1e2ccb1d8f8
 x23: 00000000803bfe00 x22: ffffc1e2cee39098 x21: 0000000000000020
 x20: 00000000803c0000 x19: fffffbfffe000000 x18: ffffc1e2cee37d1c
 x17: 0000000000000000 x16: 1fffe8004a86f14c x15: 1fffe806c89e154a
 x14: 1fffe8004a86f11c x13: 0000000000000004 x12: ffff783c5c455e6d
 x11: 1ffff83c5c455e6c x10: ffff783c5c455e6c x9 : dfff800000000000
 x8 : ffffc1e2e22af363 x7 : 0000000000000001 x6 : 0000000000000003
 x5 : ffffc1e2e22af360 x4 : ffff783c5c455e6c x3 : ffff700007cece90
 x2 : 0000000000000003 x1 : 0000000000000000 x0 : fffffbfffe000030
 Call trace:
 Call trace:
  isolate_single_pageblock
  PageBuddy at ./include/linux/page-flags.h:969 (discriminator 3)
  (inlined by) isolate_single_pageblock at mm/page_isolation.c:414 (discriminator 3)
  start_isolate_page_range
  offline_pages
  memory_subsys_offline
  device_offline
  online_store
  dev_attr_store
  sysfs_kf_write
  kernfs_fop_write_iter
  new_sync_write
  vfs_write
  ksys_write
  __arm64_sys_write
  invoke_syscall
  el0_svc_common.constprop.0
  do_el0_svc
  el0_svc
  el0t_64_sync_handler
  el0t_64_sync
 Code: 38fa6821 7100003f 7a411041 54000dca (b9403260)
 ---[ end trace 0000000000000000 ]---
 Kernel panic - not syncing: Oops: Fatal exception
 SMP: stopping secondary CPUs
 Kernel Offset: 0x41e2c0720000 from 0xffff800008000000
 PHYS_OFFSET: 0x80000000
 CPU features: 0x000,0021700d,19801c82
 Memory Limit: none
Zi Yan April 26, 2022, 8:26 p.m. UTC | #2
On 26 Apr 2022, at 16:18, Qian Cai wrote:

> On Mon, Apr 25, 2022 at 10:31:12AM -0400, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> Hi David,
>>
>> This patchset tries to remove the MAX_ORDER-1 alignment requirement for CMA
>> and alloc_contig_range(). It prepares for my upcoming changes to make
>> MAX_ORDER adjustable at boot time[1]. It is on top of mmotm-2022-04-20-17-12.
>>
>> Changelog
>> ===
>> V11
>> ---
>> 1. Moved start_isolate_page_range()/undo_isolate_page_range() alignment
>>    change to a separate patch after the unmovable page check change and
>>    alloc_contig_range() change to avoid some unwanted memory
>>    hotplug/hotremove failures.
>> 2. Cleaned up has_unmovable_pages() in Patch 2.
>>
>> V10
>> ---
>> 1. Reverted back to the original outer_start, outer_end range for
>>    test_pages_isolated() and isolate_freepages_range() in Patch 3,
>>    otherwise isolation will fail if start in alloc_contig_range() is in
>>    the middle of a free page.
>>
>> V9
>> ---
>> 1. Limited has_unmovable_pages() check within a pageblock.
>> 2. Added a check to ensure page isolation is done within a single zone
>>    in isolate_single_pageblock().
>> 3. Fixed an off-by-one bug in isolate_single_pageblock().
>> 4. Fixed a NULL-deferencing bug when the pages before to-be-isolated pageblock
>>    is not online in isolate_single_pageblock().
>>
>> V8
>> ---
>> 1. Cleaned up has_unmovable_pages() to remove page argument.
>>
>> V7
>> ---
>> 1. Added page validity check in isolate_single_pageblock() to avoid out
>>    of zone pages.
>> 2. Fixed a bug in split_free_page() to split and free pages in correct
>>    page order.
>>
>> V6
>> ---
>> 1. Resolved compilation error/warning reported by kernel test robot.
>> 2. Tried to solve the coding concerns from Christophe Leroy.
>> 3. Shortened lengthy lines (pointed out by Christoph Hellwig).
>>
>> V5
>> ---
>> 1. Moved isolation address alignment handling in start_isolate_page_range().
>> 2. Rewrote and simplified how alloc_contig_range() works at pageblock
>>    granularity (Patch 3). Only two pageblock migratetypes need to be saved and
>>    restored. start_isolate_page_range() might need to migrate pages in this
>>    version, but it prevents the caller from worrying about
>>    max(MAX_ORDER_NR_PAEGS, pageblock_nr_pages) alignment after the page range
>>    is isolated.
>>
>> V4
>> ---
>> 1. Dropped two irrelevant patches on non-lru compound page handling, as
>>    it is not supported upstream.
>> 2. Renamed migratetype_has_fallback() to migratetype_is_mergeable().
>> 3. Always check whether two pageblocks can be merged in
>>    __free_one_page() when order is >= pageblock_order, as the case (not
>>    mergeable pageblocks are isolated, CMA, and HIGHATOMIC) becomes more common.
>> 3. Moving has_unmovable_pages() is now a separate patch.
>> 4. Removed MAX_ORDER-1 alignment requirement in the comment in virtio_mem code.
>>
>> Description
>> ===
>>
>> The MAX_ORDER - 1 alignment requirement comes from that alloc_contig_range()
>> isolates pageblocks to remove free memory from buddy allocator but isolating
>> only a subset of pageblocks within a page spanning across multiple pageblocks
>> causes free page accounting issues. Isolated page might not be put into the
>> right free list, since the code assumes the migratetype of the first pageblock
>> as the whole free page migratetype. This is based on the discussion at [2].
>>
>> To remove the requirement, this patchset:
>> 1. isolates pages at pageblock granularity instead of
>>    max(MAX_ORDER_NR_PAEGS, pageblock_nr_pages);
>> 2. splits free pages across the specified range or migrates in-use pages
>>    across the specified range then splits the freed page to avoid free page
>>    accounting issues (it happens when multiple pageblocks within a single page
>>    have different migratetypes);
>> 3. only checks unmovable pages within the range instead of MAX_ORDER - 1 aligned
>>    range during isolation to avoid alloc_contig_range() failure when pageblocks
>>    within a MAX_ORDER - 1 aligned range are allocated separately.
>> 4. returns pages not in the range as it did before.
>>
>> One optimization might come later:
>> 1. make MIGRATE_ISOLATE a separate bit to be able to restore the original
>>    migratetypes when isolation fails in the middle of the range.
>>
>> Feel free to give comments and suggestions. Thanks.
>>
>> [1] https://lore.kernel.org/linux-mm/20210805190253.2795604-1-zi.yan@sent.com/
>> [2] https://lore.kernel.org/linux-mm/d19fb078-cb9b-f60f-e310-fdeea1b947d2@redhat.com/
>>
>> Zi Yan (6):
>>   mm: page_isolation: move has_unmovable_pages() to mm/page_isolation.c
>>   mm: page_isolation: check specified range for unmovable pages
>>   mm: make alloc_contig_range work at pageblock granularity
>>   mm: page_isolation: enable arbitrary range page isolation.
>>   mm: cma: use pageblock_order as the single alignment
>>   drivers: virtio_mem: use pageblock size as the minimum virtio_mem
>>     size.
>>
>>  drivers/virtio/virtio_mem.c    |   6 +-
>>  include/linux/cma.h            |   4 +-
>>  include/linux/mmzone.h         |   5 +-
>>  include/linux/page-isolation.h |   6 +-
>>  mm/internal.h                  |   6 +
>>  mm/memory_hotplug.c            |   3 +-
>>  mm/page_alloc.c                | 191 +++++-------------
>>  mm/page_isolation.c            | 345 +++++++++++++++++++++++++++++++--
>>  8 files changed, 392 insertions(+), 174 deletions(-)
>
> Reverting this series fixed a deadlock during memory offline/online
> tests and then a crash.

Hi Qian,

Thanks for reporting the issue. Do you have a reproducer I can use to debug the code?

>
>  INFO: task kmemleak:1027 blocked for more than 120 seconds.
>        Not tainted 5.18.0-rc4-next-20220426-dirty #27
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>  task:kmemleak        state:D stack:27744 pid: 1027 ppid:     2 flags:0x00000008
>  Call trace:
>   __switch_to
>   __schedule
>   schedule
>   percpu_rwsem_wait
>   __percpu_down_read
>   percpu_down_read.constprop.0
>   get_online_mems
>   kmemleak_scan
>   kmemleak_scan_thread
>   kthread
>   ret_from_fork
>
>  Showing all locks held in the system:
>  1 lock held by rcu_tasks_kthre/11:
>   #0: ffffc1e2cefc17f0 (rcu_tasks.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one_gp
>  1 lock held by rcu_tasks_rude_/12:
>   #0: ffffc1e2cefc1a90 (rcu_tasks_rude.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one_gp
>  1 lock held by rcu_tasks_trace/13:
>   #0: ffffc1e2cefc1db0 (rcu_tasks_trace.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one_gp
>  1 lock held by khungtaskd/824:
>   #0: ffffc1e2cefc2820 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks
>  2 locks held by kmemleak/1027:
>   #0: ffffc1e2cf1aa628 (scan_mutex){+.+.}-{3:3}, at: kmemleak_scan_thread
>   #1: ffffc1e2cf14e690 (mem_hotplug_lock){++++}-{0:0}, at: get_online_mems
>  2 locks held by cppc_fie/1805:
>  1 lock held by in:imklog/2822:
>  8 locks held by tee/3334:
>   #0: ffff0816d65c9438 (sb_writers#6){.+.+}-{0:0}, at: vfs_write
>   #1: ffff40025438be88 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter
>   #2: ffff4000c8261eb0 (kn->active#298){.+.+}-{0:0}, at: kernfs_fop_write_iter
>   #3: ffffc1e2d0013f68 (device_hotplug_lock){+.+.}-{3:3}, at: online_store
>   #4: ffff0800cd8bb998 (&dev->mutex){....}-{3:3}, at: device_offline
>   #5: ffffc1e2ceed3750 (cpu_hotplug_lock){++++}-{0:0}, at: cpus_read_lock
>   #6: ffffc1e2cf14e690 (mem_hotplug_lock){++++}-{0:0}, at: offline_pages
>   #7: ffffc1e2cf13bf68 (pcp_batch_high_lock){+.+.}-{3:3}, at: zone_pcp_disable
>  __zone_set_pageset_high_and_batch at mm/page_alloc.c:7005
>  (inlined by) zone_pcp_disable at mm/page_alloc.c:9286
>
> Later, running some kernel compilation workloads could trigger a crash.
>
>  Unable to handle kernel paging request at virtual address fffffbfffe000030
>  KASAN: maybe wild-memory-access in range [0x0003dffff0000180-0x0003dffff0000187]
>  Mem abort info:
>    ESR = 0x96000006
>    EC = 0x25: DABT (current EL), IL = 32 bits
>    SET = 0, FnV = 0
>    EA = 0, S1PTW = 0
>    FSC = 0x06: level 2 translation fault
>  Data abort info:
>    ISV = 0, ISS = 0x00000006
>    CM = 0, WnR = 0
>  swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000817545fd000
>  [fffffbfffe000030] pgd=00000817581e9003, p4d=00000817581e9003, pud=00000817581ea003, pmd=0000000000000000
>  Internal error: Oops: 96000006 [#1] PREEMPT SMP
>  Modules linked in: bridge stp llc cdc_ether usbnet ipmi_devintf ipmi_msghandler cppc_cpufreq fuse ip_tables x_tables ipv6 btrfs blake2b_generic libcrc32c xor xor_neon raid6_pq zstd_compress dm_mod nouveau drm_ttm_helper ttm crct10dif_ce mlx5_core drm_display_helper drm_kms_helper nvme mpt3sas xhci_pci nvme_core drm raid_class xhci_pci_renesas
>  CPU: 147 PID: 3334 Comm: tee Not tainted 5.18.0-rc4-next-20220426-dirty #27
>  pstate: 10400009 (nzcV daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>  pc : isolate_single_pageblock
>  lr : isolate_single_pageblock
>  sp : ffff80003e767500
>  x29: ffff80003e767500 x28: 0000000000000000 x27: ffff783c59963b1f
>  x26: dfff800000000000 x25: ffffc1e2ccb1d000 x24: ffffc1e2ccb1d8f8
>  x23: 00000000803bfe00 x22: ffffc1e2cee39098 x21: 0000000000000020
>  x20: 00000000803c0000 x19: fffffbfffe000000 x18: ffffc1e2cee37d1c
>  x17: 0000000000000000 x16: 1fffe8004a86f14c x15: 1fffe806c89e154a
>  x14: 1fffe8004a86f11c x13: 0000000000000004 x12: ffff783c5c455e6d
>  x11: 1ffff83c5c455e6c x10: ffff783c5c455e6c x9 : dfff800000000000
>  x8 : ffffc1e2e22af363 x7 : 0000000000000001 x6 : 0000000000000003
>  x5 : ffffc1e2e22af360 x4 : ffff783c5c455e6c x3 : ffff700007cece90
>  x2 : 0000000000000003 x1 : 0000000000000000 x0 : fffffbfffe000030
>  Call trace:
>  Call trace:
>   isolate_single_pageblock
>   PageBuddy at ./include/linux/page-flags.h:969 (discriminator 3)
>   (inlined by) isolate_single_pageblock at mm/page_isolation.c:414 (discriminator 3)
>   start_isolate_page_range
>   offline_pages
>   memory_subsys_offline
>   device_offline
>   online_store
>   dev_attr_store
>   sysfs_kf_write
>   kernfs_fop_write_iter
>   new_sync_write
>   vfs_write
>   ksys_write
>   __arm64_sys_write
>   invoke_syscall
>   el0_svc_common.constprop.0
>   do_el0_svc
>   el0_svc
>   el0t_64_sync_handler
>   el0t_64_sync
>  Code: 38fa6821 7100003f 7a411041 54000dca (b9403260)
>  ---[ end trace 0000000000000000 ]---
>  Kernel panic - not syncing: Oops: Fatal exception
>  SMP: stopping secondary CPUs
>  Kernel Offset: 0x41e2c0720000 from 0xffff800008000000
>  PHYS_OFFSET: 0x80000000
>  CPU features: 0x000,0021700d,19801c82
>  Memory Limit: none

--
Best Regards,
Yan, Zi
Qian Cai April 26, 2022, 9:08 p.m. UTC | #3
On Tue, Apr 26, 2022 at 04:26:08PM -0400, Zi Yan wrote:
> Thanks for reporting the issue. Do you have a reproducer I can use to debug the code?

Nothing fancy. It just try to remove and add back each memory section.

#!/usr/bin/env python3
# SPDX-License-Identifier: GPL-2.0

import os
import re
import subprocess


def mem_iter():
    base_dir = '/sys/devices/system/memory/'
    for curr_dir in os.listdir(base_dir):
        if re.match(r'memory\d+', curr_dir):
            yield base_dir + curr_dir


if __name__ == '__main__':
    print('- Try to remove each memory section and then add it back.')
    for mem_dir in mem_iter():
        status = f'{mem_dir}/online'
        if open(status).read().rstrip() == '1':
            # This could expectedly fail due to many reasons.
            section = os.path.basename(mem_dir)
            print(f'- Try to remove {section}.')
            proc = subprocess.run([f'echo 0 | sudo tee {status}'], shell=True)
            if proc.returncode == 0:
                print(f'- Try to add {section}.')
                subprocess.check_call([f'echo 1 | sudo tee {status}'], shell=True)
Zi Yan April 26, 2022, 9:38 p.m. UTC | #4
On 26 Apr 2022, at 17:08, Qian Cai wrote:

> On Tue, Apr 26, 2022 at 04:26:08PM -0400, Zi Yan wrote:
>> Thanks for reporting the issue. Do you have a reproducer I can use to debug the code?
>
> Nothing fancy. It just try to remove and add back each memory section.
>
> #!/usr/bin/env python3
> # SPDX-License-Identifier: GPL-2.0
>
> import os
> import re
> import subprocess
>
>
> def mem_iter():
>     base_dir = '/sys/devices/system/memory/'
>     for curr_dir in os.listdir(base_dir):
>         if re.match(r'memory\d+', curr_dir):
>             yield base_dir + curr_dir
>
>
> if __name__ == '__main__':
>     print('- Try to remove each memory section and then add it back.')
>     for mem_dir in mem_iter():
>         status = f'{mem_dir}/online'
>         if open(status).read().rstrip() == '1':
>             # This could expectedly fail due to many reasons.
>             section = os.path.basename(mem_dir)
>             print(f'- Try to remove {section}.')
>             proc = subprocess.run([f'echo 0 | sudo tee {status}'], shell=True)
>             if proc.returncode == 0:
>                 print(f'- Try to add {section}.')
>                 subprocess.check_call([f'echo 1 | sudo tee {status}'], shell=True)

Thanks. Do you mind attaching your config file? I cannot reproduce
the deadlock locally using my own config. I also see kmemleak_scan
in the dumped stack, so it must be something else in addition to
memory online/offline causing the issue.

--
Best Regards,
Yan, Zi
Qian Cai April 27, 2022, 12:41 p.m. UTC | #5
On Tue, Apr 26, 2022 at 05:38:58PM -0400, Zi Yan wrote:
> Thanks. Do you mind attaching your config file? I cannot reproduce
> the deadlock locally using my own config. I also see kmemleak_scan
> in the dumped stack, so it must be something else in addition to
> memory online/offline causing the issue.

This is on an arm64 server.

$ make defconfig debug.config
Qian Cai April 27, 2022, 1:10 p.m. UTC | #6
On Tue, Apr 26, 2022 at 05:38:58PM -0400, Zi Yan wrote:
> Thanks. Do you mind attaching your config file? I cannot reproduce
> the deadlock locally using my own config. I also see kmemleak_scan
> in the dumped stack, so it must be something else in addition to
> memory online/offline causing the issue.

Of course it also need to enable those. The kmemleak_scan is just a
symptom of one of the online operations is blocking forever, as the
locks were never released.

CONFIG_MEMORY_HOTPLUG=y
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y
CONFIG_MEMORY_HOTREMOVE=y
Qian Cai April 27, 2022, 1:27 p.m. UTC | #7
On Tue, Apr 26, 2022 at 05:38:58PM -0400, Zi Yan wrote:
> Thanks. Do you mind attaching your config file? I cannot reproduce
> the deadlock locally using my own config. I also see kmemleak_scan
> in the dumped stack, so it must be something else in addition to
> memory online/offline causing the issue.

Actually, it is one of those *offline* operations, i.e.,

echo 0 > /sys/devices/system/memory/memoryNNN/online

looping forever which never finish after more than 2-hour.
Zi Yan April 27, 2022, 1:30 p.m. UTC | #8
On 27 Apr 2022, at 9:27, Qian Cai wrote:

> On Tue, Apr 26, 2022 at 05:38:58PM -0400, Zi Yan wrote:
>> Thanks. Do you mind attaching your config file? I cannot reproduce
>> the deadlock locally using my own config. I also see kmemleak_scan
>> in the dumped stack, so it must be something else in addition to
>> memory online/offline causing the issue.
>
> Actually, it is one of those *offline* operations, i.e.,
>
> echo 0 > /sys/devices/system/memory/memoryNNN/online
>
> looping forever which never finish after more than 2-hour.

Thank you for the detailed information. I am able to reproduce the
issue locally. I will update the patch once I fix the bug.

--
Best Regards,
Yan, Zi
Zi Yan April 27, 2022, 9:04 p.m. UTC | #9
On 27 Apr 2022, at 9:30, Zi Yan wrote:

> On 27 Apr 2022, at 9:27, Qian Cai wrote:
>
>> On Tue, Apr 26, 2022 at 05:38:58PM -0400, Zi Yan wrote:
>>> Thanks. Do you mind attaching your config file? I cannot reproduce
>>> the deadlock locally using my own config. I also see kmemleak_scan
>>> in the dumped stack, so it must be something else in addition to
>>> memory online/offline causing the issue.
>>
>> Actually, it is one of those *offline* operations, i.e.,
>>
>> echo 0 > /sys/devices/system/memory/memoryNNN/online
>>
>> looping forever which never finish after more than 2-hour.
>
> Thank you for the detailed information. I am able to reproduce the
> issue locally. I will update the patch once I fix the bug.

Hi Qian,

Do you mind checking if the patch below fixes the issue? It works
for me.

The original code was trying to migrate non-migratible compound pages
(high-order slab pages from my tests) during isolation and caused
an infinite loop. The patch avoids non-migratible pages.

I will update my patch series once we confirm the patch fixes
the bug.

Thanks.

diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 75e454f5cf45..c39980fce626 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -367,58 +367,68 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
                }
                /*
                 * migrate compound pages then let the free page handling code
-                * above do the rest. If migration is not enabled, just fail.
+                * above do the rest. If migration is not possible, just fail.
                 */
-               if (PageHuge(page) || PageTransCompound(page)) {
-#if defined CONFIG_COMPACTION || defined CONFIG_CMA
+               if (PageCompound(page)) {
                        unsigned long nr_pages = compound_nr(page);
-                       int order = compound_order(page);
                        struct page *head = compound_head(page);
                        unsigned long head_pfn = page_to_pfn(head);
-                       int ret;
-                       struct compact_control cc = {
-                               .nr_migratepages = 0,
-                               .order = -1,
-                               .zone = page_zone(pfn_to_page(head_pfn)),
-                               .mode = MIGRATE_SYNC,
-                               .ignore_skip_hint = true,
-                               .no_set_skip_hint = true,
-                               .gfp_mask = gfp_flags,
-                               .alloc_contig = true,
-                       };
-                       INIT_LIST_HEAD(&cc.migratepages);

                        if (head_pfn + nr_pages < boundary_pfn) {
-                               pfn += nr_pages;
+                               pfn = head_pfn + nr_pages;
                                continue;
                        }

-                       ret = __alloc_contig_migrate_range(&cc, head_pfn,
-                                               head_pfn + nr_pages);
-
-                       if (ret)
-                               goto failed;
+#if defined CONFIG_MIGRATION
                        /*
-                        * reset pfn, let the free page handling code above
-                        * split the free page to the right migratetype list.
-                        *
-                        * head_pfn is not used here as a hugetlb page order
-                        * can be bigger than MAX_ORDER-1, but after it is
-                        * freed, the free page order is not. Use pfn within
-                        * the range to find the head of the free page and
-                        * reset order to 0 if a hugetlb page with
-                        * >MAX_ORDER-1 order is encountered.
+                        * hugetlb, lru compound (THP), and movable compound pages
+                        * can be migrated. Otherwise, fail the isolation.
                         */
-                       if (order > MAX_ORDER-1)
+                       if (PageHuge(page) || PageLRU(page) || __PageMovable(page)) {
+                               int order;
+                               unsigned long outer_pfn;
+                               int ret;
+                               struct compact_control cc = {
+                                       .nr_migratepages = 0,
+                                       .order = -1,
+                                       .zone = page_zone(pfn_to_page(head_pfn)),
+                                       .mode = MIGRATE_SYNC,
+                                       .ignore_skip_hint = true,
+                                       .no_set_skip_hint = true,
+                                       .gfp_mask = gfp_flags,
+                                       .alloc_contig = true,
+                               };
+                               INIT_LIST_HEAD(&cc.migratepages);
+
+                               ret = __alloc_contig_migrate_range(&cc, head_pfn,
+                                                       head_pfn + nr_pages);
+
+                               if (ret)
+                                       goto failed;
+                               /*
+                                * reset pfn to the head of the free page, so
+                                * that the free page handling code above can split
+                                * the free page to the right migratetype list.
+                                *
+                                * head_pfn is not used here as a hugetlb page order
+                                * can be bigger than MAX_ORDER-1, but after it is
+                                * freed, the free page order is not. Use pfn within
+                                * the range to find the head of the free page.
+                                */
                                order = 0;
-                       while (!PageBuddy(pfn_to_page(pfn))) {
-                               order++;
-                               pfn &= ~0UL << order;
-                       }
-                       continue;
-#else
-                       goto failed;
+                               outer_pfn = pfn;
+                               while (!PageBuddy(pfn_to_page(outer_pfn))) {
+                                       if (++order >= MAX_ORDER) {
+                                               outer_pfn = pfn;
+                                               break;
+                                       }
+                                       outer_pfn &= ~0UL << order;
+                               }
+                               pfn = outer_pfn;
+                               continue;
+                       } else
 #endif
+                               goto failed;
                }

                pfn++;
--
Best Regards,
Yan, Zi
Qian Cai April 28, 2022, 12:33 p.m. UTC | #10
On Wed, Apr 27, 2022 at 05:04:39PM -0400, Zi Yan wrote:
> Do you mind checking if the patch below fixes the issue? It works
> for me.
> 
> The original code was trying to migrate non-migratible compound pages
> (high-order slab pages from my tests) during isolation and caused
> an infinite loop. The patch avoids non-migratible pages.
> 
> I will update my patch series once we confirm the patch fixes
> the bug.

I am not able to apply it on today's linux-next tree.

$ patch -Np1 --dry-run < ../patch/migrate.patch
checking file mm/page_isolation.c
Hunk #1 FAILED at 367.
1 out of 1 hunk FAILED
Zi Yan April 28, 2022, 12:39 p.m. UTC | #11
On 28 Apr 2022, at 8:33, Qian Cai wrote:

> On Wed, Apr 27, 2022 at 05:04:39PM -0400, Zi Yan wrote:
>> Do you mind checking if the patch below fixes the issue? It works
>> for me.
>>
>> The original code was trying to migrate non-migratible compound pages
>> (high-order slab pages from my tests) during isolation and caused
>> an infinite loop. The patch avoids non-migratible pages.
>>
>> I will update my patch series once we confirm the patch fixes
>> the bug.
>
> I am not able to apply it on today's linux-next tree.
>
> $ patch -Np1 --dry-run < ../patch/migrate.patch
> checking file mm/page_isolation.c
> Hunk #1 FAILED at 367.
> 1 out of 1 hunk FAILED

How about the one attached? I can apply it to next-20220428. Let me know
if you are using a different branch. Thanks.


--
Best Regards,
Yan, Zi
From 1567f4dbc287f6fe2fa6d4dc63fa1f9137692cff Mon Sep 17 00:00:00 2001
From: Zi Yan <ziy@nvidia.com>
Date: Wed, 27 Apr 2022 16:49:22 -0400
Subject: [PATCH] fix what can be migrated what cannot.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/page_isolation.c | 88 +++++++++++++++++++++++++--------------------
 1 file changed, 49 insertions(+), 39 deletions(-)

diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 75e454f5cf45..7968a1dd692a 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -367,58 +367,68 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
 		}
 		/*
 		 * migrate compound pages then let the free page handling code
-		 * above do the rest. If migration is not enabled, just fail.
+		 * above do the rest. If migration is not possible, just fail.
 		 */
-		if (PageHuge(page) || PageTransCompound(page)) {
-#if defined CONFIG_COMPACTION || defined CONFIG_CMA
+		if (PageCompound(page)) {
 			unsigned long nr_pages = compound_nr(page);
-			int order = compound_order(page);
 			struct page *head = compound_head(page);
 			unsigned long head_pfn = page_to_pfn(head);
-			int ret;
-			struct compact_control cc = {
-				.nr_migratepages = 0,
-				.order = -1,
-				.zone = page_zone(pfn_to_page(head_pfn)),
-				.mode = MIGRATE_SYNC,
-				.ignore_skip_hint = true,
-				.no_set_skip_hint = true,
-				.gfp_mask = gfp_flags,
-				.alloc_contig = true,
-			};
-			INIT_LIST_HEAD(&cc.migratepages);
 
 			if (head_pfn + nr_pages < boundary_pfn) {
-				pfn += nr_pages;
+				pfn = head_pfn + nr_pages;
 				continue;
 			}
 
-			ret = __alloc_contig_migrate_range(&cc, head_pfn,
-						head_pfn + nr_pages);
-
-			if (ret)
-				goto failed;
+#if defined CONFIG_MIGRATION
 			/*
-			 * reset pfn, let the free page handling code above
-			 * split the free page to the right migratetype list.
-			 *
-			 * head_pfn is not used here as a hugetlb page order
-			 * can be bigger than MAX_ORDER-1, but after it is
-			 * freed, the free page order is not. Use pfn within
-			 * the range to find the head of the free page and
-			 * reset order to 0 if a hugetlb page with
-			 * >MAX_ORDER-1 order is encountered.
+			 * hugetlb, lru compound (THP), and movable compound pages
+			 * can be migrated. Otherwise, fail the isolation.
 			 */
-			if (order > MAX_ORDER-1)
+			if (PageHuge(page) || PageLRU(page) || __PageMovable(page)) {
+				int order;
+				unsigned long outer_pfn;
+				int ret;
+				struct compact_control cc = {
+					.nr_migratepages = 0,
+					.order = -1,
+					.zone = page_zone(pfn_to_page(head_pfn)),
+					.mode = MIGRATE_SYNC,
+					.ignore_skip_hint = true,
+					.no_set_skip_hint = true,
+					.gfp_mask = gfp_flags,
+					.alloc_contig = true,
+				};
+				INIT_LIST_HEAD(&cc.migratepages);
+
+				ret = __alloc_contig_migrate_range(&cc, head_pfn,
+							head_pfn + nr_pages);
+
+				if (ret)
+					goto failed;
+				/*
+				 * reset pfn to the head of the free page, so
+				 * that the free page handling code above can split
+				 * the free page to the right migratetype list.
+				 *
+				 * head_pfn is not used here as a hugetlb page order
+				 * can be bigger than MAX_ORDER-1, but after it is
+				 * freed, the free page order is not. Use pfn within
+				 * the range to find the head of the free page.
+				 */
 				order = 0;
-			while (!PageBuddy(pfn_to_page(pfn))) {
-				order++;
-				pfn &= ~0UL << order;
-			}
-			continue;
-#else
-			goto failed;
+				outer_pfn = pfn;
+				while (!PageBuddy(pfn_to_page(outer_pfn))) {
+					if (++order >= MAX_ORDER) {
+						outer_pfn = pfn;
+						break;
+					}
+					outer_pfn &= ~0UL << order;
+				}
+				pfn = outer_pfn;
+				continue;
+			} else
 #endif
+				goto failed;
 		}
 
 		pfn++;
Qian Cai April 28, 2022, 4:19 p.m. UTC | #12
On Thu, Apr 28, 2022 at 08:39:06AM -0400, Zi Yan wrote:
> How about the one attached? I can apply it to next-20220428. Let me know
> if you are using a different branch. Thanks.

The original endless loop is gone, but running some syscall fuzzer
afterwards for a while would trigger the warning here. I have yet to
figure out if this is related to this series.

        /*
         * There are several places where we assume that the order value is sane
         * so bail out early if the request is out of bound.
         */
        if (unlikely(order >= MAX_ORDER)) {
                WARN_ON_ONCE(!(gfp & __GFP_NOWARN));
                return NULL;
        }

 WARNING: CPU: 26 PID: 172874 at mm/page_alloc.c:5368 __alloc_pages
 CPU: 26 PID: 172874 Comm: trinity-main Not tainted 5.18.0-rc4-next-20220428-dirty #67
 pstate: 20400009 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
 tpidr_el2 : ffff28cf80a61000
 pc : __alloc_pages
 lr : alloc_pages
 sp : ffff8000597b70f0
 x29: ffff8000597b70f0 x28: ffff0801e68d34c0 x27: 0000000000000000
 x26: 1ffff0000b2f6ea2 x25: ffff8000597b7510 x24: 0000000000000dc0
 x23: ffff28cf80a61000 x22: 000000000000000e x21: 1ffff0000b2f6e28
 x20: 0000000000040dc0 x19: ffffdf670d4a6fe0 x18: ffffdf66fa017d1c
 x17: ffffdf66f42f8348 x16: 1fffe1003cd1a7b3 x15: 000000000000001a
 x14: 1fffe1003cd1a7a6 x13: 0000000000000004 x12: ffff70000b2f6e05
 x11: 1ffff0000b2f6e04 x10: 00000000f204f1f1 x9 : 000000000000f204
 x8 : dfff800000000000 x7 : 00000000f3000000 x6 : 00000000f3f3f3f3
 x5 : ffff70000b2f6e28 x4 : ffff0801e68d34c0 x3 : 0000000000000000
 x2 : 0000000000000000 x1 : 0000000000000001 x0 : 0000000000040dc0
 Call trace:
  __alloc_pages
  alloc_pages
  kmalloc_order
  kmalloc_order_trace
  __kmalloc
  __regset_get
  regset_get_alloc
  fill_thread_core_info
  fill_note_info
  elf_core_dump
  do_coredump
  get_signal
  do_signal
  do_notify_resume
  el0_svc
  el0t_64_sync_handler
  el0t_64_sync
 irq event stamp: 3614
 hardirqs last  enabled at (3613):  _raw_spin_unlock_irqrestore
 hardirqs last disabled at (3614):  el1_dbg
 softirqs last  enabled at (2988):  fpsimd_preserve_current_state
 softirqs last disabled at (2986):  fpsimd_preserve_current_state
Zi Yan April 29, 2022, 1:38 p.m. UTC | #13
On 28 Apr 2022, at 12:19, Qian Cai wrote:

> On Thu, Apr 28, 2022 at 08:39:06AM -0400, Zi Yan wrote:
>> How about the one attached? I can apply it to next-20220428. Let me know
>> if you are using a different branch. Thanks.
>
> The original endless loop is gone, but running some syscall fuzzer

Thanks for the confirmation.

> afterwards for a while would trigger the warning here. I have yet to
> figure out if this is related to this series.
>
>         /*
>          * There are several places where we assume that the order value is sane
>          * so bail out early if the request is out of bound.
>          */
>         if (unlikely(order >= MAX_ORDER)) {
>                 WARN_ON_ONCE(!(gfp & __GFP_NOWARN));
>                 return NULL;
>         }
>
>  WARNING: CPU: 26 PID: 172874 at mm/page_alloc.c:5368 __alloc_pages
>  CPU: 26 PID: 172874 Comm: trinity-main Not tainted 5.18.0-rc4-next-20220428-dirty #67
>  pstate: 20400009 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>  tpidr_el2 : ffff28cf80a61000
>  pc : __alloc_pages
>  lr : alloc_pages
>  sp : ffff8000597b70f0
>  x29: ffff8000597b70f0 x28: ffff0801e68d34c0 x27: 0000000000000000
>  x26: 1ffff0000b2f6ea2 x25: ffff8000597b7510 x24: 0000000000000dc0
>  x23: ffff28cf80a61000 x22: 000000000000000e x21: 1ffff0000b2f6e28
>  x20: 0000000000040dc0 x19: ffffdf670d4a6fe0 x18: ffffdf66fa017d1c
>  x17: ffffdf66f42f8348 x16: 1fffe1003cd1a7b3 x15: 000000000000001a
>  x14: 1fffe1003cd1a7a6 x13: 0000000000000004 x12: ffff70000b2f6e05
>  x11: 1ffff0000b2f6e04 x10: 00000000f204f1f1 x9 : 000000000000f204
>  x8 : dfff800000000000 x7 : 00000000f3000000 x6 : 00000000f3f3f3f3
>  x5 : ffff70000b2f6e28 x4 : ffff0801e68d34c0 x3 : 0000000000000000
>  x2 : 0000000000000000 x1 : 0000000000000001 x0 : 0000000000040dc0
>  Call trace:
>   __alloc_pages
>   alloc_pages
>   kmalloc_order
>   kmalloc_order_trace
>   __kmalloc
>   __regset_get
>   regset_get_alloc
>   fill_thread_core_info
>   fill_note_info
>   elf_core_dump
>   do_coredump
>   get_signal
>   do_signal
>   do_notify_resume
>   el0_svc
>   el0t_64_sync_handler
>   el0t_64_sync
>  irq event stamp: 3614
>  hardirqs last  enabled at (3613):  _raw_spin_unlock_irqrestore
>  hardirqs last disabled at (3614):  el1_dbg
>  softirqs last  enabled at (2988):  fpsimd_preserve_current_state
>  softirqs last disabled at (2986):  fpsimd_preserve_current_state

I got an email this morning reporting a warning with the same call trace:
https://lore.kernel.org/linux-mm/CA+G9fYveMF-NU-rvrsbaora2g2QWxrkF7AWViuDrJyN9mNScJg@mail.gmail.com/

The email says the warning appeared from next-20220427, but my
patchset was in linux-next since next-20220426. In addition,
my patches do not touch any function in the call trace. I assume
this warning is not related to my patchset. But let me know
if my patchset is related.

Thanks.

--
Best Regards,
Yan, Zi
Andrew Morton May 10, 2022, 1:03 a.m. UTC | #14
On Mon, 25 Apr 2022 10:31:12 -0400 Zi Yan <zi.yan@sent.com> wrote:

> This patchset tries to remove the MAX_ORDER-1 alignment requirement for CMA
> and alloc_contig_range(). It prepares for my upcoming changes to make
> MAX_ORDER adjustable at boot time[1].

I'm thinking this looks ready to be merged into mm-stable later this week, for
the 5.19-rc1 merge window.

I believe the build error at
https://lkml.kernel.org/r/CA+G9fYveMF-NU-rvrsbaora2g2QWxrkF7AWViuDrJyN9mNScJg@mail.gmail.com
was addressed in ARM?

I have one -fix to be squashed,
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-make-alloc_contig_range-work-at-pageblock-granularity-fix.patch
Zi Yan May 10, 2022, 1:07 a.m. UTC | #15
On 9 May 2022, at 21:03, Andrew Morton wrote:

> On Mon, 25 Apr 2022 10:31:12 -0400 Zi Yan <zi.yan@sent.com> wrote:
>
>> This patchset tries to remove the MAX_ORDER-1 alignment requirement for CMA
>> and alloc_contig_range(). It prepares for my upcoming changes to make
>> MAX_ORDER adjustable at boot time[1].
>
> I'm thinking this looks ready to be merged into mm-stable later this week, for
> the 5.19-rc1 merge window.
>
> I believe the build error at
> https://lkml.kernel.org/r/CA+G9fYveMF-NU-rvrsbaora2g2QWxrkF7AWViuDrJyN9mNScJg@mail.gmail.com
> was addressed in ARM?

Right. The warning is caused by CONFIG_ARM64_SME=y not this patchset,
see https://lore.kernel.org/all/YnGrbEt3oBBTly7u@qian/.

>
> I have one -fix to be squashed,
> https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-make-alloc_contig_range-work-at-pageblock-granularity-fix.patch

Yes. Thanks.

--
Best Regards,
Yan, Zi
Qian Cai May 19, 2022, 8:57 p.m. UTC | #16
On Thu, Apr 28, 2022 at 08:39:06AM -0400, Zi Yan wrote:
> How about the one attached? I can apply it to next-20220428. Let me know
> if you are using a different branch. Thanks.

Zi, it turns out that the endless loop in isolate_single_pageblock() can
still be reproduced on today's linux-next tree by running the reproducer a
few times. With this debug patch applied, it keeps printing the same
values.

--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -399,6 +399,8 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
                                };
                                INIT_LIST_HEAD(&cc.migratepages);

+                               printk_ratelimited("KK stucked pfn=%lu head_pfn=%lu nr_pages=%lu boundary_pfn=%lu\n", pfn, head_pfn, nr_pages, boundary_pfn);
                                ret = __alloc_contig_migrate_range(&cc, head_pfn,
                                                        head_pfn + nr_pages);

 isolate_single_pageblock: 179 callbacks suppressed
 KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
 KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
 KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
 KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
 KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
 KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
 KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
 KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
 KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
 KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
Zi Yan May 19, 2022, 9:35 p.m. UTC | #17
On 19 May 2022, at 16:57, Qian Cai wrote:

> On Thu, Apr 28, 2022 at 08:39:06AM -0400, Zi Yan wrote:
>> How about the one attached? I can apply it to next-20220428. Let me know
>> if you are using a different branch. Thanks.
>
> Zi, it turns out that the endless loop in isolate_single_pageblock() can
> still be reproduced on today's linux-next tree by running the reproducer a
> few times. With this debug patch applied, it keeps printing the same
> values.
>
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -399,6 +399,8 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
>                                 };
>                                 INIT_LIST_HEAD(&cc.migratepages);
>
> +                               printk_ratelimited("KK stucked pfn=%lu head_pfn=%lu nr_pages=%lu boundary_pfn=%lu\n", pfn, head_pfn, nr_pages, boundary_pfn);
>                                 ret = __alloc_contig_migrate_range(&cc, head_pfn,
>                                                         head_pfn + nr_pages);
>
>  isolate_single_pageblock: 179 callbacks suppressed
>  KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
>  KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
>  KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
>  KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
>  KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
>  KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
>  KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
>  KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
>  KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
>  KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896

Hi Qian,

Thanks for your testing.

Do you have a complete reproducer? From your printout, it is clear that a 512-page compound
page caused the infinite loop, because the page was not migrated and the code kept
retrying. But __alloc_contig_migrate_range() is supposed to return non-zero to tell the
code the page cannot be migrated and the code will goto failed without retrying. It will be
great you can share what exactly has run after boot, so that I can reproduce locally to
identify what makes __alloc_contig_migrate_range() return 0 without migrating the page.

Can you also try the patch below to see if it fixes the infinite loop?

diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index b3f074d1682e..abde1877bbcb 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -417,10 +417,9 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
                                order = 0;
                                outer_pfn = pfn;
                                while (!PageBuddy(pfn_to_page(outer_pfn))) {
-                                       if (++order >= MAX_ORDER) {
-                                               outer_pfn = pfn;
-                                               break;
-                                       }
+                                       /* abort if the free page cannot be found */
+                                       if (++order >= MAX_ORDER)
+                                               goto failed;
                                        outer_pfn &= ~0UL << order;
                                }
                                pfn = outer_pfn;

--
Best Regards,
Yan, Zi
Zi Yan May 19, 2022, 11:24 p.m. UTC | #18
On 19 May 2022, at 17:35, Zi Yan wrote:

> On 19 May 2022, at 16:57, Qian Cai wrote:
>
>> On Thu, Apr 28, 2022 at 08:39:06AM -0400, Zi Yan wrote:
>>> How about the one attached? I can apply it to next-20220428. Let me know
>>> if you are using a different branch. Thanks.
>>
>> Zi, it turns out that the endless loop in isolate_single_pageblock() can
>> still be reproduced on today's linux-next tree by running the reproducer a
>> few times. With this debug patch applied, it keeps printing the same
>> values.
>>
>> --- a/mm/page_isolation.c
>> +++ b/mm/page_isolation.c
>> @@ -399,6 +399,8 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
>>                                 };
>>                                 INIT_LIST_HEAD(&cc.migratepages);
>>
>> +                               printk_ratelimited("KK stucked pfn=%lu head_pfn=%lu nr_pages=%lu boundary_pfn=%lu\n", pfn, head_pfn, nr_pages, boundary_pfn);
>>                                 ret = __alloc_contig_migrate_range(&cc, head_pfn,
>>                                                         head_pfn + nr_pages);
>>
>>  isolate_single_pageblock: 179 callbacks suppressed
>>  KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
>>  KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
>>  KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
>>  KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
>>  KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
>>  KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
>>  KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
>>  KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
>>  KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
>>  KK stucked pfn=2151120384 head_pfn=2151120384 nr_pages=512 boundary_pfn=2151120896
>
> Hi Qian,
>
> Thanks for your testing.
>
> Do you have a complete reproducer? From your printout, it is clear that a 512-page compound
> page caused the infinite loop, because the page was not migrated and the code kept
> retrying. But __alloc_contig_migrate_range() is supposed to return non-zero to tell the
> code the page cannot be migrated and the code will goto failed without retrying. It will be
> great you can share what exactly has run after boot, so that I can reproduce locally to
> identify what makes __alloc_contig_migrate_range() return 0 without migrating the page.
>
> Can you also try the patch below to see if it fixes the infinite loop?

I also have an off-by-one error in the code. The error caused unnecessary effort of
trying to migrate some pages. Your endless loop case seems to be caused by it.
Can you actually try the patch below? Thanks.

diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index b3f074d1682e..5c8099bb822f 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -374,7 +374,7 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
                        struct page *head = compound_head(page);
                        unsigned long head_pfn = page_to_pfn(head);

-                       if (head_pfn + nr_pages < boundary_pfn) {
+                       if (head_pfn + nr_pages <= boundary_pfn) {
                                pfn = head_pfn + nr_pages;
                                continue;
                        }
@@ -417,10 +417,8 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
                                order = 0;
                                outer_pfn = pfn;
                                while (!PageBuddy(pfn_to_page(outer_pfn))) {
-                                       if (++order >= MAX_ORDER) {
-                                               outer_pfn = pfn;
-                                               break;
-                                       }
+                                       if (++order >= MAX_ORDER)
+                                               goto failed;
                                        outer_pfn &= ~0UL << order;
                                }
                                pfn = outer_pfn;

--
Best Regards,
Yan, Zi
Qian Cai May 20, 2022, 11:30 a.m. UTC | #19
On Thu, May 19, 2022 at 05:35:15PM -0400, Zi Yan wrote:
> Do you have a complete reproducer? From your printout, it is clear that a 512-page compound
> page caused the infinite loop, because the page was not migrated and the code kept
> retrying. But __alloc_contig_migrate_range() is supposed to return non-zero to tell the
> code the page cannot be migrated and the code will goto failed without retrying. It will be
> great you can share what exactly has run after boot, so that I can reproduce locally to
> identify what makes __alloc_contig_migrate_range() return 0 without migrating the page.

The reproducer is just to run the same script I shared with you previously
multiple times instead. It is still quite reproducible here as it usually
happens within a hour.

$ for i in `seq 1 100`; do ./flip_mem.py; done

> Can you also try the patch below to see if it fixes the infinite loop?
> 
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index b3f074d1682e..abde1877bbcb 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -417,10 +417,9 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
>                                 order = 0;
>                                 outer_pfn = pfn;
>                                 while (!PageBuddy(pfn_to_page(outer_pfn))) {
> -                                       if (++order >= MAX_ORDER) {
> -                                               outer_pfn = pfn;
> -                                               break;
> -                                       }
> +                                       /* abort if the free page cannot be found */
> +                                       if (++order >= MAX_ORDER)
> +                                               goto failed;
>                                         outer_pfn &= ~0UL << order;
>                                 }
>                                 pfn = outer_pfn;
> 

Can you explain a bit how this patch is the right thing to do here? I am a
little bit worry about shooting into the dark. Otherwise, I'll be running
the off-by-one part over the weekend to see if that helps.
Zi Yan May 20, 2022, 1:43 p.m. UTC | #20
On 20 May 2022, at 7:30, Qian Cai wrote:

> On Thu, May 19, 2022 at 05:35:15PM -0400, Zi Yan wrote:
>> Do you have a complete reproducer? From your printout, it is clear that a 512-page compound
>> page caused the infinite loop, because the page was not migrated and the code kept
>> retrying. But __alloc_contig_migrate_range() is supposed to return non-zero to tell the
>> code the page cannot be migrated and the code will goto failed without retrying. It will be
>> great you can share what exactly has run after boot, so that I can reproduce locally to
>> identify what makes __alloc_contig_migrate_range() return 0 without migrating the page.
>
> The reproducer is just to run the same script I shared with you previously
> multiple times instead. It is still quite reproducible here as it usually
> happens within a hour.
>
> $ for i in `seq 1 100`; do ./flip_mem.py; done
>
>> Can you also try the patch below to see if it fixes the infinite loop?
>>
>> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
>> index b3f074d1682e..abde1877bbcb 100644
>> --- a/mm/page_isolation.c
>> +++ b/mm/page_isolation.c
>> @@ -417,10 +417,9 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
>>                                 order = 0;
>>                                 outer_pfn = pfn;
>>                                 while (!PageBuddy(pfn_to_page(outer_pfn))) {
>> -                                       if (++order >= MAX_ORDER) {
>> -                                               outer_pfn = pfn;
>> -                                               break;
>> -                                       }
>> +                                       /* abort if the free page cannot be found */
>> +                                       if (++order >= MAX_ORDER)
>> +                                               goto failed;
>>                                         outer_pfn &= ~0UL << order;
>>                                 }
>>                                 pfn = outer_pfn;
>>
>
> Can you explain a bit how this patch is the right thing to do here? I am a
> little bit worry about shooting into the dark. Otherwise, I'll be running
> the off-by-one part over the weekend to see if that helps.

The code kept retrying to migrate a 512-page compound page, so it seems to me
that __alloc_contig_migrate_range() did not migrate the page but returned
0 every time, otherwise, if (ret) goto failed; would bail out of the loop
already. The original code above assumed a free page can always be found after
__alloc_contig_migrate_range(), so it will retry if no free page is found.
But that assumption is not true from your infinite loop result, the new
code quits retrying when no free page can be found.

I will dig into it deeper to make sure it is the correct fix. I will
update you when I am done.

Thanks.

--
Best Regards,
Yan, Zi
Zi Yan May 20, 2022, 2:13 p.m. UTC | #21
On 20 May 2022, at 9:43, Zi Yan wrote:

> On 20 May 2022, at 7:30, Qian Cai wrote:
>
>> On Thu, May 19, 2022 at 05:35:15PM -0400, Zi Yan wrote:
>>> Do you have a complete reproducer? From your printout, it is clear that a 512-page compound
>>> page caused the infinite loop, because the page was not migrated and the code kept
>>> retrying. But __alloc_contig_migrate_range() is supposed to return non-zero to tell the
>>> code the page cannot be migrated and the code will goto failed without retrying. It will be
>>> great you can share what exactly has run after boot, so that I can reproduce locally to
>>> identify what makes __alloc_contig_migrate_range() return 0 without migrating the page.
>>
>> The reproducer is just to run the same script I shared with you previously
>> multiple times instead. It is still quite reproducible here as it usually
>> happens within a hour.
>>
>> $ for i in `seq 1 100`; do ./flip_mem.py; done

Also, do you mind providing the page dump of the 512-page compound page? I would like
to know what page caused the issue.

Thanks.

>>
>>> Can you also try the patch below to see if it fixes the infinite loop?
>>>
>>> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
>>> index b3f074d1682e..abde1877bbcb 100644
>>> --- a/mm/page_isolation.c
>>> +++ b/mm/page_isolation.c
>>> @@ -417,10 +417,9 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
>>>                                 order = 0;
>>>                                 outer_pfn = pfn;
>>>                                 while (!PageBuddy(pfn_to_page(outer_pfn))) {
>>> -                                       if (++order >= MAX_ORDER) {
>>> -                                               outer_pfn = pfn;
>>> -                                               break;
>>> -                                       }
>>> +                                       /* abort if the free page cannot be found */
>>> +                                       if (++order >= MAX_ORDER)
>>> +                                               goto failed;
>>>                                         outer_pfn &= ~0UL << order;
>>>                                 }
>>>                                 pfn = outer_pfn;
>>>
>>
>> Can you explain a bit how this patch is the right thing to do here? I am a
>> little bit worry about shooting into the dark. Otherwise, I'll be running
>> the off-by-one part over the weekend to see if that helps.
>
> The code kept retrying to migrate a 512-page compound page, so it seems to me
> that __alloc_contig_migrate_range() did not migrate the page but returned
> 0 every time, otherwise, if (ret) goto failed; would bail out of the loop
> already. The original code above assumed a free page can always be found after
> __alloc_contig_migrate_range(), so it will retry if no free page is found.
> But that assumption is not true from your infinite loop result, the new
> code quits retrying when no free page can be found.
>
> I will dig into it deeper to make sure it is the correct fix. I will
> update you when I am done.
>
> Thanks.
>
> --
> Best Regards,
> Yan, Zi


--
Best Regards,
Yan, Zi
Qian Cai May 20, 2022, 7:41 p.m. UTC | #22
On Fri, May 20, 2022 at 10:13:51AM -0400, Zi Yan wrote:
> Also, do you mind providing the page dump of the 512-page compound page? I would like
> to know what page caused the issue.

 page last allocated via order 9, migratetype Movable, gfp_mask 0x3c24ca(GFP_TRANSHUGE|__GFP_THISNODE), pid 831, tgid 831 (khugepaged), ts 3899865924520, free_ts 3821953009040
  post_alloc_hook
  get_page_from_freelist
  __alloc_pages
  khugepaged_alloc_page
  collapse_huge_page
  khugepaged_scan_pmd
  khugepaged_scan_mm_slot
  khugepaged
  kthread
  ret_from_fork
 page last free stack trace:
  free_pcp_prepare
  free_unref_page
  free_compound_page
  free_transhuge_page
  __put_compound_page
  release_pages
  free_pages_and_swap_cache
  tlb_batch_pages_flush
  tlb_finish_mmu
  exit_mmap
  __mmput
  mmput
  exit_mm
  do_exit
  do_group_exit
  __arm64_sys_exit_group
Zi Yan May 20, 2022, 9:56 p.m. UTC | #23
On 20 May 2022, at 15:41, Qian Cai wrote:

> On Fri, May 20, 2022 at 10:13:51AM -0400, Zi Yan wrote:
>> Also, do you mind providing the page dump of the 512-page compound page? I would like
>> to know what page caused the issue.
>
>  page last allocated via order 9, migratetype Movable, gfp_mask 0x3c24ca(GFP_TRANSHUGE|__GFP_THISNODE), pid 831, tgid 831 (khugepaged), ts 3899865924520, free_ts 3821953009040
>   post_alloc_hook
>   get_page_from_freelist
>   __alloc_pages
>   khugepaged_alloc_page
>   collapse_huge_page
>   khugepaged_scan_pmd
>   khugepaged_scan_mm_slot
>   khugepaged
>   kthread
>   ret_from_fork
>  page last free stack trace:
>   free_pcp_prepare
>   free_unref_page
>   free_compound_page
>   free_transhuge_page
>   __put_compound_page
>   release_pages
>   free_pages_and_swap_cache
>   tlb_batch_pages_flush
>   tlb_finish_mmu
>   exit_mmap
>   __mmput
>   mmput
>   exit_mm
>   do_exit
>   do_group_exit
>   __arm64_sys_exit_group

Do you have the page information like refcount, map count, mapping, index, and
page flags? That would be more helpful. Thanks.

I cannot reproduce it locally after hundreds of iterations of flip_mem.py on my
x86_64 VM and bare metal.

What ARM machine are you using? I wonder if I am able to get one locally.

Thanks.

--
Best Regards,
Yan, Zi
Qian Cai May 20, 2022, 11:41 p.m. UTC | #24
On Fri, May 20, 2022 at 05:56:52PM -0400, Zi Yan wrote:
> Do you have the page information like refcount, map count, mapping, index, and
> page flags? That would be more helpful. Thanks.

page:fffffc200c7f8000 refcount:393 mapcount:1 mapping:0000000000000000 index:0xffffbb800 pfn:0x8039fe00
head:fffffc200c7f8000 order:9 compound_mapcount:0 compound_pincount:0
memcg:ffff40026005a000
anon flags: 0xbfffc000009001c(uptodate|dirty|lru|head|swapbacked|node=0|zone=2|lastcpupid=0xffff)
raw: 0bfffc000009001c fffffc2007b74048 fffffc2009c087c8 ffff08038dab9189
raw: 0000000ffffbb800 0000000000000000 0000018900000000 ffff40026005a000

> I cannot reproduce it locally after hundreds of iterations of flip_mem.py on my
> x86_64 VM and bare metal.
> 
> What ARM machine are you using? I wonder if I am able to get one locally.

Ampere Altra.
Zi Yan May 22, 2022, 4:54 p.m. UTC | #25
On 20 May 2022, at 19:41, Qian Cai wrote:

> On Fri, May 20, 2022 at 05:56:52PM -0400, Zi Yan wrote:
>> Do you have the page information like refcount, map count, mapping, index, and
>> page flags? That would be more helpful. Thanks.
>
> page:fffffc200c7f8000 refcount:393 mapcount:1 mapping:0000000000000000 index:0xffffbb800 pfn:0x8039fe00
> head:fffffc200c7f8000 order:9 compound_mapcount:0 compound_pincount:0
> memcg:ffff40026005a000
> anon flags: 0xbfffc000009001c(uptodate|dirty|lru|head|swapbacked|node=0|zone=2|lastcpupid=0xffff)
> raw: 0bfffc000009001c fffffc2007b74048 fffffc2009c087c8 ffff08038dab9189
> raw: 0000000ffffbb800 0000000000000000 0000018900000000 ffff40026005a000
>

This is a PTE-mapped THP, unless <393 subpages are mapped, meaning extra refcount is present,
the page should be migratable. Even if it is not migratible due to the extra pin,
__alloc_contig_migrate_range() will return non-zero and bails out the code.
No idea why it caused the infinite loop.

>> I cannot reproduce it locally after hundreds of iterations of flip_mem.py on my
>> x86_64 VM and bare metal.
>>
>> What ARM machine are you using? I wonder if I am able to get one locally.
>
> Ampere Altra.

Sorry, I have no access to such a machine right now and cannot afford to buy one.

Can you try the patch below on top of linux-next to see if it fixes the infinite loop issue?
Thanks.

1. split_free_page() change is irrelevant but to make the code more robust.
2. using set_migratetype_isolate() in isolate_single_pageblock() is to properly mark the pageblock
MIGRATE_ISOLATE.
3. setting to-be-migrated page's pageblock to MIGRATE_ISOLATE is to avoid a possible race
that another thread might take the free page after migration.
4. off-by-one fix and no retry if free page is not found after migration like I added before.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4dcfa0ceca45..ad8f73b00466 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1122,13 +1122,16 @@ void split_free_page(struct page *free_page,
 	unsigned long flags;
 	int free_page_order;

+	if (split_pfn_offset == 0)
+		return;
+
 	spin_lock_irqsave(&zone->lock, flags);
 	del_page_from_free_list(free_page, zone, order);
 	for (pfn = free_page_pfn;
 	     pfn < free_page_pfn + (1UL << order);) {
 		int mt = get_pfnblock_migratetype(pfn_to_page(pfn), pfn);

-		free_page_order = ffs(split_pfn_offset) - 1;
+		free_page_order = min(pfn ? __ffs(pfn) : order, __fls(split_pfn_offset));
 		__free_one_page(pfn_to_page(pfn), pfn, zone, free_page_order,
 				mt, FPI_NONE);
 		pfn += 1UL << free_page_order;
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index b3f074d1682e..706915c9a380 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -283,6 +283,7 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
  * isolate_single_pageblock() -- tries to isolate a pageblock that might be
  * within a free or in-use page.
  * @boundary_pfn:		pageblock-aligned pfn that a page might cross
+ * @flags:			isolation flags
  * @gfp_flags:			GFP flags used for migrating pages
  * @isolate_before:	isolate the pageblock before the boundary_pfn
  *
@@ -298,14 +299,15 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
  * either. The function handles this by splitting the free page or migrating
  * the in-use page then splitting the free page.
  */
-static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
-			bool isolate_before)
+static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
+			gfp_t gfp_flags, bool isolate_before)
 {
 	unsigned char saved_mt;
 	unsigned long start_pfn;
 	unsigned long isolate_pageblock;
 	unsigned long pfn;
 	struct zone *zone;
+	int ret;

 	VM_BUG_ON(!IS_ALIGNED(boundary_pfn, pageblock_nr_pages));

@@ -325,7 +327,11 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
 				      zone->zone_start_pfn);

 	saved_mt = get_pageblock_migratetype(pfn_to_page(isolate_pageblock));
-	set_pageblock_migratetype(pfn_to_page(isolate_pageblock), MIGRATE_ISOLATE);
+	ret = set_migratetype_isolate(pfn_to_page(isolate_pageblock), saved_mt, flags,
+			isolate_pageblock, isolate_pageblock + pageblock_nr_pages);
+
+	if (ret)
+		return ret;

 	/*
 	 * Bail out early when the to-be-isolated pageblock does not form
@@ -374,7 +380,7 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
 			struct page *head = compound_head(page);
 			unsigned long head_pfn = page_to_pfn(head);

-			if (head_pfn + nr_pages < boundary_pfn) {
+			if (head_pfn + nr_pages <= boundary_pfn) {
 				pfn = head_pfn + nr_pages;
 				continue;
 			}
@@ -386,7 +392,8 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
 			if (PageHuge(page) || PageLRU(page) || __PageMovable(page)) {
 				int order;
 				unsigned long outer_pfn;
-				int ret;
+				int page_mt = get_pageblock_migratetype(page);
+				bool isolate_page = !is_migrate_isolate_page(page);
 				struct compact_control cc = {
 					.nr_migratepages = 0,
 					.order = -1,
@@ -399,9 +406,31 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
 				};
 				INIT_LIST_HEAD(&cc.migratepages);

+				/*
+				 * XXX: mark the page as MIGRATE_ISOLATE so that
+				 * no one else can grab the freed page after migration.
+				 * Ideally, the page should be freed as two separate
+				 * pages to be added into separate migratetype free
+				 * lists.
+				 */
+				if (isolate_page) {
+					ret = set_migratetype_isolate(page, page_mt,
+						flags, head_pfn, boundary_pfn - 1);
+					if (ret)
+						goto failed;
+				}
+
 				ret = __alloc_contig_migrate_range(&cc, head_pfn,
 							head_pfn + nr_pages);

+				/*
+				 * restore the page's migratetype so that it can
+				 * be split into separate migratetype free lists
+				 * later.
+				 */
+				if (isolate_page)
+					unset_migratetype_isolate(page, page_mt);
+
 				if (ret)
 					goto failed;
 				/*
@@ -417,10 +446,9 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
 				order = 0;
 				outer_pfn = pfn;
 				while (!PageBuddy(pfn_to_page(outer_pfn))) {
-					if (++order >= MAX_ORDER) {
-						outer_pfn = pfn;
-						break;
-					}
+					/* stop if we cannot find the free page */
+					if (++order >= MAX_ORDER)
+						goto failed;
 					outer_pfn &= ~0UL << order;
 				}
 				pfn = outer_pfn;
@@ -435,7 +463,7 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, gfp_t gfp_flags,
 	return 0;
 failed:
 	/* restore the original migratetype */
-	set_pageblock_migratetype(pfn_to_page(isolate_pageblock), saved_mt);
+	unset_migratetype_isolate(pfn_to_page(isolate_pageblock), saved_mt);
 	return -EBUSY;
 }

@@ -496,12 +524,12 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
 	int ret;

 	/* isolate [isolate_start, isolate_start + pageblock_nr_pages) pageblock */
-	ret = isolate_single_pageblock(isolate_start, gfp_flags, false);
+	ret = isolate_single_pageblock(isolate_start, flags, gfp_flags, false);
 	if (ret)
 		return ret;

 	/* isolate [isolate_end - pageblock_nr_pages, isolate_end) pageblock */
-	ret = isolate_single_pageblock(isolate_end, gfp_flags, true);
+	ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true);
 	if (ret) {
 		unset_migratetype_isolate(pfn_to_page(isolate_start), migratetype);
 		return ret;


--
Best Regards,
Yan, Zi
Zi Yan May 22, 2022, 7:33 p.m. UTC | #26
On 22 May 2022, at 12:54, Zi Yan wrote:

> On 20 May 2022, at 19:41, Qian Cai wrote:
>
>> On Fri, May 20, 2022 at 05:56:52PM -0400, Zi Yan wrote:
>>> Do you have the page information like refcount, map count, mapping, index, and
>>> page flags? That would be more helpful. Thanks.
>>
>> page:fffffc200c7f8000 refcount:393 mapcount:1 mapping:0000000000000000 index:0xffffbb800 pfn:0x8039fe00
>> head:fffffc200c7f8000 order:9 compound_mapcount:0 compound_pincount:0
>> memcg:ffff40026005a000
>> anon flags: 0xbfffc000009001c(uptodate|dirty|lru|head|swapbacked|node=0|zone=2|lastcpupid=0xffff)
>> raw: 0bfffc000009001c fffffc2007b74048 fffffc2009c087c8 ffff08038dab9189
>> raw: 0000000ffffbb800 0000000000000000 0000018900000000 ffff40026005a000

OK. I replicated two scenarios, which can have the above page dump:
1. a PTE-mapped THP with 393 subpages mapped without any extra pin,
2. a PTE-mapped THP with 392 subpages mapped with an extra pin on the first subpage.

For scenario 1, there is no infinite looping on next-20220519 and next-20220520.

For scenario 2, an infinite loop happens on next-20220519, next-20220520, and next-20220520
with my fixup patch from another email, when the memory block, in which the page resides,
is being offlined. However, after reverting all my patches, the infinite loop remains.

So it looks to me that having an infinite loop during memory offline is not a regression
based on the experiments I have done. David Hildenbranch can correct me if I am wrong.
A better issue description, other than infinite loop during memory offlining, and a
better reproducer are needed for me to identify potential bugs in my code and fix them.

Of course, my fixup patch should be applied anyway.

Thanks for your testing.


--
Best Regards,
Yan, Zi
Qian Cai May 24, 2022, 4:59 p.m. UTC | #27
On Sun, May 22, 2022 at 12:54:04PM -0400, Zi Yan wrote:
> Can you try the patch below on top of linux-next to see if it fixes the infinite loop issue?
> Thanks.
> 
> 1. split_free_page() change is irrelevant but to make the code more robust.
> 2. using set_migratetype_isolate() in isolate_single_pageblock() is to properly mark the pageblock
> MIGRATE_ISOLATE.
> 3. setting to-be-migrated page's pageblock to MIGRATE_ISOLATE is to avoid a possible race
> that another thread might take the free page after migration.
> 4. off-by-one fix and no retry if free page is not found after migration like I added before.

Cool. I'll be running it this week and report back next week.