diff mbox series

[1/1] mm: protect xa split stuff under lruvec->lru_lock during migration

Message ID 20240412064353.133497-1-zhaoyang.huang@unisoc.com (mailing list archive)
State New
Headers show
Series [1/1] mm: protect xa split stuff under lruvec->lru_lock during migration | expand

Commit Message

zhaoyang.huang April 12, 2024, 6:43 a.m. UTC
From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>

Livelock in [1] is reported multitimes since v515, where the zero-ref
folio is repeatly found on the page cache by find_get_entry. A possible
timing sequence is proposed in [2], which can be described briefly as
the lockless xarray operation could get harmed by an illegal folio
remaining on the slot[offset]. This commit would like to protect
the xa split stuff(folio_ref_freeze and __split_huge_page) under
lruvec->lock to remove the race window.

[1]
[167789.800297] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[167726.780305] rcu: Tasks blocked on level-0 rcu_node (CPUs 0-7): P155
[167726.780319] (detected by 3, t=17256977 jiffies, g=19883597, q=2397394)
[167726.780325] task:kswapd0         state:R  running task     stack:   24 pid:  155 ppid:     2 flags:0x00000008
[167789.800308] rcu: Tasks blocked on level-0 rcu_node (CPUs 0-7): P155
[167789.800322] (detected by 3, t=17272732 jiffies, g=19883597, q=2397470)
[167789.800328] task:kswapd0         state:R  running task     stack:   24 pid:  155 ppid:     2 flags:0x00000008
[167789.800339] Call trace:
[167789.800342]  dump_backtrace.cfi_jt+0x0/0x8
[167789.800355]  show_stack+0x1c/0x2c
[167789.800363]  sched_show_task+0x1ac/0x27c
[167789.800370]  print_other_cpu_stall+0x314/0x4dc
[167789.800377]  check_cpu_stall+0x1c4/0x36c
[167789.800382]  rcu_sched_clock_irq+0xe8/0x388
[167789.800389]  update_process_times+0xa0/0xe0
[167789.800396]  tick_sched_timer+0x7c/0xd4
[167789.800404]  __run_hrtimer+0xd8/0x30c
[167789.800408]  hrtimer_interrupt+0x1e4/0x2d0
[167789.800414]  arch_timer_handler_phys+0x5c/0xa0
[167789.800423]  handle_percpu_devid_irq+0xbc/0x318
[167789.800430]  handle_domain_irq+0x7c/0xf0
[167789.800437]  gic_handle_irq+0x54/0x12c
[167789.800445]  call_on_irq_stack+0x40/0x70
[167789.800451]  do_interrupt_handler+0x44/0xa0
[167789.800457]  el1_interrupt+0x34/0x64
[167789.800464]  el1h_64_irq_handler+0x1c/0x2c
[167789.800470]  el1h_64_irq+0x7c/0x80
[167789.800474]  xas_find+0xb4/0x28c
[167789.800481]  find_get_entry+0x3c/0x178
[167789.800487]  find_lock_entries+0x98/0x2f8
[167789.800492]  __invalidate_mapping_pages.llvm.3657204692649320853+0xc8/0x224
[167789.800500]  invalidate_mapping_pages+0x18/0x28
[167789.800506]  inode_lru_isolate+0x140/0x2a4
[167789.800512]  __list_lru_walk_one+0xd8/0x204
[167789.800519]  list_lru_walk_one+0x64/0x90
[167789.800524]  prune_icache_sb+0x54/0xe0
[167789.800529]  super_cache_scan+0x160/0x1ec
[167789.800535]  do_shrink_slab+0x20c/0x5c0
[167789.800541]  shrink_slab+0xf0/0x20c
[167789.800546]  shrink_node_memcgs+0x98/0x320
[167789.800553]  shrink_node+0xe8/0x45c
[167789.800557]  balance_pgdat+0x464/0x814
[167789.800563]  kswapd+0xfc/0x23c
[167789.800567]  kthread+0x164/0x1c8
[167789.800573]  ret_from_fork+0x10/0x20

[2]
Thread_isolate:
1. alloc_contig_range->isolate_migratepages_block isolate a certain of
pages to cc->migratepages via pfn
       (folio has refcount: 1 + n (alloc_pages, page_cache))

2. alloc_contig_range->migrate_pages->folio_ref_freeze(folio, 1 +
extra_pins) set the folio->refcnt to 0

3. alloc_contig_range->migrate_pages->xas_split split the folios to
each slot as folio from slot[offset] to slot[offset + sibs]

4. alloc_contig_range->migrate_pages->__split_huge_page->folio_lruvec_lock
failed which have the folio be failed in setting refcnt to 2

5. Thread_kswapd enter the livelock by the chain below
      rcu_read_lock();
   retry:
        find_get_entry
            folio = xas_find
            if(!folio_try_get_rcu)
                xas_reset;
            goto retry;
      rcu_read_unlock();

5'. Thread_holdlock as the lruvec->lru_lock holder could be stalled in
the same core of Thread_kswapd.

Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
---
 mm/huge_memory.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

Comments

Matthew Wilcox (Oracle) April 12, 2024, 12:24 p.m. UTC | #1
On Fri, Apr 12, 2024 at 02:43:53PM +0800, zhaoyang.huang wrote:
> From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> 
> Livelock in [1] is reported multitimes since v515, where the zero-ref
> folio is repeatly found on the page cache by find_get_entry. A possible
> timing sequence is proposed in [2], which can be described briefly as

I have no patience for going through another one of your "analyses".

1. Can you reproduce this bug without this patch?
2. Does the reproducer stop working after this patch?

Otherwise I'm not interested.  Sorry.  You burnt all my good will.

> the lockless xarray operation could get harmed by an illegal folio
> remaining on the slot[offset]. This commit would like to protect
> the xa split stuff(folio_ref_freeze and __split_huge_page) under
> lruvec->lock to remove the race window.
Andrew Morton April 12, 2024, 9:34 p.m. UTC | #2
On Fri, 12 Apr 2024 14:43:53 +0800 "zhaoyang.huang" <zhaoyang.huang@unisoc.com> wrote:

> Livelock in [1] is reported multitimes since v515, 

Are you able to provide us with a means by which others can reproduce this?

Thanks.
Zhaoyang Huang April 13, 2024, 2:01 a.m. UTC | #3
loop Dave, since he has ever helped set up an reproducer in
https://lore.kernel.org/linux-mm/20221101071721.GV2703033@dread.disaster.area/
@Dave Chinner , I would like to ask for your kindly help on if you can
verify this patch on your environment if convenient. Thanks a lot.


On Sat, Apr 13, 2024 at 5:34 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Fri, 12 Apr 2024 14:43:53 +0800 "zhaoyang.huang" <zhaoyang.huang@unisoc.com> wrote:
>
> > Livelock in [1] is reported multitimes since v515,
>
> Are you able to provide us with a means by which others can reproduce this?
>
> Thanks.
Zhaoyang Huang April 13, 2024, 7:10 a.m. UTC | #4
On Fri, Apr 12, 2024 at 8:24 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Apr 12, 2024 at 02:43:53PM +0800, zhaoyang.huang wrote:
> > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> >
> > Livelock in [1] is reported multitimes since v515, where the zero-ref
> > folio is repeatly found on the page cache by find_get_entry. A possible
> > timing sequence is proposed in [2], which can be described briefly as
>
> I have no patience for going through another one of your "analyses".
>
> 1. Can you reproduce this bug without this patch?
> 2. Does the reproducer stop working after this patch?
>
> Otherwise I'm not interested.  Sorry.  You burnt all my good will.

This bug has been reported many times by three people including me as
below for at least two years, have you ever tried to solve it? Do Dave
and Brian also burnt your good will if you ever have? Be aware that
you are the maintainer who has the responsibility for maintaining the
code but not us. "To wear crowns shall bear the heavy or give up". Put
me on your SPAM list, thank you.

https://lore.kernel.org/linux-mm/20221018223042.GJ2703033@dread.disaster.area/
https://lore.kernel.org/linux-mm/Y0%2FkZbIvMgkNhWpM@bfoster/

>
> > the lockless xarray operation could get harmed by an illegal folio
> > remaining on the slot[offset]. This commit would like to protect
> > the xa split stuff(folio_ref_freeze and __split_huge_page) under
> > lruvec->lock to remove the race window.
Dave Chinner April 15, 2024, 12:09 a.m. UTC | #5
On Sat, Apr 13, 2024 at 10:01:27AM +0800, Zhaoyang Huang wrote:
> loop Dave, since he has ever helped set up an reproducer in
> https://lore.kernel.org/linux-mm/20221101071721.GV2703033@dread.disaster.area/
> @Dave Chinner , I would like to ask for your kindly help on if you can
> verify this patch on your environment if convenient. Thanks a lot.

I don't have the test environment from 18 months ago available any
more. Also, I haven't seen this problem since that specific test
environment tripped over the issue. Hence I don't have any way of
confirming that the problem is fixed, either, because first I'd have
to reproduce it...

-Dave.
Zhaoyang Huang April 15, 2024, 1:50 a.m. UTC | #6
On Mon, Apr 15, 2024 at 8:09 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Sat, Apr 13, 2024 at 10:01:27AM +0800, Zhaoyang Huang wrote:
> > loop Dave, since he has ever helped set up an reproducer in
> > https://lore.kernel.org/linux-mm/20221101071721.GV2703033@dread.disaster.area/
> > @Dave Chinner , I would like to ask for your kindly help on if you can
> > verify this patch on your environment if convenient. Thanks a lot.
>
> I don't have the test environment from 18 months ago available any
> more. Also, I haven't seen this problem since that specific test
> environment tripped over the issue. Hence I don't have any way of
> confirming that the problem is fixed, either, because first I'd have
> to reproduce it...
Thanks for the information. I noticed that you reported another soft
lockup which is related to xas_load since NOV.2023. This patch is
supposed to be helpful for this. With regard to the version timing,
this commit is actually a revert of <mm/thp: narrow lru locking>
b6769834aac1d467fa1c71277d15688efcbb4d76 which is merged before v5.15.

For saving your time, a brief description below. IMO, b6769834aa
introduce a potential stall between freeze the folio's refcnt and
store it back to 2, which have the xas_load->folio_try_get_rcu loops
as livelock if it stalls the lru_lock's holder.

b6769834aa
    split_huge_page_to_list
-       spin_lock(lru_lock)
        xas_split(&xas, folio,order)
        folio_refcnt_freeze(folio, 1 + folio_nr_pages(folio0)
+      spin_lock(lru_lock)
        xas_store(&xas, offset++, head+i)
        page_ref_add(head, 2)
        spin_unlock(lru_lock)

Sorry in advance if the above doesn't make sense, I am just a
developer who is also suffering from this bug and trying to fix it
>
> -Dave.
> --
> Dave Chinner
> david@fromorbit.com
Marcin Wanat May 20, 2024, 7:42 p.m. UTC | #7
On 15.04.2024 03:50, Zhaoyang Huang wrote:
> On Mon, Apr 15, 2024 at 8:09 AM Dave Chinner <david@fromorbit.com> > wrote: >> >> On Sat, Apr 13, 2024 at 10:01:27AM +0800, Zhaoyang 
Huang wrote: >>> loop Dave, since he has ever helped set up an 
reproducer in >>> https://lore.kernel.org/linux- >>> 
mm/20221101071721.GV2703033@dread.disaster.area/ @Dave Chinner , >>> I 
would like to ask for your kindly help on if you can verify >>> this 
patch on your environment if convenient. Thanks a lot. >> >> I don't 
have the test environment from 18 months ago available any >> more. 
Also, I haven't seen this problem since that specific test >> 
environment tripped over the issue. Hence I don't have any way of >> 
confirming that the problem is fixed, either, because first I'd >> have 
to reproduce it... > Thanks for the information. I noticed that you 
reported another soft > lockup which is related to xas_load since 
NOV.2023. This patch is > supposed to be helpful for this. With regard 
to the version timing, > this commit is actually a revert of <mm/thp: 
narrow lru locking> > b6769834aac1d467fa1c71277d15688efcbb4d76 which is 
merged before > v5.15. > > For saving your time, a brief description 
below. IMO, b6769834aa > introduce a potential stall between freeze the 
folio's refcnt and > store it back to 2, which have the 
xas_load->folio_try_get_rcu loops > as livelock if it stalls the 
lru_lock's holder. > > b6769834aa split_huge_page_to_list - 
spin_lock(lru_lock) > xas_split(&xas, folio,order) 
folio_refcnt_freeze(folio, 1 + > folio_nr_pages(folio0) + 
spin_lock(lru_lock) xas_store(&xas, > offset++, head+i) 
page_ref_add(head, 2) spin_unlock(lru_lock) > > Sorry in advance if the 
above doesn't make sense, I am just a > developer who is also suffering 
from this bug and trying to fix it
I am experiencing a similar error on dozens of hosts, with stack traces 
that are all similar:

[627163.727746] watchdog: BUG: soft lockup - CPU#77 stuck for 22s! 
[file_get:953301]
[627163.727778] Modules linked in: xt_set ip_set_hash_net ip_set xt_CT 
xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat 
nf_tables nfnetlink sr_mod cdrom rfkill vfat fat intel_rapl_msr 
intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common 
isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal 
intel_powerclamp coretemp ipmi_ssif kvm_intel kvm irqbypass mlx5_ib rapl 
iTCO_wdt intel_cstate intel_pmc_bxt ib_uverbs iTCO_vendor_support 
dell_smbios dcdbas i2c_i801 intel_uncore uas ses mei_me ib_core 
dell_wmi_descriptor wmi_bmof pcspkr enclosure lpc_ich usb_storage 
i2c_smbus acpi_ipmi mei intel_pch_thermal ipmi_si ipmi_devintf 
ipmi_msghandler acpi_power_meter joydev tcp_bbr fuse xfs libcrc32c raid1 
sd_mod sg mlx5_core crct10dif_pclmul crc32_pclmul crc32c_intel 
polyval_clmulni mgag200 polyval_generic drm_kms_helper mlxfw 
drm_shmem_helper ahci nvme mpt3sas tls libahci ghash_clmulni_intel 
nvme_core psample drm igb t10_pi raid_class pci_hyperv_intf dca libata 
scsi_transport_sas i2c_algo_bit wmi
[627163.727841] CPU: 77 PID: 953301 Comm: file_get Kdump: loaded 
Tainted: G             L     6.6.30.el9 #2
[627163.727844] Hardware name: Dell Inc. PowerEdge R740xd/08D89F, BIOS 
2.21.2 02/19/2024
[627163.727847] RIP: 0010:xas_descend+0x1b/0x70
[627163.727857] Code: 57 10 48 89 07 48 c1 e8 20 48 89 57 08 c3 cc 0f b6 
0e 48 8b 47 08 48 d3 e8 48 89 c1 83 e1 3f 89 c8 48 83 c0 04 48 8b 44 c6 
08 <48> 89 77 18 48 89 c2 83 e2 03 48 83 fa 02 74 0a 88 4f 12 c3 48 83
[627163.727859] RSP: 0018:ffffc90034a67978 EFLAGS: 00000206
[627163.727861] RAX: ffff888e4f971242 RBX: ffffc90034a67a98 RCX: 
0000000000000020
[627163.727863] RDX: 0000000000000002 RSI: ffff88a454546d80 RDI: 
ffffc90034a67990
[627163.727865] RBP: fffffffffffffffe R08: fffffffffffffffe R09: 
0000000000008820
[627163.727867] R10: 0000000000008820 R11: 0000000000000000 R12: 
ffffc90034a67a20
[627163.727868] R13: ffffc90034a67a18 R14: ffffea00873e8000 R15: 
ffffc90034a67a18
[627163.727870] FS:  00007fc5e503b740(0000) GS:ffff88bfefd80000(0000) 
knlGS:0000000000000000
[627163.727871] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[627163.727873] CR2: 000000005fb87b6e CR3: 00000022875e8006 CR4: 
00000000007706e0
[627163.727875] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[627163.727876] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
0000000000000400
[627163.727878] PKRU: 55555554
[627163.727879] Call Trace:
[627163.727882]  <IRQ>
[627163.727886]  ? watchdog_timer_fn+0x22a/0x2a0
[627163.727892]  ? softlockup_fn+0x70/0x70
[627163.727895]  ? __hrtimer_run_queues+0x10f/0x2a0
[627163.727903]  ? hrtimer_interrupt+0x106/0x240
[627163.727906]  ? __sysvec_apic_timer_interrupt+0x68/0x170
[627163.727913]  ? sysvec_apic_timer_interrupt+0x9d/0xd0
[627163.727917]  </IRQ>
[627163.727918]  <TASK>
[627163.727920]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
[627163.727927]  ? xas_descend+0x1b/0x70
[627163.727930]  xas_load+0x2c/0x40
[627163.727933]  xas_find+0x161/0x1a0
[627163.727937]  find_get_entries+0x77/0x1d0
[627163.727944]  truncate_inode_pages_range+0x244/0x3f0
[627163.727950]  truncate_pagecache+0x44/0x60
[627163.727955]  xfs_setattr_size+0x168/0x490 [xfs]
[627163.728074]  xfs_vn_setattr+0x78/0x140 [xfs]
[627163.728153]  notify_change+0x34f/0x4f0
[627163.728158]  ? _raw_spin_lock+0x13/0x30
[627163.728165]  ? do_truncate+0x80/0xd0
[627163.728169]  do_truncate+0x80/0xd0
[627163.728172]  do_open+0x2ce/0x400
[627163.728177]  path_openat+0x10d/0x280
[627163.728181]  do_filp_open+0xb2/0x150
[627163.728186]  ? check_heap_object+0x34/0x190
[627163.728189]  ? __check_object_size.part.0+0x5a/0x130
[627163.728194]  do_sys_openat2+0x92/0xc0
[627163.728197]  __x64_sys_openat+0x53/0x90
[627163.728200]  do_syscall_64+0x35/0x80
[627163.728206]  entry_SYSCALL_64_after_hwframe+0x4b/0xb5
[627163.728210] RIP: 0033:0x7fc5e493e7fb
[627163.728213] Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18 
00 00 00 85 c0 75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 
05 <48> 3d 00 f0 ff ff 0f 87 91 00 00 00 48 8b 54 24 28 64 48 2b 14 25
[627163.728215] RSP: 002b:00007ffdd4e300e0 EFLAGS: 00000246 ORIG_RAX: 
0000000000000101
[627163.728218] RAX: ffffffffffffffda RBX: 00007ffdd4e30180 RCX: 
00007fc5e493e7fb
[627163.728220] RDX: 0000000000000241 RSI: 00007ffdd4e30180 RDI: 
00000000ffffff9c
[627163.728221] RBP: 00007ffdd4e30180 R08: 00007fc5e4600040 R09: 
0000000000000001
[627163.728223] R10: 00000000000001b6 R11: 0000000000000246 R12: 
0000000000000241
[627163.728224] R13: 0000000000000000 R14: 00007fc5e4662fa8 R15: 
0000000000000000
[627163.728227]  </TASK>

I have around 50 hosts handling high I/O (each with 20Gbps+ uplinks
and multiple NVMe drives), running RockyLinux 8/9. The stock RHEL
kernel 8/9 is NOT affected, and the long-term kernel 5.15.X is NOT affected.
However, with long-term kernels 6.1.XX and 6.6.XX,
(tested at least 10 different versions), this lockup always appears
after 2-30 days, similar to the report in the original thread.
The more load (for example, copying a lot of local files while
serving 20Gbps traffic), the higher the chance that the bug will appear.

I haven't been able to reproduce this during synthetic tests,
but it always occurs in production on 6.1.X and 6.6.X within 2-30 days.
If anyone can provide a patch, I can test it on multiple machines
over the next few days.

Regards,
Marcin
Zhaoyang Huang May 21, 2024, 12:58 a.m. UTC | #8
On Tue, May 21, 2024 at 3:42 AM Marcin Wanat <private@marcinwanat.pl> wrote:
>
> On 15.04.2024 03:50, Zhaoyang Huang wrote:
> > On Mon, Apr 15, 2024 at 8:09 AM Dave Chinner <david@fromorbit.com> > wrote: >> >> On Sat, Apr 13, 2024 at 10:01:27AM +0800, Zhaoyang
> Huang wrote: >>> loop Dave, since he has ever helped set up an
> reproducer in >>> https://lore.kernel.org/linux- >>>
> mm/20221101071721.GV2703033@dread.disaster.area/ @Dave Chinner , >>> I
> would like to ask for your kindly help on if you can verify >>> this
> patch on your environment if convenient. Thanks a lot. >> >> I don't
> have the test environment from 18 months ago available any >> more.
> Also, I haven't seen this problem since that specific test >>
> environment tripped over the issue. Hence I don't have any way of >>
> confirming that the problem is fixed, either, because first I'd >> have
> to reproduce it... > Thanks for the information. I noticed that you
> reported another soft > lockup which is related to xas_load since
> NOV.2023. This patch is > supposed to be helpful for this. With regard
> to the version timing, > this commit is actually a revert of <mm/thp:
> narrow lru locking> > b6769834aac1d467fa1c71277d15688efcbb4d76 which is
> merged before > v5.15. > > For saving your time, a brief description
> below. IMO, b6769834aa > introduce a potential stall between freeze the
> folio's refcnt and > store it back to 2, which have the
> xas_load->folio_try_get_rcu loops > as livelock if it stalls the
> lru_lock's holder. > > b6769834aa split_huge_page_to_list -
> spin_lock(lru_lock) > xas_split(&xas, folio,order)
> folio_refcnt_freeze(folio, 1 + > folio_nr_pages(folio0) +
> spin_lock(lru_lock) xas_store(&xas, > offset++, head+i)
> page_ref_add(head, 2) spin_unlock(lru_lock) > > Sorry in advance if the
> above doesn't make sense, I am just a > developer who is also suffering
> from this bug and trying to fix it
> I am experiencing a similar error on dozens of hosts, with stack traces
> that are all similar:
>
> [627163.727746] watchdog: BUG: soft lockup - CPU#77 stuck for 22s!
> [file_get:953301]
> [627163.727778] Modules linked in: xt_set ip_set_hash_net ip_set xt_CT
> xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat
> nf_tables nfnetlink sr_mod cdrom rfkill vfat fat intel_rapl_msr
> intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common
> isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal
> intel_powerclamp coretemp ipmi_ssif kvm_intel kvm irqbypass mlx5_ib rapl
> iTCO_wdt intel_cstate intel_pmc_bxt ib_uverbs iTCO_vendor_support
> dell_smbios dcdbas i2c_i801 intel_uncore uas ses mei_me ib_core
> dell_wmi_descriptor wmi_bmof pcspkr enclosure lpc_ich usb_storage
> i2c_smbus acpi_ipmi mei intel_pch_thermal ipmi_si ipmi_devintf
> ipmi_msghandler acpi_power_meter joydev tcp_bbr fuse xfs libcrc32c raid1
> sd_mod sg mlx5_core crct10dif_pclmul crc32_pclmul crc32c_intel
> polyval_clmulni mgag200 polyval_generic drm_kms_helper mlxfw
> drm_shmem_helper ahci nvme mpt3sas tls libahci ghash_clmulni_intel
> nvme_core psample drm igb t10_pi raid_class pci_hyperv_intf dca libata
> scsi_transport_sas i2c_algo_bit wmi
> [627163.727841] CPU: 77 PID: 953301 Comm: file_get Kdump: loaded
> Tainted: G             L     6.6.30.el9 #2
> [627163.727844] Hardware name: Dell Inc. PowerEdge R740xd/08D89F, BIOS
> 2.21.2 02/19/2024
> [627163.727847] RIP: 0010:xas_descend+0x1b/0x70
> [627163.727857] Code: 57 10 48 89 07 48 c1 e8 20 48 89 57 08 c3 cc 0f b6
> 0e 48 8b 47 08 48 d3 e8 48 89 c1 83 e1 3f 89 c8 48 83 c0 04 48 8b 44 c6
> 08 <48> 89 77 18 48 89 c2 83 e2 03 48 83 fa 02 74 0a 88 4f 12 c3 48 83
> [627163.727859] RSP: 0018:ffffc90034a67978 EFLAGS: 00000206
> [627163.727861] RAX: ffff888e4f971242 RBX: ffffc90034a67a98 RCX:
> 0000000000000020
> [627163.727863] RDX: 0000000000000002 RSI: ffff88a454546d80 RDI:
> ffffc90034a67990
> [627163.727865] RBP: fffffffffffffffe R08: fffffffffffffffe R09:
> 0000000000008820
> [627163.727867] R10: 0000000000008820 R11: 0000000000000000 R12:
> ffffc90034a67a20
> [627163.727868] R13: ffffc90034a67a18 R14: ffffea00873e8000 R15:
> ffffc90034a67a18
> [627163.727870] FS:  00007fc5e503b740(0000) GS:ffff88bfefd80000(0000)
> knlGS:0000000000000000
> [627163.727871] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [627163.727873] CR2: 000000005fb87b6e CR3: 00000022875e8006 CR4:
> 00000000007706e0
> [627163.727875] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [627163.727876] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [627163.727878] PKRU: 55555554
> [627163.727879] Call Trace:
> [627163.727882]  <IRQ>
> [627163.727886]  ? watchdog_timer_fn+0x22a/0x2a0
> [627163.727892]  ? softlockup_fn+0x70/0x70
> [627163.727895]  ? __hrtimer_run_queues+0x10f/0x2a0
> [627163.727903]  ? hrtimer_interrupt+0x106/0x240
> [627163.727906]  ? __sysvec_apic_timer_interrupt+0x68/0x170
> [627163.727913]  ? sysvec_apic_timer_interrupt+0x9d/0xd0
> [627163.727917]  </IRQ>
> [627163.727918]  <TASK>
> [627163.727920]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> [627163.727927]  ? xas_descend+0x1b/0x70
> [627163.727930]  xas_load+0x2c/0x40
> [627163.727933]  xas_find+0x161/0x1a0
> [627163.727937]  find_get_entries+0x77/0x1d0
> [627163.727944]  truncate_inode_pages_range+0x244/0x3f0
> [627163.727950]  truncate_pagecache+0x44/0x60
> [627163.727955]  xfs_setattr_size+0x168/0x490 [xfs]
> [627163.728074]  xfs_vn_setattr+0x78/0x140 [xfs]
> [627163.728153]  notify_change+0x34f/0x4f0
> [627163.728158]  ? _raw_spin_lock+0x13/0x30
> [627163.728165]  ? do_truncate+0x80/0xd0
> [627163.728169]  do_truncate+0x80/0xd0
> [627163.728172]  do_open+0x2ce/0x400
> [627163.728177]  path_openat+0x10d/0x280
> [627163.728181]  do_filp_open+0xb2/0x150
> [627163.728186]  ? check_heap_object+0x34/0x190
> [627163.728189]  ? __check_object_size.part.0+0x5a/0x130
> [627163.728194]  do_sys_openat2+0x92/0xc0
> [627163.728197]  __x64_sys_openat+0x53/0x90
> [627163.728200]  do_syscall_64+0x35/0x80
> [627163.728206]  entry_SYSCALL_64_after_hwframe+0x4b/0xb5
> [627163.728210] RIP: 0033:0x7fc5e493e7fb
> [627163.728213] Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18
> 00 00 00 85 c0 75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f
> 05 <48> 3d 00 f0 ff ff 0f 87 91 00 00 00 48 8b 54 24 28 64 48 2b 14 25
> [627163.728215] RSP: 002b:00007ffdd4e300e0 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000101
> [627163.728218] RAX: ffffffffffffffda RBX: 00007ffdd4e30180 RCX:
> 00007fc5e493e7fb
> [627163.728220] RDX: 0000000000000241 RSI: 00007ffdd4e30180 RDI:
> 00000000ffffff9c
> [627163.728221] RBP: 00007ffdd4e30180 R08: 00007fc5e4600040 R09:
> 0000000000000001
> [627163.728223] R10: 00000000000001b6 R11: 0000000000000246 R12:
> 0000000000000241
> [627163.728224] R13: 0000000000000000 R14: 00007fc5e4662fa8 R15:
> 0000000000000000
> [627163.728227]  </TASK>
>
> I have around 50 hosts handling high I/O (each with 20Gbps+ uplinks
> and multiple NVMe drives), running RockyLinux 8/9. The stock RHEL
> kernel 8/9 is NOT affected, and the long-term kernel 5.15.X is NOT affected.
> However, with long-term kernels 6.1.XX and 6.6.XX,
> (tested at least 10 different versions), this lockup always appears
> after 2-30 days, similar to the report in the original thread.
> The more load (for example, copying a lot of local files while
> serving 20Gbps traffic), the higher the chance that the bug will appear.
>
> I haven't been able to reproduce this during synthetic tests,
> but it always occurs in production on 6.1.X and 6.6.X within 2-30 days.
> If anyone can provide a patch, I can test it on multiple machines
> over the next few days.
Could you please try this one which could be applied on 6.6 directly. Thank you!
>
> Regards,
> Marcin
Zhaoyang Huang May 21, 2024, 1 a.m. UTC | #9
On Tue, May 21, 2024 at 8:58 AM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote:
>
> On Tue, May 21, 2024 at 3:42 AM Marcin Wanat <private@marcinwanat.pl> wrote:
> >
> > On 15.04.2024 03:50, Zhaoyang Huang wrote:
> > > On Mon, Apr 15, 2024 at 8:09 AM Dave Chinner <david@fromorbit.com> > wrote: >> >> On Sat, Apr 13, 2024 at 10:01:27AM +0800, Zhaoyang
> > Huang wrote: >>> loop Dave, since he has ever helped set up an
> > reproducer in >>> https://lore.kernel.org/linux- >>>
> > mm/20221101071721.GV2703033@dread.disaster.area/ @Dave Chinner , >>> I
> > would like to ask for your kindly help on if you can verify >>> this
> > patch on your environment if convenient. Thanks a lot. >> >> I don't
> > have the test environment from 18 months ago available any >> more.
> > Also, I haven't seen this problem since that specific test >>
> > environment tripped over the issue. Hence I don't have any way of >>
> > confirming that the problem is fixed, either, because first I'd >> have
> > to reproduce it... > Thanks for the information. I noticed that you
> > reported another soft > lockup which is related to xas_load since
> > NOV.2023. This patch is > supposed to be helpful for this. With regard
> > to the version timing, > this commit is actually a revert of <mm/thp:
> > narrow lru locking> > b6769834aac1d467fa1c71277d15688efcbb4d76 which is
> > merged before > v5.15. > > For saving your time, a brief description
> > below. IMO, b6769834aa > introduce a potential stall between freeze the
> > folio's refcnt and > store it back to 2, which have the
> > xas_load->folio_try_get_rcu loops > as livelock if it stalls the
> > lru_lock's holder. > > b6769834aa split_huge_page_to_list -
> > spin_lock(lru_lock) > xas_split(&xas, folio,order)
> > folio_refcnt_freeze(folio, 1 + > folio_nr_pages(folio0) +
> > spin_lock(lru_lock) xas_store(&xas, > offset++, head+i)
> > page_ref_add(head, 2) spin_unlock(lru_lock) > > Sorry in advance if the
> > above doesn't make sense, I am just a > developer who is also suffering
> > from this bug and trying to fix it
> > I am experiencing a similar error on dozens of hosts, with stack traces
> > that are all similar:
> >
> > [627163.727746] watchdog: BUG: soft lockup - CPU#77 stuck for 22s!
> > [file_get:953301]
> > [627163.727778] Modules linked in: xt_set ip_set_hash_net ip_set xt_CT
> > xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat
> > nf_tables nfnetlink sr_mod cdrom rfkill vfat fat intel_rapl_msr
> > intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common
> > isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal
> > intel_powerclamp coretemp ipmi_ssif kvm_intel kvm irqbypass mlx5_ib rapl
> > iTCO_wdt intel_cstate intel_pmc_bxt ib_uverbs iTCO_vendor_support
> > dell_smbios dcdbas i2c_i801 intel_uncore uas ses mei_me ib_core
> > dell_wmi_descriptor wmi_bmof pcspkr enclosure lpc_ich usb_storage
> > i2c_smbus acpi_ipmi mei intel_pch_thermal ipmi_si ipmi_devintf
> > ipmi_msghandler acpi_power_meter joydev tcp_bbr fuse xfs libcrc32c raid1
> > sd_mod sg mlx5_core crct10dif_pclmul crc32_pclmul crc32c_intel
> > polyval_clmulni mgag200 polyval_generic drm_kms_helper mlxfw
> > drm_shmem_helper ahci nvme mpt3sas tls libahci ghash_clmulni_intel
> > nvme_core psample drm igb t10_pi raid_class pci_hyperv_intf dca libata
> > scsi_transport_sas i2c_algo_bit wmi
> > [627163.727841] CPU: 77 PID: 953301 Comm: file_get Kdump: loaded
> > Tainted: G             L     6.6.30.el9 #2
> > [627163.727844] Hardware name: Dell Inc. PowerEdge R740xd/08D89F, BIOS
> > 2.21.2 02/19/2024
> > [627163.727847] RIP: 0010:xas_descend+0x1b/0x70
> > [627163.727857] Code: 57 10 48 89 07 48 c1 e8 20 48 89 57 08 c3 cc 0f b6
> > 0e 48 8b 47 08 48 d3 e8 48 89 c1 83 e1 3f 89 c8 48 83 c0 04 48 8b 44 c6
> > 08 <48> 89 77 18 48 89 c2 83 e2 03 48 83 fa 02 74 0a 88 4f 12 c3 48 83
> > [627163.727859] RSP: 0018:ffffc90034a67978 EFLAGS: 00000206
> > [627163.727861] RAX: ffff888e4f971242 RBX: ffffc90034a67a98 RCX:
> > 0000000000000020
> > [627163.727863] RDX: 0000000000000002 RSI: ffff88a454546d80 RDI:
> > ffffc90034a67990
> > [627163.727865] RBP: fffffffffffffffe R08: fffffffffffffffe R09:
> > 0000000000008820
> > [627163.727867] R10: 0000000000008820 R11: 0000000000000000 R12:
> > ffffc90034a67a20
> > [627163.727868] R13: ffffc90034a67a18 R14: ffffea00873e8000 R15:
> > ffffc90034a67a18
> > [627163.727870] FS:  00007fc5e503b740(0000) GS:ffff88bfefd80000(0000)
> > knlGS:0000000000000000
> > [627163.727871] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [627163.727873] CR2: 000000005fb87b6e CR3: 00000022875e8006 CR4:
> > 00000000007706e0
> > [627163.727875] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > 0000000000000000
> > [627163.727876] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > 0000000000000400
> > [627163.727878] PKRU: 55555554
> > [627163.727879] Call Trace:
> > [627163.727882]  <IRQ>
> > [627163.727886]  ? watchdog_timer_fn+0x22a/0x2a0
> > [627163.727892]  ? softlockup_fn+0x70/0x70
> > [627163.727895]  ? __hrtimer_run_queues+0x10f/0x2a0
> > [627163.727903]  ? hrtimer_interrupt+0x106/0x240
> > [627163.727906]  ? __sysvec_apic_timer_interrupt+0x68/0x170
> > [627163.727913]  ? sysvec_apic_timer_interrupt+0x9d/0xd0
> > [627163.727917]  </IRQ>
> > [627163.727918]  <TASK>
> > [627163.727920]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> > [627163.727927]  ? xas_descend+0x1b/0x70
> > [627163.727930]  xas_load+0x2c/0x40
> > [627163.727933]  xas_find+0x161/0x1a0
> > [627163.727937]  find_get_entries+0x77/0x1d0
> > [627163.727944]  truncate_inode_pages_range+0x244/0x3f0
> > [627163.727950]  truncate_pagecache+0x44/0x60
> > [627163.727955]  xfs_setattr_size+0x168/0x490 [xfs]
> > [627163.728074]  xfs_vn_setattr+0x78/0x140 [xfs]
> > [627163.728153]  notify_change+0x34f/0x4f0
> > [627163.728158]  ? _raw_spin_lock+0x13/0x30
> > [627163.728165]  ? do_truncate+0x80/0xd0
> > [627163.728169]  do_truncate+0x80/0xd0
> > [627163.728172]  do_open+0x2ce/0x400
> > [627163.728177]  path_openat+0x10d/0x280
> > [627163.728181]  do_filp_open+0xb2/0x150
> > [627163.728186]  ? check_heap_object+0x34/0x190
> > [627163.728189]  ? __check_object_size.part.0+0x5a/0x130
> > [627163.728194]  do_sys_openat2+0x92/0xc0
> > [627163.728197]  __x64_sys_openat+0x53/0x90
> > [627163.728200]  do_syscall_64+0x35/0x80
> > [627163.728206]  entry_SYSCALL_64_after_hwframe+0x4b/0xb5
> > [627163.728210] RIP: 0033:0x7fc5e493e7fb
> > [627163.728213] Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18
> > 00 00 00 85 c0 75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f
> > 05 <48> 3d 00 f0 ff ff 0f 87 91 00 00 00 48 8b 54 24 28 64 48 2b 14 25
> > [627163.728215] RSP: 002b:00007ffdd4e300e0 EFLAGS: 00000246 ORIG_RAX:
> > 0000000000000101
> > [627163.728218] RAX: ffffffffffffffda RBX: 00007ffdd4e30180 RCX:
> > 00007fc5e493e7fb
> > [627163.728220] RDX: 0000000000000241 RSI: 00007ffdd4e30180 RDI:
> > 00000000ffffff9c
> > [627163.728221] RBP: 00007ffdd4e30180 R08: 00007fc5e4600040 R09:
> > 0000000000000001
> > [627163.728223] R10: 00000000000001b6 R11: 0000000000000246 R12:
> > 0000000000000241
> > [627163.728224] R13: 0000000000000000 R14: 00007fc5e4662fa8 R15:
> > 0000000000000000
> > [627163.728227]  </TASK>
> >
> > I have around 50 hosts handling high I/O (each with 20Gbps+ uplinks
> > and multiple NVMe drives), running RockyLinux 8/9. The stock RHEL
> > kernel 8/9 is NOT affected, and the long-term kernel 5.15.X is NOT affected.
> > However, with long-term kernels 6.1.XX and 6.6.XX,
> > (tested at least 10 different versions), this lockup always appears
> > after 2-30 days, similar to the report in the original thread.
> > The more load (for example, copying a lot of local files while
> > serving 20Gbps traffic), the higher the chance that the bug will appear.
> >
> > I haven't been able to reproduce this during synthetic tests,
> > but it always occurs in production on 6.1.X and 6.6.X within 2-30 days.
> > If anyone can provide a patch, I can test it on multiple machines
> > over the next few days.
> Could you please try this one which could be applied on 6.6 directly. Thank you!
URL: https://lore.kernel.org/linux-mm/20240412064353.133497-1-zhaoyang.huang@unisoc.com/

> >
> > Regards,
> > Marcin
Marcin Wanat May 21, 2024, 3:47 p.m. UTC | #10
On 21.05.2024 03:00, Zhaoyang Huang wrote:
> On Tue, May 21, 2024 at 8:58 AM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote:
>>
>> On Tue, May 21, 2024 at 3:42 AM Marcin Wanat <private@marcinwanat.pl> wrote:
>>>
>>> On 15.04.2024 03:50, Zhaoyang Huang wrote:
>>> I have around 50 hosts handling high I/O (each with 20Gbps+ uplinks
>>> and multiple NVMe drives), running RockyLinux 8/9. The stock RHEL
>>> kernel 8/9 is NOT affected, and the long-term kernel 5.15.X is NOT affected.
>>> However, with long-term kernels 6.1.XX and 6.6.XX,
>>> (tested at least 10 different versions), this lockup always appears
>>> after 2-30 days, similar to the report in the original thread.
>>> The more load (for example, copying a lot of local files while
>>> serving 20Gbps traffic), the higher the chance that the bug will appear.
>>>
>>> I haven't been able to reproduce this during synthetic tests,
>>> but it always occurs in production on 6.1.X and 6.6.X within 2-30 days.
>>> If anyone can provide a patch, I can test it on multiple machines
>>> over the next few days.
>> Could you please try this one which could be applied on 6.6 directly. Thank you!
> URL: https://lore.kernel.org/linux-mm/20240412064353.133497-1-zhaoyang.huang@unisoc.com/
> 

Unfortunately, I am unable to cleanly apply this patch against the 
latest 6.6.31
Zhaoyang Huang May 22, 2024, 5:37 a.m. UTC | #11
On Tue, May 21, 2024 at 11:47 PM Marcin Wanat <private@marcinwanat.pl> wrote:
>
> On 21.05.2024 03:00, Zhaoyang Huang wrote:
> > On Tue, May 21, 2024 at 8:58 AM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote:
> >>
> >> On Tue, May 21, 2024 at 3:42 AM Marcin Wanat <private@marcinwanat.pl> wrote:
> >>>
> >>> On 15.04.2024 03:50, Zhaoyang Huang wrote:
> >>> I have around 50 hosts handling high I/O (each with 20Gbps+ uplinks
> >>> and multiple NVMe drives), running RockyLinux 8/9. The stock RHEL
> >>> kernel 8/9 is NOT affected, and the long-term kernel 5.15.X is NOT affected.
> >>> However, with long-term kernels 6.1.XX and 6.6.XX,
> >>> (tested at least 10 different versions), this lockup always appears
> >>> after 2-30 days, similar to the report in the original thread.
> >>> The more load (for example, copying a lot of local files while
> >>> serving 20Gbps traffic), the higher the chance that the bug will appear.
> >>>
> >>> I haven't been able to reproduce this during synthetic tests,
> >>> but it always occurs in production on 6.1.X and 6.6.X within 2-30 days.
> >>> If anyone can provide a patch, I can test it on multiple machines
> >>> over the next few days.
> >> Could you please try this one which could be applied on 6.6 directly. Thank you!
> > URL: https://lore.kernel.org/linux-mm/20240412064353.133497-1-zhaoyang.huang@unisoc.com/
> >
>
> Unfortunately, I am unable to cleanly apply this patch against the
> latest 6.6.31
Please try below one which works on my v6.6 based android. Thank you
for your test in advance :D

mm/huge_memory.c | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 064fbd90822b..5899906c326a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2498,7 +2498,7 @@ static void __split_huge_page(struct page *page,
struct list_head *list,
 {
  struct folio *folio = page_folio(page);
  struct page *head = &folio->page;
- struct lruvec *lruvec;
+ struct lruvec *lruvec = folio_lruvec(folio);
  struct address_space *swap_cache = NULL;
  unsigned long offset = 0;
  unsigned int nr = thp_nr_pages(head);
@@ -2513,9 +2513,6 @@ static void __split_huge_page(struct page *page,
struct list_head *list,
  xa_lock(&swap_cache->i_pages);
  }

- /* lock lru list/PageCompound, ref frozen by page_ref_freeze */
- lruvec = folio_lruvec_lock(folio);
-
  ClearPageHasHWPoisoned(head);

  for (i = nr - 1; i >= 1; i--) {
@@ -2541,9 +2538,6 @@ static void __split_huge_page(struct page *page,
struct list_head *list,
  }

  ClearPageCompound(head);
- unlock_page_lruvec(lruvec);
- /* Caller disabled irqs, so they are still disabled here */
-
  split_page_owner(head, nr);

  /* See comment in __split_huge_page_tail() */
@@ -2560,7 +2554,6 @@ static void __split_huge_page(struct page *page,
struct list_head *list,
  page_ref_add(head, 2);
  xa_unlock(&head->mapping->i_pages);
  }
- local_irq_enable();

  if (nr_dropped)
  shmem_uncharge(head->mapping->host, nr_dropped);
@@ -2631,6 +2624,7 @@ int split_huge_page_to_list(struct page *page,
struct list_head *list)
  int extra_pins, ret;
  pgoff_t end;
  bool is_hzp;
+ struct lruvec *lruvec;

  VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
  VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
@@ -2714,6 +2708,14 @@ int split_huge_page_to_list(struct page *page,
struct list_head *list)

  /* block interrupt reentry in xa_lock and spinlock */
  local_irq_disable();
+
+ /*
+ * take lruvec's lock before freeze the folio to prevent the folio
+ * remains in the page cache with refcnt == 0, which could lead to
+ * find_get_entry enters livelock by iterating the xarray.
+ */
+ lruvec = folio_lruvec_lock(folio);
+
  if (mapping) {
  /*
  * Check if the folio is present in page cache.
@@ -2748,12 +2750,16 @@ int split_huge_page_to_list(struct page *page,
struct list_head *list)
  }

  __split_huge_page(page, list, end);
+ unlock_page_lruvec(lruvec);
+ local_irq_enable();
  ret = 0;
  } else {
  spin_unlock(&ds_queue->split_queue_lock);
 fail:
  if (mapping)
  xas_unlock(&xas);
+
+ unlock_page_lruvec(lruvec);
  local_irq_enable();
  remap_page(folio, folio_nr_pages(folio));
  ret = -EAGAIN;
Marcin Wanat May 22, 2024, 10:13 a.m. UTC | #12
On 22.05.2024 07:37, Zhaoyang Huang wrote:
> On Tue, May 21, 2024 at 11:47 PM Marcin Wanat <private@marcinwanat.pl> wrote:
>>
>> On 21.05.2024 03:00, Zhaoyang Huang wrote:
>>> On Tue, May 21, 2024 at 8:58 AM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote:
>>>>
>>>> On Tue, May 21, 2024 at 3:42 AM Marcin Wanat <private@marcinwanat.pl> wrote:
>>>>>
>>>>> On 15.04.2024 03:50, Zhaoyang Huang wrote:
>>>>> I have around 50 hosts handling high I/O (each with 20Gbps+ uplinks
>>>>> and multiple NVMe drives), running RockyLinux 8/9. The stock RHEL
>>>>> kernel 8/9 is NOT affected, and the long-term kernel 5.15.X is NOT affected.
>>>>> However, with long-term kernels 6.1.XX and 6.6.XX,
>>>>> (tested at least 10 different versions), this lockup always appears
>>>>> after 2-30 days, similar to the report in the original thread.
>>>>> The more load (for example, copying a lot of local files while
>>>>> serving 20Gbps traffic), the higher the chance that the bug will appear.
>>>>>
>>>>> I haven't been able to reproduce this during synthetic tests,
>>>>> but it always occurs in production on 6.1.X and 6.6.X within 2-30 days.
>>>>> If anyone can provide a patch, I can test it on multiple machines
>>>>> over the next few days.
>>>> Could you please try this one which could be applied on 6.6 directly. Thank you!
>>> URL: https://lore.kernel.org/linux-mm/20240412064353.133497-1-zhaoyang.huang@unisoc.com/
>>>
>>
>> Unfortunately, I am unable to cleanly apply this patch against the
>> latest 6.6.31
> Please try below one which works on my v6.6 based android. Thank you
> for your test in advance :D
> 
> mm/huge_memory.c | 22 ++++++++++++++--------
>   1 file changed, 14 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c

I have compiled 6.6.31 with this patch and will test it on multiple 
machines over the next 30 days. I will provide an update after 30 days 
if everything is fine or sooner if any of the hosts experience the same 
soft lockup again.
Marcin Wanat May 27, 2024, 8:22 a.m. UTC | #13
On 22.05.2024 12:13, Marcin Wanat wrote:
> On 22.05.2024 07:37, Zhaoyang Huang wrote:
>> On Tue, May 21, 2024 at 11:47 PM Marcin Wanat <private@marcinwanat.pl> 
>> wrote:
>>>
>>> On 21.05.2024 03:00, Zhaoyang Huang wrote:
>>>> On Tue, May 21, 2024 at 8:58 AM Zhaoyang Huang 
>>>> <huangzhaoyang@gmail.com> wrote:
>>>>>
>>>>> On Tue, May 21, 2024 at 3:42 AM Marcin Wanat 
>>>>> <private@marcinwanat.pl> wrote:
>>>>>>
>>>>>> On 15.04.2024 03:50, Zhaoyang Huang wrote:
>>>>>> I have around 50 hosts handling high I/O (each with 20Gbps+ uplinks
>>>>>> and multiple NVMe drives), running RockyLinux 8/9. The stock RHEL
>>>>>> kernel 8/9 is NOT affected, and the long-term kernel 5.15.X is NOT 
>>>>>> affected.
>>>>>> However, with long-term kernels 6.1.XX and 6.6.XX,
>>>>>> (tested at least 10 different versions), this lockup always appears
>>>>>> after 2-30 days, similar to the report in the original thread.
>>>>>> The more load (for example, copying a lot of local files while
>>>>>> serving 20Gbps traffic), the higher the chance that the bug will 
>>>>>> appear.
>>>>>>
>>>>>> I haven't been able to reproduce this during synthetic tests,
>>>>>> but it always occurs in production on 6.1.X and 6.6.X within 2-30 
>>>>>> days.
>>>>>> If anyone can provide a patch, I can test it on multiple machines
>>>>>> over the next few days.
>>>>> Could you please try this one which could be applied on 6.6 
>>>>> directly. Thank you!
>>>> URL: https://lore.kernel.org/linux-mm/20240412064353.133497-1- 
>>>> zhaoyang.huang@unisoc.com/
>>>>
>>>
>>> Unfortunately, I am unable to cleanly apply this patch against the
>>> latest 6.6.31
>> Please try below one which works on my v6.6 based android. Thank you
>> for your test in advance :D
>>
>> mm/huge_memory.c | 22 ++++++++++++++--------
>>   1 file changed, 14 insertions(+), 8 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> 
> I have compiled 6.6.31 with this patch and will test it on multiple 
> machines over the next 30 days. I will provide an update after 30 days 
> if everything is fine or sooner if any of the hosts experience the same 
> soft lockup again.
> 

First server with 6.6.31 and this patch hang today. Soft lockup changed 
to hard lockup:

[26887.389623] watchdog: Watchdog detected hard LOCKUP on cpu 21
[26887.389626] Modules linked in: nft_limit xt_limit xt_hashlimit 
ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_connlimit 
nf_conncount tls xt_set ip_set_hash_net ip_set xt_CT xt_conntrack 
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables 
nfnetlink rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency 
intel_uncore_frequency_common isst_if_common skx_edac nfit libnvdimm 
x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass 
rapl intel_cstate ipmi_ssif irdma ext4 mbcache ice iTCO_wdt jbd2 mgag200 
intel_pmc_bxt iTCO_vendor_support ib_uverbs i2c_algo_bit acpi_ipmi 
intel_uncore mei_me drm_shmem_helper pcspkr ib_core i2c_i801 ipmi_si 
drm_kms_helper mei lpc_ich i2c_smbus ioatdma intel_pch_thermal 
ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter joydev tcp_bbr 
drm fuse xfs libcrc32c sd_mod t10_pi sg crct10dif_pclmul crc32_pclmul 
crc32c_intel ixgbe polyval_clmulni ahci polyval_generic libahci mdio 
i40e libata megaraid_sas dca ghash_clmulni_intel wmi
[26887.389682] CPU: 21 PID: 264 Comm: kswapd0 Kdump: loaded Tainted: G 
      W          6.6.31.el9 #3
[26887.389685] Hardware name: FUJITSU PRIMERGY RX2540 M4/D3384-A1, BIOS 
V5.0.0.12 R1.22.0 for D3384-A1x                    06/04/2018
[26887.389687] RIP: 0010:native_queued_spin_lock_slowpath+0x6e/0x2c0
[26887.389696] Code: 08 0f 92 c2 8b 45 00 0f b6 d2 c1 e2 08 30 e4 09 d0 
a9 00 01 ff ff 0f 85 ea 01 00 00 85 c0 74 12 0f b6 45 00 84 c0 74 0a f3 
90 <0f> b6 45 00 84 c0 75 f6 b8 01 00 00 00 66 89 45 00 5b 5d 41 5c 41
[26887.389698] RSP: 0018:ffffb3e587a87a20 EFLAGS: 00000002
[26887.389700] RAX: 0000000000000001 RBX: ffff9ad6c6f67050 RCX: 
0000000000000000
[26887.389701] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
ffff9ad6c6f67050
[26887.389703] RBP: ffff9ad6c6f67050 R08: 0000000000000000 R09: 
0000000000000067
[26887.389704] R10: 0000000000000000 R11: 0000000000000000 R12: 
0000000000000046
[26887.389705] R13: 0000000000000200 R14: 0000000000000000 R15: 
ffffe1138aa98000
[26887.389707] FS:  0000000000000000(0000) GS:ffff9ade20340000(0000) 
knlGS:0000000000000000
[26887.389708] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[26887.389710] CR2: 000000002912809b CR3: 000000064401e003 CR4: 
00000000007706e0
[26887.389711] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[26887.389712] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
0000000000000400
[26887.389713] PKRU: 55555554
[26887.389714] Call Trace:
[26887.389717]  <NMI>
[26887.389720]  ? watchdog_hardlockup_check+0xac/0x150
[26887.389725]  ? __perf_event_overflow+0x102/0x1d0
[26887.389729]  ? handle_pmi_common+0x189/0x3e0
[26887.389735]  ? set_pte_vaddr_p4d+0x4a/0x60
[26887.389738]  ? flush_tlb_one_kernel+0xa/0x20
[26887.389742]  ? native_set_fixmap+0x65/0x80
[26887.389745]  ? ghes_copy_tofrom_phys+0x75/0x110
[26887.389751]  ? __ghes_peek_estatus.isra.0+0x49/0xb0
[26887.389755]  ? intel_pmu_handle_irq+0x10b/0x230
[26887.389756]  ? perf_event_nmi_handler+0x28/0x50
[26887.389759]  ? nmi_handle+0x58/0x150
[26887.389764]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
[26887.389768]  ? default_do_nmi+0x6b/0x170
[26887.389770]  ? exc_nmi+0x12c/0x1a0
[26887.389772]  ? end_repeat_nmi+0x16/0x1f
[26887.389777]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
[26887.389780]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
[26887.389784]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
[26887.389787]  </NMI>
[26887.389788]  <TASK>
[26887.389789]  __raw_spin_lock_irqsave+0x3d/0x50
[26887.389793]  folio_lruvec_lock_irqsave+0x5e/0x90
[26887.389798]  __page_cache_release+0x68/0x230
[26887.389801]  ? remove_migration_ptes+0x5c/0x80
[26887.389807]  __folio_put+0x24/0x60
[26887.389808]  __split_huge_page+0x368/0x520
[26887.389812]  split_huge_page_to_list+0x4b3/0x570
[26887.389816]  deferred_split_scan+0x1c8/0x290
[26887.389819]  do_shrink_slab+0x12f/0x2d0
[26887.389824]  shrink_slab_memcg+0x133/0x1d0
[26887.389829]  shrink_node_memcgs+0x18e/0x1d0
[26887.389832]  shrink_node+0xa7/0x370
[26887.389836]  balance_pgdat+0x332/0x6f0
[26887.389842]  kswapd+0xf0/0x190
[26887.389845]  ? balance_pgdat+0x6f0/0x6f0
[26887.389848]  kthread+0xee/0x120
[26887.389851]  ? kthread_complete_and_exit+0x20/0x20
[26887.389853]  ret_from_fork+0x2d/0x50
[26887.389857]  ? kthread_complete_and_exit+0x20/0x20
[26887.389859]  ret_from_fork_asm+0x11/0x20
[26887.389864]  </TASK>
[26887.389865] Kernel panic - not syncing: Hard LOCKUP
[26887.389867] CPU: 21 PID: 264 Comm: kswapd0 Kdump: loaded Tainted: G 
      W          6.6.31.el9 #3
[26887.389869] Hardware name: FUJITSU PRIMERGY RX2540 M4/D3384-A1, BIOS 
V5.0.0.12 R1.22.0 for D3384-A1x                    06/04/2018
[26887.389870] Call Trace:
[26887.389871]  <NMI>
[26887.389872]  dump_stack_lvl+0x44/0x60
[26887.389877]  panic+0x241/0x330
[26887.389881]  nmi_panic+0x2f/0x40
[26887.389883]  watchdog_hardlockup_check+0x119/0x150
[26887.389886]  __perf_event_overflow+0x102/0x1d0
[26887.389889]  handle_pmi_common+0x189/0x3e0
[26887.389893]  ? set_pte_vaddr_p4d+0x4a/0x60
[26887.389896]  ? flush_tlb_one_kernel+0xa/0x20
[26887.389899]  ? native_set_fixmap+0x65/0x80
[26887.389902]  ? ghes_copy_tofrom_phys+0x75/0x110
[26887.389906]  ? __ghes_peek_estatus.isra.0+0x49/0xb0
[26887.389909]  intel_pmu_handle_irq+0x10b/0x230
[26887.389911]  perf_event_nmi_handler+0x28/0x50
[26887.389913]  nmi_handle+0x58/0x150
[26887.389916]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
[26887.389920]  default_do_nmi+0x6b/0x170
[26887.389922]  exc_nmi+0x12c/0x1a0
[26887.389923]  end_repeat_nmi+0x16/0x1f
[26887.389926] RIP: 0010:native_queued_spin_lock_slowpath+0x6e/0x2c0
[26887.389930] Code: 08 0f 92 c2 8b 45 00 0f b6 d2 c1 e2 08 30 e4 09 d0 
a9 00 01 ff ff 0f 85 ea 01 00 00 85 c0 74 12 0f b6 45 00 84 c0 74 0a f3 
90 <0f> b6 45 00 84 c0 75 f6 b8 01 00 00 00 66 89 45 00 5b 5d 41 5c 41
[26887.389931] RSP: 0018:ffffb3e587a87a20 EFLAGS: 00000002
[26887.389933] RAX: 0000000000000001 RBX: ffff9ad6c6f67050 RCX: 
0000000000000000
[26887.389934] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
ffff9ad6c6f67050
[26887.389935] RBP: ffff9ad6c6f67050 R08: 0000000000000000 R09: 
0000000000000067
[26887.389936] R10: 0000000000000000 R11: 0000000000000000 R12: 
0000000000000046
[26887.389937] R13: 0000000000000200 R14: 0000000000000000 R15: 
ffffe1138aa98000
[26887.389940]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
[26887.389943]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
[26887.389946]  </NMI>
[26887.389947]  <TASK>
[26887.389947]  __raw_spin_lock_irqsave+0x3d/0x50
[26887.389950]  folio_lruvec_lock_irqsave+0x5e/0x90
[26887.389953]  __page_cache_release+0x68/0x230
[26887.389955]  ? remove_migration_ptes+0x5c/0x80
[26887.389958]  __folio_put+0x24/0x60
[26887.389960]  __split_huge_page+0x368/0x520
[26887.389963]  split_huge_page_to_list+0x4b3/0x570
[26887.389967]  deferred_split_scan+0x1c8/0x290
[26887.389971]  do_shrink_slab+0x12f/0x2d0
[26887.389974]  shrink_slab_memcg+0x133/0x1d0
[26887.389978]  shrink_node_memcgs+0x18e/0x1d0
[26887.389982]  shrink_node+0xa7/0x370
[26887.389985]  balance_pgdat+0x332/0x6f0
[26887.389991]  kswapd+0xf0/0x190
[26887.389994]  ? balance_pgdat+0x6f0/0x6f0
[26887.389997]  kthread+0xee/0x120
[26887.389998]  ? kthread_complete_and_exit+0x20/0x20
[26887.390000]  ret_from_fork+0x2d/0x50
[26887.390003]  ? kthread_complete_and_exit+0x20/0x20
[26887.390004]  ret_from_fork_asm+0x11/0x20
[26887.390009]  </TASK>
Zhaoyang Huang May 27, 2024, 8:53 a.m. UTC | #14
On Mon, May 27, 2024 at 4:22 PM Marcin Wanat <private@marcinwanat.pl> wrote:
>
> On 22.05.2024 12:13, Marcin Wanat wrote:
> > On 22.05.2024 07:37, Zhaoyang Huang wrote:
> >> On Tue, May 21, 2024 at 11:47 PM Marcin Wanat <private@marcinwanat.pl>
> >> wrote:
> >>>
> >>> On 21.05.2024 03:00, Zhaoyang Huang wrote:
> >>>> On Tue, May 21, 2024 at 8:58 AM Zhaoyang Huang
> >>>> <huangzhaoyang@gmail.com> wrote:
> >>>>>
> >>>>> On Tue, May 21, 2024 at 3:42 AM Marcin Wanat
> >>>>> <private@marcinwanat.pl> wrote:
> >>>>>>
> >>>>>> On 15.04.2024 03:50, Zhaoyang Huang wrote:
> >>>>>> I have around 50 hosts handling high I/O (each with 20Gbps+ uplinks
> >>>>>> and multiple NVMe drives), running RockyLinux 8/9. The stock RHEL
> >>>>>> kernel 8/9 is NOT affected, and the long-term kernel 5.15.X is NOT
> >>>>>> affected.
> >>>>>> However, with long-term kernels 6.1.XX and 6.6.XX,
> >>>>>> (tested at least 10 different versions), this lockup always appears
> >>>>>> after 2-30 days, similar to the report in the original thread.
> >>>>>> The more load (for example, copying a lot of local files while
> >>>>>> serving 20Gbps traffic), the higher the chance that the bug will
> >>>>>> appear.
> >>>>>>
> >>>>>> I haven't been able to reproduce this during synthetic tests,
> >>>>>> but it always occurs in production on 6.1.X and 6.6.X within 2-30
> >>>>>> days.
> >>>>>> If anyone can provide a patch, I can test it on multiple machines
> >>>>>> over the next few days.
> >>>>> Could you please try this one which could be applied on 6.6
> >>>>> directly. Thank you!
> >>>> URL: https://lore.kernel.org/linux-mm/20240412064353.133497-1-
> >>>> zhaoyang.huang@unisoc.com/
> >>>>
> >>>
> >>> Unfortunately, I am unable to cleanly apply this patch against the
> >>> latest 6.6.31
> >> Please try below one which works on my v6.6 based android. Thank you
> >> for your test in advance :D
> >>
> >> mm/huge_memory.c | 22 ++++++++++++++--------
> >>   1 file changed, 14 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >
> > I have compiled 6.6.31 with this patch and will test it on multiple
> > machines over the next 30 days. I will provide an update after 30 days
> > if everything is fine or sooner if any of the hosts experience the same
> > soft lockup again.
> >
>
> First server with 6.6.31 and this patch hang today. Soft lockup changed
> to hard lockup:
>
> [26887.389623] watchdog: Watchdog detected hard LOCKUP on cpu 21
> [26887.389626] Modules linked in: nft_limit xt_limit xt_hashlimit
> ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_connlimit
> nf_conncount tls xt_set ip_set_hash_net ip_set xt_CT xt_conntrack
> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables
> nfnetlink rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency
> intel_uncore_frequency_common isst_if_common skx_edac nfit libnvdimm
> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass
> rapl intel_cstate ipmi_ssif irdma ext4 mbcache ice iTCO_wdt jbd2 mgag200
> intel_pmc_bxt iTCO_vendor_support ib_uverbs i2c_algo_bit acpi_ipmi
> intel_uncore mei_me drm_shmem_helper pcspkr ib_core i2c_i801 ipmi_si
> drm_kms_helper mei lpc_ich i2c_smbus ioatdma intel_pch_thermal
> ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter joydev tcp_bbr
> drm fuse xfs libcrc32c sd_mod t10_pi sg crct10dif_pclmul crc32_pclmul
> crc32c_intel ixgbe polyval_clmulni ahci polyval_generic libahci mdio
> i40e libata megaraid_sas dca ghash_clmulni_intel wmi
> [26887.389682] CPU: 21 PID: 264 Comm: kswapd0 Kdump: loaded Tainted: G
>       W          6.6.31.el9 #3
> [26887.389685] Hardware name: FUJITSU PRIMERGY RX2540 M4/D3384-A1, BIOS
> V5.0.0.12 R1.22.0 for D3384-A1x                    06/04/2018
> [26887.389687] RIP: 0010:native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389696] Code: 08 0f 92 c2 8b 45 00 0f b6 d2 c1 e2 08 30 e4 09 d0
> a9 00 01 ff ff 0f 85 ea 01 00 00 85 c0 74 12 0f b6 45 00 84 c0 74 0a f3
> 90 <0f> b6 45 00 84 c0 75 f6 b8 01 00 00 00 66 89 45 00 5b 5d 41 5c 41
> [26887.389698] RSP: 0018:ffffb3e587a87a20 EFLAGS: 00000002
> [26887.389700] RAX: 0000000000000001 RBX: ffff9ad6c6f67050 RCX:
> 0000000000000000
> [26887.389701] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> ffff9ad6c6f67050
> [26887.389703] RBP: ffff9ad6c6f67050 R08: 0000000000000000 R09:
> 0000000000000067
> [26887.389704] R10: 0000000000000000 R11: 0000000000000000 R12:
> 0000000000000046
> [26887.389705] R13: 0000000000000200 R14: 0000000000000000 R15:
> ffffe1138aa98000
> [26887.389707] FS:  0000000000000000(0000) GS:ffff9ade20340000(0000)
> knlGS:0000000000000000
> [26887.389708] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [26887.389710] CR2: 000000002912809b CR3: 000000064401e003 CR4:
> 00000000007706e0
> [26887.389711] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [26887.389712] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [26887.389713] PKRU: 55555554
> [26887.389714] Call Trace:
> [26887.389717]  <NMI>
> [26887.389720]  ? watchdog_hardlockup_check+0xac/0x150
> [26887.389725]  ? __perf_event_overflow+0x102/0x1d0
> [26887.389729]  ? handle_pmi_common+0x189/0x3e0
> [26887.389735]  ? set_pte_vaddr_p4d+0x4a/0x60
> [26887.389738]  ? flush_tlb_one_kernel+0xa/0x20
> [26887.389742]  ? native_set_fixmap+0x65/0x80
> [26887.389745]  ? ghes_copy_tofrom_phys+0x75/0x110
> [26887.389751]  ? __ghes_peek_estatus.isra.0+0x49/0xb0
> [26887.389755]  ? intel_pmu_handle_irq+0x10b/0x230
> [26887.389756]  ? perf_event_nmi_handler+0x28/0x50
> [26887.389759]  ? nmi_handle+0x58/0x150
> [26887.389764]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389768]  ? default_do_nmi+0x6b/0x170
> [26887.389770]  ? exc_nmi+0x12c/0x1a0
> [26887.389772]  ? end_repeat_nmi+0x16/0x1f
> [26887.389777]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389780]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389784]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389787]  </NMI>
> [26887.389788]  <TASK>
> [26887.389789]  __raw_spin_lock_irqsave+0x3d/0x50
> [26887.389793]  folio_lruvec_lock_irqsave+0x5e/0x90
> [26887.389798]  __page_cache_release+0x68/0x230
> [26887.389801]  ? remove_migration_ptes+0x5c/0x80
> [26887.389807]  __folio_put+0x24/0x60
> [26887.389808]  __split_huge_page+0x368/0x520
> [26887.389812]  split_huge_page_to_list+0x4b3/0x570
> [26887.389816]  deferred_split_scan+0x1c8/0x290
> [26887.389819]  do_shrink_slab+0x12f/0x2d0
> [26887.389824]  shrink_slab_memcg+0x133/0x1d0
> [26887.389829]  shrink_node_memcgs+0x18e/0x1d0
> [26887.389832]  shrink_node+0xa7/0x370
> [26887.389836]  balance_pgdat+0x332/0x6f0
> [26887.389842]  kswapd+0xf0/0x190
> [26887.389845]  ? balance_pgdat+0x6f0/0x6f0
> [26887.389848]  kthread+0xee/0x120
> [26887.389851]  ? kthread_complete_and_exit+0x20/0x20
> [26887.389853]  ret_from_fork+0x2d/0x50
> [26887.389857]  ? kthread_complete_and_exit+0x20/0x20
> [26887.389859]  ret_from_fork_asm+0x11/0x20
> [26887.389864]  </TASK>
> [26887.389865] Kernel panic - not syncing: Hard LOCKUP
> [26887.389867] CPU: 21 PID: 264 Comm: kswapd0 Kdump: loaded Tainted: G
>       W          6.6.31.el9 #3
> [26887.389869] Hardware name: FUJITSU PRIMERGY RX2540 M4/D3384-A1, BIOS
> V5.0.0.12 R1.22.0 for D3384-A1x                    06/04/2018
> [26887.389870] Call Trace:
> [26887.389871]  <NMI>
> [26887.389872]  dump_stack_lvl+0x44/0x60
> [26887.389877]  panic+0x241/0x330
> [26887.389881]  nmi_panic+0x2f/0x40
> [26887.389883]  watchdog_hardlockup_check+0x119/0x150
> [26887.389886]  __perf_event_overflow+0x102/0x1d0
> [26887.389889]  handle_pmi_common+0x189/0x3e0
> [26887.389893]  ? set_pte_vaddr_p4d+0x4a/0x60
> [26887.389896]  ? flush_tlb_one_kernel+0xa/0x20
> [26887.389899]  ? native_set_fixmap+0x65/0x80
> [26887.389902]  ? ghes_copy_tofrom_phys+0x75/0x110
> [26887.389906]  ? __ghes_peek_estatus.isra.0+0x49/0xb0
> [26887.389909]  intel_pmu_handle_irq+0x10b/0x230
> [26887.389911]  perf_event_nmi_handler+0x28/0x50
> [26887.389913]  nmi_handle+0x58/0x150
> [26887.389916]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389920]  default_do_nmi+0x6b/0x170
> [26887.389922]  exc_nmi+0x12c/0x1a0
> [26887.389923]  end_repeat_nmi+0x16/0x1f
> [26887.389926] RIP: 0010:native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389930] Code: 08 0f 92 c2 8b 45 00 0f b6 d2 c1 e2 08 30 e4 09 d0
> a9 00 01 ff ff 0f 85 ea 01 00 00 85 c0 74 12 0f b6 45 00 84 c0 74 0a f3
> 90 <0f> b6 45 00 84 c0 75 f6 b8 01 00 00 00 66 89 45 00 5b 5d 41 5c 41
> [26887.389931] RSP: 0018:ffffb3e587a87a20 EFLAGS: 00000002
> [26887.389933] RAX: 0000000000000001 RBX: ffff9ad6c6f67050 RCX:
> 0000000000000000
> [26887.389934] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> ffff9ad6c6f67050
> [26887.389935] RBP: ffff9ad6c6f67050 R08: 0000000000000000 R09:
> 0000000000000067
> [26887.389936] R10: 0000000000000000 R11: 0000000000000000 R12:
> 0000000000000046
> [26887.389937] R13: 0000000000000200 R14: 0000000000000000 R15:
> ffffe1138aa98000
> [26887.389940]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389943]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389946]  </NMI>
> [26887.389947]  <TASK>
> [26887.389947]  __raw_spin_lock_irqsave+0x3d/0x50
> [26887.389950]  folio_lruvec_lock_irqsave+0x5e/0x90
> [26887.389953]  __page_cache_release+0x68/0x230
> [26887.389955]  ? remove_migration_ptes+0x5c/0x80
> [26887.389958]  __folio_put+0x24/0x60
> [26887.389960]  __split_huge_page+0x368/0x520
> [26887.389963]  split_huge_page_to_list+0x4b3/0x570
> [26887.389967]  deferred_split_scan+0x1c8/0x290
> [26887.389971]  do_shrink_slab+0x12f/0x2d0
> [26887.389974]  shrink_slab_memcg+0x133/0x1d0
> [26887.389978]  shrink_node_memcgs+0x18e/0x1d0
> [26887.389982]  shrink_node+0xa7/0x370
> [26887.389985]  balance_pgdat+0x332/0x6f0
> [26887.389991]  kswapd+0xf0/0x190
> [26887.389994]  ? balance_pgdat+0x6f0/0x6f0
> [26887.389997]  kthread+0xee/0x120
> [26887.389998]  ? kthread_complete_and_exit+0x20/0x20
> [26887.390000]  ret_from_fork+0x2d/0x50
> [26887.390003]  ? kthread_complete_and_exit+0x20/0x20
> [26887.390004]  ret_from_fork_asm+0x11/0x20
> [26887.390009]  </TASK>
>
ok, thanks for the information. That should be generated by lock's
contention. I will check the code and keep you posted.
Yafang Shao May 30, 2024, 8:48 a.m. UTC | #15
On Tue, May 21, 2024 at 3:42 AM Marcin Wanat <private@marcinwanat.pl> wrote:
>
> On 15.04.2024 03:50, Zhaoyang Huang wrote:
> > On Mon, Apr 15, 2024 at 8:09 AM Dave Chinner <david@fromorbit.com> > wrote: >> >> On Sat, Apr 13, 2024 at 10:01:27AM +0800, Zhaoyang
> Huang wrote: >>> loop Dave, since he has ever helped set up an
> reproducer in >>> https://lore.kernel.org/linux- >>>
> mm/20221101071721.GV2703033@dread.disaster.area/ @Dave Chinner , >>> I
> would like to ask for your kindly help on if you can verify >>> this
> patch on your environment if convenient. Thanks a lot. >> >> I don't
> have the test environment from 18 months ago available any >> more.
> Also, I haven't seen this problem since that specific test >>
> environment tripped over the issue. Hence I don't have any way of >>
> confirming that the problem is fixed, either, because first I'd >> have
> to reproduce it... > Thanks for the information. I noticed that you
> reported another soft > lockup which is related to xas_load since
> NOV.2023. This patch is > supposed to be helpful for this. With regard
> to the version timing, > this commit is actually a revert of <mm/thp:
> narrow lru locking> > b6769834aac1d467fa1c71277d15688efcbb4d76 which is
> merged before > v5.15. > > For saving your time, a brief description
> below. IMO, b6769834aa > introduce a potential stall between freeze the
> folio's refcnt and > store it back to 2, which have the
> xas_load->folio_try_get_rcu loops > as livelock if it stalls the
> lru_lock's holder. > > b6769834aa split_huge_page_to_list -
> spin_lock(lru_lock) > xas_split(&xas, folio,order)
> folio_refcnt_freeze(folio, 1 + > folio_nr_pages(folio0) +
> spin_lock(lru_lock) xas_store(&xas, > offset++, head+i)
> page_ref_add(head, 2) spin_unlock(lru_lock) > > Sorry in advance if the
> above doesn't make sense, I am just a > developer who is also suffering
> from this bug and trying to fix it
> I am experiencing a similar error on dozens of hosts, with stack traces
> that are all similar:
>
> [627163.727746] watchdog: BUG: soft lockup - CPU#77 stuck for 22s!
> [file_get:953301]
> [627163.727778] Modules linked in: xt_set ip_set_hash_net ip_set xt_CT
> xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat
> nf_tables nfnetlink sr_mod cdrom rfkill vfat fat intel_rapl_msr
> intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common
> isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal
> intel_powerclamp coretemp ipmi_ssif kvm_intel kvm irqbypass mlx5_ib rapl
> iTCO_wdt intel_cstate intel_pmc_bxt ib_uverbs iTCO_vendor_support
> dell_smbios dcdbas i2c_i801 intel_uncore uas ses mei_me ib_core
> dell_wmi_descriptor wmi_bmof pcspkr enclosure lpc_ich usb_storage
> i2c_smbus acpi_ipmi mei intel_pch_thermal ipmi_si ipmi_devintf
> ipmi_msghandler acpi_power_meter joydev tcp_bbr fuse xfs libcrc32c raid1
> sd_mod sg mlx5_core crct10dif_pclmul crc32_pclmul crc32c_intel
> polyval_clmulni mgag200 polyval_generic drm_kms_helper mlxfw
> drm_shmem_helper ahci nvme mpt3sas tls libahci ghash_clmulni_intel
> nvme_core psample drm igb t10_pi raid_class pci_hyperv_intf dca libata
> scsi_transport_sas i2c_algo_bit wmi
> [627163.727841] CPU: 77 PID: 953301 Comm: file_get Kdump: loaded
> Tainted: G             L     6.6.30.el9 #2
> [627163.727844] Hardware name: Dell Inc. PowerEdge R740xd/08D89F, BIOS
> 2.21.2 02/19/2024
> [627163.727847] RIP: 0010:xas_descend+0x1b/0x70
> [627163.727857] Code: 57 10 48 89 07 48 c1 e8 20 48 89 57 08 c3 cc 0f b6
> 0e 48 8b 47 08 48 d3 e8 48 89 c1 83 e1 3f 89 c8 48 83 c0 04 48 8b 44 c6
> 08 <48> 89 77 18 48 89 c2 83 e2 03 48 83 fa 02 74 0a 88 4f 12 c3 48 83
> [627163.727859] RSP: 0018:ffffc90034a67978 EFLAGS: 00000206
> [627163.727861] RAX: ffff888e4f971242 RBX: ffffc90034a67a98 RCX:
> 0000000000000020
> [627163.727863] RDX: 0000000000000002 RSI: ffff88a454546d80 RDI:
> ffffc90034a67990
> [627163.727865] RBP: fffffffffffffffe R08: fffffffffffffffe R09:
> 0000000000008820
> [627163.727867] R10: 0000000000008820 R11: 0000000000000000 R12:
> ffffc90034a67a20
> [627163.727868] R13: ffffc90034a67a18 R14: ffffea00873e8000 R15:
> ffffc90034a67a18
> [627163.727870] FS:  00007fc5e503b740(0000) GS:ffff88bfefd80000(0000)
> knlGS:0000000000000000
> [627163.727871] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [627163.727873] CR2: 000000005fb87b6e CR3: 00000022875e8006 CR4:
> 00000000007706e0
> [627163.727875] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [627163.727876] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [627163.727878] PKRU: 55555554
> [627163.727879] Call Trace:
> [627163.727882]  <IRQ>
> [627163.727886]  ? watchdog_timer_fn+0x22a/0x2a0
> [627163.727892]  ? softlockup_fn+0x70/0x70
> [627163.727895]  ? __hrtimer_run_queues+0x10f/0x2a0
> [627163.727903]  ? hrtimer_interrupt+0x106/0x240
> [627163.727906]  ? __sysvec_apic_timer_interrupt+0x68/0x170
> [627163.727913]  ? sysvec_apic_timer_interrupt+0x9d/0xd0
> [627163.727917]  </IRQ>
> [627163.727918]  <TASK>
> [627163.727920]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> [627163.727927]  ? xas_descend+0x1b/0x70
> [627163.727930]  xas_load+0x2c/0x40
> [627163.727933]  xas_find+0x161/0x1a0
> [627163.727937]  find_get_entries+0x77/0x1d0
> [627163.727944]  truncate_inode_pages_range+0x244/0x3f0
> [627163.727950]  truncate_pagecache+0x44/0x60
> [627163.727955]  xfs_setattr_size+0x168/0x490 [xfs]
> [627163.728074]  xfs_vn_setattr+0x78/0x140 [xfs]
> [627163.728153]  notify_change+0x34f/0x4f0
> [627163.728158]  ? _raw_spin_lock+0x13/0x30
> [627163.728165]  ? do_truncate+0x80/0xd0
> [627163.728169]  do_truncate+0x80/0xd0
> [627163.728172]  do_open+0x2ce/0x400
> [627163.728177]  path_openat+0x10d/0x280
> [627163.728181]  do_filp_open+0xb2/0x150
> [627163.728186]  ? check_heap_object+0x34/0x190
> [627163.728189]  ? __check_object_size.part.0+0x5a/0x130
> [627163.728194]  do_sys_openat2+0x92/0xc0
> [627163.728197]  __x64_sys_openat+0x53/0x90
> [627163.728200]  do_syscall_64+0x35/0x80
> [627163.728206]  entry_SYSCALL_64_after_hwframe+0x4b/0xb5
> [627163.728210] RIP: 0033:0x7fc5e493e7fb
> [627163.728213] Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18
> 00 00 00 85 c0 75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f
> 05 <48> 3d 00 f0 ff ff 0f 87 91 00 00 00 48 8b 54 24 28 64 48 2b 14 25
> [627163.728215] RSP: 002b:00007ffdd4e300e0 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000101
> [627163.728218] RAX: ffffffffffffffda RBX: 00007ffdd4e30180 RCX:
> 00007fc5e493e7fb
> [627163.728220] RDX: 0000000000000241 RSI: 00007ffdd4e30180 RDI:
> 00000000ffffff9c
> [627163.728221] RBP: 00007ffdd4e30180 R08: 00007fc5e4600040 R09:
> 0000000000000001
> [627163.728223] R10: 00000000000001b6 R11: 0000000000000246 R12:
> 0000000000000241
> [627163.728224] R13: 0000000000000000 R14: 00007fc5e4662fa8 R15:
> 0000000000000000
> [627163.728227]  </TASK>
>
> I have around 50 hosts handling high I/O (each with 20Gbps+ uplinks
> and multiple NVMe drives), running RockyLinux 8/9. The stock RHEL
> kernel 8/9 is NOT affected, and the long-term kernel 5.15.X is NOT affected.
> However, with long-term kernels 6.1.XX and 6.6.XX,
> (tested at least 10 different versions), this lockup always appears
> after 2-30 days, similar to the report in the original thread.
> The more load (for example, copying a lot of local files while
> serving 20Gbps traffic), the higher the chance that the bug will appear.
>
> I haven't been able to reproduce this during synthetic tests,
> but it always occurs in production on 6.1.X and 6.6.X within 2-30 days.

We encountered a similar issue several months ago. Some of our
production servers crashed within days after deploying the 6.1.y
stable kernel. The soft lock info as follows,

[282879.612238] watchdog: BUG: soft lockup - CPU#65 stuck for 101s!
[container-execu:1572375]
[282879.612513] Modules linked in: ebtable_filter ebtables xt_DSCP
iptable_mangle iptable_raw xt_CT cls_bpf sch_ingress raw_diag
unix_diag tcp_diag udp_diag inet_diag iptable_filter bpfilter
xt_conntrack nf_nat nf_conntrack_netlink nfnetlink nf_conntrack
nf_defrag_ipv6 nf_defrag_ipv4 bpf_preload binfmt_misc cuse fuse
overlay af_packet bonding intel_rapl_msr intel_rapl_common
64_edac kvm_amd kvm irqbypass crct10dif_pclmul crc32_pclmul
polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3
aesni_intel crypto_simd cryptd rapl pcspkr vfat fat xfs mlx5_ib(O)
ib_uverbs(O) input_leds ib_core(O) sg ccp ptdma i2c_piix4 k10temp
acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_cpufreq ip_tables
ext4 mbcache crc32c_intel jbd2 mlx5_core(O) mlxfw(O) pci_hyperv_intf
psample mlxdevm(O) mlx_compat(O) tls nvme ptp pps_core nvme_core
sd_mod t10_pi ahci libahci libata
[282879.612571] CPU: 65 PID: 1572375 Comm: container-execu Kdump:
loaded Tainted: G        W  O L     6.1.38-rc3 #rc3.pdd
[282879.612574] Hardware name: New H3C Technologies Co., Ltd. H3C
UniServer R4950 G5/RS45M2C9S, BIOS 5.30 06/30/2021
[282879.612576] RIP: 0010:xas_descend+0x18/0x80
[282879.612583] Code: b6 e8 ec de 05 00 cc cc cc cc cc cc cc cc cc cc
cc cc 0f b6 0e 48 8b 57 08 48 d3 ea 83 e2 3f 89 d0 48 83 c0 04 48 8b
44 c6 08 <48> 89 77 18 48 89 c1 83 e1 03 48 83 f9 02 75 08 48 3d fd 00
00 00
[282879.612586] RSP: 0018:ffffad700b247c40 EFLAGS: 00000202
[282879.612588] RAX: ffff91d247a75d8a RBX: fffffffffffffffe RCX:
0000000000000006
[282879.612589] RDX: 0000000000000026 RSI: ffff91d473cb7b30 RDI:
ffffad700b247c68
[282879.612591] RBP: ffffad700b247c48 R08: 0000000000000003 R09:
fffffffffffffffe
[282879.612592] R10: 0000000000001990 R11: 0000000000000003 R12:
ffffad700b247cf8
[282879.612593] R13: ffffad700b247d70 R14: ffffad700b247cf8 R15:
ffffdfcd2c778000
[282879.612594] FS:  00007f5f576fb740(0000) GS:ffff922df0840000(0000)
knlGS:0000000000000000
[282879.612596] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[282879.612597] CR2: 00007fe797100600 CR3: 0000002b2468e000 CR4:
0000000000350ee0
[282879.612599] Call Trace:
[282879.612601]  <IRQ>
[282879.612605]  ? show_regs.cold+0x1a/0x1f
[282879.612610]  ? watchdog_timer_fn+0x1c4/0x220
[282879.612614]  ? softlockup_fn+0x30/0x30
[282879.612616]  ? __hrtimer_run_queues+0xa2/0x2b0
[282879.612620]  ? hrtimer_interrupt+0x109/0x220
[282879.612622]  ? __sysvec_apic_timer_interrupt+0x5e/0x110
[282879.612625]  ? sysvec_apic_timer_interrupt+0x7b/0x90
[282879.612629]  </IRQ>
[282879.612630]  <TASK>
[282879.612631]  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
[282879.612640]  ? xas_descend+0x18/0x80
[282879.612641]  ? xas_load+0x35/0x40
[282879.612643]  xas_find+0x197/0x1d0
[282879.612645]  find_get_entries+0x6e/0x170
[282879.612649]  truncate_inode_pages_range+0x294/0x4c0
[282879.612655]  ? __xfs_trans_commit+0x13c/0x3e0 [xfs]
[282879.612787]  ? kvfree+0x2c/0x40
[282879.612791]  ? trace_hardirqs_off+0x36/0xf0
[282879.612795]  truncate_inode_pages_final+0x44/0x50
[282879.612798]  evict+0x177/0x190
[282879.612802]  iput.part.0+0x183/0x1e0
[282879.612804]  iput+0x1c/0x30
[282879.612806]  do_unlinkat+0x1c7/0x2c0
[282879.612810]  __x64_sys_unlinkat+0x38/0x70
[282879.612812]  do_syscall_64+0x38/0x90
[282879.612815]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[282879.612818] RIP: 0033:0x7f5f56cf120d
[282879.612827] Code: 69 5c 2d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e
0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 63 d2 48 63 ff b8 07 01 00
00 0f 05 <48> 3d 00 f0 ff ff 77 02 f3 c3 48 8b 15 32 5c 2d 00 f7 d8 64
89 02
[282879.612828] RSP: 002b:00007fff30375c48 EFLAGS: 00000206 ORIG_RAX:
0000000000000107
[282879.612830] RAX: ffffffffffffffda RBX: 0000000000000003 RCX:
00007f5f56cf120d
[282879.612831] RDX: 0000000000000000 RSI: 0000000001640403 RDI:
0000000000000003
[282879.612832] RBP: 0000000001640403 R08: 0000000000000000 R09:
0000000001640403
[282879.612833] R10: 0000000000000100 R11: 0000000000000206 R12:
0000000000000003
[282879.612834] R13: 000000000163c5c0 R14: 00007fff30375c80 R15:
0000000000000000
[282879.612836]  </TASK>


Unfortunately, we couldn't reproduce the issue on our test servers. We
worked around it by disabling CONFIG_XARRAY_MULTI. Since then, these
production servers have been running smoothly for several months.

> If anyone can provide a patch, I can test it on multiple machines
> over the next few days.
>


--
Regards
Yafang
Zhaoyang Huang May 30, 2024, 8:57 a.m. UTC | #16
On Thu, May 30, 2024 at 4:49 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Tue, May 21, 2024 at 3:42 AM Marcin Wanat <private@marcinwanat.pl> wrote:
> >
> > On 15.04.2024 03:50, Zhaoyang Huang wrote:
> > > On Mon, Apr 15, 2024 at 8:09 AM Dave Chinner <david@fromorbit.com> > wrote: >> >> On Sat, Apr 13, 2024 at 10:01:27AM +0800, Zhaoyang
> > Huang wrote: >>> loop Dave, since he has ever helped set up an
> > reproducer in >>> https://lore.kernel.org/linux- >>>
> > mm/20221101071721.GV2703033@dread.disaster.area/ @Dave Chinner , >>> I
> > would like to ask for your kindly help on if you can verify >>> this
> > patch on your environment if convenient. Thanks a lot. >> >> I don't
> > have the test environment from 18 months ago available any >> more.
> > Also, I haven't seen this problem since that specific test >>
> > environment tripped over the issue. Hence I don't have any way of >>
> > confirming that the problem is fixed, either, because first I'd >> have
> > to reproduce it... > Thanks for the information. I noticed that you
> > reported another soft > lockup which is related to xas_load since
> > NOV.2023. This patch is > supposed to be helpful for this. With regard
> > to the version timing, > this commit is actually a revert of <mm/thp:
> > narrow lru locking> > b6769834aac1d467fa1c71277d15688efcbb4d76 which is
> > merged before > v5.15. > > For saving your time, a brief description
> > below. IMO, b6769834aa > introduce a potential stall between freeze the
> > folio's refcnt and > store it back to 2, which have the
> > xas_load->folio_try_get_rcu loops > as livelock if it stalls the
> > lru_lock's holder. > > b6769834aa split_huge_page_to_list -
> > spin_lock(lru_lock) > xas_split(&xas, folio,order)
> > folio_refcnt_freeze(folio, 1 + > folio_nr_pages(folio0) +
> > spin_lock(lru_lock) xas_store(&xas, > offset++, head+i)
> > page_ref_add(head, 2) spin_unlock(lru_lock) > > Sorry in advance if the
> > above doesn't make sense, I am just a > developer who is also suffering
> > from this bug and trying to fix it
> > I am experiencing a similar error on dozens of hosts, with stack traces
> > that are all similar:
> >
> > [627163.727746] watchdog: BUG: soft lockup - CPU#77 stuck for 22s!
> > [file_get:953301]
> > [627163.727778] Modules linked in: xt_set ip_set_hash_net ip_set xt_CT
> > xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat
> > nf_tables nfnetlink sr_mod cdrom rfkill vfat fat intel_rapl_msr
> > intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common
> > isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal
> > intel_powerclamp coretemp ipmi_ssif kvm_intel kvm irqbypass mlx5_ib rapl
> > iTCO_wdt intel_cstate intel_pmc_bxt ib_uverbs iTCO_vendor_support
> > dell_smbios dcdbas i2c_i801 intel_uncore uas ses mei_me ib_core
> > dell_wmi_descriptor wmi_bmof pcspkr enclosure lpc_ich usb_storage
> > i2c_smbus acpi_ipmi mei intel_pch_thermal ipmi_si ipmi_devintf
> > ipmi_msghandler acpi_power_meter joydev tcp_bbr fuse xfs libcrc32c raid1
> > sd_mod sg mlx5_core crct10dif_pclmul crc32_pclmul crc32c_intel
> > polyval_clmulni mgag200 polyval_generic drm_kms_helper mlxfw
> > drm_shmem_helper ahci nvme mpt3sas tls libahci ghash_clmulni_intel
> > nvme_core psample drm igb t10_pi raid_class pci_hyperv_intf dca libata
> > scsi_transport_sas i2c_algo_bit wmi
> > [627163.727841] CPU: 77 PID: 953301 Comm: file_get Kdump: loaded
> > Tainted: G             L     6.6.30.el9 #2
> > [627163.727844] Hardware name: Dell Inc. PowerEdge R740xd/08D89F, BIOS
> > 2.21.2 02/19/2024
> > [627163.727847] RIP: 0010:xas_descend+0x1b/0x70
> > [627163.727857] Code: 57 10 48 89 07 48 c1 e8 20 48 89 57 08 c3 cc 0f b6
> > 0e 48 8b 47 08 48 d3 e8 48 89 c1 83 e1 3f 89 c8 48 83 c0 04 48 8b 44 c6
> > 08 <48> 89 77 18 48 89 c2 83 e2 03 48 83 fa 02 74 0a 88 4f 12 c3 48 83
> > [627163.727859] RSP: 0018:ffffc90034a67978 EFLAGS: 00000206
> > [627163.727861] RAX: ffff888e4f971242 RBX: ffffc90034a67a98 RCX:
> > 0000000000000020
> > [627163.727863] RDX: 0000000000000002 RSI: ffff88a454546d80 RDI:
> > ffffc90034a67990
> > [627163.727865] RBP: fffffffffffffffe R08: fffffffffffffffe R09:
> > 0000000000008820
> > [627163.727867] R10: 0000000000008820 R11: 0000000000000000 R12:
> > ffffc90034a67a20
> > [627163.727868] R13: ffffc90034a67a18 R14: ffffea00873e8000 R15:
> > ffffc90034a67a18
> > [627163.727870] FS:  00007fc5e503b740(0000) GS:ffff88bfefd80000(0000)
> > knlGS:0000000000000000
> > [627163.727871] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [627163.727873] CR2: 000000005fb87b6e CR3: 00000022875e8006 CR4:
> > 00000000007706e0
> > [627163.727875] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > 0000000000000000
> > [627163.727876] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > 0000000000000400
> > [627163.727878] PKRU: 55555554
> > [627163.727879] Call Trace:
> > [627163.727882]  <IRQ>
> > [627163.727886]  ? watchdog_timer_fn+0x22a/0x2a0
> > [627163.727892]  ? softlockup_fn+0x70/0x70
> > [627163.727895]  ? __hrtimer_run_queues+0x10f/0x2a0
> > [627163.727903]  ? hrtimer_interrupt+0x106/0x240
> > [627163.727906]  ? __sysvec_apic_timer_interrupt+0x68/0x170
> > [627163.727913]  ? sysvec_apic_timer_interrupt+0x9d/0xd0
> > [627163.727917]  </IRQ>
> > [627163.727918]  <TASK>
> > [627163.727920]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> > [627163.727927]  ? xas_descend+0x1b/0x70
> > [627163.727930]  xas_load+0x2c/0x40
> > [627163.727933]  xas_find+0x161/0x1a0
> > [627163.727937]  find_get_entries+0x77/0x1d0
> > [627163.727944]  truncate_inode_pages_range+0x244/0x3f0
> > [627163.727950]  truncate_pagecache+0x44/0x60
> > [627163.727955]  xfs_setattr_size+0x168/0x490 [xfs]
> > [627163.728074]  xfs_vn_setattr+0x78/0x140 [xfs]
> > [627163.728153]  notify_change+0x34f/0x4f0
> > [627163.728158]  ? _raw_spin_lock+0x13/0x30
> > [627163.728165]  ? do_truncate+0x80/0xd0
> > [627163.728169]  do_truncate+0x80/0xd0
> > [627163.728172]  do_open+0x2ce/0x400
> > [627163.728177]  path_openat+0x10d/0x280
> > [627163.728181]  do_filp_open+0xb2/0x150
> > [627163.728186]  ? check_heap_object+0x34/0x190
> > [627163.728189]  ? __check_object_size.part.0+0x5a/0x130
> > [627163.728194]  do_sys_openat2+0x92/0xc0
> > [627163.728197]  __x64_sys_openat+0x53/0x90
> > [627163.728200]  do_syscall_64+0x35/0x80
> > [627163.728206]  entry_SYSCALL_64_after_hwframe+0x4b/0xb5
> > [627163.728210] RIP: 0033:0x7fc5e493e7fb
> > [627163.728213] Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18
> > 00 00 00 85 c0 75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f
> > 05 <48> 3d 00 f0 ff ff 0f 87 91 00 00 00 48 8b 54 24 28 64 48 2b 14 25
> > [627163.728215] RSP: 002b:00007ffdd4e300e0 EFLAGS: 00000246 ORIG_RAX:
> > 0000000000000101
> > [627163.728218] RAX: ffffffffffffffda RBX: 00007ffdd4e30180 RCX:
> > 00007fc5e493e7fb
> > [627163.728220] RDX: 0000000000000241 RSI: 00007ffdd4e30180 RDI:
> > 00000000ffffff9c
> > [627163.728221] RBP: 00007ffdd4e30180 R08: 00007fc5e4600040 R09:
> > 0000000000000001
> > [627163.728223] R10: 00000000000001b6 R11: 0000000000000246 R12:
> > 0000000000000241
> > [627163.728224] R13: 0000000000000000 R14: 00007fc5e4662fa8 R15:
> > 0000000000000000
> > [627163.728227]  </TASK>
> >
> > I have around 50 hosts handling high I/O (each with 20Gbps+ uplinks
> > and multiple NVMe drives), running RockyLinux 8/9. The stock RHEL
> > kernel 8/9 is NOT affected, and the long-term kernel 5.15.X is NOT affected.
> > However, with long-term kernels 6.1.XX and 6.6.XX,
> > (tested at least 10 different versions), this lockup always appears
> > after 2-30 days, similar to the report in the original thread.
> > The more load (for example, copying a lot of local files while
> > serving 20Gbps traffic), the higher the chance that the bug will appear.
> >
> > I haven't been able to reproduce this during synthetic tests,
> > but it always occurs in production on 6.1.X and 6.6.X within 2-30 days.
>
> We encountered a similar issue several months ago. Some of our
> production servers crashed within days after deploying the 6.1.y
> stable kernel. The soft lock info as follows,
>
> [282879.612238] watchdog: BUG: soft lockup - CPU#65 stuck for 101s!
> [container-execu:1572375]
> [282879.612513] Modules linked in: ebtable_filter ebtables xt_DSCP
> iptable_mangle iptable_raw xt_CT cls_bpf sch_ingress raw_diag
> unix_diag tcp_diag udp_diag inet_diag iptable_filter bpfilter
> xt_conntrack nf_nat nf_conntrack_netlink nfnetlink nf_conntrack
> nf_defrag_ipv6 nf_defrag_ipv4 bpf_preload binfmt_misc cuse fuse
> overlay af_packet bonding intel_rapl_msr intel_rapl_common
> 64_edac kvm_amd kvm irqbypass crct10dif_pclmul crc32_pclmul
> polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3
> aesni_intel crypto_simd cryptd rapl pcspkr vfat fat xfs mlx5_ib(O)
> ib_uverbs(O) input_leds ib_core(O) sg ccp ptdma i2c_piix4 k10temp
> acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_cpufreq ip_tables
> ext4 mbcache crc32c_intel jbd2 mlx5_core(O) mlxfw(O) pci_hyperv_intf
> psample mlxdevm(O) mlx_compat(O) tls nvme ptp pps_core nvme_core
> sd_mod t10_pi ahci libahci libata
> [282879.612571] CPU: 65 PID: 1572375 Comm: container-execu Kdump:
> loaded Tainted: G        W  O L     6.1.38-rc3 #rc3.pdd
> [282879.612574] Hardware name: New H3C Technologies Co., Ltd. H3C
> UniServer R4950 G5/RS45M2C9S, BIOS 5.30 06/30/2021
> [282879.612576] RIP: 0010:xas_descend+0x18/0x80
> [282879.612583] Code: b6 e8 ec de 05 00 cc cc cc cc cc cc cc cc cc cc
> cc cc 0f b6 0e 48 8b 57 08 48 d3 ea 83 e2 3f 89 d0 48 83 c0 04 48 8b
> 44 c6 08 <48> 89 77 18 48 89 c1 83 e1 03 48 83 f9 02 75 08 48 3d fd 00
> 00 00
> [282879.612586] RSP: 0018:ffffad700b247c40 EFLAGS: 00000202
> [282879.612588] RAX: ffff91d247a75d8a RBX: fffffffffffffffe RCX:
> 0000000000000006
> [282879.612589] RDX: 0000000000000026 RSI: ffff91d473cb7b30 RDI:
> ffffad700b247c68
> [282879.612591] RBP: ffffad700b247c48 R08: 0000000000000003 R09:
> fffffffffffffffe
> [282879.612592] R10: 0000000000001990 R11: 0000000000000003 R12:
> ffffad700b247cf8
> [282879.612593] R13: ffffad700b247d70 R14: ffffad700b247cf8 R15:
> ffffdfcd2c778000
> [282879.612594] FS:  00007f5f576fb740(0000) GS:ffff922df0840000(0000)
> knlGS:0000000000000000
> [282879.612596] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [282879.612597] CR2: 00007fe797100600 CR3: 0000002b2468e000 CR4:
> 0000000000350ee0
> [282879.612599] Call Trace:
> [282879.612601]  <IRQ>
> [282879.612605]  ? show_regs.cold+0x1a/0x1f
> [282879.612610]  ? watchdog_timer_fn+0x1c4/0x220
> [282879.612614]  ? softlockup_fn+0x30/0x30
> [282879.612616]  ? __hrtimer_run_queues+0xa2/0x2b0
> [282879.612620]  ? hrtimer_interrupt+0x109/0x220
> [282879.612622]  ? __sysvec_apic_timer_interrupt+0x5e/0x110
> [282879.612625]  ? sysvec_apic_timer_interrupt+0x7b/0x90
> [282879.612629]  </IRQ>
> [282879.612630]  <TASK>
> [282879.612631]  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
> [282879.612640]  ? xas_descend+0x18/0x80
> [282879.612641]  ? xas_load+0x35/0x40
> [282879.612643]  xas_find+0x197/0x1d0
> [282879.612645]  find_get_entries+0x6e/0x170
> [282879.612649]  truncate_inode_pages_range+0x294/0x4c0
> [282879.612655]  ? __xfs_trans_commit+0x13c/0x3e0 [xfs]
> [282879.612787]  ? kvfree+0x2c/0x40
> [282879.612791]  ? trace_hardirqs_off+0x36/0xf0
> [282879.612795]  truncate_inode_pages_final+0x44/0x50
> [282879.612798]  evict+0x177/0x190
> [282879.612802]  iput.part.0+0x183/0x1e0
> [282879.612804]  iput+0x1c/0x30
> [282879.612806]  do_unlinkat+0x1c7/0x2c0
> [282879.612810]  __x64_sys_unlinkat+0x38/0x70
> [282879.612812]  do_syscall_64+0x38/0x90
> [282879.612815]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> [282879.612818] RIP: 0033:0x7f5f56cf120d
> [282879.612827] Code: 69 5c 2d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e
> 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 63 d2 48 63 ff b8 07 01 00
> 00 0f 05 <48> 3d 00 f0 ff ff 77 02 f3 c3 48 8b 15 32 5c 2d 00 f7 d8 64
> 89 02
> [282879.612828] RSP: 002b:00007fff30375c48 EFLAGS: 00000206 ORIG_RAX:
> 0000000000000107
> [282879.612830] RAX: ffffffffffffffda RBX: 0000000000000003 RCX:
> 00007f5f56cf120d
> [282879.612831] RDX: 0000000000000000 RSI: 0000000001640403 RDI:
> 0000000000000003
> [282879.612832] RBP: 0000000001640403 R08: 0000000000000000 R09:
> 0000000001640403
> [282879.612833] R10: 0000000000000100 R11: 0000000000000206 R12:
> 0000000000000003
> [282879.612834] R13: 000000000163c5c0 R14: 00007fff30375c80 R15:
> 0000000000000000
> [282879.612836]  </TASK>
>
>
> Unfortunately, we couldn't reproduce the issue on our test servers. We
> worked around it by disabling CONFIG_XARRAY_MULTI. Since then, these
> production servers have been running smoothly for several months.
>
> > If anyone can provide a patch, I can test it on multiple machines
> > over the next few days.
It is highly appreciated that you could help to try below one which
works on my v6.6 based android. However, there is a hard lockup
reported on an ongoing regression test(not sure if caused by this
patch yet). Thank you!

mm/huge_memory.c | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 064fbd90822b..5899906c326a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2498,7 +2498,7 @@ static void __split_huge_page(struct page *page,
struct list_head *list,
 {
  struct folio *folio = page_folio(page);
  struct page *head = &folio->page;
- struct lruvec *lruvec;
+ struct lruvec *lruvec = folio_lruvec(folio);
  struct address_space *swap_cache = NULL;
  unsigned long offset = 0;
  unsigned int nr = thp_nr_pages(head);
@@ -2513,9 +2513,6 @@ static void __split_huge_page(struct page *page,
struct list_head *list,
  xa_lock(&swap_cache->i_pages);
  }

- /* lock lru list/PageCompound, ref frozen by page_ref_freeze */
- lruvec = folio_lruvec_lock(folio);
-
  ClearPageHasHWPoisoned(head);

  for (i = nr - 1; i >= 1; i--) {
@@ -2541,9 +2538,6 @@ static void __split_huge_page(struct page *page,
struct list_head *list,
  }

  ClearPageCompound(head);
- unlock_page_lruvec(lruvec);
- /* Caller disabled irqs, so they are still disabled here */
-
  split_page_owner(head, nr);

  /* See comment in __split_huge_page_tail() */
@@ -2560,7 +2554,6 @@ static void __split_huge_page(struct page *page,
struct list_head *list,
  page_ref_add(head, 2);
  xa_unlock(&head->mapping->i_pages);
  }
- local_irq_enable();

  if (nr_dropped)
  shmem_uncharge(head->mapping->host, nr_dropped);
@@ -2631,6 +2624,7 @@ int split_huge_page_to_list(struct page *page,
struct list_head *list)
  int extra_pins, ret;
  pgoff_t end;
  bool is_hzp;
+ struct lruvec *lruvec;

  VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
  VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
@@ -2714,6 +2708,14 @@ int split_huge_page_to_list(struct page *page,
struct list_head *list)

  /* block interrupt reentry in xa_lock and spinlock */
  local_irq_disable();
+
+ /*
+ * take lruvec's lock before freeze the folio to prevent the folio
+ * remains in the page cache with refcnt == 0, which could lead to
+ * find_get_entry enters livelock by iterating the xarray.
+ */
+ lruvec = folio_lruvec_lock(folio);
+
  if (mapping) {
  /*
  * Check if the folio is present in page cache.
@@ -2748,12 +2750,16 @@ int split_huge_page_to_list(struct page *page,
struct list_head *list)
  }

  __split_huge_page(page, list, end);

> >
>
>
> --
> Regards
> Yafang
Yafang Shao May 30, 2024, 9:24 a.m. UTC | #17
On Thu, May 30, 2024 at 4:57 PM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote:
>
> On Thu, May 30, 2024 at 4:49 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Tue, May 21, 2024 at 3:42 AM Marcin Wanat <private@marcinwanat.pl> wrote:
> > >
> > > On 15.04.2024 03:50, Zhaoyang Huang wrote:
> > > > On Mon, Apr 15, 2024 at 8:09 AM Dave Chinner <david@fromorbit.com> > wrote: >> >> On Sat, Apr 13, 2024 at 10:01:27AM +0800, Zhaoyang
> > > Huang wrote: >>> loop Dave, since he has ever helped set up an
> > > reproducer in >>> https://lore.kernel.org/linux- >>>
> > > mm/20221101071721.GV2703033@dread.disaster.area/ @Dave Chinner , >>> I
> > > would like to ask for your kindly help on if you can verify >>> this
> > > patch on your environment if convenient. Thanks a lot. >> >> I don't
> > > have the test environment from 18 months ago available any >> more.
> > > Also, I haven't seen this problem since that specific test >>
> > > environment tripped over the issue. Hence I don't have any way of >>
> > > confirming that the problem is fixed, either, because first I'd >> have
> > > to reproduce it... > Thanks for the information. I noticed that you
> > > reported another soft > lockup which is related to xas_load since
> > > NOV.2023. This patch is > supposed to be helpful for this. With regard
> > > to the version timing, > this commit is actually a revert of <mm/thp:
> > > narrow lru locking> > b6769834aac1d467fa1c71277d15688efcbb4d76 which is
> > > merged before > v5.15. > > For saving your time, a brief description
> > > below. IMO, b6769834aa > introduce a potential stall between freeze the
> > > folio's refcnt and > store it back to 2, which have the
> > > xas_load->folio_try_get_rcu loops > as livelock if it stalls the
> > > lru_lock's holder. > > b6769834aa split_huge_page_to_list -
> > > spin_lock(lru_lock) > xas_split(&xas, folio,order)
> > > folio_refcnt_freeze(folio, 1 + > folio_nr_pages(folio0) +
> > > spin_lock(lru_lock) xas_store(&xas, > offset++, head+i)
> > > page_ref_add(head, 2) spin_unlock(lru_lock) > > Sorry in advance if the
> > > above doesn't make sense, I am just a > developer who is also suffering
> > > from this bug and trying to fix it
> > > I am experiencing a similar error on dozens of hosts, with stack traces
> > > that are all similar:
> > >
> > > [627163.727746] watchdog: BUG: soft lockup - CPU#77 stuck for 22s!
> > > [file_get:953301]
> > > [627163.727778] Modules linked in: xt_set ip_set_hash_net ip_set xt_CT
> > > xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat
> > > nf_tables nfnetlink sr_mod cdrom rfkill vfat fat intel_rapl_msr
> > > intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common
> > > isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal
> > > intel_powerclamp coretemp ipmi_ssif kvm_intel kvm irqbypass mlx5_ib rapl
> > > iTCO_wdt intel_cstate intel_pmc_bxt ib_uverbs iTCO_vendor_support
> > > dell_smbios dcdbas i2c_i801 intel_uncore uas ses mei_me ib_core
> > > dell_wmi_descriptor wmi_bmof pcspkr enclosure lpc_ich usb_storage
> > > i2c_smbus acpi_ipmi mei intel_pch_thermal ipmi_si ipmi_devintf
> > > ipmi_msghandler acpi_power_meter joydev tcp_bbr fuse xfs libcrc32c raid1
> > > sd_mod sg mlx5_core crct10dif_pclmul crc32_pclmul crc32c_intel
> > > polyval_clmulni mgag200 polyval_generic drm_kms_helper mlxfw
> > > drm_shmem_helper ahci nvme mpt3sas tls libahci ghash_clmulni_intel
> > > nvme_core psample drm igb t10_pi raid_class pci_hyperv_intf dca libata
> > > scsi_transport_sas i2c_algo_bit wmi
> > > [627163.727841] CPU: 77 PID: 953301 Comm: file_get Kdump: loaded
> > > Tainted: G             L     6.6.30.el9 #2
> > > [627163.727844] Hardware name: Dell Inc. PowerEdge R740xd/08D89F, BIOS
> > > 2.21.2 02/19/2024
> > > [627163.727847] RIP: 0010:xas_descend+0x1b/0x70
> > > [627163.727857] Code: 57 10 48 89 07 48 c1 e8 20 48 89 57 08 c3 cc 0f b6
> > > 0e 48 8b 47 08 48 d3 e8 48 89 c1 83 e1 3f 89 c8 48 83 c0 04 48 8b 44 c6
> > > 08 <48> 89 77 18 48 89 c2 83 e2 03 48 83 fa 02 74 0a 88 4f 12 c3 48 83
> > > [627163.727859] RSP: 0018:ffffc90034a67978 EFLAGS: 00000206
> > > [627163.727861] RAX: ffff888e4f971242 RBX: ffffc90034a67a98 RCX:
> > > 0000000000000020
> > > [627163.727863] RDX: 0000000000000002 RSI: ffff88a454546d80 RDI:
> > > ffffc90034a67990
> > > [627163.727865] RBP: fffffffffffffffe R08: fffffffffffffffe R09:
> > > 0000000000008820
> > > [627163.727867] R10: 0000000000008820 R11: 0000000000000000 R12:
> > > ffffc90034a67a20
> > > [627163.727868] R13: ffffc90034a67a18 R14: ffffea00873e8000 R15:
> > > ffffc90034a67a18
> > > [627163.727870] FS:  00007fc5e503b740(0000) GS:ffff88bfefd80000(0000)
> > > knlGS:0000000000000000
> > > [627163.727871] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [627163.727873] CR2: 000000005fb87b6e CR3: 00000022875e8006 CR4:
> > > 00000000007706e0
> > > [627163.727875] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > > 0000000000000000
> > > [627163.727876] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > > 0000000000000400
> > > [627163.727878] PKRU: 55555554
> > > [627163.727879] Call Trace:
> > > [627163.727882]  <IRQ>
> > > [627163.727886]  ? watchdog_timer_fn+0x22a/0x2a0
> > > [627163.727892]  ? softlockup_fn+0x70/0x70
> > > [627163.727895]  ? __hrtimer_run_queues+0x10f/0x2a0
> > > [627163.727903]  ? hrtimer_interrupt+0x106/0x240
> > > [627163.727906]  ? __sysvec_apic_timer_interrupt+0x68/0x170
> > > [627163.727913]  ? sysvec_apic_timer_interrupt+0x9d/0xd0
> > > [627163.727917]  </IRQ>
> > > [627163.727918]  <TASK>
> > > [627163.727920]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> > > [627163.727927]  ? xas_descend+0x1b/0x70
> > > [627163.727930]  xas_load+0x2c/0x40
> > > [627163.727933]  xas_find+0x161/0x1a0
> > > [627163.727937]  find_get_entries+0x77/0x1d0
> > > [627163.727944]  truncate_inode_pages_range+0x244/0x3f0
> > > [627163.727950]  truncate_pagecache+0x44/0x60
> > > [627163.727955]  xfs_setattr_size+0x168/0x490 [xfs]
> > > [627163.728074]  xfs_vn_setattr+0x78/0x140 [xfs]
> > > [627163.728153]  notify_change+0x34f/0x4f0
> > > [627163.728158]  ? _raw_spin_lock+0x13/0x30
> > > [627163.728165]  ? do_truncate+0x80/0xd0
> > > [627163.728169]  do_truncate+0x80/0xd0
> > > [627163.728172]  do_open+0x2ce/0x400
> > > [627163.728177]  path_openat+0x10d/0x280
> > > [627163.728181]  do_filp_open+0xb2/0x150
> > > [627163.728186]  ? check_heap_object+0x34/0x190
> > > [627163.728189]  ? __check_object_size.part.0+0x5a/0x130
> > > [627163.728194]  do_sys_openat2+0x92/0xc0
> > > [627163.728197]  __x64_sys_openat+0x53/0x90
> > > [627163.728200]  do_syscall_64+0x35/0x80
> > > [627163.728206]  entry_SYSCALL_64_after_hwframe+0x4b/0xb5
> > > [627163.728210] RIP: 0033:0x7fc5e493e7fb
> > > [627163.728213] Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18
> > > 00 00 00 85 c0 75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f
> > > 05 <48> 3d 00 f0 ff ff 0f 87 91 00 00 00 48 8b 54 24 28 64 48 2b 14 25
> > > [627163.728215] RSP: 002b:00007ffdd4e300e0 EFLAGS: 00000246 ORIG_RAX:
> > > 0000000000000101
> > > [627163.728218] RAX: ffffffffffffffda RBX: 00007ffdd4e30180 RCX:
> > > 00007fc5e493e7fb
> > > [627163.728220] RDX: 0000000000000241 RSI: 00007ffdd4e30180 RDI:
> > > 00000000ffffff9c
> > > [627163.728221] RBP: 00007ffdd4e30180 R08: 00007fc5e4600040 R09:
> > > 0000000000000001
> > > [627163.728223] R10: 00000000000001b6 R11: 0000000000000246 R12:
> > > 0000000000000241
> > > [627163.728224] R13: 0000000000000000 R14: 00007fc5e4662fa8 R15:
> > > 0000000000000000
> > > [627163.728227]  </TASK>
> > >
> > > I have around 50 hosts handling high I/O (each with 20Gbps+ uplinks
> > > and multiple NVMe drives), running RockyLinux 8/9. The stock RHEL
> > > kernel 8/9 is NOT affected, and the long-term kernel 5.15.X is NOT affected.
> > > However, with long-term kernels 6.1.XX and 6.6.XX,
> > > (tested at least 10 different versions), this lockup always appears
> > > after 2-30 days, similar to the report in the original thread.
> > > The more load (for example, copying a lot of local files while
> > > serving 20Gbps traffic), the higher the chance that the bug will appear.
> > >
> > > I haven't been able to reproduce this during synthetic tests,
> > > but it always occurs in production on 6.1.X and 6.6.X within 2-30 days.
> >
> > We encountered a similar issue several months ago. Some of our
> > production servers crashed within days after deploying the 6.1.y
> > stable kernel. The soft lock info as follows,
> >
> > [282879.612238] watchdog: BUG: soft lockup - CPU#65 stuck for 101s!
> > [container-execu:1572375]
> > [282879.612513] Modules linked in: ebtable_filter ebtables xt_DSCP
> > iptable_mangle iptable_raw xt_CT cls_bpf sch_ingress raw_diag
> > unix_diag tcp_diag udp_diag inet_diag iptable_filter bpfilter
> > xt_conntrack nf_nat nf_conntrack_netlink nfnetlink nf_conntrack
> > nf_defrag_ipv6 nf_defrag_ipv4 bpf_preload binfmt_misc cuse fuse
> > overlay af_packet bonding intel_rapl_msr intel_rapl_common
> > 64_edac kvm_amd kvm irqbypass crct10dif_pclmul crc32_pclmul
> > polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3
> > aesni_intel crypto_simd cryptd rapl pcspkr vfat fat xfs mlx5_ib(O)
> > ib_uverbs(O) input_leds ib_core(O) sg ccp ptdma i2c_piix4 k10temp
> > acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_cpufreq ip_tables
> > ext4 mbcache crc32c_intel jbd2 mlx5_core(O) mlxfw(O) pci_hyperv_intf
> > psample mlxdevm(O) mlx_compat(O) tls nvme ptp pps_core nvme_core
> > sd_mod t10_pi ahci libahci libata
> > [282879.612571] CPU: 65 PID: 1572375 Comm: container-execu Kdump:
> > loaded Tainted: G        W  O L     6.1.38-rc3 #rc3.pdd
> > [282879.612574] Hardware name: New H3C Technologies Co., Ltd. H3C
> > UniServer R4950 G5/RS45M2C9S, BIOS 5.30 06/30/2021
> > [282879.612576] RIP: 0010:xas_descend+0x18/0x80
> > [282879.612583] Code: b6 e8 ec de 05 00 cc cc cc cc cc cc cc cc cc cc
> > cc cc 0f b6 0e 48 8b 57 08 48 d3 ea 83 e2 3f 89 d0 48 83 c0 04 48 8b
> > 44 c6 08 <48> 89 77 18 48 89 c1 83 e1 03 48 83 f9 02 75 08 48 3d fd 00
> > 00 00
> > [282879.612586] RSP: 0018:ffffad700b247c40 EFLAGS: 00000202
> > [282879.612588] RAX: ffff91d247a75d8a RBX: fffffffffffffffe RCX:
> > 0000000000000006
> > [282879.612589] RDX: 0000000000000026 RSI: ffff91d473cb7b30 RDI:
> > ffffad700b247c68
> > [282879.612591] RBP: ffffad700b247c48 R08: 0000000000000003 R09:
> > fffffffffffffffe
> > [282879.612592] R10: 0000000000001990 R11: 0000000000000003 R12:
> > ffffad700b247cf8
> > [282879.612593] R13: ffffad700b247d70 R14: ffffad700b247cf8 R15:
> > ffffdfcd2c778000
> > [282879.612594] FS:  00007f5f576fb740(0000) GS:ffff922df0840000(0000)
> > knlGS:0000000000000000
> > [282879.612596] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [282879.612597] CR2: 00007fe797100600 CR3: 0000002b2468e000 CR4:
> > 0000000000350ee0
> > [282879.612599] Call Trace:
> > [282879.612601]  <IRQ>
> > [282879.612605]  ? show_regs.cold+0x1a/0x1f
> > [282879.612610]  ? watchdog_timer_fn+0x1c4/0x220
> > [282879.612614]  ? softlockup_fn+0x30/0x30
> > [282879.612616]  ? __hrtimer_run_queues+0xa2/0x2b0
> > [282879.612620]  ? hrtimer_interrupt+0x109/0x220
> > [282879.612622]  ? __sysvec_apic_timer_interrupt+0x5e/0x110
> > [282879.612625]  ? sysvec_apic_timer_interrupt+0x7b/0x90
> > [282879.612629]  </IRQ>
> > [282879.612630]  <TASK>
> > [282879.612631]  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
> > [282879.612640]  ? xas_descend+0x18/0x80
> > [282879.612641]  ? xas_load+0x35/0x40
> > [282879.612643]  xas_find+0x197/0x1d0
> > [282879.612645]  find_get_entries+0x6e/0x170
> > [282879.612649]  truncate_inode_pages_range+0x294/0x4c0
> > [282879.612655]  ? __xfs_trans_commit+0x13c/0x3e0 [xfs]
> > [282879.612787]  ? kvfree+0x2c/0x40
> > [282879.612791]  ? trace_hardirqs_off+0x36/0xf0
> > [282879.612795]  truncate_inode_pages_final+0x44/0x50
> > [282879.612798]  evict+0x177/0x190
> > [282879.612802]  iput.part.0+0x183/0x1e0
> > [282879.612804]  iput+0x1c/0x30
> > [282879.612806]  do_unlinkat+0x1c7/0x2c0
> > [282879.612810]  __x64_sys_unlinkat+0x38/0x70
> > [282879.612812]  do_syscall_64+0x38/0x90
> > [282879.612815]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > [282879.612818] RIP: 0033:0x7f5f56cf120d
> > [282879.612827] Code: 69 5c 2d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e
> > 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 63 d2 48 63 ff b8 07 01 00
> > 00 0f 05 <48> 3d 00 f0 ff ff 77 02 f3 c3 48 8b 15 32 5c 2d 00 f7 d8 64
> > 89 02
> > [282879.612828] RSP: 002b:00007fff30375c48 EFLAGS: 00000206 ORIG_RAX:
> > 0000000000000107
> > [282879.612830] RAX: ffffffffffffffda RBX: 0000000000000003 RCX:
> > 00007f5f56cf120d
> > [282879.612831] RDX: 0000000000000000 RSI: 0000000001640403 RDI:
> > 0000000000000003
> > [282879.612832] RBP: 0000000001640403 R08: 0000000000000000 R09:
> > 0000000001640403
> > [282879.612833] R10: 0000000000000100 R11: 0000000000000206 R12:
> > 0000000000000003
> > [282879.612834] R13: 000000000163c5c0 R14: 00007fff30375c80 R15:
> > 0000000000000000
> > [282879.612836]  </TASK>
> >
> >
> > Unfortunately, we couldn't reproduce the issue on our test servers. We
> > worked around it by disabling CONFIG_XARRAY_MULTI. Since then, these
> > production servers have been running smoothly for several months.
> >
> > > If anyone can provide a patch, I can test it on multiple machines
> > > over the next few days.
> It is highly appreciated that you could help to try below one which
> works on my v6.6 based android. However, there is a hard lockup
> reported on an ongoing regression test(not sure if caused by this
> patch yet). Thank you!

I'm sorry to inform you that our users are unwilling to experiment
with these changes on our production servers again, and I am unable to
reproduce the issue on our test servers. I am reporting this issue to
highlight to the community that it is indeed a serious problem, and we
should consider it carefully.

>
> mm/huge_memory.c | 22 ++++++++++++++--------
>  1 file changed, 14 insertions(+), 8 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 064fbd90822b..5899906c326a 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2498,7 +2498,7 @@ static void __split_huge_page(struct page *page,
> struct list_head *list,
>  {
>   struct folio *folio = page_folio(page);
>   struct page *head = &folio->page;
> - struct lruvec *lruvec;
> + struct lruvec *lruvec = folio_lruvec(folio);
>   struct address_space *swap_cache = NULL;
>   unsigned long offset = 0;
>   unsigned int nr = thp_nr_pages(head);
> @@ -2513,9 +2513,6 @@ static void __split_huge_page(struct page *page,
> struct list_head *list,
>   xa_lock(&swap_cache->i_pages);
>   }
>
> - /* lock lru list/PageCompound, ref frozen by page_ref_freeze */
> - lruvec = folio_lruvec_lock(folio);
> -
>   ClearPageHasHWPoisoned(head);
>
>   for (i = nr - 1; i >= 1; i--) {
> @@ -2541,9 +2538,6 @@ static void __split_huge_page(struct page *page,
> struct list_head *list,
>   }
>
>   ClearPageCompound(head);
> - unlock_page_lruvec(lruvec);
> - /* Caller disabled irqs, so they are still disabled here */
> -
>   split_page_owner(head, nr);
>
>   /* See comment in __split_huge_page_tail() */
> @@ -2560,7 +2554,6 @@ static void __split_huge_page(struct page *page,
> struct list_head *list,
>   page_ref_add(head, 2);
>   xa_unlock(&head->mapping->i_pages);
>   }
> - local_irq_enable();
>
>   if (nr_dropped)
>   shmem_uncharge(head->mapping->host, nr_dropped);
> @@ -2631,6 +2624,7 @@ int split_huge_page_to_list(struct page *page,
> struct list_head *list)
>   int extra_pins, ret;
>   pgoff_t end;
>   bool is_hzp;
> + struct lruvec *lruvec;
>
>   VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
>   VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
> @@ -2714,6 +2708,14 @@ int split_huge_page_to_list(struct page *page,
> struct list_head *list)
>
>   /* block interrupt reentry in xa_lock and spinlock */
>   local_irq_disable();
> +
> + /*
> + * take lruvec's lock before freeze the folio to prevent the folio
> + * remains in the page cache with refcnt == 0, which could lead to
> + * find_get_entry enters livelock by iterating the xarray.
> + */
> + lruvec = folio_lruvec_lock(folio);
> +
>   if (mapping) {
>   /*
>   * Check if the folio is present in page cache.
> @@ -2748,12 +2750,16 @@ int split_huge_page_to_list(struct page *page,
> struct list_head *list)
>   }
>
>   __split_huge_page(page, list, end);
>
> > >
> >
> >
> > --
> > Regards
> > Yafang
Zhaoyang Huang May 31, 2024, 6:17 a.m. UTC | #18
On Thu, May 30, 2024 at 5:24 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Thu, May 30, 2024 at 4:57 PM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote:
> >
> > On Thu, May 30, 2024 at 4:49 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > On Tue, May 21, 2024 at 3:42 AM Marcin Wanat <private@marcinwanat.pl> wrote:
> > > >
> > > > On 15.04.2024 03:50, Zhaoyang Huang wrote:
> > > > > On Mon, Apr 15, 2024 at 8:09 AM Dave Chinner <david@fromorbit.com> > wrote: >> >> On Sat, Apr 13, 2024 at 10:01:27AM +0800, Zhaoyang
> > > > Huang wrote: >>> loop Dave, since he has ever helped set up an
> > > > reproducer in >>> https://lore.kernel.org/linux- >>>
> > > > mm/20221101071721.GV2703033@dread.disaster.area/ @Dave Chinner , >>> I
> > > > would like to ask for your kindly help on if you can verify >>> this
> > > > patch on your environment if convenient. Thanks a lot. >> >> I don't
> > > > have the test environment from 18 months ago available any >> more.
> > > > Also, I haven't seen this problem since that specific test >>
> > > > environment tripped over the issue. Hence I don't have any way of >>
> > > > confirming that the problem is fixed, either, because first I'd >> have
> > > > to reproduce it... > Thanks for the information. I noticed that you
> > > > reported another soft > lockup which is related to xas_load since
> > > > NOV.2023. This patch is > supposed to be helpful for this. With regard
> > > > to the version timing, > this commit is actually a revert of <mm/thp:
> > > > narrow lru locking> > b6769834aac1d467fa1c71277d15688efcbb4d76 which is
> > > > merged before > v5.15. > > For saving your time, a brief description
> > > > below. IMO, b6769834aa > introduce a potential stall between freeze the
> > > > folio's refcnt and > store it back to 2, which have the
> > > > xas_load->folio_try_get_rcu loops > as livelock if it stalls the
> > > > lru_lock's holder. > > b6769834aa split_huge_page_to_list -
> > > > spin_lock(lru_lock) > xas_split(&xas, folio,order)
> > > > folio_refcnt_freeze(folio, 1 + > folio_nr_pages(folio0) +
> > > > spin_lock(lru_lock) xas_store(&xas, > offset++, head+i)
> > > > page_ref_add(head, 2) spin_unlock(lru_lock) > > Sorry in advance if the
> > > > above doesn't make sense, I am just a > developer who is also suffering
> > > > from this bug and trying to fix it
> > > > I am experiencing a similar error on dozens of hosts, with stack traces
> > > > that are all similar:
> > > >
> > > > [627163.727746] watchdog: BUG: soft lockup - CPU#77 stuck for 22s!
> > > > [file_get:953301]
> > > > [627163.727778] Modules linked in: xt_set ip_set_hash_net ip_set xt_CT
> > > > xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat
> > > > nf_tables nfnetlink sr_mod cdrom rfkill vfat fat intel_rapl_msr
> > > > intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common
> > > > isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal
> > > > intel_powerclamp coretemp ipmi_ssif kvm_intel kvm irqbypass mlx5_ib rapl
> > > > iTCO_wdt intel_cstate intel_pmc_bxt ib_uverbs iTCO_vendor_support
> > > > dell_smbios dcdbas i2c_i801 intel_uncore uas ses mei_me ib_core
> > > > dell_wmi_descriptor wmi_bmof pcspkr enclosure lpc_ich usb_storage
> > > > i2c_smbus acpi_ipmi mei intel_pch_thermal ipmi_si ipmi_devintf
> > > > ipmi_msghandler acpi_power_meter joydev tcp_bbr fuse xfs libcrc32c raid1
> > > > sd_mod sg mlx5_core crct10dif_pclmul crc32_pclmul crc32c_intel
> > > > polyval_clmulni mgag200 polyval_generic drm_kms_helper mlxfw
> > > > drm_shmem_helper ahci nvme mpt3sas tls libahci ghash_clmulni_intel
> > > > nvme_core psample drm igb t10_pi raid_class pci_hyperv_intf dca libata
> > > > scsi_transport_sas i2c_algo_bit wmi
> > > > [627163.727841] CPU: 77 PID: 953301 Comm: file_get Kdump: loaded
> > > > Tainted: G             L     6.6.30.el9 #2
> > > > [627163.727844] Hardware name: Dell Inc. PowerEdge R740xd/08D89F, BIOS
> > > > 2.21.2 02/19/2024
> > > > [627163.727847] RIP: 0010:xas_descend+0x1b/0x70
> > > > [627163.727857] Code: 57 10 48 89 07 48 c1 e8 20 48 89 57 08 c3 cc 0f b6
> > > > 0e 48 8b 47 08 48 d3 e8 48 89 c1 83 e1 3f 89 c8 48 83 c0 04 48 8b 44 c6
> > > > 08 <48> 89 77 18 48 89 c2 83 e2 03 48 83 fa 02 74 0a 88 4f 12 c3 48 83
> > > > [627163.727859] RSP: 0018:ffffc90034a67978 EFLAGS: 00000206
> > > > [627163.727861] RAX: ffff888e4f971242 RBX: ffffc90034a67a98 RCX:
> > > > 0000000000000020
> > > > [627163.727863] RDX: 0000000000000002 RSI: ffff88a454546d80 RDI:
> > > > ffffc90034a67990
> > > > [627163.727865] RBP: fffffffffffffffe R08: fffffffffffffffe R09:
> > > > 0000000000008820
> > > > [627163.727867] R10: 0000000000008820 R11: 0000000000000000 R12:
> > > > ffffc90034a67a20
> > > > [627163.727868] R13: ffffc90034a67a18 R14: ffffea00873e8000 R15:
> > > > ffffc90034a67a18
> > > > [627163.727870] FS:  00007fc5e503b740(0000) GS:ffff88bfefd80000(0000)
> > > > knlGS:0000000000000000
> > > > [627163.727871] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > [627163.727873] CR2: 000000005fb87b6e CR3: 00000022875e8006 CR4:
> > > > 00000000007706e0
> > > > [627163.727875] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > > > 0000000000000000
> > > > [627163.727876] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > > > 0000000000000400
> > > > [627163.727878] PKRU: 55555554
> > > > [627163.727879] Call Trace:
> > > > [627163.727882]  <IRQ>
> > > > [627163.727886]  ? watchdog_timer_fn+0x22a/0x2a0
> > > > [627163.727892]  ? softlockup_fn+0x70/0x70
> > > > [627163.727895]  ? __hrtimer_run_queues+0x10f/0x2a0
> > > > [627163.727903]  ? hrtimer_interrupt+0x106/0x240
> > > > [627163.727906]  ? __sysvec_apic_timer_interrupt+0x68/0x170
> > > > [627163.727913]  ? sysvec_apic_timer_interrupt+0x9d/0xd0
> > > > [627163.727917]  </IRQ>
> > > > [627163.727918]  <TASK>
> > > > [627163.727920]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> > > > [627163.727927]  ? xas_descend+0x1b/0x70
> > > > [627163.727930]  xas_load+0x2c/0x40
> > > > [627163.727933]  xas_find+0x161/0x1a0
> > > > [627163.727937]  find_get_entries+0x77/0x1d0
> > > > [627163.727944]  truncate_inode_pages_range+0x244/0x3f0
> > > > [627163.727950]  truncate_pagecache+0x44/0x60
> > > > [627163.727955]  xfs_setattr_size+0x168/0x490 [xfs]
> > > > [627163.728074]  xfs_vn_setattr+0x78/0x140 [xfs]
> > > > [627163.728153]  notify_change+0x34f/0x4f0
> > > > [627163.728158]  ? _raw_spin_lock+0x13/0x30
> > > > [627163.728165]  ? do_truncate+0x80/0xd0
> > > > [627163.728169]  do_truncate+0x80/0xd0
> > > > [627163.728172]  do_open+0x2ce/0x400
> > > > [627163.728177]  path_openat+0x10d/0x280
> > > > [627163.728181]  do_filp_open+0xb2/0x150
> > > > [627163.728186]  ? check_heap_object+0x34/0x190
> > > > [627163.728189]  ? __check_object_size.part.0+0x5a/0x130
> > > > [627163.728194]  do_sys_openat2+0x92/0xc0
> > > > [627163.728197]  __x64_sys_openat+0x53/0x90
> > > > [627163.728200]  do_syscall_64+0x35/0x80
> > > > [627163.728206]  entry_SYSCALL_64_after_hwframe+0x4b/0xb5
> > > > [627163.728210] RIP: 0033:0x7fc5e493e7fb
> > > > [627163.728213] Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18
> > > > 00 00 00 85 c0 75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f
> > > > 05 <48> 3d 00 f0 ff ff 0f 87 91 00 00 00 48 8b 54 24 28 64 48 2b 14 25
> > > > [627163.728215] RSP: 002b:00007ffdd4e300e0 EFLAGS: 00000246 ORIG_RAX:
> > > > 0000000000000101
> > > > [627163.728218] RAX: ffffffffffffffda RBX: 00007ffdd4e30180 RCX:
> > > > 00007fc5e493e7fb
> > > > [627163.728220] RDX: 0000000000000241 RSI: 00007ffdd4e30180 RDI:
> > > > 00000000ffffff9c
> > > > [627163.728221] RBP: 00007ffdd4e30180 R08: 00007fc5e4600040 R09:
> > > > 0000000000000001
> > > > [627163.728223] R10: 00000000000001b6 R11: 0000000000000246 R12:
> > > > 0000000000000241
> > > > [627163.728224] R13: 0000000000000000 R14: 00007fc5e4662fa8 R15:
> > > > 0000000000000000
> > > > [627163.728227]  </TASK>
> > > >
> > > > I have around 50 hosts handling high I/O (each with 20Gbps+ uplinks
> > > > and multiple NVMe drives), running RockyLinux 8/9. The stock RHEL
> > > > kernel 8/9 is NOT affected, and the long-term kernel 5.15.X is NOT affected.
> > > > However, with long-term kernels 6.1.XX and 6.6.XX,
> > > > (tested at least 10 different versions), this lockup always appears
> > > > after 2-30 days, similar to the report in the original thread.
> > > > The more load (for example, copying a lot of local files while
> > > > serving 20Gbps traffic), the higher the chance that the bug will appear.
> > > >
> > > > I haven't been able to reproduce this during synthetic tests,
> > > > but it always occurs in production on 6.1.X and 6.6.X within 2-30 days.
> > >
> > > We encountered a similar issue several months ago. Some of our
> > > production servers crashed within days after deploying the 6.1.y
> > > stable kernel. The soft lock info as follows,
> > >
> > > [282879.612238] watchdog: BUG: soft lockup - CPU#65 stuck for 101s!
> > > [container-execu:1572375]
> > > [282879.612513] Modules linked in: ebtable_filter ebtables xt_DSCP
> > > iptable_mangle iptable_raw xt_CT cls_bpf sch_ingress raw_diag
> > > unix_diag tcp_diag udp_diag inet_diag iptable_filter bpfilter
> > > xt_conntrack nf_nat nf_conntrack_netlink nfnetlink nf_conntrack
> > > nf_defrag_ipv6 nf_defrag_ipv4 bpf_preload binfmt_misc cuse fuse
> > > overlay af_packet bonding intel_rapl_msr intel_rapl_common
> > > 64_edac kvm_amd kvm irqbypass crct10dif_pclmul crc32_pclmul
> > > polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3
> > > aesni_intel crypto_simd cryptd rapl pcspkr vfat fat xfs mlx5_ib(O)
> > > ib_uverbs(O) input_leds ib_core(O) sg ccp ptdma i2c_piix4 k10temp
> > > acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_cpufreq ip_tables
> > > ext4 mbcache crc32c_intel jbd2 mlx5_core(O) mlxfw(O) pci_hyperv_intf
> > > psample mlxdevm(O) mlx_compat(O) tls nvme ptp pps_core nvme_core
> > > sd_mod t10_pi ahci libahci libata
> > > [282879.612571] CPU: 65 PID: 1572375 Comm: container-execu Kdump:
> > > loaded Tainted: G        W  O L     6.1.38-rc3 #rc3.pdd
> > > [282879.612574] Hardware name: New H3C Technologies Co., Ltd. H3C
> > > UniServer R4950 G5/RS45M2C9S, BIOS 5.30 06/30/2021
> > > [282879.612576] RIP: 0010:xas_descend+0x18/0x80
> > > [282879.612583] Code: b6 e8 ec de 05 00 cc cc cc cc cc cc cc cc cc cc
> > > cc cc 0f b6 0e 48 8b 57 08 48 d3 ea 83 e2 3f 89 d0 48 83 c0 04 48 8b
> > > 44 c6 08 <48> 89 77 18 48 89 c1 83 e1 03 48 83 f9 02 75 08 48 3d fd 00
> > > 00 00
> > > [282879.612586] RSP: 0018:ffffad700b247c40 EFLAGS: 00000202
> > > [282879.612588] RAX: ffff91d247a75d8a RBX: fffffffffffffffe RCX:
> > > 0000000000000006
> > > [282879.612589] RDX: 0000000000000026 RSI: ffff91d473cb7b30 RDI:
> > > ffffad700b247c68
> > > [282879.612591] RBP: ffffad700b247c48 R08: 0000000000000003 R09:
> > > fffffffffffffffe
> > > [282879.612592] R10: 0000000000001990 R11: 0000000000000003 R12:
> > > ffffad700b247cf8
> > > [282879.612593] R13: ffffad700b247d70 R14: ffffad700b247cf8 R15:
> > > ffffdfcd2c778000
> > > [282879.612594] FS:  00007f5f576fb740(0000) GS:ffff922df0840000(0000)
> > > knlGS:0000000000000000
> > > [282879.612596] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [282879.612597] CR2: 00007fe797100600 CR3: 0000002b2468e000 CR4:
> > > 0000000000350ee0
> > > [282879.612599] Call Trace:
> > > [282879.612601]  <IRQ>
> > > [282879.612605]  ? show_regs.cold+0x1a/0x1f
> > > [282879.612610]  ? watchdog_timer_fn+0x1c4/0x220
> > > [282879.612614]  ? softlockup_fn+0x30/0x30
> > > [282879.612616]  ? __hrtimer_run_queues+0xa2/0x2b0
> > > [282879.612620]  ? hrtimer_interrupt+0x109/0x220
> > > [282879.612622]  ? __sysvec_apic_timer_interrupt+0x5e/0x110
> > > [282879.612625]  ? sysvec_apic_timer_interrupt+0x7b/0x90
> > > [282879.612629]  </IRQ>
> > > [282879.612630]  <TASK>
> > > [282879.612631]  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
> > > [282879.612640]  ? xas_descend+0x18/0x80
> > > [282879.612641]  ? xas_load+0x35/0x40
> > > [282879.612643]  xas_find+0x197/0x1d0
> > > [282879.612645]  find_get_entries+0x6e/0x170
> > > [282879.612649]  truncate_inode_pages_range+0x294/0x4c0
> > > [282879.612655]  ? __xfs_trans_commit+0x13c/0x3e0 [xfs]
> > > [282879.612787]  ? kvfree+0x2c/0x40
> > > [282879.612791]  ? trace_hardirqs_off+0x36/0xf0
> > > [282879.612795]  truncate_inode_pages_final+0x44/0x50
> > > [282879.612798]  evict+0x177/0x190
> > > [282879.612802]  iput.part.0+0x183/0x1e0
> > > [282879.612804]  iput+0x1c/0x30
> > > [282879.612806]  do_unlinkat+0x1c7/0x2c0
> > > [282879.612810]  __x64_sys_unlinkat+0x38/0x70
> > > [282879.612812]  do_syscall_64+0x38/0x90
> > > [282879.612815]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > > [282879.612818] RIP: 0033:0x7f5f56cf120d
> > > [282879.612827] Code: 69 5c 2d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e
> > > 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 63 d2 48 63 ff b8 07 01 00
> > > 00 0f 05 <48> 3d 00 f0 ff ff 77 02 f3 c3 48 8b 15 32 5c 2d 00 f7 d8 64
> > > 89 02
> > > [282879.612828] RSP: 002b:00007fff30375c48 EFLAGS: 00000206 ORIG_RAX:
> > > 0000000000000107
> > > [282879.612830] RAX: ffffffffffffffda RBX: 0000000000000003 RCX:
> > > 00007f5f56cf120d
> > > [282879.612831] RDX: 0000000000000000 RSI: 0000000001640403 RDI:
> > > 0000000000000003
> > > [282879.612832] RBP: 0000000001640403 R08: 0000000000000000 R09:
> > > 0000000001640403
> > > [282879.612833] R10: 0000000000000100 R11: 0000000000000206 R12:
> > > 0000000000000003
> > > [282879.612834] R13: 000000000163c5c0 R14: 00007fff30375c80 R15:
> > > 0000000000000000
> > > [282879.612836]  </TASK>
> > >
> > >
> > > Unfortunately, we couldn't reproduce the issue on our test servers. We
> > > worked around it by disabling CONFIG_XARRAY_MULTI. Since then, these
> > > production servers have been running smoothly for several months.
> > >
> > > > If anyone can provide a patch, I can test it on multiple machines
> > > > over the next few days.
> > It is highly appreciated that you could help to try below one which
> > works on my v6.6 based android. However, there is a hard lockup
> > reported on an ongoing regression test(not sure if caused by this
> > patch yet). Thank you!
>
> I'm sorry to inform you that our users are unwilling to experiment
> with these changes on our production servers again, and I am unable to
> reproduce the issue on our test servers. I am reporting this issue to
> highlight to the community that it is indeed a serious problem, and we
> should consider it carefully.
ok. I would like to suggest a possible reproduce timing sequence which
inspired during investigation as mmap and truncate the same file
simultaneously by multi-process and reserve a certain number of CMA
area via dts could be more helpful.
>
> >
> > mm/huge_memory.c | 22 ++++++++++++++--------
> >  1 file changed, 14 insertions(+), 8 deletions(-)
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 064fbd90822b..5899906c326a 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -2498,7 +2498,7 @@ static void __split_huge_page(struct page *page,
> > struct list_head *list,
> >  {
> >   struct folio *folio = page_folio(page);
> >   struct page *head = &folio->page;
> > - struct lruvec *lruvec;
> > + struct lruvec *lruvec = folio_lruvec(folio);
> >   struct address_space *swap_cache = NULL;
> >   unsigned long offset = 0;
> >   unsigned int nr = thp_nr_pages(head);
> > @@ -2513,9 +2513,6 @@ static void __split_huge_page(struct page *page,
> > struct list_head *list,
> >   xa_lock(&swap_cache->i_pages);
> >   }
> >
> > - /* lock lru list/PageCompound, ref frozen by page_ref_freeze */
> > - lruvec = folio_lruvec_lock(folio);
> > -
> >   ClearPageHasHWPoisoned(head);
> >
> >   for (i = nr - 1; i >= 1; i--) {
> > @@ -2541,9 +2538,6 @@ static void __split_huge_page(struct page *page,
> > struct list_head *list,
> >   }
> >
> >   ClearPageCompound(head);
> > - unlock_page_lruvec(lruvec);
> > - /* Caller disabled irqs, so they are still disabled here */
> > -
> >   split_page_owner(head, nr);
> >
> >   /* See comment in __split_huge_page_tail() */
> > @@ -2560,7 +2554,6 @@ static void __split_huge_page(struct page *page,
> > struct list_head *list,
> >   page_ref_add(head, 2);
> >   xa_unlock(&head->mapping->i_pages);
> >   }
> > - local_irq_enable();
> >
> >   if (nr_dropped)
> >   shmem_uncharge(head->mapping->host, nr_dropped);
> > @@ -2631,6 +2624,7 @@ int split_huge_page_to_list(struct page *page,
> > struct list_head *list)
> >   int extra_pins, ret;
> >   pgoff_t end;
> >   bool is_hzp;
> > + struct lruvec *lruvec;
> >
> >   VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
> >   VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
> > @@ -2714,6 +2708,14 @@ int split_huge_page_to_list(struct page *page,
> > struct list_head *list)
> >
> >   /* block interrupt reentry in xa_lock and spinlock */
> >   local_irq_disable();
> > +
> > + /*
> > + * take lruvec's lock before freeze the folio to prevent the folio
> > + * remains in the page cache with refcnt == 0, which could lead to
> > + * find_get_entry enters livelock by iterating the xarray.
> > + */
> > + lruvec = folio_lruvec_lock(folio);
> > +
> >   if (mapping) {
> >   /*
> >   * Check if the folio is present in page cache.
> > @@ -2748,12 +2750,16 @@ int split_huge_page_to_list(struct page *page,
> > struct list_head *list)
> >   }
> >
> >   __split_huge_page(page, list, end);
> >
> > > >
> > >
> > >
> > > --
> > > Regards
> > > Yafang
>
>
>
> --
> Regards
> Yafang
Zhaoyang Huang June 14, 2024, 3:31 a.m. UTC | #19
On Mon, May 27, 2024 at 4:22 PM Marcin Wanat <private@marcinwanat.pl> wrote:
>
> On 22.05.2024 12:13, Marcin Wanat wrote:
> > On 22.05.2024 07:37, Zhaoyang Huang wrote:
> >> On Tue, May 21, 2024 at 11:47 PM Marcin Wanat <private@marcinwanat.pl>
> >> wrote:
> >>>
> >>> On 21.05.2024 03:00, Zhaoyang Huang wrote:
> >>>> On Tue, May 21, 2024 at 8:58 AM Zhaoyang Huang
> >>>> <huangzhaoyang@gmail.com> wrote:
> >>>>>
> >>>>> On Tue, May 21, 2024 at 3:42 AM Marcin Wanat
> >>>>> <private@marcinwanat.pl> wrote:
> >>>>>>
> >>>>>> On 15.04.2024 03:50, Zhaoyang Huang wrote:
> >>>>>> I have around 50 hosts handling high I/O (each with 20Gbps+ uplinks
> >>>>>> and multiple NVMe drives), running RockyLinux 8/9. The stock RHEL
> >>>>>> kernel 8/9 is NOT affected, and the long-term kernel 5.15.X is NOT
> >>>>>> affected.
> >>>>>> However, with long-term kernels 6.1.XX and 6.6.XX,
> >>>>>> (tested at least 10 different versions), this lockup always appears
> >>>>>> after 2-30 days, similar to the report in the original thread.
> >>>>>> The more load (for example, copying a lot of local files while
> >>>>>> serving 20Gbps traffic), the higher the chance that the bug will
> >>>>>> appear.
> >>>>>>
> >>>>>> I haven't been able to reproduce this during synthetic tests,
> >>>>>> but it always occurs in production on 6.1.X and 6.6.X within 2-30
> >>>>>> days.
> >>>>>> If anyone can provide a patch, I can test it on multiple machines
> >>>>>> over the next few days.
> >>>>> Could you please try this one which could be applied on 6.6
> >>>>> directly. Thank you!
> >>>> URL: https://lore.kernel.org/linux-mm/20240412064353.133497-1-
> >>>> zhaoyang.huang@unisoc.com/
> >>>>
> >>>
> >>> Unfortunately, I am unable to cleanly apply this patch against the
> >>> latest 6.6.31
> >> Please try below one which works on my v6.6 based android. Thank you
> >> for your test in advance :D
> >>
> >> mm/huge_memory.c | 22 ++++++++++++++--------
> >>   1 file changed, 14 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >
> > I have compiled 6.6.31 with this patch and will test it on multiple
> > machines over the next 30 days. I will provide an update after 30 days
> > if everything is fine or sooner if any of the hosts experience the same
> > soft lockup again.
> >
>
> First server with 6.6.31 and this patch hang today. Soft lockup changed
> to hard lockup:
>
> [26887.389623] watchdog: Watchdog detected hard LOCKUP on cpu 21
> [26887.389626] Modules linked in: nft_limit xt_limit xt_hashlimit
> ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_connlimit
> nf_conncount tls xt_set ip_set_hash_net ip_set xt_CT xt_conntrack
> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables
> nfnetlink rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency
> intel_uncore_frequency_common isst_if_common skx_edac nfit libnvdimm
> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass
> rapl intel_cstate ipmi_ssif irdma ext4 mbcache ice iTCO_wdt jbd2 mgag200
> intel_pmc_bxt iTCO_vendor_support ib_uverbs i2c_algo_bit acpi_ipmi
> intel_uncore mei_me drm_shmem_helper pcspkr ib_core i2c_i801 ipmi_si
> drm_kms_helper mei lpc_ich i2c_smbus ioatdma intel_pch_thermal
> ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter joydev tcp_bbr
> drm fuse xfs libcrc32c sd_mod t10_pi sg crct10dif_pclmul crc32_pclmul
> crc32c_intel ixgbe polyval_clmulni ahci polyval_generic libahci mdio
> i40e libata megaraid_sas dca ghash_clmulni_intel wmi
> [26887.389682] CPU: 21 PID: 264 Comm: kswapd0 Kdump: loaded Tainted: G
>       W          6.6.31.el9 #3
> [26887.389685] Hardware name: FUJITSU PRIMERGY RX2540 M4/D3384-A1, BIOS
> V5.0.0.12 R1.22.0 for D3384-A1x                    06/04/2018
> [26887.389687] RIP: 0010:native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389696] Code: 08 0f 92 c2 8b 45 00 0f b6 d2 c1 e2 08 30 e4 09 d0
> a9 00 01 ff ff 0f 85 ea 01 00 00 85 c0 74 12 0f b6 45 00 84 c0 74 0a f3
> 90 <0f> b6 45 00 84 c0 75 f6 b8 01 00 00 00 66 89 45 00 5b 5d 41 5c 41
> [26887.389698] RSP: 0018:ffffb3e587a87a20 EFLAGS: 00000002
> [26887.389700] RAX: 0000000000000001 RBX: ffff9ad6c6f67050 RCX:
> 0000000000000000
> [26887.389701] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> ffff9ad6c6f67050
> [26887.389703] RBP: ffff9ad6c6f67050 R08: 0000000000000000 R09:
> 0000000000000067
> [26887.389704] R10: 0000000000000000 R11: 0000000000000000 R12:
> 0000000000000046
> [26887.389705] R13: 0000000000000200 R14: 0000000000000000 R15:
> ffffe1138aa98000
> [26887.389707] FS:  0000000000000000(0000) GS:ffff9ade20340000(0000)
> knlGS:0000000000000000
> [26887.389708] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [26887.389710] CR2: 000000002912809b CR3: 000000064401e003 CR4:
> 00000000007706e0
> [26887.389711] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [26887.389712] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [26887.389713] PKRU: 55555554
> [26887.389714] Call Trace:
> [26887.389717]  <NMI>
> [26887.389720]  ? watchdog_hardlockup_check+0xac/0x150
> [26887.389725]  ? __perf_event_overflow+0x102/0x1d0
> [26887.389729]  ? handle_pmi_common+0x189/0x3e0
> [26887.389735]  ? set_pte_vaddr_p4d+0x4a/0x60
> [26887.389738]  ? flush_tlb_one_kernel+0xa/0x20
> [26887.389742]  ? native_set_fixmap+0x65/0x80
> [26887.389745]  ? ghes_copy_tofrom_phys+0x75/0x110
> [26887.389751]  ? __ghes_peek_estatus.isra.0+0x49/0xb0
> [26887.389755]  ? intel_pmu_handle_irq+0x10b/0x230
> [26887.389756]  ? perf_event_nmi_handler+0x28/0x50
> [26887.389759]  ? nmi_handle+0x58/0x150
> [26887.389764]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389768]  ? default_do_nmi+0x6b/0x170
> [26887.389770]  ? exc_nmi+0x12c/0x1a0
> [26887.389772]  ? end_repeat_nmi+0x16/0x1f
> [26887.389777]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389780]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389784]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389787]  </NMI>
> [26887.389788]  <TASK>
> [26887.389789]  __raw_spin_lock_irqsave+0x3d/0x50
> [26887.389793]  folio_lruvec_lock_irqsave+0x5e/0x90
> [26887.389798]  __page_cache_release+0x68/0x230
> [26887.389801]  ? remove_migration_ptes+0x5c/0x80
> [26887.389807]  __folio_put+0x24/0x60
> [26887.389808]  __split_huge_page+0x368/0x520
> [26887.389812]  split_huge_page_to_list+0x4b3/0x570
> [26887.389816]  deferred_split_scan+0x1c8/0x290
> [26887.389819]  do_shrink_slab+0x12f/0x2d0
> [26887.389824]  shrink_slab_memcg+0x133/0x1d0
> [26887.389829]  shrink_node_memcgs+0x18e/0x1d0
> [26887.389832]  shrink_node+0xa7/0x370
> [26887.389836]  balance_pgdat+0x332/0x6f0
> [26887.389842]  kswapd+0xf0/0x190
> [26887.389845]  ? balance_pgdat+0x6f0/0x6f0
> [26887.389848]  kthread+0xee/0x120
> [26887.389851]  ? kthread_complete_and_exit+0x20/0x20
> [26887.389853]  ret_from_fork+0x2d/0x50
> [26887.389857]  ? kthread_complete_and_exit+0x20/0x20
> [26887.389859]  ret_from_fork_asm+0x11/0x20
> [26887.389864]  </TASK>
> [26887.389865] Kernel panic - not syncing: Hard LOCKUP
> [26887.389867] CPU: 21 PID: 264 Comm: kswapd0 Kdump: loaded Tainted: G
>       W          6.6.31.el9 #3
> [26887.389869] Hardware name: FUJITSU PRIMERGY RX2540 M4/D3384-A1, BIOS
> V5.0.0.12 R1.22.0 for D3384-A1x                    06/04/2018
> [26887.389870] Call Trace:
> [26887.389871]  <NMI>
> [26887.389872]  dump_stack_lvl+0x44/0x60
> [26887.389877]  panic+0x241/0x330
> [26887.389881]  nmi_panic+0x2f/0x40
> [26887.389883]  watchdog_hardlockup_check+0x119/0x150
> [26887.389886]  __perf_event_overflow+0x102/0x1d0
> [26887.389889]  handle_pmi_common+0x189/0x3e0
> [26887.389893]  ? set_pte_vaddr_p4d+0x4a/0x60
> [26887.389896]  ? flush_tlb_one_kernel+0xa/0x20
> [26887.389899]  ? native_set_fixmap+0x65/0x80
> [26887.389902]  ? ghes_copy_tofrom_phys+0x75/0x110
> [26887.389906]  ? __ghes_peek_estatus.isra.0+0x49/0xb0
> [26887.389909]  intel_pmu_handle_irq+0x10b/0x230
> [26887.389911]  perf_event_nmi_handler+0x28/0x50
> [26887.389913]  nmi_handle+0x58/0x150
> [26887.389916]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389920]  default_do_nmi+0x6b/0x170
> [26887.389922]  exc_nmi+0x12c/0x1a0
> [26887.389923]  end_repeat_nmi+0x16/0x1f
> [26887.389926] RIP: 0010:native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389930] Code: 08 0f 92 c2 8b 45 00 0f b6 d2 c1 e2 08 30 e4 09 d0
> a9 00 01 ff ff 0f 85 ea 01 00 00 85 c0 74 12 0f b6 45 00 84 c0 74 0a f3
> 90 <0f> b6 45 00 84 c0 75 f6 b8 01 00 00 00 66 89 45 00 5b 5d 41 5c 41
> [26887.389931] RSP: 0018:ffffb3e587a87a20 EFLAGS: 00000002
> [26887.389933] RAX: 0000000000000001 RBX: ffff9ad6c6f67050 RCX:
> 0000000000000000
> [26887.389934] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> ffff9ad6c6f67050
> [26887.389935] RBP: ffff9ad6c6f67050 R08: 0000000000000000 R09:
> 0000000000000067
> [26887.389936] R10: 0000000000000000 R11: 0000000000000000 R12:
> 0000000000000046
> [26887.389937] R13: 0000000000000200 R14: 0000000000000000 R15:
> ffffe1138aa98000
> [26887.389940]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389943]  ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389946]  </NMI>
> [26887.389947]  <TASK>
> [26887.389947]  __raw_spin_lock_irqsave+0x3d/0x50
> [26887.389950]  folio_lruvec_lock_irqsave+0x5e/0x90
> [26887.389953]  __page_cache_release+0x68/0x230
> [26887.389955]  ? remove_migration_ptes+0x5c/0x80
> [26887.389958]  __folio_put+0x24/0x60
> [26887.389960]  __split_huge_page+0x368/0x520
> [26887.389963]  split_huge_page_to_list+0x4b3/0x570
> [26887.389967]  deferred_split_scan+0x1c8/0x290
> [26887.389971]  do_shrink_slab+0x12f/0x2d0
> [26887.389974]  shrink_slab_memcg+0x133/0x1d0
> [26887.389978]  shrink_node_memcgs+0x18e/0x1d0
> [26887.389982]  shrink_node+0xa7/0x370
> [26887.389985]  balance_pgdat+0x332/0x6f0
> [26887.389991]  kswapd+0xf0/0x190
> [26887.389994]  ? balance_pgdat+0x6f0/0x6f0
> [26887.389997]  kthread+0xee/0x120
> [26887.389998]  ? kthread_complete_and_exit+0x20/0x20
> [26887.390000]  ret_from_fork+0x2d/0x50
> [26887.390003]  ? kthread_complete_and_exit+0x20/0x20
> [26887.390004]  ret_from_fork_asm+0x11/0x20
> [26887.390009]  </TASK>
>
Hi Marcin. Sorry for this late reply. I think the above hard lockup is
caused by a recursive deadlock as [1] and has been fixed by [2] which
is on v6.8+. I would like to know if your regression test is still
going on? Thanks very much.

[1]
static void __split_huge_page(struct page *page, struct list_head *list,
                pgoff_t end, unsigned int new_order)
{
        /* lock lru list/PageCompound, ref frozen by page_ref_freeze */
        lruvec = folio_lruvec_lock(folio);
                                                 //take lruvec_lock
here 1st

        for (i = nr - new_nr; i >= new_nr; i -= new_nr) {
                __split_huge_page_tail(folio, i, lruvec, list, new_order);
                /* Some pages can be beyond EOF: drop them from page cache */
                if (head[i].index >= end) {
                        folio_put(tail);
                            __page_cache_release
                                  folio_lruvec_lock_irqsave
                                                  //hanged by 2nd try

[2]
commit f1ee018baee9f4e724e08859c2559323be768be3
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Feb 27 17:42:42 2024 +0000

    mm: use __page_cache_release() in folios_put()

    Pass a pointer to the lruvec so we can take advantage of the
    folio_lruvec_relock_irqsave().  Adjust the calling convention of
    folio_lruvec_relock_irqsave() to suit and add a page_cache_release()
    wrapper.
diff mbox series

Patch

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9859aa4f7553..418e8d03480a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2891,7 +2891,7 @@  static void __split_huge_page(struct page *page, struct list_head *list,
 {
 	struct folio *folio = page_folio(page);
 	struct page *head = &folio->page;
-	struct lruvec *lruvec;
+	struct lruvec *lruvec = folio_lruvec(folio);
 	struct address_space *swap_cache = NULL;
 	unsigned long offset = 0;
 	int i, nr_dropped = 0;
@@ -2908,8 +2908,6 @@  static void __split_huge_page(struct page *page, struct list_head *list,
 		xa_lock(&swap_cache->i_pages);
 	}
 
-	/* lock lru list/PageCompound, ref frozen by page_ref_freeze */
-	lruvec = folio_lruvec_lock(folio);
 
 	ClearPageHasHWPoisoned(head);
 
@@ -2942,7 +2940,6 @@  static void __split_huge_page(struct page *page, struct list_head *list,
 
 		folio_set_order(new_folio, new_order);
 	}
-	unlock_page_lruvec(lruvec);
 	/* Caller disabled irqs, so they are still disabled here */
 
 	split_page_owner(head, order, new_order);
@@ -2961,7 +2958,6 @@  static void __split_huge_page(struct page *page, struct list_head *list,
 		folio_ref_add(folio, 1 + new_nr);
 		xa_unlock(&folio->mapping->i_pages);
 	}
-	local_irq_enable();
 
 	if (nr_dropped)
 		shmem_uncharge(folio->mapping->host, nr_dropped);
@@ -3048,6 +3044,7 @@  int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 	int extra_pins, ret;
 	pgoff_t end;
 	bool is_hzp;
+	struct lruvec *lruvec;
 
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 	VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
@@ -3159,6 +3156,14 @@  int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 
 	/* block interrupt reentry in xa_lock and spinlock */
 	local_irq_disable();
+
+	/*
+	 * take lruvec's lock before freeze the folio to prevent the folio
+	 * remains in the page cache with refcnt == 0, which could lead to
+	 * find_get_entry enters livelock by iterating the xarray.
+	 */
+	lruvec = folio_lruvec_lock(folio);
+
 	if (mapping) {
 		/*
 		 * Check if the folio is present in page cache.
@@ -3203,12 +3208,16 @@  int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 		}
 
 		__split_huge_page(page, list, end, new_order);
+		unlock_page_lruvec(lruvec);
+		local_irq_enable();
 		ret = 0;
 	} else {
 		spin_unlock(&ds_queue->split_queue_lock);
 fail:
 		if (mapping)
 			xas_unlock(&xas);
+
+		unlock_page_lruvec(lruvec);
 		local_irq_enable();
 		remap_page(folio, folio_nr_pages(folio));
 		ret = -EAGAIN;