Message ID | 20240412025754.1897615-1-linmiaohe@huawei.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2] mm/memory-failure: fix deadlock when hugetlb_optimize_vmemmap is enabled | expand |
On 4/11/24 7:57 PM, Miaohe Lin wrote: > When I did hard offline test with hugetlb pages, below deadlock occurs: > > ====================================================== > WARNING: possible circular locking dependency detected > 6.8.0-11409-gf6cef5f8c37f #1 Not tainted > ------------------------------------------------------ > bash/46904 is trying to acquire lock: > ffffffffabe68910 (cpu_hotplug_lock){++++}-{0:0}, at: static_key_slow_dec+0x16/0x60 > > but task is already holding lock: > ffffffffabf92ea8 (pcp_batch_high_lock){+.+.}-{3:3}, at: zone_pcp_disable+0x16/0x40 > > which lock already depends on the new lock. > > the existing dependency chain (in reverse order) is: > > -> #1 (pcp_batch_high_lock){+.+.}-{3:3}: > __mutex_lock+0x6c/0x770 > page_alloc_cpu_online+0x3c/0x70 > cpuhp_invoke_callback+0x397/0x5f0 > __cpuhp_invoke_callback_range+0x71/0xe0 > _cpu_up+0xeb/0x210 > cpu_up+0x91/0xe0 > cpuhp_bringup_mask+0x49/0xb0 > bringup_nonboot_cpus+0xb7/0xe0 > smp_init+0x25/0xa0 > kernel_init_freeable+0x15f/0x3e0 > kernel_init+0x15/0x1b0 > ret_from_fork+0x2f/0x50 > ret_from_fork_asm+0x1a/0x30 > > -> #0 (cpu_hotplug_lock){++++}-{0:0}: > __lock_acquire+0x1298/0x1cd0 > lock_acquire+0xc0/0x2b0 > cpus_read_lock+0x2a/0xc0 > static_key_slow_dec+0x16/0x60 > __hugetlb_vmemmap_restore_folio+0x1b9/0x200 > dissolve_free_huge_page+0x211/0x260 > __page_handle_poison+0x45/0xc0 > memory_failure+0x65e/0xc70 > hard_offline_page_store+0x55/0xa0 > kernfs_fop_write_iter+0x12c/0x1d0 > vfs_write+0x387/0x550 > ksys_write+0x64/0xe0 > do_syscall_64+0xca/0x1e0 > entry_SYSCALL_64_after_hwframe+0x6d/0x75 > > other info that might help us debug this: > > Possible unsafe locking scenario: > > CPU0 CPU1 > ---- ---- > lock(pcp_batch_high_lock); > lock(cpu_hotplug_lock); > lock(pcp_batch_high_lock); > rlock(cpu_hotplug_lock); > > *** DEADLOCK *** > > 5 locks held by bash/46904: > #0: ffff98f6c3bb23f0 (sb_writers#5){.+.+}-{0:0}, at: ksys_write+0x64/0xe0 > #1: ffff98f6c328e488 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0xf8/0x1d0 > #2: ffff98ef83b31890 (kn->active#113){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x100/0x1d0 > #3: ffffffffabf9db48 (mf_mutex){+.+.}-{3:3}, at: memory_failure+0x44/0xc70 > #4: ffffffffabf92ea8 (pcp_batch_high_lock){+.+.}-{3:3}, at: zone_pcp_disable+0x16/0x40 > > stack backtrace: > CPU: 10 PID: 46904 Comm: bash Kdump: loaded Not tainted 6.8.0-11409-gf6cef5f8c37f #1 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 > Call Trace: > <TASK> > dump_stack_lvl+0x68/0xa0 > check_noncircular+0x129/0x140 > __lock_acquire+0x1298/0x1cd0 > lock_acquire+0xc0/0x2b0 > cpus_read_lock+0x2a/0xc0 > static_key_slow_dec+0x16/0x60 > __hugetlb_vmemmap_restore_folio+0x1b9/0x200 > dissolve_free_huge_page+0x211/0x260 > __page_handle_poison+0x45/0xc0 > memory_failure+0x65e/0xc70 > hard_offline_page_store+0x55/0xa0 > kernfs_fop_write_iter+0x12c/0x1d0 > vfs_write+0x387/0x550 > ksys_write+0x64/0xe0 > do_syscall_64+0xca/0x1e0 > entry_SYSCALL_64_after_hwframe+0x6d/0x75 > RIP: 0033:0x7fc862314887 > Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 > RSP: 002b:00007fff19311268 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 > RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007fc862314887 > RDX: 000000000000000c RSI: 000056405645fe10 RDI: 0000000000000001 > RBP: 000056405645fe10 R08: 00007fc8623d1460 R09: 000000007fffffff > R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c > R13: 00007fc86241b780 R14: 00007fc862417600 R15: 00007fc862416a00 > > In short, below scene breaks the lock dependency chain: > > memory_failure > __page_handle_poison > zone_pcp_disable -- lock(pcp_batch_high_lock) > dissolve_free_huge_page > __hugetlb_vmemmap_restore_folio > static_key_slow_dec > cpus_read_lock -- rlock(cpu_hotplug_lock) > > Fix this by calling drain_all_pages() instead. > > Fixes: a6b40850c442 ("mm: hugetlb: replace hugetlb_free_vmemmap_enabled with a static_key") > Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> > Acked-by: Oscar Salvador <osalvador@suse.de> > Cc: <stable@vger.kernel.org> > --- > v2: > collect Acked-by tag and extend comment per Oscar. Thanks. > --- > mm/memory-failure.c | 16 +++++++++++++--- > 1 file changed, 13 insertions(+), 3 deletions(-) > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > index edd6e114462f..c6750509d74c 100644 > --- a/mm/memory-failure.c > +++ b/mm/memory-failure.c > @@ -153,11 +153,21 @@ static int __page_handle_poison(struct page *page) > { > int ret; > > - zone_pcp_disable(page_zone(page)); > + /* > + * zone_pcp_disable() can't be used here. It will hold pcp_batch_high_lock and > + * dissolve_free_huge_page() might hold cpu_hotplug_lock via static_key_slow_dec() > + * when hugetlb vmemmap optimization is enabled. This will break current lock > + * dependency chain and leads to deadlock. > + * Disabling pcp before dissolving the page was a deterministic approach because > + * we made sure that those pages cannot end up in any PCP list. Draining PCP lists > + * expels those pages to the buddy system, but nothing guarantees that those pages > + * do not get back to a PCP queue if we need to refill those. > + */ > ret = dissolve_free_huge_page(page); Hi Miaohe, I recently sent a patch[1] to convert dissolve_free_huge_page() to folios which changes the function name and the name referenced in the comment so this will conflict with my patch. It's in mm-unstable now, would you be able to rebase to that in a new version? Thanks, Sid [1] https://lore.kernel.org/linux-mm/20240411164756.261178-1-sidhartha.kumar@oracle.com/T/#u > - if (!ret) > + if (!ret) { > + drain_all_pages(page_zone(page)); > ret = take_page_off_buddy(page); > - zone_pcp_enable(page_zone(page)); > + } > > return ret; > }
On 2024/4/12 11:12, Sidhartha Kumar wrote: > On 4/11/24 7:57 PM, Miaohe Lin wrote: >> When I did hard offline test with hugetlb pages, below deadlock occurs: >> >> ====================================================== >> WARNING: possible circular locking dependency detected >> 6.8.0-11409-gf6cef5f8c37f #1 Not tainted >> ------------------------------------------------------ >> bash/46904 is trying to acquire lock: >> ffffffffabe68910 (cpu_hotplug_lock){++++}-{0:0}, at: static_key_slow_dec+0x16/0x60 >> >> but task is already holding lock: >> ffffffffabf92ea8 (pcp_batch_high_lock){+.+.}-{3:3}, at: zone_pcp_disable+0x16/0x40 >> >> which lock already depends on the new lock. >> >> the existing dependency chain (in reverse order) is: >> >> -> #1 (pcp_batch_high_lock){+.+.}-{3:3}: >> __mutex_lock+0x6c/0x770 >> page_alloc_cpu_online+0x3c/0x70 >> cpuhp_invoke_callback+0x397/0x5f0 >> __cpuhp_invoke_callback_range+0x71/0xe0 >> _cpu_up+0xeb/0x210 >> cpu_up+0x91/0xe0 >> cpuhp_bringup_mask+0x49/0xb0 >> bringup_nonboot_cpus+0xb7/0xe0 >> smp_init+0x25/0xa0 >> kernel_init_freeable+0x15f/0x3e0 >> kernel_init+0x15/0x1b0 >> ret_from_fork+0x2f/0x50 >> ret_from_fork_asm+0x1a/0x30 >> >> -> #0 (cpu_hotplug_lock){++++}-{0:0}: >> __lock_acquire+0x1298/0x1cd0 >> lock_acquire+0xc0/0x2b0 >> cpus_read_lock+0x2a/0xc0 >> static_key_slow_dec+0x16/0x60 >> __hugetlb_vmemmap_restore_folio+0x1b9/0x200 >> dissolve_free_huge_page+0x211/0x260 >> __page_handle_poison+0x45/0xc0 >> memory_failure+0x65e/0xc70 >> hard_offline_page_store+0x55/0xa0 >> kernfs_fop_write_iter+0x12c/0x1d0 >> vfs_write+0x387/0x550 >> ksys_write+0x64/0xe0 >> do_syscall_64+0xca/0x1e0 >> entry_SYSCALL_64_after_hwframe+0x6d/0x75 >> >> other info that might help us debug this: >> >> Possible unsafe locking scenario: >> >> CPU0 CPU1 >> ---- ---- >> lock(pcp_batch_high_lock); >> lock(cpu_hotplug_lock); >> lock(pcp_batch_high_lock); >> rlock(cpu_hotplug_lock); >> >> *** DEADLOCK *** >> >> 5 locks held by bash/46904: >> #0: ffff98f6c3bb23f0 (sb_writers#5){.+.+}-{0:0}, at: ksys_write+0x64/0xe0 >> #1: ffff98f6c328e488 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0xf8/0x1d0 >> #2: ffff98ef83b31890 (kn->active#113){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x100/0x1d0 >> #3: ffffffffabf9db48 (mf_mutex){+.+.}-{3:3}, at: memory_failure+0x44/0xc70 >> #4: ffffffffabf92ea8 (pcp_batch_high_lock){+.+.}-{3:3}, at: zone_pcp_disable+0x16/0x40 >> >> stack backtrace: >> CPU: 10 PID: 46904 Comm: bash Kdump: loaded Not tainted 6.8.0-11409-gf6cef5f8c37f #1 >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 >> Call Trace: >> <TASK> >> dump_stack_lvl+0x68/0xa0 >> check_noncircular+0x129/0x140 >> __lock_acquire+0x1298/0x1cd0 >> lock_acquire+0xc0/0x2b0 >> cpus_read_lock+0x2a/0xc0 >> static_key_slow_dec+0x16/0x60 >> __hugetlb_vmemmap_restore_folio+0x1b9/0x200 >> dissolve_free_huge_page+0x211/0x260 >> __page_handle_poison+0x45/0xc0 >> memory_failure+0x65e/0xc70 >> hard_offline_page_store+0x55/0xa0 >> kernfs_fop_write_iter+0x12c/0x1d0 >> vfs_write+0x387/0x550 >> ksys_write+0x64/0xe0 >> do_syscall_64+0xca/0x1e0 >> entry_SYSCALL_64_after_hwframe+0x6d/0x75 >> RIP: 0033:0x7fc862314887 >> Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 >> RSP: 002b:00007fff19311268 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 >> RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007fc862314887 >> RDX: 000000000000000c RSI: 000056405645fe10 RDI: 0000000000000001 >> RBP: 000056405645fe10 R08: 00007fc8623d1460 R09: 000000007fffffff >> R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c >> R13: 00007fc86241b780 R14: 00007fc862417600 R15: 00007fc862416a00 >> >> In short, below scene breaks the lock dependency chain: >> >> memory_failure >> __page_handle_poison >> zone_pcp_disable -- lock(pcp_batch_high_lock) >> dissolve_free_huge_page >> __hugetlb_vmemmap_restore_folio >> static_key_slow_dec >> cpus_read_lock -- rlock(cpu_hotplug_lock) >> >> Fix this by calling drain_all_pages() instead. >> >> Fixes: a6b40850c442 ("mm: hugetlb: replace hugetlb_free_vmemmap_enabled with a static_key") >> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> >> Acked-by: Oscar Salvador <osalvador@suse.de> >> Cc: <stable@vger.kernel.org> >> --- >> v2: >> collect Acked-by tag and extend comment per Oscar. Thanks. >> --- >> mm/memory-failure.c | 16 +++++++++++++--- >> 1 file changed, 13 insertions(+), 3 deletions(-) >> >> diff --git a/mm/memory-failure.c b/mm/memory-failure.c >> index edd6e114462f..c6750509d74c 100644 >> --- a/mm/memory-failure.c >> +++ b/mm/memory-failure.c >> @@ -153,11 +153,21 @@ static int __page_handle_poison(struct page *page) >> { >> int ret; >> - zone_pcp_disable(page_zone(page)); >> + /* >> + * zone_pcp_disable() can't be used here. It will hold pcp_batch_high_lock and >> + * dissolve_free_huge_page() might hold cpu_hotplug_lock via static_key_slow_dec() >> + * when hugetlb vmemmap optimization is enabled. This will break current lock >> + * dependency chain and leads to deadlock. >> + * Disabling pcp before dissolving the page was a deterministic approach because >> + * we made sure that those pages cannot end up in any PCP list. Draining PCP lists >> + * expels those pages to the buddy system, but nothing guarantees that those pages >> + * do not get back to a PCP queue if we need to refill those. >> + */ >> ret = dissolve_free_huge_page(page); > > Hi Miaohe, > > I recently sent a patch[1] to convert dissolve_free_huge_page() to folios which changes the function name and the name referenced in the comment so this will conflict with my patch. It's in mm-unstable now, would you be able to rebase to that in a new version? > The version 1 of this patch is in mm-unstable too. So it might be better to send a separate patch to extend the comment. Thanks. . > Thanks, > Sid > > [1] https://lore.kernel.org/linux-mm/20240411164756.261178-1-sidhartha.kumar@oracle.com/T/#u > > >> - if (!ret) >> + if (!ret) { >> + drain_all_pages(page_zone(page)); >> ret = take_page_off_buddy(page); >> - zone_pcp_enable(page_zone(page)); >> + } >> return ret; >> } > > .
On Fri, 12 Apr 2024 16:11:52 +0800 Miaohe Lin <linmiaohe@huawei.com> wrote: > > I recently sent a patch[1] to convert dissolve_free_huge_page() to folios which changes the function name and the name referenced in the comment so this will conflict with my patch. It's in mm-unstable now, would you be able to rebase to that in a new version? > > This patch is a hotfixes, cc:stable one so the mm-unstable material will be based on top of this change. I've queued this change up as a -fix against v1. And I've retained this changelog addition: : This issue won't occur until commit a6b40850c442 ("mm: hugetlb: replace : hugetlb_free_vmemmap_enabled with a static_key"). As it introduced : rlock(cpu_hotplug_lock) in dissolve_free_huge_page() code path while : lock(pcp_batch_high_lock) is already in the __page_handle_poison(). And I've queued another -fix to reflow that block comment to 80 columns. --- a/mm/memory-failure.c~mm-memory-failure-fix-deadlock-when-hugetlb_optimize_vmemmap-is-enabled-v2-fix +++ a/mm/memory-failure.c @@ -155,14 +155,16 @@ static int __page_handle_poison(struct p int ret; /* - * zone_pcp_disable() can't be used here. It will hold pcp_batch_high_lock and - * dissolve_free_huge_page() might hold cpu_hotplug_lock via static_key_slow_dec() - * when hugetlb vmemmap optimization is enabled. This will break current lock - * dependency chain and leads to deadlock. - * Disabling pcp before dissolving the page was a deterministic approach because - * we made sure that those pages cannot end up in any PCP list. Draining PCP lists - * expels those pages to the buddy system, but nothing guarantees that those pages - * do not get back to a PCP queue if we need to refill those. + * zone_pcp_disable() can't be used here. It will + * hold pcp_batch_high_lock and dissolve_free_huge_page() might hold + * cpu_hotplug_lock via static_key_slow_dec() when hugetlb vmemmap + * optimization is enabled. This will break current lock dependency + * chain and leads to deadlock. + * Disabling pcp before dissolving the page was a deterministic + * approach because we made sure that those pages cannot end up in any + * PCP list. Draining PCP lists expels those pages to the buddy system, + * but nothing guarantees that those pages do not get back to a PCP + * queue if we need to refill those. */ ret = dissolve_free_huge_page(page); if (!ret) {
On 2024/4/13 7:21, Andrew Morton wrote: > On Fri, 12 Apr 2024 16:11:52 +0800 Miaohe Lin <linmiaohe@huawei.com> wrote: > >>> I recently sent a patch[1] to convert dissolve_free_huge_page() to folios which changes the function name and the name referenced in the comment so this will conflict with my patch. It's in mm-unstable now, would you be able to rebase to that in a new version? >>> > > This patch is a hotfixes, cc:stable one so the mm-unstable material will be > based on top of this change. > > I've queued this change up as a -fix against v1. And I've retained > this changelog addition: > > : This issue won't occur until commit a6b40850c442 ("mm: hugetlb: replace > : hugetlb_free_vmemmap_enabled with a static_key"). As it introduced > : rlock(cpu_hotplug_lock) in dissolve_free_huge_page() code path while > : lock(pcp_batch_high_lock) is already in the __page_handle_poison(). > > And I've queued another -fix to reflow that block comment to 80 columns. > > --- a/mm/memory-failure.c~mm-memory-failure-fix-deadlock-when-hugetlb_optimize_vmemmap-is-enabled-v2-fix > +++ a/mm/memory-failure.c > @@ -155,14 +155,16 @@ static int __page_handle_poison(struct p > int ret; > > /* > - * zone_pcp_disable() can't be used here. It will hold pcp_batch_high_lock and > - * dissolve_free_huge_page() might hold cpu_hotplug_lock via static_key_slow_dec() > - * when hugetlb vmemmap optimization is enabled. This will break current lock > - * dependency chain and leads to deadlock. > - * Disabling pcp before dissolving the page was a deterministic approach because > - * we made sure that those pages cannot end up in any PCP list. Draining PCP lists > - * expels those pages to the buddy system, but nothing guarantees that those pages > - * do not get back to a PCP queue if we need to refill those. > + * zone_pcp_disable() can't be used here. It will > + * hold pcp_batch_high_lock and dissolve_free_huge_page() might hold > + * cpu_hotplug_lock via static_key_slow_dec() when hugetlb vmemmap > + * optimization is enabled. This will break current lock dependency > + * chain and leads to deadlock. > + * Disabling pcp before dissolving the page was a deterministic > + * approach because we made sure that those pages cannot end up in any > + * PCP list. Draining PCP lists expels those pages to the buddy system, > + * but nothing guarantees that those pages do not get back to a PCP > + * queue if we need to refill those. > */ > ret = dissolve_free_huge_page(page); > if (!ret) { > _ > Many thanks for doing this. :) . > . >
On 4/11/2024 7:57 PM, Miaohe Lin wrote: > When I did hard offline test with hugetlb pages, below deadlock occurs: > > ====================================================== > WARNING: possible circular locking dependency detected > 6.8.0-11409-gf6cef5f8c37f #1 Not tainted > ------------------------------------------------------ > bash/46904 is trying to acquire lock: > ffffffffabe68910 (cpu_hotplug_lock){++++}-{0:0}, at: static_key_slow_dec+0x16/0x60 > > but task is already holding lock: > ffffffffabf92ea8 (pcp_batch_high_lock){+.+.}-{3:3}, at: zone_pcp_disable+0x16/0x40 > > which lock already depends on the new lock. > > the existing dependency chain (in reverse order) is: > > -> #1 (pcp_batch_high_lock){+.+.}-{3:3}: > __mutex_lock+0x6c/0x770 > page_alloc_cpu_online+0x3c/0x70 > cpuhp_invoke_callback+0x397/0x5f0 > __cpuhp_invoke_callback_range+0x71/0xe0 > _cpu_up+0xeb/0x210 > cpu_up+0x91/0xe0 > cpuhp_bringup_mask+0x49/0xb0 > bringup_nonboot_cpus+0xb7/0xe0 > smp_init+0x25/0xa0 > kernel_init_freeable+0x15f/0x3e0 > kernel_init+0x15/0x1b0 > ret_from_fork+0x2f/0x50 > ret_from_fork_asm+0x1a/0x30 > > -> #0 (cpu_hotplug_lock){++++}-{0:0}: > __lock_acquire+0x1298/0x1cd0 > lock_acquire+0xc0/0x2b0 > cpus_read_lock+0x2a/0xc0 > static_key_slow_dec+0x16/0x60 > __hugetlb_vmemmap_restore_folio+0x1b9/0x200 > dissolve_free_huge_page+0x211/0x260 > __page_handle_poison+0x45/0xc0 > memory_failure+0x65e/0xc70 > hard_offline_page_store+0x55/0xa0 > kernfs_fop_write_iter+0x12c/0x1d0 > vfs_write+0x387/0x550 > ksys_write+0x64/0xe0 > do_syscall_64+0xca/0x1e0 > entry_SYSCALL_64_after_hwframe+0x6d/0x75 > > other info that might help us debug this: > > Possible unsafe locking scenario: > > CPU0 CPU1 > ---- ---- > lock(pcp_batch_high_lock); > lock(cpu_hotplug_lock); > lock(pcp_batch_high_lock); > rlock(cpu_hotplug_lock); > > *** DEADLOCK *** > > 5 locks held by bash/46904: > #0: ffff98f6c3bb23f0 (sb_writers#5){.+.+}-{0:0}, at: ksys_write+0x64/0xe0 > #1: ffff98f6c328e488 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0xf8/0x1d0 > #2: ffff98ef83b31890 (kn->active#113){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x100/0x1d0 > #3: ffffffffabf9db48 (mf_mutex){+.+.}-{3:3}, at: memory_failure+0x44/0xc70 > #4: ffffffffabf92ea8 (pcp_batch_high_lock){+.+.}-{3:3}, at: zone_pcp_disable+0x16/0x40 > > stack backtrace: > CPU: 10 PID: 46904 Comm: bash Kdump: loaded Not tainted 6.8.0-11409-gf6cef5f8c37f #1 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 > Call Trace: > <TASK> > dump_stack_lvl+0x68/0xa0 > check_noncircular+0x129/0x140 > __lock_acquire+0x1298/0x1cd0 > lock_acquire+0xc0/0x2b0 > cpus_read_lock+0x2a/0xc0 > static_key_slow_dec+0x16/0x60 > __hugetlb_vmemmap_restore_folio+0x1b9/0x200 > dissolve_free_huge_page+0x211/0x260 > __page_handle_poison+0x45/0xc0 > memory_failure+0x65e/0xc70 > hard_offline_page_store+0x55/0xa0 > kernfs_fop_write_iter+0x12c/0x1d0 > vfs_write+0x387/0x550 > ksys_write+0x64/0xe0 > do_syscall_64+0xca/0x1e0 > entry_SYSCALL_64_after_hwframe+0x6d/0x75 > RIP: 0033:0x7fc862314887 > Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 > RSP: 002b:00007fff19311268 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 > RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007fc862314887 > RDX: 000000000000000c RSI: 000056405645fe10 RDI: 0000000000000001 > RBP: 000056405645fe10 R08: 00007fc8623d1460 R09: 000000007fffffff > R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c > R13: 00007fc86241b780 R14: 00007fc862417600 R15: 00007fc862416a00 > > In short, below scene breaks the lock dependency chain: > > memory_failure > __page_handle_poison > zone_pcp_disable -- lock(pcp_batch_high_lock) > dissolve_free_huge_page > __hugetlb_vmemmap_restore_folio > static_key_slow_dec > cpus_read_lock -- rlock(cpu_hotplug_lock) > > Fix this by calling drain_all_pages() instead. > > Fixes: a6b40850c442 ("mm: hugetlb: replace hugetlb_free_vmemmap_enabled with a static_key") > Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> > Acked-by: Oscar Salvador <osalvador@suse.de> > Cc: <stable@vger.kernel.org> > --- > v2: > collect Acked-by tag and extend comment per Oscar. Thanks. > --- > mm/memory-failure.c | 16 +++++++++++++--- > 1 file changed, 13 insertions(+), 3 deletions(-) > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > index edd6e114462f..c6750509d74c 100644 > --- a/mm/memory-failure.c > +++ b/mm/memory-failure.c > @@ -153,11 +153,21 @@ static int __page_handle_poison(struct page *page) > { > int ret; > > - zone_pcp_disable(page_zone(page)); > + /* > + * zone_pcp_disable() can't be used here. It will hold pcp_batch_high_lock and > + * dissolve_free_huge_page() might hold cpu_hotplug_lock via static_key_slow_dec() > + * when hugetlb vmemmap optimization is enabled. This will break current lock > + * dependency chain and leads to deadlock. > + * Disabling pcp before dissolving the page was a deterministic approach because > + * we made sure that those pages cannot end up in any PCP list. Draining PCP lists > + * expels those pages to the buddy system, but nothing guarantees that those pages > + * do not get back to a PCP queue if we need to refill those. > + */ > ret = dissolve_free_huge_page(page); > - if (!ret) > + if (!ret) { > + drain_all_pages(page_zone(page)); > ret = take_page_off_buddy(page); > - zone_pcp_enable(page_zone(page)); > + } > > return ret; > } Looks good. Reviewed-by: Jane Chu <jane.chu@oracle.com> -jane
diff --git a/mm/memory-failure.c b/mm/memory-failure.c index edd6e114462f..c6750509d74c 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -153,11 +153,21 @@ static int __page_handle_poison(struct page *page) { int ret; - zone_pcp_disable(page_zone(page)); + /* + * zone_pcp_disable() can't be used here. It will hold pcp_batch_high_lock and + * dissolve_free_huge_page() might hold cpu_hotplug_lock via static_key_slow_dec() + * when hugetlb vmemmap optimization is enabled. This will break current lock + * dependency chain and leads to deadlock. + * Disabling pcp before dissolving the page was a deterministic approach because + * we made sure that those pages cannot end up in any PCP list. Draining PCP lists + * expels those pages to the buddy system, but nothing guarantees that those pages + * do not get back to a PCP queue if we need to refill those. + */ ret = dissolve_free_huge_page(page); - if (!ret) + if (!ret) { + drain_all_pages(page_zone(page)); ret = take_page_off_buddy(page); - zone_pcp_enable(page_zone(page)); + } return ret; }