diff mbox series

mm/memory-failure: fix deadlock when hugetlb_optimize_vmemmap is enabled

Message ID 20240407085456.2798193-1-linmiaohe@huawei.com (mailing list archive)
State New
Headers show
Series mm/memory-failure: fix deadlock when hugetlb_optimize_vmemmap is enabled | expand

Commit Message

Miaohe Lin April 7, 2024, 8:54 a.m. UTC
When I did hard offline test with hugetlb pages, below deadlock occurs:

======================================================
WARNING: possible circular locking dependency detected
6.8.0-11409-gf6cef5f8c37f #1 Not tainted
------------------------------------------------------
bash/46904 is trying to acquire lock:
ffffffffabe68910 (cpu_hotplug_lock){++++}-{0:0}, at: static_key_slow_dec+0x16/0x60

but task is already holding lock:
ffffffffabf92ea8 (pcp_batch_high_lock){+.+.}-{3:3}, at: zone_pcp_disable+0x16/0x40

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (pcp_batch_high_lock){+.+.}-{3:3}:
       __mutex_lock+0x6c/0x770
       page_alloc_cpu_online+0x3c/0x70
       cpuhp_invoke_callback+0x397/0x5f0
       __cpuhp_invoke_callback_range+0x71/0xe0
       _cpu_up+0xeb/0x210
       cpu_up+0x91/0xe0
       cpuhp_bringup_mask+0x49/0xb0
       bringup_nonboot_cpus+0xb7/0xe0
       smp_init+0x25/0xa0
       kernel_init_freeable+0x15f/0x3e0
       kernel_init+0x15/0x1b0
       ret_from_fork+0x2f/0x50
       ret_from_fork_asm+0x1a/0x30

-> #0 (cpu_hotplug_lock){++++}-{0:0}:
       __lock_acquire+0x1298/0x1cd0
       lock_acquire+0xc0/0x2b0
       cpus_read_lock+0x2a/0xc0
       static_key_slow_dec+0x16/0x60
       __hugetlb_vmemmap_restore_folio+0x1b9/0x200
       dissolve_free_huge_page+0x211/0x260
       __page_handle_poison+0x45/0xc0
       memory_failure+0x65e/0xc70
       hard_offline_page_store+0x55/0xa0
       kernfs_fop_write_iter+0x12c/0x1d0
       vfs_write+0x387/0x550
       ksys_write+0x64/0xe0
       do_syscall_64+0xca/0x1e0
       entry_SYSCALL_64_after_hwframe+0x6d/0x75

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(pcp_batch_high_lock);
                               lock(cpu_hotplug_lock);
                               lock(pcp_batch_high_lock);
  rlock(cpu_hotplug_lock);

 *** DEADLOCK ***

5 locks held by bash/46904:
 #0: ffff98f6c3bb23f0 (sb_writers#5){.+.+}-{0:0}, at: ksys_write+0x64/0xe0
 #1: ffff98f6c328e488 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0xf8/0x1d0
 #2: ffff98ef83b31890 (kn->active#113){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x100/0x1d0
 #3: ffffffffabf9db48 (mf_mutex){+.+.}-{3:3}, at: memory_failure+0x44/0xc70
 #4: ffffffffabf92ea8 (pcp_batch_high_lock){+.+.}-{3:3}, at: zone_pcp_disable+0x16/0x40

stack backtrace:
CPU: 10 PID: 46904 Comm: bash Kdump: loaded Not tainted 6.8.0-11409-gf6cef5f8c37f #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0x68/0xa0
 check_noncircular+0x129/0x140
 __lock_acquire+0x1298/0x1cd0
 lock_acquire+0xc0/0x2b0
 cpus_read_lock+0x2a/0xc0
 static_key_slow_dec+0x16/0x60
 __hugetlb_vmemmap_restore_folio+0x1b9/0x200
 dissolve_free_huge_page+0x211/0x260
 __page_handle_poison+0x45/0xc0
 memory_failure+0x65e/0xc70
 hard_offline_page_store+0x55/0xa0
 kernfs_fop_write_iter+0x12c/0x1d0
 vfs_write+0x387/0x550
 ksys_write+0x64/0xe0
 do_syscall_64+0xca/0x1e0
 entry_SYSCALL_64_after_hwframe+0x6d/0x75
RIP: 0033:0x7fc862314887
Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
RSP: 002b:00007fff19311268 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007fc862314887
RDX: 000000000000000c RSI: 000056405645fe10 RDI: 0000000000000001
RBP: 000056405645fe10 R08: 00007fc8623d1460 R09: 000000007fffffff
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
R13: 00007fc86241b780 R14: 00007fc862417600 R15: 00007fc862416a00

In short, below scene breaks the lock dependency chain:

 memory_failure
  __page_handle_poison
   zone_pcp_disable -- lock(pcp_batch_high_lock)
   dissolve_free_huge_page
    __hugetlb_vmemmap_restore_folio
     static_key_slow_dec
      cpus_read_lock -- rlock(cpu_hotplug_lock)

Fix this by calling drain_all_pages() instead.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
---
 mm/memory-failure.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

Comments

Andrew Morton April 8, 2024, 7:29 p.m. UTC | #1
On Sun, 7 Apr 2024 16:54:56 +0800 Miaohe Lin <linmiaohe@huawei.com> wrote:

> When I did hard offline test with hugetlb pages, below deadlock occurs:
> 
> ...
>
> Fix this by calling drain_all_pages() instead.

Thanks.  I propose

Fixes: 510d25c92ec4a ("mm/hwpoison: disable pcp for page_handle_poison()")
Cc: <stable@vger.kernel.org>
Miaohe Lin April 9, 2024, 1:55 a.m. UTC | #2
On 2024/4/9 3:29, Andrew Morton wrote:
> On Sun, 7 Apr 2024 16:54:56 +0800 Miaohe Lin <linmiaohe@huawei.com> wrote:
> 
>> When I did hard offline test with hugetlb pages, below deadlock occurs:
>>
>> ...
>>
>> Fix this by calling drain_all_pages() instead.
> 
> Thanks.  I propose
> 
> Fixes: 510d25c92ec4a ("mm/hwpoison: disable pcp for page_handle_poison()")

IMHO this issue won't occur until commit a6b40850c442 ("mm: hugetlb: replace hugetlb_free_vmemmap_enabled with a static_key").
As it introduced rlock(cpu_hotplug_lock) in dissolve_free_huge_page() code path while lock(pcp_batch_high_lock) is already
in the __page_handle_poison(). So it might be more appropriate to use:

Fixes: a6b40850c442 ("mm: hugetlb: replace hugetlb_free_vmemmap_enabled with a static_key")

> Cc: <stable@vger.kernel.org>

Thanks for doing this.

> 
> .
>
Oscar Salvador April 9, 2024, 2:10 p.m. UTC | #3
On Sun, Apr 07, 2024 at 04:54:56PM +0800, Miaohe Lin wrote:
> In short, below scene breaks the lock dependency chain:
> 
>  memory_failure
>   __page_handle_poison
>    zone_pcp_disable -- lock(pcp_batch_high_lock)
>    dissolve_free_huge_page
>     __hugetlb_vmemmap_restore_folio
>      static_key_slow_dec
>       cpus_read_lock -- rlock(cpu_hotplug_lock)
> 
> Fix this by calling drain_all_pages() instead.
> 
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>

Acked-by: Oscar Salvador <osalvador@suse.de>

Thanks!
Oscar Salvador April 9, 2024, 4:10 p.m. UTC | #4
On Tue, Apr 09, 2024 at 04:10:22PM +0200, Oscar Salvador wrote:
> On Sun, Apr 07, 2024 at 04:54:56PM +0800, Miaohe Lin wrote:
> > In short, below scene breaks the lock dependency chain:
> > 
> >  memory_failure
> >   __page_handle_poison
> >    zone_pcp_disable -- lock(pcp_batch_high_lock)
> >    dissolve_free_huge_page
> >     __hugetlb_vmemmap_restore_folio
> >      static_key_slow_dec
> >       cpus_read_lock -- rlock(cpu_hotplug_lock)
> > 
> > Fix this by calling drain_all_pages() instead.
> > 
> > Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> 
> Acked-by: Oscar Salvador <osalvador@suse.de>

On a second though,

disabling pcp via zone_pcp_disable() was a deterministic approach.
Now, with drain_all_pages() we drain PCP queues to buddy, but nothing
guarantees that those pages do not end up in a PCP queue again before we
the call to take_page_off_budy() if we
need refilling, right?

I guess we can live with that because we will let the system know that we
failed to isolate that page.
Miaohe Lin April 10, 2024, 7:52 a.m. UTC | #5
On 2024/4/10 0:10, Oscar Salvador wrote:
> On Tue, Apr 09, 2024 at 04:10:22PM +0200, Oscar Salvador wrote:
>> On Sun, Apr 07, 2024 at 04:54:56PM +0800, Miaohe Lin wrote:
>>> In short, below scene breaks the lock dependency chain:
>>>
>>>  memory_failure
>>>   __page_handle_poison
>>>    zone_pcp_disable -- lock(pcp_batch_high_lock)
>>>    dissolve_free_huge_page
>>>     __hugetlb_vmemmap_restore_folio
>>>      static_key_slow_dec
>>>       cpus_read_lock -- rlock(cpu_hotplug_lock)
>>>
>>> Fix this by calling drain_all_pages() instead.
>>>
>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>>
>> Acked-by: Oscar Salvador <osalvador@suse.de>

Thanks.

> 
> On a second though,
> 
> disabling pcp via zone_pcp_disable() was a deterministic approach.
> Now, with drain_all_pages() we drain PCP queues to buddy, but nothing
> guarantees that those pages do not end up in a PCP queue again before we
> the call to take_page_off_budy() if we
> need refilling, right?

AFAICS, iff check_pages_enabled static key is enabled and in hard offline mode,
check_new_pages() will prevent those pages from ending up in a PCP queue again
when refilling PCP list. Because PageHWPoison pages will be taken as 'bad' pages
and skipped when refill PCP list.

> 
> I guess we can live with that because we will let the system know that we
> failed to isolate that page.

We're trying best to isolate that page anyway. :)

Thanks for your thought.
.

> 
>
Oscar Salvador April 10, 2024, 8:52 a.m. UTC | #6
On Wed, Apr 10, 2024 at 03:52:14PM +0800, Miaohe Lin wrote:
> AFAICS, iff check_pages_enabled static key is enabled and in hard offline mode,
> check_new_pages() will prevent those pages from ending up in a PCP queue again
> when refilling PCP list. Because PageHWPoison pages will be taken as 'bad' pages
> and skipped when refill PCP list.

Yes, but check_pages_enabled static key is only enabled when
either CONFIG_DEBUG_PAGEALLOC or CONFIG_DEBUG_VM are set, which means
that under most of the systems that protection will not take place.

Which takes me to a problem we had in the past where we were handing
over hwpoisoned pages from PCP lists happily.
Now, with for soft-offline mode, we worked hard to stop doing that
because soft-offline is a non-disruptive operation and no one should get 
killed.
hard-offline is another story, but still I think that extending the
comment to include the following would be a good idea:

"Disabling pcp before dissolving the page was a deterministic approach
 because we made sure that those pages cannot end up in any PCP list.
 Draining PCP lists expels those pages to the buddy system, but nothing
 guarantees that those pages do not get back to a PCP queue if we need
 to refill those."

 Just to remind ourselves of the dangers of a non-deterministic
 approach.


Thanks
Miaohe Lin April 11, 2024, 2:26 a.m. UTC | #7
On 2024/4/10 16:52, Oscar Salvador wrote:
> On Wed, Apr 10, 2024 at 03:52:14PM +0800, Miaohe Lin wrote:
>> AFAICS, iff check_pages_enabled static key is enabled and in hard offline mode,
>> check_new_pages() will prevent those pages from ending up in a PCP queue again
>> when refilling PCP list. Because PageHWPoison pages will be taken as 'bad' pages
>> and skipped when refill PCP list.
> 
> Yes, but check_pages_enabled static key is only enabled when
> either CONFIG_DEBUG_PAGEALLOC or CONFIG_DEBUG_VM are set, which means
> that under most of the systems that protection will not take place.
> 
> Which takes me to a problem we had in the past where we were handing
> over hwpoisoned pages from PCP lists happily.
> Now, with for soft-offline mode, we worked hard to stop doing that
> because soft-offline is a non-disruptive operation and no one should get 
> killed.
> hard-offline is another story, but still I think that extending the
> comment to include the following would be a good idea:
> 
> "Disabling pcp before dissolving the page was a deterministic approach
>  because we made sure that those pages cannot end up in any PCP list.
>  Draining PCP lists expels those pages to the buddy system, but nothing
>  guarantees that those pages do not get back to a PCP queue if we need
>  to refill those."

This really helps. Will add it in v2.
Thanks Oscar.

> 
>  Just to remind ourselves of the dangers of a non-deterministic
>  approach.
> 
> 
> Thanks
> 
>
diff mbox series

Patch

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index edd6e114462f..88359a185c5f 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -153,11 +153,17 @@  static int __page_handle_poison(struct page *page)
 {
 	int ret;
 
-	zone_pcp_disable(page_zone(page));
+	/*
+	 * zone_pcp_disable() can't be used here. It will hold pcp_batch_high_lock and
+	 * dissolve_free_huge_page() might hold cpu_hotplug_lock via static_key_slow_dec()
+	 * when hugetlb vmemmap optimization is enabled. This will break current lock
+	 * dependency chain and leads to deadlock.
+	 */
 	ret = dissolve_free_huge_page(page);
-	if (!ret)
+	if (!ret) {
+		drain_all_pages(page_zone(page));
 		ret = take_page_off_buddy(page);
-	zone_pcp_enable(page_zone(page));
+	}
 
 	return ret;
 }