Message ID | 8796b95c-3da3-5885-fddd-6ef55f30e4d3@I-love.SAKURA.ne.jp (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2] mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock | expand |
On Tue 04-04-23 23:31:58, Tetsuo Handa wrote: > syzbot is reporting circular locking dependency which involves > zonelist_update_seq seqlock [1], for this lock is checked by memory > allocation requests which do not need to be retried. > > One deadlock scenario is kmalloc(GFP_ATOMIC) from an interrupt handler. > > CPU0 > ---- > __build_all_zonelists() { > write_seqlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount odd > // e.g. timer interrupt handler runs at this moment > some_timer_func() { > kmalloc(GFP_ATOMIC) { > __alloc_pages_slowpath() { > read_seqbegin(&zonelist_update_seq) { > // spins forever because zonelist_update_seq.seqcount is odd > } > } > } > } > // e.g. timer interrupt handler finishes > write_sequnlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount even > } > > This deadlock scenario can be easily eliminated by not calling > read_seqbegin(&zonelist_update_seq) from !__GFP_DIRECT_RECLAIM allocation > requests, for retry is applicable to only __GFP_DIRECT_RECLAIM allocation > requests. But Michal Hocko does not know whether we should go with this > approach. It would have been more useful to explain why that is not preferred or desirable. > Another deadlock scenario which syzbot is reporting is a race between > kmalloc(GFP_ATOMIC) from tty_insert_flip_string_and_push_buffer() > with port->lock held and printk() from __build_all_zonelists() with > zonelist_update_seq held. > > CPU0 CPU1 > ---- ---- > pty_write() { > tty_insert_flip_string_and_push_buffer() { > __build_all_zonelists() { > write_seqlock(&zonelist_update_seq); > build_zonelists() { > printk() { > vprintk() { > vprintk_default() { > vprintk_emit() { > console_unlock() { > console_flush_all() { > console_emit_next_record() { > con->write() = serial8250_console_write() { > spin_lock_irqsave(&port->lock, flags); > tty_insert_flip_string() { > tty_insert_flip_string_fixed_flag() { > __tty_buffer_request_room() { > tty_buffer_alloc() { > kmalloc(GFP_ATOMIC | __GFP_NOWARN) { > __alloc_pages_slowpath() { > zonelist_iter_begin() { > read_seqbegin(&zonelist_update_seq); // spins forever because zonelist_update_seq.seqcount is odd > spin_lock_irqsave(&port->lock, flags); // spins forever because port->lock is held > } > } > } > } > } > } > } > } > spin_unlock_irqrestore(&port->lock, flags); > // message is printed to console > spin_unlock_irqrestore(&port->lock, flags); > } > } > } > } > } > } > } > } > } > write_sequnlock(&zonelist_update_seq); > } > } > } > > This deadlock scenario can be eliminated by > > preventing interrupt context from calling kmalloc(GFP_ATOMIC) > > and > > preventing printk() from calling console_flush_all() > > while zonelist_update_seq.seqcount is odd. > > Since Petr Mladek thinks that __build_all_zonelists() can become a > candidate for deferring printk() [2], let's address this problem by > > disabling local interrupts in order to avoid kmalloc(GFP_ATOMIC) > > and > > disabling synchronous printk() in order to avoid console_flush_all() > > . > > As a side effect of minimizing duration of zonelist_update_seq.seqcount > being odd by disabling synchronous printk(), latency at > read_seqbegin(&zonelist_update_seq) for both !__GFP_DIRECT_RECLAIM and > __GFP_DIRECT_RECLAIM allocation requests will be reduced. Although, from > lockdep perspective, not calling read_seqbegin(&zonelist_update_seq) (i.e. > do not record unnecessary locking dependency) from interrupt context is > still preferable, even if we don't allow calling kmalloc(GFP_ATOMIC) inside > write_seqlock(&zonelist_update_seq)/write_sequnlock(&zonelist_update_seq) > section... I have really hard time to wrap my head around this changelog. I would rephrase as follows. The syzbot has noticed the following deadlock scenario[1] CPU0 CPU1 pty_write() { tty_insert_flip_string_and_push_buffer() { __build_all_zonelists() { write_seqlock(&zonelist_update_seq); (A) build_zonelists() { printk() { vprintk() { vprintk_default() { vprintk_emit() { console_unlock() { console_flush_all() { console_emit_next_record() { con->write() = serial8250_console_write() { spin_lock_irqsave(&port->lock, flags); (B) spin_lock_irqsave(&port->lock, flags); <<< spinning on (B) tty_insert_flip_string() { tty_insert_flip_string_fixed_flag() { __tty_buffer_request_room() { tty_buffer_alloc() { kmalloc(GFP_ATOMIC | __GFP_NOWARN) { __alloc_pages_slowpath() { zonelist_iter_begin() { read_seqbegin(&zonelist_update_seq); <<< spinning on (A) This can happen during memory hotplug operation. This means that write_seqlock on zonelist_update_seq is not allowed to call into synchronous printk code path. This can be avoided by using a deferred printk context. This is not the only problematic scenario though. Another one would be __build_all_zonelists() { write_seqlock(&zonelist_update_seq); <<< (A) <IRQ> kmalloc(GFP_ATOMIC) { __alloc_pages_slowpath() { read_seqbegin(&zonelist_update_seq) <<< spinning on (A) Allocations from (soft)IRQ contexts are quite common. This can be avoided by disabling interrupts for this path so we won't self livelock. > Reported-by: syzbot <syzbot+223c7461c58c58a4cb10@syzkaller.appspotmail.com> > Link: https://syzkaller.appspot.com/bug?extid=223c7461c58c58a4cb10 [1] > Fixes: 3d36424b3b58 ("mm/page_alloc: fix race condition between build_all_zonelists and page allocation") > Link: https://lkml.kernel.org/r/ZCrs+1cDqPWTDFNM@alley [2] > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > Cc: Michal Hocko <mhocko@suse.com> > Cc: Petr Mladek <pmladek@suse.com> Anyway the patch is correct Acked-by: Michal Hocko <mhocko@suse.com> > --- > Changes in v2: > Update patch description and comment. > > mm/page_alloc.c | 16 ++++++++++++++++ > 1 file changed, 16 insertions(+) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 7136c36c5d01..e8b4f294d763 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -6632,7 +6632,21 @@ static void __build_all_zonelists(void *data) > int nid; > int __maybe_unused cpu; > pg_data_t *self = data; > + unsigned long flags; > > + /* > + * Explicitly disable this CPU's interrupts before taking seqlock > + * to prevent any IRQ handler from calling into the page allocator > + * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and livelock. > + */ > + local_irq_save(flags); > + /* > + * Explicitly disable this CPU's synchronous printk() before taking > + * seqlock to prevent any printk() from trying to hold port->lock, for > + * tty_insert_flip_string_and_push_buffer() on other CPU might be > + * calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held. > + */ > + printk_deferred_enter(); > write_seqlock(&zonelist_update_seq); > > #ifdef CONFIG_NUMA > @@ -6671,6 +6685,8 @@ static void __build_all_zonelists(void *data) > } > > write_sequnlock(&zonelist_update_seq); > + printk_deferred_exit(); > + local_irq_restore(flags); > } > > static noinline void __init > -- > 2.34.1
On Tue, 4 Apr 2023 23:31:58 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote: > syzbot is reporting circular locking dependency which involves > zonelist_update_seq seqlock [1], for this lock is checked by memory > allocation requests which do not need to be retried. I queued this, along with a note that an updated changelog is likely. Do we feel that a -stable backport is warranted? I think so, from your earlier comments. Please add the cc:stable to the changelog in this situation.
On Tue 04-04-23 14:25:28, Andrew Morton wrote: > On Tue, 4 Apr 2023 23:31:58 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote: > > > syzbot is reporting circular locking dependency which involves > > zonelist_update_seq seqlock [1], for this lock is checked by memory > > allocation requests which do not need to be retried. > > I queued this, along with a note that an updated changelog is likely. > > Do we feel that a -stable backport is warranted? I think so, from your > earlier comments. Please add the cc:stable to the changelog in this > situation. Memory hotplug is pretty rare event so the deadlock is quite unlikely. On the other hand the fix is pretty easy so it shouldn't hurt to have it in stable kernels.
On Wed 2023-04-05 10:28:07, Michal Hocko wrote: > On Tue 04-04-23 14:25:28, Andrew Morton wrote: > > On Tue, 4 Apr 2023 23:31:58 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote: > > > > > syzbot is reporting circular locking dependency which involves > > > zonelist_update_seq seqlock [1], for this lock is checked by memory > > > allocation requests which do not need to be retried. > > > > I queued this, along with a note that an updated changelog is likely. > > > > Do we feel that a -stable backport is warranted? I think so, from your > > earlier comments. Please add the cc:stable to the changelog in this > > situation. > > Memory hotplug is pretty rare event so the deadlock is quite unlikely. > On the other hand the fix is pretty easy so it shouldn't hurt to have it > in stable kernels. Note that printk_deferred_enter()/exit() has been added in v5.15-rc1 by the commit 85e3e7fbbb720b9897 ("printk: remove NMI tracking"). The commit has non-trivial dependencies. Any backport for older stable kernel would need a custom patch just adding these two wrappers. Best Regards, Petr
On Tue, Apr 04, 2023 at 05:20:35PM +0200, Michal Hocko wrote: > On Tue 04-04-23 23:31:58, Tetsuo Handa wrote: > > syzbot is reporting circular locking dependency which involves > > zonelist_update_seq seqlock [1], for this lock is checked by memory > > allocation requests which do not need to be retried. > > > > One deadlock scenario is kmalloc(GFP_ATOMIC) from an interrupt handler. > > > > CPU0 > > ---- > > __build_all_zonelists() { > > write_seqlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount odd > > // e.g. timer interrupt handler runs at this moment > > some_timer_func() { > > kmalloc(GFP_ATOMIC) { > > __alloc_pages_slowpath() { > > read_seqbegin(&zonelist_update_seq) { > > // spins forever because zonelist_update_seq.seqcount is odd > > } > > } > > } > > } > > // e.g. timer interrupt handler finishes > > write_sequnlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount even > > } > > > > This deadlock scenario can be easily eliminated by not calling > > read_seqbegin(&zonelist_update_seq) from !__GFP_DIRECT_RECLAIM allocation > > requests, for retry is applicable to only __GFP_DIRECT_RECLAIM allocation > > requests. But Michal Hocko does not know whether we should go with this > > approach. > > It would have been more useful to explain why that is not preferred or > desirable. > It would have undesirable side-effects. !__GFP_DIRECT_RECLAIM does not mean that the caller is happening from interrupt context or is a re-entrant allocation request from IRQ context. Special casing __GFP_DIRECT_RECLAIM could allow a speculative allocation to simply prematurely fail due to memory hotplug. This deadlock could be mitigated by not calling read_seqbegin(&zonelist_update_seq) from !__GFP_DIRECT_RECLAIM allocation request but it has undesirable consequences. !GFP_DIRECT_RECLAIM applies to allocations other than atomic allocations from interrupt context such as GFP_NOWAIT allocations. These type of allocations could prematurely fail due to a memory hotplug race or cpuset update and while this is probably harmless and recoverable, it's undesirable and unnecessary to fix the deadlock. I imagine there could also be checks for callers from IRQ context but on some older chips, checking if IRQs are disabled is expensive so also is an undesirable fix. Maybe the same is true for newer CPUs, but I didn't check. The printk_deferred option seems the most harmless option for now and maybe printk_deferred will ultimately go away. Other than potential changelog updates Acked-by: Mel Gorman <mgorman@techsingularity.net>
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 7136c36c5d01..e8b4f294d763 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6632,7 +6632,21 @@ static void __build_all_zonelists(void *data) int nid; int __maybe_unused cpu; pg_data_t *self = data; + unsigned long flags; + /* + * Explicitly disable this CPU's interrupts before taking seqlock + * to prevent any IRQ handler from calling into the page allocator + * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and livelock. + */ + local_irq_save(flags); + /* + * Explicitly disable this CPU's synchronous printk() before taking + * seqlock to prevent any printk() from trying to hold port->lock, for + * tty_insert_flip_string_and_push_buffer() on other CPU might be + * calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held. + */ + printk_deferred_enter(); write_seqlock(&zonelist_update_seq); #ifdef CONFIG_NUMA @@ -6671,6 +6685,8 @@ static void __build_all_zonelists(void *data) } write_sequnlock(&zonelist_update_seq); + printk_deferred_exit(); + local_irq_restore(flags); } static noinline void __init
syzbot is reporting circular locking dependency which involves zonelist_update_seq seqlock [1], for this lock is checked by memory allocation requests which do not need to be retried. One deadlock scenario is kmalloc(GFP_ATOMIC) from an interrupt handler. CPU0 ---- __build_all_zonelists() { write_seqlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount odd // e.g. timer interrupt handler runs at this moment some_timer_func() { kmalloc(GFP_ATOMIC) { __alloc_pages_slowpath() { read_seqbegin(&zonelist_update_seq) { // spins forever because zonelist_update_seq.seqcount is odd } } } } // e.g. timer interrupt handler finishes write_sequnlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount even } This deadlock scenario can be easily eliminated by not calling read_seqbegin(&zonelist_update_seq) from !__GFP_DIRECT_RECLAIM allocation requests, for retry is applicable to only __GFP_DIRECT_RECLAIM allocation requests. But Michal Hocko does not know whether we should go with this approach. Another deadlock scenario which syzbot is reporting is a race between kmalloc(GFP_ATOMIC) from tty_insert_flip_string_and_push_buffer() with port->lock held and printk() from __build_all_zonelists() with zonelist_update_seq held. CPU0 CPU1 ---- ---- pty_write() { tty_insert_flip_string_and_push_buffer() { __build_all_zonelists() { write_seqlock(&zonelist_update_seq); build_zonelists() { printk() { vprintk() { vprintk_default() { vprintk_emit() { console_unlock() { console_flush_all() { console_emit_next_record() { con->write() = serial8250_console_write() { spin_lock_irqsave(&port->lock, flags); tty_insert_flip_string() { tty_insert_flip_string_fixed_flag() { __tty_buffer_request_room() { tty_buffer_alloc() { kmalloc(GFP_ATOMIC | __GFP_NOWARN) { __alloc_pages_slowpath() { zonelist_iter_begin() { read_seqbegin(&zonelist_update_seq); // spins forever because zonelist_update_seq.seqcount is odd spin_lock_irqsave(&port->lock, flags); // spins forever because port->lock is held } } } } } } } } spin_unlock_irqrestore(&port->lock, flags); // message is printed to console spin_unlock_irqrestore(&port->lock, flags); } } } } } } } } } write_sequnlock(&zonelist_update_seq); } } } This deadlock scenario can be eliminated by preventing interrupt context from calling kmalloc(GFP_ATOMIC) and preventing printk() from calling console_flush_all() while zonelist_update_seq.seqcount is odd. Since Petr Mladek thinks that __build_all_zonelists() can become a candidate for deferring printk() [2], let's address this problem by disabling local interrupts in order to avoid kmalloc(GFP_ATOMIC) and disabling synchronous printk() in order to avoid console_flush_all() . As a side effect of minimizing duration of zonelist_update_seq.seqcount being odd by disabling synchronous printk(), latency at read_seqbegin(&zonelist_update_seq) for both !__GFP_DIRECT_RECLAIM and __GFP_DIRECT_RECLAIM allocation requests will be reduced. Although, from lockdep perspective, not calling read_seqbegin(&zonelist_update_seq) (i.e. do not record unnecessary locking dependency) from interrupt context is still preferable, even if we don't allow calling kmalloc(GFP_ATOMIC) inside write_seqlock(&zonelist_update_seq)/write_sequnlock(&zonelist_update_seq) section... Reported-by: syzbot <syzbot+223c7461c58c58a4cb10@syzkaller.appspotmail.com> Link: https://syzkaller.appspot.com/bug?extid=223c7461c58c58a4cb10 [1] Fixes: 3d36424b3b58 ("mm/page_alloc: fix race condition between build_all_zonelists and page allocation") Link: https://lkml.kernel.org/r/ZCrs+1cDqPWTDFNM@alley [2] Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Michal Hocko <mhocko@suse.com> Cc: Petr Mladek <pmladek@suse.com> --- Changes in v2: Update patch description and comment. mm/page_alloc.c | 16 ++++++++++++++++ 1 file changed, 16 insertions(+)