Message ID | 20230621104034.HT6QnNkQ@linutronix.de (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/page_alloc: Use write_seqlock_irqsave() instead write_seqlock() + local_irq_save(). | expand |
On Wed 21-06-23 12:40:34, Sebastian Andrzej Siewior wrote: > __build_all_zonelists() acquires zonelist_update_seq by first disabling > interrupts via local_irq_save() and then acquiring the seqlock with > write_seqlock(). This is troublesome and leads to problems on > PREEMPT_RT because the inner spinlock_t is now acquired with disabled > interrupts. And the spinlock might sleep with PREEMPT_RT so a deadlock, right? It would be better to call that out explicitly > The API provides write_seqlock_irqsave() which does the right thing in > one step. > printk_deferred_enter() has to be invoked in non-migrate-able context to > ensure that deferred printing is enabled and disabled on the same CPU. > This is the case after zonelist_update_seq has been acquired. > > Acquire zonelist_update_seq with write_seqlock_irqsave() and then defer > printk output. > > Fixes: 1007843a91909 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock") > Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Michal Hocko <mhocko@suse.com> Thanks > --- > mm/page_alloc.c | 11 ++++------- > 1 file changed, 4 insertions(+), 7 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 47421bedc12b7..99b7e7d09c5c0 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -5808,11 +5808,10 @@ static void __build_all_zonelists(void *data) > unsigned long flags; > > /* > - * Explicitly disable this CPU's interrupts before taking seqlock > - * to prevent any IRQ handler from calling into the page allocator > - * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and livelock. > + * The zonelist_update_seq must be acquired with irqsave because the > + * reader can be invoked from IRQ with GFP_ATOMIC. > */ > - local_irq_save(flags); > + write_seqlock_irqsave(&zonelist_update_seq, flags); > /* > * Explicitly disable this CPU's synchronous printk() before taking > * seqlock to prevent any printk() from trying to hold port->lock, for > @@ -5820,7 +5819,6 @@ static void __build_all_zonelists(void *data) > * calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held. > */ > printk_deferred_enter(); > - write_seqlock(&zonelist_update_seq); > > #ifdef CONFIG_NUMA > memset(node_load, 0, sizeof(node_load)); > @@ -5857,9 +5855,8 @@ static void __build_all_zonelists(void *data) > #endif > } > > - write_sequnlock(&zonelist_update_seq); > printk_deferred_exit(); > - local_irq_restore(flags); > + write_sequnlock_irqrestore(&zonelist_update_seq, flags); > } > > static noinline void __init > -- > 2.40.1
On 21.06.23 12:40, Sebastian Andrzej Siewior wrote: > __build_all_zonelists() acquires zonelist_update_seq by first disabling > interrupts via local_irq_save() and then acquiring the seqlock with > write_seqlock(). This is troublesome and leads to problems on > PREEMPT_RT because the inner spinlock_t is now acquired with disabled > interrupts. > The API provides write_seqlock_irqsave() which does the right thing in > one step. > printk_deferred_enter() has to be invoked in non-migrate-able context to > ensure that deferred printing is enabled and disabled on the same CPU. > This is the case after zonelist_update_seq has been acquired. > > Acquire zonelist_update_seq with write_seqlock_irqsave() and then defer > printk output. > > Fixes: 1007843a91909 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock") > Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> > --- > mm/page_alloc.c | 11 ++++------- > 1 file changed, 4 insertions(+), 7 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 47421bedc12b7..99b7e7d09c5c0 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -5808,11 +5808,10 @@ static void __build_all_zonelists(void *data) > unsigned long flags; > > /* > - * Explicitly disable this CPU's interrupts before taking seqlock > - * to prevent any IRQ handler from calling into the page allocator > - * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and livelock. > + * The zonelist_update_seq must be acquired with irqsave because the > + * reader can be invoked from IRQ with GFP_ATOMIC. > */ > - local_irq_save(flags); > + write_seqlock_irqsave(&zonelist_update_seq, flags); > /* > * Explicitly disable this CPU's synchronous printk() before taking > * seqlock to prevent any printk() from trying to hold port->lock, for > @@ -5820,7 +5819,6 @@ static void __build_all_zonelists(void *data) > * calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held. > */ > printk_deferred_enter(); > - write_seqlock(&zonelist_update_seq); > > #ifdef CONFIG_NUMA > memset(node_load, 0, sizeof(node_load)); > @@ -5857,9 +5855,8 @@ static void __build_all_zonelists(void *data) > #endif > } > > - write_sequnlock(&zonelist_update_seq); > printk_deferred_exit(); > - local_irq_restore(flags); > + write_sequnlock_irqrestore(&zonelist_update_seq, flags); > } > > static noinline void __init Reviewed-by: David Hildenbrand <david@redhat.com>
On 2023-06-21 12:59:44 [+0200], Michal Hocko wrote: > On Wed 21-06-23 12:40:34, Sebastian Andrzej Siewior wrote: > > __build_all_zonelists() acquires zonelist_update_seq by first disabling > > interrupts via local_irq_save() and then acquiring the seqlock with > > write_seqlock(). This is troublesome and leads to problems on > > PREEMPT_RT because the inner spinlock_t is now acquired with disabled > > interrupts. > > And the spinlock might sleep with PREEMPT_RT so a deadlock, right? It > would be better to call that out explicitly No, no deadlock. Let me double check this a VM with mem-hotplug later one but I don't expect an IRQ path. If so there should be more broken pieces… On PREEMPT_RT what you can happen is that the writer is preempted by a high-priority reader which then deadlocks because the reader spins while waiting and the writer is blocked. For this issue we have lock + unlock in the seq reader to PI boost the seq writer so it can make progress. Sebastian
On 2023/06/21 19:40, Sebastian Andrzej Siewior wrote: > printk_deferred_enter() has to be invoked in non-migrate-able context to > ensure that deferred printing is enabled and disabled on the same CPU. I can't catch. local_irq_save(flags); makes non-migrate-able context because sleeping is not allowed while IRQ is disabled, doesn't it? > This is the case after zonelist_update_seq has been acquired. > > Acquire zonelist_update_seq with write_seqlock_irqsave() and then defer > printk output. What guarantees that write_seqlock_irqsave() never calls printk() (e.g. lockdep warning) before printk_deferred_enter() takes effect?
On Wed 21-06-23 13:16:36, Sebastian Andrzej Siewior wrote: > On 2023-06-21 12:59:44 [+0200], Michal Hocko wrote: > > On Wed 21-06-23 12:40:34, Sebastian Andrzej Siewior wrote: > > > __build_all_zonelists() acquires zonelist_update_seq by first disabling > > > interrupts via local_irq_save() and then acquiring the seqlock with > > > write_seqlock(). This is troublesome and leads to problems on > > > PREEMPT_RT because the inner spinlock_t is now acquired with disabled > > > interrupts. > > > > And the spinlock might sleep with PREEMPT_RT so a deadlock, right? It > > would be better to call that out explicitly > > No, no deadlock. Let me double check this a VM with mem-hotplug later > one but I don't expect an IRQ path. If so there should be more broken > pieces… > > On PREEMPT_RT what you can happen is that the writer is preempted by a > high-priority reader which then deadlocks because the reader spins while > waiting and the writer is blocked. For this issue we have lock + unlock > in the seq reader to PI boost the seq writer so it can make progress. Please state the the problem explicitly in the changelog. You are marking this patch as a fix so the underlying issue should be stated. Thanks!
On Wed 2023-06-21 20:33:35, Tetsuo Handa wrote: > On 2023/06/21 19:40, Sebastian Andrzej Siewior wrote: > > printk_deferred_enter() has to be invoked in non-migrate-able context to > > ensure that deferred printing is enabled and disabled on the same CPU. > > I can't catch. local_irq_save(flags); makes non-migrate-able context > because sleeping is not allowed while IRQ is disabled, doesn't it? > > > This is the case after zonelist_update_seq has been acquired. > > > > Acquire zonelist_update_seq with write_seqlock_irqsave() and then defer > > printk output. > > What guarantees that write_seqlock_irqsave() never calls printk() > (e.g. lockdep warning) before printk_deferred_enter() takes effect? I think that we should explicitly disable preemption in printk_deferred_enter(). The disabled interrupts are not strictly necessary. I am going to resurrect the patch https://lore.kernel.org/r/20230419074210.17646-1-pmladek@suse.com and add the preempt_disable()/enable() at the same time. Is this going to work for you? Best Regards, Petr
On 2023-06-21 20:33:35 [+0900], Tetsuo Handa wrote: > On 2023/06/21 19:40, Sebastian Andrzej Siewior wrote: > > printk_deferred_enter() has to be invoked in non-migrate-able context to > > ensure that deferred printing is enabled and disabled on the same CPU. > > I can't catch. local_irq_save(flags); makes non-migrate-able context > because sleeping is not allowed while IRQ is disabled, doesn't it? That is correct. The problem is the local_irq_save() which remains on PREEMPT_RT and this is problematic during write_seqlock() which acquires spinlock_t which becomes a sleeping lock. write_seqlock_irqsave() substitutes everything properly so on RT there is no local_irq_save(). > > This is the case after zonelist_update_seq has been acquired. > > > > Acquire zonelist_update_seq with write_seqlock_irqsave() and then defer > > printk output. > > What guarantees that write_seqlock_irqsave() never calls printk() > (e.g. lockdep warning) before printk_deferred_enter() takes effect? Actually nothing but this could fixed by making the lockdep annotation first followed by the actual increment in do_write_seqcount_begin_nested(). I don't know this is the other way around since we do annotation followed by locking in order to see the splat before it happens. Let me do a patch for that since it looks like the right thing to do. If this "rare" issue is a blocker I don't see a sane way around it. Be aware that tty_insert_flip_string_and_push_buffer() has atomic/ KASAN annotation which _could_ detect something which then leads to pr_err() invocations while tty_port::lock is acquired. Looking over tty_port_close_start() there are pr_warn() invocation with the lock held so again, possibilities. Petr, is printing via print threads and direct printing via explicit driver callback (which would avoid this problem) on the agenda? Sebastian
On 2023-06-21 14:40:50 [+0200], Petr Mladek wrote: > I think that we should explicitly disable preemption in > printk_deferred_enter(). The disabled interrupts are not > strictly necessary. > > I am going to resurrect the patch > https://lore.kernel.org/r/20230419074210.17646-1-pmladek@suse.com > and add the preempt_disable()/enable() at the same time. > > Is this going to work for you? No. migrate_disable() is the minimal thing that is needed. However it is heavy weight comparing to what is required. Also by disabling preemption we still have the problem later on with the seqlock_t. Let me swap the lockdep annotation and we should be good. > Best Regards, > Petr Sebastian
On 2023-06-21 13:49:06 [+0200], Michal Hocko wrote: > > On PREEMPT_RT what you can happen is that the writer is preempted by a > > high-priority reader which then deadlocks because the reader spins while > > waiting and the writer is blocked. For this issue we have lock + unlock > > in the seq reader to PI boost the seq writer so it can make progress. > > Please state the the problem explicitly in the changelog. You are > marking this patch as a fix so the underlying issue should be stated. The problem is the "local_irq_save()" which I believe I stated. The lock + unlock story was just a side story and is already covered. I really need just the local_irq_save() invocation to be part of the seqlock API so it can be substituted away. > Thanks! Sebastian
On Wed 21-06-23 15:11:25, Sebastian Andrzej Siewior wrote: > On 2023-06-21 13:49:06 [+0200], Michal Hocko wrote: > > > On PREEMPT_RT what you can happen is that the writer is preempted by a > > > high-priority reader which then deadlocks because the reader spins while > > > waiting and the writer is blocked. For this issue we have lock + unlock > > > in the seq reader to PI boost the seq writer so it can make progress. > > > > Please state the the problem explicitly in the changelog. You are > > marking this patch as a fix so the underlying issue should be stated. > > The problem is the "local_irq_save()" which I believe I stated. The lock > + unlock story was just a side story and is already covered. I really > need just the local_irq_save() invocation to be part of the seqlock API > so it can be substituted away. I really do not want to nitpick but your changelog states: "This is troublesome and leads to problems on PREEMPT_RT because the inner spinlock_t is now acquired with disabled interrupts." I believe it would be benefitial to state why htis is troublesome because not everybody has insight into PREEMPT_RT and all the consequences. Thanks!
On 2023-06-21 15:22:29 [+0200], Michal Hocko wrote: > > The problem is the "local_irq_save()" which I believe I stated. The lock > > + unlock story was just a side story and is already covered. I really > > need just the local_irq_save() invocation to be part of the seqlock API > > so it can be substituted away. > > I really do not want to nitpick but your changelog states: > "This is troublesome and leads to problems on PREEMPT_RT because the > inner spinlock_t is now acquired with disabled interrupts." > > I believe it would be benefitial to state why htis is troublesome > because not everybody has insight into PREEMPT_RT and all the > consequences. Now after re-reading it I do understand what you mean. > Thanks! Sebastian
On 2023/06/21 22:06, Sebastian Andrzej Siewior wrote: > On 2023-06-21 20:33:35 [+0900], Tetsuo Handa wrote: >> On 2023/06/21 19:40, Sebastian Andrzej Siewior wrote: >>> printk_deferred_enter() has to be invoked in non-migrate-able context to >>> ensure that deferred printing is enabled and disabled on the same CPU. >> >> I can't catch. local_irq_save(flags); makes non-migrate-able context >> because sleeping is not allowed while IRQ is disabled, doesn't it? > > That is correct. The problem is the local_irq_save() which remains on > PREEMPT_RT and this is problematic during write_seqlock() which acquires > spinlock_t which becomes a sleeping lock. write_seqlock_irqsave() > substitutes everything properly so on RT there is no local_irq_save(). include/linux/seqlock.h says static inline unsigned long __write_seqlock_irqsave(seqlock_t *sl) { unsigned long flags; spin_lock_irqsave(&sl->lock, flags); do_write_seqcount_begin(&sl->seqcount.seqcount); return flags; } /** * write_seqlock_irqsave() - start a non-interruptible seqlock_t write * section * @lock: Pointer to seqlock_t * @flags: Stack-allocated storage for saving caller's local interrupt * state, to be passed to write_sequnlock_irqrestore(). * * _irqsave variant of write_seqlock(). Use it only if the read side * section, or other write sections, can be invoked from hardirq context. */ #define write_seqlock_irqsave(lock, flags) \ do { flags = __write_seqlock_irqsave(lock); } while (0) /** * write_seqlock() - start a seqlock_t write side critical section * @sl: Pointer to seqlock_t * * write_seqlock opens a write side critical section for the given * seqlock_t. It also implicitly acquires the spinlock_t embedded inside * that sequential lock. All seqlock_t write side sections are thus * automatically serialized and non-preemptible. * * Context: if the seqlock_t read section, or other write side critical * sections, can be invoked from hardirq or softirq contexts, use the * _irqsave or _bh variants of this function instead. */ static inline void write_seqlock(seqlock_t *sl) { spin_lock(&sl->lock); do_write_seqcount_begin(&sl->seqcount.seqcount); } and regarding include/linux/spinlock.h , since spin_lock_irqsave(lock, flags) is equivalent with local_irq_save(flags) + spin_lock(lock), there is no difference between local_irq_save(flags); printk_deferred_enter(); write_seqlock(&zonelist_update_seq); and write_seqlock_irqsave(&zonelist_update_seq, flags); printk_deferred_enter(); (except potential lockdep warning). However, regarding include/linux/spinlock_rt.h , since spin_lock_irqsave(lock, flags) is equivalent with spin_lock(lock), there is difference between local_irq_save(flags); printk_deferred_enter(); write_seqlock(&zonelist_update_seq); and write_seqlock_irqsave(&zonelist_update_seq, flags); printk_deferred_enter(); . Is above understanding correct? And you are trying to replace local_irq_save(flags); printk_deferred_enter(); write_seqlock(&zonelist_update_seq); with write_seqlock_irqsave(&zonelist_update_seq, flags); printk_deferred_enter(); , aren't you? But include/linux/printk.h says /* * The printk_deferred_enter/exit macros are available only as a hack for * some code paths that need to defer all printk console printing. Interrupts * must be disabled for the deferred duration. */ #define printk_deferred_enter __printk_safe_enter #define printk_deferred_exit __printk_safe_exit which means that local_irq_save() is _required_ before printk_deferred_enter(). If local_irq_save() is hidden by your patch, what guarantees that printk_deferred_enter() and printk_deferred_exit() run on the same CPU? Also, if local_irq_save() is hidden due to RT, what guarantees that write_seqlock_irqsave(&zonelist_update_seq, flags); <<IRQ>> some_timer_function() { printk(); } <<IRQ>> printk_deferred_enter(); does not happen because write_seqlock_irqsave() does not disable IRQ? Disabling IRQ before incrementing zonelist_update_seq is _required_ for both making printk_deferred_enter() safe and making sure that printk_deferred_enter() takes effect .
On 2023-06-21 22:32:40 [+0900], Tetsuo Handa wrote: > include/linux/seqlock.h says … > Is above understanding correct? That is correct. > And you are trying to replace > > local_irq_save(flags); > printk_deferred_enter(); > write_seqlock(&zonelist_update_seq); > > with > > write_seqlock_irqsave(&zonelist_update_seq, flags); > printk_deferred_enter(); > > , aren't you? correct. > But include/linux/printk.h says > > /* > * The printk_deferred_enter/exit macros are available only as a hack for > * some code paths that need to defer all printk console printing. Interrupts > * must be disabled for the deferred duration. > */ > #define printk_deferred_enter __printk_safe_enter > #define printk_deferred_exit __printk_safe_exit > > which means that local_irq_save() is _required_ before printk_deferred_enter(). It says that, yes, but that requirement is described as too heavy. The requirement is that printk_deferred_enter() happens on the same CPU as printk_deferred_exit(). This can be achieved by an explicit local_irq_save(), yes, but also by something like spin_lock_irq() which _ensures_ that the task does not migrate to another CPU while the lock is acquired. This is the requirement by the current implementation. > If local_irq_save() is hidden by your patch, what guarantees that > printk_deferred_enter() and printk_deferred_exit() run on the same CPU? Because spin_lock_irqsave() on CONFIG_PREEMPT_RT uses migrate_disable(). The function ensures that the scheduler does not migrate the task to another CPU. The CPU is even block from going down (as in CPU-hotplug) until the matching migrate_enable() occurs. > Also, if local_irq_save() is hidden due to RT, what guarantees that > > write_seqlock_irqsave(&zonelist_update_seq, flags); > <<IRQ>> > some_timer_function() { > printk(); > } > <<IRQ>> > printk_deferred_enter(); > > does not happen because write_seqlock_irqsave() does not disable IRQ? I don't see how zonelist_update_seq and printk here are connected without the port lock/ or memory allocation. But there are two things that are different on RT which probably answer your question: - If the reader observes an odd sequence number then it acquires the lock of the sequence counter (it is held by the writer) which forces the writer to complete the write critical section and then the reader can continue. There are _no_ memory allocation within a hard IRQ context (as in the actual interrupt). The timer (hrtimer or timer_list timer) are served in task context and we have forced-threaded interrupts. Clearly this means that the seqlock_t (as used here) can only be used task context and not in hard IRQ context. - The printk implementation is slightly different and it is being worked to merge it upstream. The two important differences here: - Printing happens by default in a dedicated printing thread. - In emergency cases like panic(), printing happens directly within the invocation of printk(). This requires a so called atomic console which does not use the tty_port::lock. > Disabling IRQ before incrementing zonelist_update_seq is _required_ for both > > making printk_deferred_enter() safe > > and > > making sure that printk_deferred_enter() takes effect > > . Did I explain why it is sufficient to do write_seqlock_irqsave() printk_deferred_enter() assuming we have | static inline void do_write_seqcount_begin_nested(seqcount_t *s, int subclass) | { | seqcount_acquire(&s->dep_map, subclass, 0, _RET_IP_); | do_raw_write_seqcount_begin(s); | } ? Sebastian
On 2023/06/21 23:34, Sebastian Andrzej Siewior wrote: >> Also, if local_irq_save() is hidden due to RT, what guarantees that >> >> write_seqlock_irqsave(&zonelist_update_seq, flags); >> <<IRQ>> >> some_timer_function() { >> printk(); >> } >> <<IRQ>> >> printk_deferred_enter(); >> >> does not happen because write_seqlock_irqsave() does not disable IRQ? > > I don't see how zonelist_update_seq and printk here are connected > without the port lock/ or memory allocation. But there are two things > that are different on RT which probably answer your question: It is explained as the first deadlock scenario in commit 1007843a9190 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock"). We have to disable IRQ before making zonelist_update_seq.seqcount odd.
On Wed 2023-06-21 16:34:21, Sebastian Andrzej Siewior wrote: > On 2023-06-21 22:32:40 [+0900], Tetsuo Handa wrote: > > include/linux/seqlock.h says > … > > Is above understanding correct? > > That is correct. > > > And you are trying to replace > > > > local_irq_save(flags); > > printk_deferred_enter(); > > write_seqlock(&zonelist_update_seq); > > > > with > > > > write_seqlock_irqsave(&zonelist_update_seq, flags); > > printk_deferred_enter(); > > > > , aren't you? > > correct. > > > But include/linux/printk.h says > > > > /* > > * The printk_deferred_enter/exit macros are available only as a hack for > > * some code paths that need to defer all printk console printing. Interrupts > > * must be disabled for the deferred duration. > > */ > > #define printk_deferred_enter __printk_safe_enter > > #define printk_deferred_exit __printk_safe_exit > > > > which means that local_irq_save() is _required_ before printk_deferred_enter(). > > It says that, yes, but that requirement is described as too heavy. The > requirement is that printk_deferred_enter() happens on the same CPU as > printk_deferred_exit(). True. > This can be achieved by an explicit > local_irq_save(), yes, but also by something like spin_lock_irq() which > _ensures_ that the task does not migrate to another CPU while the lock > is acquired. This is the requirement by the current implementation. Good to know. > > If local_irq_save() is hidden by your patch, what guarantees that > > printk_deferred_enter() and printk_deferred_exit() run on the same CPU? > > Because spin_lock_irqsave() on CONFIG_PREEMPT_RT uses migrate_disable(). > The function ensures that the scheduler does not migrate the task to > another CPU. The CPU is even block from going down (as in CPU-hotplug) > until the matching migrate_enable() occurs. > > > Also, if local_irq_save() is hidden due to RT, what guarantees that > > > > write_seqlock_irqsave(&zonelist_update_seq, flags); > > <<IRQ>> > > some_timer_function() { > > printk(); > > } > > <<IRQ>> > > printk_deferred_enter(); > > > > does not happen because write_seqlock_irqsave() does not disable IRQ? > > I don't see how zonelist_update_seq and printk here are connected > without the port lock/ or memory allocation. But there are two things > that are different on RT which probably answer your question: > > - If the reader observes an odd sequence number then it acquires the > lock of the sequence counter (it is held by the writer) which > forces the writer to complete the write critical section and then the > reader can continue. There are _no_ memory allocation within a > hard IRQ context (as in the actual interrupt). The timer (hrtimer or > timer_list timer) are served in task context and we have > forced-threaded interrupts. Clearly this means that the seqlock_t (as > used here) can only be used task context and not in hard IRQ context. > > - The printk implementation is slightly different and it is being worked > to merge it upstream. The two important differences here: > - Printing happens by default in a dedicated printing thread. > - In emergency cases like panic(), printing happens directly within > the invocation of printk(). This requires a so called atomic console > which does not use the tty_port::lock. If I get it correctly, RT solve both possible deadlocks by offloading the nested operation into a kthread (irq and printk threads). Plus, printk uses emergency_write() when the kthread is not usable. If this is true then printk_safe_enter() might be a nop on RT. All possible deadlocks are prevented either by the kthread or the console->emergency_write() call. But wait. AFAIK, the printk kthread is implemented only for the serial console. So, it won't be safe for other consoles, especially the problematic tty_insert_flip_string_and_push_buffer() call. Note that adding printk thread for the graphical consoles will be a real challenge. We have hard times even with the basic UART 8250. There are still races possible in the implementation in the RT tree... OK, what about using migrate_disable() in printk_deferred_enter()? Something like: /* * The printk_deferred_enter/exit macros are available only as a hack. * They define a per-CPU context where all printk console printing * is deferred because it might cause a deadlock otherwise. * * It is highly recommended to use them only in a context with interrupts * disabled. Otherwise, other unrelated printk() calls might be deferred * when they interrupt/preempt the deferred code section. * * They should not be used for to deffer printing of many messages. It might * create softlockup when they are flushed later. * * IMPORTANT: Any new use of these MUST be consulted with printk maintainers. * It might have unexpected side effects on the printk infrastructure. */ #ifdef CONFIG_PREEMPT_RT #define printk_deferred_enter() \ do { \ migrate_disable(); \ __printk_safe_enter(); \ } while (0) #define printk_deferred_exit() \ do { \ __printk_safe_exit(); \ migrate_enable(); \ } while (0) #else /* CONFIG_PREEMPT_RT */ #define printk_deferred_enter() \ do { \ preempt_disable(); \ __printk_safe_enter(); \ } while (0) #define printk_deferred_exit() \ do { \ __printk_safe_exit(); \ preempt_enable(); \ } while (0) #endif /* CONFIG_PREEMPT_RT */ Note that I have used preempt_disable() on non-RT because it is much cheaper. And IRQs will be disabled anyway on non-RT system in this code path. > > Disabling IRQ before incrementing zonelist_update_seq is _required_ for both > > > > making printk_deferred_enter() safe > > > > and > > > > making sure that printk_deferred_enter() takes effect > > > > . > Did I explain why it is sufficient to do > write_seqlock_irqsave() > printk_deferred_enter() > > assuming we have > > | static inline void do_write_seqcount_begin_nested(seqcount_t *s, int subclass) > | { > | seqcount_acquire(&s->dep_map, subclass, 0, _RET_IP_); > | do_raw_write_seqcount_begin(s); > | } Will this prevent any printk() called on the same CPU before printk_deferred_enter() is called? Best Regards, Petr
On 2023/06/21 23:50, Tetsuo Handa wrote: > On 2023/06/21 23:34, Sebastian Andrzej Siewior wrote: >>> Also, if local_irq_save() is hidden due to RT, what guarantees that >>> >>> write_seqlock_irqsave(&zonelist_update_seq, flags); >>> <<IRQ>> >>> some_timer_function() { >>> printk(); >>> } >>> <<IRQ>> >>> printk_deferred_enter(); >>> >>> does not happen because write_seqlock_irqsave() does not disable IRQ? >> >> I don't see how zonelist_update_seq and printk here are connected >> without the port lock/ or memory allocation. But there are two things >> that are different on RT which probably answer your question: > > It is explained as the first deadlock scenario in commit 1007843a9190 > ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock"). > We have to disable IRQ before making zonelist_update_seq.seqcount odd. > Since we must replace local_irq_save() + write_seqlock() with write_seqlock_irqsave() for CONFIG_PREEMPT_RT=y case but we must not replace local_irq_save() + write_seqlock() with write_seqlock_irqsave() for CONFIG_PREEMPT_RT=n case, the proper fix is something like below? diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 47421bedc12b..e3e9bd719dcc 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5798,28 +5798,30 @@ static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonesta #define BOOT_PAGESET_HIGH 0 #define BOOT_PAGESET_BATCH 1 static DEFINE_PER_CPU(struct per_cpu_pages, boot_pageset); static DEFINE_PER_CPU(struct per_cpu_zonestat, boot_zonestats); static void __build_all_zonelists(void *data) { int nid; int __maybe_unused cpu; pg_data_t *self = data; +#ifndef CONFIG_PREEMPT_RT unsigned long flags; /* * Explicitly disable this CPU's interrupts before taking seqlock * to prevent any IRQ handler from calling into the page allocator * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and livelock. */ local_irq_save(flags); +#endif /* * Explicitly disable this CPU's synchronous printk() before taking * seqlock to prevent any printk() from trying to hold port->lock, for * tty_insert_flip_string_and_push_buffer() on other CPU might be * calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held. */ printk_deferred_enter(); write_seqlock(&zonelist_update_seq); #ifdef CONFIG_NUMA @@ -5852,21 +5854,23 @@ static void __build_all_zonelists(void *data) * secondary cpus' numa_mem as they come on-line. During * node/memory hotplug, we'll fixup all on-line cpus. */ for_each_online_cpu(cpu) set_cpu_numa_mem(cpu, local_memory_node(cpu_to_node(cpu))); #endif } write_sequnlock(&zonelist_update_seq); printk_deferred_exit(); +#ifndef CONFIG_PREEMPT_RT local_irq_restore(flags); +#endif } static noinline void __init build_all_zonelists_init(void) { int cpu; __build_all_zonelists(NULL); /* By the way, given write_seqlock_irqsave(&zonelist_update_seq, flags); <<IRQ>> some_timer_function() { kmalloc(GFP_ATOMIC); } <</IRQ>> printk_deferred_enter(); scenario in CONFIG_PREEMPT_RT=y case is handled by executing some_timer_function() on a dedicated kernel thread for IRQs, what guarantees that the kernel thread for IRQs gives up CPU and the user thread which called write_seqlock() gains CPU until write_sequnlock() is called? How can the kernel figure out that executing the user thread needs higher priority than the kernel thread?
On Thu 22-06-23 08:24:30, Tetsuo Handa wrote: > On 2023/06/21 23:50, Tetsuo Handa wrote: > > On 2023/06/21 23:34, Sebastian Andrzej Siewior wrote: > >>> Also, if local_irq_save() is hidden due to RT, what guarantees that > >>> > >>> write_seqlock_irqsave(&zonelist_update_seq, flags); > >>> <<IRQ>> > >>> some_timer_function() { > >>> printk(); > >>> } > >>> <<IRQ>> > >>> printk_deferred_enter(); > >>> > >>> does not happen because write_seqlock_irqsave() does not disable IRQ? > >> > >> I don't see how zonelist_update_seq and printk here are connected > >> without the port lock/ or memory allocation. But there are two things > >> that are different on RT which probably answer your question: > > > > It is explained as the first deadlock scenario in commit 1007843a9190 > > ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock"). > > We have to disable IRQ before making zonelist_update_seq.seqcount odd. > > > > Since we must replace local_irq_save() + write_seqlock() with write_seqlock_irqsave() for > CONFIG_PREEMPT_RT=y case but we must not replace local_irq_save() + write_seqlock() with > write_seqlock_irqsave() for CONFIG_PREEMPT_RT=n case, the proper fix is something like below? Now, I am confused. Why write_seqlock_irqsave is not allowed for !RT? Let me quote the changelog and he scenario 1: write_seqlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount odd // e.g. timer interrupt handler runs at this moment some_timer_func() { kmalloc(GFP_ATOMIC) { __alloc_pages_slowpath() { read_seqbegin(&zonelist_update_seq) { // spins forever because zonelist_update_seq.seqcount is odd } } } } // e.g. timer interrupt handler finishes write_sequnlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount even This is clearly impossible with write_seqlock_irqsave as interrupts are disabled before the lock is taken.
On 2023/06/22 16:18, Michal Hocko wrote: >>> It is explained as the first deadlock scenario in commit 1007843a9190 >>> ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock"). >>> We have to disable IRQ before making zonelist_update_seq.seqcount odd. >>> >> >> Since we must replace local_irq_save() + write_seqlock() with write_seqlock_irqsave() for >> CONFIG_PREEMPT_RT=y case but we must not replace local_irq_save() + write_seqlock() with >> write_seqlock_irqsave() for CONFIG_PREEMPT_RT=n case, the proper fix is something like below? > > Now, I am confused. Why write_seqlock_irqsave is not allowed for !RT? > Let me quote the changelog and he scenario 1: > write_seqlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount odd > // e.g. timer interrupt handler runs at this moment > some_timer_func() { > kmalloc(GFP_ATOMIC) { > __alloc_pages_slowpath() { > read_seqbegin(&zonelist_update_seq) { > // spins forever because zonelist_update_seq.seqcount is odd > } > } > } > } > // e.g. timer interrupt handler finishes > write_sequnlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount even > > This is clearly impossible with write_seqlock_irqsave as interrupts are > disabled before the lock is taken. Well, it seems that "I don't want to replace" rather than "we must not replace". I reread the thread but I couldn't find why nobody suggested write_seqlock_irqsave(). The reason I proposed the local_irq_save() => printk_deferred_enter() => write_seqlock() ordering implies a precaution in case write_seqlock() involves printk() (e.g. lockdep, KCSAN, soft-lockup warning), in addition to "local_irq_save() before printk_deferred_enter()" requirement. Maybe people in that thread were happy with preserving this precaution... You commented There shouldn't be any other locks (apart from hotplug) taken in that path IIRC. at https://lkml.kernel.org/ZCrYQj+2/uMtqNBm@dhcp22.suse.cz . If __build_all_zonelists() is already serialized by hotplug lock, we don't need to call spin_lock(&zonelist_update_seq.lock) and we will be able to replace write_seqlock(&zonelist_update_seq) with write_seqcount_begin(&zonelist_update_seq.seqcount) like cpuset_change_task_nodemask() does?
On Thu 22-06-23 19:58:33, Tetsuo Handa wrote: > On 2023/06/22 16:18, Michal Hocko wrote: > >>> It is explained as the first deadlock scenario in commit 1007843a9190 > >>> ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock"). > >>> We have to disable IRQ before making zonelist_update_seq.seqcount odd. > >>> > >> > >> Since we must replace local_irq_save() + write_seqlock() with write_seqlock_irqsave() for > >> CONFIG_PREEMPT_RT=y case but we must not replace local_irq_save() + write_seqlock() with > >> write_seqlock_irqsave() for CONFIG_PREEMPT_RT=n case, the proper fix is something like below? > > > > Now, I am confused. Why write_seqlock_irqsave is not allowed for !RT? > > Let me quote the changelog and he scenario 1: > > write_seqlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount odd > > // e.g. timer interrupt handler runs at this moment > > some_timer_func() { > > kmalloc(GFP_ATOMIC) { > > __alloc_pages_slowpath() { > > read_seqbegin(&zonelist_update_seq) { > > // spins forever because zonelist_update_seq.seqcount is odd > > } > > } > > } > > } > > // e.g. timer interrupt handler finishes > > write_sequnlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount even > > > > This is clearly impossible with write_seqlock_irqsave as interrupts are > > disabled before the lock is taken. > > Well, it seems that "I don't want to replace" rather than "we must not replace". OK, so this is an alteranative fix rather the proposed fix being incorrect. > I reread the thread but I couldn't find why nobody suggested write_seqlock_irqsave(). > The reason I proposed the > > local_irq_save() => printk_deferred_enter() => write_seqlock() > > ordering implies a precaution in case write_seqlock() involves printk() (e.g. lockdep, > KCSAN, soft-lockup warning), in addition to "local_irq_save() before printk_deferred_enter()" > requirement. Maybe people in that thread were happy with preserving this precaution... Precaution is a fair argument. I am not sure it is the strongest one to justify the ugly RT special casing though. I would propose to go with Sebastian's patch as a clear fix and if you really care about the pre-caution then make sure you describe potential problems. > You commented > > There shouldn't be any other locks (apart from hotplug) taken in that path IIRC. > > at https://lkml.kernel.org/ZCrYQj+2/uMtqNBm@dhcp22.suse.cz . > > If __build_all_zonelists() is already serialized by hotplug lock, we don't > need to call spin_lock(&zonelist_update_seq.lock) and we will be able to > replace write_seqlock(&zonelist_update_seq) with > write_seqcount_begin(&zonelist_update_seq.seqcount) like > cpuset_change_task_nodemask() does? Maybe, I haven't really dived into this deeper. One way or the other RT requires a special IRQ handling along with the seq lock, no?
On 2023/06/22 8:24, Tetsuo Handa wrote: > By the way, given > > write_seqlock_irqsave(&zonelist_update_seq, flags); > <<IRQ>> > some_timer_function() { > kmalloc(GFP_ATOMIC); > } > <</IRQ>> > printk_deferred_enter(); > > scenario in CONFIG_PREEMPT_RT=y case is handled by executing some_timer_function() > on a dedicated kernel thread for IRQs, what guarantees that the kernel thread for > IRQs gives up CPU and the user thread which called write_seqlock() gains CPU until > write_sequnlock() is called? How can the kernel figure out that executing the user > thread needs higher priority than the kernel thread? I haven't got response on this question. Several years ago, I demonstrated that a SCHED_IDLE priority userspace thread holding oom_lock causes other concurrently allocating !SCHED_IDLE priority threads to misunderstand that mutex_trylock(&oom_lock) failure implies we are making forward progress (despite the SCHED_IDLE priority userspace thread was unable to wake up for minutes). If a SCHED_IDLE priority thread which called write_seqlock_irqsave() is preempted by some other !SCHED_IDLE priority threads (especially realtime priority threads), and such !SCHED_IDLE priority thread calls kmalloc(GFP_ATOMIC) or printk(), a similar thing (misunderstand that spinning on read_seqbegin() from zonelist_iter_begin() can make forward progress despite a thread which called write_seqlock_irqsave() cannot make progress due to preemption) can happen. Question to Sebastian: To make sure that such thing cannot happen, we should make sure that a thread which entered write_seqcount_begin(&zonelist_update_seq.seqcount) from write_seqlock_irqsave(&zonelist_update_seq, flags) can continue using CPU until write_seqcount_end(&zonelist_update_seq.seqcount) from write_seqlock_irqrestore(&zonelist_update_seq, flags). Does adding preempt_disable() before write_seqlock(&zonelist_update_seq, flags) help? Question to Peter: Even if local_irq_save(flags) disables IRQ, NMI context can enqueue message via printk(). When does the message enqueued from NMI context gets printed? If there is a possibility that the message enqueued from NMI context gets printed between "write_seqlock_irqsave(&zonelist_update_seq, flags) and printk_deferred_enter()" or "printk_deferred_exit() and write_sequnlock_irqrestore(&zonelist_update_seq, flags)" ? If yes, we can't increment zonelist_update_seq.seqcount before printk_deferred_enter()...
On Thu 2023-06-22 22:36:27, Tetsuo Handa wrote: > On 2023/06/22 8:24, Tetsuo Handa wrote: > > By the way, given > > > > write_seqlock_irqsave(&zonelist_update_seq, flags); > > <<IRQ>> > > some_timer_function() { > > kmalloc(GFP_ATOMIC); > > } > > <</IRQ>> > > printk_deferred_enter(); > > > > scenario in CONFIG_PREEMPT_RT=y case is handled by executing some_timer_function() > > on a dedicated kernel thread for IRQs, what guarantees that the kernel thread for > > IRQs gives up CPU and the user thread which called write_seqlock() gains CPU until > > write_sequnlock() is called? How can the kernel figure out that executing the user > > thread needs higher priority than the kernel thread? > > I haven't got response on this question. > > Several years ago, I demonstrated that a SCHED_IDLE priority userspace thread holding > oom_lock causes other concurrently allocating !SCHED_IDLE priority threads to > misunderstand that mutex_trylock(&oom_lock) failure implies we are making forward > progress (despite the SCHED_IDLE priority userspace thread was unable to wake up for > minutes). > > If a SCHED_IDLE priority thread which called write_seqlock_irqsave() is preempted by > some other !SCHED_IDLE priority threads (especially realtime priority threads), and > such !SCHED_IDLE priority thread calls kmalloc(GFP_ATOMIC) or printk(), a similar thing > (misunderstand that spinning on read_seqbegin() from zonelist_iter_begin() can make > forward progress despite a thread which called write_seqlock_irqsave() cannot make > progress due to preemption) can happen. > > Question to Sebastian: > To make sure that such thing cannot happen, we should make sure that > a thread which entered write_seqcount_begin(&zonelist_update_seq.seqcount) from > write_seqlock_irqsave(&zonelist_update_seq, flags) can continue using CPU until > write_seqcount_end(&zonelist_update_seq.seqcount) from > write_seqlock_irqrestore(&zonelist_update_seq, flags). > Does adding preempt_disable() before write_seqlock(&zonelist_update_seq, flags) help? > > > > Question to Peter: > Even if local_irq_save(flags) disables IRQ, NMI context can enqueue message via printk(). > When does the message enqueued from NMI context gets printed? They are flushed to the console either by irq_work or by another printk(). The irq_work could not be proceed when IRQs are disabled. But another non-deferred printk() would try to flush them immediately. > If there is a possibility > that the message enqueued from NMI context gets printed between > "write_seqlock_irqsave(&zonelist_update_seq, flags) and printk_deferred_enter()" or > "printk_deferred_exit() and write_sequnlock_irqrestore(&zonelist_update_seq, flags)" ? > If yes, we can't increment zonelist_update_seq.seqcount before printk_deferred_enter()... It might happen when a printk() is called in these holes. Best Regards, Petr
On 2023/06/22 23:11, Petr Mladek wrote: >> Question to Peter: >> Even if local_irq_save(flags) disables IRQ, NMI context can enqueue message via printk(). >> When does the message enqueued from NMI context gets printed? > > They are flushed to the console either by irq_work or by another > printk(). The irq_work could not be proceed when IRQs are disabled. Is that rule same for both COMFIG_PREEMPT_RT=y and COMFIG_PREEMPT_RT=n ? If yes, when Sebastian's patch is applied, IRQs will not be disabled for COMFIG_PREEMPT_RT=y kernels because write_seqlock_irqsave() is equivalent with write_seqlock() because spin_lock_irqsave(lock, flags) is equivalent with spin_lock(lock), and > But another non-deferred printk() would try to flush them immediately. > >> If there is a possibility >> that the message enqueued from NMI context gets printed between >> "write_seqlock_irqsave(&zonelist_update_seq, flags) and printk_deferred_enter()" or >> "printk_deferred_exit() and write_sequnlock_irqrestore(&zonelist_update_seq, flags)" ? >> If yes, we can't increment zonelist_update_seq.seqcount before printk_deferred_enter()... > > It might happen when a printk() is called in these holes. printk() can happen between these holes for COMFIG_PREEMPT_RT=y kernels. We will need to call printk_deferred_enter() before incrementing zonelist_update_seq.seqcount in order to close these holes.
On Thu 2023-06-22 16:11:41, Petr Mladek wrote: > On Thu 2023-06-22 22:36:27, Tetsuo Handa wrote: > > On 2023/06/22 8:24, Tetsuo Handa wrote: > > > By the way, given > > > > > > write_seqlock_irqsave(&zonelist_update_seq, flags); > > > <<IRQ>> > > > some_timer_function() { > > > kmalloc(GFP_ATOMIC); > > > } > > > <</IRQ>> > > > printk_deferred_enter(); > > > > > > scenario in CONFIG_PREEMPT_RT=y case is handled by executing some_timer_function() > > > on a dedicated kernel thread for IRQs, what guarantees that the kernel thread for > > > IRQs gives up CPU and the user thread which called write_seqlock() gains CPU until > > > write_sequnlock() is called? How can the kernel figure out that executing the user > > > thread needs higher priority than the kernel thread? My understanding is that this is achieved by spin_lock_irqsave(&sl->lock, flags). When RT is enabled then rt_spin_lock(lock) is used. AFAIK, rt_spin_lock(lock) fulfills exactly the above requirements. The owner could schedule. The waiter could schedule as well so that they could be running on the same CPU. Also the current owner gets higher priority when the is a waiter with the higher priority to avoid the priority inversion. > > I haven't got response on this question. > > > > Several years ago, I demonstrated that a SCHED_IDLE priority userspace thread holding > > oom_lock causes other concurrently allocating !SCHED_IDLE priority threads to > > misunderstand that mutex_trylock(&oom_lock) failure implies we are making forward > > progress (despite the SCHED_IDLE priority userspace thread was unable to wake up for > > minutes). > > > > If a SCHED_IDLE priority thread which called write_seqlock_irqsave() is preempted by > > some other !SCHED_IDLE priority threads (especially realtime priority threads), and > > such !SCHED_IDLE priority thread calls kmalloc(GFP_ATOMIC) or printk(), a similar thing > > (misunderstand that spinning on read_seqbegin() from zonelist_iter_begin() can make > > forward progress despite a thread which called write_seqlock_irqsave() cannot make > > progress due to preemption) can happen. > > > > Question to Sebastian: > > To make sure that such thing cannot happen, we should make sure that > > a thread which entered write_seqcount_begin(&zonelist_update_seq.seqcount) from > > write_seqlock_irqsave(&zonelist_update_seq, flags) can continue using CPU until > > write_seqcount_end(&zonelist_update_seq.seqcount) from > > write_seqlock_irqrestore(&zonelist_update_seq, flags). > > Does adding preempt_disable() before write_seqlock(&zonelist_update_seq, flags) help? > > > > Question to Peter: > > Even if local_irq_save(flags) disables IRQ, NMI context can enqueue message via printk(). > > When does the message enqueued from NMI context gets printed? > > They are flushed to the console either by irq_work or by another > printk(). The irq_work could not be proceed when IRQs are disabled. > But another non-deferred printk() would try to flush them immediately. > > > If there is a possibility > > that the message enqueued from NMI context gets printed between > > "write_seqlock_irqsave(&zonelist_update_seq, flags) and printk_deferred_enter()" or > > "printk_deferred_exit() and write_sequnlock_irqrestore(&zonelist_update_seq, flags)" ? > > If yes, we can't increment zonelist_update_seq.seqcount before printk_deferred_enter()... > > It might happen when a printk() is called in these holes. I believe that this hole is the only remaining problem. IMHO, the only solution is to disable migration/preemtion in printk_deferred_enter(). It is not a big deal, really. __rt_spin_lock() would call migrate_disable() anyway. And nested migrate_disable() is fast. It just increments the counter when it is not 0. I would suggest to do this change in printk_deferred_enter() first. It will allows to call it before write_seqlock_irqsave(). And we will not need to change the ordering back and forth. The result would look like: in kernel/linux/printk.h: static inline void printk_deferred_enter(void) { if (!defined(CONFIG_PREEMPT_RT)) preempt_disable(); else migrate_disable(); __printk_safe_enter(); } in mm/page_alloc.c printk_deferred_enter(); write_seqlock_irqsafe(&zonelist_update_seq, flags); Best Regards, Petr
On 2023/06/23 0:04, Petr Mladek wrote: > AFAIK, rt_spin_lock(lock) fulfills exactly the above requirements. > The owner could schedule. The waiter could schedule as well so that > they could be running on the same CPU. Also the current owner gets > higher priority when the is a waiter with the higher priority to avoid > the priority inversion. Excuse me, but that is about multiple threads trying to acquire the same lock, isn't it? Our case is that one thread which makes zonelist_update_seq.seqcount odd acquires zonelist_update_seq.lock but threads spinning at read_seqbegin(zonelist_update_seq.seqcount) from zonelist_iter_begin() do nothing but cpu_relax() busy loop. There is no way to teach that these threads need to give CPU to the thread which made zonelist_update_seq.seqcount odd... > The result would look like: > > in kernel/linux/printk.h: > > static inline void printk_deferred_enter(void) > { > if (!defined(CONFIG_PREEMPT_RT)) > preempt_disable(); > else > migrate_disable(); > > __printk_safe_enter(); > } > > in mm/page_alloc.c > > printk_deferred_enter(); > write_seqlock_irqsafe(&zonelist_update_seq, flags); OK. But for stable, + if (defined(CONFIG_PREEMPT_RT)) + migrate_disable(); /* * Explicitly disable this CPU's interrupts before taking seqlock * to prevent any IRQ handler from calling into the page allocator * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and livelock. */ local_irq_save(flags); /* * Explicitly disable this CPU's synchronous printk() before taking * seqlock to prevent any printk() from trying to hold port->lock, for * tty_insert_flip_string_and_push_buffer() on other CPU might be * calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held. */ printk_deferred_enter(); write_seqlock(&zonelist_update_seq); will be easier to apply.
On 2023-06-21 23:50:02 [+0900], Tetsuo Handa wrote: > On 2023/06/21 23:34, Sebastian Andrzej Siewior wrote: > >> Also, if local_irq_save() is hidden due to RT, what guarantees that > >> > >> write_seqlock_irqsave(&zonelist_update_seq, flags); > >> <<IRQ>> > >> some_timer_function() { > >> printk(); > >> } > >> <<IRQ>> > >> printk_deferred_enter(); > >> > >> does not happen because write_seqlock_irqsave() does not disable IRQ? > > > > I don't see how zonelist_update_seq and printk here are connected > > without the port lock/ or memory allocation. But there are two things > > that are different on RT which probably answer your question: > > It is explained as the first deadlock scenario in commit 1007843a9190 > ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock"). > We have to disable IRQ before making zonelist_update_seq.seqcount odd. You did explain the relation zonelist_update_seq -> kmalloc(, GFP_ATOMIC) and tty_insert_flip_string_and_push_buffer() -> kmalloc(, GFP_ATOMIC) with printk() on another CPU. As far as PREEMPT_RT goes, it is not an issue because the port-lock is not used or printing is deferred. Sebastian
On 2023-06-21 17:38:03 [+0200], Petr Mladek wrote: > If I get it correctly, RT solve both possible deadlocks by offloading > the nested operation into a kthread (irq and printk threads). > Plus, printk uses emergency_write() when the kthread is not usable. correct. > If this is true then printk_safe_enter() might be a nop on RT. it is gone actually. > All possible deadlocks are prevented either by the kthread or > the console->emergency_write() call. correct. > But wait. AFAIK, the printk kthread is implemented only for the serial > console. So, it won't be safe for other consoles, especially > the problematic tty_insert_flip_string_and_push_buffer() call. if the special printing is not implemented then there is only printing via the kthread meaning no emergency printing if needed. Once the interface is upstream, more drivers can be added including GPU based. > Note that adding printk thread for the graphical consoles will > be a real challenge. We have hard times even with the basic > UART 8250. There are still races possible in the implementation > in the RT tree... John has been mumbling something about those but did not send anything. > OK, what about using migrate_disable() in printk_deferred_enter()? > Something like: > > /* > * The printk_deferred_enter/exit macros are available only as a hack. > * They define a per-CPU context where all printk console printing > * is deferred because it might cause a deadlock otherwise. > * > * It is highly recommended to use them only in a context with interrupts > * disabled. Otherwise, other unrelated printk() calls might be deferred > * when they interrupt/preempt the deferred code section. > * > * They should not be used for to deffer printing of many messages. It might > * create softlockup when they are flushed later. > * > * IMPORTANT: Any new use of these MUST be consulted with printk maintainers. > * It might have unexpected side effects on the printk infrastructure. > */ > #ifdef CONFIG_PREEMPT_RT > > #define printk_deferred_enter() \ > do { \ > migrate_disable(); \ > __printk_safe_enter(); \ > } while (0) > > #define printk_deferred_exit() \ > do { \ > __printk_safe_exit(); \ > migrate_enable(); \ > } while (0) > > #else /* CONFIG_PREEMPT_RT */ > > #define printk_deferred_enter() \ > do { \ > preempt_disable(); \ > __printk_safe_enter(); \ > } while (0) > > #define printk_deferred_exit() \ > do { \ > __printk_safe_exit(); \ > preempt_enable(); \ > } while (0) > > #endif /* CONFIG_PREEMPT_RT */ > > Note that I have used preempt_disable() on non-RT because it is much > cheaper. And IRQs will be disabled anyway on non-RT system > in this code path. It would do _but_ is it really needed? All users but one (the one in __build_all_zonelists()) have interrupts already disabled. This isn't a hot path and is not used very common. Does it justify the ifdef? An unconditional migrate_disable() would do the job if you want it to be safe but I would rather suggest lockdep annotation instead of disabling either preemption or migration. If I read the comment on top right, this API isn't open for the public. The global variable would be probably be simpler but I guess you want to have direct printing on other CPUs. To be honst, I would suggest to work towards always printing in a thread with an emergency console than this deferred/ safe duckt tape. Once tty_port::lock is acquired, every possible printk output, say a BUG, kasan,… is a possible death trap. This limitation here wasn't obvious and syzbot had to figure it out. There might be other side effects. > > > Disabling IRQ before incrementing zonelist_update_seq is _required_ for both > > > > > > making printk_deferred_enter() safe > > > > > > and > > > > > > making sure that printk_deferred_enter() takes effect > > > > > > . > > Did I explain why it is sufficient to do > > write_seqlock_irqsave() > > printk_deferred_enter() > > > > assuming we have > > > > | static inline void do_write_seqcount_begin_nested(seqcount_t *s, int subclass) > > | { > > | seqcount_acquire(&s->dep_map, subclass, 0, _RET_IP_); > > | do_raw_write_seqcount_begin(s); > > | } > > Will this prevent any printk() called on the same CPU before > printk_deferred_enter() is called? Yes. There is the _irqsave() which disables interrupts at the very beginning. Then we have here (seqcount_acquire()) the last possible origin of a lockdep splat. It ends with do_raw_write_seqcount_begin() which increments the counter (the "locking") and this is followed by printk_deferred_enter(). > Best Regards, > Petr Sebastian
On Fri 23-06-23 10:12:50, Sebastian Andrzej Siewior wrote: [...] > It would do _but_ is it really needed? All users but one (the one in > __build_all_zonelists()) have interrupts already disabled. This isn't a > hot path and is not used very common. Does it justify the ifdef? This is not my call of course but let me just stress out that if all that is just because of __build_all_zonelists then it is likely not really worth it. We are talking about memory hoptlug here. I seriously doubt that this is something anybody would be using with RT expectations.
On 2023-06-22 22:36:27 [+0900], Tetsuo Handa wrote: > On 2023/06/22 8:24, Tetsuo Handa wrote: > > By the way, given > > > > write_seqlock_irqsave(&zonelist_update_seq, flags); > > <<IRQ>> > > some_timer_function() { > > kmalloc(GFP_ATOMIC); > > } > > <</IRQ>> > > printk_deferred_enter(); > > > > scenario in CONFIG_PREEMPT_RT=y case is handled by executing some_timer_function() > > on a dedicated kernel thread for IRQs, what guarantees that the kernel thread for > > IRQs gives up CPU and the user thread which called write_seqlock() gains CPU until > > write_sequnlock() is called? How can the kernel figure out that executing the user > > thread needs higher priority than the kernel thread? > > I haven't got response on this question. I did explain in 20230621143421.BgHjJklo@linutronix.de that the printk implementation is a different one on PREEMPT_RT. The reader _may_ boost the writer if needed. Also for !PREEMPT_RT your only concern was a lockdep splat within write_seqlock_irqsave() which I wanted to take care. Leaving that aside, I don't see any problem. > Several years ago, I demonstrated that a SCHED_IDLE priority userspace thread holding > oom_lock causes other concurrently allocating !SCHED_IDLE priority threads to > misunderstand that mutex_trylock(&oom_lock) failure implies we are making forward > progress (despite the SCHED_IDLE priority userspace thread was unable to wake up for > minutes). repeated trylock without explicit forward progress is a general problem on RT. We try to remove them where we find them. > If a SCHED_IDLE priority thread which called write_seqlock_irqsave() is preempted by > some other !SCHED_IDLE priority threads (especially realtime priority threads), and > such !SCHED_IDLE priority thread calls kmalloc(GFP_ATOMIC) or printk(), a similar thing > (misunderstand that spinning on read_seqbegin() from zonelist_iter_begin() can make > forward progress despite a thread which called write_seqlock_irqsave() cannot make > progress due to preemption) can happen. I can because on PREEMPT_RT the read_seqbegin() _will_ block on the lock, that is held by write_seqlock_irqsave(). This ensures that the writer will make progress and the reader does not loop several iterations like on !PREEMPT_RT. This is PREEMPT_RT and happens regardless of rhe priority of the task involved. > Question to Sebastian: > To make sure that such thing cannot happen, we should make sure that > a thread which entered write_seqcount_begin(&zonelist_update_seq.seqcount) from > write_seqlock_irqsave(&zonelist_update_seq, flags) can continue using CPU until > write_seqcount_end(&zonelist_update_seq.seqcount) from > write_seqlock_irqrestore(&zonelist_update_seq, flags). > Does adding preempt_disable() before write_seqlock(&zonelist_update_seq, flags) help? Es explained before, this scenario does not happen and is already accounted for by the underlying seqcount API. Adding preempt_disable() to the mix makes things worse. > Question to Peter: > Even if local_irq_save(flags) disables IRQ, NMI context can enqueue message via printk(). > When does the message enqueued from NMI context gets printed? If there is a possibility > that the message enqueued from NMI context gets printed between > "write_seqlock_irqsave(&zonelist_update_seq, flags) and printk_deferred_enter()" or > "printk_deferred_exit() and write_sequnlock_irqrestore(&zonelist_update_seq, flags)" ? > If yes, we can't increment zonelist_update_seq.seqcount before printk_deferred_enter()... There are no prints from NMI because you can't acquire a lock in NMI context. All printk invocations are delayed. Sebastian
On 2023-06-22 23:28:25 [+0900], Tetsuo Handa wrote: > On 2023/06/22 23:11, Petr Mladek wrote: > >> Question to Peter: > >> Even if local_irq_save(flags) disables IRQ, NMI context can enqueue message via printk(). > >> When does the message enqueued from NMI context gets printed? > > > > They are flushed to the console either by irq_work or by another > > printk(). The irq_work could not be proceed when IRQs are disabled. > > Is that rule same for both COMFIG_PREEMPT_RT=y and COMFIG_PREEMPT_RT=n ? > > If yes, when Sebastian's patch is applied, IRQs will not be disabled for > COMFIG_PREEMPT_RT=y kernels because write_seqlock_irqsave() is equivalent > with write_seqlock() because spin_lock_irqsave(lock, flags) is equivalent > with spin_lock(lock), and So why do you accept that spin_lock_irqsave() does not disable interrupts on PREEMPT_RT but not that printk() is different and has dedicated printing threads? Regarding printk: - dedicated printking thread is used. - in emergcy cases a different driver is used which does not rely on lock held by tty. > > But another non-deferred printk() would try to flush them immediately. > > > >> If there is a possibility > >> that the message enqueued from NMI context gets printed between > >> "write_seqlock_irqsave(&zonelist_update_seq, flags) and printk_deferred_enter()" or > >> "printk_deferred_exit() and write_sequnlock_irqrestore(&zonelist_update_seq, flags)" ? > >> If yes, we can't increment zonelist_update_seq.seqcount before printk_deferred_enter()... > > > > It might happen when a printk() is called in these holes. > > printk() can happen between these holes for COMFIG_PREEMPT_RT=y kernels. > We will need to call printk_deferred_enter() before incrementing > zonelist_update_seq.seqcount in order to close these holes. The only hole is the possible lockdep splat which I'm going to change… Sebastian
On 2023-06-23 00:43:00 [+0900], Tetsuo Handa wrote: > On 2023/06/23 0:04, Petr Mladek wrote: > > AFAIK, rt_spin_lock(lock) fulfills exactly the above requirements. > > The owner could schedule. The waiter could schedule as well so that > > they could be running on the same CPU. Also the current owner gets > > higher priority when the is a waiter with the higher priority to avoid > > the priority inversion. > > Excuse me, but that is about multiple threads trying to acquire the same lock, isn't it? > > Our case is that one thread which makes zonelist_update_seq.seqcount odd acquires > zonelist_update_seq.lock but threads spinning at > read_seqbegin(zonelist_update_seq.seqcount) from zonelist_iter_begin() do nothing but > cpu_relax() busy loop. There is no way to teach that these threads need to give > CPU to the thread which made zonelist_update_seq.seqcount odd... For !RT there is no spinning because interrupts are disabled. For RT there is no spinning because the reader blocks on lock owned by writer. > > The result would look like: > > > > in kernel/linux/printk.h: > > > > static inline void printk_deferred_enter(void) > > { > > if (!defined(CONFIG_PREEMPT_RT)) > > preempt_disable(); > > else > > migrate_disable(); > > > > __printk_safe_enter(); > > } > > > > in mm/page_alloc.c > > > > printk_deferred_enter(); > > write_seqlock_irqsafe(&zonelist_update_seq, flags); > > OK. But for stable, > > + if (defined(CONFIG_PREEMPT_RT)) > + migrate_disable(); > /* > * Explicitly disable this CPU's interrupts before taking seqlock > * to prevent any IRQ handler from calling into the page allocator > * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and livelock. > */ > local_irq_save(flags); > /* > * Explicitly disable this CPU's synchronous printk() before taking > * seqlock to prevent any printk() from trying to hold port->lock, for > * tty_insert_flip_string_and_push_buffer() on other CPU might be > * calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held. > */ > printk_deferred_enter(); > write_seqlock(&zonelist_update_seq); > > will be easier to apply. I would prefer not to worry about stable-RT but about upstream and then we backport what is needed into the stable-RT trees once we settled on something. This does not affect !RT. Sebastian
On 2023/06/23 18:45, Sebastian Andrzej Siewior wrote: > On 2023-06-23 00:43:00 [+0900], Tetsuo Handa wrote: >> On 2023/06/23 0:04, Petr Mladek wrote: >>> AFAIK, rt_spin_lock(lock) fulfills exactly the above requirements. >>> The owner could schedule. The waiter could schedule as well so that >>> they could be running on the same CPU. Also the current owner gets >>> higher priority when the is a waiter with the higher priority to avoid >>> the priority inversion. >> >> Excuse me, but that is about multiple threads trying to acquire the same lock, isn't it? >> >> Our case is that one thread which makes zonelist_update_seq.seqcount odd acquires >> zonelist_update_seq.lock but threads spinning at >> read_seqbegin(zonelist_update_seq.seqcount) from zonelist_iter_begin() do nothing but >> cpu_relax() busy loop. There is no way to teach that these threads need to give >> CPU to the thread which made zonelist_update_seq.seqcount odd... > > For !RT there is no spinning because interrupts are disabled. > For RT there is no spinning because the reader blocks on lock owned by > writer. My understanding is that zonelist_iter_begin() accesses only zonelist_update_seq.seqcount . From where is the zonelist_update_seq.lock accessed when zonelist_iter_begin() is called?
On 2023-06-23 11:21:01 [+0200], Michal Hocko wrote: > On Fri 23-06-23 10:12:50, Sebastian Andrzej Siewior wrote: > [...] > > It would do _but_ is it really needed? All users but one (the one in > > __build_all_zonelists()) have interrupts already disabled. This isn't a > > hot path and is not used very common. Does it justify the ifdef? > > This is not my call of course but let me just stress out that if all > that is just because of __build_all_zonelists then it is likely not > really worth it. We are talking about memory hoptlug here. I seriously > doubt that this is something anybody would be using with RT > expectations. My current plan to repost this with a better explanation and with printk deferred within the locked region. But first I'm going to use hotplug-mem to see how it works. If it can be enabled on RT then it should work even if nobody is using it. Is someone objecting and if so why. Sebastian
On 2023-06-23 18:51:24 [+0900], Tetsuo Handa wrote: > My understanding is that zonelist_iter_begin() accesses only zonelist_update_seq.seqcount . > From where is the zonelist_update_seq.lock accessed when zonelist_iter_begin() is called? zonelist_iter_begin() -> read_seqbegin() -> read_seqcount_begin() -> raw_read_seqcount_begin() -> __read_seqcount_begin() Now. The inner part looks like while ((__seq = seqprop_sequence(s)) & 1) cpu_relax(); but with RT enabled (in-tree, not out-of-tree, in-tree) seqprop_sequence() goes from: | __seqprop_##lockname##_sequence(const seqcount_##lockname##_t *s) \ | { \ | unsigned seq = READ_ONCE(s->seqcount.sequence); \ | \ | if (!IS_ENABLED(CONFIG_PREEMPT_RT)) \ | return seq; \ | \ | if (preemptible && unlikely(seq & 1)) { \ | __SEQ_LOCK(lock_acquire); \ | __SEQ_LOCK(lockbase##_unlock(s->lock)); \ | \ | /* \ | * Re-read the sequence counter since the (possibly \ | * preempted) writer made progress. \ | */ \ | seq = READ_ONCE(s->seqcount.sequence); \ | } \ | \ | return seq; \ to | unsigned __seqprop_spinlock_sequence(const seqcount_spinlock_t *s) | { | unsigned seq = READ_ONCE(s->seqcount.sequence); | | if (unlikely(seq & 1)) { | spin_lock(s->lock); | spin_unlock(s->lock); | seq = READ_ONCE(s->seqcount.sequence); | } | return seq; | } tada \o/ Sebastian
On 2023/06/23 19:11, Sebastian Andrzej Siewior wrote: > | unsigned __seqprop_spinlock_sequence(const seqcount_spinlock_t *s) > | { > | unsigned seq = READ_ONCE(s->seqcount.sequence); > | > | if (unlikely(seq & 1)) { > | spin_lock(s->lock); > | spin_unlock(s->lock); > | seq = READ_ONCE(s->seqcount.sequence); > | } > | return seq; > | } OK. I understood that read_seqbegin() implies spin_lock()/spin_unlock() if RT. What a deep macro. Thank you for explanation. Well, /* * Zonelists may change due to hotplug during allocation. Detect when zonelists * have been rebuilt so allocation retries. Reader side does not lock and * retries the allocation if zonelist changes. Writer side is protected by the * embedded spin_lock. */ is not accurate. Something like below? If !RT, reader side does not lock and retries the allocation if zonelist changes. If RT, reader side grabs and releases the embedded spin_lock in order to wait for zonelist change operations to complete. Hmm, I feel worried that kmalloc(GFP_ATOMIC) from hard IRQ context might sleep if RT...
On Fri 2023-06-23 10:12:50, Sebastian Andrzej Siewior wrote: > On 2023-06-21 17:38:03 [+0200], Petr Mladek wrote: > > But wait. AFAIK, the printk kthread is implemented only for the serial > > console. So, it won't be safe for other consoles, especially > > the problematic tty_insert_flip_string_and_push_buffer() call. > > > OK, what about using migrate_disable() in printk_deferred_enter()? > > Something like: > > > > /* > > * The printk_deferred_enter/exit macros are available only as a hack. > > * They define a per-CPU context where all printk console printing > > * is deferred because it might cause a deadlock otherwise. > > * > > * It is highly recommended to use them only in a context with interrupts > > * disabled. Otherwise, other unrelated printk() calls might be deferred > > * when they interrupt/preempt the deferred code section. > > * > > * They should not be used for to deffer printing of many messages. It might > > * create softlockup when they are flushed later. > > * > > * IMPORTANT: Any new use of these MUST be consulted with printk maintainers. > > * It might have unexpected side effects on the printk infrastructure. > > */ > > #ifdef CONFIG_PREEMPT_RT > > > > #define printk_deferred_enter() \ > > do { \ > > migrate_disable(); \ > > __printk_safe_enter(); \ > > } while (0) > > > > #define printk_deferred_exit() \ > > do { \ > > __printk_safe_exit(); \ > > migrate_enable(); \ > > } while (0) > > > > #else /* CONFIG_PREEMPT_RT */ > > > > #define printk_deferred_enter() \ > > do { \ > > preempt_disable(); \ > > __printk_safe_enter(); \ > > } while (0) > > > > #define printk_deferred_exit() \ > > do { \ > > __printk_safe_exit(); \ > > preempt_enable(); \ > > } while (0) > > > > #endif /* CONFIG_PREEMPT_RT */ > > > > Note that I have used preempt_disable() on non-RT because it is much > > cheaper. And IRQs will be disabled anyway on non-RT system > > in this code path. > > It would do _but_ is it really needed? All users but one (the one in > __build_all_zonelists()) have interrupts already disabled. This isn't a > hot path and is not used very common. Does it justify the ifdef? > > An unconditional migrate_disable() would do the job if you want it to be > safe Makes sense. The ifdef does not look worth it. > but I would rather suggest lockdep annotation instead of disabling > either preemption or migration. If I read the comment on top right, this > API isn't open for the public. It might work when mm people agree to explicitly call migrate_disable() in __build_all_zonelists(). I mean to call there: migrate_disable(); printk_deferred_enter(); write_seqlock_irqsave(&zonelist_update_seq, flags); Plus it would require some comments. For me, it is acceptable to hide at least the migrate_disable() part into printk_deferred_enter(). > The global variable would be probably be simpler but I guess you want > to have direct printing on other CPUs. Yes, the scope should be as limited as possible. > To be honst, I would suggest to work towards always printing in a thread > with an emergency console than this deferred/ safe duckt tape. I have used this argument many years. But it is not much convincing after 10 years. Note that John has sent 1st RFC 4 years ago and we still do not have the kthreads :-/ Best Regards, Petr
On Fri 23-06-23 11:58:04, Sebastian Andrzej Siewior wrote: > On 2023-06-23 11:21:01 [+0200], Michal Hocko wrote: > > On Fri 23-06-23 10:12:50, Sebastian Andrzej Siewior wrote: > > [...] > > > It would do _but_ is it really needed? All users but one (the one in > > > __build_all_zonelists()) have interrupts already disabled. This isn't a > > > hot path and is not used very common. Does it justify the ifdef? > > > > This is not my call of course but let me just stress out that if all > > that is just because of __build_all_zonelists then it is likely not > > really worth it. We are talking about memory hoptlug here. I seriously > > doubt that this is something anybody would be using with RT > > expectations. > > My current plan to repost this with a better explanation and with printk > deferred within the locked region. But first I'm going to use > hotplug-mem to see how it works. If it can be enabled on RT then it > should work even if nobody is using it. Agreed! I merely wanted to point out that this needs a based fix much more than trying to overengineer it.
On 2023-06-23 11:58:05 [+0200], To Michal Hocko wrote: > But first I'm going to use > hotplug-mem to see how it works. I boot qemu with "-m 4G,slots=4,maxmem=16G". | ~# free -h | total used free shared buff/cache available | Mem: 3,8Gi 221Mi 3,6Gi 520Ki 80Mi 3,6Gi | Swap: 0B 0B 0B Telling qemu to add memory: | object_add memory-backend-ram,id=mem1,size=4G | device_add pc-dimm,id=dimm1,memdev=mem1 I see in Linux: | ~# free -h | total used free shared buff/cache available | Mem: 7,8Gi 345Mi 7,5Gi 516Ki 80Mi 7,5Gi | Swap: 0B 0B 0B memory arrived. Nothing in dmesg and my printk in __build_all_zonelists() did not trigger either. Same after I removed memory with: | device_del dimm1 | object_del mem1 it is gone: | ~# free -h | total used free shared buff/cache available | Mem: 3,8Gi 232Mi 3,6Gi 516Ki 80Mi 3,6Gi | Swap: 0B 0B 0B nothing in dmesg. What did I do wrong not to enter __build_all_zonelists()? Sebastian
On 2023-06-23 12:45:31 [+0200], To Michal Hocko wrote: > > nothing in dmesg. What did I do wrong not to enter > __build_all_zonelists()? However, looking at mem_hotplug_begin() it gets triggered from acpi_hotplug_work_fn() which is completely fine. No lockdep complains. So it looks good. Sebastian
On Fri 2023-06-23 12:11:11, Sebastian Andrzej Siewior wrote: > On 2023-06-23 18:51:24 [+0900], Tetsuo Handa wrote: > > My understanding is that zonelist_iter_begin() accesses only zonelist_update_seq.seqcount . > > From where is the zonelist_update_seq.lock accessed when zonelist_iter_begin() is called? > > zonelist_iter_begin() > -> read_seqbegin() > -> read_seqcount_begin() > -> raw_read_seqcount_begin() > -> __read_seqcount_begin() > > Now. The inner part looks like > while ((__seq = seqprop_sequence(s)) & 1) > cpu_relax(); > > but with RT enabled (in-tree, not out-of-tree, in-tree) seqprop_sequence() > goes from: BTW: I see a possible race in the below code. > | unsigned __seqprop_spinlock_sequence(const seqcount_spinlock_t *s) > | { > | unsigned seq = READ_ONCE(s->seqcount.sequence); > | > | if (unlikely(seq & 1)) { > | spin_lock(s->lock); > | spin_unlock(s->lock); This guarantees the reader waits for the writer. But does this guarantee that another writer could not bump the sequence before we read it below? > | seq = READ_ONCE(s->seqcount.sequence); IMHO, it should be: if (unlikely(seq & 1)) { spin_lock(s->lock); seq = READ_ONCE(s->seqcount.sequence); spin_unlock(s->lock); IMHO, only this would guarantee that returned seq couldn't be odd. Best Regards, Petr > | } > | return seq; > | }
On 2023/06/23 19:53, Petr Mladek wrote: > BTW: I see a possible race in the below code. > >> | unsigned __seqprop_spinlock_sequence(const seqcount_spinlock_t *s) >> | { >> | unsigned seq = READ_ONCE(s->seqcount.sequence); >> | >> | if (unlikely(seq & 1)) { >> | spin_lock(s->lock); >> | spin_unlock(s->lock); > > This guarantees the reader waits for the writer. But does this > guarantee that another writer could not bump the sequence > before we read it below? > >> | seq = READ_ONCE(s->seqcount.sequence); > > IMHO, it should be: > > if (unlikely(seq & 1)) { > spin_lock(s->lock); > seq = READ_ONCE(s->seqcount.sequence); > spin_unlock(s->lock); > > IMHO, only this would guarantee that returned seq couldn't be odd. What is wrong with bumping the sequence again? If the returned seq is odd, the caller just retries as needed. For example, __read_seqcount_begin() will retry at while ((__seq = seqprop_sequence(s)) & 1) cpu_relax(); . Lack of data_race() annotation resulting in KCSAN messages might be an unexpected source of printk(). :-)
On Fri 23-06-23 12:45:30, Sebastian Andrzej Siewior wrote: [...] > nothing in dmesg. What did I do wrong not to enter > __build_all_zonelists()? You would need to online memory into a unpopulated node/zone. E.g. by booting NUMA system with 2 numa nodes one of them empty and onlining into that node.
On 2023-06-23 19:36:55 [+0900], Tetsuo Handa wrote: > /* > * Zonelists may change due to hotplug during allocation. Detect when zonelists > * have been rebuilt so allocation retries. Reader side does not lock and > * retries the allocation if zonelist changes. Writer side is protected by the > * embedded spin_lock. > */ > > is not accurate. Something like below? > > If !RT, reader side does not lock and retries the allocation if zonelist changes. > If RT, reader side grabs and releases the embedded spin_lock in order to wait > for zonelist change operations to complete. I wouldn't sprinkle it around the code. It is implementation specific for PREEMPT_RT and documentation wise it would belong to Documentation/locking/seqlock.rst I don't think extra knowledge benefits the code. If this piece if information is needed I would suggest to include it to the original documentation for seqlock. > Hmm, I feel worried that kmalloc(GFP_ATOMIC) from hard IRQ context > might sleep if RT... Relax and don't worry. PREEMPT_RT does not allow (or has afaik) memory allocations from hard IRQ context (or preempt-disable sections). Sebastian
On Fri 23-06-23 14:44:16, Sebastian Andrzej Siewior wrote: [...] > > Hmm, I feel worried that kmalloc(GFP_ATOMIC) from hard IRQ context > > might sleep if RT... > > Relax and don't worry. PREEMPT_RT does not allow (or has afaik) memory > allocations from hard IRQ context (or preempt-disable sections). Yeah, that would require something like GFP_LOCKLESS or change some of the locking we have to raw spinlocks.
On 2023-06-23 12:40:24 [+0200], Petr Mladek wrote: > > but I would rather suggest lockdep annotation instead of disabling > > either preemption or migration. If I read the comment on top right, this > > API isn't open for the public. > > It might work when mm people agree to explicitly call > migrate_disable() in __build_all_zonelists(). I mean > to call there: > > migrate_disable(); > printk_deferred_enter(); > write_seqlock_irqsave(&zonelist_update_seq, flags); > > Plus it would require some comments. right but I wanted to move it after write_seqlock_irqsave(). > > To be honst, I would suggest to work towards always printing in a thread > > with an emergency console than this deferred/ safe duckt tape. > > I have used this argument many years. But it is not much convincing > after 10 years. Note that John has sent 1st RFC 4 years ago and we > still do not have the kthreads :-/ Yes, it makes me very sad. It feels longer than that. We have in RT ever since… > Best Regards, > Petr Sebastian
On 2023-06-23 12:53:57 [+0200], Petr Mladek wrote: > BTW: I see a possible race in the below code. > > > | unsigned __seqprop_spinlock_sequence(const seqcount_spinlock_t *s) > > | { > > | unsigned seq = READ_ONCE(s->seqcount.sequence); > > | > > | if (unlikely(seq & 1)) { > > | spin_lock(s->lock); > > | spin_unlock(s->lock); > > This guarantees the reader waits for the writer. But does this > guarantee that another writer could not bump the sequence > before we read it below? A second writer can bump the seq after the first is done, correct. > > | seq = READ_ONCE(s->seqcount.sequence); > > IMHO, it should be: > > if (unlikely(seq & 1)) { > spin_lock(s->lock); > seq = READ_ONCE(s->seqcount.sequence); > spin_unlock(s->lock); > > IMHO, only this would guarantee that returned seq couldn't be odd. This doesn't buy us anything. If the writer increments the counter, after the reader unlocks, then the reader observers the odd seq number and locks immediately (within read_seqbegin()) on the lock without spending cycles within the read section (outside of read_seqbegin()) and learning about the seq increment in read_seqretry(). > Best Regards, > Petr Sebastian
On Fri 2023-06-23 15:31:27, Sebastian Andrzej Siewior wrote: > On 2023-06-23 12:53:57 [+0200], Petr Mladek wrote: > > BTW: I see a possible race in the below code. > > > > > | unsigned __seqprop_spinlock_sequence(const seqcount_spinlock_t *s) > > > | { > > > | unsigned seq = READ_ONCE(s->seqcount.sequence); > > > | > > > | if (unlikely(seq & 1)) { > > > | spin_lock(s->lock); > > > | spin_unlock(s->lock); > > > > This guarantees the reader waits for the writer. But does this > > guarantee that another writer could not bump the sequence > > before we read it below? > > A second writer can bump the seq after the first is done, correct. > > > > | seq = READ_ONCE(s->seqcount.sequence); > > > > IMHO, it should be: > > > > if (unlikely(seq & 1)) { > > spin_lock(s->lock); > > seq = READ_ONCE(s->seqcount.sequence); > > spin_unlock(s->lock); > > > > IMHO, only this would guarantee that returned seq couldn't be odd. > > This doesn't buy us anything. Sure? > If the writer increments the counter, > after the reader unlocks, then the reader observers the odd seq number > and locks immediately (within read_seqbegin()) on the lock without > spending cycles within the read section (outside of read_seqbegin()) and > learning about the seq increment in read_seqretry(). I am afraid that it will not get caugh: static inline int do___read_seqcount_retry(const seqcount_t *s, unsigned start) { kcsan_atomic_next(0); return unlikely(READ_ONCE(s->sequence) != start); } It checks if the sequence is the same. It does not check if it is odd. But it might be odd if you read "start" outsize of the lock and do not check if it is odd. I do not feel to be expert but a code: spin_lock(s->lock); spin_unlock(s->lock); s = locked_variable; just triggers many bells in my head. It is like from a book "Classic locking mistakes". With the code: spin_lock(s->lock); seq s->seqcount.sequence; spin_unlock(s->lock); you will always get the right value without any extra cost. The code took the lock anyway. And the single assignment is negligible. And RT could schedule inside the spin_lock anyway. Or do I miss anything? Best Regards, Petr
On 2023-06-23 17:38:45 [+0200], Petr Mladek wrote: > > If the writer increments the counter, > > after the reader unlocks, then the reader observers the odd seq number > > and locks immediately (within read_seqbegin()) on the lock without > > spending cycles within the read section (outside of read_seqbegin()) and > > learning about the seq increment in read_seqretry(). > > I am afraid that it will not get caugh: > > static inline int do___read_seqcount_retry(const seqcount_t *s, unsigned start) > { > kcsan_atomic_next(0); > return unlikely(READ_ONCE(s->sequence) != start); > } > > It checks if the sequence is the same. It does not check if it is odd. You do realise that __read_seqcount_begin() is | while ((__seq = seqprop_sequence(s)) & 1) \ | cpu_relax(); \ so the code you try to modify is within seqprop_sequence() so you _do_ loop until the resulting __seq is even. And only then you continue towards do___read_seqcount_retry(). > But it might be odd if you read "start" outsize of the lock and do > not check if it is odd. no, see above. > I do not feel to be expert but a code: > > spin_lock(s->lock); > spin_unlock(s->lock); > s = locked_variable; > > just triggers many bells in my head. It is like from a book > "Classic locking mistakes". It is a counter and not locked variable and it is designed to be read outside of the lock. The usage of the lock is not for "protecting" as locking reasons but for making progress. > With the code: > > spin_lock(s->lock); > seq s->seqcount.sequence; > spin_unlock(s->lock); > > you will always get the right value without any extra cost. The code > took the lock anyway. And the single assignment is negligible. > And RT could schedule inside the spin_lock anyway. > > Or do I miss anything? Yes. See above. If after the unlock the read sequence is odd again we do notice it. Doing it from within the locked section, we don't. > Best Regards, > Petr Sebastian
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 47421bedc12b7..99b7e7d09c5c0 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5808,11 +5808,10 @@ static void __build_all_zonelists(void *data) unsigned long flags; /* - * Explicitly disable this CPU's interrupts before taking seqlock - * to prevent any IRQ handler from calling into the page allocator - * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and livelock. + * The zonelist_update_seq must be acquired with irqsave because the + * reader can be invoked from IRQ with GFP_ATOMIC. */ - local_irq_save(flags); + write_seqlock_irqsave(&zonelist_update_seq, flags); /* * Explicitly disable this CPU's synchronous printk() before taking * seqlock to prevent any printk() from trying to hold port->lock, for @@ -5820,7 +5819,6 @@ static void __build_all_zonelists(void *data) * calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held. */ printk_deferred_enter(); - write_seqlock(&zonelist_update_seq); #ifdef CONFIG_NUMA memset(node_load, 0, sizeof(node_load)); @@ -5857,9 +5855,8 @@ static void __build_all_zonelists(void *data) #endif } - write_sequnlock(&zonelist_update_seq); printk_deferred_exit(); - local_irq_restore(flags); + write_sequnlock_irqrestore(&zonelist_update_seq, flags); } static noinline void __init
__build_all_zonelists() acquires zonelist_update_seq by first disabling interrupts via local_irq_save() and then acquiring the seqlock with write_seqlock(). This is troublesome and leads to problems on PREEMPT_RT because the inner spinlock_t is now acquired with disabled interrupts. The API provides write_seqlock_irqsave() which does the right thing in one step. printk_deferred_enter() has to be invoked in non-migrate-able context to ensure that deferred printing is enabled and disabled on the same CPU. This is the case after zonelist_update_seq has been acquired. Acquire zonelist_update_seq with write_seqlock_irqsave() and then defer printk output. Fixes: 1007843a91909 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> --- mm/page_alloc.c | 11 ++++------- 1 file changed, 4 insertions(+), 7 deletions(-)