mm/page_alloc: Use write_seqlock_irqsave() instead write_seqlock() + local_irq_save().

Message ID	20230621104034.HT6QnNkQ@linutronix.de (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> Date: Wed, 21 Jun 2023 12:40:34 +0200 From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> To: linux-mm@kvack.org Cc: "Luis Claudio R. Goncalves" <lgoncalv@redhat.com>, Andrew Morton <akpm@linux-foundation.org>, Mel Gorman <mgorman@techsingularity.net>, Michal Hocko <mhocko@suse.com>, Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>, Thomas Gleixner <tglx@linutronix.de> Subject: [PATCH] mm/page_alloc: Use write_seqlock_irqsave() instead write_seqlock() + local_irq_save(). Message-ID: <20230621104034.HT6QnNkQ@linutronix.de> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm/page_alloc: Use write_seqlock_irqsave() instead write_seqlock() + local_irq_save(). \| expand mm/page_alloc: Use write_seqlock_irqsave() instead write_seqlock() + local_irq_save().

Sebastian Andrzej Siewior June 21, 2023, 10:40 a.m. UTC

__build_all_zonelists() acquires zonelist_update_seq by first disabling
interrupts via local_irq_save() and then acquiring the seqlock with
write_seqlock(). This is troublesome and leads to problems on
PREEMPT_RT because the inner spinlock_t is now acquired with disabled
interrupts.
The API provides write_seqlock_irqsave() which does the right thing in
one step.
printk_deferred_enter() has to be invoked in non-migrate-able context to
ensure that deferred printing is enabled and disabled on the same CPU.
This is the case after zonelist_update_seq has been acquired.

Acquire zonelist_update_seq with write_seqlock_irqsave() and then defer
printk output.

Fixes: 1007843a91909 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 mm/page_alloc.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

Michal Hocko June 21, 2023, 10:59 a.m. UTC | #1

On Wed 21-06-23 12:40:34, Sebastian Andrzej Siewior wrote:
> __build_all_zonelists() acquires zonelist_update_seq by first disabling
> interrupts via local_irq_save() and then acquiring the seqlock with
> write_seqlock(). This is troublesome and leads to problems on
> PREEMPT_RT because the inner spinlock_t is now acquired with disabled
> interrupts.

And the spinlock might sleep with PREEMPT_RT so a deadlock, right? It
would be better to call that out explicitly

> The API provides write_seqlock_irqsave() which does the right thing in
> one step.
> printk_deferred_enter() has to be invoked in non-migrate-able context to
> ensure that deferred printing is enabled and disabled on the same CPU.
> This is the case after zonelist_update_seq has been acquired.
> 
> Acquire zonelist_update_seq with write_seqlock_irqsave() and then defer
> printk output.
> 
> Fixes: 1007843a91909 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock")
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Acked-by: Michal Hocko <mhocko@suse.com>

Thanks

> ---
>  mm/page_alloc.c | 11 ++++-------
>  1 file changed, 4 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 47421bedc12b7..99b7e7d09c5c0 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5808,11 +5808,10 @@ static void __build_all_zonelists(void *data)
>  	unsigned long flags;
>  
>  	/*
> -	 * Explicitly disable this CPU's interrupts before taking seqlock
> -	 * to prevent any IRQ handler from calling into the page allocator
> -	 * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and livelock.
> +	 * The zonelist_update_seq must be acquired with irqsave because the
> +	 * reader can be invoked from IRQ with GFP_ATOMIC.
>  	 */
> -	local_irq_save(flags);
> +	write_seqlock_irqsave(&zonelist_update_seq, flags);
>  	/*
>  	 * Explicitly disable this CPU's synchronous printk() before taking
>  	 * seqlock to prevent any printk() from trying to hold port->lock, for
> @@ -5820,7 +5819,6 @@ static void __build_all_zonelists(void *data)
>  	 * calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held.
>  	 */
>  	printk_deferred_enter();
> -	write_seqlock(&zonelist_update_seq);
>  
>  #ifdef CONFIG_NUMA
>  	memset(node_load, 0, sizeof(node_load));
> @@ -5857,9 +5855,8 @@ static void __build_all_zonelists(void *data)
>  #endif
>  	}
>  
> -	write_sequnlock(&zonelist_update_seq);
>  	printk_deferred_exit();
> -	local_irq_restore(flags);
> +	write_sequnlock_irqrestore(&zonelist_update_seq, flags);
>  }
>  
>  static noinline void __init
> -- 
> 2.40.1

David Hildenbrand June 21, 2023, 11:14 a.m. UTC | #2

On 21.06.23 12:40, Sebastian Andrzej Siewior wrote:
> __build_all_zonelists() acquires zonelist_update_seq by first disabling
> interrupts via local_irq_save() and then acquiring the seqlock with
> write_seqlock(). This is troublesome and leads to problems on
> PREEMPT_RT because the inner spinlock_t is now acquired with disabled
> interrupts.
> The API provides write_seqlock_irqsave() which does the right thing in
> one step.
> printk_deferred_enter() has to be invoked in non-migrate-able context to
> ensure that deferred printing is enabled and disabled on the same CPU.
> This is the case after zonelist_update_seq has been acquired.
> 
> Acquire zonelist_update_seq with write_seqlock_irqsave() and then defer
> printk output.
> 
> Fixes: 1007843a91909 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock")
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>   mm/page_alloc.c | 11 ++++-------
>   1 file changed, 4 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 47421bedc12b7..99b7e7d09c5c0 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5808,11 +5808,10 @@ static void __build_all_zonelists(void *data)
>   	unsigned long flags;
>   
>   	/*
> -	 * Explicitly disable this CPU's interrupts before taking seqlock
> -	 * to prevent any IRQ handler from calling into the page allocator
> -	 * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and livelock.
> +	 * The zonelist_update_seq must be acquired with irqsave because the
> +	 * reader can be invoked from IRQ with GFP_ATOMIC.
>   	 */
> -	local_irq_save(flags);
> +	write_seqlock_irqsave(&zonelist_update_seq, flags);
>   	/*
>   	 * Explicitly disable this CPU's synchronous printk() before taking
>   	 * seqlock to prevent any printk() from trying to hold port->lock, for
> @@ -5820,7 +5819,6 @@ static void __build_all_zonelists(void *data)
>   	 * calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held.
>   	 */
>   	printk_deferred_enter();
> -	write_seqlock(&zonelist_update_seq);
>   
>   #ifdef CONFIG_NUMA
>   	memset(node_load, 0, sizeof(node_load));
> @@ -5857,9 +5855,8 @@ static void __build_all_zonelists(void *data)
>   #endif
>   	}
>   
> -	write_sequnlock(&zonelist_update_seq);
>   	printk_deferred_exit();
> -	local_irq_restore(flags);
> +	write_sequnlock_irqrestore(&zonelist_update_seq, flags);
>   }
>   
>   static noinline void __init

Reviewed-by: David Hildenbrand <david@redhat.com>

Sebastian Andrzej Siewior June 21, 2023, 11:16 a.m. UTC | #3

On 2023-06-21 12:59:44 [+0200], Michal Hocko wrote:
> On Wed 21-06-23 12:40:34, Sebastian Andrzej Siewior wrote:
> > __build_all_zonelists() acquires zonelist_update_seq by first disabling
> > interrupts via local_irq_save() and then acquiring the seqlock with
> > write_seqlock(). This is troublesome and leads to problems on
> > PREEMPT_RT because the inner spinlock_t is now acquired with disabled
> > interrupts.
> 
> And the spinlock might sleep with PREEMPT_RT so a deadlock, right? It
> would be better to call that out explicitly

No, no deadlock. Let me double check this a VM with mem-hotplug later
one but I don't expect an IRQ path. If so there should be more broken
pieces…

On PREEMPT_RT what you can happen is that the writer is preempted by a
high-priority reader which then deadlocks because the reader spins while
waiting and the writer is blocked. For this issue we have lock + unlock
in the seq reader to PI boost the seq writer so it can make progress. 

Sebastian

Tetsuo Handa June 21, 2023, 11:33 a.m. UTC | #4

On 2023/06/21 19:40, Sebastian Andrzej Siewior wrote:
> printk_deferred_enter() has to be invoked in non-migrate-able context to
> ensure that deferred printing is enabled and disabled on the same CPU.

I can't catch. local_irq_save(flags); makes non-migrate-able context
because sleeping is not allowed while IRQ is disabled, doesn't it?

> This is the case after zonelist_update_seq has been acquired.
> 
> Acquire zonelist_update_seq with write_seqlock_irqsave() and then defer
> printk output.

What guarantees that write_seqlock_irqsave() never calls printk()
(e.g. lockdep warning) before printk_deferred_enter() takes effect?

Michal Hocko June 21, 2023, 11:49 a.m. UTC | #5

On Wed 21-06-23 13:16:36, Sebastian Andrzej Siewior wrote:
> On 2023-06-21 12:59:44 [+0200], Michal Hocko wrote:
> > On Wed 21-06-23 12:40:34, Sebastian Andrzej Siewior wrote:
> > > __build_all_zonelists() acquires zonelist_update_seq by first disabling
> > > interrupts via local_irq_save() and then acquiring the seqlock with
> > > write_seqlock(). This is troublesome and leads to problems on
> > > PREEMPT_RT because the inner spinlock_t is now acquired with disabled
> > > interrupts.
> > 
> > And the spinlock might sleep with PREEMPT_RT so a deadlock, right? It
> > would be better to call that out explicitly
> 
> No, no deadlock. Let me double check this a VM with mem-hotplug later
> one but I don't expect an IRQ path. If so there should be more broken
> pieces…
> 
> On PREEMPT_RT what you can happen is that the writer is preempted by a
> high-priority reader which then deadlocks because the reader spins while
> waiting and the writer is blocked. For this issue we have lock + unlock
> in the seq reader to PI boost the seq writer so it can make progress. 

Please state the the problem explicitly in the changelog. You are
marking this patch as a fix so the underlying issue should be stated.

Thanks!

Petr Mladek June 21, 2023, 12:40 p.m. UTC | #6

On Wed 2023-06-21 20:33:35, Tetsuo Handa wrote:
> On 2023/06/21 19:40, Sebastian Andrzej Siewior wrote:
> > printk_deferred_enter() has to be invoked in non-migrate-able context to
> > ensure that deferred printing is enabled and disabled on the same CPU.
> 
> I can't catch. local_irq_save(flags); makes non-migrate-able context
> because sleeping is not allowed while IRQ is disabled, doesn't it?
> 
> > This is the case after zonelist_update_seq has been acquired.
> > 
> > Acquire zonelist_update_seq with write_seqlock_irqsave() and then defer
> > printk output.
> 
> What guarantees that write_seqlock_irqsave() never calls printk()
> (e.g. lockdep warning) before printk_deferred_enter() takes effect?

I think that we should explicitly disable preemption in
printk_deferred_enter(). The disabled interrupts are not
strictly necessary.

I am going to resurrect the patch
https://lore.kernel.org/r/20230419074210.17646-1-pmladek@suse.com
and add the preempt_disable()/enable() at the same time.

Is this going to work for you?

Best Regards,
Petr

Sebastian Andrzej Siewior June 21, 2023, 1:06 p.m. UTC | #7

On 2023-06-21 20:33:35 [+0900], Tetsuo Handa wrote:
> On 2023/06/21 19:40, Sebastian Andrzej Siewior wrote:
> > printk_deferred_enter() has to be invoked in non-migrate-able context to
> > ensure that deferred printing is enabled and disabled on the same CPU.
> 
> I can't catch. local_irq_save(flags); makes non-migrate-able context
> because sleeping is not allowed while IRQ is disabled, doesn't it?

That is correct. The problem is the local_irq_save() which remains on
PREEMPT_RT and this is problematic during write_seqlock() which acquires
spinlock_t which becomes a sleeping lock. write_seqlock_irqsave()
substitutes everything properly so on RT there is no local_irq_save().

> > This is the case after zonelist_update_seq has been acquired.
> > 
> > Acquire zonelist_update_seq with write_seqlock_irqsave() and then defer
> > printk output.
> 
> What guarantees that write_seqlock_irqsave() never calls printk()
> (e.g. lockdep warning) before printk_deferred_enter() takes effect?

Actually nothing but this could fixed by making the lockdep annotation
first followed by the actual increment in
do_write_seqcount_begin_nested(). I don't know this is the other way
around since we do annotation followed by locking in order to see the
splat before it happens.

Let me do a patch for that since it looks like the right thing to do.

If this "rare" issue is a blocker I don't see a sane way around it. Be
aware that tty_insert_flip_string_and_push_buffer() has atomic/ KASAN
annotation which _could_ detect something which then leads to pr_err()
invocations while tty_port::lock is acquired. Looking over
tty_port_close_start() there are pr_warn() invocation with the lock held
so again, possibilities.

Petr, is printing via print threads and direct printing via explicit
driver callback (which would avoid this problem) on the agenda?

Sebastian

Sebastian Andrzej Siewior June 21, 2023, 1:08 p.m. UTC | #8

On 2023-06-21 14:40:50 [+0200], Petr Mladek wrote:
> I think that we should explicitly disable preemption in
> printk_deferred_enter(). The disabled interrupts are not
> strictly necessary.
> 
> I am going to resurrect the patch
> https://lore.kernel.org/r/20230419074210.17646-1-pmladek@suse.com
> and add the preempt_disable()/enable() at the same time.
> 
> Is this going to work for you?

No. migrate_disable() is the minimal thing that is needed. However it is
heavy weight comparing to what is required. Also by disabling preemption
we still have the problem later on with the seqlock_t.

Let me swap the lockdep annotation and we should be good.

> Best Regards,
> Petr

Sebastian

Sebastian Andrzej Siewior June 21, 2023, 1:11 p.m. UTC | #9

On 2023-06-21 13:49:06 [+0200], Michal Hocko wrote:
> > On PREEMPT_RT what you can happen is that the writer is preempted by a
> > high-priority reader which then deadlocks because the reader spins while
> > waiting and the writer is blocked. For this issue we have lock + unlock
> > in the seq reader to PI boost the seq writer so it can make progress. 
> 
> Please state the the problem explicitly in the changelog. You are
> marking this patch as a fix so the underlying issue should be stated.

The problem is the "local_irq_save()" which I believe I stated. The lock
+ unlock story was just a side story and is already covered. I really
need just the local_irq_save() invocation to be part of the seqlock API
so it can be substituted away.

> Thanks!

Sebastian

Michal Hocko June 21, 2023, 1:22 p.m. UTC | #10

On Wed 21-06-23 15:11:25, Sebastian Andrzej Siewior wrote:
> On 2023-06-21 13:49:06 [+0200], Michal Hocko wrote:
> > > On PREEMPT_RT what you can happen is that the writer is preempted by a
> > > high-priority reader which then deadlocks because the reader spins while
> > > waiting and the writer is blocked. For this issue we have lock + unlock
> > > in the seq reader to PI boost the seq writer so it can make progress. 
> > 
> > Please state the the problem explicitly in the changelog. You are
> > marking this patch as a fix so the underlying issue should be stated.
> 
> The problem is the "local_irq_save()" which I believe I stated. The lock
> + unlock story was just a side story and is already covered. I really
> need just the local_irq_save() invocation to be part of the seqlock API
> so it can be substituted away.

I really do not want to nitpick but your changelog states:
"This is troublesome and leads to problems on PREEMPT_RT because the
inner spinlock_t is now acquired with disabled interrupts."

I believe it would be benefitial to state why htis is troublesome
because not everybody has insight into PREEMPT_RT and all the
consequences.

Thanks!

Sebastian Andrzej Siewior June 21, 2023, 1:25 p.m. UTC | #11

On 2023-06-21 15:22:29 [+0200], Michal Hocko wrote:
> > The problem is the "local_irq_save()" which I believe I stated. The lock
> > + unlock story was just a side story and is already covered. I really
> > need just the local_irq_save() invocation to be part of the seqlock API
> > so it can be substituted away.
> 
> I really do not want to nitpick but your changelog states:
> "This is troublesome and leads to problems on PREEMPT_RT because the
> inner spinlock_t is now acquired with disabled interrupts."
>
> I believe it would be benefitial to state why htis is troublesome
> because not everybody has insight into PREEMPT_RT and all the
> consequences.

Now after re-reading it I do understand what you mean.

> Thanks!

Sebastian

Tetsuo Handa June 21, 2023, 1:32 p.m. UTC | #12

On 2023/06/21 22:06, Sebastian Andrzej Siewior wrote:
> On 2023-06-21 20:33:35 [+0900], Tetsuo Handa wrote:
>> On 2023/06/21 19:40, Sebastian Andrzej Siewior wrote:
>>> printk_deferred_enter() has to be invoked in non-migrate-able context to
>>> ensure that deferred printing is enabled and disabled on the same CPU.
>>
>> I can't catch. local_irq_save(flags); makes non-migrate-able context
>> because sleeping is not allowed while IRQ is disabled, doesn't it?
> 
> That is correct. The problem is the local_irq_save() which remains on
> PREEMPT_RT and this is problematic during write_seqlock() which acquires
> spinlock_t which becomes a sleeping lock. write_seqlock_irqsave()
> substitutes everything properly so on RT there is no local_irq_save().

include/linux/seqlock.h says

  static inline unsigned long __write_seqlock_irqsave(seqlock_t *sl)
  {
  	unsigned long flags;

  	spin_lock_irqsave(&sl->lock, flags);
  	do_write_seqcount_begin(&sl->seqcount.seqcount);
  	return flags;
  }

  /**
   * write_seqlock_irqsave() - start a non-interruptible seqlock_t write
   *                           section
   * @lock:  Pointer to seqlock_t
   * @flags: Stack-allocated storage for saving caller's local interrupt
   *         state, to be passed to write_sequnlock_irqrestore().
   *
   * _irqsave variant of write_seqlock(). Use it only if the read side
   * section, or other write sections, can be invoked from hardirq context.
   */
  #define write_seqlock_irqsave(lock, flags)				\
  	do { flags = __write_seqlock_irqsave(lock); } while (0)

  /**
   * write_seqlock() - start a seqlock_t write side critical section
   * @sl: Pointer to seqlock_t
   *
   * write_seqlock opens a write side critical section for the given
   * seqlock_t.  It also implicitly acquires the spinlock_t embedded inside
   * that sequential lock. All seqlock_t write side sections are thus
   * automatically serialized and non-preemptible.
   *
   * Context: if the seqlock_t read section, or other write side critical
   * sections, can be invoked from hardirq or softirq contexts, use the
   * _irqsave or _bh variants of this function instead.
   */
  static inline void write_seqlock(seqlock_t *sl)
  {
  	spin_lock(&sl->lock);
  	do_write_seqcount_begin(&sl->seqcount.seqcount);
  }

and regarding include/linux/spinlock.h , since spin_lock_irqsave(lock, flags) is
equivalent with local_irq_save(flags) + spin_lock(lock), there is no difference
between

  local_irq_save(flags);
  printk_deferred_enter();
  write_seqlock(&zonelist_update_seq);

and

  write_seqlock_irqsave(&zonelist_update_seq, flags);
  printk_deferred_enter();

(except potential lockdep warning).

However, regarding include/linux/spinlock_rt.h , since spin_lock_irqsave(lock, flags)
is equivalent with spin_lock(lock), there is difference between

  local_irq_save(flags);
  printk_deferred_enter();
  write_seqlock(&zonelist_update_seq);

and

  write_seqlock_irqsave(&zonelist_update_seq, flags);
  printk_deferred_enter();

.

Is above understanding correct?

And you are trying to replace

  local_irq_save(flags);
  printk_deferred_enter();
  write_seqlock(&zonelist_update_seq);

with

  write_seqlock_irqsave(&zonelist_update_seq, flags);
  printk_deferred_enter();

, aren't you?

But include/linux/printk.h says

  /*
   * The printk_deferred_enter/exit macros are available only as a hack for
   * some code paths that need to defer all printk console printing. Interrupts
   * must be disabled for the deferred duration.
   */
  #define printk_deferred_enter __printk_safe_enter
  #define printk_deferred_exit __printk_safe_exit

which means that local_irq_save() is _required_ before printk_deferred_enter().

If local_irq_save() is hidden by your patch, what guarantees that
printk_deferred_enter() and printk_deferred_exit() run on the same CPU?
Also, if local_irq_save() is hidden due to RT, what guarantees that

  write_seqlock_irqsave(&zonelist_update_seq, flags);
  <<IRQ>>
    some_timer_function() {
      printk();
    }
  <<IRQ>>
  printk_deferred_enter();

does not happen because write_seqlock_irqsave() does not disable IRQ?

Disabling IRQ before incrementing zonelist_update_seq is _required_ for both

  making printk_deferred_enter() safe

and

  making sure that printk_deferred_enter() takes effect

.

Sebastian Andrzej Siewior June 21, 2023, 2:34 p.m. UTC | #13

On 2023-06-21 22:32:40 [+0900], Tetsuo Handa wrote:
> include/linux/seqlock.h says
…
> Is above understanding correct?

That is correct.

> And you are trying to replace
> 
>   local_irq_save(flags);
>   printk_deferred_enter();
>   write_seqlock(&zonelist_update_seq);
> 
> with
> 
>   write_seqlock_irqsave(&zonelist_update_seq, flags);
>   printk_deferred_enter();
> 
> , aren't you?

correct.

> But include/linux/printk.h says
> 
>   /*
>    * The printk_deferred_enter/exit macros are available only as a hack for
>    * some code paths that need to defer all printk console printing. Interrupts
>    * must be disabled for the deferred duration.
>    */
>   #define printk_deferred_enter __printk_safe_enter
>   #define printk_deferred_exit __printk_safe_exit
> 
> which means that local_irq_save() is _required_ before printk_deferred_enter().

It says that, yes, but that requirement is described as too heavy. The
requirement is that printk_deferred_enter() happens on the same CPU as
printk_deferred_exit(). This can be achieved by an explicit
local_irq_save(), yes, but also by something like spin_lock_irq() which
_ensures_ that the task does not migrate to another CPU while the lock
is acquired. This is the requirement by the current implementation.

> If local_irq_save() is hidden by your patch, what guarantees that
> printk_deferred_enter() and printk_deferred_exit() run on the same CPU?

Because spin_lock_irqsave() on CONFIG_PREEMPT_RT uses migrate_disable().
The function ensures that the scheduler does not migrate the task to
another CPU. The CPU is even block from going down (as in CPU-hotplug)
until the matching migrate_enable() occurs.

> Also, if local_irq_save() is hidden due to RT, what guarantees that
> 
>   write_seqlock_irqsave(&zonelist_update_seq, flags);
>   <<IRQ>>
>     some_timer_function() {
>       printk();
>     }
>   <<IRQ>>
>   printk_deferred_enter();
>
> does not happen because write_seqlock_irqsave() does not disable IRQ?

I don't see how zonelist_update_seq and printk here are connected
without the port lock/ or memory allocation. But there are two things
that are different on RT which probably answer your question:

- If the reader observes an odd sequence number then it acquires the
  lock of the sequence counter (it is held by the writer) which
  forces the writer to complete the write critical section and then the
  reader can continue. There are _no_ memory allocation within a
  hard IRQ context (as in the actual interrupt). The timer (hrtimer or
  timer_list timer) are served in task context and we have
  forced-threaded interrupts. Clearly this means that the seqlock_t (as
  used here) can only be used task context and not in hard IRQ context.

- The printk implementation is slightly different and it is being worked
  to merge it upstream. The two important differences here:
  - Printing happens by default in a dedicated printing thread.
  - In emergency cases like panic(), printing happens directly within
    the invocation of printk(). This requires a so called atomic console
    which does not use the tty_port::lock.

> Disabling IRQ before incrementing zonelist_update_seq is _required_ for both
> 
>   making printk_deferred_enter() safe
> 
> and
> 
>   making sure that printk_deferred_enter() takes effect
> 
> .
Did I explain why it is sufficient to do
	write_seqlock_irqsave()
	printk_deferred_enter()

assuming we have

| static inline void do_write_seqcount_begin_nested(seqcount_t *s, int subclass)
| {
|         seqcount_acquire(&s->dep_map, subclass, 0, _RET_IP_);
|         do_raw_write_seqcount_begin(s);
| }

?

Sebastian

Tetsuo Handa June 21, 2023, 2:50 p.m. UTC | #14

On 2023/06/21 23:34, Sebastian Andrzej Siewior wrote:
>> Also, if local_irq_save() is hidden due to RT, what guarantees that
>>
>>   write_seqlock_irqsave(&zonelist_update_seq, flags);
>>   <<IRQ>>
>>     some_timer_function() {
>>       printk();
>>     }
>>   <<IRQ>>
>>   printk_deferred_enter();
>>
>> does not happen because write_seqlock_irqsave() does not disable IRQ?
> 
> I don't see how zonelist_update_seq and printk here are connected
> without the port lock/ or memory allocation. But there are two things
> that are different on RT which probably answer your question:

It is explained as the first deadlock scenario in commit 1007843a9190
("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock").
We have to disable IRQ before making zonelist_update_seq.seqcount odd.

Petr Mladek June 21, 2023, 3:38 p.m. UTC | #15

On Wed 2023-06-21 16:34:21, Sebastian Andrzej Siewior wrote:
> On 2023-06-21 22:32:40 [+0900], Tetsuo Handa wrote:
> > include/linux/seqlock.h says
> …
> > Is above understanding correct?
> 
> That is correct.
> 
> > And you are trying to replace
> > 
> >   local_irq_save(flags);
> >   printk_deferred_enter();
> >   write_seqlock(&zonelist_update_seq);
> > 
> > with
> > 
> >   write_seqlock_irqsave(&zonelist_update_seq, flags);
> >   printk_deferred_enter();
> > 
> > , aren't you?
> 
> correct.
> 
> > But include/linux/printk.h says
> > 
> >   /*
> >    * The printk_deferred_enter/exit macros are available only as a hack for
> >    * some code paths that need to defer all printk console printing. Interrupts
> >    * must be disabled for the deferred duration.
> >    */
> >   #define printk_deferred_enter __printk_safe_enter
> >   #define printk_deferred_exit __printk_safe_exit
> > 
> > which means that local_irq_save() is _required_ before printk_deferred_enter().
> 
> It says that, yes, but that requirement is described as too heavy. The
> requirement is that printk_deferred_enter() happens on the same CPU as
> printk_deferred_exit().

True.

> This can be achieved by an explicit
> local_irq_save(), yes, but also by something like spin_lock_irq() which
> _ensures_ that the task does not migrate to another CPU while the lock
> is acquired. This is the requirement by the current implementation.

Good to know.

> > If local_irq_save() is hidden by your patch, what guarantees that
> > printk_deferred_enter() and printk_deferred_exit() run on the same CPU?
> 
> Because spin_lock_irqsave() on CONFIG_PREEMPT_RT uses migrate_disable().
> The function ensures that the scheduler does not migrate the task to
> another CPU. The CPU is even block from going down (as in CPU-hotplug)
> until the matching migrate_enable() occurs.
> 
> > Also, if local_irq_save() is hidden due to RT, what guarantees that
> > 
> >   write_seqlock_irqsave(&zonelist_update_seq, flags);
> >   <<IRQ>>
> >     some_timer_function() {
> >       printk();
> >     }
> >   <<IRQ>>
> >   printk_deferred_enter();
> >
> > does not happen because write_seqlock_irqsave() does not disable IRQ?
> 
> I don't see how zonelist_update_seq and printk here are connected
> without the port lock/ or memory allocation. But there are two things
> that are different on RT which probably answer your question:
> 
> - If the reader observes an odd sequence number then it acquires the
>   lock of the sequence counter (it is held by the writer) which
>   forces the writer to complete the write critical section and then the
>   reader can continue. There are _no_ memory allocation within a
>   hard IRQ context (as in the actual interrupt). The timer (hrtimer or
>   timer_list timer) are served in task context and we have
>   forced-threaded interrupts. Clearly this means that the seqlock_t (as
>   used here) can only be used task context and not in hard IRQ context.
> 
> - The printk implementation is slightly different and it is being worked
>   to merge it upstream. The two important differences here:
>   - Printing happens by default in a dedicated printing thread.
>   - In emergency cases like panic(), printing happens directly within
>     the invocation of printk(). This requires a so called atomic console
>     which does not use the tty_port::lock.

If I get it correctly, RT solve both possible deadlocks by offloading
the nested operation into a kthread (irq and printk threads).
Plus, printk uses emergency_write() when the kthread is not usable.

If this is true then printk_safe_enter() might be a nop on RT.
All possible deadlocks are prevented either by the kthread or
the console->emergency_write() call.

But wait. AFAIK, the printk kthread is implemented only for the serial
console. So, it won't be safe for other consoles, especially
the problematic tty_insert_flip_string_and_push_buffer() call.

Note that adding printk thread for the graphical consoles will
be a real challenge. We have hard times even with the basic
UART 8250. There are still races possible in the implementation
in the RT tree...

OK, what about using migrate_disable() in printk_deferred_enter()?
Something like:

/*
 * The printk_deferred_enter/exit macros are available only as a hack.
 * They define a per-CPU context where all printk console printing
 * is deferred because it might cause a deadlock otherwise.
 *
 * It is highly recommended to use them only in a context with interrupts
 * disabled. Otherwise, other unrelated printk() calls might be deferred
 * when they interrupt/preempt the deferred code section.
 *
 * They should not be used for to deffer printing of many messages. It might
 * create softlockup when they are flushed later.
 *
 * IMPORTANT: Any new use of these MUST be consulted with printk maintainers.
 *    It might have unexpected side effects on the printk infrastructure.
 */
#ifdef CONFIG_PREEMPT_RT

#define printk_deferred_enter()			\
	do {					\
		migrate_disable();		\
		__printk_safe_enter();		\
	} while (0)

#define printk_deferred_exit()			\
	do {					\
		__printk_safe_exit();		\
		migrate_enable();		\
	} while (0)

#else  /* CONFIG_PREEMPT_RT */

#define printk_deferred_enter()			\
	do {					\
		preempt_disable();		\
		__printk_safe_enter();		\
	} while (0)

#define printk_deferred_exit()			\
	do {					\
		__printk_safe_exit();		\
		preempt_enable();		\
	} while (0)

#endif   /* CONFIG_PREEMPT_RT */

Note that I have used preempt_disable() on non-RT because it is much
cheaper. And IRQs will be disabled anyway on non-RT system
in this code path.

> > Disabling IRQ before incrementing zonelist_update_seq is _required_ for both
> > 
> >   making printk_deferred_enter() safe
> > 
> > and
> > 
> >   making sure that printk_deferred_enter() takes effect
> > 
> > .
> Did I explain why it is sufficient to do
> 	write_seqlock_irqsave()
> 	printk_deferred_enter()
> 
> assuming we have
> 
> | static inline void do_write_seqcount_begin_nested(seqcount_t *s, int subclass)
> | {
> |         seqcount_acquire(&s->dep_map, subclass, 0, _RET_IP_);
> |         do_raw_write_seqcount_begin(s);
> | }

Will this prevent any printk() called on the same CPU before
printk_deferred_enter() is called?

Best Regards,
Petr

Tetsuo Handa June 21, 2023, 11:24 p.m. UTC | #16

On 2023/06/21 23:50, Tetsuo Handa wrote:
> On 2023/06/21 23:34, Sebastian Andrzej Siewior wrote:
>>> Also, if local_irq_save() is hidden due to RT, what guarantees that
>>>
>>>   write_seqlock_irqsave(&zonelist_update_seq, flags);
>>>   <<IRQ>>
>>>     some_timer_function() {
>>>       printk();
>>>     }
>>>   <<IRQ>>
>>>   printk_deferred_enter();
>>>
>>> does not happen because write_seqlock_irqsave() does not disable IRQ?
>>
>> I don't see how zonelist_update_seq and printk here are connected
>> without the port lock/ or memory allocation. But there are two things
>> that are different on RT which probably answer your question:
> 
> It is explained as the first deadlock scenario in commit 1007843a9190
> ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock").
> We have to disable IRQ before making zonelist_update_seq.seqcount odd.
> 

Since we must replace local_irq_save() + write_seqlock() with write_seqlock_irqsave() for
CONFIG_PREEMPT_RT=y case but we must not replace local_irq_save() + write_seqlock() with
write_seqlock_irqsave() for CONFIG_PREEMPT_RT=n case, the proper fix is something like below?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 47421bedc12b..e3e9bd719dcc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5798,28 +5798,30 @@ static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonesta
 #define BOOT_PAGESET_HIGH	0
 #define BOOT_PAGESET_BATCH	1
 static DEFINE_PER_CPU(struct per_cpu_pages, boot_pageset);
 static DEFINE_PER_CPU(struct per_cpu_zonestat, boot_zonestats);
 
 static void __build_all_zonelists(void *data)
 {
 	int nid;
 	int __maybe_unused cpu;
 	pg_data_t *self = data;
+#ifndef CONFIG_PREEMPT_RT
 	unsigned long flags;
 
 	/*
 	 * Explicitly disable this CPU's interrupts before taking seqlock
 	 * to prevent any IRQ handler from calling into the page allocator
 	 * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and livelock.
 	 */
 	local_irq_save(flags);
+#endif
 	/*
 	 * Explicitly disable this CPU's synchronous printk() before taking
 	 * seqlock to prevent any printk() from trying to hold port->lock, for
 	 * tty_insert_flip_string_and_push_buffer() on other CPU might be
 	 * calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held.
 	 */
 	printk_deferred_enter();
 	write_seqlock(&zonelist_update_seq);
 
 #ifdef CONFIG_NUMA
@@ -5852,21 +5854,23 @@ static void __build_all_zonelists(void *data)
 		 * secondary cpus' numa_mem as they come on-line.  During
 		 * node/memory hotplug, we'll fixup all on-line cpus.
 		 */
 		for_each_online_cpu(cpu)
 			set_cpu_numa_mem(cpu, local_memory_node(cpu_to_node(cpu)));
 #endif
 	}
 
 	write_sequnlock(&zonelist_update_seq);
 	printk_deferred_exit();
+#ifndef CONFIG_PREEMPT_RT
 	local_irq_restore(flags);
+#endif
 }
 
 static noinline void __init
 build_all_zonelists_init(void)
 {
 	int cpu;
 
 	__build_all_zonelists(NULL);
 
 	/*


By the way, given

  write_seqlock_irqsave(&zonelist_update_seq, flags);
  <<IRQ>>
    some_timer_function() {
      kmalloc(GFP_ATOMIC);
    }
  <</IRQ>>
  printk_deferred_enter();

scenario in CONFIG_PREEMPT_RT=y case is handled by executing some_timer_function()
on a dedicated kernel thread for IRQs, what guarantees that the kernel thread for
IRQs gives up CPU and the user thread which called write_seqlock() gains CPU until
write_sequnlock() is called? How can the kernel figure out that executing the user
thread needs higher priority than the kernel thread?

Michal Hocko June 22, 2023, 7:18 a.m. UTC | #17

On Thu 22-06-23 08:24:30, Tetsuo Handa wrote:
> On 2023/06/21 23:50, Tetsuo Handa wrote:
> > On 2023/06/21 23:34, Sebastian Andrzej Siewior wrote:
> >>> Also, if local_irq_save() is hidden due to RT, what guarantees that
> >>>
> >>>   write_seqlock_irqsave(&zonelist_update_seq, flags);
> >>>   <<IRQ>>
> >>>     some_timer_function() {
> >>>       printk();
> >>>     }
> >>>   <<IRQ>>
> >>>   printk_deferred_enter();
> >>>
> >>> does not happen because write_seqlock_irqsave() does not disable IRQ?
> >>
> >> I don't see how zonelist_update_seq and printk here are connected
> >> without the port lock/ or memory allocation. But there are two things
> >> that are different on RT which probably answer your question:
> > 
> > It is explained as the first deadlock scenario in commit 1007843a9190
> > ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock").
> > We have to disable IRQ before making zonelist_update_seq.seqcount odd.
> > 
> 
> Since we must replace local_irq_save() + write_seqlock() with write_seqlock_irqsave() for
> CONFIG_PREEMPT_RT=y case but we must not replace local_irq_save() + write_seqlock() with
> write_seqlock_irqsave() for CONFIG_PREEMPT_RT=n case, the proper fix is something like below?

Now, I am confused. Why write_seqlock_irqsave is not allowed for !RT?
Let me quote the changelog and he scenario 1:
        write_seqlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount odd
        // e.g. timer interrupt handler runs at this moment
          some_timer_func() {
            kmalloc(GFP_ATOMIC) {
              __alloc_pages_slowpath() {
                read_seqbegin(&zonelist_update_seq) {
                  // spins forever because zonelist_update_seq.seqcount is odd
                }
              }
            }
          }
        // e.g. timer interrupt handler finishes
        write_sequnlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount even

This is clearly impossible with write_seqlock_irqsave as interrupts are
disabled before the lock is taken.

Tetsuo Handa June 22, 2023, 10:58 a.m. UTC | #18

On 2023/06/22 16:18, Michal Hocko wrote:
>>> It is explained as the first deadlock scenario in commit 1007843a9190
>>> ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock").
>>> We have to disable IRQ before making zonelist_update_seq.seqcount odd.
>>>
>>
>> Since we must replace local_irq_save() + write_seqlock() with write_seqlock_irqsave() for
>> CONFIG_PREEMPT_RT=y case but we must not replace local_irq_save() + write_seqlock() with
>> write_seqlock_irqsave() for CONFIG_PREEMPT_RT=n case, the proper fix is something like below?
> 
> Now, I am confused. Why write_seqlock_irqsave is not allowed for !RT?
> Let me quote the changelog and he scenario 1:
>         write_seqlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount odd
>         // e.g. timer interrupt handler runs at this moment
>           some_timer_func() {
>             kmalloc(GFP_ATOMIC) {
>               __alloc_pages_slowpath() {
>                 read_seqbegin(&zonelist_update_seq) {
>                   // spins forever because zonelist_update_seq.seqcount is odd
>                 }
>               }
>             }
>           }
>         // e.g. timer interrupt handler finishes
>         write_sequnlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount even
> 
> This is clearly impossible with write_seqlock_irqsave as interrupts are
> disabled before the lock is taken.

Well, it seems that "I don't want to replace" rather than "we must not replace".
I reread the thread but I couldn't find why nobody suggested write_seqlock_irqsave().
The reason I proposed the

  local_irq_save() => printk_deferred_enter() => write_seqlock()

ordering implies a precaution in case write_seqlock() involves printk() (e.g. lockdep,
KCSAN, soft-lockup warning), in addition to "local_irq_save() before printk_deferred_enter()"
requirement. Maybe people in that thread were happy with preserving this precaution...

You commented

  There shouldn't be any other locks (apart from hotplug) taken in that path IIRC.

at https://lkml.kernel.org/ZCrYQj+2/uMtqNBm@dhcp22.suse.cz .

If __build_all_zonelists() is already serialized by hotplug lock, we don't
need to call spin_lock(&zonelist_update_seq.lock) and we will be able to
replace write_seqlock(&zonelist_update_seq) with
write_seqcount_begin(&zonelist_update_seq.seqcount) like
cpuset_change_task_nodemask() does?

Michal Hocko June 22, 2023, 12:09 p.m. UTC | #19

On Thu 22-06-23 19:58:33, Tetsuo Handa wrote:
> On 2023/06/22 16:18, Michal Hocko wrote:
> >>> It is explained as the first deadlock scenario in commit 1007843a9190
> >>> ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock").
> >>> We have to disable IRQ before making zonelist_update_seq.seqcount odd.
> >>>
> >>
> >> Since we must replace local_irq_save() + write_seqlock() with write_seqlock_irqsave() for
> >> CONFIG_PREEMPT_RT=y case but we must not replace local_irq_save() + write_seqlock() with
> >> write_seqlock_irqsave() for CONFIG_PREEMPT_RT=n case, the proper fix is something like below?
> > 
> > Now, I am confused. Why write_seqlock_irqsave is not allowed for !RT?
> > Let me quote the changelog and he scenario 1:
> >         write_seqlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount odd
> >         // e.g. timer interrupt handler runs at this moment
> >           some_timer_func() {
> >             kmalloc(GFP_ATOMIC) {
> >               __alloc_pages_slowpath() {
> >                 read_seqbegin(&zonelist_update_seq) {
> >                   // spins forever because zonelist_update_seq.seqcount is odd
> >                 }
> >               }
> >             }
> >           }
> >         // e.g. timer interrupt handler finishes
> >         write_sequnlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount even
> > 
> > This is clearly impossible with write_seqlock_irqsave as interrupts are
> > disabled before the lock is taken.
> 
> Well, it seems that "I don't want to replace" rather than "we must not replace".

OK, so this is an alteranative fix rather the proposed fix being
incorrect.

> I reread the thread but I couldn't find why nobody suggested write_seqlock_irqsave().
> The reason I proposed the
> 
>   local_irq_save() => printk_deferred_enter() => write_seqlock()
> 
> ordering implies a precaution in case write_seqlock() involves printk() (e.g. lockdep,
> KCSAN, soft-lockup warning), in addition to "local_irq_save() before printk_deferred_enter()"
> requirement. Maybe people in that thread were happy with preserving this precaution...

Precaution is a fair argument. I am not sure it is the strongest one to
justify the ugly RT special casing though. I would propose to go with 
Sebastian's patch as a clear fix and if you really care about the
pre-caution then make sure you describe potential problems.

> You commented
> 
>   There shouldn't be any other locks (apart from hotplug) taken in that path IIRC.
> 
> at https://lkml.kernel.org/ZCrYQj+2/uMtqNBm@dhcp22.suse.cz .
> 
> If __build_all_zonelists() is already serialized by hotplug lock, we don't
> need to call spin_lock(&zonelist_update_seq.lock) and we will be able to
> replace write_seqlock(&zonelist_update_seq) with
> write_seqcount_begin(&zonelist_update_seq.seqcount) like
> cpuset_change_task_nodemask() does?

Maybe, I haven't really dived into this deeper. One way or the other
RT requires a special IRQ handling along with the seq lock, no?

Tetsuo Handa June 22, 2023, 1:36 p.m. UTC | #20

On 2023/06/22 8:24, Tetsuo Handa wrote:
> By the way, given
> 
>   write_seqlock_irqsave(&zonelist_update_seq, flags);
>   <<IRQ>>
>     some_timer_function() {
>       kmalloc(GFP_ATOMIC);
>     }
>   <</IRQ>>
>   printk_deferred_enter();
> 
> scenario in CONFIG_PREEMPT_RT=y case is handled by executing some_timer_function()
> on a dedicated kernel thread for IRQs, what guarantees that the kernel thread for
> IRQs gives up CPU and the user thread which called write_seqlock() gains CPU until
> write_sequnlock() is called? How can the kernel figure out that executing the user
> thread needs higher priority than the kernel thread?

I haven't got response on this question.

Several years ago, I demonstrated that a SCHED_IDLE priority userspace thread holding
oom_lock causes other concurrently allocating !SCHED_IDLE priority threads to
misunderstand that mutex_trylock(&oom_lock) failure implies we are making forward
progress (despite the SCHED_IDLE priority userspace thread was unable to wake up for
minutes).

If a SCHED_IDLE priority thread which called write_seqlock_irqsave() is preempted by
some other !SCHED_IDLE priority threads (especially realtime priority threads), and
such !SCHED_IDLE priority thread calls kmalloc(GFP_ATOMIC) or printk(), a similar thing
(misunderstand that spinning on read_seqbegin() from zonelist_iter_begin() can make
forward progress despite a thread which called write_seqlock_irqsave() cannot make
progress due to preemption) can happen.

Question to Sebastian:
To make sure that such thing cannot happen, we should make sure that
a thread which entered write_seqcount_begin(&zonelist_update_seq.seqcount) from 
write_seqlock_irqsave(&zonelist_update_seq, flags) can continue using CPU until
write_seqcount_end(&zonelist_update_seq.seqcount) from
write_seqlock_irqrestore(&zonelist_update_seq, flags).
Does adding preempt_disable() before write_seqlock(&zonelist_update_seq, flags) help?

Question to Peter:
Even if local_irq_save(flags) disables IRQ, NMI context can enqueue message via printk().
When does the message enqueued from NMI context gets printed? If there is a possibility
that the message enqueued from NMI context gets printed between
"write_seqlock_irqsave(&zonelist_update_seq, flags) and printk_deferred_enter()" or
"printk_deferred_exit() and write_sequnlock_irqrestore(&zonelist_update_seq, flags)" ?
If yes, we can't increment zonelist_update_seq.seqcount before printk_deferred_enter()...

Petr Mladek June 22, 2023, 2:11 p.m. UTC | #21

On Thu 2023-06-22 22:36:27, Tetsuo Handa wrote:
> On 2023/06/22 8:24, Tetsuo Handa wrote:
> > By the way, given
> > 
> >   write_seqlock_irqsave(&zonelist_update_seq, flags);
> >   <<IRQ>>
> >     some_timer_function() {
> >       kmalloc(GFP_ATOMIC);
> >     }
> >   <</IRQ>>
> >   printk_deferred_enter();
> > 
> > scenario in CONFIG_PREEMPT_RT=y case is handled by executing some_timer_function()
> > on a dedicated kernel thread for IRQs, what guarantees that the kernel thread for
> > IRQs gives up CPU and the user thread which called write_seqlock() gains CPU until
> > write_sequnlock() is called? How can the kernel figure out that executing the user
> > thread needs higher priority than the kernel thread?
> 
> I haven't got response on this question.
> 
> Several years ago, I demonstrated that a SCHED_IDLE priority userspace thread holding
> oom_lock causes other concurrently allocating !SCHED_IDLE priority threads to
> misunderstand that mutex_trylock(&oom_lock) failure implies we are making forward
> progress (despite the SCHED_IDLE priority userspace thread was unable to wake up for
> minutes).
> 
> If a SCHED_IDLE priority thread which called write_seqlock_irqsave() is preempted by
> some other !SCHED_IDLE priority threads (especially realtime priority threads), and
> such !SCHED_IDLE priority thread calls kmalloc(GFP_ATOMIC) or printk(), a similar thing
> (misunderstand that spinning on read_seqbegin() from zonelist_iter_begin() can make
> forward progress despite a thread which called write_seqlock_irqsave() cannot make
> progress due to preemption) can happen.
> 
> Question to Sebastian:
> To make sure that such thing cannot happen, we should make sure that
> a thread which entered write_seqcount_begin(&zonelist_update_seq.seqcount) from 
> write_seqlock_irqsave(&zonelist_update_seq, flags) can continue using CPU until
> write_seqcount_end(&zonelist_update_seq.seqcount) from
> write_seqlock_irqrestore(&zonelist_update_seq, flags).
> Does adding preempt_disable() before write_seqlock(&zonelist_update_seq, flags) help?
> 
> 
> 
> Question to Peter:
> Even if local_irq_save(flags) disables IRQ, NMI context can enqueue message via printk().
> When does the message enqueued from NMI context gets printed?

They are flushed to the console either by irq_work or by another
printk(). The irq_work could not be proceed when IRQs are disabled.
But another non-deferred printk() would try to flush them immediately.

> If there is a possibility
> that the message enqueued from NMI context gets printed between
> "write_seqlock_irqsave(&zonelist_update_seq, flags) and printk_deferred_enter()" or
> "printk_deferred_exit() and write_sequnlock_irqrestore(&zonelist_update_seq, flags)" ?
> If yes, we can't increment zonelist_update_seq.seqcount before printk_deferred_enter()...

It might happen when a printk() is called in these holes.

Best Regards,
Petr

Tetsuo Handa June 22, 2023, 2:28 p.m. UTC | #22

On 2023/06/22 23:11, Petr Mladek wrote:
>> Question to Peter:
>> Even if local_irq_save(flags) disables IRQ, NMI context can enqueue message via printk().
>> When does the message enqueued from NMI context gets printed?
> 
> They are flushed to the console either by irq_work or by another
> printk(). The irq_work could not be proceed when IRQs are disabled.

Is that rule same for both COMFIG_PREEMPT_RT=y and COMFIG_PREEMPT_RT=n ?

If yes, when Sebastian's patch is applied, IRQs will not be disabled for
COMFIG_PREEMPT_RT=y kernels because write_seqlock_irqsave() is equivalent
with write_seqlock() because spin_lock_irqsave(lock, flags) is equivalent
with spin_lock(lock), and


> But another non-deferred printk() would try to flush them immediately.
> 
>> If there is a possibility
>> that the message enqueued from NMI context gets printed between
>> "write_seqlock_irqsave(&zonelist_update_seq, flags) and printk_deferred_enter()" or
>> "printk_deferred_exit() and write_sequnlock_irqrestore(&zonelist_update_seq, flags)" ?
>> If yes, we can't increment zonelist_update_seq.seqcount before printk_deferred_enter()...
> 
> It might happen when a printk() is called in these holes.

printk() can happen between these holes for COMFIG_PREEMPT_RT=y kernels.
We will need to call printk_deferred_enter() before incrementing
zonelist_update_seq.seqcount in order to close these holes.

Petr Mladek June 22, 2023, 3:04 p.m. UTC | #23

On Thu 2023-06-22 16:11:41, Petr Mladek wrote:
> On Thu 2023-06-22 22:36:27, Tetsuo Handa wrote:
> > On 2023/06/22 8:24, Tetsuo Handa wrote:
> > > By the way, given
> > > 
> > >   write_seqlock_irqsave(&zonelist_update_seq, flags);
> > >   <<IRQ>>
> > >     some_timer_function() {
> > >       kmalloc(GFP_ATOMIC);
> > >     }
> > >   <</IRQ>>
> > >   printk_deferred_enter();
> > > 
> > > scenario in CONFIG_PREEMPT_RT=y case is handled by executing some_timer_function()
> > > on a dedicated kernel thread for IRQs, what guarantees that the kernel thread for
> > > IRQs gives up CPU and the user thread which called write_seqlock() gains CPU until
> > > write_sequnlock() is called? How can the kernel figure out that executing the user
> > > thread needs higher priority than the kernel thread?

My understanding is that this is achieved by spin_lock_irqsave(&sl->lock, flags).
When RT is enabled then	rt_spin_lock(lock) is used.

AFAIK, rt_spin_lock(lock) fulfills exactly the above requirements.
The owner could schedule. The waiter could schedule as well so that
they could be running on the same CPU. Also the current owner gets
higher priority when the is a waiter with the higher priority to avoid
the priority inversion.

> > I haven't got response on this question.
> > 
> > Several years ago, I demonstrated that a SCHED_IDLE priority userspace thread holding
> > oom_lock causes other concurrently allocating !SCHED_IDLE priority threads to
> > misunderstand that mutex_trylock(&oom_lock) failure implies we are making forward
> > progress (despite the SCHED_IDLE priority userspace thread was unable to wake up for
> > minutes).
> > 
> > If a SCHED_IDLE priority thread which called write_seqlock_irqsave() is preempted by
> > some other !SCHED_IDLE priority threads (especially realtime priority threads), and
> > such !SCHED_IDLE priority thread calls kmalloc(GFP_ATOMIC) or printk(), a similar thing
> > (misunderstand that spinning on read_seqbegin() from zonelist_iter_begin() can make
> > forward progress despite a thread which called write_seqlock_irqsave() cannot make
> > progress due to preemption) can happen.
> > 
> > Question to Sebastian:
> > To make sure that such thing cannot happen, we should make sure that
> > a thread which entered write_seqcount_begin(&zonelist_update_seq.seqcount) from 
> > write_seqlock_irqsave(&zonelist_update_seq, flags) can continue using CPU until
> > write_seqcount_end(&zonelist_update_seq.seqcount) from
> > write_seqlock_irqrestore(&zonelist_update_seq, flags).
> > Does adding preempt_disable() before write_seqlock(&zonelist_update_seq, flags) help?
> > 
> > Question to Peter:
> > Even if local_irq_save(flags) disables IRQ, NMI context can enqueue message via printk().
> > When does the message enqueued from NMI context gets printed?
> 
> They are flushed to the console either by irq_work or by another
> printk(). The irq_work could not be proceed when IRQs are disabled.
> But another non-deferred printk() would try to flush them immediately.
> 
> > If there is a possibility
> > that the message enqueued from NMI context gets printed between
> > "write_seqlock_irqsave(&zonelist_update_seq, flags) and printk_deferred_enter()" or
> > "printk_deferred_exit() and write_sequnlock_irqrestore(&zonelist_update_seq, flags)" ?
> > If yes, we can't increment zonelist_update_seq.seqcount before printk_deferred_enter()...
> 
> It might happen when a printk() is called in these holes.

I believe that this hole is the only remaining problem. IMHO, the only
solution is to disable migration/preemtion in printk_deferred_enter().

It is not a big deal, really. __rt_spin_lock() would call
migrate_disable() anyway. And nested migrate_disable() is fast.
It just increments the counter when it is not 0.

I would suggest to do this change in printk_deferred_enter() first.
It will allows to call it before write_seqlock_irqsave(). And we
will not need to change the ordering back and forth.

The result would look like:

in kernel/linux/printk.h:

static inline void printk_deferred_enter(void)
{
	if (!defined(CONFIG_PREEMPT_RT))
		preempt_disable();
	else
		migrate_disable();

	__printk_safe_enter();
}

in mm/page_alloc.c

	printk_deferred_enter();
	write_seqlock_irqsafe(&zonelist_update_seq, flags);


Best Regards,
Petr

Tetsuo Handa June 22, 2023, 3:43 p.m. UTC | #24

On 2023/06/23 0:04, Petr Mladek wrote:
> AFAIK, rt_spin_lock(lock) fulfills exactly the above requirements.
> The owner could schedule. The waiter could schedule as well so that
> they could be running on the same CPU. Also the current owner gets
> higher priority when the is a waiter with the higher priority to avoid
> the priority inversion.

Excuse me, but that is about multiple threads trying to acquire the same lock, isn't it?

Our case is that one thread which makes zonelist_update_seq.seqcount odd acquires
zonelist_update_seq.lock but threads spinning at
read_seqbegin(zonelist_update_seq.seqcount) from zonelist_iter_begin() do nothing but
cpu_relax() busy loop. There is no way to teach that these threads need to give
CPU to the thread which made zonelist_update_seq.seqcount odd...

> The result would look like:
> 
> in kernel/linux/printk.h:
> 
> static inline void printk_deferred_enter(void)
> {
> 	if (!defined(CONFIG_PREEMPT_RT))
> 		preempt_disable();
> 	else
> 		migrate_disable();
> 
> 	__printk_safe_enter();
> }
> 
> in mm/page_alloc.c
> 
> 	printk_deferred_enter();
> 	write_seqlock_irqsafe(&zonelist_update_seq, flags);

OK. But for stable,

+	if (defined(CONFIG_PREEMPT_RT))
+		migrate_disable();
 	/*
 	 * Explicitly disable this CPU's interrupts before taking seqlock
 	 * to prevent any IRQ handler from calling into the page allocator
 	 * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and livelock.
 	 */
 	local_irq_save(flags);
	/*
	 * Explicitly disable this CPU's synchronous printk() before taking
	 * seqlock to prevent any printk() from trying to hold port->lock, for
	 * tty_insert_flip_string_and_push_buffer() on other CPU might be
	 * calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held.
	 */
	printk_deferred_enter();
	write_seqlock(&zonelist_update_seq);

will be easier to apply.

Sebastian Andrzej Siewior June 23, 2023, 7:27 a.m. UTC | #25

On 2023-06-21 23:50:02 [+0900], Tetsuo Handa wrote:
> On 2023/06/21 23:34, Sebastian Andrzej Siewior wrote:
> >> Also, if local_irq_save() is hidden due to RT, what guarantees that
> >>
> >>   write_seqlock_irqsave(&zonelist_update_seq, flags);
> >>   <<IRQ>>
> >>     some_timer_function() {
> >>       printk();
> >>     }
> >>   <<IRQ>>
> >>   printk_deferred_enter();
> >>
> >> does not happen because write_seqlock_irqsave() does not disable IRQ?
> > 
> > I don't see how zonelist_update_seq and printk here are connected
> > without the port lock/ or memory allocation. But there are two things
> > that are different on RT which probably answer your question:
> 
> It is explained as the first deadlock scenario in commit 1007843a9190
> ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock").
> We have to disable IRQ before making zonelist_update_seq.seqcount odd.

You did explain the relation zonelist_update_seq -> kmalloc(, GFP_ATOMIC) and
tty_insert_flip_string_and_push_buffer() -> kmalloc(, GFP_ATOMIC) with
printk() on another CPU. 
As far as PREEMPT_RT goes, it is not an issue because the port-lock is
not used or printing is deferred.

Sebastian

Sebastian Andrzej Siewior June 23, 2023, 8:12 a.m. UTC | #26

On 2023-06-21 17:38:03 [+0200], Petr Mladek wrote:
> If I get it correctly, RT solve both possible deadlocks by offloading
> the nested operation into a kthread (irq and printk threads).
> Plus, printk uses emergency_write() when the kthread is not usable.

correct.

> If this is true then printk_safe_enter() might be a nop on RT.

it is gone actually.

> All possible deadlocks are prevented either by the kthread or
> the console->emergency_write() call.

correct.

> But wait. AFAIK, the printk kthread is implemented only for the serial
> console. So, it won't be safe for other consoles, especially
> the problematic tty_insert_flip_string_and_push_buffer() call.

if the special printing is not implemented then there is only printing
via the kthread meaning no emergency printing if needed.

Once the interface is upstream, more drivers can be added including GPU
based.

> Note that adding printk thread for the graphical consoles will
> be a real challenge. We have hard times even with the basic
> UART 8250. There are still races possible in the implementation
> in the RT tree...

John has been mumbling something about those but did not send anything.

> OK, what about using migrate_disable() in printk_deferred_enter()?
> Something like:
> 
> /*
>  * The printk_deferred_enter/exit macros are available only as a hack.
>  * They define a per-CPU context where all printk console printing
>  * is deferred because it might cause a deadlock otherwise.
>  *
>  * It is highly recommended to use them only in a context with interrupts
>  * disabled. Otherwise, other unrelated printk() calls might be deferred
>  * when they interrupt/preempt the deferred code section.
>  *
>  * They should not be used for to deffer printing of many messages. It might
>  * create softlockup when they are flushed later.
>  *
>  * IMPORTANT: Any new use of these MUST be consulted with printk maintainers.
>  *    It might have unexpected side effects on the printk infrastructure.
>  */
> #ifdef CONFIG_PREEMPT_RT
> 
> #define printk_deferred_enter()			\
> 	do {					\
> 		migrate_disable();		\
> 		__printk_safe_enter();		\
> 	} while (0)
> 
> #define printk_deferred_exit()			\
> 	do {					\
> 		__printk_safe_exit();		\
> 		migrate_enable();		\
> 	} while (0)
> 
> #else  /* CONFIG_PREEMPT_RT */
> 
> #define printk_deferred_enter()			\
> 	do {					\
> 		preempt_disable();		\
> 		__printk_safe_enter();		\
> 	} while (0)
> 
> #define printk_deferred_exit()			\
> 	do {					\
> 		__printk_safe_exit();		\
> 		preempt_enable();		\
> 	} while (0)
> 
> #endif   /* CONFIG_PREEMPT_RT */
> 
> Note that I have used preempt_disable() on non-RT because it is much
> cheaper. And IRQs will be disabled anyway on non-RT system
> in this code path.

It would do _but_ is it really needed? All users but one (the one in
__build_all_zonelists()) have interrupts already disabled. This isn't a
hot path and is not used very common. Does it justify the ifdef?

An unconditional migrate_disable() would do the job if you want it to be
safe but I would rather suggest lockdep annotation instead of disabling
either preemption or migration. If I read the comment on top right, this
API isn't open for the public.
The global variable would be probably be simpler but I guess you want
to have direct printing on other CPUs.

To be honst, I would suggest to work towards always printing in a thread
with an emergency console than this deferred/ safe duckt tape. Once
tty_port::lock is acquired, every possible printk output, say a BUG,
kasan,… is a possible death trap. This limitation here wasn't obvious
and syzbot had to figure it out. There might be other side effects.

> > > Disabling IRQ before incrementing zonelist_update_seq is _required_ for both
> > > 
> > >   making printk_deferred_enter() safe
> > > 
> > > and
> > > 
> > >   making sure that printk_deferred_enter() takes effect
> > > 
> > > .
> > Did I explain why it is sufficient to do
> > 	write_seqlock_irqsave()
> > 	printk_deferred_enter()
> > 
> > assuming we have
> > 
> > | static inline void do_write_seqcount_begin_nested(seqcount_t *s, int subclass)
> > | {
> > |         seqcount_acquire(&s->dep_map, subclass, 0, _RET_IP_);
> > |         do_raw_write_seqcount_begin(s);
> > | }
> 
> Will this prevent any printk() called on the same CPU before
> printk_deferred_enter() is called?

Yes. There is the _irqsave() which disables interrupts at the very
beginning. Then we have here (seqcount_acquire()) the last possible
origin of a lockdep splat. It ends with do_raw_write_seqcount_begin()
which increments the counter (the "locking") and this is followed by
printk_deferred_enter().

> Best Regards,
> Petr

Sebastian

Michal Hocko June 23, 2023, 9:21 a.m. UTC | #27

On Fri 23-06-23 10:12:50, Sebastian Andrzej Siewior wrote:
[...]
> It would do _but_ is it really needed? All users but one (the one in
> __build_all_zonelists()) have interrupts already disabled. This isn't a
> hot path and is not used very common. Does it justify the ifdef?

This is not my call of course but let me just stress out that if all
that is just because of __build_all_zonelists then it is likely not
really worth it. We are talking about memory hoptlug here. I seriously
doubt that this is something anybody would be using with RT
expectations.

Sebastian Andrzej Siewior June 23, 2023, 9:31 a.m. UTC | #28

On 2023-06-22 22:36:27 [+0900], Tetsuo Handa wrote:
> On 2023/06/22 8:24, Tetsuo Handa wrote:
> > By the way, given
> > 
> >   write_seqlock_irqsave(&zonelist_update_seq, flags);
> >   <<IRQ>>
> >     some_timer_function() {
> >       kmalloc(GFP_ATOMIC);
> >     }
> >   <</IRQ>>
> >   printk_deferred_enter();
> > 
> > scenario in CONFIG_PREEMPT_RT=y case is handled by executing some_timer_function()
> > on a dedicated kernel thread for IRQs, what guarantees that the kernel thread for
> > IRQs gives up CPU and the user thread which called write_seqlock() gains CPU until
> > write_sequnlock() is called? How can the kernel figure out that executing the user
> > thread needs higher priority than the kernel thread?
> 
> I haven't got response on this question.

I did explain in 20230621143421.BgHjJklo@linutronix.de that the printk
implementation is a different one on PREEMPT_RT. The reader _may_ boost
the writer if needed. Also for !PREEMPT_RT your only concern was a
lockdep splat within write_seqlock_irqsave() which I wanted to take
care. Leaving that aside, I don't see any problem.

> Several years ago, I demonstrated that a SCHED_IDLE priority userspace thread holding
> oom_lock causes other concurrently allocating !SCHED_IDLE priority threads to
> misunderstand that mutex_trylock(&oom_lock) failure implies we are making forward
> progress (despite the SCHED_IDLE priority userspace thread was unable to wake up for
> minutes).

repeated trylock without explicit forward progress is a general problem
on RT. We try to remove them where we find them.

> If a SCHED_IDLE priority thread which called write_seqlock_irqsave() is preempted by
> some other !SCHED_IDLE priority threads (especially realtime priority threads), and
> such !SCHED_IDLE priority thread calls kmalloc(GFP_ATOMIC) or printk(), a similar thing
> (misunderstand that spinning on read_seqbegin() from zonelist_iter_begin() can make
> forward progress despite a thread which called write_seqlock_irqsave() cannot make
> progress due to preemption) can happen.

I can because on PREEMPT_RT the read_seqbegin() _will_ block on the
lock, that is held by write_seqlock_irqsave(). This ensures that the
writer will make progress and the reader does not loop several
iterations like on !PREEMPT_RT. This is PREEMPT_RT and happens
regardless of rhe priority of the task involved.

> Question to Sebastian:
> To make sure that such thing cannot happen, we should make sure that
> a thread which entered write_seqcount_begin(&zonelist_update_seq.seqcount) from 
> write_seqlock_irqsave(&zonelist_update_seq, flags) can continue using CPU until
> write_seqcount_end(&zonelist_update_seq.seqcount) from
> write_seqlock_irqrestore(&zonelist_update_seq, flags).
> Does adding preempt_disable() before write_seqlock(&zonelist_update_seq, flags) help?

Es explained before, this scenario does not happen and is already
accounted for by the underlying seqcount API. Adding preempt_disable()
to the mix makes things worse.

> Question to Peter:
> Even if local_irq_save(flags) disables IRQ, NMI context can enqueue message via printk().
> When does the message enqueued from NMI context gets printed? If there is a possibility
> that the message enqueued from NMI context gets printed between
> "write_seqlock_irqsave(&zonelist_update_seq, flags) and printk_deferred_enter()" or
> "printk_deferred_exit() and write_sequnlock_irqrestore(&zonelist_update_seq, flags)" ?
> If yes, we can't increment zonelist_update_seq.seqcount before printk_deferred_enter()...

There are no prints from NMI because you can't acquire a lock in NMI
context. All printk invocations are delayed.

Sebastian

Sebastian Andrzej Siewior June 23, 2023, 9:35 a.m. UTC | #29

On 2023-06-22 23:28:25 [+0900], Tetsuo Handa wrote:
> On 2023/06/22 23:11, Petr Mladek wrote:
> >> Question to Peter:
> >> Even if local_irq_save(flags) disables IRQ, NMI context can enqueue message via printk().
> >> When does the message enqueued from NMI context gets printed?
> > 
> > They are flushed to the console either by irq_work or by another
> > printk(). The irq_work could not be proceed when IRQs are disabled.
> 
> Is that rule same for both COMFIG_PREEMPT_RT=y and COMFIG_PREEMPT_RT=n ?
> 
> If yes, when Sebastian's patch is applied, IRQs will not be disabled for
> COMFIG_PREEMPT_RT=y kernels because write_seqlock_irqsave() is equivalent
> with write_seqlock() because spin_lock_irqsave(lock, flags) is equivalent
> with spin_lock(lock), and

So why do you accept that spin_lock_irqsave() does not disable
interrupts on PREEMPT_RT but not that printk() is different and has
dedicated printing threads?
Regarding printk:
- dedicated printking thread is used.
- in emergcy cases a different driver is used which does not rely on
  lock held by tty.

> > But another non-deferred printk() would try to flush them immediately.
> > 
> >> If there is a possibility
> >> that the message enqueued from NMI context gets printed between
> >> "write_seqlock_irqsave(&zonelist_update_seq, flags) and printk_deferred_enter()" or
> >> "printk_deferred_exit() and write_sequnlock_irqrestore(&zonelist_update_seq, flags)" ?
> >> If yes, we can't increment zonelist_update_seq.seqcount before printk_deferred_enter()...
> > 
> > It might happen when a printk() is called in these holes.
> 
> printk() can happen between these holes for COMFIG_PREEMPT_RT=y kernels.
> We will need to call printk_deferred_enter() before incrementing
> zonelist_update_seq.seqcount in order to close these holes.

The only hole is the possible lockdep splat which I'm going to change…

Sebastian

Sebastian Andrzej Siewior June 23, 2023, 9:45 a.m. UTC | #30

On 2023-06-23 00:43:00 [+0900], Tetsuo Handa wrote:
> On 2023/06/23 0:04, Petr Mladek wrote:
> > AFAIK, rt_spin_lock(lock) fulfills exactly the above requirements.
> > The owner could schedule. The waiter could schedule as well so that
> > they could be running on the same CPU. Also the current owner gets
> > higher priority when the is a waiter with the higher priority to avoid
> > the priority inversion.
> 
> Excuse me, but that is about multiple threads trying to acquire the same lock, isn't it?
> 
> Our case is that one thread which makes zonelist_update_seq.seqcount odd acquires
> zonelist_update_seq.lock but threads spinning at
> read_seqbegin(zonelist_update_seq.seqcount) from zonelist_iter_begin() do nothing but
> cpu_relax() busy loop. There is no way to teach that these threads need to give
> CPU to the thread which made zonelist_update_seq.seqcount odd...

For !RT there is no spinning because interrupts are disabled.
For RT there is no spinning because the reader blocks on lock owned by
writer.

> > The result would look like:
> > 
> > in kernel/linux/printk.h:
> > 
> > static inline void printk_deferred_enter(void)
> > {
> > 	if (!defined(CONFIG_PREEMPT_RT))
> > 		preempt_disable();
> > 	else
> > 		migrate_disable();
> > 
> > 	__printk_safe_enter();
> > }
> > 
> > in mm/page_alloc.c
> > 
> > 	printk_deferred_enter();
> > 	write_seqlock_irqsafe(&zonelist_update_seq, flags);
> 
> OK. But for stable,
> 
> +	if (defined(CONFIG_PREEMPT_RT))
> +		migrate_disable();
>  	/*
>  	 * Explicitly disable this CPU's interrupts before taking seqlock
>  	 * to prevent any IRQ handler from calling into the page allocator
>  	 * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and livelock.
>  	 */
>  	local_irq_save(flags);
> 	/*
> 	 * Explicitly disable this CPU's synchronous printk() before taking
> 	 * seqlock to prevent any printk() from trying to hold port->lock, for
> 	 * tty_insert_flip_string_and_push_buffer() on other CPU might be
> 	 * calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held.
> 	 */
> 	printk_deferred_enter();
> 	write_seqlock(&zonelist_update_seq);
> 
> will be easier to apply.

I would prefer not to worry about stable-RT but about upstream and then
we backport what is needed into the stable-RT trees once we settled on
something. This does not affect !RT.

Sebastian

Tetsuo Handa June 23, 2023, 9:51 a.m. UTC | #31

On 2023/06/23 18:45, Sebastian Andrzej Siewior wrote:
> On 2023-06-23 00:43:00 [+0900], Tetsuo Handa wrote:
>> On 2023/06/23 0:04, Petr Mladek wrote:
>>> AFAIK, rt_spin_lock(lock) fulfills exactly the above requirements.
>>> The owner could schedule. The waiter could schedule as well so that
>>> they could be running on the same CPU. Also the current owner gets
>>> higher priority when the is a waiter with the higher priority to avoid
>>> the priority inversion.
>>
>> Excuse me, but that is about multiple threads trying to acquire the same lock, isn't it?
>>
>> Our case is that one thread which makes zonelist_update_seq.seqcount odd acquires
>> zonelist_update_seq.lock but threads spinning at
>> read_seqbegin(zonelist_update_seq.seqcount) from zonelist_iter_begin() do nothing but
>> cpu_relax() busy loop. There is no way to teach that these threads need to give
>> CPU to the thread which made zonelist_update_seq.seqcount odd...
> 
> For !RT there is no spinning because interrupts are disabled.
> For RT there is no spinning because the reader blocks on lock owned by
> writer.

My understanding is that zonelist_iter_begin() accesses only zonelist_update_seq.seqcount .
 From where is the zonelist_update_seq.lock accessed when zonelist_iter_begin() is called?

Sebastian Andrzej Siewior June 23, 2023, 9:58 a.m. UTC | #32

On 2023-06-23 11:21:01 [+0200], Michal Hocko wrote:
> On Fri 23-06-23 10:12:50, Sebastian Andrzej Siewior wrote:
> [...]
> > It would do _but_ is it really needed? All users but one (the one in
> > __build_all_zonelists()) have interrupts already disabled. This isn't a
> > hot path and is not used very common. Does it justify the ifdef?
> 
> This is not my call of course but let me just stress out that if all
> that is just because of __build_all_zonelists then it is likely not
> really worth it. We are talking about memory hoptlug here. I seriously
> doubt that this is something anybody would be using with RT
> expectations.

My current plan to repost this with a better explanation and with printk
deferred within the locked region. But first I'm going to use
hotplug-mem to see how it works. If it can be enabled on RT then it
should work even if nobody is using it.

Is someone objecting and if so why.

Sebastian

Sebastian Andrzej Siewior June 23, 2023, 10:11 a.m. UTC | #33

On 2023-06-23 18:51:24 [+0900], Tetsuo Handa wrote:
> My understanding is that zonelist_iter_begin() accesses only zonelist_update_seq.seqcount .
>  From where is the zonelist_update_seq.lock accessed when zonelist_iter_begin() is called?

zonelist_iter_begin()
 -> read_seqbegin()
   -> read_seqcount_begin()
     -> raw_read_seqcount_begin()
       -> __read_seqcount_begin()

Now. The inner part looks like
         while ((__seq = seqprop_sequence(s)) & 1)
                 cpu_relax();

but with RT enabled (in-tree, not out-of-tree, in-tree) seqprop_sequence()
goes from:

| __seqprop_##lockname##_sequence(const seqcount_##lockname##_t *s)       \
| {                                                                       \
|         unsigned seq = READ_ONCE(s->seqcount.sequence);                 \
|                                                                         \
|         if (!IS_ENABLED(CONFIG_PREEMPT_RT))                             \
|                 return seq;                                             \
|                                                                         \
|         if (preemptible && unlikely(seq & 1)) {                         \
|                 __SEQ_LOCK(lock_acquire);                               \
|                 __SEQ_LOCK(lockbase##_unlock(s->lock));                 \
|                                                                         \
|                 /*                                                      \
|                  * Re-read the sequence counter since the (possibly     \
|                  * preempted) writer made progress.                     \
|                  */                                                     \
|                 seq = READ_ONCE(s->seqcount.sequence);                  \
|         }                                                               \
|                                                                         \
|         return seq;                                                     \

to

| unsigned __seqprop_spinlock_sequence(const seqcount_spinlock_t *s)
| { 
| 	unsigned seq = READ_ONCE(s->seqcount.sequence);
| 
| 	if (unlikely(seq & 1)) {
| 		spin_lock(s->lock); 
| 		spin_unlock(s->lock);
| 		seq = READ_ONCE(s->seqcount.sequence);
| 	}
| 	return seq;
| }

tada \o/

Sebastian

Tetsuo Handa June 23, 2023, 10:36 a.m. UTC | #34

On 2023/06/23 19:11, Sebastian Andrzej Siewior wrote:
> | unsigned __seqprop_spinlock_sequence(const seqcount_spinlock_t *s)
> | { 
> | 	unsigned seq = READ_ONCE(s->seqcount.sequence);
> | 
> | 	if (unlikely(seq & 1)) {
> | 		spin_lock(s->lock); 
> | 		spin_unlock(s->lock);
> | 		seq = READ_ONCE(s->seqcount.sequence);
> | 	}
> | 	return seq;
> | }

OK. I understood that read_seqbegin() implies spin_lock()/spin_unlock() if RT.
What a deep macro. Thank you for explanation.

Well,

/*
 * Zonelists may change due to hotplug during allocation. Detect when zonelists
 * have been rebuilt so allocation retries. Reader side does not lock and
 * retries the allocation if zonelist changes. Writer side is protected by the
 * embedded spin_lock.
 */

is not accurate. Something like below?

  If !RT, reader side does not lock and retries the allocation if zonelist changes.
  If RT, reader side grabs and releases the embedded spin_lock in order to wait
  for zonelist change operations to complete.

Hmm, I feel worried that kmalloc(GFP_ATOMIC) from hard IRQ context might sleep if RT...

Petr Mladek June 23, 2023, 10:40 a.m. UTC | #35

On Fri 2023-06-23 10:12:50, Sebastian Andrzej Siewior wrote:
> On 2023-06-21 17:38:03 [+0200], Petr Mladek wrote:
> > But wait. AFAIK, the printk kthread is implemented only for the serial
> > console. So, it won't be safe for other consoles, especially
> > the problematic tty_insert_flip_string_and_push_buffer() call.
> 
> > OK, what about using migrate_disable() in printk_deferred_enter()?
> > Something like:
> > 
> > /*
> >  * The printk_deferred_enter/exit macros are available only as a hack.
> >  * They define a per-CPU context where all printk console printing
> >  * is deferred because it might cause a deadlock otherwise.
> >  *
> >  * It is highly recommended to use them only in a context with interrupts
> >  * disabled. Otherwise, other unrelated printk() calls might be deferred
> >  * when they interrupt/preempt the deferred code section.
> >  *
> >  * They should not be used for to deffer printing of many messages. It might
> >  * create softlockup when they are flushed later.
> >  *
> >  * IMPORTANT: Any new use of these MUST be consulted with printk maintainers.
> >  *    It might have unexpected side effects on the printk infrastructure.
> >  */
> > #ifdef CONFIG_PREEMPT_RT
> > 
> > #define printk_deferred_enter()			\
> > 	do {					\
> > 		migrate_disable();		\
> > 		__printk_safe_enter();		\
> > 	} while (0)
> > 
> > #define printk_deferred_exit()			\
> > 	do {					\
> > 		__printk_safe_exit();		\
> > 		migrate_enable();		\
> > 	} while (0)
> > 
> > #else  /* CONFIG_PREEMPT_RT */
> > 
> > #define printk_deferred_enter()			\
> > 	do {					\
> > 		preempt_disable();		\
> > 		__printk_safe_enter();		\
> > 	} while (0)
> > 
> > #define printk_deferred_exit()			\
> > 	do {					\
> > 		__printk_safe_exit();		\
> > 		preempt_enable();		\
> > 	} while (0)
> > 
> > #endif   /* CONFIG_PREEMPT_RT */
> > 
> > Note that I have used preempt_disable() on non-RT because it is much
> > cheaper. And IRQs will be disabled anyway on non-RT system
> > in this code path.
> 
> It would do _but_ is it really needed? All users but one (the one in
> __build_all_zonelists()) have interrupts already disabled. This isn't a
> hot path and is not used very common. Does it justify the ifdef?
>
> An unconditional migrate_disable() would do the job if you want it to be
> safe

Makes sense. The ifdef does not look worth it.

> but I would rather suggest lockdep annotation instead of disabling
> either preemption or migration. If I read the comment on top right, this
> API isn't open for the public.

It might work when mm people agree to explicitly call
migrate_disable() in __build_all_zonelists(). I mean
to call there:

	migrate_disable();
	printk_deferred_enter();
	write_seqlock_irqsave(&zonelist_update_seq, flags);

Plus it would require some comments.

For me, it is acceptable to hide at least the migrate_disable() part
into printk_deferred_enter().

> The global variable would be probably be simpler but I guess you want
> to have direct printing on other CPUs.

Yes, the scope should be as limited as possible.

> To be honst, I would suggest to work towards always printing in a thread
> with an emergency console than this deferred/ safe duckt tape.

I have used this argument many years. But it is not much convincing
after 10 years. Note that John has sent 1st RFC 4 years ago and we
still do not have the kthreads :-/

Best Regards,
Petr

Michal Hocko June 23, 2023, 10:43 a.m. UTC | #36

On Fri 23-06-23 11:58:04, Sebastian Andrzej Siewior wrote:
> On 2023-06-23 11:21:01 [+0200], Michal Hocko wrote:
> > On Fri 23-06-23 10:12:50, Sebastian Andrzej Siewior wrote:
> > [...]
> > > It would do _but_ is it really needed? All users but one (the one in
> > > __build_all_zonelists()) have interrupts already disabled. This isn't a
> > > hot path and is not used very common. Does it justify the ifdef?
> > 
> > This is not my call of course but let me just stress out that if all
> > that is just because of __build_all_zonelists then it is likely not
> > really worth it. We are talking about memory hoptlug here. I seriously
> > doubt that this is something anybody would be using with RT
> > expectations.
> 
> My current plan to repost this with a better explanation and with printk
> deferred within the locked region. But first I'm going to use
> hotplug-mem to see how it works. If it can be enabled on RT then it
> should work even if nobody is using it.

Agreed! I merely wanted to point out that this needs a based fix much
more than trying to overengineer it.

Sebastian Andrzej Siewior June 23, 2023, 10:45 a.m. UTC | #37

On 2023-06-23 11:58:05 [+0200], To Michal Hocko wrote:
>                                    But first I'm going to use
> hotplug-mem to see how it works.

I boot qemu with "-m 4G,slots=4,maxmem=16G". 

| ~# free -h
|                total        used        free      shared  buff/cache   available
| Mem:           3,8Gi       221Mi       3,6Gi       520Ki        80Mi       3,6Gi
| Swap:             0B          0B          0B

Telling qemu to add memory:
| object_add memory-backend-ram,id=mem1,size=4G
| device_add pc-dimm,id=dimm1,memdev=mem1

I see in Linux:
| ~# free -h
|                total        used        free      shared  buff/cache   available
| Mem:           7,8Gi       345Mi       7,5Gi       516Ki        80Mi       7,5Gi
| Swap:             0B          0B          0B

memory arrived. Nothing in dmesg and my printk in __build_all_zonelists()
did not trigger either. Same after I removed memory with:
| device_del dimm1
| object_del mem1

it is gone:
| ~# free -h
|                total        used        free      shared  buff/cache   available
| Mem:           3,8Gi       232Mi       3,6Gi       516Ki        80Mi       3,6Gi
| Swap:             0B          0B          0B

nothing in dmesg. What did I do wrong not to enter
__build_all_zonelists()?

Sebastian

Sebastian Andrzej Siewior June 23, 2023, 10:50 a.m. UTC | #38

On 2023-06-23 12:45:31 [+0200], To Michal Hocko wrote:
> 
> nothing in dmesg. What did I do wrong not to enter
> __build_all_zonelists()?

However, looking at mem_hotplug_begin() it gets triggered from
acpi_hotplug_work_fn() which is completely fine. No lockdep complains.
So it looks good.

Sebastian

Petr Mladek June 23, 2023, 10:53 a.m. UTC | #39

On Fri 2023-06-23 12:11:11, Sebastian Andrzej Siewior wrote:
> On 2023-06-23 18:51:24 [+0900], Tetsuo Handa wrote:
> > My understanding is that zonelist_iter_begin() accesses only zonelist_update_seq.seqcount .
> >  From where is the zonelist_update_seq.lock accessed when zonelist_iter_begin() is called?
> 
> zonelist_iter_begin()
>  -> read_seqbegin()
>    -> read_seqcount_begin()
>      -> raw_read_seqcount_begin()
>        -> __read_seqcount_begin()
> 
> Now. The inner part looks like
>          while ((__seq = seqprop_sequence(s)) & 1)
>                  cpu_relax();
> 
> but with RT enabled (in-tree, not out-of-tree, in-tree) seqprop_sequence()
> goes from:

BTW: I see a possible race in the below code.

> | unsigned __seqprop_spinlock_sequence(const seqcount_spinlock_t *s)
> | { 
> | 	unsigned seq = READ_ONCE(s->seqcount.sequence);
> | 
> | 	if (unlikely(seq & 1)) {
> | 		spin_lock(s->lock); 
> | 		spin_unlock(s->lock);

This guarantees the reader waits for the writer. But does this
guarantee that another writer could not bump the sequence
before we read it below?

> | 		seq = READ_ONCE(s->seqcount.sequence);

IMHO, it should be:

	if (unlikely(seq & 1)) {
		spin_lock(s->lock); 
		seq = READ_ONCE(s->seqcount.sequence);
		spin_unlock(s->lock);

IMHO, only this would guarantee that returned seq couldn't be odd.

Best Regards,
Petr


> | 	}
> | 	return seq;
> | }

Tetsuo Handa June 23, 2023, 11:16 a.m. UTC | #40

On 2023/06/23 19:53, Petr Mladek wrote:
> BTW: I see a possible race in the below code.
> 
>> | unsigned __seqprop_spinlock_sequence(const seqcount_spinlock_t *s)
>> | { 
>> | 	unsigned seq = READ_ONCE(s->seqcount.sequence);
>> | 
>> | 	if (unlikely(seq & 1)) {
>> | 		spin_lock(s->lock); 
>> | 		spin_unlock(s->lock);
> 
> This guarantees the reader waits for the writer. But does this
> guarantee that another writer could not bump the sequence
> before we read it below?
> 
>> | 		seq = READ_ONCE(s->seqcount.sequence);
> 
> IMHO, it should be:
> 
> 	if (unlikely(seq & 1)) {
> 		spin_lock(s->lock); 
> 		seq = READ_ONCE(s->seqcount.sequence);
> 		spin_unlock(s->lock);
> 
> IMHO, only this would guarantee that returned seq couldn't be odd.

What is wrong with bumping the sequence again?
If the returned seq is odd, the caller just retries as needed.
For example, __read_seqcount_begin() will retry at

	while ((__seq = seqprop_sequence(s)) & 1)
		cpu_relax();

.

Lack of data_race() annotation resulting in KCSAN messages might be
an unexpected source of printk(). :-)

Michal Hocko June 23, 2023, 11:32 a.m. UTC | #41

On Fri 23-06-23 12:45:30, Sebastian Andrzej Siewior wrote:
[...]
> nothing in dmesg. What did I do wrong not to enter
> __build_all_zonelists()?

You would need to online memory into a unpopulated node/zone. E.g. by
booting NUMA system with 2 numa nodes one of them empty and onlining
into that node.

Sebastian Andrzej Siewior June 23, 2023, 12:44 p.m. UTC | #42

On 2023-06-23 19:36:55 [+0900], Tetsuo Handa wrote:
> /*
>  * Zonelists may change due to hotplug during allocation. Detect when zonelists
>  * have been rebuilt so allocation retries. Reader side does not lock and
>  * retries the allocation if zonelist changes. Writer side is protected by the
>  * embedded spin_lock.
>  */
> 
> is not accurate. Something like below?
> 
>   If !RT, reader side does not lock and retries the allocation if zonelist changes.
>   If RT, reader side grabs and releases the embedded spin_lock in order to wait
>   for zonelist change operations to complete.

I wouldn't sprinkle it around the code. It is implementation specific
for PREEMPT_RT and documentation wise it would belong to
	Documentation/locking/seqlock.rst

I don't think extra knowledge benefits the code. If this piece if
information is needed I would suggest to include it to the original
documentation for seqlock. 

> Hmm, I feel worried that kmalloc(GFP_ATOMIC) from hard IRQ context
> might sleep if RT...

Relax and don't worry. PREEMPT_RT does not allow (or has afaik) memory
allocations from hard IRQ context (or preempt-disable sections).

Sebastian

Michal Hocko June 23, 2023, 12:57 p.m. UTC | #43

On Fri 23-06-23 14:44:16, Sebastian Andrzej Siewior wrote:
[...]
> > Hmm, I feel worried that kmalloc(GFP_ATOMIC) from hard IRQ context
> > might sleep if RT...
> 
> Relax and don't worry. PREEMPT_RT does not allow (or has afaik) memory
> allocations from hard IRQ context (or preempt-disable sections).

Yeah, that would require something like GFP_LOCKLESS or change some of
the locking we have to raw spinlocks.

Sebastian Andrzej Siewior June 23, 2023, 1:24 p.m. UTC | #44

On 2023-06-23 12:40:24 [+0200], Petr Mladek wrote:
> > but I would rather suggest lockdep annotation instead of disabling
> > either preemption or migration. If I read the comment on top right, this
> > API isn't open for the public.
> 
> It might work when mm people agree to explicitly call
> migrate_disable() in __build_all_zonelists(). I mean
> to call there:
> 
> 	migrate_disable();
> 	printk_deferred_enter();
> 	write_seqlock_irqsave(&zonelist_update_seq, flags);
> 
> Plus it would require some comments.

right but I wanted to move it after write_seqlock_irqsave().

> > To be honst, I would suggest to work towards always printing in a thread
> > with an emergency console than this deferred/ safe duckt tape.
> 
> I have used this argument many years. But it is not much convincing
> after 10 years. Note that John has sent 1st RFC 4 years ago and we
> still do not have the kthreads :-/

Yes, it makes me very sad. It feels longer than that. We have in RT ever
since…

> Best Regards,
> Petr

Sebastian

Sebastian Andrzej Siewior June 23, 2023, 1:31 p.m. UTC | #45

On 2023-06-23 12:53:57 [+0200], Petr Mladek wrote:
> BTW: I see a possible race in the below code.
> 
> > | unsigned __seqprop_spinlock_sequence(const seqcount_spinlock_t *s)
> > | { 
> > | 	unsigned seq = READ_ONCE(s->seqcount.sequence);
> > | 
> > | 	if (unlikely(seq & 1)) {
> > | 		spin_lock(s->lock); 
> > | 		spin_unlock(s->lock);
> 
> This guarantees the reader waits for the writer. But does this
> guarantee that another writer could not bump the sequence
> before we read it below?

A second writer can bump the seq after the first is done, correct.

> > | 		seq = READ_ONCE(s->seqcount.sequence);
> 
> IMHO, it should be:
> 
> 	if (unlikely(seq & 1)) {
> 		spin_lock(s->lock); 
> 		seq = READ_ONCE(s->seqcount.sequence);
> 		spin_unlock(s->lock);
> 
> IMHO, only this would guarantee that returned seq couldn't be odd.

This doesn't buy us anything. If the writer increments the counter,
after the reader unlocks, then the reader observers the odd seq number
and locks immediately (within read_seqbegin()) on the lock without
spending cycles within the read section (outside of read_seqbegin()) and
learning about the seq increment in read_seqretry().

> Best Regards,
> Petr

Sebastian

Petr Mladek June 23, 2023, 3:38 p.m. UTC | #46

On Fri 2023-06-23 15:31:27, Sebastian Andrzej Siewior wrote:
> On 2023-06-23 12:53:57 [+0200], Petr Mladek wrote:
> > BTW: I see a possible race in the below code.
> > 
> > > | unsigned __seqprop_spinlock_sequence(const seqcount_spinlock_t *s)
> > > | { 
> > > | 	unsigned seq = READ_ONCE(s->seqcount.sequence);
> > > | 
> > > | 	if (unlikely(seq & 1)) {
> > > | 		spin_lock(s->lock); 
> > > | 		spin_unlock(s->lock);
> > 
> > This guarantees the reader waits for the writer. But does this
> > guarantee that another writer could not bump the sequence
> > before we read it below?
> 
> A second writer can bump the seq after the first is done, correct.
> 
> > > | 		seq = READ_ONCE(s->seqcount.sequence);
> > 
> > IMHO, it should be:
> > 
> > 	if (unlikely(seq & 1)) {
> > 		spin_lock(s->lock); 
> > 		seq = READ_ONCE(s->seqcount.sequence);
> > 		spin_unlock(s->lock);
> > 
> > IMHO, only this would guarantee that returned seq couldn't be odd.
> 
> This doesn't buy us anything.

Sure?

> If the writer increments the counter,
> after the reader unlocks, then the reader observers the odd seq number
> and locks immediately (within read_seqbegin()) on the lock without
> spending cycles within the read section (outside of read_seqbegin()) and
> learning about the seq increment in read_seqretry().

I am afraid that it will not get caugh:

static inline int do___read_seqcount_retry(const seqcount_t *s, unsigned start)
{
	kcsan_atomic_next(0);
	return unlikely(READ_ONCE(s->sequence) != start);
}

It checks if the sequence is the same. It does not check if it is odd.

But it might be odd if you read "start" outsize of the lock and do
not check if it is odd.

I do not feel to be expert but a code:

		spin_lock(s->lock);
		spin_unlock(s->lock);
		s = locked_variable;

just triggers many bells in my head. It is like from a book
"Classic locking mistakes".

With the code:

		spin_lock(s->lock);
		seq s->seqcount.sequence;
		spin_unlock(s->lock);

you will always get the right value without any extra cost. The code
took the lock anyway. And the single assignment is negligible.
And RT could schedule inside the spin_lock anyway.

Or do I miss anything?

Best Regards,
Petr

Sebastian Andrzej Siewior June 23, 2023, 4:04 p.m. UTC | #47

On 2023-06-23 17:38:45 [+0200], Petr Mladek wrote:
> > If the writer increments the counter,
> > after the reader unlocks, then the reader observers the odd seq number
> > and locks immediately (within read_seqbegin()) on the lock without
> > spending cycles within the read section (outside of read_seqbegin()) and
> > learning about the seq increment in read_seqretry().
> 
> I am afraid that it will not get caugh:
> 
> static inline int do___read_seqcount_retry(const seqcount_t *s, unsigned start)
> {
> 	kcsan_atomic_next(0);
> 	return unlikely(READ_ONCE(s->sequence) != start);
> }
> 
> It checks if the sequence is the same. It does not check if it is odd.

You do realise that __read_seqcount_begin() is

|         while ((__seq = seqprop_sequence(s)) & 1)                       \
|                 cpu_relax();                                            \

so the code you try to modify is within seqprop_sequence() so you _do_
loop until the resulting __seq is even. And only then you continue
towards do___read_seqcount_retry().

> But it might be odd if you read "start" outsize of the lock and do
> not check if it is odd.

no, see above.

> I do not feel to be expert but a code:
> 
> 		spin_lock(s->lock);
> 		spin_unlock(s->lock);
> 		s = locked_variable;
> 
> just triggers many bells in my head. It is like from a book
> "Classic locking mistakes".

It is a counter and not locked variable and it is designed to be read
outside of the lock. The usage of the lock is not for "protecting" as
locking reasons but for making progress.

> With the code:
> 
> 		spin_lock(s->lock);
> 		seq s->seqcount.sequence;
> 		spin_unlock(s->lock);
> 
> you will always get the right value without any extra cost. The code
> took the lock anyway. And the single assignment is negligible.
> And RT could schedule inside the spin_lock anyway.
> 
> Or do I miss anything?

Yes. See above. If after the unlock the read sequence is odd again we do
notice it. Doing it from within the locked section, we don't.

> Best Regards,
> Petr

Sebastian

mm/page_alloc: Use write_seqlock_irqsave() instead write_seqlock() + local_irq_save().

Commit Message

Comments

Patch