[RT,v2,0/3] riscv: add PREEMPT_RT support

Message ID	20231031143521.441-1-jszhang@kernel.org (mailing list archive)
Headers	show Return-Path: <linux-riscv-bounces+linux-riscv=archiver.kernel.org@lists.infradead.org> From: Jisheng Zhang <jszhang@kernel.org> To: Sebastian Andrzej Siewior <bigeasy@linutronix.de>, Thomas Gleixner <tglx@linutronix.de>, Paul Walmsley <paul.walmsley@sifive.com>, Palmer Dabbelt <palmer@dabbelt.com>, Albert Ou <aou@eecs.berkeley.edu> Cc: linux-riscv@lists.infradead.org, tobias.schaffner@siemens.com, Arnd Bergmann <arnd@arndb.de>, Conor Dooley <conor.dooley@microchip.com> Subject: [PATCH RT v2 0/3] riscv: add PREEMPT_RT support Date: Tue, 31 Oct 2023 22:35:19 +0800 Message-Id: <20231031143521.441-1-jszhang@kernel.org> MIME-Version: 1.0 Precedence: list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-riscv" <linux-riscv-bounces@lists.infradead.org> Errors-To: linux-riscv-bounces+linux-riscv=archiver.kernel.org@lists.infradead.org
Series	riscv: add PREEMPT_RT support \| expand [RT,v2,0/3] riscv: add PREEMPT_RT support [RT,v2,1/2] riscv: add PREEMPT_AUTO support [RT,v2,2/2] riscv: allow to enable RT

Jisheng Zhang Oct. 31, 2023, 2:35 p.m. UTC

This series is to add PREEMPT_RT support to riscv. Compared with last
year's try[1], there are two major changes:

1. riscv has been converted to Generic Entry. And RT maintainers has
introduced PREEMPT_AUTO, so its support for riscv as well.

2. three preparation patches(patch1/2/3 in [1]) has been merged in
mainline.

Link: https://lore.kernel.org/linux-riscv/20220831175920.2806-1-jszhang@kernel.org/ [1]

Hi @Conor,

IIRC, you reported warnings with LOCKDEP(and DEBUG_ATOMIC_SLEEP?) with
previous version RT patch, but I tried on c906, c910 platforms and
can't reproduce the warnings. And Schaffner also tried them and
didn't reproduce warnings either. So could you please help try
v6.6-RT + this series again?

Thanks

Since v1:
 - rebase on v6.6-rt
 - support the new PREEMPT_AUTO instead of Lazy preempt

Jisheng Zhang (2):
  riscv: add PREEMPT_AUTO support
  riscv: allow to enable RT

 arch/riscv/Kconfig                   | 2 ++
 arch/riscv/include/asm/thread_info.h | 2 ++
 2 files changed, 4 insertions(+)

Sebastian Andrzej Siewior Oct. 31, 2023, 3:32 p.m. UTC | #1

On 2023-10-31 22:35:19 [+0800], Jisheng Zhang wrote:
> This series is to add PREEMPT_RT support to riscv. Compared with last
> year's try[1], there are two major changes:
> 
> 1. riscv has been converted to Generic Entry. And RT maintainers has
> introduced PREEMPT_AUTO, so its support for riscv as well.
> 
> 2. three preparation patches(patch1/2/3 in [1]) has been merged in
> mainline.

There is no third patch in the series, right? If so then the amount in
shortlog is consistent with I received.

> Jisheng Zhang (2):
>   riscv: add PREEMPT_AUTO support
>   riscv: allow to enable RT
> 
>  arch/riscv/Kconfig                   | 2 ++
>  arch/riscv/include/asm/thread_info.h | 2 ++
>  2 files changed, 4 insertions(+)

Sebastian

Jisheng Zhang Oct. 31, 2023, 3:49 p.m. UTC | #2

On Tue, Oct 31, 2023 at 04:32:42PM +0100, Sebastian Andrzej Siewior wrote:
> On 2023-10-31 22:35:19 [+0800], Jisheng Zhang wrote:
> > This series is to add PREEMPT_RT support to riscv. Compared with last
> > year's try[1], there are two major changes:
> > 
> > 1. riscv has been converted to Generic Entry. And RT maintainers has
> > introduced PREEMPT_AUTO, so its support for riscv as well.
> > 
> > 2. three preparation patches(patch1/2/3 in [1]) has been merged in
> > mainline.
> 
> There is no third patch in the series, right? If so then the amount in
> shortlog is consistent with I received.

Yes there's no third patch. I didn't use the correct number in patch0's
subject.

> 
> > Jisheng Zhang (2):
> >   riscv: add PREEMPT_AUTO support
> >   riscv: allow to enable RT
> > 
> >  arch/riscv/Kconfig                   | 2 ++
> >  arch/riscv/include/asm/thread_info.h | 2 ++
> >  2 files changed, 4 insertions(+)
> 
> Sebastian

Sebastian Andrzej Siewior Oct. 31, 2023, 4:44 p.m. UTC | #3

On 2023-10-31 23:49:29 [+0800], Jisheng Zhang wrote:
> Yes there's no third patch. I didn't use the correct number in patch0's
> subject.

So it looks fine. The warning was CPU-hotplug related
	https://lore.kernel.org/all/0abd0acf-70a1-d546-a517-19efe60042d1@microchip.com/

and it looks to be gone as of commit
	5944ce092b97c ("arch_topology: Build cacheinfo from primary CPU")

so that good. Any double checking is welcome of course ;)
JUMP_LABELs don't use stop_cpu. Check.
The timer is PERCPU. Check.
Can't find perf events. But the commit for threaded interrupts claims to
have them per-CPU. 
Has HAVE_POSIX_CPU_TIMERS_TASK_WORK with generic kvm. Check.

die() and die_lock. It looks like die_lock is acquired when the system
is done and requires medical assistance. This would qualify it for a
raw_spinlock_t. Also, should any of the bad things happen in a section
with disabled preemption or interrupts then a spinlock_t can not be
acquired. Unless die() is always invoked in a preemptible context…

The other things are covered by the generic code. I think I didn't miss
anything…
I going to have new release by the end of the week at the latest with
this bits. Please look after the die_lock.

Sebastian

Jisheng Zhang Nov. 1, 2023, 11:41 a.m. UTC | #4

On Tue, Oct 31, 2023 at 05:44:11PM +0100, Sebastian Andrzej Siewior wrote:
> On 2023-10-31 23:49:29 [+0800], Jisheng Zhang wrote:
> > Yes there's no third patch. I didn't use the correct number in patch0's
> > subject.
> 
> So it looks fine. The warning was CPU-hotplug related
> 	https://lore.kernel.org/all/0abd0acf-70a1-d546-a517-19efe60042d1@microchip.com/
> 
> and it looks to be gone as of commit
> 	5944ce092b97c ("arch_topology: Build cacheinfo from primary CPU")
> 
> so that good. Any double checking is welcome of course ;)
> JUMP_LABELs don't use stop_cpu. Check.
> The timer is PERCPU. Check.
> Can't find perf events. But the commit for threaded interrupts claims to
> have them per-CPU. 
> Has HAVE_POSIX_CPU_TIMERS_TASK_WORK with generic kvm. Check.
> 
> die() and die_lock. It looks like die_lock is acquired when the system
> is done and requires medical assistance. This would qualify it for a
> raw_spinlock_t. Also, should any of the bad things happen in a section
> with disabled preemption or interrupts then a spinlock_t can not be
> acquired. Unless die() is always invoked in a preemptible context…
> 
> The other things are covered by the generic code. I think I didn't miss
> anything…
> I going to have new release by the end of the week at the latest with
> this bits. Please look after the die_lock.

Thank you so much, I will check.

Hi @Conor,

If you help to try this series, can you please apply Evan' misaligned
access probe probe patch? Refer to [1] for details. NOTE: this is not
related with RT patches because I can reproduce the bug with v6.6.

Link: https://lore.kernel.org/linux-riscv/ZUI3JKff9SgsA3Z%2F@xhacker/ [1]
> 
> Sebastian

Conor Dooley Nov. 2, 2023, 12:31 p.m. UTC | #5

On Wed, Nov 01, 2023 at 07:41:59PM +0800, Jisheng Zhang wrote:
> On Tue, Oct 31, 2023 at 05:44:11PM +0100, Sebastian Andrzej Siewior wrote:
> > On 2023-10-31 23:49:29 [+0800], Jisheng Zhang wrote:
> > > Yes there's no third patch. I didn't use the correct number in patch0's
> > > subject.
> > 
> > So it looks fine. The warning was CPU-hotplug related
> > 	https://lore.kernel.org/all/0abd0acf-70a1-d546-a517-19efe60042d1@microchip.com/
> > 
> > and it looks to be gone as of commit
> > 	5944ce092b97c ("arch_topology: Build cacheinfo from primary CPU")
> > 
> > so that good. Any double checking is welcome of course ;)
> > JUMP_LABELs don't use stop_cpu. Check.
> > The timer is PERCPU. Check.
> > Can't find perf events. But the commit for threaded interrupts claims to
> > have them per-CPU. 
> > Has HAVE_POSIX_CPU_TIMERS_TASK_WORK with generic kvm. Check.
> > 
> > die() and die_lock. It looks like die_lock is acquired when the system
> > is done and requires medical assistance. This would qualify it for a
> > raw_spinlock_t. Also, should any of the bad things happen in a section
> > with disabled preemption or interrupts then a spinlock_t can not be
> > acquired. Unless die() is always invoked in a preemptible context…
> > 
> > The other things are covered by the generic code. I think I didn't miss
> > anything…
> > I going to have new release by the end of the week at the latest with
> > this bits. Please look after the die_lock.
> 
> Thank you so much, I will check.
> 
> Hi @Conor,
> 
> If you help to try this series, can you please apply Evan' misaligned
> access probe probe patch? Refer to [1] for details. NOTE: this is not
> related with RT patches because I can reproduce the bug with v6.6.

Hmm, so I gave it a go with and without Evan's patch, but I am seeing
various issues with locking. For example, here's one with Evan's patch:

Starting kernel ...

[    0.000000] Linux version 6.6.0-rc6-rt10-00003-gd649dd498753 (conor@wendy) (ClangBuiltLinux clang version 16.0.2 (/
home/conor/stuff/dev/llvm/clang 18ddebe1a1a9bde349441631365f0472e9693520), ClangBuiltLinux LLD 16.0.2) #1 SMP PREEMPT_
RT @666
[    0.000000] Machine model: Microchip PolarFire-SoC Icicle Kit
[    0.000000] SBI specification v1.0 detected
[    0.000000] SBI implementation ID=0x8 Version=0x10002
[    0.000000] SBI TIME extension detected
[    0.000000] SBI IPI extension detected
[    0.000000] SBI RFENCE extension detected
[    0.000000] SBI SRST extension detected
[    0.000000] earlycon: ns16550a0 at MMIO32 0x0000000020100000 (options '115200n8')
[    0.000000] printk: legacy bootconsole [ns16550a0] enabled
[    0.000000] printk: debug: skip boot console de-registration.
[    0.000000] efi: UEFI not found.
[    0.000000] OF: reserved mem: 0x00000000bfc00000..0x00000000bfffffff (4096 KiB) nomap non-reusable region@BFC00000
[    0.000000] Zone ranges:
[    0.000000]   DMA32    [mem 0x0000000080000000-0x00000000ffffffff]
[    0.000000]   Normal   [mem 0x0000000100000000-0x000000107fffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000080000000-0x00000000bfbfffff]
[    0.000000]   node   0: [mem 0x00000000bfc00000-0x00000000bfffffff]
[    0.000000]   node   0: [mem 0x0000001040000000-0x000000107fffffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000080000000-0x000000107fffffff]
[    0.000000] SBI HSM extension detected
[    0.000000] CPU with hartid=0 is not available
[    0.000000] Falling back to deprecated "riscv,isa"
[    0.000000] riscv: base ISA extensions acdfim
[    0.000000] riscv: ELF capabilities acdfim
[    0.000000] percpu: Embedded 31 pages/cpu s88096 r8192 d30688 u126976
[    0.000000] Kernel command line: earlycon keep_bootcon riscv_isa_fallback
[    0.000000] Unknown kernel command line parameters "riscv_isa_fallback", will be passed to user space.
[    0.000000] Dentry cache hash table entries: 262144 (order: 9, 2097152 bytes, linear)
[    0.000000] Inode-cache hash table entries: 131072 (order: 8, 1048576 bytes, linear)
[    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 517120
[    0.000000] mem auto-init: stack:all(zero), heap alloc:off, heap free:off
[    0.000000] software IO TLB: area num 4.
[    0.000000] software IO TLB: mapped [mem 0x00000000bbc00000-0x00000000bfc00000] (64MB)
[    0.000000] Virtual kernel memory layout:
[    0.000000]       fixmap : 0xffffffc6fea00000 - 0xffffffc6ff000000   (6144 kB)
[    0.000000]       pci io : 0xffffffc6ff000000 - 0xffffffc700000000   (  16 MB)
[    0.000000]      vmemmap : 0xffffffc700000000 - 0xffffffc800000000   (4096 MB)
[    0.000000]      vmalloc : 0xffffffc800000000 - 0xffffffd800000000   (  64 GB)
[    0.000000]      modules : 0xffffffff02a26000 - 0xffffffff80000000   (2005 MB)
[    0.000000]       lowmem : 0xffffffd800000000 - 0xffffffe800000000   (  64 GB)
[    0.000000]       kernel : 0xffffffff80000000 - 0xffffffffffffffff   (2047 MB)
[    0.000000] Memory: 1914732K/2097152K available (12175K kernel code, 7264K rwdata, 6144K rodata, 6211K init, 11249K bss, 182420K reserved, 0K cma-reserved)
[    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
[    0.000000] trace event string verifier disabled
[    0.000000] Running RCU self tests
[    0.000000] Running RCU synchronous self tests
[    0.000000] rcu: Preemptible hierarchical RCU implementation.
[    0.000000] rcu:     RCU lockdep checking is enabled.
[    0.000000] rcu:     RCU restricting CPUs from NR_CPUS=64 to nr_cpu_ids=4.
[    0.000000] rcu:     RCU priority boosting: priority 1 delay 500 ms.
[    0.000000] rcu:     RCU_SOFTIRQ processing moved to rcuc kthreads.
[    0.000000] rcu:     RCU debug extended QS entry/exit.
[    0.000000]  No expedited grace period (rcu_normal_after_boot).
[    0.000000]  Trampoline variant of Tasks RCU enabled.
[    0.000000]  Tracing variant of Tasks RCU enabled.
[    0.000000] rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
[    0.000000] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=4
[    0.000000] Running RCU synchronous self tests
[    0.000000] NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0
[    0.000000] riscv-intc: unable to find hart id for /cpus/cpu@0/interrupt-controller
[    0.000000] riscv-intc: 64 local interrupts mapped
[    0.000000] plic: interrupt-controller@c000000: mapped 186 interrupts with 4 handlers for 9 contexts.
[    0.000000] riscv: providing IPIs using SBI IPI extension
[    0.000000] rcu: srcu_init: Setting srcu_struct sizes based on contention.
[    0.000000] clocksource: riscv_clocksource: mask: 0xffffffffffffffff max_cycles: 0x1d854df40, max_idle_ns: 3526361616960 ns
[    0.000003] sched_clock: 64 bits at 1000kHz, resolution 1000ns, wraps every 2199023255500ns
[    0.003776] Console: colour dummy device 80x25
[    0.003879] printk: legacy console [tty0] enabled
[    0.004090] Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar
[    0.004107] ... MAX_LOCKDEP_SUBCLASSES:  8
[    0.004121] ... MAX_LOCK_DEPTH:          48
[    0.004135] ... MAX_LOCKDEP_KEYS:        8192
[    0.004148] ... CLASSHASH_SIZE:          4096
[    0.004161] ... MAX_LOCKDEP_ENTRIES:     32768
[    0.004175] ... MAX_LOCKDEP_CHAINS:      65536
[    0.004188] ... CHAINHASH_SIZE:          32768
[    0.004200]  memory used by lock dependency info: 6493 kB
[    0.004215]  memory used for stack traces: 4224 kB
[    0.004228]  per task-struct memory footprint: 1920 bytes
[    0.005187] Calibrating delay loop (skipped), value calculated using timer frequency.. 2.00 BogoMIPS (lpj=4000)
[    0.005236] pid_max: default: 32768 minimum: 301
[    0.009319] Mount-cache hash table entries: 4096 (order: 3, 32768 bytes, linear)
[    0.009520] Mountpoint-cache hash table entries: 4096 (order: 3, 32768 bytes, linear)
[    0.030982] Running RCU synchronous self tests
[    0.031032] Running RCU synchronous self tests
[    0.036925] CPU node for /cpus/cpu@0 exist but the possible cpu range is :0-3
[    0.055142] RCU Tasks: Setting shift to 2 and lim to 1 rcu_task_cb_adjust=1.
[    0.056482] RCU Tasks Trace: Setting shift to 2 and lim to 1 rcu_task_cb_adjust=1.
[    0.058104] Running RCU-tasks wait API self tests
[    0.064370] riscv: ELF compat mode unsupported
[    0.064417] ASID allocator disabled (0 bits)
[    0.071600] Callback from call_rcu_tasks_trace() invoked.
[    0.072487] rcu: Hierarchical SRCU implementation.
[    0.072504] rcu:     Max phase no-delay instances is 1000.
[    0.753395] EFI services will not be available.
[    0.765374] smp: Bringing up secondary CPUs ...
[    0.826425] smp: Brought up 1 node, 4 CPUs
[    0.859509] devtmpfs: initialized
[    1.033554] Running RCU synchronous self tests
[    1.033791] Running RCU synchronous self tests
[    1.044502] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns
[    1.044965] futex hash table entries: 1024 (order: 5, 196608 bytes, linear)
[    1.049859] pinctrl core: initialized pinctrl subsystem
[    1.076426] NET: Registered PF_NETLINK/PF_ROUTE protocol family
[    1.093390] Callback from call_rcu_tasks() invoked.
[    1.099141] DMA: preallocated 256 KiB GFP_KERNEL pool for atomic allocations
[    1.099644] DMA: preallocated 256 KiB GFP_KERNEL|GFP_DMA32 pool for atomic allocations
[    1.123261] cpuidle: using governor menu
[    1.130384] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
[    1.130418]
[    1.130414] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/1
[    1.130431] ================================
[    1.130440] WARNING: inconsistent lock state
[    1.130439] preempt_count: 10001, expected: 0
[    1.130454] RCU nest depth: 1, expected: 1
[    1.130452] 6.6.0-rc6-rt10-00003-gd649dd498753 #1 Not tainted
[    1.130468] INFO: lockdep is turned off.
[    1.130470] --------------------------------
[    1.130477] irq event stamp: 1026
[    1.130479] inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
[    1.130496] swapper/3/0 [HC1[1]:SC0[0]:HE0:SE1] takes:
[    1.130487] hardirqs last  enabled at (1025): [<ffffffff80bd5308>] do_irq+0x7e/0xa6
[    1.130525] ffffffff81c42d38 (
[    1.130533] hardirqs last disabled at (1026): [<ffffffff80bd52a0>] do_irq+0x16/0xa6
[    1.130550] &zone->lock){?.+.}-{2:2}
[    1.130562] softirqs last  enabled at (0): [<ffffffff8001543c>] copy_process+0x548/0xdec
[    1.130577] , at: __rmqueue_pcplist+0x12a/0xbc0
[    1.130611] softirqs last disabled at (0): [<0000000000000000>] 0x0
[    1.130626] {HARDIRQ-ON-W} state was registered at:
[    1.130638] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.6.0-rc6-rt10-00003-gd649dd498753 #1
[    1.130639]   __lock_acquire+0x80a/0xd52
[    1.130668] Hardware name: Microchip PolarFire-SoC Icicle Kit (DT)
[    1.130684] Call Trace:
[    1.130681]   lock_acquire+0x116/0x2b0
[    1.130697] [<ffffffff80006fca>] show_stack+0x2c/0x3c
[    1.130709]   rt_spin_lock+0x26/0xdc
[    1.130747]   __rmqueue_pcplist+0x12a/0xbc0
[    1.130740] [<ffffffff80bd4c38>] dump_stack_lvl+0x5e/0x84
[    1.130777]   get_page_from_freelist+0x37a/0x143e
[    1.130787] [<ffffffff80bd4c72>] dump_stack+0x14/0x1c
[    1.130806]   __alloc_pages+0xac/0x1be
[    1.130822] [<ffffffff8005ba84>] __might_resched+0x1b4/0x1c2
[    1.130835]   alloc_slab_page+0x1c/0xc0
[    1.130863]   new_slab+0x94/0x336
[    1.130861] [<ffffffff80bdfdd2>] rt_spin_lock+0x42/0xdc
[    1.130886]   ___slab_alloc+0x830/0xcda
[    1.130896] [<ffffffff80267236>] __rmqueue_pcplist+0x12a/0xbc0
[    1.130909]   __kmem_cache_alloc_node+0xc8/0x1c4
[    1.130934] [<ffffffff80268204>] get_page_from_freelist+0x37a/0x143e
[    1.130934]   kmalloc_trace+0x22/0x48
[    1.130971] [<ffffffff80267d78>] __alloc_pages+0xac/0x1be
[    1.130978]   alloc_workqueue+0x96/0x6de
[    1.131005] [<ffffffff8000412a>] check_unaligned_access+0x34/0x336
[    1.131010]   kmem_cache_init_late+0x1c/0x36
[    1.131038] [<ffffffff80004660>] check_unaligned_access_nonboot_cpu+0x12/0x1a
[    1.131046]   start_kernel+0x204/0x7e6
[    1.131082] irq event stamp: 822
[    1.131071] [<ffffffff800fcba0>] __flush_smp_call_function_queue+0x1de/0x790
[    1.131093] hardirqs last  enabled at (821): [<ffffffff80bd75b0>] default_idle_call+0xfa/0x152
[    1.131111] [<ffffffff800fd3a0>] generic_smp_call_function_single_interrupt+0xe/0x1a
[    1.131124] hardirqs last disabled at (822): [<ffffffff80bd52a0>] do_irq+0x16/0xa6
[    1.131146] [<ffffffff80009edc>] handle_IPI+0x54/0x13c
[    1.131153] softirqs last  enabled at (0): [<ffffffff8001543c>] copy_process+0x548/0xdec
[    1.131190] softirqs last disabled at (0): [<0000000000000000>] 0x0
[    1.131184] [<ffffffff800a65e6>] handle_percpu_devid_irq+0xc8/0x1c0
[    1.131211]
[    1.131211] other info that might help us debug this:
[    1.131220]  Possible unsafe locking scenario:
[    1.131220]
[    1.131228]        CPU0
[    1.131235]        ----
[    1.131229] [<ffffffff8009fd96>] generic_handle_domain_irq+0x2c/0x40
[    1.131241]   lock(&zone->lock);
[    1.131266]   <Interrupt>
[    1.131272]     lock(
[    1.131261] [<ffffffff800ae39c>] ipi_mux_process+0xb0/0x102
[    1.131279] &zone->lock);
[    1.131294]
[    1.131294]  *** DEADLOCK ***
[    1.131294]
[    1.131302] 2 locks held by swapper/3/0:
[    1.131298] [<ffffffff8000c0ec>] sbi_ipi_handle+0x50/0x7c
[    1.131320]  #0: ffffffe7fefcf958
[    1.131328] [<ffffffff8009fd96>] generic_handle_domain_irq+0x2c/0x40
[    1.131341]  (&pcp->lock){+.+.}-{2:2}
[    1.131358] [<ffffffff805ce5dc>] riscv_intc_irq+0x28/0x44
[    1.131370] , at: get_page_from_freelist+0x338/0x143e
[    1.131402]  #1:
[    1.131394] [<ffffffff80bd5370>] handle_riscv_irq+0x40/0x64
[    1.131411] ffffffff81aa6960 (rcu_read_lock
[    1.131423] [<ffffffff80bd52ea>] do_irq+0x60/0xa6
[    1.131437] ){....}-{1:2}, at: rcu_lock_acquire+0x0/0x32
[    1.131478]
[    1.131478] stack backtrace:
[    1.131488] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G        W          6.6.0-rc6-rt10-00003-gd649dd498753 #1
[    1.131518] Hardware name: Microchip PolarFire-SoC Icicle Kit (DT)
[    1.131530] Call Trace:
[    1.131542] [<ffffffff80006fca>] show_stack+0x2c/0x3c
[    1.131576] [<ffffffff80bd4c38>] dump_stack_lvl+0x5e/0x84
[    1.131614] [<ffffffff80bd4c72>] dump_stack+0x14/0x1c
[    1.131648] [<ffffffff8008c66a>] print_usage_bug+0x430/0x4d2
[    1.131686] [<ffffffff8008bf6e>] mark_lock_irq+0x504/0x61c
[    1.131720] [<ffffffff8008b64e>] mark_lock+0x12e/0x194
[    1.131753] [<ffffffff80088d3a>] __lock_acquire+0x5d2/0xd52
[    1.131786] [<ffffffff800884fa>] lock_acquire+0x116/0x2b0
[    1.131819] [<ffffffff80bdfdb6>] rt_spin_lock+0x26/0xdc
[    1.131852] [<ffffffff80267236>] __rmqueue_pcplist+0x12a/0xbc0
[    1.131887] [<ffffffff80268204>] get_page_from_freelist+0x37a/0x143e
[    1.131922] [<ffffffff80267d78>] __alloc_pages+0xac/0x1be
[    1.131955] [<ffffffff8000412a>] check_unaligned_access+0x34/0x336
[    1.131986] [<ffffffff80004660>] check_unaligned_access_nonboot_cpu+0x12/0x1a
[    1.132018] [<ffffffff800fcba0>] __flush_smp_call_function_queue+0x1de/0x790
[    1.132052] [<ffffffff800fd3a0>] generic_smp_call_function_single_interrupt+0xe/0x1a
[    1.132085] [<ffffffff80009edc>] handle_IPI+0x54/0x13c
[    1.132119] [<ffffffff800a65e6>] handle_percpu_devid_irq+0xc8/0x1c0
[    1.132154] [<ffffffff8009fd96>] generic_handle_domain_irq+0x2c/0x40
[    1.132184] [<ffffffff800ae39c>] ipi_mux_process+0xb0/0x102
[    1.132216] [<ffffffff8000c0ec>] sbi_ipi_handle+0x50/0x7c
[    1.132244] [<ffffffff8009fd96>] generic_handle_domain_irq+0x2c/0x40
[    1.132273] [<ffffffff805ce5dc>] riscv_intc_irq+0x28/0x44
[    1.132303] [<ffffffff80bd5370>] handle_riscv_irq+0x40/0x64
[    1.132331] [<ffffffff80bd52ea>] do_irq+0x60/0xa6
[    1.156457] cpu1: Ratio of byte access time to unaligned word access is 0.01, unaligned accesses are slow
[    1.156464] cpu3: Ratio of byte access time to unaligned word access is 0.01, unaligned accesses are slow
[    1.156465] cpu2: Ratio of byte access time to unaligned word access is 0.01, unaligned accesses are slow
[    1.184777] cpu0: Ratio of byte access time to unaligned word access is 0.01, unaligned accesses are slow


The exact tree/.config are here:
https://git.kernel.org/pub/scm/linux/kernel/git/conor/linux.git/log/?h=linux-6.6.y-rt

Cheers,
Conor.

Sebastian Andrzej Siewior Nov. 2, 2023, 3:54 p.m. UTC | #6

On 2023-11-02 12:31:15 [+0000], Conor Dooley wrote:
> [    1.130384] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
> [    1.130418]
> [    1.130414] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/1
> [    1.130971] [<ffffffff80267d78>] __alloc_pages+0xac/0x1be
> [    1.130978]   alloc_workqueue+0x96/0x6de
> [    1.131005] [<ffffffff8000412a>] check_unaligned_access+0x34/0x336
> [    1.131010]   kmem_cache_init_late+0x1c/0x36
> [    1.131038] [<ffffffff80004660>] check_unaligned_access_nonboot_cpu+0x12/0x1a
> [    1.131046]   start_kernel+0x204/0x7e6
> [    1.131082] irq event stamp: 822
> [    1.131071] [<ffffffff800fcba0>] __flush_smp_call_function_queue+0x1de/0x790
> [    1.131093] hardirqs last  enabled at (821): [<ffffffff80bd75b0>] default_idle_call+0xfa/0x152
> [    1.131111] [<ffffffff800fd3a0>] generic_smp_call_function_single_interrupt+0xe/0x1a
> [    1.131124] hardirqs last disabled at (822): [<ffffffff80bd52a0>] do_irq+0x16/0xa6

Without the patch check_unaligned_access_boot_cpu() ->
check_unaligned_access() is okay. The SMP/IPI variant (the patch imho)
is bad. There must be no memory allocation from IRQ-off region - the
smp-function call in this case.

Would it be okay if I delay it until this solved? If you have something,
then I could apply the two patches plus the fixup for this.

Sebastian

Jisheng Zhang Nov. 2, 2023, 5:03 p.m. UTC | #7

On Thu, Nov 02, 2023 at 04:54:23PM +0100, Sebastian Andrzej Siewior wrote:
> On 2023-11-02 12:31:15 [+0000], Conor Dooley wrote:
> > [    1.130384] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
> > [    1.130418]
> > [    1.130414] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/1
> > [    1.130971] [<ffffffff80267d78>] __alloc_pages+0xac/0x1be
> > [    1.130978]   alloc_workqueue+0x96/0x6de
> > [    1.131005] [<ffffffff8000412a>] check_unaligned_access+0x34/0x336
> > [    1.131010]   kmem_cache_init_late+0x1c/0x36
> > [    1.131038] [<ffffffff80004660>] check_unaligned_access_nonboot_cpu+0x12/0x1a
> > [    1.131046]   start_kernel+0x204/0x7e6
> > [    1.131082] irq event stamp: 822
> > [    1.131071] [<ffffffff800fcba0>] __flush_smp_call_function_queue+0x1de/0x790
> > [    1.131093] hardirqs last  enabled at (821): [<ffffffff80bd75b0>] default_idle_call+0xfa/0x152
> > [    1.131111] [<ffffffff800fd3a0>] generic_smp_call_function_single_interrupt+0xe/0x1a
> > [    1.131124] hardirqs last disabled at (822): [<ffffffff80bd52a0>] do_irq+0x16/0xa6
> 
> Without the patch check_unaligned_access_boot_cpu() ->
> check_unaligned_access() is okay. The SMP/IPI variant (the patch imho)
> is bad. There must be no memory allocation from IRQ-off region - the
> smp-function call in this case.
> 
> Would it be okay if I delay it until this solved? If you have something,
> then I could apply the two patches plus the fixup for this.
> 

Hi Sebastian,

you can delay merge the series. I will check it

Thanks

Palmer Dabbelt Nov. 2, 2023, 9:37 p.m. UTC | #8

On Thu, 02 Nov 2023 08:54:23 PDT (-0700), bigeasy@linutronix.de wrote:
> On 2023-11-02 12:31:15 [+0000], Conor Dooley wrote:
>> [    1.130384] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
>> [    1.130418]
>> [    1.130414] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/1
>> [    1.130971] [<ffffffff80267d78>] __alloc_pages+0xac/0x1be
>> [    1.130978]   alloc_workqueue+0x96/0x6de
>> [    1.131005] [<ffffffff8000412a>] check_unaligned_access+0x34/0x336
>> [    1.131010]   kmem_cache_init_late+0x1c/0x36
>> [    1.131038] [<ffffffff80004660>] check_unaligned_access_nonboot_cpu+0x12/0x1a
>> [    1.131046]   start_kernel+0x204/0x7e6
>> [    1.131082] irq event stamp: 822
>> [    1.131071] [<ffffffff800fcba0>] __flush_smp_call_function_queue+0x1de/0x790
>> [    1.131093] hardirqs last  enabled at (821): [<ffffffff80bd75b0>] default_idle_call+0xfa/0x152
>> [    1.131111] [<ffffffff800fd3a0>] generic_smp_call_function_single_interrupt+0xe/0x1a
>> [    1.131124] hardirqs last disabled at (822): [<ffffffff80bd52a0>] do_irq+0x16/0xa6
>
> Without the patch check_unaligned_access_boot_cpu() ->
> check_unaligned_access() is okay. The SMP/IPI variant (the patch imho)
> is bad. There must be no memory allocation from IRQ-off region - the

Ya, sorry I didn't catch it.  I'm just going to drop Evan's patch, it'd 
missed the batch I was going to send tomorrow anyway.

> smp-function call in this case.
>
> Would it be okay if I delay it until this solved? If you have something,
> then I could apply the two patches plus the fixup for this.
>
> Sebastian
>
> _______________________________________________
> linux-riscv mailing list
> linux-riscv@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv

Conor Dooley Nov. 3, 2023, 8:14 a.m. UTC | #9

On Thu, Nov 02, 2023 at 04:54:23PM +0100, Sebastian Andrzej Siewior wrote:
> On 2023-11-02 12:31:15 [+0000], Conor Dooley wrote:
> > [    1.130384] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
> > [    1.130418]
> > [    1.130414] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/1
> > [    1.130971] [<ffffffff80267d78>] __alloc_pages+0xac/0x1be
> > [    1.130978]   alloc_workqueue+0x96/0x6de
> > [    1.131005] [<ffffffff8000412a>] check_unaligned_access+0x34/0x336
> > [    1.131010]   kmem_cache_init_late+0x1c/0x36
> > [    1.131038] [<ffffffff80004660>] check_unaligned_access_nonboot_cpu+0x12/0x1a
> > [    1.131046]   start_kernel+0x204/0x7e6
> > [    1.131082] irq event stamp: 822
> > [    1.131071] [<ffffffff800fcba0>] __flush_smp_call_function_queue+0x1de/0x790
> > [    1.131093] hardirqs last  enabled at (821): [<ffffffff80bd75b0>] default_idle_call+0xfa/0x152
> > [    1.131111] [<ffffffff800fd3a0>] generic_smp_call_function_single_interrupt+0xe/0x1a
> > [    1.131124] hardirqs last disabled at (822): [<ffffffff80bd52a0>] do_irq+0x16/0xa6
> 
> Without the patch check_unaligned_access_boot_cpu() ->
> check_unaligned_access() is okay. The SMP/IPI variant (the patch imho)
> is bad. There must be no memory allocation from IRQ-off region - the
> smp-function call in this case.
> 
> Would it be okay if I delay it until this solved? If you have something,
> then I could apply the two patches plus the fixup for this.

I'm don't think that problems are restricted to a tree containing Evan's
patch for parallel checks. For example:

[    0.057684] rcu: Hierarchical SRCU implementation.
[    0.057708] rcu:     Max phase no-delay instances is 1000.
[    0.100085] EFI services will not be available.
[    0.102978] smp: Bringing up secondary CPUs ...
[    0.112532] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
[    0.112580] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/1
[    0.112597] preempt_count: 1, expected: 0
[    0.112617] RCU nest depth: 1, expected: 1
[    0.112660] 3 locks held by swapper/1/0:
[    0.112679]  #0: ff60000ffef8b958 (&pcp->lock){+.+.}-{2:2}, at: get_page_from_freelist+0x338/0x143e
[    0.113123]  #1: ffffffff81aa6920 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire+0x0/0x32
[    0.113187]  #2: ffffffff81c42cf8 (&zone->lock){+.+.}-{2:2}, at: __rmqueue_pcplist+0x12a/0xbc0
[    0.113268] irq event stamp: 0
[    0.113281] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
[    0.113349] hardirqs last disabled at (0): [<ffffffff8001536a>] copy_process+0x53a/0xdec
[    0.113368] softirqs last  enabled at (0): [<ffffffff80015378>] copy_process+0x548/0xdec
[    0.113384] softirqs last disabled at (0): [<0000000000000000>] 0x0
[    0.113497] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.6.0-rc6-rt10-00111-g399ad8863993-dirty #1
[    0.113570] Hardware name: Microchip PolarFire-SoC Icicle Kit (DT)
[    0.113641] Call Trace:
[    0.113705] [<ffffffff80006f76>] show_stack+0x2c/0x3c
[    0.113723] [<ffffffff80bd4ba4>] dump_stack_lvl+0x5e/0x84
[    0.113738] [<ffffffff80bd4bde>] dump_stack+0x14/0x1c
[    0.113751] [<ffffffff8005b9c0>] __might_resched+0x1b4/0x1c2
[    0.113764] [<ffffffff80bdfd3e>] rt_spin_lock+0x42/0xdc
[    0.113779] [<ffffffff8026717a>] __rmqueue_pcplist+0x12a/0xbc0
[    0.113792] [<ffffffff80268148>] get_page_from_freelist+0x37a/0x143e
[    0.113805] [<ffffffff80267cbc>] __alloc_pages+0xac/0x1be
[    0.113816] [<ffffffff80004128>] check_unaligned_access+0x32/0x332
[    0.113827] [<ffffffff80009b16>] smp_callin+0x42/0x6c
[    0.138406] cpu1: Ratio of byte access time to unaligned word access is 4.00, unaligned accesses are fast
[    0.180249] cpu2: Ratio of byte access time to unaligned word access is 4.50, unaligned accesses are fast
[    0.213500] cpu3: Ratio of byte access time to unaligned word access is 4.00, unaligned accesses are fast
[    0.214860] smp: Brought up 1 node, 4 CPUs

(this is qemu, hence the fast unaligned access, despite OpenSBI
emulating them)

Sebastian Andrzej Siewior Nov. 3, 2023, 8:17 a.m. UTC | #10

On 2023-11-02 14:37:05 [-0700], Palmer Dabbelt wrote:
> On Thu, 02 Nov 2023 08:54:23 PDT (-0700), bigeasy@linutronix.de wrote:
> > On 2023-11-02 12:31:15 [+0000], Conor Dooley wrote:
> > > [    1.130384] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
> > > [    1.130418]
> > > [    1.130414] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/1
> > > [    1.130971] [<ffffffff80267d78>] __alloc_pages+0xac/0x1be
> > > [    1.130978]   alloc_workqueue+0x96/0x6de
> > > [    1.131005] [<ffffffff8000412a>] check_unaligned_access+0x34/0x336
> > > [    1.131010]   kmem_cache_init_late+0x1c/0x36
> > > [    1.131038] [<ffffffff80004660>] check_unaligned_access_nonboot_cpu+0x12/0x1a
> > > [    1.131046]   start_kernel+0x204/0x7e6
> > > [    1.131082] irq event stamp: 822
> > > [    1.131071] [<ffffffff800fcba0>] __flush_smp_call_function_queue+0x1de/0x790
> > > [    1.131093] hardirqs last  enabled at (821): [<ffffffff80bd75b0>] default_idle_call+0xfa/0x152
> > > [    1.131111] [<ffffffff800fd3a0>] generic_smp_call_function_single_interrupt+0xe/0x1a
> > > [    1.131124] hardirqs last disabled at (822): [<ffffffff80bd52a0>] do_irq+0x16/0xa6
> > 
> > Without the patch check_unaligned_access_boot_cpu() ->
> > check_unaligned_access() is okay. The SMP/IPI variant (the patch imho)
> > is bad. There must be no memory allocation from IRQ-off region - the
> 
> Ya, sorry I didn't catch it.  I'm just going to drop Evan's patch, it'd
> missed the batch I was going to send tomorrow anyway.

The bring up of the secondary CPUs still has the memory in IRQ-off
path, only the primary CPU is fine.

Sebastian

Sebastian Andrzej Siewior Nov. 3, 2023, 8:24 a.m. UTC | #11

On 2023-11-03 08:14:01 [+0000], Conor Dooley wrote:
> I'm don't think that problems are restricted to a tree containing Evan's
> patch for parallel checks. For example:

I saw check_unaligned_access() being invoked from arch_initcall()/
check_unaligned_access_boot_cpu() which is fine. The other invocation
from smp_callin() (secondary bring up). This is not. Evan's patch makes
them all happen from SMP-function call which happens always with
disabled interrupts.

Sebastian

Evan Green Nov. 3, 2023, 5:19 p.m. UTC | #12

On Fri, Nov 3, 2023 at 1:24 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> On 2023-11-03 08:14:01 [+0000], Conor Dooley wrote:
> > I'm don't think that problems are restricted to a tree containing Evan's
> > patch for parallel checks. For example:
>
> I saw check_unaligned_access() being invoked from arch_initcall()/
> check_unaligned_access_boot_cpu() which is fine. The other invocation
> from smp_callin() (secondary bring up). This is not. Evan's patch makes
> them all happen from SMP-function call which happens always with
> disabled interrupts.

Hi Sebastien,
Could you elaborate a little on the rule violation here? The
documentation for GFP_NOWAIT says:

> If the allocation is performed from an atomic context, e.g interrupt
> handler, use ``GFP_NOWAIT``.

which seems like it basically fits my situation. If I do a search for
GFP_NOWAIT to hunt for instances of allocating with interrupts
disabled, I see at least the following examples (more available upon
request):

[kzalloc] arch/sparc/kernel/adi_64.c: alloc_tag_store()
[kzalloc] block/blk-cgroup.c: blkg_create()
[idr_alloc] drivers/accel/drm_accel.c: accel_minor_alloc()
[kzalloc] drivers/block/loop.c: loop_queue_work()
[idr_alloc] drivers/block/ublk_drv.c: ublk_alloc_dev_number()

Finding the documentation that states this is illegal might help me
understand what I should be doing instead. For example, I'm fuzzy on
something that's disallowed when interrupts are disabled but ok in
smp_callin().

One option is to do all the allocations in
check_unaligned_access_all_cpus(), and pass them in, but until I can
find the rules it's hard to verify that's a correct fix. It's also a
little clunky, and wouldn't apply to the hotplug path, so I want to
make sure it's necessary and sufficient before embarking on that
journey.

-Evan

>
> Sebastian

Sebastian Andrzej Siewior Nov. 3, 2023, 5:39 p.m. UTC | #13

On 2023-11-03 10:19:49 [-0700], Evan Green wrote:
> Hi Sebastien,

Hi Evan,

> Could you elaborate a little on the rule violation here? The
> documentation for GFP_NOWAIT says:
> 
> > If the allocation is performed from an atomic context, e.g interrupt
> > handler, use ``GFP_NOWAIT``.
> 
> which seems like it basically fits my situation. If I do a search for
> GFP_NOWAIT to hunt for instances of allocating with interrupts
> disabled, I see at least the following examples (more available upon
> request):

We talk here about PREEMPT_RT. The early-CPU up or a SMP-function calls
happens in hardirq context with disabled interrupts. Always.
The sequence
	spin_lock_irq();
	kmalloc(, GFP_ATOMIC);

is fine because on PREEMPT_RT spin_lock_irq() does not really disable
interrupts and the spin_lock_t becomes a sleeping lock. See
	https://docs.kernel.org/locking/locktypes.html

The problem is that sleeping locks must be acquired in a context where
sleeping is not possible. And kmalloc() may need to acquire sleeping
locks. Therefore no memory allocations in IRQ-off sections.

> Finding the documentation that states this is illegal might help me
> understand what I should be doing instead. For example, I'm fuzzy on
> something that's disallowed when interrupts are disabled but ok in
> smp_callin().

smp_callin() is the SMP functions call so it is not okay. Having a
kworker would okay for instance. This however requires a fully setup
scheduler. Having the memory allocated upfront and passing to every CPU
would work fine, too.

> One option is to do all the allocations in
> check_unaligned_access_all_cpus(), and pass them in, but until I can
> find the rules it's hard to verify that's a correct fix. It's also a
> little clunky, and wouldn't apply to the hotplug path, so I want to
> make sure it's necessary and sufficient before embarking on that
> journey.

I don't know what it does and how often it needs to be done. The
question is would it be okay to do once on system boot or does it need
to be done each time the CPU goes up? Is it required to be the first
thing it does or can it be delayed to slightly later point in time?

> -Evan

Sebastian

Evan Green Nov. 3, 2023, 6:22 p.m. UTC | #14

On Fri, Nov 3, 2023 at 10:39 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> On 2023-11-03 10:19:49 [-0700], Evan Green wrote:
> > Hi Sebastien,
>
> Hi Evan,
>
> > Could you elaborate a little on the rule violation here? The
> > documentation for GFP_NOWAIT says:
> >
> > > If the allocation is performed from an atomic context, e.g interrupt
> > > handler, use ``GFP_NOWAIT``.
> >
> > which seems like it basically fits my situation. If I do a search for
> > GFP_NOWAIT to hunt for instances of allocating with interrupts
> > disabled, I see at least the following examples (more available upon
> > request):
>
> We talk here about PREEMPT_RT. The early-CPU up or a SMP-function calls
> happens in hardirq context with disabled interrupts. Always.
> The sequence
>         spin_lock_irq();
>         kmalloc(, GFP_ATOMIC);
>
> is fine because on PREEMPT_RT spin_lock_irq() does not really disable
> interrupts and the spin_lock_t becomes a sleeping lock. See
>         https://docs.kernel.org/locking/locktypes.html
>

Thanks, that page helps. PREEMPT_RT is wild :)

> The problem is that sleeping locks must be acquired in a context where
> sleeping is not possible. And kmalloc() may need to acquire sleeping
> locks. Therefore no memory allocations in IRQ-off sections.
>
> > Finding the documentation that states this is illegal might help me
> > understand what I should be doing instead. For example, I'm fuzzy on
> > something that's disallowed when interrupts are disabled but ok in
> > smp_callin().
>
> smp_callin() is the SMP functions call so it is not okay. Having a
> kworker would okay for instance. This however requires a fully setup
> scheduler. Having the memory allocated upfront and passing to every CPU
> would work fine, too.
>
> > One option is to do all the allocations in
> > check_unaligned_access_all_cpus(), and pass them in, but until I can
> > find the rules it's hard to verify that's a correct fix. It's also a
> > little clunky, and wouldn't apply to the hotplug path, so I want to
> > make sure it's necessary and sufficient before embarking on that
> > journey.
>
> I don't know what it does and how often it needs to be done. The
> question is would it be okay to do once on system boot or does it need
> to be done each time the CPU goes up? Is it required to be the first
> thing it does or can it be delayed to slightly later point in time?

It's essentially a characterization we do on each CPU to figure out if
misaligned word accesses are fast or slow. This info is reported to
usermode, and potentially has kernel consumers as well (Charlie's ip
checksumming series wants to flip a static branch based on it).

In a sense I barely need the buffer, the buffer is just a runway I use
to do dummy memcpys across. I need to do it once on boot for each CPU,
and for any additional CPUs that show up via hotplug. It does not need
to be done across suspend/resume.

The patch in question was an attempt to move this work to be done in
parallel, since on systems with lots of CPUs, doing it serially in
smp_callin() was causing boot time regressions.

For boot, I think the plan I outlined should work: allocate all pages
in the initcall before calling on_each_cpu(). For hotplug, I'll have
to find a spot to allocate a buffer before the CPU comes up, so it can
use it in smp_callin().

Palmer's dropped this patch for now, so I'll plan to spin with those
changes unless anyone has thoughts on this approach.
-Evan

Sebastian Andrzej Siewior Nov. 6, 2023, 8:53 a.m. UTC | #15

On 2023-11-03 11:22:40 [-0700], Evan Green wrote:
> > I don't know what it does and how often it needs to be done. The
> > question is would it be okay to do once on system boot or does it need
> > to be done each time the CPU goes up? Is it required to be the first
> > thing it does or can it be delayed to slightly later point in time?
> 
> It's essentially a characterization we do on each CPU to figure out if
> misaligned word accesses are fast or slow. This info is reported to
> usermode, and potentially has kernel consumers as well (Charlie's ip
> checksumming series wants to flip a static branch based on it).

So you need to do this on each CPU? And there may be differences between
CPU0 and CPU1? On x86 for instance all CPUs within one socket behave the
same. I'm not aware that different CPUs on sockets within a system are
supported.

> In a sense I barely need the buffer, the buffer is just a runway I use
> to do dummy memcpys across. I need to do it once on boot for each CPU,
> and for any additional CPUs that show up via hotplug. It does not need
> to be done across suspend/resume.
> 
> The patch in question was an attempt to move this work to be done in
> parallel, since on systems with lots of CPUs, doing it serially in
> smp_callin() was causing boot time regressions.
> 
> For boot, I think the plan I outlined should work: allocate all pages
> in the initcall before calling on_each_cpu(). For hotplug, I'll have
> to find a spot to allocate a buffer before the CPU comes up, so it can
> use it in smp_callin().

My question still remains: Is this needed to be done on each CPU in the
system or would one CPU, the boot CPU, be sufficient? The static branch
you mentioned is not per-CPU basis this is per system. If you hook
this check to CPU-hotplug, then this will happen for instance during
suspend/ resume (of if triggered manually) and the CPU "feature" will
likely not change.

As naive as I am, I would assume that doing this on the boot CPU should
be enough.

> -Evan

Sebastian

Evan Green Nov. 6, 2023, 4:25 p.m. UTC | #16

On Mon, Nov 6, 2023 at 12:53 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> On 2023-11-03 11:22:40 [-0700], Evan Green wrote:
> > > I don't know what it does and how often it needs to be done. The
> > > question is would it be okay to do once on system boot or does it need
> > > to be done each time the CPU goes up? Is it required to be the first
> > > thing it does or can it be delayed to slightly later point in time?
> >
> > It's essentially a characterization we do on each CPU to figure out if
> > misaligned word accesses are fast or slow. This info is reported to
> > usermode, and potentially has kernel consumers as well (Charlie's ip
> > checksumming series wants to flip a static branch based on it).
>
> So you need to do this on each CPU? And there may be differences between
> CPU0 and CPU1? On x86 for instance all CPUs within one socket behave the
> same. I'm not aware that different CPUs on sockets within a system are
> supported.
>
> > In a sense I barely need the buffer, the buffer is just a runway I use
> > to do dummy memcpys across. I need to do it once on boot for each CPU,
> > and for any additional CPUs that show up via hotplug. It does not need
> > to be done across suspend/resume.
> >
> > The patch in question was an attempt to move this work to be done in
> > parallel, since on systems with lots of CPUs, doing it serially in
> > smp_callin() was causing boot time regressions.
> >
> > For boot, I think the plan I outlined should work: allocate all pages
> > in the initcall before calling on_each_cpu(). For hotplug, I'll have
> > to find a spot to allocate a buffer before the CPU comes up, so it can
> > use it in smp_callin().
>
> My question still remains: Is this needed to be done on each CPU in the
> system or would one CPU, the boot CPU, be sufficient? The static branch
> you mentioned is not per-CPU basis this is per system. If you hook
> this check to CPU-hotplug, then this will happen for instance during
> suspend/ resume (of if triggered manually) and the CPU "feature" will
> likely not change.
>
> As naive as I am, I would assume that doing this on the boot CPU should
> be enough.

I don't think it's safe to assume all cores will be the same. With the
proliferation of big.LITTLE-esque designs (even I thought on Intel),
and the likelihood of even greater variety on an architecture like
risc-v, I think it makes sense to do the measurement on each core. You
are correct that in the ip checksum static branch, we can only take
the path optimized for misaligned loads if all cores support it.

The goal is to measure all CPUs in parallel at boot, and then measure
any new hotplugged CPUs as they arrive. You're right that across
suspend/resume or other times when we don't expect the CPU to change
out from under us, we should avoid re-measuring. My parallel patch
aims for that, though there could still be issues.
-Evan

Palmer Dabbelt Nov. 6, 2023, 9:58 p.m. UTC | #17

On Tue, 31 Oct 2023 07:35:19 PDT (-0700), jszhang@kernel.org wrote:
> This series is to add PREEMPT_RT support to riscv. Compared with last
> year's try[1], there are two major changes:
>
> 1. riscv has been converted to Generic Entry. And RT maintainers has
> introduced PREEMPT_AUTO, so its support for riscv as well.
>
> 2. three preparation patches(patch1/2/3 in [1]) has been merged in
> mainline.
>
> Link: https://lore.kernel.org/linux-riscv/20220831175920.2806-1-jszhang@kernel.org/ [1]
>
> Hi @Conor,
>
> IIRC, you reported warnings with LOCKDEP(and DEBUG_ATOMIC_SLEEP?) with
> previous version RT patch, but I tried on c906, c910 platforms and
> can't reproduce the warnings. And Schaffner also tried them and
> didn't reproduce warnings either. So could you please help try
> v6.6-RT + this series again?
>
> Thanks
>
> Since v1:
>  - rebase on v6.6-rt
>  - support the new PREEMPT_AUTO instead of Lazy preempt
>
> Jisheng Zhang (2):
>   riscv: add PREEMPT_AUTO support
>   riscv: allow to enable RT
>
>  arch/riscv/Kconfig                   | 2 ++
>  arch/riscv/include/asm/thread_info.h | 2 ++
>  2 files changed, 4 insertions(+)

Acked-by: Palmer Dabbelt <palmer@rivosinc.com>

In case you guys wanted to take this.  I'm also happy doing a shared tag 
or something, no rush on my end.

Sebastian Andrzej Siewior Nov. 10, 2023, 3:37 p.m. UTC | #18

On 2023-11-06 13:58:56 [-0800], Palmer Dabbelt wrote:
> In case you guys wanted to take this.  I'm also happy doing a shared tag or
> something, no rush on my end.

I just released v6.6-rt14 with this included. I backported that patch
from Evan Green and hope for the best ;)

#1 from the series can't go upstream until the lazy-bits are merged.
#2 from the series can't go upstream until PREEMPT_RT can be enabled
   globally.

I'm going to keep them for now. 

Sebastian

[RT,v2,0/3] riscv: add PREEMPT_RT support

Message

Comments