Message ID | BN0P110MB21487F77F8E578780A3FE44490DFA@BN0P110MB2148.NAMP110.PROD.OUTLOOK.COM (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Xen panic when shutting down ARINC653 cpupool | expand |
On 17.03.25 06:07, Choi, Anderson wrote: > I'd like to report xen panic when shutting down an ARINC653 domain with the following setup. > Note that this is only observed when CONFIG_DEBUG is enabled. > > [Test environment] > Yocto release : 5.05 > Xen release : 4.19 (hash = 026c9fa29716b0ff0f8b7c687908e71ba29cf239) > Target machine : QEMU ARM64 > Number of physical CPUs : 4 > > [Xen config] > CONFIG_DEBUG = y > > [CPU pool configuration files] > cpupool_arinc0.cfg > - name= "Pool-arinc0" > - sched="arinc653" > - cpus=["2"] > > [Domain configuration file] > dom1.cfg > - vcpus = 1 > - pool = "Pool-arinc0" > > $ xl cpupool-cpu-remove Pool-0 2 > $ xl cpupool-create -f cpupool_arinc0.cfg > $ xl create dom1.cfg > $ a653_sched -P Pool-arinc0 dom1:100 > > ** Wait for DOM1 to complete boot.** > > $ xl shutdown dom1 > > [xen log] > root@boeing-linux-ref:~# xl shutdown dom1 > Shutting down domain 1 > root@boeing-linux-ref:~# (XEN) Assertion '!in_irq() && (local_irq_is_enabled() || num_online_cpus() <= 1)' failed at common/xmalloc_tlsf.c:714 > (XEN) ----[ Xen-4.19.1-pre arm64 debug=y Tainted: I ]---- > (XEN) CPU: 2 > (XEN) PC: 00000a000022d2b0 xfree+0x130/0x1a4 > (XEN) LR: 00000a000022d2a4 > (XEN) SP: 00008000fff77b50 > (XEN) CPSR: 00000000200002c9 MODE:64-bit EL2h (Hypervisor, handler) > ... > (XEN) Xen call trace: > (XEN) [<00000a000022d2b0>] xfree+0x130/0x1a4 (PC) > (XEN) [<00000a000022d2a4>] xfree+0x124/0x1a4 (LR) > (XEN) [<00000a00002321f0>] arinc653.c#a653sched_free_udata+0x50/0xc4 > (XEN) [<00000a0000241bc0>] core.c#sched_move_domain_cleanup+0x5c/0x80 > (XEN) [<00000a0000245328>] sched_move_domain+0x69c/0x70c > (XEN) [<00000a000022f840>] cpupool.c#cpupool_move_domain_locked+0x38/0x70 > (XEN) [<00000a0000230f20>] cpupool_move_domain+0x34/0x54 > (XEN) [<00000a0000206c40>] domain_kill+0xc0/0x15c > (XEN) [<00000a000022e0d4>] do_domctl+0x904/0x12ec > (XEN) [<00000a0000277a1c>] traps.c#do_trap_hypercall+0x1f4/0x288 > (XEN) [<00000a0000279018>] do_trap_guest_sync+0x448/0x63c > (XEN) [<00000a0000262c80>] entry.o#guest_sync_slowpath+0xa8/0xd8 > (XEN) > (XEN) > (XEN) **************************************** > (XEN) Panic on CPU 2: > (XEN) Assertion '!in_irq() && (local_irq_is_enabled() || num_online_cpus() <= 1)' failed at common/xmalloc_tlsf.c:714 > (XEN) **************************************** > > In commit 19049f8d (sched: fix locking in a653sched_free_vdata()), locking was introduced to prevent a race against the list manipulation but leads to assertion failure when the ARINC 653 domain is shutdown. > > I think this can be fixed by calling xfree() after spin_unlock_irqrestore() as shown below. > > xen/common/sched/arinc653.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/xen/common/sched/arinc653.c b/xen/common/sched/arinc653.c > index 7bf288264c..1615f1bc46 100644 > --- a/xen/common/sched/arinc653.c > +++ b/xen/common/sched/arinc653.c > @@ -463,10 +463,11 @@ a653sched_free_udata(const struct scheduler *ops, void *priv) > if ( !is_idle_unit(av->unit) ) > list_del(&av->list); > > - xfree(av); > update_schedule_units(ops); > > spin_unlock_irqrestore(&sched_priv->lock, flags); > + > + xfree(av); > } > > Can I hear your opinion on this? Yes, this seems the right way to fix the issue. Could you please send a proper patch (please have a look at [1] in case you are unsure how a proper patch should look like)? Juergen [1] http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/process/sending-patches.pandoc
Jürgen, > On 17.03.25 06:07, Choi, Anderson wrote: >> I'd like to report xen panic when shutting down an ARINC653 domain >> with the following setup. Note that this is only observed when >> CONFIG_DEBUG is enabled. >> >> [Test environment] >> Yocto release : 5.05 >> Xen release : 4.19 (hash = 026c9fa29716b0ff0f8b7c687908e71ba29cf239) >> Target machine : QEMU ARM64 >> Number of physical CPUs : 4 >> >> [Xen config] >> CONFIG_DEBUG = y >> >> [CPU pool configuration files] >> cpupool_arinc0.cfg >> - name= "Pool-arinc0" >> - sched="arinc653" >> - cpus=["2"] >> >> [Domain configuration file] >> dom1.cfg >> - vcpus = 1 >> - pool = "Pool-arinc0" >> >> $ xl cpupool-cpu-remove Pool-0 2 >> $ xl cpupool-create -f cpupool_arinc0.cfg $ xl create dom1.cfg $ >> a653_sched -P Pool-arinc0 dom1:100 >> >> ** Wait for DOM1 to complete boot.** >> >> $ xl shutdown dom1 >> >> [xen log] root@boeing-linux-ref:~# xl shutdown dom1 Shutting down >> domain 1 root@boeing-linux-ref:~# (XEN) Assertion '!in_irq() && >> (local_irq_is_enabled() || num_online_cpus() <= 1)' failed at >> common/xmalloc_tlsf.c:714 (XEN) ----[ Xen-4.19.1-pre arm64 debug=y >> Tainted: I ]---- (XEN) CPU: 2 (XEN) PC: 00000a000022d2b0 >> xfree+0x130/0x1a4 (XEN) LR: 00000a000022d2a4 (XEN) SP: >> 00008000fff77b50 (XEN) CPSR: 00000000200002c9 MODE:64-bit EL2h >> (Hypervisor, handler) ... (XEN) Xen call trace: (XEN) >> [<00000a000022d2b0>] xfree+0x130/0x1a4 (PC) (XEN) >> [<00000a000022d2a4>] xfree+0x124/0x1a4 (LR) (XEN) >> [<00000a00002321f0>] arinc653.c#a653sched_free_udata+0x50/0xc4 (XEN) >> [<00000a0000241bc0>] core.c#sched_move_domain_cleanup+0x5c/0x80 (XEN) >> [<00000a0000245328>] sched_move_domain+0x69c/0x70c (XEN) >> [<00000a000022f840>] cpupool.c#cpupool_move_domain_locked+0x38/0x70 >> (XEN) [<00000a0000230f20>] cpupool_move_domain+0x34/0x54 (XEN) >> [<00000a0000206c40>] domain_kill+0xc0/0x15c (XEN) >> [<00000a000022e0d4>] do_domctl+0x904/0x12ec (XEN) >> [<00000a0000277a1c>] traps.c#do_trap_hypercall+0x1f4/0x288 (XEN) >> [<00000a0000279018>] do_trap_guest_sync+0x448/0x63c (XEN) >> [<00000a0000262c80>] entry.o#guest_sync_slowpath+0xa8/0xd8 (XEN) >> (XEN) >> (XEN) **************************************** (XEN) Panic on CPU 2: >> (XEN) Assertion '!in_irq() && (local_irq_is_enabled() || >> num_online_cpus() <= 1)' failed at common/xmalloc_tlsf.c:714 (XEN) >> **************************************** >> >> In commit 19049f8d (sched: fix locking in a653sched_free_vdata()), >> locking > was introduced to prevent a race against the list manipulation but > leads to assertion failure when the ARINC 653 domain is shutdown. >> >> I think this can be fixed by calling xfree() after >> spin_unlock_irqrestore() as shown below. >> >> xen/common/sched/arinc653.c | 3 ++- >> 1 file changed, 2 insertions(+), 1 deletion(-) diff --git >> a/xen/common/sched/arinc653.c b/xen/common/sched/arinc653.c index >> 7bf288264c..1615f1bc46 100644 >> --- a/xen/common/sched/arinc653.c >> +++ b/xen/common/sched/arinc653.c >> @@ -463,10 +463,11 @@ a653sched_free_udata(const struct scheduler >> *ops, > void *priv) >> if ( !is_idle_unit(av->unit) ) >> list_del(&av->list); >> - xfree(av); >> update_schedule_units(ops); >> >> spin_unlock_irqrestore(&sched_priv->lock, flags); >> + >> + xfree(av); >> } >> Can I hear your opinion on this? > > Yes, this seems the right way to fix the issue. > > Could you please send a proper patch (please have a look at [1] in > case you are unsure how a proper patch should look like)? > > Juergen > > [1] > http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/process/sending > - > patches.pandoc Thanks for your opinion. Let me read through the link and submit the patch. Regards, Anderson
On 17/03/2025 1:21 pm, Choi, Anderson wrote: > Jürgen, > >> On 17.03.25 06:07, Choi, Anderson wrote: >>> I'd like to report xen panic when shutting down an ARINC653 domain >>> with the following setup. Note that this is only observed when >>> CONFIG_DEBUG is enabled. >>> >>> [Test environment] >>> Yocto release : 5.05 >>> Xen release : 4.19 (hash = 026c9fa29716b0ff0f8b7c687908e71ba29cf239) >>> Target machine : QEMU ARM64 >>> Number of physical CPUs : 4 >>> >>> [Xen config] >>> CONFIG_DEBUG = y >>> >>> [CPU pool configuration files] >>> cpupool_arinc0.cfg >>> - name= "Pool-arinc0" >>> - sched="arinc653" >>> - cpus=["2"] >>> >>> [Domain configuration file] >>> dom1.cfg >>> - vcpus = 1 >>> - pool = "Pool-arinc0" >>> >>> $ xl cpupool-cpu-remove Pool-0 2 >>> $ xl cpupool-create -f cpupool_arinc0.cfg $ xl create dom1.cfg $ >>> a653_sched -P Pool-arinc0 dom1:100 >>> >>> ** Wait for DOM1 to complete boot.** >>> >>> $ xl shutdown dom1 >>> >>> [xen log] root@boeing-linux-ref:~# xl shutdown dom1 Shutting down >>> domain 1 root@boeing-linux-ref:~# (XEN) Assertion '!in_irq() && >>> (local_irq_is_enabled() || num_online_cpus() <= 1)' failed at >>> common/xmalloc_tlsf.c:714 (XEN) ----[ Xen-4.19.1-pre arm64 debug=y >>> Tainted: I ]---- (XEN) CPU: 2 (XEN) PC: 00000a000022d2b0 >>> xfree+0x130/0x1a4 (XEN) LR: 00000a000022d2a4 (XEN) SP: >>> 00008000fff77b50 (XEN) CPSR: 00000000200002c9 MODE:64-bit EL2h >>> (Hypervisor, handler) ... (XEN) Xen call trace: (XEN) >>> [<00000a000022d2b0>] xfree+0x130/0x1a4 (PC) (XEN) >>> [<00000a000022d2a4>] xfree+0x124/0x1a4 (LR) (XEN) >>> [<00000a00002321f0>] arinc653.c#a653sched_free_udata+0x50/0xc4 (XEN) >>> [<00000a0000241bc0>] core.c#sched_move_domain_cleanup+0x5c/0x80 (XEN) >>> [<00000a0000245328>] sched_move_domain+0x69c/0x70c (XEN) >>> [<00000a000022f840>] cpupool.c#cpupool_move_domain_locked+0x38/0x70 >>> (XEN) [<00000a0000230f20>] cpupool_move_domain+0x34/0x54 (XEN) >>> [<00000a0000206c40>] domain_kill+0xc0/0x15c (XEN) >>> [<00000a000022e0d4>] do_domctl+0x904/0x12ec (XEN) >>> [<00000a0000277a1c>] traps.c#do_trap_hypercall+0x1f4/0x288 (XEN) >>> [<00000a0000279018>] do_trap_guest_sync+0x448/0x63c (XEN) >>> [<00000a0000262c80>] entry.o#guest_sync_slowpath+0xa8/0xd8 (XEN) >>> (XEN) >>> (XEN) **************************************** (XEN) Panic on CPU 2: >>> (XEN) Assertion '!in_irq() && (local_irq_is_enabled() || >>> num_online_cpus() <= 1)' failed at common/xmalloc_tlsf.c:714 (XEN) >>> **************************************** >>> >>> In commit 19049f8d (sched: fix locking in a653sched_free_vdata()), >>> locking >> was introduced to prevent a race against the list manipulation but >> leads to assertion failure when the ARINC 653 domain is shutdown. >>> I think this can be fixed by calling xfree() after >>> spin_unlock_irqrestore() as shown below. >>> >>> xen/common/sched/arinc653.c | 3 ++- >>> 1 file changed, 2 insertions(+), 1 deletion(-) diff --git >>> a/xen/common/sched/arinc653.c b/xen/common/sched/arinc653.c index >>> 7bf288264c..1615f1bc46 100644 >>> --- a/xen/common/sched/arinc653.c >>> +++ b/xen/common/sched/arinc653.c >>> @@ -463,10 +463,11 @@ a653sched_free_udata(const struct scheduler >>> *ops, >> void *priv) >>> if ( !is_idle_unit(av->unit) ) >>> list_del(&av->list); >>> - xfree(av); >>> update_schedule_units(ops); >>> >>> spin_unlock_irqrestore(&sched_priv->lock, flags); >>> + >>> + xfree(av); >>> } >>> Can I hear your opinion on this? >> Yes, this seems the right way to fix the issue. >> >> Could you please send a proper patch (please have a look at [1] in >> case you are unsure how a proper patch should look like)? >> >> Juergen >> >> [1] >> http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/process/sending >> - >> patches.pandoc > Thanks for your opinion. Let me read through the link and submit the patch. Other good references are: https://lore.kernel.org/xen-devel/20250313093157.30450-1-jgross@suse.com/ https://lore.kernel.org/xen-devel/d8c08c22-ee70-4c06-8fcd-ad44fc0dc58f@suse.com/ One you hopefully recognise, and the other is another bugfix to ARINC noticed by the Coverity run over the weekend. ~Andrew
On 17.03.25 14:29, Andrew Cooper wrote: > On 17/03/2025 1:21 pm, Choi, Anderson wrote: >> Jürgen, >> >>> On 17.03.25 06:07, Choi, Anderson wrote: >>>> I'd like to report xen panic when shutting down an ARINC653 domain >>>> with the following setup. Note that this is only observed when >>>> CONFIG_DEBUG is enabled. >>>> >>>> [Test environment] >>>> Yocto release : 5.05 >>>> Xen release : 4.19 (hash = 026c9fa29716b0ff0f8b7c687908e71ba29cf239) >>>> Target machine : QEMU ARM64 >>>> Number of physical CPUs : 4 >>>> >>>> [Xen config] >>>> CONFIG_DEBUG = y >>>> >>>> [CPU pool configuration files] >>>> cpupool_arinc0.cfg >>>> - name= "Pool-arinc0" >>>> - sched="arinc653" >>>> - cpus=["2"] >>>> >>>> [Domain configuration file] >>>> dom1.cfg >>>> - vcpus = 1 >>>> - pool = "Pool-arinc0" >>>> >>>> $ xl cpupool-cpu-remove Pool-0 2 >>>> $ xl cpupool-create -f cpupool_arinc0.cfg $ xl create dom1.cfg $ >>>> a653_sched -P Pool-arinc0 dom1:100 >>>> >>>> ** Wait for DOM1 to complete boot.** >>>> >>>> $ xl shutdown dom1 >>>> >>>> [xen log] root@boeing-linux-ref:~# xl shutdown dom1 Shutting down >>>> domain 1 root@boeing-linux-ref:~# (XEN) Assertion '!in_irq() && >>>> (local_irq_is_enabled() || num_online_cpus() <= 1)' failed at >>>> common/xmalloc_tlsf.c:714 (XEN) ----[ Xen-4.19.1-pre arm64 debug=y >>>> Tainted: I ]---- (XEN) CPU: 2 (XEN) PC: 00000a000022d2b0 >>>> xfree+0x130/0x1a4 (XEN) LR: 00000a000022d2a4 (XEN) SP: >>>> 00008000fff77b50 (XEN) CPSR: 00000000200002c9 MODE:64-bit EL2h >>>> (Hypervisor, handler) ... (XEN) Xen call trace: (XEN) >>>> [<00000a000022d2b0>] xfree+0x130/0x1a4 (PC) (XEN) >>>> [<00000a000022d2a4>] xfree+0x124/0x1a4 (LR) (XEN) >>>> [<00000a00002321f0>] arinc653.c#a653sched_free_udata+0x50/0xc4 (XEN) >>>> [<00000a0000241bc0>] core.c#sched_move_domain_cleanup+0x5c/0x80 (XEN) >>>> [<00000a0000245328>] sched_move_domain+0x69c/0x70c (XEN) >>>> [<00000a000022f840>] cpupool.c#cpupool_move_domain_locked+0x38/0x70 >>>> (XEN) [<00000a0000230f20>] cpupool_move_domain+0x34/0x54 (XEN) >>>> [<00000a0000206c40>] domain_kill+0xc0/0x15c (XEN) >>>> [<00000a000022e0d4>] do_domctl+0x904/0x12ec (XEN) >>>> [<00000a0000277a1c>] traps.c#do_trap_hypercall+0x1f4/0x288 (XEN) >>>> [<00000a0000279018>] do_trap_guest_sync+0x448/0x63c (XEN) >>>> [<00000a0000262c80>] entry.o#guest_sync_slowpath+0xa8/0xd8 (XEN) >>>> (XEN) >>>> (XEN) **************************************** (XEN) Panic on CPU 2: >>>> (XEN) Assertion '!in_irq() && (local_irq_is_enabled() || >>>> num_online_cpus() <= 1)' failed at common/xmalloc_tlsf.c:714 (XEN) >>>> **************************************** >>>> >>>> In commit 19049f8d (sched: fix locking in a653sched_free_vdata()), >>>> locking >>> was introduced to prevent a race against the list manipulation but >>> leads to assertion failure when the ARINC 653 domain is shutdown. >>>> I think this can be fixed by calling xfree() after >>>> spin_unlock_irqrestore() as shown below. >>>> >>>> xen/common/sched/arinc653.c | 3 ++- >>>> 1 file changed, 2 insertions(+), 1 deletion(-) diff --git >>>> a/xen/common/sched/arinc653.c b/xen/common/sched/arinc653.c index >>>> 7bf288264c..1615f1bc46 100644 >>>> --- a/xen/common/sched/arinc653.c >>>> +++ b/xen/common/sched/arinc653.c >>>> @@ -463,10 +463,11 @@ a653sched_free_udata(const struct scheduler >>>> *ops, >>> void *priv) >>>> if ( !is_idle_unit(av->unit) ) >>>> list_del(&av->list); >>>> - xfree(av); >>>> update_schedule_units(ops); >>>> >>>> spin_unlock_irqrestore(&sched_priv->lock, flags); >>>> + >>>> + xfree(av); >>>> } >>>> Can I hear your opinion on this? >>> Yes, this seems the right way to fix the issue. >>> >>> Could you please send a proper patch (please have a look at [1] in >>> case you are unsure how a proper patch should look like)? >>> >>> Juergen >>> >>> [1] >>> http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/process/sending >>> - >>> patches.pandoc >> Thanks for your opinion. Let me read through the link and submit the patch. > > Other good references are: > > https://lore.kernel.org/xen-devel/20250313093157.30450-1-jgross@suse.com/ > https://lore.kernel.org/xen-devel/d8c08c22-ee70-4c06-8fcd-ad44fc0dc58f@suse.com/ > > One you hopefully recognise, and the other is another bugfix to ARINC > noticed by the Coverity run over the weekend. Please note that the Coverity report is not about a real bug, but just a latent one. As long as the arinc653 scheduler is supporting a single physical cpu only, there is no real need for the lock when accessing sched_priv->next_switch_time (the lock is thought to protect the list of units/vcpus, not all the other fields of sched_priv). Juergen
diff --git a/xen/common/sched/arinc653.c b/xen/common/sched/arinc653.c index 7bf288264c..1615f1bc46 100644 --- a/xen/common/sched/arinc653.c +++ b/xen/common/sched/arinc653.c @@ -463,10 +463,11 @@ a653sched_free_udata(const struct scheduler *ops, void *priv) if ( !is_idle_unit(av->unit) ) list_del(&av->list); - xfree(av); update_schedule_units(ops); spin_unlock_irqrestore(&sched_priv->lock, flags); + + xfree(av); } Can I hear your opinion on this?