diff mbox series

KVM/x86: Do not clear SIPI while in SMM

Message ID 20240416204729.2541743-1-boris.ostrovsky@oracle.com (mailing list archive)
State New, archived
Headers show
Series KVM/x86: Do not clear SIPI while in SMM | expand

Commit Message

Boris Ostrovsky April 16, 2024, 8:47 p.m. UTC
When a processor is running in SMM and receives INIT message the interrupt
is left pending until SMM is exited. On the other hand, SIPI, which
typically follows INIT, is discarded. This presents a problem since sender
has no way of knowing that its SIPI has been dropped, which results in
processor failing to come up.

Keeping the SIPI pending avoids this scenario.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
---
I am not sure whether non-SMM cases should clear the bit.

 arch/x86/kvm/lapic.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

Comments

Paolo Bonzini April 16, 2024, 8:53 p.m. UTC | #1
On 4/16/24 22:47, Boris Ostrovsky wrote:
> When a processor is running in SMM and receives INIT message the interrupt
> is left pending until SMM is exited. On the other hand, SIPI, which
> typically follows INIT, is discarded. This presents a problem since sender
> has no way of knowing that its SIPI has been dropped, which results in
> processor failing to come up.
> 
> Keeping the SIPI pending avoids this scenario.

This is incorrect - it's yet another ugly legacy facet of x86, but we 
have to live with it.  SIPI is discarded because the code is supposed to 
retry it if needed ("INIT-SIPI-SIPI").

The sender should set a flag as early as possible in the SIPI code so 
that it's clear that it was not received; and an extra SIPI is not a 
problem, it will be ignored anyway and will not cause trouble if there's 
a race.

What is the reproducer for this?

Paolo
Boris Ostrovsky April 16, 2024, 8:57 p.m. UTC | #2
On 4/16/24 4:53 PM, Paolo Bonzini wrote:
> On 4/16/24 22:47, Boris Ostrovsky wrote:
>> When a processor is running in SMM and receives INIT message the 
>> interrupt
>> is left pending until SMM is exited. On the other hand, SIPI, which
>> typically follows INIT, is discarded. This presents a problem since 
>> sender
>> has no way of knowing that its SIPI has been dropped, which results in
>> processor failing to come up.
>>
>> Keeping the SIPI pending avoids this scenario.
>
> This is incorrect - it's yet another ugly legacy facet of x86, but we 
> have to live with it.  SIPI is discarded because the code is supposed 
> to retry it if needed ("INIT-SIPI-SIPI").


I couldn't find in the SDM/APM a definitive statement about whether SIPI 
is supposed to be dropped.


>
> The sender should set a flag as early as possible in the SIPI code so 
> that it's clear that it was not received; and an extra SIPI is not a 
> problem, it will be ignored anyway and will not cause trouble if 
> there's a race.
>
> What is the reproducer for this?
>

Hotplugging/unplugging cpus in a loop, especially if you oversubscribe 
the guest, will get you there in 10-15 minutes.

Typically (although I think not always) this is happening when OVMF if 
trying to rendezvous and a processor is missing and is sent an extra SMI.


-boris
Paolo Bonzini April 16, 2024, 10:03 p.m. UTC | #3
On Tue, Apr 16, 2024 at 10:57 PM <boris.ostrovsky@oracle.com> wrote:
> On 4/16/24 4:53 PM, Paolo Bonzini wrote:
> > On 4/16/24 22:47, Boris Ostrovsky wrote:
> >> Keeping the SIPI pending avoids this scenario.
> >
> > This is incorrect - it's yet another ugly legacy facet of x86, but we
> > have to live with it.  SIPI is discarded because the code is supposed
> > to retry it if needed ("INIT-SIPI-SIPI").
>
> I couldn't find in the SDM/APM a definitive statement about whether SIPI
> is supposed to be dropped.

I think the manual is pretty consistent that SIPIs are never latched,
they're only ever used in wait-for-SIPI state.

> > The sender should set a flag as early as possible in the SIPI code so
> > that it's clear that it was not received; and an extra SIPI is not a
> > problem, it will be ignored anyway and will not cause trouble if
> > there's a race.
> >
> > What is the reproducer for this?
>
> Hotplugging/unplugging cpus in a loop, especially if you oversubscribe
> the guest, will get you there in 10-15 minutes.
>
> Typically (although I think not always) this is happening when OVMF if
> trying to rendezvous and a processor is missing and is sent an extra SMI.

Can you go into more detail? I wasn't even aware that OVMF's SMM
supported hotplug - on real hardware I think there's extra work from
the BMC to coordinate all SMIs across both existing and hotplugged
packages(*)

What should happen is that SMIs are blocked on the new CPUs, so that
only existing CPUs answer. These restore the 0x30000 segment to
prepare for the SMI on the new CPUs, and send an INIT-SIPI to start
the SMI on the new CPUs. Does OVMF do anything like that?

Paolo
Sean Christopherson April 16, 2024, 10:14 p.m. UTC | #4
On Wed, Apr 17, 2024, Paolo Bonzini wrote:
> On Tue, Apr 16, 2024 at 10:57 PM <boris.ostrovsky@oracle.com> wrote:
> > On 4/16/24 4:53 PM, Paolo Bonzini wrote:
> > > On 4/16/24 22:47, Boris Ostrovsky wrote:
> > >> Keeping the SIPI pending avoids this scenario.
> > >
> > > This is incorrect - it's yet another ugly legacy facet of x86, but we
> > > have to live with it.  SIPI is discarded because the code is supposed
> > > to retry it if needed ("INIT-SIPI-SIPI").
> >
> > I couldn't find in the SDM/APM a definitive statement about whether SIPI
> > is supposed to be dropped.
> 
> I think the manual is pretty consistent that SIPIs are never latched,
> they're only ever used in wait-for-SIPI state.

Ya, the "Interrupt Command Register (ICR)" section for "110 (Start-Up)" explicitly
says it's software's responsibility to detect whether or not the SIPI was delivered,
and to resend SIPI(s) if needed.

  IPIs sent with this delivery mode are not automatically retried if the source
  APIC is unable to deliver it. It is up to the software to determine if the
  SIPI was not successfully delivered and to reissue the SIPI if necessary.
Boris Ostrovsky April 16, 2024, 10:56 p.m. UTC | #5
(Sorry, need to resend)

On 4/16/24 6:03 PM, Paolo Bonzini wrote:
> On Tue, Apr 16, 2024 at 10:57 PM <boris.ostrovsky@oracle.com> wrote:
>> On 4/16/24 4:53 PM, Paolo Bonzini wrote:
>>> On 4/16/24 22:47, Boris Ostrovsky wrote:
>>>> Keeping the SIPI pending avoids this scenario.
>>>
>>> This is incorrect - it's yet another ugly legacy facet of x86, but we
>>> have to live with it.  SIPI is discarded because the code is supposed
>>> to retry it if needed ("INIT-SIPI-SIPI").
>>
>> I couldn't find in the SDM/APM a definitive statement about whether SIPI
>> is supposed to be dropped.
> 
> I think the manual is pretty consistent that SIPIs are never latched,
> they're only ever used in wait-for-SIPI state.
> 
>>> The sender should set a flag as early as possible in the SIPI code so
>>> that it's clear that it was not received; and an extra SIPI is not a
>>> problem, it will be ignored anyway and will not cause trouble if
>>> there's a race.
>>>
>>> What is the reproducer for this?
>>
>> Hotplugging/unplugging cpus in a loop, especially if you oversubscribe
>> the guest, will get you there in 10-15 minutes.
>>
>> Typically (although I think not always) this is happening when OVMF if
>> trying to rendezvous and a processor is missing and is sent an extra SMI.
> 
> Can you go into more detail? I wasn't even aware that OVMF's SMM
> supported hotplug - on real hardware I think there's extra work from
> the BMC to coordinate all SMIs across both existing and hotplugged
> packages(*)


It's been supported by OVMF for a couple of years (in fact, IIRC you 
were part of at least initial conversations about this, at least for the 
unplug part).

During hotplug QEMU gathers all cpus in OVMF from (I think) 
ich9_apm_ctrl_changed() and they are all waited for in 
SmmCpuRendezvous()->SmmWaitForApArrival(). Occasionally it may so happen 
that the SMI from QEMU is not delivered to a processor that was *just* 
successfully hotplugged and so it is pinged again 
(https://github.com/tianocore/edk2/blob/fcfdbe29874320e9f876baa7afebc3fca8f4a7df/UefiCpuPkg/PiSmmCpuDxeSmm/MpService.c#L304). 


At the same time this processor is now being brought up by kernel and is 
being sent INIT-SIPI-SIPI. If these (or at least the SIPIs) arrive after 
the SMI reaches the processor then that processor is not going to have a 
good day.


> 
> What should happen is that SMIs are blocked on the new CPUs, so that
> only existing CPUs answer. These restore the 0x30000 segment to
> prepare for the SMI on the new CPUs, and send an INIT-SIPI to start
> the SMI on the new CPUs. Does OVMF do anything like that?
You mean this: 
https://github.com/tianocore/edk2/blob/fcfdbe29874320e9f876baa7afebc3fca8f4a7df/OvmfPkg/CpuHotplugSmm/Smbase.c#L272 
?


-boris
Boris Ostrovsky April 16, 2024, 11:02 p.m. UTC | #6
On 4/16/24 6:14 PM, Sean Christopherson wrote:
> On Wed, Apr 17, 2024, Paolo Bonzini wrote:
>> On Tue, Apr 16, 2024 at 10:57 PM <boris.ostrovsky@oracle.com> wrote:
>>> On 4/16/24 4:53 PM, Paolo Bonzini wrote:
>>>> On 4/16/24 22:47, Boris Ostrovsky wrote:
>>>>> Keeping the SIPI pending avoids this scenario.
>>>>
>>>> This is incorrect - it's yet another ugly legacy facet of x86, but we
>>>> have to live with it.  SIPI is discarded because the code is supposed
>>>> to retry it if needed ("INIT-SIPI-SIPI").
>>>
>>> I couldn't find in the SDM/APM a definitive statement about whether SIPI
>>> is supposed to be dropped.
>>
>> I think the manual is pretty consistent that SIPIs are never latched,
>> they're only ever used in wait-for-SIPI state.
> 
> Ya, the "Interrupt Command Register (ICR)" section for "110 (Start-Up)" explicitly
> says it's software's responsibility to detect whether or not the SIPI was delivered,
> and to resend SIPI(s) if needed.
> 
>    IPIs sent with this delivery mode are not automatically retried if the source
>    APIC is unable to deliver it. It is up to the software to determine if the
>    SIPI was not successfully delivered and to reissue the SIPI if necessary.


Right, I saw that. I was hoping to see something about SIPI being 
dropped. IOW my question was what happens to a SIPI that was delivered 
to a processor in SMM and not what should I do if it wasn't.

-boris
Sean Christopherson April 16, 2024, 11:17 p.m. UTC | #7
On Tue, Apr 16, 2024, boris.ostrovsky@oracle.com wrote:
> (Sorry, need to resend)
> 
> On 4/16/24 6:03 PM, Paolo Bonzini wrote:
> > On Tue, Apr 16, 2024 at 10:57 PM <boris.ostrovsky@oracle.com> wrote:
> > > On 4/16/24 4:53 PM, Paolo Bonzini wrote:
> > > > On 4/16/24 22:47, Boris Ostrovsky wrote:
> > > > > Keeping the SIPI pending avoids this scenario.
> > > > 
> > > > This is incorrect - it's yet another ugly legacy facet of x86, but we
> > > > have to live with it.  SIPI is discarded because the code is supposed
> > > > to retry it if needed ("INIT-SIPI-SIPI").
> > > 
> > > I couldn't find in the SDM/APM a definitive statement about whether SIPI
> > > is supposed to be dropped.
> > 
> > I think the manual is pretty consistent that SIPIs are never latched,
> > they're only ever used in wait-for-SIPI state.
> > 
> > > > The sender should set a flag as early as possible in the SIPI code so
> > > > that it's clear that it was not received; and an extra SIPI is not a
> > > > problem, it will be ignored anyway and will not cause trouble if
> > > > there's a race.
> > > > 
> > > > What is the reproducer for this?
> > > 
> > > Hotplugging/unplugging cpus in a loop, especially if you oversubscribe
> > > the guest, will get you there in 10-15 minutes.
> > > 
> > > Typically (although I think not always) this is happening when OVMF if
> > > trying to rendezvous and a processor is missing and is sent an extra SMI.
> > 
> > Can you go into more detail? I wasn't even aware that OVMF's SMM
> > supported hotplug - on real hardware I think there's extra work from
> > the BMC to coordinate all SMIs across both existing and hotplugged
> > packages(*)
> 
> 
> It's been supported by OVMF for a couple of years (in fact, IIRC you were
> part of at least initial conversations about this, at least for the unplug
> part).
> 
> During hotplug QEMU gathers all cpus in OVMF from (I think)
> ich9_apm_ctrl_changed() and they are all waited for in
> SmmCpuRendezvous()->SmmWaitForApArrival(). Occasionally it may so happen
> that the SMI from QEMU is not delivered to a processor that was *just*
> successfully hotplugged and so it is pinged again (https://github.com/tianocore/edk2/blob/fcfdbe29874320e9f876baa7afebc3fca8f4a7df/UefiCpuPkg/PiSmmCpuDxeSmm/MpService.c#L304).
> 
> 
> At the same time this processor is now being brought up by kernel and is
> being sent INIT-SIPI-SIPI. If these (or at least the SIPIs) arrive after the
> SMI reaches the processor then that processor is not going to have a good
> day.

It's specifically SIPI that's problematic.  INIT is blocked by SMM, but latched,
and SMIs are blocked by WFS, but latched.  And AFAICT, KVM emulates all of those
combinations correctly.

Why is the SMI from QEMU not delivered?  That seems like the smoking gun.
Boris Ostrovsky April 16, 2024, 11:37 p.m. UTC | #8
On 4/16/24 7:17 PM, Sean Christopherson wrote:
> On Tue, Apr 16, 2024, boris.ostrovsky@oracle.com wrote:
>> (Sorry, need to resend)
>>
>> On 4/16/24 6:03 PM, Paolo Bonzini wrote:
>>> On Tue, Apr 16, 2024 at 10:57 PM <boris.ostrovsky@oracle.com> wrote:
>>>> On 4/16/24 4:53 PM, Paolo Bonzini wrote:
>>>>> On 4/16/24 22:47, Boris Ostrovsky wrote:
>>>>>> Keeping the SIPI pending avoids this scenario.
>>>>>
>>>>> This is incorrect - it's yet another ugly legacy facet of x86, but we
>>>>> have to live with it.  SIPI is discarded because the code is supposed
>>>>> to retry it if needed ("INIT-SIPI-SIPI").
>>>>
>>>> I couldn't find in the SDM/APM a definitive statement about whether SIPI
>>>> is supposed to be dropped.
>>>
>>> I think the manual is pretty consistent that SIPIs are never latched,
>>> they're only ever used in wait-for-SIPI state.
>>>
>>>>> The sender should set a flag as early as possible in the SIPI code so
>>>>> that it's clear that it was not received; and an extra SIPI is not a
>>>>> problem, it will be ignored anyway and will not cause trouble if
>>>>> there's a race.
>>>>>
>>>>> What is the reproducer for this?
>>>>
>>>> Hotplugging/unplugging cpus in a loop, especially if you oversubscribe
>>>> the guest, will get you there in 10-15 minutes.
>>>>
>>>> Typically (although I think not always) this is happening when OVMF if
>>>> trying to rendezvous and a processor is missing and is sent an extra SMI.
>>>
>>> Can you go into more detail? I wasn't even aware that OVMF's SMM
>>> supported hotplug - on real hardware I think there's extra work from
>>> the BMC to coordinate all SMIs across both existing and hotplugged
>>> packages(*)
>>
>>
>> It's been supported by OVMF for a couple of years (in fact, IIRC you were
>> part of at least initial conversations about this, at least for the unplug
>> part).
>>
>> During hotplug QEMU gathers all cpus in OVMF from (I think)
>> ich9_apm_ctrl_changed() and they are all waited for in
>> SmmCpuRendezvous()->SmmWaitForApArrival(). Occasionally it may so happen
>> that the SMI from QEMU is not delivered to a processor that was *just*
>> successfully hotplugged and so it is pinged again (https://github.com/tianocore/edk2/blob/fcfdbe29874320e9f876baa7afebc3fca8f4a7df/UefiCpuPkg/PiSmmCpuDxeSmm/MpService.c#L304).
>>
>>
>> At the same time this processor is now being brought up by kernel and is
>> being sent INIT-SIPI-SIPI. If these (or at least the SIPIs) arrive after the
>> SMI reaches the processor then that processor is not going to have a good
>> day.
> 
> It's specifically SIPI that's problematic.  INIT is blocked by SMM, but latched,
> and SMIs are blocked by WFS, but latched.  And AFAICT, KVM emulates all of those
> combinations correctly.
> 
> Why is the SMI from QEMU not delivered?  That seems like the smoking gun.

I haven't actually traced this but it seems that what happens is that 
the newly-added processor is about to leave SMM and the count of in-SMM 
processors is decremented. At the same time, since the processor is 
still in SMM the QEMU's SMM is not taken.

And so when the count is looked at again in SmmWaitForApArrival() one 
processor is missing.


-boris
Igor Mammedov April 17, 2024, 12:40 p.m. UTC | #9
On Tue, 16 Apr 2024 19:37:09 -0400
boris.ostrovsky@oracle.com wrote:

> On 4/16/24 7:17 PM, Sean Christopherson wrote:
> > On Tue, Apr 16, 2024, boris.ostrovsky@oracle.com wrote:  
> >> (Sorry, need to resend)
> >>
> >> On 4/16/24 6:03 PM, Paolo Bonzini wrote:  
> >>> On Tue, Apr 16, 2024 at 10:57 PM <boris.ostrovsky@oracle.com> wrote:  
> >>>> On 4/16/24 4:53 PM, Paolo Bonzini wrote:  
> >>>>> On 4/16/24 22:47, Boris Ostrovsky wrote:  
> >>>>>> Keeping the SIPI pending avoids this scenario.  
> >>>>>
> >>>>> This is incorrect - it's yet another ugly legacy facet of x86, but we
> >>>>> have to live with it.  SIPI is discarded because the code is supposed
> >>>>> to retry it if needed ("INIT-SIPI-SIPI").  
> >>>>
> >>>> I couldn't find in the SDM/APM a definitive statement about whether SIPI
> >>>> is supposed to be dropped.  
> >>>
> >>> I think the manual is pretty consistent that SIPIs are never latched,
> >>> they're only ever used in wait-for-SIPI state.
> >>>  
> >>>>> The sender should set a flag as early as possible in the SIPI code so
> >>>>> that it's clear that it was not received; and an extra SIPI is not a
> >>>>> problem, it will be ignored anyway and will not cause trouble if
> >>>>> there's a race.
> >>>>>
> >>>>> What is the reproducer for this?  
> >>>>
> >>>> Hotplugging/unplugging cpus in a loop, especially if you oversubscribe
> >>>> the guest, will get you there in 10-15 minutes.
> >>>>
> >>>> Typically (although I think not always) this is happening when OVMF if
> >>>> trying to rendezvous and a processor is missing and is sent an extra SMI.  
> >>>
> >>> Can you go into more detail? I wasn't even aware that OVMF's SMM
> >>> supported hotplug - on real hardware I think there's extra work from
> >>> the BMC to coordinate all SMIs across both existing and hotplugged
> >>> packages(*)  
> >>
> >>
> >> It's been supported by OVMF for a couple of years (in fact, IIRC you were
> >> part of at least initial conversations about this, at least for the unplug
> >> part).
> >>
> >> During hotplug QEMU gathers all cpus in OVMF from (I think)
> >> ich9_apm_ctrl_changed() and they are all waited for in
> >> SmmCpuRendezvous()->SmmWaitForApArrival(). Occasionally it may so happen
> >> that the SMI from QEMU is not delivered to a processor that was *just*
> >> successfully hotplugged and so it is pinged again (https://github.com/tianocore/edk2/blob/fcfdbe29874320e9f876baa7afebc3fca8f4a7df/UefiCpuPkg/PiSmmCpuDxeSmm/MpService.c#L304).
> >>
> >>
> >> At the same time this processor is now being brought up by kernel and is
> >> being sent INIT-SIPI-SIPI. If these (or at least the SIPIs) arrive after the
> >> SMI reaches the processor then that processor is not going to have a good
> >> day. 

Do you use qemu/firmware combo that negotiated ICH9_LPC_SMI_F_CPU_HOTPLUG_BIT/
ICH9_LPC_SMI_F_CPU_HOT_UNPLUG_BIT features?

> > 
> > It's specifically SIPI that's problematic.  INIT is blocked by SMM, but latched,
> > and SMIs are blocked by WFS, but latched.  And AFAICT, KVM emulates all of those
> > combinations correctly.
> > 
> > Why is the SMI from QEMU not delivered?  That seems like the smoking gun.  
> 
> I haven't actually traced this but it seems that what happens is that cv
> the newly-added processor is about to leave SMM and the count of in-SMM 
> processors is decremented. At the same time, since the processor is 
> still in SMM the QEMU's SMM is not taken.
> 
> And so when the count is looked at again in SmmWaitForApArrival() one 
> processor is missing.

Current QEMU CPU hotplug workflow with SMM enabled, should be following:

  1. OSPM gets list(N) of hotplugged cpus 
  2. OSPM hands over control to firmware (SMM callback leading to SMI broadcast)
  3. Firmware at this point shall initialize all new CPUs (incl. relocating SMBASE for new ones)
     it shall pull in all CPUs that are present at the moment
  4. Firmware returns control to OSPM
  5. OSPM sends Notify to the list(N) CPUs triggering INIT-SIPI-SIPI _only_ on
     those CPUs that it collected in step 1

above steps will repeat until all hotplugged CPUs are handled.

In nutshell INIT-SIPI-SIPI shall not be sent to a freshly hotplugged CPU
that OSPM haven't seen (1) yet _and_ firmware should have initialized (3).

CPUs enumerated at (3) at least shall include CPUs present at (1)
and may include newer CPU arrived in between (1-3).

CPUs collected at (1) shall all get SMM, if it doesn't happen
then hotplug workflow won't work as expected.
In which case we need to figure out why SMM is not delivered
or why firmware isn't waiting for hotplugged CPU.

> 
> -boris
>
Boris Ostrovsky April 17, 2024, 1:58 p.m. UTC | #10
On 4/17/24 8:40 AM, Igor Mammedov wrote:
> On Tue, 16 Apr 2024 19:37:09 -0400
> boris.ostrovsky@oracle.com wrote:
> 
>> On 4/16/24 7:17 PM, Sean Christopherson wrote:
>>> On Tue, Apr 16, 2024, boris.ostrovsky@oracle.com wrote:
>>>> (Sorry, need to resend)
>>>>
>>>> On 4/16/24 6:03 PM, Paolo Bonzini wrote:
>>>>> On Tue, Apr 16, 2024 at 10:57 PM <boris.ostrovsky@oracle.com> wrote:
>>>>>> On 4/16/24 4:53 PM, Paolo Bonzini wrote:
>>>>>>> On 4/16/24 22:47, Boris Ostrovsky wrote:
>>>>>>>> Keeping the SIPI pending avoids this scenario.
>>>>>>>
>>>>>>> This is incorrect - it's yet another ugly legacy facet of x86, but we
>>>>>>> have to live with it.  SIPI is discarded because the code is supposed
>>>>>>> to retry it if needed ("INIT-SIPI-SIPI").
>>>>>>
>>>>>> I couldn't find in the SDM/APM a definitive statement about whether SIPI
>>>>>> is supposed to be dropped.
>>>>>
>>>>> I think the manual is pretty consistent that SIPIs are never latched,
>>>>> they're only ever used in wait-for-SIPI state.
>>>>>   
>>>>>>> The sender should set a flag as early as possible in the SIPI code so
>>>>>>> that it's clear that it was not received; and an extra SIPI is not a
>>>>>>> problem, it will be ignored anyway and will not cause trouble if
>>>>>>> there's a race.
>>>>>>>
>>>>>>> What is the reproducer for this?
>>>>>>
>>>>>> Hotplugging/unplugging cpus in a loop, especially if you oversubscribe
>>>>>> the guest, will get you there in 10-15 minutes.
>>>>>>
>>>>>> Typically (although I think not always) this is happening when OVMF if
>>>>>> trying to rendezvous and a processor is missing and is sent an extra SMI.
>>>>>
>>>>> Can you go into more detail? I wasn't even aware that OVMF's SMM
>>>>> supported hotplug - on real hardware I think there's extra work from
>>>>> the BMC to coordinate all SMIs across both existing and hotplugged
>>>>> packages(*)
>>>>
>>>>
>>>> It's been supported by OVMF for a couple of years (in fact, IIRC you were
>>>> part of at least initial conversations about this, at least for the unplug
>>>> part).
>>>>
>>>> During hotplug QEMU gathers all cpus in OVMF from (I think)
>>>> ich9_apm_ctrl_changed() and they are all waited for in
>>>> SmmCpuRendezvous()->SmmWaitForApArrival(). Occasionally it may so happen
>>>> that the SMI from QEMU is not delivered to a processor that was *just*
>>>> successfully hotplugged and so it is pinged again (https://github.com/tianocore/edk2/blob/fcfdbe29874320e9f876baa7afebc3fca8f4a7df/UefiCpuPkg/PiSmmCpuDxeSmm/MpService.c#L304).
>>>>
>>>>
>>>> At the same time this processor is now being brought up by kernel and is
>>>> being sent INIT-SIPI-SIPI. If these (or at least the SIPIs) arrive after the
>>>> SMI reaches the processor then that processor is not going to have a good
>>>> day.
> 
> Do you use qemu/firmware combo that negotiated ICH9_LPC_SMI_F_CPU_HOTPLUG_BIT/
> ICH9_LPC_SMI_F_CPU_HOT_UNPLUG_BIT features?

Yes.

> 
>>>
>>> It's specifically SIPI that's problematic.  INIT is blocked by SMM, but latched,
>>> and SMIs are blocked by WFS, but latched.  And AFAICT, KVM emulates all of those
>>> combinations correctly.
>>>
>>> Why is the SMI from QEMU not delivered?  That seems like the smoking gun.
>>
>> I haven't actually traced this but it seems that what happens is that cv
>> the newly-added processor is about to leave SMM and the count of in-SMM
>> processors is decremented. At the same time, since the processor is
>> still in SMM the QEMU's SMM is not taken.
>>
>> And so when the count is looked at again in SmmWaitForApArrival() one
>> processor is missing.
> 
> Current QEMU CPU hotplug workflow with SMM enabled, should be following:
> 
>    1. OSPM gets list(N) of hotplugged cpus
>    2. OSPM hands over control to firmware (SMM callback leading to SMI broadcast)
>    3. Firmware at this point shall initialize all new CPUs (incl. relocating SMBASE for new ones)
>       it shall pull in all CPUs that are present at the moment
>    4. Firmware returns control to OSPM
>    5. OSPM sends Notify to the list(N) CPUs triggering INIT-SIPI-SIPI _only_ on
>       those CPUs that it collected in step 1
> 
> above steps will repeat until all hotplugged CPUs are handled.
> 
> In nutshell INIT-SIPI-SIPI shall not be sent to a freshly hotplugged CPU
> that OSPM haven't seen (1) yet _and_ firmware should have initialized (3).
> 
> CPUs enumerated at (3) at least shall include CPUs present at (1)
> and may include newer CPU arrived in between (1-3).
> 
> CPUs collected at (1) shall all get SMM, if it doesn't happen
> then hotplug workflow won't work as expected.
> In which case we need to figure out why SMM is not delivered
> or why firmware isn't waiting for hotplugged CPU.

I noticed that I was using a few months old qemu bits and now I am 
having trouble reproducing this on latest bits. Let me see if I can get 
this to fail with latest first and then try to trace why the processor 
is in this unexpected state.

-boris
Boris Ostrovsky April 19, 2024, 4:17 p.m. UTC | #11
On 4/17/24 9:58 AM, boris.ostrovsky@oracle.com wrote:
> 
> I noticed that I was using a few months old qemu bits and now I am 
> having trouble reproducing this on latest bits. Let me see if I can get 
> this to fail with latest first and then try to trace why the processor 
> is in this unexpected state.

Looks like 012b170173bc "system/qdev-monitor: move drain_call_rcu call 
under if (!dev) in qmp_device_add()" is what makes the test to stop failing.

I need to understand whether lack of failures is a side effect of timing 
changes that simply make hotplug fail less likely or if this is an 
actual (but seemingly unintentional) fix.

-boris
Igor Mammedov Sept. 24, 2024, 9:40 a.m. UTC | #12
On Fri, 19 Apr 2024 12:17:01 -0400
boris.ostrovsky@oracle.com wrote:

> On 4/17/24 9:58 AM, boris.ostrovsky@oracle.com wrote:
> > 
> > I noticed that I was using a few months old qemu bits and now I am 
> > having trouble reproducing this on latest bits. Let me see if I can get 
> > this to fail with latest first and then try to trace why the processor 
> > is in this unexpected state.  
> 
> Looks like 012b170173bc "system/qdev-monitor: move drain_call_rcu call 
> under if (!dev) in qmp_device_add()" is what makes the test to stop failing.
>
> I need to understand whether lack of failures is a side effect of timing 
> changes that simply make hotplug fail less likely or if this is an 
> actual (but seemingly unintentional) fix.

Agreed, we should find out culprit of the problem.

PS:
also if you are using AMD host, there was a regression in OVMF
where where vCPU that OSPM was already online-ing, was yanked
from under OSMP feet by OVMF (which depending on timing could
manifest as lost SIPI).

edk2 commit that should fix it is:
    https://github.com/tianocore/edk2/commit/1c19ccd5103b

Switching to Intel host should rule that out at least.
(or use fixed edk2-ovmf-20240524-5.el10.noarch package from centos,
if you are forced to use AMD host)

> -boris
>
Boris Ostrovsky Sept. 24, 2024, 9:59 p.m. UTC | #13
On 9/24/24 5:40 AM, Igor Mammedov wrote:
> On Fri, 19 Apr 2024 12:17:01 -0400
> boris.ostrovsky@oracle.com wrote:
> 
>> On 4/17/24 9:58 AM, boris.ostrovsky@oracle.com wrote:
>>>
>>> I noticed that I was using a few months old qemu bits and now I am
>>> having trouble reproducing this on latest bits. Let me see if I can get
>>> this to fail with latest first and then try to trace why the processor
>>> is in this unexpected state.
>>
>> Looks like 012b170173bc "system/qdev-monitor: move drain_call_rcu call
>> under if (!dev) in qmp_device_add()" is what makes the test to stop failing.
>>
>> I need to understand whether lack of failures is a side effect of timing
>> changes that simply make hotplug fail less likely or if this is an
>> actual (but seemingly unintentional) fix.
> 
> Agreed, we should find out culprit of the problem.


I haven't been able to spend much time on this unfortunately, Eric is 
now starting to look at this again.

One of my theories was that ich9_apm_ctrl_changed() is sending SMIs to 
vcpus serially while on HW my understanding is that this is done as a 
broadcast so I thought this could cause a race. I had a quick test with 
pausing and resuming all vcpus around the loop but that didn't help.


> 
> PS:
> also if you are using AMD host, there was a regression in OVMF
> where where vCPU that OSPM was already online-ing, was yanked
> from under OSMP feet by OVMF (which depending on timing could
> manifest as lost SIPI).
> 
> edk2 commit that should fix it is:
>      https://github.com/tianocore/edk2/commit/1c19ccd5103b
> 
> Switching to Intel host should rule that out at least.
> (or use fixed edk2-ovmf-20240524-5.el10.noarch package from centos,
> if you are forced to use AMD host)

I just tried with latest bits that include this commit and still was 
able to reproduce the problem.


-boris
Eric Mackay Sept. 27, 2024, 1:22 a.m. UTC | #14
> On 9/24/24 5:40 AM, Igor Mammedov wrote:
>> On Fri, 19 Apr 2024 12:17:01 -0400
>> boris.ostrovsky@oracle.com wrote:
>> 
>>> On 4/17/24 9:58 AM, boris.ostrovsky@oracle.com wrote:
>>>>
>>>> I noticed that I was using a few months old qemu bits and now I am
>>>> having trouble reproducing this on latest bits. Let me see if I can get
>>>> this to fail with latest first and then try to trace why the processor
>>>> is in this unexpected state.
>>>
>>> Looks like 012b170173bc "system/qdev-monitor: move drain_call_rcu call
>>> under if (!dev) in qmp_device_add()" is what makes the test to stop failing.
>>>
>>> I need to understand whether lack of failures is a side effect of timing
>>> changes that simply make hotplug fail less likely or if this is an
>>> actual (but seemingly unintentional) fix.
>> 
>> Agreed, we should find out culprit of the problem.
>
>
> I haven't been able to spend much time on this unfortunately, Eric is 
> now starting to look at this again.
>
> One of my theories was that ich9_apm_ctrl_changed() is sending SMIs to 
> vcpus serially while on HW my understanding is that this is done as a 
> broadcast so I thought this could cause a race. I had a quick test with 
> pausing and resuming all vcpus around the loop but that didn't help.
>
>
>> 
>> PS:
>> also if you are using AMD host, there was a regression in OVMF
>> where where vCPU that OSPM was already online-ing, was yanked
>> from under OSMP feet by OVMF (which depending on timing could
>> manifest as lost SIPI).
>> 
>> edk2 commit that should fix it is:
>>      https://github.com/tianocore/edk2/commit/1c19ccd5103b
>> 
>> Switching to Intel host should rule that out at least.
>> (or use fixed edk2-ovmf-20240524-5.el10.noarch package from centos,
>> if you are forced to use AMD host)

I haven't been able to reproduce the issue on an Intel host thus far,
but it may not be an apples-to-apples comparison because my AMD hosts
have a much higher core count.

>
> I just tried with latest bits that include this commit and still was 
> able to reproduce the problem.
>
>
>-boris

The initial hotplug of each CPU appears to complete from the
perspective of OVMF and OSPM. SMBASE relocation succeeds, and the new
CPU reports back from the pen. It seems to be the later INIT-SIPI-SIPI
sequence sent from the guest that doesn't complete.

My working theory has been that some CPU/AP is lagging behind the others
when the BSP is waiting for all the APs to go into SMM, and the BSP just
gives up and moves on. Presumably the INIT-SIPI-SIPI is sent while that
CPU does finally go into SMM, and other CPUs are in normal mode.

I've been able to observe the SMI handler for the problematic CPU will
sometimes start running when no BSP is elected. This means we have a
window of time where the CPU will ignore SIPI, and least 1 CPU is in
normal mode (the BSP) which is capable of sending INIT-SIPI-SIPI from
the guest.
Igor Mammedov Sept. 27, 2024, 9:28 a.m. UTC | #15
On Thu, 26 Sep 2024 18:22:39 -0700
Eric Mackay <eric.mackay@oracle.com> wrote:

> > On 9/24/24 5:40 AM, Igor Mammedov wrote:  
> >> On Fri, 19 Apr 2024 12:17:01 -0400
> >> boris.ostrovsky@oracle.com wrote:
> >>   
> >>> On 4/17/24 9:58 AM, boris.ostrovsky@oracle.com wrote:  
> >>>>
> >>>> I noticed that I was using a few months old qemu bits and now I am
> >>>> having trouble reproducing this on latest bits. Let me see if I can get
> >>>> this to fail with latest first and then try to trace why the processor
> >>>> is in this unexpected state.  
> >>>
> >>> Looks like 012b170173bc "system/qdev-monitor: move drain_call_rcu call
> >>> under if (!dev) in qmp_device_add()" is what makes the test to stop failing.
> >>>
> >>> I need to understand whether lack of failures is a side effect of timing
> >>> changes that simply make hotplug fail less likely or if this is an
> >>> actual (but seemingly unintentional) fix.  
> >> 
> >> Agreed, we should find out culprit of the problem.  
> >
> >
> > I haven't been able to spend much time on this unfortunately, Eric is 
> > now starting to look at this again.
> >
> > One of my theories was that ich9_apm_ctrl_changed() is sending SMIs to 
> > vcpus serially while on HW my understanding is that this is done as a 
> > broadcast so I thought this could cause a race. I had a quick test with 
> > pausing and resuming all vcpus around the loop but that didn't help.
> >
> >  
> >> 
> >> PS:
> >> also if you are using AMD host, there was a regression in OVMF
> >> where where vCPU that OSPM was already online-ing, was yanked
> >> from under OSMP feet by OVMF (which depending on timing could
> >> manifest as lost SIPI).
> >> 
> >> edk2 commit that should fix it is:
> >>      https://github.com/tianocore/edk2/commit/1c19ccd5103b
> >> 
> >> Switching to Intel host should rule that out at least.
> >> (or use fixed edk2-ovmf-20240524-5.el10.noarch package from centos,
> >> if you are forced to use AMD host)  
> 
> I haven't been able to reproduce the issue on an Intel host thus far,
> but it may not be an apples-to-apples comparison because my AMD hosts
> have a much higher core count.
> 
> >
> > I just tried with latest bits that include this commit and still was 
> > able to reproduce the problem.
> >
> >
> >-boris  
> 
> The initial hotplug of each CPU appears to complete from the
> perspective of OVMF and OSPM. SMBASE relocation succeeds, and the new
> CPU reports back from the pen. It seems to be the later INIT-SIPI-SIPI
> sequence sent from the guest that doesn't complete.
> 
> My working theory has been that some CPU/AP is lagging behind the others
> when the BSP is waiting for all the APs to go into SMM, and the BSP just
> gives up and moves on. Presumably the INIT-SIPI-SIPI is sent while that
> CPU does finally go into SMM, and other CPUs are in normal mode.
> 
> I've been able to observe the SMI handler for the problematic CPU will
> sometimes start running when no BSP is elected. This means we have a
> window of time where the CPU will ignore SIPI, and least 1 CPU is in
> normal mode (the BSP) which is capable of sending INIT-SIPI-SIPI from
> the guest.

I've re-read whole thread and noticed Boris were saying:
  > On Tue, Apr 16, 2024 at 10:57 PM <boris.ostrovsky@oracle.com> wrote:
  > > On 4/16/24 4:53 PM, Paolo Bonzini wrote:  
  ...
  > > >
  > > > What is the reproducer for this?  
  > >
  > > Hotplugging/unplugging cpus in a loop, especially if you oversubscribe
  > > the guest, will get you there in 10-15 minutes.
  ...

So there was unplug involved as well, which was broken since forever.

Recent patch
 https://patchew.org/QEMU/20230427211013.2994127-1-alxndr@bu.edu/20230427211013.2994127-2-alxndr@bu.edu/
has exposed issue (unexpected uplug/unplug flow) with root cause in OVMF.
Firmware was letting non involved APs run wild in normal mode.
As result AP that was calling _EJ0 and holding ACPI lock was
continuing _EJ0 and releasing ACPI lock, while BSP and a being removed
CPU were still in SMM world. And any other plug/unplug op
were able to grab ACPI lock and trigger another SMI, which breaks
hotplug flow expectations (aka exclusive access to hotplug registers
during plug/unplug op)
Perhaps that's what you are observing.

Please check if following helps:
  https://github.com/kraxel/edk2/commit/738c09f6b5ab87be48d754e62deb72b767415158

So yes, SIPI can be lost (which should be expected as others noted)
but that normally shouldn't be an issue as wakeup_secondary_cpu_via_init()
do resend SIPI.
However if wakeup_secondary_cpu is set to another handler that doesn't
resend SIPI, It might be an issue.
Eric Mackay Sept. 30, 2024, 11:34 p.m. UTC | #16
> On Thu, 26 Sep 2024 18:22:39 -0700
> Eric Mackay <eric.mackay@oracle.com> wrote:
> > > On 9/24/24 5:40 AM, Igor Mammedov wrote:  
> > >> On Fri, 19 Apr 2024 12:17:01 -0400
> > >> boris.ostrovsky@oracle.com wrote:
> > >>   
> > >>> On 4/17/24 9:58 AM, boris.ostrovsky@oracle.com wrote:  
> > >>>>
> > >>>> I noticed that I was using a few months old qemu bits and now I am
> > >>>> having trouble reproducing this on latest bits. Let me see if I can get
> > >>>> this to fail with latest first and then try to trace why the processor
> > >>>> is in this unexpected state.  
> > >>>
> > >>> Looks like 012b170173bc "system/qdev-monitor: move drain_call_rcu call
> > >>> under if (!dev) in qmp_device_add()" is what makes the test to stop failing.
> > >>>
> > >>> I need to understand whether lack of failures is a side effect of timing
> > >>> changes that simply make hotplug fail less likely or if this is an
> > >>> actual (but seemingly unintentional) fix.  
> > >> 
> > >> Agreed, we should find out culprit of the problem.  
> > >
> > >
> > > I haven't been able to spend much time on this unfortunately, Eric is 
> > > now starting to look at this again.
> > >
> > > One of my theories was that ich9_apm_ctrl_changed() is sending SMIs to 
> > > vcpus serially while on HW my understanding is that this is done as a 
> > > broadcast so I thought this could cause a race. I had a quick test with 
> > > pausing and resuming all vcpus around the loop but that didn't help.
> > >
> > >  
> > >> 
> > >> PS:
> > >> also if you are using AMD host, there was a regression in OVMF
> > >> where where vCPU that OSPM was already online-ing, was yanked
> > >> from under OSMP feet by OVMF (which depending on timing could
> > >> manifest as lost SIPI).
> > >> 
> > >> edk2 commit that should fix it is:
> > >>      https://github.com/tianocore/edk2/commit/1c19ccd5103b
> > >> 
> > >> Switching to Intel host should rule that out at least.
> > >> (or use fixed edk2-ovmf-20240524-5.el10.noarch package from centos,
> > >> if you are forced to use AMD host)  
> > 
> > I haven't been able to reproduce the issue on an Intel host thus far,
> > but it may not be an apples-to-apples comparison because my AMD hosts
> > have a much higher core count.
> > 
> > >
> > > I just tried with latest bits that include this commit and still was 
> > > able to reproduce the problem.
> > >
> > >
> > >-boris  
> > 
> > The initial hotplug of each CPU appears to complete from the
> > perspective of OVMF and OSPM. SMBASE relocation succeeds, and the new
> > CPU reports back from the pen. It seems to be the later INIT-SIPI-SIPI
> > sequence sent from the guest that doesn't complete.
> > 
> > My working theory has been that some CPU/AP is lagging behind the others
> > when the BSP is waiting for all the APs to go into SMM, and the BSP just
> > gives up and moves on. Presumably the INIT-SIPI-SIPI is sent while that
> > CPU does finally go into SMM, and other CPUs are in normal mode.
> > 
> > I've been able to observe the SMI handler for the problematic CPU will
> > sometimes start running when no BSP is elected. This means we have a
> > window of time where the CPU will ignore SIPI, and least 1 CPU is in
> > normal mode (the BSP) which is capable of sending INIT-SIPI-SIPI from
> > the guest.
> 
> I've re-read whole thread and noticed Boris were saying:
>   > On Tue, Apr 16, 2024 at 10:57 PM <boris.ostrovsky@oracle.com> wrote:
>   > > On 4/16/24 4:53 PM, Paolo Bonzini wrote:  
>   ...
>   > > >
>   > > > What is the reproducer for this?  
>   > >
>   > > Hotplugging/unplugging cpus in a loop, especially if you oversubscribe
>   > > the guest, will get you there in 10-15 minutes.
>   ...
> 
> So there was unplug involved as well, which was broken since forever.
> 
> Recent patch
>  https://patchew.org/QEMU/20230427211013.2994127-1-alxndr@bu.edu/20230427211013.2994127-2-alxndr@bu.edu/
> has exposed issue (unexpected uplug/unplug flow) with root cause in OVMF.
> Firmware was letting non involved APs run wild in normal mode.
> As result AP that was calling _EJ0 and holding ACPI lock was
> continuing _EJ0 and releasing ACPI lock, while BSP and a being removed
> CPU were still in SMM world. And any other plug/unplug op
> were able to grab ACPI lock and trigger another SMI, which breaks
> hotplug flow expectations (aka exclusive access to hotplug registers
> during plug/unplug op)
> Perhaps that's what you are observing.
> 
> Please check if following helps:
>   https://github.com/kraxel/edk2/commit/738c09f6b5ab87be48d754e62deb72b767415158
> 

I haven't actually seen the guest crash during unplug, though certainly
there have been unplug failures. I haven't been keeping track of the
unplug failures as closely, but a test I ran over the weekend with this
patch added seemed to show less unplug failures.

I'm still getting hotplug failures that cause a guest crash though, so
that mystery remains.

> So yes, SIPI can be lost (which should be expected as others noted)
> but that normally shouldn't be an issue as wakeup_secondary_cpu_via_init()
> do resend SIPI.
> However if wakeup_secondary_cpu is set to another handler that doesn't
> resend SIPI, It might be an issue.

We're using wakeup_secondary_cpu_via_init(). acpi_wakeup_cpu() and
wakeup_cpu_via_vmgexit(), for example, are a bit opaque to me, so I'm
not sure if those code paths include a SIPI resend.
Igor Mammedov Oct. 1, 2024, 8:18 a.m. UTC | #17
On Mon, 30 Sep 2024 16:34:57 -0700
Eric Mackay <eric.mackay@oracle.com> wrote:

> > On Thu, 26 Sep 2024 18:22:39 -0700
> > Eric Mackay <eric.mackay@oracle.com> wrote:  
> > > > On 9/24/24 5:40 AM, Igor Mammedov wrote:    
> > > >> On Fri, 19 Apr 2024 12:17:01 -0400
> > > >> boris.ostrovsky@oracle.com wrote:
> > > >>     
> > > >>> On 4/17/24 9:58 AM, boris.ostrovsky@oracle.com wrote:    
> > > >>>>
> > > >>>> I noticed that I was using a few months old qemu bits and now I am
> > > >>>> having trouble reproducing this on latest bits. Let me see if I can get
> > > >>>> this to fail with latest first and then try to trace why the processor
> > > >>>> is in this unexpected state.    
> > > >>>
> > > >>> Looks like 012b170173bc "system/qdev-monitor: move drain_call_rcu call
> > > >>> under if (!dev) in qmp_device_add()" is what makes the test to stop failing.
> > > >>>
> > > >>> I need to understand whether lack of failures is a side effect of timing
> > > >>> changes that simply make hotplug fail less likely or if this is an
> > > >>> actual (but seemingly unintentional) fix.    
> > > >> 
> > > >> Agreed, we should find out culprit of the problem.    
> > > >
> > > >
> > > > I haven't been able to spend much time on this unfortunately, Eric is 
> > > > now starting to look at this again.
> > > >
> > > > One of my theories was that ich9_apm_ctrl_changed() is sending SMIs to 
> > > > vcpus serially while on HW my understanding is that this is done as a 
> > > > broadcast so I thought this could cause a race. I had a quick test with 
> > > > pausing and resuming all vcpus around the loop but that didn't help.
> > > >
> > > >    
> > > >> 
> > > >> PS:
> > > >> also if you are using AMD host, there was a regression in OVMF
> > > >> where where vCPU that OSPM was already online-ing, was yanked
> > > >> from under OSMP feet by OVMF (which depending on timing could
> > > >> manifest as lost SIPI).
> > > >> 
> > > >> edk2 commit that should fix it is:
> > > >>      https://github.com/tianocore/edk2/commit/1c19ccd5103b
> > > >> 
> > > >> Switching to Intel host should rule that out at least.
> > > >> (or use fixed edk2-ovmf-20240524-5.el10.noarch package from centos,
> > > >> if you are forced to use AMD host)    
> > > 
> > > I haven't been able to reproduce the issue on an Intel host thus far,
> > > but it may not be an apples-to-apples comparison because my AMD hosts
> > > have a much higher core count.
> > >   
> > > >
> > > > I just tried with latest bits that include this commit and still was 
> > > > able to reproduce the problem.
> > > >
> > > >
> > > >-boris    
> > > 
> > > The initial hotplug of each CPU appears to complete from the
> > > perspective of OVMF and OSPM. SMBASE relocation succeeds, and the new
> > > CPU reports back from the pen. It seems to be the later INIT-SIPI-SIPI
> > > sequence sent from the guest that doesn't complete.
> > > 
> > > My working theory has been that some CPU/AP is lagging behind the others
> > > when the BSP is waiting for all the APs to go into SMM, and the BSP just
> > > gives up and moves on. Presumably the INIT-SIPI-SIPI is sent while that
> > > CPU does finally go into SMM, and other CPUs are in normal mode.
> > > 
> > > I've been able to observe the SMI handler for the problematic CPU will
> > > sometimes start running when no BSP is elected. This means we have a
> > > window of time where the CPU will ignore SIPI, and least 1 CPU is in
> > > normal mode (the BSP) which is capable of sending INIT-SIPI-SIPI from
> > > the guest.  
> > 
> > I've re-read whole thread and noticed Boris were saying:  
> >   > On Tue, Apr 16, 2024 at 10:57 PM <boris.ostrovsky@oracle.com> wrote:  
> >   > > On 4/16/24 4:53 PM, Paolo Bonzini wrote:    
> >   ...  
> >   > > >
> >   > > > What is the reproducer for this?    
> >   > >
> >   > > Hotplugging/unplugging cpus in a loop, especially if you oversubscribe
> >   > > the guest, will get you there in 10-15 minutes.  
> >   ...
> > 
> > So there was unplug involved as well, which was broken since forever.
> > 
> > Recent patch
> >  https://patchew.org/QEMU/20230427211013.2994127-1-alxndr@bu.edu/20230427211013.2994127-2-alxndr@bu.edu/
> > has exposed issue (unexpected uplug/unplug flow) with root cause in OVMF.
> > Firmware was letting non involved APs run wild in normal mode.
> > As result AP that was calling _EJ0 and holding ACPI lock was
> > continuing _EJ0 and releasing ACPI lock, while BSP and a being removed
> > CPU were still in SMM world. And any other plug/unplug op
> > were able to grab ACPI lock and trigger another SMI, which breaks
> > hotplug flow expectations (aka exclusive access to hotplug registers
> > during plug/unplug op)
> > Perhaps that's what you are observing.
> > 
> > Please check if following helps:
> >   https://github.com/kraxel/edk2/commit/738c09f6b5ab87be48d754e62deb72b767415158
> >   
> 
> I haven't actually seen the guest crash during unplug, though certainly
> there have been unplug failures. I haven't been keeping track of the
> unplug failures as closely, but a test I ran over the weekend with this
> patch added seemed to show less unplug failures.

it's not only about unplug, unfortunately.
QEMU that includes Alexander's patch, essentially denies access to hotplug
registers if unplug is in process. So if there is hotplug going at the same
time, it may be broken by that access deny.
To exclude this issue, you need to test with edk2 fix or use older QEMU
without Alexander's patch.


> I'm still getting hotplug failures that cause a guest crash though, so
> that mystery remains.
> 
> > So yes, SIPI can be lost (which should be expected as others noted)
> > but that normally shouldn't be an issue as wakeup_secondary_cpu_via_init()
> > do resend SIPI.
> > However if wakeup_secondary_cpu is set to another handler that doesn't
> > resend SIPI, It might be an issue.  
> 
> We're using wakeup_secondary_cpu_via_init(). acpi_wakeup_cpu() and
> wakeup_cpu_via_vmgexit(), for example, are a bit opaque to me, so I'm
> not sure if those code paths include a SIPI resend.

wakeup_secondary_cpu_via_init() should re-send SIPI.
If you can reproduce with KVM tracing and guest kernel debug enabled,
I'd try to do that and check if SIPI are being re-sent or not.
That at least should give a hint if we should look at guest side or at KVM/QEMU.
diff mbox series

Patch

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index cf37586f0466..4a57b69efc7f 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -3308,13 +3308,13 @@  int kvm_apic_accept_events(struct kvm_vcpu *vcpu)
 	}
 
 	/*
-	 * INITs are blocked while CPU is in specific states (SMM, VMX root
-	 * mode, SVM with GIF=0), while SIPIs are dropped if the CPU isn't in
-	 * wait-for-SIPI (WFS).
+	 * INIT/SIPI are blocked while CPU is in specific states (SMM, VMX root
+	 * mode, SVM with GIF=0).
 	 */
 	if (!kvm_apic_init_sipi_allowed(vcpu)) {
 		WARN_ON_ONCE(vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED);
-		clear_bit(KVM_APIC_SIPI, &apic->pending_events);
+		if (!is_smm(vcpu))
+			clear_bit(KVM_APIC_SIPI, &apic->pending_events);
 		return 0;
 	}