x86/vpmu: fix race-condition in vpmu_load

Message ID	8294476a707d7f20799a40479cc0bf9a1cf07673.1663249988.git.tamas.lengyel@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <xen-devel-bounces@lists.xenproject.org> Errors-To: xen-devel-bounces@lists.xenproject.org Precedence: list Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org> From: Tamas K Lengyel <tamas.lengyel@intel.com> To: xen-devel@lists.xenproject.org Cc: Tamas K Lengyel <tamas.lengyel@intel.com>, Jan Beulich <jbeulich@suse.com>, Andrew Cooper <andrew.cooper3@citrix.com>, =?utf-8?q?Roger_Pau_Monn=C3=A9?= <roger.pau@citrix.com>, Wei Liu <wl@xen.org> Subject: [PATCH] x86/vpmu: fix race-condition in vpmu_load Date: Thu, 15 Sep 2022 10:01:59 -0400 Message-Id: <8294476a707d7f20799a40479cc0bf9a1cf07673.1663249988.git.tamas.lengyel@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	x86/vpmu: fix race-condition in vpmu_load \| expand x86/vpmu: fix race-condition in vpmu_load

Tamas K Lengyel Sept. 15, 2022, 2:01 p.m. UTC

While experimenting with the vPMU subsystem an ASSERT failure was
observed in vmx_find_msr because the vcpu_runnable state was true.

The root cause of the bug appears to be the fact that the vPMU subsystem
doesn't save its state on context_switch. The vpmu_load function will attempt
to gather the PMU state if its still loaded two different ways:
    1. if the current pcpu is not where the vcpu ran before doing a remote save
    2. if the current pcpu had another vcpu active before doing a local save

However, in case the prev vcpu is being rescheduled on another pcpu its state
has already changed and vcpu_runnable is returning true, thus #2 will trip the
ASSERT. The only way to avoid this race condition is to make sure the
prev vcpu is paused while being checked and its context saved. Once the prev
vcpu is resumed and does #1 it will find its state already saved.

Signed-off-by: Tamas K Lengyel <tamas.lengyel@intel.com>
---
 xen/arch/x86/cpu/vpmu.c | 2 ++
 1 file changed, 2 insertions(+)

Jan Beulich Sept. 16, 2022, 12:52 p.m. UTC | #1

On 15.09.2022 16:01, Tamas K Lengyel wrote:
> While experimenting with the vPMU subsystem an ASSERT failure was
> observed in vmx_find_msr because the vcpu_runnable state was true.
> 
> The root cause of the bug appears to be the fact that the vPMU subsystem
> doesn't save its state on context_switch. The vpmu_load function will attempt
> to gather the PMU state if its still loaded two different ways:
>     1. if the current pcpu is not where the vcpu ran before doing a remote save
>     2. if the current pcpu had another vcpu active before doing a local save
> 
> However, in case the prev vcpu is being rescheduled on another pcpu its state
> has already changed and vcpu_runnable is returning true, thus #2 will trip the
> ASSERT. The only way to avoid this race condition is to make sure the
> prev vcpu is paused while being checked and its context saved. Once the prev
> vcpu is resumed and does #1 it will find its state already saved.

While I consider this explanation plausible, I'm worried:

> --- a/xen/arch/x86/cpu/vpmu.c
> +++ b/xen/arch/x86/cpu/vpmu.c
> @@ -419,8 +419,10 @@ int vpmu_load(struct vcpu *v, bool_t from_guest)
>          vpmu = vcpu_vpmu(prev);
>  
>          /* Someone ran here before us */
> +        vcpu_pause(prev);
>          vpmu_save_force(prev);
>          vpmu_reset(vpmu, VPMU_CONTEXT_LOADED);
> +        vcpu_unpause(prev);
>  
>          vpmu = vcpu_vpmu(v);
>      }

We're running with IRQs off here, yet vcpu_pause() waits for the vcpu
to actually be de-scheduled. Even with IRQs on this is already a
relatively heavy operation (also including its impact on the remote
side). Additionally the function is called from context_switch(), and
I'm unsure of the usability of vcpu_pause() on such a path. In
particular: Is there a risk of two CPUs doing this mutually to one
another? If so, is deadlocking excluded?

Hence at the very least I think the description wants extending, to
discuss the safety of the change.

Boris - any chance you could comment here? Iirc that's code you did
introduce.

Jan

Boris Ostrovsky Sept. 16, 2022, 9:35 p.m. UTC | #2

On 9/16/22 8:52 AM, Jan Beulich wrote:
> On 15.09.2022 16:01, Tamas K Lengyel wrote:
>> While experimenting with the vPMU subsystem an ASSERT failure was
>> observed in vmx_find_msr because the vcpu_runnable state was true.
>>
>> The root cause of the bug appears to be the fact that the vPMU subsystem
>> doesn't save its state on context_switch. The vpmu_load function will attempt
>> to gather the PMU state if its still loaded two different ways:
>>      1. if the current pcpu is not where the vcpu ran before doing a remote save
>>      2. if the current pcpu had another vcpu active before doing a local save
>>
>> However, in case the prev vcpu is being rescheduled on another pcpu its state
>> has already changed and vcpu_runnable is returning true, thus #2 will trip the
>> ASSERT. The only way to avoid this race condition is to make sure the
>> prev vcpu is paused while being checked and its context saved. Once the prev
>> vcpu is resumed and does #1 it will find its state already saved.
> While I consider this explanation plausible, I'm worried:
>
>> --- a/xen/arch/x86/cpu/vpmu.c
>> +++ b/xen/arch/x86/cpu/vpmu.c
>> @@ -419,8 +419,10 @@ int vpmu_load(struct vcpu *v, bool_t from_guest)
>>           vpmu = vcpu_vpmu(prev);
>>   
>>           /* Someone ran here before us */
>> +        vcpu_pause(prev);
>>           vpmu_save_force(prev);
>>           vpmu_reset(vpmu, VPMU_CONTEXT_LOADED);
>> +        vcpu_unpause(prev);
>>   
>>           vpmu = vcpu_vpmu(v);
>>       }
> We're running with IRQs off here, yet vcpu_pause() waits for the vcpu
> to actually be de-scheduled. Even with IRQs on this is already a
> relatively heavy operation (also including its impact on the remote
> side). Additionally the function is called from context_switch(), and
> I'm unsure of the usability of vcpu_pause() on such a path. In
> particular: Is there a risk of two CPUs doing this mutually to one
> another? If so, is deadlocking excluded?
>
> Hence at the very least I think the description wants extending, to
> discuss the safety of the change.
>
> Boris - any chance you could comment here? Iirc that's code you did
> introduce.


Is the assertion in vmx_find_msr() really needs to be for runnable vcpu or can it be a check on whether vcpu is actually running (e.g. RUNSTATE_running)? I think call chain vpmu_load()->..->*_vpmu_save()->...->vmx_find_msr() is the only one where we are doing it for non-current vcpu. If we can guarantee that run state is set after vpmu_load() call (maybe it is already, I haven't checked) then testing the state may avoid the assertion.


-boris

Jan Beulich Sept. 19, 2022, 9:27 a.m. UTC | #3

On 16.09.2022 23:35, Boris Ostrovsky wrote:
> 
> On 9/16/22 8:52 AM, Jan Beulich wrote:
>> On 15.09.2022 16:01, Tamas K Lengyel wrote:
>>> While experimenting with the vPMU subsystem an ASSERT failure was
>>> observed in vmx_find_msr because the vcpu_runnable state was true.
>>>
>>> The root cause of the bug appears to be the fact that the vPMU subsystem
>>> doesn't save its state on context_switch. The vpmu_load function will attempt
>>> to gather the PMU state if its still loaded two different ways:
>>>      1. if the current pcpu is not where the vcpu ran before doing a remote save
>>>      2. if the current pcpu had another vcpu active before doing a local save
>>>
>>> However, in case the prev vcpu is being rescheduled on another pcpu its state
>>> has already changed and vcpu_runnable is returning true, thus #2 will trip the
>>> ASSERT. The only way to avoid this race condition is to make sure the
>>> prev vcpu is paused while being checked and its context saved. Once the prev
>>> vcpu is resumed and does #1 it will find its state already saved.
>> While I consider this explanation plausible, I'm worried:
>>
>>> --- a/xen/arch/x86/cpu/vpmu.c
>>> +++ b/xen/arch/x86/cpu/vpmu.c
>>> @@ -419,8 +419,10 @@ int vpmu_load(struct vcpu *v, bool_t from_guest)
>>>           vpmu = vcpu_vpmu(prev);
>>>   
>>>           /* Someone ran here before us */
>>> +        vcpu_pause(prev);
>>>           vpmu_save_force(prev);
>>>           vpmu_reset(vpmu, VPMU_CONTEXT_LOADED);
>>> +        vcpu_unpause(prev);
>>>   
>>>           vpmu = vcpu_vpmu(v);
>>>       }
>> We're running with IRQs off here, yet vcpu_pause() waits for the vcpu
>> to actually be de-scheduled. Even with IRQs on this is already a
>> relatively heavy operation (also including its impact on the remote
>> side). Additionally the function is called from context_switch(), and
>> I'm unsure of the usability of vcpu_pause() on such a path. In
>> particular: Is there a risk of two CPUs doing this mutually to one
>> another? If so, is deadlocking excluded?
>>
>> Hence at the very least I think the description wants extending, to
>> discuss the safety of the change.
>>
>> Boris - any chance you could comment here? Iirc that's code you did
>> introduce.
> 
> 
> Is the assertion in vmx_find_msr() really needs to be for runnable vcpu or can it be a check on whether vcpu is actually running (e.g. RUNSTATE_running)?

You cannot safely check for "running", as "runnable" may transition
to/from "running" behind your back.

Jan

> I think call chain vpmu_load()->..->*_vpmu_save()->...->vmx_find_msr() is the only one where we are doing it for non-current vcpu. If we can guarantee that run state is set after vpmu_load() call (maybe it is already, I haven't checked) then testing the state may avoid the assertion.
> 
> 
> -boris
>

Tamas K Lengyel Sept. 19, 2022, 12:25 p.m. UTC | #4

On Mon, Sep 19, 2022 at 5:28 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 16.09.2022 23:35, Boris Ostrovsky wrote:
> >
> > On 9/16/22 8:52 AM, Jan Beulich wrote:
> >> On 15.09.2022 16:01, Tamas K Lengyel wrote:
> >>> While experimenting with the vPMU subsystem an ASSERT failure was
> >>> observed in vmx_find_msr because the vcpu_runnable state was true.
> >>>
> >>> The root cause of the bug appears to be the fact that the vPMU subsystem
> >>> doesn't save its state on context_switch. The vpmu_load function will attempt
> >>> to gather the PMU state if its still loaded two different ways:
> >>>      1. if the current pcpu is not where the vcpu ran before doing a remote save
> >>>      2. if the current pcpu had another vcpu active before doing a local save
> >>>
> >>> However, in case the prev vcpu is being rescheduled on another pcpu its state
> >>> has already changed and vcpu_runnable is returning true, thus #2 will trip the
> >>> ASSERT. The only way to avoid this race condition is to make sure the
> >>> prev vcpu is paused while being checked and its context saved. Once the prev
> >>> vcpu is resumed and does #1 it will find its state already saved.
> >> While I consider this explanation plausible, I'm worried:
> >>
> >>> --- a/xen/arch/x86/cpu/vpmu.c
> >>> +++ b/xen/arch/x86/cpu/vpmu.c
> >>> @@ -419,8 +419,10 @@ int vpmu_load(struct vcpu *v, bool_t from_guest)
> >>>           vpmu = vcpu_vpmu(prev);
> >>>
> >>>           /* Someone ran here before us */
> >>> +        vcpu_pause(prev);
> >>>           vpmu_save_force(prev);
> >>>           vpmu_reset(vpmu, VPMU_CONTEXT_LOADED);
> >>> +        vcpu_unpause(prev);
> >>>
> >>>           vpmu = vcpu_vpmu(v);
> >>>       }
> >> We're running with IRQs off here, yet vcpu_pause() waits for the vcpu
> >> to actually be de-scheduled. Even with IRQs on this is already a
> >> relatively heavy operation (also including its impact on the remote
> >> side). Additionally the function is called from context_switch(), and
> >> I'm unsure of the usability of vcpu_pause() on such a path. In
> >> particular: Is there a risk of two CPUs doing this mutually to one
> >> another? If so, is deadlocking excluded?
> >>
> >> Hence at the very least I think the description wants extending, to
> >> discuss the safety of the change.
> >>
> >> Boris - any chance you could comment here? Iirc that's code you did
> >> introduce.
> >
> >
> > Is the assertion in vmx_find_msr() really needs to be for runnable vcpu or can it be a check on whether vcpu is actually running (e.g. RUNSTATE_running)?
>
> You cannot safely check for "running", as "runnable" may transition
> to/from "running" behind your back.

The more I look at this the more I think the only sensible solution is
to have the vPMU state be saved on vmexit for all vCPUs. That way all
this having to figure out where and when a context needs saving during
scheduling goes away. Yes, it adds a bit of overhead for cases where
the vCPU will resume on the same pCPU and that context saved could
have been skipped, but it makes it so that the vCPU can be resumed on
any pCPU without having to have evidently fragile checks that may
potentially lead to deadlocks (TBH I don't know if that's a real
concern at the moment because the current setup is very hard to reason
about). We can still keep track if the context needs reloading from
the saved context and skip that if we know the state is still active.
Any objection to that change in light of these issues?

Tamas

Jan Beulich Sept. 19, 2022, 1:21 p.m. UTC | #5

On 19.09.2022 14:25, Tamas K Lengyel wrote:
> On Mon, Sep 19, 2022 at 5:28 AM Jan Beulich <jbeulich@suse.com> wrote:
>>
>> On 16.09.2022 23:35, Boris Ostrovsky wrote:
>>>
>>> On 9/16/22 8:52 AM, Jan Beulich wrote:
>>>> On 15.09.2022 16:01, Tamas K Lengyel wrote:
>>>>> While experimenting with the vPMU subsystem an ASSERT failure was
>>>>> observed in vmx_find_msr because the vcpu_runnable state was true.
>>>>>
>>>>> The root cause of the bug appears to be the fact that the vPMU subsystem
>>>>> doesn't save its state on context_switch.

For the further reply below - is this actually true? What is the
vpmu_switch_from() (resolving to vpmu_save()) doing then early
in context_switch()?

>>>>> The vpmu_load function will attempt
>>>>> to gather the PMU state if its still loaded two different ways:
>>>>>      1. if the current pcpu is not where the vcpu ran before doing a remote save
>>>>>      2. if the current pcpu had another vcpu active before doing a local save
>>>>>
>>>>> However, in case the prev vcpu is being rescheduled on another pcpu its state
>>>>> has already changed and vcpu_runnable is returning true, thus #2 will trip the
>>>>> ASSERT. The only way to avoid this race condition is to make sure the
>>>>> prev vcpu is paused while being checked and its context saved. Once the prev
>>>>> vcpu is resumed and does #1 it will find its state already saved.
>>>> While I consider this explanation plausible, I'm worried:
>>>>
>>>>> --- a/xen/arch/x86/cpu/vpmu.c
>>>>> +++ b/xen/arch/x86/cpu/vpmu.c
>>>>> @@ -419,8 +419,10 @@ int vpmu_load(struct vcpu *v, bool_t from_guest)
>>>>>           vpmu = vcpu_vpmu(prev);
>>>>>
>>>>>           /* Someone ran here before us */
>>>>> +        vcpu_pause(prev);
>>>>>           vpmu_save_force(prev);
>>>>>           vpmu_reset(vpmu, VPMU_CONTEXT_LOADED);
>>>>> +        vcpu_unpause(prev);
>>>>>
>>>>>           vpmu = vcpu_vpmu(v);
>>>>>       }
>>>> We're running with IRQs off here, yet vcpu_pause() waits for the vcpu
>>>> to actually be de-scheduled. Even with IRQs on this is already a
>>>> relatively heavy operation (also including its impact on the remote
>>>> side). Additionally the function is called from context_switch(), and
>>>> I'm unsure of the usability of vcpu_pause() on such a path. In
>>>> particular: Is there a risk of two CPUs doing this mutually to one
>>>> another? If so, is deadlocking excluded?
>>>>
>>>> Hence at the very least I think the description wants extending, to
>>>> discuss the safety of the change.
>>>>
>>>> Boris - any chance you could comment here? Iirc that's code you did
>>>> introduce.
>>>
>>>
>>> Is the assertion in vmx_find_msr() really needs to be for runnable vcpu or can it be a check on whether vcpu is actually running (e.g. RUNSTATE_running)?
>>
>> You cannot safely check for "running", as "runnable" may transition
>> to/from "running" behind your back.
> 
> The more I look at this the more I think the only sensible solution is
> to have the vPMU state be saved on vmexit for all vCPUs.

Do you really mean vmexit? It would suffice if state was reliably
saved during context-switch-out, wouldn't it? At that point the
vCPU can't be resumed on another pCPU, yet.

> That way all
> this having to figure out where and when a context needs saving during
> scheduling goes away. Yes, it adds a bit of overhead for cases where
> the vCPU will resume on the same pCPU and that context saved could
> have been skipped,

If you really mean vmexit, then I'm inclined to say that's more
than just "a bit of overhead". I'd agree if you really meant
context-switch-out, but as said further up it looks to me as if
that was already occurring. Apparently I'm overlooking something
crucial ...

Jan

> but it makes it so that the vCPU can be resumed on
> any pCPU without having to have evidently fragile checks that may
> potentially lead to deadlocks (TBH I don't know if that's a real
> concern at the moment because the current setup is very hard to reason
> about). We can still keep track if the context needs reloading from
> the saved context and skip that if we know the state is still active.
> Any objection to that change in light of these issues?
> 
> Tamas

Tamas K Lengyel Sept. 19, 2022, 1:24 p.m. UTC | #6

On Mon, Sep 19, 2022 at 9:21 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 19.09.2022 14:25, Tamas K Lengyel wrote:
> > On Mon, Sep 19, 2022 at 5:28 AM Jan Beulich <jbeulich@suse.com> wrote:
> >>
> >> On 16.09.2022 23:35, Boris Ostrovsky wrote:
> >>>
> >>> On 9/16/22 8:52 AM, Jan Beulich wrote:
> >>>> On 15.09.2022 16:01, Tamas K Lengyel wrote:
> >>>>> While experimenting with the vPMU subsystem an ASSERT failure was
> >>>>> observed in vmx_find_msr because the vcpu_runnable state was true.
> >>>>>
> >>>>> The root cause of the bug appears to be the fact that the vPMU subsystem
> >>>>> doesn't save its state on context_switch.
>
> For the further reply below - is this actually true? What is the
> vpmu_switch_from() (resolving to vpmu_save()) doing then early
> in context_switch()?
>
> >>>>> The vpmu_load function will attempt
> >>>>> to gather the PMU state if its still loaded two different ways:
> >>>>>      1. if the current pcpu is not where the vcpu ran before doing a remote save
> >>>>>      2. if the current pcpu had another vcpu active before doing a local save
> >>>>>
> >>>>> However, in case the prev vcpu is being rescheduled on another pcpu its state
> >>>>> has already changed and vcpu_runnable is returning true, thus #2 will trip the
> >>>>> ASSERT. The only way to avoid this race condition is to make sure the
> >>>>> prev vcpu is paused while being checked and its context saved. Once the prev
> >>>>> vcpu is resumed and does #1 it will find its state already saved.
> >>>> While I consider this explanation plausible, I'm worried:
> >>>>
> >>>>> --- a/xen/arch/x86/cpu/vpmu.c
> >>>>> +++ b/xen/arch/x86/cpu/vpmu.c
> >>>>> @@ -419,8 +419,10 @@ int vpmu_load(struct vcpu *v, bool_t from_guest)
> >>>>>           vpmu = vcpu_vpmu(prev);
> >>>>>
> >>>>>           /* Someone ran here before us */
> >>>>> +        vcpu_pause(prev);
> >>>>>           vpmu_save_force(prev);
> >>>>>           vpmu_reset(vpmu, VPMU_CONTEXT_LOADED);
> >>>>> +        vcpu_unpause(prev);
> >>>>>
> >>>>>           vpmu = vcpu_vpmu(v);
> >>>>>       }
> >>>> We're running with IRQs off here, yet vcpu_pause() waits for the vcpu
> >>>> to actually be de-scheduled. Even with IRQs on this is already a
> >>>> relatively heavy operation (also including its impact on the remote
> >>>> side). Additionally the function is called from context_switch(), and
> >>>> I'm unsure of the usability of vcpu_pause() on such a path. In
> >>>> particular: Is there a risk of two CPUs doing this mutually to one
> >>>> another? If so, is deadlocking excluded?
> >>>>
> >>>> Hence at the very least I think the description wants extending, to
> >>>> discuss the safety of the change.
> >>>>
> >>>> Boris - any chance you could comment here? Iirc that's code you did
> >>>> introduce.
> >>>
> >>>
> >>> Is the assertion in vmx_find_msr() really needs to be for runnable vcpu or can it be a check on whether vcpu is actually running (e.g. RUNSTATE_running)?
> >>
> >> You cannot safely check for "running", as "runnable" may transition
> >> to/from "running" behind your back.
> >
> > The more I look at this the more I think the only sensible solution is
> > to have the vPMU state be saved on vmexit for all vCPUs.
>
> Do you really mean vmexit? It would suffice if state was reliably
> saved during context-switch-out, wouldn't it? At that point the
> vCPU can't be resumed on another pCPU, yet.
>
> > That way all
> > this having to figure out where and when a context needs saving during
> > scheduling goes away. Yes, it adds a bit of overhead for cases where
> > the vCPU will resume on the same pCPU and that context saved could
> > have been skipped,
>
> If you really mean vmexit, then I'm inclined to say that's more
> than just "a bit of overhead". I'd agree if you really meant
> context-switch-out, but as said further up it looks to me as if
> that was already occurring. Apparently I'm overlooking something
> crucial ...

Yes, the current setup is doing exactly that, saving the vPMU context
on context-switch-out, and that's where the ASSERT failure occurs
because the vCPU it needs to save the context for is already runnable
on another pCPU.

Tamas

Jan Beulich Sept. 19, 2022, 1:58 p.m. UTC | #7

On 19.09.2022 15:24, Tamas K Lengyel wrote:
> On Mon, Sep 19, 2022 at 9:21 AM Jan Beulich <jbeulich@suse.com> wrote:
>>
>> On 19.09.2022 14:25, Tamas K Lengyel wrote:
>>> On Mon, Sep 19, 2022 at 5:28 AM Jan Beulich <jbeulich@suse.com> wrote:
>>>>
>>>> On 16.09.2022 23:35, Boris Ostrovsky wrote:
>>>>>
>>>>> On 9/16/22 8:52 AM, Jan Beulich wrote:
>>>>>> On 15.09.2022 16:01, Tamas K Lengyel wrote:
>>>>>>> While experimenting with the vPMU subsystem an ASSERT failure was
>>>>>>> observed in vmx_find_msr because the vcpu_runnable state was true.
>>>>>>>
>>>>>>> The root cause of the bug appears to be the fact that the vPMU subsystem
>>>>>>> doesn't save its state on context_switch.
>>
>> For the further reply below - is this actually true? What is the
>> vpmu_switch_from() (resolving to vpmu_save()) doing then early
>> in context_switch()?
>>
>>>>>>> The vpmu_load function will attempt
>>>>>>> to gather the PMU state if its still loaded two different ways:
>>>>>>>      1. if the current pcpu is not where the vcpu ran before doing a remote save
>>>>>>>      2. if the current pcpu had another vcpu active before doing a local save
>>>>>>>
>>>>>>> However, in case the prev vcpu is being rescheduled on another pcpu its state
>>>>>>> has already changed and vcpu_runnable is returning true, thus #2 will trip the
>>>>>>> ASSERT. The only way to avoid this race condition is to make sure the
>>>>>>> prev vcpu is paused while being checked and its context saved. Once the prev
>>>>>>> vcpu is resumed and does #1 it will find its state already saved.
>>>>>> While I consider this explanation plausible, I'm worried:
>>>>>>
>>>>>>> --- a/xen/arch/x86/cpu/vpmu.c
>>>>>>> +++ b/xen/arch/x86/cpu/vpmu.c
>>>>>>> @@ -419,8 +419,10 @@ int vpmu_load(struct vcpu *v, bool_t from_guest)
>>>>>>>           vpmu = vcpu_vpmu(prev);
>>>>>>>
>>>>>>>           /* Someone ran here before us */
>>>>>>> +        vcpu_pause(prev);
>>>>>>>           vpmu_save_force(prev);
>>>>>>>           vpmu_reset(vpmu, VPMU_CONTEXT_LOADED);
>>>>>>> +        vcpu_unpause(prev);
>>>>>>>
>>>>>>>           vpmu = vcpu_vpmu(v);
>>>>>>>       }
>>>>>> We're running with IRQs off here, yet vcpu_pause() waits for the vcpu
>>>>>> to actually be de-scheduled. Even with IRQs on this is already a
>>>>>> relatively heavy operation (also including its impact on the remote
>>>>>> side). Additionally the function is called from context_switch(), and
>>>>>> I'm unsure of the usability of vcpu_pause() on such a path. In
>>>>>> particular: Is there a risk of two CPUs doing this mutually to one
>>>>>> another? If so, is deadlocking excluded?
>>>>>>
>>>>>> Hence at the very least I think the description wants extending, to
>>>>>> discuss the safety of the change.
>>>>>>
>>>>>> Boris - any chance you could comment here? Iirc that's code you did
>>>>>> introduce.
>>>>>
>>>>>
>>>>> Is the assertion in vmx_find_msr() really needs to be for runnable vcpu or can it be a check on whether vcpu is actually running (e.g. RUNSTATE_running)?
>>>>
>>>> You cannot safely check for "running", as "runnable" may transition
>>>> to/from "running" behind your back.
>>>
>>> The more I look at this the more I think the only sensible solution is
>>> to have the vPMU state be saved on vmexit for all vCPUs.
>>
>> Do you really mean vmexit? It would suffice if state was reliably
>> saved during context-switch-out, wouldn't it? At that point the
>> vCPU can't be resumed on another pCPU, yet.
>>
>>> That way all
>>> this having to figure out where and when a context needs saving during
>>> scheduling goes away. Yes, it adds a bit of overhead for cases where
>>> the vCPU will resume on the same pCPU and that context saved could
>>> have been skipped,
>>
>> If you really mean vmexit, then I'm inclined to say that's more
>> than just "a bit of overhead". I'd agree if you really meant
>> context-switch-out, but as said further up it looks to me as if
>> that was already occurring. Apparently I'm overlooking something
>> crucial ...
> 
> Yes, the current setup is doing exactly that, saving the vPMU context
> on context-switch-out, and that's where the ASSERT failure occurs
> because the vCPU it needs to save the context for is already runnable
> on another pCPU.

Well, if that's the scenario (sorry for not understanding it that
way earlier on), then the assertion is too strict: While in context
switch, the vCPU may be runnable, but certainly won't actually run
anywhere. Therefore I'd be inclined to suggest to relax the
condition just enough to cover this case, without actually going to
checking for "running".

Jan

Tamas K Lengyel Sept. 19, 2022, 2:11 p.m. UTC | #8

On Mon, Sep 19, 2022 at 9:58 AM Jan Beulich <jbeulich@suse.com> wrote:

> On 19.09.2022 15:24, Tamas K Lengyel wrote:
> > On Mon, Sep 19, 2022 at 9:21 AM Jan Beulich <jbeulich@suse.com> wrote:
> >>
> >> On 19.09.2022 14:25, Tamas K Lengyel wrote:
> >>> On Mon, Sep 19, 2022 at 5:28 AM Jan Beulich <jbeulich@suse.com> wrote:
> >>>>
> >>>> On 16.09.2022 23:35, Boris Ostrovsky wrote:
> >>>>>
> >>>>> On 9/16/22 8:52 AM, Jan Beulich wrote:
> >>>>>> On 15.09.2022 16:01, Tamas K Lengyel wrote:
> >>>>>>> While experimenting with the vPMU subsystem an ASSERT failure was
> >>>>>>> observed in vmx_find_msr because the vcpu_runnable state was true.
> >>>>>>>
> >>>>>>> The root cause of the bug appears to be the fact that the vPMU
> subsystem
> >>>>>>> doesn't save its state on context_switch.
> >>
> >> For the further reply below - is this actually true? What is the
> >> vpmu_switch_from() (resolving to vpmu_save()) doing then early
> >> in context_switch()?
> >>
> >>>>>>> The vpmu_load function will attempt
> >>>>>>> to gather the PMU state if its still loaded two different ways:
> >>>>>>>      1. if the current pcpu is not where the vcpu ran before doing
> a remote save
> >>>>>>>      2. if the current pcpu had another vcpu active before doing a
> local save
> >>>>>>>
> >>>>>>> However, in case the prev vcpu is being rescheduled on another
> pcpu its state
> >>>>>>> has already changed and vcpu_runnable is returning true, thus #2
> will trip the
> >>>>>>> ASSERT. The only way to avoid this race condition is to make sure
> the
> >>>>>>> prev vcpu is paused while being checked and its context saved.
> Once the prev
> >>>>>>> vcpu is resumed and does #1 it will find its state already saved.
> >>>>>> While I consider this explanation plausible, I'm worried:
> >>>>>>
> >>>>>>> --- a/xen/arch/x86/cpu/vpmu.c
> >>>>>>> +++ b/xen/arch/x86/cpu/vpmu.c
> >>>>>>> @@ -419,8 +419,10 @@ int vpmu_load(struct vcpu *v, bool_t
> from_guest)
> >>>>>>>           vpmu = vcpu_vpmu(prev);
> >>>>>>>
> >>>>>>>           /* Someone ran here before us */
> >>>>>>> +        vcpu_pause(prev);
> >>>>>>>           vpmu_save_force(prev);
> >>>>>>>           vpmu_reset(vpmu, VPMU_CONTEXT_LOADED);
> >>>>>>> +        vcpu_unpause(prev);
> >>>>>>>
> >>>>>>>           vpmu = vcpu_vpmu(v);
> >>>>>>>       }
> >>>>>> We're running with IRQs off here, yet vcpu_pause() waits for the
> vcpu
> >>>>>> to actually be de-scheduled. Even with IRQs on this is already a
> >>>>>> relatively heavy operation (also including its impact on the remote
> >>>>>> side). Additionally the function is called from context_switch(),
> and
> >>>>>> I'm unsure of the usability of vcpu_pause() on such a path. In
> >>>>>> particular: Is there a risk of two CPUs doing this mutually to one
> >>>>>> another? If so, is deadlocking excluded?
> >>>>>>
> >>>>>> Hence at the very least I think the description wants extending, to
> >>>>>> discuss the safety of the change.
> >>>>>>
> >>>>>> Boris - any chance you could comment here? Iirc that's code you did
> >>>>>> introduce.
> >>>>>
> >>>>>
> >>>>> Is the assertion in vmx_find_msr() really needs to be for runnable
> vcpu or can it be a check on whether vcpu is actually running (e.g.
> RUNSTATE_running)?
> >>>>
> >>>> You cannot safely check for "running", as "runnable" may transition
> >>>> to/from "running" behind your back.
> >>>
> >>> The more I look at this the more I think the only sensible solution is
> >>> to have the vPMU state be saved on vmexit for all vCPUs.
> >>
> >> Do you really mean vmexit? It would suffice if state was reliably
> >> saved during context-switch-out, wouldn't it? At that point the
> >> vCPU can't be resumed on another pCPU, yet.
> >>
> >>> That way all
> >>> this having to figure out where and when a context needs saving during
> >>> scheduling goes away. Yes, it adds a bit of overhead for cases where
> >>> the vCPU will resume on the same pCPU and that context saved could
> >>> have been skipped,
> >>
> >> If you really mean vmexit, then I'm inclined to say that's more
> >> than just "a bit of overhead". I'd agree if you really meant
> >> context-switch-out, but as said further up it looks to me as if
> >> that was already occurring. Apparently I'm overlooking something
> >> crucial ...
> >
> > Yes, the current setup is doing exactly that, saving the vPMU context
> > on context-switch-out, and that's where the ASSERT failure occurs
> > because the vCPU it needs to save the context for is already runnable
> > on another pCPU.
>
> Well, if that's the scenario (sorry for not understanding it that
> way earlier on), then the assertion is too strict: While in context
> switch, the vCPU may be runnable, but certainly won't actually run
> anywhere. Therefore I'd be inclined to suggest to relax the
> condition just enough to cover this case, without actually going to
> checking for "running".
>

What ensures the vCPU won't actually run anywhere if it's in the runnable
state? And how do I relax the condition just enough to cover just this case?

Tamas

Jan Beulich Sept. 19, 2022, 2:56 p.m. UTC | #9

On 19.09.2022 16:11, Tamas K Lengyel wrote:
> On Mon, Sep 19, 2022 at 9:58 AM Jan Beulich <jbeulich@suse.com> wrote:
> 
>> On 19.09.2022 15:24, Tamas K Lengyel wrote:
>>> On Mon, Sep 19, 2022 at 9:21 AM Jan Beulich <jbeulich@suse.com> wrote:
>>>>
>>>> On 19.09.2022 14:25, Tamas K Lengyel wrote:
>>>>> On Mon, Sep 19, 2022 at 5:28 AM Jan Beulich <jbeulich@suse.com> wrote:
>>>>>>
>>>>>> On 16.09.2022 23:35, Boris Ostrovsky wrote:
>>>>>>>
>>>>>>> On 9/16/22 8:52 AM, Jan Beulich wrote:
>>>>>>>> On 15.09.2022 16:01, Tamas K Lengyel wrote:
>>>>>>>>> While experimenting with the vPMU subsystem an ASSERT failure was
>>>>>>>>> observed in vmx_find_msr because the vcpu_runnable state was true.
>>>>>>>>>
>>>>>>>>> The root cause of the bug appears to be the fact that the vPMU
>> subsystem
>>>>>>>>> doesn't save its state on context_switch.
>>>>
>>>> For the further reply below - is this actually true? What is the
>>>> vpmu_switch_from() (resolving to vpmu_save()) doing then early
>>>> in context_switch()?
>>>>
>>>>>>>>> The vpmu_load function will attempt
>>>>>>>>> to gather the PMU state if its still loaded two different ways:
>>>>>>>>>      1. if the current pcpu is not where the vcpu ran before doing
>> a remote save
>>>>>>>>>      2. if the current pcpu had another vcpu active before doing a
>> local save
>>>>>>>>>
>>>>>>>>> However, in case the prev vcpu is being rescheduled on another
>> pcpu its state
>>>>>>>>> has already changed and vcpu_runnable is returning true, thus #2
>> will trip the
>>>>>>>>> ASSERT. The only way to avoid this race condition is to make sure
>> the
>>>>>>>>> prev vcpu is paused while being checked and its context saved.
>> Once the prev
>>>>>>>>> vcpu is resumed and does #1 it will find its state already saved.
>>>>>>>> While I consider this explanation plausible, I'm worried:
>>>>>>>>
>>>>>>>>> --- a/xen/arch/x86/cpu/vpmu.c
>>>>>>>>> +++ b/xen/arch/x86/cpu/vpmu.c
>>>>>>>>> @@ -419,8 +419,10 @@ int vpmu_load(struct vcpu *v, bool_t
>> from_guest)
>>>>>>>>>           vpmu = vcpu_vpmu(prev);
>>>>>>>>>
>>>>>>>>>           /* Someone ran here before us */
>>>>>>>>> +        vcpu_pause(prev);
>>>>>>>>>           vpmu_save_force(prev);
>>>>>>>>>           vpmu_reset(vpmu, VPMU_CONTEXT_LOADED);
>>>>>>>>> +        vcpu_unpause(prev);
>>>>>>>>>
>>>>>>>>>           vpmu = vcpu_vpmu(v);
>>>>>>>>>       }
>>>>>>>> We're running with IRQs off here, yet vcpu_pause() waits for the
>> vcpu
>>>>>>>> to actually be de-scheduled. Even with IRQs on this is already a
>>>>>>>> relatively heavy operation (also including its impact on the remote
>>>>>>>> side). Additionally the function is called from context_switch(),
>> and
>>>>>>>> I'm unsure of the usability of vcpu_pause() on such a path. In
>>>>>>>> particular: Is there a risk of two CPUs doing this mutually to one
>>>>>>>> another? If so, is deadlocking excluded?
>>>>>>>>
>>>>>>>> Hence at the very least I think the description wants extending, to
>>>>>>>> discuss the safety of the change.
>>>>>>>>
>>>>>>>> Boris - any chance you could comment here? Iirc that's code you did
>>>>>>>> introduce.
>>>>>>>
>>>>>>>
>>>>>>> Is the assertion in vmx_find_msr() really needs to be for runnable
>> vcpu or can it be a check on whether vcpu is actually running (e.g.
>> RUNSTATE_running)?
>>>>>>
>>>>>> You cannot safely check for "running", as "runnable" may transition
>>>>>> to/from "running" behind your back.
>>>>>
>>>>> The more I look at this the more I think the only sensible solution is
>>>>> to have the vPMU state be saved on vmexit for all vCPUs.
>>>>
>>>> Do you really mean vmexit? It would suffice if state was reliably
>>>> saved during context-switch-out, wouldn't it? At that point the
>>>> vCPU can't be resumed on another pCPU, yet.
>>>>
>>>>> That way all
>>>>> this having to figure out where and when a context needs saving during
>>>>> scheduling goes away. Yes, it adds a bit of overhead for cases where
>>>>> the vCPU will resume on the same pCPU and that context saved could
>>>>> have been skipped,
>>>>
>>>> If you really mean vmexit, then I'm inclined to say that's more
>>>> than just "a bit of overhead". I'd agree if you really meant
>>>> context-switch-out, but as said further up it looks to me as if
>>>> that was already occurring. Apparently I'm overlooking something
>>>> crucial ...
>>>
>>> Yes, the current setup is doing exactly that, saving the vPMU context
>>> on context-switch-out, and that's where the ASSERT failure occurs
>>> because the vCPU it needs to save the context for is already runnable
>>> on another pCPU.
>>
>> Well, if that's the scenario (sorry for not understanding it that
>> way earlier on), then the assertion is too strict: While in context
>> switch, the vCPU may be runnable, but certainly won't actually run
>> anywhere. Therefore I'd be inclined to suggest to relax the
>> condition just enough to cover this case, without actually going to
>> checking for "running".
>>
> 
> What ensures the vCPU won't actually run anywhere if it's in the runnable
> state?

The fact that the vCPU is the subject of context_switch().

> And how do I relax the condition just enough to cover just this case?

That's the more difficult question. The immediate solution, passing a
boolean or flag to vpmu_switch_from(), may not be practical, as it
would likely mean passing this through many layers.

Utilizing that I have Jürgen sitting next to me, I've discussed this
with him, and he suggested to simply check for v == current. And
indeed - set_current() in context_switch() happens a few lines after
vpmu_switch_from().

However, going back to vmx_find_msr() I find that the v == current
case is already included there. Which makes me wonder again - what
exactly is the scenario that you're observing the assertion
triggering? Would you mind spelling out the call chain, perhaps by
way of the call stack from the assertion?

Jan

Boris Ostrovsky Sept. 19, 2022, 10:42 p.m. UTC | #10

On 9/19/22 10:56 AM, Jan Beulich wrote:
> On 19.09.2022 16:11, Tamas K Lengyel wrote:
>> On Mon, Sep 19, 2022 at 9:58 AM Jan Beulich <jbeulich@suse.com> wrote:
>>
>>> On 19.09.2022 15:24, Tamas K Lengyel wrote:
>>>> On Mon, Sep 19, 2022 at 9:21 AM Jan Beulich <jbeulich@suse.com> wrote:
>>>>>
>>>>> On 19.09.2022 14:25, Tamas K Lengyel wrote:
>>>>>> On Mon, Sep 19, 2022 at 5:28 AM Jan Beulich <jbeulich@suse.com> wrote:
>>>>>>>
>>>>>>> On 16.09.2022 23:35, Boris Ostrovsky wrote:
>>>>>>>>
>>>>>>>> On 9/16/22 8:52 AM, Jan Beulich wrote:
>>>>>>>>> On 15.09.2022 16:01, Tamas K Lengyel wrote:
>>>>>>>>>> While experimenting with the vPMU subsystem an ASSERT failure was
>>>>>>>>>> observed in vmx_find_msr because the vcpu_runnable state was true.
>>>>>>>>>>
>>>>>>>>>> The root cause of the bug appears to be the fact that the vPMU
>>> subsystem
>>>>>>>>>> doesn't save its state on context_switch.
>>>>>
>>>>> For the further reply below - is this actually true? What is the
>>>>> vpmu_switch_from() (resolving to vpmu_save()) doing then early
>>>>> in context_switch()?
>>>>>
>>>>>>>>>> The vpmu_load function will attempt
>>>>>>>>>> to gather the PMU state if its still loaded two different ways:
>>>>>>>>>>       1. if the current pcpu is not where the vcpu ran before doing
>>> a remote save
>>>>>>>>>>       2. if the current pcpu had another vcpu active before doing a
>>> local save
>>>>>>>>>>
>>>>>>>>>> However, in case the prev vcpu is being rescheduled on another
>>> pcpu its state
>>>>>>>>>> has already changed and vcpu_runnable is returning true, thus #2
>>> will trip the
>>>>>>>>>> ASSERT. The only way to avoid this race condition is to make sure
>>> the
>>>>>>>>>> prev vcpu is paused while being checked and its context saved.
>>> Once the prev
>>>>>>>>>> vcpu is resumed and does #1 it will find its state already saved.
>>>>>>>>> While I consider this explanation plausible, I'm worried:
>>>>>>>>>
>>>>>>>>>> --- a/xen/arch/x86/cpu/vpmu.c
>>>>>>>>>> +++ b/xen/arch/x86/cpu/vpmu.c
>>>>>>>>>> @@ -419,8 +419,10 @@ int vpmu_load(struct vcpu *v, bool_t
>>> from_guest)
>>>>>>>>>>            vpmu = vcpu_vpmu(prev);
>>>>>>>>>>
>>>>>>>>>>            /* Someone ran here before us */
>>>>>>>>>> +        vcpu_pause(prev);
>>>>>>>>>>            vpmu_save_force(prev);
>>>>>>>>>>            vpmu_reset(vpmu, VPMU_CONTEXT_LOADED);
>>>>>>>>>> +        vcpu_unpause(prev);
>>>>>>>>>>
>>>>>>>>>>            vpmu = vcpu_vpmu(v);
>>>>>>>>>>        }
>>>>>>>>> We're running with IRQs off here, yet vcpu_pause() waits for the
>>> vcpu
>>>>>>>>> to actually be de-scheduled. Even with IRQs on this is already a
>>>>>>>>> relatively heavy operation (also including its impact on the remote
>>>>>>>>> side). Additionally the function is called from context_switch(),
>>> and
>>>>>>>>> I'm unsure of the usability of vcpu_pause() on such a path. In
>>>>>>>>> particular: Is there a risk of two CPUs doing this mutually to one
>>>>>>>>> another? If so, is deadlocking excluded?
>>>>>>>>>
>>>>>>>>> Hence at the very least I think the description wants extending, to
>>>>>>>>> discuss the safety of the change.
>>>>>>>>>
>>>>>>>>> Boris - any chance you could comment here? Iirc that's code you did
>>>>>>>>> introduce.
>>>>>>>>
>>>>>>>>
>>>>>>>> Is the assertion in vmx_find_msr() really needs to be for runnable
>>> vcpu or can it be a check on whether vcpu is actually running (e.g.
>>> RUNSTATE_running)?
>>>>>>>
>>>>>>> You cannot safely check for "running", as "runnable" may transition
>>>>>>> to/from "running" behind your back.
>>>>>>
>>>>>> The more I look at this the more I think the only sensible solution is
>>>>>> to have the vPMU state be saved on vmexit for all vCPUs.
>>>>>
>>>>> Do you really mean vmexit? It would suffice if state was reliably
>>>>> saved during context-switch-out, wouldn't it? At that point the
>>>>> vCPU can't be resumed on another pCPU, yet.
>>>>>
>>>>>> That way all
>>>>>> this having to figure out where and when a context needs saving during
>>>>>> scheduling goes away. Yes, it adds a bit of overhead for cases where
>>>>>> the vCPU will resume on the same pCPU and that context saved could
>>>>>> have been skipped,
>>>>>
>>>>> If you really mean vmexit, then I'm inclined to say that's more
>>>>> than just "a bit of overhead". I'd agree if you really meant
>>>>> context-switch-out, but as said further up it looks to me as if
>>>>> that was already occurring. Apparently I'm overlooking something
>>>>> crucial ...
>>>>
>>>> Yes, the current setup is doing exactly that, saving the vPMU context
>>>> on context-switch-out, and that's where the ASSERT failure occurs
>>>> because the vCPU it needs to save the context for is already runnable
>>>> on another pCPU.
>>>
>>> Well, if that's the scenario (sorry for not understanding it that
>>> way earlier on), then the assertion is too strict: While in context
>>> switch, the vCPU may be runnable, but certainly won't actually run
>>> anywhere. Therefore I'd be inclined to suggest to relax the
>>> condition just enough to cover this case, without actually going to
>>> checking for "running".
>>>
>>
>> What ensures the vCPU won't actually run anywhere if it's in the runnable
>> state?
> 
> The fact that the vCPU is the subject of context_switch().
> 
>> And how do I relax the condition just enough to cover just this case?
> 
> That's the more difficult question. The immediate solution, passing a
> boolean or flag to vpmu_switch_from(), may not be practical, as it
> would likely mean passing this through many layers.
> 
> Utilizing that I have Jürgen sitting next to me, I've discussed this
> with him, and he suggested to simply check for v == current. And
> indeed - set_current() in context_switch() happens a few lines after
> vpmu_switch_from().



It is saving vpmu data from current pcpu's MSRs for a remote vcpu so @v
in vmx_find_msr() is not @current:

      vpmu_load()
          ...
          prev = per_cpu(last_vcpu, pcpu);
          vpmu_save_force(prev)
              core2_vpmu_save()
                  __core2_vpmu_save()
                      vmx_read_guest_msr()
                          vmx_find_msr()


The call to vmx_find_msr() was introduced by 755087eb9b10c. I wonder though whether
this call is needed when code path above is executed (i.e. when we are saving
remove vcpu)


-boris

> 
> However, going back to vmx_find_msr() I find that the v == current
> case is already included there. Which makes me wonder again - what
> exactly is the scenario that you're observing the assertion
> triggering? Would you mind spelling out the call chain, perhaps by
> way of the call stack from the assertion?
> 
> Jan

Jan Beulich Sept. 20, 2022, 8:01 a.m. UTC | #11

On 20.09.2022 00:42, Boris Ostrovsky wrote:
> 
> 
> On 9/19/22 10:56 AM, Jan Beulich wrote:
>> On 19.09.2022 16:11, Tamas K Lengyel wrote:
>>> On Mon, Sep 19, 2022 at 9:58 AM Jan Beulich <jbeulich@suse.com> wrote:
>>>
>>>> On 19.09.2022 15:24, Tamas K Lengyel wrote:
>>>>> On Mon, Sep 19, 2022 at 9:21 AM Jan Beulich <jbeulich@suse.com> wrote:
>>>>>>
>>>>>> On 19.09.2022 14:25, Tamas K Lengyel wrote:
>>>>>>> On Mon, Sep 19, 2022 at 5:28 AM Jan Beulich <jbeulich@suse.com> wrote:
>>>>>>>>
>>>>>>>> On 16.09.2022 23:35, Boris Ostrovsky wrote:
>>>>>>>>>
>>>>>>>>> On 9/16/22 8:52 AM, Jan Beulich wrote:
>>>>>>>>>> On 15.09.2022 16:01, Tamas K Lengyel wrote:
>>>>>>>>>>> While experimenting with the vPMU subsystem an ASSERT failure was
>>>>>>>>>>> observed in vmx_find_msr because the vcpu_runnable state was true.
>>>>>>>>>>>
>>>>>>>>>>> The root cause of the bug appears to be the fact that the vPMU
>>>> subsystem
>>>>>>>>>>> doesn't save its state on context_switch.
>>>>>>
>>>>>> For the further reply below - is this actually true? What is the
>>>>>> vpmu_switch_from() (resolving to vpmu_save()) doing then early
>>>>>> in context_switch()?
>>>>>>
>>>>>>>>>>> The vpmu_load function will attempt
>>>>>>>>>>> to gather the PMU state if its still loaded two different ways:
>>>>>>>>>>>       1. if the current pcpu is not where the vcpu ran before doing
>>>> a remote save
>>>>>>>>>>>       2. if the current pcpu had another vcpu active before doing a
>>>> local save
>>>>>>>>>>>
>>>>>>>>>>> However, in case the prev vcpu is being rescheduled on another
>>>> pcpu its state
>>>>>>>>>>> has already changed and vcpu_runnable is returning true, thus #2
>>>> will trip the
>>>>>>>>>>> ASSERT. The only way to avoid this race condition is to make sure
>>>> the
>>>>>>>>>>> prev vcpu is paused while being checked and its context saved.
>>>> Once the prev
>>>>>>>>>>> vcpu is resumed and does #1 it will find its state already saved.
>>>>>>>>>> While I consider this explanation plausible, I'm worried:
>>>>>>>>>>
>>>>>>>>>>> --- a/xen/arch/x86/cpu/vpmu.c
>>>>>>>>>>> +++ b/xen/arch/x86/cpu/vpmu.c
>>>>>>>>>>> @@ -419,8 +419,10 @@ int vpmu_load(struct vcpu *v, bool_t
>>>> from_guest)
>>>>>>>>>>>            vpmu = vcpu_vpmu(prev);
>>>>>>>>>>>
>>>>>>>>>>>            /* Someone ran here before us */
>>>>>>>>>>> +        vcpu_pause(prev);
>>>>>>>>>>>            vpmu_save_force(prev);
>>>>>>>>>>>            vpmu_reset(vpmu, VPMU_CONTEXT_LOADED);
>>>>>>>>>>> +        vcpu_unpause(prev);
>>>>>>>>>>>
>>>>>>>>>>>            vpmu = vcpu_vpmu(v);
>>>>>>>>>>>        }
>>>>>>>>>> We're running with IRQs off here, yet vcpu_pause() waits for the
>>>> vcpu
>>>>>>>>>> to actually be de-scheduled. Even with IRQs on this is already a
>>>>>>>>>> relatively heavy operation (also including its impact on the remote
>>>>>>>>>> side). Additionally the function is called from context_switch(),
>>>> and
>>>>>>>>>> I'm unsure of the usability of vcpu_pause() on such a path. In
>>>>>>>>>> particular: Is there a risk of two CPUs doing this mutually to one
>>>>>>>>>> another? If so, is deadlocking excluded?
>>>>>>>>>>
>>>>>>>>>> Hence at the very least I think the description wants extending, to
>>>>>>>>>> discuss the safety of the change.
>>>>>>>>>>
>>>>>>>>>> Boris - any chance you could comment here? Iirc that's code you did
>>>>>>>>>> introduce.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Is the assertion in vmx_find_msr() really needs to be for runnable
>>>> vcpu or can it be a check on whether vcpu is actually running (e.g.
>>>> RUNSTATE_running)?
>>>>>>>>
>>>>>>>> You cannot safely check for "running", as "runnable" may transition
>>>>>>>> to/from "running" behind your back.
>>>>>>>
>>>>>>> The more I look at this the more I think the only sensible solution is
>>>>>>> to have the vPMU state be saved on vmexit for all vCPUs.
>>>>>>
>>>>>> Do you really mean vmexit? It would suffice if state was reliably
>>>>>> saved during context-switch-out, wouldn't it? At that point the
>>>>>> vCPU can't be resumed on another pCPU, yet.
>>>>>>
>>>>>>> That way all
>>>>>>> this having to figure out where and when a context needs saving during
>>>>>>> scheduling goes away. Yes, it adds a bit of overhead for cases where
>>>>>>> the vCPU will resume on the same pCPU and that context saved could
>>>>>>> have been skipped,
>>>>>>
>>>>>> If you really mean vmexit, then I'm inclined to say that's more
>>>>>> than just "a bit of overhead". I'd agree if you really meant
>>>>>> context-switch-out, but as said further up it looks to me as if
>>>>>> that was already occurring. Apparently I'm overlooking something
>>>>>> crucial ...
>>>>>
>>>>> Yes, the current setup is doing exactly that, saving the vPMU context
>>>>> on context-switch-out, and that's where the ASSERT failure occurs
>>>>> because the vCPU it needs to save the context for is already runnable
>>>>> on another pCPU.
>>>>
>>>> Well, if that's the scenario (sorry for not understanding it that
>>>> way earlier on), then the assertion is too strict: While in context
>>>> switch, the vCPU may be runnable, but certainly won't actually run
>>>> anywhere. Therefore I'd be inclined to suggest to relax the
>>>> condition just enough to cover this case, without actually going to
>>>> checking for "running".
>>>>
>>>
>>> What ensures the vCPU won't actually run anywhere if it's in the runnable
>>> state?
>>
>> The fact that the vCPU is the subject of context_switch().
>>
>>> And how do I relax the condition just enough to cover just this case?
>>
>> That's the more difficult question. The immediate solution, passing a
>> boolean or flag to vpmu_switch_from(), may not be practical, as it
>> would likely mean passing this through many layers.
>>
>> Utilizing that I have Jürgen sitting next to me, I've discussed this
>> with him, and he suggested to simply check for v == current. And
>> indeed - set_current() in context_switch() happens a few lines after
>> vpmu_switch_from().
> 
> 
> 
> It is saving vpmu data from current pcpu's MSRs for a remote vcpu so @v
> in vmx_find_msr() is not @current:
> 
>       vpmu_load()
>           ...
>           prev = per_cpu(last_vcpu, pcpu);
>           vpmu_save_force(prev)
>               core2_vpmu_save()
>                   __core2_vpmu_save()
>                       vmx_read_guest_msr()
>                           vmx_find_msr()
> 
> 
> The call to vmx_find_msr() was introduced by 755087eb9b10c. I wonder though whether
> this call is needed when code path above is executed (i.e. when we are saving
> remove vcpu)

How could it not be needed? We need to obtain the guest value. The
thing I don't understand is why this forced saving is necessary,
when context_switch() unconditionally calls vpmu_switch_from().

Jan

Boris Ostrovsky Sept. 20, 2022, 2:26 p.m. UTC | #12

On 9/20/22 4:01 AM, Jan Beulich wrote:
> On 20.09.2022 00:42, Boris Ostrovsky wrote:

>>
>> It is saving vpmu data from current pcpu's MSRs for a remote vcpu so @v
>> in vmx_find_msr() is not @current:
>>
>>        vpmu_load()
>>            ...
>>            prev = per_cpu(last_vcpu, pcpu);
>>            vpmu_save_force(prev)
>>                core2_vpmu_save()
>>                    __core2_vpmu_save()
>>                        vmx_read_guest_msr()
>>                            vmx_find_msr()
>>
>>
>> The call to vmx_find_msr() was introduced by 755087eb9b10c. I wonder though whether
>> this call is needed when code path above is executed (i.e. when we are saving
>> remove vcpu)
> 
> How could it not be needed? We need to obtain the guest value. The
> thing I don't understand is why this forced saving is necessary,
> when context_switch() unconditionally calls vpmu_switch_from().


IIRC the logic is:

1. vcpuA runs on pcpu0
2. vcpuA is de-scheduled and is selected to run on pcpu1. It has not yet called vpmu_load() from pcpu1
3. vcpuB is ready to run on pcpu0, calls vpmu_load()
4. vcpuB discovers that pcpu0's MSRs are still holding values from vcpuA
5. vcpuB calls vpmu_force_save(vcpuA) which stashes pcpu0's MSRs into vcpuA's vpmu context.


-boris

Jan Beulich Sept. 20, 2022, 2:54 p.m. UTC | #13

On 20.09.2022 16:26, Boris Ostrovsky wrote:
> On 9/20/22 4:01 AM, Jan Beulich wrote:
>> On 20.09.2022 00:42, Boris Ostrovsky wrote:
>>> It is saving vpmu data from current pcpu's MSRs for a remote vcpu so @v
>>> in vmx_find_msr() is not @current:
>>>
>>>        vpmu_load()
>>>            ...
>>>            prev = per_cpu(last_vcpu, pcpu);
>>>            vpmu_save_force(prev)
>>>                core2_vpmu_save()
>>>                    __core2_vpmu_save()
>>>                        vmx_read_guest_msr()
>>>                            vmx_find_msr()
>>>
>>>
>>> The call to vmx_find_msr() was introduced by 755087eb9b10c. I wonder though whether
>>> this call is needed when code path above is executed (i.e. when we are saving
>>> remove vcpu)
>>
>> How could it not be needed? We need to obtain the guest value. The
>> thing I don't understand is why this forced saving is necessary,
>> when context_switch() unconditionally calls vpmu_switch_from().
> 
> 
> IIRC the logic is:
> 
> 1. vcpuA runs on pcpu0
> 2. vcpuA is de-scheduled and is selected to run on pcpu1. It has not yet called vpmu_load() from pcpu1

The calling of vpmu_load() shouldn't matter here. What does matter is
that vpmu_save() was necessarily called already. Therefore I'm having
trouble seeing why ...

> 3. vcpuB is ready to run on pcpu0, calls vpmu_load()
> 4. vcpuB discovers that pcpu0's MSRs are still holding values from vcpuA
> 5. vcpuB calls vpmu_force_save(vcpuA) which stashes pcpu0's MSRs into vcpuA's vpmu context.

... forced saving would be necessary here. What's necessary at this
point is only the loading of vcpuB's MSR values.

Jan

Boris Ostrovsky Sept. 20, 2022, 6:12 p.m. UTC | #14

On 9/20/22 10:54 AM, Jan Beulich wrote:
> On 20.09.2022 16:26, Boris Ostrovsky wrote:
>> On 9/20/22 4:01 AM, Jan Beulich wrote:
>>> On 20.09.2022 00:42, Boris Ostrovsky wrote:
>>>> It is saving vpmu data from current pcpu's MSRs for a remote vcpu so @v
>>>> in vmx_find_msr() is not @current:
>>>>
>>>>         vpmu_load()
>>>>             ...
>>>>             prev = per_cpu(last_vcpu, pcpu);
>>>>             vpmu_save_force(prev)
>>>>                 core2_vpmu_save()
>>>>                     __core2_vpmu_save()
>>>>                         vmx_read_guest_msr()
>>>>                             vmx_find_msr()
>>>>
>>>>
>>>> The call to vmx_find_msr() was introduced by 755087eb9b10c. I wonder though whether
>>>> this call is needed when code path above is executed (i.e. when we are saving
>>>> remove vcpu)
>>> How could it not be needed? We need to obtain the guest value. The
>>> thing I don't understand is why this forced saving is necessary,
>>> when context_switch() unconditionally calls vpmu_switch_from().
>>
>> IIRC the logic is:
>>
>> 1. vcpuA runs on pcpu0
>> 2. vcpuA is de-scheduled and is selected to run on pcpu1. It has not yet called vpmu_load() from pcpu1
> The calling of vpmu_load() shouldn't matter here. What does matter is
> that vpmu_save() was necessarily called already.


I thought we don't always save MSRs on context switch because we optimize for case when the same vcpu gets scheduled again. But I am not sure I see this now that I am looking at the code.


-boris



>   Therefore I'm having
> trouble seeing why ...
>
>> 3. vcpuB is ready to run on pcpu0, calls vpmu_load()
>> 4. vcpuB discovers that pcpu0's MSRs are still holding values from vcpuA
>> 5. vcpuB calls vpmu_force_save(vcpuA) which stashes pcpu0's MSRs into vcpuA's vpmu context.
> ... forced saving would be necessary here. What's necessary at this
> point is only the loading of vcpuB's MSR values.

Tamas K Lengyel Sept. 20, 2022, 6:27 p.m. UTC | #15

On Tue, Sep 20, 2022 at 2:12 PM Boris Ostrovsky <boris.ostrovsky@oracle.com>
wrote:

>
> On 9/20/22 10:54 AM, Jan Beulich wrote:
> > On 20.09.2022 16:26, Boris Ostrovsky wrote:
> >> On 9/20/22 4:01 AM, Jan Beulich wrote:
> >>> On 20.09.2022 00:42, Boris Ostrovsky wrote:
> >>>> It is saving vpmu data from current pcpu's MSRs for a remote vcpu so
> @v
> >>>> in vmx_find_msr() is not @current:
> >>>>
> >>>>         vpmu_load()
> >>>>             ...
> >>>>             prev = per_cpu(last_vcpu, pcpu);
> >>>>             vpmu_save_force(prev)
> >>>>                 core2_vpmu_save()
> >>>>                     __core2_vpmu_save()
> >>>>                         vmx_read_guest_msr()
> >>>>                             vmx_find_msr()
> >>>>
> >>>>
> >>>> The call to vmx_find_msr() was introduced by 755087eb9b10c. I wonder
> though whether
> >>>> this call is needed when code path above is executed (i.e. when we
> are saving
> >>>> remove vcpu)
> >>> How could it not be needed? We need to obtain the guest value. The
> >>> thing I don't understand is why this forced saving is necessary,
> >>> when context_switch() unconditionally calls vpmu_switch_from().
> >>
> >> IIRC the logic is:
> >>
> >> 1. vcpuA runs on pcpu0
> >> 2. vcpuA is de-scheduled and is selected to run on pcpu1. It has not
> yet called vpmu_load() from pcpu1
> > The calling of vpmu_load() shouldn't matter here. What does matter is
> > that vpmu_save() was necessarily called already.
>
>
> I thought we don't always save MSRs on context switch because we optimize
> for case when the same vcpu gets scheduled again. But I am not sure I see
> this now that I am looking at the code.
>

I see context_switch calling vpmu_save_from before __context_switch, but if
that did actually save the vPMU state it would have reset
VPMU_CONTEXT_LOADED, so by the time vpmu_load calls vpmu_save_force it
would have just bailed before hitting the ASSERT failure in vmx_find_msrs.
The context must still be loaded at that point (I'm trying to verify right
now by adding some printks).

Tamas

Tamas K Lengyel Sept. 22, 2022, 1:35 p.m. UTC | #16

On Tue, Sep 20, 2022 at 2:27 PM Tamas K Lengyel <tamas.k.lengyel@gmail.com>
wrote:

>
>
> On Tue, Sep 20, 2022 at 2:12 PM Boris Ostrovsky <
> boris.ostrovsky@oracle.com> wrote:
>
>>
>> On 9/20/22 10:54 AM, Jan Beulich wrote:
>> > On 20.09.2022 16:26, Boris Ostrovsky wrote:
>> >> On 9/20/22 4:01 AM, Jan Beulich wrote:
>> >>> On 20.09.2022 00:42, Boris Ostrovsky wrote:
>> >>>> It is saving vpmu data from current pcpu's MSRs for a remote vcpu so
>> @v
>> >>>> in vmx_find_msr() is not @current:
>> >>>>
>> >>>>         vpmu_load()
>> >>>>             ...
>> >>>>             prev = per_cpu(last_vcpu, pcpu);
>> >>>>             vpmu_save_force(prev)
>> >>>>                 core2_vpmu_save()
>> >>>>                     __core2_vpmu_save()
>> >>>>                         vmx_read_guest_msr()
>> >>>>                             vmx_find_msr()
>> >>>>
>> >>>>
>> >>>> The call to vmx_find_msr() was introduced by 755087eb9b10c. I wonder
>> though whether
>> >>>> this call is needed when code path above is executed (i.e. when we
>> are saving
>> >>>> remove vcpu)
>> >>> How could it not be needed? We need to obtain the guest value. The
>> >>> thing I don't understand is why this forced saving is necessary,
>> >>> when context_switch() unconditionally calls vpmu_switch_from().
>> >>
>> >> IIRC the logic is:
>> >>
>> >> 1. vcpuA runs on pcpu0
>> >> 2. vcpuA is de-scheduled and is selected to run on pcpu1. It has not
>> yet called vpmu_load() from pcpu1
>> > The calling of vpmu_load() shouldn't matter here. What does matter is
>> > that vpmu_save() was necessarily called already.
>>
>>
>> I thought we don't always save MSRs on context switch because we optimize
>> for case when the same vcpu gets scheduled again. But I am not sure I see
>> this now that I am looking at the code.
>>
>
> I see context_switch calling vpmu_save_from before __context_switch, but
> if that did actually save the vPMU state it would have reset
> VPMU_CONTEXT_LOADED, so by the time vpmu_load calls vpmu_save_force it
> would have just bailed before hitting the ASSERT failure in vmx_find_msrs.
> The context must still be loaded at that point (I'm trying to verify right
> now by adding some printks).
>

OK, Boris was correct above, MSRs are not saved on context switch
automatically because of that optimization. VPMU_CONTEXT_SAVE isn't set, so
the only thing vpmu_switch_from does is set global control to 0 in case it
was a PV vCPU (see core2_vpmu_save checks for both VPMU_CONTEXT_SAVE and
VPMU_CONTEXT_LOADED) and vpmu_switch_from doesn't set VPMU_CONTEXT_SAVE. So
for HVM vCPUs it does nothing, that's why we still see the context still
loaded when we get to vpmu_load.

It would be a simple enough change to make vpmu_switch_from always save the
context and then we could get rid of vpmu_load trying to do it later and
getting into that ASSERT failure. Thoughts?

Tamas

Jan Beulich Sept. 22, 2022, 7:04 p.m. UTC | #17

On 22.09.2022 15:35, Tamas K Lengyel wrote:
> On Tue, Sep 20, 2022 at 2:27 PM Tamas K Lengyel <tamas.k.lengyel@gmail.com>
> wrote:
> 
>>
>>
>> On Tue, Sep 20, 2022 at 2:12 PM Boris Ostrovsky <
>> boris.ostrovsky@oracle.com> wrote:
>>
>>>
>>> On 9/20/22 10:54 AM, Jan Beulich wrote:
>>>> On 20.09.2022 16:26, Boris Ostrovsky wrote:
>>>>> On 9/20/22 4:01 AM, Jan Beulich wrote:
>>>>>> On 20.09.2022 00:42, Boris Ostrovsky wrote:
>>>>>>> It is saving vpmu data from current pcpu's MSRs for a remote vcpu so
>>> @v
>>>>>>> in vmx_find_msr() is not @current:
>>>>>>>
>>>>>>>         vpmu_load()
>>>>>>>             ...
>>>>>>>             prev = per_cpu(last_vcpu, pcpu);
>>>>>>>             vpmu_save_force(prev)
>>>>>>>                 core2_vpmu_save()
>>>>>>>                     __core2_vpmu_save()
>>>>>>>                         vmx_read_guest_msr()
>>>>>>>                             vmx_find_msr()
>>>>>>>
>>>>>>>
>>>>>>> The call to vmx_find_msr() was introduced by 755087eb9b10c. I wonder
>>> though whether
>>>>>>> this call is needed when code path above is executed (i.e. when we
>>> are saving
>>>>>>> remove vcpu)
>>>>>> How could it not be needed? We need to obtain the guest value. The
>>>>>> thing I don't understand is why this forced saving is necessary,
>>>>>> when context_switch() unconditionally calls vpmu_switch_from().
>>>>>
>>>>> IIRC the logic is:
>>>>>
>>>>> 1. vcpuA runs on pcpu0
>>>>> 2. vcpuA is de-scheduled and is selected to run on pcpu1. It has not
>>> yet called vpmu_load() from pcpu1
>>>> The calling of vpmu_load() shouldn't matter here. What does matter is
>>>> that vpmu_save() was necessarily called already.
>>>
>>>
>>> I thought we don't always save MSRs on context switch because we optimize
>>> for case when the same vcpu gets scheduled again. But I am not sure I see
>>> this now that I am looking at the code.
>>>
>>
>> I see context_switch calling vpmu_save_from before __context_switch, but
>> if that did actually save the vPMU state it would have reset
>> VPMU_CONTEXT_LOADED, so by the time vpmu_load calls vpmu_save_force it
>> would have just bailed before hitting the ASSERT failure in vmx_find_msrs.
>> The context must still be loaded at that point (I'm trying to verify right
>> now by adding some printks).
>>
> 
> OK, Boris was correct above, MSRs are not saved on context switch
> automatically because of that optimization. VPMU_CONTEXT_SAVE isn't set, so
> the only thing vpmu_switch_from does is set global control to 0 in case it
> was a PV vCPU (see core2_vpmu_save checks for both VPMU_CONTEXT_SAVE and
> VPMU_CONTEXT_LOADED) and vpmu_switch_from doesn't set VPMU_CONTEXT_SAVE. So
> for HVM vCPUs it does nothing, that's why we still see the context still
> loaded when we get to vpmu_load.
> 
> It would be a simple enough change to make vpmu_switch_from always save the
> context and then we could get rid of vpmu_load trying to do it later and
> getting into that ASSERT failure. Thoughts?

I'd much prefer that over e.g. the vCPU-pausing approach. I also think
vpmu_save() is misnamed if it might not really save anything.

Jan

x86/vpmu: fix race-condition in vpmu_load

Commit Message

Comments

Patch