Message ID | 20190501042249.1218-1-tamas@tklengyel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | x86/vm_event: correctly gather gs_shadow value | expand |
On 5/1/19 7:22 AM, Tamas K Lengyel wrote: > Currently the gs_shadow value is only cached when the vCPU is being scheduled > out by Xen. Reporting this (usually) stale value through vm_event is incorrect, > since it doesn't represent the actual state of the vCPU at the time the event > was recorded. This prevents vm_event subscribers from correctly finding kernel > structures in the guest when it is trapped while in ring3. Isn't the VCPU always scheduled out (since it is paused) with sync vm_events? Is this an async fix only? In any case, Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Thanks, Razvan
On 01/05/2019 08:17, Razvan Cojocaru wrote: > On 5/1/19 7:22 AM, Tamas K Lengyel wrote: >> Currently the gs_shadow value is only cached when the vCPU is being scheduled >> out by Xen. Reporting this (usually) stale value through vm_event is incorrect, >> since it doesn't represent the actual state of the vCPU at the time the event >> was recorded. This prevents vm_event subscribers from correctly finding kernel >> structures in the guest when it is trapped while in ring3. > Isn't the VCPU always scheduled out (since it is paused) with sync > vm_events? Is this an async fix only? Paused isn't always scheduled out. It can be half-idle with state still in hardware. > > In any case, > > Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
On 01/05/2019 05:22, Tamas K Lengyel wrote: > Currently the gs_shadow value is only cached when the vCPU is being scheduled > out by Xen. Reporting this (usually) stale value through vm_event is incorrect, > since it doesn't represent the actual state of the vCPU at the time the event > was recorded. This prevents vm_event subscribers from correctly finding kernel > structures in the guest when it is trapped while in ring3. > > Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com> > Cc: Razvan Cojocaru <rcojocaru@bitdefender.com> > Cc: Jan Beulich <jbeulich@suse.com> > Cc: Andrew Cooper <andrew.cooper3@citrix.com> > Cc: Wei Liu <wei.liu2@citrix.com> > Cc: Roger Pau Monne <roger.pau@citrix.com> > --- > xen/arch/x86/vm_event.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/xen/arch/x86/vm_event.c b/xen/arch/x86/vm_event.c > index 51c3493b1d..4464940da7 100644 > --- a/xen/arch/x86/vm_event.c > +++ b/xen/arch/x86/vm_event.c Actually, come to think of it, the same is true for the SYSENTER details, which by default are read/write to the guest while it is scheduled. As a result, the details reported here will from the last vcpu context switch, and possibly stale. It might be worth introducing a "sync state from hw" hook which collects all the data we intend to pass to the introspection agent. ~Andrew#
On Wed, May 1, 2019 at 1:50 AM Andrew Cooper <andrew.cooper3@citrix.com> wrote: > > On 01/05/2019 05:22, Tamas K Lengyel wrote: > > Currently the gs_shadow value is only cached when the vCPU is being scheduled > > out by Xen. Reporting this (usually) stale value through vm_event is incorrect, > > since it doesn't represent the actual state of the vCPU at the time the event > > was recorded. This prevents vm_event subscribers from correctly finding kernel > > structures in the guest when it is trapped while in ring3. > > > > Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com> > > Cc: Razvan Cojocaru <rcojocaru@bitdefender.com> > > Cc: Jan Beulich <jbeulich@suse.com> > > Cc: Andrew Cooper <andrew.cooper3@citrix.com> > > Cc: Wei Liu <wei.liu2@citrix.com> > > Cc: Roger Pau Monne <roger.pau@citrix.com> > > --- > > xen/arch/x86/vm_event.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/xen/arch/x86/vm_event.c b/xen/arch/x86/vm_event.c > > index 51c3493b1d..4464940da7 100644 > > --- a/xen/arch/x86/vm_event.c > > +++ b/xen/arch/x86/vm_event.c > > Actually, come to think of it, the same is true for the SYSENTER > details, which by default are read/write to the guest while it is > scheduled. As a result, the details reported here will from the last > vcpu context switch, and possibly stale. I'll look into it. > It might be worth introducing a "sync state from hw" hook which collects > all the data we intend to pass to the introspection agent. You mean adding another hvm hook? Thanks, Tamas
> > It might be worth introducing a "sync state from hw" hook which collects > > all the data we intend to pass to the introspection agent. > > You mean adding another hvm hook? Actually, instead of another hook I think what would make sense it to just update vmx_save_vmcs_ctxt to automatically refresh the cached register values when it's called with "v == current". Thoughts? Tamas
On 5/1/19 4:58 PM, Tamas K Lengyel wrote: >>> It might be worth introducing a "sync state from hw" hook which collects >>> all the data we intend to pass to the introspection agent. >> >> You mean adding another hvm hook? > > Actually, instead of another hook I think what would make sense it to > just update vmx_save_vmcs_ctxt to automatically refresh the cached > register values when it's called with "v == current". Thoughts? That's probably the better way to go about it, since otherwise the xc_hvm_getcontext_partial() hypercall will suffer from the same problem. (there are two ways of getting guest state: one is via the vm_event cached values, the other is via the aforementioned hypercall). Thanks, Razvan
On Wed, May 1, 2019 at 8:20 AM Razvan Cojocaru <rcojocaru@bitdefender.com> wrote: > > On 5/1/19 4:58 PM, Tamas K Lengyel wrote: > >>> It might be worth introducing a "sync state from hw" hook which collects > >>> all the data we intend to pass to the introspection agent. > >> > >> You mean adding another hvm hook? > > > > Actually, instead of another hook I think what would make sense it to > > just update vmx_save_vmcs_ctxt to automatically refresh the cached > > register values when it's called with "v == current". Thoughts? > > That's probably the better way to go about it, since otherwise the > xc_hvm_getcontext_partial() hypercall will suffer from the same problem. > (there are two ways of getting guest state: one is via the vm_event > cached values, the other is via the aforementioned hypercall). True, although issuing the hypercall in the vm_event callback is actually fine - that's how I found the issue to begin with, since the vCPU will be scheduled out with the cached registers refreshed and thus be different then what the vm_event itself had. But other callers of the hypercall can run into the problem if the guest/vcpu is not paused. Tamas
On Wed, May 1, 2019 at 7:45 AM Tamas K Lengyel <tamas@tklengyel.com> wrote: > > On Wed, May 1, 2019 at 1:50 AM Andrew Cooper <andrew.cooper3@citrix.com> wrote: > > > > On 01/05/2019 05:22, Tamas K Lengyel wrote: > > > Currently the gs_shadow value is only cached when the vCPU is being scheduled > > > out by Xen. Reporting this (usually) stale value through vm_event is incorrect, > > > since it doesn't represent the actual state of the vCPU at the time the event > > > was recorded. This prevents vm_event subscribers from correctly finding kernel > > > structures in the guest when it is trapped while in ring3. > > > > > > Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com> > > > Cc: Razvan Cojocaru <rcojocaru@bitdefender.com> > > > Cc: Jan Beulich <jbeulich@suse.com> > > > Cc: Andrew Cooper <andrew.cooper3@citrix.com> > > > Cc: Wei Liu <wei.liu2@citrix.com> > > > Cc: Roger Pau Monne <roger.pau@citrix.com> > > > --- > > > xen/arch/x86/vm_event.c | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > diff --git a/xen/arch/x86/vm_event.c b/xen/arch/x86/vm_event.c > > > index 51c3493b1d..4464940da7 100644 > > > --- a/xen/arch/x86/vm_event.c > > > +++ b/xen/arch/x86/vm_event.c > > > > Actually, come to think of it, the same is true for the SYSENTER > > details, which by default are read/write to the guest while it is > > scheduled. As a result, the details reported here will from the last > > vcpu context switch, and possibly stale. > > I'll look into it. The sysenter values look fine to me because vmx_save_vmcs_ctxt calls vmx_vmcs_save, which refreshes the values from the actual VMCS. It's only shadow_gs that is not refreshed. Tamas
On Wed, May 1, 2019 at 8:53 AM Tamas K Lengyel <tamas@tklengyel.com> wrote: > > On Wed, May 1, 2019 at 8:20 AM Razvan Cojocaru > <rcojocaru@bitdefender.com> wrote: > > > > On 5/1/19 4:58 PM, Tamas K Lengyel wrote: > > >>> It might be worth introducing a "sync state from hw" hook which collects > > >>> all the data we intend to pass to the introspection agent. > > >> > > >> You mean adding another hvm hook? > > > > > > Actually, instead of another hook I think what would make sense it to > > > just update vmx_save_vmcs_ctxt to automatically refresh the cached > > > register values when it's called with "v == current". Thoughts? > > > > That's probably the better way to go about it, since otherwise the > > xc_hvm_getcontext_partial() hypercall will suffer from the same problem. > > (there are two ways of getting guest state: one is via the vm_event > > cached values, the other is via the aforementioned hypercall). > > True, although issuing the hypercall in the vm_event callback is > actually fine - that's how I found the issue to begin with, since the > vCPU will be scheduled out with the cached registers refreshed and > thus be different then what the vm_event itself had. But other callers > of the hypercall can run into the problem if the guest/vcpu is not > paused. Actually, doing the "v == current" check wouldn't really do anything for the hypercall since it's not the domain issuing the hypercall on itself. The only way to ensure that hypercall is returning correct values would be to pause the vCPU. Tamas
On 5/1/19 6:01 PM, Tamas K Lengyel wrote: > On Wed, May 1, 2019 at 8:53 AM Tamas K Lengyel <tamas@tklengyel.com> wrote: >> >> On Wed, May 1, 2019 at 8:20 AM Razvan Cojocaru >> <rcojocaru@bitdefender.com> wrote: >>> >>> On 5/1/19 4:58 PM, Tamas K Lengyel wrote: >>>>>> It might be worth introducing a "sync state from hw" hook which collects >>>>>> all the data we intend to pass to the introspection agent. >>>>> >>>>> You mean adding another hvm hook? >>>> >>>> Actually, instead of another hook I think what would make sense it to >>>> just update vmx_save_vmcs_ctxt to automatically refresh the cached >>>> register values when it's called with "v == current". Thoughts? >>> >>> That's probably the better way to go about it, since otherwise the >>> xc_hvm_getcontext_partial() hypercall will suffer from the same problem. >>> (there are two ways of getting guest state: one is via the vm_event >>> cached values, the other is via the aforementioned hypercall). >> >> True, although issuing the hypercall in the vm_event callback is >> actually fine - that's how I found the issue to begin with, since the >> vCPU will be scheduled out with the cached registers refreshed and >> thus be different then what the vm_event itself had. But other callers >> of the hypercall can run into the problem if the guest/vcpu is not >> paused. > > Actually, doing the "v == current" check wouldn't really do anything > for the hypercall since it's not the domain issuing the hypercall on > itself. The only way to ensure that hypercall is returning correct > values would be to pause the vCPU. I've discussed this with Andrew in the meantime and he's kindly pointed out that for the hypercall we pause from remote context, which forces a de-schedule. So the hypercall _should_ be fine. But we write data to the vm_event ring from the current context, where state might actually be in hardware. He'll probably chime in with additional suggestions / comments. Thanks, Razvan
On Wed, May 1, 2019 at 9:44 AM Razvan Cojocaru <rcojocaru@bitdefender.com> wrote: > > On 5/1/19 6:01 PM, Tamas K Lengyel wrote: > > On Wed, May 1, 2019 at 8:53 AM Tamas K Lengyel <tamas@tklengyel.com> wrote: > >> > >> On Wed, May 1, 2019 at 8:20 AM Razvan Cojocaru > >> <rcojocaru@bitdefender.com> wrote: > >>> > >>> On 5/1/19 4:58 PM, Tamas K Lengyel wrote: > >>>>>> It might be worth introducing a "sync state from hw" hook which collects > >>>>>> all the data we intend to pass to the introspection agent. > >>>>> > >>>>> You mean adding another hvm hook? > >>>> > >>>> Actually, instead of another hook I think what would make sense it to > >>>> just update vmx_save_vmcs_ctxt to automatically refresh the cached > >>>> register values when it's called with "v == current". Thoughts? > >>> > >>> That's probably the better way to go about it, since otherwise the > >>> xc_hvm_getcontext_partial() hypercall will suffer from the same problem. > >>> (there are two ways of getting guest state: one is via the vm_event > >>> cached values, the other is via the aforementioned hypercall). > >> > >> True, although issuing the hypercall in the vm_event callback is > >> actually fine - that's how I found the issue to begin with, since the > >> vCPU will be scheduled out with the cached registers refreshed and > >> thus be different then what the vm_event itself had. But other callers > >> of the hypercall can run into the problem if the guest/vcpu is not > >> paused. > > > > Actually, doing the "v == current" check wouldn't really do anything > > for the hypercall since it's not the domain issuing the hypercall on > > itself. The only way to ensure that hypercall is returning correct > > values would be to pause the vCPU. > > I've discussed this with Andrew in the meantime and he's kindly pointed > out that for the hypercall we pause from remote context, which forces a > de-schedule. So the hypercall _should_ be fine. But we write data to the > vm_event ring from the current context, where state might actually be in > hardware. Correct, I've went through the hypercall code and it does pause the vCPU. So right now I don't think we need anything else then what's already in this patch. Tamas
On 01/05/2019 15:58, Tamas K Lengyel wrote: > On Wed, May 1, 2019 at 7:45 AM Tamas K Lengyel <tamas@tklengyel.com> wrote: >> On Wed, May 1, 2019 at 1:50 AM Andrew Cooper <andrew.cooper3@citrix.com> wrote: >>> On 01/05/2019 05:22, Tamas K Lengyel wrote: >>>> Currently the gs_shadow value is only cached when the vCPU is being scheduled >>>> out by Xen. Reporting this (usually) stale value through vm_event is incorrect, >>>> since it doesn't represent the actual state of the vCPU at the time the event >>>> was recorded. This prevents vm_event subscribers from correctly finding kernel >>>> structures in the guest when it is trapped while in ring3. >>>> >>>> Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com> >>>> Cc: Razvan Cojocaru <rcojocaru@bitdefender.com> >>>> Cc: Jan Beulich <jbeulich@suse.com> >>>> Cc: Andrew Cooper <andrew.cooper3@citrix.com> >>>> Cc: Wei Liu <wei.liu2@citrix.com> >>>> Cc: Roger Pau Monne <roger.pau@citrix.com> >>>> --- >>>> xen/arch/x86/vm_event.c | 2 +- >>>> 1 file changed, 1 insertion(+), 1 deletion(-) >>>> >>>> diff --git a/xen/arch/x86/vm_event.c b/xen/arch/x86/vm_event.c >>>> index 51c3493b1d..4464940da7 100644 >>>> --- a/xen/arch/x86/vm_event.c >>>> +++ b/xen/arch/x86/vm_event.c >>> Actually, come to think of it, the same is true for the SYSENTER >>> details, which by default are read/write to the guest while it is >>> scheduled. As a result, the details reported here will from the last >>> vcpu context switch, and possibly stale. >> I'll look into it. > The sysenter values look fine to me because vmx_save_vmcs_ctxt calls > vmx_vmcs_save, which refreshes the values from the actual VMCS. It's > only shadow_gs that is not refreshed. So, you are correct that we call into vmx_vmcs_save() which causes the SYSENTER details to be correct. However, the same path also calls vmx_save_cpu_state() which saves shadow_gs, and therefore ought to DTRT for the previous use in vm_event. The problem is that vmx_{save,load}_cpu_state() only function correctly in remote context, which is why the stale shadow_gs persists. Contrary to what I said earlier, I now think that a better fix would be to make the above functions safe to use use in current context (at least for the save side - the restore side can probably just ASSERT() atm, as it doesn't seem to have an equivalent use). That way, the vm_event code doesn't need to contain any context-sensitive code, which is a better overall, IMO. ~Andrew
On Wed, May 1, 2019 at 11:03 AM Andrew Cooper <andrew.cooper3@citrix.com> wrote: > > On 01/05/2019 15:58, Tamas K Lengyel wrote: > > On Wed, May 1, 2019 at 7:45 AM Tamas K Lengyel <tamas@tklengyel.com> wrote: > >> On Wed, May 1, 2019 at 1:50 AM Andrew Cooper <andrew.cooper3@citrix.com> wrote: > >>> On 01/05/2019 05:22, Tamas K Lengyel wrote: > >>>> Currently the gs_shadow value is only cached when the vCPU is being scheduled > >>>> out by Xen. Reporting this (usually) stale value through vm_event is incorrect, > >>>> since it doesn't represent the actual state of the vCPU at the time the event > >>>> was recorded. This prevents vm_event subscribers from correctly finding kernel > >>>> structures in the guest when it is trapped while in ring3. > >>>> > >>>> Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com> > >>>> Cc: Razvan Cojocaru <rcojocaru@bitdefender.com> > >>>> Cc: Jan Beulich <jbeulich@suse.com> > >>>> Cc: Andrew Cooper <andrew.cooper3@citrix.com> > >>>> Cc: Wei Liu <wei.liu2@citrix.com> > >>>> Cc: Roger Pau Monne <roger.pau@citrix.com> > >>>> --- > >>>> xen/arch/x86/vm_event.c | 2 +- > >>>> 1 file changed, 1 insertion(+), 1 deletion(-) > >>>> > >>>> diff --git a/xen/arch/x86/vm_event.c b/xen/arch/x86/vm_event.c > >>>> index 51c3493b1d..4464940da7 100644 > >>>> --- a/xen/arch/x86/vm_event.c > >>>> +++ b/xen/arch/x86/vm_event.c > >>> Actually, come to think of it, the same is true for the SYSENTER > >>> details, which by default are read/write to the guest while it is > >>> scheduled. As a result, the details reported here will from the last > >>> vcpu context switch, and possibly stale. > >> I'll look into it. > > The sysenter values look fine to me because vmx_save_vmcs_ctxt calls > > vmx_vmcs_save, which refreshes the values from the actual VMCS. It's > > only shadow_gs that is not refreshed. > > So, you are correct that we call into vmx_vmcs_save() which causes the > SYSENTER details to be correct. > > However, the same path also calls vmx_save_cpu_state() which saves > shadow_gs, and therefore ought to DTRT for the previous use in vm_event. > > The problem is that vmx_{save,load}_cpu_state() only function correctly > in remote context, which is why the stale shadow_gs persists. > > Contrary to what I said earlier, I now think that a better fix would be > to make the above functions safe to use use in current context (at least > for the save side - the restore side can probably just ASSERT() atm, as > it doesn't seem to have an equivalent use). > > That way, the vm_event code doesn't need to contain any > context-sensitive code, which is a better overall, IMO. Sounds good to me. So a "v == current" case to refresh the gs_shadow value is all this will entail, correct? Thanks, Tamas
diff --git a/xen/arch/x86/vm_event.c b/xen/arch/x86/vm_event.c index 51c3493b1d..4464940da7 100644 --- a/xen/arch/x86/vm_event.c +++ b/xen/arch/x86/vm_event.c @@ -239,7 +239,7 @@ void vm_event_fill_regs(vm_event_request_t *req) vm_event_pack_segment_register(x86_seg_ds, &req->data.regs.x86); vm_event_pack_segment_register(x86_seg_es, &req->data.regs.x86); - req->data.regs.x86.shadow_gs = ctxt.shadow_gs; + req->data.regs.x86.shadow_gs = rdgsshadow(); req->data.regs.x86.dr6 = ctxt.dr6; #endif }
Currently the gs_shadow value is only cached when the vCPU is being scheduled out by Xen. Reporting this (usually) stale value through vm_event is incorrect, since it doesn't represent the actual state of the vCPU at the time the event was recorded. This prevents vm_event subscribers from correctly finding kernel structures in the guest when it is trapped while in ring3. Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com> Cc: Razvan Cojocaru <rcojocaru@bitdefender.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Wei Liu <wei.liu2@citrix.com> Cc: Roger Pau Monne <roger.pau@citrix.com> --- xen/arch/x86/vm_event.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)