diff mbox

xc_hvm_inject_trap() races

Message ID 56bd0939-976b-ef45-4182-60e65c2f3ddc@bitdefender.com (mailing list archive)
State New, archived
Headers show

Commit Message

Razvan Cojocaru Nov. 1, 2016, 9:04 a.m. UTC
Hello,

We've stumbled across the following scenario: we're injecting a #PF to
try to bring a swapped page back, but Xen already have a pending
interrupt, and the two collide.

I've logged what happens in hvm_do_resume() at the point of injection,
and stumbled across this:

(XEN) [  252.878389] vector: 14, type: 3, error_code: 0,
VM_ENTRY_INTR_INFO: 0x00000000800000e1

VM_ENTRY_INTR_INFO does have INTR_INFO_VALID_MASK set here.

This obviously leads to all sorts of unpleasantness with the guest.
Ideally, the injected #PF should take precedence, and the other
interrupt should not be lost as well. Suggestions on how best to solve
this are welcome and appreciated.

This problem would also occur with reinjected breakpoint events (please
see xen-access.c), which also use xc_hvm_inject_trap().

The easiest approach would be to simply ignore the injected trap while
VM_ENTRY_INTR_INFO & INTR_INFO_VALID_MASK, like this:

However, this does not even guarantee delayed delivery of the trap,
since by the time the next hvm_do_resume() call happens where there's no
other pending interrupt we could have called xc_hvm_inject_trap() with
different values, and have overwritten v->arch.hvm_vcpu.inject_trap.
More importantly, the context might be different then and the injection
would fail (or mess things up) because of that.

So we'd be better off simply discarding the trap in this case (as has
been suggested by Andrew), but here again the xc_hvm_inject_trap()
caller has no clue whether the call succeeded or failed. This is
inefficient (and thus comes with a performance impact) since the guest
would be put in a loop until injection finally succeeds, and finally
makes client applications significantly more cumbersome to implement,
because they would need to guess when the injection failed (since
xc_hvm_inject_trap() won't tell us), and handle duplicates for these
corner cases.

Is there an elegant way to fix this that keeps both interrupts but makes
the injected one higher priority?


Thanks,
Razvan

Comments

Jan Beulich Nov. 1, 2016, 10:30 a.m. UTC | #1
>>> Razvan Cojocaru <rcojocaru@bitdefender.com> 11/01/16 10:04 AM >>>
>We've stumbled across the following scenario: we're injecting a #PF to
>try to bring a swapped page back, but Xen already have a pending
>interrupt, and the two collide.
>
>I've logged what happens in hvm_do_resume() at the point of injection,
>and stumbled across this:
>
>(XEN) [  252.878389] vector: 14, type: 3, error_code: 0,
>VM_ENTRY_INTR_INFO: 0x00000000800000e1
>
>VM_ENTRY_INTR_INFO does have INTR_INFO_VALID_MASK set here.

So a first question I have is this: What are the criteria that made your
application decide it needs to inject a trap? Obviously there must have
been some kind of event in the guest that triggered this. Hence the
question is whether this same event wouldn't re-trigger at the end of the
hardware interrupt (or could be made re-trigger reasonably easily).
Because in the end what you're trying to do here is something that's
architecturally impossible: There can't be a (non-nested) exception once
an external interrupt has been accepted (i.e. any subsequent exception
can only be related to the delivery of that interrupt vector, not to the code
which was running when the interrupt was signaled).

Jan
Razvan Cojocaru Nov. 1, 2016, 10:53 a.m. UTC | #2
On 11/01/2016 12:30 PM, Jan Beulich wrote:
>>>> Razvan Cojocaru <rcojocaru@bitdefender.com> 11/01/16 10:04 AM >>>
>> We've stumbled across the following scenario: we're injecting a #PF to
>> try to bring a swapped page back, but Xen already have a pending
>> interrupt, and the two collide.
>>
>> I've logged what happens in hvm_do_resume() at the point of injection,
>> and stumbled across this:
>>
>> (XEN) [  252.878389] vector: 14, type: 3, error_code: 0,
>> VM_ENTRY_INTR_INFO: 0x00000000800000e1
>>
>> VM_ENTRY_INTR_INFO does have INTR_INFO_VALID_MASK set here.
> 
> So a first question I have is this: What are the criteria that made your
> application decide it needs to inject a trap? Obviously there must have
> been some kind of event in the guest that triggered this. Hence the
> question is whether this same event wouldn't re-trigger at the end of the
> hardware interrupt (or could be made re-trigger reasonably easily).
> Because in the end what you're trying to do here is something that's
> architecturally impossible: There can't be a (non-nested) exception once
> an external interrupt has been accepted (i.e. any subsequent exception
> can only be related to the delivery of that interrupt vector, not to the code
> which was running when the interrupt was signaled).

Unfortunately there are two main reasons why relying on the conditions
for injecting the page fault repeating is problematic:

1. We'd need to be able differentiate between a failed run (where
injection doesn't work) and a succesful run, with no real possibility to
know the difference for sure. So we don't know how to adapt the
application's internal state based on some events: is the event the
"final" one, or just a duplicate? xc_hvm_inject_trap() does not tell us
(as indeed it cannot know) if the injection succeeded, and there's no
other way to know.

2. More importantly (although working around 1. is far from trivial),
the event may not be repeatable. As an example, #PF injection can occur
as part of handling an EPT event, but during handling the event the
introspection engine can decide to lift the restrictions on said page
after injecting the #PF. So the application relied on the #PF being
delivered, and with the restrictions lifted from the page there will be
no further EPT events for that page, therefore the main condition for
triggering the #PF is lost forever.


Thanks,
Razvan
Jan Beulich Nov. 1, 2016, 3:53 p.m. UTC | #3
>>> Razvan Cojocaru <rcojocaru@bitdefender.com> 11/01/16 11:53 AM >>>
>On 11/01/2016 12:30 PM, Jan Beulich wrote:
>>>>> Razvan Cojocaru <rcojocaru@bitdefender.com> 11/01/16 10:04 AM >>>
>>> We've stumbled across the following scenario: we're injecting a #PF to
>>> try to bring a swapped page back, but Xen already have a pending
>>> interrupt, and the two collide.
>>>
>>> I've logged what happens in hvm_do_resume() at the point of injection,
>>> and stumbled across this:
>>>
>>> (XEN) [  252.878389] vector: 14, type: 3, error_code: 0,
>>> VM_ENTRY_INTR_INFO: 0x00000000800000e1
>>>
>>> VM_ENTRY_INTR_INFO does have INTR_INFO_VALID_MASK set here.
>> 
>> So a first question I have is this: What are the criteria that made your
>> application decide it needs to inject a trap? Obviously there must have
>> been some kind of event in the guest that triggered this. Hence the
>> question is whether this same event wouldn't re-trigger at the end of the
>> hardware interrupt (or could be made re-trigger reasonably easily).
>> Because in the end what you're trying to do here is something that's
>> architecturally impossible: There can't be a (non-nested) exception once
>> an external interrupt has been accepted (i.e. any subsequent exception
>> can only be related to the delivery of that interrupt vector, not to the code
>> which was running when the interrupt was signaled).
>
>Unfortunately there are two main reasons why relying on the conditions
>for injecting the page fault repeating is problematic:
>
>1. We'd need to be able differentiate between a failed run (where
>injection doesn't work) and a succesful run, with no real possibility to
>know the difference for sure. So we don't know how to adapt the
>application's internal state based on some events: is the event the
>"final" one, or just a duplicate? xc_hvm_inject_trap() does not tell us
>(as indeed it cannot know) if the injection succeeded, and there's no
>other way to know.
>
>2. More importantly (although working around 1. is far from trivial),
>the event may not be repeatable. As an example, #PF injection can occur
>as part of handling an EPT event, but during handling the event the
>introspection engine can decide to lift the restrictions on said page
>after injecting the #PF. So the application relied on the #PF being
>delivered, and with the restrictions lifted from the page there will be
>no further EPT events for that page, therefore the main condition for
>triggering the #PF is lost forever.

Isn't this a problem you also have under other circumstances, e.g.
multiple faults occurring for a single instruction? Which gets us to the
fact that you didn't answer at all the initial question I did raise. Apart
from that I'm also not really understanding the model you describe:
What good does injecting #PF alongside lifting restrictions? I'd normally
expect just one of the two to occur to deal with whatever caused the
original event.

Jan
Andrei Vlad LUTAS Nov. 1, 2016, 4:13 p.m. UTC | #4
Hello,

First of all, to answer your original question: the injection decision is made when the introspection logic needs to inspect a page that is not present in the physical memory. We don't really care if the current instruction triggers multiple faults or not (and here I'm not sure what you mean by that - multiple exceptions, or multiple EPT violations - but the answer is still the same), and removing the page restrictions after the #PF injection is introspection specific logic - the address for which we inject the #PF doesn't have to be related in any way to the current instruction. Assuming that we wouldn't remove the restrictions and we would rely on re-generating the event - that is not acceptable: first of all because the instruction would normally be emulated anyway before re-entering the guest, and secondly because that is not a normal CPU behavior (and it would break internal introspection logic) - once an instruction triggered an exit, it should be emulated or single-stepped.

Best regards,
Andrei.
 
-----Original Message-----
From: Xen-devel [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of Jan Beulich
Sent: 1 November, 2016 17:54
To: rcojocaru@bitdefender.com
Cc: andrew.cooper3@citrix.com; tamas@tklengyel.com; xen-devel@lists.xenproject.org
Subject: Re: [Xen-devel] xc_hvm_inject_trap() races

>>> Razvan Cojocaru <rcojocaru@bitdefender.com> 11/01/16 11:53 AM >>>
>On 11/01/2016 12:30 PM, Jan Beulich wrote:
>>>>> Razvan Cojocaru <rcojocaru@bitdefender.com> 11/01/16 10:04 AM >>>
>>> We've stumbled across the following scenario: we're injecting a #PF 
>>> to try to bring a swapped page back, but Xen already have a pending 
>>> interrupt, and the two collide.
>>>
>>> I've logged what happens in hvm_do_resume() at the point of 
>>> injection, and stumbled across this:
>>>
>>> (XEN) [  252.878389] vector: 14, type: 3, error_code: 0,
>>> VM_ENTRY_INTR_INFO: 0x00000000800000e1
>>>
>>> VM_ENTRY_INTR_INFO does have INTR_INFO_VALID_MASK set here.
>> 
>> So a first question I have is this: What are the criteria that made 
>> your application decide it needs to inject a trap? Obviously there 
>> must have been some kind of event in the guest that triggered this. 
>> Hence the question is whether this same event wouldn't re-trigger at 
>> the end of the hardware interrupt (or could be made re-trigger reasonably easily).
>> Because in the end what you're trying to do here is something that's 
>> architecturally impossible: There can't be a (non-nested) exception 
>> once an external interrupt has been accepted (i.e. any subsequent 
>> exception can only be related to the delivery of that interrupt 
>> vector, not to the code which was running when the interrupt was signaled).
>
>Unfortunately there are two main reasons why relying on the conditions 
>for injecting the page fault repeating is problematic:
>
>1. We'd need to be able differentiate between a failed run (where 
>injection doesn't work) and a succesful run, with no real possibility 
>to know the difference for sure. So we don't know how to adapt the 
>application's internal state based on some events: is the event the 
>"final" one, or just a duplicate? xc_hvm_inject_trap() does not tell us 
>(as indeed it cannot know) if the injection succeeded, and there's no 
>other way to know.
>
>2. More importantly (although working around 1. is far from trivial), 
>the event may not be repeatable. As an example, #PF injection can occur 
>as part of handling an EPT event, but during handling the event the 
>introspection engine can decide to lift the restrictions on said page 
>after injecting the #PF. So the application relied on the #PF being 
>delivered, and with the restrictions lifted from the page there will be 
>no further EPT events for that page, therefore the main condition for 
>triggering the #PF is lost forever.

Isn't this a problem you also have under other circumstances, e.g.
multiple faults occurring for a single instruction? Which gets us to the fact that you didn't answer at all the initial question I did raise. Apart from that I'm also not really understanding the model you describe:
What good does injecting #PF alongside lifting restrictions? I'd normally expect just one of the two to occur to deal with whatever caused the original event.

Jan
Jan Beulich Nov. 1, 2016, 4:40 p.m. UTC | #5
>>> Andrei Vlad LUTAS <vlutas@bitdefender.com> 11/01/16 5:13 PM >>>

First of all, please don't top post.

>First of all, to answer your original question: the injection decision is made
>when the introspection logic needs to inspect a page that is not present in
>the physical memory. We don't really care if the current instruction triggers
>multiple faults or not (and here I'm not sure what you mean by that - multiple
>exceptions, or multiple EPT violations - but the answer is still the same),
>and removing the page restrictions after the #PF injection is introspection
>specific logic - the address for which we inject the #PF doesn't have to be
>related in any way to the current instruction.

Ah, that's this no-architectural behavior again. What if the OS doesn't fully
carry out the page-in, relying on the #PF to retrigger once the insn for which
it got reported has been restarted? Or what if the page gets paged out again
before the insn actually gets to execute (e.g. because a re-schedule
happened inside the guest on the way out of the #PF handler)? All of this
suggests that you really can't lift any restrictions _before_ seeing what you
need to see.

>Assuming that we wouldn't remove the restrictions and we would rely on
>re-generating the event - that is not acceptable: first of all because the
>instruction would normally be emulated anyway before re-entering the guest,

How would that be a problem?

>and secondly because that is not a normal CPU behavior

This really is the main problem here, afaict.

Jan
Andrei Vlad LUTAS Nov. 1, 2016, 10:17 p.m. UTC | #6
-----Original Message-----
From: Jan Beulich [mailto:jbeulich@suse.com]
Sent: 1 November, 2016 18:40
To: rcojocaru@bitdefender.com; Andrei Vlad LUTAS <vlutas@bitdefender.com>
Cc: andrew.cooper3@citrix.com; xen-devel@lists.xenproject.org; tamas@tklengyel.com
Subject: Re: RE: [Xen-devel] xc_hvm_inject_trap() races

>>> Andrei Vlad LUTAS <vlutas@bitdefender.com> 11/01/16 5:13 PM >>>

>First of all, please don't top post.

>>First of all, to answer your original question: the injection decision
>>is made when the introspection logic needs to inspect a page that is
>>not present in the physical memory. We don't really care if the current
>>instruction triggers multiple faults or not (and here I'm not sure what
>>you mean by that - multiple exceptions, or multiple EPT violations -
>>but the answer is still the same), and removing the page restrictions
>>after the #PF injection is introspection specific logic - the address
>>for which we inject the #PF doesn't have to be related in any way to the current instruction.

>Ah, that's this no-architectural behavior again.

I don't think the HVI #PF injection internals or how the #PF is handled by the OS are relevant here. We are using an existing API that seems to not work quite correct under certain circumstances and we were curious if any of you can shed some light in this regard, and maybe point us to the right direction for cooking up a fix.

>What if the OS doesn't fully carry out the page-in, relying on the #PF to retrigger once the insn for which it got reported has been restarted?

Can you be more specific?

> Or what if the page gets paged out again before the insn actually gets to execute (e.g. because a re-schedule happened inside the guest on the way out of the #PF handler)? All of this suggests that you really can't lift >any restrictions _before_ seeing what you need to see.

We don't really care when and how the #PF is handled. We don't care if the page is paged out at some random point. What we do know is that at a certain point in the future, the page will be swapped in; how do we know when? The OS will write the guest page tables, at which point we can inspect the physical page itself (so you can see here why we don't care about the page being swapped out sometime in the future). So we really _can_ lift any restriction we want at that point.

>>Assuming that we wouldn't remove the restrictions and we would rely on
>>re-generating the event - that is not acceptable: first of all because
>>the instruction would normally be emulated anyway before re-entering
>>the guest,

>How would that be a problem?

I thought it was obvious without further clarification: how can we expect the exact same event to be generated, if the instruction that triggered it in the first place was emulated or single stepped?

>>and secondly because that is not a normal CPU behavior

>This really is the main problem here, afaict.

Best regards,
Andrei.
Andrew Cooper Nov. 1, 2016, 10:55 p.m. UTC | #7
On 01/11/2016 22:17, Andrei Vlad LUTAS wrote:
>>> First of all, to answer your original question: the injection decision
>>> is made when the introspection logic needs to inspect a page that is
>>> not present in the physical memory. We don't really care if the current
>>> instruction triggers multiple faults or not (and here I'm not sure what
>>> you mean by that - multiple exceptions, or multiple EPT violations -
>>> but the answer is still the same), and removing the page restrictions
>>> after the #PF injection is introspection specific logic - the address
>>> for which we inject the #PF doesn't have to be related in any way to the current instruction.
>> Ah, that's this no-architectural behavior again.
> I don't think the HVI #PF injection internals or how the #PF is handled by the OS are relevant here. We are using an existing API that seems to not work quite correct under certain circumstances and we were curious if any of you can shed some light in this regard, and maybe point us to the right direction for cooking up a fix.

Just because there is an API like this, doesn't necessarily mean it ever
worked.  This one clearly doesn't, and it got introduced before we as a
community took a rather harder stance towards code review.

Architecturally speaking, faults are always raised as a direct
consequence of the current state.  Therefore, having in introspection
agent interposing on the instruction stream and causing faults as a side
effect of EPT permissions/etc, is quite natural and in line with
architectural expectation.

You also have a second usecase for this API, which is to trick Windows
into paging in a frame you care about looking at.  Overall, using this
#PF method to get Windows to page content back in clearly the rational
way of making that happen, but it is very definitely a non-architectural
usecase; if windows were to double check the instruction stream on
getting this pagefault, it would get very confused, as the pagefault it
received has nothing to do with the code pointed at in the exception frame.

It is quite likely that these different usecases might have different
solutions.  IMO, the former should probably be controlled by responses
in the vm_event ring, but this latter issue probably shouldn't.

When it comes to injecting exceptions, there are some restrictions which
limit what can legitimately be done.  We can only inject a single thing
at once.  Stacking a #PF on top of a plain interrupt could be resolved
by leaving the interrupt pending in the vLAPIC and injecting the #PF
instead.  Stacking a #PF on top of a different fault is going to cause
hvm_combine_exceptions() to turn it into something more nasty.  OTOH,
deferring the #PF by even a single instruction could result in it being
sent in an unsafe context, which should also be avoided.

What hard propertied do you need for this usecase, and are there any
properties can afford to be flexible?

~Andrew
Andrei Vlad LUTAS Nov. 2, 2016, 7:29 a.m. UTC | #8
> -----Original Message-----
> From: Andrew Cooper [mailto:amc96@hermes.cam.ac.uk] On Behalf Of
> Andrew Cooper
> Sent: 2 November, 2016 00:55
> To: Andrei Vlad LUTAS <vlutas@bitdefender.com>; Jan Beulich
> <jbeulich@suse.com>; rcojocaru@bitdefender.com
> Cc: xen-devel@lists.xenproject.org; tamas@tklengyel.com
> Subject: Re: [Xen-devel] xc_hvm_inject_trap() races
>
> On 01/11/2016 22:17, Andrei Vlad LUTAS wrote:
> >>> First of all, to answer your original question: the injection
> >>> decision is made when the introspection logic needs to inspect a
> >>> page that is not present in the physical memory. We don't really
> >>> care if the current instruction triggers multiple faults or not (and
> >>> here I'm not sure what you mean by that - multiple exceptions, or
> >>> multiple EPT violations - but the answer is still the same), and
> >>> removing the page restrictions after the #PF injection is
> >>> introspection specific logic - the address for which we inject the #PF
> doesn't have to be related in any way to the current instruction.
> >> Ah, that's this no-architectural behavior again.
> > I don't think the HVI #PF injection internals or how the #PF is handled by
> the OS are relevant here. We are using an existing API that seems to not
> work quite correct under certain circumstances and we were curious if any of
> you can shed some light in this regard, and maybe point us to the right
> direction for cooking up a fix.
>
> Just because there is an API like this, doesn't necessarily mean it ever
> worked.  This one clearly doesn't, and it got introduced before we as a
> community took a rather harder stance towards code review.

I totally understand that.

>
> Architecturally speaking, faults are always raised as a direct consequence of
> the current state.  Therefore, having in introspection agent interposing on
> the instruction stream and causing faults as a side effect of EPT
> permissions/etc, is quite natural and in line with architectural expectation.
>
> You also have a second usecase for this API, which is to trick Windows into
> paging in a frame you care about looking at.  Overall, using this #PF method
> to get Windows to page content back in clearly the rational way of making
> that happen, but it is very definitely a non-architectural usecase; if windows
> were to double check the instruction stream on getting this pagefault, it
> would get very confused, as the pagefault it received has nothing to do with
> the code pointed at in the exception frame.

This is indeed OS specific. After all, the entire concept of "VM introspection" deals with OS specifics and internals. Before adding support to a new build/version of the kernel, we make sure everything works fine. If something would not work properly, we would adjust our logic. As of now, every operating system that we've tested with has a consistent behavior regarding handling the injected #PF: it doesn't care what instruction triggered it, it doesn't inspect it, etc., it just swaps back in the page pointed by CR2.

>
> It is quite likely that these different usecases might have different solutions.
> IMO, the former should probably be controlled by responses in the
> vm_event ring, but this latter issue probably shouldn't.
>
> When it comes to injecting exceptions, there are some restrictions which
> limit what can legitimately be done.  We can only inject a single thing at once.
> Stacking a #PF on top of a plain interrupt could be resolved by leaving the
> interrupt pending in the vLAPIC and injecting the #PF instead.  Stacking a #PF
> on top of a different fault is going to cause
> hvm_combine_exceptions() to turn it into something more nasty.  OTOH,
> deferring the #PF by even a single instruction could result in it being sent in
> an unsafe context, which should also be avoided.

If I understand (at least some portions of) the Xen code well, the real issue appears when the VM exit handler sees that the IDT_VECTORING_INFO is valid and copies it back in the VM_ENTRY_INTR_INFO in order to re-inject the event. We may then overwrite that with the #PF event. I'm not sure what other code paths cause event injections, but this one seemed obvious. In our test hypervisor, the #PF injection takes precedence - we leave all interrupts pending in the virtual LAPIC and we postpone any event re-injection.

>
> What hard propertied do you need for this usecase, and are there any
> properties can afford to be flexible?

Ideally, knowing that once we called that API, the #PF will get injected. At the minimum, I think that knowing whether the #PF was injected or not is mandatory, in order to know what to do next.

>
> ~Andrew
>
> ________________________
> This email was scanned by Bitdefender

Best regards,
Andrei.
Jan Beulich Nov. 2, 2016, 8:49 a.m. UTC | #9
>>> On 01.11.16 at 23:17, <vlutas@bitdefender.com> wrote:
> From: Jan Beulich [mailto:jbeulich@suse.com]
> Sent: 1 November, 2016 18:40
>>>> Andrei Vlad LUTAS <vlutas@bitdefender.com> 11/01/16 5:13 PM >>>
>>>First of all, to answer your original question: the injection decision
>>>is made when the introspection logic needs to inspect a page that is
>>>not present in the physical memory. We don't really care if the current
>>>instruction triggers multiple faults or not (and here I'm not sure what
>>>you mean by that - multiple exceptions, or multiple EPT violations -
>>>but the answer is still the same), and removing the page restrictions
>>>after the #PF injection is introspection specific logic - the address
>>>for which we inject the #PF doesn't have to be related in any way to the 
> current instruction.
> 
>>Ah, that's this no-architectural behavior again.
> 
> I don't think the HVI #PF injection internals or how the #PF is handled by 
> the OS are relevant here. We are using an existing API that seems to not work 
> quite correct under certain circumstances and we were curious if any of you 
> can shed some light in this regard, and maybe point us to the right direction 
> for cooking up a fix.
> 
>>What if the OS doesn't fully carry out the page-in, relying on the #PF to 
> retrigger once the insn for which it got reported has been restarted?
> 
> Can you be more specific?

Well, perhaps with the answer you gave further down that's not that
relevant anymore, but consider a #PF handler which handles just the
top most not-present page table level each time it gets invoked. I.e.
for a not-present L4 entry it would take 4 re-invocations of the same
original instruction to resolve all 4 levels.

>> Or what if the page gets paged out again before the insn actually gets to 
> execute (e.g. because a re-schedule happened inside the guest on the way out 
> of the #PF handler)? All of this suggests that you really can't lift >any 
> restrictions _before_ seeing what you need to see.
> 
> We don't really care when and how the #PF is handled. We don't care if the 
> page is paged out at some random point. What we do know is that at a certain 
> point in the future, the page will be swapped in; how do we know when? The OS 
> will write the guest page tables, at which point we can inspect the physical 
> page itself (so you can see here why we don't care about the page being 
> swapped out sometime in the future). So we really _can_ lift any restriction 
> we want at that point.

Hmm, I'm having difficulty seeing the supposedly broken flow of
events here: Earlier it was said that #PF injection would be a result
of EPT event processing. Here you say that the lifting of the
restrictions would be a result of seeing the guest modify its page
tables (which would in turn be a result of the #PF actually having
arrived in the guest). So if (with this, and as you say above) you
don't care when the #PF gets handled, where's the original problem?

>>>Assuming that we wouldn't remove the restrictions and we would rely on
>>>re-generating the event - that is not acceptable: first of all because
>>>the instruction would normally be emulated anyway before re-entering
>>>the guest,
> 
>>How would that be a problem?
> 
> I thought it was obvious without further clarification: how can we expect 
> the exact same event to be generated, if the instruction that triggered it in 
> the first place was emulated or single stepped?

Neither emulation nor single stepping should result in architectural
events (exceptions) to be missed (or else there's a bug somewhere).
Non-architectural #PF like you're using of course can't (currently) be
guaranteed to arrive at any particular point in time.

The fact that {vmx,svm}_inject_trap() combine the new exception
with an already injected one (and blindly discard events other than
hw exceptions), otoh, looks like indeed wants to be controllable by
the caller: When the event comes from the outside (the hypercall),
it would clearly seem better to simply tell the caller that no injection
happened and the event needs to be kept pending. The main
question then is how to make certain injection gets retried at the
right point in time (read: once the other interrupt handler IRETs
back to original context).

Jan
Razvan Cojocaru Nov. 2, 2016, 8:57 a.m. UTC | #10
On 11/02/2016 10:49 AM, Jan Beulich wrote:
> The fact that {vmx,svm}_inject_trap() combine the new exception
> with an already injected one (and blindly discard events other than
> hw exceptions), otoh, looks like indeed wants to be controllable by
> the caller: When the event comes from the outside (the hypercall),
> it would clearly seem better to simply tell the caller that no injection
> happened and the event needs to be kept pending.

However this is not possible with the current design, since all
xc_hvm_inject_trap() really does is set the info to be used at
hvm_do_resume() time. So at the time xc_hvm_inject_trap() returns, it's
not yet possible to know if the injection will succeed or not (assuming
we discard it when it would collide with another).


Thanks,
Razvan
Jan Beulich Nov. 2, 2016, 9:05 a.m. UTC | #11
>>> On 02.11.16 at 09:57, <rcojocaru@bitdefender.com> wrote:
> On 11/02/2016 10:49 AM, Jan Beulich wrote:
>> The fact that {vmx,svm}_inject_trap() combine the new exception
>> with an already injected one (and blindly discard events other than
>> hw exceptions), otoh, looks like indeed wants to be controllable by
>> the caller: When the event comes from the outside (the hypercall),
>> it would clearly seem better to simply tell the caller that no injection
>> happened and the event needs to be kept pending.
> 
> However this is not possible with the current design, since all
> xc_hvm_inject_trap() really does is set the info to be used at
> hvm_do_resume() time. So at the time xc_hvm_inject_trap() returns, it's
> not yet possible to know if the injection will succeed or not (assuming
> we discard it when it would collide with another).

That's my point - it shouldn't get discarded, but remain latched
for a future invocation of hvm_do_resume(). Making
hvm_inject_trap() have a suitable parameter (and a return value)
would be the easy part of the change here. The difficult part would
be to make sure the injection context is the right one.

Jan
Tamas K Lengyel Nov. 2, 2016, 9:05 a.m. UTC | #12
On Nov 2, 2016 09:57, "Razvan Cojocaru" <rcojocaru@bitdefender.com> wrote:
>
> On 11/02/2016 10:49 AM, Jan Beulich wrote:
> > The fact that {vmx,svm}_inject_trap() combine the new exception
> > with an already injected one (and blindly discard events other than
> > hw exceptions), otoh, looks like indeed wants to be controllable by
> > the caller: When the event comes from the outside (the hypercall),
> > it would clearly seem better to simply tell the caller that no injection
> > happened and the event needs to be kept pending.
>
> However this is not possible with the current design, since all
> xc_hvm_inject_trap() really does is set the info to be used at
> hvm_do_resume() time. So at the time xc_hvm_inject_trap() returns, it's
> not yet possible to know if the injection will succeed or not (assuming
> we discard it when it would collide with another).
>

You can always introduce a new (asynchronous?) vm_event for signaling a
failed interrupt injection on the vCPU.

Tamas
Razvan Cojocaru Nov. 2, 2016, 9:11 a.m. UTC | #13
On 11/02/2016 11:05 AM, Jan Beulich wrote:
>>>> On 02.11.16 at 09:57, <rcojocaru@bitdefender.com> wrote:
>> On 11/02/2016 10:49 AM, Jan Beulich wrote:
>>> The fact that {vmx,svm}_inject_trap() combine the new exception
>>> with an already injected one (and blindly discard events other than
>>> hw exceptions), otoh, looks like indeed wants to be controllable by
>>> the caller: When the event comes from the outside (the hypercall),
>>> it would clearly seem better to simply tell the caller that no injection
>>> happened and the event needs to be kept pending.
>>
>> However this is not possible with the current design, since all
>> xc_hvm_inject_trap() really does is set the info to be used at
>> hvm_do_resume() time. So at the time xc_hvm_inject_trap() returns, it's
>> not yet possible to know if the injection will succeed or not (assuming
>> we discard it when it would collide with another).
> 
> That's my point - it shouldn't get discarded, but remain latched
> for a future invocation of hvm_do_resume(). Making
> hvm_inject_trap() have a suitable parameter (and a return value)
> would be the easy part of the change here. The difficult part would
> be to make sure the injection context is the right one.

Should I then bring this patch back?

https://lists.xen.org/archives/html/xen-devel/2014-07/msg02927.html

It has been rejected at the time on the grounds that
xc_hvm_inject_trap() is sufficient.


Thanks,
Razvan
Andrei Vlad LUTAS Nov. 2, 2016, 9:13 a.m. UTC | #14
> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: 2 November, 2016 10:50
> To: rcojocaru@bitdefender.com; Andrei Vlad LUTAS
> <vlutas@bitdefender.com>
> Cc: andrew.cooper3@citrix.com; xen-devel@lists.xenproject.org;
> tamas@tklengyel.com
> Subject: RE: RE: [Xen-devel] xc_hvm_inject_trap() races
>
> >>> On 01.11.16 at 23:17, <vlutas@bitdefender.com> wrote:
> > From: Jan Beulich [mailto:jbeulich@suse.com]
> > Sent: 1 November, 2016 18:40
> >>>> Andrei Vlad LUTAS <vlutas@bitdefender.com> 11/01/16 5:13 PM >>>
> >>>First of all, to answer your original question: the injection
> >>>decision is made when the introspection logic needs to inspect a page
> >>>that is not present in the physical memory. We don't really care if
> >>>the current instruction triggers multiple faults or not (and here I'm
> >>>not sure what you mean by that - multiple exceptions, or multiple EPT
> >>>violations - but the answer is still the same), and removing the page
> >>>restrictions after the #PF injection is introspection specific logic
> >>>- the address for which we inject the #PF doesn't have to be related
> >>>in any way to the
> > current instruction.
> >
> >>Ah, that's this no-architectural behavior again.
> >
> > I don't think the HVI #PF injection internals or how the #PF is
> > handled by the OS are relevant here. We are using an existing API that
> > seems to not work quite correct under certain circumstances and we
> > were curious if any of you can shed some light in this regard, and
> > maybe point us to the right direction for cooking up a fix.
> >
> >>What if the OS doesn't fully carry out the page-in, relying on the #PF
> >>to
> > retrigger once the insn for which it got reported has been restarted?
> >
> > Can you be more specific?
>
> Well, perhaps with the answer you gave further down that's not that
> relevant anymore, but consider a #PF handler which handles just the top
> most not-present page table level each time it gets invoked. I.e.
> for a not-present L4 entry it would take 4 re-invocations of the same original
> instruction to resolve all 4 levels.

I see what you're referring to. As I explained to Andrew in a previous mail - the #PF injection logic is indeed OS specific, and in our particular case (since VM introspection already has to handle a lot of OS specific stuff), we don't have to deal with such a behavior on the supported operating systems. Anyway, the example you provided would involve significant added performance penalty and I don't see why an OS would do that (nor have I heard of any doing it), but I understand your concern.

>
> >> Or what if the page gets paged out again before the insn actually
> >> gets to
> > execute (e.g. because a re-schedule happened inside the guest on the
> > way out of the #PF handler)? All of this suggests that you really
> > can't lift >any restrictions _before_ seeing what you need to see.
> >
> > We don't really care when and how the #PF is handled. We don't care if
> > the page is paged out at some random point. What we do know is that at
> > a certain point in the future, the page will be swapped in; how do we
> > know when? The OS will write the guest page tables, at which point we
> > can inspect the physical page itself (so you can see here why we don't
> > care about the page being swapped out sometime in the future). So we
> > really _can_ lift any restriction we want at that point.
>
> Hmm, I'm having difficulty seeing the supposedly broken flow of events
> here: Earlier it was said that #PF injection would be a result of EPT event
> processing. Here you say that the lifting of the restrictions would be a result
> of seeing the guest modify its page tables (which would in turn be a result of
> the #PF actually having arrived in the guest). So if (with this, and as you say
> above) you don't care when the #PF gets handled, where's the original
> problem?

That's not what I wanted to say, sorry if it was unclear. What I'm trying to say is that the decision to inject a #PF can be made when handling an EPT violation - the accessed page needs not be related in any way with the page for which we decide to inject the #PF. For example, we intercept writes in a list that describes the loaded module. Whenever a new module is loaded, an entry would be inserted into that list, and that would generate an EPT write violation. Now, the introspection logic will be able to analyze what module was loaded and where, and it may find out that the module headers (which are needed by the protection logic) are not present in memory - therefore, it would inject a #PF in order to force the OS to swap in said headers. On the other hand, the HVI logic may also decide that it doesn't need to watch for modules loading anymore (for example, all the interesting modules were loaded), so it will remove the write hook from the list of loaded modules. These two (injection of the #PF and the removal of the EPT write protection) would be done in the same event handler, so we can't rely on the event being re-generated in this case. Hopefully this example makes it more clear.

>
> >>>Assuming that we wouldn't remove the restrictions and we would rely
> >>>on re-generating the event - that is not acceptable: first of all
> >>>because the instruction would normally be emulated anyway before
> >>>re-entering the guest,
> >
> >>How would that be a problem?
> >
> > I thought it was obvious without further clarification: how can we
> > expect the exact same event to be generated, if the instruction that
> > triggered it in the first place was emulated or single stepped?
>
> Neither emulation nor single stepping should result in architectural events
> (exceptions) to be missed (or else there's a bug somewhere).
> Non-architectural #PF like you're using of course can't (currently) be
> guaranteed to arrive at any particular point in time.
>
> The fact that {vmx,svm}_inject_trap() combine the new exception with an
> already injected one (and blindly discard events other than hw exceptions),
> otoh, looks like indeed wants to be controllable by the caller: When the
> event comes from the outside (the hypercall), it would clearly seem better to
> simply tell the caller that no injection happened and the event needs to be
> kept pending. The main question then is how to make certain injection gets
> retried at the right point in time (read: once the other interrupt handler IRETs
> back to original context).

Yes, this is basically our problem. Right now, the #PF would overwrite other interrupts, which is very bad. On the other hand, it can't return an error (if I understand the code correctly), since it can't know if another event will be scheduled for injection. As I told Andrew, at least returning an error that would indicate the #PF cannot be injected may help us a lot here (I'm sure making the injected trap take precedence over other events would not be acceptable).

>
> Jan
>
> ________________________
> This email was scanned by Bitdefender

Best regards,
Andrei.
Jan Beulich Nov. 2, 2016, 9:30 a.m. UTC | #15
>>> On 02.11.16 at 10:11, <rcojocaru@bitdefender.com> wrote:
> On 11/02/2016 11:05 AM, Jan Beulich wrote:
>>>>> On 02.11.16 at 09:57, <rcojocaru@bitdefender.com> wrote:
>>> On 11/02/2016 10:49 AM, Jan Beulich wrote:
>>>> The fact that {vmx,svm}_inject_trap() combine the new exception
>>>> with an already injected one (and blindly discard events other than
>>>> hw exceptions), otoh, looks like indeed wants to be controllable by
>>>> the caller: When the event comes from the outside (the hypercall),
>>>> it would clearly seem better to simply tell the caller that no injection
>>>> happened and the event needs to be kept pending.
>>>
>>> However this is not possible with the current design, since all
>>> xc_hvm_inject_trap() really does is set the info to be used at
>>> hvm_do_resume() time. So at the time xc_hvm_inject_trap() returns, it's
>>> not yet possible to know if the injection will succeed or not (assuming
>>> we discard it when it would collide with another).
>> 
>> That's my point - it shouldn't get discarded, but remain latched
>> for a future invocation of hvm_do_resume(). Making
>> hvm_inject_trap() have a suitable parameter (and a return value)
>> would be the easy part of the change here. The difficult part would
>> be to make sure the injection context is the right one.
> 
> Should I then bring this patch back?
> 
> https://lists.xen.org/archives/html/xen-devel/2014-07/msg02927.html 
> 
> It has been rejected at the time on the grounds that
> xc_hvm_inject_trap() is sufficient.

I don't think it would deal with all possible situations, the more
that it's (already by its title) #PF specific. I think the named
difficult part would need to be solved in the hypervisor alone,
without further external information.

Jan
Jan Beulich Nov. 2, 2016, 9:37 a.m. UTC | #16
>>> On 02.11.16 at 10:13, <vlutas@bitdefender.com> wrote:
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: 2 November, 2016 10:50
>> >>> On 01.11.16 at 23:17, <vlutas@bitdefender.com> wrote:
>> > We don't really care when and how the #PF is handled. We don't care if
>> > the page is paged out at some random point. What we do know is that at
>> > a certain point in the future, the page will be swapped in; how do we
>> > know when? The OS will write the guest page tables, at which point we
>> > can inspect the physical page itself (so you can see here why we don't
>> > care about the page being swapped out sometime in the future). So we
>> > really _can_ lift any restriction we want at that point.
>>
>> Hmm, I'm having difficulty seeing the supposedly broken flow of events
>> here: Earlier it was said that #PF injection would be a result of EPT event
>> processing. Here you say that the lifting of the restrictions would be a result
>> of seeing the guest modify its page tables (which would in turn be a result of
>> the #PF actually having arrived in the guest). So if (with this, and as you say
>> above) you don't care when the #PF gets handled, where's the original
>> problem?
> 
> That's not what I wanted to say, sorry if it was unclear. What I'm trying to 
> say is that the decision to inject a #PF can be made when handling an EPT 
> violation - the accessed page needs not be related in any way with the page 
> for which we decide to inject the #PF. For example, we intercept writes in a 
> list that describes the loaded module. Whenever a new module is loaded, an 
> entry would be inserted into that list, and that would generate an EPT write 
> violation. Now, the introspection logic will be able to analyze what module 
> was loaded and where, and it may find out that the module headers (which are 
> needed by the protection logic) are not present in memory - therefore, it 
> would inject a #PF in order to force the OS to swap in said headers. On the 
> other hand, the HVI logic may also decide that it doesn't need to watch for 
> modules loading anymore (for example, all the interesting modules were 
> loaded), so it will remove the write hook from the list of loaded modules. 
> These two (injection of the #PF and the removal of the EPT write protection) 
> would be done in the same event handler, so we can't rely on the event being 
> re-generated in this case. Hopefully this example makes it more clear.

If the decision whether further events are needed depends on data in
a page not present in memory, how can that decision be taken before
the injected #PF has actually been handled? I'm still not seeing a flow
of events where there is a problem. Furthermore, I don't think it would
do much harm if you kept the watch in place slightly longer?

>> The fact that {vmx,svm}_inject_trap() combine the new exception with an
>> already injected one (and blindly discard events other than hw exceptions),
>> otoh, looks like indeed wants to be controllable by the caller: When the
>> event comes from the outside (the hypercall), it would clearly seem better to
>> simply tell the caller that no injection happened and the event needs to be
>> kept pending. The main question then is how to make certain injection gets
>> retried at the right point in time (read: once the other interrupt handler IRETs
>> back to original context).
> 
> Yes, this is basically our problem. Right now, the #PF would overwrite other 
> interrupts, which is very bad. On the other hand, it can't return an error 
> (if I understand the code correctly), since it can't know if another event 
> will be scheduled for injection. As I told Andrew, at least returning an 
> error that would indicate the #PF cannot be injected may help us a lot here 

But an event that prevents the injected one to make it may get
generated only _after_ the inject hypercall was completed. Once
again - the problem needs to be solved elsewhere.

> (I'm sure making the injected trap take precedence over other events would 
> not be acceptable).

Indeed.

Jan
Andrei Vlad LUTAS Nov. 2, 2016, 9:38 a.m. UTC | #17
> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: 2 November, 2016 11:31
> To: Razvan Cojocaru <rcojocaru@bitdefender.com>
> Cc: Andrei Vlad LUTAS <vlutas@bitdefender.com>;
> andrew.cooper3@citrix.com; xen-devel@lists.xenproject.org;
> tamas@tklengyel.com
> Subject: Re: [Xen-devel] xc_hvm_inject_trap() races
>
> >>> On 02.11.16 at 10:11, <rcojocaru@bitdefender.com> wrote:
> > On 11/02/2016 11:05 AM, Jan Beulich wrote:
> >>>>> On 02.11.16 at 09:57, <rcojocaru@bitdefender.com> wrote:
> >>> On 11/02/2016 10:49 AM, Jan Beulich wrote:
> >>>> The fact that {vmx,svm}_inject_trap() combine the new exception
> >>>> with an already injected one (and blindly discard events other than
> >>>> hw exceptions), otoh, looks like indeed wants to be controllable by
> >>>> the caller: When the event comes from the outside (the hypercall),
> >>>> it would clearly seem better to simply tell the caller that no
> >>>> injection happened and the event needs to be kept pending.
> >>>
> >>> However this is not possible with the current design, since all
> >>> xc_hvm_inject_trap() really does is set the info to be used at
> >>> hvm_do_resume() time. So at the time xc_hvm_inject_trap() returns,
> >>> it's not yet possible to know if the injection will succeed or not
> >>> (assuming we discard it when it would collide with another).
> >>
> >> That's my point - it shouldn't get discarded, but remain latched for
> >> a future invocation of hvm_do_resume(). Making
> >> hvm_inject_trap() have a suitable parameter (and a return value)
> >> would be the easy part of the change here. The difficult part would
> >> be to make sure the injection context is the right one.
> >
> > Should I then bring this patch back?
> >
> > https://lists.xen.org/archives/html/xen-devel/2014-07/msg02927.html
> >
> > It has been rejected at the time on the grounds that
> > xc_hvm_inject_trap() is sufficient.
>
> I don't think it would deal with all possible situations, the more that it's
> (already by its title) #PF specific. I think the named difficult part would need
> to be solved in the hypervisor alone, without further external information.
>
> Jan

With some HVI re-engineering I think I may be able to make everything work if the existing API would return an error in case it cannot inject the #PF. Would that be possible?

>
>
> ________________________
> This email was scanned by Bitdefender

Best regards,
Andrei.
Andrei Vlad LUTAS Nov. 2, 2016, 9:53 a.m. UTC | #18
> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: 2 November, 2016 11:38
> To: rcojocaru@bitdefender.com; Andrei Vlad LUTAS
> <vlutas@bitdefender.com>
> Cc: andrew.cooper3@citrix.com; xen-devel@lists.xenproject.org;
> tamas@tklengyel.com
> Subject: RE: RE: [Xen-devel] xc_hvm_inject_trap() races
> 
> >>> On 02.11.16 at 10:13, <vlutas@bitdefender.com> wrote:
> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Sent: 2 November, 2016 10:50
> >> >>> On 01.11.16 at 23:17, <vlutas@bitdefender.com> wrote:
> >> > We don't really care when and how the #PF is handled. We don't care
> >> > if the page is paged out at some random point. What we do know is
> >> > that at a certain point in the future, the page will be swapped in;
> >> > how do we know when? The OS will write the guest page tables, at
> >> > which point we can inspect the physical page itself (so you can see
> >> > here why we don't care about the page being swapped out sometime in
> >> > the future). So we really _can_ lift any restriction we want at that point.
> >>
> >> Hmm, I'm having difficulty seeing the supposedly broken flow of
> >> events
> >> here: Earlier it was said that #PF injection would be a result of EPT
> >> event processing. Here you say that the lifting of the restrictions
> >> would be a result of seeing the guest modify its page tables (which
> >> would in turn be a result of the #PF actually having arrived in the
> >> guest). So if (with this, and as you say
> >> above) you don't care when the #PF gets handled, where's the original
> >> problem?
> >
> > That's not what I wanted to say, sorry if it was unclear. What I'm
> > trying to say is that the decision to inject a #PF can be made when
> > handling an EPT violation - the accessed page needs not be related in
> > any way with the page for which we decide to inject the #PF. For
> > example, we intercept writes in a list that describes the loaded
> > module. Whenever a new module is loaded, an entry would be inserted
> > into that list, and that would generate an EPT write violation. Now,
> > the introspection logic will be able to analyze what module was loaded
> > and where, and it may find out that the module headers (which are
> > needed by the protection logic) are not present in memory - therefore,
> > it would inject a #PF in order to force the OS to swap in said
> > headers. On the other hand, the HVI logic may also decide that it
> > doesn't need to watch for modules loading anymore (for example, all the
> interesting modules were loaded), so it will remove the write hook from the
> list of loaded modules.
> > These two (injection of the #PF and the removal of the EPT write
> > protection) would be done in the same event handler, so we can't rely
> > on the event being re-generated in this case. Hopefully this example
> makes it more clear.
> 
> If the decision whether further events are needed depends on data in a
> page not present in memory, how can that decision be taken before the
> injected #PF has actually been handled? I'm still not seeing a flow of events
> where there is a problem. Furthermore, I don't think it would do much harm
> if you kept the watch in place slightly longer?

The decision whether further events are needed or not is NOT made based on the contents of the missing page. Let us assume we have the MODULE structure, that contains a "name" and an "address". The MODULE is inserted in the modules list via a write, which triggers an EPT violation, which is handled by HVI. The HVI sees that "name" is the module it was waiting for (for example, ntdll, kernel32, or whatever), and decides that it doesn't want to intercept other modules being loaded, so it removes the write hook from the list. Furthermore, it sees that "address" points to a swapped-out page, so it injects a #PF, to swap it in. 

> 
> >> The fact that {vmx,svm}_inject_trap() combine the new exception with
> >> an already injected one (and blindly discard events other than hw
> >> exceptions), otoh, looks like indeed wants to be controllable by the
> >> caller: When the event comes from the outside (the hypercall), it
> >> would clearly seem better to simply tell the caller that no injection
> >> happened and the event needs to be kept pending. The main question
> >> then is how to make certain injection gets retried at the right point
> >> in time (read: once the other interrupt handler IRETs back to original
> context).
> >
> > Yes, this is basically our problem. Right now, the #PF would overwrite
> > other interrupts, which is very bad. On the other hand, it can't
> > return an error (if I understand the code correctly), since it can't
> > know if another event will be scheduled for injection. As I told
> > Andrew, at least returning an error that would indicate the #PF cannot
> > be injected may help us a lot here
> 
> But an event that prevents the injected one to make it may get generated
> only _after_ the inject hypercall was completed. Once again - the problem
> needs to be solved elsewhere.
> 
> > (I'm sure making the injected trap take precedence over other events
> > would not be acceptable).
> 
> Indeed.
> 
> Jan
> 
> ________________________
> This email was scanned by Bitdefender
Jan Beulich Nov. 2, 2016, 10:02 a.m. UTC | #19
>>> On 02.11.16 at 10:38, <vlutas@bitdefender.com> wrote:
> With some HVI re-engineering I think I may be able to make everything work if 
> the existing API would return an error in case it cannot inject the #PF. 
> Would that be possible?

As said in another reply - at the time of the hypercall it may still be
unknown whether by the time hvm_do_resume() inspects the field
there is another event preventing the injection.

Jan
Jan Beulich Nov. 2, 2016, 10:05 a.m. UTC | #20
>>> On 02.11.16 at 10:53, <vlutas@bitdefender.com> wrote:
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: 2 November, 2016 11:38
>> >>> On 02.11.16 at 10:13, <vlutas@bitdefender.com> wrote:
>> >> From: Jan Beulich [mailto:JBeulich@suse.com]
>> >> Sent: 2 November, 2016 10:50
>> >> >>> On 01.11.16 at 23:17, <vlutas@bitdefender.com> wrote:
>> >> > We don't really care when and how the #PF is handled. We don't care
>> >> > if the page is paged out at some random point. What we do know is
>> >> > that at a certain point in the future, the page will be swapped in;
>> >> > how do we know when? The OS will write the guest page tables, at
>> >> > which point we can inspect the physical page itself (so you can see
>> >> > here why we don't care about the page being swapped out sometime in
>> >> > the future). So we really _can_ lift any restriction we want at that point.
>> >>
>> >> Hmm, I'm having difficulty seeing the supposedly broken flow of
>> >> events
>> >> here: Earlier it was said that #PF injection would be a result of EPT
>> >> event processing. Here you say that the lifting of the restrictions
>> >> would be a result of seeing the guest modify its page tables (which
>> >> would in turn be a result of the #PF actually having arrived in the
>> >> guest). So if (with this, and as you say
>> >> above) you don't care when the #PF gets handled, where's the original
>> >> problem?
>> >
>> > That's not what I wanted to say, sorry if it was unclear. What I'm
>> > trying to say is that the decision to inject a #PF can be made when
>> > handling an EPT violation - the accessed page needs not be related in
>> > any way with the page for which we decide to inject the #PF. For
>> > example, we intercept writes in a list that describes the loaded
>> > module. Whenever a new module is loaded, an entry would be inserted
>> > into that list, and that would generate an EPT write violation. Now,
>> > the introspection logic will be able to analyze what module was loaded
>> > and where, and it may find out that the module headers (which are
>> > needed by the protection logic) are not present in memory - therefore,
>> > it would inject a #PF in order to force the OS to swap in said
>> > headers. On the other hand, the HVI logic may also decide that it
>> > doesn't need to watch for modules loading anymore (for example, all the
>> interesting modules were loaded), so it will remove the write hook from the
>> list of loaded modules.
>> > These two (injection of the #PF and the removal of the EPT write
>> > protection) would be done in the same event handler, so we can't rely
>> > on the event being re-generated in this case. Hopefully this example
>> makes it more clear.
>> 
>> If the decision whether further events are needed depends on data in a
>> page not present in memory, how can that decision be taken before the
>> injected #PF has actually been handled? I'm still not seeing a flow of events
>> where there is a problem. Furthermore, I don't think it would do much harm
>> if you kept the watch in place slightly longer?
> 
> The decision whether further events are needed or not is NOT made based on 
> the contents of the missing page. Let us assume we have the MODULE structure, 
> that contains a "name" and an "address". The MODULE is inserted in the 
> modules list via a write, which triggers an EPT violation, which is handled 
> by HVI. The HVI sees that "name" is the module it was waiting for (for 
> example, ntdll, kernel32, or whatever), and decides that it doesn't want to 
> intercept other modules being loaded, so it removes the write hook from the 
> list. Furthermore, it sees that "address" points to a swapped-out page, so it 
> injects a #PF, to swap it in. 

So what's the #PF needed for then, if the introspection engine doesn't
need looking at the page? Once again - I think it would have helped
quite a bit if a _complete_ picture had been drawn from the very
beginning of this thread.

Jan
Razvan Cojocaru Nov. 2, 2016, 10:09 a.m. UTC | #21
On 11/02/2016 11:05 AM, Tamas K Lengyel wrote:
> On Nov 2, 2016 09:57, "Razvan Cojocaru" <rcojocaru@bitdefender.com
> <mailto:rcojocaru@bitdefender.com>> wrote:
>>
>> On 11/02/2016 10:49 AM, Jan Beulich wrote:
>> > The fact that {vmx,svm}_inject_trap() combine the new exception
>> > with an already injected one (and blindly discard events other than
>> > hw exceptions), otoh, looks like indeed wants to be controllable by
>> > the caller: When the event comes from the outside (the hypercall),
>> > it would clearly seem better to simply tell the caller that no injection
>> > happened and the event needs to be kept pending.
>>
>> However this is not possible with the current design, since all
>> xc_hvm_inject_trap() really does is set the info to be used at
>> hvm_do_resume() time. So at the time xc_hvm_inject_trap() returns, it's
>> not yet possible to know if the injection will succeed or not (assuming
>> we discard it when it would collide with another).
>>
> 
> You can always introduce a new (asynchronous?) vm_event for signaling a
> failed interrupt injection on the vCPU.

That might work, though I'm not sure it's an acceptable solution for the
problem - at the very least, we can't assume or require that the
xc_hvm_inject_trap() caller should register to, and process, vm_events.


Thanks,
Razvan
Andrei Vlad LUTAS Nov. 2, 2016, 10:22 a.m. UTC | #22
> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: 2 November, 2016 12:05
> To: Andrei Vlad LUTAS <vlutas@bitdefender.com>
> Cc: rcojocaru@bitdefender.com; andrew.cooper3@citrix.com; xen-
> devel@lists.xenproject.org; tamas@tklengyel.com
> Subject: RE: RE: [Xen-devel] xc_hvm_inject_trap() races
> 
> >>> On 02.11.16 at 10:53, <vlutas@bitdefender.com> wrote:
> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Sent: 2 November, 2016 11:38
> >> >>> On 02.11.16 at 10:13, <vlutas@bitdefender.com> wrote:
> >> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> >> Sent: 2 November, 2016 10:50
> >> >> >>> On 01.11.16 at 23:17, <vlutas@bitdefender.com> wrote:
> >> >> > We don't really care when and how the #PF is handled. We don't
> >> >> > care if the page is paged out at some random point. What we do
> >> >> > know is that at a certain point in the future, the page will be
> >> >> > swapped in; how do we know when? The OS will write the guest
> >> >> > page tables, at which point we can inspect the physical page
> >> >> > itself (so you can see here why we don't care about the page
> >> >> > being swapped out sometime in the future). So we really _can_ lift
> any restriction we want at that point.
> >> >>
> >> >> Hmm, I'm having difficulty seeing the supposedly broken flow of
> >> >> events
> >> >> here: Earlier it was said that #PF injection would be a result of
> >> >> EPT event processing. Here you say that the lifting of the
> >> >> restrictions would be a result of seeing the guest modify its page
> >> >> tables (which would in turn be a result of the #PF actually having
> >> >> arrived in the guest). So if (with this, and as you say
> >> >> above) you don't care when the #PF gets handled, where's the
> >> >> original problem?
> >> >
> >> > That's not what I wanted to say, sorry if it was unclear. What I'm
> >> > trying to say is that the decision to inject a #PF can be made when
> >> > handling an EPT violation - the accessed page needs not be related
> >> > in any way with the page for which we decide to inject the #PF. For
> >> > example, we intercept writes in a list that describes the loaded
> >> > module. Whenever a new module is loaded, an entry would be inserted
> >> > into that list, and that would generate an EPT write violation.
> >> > Now, the introspection logic will be able to analyze what module
> >> > was loaded and where, and it may find out that the module headers
> >> > (which are needed by the protection logic) are not present in
> >> > memory - therefore, it would inject a #PF in order to force the OS
> >> > to swap in said headers. On the other hand, the HVI logic may also
> >> > decide that it doesn't need to watch for modules loading anymore
> >> > (for example, all the
> >> interesting modules were loaded), so it will remove the write hook
> >> from the list of loaded modules.
> >> > These two (injection of the #PF and the removal of the EPT write
> >> > protection) would be done in the same event handler, so we can't
> >> > rely on the event being re-generated in this case. Hopefully this
> >> > example
> >> makes it more clear.
> >>
> >> If the decision whether further events are needed depends on data in
> >> a page not present in memory, how can that decision be taken before
> >> the injected #PF has actually been handled? I'm still not seeing a
> >> flow of events where there is a problem. Furthermore, I don't think
> >> it would do much harm if you kept the watch in place slightly longer?
> >
> > The decision whether further events are needed or not is NOT made
> > based on the contents of the missing page. Let us assume we have the
> > MODULE structure, that contains a "name" and an "address". The MODULE
> > is inserted in the modules list via a write, which triggers an EPT
> > violation, which is handled by HVI. The HVI sees that "name" is the
> > module it was waiting for (for example, ntdll, kernel32, or whatever),
> > and decides that it doesn't want to intercept other modules being
> > loaded, so it removes the write hook from the list. Furthermore, it
> > sees that "address" points to a swapped-out page, so it injects a #PF, to
> swap it in.
> 
> So what's the #PF needed for then, if the introspection engine doesn't need
> looking at the page? Once again - I think it would have helped quite a bit if a
> _complete_ picture had been drawn from the very beginning of this thread.

Who said the introspection logic doesn't need to inspect the page? That's why we inject the #PF. Because we need to further inspect the page. But the decision to inspect the missing page or the decision that further module events are relevant or not is not related in any way to the contents of the missing page. The contents of the missing page need to be inspected for other reasons.
In my opinion, the complete picture _was_ drawn from the beginning, it just seems that we need to zoom in more. You also have to understand that the HVI itself is closed-source and there's a point beyond which we simply can't give any more info. 

> 
> Jan
> 
> 
> ________________________
> This email was scanned by Bitdefender
Jan Beulich Nov. 2, 2016, 10:46 a.m. UTC | #23
>>> On 02.11.16 at 11:22, <vlutas@bitdefender.com> wrote:
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: 2 November, 2016 12:05
>> >>> On 02.11.16 at 10:53, <vlutas@bitdefender.com> wrote:
>> > The decision whether further events are needed or not is NOT made
>> > based on the contents of the missing page. Let us assume we have the
>> > MODULE structure, that contains a "name" and an "address". The MODULE
>> > is inserted in the modules list via a write, which triggers an EPT
>> > violation, which is handled by HVI. The HVI sees that "name" is the
>> > module it was waiting for (for example, ntdll, kernel32, or whatever),
>> > and decides that it doesn't want to intercept other modules being
>> > loaded, so it removes the write hook from the list. Furthermore, it
>> > sees that "address" points to a swapped-out page, so it injects a #PF, to
>> swap it in.
>> 
>> So what's the #PF needed for then, if the introspection engine doesn't need
>> looking at the page? Once again - I think it would have helped quite a bit if a
>> _complete_ picture had been drawn from the very beginning of this thread.
> 
> Who said the introspection logic doesn't need to inspect the page? That's 
> why we inject the #PF. Because we need to further inspect the page.

Looks like I've drawn a wrong conclusion then - sorry.

> But the 
> decision to inspect the missing page or the decision that further module 
> events are relevant or not is not related in any way to the contents of the 
> missing page. The contents of the missing page need to be inspected for other reasons.

And the disabling of (in your example) module load monitoring could
then be done at that point, rather than right away?

> In my opinion, the complete picture _was_ drawn from the beginning, it just 
> seems that we need to zoom in more. You also have to understand that the HVI 
> itself is closed-source and there's a point beyond which we simply can't give 
> any more info. 

I'm sure you understand that with partial / insufficient info it then
may be impossible for anyone here to give advice which is actually
helpful to you.

Jan
Andrei Vlad LUTAS Nov. 2, 2016, 11:05 a.m. UTC | #24
> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: 2 November, 2016 12:46
> To: Andrei Vlad LUTAS <vlutas@bitdefender.com>
> Cc: rcojocaru@bitdefender.com; andrew.cooper3@citrix.com; xen-
> devel@lists.xenproject.org; tamas@tklengyel.com
> Subject: RE: RE: [Xen-devel] xc_hvm_inject_trap() races
>
> >>> On 02.11.16 at 11:22, <vlutas@bitdefender.com> wrote:
> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Sent: 2 November, 2016 12:05
> >> >>> On 02.11.16 at 10:53, <vlutas@bitdefender.com> wrote:
> >> > The decision whether further events are needed or not is NOT made
> >> > based on the contents of the missing page. Let us assume we have
> >> > the MODULE structure, that contains a "name" and an "address". The
> >> > MODULE is inserted in the modules list via a write, which triggers
> >> > an EPT violation, which is handled by HVI. The HVI sees that "name"
> >> > is the module it was waiting for (for example, ntdll, kernel32, or
> >> > whatever), and decides that it doesn't want to intercept other
> >> > modules being loaded, so it removes the write hook from the list.
> >> > Furthermore, it sees that "address" points to a swapped-out page,
> >> > so it injects a #PF, to
> >> swap it in.
> >>
> >> So what's the #PF needed for then, if the introspection engine
> >> doesn't need looking at the page? Once again - I think it would have
> >> helped quite a bit if a _complete_ picture had been drawn from the very
> beginning of this thread.
> >
> > Who said the introspection logic doesn't need to inspect the page?
> > That's why we inject the #PF. Because we need to further inspect the
> page.
>
> Looks like I've drawn a wrong conclusion then - sorry.
>
> > But the
> > decision to inspect the missing page or the decision that further
> > module events are relevant or not is not related in any way to the
> > contents of the missing page. The contents of the missing page need to be
> inspected for other reasons.
>
> And the disabling of (in your example) module load monitoring could then be
> done at that point, rather than right away?

We could theoretically do even better than that - for example, inject an INT3 (0xCC) instruction at that point, and make sure the VCPU doesn't advance until we get to inject our #PF. But even this requires some modifications, because right now, we cannot know what and if the injection will succeed.

>
> > In my opinion, the complete picture _was_ drawn from the beginning, it
> > just seems that we need to zoom in more. You also have to understand
> > that the HVI itself is closed-source and there's a point beyond which
> > we simply can't give any more info.
>
> I'm sure you understand that with partial / insufficient info it then may be
> impossible for anyone here to give advice which is actually helpful to you.

Absolutely. I didn't mean to sound obtrusive or anything, it's just that we simply cannot disclose some of the details, as much as I would want (intellectual property, etc.). We already provided and we will continue providing as much info as we can. After all, we, above all, want this fixed, as it impacts our product.

>
> Jan
>
>
> ________________________
> This email was scanned by Bitdefender

Best regards,
Andrei.
Razvan Cojocaru Nov. 3, 2016, 9:59 a.m. UTC | #25
On 11/02/2016 11:30 AM, Jan Beulich wrote:
>>>> On 02.11.16 at 10:11, <rcojocaru@bitdefender.com> wrote:
>> On 11/02/2016 11:05 AM, Jan Beulich wrote:
>>>>>> On 02.11.16 at 09:57, <rcojocaru@bitdefender.com> wrote:
>>>> On 11/02/2016 10:49 AM, Jan Beulich wrote:
>>>>> The fact that {vmx,svm}_inject_trap() combine the new exception
>>>>> with an already injected one (and blindly discard events other than
>>>>> hw exceptions), otoh, looks like indeed wants to be controllable by
>>>>> the caller: When the event comes from the outside (the hypercall),
>>>>> it would clearly seem better to simply tell the caller that no injection
>>>>> happened and the event needs to be kept pending.
>>>>
>>>> However this is not possible with the current design, since all
>>>> xc_hvm_inject_trap() really does is set the info to be used at
>>>> hvm_do_resume() time. So at the time xc_hvm_inject_trap() returns, it's
>>>> not yet possible to know if the injection will succeed or not (assuming
>>>> we discard it when it would collide with another).
>>>
>>> That's my point - it shouldn't get discarded, but remain latched
>>> for a future invocation of hvm_do_resume(). Making
>>> hvm_inject_trap() have a suitable parameter (and a return value)
>>> would be the easy part of the change here. The difficult part would
>>> be to make sure the injection context is the right one.
>>
>> Should I then bring this patch back?
>>
>> https://lists.xen.org/archives/html/xen-devel/2014-07/msg02927.html 
>>
>> It has been rejected at the time on the grounds that
>> xc_hvm_inject_trap() is sufficient.
> 
> I don't think it would deal with all possible situations, the more
> that it's (already by its title) #PF specific. I think the named
> difficult part would need to be solved in the hypervisor alone,
> without further external information.

Found an earlier version of the patch, proposing to extend
HVMOP_inject_trap instead of a new libxc hypercall, which apparently got
so far as to have an "Acked-by" and a "Reviewed-by":

https://lists.xenproject.org/archives/html/xen-devel/2014-09/msg01244.html

Is this one worth bringing back with the new information?


Thanks,
Razvan
Jan Beulich Nov. 3, 2016, 10:07 a.m. UTC | #26
>>> On 03.11.16 at 10:59, <rcojocaru@bitdefender.com> wrote:
> Found an earlier version of the patch, proposing to extend
> HVMOP_inject_trap instead of a new libxc hypercall, which apparently got
> so far as to have an "Acked-by" and a "Reviewed-by":
> 
> https://lists.xenproject.org/archives/html/xen-devel/2014-09/msg01244.html 
> 
> Is this one worth bringing back with the new information?

Considering the discussion we had here, I'm not convinced. As said
before, I think the issue wants addressing in the hypervisor alone.

Jan
Razvan Cojocaru Nov. 5, 2016, 5:22 p.m. UTC | #27
On 11/02/2016 12:55 AM, Andrew Cooper wrote:
> On 01/11/2016 22:17, Andrei Vlad LUTAS wrote:
>>>> First of all, to answer your original question: the injection decision
>>>> is made when the introspection logic needs to inspect a page that is
>>>> not present in the physical memory. We don't really care if the current
>>>> instruction triggers multiple faults or not (and here I'm not sure what
>>>> you mean by that - multiple exceptions, or multiple EPT violations -
>>>> but the answer is still the same), and removing the page restrictions
>>>> after the #PF injection is introspection specific logic - the address
>>>> for which we inject the #PF doesn't have to be related in any way to the current instruction.
>>> Ah, that's this no-architectural behavior again.
>> I don't think the HVI #PF injection internals or how the #PF is handled by the OS are relevant here. We are using an existing API that seems to not work quite correct under certain circumstances and we were curious if any of you can shed some light in this regard, and maybe point us to the right direction for cooking up a fix.
> 
> Just because there is an API like this, doesn't necessarily mean it ever
> worked.  This one clearly doesn't, and it got introduced before we as a
> community took a rather harder stance towards code review.
> 
> Architecturally speaking, faults are always raised as a direct
> consequence of the current state.  Therefore, having in introspection
> agent interposing on the instruction stream and causing faults as a side
> effect of EPT permissions/etc, is quite natural and in line with
> architectural expectation.
> 
> You also have a second usecase for this API, which is to trick Windows
> into paging in a frame you care about looking at.  Overall, using this
> #PF method to get Windows to page content back in clearly the rational
> way of making that happen, but it is very definitely a non-architectural
> usecase; if windows were to double check the instruction stream on
> getting this pagefault, it would get very confused, as the pagefault it
> received has nothing to do with the code pointed at in the exception frame.
> 
> It is quite likely that these different usecases might have different
> solutions.  IMO, the former should probably be controlled by responses
> in the vm_event ring, but this latter issue probably shouldn't.
> 
> When it comes to injecting exceptions, there are some restrictions which
> limit what can legitimately be done.  We can only inject a single thing
> at once.  Stacking a #PF on top of a plain interrupt could be resolved
> by leaving the interrupt pending in the vLAPIC and injecting the #PF
> instead.  Stacking a #PF on top of a different fault is going to cause
> hvm_combine_exceptions() to turn it into something more nasty.  OTOH,
> deferring the #PF by even a single instruction could result in it being
> sent in an unsafe context, which should also be avoided.

I wonder if it would be acceptable to simply add "&&
v->arch.hvm_vcpu.inject_trap.vector != -1" to the early return condition
for vmx_intr_assist()?

Event injection is already blocked there if v->arch.hvm_vcpu.single_step
is set, so at least in that case it is acceptable behaviour. We could
also block it if there's a pending user-requested injection.


Thanks,
Razvan
Jan Beulich Nov. 7, 2016, 8:45 a.m. UTC | #28
>>> On 05.11.16 at 18:22, <rcojocaru@bitdefender.com> wrote:
> On 11/02/2016 12:55 AM, Andrew Cooper wrote:
>> On 01/11/2016 22:17, Andrei Vlad LUTAS wrote:
>>>>> First of all, to answer your original question: the injection decision
>>>>> is made when the introspection logic needs to inspect a page that is
>>>>> not present in the physical memory. We don't really care if the current
>>>>> instruction triggers multiple faults or not (and here I'm not sure what
>>>>> you mean by that - multiple exceptions, or multiple EPT violations -
>>>>> but the answer is still the same), and removing the page restrictions
>>>>> after the #PF injection is introspection specific logic - the address
>>>>> for which we inject the #PF doesn't have to be related in any way to the current instruction.
>>>> Ah, that's this no-architectural behavior again.
>>> I don't think the HVI #PF injection internals or how the #PF is handled by 
> the OS are relevant here. We are using an existing API that seems to not work 
> quite correct under certain circumstances and we were curious if any of you 
> can shed some light in this regard, and maybe point us to the right direction 
> for cooking up a fix.
>> 
>> Just because there is an API like this, doesn't necessarily mean it ever
>> worked.  This one clearly doesn't, and it got introduced before we as a
>> community took a rather harder stance towards code review.
>> 
>> Architecturally speaking, faults are always raised as a direct
>> consequence of the current state.  Therefore, having in introspection
>> agent interposing on the instruction stream and causing faults as a side
>> effect of EPT permissions/etc, is quite natural and in line with
>> architectural expectation.
>> 
>> You also have a second usecase for this API, which is to trick Windows
>> into paging in a frame you care about looking at.  Overall, using this
>> #PF method to get Windows to page content back in clearly the rational
>> way of making that happen, but it is very definitely a non-architectural
>> usecase; if windows were to double check the instruction stream on
>> getting this pagefault, it would get very confused, as the pagefault it
>> received has nothing to do with the code pointed at in the exception frame.
>> 
>> It is quite likely that these different usecases might have different
>> solutions.  IMO, the former should probably be controlled by responses
>> in the vm_event ring, but this latter issue probably shouldn't.
>> 
>> When it comes to injecting exceptions, there are some restrictions which
>> limit what can legitimately be done.  We can only inject a single thing
>> at once.  Stacking a #PF on top of a plain interrupt could be resolved
>> by leaving the interrupt pending in the vLAPIC and injecting the #PF
>> instead.  Stacking a #PF on top of a different fault is going to cause
>> hvm_combine_exceptions() to turn it into something more nasty.  OTOH,
>> deferring the #PF by even a single instruction could result in it being
>> sent in an unsafe context, which should also be avoided.
> 
> I wonder if it would be acceptable to simply add "&&
> v->arch.hvm_vcpu.inject_trap.vector != -1" to the early return condition
> for vmx_intr_assist()?
> 
> Event injection is already blocked there if v->arch.hvm_vcpu.single_step
> is set, so at least in that case it is acceptable behaviour. We could
> also block it if there's a pending user-requested injection.

While that might be an option, how would that help? Didn't you say
that your problem was with the injected vector being ignored
because there already was an event injected? That to me implies
that hvm_do_resume() (doing the injection you care about) runs
_after_ that other injection, yet it certainly runs before
vmx_intr_assist(). Am I overlooking or mis-remembering something?

Jan
Razvan Cojocaru Nov. 7, 2016, 9:37 a.m. UTC | #29
On 11/07/2016 10:45 AM, Jan Beulich wrote:
>>>> On 05.11.16 at 18:22, <rcojocaru@bitdefender.com> wrote:
>> On 11/02/2016 12:55 AM, Andrew Cooper wrote:
>>> On 01/11/2016 22:17, Andrei Vlad LUTAS wrote:
>>>>>> First of all, to answer your original question: the injection decision
>>>>>> is made when the introspection logic needs to inspect a page that is
>>>>>> not present in the physical memory. We don't really care if the current
>>>>>> instruction triggers multiple faults or not (and here I'm not sure what
>>>>>> you mean by that - multiple exceptions, or multiple EPT violations -
>>>>>> but the answer is still the same), and removing the page restrictions
>>>>>> after the #PF injection is introspection specific logic - the address
>>>>>> for which we inject the #PF doesn't have to be related in any way to the current instruction.
>>>>> Ah, that's this no-architectural behavior again.
>>>> I don't think the HVI #PF injection internals or how the #PF is handled by 
>> the OS are relevant here. We are using an existing API that seems to not work 
>> quite correct under certain circumstances and we were curious if any of you 
>> can shed some light in this regard, and maybe point us to the right direction 
>> for cooking up a fix.
>>>
>>> Just because there is an API like this, doesn't necessarily mean it ever
>>> worked.  This one clearly doesn't, and it got introduced before we as a
>>> community took a rather harder stance towards code review.
>>>
>>> Architecturally speaking, faults are always raised as a direct
>>> consequence of the current state.  Therefore, having in introspection
>>> agent interposing on the instruction stream and causing faults as a side
>>> effect of EPT permissions/etc, is quite natural and in line with
>>> architectural expectation.
>>>
>>> You also have a second usecase for this API, which is to trick Windows
>>> into paging in a frame you care about looking at.  Overall, using this
>>> #PF method to get Windows to page content back in clearly the rational
>>> way of making that happen, but it is very definitely a non-architectural
>>> usecase; if windows were to double check the instruction stream on
>>> getting this pagefault, it would get very confused, as the pagefault it
>>> received has nothing to do with the code pointed at in the exception frame.
>>>
>>> It is quite likely that these different usecases might have different
>>> solutions.  IMO, the former should probably be controlled by responses
>>> in the vm_event ring, but this latter issue probably shouldn't.
>>>
>>> When it comes to injecting exceptions, there are some restrictions which
>>> limit what can legitimately be done.  We can only inject a single thing
>>> at once.  Stacking a #PF on top of a plain interrupt could be resolved
>>> by leaving the interrupt pending in the vLAPIC and injecting the #PF
>>> instead.  Stacking a #PF on top of a different fault is going to cause
>>> hvm_combine_exceptions() to turn it into something more nasty.  OTOH,
>>> deferring the #PF by even a single instruction could result in it being
>>> sent in an unsafe context, which should also be avoided.
>>
>> I wonder if it would be acceptable to simply add "&&
>> v->arch.hvm_vcpu.inject_trap.vector != -1" to the early return condition
>> for vmx_intr_assist()?
>>
>> Event injection is already blocked there if v->arch.hvm_vcpu.single_step
>> is set, so at least in that case it is acceptable behaviour. We could
>> also block it if there's a pending user-requested injection.
> 
> While that might be an option, how would that help? Didn't you say
> that your problem was with the injected vector being ignored
> because there already was an event injected? That to me implies
> that hvm_do_resume() (doing the injection you care about) runs
> _after_ that other injection, yet it certainly runs before
> vmx_intr_assist(). Am I overlooking or mis-remembering something?

The xc_hvm_inject_trap() injected vector is not ignored (since we're
doing a #PF), our injection does succeed. The main problem is that this
can override pending interrupts - specifically reinjected ones.

You're right, blocking interrupts in the style of the singlestep code
only solves the problem partially.

The interrupts we're worried about (even with the singlestep approach)
are the ones reinjected by vmx_idtv_reinject() in vmx_vmexit_handler().

The problem there is that vmx_idtv_reinject() is a corner case: it
writes VM_ENTRY_INTR_INFO directly, and this can happen _before_ the
guest exits with, let's say, an EPT vm_event.

In that case, the guest has exited, there's already an interrupt
pending, and while the VCPU is paused waiting for the vm_event reply we
request an injection that effectively obliterates the pending interrupt.

Going the singlestep way is satisfactory for us for most cases, but it
still leaves the described corner case. The only fix proposal we could
think of is to, instead of vmx_idtv_reinject() doing the work, simply
set some flags, and have a later function do the actual work, in
vmx_intr_assist() style.

I hope I've been able to make this clearer (and I haven't misunderstood
something in the process :) ).


Thanks,
Razvan
Jan Beulich Nov. 7, 2016, 9:57 a.m. UTC | #30
>>> On 07.11.16 at 10:37, <rcojocaru@bitdefender.com> wrote:
> The problem there is that vmx_idtv_reinject() is a corner case: it
> writes VM_ENTRY_INTR_INFO directly, and this can happen _before_ the
> guest exits with, let's say, an EPT vm_event.
> 
> In that case, the guest has exited, there's already an interrupt
> pending, and while the VCPU is paused waiting for the vm_event reply we
> request an injection that effectively obliterates the pending interrupt.
> 
> Going the singlestep way is satisfactory for us for most cases, but it
> still leaves the described corner case. The only fix proposal we could
> think of is to, instead of vmx_idtv_reinject() doing the work, simply
> set some flags, and have a later function do the actual work, in
> vmx_intr_assist() style.
> 
> I hope I've been able to make this clearer (and I haven't misunderstood
> something in the process :) ).

You did, thanks. Yet I continue to remain on my earlier position: It's
the non-architectural (injected) event which should get deferred,
not architectural ones (either the variant you describe above, or
any which hypervisor processing found necessary to deliver). The
single step case can be viewed as an exception here, albeit ideally
it also wouldn't need to be special cased.

Jan
Razvan Cojocaru Nov. 7, 2016, 11:49 a.m. UTC | #31
On 11/07/2016 11:57 AM, Jan Beulich wrote:
>>>> On 07.11.16 at 10:37, <rcojocaru@bitdefender.com> wrote:
>> The problem there is that vmx_idtv_reinject() is a corner case: it
>> writes VM_ENTRY_INTR_INFO directly, and this can happen _before_ the
>> guest exits with, let's say, an EPT vm_event.
>>
>> In that case, the guest has exited, there's already an interrupt
>> pending, and while the VCPU is paused waiting for the vm_event reply we
>> request an injection that effectively obliterates the pending interrupt.
>>
>> Going the singlestep way is satisfactory for us for most cases, but it
>> still leaves the described corner case. The only fix proposal we could
>> think of is to, instead of vmx_idtv_reinject() doing the work, simply
>> set some flags, and have a later function do the actual work, in
>> vmx_intr_assist() style.
>>
>> I hope I've been able to make this clearer (and I haven't misunderstood
>> something in the process :) ).
> 
> You did, thanks. Yet I continue to remain on my earlier position: It's
> the non-architectural (injected) event which should get deferred,
> not architectural ones (either the variant you describe above, or
> any which hypervisor processing found necessary to deliver). The
> single step case can be viewed as an exception here, albeit ideally
> it also wouldn't need to be special cased.

But the problem is that if we defer, say a #PF, by the time there's a VM
entry with no pending interrupt we may get the wrong context (wrong
address space, guest mode, etc.).

The simple solution to this would be to save the context (for a #PF this
would additionally mean CR3 as well), and only inject at the first
opportunity where the context matches.

However, this gets ugly quickly: for one, the right context for a #PF
may occur on a different VCPU, so right off the bat we can't hold this
information per-VCPU anymore (it would need to be per-domain).

Second, it is conceivable that there will be several injection requests
not yet delivered, and in that case we need to save the context for all
of them (so an array or some sort of container of contexts becomes
necessary).

And third, there's no guarantee that the guest will ever reach the
proper context for injecting any of the deferred interrupts for the
duration of it's life, which brings us back to the problem we were
trying to solve: we still can't guarantee trap delivery, but now in a
much more complicated manner. :)

We haven't even discussed what to do if a new request comes when the
context buffer is full (we'd need to lose an undelivered trap), or that
different trap types may require different contexts and handling logic.

At this point it looks like the best solution is to simply discard the
non-architectural event if there's a pending architectural one, and add
a new vm_event, as suggested by Tamas, that notifies the application
that a trap has failed, or succeeded, and let it do the best it can with
that information.


Thanks,
Razvan
Jan Beulich Nov. 7, 2016, 12:23 p.m. UTC | #32
>>> On 07.11.16 at 12:49, <rcojocaru@bitdefender.com> wrote:
> On 11/07/2016 11:57 AM, Jan Beulich wrote:
>>>>> On 07.11.16 at 10:37, <rcojocaru@bitdefender.com> wrote:
>>> The problem there is that vmx_idtv_reinject() is a corner case: it
>>> writes VM_ENTRY_INTR_INFO directly, and this can happen _before_ the
>>> guest exits with, let's say, an EPT vm_event.
>>>
>>> In that case, the guest has exited, there's already an interrupt
>>> pending, and while the VCPU is paused waiting for the vm_event reply we
>>> request an injection that effectively obliterates the pending interrupt.
>>>
>>> Going the singlestep way is satisfactory for us for most cases, but it
>>> still leaves the described corner case. The only fix proposal we could
>>> think of is to, instead of vmx_idtv_reinject() doing the work, simply
>>> set some flags, and have a later function do the actual work, in
>>> vmx_intr_assist() style.
>>>
>>> I hope I've been able to make this clearer (and I haven't misunderstood
>>> something in the process :) ).
>> 
>> You did, thanks. Yet I continue to remain on my earlier position: It's
>> the non-architectural (injected) event which should get deferred,
>> not architectural ones (either the variant you describe above, or
>> any which hypervisor processing found necessary to deliver). The
>> single step case can be viewed as an exception here, albeit ideally
>> it also wouldn't need to be special cased.
> 
> But the problem is that if we defer, say a #PF, by the time there's a VM
> entry with no pending interrupt we may get the wrong context (wrong
> address space, guest mode, etc.).
> 
> The simple solution to this would be to save the context (for a #PF this
> would additionally mean CR3 as well), and only inject at the first
> opportunity where the context matches.
> 
> However, this gets ugly quickly: for one, the right context for a #PF
> may occur on a different VCPU, so right off the bat we can't hold this
> information per-VCPU anymore (it would need to be per-domain).
> 
> Second, it is conceivable that there will be several injection requests
> not yet delivered, and in that case we need to save the context for all
> of them (so an array or some sort of container of contexts becomes
> necessary).
> 
> And third, there's no guarantee that the guest will ever reach the
> proper context for injecting any of the deferred interrupts for the
> duration of it's life, which brings us back to the problem we were
> trying to solve: we still can't guarantee trap delivery, but now in a
> much more complicated manner. :)
> 
> We haven't even discussed what to do if a new request comes when the
> context buffer is full (we'd need to lose an undelivered trap), or that
> different trap types may require different contexts and handling logic.

All agreed, yet all are issues for you to solve in order to be able
to half way sanely inject a non-architectural fault. (And btw., CR3
as context may not suffice - you may also need e.g. the altp2m
index in use.)

> At this point it looks like the best solution is to simply discard the
> non-architectural event if there's a pending architectural one, and add
> a new vm_event, as suggested by Tamas, that notifies the application
> that a trap has failed, or succeeded, and let it do the best it can with
> that information.

Well, if the two of you think this is something which fits into the VM
event model, then I guess that's an option. I, for one, am not
convinced: It simply seems too special purpose. If this was a more
generic event ("interruption delivered", perhaps with a type or
vector qualifier) that can be subscribed to, and the app did that
only for such transient periods of time, this would look more
reasonable to me.

Jan
Razvan Cojocaru Nov. 7, 2016, 12:27 p.m. UTC | #33
On 11/07/2016 02:23 PM, Jan Beulich wrote:
>> At this point it looks like the best solution is to simply discard the
>> > non-architectural event if there's a pending architectural one, and add
>> > a new vm_event, as suggested by Tamas, that notifies the application
>> > that a trap has failed, or succeeded, and let it do the best it can with
>> > that information.
> Well, if the two of you think this is something which fits into the VM
> event model, then I guess that's an option. I, for one, am not
> convinced: It simply seems too special purpose. If this was a more
> generic event ("interruption delivered", perhaps with a type or
> vector qualifier) that can be subscribed to, and the app did that
> only for such transient periods of time, this would look more
> reasonable to me.

Indeed, that's the way I think of it: the user is free to subscribe to
the event or not, and the event is:

struct vm_event_injection_result {
    uint32_t vector;
    uint32_t type;
    uint32_t error_code;
    uint64_t cr2;
    uint32_t injected;
};

with injected 0 for failure and 1 for success. It's as generic as possible.


Thanks,
Razvan
Jan Beulich Nov. 7, 2016, 1:20 p.m. UTC | #34
>>> On 07.11.16 at 13:27, <rcojocaru@bitdefender.com> wrote:
> On 11/07/2016 02:23 PM, Jan Beulich wrote:
>>> At this point it looks like the best solution is to simply discard the
>>> > non-architectural event if there's a pending architectural one, and add
>>> > a new vm_event, as suggested by Tamas, that notifies the application
>>> > that a trap has failed, or succeeded, and let it do the best it can with
>>> > that information.
>> Well, if the two of you think this is something which fits into the VM
>> event model, then I guess that's an option. I, for one, am not
>> convinced: It simply seems too special purpose. If this was a more
>> generic event ("interruption delivered", perhaps with a type or
>> vector qualifier) that can be subscribed to, and the app did that
>> only for such transient periods of time, this would look more
>> reasonable to me.
> 
> Indeed, that's the way I think of it: the user is free to subscribe to
> the event or not, and the event is:
> 
> struct vm_event_injection_result {
>     uint32_t vector;
>     uint32_t type;
>     uint32_t error_code;
>     uint64_t cr2;
>     uint32_t injected;
> };
> 
> with injected 0 for failure and 1 for success. It's as generic as possible.

Wait, no, that's not what I did describe. I'm not talking about the
"result" of an injection (through hypercall), but about delivering of
events (of any origin). Hence there can't be any "failure" here.
IOW what I'm proposing is a "VM is about to see this interruption"
event.

Jan
Razvan Cojocaru Nov. 7, 2016, 1:44 p.m. UTC | #35
On 11/07/2016 03:20 PM, Jan Beulich wrote:
>>>> On 07.11.16 at 13:27, <rcojocaru@bitdefender.com> wrote:
>> On 11/07/2016 02:23 PM, Jan Beulich wrote:
>>>> At this point it looks like the best solution is to simply discard the
>>>>> non-architectural event if there's a pending architectural one, and add
>>>>> a new vm_event, as suggested by Tamas, that notifies the application
>>>>> that a trap has failed, or succeeded, and let it do the best it can with
>>>>> that information.
>>> Well, if the two of you think this is something which fits into the VM
>>> event model, then I guess that's an option. I, for one, am not
>>> convinced: It simply seems too special purpose. If this was a more
>>> generic event ("interruption delivered", perhaps with a type or
>>> vector qualifier) that can be subscribed to, and the app did that
>>> only for such transient periods of time, this would look more
>>> reasonable to me.
>>
>> Indeed, that's the way I think of it: the user is free to subscribe to
>> the event or not, and the event is:
>>
>> struct vm_event_injection_result {
>>     uint32_t vector;
>>     uint32_t type;
>>     uint32_t error_code;
>>     uint64_t cr2;
>>     uint32_t injected;
>> };
>>
>> with injected 0 for failure and 1 for success. It's as generic as possible.
> 
> Wait, no, that's not what I did describe. I'm not talking about the
> "result" of an injection (through hypercall), but about delivering of
> events (of any origin). Hence there can't be any "failure" here.
> IOW what I'm proposing is a "VM is about to see this interruption"
> event.

But a a success-only event is hard to interpret with regards to failure,
which is what we're really interested in (specifically, failure to
inject via hypercall). Do we count it as a failure if we don't receive a
"VM is about to see this interruption" event immediately after the
vm_event that caused the injection, on the same processor, with the same
trap vector? That's a tricky commitment to make.

That would also decrease performance, likely noticeably, for a
vm_event-based application, as we'd get many such events (most of which
we're not interested in at all) which would require treatment while the
respective VCPU is paused.

Tamas, any thoughts on this?


Thanks,
Razvan
Jan Beulich Nov. 7, 2016, 1:53 p.m. UTC | #36
>>> On 07.11.16 at 14:44, <rcojocaru@bitdefender.com> wrote:
> On 11/07/2016 03:20 PM, Jan Beulich wrote:
>>>>> On 07.11.16 at 13:27, <rcojocaru@bitdefender.com> wrote:
>>> On 11/07/2016 02:23 PM, Jan Beulich wrote:
>>>>> At this point it looks like the best solution is to simply discard the
>>>>>> non-architectural event if there's a pending architectural one, and add
>>>>>> a new vm_event, as suggested by Tamas, that notifies the application
>>>>>> that a trap has failed, or succeeded, and let it do the best it can with
>>>>>> that information.
>>>> Well, if the two of you think this is something which fits into the VM
>>>> event model, then I guess that's an option. I, for one, am not
>>>> convinced: It simply seems too special purpose. If this was a more
>>>> generic event ("interruption delivered", perhaps with a type or
>>>> vector qualifier) that can be subscribed to, and the app did that
>>>> only for such transient periods of time, this would look more
>>>> reasonable to me.
>>>
>>> Indeed, that's the way I think of it: the user is free to subscribe to
>>> the event or not, and the event is:
>>>
>>> struct vm_event_injection_result {
>>>     uint32_t vector;
>>>     uint32_t type;
>>>     uint32_t error_code;
>>>     uint64_t cr2;
>>>     uint32_t injected;
>>> };
>>>
>>> with injected 0 for failure and 1 for success. It's as generic as possible.
>> 
>> Wait, no, that's not what I did describe. I'm not talking about the
>> "result" of an injection (through hypercall), but about delivering of
>> events (of any origin). Hence there can't be any "failure" here.
>> IOW what I'm proposing is a "VM is about to see this interruption"
>> event.
> 
> But a a success-only event is hard to interpret with regards to failure,
> which is what we're really interested in (specifically, failure to
> inject via hypercall). Do we count it as a failure if we don't receive a
> "VM is about to see this interruption" event immediately after the
> vm_event that caused the injection, on the same processor, with the same
> trap vector? That's a tricky commitment to make.

If you subscribed to all interruptions, you'd simply see one next
which is different from the one you've tried to inject.

> That would also decrease performance, likely noticeably, for a
> vm_event-based application, as we'd get many such events (most of which
> we're not interested in at all) which would require treatment while the
> respective VCPU is paused.

You should limit the periods of time when you enable the
subscription (e.g. only from when you inject an event until
you did see it arrive). Perhaps such a subscription should then
be per-vCPU ...

Jan
Razvan Cojocaru Nov. 7, 2016, 2:05 p.m. UTC | #37
On 11/07/2016 03:53 PM, Jan Beulich wrote:
>>>> On 07.11.16 at 14:44, <rcojocaru@bitdefender.com> wrote:
>> On 11/07/2016 03:20 PM, Jan Beulich wrote:
>>>>>> On 07.11.16 at 13:27, <rcojocaru@bitdefender.com> wrote:
>>>> On 11/07/2016 02:23 PM, Jan Beulich wrote:
>>>>>> At this point it looks like the best solution is to simply discard the
>>>>>>> non-architectural event if there's a pending architectural one, and add
>>>>>>> a new vm_event, as suggested by Tamas, that notifies the application
>>>>>>> that a trap has failed, or succeeded, and let it do the best it can with
>>>>>>> that information.
>>>>> Well, if the two of you think this is something which fits into the VM
>>>>> event model, then I guess that's an option. I, for one, am not
>>>>> convinced: It simply seems too special purpose. If this was a more
>>>>> generic event ("interruption delivered", perhaps with a type or
>>>>> vector qualifier) that can be subscribed to, and the app did that
>>>>> only for such transient periods of time, this would look more
>>>>> reasonable to me.
>>>>
>>>> Indeed, that's the way I think of it: the user is free to subscribe to
>>>> the event or not, and the event is:
>>>>
>>>> struct vm_event_injection_result {
>>>>     uint32_t vector;
>>>>     uint32_t type;
>>>>     uint32_t error_code;
>>>>     uint64_t cr2;
>>>>     uint32_t injected;
>>>> };
>>>>
>>>> with injected 0 for failure and 1 for success. It's as generic as possible.
>>>
>>> Wait, no, that's not what I did describe. I'm not talking about the
>>> "result" of an injection (through hypercall), but about delivering of
>>> events (of any origin). Hence there can't be any "failure" here.
>>> IOW what I'm proposing is a "VM is about to see this interruption"
>>> event.
>>
>> But a a success-only event is hard to interpret with regards to failure,
>> which is what we're really interested in (specifically, failure to
>> inject via hypercall). Do we count it as a failure if we don't receive a
>> "VM is about to see this interruption" event immediately after the
>> vm_event that caused the injection, on the same processor, with the same
>> trap vector? That's a tricky commitment to make.
> 
> If you subscribed to all interruptions, you'd simply see one next
> which is different from the one you've tried to inject.
> 
>> That would also decrease performance, likely noticeably, for a
>> vm_event-based application, as we'd get many such events (most of which
>> we're not interested in at all) which would require treatment while the
>> respective VCPU is paused.
> 
> You should limit the periods of time when you enable the
> subscription (e.g. only from when you inject an event until
> you did see it arrive). Perhaps such a subscription should then
> be per-vCPU ...

Obviously the most generic solution, of benefit most users, should be
sought. That said, the only consumers for this are likely to be Tamas
and us for the foreseeable future, and the ammount of logic and permance
impact needed to handle the failure case (which is the motivator of the
changes) is definitely non-trivial.

Maybe Tamas has a different opinion, or other current (or future) users
can share other viewpoints.

However this turns out, we really appreciate your input and the time
you've taken to consider this.


Thanks,
Razvan
diff mbox

Patch

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 2d78e45..cfe04b4 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -532,8 +532,16 @@  void hvm_do_resume(struct vcpu *v)
     /* Inject pending hw/sw trap */
     if ( v->arch.hvm_vcpu.inject_trap.vector != -1 )
     {
-        hvm_inject_trap(&v->arch.hvm_vcpu.inject_trap);
-        v->arch.hvm_vcpu.inject_trap.vector = -1;
+        unsigned long ev;
+
+        __vmread(VM_ENTRY_INTR_INFO, &ev);
+
+        /* Check for already pending interrupts (races). */
+        if ( !(ev & INTR_INFO_VALID_MASK) )
+        {
+            hvm_inject_trap(&v->arch.hvm_vcpu.inject_trap);
+            v->arch.hvm_vcpu.inject_trap.vector = -1;
+        }
     }
 }