diff mbox series

[v2] x86/kvm: Disable KVM_ASYNC_PF_SEND_ALWAYS

Message ID ed71d0967113a35f670a9625a058b8e6e0b2f104.1583547991.git.luto@kernel.org (mailing list archive)
State New, archived
Headers show
Series [v2] x86/kvm: Disable KVM_ASYNC_PF_SEND_ALWAYS | expand

Commit Message

Andy Lutomirski March 7, 2020, 2:26 a.m. UTC
The ABI is broken and we cannot support it properly.  Turn it off.

If this causes a meaningful performance regression for someone, KVM
can introduce an improved ABI that is supportable.

Cc: stable@vger.kernel.org
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/kvm.c | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

Comments

Andy Lutomirski March 7, 2020, 3:03 p.m. UTC | #1
On Fri, Mar 6, 2020 at 6:26 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> The ABI is broken and we cannot support it properly.  Turn it off.
>
> If this causes a meaningful performance regression for someone, KVM
> can introduce an improved ABI that is supportable.
>
> Cc: stable@vger.kernel.org
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/x86/kernel/kvm.c | 21 ++++++++++++++++++---
>  1 file changed, 18 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 93ab0cbd304e..e6f2aefa298b 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -318,11 +318,26 @@ static void kvm_guest_cpu_init(void)
>
>                 pa = slow_virt_to_phys(this_cpu_ptr(&apf_reason));
>
> -#ifdef CONFIG_PREEMPTION
> -               pa |= KVM_ASYNC_PF_SEND_ALWAYS;
> -#endif
>                 pa |= KVM_ASYNC_PF_ENABLED;
>
> +               /*
> +                * We do not set KVM_ASYNC_PF_SEND_ALWAYS.  With the current
> +                * KVM paravirt ABI, the following scenario is possible:
> +                *
> +                * #PF: async page fault (KVM_PV_REASON_PAGE_NOT_PRESENT)
> +                *  NMI before CR2 or KVM_PF_REASON_PAGE_NOT_PRESENT
> +                *   NMI accesses user memory, e.g. due to perf
> +                *    #PF: normal page fault
> +                *     #PF reads CR2 and apf_reason -- apf_reason should be 0
> +                *
> +                *  outer #PF reads CR2 and apf_reason -- apf_reason should be
> +                *  KVM_PV_REASON_PAGE_NOT_PRESENT
> +                *
> +                * There is no possible way that both reads of CR2 and
> +                * apf_reason get the correct values.  Fixing this would
> +                * require paravirt ABI changes.
> +                */
> +

Upon re-reading my own comment, I think the problem is real, but I
don't think my patch fixes it.  The outer #PF could just as easily
have come from user mode.  We may actually need the NMI code (and
perhaps MCE and maybe #DB too) to save, clear, and restore apf_reason.
If we do this, then maybe CPL0 async PFs are actually okay, but the
semantics are so poorly defined that I'm not very confident about
that.
Thomas Gleixner March 7, 2020, 3:47 p.m. UTC | #2
Andy Lutomirski <luto@kernel.org> writes:
> On Fri, Mar 6, 2020 at 6:26 PM Andy Lutomirski <luto@kernel.org> wrote:
>> +               /*
>> +                * We do not set KVM_ASYNC_PF_SEND_ALWAYS.  With the current
>> +                * KVM paravirt ABI, the following scenario is possible:
>> +                *
>> +                * #PF: async page fault (KVM_PV_REASON_PAGE_NOT_PRESENT)
>> +                *  NMI before CR2 or KVM_PF_REASON_PAGE_NOT_PRESENT
>> +                *   NMI accesses user memory, e.g. due to perf
>> +                *    #PF: normal page fault
>> +                *     #PF reads CR2 and apf_reason -- apf_reason should be 0
>> +                *
>> +                *  outer #PF reads CR2 and apf_reason -- apf_reason should be
>> +                *  KVM_PV_REASON_PAGE_NOT_PRESENT
>> +                *
>> +                * There is no possible way that both reads of CR2 and
>> +                * apf_reason get the correct values.  Fixing this would
>> +                * require paravirt ABI changes.
>> +                */
>> +
>
> Upon re-reading my own comment, I think the problem is real, but I
> don't think my patch fixes it.  The outer #PF could just as easily
> have come from user mode.  We may actually need the NMI code (and
> perhaps MCE and maybe #DB too) to save, clear, and restore apf_reason.
> If we do this, then maybe CPL0 async PFs are actually okay, but the
> semantics are so poorly defined that I'm not very confident about
> that.

I think even with the current mode this is fixable on the host side when
it keeps track of the state.

The host knows exactly when it injects a async PF and it can store CR2
and reason of that async PF in flight.

On the next VMEXIT it checks whether apf_reason is 0. If apf_reason is 0
then it knows that the guest has read CR2 and apf_reason. All good
nothing to worry about.

If not it needs to be careful.

As long as the apf_reason of the last async #PF is not cleared by the
guest no new async #PF can be injected. That's already correct because
in that case IF==0 which prevents a nested async #PF.

If MCE, NMI trigger a real pagefault then the #PF injection needs to
clear apf_reason and set the correct CR2. When that #PF returns then the
old CR2 and apf_reason need to be restored.

I tried to figure out whether any of this logic exists in the KVM code,
but I got completely lost in that code. Maybe I try later today again.

Thanks,

	tglx
Andy Lutomirski March 7, 2020, 3:59 p.m. UTC | #3
On Sat, Mar 7, 2020 at 7:47 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Andy Lutomirski <luto@kernel.org> writes:
> > On Fri, Mar 6, 2020 at 6:26 PM Andy Lutomirski <luto@kernel.org> wrote:
> >> +               /*
> >> +                * We do not set KVM_ASYNC_PF_SEND_ALWAYS.  With the current
> >> +                * KVM paravirt ABI, the following scenario is possible:
> >> +                *
> >> +                * #PF: async page fault (KVM_PV_REASON_PAGE_NOT_PRESENT)
> >> +                *  NMI before CR2 or KVM_PF_REASON_PAGE_NOT_PRESENT
> >> +                *   NMI accesses user memory, e.g. due to perf
> >> +                *    #PF: normal page fault
> >> +                *     #PF reads CR2 and apf_reason -- apf_reason should be 0
> >> +                *
> >> +                *  outer #PF reads CR2 and apf_reason -- apf_reason should be
> >> +                *  KVM_PV_REASON_PAGE_NOT_PRESENT
> >> +                *
> >> +                * There is no possible way that both reads of CR2 and
> >> +                * apf_reason get the correct values.  Fixing this would
> >> +                * require paravirt ABI changes.
> >> +                */
> >> +
> >
> > Upon re-reading my own comment, I think the problem is real, but I
> > don't think my patch fixes it.  The outer #PF could just as easily
> > have come from user mode.  We may actually need the NMI code (and
> > perhaps MCE and maybe #DB too) to save, clear, and restore apf_reason.
> > If we do this, then maybe CPL0 async PFs are actually okay, but the
> > semantics are so poorly defined that I'm not very confident about
> > that.
>
> I think even with the current mode this is fixable on the host side when
> it keeps track of the state.
>
> The host knows exactly when it injects a async PF and it can store CR2
> and reason of that async PF in flight.
>
> On the next VMEXIT it checks whether apf_reason is 0. If apf_reason is 0
> then it knows that the guest has read CR2 and apf_reason. All good
> nothing to worry about.
>
> If not it needs to be careful.
>
> As long as the apf_reason of the last async #PF is not cleared by the
> guest no new async #PF can be injected. That's already correct because
> in that case IF==0 which prevents a nested async #PF.
>
> If MCE, NMI trigger a real pagefault then the #PF injection needs to
> clear apf_reason and set the correct CR2. When that #PF returns then the
> old CR2 and apf_reason need to be restored.

How is the host supposed to know when the #PF returns?  Intercepting
IRET sounds like a bad idea and, in any case, is not actually a
reliable indication that #PF returned.
Thomas Gleixner March 7, 2020, 7:01 p.m. UTC | #4
Andy Lutomirski <luto@kernel.org> writes:
> On Sat, Mar 7, 2020 at 7:47 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>> The host knows exactly when it injects a async PF and it can store CR2
>> and reason of that async PF in flight.
>>
>> On the next VMEXIT it checks whether apf_reason is 0. If apf_reason is 0
>> then it knows that the guest has read CR2 and apf_reason. All good
>> nothing to worry about.
>>
>> If not it needs to be careful.
>>
>> As long as the apf_reason of the last async #PF is not cleared by the
>> guest no new async #PF can be injected. That's already correct because
>> in that case IF==0 which prevents a nested async #PF.
>>
>> If MCE, NMI trigger a real pagefault then the #PF injection needs to
>> clear apf_reason and set the correct CR2. When that #PF returns then the
>> old CR2 and apf_reason need to be restored.
>
> How is the host supposed to know when the #PF returns?  Intercepting
> IRET sounds like a bad idea and, in any case, is not actually a
> reliable indication that #PF returned.

The host does not care about the IRET. It solely has to check whether
apf_reason is 0 or not. That way it knows that the guest has read CR2
and apf_reason.

Thanks,

        tglx
Andy Lutomirski March 7, 2020, 7:34 p.m. UTC | #5
On Sat, Mar 7, 2020 at 11:01 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Andy Lutomirski <luto@kernel.org> writes:
> > On Sat, Mar 7, 2020 at 7:47 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> >> The host knows exactly when it injects a async PF and it can store CR2
> >> and reason of that async PF in flight.
> >>
> >> On the next VMEXIT it checks whether apf_reason is 0. If apf_reason is 0
> >> then it knows that the guest has read CR2 and apf_reason. All good
> >> nothing to worry about.
> >>
> >> If not it needs to be careful.
> >>
> >> As long as the apf_reason of the last async #PF is not cleared by the
> >> guest no new async #PF can be injected. That's already correct because
> >> in that case IF==0 which prevents a nested async #PF.
> >>
> >> If MCE, NMI trigger a real pagefault then the #PF injection needs to
> >> clear apf_reason and set the correct CR2. When that #PF returns then the
> >> old CR2 and apf_reason need to be restored.
> >
> > How is the host supposed to know when the #PF returns?  Intercepting
> > IRET sounds like a bad idea and, in any case, is not actually a
> > reliable indication that #PF returned.
>
> The host does not care about the IRET. It solely has to check whether
> apf_reason is 0 or not. That way it knows that the guest has read CR2
> and apf_reason.

/me needs actual details

Suppose the host delivers an async #PF.  apf_reason != 0 and CR2
contains something meaningful.  Host resumes the guest.

The guest does whatever (gets NMI, and does perf stuff, for example).
The guest gets a normal #PF.  Somehow the host needs to do:

if (apf_reason != 0) {
  prev_apf_reason = apf_reason;
  prev_cr2 = cr2;
  apf_reason = 0;
  cr2 = actual fault address;
}

resume guest;

Obviously this can only happen if the host intercepts #PF.  Let's
pretend for now that this is even possible on SEV-ES (it may well be,
but I would also believe that it's not.  SEV-ES intercepts are weird
and I don't have the whole manual in my head.  I'm not sure the host
has any way to read CR2 for a SEV-ES guest.)  So now the guest runs
some more and finishes handling the inner #PF.  Some time between
doing that and running the outer #PF code that reads apf_reason, the
host needs to do:

apf_reason = prev_apf_reason;
cr2 = prev_cr2;
prev_apf_reason = 0;

How is the host supposed to know when to do that?

--Andy
Thomas Gleixner March 8, 2020, 7:23 a.m. UTC | #6
Thomas Gleixner <tglx@linutronix.de> writes:
> Andy Lutomirski <luto@kernel.org> writes:
>> On Sat, Mar 7, 2020 at 7:47 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>>> The host knows exactly when it injects a async PF and it can store CR2
>>> and reason of that async PF in flight.
>>>
>>> On the next VMEXIT it checks whether apf_reason is 0. If apf_reason is 0
>>> then it knows that the guest has read CR2 and apf_reason. All good
>>> nothing to worry about.
>>>
>>> If not it needs to be careful.
>>>
>>> As long as the apf_reason of the last async #PF is not cleared by the
>>> guest no new async #PF can be injected. That's already correct because
>>> in that case IF==0 which prevents a nested async #PF.
>>>
>>> If MCE, NMI trigger a real pagefault then the #PF injection needs to
>>> clear apf_reason and set the correct CR2. When that #PF returns then the
>>> old CR2 and apf_reason need to be restored.
>>
>> How is the host supposed to know when the #PF returns?  Intercepting
>> IRET sounds like a bad idea and, in any case, is not actually a
>> reliable indication that #PF returned.
>
> The host does not care about the IRET. It solely has to check whether
> apf_reason is 0 or not. That way it knows that the guest has read CR2
> and apf_reason.

Bah. I'm a moron. Of course it needs to trap the IRET of the #PF in
order to restore CR2 and apf_reason. Alternatively it could trap the CR2
read of #PF, but yes that's all nasty.

Thanks,

        tglx
Thomas Gleixner March 9, 2020, 6:57 a.m. UTC | #7
Thomas Gleixner <tglx@linutronix.de> writes:

> Thomas Gleixner <tglx@linutronix.de> writes:
>> Andy Lutomirski <luto@kernel.org> writes:
>>> On Sat, Mar 7, 2020 at 7:47 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>>>> The host knows exactly when it injects a async PF and it can store CR2
>>>> and reason of that async PF in flight.
>>>>
>>>> On the next VMEXIT it checks whether apf_reason is 0. If apf_reason is 0
>>>> then it knows that the guest has read CR2 and apf_reason. All good
>>>> nothing to worry about.
>>>>
>>>> If not it needs to be careful.
>>>>
>>>> As long as the apf_reason of the last async #PF is not cleared by the
>>>> guest no new async #PF can be injected. That's already correct because
>>>> in that case IF==0 which prevents a nested async #PF.
>>>>
>>>> If MCE, NMI trigger a real pagefault then the #PF injection needs to
>>>> clear apf_reason and set the correct CR2. When that #PF returns then the
>>>> old CR2 and apf_reason need to be restored.
>>>
>>> How is the host supposed to know when the #PF returns?  Intercepting
>>> IRET sounds like a bad idea and, in any case, is not actually a
>>> reliable indication that #PF returned.
>>
>> The host does not care about the IRET. It solely has to check whether
>> apf_reason is 0 or not. That way it knows that the guest has read CR2
>> and apf_reason.
>
> Bah. I'm a moron. Of course it needs to trap the IRET of the #PF in
> order to restore CR2 and apf_reason. Alternatively it could trap the CR2
> read of #PF, but yes that's all nasty.

Some hours or sleep and not staring at this meess later and while
reading the leaves of my morning tea:

guest side:

   nmi()/mce() ...
   
        stash_crs();

+       stash_and_clear_apf_reason();

        ....

+       restore_apf_reason();

	restore_cr2();

Too obvious, isn't it?

Thanks,

        tglx
Paolo Bonzini March 9, 2020, 8:40 a.m. UTC | #8
On 09/03/20 07:57, Thomas Gleixner wrote:
> Thomas Gleixner <tglx@linutronix.de> writes:
>> Thomas Gleixner <tglx@linutronix.de> writes:
>>> Andy Lutomirski <luto@kernel.org> writes:
>>>> On Sat, Mar 7, 2020 at 7:47 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>>>>> If MCE, NMI trigger a real pagefault then the #PF injection needs to
>>>>> clear apf_reason and set the correct CR2. When that #PF returns then the
>>>>> old CR2 and apf_reason need to be restored.
>>>>
>>> The host does not care about the IRET. It solely has to check whether
>>> apf_reason is 0 or not. That way it knows that the guest has read CR2
>>> and apf_reason.
> 
> Some hours or sleep and not staring at this meess later and while
> reading the leaves of my morning tea:
> 
> guest side:
> 
>    nmi()/mce() ...
>    
>         stash_crs();
> 
> +       stash_and_clear_apf_reason();
> 
>         ....
> 
> +       restore_apf_reason();
> 
> 	restore_cr2();
> 
> Too obvious, isn't it?

Yes, this works but Andy was not happy about adding more
save-and-restore to NMIs.  If you do not want to do that, I'm okay with
disabling async page fault support for now.

Storing the page fault reason in memory was not a good idea.  Better
options would be to co-opt the page fault error code (e.g. store the
reason in bits 31:16, mark bits 15:0 with the invalid error code
RSVD=1/P=0), or to use the virtualization exception area.

Paolo
Thomas Gleixner March 9, 2020, 9:09 a.m. UTC | #9
Paolo Bonzini <pbonzini@redhat.com> writes:
> On 09/03/20 07:57, Thomas Gleixner wrote:
>> Thomas Gleixner <tglx@linutronix.de> writes:
>>
>> guest side:
>> 
>>    nmi()/mce() ...
>>    
>>         stash_crs();
>> 
>> +       stash_and_clear_apf_reason();
>> 
>>         ....
>> 
>> +       restore_apf_reason();
>> 
>> 	restore_cr2();
>> 
>> Too obvious, isn't it?
>
> Yes, this works but Andy was not happy about adding more
> save-and-restore to NMIs.  If you do not want to do that, I'm okay with
> disabling async page fault support for now.

I'm fine with doing that save/restore dance, but I have no strong
opinion either.

> Storing the page fault reason in memory was not a good idea.  Better
> options would be to co-opt the page fault error code (e.g. store the
> reason in bits 31:16, mark bits 15:0 with the invalid error code
> RSVD=1/P=0), or to use the virtualization exception area.

Memory store is not the problem. The real problem is hijacking #PF.

If you'd have just used a separate VECTOR_ASYNC_PF then none of these
problems would exist at all.

Thanks,

        tglx
Andy Lutomirski March 9, 2020, 6:14 p.m. UTC | #10
On Mon, Mar 9, 2020 at 2:09 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Paolo Bonzini <pbonzini@redhat.com> writes:
> > On 09/03/20 07:57, Thomas Gleixner wrote:
> >> Thomas Gleixner <tglx@linutronix.de> writes:
> >>
> >> guest side:
> >>
> >>    nmi()/mce() ...
> >>
> >>         stash_crs();
> >>
> >> +       stash_and_clear_apf_reason();
> >>
> >>         ....
> >>
> >> +       restore_apf_reason();
> >>
> >>      restore_cr2();
> >>
> >> Too obvious, isn't it?
> >
> > Yes, this works but Andy was not happy about adding more
> > save-and-restore to NMIs.  If you do not want to do that, I'm okay with
> > disabling async page fault support for now.
>
> I'm fine with doing that save/restore dance, but I have no strong
> opinion either.
>
> > Storing the page fault reason in memory was not a good idea.  Better
> > options would be to co-opt the page fault error code (e.g. store the
> > reason in bits 31:16, mark bits 15:0 with the invalid error code
> > RSVD=1/P=0), or to use the virtualization exception area.
>
> Memory store is not the problem. The real problem is hijacking #PF.
>
> If you'd have just used a separate VECTOR_ASYNC_PF then none of these
> problems would exist at all.
>

I'm okay with the save/restore dance, I guess.  It's just yet more
entry crud to deal with architecture nastiness, except that this
nastiness is 100% software and isn't Intel/AMD's fault.

At least until we get an async page fault due to apf_reason being paged out...
Thomas Gleixner March 9, 2020, 7:05 p.m. UTC | #11
Andy Lutomirski <luto@kernel.org> writes:
> On Mon, Mar 9, 2020 at 2:09 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>> Paolo Bonzini <pbonzini@redhat.com> writes:
>> > Yes, this works but Andy was not happy about adding more
>> > save-and-restore to NMIs.  If you do not want to do that, I'm okay with
>> > disabling async page fault support for now.
>>
>> I'm fine with doing that save/restore dance, but I have no strong
>> opinion either.
>>
>> > Storing the page fault reason in memory was not a good idea.  Better
>> > options would be to co-opt the page fault error code (e.g. store the
>> > reason in bits 31:16, mark bits 15:0 with the invalid error code
>> > RSVD=1/P=0), or to use the virtualization exception area.
>>
>> Memory store is not the problem. The real problem is hijacking #PF.
>>
>> If you'd have just used a separate VECTOR_ASYNC_PF then none of these
>> problems would exist at all.
>>
>
> I'm okay with the save/restore dance, I guess.  It's just yet more
> entry crud to deal with architecture nastiness, except that this
> nastiness is 100% software and isn't Intel/AMD's fault.

And we can do it in C and don't have to fiddle with it in the ASM
maze.

Thanks,

        tglx
Peter Zijlstra March 9, 2020, 8:22 p.m. UTC | #12
On Mon, Mar 09, 2020 at 08:05:18PM +0100, Thomas Gleixner wrote:
> Andy Lutomirski <luto@kernel.org> writes:

> > I'm okay with the save/restore dance, I guess.  It's just yet more
> > entry crud to deal with architecture nastiness, except that this
> > nastiness is 100% software and isn't Intel/AMD's fault.
> 
> And we can do it in C and don't have to fiddle with it in the ASM
> maze.

Right; I'd still love to kill KVM_ASYNC_PF_SEND_ALWAYS though, even if
we do the save/restore in do_nmi(). That is some wild brain melt. Also,
AFAIK none of the distros are actually shipping a PREEMPT=y kernel
anyway, so killing it shouldn't matter much.

If people want to recover that, I'd suggest they sit down and create a
sane paravirt interface for this.
Vivek Goyal April 6, 2020, 7:09 p.m. UTC | #13
On Mon, Mar 09, 2020 at 09:22:15PM +0100, Peter Zijlstra wrote:
> On Mon, Mar 09, 2020 at 08:05:18PM +0100, Thomas Gleixner wrote:
> > Andy Lutomirski <luto@kernel.org> writes:
> 
> > > I'm okay with the save/restore dance, I guess.  It's just yet more
> > > entry crud to deal with architecture nastiness, except that this
> > > nastiness is 100% software and isn't Intel/AMD's fault.
> > 
> > And we can do it in C and don't have to fiddle with it in the ASM
> > maze.
> 
> Right; I'd still love to kill KVM_ASYNC_PF_SEND_ALWAYS though, even if
> we do the save/restore in do_nmi(). That is some wild brain melt. Also,
> AFAIK none of the distros are actually shipping a PREEMPT=y kernel
> anyway, so killing it shouldn't matter much.

It will be nice if we can retain KVM_ASYNC_PF_SEND_ALWAYS. I have another
use case outside CONFIG_PREEMPT.

I am trying to extend async pf interface to also report page fault errors
to the guest.

https://lore.kernel.org/kvm/20200331194011.24834-1-vgoyal@redhat.com/

Right now async page fault interface assumes that host will always be
able to successfully resolve the page fault and sooner or later PAGE_READY
event will be sent to guest. And there is no mechnaism to report the
errors back to guest.

I am trying to add enhance virtiofs to directly map host page cache in guest.

https://lore.kernel.org/linux-fsdevel/20200304165845.3081-1-vgoyal@redhat.com/

There it is possible that a file page on host is mapped in guest and file
got truncated and page is not there anymore. Guest tries to access it,
and it generates async page fault. On host we will get -EFAULT and I 
need to propagate it back to guest so that guest can either send SIGBUS
to process which caused this. Or if kernel was trying to do memcpy(),
then be able to use execpetion table error handling and be able to
return with error.  (memcpy_mcflush()).

For the second case to work, I will need async pf events to come in
even if guest is in kernel and CONFIG_PREEMPT=n.

So it would be nice if we can keep KVM_ASYNC_PF_SEND_ALWAYS around.

Thanks
Vivek
Peter Zijlstra April 6, 2020, 8:25 p.m. UTC | #14
On Mon, Apr 06, 2020 at 03:09:51PM -0400, Vivek Goyal wrote:
> On Mon, Mar 09, 2020 at 09:22:15PM +0100, Peter Zijlstra wrote:
> > On Mon, Mar 09, 2020 at 08:05:18PM +0100, Thomas Gleixner wrote:
> > > Andy Lutomirski <luto@kernel.org> writes:
> > 
> > > > I'm okay with the save/restore dance, I guess.  It's just yet more
> > > > entry crud to deal with architecture nastiness, except that this
> > > > nastiness is 100% software and isn't Intel/AMD's fault.
> > > 
> > > And we can do it in C and don't have to fiddle with it in the ASM
> > > maze.
> > 
> > Right; I'd still love to kill KVM_ASYNC_PF_SEND_ALWAYS though, even if
> > we do the save/restore in do_nmi(). That is some wild brain melt. Also,
> > AFAIK none of the distros are actually shipping a PREEMPT=y kernel
> > anyway, so killing it shouldn't matter much.
> 
> It will be nice if we can retain KVM_ASYNC_PF_SEND_ALWAYS. I have another
> use case outside CONFIG_PREEMPT.
> 
> I am trying to extend async pf interface to also report page fault errors
> to the guest.

Then please start over and design a sane ParaVirt Fault interface. The
current one is utter crap.
Andy Lutomirski April 6, 2020, 8:32 p.m. UTC | #15
> On Apr 6, 2020, at 1:25 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Mon, Apr 06, 2020 at 03:09:51PM -0400, Vivek Goyal wrote:
>>> On Mon, Mar 09, 2020 at 09:22:15PM +0100, Peter Zijlstra wrote:
>>> On Mon, Mar 09, 2020 at 08:05:18PM +0100, Thomas Gleixner wrote:
>>>> Andy Lutomirski <luto@kernel.org> writes:
>>> 
>>>>> I'm okay with the save/restore dance, I guess.  It's just yet more
>>>>> entry crud to deal with architecture nastiness, except that this
>>>>> nastiness is 100% software and isn't Intel/AMD's fault.
>>>> 
>>>> And we can do it in C and don't have to fiddle with it in the ASM
>>>> maze.
>>> 
>>> Right; I'd still love to kill KVM_ASYNC_PF_SEND_ALWAYS though, even if
>>> we do the save/restore in do_nmi(). That is some wild brain melt. Also,
>>> AFAIK none of the distros are actually shipping a PREEMPT=y kernel
>>> anyway, so killing it shouldn't matter much.
>> 
>> It will be nice if we can retain KVM_ASYNC_PF_SEND_ALWAYS. I have another
>> use case outside CONFIG_PREEMPT.
>> 
>> I am trying to extend async pf interface to also report page fault errors
>> to the guest.
> 
> Then please start over and design a sane ParaVirt Fault interface. The
> current one is utter crap.

Agreed. Don’t extend the current mechanism. Replace it.

I would be happy to review a replacement. I’m not really excited to review an extension of the current mess.  The current thing is barely, if at all, correct.
Andy Lutomirski April 6, 2020, 8:42 p.m. UTC | #16
> On Apr 6, 2020, at 1:32 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> 
> 
>> On Apr 6, 2020, at 1:25 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> 
>> On Mon, Apr 06, 2020 at 03:09:51PM -0400, Vivek Goyal wrote:
>>>> On Mon, Mar 09, 2020 at 09:22:15PM +0100, Peter Zijlstra wrote:
>>>>> On Mon, Mar 09, 2020 at 08:05:18PM +0100, Thomas Gleixner wrote:
>>>>>> Andy Lutomirski <luto@kernel.org> writes:
>>>>> 
>>>>>>> I'm okay with the save/restore dance, I guess.  It's just yet more
>>>>>>> entry crud to deal with architecture nastiness, except that this
>>>>>>> nastiness is 100% software and isn't Intel/AMD's fault.
>>>>>> 
>>>>>> And we can do it in C and don't have to fiddle with it in the ASM
>>>>>> maze.
>>>>> 
>>>>> Right; I'd still love to kill KVM_ASYNC_PF_SEND_ALWAYS though, even if
>>>>> we do the save/restore in do_nmi(). That is some wild brain melt. Also,
>>>>> AFAIK none of the distros are actually shipping a PREEMPT=y kernel
>>>>> anyway, so killing it shouldn't matter much.
>>> 
>>> It will be nice if we can retain KVM_ASYNC_PF_SEND_ALWAYS. I have another
>>> use case outside CONFIG_PREEMPT.
>>> 
>>> I am trying to extend async pf interface to also report page fault errors
>>> to the guest.
>> 
>> Then please start over and design a sane ParaVirt Fault interface. The
>> current one is utter crap.
> 
> Agreed. Don’t extend the current mechanism. Replace it.
> 
> I would be happy to review a replacement. I’m not really excited to review an extension of the current mess.  The current thing is barely, if at all, correct.

I read your patch. It cannot possibly be correct.  You need to decide what happens if you get a memory failure when guest interrupts are off. If this happens, you can’t send #PF, but you also can’t just swallow the error. The existing APF code is so messy that it’s not at all obvious what your code ends up doing, but I’m pretty sure it doesn’t do anything sensible, especially since the ABI doesn’t have a sensible option.

I think you should inject MCE and coordinate with Tony Luck to make it sane. And, in the special case that the new improved async PF mechanism is enabled *and* interrupts are on, you can skip the MCE and instead inject a new improved APF.

But, as it stands, I will NAK any guest code that tries to make #PF handle memory failure. Sorry, it’s just too messy to actually analyze all the cases.
Thomas Gleixner April 6, 2020, 9:32 p.m. UTC | #17
Vivek Goyal <vgoyal@redhat.com> writes:
> On Mon, Mar 09, 2020 at 09:22:15PM +0100, Peter Zijlstra wrote:
>> On Mon, Mar 09, 2020 at 08:05:18PM +0100, Thomas Gleixner wrote:
>> > Andy Lutomirski <luto@kernel.org> writes:
>> 
>> > > I'm okay with the save/restore dance, I guess.  It's just yet more
>> > > entry crud to deal with architecture nastiness, except that this
>> > > nastiness is 100% software and isn't Intel/AMD's fault.
>> > 
>> > And we can do it in C and don't have to fiddle with it in the ASM
>> > maze.
>> 
>> Right; I'd still love to kill KVM_ASYNC_PF_SEND_ALWAYS though, even if
>> we do the save/restore in do_nmi(). That is some wild brain melt. Also,
>> AFAIK none of the distros are actually shipping a PREEMPT=y kernel
>> anyway, so killing it shouldn't matter much.
>
> It will be nice if we can retain KVM_ASYNC_PF_SEND_ALWAYS. I have another
> use case outside CONFIG_PREEMPT.
>
> I am trying to extend async pf interface to also report page fault errors
> to the guest.
>
> https://lore.kernel.org/kvm/20200331194011.24834-1-vgoyal@redhat.com/
>
> Right now async page fault interface assumes that host will always be
> able to successfully resolve the page fault and sooner or later PAGE_READY
> event will be sent to guest. And there is no mechnaism to report the
> errors back to guest.
>
> I am trying to add enhance virtiofs to directly map host page cache in guest.
>
> https://lore.kernel.org/linux-fsdevel/20200304165845.3081-1-vgoyal@redhat.com/
>
> There it is possible that a file page on host is mapped in guest and file
> got truncated and page is not there anymore. Guest tries to access it,
> and it generates async page fault. On host we will get -EFAULT and I 
> need to propagate it back to guest so that guest can either send SIGBUS
> to process which caused this. Or if kernel was trying to do memcpy(),
> then be able to use execpetion table error handling and be able to
> return with error.  (memcpy_mcflush()).
>
> For the second case to work, I will need async pf events to come in
> even if guest is in kernel and CONFIG_PREEMPT=n.

What?

> So it would be nice if we can keep KVM_ASYNC_PF_SEND_ALWAYS around.

No. If you want this stuff to be actually useful and correct, then
please redesign it from scratch w/o abusing #PF. It want's to be a
separate vector and then the pagefault resulting from your example above
becomes a real #PF without any bells and whistels.

Thanks,

        tglx
Vivek Goyal April 7, 2020, 5:21 p.m. UTC | #18
On Mon, Apr 06, 2020 at 01:42:28PM -0700, Andy Lutomirski wrote:
> 
> > On Apr 6, 2020, at 1:32 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> > 
> > 
> >> On Apr 6, 2020, at 1:25 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> >> 
> >> On Mon, Apr 06, 2020 at 03:09:51PM -0400, Vivek Goyal wrote:
> >>>> On Mon, Mar 09, 2020 at 09:22:15PM +0100, Peter Zijlstra wrote:
> >>>>> On Mon, Mar 09, 2020 at 08:05:18PM +0100, Thomas Gleixner wrote:
> >>>>>> Andy Lutomirski <luto@kernel.org> writes:
> >>>>> 
> >>>>>>> I'm okay with the save/restore dance, I guess.  It's just yet more
> >>>>>>> entry crud to deal with architecture nastiness, except that this
> >>>>>>> nastiness is 100% software and isn't Intel/AMD's fault.
> >>>>>> 
> >>>>>> And we can do it in C and don't have to fiddle with it in the ASM
> >>>>>> maze.
> >>>>> 
> >>>>> Right; I'd still love to kill KVM_ASYNC_PF_SEND_ALWAYS though, even if
> >>>>> we do the save/restore in do_nmi(). That is some wild brain melt. Also,
> >>>>> AFAIK none of the distros are actually shipping a PREEMPT=y kernel
> >>>>> anyway, so killing it shouldn't matter much.
> >>> 
> >>> It will be nice if we can retain KVM_ASYNC_PF_SEND_ALWAYS. I have another
> >>> use case outside CONFIG_PREEMPT.
> >>> 
> >>> I am trying to extend async pf interface to also report page fault errors
> >>> to the guest.
> >> 
> >> Then please start over and design a sane ParaVirt Fault interface. The
> >> current one is utter crap.
> > 
> > Agreed. Don’t extend the current mechanism. Replace it.
> > 
> > I would be happy to review a replacement. I’m not really excited to review an extension of the current mess.  The current thing is barely, if at all, correct.
> 
> I read your patch. It cannot possibly be correct.  You need to decide what happens if you get a memory failure when guest interrupts are off. If this happens, you can’t send #PF, but you also can’t just swallow the error. The existing APF code is so messy that it’s not at all obvious what your code ends up doing, but I’m pretty sure it doesn’t do anything sensible, especially since the ABI doesn’t have a sensible option.

Hi Andy,

I am not familiar with this KVM code and trying to understand it. I think
error exception gets queued and gets delivered at some point of time, even
if interrupts are disabled at the time of exception. Most likely at the time
of next VM entry.

Whether interrupts are enabled or not check only happens before we decide
if async pf protocol should be followed or not. Once we decide to
send PAGE_NOT_PRESENT, later notification PAGE_READY does not check
if interrupts are enabled or not. And it kind of makes sense otherwise
guest process will wait infinitely to receive PAGE_READY.

I modified the code a bit to disable interrupt and wait 10 seconds (after
getting PAGE_NOT_PRESENT message). And I noticed that error async pf
got delivered after 10 seconds after enabling interrupts. So error
async pf was not lost because interrupts were disabled.

Havind said that, I thought disabling interrupts does not mask exceptions.
So page fault exception should have been delivered even with interrupts
disabled. Is that correct? May be there was no vm exit/entry during
those 10 seconds and that's why.

Thanks
Vivek
Andy Lutomirski April 7, 2020, 5:38 p.m. UTC | #19
> On Apr 7, 2020, at 10:21 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> 
> On Mon, Apr 06, 2020 at 01:42:28PM -0700, Andy Lutomirski wrote:
>> 
>>>> On Apr 6, 2020, at 1:32 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> 
>>> 
>>>> On Apr 6, 2020, at 1:25 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>>>> 
>>>> On Mon, Apr 06, 2020 at 03:09:51PM -0400, Vivek Goyal wrote:
>>>>>> On Mon, Mar 09, 2020 at 09:22:15PM +0100, Peter Zijlstra wrote:
>>>>>>> On Mon, Mar 09, 2020 at 08:05:18PM +0100, Thomas Gleixner wrote:
>>>>>>>> Andy Lutomirski <luto@kernel.org> writes:
>>>>>>> 
>>>>>>>>> I'm okay with the save/restore dance, I guess.  It's just yet more
>>>>>>>>> entry crud to deal with architecture nastiness, except that this
>>>>>>>>> nastiness is 100% software and isn't Intel/AMD's fault.
>>>>>>>> 
>>>>>>>> And we can do it in C and don't have to fiddle with it in the ASM
>>>>>>>> maze.
>>>>>>> 
>>>>>>> Right; I'd still love to kill KVM_ASYNC_PF_SEND_ALWAYS though, even if
>>>>>>> we do the save/restore in do_nmi(). That is some wild brain melt. Also,
>>>>>>> AFAIK none of the distros are actually shipping a PREEMPT=y kernel
>>>>>>> anyway, so killing it shouldn't matter much.
>>>>> 
>>>>> It will be nice if we can retain KVM_ASYNC_PF_SEND_ALWAYS. I have another
>>>>> use case outside CONFIG_PREEMPT.
>>>>> 
>>>>> I am trying to extend async pf interface to also report page fault errors
>>>>> to the guest.
>>>> 
>>>> Then please start over and design a sane ParaVirt Fault interface. The
>>>> current one is utter crap.
>>> 
>>> Agreed. Don’t extend the current mechanism. Replace it.
>>> 
>>> I would be happy to review a replacement. I’m not really excited to review an extension of the current mess.  The current thing is barely, if at all, correct.
>> 
>> I read your patch. It cannot possibly be correct.  You need to decide what happens if you get a memory failure when guest interrupts are off. If this happens, you can’t send #PF, but you also can’t just swallow the error. The existing APF code is so messy that it’s not at all obvious what your code ends up doing, but I’m pretty sure it doesn’t do anything sensible, especially since the ABI doesn’t have a sensible option.
> 
> Hi Andy,
> 
> I am not familiar with this KVM code and trying to understand it. I think
> error exception gets queued and gets delivered at some point of time, even
> if interrupts are disabled at the time of exception. Most likely at the time
> of next VM entry.

I’ve read the code three or four times and I barely understand it. I’m not convinced the author understood it.  It’s spaghetti.

> 
> Whether interrupts are enabled or not check only happens before we decide
> if async pf protocol should be followed or not. Once we decide to
> send PAGE_NOT_PRESENT, later notification PAGE_READY does not check
> if interrupts are enabled or not. And it kind of makes sense otherwise
> guest process will wait infinitely to receive PAGE_READY.
> 
> I modified the code a bit to disable interrupt and wait 10 seconds (after
> getting PAGE_NOT_PRESENT message). And I noticed that error async pf
> got delivered after 10 seconds after enabling interrupts. So error
> async pf was not lost because interrupts were disabled.
> 
> Havind said that, I thought disabling interrupts does not mask exceptions.
> So page fault exception should have been delivered even with interrupts
> disabled. Is that correct? May be there was no vm exit/entry during
> those 10 seconds and that's why.

My point is that the entire async pf is nonsense. There are two types of events right now:

“Page not ready”:  normally this isn’t even visible to the guest — the guest just waits. With async pf, the idea is to try to tell the guest that a particular instruction would block and the guest should do something else instead. Sending a normal exception is a poor design, though: the guest may not expect this instruction to cause an exception. I think KVM should try to deliver an *interrupt* and, if it can’t, then just block the guest.

“Page ready”: this is a regular asynchronous notification just like, say, a virtio completion. It should be an ordinary interrupt.  Some in memory data structure should indicate which pages are ready.

“Page is malfunctioning” is tricky because you *must* deliver the event. x86’s #MC is not exactly a masterpiece, but it does kind of work.

> 
> Thanks
> Vivek
>
Thomas Gleixner April 7, 2020, 8:20 p.m. UTC | #20
Andy Lutomirski <luto@amacapital.net> writes:
>> On Apr 7, 2020, at 10:21 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
>> Whether interrupts are enabled or not check only happens before we decide
>> if async pf protocol should be followed or not. Once we decide to
>> send PAGE_NOT_PRESENT, later notification PAGE_READY does not check
>> if interrupts are enabled or not. And it kind of makes sense otherwise
>> guest process will wait infinitely to receive PAGE_READY.
>> 
>> I modified the code a bit to disable interrupt and wait 10 seconds (after
>> getting PAGE_NOT_PRESENT message). And I noticed that error async pf
>> got delivered after 10 seconds after enabling interrupts. So error
>> async pf was not lost because interrupts were disabled.

Async PF is not the same as a real #PF. It just hijacked the #PF vector
because someone thought this is a brilliant idea.

>> Havind said that, I thought disabling interrupts does not mask exceptions.
>> So page fault exception should have been delivered even with interrupts
>> disabled. Is that correct? May be there was no vm exit/entry during
>> those 10 seconds and that's why.

No. Async PF is not a real exception. It has interrupt semantics and it
can only be injected when the guest has interrupts enabled. It's bad
design.

> My point is that the entire async pf is nonsense. There are two types of events right now:
>
> “Page not ready”: normally this isn’t even visible to the guest — the
> guest just waits. With async pf, the idea is to try to tell the guest
> that a particular instruction would block and the guest should do
> something else instead. Sending a normal exception is a poor design,
> though: the guest may not expect this instruction to cause an
> exception. I think KVM should try to deliver an *interrupt* and, if it
> can’t, then just block the guest.

That's pretty much what it does, just that it runs this through #PF and
has the checks for interrupts disabled - i.e can't right now' around
that. If it can't then KVM schedules the guest out until the situation
has been resolved.

> “Page ready”: this is a regular asynchronous notification just like,
> say, a virtio completion. It should be an ordinary interrupt.  Some in
> memory data structure should indicate which pages are ready.
>
> “Page is malfunctioning” is tricky because you *must* deliver the
> event. x86’s #MC is not exactly a masterpiece, but it does kind of
> work.

Nooooo. This does not need #MC at all. Don't even think about it.

The point is that the access to such a page is either happening in user
space or in kernel space with a proper exception table fixup.

That means a real #PF is perfectly fine. That can be injected any time
and does not have the interrupt semantics of async PF.

So now lets assume we distangled async PF from #PF and made it a regular
interrupt, then the following situation still needs to be dealt with:

   guest -> access faults

host -> injects async fault

   guest -> handles and blocks the task

host figures out that the page does not exist anymore and now needs to
fixup the situation.

host -> injects async wakeup

   guest -> returns from aysnc PF interrupt and retries the instruction
            which faults again.

host -> knows by now that this is a real fault and injects a proper #PF

   guest -> #PF runs and either sends signal to user space or runs
            the exception table fixup for a kernel fault.

Thanks,

        tglx
Andy Lutomirski April 7, 2020, 9:41 p.m. UTC | #21
> On Apr 7, 2020, at 1:20 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> Andy Lutomirski <luto@amacapital.net> writes:
>>>> On Apr 7, 2020, at 10:21 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
>>> Whether interrupts are enabled or not check only happens before we decide
>>> if async pf protocol should be followed or not. Once we decide to
>>> send PAGE_NOT_PRESENT, later notification PAGE_READY does not check
>>> if interrupts are enabled or not. And it kind of makes sense otherwise
>>> guest process will wait infinitely to receive PAGE_READY.
>>> 
>>> I modified the code a bit to disable interrupt and wait 10 seconds (after
>>> getting PAGE_NOT_PRESENT message). And I noticed that error async pf
>>> got delivered after 10 seconds after enabling interrupts. So error
>>> async pf was not lost because interrupts were disabled.
> 
> Async PF is not the same as a real #PF. It just hijacked the #PF vector
> because someone thought this is a brilliant idea.
> 
>>> Havind said that, I thought disabling interrupts does not mask exceptions.
>>> So page fault exception should have been delivered even with interrupts
>>> disabled. Is that correct? May be there was no vm exit/entry during
>>> those 10 seconds and that's why.
> 
> No. Async PF is not a real exception. It has interrupt semantics and it
> can only be injected when the guest has interrupts enabled. It's bad
> design.
> 
>> My point is that the entire async pf is nonsense. There are two types of events right now:
>> 
>> “Page not ready”: normally this isn’t even visible to the guest — the
>> guest just waits. With async pf, the idea is to try to tell the guest
>> that a particular instruction would block and the guest should do
>> something else instead. Sending a normal exception is a poor design,
>> though: the guest may not expect this instruction to cause an
>> exception. I think KVM should try to deliver an *interrupt* and, if it
>> can’t, then just block the guest.
> 
> That's pretty much what it does, just that it runs this through #PF and
> has the checks for interrupts disabled - i.e can't right now' around
> that. If it can't then KVM schedules the guest out until the situation
> has been resolved.
> 
>> “Page ready”: this is a regular asynchronous notification just like,
>> say, a virtio completion. It should be an ordinary interrupt.  Some in
>> memory data structure should indicate which pages are ready.
>> 
>> “Page is malfunctioning” is tricky because you *must* deliver the
>> event. x86’s #MC is not exactly a masterpiece, but it does kind of
>> work.
> 
> Nooooo. This does not need #MC at all. Don't even think about it.

Yessssssssssss.  Please do think about it. :)

> 
> The point is that the access to such a page is either happening in user
> space or in kernel space with a proper exception table fixup.
> 
> That means a real #PF is perfectly fine. That can be injected any time
> and does not have the interrupt semantics of async PF.

The hypervisor has no way to distinguish between MOV-and-has-valid-stack-and-extable-entry and MOV-definitely-can’t-fault-here.  Or, for that matter, MOV-in-do_page_fault()-will-recurve-if-it-faults.

> 
> So now lets assume we distangled async PF from #PF and made it a regular
> interrupt, then the following situation still needs to be dealt with:
> 
>   guest -> access faults
> 
> host -> injects async fault
> 
>   guest -> handles and blocks the task
> 
> host figures out that the page does not exist anymore and now needs to
> fixup the situation.
> 
> host -> injects async wakeup
> 
>   guest -> returns from aysnc PF interrupt and retries the instruction
>            which faults again.
> 
> host -> knows by now that this is a real fault and injects a proper #PF
> 
>   guest -> #PF runs and either sends signal to user space or runs
>            the exception table fixup for a kernel fault.

Or guest blows up because the fault could not be recovered using #PF.

I can see two somewhat sane ways to make this work.

1. Access to bad memory results in an async-page-not-present, except that,  it’s not deliverable, the guest is killed. Either that async-page-not-present has a special flag saying “memory failure” or the eventual wakeup says “memory failure”.

2. Access to bad memory results in #MC.  Sure, #MC is a turd, but it’s an *architectural* turd. By all means, have a nice simple PV mechanism to tell the #MC code exactly what went wrong, but keep the overall flow the same as in the native case.

I think I like #2 much better. It has another nice effect: a good implementation will serve as a way to exercise the #MC code without needing to muck with EINJ or with whatever magic Tony uses. The average kernel developer does not have access to a box with testable memory failure reporting.

> 
> Thanks,
> 
>        tglx
> 
> 
> 
>
Paolo Bonzini April 7, 2020, 10:04 p.m. UTC | #22
On 07/04/20 22:20, Thomas Gleixner wrote:
>>> Havind said that, I thought disabling interrupts does not mask exceptions.
>>> So page fault exception should have been delivered even with interrupts
>>> disabled. Is that correct? May be there was no vm exit/entry during
>>> those 10 seconds and that's why.
> No. Async PF is not a real exception. It has interrupt semantics and it
> can only be injected when the guest has interrupts enabled. It's bad
> design.

Page-ready async PF has interrupt semantics.

Page-not-present async PF however does not have interrupt semantics, it
has to be injected immediately or not at all (falling back to host page
fault in the latter case).  So page-not-present async PF definitely
needs to be an exception, this is independent of whether it can be
injected when IF=0.

Hypervisors do not have any reserved exception vector, and must use
vectors up to 31, which is why I believe #PF was used in the first place
(though that predates my involvement in KVM by a few years).  These
days, #VE would be a much better exception to use instead (and it also
has a defined mechanism to avoid reentrancy).

Paolo
Paolo Bonzini April 7, 2020, 10:07 p.m. UTC | #23
On 07/04/20 23:41, Andy Lutomirski wrote:
> 2. Access to bad memory results in #MC.  Sure, #MC is a turd, but
> it’s an *architectural* turd. By all means, have a nice simple PV
> mechanism to tell the #MC code exactly what went wrong, but keep the
> overall flow the same as in the native case.
> 
> I think I like #2 much better. It has another nice effect: a good
> implementation will serve as a way to exercise the #MC code without
> needing to muck with EINJ or with whatever magic Tony uses. The
> average kernel developer does not have access to a box with testable
> memory failure reporting.

I prefer #VE, but I can see how #MC has some appeal.  However, #VE has a
mechanism to avoid reentrancy, unlike #MC.  How would that be better
than the current mess with an NMI happening in the first few
instructions of the #PF handler?

Paolo
Andy Lutomirski April 7, 2020, 10:29 p.m. UTC | #24
> On Apr 7, 2020, at 3:07 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> On 07/04/20 23:41, Andy Lutomirski wrote:
>> 2. Access to bad memory results in #MC.  Sure, #MC is a turd, but
>> it’s an *architectural* turd. By all means, have a nice simple PV
>> mechanism to tell the #MC code exactly what went wrong, but keep the
>> overall flow the same as in the native case.
>> 
>> I think I like #2 much better. It has another nice effect: a good
>> implementation will serve as a way to exercise the #MC code without
>> needing to muck with EINJ or with whatever magic Tony uses. The
>> average kernel developer does not have access to a box with testable
>> memory failure reporting.
> 
> I prefer #VE, but I can see how #MC has some appeal.  However, #VE has a
> mechanism to avoid reentrancy, unlike #MC.  How would that be better
> than the current mess with an NMI happening in the first few
> instructions of the #PF handler?
> 
> 

It has to be an IST vector due to the possibility of hitting a memory failure right after SYSCALL.  I imagine that making #VE use IST would be unfortunate.

I think #MC has a mechanism to prevent reentrancy to a limited extent. How does #VE avoid reentrancy?
Thomas Gleixner April 7, 2020, 10:48 p.m. UTC | #25
Andy Lutomirski <luto@amacapital.net> writes:
>> On Apr 7, 2020, at 1:20 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> Andy Lutomirski <luto@amacapital.net> writes:
>>> “Page is malfunctioning” is tricky because you *must* deliver the
>>> event. x86’s #MC is not exactly a masterpiece, but it does kind of
>>> work.
>> 
>> Nooooo. This does not need #MC at all. Don't even think about it.
>
> Yessssssssssss.  Please do think about it. :)

I stared too much into that code recently that even thinking about it
hurts. :)

>> The point is that the access to such a page is either happening in user
>> space or in kernel space with a proper exception table fixup.
>> 
>> That means a real #PF is perfectly fine. That can be injected any time
>> and does not have the interrupt semantics of async PF.
>
> The hypervisor has no way to distinguish between
> MOV-and-has-valid-stack-and-extable-entry and
> MOV-definitely-can’t-fault-here.  Or, for that matter,
> MOV-in-do_page_fault()-will-recurve-if-it-faults.

The mechanism which Vivek wants to support has a well defined usage
scenario, i.e. either user space or kernel-valid-stack+extable-entry.

So why do you want to route that through #MC? 

>> So now lets assume we distangled async PF from #PF and made it a regular
>> interrupt, then the following situation still needs to be dealt with:
>> 
>>   guest -> access faults
>> 
>> host -> injects async fault
>> 
>>   guest -> handles and blocks the task
>> 
>> host figures out that the page does not exist anymore and now needs to
>> fixup the situation.
>> 
>> host -> injects async wakeup
>> 
>>   guest -> returns from aysnc PF interrupt and retries the instruction
>>            which faults again.
>> 
>> host -> knows by now that this is a real fault and injects a proper #PF
>> 
>>   guest -> #PF runs and either sends signal to user space or runs
>>            the exception table fixup for a kernel fault.
>
> Or guest blows up because the fault could not be recovered using #PF.

Not for the use case at hand. And for that you really want to use
regular #PF.

The scenario I showed above is perfectly legit:

   guest:
        copy_to_user()          <- Has extable
           -> FAULT

host:
   Oh, page is not there, give me some time to figure it out.

   inject async fault

   guest:
        handles async fault interrupt, enables interrupts, blocks

host:
   Situation resolved, shared file was truncated. Tell guest
  
   Inject #MC

   guest:
        handles #MC completely out of context because the faulting
        task is scheduled out.

        Rumage through wait lists (how do you protect them?) and find
        the waiter.

        Schedule irq_work to actually mark the waiter as failed and wake
        it up.

Sorry, but that's not really an improvement. That's worse. 

> I can see two somewhat sane ways to make this work.
>
> 1. Access to bad memory results in an async-page-not-present, except
> that, it’s not deliverable, the guest is killed.

That's incorrect. The proper reaction is a real #PF. Simply because this
is part of the contract of sharing some file backed stuff between host
and guest in a well defined "virtio" scenario and not a random access to
memory which might be there or not.

Look at it from the point where async whatever does not exist at all:

   guest:
        copy_to_user()          <- Has extable
           -> FAULT

host:
        suspend guest and resolve situation

        if (page swapped in)
           resume_guest();
        else
           inject_pf();

And this inject_pf() does not care whether it kills the guest or makes
it double/triple fault or whatever.

The 'tell the guest to do something else while host tries to sort it'
opportunistic thingy turns this into:

   guest:
        copy_to_user()          <- Has extable
           -> FAULT

host:
        tell guest to do something else, i.e. guest suspends task

        if (page swapped in)
           tell guest to resume suspended task
        else
           tell guest to resume suspended task

   guest resumes and faults again

host:
           inject_pf();

which is pretty much equivalent.

> 2. Access to bad memory results in #MC.  Sure, #MC is a turd, but it’s
> an *architectural* turd. By all means, have a nice simple PV mechanism
> to tell the #MC code exactly what went wrong, but keep the overall
> flow the same as in the native case.

It's a completely different flow as you evaluate PV turd instead of
analysing the MCE banks and the other error reporting facilities.

> I think I like #2 much better. It has another nice effect: a good
> implementation will serve as a way to exercise the #MC code without
> needing to muck with EINJ or with whatever magic Tony uses. The
> average kernel developer does not have access to a box with testable
> memory failure reporting.

Yes, because CPU dudes decided that documenting the testability
mechanisms is a bad thing. They surely exist and for Nehalem at least
the ECC error injection mechanism was documented and usable.

But to actually make soemthing useful you need to emulate the bank
interface on the hypervisor and expose the errors to the guest. The MCE
injection mechanism has some awful way to "emulate" MSR reads for
that. See mce_rdmsr(). *SHUDDER*

Thanks,

        tglx
Vivek Goyal April 7, 2020, 10:49 p.m. UTC | #26
On Tue, Apr 07, 2020 at 02:41:02PM -0700, Andy Lutomirski wrote:
> 
> 
> > On Apr 7, 2020, at 1:20 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > 
> > Andy Lutomirski <luto@amacapital.net> writes:
> >>>> On Apr 7, 2020, at 10:21 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> >>> Whether interrupts are enabled or not check only happens before we decide
> >>> if async pf protocol should be followed or not. Once we decide to
> >>> send PAGE_NOT_PRESENT, later notification PAGE_READY does not check
> >>> if interrupts are enabled or not. And it kind of makes sense otherwise
> >>> guest process will wait infinitely to receive PAGE_READY.
> >>> 
> >>> I modified the code a bit to disable interrupt and wait 10 seconds (after
> >>> getting PAGE_NOT_PRESENT message). And I noticed that error async pf
> >>> got delivered after 10 seconds after enabling interrupts. So error
> >>> async pf was not lost because interrupts were disabled.
> > 
> > Async PF is not the same as a real #PF. It just hijacked the #PF vector
> > because someone thought this is a brilliant idea.
> > 
> >>> Havind said that, I thought disabling interrupts does not mask exceptions.
> >>> So page fault exception should have been delivered even with interrupts
> >>> disabled. Is that correct? May be there was no vm exit/entry during
> >>> those 10 seconds and that's why.
> > 
> > No. Async PF is not a real exception. It has interrupt semantics and it
> > can only be injected when the guest has interrupts enabled. It's bad
> > design.
> > 
> >> My point is that the entire async pf is nonsense. There are two types of events right now:
> >> 
> >> “Page not ready”: normally this isn’t even visible to the guest — the
> >> guest just waits. With async pf, the idea is to try to tell the guest
> >> that a particular instruction would block and the guest should do
> >> something else instead. Sending a normal exception is a poor design,
> >> though: the guest may not expect this instruction to cause an
> >> exception. I think KVM should try to deliver an *interrupt* and, if it
> >> can’t, then just block the guest.
> > 
> > That's pretty much what it does, just that it runs this through #PF and
> > has the checks for interrupts disabled - i.e can't right now' around
> > that. If it can't then KVM schedules the guest out until the situation
> > has been resolved.
> > 
> >> “Page ready”: this is a regular asynchronous notification just like,
> >> say, a virtio completion. It should be an ordinary interrupt.  Some in
> >> memory data structure should indicate which pages are ready.
> >> 
> >> “Page is malfunctioning” is tricky because you *must* deliver the
> >> event. x86’s #MC is not exactly a masterpiece, but it does kind of
> >> work.
> > 
> > Nooooo. This does not need #MC at all. Don't even think about it.
> 
> Yessssssssssss.  Please do think about it. :)
> 
> > 
> > The point is that the access to such a page is either happening in user
> > space or in kernel space with a proper exception table fixup.
> > 
> > That means a real #PF is perfectly fine. That can be injected any time
> > and does not have the interrupt semantics of async PF.
> 
> The hypervisor has no way to distinguish between MOV-and-has-valid-stack-and-extable-entry and MOV-definitely-can’t-fault-here.  Or, for that matter, MOV-in-do_page_fault()-will-recurve-if-it-faults.
> 
> > 
> > So now lets assume we distangled async PF from #PF and made it a regular
> > interrupt, then the following situation still needs to be dealt with:
> > 
> >   guest -> access faults
> > 
> > host -> injects async fault
> > 
> >   guest -> handles and blocks the task
> > 
> > host figures out that the page does not exist anymore and now needs to
> > fixup the situation.
> > 
> > host -> injects async wakeup
> > 
> >   guest -> returns from aysnc PF interrupt and retries the instruction
> >            which faults again.
> > 
> > host -> knows by now that this is a real fault and injects a proper #PF
> > 
> >   guest -> #PF runs and either sends signal to user space or runs
> >            the exception table fixup for a kernel fault.
> 
> Or guest blows up because the fault could not be recovered using #PF.
> 
> I can see two somewhat sane ways to make this work.
> 
> 1. Access to bad memory results in an async-page-not-present, except that,  it’s not deliverable, the guest is killed. Either that async-page-not-present has a special flag saying “memory failure” or the eventual wakeup says “memory failure”.
> 
> 2. Access to bad memory results in #MC.  Sure, #MC is a turd, but it’s an *architectural* turd. By all means, have a nice simple PV mechanism to tell the #MC code exactly what went wrong, but keep the overall flow the same as in the native case.
> 

Should we differentiate between two error cases. In this case memory is
not bad. Just that page being asked for can't be faulted in anymore. And
another error case is real memory failure. So for the first case it
could be injected into guest using #PF or #VE or something paravirt
specific and real hardware failures take #MC path (if they are not
already doing so).

Thanks
Vivek
Thomas Gleixner April 7, 2020, 11:21 p.m. UTC | #27
Paolo Bonzini <pbonzini@redhat.com> writes:

> On 07/04/20 22:20, Thomas Gleixner wrote:
>>>> Havind said that, I thought disabling interrupts does not mask exceptions.
>>>> So page fault exception should have been delivered even with interrupts
>>>> disabled. Is that correct? May be there was no vm exit/entry during
>>>> those 10 seconds and that's why.
>> No. Async PF is not a real exception. It has interrupt semantics and it
>> can only be injected when the guest has interrupts enabled. It's bad
>> design.
>
> Page-ready async PF has interrupt semantics.
>
> Page-not-present async PF however does not have interrupt semantics, it
> has to be injected immediately or not at all (falling back to host page
> fault in the latter case).

If interrupts are disabled in the guest then it is NOT injected and the
guest is suspended. So it HAS interrupt semantics. Conditional ones,
i.e. if interrupts are disabled, bail, if not then inject it.

But that does not make it an exception by any means.

It never should have been hooked to #PF in the first place and it never
should have been named that way. The functionality is to opportunisticly
tell the guest to do some other stuff.

So the proper name for this seperate interrupt vector would be:

   VECTOR_OMG_DOS - Opportunisticly Make Guest Do Other Stuff

and the counter part

   VECTOR_STOP_DOS - Stop Doing Other Stuff 

> So page-not-present async PF definitely needs to be an exception, this
> is independent of whether it can be injected when IF=0.

That wants to be a straight #PF. See my reply to Andy.

> Hypervisors do not have any reserved exception vector, and must use
> vectors up to 31, which is why I believe #PF was used in the first place
> (though that predates my involvement in KVM by a few years).

No. That was just bad taste or something worse. It has nothing to do
with exceptions, see above. Stop proliferating the confusion.

> These days, #VE would be a much better exception to use instead (and
> it also has a defined mechanism to avoid reentrancy).

#VE is not going to solve anything.

The idea of OMG_DOS is to (opportunisticly) avoid that the guest (and
perhaps host) sit idle waiting for I/O until the fault has been
resolved. That makes sense as there might be enough other stuff to do
which does not depend on that particular page. If not then fine, the
guest will go idle.

Thanks,

        tglx
Paolo Bonzini April 8, 2020, 12:30 a.m. UTC | #28
On 08/04/20 00:29, Andy Lutomirski wrote:
>> I prefer #VE, but I can see how #MC has some appeal.  However, #VE has a
>> mechanism to avoid reentrancy, unlike #MC.  How would that be better
>> than the current mess with an NMI happening in the first few
>> instructions of the #PF handler?
>>
>>
> It has to be an IST vector due to the possibility of hitting a memory failure right after SYSCALL.

Not if syscall clears IF, right?

> I think #MC has a mechanism to prevent reentrancy to a limited extent. How does #VE avoid reentrancy?

In hardware, it has a flag word and delivers a vmexit instead of #VE if
that word is not zero (see towards the end of 25.5.6.1 Convertible EPT
Violations).  Here it would be the same except it would just do the page
fault synchronously in the host.

Paolo
Andy Lutomirski April 8, 2020, 4:48 a.m. UTC | #29
> On Apr 7, 2020, at 3:48 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> Andy Lutomirski <luto@amacapital.net> writes:
>>>> On Apr 7, 2020, at 1:20 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>> Andy Lutomirski <luto@amacapital.net> writes:
>>>> “Page is malfunctioning” is tricky because you *must* deliver the
>>>> event. x86’s #MC is not exactly a masterpiece, but it does kind of
>>>> work.
>>> 
>>> Nooooo. This does not need #MC at all. Don't even think about it.
>> 
>> Yessssssssssss.  Please do think about it. :)
> 
> I stared too much into that code recently that even thinking about it
> hurts. :)
> 
>>> The point is that the access to such a page is either happening in user
>>> space or in kernel space with a proper exception table fixup.
>>> 
>>> That means a real #PF is perfectly fine. That can be injected any time
>>> and does not have the interrupt semantics of async PF.
>> 
>> The hypervisor has no way to distinguish between
>> MOV-and-has-valid-stack-and-extable-entry and
>> MOV-definitely-can’t-fault-here.  Or, for that matter,
>> MOV-in-do_page_fault()-will-recurve-if-it-faults.
> 
> The mechanism which Vivek wants to support has a well defined usage
> scenario, i.e. either user space or kernel-valid-stack+extable-entry.
> 
> So why do you want to route that through #MC? 

To be clear, I hate #MC as much as the next person.  But I think it needs to be an IST vector, and the #MC *vector* isn’t so terrible.  (The fact that we can’t atomically return from #MC and re-enable CR4.MCE in a single instruction is problematic but not the end of the world.). But see below — I don’t think my suggestion should work quite the way you interpreted it.

>>> 
>>> 
>>>  guest -> #PF runs and either sends signal to user space or runs
>>>           the exception table fixup for a kernel fault.
>> 
>> Or guest blows up because the fault could not be recovered using #PF.
> 
> Not for the use case at hand. And for that you really want to use
> regular #PF.
> 
> The scenario I showed above is perfectly legit:
> 
>   guest:
>        copy_to_user()          <- Has extable
>           -> FAULT
> 
> host:
>   Oh, page is not there, give me some time to figure it out.
> 
>   inject async fault
> 
>   guest:
>        handles async fault interrupt, enables interrupts, blocks
> 
> host:
>   Situation resolved, shared file was truncated. Tell guest

All good so far.

> 
>   Inject #MC

No, not what I meant. Host has two sane choices here IMO:

1. Tell the guest that the page is gone as part of the wakeup. No #PF or #MC.

2. Tell guest that it’s resolved and inject #MC when the guest retries.  The #MC is a real fault, RIP points to the right place, etc.

> 
>   
>> 
>> 
>> 1. Access to bad memory results in an async-page-not-present, except
>> that, it’s not deliverable, the guest is killed.
> 
> That's incorrect. The proper reaction is a real #PF. Simply because this
> is part of the contract of sharing some file backed stuff between host
> and guest in a well defined "virtio" scenario and not a random access to
> memory which might be there or not.

The problem is that the host doesn’t know when #PF is safe. It’s sort of the same problem that async pf has now.  The guest kernel could access the problematic page in the middle of an NMI, under pagefault_disable(), etc — getting #PF as a result of CPL0 access to a page with a valid guest PTE is simply not part of the x86 architecture.

> 
> Look at it from the point where async whatever does not exist at all:
> 
>   guest:
>        copy_to_user()          <- Has extable
>           -> FAULT
> 
> host:
>        suspend guest and resolve situation
> 
>        if (page swapped in)
>           resume_guest();
>        else
>           inject_pf();
> 
> And this inject_pf() does not care whether it kills the guest or makes
> it double/triple fault or whatever.
> 
> The 'tell the guest to do something else while host tries to sort it'
> opportunistic thingy turns this into:
> 
>   guest:
>        copy_to_user()          <- Has extable
>           -> FAULT
> 
> host:
>        tell guest to do something else, i.e. guest suspends task
> 
>        if (page swapped in)
>           tell guest to resume suspended task
>        else
>           tell guest to resume suspended task
> 
>   guest resumes and faults again
> 
> host:
>           inject_pf();
> 
> which is pretty much equivalent.

Replace copy_to_user() with some access to a gup-ed mapping with no extable handler and it doesn’t look so good any more.

Of course, the guest will oops if this happens, but the guest needs to be able to oops cleanly. #PF is too fragile for this because it’s not IST, and #PF is the wrong thing anyway — #PF is all about guest-virtual-to-guest-physical mappings.  Heck, what would CR2 be?  The host might not even know the guest virtual address.

> 
>> 2. Access to bad memory results in #MC.  Sure, #MC is a turd, but it’s
>> an *architectural* turd. By all means, have a nice simple PV mechanism
>> to tell the #MC code exactly what went wrong, but keep the overall
>> flow the same as in the native case.
> 
> It's a completely different flow as you evaluate PV turd instead of
> analysing the MCE banks and the other error reporting facilities.

I’m fine with the flow being different. do_machine_check() could have entirely different logic to decide the error in PV.  But I think we should reuse the overall flow: kernel gets #MC with RIP pointing to the offending instruction. If there’s an extable entry that can handle memory failure, handle it. If it’s a user access, handle it.  If it’s an unrecoverable error because it was a non-extable kernel access, oops or panic.

The actual PV part could be extremely simple: the host just needs to tell the guest “this #MC is due to memory failure at this guest physical address”.  No banks, no DIMM slot, no rendezvous crap (LMCE), no other nonsense.  It would be nifty if the host also told the guest what the guest virtual address was if the host knows it.
Paolo Bonzini April 8, 2020, 8:23 a.m. UTC | #30
On 08/04/20 01:21, Thomas Gleixner wrote:
> Paolo Bonzini <pbonzini@redhat.com> writes:
> 
>> On 07/04/20 22:20, Thomas Gleixner wrote:
>>>>> Havind said that, I thought disabling interrupts does not mask exceptions.
>>>>> So page fault exception should have been delivered even with interrupts
>>>>> disabled. Is that correct? May be there was no vm exit/entry during
>>>>> those 10 seconds and that's why.
>>> No. Async PF is not a real exception. It has interrupt semantics and it
>>> can only be injected when the guest has interrupts enabled. It's bad
>>> design.
>>
>> Page-ready async PF has interrupt semantics.
>>
>> Page-not-present async PF however does not have interrupt semantics, it
>> has to be injected immediately or not at all (falling back to host page
>> fault in the latter case).
> 
> If interrupts are disabled in the guest then it is NOT injected and the
> guest is suspended. So it HAS interrupt semantics. Conditional ones,
> i.e. if interrupts are disabled, bail, if not then inject it.

Interrupts can be delayed by TPR or STI/MOV SS interrupt window, async
page faults cannot (again, not the page-ready kind).  Page-not-present
async page faults are almost a perfect match for the hardware use of #VE
(and it might even be possible to let the processor deliver the
exceptions).  There are other advantages:

- the only real problem with using #PF (with or without
KVM_ASYNC_PF_SEND_ALWAYS) seems to be the NMI reentrancy issue, which
would not be there for #VE.

- #VE are combined the right way with other exceptions (the
benign/contributory/pagefault stuff)

- adjusting KVM and Linux to use #VE instead of #PF would be less than
100 lines of code.

Paolo

> But that does not make it an exception by any means.
> 
> It never should have been hooked to #PF in the first place and it never
> should have been named that way. The functionality is to opportunisticly
> tell the guest to do some other stuff.
> 
> So the proper name for this seperate interrupt vector would be:
> 
>    VECTOR_OMG_DOS - Opportunisticly Make Guest Do Other Stuff
> 
> and the counter part
> 
>    VECTOR_STOP_DOS - Stop Doing Other Stuff 
> 
>> So page-not-present async PF definitely needs to be an exception, this
>> is independent of whether it can be injected when IF=0.
> 
> That wants to be a straight #PF. See my reply to Andy.
> 
>> Hypervisors do not have any reserved exception vector, and must use
>> vectors up to 31, which is why I believe #PF was used in the first place
>> (though that predates my involvement in KVM by a few years).
> 
> No. That was just bad taste or something worse. It has nothing to do
> with exceptions, see above. Stop proliferating the confusion.
> 
>> These days, #VE would be a much better exception to use instead (and
>> it also has a defined mechanism to avoid reentrancy).
> 
> #VE is not going to solve anything.
> 
> The idea of OMG_DOS is to (opportunisticly) avoid that the guest (and
> perhaps host) sit idle waiting for I/O until the fault has been
> resolved. That makes sense as there might be enough other stuff to do
> which does not depend on that particular page. If not then fine, the
> guest will go idle.
> 
> Thanks,
> 
>         tglx
>
Borislav Petkov April 8, 2020, 9:32 a.m. UTC | #31
On Tue, Apr 07, 2020 at 09:48:02PM -0700, Andy Lutomirski wrote:
> I’m fine with the flow being different. do_machine_check() could
> have entirely different logic to decide the error in PV.

Nope, do_machine_check() is already as ugly as it gets. I don't want any
more crap in it.

> But I think we should reuse the overall flow: kernel gets #MC with
> RIP pointing to the offending instruction. If there’s an extable
> entry that can handle memory failure, handle it. If it’s a user
> access, handle it. If it’s an unrecoverable error because it was a
> non-extable kernel access, oops or panic.
>
> The actual PV part could be extremely simple: the host just needs to
> tell the guest “this #MC is due to memory failure at this guest
> physical address”. No banks, no DIMM slot, no rendezvous crap
> (LMCE), no other nonsense. It would be nifty if the host also told the
> guest what the guest virtual address was if the host knows it.

It better be a whole different path and a whole different vector. If you
wanna keep it simple and apart from all of the other nonsense, then you
can just as well use a completely different vector.

Thx.
Borislav Petkov April 8, 2020, 10:01 a.m. UTC | #32
On Tue, Apr 07, 2020 at 06:49:03PM -0400, Vivek Goyal wrote:
> > 2. Access to bad memory results in #MC.  Sure, #MC is a turd, but it’s an *architectural* turd. By all means, have a nice simple PV mechanism to tell the #MC code exactly what went wrong, but keep the overall flow the same as in the native case.
> > 
> 
> Should we differentiate between two error cases. In this case memory is
> not bad. Just that page being asked for can't be faulted in anymore. And
> another error case is real memory failure. So for the first case it
> could be injected into guest using #PF or #VE or something paravirt
> specific and real hardware failures take #MC path (if they are not
> already doing so).

For handling memory failures with guests there's this whole thing
described Documentation/vm/hwpoison.rst and KVM has support for that.
Thomas Gleixner April 8, 2020, 10:12 a.m. UTC | #33
Andy Lutomirski <luto@amacapital.net> writes:
>> On Apr 7, 2020, at 3:48 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>   Inject #MC
>
> No, not what I meant. Host has two sane choices here IMO:
>
> 1. Tell the guest that the page is gone as part of the wakeup. No #PF or #MC.
>
> 2. Tell guest that it’s resolved and inject #MC when the guest
> retries.  The #MC is a real fault, RIP points to the right place, etc.

Ok, that makes sense.

>>> 1. Access to bad memory results in an async-page-not-present, except
>>> that, it’s not deliverable, the guest is killed.
>> 
>> That's incorrect. The proper reaction is a real #PF. Simply because this
>> is part of the contract of sharing some file backed stuff between host
>> and guest in a well defined "virtio" scenario and not a random access to
>> memory which might be there or not.
>
> The problem is that the host doesn’t know when #PF is safe. It’s sort
> of the same problem that async pf has now.  The guest kernel could
> access the problematic page in the middle of an NMI, under
> pagefault_disable(), etc — getting #PF as a result of CPL0 access to a
> page with a valid guest PTE is simply not part of the x86
> architecture.

Fair enough. 

> Replace copy_to_user() with some access to a gup-ed mapping with no
> extable handler and it doesn’t look so good any more.

In this case the guest needs to die.

> Of course, the guest will oops if this happens, but the guest needs to
> be able to oops cleanly. #PF is too fragile for this because it’s not
> IST, and #PF is the wrong thing anyway — #PF is all about
> guest-virtual-to-guest-physical mappings.  Heck, what would CR2 be?
> The host might not even know the guest virtual address.

It knows, but I can see your point.

>>> 2. Access to bad memory results in #MC.  Sure, #MC is a turd, but it’s
>>> an *architectural* turd. By all means, have a nice simple PV mechanism
>>> to tell the #MC code exactly what went wrong, but keep the overall
>>> flow the same as in the native case.
>> 
>> It's a completely different flow as you evaluate PV turd instead of
>> analysing the MCE banks and the other error reporting facilities.
>
> I’m fine with the flow being different. do_machine_check() could have
> entirely different logic to decide the error in PV.  But I think we
> should reuse the overall flow: kernel gets #MC with RIP pointing to
> the offending instruction. If there’s an extable entry that can handle
> memory failure, handle it. If it’s a user access, handle it.  If it’s
> an unrecoverable error because it was a non-extable kernel access,
> oops or panic.
>
> The actual PV part could be extremely simple: the host just needs to
> tell the guest “this #MC is due to memory failure at this guest
> physical address”.  No banks, no DIMM slot, no rendezvous crap (LMCE),
> no other nonsense.  It would be nifty if the host also told the guest
> what the guest virtual address was if the host knows it.

It does. The EPT violations store:

  - guest-linear address
  - guest-physical address

That's also part of the #VE exception to which Paolo was referring.

Thanks,

        tglx
Thomas Gleixner April 8, 2020, 1:01 p.m. UTC | #34
Paolo Bonzini <pbonzini@redhat.com> writes:
> On 08/04/20 01:21, Thomas Gleixner wrote:
>>>> No. Async PF is not a real exception. It has interrupt semantics and it
>>>> can only be injected when the guest has interrupts enabled. It's bad
>>>> design.
>>>
>>> Page-ready async PF has interrupt semantics.
>>>
>>> Page-not-present async PF however does not have interrupt semantics, it
>>> has to be injected immediately or not at all (falling back to host page
>>> fault in the latter case).
>> 
>> If interrupts are disabled in the guest then it is NOT injected and the
>> guest is suspended. So it HAS interrupt semantics. Conditional ones,
>> i.e. if interrupts are disabled, bail, if not then inject it.
>
> Interrupts can be delayed by TPR or STI/MOV SS interrupt window, async
> page faults cannot (again, not the page-ready kind).

Can we pretty please stop using the term async page fault? It's just
wrong and causes more confusion than anything else.

What this does is really what I called Opportunistic Make Guest Do Other
Stuff. And it has neither true exception nor true interrupt semantics.

It's a software event which is injected into the guest to let the guest
do something else than waiting for the actual #PF cause to be
resolved. It's part of a software protocol between host and guest.

And it comes with restrictions:

    The Do Other Stuff event can only be delivered when guest IF=1.

    If guest IF=0 then the host has to suspend the guest until the
    situation is resolved.

    The 'Situation resolved' event must also wait for a guest IF=1 slot.

> Page-not-present async page faults are almost a perfect match for the
> hardware use of #VE (and it might even be possible to let the
> processor deliver the exceptions).  There are other advantages:
>
> - the only real problem with using #PF (with or without
> KVM_ASYNC_PF_SEND_ALWAYS) seems to be the NMI reentrancy issue, which
> would not be there for #VE.
>
> - #VE are combined the right way with other exceptions (the
> benign/contributory/pagefault stuff)
>
> - adjusting KVM and Linux to use #VE instead of #PF would be less than
> 100 lines of code.

If you just want to solve Viveks problem, then its good enough. I.e. the
file truncation turns the EPT entries into #VE convertible entries and
the guest #VE handler can figure it out. This one can be injected
directly by the hardware, i.e. you don't need a VMEXIT.

If you want the opportunistic do other stuff mechanism, then #VE has
exactly the same problems as the existing async "PF". It's not magicaly
making that go away.

One possible solution might be to make all recoverable EPT entries
convertible and let the HW inject #VE for those.

So the #VE handler in the guest would have to do:

       if (!recoverable()) {
       		if (user_mode)
                	send_signal();
                else if (!fixup_exception())
                	die_hard();
                goto done;  
       }                 

       store_ve_info_in_pv_page();

       if (!user_mode(regs) || !preemptible()) {
       		hypercall_resolve_ept(can_continue = false);
       } else {
                init_completion();
       		hypercall_resolve_ept(can_continue = true);
                wait_for_completion();
       }

or something like that.

The hypercall to resolve the EPT fail on the host acts on the
can_continue argument.

If false, it suspends the guest vCPU and only returns when done.

If true it kicks the resolve process and returns to the guest which
suspends the task and tries to do something else.

The wakeup side needs to be a regular interrupt and cannot go through
#VE.

Thanks,

        tglx
Sean Christopherson April 8, 2020, 3:34 p.m. UTC | #35
On Wed, Apr 08, 2020 at 10:23:58AM +0200, Paolo Bonzini wrote:
> Page-not-present async page faults are almost a perfect match for the
> hardware use of #VE (and it might even be possible to let the processor
> deliver the exceptions).

My "async" page fault knowledge is limited, but if the desired behavior is
to reflect a fault into the guest for select EPT Violations, then yes,
enabling EPT Violation #VEs in hardware is doable.  The big gotcha is that
KVM needs to set the suppress #VE bit for all EPTEs when allocating a new
MMU page, otherwise not-present faults on zero-initialized EPTEs will get
reflected.

Attached a patch that does the prep work in the MMU.  The VMX usage would be:

	kvm_mmu_set_spte_init_value(VMX_EPT_SUPPRESS_VE_BIT);

when EPT Violation #VEs are enabled.  It's 64-bit only as it uses stosq to
initialize EPTEs.  32-bit could also be supported by doing memcpy() from
a static page.
Peter Zijlstra April 8, 2020, 3:38 p.m. UTC | #36
On Wed, Apr 08, 2020 at 03:01:58PM +0200, Thomas Gleixner wrote:
> And it comes with restrictions:
> 
>     The Do Other Stuff event can only be delivered when guest IF=1.
> 
>     If guest IF=0 then the host has to suspend the guest until the
>     situation is resolved.
> 
>     The 'Situation resolved' event must also wait for a guest IF=1 slot.

Moo, can we pretty please already kill that ALWAYS and IF nonsense? That
results in that terrifyingly crap HLT loop. That needs to die with
extreme prejudice.

So the host only inject these OMFG_DOS things when the guest is in
luserspace -- which it can see in the VMCS state IIRC. And then using
#VE for the make-it-go signal is far preferred over the currentl #PF
abuse.

> > Page-not-present async page faults are almost a perfect match for the
> > hardware use of #VE (and it might even be possible to let the
> > processor deliver the exceptions).  There are other advantages:
> >
> > - the only real problem with using #PF (with or without
> > KVM_ASYNC_PF_SEND_ALWAYS) seems to be the NMI reentrancy issue, which
> > would not be there for #VE.
> >
> > - #VE are combined the right way with other exceptions (the
> > benign/contributory/pagefault stuff)
> >
> > - adjusting KVM and Linux to use #VE instead of #PF would be less than
> > 100 lines of code.
> 
> If you just want to solve Viveks problem, then its good enough. I.e. the
> file truncation turns the EPT entries into #VE convertible entries and
> the guest #VE handler can figure it out. This one can be injected
> directly by the hardware, i.e. you don't need a VMEXIT.

That sounds like something that doesn't actually need the whole
'async'/do-something-else-for-a-while crap, right? It's a #PF trap from
kernel space where we need to report fail.

> If you want the opportunistic do other stuff mechanism, then #VE has
> exactly the same problems as the existing async "PF". It's not magicaly
> making that go away.

We need to somehow have the guest teach the host how to tell if it can
inject that OMFG_DOS thing or not. Injecting it only to then instantly
exit again is stupid and expensive.

Clearly we don't want to expose preempt_count and make that ABI, but is
there some way we can push a snippet of code to the host that instructs
the host how to determine if it can sleep or not? I realize that pushing
actual x86 .text is a giant security problem, so perhaps a snipped of
BPF that the host can verify, which it can run on the guest image ?

Make it a hard error (guest cpu dies) to inject the OMFG_DOS signal on a
context that cannot put the task to sleep.
Thomas Gleixner April 8, 2020, 4:41 p.m. UTC | #37
Peter Zijlstra <peterz@infradead.org> writes:
> On Wed, Apr 08, 2020 at 03:01:58PM +0200, Thomas Gleixner wrote:
>> And it comes with restrictions:
>> 
>>     The Do Other Stuff event can only be delivered when guest IF=1.
>> 
>>     If guest IF=0 then the host has to suspend the guest until the
>>     situation is resolved.
>> 
>>     The 'Situation resolved' event must also wait for a guest IF=1 slot.
>
> Moo, can we pretty please already kill that ALWAYS and IF nonsense? That
> results in that terrifyingly crap HLT loop. That needs to die with
> extreme prejudice.
>
> So the host only inject these OMFG_DOS things when the guest is in
> luserspace -- which it can see in the VMCS state IIRC. And then using
> #VE for the make-it-go signal is far preferred over the currentl #PF
> abuse.

Yes, but this requires software based injection.

>> If you just want to solve Viveks problem, then its good enough. I.e. the
>> file truncation turns the EPT entries into #VE convertible entries and
>> the guest #VE handler can figure it out. This one can be injected
>> directly by the hardware, i.e. you don't need a VMEXIT.
>
> That sounds like something that doesn't actually need the whole
> 'async'/do-something-else-for-a-while crap, right? It's a #PF trap from
> kernel space where we need to report fail.

Fail or fixup via extable.

>> If you want the opportunistic do other stuff mechanism, then #VE has
>> exactly the same problems as the existing async "PF". It's not magicaly
>> making that go away.
>
> We need to somehow have the guest teach the host how to tell if it can
> inject that OMFG_DOS thing or not. Injecting it only to then instantly
> exit again is stupid and expensive.

Not if the injection is actually done by the hardware. Then the guest
handles #VE and tells the host what to do.

> Clearly we don't want to expose preempt_count and make that ABI, but is
> there some way we can push a snippet of code to the host that instructs
> the host how to determine if it can sleep or not? I realize that pushing
> actual x86 .text is a giant security problem, so perhaps a snipped of
> BPF that the host can verify, which it can run on the guest image ?

*SHUDDER*

> Make it a hard error (guest cpu dies) to inject the OMFG_DOS signal on a
> context that cannot put the task to sleep.

With the hardware based #VE and a hypercall which tells the host how to
handle the EPT fixup (suspend the vcpu or let it continue and do the
completion later) you don't have to play any games on the host. If the
guest tells the host the wrong thing, then the guest will have to mop up
the pieces.

Thanks,

        tglx
Paolo Bonzini April 8, 2020, 4:50 p.m. UTC | #38
On 08/04/20 17:34, Sean Christopherson wrote:
> On Wed, Apr 08, 2020 at 10:23:58AM +0200, Paolo Bonzini wrote:
>> Page-not-present async page faults are almost a perfect match for the
>> hardware use of #VE (and it might even be possible to let the processor
>> deliver the exceptions).
> 
> My "async" page fault knowledge is limited, but if the desired behavior is
> to reflect a fault into the guest for select EPT Violations, then yes,
> enabling EPT Violation #VEs in hardware is doable.  The big gotcha is that
> KVM needs to set the suppress #VE bit for all EPTEs when allocating a new
> MMU page, otherwise not-present faults on zero-initialized EPTEs will get
> reflected.
> 
> Attached a patch that does the prep work in the MMU.  The VMX usage would be:
> 
> 	kvm_mmu_set_spte_init_value(VMX_EPT_SUPPRESS_VE_BIT);
> 
> when EPT Violation #VEs are enabled.  It's 64-bit only as it uses stosq to
> initialize EPTEs.  32-bit could also be supported by doing memcpy() from
> a static page.

The complication is that (at least according to the current ABI) we
would not want #VE to kick if the guest currently has IF=0 (and possibly
CPL=0).  But the ABI is not set in stone, and anyway the #VE protocol is
a decent one and worth using as a base for whatever PV protocol we design.

Paolo
Thomas Gleixner April 8, 2020, 6:01 p.m. UTC | #39
Paolo Bonzini <pbonzini@redhat.com> writes:
> On 08/04/20 17:34, Sean Christopherson wrote:
>> On Wed, Apr 08, 2020 at 10:23:58AM +0200, Paolo Bonzini wrote:
>>> Page-not-present async page faults are almost a perfect match for the
>>> hardware use of #VE (and it might even be possible to let the processor
>>> deliver the exceptions).
>> 
>> My "async" page fault knowledge is limited, but if the desired behavior is
>> to reflect a fault into the guest for select EPT Violations, then yes,
>> enabling EPT Violation #VEs in hardware is doable.  The big gotcha is that
>> KVM needs to set the suppress #VE bit for all EPTEs when allocating a new
>> MMU page, otherwise not-present faults on zero-initialized EPTEs will get
>> reflected.
>> 
>> Attached a patch that does the prep work in the MMU.  The VMX usage would be:
>> 
>> 	kvm_mmu_set_spte_init_value(VMX_EPT_SUPPRESS_VE_BIT);
>> 
>> when EPT Violation #VEs are enabled.  It's 64-bit only as it uses stosq to
>> initialize EPTEs.  32-bit could also be supported by doing memcpy() from
>> a static page.
>
> The complication is that (at least according to the current ABI) we
> would not want #VE to kick if the guest currently has IF=0 (and possibly
> CPL=0).  But the ABI is not set in stone, and anyway the #VE protocol is
> a decent one and worth using as a base for whatever PV protocol we design.

Forget the current pf async semantics (or the lack of). You really want
to start from scratch and igore the whole thing.

The charm of #VE is that the hardware can inject it and it's not nesting
until the guest cleared the second word in the VE information area. If
that word is not 0 then you get a regular vmexit where you suspend the
vcpu until the nested problem is solved.

So you really don't worry about the guest CPU state at all. The guest
side #VE handler has to decide what it wants from the host depending on
it's internal state:

     - Suspend me and resume once the EPT fail is solved

     - Let me park the failing task and tell me once you resolved the
       problem.

That's pretty straight forward and avoids the whole nonsense which the
current mess contains. It completely avoids the allocation stuff as well
as you need to use a PV page where the guest copies the VE information
to.

The notification that a problem has been resolved needs to go through a
separate vector which still has the IF=1 requirement obviously.

Thanks,

        tglx
Vivek Goyal April 8, 2020, 6:23 p.m. UTC | #40
On Tue, Apr 07, 2020 at 09:48:02PM -0700, Andy Lutomirski wrote:

[..]
> It would be nifty if the host also told the guest what the guest virtual address was if the host knows it.

It will be good to know and send guest virtual address as well. While
sending SIGBUS to guest user space, information about which access
triggered SIGBUS will be useful.

I thought GUEST_LINEAR_ADDRESS provides guest virtual address if
EPT_VIOLATION_GLA_VALID bit is set. And it seems to work for my
simple test case. But when I try to read intel SDM, section "27.2" VM
exits, EPT violations, I am not so sure.

Somebody who understands this better, can you please help me understand
what exactly GUEST_LINEAR_ADDRESS is supposed to contain during
EPT violation. I assumed it is guest virtual address and added a
patch in my RFC patch series.

https://lore.kernel.org/kvm/20200331194011.24834-3-vgoyal@redhat.com/

But I might have misunderstood it.

Vivek
Vivek Goyal April 8, 2020, 8:34 p.m. UTC | #41
On Wed, Apr 08, 2020 at 08:01:36PM +0200, Thomas Gleixner wrote:
> Paolo Bonzini <pbonzini@redhat.com> writes:
> > On 08/04/20 17:34, Sean Christopherson wrote:
> >> On Wed, Apr 08, 2020 at 10:23:58AM +0200, Paolo Bonzini wrote:
> >>> Page-not-present async page faults are almost a perfect match for the
> >>> hardware use of #VE (and it might even be possible to let the processor
> >>> deliver the exceptions).
> >> 
> >> My "async" page fault knowledge is limited, but if the desired behavior is
> >> to reflect a fault into the guest for select EPT Violations, then yes,
> >> enabling EPT Violation #VEs in hardware is doable.  The big gotcha is that
> >> KVM needs to set the suppress #VE bit for all EPTEs when allocating a new
> >> MMU page, otherwise not-present faults on zero-initialized EPTEs will get
> >> reflected.
> >> 
> >> Attached a patch that does the prep work in the MMU.  The VMX usage would be:
> >> 
> >> 	kvm_mmu_set_spte_init_value(VMX_EPT_SUPPRESS_VE_BIT);
> >> 
> >> when EPT Violation #VEs are enabled.  It's 64-bit only as it uses stosq to
> >> initialize EPTEs.  32-bit could also be supported by doing memcpy() from
> >> a static page.
> >
> > The complication is that (at least according to the current ABI) we
> > would not want #VE to kick if the guest currently has IF=0 (and possibly
> > CPL=0).  But the ABI is not set in stone, and anyway the #VE protocol is
> > a decent one and worth using as a base for whatever PV protocol we design.
> 
> Forget the current pf async semantics (or the lack of). You really want
> to start from scratch and igore the whole thing.
> 
> The charm of #VE is that the hardware can inject it and it's not nesting
> until the guest cleared the second word in the VE information area. If
> that word is not 0 then you get a regular vmexit where you suspend the
> vcpu until the nested problem is solved.

So IIUC, only one process on a vcpu could affort to relinquish cpu to
another task. If next task also triggers EPT violation, that will result
in VM exit (as previous #VE is not complete yet) and vcpu will be halted.

> 
> So you really don't worry about the guest CPU state at all. The guest
> side #VE handler has to decide what it wants from the host depending on
> it's internal state:
> 
>      - Suspend me and resume once the EPT fail is solved
> 
>      - Let me park the failing task and tell me once you resolved the
>        problem.
> 
> That's pretty straight forward and avoids the whole nonsense which the
> current mess contains. It completely avoids the allocation stuff as well
> as you need to use a PV page where the guest copies the VE information
> to.
> 
> The notification that a problem has been resolved needs to go through a
> separate vector which still has the IF=1 requirement obviously.

How is this vector decided between guest and host. Failure to fault in
page will be communicated through same vector?

Thanks
Vivek
Thomas Gleixner April 8, 2020, 11:06 p.m. UTC | #42
Vivek Goyal <vgoyal@redhat.com> writes:
> On Wed, Apr 08, 2020 at 08:01:36PM +0200, Thomas Gleixner wrote:
>> Forget the current pf async semantics (or the lack of). You really want
>> to start from scratch and igore the whole thing.
>> 
>> The charm of #VE is that the hardware can inject it and it's not nesting
>> until the guest cleared the second word in the VE information area. If
>> that word is not 0 then you get a regular vmexit where you suspend the
>> vcpu until the nested problem is solved.
>
> So IIUC, only one process on a vcpu could affort to relinquish cpu to
> another task. If next task also triggers EPT violation, that will result
> in VM exit (as previous #VE is not complete yet) and vcpu will be
> halted.

No. The guest #VE handler reads VE information, stores it into a PV page
which is used to communicate with the host, invokes the hypercall and
clears word1 so a consecutive #VE can be handled.

If the hypercall is telling the host to suspend the guest (e.g. because
the exception hit an interrupt or preemption disabled region where
scheduling is not possible) then the host suspends the vCPU until the
EPT issue is fixed. Before that hypercall returns the state of the
recursion prevention word is irrelevant as the vCPU is not running, so
it's fine to clear it afterwards.

If the hypercall is telling the host that it can schedule and do
something else, then the word is cleared after the hypercall
returns. This makes sure that the host has gathered the information
before another VE can be injected.

TBH, I really was positively surprised that this HW mechanism actually
makes sense. It prevents the following situation:

  1) Guest triggers a EPT fail

  2) HW injects #VE sets VE info

  3) Guest handles #VE and before being able to gather the VE info data
     it triggers another EPT fail

  4) HW injects #VE sets VE info -> FAIL

So if we clear the reentrancy protection after saving the info in the
PV page and after the hypercall returns, it's guaranteed that the above
results in:

  1) Guest triggers a EPT fail

  2) HW injects #VE sets VE info

  3) Guest handles #VE and before being able to gather the VE info data
     and clearing the reentrancy protection it triggers another EPT fail

  4) VMEXIT which needs to be handled by suspending the vCPU, solving
     the issue and resuming it, which allows it to handle the original
     fail #1

>> So you really don't worry about the guest CPU state at all. The guest
>> side #VE handler has to decide what it wants from the host depending on
>> it's internal state:
>> 
>>      - Suspend me and resume once the EPT fail is solved
>> 
>>      - Let me park the failing task and tell me once you resolved the
>>        problem.
>> 
>> That's pretty straight forward and avoids the whole nonsense which the
>> current mess contains. It completely avoids the allocation stuff as well
>> as you need to use a PV page where the guest copies the VE information
>> to.
>> 
>> The notification that a problem has been resolved needs to go through a
>> separate vector which still has the IF=1 requirement obviously.
>
> How is this vector decided between guest and host.

That's either a fixed vector which then becomes ABI or the guest tells
the host via some PV/hypercall interface that it can handle #VE and that
the vector for notification is number X.

> Failure to fault in page will be communicated through same
> vector?

The PV page which communicates the current and eventually pending EPT
fails to the host is also the communication mechanism for the outcome.

Lets assume that the PV page contains an array of guest/host
communication data structures:

struct ve_info {
	struct ve_hw_info	hw_info;
        unsigned long		state;
        struct rcu_wait         wait;
);

The PV page looks like this:

struct ve_page {
	struct ve_info		info[N_ENTRIES];
        unsigned int		guest_current;
        unsigned int		host_current;
        unsigned long		active[BITS_TO_LONGS(N_ENTRIES)];
};

The #VE handler does:

        struct ve_page *vp = this_cpu_ptr(&ve_page);
        struct ve_info *info;
        bool can_continue;

        idx = find_first_zero_bit(vp->active, N_ENTRIES);
        BUG_ON(idx >= N_ENTRIES);
        set_bit(idx, vp->active);
        info = vp->info + idx;

        copy_ve_info(&info->hw_info, ve_hwinfo);
        vp->guest_current = idx;

        if (test_and_set_thread_flag(TIF_IN_VE) || bitmap_full(vp->active, N_ENTRIES))
                can_continue = false;
        else
                can_continue = user_mode(regs) || preemptible();

        if (can_continue) {
                info->state = CAN_CONTINUE;
        	hypercall_ve();
                ve_hwinfo.reentrancy_protect = 0;
                rcuwait_wait_event(&info->wait, info->state != CAN_CONTINUE, TASK_IDLE);
        } else {        
                info->state = SUSPEND_ME;
        	hypercall_ve();
                ve_hwinfo.reentrancy_protect = 0;
	}

	state = info->state;
        info->state = NONE;
        clear_bit(idx, vp->active);

        switch (state) {
        case RESOLVED:
        	break;

        case FAIL:
        	if (user_mode(regs))
                	send_signal(.....);
                else if (!fixup_exception())
                	panic("I'm toast");
                break;

        default:
        	panic("Confused");
        }
        
        clear_thread_flag(TIF_IN_VE);
        
Now the host completion does:

        struct ve_page *vp = vcpu->ve_page;
        struct ve_info *info = vp->info + idx;

        state = info->state;
        info->state = resolved_state;

        switch (state) {
        case SUSPEND_ME:
        	resume_vcpu(vcpu);
                break;
        case CAN_CONTINUE:
        	queue_completion(vcpu, idx);
                break;
        default:
                kill_guest();
        }

and the host completion injection which handles the queued completion
when guest IF=0 does:

        struct ve_page *vp = vcpu->ve_page;

        vp->host_current = idx;
        inject_ve_complete(vcpu);

The guest completion does:

        struct ve_page *vp = this_cpu_ptr(&ve_page);
        struct ve_info *info;

        info = vp->info + vp->host_current;
        rcuwait_wake_up(&info->wait);

There are a bunch of life time issues to think about (not rocket
science), but you get the idea.

Thanks,

        tglx
Thomas Gleixner April 8, 2020, 11:14 p.m. UTC | #43
Thomas Gleixner <tglx@linutronix.de> writes:
> Vivek Goyal <vgoyal@redhat.com> writes:
>
> and the host completion injection which handles the queued completion
> when guest IF=0 does:

obviously IF=1 ...

>
>         struct ve_page *vp = vcpu->ve_page;
>
>         vp->host_current = idx;
>         inject_ve_complete(vcpu);
>
> The guest completion does:
>
>         struct ve_page *vp = this_cpu_ptr(&ve_page);
>         struct ve_info *info;
>
>         info = vp->info + vp->host_current;
>         rcuwait_wake_up(&info->wait);
>
> There are a bunch of life time issues to think about (not rocket
> science), but you get the idea.
>
> Thanks,
>
>         tglx
Andy Lutomirski April 9, 2020, 4:50 a.m. UTC | #44
On Wed, Apr 8, 2020 at 11:01 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Paolo Bonzini <pbonzini@redhat.com> writes:
> > On 08/04/20 17:34, Sean Christopherson wrote:
> >> On Wed, Apr 08, 2020 at 10:23:58AM +0200, Paolo Bonzini wrote:
> >>> Page-not-present async page faults are almost a perfect match for the
> >>> hardware use of #VE (and it might even be possible to let the processor
> >>> deliver the exceptions).
> >>
> >> My "async" page fault knowledge is limited, but if the desired behavior is
> >> to reflect a fault into the guest for select EPT Violations, then yes,
> >> enabling EPT Violation #VEs in hardware is doable.  The big gotcha is that
> >> KVM needs to set the suppress #VE bit for all EPTEs when allocating a new
> >> MMU page, otherwise not-present faults on zero-initialized EPTEs will get
> >> reflected.
> >>
> >> Attached a patch that does the prep work in the MMU.  The VMX usage would be:
> >>
> >>      kvm_mmu_set_spte_init_value(VMX_EPT_SUPPRESS_VE_BIT);
> >>
> >> when EPT Violation #VEs are enabled.  It's 64-bit only as it uses stosq to
> >> initialize EPTEs.  32-bit could also be supported by doing memcpy() from
> >> a static page.
> >
> > The complication is that (at least according to the current ABI) we
> > would not want #VE to kick if the guest currently has IF=0 (and possibly
> > CPL=0).  But the ABI is not set in stone, and anyway the #VE protocol is
> > a decent one and worth using as a base for whatever PV protocol we design.
>
> Forget the current pf async semantics (or the lack of). You really want
> to start from scratch and igore the whole thing.
>
> The charm of #VE is that the hardware can inject it and it's not nesting
> until the guest cleared the second word in the VE information area. If
> that word is not 0 then you get a regular vmexit where you suspend the
> vcpu until the nested problem is solved.

Can you point me at where the SDM says this?

Anyway, I see two problems with #VE, one big and one small.  The small
(or maybe small) one is that any fancy protocol where the guest
returns from an exception by doing, logically:

Hey I'm done;  /* MOV somewhere, hypercall, MOV to CR4, whatever */
IRET;

is fundamentally racy.  After we say we're done and before IRET, we
can be recursively reentered.  Hi, NMI!

The big problem is that #VE doesn't exist on AMD, and I really think
that any fancy protocol we design should work on AMD.  I have no
problem with #VE being a nifty optimization to the protocol on Intel,
but it should *work* without #VE.

>
> So you really don't worry about the guest CPU state at all. The guest
> side #VE handler has to decide what it wants from the host depending on
> it's internal state:
>
>      - Suspend me and resume once the EPT fail is solved

I'm not entirely convinced this is better than the HLT loop.  It's
*prettier*, but the HLT loop avoids an extra hypercall and has the
potentially useful advantage that the guest can continue to process
interrupts even if it is unable to make progress otherwise.

Anyway, the whole thing can be made to work reasonably well without
#VE, #MC or any other additional special exception, like this:

First, when the guest accesses a page that is not immediately
available (paged out or failed), the host attempts to deliver the
"page not present -- try to do other stuff" event.  This event has an
associated per-vCPU data structure along these lines:

struct page_not_present_data {
  u32 inuse;  /* 1: the structure is in use.  0: free */
  u32 token;
  u64 gpa;
  u64 gva;  /* if known, and there should be a way to indicate that
it's not known. */
};

Only the host ever changes inuse from 0 to 1 and only the guest ever
changes inuse from 1 to 0.

The "page not present -- try to do other stuff" event has interrupt
semantics -- it is only delivered if the vCPU can currently receive an
interrupt.  This means IF = 1 and STI and MOV SS shadows are not
active.  Arguably TPR should be checked too.  It is also only
delivered if page_not_present_data.inuse == 0 and if there are tokens
available -- see below.  If the event can be delivered, then
page_not_present_data is filled out and the event is delivered.  If
the event is not delivered, then one of three things happens:

a) If the page not currently known to be failed (e.g. paged out or the
host simply does not know yet until it does some IO), then the vCPU
goes to sleep until the host is ready.
b) If the page is known to be failed, and no fancy #VE / #MC is
available, then the guest is killed and the host logs an error.
c) If some fancy recovery mechanism is implemented, which is optional,
then the guest gets an appropriate fault.

If a page_not_present event is delivered, then the host promises to
eventually resolve it.  Resolving it looks like this:

struct page_not_present_resolution {
  u32 result;  /* 0: guest should try again.  1: page is failed */
  u32 token;
};

struct page_not_present_resolutions {
  struct page_not_present_resolution events[N];
  u32 head, tail;
};

Only N page-not-presents can be outstanding and unresolved at a time.
it is entirely legal for the host to write the resolution to the
resolution list before delivering page-not-present.

When a page-not-present is resolved, the host writes the outcome to
the page_not_present_resolutions ring.  If there is no space, this
means that either the host or guest messed up (the host will not
allocate more tokens than can fit in the ring) and the guest is
killed.  The host also sends the guest an interrupt.  This is a
totally normal interrupt.

If the guest gets a "page is failed" resolution, the page is failed.
If the guest accesses the failed page again, then the host will try to
send a page-not-present event again.  If there is no space in the
ring, then the rules above are followed.

This will allow the sensible cases of memory failure to be recovered
by the guest without the introduction of any super-atomic faults. :)


Obviously there is more than one way to rig up the descriptor ring.
My proposal above is just one way to do it, not necessarily the best
way.
Paolo Bonzini April 9, 2020, 9:03 a.m. UTC | #45
On 08/04/20 15:01, Thomas Gleixner wrote:
> 
> And it comes with restrictions:
> 
>     The Do Other Stuff event can only be delivered when guest IF=1.
> 
>     If guest IF=0 then the host has to suspend the guest until the
>     situation is resolved.
> 
>     The 'Situation resolved' event must also wait for a guest IF=1 slot.

Additionally:

- the do other stuff event must be delivered to the same CPU that is
causing the host-side page fault

- the do other stuff event provides a token that identifies the cause
and the situation resolved event provides a matching token

This stuff is why I think the do other stuff event looks very much like
a #VE.  But I think we're in violent agreement after all.

> If you just want to solve Viveks problem, then its good enough. I.e. the
> file truncation turns the EPT entries into #VE convertible entries and
> the guest #VE handler can figure it out. This one can be injected
> directly by the hardware, i.e. you don't need a VMEXIT.
> 
> If you want the opportunistic do other stuff mechanism, then #VE has
> exactly the same problems as the existing async "PF". It's not magicaly
> making that go away.

You can inject #VE from the hypervisor too, with PV magic to distinguish
the two.  However that's not necessarily a good idea because it makes it
harder to switch to hardware delivery in the future.

> One possible solution might be to make all recoverable EPT entries
> convertible and let the HW inject #VE for those.
> 
> So the #VE handler in the guest would have to do:
> 
>        if (!recoverable()) {
>        	if (user_mode)
>                 	send_signal();
>                 else if (!fixup_exception())
>                 	die_hard();
>                 goto done;  
>        }                 
> 
>        store_ve_info_in_pv_page();
> 
>        if (!user_mode(regs) || !preemptible()) {
>        	hypercall_resolve_ept(can_continue = false);
>        } else {
>               init_completion();
>        	hypercall_resolve_ept(can_continue = true);
>               wait_for_completion();
>        }
> 
> or something like that.

Yes, pretty much.  The VE info can also be passed down to the hypercall
as arguments.

Paolo

> The hypercall to resolve the EPT fail on the host acts on the
> can_continue argument.
> 
> If false, it suspends the guest vCPU and only returns when done.
> 
> If true it kicks the resolve process and returns to the guest which
> suspends the task and tries to do something else.
> 
> The wakeup side needs to be a regular interrupt and cannot go through
> #VE.
Paolo Bonzini April 9, 2020, 9:43 a.m. UTC | #46
On 09/04/20 06:50, Andy Lutomirski wrote:
> The big problem is that #VE doesn't exist on AMD, and I really think
> that any fancy protocol we design should work on AMD.  I have no
> problem with #VE being a nifty optimization to the protocol on Intel,
> but it should *work* without #VE.

Yes and unfortunately AMD does not like to inject a non-existing
exception.  Intel only requires the vector to be <=31, but AMD wants the
vector to correspond to an exception.

However, software injection is always possible and AMD even suggests
that you use software injection for ICEBP and, on older processors, the
INT instruction.

Paolo
Andrew Cooper April 9, 2020, 11:36 a.m. UTC | #47
On 09/04/2020 05:50, Andy Lutomirski wrote:
> On Wed, Apr 8, 2020 at 11:01 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>> Paolo Bonzini <pbonzini@redhat.com> writes:
>>> On 08/04/20 17:34, Sean Christopherson wrote:
>>>> On Wed, Apr 08, 2020 at 10:23:58AM +0200, Paolo Bonzini wrote:
>>>>> Page-not-present async page faults are almost a perfect match for the
>>>>> hardware use of #VE (and it might even be possible to let the processor
>>>>> deliver the exceptions).
>>>> My "async" page fault knowledge is limited, but if the desired behavior is
>>>> to reflect a fault into the guest for select EPT Violations, then yes,
>>>> enabling EPT Violation #VEs in hardware is doable.  The big gotcha is that
>>>> KVM needs to set the suppress #VE bit for all EPTEs when allocating a new
>>>> MMU page, otherwise not-present faults on zero-initialized EPTEs will get
>>>> reflected.
>>>>
>>>> Attached a patch that does the prep work in the MMU.  The VMX usage would be:
>>>>
>>>>      kvm_mmu_set_spte_init_value(VMX_EPT_SUPPRESS_VE_BIT);
>>>>
>>>> when EPT Violation #VEs are enabled.  It's 64-bit only as it uses stosq to
>>>> initialize EPTEs.  32-bit could also be supported by doing memcpy() from
>>>> a static page.
>>> The complication is that (at least according to the current ABI) we
>>> would not want #VE to kick if the guest currently has IF=0 (and possibly
>>> CPL=0).  But the ABI is not set in stone, and anyway the #VE protocol is
>>> a decent one and worth using as a base for whatever PV protocol we design.
>> Forget the current pf async semantics (or the lack of). You really want
>> to start from scratch and igore the whole thing.
>>
>> The charm of #VE is that the hardware can inject it and it's not nesting
>> until the guest cleared the second word in the VE information area. If
>> that word is not 0 then you get a regular vmexit where you suspend the
>> vcpu until the nested problem is solved.
> Can you point me at where the SDM says this?

Vol3 25.5.6.1 Convertible EPT Violations

> Anyway, I see two problems with #VE, one big and one small.  The small
> (or maybe small) one is that any fancy protocol where the guest
> returns from an exception by doing, logically:
>
> Hey I'm done;  /* MOV somewhere, hypercall, MOV to CR4, whatever */
> IRET;
>
> is fundamentally racy.  After we say we're done and before IRET, we
> can be recursively reentered.  Hi, NMI!

Correct.  There is no way to atomically end the #VE handler.  (This
causes "fun" even when using #VE for its intended purpose.)

~Andrew
Paolo Bonzini April 9, 2020, 12:47 p.m. UTC | #48
On 09/04/20 06:50, Andy Lutomirski wrote:
> The small
> (or maybe small) one is that any fancy protocol where the guest
> returns from an exception by doing, logically:
> 
> Hey I'm done;  /* MOV somewhere, hypercall, MOV to CR4, whatever */
> IRET;
> 
> is fundamentally racy.  After we say we're done and before IRET, we
> can be recursively reentered.  Hi, NMI!

That's possible in theory.  In practice there would be only two levels
of nesting, one for the original page being loaded and one for the tail
of the #VE handler.  The nested #VE would see IF=0, resolve the EPT
violation synchronously and both handlers would finish.  For the tail
page to be swapped out again, leading to more nesting, the host's LRU
must be seriously messed up.

With IST it would be much messier, and I haven't quite understood why
you believe the #VE handler should have an IST.

Anyhow, apart from the above "small" issue, we have these separate parts:

1) deliver page-ready notifications via interrupt

2) page-in hypercall + deliver page-not-found notifications via #VE

3) propagation of host-side SIGBUS

all of which have both a host and a guest part, and all of which make
(more or less) sense independent of the other.

Paolo
Andrew Cooper April 9, 2020, 2:13 p.m. UTC | #49
On 09/04/2020 13:47, Paolo Bonzini wrote:
> On 09/04/20 06:50, Andy Lutomirski wrote:
>> The small
>> (or maybe small) one is that any fancy protocol where the guest
>> returns from an exception by doing, logically:
>>
>> Hey I'm done;  /* MOV somewhere, hypercall, MOV to CR4, whatever */
>> IRET;
>>
>> is fundamentally racy.  After we say we're done and before IRET, we
>> can be recursively reentered.  Hi, NMI!
> That's possible in theory.  In practice there would be only two levels
> of nesting, one for the original page being loaded and one for the tail
> of the #VE handler.  The nested #VE would see IF=0, resolve the EPT
> violation synchronously and both handlers would finish.  For the tail
> page to be swapped out again, leading to more nesting, the host's LRU
> must be seriously messed up.
>
> With IST it would be much messier, and I haven't quite understood why
> you believe the #VE handler should have an IST.

Any interrupt/exception which can possibly occur between a SYSCALL and
re-establishing a kernel stack (several instructions), must be IST to
avoid taking said exception on a user stack and being a trivial
privilege escalation.

In terms of using #VE in its architecturally-expected way, this can
occur in general before the kernel stack is established, so must be IST
for safety.

Therefore, it doesn't really matter if KVM's paravirt use of #VE does
respect the interrupt flag.  It is not sensible to build a paravirt
interface using #VE who's safety depends on never turning on
hardware-induced #VE's.

~Andrew
Paolo Bonzini April 9, 2020, 2:32 p.m. UTC | #50
On 09/04/20 16:13, Andrew Cooper wrote:
> On 09/04/2020 13:47, Paolo Bonzini wrote:
>> On 09/04/20 06:50, Andy Lutomirski wrote:
>>> The small
>>> (or maybe small) one is that any fancy protocol where the guest
>>> returns from an exception by doing, logically:
>>>
>>> Hey I'm done;  /* MOV somewhere, hypercall, MOV to CR4, whatever */
>>> IRET;
>>>
>>> is fundamentally racy.  After we say we're done and before IRET, we
>>> can be recursively reentered.  Hi, NMI!
>> That's possible in theory.  In practice there would be only two levels
>> of nesting, one for the original page being loaded and one for the tail
>> of the #VE handler.  The nested #VE would see IF=0, resolve the EPT
>> violation synchronously and both handlers would finish.  For the tail
>> page to be swapped out again, leading to more nesting, the host's LRU
>> must be seriously messed up.
>>
>> With IST it would be much messier, and I haven't quite understood why
>> you believe the #VE handler should have an IST.
> 
> Any interrupt/exception which can possibly occur between a SYSCALL and
> re-establishing a kernel stack (several instructions), must be IST to
> avoid taking said exception on a user stack and being a trivial
> privilege escalation.

Doh, of course.  I always confuse SYSCALL and SYSENTER.

> Therefore, it doesn't really matter if KVM's paravirt use of #VE does
> respect the interrupt flag.  It is not sensible to build a paravirt
> interface using #VE who's safety depends on never turning on
> hardware-induced #VE's.

No, I think we wouldn't use a paravirt #VE at this point, we would use
the real thing if available.

It would still be possible to switch from the IST to the main kernel
stack before writing 0 to the reentrancy word.

Paolo
Andy Lutomirski April 9, 2020, 3:03 p.m. UTC | #51
> On Apr 9, 2020, at 7:32 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> On 09/04/20 16:13, Andrew Cooper wrote:
>>> On 09/04/2020 13:47, Paolo Bonzini wrote:
>>> On 09/04/20 06:50, Andy Lutomirski wrote:
>>>> The small
>>>> (or maybe small) one is that any fancy protocol where the guest
>>>> returns from an exception by doing, logically:
>>>> 
>>>> Hey I'm done;  /* MOV somewhere, hypercall, MOV to CR4, whatever */
>>>> IRET;
>>>> 
>>>> is fundamentally racy.  After we say we're done and before IRET, we
>>>> can be recursively reentered.  Hi, NMI!
>>> That's possible in theory.  In practice there would be only two levels
>>> of nesting, one for the original page being loaded and one for the tail
>>> of the #VE handler.  The nested #VE would see IF=0, resolve the EPT
>>> violation synchronously and both handlers would finish.  For the tail
>>> page to be swapped out again, leading to more nesting, the host's LRU
>>> must be seriously messed up.
>>> 
>>> With IST it would be much messier, and I haven't quite understood why
>>> you believe the #VE handler should have an IST.
>> 
>> Any interrupt/exception which can possibly occur between a SYSCALL and
>> re-establishing a kernel stack (several instructions), must be IST to
>> avoid taking said exception on a user stack and being a trivial
>> privilege escalation.
> 
> Doh, of course.  I always confuse SYSCALL and SYSENTER.
> 
>> Therefore, it doesn't really matter if KVM's paravirt use of #VE does
>> respect the interrupt flag.  It is not sensible to build a paravirt
>> interface using #VE who's safety depends on never turning on
>> hardware-induced #VE's.
> 
> No, I think we wouldn't use a paravirt #VE at this point, we would use
> the real thing if available.
> 
> It would still be possible to switch from the IST to the main kernel
> stack before writing 0 to the reentrancy word.
> 
> 

Almost but not quite. We do this for NMI-from-usermode, and it’s ugly. But we can’t do this for NMI-from-kernel or #VE-from-kernel because there might not be a kernel stack.  Trying to hack around this won’t be pretty.

Frankly, I think that we shouldn’t even try to report memory failure to the guest if it happens with interrupts off. Just kill the guest cleanly and keep it simple. Or inject an intentionally unrecoverable IST exception.
Paolo Bonzini April 9, 2020, 3:17 p.m. UTC | #52
On 09/04/20 17:03, Andy Lutomirski wrote:
>> No, I think we wouldn't use a paravirt #VE at this point, we would
>> use the real thing if available.
>> 
>> It would still be possible to switch from the IST to the main
>> kernel stack before writing 0 to the reentrancy word.
> 
> Almost but not quite. We do this for NMI-from-usermode, and it’s
> ugly. But we can’t do this for NMI-from-kernel or #VE-from-kernel
> because there might not be a kernel stack.  Trying to hack around
> this won’t be pretty.
> 
> Frankly, I think that we shouldn’t even try to report memory failure
> to the guest if it happens with interrupts off. Just kill the guest
> cleanly and keep it simple. Or inject an intentionally unrecoverable
> IST exception.

But it would be nice to use #VE for all host-side page faults, not just
for memory failure.

So the solution would be the same as for NMIs, duplicating the stack
frame and patching the outer handler's stack from the recursive #VE
(https://lwn.net/Articles/484932/).  It's ugly but it's a known ugliness.

Paolo
Andy Lutomirski April 9, 2020, 5:32 p.m. UTC | #53
Hi, Tony.  I'm adding you because, despite the fact that everyone in
this thread is allergic to #MC, this seems to be one of your favorite
topics :)

> On Apr 9, 2020, at 8:17 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 09/04/20 17:03, Andy Lutomirski wrote:
>>> No, I think we wouldn't use a paravirt #VE at this point, we would
>>> use the real thing if available.
>>>
>>> It would still be possible to switch from the IST to the main
>>> kernel stack before writing 0 to the reentrancy word.
>>
>> Almost but not quite. We do this for NMI-from-usermode, and it’s
>> ugly. But we can’t do this for NMI-from-kernel or #VE-from-kernel
>> because there might not be a kernel stack.  Trying to hack around
>> this won’t be pretty.
>>
>> Frankly, I think that we shouldn’t even try to report memory failure
>> to the guest if it happens with interrupts off. Just kill the guest
>> cleanly and keep it simple. Or inject an intentionally unrecoverable
>> IST exception.
>
> But it would be nice to use #VE for all host-side page faults, not just
> for memory failure.
>
> So the solution would be the same as for NMIs, duplicating the stack
> frame and patching the outer handler's stack from the recursive #VE
> (https://lwn.net/Articles/484932/).  It's ugly but it's a known ugliness.
>
>

Believe me, I know all about how ugly it is, since I’m the one who
fixed most of the bugs in the first few implementations.  And, before
I wrote or ack any such thing for #VE, I want to know why.  What,
exactly, is a sufficiently strong motivation for using #VE *at all*
that Linux should implement a #VE handler?

As I see it, #VE has several downsides:

1. Intel only.

2. Depending on precisely what it's used for, literally any memory
access in the kernel can trigger it as a fault.  This means that it
joins NMI and #MC (and, to a limited extent, #DB) in the horrible
super-atomic-happens-in-bad-contexts camp.  IST is mandatory, and IST
is not so great.

3. Just like NMI and MCE, it comes with a fundamentally broken
don't-recurse-me mechanism.

If we do support #VE, I would suggest we do it roughly like this.  The
#VE handler is a regular IST entry -- there's a single IST stack, and
#VE from userspace stack-switches to the regular kernel stack.  The C
handler (do_virtualization_exception?) is permitted to panic if
something is horribly wrong, but is *not* permitted to clear the word
at byte 4 to re-enable #VE.  Instead, it does something to trigger a
deferred re-enable.  For example, it sends IPI to self and the IPI
handler clears the magic re-enable flag.

There are definitely horrible corner cases here.  For example, suppose
user code mmaps() some kind of failable memory (due to NVDIMM hardware
failures, truncation, whatever).  Then user code points RBP at it and
we get a perf NMI.  Now the perf code tries to copy_from_user_nmi()
the user stack and hits the failure.  It gets #MC or #VE or some
paravirt thing.  Now we're in a situation where we got an IST
exception in the middle of NMI processing and we're expected to do
something intelligent about it.  Sure, we can invoke the extable
handler, but it's not really clear how to recover if we somehow hit
the same type of failure again in the same NMI.

A model that could actually work, perhaps for #VE and #MC, is to have
the extable code do the re-enabling.  So ex_handler_uaccess() and
ex_handler_mcsafe() will call something like rearm_memory_failure(),
and that will do whatever is needed to re-arm the specific memory
failure reporting mechanism in use.

But, before I touch that with a ten-foot pole, I want to know *why*.
What's the benefit?  Why do we want convertible EPT violations?
Vivek Goyal May 21, 2020, 3:55 p.m. UTC | #54
On Wed, Apr 08, 2020 at 12:07:22AM +0200, Paolo Bonzini wrote:
> On 07/04/20 23:41, Andy Lutomirski wrote:
> > 2. Access to bad memory results in #MC.  Sure, #MC is a turd, but
> > it’s an *architectural* turd. By all means, have a nice simple PV
> > mechanism to tell the #MC code exactly what went wrong, but keep the
> > overall flow the same as in the native case.
> > 
> > I think I like #2 much better. It has another nice effect: a good
> > implementation will serve as a way to exercise the #MC code without
> > needing to muck with EINJ or with whatever magic Tony uses. The
> > average kernel developer does not have access to a box with testable
> > memory failure reporting.
> 
> I prefer #VE, but I can see how #MC has some appeal. 

I have spent some time looking at #MC and trying to figure out if we
can use it. I have encountered couple of issues.

- Uncorrected Action required machine checks are generated when poison
  is consumed. So typically all kernel code and exception handling is
  assuming MCE can be encoutered synchronously only on load and not
  store. stores don't generate MCE (atleast not AR one, IIUC). If we were
  to use #MC, we will need to generate it on store as well and then that
  requires changing assumptions in kernel which assumes stores can't
  generate #MC (Change all copy_to_user()/copy_from_user() and friends)

- Machine check is generated for poisoned memory. And in this it is not
  exaclty poisoning. It feels like as if memory has gone missing. And
  failure might be temporary that is if file is truncated again to extend,
  then next load/store to same memory location will work just fine. My
  understanding is that sending #MC will mark that page poisoned and
  it will sort of become permanent failure. 

I am less concerned about point 2, but not sure how to get past the
first issue.

Thanks
Vivek
diff mbox series

Patch

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 93ab0cbd304e..e6f2aefa298b 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -318,11 +318,26 @@  static void kvm_guest_cpu_init(void)
 
 		pa = slow_virt_to_phys(this_cpu_ptr(&apf_reason));
 
-#ifdef CONFIG_PREEMPTION
-		pa |= KVM_ASYNC_PF_SEND_ALWAYS;
-#endif
 		pa |= KVM_ASYNC_PF_ENABLED;
 
+		/*
+		 * We do not set KVM_ASYNC_PF_SEND_ALWAYS.  With the current
+		 * KVM paravirt ABI, the following scenario is possible:
+		 *
+		 * #PF: async page fault (KVM_PV_REASON_PAGE_NOT_PRESENT)
+		 *  NMI before CR2 or KVM_PF_REASON_PAGE_NOT_PRESENT
+		 *   NMI accesses user memory, e.g. due to perf
+		 *    #PF: normal page fault
+		 *     #PF reads CR2 and apf_reason -- apf_reason should be 0
+		 *
+		 *  outer #PF reads CR2 and apf_reason -- apf_reason should be
+		 *  KVM_PV_REASON_PAGE_NOT_PRESENT
+		 *
+		 * There is no possible way that both reads of CR2 and
+		 * apf_reason get the correct values.  Fixing this would
+		 * require paravirt ABI changes.
+		 */
+
 		if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF_VMEXIT))
 			pa |= KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT;