Message ID | ed71d0967113a35f670a9625a058b8e6e0b2f104.1583547991.git.luto@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] x86/kvm: Disable KVM_ASYNC_PF_SEND_ALWAYS | expand |
On Fri, Mar 6, 2020 at 6:26 PM Andy Lutomirski <luto@kernel.org> wrote: > > The ABI is broken and we cannot support it properly. Turn it off. > > If this causes a meaningful performance regression for someone, KVM > can introduce an improved ABI that is supportable. > > Cc: stable@vger.kernel.org > Signed-off-by: Andy Lutomirski <luto@kernel.org> > --- > arch/x86/kernel/kvm.c | 21 ++++++++++++++++++--- > 1 file changed, 18 insertions(+), 3 deletions(-) > > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c > index 93ab0cbd304e..e6f2aefa298b 100644 > --- a/arch/x86/kernel/kvm.c > +++ b/arch/x86/kernel/kvm.c > @@ -318,11 +318,26 @@ static void kvm_guest_cpu_init(void) > > pa = slow_virt_to_phys(this_cpu_ptr(&apf_reason)); > > -#ifdef CONFIG_PREEMPTION > - pa |= KVM_ASYNC_PF_SEND_ALWAYS; > -#endif > pa |= KVM_ASYNC_PF_ENABLED; > > + /* > + * We do not set KVM_ASYNC_PF_SEND_ALWAYS. With the current > + * KVM paravirt ABI, the following scenario is possible: > + * > + * #PF: async page fault (KVM_PV_REASON_PAGE_NOT_PRESENT) > + * NMI before CR2 or KVM_PF_REASON_PAGE_NOT_PRESENT > + * NMI accesses user memory, e.g. due to perf > + * #PF: normal page fault > + * #PF reads CR2 and apf_reason -- apf_reason should be 0 > + * > + * outer #PF reads CR2 and apf_reason -- apf_reason should be > + * KVM_PV_REASON_PAGE_NOT_PRESENT > + * > + * There is no possible way that both reads of CR2 and > + * apf_reason get the correct values. Fixing this would > + * require paravirt ABI changes. > + */ > + Upon re-reading my own comment, I think the problem is real, but I don't think my patch fixes it. The outer #PF could just as easily have come from user mode. We may actually need the NMI code (and perhaps MCE and maybe #DB too) to save, clear, and restore apf_reason. If we do this, then maybe CPL0 async PFs are actually okay, but the semantics are so poorly defined that I'm not very confident about that.
Andy Lutomirski <luto@kernel.org> writes: > On Fri, Mar 6, 2020 at 6:26 PM Andy Lutomirski <luto@kernel.org> wrote: >> + /* >> + * We do not set KVM_ASYNC_PF_SEND_ALWAYS. With the current >> + * KVM paravirt ABI, the following scenario is possible: >> + * >> + * #PF: async page fault (KVM_PV_REASON_PAGE_NOT_PRESENT) >> + * NMI before CR2 or KVM_PF_REASON_PAGE_NOT_PRESENT >> + * NMI accesses user memory, e.g. due to perf >> + * #PF: normal page fault >> + * #PF reads CR2 and apf_reason -- apf_reason should be 0 >> + * >> + * outer #PF reads CR2 and apf_reason -- apf_reason should be >> + * KVM_PV_REASON_PAGE_NOT_PRESENT >> + * >> + * There is no possible way that both reads of CR2 and >> + * apf_reason get the correct values. Fixing this would >> + * require paravirt ABI changes. >> + */ >> + > > Upon re-reading my own comment, I think the problem is real, but I > don't think my patch fixes it. The outer #PF could just as easily > have come from user mode. We may actually need the NMI code (and > perhaps MCE and maybe #DB too) to save, clear, and restore apf_reason. > If we do this, then maybe CPL0 async PFs are actually okay, but the > semantics are so poorly defined that I'm not very confident about > that. I think even with the current mode this is fixable on the host side when it keeps track of the state. The host knows exactly when it injects a async PF and it can store CR2 and reason of that async PF in flight. On the next VMEXIT it checks whether apf_reason is 0. If apf_reason is 0 then it knows that the guest has read CR2 and apf_reason. All good nothing to worry about. If not it needs to be careful. As long as the apf_reason of the last async #PF is not cleared by the guest no new async #PF can be injected. That's already correct because in that case IF==0 which prevents a nested async #PF. If MCE, NMI trigger a real pagefault then the #PF injection needs to clear apf_reason and set the correct CR2. When that #PF returns then the old CR2 and apf_reason need to be restored. I tried to figure out whether any of this logic exists in the KVM code, but I got completely lost in that code. Maybe I try later today again. Thanks, tglx
On Sat, Mar 7, 2020 at 7:47 AM Thomas Gleixner <tglx@linutronix.de> wrote: > > Andy Lutomirski <luto@kernel.org> writes: > > On Fri, Mar 6, 2020 at 6:26 PM Andy Lutomirski <luto@kernel.org> wrote: > >> + /* > >> + * We do not set KVM_ASYNC_PF_SEND_ALWAYS. With the current > >> + * KVM paravirt ABI, the following scenario is possible: > >> + * > >> + * #PF: async page fault (KVM_PV_REASON_PAGE_NOT_PRESENT) > >> + * NMI before CR2 or KVM_PF_REASON_PAGE_NOT_PRESENT > >> + * NMI accesses user memory, e.g. due to perf > >> + * #PF: normal page fault > >> + * #PF reads CR2 and apf_reason -- apf_reason should be 0 > >> + * > >> + * outer #PF reads CR2 and apf_reason -- apf_reason should be > >> + * KVM_PV_REASON_PAGE_NOT_PRESENT > >> + * > >> + * There is no possible way that both reads of CR2 and > >> + * apf_reason get the correct values. Fixing this would > >> + * require paravirt ABI changes. > >> + */ > >> + > > > > Upon re-reading my own comment, I think the problem is real, but I > > don't think my patch fixes it. The outer #PF could just as easily > > have come from user mode. We may actually need the NMI code (and > > perhaps MCE and maybe #DB too) to save, clear, and restore apf_reason. > > If we do this, then maybe CPL0 async PFs are actually okay, but the > > semantics are so poorly defined that I'm not very confident about > > that. > > I think even with the current mode this is fixable on the host side when > it keeps track of the state. > > The host knows exactly when it injects a async PF and it can store CR2 > and reason of that async PF in flight. > > On the next VMEXIT it checks whether apf_reason is 0. If apf_reason is 0 > then it knows that the guest has read CR2 and apf_reason. All good > nothing to worry about. > > If not it needs to be careful. > > As long as the apf_reason of the last async #PF is not cleared by the > guest no new async #PF can be injected. That's already correct because > in that case IF==0 which prevents a nested async #PF. > > If MCE, NMI trigger a real pagefault then the #PF injection needs to > clear apf_reason and set the correct CR2. When that #PF returns then the > old CR2 and apf_reason need to be restored. How is the host supposed to know when the #PF returns? Intercepting IRET sounds like a bad idea and, in any case, is not actually a reliable indication that #PF returned.
Andy Lutomirski <luto@kernel.org> writes: > On Sat, Mar 7, 2020 at 7:47 AM Thomas Gleixner <tglx@linutronix.de> wrote: >> The host knows exactly when it injects a async PF and it can store CR2 >> and reason of that async PF in flight. >> >> On the next VMEXIT it checks whether apf_reason is 0. If apf_reason is 0 >> then it knows that the guest has read CR2 and apf_reason. All good >> nothing to worry about. >> >> If not it needs to be careful. >> >> As long as the apf_reason of the last async #PF is not cleared by the >> guest no new async #PF can be injected. That's already correct because >> in that case IF==0 which prevents a nested async #PF. >> >> If MCE, NMI trigger a real pagefault then the #PF injection needs to >> clear apf_reason and set the correct CR2. When that #PF returns then the >> old CR2 and apf_reason need to be restored. > > How is the host supposed to know when the #PF returns? Intercepting > IRET sounds like a bad idea and, in any case, is not actually a > reliable indication that #PF returned. The host does not care about the IRET. It solely has to check whether apf_reason is 0 or not. That way it knows that the guest has read CR2 and apf_reason. Thanks, tglx
On Sat, Mar 7, 2020 at 11:01 AM Thomas Gleixner <tglx@linutronix.de> wrote: > > Andy Lutomirski <luto@kernel.org> writes: > > On Sat, Mar 7, 2020 at 7:47 AM Thomas Gleixner <tglx@linutronix.de> wrote: > >> The host knows exactly when it injects a async PF and it can store CR2 > >> and reason of that async PF in flight. > >> > >> On the next VMEXIT it checks whether apf_reason is 0. If apf_reason is 0 > >> then it knows that the guest has read CR2 and apf_reason. All good > >> nothing to worry about. > >> > >> If not it needs to be careful. > >> > >> As long as the apf_reason of the last async #PF is not cleared by the > >> guest no new async #PF can be injected. That's already correct because > >> in that case IF==0 which prevents a nested async #PF. > >> > >> If MCE, NMI trigger a real pagefault then the #PF injection needs to > >> clear apf_reason and set the correct CR2. When that #PF returns then the > >> old CR2 and apf_reason need to be restored. > > > > How is the host supposed to know when the #PF returns? Intercepting > > IRET sounds like a bad idea and, in any case, is not actually a > > reliable indication that #PF returned. > > The host does not care about the IRET. It solely has to check whether > apf_reason is 0 or not. That way it knows that the guest has read CR2 > and apf_reason. /me needs actual details Suppose the host delivers an async #PF. apf_reason != 0 and CR2 contains something meaningful. Host resumes the guest. The guest does whatever (gets NMI, and does perf stuff, for example). The guest gets a normal #PF. Somehow the host needs to do: if (apf_reason != 0) { prev_apf_reason = apf_reason; prev_cr2 = cr2; apf_reason = 0; cr2 = actual fault address; } resume guest; Obviously this can only happen if the host intercepts #PF. Let's pretend for now that this is even possible on SEV-ES (it may well be, but I would also believe that it's not. SEV-ES intercepts are weird and I don't have the whole manual in my head. I'm not sure the host has any way to read CR2 for a SEV-ES guest.) So now the guest runs some more and finishes handling the inner #PF. Some time between doing that and running the outer #PF code that reads apf_reason, the host needs to do: apf_reason = prev_apf_reason; cr2 = prev_cr2; prev_apf_reason = 0; How is the host supposed to know when to do that? --Andy
Thomas Gleixner <tglx@linutronix.de> writes: > Andy Lutomirski <luto@kernel.org> writes: >> On Sat, Mar 7, 2020 at 7:47 AM Thomas Gleixner <tglx@linutronix.de> wrote: >>> The host knows exactly when it injects a async PF and it can store CR2 >>> and reason of that async PF in flight. >>> >>> On the next VMEXIT it checks whether apf_reason is 0. If apf_reason is 0 >>> then it knows that the guest has read CR2 and apf_reason. All good >>> nothing to worry about. >>> >>> If not it needs to be careful. >>> >>> As long as the apf_reason of the last async #PF is not cleared by the >>> guest no new async #PF can be injected. That's already correct because >>> in that case IF==0 which prevents a nested async #PF. >>> >>> If MCE, NMI trigger a real pagefault then the #PF injection needs to >>> clear apf_reason and set the correct CR2. When that #PF returns then the >>> old CR2 and apf_reason need to be restored. >> >> How is the host supposed to know when the #PF returns? Intercepting >> IRET sounds like a bad idea and, in any case, is not actually a >> reliable indication that #PF returned. > > The host does not care about the IRET. It solely has to check whether > apf_reason is 0 or not. That way it knows that the guest has read CR2 > and apf_reason. Bah. I'm a moron. Of course it needs to trap the IRET of the #PF in order to restore CR2 and apf_reason. Alternatively it could trap the CR2 read of #PF, but yes that's all nasty. Thanks, tglx
Thomas Gleixner <tglx@linutronix.de> writes: > Thomas Gleixner <tglx@linutronix.de> writes: >> Andy Lutomirski <luto@kernel.org> writes: >>> On Sat, Mar 7, 2020 at 7:47 AM Thomas Gleixner <tglx@linutronix.de> wrote: >>>> The host knows exactly when it injects a async PF and it can store CR2 >>>> and reason of that async PF in flight. >>>> >>>> On the next VMEXIT it checks whether apf_reason is 0. If apf_reason is 0 >>>> then it knows that the guest has read CR2 and apf_reason. All good >>>> nothing to worry about. >>>> >>>> If not it needs to be careful. >>>> >>>> As long as the apf_reason of the last async #PF is not cleared by the >>>> guest no new async #PF can be injected. That's already correct because >>>> in that case IF==0 which prevents a nested async #PF. >>>> >>>> If MCE, NMI trigger a real pagefault then the #PF injection needs to >>>> clear apf_reason and set the correct CR2. When that #PF returns then the >>>> old CR2 and apf_reason need to be restored. >>> >>> How is the host supposed to know when the #PF returns? Intercepting >>> IRET sounds like a bad idea and, in any case, is not actually a >>> reliable indication that #PF returned. >> >> The host does not care about the IRET. It solely has to check whether >> apf_reason is 0 or not. That way it knows that the guest has read CR2 >> and apf_reason. > > Bah. I'm a moron. Of course it needs to trap the IRET of the #PF in > order to restore CR2 and apf_reason. Alternatively it could trap the CR2 > read of #PF, but yes that's all nasty. Some hours or sleep and not staring at this meess later and while reading the leaves of my morning tea: guest side: nmi()/mce() ... stash_crs(); + stash_and_clear_apf_reason(); .... + restore_apf_reason(); restore_cr2(); Too obvious, isn't it? Thanks, tglx
On 09/03/20 07:57, Thomas Gleixner wrote: > Thomas Gleixner <tglx@linutronix.de> writes: >> Thomas Gleixner <tglx@linutronix.de> writes: >>> Andy Lutomirski <luto@kernel.org> writes: >>>> On Sat, Mar 7, 2020 at 7:47 AM Thomas Gleixner <tglx@linutronix.de> wrote: >>>>> If MCE, NMI trigger a real pagefault then the #PF injection needs to >>>>> clear apf_reason and set the correct CR2. When that #PF returns then the >>>>> old CR2 and apf_reason need to be restored. >>>> >>> The host does not care about the IRET. It solely has to check whether >>> apf_reason is 0 or not. That way it knows that the guest has read CR2 >>> and apf_reason. > > Some hours or sleep and not staring at this meess later and while > reading the leaves of my morning tea: > > guest side: > > nmi()/mce() ... > > stash_crs(); > > + stash_and_clear_apf_reason(); > > .... > > + restore_apf_reason(); > > restore_cr2(); > > Too obvious, isn't it? Yes, this works but Andy was not happy about adding more save-and-restore to NMIs. If you do not want to do that, I'm okay with disabling async page fault support for now. Storing the page fault reason in memory was not a good idea. Better options would be to co-opt the page fault error code (e.g. store the reason in bits 31:16, mark bits 15:0 with the invalid error code RSVD=1/P=0), or to use the virtualization exception area. Paolo
Paolo Bonzini <pbonzini@redhat.com> writes: > On 09/03/20 07:57, Thomas Gleixner wrote: >> Thomas Gleixner <tglx@linutronix.de> writes: >> >> guest side: >> >> nmi()/mce() ... >> >> stash_crs(); >> >> + stash_and_clear_apf_reason(); >> >> .... >> >> + restore_apf_reason(); >> >> restore_cr2(); >> >> Too obvious, isn't it? > > Yes, this works but Andy was not happy about adding more > save-and-restore to NMIs. If you do not want to do that, I'm okay with > disabling async page fault support for now. I'm fine with doing that save/restore dance, but I have no strong opinion either. > Storing the page fault reason in memory was not a good idea. Better > options would be to co-opt the page fault error code (e.g. store the > reason in bits 31:16, mark bits 15:0 with the invalid error code > RSVD=1/P=0), or to use the virtualization exception area. Memory store is not the problem. The real problem is hijacking #PF. If you'd have just used a separate VECTOR_ASYNC_PF then none of these problems would exist at all. Thanks, tglx
On Mon, Mar 9, 2020 at 2:09 AM Thomas Gleixner <tglx@linutronix.de> wrote: > > Paolo Bonzini <pbonzini@redhat.com> writes: > > On 09/03/20 07:57, Thomas Gleixner wrote: > >> Thomas Gleixner <tglx@linutronix.de> writes: > >> > >> guest side: > >> > >> nmi()/mce() ... > >> > >> stash_crs(); > >> > >> + stash_and_clear_apf_reason(); > >> > >> .... > >> > >> + restore_apf_reason(); > >> > >> restore_cr2(); > >> > >> Too obvious, isn't it? > > > > Yes, this works but Andy was not happy about adding more > > save-and-restore to NMIs. If you do not want to do that, I'm okay with > > disabling async page fault support for now. > > I'm fine with doing that save/restore dance, but I have no strong > opinion either. > > > Storing the page fault reason in memory was not a good idea. Better > > options would be to co-opt the page fault error code (e.g. store the > > reason in bits 31:16, mark bits 15:0 with the invalid error code > > RSVD=1/P=0), or to use the virtualization exception area. > > Memory store is not the problem. The real problem is hijacking #PF. > > If you'd have just used a separate VECTOR_ASYNC_PF then none of these > problems would exist at all. > I'm okay with the save/restore dance, I guess. It's just yet more entry crud to deal with architecture nastiness, except that this nastiness is 100% software and isn't Intel/AMD's fault. At least until we get an async page fault due to apf_reason being paged out...
Andy Lutomirski <luto@kernel.org> writes: > On Mon, Mar 9, 2020 at 2:09 AM Thomas Gleixner <tglx@linutronix.de> wrote: >> Paolo Bonzini <pbonzini@redhat.com> writes: >> > Yes, this works but Andy was not happy about adding more >> > save-and-restore to NMIs. If you do not want to do that, I'm okay with >> > disabling async page fault support for now. >> >> I'm fine with doing that save/restore dance, but I have no strong >> opinion either. >> >> > Storing the page fault reason in memory was not a good idea. Better >> > options would be to co-opt the page fault error code (e.g. store the >> > reason in bits 31:16, mark bits 15:0 with the invalid error code >> > RSVD=1/P=0), or to use the virtualization exception area. >> >> Memory store is not the problem. The real problem is hijacking #PF. >> >> If you'd have just used a separate VECTOR_ASYNC_PF then none of these >> problems would exist at all. >> > > I'm okay with the save/restore dance, I guess. It's just yet more > entry crud to deal with architecture nastiness, except that this > nastiness is 100% software and isn't Intel/AMD's fault. And we can do it in C and don't have to fiddle with it in the ASM maze. Thanks, tglx
On Mon, Mar 09, 2020 at 08:05:18PM +0100, Thomas Gleixner wrote: > Andy Lutomirski <luto@kernel.org> writes: > > I'm okay with the save/restore dance, I guess. It's just yet more > > entry crud to deal with architecture nastiness, except that this > > nastiness is 100% software and isn't Intel/AMD's fault. > > And we can do it in C and don't have to fiddle with it in the ASM > maze. Right; I'd still love to kill KVM_ASYNC_PF_SEND_ALWAYS though, even if we do the save/restore in do_nmi(). That is some wild brain melt. Also, AFAIK none of the distros are actually shipping a PREEMPT=y kernel anyway, so killing it shouldn't matter much. If people want to recover that, I'd suggest they sit down and create a sane paravirt interface for this.
On Mon, Mar 09, 2020 at 09:22:15PM +0100, Peter Zijlstra wrote: > On Mon, Mar 09, 2020 at 08:05:18PM +0100, Thomas Gleixner wrote: > > Andy Lutomirski <luto@kernel.org> writes: > > > > I'm okay with the save/restore dance, I guess. It's just yet more > > > entry crud to deal with architecture nastiness, except that this > > > nastiness is 100% software and isn't Intel/AMD's fault. > > > > And we can do it in C and don't have to fiddle with it in the ASM > > maze. > > Right; I'd still love to kill KVM_ASYNC_PF_SEND_ALWAYS though, even if > we do the save/restore in do_nmi(). That is some wild brain melt. Also, > AFAIK none of the distros are actually shipping a PREEMPT=y kernel > anyway, so killing it shouldn't matter much. It will be nice if we can retain KVM_ASYNC_PF_SEND_ALWAYS. I have another use case outside CONFIG_PREEMPT. I am trying to extend async pf interface to also report page fault errors to the guest. https://lore.kernel.org/kvm/20200331194011.24834-1-vgoyal@redhat.com/ Right now async page fault interface assumes that host will always be able to successfully resolve the page fault and sooner or later PAGE_READY event will be sent to guest. And there is no mechnaism to report the errors back to guest. I am trying to add enhance virtiofs to directly map host page cache in guest. https://lore.kernel.org/linux-fsdevel/20200304165845.3081-1-vgoyal@redhat.com/ There it is possible that a file page on host is mapped in guest and file got truncated and page is not there anymore. Guest tries to access it, and it generates async page fault. On host we will get -EFAULT and I need to propagate it back to guest so that guest can either send SIGBUS to process which caused this. Or if kernel was trying to do memcpy(), then be able to use execpetion table error handling and be able to return with error. (memcpy_mcflush()). For the second case to work, I will need async pf events to come in even if guest is in kernel and CONFIG_PREEMPT=n. So it would be nice if we can keep KVM_ASYNC_PF_SEND_ALWAYS around. Thanks Vivek
On Mon, Apr 06, 2020 at 03:09:51PM -0400, Vivek Goyal wrote: > On Mon, Mar 09, 2020 at 09:22:15PM +0100, Peter Zijlstra wrote: > > On Mon, Mar 09, 2020 at 08:05:18PM +0100, Thomas Gleixner wrote: > > > Andy Lutomirski <luto@kernel.org> writes: > > > > > > I'm okay with the save/restore dance, I guess. It's just yet more > > > > entry crud to deal with architecture nastiness, except that this > > > > nastiness is 100% software and isn't Intel/AMD's fault. > > > > > > And we can do it in C and don't have to fiddle with it in the ASM > > > maze. > > > > Right; I'd still love to kill KVM_ASYNC_PF_SEND_ALWAYS though, even if > > we do the save/restore in do_nmi(). That is some wild brain melt. Also, > > AFAIK none of the distros are actually shipping a PREEMPT=y kernel > > anyway, so killing it shouldn't matter much. > > It will be nice if we can retain KVM_ASYNC_PF_SEND_ALWAYS. I have another > use case outside CONFIG_PREEMPT. > > I am trying to extend async pf interface to also report page fault errors > to the guest. Then please start over and design a sane ParaVirt Fault interface. The current one is utter crap.
> On Apr 6, 2020, at 1:25 PM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Mon, Apr 06, 2020 at 03:09:51PM -0400, Vivek Goyal wrote: >>> On Mon, Mar 09, 2020 at 09:22:15PM +0100, Peter Zijlstra wrote: >>> On Mon, Mar 09, 2020 at 08:05:18PM +0100, Thomas Gleixner wrote: >>>> Andy Lutomirski <luto@kernel.org> writes: >>> >>>>> I'm okay with the save/restore dance, I guess. It's just yet more >>>>> entry crud to deal with architecture nastiness, except that this >>>>> nastiness is 100% software and isn't Intel/AMD's fault. >>>> >>>> And we can do it in C and don't have to fiddle with it in the ASM >>>> maze. >>> >>> Right; I'd still love to kill KVM_ASYNC_PF_SEND_ALWAYS though, even if >>> we do the save/restore in do_nmi(). That is some wild brain melt. Also, >>> AFAIK none of the distros are actually shipping a PREEMPT=y kernel >>> anyway, so killing it shouldn't matter much. >> >> It will be nice if we can retain KVM_ASYNC_PF_SEND_ALWAYS. I have another >> use case outside CONFIG_PREEMPT. >> >> I am trying to extend async pf interface to also report page fault errors >> to the guest. > > Then please start over and design a sane ParaVirt Fault interface. The > current one is utter crap. Agreed. Don’t extend the current mechanism. Replace it. I would be happy to review a replacement. I’m not really excited to review an extension of the current mess. The current thing is barely, if at all, correct.
> On Apr 6, 2020, at 1:32 PM, Andy Lutomirski <luto@amacapital.net> wrote: > > >> On Apr 6, 2020, at 1:25 PM, Peter Zijlstra <peterz@infradead.org> wrote: >> >> On Mon, Apr 06, 2020 at 03:09:51PM -0400, Vivek Goyal wrote: >>>> On Mon, Mar 09, 2020 at 09:22:15PM +0100, Peter Zijlstra wrote: >>>>> On Mon, Mar 09, 2020 at 08:05:18PM +0100, Thomas Gleixner wrote: >>>>>> Andy Lutomirski <luto@kernel.org> writes: >>>>> >>>>>>> I'm okay with the save/restore dance, I guess. It's just yet more >>>>>>> entry crud to deal with architecture nastiness, except that this >>>>>>> nastiness is 100% software and isn't Intel/AMD's fault. >>>>>> >>>>>> And we can do it in C and don't have to fiddle with it in the ASM >>>>>> maze. >>>>> >>>>> Right; I'd still love to kill KVM_ASYNC_PF_SEND_ALWAYS though, even if >>>>> we do the save/restore in do_nmi(). That is some wild brain melt. Also, >>>>> AFAIK none of the distros are actually shipping a PREEMPT=y kernel >>>>> anyway, so killing it shouldn't matter much. >>> >>> It will be nice if we can retain KVM_ASYNC_PF_SEND_ALWAYS. I have another >>> use case outside CONFIG_PREEMPT. >>> >>> I am trying to extend async pf interface to also report page fault errors >>> to the guest. >> >> Then please start over and design a sane ParaVirt Fault interface. The >> current one is utter crap. > > Agreed. Don’t extend the current mechanism. Replace it. > > I would be happy to review a replacement. I’m not really excited to review an extension of the current mess. The current thing is barely, if at all, correct. I read your patch. It cannot possibly be correct. You need to decide what happens if you get a memory failure when guest interrupts are off. If this happens, you can’t send #PF, but you also can’t just swallow the error. The existing APF code is so messy that it’s not at all obvious what your code ends up doing, but I’m pretty sure it doesn’t do anything sensible, especially since the ABI doesn’t have a sensible option. I think you should inject MCE and coordinate with Tony Luck to make it sane. And, in the special case that the new improved async PF mechanism is enabled *and* interrupts are on, you can skip the MCE and instead inject a new improved APF. But, as it stands, I will NAK any guest code that tries to make #PF handle memory failure. Sorry, it’s just too messy to actually analyze all the cases.
Vivek Goyal <vgoyal@redhat.com> writes: > On Mon, Mar 09, 2020 at 09:22:15PM +0100, Peter Zijlstra wrote: >> On Mon, Mar 09, 2020 at 08:05:18PM +0100, Thomas Gleixner wrote: >> > Andy Lutomirski <luto@kernel.org> writes: >> >> > > I'm okay with the save/restore dance, I guess. It's just yet more >> > > entry crud to deal with architecture nastiness, except that this >> > > nastiness is 100% software and isn't Intel/AMD's fault. >> > >> > And we can do it in C and don't have to fiddle with it in the ASM >> > maze. >> >> Right; I'd still love to kill KVM_ASYNC_PF_SEND_ALWAYS though, even if >> we do the save/restore in do_nmi(). That is some wild brain melt. Also, >> AFAIK none of the distros are actually shipping a PREEMPT=y kernel >> anyway, so killing it shouldn't matter much. > > It will be nice if we can retain KVM_ASYNC_PF_SEND_ALWAYS. I have another > use case outside CONFIG_PREEMPT. > > I am trying to extend async pf interface to also report page fault errors > to the guest. > > https://lore.kernel.org/kvm/20200331194011.24834-1-vgoyal@redhat.com/ > > Right now async page fault interface assumes that host will always be > able to successfully resolve the page fault and sooner or later PAGE_READY > event will be sent to guest. And there is no mechnaism to report the > errors back to guest. > > I am trying to add enhance virtiofs to directly map host page cache in guest. > > https://lore.kernel.org/linux-fsdevel/20200304165845.3081-1-vgoyal@redhat.com/ > > There it is possible that a file page on host is mapped in guest and file > got truncated and page is not there anymore. Guest tries to access it, > and it generates async page fault. On host we will get -EFAULT and I > need to propagate it back to guest so that guest can either send SIGBUS > to process which caused this. Or if kernel was trying to do memcpy(), > then be able to use execpetion table error handling and be able to > return with error. (memcpy_mcflush()). > > For the second case to work, I will need async pf events to come in > even if guest is in kernel and CONFIG_PREEMPT=n. What? > So it would be nice if we can keep KVM_ASYNC_PF_SEND_ALWAYS around. No. If you want this stuff to be actually useful and correct, then please redesign it from scratch w/o abusing #PF. It want's to be a separate vector and then the pagefault resulting from your example above becomes a real #PF without any bells and whistels. Thanks, tglx
On Mon, Apr 06, 2020 at 01:42:28PM -0700, Andy Lutomirski wrote: > > > On Apr 6, 2020, at 1:32 PM, Andy Lutomirski <luto@amacapital.net> wrote: > > > > > >> On Apr 6, 2020, at 1:25 PM, Peter Zijlstra <peterz@infradead.org> wrote: > >> > >> On Mon, Apr 06, 2020 at 03:09:51PM -0400, Vivek Goyal wrote: > >>>> On Mon, Mar 09, 2020 at 09:22:15PM +0100, Peter Zijlstra wrote: > >>>>> On Mon, Mar 09, 2020 at 08:05:18PM +0100, Thomas Gleixner wrote: > >>>>>> Andy Lutomirski <luto@kernel.org> writes: > >>>>> > >>>>>>> I'm okay with the save/restore dance, I guess. It's just yet more > >>>>>>> entry crud to deal with architecture nastiness, except that this > >>>>>>> nastiness is 100% software and isn't Intel/AMD's fault. > >>>>>> > >>>>>> And we can do it in C and don't have to fiddle with it in the ASM > >>>>>> maze. > >>>>> > >>>>> Right; I'd still love to kill KVM_ASYNC_PF_SEND_ALWAYS though, even if > >>>>> we do the save/restore in do_nmi(). That is some wild brain melt. Also, > >>>>> AFAIK none of the distros are actually shipping a PREEMPT=y kernel > >>>>> anyway, so killing it shouldn't matter much. > >>> > >>> It will be nice if we can retain KVM_ASYNC_PF_SEND_ALWAYS. I have another > >>> use case outside CONFIG_PREEMPT. > >>> > >>> I am trying to extend async pf interface to also report page fault errors > >>> to the guest. > >> > >> Then please start over and design a sane ParaVirt Fault interface. The > >> current one is utter crap. > > > > Agreed. Don’t extend the current mechanism. Replace it. > > > > I would be happy to review a replacement. I’m not really excited to review an extension of the current mess. The current thing is barely, if at all, correct. > > I read your patch. It cannot possibly be correct. You need to decide what happens if you get a memory failure when guest interrupts are off. If this happens, you can’t send #PF, but you also can’t just swallow the error. The existing APF code is so messy that it’s not at all obvious what your code ends up doing, but I’m pretty sure it doesn’t do anything sensible, especially since the ABI doesn’t have a sensible option. Hi Andy, I am not familiar with this KVM code and trying to understand it. I think error exception gets queued and gets delivered at some point of time, even if interrupts are disabled at the time of exception. Most likely at the time of next VM entry. Whether interrupts are enabled or not check only happens before we decide if async pf protocol should be followed or not. Once we decide to send PAGE_NOT_PRESENT, later notification PAGE_READY does not check if interrupts are enabled or not. And it kind of makes sense otherwise guest process will wait infinitely to receive PAGE_READY. I modified the code a bit to disable interrupt and wait 10 seconds (after getting PAGE_NOT_PRESENT message). And I noticed that error async pf got delivered after 10 seconds after enabling interrupts. So error async pf was not lost because interrupts were disabled. Havind said that, I thought disabling interrupts does not mask exceptions. So page fault exception should have been delivered even with interrupts disabled. Is that correct? May be there was no vm exit/entry during those 10 seconds and that's why. Thanks Vivek
> On Apr 7, 2020, at 10:21 AM, Vivek Goyal <vgoyal@redhat.com> wrote: > > On Mon, Apr 06, 2020 at 01:42:28PM -0700, Andy Lutomirski wrote: >> >>>> On Apr 6, 2020, at 1:32 PM, Andy Lutomirski <luto@amacapital.net> wrote: >>> >>> >>>> On Apr 6, 2020, at 1:25 PM, Peter Zijlstra <peterz@infradead.org> wrote: >>>> >>>> On Mon, Apr 06, 2020 at 03:09:51PM -0400, Vivek Goyal wrote: >>>>>> On Mon, Mar 09, 2020 at 09:22:15PM +0100, Peter Zijlstra wrote: >>>>>>> On Mon, Mar 09, 2020 at 08:05:18PM +0100, Thomas Gleixner wrote: >>>>>>>> Andy Lutomirski <luto@kernel.org> writes: >>>>>>> >>>>>>>>> I'm okay with the save/restore dance, I guess. It's just yet more >>>>>>>>> entry crud to deal with architecture nastiness, except that this >>>>>>>>> nastiness is 100% software and isn't Intel/AMD's fault. >>>>>>>> >>>>>>>> And we can do it in C and don't have to fiddle with it in the ASM >>>>>>>> maze. >>>>>>> >>>>>>> Right; I'd still love to kill KVM_ASYNC_PF_SEND_ALWAYS though, even if >>>>>>> we do the save/restore in do_nmi(). That is some wild brain melt. Also, >>>>>>> AFAIK none of the distros are actually shipping a PREEMPT=y kernel >>>>>>> anyway, so killing it shouldn't matter much. >>>>> >>>>> It will be nice if we can retain KVM_ASYNC_PF_SEND_ALWAYS. I have another >>>>> use case outside CONFIG_PREEMPT. >>>>> >>>>> I am trying to extend async pf interface to also report page fault errors >>>>> to the guest. >>>> >>>> Then please start over and design a sane ParaVirt Fault interface. The >>>> current one is utter crap. >>> >>> Agreed. Don’t extend the current mechanism. Replace it. >>> >>> I would be happy to review a replacement. I’m not really excited to review an extension of the current mess. The current thing is barely, if at all, correct. >> >> I read your patch. It cannot possibly be correct. You need to decide what happens if you get a memory failure when guest interrupts are off. If this happens, you can’t send #PF, but you also can’t just swallow the error. The existing APF code is so messy that it’s not at all obvious what your code ends up doing, but I’m pretty sure it doesn’t do anything sensible, especially since the ABI doesn’t have a sensible option. > > Hi Andy, > > I am not familiar with this KVM code and trying to understand it. I think > error exception gets queued and gets delivered at some point of time, even > if interrupts are disabled at the time of exception. Most likely at the time > of next VM entry. I’ve read the code three or four times and I barely understand it. I’m not convinced the author understood it. It’s spaghetti. > > Whether interrupts are enabled or not check only happens before we decide > if async pf protocol should be followed or not. Once we decide to > send PAGE_NOT_PRESENT, later notification PAGE_READY does not check > if interrupts are enabled or not. And it kind of makes sense otherwise > guest process will wait infinitely to receive PAGE_READY. > > I modified the code a bit to disable interrupt and wait 10 seconds (after > getting PAGE_NOT_PRESENT message). And I noticed that error async pf > got delivered after 10 seconds after enabling interrupts. So error > async pf was not lost because interrupts were disabled. > > Havind said that, I thought disabling interrupts does not mask exceptions. > So page fault exception should have been delivered even with interrupts > disabled. Is that correct? May be there was no vm exit/entry during > those 10 seconds and that's why. My point is that the entire async pf is nonsense. There are two types of events right now: “Page not ready”: normally this isn’t even visible to the guest — the guest just waits. With async pf, the idea is to try to tell the guest that a particular instruction would block and the guest should do something else instead. Sending a normal exception is a poor design, though: the guest may not expect this instruction to cause an exception. I think KVM should try to deliver an *interrupt* and, if it can’t, then just block the guest. “Page ready”: this is a regular asynchronous notification just like, say, a virtio completion. It should be an ordinary interrupt. Some in memory data structure should indicate which pages are ready. “Page is malfunctioning” is tricky because you *must* deliver the event. x86’s #MC is not exactly a masterpiece, but it does kind of work. > > Thanks > Vivek >
Andy Lutomirski <luto@amacapital.net> writes: >> On Apr 7, 2020, at 10:21 AM, Vivek Goyal <vgoyal@redhat.com> wrote: >> Whether interrupts are enabled or not check only happens before we decide >> if async pf protocol should be followed or not. Once we decide to >> send PAGE_NOT_PRESENT, later notification PAGE_READY does not check >> if interrupts are enabled or not. And it kind of makes sense otherwise >> guest process will wait infinitely to receive PAGE_READY. >> >> I modified the code a bit to disable interrupt and wait 10 seconds (after >> getting PAGE_NOT_PRESENT message). And I noticed that error async pf >> got delivered after 10 seconds after enabling interrupts. So error >> async pf was not lost because interrupts were disabled. Async PF is not the same as a real #PF. It just hijacked the #PF vector because someone thought this is a brilliant idea. >> Havind said that, I thought disabling interrupts does not mask exceptions. >> So page fault exception should have been delivered even with interrupts >> disabled. Is that correct? May be there was no vm exit/entry during >> those 10 seconds and that's why. No. Async PF is not a real exception. It has interrupt semantics and it can only be injected when the guest has interrupts enabled. It's bad design. > My point is that the entire async pf is nonsense. There are two types of events right now: > > “Page not ready”: normally this isn’t even visible to the guest — the > guest just waits. With async pf, the idea is to try to tell the guest > that a particular instruction would block and the guest should do > something else instead. Sending a normal exception is a poor design, > though: the guest may not expect this instruction to cause an > exception. I think KVM should try to deliver an *interrupt* and, if it > can’t, then just block the guest. That's pretty much what it does, just that it runs this through #PF and has the checks for interrupts disabled - i.e can't right now' around that. If it can't then KVM schedules the guest out until the situation has been resolved. > “Page ready”: this is a regular asynchronous notification just like, > say, a virtio completion. It should be an ordinary interrupt. Some in > memory data structure should indicate which pages are ready. > > “Page is malfunctioning” is tricky because you *must* deliver the > event. x86’s #MC is not exactly a masterpiece, but it does kind of > work. Nooooo. This does not need #MC at all. Don't even think about it. The point is that the access to such a page is either happening in user space or in kernel space with a proper exception table fixup. That means a real #PF is perfectly fine. That can be injected any time and does not have the interrupt semantics of async PF. So now lets assume we distangled async PF from #PF and made it a regular interrupt, then the following situation still needs to be dealt with: guest -> access faults host -> injects async fault guest -> handles and blocks the task host figures out that the page does not exist anymore and now needs to fixup the situation. host -> injects async wakeup guest -> returns from aysnc PF interrupt and retries the instruction which faults again. host -> knows by now that this is a real fault and injects a proper #PF guest -> #PF runs and either sends signal to user space or runs the exception table fixup for a kernel fault. Thanks, tglx
> On Apr 7, 2020, at 1:20 PM, Thomas Gleixner <tglx@linutronix.de> wrote: > > Andy Lutomirski <luto@amacapital.net> writes: >>>> On Apr 7, 2020, at 10:21 AM, Vivek Goyal <vgoyal@redhat.com> wrote: >>> Whether interrupts are enabled or not check only happens before we decide >>> if async pf protocol should be followed or not. Once we decide to >>> send PAGE_NOT_PRESENT, later notification PAGE_READY does not check >>> if interrupts are enabled or not. And it kind of makes sense otherwise >>> guest process will wait infinitely to receive PAGE_READY. >>> >>> I modified the code a bit to disable interrupt and wait 10 seconds (after >>> getting PAGE_NOT_PRESENT message). And I noticed that error async pf >>> got delivered after 10 seconds after enabling interrupts. So error >>> async pf was not lost because interrupts were disabled. > > Async PF is not the same as a real #PF. It just hijacked the #PF vector > because someone thought this is a brilliant idea. > >>> Havind said that, I thought disabling interrupts does not mask exceptions. >>> So page fault exception should have been delivered even with interrupts >>> disabled. Is that correct? May be there was no vm exit/entry during >>> those 10 seconds and that's why. > > No. Async PF is not a real exception. It has interrupt semantics and it > can only be injected when the guest has interrupts enabled. It's bad > design. > >> My point is that the entire async pf is nonsense. There are two types of events right now: >> >> “Page not ready”: normally this isn’t even visible to the guest — the >> guest just waits. With async pf, the idea is to try to tell the guest >> that a particular instruction would block and the guest should do >> something else instead. Sending a normal exception is a poor design, >> though: the guest may not expect this instruction to cause an >> exception. I think KVM should try to deliver an *interrupt* and, if it >> can’t, then just block the guest. > > That's pretty much what it does, just that it runs this through #PF and > has the checks for interrupts disabled - i.e can't right now' around > that. If it can't then KVM schedules the guest out until the situation > has been resolved. > >> “Page ready”: this is a regular asynchronous notification just like, >> say, a virtio completion. It should be an ordinary interrupt. Some in >> memory data structure should indicate which pages are ready. >> >> “Page is malfunctioning” is tricky because you *must* deliver the >> event. x86’s #MC is not exactly a masterpiece, but it does kind of >> work. > > Nooooo. This does not need #MC at all. Don't even think about it. Yessssssssssss. Please do think about it. :) > > The point is that the access to such a page is either happening in user > space or in kernel space with a proper exception table fixup. > > That means a real #PF is perfectly fine. That can be injected any time > and does not have the interrupt semantics of async PF. The hypervisor has no way to distinguish between MOV-and-has-valid-stack-and-extable-entry and MOV-definitely-can’t-fault-here. Or, for that matter, MOV-in-do_page_fault()-will-recurve-if-it-faults. > > So now lets assume we distangled async PF from #PF and made it a regular > interrupt, then the following situation still needs to be dealt with: > > guest -> access faults > > host -> injects async fault > > guest -> handles and blocks the task > > host figures out that the page does not exist anymore and now needs to > fixup the situation. > > host -> injects async wakeup > > guest -> returns from aysnc PF interrupt and retries the instruction > which faults again. > > host -> knows by now that this is a real fault and injects a proper #PF > > guest -> #PF runs and either sends signal to user space or runs > the exception table fixup for a kernel fault. Or guest blows up because the fault could not be recovered using #PF. I can see two somewhat sane ways to make this work. 1. Access to bad memory results in an async-page-not-present, except that, it’s not deliverable, the guest is killed. Either that async-page-not-present has a special flag saying “memory failure” or the eventual wakeup says “memory failure”. 2. Access to bad memory results in #MC. Sure, #MC is a turd, but it’s an *architectural* turd. By all means, have a nice simple PV mechanism to tell the #MC code exactly what went wrong, but keep the overall flow the same as in the native case. I think I like #2 much better. It has another nice effect: a good implementation will serve as a way to exercise the #MC code without needing to muck with EINJ or with whatever magic Tony uses. The average kernel developer does not have access to a box with testable memory failure reporting. > > Thanks, > > tglx > > > >
On 07/04/20 22:20, Thomas Gleixner wrote: >>> Havind said that, I thought disabling interrupts does not mask exceptions. >>> So page fault exception should have been delivered even with interrupts >>> disabled. Is that correct? May be there was no vm exit/entry during >>> those 10 seconds and that's why. > No. Async PF is not a real exception. It has interrupt semantics and it > can only be injected when the guest has interrupts enabled. It's bad > design. Page-ready async PF has interrupt semantics. Page-not-present async PF however does not have interrupt semantics, it has to be injected immediately or not at all (falling back to host page fault in the latter case). So page-not-present async PF definitely needs to be an exception, this is independent of whether it can be injected when IF=0. Hypervisors do not have any reserved exception vector, and must use vectors up to 31, which is why I believe #PF was used in the first place (though that predates my involvement in KVM by a few years). These days, #VE would be a much better exception to use instead (and it also has a defined mechanism to avoid reentrancy). Paolo
On 07/04/20 23:41, Andy Lutomirski wrote: > 2. Access to bad memory results in #MC. Sure, #MC is a turd, but > it’s an *architectural* turd. By all means, have a nice simple PV > mechanism to tell the #MC code exactly what went wrong, but keep the > overall flow the same as in the native case. > > I think I like #2 much better. It has another nice effect: a good > implementation will serve as a way to exercise the #MC code without > needing to muck with EINJ or with whatever magic Tony uses. The > average kernel developer does not have access to a box with testable > memory failure reporting. I prefer #VE, but I can see how #MC has some appeal. However, #VE has a mechanism to avoid reentrancy, unlike #MC. How would that be better than the current mess with an NMI happening in the first few instructions of the #PF handler? Paolo
> On Apr 7, 2020, at 3:07 PM, Paolo Bonzini <pbonzini@redhat.com> wrote: > > On 07/04/20 23:41, Andy Lutomirski wrote: >> 2. Access to bad memory results in #MC. Sure, #MC is a turd, but >> it’s an *architectural* turd. By all means, have a nice simple PV >> mechanism to tell the #MC code exactly what went wrong, but keep the >> overall flow the same as in the native case. >> >> I think I like #2 much better. It has another nice effect: a good >> implementation will serve as a way to exercise the #MC code without >> needing to muck with EINJ or with whatever magic Tony uses. The >> average kernel developer does not have access to a box with testable >> memory failure reporting. > > I prefer #VE, but I can see how #MC has some appeal. However, #VE has a > mechanism to avoid reentrancy, unlike #MC. How would that be better > than the current mess with an NMI happening in the first few > instructions of the #PF handler? > > It has to be an IST vector due to the possibility of hitting a memory failure right after SYSCALL. I imagine that making #VE use IST would be unfortunate. I think #MC has a mechanism to prevent reentrancy to a limited extent. How does #VE avoid reentrancy?
Andy Lutomirski <luto@amacapital.net> writes: >> On Apr 7, 2020, at 1:20 PM, Thomas Gleixner <tglx@linutronix.de> wrote: >> Andy Lutomirski <luto@amacapital.net> writes: >>> “Page is malfunctioning” is tricky because you *must* deliver the >>> event. x86’s #MC is not exactly a masterpiece, but it does kind of >>> work. >> >> Nooooo. This does not need #MC at all. Don't even think about it. > > Yessssssssssss. Please do think about it. :) I stared too much into that code recently that even thinking about it hurts. :) >> The point is that the access to such a page is either happening in user >> space or in kernel space with a proper exception table fixup. >> >> That means a real #PF is perfectly fine. That can be injected any time >> and does not have the interrupt semantics of async PF. > > The hypervisor has no way to distinguish between > MOV-and-has-valid-stack-and-extable-entry and > MOV-definitely-can’t-fault-here. Or, for that matter, > MOV-in-do_page_fault()-will-recurve-if-it-faults. The mechanism which Vivek wants to support has a well defined usage scenario, i.e. either user space or kernel-valid-stack+extable-entry. So why do you want to route that through #MC? >> So now lets assume we distangled async PF from #PF and made it a regular >> interrupt, then the following situation still needs to be dealt with: >> >> guest -> access faults >> >> host -> injects async fault >> >> guest -> handles and blocks the task >> >> host figures out that the page does not exist anymore and now needs to >> fixup the situation. >> >> host -> injects async wakeup >> >> guest -> returns from aysnc PF interrupt and retries the instruction >> which faults again. >> >> host -> knows by now that this is a real fault and injects a proper #PF >> >> guest -> #PF runs and either sends signal to user space or runs >> the exception table fixup for a kernel fault. > > Or guest blows up because the fault could not be recovered using #PF. Not for the use case at hand. And for that you really want to use regular #PF. The scenario I showed above is perfectly legit: guest: copy_to_user() <- Has extable -> FAULT host: Oh, page is not there, give me some time to figure it out. inject async fault guest: handles async fault interrupt, enables interrupts, blocks host: Situation resolved, shared file was truncated. Tell guest Inject #MC guest: handles #MC completely out of context because the faulting task is scheduled out. Rumage through wait lists (how do you protect them?) and find the waiter. Schedule irq_work to actually mark the waiter as failed and wake it up. Sorry, but that's not really an improvement. That's worse. > I can see two somewhat sane ways to make this work. > > 1. Access to bad memory results in an async-page-not-present, except > that, it’s not deliverable, the guest is killed. That's incorrect. The proper reaction is a real #PF. Simply because this is part of the contract of sharing some file backed stuff between host and guest in a well defined "virtio" scenario and not a random access to memory which might be there or not. Look at it from the point where async whatever does not exist at all: guest: copy_to_user() <- Has extable -> FAULT host: suspend guest and resolve situation if (page swapped in) resume_guest(); else inject_pf(); And this inject_pf() does not care whether it kills the guest or makes it double/triple fault or whatever. The 'tell the guest to do something else while host tries to sort it' opportunistic thingy turns this into: guest: copy_to_user() <- Has extable -> FAULT host: tell guest to do something else, i.e. guest suspends task if (page swapped in) tell guest to resume suspended task else tell guest to resume suspended task guest resumes and faults again host: inject_pf(); which is pretty much equivalent. > 2. Access to bad memory results in #MC. Sure, #MC is a turd, but it’s > an *architectural* turd. By all means, have a nice simple PV mechanism > to tell the #MC code exactly what went wrong, but keep the overall > flow the same as in the native case. It's a completely different flow as you evaluate PV turd instead of analysing the MCE banks and the other error reporting facilities. > I think I like #2 much better. It has another nice effect: a good > implementation will serve as a way to exercise the #MC code without > needing to muck with EINJ or with whatever magic Tony uses. The > average kernel developer does not have access to a box with testable > memory failure reporting. Yes, because CPU dudes decided that documenting the testability mechanisms is a bad thing. They surely exist and for Nehalem at least the ECC error injection mechanism was documented and usable. But to actually make soemthing useful you need to emulate the bank interface on the hypervisor and expose the errors to the guest. The MCE injection mechanism has some awful way to "emulate" MSR reads for that. See mce_rdmsr(). *SHUDDER* Thanks, tglx
On Tue, Apr 07, 2020 at 02:41:02PM -0700, Andy Lutomirski wrote: > > > > On Apr 7, 2020, at 1:20 PM, Thomas Gleixner <tglx@linutronix.de> wrote: > > > > Andy Lutomirski <luto@amacapital.net> writes: > >>>> On Apr 7, 2020, at 10:21 AM, Vivek Goyal <vgoyal@redhat.com> wrote: > >>> Whether interrupts are enabled or not check only happens before we decide > >>> if async pf protocol should be followed or not. Once we decide to > >>> send PAGE_NOT_PRESENT, later notification PAGE_READY does not check > >>> if interrupts are enabled or not. And it kind of makes sense otherwise > >>> guest process will wait infinitely to receive PAGE_READY. > >>> > >>> I modified the code a bit to disable interrupt and wait 10 seconds (after > >>> getting PAGE_NOT_PRESENT message). And I noticed that error async pf > >>> got delivered after 10 seconds after enabling interrupts. So error > >>> async pf was not lost because interrupts were disabled. > > > > Async PF is not the same as a real #PF. It just hijacked the #PF vector > > because someone thought this is a brilliant idea. > > > >>> Havind said that, I thought disabling interrupts does not mask exceptions. > >>> So page fault exception should have been delivered even with interrupts > >>> disabled. Is that correct? May be there was no vm exit/entry during > >>> those 10 seconds and that's why. > > > > No. Async PF is not a real exception. It has interrupt semantics and it > > can only be injected when the guest has interrupts enabled. It's bad > > design. > > > >> My point is that the entire async pf is nonsense. There are two types of events right now: > >> > >> “Page not ready”: normally this isn’t even visible to the guest — the > >> guest just waits. With async pf, the idea is to try to tell the guest > >> that a particular instruction would block and the guest should do > >> something else instead. Sending a normal exception is a poor design, > >> though: the guest may not expect this instruction to cause an > >> exception. I think KVM should try to deliver an *interrupt* and, if it > >> can’t, then just block the guest. > > > > That's pretty much what it does, just that it runs this through #PF and > > has the checks for interrupts disabled - i.e can't right now' around > > that. If it can't then KVM schedules the guest out until the situation > > has been resolved. > > > >> “Page ready”: this is a regular asynchronous notification just like, > >> say, a virtio completion. It should be an ordinary interrupt. Some in > >> memory data structure should indicate which pages are ready. > >> > >> “Page is malfunctioning” is tricky because you *must* deliver the > >> event. x86’s #MC is not exactly a masterpiece, but it does kind of > >> work. > > > > Nooooo. This does not need #MC at all. Don't even think about it. > > Yessssssssssss. Please do think about it. :) > > > > > The point is that the access to such a page is either happening in user > > space or in kernel space with a proper exception table fixup. > > > > That means a real #PF is perfectly fine. That can be injected any time > > and does not have the interrupt semantics of async PF. > > The hypervisor has no way to distinguish between MOV-and-has-valid-stack-and-extable-entry and MOV-definitely-can’t-fault-here. Or, for that matter, MOV-in-do_page_fault()-will-recurve-if-it-faults. > > > > > So now lets assume we distangled async PF from #PF and made it a regular > > interrupt, then the following situation still needs to be dealt with: > > > > guest -> access faults > > > > host -> injects async fault > > > > guest -> handles and blocks the task > > > > host figures out that the page does not exist anymore and now needs to > > fixup the situation. > > > > host -> injects async wakeup > > > > guest -> returns from aysnc PF interrupt and retries the instruction > > which faults again. > > > > host -> knows by now that this is a real fault and injects a proper #PF > > > > guest -> #PF runs and either sends signal to user space or runs > > the exception table fixup for a kernel fault. > > Or guest blows up because the fault could not be recovered using #PF. > > I can see two somewhat sane ways to make this work. > > 1. Access to bad memory results in an async-page-not-present, except that, it’s not deliverable, the guest is killed. Either that async-page-not-present has a special flag saying “memory failure” or the eventual wakeup says “memory failure”. > > 2. Access to bad memory results in #MC. Sure, #MC is a turd, but it’s an *architectural* turd. By all means, have a nice simple PV mechanism to tell the #MC code exactly what went wrong, but keep the overall flow the same as in the native case. > Should we differentiate between two error cases. In this case memory is not bad. Just that page being asked for can't be faulted in anymore. And another error case is real memory failure. So for the first case it could be injected into guest using #PF or #VE or something paravirt specific and real hardware failures take #MC path (if they are not already doing so). Thanks Vivek
Paolo Bonzini <pbonzini@redhat.com> writes: > On 07/04/20 22:20, Thomas Gleixner wrote: >>>> Havind said that, I thought disabling interrupts does not mask exceptions. >>>> So page fault exception should have been delivered even with interrupts >>>> disabled. Is that correct? May be there was no vm exit/entry during >>>> those 10 seconds and that's why. >> No. Async PF is not a real exception. It has interrupt semantics and it >> can only be injected when the guest has interrupts enabled. It's bad >> design. > > Page-ready async PF has interrupt semantics. > > Page-not-present async PF however does not have interrupt semantics, it > has to be injected immediately or not at all (falling back to host page > fault in the latter case). If interrupts are disabled in the guest then it is NOT injected and the guest is suspended. So it HAS interrupt semantics. Conditional ones, i.e. if interrupts are disabled, bail, if not then inject it. But that does not make it an exception by any means. It never should have been hooked to #PF in the first place and it never should have been named that way. The functionality is to opportunisticly tell the guest to do some other stuff. So the proper name for this seperate interrupt vector would be: VECTOR_OMG_DOS - Opportunisticly Make Guest Do Other Stuff and the counter part VECTOR_STOP_DOS - Stop Doing Other Stuff > So page-not-present async PF definitely needs to be an exception, this > is independent of whether it can be injected when IF=0. That wants to be a straight #PF. See my reply to Andy. > Hypervisors do not have any reserved exception vector, and must use > vectors up to 31, which is why I believe #PF was used in the first place > (though that predates my involvement in KVM by a few years). No. That was just bad taste or something worse. It has nothing to do with exceptions, see above. Stop proliferating the confusion. > These days, #VE would be a much better exception to use instead (and > it also has a defined mechanism to avoid reentrancy). #VE is not going to solve anything. The idea of OMG_DOS is to (opportunisticly) avoid that the guest (and perhaps host) sit idle waiting for I/O until the fault has been resolved. That makes sense as there might be enough other stuff to do which does not depend on that particular page. If not then fine, the guest will go idle. Thanks, tglx
On 08/04/20 00:29, Andy Lutomirski wrote: >> I prefer #VE, but I can see how #MC has some appeal. However, #VE has a >> mechanism to avoid reentrancy, unlike #MC. How would that be better >> than the current mess with an NMI happening in the first few >> instructions of the #PF handler? >> >> > It has to be an IST vector due to the possibility of hitting a memory failure right after SYSCALL. Not if syscall clears IF, right? > I think #MC has a mechanism to prevent reentrancy to a limited extent. How does #VE avoid reentrancy? In hardware, it has a flag word and delivers a vmexit instead of #VE if that word is not zero (see towards the end of 25.5.6.1 Convertible EPT Violations). Here it would be the same except it would just do the page fault synchronously in the host. Paolo
> On Apr 7, 2020, at 3:48 PM, Thomas Gleixner <tglx@linutronix.de> wrote: > > Andy Lutomirski <luto@amacapital.net> writes: >>>> On Apr 7, 2020, at 1:20 PM, Thomas Gleixner <tglx@linutronix.de> wrote: >>> Andy Lutomirski <luto@amacapital.net> writes: >>>> “Page is malfunctioning” is tricky because you *must* deliver the >>>> event. x86’s #MC is not exactly a masterpiece, but it does kind of >>>> work. >>> >>> Nooooo. This does not need #MC at all. Don't even think about it. >> >> Yessssssssssss. Please do think about it. :) > > I stared too much into that code recently that even thinking about it > hurts. :) > >>> The point is that the access to such a page is either happening in user >>> space or in kernel space with a proper exception table fixup. >>> >>> That means a real #PF is perfectly fine. That can be injected any time >>> and does not have the interrupt semantics of async PF. >> >> The hypervisor has no way to distinguish between >> MOV-and-has-valid-stack-and-extable-entry and >> MOV-definitely-can’t-fault-here. Or, for that matter, >> MOV-in-do_page_fault()-will-recurve-if-it-faults. > > The mechanism which Vivek wants to support has a well defined usage > scenario, i.e. either user space or kernel-valid-stack+extable-entry. > > So why do you want to route that through #MC? To be clear, I hate #MC as much as the next person. But I think it needs to be an IST vector, and the #MC *vector* isn’t so terrible. (The fact that we can’t atomically return from #MC and re-enable CR4.MCE in a single instruction is problematic but not the end of the world.). But see below — I don’t think my suggestion should work quite the way you interpreted it. >>> >>> >>> guest -> #PF runs and either sends signal to user space or runs >>> the exception table fixup for a kernel fault. >> >> Or guest blows up because the fault could not be recovered using #PF. > > Not for the use case at hand. And for that you really want to use > regular #PF. > > The scenario I showed above is perfectly legit: > > guest: > copy_to_user() <- Has extable > -> FAULT > > host: > Oh, page is not there, give me some time to figure it out. > > inject async fault > > guest: > handles async fault interrupt, enables interrupts, blocks > > host: > Situation resolved, shared file was truncated. Tell guest All good so far. > > Inject #MC No, not what I meant. Host has two sane choices here IMO: 1. Tell the guest that the page is gone as part of the wakeup. No #PF or #MC. 2. Tell guest that it’s resolved and inject #MC when the guest retries. The #MC is a real fault, RIP points to the right place, etc. > > >> >> >> 1. Access to bad memory results in an async-page-not-present, except >> that, it’s not deliverable, the guest is killed. > > That's incorrect. The proper reaction is a real #PF. Simply because this > is part of the contract of sharing some file backed stuff between host > and guest in a well defined "virtio" scenario and not a random access to > memory which might be there or not. The problem is that the host doesn’t know when #PF is safe. It’s sort of the same problem that async pf has now. The guest kernel could access the problematic page in the middle of an NMI, under pagefault_disable(), etc — getting #PF as a result of CPL0 access to a page with a valid guest PTE is simply not part of the x86 architecture. > > Look at it from the point where async whatever does not exist at all: > > guest: > copy_to_user() <- Has extable > -> FAULT > > host: > suspend guest and resolve situation > > if (page swapped in) > resume_guest(); > else > inject_pf(); > > And this inject_pf() does not care whether it kills the guest or makes > it double/triple fault or whatever. > > The 'tell the guest to do something else while host tries to sort it' > opportunistic thingy turns this into: > > guest: > copy_to_user() <- Has extable > -> FAULT > > host: > tell guest to do something else, i.e. guest suspends task > > if (page swapped in) > tell guest to resume suspended task > else > tell guest to resume suspended task > > guest resumes and faults again > > host: > inject_pf(); > > which is pretty much equivalent. Replace copy_to_user() with some access to a gup-ed mapping with no extable handler and it doesn’t look so good any more. Of course, the guest will oops if this happens, but the guest needs to be able to oops cleanly. #PF is too fragile for this because it’s not IST, and #PF is the wrong thing anyway — #PF is all about guest-virtual-to-guest-physical mappings. Heck, what would CR2 be? The host might not even know the guest virtual address. > >> 2. Access to bad memory results in #MC. Sure, #MC is a turd, but it’s >> an *architectural* turd. By all means, have a nice simple PV mechanism >> to tell the #MC code exactly what went wrong, but keep the overall >> flow the same as in the native case. > > It's a completely different flow as you evaluate PV turd instead of > analysing the MCE banks and the other error reporting facilities. I’m fine with the flow being different. do_machine_check() could have entirely different logic to decide the error in PV. But I think we should reuse the overall flow: kernel gets #MC with RIP pointing to the offending instruction. If there’s an extable entry that can handle memory failure, handle it. If it’s a user access, handle it. If it’s an unrecoverable error because it was a non-extable kernel access, oops or panic. The actual PV part could be extremely simple: the host just needs to tell the guest “this #MC is due to memory failure at this guest physical address”. No banks, no DIMM slot, no rendezvous crap (LMCE), no other nonsense. It would be nifty if the host also told the guest what the guest virtual address was if the host knows it.
On 08/04/20 01:21, Thomas Gleixner wrote: > Paolo Bonzini <pbonzini@redhat.com> writes: > >> On 07/04/20 22:20, Thomas Gleixner wrote: >>>>> Havind said that, I thought disabling interrupts does not mask exceptions. >>>>> So page fault exception should have been delivered even with interrupts >>>>> disabled. Is that correct? May be there was no vm exit/entry during >>>>> those 10 seconds and that's why. >>> No. Async PF is not a real exception. It has interrupt semantics and it >>> can only be injected when the guest has interrupts enabled. It's bad >>> design. >> >> Page-ready async PF has interrupt semantics. >> >> Page-not-present async PF however does not have interrupt semantics, it >> has to be injected immediately or not at all (falling back to host page >> fault in the latter case). > > If interrupts are disabled in the guest then it is NOT injected and the > guest is suspended. So it HAS interrupt semantics. Conditional ones, > i.e. if interrupts are disabled, bail, if not then inject it. Interrupts can be delayed by TPR or STI/MOV SS interrupt window, async page faults cannot (again, not the page-ready kind). Page-not-present async page faults are almost a perfect match for the hardware use of #VE (and it might even be possible to let the processor deliver the exceptions). There are other advantages: - the only real problem with using #PF (with or without KVM_ASYNC_PF_SEND_ALWAYS) seems to be the NMI reentrancy issue, which would not be there for #VE. - #VE are combined the right way with other exceptions (the benign/contributory/pagefault stuff) - adjusting KVM and Linux to use #VE instead of #PF would be less than 100 lines of code. Paolo > But that does not make it an exception by any means. > > It never should have been hooked to #PF in the first place and it never > should have been named that way. The functionality is to opportunisticly > tell the guest to do some other stuff. > > So the proper name for this seperate interrupt vector would be: > > VECTOR_OMG_DOS - Opportunisticly Make Guest Do Other Stuff > > and the counter part > > VECTOR_STOP_DOS - Stop Doing Other Stuff > >> So page-not-present async PF definitely needs to be an exception, this >> is independent of whether it can be injected when IF=0. > > That wants to be a straight #PF. See my reply to Andy. > >> Hypervisors do not have any reserved exception vector, and must use >> vectors up to 31, which is why I believe #PF was used in the first place >> (though that predates my involvement in KVM by a few years). > > No. That was just bad taste or something worse. It has nothing to do > with exceptions, see above. Stop proliferating the confusion. > >> These days, #VE would be a much better exception to use instead (and >> it also has a defined mechanism to avoid reentrancy). > > #VE is not going to solve anything. > > The idea of OMG_DOS is to (opportunisticly) avoid that the guest (and > perhaps host) sit idle waiting for I/O until the fault has been > resolved. That makes sense as there might be enough other stuff to do > which does not depend on that particular page. If not then fine, the > guest will go idle. > > Thanks, > > tglx >
On Tue, Apr 07, 2020 at 09:48:02PM -0700, Andy Lutomirski wrote: > I’m fine with the flow being different. do_machine_check() could > have entirely different logic to decide the error in PV. Nope, do_machine_check() is already as ugly as it gets. I don't want any more crap in it. > But I think we should reuse the overall flow: kernel gets #MC with > RIP pointing to the offending instruction. If there’s an extable > entry that can handle memory failure, handle it. If it’s a user > access, handle it. If it’s an unrecoverable error because it was a > non-extable kernel access, oops or panic. > > The actual PV part could be extremely simple: the host just needs to > tell the guest “this #MC is due to memory failure at this guest > physical address”. No banks, no DIMM slot, no rendezvous crap > (LMCE), no other nonsense. It would be nifty if the host also told the > guest what the guest virtual address was if the host knows it. It better be a whole different path and a whole different vector. If you wanna keep it simple and apart from all of the other nonsense, then you can just as well use a completely different vector. Thx.
On Tue, Apr 07, 2020 at 06:49:03PM -0400, Vivek Goyal wrote: > > 2. Access to bad memory results in #MC. Sure, #MC is a turd, but it’s an *architectural* turd. By all means, have a nice simple PV mechanism to tell the #MC code exactly what went wrong, but keep the overall flow the same as in the native case. > > > > Should we differentiate between two error cases. In this case memory is > not bad. Just that page being asked for can't be faulted in anymore. And > another error case is real memory failure. So for the first case it > could be injected into guest using #PF or #VE or something paravirt > specific and real hardware failures take #MC path (if they are not > already doing so). For handling memory failures with guests there's this whole thing described Documentation/vm/hwpoison.rst and KVM has support for that.
Andy Lutomirski <luto@amacapital.net> writes: >> On Apr 7, 2020, at 3:48 PM, Thomas Gleixner <tglx@linutronix.de> wrote: >> Inject #MC > > No, not what I meant. Host has two sane choices here IMO: > > 1. Tell the guest that the page is gone as part of the wakeup. No #PF or #MC. > > 2. Tell guest that it’s resolved and inject #MC when the guest > retries. The #MC is a real fault, RIP points to the right place, etc. Ok, that makes sense. >>> 1. Access to bad memory results in an async-page-not-present, except >>> that, it’s not deliverable, the guest is killed. >> >> That's incorrect. The proper reaction is a real #PF. Simply because this >> is part of the contract of sharing some file backed stuff between host >> and guest in a well defined "virtio" scenario and not a random access to >> memory which might be there or not. > > The problem is that the host doesn’t know when #PF is safe. It’s sort > of the same problem that async pf has now. The guest kernel could > access the problematic page in the middle of an NMI, under > pagefault_disable(), etc — getting #PF as a result of CPL0 access to a > page with a valid guest PTE is simply not part of the x86 > architecture. Fair enough. > Replace copy_to_user() with some access to a gup-ed mapping with no > extable handler and it doesn’t look so good any more. In this case the guest needs to die. > Of course, the guest will oops if this happens, but the guest needs to > be able to oops cleanly. #PF is too fragile for this because it’s not > IST, and #PF is the wrong thing anyway — #PF is all about > guest-virtual-to-guest-physical mappings. Heck, what would CR2 be? > The host might not even know the guest virtual address. It knows, but I can see your point. >>> 2. Access to bad memory results in #MC. Sure, #MC is a turd, but it’s >>> an *architectural* turd. By all means, have a nice simple PV mechanism >>> to tell the #MC code exactly what went wrong, but keep the overall >>> flow the same as in the native case. >> >> It's a completely different flow as you evaluate PV turd instead of >> analysing the MCE banks and the other error reporting facilities. > > I’m fine with the flow being different. do_machine_check() could have > entirely different logic to decide the error in PV. But I think we > should reuse the overall flow: kernel gets #MC with RIP pointing to > the offending instruction. If there’s an extable entry that can handle > memory failure, handle it. If it’s a user access, handle it. If it’s > an unrecoverable error because it was a non-extable kernel access, > oops or panic. > > The actual PV part could be extremely simple: the host just needs to > tell the guest “this #MC is due to memory failure at this guest > physical address”. No banks, no DIMM slot, no rendezvous crap (LMCE), > no other nonsense. It would be nifty if the host also told the guest > what the guest virtual address was if the host knows it. It does. The EPT violations store: - guest-linear address - guest-physical address That's also part of the #VE exception to which Paolo was referring. Thanks, tglx
Paolo Bonzini <pbonzini@redhat.com> writes: > On 08/04/20 01:21, Thomas Gleixner wrote: >>>> No. Async PF is not a real exception. It has interrupt semantics and it >>>> can only be injected when the guest has interrupts enabled. It's bad >>>> design. >>> >>> Page-ready async PF has interrupt semantics. >>> >>> Page-not-present async PF however does not have interrupt semantics, it >>> has to be injected immediately or not at all (falling back to host page >>> fault in the latter case). >> >> If interrupts are disabled in the guest then it is NOT injected and the >> guest is suspended. So it HAS interrupt semantics. Conditional ones, >> i.e. if interrupts are disabled, bail, if not then inject it. > > Interrupts can be delayed by TPR or STI/MOV SS interrupt window, async > page faults cannot (again, not the page-ready kind). Can we pretty please stop using the term async page fault? It's just wrong and causes more confusion than anything else. What this does is really what I called Opportunistic Make Guest Do Other Stuff. And it has neither true exception nor true interrupt semantics. It's a software event which is injected into the guest to let the guest do something else than waiting for the actual #PF cause to be resolved. It's part of a software protocol between host and guest. And it comes with restrictions: The Do Other Stuff event can only be delivered when guest IF=1. If guest IF=0 then the host has to suspend the guest until the situation is resolved. The 'Situation resolved' event must also wait for a guest IF=1 slot. > Page-not-present async page faults are almost a perfect match for the > hardware use of #VE (and it might even be possible to let the > processor deliver the exceptions). There are other advantages: > > - the only real problem with using #PF (with or without > KVM_ASYNC_PF_SEND_ALWAYS) seems to be the NMI reentrancy issue, which > would not be there for #VE. > > - #VE are combined the right way with other exceptions (the > benign/contributory/pagefault stuff) > > - adjusting KVM and Linux to use #VE instead of #PF would be less than > 100 lines of code. If you just want to solve Viveks problem, then its good enough. I.e. the file truncation turns the EPT entries into #VE convertible entries and the guest #VE handler can figure it out. This one can be injected directly by the hardware, i.e. you don't need a VMEXIT. If you want the opportunistic do other stuff mechanism, then #VE has exactly the same problems as the existing async "PF". It's not magicaly making that go away. One possible solution might be to make all recoverable EPT entries convertible and let the HW inject #VE for those. So the #VE handler in the guest would have to do: if (!recoverable()) { if (user_mode) send_signal(); else if (!fixup_exception()) die_hard(); goto done; } store_ve_info_in_pv_page(); if (!user_mode(regs) || !preemptible()) { hypercall_resolve_ept(can_continue = false); } else { init_completion(); hypercall_resolve_ept(can_continue = true); wait_for_completion(); } or something like that. The hypercall to resolve the EPT fail on the host acts on the can_continue argument. If false, it suspends the guest vCPU and only returns when done. If true it kicks the resolve process and returns to the guest which suspends the task and tries to do something else. The wakeup side needs to be a regular interrupt and cannot go through #VE. Thanks, tglx
On Wed, Apr 08, 2020 at 10:23:58AM +0200, Paolo Bonzini wrote: > Page-not-present async page faults are almost a perfect match for the > hardware use of #VE (and it might even be possible to let the processor > deliver the exceptions). My "async" page fault knowledge is limited, but if the desired behavior is to reflect a fault into the guest for select EPT Violations, then yes, enabling EPT Violation #VEs in hardware is doable. The big gotcha is that KVM needs to set the suppress #VE bit for all EPTEs when allocating a new MMU page, otherwise not-present faults on zero-initialized EPTEs will get reflected. Attached a patch that does the prep work in the MMU. The VMX usage would be: kvm_mmu_set_spte_init_value(VMX_EPT_SUPPRESS_VE_BIT); when EPT Violation #VEs are enabled. It's 64-bit only as it uses stosq to initialize EPTEs. 32-bit could also be supported by doing memcpy() from a static page.
On Wed, Apr 08, 2020 at 03:01:58PM +0200, Thomas Gleixner wrote: > And it comes with restrictions: > > The Do Other Stuff event can only be delivered when guest IF=1. > > If guest IF=0 then the host has to suspend the guest until the > situation is resolved. > > The 'Situation resolved' event must also wait for a guest IF=1 slot. Moo, can we pretty please already kill that ALWAYS and IF nonsense? That results in that terrifyingly crap HLT loop. That needs to die with extreme prejudice. So the host only inject these OMFG_DOS things when the guest is in luserspace -- which it can see in the VMCS state IIRC. And then using #VE for the make-it-go signal is far preferred over the currentl #PF abuse. > > Page-not-present async page faults are almost a perfect match for the > > hardware use of #VE (and it might even be possible to let the > > processor deliver the exceptions). There are other advantages: > > > > - the only real problem with using #PF (with or without > > KVM_ASYNC_PF_SEND_ALWAYS) seems to be the NMI reentrancy issue, which > > would not be there for #VE. > > > > - #VE are combined the right way with other exceptions (the > > benign/contributory/pagefault stuff) > > > > - adjusting KVM and Linux to use #VE instead of #PF would be less than > > 100 lines of code. > > If you just want to solve Viveks problem, then its good enough. I.e. the > file truncation turns the EPT entries into #VE convertible entries and > the guest #VE handler can figure it out. This one can be injected > directly by the hardware, i.e. you don't need a VMEXIT. That sounds like something that doesn't actually need the whole 'async'/do-something-else-for-a-while crap, right? It's a #PF trap from kernel space where we need to report fail. > If you want the opportunistic do other stuff mechanism, then #VE has > exactly the same problems as the existing async "PF". It's not magicaly > making that go away. We need to somehow have the guest teach the host how to tell if it can inject that OMFG_DOS thing or not. Injecting it only to then instantly exit again is stupid and expensive. Clearly we don't want to expose preempt_count and make that ABI, but is there some way we can push a snippet of code to the host that instructs the host how to determine if it can sleep or not? I realize that pushing actual x86 .text is a giant security problem, so perhaps a snipped of BPF that the host can verify, which it can run on the guest image ? Make it a hard error (guest cpu dies) to inject the OMFG_DOS signal on a context that cannot put the task to sleep.
Peter Zijlstra <peterz@infradead.org> writes: > On Wed, Apr 08, 2020 at 03:01:58PM +0200, Thomas Gleixner wrote: >> And it comes with restrictions: >> >> The Do Other Stuff event can only be delivered when guest IF=1. >> >> If guest IF=0 then the host has to suspend the guest until the >> situation is resolved. >> >> The 'Situation resolved' event must also wait for a guest IF=1 slot. > > Moo, can we pretty please already kill that ALWAYS and IF nonsense? That > results in that terrifyingly crap HLT loop. That needs to die with > extreme prejudice. > > So the host only inject these OMFG_DOS things when the guest is in > luserspace -- which it can see in the VMCS state IIRC. And then using > #VE for the make-it-go signal is far preferred over the currentl #PF > abuse. Yes, but this requires software based injection. >> If you just want to solve Viveks problem, then its good enough. I.e. the >> file truncation turns the EPT entries into #VE convertible entries and >> the guest #VE handler can figure it out. This one can be injected >> directly by the hardware, i.e. you don't need a VMEXIT. > > That sounds like something that doesn't actually need the whole > 'async'/do-something-else-for-a-while crap, right? It's a #PF trap from > kernel space where we need to report fail. Fail or fixup via extable. >> If you want the opportunistic do other stuff mechanism, then #VE has >> exactly the same problems as the existing async "PF". It's not magicaly >> making that go away. > > We need to somehow have the guest teach the host how to tell if it can > inject that OMFG_DOS thing or not. Injecting it only to then instantly > exit again is stupid and expensive. Not if the injection is actually done by the hardware. Then the guest handles #VE and tells the host what to do. > Clearly we don't want to expose preempt_count and make that ABI, but is > there some way we can push a snippet of code to the host that instructs > the host how to determine if it can sleep or not? I realize that pushing > actual x86 .text is a giant security problem, so perhaps a snipped of > BPF that the host can verify, which it can run on the guest image ? *SHUDDER* > Make it a hard error (guest cpu dies) to inject the OMFG_DOS signal on a > context that cannot put the task to sleep. With the hardware based #VE and a hypercall which tells the host how to handle the EPT fixup (suspend the vcpu or let it continue and do the completion later) you don't have to play any games on the host. If the guest tells the host the wrong thing, then the guest will have to mop up the pieces. Thanks, tglx
On 08/04/20 17:34, Sean Christopherson wrote: > On Wed, Apr 08, 2020 at 10:23:58AM +0200, Paolo Bonzini wrote: >> Page-not-present async page faults are almost a perfect match for the >> hardware use of #VE (and it might even be possible to let the processor >> deliver the exceptions). > > My "async" page fault knowledge is limited, but if the desired behavior is > to reflect a fault into the guest for select EPT Violations, then yes, > enabling EPT Violation #VEs in hardware is doable. The big gotcha is that > KVM needs to set the suppress #VE bit for all EPTEs when allocating a new > MMU page, otherwise not-present faults on zero-initialized EPTEs will get > reflected. > > Attached a patch that does the prep work in the MMU. The VMX usage would be: > > kvm_mmu_set_spte_init_value(VMX_EPT_SUPPRESS_VE_BIT); > > when EPT Violation #VEs are enabled. It's 64-bit only as it uses stosq to > initialize EPTEs. 32-bit could also be supported by doing memcpy() from > a static page. The complication is that (at least according to the current ABI) we would not want #VE to kick if the guest currently has IF=0 (and possibly CPL=0). But the ABI is not set in stone, and anyway the #VE protocol is a decent one and worth using as a base for whatever PV protocol we design. Paolo
Paolo Bonzini <pbonzini@redhat.com> writes: > On 08/04/20 17:34, Sean Christopherson wrote: >> On Wed, Apr 08, 2020 at 10:23:58AM +0200, Paolo Bonzini wrote: >>> Page-not-present async page faults are almost a perfect match for the >>> hardware use of #VE (and it might even be possible to let the processor >>> deliver the exceptions). >> >> My "async" page fault knowledge is limited, but if the desired behavior is >> to reflect a fault into the guest for select EPT Violations, then yes, >> enabling EPT Violation #VEs in hardware is doable. The big gotcha is that >> KVM needs to set the suppress #VE bit for all EPTEs when allocating a new >> MMU page, otherwise not-present faults on zero-initialized EPTEs will get >> reflected. >> >> Attached a patch that does the prep work in the MMU. The VMX usage would be: >> >> kvm_mmu_set_spte_init_value(VMX_EPT_SUPPRESS_VE_BIT); >> >> when EPT Violation #VEs are enabled. It's 64-bit only as it uses stosq to >> initialize EPTEs. 32-bit could also be supported by doing memcpy() from >> a static page. > > The complication is that (at least according to the current ABI) we > would not want #VE to kick if the guest currently has IF=0 (and possibly > CPL=0). But the ABI is not set in stone, and anyway the #VE protocol is > a decent one and worth using as a base for whatever PV protocol we design. Forget the current pf async semantics (or the lack of). You really want to start from scratch and igore the whole thing. The charm of #VE is that the hardware can inject it and it's not nesting until the guest cleared the second word in the VE information area. If that word is not 0 then you get a regular vmexit where you suspend the vcpu until the nested problem is solved. So you really don't worry about the guest CPU state at all. The guest side #VE handler has to decide what it wants from the host depending on it's internal state: - Suspend me and resume once the EPT fail is solved - Let me park the failing task and tell me once you resolved the problem. That's pretty straight forward and avoids the whole nonsense which the current mess contains. It completely avoids the allocation stuff as well as you need to use a PV page where the guest copies the VE information to. The notification that a problem has been resolved needs to go through a separate vector which still has the IF=1 requirement obviously. Thanks, tglx
On Tue, Apr 07, 2020 at 09:48:02PM -0700, Andy Lutomirski wrote:
[..]
> It would be nifty if the host also told the guest what the guest virtual address was if the host knows it.
It will be good to know and send guest virtual address as well. While
sending SIGBUS to guest user space, information about which access
triggered SIGBUS will be useful.
I thought GUEST_LINEAR_ADDRESS provides guest virtual address if
EPT_VIOLATION_GLA_VALID bit is set. And it seems to work for my
simple test case. But when I try to read intel SDM, section "27.2" VM
exits, EPT violations, I am not so sure.
Somebody who understands this better, can you please help me understand
what exactly GUEST_LINEAR_ADDRESS is supposed to contain during
EPT violation. I assumed it is guest virtual address and added a
patch in my RFC patch series.
https://lore.kernel.org/kvm/20200331194011.24834-3-vgoyal@redhat.com/
But I might have misunderstood it.
Vivek
On Wed, Apr 08, 2020 at 08:01:36PM +0200, Thomas Gleixner wrote: > Paolo Bonzini <pbonzini@redhat.com> writes: > > On 08/04/20 17:34, Sean Christopherson wrote: > >> On Wed, Apr 08, 2020 at 10:23:58AM +0200, Paolo Bonzini wrote: > >>> Page-not-present async page faults are almost a perfect match for the > >>> hardware use of #VE (and it might even be possible to let the processor > >>> deliver the exceptions). > >> > >> My "async" page fault knowledge is limited, but if the desired behavior is > >> to reflect a fault into the guest for select EPT Violations, then yes, > >> enabling EPT Violation #VEs in hardware is doable. The big gotcha is that > >> KVM needs to set the suppress #VE bit for all EPTEs when allocating a new > >> MMU page, otherwise not-present faults on zero-initialized EPTEs will get > >> reflected. > >> > >> Attached a patch that does the prep work in the MMU. The VMX usage would be: > >> > >> kvm_mmu_set_spte_init_value(VMX_EPT_SUPPRESS_VE_BIT); > >> > >> when EPT Violation #VEs are enabled. It's 64-bit only as it uses stosq to > >> initialize EPTEs. 32-bit could also be supported by doing memcpy() from > >> a static page. > > > > The complication is that (at least according to the current ABI) we > > would not want #VE to kick if the guest currently has IF=0 (and possibly > > CPL=0). But the ABI is not set in stone, and anyway the #VE protocol is > > a decent one and worth using as a base for whatever PV protocol we design. > > Forget the current pf async semantics (or the lack of). You really want > to start from scratch and igore the whole thing. > > The charm of #VE is that the hardware can inject it and it's not nesting > until the guest cleared the second word in the VE information area. If > that word is not 0 then you get a regular vmexit where you suspend the > vcpu until the nested problem is solved. So IIUC, only one process on a vcpu could affort to relinquish cpu to another task. If next task also triggers EPT violation, that will result in VM exit (as previous #VE is not complete yet) and vcpu will be halted. > > So you really don't worry about the guest CPU state at all. The guest > side #VE handler has to decide what it wants from the host depending on > it's internal state: > > - Suspend me and resume once the EPT fail is solved > > - Let me park the failing task and tell me once you resolved the > problem. > > That's pretty straight forward and avoids the whole nonsense which the > current mess contains. It completely avoids the allocation stuff as well > as you need to use a PV page where the guest copies the VE information > to. > > The notification that a problem has been resolved needs to go through a > separate vector which still has the IF=1 requirement obviously. How is this vector decided between guest and host. Failure to fault in page will be communicated through same vector? Thanks Vivek
Vivek Goyal <vgoyal@redhat.com> writes: > On Wed, Apr 08, 2020 at 08:01:36PM +0200, Thomas Gleixner wrote: >> Forget the current pf async semantics (or the lack of). You really want >> to start from scratch and igore the whole thing. >> >> The charm of #VE is that the hardware can inject it and it's not nesting >> until the guest cleared the second word in the VE information area. If >> that word is not 0 then you get a regular vmexit where you suspend the >> vcpu until the nested problem is solved. > > So IIUC, only one process on a vcpu could affort to relinquish cpu to > another task. If next task also triggers EPT violation, that will result > in VM exit (as previous #VE is not complete yet) and vcpu will be > halted. No. The guest #VE handler reads VE information, stores it into a PV page which is used to communicate with the host, invokes the hypercall and clears word1 so a consecutive #VE can be handled. If the hypercall is telling the host to suspend the guest (e.g. because the exception hit an interrupt or preemption disabled region where scheduling is not possible) then the host suspends the vCPU until the EPT issue is fixed. Before that hypercall returns the state of the recursion prevention word is irrelevant as the vCPU is not running, so it's fine to clear it afterwards. If the hypercall is telling the host that it can schedule and do something else, then the word is cleared after the hypercall returns. This makes sure that the host has gathered the information before another VE can be injected. TBH, I really was positively surprised that this HW mechanism actually makes sense. It prevents the following situation: 1) Guest triggers a EPT fail 2) HW injects #VE sets VE info 3) Guest handles #VE and before being able to gather the VE info data it triggers another EPT fail 4) HW injects #VE sets VE info -> FAIL So if we clear the reentrancy protection after saving the info in the PV page and after the hypercall returns, it's guaranteed that the above results in: 1) Guest triggers a EPT fail 2) HW injects #VE sets VE info 3) Guest handles #VE and before being able to gather the VE info data and clearing the reentrancy protection it triggers another EPT fail 4) VMEXIT which needs to be handled by suspending the vCPU, solving the issue and resuming it, which allows it to handle the original fail #1 >> So you really don't worry about the guest CPU state at all. The guest >> side #VE handler has to decide what it wants from the host depending on >> it's internal state: >> >> - Suspend me and resume once the EPT fail is solved >> >> - Let me park the failing task and tell me once you resolved the >> problem. >> >> That's pretty straight forward and avoids the whole nonsense which the >> current mess contains. It completely avoids the allocation stuff as well >> as you need to use a PV page where the guest copies the VE information >> to. >> >> The notification that a problem has been resolved needs to go through a >> separate vector which still has the IF=1 requirement obviously. > > How is this vector decided between guest and host. That's either a fixed vector which then becomes ABI or the guest tells the host via some PV/hypercall interface that it can handle #VE and that the vector for notification is number X. > Failure to fault in page will be communicated through same > vector? The PV page which communicates the current and eventually pending EPT fails to the host is also the communication mechanism for the outcome. Lets assume that the PV page contains an array of guest/host communication data structures: struct ve_info { struct ve_hw_info hw_info; unsigned long state; struct rcu_wait wait; ); The PV page looks like this: struct ve_page { struct ve_info info[N_ENTRIES]; unsigned int guest_current; unsigned int host_current; unsigned long active[BITS_TO_LONGS(N_ENTRIES)]; }; The #VE handler does: struct ve_page *vp = this_cpu_ptr(&ve_page); struct ve_info *info; bool can_continue; idx = find_first_zero_bit(vp->active, N_ENTRIES); BUG_ON(idx >= N_ENTRIES); set_bit(idx, vp->active); info = vp->info + idx; copy_ve_info(&info->hw_info, ve_hwinfo); vp->guest_current = idx; if (test_and_set_thread_flag(TIF_IN_VE) || bitmap_full(vp->active, N_ENTRIES)) can_continue = false; else can_continue = user_mode(regs) || preemptible(); if (can_continue) { info->state = CAN_CONTINUE; hypercall_ve(); ve_hwinfo.reentrancy_protect = 0; rcuwait_wait_event(&info->wait, info->state != CAN_CONTINUE, TASK_IDLE); } else { info->state = SUSPEND_ME; hypercall_ve(); ve_hwinfo.reentrancy_protect = 0; } state = info->state; info->state = NONE; clear_bit(idx, vp->active); switch (state) { case RESOLVED: break; case FAIL: if (user_mode(regs)) send_signal(.....); else if (!fixup_exception()) panic("I'm toast"); break; default: panic("Confused"); } clear_thread_flag(TIF_IN_VE); Now the host completion does: struct ve_page *vp = vcpu->ve_page; struct ve_info *info = vp->info + idx; state = info->state; info->state = resolved_state; switch (state) { case SUSPEND_ME: resume_vcpu(vcpu); break; case CAN_CONTINUE: queue_completion(vcpu, idx); break; default: kill_guest(); } and the host completion injection which handles the queued completion when guest IF=0 does: struct ve_page *vp = vcpu->ve_page; vp->host_current = idx; inject_ve_complete(vcpu); The guest completion does: struct ve_page *vp = this_cpu_ptr(&ve_page); struct ve_info *info; info = vp->info + vp->host_current; rcuwait_wake_up(&info->wait); There are a bunch of life time issues to think about (not rocket science), but you get the idea. Thanks, tglx
Thomas Gleixner <tglx@linutronix.de> writes: > Vivek Goyal <vgoyal@redhat.com> writes: > > and the host completion injection which handles the queued completion > when guest IF=0 does: obviously IF=1 ... > > struct ve_page *vp = vcpu->ve_page; > > vp->host_current = idx; > inject_ve_complete(vcpu); > > The guest completion does: > > struct ve_page *vp = this_cpu_ptr(&ve_page); > struct ve_info *info; > > info = vp->info + vp->host_current; > rcuwait_wake_up(&info->wait); > > There are a bunch of life time issues to think about (not rocket > science), but you get the idea. > > Thanks, > > tglx
On Wed, Apr 8, 2020 at 11:01 AM Thomas Gleixner <tglx@linutronix.de> wrote: > > Paolo Bonzini <pbonzini@redhat.com> writes: > > On 08/04/20 17:34, Sean Christopherson wrote: > >> On Wed, Apr 08, 2020 at 10:23:58AM +0200, Paolo Bonzini wrote: > >>> Page-not-present async page faults are almost a perfect match for the > >>> hardware use of #VE (and it might even be possible to let the processor > >>> deliver the exceptions). > >> > >> My "async" page fault knowledge is limited, but if the desired behavior is > >> to reflect a fault into the guest for select EPT Violations, then yes, > >> enabling EPT Violation #VEs in hardware is doable. The big gotcha is that > >> KVM needs to set the suppress #VE bit for all EPTEs when allocating a new > >> MMU page, otherwise not-present faults on zero-initialized EPTEs will get > >> reflected. > >> > >> Attached a patch that does the prep work in the MMU. The VMX usage would be: > >> > >> kvm_mmu_set_spte_init_value(VMX_EPT_SUPPRESS_VE_BIT); > >> > >> when EPT Violation #VEs are enabled. It's 64-bit only as it uses stosq to > >> initialize EPTEs. 32-bit could also be supported by doing memcpy() from > >> a static page. > > > > The complication is that (at least according to the current ABI) we > > would not want #VE to kick if the guest currently has IF=0 (and possibly > > CPL=0). But the ABI is not set in stone, and anyway the #VE protocol is > > a decent one and worth using as a base for whatever PV protocol we design. > > Forget the current pf async semantics (or the lack of). You really want > to start from scratch and igore the whole thing. > > The charm of #VE is that the hardware can inject it and it's not nesting > until the guest cleared the second word in the VE information area. If > that word is not 0 then you get a regular vmexit where you suspend the > vcpu until the nested problem is solved. Can you point me at where the SDM says this? Anyway, I see two problems with #VE, one big and one small. The small (or maybe small) one is that any fancy protocol where the guest returns from an exception by doing, logically: Hey I'm done; /* MOV somewhere, hypercall, MOV to CR4, whatever */ IRET; is fundamentally racy. After we say we're done and before IRET, we can be recursively reentered. Hi, NMI! The big problem is that #VE doesn't exist on AMD, and I really think that any fancy protocol we design should work on AMD. I have no problem with #VE being a nifty optimization to the protocol on Intel, but it should *work* without #VE. > > So you really don't worry about the guest CPU state at all. The guest > side #VE handler has to decide what it wants from the host depending on > it's internal state: > > - Suspend me and resume once the EPT fail is solved I'm not entirely convinced this is better than the HLT loop. It's *prettier*, but the HLT loop avoids an extra hypercall and has the potentially useful advantage that the guest can continue to process interrupts even if it is unable to make progress otherwise. Anyway, the whole thing can be made to work reasonably well without #VE, #MC or any other additional special exception, like this: First, when the guest accesses a page that is not immediately available (paged out or failed), the host attempts to deliver the "page not present -- try to do other stuff" event. This event has an associated per-vCPU data structure along these lines: struct page_not_present_data { u32 inuse; /* 1: the structure is in use. 0: free */ u32 token; u64 gpa; u64 gva; /* if known, and there should be a way to indicate that it's not known. */ }; Only the host ever changes inuse from 0 to 1 and only the guest ever changes inuse from 1 to 0. The "page not present -- try to do other stuff" event has interrupt semantics -- it is only delivered if the vCPU can currently receive an interrupt. This means IF = 1 and STI and MOV SS shadows are not active. Arguably TPR should be checked too. It is also only delivered if page_not_present_data.inuse == 0 and if there are tokens available -- see below. If the event can be delivered, then page_not_present_data is filled out and the event is delivered. If the event is not delivered, then one of three things happens: a) If the page not currently known to be failed (e.g. paged out or the host simply does not know yet until it does some IO), then the vCPU goes to sleep until the host is ready. b) If the page is known to be failed, and no fancy #VE / #MC is available, then the guest is killed and the host logs an error. c) If some fancy recovery mechanism is implemented, which is optional, then the guest gets an appropriate fault. If a page_not_present event is delivered, then the host promises to eventually resolve it. Resolving it looks like this: struct page_not_present_resolution { u32 result; /* 0: guest should try again. 1: page is failed */ u32 token; }; struct page_not_present_resolutions { struct page_not_present_resolution events[N]; u32 head, tail; }; Only N page-not-presents can be outstanding and unresolved at a time. it is entirely legal for the host to write the resolution to the resolution list before delivering page-not-present. When a page-not-present is resolved, the host writes the outcome to the page_not_present_resolutions ring. If there is no space, this means that either the host or guest messed up (the host will not allocate more tokens than can fit in the ring) and the guest is killed. The host also sends the guest an interrupt. This is a totally normal interrupt. If the guest gets a "page is failed" resolution, the page is failed. If the guest accesses the failed page again, then the host will try to send a page-not-present event again. If there is no space in the ring, then the rules above are followed. This will allow the sensible cases of memory failure to be recovered by the guest without the introduction of any super-atomic faults. :) Obviously there is more than one way to rig up the descriptor ring. My proposal above is just one way to do it, not necessarily the best way.
On 08/04/20 15:01, Thomas Gleixner wrote: > > And it comes with restrictions: > > The Do Other Stuff event can only be delivered when guest IF=1. > > If guest IF=0 then the host has to suspend the guest until the > situation is resolved. > > The 'Situation resolved' event must also wait for a guest IF=1 slot. Additionally: - the do other stuff event must be delivered to the same CPU that is causing the host-side page fault - the do other stuff event provides a token that identifies the cause and the situation resolved event provides a matching token This stuff is why I think the do other stuff event looks very much like a #VE. But I think we're in violent agreement after all. > If you just want to solve Viveks problem, then its good enough. I.e. the > file truncation turns the EPT entries into #VE convertible entries and > the guest #VE handler can figure it out. This one can be injected > directly by the hardware, i.e. you don't need a VMEXIT. > > If you want the opportunistic do other stuff mechanism, then #VE has > exactly the same problems as the existing async "PF". It's not magicaly > making that go away. You can inject #VE from the hypervisor too, with PV magic to distinguish the two. However that's not necessarily a good idea because it makes it harder to switch to hardware delivery in the future. > One possible solution might be to make all recoverable EPT entries > convertible and let the HW inject #VE for those. > > So the #VE handler in the guest would have to do: > > if (!recoverable()) { > if (user_mode) > send_signal(); > else if (!fixup_exception()) > die_hard(); > goto done; > } > > store_ve_info_in_pv_page(); > > if (!user_mode(regs) || !preemptible()) { > hypercall_resolve_ept(can_continue = false); > } else { > init_completion(); > hypercall_resolve_ept(can_continue = true); > wait_for_completion(); > } > > or something like that. Yes, pretty much. The VE info can also be passed down to the hypercall as arguments. Paolo > The hypercall to resolve the EPT fail on the host acts on the > can_continue argument. > > If false, it suspends the guest vCPU and only returns when done. > > If true it kicks the resolve process and returns to the guest which > suspends the task and tries to do something else. > > The wakeup side needs to be a regular interrupt and cannot go through > #VE.
On 09/04/20 06:50, Andy Lutomirski wrote: > The big problem is that #VE doesn't exist on AMD, and I really think > that any fancy protocol we design should work on AMD. I have no > problem with #VE being a nifty optimization to the protocol on Intel, > but it should *work* without #VE. Yes and unfortunately AMD does not like to inject a non-existing exception. Intel only requires the vector to be <=31, but AMD wants the vector to correspond to an exception. However, software injection is always possible and AMD even suggests that you use software injection for ICEBP and, on older processors, the INT instruction. Paolo
On 09/04/2020 05:50, Andy Lutomirski wrote: > On Wed, Apr 8, 2020 at 11:01 AM Thomas Gleixner <tglx@linutronix.de> wrote: >> Paolo Bonzini <pbonzini@redhat.com> writes: >>> On 08/04/20 17:34, Sean Christopherson wrote: >>>> On Wed, Apr 08, 2020 at 10:23:58AM +0200, Paolo Bonzini wrote: >>>>> Page-not-present async page faults are almost a perfect match for the >>>>> hardware use of #VE (and it might even be possible to let the processor >>>>> deliver the exceptions). >>>> My "async" page fault knowledge is limited, but if the desired behavior is >>>> to reflect a fault into the guest for select EPT Violations, then yes, >>>> enabling EPT Violation #VEs in hardware is doable. The big gotcha is that >>>> KVM needs to set the suppress #VE bit for all EPTEs when allocating a new >>>> MMU page, otherwise not-present faults on zero-initialized EPTEs will get >>>> reflected. >>>> >>>> Attached a patch that does the prep work in the MMU. The VMX usage would be: >>>> >>>> kvm_mmu_set_spte_init_value(VMX_EPT_SUPPRESS_VE_BIT); >>>> >>>> when EPT Violation #VEs are enabled. It's 64-bit only as it uses stosq to >>>> initialize EPTEs. 32-bit could also be supported by doing memcpy() from >>>> a static page. >>> The complication is that (at least according to the current ABI) we >>> would not want #VE to kick if the guest currently has IF=0 (and possibly >>> CPL=0). But the ABI is not set in stone, and anyway the #VE protocol is >>> a decent one and worth using as a base for whatever PV protocol we design. >> Forget the current pf async semantics (or the lack of). You really want >> to start from scratch and igore the whole thing. >> >> The charm of #VE is that the hardware can inject it and it's not nesting >> until the guest cleared the second word in the VE information area. If >> that word is not 0 then you get a regular vmexit where you suspend the >> vcpu until the nested problem is solved. > Can you point me at where the SDM says this? Vol3 25.5.6.1 Convertible EPT Violations > Anyway, I see two problems with #VE, one big and one small. The small > (or maybe small) one is that any fancy protocol where the guest > returns from an exception by doing, logically: > > Hey I'm done; /* MOV somewhere, hypercall, MOV to CR4, whatever */ > IRET; > > is fundamentally racy. After we say we're done and before IRET, we > can be recursively reentered. Hi, NMI! Correct. There is no way to atomically end the #VE handler. (This causes "fun" even when using #VE for its intended purpose.) ~Andrew
On 09/04/20 06:50, Andy Lutomirski wrote: > The small > (or maybe small) one is that any fancy protocol where the guest > returns from an exception by doing, logically: > > Hey I'm done; /* MOV somewhere, hypercall, MOV to CR4, whatever */ > IRET; > > is fundamentally racy. After we say we're done and before IRET, we > can be recursively reentered. Hi, NMI! That's possible in theory. In practice there would be only two levels of nesting, one for the original page being loaded and one for the tail of the #VE handler. The nested #VE would see IF=0, resolve the EPT violation synchronously and both handlers would finish. For the tail page to be swapped out again, leading to more nesting, the host's LRU must be seriously messed up. With IST it would be much messier, and I haven't quite understood why you believe the #VE handler should have an IST. Anyhow, apart from the above "small" issue, we have these separate parts: 1) deliver page-ready notifications via interrupt 2) page-in hypercall + deliver page-not-found notifications via #VE 3) propagation of host-side SIGBUS all of which have both a host and a guest part, and all of which make (more or less) sense independent of the other. Paolo
On 09/04/2020 13:47, Paolo Bonzini wrote: > On 09/04/20 06:50, Andy Lutomirski wrote: >> The small >> (or maybe small) one is that any fancy protocol where the guest >> returns from an exception by doing, logically: >> >> Hey I'm done; /* MOV somewhere, hypercall, MOV to CR4, whatever */ >> IRET; >> >> is fundamentally racy. After we say we're done and before IRET, we >> can be recursively reentered. Hi, NMI! > That's possible in theory. In practice there would be only two levels > of nesting, one for the original page being loaded and one for the tail > of the #VE handler. The nested #VE would see IF=0, resolve the EPT > violation synchronously and both handlers would finish. For the tail > page to be swapped out again, leading to more nesting, the host's LRU > must be seriously messed up. > > With IST it would be much messier, and I haven't quite understood why > you believe the #VE handler should have an IST. Any interrupt/exception which can possibly occur between a SYSCALL and re-establishing a kernel stack (several instructions), must be IST to avoid taking said exception on a user stack and being a trivial privilege escalation. In terms of using #VE in its architecturally-expected way, this can occur in general before the kernel stack is established, so must be IST for safety. Therefore, it doesn't really matter if KVM's paravirt use of #VE does respect the interrupt flag. It is not sensible to build a paravirt interface using #VE who's safety depends on never turning on hardware-induced #VE's. ~Andrew
On 09/04/20 16:13, Andrew Cooper wrote: > On 09/04/2020 13:47, Paolo Bonzini wrote: >> On 09/04/20 06:50, Andy Lutomirski wrote: >>> The small >>> (or maybe small) one is that any fancy protocol where the guest >>> returns from an exception by doing, logically: >>> >>> Hey I'm done; /* MOV somewhere, hypercall, MOV to CR4, whatever */ >>> IRET; >>> >>> is fundamentally racy. After we say we're done and before IRET, we >>> can be recursively reentered. Hi, NMI! >> That's possible in theory. In practice there would be only two levels >> of nesting, one for the original page being loaded and one for the tail >> of the #VE handler. The nested #VE would see IF=0, resolve the EPT >> violation synchronously and both handlers would finish. For the tail >> page to be swapped out again, leading to more nesting, the host's LRU >> must be seriously messed up. >> >> With IST it would be much messier, and I haven't quite understood why >> you believe the #VE handler should have an IST. > > Any interrupt/exception which can possibly occur between a SYSCALL and > re-establishing a kernel stack (several instructions), must be IST to > avoid taking said exception on a user stack and being a trivial > privilege escalation. Doh, of course. I always confuse SYSCALL and SYSENTER. > Therefore, it doesn't really matter if KVM's paravirt use of #VE does > respect the interrupt flag. It is not sensible to build a paravirt > interface using #VE who's safety depends on never turning on > hardware-induced #VE's. No, I think we wouldn't use a paravirt #VE at this point, we would use the real thing if available. It would still be possible to switch from the IST to the main kernel stack before writing 0 to the reentrancy word. Paolo
> On Apr 9, 2020, at 7:32 AM, Paolo Bonzini <pbonzini@redhat.com> wrote: > > On 09/04/20 16:13, Andrew Cooper wrote: >>> On 09/04/2020 13:47, Paolo Bonzini wrote: >>> On 09/04/20 06:50, Andy Lutomirski wrote: >>>> The small >>>> (or maybe small) one is that any fancy protocol where the guest >>>> returns from an exception by doing, logically: >>>> >>>> Hey I'm done; /* MOV somewhere, hypercall, MOV to CR4, whatever */ >>>> IRET; >>>> >>>> is fundamentally racy. After we say we're done and before IRET, we >>>> can be recursively reentered. Hi, NMI! >>> That's possible in theory. In practice there would be only two levels >>> of nesting, one for the original page being loaded and one for the tail >>> of the #VE handler. The nested #VE would see IF=0, resolve the EPT >>> violation synchronously and both handlers would finish. For the tail >>> page to be swapped out again, leading to more nesting, the host's LRU >>> must be seriously messed up. >>> >>> With IST it would be much messier, and I haven't quite understood why >>> you believe the #VE handler should have an IST. >> >> Any interrupt/exception which can possibly occur between a SYSCALL and >> re-establishing a kernel stack (several instructions), must be IST to >> avoid taking said exception on a user stack and being a trivial >> privilege escalation. > > Doh, of course. I always confuse SYSCALL and SYSENTER. > >> Therefore, it doesn't really matter if KVM's paravirt use of #VE does >> respect the interrupt flag. It is not sensible to build a paravirt >> interface using #VE who's safety depends on never turning on >> hardware-induced #VE's. > > No, I think we wouldn't use a paravirt #VE at this point, we would use > the real thing if available. > > It would still be possible to switch from the IST to the main kernel > stack before writing 0 to the reentrancy word. > > Almost but not quite. We do this for NMI-from-usermode, and it’s ugly. But we can’t do this for NMI-from-kernel or #VE-from-kernel because there might not be a kernel stack. Trying to hack around this won’t be pretty. Frankly, I think that we shouldn’t even try to report memory failure to the guest if it happens with interrupts off. Just kill the guest cleanly and keep it simple. Or inject an intentionally unrecoverable IST exception.
On 09/04/20 17:03, Andy Lutomirski wrote: >> No, I think we wouldn't use a paravirt #VE at this point, we would >> use the real thing if available. >> >> It would still be possible to switch from the IST to the main >> kernel stack before writing 0 to the reentrancy word. > > Almost but not quite. We do this for NMI-from-usermode, and it’s > ugly. But we can’t do this for NMI-from-kernel or #VE-from-kernel > because there might not be a kernel stack. Trying to hack around > this won’t be pretty. > > Frankly, I think that we shouldn’t even try to report memory failure > to the guest if it happens with interrupts off. Just kill the guest > cleanly and keep it simple. Or inject an intentionally unrecoverable > IST exception. But it would be nice to use #VE for all host-side page faults, not just for memory failure. So the solution would be the same as for NMIs, duplicating the stack frame and patching the outer handler's stack from the recursive #VE (https://lwn.net/Articles/484932/). It's ugly but it's a known ugliness. Paolo
Hi, Tony. I'm adding you because, despite the fact that everyone in this thread is allergic to #MC, this seems to be one of your favorite topics :) > On Apr 9, 2020, at 8:17 AM, Paolo Bonzini <pbonzini@redhat.com> wrote: > > On 09/04/20 17:03, Andy Lutomirski wrote: >>> No, I think we wouldn't use a paravirt #VE at this point, we would >>> use the real thing if available. >>> >>> It would still be possible to switch from the IST to the main >>> kernel stack before writing 0 to the reentrancy word. >> >> Almost but not quite. We do this for NMI-from-usermode, and it’s >> ugly. But we can’t do this for NMI-from-kernel or #VE-from-kernel >> because there might not be a kernel stack. Trying to hack around >> this won’t be pretty. >> >> Frankly, I think that we shouldn’t even try to report memory failure >> to the guest if it happens with interrupts off. Just kill the guest >> cleanly and keep it simple. Or inject an intentionally unrecoverable >> IST exception. > > But it would be nice to use #VE for all host-side page faults, not just > for memory failure. > > So the solution would be the same as for NMIs, duplicating the stack > frame and patching the outer handler's stack from the recursive #VE > (https://lwn.net/Articles/484932/). It's ugly but it's a known ugliness. > > Believe me, I know all about how ugly it is, since I’m the one who fixed most of the bugs in the first few implementations. And, before I wrote or ack any such thing for #VE, I want to know why. What, exactly, is a sufficiently strong motivation for using #VE *at all* that Linux should implement a #VE handler? As I see it, #VE has several downsides: 1. Intel only. 2. Depending on precisely what it's used for, literally any memory access in the kernel can trigger it as a fault. This means that it joins NMI and #MC (and, to a limited extent, #DB) in the horrible super-atomic-happens-in-bad-contexts camp. IST is mandatory, and IST is not so great. 3. Just like NMI and MCE, it comes with a fundamentally broken don't-recurse-me mechanism. If we do support #VE, I would suggest we do it roughly like this. The #VE handler is a regular IST entry -- there's a single IST stack, and #VE from userspace stack-switches to the regular kernel stack. The C handler (do_virtualization_exception?) is permitted to panic if something is horribly wrong, but is *not* permitted to clear the word at byte 4 to re-enable #VE. Instead, it does something to trigger a deferred re-enable. For example, it sends IPI to self and the IPI handler clears the magic re-enable flag. There are definitely horrible corner cases here. For example, suppose user code mmaps() some kind of failable memory (due to NVDIMM hardware failures, truncation, whatever). Then user code points RBP at it and we get a perf NMI. Now the perf code tries to copy_from_user_nmi() the user stack and hits the failure. It gets #MC or #VE or some paravirt thing. Now we're in a situation where we got an IST exception in the middle of NMI processing and we're expected to do something intelligent about it. Sure, we can invoke the extable handler, but it's not really clear how to recover if we somehow hit the same type of failure again in the same NMI. A model that could actually work, perhaps for #VE and #MC, is to have the extable code do the re-enabling. So ex_handler_uaccess() and ex_handler_mcsafe() will call something like rearm_memory_failure(), and that will do whatever is needed to re-arm the specific memory failure reporting mechanism in use. But, before I touch that with a ten-foot pole, I want to know *why*. What's the benefit? Why do we want convertible EPT violations?
On Wed, Apr 08, 2020 at 12:07:22AM +0200, Paolo Bonzini wrote: > On 07/04/20 23:41, Andy Lutomirski wrote: > > 2. Access to bad memory results in #MC. Sure, #MC is a turd, but > > it’s an *architectural* turd. By all means, have a nice simple PV > > mechanism to tell the #MC code exactly what went wrong, but keep the > > overall flow the same as in the native case. > > > > I think I like #2 much better. It has another nice effect: a good > > implementation will serve as a way to exercise the #MC code without > > needing to muck with EINJ or with whatever magic Tony uses. The > > average kernel developer does not have access to a box with testable > > memory failure reporting. > > I prefer #VE, but I can see how #MC has some appeal. I have spent some time looking at #MC and trying to figure out if we can use it. I have encountered couple of issues. - Uncorrected Action required machine checks are generated when poison is consumed. So typically all kernel code and exception handling is assuming MCE can be encoutered synchronously only on load and not store. stores don't generate MCE (atleast not AR one, IIUC). If we were to use #MC, we will need to generate it on store as well and then that requires changing assumptions in kernel which assumes stores can't generate #MC (Change all copy_to_user()/copy_from_user() and friends) - Machine check is generated for poisoned memory. And in this it is not exaclty poisoning. It feels like as if memory has gone missing. And failure might be temporary that is if file is truncated again to extend, then next load/store to same memory location will work just fine. My understanding is that sending #MC will mark that page poisoned and it will sort of become permanent failure. I am less concerned about point 2, but not sure how to get past the first issue. Thanks Vivek
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 93ab0cbd304e..e6f2aefa298b 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -318,11 +318,26 @@ static void kvm_guest_cpu_init(void) pa = slow_virt_to_phys(this_cpu_ptr(&apf_reason)); -#ifdef CONFIG_PREEMPTION - pa |= KVM_ASYNC_PF_SEND_ALWAYS; -#endif pa |= KVM_ASYNC_PF_ENABLED; + /* + * We do not set KVM_ASYNC_PF_SEND_ALWAYS. With the current + * KVM paravirt ABI, the following scenario is possible: + * + * #PF: async page fault (KVM_PV_REASON_PAGE_NOT_PRESENT) + * NMI before CR2 or KVM_PF_REASON_PAGE_NOT_PRESENT + * NMI accesses user memory, e.g. due to perf + * #PF: normal page fault + * #PF reads CR2 and apf_reason -- apf_reason should be 0 + * + * outer #PF reads CR2 and apf_reason -- apf_reason should be + * KVM_PV_REASON_PAGE_NOT_PRESENT + * + * There is no possible way that both reads of CR2 and + * apf_reason get the correct values. Fixing this would + * require paravirt ABI changes. + */ + if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF_VMEXIT)) pa |= KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT;
The ABI is broken and we cannot support it properly. Turn it off. If this causes a meaningful performance regression for someone, KVM can introduce an improved ABI that is supportable. Cc: stable@vger.kernel.org Signed-off-by: Andy Lutomirski <luto@kernel.org> --- arch/x86/kernel/kvm.c | 21 ++++++++++++++++++--- 1 file changed, 18 insertions(+), 3 deletions(-)