Message ID | 20110512154228.GA7943@fermat.math.technion.ac.il (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, May 12, 2011 at 06:42:28PM +0300, Nadav Har'El wrote: > So I guess my question is, and Avi and Gleb I'd love your comments about this > question: Is it really beneficial that I rewrite the "ugly" nested-VMX > injection code to be somewhat-ugly in exactly the same way that nested-SVM > injection code? Won't it be more beneficial to rewrite *both* codes to > be cleaner? This would probably mean changes to the common x86.c, that both > will use. For example, x86.c's injection code could check the nested case > itself, perhaps calling a special x86_op to handle the nested injection (exit, > set interrupt window, etc.) instead of calling the regular > interrupt_allowed/enable_irq_window and forcing those to be modified in > mysterious ways. > That is exactly what should be done and what I have in mind when I am asking to change VMX code to be SVM like. To achieve what you outlined above gradually we need to move common VMX and SVM logic into x86.c and then change the logic to be more nested friendly. If VMX will have different interrupt handling logic we will have to have additional step: making SVM and VMX code similar (so it will be possible to move it into x86.c). All I am asking is to make this step now, before merge, while the code is still actively developed. > Now that there's a is_guest_mode(vcpu) function, more nested-related code > can be moved to x86.c, to make both nested VMX and nested SVM code cleaner. > > Waiting to hear your opinions and suggestions, > -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/12/2011 06:57 PM, Gleb Natapov wrote: > On Thu, May 12, 2011 at 06:42:28PM +0300, Nadav Har'El wrote: > > So I guess my question is, and Avi and Gleb I'd love your comments about this > > question: Is it really beneficial that I rewrite the "ugly" nested-VMX > > injection code to be somewhat-ugly in exactly the same way that nested-SVM > > injection code? Won't it be more beneficial to rewrite *both* codes to > > be cleaner? This would probably mean changes to the common x86.c, that both > > will use. For example, x86.c's injection code could check the nested case > > itself, perhaps calling a special x86_op to handle the nested injection (exit, > > set interrupt window, etc.) instead of calling the regular > > interrupt_allowed/enable_irq_window and forcing those to be modified in > > mysterious ways. > > > That is exactly what should be done and what I have in mind when I am > asking to change VMX code to be SVM like. To achieve what you outlined > above gradually we need to move common VMX and SVM logic into x86.c > and then change the logic to be more nested friendly. If VMX will have > different interrupt handling logic we will have to have additional step: > making SVM and VMX code similar (so it will be possible to move it > into x86.c). All I am asking is to make this step now, before merge, > while the code is still actively developed. > I don't think it's fair to ask Nadav to do a unification right now. Or productive - there's a limit to the size of a patchset that can be carried outside. Also it needs to be done in consideration with future changes to interrupt injection, like using the svm interrupt queue to avoid an interrupt window exit. Are there vmx-only changes that you think can help?
On Thu, May 12, 2011 at 07:08:59PM +0300, Avi Kivity wrote: > On 05/12/2011 06:57 PM, Gleb Natapov wrote: > >On Thu, May 12, 2011 at 06:42:28PM +0300, Nadav Har'El wrote: > >> So I guess my question is, and Avi and Gleb I'd love your comments about this > >> question: Is it really beneficial that I rewrite the "ugly" nested-VMX > >> injection code to be somewhat-ugly in exactly the same way that nested-SVM > >> injection code? Won't it be more beneficial to rewrite *both* codes to > >> be cleaner? This would probably mean changes to the common x86.c, that both > >> will use. For example, x86.c's injection code could check the nested case > >> itself, perhaps calling a special x86_op to handle the nested injection (exit, > >> set interrupt window, etc.) instead of calling the regular > >> interrupt_allowed/enable_irq_window and forcing those to be modified in > >> mysterious ways. > >> > >That is exactly what should be done and what I have in mind when I am > >asking to change VMX code to be SVM like. To achieve what you outlined > >above gradually we need to move common VMX and SVM logic into x86.c > >and then change the logic to be more nested friendly. If VMX will have > >different interrupt handling logic we will have to have additional step: > >making SVM and VMX code similar (so it will be possible to move it > >into x86.c). All I am asking is to make this step now, before merge, > >while the code is still actively developed. > > > > I don't think it's fair to ask Nadav to do a unification right now. Definitely. And I am not asking for it! > Or productive - there's a limit to the size of a patchset that can > be carried outside. Also it needs to be done in consideration with > future changes to interrupt injection, like using the svm interrupt > queue to avoid an interrupt window exit. > > Are there vmx-only changes that you think can help? > I am asking for vmx-only change actually. To make interrupt handling logic the same as SVM. This will allow me or you or someone else to handle unification part later without rewriting VMX. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/12/2011 06:42 PM, Nadav Har'El wrote: > Our second conclusion (and I hope that I'm not offending anyone here) > is that the changes for L2 interrupt injection in both SVM and VMX are both > ugly - they are just ugly in different ways. Both modified the non-nested > code in strange places in strange and unexpected ways, and tried to circumvent > the usual code path in x86.c without touching x86.c. They just did this in > two slightly different ways, neither (I think) is inherently uglier than the > other: > > For accurate emulation (as I explain in the patch below), both codes need to > cause x86.c to change its normal behavior: It checks for interrupt_allowed() > and then (discovering that it isn't) enable_irq_window(). We want it to > instead exit to L1, and then enable the irq window on that. In the SVM code, > interrupt_allowed() is modified to always return false if nested, and > enable_irq_window() is modified to flag for an exit to L1 (which is performed > later) and turn on the interrupt window. In VMX, we modify the same places > but differently: In interrupt_allowed() we exit to L1 immediately (it's a > short operation, we didn't mind to do it in atomic context), and > enable_irq_window() doesn't need to be changed (it already runs in L1). I think that interrupt_allowed() should return true in L2 (if L1 has configured external interrupts to be trapped), and interrupt injection modified to cause an exit instead of queueing an interrupt. Note that on vmx, intercepted interrupt injection can take two different paths depending on whether the L1 wants interrupts acked or not. > Continuing to survey the difference between nested VMX and and SVM, there > were other different choices made besides the ones mentioned above. nested SVM > uses an additional trick, of skipping one round of running the guest, when > it discovered the need for an exit in the "wrong" place, so it can get to > the "right" place again. Nested VMX solved the same problems with other > mechanisms, like a separate piece of code for handling IDT_VECTORING_INFO, > and nested_run_pending. Some differences can also be explained by the different > design of (non-nested) vmx.c vs svm.c - e.g., svm_complete_interrupts() is > called during the handle_exit(), while vmx_complete_interrupts() is called > after handle_exit() has completed (in atomic context) - this is one of the > reasons the nested IDT_VECTORING_INFO path is different. > > I think that both solutions are far from being beautiful or easy to understand. > Nested SVM is perhaps slightly less ugly but also has a small performance cost > (with the extra vcpu_run iteration doing nothing) - and I think neither is > inherently better than the other. > > So I guess my question is, and Avi and Gleb I'd love your comments about this > question: Is it really beneficial that I rewrite the "ugly" nested-VMX > injection code to be somewhat-ugly in exactly the same way that nested-SVM > injection code? Won't it be more beneficial to rewrite *both* codes to > be cleaner? This would probably mean changes to the common x86.c, that both > will use. For example, x86.c's injection code could check the nested case > itself, perhaps calling a special x86_op to handle the nested injection (exit, > set interrupt window, etc.) instead of calling the regular > interrupt_allowed/enable_irq_window and forcing those to be modified in > mysterious ways. > > Now that there's a is_guest_mode(vcpu) function, more nested-related code > can be moved to x86.c, to make both nested VMX and nested SVM code cleaner. I am fine with committing as is. Later we can modify both vmx and svm to do the right thing (whatever that is), and later merge them into x86.c.
Hi, On Thu, May 12, 2011, Gleb Natapov wrote about "Re: [PATCH 0/30] nVMX: Nested VMX, v9": > That is exactly what should be done and what I have in mind when I am > asking to change VMX code to be SVM like. To achieve what you outlined > above gradually we need to move common VMX and SVM logic into x86.c > and then change the logic to be more nested friendly. If VMX will have > different interrupt handling logic we will have to have additional step: > making SVM and VMX code similar (so it will be possible to move it > into x86.c). But if my interpretation of the code is correct, SVM isn't much closer than VMX to the goal of moving this logic to x86.c. When some logic is moved there, both SVM and VMX code will need to change - perhaps even considerably. So how will it be helpful to make VMX behave exactly like SVM does now, when the latter will also need to change considerably? It sounds to me that working to move some nested-interrupt-injection related logic to x86.c is a worthy effort (and I'd be happy to start some discussion on how to best design it), but working to duplicate the exact idiosyncrasies of the current SVM implementation in the VMX code is not as productive. But as usual, I'm open to arguments (or dictums ;-)) that I'm wrong here. By the way, I hope that I'm being fair to the nested SVM implementation when I call some of the code there, after only a short review, idiosyncrasies. Basically I am working under the assumption that some of the modifications there (I gave examples in my previous post) were done in the way they were just to fit the mold of x86.c, and that it would have been possible to alter x86.c in a way that could make the nested SVM code simpler - and quite different (in the area of interrupt injection). > All I am asking is to make this step now, before merge, > while the code is still actively developed. The code will continue to be actively developed even after the merge :-) Nadav.
On Thu, May 12, 2011 at 07:31:15PM +0300, Nadav Har'El wrote: > Hi, > > On Thu, May 12, 2011, Gleb Natapov wrote about "Re: [PATCH 0/30] nVMX: Nested VMX, v9": > > That is exactly what should be done and what I have in mind when I am > > asking to change VMX code to be SVM like. To achieve what you outlined > > above gradually we need to move common VMX and SVM logic into x86.c > > and then change the logic to be more nested friendly. If VMX will have > > different interrupt handling logic we will have to have additional step: > > making SVM and VMX code similar (so it will be possible to move it > > into x86.c). > > But if my interpretation of the code is correct, SVM isn't much closer > than VMX to the goal of moving this logic to x86.c. When some logic is > moved there, both SVM and VMX code will need to change - perhaps even > considerably. So how will it be helpful to make VMX behave exactly like > SVM does now, when the latter will also need to change considerably? > SVM design is much close to the goal of moving the logic into x86.c because IIRC it does not bypass parsing of IDT vectoring info into arch independent structure. VMX code uses vmx->idt_vectoring_info directly. SVM is much close to working migration with nested guests for the same reason. > It sounds to me that working to move some nested-interrupt-injection related > logic to x86.c is a worthy effort (and I'd be happy to start some discussion > on how to best design it), but working to duplicate the exact idiosyncrasies > of the current SVM implementation in the VMX code is not as productive. > But as usual, I'm open to arguments (or dictums ;-)) that I'm wrong here. > > By the way, I hope that I'm being fair to the nested SVM implementation when > I call some of the code there, after only a short review, idiosyncrasies. > Basically I am working under the assumption that some of the modifications > there (I gave examples in my previous post) were done in the way they were > just to fit the mold of x86.c, and that it would have been possible to alter > x86.c in a way that could make the nested SVM code simpler - and quite > different (in the area of interrupt injection). I think (haven't looked at the code for a long time) it can benefit from additional x86 ops callbacks, but it would be silly to add one set of callbacks to support SVM way of doing things and another set for VMX, not because archs are so different that unification is impossible (they are not, they are very close in this area in fact), but because two implementations are different. > > > All I am asking is to make this step now, before merge, > > while the code is still actively developed. > > The code will continue to be actively developed even after the merge :-) > Amen! :) -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/12/2011 07:51 PM, Gleb Natapov wrote: > > > > But if my interpretation of the code is correct, SVM isn't much closer > > than VMX to the goal of moving this logic to x86.c. When some logic is > > moved there, both SVM and VMX code will need to change - perhaps even > > considerably. So how will it be helpful to make VMX behave exactly like > > SVM does now, when the latter will also need to change considerably? > > > SVM design is much close to the goal of moving the logic into x86.c > because IIRC it does not bypass parsing of IDT vectoring info into arch > independent structure. VMX code uses vmx->idt_vectoring_info directly. > SVM is much close to working migration with nested guests for the same > reason. Ah, yes. For live migration to work, all vmcb state must be accessible via vendor-independent accessors once an exit is completely handled. For example, GPRs are accessible via kvm_register_read(), and without nesting, interrupt state is stowed in the interrupt queue, but if you keep IDT_VECTORING_INFO live between exit and entry, you can lose it if you migrate at this point.
On Thu, May 12, 2011, Avi Kivity wrote about "Re: [PATCH 0/30] nVMX: Nested VMX, v9": > Ah, yes. For live migration to work, all vmcb state must be accessible > via vendor-independent accessors once an exit is completely handled. > For example, GPRs are accessible via kvm_register_read(), and without > nesting, interrupt state is stowed in the interrupt queue, but if you > keep IDT_VECTORING_INFO live between exit and entry, you can lose it if > you migrate at this point. Hi, I can quite easily save this state in a different place which is saved - The easiest will just be to use vmcs12, which has place for exactly the fields we want to save (and they are rewritten anyway when we exit to L1). Avi, would you you like me use this sort of solution to avoid the extra state? Of course, considering that anyway, live migration with nested VMX probably still doesn't work for a dozen other reasons :( Or do you consider this not enough, and rather that it is necessary that nested VMX should use exactly the same logic as nested SVM does - namely, use tricks like SVM's "exit_required" instead of our different tricks? Nadav.
On Mon, May 16, 2011 at 02:11:40AM +0300, Nadav Har'El wrote: > On Thu, May 12, 2011, Avi Kivity wrote about "Re: [PATCH 0/30] nVMX: Nested VMX, v9": > > Ah, yes. For live migration to work, all vmcb state must be accessible > > via vendor-independent accessors once an exit is completely handled. > > For example, GPRs are accessible via kvm_register_read(), and without > > nesting, interrupt state is stowed in the interrupt queue, but if you > > keep IDT_VECTORING_INFO live between exit and entry, you can lose it if > > you migrate at this point. > > Hi, I can quite easily save this state in a different place which is saved - > The easiest will just be to use vmcs12, which has place for exactly the fields > we want to save (and they are rewritten anyway when we exit to L1). > This will not address the problem that the state will not be visible to generic logic in x86.c. > Avi, would you you like me use this sort of solution to avoid the extra > state? Of course, considering that anyway, live migration with nested VMX > probably still doesn't work for a dozen other reasons :( > > Or do you consider this not enough, and rather that it is necessary that > nested VMX should use exactly the same logic as nested SVM does - namely, > use tricks like SVM's "exit_required" instead of our different tricks? > Given two solutions I prefer SVM one. Yes, I know that you asked Avi :) -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, May 16, 2011, Gleb Natapov wrote about "Re: [PATCH 0/30] nVMX: Nested VMX, v9": > > Hi, I can quite easily save this state in a different place which is saved - > > The easiest will just be to use vmcs12, which has place for exactly the fields > > we want to save (and they are rewritten anyway when we exit to L1). > > > This will not address the problem that the state will not be visible to > generic logic in x86.c. Maybe I misunderstood your intention, but given that vmcs12 is in guest memory, which is migrated as well, isn't that enough (for the live migration issue)?
On Mon, May 16, 2011 at 10:44:28AM +0300, Nadav Har'El wrote: > On Mon, May 16, 2011, Gleb Natapov wrote about "Re: [PATCH 0/30] nVMX: Nested VMX, v9": > > > Hi, I can quite easily save this state in a different place which is saved - > > > The easiest will just be to use vmcs12, which has place for exactly the fields > > > we want to save (and they are rewritten anyway when we exit to L1). > > > > > This will not address the problem that the state will not be visible to > > generic logic in x86.c. > > Maybe I misunderstood your intention, but given that vmcs12 is in guest > memory, which is migrated as well, isn't that enough (for the live migration > issue)? > I pointed two issues. Migration was a second and minor one since there is a long why before migration will work with nested guest anyway. The first one was much more important, so let me repeat it again. To move nested event handling into generic code IDT vectoring info has to be parsed into data structure that event injection code in x86.c actually works with. And that code does not manipulate vmx->idt_vectoring_info or SVM analog directly, but it works with event queue instead. SVM does this right and there is nothing I can see that prevents moving SVM logic into x86.c. I don't see how your VMX logic can be moved into x86.c as is since it works on internal VMX fields directly. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/16/2011 02:11 AM, Nadav Har'El wrote: > On Thu, May 12, 2011, Avi Kivity wrote about "Re: [PATCH 0/30] nVMX: Nested VMX, v9": > > Ah, yes. For live migration to work, all vmcb state must be accessible > > via vendor-independent accessors once an exit is completely handled. > > For example, GPRs are accessible via kvm_register_read(), and without > > nesting, interrupt state is stowed in the interrupt queue, but if you > > keep IDT_VECTORING_INFO live between exit and entry, you can lose it if > > you migrate at this point. > > Hi, I can quite easily save this state in a different place which is saved - > The easiest will just be to use vmcs12, which has place for exactly the fields > we want to save (and they are rewritten anyway when we exit to L1). You would still need ->valid_idt_vectoring_info to know you need special handling, no? > Avi, would you you like me use this sort of solution to avoid the extra > state? Of course, considering that anyway, live migration with nested VMX > probably still doesn't work for a dozen other reasons :( > > Or do you consider this not enough, and rather that it is necessary that > nested VMX should use exactly the same logic as nested SVM does - namely, > use tricks like SVM's "exit_required" instead of our different tricks? I think svm is rather simple here using svm_complete_interrupts() to decode exit_int_info into the arch independent structures. I don't think ->exit_required is a hack - it could probably be improved but I think it does the right thing essentially. For example svm_nmi_allowed() will return true if in guest mode and NMI interception is enabled.
On 05/16/2011 12:50 PM, Avi Kivity wrote: >> Or do you consider this not enough, and rather that it is necessary that >> nested VMX should use exactly the same logic as nested SVM does - >> namely, >> use tricks like SVM's "exit_required" instead of our different tricks? > > > I think svm is rather simple here using svm_complete_interrupts() to > decode exit_int_info into the arch independent structures. I don't > think ->exit_required is a hack - it could probably be improved but I > think it does the right thing essentially. For example > svm_nmi_allowed() will return true if in guest mode and NMI > interception is enabled. > It would be better if it just returned true and let svm_inject_nmi() do the vmexit.
On Thu, May 12, 2011, Gleb Natapov wrote about "Re: [PATCH 0/30] nVMX: Nested VMX, v9": > > But if my interpretation of the code is correct, SVM isn't much closer > > than VMX to the goal of moving this logic to x86.c. When some logic is > > moved there, both SVM and VMX code will need to change - perhaps even > > considerably. So how will it be helpful to make VMX behave exactly like > > SVM does now, when the latter will also need to change considerably? > > > SVM design is much close to the goal of moving the logic into x86.c > because IIRC it does not bypass parsing of IDT vectoring info into arch > independent structure. VMX code uses vmx->idt_vectoring_info directly. At the risk of sounding blasphemous, I'd like to make the case that perhaps the current nested-VMX design - regarding the IDT-vectoring-info-field handling - is actually closer than nested-SVM to the goal of moving clean nested-supporting logic into x86.c, instead of having ad-hoc, unnatural, workarounds. Let me explain, and see if you agree with my logic: We discover at exit time whether the virtualization hardware (VMX or SVM) exited while *delivering* an interrupt or exception to the current guest. This is known as "idt-vectoring-information" in VMX. What do we need to do with this idt-vectoring-information? In regular (non- nested) guests, the answer is simple: On the next entry, we need to inject this event again into the guest, so it can resume the delivery of the same event it was trying to deliver. This is why the nested-unaware code has a vmx_complete_interrupts which basically adds this idt-vectoring-info into KVM's event queue, which on the next entry will be injected similarly to the way virtual interrupts from userspace are injected, and so on. But with nested virtualization, this is *not* what is supposed to happen - we do not *always* need to inject the event to the guest. We will only need to inject the event if the next entry will be again to the same guest, i.e., L1 after L1, or L2 after L2. If the idt-vectoring-info came from L2, but our next entry will be into L1 (i.e., a nested exit), we *shouldn't* inject the event as usual, but should rather pass this idt-vectoring-info field as the exit information that L1 gets (in nested vmx terminology, in vmcs12). However, at the time of exit, we cannot know for sure whether L2 will actually run next, because it is still possible that an injection from user space, before the next entry, will cause us to decide to exit to L1. Therefore, I believe that the clean solution isn't to leave the original non-nested logic that always queues the idt-vectoring-info assuming it will be injected, and then if it shouldn't (because we want to exit during entry) we need to skip the entry once as a "trick" to avoid this wrong injection. Rather, a clean solution is, I think, to recognize that in nested virtualization, idt-vectoring-info is a different kind of beast than regular injected events, and it needs to be saved at exit time in a different field (which will of course be common to SVM and VMX). Only at entry time, after the regular injection code (which may cause a nested exit), we can call a x86_op to handle this special injection. The benefit of this approach, which is closer to the current vmx code, is, I think, that x86.c will contain clear, self-explanatory nested logic, instead of relying on vmx.c or svm.c circumventing various x86.c functions and mechanisms to do something different from what they were meant to do. What do you think?
On Sun, May 22, 2011 at 10:32:39PM +0300, Nadav Har'El wrote: > At the risk of sounding blasphemous, I'd like to make the case that perhaps > the current nested-VMX design - regarding the IDT-vectoring-info-field > handling - is actually closer than nested-SVM to the goal of moving clean > nested-supporting logic into x86.c, instead of having ad-hoc, unnatural, > workarounds. Well, the nested SVM implementation is certainly not perfect in this regard :) > Therefore, I believe that the clean solution isn't to leave the original > non-nested logic that always queues the idt-vectoring-info assuming it will > be injected, and then if it shouldn't (because we want to exit during entry) > we need to skip the entry once as a "trick" to avoid this wrong injection. > > Rather, a clean solution is, I think, to recognize that in nested > virtualization, idt-vectoring-info is a different kind of beast than regular > injected events, and it needs to be saved at exit time in a different field > (which will of course be common to SVM and VMX). Only at entry time, after > the regular injection code (which may cause a nested exit), we can call a > x86_op to handle this special injection. Things are complicated either way. If you keep the vectoring-info seperate from the kvm exception queue you need special logic to combine the vectoring-info and the queue. For example, imagine something is pending in idt-vectoring info and the intercept causes another exception for the guest. KVM needs to turn this into the #DF then. When we just queue the vectoring-info into the exception queue we get this implicitly without extra code. This is a cleaner way imho. On the other side, when using the exception queue we need to keep extra-information for nesting in the queue because an event which is just re-injected into L2 must not cause a nested vmexit, even if the exception vector is intercepted by L1. But this is the same for SVM and VMX so we can do this in generic x86 code. This is not the case when keeping track of idt-vectoring info seperate in architecture code. Regards, Joerg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/22/2011 10:32 PM, Nadav Har'El wrote: > On Thu, May 12, 2011, Gleb Natapov wrote about "Re: [PATCH 0/30] nVMX: Nested VMX, v9": > > > But if my interpretation of the code is correct, SVM isn't much closer > > > than VMX to the goal of moving this logic to x86.c. When some logic is > > > moved there, both SVM and VMX code will need to change - perhaps even > > > considerably. So how will it be helpful to make VMX behave exactly like > > > SVM does now, when the latter will also need to change considerably? > > > > > SVM design is much close to the goal of moving the logic into x86.c > > because IIRC it does not bypass parsing of IDT vectoring info into arch > > independent structure. VMX code uses vmx->idt_vectoring_info directly. > > At the risk of sounding blasphemous, I'd like to make the case that perhaps > the current nested-VMX design - regarding the IDT-vectoring-info-field > handling - is actually closer than nested-SVM to the goal of moving clean > nested-supporting logic into x86.c, instead of having ad-hoc, unnatural, > workarounds. > > Let me explain, and see if you agree with my logic: > > We discover at exit time whether the virtualization hardware (VMX or SVM) > exited while *delivering* an interrupt or exception to the current guest. > This is known as "idt-vectoring-information" in VMX. > > What do we need to do with this idt-vectoring-information? In regular (non- > nested) guests, the answer is simple: On the next entry, we need to inject > this event again into the guest, so it can resume the delivery of the > same event it was trying to deliver. This is why the nested-unaware code > has a vmx_complete_interrupts which basically adds this idt-vectoring-info > into KVM's event queue, which on the next entry will be injected similarly > to the way virtual interrupts from userspace are injected, and so on. The other thing we may need to do, is to expose it to userspace in case we're live migrating at exactly this point in time. > But with nested virtualization, this is *not* what is supposed to happen - > we do not *always* need to inject the event to the guest. We will only need > to inject the event if the next entry will be again to the same guest, i.e., > L1 after L1, or L2 after L2. If the idt-vectoring-info came from L2, but > our next entry will be into L1 (i.e., a nested exit), we *shouldn't* inject > the event as usual, but should rather pass this idt-vectoring-info field > as the exit information that L1 gets (in nested vmx terminology, in vmcs12). > > However, at the time of exit, we cannot know for sure whether L2 will actually > run next, because it is still possible that an injection from user space, > before the next entry, will cause us to decide to exit to L1. > > Therefore, I believe that the clean solution isn't to leave the original > non-nested logic that always queues the idt-vectoring-info assuming it will > be injected, and then if it shouldn't (because we want to exit during entry) > we need to skip the entry once as a "trick" to avoid this wrong injection. > > Rather, a clean solution is, I think, to recognize that in nested > virtualization, idt-vectoring-info is a different kind of beast than regular > injected events, and it needs to be saved at exit time in a different field > (which will of course be common to SVM and VMX). Only at entry time, after > the regular injection code (which may cause a nested exit), we can call a > x86_op to handle this special injection. > > The benefit of this approach, which is closer to the current vmx code, > is, I think, that x86.c will contain clear, self-explanatory nested logic, > instead of relying on vmx.c or svm.c circumventing various x86.c functions > and mechanisms to do something different from what they were meant to do. > IMO this will cause confusion, especially with the user interfaces used to read/write pending events. I think what we need to do is: 1. change ->interrupt_allowed() to return true if the interrupt flag is unmasked OR if in a nested guest, and we're intercepting interrupts 2. change ->set_irq() to cause a nested vmexit if in a nested guest and we're intercepting interrupts 3. change ->nmi_allowed() and ->set_nmi() in a similar way 4. add a .injected flag to the interrupt queue which overrides the nested vmexit for VM_ENTRY_INTR_INFO_FIELD and the svm equivalent; if present normal injection takes place (or an error vmexit if the interrupt flag is clear and we cannot inject)
On Mon, May 23, 2011 at 12:52:50PM +0300, Avi Kivity wrote: > On 05/22/2011 10:32 PM, Nadav Har'El wrote: >> What do we need to do with this idt-vectoring-information? In regular (non- >> nested) guests, the answer is simple: On the next entry, we need to inject >> this event again into the guest, so it can resume the delivery of the >> same event it was trying to deliver. This is why the nested-unaware code >> has a vmx_complete_interrupts which basically adds this idt-vectoring-info >> into KVM's event queue, which on the next entry will be injected similarly >> to the way virtual interrupts from userspace are injected, and so on. > > The other thing we may need to do, is to expose it to userspace in case > we're live migrating at exactly this point in time. About live-migration with nesting, we had discussed the idea of just doing an VMEXIT(INTR) if the vcpu runs nested and we want to migrate. The problem was that the hypervisor may not expect an INTR intercept. How about doing an implicit VMEXIT in this case and an implicit VMRUN after the vcpu is migrated? The nested hypervisor will not see the vmexit and the vcpu will be in a state where it is safe to migrate. This should work for nested-vmx too if the guest-state is written back to guest memory on VMEXIT. Is this the case? Joerg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/23/2011 04:02 PM, Joerg Roedel wrote: > On Mon, May 23, 2011 at 12:52:50PM +0300, Avi Kivity wrote: > > On 05/22/2011 10:32 PM, Nadav Har'El wrote: > >> What do we need to do with this idt-vectoring-information? In regular (non- > >> nested) guests, the answer is simple: On the next entry, we need to inject > >> this event again into the guest, so it can resume the delivery of the > >> same event it was trying to deliver. This is why the nested-unaware code > >> has a vmx_complete_interrupts which basically adds this idt-vectoring-info > >> into KVM's event queue, which on the next entry will be injected similarly > >> to the way virtual interrupts from userspace are injected, and so on. > > > > The other thing we may need to do, is to expose it to userspace in case > > we're live migrating at exactly this point in time. > > About live-migration with nesting, we had discussed the idea of just > doing an VMEXIT(INTR) if the vcpu runs nested and we want to migrate. > The problem was that the hypervisor may not expect an INTR intercept. > > How about doing an implicit VMEXIT in this case and an implicit VMRUN > after the vcpu is migrated? What if there's something in EXIT_INT_INFO? > The nested hypervisor will not see the > vmexit and the vcpu will be in a state where it is safe to migrate. This > should work for nested-vmx too if the guest-state is written back to > guest memory on VMEXIT. Is this the case? It is the case with the current implementation, and we can/should make it so in future implementations, just before exit to userspace. Or at least provide an ABI to sync memory. But I don't see why we shouldn't just migrate all the hidden state (in guest mode flag, svm host paging mode, svm host interrupt state, vmcb address/vmptr, etc.). It's more state, but no thinking is involved, so it's clearly superior.
On Mon, May 23, 2011, Joerg Roedel wrote about "Re: [PATCH 0/30] nVMX: Nested VMX, v9": > About live-migration with nesting, we had discussed the idea of just > doing an VMEXIT(INTR) if the vcpu runs nested and we want to migrate. > The problem was that the hypervisor may not expect an INTR intercept. > > How about doing an implicit VMEXIT in this case and an implicit VMRUN > after the vcpu is migrated? The nested hypervisor will not see the > vmexit and the vcpu will be in a state where it is safe to migrate. This > should work for nested-vmx too if the guest-state is written back to > guest memory on VMEXIT. Is this the case? Indeed, on nested exit (L2 to L1), the L2 guest state is written back to vmcs12 (in guest memory). In theory, at that point, the vmcs02 (the vmcs used by L0 to actually run L2) can be discarded, without risking losing anything. The receiving hypervisor will need to remember to do that implicit VMRUN when it starts the guest; It also needs to know what is the current L2 guest - in VMX this would be vmx->nested.current_vmptr, which needs to me migrated as well (on the other hand, other variables like vmx->nested.current_vmcs12, will need to be recalculated by the receiver, and not migrated as-is). I haven't started considering how to wrap up all these pieces into a complete working solution - it is one of the things on my TODO list after the basic nested VMX is merged.
On Mon, May 23, 2011 at 04:08:00PM +0300, Avi Kivity wrote: > On 05/23/2011 04:02 PM, Joerg Roedel wrote: >> About live-migration with nesting, we had discussed the idea of just >> doing an VMEXIT(INTR) if the vcpu runs nested and we want to migrate. >> The problem was that the hypervisor may not expect an INTR intercept. >> >> How about doing an implicit VMEXIT in this case and an implicit VMRUN >> after the vcpu is migrated? > > What if there's something in EXIT_INT_INFO? On real SVM hardware EXIT_INT_INFO should only contain something for exception and npt intercepts. These are all handled in the kernel and do not cause an exit to user-space so that no valid EXIT_INT_INFO should be around when we actually go back to user-space (so that migration can happen). The exception might be the #PF/NPT intercept when the guest is doing very obscure things like putting an exception/interrupt handler on mmio memory, but that isn't really supported by KVM anyway so I doubt we should care. Unless I miss something here we should be safe by just not looking at EXIT_INT_INFO while migrating. >> The nested hypervisor will not see the >> vmexit and the vcpu will be in a state where it is safe to migrate. This >> should work for nested-vmx too if the guest-state is written back to >> guest memory on VMEXIT. Is this the case? > > It is the case with the current implementation, and we can/should make > it so in future implementations, just before exit to userspace. Or at > least provide an ABI to sync memory. > > But I don't see why we shouldn't just migrate all the hidden state (in > guest mode flag, svm host paging mode, svm host interrupt state, vmcb > address/vmptr, etc.). It's more state, but no thinking is involved, so > it's clearly superior. An issue is that there is different state to migrate for Intel and AMD hosts. If we keep all that information in guest memory the kvm kernel module can handle those details and all KVM needs to migrate is the in-guest-mode flag and the gpa of the vmcb/vmcs which is currently executed. This state should be enough for Intel and AMD nesting. The next benefit is that it works seemlessly even if the state that needs to be transfered is extended (e.g. by emulating a new virtualization hardware feature). This support can be implemented in the kernel module and no changes to qemu are required. Joerg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/23/2011 04:40 PM, Joerg Roedel wrote: > On Mon, May 23, 2011 at 04:08:00PM +0300, Avi Kivity wrote: > > On 05/23/2011 04:02 PM, Joerg Roedel wrote: > > >> About live-migration with nesting, we had discussed the idea of just > >> doing an VMEXIT(INTR) if the vcpu runs nested and we want to migrate. > >> The problem was that the hypervisor may not expect an INTR intercept. > >> > >> How about doing an implicit VMEXIT in this case and an implicit VMRUN > >> after the vcpu is migrated? > > > > What if there's something in EXIT_INT_INFO? > > On real SVM hardware EXIT_INT_INFO should only contain something for > exception and npt intercepts. These are all handled in the kernel and do > not cause an exit to user-space so that no valid EXIT_INT_INFO should be > around when we actually go back to user-space (so that migration can > happen). > > The exception might be the #PF/NPT intercept when the guest is doing > very obscure things like putting an exception/interrupt handler on mmio > memory, but that isn't really supported by KVM anyway so I doubt we > should care. > > Unless I miss something here we should be safe by just not looking at > EXIT_INT_INFO while migrating. Agree. > >> The nested hypervisor will not see the > >> vmexit and the vcpu will be in a state where it is safe to migrate. This > >> should work for nested-vmx too if the guest-state is written back to > >> guest memory on VMEXIT. Is this the case? > > > > It is the case with the current implementation, and we can/should make > > it so in future implementations, just before exit to userspace. Or at > > least provide an ABI to sync memory. > > > > But I don't see why we shouldn't just migrate all the hidden state (in > > guest mode flag, svm host paging mode, svm host interrupt state, vmcb > > address/vmptr, etc.). It's more state, but no thinking is involved, so > > it's clearly superior. > > An issue is that there is different state to migrate for Intel and AMD > hosts. If we keep all that information in guest memory the kvm kernel > module can handle those details and all KVM needs to migrate is the > in-guest-mode flag and the gpa of the vmcb/vmcs which is currently > executed. This state should be enough for Intel and AMD nesting. I think for Intel there is no hidden state apart from in-guest-mode (there is the VMPTR, but it is an actual register accessible via instructions). For svm we can keep the hidden state in the host state-save area (including the vmcb pointer). The only risk is that svm will gain hardware support for nesting, and will choose a different format than ours. An alternative is a fake MSR for storing this data, or just another get/set ioctl pair. We'll have a flags field that says which fields are filled in. > The next benefit is that it works seemlessly even if the state that > needs to be transfered is extended (e.g. by emulating a new > virtualization hardware feature). This support can be implemented in the > kernel module and no changes to qemu are required. I agree it's a benefit. But I don't like making the fake vmexit part of live migration, if it turns out the wrong choice it's hard to undo it.
On Mon, May 23, 2011, Avi Kivity wrote about "Re: [PATCH 0/30] nVMX: Nested VMX, v9": > I think for Intel there is no hidden state apart from in-guest-mode > (there is the VMPTR, but it is an actual register accessible via > instructions). is_guest_mode(vcpu), vmx->nested.vmxon, vmx->nested.current_vmptr are the only three things I can think of. Vmxon is actually more than a boolean (there's also a vmxon pointer). What do you mean by the current_vmptr being available through an instruction? It is (VMPTRST), but this would be an instruction run on L1 (emulated by L0). How would L0's user space use that instruction? > I agree it's a benefit. But I don't like making the fake vmexit part of > live migration, if it turns out the wrong choice it's hard to undo it. If you don't do this "fake vmexit", you'll need to migrate both vmcs01 and the current vmcs02 - the fact that vmcs12 is in guest memory will not be enough, because vmcs02 isn't copied back to vmcs12 until the nested exit.
On Mon, May 23, 2011 at 04:52:47PM +0300, Avi Kivity wrote: > On 05/23/2011 04:40 PM, Joerg Roedel wrote: >> The next benefit is that it works seemlessly even if the state that >> needs to be transfered is extended (e.g. by emulating a new >> virtualization hardware feature). This support can be implemented in the >> kernel module and no changes to qemu are required. > > I agree it's a benefit. But I don't like making the fake vmexit part of > live migration, if it turns out the wrong choice it's hard to undo it. Well, saving the state to the host-save-area and doing a fake-vmexit is logically the same, only the memory where the information is stored differs. To user-space we can provide a VCPU_FREEZE/VCPU_UNFREEZE ioctl which does all the necessary things. Joerg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/23/2011 05:10 PM, Nadav Har'El wrote: > On Mon, May 23, 2011, Avi Kivity wrote about "Re: [PATCH 0/30] nVMX: Nested VMX, v9": > > I think for Intel there is no hidden state apart from in-guest-mode > > (there is the VMPTR, but it is an actual register accessible via > > instructions). > > is_guest_mode(vcpu), vmx->nested.vmxon, vmx->nested.current_vmptr are the > only three things I can think of. Vmxon is actually more than a boolean > (there's also a vmxon pointer). > > What do you mean by the current_vmptr being available through an instruction? > It is (VMPTRST), but this would be an instruction run on L1 (emulated by L0). > How would L0's user space use that instruction? I mean that it is an architectural register rather than "hidden state". It doesn't mean that L0 user space can use it. > > I agree it's a benefit. But I don't like making the fake vmexit part of > > live migration, if it turns out the wrong choice it's hard to undo it. > > If you don't do this "fake vmexit", you'll need to migrate both vmcs01 and > the current vmcs02 - the fact that vmcs12 is in guest memory will not be > enough, because vmcs02 isn't copied back to vmcs12 until the nested exit. > vmcs01 and vmcs02 will both be generated from vmcs12.
On 05/23/2011 05:28 PM, Joerg Roedel wrote: > On Mon, May 23, 2011 at 04:52:47PM +0300, Avi Kivity wrote: > > On 05/23/2011 04:40 PM, Joerg Roedel wrote: > > >> The next benefit is that it works seemlessly even if the state that > >> needs to be transfered is extended (e.g. by emulating a new > >> virtualization hardware feature). This support can be implemented in the > >> kernel module and no changes to qemu are required. > > > > I agree it's a benefit. But I don't like making the fake vmexit part of > > live migration, if it turns out the wrong choice it's hard to undo it. > > Well, saving the state to the host-save-area and doing a fake-vmexit is > logically the same, only the memory where the information is stored > differs. Right. I guess the main difference is "info registers" after a stop. > To user-space we can provide a VCPU_FREEZE/VCPU_UNFREEZE ioctl which > does all the necessary things. > Or we can automatically flush things on any exit to userspace. They should be very rare in guest mode.
On Mon, May 23, 2011, Avi Kivity wrote about "Re: [PATCH 0/30] nVMX: Nested VMX, v9":
> vmcs01 and vmcs02 will both be generated from vmcs12.
If you don't do a clean nested exit (from L2 to L1), vmcs02 can't be generated
from vmcs12... while L2 runs, it is possible that it modifies vmcs02 (e.g.,
non-trapped bits of guest_cr0), and these modifications are not copied back
to vmcs12 until the nested exit (when prepare_vmcs12() is called to perform
this task).
If you do a nested exit (a "fake" one), vmcs12 is made up to date, and then
indeed vmcs02 can be thrown away and regenerated.
Nadav.
On Mon, May 23, 2011 at 05:34:20PM +0300, Avi Kivity wrote: > On 05/23/2011 05:28 PM, Joerg Roedel wrote: >> To user-space we can provide a VCPU_FREEZE/VCPU_UNFREEZE ioctl which >> does all the necessary things. > > Or we can automatically flush things on any exit to userspace. They > should be very rare in guest mode. This would make nesting mostly transparent to migration, so it sounds good in this regard. I do not completly agree that user-space exits in guest-mode are rare, this depends on the hypervisor in the L1. In Hyper-V for example the root-domain uses hardware virtualization too and has direct access to devices (at least to some degree). IOIO is not intercepted in the root-domain, for example. Not sure about the MMIO regions. Joerg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/23/2011 05:58 PM, Joerg Roedel wrote: > On Mon, May 23, 2011 at 05:34:20PM +0300, Avi Kivity wrote: > > On 05/23/2011 05:28 PM, Joerg Roedel wrote: > > >> To user-space we can provide a VCPU_FREEZE/VCPU_UNFREEZE ioctl which > >> does all the necessary things. > > > > Or we can automatically flush things on any exit to userspace. They > > should be very rare in guest mode. > > This would make nesting mostly transparent to migration, so it sounds > good in this regard. > > I do not completly agree that user-space exits in guest-mode are rare, > this depends on the hypervisor in the L1. In Hyper-V for example the > root-domain uses hardware virtualization too and has direct access to > devices (at least to some degree). IOIO is not intercepted in the > root-domain, for example. Not sure about the MMIO regions. Good point. We were also talking about passing through virtio (or even host) devices to the guest. So an ioctl to flush volatile state to memory would be a good idea.
On 05/23/2011 05:44 PM, Nadav Har'El wrote: > On Mon, May 23, 2011, Avi Kivity wrote about "Re: [PATCH 0/30] nVMX: Nested VMX, v9": > > vmcs01 and vmcs02 will both be generated from vmcs12. > > If you don't do a clean nested exit (from L2 to L1), vmcs02 can't be generated > from vmcs12... while L2 runs, it is possible that it modifies vmcs02 (e.g., > non-trapped bits of guest_cr0), and these modifications are not copied back > to vmcs12 until the nested exit (when prepare_vmcs12() is called to perform > this task). > > If you do a nested exit (a "fake" one), vmcs12 is made up to date, and then > indeed vmcs02 can be thrown away and regenerated. You would flush this state back to the vmcs. But that just confirms Joerg's statement that a fake vmexit/vmrun is more or less equivalent. The question is whether %rip points to the VMRUN/VMLAUNCH instruction, HOST_RIP (or the next instruction for svm), or to guest code. But the actual things we need to do are all very similar subsets of a vmexit.
On 23.05.2011, at 17:23, Avi Kivity wrote: > On 05/23/2011 05:44 PM, Nadav Har'El wrote: >> On Mon, May 23, 2011, Avi Kivity wrote about "Re: [PATCH 0/30] nVMX: Nested VMX, v9": >> > vmcs01 and vmcs02 will both be generated from vmcs12. >> >> If you don't do a clean nested exit (from L2 to L1), vmcs02 can't be generated >> from vmcs12... while L2 runs, it is possible that it modifies vmcs02 (e.g., >> non-trapped bits of guest_cr0), and these modifications are not copied back >> to vmcs12 until the nested exit (when prepare_vmcs12() is called to perform >> this task). >> >> If you do a nested exit (a "fake" one), vmcs12 is made up to date, and then >> indeed vmcs02 can be thrown away and regenerated. > > You would flush this state back to the vmcs. But that just confirms Joerg's statement that a fake vmexit/vmrun is more or less equivalent. > > The question is whether %rip points to the VMRUN/VMLAUNCH instruction, HOST_RIP (or the next instruction for svm), or to guest code. But the actual things we need to do are all very similar subsets of a vmexit. %rip should certainly point to VMRUN. That way there is no need to save any information whatsoever, as the VMCB is already in sane state and nothing needs to be special cased, as the next VCPU_RUN would simply go back into guest mode - which is exactly what we want. The only tricky part is how we distinguish between "I need to live migrate" and "info registers". In the former case, %rip should be on VMRUN. In the latter, on the guest rip. Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/23/2011 09:06 PM, Alexander Graf wrote: > On 23.05.2011, at 17:23, Avi Kivity wrote: > > > On 05/23/2011 05:44 PM, Nadav Har'El wrote: > >> On Mon, May 23, 2011, Avi Kivity wrote about "Re: [PATCH 0/30] nVMX: Nested VMX, v9": > >> > vmcs01 and vmcs02 will both be generated from vmcs12. > >> > >> If you don't do a clean nested exit (from L2 to L1), vmcs02 can't be generated > >> from vmcs12... while L2 runs, it is possible that it modifies vmcs02 (e.g., > >> non-trapped bits of guest_cr0), and these modifications are not copied back > >> to vmcs12 until the nested exit (when prepare_vmcs12() is called to perform > >> this task). > >> > >> If you do a nested exit (a "fake" one), vmcs12 is made up to date, and then > >> indeed vmcs02 can be thrown away and regenerated. > > > > You would flush this state back to the vmcs. But that just confirms Joerg's statement that a fake vmexit/vmrun is more or less equivalent. > > > > The question is whether %rip points to the VMRUN/VMLAUNCH instruction, HOST_RIP (or the next instruction for svm), or to guest code. But the actual things we need to do are all very similar subsets of a vmexit. > > %rip should certainly point to VMRUN. That way there is no need to save any information whatsoever, as the VMCB is already in sane state and nothing needs to be special cased, as the next VCPU_RUN would simply go back into guest mode - which is exactly what we want. > > The only tricky part is how we distinguish between "I need to live migrate" and "info registers". In the former case, %rip should be on VMRUN. In the latter, on the guest rip. We can split vmrun emulation into "save host state, load guest state" and "prepare nested vmcb". Then, when we load registers, if we see that we're in guest mode, we do just the "prepare nested vmcb" bit. This way register state is always nested guest state.
On Tue, May 24, 2011 at 02:09:00PM +0300, Avi Kivity wrote: > On 05/23/2011 09:06 PM, Alexander Graf wrote: >> On 23.05.2011, at 17:23, Avi Kivity wrote: >> >> > On 05/23/2011 05:44 PM, Nadav Har'El wrote: >> >> On Mon, May 23, 2011, Avi Kivity wrote about "Re: [PATCH 0/30] nVMX: Nested VMX, v9": >> >> > vmcs01 and vmcs02 will both be generated from vmcs12. >> >> >> >> If you don't do a clean nested exit (from L2 to L1), vmcs02 can't be generated >> >> from vmcs12... while L2 runs, it is possible that it modifies vmcs02 (e.g., >> >> non-trapped bits of guest_cr0), and these modifications are not copied back >> >> to vmcs12 until the nested exit (when prepare_vmcs12() is called to perform >> >> this task). >> >> >> >> If you do a nested exit (a "fake" one), vmcs12 is made up to date, and then >> >> indeed vmcs02 can be thrown away and regenerated. >> > >> > You would flush this state back to the vmcs. But that just confirms Joerg's statement that a fake vmexit/vmrun is more or less equivalent. >> > >> > The question is whether %rip points to the VMRUN/VMLAUNCH instruction, HOST_RIP (or the next instruction for svm), or to guest code. But the actual things we need to do are all very similar subsets of a vmexit. >> >> %rip should certainly point to VMRUN. That way there is no need to save any information whatsoever, as the VMCB is already in sane state and nothing needs to be special cased, as the next VCPU_RUN would simply go back into guest mode - which is exactly what we want. >> >> The only tricky part is how we distinguish between "I need to live migrate" and "info registers". In the former case, %rip should be on VMRUN. In the latter, on the guest rip. > > We can split vmrun emulation into "save host state, load guest state" > and "prepare nested vmcb". Then, when we load registers, if we see that > we're in guest mode, we do just the "prepare nested vmcb" bit. Or we just emulate a VMEXIT in the VCPU_FREEZE ioctl and set the %rip back to the VMRUN that entered the L2 guest. For 'info registers' the VCPU_FREEZE ioctl will not be issued and the guest registers be displayed. That way we don't need to migrate any additional state for SVM. Joerg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
============================================================= Subject: [PATCH 23/31] nVMX: Correct handling of interrupt injection The code in this patch correctly emulates external-interrupt injection while a nested guest L2 is running. Because of this code's relative oddity and un-obviousness, I include here a longer-than-usual justification for what it does - longer than the code itself ;-) To understand how to correctly emulate interrupt injection while L2 is running, let's look first at what we need to emulate: How would things look like if the extra L0 hypervisor layer is removed, and instead of L0 injecting an interrupt we have hardware delivering an interrupt? Now L1 runs on bare metal, with a guest L2 and the hardware generates an interrupt. Assuming that L1 set PIN_BASED_EXT_INTR_MASK to 1, and VM_EXIT_ACK_INTR_ON_EXIT to 0 (we'll revisit these assumptions below), what happens now is this: The processor exits from L2 to L1, with an external-interrupt exit reason but without an interrupt vector. L1 runs, with interrupts disabled, and it doesn't yet know what the interrupt was. Soon after, it enables interrupts and only at that moment, it gets the interrupt from the processor. when L1 is KVM, Linux handles this interrupt. Now we need exactly the same thing to happen when that L1->L2 system runs on top of L0, instead of real hardware. This is how we do this: When L0 wants to inject an interrupt, it needs to exit from L2 to L1, with external-interrupt exit reason (without an interrupt vector), and run L1. Just like in the bare metal case, it likely can't deliver the interrupt to L1 now because it is running with interrupts disabled, in which case it turns on the interrupt window when running L1 after the exit. L1 will soon enable interrupts, and at that point L0 will gain control again and inject the interrupt to L1. Finally, there is an extra complication in the code: when nested_run_pending, we cannot return to L1 now, and must launch L2. We need to remember the interrupt we wanted to inject (and not clear it now), and do it on the next exit. The above explanation shows that the relative strangeness of the nested interrupt injection code in this patch, and the extra interrupt-window exit incurred, are in fact necessary for accurate emulation, and are not just an unoptimized implementation. Let's revisit now the two assumptions made above: If L1 turns off PIN_BASED_EXT_INTR_MASK (no hypervisor that I know does, by the way), things are simple: L0 may inject the interrupt directly to the L2 guest - using the normal code path that injects to any guest. We support this case in the code below. If L1 turns on VM_EXIT_ACK_INTR_ON_EXIT (again, no hypervisor that I know does), things look very different from the description above: L1 expects to see an exit from L2 with the interrupt vector already filled in the exit information, and does not expect to be interrupted again with this interrupt. The current code does not (yet) support this case, so we do not allow the VM_EXIT_ACK_INTR_ON_EXIT exit-control to be turned on by L1. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> --- arch/x86/kvm/vmx.c | 37 +++++++++++++++++++++++++++++++++++++ 1 file changed, 37 insertions(+) --- .before/arch/x86/kvm/vmx.c 2011-05-12 17:39:33.000000000 +0300 +++ .after/arch/x86/kvm/vmx.c 2011-05-12 17:39:33.000000000 +0300 @@ -3703,9 +3703,25 @@ out: return ret; } +/* + * In nested virtualization, check if L1 asked to exit on external interrupts. + * For most existing hypervisors, this will always return true. + */ +static bool nested_exit_on_intr(struct kvm_vcpu *vcpu) +{ + return get_vmcs12(vcpu)->pin_based_vm_exec_control & + PIN_BASED_EXT_INTR_MASK; +} + static void enable_irq_window(struct kvm_vcpu *vcpu) { u32 cpu_based_vm_exec_control; + if (is_guest_mode(vcpu) && nested_exit_on_intr(vcpu)) + /* We can get here when nested_run_pending caused + * vmx_interrupt_allowed() to return false. In this case, do + * nothing - the interrupt will be injected later. + */ + return; cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL); cpu_based_vm_exec_control |= CPU_BASED_VIRTUAL_INTR_PENDING; @@ -3828,6 +3844,15 @@ static void vmx_set_nmi_mask(struct kvm_ static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu) { + if (is_guest_mode(vcpu) && nested_exit_on_intr(vcpu)) { + if (to_vmx(vcpu)->nested.nested_run_pending) + return 0; + nested_vmx_vmexit(vcpu); + get_vmcs12(vcpu)->vm_exit_reason = + EXIT_REASON_EXTERNAL_INTERRUPT; + /* fall through to normal code, but now in L1, not L2 */ + } + return (vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) && !(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) & (GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS)); @@ -5515,6 +5540,14 @@ static int vmx_handle_exit(struct kvm_vc if (vmx->emulation_required && emulate_invalid_guest_state) return handle_invalid_guest_state(vcpu); + /* + * the KVM_REQ_EVENT optimization bit is only on for one entry, and if + * we did not inject a still-pending event to L1 now because of + * nested_run_pending, we need to re-enable this bit. + */ + if (vmx->nested.nested_run_pending) + kvm_make_request(KVM_REQ_EVENT, vcpu); + if (exit_reason == EXIT_REASON_VMLAUNCH || exit_reason == EXIT_REASON_VMRESUME) vmx->nested.nested_run_pending = 1; @@ -5712,6 +5745,8 @@ static void __vmx_complete_interrupts(st static void vmx_complete_interrupts(struct vcpu_vmx *vmx) { + if (is_guest_mode(&vmx->vcpu)) + return; __vmx_complete_interrupts(vmx, vmx->idt_vectoring_info, VM_EXIT_INSTRUCTION_LEN, IDT_VECTORING_ERROR_CODE); @@ -5719,6 +5754,8 @@ static void vmx_complete_interrupts(stru static void vmx_cancel_injection(struct kvm_vcpu *vcpu) { + if (is_guest_mode(vcpu)) + return; __vmx_complete_interrupts(to_vmx(vcpu), vmcs_read32(VM_ENTRY_INTR_INFO_FIELD), VM_ENTRY_INSTRUCTION_LEN,