Message ID | 201108021255.p72CtNjD002153@rice.haifa.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Caution: this requires more care. Pretty sure this breaks userspace suspend at the cost of supporting a not-so-reasonable hardware feature. On Tue, Aug 2, 2011 at 5:55 AM, Nadav Har'El <nyh@il.ibm.com> wrote: > When the TSC MSR is read by an L2 guest (when L1 allowed this MSR to be > read without exit), we need to return L2's notion of the TSC, not L1's. > > The current code incorrectly returned L1 TSC, because svm_get_msr() was also > used in x86.c where this was assumed, but now that these places call the new > svm_read_l1_tsc(), the MSR read can be fixed. > > Signed-off-by: Nadav Har'El <nyh@il.ibm.com> > --- > arch/x86/kvm/svm.c | 4 +--- > 1 file changed, 1 insertion(+), 3 deletions(-) > > --- .before/arch/x86/kvm/svm.c 2011-08-02 15:51:02.000000000 +0300 > +++ .after/arch/x86/kvm/svm.c 2011-08-02 15:51:02.000000000 +0300 > @@ -2907,9 +2907,7 @@ static int svm_get_msr(struct kvm_vcpu * > > switch (ecx) { > case MSR_IA32_TSC: { > - struct vmcb *vmcb = get_host_vmcb(svm); > - > - *data = vmcb->control.tsc_offset + > + *data = svm->vmcb->control.tsc_offset + > svm_scale_tsc(vcpu, native_read_tsc()); > > break; > This is correct. Now you properly return the correct MSR value for the TSC to the guest in all cases. However, there is a BIG PROBLEM (yes, it is that bad). Sorry I did not think of it before. When the guest gets suspended and frozen by userspace, to be restarted later, what qemu is going to do is come along and read all of the MSRs as part of the saved state. One of those happens to be the TSC MSR. Since you can't guarantee suspend will take place when only L1 is running, you may have mixed L1/L2 TSC MSRs now being returned to userspace. Now, when you resume this guest, you will have mixed L1/L2 TSC values written into the MSRs. Those will almost certainly fail to be matched by the TSC offset matching code, and thus, with multiple VCPUs, you will end up with slightly unsynchronized TSCs, and with that, all the problems associated with unstable TSC come back to haunt you again. Basically, all bets for stability are off. Apart from recording the L1 and L2 TSC through separate MSRs, there isn't a good way to solve this. Walking through all the steps again, I'm pretty sure this is why I thought it would be better for L0 kvm_get_msr() to return the L1 values even if L2 was running. In the end, it may not be worth the hassle to support this mode of operation that to our current knowledge, is in fact, unused. I would much prefer to see a bad virtualization of a known insecure and unused guest mode than to have a well used feature such as L1 guest suspend / resume continue to cause clock de-synchronization. Zach -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi, On Wed, Aug 03, 2011, Zachary Amsden wrote about "Re: [PATCH 3/3] Fix TSC MSR read in nested SVM": > Pretty sure this breaks userspace suspend at the cost of supporting a > not-so-reasonable hardware feature. >... > This is correct. Now you properly return the correct MSR value for > the TSC to the guest in all cases. > > However, there is a BIG PROBLEM (yes, it is that bad). Sorry I did > not think of it before. > > When the guest gets suspended and frozen by userspace, to be restarted > later, what qemu is going to do is come along and read all of the MSRs > as part of the saved state. One of those happens to be the TSC MSR. And just when I thought we were done with this bug :( Does live migration (or suspend) actually work with nested SVM in the current code? I certainly don't expect it to work correctly with nested VMX. Also, I vaguely remember a discussion a while back on this mailing list about the topic of live migration and nested virtualization, and one of the ideas raised was that before we can migrate L1, we should force an exit from L2 to L1, either really (with some real exit reason) or artificially (call the exit emulation function directly). Then, during the migration we will be sure to have all the L1 MSRs, in-memory structures, and so on, updated, and, importantly, we will also be sure to have vmcs12 (the vmcs that L1 keeps for L2) updated in L1's memory - so that we don't need to send even more internal KVM state (like vmcs02) during the live migration. > In the end, it may not be worth the hassle to support this mode of > operation that to our current knowledge, is in fact, unused. I would I do agree that this doesn't sound like a useful mode of operation - but I also don't like having deliberate mistakes in the code because they have useful side-effects. I guess that if we can't figure out a way around this new problem, what I can do is create a patch that: 1. Always returns L1's TSC for the MSR (as in the original SVM code). 2. Put a big comment above this function, about it being architecturaly *wrong*, but still useful (and explain why). 3. Check for the case where users might expect the architecturally-correct version, not the current "wrong" version. I.e., check if L1 allows L2 exit-less reads from TSC, using the MSR bitmap; If does, kill the guest, or find a way to prevent this setting. Thanks, Nadav.
On Wed, Aug 03, 2011 at 01:34:58AM -0700, Zachary Amsden wrote: > Caution: this requires more care. > > Pretty sure this breaks userspace suspend at the cost of supporting a > not-so-reasonable hardware feature. > > On Tue, Aug 2, 2011 at 5:55 AM, Nadav Har'El <nyh@il.ibm.com> wrote: > > When the TSC MSR is read by an L2 guest (when L1 allowed this MSR to be > > read without exit), we need to return L2's notion of the TSC, not L1's. > > > > The current code incorrectly returned L1 TSC, because svm_get_msr() was also > > used in x86.c where this was assumed, but now that these places call the new > > svm_read_l1_tsc(), the MSR read can be fixed. > > > > Signed-off-by: Nadav Har'El <nyh@il.ibm.com> > > --- > > arch/x86/kvm/svm.c | 4 +--- > > 1 file changed, 1 insertion(+), 3 deletions(-) > > > > --- .before/arch/x86/kvm/svm.c 2011-08-02 15:51:02.000000000 +0300 > > +++ .after/arch/x86/kvm/svm.c 2011-08-02 15:51:02.000000000 +0300 > > @@ -2907,9 +2907,7 @@ static int svm_get_msr(struct kvm_vcpu * > > > > switch (ecx) { > > case MSR_IA32_TSC: { > > - struct vmcb *vmcb = get_host_vmcb(svm); > > - > > - *data = vmcb->control.tsc_offset + > > + *data = svm->vmcb->control.tsc_offset + > > svm_scale_tsc(vcpu, native_read_tsc()); > > > > break; > > > > > This is correct. Now you properly return the correct MSR value for > the TSC to the guest in all cases. > > However, there is a BIG PROBLEM (yes, it is that bad). Sorry I did > not think of it before. > > When the guest gets suspended and frozen by userspace, to be restarted > later, what qemu is going to do is come along and read all of the MSRs > as part of the saved state. One of those happens to be the TSC MSR. > > Since you can't guarantee suspend will take place when only L1 is > running, you may have mixed L1/L2 TSC MSRs now being returned to > userspace. Now, when you resume this guest, you will have mixed L1/L2 > TSC values written into the MSRs. > > Those will almost certainly fail to be matched by the TSC offset > matching code, and thus, with multiple VCPUs, you will end up with > slightly unsynchronized TSCs, and with that, all the problems > associated with unstable TSC come back to haunt you again. Basically, > all bets for stability are off. TSC synchronization is the least of your problems if you attempt to save/restore state a guest while any vcpu is in L2 mode. So to keep consistency between the remaining MSRs, i agree with Nadav's patch. Apparently SVM's original patches to return L1 TSC were aimed at fixing the problem VMX is facing now, which is fixed by introduction read_l1_tsc helpers. > Apart from recording the L1 and L2 TSC through separate MSRs, there > isn't a good way to solve this. Walking through all the steps again, > I'm pretty sure this is why I thought it would be better for L0 > kvm_get_msr() to return the L1 values even if L2 was running. > > In the end, it may not be worth the hassle to support this mode of > operation that to our current knowledge, is in fact, unused. I would > much prefer to see a bad virtualization of a known insecure and unused > guest mode than to have a well used feature such as L1 guest suspend / > resume continue to cause clock de-synchronization. > > Zach -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Aug 03, 2011 at 06:00:52PM -0300, Marcelo Tosatti wrote: > On Wed, Aug 03, 2011 at 01:34:58AM -0700, Zachary Amsden wrote: > > Caution: this requires more care. > > > > Pretty sure this breaks userspace suspend at the cost of supporting a > > not-so-reasonable hardware feature. > > > > On Tue, Aug 2, 2011 at 5:55 AM, Nadav Har'El <nyh@il.ibm.com> wrote: > > > When the TSC MSR is read by an L2 guest (when L1 allowed this MSR to be > > > read without exit), we need to return L2's notion of the TSC, not L1's. > > > > > > The current code incorrectly returned L1 TSC, because svm_get_msr() was also > > > used in x86.c where this was assumed, but now that these places call the new > > > svm_read_l1_tsc(), the MSR read can be fixed. > > > > > > Signed-off-by: Nadav Har'El <nyh@il.ibm.com> > > > --- > > > arch/x86/kvm/svm.c | 4 +--- > > > 1 file changed, 1 insertion(+), 3 deletions(-) > > > > > > --- .before/arch/x86/kvm/svm.c 2011-08-02 15:51:02.000000000 +0300 > > > +++ .after/arch/x86/kvm/svm.c 2011-08-02 15:51:02.000000000 +0300 > > > @@ -2907,9 +2907,7 @@ static int svm_get_msr(struct kvm_vcpu * > > > > > > switch (ecx) { > > > case MSR_IA32_TSC: { > > > - struct vmcb *vmcb = get_host_vmcb(svm); > > > - > > > - *data = vmcb->control.tsc_offset + > > > + *data = svm->vmcb->control.tsc_offset + > > > svm_scale_tsc(vcpu, native_read_tsc()); > > > > > > break; > > > > > > > > > This is correct. Now you properly return the correct MSR value for > > the TSC to the guest in all cases. > > > > However, there is a BIG PROBLEM (yes, it is that bad). Sorry I did > > not think of it before. > > > > When the guest gets suspended and frozen by userspace, to be restarted > > later, what qemu is going to do is come along and read all of the MSRs > > as part of the saved state. One of those happens to be the TSC MSR. > > > > Since you can't guarantee suspend will take place when only L1 is > > running, you may have mixed L1/L2 TSC MSRs now being returned to > > userspace. Now, when you resume this guest, you will have mixed L1/L2 > > TSC values written into the MSRs. > > > > Those will almost certainly fail to be matched by the TSC offset > > matching code, and thus, with multiple VCPUs, you will end up with > > slightly unsynchronized TSCs, and with that, all the problems > > associated with unstable TSC come back to haunt you again. Basically, > > all bets for stability are off. > > TSC synchronization is the least of your problems if you attempt to > save/restore state a guest while any vcpu is in L2 mode. > > So to keep consistency between the remaining MSRs, i agree with Nadav's > patch. Apparently SVM's original patches to return L1 TSC were aimed at > fixing the problem VMX is facing now, which is fixed by introduction > read_l1_tsc helpers. Joerg, can you review and ack Nadav's SVM patch? TIA -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Aug 05, 2011 at 12:32:24PM -0300, Marcelo Tosatti wrote: > On Wed, Aug 03, 2011 at 06:00:52PM -0300, Marcelo Tosatti wrote: > > On Wed, Aug 03, 2011 at 01:34:58AM -0700, Zachary Amsden wrote: > > > Caution: this requires more care. > > > > > > Pretty sure this breaks userspace suspend at the cost of supporting a > > > not-so-reasonable hardware feature. > > > > > > On Tue, Aug 2, 2011 at 5:55 AM, Nadav Har'El <nyh@il.ibm.com> wrote: > > > > When the TSC MSR is read by an L2 guest (when L1 allowed this MSR to be > > > > read without exit), we need to return L2's notion of the TSC, not L1's. > > > > > > > > The current code incorrectly returned L1 TSC, because svm_get_msr() was also > > > > used in x86.c where this was assumed, but now that these places call the new > > > > svm_read_l1_tsc(), the MSR read can be fixed. > > > > > > > > Signed-off-by: Nadav Har'El <nyh@il.ibm.com> > > > > --- > > > > arch/x86/kvm/svm.c | 4 +--- > > > > 1 file changed, 1 insertion(+), 3 deletions(-) > > > > > > > > --- .before/arch/x86/kvm/svm.c 2011-08-02 15:51:02.000000000 +0300 > > > > +++ .after/arch/x86/kvm/svm.c 2011-08-02 15:51:02.000000000 +0300 > > > > @@ -2907,9 +2907,7 @@ static int svm_get_msr(struct kvm_vcpu * > > > > > > > > switch (ecx) { > > > > case MSR_IA32_TSC: { > > > > - struct vmcb *vmcb = get_host_vmcb(svm); > > > > - > > > > - *data = vmcb->control.tsc_offset + > > > > + *data = svm->vmcb->control.tsc_offset + > > > > svm_scale_tsc(vcpu, native_read_tsc()); > > > > > > > > break; > > > > > > > > > > > > > This is correct. Now you properly return the correct MSR value for > > > the TSC to the guest in all cases. > > > > > > However, there is a BIG PROBLEM (yes, it is that bad). Sorry I did > > > not think of it before. > > > > > > When the guest gets suspended and frozen by userspace, to be restarted > > > later, what qemu is going to do is come along and read all of the MSRs > > > as part of the saved state. One of those happens to be the TSC MSR. > > > > > > Since you can't guarantee suspend will take place when only L1 is > > > running, you may have mixed L1/L2 TSC MSRs now being returned to > > > userspace. Now, when you resume this guest, you will have mixed L1/L2 > > > TSC values written into the MSRs. > > > > > > Those will almost certainly fail to be matched by the TSC offset > > > matching code, and thus, with multiple VCPUs, you will end up with > > > slightly unsynchronized TSCs, and with that, all the problems > > > associated with unstable TSC come back to haunt you again. Basically, > > > all bets for stability are off. > > > > TSC synchronization is the least of your problems if you attempt to > > save/restore state a guest while any vcpu is in L2 mode. > > > > So to keep consistency between the remaining MSRs, i agree with Nadav's > > patch. Apparently SVM's original patches to return L1 TSC were aimed at > > fixing the problem VMX is facing now, which is fixed by introduction > > read_l1_tsc helpers. > > Joerg, can you review and ack Nadav's SVM patch? TIA Yes, sorry for the delay. I'll give it a review and test today. Joerg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Aug 05, 2011 at 12:32:24PM -0300, Marcelo Tosatti wrote: > > So to keep consistency between the remaining MSRs, i agree with Nadav's > > patch. Apparently SVM's original patches to return L1 TSC were aimed at > > fixing the problem VMX is facing now, which is fixed by introduction > > read_l1_tsc helpers. > > Joerg, can you review and ack Nadav's SVM patch? TIA Tested-by: Joerg Roedel <joerg.roedel@amd.com> Acked-by: Joerg Roedel <joerg.roedel@amd.com> Reviewed and tested (didn't apply cleanly, but was easy to fix that up). Looks all good so far. Regards, Joerg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Aug 07, 2011, Joerg Roedel wrote about "Re: [PATCH 3/3] Fix TSC MSR read in nested SVM": > > Joerg, can you review and ack Nadav's SVM patch? TIA > > Tested-by: Joerg Roedel <joerg.roedel@amd.com> > Acked-by: Joerg Roedel <joerg.roedel@amd.com> > > Reviewed and tested (didn't apply cleanly, but was easy to fix that up). > Looks all good so far. Hi guys, I'm a bit confused how we want to proceed from here. The patches which I sent appear to fix the original bug (as confirmed by the two people who reported it), but Zachary warned that it would break migration of nested SVM while L2 is running. I asked whether migration works at all while L2 is running (without exiting to L1 first) - and Marcelo suggested that it doesn't. If the problem Zachary pointed to is considered serious, I proposed a second option - to leave the code to *wrongly* return the L1 TSC MSR always, and check (and warn) in situations where this value is wrongly returned to the guest, but this will leave qemu to always read the TSC MSR from L1, even when L2 is running. While I proposed this as a second option, I don't think I can recommend it. So what's the verdict? :-) Thanks, Nadav.
On Wed, Aug 10, 2011 at 03:35:28PM +0300, Nadav Har'El wrote: > On Sun, Aug 07, 2011, Joerg Roedel wrote about "Re: [PATCH 3/3] Fix TSC MSR read in nested SVM": > > > Joerg, can you review and ack Nadav's SVM patch? TIA > > > > Tested-by: Joerg Roedel <joerg.roedel@amd.com> > > Acked-by: Joerg Roedel <joerg.roedel@amd.com> > > > > Reviewed and tested (didn't apply cleanly, but was easy to fix that up). > > Looks all good so far. > > Hi guys, I'm a bit confused how we want to proceed from here. > > The patches which I sent appear to fix the original bug (as confirmed by > the two people who reported it), but Zachary warned that it would break > migration of nested SVM while L2 is running. I asked whether migration > works at all while L2 is running (without exiting to L1 first) - and > Marcelo suggested that it doesn't. Migration doesn't work today when L2 is running, so don't worry about it for now. Regards, Joerg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
--- .before/arch/x86/kvm/svm.c 2011-08-02 15:51:02.000000000 +0300 +++ .after/arch/x86/kvm/svm.c 2011-08-02 15:51:02.000000000 +0300 @@ -2907,9 +2907,7 @@ static int svm_get_msr(struct kvm_vcpu * switch (ecx) { case MSR_IA32_TSC: { - struct vmcb *vmcb = get_host_vmcb(svm); - - *data = vmcb->control.tsc_offset + + *data = svm->vmcb->control.tsc_offset + svm_scale_tsc(vcpu, native_read_tsc()); break;
When the TSC MSR is read by an L2 guest (when L1 allowed this MSR to be read without exit), we need to return L2's notion of the TSC, not L1's. The current code incorrectly returned L1 TSC, because svm_get_msr() was also used in x86.c where this was assumed, but now that these places call the new svm_read_l1_tsc(), the MSR read can be fixed. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> --- arch/x86/kvm/svm.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html