Message ID | 20200921103805.9102-2-mlevitsk@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: correctly restore the TSC value on nested migration | expand |
On 21/09/20 18:23, Sean Christopherson wrote: > Avoid "should" in code comments and describe what the code is doing, not what > it should be doing. The only exception for this is when the code has a known > flaw/gap, e.g. "KVM should do X, but because of Y, KVM actually does Z". > >> + * return it's real L1 value so that its restore will be correct. > s/it's/its > > Perhaps add "unconditionally" somewhere, since arch.tsc_offset can also contain > the L1 value. E.g. > > * Unconditionally return L1's TSC offset on userspace reads > * so that userspace reads and writes always operate on L1's > * offset, e.g. to ensure deterministic behavior for migration. > */ > Technically the host need not restore MSR_IA32_TSC at all. This follows the idea of the discussion with Oliver Upton about transmitting the state of the kvmclock heuristics to userspace, which include a (TSC, CLOCK_MONOTONIC) pair to transmit the offset to the destination. All that needs to be an L1 value is then the TSC value in that pair. I'm a bit torn over this patch. On one hand it's an easy solution, on the other hand it's... just wrong if KVM_GET_MSR is used for e.g. debugging the guest. I'll talk to Maxim and see if he can work on the kvmclock migration stuff. Paolo
On Tue, 2020-09-22 at 14:50 +0200, Paolo Bonzini wrote: > On 21/09/20 18:23, Sean Christopherson wrote: > > Avoid "should" in code comments and describe what the code is doing, not what > > it should be doing. The only exception for this is when the code has a known > > flaw/gap, e.g. "KVM should do X, but because of Y, KVM actually does Z". > > > > > + * return it's real L1 value so that its restore will be correct. > > s/it's/its > > > > Perhaps add "unconditionally" somewhere, since arch.tsc_offset can also contain > > the L1 value. E.g. > > > > * Unconditionally return L1's TSC offset on userspace reads > > * so that userspace reads and writes always operate on L1's > > * offset, e.g. to ensure deterministic behavior for migration. > > */ > > > > Technically the host need not restore MSR_IA32_TSC at all. This follows > the idea of the discussion with Oliver Upton about transmitting the > state of the kvmclock heuristics to userspace, which include a (TSC, > CLOCK_MONOTONIC) pair to transmit the offset to the destination. All > that needs to be an L1 value is then the TSC value in that pair. > > I'm a bit torn over this patch. On one hand it's an easy solution, on > the other hand it's... just wrong if KVM_GET_MSR is used for e.g. > debugging the guest. Could you explain why though? After my patch, the KVM_GET_MSR will consistently read the L1 TSC, just like all other MSRs as I explained. I guess for debugging, this should work? The fact that TSC reads with the guest offset is a nice exception made for the guests, that insist on reading this msr without inteception and not using rdtsc. Best regards, Maxim Levitsky > > I'll talk to Maxim and see if he can work on the kvmclock migration stuff. > > Paolo >
On Tue, 2020-09-22 at 17:50 +0300, Maxim Levitsky wrote: > On Tue, 2020-09-22 at 14:50 +0200, Paolo Bonzini wrote: > > On 21/09/20 18:23, Sean Christopherson wrote: > > > Avoid "should" in code comments and describe what the code is doing, not what > > > it should be doing. The only exception for this is when the code has a known > > > flaw/gap, e.g. "KVM should do X, but because of Y, KVM actually does Z". > > > > > > > + * return it's real L1 value so that its restore will be correct. > > > s/it's/its > > > > > > Perhaps add "unconditionally" somewhere, since arch.tsc_offset can also contain > > > the L1 value. E.g. > > > > > > * Unconditionally return L1's TSC offset on userspace reads > > > * so that userspace reads and writes always operate on L1's > > > * offset, e.g. to ensure deterministic behavior for migration. > > > */ > > > > > > > Technically the host need not restore MSR_IA32_TSC at all. This follows > > the idea of the discussion with Oliver Upton about transmitting the > > state of the kvmclock heuristics to userspace, which include a (TSC, > > CLOCK_MONOTONIC) pair to transmit the offset to the destination. All > > that needs to be an L1 value is then the TSC value in that pair. > > > > I'm a bit torn over this patch. On one hand it's an easy solution, on > > the other hand it's... just wrong if KVM_GET_MSR is used for e.g. > > debugging the guest. > > Could you explain why though? After my patch, the KVM_GET_MSR will consistently > read the L1 TSC, just like all other MSRs as I explained. I guess for debugging, > this should work? > > The fact that TSC reads with the guest offset is a nice exception made for the guests, > that insist on reading this msr without inteception and not using rdtsc. > > Best regards, > Maxim Levitsky > > > I'll talk to Maxim and see if he can work on the kvmclock migration stuff. We talked about this on IRC and now I am also convinced that we should implement proper TSC migration instead, so I guess I'll drop this patch and I will implement it. Last few weeks I was digging through all the timing code, and I mostly understand it so it shouldn't take me much time to implement it. There is hope that this will make nested migration fully stable since, with this patch, it still sometimes hangs. While on my AMD machine it takes about half a day of migration cycles to reproduce this, on my Intel's laptop even with this patch I can hang the nested guest after 10-20 cycles. The symptoms look very similar to the issue that this patch tried to fix. Maybe we should keep the *comment* I added to document this funny TSC read behavior. When I implement the whole thing, maybe I add a comment only version of this patch for that. Best regards, Maxim Levitsky > > > > Paolo > >
On 22/09/20 17:39, Maxim Levitsky wrote: >>> I'll talk to Maxim and see if he can work on the kvmclock migration stuff. > > We talked about this on IRC and now I am also convinced that we should implement > proper TSC migration instead, so I guess I'll drop this patch and I will implement it. > > Last few weeks I was digging through all the timing code, and I mostly understand it > so it shouldn't take me much time to implement it. > > There is hope that this will make nested migration fully stable since, with this patch, > it still sometimes hangs. While on my AMD machine it takes about half a day of migration > cycles to reproduce this, on my Intel's laptop even with this patch I can hang the nested > guest after 10-20 cycles. The symptoms look very similar to the issue that this patch > tried to fix. > > Maybe we should keep the *comment* I added to document this funny TSC read behavior. > When I implement the whole thing, maybe I add a comment only version of this patch > for that. Sure, that's a good idea. Paolo
On 21/09/20 12:38, Maxim Levitsky wrote: > MSR reads/writes should always access the L1 state, since the (nested) > hypervisor should intercept all the msrs it wants to adjust, and these > that it doesn't should be read by the guest as if the host had read it. > > However IA32_TSC is an exception. Even when not intercepted, guest still > reads the value + TSC offset. > The write however does not take any TSC offset into account. > > This is documented in Intel's SDM and seems also to happen on AMD as well. > > This creates a problem when userspace wants to read the IA32_TSC value and then > write it. (e.g for migration) > > In this case it reads L2 value but write is interpreted as an L1 value. > To fix this make the userspace initiated reads of IA32_TSC return L1 value > as well. > > Huge thanks to Dave Gilbert for helping me understand this very confusing > semantic of MSR writes. > > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> > --- > arch/x86/kvm/x86.c | 16 ++++++++++++++-- > 1 file changed, 14 insertions(+), 2 deletions(-) > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 17f4995e80a7e..ed4314641360e 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -3219,9 +3219,21 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) > case MSR_IA32_POWER_CTL: > msr_info->data = vcpu->arch.msr_ia32_power_ctl; > break; > - case MSR_IA32_TSC: > - msr_info->data = kvm_scale_tsc(vcpu, rdtsc()) + vcpu->arch.tsc_offset; > + case MSR_IA32_TSC: { > + /* > + * Intel SDM states that MSR_IA32_TSC read adds the TSC offset > + * even when not intercepted. AMD manual doesn't explicitly > + * state this but appears to behave the same. > + * > + * However when userspace wants to read this MSR, we should > + * return it's real L1 value so that its restore will be correct. > + */ > + u64 tsc_offset = msr_info->host_initiated ? vcpu->arch.l1_tsc_offset : > + vcpu->arch.tsc_offset; > + > + msr_info->data = kvm_scale_tsc(vcpu, rdtsc()) + tsc_offset; > break; > + } > case MSR_MTRRcap: > case 0x200 ... 0x2ff: > return kvm_mtrr_get_msr(vcpu, msr_info->index, &msr_info->data); > Applied the patch as it is doing the sanest possible thing for the current semantics of host-initiated accesses. Paolo
On Thu, 2020-09-24 at 19:33 +0200, Paolo Bonzini wrote: > On 21/09/20 12:38, Maxim Levitsky wrote: > > MSR reads/writes should always access the L1 state, since the (nested) > > hypervisor should intercept all the msrs it wants to adjust, and these > > that it doesn't should be read by the guest as if the host had read it. > > > > However IA32_TSC is an exception. Even when not intercepted, guest still > > reads the value + TSC offset. > > The write however does not take any TSC offset into account. > > > > This is documented in Intel's SDM and seems also to happen on AMD as well. > > > > This creates a problem when userspace wants to read the IA32_TSC value and then > > write it. (e.g for migration) > > > > In this case it reads L2 value but write is interpreted as an L1 value. > > To fix this make the userspace initiated reads of IA32_TSC return L1 value > > as well. > > > > Huge thanks to Dave Gilbert for helping me understand this very confusing > > semantic of MSR writes. > > > > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> > > --- > > arch/x86/kvm/x86.c | 16 ++++++++++++++-- > > 1 file changed, 14 insertions(+), 2 deletions(-) > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > index 17f4995e80a7e..ed4314641360e 100644 > > --- a/arch/x86/kvm/x86.c > > +++ b/arch/x86/kvm/x86.c > > @@ -3219,9 +3219,21 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) > > case MSR_IA32_POWER_CTL: > > msr_info->data = vcpu->arch.msr_ia32_power_ctl; > > break; > > - case MSR_IA32_TSC: > > - msr_info->data = kvm_scale_tsc(vcpu, rdtsc()) + vcpu->arch.tsc_offset; > > + case MSR_IA32_TSC: { > > + /* > > + * Intel SDM states that MSR_IA32_TSC read adds the TSC offset > > + * even when not intercepted. AMD manual doesn't explicitly > > + * state this but appears to behave the same. > > + * > > + * However when userspace wants to read this MSR, we should > > + * return it's real L1 value so that its restore will be correct. > > + */ > > + u64 tsc_offset = msr_info->host_initiated ? vcpu->arch.l1_tsc_offset : > > + vcpu->arch.tsc_offset; > > + > > + msr_info->data = kvm_scale_tsc(vcpu, rdtsc()) + tsc_offset; > > break; > > + } > > case MSR_MTRRcap: > > case 0x200 ... 0x2ff: > > return kvm_mtrr_get_msr(vcpu, msr_info->index, &msr_info->data); > > > > Applied the patch as it is doing the sanest possible thing for the > current semantics of host-initiated accesses. > > Paolo > Thanks! Best regards, Maxim Levitsky
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 17f4995e80a7e..ed4314641360e 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3219,9 +3219,21 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) case MSR_IA32_POWER_CTL: msr_info->data = vcpu->arch.msr_ia32_power_ctl; break; - case MSR_IA32_TSC: - msr_info->data = kvm_scale_tsc(vcpu, rdtsc()) + vcpu->arch.tsc_offset; + case MSR_IA32_TSC: { + /* + * Intel SDM states that MSR_IA32_TSC read adds the TSC offset + * even when not intercepted. AMD manual doesn't explicitly + * state this but appears to behave the same. + * + * However when userspace wants to read this MSR, we should + * return it's real L1 value so that its restore will be correct. + */ + u64 tsc_offset = msr_info->host_initiated ? vcpu->arch.l1_tsc_offset : + vcpu->arch.tsc_offset; + + msr_info->data = kvm_scale_tsc(vcpu, rdtsc()) + tsc_offset; break; + } case MSR_MTRRcap: case 0x200 ... 0x2ff: return kvm_mtrr_get_msr(vcpu, msr_info->index, &msr_info->data);
MSR reads/writes should always access the L1 state, since the (nested) hypervisor should intercept all the msrs it wants to adjust, and these that it doesn't should be read by the guest as if the host had read it. However IA32_TSC is an exception. Even when not intercepted, guest still reads the value + TSC offset. The write however does not take any TSC offset into account. This is documented in Intel's SDM and seems also to happen on AMD as well. This creates a problem when userspace wants to read the IA32_TSC value and then write it. (e.g for migration) In this case it reads L2 value but write is interpreted as an L1 value. To fix this make the userspace initiated reads of IA32_TSC return L1 value as well. Huge thanks to Dave Gilbert for helping me understand this very confusing semantic of MSR writes. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> --- arch/x86/kvm/x86.c | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-)