From patchwork Mon Sep 21 10:38:04 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Maxim Levitsky X-Patchwork-Id: 11789243 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A57EA139A for ; Mon, 21 Sep 2020 10:38:19 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 892F0216C4 for ; Mon, 21 Sep 2020 10:38:19 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="AueoSnjJ" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726513AbgIUKiS (ORCPT ); Mon, 21 Sep 2020 06:38:18 -0400 Received: from us-smtp-delivery-1.mimecast.com ([207.211.31.120]:24161 "EHLO us-smtp-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726436AbgIUKiS (ORCPT ); Mon, 21 Sep 2020 06:38:18 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1600684697; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=k37fRmP6IslqoM8RNLxcb8OzcjtWZpkq3yLymvakBU4=; b=AueoSnjJkfxW7RSb0Z5Hyb1MxvQTgtSHZt5IAWWE7T4OJHOeWVa5CjQTaRFTOhg6DNYqXz gI/N6uzv0ZRG7C/brcvbJBNvm5ZnfHWIPWE0NR48mqCvhMROgiRvNMTXmmg8koV/lSR5I1 rA5JCKVdjyJ+/TE7YOGistZ19H8pJkE= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-395-YFqMEzMEPRqvidZyzhPt7g-1; Mon, 21 Sep 2020 06:38:15 -0400 X-MC-Unique: YFqMEzMEPRqvidZyzhPt7g-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 40029801AC3; Mon, 21 Sep 2020 10:38:13 +0000 (UTC) Received: from localhost.localdomain (unknown [10.35.206.238]) by smtp.corp.redhat.com (Postfix) with ESMTP id 7513E68D64; Mon, 21 Sep 2020 10:38:06 +0000 (UTC) From: Maxim Levitsky To: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org, Jim Mattson , Wanpeng Li , Ingo Molnar , Sean Christopherson , Borislav Petkov , Paolo Bonzini , Thomas Gleixner , Vitaly Kuznetsov , x86@kernel.org (maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)), Joerg Roedel , "H. Peter Anvin" , Maxim Levitsky Subject: [PATCH v2 0/1] KVM: correctly restore the TSC value on nested migration Date: Mon, 21 Sep 2020 13:38:04 +0300 Message-Id: <20200921103805.9102-1-mlevitsk@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.16 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org This patch is a result of a long investigation I made to understand why the nested migration more often than not makes the nested guest hang. Sometimes the nested guest recovers and sometimes it hangs forever. The root cause of this is that reading MSR_IA32_TSC while nested guest is running returns its TSC value, that is (assuming no tsc scaling) host tsc + L1 tsc offset + L2 tsc offset. This is correct but it is a result of a nice curiosity of X86 VMX (and apparently SVM too, according to my tests) implementation: As a rule MSR reads done by the guest should either trap to host, or just return host value, and therefore kvm_get_msr and friends, should basically always return the L1 value of any msr. Well, MSR_IA32_TSC is an exception. Intel's PRM states that when you disable its interception, then in guest mode the host adds the TSC offset to the read value. I haven't found anything like that in AMD's PRM but according to the few tests I made, it behaves the same. However, there is no such exception when writing MSR_IA32_TSC, and this poses a problem for nested migration. When MSR_IA32_TSC is read, we read L2 value (smaller since L2 is started after L1), and when we restore it after migration, the value is interpreted as L1 value, thus resulting in huge TSC jump backward which the guest usually really doesn't like, especially on AMD with APIC deadline timer, which usually just doesn't fire afterward sending the guest into endless wait for it. The proposed patch fixes this by making read of MSR_IA32_TSC depend on 'msr_info->host_initiated' If guest reads the MSR, we add the TSC offset, but when host's qemu reads the msr we skip that silly emulation of TSC offset, and return the real value for the L1 guest which is host tsc + L1 offset. This patch was tested on both SVM and VMX, and on both it fixes hangs. On VMX since it uses VMX preemption timer for APIC deadline, the guest seems to recover after a while without that patch. To make sure that the nested migration happens I usually used -overcommit cpu_pm=on but I reproduced this with just running an endless loop in L2. This is tested both with and without -invtsc,tsc-frequency=... The migration was done by saving the migration stream to a file, and then loading the qemu with '-incoming' V2: incorporated feedback from Sean Christopherson (thanks!) Maxim Levitsky (1): KVM: x86: fix MSR_IA32_TSC read for nested migration arch/x86/kvm/x86.c | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-)