From patchwork Wed Feb 15 10:27:13 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Sergey Dyasli <sergey.dyasli@citrix.com>
X-Patchwork-Id: 9573787
Return-Path: <xen-devel-bounces@lists.xen.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	42B2D600F6 for <patchwork-xen-devel@patchwork.kernel.org>;
	Wed, 15 Feb 2017 10:30:18 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 3073328446
	for <patchwork-xen-devel@patchwork.kernel.org>;
	Wed, 15 Feb 2017 10:30:18 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 24CFE28473; Wed, 15 Feb 2017 10:30:18 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-4.2 required=2.0 tests=BAYES_00, RCVD_IN_DNSWL_MED
	autolearn=ham version=3.3.1
Received: from lists.xenproject.org (lists.xenproject.org [192.237.175.120])
	(using TLSv1.2 with cipher AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 0092E28446
	for <patchwork-xen-devel@patchwork.kernel.org>;
	Wed, 15 Feb 2017 10:30:16 +0000 (UTC)
Received: from localhost ([127.0.0.1] helo=lists.xenproject.org)
	by lists.xenproject.org with esmtp (Exim 4.84_2)
	(envelope-from <xen-devel-bounces@lists.xen.org>)
	id 1cdwoG-0003w3-Ak; Wed, 15 Feb 2017 10:27:40 +0000
Received: from mail6.bemta3.messagelabs.com ([195.245.230.39])
	by lists.xenproject.org with esmtp (Exim 4.84_2)
	(envelope-from <prvs=212368d20=sergey.dyasli@citrix.com>)
	id 1cdwoF-0003uu-2T
	for xen-devel@lists.xenproject.org; Wed, 15 Feb 2017 10:27:39 +0000
Received: from [85.158.137.68] by server-4.bemta-3.messagelabs.com id
	63/1F-01392-A1D24A85; Wed, 15 Feb 2017 10:27:38 +0000
X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFupjleJIrShJLcpLzFFi42LZKekooSupuyT
	CYOIZWYvvWyYzOTB6HP5whSWAMYo1My8pvyKBNeP9TsmC/QUVT45tYmlgbMnvYuTkkBDwk1gw
	ZRYriM0moCexcfYrJhBbRCBf4tbGB4xdjFwczALdTBJHN+9jA0kICzhLPHr7C6iBA6jIRaLtk
	yZEvZHE4XeLwUpYBFQlepZNYgSxeQUMJB6duQ0WFxIolrjz/A9YK6eAvcS11fUgYUYBWYkvja
	uZQWxmAXGJW0/mM0GcJiCxZM95ZghbVOLl43+sELaaxMHed1A1OhJnrz9hhLANJLYu3ccCMp5
	ZQFNi/S59CNNSYtYbb4jpihJTuh+yQxwmKHFy5hOWCYxis5AsnoXQPAuheRaS5llImhcwsq5i
	VC9OLSpLLdI11ksqykzPKMlNzMzRNTQw1stNLS5OTE/NSUwq1kvOz93ECIwnBiDYwdj8xekQo
	yQHk5IoL9uxxRFCfEn5KZUZicUZ8UWlOanFhxhlODiUJHhztJdECAkWpaanVqRl5gAjGyYtwc
	GjJMIrqQOU5i0uSMwtzkyHSJ1iVJQS510B0icAksgozYNrgyWTS4yyUsK8jECHCPEUpBblZpa
	gyr9iFOdgVBLmXQQyhSczrwRu+iugxUxAi1njFoIsLklESEk1MOrt2az0ecom5/uy+7kqvbqv
	+u08ciHR6uiS5S//ThN9fVO1t53viFGgQtNapRNTj83TKF0ZxXN4yVerjbY57wTXPS/+UvTp4
	N4L/idMolzeNzxbdMbqGh/vkaua09RTzkwwSZv+YsmrZKF/mhdWBETPCEhpXLbua8zj9ZXL5f
	9VmnofbHjpEBuhxFKckWioxVxUnAgAowjGZiEDAAA=
X-Env-Sender: prvs=212368d20=sergey.dyasli@citrix.com
X-Msg-Ref: server-12.tower-31.messagelabs.com!1487154456!68891948!1
X-Originating-IP: [185.25.65.24]
X-SpamReason: No, hits=0.5 required=7.0 tests=BODY_RANDOM_LONG,
	received_headers: No Received headers
X-StarScan-Received: 
X-StarScan-Version: 9.2.3; banners=-,-,-
X-VirusChecked: Checked
Received: (qmail 34496 invoked from network); 15 Feb 2017 10:27:37 -0000
Received: from smtp.eu.citrix.com (HELO SMTP.EU.CITRIX.COM) (185.25.65.24)
	by server-12.tower-31.messagelabs.com with RC4-SHA encrypted SMTP;
	15 Feb 2017 10:27:37 -0000
X-IronPort-AV: E=Sophos;i="5.35,165,1484006400"; d="scan'208";a="40802185"
From: Sergey Dyasli <sergey.dyasli@citrix.com>
To: "JBeulich@suse.com" <JBeulich@suse.com>, "xen-devel@lists.xenproject.org"
	<xen-devel@lists.xenproject.org>
Thread-Topic: [PATCH 1/2] VMX: fix VMCS race on context-switch paths
Thread-Index: AQHShq0qOEe6OB/G2kauuqWVv9SXi6FpzgaA
Date: Wed, 15 Feb 2017 10:27:13 +0000
Message-ID: <1487154433.3588.1.camel@citrix.com>
References: <58A2E8B70200007800139946@prv-mh.provo.novell.com>
	<58A2E9F9020000780013995D@prv-mh.provo.novell.com>
In-Reply-To: <58A2E9F9020000780013995D@prv-mh.provo.novell.com>
Accept-Language: en-GB, en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-mailer: Evolution 3.22.3-0ubuntu0.1 
x-ms-exchange-messagesentrepresentingtype: 1
x-ms-exchange-transport-fromentityheader: Hosted
Content-ID: <76C337918EF87C44BEBB270D21AB9F69@citrix.com>
MIME-Version: 1.0
Cc: Sergey Dyasli <sergey.dyasli@citrix.com>,
	Kevin Tian <kevin.tian@intel.com>,
	"Kevin.Mayer@gdata.de" <Kevin.Mayer@gdata.de>,
	Andrew Cooper <Andrew.Cooper3@citrix.com>,
	Anshul Makkar <anshul.makkar@citrix.com>,
	"jun.nakajima@intel.com" <jun.nakajima@intel.com>
Subject: Re: [Xen-devel] [PATCH 1/2] VMX: fix VMCS race on context-switch
	paths
X-BeenThere: xen-devel@lists.xen.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: Xen developer discussion <xen-devel.lists.xen.org>
List-Unsubscribe: <https://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <https://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Errors-To: xen-devel-bounces@lists.xen.org
Sender: "Xen-devel" <xen-devel-bounces@lists.xen.org>
X-Virus-Scanned: ClamAV using ClamSMTP

On Tue, 2017-02-14 at 03:28 -0700, Jan Beulich wrote:
> When __context_switch() is being bypassed during original context
> switch handling, the vCPU "owning" the VMCS partially loses control of
> it: It will appear non-running to remote CPUs, and hence their attempt
> to pause the owning vCPU will have no effect on it (as it already
> looks to be paused). At the same time the "owning" CPU will re-enable
> interrupts eventually (the lastest when entering the idle loop) and
> hence becomes subject to IPIs from other CPUs requesting access to the
> VMCS. As a result, when __context_switch() finally gets run, the CPU
> may no longer have the VMCS loaded, and hence any accesses to it would
> fail. Hence we may need to re-load the VMCS in vmx_ctxt_switch_from().
> 
> Similarly, when __context_switch() is being bypassed also on the second
> (switch-in) path, VMCS ownership may have been lost and hence needs
> re-establishing. Since there's no existing hook to put this in, add a
> new one.
> 
> Reported-by: Kevin Mayer <Kevin.Mayer@gdata.de>
> Reported-by: Anshul Makkar <anshul.makkar@citrix.com>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -2098,11 +2098,14 @@ void context_switch(struct vcpu *prev, s
>  
>      set_current(next);
>  
> -    if ( (per_cpu(curr_vcpu, cpu) == next) ||
> -         (is_idle_domain(nextd) && cpu_online(cpu)) )
> +    if ( (per_cpu(curr_vcpu, cpu) == next) )
>      {
> +        if ( next->arch.ctxt_switch_same )
> +            next->arch.ctxt_switch_same(next);
>          local_irq_enable();
>      }
> +    else if ( is_idle_domain(nextd) && cpu_online(cpu) )
> +        local_irq_enable();
>      else
>      {
>          __context_switch();
> --- a/xen/arch/x86/hvm/vmx/vmcs.c
> +++ b/xen/arch/x86/hvm/vmx/vmcs.c
> @@ -552,6 +552,27 @@ static void vmx_load_vmcs(struct vcpu *v
>      local_irq_restore(flags);
>  }
>  
> +void vmx_vmcs_reload(struct vcpu *v)
> +{
> +    /*
> +     * As we're running with interrupts disabled, we can't acquire
> +     * v->arch.hvm_vmx.vmcs_lock here. However, with interrupts disabled
> +     * the VMCS can't be taken away from us anymore if we still own it.
> +     */
> +    ASSERT(!local_irq_is_enabled());
> +    if ( v->arch.hvm_vmx.vmcs_pa == this_cpu(current_vmcs) )
> +        return;
> +    ASSERT(!this_cpu(current_vmcs));
> +
> +    /*
> +     * Wait for the remote side to be done with the VMCS before loading
> +     * it here.
> +     */
> +    while ( v->arch.hvm_vmx.active_cpu != -1 )
> +        cpu_relax();
> +    vmx_load_vmcs(v);
> +}
> +
>  int vmx_cpu_up_prepare(unsigned int cpu)
>  {
>      /*
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -298,6 +298,7 @@ static int vmx_vcpu_initialise(struct vc
>      v->arch.schedule_tail    = vmx_do_resume;
>      v->arch.ctxt_switch_from = vmx_ctxt_switch_from;
>      v->arch.ctxt_switch_to   = vmx_ctxt_switch_to;
> +    v->arch.ctxt_switch_same = vmx_vmcs_reload;
>  
>      if ( (rc = vmx_create_vmcs(v)) != 0 )
>      {
> @@ -936,6 +937,18 @@ static void vmx_ctxt_switch_from(struct
>      if ( unlikely(!this_cpu(vmxon)) )
>          return;
>  
> +    if ( !v->is_running )
> +    {
> +        /*
> +         * When this vCPU isn't marked as running anymore, a remote pCPU's
> +         * attempt to pause us (from vmx_vmcs_enter()) won't have a reason
> +         * to spin in vcpu_sleep_sync(), and hence that pCPU might have taken
> +         * away the VMCS from us. As we're running with interrupts disabled,
> +         * we also can't call vmx_vmcs_enter().
> +         */
> +        vmx_vmcs_reload(v);
> +    }
> +
>      vmx_fpu_leave(v);
>      vmx_save_guest_msrs(v);
>      vmx_restore_host_msrs();
> --- a/xen/include/asm-x86/domain.h
> +++ b/xen/include/asm-x86/domain.h
> @@ -514,6 +514,7 @@ struct arch_vcpu
>  
>      void (*ctxt_switch_from) (struct vcpu *);
>      void (*ctxt_switch_to) (struct vcpu *);
> +    void (*ctxt_switch_same) (struct vcpu *);
>  
>      struct vpmu_struct vpmu;
>  
> --- a/xen/include/asm-x86/hvm/vmx/vmcs.h
> +++ b/xen/include/asm-x86/hvm/vmx/vmcs.h
> @@ -174,6 +174,7 @@ void vmx_destroy_vmcs(struct vcpu *v);
>  void vmx_vmcs_enter(struct vcpu *v);
>  bool_t __must_check vmx_vmcs_try_enter(struct vcpu *v);
>  void vmx_vmcs_exit(struct vcpu *v);
> +void vmx_vmcs_reload(struct vcpu *v);
>  
>  #define CPU_BASED_VIRTUAL_INTR_PENDING        0x00000004
>  #define CPU_BASED_USE_TSC_OFFSETING           0x00000008
> 
> 

Using the above patch with the following change for Xen-4.6.1:

-    if ( v->arch.hvm_vmx.vmcs_pa == this_cpu(current_vmcs) )
+    if ( v->arch.hvm_vmx.vmcs == this_cpu(current_vmcs) )

This is what I'm getting during the original test case (32 VMs reboot):

(XEN) [ 1407.789329] Watchdog timer detects that CPU12 is stuck!
(XEN) [ 1407.795726] ----[ Xen-4.6.1-xs-local  x86_64  debug=n  Not tainted ]----
(XEN) [ 1407.803774] CPU:    12
(XEN) [ 1407.806975] RIP:    e008:[<ffff82d0801ea2a2>] vmx_vmcs_reload+0x32/0x50
(XEN) [ 1407.814926] RFLAGS: 0000000000000013   CONTEXT: hypervisor (d230v0)
(XEN) [ 1407.822486] rax: 0000000000000000   rbx: ffff830079ee7000   rcx: 0000000000000000
(XEN) [ 1407.831407] rdx: 0000006f8f72ce00   rsi: ffff8329b3efbfe8   rdi: ffff830079ee7000
(XEN) [ 1407.840326] rbp: ffff83007bab7000   rsp: ffff83400fab7dc8   r8:  000001468e9e3ccc
(XEN) [ 1407.849246] r9:  ffff83403ffe7000   r10: 00000146c91c1737   r11: ffff833a9558c310
(XEN) [ 1407.858166] r12: ffff833a9558c000   r13: 000000000000000c   r14: ffff83403ffe7000
(XEN) [ 1407.867085] r15: ffff82d080364640   cr0: 0000000080050033   cr4: 00000000003526e0
(XEN) [ 1407.876005] cr3: 000000294b074000   cr2: 000007fefd7ce150
(XEN) [ 1407.882599] ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
(XEN) [ 1407.890938] Xen code around <ffff82d0801ea2a2> (vmx_vmcs_reload+0x32/0x50):
(XEN) [ 1407.899277]  84 00 00 00 00 00 f3 90 <83> bf e8 05 00 00 ff 75 f5 e9 a0 fa ff ff f3 c3
(XEN) [ 1407.908679] Xen stack trace from rsp=ffff83400fab7dc8:
(XEN) [ 1407.914982]    ffff82d08016c58d 0000000000001000 0000000000000000 0000000000000000
(XEN) [ 1407.923998]    0000000000000206 0000000000000086 0000000000000286 000000000000000c
(XEN) [ 1407.933017]    ffff83007bab7058 ffff82d080364640 ffff83007bab7000 00000146a7f26495
(XEN) [ 1407.942032]    ffff830079ee7000 ffff833a9558cf84 ffff833a9558c000 ffff82d080364640
(XEN) [ 1407.951048]    ffff82d08012fb8e ffff83400fabda98 ffff83400faba148 ffff83403ffe7000
(XEN) [ 1407.960067]    ffff83400faba160 ffff83400fabda40 ffff82d080164305 000000000000000c
(XEN) [ 1407.969083]    ffff830079ee7000 0000000001c9c380 ffff82d080136400 000000440000011d
(XEN) [ 1407.978101]    00000000ffffffff ffffffffffffffff ffff83400fab0000 ffff82d080348d00
(XEN) [ 1407.987116]    ffff833a9558c000 ffff82d080364640 ffff82d08013311c ffff830079ee7000
(XEN) [ 1407.996134]    ffff83400fab0000 ffff830079ee7000 ffff83403ffe7000 00000000ffffffff
(XEN) [ 1408.005151]    ffff82d080167d35 ffff83007bab7000 0000000000000001 fffffa80077f9700
(XEN) [ 1408.014167]    fffffa80075bf900 fffffa80077f9820 0000000000000000 0000000000000000
(XEN) [ 1408.023184]    fffffa8008889c00 0000000002fa1e78 0000003b6ed18d78 0000000000000000
(XEN) [ 1408.032202]    00000000068e7780 fffffa80075ba790 fffffa80077f9848 fffff800027f9e80
(XEN) [ 1408.041220]    0000000000000001 000000fc00000000 fffff880042499c2 0000000000000000
(XEN) [ 1408.050235]    0000000000000246 fffff80000b9cb58 0000000000000000 80248e00e008e1f0
(XEN) [ 1408.059253]    00000000ffff82d0 80248e00e008e200 00000000ffff82d0 80248e000000000c
(XEN) [ 1408.068268]    ffff830079ee7000 0000006f8f72ce00 00000000ffff82d0
(XEN) [ 1408.075638] Xen call trace:
(XEN) [ 1408.079322]    [<ffff82d0801ea2a2>] vmx_vmcs_reload+0x32/0x50
(XEN) [ 1408.086303]    [<ffff82d08016c58d>] context_switch+0x85d/0xeb0
(XEN) [ 1408.093380]    [<ffff82d08012fb8e>] schedule.c#schedule+0x46e/0x7d0
(XEN) [ 1408.100942]    [<ffff82d080164305>] reprogram_timer+0x75/0xe0
(XEN) [ 1408.107925]    [<ffff82d080136400>] timer.c#timer_softirq_action+0x90/0x210
(XEN) [ 1408.116263]    [<ffff82d08013311c>] softirq.c#__do_softirq+0x5c/0x90
(XEN) [ 1408.123921]    [<ffff82d080167d35>] domain.c#idle_loop+0x25/0x60

Currently I'm testing the following patch that fixes at least one possible scenario:

commit f76dc832c2de4950872fc27962c56c609cff1160
Author: Sergey Dyasli <sergey.dyasli@citrix.com>
Date:   Tue Feb 14 15:23:54 2017 +0000

    x86/vmx: use curr_vcpu in vmx_vmcs_exit()
    
    During context_switch() from a HVM vCPU to the idle vCPU, current is
    updated but curr_vcpu retains the HMV vCPU's value.  If after that,
    for some reason, vmx_vmcs_enter() + vmx_vmcs_exit() pair will be
    executed, the test for has_hvm_container_vcpu(current) will fail and
    vmcs for curr_vcpu will not be loaded.
    
    This will lead to BUG() during the next __context_switch() when
    vmx_ctxt_switch_from() will be executed and __vmwrite() will fail.
    
    Fix this by testing has_hvm_container_vcpu() against curr_vcpu.
    
    Signed-off-by: Sergey Dyasli <sergey.dyasli@citrix.com>

         
So far no crashes observed but testing continues.
---
Thanks,
Sergey

diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
index 88db7ee..f0d71ae 100644
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -742,6 +742,8 @@ void vmx_vmcs_enter(struct vcpu *v)
 void vmx_vmcs_exit(struct vcpu *v)
 {
     struct foreign_vmcs *fv;
+    unsigned int cpu = smp_processor_id();
+    struct vcpu *p = per_cpu(curr_vcpu, cpu);
 
     if ( likely(v == current) )
         return;
@@ -754,8 +756,8 @@ void vmx_vmcs_exit(struct vcpu *v)
     {
         /* Don't confuse vmx_do_resume (for @v or @current!) */
         vmx_clear_vmcs(v);
-        if ( has_hvm_container_vcpu(current) )
-            vmx_load_vmcs(current);
+        if ( has_hvm_container_vcpu(p) )
+            vmx_load_vmcs(p);
 
         spin_unlock(&v->arch.hvm_vmx.vmcs_lock);
         vcpu_unpause(v);