From patchwork Sat Feb 17 11:26:59 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: David Woodhouse X-Patchwork-Id: 13561349 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6AD546A00F for ; Sat, 17 Feb 2024 11:40:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.50.34 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1708170029; cv=none; b=dZzt9JbBD14iw4Ib4hXzB62vH2rohXnbMubJUhJZhesNNbgPiEogH2AEq4t2jATg0Gs6rWLd1TDynqjDNsvk4autiXqOfr+mgf8VawcFq6yEHg+Eb8ff1UdDUgpF73xl7QATAqovftgz6kcAvZtyVdNH/O8Ru0HRp0ef0heE6Gg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1708170029; c=relaxed/simple; bh=xUPK2MjmEnmzNQbdmWvWcsyWK1EMbJbl553VIGGzx7c=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=pOH+zqY4QPQfv6n9pmNqffbMcfwOLto9AqI7EMZVJktQql5y4AsN/vEi+QyyCpAtfOKzrgN99qzSTSDGkn6+wV9X2I9h4Mfax8HXaGDHewfhZg4Y+cSzYsUTdMsvBk9LVfUqZCMp/p57vDlTPGqf5Cq+u3pP0JFrD9/SSBDWoWk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=casper.srs.infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=p3nX9SJ5; arc=none smtp.client-ip=90.155.50.34 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=casper.srs.infradead.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="p3nX9SJ5" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=Sender:Content-Transfer-Encoding: Content-Type:MIME-Version:References:In-Reply-To:Message-ID:Date:Subject:Cc: To:From:Reply-To:Content-ID:Content-Description; bh=0RPl/ITPhq1odqGaPpEWntC4wW6tfm4NKhJsQNOd87k=; b=p3nX9SJ5uKcQtzBrQtP8Mt6o7h RkuS7SvFAGrQiiVGAEXB4XghL5+LpVtBooOuTE8CBTLMCBdQQdHbVg1SKIQ8gFCB5ppvnNr6u9xal SK8VdCZneSryVzCxwcF1TIn8mWqBIpVsp9mR/tQmDvD4v1HsvQuUYbJ7ua3B5Bp3hZIsOtAAhXPeA iSYNtJg7mzaj1in3eLY0mikNR2BHOkjLowhpzd74HHsqi7M6j6HGb2tLMr0zDwLTukqXcsmmmS0Aw 4T3FJgVyaNjH3m47aGFptOPnkAxSSynMQkZH6VMK+YNRqGbYTq54M/gTYHrdEkkFdmqGhp/dVbVoD 0uGWhQ8Q==; Received: from [2001:8b0:10b:1::ebe] (helo=i7.infradead.org) by casper.infradead.org with esmtpsa (Exim 4.97.1 #2 (Red Hat Linux)) id 1rbJ3L-00000007JdV-0BjN; Sat, 17 Feb 2024 11:40:25 +0000 Received: from dwoodhou by i7.infradead.org with local (Exim 4.97.1 #2 (Red Hat Linux)) id 1rbJ3L-00000000346-38gF; Sat, 17 Feb 2024 11:40:19 +0000 From: David Woodhouse To: kvm@vger.kernel.org Cc: Sean Christopherson , Paul Durrant , Paolo Bonzini , Michal Luczaj , David Woodhouse Subject: [PATCH 1/6] KVM: x86/xen: improve accuracy of Xen timers Date: Sat, 17 Feb 2024 11:26:59 +0000 Message-ID: <20240217114017.11551-2-dwmw2@infradead.org> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240217114017.11551-1-dwmw2@infradead.org> References: <20240217114017.11551-1-dwmw2@infradead.org> Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Sender: David Woodhouse X-SRS-Rewrite: SMTP reverse-path rewritten from by casper.infradead.org. See http://www.infradead.org/rpr.html From: David Woodhouse A test program such as http://david.woodhou.se/timerlat.c confirms user reports that timers are increasingly inaccurate as the lifetime of a guest increases. Reporting the actual delay observed when asking for 100µs of sleep, it starts off OK on a newly-launched guest but gets worse over time, giving incorrect sleep times: root@ip-10-0-193-21:~# ./timerlat -c -n 5 00000000 latency 103243/100000 (3.2430%) 00000001 latency 103243/100000 (3.2430%) 00000002 latency 103242/100000 (3.2420%) 00000003 latency 103245/100000 (3.2450%) 00000004 latency 103245/100000 (3.2450%) The biggest problem is that get_kvmclock_ns() returns inaccurate values when the guest TSC is scaled. The guest sees a TSC value scaled from the host TSC by a mul/shift conversion (hopefully done in hardware). The guest then converts that guest TSC value into nanoseconds using the mul/shift conversion given to it by the KVM pvclock information. But get_kvmclock_ns() performs only a single conversion directly from host TSC to nanoseconds, giving a different result. A test program at http://david.woodhou.se/tsdrift.c demonstrates the cumulative error over a day. It's non-trivial to fix get_kvmclock_ns(), although I'll come back to that. The actual guest hv_clock is per-CPU, and *theoretically* each vCPU could be running at a *different* frequency. But this patch is needed anyway because... The other issue with Xen timers was that the code would snapshot the host CLOCK_MONOTONIC at some point in time, and then... after a few interrupts may have occurred, some preemption perhaps... would also read the guest's kvmclock. Then it would proceed under the false assumption that those two happened at the *same* time. Any time which *actually* elapsed between reading the two clocks was introduced as inaccuracies in the time at which the timer fired. Fix it to use a variant of kvm_get_time_and_clockread(), which reads the host TSC just *once*, then use the returned TSC value to calculate the kvmclock (making sure to do that the way the guest would instead of making the same mistake get_kvmclock_ns() does). Sadly, hrtimers based on CLOCK_MONOTONIC_RAW are not supported, so Xen timers still have to use CLOCK_MONOTONIC. In practice the difference between the two won't matter over the timescales involved, as the *absolute* values don't matter; just the delta. This does mean a new variant of kvm_get_time_and_clockread() is needed; called kvm_get_monotonic_and_clockread() because that's what it does. Fixes: 536395260582 ("KVM: x86/xen: handle PV timers oneshot mode") Signed-off-by: David Woodhouse Reviewed-by: Paul Durrant --- arch/x86/kvm/x86.c | 61 +++++++++++++++++++++-- arch/x86/kvm/x86.h | 1 + arch/x86/kvm/xen.c | 121 ++++++++++++++++++++++++++++++++++----------- 3 files changed, 149 insertions(+), 34 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 9f98c5075629..75f46fa3737c 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2853,7 +2853,11 @@ static inline u64 vgettsc(struct pvclock_clock *clock, u64 *tsc_timestamp, return v * clock->mult; } -static int do_monotonic_raw(s64 *t, u64 *tsc_timestamp) +/* + * As with get_kvmclock_base_ns(), this counts from boot time, at the + * frequency of CLOCK_MONOTONIC_RAW (hence adding gtos->offs_boot). + */ +static int do_kvmclock_base(s64 *t, u64 *tsc_timestamp) { struct pvclock_gtod_data *gtod = &pvclock_gtod_data; unsigned long seq; @@ -2872,6 +2876,29 @@ static int do_monotonic_raw(s64 *t, u64 *tsc_timestamp) return mode; } +/* + * This calculates CLOCK_MONOTONIC at the time of the TSC snapshot, with + * no boot time offset. + */ +static int do_monotonic(s64 *t, u64 *tsc_timestamp) +{ + struct pvclock_gtod_data *gtod = &pvclock_gtod_data; + unsigned long seq; + int mode; + u64 ns; + + do { + seq = read_seqcount_begin(>od->seq); + ns = gtod->clock.base_cycles; + ns += vgettsc(>od->clock, tsc_timestamp, &mode); + ns >>= gtod->clock.shift; + ns += ktime_to_ns(gtod->clock.offset); + } while (unlikely(read_seqcount_retry(>od->seq, seq))); + *t = ns; + + return mode; +} + static int do_realtime(struct timespec64 *ts, u64 *tsc_timestamp) { struct pvclock_gtod_data *gtod = &pvclock_gtod_data; @@ -2893,18 +2920,42 @@ static int do_realtime(struct timespec64 *ts, u64 *tsc_timestamp) return mode; } -/* returns true if host is using TSC based clocksource */ +/* + * Calculates the kvmclock_base_ns (CLOCK_MONOTONIC_RAW + boot time) and + * reports the TSC value from which it do so. Returns true if host is + * using TSC based clocksource. + */ static bool kvm_get_time_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp) { /* checked again under seqlock below */ if (!gtod_is_based_on_tsc(pvclock_gtod_data.clock.vclock_mode)) return false; - return gtod_is_based_on_tsc(do_monotonic_raw(kernel_ns, - tsc_timestamp)); + return gtod_is_based_on_tsc(do_kvmclock_base(kernel_ns, + tsc_timestamp)); } -/* returns true if host is using TSC based clocksource */ +/* + * Calculates CLOCK_MONOTONIC and reports the TSC value from which it did + * so. Returns true if host is using TSC based clocksource. + */ +bool kvm_get_monotonic_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp) +{ + /* checked again under seqlock below */ + if (!gtod_is_based_on_tsc(pvclock_gtod_data.clock.vclock_mode)) + return false; + + return gtod_is_based_on_tsc(do_monotonic(kernel_ns, + tsc_timestamp)); +} + +/* + * Calculates CLOCK_REALTIME and reports the TSC value from which it did + * so. Returns true if host is using TSC based clocksource. + * + * DO NOT USE this for anything related to migration. You want CLOCK_TAI + * for that. + */ static bool kvm_get_walltime_and_clockread(struct timespec64 *ts, u64 *tsc_timestamp) { diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h index 2f7e19166658..56b7a78f45bf 100644 --- a/arch/x86/kvm/x86.h +++ b/arch/x86/kvm/x86.h @@ -294,6 +294,7 @@ void kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq, int inc_eip); u64 get_kvmclock_ns(struct kvm *kvm); uint64_t kvm_get_wall_clock_epoch(struct kvm *kvm); +bool kvm_get_monotonic_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp); int kvm_read_guest_virt(struct kvm_vcpu *vcpu, gva_t addr, void *val, unsigned int bytes, diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c index 11ab62ca011d..4a24899bbcfa 100644 --- a/arch/x86/kvm/xen.c +++ b/arch/x86/kvm/xen.c @@ -24,6 +24,7 @@ #include #include +#include #include "cpuid.h" #include "trace.h" @@ -149,8 +150,93 @@ static enum hrtimer_restart xen_timer_callback(struct hrtimer *timer) return HRTIMER_NORESTART; } -static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_ns) +static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, + bool linux_wa) { + uint64_t guest_now; + int64_t kernel_now, delta; + + /* + * The guest provides the requested timeout in absolute nanoseconds + * of the KVM clock — as *it* sees it, based on the scaled TSC and + * the pvclock information provided by KVM. + * + * The kernel doesn't support hrtimers based on CLOCK_MONOTONIC_RAW + * so use CLOCK_MONOTONIC. In the timescales covered by timers, the + * difference won't matter much as there is no cumulative effect. + * + * Calculate the time for some arbitrary point in time around "now" + * in terms of both kvmclock and CLOCK_MONOTONIC. Calculate the + * delta between the kvmclock "now" value and the guest's requested + * timeout, apply the "Linux workaround" described below, and add + * the resulting delta to the CLOCK_MONOTONIC "now" value, to get + * the absolute CLOCK_MONOTONIC time at which the timer should + * fire. + */ + if (vcpu->arch.hv_clock.version && vcpu->kvm->arch.use_master_clock && + static_cpu_has(X86_FEATURE_CONSTANT_TSC)) { + uint64_t host_tsc, guest_tsc; + + if (!IS_ENABLED(CONFIG_64BIT) || + !kvm_get_monotonic_and_clockread(&kernel_now, &host_tsc)) { + /* + * Don't fall back to get_kvmclock_ns() because it's + * broken; it has a systemic error in its results + * because it scales directly from host TSC to + * nanoseconds, and doesn't scale first to guest TSC + * and then* to nanoseconds as the guest does. + * + * There is a small error introduced here because time + * continues to elapse between the ktime_get() and the + * subsequent rdtsc(). But not the systemic drift due + * to get_kvmclock_ns(). + */ + kernel_now = ktime_get(); /* This is CLOCK_MONOTONIC */ + host_tsc = rdtsc(); + } + + /* Calculate the guest kvmclock as the guest would do it. */ + guest_tsc = kvm_read_l1_tsc(vcpu, host_tsc); + guest_now = __pvclock_read_cycles(&vcpu->arch.hv_clock, + guest_tsc); + } else { + /* + * Without CONSTANT_TSC, get_kvmclock_ns() is the only option. + * + * Also if the guest PV clock hasn't been set up yet, as is + * likely to be the case during migration when the vCPU has + * not been run yet. It would be possible to calculate the + * scaling factors properly in that case but there's not much + * point in doing so. The get_kvmclock_ns() drift accumulates + * over time, so it's OK to use it at startup. Besides, on + * migration there's going to be a little bit of skew in the + * precise moment at which timers fire anyway. Often they'll + * be in the "past" by the time the VM is running again after + * migration. + */ + guest_now = get_kvmclock_ns(vcpu->kvm); + kernel_now = ktime_get(); + } + + delta = guest_abs - guest_now; + + /* Xen has a 'Linux workaround' in do_set_timer_op() which + * checks for negative absolute timeout values (caused by + * integer overflow), and for values about 13 days in the + * future (2^50ns) which would be caused by jiffies + * overflow. For those cases, it sets the timeout 100ms in + * the future (not *too* soon, since if a guest really did + * set a long timeout on purpose we don't want to keep + * churning CPU time by waking it up). + */ + if (linux_wa) { + if ((unlikely((int64_t)guest_abs < 0 || + (delta > 0 && (uint32_t) (delta >> 50) != 0)))) { + delta = 100 * NSEC_PER_MSEC; + guest_abs = guest_now + delta; + } + } + /* * Avoid races with the old timer firing. Checking timer_expires * to avoid calling hrtimer_cancel() will only have false positives @@ -162,12 +248,11 @@ static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_ atomic_set(&vcpu->arch.xen.timer_pending, 0); vcpu->arch.xen.timer_expires = guest_abs; - if (delta_ns <= 0) { + if (delta <= 0) { xen_timer_callback(&vcpu->arch.xen.timer); } else { - ktime_t ktime_now = ktime_get(); hrtimer_start(&vcpu->arch.xen.timer, - ktime_add_ns(ktime_now, delta_ns), + ktime_add_ns(kernel_now, delta), HRTIMER_MODE_ABS_HARD); } } @@ -1000,8 +1085,7 @@ int kvm_xen_vcpu_set_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data) /* Start the timer if the new value has a valid vector+expiry. */ if (data->u.timer.port && data->u.timer.expires_ns) kvm_xen_start_timer(vcpu, data->u.timer.expires_ns, - data->u.timer.expires_ns - - get_kvmclock_ns(vcpu->kvm)); + false); r = 0; break; @@ -1475,7 +1559,6 @@ static bool kvm_xen_hcall_vcpu_op(struct kvm_vcpu *vcpu, bool longmode, int cmd, { struct vcpu_set_singleshot_timer oneshot; struct x86_exception e; - s64 delta; if (!kvm_xen_timer_enabled(vcpu)) return false; @@ -1509,9 +1592,7 @@ static bool kvm_xen_hcall_vcpu_op(struct kvm_vcpu *vcpu, bool longmode, int cmd, return true; } - /* A delta <= 0 results in an immediate callback, which is what we want */ - delta = oneshot.timeout_abs_ns - get_kvmclock_ns(vcpu->kvm); - kvm_xen_start_timer(vcpu, oneshot.timeout_abs_ns, delta); + kvm_xen_start_timer(vcpu, oneshot.timeout_abs_ns, false); *r = 0; return true; @@ -1535,25 +1616,7 @@ static bool kvm_xen_hcall_set_timer_op(struct kvm_vcpu *vcpu, uint64_t timeout, return false; if (timeout) { - uint64_t guest_now = get_kvmclock_ns(vcpu->kvm); - int64_t delta = timeout - guest_now; - - /* Xen has a 'Linux workaround' in do_set_timer_op() which - * checks for negative absolute timeout values (caused by - * integer overflow), and for values about 13 days in the - * future (2^50ns) which would be caused by jiffies - * overflow. For those cases, it sets the timeout 100ms in - * the future (not *too* soon, since if a guest really did - * set a long timeout on purpose we don't want to keep - * churning CPU time by waking it up). - */ - if (unlikely((int64_t)timeout < 0 || - (delta > 0 && (uint32_t) (delta >> 50) != 0))) { - delta = 100 * NSEC_PER_MSEC; - timeout = guest_now + delta; - } - - kvm_xen_start_timer(vcpu, timeout, delta); + kvm_xen_start_timer(vcpu, timeout, true); } else { kvm_xen_stop_timer(vcpu); } From patchwork Sat Feb 17 11:27:00 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Woodhouse X-Patchwork-Id: 13561347 Received: from desiato.infradead.org (desiato.infradead.org [90.155.92.199]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3123769D3A; Sat, 17 Feb 2024 11:40:24 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.92.199 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1708170027; cv=none; b=dijuAqeNO2EuO5PHOCHQgt6ugMuYCXpHy6LIaFYobG5y71b6ya0L3hJU/i2gw501oY7EelYsPNOu5Di/hWkhqeseGchf/YIGZL+1cg5gV5hxRQ6L5h5YyOBTPi7rS0/Nv8dZSdDRK/aULjcMRT5/wrwowMaBb7nO1peBFh2UHAE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1708170027; c=relaxed/simple; bh=uBcNSjsif+abeDAk+chPtDqj8oWsJNCe+AeGbLMi68c=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=c1nzno2ktMg/inoP6LDJKASvcBdpm+7iX05iZ6HA0kAFYn3jb4awggNhljTGvf0860FgESFaVRi+sKsHAy6xXC0A2qNOrc18IhAlGuYd0+ebyhIgszYf12ROzmDTmwidTP2CUzsRjqGK4uk/SBNHC/V1Qty+v7qKDkjXktzTsb4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=desiato.srs.infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=dR7C8TGE; arc=none smtp.client-ip=90.155.92.199 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=desiato.srs.infradead.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="dR7C8TGE" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=desiato.20200630; h=Sender:Content-Transfer-Encoding: MIME-Version:References:In-Reply-To:Message-ID:Date:Subject:Cc:To:From: Reply-To:Content-Type:Content-ID:Content-Description; bh=hu5J5fx2kJgZnYgwmuVlEzWXhnpobPtK6APDWPWVI+E=; b=dR7C8TGE0kl8P9C0+2Xa4UHkVX YAaX/41fHoK2nAYKHjNS+n4EWx5dCYAJqGT1P978pVwJwKE0Y1VUINi5cWvoiULl6z50cvfcolDXd 9d5okQnkN0jc9SnalBgkpZFIA4DT0mIlPS+xsfzQj/7X2C/uVZ0+vOUx9Jmv5rfiE2AxUiGyoYP6T zc7as2IzVWqUKD/CXnY48w1DJJYIbqmb+f1TaLYA9Q1PM/Pe/vm2OzGisfqeg0MwksHF7jOMds8+4 afy5SQNuin3iQ3YDOc4QtZ4yMVraRNDKruVKW6BoxmpQ6tFG8WEm8rFRUwhXhUa55n1vgvVb89Tj2 NtOYjmDg==; Received: from [2001:8b0:10b:1::ebe] (helo=i7.infradead.org) by desiato.infradead.org with esmtpsa (Exim 4.97.1 #2 (Red Hat Linux)) id 1rbJ3M-00000000Kvi-41ni; Sat, 17 Feb 2024 11:40:21 +0000 Received: from dwoodhou by i7.infradead.org with local (Exim 4.97.1 #2 (Red Hat Linux)) id 1rbJ3L-00000000349-3LIv; Sat, 17 Feb 2024 11:40:19 +0000 From: David Woodhouse To: kvm@vger.kernel.org Cc: Sean Christopherson , Paul Durrant , Paolo Bonzini , Michal Luczaj , David Woodhouse , stable@vger.kernel.org Subject: [PATCH 2/6] KVM: x86/xen: inject vCPU upcall vector when local APIC is enabled Date: Sat, 17 Feb 2024 11:27:00 +0000 Message-ID: <20240217114017.11551-3-dwmw2@infradead.org> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240217114017.11551-1-dwmw2@infradead.org> References: <20240217114017.11551-1-dwmw2@infradead.org> Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Sender: David Woodhouse X-SRS-Rewrite: SMTP reverse-path rewritten from by desiato.infradead.org. See http://www.infradead.org/rpr.html From: David Woodhouse Linux guests since commit b1c3497e604d ("x86/xen: Add support for HVMOP_set_evtchn_upcall_vector") in v6.0 onwards will use the per-vCPU upcall vector when it's advertised in the Xen CPUID leaves. This upcall is injected through the guest's local APIC as an MSI, unlike the older system vector which was merely injected by the hypervisor any time the CPU was able to receive an interrupt and the upcall_pending flags is set in its vcpu_info. Effectively, that makes the per-CPU upcall edge triggered instead of level triggered, which results in the upcall being lost if the MSI is delivered when the local APIC is *disabled*. Xen checks the vcpu_info->evtchn_upcall_pending flag when the local APIC for a vCPU is software enabled (in fact, on any write to the SPIV register which doesn't disable the APIC). Do the same in KVM since KVM doesn't provide a way for userspace to intervene and trap accesses to the SPIV register of a local APIC emulated by KVM. Fixes: fde0451be8fb3 ("KVM: x86/xen: Support per-vCPU event channel upcall via local APIC") Signed-off-by: David Woodhouse Reviewed-by: Paul Durrant Cc: stable@vger.kernel.org --- arch/x86/kvm/lapic.c | 5 ++++- arch/x86/kvm/xen.c | 2 +- arch/x86/kvm/xen.h | 18 ++++++++++++++++++ 3 files changed, 23 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 3242f3da2457..75bc7d3f0022 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -41,6 +41,7 @@ #include "ioapic.h" #include "trace.h" #include "x86.h" +#include "xen.h" #include "cpuid.h" #include "hyperv.h" #include "smm.h" @@ -499,8 +500,10 @@ static inline void apic_set_spiv(struct kvm_lapic *apic, u32 val) } /* Check if there are APF page ready requests pending */ - if (enabled) + if (enabled) { kvm_make_request(KVM_REQ_APF_READY, apic->vcpu); + kvm_xen_sw_enable_lapic(apic->vcpu); + } } static inline void kvm_apic_set_xapic_id(struct kvm_lapic *apic, u8 id) diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c index 4a24899bbcfa..847d9e75df6d 100644 --- a/arch/x86/kvm/xen.c +++ b/arch/x86/kvm/xen.c @@ -568,7 +568,7 @@ void kvm_xen_update_runstate(struct kvm_vcpu *v, int state) kvm_xen_update_runstate_guest(v, state == RUNSTATE_runnable); } -static void kvm_xen_inject_vcpu_vector(struct kvm_vcpu *v) +void kvm_xen_inject_vcpu_vector(struct kvm_vcpu *v) { struct kvm_lapic_irq irq = { }; int r; diff --git a/arch/x86/kvm/xen.h b/arch/x86/kvm/xen.h index f8f1fe22d090..f5841d9000ae 100644 --- a/arch/x86/kvm/xen.h +++ b/arch/x86/kvm/xen.h @@ -18,6 +18,7 @@ extern struct static_key_false_deferred kvm_xen_enabled; int __kvm_xen_has_interrupt(struct kvm_vcpu *vcpu); void kvm_xen_inject_pending_events(struct kvm_vcpu *vcpu); +void kvm_xen_inject_vcpu_vector(struct kvm_vcpu *vcpu); int kvm_xen_vcpu_set_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data); int kvm_xen_vcpu_get_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data); int kvm_xen_hvm_set_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data); @@ -36,6 +37,19 @@ int kvm_xen_setup_evtchn(struct kvm *kvm, const struct kvm_irq_routing_entry *ue); void kvm_xen_update_tsc_info(struct kvm_vcpu *vcpu); +static inline void kvm_xen_sw_enable_lapic(struct kvm_vcpu *vcpu) +{ + /* + * The local APIC is being enabled. If the per-vCPU upcall vector is + * set and the vCPU's evtchn_upcall_pending flag is set, inject the + * interrupt. + */ + if (static_branch_unlikely(&kvm_xen_enabled.key) && + vcpu->arch.xen.vcpu_info_cache.active && + vcpu->arch.xen.upcall_vector && __kvm_xen_has_interrupt(vcpu)) + kvm_xen_inject_vcpu_vector(vcpu); +} + static inline bool kvm_xen_msr_enabled(struct kvm *kvm) { return static_branch_unlikely(&kvm_xen_enabled.key) && @@ -101,6 +115,10 @@ static inline void kvm_xen_destroy_vcpu(struct kvm_vcpu *vcpu) { } +static inline void kvm_xen_sw_enable_lapic(struct kvm_vcpu *vcpu) +{ +} + static inline bool kvm_xen_msr_enabled(struct kvm *kvm) { return false; From patchwork Sat Feb 17 11:27:01 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Woodhouse X-Patchwork-Id: 13561345 Received: from desiato.infradead.org (desiato.infradead.org [90.155.92.199]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 311C06519E for ; Sat, 17 Feb 2024 11:40:24 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.92.199 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1708170026; cv=none; b=igycFk+lAHCJEVHO8U9iL8l7ja1QYwkULRXOX7oh3wy+c3Zmkywyv/OuEG/2NapOUFnMjlrpFBSqSiK6b2fqkWmzH9U6sfJaVkHQOL6b37lKC94pZrVvNjmNkr3BkMI9pXcSvcBHX7fzpuZvkTcwGB63UPbm9g6HW3EH6yPqigs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1708170026; c=relaxed/simple; bh=LvY0Ry8TCWIIsbpj0ROhUqljLsZm4vsg9Oo8hsMWXdk=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=en0cHFyRPCQ0B0999xJscWmyi1SCCANMHKaciFSlFQJsRi6Se1FMyFM2EoJbISGjfnqDoXF7wBoaPj9X0sThKTGvR51KpQpwxki+fc4JiSrPTBVW024rD8ZU2X6euVUXTD1RjgeVOZieU0K9Tfye2Dkv9BVsCqzZiB4gbs8Voso= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=desiato.srs.infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=Bp8fTIv8; arc=none smtp.client-ip=90.155.92.199 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=desiato.srs.infradead.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="Bp8fTIv8" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=desiato.20200630; h=Sender:Content-Transfer-Encoding: MIME-Version:References:In-Reply-To:Message-ID:Date:Subject:Cc:To:From: Reply-To:Content-Type:Content-ID:Content-Description; bh=8XejwVT26HDeY7wuOb6N63+FEk7ZLmyz+REELGIhboI=; b=Bp8fTIv8UYtnqU9Fpkt6g2DH1Y +nsPxn2zr3KGC+s80PSrUBf/tQb0rjy/BXmq7J+Ma2b87dXBL7aiMuqUl88uYhx5McfmAW0wctsQA 3cy/5MZ3Ky3rdXptbtHoftfwAKuY/hw7/zqDTkJ/mqa+6MzwfoOYG/Bx5CCmM0n4cVktNQKsDEC8k 3ZHwKScm9jp5FTybbRmuhNXYN0Zehiz/YtumX50sPv19zWDDJKa1TF3qvc7csdlMvcopAhpXEE1cE 9CO9zd2fug0yg+lJBziYLti+QXaeZaS0ej2HpgtrqopAyMEG9fAGGr4M/aHt43KRAr2nNoTOaR9XN kkC/Ss2A==; Received: from [2001:8b0:10b:1::ebe] (helo=i7.infradead.org) by desiato.infradead.org with esmtpsa (Exim 4.97.1 #2 (Red Hat Linux)) id 1rbJ3M-00000000Kvj-42np; Sat, 17 Feb 2024 11:40:21 +0000 Received: from dwoodhou by i7.infradead.org with local (Exim 4.97.1 #2 (Red Hat Linux)) id 1rbJ3L-0000000034E-3ct4; Sat, 17 Feb 2024 11:40:19 +0000 From: David Woodhouse To: kvm@vger.kernel.org Cc: Sean Christopherson , Paul Durrant , Paolo Bonzini , Michal Luczaj , David Woodhouse Subject: [PATCH 3/6] KVM: x86/xen: remove WARN_ON_ONCE() with false positives in evtchn delivery Date: Sat, 17 Feb 2024 11:27:01 +0000 Message-ID: <20240217114017.11551-4-dwmw2@infradead.org> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240217114017.11551-1-dwmw2@infradead.org> References: <20240217114017.11551-1-dwmw2@infradead.org> Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Sender: David Woodhouse X-SRS-Rewrite: SMTP reverse-path rewritten from by desiato.infradead.org. See http://www.infradead.org/rpr.html From: David Woodhouse The kvm_xen_inject_vcpu_vector() function has a comment saying "the fast version will always work for physical unicast", justifying its use of kvm_irq_delivery_to_apic_fast() and the WARN_ON_ONCE() when that fails. In fact that assumption isn't true if X2APIC isn't in use by the guest and there is (8-bit x)APIC ID aliasing. A single "unicast" destination APIC ID *may* then be delivered to multiple vCPUs. Remove the warning, and in fact it might as well just call kvm_irq_delivery_to_apic(). Reported-by: Michal Luczaj Fixes: fde0451be8fb3 ("KVM: x86/xen: Support per-vCPU event channel upcall via local APIC") Signed-off-by: David Woodhouse Reviewed-by: Paul Durrant --- arch/x86/kvm/xen.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c index 847d9e75df6d..e6c7353e5315 100644 --- a/arch/x86/kvm/xen.c +++ b/arch/x86/kvm/xen.c @@ -10,7 +10,7 @@ #include "x86.h" #include "xen.h" #include "hyperv.h" -#include "lapic.h" +#include "irq.h" #include #include @@ -571,7 +571,6 @@ void kvm_xen_update_runstate(struct kvm_vcpu *v, int state) void kvm_xen_inject_vcpu_vector(struct kvm_vcpu *v) { struct kvm_lapic_irq irq = { }; - int r; irq.dest_id = v->vcpu_id; irq.vector = v->arch.xen.upcall_vector; @@ -580,8 +579,7 @@ void kvm_xen_inject_vcpu_vector(struct kvm_vcpu *v) irq.delivery_mode = APIC_DM_FIXED; irq.level = 1; - /* The fast version will always work for physical unicast */ - WARN_ON_ONCE(!kvm_irq_delivery_to_apic_fast(v->kvm, NULL, &irq, &r, NULL)); + kvm_irq_delivery_to_apic(v->kvm, NULL, &irq, NULL); } /* From patchwork Sat Feb 17 11:27:02 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: David Woodhouse X-Patchwork-Id: 13561348 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6AD8B6A010 for ; Sat, 17 Feb 2024 11:40:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.50.34 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1708170029; cv=none; b=EDEd2+Qvyl+XzAPO8Qr3YRkscOVM/zkkwzJw8jqRkjQ7qJGVPa5emM8lABCqr4MBZv7cJmizeusbxZ5L/mxLDl0XX7cE6Vsjt3BV2DnGoInzeMmgxj3kB7Rl/q7lcu1KQAxT6I1GNrixZJiMmGY4CfEyoCiqusj7iC0ZBEDS0H8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1708170029; c=relaxed/simple; bh=k3WkaHKlWoTNaL4mad5xlJc9eqtK33g9A32lSW7k+hk=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=UwaaP6U2/Xr/jJ56JLCv0prrbf496IzCRFC0z7l//yT3kIFD7RT7bo7FaPZu7mO2w+9Xz57nevZFVpiD3LoLgMU0bgPF+aULlnntwzulEiCB5ijk0c7BADimBwUfdqpFVP3kq1iK6fMB53yU7rxdpB+B6niPLGyPqCES+1SKFU8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=casper.srs.infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=l4x31Fi1; arc=none smtp.client-ip=90.155.50.34 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=casper.srs.infradead.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="l4x31Fi1" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=Sender:Content-Transfer-Encoding: Content-Type:MIME-Version:References:In-Reply-To:Message-ID:Date:Subject:Cc: To:From:Reply-To:Content-ID:Content-Description; bh=WGLpW6FT++ZTSc9z/kA2ucScxMGZ3vZ8SGGZsO+TlS4=; b=l4x31Fi1Q9YqtFH2G31OnJU0gF fdZXFrD9ZkHYSeK8o/hrNb5HVWW1iHItY2JOTkxp6CJ41dR83pbgtFC9/B8MXaE2z+0Pwj6ksSTHd 2A1xWgPGn064VaimVVP+xowl8HnaL9M++tFFvYyZpO61LAJfBOBwSlQPLx+vSJoaM38C66Ndwsk3c gZagrQ8VurMNKGVfn28ZqF/ClEioyXQWCnvmtDY/cJBrQftNlIZ8u8U2GholSin5a+fkvCtdkw3f2 +Ku6xa1EPuK5eo/ZQoVZmSBzHNumtNeu5uw4vsxs0P579yt8XJjNZb4dxl17BiilmDRODkv7/h3zG Pqv6No2w==; Received: from [2001:8b0:10b:1::ebe] (helo=i7.infradead.org) by casper.infradead.org with esmtpsa (Exim 4.97.1 #2 (Red Hat Linux)) id 1rbJ3K-00000007JdW-3ULh; Sat, 17 Feb 2024 11:40:25 +0000 Received: from dwoodhou by i7.infradead.org with local (Exim 4.97.1 #2 (Red Hat Linux)) id 1rbJ3L-0000000034I-3oWM; Sat, 17 Feb 2024 11:40:19 +0000 From: David Woodhouse To: kvm@vger.kernel.org Cc: Sean Christopherson , Paul Durrant , Paolo Bonzini , Michal Luczaj , David Woodhouse Subject: [PATCH 4/6] KVM: pfncache: simplify locking and make more self-contained Date: Sat, 17 Feb 2024 11:27:02 +0000 Message-ID: <20240217114017.11551-5-dwmw2@infradead.org> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240217114017.11551-1-dwmw2@infradead.org> References: <20240217114017.11551-1-dwmw2@infradead.org> Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Sender: David Woodhouse X-SRS-Rewrite: SMTP reverse-path rewritten from by casper.infradead.org. See http://www.infradead.org/rpr.html From: David Woodhouse The locking on the gfn_to_pfn_cache is... interesting. And awful. There is a rwlock in ->lock which readers take to ensure protection against concurrent changes. But __kvm_gpc_refresh() makes assumptions that certain fields will not change even while it drops the write lock and performs MM operations to revalidate the target PFN and kernel mapping. Commit 93984f19e7bc ("KVM: Fully serialize gfn=>pfn cache refresh via mutex") partly addressed that — not by fixing it, but by adding a new mutex, ->refresh_lock. This prevented concurrent __kvm_gpc_refresh() calls on a given gfn_to_pfn_cache, but is still only a partial solution. There is still a theoretical race where __kvm_gpc_refresh() runs in parallel with kvm_gpc_deactivate(). While __kvm_gpc_refresh() has dropped the write lock, kvm_gpc_deactivate() clears the ->active flag and unmaps ->khva. Then __kvm_gpc_refresh() determines that the previous ->pfn and ->khva are still valid, and reinstalls those values into the structure. This leaves the gfn_to_pfn_cache with the ->valid bit set, but ->active clear. And a ->khva which looks like a reasonable kernel address but is actually unmapped. All it takes is a subsequent reactivation to cause that ->khva to be dereferenced. This would theoretically cause an oops which would look something like this: [1724749.564994] BUG: unable to handle page fault for address: ffffaa3540ace0e0 [1724749.565039] RIP: 0010:__kvm_xen_has_interrupt+0x8b/0xb0 I say "theoretically" because theoretically, that oops that was seen in production cannot happen. The code which uses the gfn_to_pfn_cache is supposed to have its *own* locking, to further paper over the fact that the gfn_to_pfn_cache's own papering-over (->refresh_lock) of its own rwlock abuse is not sufficient. For the Xen vcpu_info that external lock is the vcpu->mutex, and for the shared info it's kvm->arch.xen.xen_lock. Those locks ought to protect the gfn_to_pfn_cache against concurrent deactivation vs. refresh in all but the cases where the vcpu or kvm object is being *destroyed*, in which case the subsequent reactivation should never happen. Theoretically. Nevertheless, this locking abuse is awful and should be fixed, even if no clear explanation can be found for how the oops happened. So expand the use of the ->refresh_lock mutex to ensure serialization of activate/deactivate vs. refresh and make the pfncache locking entirely self-sufficient. This means that a future commit can simplify the locking in the callers, such as the Xen emulation code which has an outstanding problem with recursive locking of kvm->arch.xen.xen_lock, which will no longer be necessary. The rwlock abuse described above is still not best practice, although it's harmless now that the ->refresh_lock is held for the entire duration while the offending code drops the write lock, does some other stuff, then takes the write lock again and assumes nothing changed. That can also be fixed^W cleaned up in a subsequent commit, but this commit is a simpler basis for the Xen deadlock fix mentioned above. Signed-off-by: David Woodhouse Reviewed-by: Paul Durrant --- virt/kvm/pfncache.c | 32 +++++++++++++++++++++----------- 1 file changed, 21 insertions(+), 11 deletions(-) diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c index a60d8f906896..79a3ef7c6d04 100644 --- a/virt/kvm/pfncache.c +++ b/virt/kvm/pfncache.c @@ -255,12 +255,7 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l if (page_offset + len > PAGE_SIZE) return -EINVAL; - /* - * If another task is refreshing the cache, wait for it to complete. - * There is no guarantee that concurrent refreshes will see the same - * gpa, memslots generation, etc..., so they must be fully serialized. - */ - mutex_lock(&gpc->refresh_lock); + lockdep_assert_held(&gpc->refresh_lock); write_lock_irq(&gpc->lock); @@ -346,8 +341,6 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l out_unlock: write_unlock_irq(&gpc->lock); - mutex_unlock(&gpc->refresh_lock); - if (unmap_old) gpc_unmap(old_pfn, old_khva); @@ -356,16 +349,24 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l int kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, unsigned long len) { - unsigned long uhva = gpc->uhva; + unsigned long uhva; + int ret; + + mutex_lock(&gpc->refresh_lock); /* * If the GPA is valid then invalidate the HVA, otherwise * __kvm_gpc_refresh() will fail its strict either/or address check. */ + uhva = gpc->uhva; if (!kvm_is_error_gpa(gpc->gpa)) uhva = KVM_HVA_ERR_BAD; - return __kvm_gpc_refresh(gpc, gpc->gpa, uhva, len); + ret = __kvm_gpc_refresh(gpc, gpc->gpa, uhva, len); + + mutex_unlock(&gpc->refresh_lock); + + return ret; } void kvm_gpc_init(struct gfn_to_pfn_cache *gpc, struct kvm *kvm) @@ -377,12 +378,16 @@ void kvm_gpc_init(struct gfn_to_pfn_cache *gpc, struct kvm *kvm) gpc->pfn = KVM_PFN_ERR_FAULT; gpc->gpa = INVALID_GPA; gpc->uhva = KVM_HVA_ERR_BAD; + gpc->active = gpc->valid = false; } static int __kvm_gpc_activate(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned long uhva, unsigned long len) { struct kvm *kvm = gpc->kvm; + int ret; + + mutex_lock(&gpc->refresh_lock); if (!gpc->active) { if (KVM_BUG_ON(gpc->valid, kvm)) @@ -401,7 +406,10 @@ static int __kvm_gpc_activate(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned gpc->active = true; write_unlock_irq(&gpc->lock); } - return __kvm_gpc_refresh(gpc, gpa, uhva, len); + ret = __kvm_gpc_refresh(gpc, gpa, uhva, len); + mutex_unlock(&gpc->refresh_lock); + + return ret; } int kvm_gpc_activate(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned long len) @@ -420,6 +428,7 @@ void kvm_gpc_deactivate(struct gfn_to_pfn_cache *gpc) kvm_pfn_t old_pfn; void *old_khva; + mutex_lock(&gpc->refresh_lock); if (gpc->active) { /* * Deactivate the cache before removing it from the list, KVM @@ -449,4 +458,5 @@ void kvm_gpc_deactivate(struct gfn_to_pfn_cache *gpc) gpc_unmap(old_pfn, old_khva); } + mutex_unlock(&gpc->refresh_lock); } From patchwork Sat Feb 17 11:27:03 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Woodhouse X-Patchwork-Id: 13561343 Received: from desiato.infradead.org (desiato.infradead.org [90.155.92.199]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3127669DE1 for ; Sat, 17 Feb 2024 11:40:24 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.92.199 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1708170025; cv=none; b=B51CIRPeVZpCMpBxV7qSJkYlHe9AB5Ad24kMy+qCGnJCHo2cbUqCAJ0uY1GBGvRaqMQaJRi0Z3ro4C01Z3Q1zPnS0wRFM2jUOkRju2rni8LI9TlTi6ETRFlJUGeJYtwoQo+Up6m6OwHcgrVCfHrU0A4eh0EhPsSM4fCWG6yy96I= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1708170025; c=relaxed/simple; bh=By8tQ0MAT5qiJwmECK527g73NxX9Vi2/ED5npWSn874=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=lThwZuhxvvM1mSLffPF8dF7JuiWVUza+XFwaeCwpR42/P+l7emszIiXfFCvhgsOKl4A8FZsW1dPRJ4of19jHn+6A9armFhSmvEBW2fqF7mt08OCTyFDCqs7fmNEXDtPRwEPqo1HN4bMODT99dUEaxKZEst1qWSWkb5Ocu6yBlro= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=desiato.srs.infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=KBUyDBF2; arc=none smtp.client-ip=90.155.92.199 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=desiato.srs.infradead.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="KBUyDBF2" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=desiato.20200630; h=Sender:Content-Transfer-Encoding: MIME-Version:References:In-Reply-To:Message-ID:Date:Subject:Cc:To:From: Reply-To:Content-Type:Content-ID:Content-Description; bh=E2u87X6rFncDNIqD+E/MeHzgPyotp0c53t/P40qA5ik=; b=KBUyDBF2Y6+/vgStL1GenZryVR 1lUKJjVPRPHTsMcxYlTU9lrv6bDIpaZrw6/gqTE8gH4ZHhwM2mkFr6D8rj6eoxhNt1EdvFnmCroNF sgpIbJfr0YiS64szESPkJqW5kxr7dw8fRJW6renaA5xfZOE6HSNBzNu83xdlhESWiVNPd21Ob4JGD HJFNpR/kybpR7lK4Zf4CX5XMcAQW3QXjAMWdlvNsrvulbkvA1bCe9+6Ekfp0GbYvqb/1eq7Sj8Irh yBJdCd3EPfGsAYFCz7ahxf8962NXqfVKdxlPjrpxp9+C0SCpZB7w104hAPwKkLJqGK97n/1UKOAGg aAagxhRQ==; Received: from [2001:8b0:10b:1::ebe] (helo=i7.infradead.org) by desiato.infradead.org with esmtpsa (Exim 4.97.1 #2 (Red Hat Linux)) id 1rbJ3M-00000000Kvk-42MU; Sat, 17 Feb 2024 11:40:21 +0000 Received: from dwoodhou by i7.infradead.org with local (Exim 4.97.1 #2 (Red Hat Linux)) id 1rbJ3L-0000000034M-402n; Sat, 17 Feb 2024 11:40:19 +0000 From: David Woodhouse To: kvm@vger.kernel.org Cc: Sean Christopherson , Paul Durrant , Paolo Bonzini , Michal Luczaj , David Woodhouse Subject: [PATCH 5/6] KVM: x86/xen: fix recursive deadlock in timer injection Date: Sat, 17 Feb 2024 11:27:03 +0000 Message-ID: <20240217114017.11551-6-dwmw2@infradead.org> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240217114017.11551-1-dwmw2@infradead.org> References: <20240217114017.11551-1-dwmw2@infradead.org> Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Sender: David Woodhouse X-SRS-Rewrite: SMTP reverse-path rewritten from by desiato.infradead.org. See http://www.infradead.org/rpr.html From: David Woodhouse The fast-path timer delivery introduced a recursive locking deadlock when userspace configures a timer which has already expired and is delivered immediately. The call to kvm_xen_inject_timer_irqs() can call to kvm_xen_set_evtchn() which may take kvm->arch.xen.xen_lock, which is already held in kvm_xen_vcpu_get_attr(). ============================================ WARNING: possible recursive locking detected 6.8.0-smp--5e10b4d51d77-drs #232 Tainted: G O -------------------------------------------- xen_shinfo_test/250013 is trying to acquire lock: ffff938c9930cc30 (&kvm->arch.xen.xen_lock){+.+.}-{3:3}, at: kvm_xen_set_evtchn+0x74/0x170 [kvm] but task is already holding lock: ffff938c9930cc30 (&kvm->arch.xen.xen_lock){+.+.}-{3:3}, at: kvm_xen_vcpu_get_attr+0x38/0x250 [kvm] Now that the gfn_to_pfn_cache has its own self-sufficient locking, its callers no longer need to ensure serialization, so just stop taking kvm->arch.xen.xen_lock from kvm_xen_set_evtchn(). Fixes: 77c9b9dea4fb ("KVM: x86/xen: Use fast path for Xen timer delivery") Signed-off-by: David Woodhouse Reviewed-by: Paul Durrant --- arch/x86/kvm/xen.c | 4 ---- 1 file changed, 4 deletions(-) diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c index e6c7353e5315..706af2935277 100644 --- a/arch/x86/kvm/xen.c +++ b/arch/x86/kvm/xen.c @@ -1903,8 +1903,6 @@ static int kvm_xen_set_evtchn(struct kvm_xen_evtchn *xe, struct kvm *kvm) mm_borrowed = true; } - mutex_lock(&kvm->arch.xen.xen_lock); - /* * It is theoretically possible for the page to be unmapped * and the MMU notifier to invalidate the shared_info before @@ -1932,8 +1930,6 @@ static int kvm_xen_set_evtchn(struct kvm_xen_evtchn *xe, struct kvm *kvm) srcu_read_unlock(&kvm->srcu, idx); } while(!rc); - mutex_unlock(&kvm->arch.xen.xen_lock); - if (mm_borrowed) kthread_unuse_mm(kvm->mm); From patchwork Sat Feb 17 11:27:04 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: David Woodhouse X-Patchwork-Id: 13561344 Received: from desiato.infradead.org (desiato.infradead.org [90.155.92.199]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 312A769DE6 for ; Sat, 17 Feb 2024 11:40:24 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.92.199 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1708170026; cv=none; b=Q3MEArTnPHZEp3XOsXkUuk8PoNgJrCRyZ0cN6TZnd2xFThx8gclY5FGI5EAVU7T/FPVqghnusWHYDJ6u4YhFy7nLh8hGhHPMGN2euWUyrG4Aj0715DBH5RY7DSG2kv+88vAv2BeP38qIvQHWKtOvSRwKkkvmA7a9LU8G+t3Isq4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1708170026; c=relaxed/simple; bh=PGbaxLat7ipodFfTPPjVNTl62n8Kiuiju+GUcCKpmbM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=pZJW5XKgGaUQNx3Sk96Psm27XWG0RmYY3PUnvRbMTgW0NnqD+haJ9BC33M/GqKnhG554hV23O14XMHv/vPeC7GQlJt+/9S5VINXLRkk/dRaXmgsasrIqfK+73MkIil24dMSG8UY3s+HVdbP7xgExd9ILPnQb50yhmXNNv0iwmVU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=desiato.srs.infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=k/unMedm; arc=none smtp.client-ip=90.155.92.199 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=desiato.srs.infradead.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="k/unMedm" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=desiato.20200630; h=Sender:Content-Transfer-Encoding: Content-Type:MIME-Version:References:In-Reply-To:Message-ID:Date:Subject:Cc: To:From:Reply-To:Content-ID:Content-Description; bh=HnQTcMgMaLN8qiN7jeg9wIX94DPrXC/qk4QW6y2IuF4=; b=k/unMedmhr9HWE+qTARQUq3gvP XQIUU76UTNYrGHsPsMlSnL6q0f8IwJT1MXckepfZ1/d+WFT46kLM4wL8PJVGpzHlIredU8dy7qCAi eoJKlVbRwsgt8i21rxxNs+wK7DOJZU7ssHxHPn0BsTAL1bFGuNGGWv4NyYnkRo9Fbau8apDK5Ma2x 0bvts8/LLSXI7ul9+PLzGyZyhJiuFyQ3tExnhOn6P6luZozKVYVW5rOyhNsGXI4Hd0ohe75bkPiE0 bCEI4M9QfN5K/jzJaD619YNSDCkS2DUMnDctUQc4MTZh4v/HvDPxqBN9LEw++ZfF8jSqF5Xc2U9XO pAqB6U1w==; Received: from [2001:8b0:10b:1::ebe] (helo=i7.infradead.org) by desiato.infradead.org with esmtpsa (Exim 4.97.1 #2 (Red Hat Linux)) id 1rbJ3M-00000000Kvl-40V3; Sat, 17 Feb 2024 11:40:21 +0000 Received: from dwoodhou by i7.infradead.org with local (Exim 4.97.1 #2 (Red Hat Linux)) id 1rbJ3M-0000000034Q-0B6t; Sat, 17 Feb 2024 11:40:20 +0000 From: David Woodhouse To: kvm@vger.kernel.org Cc: Sean Christopherson , Paul Durrant , Paolo Bonzini , Michal Luczaj , David Woodhouse , Paul Durrant Subject: [PATCH 6/6] KVM: pfncache: clean up rwlock abuse Date: Sat, 17 Feb 2024 11:27:04 +0000 Message-ID: <20240217114017.11551-7-dwmw2@infradead.org> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240217114017.11551-1-dwmw2@infradead.org> References: <20240217114017.11551-1-dwmw2@infradead.org> Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Sender: David Woodhouse X-SRS-Rewrite: SMTP reverse-path rewritten from by desiato.infradead.org. See http://www.infradead.org/rpr.html From: David Woodhouse There is a rwlock in ->lock which readers take to ensure protection against concurrent changes. But __kvm_gpc_refresh() makes assumptions that certain fields will not change even while it drops the write lock and performs MM operations to revalidate the target PFN and kernel mapping. Those assumptions *are* valid, because a previous commit expanded the coverage of the ->refresh_lock mutex to ensure serialization and that nothing else can change those fields while __kvm_gpc_refresh() drops the rwlock. But this is not good practice. Clean up the semantics of hva_to_pfn_retry() so that it no longer does any locking gymnastics because it no longer operates on the gpc object at all. It is now called with a uhva and simply returns the corresponding pfn (pinned), and a mapped khva for it. Its caller __kvm_gpc_refresh() now sets gpc->uhva and clears gpc->valid before dropping ->lock, calling hva_to_pfn_retry() and retaking ->lock for write. If hva_to_pfn_retry() fails, *or* if the ->uhva or ->active fields in the gpc changed while the lock was dropped, the new mapping is discarded and the gpc is not modified. On success with an unchanged gpc, the new mapping is installed and the current ->pfn and ->uhva are taken into the local old_pfn and old_khva variables to be unmapped once the locks are all released. This simplification means that ->refresh_lock is no longer needed for correctness, but it does still provide a minor optimisation because it will prevent two concurrent __kvm_gpc_refresh() calls from mapping a given PFN, only for one of them to lose the race and discard its mapping. The optimisation in hva_to_pfn_retry() where it attempts to use the old mapping if the pfn doesn't change is dropped, since it makes the pinning more complex. It's a pointless optimisation anyway, since the odds of the pfn ending up the same when the uhva has changed (i.e. the odds of the two userspace addresses both pointing to the same underlying physical page) are negligible, The 'hva_changed' local variable in __kvm_gpc_refresh() is also removed, since it's simpler just to clear gpc->valid if the uhva changed. Likewise the unmap_old variable is dropped because it's just as easy to check the old_pfn variable for KVM_PFN_ERR_FAULT. Signed-off-by: David Woodhouse Reviewed-by: Paul Durrant --- Cc: Sean Christopherson Cc: Paolo Bonzini --- virt/kvm/pfncache.c | 179 +++++++++++++++++++++----------------------- 1 file changed, 85 insertions(+), 94 deletions(-) diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c index 79a3ef7c6d04..11b66f63af83 100644 --- a/virt/kvm/pfncache.c +++ b/virt/kvm/pfncache.c @@ -139,108 +139,65 @@ static inline bool mmu_notifier_retry_cache(struct kvm *kvm, unsigned long mmu_s return kvm->mmu_invalidate_seq != mmu_seq; } -static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc) +/* + * Given a user virtual address, obtain a pinned host PFN and kernel mapping + * for it. The caller will release the PFN after installing it into the GPC + * so that the MMU notifier invalidation mechanism is active. + */ +static kvm_pfn_t hva_to_pfn_retry(struct kvm *kvm, unsigned long uhva, + kvm_pfn_t *pfn, void **khva) { /* Note, the new page offset may be different than the old! */ - void *old_khva = (void *)PAGE_ALIGN_DOWN((uintptr_t)gpc->khva); kvm_pfn_t new_pfn = KVM_PFN_ERR_FAULT; void *new_khva = NULL; unsigned long mmu_seq; - lockdep_assert_held(&gpc->refresh_lock); - - lockdep_assert_held_write(&gpc->lock); - - /* - * Invalidate the cache prior to dropping gpc->lock, the gpa=>uhva - * assets have already been updated and so a concurrent check() from a - * different task may not fail the gpa/uhva/generation checks. - */ - gpc->valid = false; - - do { - mmu_seq = gpc->kvm->mmu_invalidate_seq; + for (;;) { + mmu_seq = kvm->mmu_invalidate_seq; smp_rmb(); - write_unlock_irq(&gpc->lock); - - /* - * If the previous iteration "failed" due to an mmu_notifier - * event, release the pfn and unmap the kernel virtual address - * from the previous attempt. Unmapping might sleep, so this - * needs to be done after dropping the lock. Opportunistically - * check for resched while the lock isn't held. - */ - if (new_pfn != KVM_PFN_ERR_FAULT) { - /* - * Keep the mapping if the previous iteration reused - * the existing mapping and didn't create a new one. - */ - if (new_khva != old_khva) - gpc_unmap(new_pfn, new_khva); - - kvm_release_pfn_clean(new_pfn); - - cond_resched(); - } - /* We always request a writeable mapping */ - new_pfn = hva_to_pfn(gpc->uhva, false, false, NULL, true, NULL); + new_pfn = hva_to_pfn(uhva, false, false, NULL, true, NULL); if (is_error_noslot_pfn(new_pfn)) - goto out_error; + return -EFAULT; /* - * Obtain a new kernel mapping if KVM itself will access the - * pfn. Note, kmap() and memremap() can both sleep, so this - * too must be done outside of gpc->lock! + * Always obtain a new kernel mapping. Trying to reuse an + * existing one is more complex than it's worth. */ - if (new_pfn == gpc->pfn) - new_khva = old_khva; - else - new_khva = gpc_map(new_pfn); - + new_khva = gpc_map(new_pfn); if (!new_khva) { kvm_release_pfn_clean(new_pfn); - goto out_error; + return -EFAULT; } - write_lock_irq(&gpc->lock); + if (!mmu_notifier_retry_cache(kvm, mmu_seq)) + break; /* - * Other tasks must wait for _this_ refresh to complete before - * attempting to refresh. + * If this iteration "failed" due to an mmu_notifier event, + * release the pfn and unmap the kernel virtual address, and + * loop around again. */ - WARN_ON_ONCE(gpc->valid); - } while (mmu_notifier_retry_cache(gpc->kvm, mmu_seq)); - - gpc->valid = true; - gpc->pfn = new_pfn; - gpc->khva = new_khva + offset_in_page(gpc->uhva); + if (new_pfn != KVM_PFN_ERR_FAULT) { + gpc_unmap(new_pfn, new_khva); + kvm_release_pfn_clean(new_pfn); + } + } - /* - * Put the reference to the _new_ pfn. The pfn is now tracked by the - * cache and can be safely migrated, swapped, etc... as the cache will - * invalidate any mappings in response to relevant mmu_notifier events. - */ - kvm_release_pfn_clean(new_pfn); + *pfn = new_pfn; + *khva = new_khva; return 0; - -out_error: - write_lock_irq(&gpc->lock); - - return -EFAULT; } -static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned long uhva, - unsigned long len) +static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, + unsigned long uhva, unsigned long len) { unsigned long page_offset = kvm_is_error_gpa(gpa) ? offset_in_page(uhva) : offset_in_page(gpa); - bool unmap_old = false; unsigned long old_uhva; - kvm_pfn_t old_pfn; - bool hva_change = false; + kvm_pfn_t old_pfn = KVM_PFN_ERR_FAULT; void *old_khva; int ret; @@ -274,7 +231,7 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l gpc->uhva = PAGE_ALIGN_DOWN(uhva); if (gpc->uhva != old_uhva) - hva_change = true; + gpc->valid = false; } else { struct kvm_memslots *slots = kvm_memslots(gpc->kvm); @@ -289,7 +246,11 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l if (kvm_is_error_hva(gpc->uhva)) { ret = -EFAULT; - goto out; + + gpc->valid = false; + gpc->pfn = KVM_PFN_ERR_FAULT; + gpc->khva = NULL; + goto out_unlock; } /* @@ -297,7 +258,7 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l * HVA may still be the same. */ if (gpc->uhva != old_uhva) - hva_change = true; + gpc->valid = false; } else { gpc->uhva = old_uhva; } @@ -310,9 +271,7 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l * If the userspace HVA changed or the PFN was already invalid, * drop the lock and do the HVA to PFN lookup again. */ - if (!gpc->valid || hva_change) { - ret = hva_to_pfn_retry(gpc); - } else { + if (gpc->valid) { /* * If the HVA→PFN mapping was already valid, don't unmap it. * But do update gpc->khva because the offset within the page @@ -320,28 +279,60 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l */ gpc->khva = old_khva + page_offset; ret = 0; - goto out_unlock; - } - out: - /* - * Invalidate the cache and purge the pfn/khva if the refresh failed. - * Some/all of the uhva, gpa, and memslot generation info may still be - * valid, leave it as is. - */ - if (ret) { + /* old_pfn must not be unmapped because it was reused. */ + old_pfn = KVM_PFN_ERR_FAULT; + } else { + kvm_pfn_t new_pfn = KVM_PFN_ERR_FAULT; + unsigned long new_uhva = gpc->uhva; + void *new_khva = NULL; + + /* + * Invalidate the cache prior to dropping gpc->lock; the + * gpa=>uhva assets have already been updated and so a + * concurrent check() from a different task may not fail + * the gpa/uhva/generation checks as it should. + */ gpc->valid = false; - gpc->pfn = KVM_PFN_ERR_FAULT; - gpc->khva = NULL; - } - /* Detect a pfn change before dropping the lock! */ - unmap_old = (old_pfn != gpc->pfn); + write_unlock_irq(&gpc->lock); + + ret = hva_to_pfn_retry(gpc->kvm, new_uhva, &new_pfn, &new_khva); + + write_lock_irq(&gpc->lock); + + WARN_ON_ONCE(gpc->valid); + + if (ret || !gpc->active || gpc->uhva != new_uhva) { + /* + * On failure or if another change occurred while the + * lock was dropped, just purge the new mapping. + */ + old_pfn = new_pfn; + old_khva = new_khva; + } else { + old_pfn = gpc->pfn; + old_khva = gpc->khva; + + gpc->pfn = new_pfn; + gpc->khva = new_khva + offset_in_page(gpc->uhva); + gpc->valid = true; + } + + /* + * Put the reference to the _new_ pfn. On success, the + * pfn is now tracked by the cache and can safely be + * migrated, swapped, etc. as the cache will invalidate + * any mappings in response to relevant mmu_notifier + * events. + */ + kvm_release_pfn_clean(new_pfn); + } out_unlock: write_unlock_irq(&gpc->lock); - if (unmap_old) + if (old_pfn != KVM_PFN_ERR_FAULT) gpc_unmap(old_pfn, old_khva); return ret;