diff mbox series

[v2,01/10] KVM: x86: Enforce x2APIC's must-be-zero reserved ICR bits

Message ID 20240719235107.3023592-2-seanjc@google.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86: Fix ICR handling when x2AVIC is active | expand

Commit Message

Sean Christopherson July 19, 2024, 11:50 p.m. UTC
Inject a #GP on a WRMSR(ICR) that attempts to set any reserved bits that
are must-be-zero on both Intel and AMD, i.e. any reserved bits other than
the BUSY bit, which Intel ignores and basically says is undefined.

KVM's xapic_state_test selftest has been fudging the bug since commit
4b88b1a518b3 ("KVM: selftests: Enhance handling WRMSR ICR register in
x2APIC mode"), which essentially removed the testcase instead of fixing
the bug.

WARN if the nodecode path triggers a #GP, as the CPU is supposed to check
reserved bits for ICR when it's partially virtualized.

Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/lapic.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

Comments

Sean Christopherson Nov. 1, 2024, 6:34 p.m. UTC | #1
On Fri, Jul 19, 2024, Sean Christopherson wrote:
> Inject a #GP on a WRMSR(ICR) that attempts to set any reserved bits that
> are must-be-zero on both Intel and AMD, i.e. any reserved bits other than
> the BUSY bit, which Intel ignores and basically says is undefined.
> 
> KVM's xapic_state_test selftest has been fudging the bug since commit
> 4b88b1a518b3 ("KVM: selftests: Enhance handling WRMSR ICR register in
> x2APIC mode"), which essentially removed the testcase instead of fixing
> the bug.
> 
> WARN if the nodecode path triggers a #GP, as the CPU is supposed to check
> reserved bits for ICR when it's partially virtualized.

Apparently this isn't accurate, as I've now hit the WARN twice with x2AVIC.  I
haven't debugged in depth, but it's either INVALID_TARGET and INVALID_INT_TYPE.
Which is odd, because the WARN only happens rarely, e.g. appears to be a race of
some form.  But I wouldn't expect those checks to be subject to races.

Ah, but maybe this one is referring to the VALID bit?

  address is not present in the physical or logical ID tables

If that's the case, then (a) ucode is buggy (IMO) and is doing table lookups
*before* reserved bits checks, and (b) I don't see a better option than simply
deleting the WARN.

  ------------[ cut here ]------------
  WARNING: CPU: 146 PID: 274555 at arch/x86/kvm/lapic.c:2521 kvm_apic_write_nodecode+0x7a/0x90 [kvm]
  Modules linked in: kvm_amd kvm ... [last unloaded: kvm]
  CPU: 146 UID: 0 PID: 274555 Comm: qemu Not tainted 6.12.0-smp--41585e8a34cb-sink #458
  Hardware name: Google Astoria/astoria, BIOS 0.20240617.0-0 06/17/2024
  RIP: 0010:kvm_apic_write_nodecode+0x7a/0x90 [kvm]
  RSP: 0018:ff51c04b4d133be8 EFLAGS: 00010202
  RAX: 0000000000000001 RBX: 0000000000000000 RCX: 00000000000cffff
  RDX: 0000000087fd0e00 RSI: 00000000000cffff RDI: ff42132c9e336f00
  RBP: ff51c04b4d133e50 R08: 0000000000000000 R09: 0000000000060000
  R10: ffffffffc067428f R11: ffffffffc080aa20 R12: 00000000000cffff
  R13: 0000000000000000 R14: ff42132d09e7c2c0 R15: 0000000000000000
  FS:  00007fc1af0006c0(0000) GS:ff42138a08500000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000000 CR3: 0000006267e52001 CR4: 0000000000771ef0
  PKRU: 00000000
  Call Trace:
   <TASK>
   avic_incomplete_ipi_interception+0x24a/0x4c0 [kvm_amd]
   kvm_arch_vcpu_ioctl_run+0x1e11/0x2720 [kvm]
   kvm_vcpu_ioctl+0x54f/0x630 [kvm]
   __se_sys_ioctl+0x6b/0xc0
   do_syscall_64+0x83/0x160
   entry_SYSCALL_64_after_hwframe+0x76/0x7e
  RIP: 0033:0x7fc1b584624b
   </TASK>
  ---[ end trace 0000000000000000 ]---
diff mbox series

Patch

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index a7172ba59ad2..35c4567567a2 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2472,7 +2472,7 @@  void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset)
 	 * maybe-unecessary write, and both are in the noise anyways.
 	 */
 	if (apic_x2apic_mode(apic) && offset == APIC_ICR)
-		kvm_x2apic_icr_write(apic, kvm_lapic_get_reg64(apic, APIC_ICR));
+		WARN_ON_ONCE(kvm_x2apic_icr_write(apic, kvm_lapic_get_reg64(apic, APIC_ICR)));
 	else
 		kvm_lapic_reg_write(apic, offset, kvm_lapic_get_reg(apic, offset));
 }
@@ -3186,8 +3186,21 @@  int kvm_lapic_set_vapic_addr(struct kvm_vcpu *vcpu, gpa_t vapic_addr)
 	return 0;
 }
 
+#define X2APIC_ICR_RESERVED_BITS (GENMASK_ULL(31, 20) | GENMASK_ULL(17, 16) | BIT(13))
+
 int kvm_x2apic_icr_write(struct kvm_lapic *apic, u64 data)
 {
+	if (data & X2APIC_ICR_RESERVED_BITS)
+		return 1;
+
+	/*
+	 * The BUSY bit is reserved on both Intel and AMD in x2APIC mode, but
+	 * only AMD requires it to be zero, Intel essentially just ignores the
+	 * bit.  And if IPI virtualization (Intel) or x2AVIC (AMD) is enabled,
+	 * the CPU performs the reserved bits checks, i.e. the underlying CPU
+	 * behavior will "win".  Arbitrarily clear the BUSY bit, as there is no
+	 * sane way to provide consistent behavior with respect to hardware.
+	 */
 	data &= ~APIC_ICR_BUSY;
 
 	kvm_apic_send_ipi(apic, (u32)data, (u32)(data >> 32));