Message ID | 20240719235107.3023592-2-seanjc@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: x86: Fix ICR handling when x2AVIC is active | expand |
On Fri, Jul 19, 2024, Sean Christopherson wrote: > Inject a #GP on a WRMSR(ICR) that attempts to set any reserved bits that > are must-be-zero on both Intel and AMD, i.e. any reserved bits other than > the BUSY bit, which Intel ignores and basically says is undefined. > > KVM's xapic_state_test selftest has been fudging the bug since commit > 4b88b1a518b3 ("KVM: selftests: Enhance handling WRMSR ICR register in > x2APIC mode"), which essentially removed the testcase instead of fixing > the bug. > > WARN if the nodecode path triggers a #GP, as the CPU is supposed to check > reserved bits for ICR when it's partially virtualized. Apparently this isn't accurate, as I've now hit the WARN twice with x2AVIC. I haven't debugged in depth, but it's either INVALID_TARGET and INVALID_INT_TYPE. Which is odd, because the WARN only happens rarely, e.g. appears to be a race of some form. But I wouldn't expect those checks to be subject to races. Ah, but maybe this one is referring to the VALID bit? address is not present in the physical or logical ID tables If that's the case, then (a) ucode is buggy (IMO) and is doing table lookups *before* reserved bits checks, and (b) I don't see a better option than simply deleting the WARN. ------------[ cut here ]------------ WARNING: CPU: 146 PID: 274555 at arch/x86/kvm/lapic.c:2521 kvm_apic_write_nodecode+0x7a/0x90 [kvm] Modules linked in: kvm_amd kvm ... [last unloaded: kvm] CPU: 146 UID: 0 PID: 274555 Comm: qemu Not tainted 6.12.0-smp--41585e8a34cb-sink #458 Hardware name: Google Astoria/astoria, BIOS 0.20240617.0-0 06/17/2024 RIP: 0010:kvm_apic_write_nodecode+0x7a/0x90 [kvm] RSP: 0018:ff51c04b4d133be8 EFLAGS: 00010202 RAX: 0000000000000001 RBX: 0000000000000000 RCX: 00000000000cffff RDX: 0000000087fd0e00 RSI: 00000000000cffff RDI: ff42132c9e336f00 RBP: ff51c04b4d133e50 R08: 0000000000000000 R09: 0000000000060000 R10: ffffffffc067428f R11: ffffffffc080aa20 R12: 00000000000cffff R13: 0000000000000000 R14: ff42132d09e7c2c0 R15: 0000000000000000 FS: 00007fc1af0006c0(0000) GS:ff42138a08500000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000006267e52001 CR4: 0000000000771ef0 PKRU: 00000000 Call Trace: <TASK> avic_incomplete_ipi_interception+0x24a/0x4c0 [kvm_amd] kvm_arch_vcpu_ioctl_run+0x1e11/0x2720 [kvm] kvm_vcpu_ioctl+0x54f/0x630 [kvm] __se_sys_ioctl+0x6b/0xc0 do_syscall_64+0x83/0x160 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7fc1b584624b </TASK> ---[ end trace 0000000000000000 ]---
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index a7172ba59ad2..35c4567567a2 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -2472,7 +2472,7 @@ void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset) * maybe-unecessary write, and both are in the noise anyways. */ if (apic_x2apic_mode(apic) && offset == APIC_ICR) - kvm_x2apic_icr_write(apic, kvm_lapic_get_reg64(apic, APIC_ICR)); + WARN_ON_ONCE(kvm_x2apic_icr_write(apic, kvm_lapic_get_reg64(apic, APIC_ICR))); else kvm_lapic_reg_write(apic, offset, kvm_lapic_get_reg(apic, offset)); } @@ -3186,8 +3186,21 @@ int kvm_lapic_set_vapic_addr(struct kvm_vcpu *vcpu, gpa_t vapic_addr) return 0; } +#define X2APIC_ICR_RESERVED_BITS (GENMASK_ULL(31, 20) | GENMASK_ULL(17, 16) | BIT(13)) + int kvm_x2apic_icr_write(struct kvm_lapic *apic, u64 data) { + if (data & X2APIC_ICR_RESERVED_BITS) + return 1; + + /* + * The BUSY bit is reserved on both Intel and AMD in x2APIC mode, but + * only AMD requires it to be zero, Intel essentially just ignores the + * bit. And if IPI virtualization (Intel) or x2AVIC (AMD) is enabled, + * the CPU performs the reserved bits checks, i.e. the underlying CPU + * behavior will "win". Arbitrarily clear the BUSY bit, as there is no + * sane way to provide consistent behavior with respect to hardware. + */ data &= ~APIC_ICR_BUSY; kvm_apic_send_ipi(apic, (u32)data, (u32)(data >> 32));
Inject a #GP on a WRMSR(ICR) that attempts to set any reserved bits that are must-be-zero on both Intel and AMD, i.e. any reserved bits other than the BUSY bit, which Intel ignores and basically says is undefined. KVM's xapic_state_test selftest has been fudging the bug since commit 4b88b1a518b3 ("KVM: selftests: Enhance handling WRMSR ICR register in x2APIC mode"), which essentially removed the testcase instead of fixing the bug. WARN if the nodecode path triggers a #GP, as the CPU is supposed to check reserved bits for ICR when it's partially virtualized. Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> --- arch/x86/kvm/lapic.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-)