Message ID | 20200728004446.932-1-graf@amazon.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: x86: Deflect unknown MSR accesses to user space | expand |
Alexander Graf <graf@amazon.com> writes: > MSRs are weird. Some of them are normal control registers, such as EFER. > Some however are registers that really are model specific, not very > interesting to virtualization workloads, and not performance critical. > Others again are really just windows into package configuration. > > Out of these MSRs, only the first category is necessary to implement in > kernel space. Rarely accessed MSRs, MSRs that should be fine tunes against > certain CPU models and MSRs that contain information on the package level > are much better suited for user space to process. However, over time we have > accumulated a lot of MSRs that are not the first category, but still handled > by in-kernel KVM code. > > This patch adds a generic interface to handle WRMSR and RDMSR from user > space. With this, any future MSR that is part of the latter categories can > be handled in user space. > > Furthermore, it allows us to replace the existing "ignore_msrs" logic with > something that applies per-VM rather than on the full system. That way you > can run productive VMs in parallel to experimental ones where you don't care > about proper MSR handling. > In theory, we can go further: userspace will give KVM the list of MSRs it is interested in. This list may even contain MSRs which are normally handled by KVM, in this case userspace gets an option to mangle KVM's reply (RDMSR) or do something extra (WRMSR). I'm not sure if there is a real need behind this, just an idea. The problem with this approach is: if currently some MSR is not implemented in KVM you will get an exit. When later someone comes with a patch to implement this MSR your userspace handling will immediately get broken so the list of not implemented MSRs effectively becomes an API :-) > Signed-off-by: Alexander Graf <graf@amazon.com> > > --- > > As a quick example to show what this does, I implemented handling for MSR 0x35 > (MSR_CORE_THREAD_COUNT) in QEMU on top of this patch set: > > https://github.com/agraf/qemu/commits/user-space-msr > --- > Documentation/virt/kvm/api.rst | 60 ++++++++++++++++++++++++++++++ > arch/x86/include/asm/kvm_host.h | 6 +++ > arch/x86/kvm/emulate.c | 18 +++++++-- > arch/x86/kvm/x86.c | 65 ++++++++++++++++++++++++++++++++- > include/trace/events/kvm.h | 2 +- > include/uapi/linux/kvm.h | 11 ++++++ > 6 files changed, 155 insertions(+), 7 deletions(-) > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst > index 320788f81a05..7dfcc8e09dad 100644 > --- a/Documentation/virt/kvm/api.rst > +++ b/Documentation/virt/kvm/api.rst > @@ -5155,6 +5155,34 @@ Note that KVM does not skip the faulting instruction as it does for > KVM_EXIT_MMIO, but userspace has to emulate any change to the processing state > if it decides to decode and emulate the instruction. > > +:: > + > + /* KVM_EXIT_RDMSR / KVM_EXIT_WRMSR */ > + struct { > + __u8 reply; > + __u8 error; > + __u8 pad[2]; > + __u32 index; > + __u64 data; > + } msr; (Personal taste most likely) This layout is perfect but it makes my brain explode :-) Naturally, I expect index and data to be the most significant members and I expect them to be the first two members, something like struct { __u32 index; __u32 pad32; __u64 data; __u8 reply; __u8 error; __u8 pad8[6]; } msr; > + > +Used on x86 systems. When the VM capability KVM_CAP_X86_USER_SPACE_MSR is > +enabled, MSR accesses to registers that are not known by KVM kernel code will > +trigger a KVM_EXIT_RDMSR exit for reads and KVM_EXIT_WRMSR exit for writes. > + > +For KVM_EXIT_RDMSR, the "index" field tells user space which MSR the guest > +wants to read. To respond to this request with a successful read, user space > +writes a 1 into the "reply" field and the respective data into the "data" field. > + > +If the RDMSR request was unsuccessful, user space indicates that with a "1" > +in the "reply" field and a "1" in the "error" field. This will inject a #GP > +into the guest when the VCPU is executed again. > + > +For KVM_EXIT_WRMSR, the "index" field tells user space which MSR the guest > +wants to write. Once finished processing the event, user space sets the "reply" > +field to "1". If the MSR write was unsuccessful, user space also sets the > +"error" field to "1". > + > :: > > /* Fix the size of the union. */ > @@ -5844,6 +5872,27 @@ controlled by the kvm module parameter halt_poll_ns. This capability allows > the maximum halt time to specified on a per-VM basis, effectively overriding > the module parameter for the target VM. > > +7.21 KVM_CAP_X86_USER_SPACE_MSR > +---------------------- > + > +:Architectures: x86 > +:Target: VM > +:Parameters: args[0] is 1 if user space MSR handling is enabled, 0 otherwise > +:Returns: 0 on success; -1 on error > + > +This capability enabled trapping of unhandled RDMSR and WRMSR instructions > +into user space. > + > +When a guest requests to read or write an MSR, KVM may not implement all MSRs > +that are relevant to a respective system. It also does not differentiate by > +CPU type. > + > +To allow more fine grained control over MSR handling, user space may enable > +this capability. With it enabled, MSR accesses that are not handled by KVM > +will trigger KVM_EXIT_RDMSR and KVM_EXIT_WRMSR exit notifications which > +user space can then handle to implement model specific MSR handling and/or > +user notifications to inform a user that an MSR was not handled. > + > 8. Other capabilities. > ====================== > > @@ -6151,3 +6200,14 @@ KVM can therefore start protected VMs. > This capability governs the KVM_S390_PV_COMMAND ioctl and the > KVM_MP_STATE_LOAD MP_STATE. KVM_SET_MP_STATE can fail for protected > guests when the state change is invalid. > + > +8.24 KVM_CAP_X86_USER_SPACE_MSR > +---------------------------- > + > +:Architectures: x86 > + > +This capability indicates that KVM supports deflection of MSR reads and > +writes to user space. It can be enabled on a VM level. If enabled, MSR > +accesses that are not handled by KVM and would thus usually trigger a > +#GP into the guest will instead get bounced to user space through the > +KVM_EXIT_RDMSR and KVM_EXIT_WRMSR exit notifications. > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index be5363b21540..c4218e05d8b8 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -1002,6 +1002,9 @@ struct kvm_arch { > bool guest_can_read_msr_platform_info; > bool exception_payload_enabled; > > + /* Deflect RDMSR and WRMSR to user space if not handled in kernel */ > + bool user_space_msr_enabled; > + > struct kvm_pmu_event_filter *pmu_event_filter; > struct task_struct *nx_lpage_recovery_thread; > }; > @@ -1437,6 +1440,9 @@ int kvm_emulate_instruction(struct kvm_vcpu *vcpu, int emulation_type); > int kvm_emulate_instruction_from_buffer(struct kvm_vcpu *vcpu, > void *insn, int insn_len); > > +/* Indicate that an MSR operation should be handled by user space */ > +#define ETRAP_TO_USER_SPACE EREMOTE What if we just use ENOENT in kvm_set_msr_user_space()/kvm_get_msr_user_space()? Or, maybe, we can just notice that KVM_EXIT_RDMSR/KVM_EXIT_WRMSR was set, this way we don't need a specific exit code. > + > void kvm_enable_efer_bits(u64); > bool kvm_valid_efer(struct kvm_vcpu *vcpu, u64 efer); > int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data, bool host_initiated); > diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c > index d0e2825ae617..b08000e3b2fe 100644 > --- a/arch/x86/kvm/emulate.c > +++ b/arch/x86/kvm/emulate.c > @@ -3693,18 +3693,28 @@ static int em_wrmsr(struct x86_emulate_ctxt *ctxt) > > msr_data = (u32)reg_read(ctxt, VCPU_REGS_RAX) > | ((u64)reg_read(ctxt, VCPU_REGS_RDX) << 32); > - if (ctxt->ops->set_msr(ctxt, reg_read(ctxt, VCPU_REGS_RCX), msr_data)) > + switch (ctxt->ops->set_msr(ctxt, reg_read(ctxt, VCPU_REGS_RCX), msr_data)) { > + case 0: > + return X86EMUL_CONTINUE; > + case -ETRAP_TO_USER_SPACE: > + return X86EMUL_IO_NEEDED; > + default: > return emulate_gp(ctxt, 0); > - > - return X86EMUL_CONTINUE; > + } > } > > static int em_rdmsr(struct x86_emulate_ctxt *ctxt) > { > u64 msr_data; > > - if (ctxt->ops->get_msr(ctxt, reg_read(ctxt, VCPU_REGS_RCX), &msr_data)) > + switch (ctxt->ops->get_msr(ctxt, reg_read(ctxt, VCPU_REGS_RCX), &msr_data)) { > + case 0: > + break; > + case -ETRAP_TO_USER_SPACE: > + return X86EMUL_IO_NEEDED; > + default: > return emulate_gp(ctxt, 0); > + } > > *reg_write(ctxt, VCPU_REGS_RAX) = (u32)msr_data; > *reg_write(ctxt, VCPU_REGS_RDX) = msr_data >> 32; > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 88c593f83b28..530729e7ca4b 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -1554,7 +1554,13 @@ int kvm_emulate_rdmsr(struct kvm_vcpu *vcpu) > u32 ecx = kvm_rcx_read(vcpu); > u64 data; > > - if (kvm_get_msr(vcpu, ecx, &data)) { > + switch (kvm_get_msr(vcpu, ecx, &data)) { > + case 0: > + break; > + case -ETRAP_TO_USER_SPACE: > + trace_kvm_msr_read(ecx, data); > + return 0; > + default: > trace_kvm_msr_read_ex(ecx); > kvm_inject_gp(vcpu, 0); > return 1; > @@ -1573,7 +1579,13 @@ int kvm_emulate_wrmsr(struct kvm_vcpu *vcpu) > u32 ecx = kvm_rcx_read(vcpu); > u64 data = kvm_read_edx_eax(vcpu); > > - if (kvm_set_msr(vcpu, ecx, data)) { > + switch (kvm_set_msr(vcpu, ecx, data)) { > + case 0: > + break; > + case -ETRAP_TO_USER_SPACE: > + trace_kvm_msr_write(ecx, data); > + return 0; > + default: > trace_kvm_msr_write_ex(ecx, data); > kvm_inject_gp(vcpu, 0); > return 1; > @@ -2797,6 +2809,26 @@ static void record_steal_time(struct kvm_vcpu *vcpu) > kvm_unmap_gfn(vcpu, &map, &vcpu->arch.st.cache, true, false); > } > > +static int kvm_set_msr_user_space(struct kvm_vcpu *vcpu, struct msr_data *msr_info) > +{ > + if (vcpu->run->exit_reason == KVM_EXIT_WRMSR && vcpu->run->msr.reply) { > + vcpu->run->msr.reply = 0; > + > + if (vcpu->run->msr.error) > + return 1; > + > + return 0; > + } > + > + vcpu->run->exit_reason = KVM_EXIT_WRMSR; > + vcpu->run->msr.reply = 0; > + vcpu->run->msr.error = 0; > + vcpu->run->msr.index = msr_info->index; > + vcpu->run->msr.data = msr_info->data; > + > + return -ETRAP_TO_USER_SPACE; > +} > + > int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) > { > bool pr = false; > @@ -3066,6 +3098,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) > return xen_hvm_config(vcpu, data); > if (kvm_pmu_is_valid_msr(vcpu, msr)) > return kvm_pmu_set_msr(vcpu, msr_info); > + if (vcpu->kvm->arch.user_space_msr_enabled && !msr_info->host_initiated) > + return kvm_set_msr_user_space(vcpu, msr_info); > if (!ignore_msrs) { > vcpu_debug_ratelimited(vcpu, "unhandled wrmsr: 0x%x data 0x%llx\n", > msr, data); > @@ -3120,6 +3154,26 @@ static int get_msr_mce(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata, bool host) > return 0; > } > > +static int kvm_get_msr_user_space(struct kvm_vcpu *vcpu, struct msr_data *msr_info) > +{ > + if (vcpu->run->exit_reason == KVM_EXIT_RDMSR && vcpu->run->msr.reply) { > + vcpu->run->msr.reply = 0; > + > + if (vcpu->run->msr.error) > + return 1; > + > + msr_info->data = vcpu->run->msr.data; > + return 0; > + } > + > + vcpu->run->exit_reason = KVM_EXIT_RDMSR; > + vcpu->run->msr.reply = 0; > + vcpu->run->msr.error = 0; > + vcpu->run->msr.index = msr_info->index; > + > + return -ETRAP_TO_USER_SPACE; > +} > + > int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) > { > switch (msr_info->index) { > @@ -3331,6 +3385,8 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) > default: > if (kvm_pmu_is_valid_msr(vcpu, msr_info->index)) > return kvm_pmu_get_msr(vcpu, msr_info); > + if (vcpu->kvm->arch.user_space_msr_enabled && !msr_info->host_initiated) > + return kvm_get_msr_user_space(vcpu, msr_info); > if (!ignore_msrs) { > vcpu_debug_ratelimited(vcpu, "unhandled rdmsr: 0x%x\n", > msr_info->index); > @@ -3476,6 +3532,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) > case KVM_CAP_MSR_PLATFORM_INFO: > case KVM_CAP_EXCEPTION_PAYLOAD: > case KVM_CAP_SET_GUEST_DEBUG: > + case KVM_CAP_X86_USER_SPACE_MSR: > r = 1; > break; > case KVM_CAP_SYNC_REGS: > @@ -4990,6 +5047,10 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, > kvm->arch.exception_payload_enabled = cap->args[0]; > r = 0; > break; > + case KVM_CAP_X86_USER_SPACE_MSR: > + kvm->arch.user_space_msr_enabled = cap->args[0]; > + r = 0; > + break; > default: > r = -EINVAL; > break; > diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h > index 2c735a3e6613..09509dee4968 100644 > --- a/include/trace/events/kvm.h > +++ b/include/trace/events/kvm.h > @@ -17,7 +17,7 @@ > ERSN(NMI), ERSN(INTERNAL_ERROR), ERSN(OSI), ERSN(PAPR_HCALL), \ > ERSN(S390_UCONTROL), ERSN(WATCHDOG), ERSN(S390_TSCH), ERSN(EPR),\ > ERSN(SYSTEM_EVENT), ERSN(S390_STSI), ERSN(IOAPIC_EOI), \ > - ERSN(HYPERV) > + ERSN(HYPERV), ERSN(ARM_NISV), ERSN(RDMSR), ERSN(WRMSR) > > TRACE_EVENT(kvm_userspace_exit, > TP_PROTO(__u32 reason, int errno), > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h > index 4fdf30316582..df237bf2bdc2 100644 > --- a/include/uapi/linux/kvm.h > +++ b/include/uapi/linux/kvm.h > @@ -248,6 +248,8 @@ struct kvm_hyperv_exit { > #define KVM_EXIT_IOAPIC_EOI 26 > #define KVM_EXIT_HYPERV 27 > #define KVM_EXIT_ARM_NISV 28 > +#define KVM_EXIT_RDMSR 29 > +#define KVM_EXIT_WRMSR 30 > > /* For KVM_EXIT_INTERNAL_ERROR */ > /* Emulate instruction failed. */ > @@ -412,6 +414,14 @@ struct kvm_run { > __u64 esr_iss; > __u64 fault_ipa; > } arm_nisv; > + /* KVM_EXIT_RDMSR / KVM_EXIT_WRMSR */ > + struct { > + __u8 reply; > + __u8 error; > + __u8 pad[2]; > + __u32 index; > + __u64 data; > + } msr; > /* Fix the size of the union. */ > char padding[256]; > }; > @@ -1031,6 +1041,7 @@ struct kvm_ppc_resize_hpt { > #define KVM_CAP_PPC_SECURE_GUEST 181 > #define KVM_CAP_HALT_POLL 182 > #define KVM_CAP_ASYNC_PF_INT 183 > +#define KVM_CAP_X86_USER_SPACE_MSR 184 > > #ifdef KVM_CAP_IRQ_ROUTING
On 28.07.20 10:15, Vitaly Kuznetsov wrote: > > Alexander Graf <graf@amazon.com> writes: > >> MSRs are weird. Some of them are normal control registers, such as EFER. >> Some however are registers that really are model specific, not very >> interesting to virtualization workloads, and not performance critical. >> Others again are really just windows into package configuration. >> >> Out of these MSRs, only the first category is necessary to implement in >> kernel space. Rarely accessed MSRs, MSRs that should be fine tunes against >> certain CPU models and MSRs that contain information on the package level >> are much better suited for user space to process. However, over time we have >> accumulated a lot of MSRs that are not the first category, but still handled >> by in-kernel KVM code. >> >> This patch adds a generic interface to handle WRMSR and RDMSR from user >> space. With this, any future MSR that is part of the latter categories can >> be handled in user space. >> >> Furthermore, it allows us to replace the existing "ignore_msrs" logic with >> something that applies per-VM rather than on the full system. That way you >> can run productive VMs in parallel to experimental ones where you don't care >> about proper MSR handling. >> > > In theory, we can go further: userspace will give KVM the list of MSRs > it is interested in. This list may even contain MSRs which are normally > handled by KVM, in this case userspace gets an option to mangle KVM's > reply (RDMSR) or do something extra (WRMSR). I'm not sure if there is a > real need behind this, just an idea. > > The problem with this approach is: if currently some MSR is not > implemented in KVM you will get an exit. When later someone comes with a > patch to implement this MSR your userspace handling will immediately get > broken so the list of not implemented MSRs effectively becomes an API :-) Yeah, I'm not quite sure how to do this without bloating the kernel's memory footprint too much though. One option would be to create a shared bitmap with user space. But that would need to be sparse and quite big to be able to address all of today's possible MSR indexes. From a quick glimpse at Linux's MSR defines, there are: 0x00000000 - 0x00001000 (Intel) 0x00001000 - 0x00002000 (VIA) 0x40000000 - 0x50000000 (PV) 0xc0000000 - 0xc0003000 (AMD) 0xc0010000 - 0xc0012000 (AMD) 0x80860000 - 0x80870000 (Transmeta) Another idea would be to turn the logic around and implement an allowlist in KVM with all of the MSRs that KVM should handle. In that API we could ask for an array of KVM supported MSRs into user space. User space could then bounce that array back to KVM to have all in-KVM supported MSRs handled. Or it could remove entries that it wants to handle on its own. KVM internally could then save the list as a dense bitmap, translating every list entry into its corresponding bit. While it does feel a bit overengineered, it would solve the problem that we're turning in-KVM handled MSRs into an ABI. > >> Signed-off-by: Alexander Graf <graf@amazon.com> >> >> --- >> >> As a quick example to show what this does, I implemented handling for MSR 0x35 >> (MSR_CORE_THREAD_COUNT) in QEMU on top of this patch set: >> >> https://github.com/agraf/qemu/commits/user-space-msr >> --- >> Documentation/virt/kvm/api.rst | 60 ++++++++++++++++++++++++++++++ >> arch/x86/include/asm/kvm_host.h | 6 +++ >> arch/x86/kvm/emulate.c | 18 +++++++-- >> arch/x86/kvm/x86.c | 65 ++++++++++++++++++++++++++++++++- >> include/trace/events/kvm.h | 2 +- >> include/uapi/linux/kvm.h | 11 ++++++ >> 6 files changed, 155 insertions(+), 7 deletions(-) >> >> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst >> index 320788f81a05..7dfcc8e09dad 100644 >> --- a/Documentation/virt/kvm/api.rst >> +++ b/Documentation/virt/kvm/api.rst >> @@ -5155,6 +5155,34 @@ Note that KVM does not skip the faulting instruction as it does for >> KVM_EXIT_MMIO, but userspace has to emulate any change to the processing state >> if it decides to decode and emulate the instruction. >> >> +:: >> + >> + /* KVM_EXIT_RDMSR / KVM_EXIT_WRMSR */ >> + struct { >> + __u8 reply; >> + __u8 error; >> + __u8 pad[2]; >> + __u32 index; >> + __u64 data; >> + } msr; > > (Personal taste most likely) > > This layout is perfect but it makes my brain explode :-) Naturally, I > expect index and data to be the most significant members and I expect > them to be the first two members, something like > > struct { > __u32 index; > __u32 pad32; > __u64 data; > __u8 reply; > __u8 error; > __u8 pad8[6]; > } msr; The layout I chose mimics the io one and does feel pretty natural to me (flags first, index next, data last). Let's shrug it off as taste? :) > >> + >> +Used on x86 systems. When the VM capability KVM_CAP_X86_USER_SPACE_MSR is >> +enabled, MSR accesses to registers that are not known by KVM kernel code will >> +trigger a KVM_EXIT_RDMSR exit for reads and KVM_EXIT_WRMSR exit for writes. >> + >> +For KVM_EXIT_RDMSR, the "index" field tells user space which MSR the guest >> +wants to read. To respond to this request with a successful read, user space >> +writes a 1 into the "reply" field and the respective data into the "data" field. >> + >> +If the RDMSR request was unsuccessful, user space indicates that with a "1" >> +in the "reply" field and a "1" in the "error" field. This will inject a #GP >> +into the guest when the VCPU is executed again. >> + >> +For KVM_EXIT_WRMSR, the "index" field tells user space which MSR the guest >> +wants to write. Once finished processing the event, user space sets the "reply" >> +field to "1". If the MSR write was unsuccessful, user space also sets the >> +"error" field to "1". >> + >> :: >> >> /* Fix the size of the union. */ >> @@ -5844,6 +5872,27 @@ controlled by the kvm module parameter halt_poll_ns. This capability allows >> the maximum halt time to specified on a per-VM basis, effectively overriding >> the module parameter for the target VM. >> >> +7.21 KVM_CAP_X86_USER_SPACE_MSR >> +---------------------- >> + >> +:Architectures: x86 >> +:Target: VM >> +:Parameters: args[0] is 1 if user space MSR handling is enabled, 0 otherwise >> +:Returns: 0 on success; -1 on error >> + >> +This capability enabled trapping of unhandled RDMSR and WRMSR instructions >> +into user space. >> + >> +When a guest requests to read or write an MSR, KVM may not implement all MSRs >> +that are relevant to a respective system. It also does not differentiate by >> +CPU type. >> + >> +To allow more fine grained control over MSR handling, user space may enable >> +this capability. With it enabled, MSR accesses that are not handled by KVM >> +will trigger KVM_EXIT_RDMSR and KVM_EXIT_WRMSR exit notifications which >> +user space can then handle to implement model specific MSR handling and/or >> +user notifications to inform a user that an MSR was not handled. >> + >> 8. Other capabilities. >> ====================== >> >> @@ -6151,3 +6200,14 @@ KVM can therefore start protected VMs. >> This capability governs the KVM_S390_PV_COMMAND ioctl and the >> KVM_MP_STATE_LOAD MP_STATE. KVM_SET_MP_STATE can fail for protected >> guests when the state change is invalid. >> + >> +8.24 KVM_CAP_X86_USER_SPACE_MSR >> +---------------------------- >> + >> +:Architectures: x86 >> + >> +This capability indicates that KVM supports deflection of MSR reads and >> +writes to user space. It can be enabled on a VM level. If enabled, MSR >> +accesses that are not handled by KVM and would thus usually trigger a >> +#GP into the guest will instead get bounced to user space through the >> +KVM_EXIT_RDMSR and KVM_EXIT_WRMSR exit notifications. >> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h >> index be5363b21540..c4218e05d8b8 100644 >> --- a/arch/x86/include/asm/kvm_host.h >> +++ b/arch/x86/include/asm/kvm_host.h >> @@ -1002,6 +1002,9 @@ struct kvm_arch { >> bool guest_can_read_msr_platform_info; >> bool exception_payload_enabled; >> >> + /* Deflect RDMSR and WRMSR to user space if not handled in kernel */ >> + bool user_space_msr_enabled; >> + >> struct kvm_pmu_event_filter *pmu_event_filter; >> struct task_struct *nx_lpage_recovery_thread; >> }; >> @@ -1437,6 +1440,9 @@ int kvm_emulate_instruction(struct kvm_vcpu *vcpu, int emulation_type); >> int kvm_emulate_instruction_from_buffer(struct kvm_vcpu *vcpu, >> void *insn, int insn_len); >> >> +/* Indicate that an MSR operation should be handled by user space */ >> +#define ETRAP_TO_USER_SPACE EREMOTE > > What if we just use ENOENT in > kvm_set_msr_user_space()/kvm_get_msr_user_space()? Or, maybe, we can > just notice that KVM_EXIT_RDMSR/KVM_EXIT_WRMSR was set, this way we > don't need a specific exit code. Yeah, ENOENT is definitely a better option. Checking for the exit_reason in the rdmsr/wrmsr code paths is tricky, as we don't provide any guarantees over the value of vcpu->run->exit_reason unless we are in the user space return path. So if you trap to user space for one MSR, handle that, continue and the next MSR access is an in-kvm handled one that triggers a #GP, we have no way to differentiate whether the exit_reason is just stale from the previous run. We could avoid that by setting exit_reason to unknown on every vcpu_run, but it really only creates yet another magical API. Explicitly saying "go back to user space" from {g,s}et_msr() is much more explicit and readable IMHO. Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
On Tue, Jul 28, 2020 at 5:41 AM Alexander Graf <graf@amazon.com> wrote: > > > > On 28.07.20 10:15, Vitaly Kuznetsov wrote: > > > > Alexander Graf <graf@amazon.com> writes: > > > >> MSRs are weird. Some of them are normal control registers, such as EFER. > >> Some however are registers that really are model specific, not very > >> interesting to virtualization workloads, and not performance critical. > >> Others again are really just windows into package configuration. > >> > >> Out of these MSRs, only the first category is necessary to implement in > >> kernel space. Rarely accessed MSRs, MSRs that should be fine tunes against > >> certain CPU models and MSRs that contain information on the package level > >> are much better suited for user space to process. However, over time we have > >> accumulated a lot of MSRs that are not the first category, but still handled > >> by in-kernel KVM code. > >> > >> This patch adds a generic interface to handle WRMSR and RDMSR from user > >> space. With this, any future MSR that is part of the latter categories can > >> be handled in user space. This sounds similar to Peter Hornyack's RFC from 5 years ago: https://www.mail-archive.com/kvm@vger.kernel.org/msg124448.html. > >> Furthermore, it allows us to replace the existing "ignore_msrs" logic with > >> something that applies per-VM rather than on the full system. That way you > >> can run productive VMs in parallel to experimental ones where you don't care > >> about proper MSR handling. > >> > > > > In theory, we can go further: userspace will give KVM the list of MSRs > > it is interested in. This list may even contain MSRs which are normally > > handled by KVM, in this case userspace gets an option to mangle KVM's > > reply (RDMSR) or do something extra (WRMSR). I'm not sure if there is a > > real need behind this, just an idea. > > > > The problem with this approach is: if currently some MSR is not > > implemented in KVM you will get an exit. When later someone comes with a > > patch to implement this MSR your userspace handling will immediately get > > broken so the list of not implemented MSRs effectively becomes an API :-) Indeed. This is a legitimate concern. At Google, we have experienced this problem already, using Peter Hornyack's approach. We ended up commenting out some MSRs from kvm, which is less than ideal. > Yeah, I'm not quite sure how to do this without bloating the kernel's > memory footprint too much though. > > One option would be to create a shared bitmap with user space. But that > would need to be sparse and quite big to be able to address all of > today's possible MSR indexes. From a quick glimpse at Linux's MSR > defines, there are: > > 0x00000000 - 0x00001000 (Intel) > 0x00001000 - 0x00002000 (VIA) > 0x40000000 - 0x50000000 (PV) > 0xc0000000 - 0xc0003000 (AMD) > 0xc0010000 - 0xc0012000 (AMD) > 0x80860000 - 0x80870000 (Transmeta) > > Another idea would be to turn the logic around and implement an > allowlist in KVM with all of the MSRs that KVM should handle. In that > API we could ask for an array of KVM supported MSRs into user space. > User space could then bounce that array back to KVM to have all in-KVM > supported MSRs handled. Or it could remove entries that it wants to > handle on its own. > > KVM internally could then save the list as a dense bitmap, translating > every list entry into its corresponding bit. > > While it does feel a bit overengineered, it would solve the problem that > we're turning in-KVM handled MSRs into an ABI. It seems unlikely that userspace is going to know what to do with a large number of MSRs. I suspect that a small enumerated list will suffice. In fact, +Aaron Lewis is working on upstreaming a local Google patch set that does just that.
Jim Mattson <jmattson@google.com> writes: > On Tue, Jul 28, 2020 at 5:41 AM Alexander Graf <graf@amazon.com> wrote: >> ... >> While it does feel a bit overengineered, it would solve the problem that >> we're turning in-KVM handled MSRs into an ABI. > > It seems unlikely that userspace is going to know what to do with a > large number of MSRs. I suspect that a small enumerated list will > suffice. The list can also be 'wildcarded', i.e. { u32 index; u32 mask; ... } to make it really short.
On 28.07.20 19:13, Jim Mattson wrote: > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. > > > > On Tue, Jul 28, 2020 at 5:41 AM Alexander Graf <graf@amazon.com> wrote: >> >> >> >> On 28.07.20 10:15, Vitaly Kuznetsov wrote: >>> >>> Alexander Graf <graf@amazon.com> writes: >>> >>>> MSRs are weird. Some of them are normal control registers, such as EFER. >>>> Some however are registers that really are model specific, not very >>>> interesting to virtualization workloads, and not performance critical. >>>> Others again are really just windows into package configuration. >>>> >>>> Out of these MSRs, only the first category is necessary to implement in >>>> kernel space. Rarely accessed MSRs, MSRs that should be fine tunes against >>>> certain CPU models and MSRs that contain information on the package level >>>> are much better suited for user space to process. However, over time we have >>>> accumulated a lot of MSRs that are not the first category, but still handled >>>> by in-kernel KVM code. >>>> >>>> This patch adds a generic interface to handle WRMSR and RDMSR from user >>>> space. With this, any future MSR that is part of the latter categories can >>>> be handled in user space. > > This sounds similar to Peter Hornyack's RFC from 5 years ago: > https://www.mail-archive.com/kvm@vger.kernel.org/msg124448.html. Yeah, looks very similar. Do you know the history why it never got merged? I couldn't spot a non-RFC version of this on the ML. > >>>> Furthermore, it allows us to replace the existing "ignore_msrs" logic with >>>> something that applies per-VM rather than on the full system. That way you >>>> can run productive VMs in parallel to experimental ones where you don't care >>>> about proper MSR handling. >>>> >>> >>> In theory, we can go further: userspace will give KVM the list of MSRs >>> it is interested in. This list may even contain MSRs which are normally >>> handled by KVM, in this case userspace gets an option to mangle KVM's >>> reply (RDMSR) or do something extra (WRMSR). I'm not sure if there is a >>> real need behind this, just an idea. >>> >>> The problem with this approach is: if currently some MSR is not >>> implemented in KVM you will get an exit. When later someone comes with a >>> patch to implement this MSR your userspace handling will immediately get >>> broken so the list of not implemented MSRs effectively becomes an API :-) > > Indeed. This is a legitimate concern. At Google, we have experienced > this problem already, using Peter Hornyack's approach. We ended up > commenting out some MSRs from kvm, which is less than ideal. Yeah :(. > >> Yeah, I'm not quite sure how to do this without bloating the kernel's >> memory footprint too much though. >> >> One option would be to create a shared bitmap with user space. But that >> would need to be sparse and quite big to be able to address all of >> today's possible MSR indexes. From a quick glimpse at Linux's MSR >> defines, there are: >> >> 0x00000000 - 0x00001000 (Intel) >> 0x00001000 - 0x00002000 (VIA) >> 0x40000000 - 0x50000000 (PV) >> 0xc0000000 - 0xc0003000 (AMD) >> 0xc0010000 - 0xc0012000 (AMD) >> 0x80860000 - 0x80870000 (Transmeta) >> >> Another idea would be to turn the logic around and implement an >> allowlist in KVM with all of the MSRs that KVM should handle. In that >> API we could ask for an array of KVM supported MSRs into user space. >> User space could then bounce that array back to KVM to have all in-KVM >> supported MSRs handled. Or it could remove entries that it wants to >> handle on its own. >> >> KVM internally could then save the list as a dense bitmap, translating >> every list entry into its corresponding bit. >> >> While it does feel a bit overengineered, it would solve the problem that >> we're turning in-KVM handled MSRs into an ABI. > > It seems unlikely that userspace is going to know what to do with a > large number of MSRs. I suspect that a small enumerated list will > suffice. In fact, +Aaron Lewis is working on upstreaming a local > Google patch set that does just that. I tend to disagree on that sentiment. One of the motivations behind this patch is to populate invalid MSR accesses into user space, to move logic like "ignore_msrs"[1] into user space. This is not very useful for the cloud use case, but it does come in handy when you want to have VMs that can handle unimplemented MSRs in parallel to ones that do not. So whatever we implement, I would ideally want a mechanism at the end of the day that allows me to "trap the rest" into user space. Alex [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kvm/x86.c#n114 Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
On 29.07.20 10:23, Vitaly Kuznetsov wrote: > > > > Jim Mattson <jmattson@google.com> writes: > >> On Tue, Jul 28, 2020 at 5:41 AM Alexander Graf <graf@amazon.com> wrote: >>> > > ... > >>> While it does feel a bit overengineered, it would solve the problem that >>> we're turning in-KVM handled MSRs into an ABI. >> >> It seems unlikely that userspace is going to know what to do with a >> large number of MSRs. I suspect that a small enumerated list will >> suffice. > > The list can also be 'wildcarded', i.e. > { > u32 index; > u32 mask; > ... > } > > to make it really short. I like the idea of wildcards, but I can't quite wrap my head around how we would implement ignore_msrs in user space with them? Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
Alexander Graf <graf@amazon.com> writes: > On 29.07.20 10:23, Vitaly Kuznetsov wrote: >> >> >> >> Jim Mattson <jmattson@google.com> writes: >> >>> On Tue, Jul 28, 2020 at 5:41 AM Alexander Graf <graf@amazon.com> wrote: >>>> >> >> ... >> >>>> While it does feel a bit overengineered, it would solve the problem that >>>> we're turning in-KVM handled MSRs into an ABI. >>> >>> It seems unlikely that userspace is going to know what to do with a >>> large number of MSRs. I suspect that a small enumerated list will >>> suffice. >> >> The list can also be 'wildcarded', i.e. >> { >> u32 index; >> u32 mask; >> ... >> } >> >> to make it really short. > > I like the idea of wildcards, but I can't quite wrap my head around how > we would implement ignore_msrs in user space with them? > For that I think we can still deflect all unknown MSR accesses to userspace (when the CAP is enabled of course ) but MSRs which are on the list will *have to be deflected*, i.e. KVM can't handle them internally without consulting with userspace. We can make it tunable through a parameter for CAP enablement if needed.
On 29.07.20 11:22, Vitaly Kuznetsov wrote: > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. > > > > Alexander Graf <graf@amazon.com> writes: > >> On 29.07.20 10:23, Vitaly Kuznetsov wrote: >>> >>> >>> >>> Jim Mattson <jmattson@google.com> writes: >>> >>>> On Tue, Jul 28, 2020 at 5:41 AM Alexander Graf <graf@amazon.com> wrote: >>>>> >>> >>> ... >>> >>>>> While it does feel a bit overengineered, it would solve the problem that >>>>> we're turning in-KVM handled MSRs into an ABI. >>>> >>>> It seems unlikely that userspace is going to know what to do with a >>>> large number of MSRs. I suspect that a small enumerated list will >>>> suffice. >>> >>> The list can also be 'wildcarded', i.e. >>> { >>> u32 index; >>> u32 mask; >>> ... >>> } >>> >>> to make it really short. >> >> I like the idea of wildcards, but I can't quite wrap my head around how >> we would implement ignore_msrs in user space with them? >> > > For that I think we can still deflect all unknown MSR accesses to > userspace (when the CAP is enabled of course ) but MSRs which are on the > list will *have to be deflected*, i.e. KVM can't handle them internally > without consulting with userspace. > > We can make it tunable through a parameter for CAP enablement if needed. That would still make the set of MSRs implemented in KVM a de-facto ABI, no? Another thing that might be worth bringing up here is that we have an in-house mechanism to set up a allowlist for KVM handling MSR accesses. What if we combine the two? int kvm_rdmsr(...) { switch (msr) { [...] default: return -ENOENT; } } int rdmsr(...) { if (!has_allowlist || msr_read_is_allowed(msr)) return kvm_rdmsr(); return -ENOENT; } int handle_rdmsr(...) { switch (rdmsr(msr)) { case 0: return 1; case 1: inject_gp(); return 1; case -ENOENT: if (cap_msr_exit) { run->exit_reason = MSR; return 0; } else { inject_gp(); return 1; } } } That way user space can either say "I don't care what you implement, just tell me all the MSRs you could not handle" or it says "I want you to handle this exact subset of MSRs, tell me any time there's an out of bounds access". That would give us the best of both worlds, right? Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
On Wed, Jul 29, 2020 at 2:06 AM Alexander Graf <graf@amazon.com> wrote: > > > > On 28.07.20 19:13, Jim Mattson wrote: > > This sounds similar to Peter Hornyack's RFC from 5 years ago: > > https://www.mail-archive.com/kvm@vger.kernel.org/msg124448.html. > > Yeah, looks very similar. Do you know the history why it never got > merged? I couldn't spot a non-RFC version of this on the ML. I believe Peter got frustrated with all of the pushback he was getting, and he moved on to other things. While Google still uses that code, Aaron's new approach should give us equivalent functionality without having to comment out the MSRs that kvm previously didn't know about, and which we still want redirected to userspace. > > It seems unlikely that userspace is going to know what to do with a > > large number of MSRs. I suspect that a small enumerated list will > > suffice. In fact, +Aaron Lewis is working on upstreaming a local > > Google patch set that does just that. > > I tend to disagree on that sentiment. One of the motivations behind this > patch is to populate invalid MSR accesses into user space, to move logic > like "ignore_msrs"[1] into user space. This is not very useful for the > cloud use case, but it does come in handy when you want to have VMs that > can handle unimplemented MSRs in parallel to ones that do not. > > So whatever we implement, I would ideally want a mechanism at the end of > the day that allows me to "trap the rest" into user space. I do think "the rest" should be explicitly specified, so that userspace doesn't get surprises when kvm evolves. Maybe this can be done using the allow-list you refer to later, along with a specified action for disallowed MSRs: (1) raise #GP, (2) ignore, or (3) exit to userspace. This actually seems orthogonal to what Aaron is working on, which is to request that specific MSR accesses exit to userspace. But, at least the plumbing for {RD,WR}MSR completion when coming back from userspace can be leveraged by both.
On 29.07.20 20:27, Jim Mattson wrote: > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. > > > > On Wed, Jul 29, 2020 at 2:06 AM Alexander Graf <graf@amazon.com> wrote: >> >> >> >> On 28.07.20 19:13, Jim Mattson wrote: > >>> This sounds similar to Peter Hornyack's RFC from 5 years ago: >>> https://www.mail-archive.com/kvm@vger.kernel.org/msg124448.html. >> >> Yeah, looks very similar. Do you know the history why it never got >> merged? I couldn't spot a non-RFC version of this on the ML. > > I believe Peter got frustrated with all of the pushback he was > getting, and he moved on to other things. While Google still uses that > code, Aaron's new approach should give us equivalent functionality > without having to comment out the MSRs that kvm previously didn't know > about, and which we still want redirected to userspace. > >>> It seems unlikely that userspace is going to know what to do with a >>> large number of MSRs. I suspect that a small enumerated list will >>> suffice. In fact, +Aaron Lewis is working on upstreaming a local >>> Google patch set that does just that. >> >> I tend to disagree on that sentiment. One of the motivations behind this >> patch is to populate invalid MSR accesses into user space, to move logic >> like "ignore_msrs"[1] into user space. This is not very useful for the >> cloud use case, but it does come in handy when you want to have VMs that >> can handle unimplemented MSRs in parallel to ones that do not. >> >> So whatever we implement, I would ideally want a mechanism at the end of >> the day that allows me to "trap the rest" into user space. > > I do think "the rest" should be explicitly specified, so that > userspace doesn't get surprises when kvm evolves. Maybe this can be > done using the allow-list you refer to later, along with a specified > action for disallowed MSRs: (1) raise #GP, (2) ignore, or (3) exit to > userspace. This actually seems orthogonal to what Aaron is working on, > which is to request that specific MSR accesses exit to userspace. But, > at least the plumbing for {RD,WR}MSR completion when coming back from > userspace can be leveraged by both. Thinking about this for a while, I am quite confident that we don't need to complexify this all that much. The #GP path is never performance critical and thus can easily be handled in user space. There are a few niche cases where exiting to user space is "too complicated" (think nVMX MSR restore path). But they are niche and just bailing out for the user space exit path on them is fine. So I think a patch that allows us to allow list MSRs that should be handled in KVM and another patch that allows us to deflect any MSR inflicted #GPs into user space is all it takes to make this a flexible and stable ABI. The great thing is that by untangling the two bits, we can also support the "user space wants to leave it all to KVM, but be able to implement ignore_msrs itself" use case easily. User space would just not set an allow list. Meanwhile, I have cleaned up Karim's old patch to add allow listing to KVM and would post it if Aaron doesn't beat me to it :). Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
On Wed, Jul 29, 2020 at 1:29 PM Alexander Graf <graf@amazon.com> wrote: > Meanwhile, I have cleaned up Karim's old patch to add allow listing to > KVM and would post it if Aaron doesn't beat me to it :). Ideally, this becomes a collaboration rather than a race to the finish. I'd like to see both proposals, so that we can take the best parts of each!
On 29.07.20 22:37, Jim Mattson wrote: > > On Wed, Jul 29, 2020 at 1:29 PM Alexander Graf <graf@amazon.com> wrote: > >> Meanwhile, I have cleaned up Karim's old patch to add allow listing to >> KVM and would post it if Aaron doesn't beat me to it :). > > Ideally, this becomes a collaboration rather than a race to the > finish. I'd like to see both proposals, so that we can take the best > parts of each! > Oh, definitely! I'm not really married to Karim's patch here, it was just simply there and is dead simple. Do you have a rough ETA for Aaron's patch set yet? :) Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
On Wed, Jul 29, 2020 at 1:46 PM Alexander Graf <graf@amazon.com> wrote:
> Do you have a rough ETA for Aaron's patch set yet? :)
Rough ETA: Friday (31 July 2020).
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 320788f81a05..7dfcc8e09dad 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -5155,6 +5155,34 @@ Note that KVM does not skip the faulting instruction as it does for KVM_EXIT_MMIO, but userspace has to emulate any change to the processing state if it decides to decode and emulate the instruction. +:: + + /* KVM_EXIT_RDMSR / KVM_EXIT_WRMSR */ + struct { + __u8 reply; + __u8 error; + __u8 pad[2]; + __u32 index; + __u64 data; + } msr; + +Used on x86 systems. When the VM capability KVM_CAP_X86_USER_SPACE_MSR is +enabled, MSR accesses to registers that are not known by KVM kernel code will +trigger a KVM_EXIT_RDMSR exit for reads and KVM_EXIT_WRMSR exit for writes. + +For KVM_EXIT_RDMSR, the "index" field tells user space which MSR the guest +wants to read. To respond to this request with a successful read, user space +writes a 1 into the "reply" field and the respective data into the "data" field. + +If the RDMSR request was unsuccessful, user space indicates that with a "1" +in the "reply" field and a "1" in the "error" field. This will inject a #GP +into the guest when the VCPU is executed again. + +For KVM_EXIT_WRMSR, the "index" field tells user space which MSR the guest +wants to write. Once finished processing the event, user space sets the "reply" +field to "1". If the MSR write was unsuccessful, user space also sets the +"error" field to "1". + :: /* Fix the size of the union. */ @@ -5844,6 +5872,27 @@ controlled by the kvm module parameter halt_poll_ns. This capability allows the maximum halt time to specified on a per-VM basis, effectively overriding the module parameter for the target VM. +7.21 KVM_CAP_X86_USER_SPACE_MSR +---------------------- + +:Architectures: x86 +:Target: VM +:Parameters: args[0] is 1 if user space MSR handling is enabled, 0 otherwise +:Returns: 0 on success; -1 on error + +This capability enabled trapping of unhandled RDMSR and WRMSR instructions +into user space. + +When a guest requests to read or write an MSR, KVM may not implement all MSRs +that are relevant to a respective system. It also does not differentiate by +CPU type. + +To allow more fine grained control over MSR handling, user space may enable +this capability. With it enabled, MSR accesses that are not handled by KVM +will trigger KVM_EXIT_RDMSR and KVM_EXIT_WRMSR exit notifications which +user space can then handle to implement model specific MSR handling and/or +user notifications to inform a user that an MSR was not handled. + 8. Other capabilities. ====================== @@ -6151,3 +6200,14 @@ KVM can therefore start protected VMs. This capability governs the KVM_S390_PV_COMMAND ioctl and the KVM_MP_STATE_LOAD MP_STATE. KVM_SET_MP_STATE can fail for protected guests when the state change is invalid. + +8.24 KVM_CAP_X86_USER_SPACE_MSR +---------------------------- + +:Architectures: x86 + +This capability indicates that KVM supports deflection of MSR reads and +writes to user space. It can be enabled on a VM level. If enabled, MSR +accesses that are not handled by KVM and would thus usually trigger a +#GP into the guest will instead get bounced to user space through the +KVM_EXIT_RDMSR and KVM_EXIT_WRMSR exit notifications. diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index be5363b21540..c4218e05d8b8 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1002,6 +1002,9 @@ struct kvm_arch { bool guest_can_read_msr_platform_info; bool exception_payload_enabled; + /* Deflect RDMSR and WRMSR to user space if not handled in kernel */ + bool user_space_msr_enabled; + struct kvm_pmu_event_filter *pmu_event_filter; struct task_struct *nx_lpage_recovery_thread; }; @@ -1437,6 +1440,9 @@ int kvm_emulate_instruction(struct kvm_vcpu *vcpu, int emulation_type); int kvm_emulate_instruction_from_buffer(struct kvm_vcpu *vcpu, void *insn, int insn_len); +/* Indicate that an MSR operation should be handled by user space */ +#define ETRAP_TO_USER_SPACE EREMOTE + void kvm_enable_efer_bits(u64); bool kvm_valid_efer(struct kvm_vcpu *vcpu, u64 efer); int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data, bool host_initiated); diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index d0e2825ae617..b08000e3b2fe 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -3693,18 +3693,28 @@ static int em_wrmsr(struct x86_emulate_ctxt *ctxt) msr_data = (u32)reg_read(ctxt, VCPU_REGS_RAX) | ((u64)reg_read(ctxt, VCPU_REGS_RDX) << 32); - if (ctxt->ops->set_msr(ctxt, reg_read(ctxt, VCPU_REGS_RCX), msr_data)) + switch (ctxt->ops->set_msr(ctxt, reg_read(ctxt, VCPU_REGS_RCX), msr_data)) { + case 0: + return X86EMUL_CONTINUE; + case -ETRAP_TO_USER_SPACE: + return X86EMUL_IO_NEEDED; + default: return emulate_gp(ctxt, 0); - - return X86EMUL_CONTINUE; + } } static int em_rdmsr(struct x86_emulate_ctxt *ctxt) { u64 msr_data; - if (ctxt->ops->get_msr(ctxt, reg_read(ctxt, VCPU_REGS_RCX), &msr_data)) + switch (ctxt->ops->get_msr(ctxt, reg_read(ctxt, VCPU_REGS_RCX), &msr_data)) { + case 0: + break; + case -ETRAP_TO_USER_SPACE: + return X86EMUL_IO_NEEDED; + default: return emulate_gp(ctxt, 0); + } *reg_write(ctxt, VCPU_REGS_RAX) = (u32)msr_data; *reg_write(ctxt, VCPU_REGS_RDX) = msr_data >> 32; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 88c593f83b28..530729e7ca4b 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1554,7 +1554,13 @@ int kvm_emulate_rdmsr(struct kvm_vcpu *vcpu) u32 ecx = kvm_rcx_read(vcpu); u64 data; - if (kvm_get_msr(vcpu, ecx, &data)) { + switch (kvm_get_msr(vcpu, ecx, &data)) { + case 0: + break; + case -ETRAP_TO_USER_SPACE: + trace_kvm_msr_read(ecx, data); + return 0; + default: trace_kvm_msr_read_ex(ecx); kvm_inject_gp(vcpu, 0); return 1; @@ -1573,7 +1579,13 @@ int kvm_emulate_wrmsr(struct kvm_vcpu *vcpu) u32 ecx = kvm_rcx_read(vcpu); u64 data = kvm_read_edx_eax(vcpu); - if (kvm_set_msr(vcpu, ecx, data)) { + switch (kvm_set_msr(vcpu, ecx, data)) { + case 0: + break; + case -ETRAP_TO_USER_SPACE: + trace_kvm_msr_write(ecx, data); + return 0; + default: trace_kvm_msr_write_ex(ecx, data); kvm_inject_gp(vcpu, 0); return 1; @@ -2797,6 +2809,26 @@ static void record_steal_time(struct kvm_vcpu *vcpu) kvm_unmap_gfn(vcpu, &map, &vcpu->arch.st.cache, true, false); } +static int kvm_set_msr_user_space(struct kvm_vcpu *vcpu, struct msr_data *msr_info) +{ + if (vcpu->run->exit_reason == KVM_EXIT_WRMSR && vcpu->run->msr.reply) { + vcpu->run->msr.reply = 0; + + if (vcpu->run->msr.error) + return 1; + + return 0; + } + + vcpu->run->exit_reason = KVM_EXIT_WRMSR; + vcpu->run->msr.reply = 0; + vcpu->run->msr.error = 0; + vcpu->run->msr.index = msr_info->index; + vcpu->run->msr.data = msr_info->data; + + return -ETRAP_TO_USER_SPACE; +} + int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) { bool pr = false; @@ -3066,6 +3098,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) return xen_hvm_config(vcpu, data); if (kvm_pmu_is_valid_msr(vcpu, msr)) return kvm_pmu_set_msr(vcpu, msr_info); + if (vcpu->kvm->arch.user_space_msr_enabled && !msr_info->host_initiated) + return kvm_set_msr_user_space(vcpu, msr_info); if (!ignore_msrs) { vcpu_debug_ratelimited(vcpu, "unhandled wrmsr: 0x%x data 0x%llx\n", msr, data); @@ -3120,6 +3154,26 @@ static int get_msr_mce(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata, bool host) return 0; } +static int kvm_get_msr_user_space(struct kvm_vcpu *vcpu, struct msr_data *msr_info) +{ + if (vcpu->run->exit_reason == KVM_EXIT_RDMSR && vcpu->run->msr.reply) { + vcpu->run->msr.reply = 0; + + if (vcpu->run->msr.error) + return 1; + + msr_info->data = vcpu->run->msr.data; + return 0; + } + + vcpu->run->exit_reason = KVM_EXIT_RDMSR; + vcpu->run->msr.reply = 0; + vcpu->run->msr.error = 0; + vcpu->run->msr.index = msr_info->index; + + return -ETRAP_TO_USER_SPACE; +} + int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) { switch (msr_info->index) { @@ -3331,6 +3385,8 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) default: if (kvm_pmu_is_valid_msr(vcpu, msr_info->index)) return kvm_pmu_get_msr(vcpu, msr_info); + if (vcpu->kvm->arch.user_space_msr_enabled && !msr_info->host_initiated) + return kvm_get_msr_user_space(vcpu, msr_info); if (!ignore_msrs) { vcpu_debug_ratelimited(vcpu, "unhandled rdmsr: 0x%x\n", msr_info->index); @@ -3476,6 +3532,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_MSR_PLATFORM_INFO: case KVM_CAP_EXCEPTION_PAYLOAD: case KVM_CAP_SET_GUEST_DEBUG: + case KVM_CAP_X86_USER_SPACE_MSR: r = 1; break; case KVM_CAP_SYNC_REGS: @@ -4990,6 +5047,10 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, kvm->arch.exception_payload_enabled = cap->args[0]; r = 0; break; + case KVM_CAP_X86_USER_SPACE_MSR: + kvm->arch.user_space_msr_enabled = cap->args[0]; + r = 0; + break; default: r = -EINVAL; break; diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h index 2c735a3e6613..09509dee4968 100644 --- a/include/trace/events/kvm.h +++ b/include/trace/events/kvm.h @@ -17,7 +17,7 @@ ERSN(NMI), ERSN(INTERNAL_ERROR), ERSN(OSI), ERSN(PAPR_HCALL), \ ERSN(S390_UCONTROL), ERSN(WATCHDOG), ERSN(S390_TSCH), ERSN(EPR),\ ERSN(SYSTEM_EVENT), ERSN(S390_STSI), ERSN(IOAPIC_EOI), \ - ERSN(HYPERV) + ERSN(HYPERV), ERSN(ARM_NISV), ERSN(RDMSR), ERSN(WRMSR) TRACE_EVENT(kvm_userspace_exit, TP_PROTO(__u32 reason, int errno), diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 4fdf30316582..df237bf2bdc2 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -248,6 +248,8 @@ struct kvm_hyperv_exit { #define KVM_EXIT_IOAPIC_EOI 26 #define KVM_EXIT_HYPERV 27 #define KVM_EXIT_ARM_NISV 28 +#define KVM_EXIT_RDMSR 29 +#define KVM_EXIT_WRMSR 30 /* For KVM_EXIT_INTERNAL_ERROR */ /* Emulate instruction failed. */ @@ -412,6 +414,14 @@ struct kvm_run { __u64 esr_iss; __u64 fault_ipa; } arm_nisv; + /* KVM_EXIT_RDMSR / KVM_EXIT_WRMSR */ + struct { + __u8 reply; + __u8 error; + __u8 pad[2]; + __u32 index; + __u64 data; + } msr; /* Fix the size of the union. */ char padding[256]; }; @@ -1031,6 +1041,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_PPC_SECURE_GUEST 181 #define KVM_CAP_HALT_POLL 182 #define KVM_CAP_ASYNC_PF_INT 183 +#define KVM_CAP_X86_USER_SPACE_MSR 184 #ifdef KVM_CAP_IRQ_ROUTING
MSRs are weird. Some of them are normal control registers, such as EFER. Some however are registers that really are model specific, not very interesting to virtualization workloads, and not performance critical. Others again are really just windows into package configuration. Out of these MSRs, only the first category is necessary to implement in kernel space. Rarely accessed MSRs, MSRs that should be fine tunes against certain CPU models and MSRs that contain information on the package level are much better suited for user space to process. However, over time we have accumulated a lot of MSRs that are not the first category, but still handled by in-kernel KVM code. This patch adds a generic interface to handle WRMSR and RDMSR from user space. With this, any future MSR that is part of the latter categories can be handled in user space. Furthermore, it allows us to replace the existing "ignore_msrs" logic with something that applies per-VM rather than on the full system. That way you can run productive VMs in parallel to experimental ones where you don't care about proper MSR handling. Signed-off-by: Alexander Graf <graf@amazon.com> --- As a quick example to show what this does, I implemented handling for MSR 0x35 (MSR_CORE_THREAD_COUNT) in QEMU on top of this patch set: https://github.com/agraf/qemu/commits/user-space-msr --- Documentation/virt/kvm/api.rst | 60 ++++++++++++++++++++++++++++++ arch/x86/include/asm/kvm_host.h | 6 +++ arch/x86/kvm/emulate.c | 18 +++++++-- arch/x86/kvm/x86.c | 65 ++++++++++++++++++++++++++++++++- include/trace/events/kvm.h | 2 +- include/uapi/linux/kvm.h | 11 ++++++ 6 files changed, 155 insertions(+), 7 deletions(-)