Message ID | 20200818211533.849501-5-aaronlewis@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Allow userspace to manage MSRs | expand |
On 18.08.20 23:15, Aaron Lewis wrote: > > Add KVM_SET_EXIT_MSRS ioctl to allow userspace to pass in a list of MSRs > that force an exit to userspace when rdmsr or wrmsr are used by the > guest. > > KVM_SET_EXIT_MSRS will need to be called before any vCPUs are > created to protect the 'user_exit_msrs' list from being mutated while > vCPUs are running. > > Add KVM_CAP_SET_MSR_EXITS to identify the feature exists. > > Signed-off-by: Aaron Lewis <aaronlewis@google.com> > Reviewed-by: Oliver Upton <oupton@google.com> Why would we still need this with the allow list and user space #GP deflection logic in place? Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
On Wed, Aug 19, 2020 at 2:00 AM Alexander Graf <graf@amazon.com> wrote: > Why would we still need this with the allow list and user space #GP > deflection logic in place? Conversion to an allow list is cumbersome when you have a short deny list. Suppose that I want to implement the following deny list: {IA32_ARCH_CAPABILITIES, HV_X64_MSR_REFERENCE_TSC, MSR_GOOGLE_TRUE_TIME, MSR_GOOGLE_FDR_TRACE, MSR_GOOGLE_HBI}. What would the corresponding deny list look like? Given your current implementation, I don't think the corresponding allow list can actually be constructed. I want to allow 2^32-5 MSRs, but I can allow at most 122880, if I've done the math correctly. (10 ranges, each spanning at most 0x600 bytes worth of bitmap.) Perhaps we should adopt allow/deny rules similar to those accepted by most firewalls. Instead of ports, we have MSR indices. Instead of protocols, we have READ, WRITE, or READ/WRITE. Suppose that we supported up to <n> rules of the form: {start index, end index, access modes, allow or deny}? Rules would be processed in the order given, and the first rule that matched a given access would take precedence. Finally, userspace could specify the default behavior (either allow or deny) for any MSR access that didn't match any of the rules. Thoughts?
Hi Jim, On 20.08.20 19:30, Jim Mattson wrote: > > > On Wed, Aug 19, 2020 at 2:00 AM Alexander Graf <graf@amazon.com> wrote: > >> Why would we still need this with the allow list and user space #GP >> deflection logic in place? > > Conversion to an allow list is cumbersome when you have a short deny > list. Suppose that I want to implement the following deny list: > {IA32_ARCH_CAPABILITIES, HV_X64_MSR_REFERENCE_TSC, > MSR_GOOGLE_TRUE_TIME, MSR_GOOGLE_FDR_TRACE, MSR_GOOGLE_HBI}. What > would the corresponding deny list look like? Given your current > implementation, I don't think the corresponding allow list can > actually be constructed. I want to allow 2^32-5 MSRs, but I can allow > at most 122880, if I've done the math correctly. (10 ranges, each > spanning at most 0x600 bytes worth of bitmap.) There are only very few MSR ranges that actually data. So in your case, to allow all MSRs that Linux knows about in msr-index.h, you would need [0x00000000 - 0x00002000] [0x40000000 - 0x40000200] [0x4b564d00 - 0x4b564e00] [0x80868000 - 0x80868020] [0xc0000000 - 0xc0000200] [0xc0010000 - 0xc0012000] [0xc0020000 - 0xc0020010] which are 7 regions. For good measure, you can probably pad every one of them to the full 0x3000 MSRs they can span. For MSRs that KVM actually handles in-kernel (others don't need to be allowed), the list shrinks to 5: [0x00000000 - 0x00001000] [0x40000000 - 0x40000200] [0x4b564d00 - 0x4b564e00] [0xc0000000 - 0xc0000200] [0xc0010000 - 0xc0012000] Let's extend them a bit to make reasoning easier: [0x00000000 - 0x00003000] [0x40000000 - 0x40003000] [0x4b564d00 - 0x4b567000] [0xc0000000 - 0xc0003000] [0xc0010000 - 0xc0013000] What are the odds that you will want to implicitly (without a new CAP, that would need user space adjustments anyway) have a random new MSR handled in-kernel with an identifier that is outside of those ranges? I'm fairly confident that trends towards 0. The only real downside I can see is that we just wasted ~8kb of RAM. Nothing I would really get hung up on though. > Perhaps we should adopt allow/deny rules similar to those accepted by > most firewalls. Instead of ports, we have MSR indices. Instead of > protocols, we have READ, WRITE, or READ/WRITE. Suppose that we > supported up to <n> rules of the form: {start index, end index, access > modes, allow or deny}? Rules would be processed in the order given, > and the first rule that matched a given access would take precedence. > Finally, userspace could specify the default behavior (either allow or > deny) for any MSR access that didn't match any of the rules. > > Thoughts? That wouldn't scale well if you want to allow all architecturally useful MSRs in a purely allow list fashion. You'd have to create hundreds of rules - or at least a few dozen if you combine contiguous ranges. If you really desperately believe a deny list is a better fit for your use case, we could redesign the interface differently: struct msr_set_accesslist { #define MSR_ACCESSLIST_DEFAULT_ALLOW 0 #define MSR_ACCESSLIST_DEFAULT_DENY 1 u32 flags; struct { u32 flags; u32 nmsrs; /* MSRs in bitmap */ u32 base; /* first MSR address to bitmap */ void *bitmap; /* pointer to bitmap, 1 means allow, 0 deny */ } lists[10]; }; which means in your use case, you can do u64 deny = 0; struct msr_set_accesslist access = { .flags = MSR_ACCESSLIST_DEFAULT_ALLOW, .lists = { { .nmsrs = 1, .base = IA32_ARCH_CAPABILITIES, .bitmap = &deny, }, { { .nmsrs = 1, .base = HV_X64_MSR_REFERENCE_TSC, .bitmap = &deny, }, { { .nmsrs = 1, /* can probably be combined with the ones below? */ .base = MSR_GOOGLE_TRUE_TIME, .bitmap = &deny, }, { { .nmsrs = 1, .base = MSR_GOOGLE_FDR_TRACE, .bitmap = &deny, }, { { .nmsrs = 1, .base = MSR_GOOGLE_HBI, .bitmap = &deny, }, } }; msr_set_accesslist(kvm_fd, &access); while I can do the same dance as before, but with a single call rather than multiple ones. What do you think? Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
On Thu, Aug 20, 2020 at 2:49 PM Alexander Graf <graf@amazon.com> wrote: > The only real downside I can see is that we just wasted ~8kb of RAM. > Nothing I would really get hung up on though. I also suspect that the MSR permission bitmap modifications are going to be a bit more expensive with 4kb (6kb on AMD) of pertinent allow-bitmaps than they would be with a few bytes of pertinent deny-bitmaps. > If you really desperately believe a deny list is a better fit for your > use case, we could redesign the interface differently: > > struct msr_set_accesslist { > #define MSR_ACCESSLIST_DEFAULT_ALLOW 0 > #define MSR_ACCESSLIST_DEFAULT_DENY 1 > u32 flags; > struct { > u32 flags; > u32 nmsrs; /* MSRs in bitmap */ > u32 base; /* first MSR address to bitmap */ > void *bitmap; /* pointer to bitmap, 1 means allow, 0 deny */ > } lists[10]; > }; > > which means in your use case, you can do > > u64 deny = 0; > struct msr_set_accesslist access = { > .flags = MSR_ACCESSLIST_DEFAULT_ALLOW, > .lists = { > { > .nmsrs = 1, > .base = IA32_ARCH_CAPABILITIES, > .bitmap = &deny, > }, { > { > .nmsrs = 1, > .base = HV_X64_MSR_REFERENCE_TSC, > .bitmap = &deny, > }, { > { > .nmsrs = 1, > /* can probably be combined with the ones below? */ > .base = MSR_GOOGLE_TRUE_TIME, > .bitmap = &deny, > }, { > { > .nmsrs = 1, > .base = MSR_GOOGLE_FDR_TRACE, > .bitmap = &deny, > }, { > { > .nmsrs = 1, > .base = MSR_GOOGLE_HBI, > .bitmap = &deny, > }, > } > }; > > msr_set_accesslist(kvm_fd, &access); > > while I can do the same dance as before, but with a single call rather > than multiple ones. > > What do you think? I like it. I think this suits our use case well.
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 91ce3e4b5b2e..e3cf1e971d0f 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -1010,6 +1010,30 @@ such as migration. :Parameters: struct kvm_vcpu_event (out) :Returns: 0 on success, -1 on error +4.32 KVM_SET_EXIT_MSRS +------------------ + +:Capability: KVM_CAP_SET_MSR_EXITS +:Architectures: x86 +:Type: vm ioctl +:Parameters: struct kvm_msr_list (in) +:Returns: 0 on success, -1 on error + +Sets the userspace MSR list which is used to track which MSRs KVM should send +to userspace to be serviced when the guest executes rdmsr or wrmsr. + +This ioctl needs to be called before vCPUs are setup otherwise the list of MSRs +will not be accepted and an EINVAL error will be returned. Also, if a list of +MSRs has already been supplied, and this ioctl is called again an EEXIST error +will be returned. + +:: + + struct kvm_msr_list { + __u32 nmsrs; + __u32 indices[0]; +}; + X86: ^^^^ diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 1ee8468c913c..6c4c5b972395 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -976,6 +976,8 @@ struct kvm_arch { struct kvm_pmu_event_filter *pmu_event_filter; struct task_struct *nx_lpage_recovery_thread; + + struct kvm_msr_list *user_exit_msrs; }; struct kvm_vm_stat { diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index c46a709be532..e349d51d5d65 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3570,6 +3570,42 @@ static inline bool kvm_can_mwait_in_guest(void) boot_cpu_has(X86_FEATURE_ARAT); } +static int kvm_vm_ioctl_set_exit_msrs(struct kvm *kvm, + struct kvm_msr_list __user *user_msr_list) +{ + struct kvm_msr_list *msr_list, hdr; + size_t indices_size; + + if (kvm->arch.user_exit_msrs != NULL) + return -EEXIST; + + if (kvm->created_vcpus) + return -EINVAL; + + if (copy_from_user(&hdr, user_msr_list, sizeof(hdr))) + return -EFAULT; + + if (hdr.nmsrs >= MAX_IO_MSRS) + return -E2BIG; + + indices_size = sizeof(hdr.indices[0]) * hdr.nmsrs; + msr_list = kvzalloc(sizeof(struct kvm_msr_list) + indices_size, + GFP_KERNEL_ACCOUNT); + if (!msr_list) + return -ENOMEM; + msr_list->nmsrs = hdr.nmsrs; + + if (copy_from_user(msr_list->indices, user_msr_list->indices, + indices_size)) { + kvfree(msr_list); + return -EFAULT; + } + + kvm->arch.user_exit_msrs = msr_list; + + return 0; +} + int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) { int r = 0; @@ -3630,6 +3666,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_LAST_CPU: case KVM_CAP_X86_USER_SPACE_MSR: case KVM_CAP_X86_MSR_ALLOWLIST: + case KVM_CAP_SET_MSR_EXITS: r = 1; break; case KVM_CAP_SYNC_REGS: @@ -5532,6 +5569,10 @@ long kvm_arch_vm_ioctl(struct file *filp, r = 0; break; } + case KVM_SET_EXIT_MSRS: { + r = kvm_vm_ioctl_set_exit_msrs(kvm, argp); + break; + } case KVM_MEMORY_ENCRYPT_OP: { r = -ENOTTY; if (kvm_x86_ops.mem_enc_op) diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 374021dc4e61..7d47d518a5d4 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1046,6 +1046,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_S390_DIAG318 186 #define KVM_CAP_X86_USER_SPACE_MSR 187 #define KVM_CAP_X86_MSR_ALLOWLIST 188 +#define KVM_CAP_SET_MSR_EXITS 189 #ifdef KVM_CAP_IRQ_ROUTING @@ -1386,6 +1387,7 @@ struct kvm_s390_ucas_mapping { /* Available with KVM_CAP_PMU_EVENT_FILTER */ #define KVM_SET_PMU_EVENT_FILTER _IOW(KVMIO, 0xb2, struct kvm_pmu_event_filter) #define KVM_PPC_SVM_OFF _IO(KVMIO, 0xb3) +#define KVM_SET_EXIT_MSRS _IOW(KVMIO, 0xb4, struct kvm_msr_list) /* ioctl for vm fd */ #define KVM_CREATE_DEVICE _IOWR(KVMIO, 0xe0, struct kvm_create_device)