Message ID | 20230914015531.1419405-8-seanjc@google.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | KVM: guest_memfd() and per-page attributes | expand |
Context | Check | Description |
---|---|---|
conchuod/cover_letter | success | Series has a cover letter |
conchuod/tree_selection | success | Guessed tree name to be for-next at HEAD 0bb80ecc33a8 |
conchuod/fixes_present | success | Fixes tag not required for -next series |
conchuod/maintainers_pattern | success | MAINTAINERS pattern errors before the patch: 5 and now 5 |
conchuod/verify_signedoff | success | Signed-off-by tag matches author and committer |
conchuod/kdoc | success | Errors and warnings before: 1 this patch: 1 |
conchuod/build_rv64_clang_allmodconfig | success | Errors and warnings before: 9 this patch: 9 |
conchuod/module_param | success | Was 0 now: 0 |
conchuod/build_rv64_gcc_allmodconfig | success | Errors and warnings before: 9 this patch: 9 |
conchuod/build_rv32_defconfig | success | Build OK |
conchuod/dtb_warn_rv64 | success | Errors and warnings before: 25 this patch: 25 |
conchuod/header_inline | success | No static functions without inline keyword in header files |
conchuod/checkpatch | success | total: 0 errors, 0 warnings, 0 checks, 85 lines checked |
conchuod/build_rv64_nommu_k210_defconfig | success | Build OK |
conchuod/verify_fixes | success | No Fixes tag |
conchuod/build_rv64_nommu_virt_defconfig | success | Build OK |
On 9/14/2023 9:55 AM, Sean Christopherson wrote: > From: Chao Peng <chao.p.peng@linux.intel.com> > > Add a new KVM exit type to allow userspace to handle memory faults that > KVM cannot resolve, but that userspace *may* be able to handle (without > terminating the guest). > > KVM will initially use KVM_EXIT_MEMORY_FAULT to report implicit > conversions between private and shared memory. With guest private memory, > there will be two kind of memory conversions: > > - explicit conversion: happens when the guest explicitly calls into KVM > to map a range (as private or shared) > > - implicit conversion: happens when the guest attempts to access a gfn > that is configured in the "wrong" state (private vs. shared) > > On x86 (first architecture to support guest private memory), explicit > conversions will be reported via KVM_EXIT_HYPERCALL+KVM_HC_MAP_GPA_RANGE, side topic. Do we expect to integrate TDVMCALL(MAPGPA) of TDX into KVM_HC_MAP_GPA_RANGE? > but reporting KVM_EXIT_HYPERCALL for implicit conversions is undesriable > as there is (obviously) no hypercall, and there is no guarantee that the > guest actually intends to convert between private and shared, i.e. what > KVM thinks is an implicit conversion "request" could actually be the > result of a guest code bug. > > KVM_EXIT_MEMORY_FAULT will be used to report memory faults that appear to > be implicit conversions. > > Place "struct memory_fault" in a second anonymous union so that filling > memory_fault doesn't clobber state from other yet-to-be-fulfilled exits, > and to provide additional information if KVM does NOT ultimately exit to > userspace with KVM_EXIT_MEMORY_FAULT, e.g. if KVM suppresses (or worse, > loses) the exit, as KVM often suppresses exits for memory failures that > occur when accessing paravirt data structures. The initial usage for > private memory will be all-or-nothing, but other features such as the > proposed "userfault on missing mappings" support will use > KVM_EXIT_MEMORY_FAULT for potentially _all_ guest memory accesses, i.e. > will run afoul of KVM's various quirks. So when exit reason is KVM_EXIT_MEMORY_FAULT, how can we tell which field in the first union is valid? When exit reason is not KVM_EXIT_MEMORY_FAULT, how can we know the info in the second union run.memory is valid without a run.memory.valid field? > Use bit 3 for flagging private memory so that KVM can use bits 0-2 for > capturing RWX behavior if/when userspace needs such information. > > Note! To allow for future possibilities where KVM reports > KVM_EXIT_MEMORY_FAULT and fills run->memory_fault on _any_ unresolved > fault, KVM returns "-EFAULT" (-1 with errno == EFAULT from userspace's > perspective), not '0'! Due to historical baggage within KVM, exiting to > userspace with '0' from deep callstacks, e.g. in emulation paths, is > infeasible as doing so would require a near-complete overhaul of KVM, > whereas KVM already propagates -errno return codes to userspace even when > the -errno originated in a low level helper. > > Link: https://lore.kernel.org/all/20230908222905.1321305-5-amoorthy@google.com > Cc: Anish Moorthy <amoorthy@google.com> > Suggested-by: Sean Christopherson <seanjc@google.com> > Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com> > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> > Co-developed-by: Sean Christopherson <seanjc@google.com> > Signed-off-by: Sean Christopherson <seanjc@google.com> > --- > Documentation/virt/kvm/api.rst | 24 ++++++++++++++++++++++++ > include/linux/kvm_host.h | 15 +++++++++++++++ > include/uapi/linux/kvm.h | 24 ++++++++++++++++++++++++ > 3 files changed, 63 insertions(+) > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst > index 21a7578142a1..e28a13439a95 100644 > --- a/Documentation/virt/kvm/api.rst > +++ b/Documentation/virt/kvm/api.rst > @@ -6702,6 +6702,30 @@ array field represents return values. The userspace should update the return > values of SBI call before resuming the VCPU. For more details on RISC-V SBI > spec refer, https://github.com/riscv/riscv-sbi-doc. > > +:: > + > + /* KVM_EXIT_MEMORY_FAULT */ > + struct { > + #define KVM_MEMORY_EXIT_FLAG_PRIVATE (1ULL << 3) > + __u64 flags; > + __u64 gpa; > + __u64 size; > + } memory; > + > +KVM_EXIT_MEMORY_FAULT indicates the vCPU has encountered a memory fault that > +could not be resolved by KVM. The 'gpa' and 'size' (in bytes) describe the > +guest physical address range [gpa, gpa + size) of the fault. The 'flags' field > +describes properties of the faulting access that are likely pertinent: > + > + - KVM_MEMORY_EXIT_FLAG_PRIVATE - When set, indicates the memory fault occurred > + on a private memory access. When clear, indicates the fault occurred on a > + shared access. > + > +Note! KVM_EXIT_MEMORY_FAULT is unique among all KVM exit reasons in that it > +accompanies a return code of '-1', not '0'! errno will always be set to EFAULT > +or EHWPOISON when KVM exits with KVM_EXIT_MEMORY_FAULT, userspace should assume > +kvm_run.exit_reason is stale/undefined for all other error numbers. > + Initially, this section is the copy of struct kvm_run and had comments for each field accordingly. Unfortunately, the consistence has not been well maintained during the new filed being added. Do we expect to fix it? > :: > > /* KVM_EXIT_NOTIFY */ > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index 4e741ff27af3..d8c6ce6c8211 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -2327,4 +2327,19 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr) > /* Max number of entries allowed for each kvm dirty ring */ > #define KVM_DIRTY_RING_MAX_ENTRIES 65536 > > +static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu, > + gpa_t gpa, gpa_t size, > + bool is_write, bool is_exec, > + bool is_private) > +{ > + vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT; > + vcpu->run->memory_fault.gpa = gpa; > + vcpu->run->memory_fault.size = size; > + > + /* RWX flags are not (yet) defined or communicated to userspace. */ > + vcpu->run->memory_fault.flags = 0; > + if (is_private) > + vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_PRIVATE; > +} > + > #endif > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h > index bd1abe067f28..d2d913acf0df 100644 > --- a/include/uapi/linux/kvm.h > +++ b/include/uapi/linux/kvm.h > @@ -274,6 +274,7 @@ struct kvm_xen_exit { > #define KVM_EXIT_RISCV_SBI 35 > #define KVM_EXIT_RISCV_CSR 36 > #define KVM_EXIT_NOTIFY 37 > +#define KVM_EXIT_MEMORY_FAULT 38 > > /* For KVM_EXIT_INTERNAL_ERROR */ > /* Emulate instruction failed. */ > @@ -541,6 +542,29 @@ struct kvm_run { > struct kvm_sync_regs regs; > char padding[SYNC_REGS_SIZE_BYTES]; > } s; > + > + /* > + * This second exit union holds structs for exit types which may be > + * triggered after KVM has already initiated a different exit, or which > + * may be ultimately dropped by KVM. > + * > + * For example, because of limitations in KVM's uAPI, KVM x86 can > + * generate a memory fault exit an MMIO exit is initiated (exit_reason > + * and kvm_run.mmio are filled). And conversely, KVM often disables > + * paravirt features if a memory fault occurs when accessing paravirt > + * data instead of reporting the error to userspace. > + */ > + union { > + /* KVM_EXIT_MEMORY_FAULT */ > + struct { > +#define KVM_MEMORY_EXIT_FLAG_PRIVATE (1ULL << 3) > + __u64 flags; > + __u64 gpa; > + __u64 size; > + } memory_fault; > + /* Fix the size of the union. */ > + char padding2[256]; > + }; > }; > > /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */
On Fri, Sep 22, 2023, Xiaoyao Li wrote: > On 9/14/2023 9:55 AM, Sean Christopherson wrote: > > From: Chao Peng <chao.p.peng@linux.intel.com> > > > > Add a new KVM exit type to allow userspace to handle memory faults that > > KVM cannot resolve, but that userspace *may* be able to handle (without > > terminating the guest). > > > > KVM will initially use KVM_EXIT_MEMORY_FAULT to report implicit > > conversions between private and shared memory. With guest private memory, > > there will be two kind of memory conversions: > > > > - explicit conversion: happens when the guest explicitly calls into KVM > > to map a range (as private or shared) > > > > - implicit conversion: happens when the guest attempts to access a gfn > > that is configured in the "wrong" state (private vs. shared) > > > > On x86 (first architecture to support guest private memory), explicit > > conversions will be reported via KVM_EXIT_HYPERCALL+KVM_HC_MAP_GPA_RANGE, > > side topic. > > Do we expect to integrate TDVMCALL(MAPGPA) of TDX into KVM_HC_MAP_GPA_RANGE? Yes, that's my expectation. > > but reporting KVM_EXIT_HYPERCALL for implicit conversions is undesriable > > as there is (obviously) no hypercall, and there is no guarantee that the > > guest actually intends to convert between private and shared, i.e. what > > KVM thinks is an implicit conversion "request" could actually be the > > result of a guest code bug. > > > > KVM_EXIT_MEMORY_FAULT will be used to report memory faults that appear to > > be implicit conversions. > > > > Place "struct memory_fault" in a second anonymous union so that filling > > memory_fault doesn't clobber state from other yet-to-be-fulfilled exits, > > and to provide additional information if KVM does NOT ultimately exit to > > userspace with KVM_EXIT_MEMORY_FAULT, e.g. if KVM suppresses (or worse, > > loses) the exit, as KVM often suppresses exits for memory failures that > > occur when accessing paravirt data structures. The initial usage for > > private memory will be all-or-nothing, but other features such as the > > proposed "userfault on missing mappings" support will use > > KVM_EXIT_MEMORY_FAULT for potentially _all_ guest memory accesses, i.e. > > will run afoul of KVM's various quirks. > > So when exit reason is KVM_EXIT_MEMORY_FAULT, how can we tell which field in > the first union is valid? > > When exit reason is not KVM_EXIT_MEMORY_FAULT, how can we know the info in > the second union run.memory is valid without a run.memory.valid field? I'll respond to this separately with a trimmed Cc list. I suspect this will be a rather lengthy conversation, and it has almost nothing to do with guest_memfd. > > +Note! KVM_EXIT_MEMORY_FAULT is unique among all KVM exit reasons in that it > > +accompanies a return code of '-1', not '0'! errno will always be set to EFAULT > > +or EHWPOISON when KVM exits with KVM_EXIT_MEMORY_FAULT, userspace should assume > > +kvm_run.exit_reason is stale/undefined for all other error numbers. > > + > > Initially, this section is the copy of struct kvm_run and had comments for > each field accordingly. Unfortunately, the consistence has not been well > maintained during the new filed being added. > > Do we expect to fix it? AFAIK, no one is working on cleaning up this section of the docs, but as always, patches are welcome :-)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 21a7578142a1..e28a13439a95 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -6702,6 +6702,30 @@ array field represents return values. The userspace should update the return values of SBI call before resuming the VCPU. For more details on RISC-V SBI spec refer, https://github.com/riscv/riscv-sbi-doc. +:: + + /* KVM_EXIT_MEMORY_FAULT */ + struct { + #define KVM_MEMORY_EXIT_FLAG_PRIVATE (1ULL << 3) + __u64 flags; + __u64 gpa; + __u64 size; + } memory; + +KVM_EXIT_MEMORY_FAULT indicates the vCPU has encountered a memory fault that +could not be resolved by KVM. The 'gpa' and 'size' (in bytes) describe the +guest physical address range [gpa, gpa + size) of the fault. The 'flags' field +describes properties of the faulting access that are likely pertinent: + + - KVM_MEMORY_EXIT_FLAG_PRIVATE - When set, indicates the memory fault occurred + on a private memory access. When clear, indicates the fault occurred on a + shared access. + +Note! KVM_EXIT_MEMORY_FAULT is unique among all KVM exit reasons in that it +accompanies a return code of '-1', not '0'! errno will always be set to EFAULT +or EHWPOISON when KVM exits with KVM_EXIT_MEMORY_FAULT, userspace should assume +kvm_run.exit_reason is stale/undefined for all other error numbers. + :: /* KVM_EXIT_NOTIFY */ diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 4e741ff27af3..d8c6ce6c8211 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -2327,4 +2327,19 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr) /* Max number of entries allowed for each kvm dirty ring */ #define KVM_DIRTY_RING_MAX_ENTRIES 65536 +static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu, + gpa_t gpa, gpa_t size, + bool is_write, bool is_exec, + bool is_private) +{ + vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT; + vcpu->run->memory_fault.gpa = gpa; + vcpu->run->memory_fault.size = size; + + /* RWX flags are not (yet) defined or communicated to userspace. */ + vcpu->run->memory_fault.flags = 0; + if (is_private) + vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_PRIVATE; +} + #endif diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index bd1abe067f28..d2d913acf0df 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -274,6 +274,7 @@ struct kvm_xen_exit { #define KVM_EXIT_RISCV_SBI 35 #define KVM_EXIT_RISCV_CSR 36 #define KVM_EXIT_NOTIFY 37 +#define KVM_EXIT_MEMORY_FAULT 38 /* For KVM_EXIT_INTERNAL_ERROR */ /* Emulate instruction failed. */ @@ -541,6 +542,29 @@ struct kvm_run { struct kvm_sync_regs regs; char padding[SYNC_REGS_SIZE_BYTES]; } s; + + /* + * This second exit union holds structs for exit types which may be + * triggered after KVM has already initiated a different exit, or which + * may be ultimately dropped by KVM. + * + * For example, because of limitations in KVM's uAPI, KVM x86 can + * generate a memory fault exit an MMIO exit is initiated (exit_reason + * and kvm_run.mmio are filled). And conversely, KVM often disables + * paravirt features if a memory fault occurs when accessing paravirt + * data instead of reporting the error to userspace. + */ + union { + /* KVM_EXIT_MEMORY_FAULT */ + struct { +#define KVM_MEMORY_EXIT_FLAG_PRIVATE (1ULL << 3) + __u64 flags; + __u64 gpa; + __u64 size; + } memory_fault; + /* Fix the size of the union. */ + char padding2[256]; + }; }; /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */