Message ID | 20240417153450.3608097-2-pbonzini@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: Guest Memory Pre-Population API | expand |
On Wed, Apr 17, 2024, Paolo Bonzini wrote: > +4.143 KVM_MAP_MEMORY > +------------------------ > + > +:Capability: KVM_CAP_MAP_MEMORY > +:Architectures: none > +:Type: vcpu ioctl > +:Parameters: struct kvm_map_memory (in/out) > +:Returns: 0 on success, < 0 on error > + > +Errors: > + > + ========== =============================================================== > + EINVAL The specified `base_address` and `size` were invalid (e.g. not > + page aligned or outside the defined memory slots). "outside the memslots" should probably be -EFAULT, i.e. keep EINVAL for things that can _never_ succeed. > + EAGAIN The ioctl should be invoked again and no page was processed. > + EINTR An unmasked signal is pending and no page was processed. I'm guessing we'll want to handle large ranges, at which point we'll likely end up with EAGAIN and/or EINTR after processing at least one page. > + EFAULT The parameter address was invalid. > + EOPNOTSUPP The architecture does not support this operation, or the > + guest state does not allow it. I would phrase this as something like: Mapping memory given for a GPA is unsupported by the architecture, and/or for the current vCPU state/mode. It's not that the guest state doesn't "allow" it, it's that it's explicitly unsupported because it's nonsensical without a GVA (or L2 GPA). > + ========== =============================================================== > + > +:: > + > + struct kvm_map_memory { > + /* in/out */ > + __u64 base_address; I think we should commit to this being limited to gpa mappings, e.g. go with "gpa", or "guest_physical_address" if we want to be verbose (I vote for "gpa"). > + __u64 size; > + /* in */ > + __u64 flags; > + __u64 padding[5]; > + }; > + > +KVM_MAP_MEMORY populates guest memory in the page tables of a vCPU. I think we should word this very carefully and explicitly so that KVM doesn't commit to behavior that can't be guaranteed. We might even want to use a name that explicitly captures the semantics, e.g. KVM_PRE_FAULT_MEMORY? Also, this doesn't populate guest _memory_, and "in the page tables of a vCPU" could be interpreted as the _guest's_ page tables. Something like: KVM_PRE_FAULT_MEMORY populates KVM's stage-2 page tables used to map memory for the current vCPU state. KVM maps memory as if the vCPU generated a stage-2 read page fault, e.g. faults in memory as needed, but doesn't break CoW. However, KVM does not mark any newly created stage-2 PTE as Accessed. > +When the ioctl returns, the input values are updated to point to the > +remaining range. If `size` > 0 on return, the caller can just issue > +the ioctl again with the same `struct kvm_map_memory` argument. This is likely misleading. Unless KVM explicitly zeros size on *every* failure, a pedantic reading of this would suggest that userspace can retry and it should eventually succeed. > +In some cases, multiple vCPUs might share the page tables. In this > +case, if this ioctl is called in parallel for multiple vCPUs the > +ioctl might return with `size` > 0. Why? If there's already a valid mapping, mission accomplished. I don't see any reason to return an error. If x86's page fault path returns RET_PF_RETRY, then I think it makes sense to retry in KVM, not punt this to userspace. > +The ioctl may not be supported for all VMs, and may just return > +an `EOPNOTSUPP` error if a VM does not support it. You may use > +`KVM_CHECK_EXTENSION` on the VM file descriptor to check if it is > +supported. Why per-VM? I don't think there's any per-VM state that would change the behavior. The TDP MMU being enabled is KVM wide, and the guest state modifiers that cause problems are per-vCPU, not per-VM. Adding support for KVM_CHECK_EXTENSION on vCPU FDs is probably overkill, e.g. I don't think it would add much value beyond returning EOPNOTSUPP for the ioctl() itself. > +Also, shadow page tables cannot support this ioctl because they > +are indexed by virtual address or nested guest physical address. > +Calling this ioctl when the guest is using shadow page tables (for > +example because it is running a nested guest) will also fail. Running a nested guest using TDP. > + > +`flags` must currently be zero. > + > + > 5. The kvm_run structure > ======================== > > -- > 2.43.0 > >
On Wed, Apr 17, 2024 at 10:28 PM Sean Christopherson <seanjc@google.com> wrote: > > On Wed, Apr 17, 2024, Paolo Bonzini wrote: > > +4.143 KVM_MAP_MEMORY > > +------------------------ > > + > > +:Capability: KVM_CAP_MAP_MEMORY > > +:Architectures: none > > +:Type: vcpu ioctl > > +:Parameters: struct kvm_map_memory (in/out) > > +:Returns: 0 on success, < 0 on error > > + > > +Errors: > > + > > + ========== =============================================================== > > + EINVAL The specified `base_address` and `size` were invalid (e.g. not > > + page aligned or outside the defined memory slots). > > "outside the memslots" should probably be -EFAULT, i.e. keep EINVAL for things > that can _never_ succeed. > > > + EAGAIN The ioctl should be invoked again and no page was processed. > > + EINTR An unmasked signal is pending and no page was processed. > > I'm guessing we'll want to handle large ranges, at which point we'll likely end > up with EAGAIN and/or EINTR after processing at least one page. Yes, in that case you get a success (return value of 0), just like read(). > > + EFAULT The parameter address was invalid. > > + EOPNOTSUPP The architecture does not support this operation, or the > > + guest state does not allow it. > > I would phrase this as something like: > > Mapping memory given for a GPA is unsupported by the > architecture, and/or for the current vCPU state/mode. Better. > > + struct kvm_map_memory { > > + /* in/out */ > > + __u64 base_address; > > I think we should commit to this being limited to gpa mappings, e.g. go with > "gpa", or "guest_physical_address" if we want to be verbose (I vote for "gpa"). > > > + __u64 size; > > + /* in */ > > + __u64 flags; > > + __u64 padding[5]; > > + }; > > + > > +KVM_MAP_MEMORY populates guest memory in the page tables of a vCPU. > > I think we should word this very carefully and explicitly so that KVM doesn't > commit to behavior that can't be guaranteed. We might even want to use a name > that explicitly captures the semantics, e.g. KVM_PRE_FAULT_MEMORY? > > Also, this doesn't populate guest _memory_, and "in the page tables of a vCPU" > could be interpreted as the _guest's_ page tables. > > Something like: > > KVM_PRE_FAULT_MEMORY populates KVM's stage-2 page tables used to map memory > for the current vCPU state. KVM maps memory as if the vCPU generated a > stage-2 read page fault, e.g. faults in memory as needed, but doesn't break > CoW. However, KVM does not mark any newly created stage-2 PTE as Accessed. > > > +When the ioctl returns, the input values are updated to point to the > > +remaining range. If `size` > 0 on return, the caller can just issue > > +the ioctl again with the same `struct kvm_map_memory` argument. > > This is likely misleading. Unless KVM explicitly zeros size on *every* failure, > a pedantic reading of this would suggest that userspace can retry and it should > eventually succeed. Gotcha... KVM explicitly zeros size on every success, but never zeros size on a failure. > > +In some cases, multiple vCPUs might share the page tables. In this > > +case, if this ioctl is called in parallel for multiple vCPUs the > > +ioctl might return with `size` > 0. > > Why? If there's already a valid mapping, mission accomplished. I don't see any > reason to return an error. If x86's page fault path returns RET_PF_RETRY, then I > think it makes sense to retry in KVM, not punt this to userspace. Considering that vcpu_mutex critical sections are killable I think I tend to agree. > > +The ioctl may not be supported for all VMs, and may just return > > +an `EOPNOTSUPP` error if a VM does not support it. You may use > > +`KVM_CHECK_EXTENSION` on the VM file descriptor to check if it is > > +supported. > > Why per-VM? I don't think there's any per-VM state that would change the behavior. Perhaps it may depend on the VM type? I'm trying to avoid having to invent a different API later. But yeah, I can drop this sentence and the related code. > The TDP MMU being enabled is KVM wide, and the guest state modifiers that cause > problems are per-vCPU, not per-VM. > > Adding support for KVM_CHECK_EXTENSION on vCPU FDs is probably overkill, e.g. I > don't think it would add much value beyond returning EOPNOTSUPP for the ioctl() > itself. Yes, I agree. Paolo
On Wed, Apr 17, 2024 at 10:37:00PM +0200, Paolo Bonzini wrote: > On Wed, Apr 17, 2024 at 10:28 PM Sean Christopherson <seanjc@google.com> wrote: > > > > On Wed, Apr 17, 2024, Paolo Bonzini wrote: > > > +4.143 KVM_MAP_MEMORY > > > +------------------------ > > > + > > > +:Capability: KVM_CAP_MAP_MEMORY > > > +:Architectures: none > > > +:Type: vcpu ioctl > > > +:Parameters: struct kvm_map_memory (in/out) > > > +:Returns: 0 on success, < 0 on error The definition of *success* here doesn't align with below comments. Maybe replace success with a clearer definition, e.g. 0 when all or part of the pages are processed. < 0 when error and no page is processed. > > > + > > > +Errors: > > > + > > > + ========== =============================================================== > > > + EINVAL The specified `base_address` and `size` were invalid (e.g. not > > > + page aligned or outside the defined memory slots). > > > > "outside the memslots" should probably be -EFAULT, i.e. keep EINVAL for things > > that can _never_ succeed. > > > > > + EAGAIN The ioctl should be invoked again and no page was processed. > > > + EINTR An unmasked signal is pending and no page was processed. > > > > I'm guessing we'll want to handle large ranges, at which point we'll likely end > > up with EAGAIN and/or EINTR after processing at least one page. > > Yes, in that case you get a success (return value of 0), just like read(). [...] > > > > > +When the ioctl returns, the input values are updated to point to the > > > +remaining range. If `size` > 0 on return, the caller can just issue > > > +the ioctl again with the same `struct kvm_map_memory` argument. > > > > This is likely misleading. Unless KVM explicitly zeros size on *every* failure, > > a pedantic reading of this would suggest that userspace can retry and it should > > eventually succeed. > > Gotcha... KVM explicitly zeros size on every success, but never zeros > size on a failure. Thanks, Yilun
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index f0b76ff5030d..c16906a42db1 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -6352,6 +6352,60 @@ a single guest_memfd file, but the bound ranges must not overlap). See KVM_SET_USER_MEMORY_REGION2 for additional details. +4.143 KVM_MAP_MEMORY +------------------------ + +:Capability: KVM_CAP_MAP_MEMORY +:Architectures: none +:Type: vcpu ioctl +:Parameters: struct kvm_map_memory (in/out) +:Returns: 0 on success, < 0 on error + +Errors: + + ========== =============================================================== + EINVAL The specified `base_address` and `size` were invalid (e.g. not + page aligned or outside the defined memory slots). + EAGAIN The ioctl should be invoked again and no page was processed. + EINTR An unmasked signal is pending and no page was processed. + EFAULT The parameter address was invalid. + EOPNOTSUPP The architecture does not support this operation, or the + guest state does not allow it. + ========== =============================================================== + +:: + + struct kvm_map_memory { + /* in/out */ + __u64 base_address; + __u64 size; + /* in */ + __u64 flags; + __u64 padding[5]; + }; + +KVM_MAP_MEMORY populates guest memory in the page tables of a vCPU. +When the ioctl returns, the input values are updated to point to the +remaining range. If `size` > 0 on return, the caller can just issue +the ioctl again with the same `struct kvm_map_memory` argument. + +In some cases, multiple vCPUs might share the page tables. In this +case, if this ioctl is called in parallel for multiple vCPUs the +ioctl might return with `size` > 0. + +The ioctl may not be supported for all VMs, and may just return +an `EOPNOTSUPP` error if a VM does not support it. You may use +`KVM_CHECK_EXTENSION` on the VM file descriptor to check if it is +supported. + +Also, shadow page tables cannot support this ioctl because they +are indexed by virtual address or nested guest physical address. +Calling this ioctl when the guest is using shadow page tables (for +example because it is running a nested guest) will also fail. + +`flags` must currently be zero. + + 5. The kvm_run structure ========================