diff mbox series

[1/7] KVM: Document KVM_MAP_MEMORY ioctl

Message ID 20240417153450.3608097-2-pbonzini@redhat.com (mailing list archive)
State New, archived
Headers show
Series KVM: Guest Memory Pre-Population API | expand

Commit Message

Paolo Bonzini April 17, 2024, 3:34 p.m. UTC
From: Isaku Yamahata <isaku.yamahata@intel.com>

Adds documentation of KVM_MAP_MEMORY ioctl. [1]

It populates guest memory.  It doesn't do extra operations on the
underlying technology-specific initialization [2].  For example,
CoCo-related operations won't be performed.  Concretely for TDX, this API
won't invoke TDH.MEM.PAGE.ADD() or TDH.MR.EXTEND().  Vendor-specific APIs
are required for such operations.

The key point is to adapt of vcpu ioctl instead of VM ioctl.  First,
populating guest memory requires vcpu.  If it is VM ioctl, we need to pick
one vcpu somehow.  Secondly, vcpu ioctl allows each vcpu to invoke this
ioctl in parallel.  It helps to scale regarding guest memory size, e.g.,
hundreds of GB.

[1] https://lore.kernel.org/kvm/Zbrj5WKVgMsUFDtb@google.com/
[2] https://lore.kernel.org/kvm/Ze-TJh0BBOWm9spT@google.com/

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-ID: <9a060293c9ad9a78f1d8994cfe1311e818e99257.1712785629.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 Documentation/virt/kvm/api.rst | 54 ++++++++++++++++++++++++++++++++++
 1 file changed, 54 insertions(+)

Comments

Sean Christopherson April 17, 2024, 8:28 p.m. UTC | #1
On Wed, Apr 17, 2024, Paolo Bonzini wrote:
> +4.143 KVM_MAP_MEMORY
> +------------------------
> +
> +:Capability: KVM_CAP_MAP_MEMORY
> +:Architectures: none
> +:Type: vcpu ioctl
> +:Parameters: struct kvm_map_memory (in/out)
> +:Returns: 0 on success, < 0 on error
> +
> +Errors:
> +
> +  ========== ===============================================================
> +  EINVAL     The specified `base_address` and `size` were invalid (e.g. not
> +             page aligned or outside the defined memory slots).

"outside the memslots" should probably be -EFAULT, i.e. keep EINVAL for things
that can _never_ succeed.

> +  EAGAIN     The ioctl should be invoked again and no page was processed.
> +  EINTR      An unmasked signal is pending and no page was processed.

I'm guessing we'll want to handle large ranges, at which point we'll likely end
up with EAGAIN and/or EINTR after processing at least one page.

> +  EFAULT     The parameter address was invalid.
> +  EOPNOTSUPP The architecture does not support this operation, or the
> +             guest state does not allow it.

I would phrase this as something like:

                Mapping memory given for a GPA is unsupported by the
                architecture, and/or for the current vCPU state/mode.

It's not that the guest state doesn't "allow" it, it's that it's explicitly
unsupported because it's nonsensical without a GVA (or L2 GPA).

> +  ========== ===============================================================
> +
> +::
> +
> +  struct kvm_map_memory {
> +	/* in/out */
> +	__u64 base_address;

I think we should commit to this being limited to gpa mappings, e.g. go with
"gpa", or "guest_physical_address" if we want to be verbose (I vote for "gpa").

> +	__u64 size;
> +	/* in */
> +	__u64 flags;
> +	__u64 padding[5];
> +  };
> +
> +KVM_MAP_MEMORY populates guest memory in the page tables of a vCPU.

I think we should word this very carefully and explicitly so that KVM doesn't
commit to behavior that can't be guaranteed.  We might even want to use a name
that explicitly captures the semantics, e.g. KVM_PRE_FAULT_MEMORY?

Also, this doesn't populate guest _memory_, and "in the page tables of a vCPU"
could be interpreted as the _guest's_ page tables.

Something like:

  KVM_PRE_FAULT_MEMORY populates KVM's stage-2 page tables used to map memory
  for the current vCPU state.  KVM maps memory as if the vCPU generated a
  stage-2 read page fault, e.g. faults in memory as needed, but doesn't break
  CoW.  However, KVM does not mark any newly created stage-2 PTE as Accessed.

> +When the ioctl returns, the input values are updated to point to the
> +remaining range.  If `size` > 0 on return, the caller can just issue
> +the ioctl again with the same `struct kvm_map_memory` argument.

This is likely misleading.  Unless KVM explicitly zeros size on *every* failure,
a pedantic reading of this would suggest that userspace can retry and it should
eventually succeed.

> +In some cases, multiple vCPUs might share the page tables.  In this
> +case, if this ioctl is called in parallel for multiple vCPUs the
> +ioctl might return with `size` > 0.

Why?  If there's already a valid mapping, mission accomplished.  I don't see any
reason to return an error.  If x86's page fault path returns RET_PF_RETRY, then I
think it makes sense to retry in KVM, not punt this to userspace.

> +The ioctl may not be supported for all VMs, and may just return
> +an `EOPNOTSUPP` error if a VM does not support it.  You may use
> +`KVM_CHECK_EXTENSION` on the VM file descriptor to check if it is
> +supported.

Why per-VM?  I don't think there's any per-VM state that would change the behavior.
The TDP MMU being enabled is KVM wide, and the guest state modifiers that cause
problems are per-vCPU, not per-VM.

Adding support for KVM_CHECK_EXTENSION on vCPU FDs is probably overkill, e.g. I
don't think it would add much value beyond returning EOPNOTSUPP for the ioctl()
itself.

> +Also, shadow page tables cannot support this ioctl because they
> +are indexed by virtual address or nested guest physical address.
> +Calling this ioctl when the guest is using shadow page tables (for
> +example because it is running a nested guest) will also fail.

Running a nested guest using TDP.

> +
> +`flags` must currently be zero.
> +
> +
>  5. The kvm_run structure
>  ========================
>  
> -- 
> 2.43.0
> 
>
Paolo Bonzini April 17, 2024, 8:37 p.m. UTC | #2
On Wed, Apr 17, 2024 at 10:28 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Apr 17, 2024, Paolo Bonzini wrote:
> > +4.143 KVM_MAP_MEMORY
> > +------------------------
> > +
> > +:Capability: KVM_CAP_MAP_MEMORY
> > +:Architectures: none
> > +:Type: vcpu ioctl
> > +:Parameters: struct kvm_map_memory (in/out)
> > +:Returns: 0 on success, < 0 on error
> > +
> > +Errors:
> > +
> > +  ========== ===============================================================
> > +  EINVAL     The specified `base_address` and `size` were invalid (e.g. not
> > +             page aligned or outside the defined memory slots).
>
> "outside the memslots" should probably be -EFAULT, i.e. keep EINVAL for things
> that can _never_ succeed.
>
> > +  EAGAIN     The ioctl should be invoked again and no page was processed.
> > +  EINTR      An unmasked signal is pending and no page was processed.
>
> I'm guessing we'll want to handle large ranges, at which point we'll likely end
> up with EAGAIN and/or EINTR after processing at least one page.

Yes, in that case you get a success (return value of 0), just like read().

> > +  EFAULT     The parameter address was invalid.
> > +  EOPNOTSUPP The architecture does not support this operation, or the
> > +             guest state does not allow it.
>
> I would phrase this as something like:
>
>                 Mapping memory given for a GPA is unsupported by the
>                 architecture, and/or for the current vCPU state/mode.

Better.

> > +  struct kvm_map_memory {
> > +     /* in/out */
> > +     __u64 base_address;
>
> I think we should commit to this being limited to gpa mappings, e.g. go with
> "gpa", or "guest_physical_address" if we want to be verbose (I vote for "gpa").
>
> > +     __u64 size;
> > +     /* in */
> > +     __u64 flags;
> > +     __u64 padding[5];
> > +  };
> > +
> > +KVM_MAP_MEMORY populates guest memory in the page tables of a vCPU.
>
> I think we should word this very carefully and explicitly so that KVM doesn't
> commit to behavior that can't be guaranteed.  We might even want to use a name
> that explicitly captures the semantics, e.g. KVM_PRE_FAULT_MEMORY?
>
> Also, this doesn't populate guest _memory_, and "in the page tables of a vCPU"
> could be interpreted as the _guest's_ page tables.
>
> Something like:
>
>   KVM_PRE_FAULT_MEMORY populates KVM's stage-2 page tables used to map memory
>   for the current vCPU state.  KVM maps memory as if the vCPU generated a
>   stage-2 read page fault, e.g. faults in memory as needed, but doesn't break
>   CoW.  However, KVM does not mark any newly created stage-2 PTE as Accessed.
>
> > +When the ioctl returns, the input values are updated to point to the
> > +remaining range.  If `size` > 0 on return, the caller can just issue
> > +the ioctl again with the same `struct kvm_map_memory` argument.
>
> This is likely misleading.  Unless KVM explicitly zeros size on *every* failure,
> a pedantic reading of this would suggest that userspace can retry and it should
> eventually succeed.

Gotcha... KVM explicitly zeros size on every success, but never zeros
size on a failure.

> > +In some cases, multiple vCPUs might share the page tables.  In this
> > +case, if this ioctl is called in parallel for multiple vCPUs the
> > +ioctl might return with `size` > 0.
>
> Why?  If there's already a valid mapping, mission accomplished.  I don't see any
> reason to return an error.  If x86's page fault path returns RET_PF_RETRY, then I
> think it makes sense to retry in KVM, not punt this to userspace.

Considering that vcpu_mutex critical sections are killable I think I
tend to agree.

> > +The ioctl may not be supported for all VMs, and may just return
> > +an `EOPNOTSUPP` error if a VM does not support it.  You may use
> > +`KVM_CHECK_EXTENSION` on the VM file descriptor to check if it is
> > +supported.
>
> Why per-VM?  I don't think there's any per-VM state that would change the behavior.

Perhaps it may depend on the VM type? I'm trying to avoid having to
invent a different API later. But yeah, I can drop this sentence and
the related code.

> The TDP MMU being enabled is KVM wide, and the guest state modifiers that cause
> problems are per-vCPU, not per-VM.
>
> Adding support for KVM_CHECK_EXTENSION on vCPU FDs is probably overkill, e.g. I
> don't think it would add much value beyond returning EOPNOTSUPP for the ioctl()
> itself.

Yes, I agree.

Paolo
Xu Yilun April 19, 2024, 1:57 p.m. UTC | #3
On Wed, Apr 17, 2024 at 10:37:00PM +0200, Paolo Bonzini wrote:
> On Wed, Apr 17, 2024 at 10:28 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Wed, Apr 17, 2024, Paolo Bonzini wrote:
> > > +4.143 KVM_MAP_MEMORY
> > > +------------------------
> > > +
> > > +:Capability: KVM_CAP_MAP_MEMORY
> > > +:Architectures: none
> > > +:Type: vcpu ioctl
> > > +:Parameters: struct kvm_map_memory (in/out)
> > > +:Returns: 0 on success, < 0 on error

The definition of *success* here doesn't align with below comments.
Maybe replace success with a clearer definition, e.g. 0 when all or
part of the pages are processed. < 0 when error and no page is
processed.

> > > +
> > > +Errors:
> > > +
> > > +  ========== ===============================================================
> > > +  EINVAL     The specified `base_address` and `size` were invalid (e.g. not
> > > +             page aligned or outside the defined memory slots).
> >
> > "outside the memslots" should probably be -EFAULT, i.e. keep EINVAL for things
> > that can _never_ succeed.
> >
> > > +  EAGAIN     The ioctl should be invoked again and no page was processed.
> > > +  EINTR      An unmasked signal is pending and no page was processed.
> >
> > I'm guessing we'll want to handle large ranges, at which point we'll likely end
> > up with EAGAIN and/or EINTR after processing at least one page.
> 
> Yes, in that case you get a success (return value of 0), just like read().

[...]

> >
> > > +When the ioctl returns, the input values are updated to point to the
> > > +remaining range.  If `size` > 0 on return, the caller can just issue
> > > +the ioctl again with the same `struct kvm_map_memory` argument.
> >
> > This is likely misleading.  Unless KVM explicitly zeros size on *every* failure,
> > a pedantic reading of this would suggest that userspace can retry and it should
> > eventually succeed.
> 
> Gotcha... KVM explicitly zeros size on every success, but never zeros
> size on a failure.

Thanks,
Yilun
diff mbox series

Patch

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index f0b76ff5030d..c16906a42db1 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6352,6 +6352,60 @@  a single guest_memfd file, but the bound ranges must not overlap).
 
 See KVM_SET_USER_MEMORY_REGION2 for additional details.
 
+4.143 KVM_MAP_MEMORY
+------------------------
+
+:Capability: KVM_CAP_MAP_MEMORY
+:Architectures: none
+:Type: vcpu ioctl
+:Parameters: struct kvm_map_memory (in/out)
+:Returns: 0 on success, < 0 on error
+
+Errors:
+
+  ========== ===============================================================
+  EINVAL     The specified `base_address` and `size` were invalid (e.g. not
+             page aligned or outside the defined memory slots).
+  EAGAIN     The ioctl should be invoked again and no page was processed.
+  EINTR      An unmasked signal is pending and no page was processed.
+  EFAULT     The parameter address was invalid.
+  EOPNOTSUPP The architecture does not support this operation, or the
+             guest state does not allow it.
+  ========== ===============================================================
+
+::
+
+  struct kvm_map_memory {
+	/* in/out */
+	__u64 base_address;
+	__u64 size;
+	/* in */
+	__u64 flags;
+	__u64 padding[5];
+  };
+
+KVM_MAP_MEMORY populates guest memory in the page tables of a vCPU.
+When the ioctl returns, the input values are updated to point to the
+remaining range.  If `size` > 0 on return, the caller can just issue
+the ioctl again with the same `struct kvm_map_memory` argument.
+
+In some cases, multiple vCPUs might share the page tables.  In this
+case, if this ioctl is called in parallel for multiple vCPUs the
+ioctl might return with `size` > 0.
+
+The ioctl may not be supported for all VMs, and may just return
+an `EOPNOTSUPP` error if a VM does not support it.  You may use
+`KVM_CHECK_EXTENSION` on the VM file descriptor to check if it is
+supported.
+
+Also, shadow page tables cannot support this ioctl because they
+are indexed by virtual address or nested guest physical address.
+Calling this ioctl when the guest is using shadow page tables (for
+example because it is running a nested guest) will also fail.
+
+`flags` must currently be zero.
+
+
 5. The kvm_run structure
 ========================