diff mbox series

[v5,05/13] KVM: Extend the memslot to support fd-based private memory

Message ID 20220310140911.50924-6-chao.p.peng@linux.intel.com (mailing list archive)
State New
Headers show
Series KVM: mm: fd-based approach for supporting KVM guest private memory | expand

Commit Message

Chao Peng March 10, 2022, 2:09 p.m. UTC
Extend the memslot definition to provide fd-based private memory support
by adding two new fields (private_fd/private_offset). The memslot then
can maintain memory for both shared pages and private pages in a single
memslot. Shared pages are provided by existing userspace_addr(hva) field
and private pages are provided through the new private_fd/private_offset
fields.

Since there is no 'hva' concept anymore for private memory so we cannot
rely on get_user_pages() to get a pfn, instead we use the newly added
memfile_notifier to complete the same job.

This new extension is indicated by a new flag KVM_MEM_PRIVATE.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 Documentation/virt/kvm/api.rst | 37 +++++++++++++++++++++++++++-------
 include/linux/kvm_host.h       |  7 +++++++
 include/uapi/linux/kvm.h       |  8 ++++++++
 3 files changed, 45 insertions(+), 7 deletions(-)

Comments

Sean Christopherson March 28, 2022, 9:27 p.m. UTC | #1
On Thu, Mar 10, 2022, Chao Peng wrote:
> Extend the memslot definition to provide fd-based private memory support
> by adding two new fields (private_fd/private_offset). The memslot then
> can maintain memory for both shared pages and private pages in a single
> memslot. Shared pages are provided by existing userspace_addr(hva) field
> and private pages are provided through the new private_fd/private_offset
> fields.
> 
> Since there is no 'hva' concept anymore for private memory so we cannot
> rely on get_user_pages() to get a pfn, instead we use the newly added
> memfile_notifier to complete the same job.
> 
> This new extension is indicated by a new flag KVM_MEM_PRIVATE.
> 
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>

Needs a Co-developed-by: for Yu, or a From: if Yu is the sole author.

> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  Documentation/virt/kvm/api.rst | 37 +++++++++++++++++++++++++++-------
>  include/linux/kvm_host.h       |  7 +++++++
>  include/uapi/linux/kvm.h       |  8 ++++++++
>  3 files changed, 45 insertions(+), 7 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 3acbf4d263a5..f76ac598606c 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -1307,7 +1307,7 @@ yet and must be cleared on entry.
>  :Capability: KVM_CAP_USER_MEMORY
>  :Architectures: all
>  :Type: vm ioctl
> -:Parameters: struct kvm_userspace_memory_region (in)
> +:Parameters: struct kvm_userspace_memory_region(_ext) (in)
>  :Returns: 0 on success, -1 on error
>  
>  ::
> @@ -1320,9 +1320,17 @@ yet and must be cleared on entry.
>  	__u64 userspace_addr; /* start of the userspace allocated memory */
>    };
>  
> +  struct kvm_userspace_memory_region_ext {
> +	struct kvm_userspace_memory_region region;
> +	__u64 private_offset;
> +	__u32 private_fd;
> +	__u32 padding[5];

Uber nit, I'd prefer we pad u32 for private_fd separate from padding the size of
the structure for future expansion.

Regarding future expansion, any reason not to go crazy and pad like 128+ bytes?
It'd be rather embarassing if the next memslot extension needs 3 u64s and we end
up with region_ext2 :-)

> +};
> +
>    /* for kvm_memory_region::flags */
>    #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
>    #define KVM_MEM_READONLY	(1UL << 1)
> +  #define KVM_MEM_PRIVATE		(1UL << 2)
>  
>  This ioctl allows the user to create, modify or delete a guest physical
>  memory slot.  Bits 0-15 of "slot" specify the slot id and this value

...

> +static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot)

I 100% think we should usurp the name "private" for these memslots, but as prep
work this series should first rename KVM_PRIVATE_MEM_SLOTS to avoid confusion.
Maybe KVM_INTERNAL_MEM_SLOTS?
Sean Christopherson March 28, 2022, 9:56 p.m. UTC | #2
On Thu, Mar 10, 2022, Chao Peng wrote:
> Extend the memslot definition to provide fd-based private memory support
> by adding two new fields (private_fd/private_offset). The memslot then
> can maintain memory for both shared pages and private pages in a single
> memslot. Shared pages are provided by existing userspace_addr(hva) field
> and private pages are provided through the new private_fd/private_offset
> fields.
> 
> Since there is no 'hva' concept anymore for private memory so we cannot
> rely on get_user_pages() to get a pfn, instead we use the newly added
> memfile_notifier to complete the same job.
> 
> This new extension is indicated by a new flag KVM_MEM_PRIVATE.
> 
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  Documentation/virt/kvm/api.rst | 37 +++++++++++++++++++++++++++-------
>  include/linux/kvm_host.h       |  7 +++++++
>  include/uapi/linux/kvm.h       |  8 ++++++++
>  3 files changed, 45 insertions(+), 7 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 3acbf4d263a5..f76ac598606c 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -1307,7 +1307,7 @@ yet and must be cleared on entry.
>  :Capability: KVM_CAP_USER_MEMORY
>  :Architectures: all
>  :Type: vm ioctl
> -:Parameters: struct kvm_userspace_memory_region (in)
> +:Parameters: struct kvm_userspace_memory_region(_ext) (in)
>  :Returns: 0 on success, -1 on error
>  
>  ::
> @@ -1320,9 +1320,17 @@ yet and must be cleared on entry.
>  	__u64 userspace_addr; /* start of the userspace allocated memory */
>    };
>  
> +  struct kvm_userspace_memory_region_ext {
> +	struct kvm_userspace_memory_region region;

Peeking ahead, the partial switch to the _ext variant is rather gross.  I would
prefer that KVM use an entirely different, but binary compatible, struct internally.
And once the kernel supports C11[*], I'm pretty sure we can make the "region" in
_ext an anonymous struct, and make KVM's internal struct a #define of _ext.  That
should minimize the churn (no need to get the embedded "region" field), reduce
line lengths, and avoid confusion due to some flows taking the _ext but others
dealing with only the "base" struct.

Maybe kvm_user_memory_region or kvm_user_mem_region?  Though it's tempting to be
evil and usurp the old kvm_memory_region :-)

E.g. pre-C11 do

struct kvm_userspace_memory_region_ext {
	struct kvm_userspace_memory_region region;
	__u64 private_offset;
	__u32 private_fd;
	__u32 padding[5];
};

#ifdef __KERNEL__
struct kvm_user_mem_region {
	__u32 slot;
	__u32 flags;
	__u64 guest_phys_addr;
	__u64 memory_size; /* bytes */
	__u64 userspace_addr; /* start of the userspace allocated memory */
	__u64 private_offset;
	__u32 private_fd;
	__u32 padding[5];
};
#endif

and then post-C11 do

struct kvm_userspace_memory_region_ext {
#ifdef __KERNEL__
	struct kvm_userspace_memory_region region;
#else
	struct kvm_userspace_memory_region;
#endif
	__u64 private_offset;
	__u32 private_fd;
	__u32 padding[5];
};

#ifdef __KERNEL__
#define kvm_user_mem_region kvm_userspace_memory_region_ext
#endif

[*] https://lore.kernel.org/all/20220301145233.3689119-1-arnd@kernel.org

> +	__u64 private_offset;
> +	__u32 private_fd;
> +	__u32 padding[5];
> +};
Chao Peng April 8, 2022, 1:21 p.m. UTC | #3
On Mon, Mar 28, 2022 at 09:27:32PM +0000, Sean Christopherson wrote:
> On Thu, Mar 10, 2022, Chao Peng wrote:
> > Extend the memslot definition to provide fd-based private memory support
> > by adding two new fields (private_fd/private_offset). The memslot then
> > can maintain memory for both shared pages and private pages in a single
> > memslot. Shared pages are provided by existing userspace_addr(hva) field
> > and private pages are provided through the new private_fd/private_offset
> > fields.
> > 
> > Since there is no 'hva' concept anymore for private memory so we cannot
> > rely on get_user_pages() to get a pfn, instead we use the newly added
> > memfile_notifier to complete the same job.
> > 
> > This new extension is indicated by a new flag KVM_MEM_PRIVATE.
> > 
> > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> 
> Needs a Co-developed-by: for Yu, or a From: if Yu is the sole author.

Yes a Co-developed-by for Yu is needed, for all the patches throught the series.

> 
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> >  Documentation/virt/kvm/api.rst | 37 +++++++++++++++++++++++++++-------
> >  include/linux/kvm_host.h       |  7 +++++++
> >  include/uapi/linux/kvm.h       |  8 ++++++++
> >  3 files changed, 45 insertions(+), 7 deletions(-)
> > 
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 3acbf4d263a5..f76ac598606c 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -1307,7 +1307,7 @@ yet and must be cleared on entry.
> >  :Capability: KVM_CAP_USER_MEMORY
> >  :Architectures: all
> >  :Type: vm ioctl
> > -:Parameters: struct kvm_userspace_memory_region (in)
> > +:Parameters: struct kvm_userspace_memory_region(_ext) (in)
> >  :Returns: 0 on success, -1 on error
> >  
> >  ::
> > @@ -1320,9 +1320,17 @@ yet and must be cleared on entry.
> >  	__u64 userspace_addr; /* start of the userspace allocated memory */
> >    };
> >  
> > +  struct kvm_userspace_memory_region_ext {
> > +	struct kvm_userspace_memory_region region;
> > +	__u64 private_offset;
> > +	__u32 private_fd;
> > +	__u32 padding[5];
> 
> Uber nit, I'd prefer we pad u32 for private_fd separate from padding the size of
> the structure for future expansion.
> 
> Regarding future expansion, any reason not to go crazy and pad like 128+ bytes?
> It'd be rather embarassing if the next memslot extension needs 3 u64s and we end
> up with region_ext2 :-)

OK, so maybe:
	__u64 private_offset;
	__u32 private_fd;
	__u32 pad1;
	__u32 pad2[28];
> 
> > +};
> > +
> >    /* for kvm_memory_region::flags */
> >    #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
> >    #define KVM_MEM_READONLY	(1UL << 1)
> > +  #define KVM_MEM_PRIVATE		(1UL << 2)
> >  
> >  This ioctl allows the user to create, modify or delete a guest physical
> >  memory slot.  Bits 0-15 of "slot" specify the slot id and this value
> 
> ...
> 
> > +static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot)
> 
> I 100% think we should usurp the name "private" for these memslots, but as prep
> work this series should first rename KVM_PRIVATE_MEM_SLOTS to avoid confusion.
> Maybe KVM_INTERNAL_MEM_SLOTS?

Oh, I didn't realized 'PRIVATE' is already taken.  KVM_INTERNAL_MEM_SLOTS
sounds good.

Thanks,
Chao
Chao Peng April 8, 2022, 1:46 p.m. UTC | #4
On Mon, Mar 28, 2022 at 09:56:33PM +0000, Sean Christopherson wrote:
> On Thu, Mar 10, 2022, Chao Peng wrote:
> > Extend the memslot definition to provide fd-based private memory support
> > by adding two new fields (private_fd/private_offset). The memslot then
> > can maintain memory for both shared pages and private pages in a single
> > memslot. Shared pages are provided by existing userspace_addr(hva) field
> > and private pages are provided through the new private_fd/private_offset
> > fields.
> > 
> > Since there is no 'hva' concept anymore for private memory so we cannot
> > rely on get_user_pages() to get a pfn, instead we use the newly added
> > memfile_notifier to complete the same job.
> > 
> > This new extension is indicated by a new flag KVM_MEM_PRIVATE.
> > 
> > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> >  Documentation/virt/kvm/api.rst | 37 +++++++++++++++++++++++++++-------
> >  include/linux/kvm_host.h       |  7 +++++++
> >  include/uapi/linux/kvm.h       |  8 ++++++++
> >  3 files changed, 45 insertions(+), 7 deletions(-)
> > 
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 3acbf4d263a5..f76ac598606c 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -1307,7 +1307,7 @@ yet and must be cleared on entry.
> >  :Capability: KVM_CAP_USER_MEMORY
> >  :Architectures: all
> >  :Type: vm ioctl
> > -:Parameters: struct kvm_userspace_memory_region (in)
> > +:Parameters: struct kvm_userspace_memory_region(_ext) (in)
> >  :Returns: 0 on success, -1 on error
> >  
> >  ::
> > @@ -1320,9 +1320,17 @@ yet and must be cleared on entry.
> >  	__u64 userspace_addr; /* start of the userspace allocated memory */
> >    };
> >  
> > +  struct kvm_userspace_memory_region_ext {
> > +	struct kvm_userspace_memory_region region;
> 
> Peeking ahead, the partial switch to the _ext variant is rather gross.  I would
> prefer that KVM use an entirely different, but binary compatible, struct internally.
> And once the kernel supports C11[*], I'm pretty sure we can make the "region" in
> _ext an anonymous struct, and make KVM's internal struct a #define of _ext.  That
> should minimize the churn (no need to get the embedded "region" field), reduce
> line lengths, and avoid confusion due to some flows taking the _ext but others
> dealing with only the "base" struct.

Will try that.

> 
> Maybe kvm_user_memory_region or kvm_user_mem_region?  Though it's tempting to be
> evil and usurp the old kvm_memory_region :-)
> 
> E.g. pre-C11 do
> 
> struct kvm_userspace_memory_region_ext {
> 	struct kvm_userspace_memory_region region;
> 	__u64 private_offset;
> 	__u32 private_fd;
> 	__u32 padding[5];
> };
> 
> #ifdef __KERNEL__
> struct kvm_user_mem_region {
> 	__u32 slot;
> 	__u32 flags;
> 	__u64 guest_phys_addr;
> 	__u64 memory_size; /* bytes */
> 	__u64 userspace_addr; /* start of the userspace allocated memory */
> 	__u64 private_offset;
> 	__u32 private_fd;
> 	__u32 padding[5];
> };
> #endif
> 
> and then post-C11 do
> 
> struct kvm_userspace_memory_region_ext {
> #ifdef __KERNEL__

Is this #ifndef? As I think anonymous struct is only for kernel?

Thanks,
Chao

> 	struct kvm_userspace_memory_region region;
> #else
> 	struct kvm_userspace_memory_region;
> #endif
> 	__u64 private_offset;
> 	__u32 private_fd;
> 	__u32 padding[5];
> };
> 
> #ifdef __KERNEL__
> #define kvm_user_mem_region kvm_userspace_memory_region_ext
> #endif
> 
> [*] https://lore.kernel.org/all/20220301145233.3689119-1-arnd@kernel.org
> 
> > +	__u64 private_offset;
> > +	__u32 private_fd;
> > +	__u32 padding[5];
> > +};
Sean Christopherson April 8, 2022, 5:45 p.m. UTC | #5
On Fri, Apr 08, 2022, Chao Peng wrote:
> On Mon, Mar 28, 2022 at 09:56:33PM +0000, Sean Christopherson wrote:
> > struct kvm_userspace_memory_region_ext {
> > #ifdef __KERNEL__
> 
> Is this #ifndef? As I think anonymous struct is only for kernel?

Doh, yes, I inverted that.

> Thanks,
> Chao
> 
> > 	struct kvm_userspace_memory_region region;
> > #else
> > 	struct kvm_userspace_memory_region;
> > #endif
> > 	__u64 private_offset;
> > 	__u32 private_fd;
> > 	__u32 padding[5];
> > };
> > 
> > #ifdef __KERNEL__
> > #define kvm_user_mem_region kvm_userspace_memory_region_ext
> > #endif
> > 
> > [*] https://lore.kernel.org/all/20220301145233.3689119-1-arnd@kernel.org
> > 
> > > +	__u64 private_offset;
> > > +	__u32 private_fd;
> > > +	__u32 padding[5];
> > > +};
diff mbox series

Patch

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 3acbf4d263a5..f76ac598606c 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1307,7 +1307,7 @@  yet and must be cleared on entry.
 :Capability: KVM_CAP_USER_MEMORY
 :Architectures: all
 :Type: vm ioctl
-:Parameters: struct kvm_userspace_memory_region (in)
+:Parameters: struct kvm_userspace_memory_region(_ext) (in)
 :Returns: 0 on success, -1 on error
 
 ::
@@ -1320,9 +1320,17 @@  yet and must be cleared on entry.
 	__u64 userspace_addr; /* start of the userspace allocated memory */
   };
 
+  struct kvm_userspace_memory_region_ext {
+	struct kvm_userspace_memory_region region;
+	__u64 private_offset;
+	__u32 private_fd;
+	__u32 padding[5];
+};
+
   /* for kvm_memory_region::flags */
   #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
   #define KVM_MEM_READONLY	(1UL << 1)
+  #define KVM_MEM_PRIVATE		(1UL << 2)
 
 This ioctl allows the user to create, modify or delete a guest physical
 memory slot.  Bits 0-15 of "slot" specify the slot id and this value
@@ -1353,12 +1361,27 @@  It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
 be identical.  This allows large pages in the guest to be backed by large
 pages in the host.
 
-The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
-KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
-writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
-use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
-to make a new slot read-only.  In this case, writes to this memory will be
-posted to userspace as KVM_EXIT_MMIO exits.
+kvm_userspace_memory_region_ext includes all the kvm_userspace_memory_region
+fields. It also includes additional fields for some specific features. See
+below description of flags field for more information. It's recommended to use
+kvm_userspace_memory_region_ext in new userspace code.
+
+The flags field supports below flags:
+
+- KVM_MEM_LOG_DIRTY_PAGES can be set to instruct KVM to keep track of writes to
+  memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to use it.
+
+- KVM_MEM_READONLY can be set, if KVM_CAP_READONLY_MEM capability allows it, to
+  make a new slot read-only.  In this case, writes to this memory will be posted
+  to userspace as KVM_EXIT_MMIO exits.
+
+- KVM_MEM_PRIVATE can be set to indicate a new slot has private memory backed by
+  a file descirptor(fd) and the content of the private memory is invisible to
+  userspace. In this case, userspace should use private_fd/private_offset in
+  kvm_userspace_memory_region_ext to instruct KVM to provide private memory to
+  guest. Userspace should guarantee not to map the same pfn indicated by
+  private_fd/private_offset to different gfns with multiple memslots. Failed to
+  do this may result undefined behavior.
 
 When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
 the memory region are automatically reflected into the guest.  For example, an
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9536ffa0473b..3be8116079d4 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -563,8 +563,15 @@  struct kvm_memory_slot {
 	u32 flags;
 	short id;
 	u16 as_id;
+	struct file *private_file;
+	loff_t private_offset;
 };
 
+static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot)
+{
+	return slot && (slot->flags & KVM_MEM_PRIVATE);
+}
+
 static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
 {
 	return slot->flags & KVM_MEM_LOG_DIRTY_PAGES;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 91a6fe4e02c0..a523d834efc8 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -103,6 +103,13 @@  struct kvm_userspace_memory_region {
 	__u64 userspace_addr; /* start of the userspace allocated memory */
 };
 
+struct kvm_userspace_memory_region_ext {
+	struct kvm_userspace_memory_region region;
+	__u64 private_offset;
+	__u32 private_fd;
+	__u32 padding[5];
+};
+
 /*
  * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
  * other bits are reserved for kvm internal use which are defined in
@@ -110,6 +117,7 @@  struct kvm_userspace_memory_region {
  */
 #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
 #define KVM_MEM_READONLY	(1UL << 1)
+#define KVM_MEM_PRIVATE		(1UL << 2)
 
 /* for KVM_IRQ_LINE */
 struct kvm_irq_level {