Message ID | 20220915142913.2213336-3-chao.p.peng@linux.intel.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | KVM: mm: fd-based approach for supporting KVM | expand |
On Thu, Sep 15, 2022 at 10:29:07PM +0800, Chao Peng wrote: > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst > index abd7c32126ce..c1fac1e9f820 100644 > --- a/Documentation/virt/kvm/api.rst > +++ b/Documentation/virt/kvm/api.rst > @@ -1319,7 +1319,7 @@ yet and must be cleared on entry. > :Capability: KVM_CAP_USER_MEMORY > :Architectures: all > :Type: vm ioctl > -:Parameters: struct kvm_userspace_memory_region (in) > +:Parameters: struct kvm_userspace_memory_region(_ext) (in) > :Returns: 0 on success, -1 on error > > :: > @@ -1332,9 +1332,18 @@ yet and must be cleared on entry. > __u64 userspace_addr; /* start of the userspace allocated memory */ > }; > > + struct kvm_userspace_memory_region_ext { > + struct kvm_userspace_memory_region region; > + __u64 private_offset; > + __u32 private_fd; > + __u32 pad1; > + __u64 pad2[14]; > + }; > + > /* for kvm_memory_region::flags */ > #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0) > #define KVM_MEM_READONLY (1UL << 1) > + #define KVM_MEM_PRIVATE (1UL << 2) > > This ioctl allows the user to create, modify or delete a guest physical > memory slot. Bits 0-15 of "slot" specify the slot id and this value > @@ -1365,12 +1374,27 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr > be identical. This allows large pages in the guest to be backed by large > pages in the host. > > -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and > -KVM_MEM_READONLY. The former can be set to instruct KVM to keep track of > -writes to memory within the slot. See KVM_GET_DIRTY_LOG ioctl to know how to > -use it. The latter can be set, if KVM_CAP_READONLY_MEM capability allows it, > -to make a new slot read-only. In this case, writes to this memory will be > -posted to userspace as KVM_EXIT_MMIO exits. > +kvm_userspace_memory_region_ext includes all the kvm_userspace_memory_region > +fields. It also includes additional fields for some specific features. See > +below description of flags field for more information. It's recommended to use > +kvm_userspace_memory_region_ext in new userspace code. Better say "kvm_userspace_memory_region_ext includes all fields of kvm_userspace_memory_region struct, while also adds additional fields ..." > + > +The flags field supports below flags: s/below/following/ > + > +- KVM_MEM_LOG_DIRTY_PAGES can be set to instruct KVM to keep track of writes to > + memory within the slot. See KVM_GET_DIRTY_LOG ioctl to know how to use it. > + Better say "... For more details, see KVM_GET_DIRTY_LOG." > +- KVM_MEM_READONLY can be set, if KVM_CAP_READONLY_MEM capability allows it, to > + make a new slot read-only. In this case, writes to this memory will be posted > + to userspace as KVM_EXIT_MMIO exits. > + Better say "if KVM_CAP_READONLY_MEM allows, KVM_MEM_READONLY makes a new slot read-only ..." > +- KVM_MEM_PRIVATE can be set to indicate a new slot has private memory backed by > + a file descirptor(fd) and the content of the private memory is invisible to > + userspace. In this case, userspace should use private_fd/private_offset in > + kvm_userspace_memory_region_ext to instruct KVM to provide private memory to > + guest. Userspace should guarantee not to map the same pfn indicated by > + private_fd/private_offset to different gfns with multiple memslots. Failed to > + do this may result undefined behavior. > For the lists above, s/can be set/ Thanks.
On Fri, Sep 16, 2022 at 04:14:29PM +0700, Bagas Sanjaya wrote: > On Thu, Sep 15, 2022 at 10:29:07PM +0800, Chao Peng wrote: > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst > > index abd7c32126ce..c1fac1e9f820 100644 > > --- a/Documentation/virt/kvm/api.rst > > +++ b/Documentation/virt/kvm/api.rst > > @@ -1319,7 +1319,7 @@ yet and must be cleared on entry. > > :Capability: KVM_CAP_USER_MEMORY > > :Architectures: all > > :Type: vm ioctl > > -:Parameters: struct kvm_userspace_memory_region (in) > > +:Parameters: struct kvm_userspace_memory_region(_ext) (in) > > :Returns: 0 on success, -1 on error > > > > :: > > @@ -1332,9 +1332,18 @@ yet and must be cleared on entry. > > __u64 userspace_addr; /* start of the userspace allocated memory */ > > }; > > > > + struct kvm_userspace_memory_region_ext { > > + struct kvm_userspace_memory_region region; > > + __u64 private_offset; > > + __u32 private_fd; > > + __u32 pad1; > > + __u64 pad2[14]; > > + }; > > + > > /* for kvm_memory_region::flags */ > > #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0) > > #define KVM_MEM_READONLY (1UL << 1) > > + #define KVM_MEM_PRIVATE (1UL << 2) > > > > This ioctl allows the user to create, modify or delete a guest physical > > memory slot. Bits 0-15 of "slot" specify the slot id and this value > > @@ -1365,12 +1374,27 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr > > be identical. This allows large pages in the guest to be backed by large > > pages in the host. > > > > -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and > > -KVM_MEM_READONLY. The former can be set to instruct KVM to keep track of > > -writes to memory within the slot. See KVM_GET_DIRTY_LOG ioctl to know how to > > -use it. The latter can be set, if KVM_CAP_READONLY_MEM capability allows it, > > -to make a new slot read-only. In this case, writes to this memory will be > > -posted to userspace as KVM_EXIT_MMIO exits. > > +kvm_userspace_memory_region_ext includes all the kvm_userspace_memory_region > > +fields. It also includes additional fields for some specific features. See > > +below description of flags field for more information. It's recommended to use > > +kvm_userspace_memory_region_ext in new userspace code. > > Better say "kvm_userspace_memory_region_ext includes all fields of > kvm_userspace_memory_region struct, while also adds additional fields ..." > > > + > > +The flags field supports below flags: > > s/below/following/ > > > + > > +- KVM_MEM_LOG_DIRTY_PAGES can be set to instruct KVM to keep track of writes to > > + memory within the slot. See KVM_GET_DIRTY_LOG ioctl to know how to use it. > > + > > Better say "... For more details, see KVM_GET_DIRTY_LOG." > > > +- KVM_MEM_READONLY can be set, if KVM_CAP_READONLY_MEM capability allows it, to > > + make a new slot read-only. In this case, writes to this memory will be posted > > + to userspace as KVM_EXIT_MMIO exits. > > + > > Better say "if KVM_CAP_READONLY_MEM allows, KVM_MEM_READONLY makes a new > slot read-only ..." > > > +- KVM_MEM_PRIVATE can be set to indicate a new slot has private memory backed by > > + a file descirptor(fd) and the content of the private memory is invisible to > > + userspace. In this case, userspace should use private_fd/private_offset in > > + kvm_userspace_memory_region_ext to instruct KVM to provide private memory to > > + guest. Userspace should guarantee not to map the same pfn indicated by > > + private_fd/private_offset to different gfns with multiple memslots. Failed to > > + do this may result undefined behavior. > > > > For the lists above, > s/can be set/ It all looks good, thanks! > > Thanks. > > -- > An old man doll... just what I always wanted! - Clara
Hi Chao, On Thu, Sep 15, 2022 at 3:35 PM Chao Peng <chao.p.peng@linux.intel.com> wrote: > > In memory encryption usage, guest memory may be encrypted with special > key and can be accessed only by the VM itself. We call such memory > private memory. It's valueless and sometimes can cause problem to allow > userspace to access guest private memory. This patch extends the KVM > memslot definition so that guest private memory can be provided though > an inaccessible_notifier enlightened file descriptor (fd), without being > mmaped into userspace. > > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two > additional KVM memslot fields private_fd/private_offset to allow > userspace to specify that guest private memory provided from the > private_fd and guest_phys_addr mapped at the private_offset of the > private_fd, spanning a range of memory_size. > > The extended memslot can still have the userspace_addr(hva). When use, a > single memslot can maintain both private memory through private > fd(private_fd/private_offset) and shared memory through > hva(userspace_addr). Whether the private or shared part is visible to > guest is maintained by other KVM code. > > Since there is no userspace mapping for private fd so we cannot > get_user_pages() to get the pfn in KVM, instead we add a new > inaccessible_notifier in the internal memslot structure and rely on it > to get pfn by interacting with the memory file systems. > > Together with the change, a new config HAVE_KVM_PRIVATE_MEM is added and > right now it is selected on X86_64 for Intel TDX usage. > > To make code maintenance easy, internally we use a binary compatible > alias struct kvm_user_mem_region to handle both the normal and the > '_ext' variants. > > Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com> > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> > --- > Documentation/virt/kvm/api.rst | 38 +++++++++++++++++++++----- > arch/x86/kvm/Kconfig | 1 + > arch/x86/kvm/x86.c | 2 +- > include/linux/kvm_host.h | 13 +++++++-- > include/uapi/linux/kvm.h | 28 +++++++++++++++++++ > virt/kvm/Kconfig | 3 +++ > virt/kvm/kvm_main.c | 49 ++++++++++++++++++++++++++++------ > 7 files changed, 116 insertions(+), 18 deletions(-) > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst > index abd7c32126ce..c1fac1e9f820 100644 > --- a/Documentation/virt/kvm/api.rst > +++ b/Documentation/virt/kvm/api.rst > @@ -1319,7 +1319,7 @@ yet and must be cleared on entry. > :Capability: KVM_CAP_USER_MEMORY > :Architectures: all > :Type: vm ioctl > -:Parameters: struct kvm_userspace_memory_region (in) > +:Parameters: struct kvm_userspace_memory_region(_ext) (in) > :Returns: 0 on success, -1 on error > > :: > @@ -1332,9 +1332,18 @@ yet and must be cleared on entry. > __u64 userspace_addr; /* start of the userspace allocated memory */ > }; > > + struct kvm_userspace_memory_region_ext { > + struct kvm_userspace_memory_region region; > + __u64 private_offset; > + __u32 private_fd; > + __u32 pad1; > + __u64 pad2[14]; > + }; > + > /* for kvm_memory_region::flags */ > #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0) > #define KVM_MEM_READONLY (1UL << 1) > + #define KVM_MEM_PRIVATE (1UL << 2) > > This ioctl allows the user to create, modify or delete a guest physical > memory slot. Bits 0-15 of "slot" specify the slot id and this value > @@ -1365,12 +1374,27 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr > be identical. This allows large pages in the guest to be backed by large > pages in the host. > > -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and > -KVM_MEM_READONLY. The former can be set to instruct KVM to keep track of > -writes to memory within the slot. See KVM_GET_DIRTY_LOG ioctl to know how to > -use it. The latter can be set, if KVM_CAP_READONLY_MEM capability allows it, > -to make a new slot read-only. In this case, writes to this memory will be > -posted to userspace as KVM_EXIT_MMIO exits. > +kvm_userspace_memory_region_ext includes all the kvm_userspace_memory_region > +fields. It also includes additional fields for some specific features. See > +below description of flags field for more information. It's recommended to use > +kvm_userspace_memory_region_ext in new userspace code. > + > +The flags field supports below flags: > + > +- KVM_MEM_LOG_DIRTY_PAGES can be set to instruct KVM to keep track of writes to > + memory within the slot. See KVM_GET_DIRTY_LOG ioctl to know how to use it. > + > +- KVM_MEM_READONLY can be set, if KVM_CAP_READONLY_MEM capability allows it, to > + make a new slot read-only. In this case, writes to this memory will be posted > + to userspace as KVM_EXIT_MMIO exits. > + > +- KVM_MEM_PRIVATE can be set to indicate a new slot has private memory backed by > + a file descirptor(fd) and the content of the private memory is invisible to s/descirptor/descriptor > + userspace. In this case, userspace should use private_fd/private_offset in > + kvm_userspace_memory_region_ext to instruct KVM to provide private memory to > + guest. Userspace should guarantee not to map the same pfn indicated by > + private_fd/private_offset to different gfns with multiple memslots. Failed to > + do this may result undefined behavior. > > When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of > the memory region are automatically reflected into the guest. For example, an > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig > index e3cbd7706136..31db64ec0b33 100644 > --- a/arch/x86/kvm/Kconfig > +++ b/arch/x86/kvm/Kconfig > @@ -48,6 +48,7 @@ config KVM > select SRCU > select INTERVAL_TREE > select HAVE_KVM_PM_NOTIFIER if PM > + select HAVE_KVM_PRIVATE_MEM if X86_64 > help > Support hosting fully virtualized guest machines using hardware > virtualization extensions. You will need a fairly recent > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index d7374d768296..081f62ccc9a1 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -12183,7 +12183,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, > } > > for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) { > - struct kvm_userspace_memory_region m; > + struct kvm_user_mem_region m; > > m.slot = id | (i << 16); > m.flags = 0; > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index f4519d3689e1..eac1787b899b 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -44,6 +44,7 @@ > > #include <asm/kvm_host.h> > #include <linux/kvm_dirty_ring.h> > +#include <linux/memfd.h> > > #ifndef KVM_MAX_VCPU_IDS > #define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS > @@ -576,8 +577,16 @@ struct kvm_memory_slot { > u32 flags; > short id; > u16 as_id; > + struct file *private_file; > + loff_t private_offset; > + struct inaccessible_notifier notifier; > }; > > +static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot) > +{ > + return slot && (slot->flags & KVM_MEM_PRIVATE); > +} > + > static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot) > { > return slot->flags & KVM_MEM_LOG_DIRTY_PAGES; > @@ -1104,9 +1113,9 @@ enum kvm_mr_change { > }; > > int kvm_set_memory_region(struct kvm *kvm, > - const struct kvm_userspace_memory_region *mem); > + const struct kvm_user_mem_region *mem); > int __kvm_set_memory_region(struct kvm *kvm, > - const struct kvm_userspace_memory_region *mem); > + const struct kvm_user_mem_region *mem); > void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot); > void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen); > int kvm_arch_prepare_memory_region(struct kvm *kvm, > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h > index eed0315a77a6..3ef462fb3b2a 100644 > --- a/include/uapi/linux/kvm.h > +++ b/include/uapi/linux/kvm.h > @@ -103,6 +103,33 @@ struct kvm_userspace_memory_region { > __u64 userspace_addr; /* start of the userspace allocated memory */ > }; > > +struct kvm_userspace_memory_region_ext { > + struct kvm_userspace_memory_region region; > + __u64 private_offset; > + __u32 private_fd; > + __u32 pad1; > + __u64 pad2[14]; > +}; > + > +#ifdef __KERNEL__ > +/* > + * kvm_user_mem_region is a kernel-only alias of kvm_userspace_memory_region_ext > + * that "unpacks" kvm_userspace_memory_region so that KVM can directly access > + * all fields from the top-level "extended" region. > + */ > +struct kvm_user_mem_region { > + __u32 slot; > + __u32 flags; > + __u64 guest_phys_addr; > + __u64 memory_size; > + __u64 userspace_addr; > + __u64 private_offset; > + __u32 private_fd; > + __u32 pad1; > + __u64 pad2[14]; > +}; > +#endif > + > /* > * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace, > * other bits are reserved for kvm internal use which are defined in > @@ -110,6 +137,7 @@ struct kvm_userspace_memory_region { > */ > #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0) > #define KVM_MEM_READONLY (1UL << 1) > +#define KVM_MEM_PRIVATE (1UL << 2) > > /* for KVM_IRQ_LINE */ > struct kvm_irq_level { > diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig > index a8c5c9f06b3c..ccaff13cc5b8 100644 > --- a/virt/kvm/Kconfig > +++ b/virt/kvm/Kconfig > @@ -72,3 +72,6 @@ config KVM_XFER_TO_GUEST_WORK > > config HAVE_KVM_PM_NOTIFIER > bool > + > +config HAVE_KVM_PRIVATE_MEM > + bool > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 584a5bab3af3..12dc0dc57b06 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -1526,7 +1526,7 @@ static void kvm_replace_memslot(struct kvm *kvm, > } > } > > -static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem) > +static int check_memory_region_flags(const struct kvm_user_mem_region *mem) > { > u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES; > > @@ -1920,7 +1920,7 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id, > * Must be called holding kvm->slots_lock for write. > */ > int __kvm_set_memory_region(struct kvm *kvm, > - const struct kvm_userspace_memory_region *mem) > + const struct kvm_user_mem_region *mem) > { > struct kvm_memory_slot *old, *new; > struct kvm_memslots *slots; > @@ -2024,7 +2024,7 @@ int __kvm_set_memory_region(struct kvm *kvm, > EXPORT_SYMBOL_GPL(__kvm_set_memory_region); > > int kvm_set_memory_region(struct kvm *kvm, > - const struct kvm_userspace_memory_region *mem) > + const struct kvm_user_mem_region *mem) > { > int r; > > @@ -2036,7 +2036,7 @@ int kvm_set_memory_region(struct kvm *kvm, > EXPORT_SYMBOL_GPL(kvm_set_memory_region); > > static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm, > - struct kvm_userspace_memory_region *mem) > + struct kvm_user_mem_region *mem) > { > if ((u16)mem->slot >= KVM_USER_MEM_SLOTS) > return -EINVAL; > @@ -4622,6 +4622,33 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm) > return fd; > } > > +#define SANITY_CHECK_MEM_REGION_FIELD(field) \ > +do { \ > + BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) != \ > + offsetof(struct kvm_userspace_memory_region, field)); \ > + BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) != \ > + sizeof_field(struct kvm_userspace_memory_region, field)); \ > +} while (0) > + > +#define SANITY_CHECK_MEM_REGION_EXT_FIELD(field) \ > +do { \ > + BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) != \ > + offsetof(struct kvm_userspace_memory_region_ext, field)); \ > + BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) != \ > + sizeof_field(struct kvm_userspace_memory_region_ext, field)); \ > +} while (0) > + > +static void kvm_sanity_check_user_mem_region_alias(void) > +{ > + SANITY_CHECK_MEM_REGION_FIELD(slot); > + SANITY_CHECK_MEM_REGION_FIELD(flags); > + SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr); > + SANITY_CHECK_MEM_REGION_FIELD(memory_size); > + SANITY_CHECK_MEM_REGION_FIELD(userspace_addr); > + SANITY_CHECK_MEM_REGION_EXT_FIELD(private_offset); > + SANITY_CHECK_MEM_REGION_EXT_FIELD(private_fd); > +} > + > static long kvm_vm_ioctl(struct file *filp, > unsigned int ioctl, unsigned long arg) > { > @@ -4645,14 +4672,20 @@ static long kvm_vm_ioctl(struct file *filp, > break; > } > case KVM_SET_USER_MEMORY_REGION: { > - struct kvm_userspace_memory_region kvm_userspace_mem; > + struct kvm_user_mem_region mem; > + unsigned long size = sizeof(struct kvm_userspace_memory_region); nit: should this be sizeof(struct mem)? That's more similar to the existing code and makes it dependent on the size of mem regardless of possible changes to its type in the future. > + > + kvm_sanity_check_user_mem_region_alias(); > > r = -EFAULT; > - if (copy_from_user(&kvm_userspace_mem, argp, > - sizeof(kvm_userspace_mem))) > + if (copy_from_user(&mem, argp, size); It gets fixed in a future patch, but the ; should be a ). Cheers, /fuad > + goto out; > + > + r = -EINVAL; > + if (mem.flags & KVM_MEM_PRIVATE) > goto out; > > - r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem); > + r = kvm_vm_ioctl_set_memory_region(kvm, &mem); > break; > } > case KVM_GET_DIRTY_LOG: { > -- > 2.25.1 >
On Mon, Sep 26, 2022 at 11:26:45AM +0100, Fuad Tabba wrote: ... > > + > > +- KVM_MEM_PRIVATE can be set to indicate a new slot has private memory backed by > > + a file descirptor(fd) and the content of the private memory is invisible to > > s/descirptor/descriptor Thanks. ... > static long kvm_vm_ioctl(struct file *filp, > > unsigned int ioctl, unsigned long arg) > > { > > @@ -4645,14 +4672,20 @@ static long kvm_vm_ioctl(struct file *filp, > > break; > > } > > case KVM_SET_USER_MEMORY_REGION: { > > - struct kvm_userspace_memory_region kvm_userspace_mem; > > + struct kvm_user_mem_region mem; > > + unsigned long size = sizeof(struct kvm_userspace_memory_region); > > nit: should this be sizeof(struct mem)? That's more similar to the > existing code and makes it dependent on the size of mem regardless of > possible changes to its type in the future. Unluckily no, the size we need copy_from_user() depends on the flags, e.g. without KVM_MEM_PRIVATE, we can't safely copy that big size since the 'extended' part may not even exist. > > > + > > + kvm_sanity_check_user_mem_region_alias(); > > > > r = -EFAULT; > > - if (copy_from_user(&kvm_userspace_mem, argp, > > - sizeof(kvm_userspace_mem))) > > + if (copy_from_user(&mem, argp, size); > > It gets fixed in a future patch, but the ; should be a ). Good catch, thanks! Chao > > Cheers, > /fuad
On Thu, Sep 15, 2022 at 10:29:07PM +0800, Chao Peng <chao.p.peng@linux.intel.com> wrote: ... > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 584a5bab3af3..12dc0dc57b06 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c ... > @@ -4622,6 +4622,33 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm) > return fd; > } > > +#define SANITY_CHECK_MEM_REGION_FIELD(field) \ > +do { \ > + BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) != \ > + offsetof(struct kvm_userspace_memory_region, field)); \ > + BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) != \ > + sizeof_field(struct kvm_userspace_memory_region, field)); \ > +} while (0) > + > +#define SANITY_CHECK_MEM_REGION_EXT_FIELD(field) \ > +do { \ > + BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) != \ > + offsetof(struct kvm_userspace_memory_region_ext, field)); \ > + BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) != \ > + sizeof_field(struct kvm_userspace_memory_region_ext, field)); \ > +} while (0) > + > +static void kvm_sanity_check_user_mem_region_alias(void) > +{ > + SANITY_CHECK_MEM_REGION_FIELD(slot); > + SANITY_CHECK_MEM_REGION_FIELD(flags); > + SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr); > + SANITY_CHECK_MEM_REGION_FIELD(memory_size); > + SANITY_CHECK_MEM_REGION_FIELD(userspace_addr); > + SANITY_CHECK_MEM_REGION_EXT_FIELD(private_offset); > + SANITY_CHECK_MEM_REGION_EXT_FIELD(private_fd); > +} > + > static long kvm_vm_ioctl(struct file *filp, > unsigned int ioctl, unsigned long arg) > { > @@ -4645,14 +4672,20 @@ static long kvm_vm_ioctl(struct file *filp, > break; > } > case KVM_SET_USER_MEMORY_REGION: { > - struct kvm_userspace_memory_region kvm_userspace_mem; > + struct kvm_user_mem_region mem; > + unsigned long size = sizeof(struct kvm_userspace_memory_region); > + > + kvm_sanity_check_user_mem_region_alias(); > > r = -EFAULT; > - if (copy_from_user(&kvm_userspace_mem, argp, > - sizeof(kvm_userspace_mem))) > + if (copy_from_user(&mem, argp, size); > + goto out; > + > + r = -EINVAL; > + if (mem.flags & KVM_MEM_PRIVATE) > goto out; Nit: It's better to check if padding is zero. Maybe rename it to reserved. + if (mem.pad1 || memchr_inv(mem.pad2, 0, sizeof(mem.pad2))) + goto out;
On Thu, Sep 29, 2022, Isaku Yamahata wrote: > On Thu, Sep 15, 2022 at 10:29:07PM +0800, > Chao Peng <chao.p.peng@linux.intel.com> wrote: > > @@ -4645,14 +4672,20 @@ static long kvm_vm_ioctl(struct file *filp, > > break; > > } > > case KVM_SET_USER_MEMORY_REGION: { > > - struct kvm_userspace_memory_region kvm_userspace_mem; > > + struct kvm_user_mem_region mem; > > + unsigned long size = sizeof(struct kvm_userspace_memory_region); > > + > > + kvm_sanity_check_user_mem_region_alias(); > > > > r = -EFAULT; > > - if (copy_from_user(&kvm_userspace_mem, argp, > > - sizeof(kvm_userspace_mem))) > > + if (copy_from_user(&mem, argp, size); > > + goto out; > > + > > + r = -EINVAL; > > + if (mem.flags & KVM_MEM_PRIVATE) > > goto out; > > Nit: It's better to check if padding is zero. Maybe rename it to reserved. > > + if (mem.pad1 || memchr_inv(mem.pad2, 0, sizeof(mem.pad2))) > + goto out; No need, KVM has more or less settled on using flags instead "reserving" bytes. E.g. if/when another fancy feature comes along, we'll add another KVM_MEM_XYZ and only consume the relevant fields when the flag is set. Reserving bytes doesn't work very well because it assumes that '0' is an invalid value, e.g. if the future expansion is for a non-private file descriptor, then we'd need a new flag even if KVM reserved bytes since fd=0 is valid. The only reason to bother with pad2[14] at this time is to avoid having to define yet another struct if/when the struct needs to expand again. The struct definition will still need to be changed, but at least we won't end up with struct kvm_userspace_memory_region_really_extended.
On Thu, Sep 15, 2022 at 10:29:07PM +0800, Chao Peng wrote: > In memory encryption usage, guest memory may be encrypted with special > key and can be accessed only by the VM itself. We call such memory > private memory. It's valueless and sometimes can cause problem to allow > userspace to access guest private memory. This patch extends the KVM > memslot definition so that guest private memory can be provided though > an inaccessible_notifier enlightened file descriptor (fd), without being > mmaped into userspace. > > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two > additional KVM memslot fields private_fd/private_offset to allow > userspace to specify that guest private memory provided from the > private_fd and guest_phys_addr mapped at the private_offset of the > private_fd, spanning a range of memory_size. > > The extended memslot can still have the userspace_addr(hva). When use, a > single memslot can maintain both private memory through private > fd(private_fd/private_offset) and shared memory through > hva(userspace_addr). Whether the private or shared part is visible to > guest is maintained by other KVM code. > > Since there is no userspace mapping for private fd so we cannot > get_user_pages() to get the pfn in KVM, instead we add a new > inaccessible_notifier in the internal memslot structure and rely on it > to get pfn by interacting with the memory file systems. > > Together with the change, a new config HAVE_KVM_PRIVATE_MEM is added and > right now it is selected on X86_64 for Intel TDX usage. > > To make code maintenance easy, internally we use a binary compatible > alias struct kvm_user_mem_region to handle both the normal and the > '_ext' variants. > > Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com> > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> What if userspace_addr would contain address of an extension structure, if the flag is set, instead of shared address? I.e. interpret that field differently (could be turned into union too ofc). That idea could be at least re-used, if there's ever any new KVM_MEM_* flags that would need an extension. E.g. have struct kvm_userspace_memory_private, which contains shared address, fd and the offset. BR, Jarkko
On Wed, Oct 05, 2022 at 04:04:05PM +0300, Jarkko Sakkinen wrote: > On Thu, Sep 15, 2022 at 10:29:07PM +0800, Chao Peng wrote: > > In memory encryption usage, guest memory may be encrypted with special > > key and can be accessed only by the VM itself. We call such memory > > private memory. It's valueless and sometimes can cause problem to allow > > userspace to access guest private memory. This patch extends the KVM > > memslot definition so that guest private memory can be provided though > > an inaccessible_notifier enlightened file descriptor (fd), without being > > mmaped into userspace. > > > > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two > > additional KVM memslot fields private_fd/private_offset to allow > > userspace to specify that guest private memory provided from the > > private_fd and guest_phys_addr mapped at the private_offset of the > > private_fd, spanning a range of memory_size. > > > > The extended memslot can still have the userspace_addr(hva). When use, a > > single memslot can maintain both private memory through private > > fd(private_fd/private_offset) and shared memory through > > hva(userspace_addr). Whether the private or shared part is visible to > > guest is maintained by other KVM code. > > > > Since there is no userspace mapping for private fd so we cannot > > get_user_pages() to get the pfn in KVM, instead we add a new > > inaccessible_notifier in the internal memslot structure and rely on it > > to get pfn by interacting with the memory file systems. > > > > Together with the change, a new config HAVE_KVM_PRIVATE_MEM is added and > > right now it is selected on X86_64 for Intel TDX usage. > > > > To make code maintenance easy, internally we use a binary compatible > > alias struct kvm_user_mem_region to handle both the normal and the > > '_ext' variants. > > > > Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com> > > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> > > What if userspace_addr would contain address of an extension structure, > if the flag is set, instead of shared address? I.e. interpret that field > differently (could be turned into union too ofc). > > That idea could be at least re-used, if there's ever any new KVM_MEM_* > flags that would need an extension. > > E.g. have struct kvm_userspace_memory_private, which contains shared > address, fd and the offset. Or add a new ioctl number instead of messing with the existing parameter structure, e.g. KVM_SET_USER_MEMORY_REGION_PRIVATE. With this alternative and the current approach in the patch, it would be better just to redefine the struct fields that are common. It actually would reduce redundancy because then there is no need to create that somewhat confusing kernel version of the same struct, right? You don't save any redundancy with this "embedded struct" approach. BR, Jarkko
Hi, I'm not sure if this patch or the last one might be the best place for it, but I think it would be useful to have a KVM_CAP associated with this. I am working on getting kvmtool to work with this, and I haven't found a clean way of getting it to discover whether mem_private is supported. Thanks., /fuad On Thu, Sep 15, 2022 at 3:35 PM Chao Peng <chao.p.peng@linux.intel.com> wrote: > > In memory encryption usage, guest memory may be encrypted with special > key and can be accessed only by the VM itself. We call such memory > private memory. It's valueless and sometimes can cause problem to allow > userspace to access guest private memory. This patch extends the KVM > memslot definition so that guest private memory can be provided though > an inaccessible_notifier enlightened file descriptor (fd), without being > mmaped into userspace. > > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two > additional KVM memslot fields private_fd/private_offset to allow > userspace to specify that guest private memory provided from the > private_fd and guest_phys_addr mapped at the private_offset of the > private_fd, spanning a range of memory_size. > > The extended memslot can still have the userspace_addr(hva). When use, a > single memslot can maintain both private memory through private > fd(private_fd/private_offset) and shared memory through > hva(userspace_addr). Whether the private or shared part is visible to > guest is maintained by other KVM code. > > Since there is no userspace mapping for private fd so we cannot > get_user_pages() to get the pfn in KVM, instead we add a new > inaccessible_notifier in the internal memslot structure and rely on it > to get pfn by interacting with the memory file systems. > > Together with the change, a new config HAVE_KVM_PRIVATE_MEM is added and > right now it is selected on X86_64 for Intel TDX usage. > > To make code maintenance easy, internally we use a binary compatible > alias struct kvm_user_mem_region to handle both the normal and the > '_ext' variants. > > Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com> > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> > --- > Documentation/virt/kvm/api.rst | 38 +++++++++++++++++++++----- > arch/x86/kvm/Kconfig | 1 + > arch/x86/kvm/x86.c | 2 +- > include/linux/kvm_host.h | 13 +++++++-- > include/uapi/linux/kvm.h | 28 +++++++++++++++++++ > virt/kvm/Kconfig | 3 +++ > virt/kvm/kvm_main.c | 49 ++++++++++++++++++++++++++++------ > 7 files changed, 116 insertions(+), 18 deletions(-) > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst > index abd7c32126ce..c1fac1e9f820 100644 > --- a/Documentation/virt/kvm/api.rst > +++ b/Documentation/virt/kvm/api.rst > @@ -1319,7 +1319,7 @@ yet and must be cleared on entry. > :Capability: KVM_CAP_USER_MEMORY > :Architectures: all > :Type: vm ioctl > -:Parameters: struct kvm_userspace_memory_region (in) > +:Parameters: struct kvm_userspace_memory_region(_ext) (in) > :Returns: 0 on success, -1 on error > > :: > @@ -1332,9 +1332,18 @@ yet and must be cleared on entry. > __u64 userspace_addr; /* start of the userspace allocated memory */ > }; > > + struct kvm_userspace_memory_region_ext { > + struct kvm_userspace_memory_region region; > + __u64 private_offset; > + __u32 private_fd; > + __u32 pad1; > + __u64 pad2[14]; > + }; > + > /* for kvm_memory_region::flags */ > #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0) > #define KVM_MEM_READONLY (1UL << 1) > + #define KVM_MEM_PRIVATE (1UL << 2) > > This ioctl allows the user to create, modify or delete a guest physical > memory slot. Bits 0-15 of "slot" specify the slot id and this value > @@ -1365,12 +1374,27 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr > be identical. This allows large pages in the guest to be backed by large > pages in the host. > > -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and > -KVM_MEM_READONLY. The former can be set to instruct KVM to keep track of > -writes to memory within the slot. See KVM_GET_DIRTY_LOG ioctl to know how to > -use it. The latter can be set, if KVM_CAP_READONLY_MEM capability allows it, > -to make a new slot read-only. In this case, writes to this memory will be > -posted to userspace as KVM_EXIT_MMIO exits. > +kvm_userspace_memory_region_ext includes all the kvm_userspace_memory_region > +fields. It also includes additional fields for some specific features. See > +below description of flags field for more information. It's recommended to use > +kvm_userspace_memory_region_ext in new userspace code. > + > +The flags field supports below flags: > + > +- KVM_MEM_LOG_DIRTY_PAGES can be set to instruct KVM to keep track of writes to > + memory within the slot. See KVM_GET_DIRTY_LOG ioctl to know how to use it. > + > +- KVM_MEM_READONLY can be set, if KVM_CAP_READONLY_MEM capability allows it, to > + make a new slot read-only. In this case, writes to this memory will be posted > + to userspace as KVM_EXIT_MMIO exits. > + > +- KVM_MEM_PRIVATE can be set to indicate a new slot has private memory backed by > + a file descirptor(fd) and the content of the private memory is invisible to > + userspace. In this case, userspace should use private_fd/private_offset in > + kvm_userspace_memory_region_ext to instruct KVM to provide private memory to > + guest. Userspace should guarantee not to map the same pfn indicated by > + private_fd/private_offset to different gfns with multiple memslots. Failed to > + do this may result undefined behavior. > > When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of > the memory region are automatically reflected into the guest. For example, an > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig > index e3cbd7706136..31db64ec0b33 100644 > --- a/arch/x86/kvm/Kconfig > +++ b/arch/x86/kvm/Kconfig > @@ -48,6 +48,7 @@ config KVM > select SRCU > select INTERVAL_TREE > select HAVE_KVM_PM_NOTIFIER if PM > + select HAVE_KVM_PRIVATE_MEM if X86_64 > help > Support hosting fully virtualized guest machines using hardware > virtualization extensions. You will need a fairly recent > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index d7374d768296..081f62ccc9a1 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -12183,7 +12183,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, > } > > for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) { > - struct kvm_userspace_memory_region m; > + struct kvm_user_mem_region m; > > m.slot = id | (i << 16); > m.flags = 0; > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index f4519d3689e1..eac1787b899b 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -44,6 +44,7 @@ > > #include <asm/kvm_host.h> > #include <linux/kvm_dirty_ring.h> > +#include <linux/memfd.h> > > #ifndef KVM_MAX_VCPU_IDS > #define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS > @@ -576,8 +577,16 @@ struct kvm_memory_slot { > u32 flags; > short id; > u16 as_id; > + struct file *private_file; > + loff_t private_offset; > + struct inaccessible_notifier notifier; > }; > > +static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot) > +{ > + return slot && (slot->flags & KVM_MEM_PRIVATE); > +} > + > static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot) > { > return slot->flags & KVM_MEM_LOG_DIRTY_PAGES; > @@ -1104,9 +1113,9 @@ enum kvm_mr_change { > }; > > int kvm_set_memory_region(struct kvm *kvm, > - const struct kvm_userspace_memory_region *mem); > + const struct kvm_user_mem_region *mem); > int __kvm_set_memory_region(struct kvm *kvm, > - const struct kvm_userspace_memory_region *mem); > + const struct kvm_user_mem_region *mem); > void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot); > void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen); > int kvm_arch_prepare_memory_region(struct kvm *kvm, > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h > index eed0315a77a6..3ef462fb3b2a 100644 > --- a/include/uapi/linux/kvm.h > +++ b/include/uapi/linux/kvm.h > @@ -103,6 +103,33 @@ struct kvm_userspace_memory_region { > __u64 userspace_addr; /* start of the userspace allocated memory */ > }; > > +struct kvm_userspace_memory_region_ext { > + struct kvm_userspace_memory_region region; > + __u64 private_offset; > + __u32 private_fd; > + __u32 pad1; > + __u64 pad2[14]; > +}; > + > +#ifdef __KERNEL__ > +/* > + * kvm_user_mem_region is a kernel-only alias of kvm_userspace_memory_region_ext > + * that "unpacks" kvm_userspace_memory_region so that KVM can directly access > + * all fields from the top-level "extended" region. > + */ > +struct kvm_user_mem_region { > + __u32 slot; > + __u32 flags; > + __u64 guest_phys_addr; > + __u64 memory_size; > + __u64 userspace_addr; > + __u64 private_offset; > + __u32 private_fd; > + __u32 pad1; > + __u64 pad2[14]; > +}; > +#endif > + > /* > * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace, > * other bits are reserved for kvm internal use which are defined in > @@ -110,6 +137,7 @@ struct kvm_userspace_memory_region { > */ > #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0) > #define KVM_MEM_READONLY (1UL << 1) > +#define KVM_MEM_PRIVATE (1UL << 2) > > /* for KVM_IRQ_LINE */ > struct kvm_irq_level { > diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig > index a8c5c9f06b3c..ccaff13cc5b8 100644 > --- a/virt/kvm/Kconfig > +++ b/virt/kvm/Kconfig > @@ -72,3 +72,6 @@ config KVM_XFER_TO_GUEST_WORK > > config HAVE_KVM_PM_NOTIFIER > bool > + > +config HAVE_KVM_PRIVATE_MEM > + bool > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 584a5bab3af3..12dc0dc57b06 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -1526,7 +1526,7 @@ static void kvm_replace_memslot(struct kvm *kvm, > } > } > > -static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem) > +static int check_memory_region_flags(const struct kvm_user_mem_region *mem) > { > u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES; > > @@ -1920,7 +1920,7 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id, > * Must be called holding kvm->slots_lock for write. > */ > int __kvm_set_memory_region(struct kvm *kvm, > - const struct kvm_userspace_memory_region *mem) > + const struct kvm_user_mem_region *mem) > { > struct kvm_memory_slot *old, *new; > struct kvm_memslots *slots; > @@ -2024,7 +2024,7 @@ int __kvm_set_memory_region(struct kvm *kvm, > EXPORT_SYMBOL_GPL(__kvm_set_memory_region); > > int kvm_set_memory_region(struct kvm *kvm, > - const struct kvm_userspace_memory_region *mem) > + const struct kvm_user_mem_region *mem) > { > int r; > > @@ -2036,7 +2036,7 @@ int kvm_set_memory_region(struct kvm *kvm, > EXPORT_SYMBOL_GPL(kvm_set_memory_region); > > static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm, > - struct kvm_userspace_memory_region *mem) > + struct kvm_user_mem_region *mem) > { > if ((u16)mem->slot >= KVM_USER_MEM_SLOTS) > return -EINVAL; > @@ -4622,6 +4622,33 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm) > return fd; > } > > +#define SANITY_CHECK_MEM_REGION_FIELD(field) \ > +do { \ > + BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) != \ > + offsetof(struct kvm_userspace_memory_region, field)); \ > + BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) != \ > + sizeof_field(struct kvm_userspace_memory_region, field)); \ > +} while (0) > + > +#define SANITY_CHECK_MEM_REGION_EXT_FIELD(field) \ > +do { \ > + BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) != \ > + offsetof(struct kvm_userspace_memory_region_ext, field)); \ > + BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) != \ > + sizeof_field(struct kvm_userspace_memory_region_ext, field)); \ > +} while (0) > + > +static void kvm_sanity_check_user_mem_region_alias(void) > +{ > + SANITY_CHECK_MEM_REGION_FIELD(slot); > + SANITY_CHECK_MEM_REGION_FIELD(flags); > + SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr); > + SANITY_CHECK_MEM_REGION_FIELD(memory_size); > + SANITY_CHECK_MEM_REGION_FIELD(userspace_addr); > + SANITY_CHECK_MEM_REGION_EXT_FIELD(private_offset); > + SANITY_CHECK_MEM_REGION_EXT_FIELD(private_fd); > +} > + > static long kvm_vm_ioctl(struct file *filp, > unsigned int ioctl, unsigned long arg) > { > @@ -4645,14 +4672,20 @@ static long kvm_vm_ioctl(struct file *filp, > break; > } > case KVM_SET_USER_MEMORY_REGION: { > - struct kvm_userspace_memory_region kvm_userspace_mem; > + struct kvm_user_mem_region mem; > + unsigned long size = sizeof(struct kvm_userspace_memory_region); > + > + kvm_sanity_check_user_mem_region_alias(); > > r = -EFAULT; > - if (copy_from_user(&kvm_userspace_mem, argp, > - sizeof(kvm_userspace_mem))) > + if (copy_from_user(&mem, argp, size); > + goto out; > + > + r = -EINVAL; > + if (mem.flags & KVM_MEM_PRIVATE) > goto out; > > - r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem); > + r = kvm_vm_ioctl_set_memory_region(kvm, &mem); > break; > } > case KVM_GET_DIRTY_LOG: { > -- > 2.25.1 >
On Thu, Sep 15, 2022 at 10:29:07PM +0800, Chao Peng wrote: > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two > additional KVM memslot fields private_fd/private_offset to allow > userspace to specify that guest private memory provided from the > private_fd and guest_phys_addr mapped at the private_offset of the > private_fd, spanning a range of memory_size. > > The extended memslot can still have the userspace_addr(hva). When use, a > single memslot can maintain both private memory through private > fd(private_fd/private_offset) and shared memory through > hva(userspace_addr). Whether the private or shared part is visible to > guest is maintained by other KVM code. What is anyway the appeal of private_offset field, instead of having just 1:1 association between regions and files, i.e. one memfd per region? If this was the case, then an extended struct would not be needed in the first place. A simple union inside the existing struct would do: union { __u64 userspace_addr, __u64 private_fd, }; BR, Jarkko
On Thu, Oct 06, 2022 at 05:58:03PM +0300, Jarkko Sakkinen wrote: > On Thu, Sep 15, 2022 at 10:29:07PM +0800, Chao Peng wrote: > > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two > > additional KVM memslot fields private_fd/private_offset to allow > > userspace to specify that guest private memory provided from the > > private_fd and guest_phys_addr mapped at the private_offset of the > > private_fd, spanning a range of memory_size. > > > > The extended memslot can still have the userspace_addr(hva). When use, a > > single memslot can maintain both private memory through private > > fd(private_fd/private_offset) and shared memory through > > hva(userspace_addr). Whether the private or shared part is visible to > > guest is maintained by other KVM code. > > What is anyway the appeal of private_offset field, instead of having just > 1:1 association between regions and files, i.e. one memfd per region? > > If this was the case, then an extended struct would not be needed in the > first place. A simple union inside the existing struct would do: > > union { > __u64 userspace_addr, > __u64 private_fd, > }; Also, why is this mechanism just for fd's with MFD_INACCESSIBLE flag? I'd consider instead having KVM_MEM_FD flag. For generic KVM (if memfd does not have MFD_INACCESSIBLE set), KVM could just use the memory as it is using mapped memory. This would simplify user space code, as you can the use the same thing for both cases. BR, Jarkko
On Thu, Oct 06, 2022, Jarkko Sakkinen wrote: > On Thu, Oct 06, 2022 at 05:58:03PM +0300, Jarkko Sakkinen wrote: > > On Thu, Sep 15, 2022 at 10:29:07PM +0800, Chao Peng wrote: > > > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two > > > additional KVM memslot fields private_fd/private_offset to allow > > > userspace to specify that guest private memory provided from the > > > private_fd and guest_phys_addr mapped at the private_offset of the > > > private_fd, spanning a range of memory_size. > > > > > > The extended memslot can still have the userspace_addr(hva). When use, a > > > single memslot can maintain both private memory through private > > > fd(private_fd/private_offset) and shared memory through > > > hva(userspace_addr). Whether the private or shared part is visible to > > > guest is maintained by other KVM code. > > > > What is anyway the appeal of private_offset field, instead of having just > > 1:1 association between regions and files, i.e. one memfd per region? Modifying memslots is slow, both in KVM and in QEMU (not sure about Google's VMM). E.g. if a vCPU converts a single page, it will be forced to wait until all other vCPUs drop SRCU, which can have severe latency spikes, e.g. if KVM is faulting in memory. KVM's memslot updates also hold a mutex for the entire duration of the update, i.e. conversions on different vCPUs would be fully serialized, exacerbating the SRCU problem. KVM also has historical baggage where it "needs" to zap _all_ SPTEs when any memslot is deleted. Taking both a private_fd and a shared userspace address allows userspace to convert between private and shared without having to manipulate memslots. Paolo's original idea (was sent off-list): : The problem is that KVM_SET_USER_MEMORY_REGION and memslots in general : are designed around (S)RCU. It is way too slow (in both QEMU and KVM) : to be called on every private<->shared conversion with 4K granularity, : and it tends naturally to have quadratic behavior (though, at least for : KVM, the in-progress "fast memslots" series would avoid that). : : Since private PTEs are persistent, and userspace cannot access the memfd : in any other way, userspace could use fallocate() to map/unmap an : address range as private, and KVM can treat everything that userspace : hasn't mapped as shared. : : This would be a new entry in struct guest_ops, called by fallocate(), : and the callback can take the mmu_lock for write to avoid racing with : page faults. This doesn't add any more contention than : KVM_SET_USER_MEMORY_REGION, since the latter takes slots_lock. If : there's something I'm missing then the mapping operation can use a : ioctl, while the unmapping can keep using FALLOC_FL_PUNCH_HOLE. : : Then: : : - for simplicity, mapping a private memslot fails if there are any : mappings (similar to the handling when F_SEAL_GUEST is set). : : - for TDX, accessing a nonexistent private PTE will cause a userspace : exit for a shared->private conversion request. For SNP, the guest will : do a page state change VMGEXIT to request an RMPUPDATE, which can cause : a userspace exit too; the consequent fallocate() on the private fd : invokes RMPUPDATE. : : - trying to map a shared PTE where there's already a private PTE causes : a userspace exit for a private->shared conversion request. : kvm_faultin_pfn or handle_abnormal_pfn can query this in the private-fd : inode, which is essentially a single pagecache_get_page call. : : - if userspace asks to map a private PTE where there's already a shared : PTE (which it can check because it has the mmu_lock taken for write), : KVM unmaps the shared PTE. > > > > If this was the case, then an extended struct would not be needed in the > > first place. A simple union inside the existing struct would do: > > > > union { > > __u64 userspace_addr, > > __u64 private_fd, > > }; > > Also, why is this mechanism just for fd's with MFD_INACCESSIBLE flag? I'd > consider instead having KVM_MEM_FD flag. For generic KVM (if memfd does not > have MFD_INACCESSIBLE set), KVM could just use the memory as it is using > mapped memory. This would simplify user space code, as you can the use the > same thing for both cases. I explored this idea too[*]. Because we want to support specifying both the private and shared backing stores in a single memslot, then we need two file descriptors so that shared memory can also use fd-based memory. [*] https://lore.kernel.org/all/YulTH7bL4MwT5v5K@google.com
On Thu, Oct 06, 2022 at 03:34:58PM +0000, Sean Christopherson wrote: > On Thu, Oct 06, 2022, Jarkko Sakkinen wrote: > > On Thu, Oct 06, 2022 at 05:58:03PM +0300, Jarkko Sakkinen wrote: > > > On Thu, Sep 15, 2022 at 10:29:07PM +0800, Chao Peng wrote: > > > > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two > > > > additional KVM memslot fields private_fd/private_offset to allow > > > > userspace to specify that guest private memory provided from the > > > > private_fd and guest_phys_addr mapped at the private_offset of the > > > > private_fd, spanning a range of memory_size. > > > > > > > > The extended memslot can still have the userspace_addr(hva). When use, a > > > > single memslot can maintain both private memory through private > > > > fd(private_fd/private_offset) and shared memory through > > > > hva(userspace_addr). Whether the private or shared part is visible to > > > > guest is maintained by other KVM code. > > > > > > What is anyway the appeal of private_offset field, instead of having just > > > 1:1 association between regions and files, i.e. one memfd per region? > > Modifying memslots is slow, both in KVM and in QEMU (not sure about Google's VMM). > E.g. if a vCPU converts a single page, it will be forced to wait until all other > vCPUs drop SRCU, which can have severe latency spikes, e.g. if KVM is faulting in > memory. KVM's memslot updates also hold a mutex for the entire duration of the > update, i.e. conversions on different vCPUs would be fully serialized, exacerbating > the SRCU problem. > > KVM also has historical baggage where it "needs" to zap _all_ SPTEs when any > memslot is deleted. > > Taking both a private_fd and a shared userspace address allows userspace to convert > between private and shared without having to manipulate memslots. Right, this was really good explanation, thank you. Still wondering could this possibly work (or not): 1. Union userspace_addr and private_fd. 2. Instead of introducing private_offset, use guest_phys_addr as the offset. BR, Jarkko
On Fri, Oct 07, 2022, Jarkko Sakkinen wrote: > On Thu, Oct 06, 2022 at 03:34:58PM +0000, Sean Christopherson wrote: > > On Thu, Oct 06, 2022, Jarkko Sakkinen wrote: > > > On Thu, Oct 06, 2022 at 05:58:03PM +0300, Jarkko Sakkinen wrote: > > > > On Thu, Sep 15, 2022 at 10:29:07PM +0800, Chao Peng wrote: > > > > > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two > > > > > additional KVM memslot fields private_fd/private_offset to allow > > > > > userspace to specify that guest private memory provided from the > > > > > private_fd and guest_phys_addr mapped at the private_offset of the > > > > > private_fd, spanning a range of memory_size. > > > > > > > > > > The extended memslot can still have the userspace_addr(hva). When use, a > > > > > single memslot can maintain both private memory through private > > > > > fd(private_fd/private_offset) and shared memory through > > > > > hva(userspace_addr). Whether the private or shared part is visible to > > > > > guest is maintained by other KVM code. > > > > > > > > What is anyway the appeal of private_offset field, instead of having just > > > > 1:1 association between regions and files, i.e. one memfd per region? > > > > Modifying memslots is slow, both in KVM and in QEMU (not sure about Google's VMM). > > E.g. if a vCPU converts a single page, it will be forced to wait until all other > > vCPUs drop SRCU, which can have severe latency spikes, e.g. if KVM is faulting in > > memory. KVM's memslot updates also hold a mutex for the entire duration of the > > update, i.e. conversions on different vCPUs would be fully serialized, exacerbating > > the SRCU problem. > > > > KVM also has historical baggage where it "needs" to zap _all_ SPTEs when any > > memslot is deleted. > > > > Taking both a private_fd and a shared userspace address allows userspace to convert > > between private and shared without having to manipulate memslots. > > Right, this was really good explanation, thank you. > > Still wondering could this possibly work (or not): > > 1. Union userspace_addr and private_fd. No, because userspace needs to be able to provide both userspace_addr (shared memory) and private_fd (private memory) for a single memslot. > 2. Instead of introducing private_offset, use guest_phys_addr as the > offset. No, because that would force userspace to use a single private_fd for all of guest memory since it effectively means private_offset=0. And userspace couldn't skip over holes in guest memory, i.e. the size of the memfd would need to follow the max guest gpa. In other words, dropping private_offset could work, but it'd be quite kludgy and not worth saving 8 bytes.
On Fri, Oct 07, 2022 at 02:58:54PM +0000, Sean Christopherson wrote: > On Fri, Oct 07, 2022, Jarkko Sakkinen wrote: > > On Thu, Oct 06, 2022 at 03:34:58PM +0000, Sean Christopherson wrote: > > > On Thu, Oct 06, 2022, Jarkko Sakkinen wrote: > > > > On Thu, Oct 06, 2022 at 05:58:03PM +0300, Jarkko Sakkinen wrote: > > > > > On Thu, Sep 15, 2022 at 10:29:07PM +0800, Chao Peng wrote: > > > > > > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two > > > > > > additional KVM memslot fields private_fd/private_offset to allow > > > > > > userspace to specify that guest private memory provided from the > > > > > > private_fd and guest_phys_addr mapped at the private_offset of the > > > > > > private_fd, spanning a range of memory_size. > > > > > > > > > > > > The extended memslot can still have the userspace_addr(hva). When use, a > > > > > > single memslot can maintain both private memory through private > > > > > > fd(private_fd/private_offset) and shared memory through > > > > > > hva(userspace_addr). Whether the private or shared part is visible to > > > > > > guest is maintained by other KVM code. > > > > > > > > > > What is anyway the appeal of private_offset field, instead of having just > > > > > 1:1 association between regions and files, i.e. one memfd per region? > > > > > > Modifying memslots is slow, both in KVM and in QEMU (not sure about Google's VMM). > > > E.g. if a vCPU converts a single page, it will be forced to wait until all other > > > vCPUs drop SRCU, which can have severe latency spikes, e.g. if KVM is faulting in > > > memory. KVM's memslot updates also hold a mutex for the entire duration of the > > > update, i.e. conversions on different vCPUs would be fully serialized, exacerbating > > > the SRCU problem. > > > > > > KVM also has historical baggage where it "needs" to zap _all_ SPTEs when any > > > memslot is deleted. > > > > > > Taking both a private_fd and a shared userspace address allows userspace to convert > > > between private and shared without having to manipulate memslots. > > > > Right, this was really good explanation, thank you. > > > > Still wondering could this possibly work (or not): > > > > 1. Union userspace_addr and private_fd. > > No, because userspace needs to be able to provide both userspace_addr (shared > memory) and private_fd (private memory) for a single memslot. Got it, thanks for clearing my misunderstandings on this topic, and it is quite obviously visible in 5/8 and 7/8. I.e. if I got it right, memblock can be partially private, and you dig the shared holes with KVM_MEMORY_ENCRYPT_UNREG_REGION. We have (in Enarx) ATM have memblock per host mmap, I was looking into this dilated by that mindset but makes definitely sense to support that. BR, Jarkko
On Sat, Oct 08, 2022 at 12:54:32AM +0300, Jarkko Sakkinen wrote: > On Fri, Oct 07, 2022 at 02:58:54PM +0000, Sean Christopherson wrote: > > On Fri, Oct 07, 2022, Jarkko Sakkinen wrote: > > > On Thu, Oct 06, 2022 at 03:34:58PM +0000, Sean Christopherson wrote: > > > > On Thu, Oct 06, 2022, Jarkko Sakkinen wrote: > > > > > On Thu, Oct 06, 2022 at 05:58:03PM +0300, Jarkko Sakkinen wrote: > > > > > > On Thu, Sep 15, 2022 at 10:29:07PM +0800, Chao Peng wrote: > > > > > > > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two > > > > > > > additional KVM memslot fields private_fd/private_offset to allow > > > > > > > userspace to specify that guest private memory provided from the > > > > > > > private_fd and guest_phys_addr mapped at the private_offset of the > > > > > > > private_fd, spanning a range of memory_size. > > > > > > > > > > > > > > The extended memslot can still have the userspace_addr(hva). When use, a > > > > > > > single memslot can maintain both private memory through private > > > > > > > fd(private_fd/private_offset) and shared memory through > > > > > > > hva(userspace_addr). Whether the private or shared part is visible to > > > > > > > guest is maintained by other KVM code. > > > > > > > > > > > > What is anyway the appeal of private_offset field, instead of having just > > > > > > 1:1 association between regions and files, i.e. one memfd per region? > > > > > > > > Modifying memslots is slow, both in KVM and in QEMU (not sure about Google's VMM). > > > > E.g. if a vCPU converts a single page, it will be forced to wait until all other > > > > vCPUs drop SRCU, which can have severe latency spikes, e.g. if KVM is faulting in > > > > memory. KVM's memslot updates also hold a mutex for the entire duration of the > > > > update, i.e. conversions on different vCPUs would be fully serialized, exacerbating > > > > the SRCU problem. > > > > > > > > KVM also has historical baggage where it "needs" to zap _all_ SPTEs when any > > > > memslot is deleted. > > > > > > > > Taking both a private_fd and a shared userspace address allows userspace to convert > > > > between private and shared without having to manipulate memslots. > > > > > > Right, this was really good explanation, thank you. > > > > > > Still wondering could this possibly work (or not): > > > > > > 1. Union userspace_addr and private_fd. > > > > No, because userspace needs to be able to provide both userspace_addr (shared > > memory) and private_fd (private memory) for a single memslot. > > Got it, thanks for clearing my misunderstandings on this topic, and it > is quite obviously visible in 5/8 and 7/8. I.e. if I got it right, > memblock can be partially private, and you dig the shared holes with > KVM_MEMORY_ENCRYPT_UNREG_REGION. We have (in Enarx) ATM have memblock > per host mmap, I was looking into this dilated by that mindset but makes > definitely sense to support that. For me the most useful reference with this feature is kvm_set_phys_mem() implementation in privmem-v8 branch. Took while to find it because I did not have much experience with QEMU code base. I'd even recommend to mention that function in the cover letter because it is really good reference on how this feature is supposed to be used. BR, Jarkko
On Sat, Oct 08, 2022 at 07:15:17PM +0300, Jarkko Sakkinen wrote: > On Sat, Oct 08, 2022 at 12:54:32AM +0300, Jarkko Sakkinen wrote: > > On Fri, Oct 07, 2022 at 02:58:54PM +0000, Sean Christopherson wrote: > > > On Fri, Oct 07, 2022, Jarkko Sakkinen wrote: > > > > On Thu, Oct 06, 2022 at 03:34:58PM +0000, Sean Christopherson wrote: > > > > > On Thu, Oct 06, 2022, Jarkko Sakkinen wrote: > > > > > > On Thu, Oct 06, 2022 at 05:58:03PM +0300, Jarkko Sakkinen wrote: > > > > > > > On Thu, Sep 15, 2022 at 10:29:07PM +0800, Chao Peng wrote: > > > > > > > > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two > > > > > > > > additional KVM memslot fields private_fd/private_offset to allow > > > > > > > > userspace to specify that guest private memory provided from the > > > > > > > > private_fd and guest_phys_addr mapped at the private_offset of the > > > > > > > > private_fd, spanning a range of memory_size. > > > > > > > > > > > > > > > > The extended memslot can still have the userspace_addr(hva). When use, a > > > > > > > > single memslot can maintain both private memory through private > > > > > > > > fd(private_fd/private_offset) and shared memory through > > > > > > > > hva(userspace_addr). Whether the private or shared part is visible to > > > > > > > > guest is maintained by other KVM code. > > > > > > > > > > > > > > What is anyway the appeal of private_offset field, instead of having just > > > > > > > 1:1 association between regions and files, i.e. one memfd per region? > > > > > > > > > > Modifying memslots is slow, both in KVM and in QEMU (not sure about Google's VMM). > > > > > E.g. if a vCPU converts a single page, it will be forced to wait until all other > > > > > vCPUs drop SRCU, which can have severe latency spikes, e.g. if KVM is faulting in > > > > > memory. KVM's memslot updates also hold a mutex for the entire duration of the > > > > > update, i.e. conversions on different vCPUs would be fully serialized, exacerbating > > > > > the SRCU problem. > > > > > > > > > > KVM also has historical baggage where it "needs" to zap _all_ SPTEs when any > > > > > memslot is deleted. > > > > > > > > > > Taking both a private_fd and a shared userspace address allows userspace to convert > > > > > between private and shared without having to manipulate memslots. > > > > > > > > Right, this was really good explanation, thank you. > > > > > > > > Still wondering could this possibly work (or not): > > > > > > > > 1. Union userspace_addr and private_fd. > > > > > > No, because userspace needs to be able to provide both userspace_addr (shared > > > memory) and private_fd (private memory) for a single memslot. > > > > Got it, thanks for clearing my misunderstandings on this topic, and it > > is quite obviously visible in 5/8 and 7/8. I.e. if I got it right, > > memblock can be partially private, and you dig the shared holes with > > KVM_MEMORY_ENCRYPT_UNREG_REGION. We have (in Enarx) ATM have memblock > > per host mmap, I was looking into this dilated by that mindset but makes > > definitely sense to support that. > > For me the most useful reference with this feature is kvm_set_phys_mem() > implementation in privmem-v8 branch. Took while to find it because I did > not have much experience with QEMU code base. I'd even recommend to mention > that function in the cover letter because it is really good reference on > how this feature is supposed to be used. While learning QEMU code, I also noticed bunch of comparison like this: if (slot->flags | KVM_MEM_PRIVATE) I guess those could be just replaced with unconditional fills as it does not do any harm, if KVM_MEM_PRIVATE is not set. BR, Jarkko
On Sat, Oct 08, 2022 at 08:35:47PM +0300, Jarkko Sakkinen wrote: > On Sat, Oct 08, 2022 at 07:15:17PM +0300, Jarkko Sakkinen wrote: > > On Sat, Oct 08, 2022 at 12:54:32AM +0300, Jarkko Sakkinen wrote: > > > On Fri, Oct 07, 2022 at 02:58:54PM +0000, Sean Christopherson wrote: > > > > On Fri, Oct 07, 2022, Jarkko Sakkinen wrote: > > > > > On Thu, Oct 06, 2022 at 03:34:58PM +0000, Sean Christopherson wrote: > > > > > > On Thu, Oct 06, 2022, Jarkko Sakkinen wrote: > > > > > > > On Thu, Oct 06, 2022 at 05:58:03PM +0300, Jarkko Sakkinen wrote: > > > > > > > > On Thu, Sep 15, 2022 at 10:29:07PM +0800, Chao Peng wrote: > > > > > > > > > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two > > > > > > > > > additional KVM memslot fields private_fd/private_offset to allow > > > > > > > > > userspace to specify that guest private memory provided from the > > > > > > > > > private_fd and guest_phys_addr mapped at the private_offset of the > > > > > > > > > private_fd, spanning a range of memory_size. > > > > > > > > > > > > > > > > > > The extended memslot can still have the userspace_addr(hva). When use, a > > > > > > > > > single memslot can maintain both private memory through private > > > > > > > > > fd(private_fd/private_offset) and shared memory through > > > > > > > > > hva(userspace_addr). Whether the private or shared part is visible to > > > > > > > > > guest is maintained by other KVM code. > > > > > > > > > > > > > > > > What is anyway the appeal of private_offset field, instead of having just > > > > > > > > 1:1 association between regions and files, i.e. one memfd per region? > > > > > > > > > > > > Modifying memslots is slow, both in KVM and in QEMU (not sure about Google's VMM). > > > > > > E.g. if a vCPU converts a single page, it will be forced to wait until all other > > > > > > vCPUs drop SRCU, which can have severe latency spikes, e.g. if KVM is faulting in > > > > > > memory. KVM's memslot updates also hold a mutex for the entire duration of the > > > > > > update, i.e. conversions on different vCPUs would be fully serialized, exacerbating > > > > > > the SRCU problem. > > > > > > > > > > > > KVM also has historical baggage where it "needs" to zap _all_ SPTEs when any > > > > > > memslot is deleted. > > > > > > > > > > > > Taking both a private_fd and a shared userspace address allows userspace to convert > > > > > > between private and shared without having to manipulate memslots. > > > > > > > > > > Right, this was really good explanation, thank you. > > > > > > > > > > Still wondering could this possibly work (or not): > > > > > > > > > > 1. Union userspace_addr and private_fd. > > > > > > > > No, because userspace needs to be able to provide both userspace_addr (shared > > > > memory) and private_fd (private memory) for a single memslot. > > > > > > Got it, thanks for clearing my misunderstandings on this topic, and it > > > is quite obviously visible in 5/8 and 7/8. I.e. if I got it right, > > > memblock can be partially private, and you dig the shared holes with > > > KVM_MEMORY_ENCRYPT_UNREG_REGION. We have (in Enarx) ATM have memblock > > > per host mmap, I was looking into this dilated by that mindset but makes > > > definitely sense to support that. > > > > For me the most useful reference with this feature is kvm_set_phys_mem() > > implementation in privmem-v8 branch. Took while to find it because I did > > not have much experience with QEMU code base. I'd even recommend to mention > > that function in the cover letter because it is really good reference on > > how this feature is supposed to be used. That's a good point, I can mention that if people find useful. > > While learning QEMU code, I also noticed bunch of comparison like this: > > if (slot->flags | KVM_MEM_PRIVATE) > > I guess those could be just replaced with unconditional fills as it does > not do any harm, if KVM_MEM_PRIVATE is not set. Make sense, thanks. Chao > > BR, Jarkko
On Mon, Oct 10, 2022 at 04:25:07PM +0800, Chao Peng wrote: > On Sat, Oct 08, 2022 at 08:35:47PM +0300, Jarkko Sakkinen wrote: > > On Sat, Oct 08, 2022 at 07:15:17PM +0300, Jarkko Sakkinen wrote: > > > On Sat, Oct 08, 2022 at 12:54:32AM +0300, Jarkko Sakkinen wrote: > > > > On Fri, Oct 07, 2022 at 02:58:54PM +0000, Sean Christopherson wrote: > > > > > On Fri, Oct 07, 2022, Jarkko Sakkinen wrote: > > > > > > On Thu, Oct 06, 2022 at 03:34:58PM +0000, Sean Christopherson wrote: > > > > > > > On Thu, Oct 06, 2022, Jarkko Sakkinen wrote: > > > > > > > > On Thu, Oct 06, 2022 at 05:58:03PM +0300, Jarkko Sakkinen wrote: > > > > > > > > > On Thu, Sep 15, 2022 at 10:29:07PM +0800, Chao Peng wrote: > > > > > > > > > > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two > > > > > > > > > > additional KVM memslot fields private_fd/private_offset to allow > > > > > > > > > > userspace to specify that guest private memory provided from the > > > > > > > > > > private_fd and guest_phys_addr mapped at the private_offset of the > > > > > > > > > > private_fd, spanning a range of memory_size. > > > > > > > > > > > > > > > > > > > > The extended memslot can still have the userspace_addr(hva). When use, a > > > > > > > > > > single memslot can maintain both private memory through private > > > > > > > > > > fd(private_fd/private_offset) and shared memory through > > > > > > > > > > hva(userspace_addr). Whether the private or shared part is visible to > > > > > > > > > > guest is maintained by other KVM code. > > > > > > > > > > > > > > > > > > What is anyway the appeal of private_offset field, instead of having just > > > > > > > > > 1:1 association between regions and files, i.e. one memfd per region? > > > > > > > > > > > > > > Modifying memslots is slow, both in KVM and in QEMU (not sure about Google's VMM). > > > > > > > E.g. if a vCPU converts a single page, it will be forced to wait until all other > > > > > > > vCPUs drop SRCU, which can have severe latency spikes, e.g. if KVM is faulting in > > > > > > > memory. KVM's memslot updates also hold a mutex for the entire duration of the > > > > > > > update, i.e. conversions on different vCPUs would be fully serialized, exacerbating > > > > > > > the SRCU problem. > > > > > > > > > > > > > > KVM also has historical baggage where it "needs" to zap _all_ SPTEs when any > > > > > > > memslot is deleted. > > > > > > > > > > > > > > Taking both a private_fd and a shared userspace address allows userspace to convert > > > > > > > between private and shared without having to manipulate memslots. > > > > > > > > > > > > Right, this was really good explanation, thank you. > > > > > > > > > > > > Still wondering could this possibly work (or not): > > > > > > > > > > > > 1. Union userspace_addr and private_fd. > > > > > > > > > > No, because userspace needs to be able to provide both userspace_addr (shared > > > > > memory) and private_fd (private memory) for a single memslot. > > > > > > > > Got it, thanks for clearing my misunderstandings on this topic, and it > > > > is quite obviously visible in 5/8 and 7/8. I.e. if I got it right, > > > > memblock can be partially private, and you dig the shared holes with > > > > KVM_MEMORY_ENCRYPT_UNREG_REGION. We have (in Enarx) ATM have memblock > > > > per host mmap, I was looking into this dilated by that mindset but makes > > > > definitely sense to support that. > > > > > > For me the most useful reference with this feature is kvm_set_phys_mem() > > > implementation in privmem-v8 branch. Took while to find it because I did > > > not have much experience with QEMU code base. I'd even recommend to mention > > > that function in the cover letter because it is really good reference on > > > how this feature is supposed to be used. > > That's a good point, I can mention that if people find useful. Yeah, I did implementation for Enarx (https://www.enarx.dev/) using just that part as a reference. It has all the essentials what you need to consider when you are already using KVM API, and want to add private regions. BR, Jarkko
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index abd7c32126ce..c1fac1e9f820 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -1319,7 +1319,7 @@ yet and must be cleared on entry. :Capability: KVM_CAP_USER_MEMORY :Architectures: all :Type: vm ioctl -:Parameters: struct kvm_userspace_memory_region (in) +:Parameters: struct kvm_userspace_memory_region(_ext) (in) :Returns: 0 on success, -1 on error :: @@ -1332,9 +1332,18 @@ yet and must be cleared on entry. __u64 userspace_addr; /* start of the userspace allocated memory */ }; + struct kvm_userspace_memory_region_ext { + struct kvm_userspace_memory_region region; + __u64 private_offset; + __u32 private_fd; + __u32 pad1; + __u64 pad2[14]; + }; + /* for kvm_memory_region::flags */ #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0) #define KVM_MEM_READONLY (1UL << 1) + #define KVM_MEM_PRIVATE (1UL << 2) This ioctl allows the user to create, modify or delete a guest physical memory slot. Bits 0-15 of "slot" specify the slot id and this value @@ -1365,12 +1374,27 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr be identical. This allows large pages in the guest to be backed by large pages in the host. -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and -KVM_MEM_READONLY. The former can be set to instruct KVM to keep track of -writes to memory within the slot. See KVM_GET_DIRTY_LOG ioctl to know how to -use it. The latter can be set, if KVM_CAP_READONLY_MEM capability allows it, -to make a new slot read-only. In this case, writes to this memory will be -posted to userspace as KVM_EXIT_MMIO exits. +kvm_userspace_memory_region_ext includes all the kvm_userspace_memory_region +fields. It also includes additional fields for some specific features. See +below description of flags field for more information. It's recommended to use +kvm_userspace_memory_region_ext in new userspace code. + +The flags field supports below flags: + +- KVM_MEM_LOG_DIRTY_PAGES can be set to instruct KVM to keep track of writes to + memory within the slot. See KVM_GET_DIRTY_LOG ioctl to know how to use it. + +- KVM_MEM_READONLY can be set, if KVM_CAP_READONLY_MEM capability allows it, to + make a new slot read-only. In this case, writes to this memory will be posted + to userspace as KVM_EXIT_MMIO exits. + +- KVM_MEM_PRIVATE can be set to indicate a new slot has private memory backed by + a file descirptor(fd) and the content of the private memory is invisible to + userspace. In this case, userspace should use private_fd/private_offset in + kvm_userspace_memory_region_ext to instruct KVM to provide private memory to + guest. Userspace should guarantee not to map the same pfn indicated by + private_fd/private_offset to different gfns with multiple memslots. Failed to + do this may result undefined behavior. When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of the memory region are automatically reflected into the guest. For example, an diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index e3cbd7706136..31db64ec0b33 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -48,6 +48,7 @@ config KVM select SRCU select INTERVAL_TREE select HAVE_KVM_PM_NOTIFIER if PM + select HAVE_KVM_PRIVATE_MEM if X86_64 help Support hosting fully virtualized guest machines using hardware virtualization extensions. You will need a fairly recent diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index d7374d768296..081f62ccc9a1 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -12183,7 +12183,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, } for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) { - struct kvm_userspace_memory_region m; + struct kvm_user_mem_region m; m.slot = id | (i << 16); m.flags = 0; diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index f4519d3689e1..eac1787b899b 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -44,6 +44,7 @@ #include <asm/kvm_host.h> #include <linux/kvm_dirty_ring.h> +#include <linux/memfd.h> #ifndef KVM_MAX_VCPU_IDS #define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS @@ -576,8 +577,16 @@ struct kvm_memory_slot { u32 flags; short id; u16 as_id; + struct file *private_file; + loff_t private_offset; + struct inaccessible_notifier notifier; }; +static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot) +{ + return slot && (slot->flags & KVM_MEM_PRIVATE); +} + static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot) { return slot->flags & KVM_MEM_LOG_DIRTY_PAGES; @@ -1104,9 +1113,9 @@ enum kvm_mr_change { }; int kvm_set_memory_region(struct kvm *kvm, - const struct kvm_userspace_memory_region *mem); + const struct kvm_user_mem_region *mem); int __kvm_set_memory_region(struct kvm *kvm, - const struct kvm_userspace_memory_region *mem); + const struct kvm_user_mem_region *mem); void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot); void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen); int kvm_arch_prepare_memory_region(struct kvm *kvm, diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index eed0315a77a6..3ef462fb3b2a 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -103,6 +103,33 @@ struct kvm_userspace_memory_region { __u64 userspace_addr; /* start of the userspace allocated memory */ }; +struct kvm_userspace_memory_region_ext { + struct kvm_userspace_memory_region region; + __u64 private_offset; + __u32 private_fd; + __u32 pad1; + __u64 pad2[14]; +}; + +#ifdef __KERNEL__ +/* + * kvm_user_mem_region is a kernel-only alias of kvm_userspace_memory_region_ext + * that "unpacks" kvm_userspace_memory_region so that KVM can directly access + * all fields from the top-level "extended" region. + */ +struct kvm_user_mem_region { + __u32 slot; + __u32 flags; + __u64 guest_phys_addr; + __u64 memory_size; + __u64 userspace_addr; + __u64 private_offset; + __u32 private_fd; + __u32 pad1; + __u64 pad2[14]; +}; +#endif + /* * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace, * other bits are reserved for kvm internal use which are defined in @@ -110,6 +137,7 @@ struct kvm_userspace_memory_region { */ #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0) #define KVM_MEM_READONLY (1UL << 1) +#define KVM_MEM_PRIVATE (1UL << 2) /* for KVM_IRQ_LINE */ struct kvm_irq_level { diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig index a8c5c9f06b3c..ccaff13cc5b8 100644 --- a/virt/kvm/Kconfig +++ b/virt/kvm/Kconfig @@ -72,3 +72,6 @@ config KVM_XFER_TO_GUEST_WORK config HAVE_KVM_PM_NOTIFIER bool + +config HAVE_KVM_PRIVATE_MEM + bool diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 584a5bab3af3..12dc0dc57b06 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1526,7 +1526,7 @@ static void kvm_replace_memslot(struct kvm *kvm, } } -static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem) +static int check_memory_region_flags(const struct kvm_user_mem_region *mem) { u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES; @@ -1920,7 +1920,7 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id, * Must be called holding kvm->slots_lock for write. */ int __kvm_set_memory_region(struct kvm *kvm, - const struct kvm_userspace_memory_region *mem) + const struct kvm_user_mem_region *mem) { struct kvm_memory_slot *old, *new; struct kvm_memslots *slots; @@ -2024,7 +2024,7 @@ int __kvm_set_memory_region(struct kvm *kvm, EXPORT_SYMBOL_GPL(__kvm_set_memory_region); int kvm_set_memory_region(struct kvm *kvm, - const struct kvm_userspace_memory_region *mem) + const struct kvm_user_mem_region *mem) { int r; @@ -2036,7 +2036,7 @@ int kvm_set_memory_region(struct kvm *kvm, EXPORT_SYMBOL_GPL(kvm_set_memory_region); static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm, - struct kvm_userspace_memory_region *mem) + struct kvm_user_mem_region *mem) { if ((u16)mem->slot >= KVM_USER_MEM_SLOTS) return -EINVAL; @@ -4622,6 +4622,33 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm) return fd; } +#define SANITY_CHECK_MEM_REGION_FIELD(field) \ +do { \ + BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) != \ + offsetof(struct kvm_userspace_memory_region, field)); \ + BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) != \ + sizeof_field(struct kvm_userspace_memory_region, field)); \ +} while (0) + +#define SANITY_CHECK_MEM_REGION_EXT_FIELD(field) \ +do { \ + BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) != \ + offsetof(struct kvm_userspace_memory_region_ext, field)); \ + BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) != \ + sizeof_field(struct kvm_userspace_memory_region_ext, field)); \ +} while (0) + +static void kvm_sanity_check_user_mem_region_alias(void) +{ + SANITY_CHECK_MEM_REGION_FIELD(slot); + SANITY_CHECK_MEM_REGION_FIELD(flags); + SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr); + SANITY_CHECK_MEM_REGION_FIELD(memory_size); + SANITY_CHECK_MEM_REGION_FIELD(userspace_addr); + SANITY_CHECK_MEM_REGION_EXT_FIELD(private_offset); + SANITY_CHECK_MEM_REGION_EXT_FIELD(private_fd); +} + static long kvm_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg) { @@ -4645,14 +4672,20 @@ static long kvm_vm_ioctl(struct file *filp, break; } case KVM_SET_USER_MEMORY_REGION: { - struct kvm_userspace_memory_region kvm_userspace_mem; + struct kvm_user_mem_region mem; + unsigned long size = sizeof(struct kvm_userspace_memory_region); + + kvm_sanity_check_user_mem_region_alias(); r = -EFAULT; - if (copy_from_user(&kvm_userspace_mem, argp, - sizeof(kvm_userspace_mem))) + if (copy_from_user(&mem, argp, size); + goto out; + + r = -EINVAL; + if (mem.flags & KVM_MEM_PRIVATE) goto out; - r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem); + r = kvm_vm_ioctl_set_memory_region(kvm, &mem); break; } case KVM_GET_DIRTY_LOG: {