[kvmtool,v1,08/17] Use memfd for all guest ram allocations

Message ID	20221115111549.2784927-9-tabba@google.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> Date: Tue, 15 Nov 2022 11:15:40 +0000 In-Reply-To: <20221115111549.2784927-1-tabba@google.com> Mime-Version: 1.0 References: <20221115111549.2784927-1-tabba@google.com> Message-ID: <20221115111549.2784927-9-tabba@google.com> Subject: [PATCH kvmtool v1 08/17] Use memfd for all guest ram allocations From: Fuad Tabba <tabba@google.com> To: kvm@vger.kernel.org Cc: julien.thierry.kdev@gmail.com, andre.przywara@arm.com, alexandru.elisei@arm.com, will@kernel.org, tabba@google.com Content-Type: text/plain; charset="UTF-8" Precedence: bulk
Series	Use memfd for guest vm memory allocation \| expand [kvmtool,v1,00/17] Use memfd for guest vm memory allocation [kvmtool,v1,01/17] Initialize the return value in kvm__for_each_mem_bank() [kvmtool,v1,02/17] Make mmap_hugetlbfs() static [kvmtool,v1,03/17] Rename parameter in mmap_anon_or_hugetlbfs() [kvmtool,v1,04/17] Add hostmem va to debug print [kvmtool,v1,05/17] Factor out getting the hugetlb block size [kvmtool,v1,06/17] Use memfd for hugetlbfs when allocating guest ram [kvmtool,v1,07/17] Make blk_size a parameter and pass it to mmap_hugetlbfs() [kvmtool,v1,08/17] Use memfd for all guest ram allocations [kvmtool,v1,09/17] Allocate pvtime memory with memfd [kvmtool,v1,10/17] Allocate vesa memory with memfd [kvmtool,v1,11/17] Add a function that allocates aligned memory if specified [kvmtool,v1,12/17] Use new function to align memory [kvmtool,v1,13/17] Remove struct fields and code used for alignment [kvmtool,v1,14/17] Replace kvm_arch_delete_ram with kvm_delete_ram [kvmtool,v1,15/17] Remove no-longer unused macro [kvmtool,v1,16/17] Factor out set_user_memory_region code [kvmtool,v1,17/17] Pass the memory file descriptor and offset when registering ram

Fuad Tabba Nov. 15, 2022, 11:15 a.m. UTC

Allocate all guest ram backed by memfd/ftruncate instead of
anonymous mmap. This will make it easier to use kvm with fd-based
kvm guest memory proposals [*]. It also would make it easier to
use ipc memory sharing should that be needed in the future.

Signed-off-by: Fuad Tabba <tabba@google.com>

[*] https://lore.kernel.org/all/20221025151344.3784230-1-chao.p.peng@linux.intel.com/
---
 include/kvm/kvm.h  |  1 +
 include/kvm/util.h |  3 +++
 kvm.c              |  4 ++++
 util/util.c        | 33 ++++++++++++++++++++-------------
 4 files changed, 28 insertions(+), 13 deletions(-)

Alexandru Elisei Nov. 24, 2022, 11:01 a.m. UTC | #1

Hi,

On Tue, Nov 15, 2022 at 11:15:40AM +0000, Fuad Tabba wrote:
> Allocate all guest ram backed by memfd/ftruncate instead of
> anonymous mmap. This will make it easier to use kvm with fd-based
> kvm guest memory proposals [*]. It also would make it easier to
> use ipc memory sharing should that be needed in the future.

Today, there are two memory allocation paths:

- One using hugetlbfs when --hugetlbfs is specified on the command line, which
  creates a file.

- One using mmap, when --hugetlbfs is not specified.

How would support in kvmtool for the secret memfd look like? I assume there
would need to be some kind of command line parameter to kvmtool to instruct it
to use the secret memfd, right? What I'm trying to figure out is why is
necessary to make everything a memfd file instead of adding a third memory
allocation path just for that particular usecase (or merging the hugetlbfs path
if they are similar enough). Right now, the anonymous memory path is a
mmap(MAP_ANONYMOUS) call, simple and straightforward, and I would like some more
convincing that this change is needed.

Regarding IPC memory sharing, is mmap'ing an memfd file enough to enable
that? If more work is needed for it, then wouldn't it make more sense to do
all the changes at once? This change might look sensible right now, but it
might turn out that it was the wrong way to go about it when someone
actually starts implementing memory sharing.

Thanks,
Alex

> 
> Signed-off-by: Fuad Tabba <tabba@google.com>
> 
> [*] https://lore.kernel.org/all/20221025151344.3784230-1-chao.p.peng@linux.intel.com/
> ---
>  include/kvm/kvm.h  |  1 +
>  include/kvm/util.h |  3 +++
>  kvm.c              |  4 ++++
>  util/util.c        | 33 ++++++++++++++++++++-------------
>  4 files changed, 28 insertions(+), 13 deletions(-)
> 
> diff --git a/include/kvm/kvm.h b/include/kvm/kvm.h
> index 3872dc6..d0d519b 100644
> --- a/include/kvm/kvm.h
> +++ b/include/kvm/kvm.h
> @@ -87,6 +87,7 @@ struct kvm {
>  	struct kvm_config	cfg;
>  	int			sys_fd;		/* For system ioctls(), i.e. /dev/kvm */
>  	int			vm_fd;		/* For VM ioctls() */
> +	int			ram_fd;		/* For guest memory. */
>  	timer_t			timerid;	/* Posix timer for interrupts */
>  
>  	int			nrcpus;		/* Number of cpus to run */
> diff --git a/include/kvm/util.h b/include/kvm/util.h
> index 61a205b..369603b 100644
> --- a/include/kvm/util.h
> +++ b/include/kvm/util.h
> @@ -140,6 +140,9 @@ static inline int pow2_size(unsigned long x)
>  }
>  
>  struct kvm;
> +int memfd_alloc(u64 size, bool hugetlb, u64 blk_size);
> +void *mmap_anon_or_hugetlbfs_align(struct kvm *kvm, const char *htlbfs_path,
> +				   u64 size, u64 align);
>  void *mmap_anon_or_hugetlbfs(struct kvm *kvm, const char *htlbfs_path, u64 size);
>  
>  #endif /* KVM__UTIL_H */
> diff --git a/kvm.c b/kvm.c
> index 78bc0d8..ed29d68 100644
> --- a/kvm.c
> +++ b/kvm.c
> @@ -160,6 +160,7 @@ struct kvm *kvm__new(void)
>  	mutex_init(&kvm->mem_banks_lock);
>  	kvm->sys_fd = -1;
>  	kvm->vm_fd = -1;
> +	kvm->ram_fd = -1;
>  
>  #ifdef KVM_BRLOCK_DEBUG
>  	kvm->brlock_sem = (pthread_rwlock_t) PTHREAD_RWLOCK_INITIALIZER;
> @@ -174,6 +175,9 @@ int kvm__exit(struct kvm *kvm)
>  
>  	kvm__arch_delete_ram(kvm);
>  
> +	if (kvm->ram_fd >= 0)
> +		close(kvm->ram_fd);
> +
>  	list_for_each_entry_safe(bank, tmp, &kvm->mem_banks, list) {
>  		list_del(&bank->list);
>  		free(bank);
> diff --git a/util/util.c b/util/util.c
> index d3483d8..278bcc2 100644
> --- a/util/util.c
> +++ b/util/util.c
> @@ -102,36 +102,38 @@ static u64 get_hugepage_blk_size(const char *htlbfs_path)
>  	return sfs.f_bsize;
>  }
>  
> -static void *mmap_hugetlbfs(struct kvm *kvm, const char *htlbfs_path, u64 size, u64 blk_size)
> +int memfd_alloc(u64 size, bool hugetlb, u64 blk_size)
>  {
>  	const char *name = "kvmtool";
>  	unsigned int flags = 0;
>  	int fd;
> -	void *addr;
> -	int htsize = __builtin_ctzl(blk_size);
>  
> -	if ((1ULL << htsize) != blk_size)
> -		die("Hugepage size must be a power of 2.\n");
> +	if (hugetlb) {
> +		int htsize = __builtin_ctzl(blk_size);
>  
> -	flags |= MFD_HUGETLB;
> -	flags |= htsize << MFD_HUGE_SHIFT;
> +		if ((1ULL << htsize) != blk_size)
> +			die("Hugepage size must be a power of 2.\n");
> +
> +		flags |= MFD_HUGETLB;
> +		flags |= htsize << MFD_HUGE_SHIFT;
> +	}
>  
>  	fd = memfd_create(name, flags);
>  	if (fd < 0)
> -		die("Can't memfd_create for hugetlbfs map\n");
> +		die("Can't memfd_create for memory map\n");
> +
>  	if (ftruncate(fd, size) < 0)
>  		die("Can't ftruncate for mem mapping size %lld\n",
>  			(unsigned long long)size);
> -	addr = mmap(NULL, size, PROT_RW, MAP_PRIVATE, fd, 0);
> -	close(fd);
>  
> -	return addr;
> +	return fd;
>  }
>  
>  /* This function wraps the decision between hugetlbfs map (if requested) or normal mmap */
>  void *mmap_anon_or_hugetlbfs(struct kvm *kvm, const char *htlbfs_path, u64 size)
>  {
>  	u64 blk_size = 0;
> +	int fd;
>  
>  	/*
>  	 * We don't /need/ to map guest RAM from hugetlbfs, but we do so
> @@ -146,9 +148,14 @@ void *mmap_anon_or_hugetlbfs(struct kvm *kvm, const char *htlbfs_path, u64 size)
>  		}
>  
>  		kvm->ram_pagesize = blk_size;
> -		return mmap_hugetlbfs(kvm, htlbfs_path, size, blk_size);
>  	} else {
>  		kvm->ram_pagesize = getpagesize();
> -		return mmap(NULL, size, PROT_RW, MAP_ANON_NORESERVE, -1, 0);
>  	}
> +
> +	fd = memfd_alloc(size, htlbfs_path, blk_size);
> +	if (fd < 0)
> +		return MAP_FAILED;
> +
> +	kvm->ram_fd = fd;
> +	return mmap(NULL, size, PROT_RW, MAP_PRIVATE, kvm->ram_fd, 0);
>  }
> -- 
> 2.38.1.431.g37b22c650d-goog
>

Fuad Tabba Nov. 24, 2022, 3:19 p.m. UTC | #2

Hi,

On Thu, Nov 24, 2022 at 11:01 AM Alexandru Elisei
<alexandru.elisei@arm.com> wrote:
>
> Hi,
>
> On Tue, Nov 15, 2022 at 11:15:40AM +0000, Fuad Tabba wrote:
> > Allocate all guest ram backed by memfd/ftruncate instead of
> > anonymous mmap. This will make it easier to use kvm with fd-based
> > kvm guest memory proposals [*]. It also would make it easier to
> > use ipc memory sharing should that be needed in the future.
>
> Today, there are two memory allocation paths:
>
> - One using hugetlbfs when --hugetlbfs is specified on the command line, which
>   creates a file.
>
> - One using mmap, when --hugetlbfs is not specified.
>
> How would support in kvmtool for the secret memfd look like? I assume there
> would need to be some kind of command line parameter to kvmtool to instruct it
> to use the secret memfd, right? What I'm trying to figure out is why is
> necessary to make everything a memfd file instead of adding a third memory
> allocation path just for that particular usecase (or merging the hugetlbfs path
> if they are similar enough). Right now, the anonymous memory path is a
> mmap(MAP_ANONYMOUS) call, simple and straightforward, and I would like some more
> convincing that this change is needed.

This isn't about secret mem, but about the unified proposal for guest
private memory [1].  This proposal requires the use of a file
descriptor (fd) as the canonical reference to guest memory in the host
(i.e., VMM) and does not provide an alternative using a
straightforward anonymous mmap(). The idea is that guest memory
shouldn’t have mapping in the host by default, but unless explicitly
shared and needed.

Moreover, using an fd would be more generic and flexible, which allows
for other use cases (such as IPC), or to map that memory in userspace
when appropriate. It also allows us to use the same interface for
hugetlb. Considering that other VMMs (e.g., qemu [2], crosvm [3])
already back guest memory with memfd, and looking at how private
memory would work [4], it seemed to me that the best way to unify all
of these needs is to have the backend of guest memory be fd-based.

It would be possible to have that as a separate kvmtool option, where
fd-backed memory would be only for guests that use the new private
memory extensions. However, that would mean more code to maintain that
is essentially doing the same thing (allocating and mapping memory).

I thought that it would be worth having these patches in kvmtool now
rather than wait until the guest private memory has made it into kvm.
These patches simplify the code as an end result, make it easier to
allocate and map aligned memory without overallocating, and bring
kvmtool closer to a more consistent way of allocating guest memory, in
a similar manner to other VMMs.

Moreover, with the private memory proposal [1], whether the fd-based
support available can be queried by a KVM capability. If it's
available kvmtool would use the fd, if it's not available, it would
use the host-mapped address. Therefore, there isn’t a need for a
command line option, unless for testing.

I have implemented this all the way to support the private memory
proposal in kvmtool [5], but I haven’t posted these since the private
memory proposal itself is still in flux. If you’re interested you
could have a look on how I would go ahead building on these patches
for full support of private memory backed by an fd.

> Regarding IPC memory sharing, is mmap'ing an memfd file enough to enable
> that? If more work is needed for it, then wouldn't it make more sense to do
> all the changes at once? This change might look sensible right now, but it
> might turn out that it was the wrong way to go about it when someone
> actually starts implementing memory sharing.

I don’t plan on supporting IPC memory sharing. I just mentioned that
as yet another use case that would benefit from guest memory being
fd-based, should kvmtool decide to support it in the future.

Cheers,
/fuad

[1] https://lore.kernel.org/all/20221025151344.3784230-1-chao.p.peng@linux.intel.com/
[2] https://github.com/qemu/qemu
[3] https://chromium.googlesource.com/chromiumos/platform/crosvm/
[4] https://github.com/chao-p/qemu/tree/privmem-v9
[5] https://android-kvm.googlesource.com/kvmtool/+/refs/heads/tabba/fdmem-v9-core



>
> Regarding IPC memory sharing, is mmap'ing an memfd file enough to enable
> that? If more work is needed for it, then wouldn't it make more sense to do
> all the changes at once? This change might look sensible right now, but it
> might turn out that it was the wrong way to go about it when someone
> actually starts implementing memory sharing.
>
> Thanks,
> Alex
>
> >
> > Signed-off-by: Fuad Tabba <tabba@google.com>
> >
> > [*] https://lore.kernel.org/all/20221025151344.3784230-1-chao.p.peng@linux.intel.com/
> > ---
> >  include/kvm/kvm.h  |  1 +
> >  include/kvm/util.h |  3 +++
> >  kvm.c              |  4 ++++
> >  util/util.c        | 33 ++++++++++++++++++++-------------
> >  4 files changed, 28 insertions(+), 13 deletions(-)
> >
> > diff --git a/include/kvm/kvm.h b/include/kvm/kvm.h
> > index 3872dc6..d0d519b 100644
> > --- a/include/kvm/kvm.h
> > +++ b/include/kvm/kvm.h
> > @@ -87,6 +87,7 @@ struct kvm {
> >       struct kvm_config       cfg;
> >       int                     sys_fd;         /* For system ioctls(), i.e. /dev/kvm */
> >       int                     vm_fd;          /* For VM ioctls() */
> > +     int                     ram_fd;         /* For guest memory. */
> >       timer_t                 timerid;        /* Posix timer for interrupts */
> >
> >       int                     nrcpus;         /* Number of cpus to run */
> > diff --git a/include/kvm/util.h b/include/kvm/util.h
> > index 61a205b..369603b 100644
> > --- a/include/kvm/util.h
> > +++ b/include/kvm/util.h
> > @@ -140,6 +140,9 @@ static inline int pow2_size(unsigned long x)
> >  }
> >
> >  struct kvm;
> > +int memfd_alloc(u64 size, bool hugetlb, u64 blk_size);
> > +void *mmap_anon_or_hugetlbfs_align(struct kvm *kvm, const char *htlbfs_path,
> > +                                u64 size, u64 align);
> >  void *mmap_anon_or_hugetlbfs(struct kvm *kvm, const char *htlbfs_path, u64 size);
> >
> >  #endif /* KVM__UTIL_H */
> > diff --git a/kvm.c b/kvm.c
> > index 78bc0d8..ed29d68 100644
> > --- a/kvm.c
> > +++ b/kvm.c
> > @@ -160,6 +160,7 @@ struct kvm *kvm__new(void)
> >       mutex_init(&kvm->mem_banks_lock);
> >       kvm->sys_fd = -1;
> >       kvm->vm_fd = -1;
> > +     kvm->ram_fd = -1;
> >
> >  #ifdef KVM_BRLOCK_DEBUG
> >       kvm->brlock_sem = (pthread_rwlock_t) PTHREAD_RWLOCK_INITIALIZER;
> > @@ -174,6 +175,9 @@ int kvm__exit(struct kvm *kvm)
> >
> >       kvm__arch_delete_ram(kvm);
> >
> > +     if (kvm->ram_fd >= 0)
> > +             close(kvm->ram_fd);
> > +
> >       list_for_each_entry_safe(bank, tmp, &kvm->mem_banks, list) {
> >               list_del(&bank->list);
> >               free(bank);
> > diff --git a/util/util.c b/util/util.c
> > index d3483d8..278bcc2 100644
> > --- a/util/util.c
> > +++ b/util/util.c
> > @@ -102,36 +102,38 @@ static u64 get_hugepage_blk_size(const char *htlbfs_path)
> >       return sfs.f_bsize;
> >  }
> >
> > -static void *mmap_hugetlbfs(struct kvm *kvm, const char *htlbfs_path, u64 size, u64 blk_size)
> > +int memfd_alloc(u64 size, bool hugetlb, u64 blk_size)
> >  {
> >       const char *name = "kvmtool";
> >       unsigned int flags = 0;
> >       int fd;
> > -     void *addr;
> > -     int htsize = __builtin_ctzl(blk_size);
> >
> > -     if ((1ULL << htsize) != blk_size)
> > -             die("Hugepage size must be a power of 2.\n");
> > +     if (hugetlb) {
> > +             int htsize = __builtin_ctzl(blk_size);
> >
> > -     flags |= MFD_HUGETLB;
> > -     flags |= htsize << MFD_HUGE_SHIFT;
> > +             if ((1ULL << htsize) != blk_size)
> > +                     die("Hugepage size must be a power of 2.\n");
> > +
> > +             flags |= MFD_HUGETLB;
> > +             flags |= htsize << MFD_HUGE_SHIFT;
> > +     }
> >
> >       fd = memfd_create(name, flags);
> >       if (fd < 0)
> > -             die("Can't memfd_create for hugetlbfs map\n");
> > +             die("Can't memfd_create for memory map\n");
> > +
> >       if (ftruncate(fd, size) < 0)
> >               die("Can't ftruncate for mem mapping size %lld\n",
> >                       (unsigned long long)size);
> > -     addr = mmap(NULL, size, PROT_RW, MAP_PRIVATE, fd, 0);
> > -     close(fd);
> >
> > -     return addr;
> > +     return fd;
> >  }
> >
> >  /* This function wraps the decision between hugetlbfs map (if requested) or normal mmap */
> >  void *mmap_anon_or_hugetlbfs(struct kvm *kvm, const char *htlbfs_path, u64 size)
> >  {
> >       u64 blk_size = 0;
> > +     int fd;
> >
> >       /*
> >        * We don't /need/ to map guest RAM from hugetlbfs, but we do so
> > @@ -146,9 +148,14 @@ void *mmap_anon_or_hugetlbfs(struct kvm *kvm, const char *htlbfs_path, u64 size)
> >               }
> >
> >               kvm->ram_pagesize = blk_size;
> > -             return mmap_hugetlbfs(kvm, htlbfs_path, size, blk_size);
> >       } else {
> >               kvm->ram_pagesize = getpagesize();
> > -             return mmap(NULL, size, PROT_RW, MAP_ANON_NORESERVE, -1, 0);
> >       }
> > +
> > +     fd = memfd_alloc(size, htlbfs_path, blk_size);
> > +     if (fd < 0)
> > +             return MAP_FAILED;
> > +
> > +     kvm->ram_fd = fd;
> > +     return mmap(NULL, size, PROT_RW, MAP_PRIVATE, kvm->ram_fd, 0);
> >  }
> > --
> > 2.38.1.431.g37b22c650d-goog
> >

Alexandru Elisei Nov. 24, 2022, 5:14 p.m. UTC | #3

Hi,

On Thu, Nov 24, 2022 at 03:19:34PM +0000, Fuad Tabba wrote:
> Hi,
> 
> On Thu, Nov 24, 2022 at 11:01 AM Alexandru Elisei
> <alexandru.elisei@arm.com> wrote:
> >
> > Hi,
> >
> > On Tue, Nov 15, 2022 at 11:15:40AM +0000, Fuad Tabba wrote:
> > > Allocate all guest ram backed by memfd/ftruncate instead of
> > > anonymous mmap. This will make it easier to use kvm with fd-based
> > > kvm guest memory proposals [*]. It also would make it easier to
> > > use ipc memory sharing should that be needed in the future.
> >
> > Today, there are two memory allocation paths:
> >
> > - One using hugetlbfs when --hugetlbfs is specified on the command line, which
> >   creates a file.
> >
> > - One using mmap, when --hugetlbfs is not specified.
> >
> > How would support in kvmtool for the secret memfd look like? I assume there
> > would need to be some kind of command line parameter to kvmtool to instruct it
> > to use the secret memfd, right? What I'm trying to figure out is why is
> > necessary to make everything a memfd file instead of adding a third memory
> > allocation path just for that particular usecase (or merging the hugetlbfs path
> > if they are similar enough). Right now, the anonymous memory path is a
> > mmap(MAP_ANONYMOUS) call, simple and straightforward, and I would like some more
> > convincing that this change is needed.
> 
> This isn't about secret mem, but about the unified proposal for guest
> private memory [1].  This proposal requires the use of a file
> descriptor (fd) as the canonical reference to guest memory in the host
> (i.e., VMM) and does not provide an alternative using a
> straightforward anonymous mmap(). The idea is that guest memory
> shouldn’t have mapping in the host by default, but unless explicitly
> shared and needed.

I think you misunderstood me. I wasn't asking why kvmtool should get
support for guest private memory, I was asking why kvmtool should allocate
**all types of memory** using memfd. Your series allocates **all** memory
using memfd. I never said that kvmtool should or should not get support for
private memory.

> 
> Moreover, using an fd would be more generic and flexible, which allows
> for other use cases (such as IPC), or to map that memory in userspace
> when appropriate. It also allows us to use the same interface for
> hugetlb. Considering that other VMMs (e.g., qemu [2], crosvm [3])
> already back guest memory with memfd, and looking at how private
> memory would work [4], it seemed to me that the best way to unify all
> of these needs is to have the backend of guest memory be fd-based.
> 
> It would be possible to have that as a separate kvmtool option, where
> fd-backed memory would be only for guests that use the new private
> memory extensions. However, that would mean more code to maintain that
> is essentially doing the same thing (allocating and mapping memory).
> 
> I thought that it would be worth having these patches in kvmtool now
> rather than wait until the guest private memory has made it into kvm.
> These patches simplify the code as an end result, make it easier to

In the non-hugetlbfs case, before:

        kvm->arch.ram_alloc_size = kvm->ram_size + SZ_2M;
        kvm->arch.ram_alloc_start = mmap_anon_or_hugetlbfs(kvm, kvm->cfg.hugetlbfs_path, kvm->arch.ram_alloc_size);

	/*
	 * mmap_anon_or_hugetlbfs expands to:
	 * getpagesize()
	 * mmap()
	 */

        kvm->ram_start = (void *)ALIGN((unsigned long)kvm->arch.ram_alloc_start, SZ_2M);

After:
	/* mmap_anon_or_hugetlbfs: */
	getpagesize();
	mmap(NULL, total_map, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
	memfd_alloc(size, htlbfs_path, blk_size);

	/*
	 * memfd_alloc() expands to:
	 * memfd_create()
	 * ftruncate
	 */

	addr_align = (void *)ALIGN((u64)addr_map, align_sz);
	mmap(addr_align, size, PROT_RW, MAP_PRIVATE | MAP_FIXED, fd, 0);

I'm counting one extra mmap(), one memfd_create() and one ftruncate() that
this series adds (not to mention all the boiler plate code to check for
errors).

Let's use another metric, let's count the number of lines of code. Before:
9 lines of code, after: -3 lines removed from arm/kvm.c and 86 lines of
code for memfd_alloc() and mmap_anon_or_hugetlbfs_align().

I'm struggling to find a metric by which the resulting code is simpler, as
you suggest.


> allocate and map aligned memory without overallocating, and bring

If your goal is to remove the overallocting of memory, you can just munmap
the extra memory after alignment is performed. To do that you don't need to
allocate everything using a memfd.

> kvmtool closer to a more consistent way of allocating guest memory, in
> a similar manner to other VMMs.

I would really appreciate pointing me to where qemu allocates memory using
memfd when invoked with -m <size>. I was able to follow the hostmem-ram
backend allocation function until g_malloc0(), but I couldn't find the
implementation for that.

> 
> Moreover, with the private memory proposal [1], whether the fd-based
> support available can be queried by a KVM capability. If it's
> available kvmtool would use the fd, if it's not available, it would
> use the host-mapped address. Therefore, there isn’t a need for a
> command line option, unless for testing.

Why would anyone want to use private memory by default for a
non-confidential VM?

> 
> I have implemented this all the way to support the private memory
> proposal in kvmtool [5], but I haven’t posted these since the private
> memory proposal itself is still in flux. If you’re interested you

Are you saying that you are not really sure how the userspace API will end
up looking? If that's the case, wouldn't it make more sense to wait for the
API to stabilize and then send support for it as one nice series?

Thanks,
Alex

> could have a look on how I would go ahead building on these patches
> for full support of private memory backed by an fd.
> 
> > Regarding IPC memory sharing, is mmap'ing an memfd file enough to enable
> > that? If more work is needed for it, then wouldn't it make more sense to do
> > all the changes at once? This change might look sensible right now, but it
> > might turn out that it was the wrong way to go about it when someone
> > actually starts implementing memory sharing.
> 
> I don’t plan on supporting IPC memory sharing. I just mentioned that
> as yet another use case that would benefit from guest memory being
> fd-based, should kvmtool decide to support it in the future.
> 
> Cheers,
> /fuad
> 
> [1] https://lore.kernel.org/all/20221025151344.3784230-1-chao.p.peng@linux.intel.com/
> [2] https://github.com/qemu/qemu
> [3] https://chromium.googlesource.com/chromiumos/platform/crosvm/
> [4] https://github.com/chao-p/qemu/tree/privmem-v9
> [5] https://android-kvm.googlesource.com/kvmtool/+/refs/heads/tabba/fdmem-v9-core
> 
> 
> 
> >
> > Regarding IPC memory sharing, is mmap'ing an memfd file enough to enable
> > that? If more work is needed for it, then wouldn't it make more sense to do
> > all the changes at once? This change might look sensible right now, but it
> > might turn out that it was the wrong way to go about it when someone
> > actually starts implementing memory sharing.
> >
> > Thanks,
> > Alex
> >
> > >
> > > Signed-off-by: Fuad Tabba <tabba@google.com>
> > >
> > > [*] https://lore.kernel.org/all/20221025151344.3784230-1-chao.p.peng@linux.intel.com/
> > > ---
> > >  include/kvm/kvm.h  |  1 +
> > >  include/kvm/util.h |  3 +++
> > >  kvm.c              |  4 ++++
> > >  util/util.c        | 33 ++++++++++++++++++++-------------
> > >  4 files changed, 28 insertions(+), 13 deletions(-)
> > >
> > > diff --git a/include/kvm/kvm.h b/include/kvm/kvm.h
> > > index 3872dc6..d0d519b 100644
> > > --- a/include/kvm/kvm.h
> > > +++ b/include/kvm/kvm.h
> > > @@ -87,6 +87,7 @@ struct kvm {
> > >       struct kvm_config       cfg;
> > >       int                     sys_fd;         /* For system ioctls(), i.e. /dev/kvm */
> > >       int                     vm_fd;          /* For VM ioctls() */
> > > +     int                     ram_fd;         /* For guest memory. */
> > >       timer_t                 timerid;        /* Posix timer for interrupts */
> > >
> > >       int                     nrcpus;         /* Number of cpus to run */
> > > diff --git a/include/kvm/util.h b/include/kvm/util.h
> > > index 61a205b..369603b 100644
> > > --- a/include/kvm/util.h
> > > +++ b/include/kvm/util.h
> > > @@ -140,6 +140,9 @@ static inline int pow2_size(unsigned long x)
> > >  }
> > >
> > >  struct kvm;
> > > +int memfd_alloc(u64 size, bool hugetlb, u64 blk_size);
> > > +void *mmap_anon_or_hugetlbfs_align(struct kvm *kvm, const char *htlbfs_path,
> > > +                                u64 size, u64 align);
> > >  void *mmap_anon_or_hugetlbfs(struct kvm *kvm, const char *htlbfs_path, u64 size);
> > >
> > >  #endif /* KVM__UTIL_H */
> > > diff --git a/kvm.c b/kvm.c
> > > index 78bc0d8..ed29d68 100644
> > > --- a/kvm.c
> > > +++ b/kvm.c
> > > @@ -160,6 +160,7 @@ struct kvm *kvm__new(void)
> > >       mutex_init(&kvm->mem_banks_lock);
> > >       kvm->sys_fd = -1;
> > >       kvm->vm_fd = -1;
> > > +     kvm->ram_fd = -1;
> > >
> > >  #ifdef KVM_BRLOCK_DEBUG
> > >       kvm->brlock_sem = (pthread_rwlock_t) PTHREAD_RWLOCK_INITIALIZER;
> > > @@ -174,6 +175,9 @@ int kvm__exit(struct kvm *kvm)
> > >
> > >       kvm__arch_delete_ram(kvm);
> > >
> > > +     if (kvm->ram_fd >= 0)
> > > +             close(kvm->ram_fd);
> > > +
> > >       list_for_each_entry_safe(bank, tmp, &kvm->mem_banks, list) {
> > >               list_del(&bank->list);
> > >               free(bank);
> > > diff --git a/util/util.c b/util/util.c
> > > index d3483d8..278bcc2 100644
> > > --- a/util/util.c
> > > +++ b/util/util.c
> > > @@ -102,36 +102,38 @@ static u64 get_hugepage_blk_size(const char *htlbfs_path)
> > >       return sfs.f_bsize;
> > >  }
> > >
> > > -static void *mmap_hugetlbfs(struct kvm *kvm, const char *htlbfs_path, u64 size, u64 blk_size)
> > > +int memfd_alloc(u64 size, bool hugetlb, u64 blk_size)
> > >  {
> > >       const char *name = "kvmtool";
> > >       unsigned int flags = 0;
> > >       int fd;
> > > -     void *addr;
> > > -     int htsize = __builtin_ctzl(blk_size);
> > >
> > > -     if ((1ULL << htsize) != blk_size)
> > > -             die("Hugepage size must be a power of 2.\n");
> > > +     if (hugetlb) {
> > > +             int htsize = __builtin_ctzl(blk_size);
> > >
> > > -     flags |= MFD_HUGETLB;
> > > -     flags |= htsize << MFD_HUGE_SHIFT;
> > > +             if ((1ULL << htsize) != blk_size)
> > > +                     die("Hugepage size must be a power of 2.\n");
> > > +
> > > +             flags |= MFD_HUGETLB;
> > > +             flags |= htsize << MFD_HUGE_SHIFT;
> > > +     }
> > >
> > >       fd = memfd_create(name, flags);
> > >       if (fd < 0)
> > > -             die("Can't memfd_create for hugetlbfs map\n");
> > > +             die("Can't memfd_create for memory map\n");
> > > +
> > >       if (ftruncate(fd, size) < 0)
> > >               die("Can't ftruncate for mem mapping size %lld\n",
> > >                       (unsigned long long)size);
> > > -     addr = mmap(NULL, size, PROT_RW, MAP_PRIVATE, fd, 0);
> > > -     close(fd);
> > >
> > > -     return addr;
> > > +     return fd;
> > >  }
> > >
> > >  /* This function wraps the decision between hugetlbfs map (if requested) or normal mmap */
> > >  void *mmap_anon_or_hugetlbfs(struct kvm *kvm, const char *htlbfs_path, u64 size)
> > >  {
> > >       u64 blk_size = 0;
> > > +     int fd;
> > >
> > >       /*
> > >        * We don't /need/ to map guest RAM from hugetlbfs, but we do so
> > > @@ -146,9 +148,14 @@ void *mmap_anon_or_hugetlbfs(struct kvm *kvm, const char *htlbfs_path, u64 size)
> > >               }
> > >
> > >               kvm->ram_pagesize = blk_size;
> > > -             return mmap_hugetlbfs(kvm, htlbfs_path, size, blk_size);
> > >       } else {
> > >               kvm->ram_pagesize = getpagesize();
> > > -             return mmap(NULL, size, PROT_RW, MAP_ANON_NORESERVE, -1, 0);
> > >       }
> > > +
> > > +     fd = memfd_alloc(size, htlbfs_path, blk_size);
> > > +     if (fd < 0)
> > > +             return MAP_FAILED;
> > > +
> > > +     kvm->ram_fd = fd;
> > > +     return mmap(NULL, size, PROT_RW, MAP_PRIVATE, kvm->ram_fd, 0);
> > >  }
> > > --
> > > 2.38.1.431.g37b22c650d-goog
> > >

Alexandru Elisei Nov. 25, 2022, 10:43 a.m. UTC | #4

Hi,

Did some digging, correction(s) below.

On Thu, Nov 24, 2022 at 05:14:33PM +0000, Alexandru Elisei wrote:
> Hi,
> 
> On Thu, Nov 24, 2022 at 03:19:34PM +0000, Fuad Tabba wrote:
> > [..]
> > kvmtool closer to a more consistent way of allocating guest memory, in
> > a similar manner to other VMMs.
> 
> I would really appreciate pointing me to where qemu allocates memory using
> memfd when invoked with -m <size>. I was able to follow the hostmem-ram
> backend allocation function until g_malloc0(), but I couldn't find the
> implementation for that.

As far as I can tell, qemu allocates memory without backing storage (so by
specifying only -m on the command line) like this:

main -> qemu_init -> qmp_x_exit_preconfig -> qemu_init_board ->
create_default_memdev, which creates a TYPE_MEMORY_BACKEND_RAM object.

When creating the VM ram, the object's alloc function is called in:

create_default_memdev -> user_creatable_complete ->
host_memory_backend_complete) -> ram_backend_memory_alloc ->
memory_region_init_ram_flags_nomigrate -> qemu_ram_alloc ->
qemu_ram_alloc_internal -> ram_block_add -> qemu_anon_ram_alloc ->
qemu_ram_mmap(fd=-1,..) -> mmap_activate(..,fd=-1,..) ->
mmap(..,MAP_ANONYMOUS,fd=-1,..)

Unless I'm mistaken with the above (it was quite convoluted to unwrap all
of this), qemu doesn't allocate RAM for the VM using a backing file, unless
specifically requested by the user.

On the other hand. for crosvm:

main -> crosvm_main -> run_vm -> run_config (from src/scrovm/sys/unix.rs),
creates the memory layout in GuestMemory::new ->
MemoryMappingBuilder::new -> from_shared_memory -> offset -> build.

I couldn't find the implementation for MemoryMappingBuilder::build, if it's
anything like build_fixed, then indeed it looks like it uses the memfd
created with GuestMemory::create_shm and passed to
MemoryMappingBuild::from_shared_offset.

Thanks,
Alex

Fuad Tabba Nov. 25, 2022, 10:44 a.m. UTC | #5

Hi,

On Thu, Nov 24, 2022 at 5:14 PM Alexandru Elisei
<alexandru.elisei@arm.com> wrote:
>
> Hi,
>
> On Thu, Nov 24, 2022 at 03:19:34PM +0000, Fuad Tabba wrote:
> > Hi,
> >
> > On Thu, Nov 24, 2022 at 11:01 AM Alexandru Elisei
> > <alexandru.elisei@arm.com> wrote:
> > >
> > > Hi,
> > >
> > > On Tue, Nov 15, 2022 at 11:15:40AM +0000, Fuad Tabba wrote:
> > > > Allocate all guest ram backed by memfd/ftruncate instead of
> > > > anonymous mmap. This will make it easier to use kvm with fd-based
> > > > kvm guest memory proposals [*]. It also would make it easier to
> > > > use ipc memory sharing should that be needed in the future.
> > >
> > > Today, there are two memory allocation paths:
> > >
> > > - One using hugetlbfs when --hugetlbfs is specified on the command line, which
> > >   creates a file.
> > >
> > > - One using mmap, when --hugetlbfs is not specified.
> > >
> > > How would support in kvmtool for the secret memfd look like? I assume there
> > > would need to be some kind of command line parameter to kvmtool to instruct it
> > > to use the secret memfd, right? What I'm trying to figure out is why is
> > > necessary to make everything a memfd file instead of adding a third memory
> > > allocation path just for that particular usecase (or merging the hugetlbfs path
> > > if they are similar enough). Right now, the anonymous memory path is a
> > > mmap(MAP_ANONYMOUS) call, simple and straightforward, and I would like some more
> > > convincing that this change is needed.
> >
> > This isn't about secret mem, but about the unified proposal for guest
> > private memory [1].  This proposal requires the use of a file
> > descriptor (fd) as the canonical reference to guest memory in the host
> > (i.e., VMM) and does not provide an alternative using a
> > straightforward anonymous mmap(). The idea is that guest memory
> > shouldn’t have mapping in the host by default, but unless explicitly
> > shared and needed.
>
> I think you misunderstood me. I wasn't asking why kvmtool should get
> support for guest private memory, I was asking why kvmtool should allocate
> **all types of memory** using memfd. Your series allocates **all** memory
> using memfd. I never said that kvmtool should or should not get support for
> private memory.

My reasoning for allocating all memory with memfd is that it's one
ring to rule them all :) By that I mean, with memfd, we can allocate
normal memory, hugetlb memory, in the future guest private memory, and
easily expand it to support things like IPC memory sharing in the
future.


> >
> > Moreover, using an fd would be more generic and flexible, which allows
> > for other use cases (such as IPC), or to map that memory in userspace
> > when appropriate. It also allows us to use the same interface for
> > hugetlb. Considering that other VMMs (e.g., qemu [2], crosvm [3])
> > already back guest memory with memfd, and looking at how private
> > memory would work [4], it seemed to me that the best way to unify all
> > of these needs is to have the backend of guest memory be fd-based.
> >
> > It would be possible to have that as a separate kvmtool option, where
> > fd-backed memory would be only for guests that use the new private
> > memory extensions. However, that would mean more code to maintain that
> > is essentially doing the same thing (allocating and mapping memory).
> >
> > I thought that it would be worth having these patches in kvmtool now
> > rather than wait until the guest private memory has made it into kvm.
> > These patches simplify the code as an end result, make it easier to
>
> In the non-hugetlbfs case, before:
>
>         kvm->arch.ram_alloc_size = kvm->ram_size + SZ_2M;
>         kvm->arch.ram_alloc_start = mmap_anon_or_hugetlbfs(kvm, kvm->cfg.hugetlbfs_path, kvm->arch.ram_alloc_size);
>
>         /*
>          * mmap_anon_or_hugetlbfs expands to:
>          * getpagesize()
>          * mmap()
>          */
>
>         kvm->ram_start = (void *)ALIGN((unsigned long)kvm->arch.ram_alloc_start, SZ_2M);
>
> After:
>         /* mmap_anon_or_hugetlbfs: */
>         getpagesize();
>         mmap(NULL, total_map, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>         memfd_alloc(size, htlbfs_path, blk_size);
>
>         /*
>          * memfd_alloc() expands to:
>          * memfd_create()
>          * ftruncate
>          */
>
>         addr_align = (void *)ALIGN((u64)addr_map, align_sz);
>         mmap(addr_align, size, PROT_RW, MAP_PRIVATE | MAP_FIXED, fd, 0);
>
> I'm counting one extra mmap(), one memfd_create() and one ftruncate() that
> this series adds (not to mention all the boiler plate code to check for
> errors).
>
> Let's use another metric, let's count the number of lines of code. Before:
> 9 lines of code, after: -3 lines removed from arm/kvm.c and 86 lines of
> code for memfd_alloc() and mmap_anon_or_hugetlbfs_align().
>
> I'm struggling to find a metric by which the resulting code is simpler, as
> you suggest.

With simpler I didn't mean fewer lines of code, rather that it's
easier to reason about, more shared code. With this series, hugetlb
and normal memory creation follow the same path, and with the
refactoring a lot of arch-specific code is gone.

>
> > allocate and map aligned memory without overallocating, and bring
>
> If your goal is to remove the overallocting of memory, you can just munmap
> the extra memory after alignment is performed. To do that you don't need to
> allocate everything using a memfd.
>
> > kvmtool closer to a more consistent way of allocating guest memory, in
> > a similar manner to other VMMs.
>
> I would really appreciate pointing me to where qemu allocates memory using
> memfd when invoked with -m <size>. I was able to follow the hostmem-ram
> backend allocation function until g_malloc0(), but I couldn't find the
> implementation for that.

You're right. I thought that the memfd backend was the default, but
looking again it's not.

> >
> > Moreover, with the private memory proposal [1], whether the fd-based
> > support available can be queried by a KVM capability. If it's
> > available kvmtool would use the fd, if it's not available, it would
> > use the host-mapped address. Therefore, there isn’t a need for a
> > command line option, unless for testing.
>
> Why would anyone want to use private memory by default for a
> non-confidential VM?

The idea is that, at least when pKVM is enabled, we would use the
proposed extensions for all guests, i.e., memory via a file
descriptor, regardless whether the guest is protected (thus the memory
would be private), or not.


> >
> > I have implemented this all the way to support the private memory
> > proposal in kvmtool [5], but I haven’t posted these since the private
> > memory proposal itself is still in flux. If you’re interested you
>
> Are you saying that you are not really sure how the userspace API will end
> up looking? If that's the case, wouldn't it make more sense to wait for the
> API to stabilize and then send support for it as one nice series?

Yes, I'm not sure how it will end up looking. We know that it will be
fd-based though, which is why I thought it might be good to start with
that.

Cheers,
/fuad



> Thanks,
> Alex
>
> > could have a look on how I would go ahead building on these patches
> > for full support of private memory backed by an fd.
> >
> > > Regarding IPC memory sharing, is mmap'ing an memfd file enough to enable
> > > that? If more work is needed for it, then wouldn't it make more sense to do
> > > all the changes at once? This change might look sensible right now, but it
> > > might turn out that it was the wrong way to go about it when someone
> > > actually starts implementing memory sharing.
> >
> > I don’t plan on supporting IPC memory sharing. I just mentioned that
> > as yet another use case that would benefit from guest memory being
> > fd-based, should kvmtool decide to support it in the future.
> >
> > Cheers,
> > /fuad
> >
> > [1] https://lore.kernel.org/all/20221025151344.3784230-1-chao.p.peng@linux.intel.com/
> > [2] https://github.com/qemu/qemu
> > [3] https://chromium.googlesource.com/chromiumos/platform/crosvm/
> > [4] https://github.com/chao-p/qemu/tree/privmem-v9
> > [5] https://android-kvm.googlesource.com/kvmtool/+/refs/heads/tabba/fdmem-v9-core
> >
> >
> >
> > >
> > > Regarding IPC memory sharing, is mmap'ing an memfd file enough to enable
> > > that? If more work is needed for it, then wouldn't it make more sense to do
> > > all the changes at once? This change might look sensible right now, but it
> > > might turn out that it was the wrong way to go about it when someone
> > > actually starts implementing memory sharing.
> > >
> > > Thanks,
> > > Alex
> > >
> > > >
> > > > Signed-off-by: Fuad Tabba <tabba@google.com>
> > > >
> > > > [*] https://lore.kernel.org/all/20221025151344.3784230-1-chao.p.peng@linux.intel.com/
> > > > ---
> > > >  include/kvm/kvm.h  |  1 +
> > > >  include/kvm/util.h |  3 +++
> > > >  kvm.c              |  4 ++++
> > > >  util/util.c        | 33 ++++++++++++++++++++-------------
> > > >  4 files changed, 28 insertions(+), 13 deletions(-)
> > > >
> > > > diff --git a/include/kvm/kvm.h b/include/kvm/kvm.h
> > > > index 3872dc6..d0d519b 100644
> > > > --- a/include/kvm/kvm.h
> > > > +++ b/include/kvm/kvm.h
> > > > @@ -87,6 +87,7 @@ struct kvm {
> > > >       struct kvm_config       cfg;
> > > >       int                     sys_fd;         /* For system ioctls(), i.e. /dev/kvm */
> > > >       int                     vm_fd;          /* For VM ioctls() */
> > > > +     int                     ram_fd;         /* For guest memory. */
> > > >       timer_t                 timerid;        /* Posix timer for interrupts */
> > > >
> > > >       int                     nrcpus;         /* Number of cpus to run */
> > > > diff --git a/include/kvm/util.h b/include/kvm/util.h
> > > > index 61a205b..369603b 100644
> > > > --- a/include/kvm/util.h
> > > > +++ b/include/kvm/util.h
> > > > @@ -140,6 +140,9 @@ static inline int pow2_size(unsigned long x)
> > > >  }
> > > >
> > > >  struct kvm;
> > > > +int memfd_alloc(u64 size, bool hugetlb, u64 blk_size);
> > > > +void *mmap_anon_or_hugetlbfs_align(struct kvm *kvm, const char *htlbfs_path,
> > > > +                                u64 size, u64 align);
> > > >  void *mmap_anon_or_hugetlbfs(struct kvm *kvm, const char *htlbfs_path, u64 size);
> > > >
> > > >  #endif /* KVM__UTIL_H */
> > > > diff --git a/kvm.c b/kvm.c
> > > > index 78bc0d8..ed29d68 100644
> > > > --- a/kvm.c
> > > > +++ b/kvm.c
> > > > @@ -160,6 +160,7 @@ struct kvm *kvm__new(void)
> > > >       mutex_init(&kvm->mem_banks_lock);
> > > >       kvm->sys_fd = -1;
> > > >       kvm->vm_fd = -1;
> > > > +     kvm->ram_fd = -1;
> > > >
> > > >  #ifdef KVM_BRLOCK_DEBUG
> > > >       kvm->brlock_sem = (pthread_rwlock_t) PTHREAD_RWLOCK_INITIALIZER;
> > > > @@ -174,6 +175,9 @@ int kvm__exit(struct kvm *kvm)
> > > >
> > > >       kvm__arch_delete_ram(kvm);
> > > >
> > > > +     if (kvm->ram_fd >= 0)
> > > > +             close(kvm->ram_fd);
> > > > +
> > > >       list_for_each_entry_safe(bank, tmp, &kvm->mem_banks, list) {
> > > >               list_del(&bank->list);
> > > >               free(bank);
> > > > diff --git a/util/util.c b/util/util.c
> > > > index d3483d8..278bcc2 100644
> > > > --- a/util/util.c
> > > > +++ b/util/util.c
> > > > @@ -102,36 +102,38 @@ static u64 get_hugepage_blk_size(const char *htlbfs_path)
> > > >       return sfs.f_bsize;
> > > >  }
> > > >
> > > > -static void *mmap_hugetlbfs(struct kvm *kvm, const char *htlbfs_path, u64 size, u64 blk_size)
> > > > +int memfd_alloc(u64 size, bool hugetlb, u64 blk_size)
> > > >  {
> > > >       const char *name = "kvmtool";
> > > >       unsigned int flags = 0;
> > > >       int fd;
> > > > -     void *addr;
> > > > -     int htsize = __builtin_ctzl(blk_size);
> > > >
> > > > -     if ((1ULL << htsize) != blk_size)
> > > > -             die("Hugepage size must be a power of 2.\n");
> > > > +     if (hugetlb) {
> > > > +             int htsize = __builtin_ctzl(blk_size);
> > > >
> > > > -     flags |= MFD_HUGETLB;
> > > > -     flags |= htsize << MFD_HUGE_SHIFT;
> > > > +             if ((1ULL << htsize) != blk_size)
> > > > +                     die("Hugepage size must be a power of 2.\n");
> > > > +
> > > > +             flags |= MFD_HUGETLB;
> > > > +             flags |= htsize << MFD_HUGE_SHIFT;
> > > > +     }
> > > >
> > > >       fd = memfd_create(name, flags);
> > > >       if (fd < 0)
> > > > -             die("Can't memfd_create for hugetlbfs map\n");
> > > > +             die("Can't memfd_create for memory map\n");
> > > > +
> > > >       if (ftruncate(fd, size) < 0)
> > > >               die("Can't ftruncate for mem mapping size %lld\n",
> > > >                       (unsigned long long)size);
> > > > -     addr = mmap(NULL, size, PROT_RW, MAP_PRIVATE, fd, 0);
> > > > -     close(fd);
> > > >
> > > > -     return addr;
> > > > +     return fd;
> > > >  }
> > > >
> > > >  /* This function wraps the decision between hugetlbfs map (if requested) or normal mmap */
> > > >  void *mmap_anon_or_hugetlbfs(struct kvm *kvm, const char *htlbfs_path, u64 size)
> > > >  {
> > > >       u64 blk_size = 0;
> > > > +     int fd;
> > > >
> > > >       /*
> > > >        * We don't /need/ to map guest RAM from hugetlbfs, but we do so
> > > > @@ -146,9 +148,14 @@ void *mmap_anon_or_hugetlbfs(struct kvm *kvm, const char *htlbfs_path, u64 size)
> > > >               }
> > > >
> > > >               kvm->ram_pagesize = blk_size;
> > > > -             return mmap_hugetlbfs(kvm, htlbfs_path, size, blk_size);
> > > >       } else {
> > > >               kvm->ram_pagesize = getpagesize();
> > > > -             return mmap(NULL, size, PROT_RW, MAP_ANON_NORESERVE, -1, 0);
> > > >       }
> > > > +
> > > > +     fd = memfd_alloc(size, htlbfs_path, blk_size);
> > > > +     if (fd < 0)
> > > > +             return MAP_FAILED;
> > > > +
> > > > +     kvm->ram_fd = fd;
> > > > +     return mmap(NULL, size, PROT_RW, MAP_PRIVATE, kvm->ram_fd, 0);
> > > >  }
> > > > --
> > > > 2.38.1.431.g37b22c650d-goog
> > > >

Fuad Tabba Nov. 25, 2022, 10:58 a.m. UTC | #6

Hi,

On Fri, Nov 25, 2022 at 10:43 AM Alexandru Elisei
<alexandru.elisei@arm.com> wrote:
>
> Hi,
>
> Did some digging, correction(s) below.
>
> On Thu, Nov 24, 2022 at 05:14:33PM +0000, Alexandru Elisei wrote:
> > Hi,
> >
> > On Thu, Nov 24, 2022 at 03:19:34PM +0000, Fuad Tabba wrote:
> > > [..]
> > > kvmtool closer to a more consistent way of allocating guest memory, in
> > > a similar manner to other VMMs.
> >
> > I would really appreciate pointing me to where qemu allocates memory using
> > memfd when invoked with -m <size>. I was able to follow the hostmem-ram
> > backend allocation function until g_malloc0(), but I couldn't find the
> > implementation for that.
>
> As far as I can tell, qemu allocates memory without backing storage (so by
> specifying only -m on the command line) like this:
>
> main -> qemu_init -> qmp_x_exit_preconfig -> qemu_init_board ->
> create_default_memdev, which creates a TYPE_MEMORY_BACKEND_RAM object.
>
> When creating the VM ram, the object's alloc function is called in:
>
> create_default_memdev -> user_creatable_complete ->
> host_memory_backend_complete) -> ram_backend_memory_alloc ->
> memory_region_init_ram_flags_nomigrate -> qemu_ram_alloc ->
> qemu_ram_alloc_internal -> ram_block_add -> qemu_anon_ram_alloc ->
> qemu_ram_mmap(fd=-1,..) -> mmap_activate(..,fd=-1,..) ->
> mmap(..,MAP_ANONYMOUS,fd=-1,..)
>
> Unless I'm mistaken with the above (it was quite convoluted to unwrap all
> of this), qemu doesn't allocate RAM for the VM using a backing file, unless
> specifically requested by the user.

Thanks and sorry that you had to do some digging because of my mistake
(thinking that the memfd backend was the default one).

Cheers,
/fuad

> On the other hand. for crosvm:
>
> main -> crosvm_main -> run_vm -> run_config (from src/scrovm/sys/unix.rs),
> creates the memory layout in GuestMemory::new ->
> MemoryMappingBuilder::new -> from_shared_memory -> offset -> build.
>
> I couldn't find the implementation for MemoryMappingBuilder::build, if it's
> anything like build_fixed, then indeed it looks like it uses the memfd
> created with GuestMemory::create_shm and passed to
> MemoryMappingBuild::from_shared_offset.
>
> Thanks,
> Alex

Alexandru Elisei Nov. 25, 2022, 11:31 a.m. UTC | #7

Hi,

On Fri, Nov 25, 2022 at 10:44:39AM +0000, Fuad Tabba wrote:
> Hi,
> 
> On Thu, Nov 24, 2022 at 5:14 PM Alexandru Elisei
> <alexandru.elisei@arm.com> wrote:
> >
> > Hi,
> >
> > On Thu, Nov 24, 2022 at 03:19:34PM +0000, Fuad Tabba wrote:
> > > Hi,
> > >
> > > On Thu, Nov 24, 2022 at 11:01 AM Alexandru Elisei
> > > <alexandru.elisei@arm.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > On Tue, Nov 15, 2022 at 11:15:40AM +0000, Fuad Tabba wrote:
> > > > > Allocate all guest ram backed by memfd/ftruncate instead of
> > > > > anonymous mmap. This will make it easier to use kvm with fd-based
> > > > > kvm guest memory proposals [*]. It also would make it easier to
> > > > > use ipc memory sharing should that be needed in the future.
> > > >
> > > > Today, there are two memory allocation paths:
> > > >
> > > > - One using hugetlbfs when --hugetlbfs is specified on the command line, which
> > > >   creates a file.
> > > >
> > > > - One using mmap, when --hugetlbfs is not specified.
> > > >
> > > > How would support in kvmtool for the secret memfd look like? I assume there
> > > > would need to be some kind of command line parameter to kvmtool to instruct it
> > > > to use the secret memfd, right? What I'm trying to figure out is why is
> > > > necessary to make everything a memfd file instead of adding a third memory
> > > > allocation path just for that particular usecase (or merging the hugetlbfs path
> > > > if they are similar enough). Right now, the anonymous memory path is a
> > > > mmap(MAP_ANONYMOUS) call, simple and straightforward, and I would like some more
> > > > convincing that this change is needed.
> > >
> > > This isn't about secret mem, but about the unified proposal for guest
> > > private memory [1].  This proposal requires the use of a file
> > > descriptor (fd) as the canonical reference to guest memory in the host
> > > (i.e., VMM) and does not provide an alternative using a
> > > straightforward anonymous mmap(). The idea is that guest memory
> > > shouldn’t have mapping in the host by default, but unless explicitly
> > > shared and needed.
> >
> > I think you misunderstood me. I wasn't asking why kvmtool should get
> > support for guest private memory, I was asking why kvmtool should allocate
> > **all types of memory** using memfd. Your series allocates **all** memory
> > using memfd. I never said that kvmtool should or should not get support for
> > private memory.
> 
> My reasoning for allocating all memory with memfd is that it's one
> ring to rule them all :) By that I mean, with memfd, we can allocate
> normal memory, hugetlb memory, in the future guest private memory, and
> easily expand it to support things like IPC memory sharing in the
> future.

Allocating anonymous memory is more complex now. And I could argue than the
hugetlbfs case is also more complex because there are now two branches that
do different things based whether it's hugetlbfs or not, instead of one.

As I stands right now, my opinion is that using memfd for anonymous RAM
only adds complexity for zero benefits.

> 
> 
> > >
> > > Moreover, using an fd would be more generic and flexible, which allows
> > > for other use cases (such as IPC), or to map that memory in userspace
> > > when appropriate. It also allows us to use the same interface for
> > > hugetlb. Considering that other VMMs (e.g., qemu [2], crosvm [3])
> > > already back guest memory with memfd, and looking at how private
> > > memory would work [4], it seemed to me that the best way to unify all
> > > of these needs is to have the backend of guest memory be fd-based.
> > >
> > > It would be possible to have that as a separate kvmtool option, where
> > > fd-backed memory would be only for guests that use the new private
> > > memory extensions. However, that would mean more code to maintain that
> > > is essentially doing the same thing (allocating and mapping memory).
> > >
> > > I thought that it would be worth having these patches in kvmtool now
> > > rather than wait until the guest private memory has made it into kvm.
> > > These patches simplify the code as an end result, make it easier to
> >
> > In the non-hugetlbfs case, before:
> >
> >         kvm->arch.ram_alloc_size = kvm->ram_size + SZ_2M;
> >         kvm->arch.ram_alloc_start = mmap_anon_or_hugetlbfs(kvm, kvm->cfg.hugetlbfs_path, kvm->arch.ram_alloc_size);
> >
> >         /*
> >          * mmap_anon_or_hugetlbfs expands to:
> >          * getpagesize()
> >          * mmap()
> >          */
> >
> >         kvm->ram_start = (void *)ALIGN((unsigned long)kvm->arch.ram_alloc_start, SZ_2M);
> >
> > After:
> >         /* mmap_anon_or_hugetlbfs: */
> >         getpagesize();
> >         mmap(NULL, total_map, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> >         memfd_alloc(size, htlbfs_path, blk_size);
> >
> >         /*
> >          * memfd_alloc() expands to:
> >          * memfd_create()
> >          * ftruncate
> >          */
> >
> >         addr_align = (void *)ALIGN((u64)addr_map, align_sz);
> >         mmap(addr_align, size, PROT_RW, MAP_PRIVATE | MAP_FIXED, fd, 0);
> >
> > I'm counting one extra mmap(), one memfd_create() and one ftruncate() that
> > this series adds (not to mention all the boiler plate code to check for
> > errors).
> >
> > Let's use another metric, let's count the number of lines of code. Before:
> > 9 lines of code, after: -3 lines removed from arm/kvm.c and 86 lines of
> > code for memfd_alloc() and mmap_anon_or_hugetlbfs_align().
> >
> > I'm struggling to find a metric by which the resulting code is simpler, as
> > you suggest.
> 
> With simpler I didn't mean fewer lines of code, rather that it's
> easier to reason about, more shared code. With this series, hugetlb

How is all of the code that has been added easier to reason about than one
single mmap call?

> and normal memory creation follow the same path, and with the
> refactoring a lot of arch-specific code is gone.

Can you point me to the arch-specific code that this series removes? As far
as I can tell, the only arch specfic change is replacing
kvm_arch_delete_ram with kvm_delete_ram, which can be done independently of
this series. If it's only that function, I wouldn't call that "a lot" of
arch-specific code.

> 
> >
> > > allocate and map aligned memory without overallocating, and bring
> >
> > If your goal is to remove the overallocting of memory, you can just munmap
> > the extra memory after alignment is performed. To do that you don't need to
> > allocate everything using a memfd.
> >
> > > kvmtool closer to a more consistent way of allocating guest memory, in
> > > a similar manner to other VMMs.
> >
> > I would really appreciate pointing me to where qemu allocates memory using
> > memfd when invoked with -m <size>. I was able to follow the hostmem-ram
> > backend allocation function until g_malloc0(), but I couldn't find the
> > implementation for that.
> 
> You're right. I thought that the memfd backend was the default, but
> looking again it's not.
> 
> > >
> > > Moreover, with the private memory proposal [1], whether the fd-based
> > > support available can be queried by a KVM capability. If it's
> > > available kvmtool would use the fd, if it's not available, it would
> > > use the host-mapped address. Therefore, there isn’t a need for a
> > > command line option, unless for testing.
> >
> > Why would anyone want to use private memory by default for a
> > non-confidential VM?
> 
> The idea is that, at least when pKVM is enabled, we would use the
> proposed extensions for all guests, i.e., memory via a file
> descriptor, regardless whether the guest is protected (thus the memory
> would be private), or not.

kvmtool can be used to run virtual machines when pKVM is not enabled. In
fact, it has been used that way for way longer than pKVM has existed. What
about those users?

> 
> 
> > >
> > > I have implemented this all the way to support the private memory
> > > proposal in kvmtool [5], but I haven’t posted these since the private
> > > memory proposal itself is still in flux. If you’re interested you
> >
> > Are you saying that you are not really sure how the userspace API will end
> > up looking? If that's the case, wouldn't it make more sense to wait for the
> > API to stabilize and then send support for it as one nice series?
> 
> Yes, I'm not sure how it will end up looking. We know that it will be
> fd-based though, which is why I thought it might be good to start with
> that.

If you're not sure how it will end up looking, then why change kvmtool now?

Thanks,
Alex

> 
> Cheers,
> /fuad
> 
> 
> 
> > Thanks,
> > Alex
> >
> > > could have a look on how I would go ahead building on these patches
> > > for full support of private memory backed by an fd.
> > >
> > > > Regarding IPC memory sharing, is mmap'ing an memfd file enough to enable
> > > > that? If more work is needed for it, then wouldn't it make more sense to do
> > > > all the changes at once? This change might look sensible right now, but it
> > > > might turn out that it was the wrong way to go about it when someone
> > > > actually starts implementing memory sharing.
> > >
> > > I don’t plan on supporting IPC memory sharing. I just mentioned that
> > > as yet another use case that would benefit from guest memory being
> > > fd-based, should kvmtool decide to support it in the future.
> > >
> > > Cheers,
> > > /fuad
> > >
> > > [1] https://lore.kernel.org/all/20221025151344.3784230-1-chao.p.peng@linux.intel.com/
> > > [2] https://github.com/qemu/qemu
> > > [3] https://chromium.googlesource.com/chromiumos/platform/crosvm/
> > > [4] https://github.com/chao-p/qemu/tree/privmem-v9
> > > [5] https://android-kvm.googlesource.com/kvmtool/+/refs/heads/tabba/fdmem-v9-core
> > >
> > >
> > >
> > > >
> > > > Regarding IPC memory sharing, is mmap'ing an memfd file enough to enable
> > > > that? If more work is needed for it, then wouldn't it make more sense to do
> > > > all the changes at once? This change might look sensible right now, but it
> > > > might turn out that it was the wrong way to go about it when someone
> > > > actually starts implementing memory sharing.
> > > >
> > > > Thanks,
> > > > Alex
> > > >
> > > > >
> > > > > Signed-off-by: Fuad Tabba <tabba@google.com>
> > > > >
> > > > > [*] https://lore.kernel.org/all/20221025151344.3784230-1-chao.p.peng@linux.intel.com/
> > > > > ---
> > > > >  include/kvm/kvm.h  |  1 +
> > > > >  include/kvm/util.h |  3 +++
> > > > >  kvm.c              |  4 ++++
> > > > >  util/util.c        | 33 ++++++++++++++++++++-------------
> > > > >  4 files changed, 28 insertions(+), 13 deletions(-)
> > > > >
> > > > > diff --git a/include/kvm/kvm.h b/include/kvm/kvm.h
> > > > > index 3872dc6..d0d519b 100644
> > > > > --- a/include/kvm/kvm.h
> > > > > +++ b/include/kvm/kvm.h
> > > > > @@ -87,6 +87,7 @@ struct kvm {
> > > > >       struct kvm_config       cfg;
> > > > >       int                     sys_fd;         /* For system ioctls(), i.e. /dev/kvm */
> > > > >       int                     vm_fd;          /* For VM ioctls() */
> > > > > +     int                     ram_fd;         /* For guest memory. */
> > > > >       timer_t                 timerid;        /* Posix timer for interrupts */
> > > > >
> > > > >       int                     nrcpus;         /* Number of cpus to run */
> > > > > diff --git a/include/kvm/util.h b/include/kvm/util.h
> > > > > index 61a205b..369603b 100644
> > > > > --- a/include/kvm/util.h
> > > > > +++ b/include/kvm/util.h
> > > > > @@ -140,6 +140,9 @@ static inline int pow2_size(unsigned long x)
> > > > >  }
> > > > >
> > > > >  struct kvm;
> > > > > +int memfd_alloc(u64 size, bool hugetlb, u64 blk_size);
> > > > > +void *mmap_anon_or_hugetlbfs_align(struct kvm *kvm, const char *htlbfs_path,
> > > > > +                                u64 size, u64 align);
> > > > >  void *mmap_anon_or_hugetlbfs(struct kvm *kvm, const char *htlbfs_path, u64 size);
> > > > >
> > > > >  #endif /* KVM__UTIL_H */
> > > > > diff --git a/kvm.c b/kvm.c
> > > > > index 78bc0d8..ed29d68 100644
> > > > > --- a/kvm.c
> > > > > +++ b/kvm.c
> > > > > @@ -160,6 +160,7 @@ struct kvm *kvm__new(void)
> > > > >       mutex_init(&kvm->mem_banks_lock);
> > > > >       kvm->sys_fd = -1;
> > > > >       kvm->vm_fd = -1;
> > > > > +     kvm->ram_fd = -1;
> > > > >
> > > > >  #ifdef KVM_BRLOCK_DEBUG
> > > > >       kvm->brlock_sem = (pthread_rwlock_t) PTHREAD_RWLOCK_INITIALIZER;
> > > > > @@ -174,6 +175,9 @@ int kvm__exit(struct kvm *kvm)
> > > > >
> > > > >       kvm__arch_delete_ram(kvm);
> > > > >
> > > > > +     if (kvm->ram_fd >= 0)
> > > > > +             close(kvm->ram_fd);
> > > > > +
> > > > >       list_for_each_entry_safe(bank, tmp, &kvm->mem_banks, list) {
> > > > >               list_del(&bank->list);
> > > > >               free(bank);
> > > > > diff --git a/util/util.c b/util/util.c
> > > > > index d3483d8..278bcc2 100644
> > > > > --- a/util/util.c
> > > > > +++ b/util/util.c
> > > > > @@ -102,36 +102,38 @@ static u64 get_hugepage_blk_size(const char *htlbfs_path)
> > > > >       return sfs.f_bsize;
> > > > >  }
> > > > >
> > > > > -static void *mmap_hugetlbfs(struct kvm *kvm, const char *htlbfs_path, u64 size, u64 blk_size)
> > > > > +int memfd_alloc(u64 size, bool hugetlb, u64 blk_size)
> > > > >  {
> > > > >       const char *name = "kvmtool";
> > > > >       unsigned int flags = 0;
> > > > >       int fd;
> > > > > -     void *addr;
> > > > > -     int htsize = __builtin_ctzl(blk_size);
> > > > >
> > > > > -     if ((1ULL << htsize) != blk_size)
> > > > > -             die("Hugepage size must be a power of 2.\n");
> > > > > +     if (hugetlb) {
> > > > > +             int htsize = __builtin_ctzl(blk_size);
> > > > >
> > > > > -     flags |= MFD_HUGETLB;
> > > > > -     flags |= htsize << MFD_HUGE_SHIFT;
> > > > > +             if ((1ULL << htsize) != blk_size)
> > > > > +                     die("Hugepage size must be a power of 2.\n");
> > > > > +
> > > > > +             flags |= MFD_HUGETLB;
> > > > > +             flags |= htsize << MFD_HUGE_SHIFT;
> > > > > +     }
> > > > >
> > > > >       fd = memfd_create(name, flags);
> > > > >       if (fd < 0)
> > > > > -             die("Can't memfd_create for hugetlbfs map\n");
> > > > > +             die("Can't memfd_create for memory map\n");
> > > > > +
> > > > >       if (ftruncate(fd, size) < 0)
> > > > >               die("Can't ftruncate for mem mapping size %lld\n",
> > > > >                       (unsigned long long)size);
> > > > > -     addr = mmap(NULL, size, PROT_RW, MAP_PRIVATE, fd, 0);
> > > > > -     close(fd);
> > > > >
> > > > > -     return addr;
> > > > > +     return fd;
> > > > >  }
> > > > >
> > > > >  /* This function wraps the decision between hugetlbfs map (if requested) or normal mmap */
> > > > >  void *mmap_anon_or_hugetlbfs(struct kvm *kvm, const char *htlbfs_path, u64 size)
> > > > >  {
> > > > >       u64 blk_size = 0;
> > > > > +     int fd;
> > > > >
> > > > >       /*
> > > > >        * We don't /need/ to map guest RAM from hugetlbfs, but we do so
> > > > > @@ -146,9 +148,14 @@ void *mmap_anon_or_hugetlbfs(struct kvm *kvm, const char *htlbfs_path, u64 size)
> > > > >               }
> > > > >
> > > > >               kvm->ram_pagesize = blk_size;
> > > > > -             return mmap_hugetlbfs(kvm, htlbfs_path, size, blk_size);
> > > > >       } else {
> > > > >               kvm->ram_pagesize = getpagesize();
> > > > > -             return mmap(NULL, size, PROT_RW, MAP_ANON_NORESERVE, -1, 0);
> > > > >       }
> > > > > +
> > > > > +     fd = memfd_alloc(size, htlbfs_path, blk_size);
> > > > > +     if (fd < 0)
> > > > > +             return MAP_FAILED;
> > > > > +
> > > > > +     kvm->ram_fd = fd;
> > > > > +     return mmap(NULL, size, PROT_RW, MAP_PRIVATE, kvm->ram_fd, 0);
> > > > >  }
> > > > > --
> > > > > 2.38.1.431.g37b22c650d-goog
> > > > >

Fuad Tabba Nov. 28, 2022, 8:49 a.m. UTC | #8

Hi,

First I want to mention that I really appreciate your feedback, which
has already been quite helpful. I would like you to please consider
this to be an RFC, and let's use these patches as a basis for
discussion and how they can be improved when I respin them, even if
that means waiting until the kvm fd-based proposal is finalized.

Now to answer your question...

<snip>

> > My reasoning for allocating all memory with memfd is that it's one
> > ring to rule them all :) By that I mean, with memfd, we can allocate
> > normal memory, hugetlb memory, in the future guest private memory, and
> > easily expand it to support things like IPC memory sharing in the
> > future.
>
> Allocating anonymous memory is more complex now. And I could argue than the
> hugetlbfs case is also more complex because there are now two branches that
> do different things based whether it's hugetlbfs or not, instead of one.

The additional complexity now comes not from using memfd, but from
unmapping and aligning code, which I think does benefit kvmtool in
general.

Hugetlbfd already had a different path before, now at least the other
path it has just has to do with setting flags for memfd_create(),
rather than allocating memory differently.


> As I stands right now, my opinion is that using memfd for anonymous RAM
> only adds complexity for zero benefits.
> >
> >
> > > >
> > > > Moreover, using an fd would be more generic and flexible, which allows
> > > > for other use cases (such as IPC), or to map that memory in userspace
> > > > when appropriate. It also allows us to use the same interface for
> > > > hugetlb. Considering that other VMMs (e.g., qemu [2], crosvm [3])
> > > > already back guest memory with memfd, and looking at how private
> > > > memory would work [4], it seemed to me that the best way to unify all
> > > > of these needs is to have the backend of guest memory be fd-based.
> > > >
> > > > It would be possible to have that as a separate kvmtool option, where
> > > > fd-backed memory would be only for guests that use the new private
> > > > memory extensions. However, that would mean more code to maintain that
> > > > is essentially doing the same thing (allocating and mapping memory).
> > > >
> > > > I thought that it would be worth having these patches in kvmtool now
> > > > rather than wait until the guest private memory has made it into kvm.
> > > > These patches simplify the code as an end result, make it easier to
> > >
> > > In the non-hugetlbfs case, before:
> > >
> > >         kvm->arch.ram_alloc_size = kvm->ram_size + SZ_2M;
> > >         kvm->arch.ram_alloc_start = mmap_anon_or_hugetlbfs(kvm, kvm->cfg.hugetlbfs_path, kvm->arch.ram_alloc_size);
> > >
> > >         /*
> > >          * mmap_anon_or_hugetlbfs expands to:
> > >          * getpagesize()
> > >          * mmap()
> > >          */
> > >
> > >         kvm->ram_start = (void *)ALIGN((unsigned long)kvm->arch.ram_alloc_start, SZ_2M);
> > >
> > > After:
> > >         /* mmap_anon_or_hugetlbfs: */
> > >         getpagesize();
> > >         mmap(NULL, total_map, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> > >         memfd_alloc(size, htlbfs_path, blk_size);
> > >
> > >         /*
> > >          * memfd_alloc() expands to:
> > >          * memfd_create()
> > >          * ftruncate
> > >          */
> > >
> > >         addr_align = (void *)ALIGN((u64)addr_map, align_sz);
> > >         mmap(addr_align, size, PROT_RW, MAP_PRIVATE | MAP_FIXED, fd, 0);
> > >
> > > I'm counting one extra mmap(), one memfd_create() and one ftruncate() that
> > > this series adds (not to mention all the boiler plate code to check for
> > > errors).
> > >
> > > Let's use another metric, let's count the number of lines of code. Before:
> > > 9 lines of code, after: -3 lines removed from arm/kvm.c and 86 lines of
> > > code for memfd_alloc() and mmap_anon_or_hugetlbfs_align().
> > >
> > > I'm struggling to find a metric by which the resulting code is simpler, as
> > > you suggest.
> >
> > With simpler I didn't mean fewer lines of code, rather that it's
> > easier to reason about, more shared code. With this series, hugetlb
>
> How is all of the code that has been added easier to reason about than one
> single mmap call?
>
> > and normal memory creation follow the same path, and with the
> > refactoring a lot of arch-specific code is gone.
>
> Can you point me to the arch-specific code that this series removes? As far
> as I can tell, the only arch specfic change is replacing
> kvm_arch_delete_ram with kvm_delete_ram, which can be done independently of
> this series. If it's only that function, I wouldn't call that "a lot" of
> arch-specific code.

kvmtool is an old and well established project. So I think that being
able to remove the memory-alignment code from the arm and riscv kvm.c,
two fields from the arm and riscv struct kvm_arch, as well as
kvm__arch_delete_ram from all architectures, is not that little for a
mature project such as this one. I agree that this could have been
done without using memfd, but at least for me, as a person who has
just posted their first contribution to kvmtool, it was easier to make
these changes when the tracking of the memory is based on an fd rather
than a userspace address (makes alignment and unmapping unused memory
much easier).

>
> >
> > >
> > > > allocate and map aligned memory without overallocating, and bring
> > >
> > > If your goal is to remove the overallocting of memory, you can just munmap
> > > the extra memory after alignment is performed. To do that you don't need to
> > > allocate everything using a memfd.
> > >
> > > > kvmtool closer to a more consistent way of allocating guest memory, in
> > > > a similar manner to other VMMs.
> > >
> > > I would really appreciate pointing me to where qemu allocates memory using
> > > memfd when invoked with -m <size>. I was able to follow the hostmem-ram
> > > backend allocation function until g_malloc0(), but I couldn't find the
> > > implementation for that.
> >
> > You're right. I thought that the memfd backend was the default, but
> > looking again it's not.
> >
> > > >
> > > > Moreover, with the private memory proposal [1], whether the fd-based
> > > > support available can be queried by a KVM capability. If it's
> > > > available kvmtool would use the fd, if it's not available, it would
> > > > use the host-mapped address. Therefore, there isn’t a need for a
> > > > command line option, unless for testing.
> > >
> > > Why would anyone want to use private memory by default for a
> > > non-confidential VM?
> >
> > The idea is that, at least when pKVM is enabled, we would use the
> > proposed extensions for all guests, i.e., memory via a file
> > descriptor, regardless whether the guest is protected (thus the memory
> > would be private), or not.
>
> kvmtool can be used to run virtual machines when pKVM is not enabled. In
> fact, it has been used that way for way longer than pKVM has existed. What
> about those users?

This does not affect these users, which is the point of these patches.
This allows new uses as well as maintaining the existing one, and
enables potentially new ones in the future.

> >
> >
> > > >
> > > > I have implemented this all the way to support the private memory
> > > > proposal in kvmtool [5], but I haven’t posted these since the private
> > > > memory proposal itself is still in flux. If you’re interested you
> > >
> > > Are you saying that you are not really sure how the userspace API will end
> > > up looking? If that's the case, wouldn't it make more sense to wait for the
> > > API to stabilize and then send support for it as one nice series?
> >
> > Yes, I'm not sure how it will end up looking. We know that it will be
> > fd-based though, which is why I thought it might be good to start with
> > that.
>
> If you're not sure how it will end up looking, then why change kvmtool now?

Because we are sure that it will be fd-based, and because I thought
that getting a head start to set the scene would be helpful. The part
that is uncertain is the kvm capabilities, flags, and names of the new
memory region extensions, none of which I address in these patches.

Cheers,
/fuad

> Thanks,
> Alex
>
> >
> > Cheers,
> > /fuad
> >
> >
> >
> > > Thanks,
> > > Alex
> > >
> > > > could have a look on how I would go ahead building on these patches
> > > > for full support of private memory backed by an fd.
> > > >
> > > > > Regarding IPC memory sharing, is mmap'ing an memfd file enough to enable
> > > > > that? If more work is needed for it, then wouldn't it make more sense to do
> > > > > all the changes at once? This change might look sensible right now, but it
> > > > > might turn out that it was the wrong way to go about it when someone
> > > > > actually starts implementing memory sharing.
> > > >
> > > > I don’t plan on supporting IPC memory sharing. I just mentioned that
> > > > as yet another use case that would benefit from guest memory being
> > > > fd-based, should kvmtool decide to support it in the future.
> > > >
> > > > Cheers,
> > > > /fuad
> > > >
> > > > [1] https://lore.kernel.org/all/20221025151344.3784230-1-chao.p.peng@linux.intel.com/
> > > > [2] https://github.com/qemu/qemu
> > > > [3] https://chromium.googlesource.com/chromiumos/platform/crosvm/
> > > > [4] https://github.com/chao-p/qemu/tree/privmem-v9
> > > > [5] https://android-kvm.googlesource.com/kvmtool/+/refs/heads/tabba/fdmem-v9-core
> > > >
> > > >
> > > >
> > > > >
> > > > > Regarding IPC memory sharing, is mmap'ing an memfd file enough to enable
> > > > > that? If more work is needed for it, then wouldn't it make more sense to do
> > > > > all the changes at once? This change might look sensible right now, but it
> > > > > might turn out that it was the wrong way to go about it when someone
> > > > > actually starts implementing memory sharing.
> > > > >
> > > > > Thanks,
> > > > > Alex
> > > > >
> > > > > >
> > > > > > Signed-off-by: Fuad Tabba <tabba@google.com>
> > > > > >
> > > > > > [*] https://lore.kernel.org/all/20221025151344.3784230-1-chao.p.peng@linux.intel.com/
> > > > > > ---
> > > > > >  include/kvm/kvm.h  |  1 +
> > > > > >  include/kvm/util.h |  3 +++
> > > > > >  kvm.c              |  4 ++++
> > > > > >  util/util.c        | 33 ++++++++++++++++++++-------------
> > > > > >  4 files changed, 28 insertions(+), 13 deletions(-)
> > > > > >
> > > > > > diff --git a/include/kvm/kvm.h b/include/kvm/kvm.h
> > > > > > index 3872dc6..d0d519b 100644
> > > > > > --- a/include/kvm/kvm.h
> > > > > > +++ b/include/kvm/kvm.h
> > > > > > @@ -87,6 +87,7 @@ struct kvm {
> > > > > >       struct kvm_config       cfg;
> > > > > >       int                     sys_fd;         /* For system ioctls(), i.e. /dev/kvm */
> > > > > >       int                     vm_fd;          /* For VM ioctls() */
> > > > > > +     int                     ram_fd;         /* For guest memory. */
> > > > > >       timer_t                 timerid;        /* Posix timer for interrupts */
> > > > > >
> > > > > >       int                     nrcpus;         /* Number of cpus to run */
> > > > > > diff --git a/include/kvm/util.h b/include/kvm/util.h
> > > > > > index 61a205b..369603b 100644
> > > > > > --- a/include/kvm/util.h
> > > > > > +++ b/include/kvm/util.h
> > > > > > @@ -140,6 +140,9 @@ static inline int pow2_size(unsigned long x)
> > > > > >  }
> > > > > >
> > > > > >  struct kvm;
> > > > > > +int memfd_alloc(u64 size, bool hugetlb, u64 blk_size);
> > > > > > +void *mmap_anon_or_hugetlbfs_align(struct kvm *kvm, const char *htlbfs_path,
> > > > > > +                                u64 size, u64 align);
> > > > > >  void *mmap_anon_or_hugetlbfs(struct kvm *kvm, const char *htlbfs_path, u64 size);
> > > > > >
> > > > > >  #endif /* KVM__UTIL_H */
> > > > > > diff --git a/kvm.c b/kvm.c
> > > > > > index 78bc0d8..ed29d68 100644
> > > > > > --- a/kvm.c
> > > > > > +++ b/kvm.c
> > > > > > @@ -160,6 +160,7 @@ struct kvm *kvm__new(void)
> > > > > >       mutex_init(&kvm->mem_banks_lock);
> > > > > >       kvm->sys_fd = -1;
> > > > > >       kvm->vm_fd = -1;
> > > > > > +     kvm->ram_fd = -1;
> > > > > >
> > > > > >  #ifdef KVM_BRLOCK_DEBUG
> > > > > >       kvm->brlock_sem = (pthread_rwlock_t) PTHREAD_RWLOCK_INITIALIZER;
> > > > > > @@ -174,6 +175,9 @@ int kvm__exit(struct kvm *kvm)
> > > > > >
> > > > > >       kvm__arch_delete_ram(kvm);
> > > > > >
> > > > > > +     if (kvm->ram_fd >= 0)
> > > > > > +             close(kvm->ram_fd);
> > > > > > +
> > > > > >       list_for_each_entry_safe(bank, tmp, &kvm->mem_banks, list) {
> > > > > >               list_del(&bank->list);
> > > > > >               free(bank);
> > > > > > diff --git a/util/util.c b/util/util.c
> > > > > > index d3483d8..278bcc2 100644
> > > > > > --- a/util/util.c
> > > > > > +++ b/util/util.c
> > > > > > @@ -102,36 +102,38 @@ static u64 get_hugepage_blk_size(const char *htlbfs_path)
> > > > > >       return sfs.f_bsize;
> > > > > >  }
> > > > > >
> > > > > > -static void *mmap_hugetlbfs(struct kvm *kvm, const char *htlbfs_path, u64 size, u64 blk_size)
> > > > > > +int memfd_alloc(u64 size, bool hugetlb, u64 blk_size)
> > > > > >  {
> > > > > >       const char *name = "kvmtool";
> > > > > >       unsigned int flags = 0;
> > > > > >       int fd;
> > > > > > -     void *addr;
> > > > > > -     int htsize = __builtin_ctzl(blk_size);
> > > > > >
> > > > > > -     if ((1ULL << htsize) != blk_size)
> > > > > > -             die("Hugepage size must be a power of 2.\n");
> > > > > > +     if (hugetlb) {
> > > > > > +             int htsize = __builtin_ctzl(blk_size);
> > > > > >
> > > > > > -     flags |= MFD_HUGETLB;
> > > > > > -     flags |= htsize << MFD_HUGE_SHIFT;
> > > > > > +             if ((1ULL << htsize) != blk_size)
> > > > > > +                     die("Hugepage size must be a power of 2.\n");
> > > > > > +
> > > > > > +             flags |= MFD_HUGETLB;
> > > > > > +             flags |= htsize << MFD_HUGE_SHIFT;
> > > > > > +     }
> > > > > >
> > > > > >       fd = memfd_create(name, flags);
> > > > > >       if (fd < 0)
> > > > > > -             die("Can't memfd_create for hugetlbfs map\n");
> > > > > > +             die("Can't memfd_create for memory map\n");
> > > > > > +
> > > > > >       if (ftruncate(fd, size) < 0)
> > > > > >               die("Can't ftruncate for mem mapping size %lld\n",
> > > > > >                       (unsigned long long)size);
> > > > > > -     addr = mmap(NULL, size, PROT_RW, MAP_PRIVATE, fd, 0);
> > > > > > -     close(fd);
> > > > > >
> > > > > > -     return addr;
> > > > > > +     return fd;
> > > > > >  }
> > > > > >
> > > > > >  /* This function wraps the decision between hugetlbfs map (if requested) or normal mmap */
> > > > > >  void *mmap_anon_or_hugetlbfs(struct kvm *kvm, const char *htlbfs_path, u64 size)
> > > > > >  {
> > > > > >       u64 blk_size = 0;
> > > > > > +     int fd;
> > > > > >
> > > > > >       /*
> > > > > >        * We don't /need/ to map guest RAM from hugetlbfs, but we do so
> > > > > > @@ -146,9 +148,14 @@ void *mmap_anon_or_hugetlbfs(struct kvm *kvm, const char *htlbfs_path, u64 size)
> > > > > >               }
> > > > > >
> > > > > >               kvm->ram_pagesize = blk_size;
> > > > > > -             return mmap_hugetlbfs(kvm, htlbfs_path, size, blk_size);
> > > > > >       } else {
> > > > > >               kvm->ram_pagesize = getpagesize();
> > > > > > -             return mmap(NULL, size, PROT_RW, MAP_ANON_NORESERVE, -1, 0);
> > > > > >       }
> > > > > > +
> > > > > > +     fd = memfd_alloc(size, htlbfs_path, blk_size);
> > > > > > +     if (fd < 0)
> > > > > > +             return MAP_FAILED;
> > > > > > +
> > > > > > +     kvm->ram_fd = fd;
> > > > > > +     return mmap(NULL, size, PROT_RW, MAP_PRIVATE, kvm->ram_fd, 0);
> > > > > >  }
> > > > > > --
> > > > > > 2.38.1.431.g37b22c650d-goog
> > > > > >

Alexandru Elisei Nov. 29, 2022, 6:09 p.m. UTC | #9

Hi,

On Mon, Nov 28, 2022 at 08:49:29AM +0000, Fuad Tabba wrote:
> Hi,
> 
> First I want to mention that I really appreciate your feedback, which
> has already been quite helpful. I would like you to please consider
> this to be an RFC, and let's use these patches as a basis for
> discussion and how they can be improved when I respin them, even if
> that means waiting until the kvm fd-based proposal is finalized.

For that it's probably best if you add RFC to the subject prefix. That's
very helpful to let the reviewers know what to focus on, more on the
approach than on the finer details.

> 
> Now to answer your question...
> 
> <snip>
> 
> > > My reasoning for allocating all memory with memfd is that it's one
> > > ring to rule them all :) By that I mean, with memfd, we can allocate
> > > normal memory, hugetlb memory, in the future guest private memory, and
> > > easily expand it to support things like IPC memory sharing in the
> > > future.
> >
> > Allocating anonymous memory is more complex now. And I could argue than the
> > hugetlbfs case is also more complex because there are now two branches that
> > do different things based whether it's hugetlbfs or not, instead of one.
> 
> The additional complexity now comes not from using memfd, but from
> unmapping and aligning code, which I think does benefit kvmtool in
> general.

I wasn't referring to the unmapping/aligning part because that can be
implemented without using memfd.

> 
> Hugetlbfd already had a different path before, now at least the other
> path it has just has to do with setting flags for memfd_create(),
> rather than allocating memory differently.

Conceptually, both allocate memory the same way, by creating a temporary
file. I do agree though that using memfd is easier than fidling with the
temporary file name.

> 
> 
> > As I stands right now, my opinion is that using memfd for anonymous RAM
> > only adds complexity for zero benefits.
> > >
> > >
> > > > >
> > > > > Moreover, using an fd would be more generic and flexible, which allows
> > > > > for other use cases (such as IPC), or to map that memory in userspace
> > > > > when appropriate. It also allows us to use the same interface for
> > > > > hugetlb. Considering that other VMMs (e.g., qemu [2], crosvm [3])
> > > > > already back guest memory with memfd, and looking at how private
> > > > > memory would work [4], it seemed to me that the best way to unify all
> > > > > of these needs is to have the backend of guest memory be fd-based.
> > > > >
> > > > > It would be possible to have that as a separate kvmtool option, where
> > > > > fd-backed memory would be only for guests that use the new private
> > > > > memory extensions. However, that would mean more code to maintain that
> > > > > is essentially doing the same thing (allocating and mapping memory).
> > > > >
> > > > > I thought that it would be worth having these patches in kvmtool now
> > > > > rather than wait until the guest private memory has made it into kvm.
> > > > > These patches simplify the code as an end result, make it easier to
> > > >
> > > > In the non-hugetlbfs case, before:
> > > >
> > > >         kvm->arch.ram_alloc_size = kvm->ram_size + SZ_2M;
> > > >         kvm->arch.ram_alloc_start = mmap_anon_or_hugetlbfs(kvm, kvm->cfg.hugetlbfs_path, kvm->arch.ram_alloc_size);
> > > >
> > > >         /*
> > > >          * mmap_anon_or_hugetlbfs expands to:
> > > >          * getpagesize()
> > > >          * mmap()
> > > >          */
> > > >
> > > >         kvm->ram_start = (void *)ALIGN((unsigned long)kvm->arch.ram_alloc_start, SZ_2M);
> > > >
> > > > After:
> > > >         /* mmap_anon_or_hugetlbfs: */
> > > >         getpagesize();
> > > >         mmap(NULL, total_map, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> > > >         memfd_alloc(size, htlbfs_path, blk_size);
> > > >
> > > >         /*
> > > >          * memfd_alloc() expands to:
> > > >          * memfd_create()
> > > >          * ftruncate
> > > >          */
> > > >
> > > >         addr_align = (void *)ALIGN((u64)addr_map, align_sz);
> > > >         mmap(addr_align, size, PROT_RW, MAP_PRIVATE | MAP_FIXED, fd, 0);
> > > >
> > > > I'm counting one extra mmap(), one memfd_create() and one ftruncate() that
> > > > this series adds (not to mention all the boiler plate code to check for
> > > > errors).
> > > >
> > > > Let's use another metric, let's count the number of lines of code. Before:
> > > > 9 lines of code, after: -3 lines removed from arm/kvm.c and 86 lines of
> > > > code for memfd_alloc() and mmap_anon_or_hugetlbfs_align().
> > > >
> > > > I'm struggling to find a metric by which the resulting code is simpler, as
> > > > you suggest.
> > >
> > > With simpler I didn't mean fewer lines of code, rather that it's
> > > easier to reason about, more shared code. With this series, hugetlb
> >
> > How is all of the code that has been added easier to reason about than one
> > single mmap call?

Would be nice if this would be answered.

> >
> > > and normal memory creation follow the same path, and with the
> > > refactoring a lot of arch-specific code is gone.
> >
> > Can you point me to the arch-specific code that this series removes? As far
> > as I can tell, the only arch specfic change is replacing
> > kvm_arch_delete_ram with kvm_delete_ram, which can be done independently of
> > this series. If it's only that function, I wouldn't call that "a lot" of
> > arch-specific code.
> 
> kvmtool is an old and well established project. So I think that being
> able to remove the memory-alignment code from the arm and riscv kvm.c,
> two fields from the arm and riscv struct kvm_arch, as well as
> kvm__arch_delete_ram from all architectures, is not that little for a
> mature project such as this one. I agree that this could have been
> done without using memfd, but at least for me, as a person who has
  ^^^^^^^^^^^^^^^^^^^^^^^^
Good to see we're in agreement.

> just posted their first contribution to kvmtool, it was easier to make
> these changes when the tracking of the memory is based on an fd rather
> than a userspace address (makes alignment and unmapping unused memory
> much easier).

How does this look:

diff --git a/arm/kvm.c b/arm/kvm.c
index d51cc15d8b1c..13b0d10c9cd1 100644
--- a/arm/kvm.c
+++ b/arm/kvm.c
@@ -27,6 +27,7 @@ bool kvm__arch_cpu_supports_vm(void)
 void kvm__init_ram(struct kvm *kvm)
 {
        u64 phys_start, phys_size;
+       unsigned long extra_memory;
        void *host_mem;
        int err;

@@ -48,12 +49,16 @@ void kvm__init_ram(struct kvm *kvm)

        kvm->ram_start = (void *)ALIGN((unsigned long)kvm->arch.ram_alloc_start,
                                        SZ_2M);
+       extra_memory = (unsigned long)kvm->ram_start - (unsigned long)kvm->arch.ram_alloc_start;
+       if (extra_memory) {
+               munmap(kvm->arch.ram_alloc_start,  extra_memory);
+               /* Only here for kvm__arch_delete_ram, the fields should be removed */
+               kvm->arch.ram_alloc_start = kvm->ram_start;
+               kvm->arch.ram_alloc_size = kvm->ram_size;
+       }

-       madvise(kvm->arch.ram_alloc_start, kvm->arch.ram_alloc_size,
-               MADV_MERGEABLE);
-
-       madvise(kvm->arch.ram_alloc_start, kvm->arch.ram_alloc_size,
-               MADV_HUGEPAGE);
+       madvise(kvm->ram_start, kvm->ram_size, MADV_MERGEABLE);
+       madvise(kvm->ram_start, kvm->ram_size, MADV_HUGEPAGE);

        phys_start      = kvm->cfg.ram_addr;
        phys_size       = kvm->ram_size;

Warning, untested code.

Then you can fold this into mmap_anon_or_hugetlbfs_aligned(). You can do
whatever you want with the code without giving me any credit.

> 
> >
> > >
> > > >
> > > > > allocate and map aligned memory without overallocating, and bring
> > > >
> > > > If your goal is to remove the overallocting of memory, you can just munmap
> > > > the extra memory after alignment is performed. To do that you don't need to
> > > > allocate everything using a memfd.
> > > >
> > > > > kvmtool closer to a more consistent way of allocating guest memory, in
> > > > > a similar manner to other VMMs.
> > > >
> > > > I would really appreciate pointing me to where qemu allocates memory using
> > > > memfd when invoked with -m <size>. I was able to follow the hostmem-ram
> > > > backend allocation function until g_malloc0(), but I couldn't find the
> > > > implementation for that.
> > >
> > > You're right. I thought that the memfd backend was the default, but
> > > looking again it's not.
> > >
> > > > >
> > > > > Moreover, with the private memory proposal [1], whether the fd-based
> > > > > support available can be queried by a KVM capability. If it's
> > > > > available kvmtool would use the fd, if it's not available, it would
> > > > > use the host-mapped address. Therefore, there isn’t a need for a
> > > > > command line option, unless for testing.
> > > >
> > > > Why would anyone want to use private memory by default for a
> > > > non-confidential VM?
> > >
> > > The idea is that, at least when pKVM is enabled, we would use the
> > > proposed extensions for all guests, i.e., memory via a file
> > > descriptor, regardless whether the guest is protected (thus the memory
> > > would be private), or not.
> >
> > kvmtool can be used to run virtual machines when pKVM is not enabled. In
> > fact, it has been used that way for way longer than pKVM has existed. What
> > about those users?
> 
> This does not affect these users, which is the point of these patches.
> This allows new uses as well as maintaining the existing one, and
> enables potentially new ones in the future.

On the contrary, this affects people that don't use pKVM, because you are
changing how allocating anonymous memory works.

> 
> > >
> > >
> > > > >
> > > > > I have implemented this all the way to support the private memory
> > > > > proposal in kvmtool [5], but I haven’t posted these since the private
> > > > > memory proposal itself is still in flux. If you’re interested you
> > > >
> > > > Are you saying that you are not really sure how the userspace API will end
> > > > up looking? If that's the case, wouldn't it make more sense to wait for the
> > > > API to stabilize and then send support for it as one nice series?
> > >
> > > Yes, I'm not sure how it will end up looking. We know that it will be
> > > fd-based though, which is why I thought it might be good to start with
> > > that.
> >
> > If you're not sure how it will end up looking, then why change kvmtool now?
> 
> Because we are sure that it will be fd-based, and because I thought
> that getting a head start to set the scene would be helpful. The part
> that is uncertain is the kvm capabilities, flags, and names of the new
> memory region extensions, none of which I address in these patches.

I see, that makes sense. My feedback so far is that you haven't provided a
good reason why this change to anonymous memory makes sense right now.

Thanks,
Alex

> 
> Cheers,
> /fuad
> 
> > Thanks,
> > Alex
> >
> > >
> > > Cheers,
> > > /fuad
> > >
> > >
> > >
> > > > Thanks,
> > > > Alex
> > > >
> > > > > could have a look on how I would go ahead building on these patches
> > > > > for full support of private memory backed by an fd.
> > > > >
> > > > > > Regarding IPC memory sharing, is mmap'ing an memfd file enough to enable
> > > > > > that? If more work is needed for it, then wouldn't it make more sense to do
> > > > > > all the changes at once? This change might look sensible right now, but it
> > > > > > might turn out that it was the wrong way to go about it when someone
> > > > > > actually starts implementing memory sharing.
> > > > >
> > > > > I don’t plan on supporting IPC memory sharing. I just mentioned that
> > > > > as yet another use case that would benefit from guest memory being
> > > > > fd-based, should kvmtool decide to support it in the future.
> > > > >
> > > > > Cheers,
> > > > > /fuad
> > > > >
> > > > > [1] https://lore.kernel.org/all/20221025151344.3784230-1-chao.p.peng@linux.intel.com/
> > > > > [2] https://github.com/qemu/qemu
> > > > > [3] https://chromium.googlesource.com/chromiumos/platform/crosvm/
> > > > > [4] https://github.com/chao-p/qemu/tree/privmem-v9
> > > > > [5] https://android-kvm.googlesource.com/kvmtool/+/refs/heads/tabba/fdmem-v9-core
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > Regarding IPC memory sharing, is mmap'ing an memfd file enough to enable
> > > > > > that? If more work is needed for it, then wouldn't it make more sense to do
> > > > > > all the changes at once? This change might look sensible right now, but it
> > > > > > might turn out that it was the wrong way to go about it when someone
> > > > > > actually starts implementing memory sharing.
> > > > > >
> > > > > > Thanks,
> > > > > > Alex
> > > > > >
> > > > > > >
> > > > > > > Signed-off-by: Fuad Tabba <tabba@google.com>
> > > > > > >
> > > > > > > [*] https://lore.kernel.org/all/20221025151344.3784230-1-chao.p.peng@linux.intel.com/
> > > > > > > ---
> > > > > > >  include/kvm/kvm.h  |  1 +
> > > > > > >  include/kvm/util.h |  3 +++
> > > > > > >  kvm.c              |  4 ++++
> > > > > > >  util/util.c        | 33 ++++++++++++++++++++-------------
> > > > > > >  4 files changed, 28 insertions(+), 13 deletions(-)
> > > > > > >
> > > > > > > diff --git a/include/kvm/kvm.h b/include/kvm/kvm.h
> > > > > > > index 3872dc6..d0d519b 100644
> > > > > > > --- a/include/kvm/kvm.h
> > > > > > > +++ b/include/kvm/kvm.h
> > > > > > > @@ -87,6 +87,7 @@ struct kvm {
> > > > > > >       struct kvm_config       cfg;
> > > > > > >       int                     sys_fd;         /* For system ioctls(), i.e. /dev/kvm */
> > > > > > >       int                     vm_fd;          /* For VM ioctls() */
> > > > > > > +     int                     ram_fd;         /* For guest memory. */
> > > > > > >       timer_t                 timerid;        /* Posix timer for interrupts */
> > > > > > >
> > > > > > >       int                     nrcpus;         /* Number of cpus to run */
> > > > > > > diff --git a/include/kvm/util.h b/include/kvm/util.h
> > > > > > > index 61a205b..369603b 100644
> > > > > > > --- a/include/kvm/util.h
> > > > > > > +++ b/include/kvm/util.h
> > > > > > > @@ -140,6 +140,9 @@ static inline int pow2_size(unsigned long x)
> > > > > > >  }
> > > > > > >
> > > > > > >  struct kvm;
> > > > > > > +int memfd_alloc(u64 size, bool hugetlb, u64 blk_size);
> > > > > > > +void *mmap_anon_or_hugetlbfs_align(struct kvm *kvm, const char *htlbfs_path,
> > > > > > > +                                u64 size, u64 align);
> > > > > > >  void *mmap_anon_or_hugetlbfs(struct kvm *kvm, const char *htlbfs_path, u64 size);
> > > > > > >
> > > > > > >  #endif /* KVM__UTIL_H */
> > > > > > > diff --git a/kvm.c b/kvm.c
> > > > > > > index 78bc0d8..ed29d68 100644
> > > > > > > --- a/kvm.c
> > > > > > > +++ b/kvm.c
> > > > > > > @@ -160,6 +160,7 @@ struct kvm *kvm__new(void)
> > > > > > >       mutex_init(&kvm->mem_banks_lock);
> > > > > > >       kvm->sys_fd = -1;
> > > > > > >       kvm->vm_fd = -1;
> > > > > > > +     kvm->ram_fd = -1;
> > > > > > >
> > > > > > >  #ifdef KVM_BRLOCK_DEBUG
> > > > > > >       kvm->brlock_sem = (pthread_rwlock_t) PTHREAD_RWLOCK_INITIALIZER;
> > > > > > > @@ -174,6 +175,9 @@ int kvm__exit(struct kvm *kvm)
> > > > > > >
> > > > > > >       kvm__arch_delete_ram(kvm);
> > > > > > >
> > > > > > > +     if (kvm->ram_fd >= 0)
> > > > > > > +             close(kvm->ram_fd);
> > > > > > > +
> > > > > > >       list_for_each_entry_safe(bank, tmp, &kvm->mem_banks, list) {
> > > > > > >               list_del(&bank->list);
> > > > > > >               free(bank);
> > > > > > > diff --git a/util/util.c b/util/util.c
> > > > > > > index d3483d8..278bcc2 100644
> > > > > > > --- a/util/util.c
> > > > > > > +++ b/util/util.c
> > > > > > > @@ -102,36 +102,38 @@ static u64 get_hugepage_blk_size(const char *htlbfs_path)
> > > > > > >       return sfs.f_bsize;
> > > > > > >  }
> > > > > > >
> > > > > > > -static void *mmap_hugetlbfs(struct kvm *kvm, const char *htlbfs_path, u64 size, u64 blk_size)
> > > > > > > +int memfd_alloc(u64 size, bool hugetlb, u64 blk_size)
> > > > > > >  {
> > > > > > >       const char *name = "kvmtool";
> > > > > > >       unsigned int flags = 0;
> > > > > > >       int fd;
> > > > > > > -     void *addr;
> > > > > > > -     int htsize = __builtin_ctzl(blk_size);
> > > > > > >
> > > > > > > -     if ((1ULL << htsize) != blk_size)
> > > > > > > -             die("Hugepage size must be a power of 2.\n");
> > > > > > > +     if (hugetlb) {
> > > > > > > +             int htsize = __builtin_ctzl(blk_size);
> > > > > > >
> > > > > > > -     flags |= MFD_HUGETLB;
> > > > > > > -     flags |= htsize << MFD_HUGE_SHIFT;
> > > > > > > +             if ((1ULL << htsize) != blk_size)
> > > > > > > +                     die("Hugepage size must be a power of 2.\n");
> > > > > > > +
> > > > > > > +             flags |= MFD_HUGETLB;
> > > > > > > +             flags |= htsize << MFD_HUGE_SHIFT;
> > > > > > > +     }
> > > > > > >
> > > > > > >       fd = memfd_create(name, flags);
> > > > > > >       if (fd < 0)
> > > > > > > -             die("Can't memfd_create for hugetlbfs map\n");
> > > > > > > +             die("Can't memfd_create for memory map\n");
> > > > > > > +
> > > > > > >       if (ftruncate(fd, size) < 0)
> > > > > > >               die("Can't ftruncate for mem mapping size %lld\n",
> > > > > > >                       (unsigned long long)size);
> > > > > > > -     addr = mmap(NULL, size, PROT_RW, MAP_PRIVATE, fd, 0);
> > > > > > > -     close(fd);
> > > > > > >
> > > > > > > -     return addr;
> > > > > > > +     return fd;
> > > > > > >  }
> > > > > > >
> > > > > > >  /* This function wraps the decision between hugetlbfs map (if requested) or normal mmap */
> > > > > > >  void *mmap_anon_or_hugetlbfs(struct kvm *kvm, const char *htlbfs_path, u64 size)
> > > > > > >  {
> > > > > > >       u64 blk_size = 0;
> > > > > > > +     int fd;
> > > > > > >
> > > > > > >       /*
> > > > > > >        * We don't /need/ to map guest RAM from hugetlbfs, but we do so
> > > > > > > @@ -146,9 +148,14 @@ void *mmap_anon_or_hugetlbfs(struct kvm *kvm, const char *htlbfs_path, u64 size)
> > > > > > >               }
> > > > > > >
> > > > > > >               kvm->ram_pagesize = blk_size;
> > > > > > > -             return mmap_hugetlbfs(kvm, htlbfs_path, size, blk_size);
> > > > > > >       } else {
> > > > > > >               kvm->ram_pagesize = getpagesize();
> > > > > > > -             return mmap(NULL, size, PROT_RW, MAP_ANON_NORESERVE, -1, 0);
> > > > > > >       }
> > > > > > > +
> > > > > > > +     fd = memfd_alloc(size, htlbfs_path, blk_size);
> > > > > > > +     if (fd < 0)
> > > > > > > +             return MAP_FAILED;
> > > > > > > +
> > > > > > > +     kvm->ram_fd = fd;
> > > > > > > +     return mmap(NULL, size, PROT_RW, MAP_PRIVATE, kvm->ram_fd, 0);
> > > > > > >  }
> > > > > > > --
> > > > > > > 2.38.1.431.g37b22c650d-goog
> > > > > > >

Fuad Tabba Nov. 30, 2022, 5:54 p.m. UTC | #10

Hi,

On Tue, Nov 29, 2022 at 6:10 PM Alexandru Elisei
<alexandru.elisei@arm.com> wrote:
>
> Hi,
>
> On Mon, Nov 28, 2022 at 08:49:29AM +0000, Fuad Tabba wrote:
> > Hi,
> >
> > First I want to mention that I really appreciate your feedback, which
> > has already been quite helpful. I would like you to please consider
> > this to be an RFC, and let's use these patches as a basis for
> > discussion and how they can be improved when I respin them, even if
> > that means waiting until the kvm fd-based proposal is finalized.
>
> For that it's probably best if you add RFC to the subject prefix. That's
> very helpful to let the reviewers know what to focus on, more on the
> approach than on the finer details.

I will respin this as an RFC, and I will include the patches that I
have that support the restricted memory proposal [*] for pKVM as it
stands now. I hope that would help see where I was thinking this would
be heading.

[*] https://lore.kernel.org/all/20221025151344.3784230-1-chao.p.peng@linux.intel.com/

<snip>

> > > > With simpler I didn't mean fewer lines of code, rather that it's
> > > > easier to reason about, more shared code. With this series, hugetlb
> > >
> > > How is all of the code that has been added easier to reason about than one
> > > single mmap call?
>
> Would be nice if this would be answered.

As I said in a reply to a different comment, for me personally, as a first time
kvmtool contributor, it was easier for me to reason about the memory
when the canonical reference to the memory is a file descriptor that
would not change, rather than a userspace memory address which could
change as it is aligned and trimmed.

I use the word simpler subjectively, that is, in my opinion.

<snip>

> >
> > Because we are sure that it will be fd-based, and because I thought
> > that getting a head start to set the scene would be helpful. The part
> > that is uncertain is the kvm capabilities, flags, and names of the new
> > memory region extensions, none of which I address in these patches.
>
> I see, that makes sense. My feedback so far is that you haven't provided a
> good reason why this change to anonymous memory makes sense right now.

I appreciate your feedback, and I hope we can continue this discussion
when I respin this as an RFC.



Cheers,
/fuad

> Thanks,
> Alex
>
> >
> > Cheers,
> > /fuad
> >
> > > Thanks,
> > > Alex
> > >
> > > >
> > > > Cheers,
> > > > /fuad
> > > >
> > > >
> > > >
> > > > > Thanks,
> > > > > Alex
> > > > >
> > > > > > could have a look on how I would go ahead building on these patches
> > > > > > for full support of private memory backed by an fd.
> > > > > >
> > > > > > > Regarding IPC memory sharing, is mmap'ing an memfd file enough to enable
> > > > > > > that? If more work is needed for it, then wouldn't it make more sense to do
> > > > > > > all the changes at once? This change might look sensible right now, but it
> > > > > > > might turn out that it was the wrong way to go about it when someone
> > > > > > > actually starts implementing memory sharing.
> > > > > >
> > > > > > I don’t plan on supporting IPC memory sharing. I just mentioned that
> > > > > > as yet another use case that would benefit from guest memory being
> > > > > > fd-based, should kvmtool decide to support it in the future.
> > > > > >
> > > > > > Cheers,
> > > > > > /fuad
> > > > > >
> > > > > > [1] https://lore.kernel.org/all/20221025151344.3784230-1-chao.p.peng@linux.intel.com/
> > > > > > [2] https://github.com/qemu/qemu
> > > > > > [3] https://chromium.googlesource.com/chromiumos/platform/crosvm/
> > > > > > [4] https://github.com/chao-p/qemu/tree/privmem-v9
> > > > > > [5] https://android-kvm.googlesource.com/kvmtool/+/refs/heads/tabba/fdmem-v9-core
> > > > > >
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > Regarding IPC memory sharing, is mmap'ing an memfd file enough to enable
> > > > > > > that? If more work is needed for it, then wouldn't it make more sense to do
> > > > > > > all the changes at once? This change might look sensible right now, but it
> > > > > > > might turn out that it was the wrong way to go about it when someone
> > > > > > > actually starts implementing memory sharing.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Alex
> > > > > > >
> > > > > > > >
> > > > > > > > Signed-off-by: Fuad Tabba <tabba@google.com>
> > > > > > > >
> > > > > > > > [*] https://lore.kernel.org/all/20221025151344.3784230-1-chao.p.peng@linux.intel.com/
> > > > > > > > ---
> > > > > > > >  include/kvm/kvm.h  |  1 +
> > > > > > > >  include/kvm/util.h |  3 +++
> > > > > > > >  kvm.c              |  4 ++++
> > > > > > > >  util/util.c        | 33 ++++++++++++++++++++-------------
> > > > > > > >  4 files changed, 28 insertions(+), 13 deletions(-)
> > > > > > > >
> > > > > > > > diff --git a/include/kvm/kvm.h b/include/kvm/kvm.h
> > > > > > > > index 3872dc6..d0d519b 100644
> > > > > > > > --- a/include/kvm/kvm.h
> > > > > > > > +++ b/include/kvm/kvm.h
> > > > > > > > @@ -87,6 +87,7 @@ struct kvm {
> > > > > > > >       struct kvm_config       cfg;
> > > > > > > >       int                     sys_fd;         /* For system ioctls(), i.e. /dev/kvm */
> > > > > > > >       int                     vm_fd;          /* For VM ioctls() */
> > > > > > > > +     int                     ram_fd;         /* For guest memory. */
> > > > > > > >       timer_t                 timerid;        /* Posix timer for interrupts */
> > > > > > > >
> > > > > > > >       int                     nrcpus;         /* Number of cpus to run */
> > > > > > > > diff --git a/include/kvm/util.h b/include/kvm/util.h
> > > > > > > > index 61a205b..369603b 100644
> > > > > > > > --- a/include/kvm/util.h
> > > > > > > > +++ b/include/kvm/util.h
> > > > > > > > @@ -140,6 +140,9 @@ static inline int pow2_size(unsigned long x)
> > > > > > > >  }
> > > > > > > >
> > > > > > > >  struct kvm;
> > > > > > > > +int memfd_alloc(u64 size, bool hugetlb, u64 blk_size);
> > > > > > > > +void *mmap_anon_or_hugetlbfs_align(struct kvm *kvm, const char *htlbfs_path,
> > > > > > > > +                                u64 size, u64 align);
> > > > > > > >  void *mmap_anon_or_hugetlbfs(struct kvm *kvm, const char *htlbfs_path, u64 size);
> > > > > > > >
> > > > > > > >  #endif /* KVM__UTIL_H */
> > > > > > > > diff --git a/kvm.c b/kvm.c
> > > > > > > > index 78bc0d8..ed29d68 100644
> > > > > > > > --- a/kvm.c
> > > > > > > > +++ b/kvm.c
> > > > > > > > @@ -160,6 +160,7 @@ struct kvm *kvm__new(void)
> > > > > > > >       mutex_init(&kvm->mem_banks_lock);
> > > > > > > >       kvm->sys_fd = -1;
> > > > > > > >       kvm->vm_fd = -1;
> > > > > > > > +     kvm->ram_fd = -1;
> > > > > > > >
> > > > > > > >  #ifdef KVM_BRLOCK_DEBUG
> > > > > > > >       kvm->brlock_sem = (pthread_rwlock_t) PTHREAD_RWLOCK_INITIALIZER;
> > > > > > > > @@ -174,6 +175,9 @@ int kvm__exit(struct kvm *kvm)
> > > > > > > >
> > > > > > > >       kvm__arch_delete_ram(kvm);
> > > > > > > >
> > > > > > > > +     if (kvm->ram_fd >= 0)
> > > > > > > > +             close(kvm->ram_fd);
> > > > > > > > +
> > > > > > > >       list_for_each_entry_safe(bank, tmp, &kvm->mem_banks, list) {
> > > > > > > >               list_del(&bank->list);
> > > > > > > >               free(bank);
> > > > > > > > diff --git a/util/util.c b/util/util.c
> > > > > > > > index d3483d8..278bcc2 100644
> > > > > > > > --- a/util/util.c
> > > > > > > > +++ b/util/util.c
> > > > > > > > @@ -102,36 +102,38 @@ static u64 get_hugepage_blk_size(const char *htlbfs_path)
> > > > > > > >       return sfs.f_bsize;
> > > > > > > >  }
> > > > > > > >
> > > > > > > > -static void *mmap_hugetlbfs(struct kvm *kvm, const char *htlbfs_path, u64 size, u64 blk_size)
> > > > > > > > +int memfd_alloc(u64 size, bool hugetlb, u64 blk_size)
> > > > > > > >  {
> > > > > > > >       const char *name = "kvmtool";
> > > > > > > >       unsigned int flags = 0;
> > > > > > > >       int fd;
> > > > > > > > -     void *addr;
> > > > > > > > -     int htsize = __builtin_ctzl(blk_size);
> > > > > > > >
> > > > > > > > -     if ((1ULL << htsize) != blk_size)
> > > > > > > > -             die("Hugepage size must be a power of 2.\n");
> > > > > > > > +     if (hugetlb) {
> > > > > > > > +             int htsize = __builtin_ctzl(blk_size);
> > > > > > > >
> > > > > > > > -     flags |= MFD_HUGETLB;
> > > > > > > > -     flags |= htsize << MFD_HUGE_SHIFT;
> > > > > > > > +             if ((1ULL << htsize) != blk_size)
> > > > > > > > +                     die("Hugepage size must be a power of 2.\n");
> > > > > > > > +
> > > > > > > > +             flags |= MFD_HUGETLB;
> > > > > > > > +             flags |= htsize << MFD_HUGE_SHIFT;
> > > > > > > > +     }
> > > > > > > >
> > > > > > > >       fd = memfd_create(name, flags);
> > > > > > > >       if (fd < 0)
> > > > > > > > -             die("Can't memfd_create for hugetlbfs map\n");
> > > > > > > > +             die("Can't memfd_create for memory map\n");
> > > > > > > > +
> > > > > > > >       if (ftruncate(fd, size) < 0)
> > > > > > > >               die("Can't ftruncate for mem mapping size %lld\n",
> > > > > > > >                       (unsigned long long)size);
> > > > > > > > -     addr = mmap(NULL, size, PROT_RW, MAP_PRIVATE, fd, 0);
> > > > > > > > -     close(fd);
> > > > > > > >
> > > > > > > > -     return addr;
> > > > > > > > +     return fd;
> > > > > > > >  }
> > > > > > > >
> > > > > > > >  /* This function wraps the decision between hugetlbfs map (if requested) or normal mmap */
> > > > > > > >  void *mmap_anon_or_hugetlbfs(struct kvm *kvm, const char *htlbfs_path, u64 size)
> > > > > > > >  {
> > > > > > > >       u64 blk_size = 0;
> > > > > > > > +     int fd;
> > > > > > > >
> > > > > > > >       /*
> > > > > > > >        * We don't /need/ to map guest RAM from hugetlbfs, but we do so
> > > > > > > > @@ -146,9 +148,14 @@ void *mmap_anon_or_hugetlbfs(struct kvm *kvm, const char *htlbfs_path, u64 size)
> > > > > > > >               }
> > > > > > > >
> > > > > > > >               kvm->ram_pagesize = blk_size;
> > > > > > > > -             return mmap_hugetlbfs(kvm, htlbfs_path, size, blk_size);
> > > > > > > >       } else {
> > > > > > > >               kvm->ram_pagesize = getpagesize();
> > > > > > > > -             return mmap(NULL, size, PROT_RW, MAP_ANON_NORESERVE, -1, 0);
> > > > > > > >       }
> > > > > > > > +
> > > > > > > > +     fd = memfd_alloc(size, htlbfs_path, blk_size);
> > > > > > > > +     if (fd < 0)
> > > > > > > > +             return MAP_FAILED;
> > > > > > > > +
> > > > > > > > +     kvm->ram_fd = fd;
> > > > > > > > +     return mmap(NULL, size, PROT_RW, MAP_PRIVATE, kvm->ram_fd, 0);
> > > > > > > >  }
> > > > > > > > --
> > > > > > > > 2.38.1.431.g37b22c650d-goog
> > > > > > > >

[kvmtool,v1,08/17] Use memfd for all guest ram allocations

Commit Message

Comments

Patch