diff mbox series

[RFC,v1,03/26] KVM: Add restricted support for mapping guestmem by the host

Message ID 20240222161047.402609-4-tabba@google.com (mailing list archive)
State New, archived
Headers show
Series KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support | expand

Commit Message

Fuad Tabba Feb. 22, 2024, 4:10 p.m. UTC
Allow guestmem-backed memory to be mapped by the host if the
configuration option is enabled, and the (newly added) KVM memory
attribute KVM_MEMORY_ATTRIBUTE_NOT_MAPPABLE isn't set.

This new attribute is a kernel attribute, which cannot be
modified by userspace. This will be used in future patches so
that KVM can decide whether to allow the host to map guest memory
based on certain criteria, e.g., that the memory is shared with
the host.

This attribute has negative polarity (i.e., as opposed to being
an ALLOW attribute), to simplify the code, since it will usually
apply to memory marked as KVM_MEMORY_ATTRIBUTE_PRIVATE (i.e.,
already has an entry in the xarray). Its absence implies that
there are no restrictions for mapping that memory by the host.

An invariant that future patches will maintain is that memory
that is private for the guest, regardless whether it's marked
with the PRIVATE attribute, must always be NOT_MAPPABLE.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 include/linux/kvm_host.h | 13 ++++++++
 include/uapi/linux/kvm.h |  1 +
 virt/kvm/Kconfig         |  4 +++
 virt/kvm/guest_memfd.c   | 68 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 86 insertions(+)

Comments

David Hildenbrand Feb. 22, 2024, 4:28 p.m. UTC | #1
> +static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
> +{
> +	struct folio *folio;
> +
> +	folio = kvm_gmem_get_folio(file_inode(vmf->vma->vm_file), vmf->pgoff);
> +	if (!folio)
> +		return VM_FAULT_SIGBUS;
> +
> +	/*
> +	 * Check if the page is allowed to be faulted to the host, with the
> +	 * folio lock held to ensure that the check and incrementing the page
> +	 * count are protected by the same folio lock.
> +	 */
> +	if (!kvm_gmem_isfaultable(vmf)) {
> +		folio_unlock(folio);
> +		return VM_FAULT_SIGBUS;
> +	}
> +
> +	vmf->page = folio_file_page(folio, vmf->pgoff);

We won't currently get hugetlb (or even THP) here. It mimics what shmem 
would do.

finish_fault->set_pte_range() will call folio_add_file_rmap_ptes(), 
getting the rmap involved.

Do we have some tests in place that make sure that 
fallocate(FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE) will properly unmap 
the page again (IOW, that the rmap does indeed work?).
Fuad Tabba Feb. 26, 2024, 8:58 a.m. UTC | #2
Hi David,

On Thu, Feb 22, 2024 at 4:28 PM David Hildenbrand <david@redhat.com> wrote:
>
> > +static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
> > +{
> > +     struct folio *folio;
> > +
> > +     folio = kvm_gmem_get_folio(file_inode(vmf->vma->vm_file), vmf->pgoff);
> > +     if (!folio)
> > +             return VM_FAULT_SIGBUS;
> > +
> > +     /*
> > +      * Check if the page is allowed to be faulted to the host, with the
> > +      * folio lock held to ensure that the check and incrementing the page
> > +      * count are protected by the same folio lock.
> > +      */
> > +     if (!kvm_gmem_isfaultable(vmf)) {
> > +             folio_unlock(folio);
> > +             return VM_FAULT_SIGBUS;
> > +     }
> > +
> > +     vmf->page = folio_file_page(folio, vmf->pgoff);
>
> We won't currently get hugetlb (or even THP) here. It mimics what shmem
> would do.

At the moment there isn't hugetlb support in guest_memfd(), and
neither in pKVM. Although we do plan on supporting it.

> finish_fault->set_pte_range() will call folio_add_file_rmap_ptes(),
> getting the rmap involved.
>
> Do we have some tests in place that make sure that
> fallocate(FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE) will properly unmap
> the page again (IOW, that the rmap does indeed work?).

I'm not sure if you mean kernel tests, or if I've tested it. There are
guest_memfd() tests for
fallocate(FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE) , which I have
run. I've also tested it manually with sample programs, and it behaves
as expected.

Otherwise, for gunyah Elliot has used folio_mmapped() [], but Matthew
doesn't think that it would do what we'd like it to do, i.e., ensure
that _noone_ can fault in the page [2]

I would appreciate any ideas, comments, or suggestions regarding this.

Thanks!
/fuad

[1] https://lore.kernel.org/all/20240222141602976-0800.eberman@hu-eberman-lv.qualcomm.com/

[2] https://lore.kernel.org/all/ZdfoR3nCEP3HTtm1@casper.infradead.org/




> --
> Cheers,
>
> David / dhildenb
>
David Hildenbrand Feb. 26, 2024, 9:57 a.m. UTC | #3
On 26.02.24 09:58, Fuad Tabba wrote:
> Hi David,
> 
> On Thu, Feb 22, 2024 at 4:28 PM David Hildenbrand <david@redhat.com> wrote:
>>
>>> +static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
>>> +{
>>> +     struct folio *folio;
>>> +
>>> +     folio = kvm_gmem_get_folio(file_inode(vmf->vma->vm_file), vmf->pgoff);
>>> +     if (!folio)
>>> +             return VM_FAULT_SIGBUS;
>>> +
>>> +     /*
>>> +      * Check if the page is allowed to be faulted to the host, with the
>>> +      * folio lock held to ensure that the check and incrementing the page
>>> +      * count are protected by the same folio lock.
>>> +      */
>>> +     if (!kvm_gmem_isfaultable(vmf)) {
>>> +             folio_unlock(folio);
>>> +             return VM_FAULT_SIGBUS;
>>> +     }
>>> +
>>> +     vmf->page = folio_file_page(folio, vmf->pgoff);
>>
>> We won't currently get hugetlb (or even THP) here. It mimics what shmem
>> would do.
> 
> At the moment there isn't hugetlb support in guest_memfd(), and
> neither in pKVM. Although we do plan on supporting it.
> 
>> finish_fault->set_pte_range() will call folio_add_file_rmap_ptes(),
>> getting the rmap involved.
>>
>> Do we have some tests in place that make sure that
>> fallocate(FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE) will properly unmap
>> the page again (IOW, that the rmap does indeed work?).
> 
> I'm not sure if you mean kernel tests, or if I've tested it. There are
> guest_memfd() tests for
> fallocate(FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE) , which I have
> run. I've also tested it manually with sample programs, and it behaves
> as expected.

Can you point me at the existing tests? I'm interested in 
mmap()-specific guest_memfd tests.

A test that would make sense to me:

1) Create guest_memfd() and size it to contain at least one page.

2) mmap() it

3) Write some pattern (e.g., all 1's) to the first page using the mmap

4) fallocate(FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE) the first page

5) Make sure reading from the first page using the mmap reads all 0's

IOW, during 4) we properly unmapped (via rmap) and discarded the page, 
such that 5) will populate a fresh page in the page cache filled with 
0's and map that one.
Fuad Tabba Feb. 26, 2024, 5:30 p.m. UTC | #4
Hi David,

Thank you very much for reviewing this.

On Mon, Feb 26, 2024 at 9:58 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 26.02.24 09:58, Fuad Tabba wrote:
> > Hi David,
> >
> > On Thu, Feb 22, 2024 at 4:28 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >>> +static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
> >>> +{
> >>> +     struct folio *folio;
> >>> +
> >>> +     folio = kvm_gmem_get_folio(file_inode(vmf->vma->vm_file), vmf->pgoff);
> >>> +     if (!folio)
> >>> +             return VM_FAULT_SIGBUS;
> >>> +
> >>> +     /*
> >>> +      * Check if the page is allowed to be faulted to the host, with the
> >>> +      * folio lock held to ensure that the check and incrementing the page
> >>> +      * count are protected by the same folio lock.
> >>> +      */
> >>> +     if (!kvm_gmem_isfaultable(vmf)) {
> >>> +             folio_unlock(folio);
> >>> +             return VM_FAULT_SIGBUS;
> >>> +     }
> >>> +
> >>> +     vmf->page = folio_file_page(folio, vmf->pgoff);
> >>
> >> We won't currently get hugetlb (or even THP) here. It mimics what shmem
> >> would do.
> >
> > At the moment there isn't hugetlb support in guest_memfd(), and
> > neither in pKVM. Although we do plan on supporting it.
> >
> >> finish_fault->set_pte_range() will call folio_add_file_rmap_ptes(),
> >> getting the rmap involved.
> >>
> >> Do we have some tests in place that make sure that
> >> fallocate(FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE) will properly unmap
> >> the page again (IOW, that the rmap does indeed work?).
> >
> > I'm not sure if you mean kernel tests, or if I've tested it. There are
> > guest_memfd() tests for
> > fallocate(FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE) , which I have
> > run. I've also tested it manually with sample programs, and it behaves
> > as expected.
>
> Can you point me at the existing tests? I'm interested in
> mmap()-specific guest_memfd tests.
>
> A test that would make sense to me:
>
> 1) Create guest_memfd() and size it to contain at least one page.
>
> 2) mmap() it
>
> 3) Write some pattern (e.g., all 1's) to the first page using the mmap
>
> 4) fallocate(FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE) the first page
>
> 5) Make sure reading from the first page using the mmap reads all 0's
>
> IOW, during 4) we properly unmapped (via rmap) and discarded the page,
> such that 5) will populate a fresh page in the page cache filled with
> 0's and map that one.

The existing tests don't test mmap (or rather, they test the inability
to mmap). They do test FALLOC_FL_PUNCH_HOLE. [1]

The tests for mmap() are ones that I wrote myself. I will write a test
like the one you mentioned, and send it with V2, or earlier if you
prefer.

Thanks again,
/fuad


> --
> Cheers,
>
> David / dhildenb
>
David Hildenbrand Feb. 27, 2024, 7:40 a.m. UTC | #5
On 26.02.24 18:30, Fuad Tabba wrote:
> Hi David,
> 
> Thank you very much for reviewing this.
> 
> On Mon, Feb 26, 2024 at 9:58 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 26.02.24 09:58, Fuad Tabba wrote:
>>> Hi David,
>>>
>>> On Thu, Feb 22, 2024 at 4:28 PM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>>> +static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
>>>>> +{
>>>>> +     struct folio *folio;
>>>>> +
>>>>> +     folio = kvm_gmem_get_folio(file_inode(vmf->vma->vm_file), vmf->pgoff);
>>>>> +     if (!folio)
>>>>> +             return VM_FAULT_SIGBUS;
>>>>> +
>>>>> +     /*
>>>>> +      * Check if the page is allowed to be faulted to the host, with the
>>>>> +      * folio lock held to ensure that the check and incrementing the page
>>>>> +      * count are protected by the same folio lock.
>>>>> +      */
>>>>> +     if (!kvm_gmem_isfaultable(vmf)) {
>>>>> +             folio_unlock(folio);
>>>>> +             return VM_FAULT_SIGBUS;
>>>>> +     }
>>>>> +
>>>>> +     vmf->page = folio_file_page(folio, vmf->pgoff);
>>>>
>>>> We won't currently get hugetlb (or even THP) here. It mimics what shmem
>>>> would do.
>>>
>>> At the moment there isn't hugetlb support in guest_memfd(), and
>>> neither in pKVM. Although we do plan on supporting it.
>>>
>>>> finish_fault->set_pte_range() will call folio_add_file_rmap_ptes(),
>>>> getting the rmap involved.
>>>>
>>>> Do we have some tests in place that make sure that
>>>> fallocate(FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE) will properly unmap
>>>> the page again (IOW, that the rmap does indeed work?).
>>>
>>> I'm not sure if you mean kernel tests, or if I've tested it. There are
>>> guest_memfd() tests for
>>> fallocate(FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE) , which I have
>>> run. I've also tested it manually with sample programs, and it behaves
>>> as expected.
>>
>> Can you point me at the existing tests? I'm interested in
>> mmap()-specific guest_memfd tests.
>>
>> A test that would make sense to me:
>>
>> 1) Create guest_memfd() and size it to contain at least one page.
>>
>> 2) mmap() it
>>
>> 3) Write some pattern (e.g., all 1's) to the first page using the mmap
>>
>> 4) fallocate(FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE) the first page
>>
>> 5) Make sure reading from the first page using the mmap reads all 0's
>>
>> IOW, during 4) we properly unmapped (via rmap) and discarded the page,
>> such that 5) will populate a fresh page in the page cache filled with
>> 0's and map that one.
> 
> The existing tests don't test mmap (or rather, they test the inability
> to mmap). They do test FALLOC_FL_PUNCH_HOLE. [1]
> 
> The tests for mmap() are ones that I wrote myself. I will write a test
> like the one you mentioned, and send it with V2, or earlier if you
> prefer.

Thanks, no need to rush. As long as we have some simple test for that 
scenario at some point, all good!
diff mbox series

Patch

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index b96abeeb2b65..fad296baa84e 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2402,4 +2402,17 @@  static inline int kvm_gmem_get_pfn_locked(struct kvm *kvm,
 }
 #endif /* CONFIG_KVM_PRIVATE_MEM */
 
+#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM_MAPPABLE
+static inline bool kvm_gmem_is_mappable(struct kvm *kvm, gfn_t gfn)
+{
+	return !(kvm_get_memory_attributes(kvm, gfn) &
+		 KVM_MEMORY_ATTRIBUTE_NOT_MAPPABLE);
+}
+#else
+static inline bool kvm_gmem_is_mappable(struct kvm *kvm, gfn_t gfn)
+{
+	return false;
+}
+#endif /* CONFIG_KVM_GENERIC_PRIVATE_MEM_MAPPABLE */
+
 #endif
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 0862d6cc3e66..b8db8fb88bbe 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -2227,6 +2227,7 @@  struct kvm_memory_attributes {
 
 #define KVM_MEMORY_ATTRIBUTES_KERNEL_SHIFT     (16)
 #define KVM_MEMORY_ATTRIBUTES_KERNEL_MASK      GENMASK(63, KVM_MEMORY_ATTRIBUTES_KERNEL_SHIFT)
+#define KVM_MEMORY_ATTRIBUTE_NOT_MAPPABLE      (1ULL << KVM_MEMORY_ATTRIBUTES_KERNEL_SHIFT)
 
 #define KVM_CREATE_GUEST_MEMFD	_IOWR(KVMIO,  0xd4, struct kvm_create_guest_memfd)
 
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 184dab4ee871..457019de9e6d 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -108,3 +108,7 @@  config KVM_GENERIC_PRIVATE_MEM
        select KVM_GENERIC_MEMORY_ATTRIBUTES
        select KVM_PRIVATE_MEM
        bool
+
+config KVM_GENERIC_PRIVATE_MEM_MAPPABLE
+       bool
+       depends on KVM_GENERIC_PRIVATE_MEM
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 7e3ea7a3f086..ca3a5d8b1fa7 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -248,7 +248,75 @@  static inline struct file *kvm_gmem_get_file(struct kvm_memory_slot *slot)
 	return get_file_active(&slot->gmem.file);
 }
 
+#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM_MAPPABLE
+static bool kvm_gmem_isfaultable(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct kvm_gmem *gmem = vma->vm_file->private_data;
+	pgoff_t pgoff = vmf->pgoff;
+	struct kvm_memory_slot *slot;
+	struct kvm *kvm = gmem->kvm;
+	unsigned long index;
+
+	xa_for_each_range(&gmem->bindings, index, slot, pgoff, pgoff) {
+		pgoff_t base_gfn = slot->base_gfn;
+		pgoff_t gfn_pgoff = slot->gmem.pgoff;
+		pgoff_t gfn = base_gfn + max(gfn_pgoff, pgoff) - gfn_pgoff;
+
+		if (!kvm_gmem_is_mappable(kvm, gfn))
+			return false;
+	}
+
+	return true;
+}
+
+static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
+{
+	struct folio *folio;
+
+	folio = kvm_gmem_get_folio(file_inode(vmf->vma->vm_file), vmf->pgoff);
+	if (!folio)
+		return VM_FAULT_SIGBUS;
+
+	/*
+	 * Check if the page is allowed to be faulted to the host, with the
+	 * folio lock held to ensure that the check and incrementing the page
+	 * count are protected by the same folio lock.
+	 */
+	if (!kvm_gmem_isfaultable(vmf)) {
+		folio_unlock(folio);
+		return VM_FAULT_SIGBUS;
+	}
+
+	vmf->page = folio_file_page(folio, vmf->pgoff);
+
+	return VM_FAULT_LOCKED;
+}
+
+static const struct vm_operations_struct kvm_gmem_vm_ops = {
+	.fault = kvm_gmem_fault,
+};
+
+static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	/* No support for private mappings to avoid COW.  */
+	if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) !=
+	    (VM_SHARED | VM_MAYSHARE)) {
+		return -EINVAL;
+	}
+
+	file_accessed(file);
+	vm_flags_set(vma, VM_DONTDUMP);
+	vma->vm_ops = &kvm_gmem_vm_ops;
+
+	return 0;
+}
+#else
+#define kvm_gmem_mmap NULL
+#endif /* CONFIG_KVM_GENERIC_PRIVATE_MEM_MAPPABLE */
+
 static struct file_operations kvm_gmem_fops = {
+	.mmap		= kvm_gmem_mmap,
 	.open		= generic_file_open,
 	.release	= kvm_gmem_release,
 	.fallocate	= kvm_gmem_fallocate,