Message ID | 20240222161047.402609-4-tabba@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support | expand |
> +static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf) > +{ > + struct folio *folio; > + > + folio = kvm_gmem_get_folio(file_inode(vmf->vma->vm_file), vmf->pgoff); > + if (!folio) > + return VM_FAULT_SIGBUS; > + > + /* > + * Check if the page is allowed to be faulted to the host, with the > + * folio lock held to ensure that the check and incrementing the page > + * count are protected by the same folio lock. > + */ > + if (!kvm_gmem_isfaultable(vmf)) { > + folio_unlock(folio); > + return VM_FAULT_SIGBUS; > + } > + > + vmf->page = folio_file_page(folio, vmf->pgoff); We won't currently get hugetlb (or even THP) here. It mimics what shmem would do. finish_fault->set_pte_range() will call folio_add_file_rmap_ptes(), getting the rmap involved. Do we have some tests in place that make sure that fallocate(FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE) will properly unmap the page again (IOW, that the rmap does indeed work?).
Hi David, On Thu, Feb 22, 2024 at 4:28 PM David Hildenbrand <david@redhat.com> wrote: > > > +static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf) > > +{ > > + struct folio *folio; > > + > > + folio = kvm_gmem_get_folio(file_inode(vmf->vma->vm_file), vmf->pgoff); > > + if (!folio) > > + return VM_FAULT_SIGBUS; > > + > > + /* > > + * Check if the page is allowed to be faulted to the host, with the > > + * folio lock held to ensure that the check and incrementing the page > > + * count are protected by the same folio lock. > > + */ > > + if (!kvm_gmem_isfaultable(vmf)) { > > + folio_unlock(folio); > > + return VM_FAULT_SIGBUS; > > + } > > + > > + vmf->page = folio_file_page(folio, vmf->pgoff); > > We won't currently get hugetlb (or even THP) here. It mimics what shmem > would do. At the moment there isn't hugetlb support in guest_memfd(), and neither in pKVM. Although we do plan on supporting it. > finish_fault->set_pte_range() will call folio_add_file_rmap_ptes(), > getting the rmap involved. > > Do we have some tests in place that make sure that > fallocate(FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE) will properly unmap > the page again (IOW, that the rmap does indeed work?). I'm not sure if you mean kernel tests, or if I've tested it. There are guest_memfd() tests for fallocate(FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE) , which I have run. I've also tested it manually with sample programs, and it behaves as expected. Otherwise, for gunyah Elliot has used folio_mmapped() [], but Matthew doesn't think that it would do what we'd like it to do, i.e., ensure that _noone_ can fault in the page [2] I would appreciate any ideas, comments, or suggestions regarding this. Thanks! /fuad [1] https://lore.kernel.org/all/20240222141602976-0800.eberman@hu-eberman-lv.qualcomm.com/ [2] https://lore.kernel.org/all/ZdfoR3nCEP3HTtm1@casper.infradead.org/ > -- > Cheers, > > David / dhildenb >
On 26.02.24 09:58, Fuad Tabba wrote: > Hi David, > > On Thu, Feb 22, 2024 at 4:28 PM David Hildenbrand <david@redhat.com> wrote: >> >>> +static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf) >>> +{ >>> + struct folio *folio; >>> + >>> + folio = kvm_gmem_get_folio(file_inode(vmf->vma->vm_file), vmf->pgoff); >>> + if (!folio) >>> + return VM_FAULT_SIGBUS; >>> + >>> + /* >>> + * Check if the page is allowed to be faulted to the host, with the >>> + * folio lock held to ensure that the check and incrementing the page >>> + * count are protected by the same folio lock. >>> + */ >>> + if (!kvm_gmem_isfaultable(vmf)) { >>> + folio_unlock(folio); >>> + return VM_FAULT_SIGBUS; >>> + } >>> + >>> + vmf->page = folio_file_page(folio, vmf->pgoff); >> >> We won't currently get hugetlb (or even THP) here. It mimics what shmem >> would do. > > At the moment there isn't hugetlb support in guest_memfd(), and > neither in pKVM. Although we do plan on supporting it. > >> finish_fault->set_pte_range() will call folio_add_file_rmap_ptes(), >> getting the rmap involved. >> >> Do we have some tests in place that make sure that >> fallocate(FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE) will properly unmap >> the page again (IOW, that the rmap does indeed work?). > > I'm not sure if you mean kernel tests, or if I've tested it. There are > guest_memfd() tests for > fallocate(FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE) , which I have > run. I've also tested it manually with sample programs, and it behaves > as expected. Can you point me at the existing tests? I'm interested in mmap()-specific guest_memfd tests. A test that would make sense to me: 1) Create guest_memfd() and size it to contain at least one page. 2) mmap() it 3) Write some pattern (e.g., all 1's) to the first page using the mmap 4) fallocate(FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE) the first page 5) Make sure reading from the first page using the mmap reads all 0's IOW, during 4) we properly unmapped (via rmap) and discarded the page, such that 5) will populate a fresh page in the page cache filled with 0's and map that one.
Hi David, Thank you very much for reviewing this. On Mon, Feb 26, 2024 at 9:58 AM David Hildenbrand <david@redhat.com> wrote: > > On 26.02.24 09:58, Fuad Tabba wrote: > > Hi David, > > > > On Thu, Feb 22, 2024 at 4:28 PM David Hildenbrand <david@redhat.com> wrote: > >> > >>> +static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf) > >>> +{ > >>> + struct folio *folio; > >>> + > >>> + folio = kvm_gmem_get_folio(file_inode(vmf->vma->vm_file), vmf->pgoff); > >>> + if (!folio) > >>> + return VM_FAULT_SIGBUS; > >>> + > >>> + /* > >>> + * Check if the page is allowed to be faulted to the host, with the > >>> + * folio lock held to ensure that the check and incrementing the page > >>> + * count are protected by the same folio lock. > >>> + */ > >>> + if (!kvm_gmem_isfaultable(vmf)) { > >>> + folio_unlock(folio); > >>> + return VM_FAULT_SIGBUS; > >>> + } > >>> + > >>> + vmf->page = folio_file_page(folio, vmf->pgoff); > >> > >> We won't currently get hugetlb (or even THP) here. It mimics what shmem > >> would do. > > > > At the moment there isn't hugetlb support in guest_memfd(), and > > neither in pKVM. Although we do plan on supporting it. > > > >> finish_fault->set_pte_range() will call folio_add_file_rmap_ptes(), > >> getting the rmap involved. > >> > >> Do we have some tests in place that make sure that > >> fallocate(FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE) will properly unmap > >> the page again (IOW, that the rmap does indeed work?). > > > > I'm not sure if you mean kernel tests, or if I've tested it. There are > > guest_memfd() tests for > > fallocate(FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE) , which I have > > run. I've also tested it manually with sample programs, and it behaves > > as expected. > > Can you point me at the existing tests? I'm interested in > mmap()-specific guest_memfd tests. > > A test that would make sense to me: > > 1) Create guest_memfd() and size it to contain at least one page. > > 2) mmap() it > > 3) Write some pattern (e.g., all 1's) to the first page using the mmap > > 4) fallocate(FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE) the first page > > 5) Make sure reading from the first page using the mmap reads all 0's > > IOW, during 4) we properly unmapped (via rmap) and discarded the page, > such that 5) will populate a fresh page in the page cache filled with > 0's and map that one. The existing tests don't test mmap (or rather, they test the inability to mmap). They do test FALLOC_FL_PUNCH_HOLE. [1] The tests for mmap() are ones that I wrote myself. I will write a test like the one you mentioned, and send it with V2, or earlier if you prefer. Thanks again, /fuad > -- > Cheers, > > David / dhildenb >
On 26.02.24 18:30, Fuad Tabba wrote: > Hi David, > > Thank you very much for reviewing this. > > On Mon, Feb 26, 2024 at 9:58 AM David Hildenbrand <david@redhat.com> wrote: >> >> On 26.02.24 09:58, Fuad Tabba wrote: >>> Hi David, >>> >>> On Thu, Feb 22, 2024 at 4:28 PM David Hildenbrand <david@redhat.com> wrote: >>>> >>>>> +static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf) >>>>> +{ >>>>> + struct folio *folio; >>>>> + >>>>> + folio = kvm_gmem_get_folio(file_inode(vmf->vma->vm_file), vmf->pgoff); >>>>> + if (!folio) >>>>> + return VM_FAULT_SIGBUS; >>>>> + >>>>> + /* >>>>> + * Check if the page is allowed to be faulted to the host, with the >>>>> + * folio lock held to ensure that the check and incrementing the page >>>>> + * count are protected by the same folio lock. >>>>> + */ >>>>> + if (!kvm_gmem_isfaultable(vmf)) { >>>>> + folio_unlock(folio); >>>>> + return VM_FAULT_SIGBUS; >>>>> + } >>>>> + >>>>> + vmf->page = folio_file_page(folio, vmf->pgoff); >>>> >>>> We won't currently get hugetlb (or even THP) here. It mimics what shmem >>>> would do. >>> >>> At the moment there isn't hugetlb support in guest_memfd(), and >>> neither in pKVM. Although we do plan on supporting it. >>> >>>> finish_fault->set_pte_range() will call folio_add_file_rmap_ptes(), >>>> getting the rmap involved. >>>> >>>> Do we have some tests in place that make sure that >>>> fallocate(FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE) will properly unmap >>>> the page again (IOW, that the rmap does indeed work?). >>> >>> I'm not sure if you mean kernel tests, or if I've tested it. There are >>> guest_memfd() tests for >>> fallocate(FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE) , which I have >>> run. I've also tested it manually with sample programs, and it behaves >>> as expected. >> >> Can you point me at the existing tests? I'm interested in >> mmap()-specific guest_memfd tests. >> >> A test that would make sense to me: >> >> 1) Create guest_memfd() and size it to contain at least one page. >> >> 2) mmap() it >> >> 3) Write some pattern (e.g., all 1's) to the first page using the mmap >> >> 4) fallocate(FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE) the first page >> >> 5) Make sure reading from the first page using the mmap reads all 0's >> >> IOW, during 4) we properly unmapped (via rmap) and discarded the page, >> such that 5) will populate a fresh page in the page cache filled with >> 0's and map that one. > > The existing tests don't test mmap (or rather, they test the inability > to mmap). They do test FALLOC_FL_PUNCH_HOLE. [1] > > The tests for mmap() are ones that I wrote myself. I will write a test > like the one you mentioned, and send it with V2, or earlier if you > prefer. Thanks, no need to rush. As long as we have some simple test for that scenario at some point, all good!
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index b96abeeb2b65..fad296baa84e 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -2402,4 +2402,17 @@ static inline int kvm_gmem_get_pfn_locked(struct kvm *kvm, } #endif /* CONFIG_KVM_PRIVATE_MEM */ +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM_MAPPABLE +static inline bool kvm_gmem_is_mappable(struct kvm *kvm, gfn_t gfn) +{ + return !(kvm_get_memory_attributes(kvm, gfn) & + KVM_MEMORY_ATTRIBUTE_NOT_MAPPABLE); +} +#else +static inline bool kvm_gmem_is_mappable(struct kvm *kvm, gfn_t gfn) +{ + return false; +} +#endif /* CONFIG_KVM_GENERIC_PRIVATE_MEM_MAPPABLE */ + #endif diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 0862d6cc3e66..b8db8fb88bbe 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -2227,6 +2227,7 @@ struct kvm_memory_attributes { #define KVM_MEMORY_ATTRIBUTES_KERNEL_SHIFT (16) #define KVM_MEMORY_ATTRIBUTES_KERNEL_MASK GENMASK(63, KVM_MEMORY_ATTRIBUTES_KERNEL_SHIFT) +#define KVM_MEMORY_ATTRIBUTE_NOT_MAPPABLE (1ULL << KVM_MEMORY_ATTRIBUTES_KERNEL_SHIFT) #define KVM_CREATE_GUEST_MEMFD _IOWR(KVMIO, 0xd4, struct kvm_create_guest_memfd) diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig index 184dab4ee871..457019de9e6d 100644 --- a/virt/kvm/Kconfig +++ b/virt/kvm/Kconfig @@ -108,3 +108,7 @@ config KVM_GENERIC_PRIVATE_MEM select KVM_GENERIC_MEMORY_ATTRIBUTES select KVM_PRIVATE_MEM bool + +config KVM_GENERIC_PRIVATE_MEM_MAPPABLE + bool + depends on KVM_GENERIC_PRIVATE_MEM diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index 7e3ea7a3f086..ca3a5d8b1fa7 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -248,7 +248,75 @@ static inline struct file *kvm_gmem_get_file(struct kvm_memory_slot *slot) return get_file_active(&slot->gmem.file); } +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM_MAPPABLE +static bool kvm_gmem_isfaultable(struct vm_fault *vmf) +{ + struct vm_area_struct *vma = vmf->vma; + struct kvm_gmem *gmem = vma->vm_file->private_data; + pgoff_t pgoff = vmf->pgoff; + struct kvm_memory_slot *slot; + struct kvm *kvm = gmem->kvm; + unsigned long index; + + xa_for_each_range(&gmem->bindings, index, slot, pgoff, pgoff) { + pgoff_t base_gfn = slot->base_gfn; + pgoff_t gfn_pgoff = slot->gmem.pgoff; + pgoff_t gfn = base_gfn + max(gfn_pgoff, pgoff) - gfn_pgoff; + + if (!kvm_gmem_is_mappable(kvm, gfn)) + return false; + } + + return true; +} + +static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf) +{ + struct folio *folio; + + folio = kvm_gmem_get_folio(file_inode(vmf->vma->vm_file), vmf->pgoff); + if (!folio) + return VM_FAULT_SIGBUS; + + /* + * Check if the page is allowed to be faulted to the host, with the + * folio lock held to ensure that the check and incrementing the page + * count are protected by the same folio lock. + */ + if (!kvm_gmem_isfaultable(vmf)) { + folio_unlock(folio); + return VM_FAULT_SIGBUS; + } + + vmf->page = folio_file_page(folio, vmf->pgoff); + + return VM_FAULT_LOCKED; +} + +static const struct vm_operations_struct kvm_gmem_vm_ops = { + .fault = kvm_gmem_fault, +}; + +static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma) +{ + /* No support for private mappings to avoid COW. */ + if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) != + (VM_SHARED | VM_MAYSHARE)) { + return -EINVAL; + } + + file_accessed(file); + vm_flags_set(vma, VM_DONTDUMP); + vma->vm_ops = &kvm_gmem_vm_ops; + + return 0; +} +#else +#define kvm_gmem_mmap NULL +#endif /* CONFIG_KVM_GENERIC_PRIVATE_MEM_MAPPABLE */ + static struct file_operations kvm_gmem_fops = { + .mmap = kvm_gmem_mmap, .open = generic_file_open, .release = kvm_gmem_release, .fallocate = kvm_gmem_fallocate,
Allow guestmem-backed memory to be mapped by the host if the configuration option is enabled, and the (newly added) KVM memory attribute KVM_MEMORY_ATTRIBUTE_NOT_MAPPABLE isn't set. This new attribute is a kernel attribute, which cannot be modified by userspace. This will be used in future patches so that KVM can decide whether to allow the host to map guest memory based on certain criteria, e.g., that the memory is shared with the host. This attribute has negative polarity (i.e., as opposed to being an ALLOW attribute), to simplify the code, since it will usually apply to memory marked as KVM_MEMORY_ATTRIBUTE_PRIVATE (i.e., already has an entry in the xarray). Its absence implies that there are no restrictions for mapping that memory by the host. An invariant that future patches will maintain is that memory that is private for the guest, regardless whether it's marked with the PRIVATE attribute, must always be NOT_MAPPABLE. Signed-off-by: Fuad Tabba <tabba@google.com> --- include/linux/kvm_host.h | 13 ++++++++ include/uapi/linux/kvm.h | 1 + virt/kvm/Kconfig | 4 +++ virt/kvm/guest_memfd.c | 68 ++++++++++++++++++++++++++++++++++++++++ 4 files changed, 86 insertions(+)