Message ID | 20240809160909.1023470-11-peterx@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm: Support huge pfnmaps | expand |
On Fri, Aug 9, 2024 at 9:09 AM Peter Xu <peterx@redhat.com> wrote: > > Use the new pfnmap API to allow huge MMIO mappings for VMs. The rest work > is done perfectly on the other side (host_pfn_mapping_level()). I don't think it has to be done in this series, but a future optimization to consider is having follow_pfnmap just tell the caller about the mapping level directly. It already found this information as part of its walk. I think there's a possibility to simplify KVM / avoid it having to do its own walk again later. > > > Cc: Paolo Bonzini <pbonzini@redhat.com> > Cc: Sean Christopherson <seanjc@google.com> > Signed-off-by: Peter Xu <peterx@redhat.com> > --- > virt/kvm/kvm_main.c | 19 +++++++------------ > 1 file changed, 7 insertions(+), 12 deletions(-) > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index d0788d0a72cc..9fb1c527a8e1 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -2862,13 +2862,11 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, > unsigned long addr, bool write_fault, > bool *writable, kvm_pfn_t *p_pfn) > { > + struct follow_pfnmap_args args = { .vma = vma, .address = addr }; > kvm_pfn_t pfn; > - pte_t *ptep; > - pte_t pte; > - spinlock_t *ptl; > int r; > > - r = follow_pte(vma, addr, &ptep, &ptl); > + r = follow_pfnmap_start(&args); > if (r) { > /* > * get_user_pages fails for VM_IO and VM_PFNMAP vmas and does > @@ -2883,21 +2881,19 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, > if (r) > return r; > > - r = follow_pte(vma, addr, &ptep, &ptl); > + r = follow_pfnmap_start(&args); > if (r) > return r; > } > > - pte = ptep_get(ptep); > - > - if (write_fault && !pte_write(pte)) { > + if (write_fault && !args.writable) { > pfn = KVM_PFN_ERR_RO_FAULT; > goto out; > } > > if (writable) > - *writable = pte_write(pte); > - pfn = pte_pfn(pte); > + *writable = args.writable; > + pfn = args.pfn; > > /* > * Get a reference here because callers of *hva_to_pfn* and > @@ -2918,9 +2914,8 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, > */ > if (!kvm_try_get_pfn(pfn)) > r = -EFAULT; > - > out: > - pte_unmap_unlock(ptep, ptl); > + follow_pfnmap_end(&args); > *p_pfn = pfn; > > return r; > -- > 2.45.0 >
On Fri, Aug 09, 2024 at 10:23:20AM -0700, Axel Rasmussen wrote: > On Fri, Aug 9, 2024 at 9:09 AM Peter Xu <peterx@redhat.com> wrote: > > > > Use the new pfnmap API to allow huge MMIO mappings for VMs. The rest work > > is done perfectly on the other side (host_pfn_mapping_level()). > > I don't think it has to be done in this series, but a future > optimization to consider is having follow_pfnmap just tell the caller > about the mapping level directly. It already found this information as > part of its walk. I think there's a possibility to simplify KVM / > avoid it having to do its own walk again later. AFAIU pfnmap isn't special in this case, as we do the "walk pgtable twice" idea also to a generic page here, so probably not directly relevant to this patch alone. But I agree with you, sounds like something we can consider trying. I would be curious on whether the perf difference would be measurable in this specific case, though. I mean, this first walk will heat up all the things, so I'd expect the 2nd walk (which is lockless) later be pretty fast normally. Thanks,
On Mon, Aug 12, 2024 at 11:58 AM Peter Xu <peterx@redhat.com> wrote: > > On Fri, Aug 09, 2024 at 10:23:20AM -0700, Axel Rasmussen wrote: > > On Fri, Aug 9, 2024 at 9:09 AM Peter Xu <peterx@redhat.com> wrote: > > > > > > Use the new pfnmap API to allow huge MMIO mappings for VMs. The rest work > > > is done perfectly on the other side (host_pfn_mapping_level()). > > > > I don't think it has to be done in this series, but a future > > optimization to consider is having follow_pfnmap just tell the caller > > about the mapping level directly. It already found this information as > > part of its walk. I think there's a possibility to simplify KVM / > > avoid it having to do its own walk again later. > > AFAIU pfnmap isn't special in this case, as we do the "walk pgtable twice" > idea also to a generic page here, so probably not directly relevant to this > patch alone. > > But I agree with you, sounds like something we can consider trying. I > would be curious on whether the perf difference would be measurable in this > specific case, though. I mean, this first walk will heat up all the > things, so I'd expect the 2nd walk (which is lockless) later be pretty fast > normally. Agreed, the main benefit is probably just code simplification. > > Thanks, > > -- > Peter Xu >
On Mon, Aug 12, 2024, Axel Rasmussen wrote: > On Mon, Aug 12, 2024 at 11:58 AM Peter Xu <peterx@redhat.com> wrote: > > > > On Fri, Aug 09, 2024 at 10:23:20AM -0700, Axel Rasmussen wrote: > > > On Fri, Aug 9, 2024 at 9:09 AM Peter Xu <peterx@redhat.com> wrote: > > > > > > > > Use the new pfnmap API to allow huge MMIO mappings for VMs. The rest work > > > > is done perfectly on the other side (host_pfn_mapping_level()). > > > > > > I don't think it has to be done in this series, but a future > > > optimization to consider is having follow_pfnmap just tell the caller > > > about the mapping level directly. It already found this information as > > > part of its walk. I think there's a possibility to simplify KVM / > > > avoid it having to do its own walk again later. > > > > AFAIU pfnmap isn't special in this case, as we do the "walk pgtable twice" > > idea also to a generic page here, so probably not directly relevant to this > > patch alone. Ya. My original hope was that KVM could simply walk the host page tables and get whatever PFN+size it found, i.e. that KVM wouldn't care about pfn-mapped versus regular pages. That might be feasible after dropping all of KVM's refcounting shenanigans[*]? Not sure, haven't thought too much about it, precisely because I too think it won't provide any meaningful performance boost. > > But I agree with you, sounds like something we can consider trying. I > > would be curious on whether the perf difference would be measurable in this > > specific case, though. I mean, this first walk will heat up all the > > things, so I'd expect the 2nd walk (which is lockless) later be pretty fast > > normally. > > Agreed, the main benefit is probably just code simplification. +1. I wouldn't spend much time, if any, trying to plumb the size back out. Unless we can convert regular pages as well, it'd probably be more confusing to have separate ways of getting the mapping size.
On Mon, Aug 12, 2024 at 04:44:40PM -0700, Sean Christopherson wrote: > > > > I don't think it has to be done in this series, but a future > > > > optimization to consider is having follow_pfnmap just tell the caller > > > > about the mapping level directly. It already found this information as > > > > part of its walk. I think there's a possibility to simplify KVM / > > > > avoid it having to do its own walk again later. > > > > > > AFAIU pfnmap isn't special in this case, as we do the "walk pgtable twice" > > > idea also to a generic page here, so probably not directly relevant to this > > > patch alone. > > Ya. My original hope was that KVM could simply walk the host page tables and get > whatever PFN+size it found, i.e. that KVM wouldn't care about pfn-mapped versus > regular pages. That might be feasible after dropping all of KVM's refcounting > shenanigans[*]? Not sure, haven't thought too much about it, precisely because > I too think it won't provide any meaningful performance boost. The main thing, from my perspective, is that KVM reliably creates 1G mappings in its table if the VMA has 1G mappings, across all arches and scenarios. For normal memory and PFNMAP equally. Not returning the size here makes me wonder if that actually happens? Does KVM have another way to know what size entry to create? Jason
On Wed, Aug 14, 2024, Jason Gunthorpe wrote: > On Mon, Aug 12, 2024 at 04:44:40PM -0700, Sean Christopherson wrote: > > > > > > I don't think it has to be done in this series, but a future > > > > > optimization to consider is having follow_pfnmap just tell the caller > > > > > about the mapping level directly. It already found this information as > > > > > part of its walk. I think there's a possibility to simplify KVM / > > > > > avoid it having to do its own walk again later. > > > > > > > > AFAIU pfnmap isn't special in this case, as we do the "walk pgtable twice" > > > > idea also to a generic page here, so probably not directly relevant to this > > > > patch alone. > > > > Ya. My original hope was that KVM could simply walk the host page tables and get > > whatever PFN+size it found, i.e. that KVM wouldn't care about pfn-mapped versus > > regular pages. That might be feasible after dropping all of KVM's refcounting > > shenanigans[*]? Not sure, haven't thought too much about it, precisely because > > I too think it won't provide any meaningful performance boost. > > The main thing, from my perspective, is that KVM reliably creates 1G > mappings in its table if the VMA has 1G mappings, across all arches > and scenarios. For normal memory and PFNMAP equally. Yes, KVM walks the host page tables for the user virtual address and uses whatever page size it finds, regardless of what the mapping type. > Not returning the size here makes me wonder if that actually happens? It does happen, the idea here was purely to avoid the second page table walk. > Does KVM have another way to know what size entry to create? > > Jason
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index d0788d0a72cc..9fb1c527a8e1 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2862,13 +2862,11 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, unsigned long addr, bool write_fault, bool *writable, kvm_pfn_t *p_pfn) { + struct follow_pfnmap_args args = { .vma = vma, .address = addr }; kvm_pfn_t pfn; - pte_t *ptep; - pte_t pte; - spinlock_t *ptl; int r; - r = follow_pte(vma, addr, &ptep, &ptl); + r = follow_pfnmap_start(&args); if (r) { /* * get_user_pages fails for VM_IO and VM_PFNMAP vmas and does @@ -2883,21 +2881,19 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, if (r) return r; - r = follow_pte(vma, addr, &ptep, &ptl); + r = follow_pfnmap_start(&args); if (r) return r; } - pte = ptep_get(ptep); - - if (write_fault && !pte_write(pte)) { + if (write_fault && !args.writable) { pfn = KVM_PFN_ERR_RO_FAULT; goto out; } if (writable) - *writable = pte_write(pte); - pfn = pte_pfn(pte); + *writable = args.writable; + pfn = args.pfn; /* * Get a reference here because callers of *hva_to_pfn* and @@ -2918,9 +2914,8 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, */ if (!kvm_try_get_pfn(pfn)) r = -EFAULT; - out: - pte_unmap_unlock(ptep, ptl); + follow_pfnmap_end(&args); *p_pfn = pfn; return r;
Use the new pfnmap API to allow huge MMIO mappings for VMs. The rest work is done perfectly on the other side (host_pfn_mapping_level()). Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Xu <peterx@redhat.com> --- virt/kvm/kvm_main.c | 19 +++++++------------ 1 file changed, 7 insertions(+), 12 deletions(-)