Message ID | 20191211213207.215936-3-brho@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | kvm: Use huge pages for DAX-backed files | expand |
On 11/12/19 22:32, Barret Rhoden wrote: > + /* > + * Our caller grabbed the KVM mmu_lock with a successful > + * mmu_notifier_retry, so we're safe to walk the page table. > + */ > + switch (dev_pagemap_mapping_shift(hva, current->mm)) { > + case PMD_SHIFT: > + case PUD_SIZE: > + return true; > + } > + return false; Should this simply be "> PAGE_SHIFT"? Paolo
On 11.12.19 22:32, Barret Rhoden wrote: > This change allows KVM to map DAX-backed files made of huge pages with > huge mappings in the EPT/TDP. > > DAX pages are not PageTransCompound. The existing check is trying to > determine if the mapping for the pfn is a huge mapping or not. For > non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound. > For DAX, we can check the page table itself. > > Note that KVM already faulted in the page (or huge page) in the host's > page table, and we hold the KVM mmu spinlock. We grabbed that lock in > kvm_mmu_notifier_invalidate_range_end, before checking the mmu seq. > > Signed-off-by: Barret Rhoden <brho@google.com> > --- > arch/x86/kvm/mmu/mmu.c | 36 ++++++++++++++++++++++++++++++++---- > 1 file changed, 32 insertions(+), 4 deletions(-) > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 6f92b40d798c..cd07bc4e595f 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -3384,6 +3384,35 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn) > return -EFAULT; > } > > +static bool pfn_is_huge_mapped(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn) > +{ > + struct page *page = pfn_to_page(pfn); > + unsigned long hva; > + > + if (!is_zone_device_page(page)) > + return PageTransCompoundMap(page); > + > + /* > + * DAX pages do not use compound pages. The page should have already > + * been mapped into the host-side page table during try_async_pf(), so > + * we can check the page tables directly. > + */ > + hva = gfn_to_hva(kvm, gfn); > + if (kvm_is_error_hva(hva)) > + return false; > + > + /* > + * Our caller grabbed the KVM mmu_lock with a successful > + * mmu_notifier_retry, so we're safe to walk the page table. > + */ > + switch (dev_pagemap_mapping_shift(hva, current->mm)) { > + case PMD_SHIFT: > + case PUD_SIZE: Shouldn't this be PUD_SHIFT? But I agree with Paolo, that this is simply return dev_pagemap_mapping_shift(hva, current->mm) > PAGE_SHIFT; > + return true; > + } > + return false; > +} > + > static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu, > gfn_t gfn, kvm_pfn_t *pfnp, > int *levelp) > @@ -3398,8 +3427,8 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu, > * here. > */ > if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) && > - !kvm_is_zone_device_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL && > - PageTransCompoundMap(pfn_to_page(pfn)) && > + level == PT_PAGE_TABLE_LEVEL && > + pfn_is_huge_mapped(vcpu->kvm, gfn, pfn) && > !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) { > unsigned long mask; > /* > @@ -6015,8 +6044,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm, > * mapping if the indirect sp has level = 1. > */ > if (sp->role.direct && !kvm_is_reserved_pfn(pfn) && > - !kvm_is_zone_device_pfn(pfn) && > - PageTransCompoundMap(pfn_to_page(pfn))) { > + pfn_is_huge_mapped(kvm, sp->gfn, pfn)) { > pte_list_remove(rmap_head, sptep); > > if (kvm_available_flush_tlb_with_range()) > Patch itself looks good to me (especially, cleans up these two places a bit). I am not an expert on the locking part, so I can't give my RB.
> On 11 Dec 2019, at 23:32, Barret Rhoden <brho@google.com> wrote: > > This change allows KVM to map DAX-backed files made of huge pages with > huge mappings in the EPT/TDP. > > DAX pages are not PageTransCompound. The existing check is trying to > determine if the mapping for the pfn is a huge mapping or not. For > non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound. > For DAX, we can check the page table itself. For hugetlbfs pages, tdp_page_fault() -> mapping_level() -> host_mapping_level() -> kvm_host_page_size() -> vma_kernel_pagesize() will return the page-size of the hugetlbfs without the need to parse the page-tables. See vma->vm_ops->pagesize() callback implementation at hugetlb_vm_ops->pagesize()==hugetlb_vm_op_pagesize(). Only for pages that were originally mapped as small-pages and later merged to larger pages by THP, there is a need to check for PageTransCompound(). Again, instead of parsing page-tables. Therefore, it seems more logical to me that: (a) If DAX-backed files are mapped as large-pages to userspace, it should be reflected in vma->vm_ops->page_size() of that mapping. Causing kvm_host_page_size() to return the right size without the need to parse the page-tables. (b) If DAX-backed files small-pages can be later merged to large-pages by THP, then the “struct page” of these pages should be modified as usual to make PageTransCompound() return true for them. I’m not highly familiar with this mechanism, but I would expect THP to be able to merge DAX-backed files small-pages to large-pages in case DAX provides “struct page” for the DAX pages. > > Note that KVM already faulted in the page (or huge page) in the host's > page table, and we hold the KVM mmu spinlock. We grabbed that lock in > kvm_mmu_notifier_invalidate_range_end, before checking the mmu seq. > > Signed-off-by: Barret Rhoden <brho@google.com> > --- > arch/x86/kvm/mmu/mmu.c | 36 ++++++++++++++++++++++++++++++++---- > 1 file changed, 32 insertions(+), 4 deletions(-) > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 6f92b40d798c..cd07bc4e595f 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -3384,6 +3384,35 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn) > return -EFAULT; > } > > +static bool pfn_is_huge_mapped(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn) > +{ > + struct page *page = pfn_to_page(pfn); > + unsigned long hva; > + > + if (!is_zone_device_page(page)) > + return PageTransCompoundMap(page); > + > + /* > + * DAX pages do not use compound pages. The page should have already > + * been mapped into the host-side page table during try_async_pf(), so > + * we can check the page tables directly. > + */ > + hva = gfn_to_hva(kvm, gfn); > + if (kvm_is_error_hva(hva)) > + return false; > + > + /* > + * Our caller grabbed the KVM mmu_lock with a successful > + * mmu_notifier_retry, so we're safe to walk the page table. > + */ > + switch (dev_pagemap_mapping_shift(hva, current->mm)) { Doesn’t dev_pagemap_mapping_shift() get “struct page” as first parameter? Was this changed by a commit I missed? -Liran > + case PMD_SHIFT: > + case PUD_SIZE: > + return true; > + } > + return false; > +} > + > static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu, > gfn_t gfn, kvm_pfn_t *pfnp, > int *levelp) > @@ -3398,8 +3427,8 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu, > * here. > */ > if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) && > - !kvm_is_zone_device_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL && > - PageTransCompoundMap(pfn_to_page(pfn)) && > + level == PT_PAGE_TABLE_LEVEL && > + pfn_is_huge_mapped(vcpu->kvm, gfn, pfn) && > !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) { > unsigned long mask; > /* > @@ -6015,8 +6044,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm, > * mapping if the indirect sp has level = 1. > */ > if (sp->role.direct && !kvm_is_reserved_pfn(pfn) && > - !kvm_is_zone_device_pfn(pfn) && > - PageTransCompoundMap(pfn_to_page(pfn))) { > + pfn_is_huge_mapped(kvm, sp->gfn, pfn)) { > pte_list_remove(rmap_head, sptep); > > if (kvm_available_flush_tlb_with_range()) > -- > 2.24.0.525.g8f36a354ae-goog >
On 12/12/19 7:22 AM, David Hildenbrand wrote: >> + /* >> + * Our caller grabbed the KVM mmu_lock with a successful >> + * mmu_notifier_retry, so we're safe to walk the page table. >> + */ >> + switch (dev_pagemap_mapping_shift(hva, current->mm)) { >> + case PMD_SHIFT: >> + case PUD_SIZE: > > Shouldn't this be PUD_SHIFT? > > But I agree with Paolo, that this is simply > > return dev_pagemap_mapping_shift(hva, current->mm) > PAGE_SHIFT; Yes, good call. I'll fix that in the next version.
On Thu, Dec 12, 2019 at 4:34 AM Liran Alon <liran.alon@oracle.com> wrote: > > > > > On 11 Dec 2019, at 23:32, Barret Rhoden <brho@google.com> wrote: > > > > This change allows KVM to map DAX-backed files made of huge pages with > > huge mappings in the EPT/TDP. > > > > DAX pages are not PageTransCompound. The existing check is trying to > > determine if the mapping for the pfn is a huge mapping or not. For > > non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound. > > For DAX, we can check the page table itself. > > For hugetlbfs pages, tdp_page_fault() -> mapping_level() -> host_mapping_level() -> kvm_host_page_size() -> vma_kernel_pagesize() > will return the page-size of the hugetlbfs without the need to parse the page-tables. > See vma->vm_ops->pagesize() callback implementation at hugetlb_vm_ops->pagesize()==hugetlb_vm_op_pagesize(). > > Only for pages that were originally mapped as small-pages and later merged to larger pages by THP, there is a need to check for PageTransCompound(). Again, instead of parsing page-tables. > > Therefore, it seems more logical to me that: > (a) If DAX-backed files are mapped as large-pages to userspace, it should be reflected in vma->vm_ops->page_size() of that mapping. Causing kvm_host_page_size() to return the right size without the need to parse the page-tables. A given dax-mapped vma may have mixed page sizes so ->page_size() can't be used reliably to enumerating the mapping size. > (b) If DAX-backed files small-pages can be later merged to large-pages by THP, then the “struct page” of these pages should be modified as usual to make PageTransCompound() return true for them. I’m not highly familiar with this mechanism, but I would expect THP to be able to merge DAX-backed files small-pages to large-pages in case DAX provides “struct page” for the DAX pages. DAX pages do not participate in THP and do not have the PageTransCompound accounting. The only mechanism that records the mapping size for dax is the page tables themselves. > > > > > Note that KVM already faulted in the page (or huge page) in the host's > > page table, and we hold the KVM mmu spinlock. We grabbed that lock in > > kvm_mmu_notifier_invalidate_range_end, before checking the mmu seq. > > > > Signed-off-by: Barret Rhoden <brho@google.com> > > --- > > arch/x86/kvm/mmu/mmu.c | 36 ++++++++++++++++++++++++++++++++---- > > 1 file changed, 32 insertions(+), 4 deletions(-) > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > > index 6f92b40d798c..cd07bc4e595f 100644 > > --- a/arch/x86/kvm/mmu/mmu.c > > +++ b/arch/x86/kvm/mmu/mmu.c > > @@ -3384,6 +3384,35 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn) > > return -EFAULT; > > } > > > > +static bool pfn_is_huge_mapped(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn) > > +{ > > + struct page *page = pfn_to_page(pfn); > > + unsigned long hva; > > + > > + if (!is_zone_device_page(page)) > > + return PageTransCompoundMap(page); > > + > > + /* > > + * DAX pages do not use compound pages. The page should have already > > + * been mapped into the host-side page table during try_async_pf(), so > > + * we can check the page tables directly. > > + */ > > + hva = gfn_to_hva(kvm, gfn); > > + if (kvm_is_error_hva(hva)) > > + return false; > > + > > + /* > > + * Our caller grabbed the KVM mmu_lock with a successful > > + * mmu_notifier_retry, so we're safe to walk the page table. > > + */ > > + switch (dev_pagemap_mapping_shift(hva, current->mm)) { > > Doesn’t dev_pagemap_mapping_shift() get “struct page” as first parameter? > Was this changed by a commit I missed? > > -Liran > > > + case PMD_SHIFT: > > + case PUD_SIZE: > > + return true; > > + } > > + return false; > > +} > > + > > static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu, > > gfn_t gfn, kvm_pfn_t *pfnp, > > int *levelp) > > @@ -3398,8 +3427,8 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu, > > * here. > > */ > > if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) && > > - !kvm_is_zone_device_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL && > > - PageTransCompoundMap(pfn_to_page(pfn)) && > > + level == PT_PAGE_TABLE_LEVEL && > > + pfn_is_huge_mapped(vcpu->kvm, gfn, pfn) && > > !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) { > > unsigned long mask; > > /* > > @@ -6015,8 +6044,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm, > > * mapping if the indirect sp has level = 1. > > */ > > if (sp->role.direct && !kvm_is_reserved_pfn(pfn) && > > - !kvm_is_zone_device_pfn(pfn) && > > - PageTransCompoundMap(pfn_to_page(pfn))) { > > + pfn_is_huge_mapped(kvm, sp->gfn, pfn)) { > > pte_list_remove(rmap_head, sptep); > > > > if (kvm_available_flush_tlb_with_range()) > > -- > > 2.24.0.525.g8f36a354ae-goog > > >
On 12/12/19 7:33 AM, Liran Alon wrote: >> + /* >> + * Our caller grabbed the KVM mmu_lock with a successful >> + * mmu_notifier_retry, so we're safe to walk the page table. >> + */ >> + switch (dev_pagemap_mapping_shift(hva, current->mm)) { > Doesn’t dev_pagemap_mapping_shift() get “struct page” as first parameter? > Was this changed by a commit I missed? I changed this in Patch 1. The place I call it in KVM has the address and mm available, which is the only think dev_pagemap_mapping_shift() really needs. (The first thing it did was convert page to address). I'll add some more text to patch 1's commit message about that. Thanks, Barret
On Wed, Dec 11, 2019 at 04:32:07PM -0500, Barret Rhoden wrote: > This change allows KVM to map DAX-backed files made of huge pages with > huge mappings in the EPT/TDP. > > DAX pages are not PageTransCompound. The existing check is trying to > determine if the mapping for the pfn is a huge mapping or not. For > non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound. > For DAX, we can check the page table itself. > > Note that KVM already faulted in the page (or huge page) in the host's > page table, and we hold the KVM mmu spinlock. We grabbed that lock in > kvm_mmu_notifier_invalidate_range_end, before checking the mmu seq. > > Signed-off-by: Barret Rhoden <brho@google.com> > --- > arch/x86/kvm/mmu/mmu.c | 36 ++++++++++++++++++++++++++++++++---- > 1 file changed, 32 insertions(+), 4 deletions(-) > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 6f92b40d798c..cd07bc4e595f 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -3384,6 +3384,35 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn) > return -EFAULT; > } > > +static bool pfn_is_huge_mapped(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn) > +{ > + struct page *page = pfn_to_page(pfn); > + unsigned long hva; > + > + if (!is_zone_device_page(page)) > + return PageTransCompoundMap(page); > + > + /* > + * DAX pages do not use compound pages. The page should have already > + * been mapped into the host-side page table during try_async_pf(), so > + * we can check the page tables directly. > + */ > + hva = gfn_to_hva(kvm, gfn); > + if (kvm_is_error_hva(hva)) > + return false; > + > + /* > + * Our caller grabbed the KVM mmu_lock with a successful > + * mmu_notifier_retry, so we're safe to walk the page table. > + */ > + switch (dev_pagemap_mapping_shift(hva, current->mm)) { > + case PMD_SHIFT: > + case PUD_SIZE: I assume this means DAX can have 1GB pages? I ask because KVM's THP logic has historically relied on THP only supporting 2MB. I cleaned this up in a recent series[*], which is in kvm/queue, but I obviously didn't actually test whether or not KVM would correctly handle 1GB non-hugetlbfs pages. The easiest thing is probably to rebase on kvm/queue. You'll need to do that anyways, and I suspect doing so will help shake out any hiccups. [*] https://lkml.kernel.org/r/20191206235729.29263-1-sean.j.christopherson@intel.com > + return true; > + } > + return false; > +} > + > static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu, > gfn_t gfn, kvm_pfn_t *pfnp, > int *levelp) > @@ -3398,8 +3427,8 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu, > * here. > */ > if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) && > - !kvm_is_zone_device_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL && > - PageTransCompoundMap(pfn_to_page(pfn)) && > + level == PT_PAGE_TABLE_LEVEL && > + pfn_is_huge_mapped(vcpu->kvm, gfn, pfn) && > !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) { > unsigned long mask; > /* > @@ -6015,8 +6044,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm, > * mapping if the indirect sp has level = 1. > */ > if (sp->role.direct && !kvm_is_reserved_pfn(pfn) && > - !kvm_is_zone_device_pfn(pfn) && > - PageTransCompoundMap(pfn_to_page(pfn))) { > + pfn_is_huge_mapped(kvm, sp->gfn, pfn)) { > pte_list_remove(rmap_head, sptep); > > if (kvm_available_flush_tlb_with_range()) > -- > 2.24.0.525.g8f36a354ae-goog >
On Thu, Dec 12, 2019 at 9:34 AM Sean Christopherson <sean.j.christopherson@intel.com> wrote: > > On Wed, Dec 11, 2019 at 04:32:07PM -0500, Barret Rhoden wrote: > > This change allows KVM to map DAX-backed files made of huge pages with > > huge mappings in the EPT/TDP. > > > > DAX pages are not PageTransCompound. The existing check is trying to > > determine if the mapping for the pfn is a huge mapping or not. For > > non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound. > > For DAX, we can check the page table itself. > > > > Note that KVM already faulted in the page (or huge page) in the host's > > page table, and we hold the KVM mmu spinlock. We grabbed that lock in > > kvm_mmu_notifier_invalidate_range_end, before checking the mmu seq. > > > > Signed-off-by: Barret Rhoden <brho@google.com> > > --- > > arch/x86/kvm/mmu/mmu.c | 36 ++++++++++++++++++++++++++++++++---- > > 1 file changed, 32 insertions(+), 4 deletions(-) > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > > index 6f92b40d798c..cd07bc4e595f 100644 > > --- a/arch/x86/kvm/mmu/mmu.c > > +++ b/arch/x86/kvm/mmu/mmu.c > > @@ -3384,6 +3384,35 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn) > > return -EFAULT; > > } > > > > +static bool pfn_is_huge_mapped(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn) > > +{ > > + struct page *page = pfn_to_page(pfn); > > + unsigned long hva; > > + > > + if (!is_zone_device_page(page)) > > + return PageTransCompoundMap(page); > > + > > + /* > > + * DAX pages do not use compound pages. The page should have already > > + * been mapped into the host-side page table during try_async_pf(), so > > + * we can check the page tables directly. > > + */ > > + hva = gfn_to_hva(kvm, gfn); > > + if (kvm_is_error_hva(hva)) > > + return false; > > + > > + /* > > + * Our caller grabbed the KVM mmu_lock with a successful > > + * mmu_notifier_retry, so we're safe to walk the page table. > > + */ > > + switch (dev_pagemap_mapping_shift(hva, current->mm)) { > > + case PMD_SHIFT: > > + case PUD_SIZE: > > I assume this means DAX can have 1GB pages? Correct, it can. Not in the filesystem-dax case, but device-dax supports 1GB pages. > I ask because KVM's THP logic > has historically relied on THP only supporting 2MB. I cleaned this up in > a recent series[*], which is in kvm/queue, but I obviously didn't actually > test whether or not KVM would correctly handle 1GB non-hugetlbfs pages. Yeah, since device-dax is the only path to support longterm page pinning for vfio device assignment, testing with device-dax + 1GB pages would be a useful sanity check.
> On 12 Dec 2019, at 18:54, Dan Williams <dan.j.williams@intel.com> wrote: > > On Thu, Dec 12, 2019 at 4:34 AM Liran Alon <liran.alon@oracle.com> wrote: >> >> >> >>> On 11 Dec 2019, at 23:32, Barret Rhoden <brho@google.com> wrote: >>> >>> This change allows KVM to map DAX-backed files made of huge pages with >>> huge mappings in the EPT/TDP. >>> >>> DAX pages are not PageTransCompound. The existing check is trying to >>> determine if the mapping for the pfn is a huge mapping or not. For >>> non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound. >>> For DAX, we can check the page table itself. >> >> For hugetlbfs pages, tdp_page_fault() -> mapping_level() -> host_mapping_level() -> kvm_host_page_size() -> vma_kernel_pagesize() >> will return the page-size of the hugetlbfs without the need to parse the page-tables. >> See vma->vm_ops->pagesize() callback implementation at hugetlb_vm_ops->pagesize()==hugetlb_vm_op_pagesize(). >> >> Only for pages that were originally mapped as small-pages and later merged to larger pages by THP, there is a need to check for PageTransCompound(). Again, instead of parsing page-tables. >> >> Therefore, it seems more logical to me that: >> (a) If DAX-backed files are mapped as large-pages to userspace, it should be reflected in vma->vm_ops->page_size() of that mapping. Causing kvm_host_page_size() to return the right size without the need to parse the page-tables. > > A given dax-mapped vma may have mixed page sizes so ->page_size() > can't be used reliably to enumerating the mapping size. Naive question: Why don’t split the VMA in this case to multiple VMAs with different results for ->page_size()? What you are describing sounds like DAX is breaking this callback semantics in an unpredictable manner. > >> (b) If DAX-backed files small-pages can be later merged to large-pages by THP, then the “struct page” of these pages should be modified as usual to make PageTransCompound() return true for them. I’m not highly familiar with this mechanism, but I would expect THP to be able to merge DAX-backed files small-pages to large-pages in case DAX provides “struct page” for the DAX pages. > > DAX pages do not participate in THP and do not have the > PageTransCompound accounting. The only mechanism that records the > mapping size for dax is the page tables themselves. What is the rational behind this? Given that DAX pages can be described with “struct page” (i.e. ZONE_DEVICE), what prevents THP from manipulating page-tables to merge multiple DAX PFNs to a larger page? -Liran > > >> >>> >>> Note that KVM already faulted in the page (or huge page) in the host's >>> page table, and we hold the KVM mmu spinlock. We grabbed that lock in >>> kvm_mmu_notifier_invalidate_range_end, before checking the mmu seq. >>> >>> Signed-off-by: Barret Rhoden <brho@google.com> >>> --- >>> arch/x86/kvm/mmu/mmu.c | 36 ++++++++++++++++++++++++++++++++---- >>> 1 file changed, 32 insertions(+), 4 deletions(-) >>> >>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c >>> index 6f92b40d798c..cd07bc4e595f 100644 >>> --- a/arch/x86/kvm/mmu/mmu.c >>> +++ b/arch/x86/kvm/mmu/mmu.c >>> @@ -3384,6 +3384,35 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn) >>> return -EFAULT; >>> } >>> >>> +static bool pfn_is_huge_mapped(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn) >>> +{ >>> + struct page *page = pfn_to_page(pfn); >>> + unsigned long hva; >>> + >>> + if (!is_zone_device_page(page)) >>> + return PageTransCompoundMap(page); >>> + >>> + /* >>> + * DAX pages do not use compound pages. The page should have already >>> + * been mapped into the host-side page table during try_async_pf(), so >>> + * we can check the page tables directly. >>> + */ >>> + hva = gfn_to_hva(kvm, gfn); >>> + if (kvm_is_error_hva(hva)) >>> + return false; >>> + >>> + /* >>> + * Our caller grabbed the KVM mmu_lock with a successful >>> + * mmu_notifier_retry, so we're safe to walk the page table. >>> + */ >>> + switch (dev_pagemap_mapping_shift(hva, current->mm)) { >> >> Doesn’t dev_pagemap_mapping_shift() get “struct page” as first parameter? >> Was this changed by a commit I missed? >> >> -Liran >> >>> + case PMD_SHIFT: >>> + case PUD_SIZE: >>> + return true; >>> + } >>> + return false; >>> +} >>> + >>> static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu, >>> gfn_t gfn, kvm_pfn_t *pfnp, >>> int *levelp) >>> @@ -3398,8 +3427,8 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu, >>> * here. >>> */ >>> if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) && >>> - !kvm_is_zone_device_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL && >>> - PageTransCompoundMap(pfn_to_page(pfn)) && >>> + level == PT_PAGE_TABLE_LEVEL && >>> + pfn_is_huge_mapped(vcpu->kvm, gfn, pfn) && >>> !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) { >>> unsigned long mask; >>> /* >>> @@ -6015,8 +6044,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm, >>> * mapping if the indirect sp has level = 1. >>> */ >>> if (sp->role.direct && !kvm_is_reserved_pfn(pfn) && >>> - !kvm_is_zone_device_pfn(pfn) && >>> - PageTransCompoundMap(pfn_to_page(pfn))) { >>> + pfn_is_huge_mapped(kvm, sp->gfn, pfn)) { >>> pte_list_remove(rmap_head, sptep); >>> >>> if (kvm_available_flush_tlb_with_range()) >>> -- >>> 2.24.0.525.g8f36a354ae-goog >>> >>
> On 12 Dec 2019, at 19:34, Sean Christopherson <sean.j.christopherson@intel.com> wrote: > > On Wed, Dec 11, 2019 at 04:32:07PM -0500, Barret Rhoden wrote: >> This change allows KVM to map DAX-backed files made of huge pages with >> huge mappings in the EPT/TDP. >> >> DAX pages are not PageTransCompound. The existing check is trying to >> determine if the mapping for the pfn is a huge mapping or not. For >> non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound. >> For DAX, we can check the page table itself. >> >> Note that KVM already faulted in the page (or huge page) in the host's >> page table, and we hold the KVM mmu spinlock. We grabbed that lock in >> kvm_mmu_notifier_invalidate_range_end, before checking the mmu seq. >> >> Signed-off-by: Barret Rhoden <brho@google.com> >> --- >> arch/x86/kvm/mmu/mmu.c | 36 ++++++++++++++++++++++++++++++++---- >> 1 file changed, 32 insertions(+), 4 deletions(-) >> >> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c >> index 6f92b40d798c..cd07bc4e595f 100644 >> --- a/arch/x86/kvm/mmu/mmu.c >> +++ b/arch/x86/kvm/mmu/mmu.c >> @@ -3384,6 +3384,35 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn) >> return -EFAULT; >> } >> >> +static bool pfn_is_huge_mapped(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn) >> +{ >> + struct page *page = pfn_to_page(pfn); >> + unsigned long hva; >> + >> + if (!is_zone_device_page(page)) >> + return PageTransCompoundMap(page); >> + >> + /* >> + * DAX pages do not use compound pages. The page should have already >> + * been mapped into the host-side page table during try_async_pf(), so >> + * we can check the page tables directly. >> + */ >> + hva = gfn_to_hva(kvm, gfn); >> + if (kvm_is_error_hva(hva)) >> + return false; >> + >> + /* >> + * Our caller grabbed the KVM mmu_lock with a successful >> + * mmu_notifier_retry, so we're safe to walk the page table. >> + */ >> + switch (dev_pagemap_mapping_shift(hva, current->mm)) { >> + case PMD_SHIFT: >> + case PUD_SIZE: > > I assume this means DAX can have 1GB pages? I ask because KVM's THP logic > has historically relied on THP only supporting 2MB. I cleaned this up in > a recent series[*], which is in kvm/queue, but I obviously didn't actually > test whether or not KVM would correctly handle 1GB non-hugetlbfs pages. KVM doesn’t handle 1GB correctly for all types of non-hugetlbfs pages. One example we have noticed internally but haven’t submitted an upstream patch yet is for pages without “struct page”. As in this case, hva_to_pfn() will notice vma->vm_flags have VM_PFNMAP set and call hva_to_pfn_remapped() -> follow_pfn(). However, follow_pfn() currently just calls follow_pte() which use __follow_pte_pmd() that doesn’t handle a huge PUD entry. > > The easiest thing is probably to rebase on kvm/queue. You'll need to do > that anyways, and I suspect doing so will help shake out any hiccups. > > [*] https://urldefense.proofpoint.com/v2/url?u=https-3A__lkml.kernel.org_r_20191206235729.29263-2D1-2Dsean.j.christopherson-40intel.com&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=Lk-PXE125WU3GWJOV4U4crsSEFx7f5AUmRJhkrfIeAE&s=BIo4tnL4OfswRQ2QKfTs9VYScLU5lBy2pwzePBnHow8&e= > >> + return true; >> + } >> + return false; >> +} >> + >> static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu, >> gfn_t gfn, kvm_pfn_t *pfnp, >> int *levelp) >> @@ -3398,8 +3427,8 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu, >> * here. >> */ >> if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) && >> - !kvm_is_zone_device_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL && >> - PageTransCompoundMap(pfn_to_page(pfn)) && >> + level == PT_PAGE_TABLE_LEVEL && >> + pfn_is_huge_mapped(vcpu->kvm, gfn, pfn) && >> !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) { >> unsigned long mask; >> /* >> @@ -6015,8 +6044,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm, >> * mapping if the indirect sp has level = 1. >> */ >> if (sp->role.direct && !kvm_is_reserved_pfn(pfn) && >> - !kvm_is_zone_device_pfn(pfn) && >> - PageTransCompoundMap(pfn_to_page(pfn))) { >> + pfn_is_huge_mapped(kvm, sp->gfn, pfn)) { >> pte_list_remove(rmap_head, sptep); >> >> if (kvm_available_flush_tlb_with_range()) >> -- >> 2.24.0.525.g8f36a354ae-goog >>
On Thu, Dec 12, 2019 at 9:39 AM Liran Alon <liran.alon@oracle.com> wrote: > > > > > On 12 Dec 2019, at 18:54, Dan Williams <dan.j.williams@intel.com> wrote: > > > > On Thu, Dec 12, 2019 at 4:34 AM Liran Alon <liran.alon@oracle.com> wrote: > >> > >> > >> > >>> On 11 Dec 2019, at 23:32, Barret Rhoden <brho@google.com> wrote: > >>> > >>> This change allows KVM to map DAX-backed files made of huge pages with > >>> huge mappings in the EPT/TDP. > >>> > >>> DAX pages are not PageTransCompound. The existing check is trying to > >>> determine if the mapping for the pfn is a huge mapping or not. For > >>> non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound. > >>> For DAX, we can check the page table itself. > >> > >> For hugetlbfs pages, tdp_page_fault() -> mapping_level() -> host_mapping_level() -> kvm_host_page_size() -> vma_kernel_pagesize() > >> will return the page-size of the hugetlbfs without the need to parse the page-tables. > >> See vma->vm_ops->pagesize() callback implementation at hugetlb_vm_ops->pagesize()==hugetlb_vm_op_pagesize(). > >> > >> Only for pages that were originally mapped as small-pages and later merged to larger pages by THP, there is a need to check for PageTransCompound(). Again, instead of parsing page-tables. > >> > >> Therefore, it seems more logical to me that: > >> (a) If DAX-backed files are mapped as large-pages to userspace, it should be reflected in vma->vm_ops->page_size() of that mapping. Causing kvm_host_page_size() to return the right size without the need to parse the page-tables. > > > > A given dax-mapped vma may have mixed page sizes so ->page_size() > > can't be used reliably to enumerating the mapping size. > > Naive question: Why don’t split the VMA in this case to multiple VMAs with different results for ->page_size()? Filesystems traditionally have not populated ->pagesize() in their vm_operations, there was no compelling reason to go add it and the complexity seems prohibitive. > What you are describing sounds like DAX is breaking this callback semantics in an unpredictable manner. It's not unpredictable. vma_kernel_pagesize() returns PAGE_SIZE. Huge pages in the page cache has a similar issue. > >> (b) If DAX-backed files small-pages can be later merged to large-pages by THP, then the “struct page” of these pages should be modified as usual to make PageTransCompound() return true for them. I’m not highly familiar with this mechanism, but I would expect THP to be able to merge DAX-backed files small-pages to large-pages in case DAX provides “struct page” for the DAX pages. > > > > DAX pages do not participate in THP and do not have the > > PageTransCompound accounting. The only mechanism that records the > > mapping size for dax is the page tables themselves. > > What is the rational behind this? Given that DAX pages can be described with “struct page” (i.e. ZONE_DEVICE), what prevents THP from manipulating page-tables to merge multiple DAX PFNs to a larger page? THP accounting is a function of the page allocator. ZONE_DEVICE pages are excluded from the page allocator. ZONE_DEVICE is just enough infrastructure to support pfn_to_page(), page_address(), and get_user_pages(). Other page allocator services beyond that are not present.
> On 12 Dec 2019, at 19:59, Dan Williams <dan.j.williams@intel.com> wrote: > > On Thu, Dec 12, 2019 at 9:39 AM Liran Alon <liran.alon@oracle.com> wrote: >> >> >> >>> On 12 Dec 2019, at 18:54, Dan Williams <dan.j.williams@intel.com> wrote: >>> >>> On Thu, Dec 12, 2019 at 4:34 AM Liran Alon <liran.alon@oracle.com> wrote: >>>> >>>> >>>> >>>>> On 11 Dec 2019, at 23:32, Barret Rhoden <brho@google.com> wrote: >>>>> >>>>> This change allows KVM to map DAX-backed files made of huge pages with >>>>> huge mappings in the EPT/TDP. >>>>> >>>>> DAX pages are not PageTransCompound. The existing check is trying to >>>>> determine if the mapping for the pfn is a huge mapping or not. For >>>>> non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound. >>>>> For DAX, we can check the page table itself. >>>> >>>> For hugetlbfs pages, tdp_page_fault() -> mapping_level() -> host_mapping_level() -> kvm_host_page_size() -> vma_kernel_pagesize() >>>> will return the page-size of the hugetlbfs without the need to parse the page-tables. >>>> See vma->vm_ops->pagesize() callback implementation at hugetlb_vm_ops->pagesize()==hugetlb_vm_op_pagesize(). >>>> >>>> Only for pages that were originally mapped as small-pages and later merged to larger pages by THP, there is a need to check for PageTransCompound(). Again, instead of parsing page-tables. >>>> >>>> Therefore, it seems more logical to me that: >>>> (a) If DAX-backed files are mapped as large-pages to userspace, it should be reflected in vma->vm_ops->page_size() of that mapping. Causing kvm_host_page_size() to return the right size without the need to parse the page-tables. >>> >>> A given dax-mapped vma may have mixed page sizes so ->page_size() >>> can't be used reliably to enumerating the mapping size. >> >> Naive question: Why don’t split the VMA in this case to multiple VMAs with different results for ->page_size()? > > Filesystems traditionally have not populated ->pagesize() in their > vm_operations, there was no compelling reason to go add it and the > complexity seems prohibitive. I understand. Though this is technical debt that breaks ->page_size() semantics which might cause a complex bug some day... > >> What you are describing sounds like DAX is breaking this callback semantics in an unpredictable manner. > > It's not unpredictable. vma_kernel_pagesize() returns PAGE_SIZE. Of course. :) I meant it may be unexpected by the caller. > Huge > pages in the page cache has a similar issue. Ok. I haven’t known that. Thanks for the explanation. > >>>> (b) If DAX-backed files small-pages can be later merged to large-pages by THP, then the “struct page” of these pages should be modified as usual to make PageTransCompound() return true for them. I’m not highly familiar with this mechanism, but I would expect THP to be able to merge DAX-backed files small-pages to large-pages in case DAX provides “struct page” for the DAX pages. >>> >>> DAX pages do not participate in THP and do not have the >>> PageTransCompound accounting. The only mechanism that records the >>> mapping size for dax is the page tables themselves. >> >> What is the rational behind this? Given that DAX pages can be described with “struct page” (i.e. ZONE_DEVICE), what prevents THP from manipulating page-tables to merge multiple DAX PFNs to a larger page? > > THP accounting is a function of the page allocator. ZONE_DEVICE pages > are excluded from the page allocator. ZONE_DEVICE is just enough > infrastructure to support pfn_to_page(), page_address(), and > get_user_pages(). Other page allocator services beyond that are not > present. Ok.
On 12/12/19 12:37 PM, Dan Williams wrote: > Yeah, since device-dax is the only path to support longterm page > pinning for vfio device assignment, testing with device-dax + 1GB > pages would be a useful sanity check. What are the issues with fs-dax and page pinning? Is that limitation something that is permanent and unfixable (by me or anyone)? I'd like to put a lot more in a DAX/pmem region than just a guest's memory, and having a mountable filesystem would be extremely convenient. Thanks, Barret
On Thu, Dec 12, 2019 at 11:16 AM Barret Rhoden <brho@google.com> wrote: > > On 12/12/19 12:37 PM, Dan Williams wrote: > > Yeah, since device-dax is the only path to support longterm page > > pinning for vfio device assignment, testing with device-dax + 1GB > > pages would be a useful sanity check. > > What are the issues with fs-dax and page pinning? Is that limitation > something that is permanent and unfixable (by me or anyone)? It's a surprisingly painful point of contention... File backed DAX pages cannot be truncated while the page is pinned because the pin may indicate that DMA is ongoing to the file block / DAX page. When that pin is from RDMA or VFIO that creates a situation where filesystem operations are blocked indefinitely. More details here: 94db151dc892 "vfio: disable filesystem-dax page pinning". Currently, to prevent the deadlock, RDMA, VFIO, and IO_URING memory registration is blocked if the mapping is filesystem-dax backed (see the FOLL_LONGTERM flag to get_user_pages). One of the proposals to break the impasse was to allow the filesystem to forcibly revoke the mapping. I.e. to use the IOMMU to forcibly kick the RDMA device out of its registration. That was rejected by RDMA folks because RDMA applications are not prepared for this revocation to happen and the application that performed the registration may not be the application that uses the registration. There was an attempt to use a file lease to indicate the presence of a file / memory-registration that is blocking file-system operations, but that was still less palatable to filesystem folks than just keeping the status quo of blocking longterm pinning. That said, the VFIO use case seems a different situation than RDMA. There's often a 1:1 relationship between the application performing the memory registration and the application consuming it, the VMM, and there is always an IOMMU present that could revoke access and kill the guest is the mapping got truncated. It seems in theory that VFIO could tolerate a "revoke pin on truncate" mechanism where RDMA could not. > I'd like to put a lot more in a DAX/pmem region than just a guest's > memory, and having a mountable filesystem would be extremely convenient. Why would page pinning be involved in allowing the guest to mount a filesystem on guest-pmem? That already works today, it's just the device-passthrough that causes guest memory to be pinned indefinitely.
On 12/12/19 2:48 PM, Dan Williams wrote: > On Thu, Dec 12, 2019 at 11:16 AM Barret Rhoden <brho@google.com> wrote: >> >> On 12/12/19 12:37 PM, Dan Williams wrote: >>> Yeah, since device-dax is the only path to support longterm page >>> pinning for vfio device assignment, testing with device-dax + 1GB >>> pages would be a useful sanity check. >> >> What are the issues with fs-dax and page pinning? Is that limitation >> something that is permanent and unfixable (by me or anyone)? > > It's a surprisingly painful point of contention... Thanks for the info; I'll check out those threads. [snip] >> I'd like to put a lot more in a DAX/pmem region than just a guest's >> memory, and having a mountable filesystem would be extremely convenient. > > Why would page pinning be involved in allowing the guest to mount a > filesystem on guest-pmem? That already works today, it's just the > device-passthrough that causes guest memory to be pinned indefinitely. I'd like to mount the pmem filesystem on the *host* and use its files for the guest's memory. So far I've just been making an ext4 FS on /dev/pmem0 and creating a bunch of files in the FS. Some of the files are the guest memory: one file for each VM. Other files are just metadata that the host uses. That all works right now, but I'd also like to use VFIO with the guests. Thanks, Barret
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 6f92b40d798c..cd07bc4e595f 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -3384,6 +3384,35 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn) return -EFAULT; } +static bool pfn_is_huge_mapped(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn) +{ + struct page *page = pfn_to_page(pfn); + unsigned long hva; + + if (!is_zone_device_page(page)) + return PageTransCompoundMap(page); + + /* + * DAX pages do not use compound pages. The page should have already + * been mapped into the host-side page table during try_async_pf(), so + * we can check the page tables directly. + */ + hva = gfn_to_hva(kvm, gfn); + if (kvm_is_error_hva(hva)) + return false; + + /* + * Our caller grabbed the KVM mmu_lock with a successful + * mmu_notifier_retry, so we're safe to walk the page table. + */ + switch (dev_pagemap_mapping_shift(hva, current->mm)) { + case PMD_SHIFT: + case PUD_SIZE: + return true; + } + return false; +} + static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t *pfnp, int *levelp) @@ -3398,8 +3427,8 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu, * here. */ if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) && - !kvm_is_zone_device_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL && - PageTransCompoundMap(pfn_to_page(pfn)) && + level == PT_PAGE_TABLE_LEVEL && + pfn_is_huge_mapped(vcpu->kvm, gfn, pfn) && !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) { unsigned long mask; /* @@ -6015,8 +6044,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm, * mapping if the indirect sp has level = 1. */ if (sp->role.direct && !kvm_is_reserved_pfn(pfn) && - !kvm_is_zone_device_pfn(pfn) && - PageTransCompoundMap(pfn_to_page(pfn))) { + pfn_is_huge_mapped(kvm, sp->gfn, pfn)) { pte_list_remove(rmap_head, sptep); if (kvm_available_flush_tlb_with_range())
This change allows KVM to map DAX-backed files made of huge pages with huge mappings in the EPT/TDP. DAX pages are not PageTransCompound. The existing check is trying to determine if the mapping for the pfn is a huge mapping or not. For non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound. For DAX, we can check the page table itself. Note that KVM already faulted in the page (or huge page) in the host's page table, and we hold the KVM mmu spinlock. We grabbed that lock in kvm_mmu_notifier_invalidate_range_end, before checking the mmu seq. Signed-off-by: Barret Rhoden <brho@google.com> --- arch/x86/kvm/mmu/mmu.c | 36 ++++++++++++++++++++++++++++++++---- 1 file changed, 32 insertions(+), 4 deletions(-)