Message ID | 20181029210716.212159-1-brho@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] kvm: Use huge pages for DAX-backed files | expand |
On Mon, Oct 29, 2018 at 2:07 PM Barret Rhoden <brho@google.com> wrote: > > This change allows KVM to map DAX-backed files made of huge pages with > huge mappings in the EPT/TDP. > > DAX pages are not PageTransCompound. The existing check is trying to > determine if the mapping for the pfn is a huge mapping or not. For > non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound. > > For DAX, we can check the page table itself. Actually, we might always > be able to walk the page table, even for PageTransCompound pages, but > it's probably a little slower. > > Note that KVM already faulted in the page (or huge page) in the host's > page table, and we hold the KVM mmu spinlock (grabbed before checking > the mmu seq). Based on the other comments about not worrying about a > pmd split, we might be able to safely walk the page table without > holding the mm sem. > > This patch relies on kvm_is_reserved_pfn() being false for DAX pages, > which I've hacked up for testing this code. That change should > eventually happen: > > https://lore.kernel.org/lkml/20181022084659.GA84523@tiger-server/ > > Another issue is that kvm_mmu_zap_collapsible_spte() also uses > PageTransCompoundMap() to detect huge pages, but we don't have a way to > get the HVA easily. Can we just aggressively zap DAX pages there? > > Alternatively, is there a better way to track at the struct page level > whether or not a page is huge-mapped? Maybe the DAX huge pages mark > themselves as TransCompound or something similar, and we don't need to > special case DAX/ZONE_DEVICE pages. > > Signed-off-by: Barret Rhoden <brho@google.com> > --- > arch/x86/kvm/mmu.c | 71 +++++++++++++++++++++++++++++++++++++++++++++- > 1 file changed, 70 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c > index cf5f572f2305..9f3e0f83a2dd 100644 > --- a/arch/x86/kvm/mmu.c > +++ b/arch/x86/kvm/mmu.c > @@ -3152,6 +3152,75 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn) > return -EFAULT; > } > > +static unsigned long pgd_mapping_size(struct mm_struct *mm, unsigned long addr) > +{ > + pgd_t *pgd; > + p4d_t *p4d; > + pud_t *pud; > + pmd_t *pmd; > + pte_t *pte; > + > + pgd = pgd_offset(mm, addr); > + if (!pgd_present(*pgd)) > + return 0; > + > + p4d = p4d_offset(pgd, addr); > + if (!p4d_present(*p4d)) > + return 0; > + if (p4d_huge(*p4d)) > + return P4D_SIZE; > + > + pud = pud_offset(p4d, addr); > + if (!pud_present(*pud)) > + return 0; > + if (pud_huge(*pud)) > + return PUD_SIZE; > + > + pmd = pmd_offset(pud, addr); > + if (!pmd_present(*pmd)) > + return 0; > + if (pmd_huge(*pmd)) > + return PMD_SIZE; > + > + pte = pte_offset_map(pmd, addr); > + if (!pte_present(*pte)) > + return 0; > + return PAGE_SIZE; > +} > + > +static bool pfn_is_pmd_mapped(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn) > +{ > + struct page *page = pfn_to_page(pfn); > + unsigned long hva, map_sz; > + > + if (!is_zone_device_page(page)) > + return PageTransCompoundMap(page); > + > + /* > + * DAX pages do not use compound pages. The page should have already > + * been mapped into the host-side page table during try_async_pf(), so > + * we can check the page tables directly. > + */ > + hva = gfn_to_hva(kvm, gfn); > + if (kvm_is_error_hva(hva)) > + return false; > + > + /* > + * Our caller grabbed the KVM mmu_lock with a successful > + * mmu_notifier_retry, so we're safe to walk the page table. > + */ > + map_sz = pgd_mapping_size(current->mm, hva); > + switch (map_sz) { > + case PMD_SIZE: > + return true; > + case P4D_SIZE: > + case PUD_SIZE: > + printk_once(KERN_INFO "KVM THP promo found a very large page"); Why not allow PUD_SIZE? The device-dax interface supports PUD mappings. > + return false; > + } > + return false; > +} The above 2 functions are similar to what we need to do for determining the blast radius of a memory error, see dev_pagemap_mapping_shift() and its usage in add_to_kill(). > + > static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu, > gfn_t *gfnp, kvm_pfn_t *pfnp, > int *levelp) > @@ -3168,7 +3237,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu, > */ > if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) && > level == PT_PAGE_TABLE_LEVEL && > - PageTransCompoundMap(pfn_to_page(pfn)) && > + pfn_is_pmd_mapped(vcpu->kvm, gfn, pfn) && I'm wondering if we're adding an explicit is_zone_device_page() check in this path to determine the page mapping size if that can be a replacement for the kvm_is_reserved_pfn() check. In other words, the goal of fixing up PageReserved() was to preclude the need for DAX-page special casing in KVM, but if we already need add some special casing for page size determination, might as well bypass the kvm_is_reserved_pfn() dependency as well.
On 2018-10-29 at 15:25 Dan Williams <dan.j.williams@intel.com> wrote: > > + /* > > + * Our caller grabbed the KVM mmu_lock with a successful > > + * mmu_notifier_retry, so we're safe to walk the page table. > > + */ > > + map_sz = pgd_mapping_size(current->mm, hva); > > + switch (map_sz) { > > + case PMD_SIZE: > > + return true; > > + case P4D_SIZE: > > + case PUD_SIZE: > > + printk_once(KERN_INFO "KVM THP promo found a very large page"); > > Why not allow PUD_SIZE? The device-dax interface supports PUD mappings. The place where I use that helper seemed to care about PMDs (compared to huge pages larger than PUDs), I think due to THP. Though it also checks "level == PT_PAGE_TABLE_LEVEL", so it's probably a moot point. I can change it from pfn_is_pmd_mapped -> pfn_is_huge_mapped and allow any huge mapping that is appropriate: so PUD or PMD for DAX, PMD for non-DAX, IIUC. > > > + return false; > > + } > > + return false; > > +} > > The above 2 functions are similar to what we need to do for > determining the blast radius of a memory error, see > dev_pagemap_mapping_shift() and its usage in add_to_kill(). Great. I don't know if I have access in the KVM code to the VMA to use those functions directly, but I can extract the guts of dev_pagemap_mapping_shift() or something and put it in mm/util.c. > > static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu, > > gfn_t *gfnp, kvm_pfn_t *pfnp, > > int *levelp) > > @@ -3168,7 +3237,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu, > > */ > > if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) && > > level == PT_PAGE_TABLE_LEVEL && > > - PageTransCompoundMap(pfn_to_page(pfn)) && > > + pfn_is_pmd_mapped(vcpu->kvm, gfn, pfn) && > > I'm wondering if we're adding an explicit is_zone_device_page() check > in this path to determine the page mapping size if that can be a > replacement for the kvm_is_reserved_pfn() check. In other words, the > goal of fixing up PageReserved() was to preclude the need for DAX-page > special casing in KVM, but if we already need add some special casing > for page size determination, might as well bypass the > kvm_is_reserved_pfn() dependency as well. kvm_is_reserved_pfn() is used in some other places, like kvm_set_pfn_dirty()and kvm_set_pfn_accessed(). Maybe the way those treat DAX pages matters on a case-by-case basis? There are other callers of kvm_is_reserved_pfn() such as kvm_pfn_to_page() and gfn_to_page(). I'm not familiar (yet) with how struct pages and DAX work together, and whether or not the callers of those pfn_to_page() functions have expectations about the 'type' of struct page they get back. It looks like another time that this popped up was kvm_is_mmio_pfn(), though that wasn't exactly checking kvm_is_reserved_pfn(), and it special cased based on the memory type / PAT business. Thanks, Barret
On Mon, Oct 29, 2018 at 5:29 PM Barret Rhoden <brho@google.com> wrote: > > On 2018-10-29 at 15:25 Dan Williams <dan.j.williams@intel.com> wrote: > > > + /* > > > + * Our caller grabbed the KVM mmu_lock with a successful > > > + * mmu_notifier_retry, so we're safe to walk the page table. > > > + */ > > > + map_sz = pgd_mapping_size(current->mm, hva); > > > + switch (map_sz) { > > > + case PMD_SIZE: > > > + return true; > > > + case P4D_SIZE: > > > + case PUD_SIZE: > > > + printk_once(KERN_INFO "KVM THP promo found a very large page"); > > > > Why not allow PUD_SIZE? The device-dax interface supports PUD mappings. > > The place where I use that helper seemed to care about PMDs (compared > to huge pages larger than PUDs), I think due to THP. Though it also > checks "level == PT_PAGE_TABLE_LEVEL", so it's probably a moot point. > > I can change it from pfn_is_pmd_mapped -> pfn_is_huge_mapped and allow > any huge mapping that is appropriate: so PUD or PMD for DAX, PMD for > non-DAX, IIUC. Yes, THP stops at PMDs, but DAX and hugetlbfs support PUD level mappings. > > > + return false; > > > + } > > > + return false; > > > +} > > > > The above 2 functions are similar to what we need to do for > > determining the blast radius of a memory error, see > > dev_pagemap_mapping_shift() and its usage in add_to_kill(). > > Great. I don't know if I have access in the KVM code to the VMA to use > those functions directly, but I can extract the guts of > dev_pagemap_mapping_shift() or something and put it in mm/util.c. Sounds good. > > > static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu, > > > gfn_t *gfnp, kvm_pfn_t *pfnp, > > > int *levelp) > > > @@ -3168,7 +3237,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu, > > > */ > > > if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) && > > > level == PT_PAGE_TABLE_LEVEL && > > > - PageTransCompoundMap(pfn_to_page(pfn)) && > > > + pfn_is_pmd_mapped(vcpu->kvm, gfn, pfn) && > > > > I'm wondering if we're adding an explicit is_zone_device_page() check > > in this path to determine the page mapping size if that can be a > > replacement for the kvm_is_reserved_pfn() check. In other words, the > > goal of fixing up PageReserved() was to preclude the need for DAX-page > > special casing in KVM, but if we already need add some special casing > > for page size determination, might as well bypass the > > kvm_is_reserved_pfn() dependency as well. > > kvm_is_reserved_pfn() is used in some other places, like > kvm_set_pfn_dirty()and kvm_set_pfn_accessed(). Maybe the way those > treat DAX pages matters on a case-by-case basis? > > There are other callers of kvm_is_reserved_pfn() such as > kvm_pfn_to_page() and gfn_to_page(). I'm not familiar (yet) with how > struct pages and DAX work together, and whether or not the callers of > those pfn_to_page() functions have expectations about the 'type' of > struct page they get back. > The property of DAX pages that requires special coordination is the fact that the device hosting the pages can be disabled at will. The get_dev_pagemap() api is the interface to pin a device-pfn so that you can safely perform a pfn_to_page() operation. Have the pages that kvm uses in this path already been pinned by vfio?
On 2018-10-29 at 20:10 Dan Williams <dan.j.williams@intel.com> wrote: > > > > static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu, > > > > gfn_t *gfnp, kvm_pfn_t *pfnp, > > > > int *levelp) > > > > @@ -3168,7 +3237,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu, > > > > */ > > > > if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) && > > > > level == PT_PAGE_TABLE_LEVEL && > > > > - PageTransCompoundMap(pfn_to_page(pfn)) && > > > > + pfn_is_pmd_mapped(vcpu->kvm, gfn, pfn) && > > > > > > I'm wondering if we're adding an explicit is_zone_device_page() check > > > in this path to determine the page mapping size if that can be a > > > replacement for the kvm_is_reserved_pfn() check. In other words, the > > > goal of fixing up PageReserved() was to preclude the need for DAX-page > > > special casing in KVM, but if we already need add some special casing > > > for page size determination, might as well bypass the > > > kvm_is_reserved_pfn() dependency as well. > > > > kvm_is_reserved_pfn() is used in some other places, like > > kvm_set_pfn_dirty()and kvm_set_pfn_accessed(). Maybe the way those > > treat DAX pages matters on a case-by-case basis? > > > > There are other callers of kvm_is_reserved_pfn() such as > > kvm_pfn_to_page() and gfn_to_page(). I'm not familiar (yet) with how > > struct pages and DAX work together, and whether or not the callers of > > those pfn_to_page() functions have expectations about the 'type' of > > struct page they get back. > > > > The property of DAX pages that requires special coordination is the > fact that the device hosting the pages can be disabled at will. The > get_dev_pagemap() api is the interface to pin a device-pfn so that you > can safely perform a pfn_to_page() operation. > > Have the pages that kvm uses in this path already been pinned by vfio? I'm not aware of any explicit pinning, but it might be happening under the hood. These pages are just generic guest RAM, but they are present in a host-side mapping. I ran into this when looking at EPT fault handling. In the code I changed, a physical page was faulted in to the task's page table, then while the kvm->mmu_lock is held, KVM makes an EPT mapping to the same physical page. That mmu_lock seems to prevent any concurrent host-side unmappings; though I'm not familiar with the mm notifier stuff. One usage of kvm_is_reserved_pfn() in KVM code is like this: static struct page *kvm_pfn_to_page(kvm_pfn_t pfn) { if (is_error_noslot_pfn(pfn)) return KVM_ERR_PTR_BAD_PAGE; if (kvm_is_reserved_pfn(pfn)) { WARN_ON(1); return KVM_ERR_PTR_BAD_PAGE; } return pfn_to_page(pfn); } I think there's no guarantee the kvm->mmu_lock is held in the generic case. Here's one case where it wasn't (from walking through the code): handle_exception -handle_ud --kvm_emulate_instruction ---x86_emulate_instruction ----x86_emulate_insn -----writeback ------segmented_cmpxchg -------emulator_cmpxchg_emulated --------kvm_vcpu_gfn_to_page ---------kvm_pfn_to_page There are probably other rules related to gfn_to_page that keep the page alive, maybe just during interrupt/vmexit context? Whatever keeps those pages alive for normal memory might grab that devmap reference under the hood for DAX mappings. Thanks, Barret
On Mon, Oct 29, 2018 at 08:10:52PM -0700, Dan Williams wrote: > On Mon, Oct 29, 2018 at 5:29 PM Barret Rhoden <brho@google.com> wrote: > > > > On 2018-10-29 at 15:25 Dan Williams <dan.j.williams@intel.com> wrote: > > > > + /* > > > > + * Our caller grabbed the KVM mmu_lock with a successful > > > > + * mmu_notifier_retry, so we're safe to walk the page table. > > > > + */ > > > > + map_sz = pgd_mapping_size(current->mm, hva); > > > > + switch (map_sz) { > > > > + case PMD_SIZE: > > > > + return true; > > > > + case P4D_SIZE: > > > > + case PUD_SIZE: > > > > + printk_once(KERN_INFO "KVM THP promo found a very large page"); > > > > > > Why not allow PUD_SIZE? The device-dax interface supports PUD mappings. > > > > The place where I use that helper seemed to care about PMDs (compared > > to huge pages larger than PUDs), I think due to THP. Though it also > > checks "level == PT_PAGE_TABLE_LEVEL", so it's probably a moot point. > > > > I can change it from pfn_is_pmd_mapped -> pfn_is_huge_mapped and allow > > any huge mapping that is appropriate: so PUD or PMD for DAX, PMD for > > non-DAX, IIUC. > > Yes, THP stops at PMDs, but DAX and hugetlbfs support PUD level mappings. > > > > > + return false; > > > > + } > > > > + return false; > > > > +} > > > > > > The above 2 functions are similar to what we need to do for > > > determining the blast radius of a memory error, see > > > dev_pagemap_mapping_shift() and its usage in add_to_kill(). > > > > Great. I don't know if I have access in the KVM code to the VMA to use > > those functions directly, but I can extract the guts of > > dev_pagemap_mapping_shift() or something and put it in mm/util.c. > > Sounds good. > > > > > static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu, > > > > gfn_t *gfnp, kvm_pfn_t *pfnp, > > > > int *levelp) > > > > @@ -3168,7 +3237,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu, > > > > */ > > > > if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) && > > > > level == PT_PAGE_TABLE_LEVEL && > > > > - PageTransCompoundMap(pfn_to_page(pfn)) && > > > > + pfn_is_pmd_mapped(vcpu->kvm, gfn, pfn) && > > > > > > I'm wondering if we're adding an explicit is_zone_device_page() check > > > in this path to determine the page mapping size if that can be a > > > replacement for the kvm_is_reserved_pfn() check. In other words, the > > > goal of fixing up PageReserved() was to preclude the need for DAX-page > > > special casing in KVM, but if we already need add some special casing > > > for page size determination, might as well bypass the > > > kvm_is_reserved_pfn() dependency as well. > > > > kvm_is_reserved_pfn() is used in some other places, like > > kvm_set_pfn_dirty()and kvm_set_pfn_accessed(). Maybe the way those > > treat DAX pages matters on a case-by-case basis? > > > > There are other callers of kvm_is_reserved_pfn() such as > > kvm_pfn_to_page() and gfn_to_page(). I'm not familiar (yet) with how > > struct pages and DAX work together, and whether or not the callers of > > those pfn_to_page() functions have expectations about the 'type' of > > struct page they get back. > > > > The property of DAX pages that requires special coordination is the > fact that the device hosting the pages can be disabled at will. The > get_dev_pagemap() api is the interface to pin a device-pfn so that you > can safely perform a pfn_to_page() operation. > > Have the pages that kvm uses in this path already been pinned by vfio? My understanding is, it could be - if there's any device assigned to the VM. Otherwise, they will not. B.R. Yu > _______________________________________________ > Linux-nvdimm mailing list > Linux-nvdimm@lists.01.org > https://lists.01.org/mailman/listinfo/linux-nvdimm >
On 30/10/2018 20:45, Barret Rhoden wrote: > On 2018-10-29 at 20:10 Dan Williams <dan.j.williams@intel.com> wrote: >> The property of DAX pages that requires special coordination is the >> fact that the device hosting the pages can be disabled at will. The >> get_dev_pagemap() api is the interface to pin a device-pfn so that you >> can safely perform a pfn_to_page() operation. >> >> Have the pages that kvm uses in this path already been pinned by vfio? No, VFIO is not involved here. The pages that KVM uses are never pinned. Soon after we grab them and we build KVM's page table, we do put_page in mmu_set_spte (via kvm_release_pfn_clean). From that point on the MMU notifier will take care of invalidating SPT before the page disappears from the mm's page table. > One usage of kvm_is_reserved_pfn() in KVM code is like this: > > static struct page *kvm_pfn_to_page(kvm_pfn_t pfn) > { > if (is_error_noslot_pfn(pfn)) > return KVM_ERR_PTR_BAD_PAGE; > > if (kvm_is_reserved_pfn(pfn)) { > WARN_ON(1); > return KVM_ERR_PTR_BAD_PAGE; > } > > return pfn_to_page(pfn); > } > > I think there's no guarantee the kvm->mmu_lock is held in the generic > case. Indeed, it's not. > There are probably other rules related to gfn_to_page that keep the > page alive, maybe just during interrupt/vmexit context? Whatever keeps > those pages alive for normal memory might grab that devmap reference > under the hood for DAX mappings. Nothing keeps the page alive except for the MMU notifier (which of course cannot run in atomic context, since its callers take the mmap_sem). Paolo
On 29/10/2018 23:25, Dan Williams wrote: > I'm wondering if we're adding an explicit is_zone_device_page() check > in this path to determine the page mapping size if that can be a > replacement for the kvm_is_reserved_pfn() check. In other words, the > goal of fixing up PageReserved() was to preclude the need for DAX-page > special casing in KVM, but if we already need add some special casing > for page size determination, might as well bypass the > kvm_is_reserved_pfn() dependency as well. No, please don't. The kvm_is_reserved_pfn() check is for correctness, the page-size check is for optimization. In theory you could have a ZONE_DEVICE area that is smaller than 2MB and thus does not use huge pages. Paolo
On Wed, Oct 31, 2018 at 1:52 AM Paolo Bonzini <pbonzini@redhat.com> wrote: > > On 29/10/2018 23:25, Dan Williams wrote: > > I'm wondering if we're adding an explicit is_zone_device_page() check > > in this path to determine the page mapping size if that can be a > > replacement for the kvm_is_reserved_pfn() check. In other words, the > > goal of fixing up PageReserved() was to preclude the need for DAX-page > > special casing in KVM, but if we already need add some special casing > > for page size determination, might as well bypass the > > kvm_is_reserved_pfn() dependency as well. > > No, please don't. The kvm_is_reserved_pfn() check is for correctness, > the page-size check is for optimization. In theory you could have a > ZONE_DEVICE area that is smaller than 2MB and thus does not use huge pages. To be clear, I was not suggesting that a ZONE_DEVICE check would be sufficient to determine a 2MB page. I was suggesting that given there is a debate about removing the "reserved" designation for dax pages that debate is moot if kvm is going to add interrogation code to figure out the size of dax mappings.
On 2018-10-31 at 09:49 Paolo Bonzini <pbonzini@redhat.com> wrote: > > On 2018-10-29 at 20:10 Dan Williams <dan.j.williams@intel.com> wrote: > >> The property of DAX pages that requires special coordination is the > >> fact that the device hosting the pages can be disabled at will. The > >> get_dev_pagemap() api is the interface to pin a device-pfn so that you > >> can safely perform a pfn_to_page() operation. > >> > >> Have the pages that kvm uses in this path already been pinned by vfio? > > No, VFIO is not involved here. > > The pages that KVM uses are never pinned. Soon after we grab them and > we build KVM's page table, we do put_page in mmu_set_spte (via > kvm_release_pfn_clean). From that point on the MMU notifier will take > care of invalidating SPT before the page disappears from the mm's page > table. I looked into this a little, and I think it's safe in terms of DAX's get_dev_pagemap() refcounts to have kvm_is_reserved_pfn() treat DAX pages as not reserved. kvm_is_reserved_pfn() is used before a pfn_to_page() call, so the concern was that the pages didn't have a DAX refcount. Many of the times that KVM looks at the PFN, it's holding the KVM mmu_lock, which keeps the pages in the host-side VMA. (IIUC). Functions like kvm_set_pfn_dirty() fall into this category. The other times, such as the vmexit path I mentioned before, go through a gfn_to_pfn call. Under the hood, those call one of the get_user_pages() functions during hva_to_pfn, and those do the right thing w.r.t. get_dev_pagemap(). One of the other things I noticed was some places in KVM make a distinction between kvm_is_reserved_pfn and PageReserved: void kvm_set_pfn_dirty(kvm_pfn_t pfn) { if (!kvm_is_reserved_pfn(pfn)) { struct page *page = pfn_to_page(pfn); if (!PageReserved(page)) SetPageDirty(page); } } I think we want to SetPageDirty for DAX, so making PageReserved be true for DAX seems like the way to go, or we'll need more KVM-specific changes. Apologies is this was discussed in the previous thread on this topic and is redundant. Thanks, Barret
On 02/11/2018 21:32, Barret Rhoden wrote: > One of the other things I noticed was some places in KVM make a > distinction between kvm_is_reserved_pfn and PageReserved: > > void kvm_set_pfn_dirty(kvm_pfn_t pfn) > { > if (!kvm_is_reserved_pfn(pfn)) { > struct page *page = pfn_to_page(pfn); > > if (!PageReserved(page)) > SetPageDirty(page); > } > } > > I think we want to SetPageDirty for DAX, so making PageReserved be true > for DAX seems like the way to go, or we'll need more KVM-specific > changes. Apologies is this was discussed in the previous thread on this > topic and is redundant. Isn't it the opposite? We want SetPageDirty, so PageReserved must _not_ be true. Paolo
On 31/10/2018 22:16, Dan Williams wrote: >> No, please don't. The kvm_is_reserved_pfn() check is for correctness, >> the page-size check is for optimization. In theory you could have a >> ZONE_DEVICE area that is smaller than 2MB and thus does not use huge pages. > To be clear, I was not suggesting that a ZONE_DEVICE check would be > sufficient to determine a 2MB page. I was suggesting that given there > is a debate about removing the "reserved" designation for dax pages > that debate is moot if kvm is going to add interrogation code to > figure out the size of dax mappings. Oh indeed. And in general it's okay for me to add more ZONE_DEVICE checks to complement the existing PageReserved checks, if DAX pages are not going to be reserved anymore. In some cases, such as the "if (!PageReserved(page)) SetPageDirty(page)" that Barret noted, it may even be more correct for KVM if DAX pages stop being reserved. All this given my very limited knowledge of MM though... Paolo
On 2018-11-06 at 11:19 Paolo Bonzini <pbonzini@redhat.com> wrote: > > void kvm_set_pfn_dirty(kvm_pfn_t pfn) > > { > > if (!kvm_is_reserved_pfn(pfn)) { > > struct page *page = pfn_to_page(pfn); > > > > if (!PageReserved(page)) > > SetPageDirty(page); > > } > > } > > > > I think we want to SetPageDirty for DAX, so making PageReserved be true > > for DAX seems like the way to go, or we'll need more KVM-specific > > changes. Apologies is this was discussed in the previous thread on this > > topic and is redundant. > > Isn't it the opposite? We want SetPageDirty, so PageReserved must _not_ > be true. You're right on that, I had it backwards. The other DAX work is making it so that DAX pages are not reserved, so the only extra '!' was in my head. Thanks, Barret
On 2018-10-29 at 17:07 Barret Rhoden <brho@google.com> wrote: > Another issue is that kvm_mmu_zap_collapsible_spte() also uses > PageTransCompoundMap() to detect huge pages, but we don't have a way to > get the HVA easily. Can we just aggressively zap DAX pages there? Any thoughts about this? Is there a way to determine the HVA or GFN in this function: static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm, struct kvm_rmap_head *rmap_head) { u64 *sptep; struct rmap_iterator iter; int need_tlb_flush = 0; kvm_pfn_t pfn; struct kvm_mmu_page *sp; restart: for_each_rmap_spte(rmap_head, &iter, sptep) { sp = page_header(__pa(sptep)); pfn = spte_to_pfn(*sptep); /* * We cannot do huge page mapping for indirect shadow pages, * which are found on the last rmap (level = 1) when not using * tdp; such shadow pages are synced with the page table in * the guest, and the guest page table is using 4K page size * mapping if the indirect sp has level = 1. */ if (sp->role.direct && !kvm_is_reserved_pfn(pfn) && PageTransCompoundMap(pfn_to_page(pfn))) { pte_list_remove(rmap_head, sptep); need_tlb_flush = 1; goto restart; } } return need_tlb_flush; } If not, I was thinking of changing that loop to always remove PTEs for DAX mappings, with the understanding that they'll get faulted back in later. Ideally, we'd like to check if the page is huge, but DAX can't use the PageTransCompoundMap check. Thanks, Barret
On 06/11/2018 22:05, Barret Rhoden wrote: > On 2018-10-29 at 17:07 Barret Rhoden <brho@google.com> wrote: >> Another issue is that kvm_mmu_zap_collapsible_spte() also uses >> PageTransCompoundMap() to detect huge pages, but we don't have a way to >> get the HVA easily. Can we just aggressively zap DAX pages there? > > Any thoughts about this? Is there a way to determine the HVA or GFN in > this function: Yes, iter.gfn is the gfn inside the loop and iter.level is the level (1=PTE, 2=PDE, ...). iter.level of course is unusable here, similar to *levelp in transparent_hugepage_adjust, but you can use iter.gfn and gfn_to_hva. Paolo > static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm, > struct kvm_rmap_head *rmap_head) > { > u64 *sptep; > struct rmap_iterator iter; > int need_tlb_flush = 0; > kvm_pfn_t pfn; > struct kvm_mmu_page *sp; > > restart: > for_each_rmap_spte(rmap_head, &iter, sptep) { > sp = page_header(__pa(sptep)); > pfn = spte_to_pfn(*sptep); > > /* > * We cannot do huge page mapping for indirect shadow pages, > * which are found on the last rmap (level = 1) when not using > * tdp; such shadow pages are synced with the page table in > * the guest, and the guest page table is using 4K page size > * mapping if the indirect sp has level = 1. > */ > if (sp->role.direct && > !kvm_is_reserved_pfn(pfn) && > PageTransCompoundMap(pfn_to_page(pfn))) { > pte_list_remove(rmap_head, sptep); > need_tlb_flush = 1; > goto restart; > } > } > > return need_tlb_flush; > } > > If not, I was thinking of changing that loop to always remove PTEs for > DAX mappings, with the understanding that they'll get faulted back in > later. Ideally, we'd like to check if the page is huge, but DAX can't > use the PageTransCompoundMap check. > > Thanks, > > Barret > > >
On 2018-11-06 at 22:16 Paolo Bonzini <pbonzini@redhat.com> wrote: > Yes, iter.gfn is the gfn inside the loop and iter.level is the level > (1=PTE, 2=PDE, ...). iter.level of course is unusable here, similar to > *levelp in transparent_hugepage_adjust, but you can use iter.gfn and > gfn_to_hva. Great, thanks!
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index cf5f572f2305..9f3e0f83a2dd 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3152,6 +3152,75 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn) return -EFAULT; } +static unsigned long pgd_mapping_size(struct mm_struct *mm, unsigned long addr) +{ + pgd_t *pgd; + p4d_t *p4d; + pud_t *pud; + pmd_t *pmd; + pte_t *pte; + + pgd = pgd_offset(mm, addr); + if (!pgd_present(*pgd)) + return 0; + + p4d = p4d_offset(pgd, addr); + if (!p4d_present(*p4d)) + return 0; + if (p4d_huge(*p4d)) + return P4D_SIZE; + + pud = pud_offset(p4d, addr); + if (!pud_present(*pud)) + return 0; + if (pud_huge(*pud)) + return PUD_SIZE; + + pmd = pmd_offset(pud, addr); + if (!pmd_present(*pmd)) + return 0; + if (pmd_huge(*pmd)) + return PMD_SIZE; + + pte = pte_offset_map(pmd, addr); + if (!pte_present(*pte)) + return 0; + return PAGE_SIZE; +} + +static bool pfn_is_pmd_mapped(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn) +{ + struct page *page = pfn_to_page(pfn); + unsigned long hva, map_sz; + + if (!is_zone_device_page(page)) + return PageTransCompoundMap(page); + + /* + * DAX pages do not use compound pages. The page should have already + * been mapped into the host-side page table during try_async_pf(), so + * we can check the page tables directly. + */ + hva = gfn_to_hva(kvm, gfn); + if (kvm_is_error_hva(hva)) + return false; + + /* + * Our caller grabbed the KVM mmu_lock with a successful + * mmu_notifier_retry, so we're safe to walk the page table. + */ + map_sz = pgd_mapping_size(current->mm, hva); + switch (map_sz) { + case PMD_SIZE: + return true; + case P4D_SIZE: + case PUD_SIZE: + printk_once(KERN_INFO "KVM THP promo found a very large page"); + return false; + } + return false; +} + static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu, gfn_t *gfnp, kvm_pfn_t *pfnp, int *levelp) @@ -3168,7 +3237,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu, */ if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL && - PageTransCompoundMap(pfn_to_page(pfn)) && + pfn_is_pmd_mapped(vcpu->kvm, gfn, pfn) && !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) { unsigned long mask; /*
This change allows KVM to map DAX-backed files made of huge pages with huge mappings in the EPT/TDP. DAX pages are not PageTransCompound. The existing check is trying to determine if the mapping for the pfn is a huge mapping or not. For non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound. For DAX, we can check the page table itself. Actually, we might always be able to walk the page table, even for PageTransCompound pages, but it's probably a little slower. Note that KVM already faulted in the page (or huge page) in the host's page table, and we hold the KVM mmu spinlock (grabbed before checking the mmu seq). Based on the other comments about not worrying about a pmd split, we might be able to safely walk the page table without holding the mm sem. This patch relies on kvm_is_reserved_pfn() being false for DAX pages, which I've hacked up for testing this code. That change should eventually happen: https://lore.kernel.org/lkml/20181022084659.GA84523@tiger-server/ Another issue is that kvm_mmu_zap_collapsible_spte() also uses PageTransCompoundMap() to detect huge pages, but we don't have a way to get the HVA easily. Can we just aggressively zap DAX pages there? Alternatively, is there a better way to track at the struct page level whether or not a page is huge-mapped? Maybe the DAX huge pages mark themselves as TransCompound or something similar, and we don't need to special case DAX/ZONE_DEVICE pages. Signed-off-by: Barret Rhoden <brho@google.com> --- arch/x86/kvm/mmu.c | 71 +++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 70 insertions(+), 1 deletion(-)