Message ID | 20191024120938.11237-4-david@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE) | expand |
On Thu, Oct 24, 2019 at 5:11 AM David Hildenbrand <david@redhat.com> wrote: > > Right now, ZONE_DEVICE memory is always set PG_reserved. We want to > change that. > > KVM has this weird use case that you can map anything from /dev/mem > into the guest. pfn_valid() is not a reliable check whether the memmap > was initialized and can be touched. pfn_to_online_page() makes sure > that we have an initialized memmap (and don't have ZONE_DEVICE memory). > > Rewrite kvm_is_reserved_pfn() to make sure the function produces the > same result once we stop setting ZONE_DEVICE pages PG_reserved. > > Cc: Paolo Bonzini <pbonzini@redhat.com> > Cc: "Radim Krčmář" <rkrcmar@redhat.com> > Cc: Michal Hocko <mhocko@kernel.org> > Cc: Dan Williams <dan.j.williams@intel.com> > Cc: KarimAllah Ahmed <karahmed@amazon.de> > Signed-off-by: David Hildenbrand <david@redhat.com> > --- > virt/kvm/kvm_main.c | 10 ++++++++-- > 1 file changed, 8 insertions(+), 2 deletions(-) > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index e9eb666eb6e8..9d18cc67d124 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -151,9 +151,15 @@ __weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm, > > bool kvm_is_reserved_pfn(kvm_pfn_t pfn) > { > - if (pfn_valid(pfn)) > - return PageReserved(pfn_to_page(pfn)); > + struct page *page = pfn_to_online_page(pfn); > > + /* > + * We treat any pages that are not online (not managed by the buddy) > + * as reserved - this includes ZONE_DEVICE pages and pages without > + * a memmap (e.g., mapped via /dev/mem). > + */ > + if (page) > + return PageReserved(page); > return true; > } So after this all the pfn_valid() usage in kvm_main.c is replaced with pfn_to_online_page()? Looks correct to me. However, I'm worried that kvm is taking reference on ZONE_DEVICE pages through some other path resulting in this: https://lore.kernel.org/linux-nvdimm/20190919154708.GA24650@angband.pl/ I'll see if this patch set modulates or maintains that failure mode.
On 05.11.19 05:38, Dan Williams wrote: > On Thu, Oct 24, 2019 at 5:11 AM David Hildenbrand <david@redhat.com> wrote: >> >> Right now, ZONE_DEVICE memory is always set PG_reserved. We want to >> change that. >> >> KVM has this weird use case that you can map anything from /dev/mem >> into the guest. pfn_valid() is not a reliable check whether the memmap >> was initialized and can be touched. pfn_to_online_page() makes sure >> that we have an initialized memmap (and don't have ZONE_DEVICE memory). >> >> Rewrite kvm_is_reserved_pfn() to make sure the function produces the >> same result once we stop setting ZONE_DEVICE pages PG_reserved. >> >> Cc: Paolo Bonzini <pbonzini@redhat.com> >> Cc: "Radim Krčmář" <rkrcmar@redhat.com> >> Cc: Michal Hocko <mhocko@kernel.org> >> Cc: Dan Williams <dan.j.williams@intel.com> >> Cc: KarimAllah Ahmed <karahmed@amazon.de> >> Signed-off-by: David Hildenbrand <david@redhat.com> >> --- >> virt/kvm/kvm_main.c | 10 ++++++++-- >> 1 file changed, 8 insertions(+), 2 deletions(-) >> >> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c >> index e9eb666eb6e8..9d18cc67d124 100644 >> --- a/virt/kvm/kvm_main.c >> +++ b/virt/kvm/kvm_main.c >> @@ -151,9 +151,15 @@ __weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm, >> >> bool kvm_is_reserved_pfn(kvm_pfn_t pfn) >> { >> - if (pfn_valid(pfn)) >> - return PageReserved(pfn_to_page(pfn)); >> + struct page *page = pfn_to_online_page(pfn); >> >> + /* >> + * We treat any pages that are not online (not managed by the buddy) >> + * as reserved - this includes ZONE_DEVICE pages and pages without >> + * a memmap (e.g., mapped via /dev/mem). >> + */ >> + if (page) >> + return PageReserved(page); >> return true; >> } > > So after this all the pfn_valid() usage in kvm_main.c is replaced with > pfn_to_online_page()? Looks correct to me. > > However, I'm worried that kvm is taking reference on ZONE_DEVICE pages > through some other path resulting in this: > > https://lore.kernel.org/linux-nvdimm/20190919154708.GA24650@angband.pl/ > > I'll see if this patch set modulates or maintains that failure mode. > I'd assume that the behavior is unchanged. Ithink we get a reference to these ZONE_DEVICE pages via __get_user_pages_fast() and friends in hva_to_pfn_fast() and friends in virt/kvm/kvm_main.c
On 05.11.19 10:17, David Hildenbrand wrote: > On 05.11.19 05:38, Dan Williams wrote: >> On Thu, Oct 24, 2019 at 5:11 AM David Hildenbrand <david@redhat.com> wrote: >>> >>> Right now, ZONE_DEVICE memory is always set PG_reserved. We want to >>> change that. >>> >>> KVM has this weird use case that you can map anything from /dev/mem >>> into the guest. pfn_valid() is not a reliable check whether the memmap >>> was initialized and can be touched. pfn_to_online_page() makes sure >>> that we have an initialized memmap (and don't have ZONE_DEVICE memory). >>> >>> Rewrite kvm_is_reserved_pfn() to make sure the function produces the >>> same result once we stop setting ZONE_DEVICE pages PG_reserved. >>> >>> Cc: Paolo Bonzini <pbonzini@redhat.com> >>> Cc: "Radim Krčmář" <rkrcmar@redhat.com> >>> Cc: Michal Hocko <mhocko@kernel.org> >>> Cc: Dan Williams <dan.j.williams@intel.com> >>> Cc: KarimAllah Ahmed <karahmed@amazon.de> >>> Signed-off-by: David Hildenbrand <david@redhat.com> >>> --- >>> virt/kvm/kvm_main.c | 10 ++++++++-- >>> 1 file changed, 8 insertions(+), 2 deletions(-) >>> >>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c >>> index e9eb666eb6e8..9d18cc67d124 100644 >>> --- a/virt/kvm/kvm_main.c >>> +++ b/virt/kvm/kvm_main.c >>> @@ -151,9 +151,15 @@ __weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm, >>> >>> bool kvm_is_reserved_pfn(kvm_pfn_t pfn) >>> { >>> - if (pfn_valid(pfn)) >>> - return PageReserved(pfn_to_page(pfn)); >>> + struct page *page = pfn_to_online_page(pfn); >>> >>> + /* >>> + * We treat any pages that are not online (not managed by the buddy) >>> + * as reserved - this includes ZONE_DEVICE pages and pages without >>> + * a memmap (e.g., mapped via /dev/mem). >>> + */ >>> + if (page) >>> + return PageReserved(page); >>> return true; >>> } >> >> So after this all the pfn_valid() usage in kvm_main.c is replaced with >> pfn_to_online_page()? Looks correct to me. >> >> However, I'm worried that kvm is taking reference on ZONE_DEVICE pages >> through some other path resulting in this: >> >> https://lore.kernel.org/linux-nvdimm/20190919154708.GA24650@angband.pl/ >> >> I'll see if this patch set modulates or maintains that failure mode. >> > > I'd assume that the behavior is unchanged. Ithink we get a reference to > these ZONE_DEVICE pages via __get_user_pages_fast() and friends in > hva_to_pfn_fast() and friends in virt/kvm/kvm_main.c > I think I know what's going wrong: Pages that are pinned via gfn_to_pfn() and friends take a references, however are often released via kvm_release_pfn_clean()/kvm_release_pfn_dirty()/kvm_release_page_clean()... E.g., in arch/x86/kvm/x86.c:reexecute_instruction() ... pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa)); ... kvm_release_pfn_clean(pfn); void kvm_release_pfn_clean(kvm_pfn_t pfn) { if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn)) put_page(pfn_to_page(pfn)); } This function makes perfect sense as the counterpart for kvm_get_pfn(): void kvm_get_pfn(kvm_pfn_t pfn) { if (!kvm_is_reserved_pfn(pfn)) get_page(pfn_to_page(pfn)); } As all ZONE_DEVICE pages are currently reserved, pages pinned via gfn_to_pfn() and friends will often not see a put_page() AFAIKS. Now, my patch does not change that, the result of kvm_is_reserved_pfn(pfn) will be unchanged. A proper fix for that would probably be a) To drop the reference to ZONE_DEVICE pages in gfn_to_pfn() and friends, after you successfully pinned the pages. (not sure if that's the right thing to do but you're the expert) b) To not use kvm_release_pfn_clean() and friends on pages that were definitely pinned.
On 05.11.19 10:49, David Hildenbrand wrote: > On 05.11.19 10:17, David Hildenbrand wrote: >> On 05.11.19 05:38, Dan Williams wrote: >>> On Thu, Oct 24, 2019 at 5:11 AM David Hildenbrand <david@redhat.com> wrote: >>>> >>>> Right now, ZONE_DEVICE memory is always set PG_reserved. We want to >>>> change that. >>>> >>>> KVM has this weird use case that you can map anything from /dev/mem >>>> into the guest. pfn_valid() is not a reliable check whether the memmap >>>> was initialized and can be touched. pfn_to_online_page() makes sure >>>> that we have an initialized memmap (and don't have ZONE_DEVICE memory). >>>> >>>> Rewrite kvm_is_reserved_pfn() to make sure the function produces the >>>> same result once we stop setting ZONE_DEVICE pages PG_reserved. >>>> >>>> Cc: Paolo Bonzini <pbonzini@redhat.com> >>>> Cc: "Radim Krčmář" <rkrcmar@redhat.com> >>>> Cc: Michal Hocko <mhocko@kernel.org> >>>> Cc: Dan Williams <dan.j.williams@intel.com> >>>> Cc: KarimAllah Ahmed <karahmed@amazon.de> >>>> Signed-off-by: David Hildenbrand <david@redhat.com> >>>> --- >>>> virt/kvm/kvm_main.c | 10 ++++++++-- >>>> 1 file changed, 8 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c >>>> index e9eb666eb6e8..9d18cc67d124 100644 >>>> --- a/virt/kvm/kvm_main.c >>>> +++ b/virt/kvm/kvm_main.c >>>> @@ -151,9 +151,15 @@ __weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm, >>>> >>>> bool kvm_is_reserved_pfn(kvm_pfn_t pfn) >>>> { >>>> - if (pfn_valid(pfn)) >>>> - return PageReserved(pfn_to_page(pfn)); >>>> + struct page *page = pfn_to_online_page(pfn); >>>> >>>> + /* >>>> + * We treat any pages that are not online (not managed by the buddy) >>>> + * as reserved - this includes ZONE_DEVICE pages and pages without >>>> + * a memmap (e.g., mapped via /dev/mem). >>>> + */ >>>> + if (page) >>>> + return PageReserved(page); >>>> return true; >>>> } >>> >>> So after this all the pfn_valid() usage in kvm_main.c is replaced with >>> pfn_to_online_page()? Looks correct to me. >>> >>> However, I'm worried that kvm is taking reference on ZONE_DEVICE pages >>> through some other path resulting in this: >>> >>> https://lore.kernel.org/linux-nvdimm/20190919154708.GA24650@angband.pl/ >>> >>> I'll see if this patch set modulates or maintains that failure mode. >>> >> >> I'd assume that the behavior is unchanged. Ithink we get a reference to >> these ZONE_DEVICE pages via __get_user_pages_fast() and friends in >> hva_to_pfn_fast() and friends in virt/kvm/kvm_main.c >> > > I think I know what's going wrong: > > Pages that are pinned via gfn_to_pfn() and friends take a references, > however are often released via > kvm_release_pfn_clean()/kvm_release_pfn_dirty()/kvm_release_page_clean()... > > > E.g., in arch/x86/kvm/x86.c:reexecute_instruction() > > ... > pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa)); > ... > kvm_release_pfn_clean(pfn); > > > > void kvm_release_pfn_clean(kvm_pfn_t pfn) > { > if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn)) > put_page(pfn_to_page(pfn)); > } > > This function makes perfect sense as the counterpart for kvm_get_pfn(): > > void kvm_get_pfn(kvm_pfn_t pfn) > { > if (!kvm_is_reserved_pfn(pfn)) > get_page(pfn_to_page(pfn)); > } > > > As all ZONE_DEVICE pages are currently reserved, pages pinned via > gfn_to_pfn() and friends will often not see a put_page() AFAIKS. > > Now, my patch does not change that, the result of > kvm_is_reserved_pfn(pfn) will be unchanged. A proper fix for that would > probably be > > a) To drop the reference to ZONE_DEVICE pages in gfn_to_pfn() and > friends, after you successfully pinned the pages. (not sure if that's > the right thing to do but you're the expert) > > b) To not use kvm_release_pfn_clean() and friends on pages that were > definitely pinned. > (talking to myself, sorry) Thinking again, dropping this patch from this series could effectively also fix that issue. E.g., kvm_release_pfn_clean() and friends would always do a put_page() if "pfn_valid() and !PageReserved()", so after patch 9 also on ZONDE_DEVICE pages. But it would have side effects that might not be desired. E.g.,: 1. kvm_pfn_to_page() would also return ZONE_DEVICE pages (might even be the right thing to do). 2. kvm_set_pfn_dirty() would also set ZONE_DEVICE pages dirty (might be okay)
On Tue, Nov 05, 2019 at 11:02:46AM +0100, David Hildenbrand wrote: > On 05.11.19 10:49, David Hildenbrand wrote: > >On 05.11.19 10:17, David Hildenbrand wrote: > >>On 05.11.19 05:38, Dan Williams wrote: > >>>On Thu, Oct 24, 2019 at 5:11 AM David Hildenbrand <david@redhat.com> wrote: > >>>> > >>>>Right now, ZONE_DEVICE memory is always set PG_reserved. We want to > >>>>change that. > >>>> > >>>>KVM has this weird use case that you can map anything from /dev/mem > >>>>into the guest. pfn_valid() is not a reliable check whether the memmap > >>>>was initialized and can be touched. pfn_to_online_page() makes sure > >>>>that we have an initialized memmap (and don't have ZONE_DEVICE memory). > >>>> > >>>>Rewrite kvm_is_reserved_pfn() to make sure the function produces the > >>>>same result once we stop setting ZONE_DEVICE pages PG_reserved. > >>>> > >>>>Cc: Paolo Bonzini <pbonzini@redhat.com> > >>>>Cc: "Radim Krčmář" <rkrcmar@redhat.com> > >>>>Cc: Michal Hocko <mhocko@kernel.org> > >>>>Cc: Dan Williams <dan.j.williams@intel.com> > >>>>Cc: KarimAllah Ahmed <karahmed@amazon.de> > >>>>Signed-off-by: David Hildenbrand <david@redhat.com> > >>>>--- > >>>> virt/kvm/kvm_main.c | 10 ++++++++-- > >>>> 1 file changed, 8 insertions(+), 2 deletions(-) > >>>> > >>>>diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > >>>>index e9eb666eb6e8..9d18cc67d124 100644 > >>>>--- a/virt/kvm/kvm_main.c > >>>>+++ b/virt/kvm/kvm_main.c > >>>>@@ -151,9 +151,15 @@ __weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm, > >>>> > >>>> bool kvm_is_reserved_pfn(kvm_pfn_t pfn) > >>>> { > >>>>- if (pfn_valid(pfn)) > >>>>- return PageReserved(pfn_to_page(pfn)); > >>>>+ struct page *page = pfn_to_online_page(pfn); > >>>> > >>>>+ /* > >>>>+ * We treat any pages that are not online (not managed by the buddy) > >>>>+ * as reserved - this includes ZONE_DEVICE pages and pages without > >>>>+ * a memmap (e.g., mapped via /dev/mem). > >>>>+ */ > >>>>+ if (page) > >>>>+ return PageReserved(page); > >>>> return true; > >>>> } > >>> > >>>So after this all the pfn_valid() usage in kvm_main.c is replaced with > >>>pfn_to_online_page()? Looks correct to me. > >>> > >>>However, I'm worried that kvm is taking reference on ZONE_DEVICE pages > >>>through some other path resulting in this: > >>> > >>> https://lore.kernel.org/linux-nvdimm/20190919154708.GA24650@angband.pl/ > >>> > >>>I'll see if this patch set modulates or maintains that failure mode. > >>> > >> > >>I'd assume that the behavior is unchanged. Ithink we get a reference to > >>these ZONE_DEVICE pages via __get_user_pages_fast() and friends in > >>hva_to_pfn_fast() and friends in virt/kvm/kvm_main.c > >> > > > >I think I know what's going wrong: > > > >Pages that are pinned via gfn_to_pfn() and friends take a references, > >however are often released via > >kvm_release_pfn_clean()/kvm_release_pfn_dirty()/kvm_release_page_clean()... > > > > > >E.g., in arch/x86/kvm/x86.c:reexecute_instruction() > > > >... > >pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa)); > >... > >kvm_release_pfn_clean(pfn); > > > > > > > >void kvm_release_pfn_clean(kvm_pfn_t pfn) > >{ > > if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn)) > > put_page(pfn_to_page(pfn)); > >} > > > >This function makes perfect sense as the counterpart for kvm_get_pfn(): > > > >void kvm_get_pfn(kvm_pfn_t pfn) > >{ > > if (!kvm_is_reserved_pfn(pfn)) > > get_page(pfn_to_page(pfn)); > >} > > > > > >As all ZONE_DEVICE pages are currently reserved, pages pinned via > >gfn_to_pfn() and friends will often not see a put_page() AFAIKS. Assuming gup() takes a reference for ZONE_DEVICE pages, yes, this is a KVM bug. > >Now, my patch does not change that, the result of > >kvm_is_reserved_pfn(pfn) will be unchanged. A proper fix for that would > >probably be > > > >a) To drop the reference to ZONE_DEVICE pages in gfn_to_pfn() and > >friends, after you successfully pinned the pages. (not sure if that's > >the right thing to do but you're the expert) > > > >b) To not use kvm_release_pfn_clean() and friends on pages that were > >definitely pinned. This is already KVM's intent, i.e. the purpose of the PageReserved() check is simply to avoid putting a non-existent reference. The problem is that KVM assumes pages with PG_reserved set are never pinned, which AFAICT was true when the code was first added. > (talking to myself, sorry) > > Thinking again, dropping this patch from this series could effectively also > fix that issue. E.g., kvm_release_pfn_clean() and friends would always do a > put_page() if "pfn_valid() and !PageReserved()", so after patch 9 also on > ZONDE_DEVICE pages. Yeah, this appears to be the correct fix. > But it would have side effects that might not be desired. E.g.,: > > 1. kvm_pfn_to_page() would also return ZONE_DEVICE pages (might even be the > right thing to do). This should be ok, at least on x86. There are only three users of kvm_pfn_to_page(). Two of those are on allocations that are controlled by KVM and are guaranteed to be vanilla MAP_ANONYMOUS. The third is on guest memory when running a nested guest, and in that case supporting ZONE_DEVICE memory is desirable, i.e. KVM should play nice with a guest that is backed by ZONE_DEVICE memory. > 2. kvm_set_pfn_dirty() would also set ZONE_DEVICE pages dirty (might be > okay) This is ok from a KVM perspective. The scarier code (for me) is transparent_hugepage_adjust() and kvm_mmu_zap_collapsible_spte(), as I don't at all understand the interaction between THP and _PAGE_DEVMAP.
>>> I think I know what's going wrong: >>> >>> Pages that are pinned via gfn_to_pfn() and friends take a references, >>> however are often released via >>> kvm_release_pfn_clean()/kvm_release_pfn_dirty()/kvm_release_page_clean()... >>> >>> >>> E.g., in arch/x86/kvm/x86.c:reexecute_instruction() >>> >>> ... >>> pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa)); >>> ... >>> kvm_release_pfn_clean(pfn); >>> >>> >>> >>> void kvm_release_pfn_clean(kvm_pfn_t pfn) >>> { >>> if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn)) >>> put_page(pfn_to_page(pfn)); >>> } >>> >>> This function makes perfect sense as the counterpart for kvm_get_pfn(): >>> >>> void kvm_get_pfn(kvm_pfn_t pfn) >>> { >>> if (!kvm_is_reserved_pfn(pfn)) >>> get_page(pfn_to_page(pfn)); >>> } >>> >>> >>> As all ZONE_DEVICE pages are currently reserved, pages pinned via >>> gfn_to_pfn() and friends will often not see a put_page() AFAIKS. > > Assuming gup() takes a reference for ZONE_DEVICE pages, yes, this is a > KVM bug. Yes, it does take a reference AFAIKs. E.g., mm/gup.c:gup_pte_range(): ... if (pte_devmap(pte)) { if (unlikely(flags & FOLL_LONGTERM)) goto pte_unmap; pgmap = get_dev_pagemap(pte_pfn(pte), pgmap); if (unlikely(!pgmap)) { undo_dev_pagemap(nr, nr_start, pages); goto pte_unmap; } } else if (pte_special(pte)) goto pte_unmap; VM_BUG_ON(!pfn_valid(pte_pfn(pte))); page = pte_page(pte); head = try_get_compound_head(page, 1); try_get_compound_head() will increment the reference count. > >>> Now, my patch does not change that, the result of >>> kvm_is_reserved_pfn(pfn) will be unchanged. A proper fix for that would >>> probably be >>> >>> a) To drop the reference to ZONE_DEVICE pages in gfn_to_pfn() and >>> friends, after you successfully pinned the pages. (not sure if that's >>> the right thing to do but you're the expert) >>> >>> b) To not use kvm_release_pfn_clean() and friends on pages that were >>> definitely pinned. > > This is already KVM's intent, i.e. the purpose of the PageReserved() check > is simply to avoid putting a non-existent reference. The problem is that > KVM assumes pages with PG_reserved set are never pinned, which AFAICT was > true when the code was first added. > >> (talking to myself, sorry) >> >> Thinking again, dropping this patch from this series could effectively also >> fix that issue. E.g., kvm_release_pfn_clean() and friends would always do a >> put_page() if "pfn_valid() and !PageReserved()", so after patch 9 also on >> ZONDE_DEVICE pages. > > Yeah, this appears to be the correct fix. > >> But it would have side effects that might not be desired. E.g.,: >> >> 1. kvm_pfn_to_page() would also return ZONE_DEVICE pages (might even be the >> right thing to do). > > This should be ok, at least on x86. There are only three users of > kvm_pfn_to_page(). Two of those are on allocations that are controlled by > KVM and are guaranteed to be vanilla MAP_ANONYMOUS. The third is on guest > memory when running a nested guest, and in that case supporting ZONE_DEVICE > memory is desirable, i.e. KVM should play nice with a guest that is backed > by ZONE_DEVICE memory. > >> 2. kvm_set_pfn_dirty() would also set ZONE_DEVICE pages dirty (might be >> okay) > > This is ok from a KVM perspective. What about void kvm_get_pfn(kvm_pfn_t pfn) { if (!kvm_is_reserved_pfn(pfn)) get_page(pfn_to_page(pfn)); } Is a pure get_page() sufficient in case of ZONE_DEVICE? (asking because of the live references obtained via get_dev_pagemap(pte_pfn(pte), pgmap) in mm/gup.c:gup_pte_range() somewhat confuse me :) ) > > The scarier code (for me) is transparent_hugepage_adjust() and > kvm_mmu_zap_collapsible_spte(), as I don't at all understand the > interaction between THP and _PAGE_DEVMAP. The x86 KVM MMU code is one of the ugliest code I know (sorry, but it had to be said :/ ). Luckily, this should be independent of the PG_reserved thingy AFAIKs.
On Tue, Nov 05, 2019 at 09:30:53PM +0100, David Hildenbrand wrote: > >>>I think I know what's going wrong: > >>> > >>>Pages that are pinned via gfn_to_pfn() and friends take a references, > >>>however are often released via > >>>kvm_release_pfn_clean()/kvm_release_pfn_dirty()/kvm_release_page_clean()... > >>> > >>> > >>>E.g., in arch/x86/kvm/x86.c:reexecute_instruction() > >>> > >>>... > >>>pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa)); > >>>... > >>>kvm_release_pfn_clean(pfn); > >>> > >>> > >>> > >>>void kvm_release_pfn_clean(kvm_pfn_t pfn) > >>>{ > >>> if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn)) > >>> put_page(pfn_to_page(pfn)); > >>>} > >>> > >>>This function makes perfect sense as the counterpart for kvm_get_pfn(): > >>> > >>>void kvm_get_pfn(kvm_pfn_t pfn) > >>>{ > >>> if (!kvm_is_reserved_pfn(pfn)) > >>> get_page(pfn_to_page(pfn)); > >>>} > >>> > >>> > >>>As all ZONE_DEVICE pages are currently reserved, pages pinned via > >>>gfn_to_pfn() and friends will often not see a put_page() AFAIKS. > > > >Assuming gup() takes a reference for ZONE_DEVICE pages, yes, this is a > >KVM bug. > > Yes, it does take a reference AFAIKs. E.g., > > mm/gup.c:gup_pte_range(): > ... > if (pte_devmap(pte)) { > if (unlikely(flags & FOLL_LONGTERM)) > goto pte_unmap; > > pgmap = get_dev_pagemap(pte_pfn(pte), pgmap); > if (unlikely(!pgmap)) { > undo_dev_pagemap(nr, nr_start, pages); > goto pte_unmap; > } > } else if (pte_special(pte)) > goto pte_unmap; > > VM_BUG_ON(!pfn_valid(pte_pfn(pte))); > page = pte_page(pte); > > head = try_get_compound_head(page, 1); > > try_get_compound_head() will increment the reference count. Doh, I looked right at that code and somehow didn't connect the dots. Thanks! > >>>Now, my patch does not change that, the result of > >>>kvm_is_reserved_pfn(pfn) will be unchanged. A proper fix for that would > >>>probably be > >>> > >>>a) To drop the reference to ZONE_DEVICE pages in gfn_to_pfn() and > >>>friends, after you successfully pinned the pages. (not sure if that's > >>>the right thing to do but you're the expert) > >>> > >>>b) To not use kvm_release_pfn_clean() and friends on pages that were > >>>definitely pinned. > > > >This is already KVM's intent, i.e. the purpose of the PageReserved() check > >is simply to avoid putting a non-existent reference. The problem is that > >KVM assumes pages with PG_reserved set are never pinned, which AFAICT was > >true when the code was first added. > > > >>(talking to myself, sorry) > >> > >>Thinking again, dropping this patch from this series could effectively also > >>fix that issue. E.g., kvm_release_pfn_clean() and friends would always do a > >>put_page() if "pfn_valid() and !PageReserved()", so after patch 9 also on > >>ZONDE_DEVICE pages. > > > >Yeah, this appears to be the correct fix. > > > >>But it would have side effects that might not be desired. E.g.,: > >> > >>1. kvm_pfn_to_page() would also return ZONE_DEVICE pages (might even be the > >>right thing to do). > > > >This should be ok, at least on x86. There are only three users of > >kvm_pfn_to_page(). Two of those are on allocations that are controlled by > >KVM and are guaranteed to be vanilla MAP_ANONYMOUS. The third is on guest > >memory when running a nested guest, and in that case supporting ZONE_DEVICE > >memory is desirable, i.e. KVM should play nice with a guest that is backed > >by ZONE_DEVICE memory. > > > >>2. kvm_set_pfn_dirty() would also set ZONE_DEVICE pages dirty (might be > >>okay) > > > >This is ok from a KVM perspective. > > What about > > void kvm_get_pfn(kvm_pfn_t pfn) > { > if (!kvm_is_reserved_pfn(pfn)) > get_page(pfn_to_page(pfn)); > } > > Is a pure get_page() sufficient in case of ZONE_DEVICE? > (asking because of the live references obtained via > get_dev_pagemap(pte_pfn(pte), pgmap) in mm/gup.c:gup_pte_range() somewhat > confuse me :) ) This ties into my concern with thp_adjust(). On x86, kvm_get_pfn() is only used in two flows, to manually get a ref for VM_IO/VM_PFNMAP pages and to switch the ref when mapping a non-hugetlbfs compound page, i.e. a THP. I assume VM_IO and PFNMAP can't apply to ZONE_DEVICE pages. In the thp_adjust() case, when a THP is encountered and the original PFN is for a non-PG_head page, KVM transfers the reference to the associated PG_head page[*] and maps the associated 2mb chunk/page. This is where KVM uses kvm_get_pfn() and could run afoul of the get_dev_pagemap() refcounts. [*] Technically I don't think it's guaranteed to be a PG_head, e.g. if the THP is a 1gb page, as KVM currently only maps THP as 2mb pages. But the idea is the same, transfer the refcount the PFN that's actually going into KVM's page tables. > > > >The scarier code (for me) is transparent_hugepage_adjust() and > >kvm_mmu_zap_collapsible_spte(), as I don't at all understand the > >interaction between THP and _PAGE_DEVMAP. > > The x86 KVM MMU code is one of the ugliest code I know (sorry, but it had to > be said :/ ). Luckily, this should be independent of the PG_reserved thingy > AFAIKs.
On Tue, Nov 5, 2019 at 12:31 PM David Hildenbrand <david@redhat.com> wrote: > > >>> I think I know what's going wrong: > >>> > >>> Pages that are pinned via gfn_to_pfn() and friends take a references, > >>> however are often released via > >>> kvm_release_pfn_clean()/kvm_release_pfn_dirty()/kvm_release_page_clean()... > >>> > >>> > >>> E.g., in arch/x86/kvm/x86.c:reexecute_instruction() > >>> > >>> ... > >>> pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa)); > >>> ... > >>> kvm_release_pfn_clean(pfn); > >>> > >>> > >>> > >>> void kvm_release_pfn_clean(kvm_pfn_t pfn) > >>> { > >>> if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn)) > >>> put_page(pfn_to_page(pfn)); > >>> } > >>> > >>> This function makes perfect sense as the counterpart for kvm_get_pfn(): > >>> > >>> void kvm_get_pfn(kvm_pfn_t pfn) > >>> { > >>> if (!kvm_is_reserved_pfn(pfn)) > >>> get_page(pfn_to_page(pfn)); > >>> } > >>> > >>> > >>> As all ZONE_DEVICE pages are currently reserved, pages pinned via > >>> gfn_to_pfn() and friends will often not see a put_page() AFAIKS. > > > > Assuming gup() takes a reference for ZONE_DEVICE pages, yes, this is a > > KVM bug. > > Yes, it does take a reference AFAIKs. E.g., > > mm/gup.c:gup_pte_range(): > ... > if (pte_devmap(pte)) { > if (unlikely(flags & FOLL_LONGTERM)) > goto pte_unmap; > > pgmap = get_dev_pagemap(pte_pfn(pte), pgmap); > if (unlikely(!pgmap)) { > undo_dev_pagemap(nr, nr_start, pages); > goto pte_unmap; > } > } else if (pte_special(pte)) > goto pte_unmap; > > VM_BUG_ON(!pfn_valid(pte_pfn(pte))); > page = pte_page(pte); > > head = try_get_compound_head(page, 1); > > try_get_compound_head() will increment the reference count. > > > > >>> Now, my patch does not change that, the result of > >>> kvm_is_reserved_pfn(pfn) will be unchanged. A proper fix for that would > >>> probably be > >>> > >>> a) To drop the reference to ZONE_DEVICE pages in gfn_to_pfn() and > >>> friends, after you successfully pinned the pages. (not sure if that's > >>> the right thing to do but you're the expert) > >>> > >>> b) To not use kvm_release_pfn_clean() and friends on pages that were > >>> definitely pinned. > > > > This is already KVM's intent, i.e. the purpose of the PageReserved() check > > is simply to avoid putting a non-existent reference. The problem is that > > KVM assumes pages with PG_reserved set are never pinned, which AFAICT was > > true when the code was first added. > > > >> (talking to myself, sorry) > >> > >> Thinking again, dropping this patch from this series could effectively also > >> fix that issue. E.g., kvm_release_pfn_clean() and friends would always do a > >> put_page() if "pfn_valid() and !PageReserved()", so after patch 9 also on > >> ZONDE_DEVICE pages. > > > > Yeah, this appears to be the correct fix. > > > >> But it would have side effects that might not be desired. E.g.,: > >> > >> 1. kvm_pfn_to_page() would also return ZONE_DEVICE pages (might even be the > >> right thing to do). > > > > This should be ok, at least on x86. There are only three users of > > kvm_pfn_to_page(). Two of those are on allocations that are controlled by > > KVM and are guaranteed to be vanilla MAP_ANONYMOUS. The third is on guest > > memory when running a nested guest, and in that case supporting ZONE_DEVICE > > memory is desirable, i.e. KVM should play nice with a guest that is backed > > by ZONE_DEVICE memory. > > > >> 2. kvm_set_pfn_dirty() would also set ZONE_DEVICE pages dirty (might be > >> okay) > > > > This is ok from a KVM perspective. > > What about > > void kvm_get_pfn(kvm_pfn_t pfn) > { > if (!kvm_is_reserved_pfn(pfn)) > get_page(pfn_to_page(pfn)); > } > > Is a pure get_page() sufficient in case of ZONE_DEVICE? > (asking because of the live references obtained via > get_dev_pagemap(pte_pfn(pte), pgmap) in mm/gup.c:gup_pte_range() > somewhat confuse me :) ) It is not sufficient. PTE_DEVMAP is there to tell the gup path "be careful, this pfn has device-lifetime, make sure the device is pinned and not actively in the process of dying before pinning any pages associated with this device". However, if kvm_get_pfn() is honoring kvm_is_reserved_pfn() that returns true for ZONE_DEVICE, I'm missing how it is messing up the reference counts. > > The scarier code (for me) is transparent_hugepage_adjust() and > > kvm_mmu_zap_collapsible_spte(), as I don't at all understand the > > interaction between THP and _PAGE_DEVMAP. > > The x86 KVM MMU code is one of the ugliest code I know (sorry, but it > had to be said :/ ). Luckily, this should be independent of the > PG_reserved thingy AFAIKs. Both transparent_hugepage_adjust() and kvm_mmu_zap_collapsible_spte() are honoring kvm_is_reserved_pfn(), so again I'm missing where the page count gets mismanaged and leads to the reported hang.
On Tue, Nov 05, 2019 at 03:02:40PM -0800, Dan Williams wrote: > On Tue, Nov 5, 2019 at 12:31 PM David Hildenbrand <david@redhat.com> wrote: > > > The scarier code (for me) is transparent_hugepage_adjust() and > > > kvm_mmu_zap_collapsible_spte(), as I don't at all understand the > > > interaction between THP and _PAGE_DEVMAP. > > > > The x86 KVM MMU code is one of the ugliest code I know (sorry, but it > > had to be said :/ ). Luckily, this should be independent of the > > PG_reserved thingy AFAIKs. > > Both transparent_hugepage_adjust() and kvm_mmu_zap_collapsible_spte() > are honoring kvm_is_reserved_pfn(), so again I'm missing where the > page count gets mismanaged and leads to the reported hang. When mapping pages into the guest, KVM gets the page via gup(), which increments the page count for ZONE_DEVICE pages. But KVM puts the page using kvm_release_pfn_clean(), which skips put_page() if PageReserved() and so never puts its reference to ZONE_DEVICE pages. My transparent_hugepage_adjust() and kvm_mmu_zap_collapsible_spte() comments were for a post-patch/series scenario wheren PageReserved() is no longer true for ZONE_DEVICE pages.
On Tue, Nov 5, 2019 at 3:13 PM Sean Christopherson <sean.j.christopherson@intel.com> wrote: > > On Tue, Nov 05, 2019 at 03:02:40PM -0800, Dan Williams wrote: > > On Tue, Nov 5, 2019 at 12:31 PM David Hildenbrand <david@redhat.com> wrote: > > > > The scarier code (for me) is transparent_hugepage_adjust() and > > > > kvm_mmu_zap_collapsible_spte(), as I don't at all understand the > > > > interaction between THP and _PAGE_DEVMAP. > > > > > > The x86 KVM MMU code is one of the ugliest code I know (sorry, but it > > > had to be said :/ ). Luckily, this should be independent of the > > > PG_reserved thingy AFAIKs. > > > > Both transparent_hugepage_adjust() and kvm_mmu_zap_collapsible_spte() > > are honoring kvm_is_reserved_pfn(), so again I'm missing where the > > page count gets mismanaged and leads to the reported hang. > > When mapping pages into the guest, KVM gets the page via gup(), which > increments the page count for ZONE_DEVICE pages. But KVM puts the page > using kvm_release_pfn_clean(), which skips put_page() if PageReserved() > and so never puts its reference to ZONE_DEVICE pages. Oh, yeah, that's busted. > My transparent_hugepage_adjust() and kvm_mmu_zap_collapsible_spte() > comments were for a post-patch/series scenario wheren PageReserved() is > no longer true for ZONE_DEVICE pages. Ah, ok, for that David is preserving kvm_is_reserved_pfn() returning true for ZONE_DEVICE because pfn_to_online_page() will fail for ZONE_DEVICE.
On Tue, Nov 05, 2019 at 03:30:00PM -0800, Dan Williams wrote: > On Tue, Nov 5, 2019 at 3:13 PM Sean Christopherson > <sean.j.christopherson@intel.com> wrote: > > > > On Tue, Nov 05, 2019 at 03:02:40PM -0800, Dan Williams wrote: > > > On Tue, Nov 5, 2019 at 12:31 PM David Hildenbrand <david@redhat.com> wrote: > > > > > The scarier code (for me) is transparent_hugepage_adjust() and > > > > > kvm_mmu_zap_collapsible_spte(), as I don't at all understand the > > > > > interaction between THP and _PAGE_DEVMAP. > > > > > > > > The x86 KVM MMU code is one of the ugliest code I know (sorry, but it > > > > had to be said :/ ). Luckily, this should be independent of the > > > > PG_reserved thingy AFAIKs. > > > > > > Both transparent_hugepage_adjust() and kvm_mmu_zap_collapsible_spte() > > > are honoring kvm_is_reserved_pfn(), so again I'm missing where the > > > page count gets mismanaged and leads to the reported hang. > > > > When mapping pages into the guest, KVM gets the page via gup(), which > > increments the page count for ZONE_DEVICE pages. But KVM puts the page > > using kvm_release_pfn_clean(), which skips put_page() if PageReserved() > > and so never puts its reference to ZONE_DEVICE pages. > > Oh, yeah, that's busted. > > > My transparent_hugepage_adjust() and kvm_mmu_zap_collapsible_spte() > > comments were for a post-patch/series scenario wheren PageReserved() is > > no longer true for ZONE_DEVICE pages. > > Ah, ok, for that David is preserving kvm_is_reserved_pfn() returning > true for ZONE_DEVICE because pfn_to_online_page() will fail for > ZONE_DEVICE. But David's proposed fix for the above refcount bug is to omit the patch so that KVM no longer treats ZONE_DEVICE pages as reserved. That seems like the right thing to do, including for thp_adjust(), e.g. it would naturally let KVM use 2mb pages for the guest when a ZONE_DEVICE page is mapped with a huge page (2mb or above) in the host. The only hiccup is figuring out how to correctly transfer the reference.
On Tue, Nov 5, 2019 at 3:30 PM Dan Williams <dan.j.williams@intel.com> wrote: > > On Tue, Nov 5, 2019 at 3:13 PM Sean Christopherson > <sean.j.christopherson@intel.com> wrote: > > > > On Tue, Nov 05, 2019 at 03:02:40PM -0800, Dan Williams wrote: > > > On Tue, Nov 5, 2019 at 12:31 PM David Hildenbrand <david@redhat.com> wrote: > > > > > The scarier code (for me) is transparent_hugepage_adjust() and > > > > > kvm_mmu_zap_collapsible_spte(), as I don't at all understand the > > > > > interaction between THP and _PAGE_DEVMAP. > > > > > > > > The x86 KVM MMU code is one of the ugliest code I know (sorry, but it > > > > had to be said :/ ). Luckily, this should be independent of the > > > > PG_reserved thingy AFAIKs. > > > > > > Both transparent_hugepage_adjust() and kvm_mmu_zap_collapsible_spte() > > > are honoring kvm_is_reserved_pfn(), so again I'm missing where the > > > page count gets mismanaged and leads to the reported hang. > > > > When mapping pages into the guest, KVM gets the page via gup(), which > > increments the page count for ZONE_DEVICE pages. But KVM puts the page > > using kvm_release_pfn_clean(), which skips put_page() if PageReserved() > > and so never puts its reference to ZONE_DEVICE pages. > > Oh, yeah, that's busted. Ugh, it's extra busted because every other gup user in the kernel tracks the pages resulting from gup and puts them (put_page()) when they are done. KVM wants to forget about whether it did a gup to get the page and optionally trigger put_page() based purely on the pfn. Outside of VFIO device assignment that needs pages pinned for DMA, why does KVM itself need to pin pages? If pages are pinned over a return to userspace that needs to be a FOLL_LONGTERM gup.
On Tue, Nov 05, 2019 at 03:43:29PM -0800, Dan Williams wrote: > On Tue, Nov 5, 2019 at 3:30 PM Dan Williams <dan.j.williams@intel.com> wrote: > > > > On Tue, Nov 5, 2019 at 3:13 PM Sean Christopherson > > <sean.j.christopherson@intel.com> wrote: > > > > > > On Tue, Nov 05, 2019 at 03:02:40PM -0800, Dan Williams wrote: > > > > On Tue, Nov 5, 2019 at 12:31 PM David Hildenbrand <david@redhat.com> wrote: > > > > > > The scarier code (for me) is transparent_hugepage_adjust() and > > > > > > kvm_mmu_zap_collapsible_spte(), as I don't at all understand the > > > > > > interaction between THP and _PAGE_DEVMAP. > > > > > > > > > > The x86 KVM MMU code is one of the ugliest code I know (sorry, but it > > > > > had to be said :/ ). Luckily, this should be independent of the > > > > > PG_reserved thingy AFAIKs. > > > > > > > > Both transparent_hugepage_adjust() and kvm_mmu_zap_collapsible_spte() > > > > are honoring kvm_is_reserved_pfn(), so again I'm missing where the > > > > page count gets mismanaged and leads to the reported hang. > > > > > > When mapping pages into the guest, KVM gets the page via gup(), which > > > increments the page count for ZONE_DEVICE pages. But KVM puts the page > > > using kvm_release_pfn_clean(), which skips put_page() if PageReserved() > > > and so never puts its reference to ZONE_DEVICE pages. > > > > Oh, yeah, that's busted. > > Ugh, it's extra busted because every other gup user in the kernel > tracks the pages resulting from gup and puts them (put_page()) when > they are done. KVM wants to forget about whether it did a gup to get > the page and optionally trigger put_page() based purely on the pfn. > Outside of VFIO device assignment that needs pages pinned for DMA, why > does KVM itself need to pin pages? If pages are pinned over a return > to userspace that needs to be a FOLL_LONGTERM gup. Short answer, KVM pins the page to ensure correctness with respect to the primary MMU invalidating the associated host virtual address, e.g. when the page is being migrated or unmapped from host userspace. The main use of gup() is to handle guest page faults and map pages into the guest, i.e. into KVM's secondary MMU. KVM uses gup() to both get the PFN and to temporarily pin the page. The pin is held just long enough to guaranteed that any invalidation via the mmu_notifier will be stalled until after KVM finishes installing the page into the secondary MMU, i.e. the pin is short-term and not held across a return to userspace or entry into the guest. When a subsequent mmu_notifier invalidation occurs, KVM pulls the PFN from the secondary MMU and uses that to update accessed and dirty bits in the host. There are a few other KVM flows that eventually call into gup(), but those are "traditional" short-term pins and use put_page() directly.
On Tue, Nov 5, 2019 at 4:03 PM Sean Christopherson <sean.j.christopherson@intel.com> wrote: > > On Tue, Nov 05, 2019 at 03:43:29PM -0800, Dan Williams wrote: > > On Tue, Nov 5, 2019 at 3:30 PM Dan Williams <dan.j.williams@intel.com> wrote: > > > > > > On Tue, Nov 5, 2019 at 3:13 PM Sean Christopherson > > > <sean.j.christopherson@intel.com> wrote: > > > > > > > > On Tue, Nov 05, 2019 at 03:02:40PM -0800, Dan Williams wrote: > > > > > On Tue, Nov 5, 2019 at 12:31 PM David Hildenbrand <david@redhat.com> wrote: > > > > > > > The scarier code (for me) is transparent_hugepage_adjust() and > > > > > > > kvm_mmu_zap_collapsible_spte(), as I don't at all understand the > > > > > > > interaction between THP and _PAGE_DEVMAP. > > > > > > > > > > > > The x86 KVM MMU code is one of the ugliest code I know (sorry, but it > > > > > > had to be said :/ ). Luckily, this should be independent of the > > > > > > PG_reserved thingy AFAIKs. > > > > > > > > > > Both transparent_hugepage_adjust() and kvm_mmu_zap_collapsible_spte() > > > > > are honoring kvm_is_reserved_pfn(), so again I'm missing where the > > > > > page count gets mismanaged and leads to the reported hang. > > > > > > > > When mapping pages into the guest, KVM gets the page via gup(), which > > > > increments the page count for ZONE_DEVICE pages. But KVM puts the page > > > > using kvm_release_pfn_clean(), which skips put_page() if PageReserved() > > > > and so never puts its reference to ZONE_DEVICE pages. > > > > > > Oh, yeah, that's busted. > > > > Ugh, it's extra busted because every other gup user in the kernel > > tracks the pages resulting from gup and puts them (put_page()) when > > they are done. KVM wants to forget about whether it did a gup to get > > the page and optionally trigger put_page() based purely on the pfn. > > Outside of VFIO device assignment that needs pages pinned for DMA, why > > does KVM itself need to pin pages? If pages are pinned over a return > > to userspace that needs to be a FOLL_LONGTERM gup. > > Short answer, KVM pins the page to ensure correctness with respect to the > primary MMU invalidating the associated host virtual address, e.g. when > the page is being migrated or unmapped from host userspace. > > The main use of gup() is to handle guest page faults and map pages into > the guest, i.e. into KVM's secondary MMU. KVM uses gup() to both get the > PFN and to temporarily pin the page. The pin is held just long enough to > guaranteed that any invalidation via the mmu_notifier will be stalled > until after KVM finishes installing the page into the secondary MMU, i.e. > the pin is short-term and not held across a return to userspace or entry > into the guest. When a subsequent mmu_notifier invalidation occurs, KVM > pulls the PFN from the secondary MMU and uses that to update accessed > and dirty bits in the host. > > There are a few other KVM flows that eventually call into gup(), but those > are "traditional" short-term pins and use put_page() directly. Ok, I was misinterpreting the effect of the bug with what KVM is using the reference to do. To your other point: > But David's proposed fix for the above refcount bug is to omit the patch > so that KVM no longer treats ZONE_DEVICE pages as reserved. That seems > like the right thing to do, including for thp_adjust(), e.g. it would > naturally let KVM use 2mb pages for the guest when a ZONE_DEVICE page is > mapped with a huge page (2mb or above) in the host. The only hiccup is > figuring out how to correctly transfer the reference. That might not be the only hiccup. There's currently no such thing as huge pages for ZONE_DEVICE, there are huge *mappings* (pmd and pud), but the result of pfn_to_page() on such a mapping does not yield a huge 'struct page'. It seems there are other paths in KVM that assume that more typical page machinery is active like SetPageDirty() based on kvm_is_reserved_pfn(). While I told David that I did not want to see more usage of is_zone_device_page(), this patch below (untested) seems a cleaner path with less surprises: diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 4df0aa6b8e5c..fbea17c1810c 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1831,7 +1831,8 @@ EXPORT_SYMBOL_GPL(kvm_release_page_clean); void kvm_release_pfn_clean(kvm_pfn_t pfn) { - if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn)) + if ((!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn)) || + (pfn_valid(pfn) && is_zone_device_page(pfn_to_page(pfn)))) put_page(pfn_to_page(pfn)); } EXPORT_SYMBOL_GPL(kvm_release_pfn_clean); This is safe because the reference that KVM took earlier protects the is_zone_device_page() lookup from racing device teardown. Otherwise, if KVM does not have a reference it's unsafe, but that's already even more broken because KVM would be releasing a page that it never referenced. Every other KVM operation that assumes page allocator pages would continue to honor kvm_is_reserved_pfn().
On 06.11.19 01:08, Dan Williams wrote: > On Tue, Nov 5, 2019 at 4:03 PM Sean Christopherson > <sean.j.christopherson@intel.com> wrote: >> >> On Tue, Nov 05, 2019 at 03:43:29PM -0800, Dan Williams wrote: >>> On Tue, Nov 5, 2019 at 3:30 PM Dan Williams <dan.j.williams@intel.com> wrote: >>>> >>>> On Tue, Nov 5, 2019 at 3:13 PM Sean Christopherson >>>> <sean.j.christopherson@intel.com> wrote: >>>>> >>>>> On Tue, Nov 05, 2019 at 03:02:40PM -0800, Dan Williams wrote: >>>>>> On Tue, Nov 5, 2019 at 12:31 PM David Hildenbrand <david@redhat.com> wrote: >>>>>>>> The scarier code (for me) is transparent_hugepage_adjust() and >>>>>>>> kvm_mmu_zap_collapsible_spte(), as I don't at all understand the >>>>>>>> interaction between THP and _PAGE_DEVMAP. >>>>>>> >>>>>>> The x86 KVM MMU code is one of the ugliest code I know (sorry, but it >>>>>>> had to be said :/ ). Luckily, this should be independent of the >>>>>>> PG_reserved thingy AFAIKs. >>>>>> >>>>>> Both transparent_hugepage_adjust() and kvm_mmu_zap_collapsible_spte() >>>>>> are honoring kvm_is_reserved_pfn(), so again I'm missing where the >>>>>> page count gets mismanaged and leads to the reported hang. >>>>> >>>>> When mapping pages into the guest, KVM gets the page via gup(), which >>>>> increments the page count for ZONE_DEVICE pages. But KVM puts the page >>>>> using kvm_release_pfn_clean(), which skips put_page() if PageReserved() >>>>> and so never puts its reference to ZONE_DEVICE pages. >>>> >>>> Oh, yeah, that's busted. >>> >>> Ugh, it's extra busted because every other gup user in the kernel >>> tracks the pages resulting from gup and puts them (put_page()) when >>> they are done. KVM wants to forget about whether it did a gup to get >>> the page and optionally trigger put_page() based purely on the pfn. >>> Outside of VFIO device assignment that needs pages pinned for DMA, why >>> does KVM itself need to pin pages? If pages are pinned over a return >>> to userspace that needs to be a FOLL_LONGTERM gup. >> >> Short answer, KVM pins the page to ensure correctness with respect to the >> primary MMU invalidating the associated host virtual address, e.g. when >> the page is being migrated or unmapped from host userspace. >> >> The main use of gup() is to handle guest page faults and map pages into >> the guest, i.e. into KVM's secondary MMU. KVM uses gup() to both get the >> PFN and to temporarily pin the page. The pin is held just long enough to >> guaranteed that any invalidation via the mmu_notifier will be stalled >> until after KVM finishes installing the page into the secondary MMU, i.e. >> the pin is short-term and not held across a return to userspace or entry >> into the guest. When a subsequent mmu_notifier invalidation occurs, KVM >> pulls the PFN from the secondary MMU and uses that to update accessed >> and dirty bits in the host. >> >> There are a few other KVM flows that eventually call into gup(), but those >> are "traditional" short-term pins and use put_page() directly. > > Ok, I was misinterpreting the effect of the bug with what KVM is using > the reference to do. > > To your other point: > >> But David's proposed fix for the above refcount bug is to omit the patch >> so that KVM no longer treats ZONE_DEVICE pages as reserved. That seems >> like the right thing to do, including for thp_adjust(), e.g. it would >> naturally let KVM use 2mb pages for the guest when a ZONE_DEVICE page is >> mapped with a huge page (2mb or above) in the host. The only hiccup is >> figuring out how to correctly transfer the reference. > > That might not be the only hiccup. There's currently no such thing as > huge pages for ZONE_DEVICE, there are huge *mappings* (pmd and pud), > but the result of pfn_to_page() on such a mapping does not yield a > huge 'struct page'. It seems there are other paths in KVM that assume > that more typical page machinery is active like SetPageDirty() based > on kvm_is_reserved_pfn(). While I told David that I did not want to > see more usage of is_zone_device_page(), this patch below (untested) > seems a cleaner path with less surprises: > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 4df0aa6b8e5c..fbea17c1810c 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -1831,7 +1831,8 @@ EXPORT_SYMBOL_GPL(kvm_release_page_clean); > > void kvm_release_pfn_clean(kvm_pfn_t pfn) > { > - if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn)) > + if ((!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn)) || > + (pfn_valid(pfn) && is_zone_device_page(pfn_to_page(pfn)))) > put_page(pfn_to_page(pfn)); > } > EXPORT_SYMBOL_GPL(kvm_release_pfn_clean); I had the same thought, but I do wonder about the kvm_get_pfn() users, e.g.,: hva_to_pfn_remapped(): r = follow_pfn(vma, addr, &pfn); ... kvm_get_pfn(pfn); ... We would not take a reference for ZONE_DEVICE, but later drop one reference via kvm_release_pfn_clean(). IOW, kvm_get_pfn() gets *really* dangerous to use. I can't tell if this can happen right now. We do have 3 users of kvm_get_pfn() that we have to audit before this change. Also, we should add a comment to kvm_get_pfn() that it should never be used with possible ZONE_DEVICE pages. Also, we should add a comment to kvm_release_pfn_clean(), describing why we treat ZONE_DEVICE in a special way here. We can then progress like this 1. Get this fix upstream, it's somewhat unrelated to this series. 2. This patch here remains as is in this series (+/- documentation update) 3. Long term, rework KVM to not have to not treat ZONE_DEVICE like reserved pages. E.g., get rid of kvm_get_pfn(). Then, this special zone check can go.
On Wed, Nov 06, 2019 at 07:56:34AM +0100, David Hildenbrand wrote: > On 06.11.19 01:08, Dan Williams wrote: > >On Tue, Nov 5, 2019 at 4:03 PM Sean Christopherson > >>But David's proposed fix for the above refcount bug is to omit the patch > >>so that KVM no longer treats ZONE_DEVICE pages as reserved. That seems > >>like the right thing to do, including for thp_adjust(), e.g. it would > >>naturally let KVM use 2mb pages for the guest when a ZONE_DEVICE page is > >>mapped with a huge page (2mb or above) in the host. The only hiccup is > >>figuring out how to correctly transfer the reference. > > > >That might not be the only hiccup. There's currently no such thing as > >huge pages for ZONE_DEVICE, there are huge *mappings* (pmd and pud), > >but the result of pfn_to_page() on such a mapping does not yield a > >huge 'struct page'. It seems there are other paths in KVM that assume > >that more typical page machinery is active like SetPageDirty() based > >on kvm_is_reserved_pfn(). While I told David that I did not want to > >see more usage of is_zone_device_page(), this patch below (untested) > >seems a cleaner path with less surprises: > > > >diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > >index 4df0aa6b8e5c..fbea17c1810c 100644 > >--- a/virt/kvm/kvm_main.c > >+++ b/virt/kvm/kvm_main.c > >@@ -1831,7 +1831,8 @@ EXPORT_SYMBOL_GPL(kvm_release_page_clean); > > > > void kvm_release_pfn_clean(kvm_pfn_t pfn) > > { > >- if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn)) > >+ if ((!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn)) || The is_error_noslot_pfn() check shouldn't be overriden by zone_device. > >+ (pfn_valid(pfn) && is_zone_device_page(pfn_to_page(pfn)))) But rather than special case kvm_release_pfn_clean(), I'd rather KVM explicitly handle ZONE_DEVICE pages, there are other flows where KVM really should be aware of ZONE_DEVICE pages, e.g. for sanity checks and whatnot. There are surprisingly few callers of kvm_is_reserved_pfn(), so it's actually not too big of a change. > > put_page(pfn_to_page(pfn)); > > } > > EXPORT_SYMBOL_GPL(kvm_release_pfn_clean); > > I had the same thought, but I do wonder about the kvm_get_pfn() users, > e.g.,: > > hva_to_pfn_remapped(): > r = follow_pfn(vma, addr, &pfn); > ... > kvm_get_pfn(pfn); > ... > > We would not take a reference for ZONE_DEVICE, but later drop one reference > via kvm_release_pfn_clean(). IOW, kvm_get_pfn() gets *really* dangerous to > use. I can't tell if this can happen right now. > > We do have 3 users of kvm_get_pfn() that we have to audit before this > change. Also, we should add a comment to kvm_get_pfn() that it should never > be used with possible ZONE_DEVICE pages. > > Also, we should add a comment to kvm_release_pfn_clean(), describing why we > treat ZONE_DEVICE in a special way here. > > > We can then progress like this > > 1. Get this fix upstream, it's somewhat unrelated to this series. > 2. This patch here remains as is in this series (+/- documentation update) > 3. Long term, rework KVM to not have to not treat ZONE_DEVICE like reserved > pages. E.g., get rid of kvm_get_pfn(). Then, this special zone check can go. Dropping kvm_get_pfn() is less than ideal, and at this point unnecessary. I'm 99% sure the existing call sites for kvm_get_pfn() can never be reached with ZONE_DEVICE pages. I think we can do: 1. Get a fix upstream to have KVM stop treating ZONE_DEVICE pages as reserved PFNs, i.e. exempt them in kvm_is_reserved_pfn() and change the callers of kvm_is_reserved_pfn() to handle ZONE_DEVICE pages. 2. Drop this patch from the series, and instead remove the special treatment of ZONE_DEVICE pages from kvm_is_reserved_pfn(). Give me a few minutes to prep a patch. > > -- > > Thanks, > > David / dhildenb >
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index e9eb666eb6e8..9d18cc67d124 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -151,9 +151,15 @@ __weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm, bool kvm_is_reserved_pfn(kvm_pfn_t pfn) { - if (pfn_valid(pfn)) - return PageReserved(pfn_to_page(pfn)); + struct page *page = pfn_to_online_page(pfn); + /* + * We treat any pages that are not online (not managed by the buddy) + * as reserved - this includes ZONE_DEVICE pages and pages without + * a memmap (e.g., mapped via /dev/mem). + */ + if (page) + return PageReserved(page); return true; }
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to change that. KVM has this weird use case that you can map anything from /dev/mem into the guest. pfn_valid() is not a reliable check whether the memmap was initialized and can be touched. pfn_to_online_page() makes sure that we have an initialized memmap (and don't have ZONE_DEVICE memory). Rewrite kvm_is_reserved_pfn() to make sure the function produces the same result once we stop setting ZONE_DEVICE pages PG_reserved. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Radim Krčmář" <rkrcmar@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: KarimAllah Ahmed <karahmed@amazon.de> Signed-off-by: David Hildenbrand <david@redhat.com> --- virt/kvm/kvm_main.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-)