Message ID | 20240809160909.1023470-7-peterx@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm: Support huge pfnmaps | expand |
On 09.08.24 18:08, Peter Xu wrote: > Pfnmaps can always be identified with special bits in the ptes/pmds/puds. > However that's unnecessary if the vma is stable, and when it's mapped under > VM_PFNMAP | VM_IO. > > Instead of adding similar checks in all the levels for huge pfnmaps, let > folio_walk_start() fail even earlier for these mappings. It's also > something gup-slow already does, so make them match. > > Cc: David Hildenbrand <david@redhat.com> > Signed-off-by: Peter Xu <peterx@redhat.com> > --- > mm/pagewalk.c | 5 +++++ > 1 file changed, 5 insertions(+) > > diff --git a/mm/pagewalk.c b/mm/pagewalk.c > index cd79fb3b89e5..fd3965efe773 100644 > --- a/mm/pagewalk.c > +++ b/mm/pagewalk.c > @@ -727,6 +727,11 @@ struct folio *folio_walk_start(struct folio_walk *fw, > p4d_t *p4dp; > > mmap_assert_locked(vma->vm_mm); > + > + /* It has no folio backing the mappings at all.. */ > + if (vma->vm_flags & (VM_IO | VM_PFNMAP)) > + return NULL; > + That is in general not what we want, and we still have some places that wrongly hard-code that behavior. In a MAP_PRIVATE mapping you might have anon pages that we can happily walk. vm_normal_page() / vm_normal_page_pmd() [and as commented as a TODO, vm_normal_page_pud()] should be able to identify PFN maps and reject them, no?
On Fri, Aug 09, 2024 at 06:20:06PM +0200, David Hildenbrand wrote: > On 09.08.24 18:08, Peter Xu wrote: > > Pfnmaps can always be identified with special bits in the ptes/pmds/puds. > > However that's unnecessary if the vma is stable, and when it's mapped under > > VM_PFNMAP | VM_IO. > > > > Instead of adding similar checks in all the levels for huge pfnmaps, let > > folio_walk_start() fail even earlier for these mappings. It's also > > something gup-slow already does, so make them match. > > > > Cc: David Hildenbrand <david@redhat.com> > > Signed-off-by: Peter Xu <peterx@redhat.com> > > --- > > mm/pagewalk.c | 5 +++++ > > 1 file changed, 5 insertions(+) > > > > diff --git a/mm/pagewalk.c b/mm/pagewalk.c > > index cd79fb3b89e5..fd3965efe773 100644 > > --- a/mm/pagewalk.c > > +++ b/mm/pagewalk.c > > @@ -727,6 +727,11 @@ struct folio *folio_walk_start(struct folio_walk *fw, > > p4d_t *p4dp; > > mmap_assert_locked(vma->vm_mm); > > + > > + /* It has no folio backing the mappings at all.. */ > > + if (vma->vm_flags & (VM_IO | VM_PFNMAP)) > > + return NULL; > > + > > That is in general not what we want, and we still have some places that > wrongly hard-code that behavior. > > In a MAP_PRIVATE mapping you might have anon pages that we can happily walk. > > vm_normal_page() / vm_normal_page_pmd() [and as commented as a TODO, > vm_normal_page_pud()] should be able to identify PFN maps and reject them, > no? Yep, I think we can also rely on special bit. When I was working on this whole series I must confess I am already confused on the real users of MAP_PRIVATE pfnmaps. E.g. we probably don't need either PFNMAP for either mprotect/fork/... at least for our use case, then VM_PRIVATE is even one step further. Here I chose to follow gup-slow, and I suppose you meant that's also wrong? If so, would it make sense we keep them aligned as of now, and change them altogether? Or do you think we should just rely on the special bits? And, just curious: is there any use case you're aware of that can benefit from caring PRIVATE pfnmaps yet so far, especially in this path? As far as I read, none of folio_walk_start() users so far should even stumble on top of a pfnmap, share or private. But that's a fairly quick glimps only. IOW, I was wondering whether I'm just over cautious here. Thanks,
On 09.08.24 18:54, Peter Xu wrote: > On Fri, Aug 09, 2024 at 06:20:06PM +0200, David Hildenbrand wrote: >> On 09.08.24 18:08, Peter Xu wrote: >>> Pfnmaps can always be identified with special bits in the ptes/pmds/puds. >>> However that's unnecessary if the vma is stable, and when it's mapped under >>> VM_PFNMAP | VM_IO. >>> >>> Instead of adding similar checks in all the levels for huge pfnmaps, let >>> folio_walk_start() fail even earlier for these mappings. It's also >>> something gup-slow already does, so make them match. >>> >>> Cc: David Hildenbrand <david@redhat.com> >>> Signed-off-by: Peter Xu <peterx@redhat.com> >>> --- >>> mm/pagewalk.c | 5 +++++ >>> 1 file changed, 5 insertions(+) >>> >>> diff --git a/mm/pagewalk.c b/mm/pagewalk.c >>> index cd79fb3b89e5..fd3965efe773 100644 >>> --- a/mm/pagewalk.c >>> +++ b/mm/pagewalk.c >>> @@ -727,6 +727,11 @@ struct folio *folio_walk_start(struct folio_walk *fw, >>> p4d_t *p4dp; >>> mmap_assert_locked(vma->vm_mm); >>> + >>> + /* It has no folio backing the mappings at all.. */ >>> + if (vma->vm_flags & (VM_IO | VM_PFNMAP)) >>> + return NULL; >>> + >> >> That is in general not what we want, and we still have some places that >> wrongly hard-code that behavior. >> >> In a MAP_PRIVATE mapping you might have anon pages that we can happily walk. >> >> vm_normal_page() / vm_normal_page_pmd() [and as commented as a TODO, >> vm_normal_page_pud()] should be able to identify PFN maps and reject them, >> no? > > Yep, I think we can also rely on special bit. > > When I was working on this whole series I must confess I am already > confused on the real users of MAP_PRIVATE pfnmaps. E.g. we probably don't > need either PFNMAP for either mprotect/fork/... at least for our use case, > then VM_PRIVATE is even one step further. Yes, it's rather a corner case indeed. > > Here I chose to follow gup-slow, and I suppose you meant that's also wrong? I assume just nobody really noticed, just like nobody noticed that walk_page_test() skips VM_PFNMAP (but not VM_IO :) ). Your process memory stats will likely miss anon folios on COW PFNMAP mappings ... in the rare cases where they exist (e.g., mmap() of /dev/mem). > If so, would it make sense we keep them aligned as of now, and change them > altogether? Or do you think we should just rely on the special bits? GUP already refuses to work on a lot of other stuff, so likely not a good use of time unless somebody complains. But yes, long-term we should make all code either respect that it could happen (and bury less awkward checks in page table walkers) or rip support for MAP_PRIVATE PFNMAP out completely. > > And, just curious: is there any use case you're aware of that can benefit > from caring PRIVATE pfnmaps yet so far, especially in this path? In general MAP_PRIVATE pfnmaps is not really useful on things like MMIO. There was a discussion (in VM_PAT) some time ago whether we could remove MAP_PRIVATE PFNMAPs completely [1]. At least some users still use COW mappings on /dev/mem, although not many (and they might not actually write to these areas). I'm happy if someone wants to try ripping that out, I'm not brave enough :) [1] https://lkml.kernel.org/r/1f2a8ed4-aaff-4be7-b3b6-63d2841a2908@redhat.com > > As far as I read, none of folio_walk_start() users so far should even > stumble on top of a pfnmap, share or private. But that's a fairly quick > glimps only. do_pages_stat()->do_pages_stat_array() should be able to trigger it, if you pass "nodes=NULL" to move_pages(). Maybe s390x could be tricked into it, but likely as you say, most code shouldn't trigger it. The function itself should be handling it correctly as of today, though.
On Fri, Aug 09, 2024 at 07:25:36PM +0200, David Hildenbrand wrote: > On 09.08.24 18:54, Peter Xu wrote: > > On Fri, Aug 09, 2024 at 06:20:06PM +0200, David Hildenbrand wrote: > > > On 09.08.24 18:08, Peter Xu wrote: > > > > Pfnmaps can always be identified with special bits in the ptes/pmds/puds. > > > > However that's unnecessary if the vma is stable, and when it's mapped under > > > > VM_PFNMAP | VM_IO. > > > > > > > > Instead of adding similar checks in all the levels for huge pfnmaps, let > > > > folio_walk_start() fail even earlier for these mappings. It's also > > > > something gup-slow already does, so make them match. > > > > > > > > Cc: David Hildenbrand <david@redhat.com> > > > > Signed-off-by: Peter Xu <peterx@redhat.com> > > > > --- > > > > mm/pagewalk.c | 5 +++++ > > > > 1 file changed, 5 insertions(+) > > > > > > > > diff --git a/mm/pagewalk.c b/mm/pagewalk.c > > > > index cd79fb3b89e5..fd3965efe773 100644 > > > > --- a/mm/pagewalk.c > > > > +++ b/mm/pagewalk.c > > > > @@ -727,6 +727,11 @@ struct folio *folio_walk_start(struct folio_walk *fw, > > > > p4d_t *p4dp; > > > > mmap_assert_locked(vma->vm_mm); > > > > + > > > > + /* It has no folio backing the mappings at all.. */ > > > > + if (vma->vm_flags & (VM_IO | VM_PFNMAP)) > > > > + return NULL; > > > > + > > > > > > That is in general not what we want, and we still have some places that > > > wrongly hard-code that behavior. > > > > > > In a MAP_PRIVATE mapping you might have anon pages that we can happily walk. > > > > > > vm_normal_page() / vm_normal_page_pmd() [and as commented as a TODO, > > > vm_normal_page_pud()] should be able to identify PFN maps and reject them, > > > no? > > > > Yep, I think we can also rely on special bit. > > > > When I was working on this whole series I must confess I am already > > confused on the real users of MAP_PRIVATE pfnmaps. E.g. we probably don't > > need either PFNMAP for either mprotect/fork/... at least for our use case, > > then VM_PRIVATE is even one step further. > > Yes, it's rather a corner case indeed. > > > > Here I chose to follow gup-slow, and I suppose you meant that's also wrong? > > I assume just nobody really noticed, just like nobody noticed that > walk_page_test() skips VM_PFNMAP (but not VM_IO :) ). I noticed it, and that's one of the reasons why this series can be small, as walk page callers are intact. > > Your process memory stats will likely miss anon folios on COW PFNMAP > mappings ... in the rare cases where they exist (e.g., mmap() of /dev/mem). Do you mean /proc/$PID/status? I thought that (aka, mm counters) should be fine with anon pages CoWed on top of private pfnmaps, but possibly I misunderstood what you meant. > > > If so, would it make sense we keep them aligned as of now, and change them > > altogether? Or do you think we should just rely on the special bits? > > GUP already refuses to work on a lot of other stuff, so likely not a good > use of time unless somebody complains. > > But yes, long-term we should make all code either respect that it could > happen (and bury less awkward checks in page table walkers) or rip support > for MAP_PRIVATE PFNMAP out completely. > > > > > And, just curious: is there any use case you're aware of that can benefit > > from caring PRIVATE pfnmaps yet so far, especially in this path? > > In general MAP_PRIVATE pfnmaps is not really useful on things like MMIO. > > There was a discussion (in VM_PAT) some time ago whether we could remove > MAP_PRIVATE PFNMAPs completely [1]. At least some users still use COW > mappings on /dev/mem, although not many (and they might not actually write > to these areas). Ah, looks like the private map on /dev/mem is the only thing we know. > > I'm happy if someone wants to try ripping that out, I'm not brave enough :) > > [1] > https://lkml.kernel.org/r/1f2a8ed4-aaff-4be7-b3b6-63d2841a2908@redhat.com > > > > > As far as I read, none of folio_walk_start() users so far should even > > stumble on top of a pfnmap, share or private. But that's a fairly quick > > glimps only. > > do_pages_stat()->do_pages_stat_array() should be able to trigger it, if you > pass "nodes=NULL" to move_pages(). .. so assume this is also about private mapping over /dev/mem, then: someone tries to write some pages there to some MMIO regions, then tries to use move_pages() to fetch which node those pages locate? Hmm.. OK :) > > Maybe s390x could be tricked into it, but likely as you say, most code > shouldn't trigger it. The function itself should be handling it correctly as > of today, though. So indeed I cannot justify it won't be used, and it's not a huge deal indeed if we stick with special bits. Let me go with that in the next version for folio_walk_start(). Thanks,
On Fri, Aug 09, 2024 at 07:25:36PM +0200, David Hildenbrand wrote: > > > That is in general not what we want, and we still have some places that > > > wrongly hard-code that behavior. > > > > > > In a MAP_PRIVATE mapping you might have anon pages that we can happily walk. > > > > > > vm_normal_page() / vm_normal_page_pmd() [and as commented as a TODO, > > > vm_normal_page_pud()] should be able to identify PFN maps and reject them, > > > no? > > > > Yep, I think we can also rely on special bit. It is more than just relying on the special bit.. VM_PFNMAP/VM_MIXEDMAP should really only be used inside vm_normal_page() because thay are, effectively, support for a limited emulation of the special bit on arches that don't have them. There are a bunch of weird rules that are used to try and make that work properly that have to be followed. On arches with the sepcial bit they should possibly never be checked since the special bit does everything you need. Arguably any place reading those flags out side of vm_normal_page/etc is suspect. > > Here I chose to follow gup-slow, and I suppose you meant that's also wrong? > > I assume just nobody really noticed, just like nobody noticed that > walk_page_test() skips VM_PFNMAP (but not VM_IO :) ). Like here.. > > And, just curious: is there any use case you're aware of that can benefit > > from caring PRIVATE pfnmaps yet so far, especially in this path? > > In general MAP_PRIVATE pfnmaps is not really useful on things like MMIO. > > There was a discussion (in VM_PAT) some time ago whether we could remove > MAP_PRIVATE PFNMAPs completely [1]. At least some users still use COW > mappings on /dev/mem, although not many (and they might not actually write > to these areas). I've squashed many bugs where kernel drivers don't demand userspace use MAP_SHARED when asking for a PFNMAP, and of course userspace has gained the wrong flags too. I don't know if anyone needs this, but it has crept wrongly into the API. Maybe an interesting place to start is a warning printk about using an obsolete feature and see where things go from there?? Jason
On 14.08.24 15:05, Jason Gunthorpe wrote: > On Fri, Aug 09, 2024 at 07:25:36PM +0200, David Hildenbrand wrote: > >>>> That is in general not what we want, and we still have some places that >>>> wrongly hard-code that behavior. >>>> >>>> In a MAP_PRIVATE mapping you might have anon pages that we can happily walk. >>>> >>>> vm_normal_page() / vm_normal_page_pmd() [and as commented as a TODO, >>>> vm_normal_page_pud()] should be able to identify PFN maps and reject them, >>>> no? >>> >>> Yep, I think we can also rely on special bit. > > It is more than just relying on the special bit.. > > VM_PFNMAP/VM_MIXEDMAP should really only be used inside > vm_normal_page() because thay are, effectively, support for a limited > emulation of the special bit on arches that don't have them. There are > a bunch of weird rules that are used to try and make that work > properly that have to be followed. > > On arches with the sepcial bit they should possibly never be checked > since the special bit does everything you need. > > Arguably any place reading those flags out side of vm_normal_page/etc > is suspect. IIUC, your opinion matches mine: VM_PFNMAP/VM_MIXEDMAP and pte_special()/... usage should be limited to vm_normal_page/vm_normal_page_pmd/ ... of course, GUP-fast is special (one of the reason for "pte_special()" and friends after all). > >>> Here I chose to follow gup-slow, and I suppose you meant that's also wrong? >> >> I assume just nobody really noticed, just like nobody noticed that >> walk_page_test() skips VM_PFNMAP (but not VM_IO :) ). > > Like here.. > >>> And, just curious: is there any use case you're aware of that can benefit >>> from caring PRIVATE pfnmaps yet so far, especially in this path? >> >> In general MAP_PRIVATE pfnmaps is not really useful on things like MMIO. >> >> There was a discussion (in VM_PAT) some time ago whether we could remove >> MAP_PRIVATE PFNMAPs completely [1]. At least some users still use COW >> mappings on /dev/mem, although not many (and they might not actually write >> to these areas). > > I've squashed many bugs where kernel drivers don't demand userspace > use MAP_SHARED when asking for a PFNMAP, and of course userspace has > gained the wrong flags too. I don't know if anyone needs this, but it > has crept wrongly into the API. > > Maybe an interesting place to start is a warning printk about using an > obsolete feature and see where things go from there?? Maybe we should start with some way to pr_warn_ONCE() whenever we get a COW/unshare-fault in such a MAP_PRIVATE mapping, and essentially populate the fresh anon folio. Then we don't only know who mmaps() something like that, but who actually relies on getting anon folios in there.
On Fri, Aug 16, 2024 at 11:30:31AM +0200, David Hildenbrand wrote: > On 14.08.24 15:05, Jason Gunthorpe wrote: > > On Fri, Aug 09, 2024 at 07:25:36PM +0200, David Hildenbrand wrote: > > > > > > > That is in general not what we want, and we still have some places that > > > > > wrongly hard-code that behavior. > > > > > > > > > > In a MAP_PRIVATE mapping you might have anon pages that we can happily walk. > > > > > > > > > > vm_normal_page() / vm_normal_page_pmd() [and as commented as a TODO, > > > > > vm_normal_page_pud()] should be able to identify PFN maps and reject them, > > > > > no? > > > > > > > > Yep, I think we can also rely on special bit. > > > > It is more than just relying on the special bit.. > > > > VM_PFNMAP/VM_MIXEDMAP should really only be used inside > > vm_normal_page() because thay are, effectively, support for a limited > > emulation of the special bit on arches that don't have them. There are > > a bunch of weird rules that are used to try and make that work > > properly that have to be followed. > > > > On arches with the sepcial bit they should possibly never be checked > > since the special bit does everything you need. > > > > Arguably any place reading those flags out side of vm_normal_page/etc > > is suspect. > > IIUC, your opinion matches mine: VM_PFNMAP/VM_MIXEDMAP and pte_special()/... > usage should be limited to vm_normal_page/vm_normal_page_pmd/ ... of course, > GUP-fast is special (one of the reason for "pte_special()" and friends after > all). The issue is at least GUP currently doesn't work with pfnmaps, while there're potentially users who wants to be able to work on both page + !page use cases. Besides access_process_vm(), KVM also uses similar thing, and maybe more; these all seem to be valid use case of reference the vma flags for PFNMAP and such, so they can identify "it's pfnmap" or more generic issues like "permission check error on pgtable". The whole private mapping thing definitely made it complicated. > > > > > > > Here I chose to follow gup-slow, and I suppose you meant that's also wrong? > > > > > > I assume just nobody really noticed, just like nobody noticed that > > > walk_page_test() skips VM_PFNMAP (but not VM_IO :) ). > > > > Like here.. > > > > > > And, just curious: is there any use case you're aware of that can benefit > > > > from caring PRIVATE pfnmaps yet so far, especially in this path? > > > > > > In general MAP_PRIVATE pfnmaps is not really useful on things like MMIO. > > > > > > There was a discussion (in VM_PAT) some time ago whether we could remove > > > MAP_PRIVATE PFNMAPs completely [1]. At least some users still use COW > > > mappings on /dev/mem, although not many (and they might not actually write > > > to these areas). > > > > I've squashed many bugs where kernel drivers don't demand userspace > > use MAP_SHARED when asking for a PFNMAP, and of course userspace has > > gained the wrong flags too. I don't know if anyone needs this, but it > > has crept wrongly into the API. > > > > Maybe an interesting place to start is a warning printk about using an > > obsolete feature and see where things go from there?? > > Maybe we should start with some way to pr_warn_ONCE() whenever we get a > COW/unshare-fault in such a MAP_PRIVATE mapping, and essentially populate > the fresh anon folio. > > Then we don't only know who mmaps() something like that, but who actually > relies on getting anon folios in there. Sounds useful to me, if nobody yet has solid understanding of those private mappings while we'd want to collect some info. My gut feeling is we'll see some valid use of them, but I hope I'm wrong.. I hope we can still leave that as a separate thing so we focus on large mappings in this series. And yes, I'll stick with special bits here to not add one more flag reference. Thanks,
On Fri, Aug 16, 2024 at 10:21:17AM -0400, Peter Xu wrote: > On Fri, Aug 16, 2024 at 11:30:31AM +0200, David Hildenbrand wrote: > > On 14.08.24 15:05, Jason Gunthorpe wrote: > > > On Fri, Aug 09, 2024 at 07:25:36PM +0200, David Hildenbrand wrote: > > > > > > > > > That is in general not what we want, and we still have some places that > > > > > > wrongly hard-code that behavior. > > > > > > > > > > > > In a MAP_PRIVATE mapping you might have anon pages that we can happily walk. > > > > > > > > > > > > vm_normal_page() / vm_normal_page_pmd() [and as commented as a TODO, > > > > > > vm_normal_page_pud()] should be able to identify PFN maps and reject them, > > > > > > no? > > > > > > > > > > Yep, I think we can also rely on special bit. > > > > > > It is more than just relying on the special bit.. > > > > > > VM_PFNMAP/VM_MIXEDMAP should really only be used inside > > > vm_normal_page() because thay are, effectively, support for a limited > > > emulation of the special bit on arches that don't have them. There are > > > a bunch of weird rules that are used to try and make that work > > > properly that have to be followed. > > > > > > On arches with the sepcial bit they should possibly never be checked > > > since the special bit does everything you need. > > > > > > Arguably any place reading those flags out side of vm_normal_page/etc > > > is suspect. > > > > IIUC, your opinion matches mine: VM_PFNMAP/VM_MIXEDMAP and pte_special()/... > > usage should be limited to vm_normal_page/vm_normal_page_pmd/ ... of course, > > GUP-fast is special (one of the reason for "pte_special()" and friends after > > all). > > The issue is at least GUP currently doesn't work with pfnmaps, while > there're potentially users who wants to be able to work on both page + > !page use cases. Besides access_process_vm(), KVM also uses similar thing, > and maybe more; these all seem to be valid use case of reference the vma > flags for PFNMAP and such, so they can identify "it's pfnmap" or more > generic issues like "permission check error on pgtable". Why are those valid compared with calling vm_normal_page() per-page instead? What reason is there to not do something based only on the PFNMAP flag? Jason
On 16.08.24 16:21, Peter Xu wrote: > On Fri, Aug 16, 2024 at 11:30:31AM +0200, David Hildenbrand wrote: >> On 14.08.24 15:05, Jason Gunthorpe wrote: >>> On Fri, Aug 09, 2024 at 07:25:36PM +0200, David Hildenbrand wrote: >>> >>>>>> That is in general not what we want, and we still have some places that >>>>>> wrongly hard-code that behavior. >>>>>> >>>>>> In a MAP_PRIVATE mapping you might have anon pages that we can happily walk. >>>>>> >>>>>> vm_normal_page() / vm_normal_page_pmd() [and as commented as a TODO, >>>>>> vm_normal_page_pud()] should be able to identify PFN maps and reject them, >>>>>> no? >>>>> >>>>> Yep, I think we can also rely on special bit. >>> >>> It is more than just relying on the special bit.. >>> >>> VM_PFNMAP/VM_MIXEDMAP should really only be used inside >>> vm_normal_page() because thay are, effectively, support for a limited >>> emulation of the special bit on arches that don't have them. There are >>> a bunch of weird rules that are used to try and make that work >>> properly that have to be followed. >>> >>> On arches with the sepcial bit they should possibly never be checked >>> since the special bit does everything you need. >>> >>> Arguably any place reading those flags out side of vm_normal_page/etc >>> is suspect. >> >> IIUC, your opinion matches mine: VM_PFNMAP/VM_MIXEDMAP and pte_special()/... >> usage should be limited to vm_normal_page/vm_normal_page_pmd/ ... of course, >> GUP-fast is special (one of the reason for "pte_special()" and friends after >> all). > > The issue is at least GUP currently doesn't work with pfnmaps, while > there're potentially users who wants to be able to work on both page + > !page use cases. Besides access_process_vm(), KVM also uses similar thing, > and maybe more; these all seem to be valid use case of reference the vma > flags for PFNMAP and such, so they can identify "it's pfnmap" or more > generic issues like "permission check error on pgtable". What at least VFIO does is first try GUP, and if that fails, try follow_fault_pfn()->follow_pte(). There is a VM_PFNMAP check in there, yes. Ideally, follow_pte() would never return refcounted/normal pages, then the PFNMAP check might only be a performance improvement (maybe). > > The whole private mapping thing definitely made it complicated. Yes, and follow_pte() for now could even return ordinary anon pages. I spotted that when I was working on that VM_PAT stuff, but I was to unsure what to do (see below that KVM with MAP_PRIVATE /dev/mem might just work, no idea if there are use cases?). Fortunately, vfio calls is_invalid_reserved_pfn() and refuses anything that has a struct page. I think KVM does something nasty: if it something with a "struct page", and it's not PageReserved, it would take a reference (if I get kvm_pfn_to_refcounted_page()) independent if it's a "normal" or "not normal" page -- it essentially ignores the vm_normal_page() information in the page tables ... So anon pages in pivate mappings from follow_pte() might currently work with KVM ... because of the way KVM uses follow_pte(). I did not play with it, so I'm not sure if I am missing some detail.
On Fri, Aug 16, 2024 at 07:56:30PM +0200, David Hildenbrand wrote: > I think KVM does something nasty: if it something with a "struct page", and > it's not PageReserved, it would take a reference (if I get > kvm_pfn_to_refcounted_page()) independent if it's a "normal" or "not normal" > page -- it essentially ignores the vm_normal_page() information in the page > tables ... Oh that's nasty. Nothing should be upgrading the output of the follow functions to refcounted. That's what GUP is for. And PFNMAP pages, even if they have struct pages for some reason, should *NEVER* be refcounted because they are in a PFNMAP VMA. That is completely against the whole point :\ If they could be safely refcounted then it would be a MIXEDMAP. Jason
On Mon, Aug 19, 2024, Jason Gunthorpe wrote: > On Fri, Aug 16, 2024 at 07:56:30PM +0200, David Hildenbrand wrote: > > > I think KVM does something nasty: if it something with a "struct page", and > > it's not PageReserved, it would take a reference (if I get > > kvm_pfn_to_refcounted_page()) independent if it's a "normal" or "not normal" > > page -- it essentially ignores the vm_normal_page() information in the page > > tables ... > > Oh that's nasty. Nothing should be upgrading the output of the follow > functions to refcounted. That's what GUP is for. > > And PFNMAP pages, even if they have struct pages for some reason, > should *NEVER* be refcounted because they are in a PFNMAP VMA. That is > completely against the whole point :\ If they could be safely > refcounted then it would be a MIXEDMAP. Yeah yeah, I'm working on it. https://lore.kernel.org/all/20240726235234.228822-1-seanjc@google.com
On Fri, Aug 16, 2024 at 02:38:36PM -0300, Jason Gunthorpe wrote: > On Fri, Aug 16, 2024 at 10:21:17AM -0400, Peter Xu wrote: > > On Fri, Aug 16, 2024 at 11:30:31AM +0200, David Hildenbrand wrote: > > > On 14.08.24 15:05, Jason Gunthorpe wrote: > > > > On Fri, Aug 09, 2024 at 07:25:36PM +0200, David Hildenbrand wrote: > > > > > > > > > > > That is in general not what we want, and we still have some places that > > > > > > > wrongly hard-code that behavior. > > > > > > > > > > > > > > In a MAP_PRIVATE mapping you might have anon pages that we can happily walk. > > > > > > > > > > > > > > vm_normal_page() / vm_normal_page_pmd() [and as commented as a TODO, > > > > > > > vm_normal_page_pud()] should be able to identify PFN maps and reject them, > > > > > > > no? > > > > > > > > > > > > Yep, I think we can also rely on special bit. > > > > > > > > It is more than just relying on the special bit.. > > > > > > > > VM_PFNMAP/VM_MIXEDMAP should really only be used inside > > > > vm_normal_page() because thay are, effectively, support for a limited > > > > emulation of the special bit on arches that don't have them. There are > > > > a bunch of weird rules that are used to try and make that work > > > > properly that have to be followed. > > > > > > > > On arches with the sepcial bit they should possibly never be checked > > > > since the special bit does everything you need. > > > > > > > > Arguably any place reading those flags out side of vm_normal_page/etc > > > > is suspect. > > > > > > IIUC, your opinion matches mine: VM_PFNMAP/VM_MIXEDMAP and pte_special()/... > > > usage should be limited to vm_normal_page/vm_normal_page_pmd/ ... of course, > > > GUP-fast is special (one of the reason for "pte_special()" and friends after > > > all). > > > > The issue is at least GUP currently doesn't work with pfnmaps, while > > there're potentially users who wants to be able to work on both page + > > !page use cases. Besides access_process_vm(), KVM also uses similar thing, > > and maybe more; these all seem to be valid use case of reference the vma > > flags for PFNMAP and such, so they can identify "it's pfnmap" or more > > generic issues like "permission check error on pgtable". > > Why are those valid compared with calling vm_normal_page() per-page > instead? > > What reason is there to not do something based only on the PFNMAP > flag? My comment was for answering "why VM_PFNMAP flags is needed outside vm_normal_page()", because GUP lacks supports of it. Are you suggesting we should support VM_PFNMAP in GUP, perhaps? Thanks,
diff --git a/mm/pagewalk.c b/mm/pagewalk.c index cd79fb3b89e5..fd3965efe773 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -727,6 +727,11 @@ struct folio *folio_walk_start(struct folio_walk *fw, p4d_t *p4dp; mmap_assert_locked(vma->vm_mm); + + /* It has no folio backing the mappings at all.. */ + if (vma->vm_flags & (VM_IO | VM_PFNMAP)) + return NULL; + vma_pgtable_walk_begin(vma); if (WARN_ON_ONCE(addr < vma->vm_start || addr >= vma->vm_end))
Pfnmaps can always be identified with special bits in the ptes/pmds/puds. However that's unnecessary if the vma is stable, and when it's mapped under VM_PFNMAP | VM_IO. Instead of adding similar checks in all the levels for huge pfnmaps, let folio_walk_start() fail even earlier for these mappings. It's also something gup-slow already does, so make them match. Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> --- mm/pagewalk.c | 5 +++++ 1 file changed, 5 insertions(+)