diff mbox series

[06/19] mm/pagewalk: Check pfnmap early for folio_walk_start()

Message ID 20240809160909.1023470-7-peterx@redhat.com (mailing list archive)
State New, archived
Headers show
Series mm: Support huge pfnmaps | expand

Commit Message

Peter Xu Aug. 9, 2024, 4:08 p.m. UTC
Pfnmaps can always be identified with special bits in the ptes/pmds/puds.
However that's unnecessary if the vma is stable, and when it's mapped under
VM_PFNMAP | VM_IO.

Instead of adding similar checks in all the levels for huge pfnmaps, let
folio_walk_start() fail even earlier for these mappings.  It's also
something gup-slow already does, so make them match.

Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/pagewalk.c | 5 +++++
 1 file changed, 5 insertions(+)

Comments

David Hildenbrand Aug. 9, 2024, 4:20 p.m. UTC | #1
On 09.08.24 18:08, Peter Xu wrote:
> Pfnmaps can always be identified with special bits in the ptes/pmds/puds.
> However that's unnecessary if the vma is stable, and when it's mapped under
> VM_PFNMAP | VM_IO.
> 
> Instead of adding similar checks in all the levels for huge pfnmaps, let
> folio_walk_start() fail even earlier for these mappings.  It's also
> something gup-slow already does, so make them match.
> 
> Cc: David Hildenbrand <david@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   mm/pagewalk.c | 5 +++++
>   1 file changed, 5 insertions(+)
> 
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index cd79fb3b89e5..fd3965efe773 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -727,6 +727,11 @@ struct folio *folio_walk_start(struct folio_walk *fw,
>   	p4d_t *p4dp;
>   
>   	mmap_assert_locked(vma->vm_mm);
> +
> +	/* It has no folio backing the mappings at all.. */
> +	if (vma->vm_flags & (VM_IO | VM_PFNMAP))
> +		return NULL;
> +

That is in general not what we want, and we still have some places that 
wrongly hard-code that behavior.

In a MAP_PRIVATE mapping you might have anon pages that we can happily walk.

vm_normal_page() / vm_normal_page_pmd() [and as commented as a TODO, 
vm_normal_page_pud()] should be able to identify PFN maps and reject 
them, no?
Peter Xu Aug. 9, 2024, 4:54 p.m. UTC | #2
On Fri, Aug 09, 2024 at 06:20:06PM +0200, David Hildenbrand wrote:
> On 09.08.24 18:08, Peter Xu wrote:
> > Pfnmaps can always be identified with special bits in the ptes/pmds/puds.
> > However that's unnecessary if the vma is stable, and when it's mapped under
> > VM_PFNMAP | VM_IO.
> > 
> > Instead of adding similar checks in all the levels for huge pfnmaps, let
> > folio_walk_start() fail even earlier for these mappings.  It's also
> > something gup-slow already does, so make them match.
> > 
> > Cc: David Hildenbrand <david@redhat.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >   mm/pagewalk.c | 5 +++++
> >   1 file changed, 5 insertions(+)
> > 
> > diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> > index cd79fb3b89e5..fd3965efe773 100644
> > --- a/mm/pagewalk.c
> > +++ b/mm/pagewalk.c
> > @@ -727,6 +727,11 @@ struct folio *folio_walk_start(struct folio_walk *fw,
> >   	p4d_t *p4dp;
> >   	mmap_assert_locked(vma->vm_mm);
> > +
> > +	/* It has no folio backing the mappings at all.. */
> > +	if (vma->vm_flags & (VM_IO | VM_PFNMAP))
> > +		return NULL;
> > +
> 
> That is in general not what we want, and we still have some places that
> wrongly hard-code that behavior.
> 
> In a MAP_PRIVATE mapping you might have anon pages that we can happily walk.
> 
> vm_normal_page() / vm_normal_page_pmd() [and as commented as a TODO,
> vm_normal_page_pud()] should be able to identify PFN maps and reject them,
> no?

Yep, I think we can also rely on special bit.

When I was working on this whole series I must confess I am already
confused on the real users of MAP_PRIVATE pfnmaps.  E.g. we probably don't
need either PFNMAP for either mprotect/fork/... at least for our use case,
then VM_PRIVATE is even one step further.

Here I chose to follow gup-slow, and I suppose you meant that's also wrong?
If so, would it make sense we keep them aligned as of now, and change them
altogether?  Or do you think we should just rely on the special bits?

And, just curious: is there any use case you're aware of that can benefit
from caring PRIVATE pfnmaps yet so far, especially in this path?

As far as I read, none of folio_walk_start() users so far should even
stumble on top of a pfnmap, share or private.  But that's a fairly quick
glimps only.  IOW, I was wondering whether I'm just over cautious here.

Thanks,
David Hildenbrand Aug. 9, 2024, 5:25 p.m. UTC | #3
On 09.08.24 18:54, Peter Xu wrote:
> On Fri, Aug 09, 2024 at 06:20:06PM +0200, David Hildenbrand wrote:
>> On 09.08.24 18:08, Peter Xu wrote:
>>> Pfnmaps can always be identified with special bits in the ptes/pmds/puds.
>>> However that's unnecessary if the vma is stable, and when it's mapped under
>>> VM_PFNMAP | VM_IO.
>>>
>>> Instead of adding similar checks in all the levels for huge pfnmaps, let
>>> folio_walk_start() fail even earlier for these mappings.  It's also
>>> something gup-slow already does, so make them match.
>>>
>>> Cc: David Hildenbrand <david@redhat.com>
>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>> ---
>>>    mm/pagewalk.c | 5 +++++
>>>    1 file changed, 5 insertions(+)
>>>
>>> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
>>> index cd79fb3b89e5..fd3965efe773 100644
>>> --- a/mm/pagewalk.c
>>> +++ b/mm/pagewalk.c
>>> @@ -727,6 +727,11 @@ struct folio *folio_walk_start(struct folio_walk *fw,
>>>    	p4d_t *p4dp;
>>>    	mmap_assert_locked(vma->vm_mm);
>>> +
>>> +	/* It has no folio backing the mappings at all.. */
>>> +	if (vma->vm_flags & (VM_IO | VM_PFNMAP))
>>> +		return NULL;
>>> +
>>
>> That is in general not what we want, and we still have some places that
>> wrongly hard-code that behavior.
>>
>> In a MAP_PRIVATE mapping you might have anon pages that we can happily walk.
>>
>> vm_normal_page() / vm_normal_page_pmd() [and as commented as a TODO,
>> vm_normal_page_pud()] should be able to identify PFN maps and reject them,
>> no?
> 
> Yep, I think we can also rely on special bit.
> 
> When I was working on this whole series I must confess I am already
> confused on the real users of MAP_PRIVATE pfnmaps.  E.g. we probably don't
> need either PFNMAP for either mprotect/fork/... at least for our use case,
> then VM_PRIVATE is even one step further.

Yes, it's rather a corner case indeed.
> 
> Here I chose to follow gup-slow, and I suppose you meant that's also wrong?

I assume just nobody really noticed, just like nobody noticed that 
walk_page_test() skips VM_PFNMAP (but not VM_IO :) ).

Your process memory stats will likely miss anon folios on COW PFNMAP 
mappings ... in the rare cases where they exist (e.g., mmap() of /dev/mem).

> If so, would it make sense we keep them aligned as of now, and change them
> altogether?  Or do you think we should just rely on the special bits?

GUP already refuses to work on a lot of other stuff, so likely not a 
good use of time unless somebody complains.

But yes, long-term we should make all code either respect that it could 
happen (and bury less awkward checks in page table walkers) or rip 
support for MAP_PRIVATE PFNMAP out completely.

> 
> And, just curious: is there any use case you're aware of that can benefit
> from caring PRIVATE pfnmaps yet so far, especially in this path?

In general MAP_PRIVATE pfnmaps is not really useful on things like MMIO.

There was a discussion (in VM_PAT) some time ago whether we could remove 
MAP_PRIVATE PFNMAPs completely [1]. At least some users still use COW 
mappings on /dev/mem, although not many (and they might not actually 
write to these areas).

I'm happy if someone wants to try ripping that out, I'm not brave enough :)

[1] 
https://lkml.kernel.org/r/1f2a8ed4-aaff-4be7-b3b6-63d2841a2908@redhat.com

> 
> As far as I read, none of folio_walk_start() users so far should even
> stumble on top of a pfnmap, share or private.  But that's a fairly quick
> glimps only.

do_pages_stat()->do_pages_stat_array() should be able to trigger it, if 
you pass "nodes=NULL" to move_pages().

Maybe s390x could be tricked into it, but likely as you say, most code 
shouldn't trigger it. The function itself should be handling it 
correctly as of today, though.
Peter Xu Aug. 9, 2024, 9:37 p.m. UTC | #4
On Fri, Aug 09, 2024 at 07:25:36PM +0200, David Hildenbrand wrote:
> On 09.08.24 18:54, Peter Xu wrote:
> > On Fri, Aug 09, 2024 at 06:20:06PM +0200, David Hildenbrand wrote:
> > > On 09.08.24 18:08, Peter Xu wrote:
> > > > Pfnmaps can always be identified with special bits in the ptes/pmds/puds.
> > > > However that's unnecessary if the vma is stable, and when it's mapped under
> > > > VM_PFNMAP | VM_IO.
> > > > 
> > > > Instead of adding similar checks in all the levels for huge pfnmaps, let
> > > > folio_walk_start() fail even earlier for these mappings.  It's also
> > > > something gup-slow already does, so make them match.
> > > > 
> > > > Cc: David Hildenbrand <david@redhat.com>
> > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > ---
> > > >    mm/pagewalk.c | 5 +++++
> > > >    1 file changed, 5 insertions(+)
> > > > 
> > > > diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> > > > index cd79fb3b89e5..fd3965efe773 100644
> > > > --- a/mm/pagewalk.c
> > > > +++ b/mm/pagewalk.c
> > > > @@ -727,6 +727,11 @@ struct folio *folio_walk_start(struct folio_walk *fw,
> > > >    	p4d_t *p4dp;
> > > >    	mmap_assert_locked(vma->vm_mm);
> > > > +
> > > > +	/* It has no folio backing the mappings at all.. */
> > > > +	if (vma->vm_flags & (VM_IO | VM_PFNMAP))
> > > > +		return NULL;
> > > > +
> > > 
> > > That is in general not what we want, and we still have some places that
> > > wrongly hard-code that behavior.
> > > 
> > > In a MAP_PRIVATE mapping you might have anon pages that we can happily walk.
> > > 
> > > vm_normal_page() / vm_normal_page_pmd() [and as commented as a TODO,
> > > vm_normal_page_pud()] should be able to identify PFN maps and reject them,
> > > no?
> > 
> > Yep, I think we can also rely on special bit.
> > 
> > When I was working on this whole series I must confess I am already
> > confused on the real users of MAP_PRIVATE pfnmaps.  E.g. we probably don't
> > need either PFNMAP for either mprotect/fork/... at least for our use case,
> > then VM_PRIVATE is even one step further.
> 
> Yes, it's rather a corner case indeed.
> > 
> > Here I chose to follow gup-slow, and I suppose you meant that's also wrong?
> 
> I assume just nobody really noticed, just like nobody noticed that
> walk_page_test() skips VM_PFNMAP (but not VM_IO :) ).

I noticed it, and that's one of the reasons why this series can be small,
as walk page callers are intact.

> 
> Your process memory stats will likely miss anon folios on COW PFNMAP
> mappings ... in the rare cases where they exist (e.g., mmap() of /dev/mem).

Do you mean /proc/$PID/status?  I thought that (aka, mm counters) should be
fine with anon pages CoWed on top of private pfnmaps, but possibly I
misunderstood what you meant.

> 
> > If so, would it make sense we keep them aligned as of now, and change them
> > altogether?  Or do you think we should just rely on the special bits?
> 
> GUP already refuses to work on a lot of other stuff, so likely not a good
> use of time unless somebody complains.
> 
> But yes, long-term we should make all code either respect that it could
> happen (and bury less awkward checks in page table walkers) or rip support
> for MAP_PRIVATE PFNMAP out completely.
> 
> > 
> > And, just curious: is there any use case you're aware of that can benefit
> > from caring PRIVATE pfnmaps yet so far, especially in this path?
> 
> In general MAP_PRIVATE pfnmaps is not really useful on things like MMIO.
> 
> There was a discussion (in VM_PAT) some time ago whether we could remove
> MAP_PRIVATE PFNMAPs completely [1]. At least some users still use COW
> mappings on /dev/mem, although not many (and they might not actually write
> to these areas).

Ah, looks like the private map on /dev/mem is the only thing we know.

> 
> I'm happy if someone wants to try ripping that out, I'm not brave enough :)
> 
> [1]
> https://lkml.kernel.org/r/1f2a8ed4-aaff-4be7-b3b6-63d2841a2908@redhat.com
> 
> > 
> > As far as I read, none of folio_walk_start() users so far should even
> > stumble on top of a pfnmap, share or private.  But that's a fairly quick
> > glimps only.
> 
> do_pages_stat()->do_pages_stat_array() should be able to trigger it, if you
> pass "nodes=NULL" to move_pages().

.. so assume this is also about private mapping over /dev/mem, then:
someone tries to write some pages there to some MMIO regions, then tries to
use move_pages() to fetch which node those pages locate?  Hmm.. OK :)

> 
> Maybe s390x could be tricked into it, but likely as you say, most code
> shouldn't trigger it. The function itself should be handling it correctly as
> of today, though.

So indeed I cannot justify it won't be used, and it's not a huge deal
indeed if we stick with special bits.  Let me go with that in the next
version for folio_walk_start().

Thanks,
Jason Gunthorpe Aug. 14, 2024, 1:05 p.m. UTC | #5
On Fri, Aug 09, 2024 at 07:25:36PM +0200, David Hildenbrand wrote:

> > > That is in general not what we want, and we still have some places that
> > > wrongly hard-code that behavior.
> > > 
> > > In a MAP_PRIVATE mapping you might have anon pages that we can happily walk.
> > > 
> > > vm_normal_page() / vm_normal_page_pmd() [and as commented as a TODO,
> > > vm_normal_page_pud()] should be able to identify PFN maps and reject them,
> > > no?
> > 
> > Yep, I think we can also rely on special bit.

It is more than just relying on the special bit..

VM_PFNMAP/VM_MIXEDMAP should really only be used inside
vm_normal_page() because thay are, effectively, support for a limited
emulation of the special bit on arches that don't have them. There are
a bunch of weird rules that are used to try and make that work
properly that have to be followed.

On arches with the sepcial bit they should possibly never be checked
since the special bit does everything you need.

Arguably any place reading those flags out side of vm_normal_page/etc
is suspect.

> > Here I chose to follow gup-slow, and I suppose you meant that's also wrong?
> 
> I assume just nobody really noticed, just like nobody noticed that
> walk_page_test() skips VM_PFNMAP (but not VM_IO :) ).

Like here..

> > And, just curious: is there any use case you're aware of that can benefit
> > from caring PRIVATE pfnmaps yet so far, especially in this path?
> 
> In general MAP_PRIVATE pfnmaps is not really useful on things like MMIO.
> 
> There was a discussion (in VM_PAT) some time ago whether we could remove
> MAP_PRIVATE PFNMAPs completely [1]. At least some users still use COW
> mappings on /dev/mem, although not many (and they might not actually write
> to these areas).

I've squashed many bugs where kernel drivers don't demand userspace
use MAP_SHARED when asking for a PFNMAP, and of course userspace has
gained the wrong flags too. I don't know if anyone needs this, but it
has crept wrongly into the API.

Maybe an interesting place to start is a warning printk about using an
obsolete feature and see where things go from there??

Jason
David Hildenbrand Aug. 16, 2024, 9:30 a.m. UTC | #6
On 14.08.24 15:05, Jason Gunthorpe wrote:
> On Fri, Aug 09, 2024 at 07:25:36PM +0200, David Hildenbrand wrote:
> 
>>>> That is in general not what we want, and we still have some places that
>>>> wrongly hard-code that behavior.
>>>>
>>>> In a MAP_PRIVATE mapping you might have anon pages that we can happily walk.
>>>>
>>>> vm_normal_page() / vm_normal_page_pmd() [and as commented as a TODO,
>>>> vm_normal_page_pud()] should be able to identify PFN maps and reject them,
>>>> no?
>>>
>>> Yep, I think we can also rely on special bit.
> 
> It is more than just relying on the special bit..
> 
> VM_PFNMAP/VM_MIXEDMAP should really only be used inside
> vm_normal_page() because thay are, effectively, support for a limited
> emulation of the special bit on arches that don't have them. There are
> a bunch of weird rules that are used to try and make that work
> properly that have to be followed.
> 
> On arches with the sepcial bit they should possibly never be checked
> since the special bit does everything you need.
> 
> Arguably any place reading those flags out side of vm_normal_page/etc
> is suspect.

IIUC, your opinion matches mine: VM_PFNMAP/VM_MIXEDMAP and 
pte_special()/... usage should be limited to 
vm_normal_page/vm_normal_page_pmd/ ... of course, GUP-fast is special 
(one of the reason for "pte_special()" and friends after all).

> 
>>> Here I chose to follow gup-slow, and I suppose you meant that's also wrong?
>>
>> I assume just nobody really noticed, just like nobody noticed that
>> walk_page_test() skips VM_PFNMAP (but not VM_IO :) ).
> 
> Like here..
> 
>>> And, just curious: is there any use case you're aware of that can benefit
>>> from caring PRIVATE pfnmaps yet so far, especially in this path?
>>
>> In general MAP_PRIVATE pfnmaps is not really useful on things like MMIO.
>>
>> There was a discussion (in VM_PAT) some time ago whether we could remove
>> MAP_PRIVATE PFNMAPs completely [1]. At least some users still use COW
>> mappings on /dev/mem, although not many (and they might not actually write
>> to these areas).
> 
> I've squashed many bugs where kernel drivers don't demand userspace
> use MAP_SHARED when asking for a PFNMAP, and of course userspace has
> gained the wrong flags too. I don't know if anyone needs this, but it
> has crept wrongly into the API.
> 
> Maybe an interesting place to start is a warning printk about using an
> obsolete feature and see where things go from there??

Maybe we should start with some way to pr_warn_ONCE() whenever we get a 
COW/unshare-fault in such a MAP_PRIVATE mapping, and essentially 
populate the fresh anon folio.

Then we don't only know who mmaps() something like that, but who 
actually relies on getting anon folios in there.
Peter Xu Aug. 16, 2024, 2:21 p.m. UTC | #7
On Fri, Aug 16, 2024 at 11:30:31AM +0200, David Hildenbrand wrote:
> On 14.08.24 15:05, Jason Gunthorpe wrote:
> > On Fri, Aug 09, 2024 at 07:25:36PM +0200, David Hildenbrand wrote:
> > 
> > > > > That is in general not what we want, and we still have some places that
> > > > > wrongly hard-code that behavior.
> > > > > 
> > > > > In a MAP_PRIVATE mapping you might have anon pages that we can happily walk.
> > > > > 
> > > > > vm_normal_page() / vm_normal_page_pmd() [and as commented as a TODO,
> > > > > vm_normal_page_pud()] should be able to identify PFN maps and reject them,
> > > > > no?
> > > > 
> > > > Yep, I think we can also rely on special bit.
> > 
> > It is more than just relying on the special bit..
> > 
> > VM_PFNMAP/VM_MIXEDMAP should really only be used inside
> > vm_normal_page() because thay are, effectively, support for a limited
> > emulation of the special bit on arches that don't have them. There are
> > a bunch of weird rules that are used to try and make that work
> > properly that have to be followed.
> > 
> > On arches with the sepcial bit they should possibly never be checked
> > since the special bit does everything you need.
> > 
> > Arguably any place reading those flags out side of vm_normal_page/etc
> > is suspect.
> 
> IIUC, your opinion matches mine: VM_PFNMAP/VM_MIXEDMAP and pte_special()/...
> usage should be limited to vm_normal_page/vm_normal_page_pmd/ ... of course,
> GUP-fast is special (one of the reason for "pte_special()" and friends after
> all).

The issue is at least GUP currently doesn't work with pfnmaps, while
there're potentially users who wants to be able to work on both page +
!page use cases.  Besides access_process_vm(), KVM also uses similar thing,
and maybe more; these all seem to be valid use case of reference the vma
flags for PFNMAP and such, so they can identify "it's pfnmap" or more
generic issues like "permission check error on pgtable".

The whole private mapping thing definitely made it complicated.

> 
> > 
> > > > Here I chose to follow gup-slow, and I suppose you meant that's also wrong?
> > > 
> > > I assume just nobody really noticed, just like nobody noticed that
> > > walk_page_test() skips VM_PFNMAP (but not VM_IO :) ).
> > 
> > Like here..
> > 
> > > > And, just curious: is there any use case you're aware of that can benefit
> > > > from caring PRIVATE pfnmaps yet so far, especially in this path?
> > > 
> > > In general MAP_PRIVATE pfnmaps is not really useful on things like MMIO.
> > > 
> > > There was a discussion (in VM_PAT) some time ago whether we could remove
> > > MAP_PRIVATE PFNMAPs completely [1]. At least some users still use COW
> > > mappings on /dev/mem, although not many (and they might not actually write
> > > to these areas).
> > 
> > I've squashed many bugs where kernel drivers don't demand userspace
> > use MAP_SHARED when asking for a PFNMAP, and of course userspace has
> > gained the wrong flags too. I don't know if anyone needs this, but it
> > has crept wrongly into the API.
> > 
> > Maybe an interesting place to start is a warning printk about using an
> > obsolete feature and see where things go from there??
> 
> Maybe we should start with some way to pr_warn_ONCE() whenever we get a
> COW/unshare-fault in such a MAP_PRIVATE mapping, and essentially populate
> the fresh anon folio.
> 
> Then we don't only know who mmaps() something like that, but who actually
> relies on getting anon folios in there.

Sounds useful to me, if nobody yet has solid understanding of those private
mappings while we'd want to collect some info.  My gut feeling is we'll see
some valid use of them, but I hope I'm wrong..  I hope we can still leave
that as a separate thing so we focus on large mappings in this series.  And
yes, I'll stick with special bits here to not add one more flag reference.

Thanks,
Jason Gunthorpe Aug. 16, 2024, 5:38 p.m. UTC | #8
On Fri, Aug 16, 2024 at 10:21:17AM -0400, Peter Xu wrote:
> On Fri, Aug 16, 2024 at 11:30:31AM +0200, David Hildenbrand wrote:
> > On 14.08.24 15:05, Jason Gunthorpe wrote:
> > > On Fri, Aug 09, 2024 at 07:25:36PM +0200, David Hildenbrand wrote:
> > > 
> > > > > > That is in general not what we want, and we still have some places that
> > > > > > wrongly hard-code that behavior.
> > > > > > 
> > > > > > In a MAP_PRIVATE mapping you might have anon pages that we can happily walk.
> > > > > > 
> > > > > > vm_normal_page() / vm_normal_page_pmd() [and as commented as a TODO,
> > > > > > vm_normal_page_pud()] should be able to identify PFN maps and reject them,
> > > > > > no?
> > > > > 
> > > > > Yep, I think we can also rely on special bit.
> > > 
> > > It is more than just relying on the special bit..
> > > 
> > > VM_PFNMAP/VM_MIXEDMAP should really only be used inside
> > > vm_normal_page() because thay are, effectively, support for a limited
> > > emulation of the special bit on arches that don't have them. There are
> > > a bunch of weird rules that are used to try and make that work
> > > properly that have to be followed.
> > > 
> > > On arches with the sepcial bit they should possibly never be checked
> > > since the special bit does everything you need.
> > > 
> > > Arguably any place reading those flags out side of vm_normal_page/etc
> > > is suspect.
> > 
> > IIUC, your opinion matches mine: VM_PFNMAP/VM_MIXEDMAP and pte_special()/...
> > usage should be limited to vm_normal_page/vm_normal_page_pmd/ ... of course,
> > GUP-fast is special (one of the reason for "pte_special()" and friends after
> > all).
> 
> The issue is at least GUP currently doesn't work with pfnmaps, while
> there're potentially users who wants to be able to work on both page +
> !page use cases.  Besides access_process_vm(), KVM also uses similar thing,
> and maybe more; these all seem to be valid use case of reference the vma
> flags for PFNMAP and such, so they can identify "it's pfnmap" or more
> generic issues like "permission check error on pgtable".

Why are those valid compared with calling vm_normal_page() per-page
instead?

What reason is there to not do something based only on the PFNMAP
flag?

Jason
David Hildenbrand Aug. 16, 2024, 5:56 p.m. UTC | #9
On 16.08.24 16:21, Peter Xu wrote:
> On Fri, Aug 16, 2024 at 11:30:31AM +0200, David Hildenbrand wrote:
>> On 14.08.24 15:05, Jason Gunthorpe wrote:
>>> On Fri, Aug 09, 2024 at 07:25:36PM +0200, David Hildenbrand wrote:
>>>
>>>>>> That is in general not what we want, and we still have some places that
>>>>>> wrongly hard-code that behavior.
>>>>>>
>>>>>> In a MAP_PRIVATE mapping you might have anon pages that we can happily walk.
>>>>>>
>>>>>> vm_normal_page() / vm_normal_page_pmd() [and as commented as a TODO,
>>>>>> vm_normal_page_pud()] should be able to identify PFN maps and reject them,
>>>>>> no?
>>>>>
>>>>> Yep, I think we can also rely on special bit.
>>>
>>> It is more than just relying on the special bit..
>>>
>>> VM_PFNMAP/VM_MIXEDMAP should really only be used inside
>>> vm_normal_page() because thay are, effectively, support for a limited
>>> emulation of the special bit on arches that don't have them. There are
>>> a bunch of weird rules that are used to try and make that work
>>> properly that have to be followed.
>>>
>>> On arches with the sepcial bit they should possibly never be checked
>>> since the special bit does everything you need.
>>>
>>> Arguably any place reading those flags out side of vm_normal_page/etc
>>> is suspect.
>>
>> IIUC, your opinion matches mine: VM_PFNMAP/VM_MIXEDMAP and pte_special()/...
>> usage should be limited to vm_normal_page/vm_normal_page_pmd/ ... of course,
>> GUP-fast is special (one of the reason for "pte_special()" and friends after
>> all).
> 
> The issue is at least GUP currently doesn't work with pfnmaps, while
> there're potentially users who wants to be able to work on both page +
> !page use cases.  Besides access_process_vm(), KVM also uses similar thing,
> and maybe more; these all seem to be valid use case of reference the vma
> flags for PFNMAP and such, so they can identify "it's pfnmap" or more
> generic issues like "permission check error on pgtable".

What at least VFIO does is first try GUP, and if that fails, try 
follow_fault_pfn()->follow_pte(). There is a VM_PFNMAP check in there, yes.

Ideally, follow_pte() would never return refcounted/normal pages, then 
the PFNMAP check might only be a performance improvement (maybe).

> 
> The whole private mapping thing definitely made it complicated.

Yes, and follow_pte() for now could even return ordinary anon pages. I 
spotted that when I was working on that VM_PAT stuff, but I was to 
unsure what to do (see below that KVM with MAP_PRIVATE /dev/mem might 
just work, no idea if there are use cases?).

Fortunately, vfio calls is_invalid_reserved_pfn() and refuses anything 
that has a struct page.

I think KVM does something nasty: if it something with a "struct page", 
and it's not PageReserved, it would take a reference (if I get 
kvm_pfn_to_refcounted_page()) independent if it's a "normal" or "not 
normal" page -- it essentially ignores the vm_normal_page() information 
in the page tables ...

So anon pages in pivate mappings from follow_pte() might currently work 
with KVM ... because of the way KVM uses follow_pte().

I did not play with it, so I'm not sure if I am missing some detail.
Jason Gunthorpe Aug. 19, 2024, 12:19 p.m. UTC | #10
On Fri, Aug 16, 2024 at 07:56:30PM +0200, David Hildenbrand wrote:

> I think KVM does something nasty: if it something with a "struct page", and
> it's not PageReserved, it would take a reference (if I get
> kvm_pfn_to_refcounted_page()) independent if it's a "normal" or "not normal"
> page -- it essentially ignores the vm_normal_page() information in the page
> tables ...

Oh that's nasty. Nothing should be upgrading the output of the follow
functions to refcounted. That's what GUP is for.

And PFNMAP pages, even if they have struct pages for some reason,
should *NEVER* be refcounted because they are in a PFNMAP VMA. That is
completely against the whole point :\ If they could be safely
refcounted then it would be a MIXEDMAP.

Jason
Sean Christopherson Aug. 19, 2024, 2:19 p.m. UTC | #11
On Mon, Aug 19, 2024, Jason Gunthorpe wrote:
> On Fri, Aug 16, 2024 at 07:56:30PM +0200, David Hildenbrand wrote:
> 
> > I think KVM does something nasty: if it something with a "struct page", and
> > it's not PageReserved, it would take a reference (if I get
> > kvm_pfn_to_refcounted_page()) independent if it's a "normal" or "not normal"
> > page -- it essentially ignores the vm_normal_page() information in the page
> > tables ...
> 
> Oh that's nasty. Nothing should be upgrading the output of the follow
> functions to refcounted. That's what GUP is for.
> 
> And PFNMAP pages, even if they have struct pages for some reason,
> should *NEVER* be refcounted because they are in a PFNMAP VMA. That is
> completely against the whole point :\ If they could be safely
> refcounted then it would be a MIXEDMAP.

Yeah yeah, I'm working on it.

https://lore.kernel.org/all/20240726235234.228822-1-seanjc@google.com
Peter Xu Aug. 21, 2024, 6:42 p.m. UTC | #12
On Fri, Aug 16, 2024 at 02:38:36PM -0300, Jason Gunthorpe wrote:
> On Fri, Aug 16, 2024 at 10:21:17AM -0400, Peter Xu wrote:
> > On Fri, Aug 16, 2024 at 11:30:31AM +0200, David Hildenbrand wrote:
> > > On 14.08.24 15:05, Jason Gunthorpe wrote:
> > > > On Fri, Aug 09, 2024 at 07:25:36PM +0200, David Hildenbrand wrote:
> > > > 
> > > > > > > That is in general not what we want, and we still have some places that
> > > > > > > wrongly hard-code that behavior.
> > > > > > > 
> > > > > > > In a MAP_PRIVATE mapping you might have anon pages that we can happily walk.
> > > > > > > 
> > > > > > > vm_normal_page() / vm_normal_page_pmd() [and as commented as a TODO,
> > > > > > > vm_normal_page_pud()] should be able to identify PFN maps and reject them,
> > > > > > > no?
> > > > > > 
> > > > > > Yep, I think we can also rely on special bit.
> > > > 
> > > > It is more than just relying on the special bit..
> > > > 
> > > > VM_PFNMAP/VM_MIXEDMAP should really only be used inside
> > > > vm_normal_page() because thay are, effectively, support for a limited
> > > > emulation of the special bit on arches that don't have them. There are
> > > > a bunch of weird rules that are used to try and make that work
> > > > properly that have to be followed.
> > > > 
> > > > On arches with the sepcial bit they should possibly never be checked
> > > > since the special bit does everything you need.
> > > > 
> > > > Arguably any place reading those flags out side of vm_normal_page/etc
> > > > is suspect.
> > > 
> > > IIUC, your opinion matches mine: VM_PFNMAP/VM_MIXEDMAP and pte_special()/...
> > > usage should be limited to vm_normal_page/vm_normal_page_pmd/ ... of course,
> > > GUP-fast is special (one of the reason for "pte_special()" and friends after
> > > all).
> > 
> > The issue is at least GUP currently doesn't work with pfnmaps, while
> > there're potentially users who wants to be able to work on both page +
> > !page use cases.  Besides access_process_vm(), KVM also uses similar thing,
> > and maybe more; these all seem to be valid use case of reference the vma
> > flags for PFNMAP and such, so they can identify "it's pfnmap" or more
> > generic issues like "permission check error on pgtable".
> 
> Why are those valid compared with calling vm_normal_page() per-page
> instead?
> 
> What reason is there to not do something based only on the PFNMAP
> flag?

My comment was for answering "why VM_PFNMAP flags is needed outside
vm_normal_page()", because GUP lacks supports of it.

Are you suggesting we should support VM_PFNMAP in GUP, perhaps?

Thanks,
diff mbox series

Patch

diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index cd79fb3b89e5..fd3965efe773 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -727,6 +727,11 @@  struct folio *folio_walk_start(struct folio_walk *fw,
 	p4d_t *p4dp;
 
 	mmap_assert_locked(vma->vm_mm);
+
+	/* It has no folio backing the mappings at all.. */
+	if (vma->vm_flags & (VM_IO | VM_PFNMAP))
+		return NULL;
+
 	vma_pgtable_walk_begin(vma);
 
 	if (WARN_ON_ONCE(addr < vma->vm_start || addr >= vma->vm_end))