diff mbox series

[RFC,06/12] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing

Message ID 20231116012908.392077-7-peterx@redhat.com (mailing list archive)
State New
Headers show
Series mm/gup: Unify hugetlb, part 2 | expand

Commit Message

Peter Xu Nov. 16, 2023, 1:29 a.m. UTC
Hugepd format is only used in PowerPC with hugetlbfs.  In commit
a6e79df92e4a ("mm/gup: disallow FOLL_LONGTERM GUP-fast writing to
file-backed mappings"), we added a check to fail gup-fast if there's
potential risk of violating GUP over writeback file systems.  That should
never apply to hugepd.

Drop that check, not only because it'll never be true for hugepd, but also
it paves way for reusing the function outside fast-gup.

Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c | 5 -----
 1 file changed, 5 deletions(-)

Comments

Christoph Hellwig Nov. 20, 2023, 8:26 a.m. UTC | #1
On Wed, Nov 15, 2023 at 08:29:02PM -0500, Peter Xu wrote:
> Hugepd format is only used in PowerPC with hugetlbfs.  In commit
> a6e79df92e4a ("mm/gup: disallow FOLL_LONGTERM GUP-fast writing to
> file-backed mappings"), we added a check to fail gup-fast if there's
> potential risk of violating GUP over writeback file systems.  That should
> never apply to hugepd.
>
> Drop that check, not only because it'll never be true for hugepd, but also
> it paves way for reusing the function outside fast-gup.

What prevents us from ever using hugepd with file mappings?  I think
it would naturally fit in with how large folios for the pagecache work.

So keeping this check and generalizing it seems like the better idea to
me.
Peter Xu Nov. 21, 2023, 3:59 p.m. UTC | #2
On Mon, Nov 20, 2023 at 12:26:24AM -0800, Christoph Hellwig wrote:
> On Wed, Nov 15, 2023 at 08:29:02PM -0500, Peter Xu wrote:
> > Hugepd format is only used in PowerPC with hugetlbfs.  In commit
> > a6e79df92e4a ("mm/gup: disallow FOLL_LONGTERM GUP-fast writing to
> > file-backed mappings"), we added a check to fail gup-fast if there's
> > potential risk of violating GUP over writeback file systems.  That should
> > never apply to hugepd.
> >
> > Drop that check, not only because it'll never be true for hugepd, but also
> > it paves way for reusing the function outside fast-gup.
> 
> What prevents us from ever using hugepd with file mappings?  I think
> it would naturally fit in with how large folios for the pagecache work.
> 
> So keeping this check and generalizing it seems like the better idea to
> me.

But then it means we're still keeping that dead code for fast-gup even if
we know that fact..  Or do we have a plan to add that support very soon, so
this code will be destined to add back?

The other option is I can always add a comment above gup_huge_pd()
explaining this special bit, so that when someone is adding hugepd support
to file large folios we'll hopefully not forget it?  But then that
generalization work will only happen when the code will be needed.

Thanks,
Christoph Hellwig Nov. 22, 2023, 8 a.m. UTC | #3
On Tue, Nov 21, 2023 at 10:59:35AM -0500, Peter Xu wrote:
> > What prevents us from ever using hugepd with file mappings?  I think
> > it would naturally fit in with how large folios for the pagecache work.
> > 
> > So keeping this check and generalizing it seems like the better idea to
> > me.
> 
> But then it means we're still keeping that dead code for fast-gup even if
> we know that fact..  Or do we have a plan to add that support very soon, so
> this code will be destined to add back?

The question wasn't mean retorical - we support arbitrary power of two
sized folios for the pagepage, what prevents us from using hugepd with
them right now?

> The other option is I can always add a comment above gup_huge_pd()
> explaining this special bit, so that when someone is adding hugepd support
> to file large folios we'll hopefully not forget it?  But then that
> generalization work will only happen when the code will be needed.

If dropping the check is the right thing for now (and I think the ppc
maintainers and willy as the large folio guy might have a more useful
opinions than I do), leaving a comment in would be very useful.
Peter Xu Nov. 22, 2023, 3:22 p.m. UTC | #4
On Wed, Nov 22, 2023 at 12:00:24AM -0800, Christoph Hellwig wrote:
> On Tue, Nov 21, 2023 at 10:59:35AM -0500, Peter Xu wrote:
> > > What prevents us from ever using hugepd with file mappings?  I think
> > > it would naturally fit in with how large folios for the pagecache work.
> > > 
> > > So keeping this check and generalizing it seems like the better idea to
> > > me.
> > 
> > But then it means we're still keeping that dead code for fast-gup even if
> > we know that fact..  Or do we have a plan to add that support very soon, so
> > this code will be destined to add back?
> 
> The question wasn't mean retorical - we support arbitrary power of two
> sized folios for the pagepage, what prevents us from using hugepd with
> them right now?

Ah, didn't catch that point previously.  Hugepd is just not used outside
hugetlb right now, afaiu.

For example, __hugepte_alloc() (and that's the only one calls
hugepd_populate()) should be the function to allocate a hugepd (ppc only),
and it's only called in huge_pte_alloc(), which is part of the current
arch-specific hugetlb api.

And generic mm paths don't normally have hugepd handling, afaics.  For
example, page_vma_mapped_walk() doesn't handle hugepd at all unless in
hugetlb specific path.

There're actually (only) two generic mm paths that can handle hugepd,
namely:

  - fast-gup
  - walk_page_*() apis (aka, __walk_page_range())

For fast-gup I think the hugepd code is in use, however for walk_page_*
apis hugepd code shouldn't be reached iiuc as we have the hugetlb specific
handling (walk_hugetlb_range()), so anything within walk_pgd_range() to hit
a hugepd can be dead code to me (but note that this "dead code" is good
stuff to me, if one would like to merge hugetlb instead into generic mm).

This series tries to add slow gup into that list too, so the 3rd one to
support it.  I plan to look more into this area (e.g., __walk_page_range()
can be another good candidate soon).  I'm not sure whether we should teach
the whole mm to understand hugepd yet, but slow gup and __walk_page_range()
does look like good candidates to already remove the hugetlb specific code
paths - slow-gup has average ~add/~del LOCs (which this series does), and
__walk_page_range() can remove some code logically, no harm I yet see.

Indeed above are based on only my code observations, so I'll be more than
happy to be corrected otherwise, as early as possible.

> 
> > The other option is I can always add a comment above gup_huge_pd()
> > explaining this special bit, so that when someone is adding hugepd support
> > to file large folios we'll hopefully not forget it?  But then that
> > generalization work will only happen when the code will be needed.
> 
> If dropping the check is the right thing for now (and I think the ppc
> maintainers and willy as the large folio guy might have a more useful
> opinions than I do), leaving a comment in would be very useful.

Willy is in the loop, and I just notice I didn't really copy ppc list, even
I planned to..  I am adding the list (linuxppc-dev@lists.ozlabs.org) into
this reply.  I'll remember to do so as long as there's a new version.

The other reason I feel like hugepd may or may not be further developed for
new features like large folio is that I saw Power9 started to shift to
radix pgtables, and afaics hugepd is only supported in hash tables
(hugepd_ok()).  But again, I confess I know nothing about Power at all.

Thanks,
Christoph Hellwig Nov. 23, 2023, 7:21 a.m. UTC | #5
On Wed, Nov 22, 2023 at 10:22:11AM -0500, Peter Xu wrote:
> The other reason I feel like hugepd may or may not be further developed for
> new features like large folio is that I saw Power9 started to shift to
> radix pgtables, and afaics hugepd is only supported in hash tables
> (hugepd_ok()).  But again, I confess I know nothing about Power at all.

That alone sounds like a good reason to not bother.  So unless more
qualified people have a different opinion I'm fine with this patch
as long as you leave a comment in place, and ammend the commit message
with some of the very useful information from your mail.
Matthew Wilcox Nov. 23, 2023, 3:47 p.m. UTC | #6
On Wed, Nov 22, 2023 at 12:00:24AM -0800, Christoph Hellwig wrote:
> > The other option is I can always add a comment above gup_huge_pd()
> > explaining this special bit, so that when someone is adding hugepd support
> > to file large folios we'll hopefully not forget it?  But then that
> > generalization work will only happen when the code will be needed.
> 
> If dropping the check is the right thing for now (and I think the ppc
> maintainers and willy as the large folio guy might have a more useful
> opinions than I do), leaving a comment in would be very useful.

It looks like ARM (in the person of Ryan) are going to add support for
something equivalent to hugepd.  Insofar as I understand hugepd, anyway.
I've done my best to set up the generic code so that the arch code can
use whatever size TLB entries it supports.  I haven't been looking to the
hugetlb code as a reference for it, since it can assume natural alignment
and generic THP/large folio must be able to handle arbitrary alignment.

If powerpc want to join in on the fun, they're more than welcome, but I
get the feeling that investment in Linux-on-PPC is somewhat smaller than
Linux-on-ARM these days.  Even if we restrict that to the server space.
Peter Xu Nov. 23, 2023, 4:10 p.m. UTC | #7
On Wed, Nov 22, 2023 at 11:21:50PM -0800, Christoph Hellwig wrote:
> That alone sounds like a good reason to not bother.  So unless more
> qualified people have a different opinion I'm fine with this patch
> as long as you leave a comment in place, and ammend the commit message
> with some of the very useful information from your mail.

Will do, thanks.

This is what I will squash into the same patch in the new version, as the
current plan:

+/*
+ * NOTE: currently hugepd is only used in hugetlbfs file systems on Power,
+ * which does not have issue with folio writeback against GUP updates.
+ * When hugepd will be extended to support non-hugetlbfs or even anonymous
+ * memory, we need to do extra check as what we do with most of the other
+ * folios. See writable_file_mapping_allowed() and folio_fast_pin_allowed()
+ * for more information.
+ */
static int gup_huge_pd(hugepd_t hugepd, unsigned long addr,
                unsigned int pdshift, unsigned long end, unsigned int flags,
                struct page **pages, int *nr)
Peter Xu Nov. 23, 2023, 5:22 p.m. UTC | #8
On Thu, Nov 23, 2023 at 03:47:49PM +0000, Matthew Wilcox wrote:
> It looks like ARM (in the person of Ryan) are going to add support for
> something equivalent to hugepd.

If it's about arm's cont_pte, then it looks ideal because this series
didn't yet touch cont_pte, assuming it'll just work.  From that aspect, his
work may help mine, and no immediately collapsing either.

There can be a slight performance difference which I need to measure for
arm's cont_pte already for hugetlb, but I didn't worry much on that;
quotting my commit message in the last patch:

    There may be a slight difference of how the loops run when processing
    GUP over a large hugetlb range on either ARM64 (e.g. CONT_PMD) or RISCV
    (mostly its Svnapot extension on 64K huge pages): each loop of
    __get_user_pages() will resolve one pgtable entry with the patch
    applied, rather than relying on the size of hugetlb hstate, the latter
    may cover multiple entries in one loop.
    
    However, the performance difference should hopefully not be a major
    concern, considering that GUP just yet got 57edfcfd3419 ("mm/gup:
    accelerate thp gup even for "pages != NULL""), and that's not part of a
    performance analysis but a side dish.  If the performance will be a
    concern, we can consider handle CONT_PTE in follow_page(), for example.

So IMHO it can be slightly different comparing to e.g. page fault, because
each fault is still pretty slow as a whole if one fault for each small pte
(of a large folio / cont_pte), while the loop in GUP is still relatively
tight and short, comparing to a fault.  I'd boldly guess more low hanging
fruits out there for large folio outside GUP areas.

In all cases, it'll be interesting to know if Ryan has worked on cont_pte
support for gup on large folios, and whether there's any performance number
to share.  It's definitely good news to me because it means Ryan's work can
also then benefit hugetlb if this series will be merged, I just don't know
how much difference there will be.

Thanks,
Christophe Leroy Nov. 23, 2023, 6:22 p.m. UTC | #9
Le 22/11/2023 à 16:22, Peter Xu a écrit :
> On Wed, Nov 22, 2023 at 12:00:24AM -0800, Christoph Hellwig wrote:
>> On Tue, Nov 21, 2023 at 10:59:35AM -0500, Peter Xu wrote:
>>>> What prevents us from ever using hugepd with file mappings?  I think
>>>> it would naturally fit in with how large folios for the pagecache work.
>>>>
>>>> So keeping this check and generalizing it seems like the better idea to
>>>> me.
>>>
>>> But then it means we're still keeping that dead code for fast-gup even if
>>> we know that fact..  Or do we have a plan to add that support very soon, so
>>> this code will be destined to add back?
>>
>> The question wasn't mean retorical - we support arbitrary power of two
>> sized folios for the pagepage, what prevents us from using hugepd with
>> them right now?
> 
> Ah, didn't catch that point previously.  Hugepd is just not used outside
> hugetlb right now, afaiu.
> 
> For example, __hugepte_alloc() (and that's the only one calls
> hugepd_populate()) should be the function to allocate a hugepd (ppc only),
> and it's only called in huge_pte_alloc(), which is part of the current
> arch-specific hugetlb api.
> 
> And generic mm paths don't normally have hugepd handling, afaics.  For
> example, page_vma_mapped_walk() doesn't handle hugepd at all unless in
> hugetlb specific path.
> 
> There're actually (only) two generic mm paths that can handle hugepd,
> namely:
> 
>    - fast-gup
>    - walk_page_*() apis (aka, __walk_page_range())
> 
> For fast-gup I think the hugepd code is in use, however for walk_page_*
> apis hugepd code shouldn't be reached iiuc as we have the hugetlb specific
> handling (walk_hugetlb_range()), so anything within walk_pgd_range() to hit
> a hugepd can be dead code to me (but note that this "dead code" is good
> stuff to me, if one would like to merge hugetlb instead into generic mm).

Not sure what you mean here. What do you mean by "dead code" ?
A hugepage directory can be plugged at any page level, from PGD to PMD.
So the following bit in walk_pgd_range() is valid and not dead:

		if (is_hugepd(__hugepd(pgd_val(*pgd))))
			err = walk_hugepd_range((hugepd_t *)pgd, addr, next, walk, PGDIR_SHIFT);


> 
> This series tries to add slow gup into that list too, so the 3rd one to
> support it.  I plan to look more into this area (e.g., __walk_page_range()
> can be another good candidate soon).  I'm not sure whether we should teach
> the whole mm to understand hugepd yet, but slow gup and __walk_page_range()
> does look like good candidates to already remove the hugetlb specific code
> paths - slow-gup has average ~add/~del LOCs (which this series does), and
> __walk_page_range() can remove some code logically, no harm I yet see.
> 
> Indeed above are based on only my code observations, so I'll be more than
> happy to be corrected otherwise, as early as possible.
> 
>>
>>> The other option is I can always add a comment above gup_huge_pd()
>>> explaining this special bit, so that when someone is adding hugepd support
>>> to file large folios we'll hopefully not forget it?  But then that
>>> generalization work will only happen when the code will be needed.
>>
>> If dropping the check is the right thing for now (and I think the ppc
>> maintainers and willy as the large folio guy might have a more useful
>> opinions than I do), leaving a comment in would be very useful.
> 
> Willy is in the loop, and I just notice I didn't really copy ppc list, even
> I planned to..  I am adding the list (linuxppc-dev@lists.ozlabs.org) into
> this reply.  I'll remember to do so as long as there's a new version.
> 
> The other reason I feel like hugepd may or may not be further developed for
> new features like large folio is that I saw Power9 started to shift to
> radix pgtables, and afaics hugepd is only supported in hash tables
> (hugepd_ok()).  But again, I confess I know nothing about Power at all.
> 
> Thanks,
>
Ryan Roberts Nov. 23, 2023, 7:11 p.m. UTC | #10
On 23/11/2023 17:22, Peter Xu wrote:
> On Thu, Nov 23, 2023 at 03:47:49PM +0000, Matthew Wilcox wrote:
>> It looks like ARM (in the person of Ryan) are going to add support for
>> something equivalent to hugepd.
> 
> If it's about arm's cont_pte, then it looks ideal because this series
> didn't yet touch cont_pte, assuming it'll just work.  From that aspect, his
> work may help mine, and no immediately collapsing either.

Hi,

I'm not sure I've 100% understood the crossover between this series and my work
to support arm64's contpte mappings generally for anonymous and file-backed memory.

My approach is to transparently use contpte mappings when core-mm request pte
mappings that meet the requirements; and its all based around intercepting the
normal (non-hugetlb) helpers (e.g. set_ptes(), ptep_get() and friends). There is
no semantic change to the core-mm. See [1]. It relies on 1) the page cache using
large folios and 2) my "small-sized THP" series which starts using arbitrary
sized large folios for anonymous memory [2].

If I've understood this conversation correctly there is an object called hugepd,
which today is only supported by powerpc, but which could allow the core-mm to
control the mapping granularity? I can see some value in exposing that control
to core-mm in the (very) long term.

[1] https://lore.kernel.org/all/20231115163018.1303287-1-ryan.roberts@arm.com/
[2] https://lore.kernel.org/linux-mm/20231115132734.931023-1-ryan.roberts@arm.com/

Thanks,
Ryan


> 
> There can be a slight performance difference which I need to measure for
> arm's cont_pte already for hugetlb, but I didn't worry much on that;
> quotting my commit message in the last patch:
> 
>     There may be a slight difference of how the loops run when processing
>     GUP over a large hugetlb range on either ARM64 (e.g. CONT_PMD) or RISCV
>     (mostly its Svnapot extension on 64K huge pages): each loop of
>     __get_user_pages() will resolve one pgtable entry with the patch
>     applied, rather than relying on the size of hugetlb hstate, the latter
>     may cover multiple entries in one loop.
>     
>     However, the performance difference should hopefully not be a major
>     concern, considering that GUP just yet got 57edfcfd3419 ("mm/gup:
>     accelerate thp gup even for "pages != NULL""), and that's not part of a
>     performance analysis but a side dish.  If the performance will be a
>     concern, we can consider handle CONT_PTE in follow_page(), for example.
> 
> So IMHO it can be slightly different comparing to e.g. page fault, because
> each fault is still pretty slow as a whole if one fault for each small pte
> (of a large folio / cont_pte), while the loop in GUP is still relatively
> tight and short, comparing to a fault.  I'd boldly guess more low hanging
> fruits out there for large folio outside GUP areas.
> 
> In all cases, it'll be interesting to know if Ryan has worked on cont_pte
> support for gup on large folios, and whether there's any performance number
> to share.  It's definitely good news to me because it means Ryan's work can
> also then benefit hugetlb if this series will be merged, I just don't know
> how much difference there will be.
> 
> Thanks,
>
Peter Xu Nov. 23, 2023, 7:37 p.m. UTC | #11
On Thu, Nov 23, 2023 at 06:22:33PM +0000, Christophe Leroy wrote:
> > For fast-gup I think the hugepd code is in use, however for walk_page_*
> > apis hugepd code shouldn't be reached iiuc as we have the hugetlb specific
> > handling (walk_hugetlb_range()), so anything within walk_pgd_range() to hit
> > a hugepd can be dead code to me (but note that this "dead code" is good
> > stuff to me, if one would like to merge hugetlb instead into generic mm).
> 
> Not sure what you mean here. What do you mean by "dead code" ?
> A hugepage directory can be plugged at any page level, from PGD to PMD.
> So the following bit in walk_pgd_range() is valid and not dead:
> 
> 		if (is_hugepd(__hugepd(pgd_val(*pgd))))
> 			err = walk_hugepd_range((hugepd_t *)pgd, addr, next, walk, PGDIR_SHIFT);

IMHO it boils down to the question on whether hugepd is only used in
hugetlbfs.  I think I already mentioned that above, but I can be more
explicit; what I see is that from higher stack in __walk_page_range():

	if (is_vm_hugetlb_page(vma)) {
		if (ops->hugetlb_entry)
			err = walk_hugetlb_range(start, end, walk);
	} else
		err = walk_pgd_range(start, end, walk);

It means to me as long as the vma is hugetlb, it'll not trigger any code in
walk_pgd_range(), but only walk_hugetlb_range().  Do you perhaps mean
hugepd is used outside hugetlbfs?

Thanks,
Peter Xu Nov. 23, 2023, 7:46 p.m. UTC | #12
On Thu, Nov 23, 2023 at 07:11:19PM +0000, Ryan Roberts wrote:
> Hi,
> 
> I'm not sure I've 100% understood the crossover between this series and my work
> to support arm64's contpte mappings generally for anonymous and file-backed memory.

No worry, there's no confliction.  If you worked on that it's only be
something nice on top.  Also, I'm curious if you have performance numbers,
because I'm going to do some test for hugetlb cont_ptes (which is only the
current plan), and if you got those it'll be a great baseline for me,
because it should be similar in you case even though the goal is slightly
different.

> 
> My approach is to transparently use contpte mappings when core-mm request pte
> mappings that meet the requirements; and its all based around intercepting the
> normal (non-hugetlb) helpers (e.g. set_ptes(), ptep_get() and friends). There is
> no semantic change to the core-mm. See [1]. It relies on 1) the page cache using
> large folios and 2) my "small-sized THP" series which starts using arbitrary
> sized large folios for anonymous memory [2].
> 
> If I've understood this conversation correctly there is an object called hugepd,
> which today is only supported by powerpc, but which could allow the core-mm to
> control the mapping granularity? I can see some value in exposing that control
> to core-mm in the (very) long term.

For me it's needed immediately, because hugetlb_follow_page_mask() will be
gone after the last patch.

> 
> [1] https://lore.kernel.org/all/20231115163018.1303287-1-ryan.roberts@arm.com/
> [2] https://lore.kernel.org/linux-mm/20231115132734.931023-1-ryan.roberts@arm.com/

AFAICT you haven't yet worked on gup then, after I glimpsed the above
series.

It's a matter of whether one follow_page_mask() call can fetch more than
one page* for a cont_pte entry on aarch64 for a large non-hugetlb folio
(and if this series lands, it'll be the same to hugetlb or non-hugetlb).
Now the current code can only fetch one page I think.

Thanks,
Michael Ellerman Nov. 24, 2023, 1:06 a.m. UTC | #13
Peter Xu <peterx@redhat.com> writes:
> On Wed, Nov 22, 2023 at 12:00:24AM -0800, Christoph Hellwig wrote:
>> On Tue, Nov 21, 2023 at 10:59:35AM -0500, Peter Xu wrote:
...
>> 
>> If dropping the check is the right thing for now (and I think the ppc
>> maintainers and willy as the large folio guy might have a more useful
>> opinions than I do), leaving a comment in would be very useful.
>
> Willy is in the loop, and I just notice I didn't really copy ppc list, even
> I planned to..  I am adding the list (linuxppc-dev@lists.ozlabs.org) into
> this reply.  I'll remember to do so as long as there's a new version.

Thanks.

> The other reason I feel like hugepd may or may not be further developed for
> new features like large folio is that I saw Power9 started to shift to
> radix pgtables, and afaics hugepd is only supported in hash tables
> (hugepd_ok()).

Because it's powerpc it's not quite that simple :}

Power9 uses the Radix MMU by default, but the hash page table MMU is
still supported.

However although hugepd is used with the hash page table MMU, that's
only when PAGE_SIZE=4K. These days none of the major distros build with
4K pages.

But some of the non-server CPU platforms also use hugepd. 32-bit 8xx
does, which is actively maintained by Christophe.

And I believe Freescale e6500 can use it, but that is basically
orphaned, and although I boot test it I don't run any hugetlb tests.
(I guess I should do that).

cheers
Aneesh Kumar K.V Nov. 24, 2023, 5:28 a.m. UTC | #14
Peter Xu <peterx@redhat.com> writes:

> On Thu, Nov 23, 2023 at 06:22:33PM +0000, Christophe Leroy wrote:
>> > For fast-gup I think the hugepd code is in use, however for walk_page_*
>> > apis hugepd code shouldn't be reached iiuc as we have the hugetlb specific
>> > handling (walk_hugetlb_range()), so anything within walk_pgd_range() to hit
>> > a hugepd can be dead code to me (but note that this "dead code" is good
>> > stuff to me, if one would like to merge hugetlb instead into generic mm).
>> 
>> Not sure what you mean here. What do you mean by "dead code" ?
>> A hugepage directory can be plugged at any page level, from PGD to PMD.
>> So the following bit in walk_pgd_range() is valid and not dead:
>> 
>> 		if (is_hugepd(__hugepd(pgd_val(*pgd))))
>> 			err = walk_hugepd_range((hugepd_t *)pgd, addr, next, walk, PGDIR_SHIFT);
>
> IMHO it boils down to the question on whether hugepd is only used in
> hugetlbfs.  I think I already mentioned that above, but I can be more
> explicit; what I see is that from higher stack in __walk_page_range():
>
> 	if (is_vm_hugetlb_page(vma)) {
> 		if (ops->hugetlb_entry)
> 			err = walk_hugetlb_range(start, end, walk);
> 	} else
> 		err = walk_pgd_range(start, end, walk);
>
> It means to me as long as the vma is hugetlb, it'll not trigger any code in
> walk_pgd_range(), but only walk_hugetlb_range().  Do you perhaps mean
> hugepd is used outside hugetlbfs?
>

walk_pgd_range also get called from walk_page_range_novma().
IIRC commit e17eae2b839937817d771e2f5d2b30e5e2b81bb7 added the hugepd
details to pagewalk code to handle ptdump.

There is also a desire to use hugepd format in vmap mappings. 
https://lore.kernel.org/linuxppc-dev/cover.1620795204.git.christophe.leroy@csgroup.eu

-aneesh
Christophe Leroy Nov. 24, 2023, 7:03 a.m. UTC | #15
Le 23/11/2023 à 20:37, Peter Xu a écrit :
> On Thu, Nov 23, 2023 at 06:22:33PM +0000, Christophe Leroy wrote:
>>> For fast-gup I think the hugepd code is in use, however for walk_page_*
>>> apis hugepd code shouldn't be reached iiuc as we have the hugetlb specific
>>> handling (walk_hugetlb_range()), so anything within walk_pgd_range() to hit
>>> a hugepd can be dead code to me (but note that this "dead code" is good
>>> stuff to me, if one would like to merge hugetlb instead into generic mm).
>>
>> Not sure what you mean here. What do you mean by "dead code" ?
>> A hugepage directory can be plugged at any page level, from PGD to PMD.
>> So the following bit in walk_pgd_range() is valid and not dead:
>>
>> 		if (is_hugepd(__hugepd(pgd_val(*pgd))))
>> 			err = walk_hugepd_range((hugepd_t *)pgd, addr, next, walk, PGDIR_SHIFT);
> 
> IMHO it boils down to the question on whether hugepd is only used in
> hugetlbfs.  I think I already mentioned that above, but I can be more
> explicit; what I see is that from higher stack in __walk_page_range():
> 
> 	if (is_vm_hugetlb_page(vma)) {
> 		if (ops->hugetlb_entry)
> 			err = walk_hugetlb_range(start, end, walk);
> 	} else
> 		err = walk_pgd_range(start, end, walk);
> 
> It means to me as long as the vma is hugetlb, it'll not trigger any code in
> walk_pgd_range(), but only walk_hugetlb_range().  Do you perhaps mean
> hugepd is used outside hugetlbfs?

I added that code with commit e17eae2b8399 ("mm: pagewalk: fix walk for 
hugepage tables") because I was getting crazy displays when dumping 
/sys/kernel/debug/pagetables

Huge pages can be used for many thing.

On powerpc 8xx, there are 4 possible page size: 4k, 16k, 512k and 8M.
Each PGD entry addresses 4M areas, so hugepd is used for anything using 
8M pages. Could have used regular page tables instead, but it is not 
worth allocating a 4k table when the HW will only read first entry.

At the time being, linear memory mapping is performed with 8M pages, so 
ptdump_walk_pgd() will walk into huge page directories.

Also, huge pages can be used in vmalloc() and in vmap(). At the time 
being we support 512k pages there on the 8xx. 8M pages will be supported 
once vmalloc() and vmap() support hugepd, as explained in commit 
a6a8f7c4aa7e ("powerpc/8xx: add support for huge pages on VMAP and VMALLOC")

So yes as a conclusion hugepd is used outside hugetlbfs, hope it 
clarifies things.

Christophe
Ryan Roberts Nov. 24, 2023, 9:06 a.m. UTC | #16
On 23/11/2023 19:46, Peter Xu wrote:
> On Thu, Nov 23, 2023 at 07:11:19PM +0000, Ryan Roberts wrote:
>> Hi,
>>
>> I'm not sure I've 100% understood the crossover between this series and my work
>> to support arm64's contpte mappings generally for anonymous and file-backed memory.
> 
> No worry, there's no confliction.  If you worked on that it's only be
> something nice on top.  Also, I'm curious if you have performance numbers,

I have perf numbers for high level use cases (kernel compilation and Speedometer
Java Script benchmarks) at
https://lore.kernel.org/linux-arm-kernel/20230622144210.2623299-1-ryan.roberts@arm.com/

I don't have any micro-benchmarks for GUP though, if that's your question. Is
there an easy-to-use test I can run to get some numbers? I'd be happy to try it out.

> because I'm going to do some test for hugetlb cont_ptes (which is only the
> current plan), and if you got those it'll be a great baseline for me,
> because it should be similar in you case even though the goal is slightly
> different.
> 
>>
>> My approach is to transparently use contpte mappings when core-mm request pte
>> mappings that meet the requirements; and its all based around intercepting the
>> normal (non-hugetlb) helpers (e.g. set_ptes(), ptep_get() and friends). There is
>> no semantic change to the core-mm. See [1]. It relies on 1) the page cache using
>> large folios and 2) my "small-sized THP" series which starts using arbitrary
>> sized large folios for anonymous memory [2].
>>
>> If I've understood this conversation correctly there is an object called hugepd,
>> which today is only supported by powerpc, but which could allow the core-mm to
>> control the mapping granularity? I can see some value in exposing that control
>> to core-mm in the (very) long term.
> 
> For me it's needed immediately, because hugetlb_follow_page_mask() will be
> gone after the last patch.
> 
>>
>> [1] https://lore.kernel.org/all/20231115163018.1303287-1-ryan.roberts@arm.com/
>> [2] https://lore.kernel.org/linux-mm/20231115132734.931023-1-ryan.roberts@arm.com/
> 
> AFAICT you haven't yet worked on gup then, after I glimpsed the above
> series.

No, I haven't touched GUP at all. The approach is fully inside the arm64 arch
code (except 1 patch to core-mm which enables an optimization). So as far as GUP
and the rest of the core-mm is concerned, there are still only page-sized ptes
and they can all be iterated over and accessed as normal.

> 
> It's a matter of whether one follow_page_mask() call can fetch more than
> one page* for a cont_pte entry on aarch64 for a large non-hugetlb folio
> (and if this series lands, it'll be the same to hugetlb or non-hugetlb).
> Now the current code can only fetch one page I think.
> 
> Thanks,
>
Peter Xu Nov. 24, 2023, 4:07 p.m. UTC | #17
On Fri, Nov 24, 2023 at 09:06:01AM +0000, Ryan Roberts wrote:
> I don't have any micro-benchmarks for GUP though, if that's your question. Is
> there an easy-to-use test I can run to get some numbers? I'd be happy to try it out.

Thanks Ryan.  Then nothing is needed to be tested if gup is not yet touched
from your side, afaict.  I'll see whether I can provide some rough numbers
instead in the next post (I'll probably only be able to test it in a VM,
though, but hopefully that should still reflect mostly the truth).
Peter Xu Nov. 30, 2023, 9:30 p.m. UTC | #18
On Fri, Nov 24, 2023 at 11:07:51AM -0500, Peter Xu wrote:
> On Fri, Nov 24, 2023 at 09:06:01AM +0000, Ryan Roberts wrote:
> > I don't have any micro-benchmarks for GUP though, if that's your question. Is
> > there an easy-to-use test I can run to get some numbers? I'd be happy to try it out.
> 
> Thanks Ryan.  Then nothing is needed to be tested if gup is not yet touched
> from your side, afaict.  I'll see whether I can provide some rough numbers
> instead in the next post (I'll probably only be able to test it in a VM,
> though, but hopefully that should still reflect mostly the truth).

An update: I finished a round of 64K cont_pte test, in the slow gup micro
benchmark I see ~15% perf degrade with this patchset applied on a VM on top
of Apple M1.

Frankly that's even less than I expected, considering not only how slow gup
THP used to be, but also on the fact that that's a tight loop over slow
gup, which in normal cases shouldn't happen: "present" ptes normally goes
to fast-gup, while !present goes into a fault following it.  I assume
that's why nobody cared slow gup for THP before.  I think adding cont_pte
support shouldn't be very hard, but that will include making cont_pte idea
global just for arm64 and riscv Svnapot.

The current plan is I'll add that performance number into my commit message
only, as I don't ever expect any real workload will regress with it.  Maybe
a global cont_pte api will be needed at some point, but perhaps not yet
feel strongly for this use case.

Please feel free to raise any concerns otherwise.

Thanks,
Christophe Leroy Dec. 3, 2023, 1:33 p.m. UTC | #19
Le 30/11/2023 à 22:30, Peter Xu a écrit :
> On Fri, Nov 24, 2023 at 11:07:51AM -0500, Peter Xu wrote:
>> On Fri, Nov 24, 2023 at 09:06:01AM +0000, Ryan Roberts wrote:
>>> I don't have any micro-benchmarks for GUP though, if that's your question. Is
>>> there an easy-to-use test I can run to get some numbers? I'd be happy to try it out.
>>
>> Thanks Ryan.  Then nothing is needed to be tested if gup is not yet touched
>> from your side, afaict.  I'll see whether I can provide some rough numbers
>> instead in the next post (I'll probably only be able to test it in a VM,
>> though, but hopefully that should still reflect mostly the truth).
> 
> An update: I finished a round of 64K cont_pte test, in the slow gup micro
> benchmark I see ~15% perf degrade with this patchset applied on a VM on top
> of Apple M1.
> 
> Frankly that's even less than I expected, considering not only how slow gup
> THP used to be, but also on the fact that that's a tight loop over slow
> gup, which in normal cases shouldn't happen: "present" ptes normally goes
> to fast-gup, while !present goes into a fault following it.  I assume
> that's why nobody cared slow gup for THP before.  I think adding cont_pte
> support shouldn't be very hard, but that will include making cont_pte idea
> global just for arm64 and riscv Svnapot.

Is there any documentation on what cont_pte is ? I have always wondered 
if it could also fit powerpc 8xx need ?

On powerpc, for 16k pages, we have to define 4 consecutive PTEs. All 4 
PTE are flagged with the SPS bit telling it's a 16k pages, but for TLB 
misses the HW needs one entrie for each 4k fragment.

There is also a similar approach for 512k pages, we have 128 contiguous 
identical PTEs for them.

And whatever PAGE_SIZE is (either 4k or 16k), the HW needs one 'unsigned 
long' pte for each 4k fragment. So at the time being when we define 
PAGE_SIZE as 16k, we need a special pte_t which is a table of 4x 
unsigned long.

Wondering if the cont_pte concept is similar and whether it could help.

Thanks
Christophe
Ryan Roberts Dec. 4, 2023, 11:11 a.m. UTC | #20
On 03/12/2023 13:33, Christophe Leroy wrote:
> 
> 
> Le 30/11/2023 à 22:30, Peter Xu a écrit :
>> On Fri, Nov 24, 2023 at 11:07:51AM -0500, Peter Xu wrote:
>>> On Fri, Nov 24, 2023 at 09:06:01AM +0000, Ryan Roberts wrote:
>>>> I don't have any micro-benchmarks for GUP though, if that's your question. Is
>>>> there an easy-to-use test I can run to get some numbers? I'd be happy to try it out.
>>>
>>> Thanks Ryan.  Then nothing is needed to be tested if gup is not yet touched
>>> from your side, afaict.  I'll see whether I can provide some rough numbers
>>> instead in the next post (I'll probably only be able to test it in a VM,
>>> though, but hopefully that should still reflect mostly the truth).
>>
>> An update: I finished a round of 64K cont_pte test, in the slow gup micro
>> benchmark I see ~15% perf degrade with this patchset applied on a VM on top
>> of Apple M1.
>>
>> Frankly that's even less than I expected, considering not only how slow gup
>> THP used to be, but also on the fact that that's a tight loop over slow
>> gup, which in normal cases shouldn't happen: "present" ptes normally goes
>> to fast-gup, while !present goes into a fault following it.  I assume
>> that's why nobody cared slow gup for THP before.  I think adding cont_pte
>> support shouldn't be very hard, but that will include making cont_pte idea
>> global just for arm64 and riscv Svnapot.
> 
> Is there any documentation on what cont_pte is ? I have always wondered 
> if it could also fit powerpc 8xx need ?

pte_cont() (and pte_mkcont() and pte_mknoncont()) test and manipulte the
"contiguous bit" in the arm64 PTE entries. Those helpers are arm64-specific
(AFAIK). The contiguous bit is a hint to the HW to tell it that a block of PTEs
are mapping a physically contiguous and naturally aligned piece of memory. The
HW can use this to coalesce entries in the TLB. When using 4K base pages, the
contpte size is 64K (16 PTEs). For 16K base pages, its 2M (128 PTEs) and for 64K
base pages, its 2M (32 PTEs).

> 
> On powerpc, for 16k pages, we have to define 4 consecutive PTEs. All 4 
> PTE are flagged with the SPS bit telling it's a 16k pages, but for TLB 
> misses the HW needs one entrie for each 4k fragment.

From that description, it sounds like the SPS bit might be similar to arm64
contiguous bit? Although sounds like you are currently using it in a slightly
different way - telling kernel that the base page is 16K but mapping each 16K
page with 4x 4K entries (plus the SPS bit set)?

> 
> There is also a similar approach for 512k pages, we have 128 contiguous 
> identical PTEs for them.
> 
> And whatever PAGE_SIZE is (either 4k or 16k), the HW needs one 'unsigned 
> long' pte for each 4k fragment. So at the time being when we define 
> PAGE_SIZE as 16k, we need a special pte_t which is a table of 4x 
> unsigned long.
> 
> Wondering if the cont_pte concept is similar and whether it could help.

To be honest, while I understand pte_cont() and friends, I don't understand
their relevance (or at least potential future relevance) to GUP?

> 
> Thanks
> Christophe
Christophe Leroy Dec. 4, 2023, 11:25 a.m. UTC | #21
Le 04/12/2023 à 12:11, Ryan Roberts a écrit :
> On 03/12/2023 13:33, Christophe Leroy wrote:
>>
>>
>> Le 30/11/2023 à 22:30, Peter Xu a écrit :
>>> On Fri, Nov 24, 2023 at 11:07:51AM -0500, Peter Xu wrote:
>>>> On Fri, Nov 24, 2023 at 09:06:01AM +0000, Ryan Roberts wrote:
>>>>> I don't have any micro-benchmarks for GUP though, if that's your question. Is
>>>>> there an easy-to-use test I can run to get some numbers? I'd be happy to try it out.
>>>>
>>>> Thanks Ryan.  Then nothing is needed to be tested if gup is not yet touched
>>>> from your side, afaict.  I'll see whether I can provide some rough numbers
>>>> instead in the next post (I'll probably only be able to test it in a VM,
>>>> though, but hopefully that should still reflect mostly the truth).
>>>
>>> An update: I finished a round of 64K cont_pte test, in the slow gup micro
>>> benchmark I see ~15% perf degrade with this patchset applied on a VM on top
>>> of Apple M1.
>>>
>>> Frankly that's even less than I expected, considering not only how slow gup
>>> THP used to be, but also on the fact that that's a tight loop over slow
>>> gup, which in normal cases shouldn't happen: "present" ptes normally goes
>>> to fast-gup, while !present goes into a fault following it.  I assume
>>> that's why nobody cared slow gup for THP before.  I think adding cont_pte
>>> support shouldn't be very hard, but that will include making cont_pte idea
>>> global just for arm64 and riscv Svnapot.
>>
>> Is there any documentation on what cont_pte is ? I have always wondered
>> if it could also fit powerpc 8xx need ?
> 
> pte_cont() (and pte_mkcont() and pte_mknoncont()) test and manipulte the
> "contiguous bit" in the arm64 PTE entries. Those helpers are arm64-specific
> (AFAIK). The contiguous bit is a hint to the HW to tell it that a block of PTEs
> are mapping a physically contiguous and naturally aligned piece of memory. The
> HW can use this to coalesce entries in the TLB. When using 4K base pages, the
> contpte size is 64K (16 PTEs). For 16K base pages, its 2M (128 PTEs) and for 64K
> base pages, its 2M (32 PTEs).
> 
>>
>> On powerpc, for 16k pages, we have to define 4 consecutive PTEs. All 4
>> PTE are flagged with the SPS bit telling it's a 16k pages, but for TLB
>> misses the HW needs one entrie for each 4k fragment.
> 
>  From that description, it sounds like the SPS bit might be similar to arm64
> contiguous bit? Although sounds like you are currently using it in a slightly
> different way - telling kernel that the base page is 16K but mapping each 16K
> page with 4x 4K entries (plus the SPS bit set)?

Yes it's both.

When the base page is 16k, there are 4x 4k entries (with SPS bit set) in 
the page table, and pte_t is a table of 4 'unsigned long'

When the base page is 4k, there is a 16k hugepage size, which is the 
same 4x 4k entries with SPS bit set.

So it looks similar to the contiguous bit.


And by extension, the same principle is used for 512k hugepages, the bit 
_PAGE_HUGE is copied by the TLB miss handler into the lower bit of PS, 
PS being as follows:
- 00 Small (4 Kbyte or 16 Kbyte)
- 01 512 Kbyte
- 10 Reserved
- 11 8 Mbyte

So as PMD size is 4M, 512k pages are 128 identical consecutive entries 
in the page table.

I which I could have THP with 16k or 512k pages.

Christophe
Ryan Roberts Dec. 4, 2023, 11:46 a.m. UTC | #22
On 04/12/2023 11:25, Christophe Leroy wrote:
> 
> 
> Le 04/12/2023 à 12:11, Ryan Roberts a écrit :
>> On 03/12/2023 13:33, Christophe Leroy wrote:
>>>
>>>
>>> Le 30/11/2023 à 22:30, Peter Xu a écrit :
>>>> On Fri, Nov 24, 2023 at 11:07:51AM -0500, Peter Xu wrote:
>>>>> On Fri, Nov 24, 2023 at 09:06:01AM +0000, Ryan Roberts wrote:
>>>>>> I don't have any micro-benchmarks for GUP though, if that's your question. Is
>>>>>> there an easy-to-use test I can run to get some numbers? I'd be happy to try it out.
>>>>>
>>>>> Thanks Ryan.  Then nothing is needed to be tested if gup is not yet touched
>>>>> from your side, afaict.  I'll see whether I can provide some rough numbers
>>>>> instead in the next post (I'll probably only be able to test it in a VM,
>>>>> though, but hopefully that should still reflect mostly the truth).
>>>>
>>>> An update: I finished a round of 64K cont_pte test, in the slow gup micro
>>>> benchmark I see ~15% perf degrade with this patchset applied on a VM on top
>>>> of Apple M1.
>>>>
>>>> Frankly that's even less than I expected, considering not only how slow gup
>>>> THP used to be, but also on the fact that that's a tight loop over slow
>>>> gup, which in normal cases shouldn't happen: "present" ptes normally goes
>>>> to fast-gup, while !present goes into a fault following it.  I assume
>>>> that's why nobody cared slow gup for THP before.  I think adding cont_pte
>>>> support shouldn't be very hard, but that will include making cont_pte idea
>>>> global just for arm64 and riscv Svnapot.
>>>
>>> Is there any documentation on what cont_pte is ? I have always wondered
>>> if it could also fit powerpc 8xx need ?
>>
>> pte_cont() (and pte_mkcont() and pte_mknoncont()) test and manipulte the
>> "contiguous bit" in the arm64 PTE entries. Those helpers are arm64-specific
>> (AFAIK). The contiguous bit is a hint to the HW to tell it that a block of PTEs
>> are mapping a physically contiguous and naturally aligned piece of memory. The
>> HW can use this to coalesce entries in the TLB. When using 4K base pages, the
>> contpte size is 64K (16 PTEs). For 16K base pages, its 2M (128 PTEs) and for 64K
>> base pages, its 2M (32 PTEs).
>>
>>>
>>> On powerpc, for 16k pages, we have to define 4 consecutive PTEs. All 4
>>> PTE are flagged with the SPS bit telling it's a 16k pages, but for TLB
>>> misses the HW needs one entrie for each 4k fragment.
>>
>>  From that description, it sounds like the SPS bit might be similar to arm64
>> contiguous bit? Although sounds like you are currently using it in a slightly
>> different way - telling kernel that the base page is 16K but mapping each 16K
>> page with 4x 4K entries (plus the SPS bit set)?
> 
> Yes it's both.
> 
> When the base page is 16k, there are 4x 4k entries (with SPS bit set) in 
> the page table, and pte_t is a table of 4 'unsigned long'
> 
> When the base page is 4k, there is a 16k hugepage size, which is the 
> same 4x 4k entries with SPS bit set.
> 
> So it looks similar to the contiguous bit.
> 
> 
> And by extension, the same principle is used for 512k hugepages, the bit 
> _PAGE_HUGE is copied by the TLB miss handler into the lower bit of PS, 
> PS being as follows:
> - 00 Small (4 Kbyte or 16 Kbyte)
> - 01 512 Kbyte
> - 10 Reserved
> - 11 8 Mbyte
> 
> So as PMD size is 4M, 512k pages are 128 identical consecutive entries 
> in the page table.
> 
> I which I could have THP with 16k or 512k pages.

Then you have come to the right place! :)

https://lore.kernel.org/linux-mm/20231204102027.57185-1-ryan.roberts@arm.com/


> 
> Christophe
Christophe Leroy Dec. 4, 2023, 11:57 a.m. UTC | #23
Le 04/12/2023 à 12:46, Ryan Roberts a écrit :
> On 04/12/2023 11:25, Christophe Leroy wrote:
>>
>>
>> Le 04/12/2023 à 12:11, Ryan Roberts a écrit :
>>> On 03/12/2023 13:33, Christophe Leroy wrote:
>>>>
>>>>
>>>> Le 30/11/2023 à 22:30, Peter Xu a écrit :
>>>>> On Fri, Nov 24, 2023 at 11:07:51AM -0500, Peter Xu wrote:
>>>>>> On Fri, Nov 24, 2023 at 09:06:01AM +0000, Ryan Roberts wrote:
>>>>>>> I don't have any micro-benchmarks for GUP though, if that's your question. Is
>>>>>>> there an easy-to-use test I can run to get some numbers? I'd be happy to try it out.
>>>>>>
>>>>>> Thanks Ryan.  Then nothing is needed to be tested if gup is not yet touched
>>>>>> from your side, afaict.  I'll see whether I can provide some rough numbers
>>>>>> instead in the next post (I'll probably only be able to test it in a VM,
>>>>>> though, but hopefully that should still reflect mostly the truth).
>>>>>
>>>>> An update: I finished a round of 64K cont_pte test, in the slow gup micro
>>>>> benchmark I see ~15% perf degrade with this patchset applied on a VM on top
>>>>> of Apple M1.
>>>>>
>>>>> Frankly that's even less than I expected, considering not only how slow gup
>>>>> THP used to be, but also on the fact that that's a tight loop over slow
>>>>> gup, which in normal cases shouldn't happen: "present" ptes normally goes
>>>>> to fast-gup, while !present goes into a fault following it.  I assume
>>>>> that's why nobody cared slow gup for THP before.  I think adding cont_pte
>>>>> support shouldn't be very hard, but that will include making cont_pte idea
>>>>> global just for arm64 and riscv Svnapot.
>>>>
>>>> Is there any documentation on what cont_pte is ? I have always wondered
>>>> if it could also fit powerpc 8xx need ?
>>>
>>> pte_cont() (and pte_mkcont() and pte_mknoncont()) test and manipulte the
>>> "contiguous bit" in the arm64 PTE entries. Those helpers are arm64-specific
>>> (AFAIK). The contiguous bit is a hint to the HW to tell it that a block of PTEs
>>> are mapping a physically contiguous and naturally aligned piece of memory. The
>>> HW can use this to coalesce entries in the TLB. When using 4K base pages, the
>>> contpte size is 64K (16 PTEs). For 16K base pages, its 2M (128 PTEs) and for 64K
>>> base pages, its 2M (32 PTEs).
>>>
>>>>
>>>> On powerpc, for 16k pages, we have to define 4 consecutive PTEs. All 4
>>>> PTE are flagged with the SPS bit telling it's a 16k pages, but for TLB
>>>> misses the HW needs one entrie for each 4k fragment.
>>>
>>>   From that description, it sounds like the SPS bit might be similar to arm64
>>> contiguous bit? Although sounds like you are currently using it in a slightly
>>> different way - telling kernel that the base page is 16K but mapping each 16K
>>> page with 4x 4K entries (plus the SPS bit set)?
>>
>> Yes it's both.
>>
>> When the base page is 16k, there are 4x 4k entries (with SPS bit set) in
>> the page table, and pte_t is a table of 4 'unsigned long'
>>
>> When the base page is 4k, there is a 16k hugepage size, which is the
>> same 4x 4k entries with SPS bit set.
>>
>> So it looks similar to the contiguous bit.
>>
>>
>> And by extension, the same principle is used for 512k hugepages, the bit
>> _PAGE_HUGE is copied by the TLB miss handler into the lower bit of PS,
>> PS being as follows:
>> - 00 Small (4 Kbyte or 16 Kbyte)
>> - 01 512 Kbyte
>> - 10 Reserved
>> - 11 8 Mbyte
>>
>> So as PMD size is 4M, 512k pages are 128 identical consecutive entries
>> in the page table.
>>
>> I which I could have THP with 16k or 512k pages.
> 
> Then you have come to the right place! :)
> 
> https://lore.kernel.org/linux-mm/20231204102027.57185-1-ryan.roberts@arm.com/
> 

That looks great. That series only modifies core mm/ .
No changes needed in arch ? Will it work on powerpc without any 
change/additions to arch code ?

Well, I'll try it soon.

Christophe
Ryan Roberts Dec. 4, 2023, 12:02 p.m. UTC | #24
On 04/12/2023 11:57, Christophe Leroy wrote:
> 
> 
> Le 04/12/2023 à 12:46, Ryan Roberts a écrit :
>> On 04/12/2023 11:25, Christophe Leroy wrote:
>>>
>>>
>>> Le 04/12/2023 à 12:11, Ryan Roberts a écrit :
>>>> On 03/12/2023 13:33, Christophe Leroy wrote:
>>>>>
>>>>>
>>>>> Le 30/11/2023 à 22:30, Peter Xu a écrit :
>>>>>> On Fri, Nov 24, 2023 at 11:07:51AM -0500, Peter Xu wrote:
>>>>>>> On Fri, Nov 24, 2023 at 09:06:01AM +0000, Ryan Roberts wrote:
>>>>>>>> I don't have any micro-benchmarks for GUP though, if that's your question. Is
>>>>>>>> there an easy-to-use test I can run to get some numbers? I'd be happy to try it out.
>>>>>>>
>>>>>>> Thanks Ryan.  Then nothing is needed to be tested if gup is not yet touched
>>>>>>> from your side, afaict.  I'll see whether I can provide some rough numbers
>>>>>>> instead in the next post (I'll probably only be able to test it in a VM,
>>>>>>> though, but hopefully that should still reflect mostly the truth).
>>>>>>
>>>>>> An update: I finished a round of 64K cont_pte test, in the slow gup micro
>>>>>> benchmark I see ~15% perf degrade with this patchset applied on a VM on top
>>>>>> of Apple M1.
>>>>>>
>>>>>> Frankly that's even less than I expected, considering not only how slow gup
>>>>>> THP used to be, but also on the fact that that's a tight loop over slow
>>>>>> gup, which in normal cases shouldn't happen: "present" ptes normally goes
>>>>>> to fast-gup, while !present goes into a fault following it.  I assume
>>>>>> that's why nobody cared slow gup for THP before.  I think adding cont_pte
>>>>>> support shouldn't be very hard, but that will include making cont_pte idea
>>>>>> global just for arm64 and riscv Svnapot.
>>>>>
>>>>> Is there any documentation on what cont_pte is ? I have always wondered
>>>>> if it could also fit powerpc 8xx need ?
>>>>
>>>> pte_cont() (and pte_mkcont() and pte_mknoncont()) test and manipulte the
>>>> "contiguous bit" in the arm64 PTE entries. Those helpers are arm64-specific
>>>> (AFAIK). The contiguous bit is a hint to the HW to tell it that a block of PTEs
>>>> are mapping a physically contiguous and naturally aligned piece of memory. The
>>>> HW can use this to coalesce entries in the TLB. When using 4K base pages, the
>>>> contpte size is 64K (16 PTEs). For 16K base pages, its 2M (128 PTEs) and for 64K
>>>> base pages, its 2M (32 PTEs).
>>>>
>>>>>
>>>>> On powerpc, for 16k pages, we have to define 4 consecutive PTEs. All 4
>>>>> PTE are flagged with the SPS bit telling it's a 16k pages, but for TLB
>>>>> misses the HW needs one entrie for each 4k fragment.
>>>>
>>>>   From that description, it sounds like the SPS bit might be similar to arm64
>>>> contiguous bit? Although sounds like you are currently using it in a slightly
>>>> different way - telling kernel that the base page is 16K but mapping each 16K
>>>> page with 4x 4K entries (plus the SPS bit set)?
>>>
>>> Yes it's both.
>>>
>>> When the base page is 16k, there are 4x 4k entries (with SPS bit set) in
>>> the page table, and pte_t is a table of 4 'unsigned long'
>>>
>>> When the base page is 4k, there is a 16k hugepage size, which is the
>>> same 4x 4k entries with SPS bit set.
>>>
>>> So it looks similar to the contiguous bit.
>>>
>>>
>>> And by extension, the same principle is used for 512k hugepages, the bit
>>> _PAGE_HUGE is copied by the TLB miss handler into the lower bit of PS,
>>> PS being as follows:
>>> - 00 Small (4 Kbyte or 16 Kbyte)
>>> - 01 512 Kbyte
>>> - 10 Reserved
>>> - 11 8 Mbyte
>>>
>>> So as PMD size is 4M, 512k pages are 128 identical consecutive entries
>>> in the page table.
>>>
>>> I which I could have THP with 16k or 512k pages.
>>
>> Then you have come to the right place! :)
>>
>> https://lore.kernel.org/linux-mm/20231204102027.57185-1-ryan.roberts@arm.com/
>>
> 
> That looks great. That series only modifies core mm/ .
> No changes needed in arch ? Will it work on powerpc without any 
> change/additions to arch code ?

Yes there are also changes needed in arch; I have a separate series for arm64,
which transparently manages the contiguous bit when it sees appropriate PTEs:

https://lore.kernel.org/linux-arm-kernel/20231204105440.61448-1-ryan.roberts@arm.com/

> 
> Well, I'll try it soon.
> 
> Christophe
Peter Xu Dec. 4, 2023, 4:48 p.m. UTC | #25
On Mon, Dec 04, 2023 at 11:11:26AM +0000, Ryan Roberts wrote:
> To be honest, while I understand pte_cont() and friends, I don't understand
> their relevance (or at least potential future relevance) to GUP?

GUP in general can be smarter to recognize if a pte/pmd is a cont_pte and
fetch the whole pte/pmd range if the caller specified.  Now it loops over
each pte/pmd.

Fast-gup is better as it at least doesn't take pgtable lock, for cont_pte
it looks inside gup_pte_range() which is good enough, but it'll still do
folio checks for each sub-pte, even though the 2nd+ folio checks should be
mostly the same (if to ignore races when the folio changed within the time
of processing the cont_pte chunk).

Slow-gup (as of what this series is about so far) doesn't do that either,
for each cont_pte whole entry it'll loop N times, frequently taking and
releasing the pgtable lock.  A smarter slow-gup can fundamentallly setup
follow_page_context.page_mask if it sees a cont_pte.  There might be a
challenge on whether holding the head page's refcount would stablize the
whole folio, but that may be another question to ask.

I think I also overlooked that PPC_8XX also has cont_pte support, so we
actually have three users indeed, if not counting potential future archs
adding support to also get that same tlb benefit.

Thanks,
diff mbox series

Patch

diff --git a/mm/gup.c b/mm/gup.c
index 0e00204761d2..424d45e1afb3 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2816,11 +2816,6 @@  static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 		return 0;
 	}
 
-	if (!folio_fast_pin_allowed(folio, flags)) {
-		gup_put_folio(folio, refs, flags);
-		return 0;
-	}
-
 	if (!pte_write(pte) && gup_must_unshare(NULL, flags, &folio->page)) {
 		gup_put_folio(folio, refs, flags);
 		return 0;