mbox series

[00/45] hugetlb pagewalk unification

Message ID 20240704043132.28501-1-osalvador@suse.de (mailing list archive)
Headers show
Series hugetlb pagewalk unification | expand

Message

Oscar Salvador July 4, 2024, 4:30 a.m. UTC
Hi all,

During Peter's talk at the LSFMM, it was agreed that one of the things
that need to be done in order to further integrate hugetlb into mm core,
is to unify generic and hugetlb pagewalkers.
I started with this one, which is unifying hugetlb into generic
pagewalk, instead of having its hugetlb_entry entries.
Which means that pmd_entry/pte_entry(for cont-pte) entries will also deal with
hugetlb vmas as well, and so will new pud_entry entries since hugetlb can be
pud mapped (devm pages as well but we seem not to care about those with
the exception of hmm code).

The outcome is this RFC.

Before you continue, let me clarify certain points:

This patchset is not yet finished, as there are things that 1) need more thought,
2) are still broken (like the hmm bits as I am clueless about that) 3)
some paths have not been tested at all.

The things I tested were:

 - memory-failure
 - smaps/numa_maps/pagemap (the latter only for pud/pmd, not
   cont-{pmd,ptes}
 - mempolicy

on arm64 (for 64KB and 32M hugetlb pages) and on x86_64 (for 2MB and 1GB
hugetlb pages).
More tests need to be conducted, and I plan to borrow a pp64le machine
to also carry out some tests there, but for now this is what my bandwith
allowed me to do.

I am well aware that there are two things that might scare people, one
being the number of patches, and the other being the amount of code
added.

For the former, I will by no means ask anyone to review 45 patches, but
since this patchset touches isolated paths (damon, mincore, hmm,
task_mmu, memory-failure, mempolicy), I will point out some people
that might be able to help me out with those different bits:

 - Miaohe for memory-failure bits
 - David for task_mmu bits
 - SeongJae Park for damon bits
 - Jerome for hmm bits
 - feel freel to join for the rest

I think that that might be a good approach, and instead of having
to review 45 patches, one has only to review at most 5 or 6.

For the latter, there is an explanation: hugetlb operates on ptes
(although it allocates puds/pmds and the operations work on that level too),
which means that now that we will handle PUD/PMD-mapped hugetlb with
{pud,pmd}_* operations, we need to introduce quite a few functions that
do not exist yet and we need from now onwards.

Although I am sending this out, this is not a "rfc ready material",
as I said there are still things that need to be improved/fixed/tested,
but I wanted to make it public nevertheless so we can gather some constructive
feedback that helps us moving in the right direction and to also widen the discussions.

So take this more of a "Hey, let me show what I am doing and call me out on
things you consider wrong".

Thanks in advance

Oscar Salvador (45):
  arch/x86: Drop own definition of pgd,p4d_leaf
  mm: Add {pmd,pud}_huge_lock helper
  mm/pagewalk: Move vma_pgtable_walk_begin and vma_pgtable_walk_end
    upfront
  mm/pagewalk: Only call pud_entry when we have a pud leaf
  mm/pagewalk: Enable walk_pmd_range to handle cont-pmds
  mm/pagewalk: Do not try to split non-thp pud or pmd leafs
  arch/s390: Enable __s390_enable_skey_pmd to handle hugetlb vmas
  fs/proc: Enable smaps_pmd_entry to handle PMD-mapped hugetlb vmas
  mm: Implement pud-version functions for swap and vm_normal_page_pud
  fs/proc: Create smaps_pud_range to handle PUD-mapped hugetlb vmas
  fs/proc: Enable smaps_pte_entry to handle cont-pte mapped hugetlb vmas
  fs/proc: Enable pagemap_pmd_range to handle hugetlb vmas
  mm: Implement pud-version uffd functions
  fs/proc: Create pagemap_pud_range to handle PUD-mapped hugetlb vmas
  fs/proc: Adjust pte_to_pagemap_entry for hugetlb vmas
  fs/proc: Enable pagemap_scan_pmd_entry to handle hugetlb vmas
  mm: Implement pud-version for pud_mkinvalid and pudp_establish
  fs/proc: Create pagemap_scan_pud_entry to handle PUD-mapped hugetlb
    vmas
  fs/proc: Enable gather_pte_stats to handle hugetlb vmas
  fs/proc: Enable gather_pte_stats to handle cont-pte mapped hugetlb
    vmas
  fs/proc: Create gather_pud_stats to handle PUD-mapped hugetlb pages
  mm/mempolicy: Enable queue_folios_pmd to handle hugetlb vmas
  mm/mempolicy: Create queue_folios_pud to handle PUD-mapped hugetlb
    vmas
  mm/memory_failure: Enable check_hwpoisoned_pmd_entry to handle hugetlb
    vmas
  mm/memory-failure: Create check_hwpoisoned_pud_entry to handle
    PUD-mapped hugetlb vmas
  mm/damon: Enable damon_young_pmd_entry to handle hugetlb vmas
  mm/damon: Create damon_young_pud_entry to handle PUD-mapped hugetlb
    vmas
  mm/damon: Enable damon_mkold_pmd_entry to handle hugetlb vmas
  mm/damon: Create damon_mkold_pud_entry to handle PUD-mapped hugetlb
    vmas
  mm,mincore: Enable mincore_pte_range to handle hugetlb vmas
  mm/mincore: Create mincore_pud_range to handle PUD-mapped hugetlb vmas
  mm/hmm: Enable hmm_vma_walk_pmd, to handle hugetlb vmas
  mm/hmm: Enable hmm_vma_walk_pud to handle PUD-mapped hugetlb vmas
  arch/powerpc: Skip hugetlb vmas in subpage_mark_vma_nohuge
  arch/s390: Skip hugetlb vmas in thp_split_mm
  fs/proc: Make clear_refs_test_walk skip hugetlb vmas
  mm/lock: Make mlock_test_walk skip hugetlb vmas
  mm/madvise: Make swapin_test_walk skip hugetlb vmas
  mm/madvise: Make madvise_cold_test_walk skip hugetlb vmas
  mm/madvise: Make madvise_free_test_walk skip hugetlb vmas
  mm/migrate_device: Make migrate_vma_test_walk skip hugetlb vmas
  mm/memcontrol: Make mem_cgroup_move_test_walk skip hugetlb vmas
  mm/memcontrol: Make mem_cgroup_count_test_walk skip hugetlb vmas
  mm/hugetlb_vmemmap: Make vmemmap_test_walk skip hugetlb vmas
  mm: Delete all hugetlb_entry entries

 arch/arm64/include/asm/pgtable.h             |  19 +
 arch/loongarch/include/asm/pgtable.h         |   8 +
 arch/mips/include/asm/pgtable.h              |   7 +
 arch/powerpc/include/asm/book3s/64/pgtable.h |   8 +-
 arch/powerpc/mm/book3s64/pgtable.c           |  15 +-
 arch/powerpc/mm/book3s64/subpage_prot.c      |   2 +
 arch/riscv/include/asm/pgtable.h             |  15 +
 arch/s390/mm/gmap.c                          |  37 +-
 arch/x86/include/asm/pgtable.h               | 199 +++++----
 fs/proc/task_mmu.c                           | 434 ++++++++++++-------
 include/asm-generic/pgtable_uffd.h           |  30 ++
 include/linux/mm.h                           |   4 +
 include/linux/mm_inline.h                    |  34 ++
 include/linux/pagewalk.h                     |  10 -
 include/linux/pgtable.h                      |  77 +++-
 include/linux/swapops.h                      |  27 ++
 mm/damon/ops-common.c                        |  21 +-
 mm/damon/vaddr.c                             | 173 ++++----
 mm/hmm.c                                     |  69 +--
 mm/hugetlb_vmemmap.c                         |  12 +
 mm/madvise.c                                 |  36 ++
 mm/memcontrol-v1.c                           |  24 +
 mm/memory-failure.c                          |  99 +++--
 mm/memory.c                                  |  51 +++
 mm/mempolicy.c                               | 121 +++---
 mm/migrate_device.c                          |  12 +
 mm/mincore.c                                 |  46 +-
 mm/mlock.c                                   |  12 +
 mm/mprotect.c                                |  10 -
 mm/pagewalk.c                                |  73 +---
 mm/pgtable-generic.c                         |  21 +
 31 files changed, 1089 insertions(+), 617 deletions(-)

Comments

Oscar Salvador July 4, 2024, 10:13 a.m. UTC | #1
On Thu, Jul 04, 2024 at 06:30:47AM +0200, Oscar Salvador wrote:
> Hi all,
> 
> During Peter's talk at the LSFMM, it was agreed that one of the things
> that need to be done in order to further integrate hugetlb into mm core,
> is to unify generic and hugetlb pagewalkers.
> I started with this one, which is unifying hugetlb into generic
> pagewalk, instead of having its hugetlb_entry entries.
> Which means that pmd_entry/pte_entry(for cont-pte) entries will also deal with
> hugetlb vmas as well, and so will new pud_entry entries since hugetlb can be
> pud mapped (devm pages as well but we seem not to care about those with
> the exception of hmm code).
> 
> The outcome is this RFC.

Just dropping the git tree, in case someone finds it more suitable:

https://github.com/leberus/linux/ hugetlb-unification-pagewalk
David Hildenbrand July 4, 2024, 10:44 a.m. UTC | #2
On 04.07.24 06:30, Oscar Salvador wrote:
> Hi all,
> 
> During Peter's talk at the LSFMM, it was agreed that one of the things
> that need to be done in order to further integrate hugetlb into mm core,
> is to unify generic and hugetlb pagewalkers.
> I started with this one, which is unifying hugetlb into generic
> pagewalk, instead of having its hugetlb_entry entries.
> Which means that pmd_entry/pte_entry(for cont-pte) entries will also deal with
> hugetlb vmas as well, and so will new pud_entry entries since hugetlb can be
> pud mapped (devm pages as well but we seem not to care about those with
> the exception of hmm code).
> 
> The outcome is this RFC.

First of all, a good step into the right direction, but maybe not what 
we want long-term. So I'm questioning whether we want this intermediate 
approach. walk_page_range() and friends are simply not a good design 
(e.g., indirect function calls).


There are roughly two categories of page table walkers we have:

1) We actually only want to walk present folios (to be precise, page
    ranges of folios). We should look into moving away from the walk the
    page walker API where possible, and have something better that
    directly gives us the folio (page ranges). Any PTE batching would be
    done internally.

2) We want to deal with non-present folios as well (swp entries and all
    kinds of other stuff). We should maybe implement our custom page
    table walker and move away from walk_page_range(). We are not walking
    "pages" after all but everything else included :)

Then, there is a subset of 1) where we only want to walk to a single 
address (a single folio). I'm working on that right now to get rid of 
follow_page() and some (IIRC 3: KSM an daemon) walk_page_range() users. 
Hugetlb will still remain a bit special, but I'm afraid we cannot hide 
that completely.
Peter Xu July 4, 2024, 2:30 p.m. UTC | #3
Hey, David,

On Thu, Jul 04, 2024 at 12:44:38PM +0200, David Hildenbrand wrote:
> There are roughly two categories of page table walkers we have:
> 
> 1) We actually only want to walk present folios (to be precise, page
>    ranges of folios). We should look into moving away from the walk the
>    page walker API where possible, and have something better that
>    directly gives us the folio (page ranges). Any PTE batching would be
>    done internally.
> 
> 2) We want to deal with non-present folios as well (swp entries and all
>    kinds of other stuff). We should maybe implement our custom page
>    table walker and move away from walk_page_range(). We are not walking
>    "pages" after all but everything else included :)
> 
> Then, there is a subset of 1) where we only want to walk to a single address
> (a single folio). I'm working on that right now to get rid of follow_page()
> and some (IIRC 3: KSM an daemon) walk_page_range() users. Hugetlb will still
> remain a bit special, but I'm afraid we cannot hide that completely.

Maybe you are talking about the generic concept of "page table walker", not
walk_page_range() explicitly?

I'd agree if it's about the generic concept. For example, follow_page()
definitely is tailored for getting the page/folio.  But just to mention
Oscar's series is only working on the page_walk API itself.  What I see so
far is most of the walk_page API users aren't described above - most of
them do not fall into category 1) at all, if any. And they either need to
fetch something from the pgtable where having the folio isn't enough, or
modify the pgtable for different reasons.

A generic pgtable walker looks still wanted at some point, but it can be
too involved to be introduced together with this "remove hugetlb_entry"
effort.

To me, that future work is not yet about "get the folio, ignore the
pgtable", but about how to abstract different layers of pgtables, so the
caller may get a generic concept of "one pgtable entry" with the level/size
information attached, and process it at a single place / hook, and perhaps
hopefully even work with a device pgtable, as long as it's a radix tree.

[Adding Jason into the loop too. PS: Oscar, please consider copying Jason
 for the works too; Jason provided great lots of useful discussions in the
 past on relevant topics]

Thanks,
David Hildenbrand July 4, 2024, 3:23 p.m. UTC | #4
On 04.07.24 16:30, Peter Xu wrote:
> Hey, David,
> 

Hi!

> On Thu, Jul 04, 2024 at 12:44:38PM +0200, David Hildenbrand wrote:
>> There are roughly two categories of page table walkers we have:
>>
>> 1) We actually only want to walk present folios (to be precise, page
>>     ranges of folios). We should look into moving away from the walk the
>>     page walker API where possible, and have something better that
>>     directly gives us the folio (page ranges). Any PTE batching would be
>>     done internally.
>>
>> 2) We want to deal with non-present folios as well (swp entries and all
>>     kinds of other stuff). We should maybe implement our custom page
>>     table walker and move away from walk_page_range(). We are not walking
>>     "pages" after all but everything else included :)
>>
>> Then, there is a subset of 1) where we only want to walk to a single address
>> (a single folio). I'm working on that right now to get rid of follow_page()
>> and some (IIRC 3: KSM an daemon) walk_page_range() users. Hugetlb will still
>> remain a bit special, but I'm afraid we cannot hide that completely.
> 
> Maybe you are talking about the generic concept of "page table walker", not
> walk_page_range() explicitly?
> 
> I'd agree if it's about the generic concept. For example, follow_page()
> definitely is tailored for getting the page/folio.  But just to mention
> Oscar's series is only working on the page_walk API itself.  What I see so
> far is most of the walk_page API users aren't described above - most of
> them do not fall into category 1) at all, if any. And they either need to
> fetch something from the pgtable where having the folio isn't enough, or
> modify the pgtable for different reasons.

Right, but having 1) does not imply that we won't be having access to 
the page table entry in an abstracted form, the folio is simply the 
primary source of information that these users care about. 2) is an 
extension of 1), but walking+exposing all (or most) other page table 
entries as well in some form, which is certainly harder to get right.

Taking a look at some examples:

* madvise_cold_or_pageout_pte_range() only cares about present folios.
* madvise_free_pte_range() only cares about present folios.
* break_ksm_ops() only cares about present folios.
* mlock_walk_ops() only cares about present folios.
* damon_mkold_ops() only cares about present folios.
* damon_young_ops() only cares about present folios.

There are certainly other page_walk API users that are more involved and 
need to do way more magic, which fall into category 2). In particular 
things like swapin_walk_ops(), hmm_walk_ops() and most 
fs/proc/task_mmu.c. Likely there are plenty of them.


Taking a look at vmscan.c/walk_mm(), I'm not sure how much benefit there 
even is left in using walk_page_range() :)

> 
> A generic pgtable walker looks still wanted at some point, but it can be
> too involved to be introduced together with this "remove hugetlb_entry"
> effort.

My thinking was if "remove hugetlb_entry" cannot wait for "remove 
page_walk", because we found a reasonable way to do it better and 
convert the individual users. Maybe it can't.

I've not given up hope that we can end up with something better and 
clearer than the current page_walk API :)

> 
> To me, that future work is not yet about "get the folio, ignore the
> pgtable", but about how to abstract different layers of pgtables, so the
> caller may get a generic concept of "one pgtable entry" with the level/size
> information attached, and process it at a single place / hook, and perhaps
> hopefully even work with a device pgtable, as long as it's a radix tree.

To me 2) is an extension of 1). My thinking is that we can start with 1) 
without having to are about all details of 2). If we have to make it as 
generic that we can walk any page table layout out there in this world, 
I'm not so sure.
Peter Xu July 4, 2024, 4:43 p.m. UTC | #5
On Thu, Jul 04, 2024 at 05:23:30PM +0200, David Hildenbrand wrote:
> On 04.07.24 16:30, Peter Xu wrote:
> > Hey, David,
> > 
> 
> Hi!
> 
> > On Thu, Jul 04, 2024 at 12:44:38PM +0200, David Hildenbrand wrote:
> > > There are roughly two categories of page table walkers we have:
> > > 
> > > 1) We actually only want to walk present folios (to be precise, page
> > >     ranges of folios). We should look into moving away from the walk the
> > >     page walker API where possible, and have something better that
> > >     directly gives us the folio (page ranges). Any PTE batching would be
> > >     done internally.
> > > 
> > > 2) We want to deal with non-present folios as well (swp entries and all
> > >     kinds of other stuff). We should maybe implement our custom page
> > >     table walker and move away from walk_page_range(). We are not walking
> > >     "pages" after all but everything else included :)
> > > 
> > > Then, there is a subset of 1) where we only want to walk to a single address
> > > (a single folio). I'm working on that right now to get rid of follow_page()
> > > and some (IIRC 3: KSM an daemon) walk_page_range() users. Hugetlb will still
> > > remain a bit special, but I'm afraid we cannot hide that completely.
> > 
> > Maybe you are talking about the generic concept of "page table walker", not
> > walk_page_range() explicitly?
> > 
> > I'd agree if it's about the generic concept. For example, follow_page()
> > definitely is tailored for getting the page/folio.  But just to mention
> > Oscar's series is only working on the page_walk API itself.  What I see so
> > far is most of the walk_page API users aren't described above - most of
> > them do not fall into category 1) at all, if any. And they either need to
> > fetch something from the pgtable where having the folio isn't enough, or
> > modify the pgtable for different reasons.
> 
> Right, but having 1) does not imply that we won't be having access to the
> page table entry in an abstracted form, the folio is simply the primary
> source of information that these users care about. 2) is an extension of 1),
> but walking+exposing all (or most) other page table entries as well in some
> form, which is certainly harder to get right.
> 
> Taking a look at some examples:
> 
> * madvise_cold_or_pageout_pte_range() only cares about present folios.
> * madvise_free_pte_range() only cares about present folios.
> * break_ksm_ops() only cares about present folios.
> * mlock_walk_ops() only cares about present folios.
> * damon_mkold_ops() only cares about present folios.
> * damon_young_ops() only cares about present folios.
> 
> There are certainly other page_walk API users that are more involved and
> need to do way more magic, which fall into category 2). In particular things
> like swapin_walk_ops(), hmm_walk_ops() and most fs/proc/task_mmu.c. Likely
> there are plenty of them.
> 
> 
> Taking a look at vmscan.c/walk_mm(), I'm not sure how much benefit there
> even is left in using walk_page_range() :)

Hmm, I need to confess from a quick look I didn't yet see why the current
page_walk API won't work under p4d there.. it could be that I missed some
details.

> 
> > 
> > A generic pgtable walker looks still wanted at some point, but it can be
> > too involved to be introduced together with this "remove hugetlb_entry"
> > effort.
> 
> My thinking was if "remove hugetlb_entry" cannot wait for "remove
> page_walk", because we found a reasonable way to do it better and convert
> the individual users. Maybe it can't.
> 
> I've not given up hope that we can end up with something better and clearer
> than the current page_walk API :)

Oh so you meant you have plan to rewrite some of the page_walk API users to
use the new API you plan to propose?

It looks fine by me. I assume anything new will already taking hugetlb
folios into account, so it'll "just work" and actually reduce number of
patches here, am I right?

If it still needs time to land, I think it's also fine that it's done on
top of Oscar's.  So it may boil down to the schedule in that case, and we
may also want to know how Oscar sees this.

> 
> > 
> > To me, that future work is not yet about "get the folio, ignore the
> > pgtable", but about how to abstract different layers of pgtables, so the
> > caller may get a generic concept of "one pgtable entry" with the level/size
> > information attached, and process it at a single place / hook, and perhaps
> > hopefully even work with a device pgtable, as long as it's a radix tree.
> 
> To me 2) is an extension of 1). My thinking is that we can start with 1)
> without having to are about all details of 2). If we have to make it as
> generic that we can walk any page table layout out there in this world, I'm
> not so sure.

I still see a hope there, after all the radix pgtable is indeed a common
abstraction and it looks to me a lot of things share that structure. IIUC
one challenge of it is being fast.  So.. I don't know. But I'll be more
than happy to see it come if someone can work it out, and it just sounds
very nice too if some chunk of code can be run the same for mm/, kvm/ and
iommu/.

Thanks,
Oscar Salvador July 8, 2024, 8:18 a.m. UTC | #6
On Thu, Jul 04, 2024 at 05:23:30PM +0200, David Hildenbrand wrote:
> My thinking was if "remove hugetlb_entry" cannot wait for "remove
> page_walk", because we found a reasonable way to do it better and convert
> the individual users. Maybe it can't.
> 
> I've not given up hope that we can end up with something better and clearer
> than the current page_walk API :)

Hi David,

I agree that the current page_walk might be a bit convoluted, and that the
indirect functions approach is a bit of a hassle.
Having said that, let me clarify something.

Although this patchset touches the page_walk API wrt. getting rid of
hugetlb special casing all over the place, my goal was not as focused on
the page_walk as it was on the hugetlb code to gain hability to be
interpreted on PUD/PMD level.

One of the things, among other things, that helped in creating this
mess/duplication we have wrt. hugetlb code vs mm core is that hugetlb
__always__ operates on ptes, which means that we cannot rely on the mm
core to do the right thing, and we need a bunch of hugetlb-pte functions
that knows about their thing, so we lean on that.

IMHO, that was a mistake to start with, but I was not around when it was
introduced and maybe there were good reasons to deal with that the way
it is done.
But, the thing is that my ultimate goal, is for hugetlb code to be able
to deal with PUD/PMD (pte and cont-pte is already dealt with) just like
mm core does for THP (PUD is not supported by THP, but you get me), and
that is not that difficult to do, as this patchset tries to prove.

Of course, for hugetlb to gain the hability to operate on PUD/PMD, this
means we need to add a fairly amount of code. e.g: for operating
hugepages on PUD level, code for markers on PUD/PMD level for
uffd/poison stuff (only dealt
on pmd/pte atm AFAIK), swap functions for PUD (is_swap_pud for PUD), etc.
Basically, almost all we did for PMD-* stuff we need it for PUD as well,
and that will be around when THP gains support for PUD if it ever does
(I guess that in a few years if memory capacity keeps increasing).

E.g: pud_to_swp_entry to detect that a swp entry is hwpoison with
     is_hwpoison_entry

Yes, it is a hassle to have more code around, but IMO, this new code
will help us in 1) move away from __always__ operate on ptes 2) ease
integrate hugetlb code into mm core.

I will keep working on this patchset not because of pagewalk savings,
but because I think it will help us in have hugetlb more mm-core ready,
since the current pagewalk has to test that a hugetlb page can be
properly read on PUD/PMD/PTE level no matter what: uffd for hugetlb on PUD/PMD,
hwpoison entries for swp on PUD/PMD, pud invalidating, etc.

If that gets accomplished, I think that a fair amount of code that lives
in hugetlb.c can be deleted/converted as less special casing will be needed.

I might be wrong and maybe I will hit a brick wall, but hopefully not.
Jason Gunthorpe July 8, 2024, 2:28 p.m. UTC | #7
On Mon, Jul 08, 2024 at 10:18:30AM +0200, Oscar Salvador wrote:

> IMHO, that was a mistake to start with, but I was not around when it was
> introduced and maybe there were good reasons to deal with that the way
> it is done.

It is a trade off, either you have to write out a lot of duplicated
code for every level or you have this sort of level agnostic design.

> But, the thing is that my ultimate goal, is for hugetlb code to be able
> to deal with PUD/PMD (pte and cont-pte is already dealt with) just like
> mm core does for THP (PUD is not supported by THP, but you get me), and
> that is not that difficult to do, as this patchset tries to prove.

IMHO we need to get to an API that can understand everything in a page
table. Having two APIs that are both disjoint is the problematic bit.

Improving the pud/pmd/etc API is a good direction

Nobody has explored it, but generalizing to a 'non-level' API could
also be a direction. 'non-level' means it works more like the huge API
where the level is not part of the function names but somehow the
level is encoded by the values/state/something.

This is appealing for things like page_walk where we have all these
per-level ops which are kind of pointless code duplication.

I've been doing some experiments on the iommu page table side on both
these directions and so far I haven't come to something that is really
great :\

> Of course, for hugetlb to gain the hability to operate on PUD/PMD, this
> means we need to add a fairly amount of code. e.g: for operating
> hugepages on PUD level, code for markers on PUD/PMD level for
> uffd/poison stuff (only dealt
> on pmd/pte atm AFAIK), swap functions for PUD (is_swap_pud for PUD), etc.
> Basically, almost all we did for PMD-* stuff we need it for PUD as well,
> and that will be around when THP gains support for PUD if it ever does
> (I guess that in a few years if memory capacity keeps increasing).

Right, this is the general pain of the mm's design is we have to
duplicate so much stuff N-wise for each level, even though in alot of
cases it isn't different for each level.

> I will keep working on this patchset not because of pagewalk savings,
> but because I think it will help us in have hugetlb more mm-core ready,
> since the current pagewalk has to test that a hugetlb page can be
> properly read on PUD/PMD/PTE level no matter what: uffd for hugetlb on PUD/PMD,
> hwpoison entries for swp on PUD/PMD, pud invalidating, etc.

Right, it would be nice if the page walk ops didn't have to touch huge
stuff at all. pagewalk ops, as they are today, should just work with
pud/pmd/pte normal functions in all cases.

Jason
Jason Gunthorpe July 8, 2024, 2:35 p.m. UTC | #8
On Thu, Jul 04, 2024 at 10:30:14AM -0400, Peter Xu wrote:
> Hey, David,
> 
> On Thu, Jul 04, 2024 at 12:44:38PM +0200, David Hildenbrand wrote:
> > There are roughly two categories of page table walkers we have:
> > 
> > 1) We actually only want to walk present folios (to be precise, page
> >    ranges of folios). We should look into moving away from the walk the
> >    page walker API where possible, and have something better that
> >    directly gives us the folio (page ranges). Any PTE batching would be
> >    done internally.

This seems like a good direction for some users as well to me.

If we can reduce the number of places touching the pud/pmd/pte APIs
that is a nice abstraction to reach toward.

It naturally would remove hugepte users too.

Jason
David Hildenbrand July 10, 2024, 3:52 a.m. UTC | #9
On 08.07.24 10:18, Oscar Salvador wrote:
> On Thu, Jul 04, 2024 at 05:23:30PM +0200, David Hildenbrand wrote:
>> My thinking was if "remove hugetlb_entry" cannot wait for "remove
>> page_walk", because we found a reasonable way to do it better and convert
>> the individual users. Maybe it can't.
>>
>> I've not given up hope that we can end up with something better and clearer
>> than the current page_walk API :)
> 
> Hi David,
> 

Hi!

> I agree that the current page_walk might be a bit convoluted, and that the
> indirect functions approach is a bit of a hassle.
> Having said that, let me clarify something.
> 
> Although this patchset touches the page_walk API wrt. getting rid of
> hugetlb special casing all over the place, my goal was not as focused on
> the page_walk as it was on the hugetlb code to gain hability to be
> interpreted on PUD/PMD level.

I understand that. And it would all be easier+more straight forward if 
we wouldn't have that hugetlb CONT-PTE / CONT-PMD stuff in there that 
works similar, but different to "ordinary" cont-pte for thp.

I'm sure you stumbled over the set_huge_pte_at() on arm64 for example. 
If we, at one point *don't* use these hugetlb functions right now to 
modify hugetlb entries, we might be in trouble.

That's why I think we should maybe invest our time and effort in having 
a new pagewalker that will just batch such things naturally, and users 
that can operate on that naturally. For example: a hugetlb 
cont-pte-mapped folio will just naturally be reported as a "fully mapped 
folio", just like a THP would be if mapped in a compatible way.

Yes, this requires more work, but as raised in some patches here, 
working on individual PTEs/PMDs for hugetlb is problematic.

You have to batch every operation, to essentially teach ordinary code to 
do what the hugetlb_* special code would have done on cont-pte/cont-pmd 
things.


(as a side note, cont-pte/cont-pmd should primarily be a hint from arch 
code on how many entries we can batch, like we do in folio_pte_batch(); 
point is that we want to batch also on architectures where we don't have 
such bits, and prepare for architectures that implement various sizes of 
batching; IMHO, having cont-pte/cont-pmd checks in common code is likely 
the wrong approach. Again, folio_pte_batch() is where we tackled the 
problem differently from the THP perspective)

> 
> One of the things, among other things, that helped in creating this
> mess/duplication we have wrt. hugetlb code vs mm core is that hugetlb
> __always__ operates on ptes, which means that we cannot rely on the mm
> core to do the right thing, and we need a bunch of hugetlb-pte functions
> that knows about their thing, so we lean on that.
> 
> IMHO, that was a mistake to start with, but I was not around when it was
> introduced and maybe there were good reasons to deal with that the way
> it is done.
> But, the thing is that my ultimate goal, is for hugetlb code to be able
> to deal with PUD/PMD (pte and cont-pte is already dealt with) just like
> mm core does for THP (PUD is not supported by THP, but you get me), and
> that is not that difficult to do, as this patchset tries to prove.
> 
> Of course, for hugetlb to gain the hability to operate on PUD/PMD, this
> means we need to add a fairly amount of code. e.g: for operating
> hugepages on PUD level, code for markers on PUD/PMD level for
> uffd/poison stuff (only dealt
> on pmd/pte atm AFAIK), swap functions for PUD (is_swap_pud for PUD), etc.
> Basically, almost all we did for PMD-* stuff we need it for PUD as well,
> and that will be around when THP gains support for PUD if it ever does
> (I guess that in a few years if memory capacity keeps increasing).
> 
> E.g: pud_to_swp_entry to detect that a swp entry is hwpoison with
>       is_hwpoison_entry
> 
> Yes, it is a hassle to have more code around, but IMO, this new code
> will help us in 1) move away from __always__ operate on ptes 2) ease
> integrate hugetlb code into mm core.
> 
> I will keep working on this patchset not because of pagewalk savings,
> but because I think it will help us in have hugetlb more mm-core ready,
> since the current pagewalk has to test that a hugetlb page can be
> properly read on PUD/PMD/PTE level no matter what: uffd for hugetlb on PUD/PMD,
> hwpoison entries for swp on PUD/PMD, pud invalidating, etc.
> 
> If that gets accomplished, I think that a fair amount of code that lives
> in hugetlb.c can be deleted/converted as less special casing will be needed.
> 
> I might be wrong and maybe I will hit a brick wall, but hopefully not.

I have an idea for a better page table walker API that would try 
batching most entries (under one PTL), and walkers can just register for 
the types they want. Hoping I will find some time to at least scetch the 
user interface soon.

That doesn't mean that this should block your work, but the 
cont-pte/cont/pmd hugetlb stuff is really nasty to handle here, and I 
don't particularly like where this is going.
Oscar Salvador July 10, 2024, 11:26 a.m. UTC | #10
On Wed, Jul 10, 2024 at 05:52:43AM +0200, David Hildenbrand wrote:
> I understand that. And it would all be easier+more straight forward if we
> wouldn't have that hugetlb CONT-PTE / CONT-PMD stuff in there that works
> similar, but different to "ordinary" cont-pte for thp.
> 
> I'm sure you stumbled over the set_huge_pte_at() on arm64 for example. If
> we, at one point *don't* use these hugetlb functions right now to modify
> hugetlb entries, we might be in trouble.
> 
> That's why I think we should maybe invest our time and effort in having a
> new pagewalker that will just batch such things naturally, and users that
> can operate on that naturally. For example: a hugetlb cont-pte-mapped folio
> will just naturally be reported as a "fully mapped folio", just like a THP
> would be if mapped in a compatible way.
> 
> Yes, this requires more work, but as raised in some patches here, working on
> individual PTEs/PMDs for hugetlb is problematic.
> 
> You have to batch every operation, to essentially teach ordinary code to do
> what the hugetlb_* special code would have done on cont-pte/cont-pmd things.
> 
> 
> (as a side note, cont-pte/cont-pmd should primarily be a hint from arch code
> on how many entries we can batch, like we do in folio_pte_batch(); point is
> that we want to batch also on architectures where we don't have such bits,
> and prepare for architectures that implement various sizes of batching;
> IMHO, having cont-pte/cont-pmd checks in common code is likely the wrong
> approach. Again, folio_pte_batch() is where we tackled the problem
> differently from the THP perspective)

I must say I did not check folio_pte_batch() and I am totally ignorant
of what/how it does things.
I will have a look.

> I have an idea for a better page table walker API that would try batching
> most entries (under one PTL), and walkers can just register for the types
> they want. Hoping I will find some time to at least scetch the user
> interface soon.
> 
> That doesn't mean that this should block your work, but the
> cont-pte/cont/pmd hugetlb stuff is really nasty to handle here, and I don't
> particularly like where this is going.

Ok, let me take a step back then.
Previous versions of that RFC did not handle cont-{pte-pmd} wide in the
open, so let me go back to the drawing board and come up with something
that does not fiddle with cont- stuff in that way.

I might post here a small diff just to see if we are on the same page.

As usual, thanks a lot for your comments David!
David Hildenbrand July 11, 2024, 12:15 a.m. UTC | #11
>> (as a side note, cont-pte/cont-pmd should primarily be a hint from arch code
>> on how many entries we can batch, like we do in folio_pte_batch(); point is
>> that we want to batch also on architectures where we don't have such bits,
>> and prepare for architectures that implement various sizes of batching;
>> IMHO, having cont-pte/cont-pmd checks in common code is likely the wrong
>> approach. Again, folio_pte_batch() is where we tackled the problem
>> differently from the THP perspective)
> 
> I must say I did not check folio_pte_batch() and I am totally ignorant
> of what/how it does things.
> I will have a look.
> 
>> I have an idea for a better page table walker API that would try batching
>> most entries (under one PTL), and walkers can just register for the types
>> they want. Hoping I will find some time to at least scetch the user
>> interface soon.
>>
>> That doesn't mean that this should block your work, but the
>> cont-pte/cont/pmd hugetlb stuff is really nasty to handle here, and I don't
>> particularly like where this is going.
> 
> Ok, let me take a step back then.
> Previous versions of that RFC did not handle cont-{pte-pmd} wide in the
> open, so let me go back to the drawing board and come up with something
> that does not fiddle with cont- stuff in that way.
> 
> I might post here a small diff just to see if we are on the same page.
> 
> As usual, thanks a lot for your comments David!

Feel free to reach out to discuss ways forward. I think we should

(a) move to the automatic cont-pte setting as done for THPs via
     set_ptes().
(b) Batching PTE updates at all relevant places, so we get no change in
     behavior: cont-pte bit will remain set.
(c) Likely remove the use of cont-pte bits in hugetlb code for anything
     that is not a present folio (i.e., where automatic cont-pte bit
     setting would never set it). Migration entries might require
     thought (we can easily batch to achieve the same thing, but the
     behavior of hugetlb likely differs to the generic way of handling
     migration entries on multiple ptes: reference the folio vs.
     the respective subpages of the folio).

Then we are essentially replacing what the current hugetlb_ functions do 
by the common way it is being done for THP (which does not exist for 
cont-pmd yet ... ). The only real alternative is special casing hugetlb 
all over the place to still call the hugetlb_* functions.

:( it would all be easier without that hugetlb cont-pte/cont-pmd usage.

CCing Ryan so he's aware.
Oscar Salvador July 11, 2024, 4:48 a.m. UTC | #12
On Thu, Jul 11, 2024 at 02:15:38AM +0200, David Hildenbrand wrote:
> > > (as a side note, cont-pte/cont-pmd should primarily be a hint from arch code
> > > on how many entries we can batch, like we do in folio_pte_batch(); point is
> > > that we want to batch also on architectures where we don't have such bits,
> > > and prepare for architectures that implement various sizes of batching;
> > > IMHO, having cont-pte/cont-pmd checks in common code is likely the wrong
> > > approach. Again, folio_pte_batch() is where we tackled the problem
> > > differently from the THP perspective)
> > 
> > I must say I did not check folio_pte_batch() and I am totally ignorant
> > of what/how it does things.
> > I will have a look.
> > 
> > > I have an idea for a better page table walker API that would try batching
> > > most entries (under one PTL), and walkers can just register for the types
> > > they want. Hoping I will find some time to at least scetch the user
> > > interface soon.
> > > 
> > > That doesn't mean that this should block your work, but the
> > > cont-pte/cont/pmd hugetlb stuff is really nasty to handle here, and I don't
> > > particularly like where this is going.
> > 
> > Ok, let me take a step back then.
> > Previous versions of that RFC did not handle cont-{pte-pmd} wide in the
> > open, so let me go back to the drawing board and come up with something
> > that does not fiddle with cont- stuff in that way.
> > 
> > I might post here a small diff just to see if we are on the same page.
> > 
> > As usual, thanks a lot for your comments David!
> 
> Feel free to reach out to discuss ways forward. I think we should
> 
> (a) move to the automatic cont-pte setting as done for THPs via
>     set_ptes().
> (b) Batching PTE updates at all relevant places, so we get no change in
>     behavior: cont-pte bit will remain set.
> (c) Likely remove the use of cont-pte bits in hugetlb code for anything
>     that is not a present folio (i.e., where automatic cont-pte bit
>     setting would never set it). Migration entries might require
>     thought (we can easily batch to achieve the same thing, but the
>     behavior of hugetlb likely differs to the generic way of handling
>     migration entries on multiple ptes: reference the folio vs.
>     the respective subpages of the folio).

Uhm, I see, but I am bit confused.
Although related, this seems orthogonal to this series and more like for
a next-thing to do, right?

It is true that this series tries to handle cont-{pmd,pte} in the
pagewalk api for hugetlb vmas, but in order to raise less eye brows I
can come up with a way not to do that for now, so we do not fiddle with
cont-stuff in this series.


Or am I misunderstanding you?
David Hildenbrand July 11, 2024, 4:53 a.m. UTC | #13
On 11.07.24 06:48, Oscar Salvador wrote:
> On Thu, Jul 11, 2024 at 02:15:38AM +0200, David Hildenbrand wrote:
>>>> (as a side note, cont-pte/cont-pmd should primarily be a hint from arch code
>>>> on how many entries we can batch, like we do in folio_pte_batch(); point is
>>>> that we want to batch also on architectures where we don't have such bits,
>>>> and prepare for architectures that implement various sizes of batching;
>>>> IMHO, having cont-pte/cont-pmd checks in common code is likely the wrong
>>>> approach. Again, folio_pte_batch() is where we tackled the problem
>>>> differently from the THP perspective)
>>>
>>> I must say I did not check folio_pte_batch() and I am totally ignorant
>>> of what/how it does things.
>>> I will have a look.
>>>
>>>> I have an idea for a better page table walker API that would try batching
>>>> most entries (under one PTL), and walkers can just register for the types
>>>> they want. Hoping I will find some time to at least scetch the user
>>>> interface soon.
>>>>
>>>> That doesn't mean that this should block your work, but the
>>>> cont-pte/cont/pmd hugetlb stuff is really nasty to handle here, and I don't
>>>> particularly like where this is going.
>>>
>>> Ok, let me take a step back then.
>>> Previous versions of that RFC did not handle cont-{pte-pmd} wide in the
>>> open, so let me go back to the drawing board and come up with something
>>> that does not fiddle with cont- stuff in that way.
>>>
>>> I might post here a small diff just to see if we are on the same page.
>>>
>>> As usual, thanks a lot for your comments David!
>>
>> Feel free to reach out to discuss ways forward. I think we should
>>
>> (a) move to the automatic cont-pte setting as done for THPs via
>>      set_ptes().
>> (b) Batching PTE updates at all relevant places, so we get no change in
>>      behavior: cont-pte bit will remain set.
>> (c) Likely remove the use of cont-pte bits in hugetlb code for anything
>>      that is not a present folio (i.e., where automatic cont-pte bit
>>      setting would never set it). Migration entries might require
>>      thought (we can easily batch to achieve the same thing, but the
>>      behavior of hugetlb likely differs to the generic way of handling
>>      migration entries on multiple ptes: reference the folio vs.
>>      the respective subpages of the folio).
> 
> Uhm, I see, but I am bit confused.
> Although related, this seems orthogonal to this series and more like for
> a next-thing to do, right?

Well, yes and no. The thing is, that the cont-pte/cont-pmd stuff is not 
as easy to handle like the PMD/PUD stuff, and sorting that out sounds 
like some "pain". That's the ugly part of hugetlb, where it's simply ... 
quite different :(

> 
> It is true that this series tries to handle cont-{pmd,pte} in the
> pagewalk api for hugetlb vmas, but in order to raise less eye brows I
> can come up with a way not to do that for now, so we do not fiddle with
> cont-stuff in this series.
> 
> 
> Or am I misunderstanding you?

I can answer once I know more details about the approach you have in mind :)