mbox series

[RESEND,v2,0/9] Merge arm64/riscv hugetlbfs contpte support

Message ID 20240508113419.18620-1-alexghiti@rivosinc.com (mailing list archive)
Headers show
Series Merge arm64/riscv hugetlbfs contpte support | expand

Message

Alexandre Ghiti May 8, 2024, 11:34 a.m. UTC
This patchset intends to merge the contiguous ptes hugetlbfs implementation
of arm64 and riscv.

Both arm64 and riscv support the use of contiguous ptes to map pages that
are larger than the default page table size, respectively called contpte
and svnapot.

The riscv implementation differs from the arm64's in that the LSBs of the
pfn of a svnapot pte are used to store the size of the mapping, allowing
for future sizes to be added (for now only 64KB is supported). That's an
issue for the core mm code which expects to find the *real* pfn a pte points
to. Patch 1 fixes that by always returning svnapot ptes with the real pfn
and restores the size of the mapping when it is written to a page table.

The following patches are just merges of the 2 different implementations
that currently exist in arm64 and riscv which are very similar. It paves
the way to the reuse of the recent contpte THP work by Ryan [1] to avoid
reimplementing the same in riscv.

This patchset was tested by running the libhugetlbfs testsuite with 64KB
and 2MB pages on both architectures (on a 4KB base page size arm64 kernel).

[1] https://lore.kernel.org/linux-arm-kernel/20240215103205.2607016-1-ryan.roberts@arm.com/

Changes in v2:
  - Rebase on top of 6.9-rc3

Alexandre Ghiti (9):
  riscv: Restore the pfn in a NAPOT pte when manipulated by core mm code
  riscv: Safely remove huge_pte_offset() when manipulating NAPOT ptes
  mm: Use common huge_ptep_get() function for riscv/arm64
  mm: Use common set_huge_pte_at() function for riscv/arm64
  mm: Use common huge_pte_clear() function for riscv/arm64
  mm: Use common huge_ptep_get_and_clear() function for riscv/arm64
  mm: Use common huge_ptep_set_access_flags() function for riscv/arm64
  mm: Use common huge_ptep_set_wrprotect() function for riscv/arm64
  mm: Use common huge_ptep_clear_flush() function for riscv/arm64

 arch/arm64/Kconfig                  |   1 +
 arch/arm64/include/asm/pgtable.h    |  56 +++++-
 arch/arm64/mm/hugetlbpage.c         | 291 +---------------------------
 arch/riscv/Kconfig                  |   1 +
 arch/riscv/include/asm/hugetlb.h    |   2 +-
 arch/riscv/include/asm/pgtable-64.h |  11 ++
 arch/riscv/include/asm/pgtable.h    | 153 +++++++++++++--
 arch/riscv/mm/hugetlbpage.c         | 227 ----------------------
 arch/riscv/mm/pgtable.c             |   6 +-
 mm/Kconfig                          |   3 +
 mm/Makefile                         |   1 +
 mm/contpte.c                        | 272 ++++++++++++++++++++++++++
 12 files changed, 480 insertions(+), 544 deletions(-)
 create mode 100644 mm/contpte.c

Comments

Ryan Roberts May 10, 2024, 1:49 p.m. UTC | #1
On 08/05/2024 12:34, Alexandre Ghiti wrote:
> This patchset intends to merge the contiguous ptes hugetlbfs implementation
> of arm64 and riscv.
> 
> Both arm64 and riscv support the use of contiguous ptes to map pages that
> are larger than the default page table size, respectively called contpte
> and svnapot.
> 
> The riscv implementation differs from the arm64's in that the LSBs of the
> pfn of a svnapot pte are used to store the size of the mapping, allowing
> for future sizes to be added (for now only 64KB is supported). That's an
> issue for the core mm code which expects to find the *real* pfn a pte points
> to. Patch 1 fixes that by always returning svnapot ptes with the real pfn
> and restores the size of the mapping when it is written to a page table.
> 
> The following patches are just merges of the 2 different implementations
> that currently exist in arm64 and riscv which are very similar. It paves
> the way to the reuse of the recent contpte THP work by Ryan [1] to avoid
> reimplementing the same in riscv.

Hi Alexandre,

I've skimmed through this series and the one that moves contpte. I can see there
is definitely value in sharing the implementation, and the rough shape of things
seems appropriate. I had some minor concerns about making it harder to implement
potential future arm64 errata workarounds but on reflection, most of the
now-shared code is really just wrapping the primitives that are still arch-specific.

I'm going to need to spend proper time reviewing it to give detailed feedback,
but I'll be out on paternity leave for 3 weeks from end of Monday at the latest.
So realistically I won't be able to do the detailed review until at least the
first week of June.

Some high level thoughts:

 - huge_ptep_* functions could be working on different sized huge ptes - arm64
supports contpte, pmd, contpmd and pud. Is keeping them in contpte.c
appropriate? Perhaps it's better to keep huge_pte and contpte separate? Also, it
only works on arm64 because we can get away with calling the lower-level pte
functions even when the huge_pte is actually a contpmd/pmd/pud, because the
format is the same. That might present challenges to other arches if the format
is different?

 - It might be easier to review if the arm64 stuff is first moved (without
changes) then modified to make it suitable for riscv, then for riscv to be
hooked up. At the moment I'm trying to follow all 3 parts per-function.

Thanks,
Ryan


> 
> This patchset was tested by running the libhugetlbfs testsuite with 64KB
> and 2MB pages on both architectures (on a 4KB base page size arm64 kernel).
> 
> [1] https://lore.kernel.org/linux-arm-kernel/20240215103205.2607016-1-ryan.roberts@arm.com/
> 
> Changes in v2:
>   - Rebase on top of 6.9-rc3
> 
> Alexandre Ghiti (9):
>   riscv: Restore the pfn in a NAPOT pte when manipulated by core mm code
>   riscv: Safely remove huge_pte_offset() when manipulating NAPOT ptes
>   mm: Use common huge_ptep_get() function for riscv/arm64
>   mm: Use common set_huge_pte_at() function for riscv/arm64
>   mm: Use common huge_pte_clear() function for riscv/arm64
>   mm: Use common huge_ptep_get_and_clear() function for riscv/arm64
>   mm: Use common huge_ptep_set_access_flags() function for riscv/arm64
>   mm: Use common huge_ptep_set_wrprotect() function for riscv/arm64
>   mm: Use common huge_ptep_clear_flush() function for riscv/arm64
> 
>  arch/arm64/Kconfig                  |   1 +
>  arch/arm64/include/asm/pgtable.h    |  56 +++++-
>  arch/arm64/mm/hugetlbpage.c         | 291 +---------------------------
>  arch/riscv/Kconfig                  |   1 +
>  arch/riscv/include/asm/hugetlb.h    |   2 +-
>  arch/riscv/include/asm/pgtable-64.h |  11 ++
>  arch/riscv/include/asm/pgtable.h    | 153 +++++++++++++--
>  arch/riscv/mm/hugetlbpage.c         | 227 ----------------------
>  arch/riscv/mm/pgtable.c             |   6 +-
>  mm/Kconfig                          |   3 +
>  mm/Makefile                         |   1 +
>  mm/contpte.c                        | 272 ++++++++++++++++++++++++++
>  12 files changed, 480 insertions(+), 544 deletions(-)
>  create mode 100644 mm/contpte.c
>
Alexandre Ghiti May 12, 2024, 5:25 p.m. UTC | #2
Hi Ryan,

On Fri, May 10, 2024 at 3:49 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 08/05/2024 12:34, Alexandre Ghiti wrote:
> > This patchset intends to merge the contiguous ptes hugetlbfs implementation
> > of arm64 and riscv.
> >
> > Both arm64 and riscv support the use of contiguous ptes to map pages that
> > are larger than the default page table size, respectively called contpte
> > and svnapot.
> >
> > The riscv implementation differs from the arm64's in that the LSBs of the
> > pfn of a svnapot pte are used to store the size of the mapping, allowing
> > for future sizes to be added (for now only 64KB is supported). That's an
> > issue for the core mm code which expects to find the *real* pfn a pte points
> > to. Patch 1 fixes that by always returning svnapot ptes with the real pfn
> > and restores the size of the mapping when it is written to a page table.
> >
> > The following patches are just merges of the 2 different implementations
> > that currently exist in arm64 and riscv which are very similar. It paves
> > the way to the reuse of the recent contpte THP work by Ryan [1] to avoid
> > reimplementing the same in riscv.
>
> Hi Alexandre,
>
> I've skimmed through this series and the one that moves contpte. I can see there
> is definitely value in sharing the implementation, and the rough shape of things
> seems appropriate. I had some minor concerns about making it harder to implement
> potential future arm64 errata workarounds but on reflection, most of the
> now-shared code is really just wrapping the primitives that are still arch-specific.
>
> I'm going to need to spend proper time reviewing it to give detailed feedback,
> but I'll be out on paternity leave for 3 weeks from end of Monday at the latest.

Too bad, I expected to discuss that with you at LSF/MM...But congrats!
Hope your wife is fine :)

> So realistically I won't be able to do the detailed review until at least the
> first week of June.
>
> Some high level thoughts:
>
>  - huge_ptep_* functions could be working on different sized huge ptes - arm64
> supports contpte, pmd, contpmd and pud. Is keeping them in contpte.c
> appropriate?

Hmm indeed, I'll see what I can do.

> Perhaps it's better to keep huge_pte and contpte separate? Also, it
> only works on arm64 because we can get away with calling the lower-level pte
> functions even when the huge_pte is actually a contpmd/pmd/pud, because the
> format is the same. That might present challenges to other arches if the format
> is different?

Yes, but I think that if that happens, we could get away with it by
choosing the right function depending on the size of the mapping?

>
>  - It might be easier to review if the arm64 stuff is first moved (without
> changes) then modified to make it suitable for riscv, then for riscv to be
> hooked up. At the moment I'm trying to follow all 3 parts per-function.

Ok, let me give it a try during your paternity leave!

>
> Thanks,
> Ryan

Thanks,

Alex

>
>
> >
> > This patchset was tested by running the libhugetlbfs testsuite with 64KB
> > and 2MB pages on both architectures (on a 4KB base page size arm64 kernel).
> >
> > [1] https://lore.kernel.org/linux-arm-kernel/20240215103205.2607016-1-ryan.roberts@arm.com/
> >
> > Changes in v2:
> >   - Rebase on top of 6.9-rc3
> >
> > Alexandre Ghiti (9):
> >   riscv: Restore the pfn in a NAPOT pte when manipulated by core mm code
> >   riscv: Safely remove huge_pte_offset() when manipulating NAPOT ptes
> >   mm: Use common huge_ptep_get() function for riscv/arm64
> >   mm: Use common set_huge_pte_at() function for riscv/arm64
> >   mm: Use common huge_pte_clear() function for riscv/arm64
> >   mm: Use common huge_ptep_get_and_clear() function for riscv/arm64
> >   mm: Use common huge_ptep_set_access_flags() function for riscv/arm64
> >   mm: Use common huge_ptep_set_wrprotect() function for riscv/arm64
> >   mm: Use common huge_ptep_clear_flush() function for riscv/arm64
> >
> >  arch/arm64/Kconfig                  |   1 +
> >  arch/arm64/include/asm/pgtable.h    |  56 +++++-
> >  arch/arm64/mm/hugetlbpage.c         | 291 +---------------------------
> >  arch/riscv/Kconfig                  |   1 +
> >  arch/riscv/include/asm/hugetlb.h    |   2 +-
> >  arch/riscv/include/asm/pgtable-64.h |  11 ++
> >  arch/riscv/include/asm/pgtable.h    | 153 +++++++++++++--
> >  arch/riscv/mm/hugetlbpage.c         | 227 ----------------------
> >  arch/riscv/mm/pgtable.c             |   6 +-
> >  mm/Kconfig                          |   3 +
> >  mm/Makefile                         |   1 +
> >  mm/contpte.c                        | 272 ++++++++++++++++++++++++++
> >  12 files changed, 480 insertions(+), 544 deletions(-)
> >  create mode 100644 mm/contpte.c
> >
>
Ryan Roberts May 13, 2024, 8:59 a.m. UTC | #3
On 12/05/2024 18:25, Alexandre Ghiti wrote:
> Hi Ryan,
> 
> On Fri, May 10, 2024 at 3:49 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 08/05/2024 12:34, Alexandre Ghiti wrote:
>>> This patchset intends to merge the contiguous ptes hugetlbfs implementation
>>> of arm64 and riscv.
>>>
>>> Both arm64 and riscv support the use of contiguous ptes to map pages that
>>> are larger than the default page table size, respectively called contpte
>>> and svnapot.
>>>
>>> The riscv implementation differs from the arm64's in that the LSBs of the
>>> pfn of a svnapot pte are used to store the size of the mapping, allowing
>>> for future sizes to be added (for now only 64KB is supported). That's an
>>> issue for the core mm code which expects to find the *real* pfn a pte points
>>> to. Patch 1 fixes that by always returning svnapot ptes with the real pfn
>>> and restores the size of the mapping when it is written to a page table.
>>>
>>> The following patches are just merges of the 2 different implementations
>>> that currently exist in arm64 and riscv which are very similar. It paves
>>> the way to the reuse of the recent contpte THP work by Ryan [1] to avoid
>>> reimplementing the same in riscv.
>>
>> Hi Alexandre,
>>
>> I've skimmed through this series and the one that moves contpte. I can see there
>> is definitely value in sharing the implementation, and the rough shape of things
>> seems appropriate. I had some minor concerns about making it harder to implement
>> potential future arm64 errata workarounds but on reflection, most of the
>> now-shared code is really just wrapping the primitives that are still arch-specific.
>>
>> I'm going to need to spend proper time reviewing it to give detailed feedback,
>> but I'll be out on paternity leave for 3 weeks from end of Monday at the latest.
> 
> Too bad, I expected to discuss that with you at LSF/MM...But congrats!
> Hope your wife is fine :)

Thanks! Yes its unfortunate timing - there are a few topics I would have liked
to get involved with. There's always next year...

> 
>> So realistically I won't be able to do the detailed review until at least the
>> first week of June.
>>
>> Some high level thoughts:
>>
>>  - huge_ptep_* functions could be working on different sized huge ptes - arm64
>> supports contpte, pmd, contpmd and pud. Is keeping them in contpte.c
>> appropriate?
> 
> Hmm indeed, I'll see what I can do.
> 
>> Perhaps it's better to keep huge_pte and contpte separate? Also, it
>> only works on arm64 because we can get away with calling the lower-level pte
>> functions even when the huge_pte is actually a contpmd/pmd/pud, because the
>> format is the same. That might present challenges to other arches if the format
>> is different?
> 
> Yes, but I think that if that happens, we could get away with it by
> choosing the right function depending on the size of the mapping?

Yes possibly. One potential future new user of this common code would be arm32
(arch/arm), which also has the contig bit. But AIUI, the pmd an pte formats are
quite different. It's likely that arm would want to opt-in to contpte but not
huge_ptep, so separate selectors may be valuable.

> 
>>
>>  - It might be easier to review if the arm64 stuff is first moved (without
>> changes) then modified to make it suitable for riscv, then for riscv to be
>> hooked up. At the moment I'm trying to follow all 3 parts per-function.
> 
> Ok, let me give it a try during your paternity leave!

Thanks! If it's too difficult then it's not a deal-breaker. Perhaps just an
initial patch to move the existing arm functions to core-mm without change, then
all your existing patches on top of that would do the job?

> 
>>
>> Thanks,
>> Ryan
> 
> Thanks,
> 
> Alex
> 
>>
>>
>>>
>>> This patchset was tested by running the libhugetlbfs testsuite with 64KB
>>> and 2MB pages on both architectures (on a 4KB base page size arm64 kernel).
>>>
>>> [1] https://lore.kernel.org/linux-arm-kernel/20240215103205.2607016-1-ryan.roberts@arm.com/
>>>
>>> Changes in v2:
>>>   - Rebase on top of 6.9-rc3
>>>
>>> Alexandre Ghiti (9):
>>>   riscv: Restore the pfn in a NAPOT pte when manipulated by core mm code
>>>   riscv: Safely remove huge_pte_offset() when manipulating NAPOT ptes
>>>   mm: Use common huge_ptep_get() function for riscv/arm64
>>>   mm: Use common set_huge_pte_at() function for riscv/arm64
>>>   mm: Use common huge_pte_clear() function for riscv/arm64
>>>   mm: Use common huge_ptep_get_and_clear() function for riscv/arm64
>>>   mm: Use common huge_ptep_set_access_flags() function for riscv/arm64
>>>   mm: Use common huge_ptep_set_wrprotect() function for riscv/arm64
>>>   mm: Use common huge_ptep_clear_flush() function for riscv/arm64
>>>
>>>  arch/arm64/Kconfig                  |   1 +
>>>  arch/arm64/include/asm/pgtable.h    |  56 +++++-
>>>  arch/arm64/mm/hugetlbpage.c         | 291 +---------------------------
>>>  arch/riscv/Kconfig                  |   1 +
>>>  arch/riscv/include/asm/hugetlb.h    |   2 +-
>>>  arch/riscv/include/asm/pgtable-64.h |  11 ++
>>>  arch/riscv/include/asm/pgtable.h    | 153 +++++++++++++--
>>>  arch/riscv/mm/hugetlbpage.c         | 227 ----------------------
>>>  arch/riscv/mm/pgtable.c             |   6 +-
>>>  mm/Kconfig                          |   3 +
>>>  mm/Makefile                         |   1 +
>>>  mm/contpte.c                        | 272 ++++++++++++++++++++++++++
>>>  12 files changed, 480 insertions(+), 544 deletions(-)
>>>  create mode 100644 mm/contpte.c
>>>
>>
Alexandre Ghiti May 28, 2024, 8:07 a.m. UTC | #4
Hi Ryan,

On 12/05/2024 19:25, Alexandre Ghiti wrote:
> Hi Ryan,
>
> On Fri, May 10, 2024 at 3:49 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>> On 08/05/2024 12:34, Alexandre Ghiti wrote:
>>> This patchset intends to merge the contiguous ptes hugetlbfs implementation
>>> of arm64 and riscv.
>>>
>>> Both arm64 and riscv support the use of contiguous ptes to map pages that
>>> are larger than the default page table size, respectively called contpte
>>> and svnapot.
>>>
>>> The riscv implementation differs from the arm64's in that the LSBs of the
>>> pfn of a svnapot pte are used to store the size of the mapping, allowing
>>> for future sizes to be added (for now only 64KB is supported). That's an
>>> issue for the core mm code which expects to find the *real* pfn a pte points
>>> to. Patch 1 fixes that by always returning svnapot ptes with the real pfn
>>> and restores the size of the mapping when it is written to a page table.
>>>
>>> The following patches are just merges of the 2 different implementations
>>> that currently exist in arm64 and riscv which are very similar. It paves
>>> the way to the reuse of the recent contpte THP work by Ryan [1] to avoid
>>> reimplementing the same in riscv.
>> Hi Alexandre,
>>
>> I've skimmed through this series and the one that moves contpte. I can see there
>> is definitely value in sharing the implementation, and the rough shape of things
>> seems appropriate. I had some minor concerns about making it harder to implement
>> potential future arm64 errata workarounds but on reflection, most of the
>> now-shared code is really just wrapping the primitives that are still arch-specific.
>>
>> I'm going to need to spend proper time reviewing it to give detailed feedback,
>> but I'll be out on paternity leave for 3 weeks from end of Monday at the latest.
> Too bad, I expected to discuss that with you at LSF/MM...But congrats!
> Hope your wife is fine :)
>
>> So realistically I won't be able to do the detailed review until at least the
>> first week of June.
>>
>> Some high level thoughts:
>>
>>   - huge_ptep_* functions could be working on different sized huge ptes - arm64
>> supports contpte, pmd, contpmd and pud. Is keeping them in contpte.c
>> appropriate?
> Hmm indeed, I'll see what I can do.


So I took a look at that. It amounts to doing the same as what we do for 
THP contptes, ie having both contpte-aware and "normal" APIs. Let's take 
for example huge_ptep_get(), below is what I get. To me it's not that 
bad, so I'll implement this unless there is strong opposition.


diff --git a/arch/arm64/include/asm/pgtable.h 
b/arch/arm64/include/asm/pgtable.h
index f8efbc128446..869a9aae6c68 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1715,6 +1715,16 @@ static inline void clear_young_dirty_ptes(struct 
vm_area_struct *vma,
                 contpte_clear_young_dirty_ptes(vma, addr, ptep, nr, flags);
  }

+static inline pte_t huge_ptep_get(pte_t *ptep)
+{
+        pte_t orig_pte = __ptep_get(ptep);
+
+        if (!pte_present(orig_pte) || !pte_cont(orig_pte))
+                return orig_pte;
+
+        return contpte_huge_ptep_get(ptep);
+}
+
  #else /* CONFIG_ARM64_CONTPTE */

  #define ptep_get                               __ptep_get
@@ -1736,6 +1746,8 @@ static inline void clear_young_dirty_ptes(struct 
vm_area_struct *vma,
  #define ptep_set_access_flags __ptep_set_access_flags
  #define clear_young_dirty_ptes __clear_young_dirty_ptes

+#define huge_ptep_get                          __ptep_get
+
  #endif /* CONFIG_ARM64_CONTPTE */

  #endif /* !__ASSEMBLY__ */
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 3f09ac73cce3..aa0ee3f02226 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -127,28 +127,6 @@ static inline int num_contig_ptes(unsigned long 
size, size_t *pgsize)
         return contig_ptes;
  }

-pte_t huge_ptep_get(pte_t *ptep)
-{
-       int ncontig, i;
-       size_t pgsize;
-       pte_t orig_pte = __ptep_get(ptep);
-
-       if (!pte_present(orig_pte) || !pte_cont(orig_pte))
-               return orig_pte;
-
-       ncontig = num_contig_ptes(page_size(pte_page(orig_pte)), &pgsize);
-       for (i = 0; i < ncontig; i++, ptep++) {
-               pte_t pte = __ptep_get(ptep);
-
-               if (pte_dirty(pte))
-                       orig_pte = pte_mkdirty(orig_pte);
-
-               if (pte_young(pte))
-                       orig_pte = pte_mkyoung(orig_pte);
-       }
-       return orig_pte;
-}
-
  /*
   * Changing some bits of contiguous entries requires us to follow a
   * Break-Before-Make approach, breaking the whole contiguous set
diff --git a/mm/contpte.c b/mm/contpte.c
new file mode 100644
index 000000000000..4e742cf00b6f
--- /dev/null
+++ b/mm/contpte.c
@@ -0,0 +1,17 @@
+pte_t contpte_huge_ptep_get(pte_t *ptep)
+{
+        int ncontig, i;
+        size_t pgsize;
+
+        ncontig = num_contig_ptes(page_size(pte_page(orig_pte)), &pgsize);
+        for (i = 0; i < ncontig; i++, ptep++) {
+                pte_t pte = __ptep_get(ptep);
+
+                if (pte_dirty(pte))
+                        orig_pte = pte_mkdirty(orig_pte);
+
+                if (pte_young(pte))
+                        orig_pte = pte_mkyoung(orig_pte);
+        }
+        return orig_pte;
+}

>
>> Perhaps it's better to keep huge_pte and contpte separate? Also, it
>> only works on arm64 because we can get away with calling the lower-level pte
>> functions even when the huge_pte is actually a contpmd/pmd/pud, because the
>> format is the same. That might present challenges to other arches if the format
>> is different?
> Yes, but I think that if that happens, we could get away with it by
> choosing the right function depending on the size of the mapping?
>
>>   - It might be easier to review if the arm64 stuff is first moved (without
>> changes) then modified to make it suitable for riscv, then for riscv to be
>> hooked up. At the moment I'm trying to follow all 3 parts per-function.
> Ok, let me give it a try during your paternity leave!
>
>> Thanks,
>> Ryan
> Thanks,
>
> Alex
>
>>
>>> This patchset was tested by running the libhugetlbfs testsuite with 64KB
>>> and 2MB pages on both architectures (on a 4KB base page size arm64 kernel).
>>>
>>> [1] https://lore.kernel.org/linux-arm-kernel/20240215103205.2607016-1-ryan.roberts@arm.com/
>>>
>>> Changes in v2:
>>>    - Rebase on top of 6.9-rc3
>>>
>>> Alexandre Ghiti (9):
>>>    riscv: Restore the pfn in a NAPOT pte when manipulated by core mm code
>>>    riscv: Safely remove huge_pte_offset() when manipulating NAPOT ptes
>>>    mm: Use common huge_ptep_get() function for riscv/arm64
>>>    mm: Use common set_huge_pte_at() function for riscv/arm64
>>>    mm: Use common huge_pte_clear() function for riscv/arm64
>>>    mm: Use common huge_ptep_get_and_clear() function for riscv/arm64
>>>    mm: Use common huge_ptep_set_access_flags() function for riscv/arm64
>>>    mm: Use common huge_ptep_set_wrprotect() function for riscv/arm64
>>>    mm: Use common huge_ptep_clear_flush() function for riscv/arm64
>>>
>>>   arch/arm64/Kconfig                  |   1 +
>>>   arch/arm64/include/asm/pgtable.h    |  56 +++++-
>>>   arch/arm64/mm/hugetlbpage.c         | 291 +---------------------------
>>>   arch/riscv/Kconfig                  |   1 +
>>>   arch/riscv/include/asm/hugetlb.h    |   2 +-
>>>   arch/riscv/include/asm/pgtable-64.h |  11 ++
>>>   arch/riscv/include/asm/pgtable.h    | 153 +++++++++++++--
>>>   arch/riscv/mm/hugetlbpage.c         | 227 ----------------------
>>>   arch/riscv/mm/pgtable.c             |   6 +-
>>>   mm/Kconfig                          |   3 +
>>>   mm/Makefile                         |   1 +
>>>   mm/contpte.c                        | 272 ++++++++++++++++++++++++++
>>>   12 files changed, 480 insertions(+), 544 deletions(-)
>>>   create mode 100644 mm/contpte.c
>>>
> _______________________________________________
> linux-riscv mailing list
> linux-riscv@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv
Ryan Roberts June 24, 2024, 8 a.m. UTC | #5
On 28/05/2024 09:07, Alexandre Ghiti wrote:
> Hi Ryan,
> 
> On 12/05/2024 19:25, Alexandre Ghiti wrote:
>> Hi Ryan,
>>
>> On Fri, May 10, 2024 at 3:49 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>> On 08/05/2024 12:34, Alexandre Ghiti wrote:
>>>> This patchset intends to merge the contiguous ptes hugetlbfs implementation
>>>> of arm64 and riscv.
>>>>
>>>> Both arm64 and riscv support the use of contiguous ptes to map pages that
>>>> are larger than the default page table size, respectively called contpte
>>>> and svnapot.
>>>>
>>>> The riscv implementation differs from the arm64's in that the LSBs of the
>>>> pfn of a svnapot pte are used to store the size of the mapping, allowing
>>>> for future sizes to be added (for now only 64KB is supported). That's an
>>>> issue for the core mm code which expects to find the *real* pfn a pte points
>>>> to. Patch 1 fixes that by always returning svnapot ptes with the real pfn
>>>> and restores the size of the mapping when it is written to a page table.
>>>>
>>>> The following patches are just merges of the 2 different implementations
>>>> that currently exist in arm64 and riscv which are very similar. It paves
>>>> the way to the reuse of the recent contpte THP work by Ryan [1] to avoid
>>>> reimplementing the same in riscv.
>>> Hi Alexandre,
>>>
>>> I've skimmed through this series and the one that moves contpte. I can see there
>>> is definitely value in sharing the implementation, and the rough shape of things
>>> seems appropriate. I had some minor concerns about making it harder to implement
>>> potential future arm64 errata workarounds but on reflection, most of the
>>> now-shared code is really just wrapping the primitives that are still
>>> arch-specific.
>>>
>>> I'm going to need to spend proper time reviewing it to give detailed feedback,
>>> but I'll be out on paternity leave for 3 weeks from end of Monday at the latest.
>> Too bad, I expected to discuss that with you at LSF/MM...But congrats!
>> Hope your wife is fine :)
>>
>>> So realistically I won't be able to do the detailed review until at least the
>>> first week of June.

Hi Alexandre,

Sorry for the radio silence. I'm back at work now and have some cycles to review
this. Did you ever post a new version based on the suggestions below?

>>>
>>> Some high level thoughts:
>>>
>>>   - huge_ptep_* functions could be working on different sized huge ptes - arm64
>>> supports contpte, pmd, contpmd and pud. Is keeping them in contpte.c
>>> appropriate?
>> Hmm indeed, I'll see what I can do.
> 
> 
> So I took a look at that. It amounts to doing the same as what we do for THP
> contptes, ie having both contpte-aware and "normal" APIs. Let's take for example
> huge_ptep_get(), below is what I get. To me it's not that bad, so I'll implement
> this unless there is strong opposition.

I'm not sure I've understood what you are going here... see below.

> 
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index f8efbc128446..869a9aae6c68 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1715,6 +1715,16 @@ static inline void clear_young_dirty_ptes(struct
> vm_area_struct *vma,
>                 contpte_clear_young_dirty_ptes(vma, addr, ptep, nr, flags);
>  }
> 
> +static inline pte_t huge_ptep_get(pte_t *ptep)
> +{
> +        pte_t orig_pte = __ptep_get(ptep);
> +
> +        if (!pte_present(orig_pte) || !pte_cont(orig_pte))
> +                return orig_pte;
> +
> +        return contpte_huge_ptep_get(ptep);

A "huge pte" is not the same as a "cont pte". A huge pte is an abstract thing,
which maybe of a number of different sizes; on arm64 with 4K base pages, 64K,
2M, 32M, 1G are supported. The 64K size is implemented using the PTE_CONT bit at
PTE level. 2M is a single PMD level block, 32M uses PMD_CONT at PMD level and 1G
is 1 PUD block. So I'm not sure it makes sense to tie this up with "contpte_"
functions?

> +}
> +
>  #else /* CONFIG_ARM64_CONTPTE */
> 
>  #define ptep_get                               __ptep_get
> @@ -1736,6 +1746,8 @@ static inline void clear_young_dirty_ptes(struct
> vm_area_struct *vma,
>  #define ptep_set_access_flags __ptep_set_access_flags
>  #define clear_young_dirty_ptes __clear_young_dirty_ptes
> 
> +#define huge_ptep_get                          __ptep_get

I don't quite understand the logic here. huge ptes are needed for hugetlb so
their definition needs to be tied to that, not to ARM64_CONTPTE, which is an
independent feature.

> +
>  #endif /* CONFIG_ARM64_CONTPTE */
> 
>  #endif /* !__ASSEMBLY__ */
> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
> index 3f09ac73cce3..aa0ee3f02226 100644
> --- a/arch/arm64/mm/hugetlbpage.c
> +++ b/arch/arm64/mm/hugetlbpage.c
> @@ -127,28 +127,6 @@ static inline int num_contig_ptes(unsigned long size,
> size_t *pgsize)
>         return contig_ptes;
>  }
> 
> -pte_t huge_ptep_get(pte_t *ptep)
> -{
> -       int ncontig, i;
> -       size_t pgsize;
> -       pte_t orig_pte = __ptep_get(ptep);
> -
> -       if (!pte_present(orig_pte) || !pte_cont(orig_pte))
> -               return orig_pte;
> -
> -       ncontig = num_contig_ptes(page_size(pte_page(orig_pte)), &pgsize);
> -       for (i = 0; i < ncontig; i++, ptep++) {
> -               pte_t pte = __ptep_get(ptep);
> -
> -               if (pte_dirty(pte))
> -                       orig_pte = pte_mkdirty(orig_pte);
> -
> -               if (pte_young(pte))
> -                       orig_pte = pte_mkyoung(orig_pte);
> -       }
> -       return orig_pte;
> -}
> -
>  /*
>   * Changing some bits of contiguous entries requires us to follow a
>   * Break-Before-Make approach, breaking the whole contiguous set
> diff --git a/mm/contpte.c b/mm/contpte.c
> new file mode 100644
> index 000000000000..4e742cf00b6f
> --- /dev/null
> +++ b/mm/contpte.c
> @@ -0,0 +1,17 @@
> +pte_t contpte_huge_ptep_get(pte_t *ptep)
> +{
> +        int ncontig, i;
> +        size_t pgsize;
> +
> +        ncontig = num_contig_ptes(page_size(pte_page(orig_pte)), &pgsize);
> +        for (i = 0; i < ncontig; i++, ptep++) {
> +                pte_t pte = __ptep_get(ptep);
> +
> +                if (pte_dirty(pte))
> +                        orig_pte = pte_mkdirty(orig_pte);
> +
> +                if (pte_young(pte))
> +                        orig_pte = pte_mkyoung(orig_pte);
> +        }
> +        return orig_pte;
> +}

I guess your observation is that contpte_ and hugepte_ code looks similar so it
shold be grouped? I think if we can get some actual reuse that might make sense,
but as implemented, this function is completely separate from
contpte_ptep_get(). I wonder if its simpler just to have contpte.c for contpte_
and hugepte_.c for hugepte_ then they can be included in the build independently
based on arch/core Kconfigs (e.g. CONFIG_HUGETLB_PAGE vs CONFIG_ARM64_CONTPTE).

> 
>>
>>> Perhaps it's better to keep huge_pte and contpte separate? Also, it
>>> only works on arm64 because we can get away with calling the lower-level pte
>>> functions even when the huge_pte is actually a contpmd/pmd/pud, because the
>>> format is the same. That might present challenges to other arches if the format
>>> is different?
>> Yes, but I think that if that happens, we could get away with it by
>> choosing the right function depending on the size of the mapping?
>>
>>>   - It might be easier to review if the arm64 stuff is first moved (without
>>> changes) then modified to make it suitable for riscv, then for riscv to be
>>> hooked up. At the moment I'm trying to follow all 3 parts per-function.
>> Ok, let me give it a try during your paternity leave!

Review would certainly be easier with this approach!

Thanks,
Ryan

>>
>>> Thanks,
>>> Ryan
>> Thanks,
>>
>> Alex
>>
>>>
>>>> This patchset was tested by running the libhugetlbfs testsuite with 64KB
>>>> and 2MB pages on both architectures (on a 4KB base page size arm64 kernel).
>>>>
>>>> [1]
>>>> https://lore.kernel.org/linux-arm-kernel/20240215103205.2607016-1-ryan.roberts@arm.com/
>>>>
>>>> Changes in v2:
>>>>    - Rebase on top of 6.9-rc3
>>>>
>>>> Alexandre Ghiti (9):
>>>>    riscv: Restore the pfn in a NAPOT pte when manipulated by core mm code
>>>>    riscv: Safely remove huge_pte_offset() when manipulating NAPOT ptes
>>>>    mm: Use common huge_ptep_get() function for riscv/arm64
>>>>    mm: Use common set_huge_pte_at() function for riscv/arm64
>>>>    mm: Use common huge_pte_clear() function for riscv/arm64
>>>>    mm: Use common huge_ptep_get_and_clear() function for riscv/arm64
>>>>    mm: Use common huge_ptep_set_access_flags() function for riscv/arm64
>>>>    mm: Use common huge_ptep_set_wrprotect() function for riscv/arm64
>>>>    mm: Use common huge_ptep_clear_flush() function for riscv/arm64
>>>>
>>>>   arch/arm64/Kconfig                  |   1 +
>>>>   arch/arm64/include/asm/pgtable.h    |  56 +++++-
>>>>   arch/arm64/mm/hugetlbpage.c         | 291 +---------------------------
>>>>   arch/riscv/Kconfig                  |   1 +
>>>>   arch/riscv/include/asm/hugetlb.h    |   2 +-
>>>>   arch/riscv/include/asm/pgtable-64.h |  11 ++
>>>>   arch/riscv/include/asm/pgtable.h    | 153 +++++++++++++--
>>>>   arch/riscv/mm/hugetlbpage.c         | 227 ----------------------
>>>>   arch/riscv/mm/pgtable.c             |   6 +-
>>>>   mm/Kconfig                          |   3 +
>>>>   mm/Makefile                         |   1 +
>>>>   mm/contpte.c                        | 272 ++++++++++++++++++++++++++
>>>>   12 files changed, 480 insertions(+), 544 deletions(-)
>>>>   create mode 100644 mm/contpte.c
>>>>
>> _______________________________________________
>> linux-riscv mailing list
>> linux-riscv@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-riscv
Alexandre Ghiti July 2, 2024, 7:51 a.m. UTC | #6
Hi Ryan,

On 24/06/2024 10:00, Ryan Roberts wrote:
> On 28/05/2024 09:07, Alexandre Ghiti wrote:
>> Hi Ryan,
>>
>> On 12/05/2024 19:25, Alexandre Ghiti wrote:
>>> Hi Ryan,
>>>
>>> On Fri, May 10, 2024 at 3:49 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>> On 08/05/2024 12:34, Alexandre Ghiti wrote:
>>>>> This patchset intends to merge the contiguous ptes hugetlbfs implementation
>>>>> of arm64 and riscv.
>>>>>
>>>>> Both arm64 and riscv support the use of contiguous ptes to map pages that
>>>>> are larger than the default page table size, respectively called contpte
>>>>> and svnapot.
>>>>>
>>>>> The riscv implementation differs from the arm64's in that the LSBs of the
>>>>> pfn of a svnapot pte are used to store the size of the mapping, allowing
>>>>> for future sizes to be added (for now only 64KB is supported). That's an
>>>>> issue for the core mm code which expects to find the *real* pfn a pte points
>>>>> to. Patch 1 fixes that by always returning svnapot ptes with the real pfn
>>>>> and restores the size of the mapping when it is written to a page table.
>>>>>
>>>>> The following patches are just merges of the 2 different implementations
>>>>> that currently exist in arm64 and riscv which are very similar. It paves
>>>>> the way to the reuse of the recent contpte THP work by Ryan [1] to avoid
>>>>> reimplementing the same in riscv.
>>>> Hi Alexandre,
>>>>
>>>> I've skimmed through this series and the one that moves contpte. I can see there
>>>> is definitely value in sharing the implementation, and the rough shape of things
>>>> seems appropriate. I had some minor concerns about making it harder to implement
>>>> potential future arm64 errata workarounds but on reflection, most of the
>>>> now-shared code is really just wrapping the primitives that are still
>>>> arch-specific.
>>>>
>>>> I'm going to need to spend proper time reviewing it to give detailed feedback,
>>>> but I'll be out on paternity leave for 3 weeks from end of Monday at the latest.
>>> Too bad, I expected to discuss that with you at LSF/MM...But congrats!
>>> Hope your wife is fine :)
>>>
>>>> So realistically I won't be able to do the detailed review until at least the
>>>> first week of June.
> Hi Alexandre,
>
> Sorry for the radio silence. I'm back at work now and have some cycles to review
> this. Did you ever post a new version based on the suggestions below?


Unfortunately no, other things happened that took all my attention, sorry.


>>>> Some high level thoughts:
>>>>
>>>>    - huge_ptep_* functions could be working on different sized huge ptes - arm64
>>>> supports contpte, pmd, contpmd and pud. Is keeping them in contpte.c
>>>> appropriate?
>>> Hmm indeed, I'll see what I can do.
>>
>> So I took a look at that. It amounts to doing the same as what we do for THP
>> contptes, ie having both contpte-aware and "normal" APIs. Let's take for example
>> huge_ptep_get(), below is what I get. To me it's not that bad, so I'll implement
>> this unless there is strong opposition.
> I'm not sure I've understood what you are going here... see below.
>
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index f8efbc128446..869a9aae6c68 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -1715,6 +1715,16 @@ static inline void clear_young_dirty_ptes(struct
>> vm_area_struct *vma,
>>                  contpte_clear_young_dirty_ptes(vma, addr, ptep, nr, flags);
>>   }
>>
>> +static inline pte_t huge_ptep_get(pte_t *ptep)
>> +{
>> +        pte_t orig_pte = __ptep_get(ptep);
>> +
>> +        if (!pte_present(orig_pte) || !pte_cont(orig_pte))
>> +                return orig_pte;
>> +
>> +        return contpte_huge_ptep_get(ptep);
> A "huge pte" is not the same as a "cont pte". A huge pte is an abstract thing,
> which maybe of a number of different sizes; on arm64 with 4K base pages, 64K,
> 2M, 32M, 1G are supported. The 64K size is implemented using the PTE_CONT bit at
> PTE level. 2M is a single PMD level block, 32M uses PMD_CONT at PMD level and 1G
> is 1 PUD block. So I'm not sure it makes sense to tie this up with "contpte_"
> functions?
>
>> +}
>> +
>>   #else /* CONFIG_ARM64_CONTPTE */
>>
>>   #define ptep_get                               __ptep_get
>> @@ -1736,6 +1746,8 @@ static inline void clear_young_dirty_ptes(struct
>> vm_area_struct *vma,
>>   #define ptep_set_access_flags __ptep_set_access_flags
>>   #define clear_young_dirty_ptes __clear_young_dirty_ptes
>>
>> +#define huge_ptep_get                          __ptep_get
> I don't quite understand the logic here. huge ptes are needed for hugetlb so
> their definition needs to be tied to that, not to ARM64_CONTPTE, which is an
> independent feature.
>
>> +
>>   #endif /* CONFIG_ARM64_CONTPTE */
>>
>>   #endif /* !__ASSEMBLY__ */
>> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
>> index 3f09ac73cce3..aa0ee3f02226 100644
>> --- a/arch/arm64/mm/hugetlbpage.c
>> +++ b/arch/arm64/mm/hugetlbpage.c
>> @@ -127,28 +127,6 @@ static inline int num_contig_ptes(unsigned long size,
>> size_t *pgsize)
>>          return contig_ptes;
>>   }
>>
>> -pte_t huge_ptep_get(pte_t *ptep)
>> -{
>> -       int ncontig, i;
>> -       size_t pgsize;
>> -       pte_t orig_pte = __ptep_get(ptep);
>> -
>> -       if (!pte_present(orig_pte) || !pte_cont(orig_pte))
>> -               return orig_pte;
>> -
>> -       ncontig = num_contig_ptes(page_size(pte_page(orig_pte)), &pgsize);
>> -       for (i = 0; i < ncontig; i++, ptep++) {
>> -               pte_t pte = __ptep_get(ptep);
>> -
>> -               if (pte_dirty(pte))
>> -                       orig_pte = pte_mkdirty(orig_pte);
>> -
>> -               if (pte_young(pte))
>> -                       orig_pte = pte_mkyoung(orig_pte);
>> -       }
>> -       return orig_pte;
>> -}
>> -
>>   /*
>>    * Changing some bits of contiguous entries requires us to follow a
>>    * Break-Before-Make approach, breaking the whole contiguous set
>> diff --git a/mm/contpte.c b/mm/contpte.c
>> new file mode 100644
>> index 000000000000..4e742cf00b6f
>> --- /dev/null
>> +++ b/mm/contpte.c
>> @@ -0,0 +1,17 @@
>> +pte_t contpte_huge_ptep_get(pte_t *ptep)
>> +{
>> +        int ncontig, i;
>> +        size_t pgsize;
>> +
>> +        ncontig = num_contig_ptes(page_size(pte_page(orig_pte)), &pgsize);
>> +        for (i = 0; i < ncontig; i++, ptep++) {
>> +                pte_t pte = __ptep_get(ptep);
>> +
>> +                if (pte_dirty(pte))
>> +                        orig_pte = pte_mkdirty(orig_pte);
>> +
>> +                if (pte_young(pte))
>> +                        orig_pte = pte_mkyoung(orig_pte);
>> +        }
>> +        return orig_pte;
>> +}
> I guess your observation is that contpte_ and hugepte_ code looks similar so it
> shold be grouped? I think if we can get some actual reuse that might make sense,
> but as implemented, this function is completely separate from
> contpte_ptep_get(). I wonder if its simpler just to have contpte.c for contpte_
> and hugepte_.c for hugepte_ then they can be included in the build independently
> based on arch/core Kconfigs (e.g. CONFIG_HUGETLB_PAGE vs CONFIG_ARM64_CONTPTE).


Yes, you're right, this was just rambling :)


>
>>>> Perhaps it's better to keep huge_pte and contpte separate? Also, it
>>>> only works on arm64 because we can get away with calling the lower-level pte
>>>> functions even when the huge_pte is actually a contpmd/pmd/pud, because the
>>>> format is the same. That might present challenges to other arches if the format
>>>> is different?
>>> Yes, but I think that if that happens, we could get away with it by
>>> choosing the right function depending on the size of the mapping?
>>>
>>>>    - It might be easier to review if the arm64 stuff is first moved (without
>>>> changes) then modified to make it suitable for riscv, then for riscv to be
>>>> hooked up. At the moment I'm trying to follow all 3 parts per-function.
>>> Ok, let me give it a try during your paternity leave!
> Review would certainly be easier with this approach!


I'll do my best to do that soon.

Hope everything went well for you.

Thanks,

Alex


>
> Thanks,
> Ryan
>
>>>> Thanks,
>>>> Ryan
>>> Thanks,
>>>
>>> Alex
>>>
>>>>> This patchset was tested by running the libhugetlbfs testsuite with 64KB
>>>>> and 2MB pages on both architectures (on a 4KB base page size arm64 kernel).
>>>>>
>>>>> [1]
>>>>> https://lore.kernel.org/linux-arm-kernel/20240215103205.2607016-1-ryan.roberts@arm.com/
>>>>>
>>>>> Changes in v2:
>>>>>     - Rebase on top of 6.9-rc3
>>>>>
>>>>> Alexandre Ghiti (9):
>>>>>     riscv: Restore the pfn in a NAPOT pte when manipulated by core mm code
>>>>>     riscv: Safely remove huge_pte_offset() when manipulating NAPOT ptes
>>>>>     mm: Use common huge_ptep_get() function for riscv/arm64
>>>>>     mm: Use common set_huge_pte_at() function for riscv/arm64
>>>>>     mm: Use common huge_pte_clear() function for riscv/arm64
>>>>>     mm: Use common huge_ptep_get_and_clear() function for riscv/arm64
>>>>>     mm: Use common huge_ptep_set_access_flags() function for riscv/arm64
>>>>>     mm: Use common huge_ptep_set_wrprotect() function for riscv/arm64
>>>>>     mm: Use common huge_ptep_clear_flush() function for riscv/arm64
>>>>>
>>>>>    arch/arm64/Kconfig                  |   1 +
>>>>>    arch/arm64/include/asm/pgtable.h    |  56 +++++-
>>>>>    arch/arm64/mm/hugetlbpage.c         | 291 +---------------------------
>>>>>    arch/riscv/Kconfig                  |   1 +
>>>>>    arch/riscv/include/asm/hugetlb.h    |   2 +-
>>>>>    arch/riscv/include/asm/pgtable-64.h |  11 ++
>>>>>    arch/riscv/include/asm/pgtable.h    | 153 +++++++++++++--
>>>>>    arch/riscv/mm/hugetlbpage.c         | 227 ----------------------
>>>>>    arch/riscv/mm/pgtable.c             |   6 +-
>>>>>    mm/Kconfig                          |   3 +
>>>>>    mm/Makefile                         |   1 +
>>>>>    mm/contpte.c                        | 272 ++++++++++++++++++++++++++
>>>>>    12 files changed, 480 insertions(+), 544 deletions(-)
>>>>>    create mode 100644 mm/contpte.c
>>>>>
>>> _______________________________________________
>>> linux-riscv mailing list
>>> linux-riscv@lists.infradead.org
>>> http://lists.infradead.org/mailman/listinfo/linux-riscv
Alexandre Ghiti Aug. 1, 2024, 8:57 a.m. UTC | #7
Hi Ryan,

On Tue, Jul 2, 2024 at 9:51 AM Alexandre Ghiti <alex@ghiti.fr> wrote:
>
> Hi Ryan,
>
> On 24/06/2024 10:00, Ryan Roberts wrote:
> > On 28/05/2024 09:07, Alexandre Ghiti wrote:
> >> Hi Ryan,
> >>
> >> On 12/05/2024 19:25, Alexandre Ghiti wrote:
> >>> Hi Ryan,
> >>>
> >>> On Fri, May 10, 2024 at 3:49 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>> On 08/05/2024 12:34, Alexandre Ghiti wrote:
> >>>>> This patchset intends to merge the contiguous ptes hugetlbfs implementation
> >>>>> of arm64 and riscv.
> >>>>>
> >>>>> Both arm64 and riscv support the use of contiguous ptes to map pages that
> >>>>> are larger than the default page table size, respectively called contpte
> >>>>> and svnapot.
> >>>>>
> >>>>> The riscv implementation differs from the arm64's in that the LSBs of the
> >>>>> pfn of a svnapot pte are used to store the size of the mapping, allowing
> >>>>> for future sizes to be added (for now only 64KB is supported). That's an
> >>>>> issue for the core mm code which expects to find the *real* pfn a pte points
> >>>>> to. Patch 1 fixes that by always returning svnapot ptes with the real pfn
> >>>>> and restores the size of the mapping when it is written to a page table.
> >>>>>
> >>>>> The following patches are just merges of the 2 different implementations
> >>>>> that currently exist in arm64 and riscv which are very similar. It paves
> >>>>> the way to the reuse of the recent contpte THP work by Ryan [1] to avoid
> >>>>> reimplementing the same in riscv.
> >>>> Hi Alexandre,
> >>>>
> >>>> I've skimmed through this series and the one that moves contpte. I can see there
> >>>> is definitely value in sharing the implementation, and the rough shape of things
> >>>> seems appropriate. I had some minor concerns about making it harder to implement
> >>>> potential future arm64 errata workarounds but on reflection, most of the
> >>>> now-shared code is really just wrapping the primitives that are still
> >>>> arch-specific.
> >>>>
> >>>> I'm going to need to spend proper time reviewing it to give detailed feedback,
> >>>> but I'll be out on paternity leave for 3 weeks from end of Monday at the latest.
> >>> Too bad, I expected to discuss that with you at LSF/MM...But congrats!
> >>> Hope your wife is fine :)
> >>>
> >>>> So realistically I won't be able to do the detailed review until at least the
> >>>> first week of June.
> > Hi Alexandre,
> >
> > Sorry for the radio silence. I'm back at work now and have some cycles to review
> > this. Did you ever post a new version based on the suggestions below?
>
>
> Unfortunately no, other things happened that took all my attention, sorry.
>
>
> >>>> Some high level thoughts:
> >>>>
> >>>>    - huge_ptep_* functions could be working on different sized huge ptes - arm64
> >>>> supports contpte, pmd, contpmd and pud. Is keeping them in contpte.c
> >>>> appropriate?
> >>> Hmm indeed, I'll see what I can do.
> >>
> >> So I took a look at that. It amounts to doing the same as what we do for THP
> >> contptes, ie having both contpte-aware and "normal" APIs. Let's take for example
> >> huge_ptep_get(), below is what I get. To me it's not that bad, so I'll implement
> >> this unless there is strong opposition.
> > I'm not sure I've understood what you are going here... see below.
> >
> >>
> >> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> >> index f8efbc128446..869a9aae6c68 100644
> >> --- a/arch/arm64/include/asm/pgtable.h
> >> +++ b/arch/arm64/include/asm/pgtable.h
> >> @@ -1715,6 +1715,16 @@ static inline void clear_young_dirty_ptes(struct
> >> vm_area_struct *vma,
> >>                  contpte_clear_young_dirty_ptes(vma, addr, ptep, nr, flags);
> >>   }
> >>
> >> +static inline pte_t huge_ptep_get(pte_t *ptep)
> >> +{
> >> +        pte_t orig_pte = __ptep_get(ptep);
> >> +
> >> +        if (!pte_present(orig_pte) || !pte_cont(orig_pte))
> >> +                return orig_pte;
> >> +
> >> +        return contpte_huge_ptep_get(ptep);
> > A "huge pte" is not the same as a "cont pte". A huge pte is an abstract thing,
> > which maybe of a number of different sizes; on arm64 with 4K base pages, 64K,
> > 2M, 32M, 1G are supported. The 64K size is implemented using the PTE_CONT bit at
> > PTE level. 2M is a single PMD level block, 32M uses PMD_CONT at PMD level and 1G
> > is 1 PUD block. So I'm not sure it makes sense to tie this up with "contpte_"
> > functions?
> >
> >> +}
> >> +
> >>   #else /* CONFIG_ARM64_CONTPTE */
> >>
> >>   #define ptep_get                               __ptep_get
> >> @@ -1736,6 +1746,8 @@ static inline void clear_young_dirty_ptes(struct
> >> vm_area_struct *vma,
> >>   #define ptep_set_access_flags __ptep_set_access_flags
> >>   #define clear_young_dirty_ptes __clear_young_dirty_ptes
> >>
> >> +#define huge_ptep_get                          __ptep_get
> > I don't quite understand the logic here. huge ptes are needed for hugetlb so
> > their definition needs to be tied to that, not to ARM64_CONTPTE, which is an
> > independent feature.
> >
> >> +
> >>   #endif /* CONFIG_ARM64_CONTPTE */
> >>
> >>   #endif /* !__ASSEMBLY__ */
> >> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
> >> index 3f09ac73cce3..aa0ee3f02226 100644
> >> --- a/arch/arm64/mm/hugetlbpage.c
> >> +++ b/arch/arm64/mm/hugetlbpage.c
> >> @@ -127,28 +127,6 @@ static inline int num_contig_ptes(unsigned long size,
> >> size_t *pgsize)
> >>          return contig_ptes;
> >>   }
> >>
> >> -pte_t huge_ptep_get(pte_t *ptep)
> >> -{
> >> -       int ncontig, i;
> >> -       size_t pgsize;
> >> -       pte_t orig_pte = __ptep_get(ptep);
> >> -
> >> -       if (!pte_present(orig_pte) || !pte_cont(orig_pte))
> >> -               return orig_pte;
> >> -
> >> -       ncontig = num_contig_ptes(page_size(pte_page(orig_pte)), &pgsize);
> >> -       for (i = 0; i < ncontig; i++, ptep++) {
> >> -               pte_t pte = __ptep_get(ptep);
> >> -
> >> -               if (pte_dirty(pte))
> >> -                       orig_pte = pte_mkdirty(orig_pte);
> >> -
> >> -               if (pte_young(pte))
> >> -                       orig_pte = pte_mkyoung(orig_pte);
> >> -       }
> >> -       return orig_pte;
> >> -}
> >> -
> >>   /*
> >>    * Changing some bits of contiguous entries requires us to follow a
> >>    * Break-Before-Make approach, breaking the whole contiguous set
> >> diff --git a/mm/contpte.c b/mm/contpte.c
> >> new file mode 100644
> >> index 000000000000..4e742cf00b6f
> >> --- /dev/null
> >> +++ b/mm/contpte.c
> >> @@ -0,0 +1,17 @@
> >> +pte_t contpte_huge_ptep_get(pte_t *ptep)
> >> +{
> >> +        int ncontig, i;
> >> +        size_t pgsize;
> >> +
> >> +        ncontig = num_contig_ptes(page_size(pte_page(orig_pte)), &pgsize);
> >> +        for (i = 0; i < ncontig; i++, ptep++) {
> >> +                pte_t pte = __ptep_get(ptep);
> >> +
> >> +                if (pte_dirty(pte))
> >> +                        orig_pte = pte_mkdirty(orig_pte);
> >> +
> >> +                if (pte_young(pte))
> >> +                        orig_pte = pte_mkyoung(orig_pte);
> >> +        }
> >> +        return orig_pte;
> >> +}
> > I guess your observation is that contpte_ and hugepte_ code looks similar so it
> > shold be grouped? I think if we can get some actual reuse that might make sense,
> > but as implemented, this function is completely separate from
> > contpte_ptep_get(). I wonder if its simpler just to have contpte.c for contpte_
> > and hugepte_.c for hugepte_ then they can be included in the build independently
> > based on arch/core Kconfigs (e.g. CONFIG_HUGETLB_PAGE vs CONFIG_ARM64_CONTPTE).
>
>
> Yes, you're right, this was just rambling :)
>
>
> >
> >>>> Perhaps it's better to keep huge_pte and contpte separate? Also, it
> >>>> only works on arm64 because we can get away with calling the lower-level pte
> >>>> functions even when the huge_pte is actually a contpmd/pmd/pud, because the
> >>>> format is the same. That might present challenges to other arches if the format
> >>>> is different?
> >>> Yes, but I think that if that happens, we could get away with it by
> >>> choosing the right function depending on the size of the mapping?
> >>>
> >>>>    - It might be easier to review if the arm64 stuff is first moved (without
> >>>> changes) then modified to make it suitable for riscv, then for riscv to be
> >>>> hooked up. At the moment I'm trying to follow all 3 parts per-function.
> >>> Ok, let me give it a try during your paternity leave!
> > Review would certainly be easier with this approach!
>
>
> I'll do my best to do that soon.

So I finally have time to rework this and I tested moving all the
arm64 functions first and then adapting them to riscv. But I can't
modify a function at a time as it fails to build for riscv, since the
not-yet-adapted functions are arm64 specific. So I'd have to come with
a big monolithic patch that changes everything at once, which I think
will not help the review process.

So I'll keep the same format and should be back soon with the new version.

Thanks,

Alex

>
> Hope everything went well for you.
>
> Thanks,
>
> Alex
>
>
> >
> > Thanks,
> > Ryan
> >
> >>>> Thanks,
> >>>> Ryan
> >>> Thanks,
> >>>
> >>> Alex
> >>>
> >>>>> This patchset was tested by running the libhugetlbfs testsuite with 64KB
> >>>>> and 2MB pages on both architectures (on a 4KB base page size arm64 kernel).
> >>>>>
> >>>>> [1]
> >>>>> https://lore.kernel.org/linux-arm-kernel/20240215103205.2607016-1-ryan.roberts@arm.com/
> >>>>>
> >>>>> Changes in v2:
> >>>>>     - Rebase on top of 6.9-rc3
> >>>>>
> >>>>> Alexandre Ghiti (9):
> >>>>>     riscv: Restore the pfn in a NAPOT pte when manipulated by core mm code
> >>>>>     riscv: Safely remove huge_pte_offset() when manipulating NAPOT ptes
> >>>>>     mm: Use common huge_ptep_get() function for riscv/arm64
> >>>>>     mm: Use common set_huge_pte_at() function for riscv/arm64
> >>>>>     mm: Use common huge_pte_clear() function for riscv/arm64
> >>>>>     mm: Use common huge_ptep_get_and_clear() function for riscv/arm64
> >>>>>     mm: Use common huge_ptep_set_access_flags() function for riscv/arm64
> >>>>>     mm: Use common huge_ptep_set_wrprotect() function for riscv/arm64
> >>>>>     mm: Use common huge_ptep_clear_flush() function for riscv/arm64
> >>>>>
> >>>>>    arch/arm64/Kconfig                  |   1 +
> >>>>>    arch/arm64/include/asm/pgtable.h    |  56 +++++-
> >>>>>    arch/arm64/mm/hugetlbpage.c         | 291 +---------------------------
> >>>>>    arch/riscv/Kconfig                  |   1 +
> >>>>>    arch/riscv/include/asm/hugetlb.h    |   2 +-
> >>>>>    arch/riscv/include/asm/pgtable-64.h |  11 ++
> >>>>>    arch/riscv/include/asm/pgtable.h    | 153 +++++++++++++--
> >>>>>    arch/riscv/mm/hugetlbpage.c         | 227 ----------------------
> >>>>>    arch/riscv/mm/pgtable.c             |   6 +-
> >>>>>    mm/Kconfig                          |   3 +
> >>>>>    mm/Makefile                         |   1 +
> >>>>>    mm/contpte.c                        | 272 ++++++++++++++++++++++++++
> >>>>>    12 files changed, 480 insertions(+), 544 deletions(-)
> >>>>>    create mode 100644 mm/contpte.c
> >>>>>
> >>> _______________________________________________
> >>> linux-riscv mailing list
> >>> linux-riscv@lists.infradead.org
> >>> http://lists.infradead.org/mailman/listinfo/linux-riscv