diff mbox series

[v4,2/2] arm64: mm: hugetlb: Enable HUGETLB_PAGE_FREE_VMEMMAP for arm64

Message ID 20220331065640.5777-2-songmuchun@bytedance.com (mailing list archive)
State New, archived
Headers show
Series [v4,1/2] mm: hugetlb_vmemmap: introduce ARCH_WANT_HUGETLB_PAGE_FREE_VMEMMAP | expand

Commit Message

Muchun Song March 31, 2022, 6:56 a.m. UTC
The feature of minimizing overhead of struct page associated with each
HugeTLB page aims to free its vmemmap pages (used as struct page) to
save memory, where is ~14GB/16GB per 1TB HugeTLB pages (2MB/1GB type).
In short, when a HugeTLB page is allocated or freed, the vmemmap array
representing the range associated with the page will need to be remapped.
When a page is allocated, vmemmap pages are freed after remapping.
When a page is freed, previously discarded vmemmap pages must be
allocated before remapping.  More implementations and details can be
found here [1].

The infrastructure of freeing vmemmap pages associated with each HugeTLB
page is already there, we can easily enable HUGETLB_PAGE_FREE_VMEMMAP
for arm64, the only thing to be fixed is flush_dcache_page() .

flush_dcache_page() need to be adapted to operate on the head page's
flags since the tail vmemmap pages are mapped with read-only after the
feature is enabled (clear operation is not permitted).

There was some discussions about this in the thread [2], but there was
no conclusion in the end.  And I copied the concern proposed by Anshuman
to here and explain why those concern is superfluous.  It is safe to
enable it for x86_64 as well as arm64.

1st concern:
'''
But what happens when a hot remove section's vmemmap area (which is
being teared down) is nearby another vmemmap area which is either created
or being destroyed for HugeTLB alloc/free purpose. As you mentioned
HugeTLB pages inside the hot remove section might be safe. But what about
other HugeTLB areas whose vmemmap area shares page table entries with
vmemmap entries for a section being hot removed ? Massive HugeTLB alloc
/use/free test cycle using memory just adjacent to a memory hotplug area,
which is always added and removed periodically, should be able to expose
this problem.
'''

Answer: At the time memory is removed, all HugeTLB pages either have been
migrated away or dissolved.  So there is no race between memory hot remove
and free_huge_page_vmemmap().  Therefore, HugeTLB pages inside the hot
remove section is safe.  Let's talk your question "what about other
HugeTLB areas whose vmemmap area shares page table entries with vmemmap
entries for a section being hot removed ?", the question is not
established.  The minimal granularity size of hotplug memory 128MB (on
arm64, 4k base page), any HugeTLB smaller than 128MB is within a section,
then, there is no share PTE page tables between HugeTLB in this section
and ones in other sections and a HugeTLB page could not cross two
sections.  In this case, the section cannot be freed.  Any HugeTLB bigger
than 128MB (section size) whose vmemmap pages is an integer multiple of
2MB (PMD-mapped).  As long as:

  1) HugeTLBs are naturally aligned, power-of-two sizes
  2) The HugeTLB size >= the section size
  3) The HugeTLB size >= the vmemmap leaf mapping size

Then a HugeTLB will not share any leaf page table entries with *anything
else*, but will share intermediate entries.  In this case, at the time memory
is removed, all HugeTLB pages either have been migrated away or dissolved.
So there is also no race between memory hot remove and
free_huge_page_vmemmap().

2nd concern:
'''
differently, not sure if ptdump would require any synchronization.

Dumping an wrong value is probably okay but crashing because a page table
entry is being freed after ptdump acquired the pointer is bad. On arm64,
ptdump() is protected against hotremove via [get|put]_online_mems().
'''

Answer: The ptdump should be fine since vmemmap_remap_free() only exchanges
PTEs or splits the PMD entry (which means allocating a PTE page table).  Both
operations do not free any page tables (PTE), so ptdump cannot run into a
UAF on any page tables.  The worst case is just dumping an wrong value.

[1] https://lore.kernel.org/all/20210510030027.56044-1-songmuchun@bytedance.com/
[2] https://lore.kernel.org/all/20210518091826.36937-1-songmuchun@bytedance.com/

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
v4:
 - Introduce ARCH_WANT_HUGETLB_PAGE_FREE_VMEMMAP (implemented in the previous
   patch) to enable this feature for arm64.

v3:
 - Rework patch's subject.
 - Clarify the feature of HUGETLB_PAGE_FREE_VMEMMAP is already there in the
   current code and easyly be enabled for arm64 into commit log.
 - Add hugetlb_free_vmemmap_enabled() check into flush_dcache_page().

 Thanks for Barry's suggestions.

v2:
 - Update commit message (Mark Rutland).
 - Fix flush_dcache_page().

 arch/arm64/Kconfig    |  1 +
 arch/arm64/mm/flush.c | 13 +++++++++++++
 2 files changed, 14 insertions(+)

Comments

Barry Song March 31, 2022, 10:31 p.m. UTC | #1
On Thu, Mar 31, 2022 at 7:57 PM Muchun Song <songmuchun@bytedance.com> wrote:
>
> The feature of minimizing overhead of struct page associated with each
> HugeTLB page aims to free its vmemmap pages (used as struct page) to
> save memory, where is ~14GB/16GB per 1TB HugeTLB pages (2MB/1GB type).
> In short, when a HugeTLB page is allocated or freed, the vmemmap array
> representing the range associated with the page will need to be remapped.
> When a page is allocated, vmemmap pages are freed after remapping.
> When a page is freed, previously discarded vmemmap pages must be
> allocated before remapping.  More implementations and details can be
> found here [1].
>
> The infrastructure of freeing vmemmap pages associated with each HugeTLB
> page is already there, we can easily enable HUGETLB_PAGE_FREE_VMEMMAP
> for arm64, the only thing to be fixed is flush_dcache_page() .
>
> flush_dcache_page() need to be adapted to operate on the head page's
> flags since the tail vmemmap pages are mapped with read-only after the
> feature is enabled (clear operation is not permitted).
>
> There was some discussions about this in the thread [2], but there was
> no conclusion in the end.  And I copied the concern proposed by Anshuman
> to here and explain why those concern is superfluous.  It is safe to
> enable it for x86_64 as well as arm64.
>
> 1st concern:
> '''
> But what happens when a hot remove section's vmemmap area (which is
> being teared down) is nearby another vmemmap area which is either created
> or being destroyed for HugeTLB alloc/free purpose. As you mentioned
> HugeTLB pages inside the hot remove section might be safe. But what about
> other HugeTLB areas whose vmemmap area shares page table entries with
> vmemmap entries for a section being hot removed ? Massive HugeTLB alloc
> /use/free test cycle using memory just adjacent to a memory hotplug area,
> which is always added and removed periodically, should be able to expose
> this problem.
> '''
>
> Answer: At the time memory is removed, all HugeTLB pages either have been
> migrated away or dissolved.  So there is no race between memory hot remove
> and free_huge_page_vmemmap().  Therefore, HugeTLB pages inside the hot
> remove section is safe.  Let's talk your question "what about other
> HugeTLB areas whose vmemmap area shares page table entries with vmemmap
> entries for a section being hot removed ?", the question is not
> established.  The minimal granularity size of hotplug memory 128MB (on
> arm64, 4k base page), any HugeTLB smaller than 128MB is within a section,
> then, there is no share PTE page tables between HugeTLB in this section
> and ones in other sections and a HugeTLB page could not cross two
> sections.  In this case, the section cannot be freed.  Any HugeTLB bigger
> than 128MB (section size) whose vmemmap pages is an integer multiple of
> 2MB (PMD-mapped).  As long as:
>
>   1) HugeTLBs are naturally aligned, power-of-two sizes
>   2) The HugeTLB size >= the section size
>   3) The HugeTLB size >= the vmemmap leaf mapping size
>
> Then a HugeTLB will not share any leaf page table entries with *anything
> else*, but will share intermediate entries.  In this case, at the time memory
> is removed, all HugeTLB pages either have been migrated away or dissolved.
> So there is also no race between memory hot remove and
> free_huge_page_vmemmap().
>
> 2nd concern:
> '''
> differently, not sure if ptdump would require any synchronization.
>
> Dumping an wrong value is probably okay but crashing because a page table
> entry is being freed after ptdump acquired the pointer is bad. On arm64,
> ptdump() is protected against hotremove via [get|put]_online_mems().
> '''
>
> Answer: The ptdump should be fine since vmemmap_remap_free() only exchanges
> PTEs or splits the PMD entry (which means allocating a PTE page table).  Both
> operations do not free any page tables (PTE), so ptdump cannot run into a
> UAF on any page tables.  The worst case is just dumping an wrong value.
>
> [1] https://lore.kernel.org/all/20210510030027.56044-1-songmuchun@bytedance.com/
> [2] https://lore.kernel.org/all/20210518091826.36937-1-songmuchun@bytedance.com/
>
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>

Reviewed-by: Barry Song <baohua@kernel.org>

I ran some testing on this patchset. On a machine with 4GB memory
and I set 2GB hugepages by "hugepagesz=2m hugepages=1024",
I was seeing:

Before the patch,
/sys# cat kernel/debug/kernel_page_tables
---[ vmemmap start ]---
0xfffffc0000000000-0xfffffc1180000000          70G PUD
0xfffffc1180000000-0xfffffc1181000000          16M PMD
0xfffffc1181000000-0xfffffc1185000000          64M PMD       RW NX SHD
AF NG     BLK UXN    MEM/NORMAL
0xfffffc1185000000-0xfffffc11c0000000         944M PMD
0xfffffc11c0000000-0xfffffc8000000000         441G PUD
0xfffffc8000000000-0xfffffe0000000000        1536G PGD
---[ vmemmap end ]---

After the patch:
---[ vmemmap start ]---
...
0xfffffc27e8090000-0xfffffc27e8091000           4K PTE       RW NX SHD
AF NG         UXN    MEM/NORMAL
0xfffffc27e8091000-0xfffffc27e8098000          28K PTE       ro NX SHD
AF NG         UXN    MEM/NORMAL
0xfffffc27e8098000-0xfffffc27e8099000           4K PTE       RW NX SHD
AF NG         UXN    MEM/NORMAL
0xfffffc27e8099000-0xfffffc27e80a0000          28K PTE       ro NX SHD
AF NG         UXN    MEM/NORMAL
0xfffffc27e80a0000-0xfffffc27e80a1000           4K PTE       RW NX SHD
AF NG         UXN    MEM/NORMAL
0xfffffc27e80a1000-0xfffffc27e80a8000          28K PTE       ro NX SHD
AF NG         UXN    MEM/NORMAL
0xfffffc27e80a8000-0xfffffc27e80a9000           4K PTE       RW NX SHD
AF NG         UXN    MEM/NORMAL
0xfffffc27e80a9000-0xfffffc27e80b0000          28K PTE       ro NX SHD
AF NG         UXN    MEM/NORMAL
0xfffffc27e80b0000-0xfffffc27e80b1000           4K PTE       RW NX SHD
AF NG         UXN    MEM/NORMAL
0xfffffc27e80b1000-0xfffffc27e80b8000          28K PTE       ro NX SHD
AF NG         UXN    MEM/NORMAL
0xfffffc27e80b8000-0xfffffc27e80b9000           4K PTE       RW NX SHD
AF NG         UXN    MEM/NORMAL
0xfffffc27e80b9000-0xfffffc27e80c0000          28K PTE       ro NX SHD
AF NG         UXN    MEM/NORMAL
...

So it works as expected. we are seeing 7 read-only mapping after 1 RW mapping.

Then I tried to check if the patch would break 64KB hugepages by setting
"hugepagesz=64k hugepages=32768", i got:

---[ vmemmap start ]---
0xfffffc0000000000-0xfffffd8000000000        1536G PGD
0xfffffd8000000000-0xfffffd82c0000000          11G PUD
0xfffffd82c0000000-0xfffffd82c3000000          48M PMD
0xfffffd82c3000000-0xfffffd82c7000000          64M PMD       RW NX SHD
AF NG     BLK UXN    MEM/NORMAL
0xfffffd82c7000000-0xfffffd8300000000         912M PMD
0xfffffd8300000000-0xfffffe0000000000         500G PUD
---[ vmemmap end ]---

Obviously it doesn't break this corner case in which we don't need VMEMMAP_FREE.

> ---
> v4:
>  - Introduce ARCH_WANT_HUGETLB_PAGE_FREE_VMEMMAP (implemented in the previous
>    patch) to enable this feature for arm64.
>
> v3:
>  - Rework patch's subject.
>  - Clarify the feature of HUGETLB_PAGE_FREE_VMEMMAP is already there in the
>    current code and easyly be enabled for arm64 into commit log.
>  - Add hugetlb_free_vmemmap_enabled() check into flush_dcache_page().
>
>  Thanks for Barry's suggestions.
>
> v2:
>  - Update commit message (Mark Rutland).
>  - Fix flush_dcache_page().
>
>  arch/arm64/Kconfig    |  1 +
>  arch/arm64/mm/flush.c | 13 +++++++++++++
>  2 files changed, 14 insertions(+)
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index c842878f8133..37f72e3a75d0 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -94,6 +94,7 @@ config ARM64
>         select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
>         select ARCH_WANT_FRAME_POINTERS
>         select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
> +       select ARCH_WANT_HUGETLB_PAGE_FREE_VMEMMAP
>         select ARCH_WANT_LD_ORPHAN_WARN
>         select ARCH_WANTS_NO_INSTR
>         select ARCH_HAS_UBSAN_SANITIZE_ALL
> diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c
> index 2aaf950b906c..c67c1ca856c2 100644
> --- a/arch/arm64/mm/flush.c
> +++ b/arch/arm64/mm/flush.c
> @@ -68,6 +68,19 @@ EXPORT_SYMBOL_GPL(__sync_icache_dcache);
>   */
>  void flush_dcache_page(struct page *page)
>  {
> +       /*
> +        * Only the head page's flags of HugeTLB can be cleared since the tail
> +        * vmemmap pages associated with each HugeTLB page are mapped with
> +        * read-only when CONFIG_HUGETLB_PAGE_FREE_VMEMMAP is enabled (more
> +        * details can refer to vmemmap_remap_pte()).  Although
> +        * __sync_icache_dcache() only set PG_dcache_clean flag on the head
> +        * page struct, some tail page structs still can be seen the flag is
> +        * set since the head vmemmap page frame is reused (more details can
> +        * refer to the comments above page_fixed_fake_head()).
> +        */
> +       if (hugetlb_free_vmemmap_enabled() && PageHuge(page))
> +               page = compound_head(page);
> +
>         if (test_bit(PG_dcache_clean, &page->flags))
>                 clear_bit(PG_dcache_clean, &page->flags);
>  }
> --
> 2.11.0
>

Thanks
Barry
Anshuman Khandual April 4, 2022, 9:26 a.m. UTC | #2
Hello Muchun,

On 3/31/22 12:26, Muchun Song wrote:
> The feature of minimizing overhead of struct page associated with each
> HugeTLB page aims to free its vmemmap pages (used as struct page) to
> save memory, where is ~14GB/16GB per 1TB HugeTLB pages (2MB/1GB type).

Enabling this feature saves us around 1.4/1.6 % memory but looking from
other way around, unavailability of vmemmap backing pages (~1.4GB) when
freeing up a corresponding HugeTLB page, could prevent ~1TB memory from
being used as normal page form (requiring their own struct pages), thus
forcing the HugeTLB page to remain as such ? Is not this problematic ?

These additional 1TB memory in normal pages, from a HugeTLB dissolution
could have eased the system's memory pressure without this feature being
enabled.

- Anshuman
Muchun Song April 4, 2022, 12:01 p.m. UTC | #3
On Mon, Apr 4, 2022 at 5:25 PM Anshuman Khandual
<anshuman.khandual@arm.com> wrote:
>
> Hello Muchun,
>
> On 3/31/22 12:26, Muchun Song wrote:
> > The feature of minimizing overhead of struct page associated with each
> > HugeTLB page aims to free its vmemmap pages (used as struct page) to
> > save memory, where is ~14GB/16GB per 1TB HugeTLB pages (2MB/1GB type).
>
> Enabling this feature saves us around 1.4/1.6 % memory but looking from
> other way around, unavailability of vmemmap backing pages (~1.4GB) when
> freeing up a corresponding HugeTLB page, could prevent ~1TB memory from
> being used as normal page form (requiring their own struct pages), thus
> forcing the HugeTLB page to remain as such ? Is not this problematic ?
>
> These additional 1TB memory in normal pages, from a HugeTLB dissolution
> could have eased the system's memory pressure without this feature being
> enabled.

You are right. If the system is already under heavy memory pressure, it could
prevent the user from freeing HugeTLB pages to the buddy allocator. If the
HugeTLB page are allocated from non-movable zone, this scenario may be
not problematic since once a HugeTLB page is freed, then the system will
have memory to be allocated to be used as vmemmap pages, subsequent
freeing of HugeTLB pages may be getting easier.  However, if the HUgeTLB
pages are allocated from the movable zone, then the thing becomes terrible,
which is documented in Documentation/admin-guide/mm/memory-hotplug.rst.

So there is a cmdline "hugetlb_free_vmemmap" to control if enabling this
feature.  The user should enable/disable this depending on their workload.

Thanks.
Anshuman Khandual April 5, 2022, 3:34 a.m. UTC | #4
On 4/4/22 17:31, Muchun Song wrote:
> On Mon, Apr 4, 2022 at 5:25 PM Anshuman Khandual
> <anshuman.khandual@arm.com> wrote:
>>
>> Hello Muchun,
>>
>> On 3/31/22 12:26, Muchun Song wrote:
>>> The feature of minimizing overhead of struct page associated with each
>>> HugeTLB page aims to free its vmemmap pages (used as struct page) to
>>> save memory, where is ~14GB/16GB per 1TB HugeTLB pages (2MB/1GB type).
>>
>> Enabling this feature saves us around 1.4/1.6 % memory but looking from
>> other way around, unavailability of vmemmap backing pages (~1.4GB) when
>> freeing up a corresponding HugeTLB page, could prevent ~1TB memory from
>> being used as normal page form (requiring their own struct pages), thus
>> forcing the HugeTLB page to remain as such ? Is not this problematic ?
>>
>> These additional 1TB memory in normal pages, from a HugeTLB dissolution
>> could have eased the system's memory pressure without this feature being
>> enabled.
> 
> You are right. If the system is already under heavy memory pressure, it could
> prevent the user from freeing HugeTLB pages to the buddy allocator. If the
> HugeTLB page are allocated from non-movable zone, this scenario may be
> not problematic since once a HugeTLB page is freed, then the system will

But how can even the first HugeTLB page be freed without vmemmmap which is
throttled due to lack of sufficient memory ?

> have memory to be allocated to be used as vmemmap pages, subsequent
> freeing of HugeTLB pages may be getting easier.  However, if the HUgeTLB
> pages are allocated from the movable zone, then the thing becomes terrible,
> which is documented in Documentation/admin-guide/mm/memory-hotplug.rst.
> 
> So there is a cmdline "hugetlb_free_vmemmap" to control if enabling this
> feature.  The user should enable/disable this depending on their workload.

Should there also be a sysfs interface for this knob as well ? Perhaps the
system usage might change on the way, without requiring a reboot.
Muchun Song April 5, 2022, 3:49 a.m. UTC | #5
On Tue, Apr 5, 2022 at 11:34 AM Anshuman Khandual
<anshuman.khandual@arm.com> wrote:
>
>
>
> On 4/4/22 17:31, Muchun Song wrote:
> > On Mon, Apr 4, 2022 at 5:25 PM Anshuman Khandual
> > <anshuman.khandual@arm.com> wrote:
> >>
> >> Hello Muchun,
> >>
> >> On 3/31/22 12:26, Muchun Song wrote:
> >>> The feature of minimizing overhead of struct page associated with each
> >>> HugeTLB page aims to free its vmemmap pages (used as struct page) to
> >>> save memory, where is ~14GB/16GB per 1TB HugeTLB pages (2MB/1GB type).
> >>
> >> Enabling this feature saves us around 1.4/1.6 % memory but looking from
> >> other way around, unavailability of vmemmap backing pages (~1.4GB) when
> >> freeing up a corresponding HugeTLB page, could prevent ~1TB memory from
> >> being used as normal page form (requiring their own struct pages), thus
> >> forcing the HugeTLB page to remain as such ? Is not this problematic ?
> >>
> >> These additional 1TB memory in normal pages, from a HugeTLB dissolution
> >> could have eased the system's memory pressure without this feature being
> >> enabled.
> >
> > You are right. If the system is already under heavy memory pressure, it could
> > prevent the user from freeing HugeTLB pages to the buddy allocator. If the
> > HugeTLB page are allocated from non-movable zone, this scenario may be
> > not problematic since once a HugeTLB page is freed, then the system will
>
> But how can even the first HugeTLB page be freed without vmemmmap which is
> throttled due to lack of sufficient memory ?

It's unfortunate, we're deadlocked and will have to try again later :-(

>
> > have memory to be allocated to be used as vmemmap pages, subsequent
> > freeing of HugeTLB pages may be getting easier.  However, if the HUgeTLB
> > pages are allocated from the movable zone, then the thing becomes terrible,
> > which is documented in Documentation/admin-guide/mm/memory-hotplug.rst.
> >
> > So there is a cmdline "hugetlb_free_vmemmap" to control if enabling this
> > feature.  The user should enable/disable this depending on their workload.
>
> Should there also be a sysfs interface for this knob as well ? Perhaps the
> system usage might change on the way, without requiring a reboot.

Yep.  I'm working on this [1] and will cc you in the next version.

[1] https://lore.kernel.org/all/20220330153745.20465-1-songmuchun@bytedance.com/

Thanks.
Anshuman Khandual April 5, 2022, 4:45 a.m. UTC | #6
On 3/31/22 12:26, Muchun Song wrote:
> 1st concern:
> '''
> But what happens when a hot remove section's vmemmap area (which is
> being teared down) is nearby another vmemmap area which is either created
> or being destroyed for HugeTLB alloc/free purpose. As you mentioned
> HugeTLB pages inside the hot remove section might be safe. But what about
> other HugeTLB areas whose vmemmap area shares page table entries with
> vmemmap entries for a section being hot removed ? Massive HugeTLB alloc
> /use/free test cycle using memory just adjacent to a memory hotplug area,
> which is always added and removed periodically, should be able to expose
> this problem.
> '''
> 
> Answer: At the time memory is removed, all HugeTLB pages either have been
> migrated away or dissolved.  So there is no race between memory hot remove
> and free_huge_page_vmemmap().  Therefore, HugeTLB pages inside the hot
> remove section is safe.  Let's talk your question "what about other

HugeTLB pages inside the memory range is safe but concern is about the
vmemmap mapping for the HugeTLB which might share intermediate entries
with vmemmap mapping for the memory range/section being removed.

> HugeTLB areas whose vmemmap area shares page table entries with vmemmap
> entries for a section being hot removed ?", the question is not

Right.

> established.  The minimal granularity size of hotplug memory 128MB (on
> arm64, 4k base page), any HugeTLB smaller than 128MB is within a section,
> then, there is no share PTE page tables between HugeTLB in this section

128MB is the hot removable granularity but, its corresponding vmemmap
range is smaller i.e (128MB/4K) * sizeof(struct page). Memory section
getting hot removed (its vmemmap mapping being teared down) along with
HugeTLB (on another section) vmemmap remap operation, could not collide
while inside vmemmap mapping areas on init_mm ?

> and ones in other sections and a HugeTLB page could not cross two
> sections.  In this case, the section cannot be freed.  Any HugeTLB bigger

Right, they dont cross into two different sections.

> than 128MB (section size) whose vmemmap pages is an integer multiple of
> 2MB (PMD-mapped).  As long as:
> 
>   1) HugeTLBs are naturally aligned, power-of-two sizes
>   2) The HugeTLB size >= the section size
>   3) The HugeTLB size >= the vmemmap leaf mapping size
> 
> Then a HugeTLB will not share any leaf page table entries with *anything
> else*, but will share intermediate entries.  In this case, at the time memory
> is removed, all HugeTLB pages either have been migrated away or dissolved.
> So there is also no race between memory hot remove and
> free_huge_page_vmemmap().

If they just share intermediate entries, free_empty_tables() will not free
up page table pages, as there will be valid non-zero entries in them. But
the problem here is not UAF, its accessing wrong entries and crashing while
de-referncing the pointer. Hence I am wondering if no such scenario can be
possible.

> 
> 2nd concern:
> '''
> differently, not sure if ptdump would require any synchronization.
> 
> Dumping an wrong value is probably okay but crashing because a page table
> entry is being freed after ptdump acquired the pointer is bad. On arm64,
> ptdump() is protected against hotremove via [get|put]_online_mems().
> '''
> 
> Answer: The ptdump should be fine since vmemmap_remap_free() only exchanges
> PTEs or splits the PMD entry (which means allocating a PTE page table).  Both
> operations do not free any page tables (PTE), so ptdump cannot run into a
> UAF on any page tables.  The worst case is just dumping an wrong value.

Okay, fair enough. ptdump() might be just okay.
Muchun Song April 5, 2022, 8:38 a.m. UTC | #7
On Tue, Apr 5, 2022 at 12:44 PM Anshuman Khandual
<anshuman.khandual@arm.com> wrote:
>
>
>
> On 3/31/22 12:26, Muchun Song wrote:
> > 1st concern:
> > '''
> > But what happens when a hot remove section's vmemmap area (which is
> > being teared down) is nearby another vmemmap area which is either created
> > or being destroyed for HugeTLB alloc/free purpose. As you mentioned
> > HugeTLB pages inside the hot remove section might be safe. But what about
> > other HugeTLB areas whose vmemmap area shares page table entries with
> > vmemmap entries for a section being hot removed ? Massive HugeTLB alloc
> > /use/free test cycle using memory just adjacent to a memory hotplug area,
> > which is always added and removed periodically, should be able to expose
> > this problem.
> > '''
> >
> > Answer: At the time memory is removed, all HugeTLB pages either have been
> > migrated away or dissolved.  So there is no race between memory hot remove
> > and free_huge_page_vmemmap().  Therefore, HugeTLB pages inside the hot
> > remove section is safe.  Let's talk your question "what about other
>
> HugeTLB pages inside the memory range is safe but concern is about the
> vmemmap mapping for the HugeTLB which might share intermediate entries
> with vmemmap mapping for the memory range/section being removed.

The shared page table level only could be PMD, PUD and PGD, the PTE
page table cannot be shared with other sections, and we only exchange
PTEs for vmemmap mapping.

>
> > HugeTLB areas whose vmemmap area shares page table entries with vmemmap
> > entries for a section being hot removed ?", the question is not
>
> Right.
>
> > established.  The minimal granularity size of hotplug memory 128MB (on
> > arm64, 4k base page), any HugeTLB smaller than 128MB is within a section,
> > then, there is no share PTE page tables between HugeTLB in this section
>
> 128MB is the hot removable granularity but, its corresponding vmemmap
> range is smaller i.e (128MB/4K) * sizeof(struct page). Memory section
> getting hot removed (its vmemmap mapping being teared down) along with
> HugeTLB (on another section) vmemmap remap operation, could not collide
> while inside vmemmap mapping areas on init_mm ?

The boundary address of a section is aligned with 128MB and its
corresponding vmemmap boundary address is aligned with 2MB
which is mapped with a separated PTE page table (or a PMD entry).
Different sections do not share the same PTE, there are no conflicts
between a hot removed section and a remapping vmemmap section
since we are operating on different PTE. Right?

>
> > and ones in other sections and a HugeTLB page could not cross two
> > sections.  In this case, the section cannot be freed.  Any HugeTLB bigger
>
> Right, they dont cross into two different sections.
>
> > than 128MB (section size) whose vmemmap pages is an integer multiple of
> > 2MB (PMD-mapped).  As long as:
> >
> >   1) HugeTLBs are naturally aligned, power-of-two sizes
> >   2) The HugeTLB size >= the section size
> >   3) The HugeTLB size >= the vmemmap leaf mapping size
> >
> > Then a HugeTLB will not share any leaf page table entries with *anything
> > else*, but will share intermediate entries.  In this case, at the time memory
> > is removed, all HugeTLB pages either have been migrated away or dissolved.
> > So there is also no race between memory hot remove and
> > free_huge_page_vmemmap().
>
> If they just share intermediate entries, free_empty_tables() will not free
> up page table pages, as there will be valid non-zero entries in them. But

Right.

> the problem here is not UAF, its accessing wrong entries and crashing while
> de-referncing the pointer. Hence I am wondering if no such scenario can be
> possible.
>

What's the wrong entries? You mean the reused vmemmap page entries?
If so, I think free_empty_tables() will not cause any crash.  The hot removed
operation couldn't reach those entries since those addresses mapped with
those reused entries are not included in the range of the hot removed section.

Thanks.
Anshuman Khandual April 11, 2022, 9:17 a.m. UTC | #8
On 4/5/22 14:08, Muchun Song wrote:
> On Tue, Apr 5, 2022 at 12:44 PM Anshuman Khandual
> <anshuman.khandual@arm.com> wrote:
>>
>>
>>
>> On 3/31/22 12:26, Muchun Song wrote:
>>> 1st concern:
>>> '''
>>> But what happens when a hot remove section's vmemmap area (which is
>>> being teared down) is nearby another vmemmap area which is either created
>>> or being destroyed for HugeTLB alloc/free purpose. As you mentioned
>>> HugeTLB pages inside the hot remove section might be safe. But what about
>>> other HugeTLB areas whose vmemmap area shares page table entries with
>>> vmemmap entries for a section being hot removed ? Massive HugeTLB alloc
>>> /use/free test cycle using memory just adjacent to a memory hotplug area,
>>> which is always added and removed periodically, should be able to expose
>>> this problem.
>>> '''
>>>
>>> Answer: At the time memory is removed, all HugeTLB pages either have been
>>> migrated away or dissolved.  So there is no race between memory hot remove
>>> and free_huge_page_vmemmap().  Therefore, HugeTLB pages inside the hot
>>> remove section is safe.  Let's talk your question "what about other
>>
>> HugeTLB pages inside the memory range is safe but concern is about the
>> vmemmap mapping for the HugeTLB which might share intermediate entries
>> with vmemmap mapping for the memory range/section being removed.
> 
> The shared page table level only could be PMD, PUD and PGD, the PTE
> page table cannot be shared with other sections, and we only exchange
> PTEs for vmemmap mapping.

Right, the shared entries (if any) are not at the leaf level.

> 
>>
>>> HugeTLB areas whose vmemmap area shares page table entries with vmemmap
>>> entries for a section being hot removed ?", the question is not
>>
>> Right.
>>
>>> established.  The minimal granularity size of hotplug memory 128MB (on
>>> arm64, 4k base page), any HugeTLB smaller than 128MB is within a section,
>>> then, there is no share PTE page tables between HugeTLB in this section
>>
>> 128MB is the hot removable granularity but, its corresponding vmemmap
>> range is smaller i.e (128MB/4K) * sizeof(struct page). Memory section
>> getting hot removed (its vmemmap mapping being teared down) along with
>> HugeTLB (on another section) vmemmap remap operation, could not collide
>> while inside vmemmap mapping areas on init_mm ?
> 
> The boundary address of a section is aligned with 128MB and its
> corresponding vmemmap boundary address is aligned with 2MB
> which is mapped with a separated PTE page table (or a PMD entry).

Even if these PMD entries split during HugeTLB remapping, they will not
conflict with another memory section being removed simultaneously. Also
any shared page table pages will not be freed, during memory hot remove
operation as vmemmap remap does not delete any entries.

But just wondering if during PMD slit and PTE page table page addition,
these PMD entries could not be empty, even temporarily ?

> Different sections do not share the same PTE, there are no conflicts
> between a hot removed section and a remapping vmemmap section
> since we are operating on different PTE. Right?
> 
>>
>>> and ones in other sections and a HugeTLB page could not cross two
>>> sections.  In this case, the section cannot be freed.  Any HugeTLB bigger
>>
>> Right, they dont cross into two different sections.
>>
>>> than 128MB (section size) whose vmemmap pages is an integer multiple of
>>> 2MB (PMD-mapped).  As long as:
>>>
>>>   1) HugeTLBs are naturally aligned, power-of-two sizes
>>>   2) The HugeTLB size >= the section size
>>>   3) The HugeTLB size >= the vmemmap leaf mapping size
>>>
>>> Then a HugeTLB will not share any leaf page table entries with *anything
>>> else*, but will share intermediate entries.  In this case, at the time memory
>>> is removed, all HugeTLB pages either have been migrated away or dissolved.
>>> So there is also no race between memory hot remove and
>>> free_huge_page_vmemmap().
>>
>> If they just share intermediate entries, free_empty_tables() will not free
>> up page table pages, as there will be valid non-zero entries in them. But
> 
> Right.
> 
>> the problem here is not UAF, its accessing wrong entries and crashing while
>> de-referncing the pointer. Hence I am wondering if no such scenario can be
>> possible.
>>
> 
> What's the wrong entries? You mean the reused vmemmap page entries?
> If so, I think free_empty_tables() will not cause any crash.  The hot removed
> operation couldn't reach those entries since those addresses mapped with
> those reused entries are not included in the range of the hot removed section.

Fair enough. I guess if intermediate PMD entries during split, during HugeTLB
vmemmap remap, cannot be empty (even temporarily), this should be fine.
Anshuman Khandual April 11, 2022, 10:12 a.m. UTC | #9
On 3/31/22 12:26, Muchun Song wrote:
> --- a/arch/arm64/mm/flush.c
> +++ b/arch/arm64/mm/flush.c
> @@ -68,6 +68,19 @@ EXPORT_SYMBOL_GPL(__sync_icache_dcache);
>   */
>  void flush_dcache_page(struct page *page)
>  {
> +	/*
> +	 * Only the head page's flags of HugeTLB can be cleared since the tail
> +	 * vmemmap pages associated with each HugeTLB page are mapped with
> +	 * read-only when CONFIG_HUGETLB_PAGE_FREE_VMEMMAP is enabled (more
> +	 * details can refer to vmemmap_remap_pte()).  Although

Did you see real permission fault when flush_dcache_page() accessed remapped
tail pages, with readonly vmemmap ? OR this change is from code inspection ?

> +	 * __sync_icache_dcache() only set PG_dcache_clean flag on the head
> +	 * page struct, some tail page structs still can be seen the flag is

Sentence here needs restructuring ....               ^^^^^^^^^^^^^^^^^^
 
> +	 * set since the head vmemmap page frame is reused (more details can
> +	 * refer to the comments above page_fixed_fake_head()).
> +	 */
> +	if (hugetlb_free_vmemmap_enabled() && PageHuge(page))
> +		page = compound_head(page);

This should also be applicable to any other platform with both configs enabled
i.e ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE and ARCH_WANT_HUGETLB_PAGE_FREE_VMEMMAP ?
If yes, then how to ensure that the platforms change flush_dcache_page() before
subscribing into ARCH_WANT_HUGETLB_PAGE_FREE_VMEMMAP ?

> +
>  	if (test_bit(PG_dcache_clean, &page->flags))
>  		clear_bit(PG_dcache_clean, &page->flags);
>  }
Muchun Song April 11, 2022, 10:40 a.m. UTC | #10
On Mon, Apr 11, 2022 at 02:47:26PM +0530, Anshuman Khandual wrote:
> 
> 
> On 4/5/22 14:08, Muchun Song wrote:
> > On Tue, Apr 5, 2022 at 12:44 PM Anshuman Khandual
> > <anshuman.khandual@arm.com> wrote:
> >>
> >>
> >>
> >> On 3/31/22 12:26, Muchun Song wrote:
> >>> 1st concern:
> >>> '''
> >>> But what happens when a hot remove section's vmemmap area (which is
> >>> being teared down) is nearby another vmemmap area which is either created
> >>> or being destroyed for HugeTLB alloc/free purpose. As you mentioned
> >>> HugeTLB pages inside the hot remove section might be safe. But what about
> >>> other HugeTLB areas whose vmemmap area shares page table entries with
> >>> vmemmap entries for a section being hot removed ? Massive HugeTLB alloc
> >>> /use/free test cycle using memory just adjacent to a memory hotplug area,
> >>> which is always added and removed periodically, should be able to expose
> >>> this problem.
> >>> '''
> >>>
> >>> Answer: At the time memory is removed, all HugeTLB pages either have been
> >>> migrated away or dissolved.  So there is no race between memory hot remove
> >>> and free_huge_page_vmemmap().  Therefore, HugeTLB pages inside the hot
> >>> remove section is safe.  Let's talk your question "what about other
> >>
> >> HugeTLB pages inside the memory range is safe but concern is about the
> >> vmemmap mapping for the HugeTLB which might share intermediate entries
> >> with vmemmap mapping for the memory range/section being removed.
> > 
> > The shared page table level only could be PMD, PUD and PGD, the PTE
> > page table cannot be shared with other sections, and we only exchange
> > PTEs for vmemmap mapping.
> 
> Right, the shared entries (if any) are not at the leaf level.
> 
> > 
> >>
> >>> HugeTLB areas whose vmemmap area shares page table entries with vmemmap
> >>> entries for a section being hot removed ?", the question is not
> >>
> >> Right.
> >>
> >>> established.  The minimal granularity size of hotplug memory 128MB (on
> >>> arm64, 4k base page), any HugeTLB smaller than 128MB is within a section,
> >>> then, there is no share PTE page tables between HugeTLB in this section
> >>
> >> 128MB is the hot removable granularity but, its corresponding vmemmap
> >> range is smaller i.e (128MB/4K) * sizeof(struct page). Memory section
> >> getting hot removed (its vmemmap mapping being teared down) along with
> >> HugeTLB (on another section) vmemmap remap operation, could not collide
> >> while inside vmemmap mapping areas on init_mm ?
> > 
> > The boundary address of a section is aligned with 128MB and its
> > corresponding vmemmap boundary address is aligned with 2MB
> > which is mapped with a separated PTE page table (or a PMD entry).
> 
> Even if these PMD entries split during HugeTLB remapping, they will not
> conflict with another memory section being removed simultaneously. Also
> any shared page table pages will not be freed, during memory hot remove
> operation as vmemmap remap does not delete any entries.
> 
> But just wondering if during PMD slit and PTE page table page addition,
> these PMD entries could not be empty, even temporarily ?
>

The pmd entry is either a PTE page table or a PMD leaf entry, it cannot
be a empty entry forever.  More details can refer to
__split_vmemmap_huge_pmd().

Thanks.
Muchun Song April 11, 2022, 11:55 a.m. UTC | #11
On Mon, Apr 11, 2022 at 03:42:52PM +0530, Anshuman Khandual wrote:
> 
> 
> On 3/31/22 12:26, Muchun Song wrote:
> > --- a/arch/arm64/mm/flush.c
> > +++ b/arch/arm64/mm/flush.c
> > @@ -68,6 +68,19 @@ EXPORT_SYMBOL_GPL(__sync_icache_dcache);
> >   */
> >  void flush_dcache_page(struct page *page)
> >  {
> > +	/*
> > +	 * Only the head page's flags of HugeTLB can be cleared since the tail
> > +	 * vmemmap pages associated with each HugeTLB page are mapped with
> > +	 * read-only when CONFIG_HUGETLB_PAGE_FREE_VMEMMAP is enabled (more
> > +	 * details can refer to vmemmap_remap_pte()).  Although
> 
> Did you see real permission fault when flush_dcache_page() accessed remapped
> tail pages, with readonly vmemmap ? OR this change is from code inspection ?
>

Not a real word issue. Just a code-inspection one.
 
> > +	 * __sync_icache_dcache() only set PG_dcache_clean flag on the head
> > +	 * page struct, some tail page structs still can be seen the flag is
> 
> Sentence here needs restructuring ....               ^^^^^^^^^^^^^^^^^^
> 

Will do.

> > +	 * set since the head vmemmap page frame is reused (more details can
> > +	 * refer to the comments above page_fixed_fake_head()).
> > +	 */
> > +	if (hugetlb_free_vmemmap_enabled() && PageHuge(page))
> > +		page = compound_head(page);
> 
> This should also be applicable to any other platform with both configs enabled
> i.e ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE and ARCH_WANT_HUGETLB_PAGE_FREE_VMEMMAP ?

Yes.

> If yes, then how to ensure that the platforms change flush_dcache_page() before
> subscribing into ARCH_WANT_HUGETLB_PAGE_FREE_VMEMMAP ?
> 

The one who enable this feature has to make sure there is no any issues
(e.g. He should fix flush_dcache_page() or other related issues) with this
feature enabled.

Thanks.
Anshuman Khandual April 13, 2022, 10:33 a.m. UTC | #12
On 3/31/22 12:26, Muchun Song wrote:
> The feature of minimizing overhead of struct page associated with each
> HugeTLB page aims to free its vmemmap pages (used as struct page) to
> save memory, where is ~14GB/16GB per 1TB HugeTLB pages (2MB/1GB type).
> In short, when a HugeTLB page is allocated or freed, the vmemmap array
> representing the range associated with the page will need to be remapped.
> When a page is allocated, vmemmap pages are freed after remapping.
> When a page is freed, previously discarded vmemmap pages must be
> allocated before remapping.  More implementations and details can be
> found here [1].
> 
> The infrastructure of freeing vmemmap pages associated with each HugeTLB
> page is already there, we can easily enable HUGETLB_PAGE_FREE_VMEMMAP
> for arm64, the only thing to be fixed is flush_dcache_page() .
> 
> flush_dcache_page() need to be adapted to operate on the head page's
> flags since the tail vmemmap pages are mapped with read-only after the
> feature is enabled (clear operation is not permitted).
> 
> There was some discussions about this in the thread [2], but there was
> no conclusion in the end.  And I copied the concern proposed by Anshuman
> to here and explain why those concern is superfluous.  It is safe to
> enable it for x86_64 as well as arm64.
> 
> 1st concern:
> '''
> But what happens when a hot remove section's vmemmap area (which is
> being teared down) is nearby another vmemmap area which is either created
> or being destroyed for HugeTLB alloc/free purpose. As you mentioned
> HugeTLB pages inside the hot remove section might be safe. But what about
> other HugeTLB areas whose vmemmap area shares page table entries with
> vmemmap entries for a section being hot removed ? Massive HugeTLB alloc
> /use/free test cycle using memory just adjacent to a memory hotplug area,
> which is always added and removed periodically, should be able to expose
> this problem.
> '''
> 
> Answer: At the time memory is removed, all HugeTLB pages either have been
> migrated away or dissolved.  So there is no race between memory hot remove
> and free_huge_page_vmemmap().  Therefore, HugeTLB pages inside the hot
> remove section is safe.  Let's talk your question "what about other
> HugeTLB areas whose vmemmap area shares page table entries with vmemmap
> entries for a section being hot removed ?", the question is not
> established.  The minimal granularity size of hotplug memory 128MB (on
> arm64, 4k base page), any HugeTLB smaller than 128MB is within a section,
> then, there is no share PTE page tables between HugeTLB in this section
> and ones in other sections and a HugeTLB page could not cross two
> sections.  In this case, the section cannot be freed.  Any HugeTLB bigger
> than 128MB (section size) whose vmemmap pages is an integer multiple of
> 2MB (PMD-mapped).  As long as:
> 
>   1) HugeTLBs are naturally aligned, power-of-two sizes
>   2) The HugeTLB size >= the section size
>   3) The HugeTLB size >= the vmemmap leaf mapping size
> 
> Then a HugeTLB will not share any leaf page table entries with *anything
> else*, but will share intermediate entries.  In this case, at the time memory
> is removed, all HugeTLB pages either have been migrated away or dissolved.
> So there is also no race between memory hot remove and
> free_huge_page_vmemmap().
> 
> 2nd concern:
> '''
> differently, not sure if ptdump would require any synchronization.
> 
> Dumping an wrong value is probably okay but crashing because a page table
> entry is being freed after ptdump acquired the pointer is bad. On arm64,
> ptdump() is protected against hotremove via [get|put]_online_mems().
> '''
> 
> Answer: The ptdump should be fine since vmemmap_remap_free() only exchanges
> PTEs or splits the PMD entry (which means allocating a PTE page table).  Both
> operations do not free any page tables (PTE), so ptdump cannot run into a
> UAF on any page tables.  The worst case is just dumping an wrong value.
> 
> [1] https://lore.kernel.org/all/20210510030027.56044-1-songmuchun@bytedance.com/
> [2] https://lore.kernel.org/all/20210518091826.36937-1-songmuchun@bytedance.com/
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> ---
> v4:
>  - Introduce ARCH_WANT_HUGETLB_PAGE_FREE_VMEMMAP (implemented in the previous
>    patch) to enable this feature for arm64.
> 
> v3:
>  - Rework patch's subject.
>  - Clarify the feature of HUGETLB_PAGE_FREE_VMEMMAP is already there in the
>    current code and easyly be enabled for arm64 into commit log.
>  - Add hugetlb_free_vmemmap_enabled() check into flush_dcache_page().
> 
>  Thanks for Barry's suggestions.
> 
> v2:
>  - Update commit message (Mark Rutland).
>  - Fix flush_dcache_page().
> 
>  arch/arm64/Kconfig    |  1 +
>  arch/arm64/mm/flush.c | 13 +++++++++++++
>  2 files changed, 14 insertions(+)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index c842878f8133..37f72e3a75d0 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -94,6 +94,7 @@ config ARM64
>  	select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
>  	select ARCH_WANT_FRAME_POINTERS
>  	select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
> +	select ARCH_WANT_HUGETLB_PAGE_FREE_VMEMMAP
>  	select ARCH_WANT_LD_ORPHAN_WARN
>  	select ARCH_WANTS_NO_INSTR
>  	select ARCH_HAS_UBSAN_SANITIZE_ALL
> diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c
> index 2aaf950b906c..c67c1ca856c2 100644
> --- a/arch/arm64/mm/flush.c
> +++ b/arch/arm64/mm/flush.c
> @@ -68,6 +68,19 @@ EXPORT_SYMBOL_GPL(__sync_icache_dcache);
>   */
>  void flush_dcache_page(struct page *page)
>  {
> +	/*
> +	 * Only the head page's flags of HugeTLB can be cleared since the tail
> +	 * vmemmap pages associated with each HugeTLB page are mapped with
> +	 * read-only when CONFIG_HUGETLB_PAGE_FREE_VMEMMAP is enabled (more
> +	 * details can refer to vmemmap_remap_pte()).  Although
> +	 * __sync_icache_dcache() only set PG_dcache_clean flag on the head
> +	 * page struct, some tail page structs still can be seen the flag is
> +	 * set since the head vmemmap page frame is reused (more details can
> +	 * refer to the comments above page_fixed_fake_head()).
> +	 */
> +	if (hugetlb_free_vmemmap_enabled() && PageHuge(page))
> +		page = compound_head(page);
> +
>  	if (test_bit(PG_dcache_clean, &page->flags))
>  		clear_bit(PG_dcache_clean, &page->flags);
>  }

With restructuring above code comment inside flush_dcache_page(),

Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Muchun Song April 13, 2022, 2:59 p.m. UTC | #13
On Wed, Apr 13, 2022 at 04:03:33PM +0530, Anshuman Khandual wrote:
> 
> 
> On 3/31/22 12:26, Muchun Song wrote:
> > The feature of minimizing overhead of struct page associated with each
> > HugeTLB page aims to free its vmemmap pages (used as struct page) to
> > save memory, where is ~14GB/16GB per 1TB HugeTLB pages (2MB/1GB type).
> > In short, when a HugeTLB page is allocated or freed, the vmemmap array
> > representing the range associated with the page will need to be remapped.
> > When a page is allocated, vmemmap pages are freed after remapping.
> > When a page is freed, previously discarded vmemmap pages must be
> > allocated before remapping.  More implementations and details can be
> > found here [1].
> > 
> > The infrastructure of freeing vmemmap pages associated with each HugeTLB
> > page is already there, we can easily enable HUGETLB_PAGE_FREE_VMEMMAP
> > for arm64, the only thing to be fixed is flush_dcache_page() .
> > 
> > flush_dcache_page() need to be adapted to operate on the head page's
> > flags since the tail vmemmap pages are mapped with read-only after the
> > feature is enabled (clear operation is not permitted).
> > 
> > There was some discussions about this in the thread [2], but there was
> > no conclusion in the end.  And I copied the concern proposed by Anshuman
> > to here and explain why those concern is superfluous.  It is safe to
> > enable it for x86_64 as well as arm64.
> > 
> > 1st concern:
> > '''
> > But what happens when a hot remove section's vmemmap area (which is
> > being teared down) is nearby another vmemmap area which is either created
> > or being destroyed for HugeTLB alloc/free purpose. As you mentioned
> > HugeTLB pages inside the hot remove section might be safe. But what about
> > other HugeTLB areas whose vmemmap area shares page table entries with
> > vmemmap entries for a section being hot removed ? Massive HugeTLB alloc
> > /use/free test cycle using memory just adjacent to a memory hotplug area,
> > which is always added and removed periodically, should be able to expose
> > this problem.
> > '''
> > 
> > Answer: At the time memory is removed, all HugeTLB pages either have been
> > migrated away or dissolved.  So there is no race between memory hot remove
> > and free_huge_page_vmemmap().  Therefore, HugeTLB pages inside the hot
> > remove section is safe.  Let's talk your question "what about other
> > HugeTLB areas whose vmemmap area shares page table entries with vmemmap
> > entries for a section being hot removed ?", the question is not
> > established.  The minimal granularity size of hotplug memory 128MB (on
> > arm64, 4k base page), any HugeTLB smaller than 128MB is within a section,
> > then, there is no share PTE page tables between HugeTLB in this section
> > and ones in other sections and a HugeTLB page could not cross two
> > sections.  In this case, the section cannot be freed.  Any HugeTLB bigger
> > than 128MB (section size) whose vmemmap pages is an integer multiple of
> > 2MB (PMD-mapped).  As long as:
> > 
> >   1) HugeTLBs are naturally aligned, power-of-two sizes
> >   2) The HugeTLB size >= the section size
> >   3) The HugeTLB size >= the vmemmap leaf mapping size
> > 
> > Then a HugeTLB will not share any leaf page table entries with *anything
> > else*, but will share intermediate entries.  In this case, at the time memory
> > is removed, all HugeTLB pages either have been migrated away or dissolved.
> > So there is also no race between memory hot remove and
> > free_huge_page_vmemmap().
> > 
> > 2nd concern:
> > '''
> > differently, not sure if ptdump would require any synchronization.
> > 
> > Dumping an wrong value is probably okay but crashing because a page table
> > entry is being freed after ptdump acquired the pointer is bad. On arm64,
> > ptdump() is protected against hotremove via [get|put]_online_mems().
> > '''
> > 
> > Answer: The ptdump should be fine since vmemmap_remap_free() only exchanges
> > PTEs or splits the PMD entry (which means allocating a PTE page table).  Both
> > operations do not free any page tables (PTE), so ptdump cannot run into a
> > UAF on any page tables.  The worst case is just dumping an wrong value.
> > 
> > [1] https://lore.kernel.org/all/20210510030027.56044-1-songmuchun@bytedance.com/
> > [2] https://lore.kernel.org/all/20210518091826.36937-1-songmuchun@bytedance.com/
> > 
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > ---
> > v4:
> >  - Introduce ARCH_WANT_HUGETLB_PAGE_FREE_VMEMMAP (implemented in the previous
> >    patch) to enable this feature for arm64.
> > 
> > v3:
> >  - Rework patch's subject.
> >  - Clarify the feature of HUGETLB_PAGE_FREE_VMEMMAP is already there in the
> >    current code and easyly be enabled for arm64 into commit log.
> >  - Add hugetlb_free_vmemmap_enabled() check into flush_dcache_page().
> > 
> >  Thanks for Barry's suggestions.
> > 
> > v2:
> >  - Update commit message (Mark Rutland).
> >  - Fix flush_dcache_page().
> > 
> >  arch/arm64/Kconfig    |  1 +
> >  arch/arm64/mm/flush.c | 13 +++++++++++++
> >  2 files changed, 14 insertions(+)
> > 
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index c842878f8133..37f72e3a75d0 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -94,6 +94,7 @@ config ARM64
> >  	select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
> >  	select ARCH_WANT_FRAME_POINTERS
> >  	select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
> > +	select ARCH_WANT_HUGETLB_PAGE_FREE_VMEMMAP
> >  	select ARCH_WANT_LD_ORPHAN_WARN
> >  	select ARCH_WANTS_NO_INSTR
> >  	select ARCH_HAS_UBSAN_SANITIZE_ALL
> > diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c
> > index 2aaf950b906c..c67c1ca856c2 100644
> > --- a/arch/arm64/mm/flush.c
> > +++ b/arch/arm64/mm/flush.c
> > @@ -68,6 +68,19 @@ EXPORT_SYMBOL_GPL(__sync_icache_dcache);
> >   */
> >  void flush_dcache_page(struct page *page)
> >  {
> > +	/*
> > +	 * Only the head page's flags of HugeTLB can be cleared since the tail
> > +	 * vmemmap pages associated with each HugeTLB page are mapped with
> > +	 * read-only when CONFIG_HUGETLB_PAGE_FREE_VMEMMAP is enabled (more
> > +	 * details can refer to vmemmap_remap_pte()).  Although
> > +	 * __sync_icache_dcache() only set PG_dcache_clean flag on the head
> > +	 * page struct, some tail page structs still can be seen the flag is
> > +	 * set since the head vmemmap page frame is reused (more details can
> > +	 * refer to the comments above page_fixed_fake_head()).
> > +	 */
> > +	if (hugetlb_free_vmemmap_enabled() && PageHuge(page))
> > +		page = compound_head(page);
> > +
> >  	if (test_bit(PG_dcache_clean, &page->flags))
> >  		clear_bit(PG_dcache_clean, &page->flags);
> >  }
> 
> With restructuring above code comment inside flush_dcache_page(),
> 
> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
>

Thanks for your review. Will update soon.
diff mbox series

Patch

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index c842878f8133..37f72e3a75d0 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -94,6 +94,7 @@  config ARM64
 	select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
 	select ARCH_WANT_FRAME_POINTERS
 	select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
+	select ARCH_WANT_HUGETLB_PAGE_FREE_VMEMMAP
 	select ARCH_WANT_LD_ORPHAN_WARN
 	select ARCH_WANTS_NO_INSTR
 	select ARCH_HAS_UBSAN_SANITIZE_ALL
diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c
index 2aaf950b906c..c67c1ca856c2 100644
--- a/arch/arm64/mm/flush.c
+++ b/arch/arm64/mm/flush.c
@@ -68,6 +68,19 @@  EXPORT_SYMBOL_GPL(__sync_icache_dcache);
  */
 void flush_dcache_page(struct page *page)
 {
+	/*
+	 * Only the head page's flags of HugeTLB can be cleared since the tail
+	 * vmemmap pages associated with each HugeTLB page are mapped with
+	 * read-only when CONFIG_HUGETLB_PAGE_FREE_VMEMMAP is enabled (more
+	 * details can refer to vmemmap_remap_pte()).  Although
+	 * __sync_icache_dcache() only set PG_dcache_clean flag on the head
+	 * page struct, some tail page structs still can be seen the flag is
+	 * set since the head vmemmap page frame is reused (more details can
+	 * refer to the comments above page_fixed_fake_head()).
+	 */
+	if (hugetlb_free_vmemmap_enabled() && PageHuge(page))
+		page = compound_head(page);
+
 	if (test_bit(PG_dcache_clean, &page->flags))
 		clear_bit(PG_dcache_clean, &page->flags);
 }