diff mbox series

arm64: mm: hugetlb: add support for free vmemmap pages of HugeTLB

Message ID 20210518091826.36937-1-songmuchun@bytedance.com (mailing list archive)
State New, archived
Headers show
Series arm64: mm: hugetlb: add support for free vmemmap pages of HugeTLB | expand

Commit Message

Muchun Song May 18, 2021, 9:18 a.m. UTC
The preparation of supporting freeing vmemmap associated with each
HugeTLB page is ready, so we can support this feature for arm64.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 arch/arm64/mm/mmu.c | 5 +++++
 fs/Kconfig          | 2 +-
 2 files changed, 6 insertions(+), 1 deletion(-)

Comments

Anshuman Khandual May 19, 2021, 11:45 a.m. UTC | #1
On 5/18/21 2:48 PM, Muchun Song wrote:
> The preparation of supporting freeing vmemmap associated with each
> HugeTLB page is ready, so we can support this feature for arm64.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> ---
>  arch/arm64/mm/mmu.c | 5 +++++
>  fs/Kconfig          | 2 +-
>  2 files changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 5d37e461c41f..967b01ce468d 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -23,6 +23,7 @@
>  #include <linux/mm.h>
>  #include <linux/vmalloc.h>
>  #include <linux/set_memory.h>
> +#include <linux/hugetlb.h>
>  
>  #include <asm/barrier.h>
>  #include <asm/cputype.h>
> @@ -1134,6 +1135,10 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>  	pmd_t *pmdp;
>  
>  	WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
> +
> +	if (is_hugetlb_free_vmemmap_enabled() && !altmap)
> +		return vmemmap_populate_basepages(start, end, node, altmap);

Not considering the fact that this will force the kernel to have only
base page size mapping for vmemmap (unless altmap is also requested)
which might reduce the performance, it also enables vmemmap mapping to
be teared down or build up at runtime which could potentially collide
with other kernel page table walkers like ptdump or memory hotremove
operation ! How those possible collisions are protected right now ?

Does not this vmemmap operation increase latency for HugeTLB usage ?
Should not this runtime enablement also take into account some other
qualifying information apart from potential memory save from struct
page areas. Just wondering.

> +
>  	do {
>  		next = pmd_addr_end(addr, end);
>  
> diff --git a/fs/Kconfig b/fs/Kconfig
> index 6ce6fdac00a3..02c2d3bf1cb8 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -242,7 +242,7 @@ config HUGETLB_PAGE
>  
>  config HUGETLB_PAGE_FREE_VMEMMAP
>  	def_bool HUGETLB_PAGE
> -	depends on X86_64
> +	depends on X86_64 || ARM64
>  	depends on SPARSEMEM_VMEMMAP
>  
>  config MEMFD_CREATE
>
David Hildenbrand May 19, 2021, 12:03 p.m. UTC | #2
On 19.05.21 13:45, Anshuman Khandual wrote:
> 
> 
> On 5/18/21 2:48 PM, Muchun Song wrote:
>> The preparation of supporting freeing vmemmap associated with each
>> HugeTLB page is ready, so we can support this feature for arm64.
>>
>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>> ---
>>   arch/arm64/mm/mmu.c | 5 +++++
>>   fs/Kconfig          | 2 +-
>>   2 files changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index 5d37e461c41f..967b01ce468d 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -23,6 +23,7 @@
>>   #include <linux/mm.h>
>>   #include <linux/vmalloc.h>
>>   #include <linux/set_memory.h>
>> +#include <linux/hugetlb.h>
>>   
>>   #include <asm/barrier.h>
>>   #include <asm/cputype.h>
>> @@ -1134,6 +1135,10 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>>   	pmd_t *pmdp;
>>   
>>   	WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
>> +
>> +	if (is_hugetlb_free_vmemmap_enabled() && !altmap)
>> +		return vmemmap_populate_basepages(start, end, node, altmap);
> 
> Not considering the fact that this will force the kernel to have only
> base page size mapping for vmemmap (unless altmap is also requested)
> which might reduce the performance, it also enables vmemmap mapping to
> be teared down or build up at runtime which could potentially collide
> with other kernel page table walkers like ptdump or memory hotremove
> operation ! How those possible collisions are protected right now ?

Hi Anshuman,

Memory hotremove is not an issue IIRC. At the time memory is removed, 
all huge pages either have been migrated away or dissolved; the vmemmap 
is stable.

vmemmap access (accessing the memmap via a virtual address) itself is 
not an issue. Manually walking (vmemmap) page tables might behave 
differently, not sure if ptdump would require any synchronization.

> 
> Does not this vmemmap operation increase latency for HugeTLB usage ?
> Should not this runtime enablement also take into account some other
> qualifying information apart from potential memory save from struct
> page areas. Just wondering.

That's one of the reasons why it explicitly has to be enabled by an admin.
Anshuman Khandual May 19, 2021, 12:36 p.m. UTC | #3
On 5/18/21 2:48 PM, Muchun Song wrote:
> The preparation of supporting freeing vmemmap associated with each
> HugeTLB page is ready, so we can support this feature for arm64.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> ---
>  arch/arm64/mm/mmu.c | 5 +++++
>  fs/Kconfig          | 2 +-
>  2 files changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 5d37e461c41f..967b01ce468d 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -23,6 +23,7 @@
>  #include <linux/mm.h>
>  #include <linux/vmalloc.h>
>  #include <linux/set_memory.h>
> +#include <linux/hugetlb.h>
>  
>  #include <asm/barrier.h>
>  #include <asm/cputype.h>
> @@ -1134,6 +1135,10 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>  	pmd_t *pmdp;
>  
>  	WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
> +
> +	if (is_hugetlb_free_vmemmap_enabled() && !altmap)
> +		return vmemmap_populate_basepages(start, end, node, altmap);
> +
>  	do {
>  		next = pmd_addr_end(addr, end);
>  
> diff --git a/fs/Kconfig b/fs/Kconfig
> index 6ce6fdac00a3..02c2d3bf1cb8 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -242,7 +242,7 @@ config HUGETLB_PAGE
>  
>  config HUGETLB_PAGE_FREE_VMEMMAP
>  	def_bool HUGETLB_PAGE
> -	depends on X86_64
> +	depends on X86_64 || ARM64
>  	depends on SPARSEMEM_VMEMMAP
>  
>  config MEMFD_CREATE
> 

How does this interact with HugeTLB migration as such which might iterate
over individual constituent struct pages (overriding the same struct page
for all tail pages when this feature is enabled). A simple test involving
madvise(ptr, size, MADV_SOFT_OFFLINE) fails on various HugeTLB page sizes,
with this patch applied. Although I have not debugged this any further.

Soft offlining pfn 0x101c00 at process virtual address 0xffff7fa00000
soft offline: 0x101c00: hugepage migration failed 1, type bfffc0000010006 
              (referenced|uptodate|head|node=0|zone=2|lastcpupid=0xffff)
Muchun Song May 19, 2021, 12:49 p.m. UTC | #4
On Wed, May 19, 2021 at 7:44 PM Anshuman Khandual
<anshuman.khandual@arm.com> wrote:
> On 5/18/21 2:48 PM, Muchun Song wrote:
> > The preparation of supporting freeing vmemmap associated with each
> > HugeTLB page is ready, so we can support this feature for arm64.
> >
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > ---
> >  arch/arm64/mm/mmu.c | 5 +++++
> >  fs/Kconfig          | 2 +-
> >  2 files changed, 6 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> > index 5d37e461c41f..967b01ce468d 100644
> > --- a/arch/arm64/mm/mmu.c
> > +++ b/arch/arm64/mm/mmu.c
> > @@ -23,6 +23,7 @@
> >  #include <linux/mm.h>
> >  #include <linux/vmalloc.h>
> >  #include <linux/set_memory.h>
> > +#include <linux/hugetlb.h>
> >
> >  #include <asm/barrier.h>
> >  #include <asm/cputype.h>
> > @@ -1134,6 +1135,10 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
> >       pmd_t *pmdp;
> >
> >       WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
> > +
> > +     if (is_hugetlb_free_vmemmap_enabled() && !altmap)
> > +             return vmemmap_populate_basepages(start, end, node, altmap);
>
> Not considering the fact that this will force the kernel to have only
> base page size mapping for vmemmap (unless altmap is also requested)
> which might reduce the performance, it also enables vmemmap mapping to
> be teared down or build up at runtime which could potentially collide
> with other kernel page table walkers like ptdump or memory hotremove
> operation ! How those possible collisions are protected right now ?

For the ptdump, there seems no problem.  The change of pte seems
not to affect the ptdump unless I miss something.

>
> Does not this vmemmap operation increase latency for HugeTLB usage ?
> Should not this runtime enablement also take into account some other
> qualifying information apart from potential memory save from struct
> page areas. Just wondering.

The disadvantage is we add a PTE level mapping for vmemmap
pages, from this point of view, the latency will be increased.

But There's an additional benefit which is that page (un)pinners will
see an improvement, because there are fewer vmemmap pages
and thus the tail/head pages are staying in cache more often.
From this point of view, the latency will be decreased.

So if the user cares about the memory usage of the struct page, he
can enable this feature via cmdline when boot. As David said "That's
one of the reasons why it explicitly has to be enabled by an admin".

>
> > +
> >       do {
> >               next = pmd_addr_end(addr, end);
> >
> > diff --git a/fs/Kconfig b/fs/Kconfig
> > index 6ce6fdac00a3..02c2d3bf1cb8 100644
> > --- a/fs/Kconfig
> > +++ b/fs/Kconfig
> > @@ -242,7 +242,7 @@ config HUGETLB_PAGE
> >
> >  config HUGETLB_PAGE_FREE_VMEMMAP
> >       def_bool HUGETLB_PAGE
> > -     depends on X86_64
> > +     depends on X86_64 || ARM64
> >       depends on SPARSEMEM_VMEMMAP
> >
> >  config MEMFD_CREATE
> >
Muchun Song May 19, 2021, 2:43 p.m. UTC | #5
On Wed, May 19, 2021 at 8:35 PM Anshuman Khandual
<anshuman.khandual@arm.com> wrote:
>
>
> On 5/18/21 2:48 PM, Muchun Song wrote:
> > The preparation of supporting freeing vmemmap associated with each
> > HugeTLB page is ready, so we can support this feature for arm64.
> >
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > ---
> >  arch/arm64/mm/mmu.c | 5 +++++
> >  fs/Kconfig          | 2 +-
> >  2 files changed, 6 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> > index 5d37e461c41f..967b01ce468d 100644
> > --- a/arch/arm64/mm/mmu.c
> > +++ b/arch/arm64/mm/mmu.c
> > @@ -23,6 +23,7 @@
> >  #include <linux/mm.h>
> >  #include <linux/vmalloc.h>
> >  #include <linux/set_memory.h>
> > +#include <linux/hugetlb.h>
> >
> >  #include <asm/barrier.h>
> >  #include <asm/cputype.h>
> > @@ -1134,6 +1135,10 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
> >       pmd_t *pmdp;
> >
> >       WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
> > +
> > +     if (is_hugetlb_free_vmemmap_enabled() && !altmap)
> > +             return vmemmap_populate_basepages(start, end, node, altmap);
> > +
> >       do {
> >               next = pmd_addr_end(addr, end);
> >
> > diff --git a/fs/Kconfig b/fs/Kconfig
> > index 6ce6fdac00a3..02c2d3bf1cb8 100644
> > --- a/fs/Kconfig
> > +++ b/fs/Kconfig
> > @@ -242,7 +242,7 @@ config HUGETLB_PAGE
> >
> >  config HUGETLB_PAGE_FREE_VMEMMAP
> >       def_bool HUGETLB_PAGE
> > -     depends on X86_64
> > +     depends on X86_64 || ARM64
> >       depends on SPARSEMEM_VMEMMAP
> >
> >  config MEMFD_CREATE
> >
>
> How does this interact with HugeTLB migration as such which might iterate
> over individual constituent struct pages (overriding the same struct page
> for all tail pages when this feature is enabled). A simple test involving
> madvise(ptr, size, MADV_SOFT_OFFLINE) fails on various HugeTLB page sizes,
> with this patch applied. Although I have not debugged this any further.

It is weird. Actually, I didn't change the behaviour of the page migration.
This feature is default off. If you want to enable this feature, you can pass
"hugetlb_free_vmemmap=on" to the boot cmdline. Do you mean that the
success rate of page migration will decrease when you enable this feature?
The rate will increase if disbale. Right?

Thanks.


>
> Soft offlining pfn 0x101c00 at process virtual address 0xffff7fa00000
> soft offline: 0x101c00: hugepage migration failed 1, type bfffc0000010006
>               (referenced|uptodate|head|node=0|zone=2|lastcpupid=0xffff)
>
Muchun Song May 19, 2021, 3:22 p.m. UTC | #6
On Wed, May 19, 2021 at 10:43 PM Muchun Song <songmuchun@bytedance.com> wrote:
>
> On Wed, May 19, 2021 at 8:35 PM Anshuman Khandual
> <anshuman.khandual@arm.com> wrote:
> >
> >
> > On 5/18/21 2:48 PM, Muchun Song wrote:
> > > The preparation of supporting freeing vmemmap associated with each
> > > HugeTLB page is ready, so we can support this feature for arm64.
> > >
> > > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > > ---
> > >  arch/arm64/mm/mmu.c | 5 +++++
> > >  fs/Kconfig          | 2 +-
> > >  2 files changed, 6 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> > > index 5d37e461c41f..967b01ce468d 100644
> > > --- a/arch/arm64/mm/mmu.c
> > > +++ b/arch/arm64/mm/mmu.c
> > > @@ -23,6 +23,7 @@
> > >  #include <linux/mm.h>
> > >  #include <linux/vmalloc.h>
> > >  #include <linux/set_memory.h>
> > > +#include <linux/hugetlb.h>
> > >
> > >  #include <asm/barrier.h>
> > >  #include <asm/cputype.h>
> > > @@ -1134,6 +1135,10 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
> > >       pmd_t *pmdp;
> > >
> > >       WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
> > > +
> > > +     if (is_hugetlb_free_vmemmap_enabled() && !altmap)
> > > +             return vmemmap_populate_basepages(start, end, node, altmap);
> > > +
> > >       do {
> > >               next = pmd_addr_end(addr, end);
> > >
> > > diff --git a/fs/Kconfig b/fs/Kconfig
> > > index 6ce6fdac00a3..02c2d3bf1cb8 100644
> > > --- a/fs/Kconfig
> > > +++ b/fs/Kconfig
> > > @@ -242,7 +242,7 @@ config HUGETLB_PAGE
> > >
> > >  config HUGETLB_PAGE_FREE_VMEMMAP
> > >       def_bool HUGETLB_PAGE
> > > -     depends on X86_64
> > > +     depends on X86_64 || ARM64
> > >       depends on SPARSEMEM_VMEMMAP
> > >
> > >  config MEMFD_CREATE
> > >
> >
> > How does this interact with HugeTLB migration as such which might iterate
> > over individual constituent struct pages (overriding the same struct page
> > for all tail pages when this feature is enabled). A simple test involving
> > madvise(ptr, size, MADV_SOFT_OFFLINE) fails on various HugeTLB page sizes,
> > with this patch applied. Although I have not debugged this any further.
>
> It is weird. Actually, I didn't change the behaviour of the page migration.
> This feature is default off. If you want to enable this feature, you can pass
> "hugetlb_free_vmemmap=on" to the boot cmdline. Do you mean that the
> success rate of page migration will decrease when you enable this feature?
> The rate will increase if disbale. Right?

I have done the test and found the issue. Because unmap_and_move_huge_page
always returns -EBUSY. I will look into this issue in depth. Thanks for your
report.

The return point is as below:

if (page_private(hpage) && !page_mapping(hpage)) {
        rc = -EBUSY;
        goto out_unlock;
}

>
> Thanks.
>
>
> >
> > Soft offlining pfn 0x101c00 at process virtual address 0xffff7fa00000
> > soft offline: 0x101c00: hugepage migration failed 1, type bfffc0000010006
> >               (referenced|uptodate|head|node=0|zone=2|lastcpupid=0xffff)
> >
Muchun Song May 19, 2021, 4:21 p.m. UTC | #7
On Wed, May 19, 2021 at 11:22 PM Muchun Song <songmuchun@bytedance.com> wrote:
>
> On Wed, May 19, 2021 at 10:43 PM Muchun Song <songmuchun@bytedance.com> wrote:
> >
> > On Wed, May 19, 2021 at 8:35 PM Anshuman Khandual
> > <anshuman.khandual@arm.com> wrote:
> > >
> > >
> > > On 5/18/21 2:48 PM, Muchun Song wrote:
> > > > The preparation of supporting freeing vmemmap associated with each
> > > > HugeTLB page is ready, so we can support this feature for arm64.
> > > >
> > > > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > > > ---
> > > >  arch/arm64/mm/mmu.c | 5 +++++
> > > >  fs/Kconfig          | 2 +-
> > > >  2 files changed, 6 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> > > > index 5d37e461c41f..967b01ce468d 100644
> > > > --- a/arch/arm64/mm/mmu.c
> > > > +++ b/arch/arm64/mm/mmu.c
> > > > @@ -23,6 +23,7 @@
> > > >  #include <linux/mm.h>
> > > >  #include <linux/vmalloc.h>
> > > >  #include <linux/set_memory.h>
> > > > +#include <linux/hugetlb.h>
> > > >
> > > >  #include <asm/barrier.h>
> > > >  #include <asm/cputype.h>
> > > > @@ -1134,6 +1135,10 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
> > > >       pmd_t *pmdp;
> > > >
> > > >       WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
> > > > +
> > > > +     if (is_hugetlb_free_vmemmap_enabled() && !altmap)
> > > > +             return vmemmap_populate_basepages(start, end, node, altmap);
> > > > +
> > > >       do {
> > > >               next = pmd_addr_end(addr, end);
> > > >
> > > > diff --git a/fs/Kconfig b/fs/Kconfig
> > > > index 6ce6fdac00a3..02c2d3bf1cb8 100644
> > > > --- a/fs/Kconfig
> > > > +++ b/fs/Kconfig
> > > > @@ -242,7 +242,7 @@ config HUGETLB_PAGE
> > > >
> > > >  config HUGETLB_PAGE_FREE_VMEMMAP
> > > >       def_bool HUGETLB_PAGE
> > > > -     depends on X86_64
> > > > +     depends on X86_64 || ARM64
> > > >       depends on SPARSEMEM_VMEMMAP
> > > >
> > > >  config MEMFD_CREATE
> > > >
> > >
> > > How does this interact with HugeTLB migration as such which might iterate
> > > over individual constituent struct pages (overriding the same struct page
> > > for all tail pages when this feature is enabled). A simple test involving
> > > madvise(ptr, size, MADV_SOFT_OFFLINE) fails on various HugeTLB page sizes,
> > > with this patch applied. Although I have not debugged this any further.
> >
> > It is weird. Actually, I didn't change the behaviour of the page migration.
> > This feature is default off. If you want to enable this feature, you can pass
> > "hugetlb_free_vmemmap=on" to the boot cmdline. Do you mean that the
> > success rate of page migration will decrease when you enable this feature?
> > The rate will increase if disbale. Right?
>
> I have done the test and found the issue. Because unmap_and_move_huge_page
> always returns -EBUSY. I will look into this issue in depth. Thanks for your
> report.
>
> The return point is as below:
>
> if (page_private(hpage) && !page_mapping(hpage)) {
>         rc = -EBUSY;
>         goto out_unlock;
> }

I know the issue. It was caused by commit d6995da31122 ("hugetlb:
use page.private for hugetlb specific page flags"). The below patch
can fix this issue.

diff --git a/mm/migrate.c b/mm/migrate.c
index e7a173da74ec..43419c4bb097 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1290,7 +1290,7 @@ static int unmap_and_move_huge_page(new_page_t
get_new_page,
         * page_mapping() set, hugetlbfs specific move page routine will not
         * be called and we could leak usage counts for subpools.
         */
-       if (page_private(hpage) && !page_mapping(hpage)) {
+       if (hugetlb_page_subpool(hpage) && !page_mapping(hpage)) {
                rc = -EBUSY;
                goto out_unlock;
        }

>
> >
> > Thanks.
> >
> >
> > >
> > > Soft offlining pfn 0x101c00 at process virtual address 0xffff7fa00000
> > > soft offline: 0x101c00: hugepage migration failed 1, type bfffc0000010006
> > >               (referenced|uptodate|head|node=0|zone=2|lastcpupid=0xffff)
> > >
Mike Kravetz May 19, 2021, 10:44 p.m. UTC | #8
On 5/19/21 9:21 AM, Muchun Song wrote:
> On Wed, May 19, 2021 at 11:22 PM Muchun Song <songmuchun@bytedance.com> wrote:
>>
>> On Wed, May 19, 2021 at 10:43 PM Muchun Song <songmuchun@bytedance.com> wrote:
>>>
>>> On Wed, May 19, 2021 at 8:35 PM Anshuman Khandual
>>> <anshuman.khandual@arm.com> wrote:
>>>> How does this interact with HugeTLB migration as such which might iterate
>>>> over individual constituent struct pages (overriding the same struct page
>>>> for all tail pages when this feature is enabled). A simple test involving
>>>> madvise(ptr, size, MADV_SOFT_OFFLINE) fails on various HugeTLB page sizes,
>>>> with this patch applied. Although I have not debugged this any further.
>>>
>>> It is weird. Actually, I didn't change the behaviour of the page migration.
>>> This feature is default off. If you want to enable this feature, you can pass
>>> "hugetlb_free_vmemmap=on" to the boot cmdline. Do you mean that the
>>> success rate of page migration will decrease when you enable this feature?
>>> The rate will increase if disbale. Right?
>>
>> I have done the test and found the issue. Because unmap_and_move_huge_page
>> always returns -EBUSY. I will look into this issue in depth. Thanks for your
>> report.
>>
>> The return point is as below:
>>
>> if (page_private(hpage) && !page_mapping(hpage)) {
>>         rc = -EBUSY;
>>         goto out_unlock;
>> }
> 
> I know the issue. It was caused by commit d6995da31122 ("hugetlb:
> use page.private for hugetlb specific page flags"). The below patch
> can fix this issue.
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index e7a173da74ec..43419c4bb097 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1290,7 +1290,7 @@ static int unmap_and_move_huge_page(new_page_t
> get_new_page,
>          * page_mapping() set, hugetlbfs specific move page routine will not
>          * be called and we could leak usage counts for subpools.
>          */
> -       if (page_private(hpage) && !page_mapping(hpage)) {
> +       if (hugetlb_page_subpool(hpage) && !page_mapping(hpage)) {
>                 rc = -EBUSY;
>                 goto out_unlock;
>         }
> 

Thank you Muchun!  That was my bad commit.
Anshuman Khandual May 20, 2021, 11:54 a.m. UTC | #9
On 5/19/21 5:33 PM, David Hildenbrand wrote:
> On 19.05.21 13:45, Anshuman Khandual wrote:
>>
>>
>> On 5/18/21 2:48 PM, Muchun Song wrote:
>>> The preparation of supporting freeing vmemmap associated with each
>>> HugeTLB page is ready, so we can support this feature for arm64.
>>>
>>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>>> ---
>>>   arch/arm64/mm/mmu.c | 5 +++++
>>>   fs/Kconfig          | 2 +-
>>>   2 files changed, 6 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>>> index 5d37e461c41f..967b01ce468d 100644
>>> --- a/arch/arm64/mm/mmu.c
>>> +++ b/arch/arm64/mm/mmu.c
>>> @@ -23,6 +23,7 @@
>>>   #include <linux/mm.h>
>>>   #include <linux/vmalloc.h>
>>>   #include <linux/set_memory.h>
>>> +#include <linux/hugetlb.h>
>>>     #include <asm/barrier.h>
>>>   #include <asm/cputype.h>
>>> @@ -1134,6 +1135,10 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>>>       pmd_t *pmdp;
>>>         WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
>>> +
>>> +    if (is_hugetlb_free_vmemmap_enabled() && !altmap)
>>> +        return vmemmap_populate_basepages(start, end, node, altmap);
>>
>> Not considering the fact that this will force the kernel to have only
>> base page size mapping for vmemmap (unless altmap is also requested)
>> which might reduce the performance, it also enables vmemmap mapping to
>> be teared down or build up at runtime which could potentially collide
>> with other kernel page table walkers like ptdump or memory hotremove
>> operation ! How those possible collisions are protected right now ?
> 
> Hi Anshuman,
> 
> Memory hotremove is not an issue IIRC. At the time memory is removed, all huge pages either have been migrated away or dissolved; the vmemmap is stable.

But what happens when a hot remove section's vmemmap area (which is being
teared down) is nearby another vmemmap area which is either created or
being destroyed for HugeTLB alloc/free purpose. As you mentioned HugeTLB
pages inside the hot remove section might be safe. But what about other
HugeTLB areas whose vmemmap area shares page table entries with vmemmap
entries for a section being hot removed ? Massive HugeTLB alloc/use/free
test cycle using memory just adjacent to a memory hotplug area, which is
always added and removed periodically, should be able to expose this problem.

IIUC unlike vmalloc(), vmemap mapping areas in the kernel page table were
always constant unless there are hotplug add or remove operations which
are protected with a hotplug lock. Now with this change, we could have
simultaneous walking and add or remove of the vmemap areas without any
synchronization. Is not this problematic ?

On arm64 memory hot remove operation empties free portions of the vmemmap
table after clearing them. Hence all concurrent walkers (hugetlb_vmemmap,
hot remove, ptdump etc) need to be synchronized against hot remove.

From arch/arm64/mm/mmu.c

void vmemmap_free(unsigned long start, unsigned long end,
                struct vmem_altmap *altmap)
{
#ifdef CONFIG_MEMORY_HOTPLUG
        WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));

        unmap_hotplug_range(start, end, true, altmap);
        free_empty_tables(start, end, VMEMMAP_START, VMEMMAP_END);
#endif
}

> 
> vmemmap access (accessing the memmap via a virtual address) itself is not an issue. Manually walking (vmemmap) page tables might behave 

Right.

differently, not sure if ptdump would require any synchronization.

Dumping an wrong value is probably okay but crashing because a page table
entry is being freed after ptdump acquired the pointer is bad. On arm64,
ptdump() is protected against hotremove via [get|put]_online_mems().

> 
>>
>> Does not this vmemmap operation increase latency for HugeTLB usage ?
>> Should not this runtime enablement also take into account some other
>> qualifying information apart from potential memory save from struct
>> page areas. Just wondering.
> 
> That's one of the reasons why it explicitly has to be enabled by an admin.
> 

config HUGETLB_PAGE_FREE_VMEMMAP
        def_bool HUGETLB_PAGE
        depends on X86_64 || ARM64
        depends on SPARSEMEM_VMEMMAP

Should not this depend on EXPERT as well ? Regardless, there is a sync
problem on arm64 if this feature is enabled as vmemmap portions can be
freed up during hot remove operation. But wondering how latency would
be impacted if vmemap_remap_[alloc|free]() add [get|put]_online_mems().
David Hildenbrand May 20, 2021, 11:59 a.m. UTC | #10
On 20.05.21 13:54, Anshuman Khandual wrote:
> 
> On 5/19/21 5:33 PM, David Hildenbrand wrote:
>> On 19.05.21 13:45, Anshuman Khandual wrote:
>>>
>>>
>>> On 5/18/21 2:48 PM, Muchun Song wrote:
>>>> The preparation of supporting freeing vmemmap associated with each
>>>> HugeTLB page is ready, so we can support this feature for arm64.
>>>>
>>>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>>>> ---
>>>>    arch/arm64/mm/mmu.c | 5 +++++
>>>>    fs/Kconfig          | 2 +-
>>>>    2 files changed, 6 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>>>> index 5d37e461c41f..967b01ce468d 100644
>>>> --- a/arch/arm64/mm/mmu.c
>>>> +++ b/arch/arm64/mm/mmu.c
>>>> @@ -23,6 +23,7 @@
>>>>    #include <linux/mm.h>
>>>>    #include <linux/vmalloc.h>
>>>>    #include <linux/set_memory.h>
>>>> +#include <linux/hugetlb.h>
>>>>      #include <asm/barrier.h>
>>>>    #include <asm/cputype.h>
>>>> @@ -1134,6 +1135,10 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>>>>        pmd_t *pmdp;
>>>>          WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
>>>> +
>>>> +    if (is_hugetlb_free_vmemmap_enabled() && !altmap)
>>>> +        return vmemmap_populate_basepages(start, end, node, altmap);
>>>
>>> Not considering the fact that this will force the kernel to have only
>>> base page size mapping for vmemmap (unless altmap is also requested)
>>> which might reduce the performance, it also enables vmemmap mapping to
>>> be teared down or build up at runtime which could potentially collide
>>> with other kernel page table walkers like ptdump or memory hotremove
>>> operation ! How those possible collisions are protected right now ?
>>
>> Hi Anshuman,
>>
>> Memory hotremove is not an issue IIRC. At the time memory is removed, all huge pages either have been migrated away or dissolved; the vmemmap is stable.
> 
> But what happens when a hot remove section's vmemmap area (which is being
> teared down) is nearby another vmemmap area which is either created or
> being destroyed for HugeTLB alloc/free purpose. As you mentioned HugeTLB
> pages inside the hot remove section might be safe. But what about other
> HugeTLB areas whose vmemmap area shares page table entries with vmemmap
> entries for a section being hot removed ? Massive HugeTLB alloc/use/free
> test cycle using memory just adjacent to a memory hotplug area, which is
> always added and removed periodically, should be able to expose this problem.
> 
> IIUC unlike vmalloc(), vmemap mapping areas in the kernel page table were
> always constant unless there are hotplug add or remove operations which
> are protected with a hotplug lock. Now with this change, we could have
> simultaneous walking and add or remove of the vmemap areas without any
> synchronization. Is not this problematic ?
> 
> On arm64 memory hot remove operation empties free portions of the vmemmap
> table after clearing them. Hence all concurrent walkers (hugetlb_vmemmap,
> hot remove, ptdump etc) need to be synchronized against hot remove.
> 
>  From arch/arm64/mm/mmu.c
> 
> void vmemmap_free(unsigned long start, unsigned long end,
>                  struct vmem_altmap *altmap)
> {
> #ifdef CONFIG_MEMORY_HOTPLUG
>          WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
> 
>          unmap_hotplug_range(start, end, true, altmap);
>          free_empty_tables(start, end, VMEMMAP_START, VMEMMAP_END);
> #endif
> }

You are right, however, AFAIR

1) We always populate base pages, meaning we only modify PTEs and not 
actually add/remove page tables when creating/destroying a hugetlb page. 
Page table walkers should be fine and not suddenly run into a 
use-after-free.

2) For pfn_to_page() users to never fault, we have to do an atomic 
exchange of PTES, meaning, someone traversing a page table looking for 
pte_none() entries (like free_empty_tables() in your example) should 
never get a false positive.

Makes sense, or am I missing something?

> 
>>
>> vmemmap access (accessing the memmap via a virtual address) itself is not an issue. Manually walking (vmemmap) page tables might behave
> 
> Right.
> 
> differently, not sure if ptdump would require any synchronization.
> 
> Dumping an wrong value is probably okay but crashing because a page table
> entry is being freed after ptdump acquired the pointer is bad. On arm64,
> ptdump() is protected against hotremove via [get|put]_online_mems().

Okay, and as the feature in question only exchanges PTEs, we should be 
fine.
Anshuman Khandual May 20, 2021, noon UTC | #11
On 5/19/21 6:19 PM, Muchun Song wrote:
> On Wed, May 19, 2021 at 7:44 PM Anshuman Khandual
> <anshuman.khandual@arm.com> wrote:
>> On 5/18/21 2:48 PM, Muchun Song wrote:
>>> The preparation of supporting freeing vmemmap associated with each
>>> HugeTLB page is ready, so we can support this feature for arm64.
>>>
>>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>>> ---
>>>  arch/arm64/mm/mmu.c | 5 +++++
>>>  fs/Kconfig          | 2 +-
>>>  2 files changed, 6 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>>> index 5d37e461c41f..967b01ce468d 100644
>>> --- a/arch/arm64/mm/mmu.c
>>> +++ b/arch/arm64/mm/mmu.c
>>> @@ -23,6 +23,7 @@
>>>  #include <linux/mm.h>
>>>  #include <linux/vmalloc.h>
>>>  #include <linux/set_memory.h>
>>> +#include <linux/hugetlb.h>
>>>
>>>  #include <asm/barrier.h>
>>>  #include <asm/cputype.h>
>>> @@ -1134,6 +1135,10 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>>>       pmd_t *pmdp;
>>>
>>>       WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
>>> +
>>> +     if (is_hugetlb_free_vmemmap_enabled() && !altmap)
>>> +             return vmemmap_populate_basepages(start, end, node, altmap);
>>
>> Not considering the fact that this will force the kernel to have only
>> base page size mapping for vmemmap (unless altmap is also requested)
>> which might reduce the performance, it also enables vmemmap mapping to
>> be teared down or build up at runtime which could potentially collide
>> with other kernel page table walkers like ptdump or memory hotremove
>> operation ! How those possible collisions are protected right now ?
> 
> For the ptdump, there seems no problem.  The change of pte seems
> not to affect the ptdump unless I miss something.

The worst case scenario for ptdump would be an wrong information being
dumped. But as mentioned earlier, this must be protected against memory
hot remove operation which could free up entries for vmemmap areas in
the kernel page table.
Anshuman Khandual May 21, 2021, 5:02 a.m. UTC | #12
On 5/20/21 5:29 PM, David Hildenbrand wrote:
> On 20.05.21 13:54, Anshuman Khandual wrote:
>>
>> On 5/19/21 5:33 PM, David Hildenbrand wrote:
>>> On 19.05.21 13:45, Anshuman Khandual wrote:
>>>>
>>>>
>>>> On 5/18/21 2:48 PM, Muchun Song wrote:
>>>>> The preparation of supporting freeing vmemmap associated with each
>>>>> HugeTLB page is ready, so we can support this feature for arm64.
>>>>>
>>>>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>>>>> ---
>>>>>    arch/arm64/mm/mmu.c | 5 +++++
>>>>>    fs/Kconfig          | 2 +-
>>>>>    2 files changed, 6 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>>>>> index 5d37e461c41f..967b01ce468d 100644
>>>>> --- a/arch/arm64/mm/mmu.c
>>>>> +++ b/arch/arm64/mm/mmu.c
>>>>> @@ -23,6 +23,7 @@
>>>>>    #include <linux/mm.h>
>>>>>    #include <linux/vmalloc.h>
>>>>>    #include <linux/set_memory.h>
>>>>> +#include <linux/hugetlb.h>
>>>>>      #include <asm/barrier.h>
>>>>>    #include <asm/cputype.h>
>>>>> @@ -1134,6 +1135,10 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>>>>>        pmd_t *pmdp;
>>>>>          WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
>>>>> +
>>>>> +    if (is_hugetlb_free_vmemmap_enabled() && !altmap)
>>>>> +        return vmemmap_populate_basepages(start, end, node, altmap);
>>>>
>>>> Not considering the fact that this will force the kernel to have only
>>>> base page size mapping for vmemmap (unless altmap is also requested)
>>>> which might reduce the performance, it also enables vmemmap mapping to
>>>> be teared down or build up at runtime which could potentially collide
>>>> with other kernel page table walkers like ptdump or memory hotremove
>>>> operation ! How those possible collisions are protected right now ?
>>>
>>> Hi Anshuman,
>>>
>>> Memory hotremove is not an issue IIRC. At the time memory is removed, all huge pages either have been migrated away or dissolved; the vmemmap is stable.
>>
>> But what happens when a hot remove section's vmemmap area (which is being
>> teared down) is nearby another vmemmap area which is either created or
>> being destroyed for HugeTLB alloc/free purpose. As you mentioned HugeTLB
>> pages inside the hot remove section might be safe. But what about other
>> HugeTLB areas whose vmemmap area shares page table entries with vmemmap
>> entries for a section being hot removed ? Massive HugeTLB alloc/use/free
>> test cycle using memory just adjacent to a memory hotplug area, which is
>> always added and removed periodically, should be able to expose this problem.
>>
>> IIUC unlike vmalloc(), vmemap mapping areas in the kernel page table were
>> always constant unless there are hotplug add or remove operations which
>> are protected with a hotplug lock. Now with this change, we could have
>> simultaneous walking and add or remove of the vmemap areas without any
>> synchronization. Is not this problematic ?
>>
>> On arm64 memory hot remove operation empties free portions of the vmemmap
>> table after clearing them. Hence all concurrent walkers (hugetlb_vmemmap,
>> hot remove, ptdump etc) need to be synchronized against hot remove.
>>
>>  From arch/arm64/mm/mmu.c
>>
>> void vmemmap_free(unsigned long start, unsigned long end,
>>                  struct vmem_altmap *altmap)
>> {
>> #ifdef CONFIG_MEMORY_HOTPLUG
>>          WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
>>
>>          unmap_hotplug_range(start, end, true, altmap);
>>          free_empty_tables(start, end, VMEMMAP_START, VMEMMAP_END);
>> #endif
>> }
> 
> You are right, however, AFAIR
> 
> 1) We always populate base pages, meaning we only modify PTEs and not actually add/remove page tables when creating/destroying a hugetlb page. Page table walkers should be fine and not suddenly run into a use-after-free.
> 
> 2) For pfn_to_page() users to never fault, we have to do an atomic exchange of PTES, meaning, someone traversing a page table looking for pte_none() entries (like free_empty_tables() in your example) should never get a false positive.
> 
> Makes sense, or am I missing something?
> 
>>
>>>
>>> vmemmap access (accessing the memmap via a virtual address) itself is not an issue. Manually walking (vmemmap) page tables might behave
>>
>> Right.
>>
>> differently, not sure if ptdump would require any synchronization.
>>
>> Dumping an wrong value is probably okay but crashing because a page table
>> entry is being freed after ptdump acquired the pointer is bad. On arm64,
>> ptdump() is protected against hotremove via [get|put]_online_mems().
> 
> Okay, and as the feature in question only exchanges PTEs, we should be fine.

Adding Mark, Catalin and James here in case I might have missed
something regarding possible vmemmap collision.
diff mbox series

Patch

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 5d37e461c41f..967b01ce468d 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -23,6 +23,7 @@ 
 #include <linux/mm.h>
 #include <linux/vmalloc.h>
 #include <linux/set_memory.h>
+#include <linux/hugetlb.h>
 
 #include <asm/barrier.h>
 #include <asm/cputype.h>
@@ -1134,6 +1135,10 @@  int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 	pmd_t *pmdp;
 
 	WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
+
+	if (is_hugetlb_free_vmemmap_enabled() && !altmap)
+		return vmemmap_populate_basepages(start, end, node, altmap);
+
 	do {
 		next = pmd_addr_end(addr, end);
 
diff --git a/fs/Kconfig b/fs/Kconfig
index 6ce6fdac00a3..02c2d3bf1cb8 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -242,7 +242,7 @@  config HUGETLB_PAGE
 
 config HUGETLB_PAGE_FREE_VMEMMAP
 	def_bool HUGETLB_PAGE
-	depends on X86_64
+	depends on X86_64 || ARM64
 	depends on SPARSEMEM_VMEMMAP
 
 config MEMFD_CREATE