diff mbox series

hugetlbfs: add MTE support

Message ID 20240625233717.2769975-1-yang@os.amperecomputing.com (mailing list archive)
State New, archived
Headers show
Series hugetlbfs: add MTE support | expand

Commit Message

Yang Shi June 25, 2024, 11:37 p.m. UTC
MTE can be supported on ram based filesystem. It is supported on tmpfs.
There is use case to use MTE on hugetlbfs as well, adding MTE support.

Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
---
 fs/hugetlbfs/inode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Andrew Morton June 26, 2024, 8:40 p.m. UTC | #1
On Tue, 25 Jun 2024 16:37:17 -0700 Yang Shi <yang@os.amperecomputing.com> wrote:

> MTE can be supported on ram based filesystem. It is supported on tmpfs.
> There is use case to use MTE on hugetlbfs as well, adding MTE support.
> 
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -110,7 +110,7 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
>  	 * way when do_mmap unwinds (may be important on powerpc
>  	 * and ia64).
>  	 */
> -	vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND);
> +	vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND | VM_MTE_ALLOWED);
>  	vma->vm_ops = &hugetlb_vm_ops;
>  
>  	ret = seal_check_write(info->seals, vma);

How thoroughly has this been tested?

Can we expect normal linux-next testing to exercise this, or must
testers make special arangements to get the coverage?
Yang Shi June 26, 2024, 8:45 p.m. UTC | #2
On Wed, Jun 26, 2024 at 1:40 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Tue, 25 Jun 2024 16:37:17 -0700 Yang Shi <yang@os.amperecomputing.com> wrote:
>
> > MTE can be supported on ram based filesystem. It is supported on tmpfs.
> > There is use case to use MTE on hugetlbfs as well, adding MTE support.
> >
> > --- a/fs/hugetlbfs/inode.c
> > +++ b/fs/hugetlbfs/inode.c
> > @@ -110,7 +110,7 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
> >        * way when do_mmap unwinds (may be important on powerpc
> >        * and ia64).
> >        */
> > -     vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND);
> > +     vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND | VM_MTE_ALLOWED);
> >       vma->vm_ops = &hugetlb_vm_ops;
> >
> >       ret = seal_check_write(info->seals, vma);
>
> How thoroughly has this been tested?
>
> Can we expect normal linux-next testing to exercise this, or must
> testers make special arangements to get the coverage?

It requires special arrangements. This needs hardware support and
custom-patched QEMU. We did in-house test on AmpereOne platform with
patched QEMU 8.1.

>
Yang Shi June 26, 2024, 11:43 p.m. UTC | #3
On Wed, Jun 26, 2024 at 1:45 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Wed, Jun 26, 2024 at 1:40 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > On Tue, 25 Jun 2024 16:37:17 -0700 Yang Shi <yang@os.amperecomputing.com> wrote:
> >
> > > MTE can be supported on ram based filesystem. It is supported on tmpfs.
> > > There is use case to use MTE on hugetlbfs as well, adding MTE support.
> > >
> > > --- a/fs/hugetlbfs/inode.c
> > > +++ b/fs/hugetlbfs/inode.c
> > > @@ -110,7 +110,7 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
> > >        * way when do_mmap unwinds (may be important on powerpc
> > >        * and ia64).
> > >        */
> > > -     vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND);
> > > +     vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND | VM_MTE_ALLOWED);
> > >       vma->vm_ops = &hugetlb_vm_ops;
> > >
> > >       ret = seal_check_write(info->seals, vma);
> >
> > How thoroughly has this been tested?
> >
> > Can we expect normal linux-next testing to exercise this, or must
> > testers make special arangements to get the coverage?
>
> It requires special arrangements. This needs hardware support and
> custom-patched QEMU. We did in-house test on AmpereOne platform with
> patched QEMU 8.1.

To correct, custom-patched QEMU is not required for a minimum test.
But a special test program is definitely needed. We used
custom-patched QEMU to test VM backed by hugetlbfs with MTE enabled.

>
> >
Catalin Marinas July 2, 2024, 12:34 p.m. UTC | #4
On Tue, Jun 25, 2024 at 04:37:17PM -0700, Yang Shi wrote:
> MTE can be supported on ram based filesystem. It is supported on tmpfs.
> There is use case to use MTE on hugetlbfs as well, adding MTE support.
> 
> Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
> ---
>  fs/hugetlbfs/inode.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index ecad73a4f713..c34faef62daf 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -110,7 +110,7 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
>  	 * way when do_mmap unwinds (may be important on powerpc
>  	 * and ia64).
>  	 */
> -	vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND);
> +	vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND | VM_MTE_ALLOWED);
>  	vma->vm_ops = &hugetlb_vm_ops;

Last time I checked, about a year ago, this was not sufficient. One
issue is that there's no arch_clear_hugetlb_flags() implemented by your
patch, leaving PG_arch_{2,3} set on a page. The other issue was that I
initially tried to do this only on the head page but this did not go
well with the folio_copy() -> copy_highpage() which expects the
PG_arch_* flags on each individual page. The alternative was for
arch_clear_hugetlb_flags() to iterate over all the pages in a folio.

I'd also like to see some tests added to
tools/testing/selftest/arm64/mte to exercise MAP_HUGETLB with PROT_MTE:
write/read tags, a series of mman+munmap (mostly to check if old page
flags are still around), force some copy on write. I don't think we
should merge the patch without proper tests.

An untested hunk on top of your changes:

diff --git a/arch/arm64/include/asm/hugetlb.h b/arch/arm64/include/asm/hugetlb.h
index 3954cbd2ff56..5357b00b9087 100644
--- a/arch/arm64/include/asm/hugetlb.h
+++ b/arch/arm64/include/asm/hugetlb.h
@@ -20,7 +20,19 @@ extern bool arch_hugetlb_migration_supported(struct hstate *h);
 
 static inline void arch_clear_hugetlb_flags(struct folio *folio)
 {
-	clear_bit(PG_dcache_clean, &folio->flags);
+	unsigned long i, nr_pages = folio_nr_pages(folio);
+	const unsigned long clear_flags = BIT(PG_dcache_clean) |
+		BIT(PG_arch_2) | BIT(PG_arch_3);
+
+	if (!system_supports_mte()) {
+		clear_bit(PG_dcache_clean, &folio->flags);
+		return;
+	}
+
+	for (i = 0; i < nr_pages; i++) {
+		struct page *page = folio_page(folio, i);
+		page->flags &= ~clear_flags;
+	}
 }
 #define arch_clear_hugetlb_flags arch_clear_hugetlb_flags
 
diff --git a/arch/arm64/include/asm/mman.h b/arch/arm64/include/asm/mman.h
index 5966ee4a6154..304dfc499e68 100644
--- a/arch/arm64/include/asm/mman.h
+++ b/arch/arm64/include/asm/mman.h
@@ -28,7 +28,8 @@ static inline unsigned long arch_calc_vm_flag_bits(unsigned long flags)
 	 * backed by tags-capable memory. The vm_flags may be overridden by a
 	 * filesystem supporting MTE (RAM-based).
 	 */
-	if (system_supports_mte() && (flags & MAP_ANONYMOUS))
+	if (system_supports_mte() &&
+	    (flags & (MAP_ANONYMOUS | MAP_HUGETLB)))
 		return VM_MTE_ALLOWED;
 
 	return 0;
David Hildenbrand July 2, 2024, 1:09 p.m. UTC | #5
On 02.07.24 14:34, Catalin Marinas wrote:
> On Tue, Jun 25, 2024 at 04:37:17PM -0700, Yang Shi wrote:
>> MTE can be supported on ram based filesystem. It is supported on tmpfs.
>> There is use case to use MTE on hugetlbfs as well, adding MTE support.
>>
>> Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
>> ---
>>   fs/hugetlbfs/inode.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>> index ecad73a4f713..c34faef62daf 100644
>> --- a/fs/hugetlbfs/inode.c
>> +++ b/fs/hugetlbfs/inode.c
>> @@ -110,7 +110,7 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
>>   	 * way when do_mmap unwinds (may be important on powerpc
>>   	 * and ia64).
>>   	 */
>> -	vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND);
>> +	vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND | VM_MTE_ALLOWED);
>>   	vma->vm_ops = &hugetlb_vm_ops;
> 
> Last time I checked, about a year ago, this was not sufficient. One
> issue is that there's no arch_clear_hugetlb_flags() implemented by your
> patch, leaving PG_arch_{2,3} set on a page. The other issue was that I
> initially tried to do this only on the head page but this did not go
> well with the folio_copy() -> copy_highpage() which expects the
> PG_arch_* flags on each individual page. The alternative was for
> arch_clear_hugetlb_flags() to iterate over all the pages in a folio.

This would likely also add a blocker for 
ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP on arm64 (no idea if there are now 
ways to move forward with that now, or if we are still not sure if we 
can actually add support), correct?
Yang Shi July 3, 2024, 12:04 a.m. UTC | #6
On 7/2/24 5:34 AM, Catalin Marinas wrote:
> On Tue, Jun 25, 2024 at 04:37:17PM -0700, Yang Shi wrote:
>> MTE can be supported on ram based filesystem. It is supported on tmpfs.
>> There is use case to use MTE on hugetlbfs as well, adding MTE support.
>>
>> Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
>> ---
>>   fs/hugetlbfs/inode.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>> index ecad73a4f713..c34faef62daf 100644
>> --- a/fs/hugetlbfs/inode.c
>> +++ b/fs/hugetlbfs/inode.c
>> @@ -110,7 +110,7 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
>>   	 * way when do_mmap unwinds (may be important on powerpc
>>   	 * and ia64).
>>   	 */
>> -	vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND);
>> +	vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND | VM_MTE_ALLOWED);
>>   	vma->vm_ops = &hugetlb_vm_ops;
> Last time I checked, about a year ago, this was not sufficient. One
> issue is that there's no arch_clear_hugetlb_flags() implemented by your
> patch, leaving PG_arch_{2,3} set on a page. The other issue was that I
> initially tried to do this only on the head page but this did not go
> well with the folio_copy() -> copy_highpage() which expects the
> PG_arch_* flags on each individual page. The alternative was for
> arch_clear_hugetlb_flags() to iterate over all the pages in a folio.

Thanks for pointing this out. I did miss this point. I took a quick look 
at when the PG_ flags are set. IIUC, it is set by post_alloc_hook() for 
order-0 anonymous folio (clearing page and tags) and set_ptes() for 
others (just clear tags), for example, THP and hugetlb.

I can see THP does set the PG_mte_tagged flag for each sub pages. But it 
seems it does not do it for hugetlb if I read the code correctly. The 
call path is:

hugetlb_fault() ->
   hugetlb_no_page->
     set_huge_pte_at ->
       __set_ptes() ->
         __sync_cache_and_tags() ->


The __set_ptes() is called in a loop:

if (!pte_present(pte)) {
         for (i = 0; i < ncontig; i++, ptep++, addr += pgsize)
             __set_ptes(mm, addr, ptep, pte, 1);
         return;
     }

The ncontig and pgsize are returned by num_contig_ptes(). For example, 
2M hugetlb, ncontig is 1 and pgsize is 2M IIUC. So it means actually 
just the head page has PG_mte_tagged set. If so the copy_highpage() will 
just copy the old head page's flag to the new head page, and the tag. 
All the sub pages don't have PG_mte_tagged set.


Is it expected behavior? I'm supposed we need tags for every sub pages 
too, right?

>
> I'd also like to see some tests added to
> tools/testing/selftest/arm64/mte to exercise MAP_HUGETLB with PROT_MTE:
> write/read tags, a series of mman+munmap (mostly to check if old page
> flags are still around), force some copy on write. I don't think we
> should merge the patch without proper tests.
>
> An untested hunk on top of your changes:
>
> diff --git a/arch/arm64/include/asm/hugetlb.h b/arch/arm64/include/asm/hugetlb.h
> index 3954cbd2ff56..5357b00b9087 100644
> --- a/arch/arm64/include/asm/hugetlb.h
> +++ b/arch/arm64/include/asm/hugetlb.h
> @@ -20,7 +20,19 @@ extern bool arch_hugetlb_migration_supported(struct hstate *h);
>   
>   static inline void arch_clear_hugetlb_flags(struct folio *folio)
>   {
> -	clear_bit(PG_dcache_clean, &folio->flags);
> +	unsigned long i, nr_pages = folio_nr_pages(folio);
> +	const unsigned long clear_flags = BIT(PG_dcache_clean) |
> +		BIT(PG_arch_2) | BIT(PG_arch_3);
> +
> +	if (!system_supports_mte()) {
> +		clear_bit(PG_dcache_clean, &folio->flags);
> +		return;
> +	}
> +
> +	for (i = 0; i < nr_pages; i++) {
> +		struct page *page = folio_page(folio, i);
> +		page->flags &= ~clear_flags;
> +	}
>   }
>   #define arch_clear_hugetlb_flags arch_clear_hugetlb_flags
>   
> diff --git a/arch/arm64/include/asm/mman.h b/arch/arm64/include/asm/mman.h
> index 5966ee4a6154..304dfc499e68 100644
> --- a/arch/arm64/include/asm/mman.h
> +++ b/arch/arm64/include/asm/mman.h
> @@ -28,7 +28,8 @@ static inline unsigned long arch_calc_vm_flag_bits(unsigned long flags)
>   	 * backed by tags-capable memory. The vm_flags may be overridden by a
>   	 * filesystem supporting MTE (RAM-based).
>   	 */
> -	if (system_supports_mte() && (flags & MAP_ANONYMOUS))
> +	if (system_supports_mte() &&
> +	    (flags & (MAP_ANONYMOUS | MAP_HUGETLB)))
>   		return VM_MTE_ALLOWED;

Do we really need this change? IIRC, the mmap_region() will call 
hugetlbfs's mmap and set VM_MTE_ALLOWED in vma->vm_flags, then update 
vma->vm_page_prot with the new vma->vm_flags.

If this is needed, MTE for tmpfs won't work, right?

>   
>   	return 0;
>
Yang Shi July 3, 2024, 12:15 a.m. UTC | #7
On 7/2/24 5:04 PM, Yang Shi wrote:
>
>
> On 7/2/24 5:34 AM, Catalin Marinas wrote:
>> On Tue, Jun 25, 2024 at 04:37:17PM -0700, Yang Shi wrote:
>>> MTE can be supported on ram based filesystem. It is supported on tmpfs.
>>> There is use case to use MTE on hugetlbfs as well, adding MTE support.
>>>
>>> Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
>>> ---
>>>   fs/hugetlbfs/inode.c | 2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>>> index ecad73a4f713..c34faef62daf 100644
>>> --- a/fs/hugetlbfs/inode.c
>>> +++ b/fs/hugetlbfs/inode.c
>>> @@ -110,7 +110,7 @@ static int hugetlbfs_file_mmap(struct file 
>>> *file, struct vm_area_struct *vma)
>>>        * way when do_mmap unwinds (may be important on powerpc
>>>        * and ia64).
>>>        */
>>> -    vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND);
>>> +    vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND | VM_MTE_ALLOWED);
>>>       vma->vm_ops = &hugetlb_vm_ops;
>> Last time I checked, about a year ago, this was not sufficient. One
>> issue is that there's no arch_clear_hugetlb_flags() implemented by your
>> patch, leaving PG_arch_{2,3} set on a page. The other issue was that I
>> initially tried to do this only on the head page but this did not go
>> well with the folio_copy() -> copy_highpage() which expects the
>> PG_arch_* flags on each individual page. The alternative was for
>> arch_clear_hugetlb_flags() to iterate over all the pages in a folio.
>
> Thanks for pointing this out. I did miss this point. I took a quick 
> look at when the PG_ flags are set. IIUC, it is set by 
> post_alloc_hook() for order-0 anonymous folio (clearing page and tags) 
> and set_ptes() for others (just clear tags), for example, THP and 
> hugetlb.
>
> I can see THP does set the PG_mte_tagged flag for each sub pages. But 
> it seems it does not do it for hugetlb if I read the code correctly. 
> The call path is:
>
> hugetlb_fault() ->
>   hugetlb_no_page->
>     set_huge_pte_at ->
>       __set_ptes() ->
>         __sync_cache_and_tags() ->
>
>
> The __set_ptes() is called in a loop:
>
> if (!pte_present(pte)) {
>         for (i = 0; i < ncontig; i++, ptep++, addr += pgsize)
>             __set_ptes(mm, addr, ptep, pte, 1);
>         return;
>     }
>
> The ncontig and pgsize are returned by num_contig_ptes(). For example, 
> 2M hugetlb, ncontig is 1 and pgsize is 2M IIUC. So it means actually 
> just the head page has PG_mte_tagged set. If so the copy_highpage() 
> will just copy the old head page's flag to the new head page, and the 
> tag. All the sub pages don't have PG_mte_tagged set.
>
>
> Is it expected behavior? I'm supposed we need tags for every sub pages 
> too, right?

We should need something like the below to have tags and page flag set 
up for each sub page:

diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 3f09ac73cce3..528164deef27 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -228,9 +228,12 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned 
long addr,
         int ncontig;
         unsigned long pfn, dpfn;
         pgprot_t hugeprot;
+       unsigned long nr = sz >> PAGE_SHIFT;

         ncontig = num_contig_ptes(sz, &pgsize);

+       __sync_cache_and_tags(pte, nr);
+
         if (!pte_present(pte)) {
                 for (i = 0; i < ncontig; i++, ptep++, addr += pgsize)
                         __set_ptes(mm, addr, ptep, pte, 1);

>
>>
>> I'd also like to see some tests added to
>> tools/testing/selftest/arm64/mte to exercise MAP_HUGETLB with PROT_MTE:
>> write/read tags, a series of mman+munmap (mostly to check if old page
>> flags are still around), force some copy on write. I don't think we
>> should merge the patch without proper tests.
>>
>> An untested hunk on top of your changes:
>>
>> diff --git a/arch/arm64/include/asm/hugetlb.h 
>> b/arch/arm64/include/asm/hugetlb.h
>> index 3954cbd2ff56..5357b00b9087 100644
>> --- a/arch/arm64/include/asm/hugetlb.h
>> +++ b/arch/arm64/include/asm/hugetlb.h
>> @@ -20,7 +20,19 @@ extern bool 
>> arch_hugetlb_migration_supported(struct hstate *h);
>>     static inline void arch_clear_hugetlb_flags(struct folio *folio)
>>   {
>> -    clear_bit(PG_dcache_clean, &folio->flags);
>> +    unsigned long i, nr_pages = folio_nr_pages(folio);
>> +    const unsigned long clear_flags = BIT(PG_dcache_clean) |
>> +        BIT(PG_arch_2) | BIT(PG_arch_3);
>> +
>> +    if (!system_supports_mte()) {
>> +        clear_bit(PG_dcache_clean, &folio->flags);
>> +        return;
>> +    }
>> +
>> +    for (i = 0; i < nr_pages; i++) {
>> +        struct page *page = folio_page(folio, i);
>> +        page->flags &= ~clear_flags;
>> +    }
>>   }
>>   #define arch_clear_hugetlb_flags arch_clear_hugetlb_flags
>>   diff --git a/arch/arm64/include/asm/mman.h 
>> b/arch/arm64/include/asm/mman.h
>> index 5966ee4a6154..304dfc499e68 100644
>> --- a/arch/arm64/include/asm/mman.h
>> +++ b/arch/arm64/include/asm/mman.h
>> @@ -28,7 +28,8 @@ static inline unsigned long 
>> arch_calc_vm_flag_bits(unsigned long flags)
>>        * backed by tags-capable memory. The vm_flags may be 
>> overridden by a
>>        * filesystem supporting MTE (RAM-based).
>>        */
>> -    if (system_supports_mte() && (flags & MAP_ANONYMOUS))
>> +    if (system_supports_mte() &&
>> +        (flags & (MAP_ANONYMOUS | MAP_HUGETLB)))
>>           return VM_MTE_ALLOWED;
>
> Do we really need this change? IIRC, the mmap_region() will call 
> hugetlbfs's mmap and set VM_MTE_ALLOWED in vma->vm_flags, then update 
> vma->vm_page_prot with the new vma->vm_flags.
>
> If this is needed, MTE for tmpfs won't work, right?
>
>>         return 0;
>>
>
Yang Shi July 3, 2024, 12:20 a.m. UTC | #8
On 7/2/24 6:09 AM, David Hildenbrand wrote:
> On 02.07.24 14:34, Catalin Marinas wrote:
>> On Tue, Jun 25, 2024 at 04:37:17PM -0700, Yang Shi wrote:
>>> MTE can be supported on ram based filesystem. It is supported on tmpfs.
>>> There is use case to use MTE on hugetlbfs as well, adding MTE support.
>>>
>>> Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
>>> ---
>>>   fs/hugetlbfs/inode.c | 2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>>> index ecad73a4f713..c34faef62daf 100644
>>> --- a/fs/hugetlbfs/inode.c
>>> +++ b/fs/hugetlbfs/inode.c
>>> @@ -110,7 +110,7 @@ static int hugetlbfs_file_mmap(struct file 
>>> *file, struct vm_area_struct *vma)
>>>        * way when do_mmap unwinds (may be important on powerpc
>>>        * and ia64).
>>>        */
>>> -    vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND);
>>> +    vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND | VM_MTE_ALLOWED);
>>>       vma->vm_ops = &hugetlb_vm_ops;
>>
>> Last time I checked, about a year ago, this was not sufficient. One
>> issue is that there's no arch_clear_hugetlb_flags() implemented by your
>> patch, leaving PG_arch_{2,3} set on a page. The other issue was that I
>> initially tried to do this only on the head page but this did not go
>> well with the folio_copy() -> copy_highpage() which expects the
>> PG_arch_* flags on each individual page. The alternative was for
>> arch_clear_hugetlb_flags() to iterate over all the pages in a folio.
>
> This would likely also add a blocker for 
> ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP on arm64 (no idea if there are now 
> ways to move forward with that now, or if we are still not sure if we 
> can actually add support), correct?

IIUC, it is not. We just need to guarantee each subpage has 
PG_mte_tagged flag and allocated tags. The HVO just maps the 7 vmemmap 
pages for sub pages to the first page, they still see the flag and the 
space for tag is not impacted, right? Did I miss something?

>
>
David Hildenbrand July 3, 2024, 10:24 a.m. UTC | #9
On 03.07.24 02:20, Yang Shi wrote:
> 
> 
> On 7/2/24 6:09 AM, David Hildenbrand wrote:
>> On 02.07.24 14:34, Catalin Marinas wrote:
>>> On Tue, Jun 25, 2024 at 04:37:17PM -0700, Yang Shi wrote:
>>>> MTE can be supported on ram based filesystem. It is supported on tmpfs.
>>>> There is use case to use MTE on hugetlbfs as well, adding MTE support.
>>>>
>>>> Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
>>>> ---
>>>>    fs/hugetlbfs/inode.c | 2 +-
>>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>>>> index ecad73a4f713..c34faef62daf 100644
>>>> --- a/fs/hugetlbfs/inode.c
>>>> +++ b/fs/hugetlbfs/inode.c
>>>> @@ -110,7 +110,7 @@ static int hugetlbfs_file_mmap(struct file
>>>> *file, struct vm_area_struct *vma)
>>>>         * way when do_mmap unwinds (may be important on powerpc
>>>>         * and ia64).
>>>>         */
>>>> -    vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND);
>>>> +    vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND | VM_MTE_ALLOWED);
>>>>        vma->vm_ops = &hugetlb_vm_ops;
>>>
>>> Last time I checked, about a year ago, this was not sufficient. One
>>> issue is that there's no arch_clear_hugetlb_flags() implemented by your
>>> patch, leaving PG_arch_{2,3} set on a page. The other issue was that I
>>> initially tried to do this only on the head page but this did not go
>>> well with the folio_copy() -> copy_highpage() which expects the
>>> PG_arch_* flags on each individual page. The alternative was for
>>> arch_clear_hugetlb_flags() to iterate over all the pages in a folio.
>>
>> This would likely also add a blocker for
>> ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP on arm64 (no idea if there are now
>> ways to move forward with that now, or if we are still not sure if we
>> can actually add support), correct?
> 
> IIUC, it is not. We just need to guarantee each subpage has
> PG_mte_tagged flag and allocated tags. The HVO just maps the 7 vmemmap
> pages for sub pages to the first page, they still see the flag and the
> space for tag is not impacted, right? Did I miss something?

In the R/O vmemmap optimization we won't be able to modify the flags of 
the double-mapped vmemmap pages via the double mappings.

Of course, we could find HVO-specific ways to only modify the flags of 
the first vmemmap page, but it does sound wrong ...

Really, the question is if we can have a per-folio flag for hugetlb 
instead and avoid all that?
Catalin Marinas July 3, 2024, 1:57 p.m. UTC | #10
On Wed, Jul 03, 2024 at 12:24:40PM +0200, David Hildenbrand wrote:
> On 03.07.24 02:20, Yang Shi wrote:
> > On 7/2/24 6:09 AM, David Hildenbrand wrote:
> > > On 02.07.24 14:34, Catalin Marinas wrote:
> > > > On Tue, Jun 25, 2024 at 04:37:17PM -0700, Yang Shi wrote:
> > > > > MTE can be supported on ram based filesystem. It is supported on tmpfs.
> > > > > There is use case to use MTE on hugetlbfs as well, adding MTE support.
> > > > > 
> > > > > Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
> > > > > ---
> > > > >    fs/hugetlbfs/inode.c | 2 +-
> > > > >    1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> > > > > index ecad73a4f713..c34faef62daf 100644
> > > > > --- a/fs/hugetlbfs/inode.c
> > > > > +++ b/fs/hugetlbfs/inode.c
> > > > > @@ -110,7 +110,7 @@ static int hugetlbfs_file_mmap(struct file
> > > > > *file, struct vm_area_struct *vma)
> > > > >         * way when do_mmap unwinds (may be important on powerpc
> > > > >         * and ia64).
> > > > >         */
> > > > > -    vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND);
> > > > > +    vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND | VM_MTE_ALLOWED);
> > > > >        vma->vm_ops = &hugetlb_vm_ops;
> > > > 
> > > > Last time I checked, about a year ago, this was not sufficient. One
> > > > issue is that there's no arch_clear_hugetlb_flags() implemented by your
> > > > patch, leaving PG_arch_{2,3} set on a page. The other issue was that I
> > > > initially tried to do this only on the head page but this did not go
> > > > well with the folio_copy() -> copy_highpage() which expects the
> > > > PG_arch_* flags on each individual page. The alternative was for
> > > > arch_clear_hugetlb_flags() to iterate over all the pages in a folio.
> > > 
> > > This would likely also add a blocker for
> > > ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP on arm64 (no idea if there are now
> > > ways to move forward with that now, or if we are still not sure if we
> > > can actually add support), correct?
> > 
> > IIUC, it is not. We just need to guarantee each subpage has
> > PG_mte_tagged flag and allocated tags. The HVO just maps the 7 vmemmap
> > pages for sub pages to the first page, they still see the flag and the
> > space for tag is not impacted, right? Did I miss something?
> 
> In the R/O vmemmap optimization we won't be able to modify the flags of the
> double-mapped vmemmap pages via the double mappings.
> 
> Of course, we could find HVO-specific ways to only modify the flags of the
> first vmemmap page, but it does sound wrong ...
> 
> Really, the question is if we can have a per-folio flag for hugetlb instead
> and avoid all that?

I think it is possible and I have some half-baked changes but got
distracted and never completed. The only issue I came across was
folio_copy() calling copy_highpage() on individual pages that did not
have the original PG_mte_tagged (PG_arch_2) flag. To avoid some races,
we also use PG_mte_lock (PG_arch_3) as some form of locking but for
optimisation we don't clear this flag after copying the tags and setting
PG_mte_tagged. So doing the checks on the head page only confuses the
tail page copying.

Even if we use PG_arch_3 as a proper lock bit and clear it after
tag copying, I'll need to check whether this can race with any
mprotect(PROT_MTE) that could cause losing tags or leaking tags (not
initialising the pages). set_pte_at() relies on the PG_mte_tagged flag
to decide whether to initialise the tags. The arm64 hugetlbfs supports
contiguous ptes, so we'd get multiple set_pte_at() calls.

Anyway, I think with some care it is doable, I just did not have the
time, nor did I see anyone asking for such feature until now.
Catalin Marinas July 4, 2024, 1:44 p.m. UTC | #11
On Tue, Jul 02, 2024 at 05:15:12PM -0700, Yang Shi wrote:
> On 7/2/24 5:04 PM, Yang Shi wrote:
> > On 7/2/24 5:34 AM, Catalin Marinas wrote:
> > > Last time I checked, about a year ago, this was not sufficient. One
> > > issue is that there's no arch_clear_hugetlb_flags() implemented by your
> > > patch, leaving PG_arch_{2,3} set on a page. The other issue was that I
> > > initially tried to do this only on the head page but this did not go
> > > well with the folio_copy() -> copy_highpage() which expects the
> > > PG_arch_* flags on each individual page. The alternative was for
> > > arch_clear_hugetlb_flags() to iterate over all the pages in a folio.
> > 
> > Thanks for pointing this out. I did miss this point. I took a quick look
> > at when the PG_ flags are set. IIUC, it is set by post_alloc_hook() for
> > order-0 anonymous folio (clearing page and tags) and set_ptes() for
> > others (just clear tags), for example, THP and hugetlb.
> > 
> > I can see THP does set the PG_mte_tagged flag for each sub pages. But it
> > seems it does not do it for hugetlb if I read the code correctly. The
> > call path is:
> > 
> > hugetlb_fault() ->
> >   hugetlb_no_page->
> >     set_huge_pte_at ->
> >       __set_ptes() ->
> >         __sync_cache_and_tags() ->
> > 
> > 
> > The __set_ptes() is called in a loop:
> > 
> > if (!pte_present(pte)) {
> >         for (i = 0; i < ncontig; i++, ptep++, addr += pgsize)
> >             __set_ptes(mm, addr, ptep, pte, 1);
> >         return;
> >     }
> > 
> > The ncontig and pgsize are returned by num_contig_ptes(). For example,
> > 2M hugetlb, ncontig is 1 and pgsize is 2M IIUC. So it means actually
> > just the head page has PG_mte_tagged set. If so the copy_highpage() will
> > just copy the old head page's flag to the new head page, and the tag.
> > All the sub pages don't have PG_mte_tagged set.
> > 
> > Is it expected behavior? I'm supposed we need tags for every sub pages
> > too, right?
> 
> We should need something like the below to have tags and page flag set up
> for each sub page:
> 
> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
> index 3f09ac73cce3..528164deef27 100644
> --- a/arch/arm64/mm/hugetlbpage.c
> +++ b/arch/arm64/mm/hugetlbpage.c
> @@ -228,9 +228,12 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned
> long addr,
>         int ncontig;
>         unsigned long pfn, dpfn;
>         pgprot_t hugeprot;
> +       unsigned long nr = sz >> PAGE_SHIFT;
> 
>         ncontig = num_contig_ptes(sz, &pgsize);
> 
> +       __sync_cache_and_tags(pte, nr);
> +
>         if (!pte_present(pte)) {
>                 for (i = 0; i < ncontig; i++, ptep++, addr += pgsize)
>                         __set_ptes(mm, addr, ptep, pte, 1);

We only need this for the last loop when we have a present pte. I'd also
avoid calling __set_ptes(...., 1) if we call the __sync_cache_and_tags()
here already. Basically just unroll __set_ptes() in set_huge_pte_at()
and call __set_pte() directly.

It might be better to convert those page flag checks to only happen on
the head page. My stashed changes from over a year ago (before we had
more folio conversions) below. However, as I mentioned, I got stuck on
folio_copy() which also does a cond_resched() between copy_highpage().

---------8<--------------------------------
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index f5bcb0dc6267..a84ad0e46f12 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -49,7 +49,9 @@ static void mte_sync_page_tags(struct page *page, pte_t old_pte,
 		return;
 
 	if (try_page_mte_tagging(page)) {
-		mte_clear_page_tags(page_address(page));
+		unsigned long i, nr_pages = compound_nr(page);
+		for (i = 0; i < nr_pages; i++)
+			mte_clear_page_tags(page_address(page + i));
 		set_page_mte_tagged(page);
 	}
 }
@@ -57,22 +59,23 @@ static void mte_sync_page_tags(struct page *page, pte_t old_pte,
 void mte_sync_tags(pte_t old_pte, pte_t pte)
 {
 	struct page *page = pte_page(pte);
-	long i, nr_pages = compound_nr(page);
-	bool check_swap = nr_pages == 1;
+	bool check_swap = !PageCompound(page);
 	bool pte_is_tagged = pte_tagged(pte);
 
 	/* Early out if there's nothing to do */
 	if (!check_swap && !pte_is_tagged)
 		return;
 
+	/*
+	 * HugeTLB pages are always fully mapped, so only setting head page's
+	 * PG_mte_* flags is enough.
+	 */
+	if (PageHuge(page))
+		page = compound_head(page);
+
 	/* if PG_mte_tagged is set, tags have already been initialised */
-	for (i = 0; i < nr_pages; i++, page++) {
-		if (!page_mte_tagged(page)) {
-			mte_sync_page_tags(page, old_pte, check_swap,
-					   pte_is_tagged);
-			set_page_mte_tagged(page);
-		}
-	}
+	if (!page_mte_tagged(page))
+		mte_sync_page_tags(page, old_pte, check_swap, pte_is_tagged);
 
 	/* ensure the tags are visible before the PTE is set */
 	smp_wmb();
diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
index 5626ddb540ce..692198d8c0d2 100644
--- a/arch/arm64/kvm/guest.c
+++ b/arch/arm64/kvm/guest.c
@@ -1079,7 +1079,7 @@ long kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
 
 			/* uaccess failed, don't leave stale tags */
 			if (num_tags != MTE_GRANULES_PER_PAGE)
-				mte_clear_page_tags(page);
+				mte_clear_page_tags(page_address(page));
 			set_page_mte_tagged(page);
 
 			kvm_release_pfn_dirty(pfn);
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 31d7fa4c7c14..b531b1d75cda 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1173,11 +1173,10 @@ static void sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn,
 	if (!kvm_has_mte(kvm))
 		return;
 
-	for (i = 0; i < nr_pages; i++, page++) {
-		if (try_page_mte_tagging(page)) {
-			mte_clear_page_tags(page_address(page));
-			set_page_mte_tagged(page);
-		}
+	if (try_page_mte_tagging(page)) {
+		for (i = 0; i < nr_pages; i++)
+			mte_clear_page_tags(page_address(page + i));
+		set_page_mte_tagged(page);
 	}
 }
Yang Shi July 9, 2024, 5:42 p.m. UTC | #12
On 7/4/24 6:44 AM, Catalin Marinas wrote:
> On Tue, Jul 02, 2024 at 05:15:12PM -0700, Yang Shi wrote:
>> On 7/2/24 5:04 PM, Yang Shi wrote:
>>> On 7/2/24 5:34 AM, Catalin Marinas wrote:
>>>> Last time I checked, about a year ago, this was not sufficient. One
>>>> issue is that there's no arch_clear_hugetlb_flags() implemented by your
>>>> patch, leaving PG_arch_{2,3} set on a page. The other issue was that I
>>>> initially tried to do this only on the head page but this did not go
>>>> well with the folio_copy() -> copy_highpage() which expects the
>>>> PG_arch_* flags on each individual page. The alternative was for
>>>> arch_clear_hugetlb_flags() to iterate over all the pages in a folio.
>>> Thanks for pointing this out. I did miss this point. I took a quick look
>>> at when the PG_ flags are set. IIUC, it is set by post_alloc_hook() for
>>> order-0 anonymous folio (clearing page and tags) and set_ptes() for
>>> others (just clear tags), for example, THP and hugetlb.
>>>
>>> I can see THP does set the PG_mte_tagged flag for each sub pages. But it
>>> seems it does not do it for hugetlb if I read the code correctly. The
>>> call path is:
>>>
>>> hugetlb_fault() ->
>>>    hugetlb_no_page->
>>>      set_huge_pte_at ->
>>>        __set_ptes() ->
>>>          __sync_cache_and_tags() ->
>>>
>>>
>>> The __set_ptes() is called in a loop:
>>>
>>> if (!pte_present(pte)) {
>>>          for (i = 0; i < ncontig; i++, ptep++, addr += pgsize)
>>>              __set_ptes(mm, addr, ptep, pte, 1);
>>>          return;
>>>      }
>>>
>>> The ncontig and pgsize are returned by num_contig_ptes(). For example,
>>> 2M hugetlb, ncontig is 1 and pgsize is 2M IIUC. So it means actually
>>> just the head page has PG_mte_tagged set. If so the copy_highpage() will
>>> just copy the old head page's flag to the new head page, and the tag.
>>> All the sub pages don't have PG_mte_tagged set.
>>>
>>> Is it expected behavior? I'm supposed we need tags for every sub pages
>>> too, right?
>> We should need something like the below to have tags and page flag set up
>> for each sub page:
>>
>> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
>> index 3f09ac73cce3..528164deef27 100644
>> --- a/arch/arm64/mm/hugetlbpage.c
>> +++ b/arch/arm64/mm/hugetlbpage.c
>> @@ -228,9 +228,12 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned
>> long addr,
>>          int ncontig;
>>          unsigned long pfn, dpfn;
>>          pgprot_t hugeprot;
>> +       unsigned long nr = sz >> PAGE_SHIFT;
>>
>>          ncontig = num_contig_ptes(sz, &pgsize);
>>
>> +       __sync_cache_and_tags(pte, nr);
>> +
>>          if (!pte_present(pte)) {
>>                  for (i = 0; i < ncontig; i++, ptep++, addr += pgsize)
>>                          __set_ptes(mm, addr, ptep, pte, 1);
> We only need this for the last loop when we have a present pte. I'd also
> avoid calling __set_ptes(...., 1) if we call the __sync_cache_and_tags()
> here already. Basically just unroll __set_ptes() in set_huge_pte_at()
> and call __set_pte() directly.
>
> It might be better to convert those page flag checks to only happen on
> the head page. My stashed changes from over a year ago (before we had
> more folio conversions) below. However, as I mentioned, I got stuck on
> folio_copy() which also does a cond_resched() between copy_highpage().

We can have the page flags set for head only for hugetlb page. For 
copy_highpage(), we should be able to do something like the below:

if  page_is_head && page_is_hugetlb && page_has_mte_tagged
     set page_mte_tagged flags
     copy tags for all sub pages
else // <-- tail page or non-hugetlb page
     current copy_highpage implementation


The hugetlb folio can't go away under us since migration path should pin 
it so the status of folio is stable. The preemption caused by 
cond_resched() should be fine too due to the pin and the page table 
entry keeps being migration entry until migration is done, so every one 
should just see migration entry and wait for migration is done.

The other concerned user of copy_highpage() is uprobe, but it also pins 
the page then doing copy and it is called with holding write mmap_lock.

IIUC, it should work if I don't miss something. This also should have no 
impact on HVO. The overhead for other users of copy_highpage() should be 
also acceptable.

>
> ---------8<--------------------------------
> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
> index f5bcb0dc6267..a84ad0e46f12 100644
> --- a/arch/arm64/kernel/mte.c
> +++ b/arch/arm64/kernel/mte.c
> @@ -49,7 +49,9 @@ static void mte_sync_page_tags(struct page *page, pte_t old_pte,
>   		return;
>   
>   	if (try_page_mte_tagging(page)) {
> -		mte_clear_page_tags(page_address(page));
> +		unsigned long i, nr_pages = compound_nr(page);
> +		for (i = 0; i < nr_pages; i++)
> +			mte_clear_page_tags(page_address(page + i));
>   		set_page_mte_tagged(page);
>   	}
>   }
> @@ -57,22 +59,23 @@ static void mte_sync_page_tags(struct page *page, pte_t old_pte,
>   void mte_sync_tags(pte_t old_pte, pte_t pte)
>   {
>   	struct page *page = pte_page(pte);
> -	long i, nr_pages = compound_nr(page);
> -	bool check_swap = nr_pages == 1;
> +	bool check_swap = !PageCompound(page);
>   	bool pte_is_tagged = pte_tagged(pte);
>   
>   	/* Early out if there's nothing to do */
>   	if (!check_swap && !pte_is_tagged)
>   		return;
>   
> +	/*
> +	 * HugeTLB pages are always fully mapped, so only setting head page's
> +	 * PG_mte_* flags is enough.
> +	 */
> +	if (PageHuge(page))
> +		page = compound_head(page);
> +
>   	/* if PG_mte_tagged is set, tags have already been initialised */
> -	for (i = 0; i < nr_pages; i++, page++) {
> -		if (!page_mte_tagged(page)) {
> -			mte_sync_page_tags(page, old_pte, check_swap,
> -					   pte_is_tagged);
> -			set_page_mte_tagged(page);
> -		}
> -	}
> +	if (!page_mte_tagged(page))
> +		mte_sync_page_tags(page, old_pte, check_swap, pte_is_tagged);
>   
>   	/* ensure the tags are visible before the PTE is set */
>   	smp_wmb();
> diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
> index 5626ddb540ce..692198d8c0d2 100644
> --- a/arch/arm64/kvm/guest.c
> +++ b/arch/arm64/kvm/guest.c
> @@ -1079,7 +1079,7 @@ long kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
>   
>   			/* uaccess failed, don't leave stale tags */
>   			if (num_tags != MTE_GRANULES_PER_PAGE)
> -				mte_clear_page_tags(page);
> +				mte_clear_page_tags(page_address(page));
>   			set_page_mte_tagged(page);
>   
>   			kvm_release_pfn_dirty(pfn);
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 31d7fa4c7c14..b531b1d75cda 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1173,11 +1173,10 @@ static void sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn,
>   	if (!kvm_has_mte(kvm))
>   		return;
>   
> -	for (i = 0; i < nr_pages; i++, page++) {
> -		if (try_page_mte_tagging(page)) {
> -			mte_clear_page_tags(page_address(page));
> -			set_page_mte_tagged(page);
> -		}
> +	if (try_page_mte_tagging(page)) {
> +		for (i = 0; i < nr_pages; i++)
> +			mte_clear_page_tags(page_address(page + i));
> +		set_page_mte_tagged(page);
>   	}
>   }
>   
>
Yang Shi Aug. 13, 2024, 5:08 p.m. UTC | #13
On 7/9/24 10:42 AM, Yang Shi wrote:
>
>
> On 7/4/24 6:44 AM, Catalin Marinas wrote:
>> On Tue, Jul 02, 2024 at 05:15:12PM -0700, Yang Shi wrote:
>>> On 7/2/24 5:04 PM, Yang Shi wrote:
>>>> On 7/2/24 5:34 AM, Catalin Marinas wrote:
>>>>> Last time I checked, about a year ago, this was not sufficient. One
>>>>> issue is that there's no arch_clear_hugetlb_flags() implemented by 
>>>>> your
>>>>> patch, leaving PG_arch_{2,3} set on a page. The other issue was 
>>>>> that I
>>>>> initially tried to do this only on the head page but this did not go
>>>>> well with the folio_copy() -> copy_highpage() which expects the
>>>>> PG_arch_* flags on each individual page. The alternative was for
>>>>> arch_clear_hugetlb_flags() to iterate over all the pages in a folio.
>>>> Thanks for pointing this out. I did miss this point. I took a quick 
>>>> look
>>>> at when the PG_ flags are set. IIUC, it is set by post_alloc_hook() 
>>>> for
>>>> order-0 anonymous folio (clearing page and tags) and set_ptes() for
>>>> others (just clear tags), for example, THP and hugetlb.
>>>>
>>>> I can see THP does set the PG_mte_tagged flag for each sub pages. 
>>>> But it
>>>> seems it does not do it for hugetlb if I read the code correctly. The
>>>> call path is:
>>>>
>>>> hugetlb_fault() ->
>>>>    hugetlb_no_page->
>>>>      set_huge_pte_at ->
>>>>        __set_ptes() ->
>>>>          __sync_cache_and_tags() ->
>>>>
>>>>
>>>> The __set_ptes() is called in a loop:
>>>>
>>>> if (!pte_present(pte)) {
>>>>          for (i = 0; i < ncontig; i++, ptep++, addr += pgsize)
>>>>              __set_ptes(mm, addr, ptep, pte, 1);
>>>>          return;
>>>>      }
>>>>
>>>> The ncontig and pgsize are returned by num_contig_ptes(). For example,
>>>> 2M hugetlb, ncontig is 1 and pgsize is 2M IIUC. So it means actually
>>>> just the head page has PG_mte_tagged set. If so the copy_highpage() 
>>>> will
>>>> just copy the old head page's flag to the new head page, and the tag.
>>>> All the sub pages don't have PG_mte_tagged set.
>>>>
>>>> Is it expected behavior? I'm supposed we need tags for every sub pages
>>>> too, right?
>>> We should need something like the below to have tags and page flag 
>>> set up
>>> for each sub page:
>>>
>>> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
>>> index 3f09ac73cce3..528164deef27 100644
>>> --- a/arch/arm64/mm/hugetlbpage.c
>>> +++ b/arch/arm64/mm/hugetlbpage.c
>>> @@ -228,9 +228,12 @@ void set_huge_pte_at(struct mm_struct *mm, 
>>> unsigned
>>> long addr,
>>>          int ncontig;
>>>          unsigned long pfn, dpfn;
>>>          pgprot_t hugeprot;
>>> +       unsigned long nr = sz >> PAGE_SHIFT;
>>>
>>>          ncontig = num_contig_ptes(sz, &pgsize);
>>>
>>> +       __sync_cache_and_tags(pte, nr);
>>> +
>>>          if (!pte_present(pte)) {
>>>                  for (i = 0; i < ncontig; i++, ptep++, addr += pgsize)
>>>                          __set_ptes(mm, addr, ptep, pte, 1);
>> We only need this for the last loop when we have a present pte. I'd also
>> avoid calling __set_ptes(...., 1) if we call the __sync_cache_and_tags()
>> here already. Basically just unroll __set_ptes() in set_huge_pte_at()
>> and call __set_pte() directly.
>>
>> It might be better to convert those page flag checks to only happen on
>> the head page. My stashed changes from over a year ago (before we had
>> more folio conversions) below. However, as I mentioned, I got stuck on
>> folio_copy() which also does a cond_resched() between copy_highpage().
>

Ping...

Does this make sense and sound good? If is is good, I will try to 
implement it.

> We can have the page flags set for head only for hugetlb page. For 
> copy_highpage(), we should be able to do something like the below:
>
> if  page_is_head && page_is_hugetlb && page_has_mte_tagged
>     set page_mte_tagged flags
>     copy tags for all sub pages
> else // <-- tail page or non-hugetlb page
>     current copy_highpage implementation
>
>
> The hugetlb folio can't go away under us since migration path should 
> pin it so the status of folio is stable. The preemption caused by 
> cond_resched() should be fine too due to the pin and the page table 
> entry keeps being migration entry until migration is done, so every 
> one should just see migration entry and wait for migration is done.
>
> The other concerned user of copy_highpage() is uprobe, but it also 
> pins the page then doing copy and it is called with holding write 
> mmap_lock.
>
> IIUC, it should work if I don't miss something. This also should have 
> no impact on HVO. The overhead for other users of copy_highpage() 
> should be also acceptable.
>
>>
>> ---------8<--------------------------------
>> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
>> index f5bcb0dc6267..a84ad0e46f12 100644
>> --- a/arch/arm64/kernel/mte.c
>> +++ b/arch/arm64/kernel/mte.c
>> @@ -49,7 +49,9 @@ static void mte_sync_page_tags(struct page *page, 
>> pte_t old_pte,
>>           return;
>>         if (try_page_mte_tagging(page)) {
>> -        mte_clear_page_tags(page_address(page));
>> +        unsigned long i, nr_pages = compound_nr(page);
>> +        for (i = 0; i < nr_pages; i++)
>> +            mte_clear_page_tags(page_address(page + i));
>>           set_page_mte_tagged(page);
>>       }
>>   }
>> @@ -57,22 +59,23 @@ static void mte_sync_page_tags(struct page *page, 
>> pte_t old_pte,
>>   void mte_sync_tags(pte_t old_pte, pte_t pte)
>>   {
>>       struct page *page = pte_page(pte);
>> -    long i, nr_pages = compound_nr(page);
>> -    bool check_swap = nr_pages == 1;
>> +    bool check_swap = !PageCompound(page);
>>       bool pte_is_tagged = pte_tagged(pte);
>>         /* Early out if there's nothing to do */
>>       if (!check_swap && !pte_is_tagged)
>>           return;
>>   +    /*
>> +     * HugeTLB pages are always fully mapped, so only setting head 
>> page's
>> +     * PG_mte_* flags is enough.
>> +     */
>> +    if (PageHuge(page))
>> +        page = compound_head(page);
>> +
>>       /* if PG_mte_tagged is set, tags have already been initialised */
>> -    for (i = 0; i < nr_pages; i++, page++) {
>> -        if (!page_mte_tagged(page)) {
>> -            mte_sync_page_tags(page, old_pte, check_swap,
>> -                       pte_is_tagged);
>> -            set_page_mte_tagged(page);
>> -        }
>> -    }
>> +    if (!page_mte_tagged(page))
>> +        mte_sync_page_tags(page, old_pte, check_swap, pte_is_tagged);
>>         /* ensure the tags are visible before the PTE is set */
>>       smp_wmb();
>> diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
>> index 5626ddb540ce..692198d8c0d2 100644
>> --- a/arch/arm64/kvm/guest.c
>> +++ b/arch/arm64/kvm/guest.c
>> @@ -1079,7 +1079,7 @@ long kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
>>                 /* uaccess failed, don't leave stale tags */
>>               if (num_tags != MTE_GRANULES_PER_PAGE)
>> -                mte_clear_page_tags(page);
>> +                mte_clear_page_tags(page_address(page));
>>               set_page_mte_tagged(page);
>>                 kvm_release_pfn_dirty(pfn);
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index 31d7fa4c7c14..b531b1d75cda 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -1173,11 +1173,10 @@ static void sanitise_mte_tags(struct kvm 
>> *kvm, kvm_pfn_t pfn,
>>       if (!kvm_has_mte(kvm))
>>           return;
>>   -    for (i = 0; i < nr_pages; i++, page++) {
>> -        if (try_page_mte_tagging(page)) {
>> -            mte_clear_page_tags(page_address(page));
>> -            set_page_mte_tagged(page);
>> -        }
>> +    if (try_page_mte_tagging(page)) {
>> +        for (i = 0; i < nr_pages; i++)
>> +            mte_clear_page_tags(page_address(page + i));
>> +        set_page_mte_tagged(page);
>>       }
>>   }
>>
>
Catalin Marinas Aug. 15, 2024, 10:31 a.m. UTC | #14
Sorry for the delay (holidays etc.)

On Tue, Jul 09, 2024 at 10:42:58AM -0700, Yang Shi wrote:
> On 7/4/24 6:44 AM, Catalin Marinas wrote:
> > It might be better to convert those page flag checks to only happen on
> > the head page. My stashed changes from over a year ago (before we had
> > more folio conversions) below. However, as I mentioned, I got stuck on
> > folio_copy() which also does a cond_resched() between copy_highpage().
> 
> We can have the page flags set for head only for hugetlb page. For
> copy_highpage(), we should be able to do something like the below:
> 
> if  page_is_head && page_is_hugetlb && page_has_mte_tagged
>     set page_mte_tagged flags
>     copy tags for all sub pages
> else // <-- tail page or non-hugetlb page
>     current copy_highpage implementation

Ah, so you want in the first copy_highpage() for the head page to
populate the tags for the tail pages. I guess this would work.

> The hugetlb folio can't go away under us since migration path should pin it
> so the status of folio is stable. The preemption caused by cond_resched()
> should be fine too due to the pin and the page table entry keeps being
> migration entry until migration is done, so every one should just see
> migration entry and wait for migration is done.

Yeah, I don't see those pages going away, otherwise folio_copy() would
corrupt data.

> The other concerned user of copy_highpage() is uprobe, but it also pins the
> page then doing copy and it is called with holding write mmap_lock.
> 
> IIUC, it should work if I don't miss something. This also should have no
> impact on HVO. The overhead for other users of copy_highpage() should be
> also acceptable.

I also think so. We also have the copy_user_highpage() on arm64 that
calls copy_highpage() but I think that's also safe.
Yang Shi Aug. 15, 2024, 7:15 p.m. UTC | #15
On 8/15/24 3:31 AM, Catalin Marinas wrote:
> Sorry for the delay (holidays etc.)
>
> On Tue, Jul 09, 2024 at 10:42:58AM -0700, Yang Shi wrote:
>> On 7/4/24 6:44 AM, Catalin Marinas wrote:
>>> It might be better to convert those page flag checks to only happen on
>>> the head page. My stashed changes from over a year ago (before we had
>>> more folio conversions) below. However, as I mentioned, I got stuck on
>>> folio_copy() which also does a cond_resched() between copy_highpage().
>> We can have the page flags set for head only for hugetlb page. For
>> copy_highpage(), we should be able to do something like the below:
>>
>> if  page_is_head && page_is_hugetlb && page_has_mte_tagged
>>      set page_mte_tagged flags
>>      copy tags for all sub pages
>> else // <-- tail page or non-hugetlb page
>>      current copy_highpage implementation
> Ah, so you want in the first copy_highpage() for the head page to
> populate the tags for the tail pages. I guess this would work.

Yes, because we know the order of hugetlb page so we know how many tail 
pages we need populate tags for.

A deeper look showed this may be the only way to do it (if we want to 
have mte page flag for head only) because process_huge_page() may 
starting copy from the middle of huge page to have hot sub pages in 
cache. The process_huge_page() is used by hugetlb fork and COW. I think 
it is safe to do so in fork and COW path too since the destination 
hugetlb page won't be seen by the users until fork or COW fault is done.

>
>> The hugetlb folio can't go away under us since migration path should pin it
>> so the status of folio is stable. The preemption caused by cond_resched()
>> should be fine too due to the pin and the page table entry keeps being
>> migration entry until migration is done, so every one should just see
>> migration entry and wait for migration is done.
> Yeah, I don't see those pages going away, otherwise folio_copy() would
> corrupt data.
>
>> The other concerned user of copy_highpage() is uprobe, but it also pins the
>> page then doing copy and it is called with holding write mmap_lock.
>>
>> IIUC, it should work if I don't miss something. This also should have no
>> impact on HVO. The overhead for other users of copy_highpage() should be
>> also acceptable.
> I also think so. We also have the copy_user_highpage() on arm64 that
> calls copy_highpage() but I think that's also safe.

Yes, it is used by fork and COW fault by hugetlb.

>
diff mbox series

Patch

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index ecad73a4f713..c34faef62daf 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -110,7 +110,7 @@  static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
 	 * way when do_mmap unwinds (may be important on powerpc
 	 * and ia64).
 	 */
-	vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND);
+	vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND | VM_MTE_ALLOWED);
 	vma->vm_ops = &hugetlb_vm_ops;
 
 	ret = seal_check_write(info->seals, vma);