Message ID | 20240625233717.2769975-1-yang@os.amperecomputing.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | hugetlbfs: add MTE support | expand |
On Tue, 25 Jun 2024 16:37:17 -0700 Yang Shi <yang@os.amperecomputing.com> wrote: > MTE can be supported on ram based filesystem. It is supported on tmpfs. > There is use case to use MTE on hugetlbfs as well, adding MTE support. > > --- a/fs/hugetlbfs/inode.c > +++ b/fs/hugetlbfs/inode.c > @@ -110,7 +110,7 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma) > * way when do_mmap unwinds (may be important on powerpc > * and ia64). > */ > - vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND); > + vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND | VM_MTE_ALLOWED); > vma->vm_ops = &hugetlb_vm_ops; > > ret = seal_check_write(info->seals, vma); How thoroughly has this been tested? Can we expect normal linux-next testing to exercise this, or must testers make special arangements to get the coverage?
On Wed, Jun 26, 2024 at 1:40 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Tue, 25 Jun 2024 16:37:17 -0700 Yang Shi <yang@os.amperecomputing.com> wrote: > > > MTE can be supported on ram based filesystem. It is supported on tmpfs. > > There is use case to use MTE on hugetlbfs as well, adding MTE support. > > > > --- a/fs/hugetlbfs/inode.c > > +++ b/fs/hugetlbfs/inode.c > > @@ -110,7 +110,7 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma) > > * way when do_mmap unwinds (may be important on powerpc > > * and ia64). > > */ > > - vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND); > > + vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND | VM_MTE_ALLOWED); > > vma->vm_ops = &hugetlb_vm_ops; > > > > ret = seal_check_write(info->seals, vma); > > How thoroughly has this been tested? > > Can we expect normal linux-next testing to exercise this, or must > testers make special arangements to get the coverage? It requires special arrangements. This needs hardware support and custom-patched QEMU. We did in-house test on AmpereOne platform with patched QEMU 8.1. >
On Wed, Jun 26, 2024 at 1:45 PM Yang Shi <shy828301@gmail.com> wrote: > > On Wed, Jun 26, 2024 at 1:40 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > > > On Tue, 25 Jun 2024 16:37:17 -0700 Yang Shi <yang@os.amperecomputing.com> wrote: > > > > > MTE can be supported on ram based filesystem. It is supported on tmpfs. > > > There is use case to use MTE on hugetlbfs as well, adding MTE support. > > > > > > --- a/fs/hugetlbfs/inode.c > > > +++ b/fs/hugetlbfs/inode.c > > > @@ -110,7 +110,7 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma) > > > * way when do_mmap unwinds (may be important on powerpc > > > * and ia64). > > > */ > > > - vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND); > > > + vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND | VM_MTE_ALLOWED); > > > vma->vm_ops = &hugetlb_vm_ops; > > > > > > ret = seal_check_write(info->seals, vma); > > > > How thoroughly has this been tested? > > > > Can we expect normal linux-next testing to exercise this, or must > > testers make special arangements to get the coverage? > > It requires special arrangements. This needs hardware support and > custom-patched QEMU. We did in-house test on AmpereOne platform with > patched QEMU 8.1. To correct, custom-patched QEMU is not required for a minimum test. But a special test program is definitely needed. We used custom-patched QEMU to test VM backed by hugetlbfs with MTE enabled. > > >
On Tue, Jun 25, 2024 at 04:37:17PM -0700, Yang Shi wrote: > MTE can be supported on ram based filesystem. It is supported on tmpfs. > There is use case to use MTE on hugetlbfs as well, adding MTE support. > > Signed-off-by: Yang Shi <yang@os.amperecomputing.com> > --- > fs/hugetlbfs/inode.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c > index ecad73a4f713..c34faef62daf 100644 > --- a/fs/hugetlbfs/inode.c > +++ b/fs/hugetlbfs/inode.c > @@ -110,7 +110,7 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma) > * way when do_mmap unwinds (may be important on powerpc > * and ia64). > */ > - vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND); > + vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND | VM_MTE_ALLOWED); > vma->vm_ops = &hugetlb_vm_ops; Last time I checked, about a year ago, this was not sufficient. One issue is that there's no arch_clear_hugetlb_flags() implemented by your patch, leaving PG_arch_{2,3} set on a page. The other issue was that I initially tried to do this only on the head page but this did not go well with the folio_copy() -> copy_highpage() which expects the PG_arch_* flags on each individual page. The alternative was for arch_clear_hugetlb_flags() to iterate over all the pages in a folio. I'd also like to see some tests added to tools/testing/selftest/arm64/mte to exercise MAP_HUGETLB with PROT_MTE: write/read tags, a series of mman+munmap (mostly to check if old page flags are still around), force some copy on write. I don't think we should merge the patch without proper tests. An untested hunk on top of your changes: diff --git a/arch/arm64/include/asm/hugetlb.h b/arch/arm64/include/asm/hugetlb.h index 3954cbd2ff56..5357b00b9087 100644 --- a/arch/arm64/include/asm/hugetlb.h +++ b/arch/arm64/include/asm/hugetlb.h @@ -20,7 +20,19 @@ extern bool arch_hugetlb_migration_supported(struct hstate *h); static inline void arch_clear_hugetlb_flags(struct folio *folio) { - clear_bit(PG_dcache_clean, &folio->flags); + unsigned long i, nr_pages = folio_nr_pages(folio); + const unsigned long clear_flags = BIT(PG_dcache_clean) | + BIT(PG_arch_2) | BIT(PG_arch_3); + + if (!system_supports_mte()) { + clear_bit(PG_dcache_clean, &folio->flags); + return; + } + + for (i = 0; i < nr_pages; i++) { + struct page *page = folio_page(folio, i); + page->flags &= ~clear_flags; + } } #define arch_clear_hugetlb_flags arch_clear_hugetlb_flags diff --git a/arch/arm64/include/asm/mman.h b/arch/arm64/include/asm/mman.h index 5966ee4a6154..304dfc499e68 100644 --- a/arch/arm64/include/asm/mman.h +++ b/arch/arm64/include/asm/mman.h @@ -28,7 +28,8 @@ static inline unsigned long arch_calc_vm_flag_bits(unsigned long flags) * backed by tags-capable memory. The vm_flags may be overridden by a * filesystem supporting MTE (RAM-based). */ - if (system_supports_mte() && (flags & MAP_ANONYMOUS)) + if (system_supports_mte() && + (flags & (MAP_ANONYMOUS | MAP_HUGETLB))) return VM_MTE_ALLOWED; return 0;
On 02.07.24 14:34, Catalin Marinas wrote: > On Tue, Jun 25, 2024 at 04:37:17PM -0700, Yang Shi wrote: >> MTE can be supported on ram based filesystem. It is supported on tmpfs. >> There is use case to use MTE on hugetlbfs as well, adding MTE support. >> >> Signed-off-by: Yang Shi <yang@os.amperecomputing.com> >> --- >> fs/hugetlbfs/inode.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c >> index ecad73a4f713..c34faef62daf 100644 >> --- a/fs/hugetlbfs/inode.c >> +++ b/fs/hugetlbfs/inode.c >> @@ -110,7 +110,7 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma) >> * way when do_mmap unwinds (may be important on powerpc >> * and ia64). >> */ >> - vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND); >> + vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND | VM_MTE_ALLOWED); >> vma->vm_ops = &hugetlb_vm_ops; > > Last time I checked, about a year ago, this was not sufficient. One > issue is that there's no arch_clear_hugetlb_flags() implemented by your > patch, leaving PG_arch_{2,3} set on a page. The other issue was that I > initially tried to do this only on the head page but this did not go > well with the folio_copy() -> copy_highpage() which expects the > PG_arch_* flags on each individual page. The alternative was for > arch_clear_hugetlb_flags() to iterate over all the pages in a folio. This would likely also add a blocker for ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP on arm64 (no idea if there are now ways to move forward with that now, or if we are still not sure if we can actually add support), correct?
On 7/2/24 5:34 AM, Catalin Marinas wrote: > On Tue, Jun 25, 2024 at 04:37:17PM -0700, Yang Shi wrote: >> MTE can be supported on ram based filesystem. It is supported on tmpfs. >> There is use case to use MTE on hugetlbfs as well, adding MTE support. >> >> Signed-off-by: Yang Shi <yang@os.amperecomputing.com> >> --- >> fs/hugetlbfs/inode.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c >> index ecad73a4f713..c34faef62daf 100644 >> --- a/fs/hugetlbfs/inode.c >> +++ b/fs/hugetlbfs/inode.c >> @@ -110,7 +110,7 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma) >> * way when do_mmap unwinds (may be important on powerpc >> * and ia64). >> */ >> - vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND); >> + vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND | VM_MTE_ALLOWED); >> vma->vm_ops = &hugetlb_vm_ops; > Last time I checked, about a year ago, this was not sufficient. One > issue is that there's no arch_clear_hugetlb_flags() implemented by your > patch, leaving PG_arch_{2,3} set on a page. The other issue was that I > initially tried to do this only on the head page but this did not go > well with the folio_copy() -> copy_highpage() which expects the > PG_arch_* flags on each individual page. The alternative was for > arch_clear_hugetlb_flags() to iterate over all the pages in a folio. Thanks for pointing this out. I did miss this point. I took a quick look at when the PG_ flags are set. IIUC, it is set by post_alloc_hook() for order-0 anonymous folio (clearing page and tags) and set_ptes() for others (just clear tags), for example, THP and hugetlb. I can see THP does set the PG_mte_tagged flag for each sub pages. But it seems it does not do it for hugetlb if I read the code correctly. The call path is: hugetlb_fault() -> hugetlb_no_page-> set_huge_pte_at -> __set_ptes() -> __sync_cache_and_tags() -> The __set_ptes() is called in a loop: if (!pte_present(pte)) { for (i = 0; i < ncontig; i++, ptep++, addr += pgsize) __set_ptes(mm, addr, ptep, pte, 1); return; } The ncontig and pgsize are returned by num_contig_ptes(). For example, 2M hugetlb, ncontig is 1 and pgsize is 2M IIUC. So it means actually just the head page has PG_mte_tagged set. If so the copy_highpage() will just copy the old head page's flag to the new head page, and the tag. All the sub pages don't have PG_mte_tagged set. Is it expected behavior? I'm supposed we need tags for every sub pages too, right? > > I'd also like to see some tests added to > tools/testing/selftest/arm64/mte to exercise MAP_HUGETLB with PROT_MTE: > write/read tags, a series of mman+munmap (mostly to check if old page > flags are still around), force some copy on write. I don't think we > should merge the patch without proper tests. > > An untested hunk on top of your changes: > > diff --git a/arch/arm64/include/asm/hugetlb.h b/arch/arm64/include/asm/hugetlb.h > index 3954cbd2ff56..5357b00b9087 100644 > --- a/arch/arm64/include/asm/hugetlb.h > +++ b/arch/arm64/include/asm/hugetlb.h > @@ -20,7 +20,19 @@ extern bool arch_hugetlb_migration_supported(struct hstate *h); > > static inline void arch_clear_hugetlb_flags(struct folio *folio) > { > - clear_bit(PG_dcache_clean, &folio->flags); > + unsigned long i, nr_pages = folio_nr_pages(folio); > + const unsigned long clear_flags = BIT(PG_dcache_clean) | > + BIT(PG_arch_2) | BIT(PG_arch_3); > + > + if (!system_supports_mte()) { > + clear_bit(PG_dcache_clean, &folio->flags); > + return; > + } > + > + for (i = 0; i < nr_pages; i++) { > + struct page *page = folio_page(folio, i); > + page->flags &= ~clear_flags; > + } > } > #define arch_clear_hugetlb_flags arch_clear_hugetlb_flags > > diff --git a/arch/arm64/include/asm/mman.h b/arch/arm64/include/asm/mman.h > index 5966ee4a6154..304dfc499e68 100644 > --- a/arch/arm64/include/asm/mman.h > +++ b/arch/arm64/include/asm/mman.h > @@ -28,7 +28,8 @@ static inline unsigned long arch_calc_vm_flag_bits(unsigned long flags) > * backed by tags-capable memory. The vm_flags may be overridden by a > * filesystem supporting MTE (RAM-based). > */ > - if (system_supports_mte() && (flags & MAP_ANONYMOUS)) > + if (system_supports_mte() && > + (flags & (MAP_ANONYMOUS | MAP_HUGETLB))) > return VM_MTE_ALLOWED; Do we really need this change? IIRC, the mmap_region() will call hugetlbfs's mmap and set VM_MTE_ALLOWED in vma->vm_flags, then update vma->vm_page_prot with the new vma->vm_flags. If this is needed, MTE for tmpfs won't work, right? > > return 0; >
On 7/2/24 5:04 PM, Yang Shi wrote: > > > On 7/2/24 5:34 AM, Catalin Marinas wrote: >> On Tue, Jun 25, 2024 at 04:37:17PM -0700, Yang Shi wrote: >>> MTE can be supported on ram based filesystem. It is supported on tmpfs. >>> There is use case to use MTE on hugetlbfs as well, adding MTE support. >>> >>> Signed-off-by: Yang Shi <yang@os.amperecomputing.com> >>> --- >>> fs/hugetlbfs/inode.c | 2 +- >>> 1 file changed, 1 insertion(+), 1 deletion(-) >>> >>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c >>> index ecad73a4f713..c34faef62daf 100644 >>> --- a/fs/hugetlbfs/inode.c >>> +++ b/fs/hugetlbfs/inode.c >>> @@ -110,7 +110,7 @@ static int hugetlbfs_file_mmap(struct file >>> *file, struct vm_area_struct *vma) >>> * way when do_mmap unwinds (may be important on powerpc >>> * and ia64). >>> */ >>> - vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND); >>> + vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND | VM_MTE_ALLOWED); >>> vma->vm_ops = &hugetlb_vm_ops; >> Last time I checked, about a year ago, this was not sufficient. One >> issue is that there's no arch_clear_hugetlb_flags() implemented by your >> patch, leaving PG_arch_{2,3} set on a page. The other issue was that I >> initially tried to do this only on the head page but this did not go >> well with the folio_copy() -> copy_highpage() which expects the >> PG_arch_* flags on each individual page. The alternative was for >> arch_clear_hugetlb_flags() to iterate over all the pages in a folio. > > Thanks for pointing this out. I did miss this point. I took a quick > look at when the PG_ flags are set. IIUC, it is set by > post_alloc_hook() for order-0 anonymous folio (clearing page and tags) > and set_ptes() for others (just clear tags), for example, THP and > hugetlb. > > I can see THP does set the PG_mte_tagged flag for each sub pages. But > it seems it does not do it for hugetlb if I read the code correctly. > The call path is: > > hugetlb_fault() -> > hugetlb_no_page-> > set_huge_pte_at -> > __set_ptes() -> > __sync_cache_and_tags() -> > > > The __set_ptes() is called in a loop: > > if (!pte_present(pte)) { > for (i = 0; i < ncontig; i++, ptep++, addr += pgsize) > __set_ptes(mm, addr, ptep, pte, 1); > return; > } > > The ncontig and pgsize are returned by num_contig_ptes(). For example, > 2M hugetlb, ncontig is 1 and pgsize is 2M IIUC. So it means actually > just the head page has PG_mte_tagged set. If so the copy_highpage() > will just copy the old head page's flag to the new head page, and the > tag. All the sub pages don't have PG_mte_tagged set. > > > Is it expected behavior? I'm supposed we need tags for every sub pages > too, right? We should need something like the below to have tags and page flag set up for each sub page: diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c index 3f09ac73cce3..528164deef27 100644 --- a/arch/arm64/mm/hugetlbpage.c +++ b/arch/arm64/mm/hugetlbpage.c @@ -228,9 +228,12 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, int ncontig; unsigned long pfn, dpfn; pgprot_t hugeprot; + unsigned long nr = sz >> PAGE_SHIFT; ncontig = num_contig_ptes(sz, &pgsize); + __sync_cache_and_tags(pte, nr); + if (!pte_present(pte)) { for (i = 0; i < ncontig; i++, ptep++, addr += pgsize) __set_ptes(mm, addr, ptep, pte, 1); > >> >> I'd also like to see some tests added to >> tools/testing/selftest/arm64/mte to exercise MAP_HUGETLB with PROT_MTE: >> write/read tags, a series of mman+munmap (mostly to check if old page >> flags are still around), force some copy on write. I don't think we >> should merge the patch without proper tests. >> >> An untested hunk on top of your changes: >> >> diff --git a/arch/arm64/include/asm/hugetlb.h >> b/arch/arm64/include/asm/hugetlb.h >> index 3954cbd2ff56..5357b00b9087 100644 >> --- a/arch/arm64/include/asm/hugetlb.h >> +++ b/arch/arm64/include/asm/hugetlb.h >> @@ -20,7 +20,19 @@ extern bool >> arch_hugetlb_migration_supported(struct hstate *h); >> static inline void arch_clear_hugetlb_flags(struct folio *folio) >> { >> - clear_bit(PG_dcache_clean, &folio->flags); >> + unsigned long i, nr_pages = folio_nr_pages(folio); >> + const unsigned long clear_flags = BIT(PG_dcache_clean) | >> + BIT(PG_arch_2) | BIT(PG_arch_3); >> + >> + if (!system_supports_mte()) { >> + clear_bit(PG_dcache_clean, &folio->flags); >> + return; >> + } >> + >> + for (i = 0; i < nr_pages; i++) { >> + struct page *page = folio_page(folio, i); >> + page->flags &= ~clear_flags; >> + } >> } >> #define arch_clear_hugetlb_flags arch_clear_hugetlb_flags >> diff --git a/arch/arm64/include/asm/mman.h >> b/arch/arm64/include/asm/mman.h >> index 5966ee4a6154..304dfc499e68 100644 >> --- a/arch/arm64/include/asm/mman.h >> +++ b/arch/arm64/include/asm/mman.h >> @@ -28,7 +28,8 @@ static inline unsigned long >> arch_calc_vm_flag_bits(unsigned long flags) >> * backed by tags-capable memory. The vm_flags may be >> overridden by a >> * filesystem supporting MTE (RAM-based). >> */ >> - if (system_supports_mte() && (flags & MAP_ANONYMOUS)) >> + if (system_supports_mte() && >> + (flags & (MAP_ANONYMOUS | MAP_HUGETLB))) >> return VM_MTE_ALLOWED; > > Do we really need this change? IIRC, the mmap_region() will call > hugetlbfs's mmap and set VM_MTE_ALLOWED in vma->vm_flags, then update > vma->vm_page_prot with the new vma->vm_flags. > > If this is needed, MTE for tmpfs won't work, right? > >> return 0; >> >
On 7/2/24 6:09 AM, David Hildenbrand wrote: > On 02.07.24 14:34, Catalin Marinas wrote: >> On Tue, Jun 25, 2024 at 04:37:17PM -0700, Yang Shi wrote: >>> MTE can be supported on ram based filesystem. It is supported on tmpfs. >>> There is use case to use MTE on hugetlbfs as well, adding MTE support. >>> >>> Signed-off-by: Yang Shi <yang@os.amperecomputing.com> >>> --- >>> fs/hugetlbfs/inode.c | 2 +- >>> 1 file changed, 1 insertion(+), 1 deletion(-) >>> >>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c >>> index ecad73a4f713..c34faef62daf 100644 >>> --- a/fs/hugetlbfs/inode.c >>> +++ b/fs/hugetlbfs/inode.c >>> @@ -110,7 +110,7 @@ static int hugetlbfs_file_mmap(struct file >>> *file, struct vm_area_struct *vma) >>> * way when do_mmap unwinds (may be important on powerpc >>> * and ia64). >>> */ >>> - vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND); >>> + vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND | VM_MTE_ALLOWED); >>> vma->vm_ops = &hugetlb_vm_ops; >> >> Last time I checked, about a year ago, this was not sufficient. One >> issue is that there's no arch_clear_hugetlb_flags() implemented by your >> patch, leaving PG_arch_{2,3} set on a page. The other issue was that I >> initially tried to do this only on the head page but this did not go >> well with the folio_copy() -> copy_highpage() which expects the >> PG_arch_* flags on each individual page. The alternative was for >> arch_clear_hugetlb_flags() to iterate over all the pages in a folio. > > This would likely also add a blocker for > ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP on arm64 (no idea if there are now > ways to move forward with that now, or if we are still not sure if we > can actually add support), correct? IIUC, it is not. We just need to guarantee each subpage has PG_mte_tagged flag and allocated tags. The HVO just maps the 7 vmemmap pages for sub pages to the first page, they still see the flag and the space for tag is not impacted, right? Did I miss something? > >
On 03.07.24 02:20, Yang Shi wrote: > > > On 7/2/24 6:09 AM, David Hildenbrand wrote: >> On 02.07.24 14:34, Catalin Marinas wrote: >>> On Tue, Jun 25, 2024 at 04:37:17PM -0700, Yang Shi wrote: >>>> MTE can be supported on ram based filesystem. It is supported on tmpfs. >>>> There is use case to use MTE on hugetlbfs as well, adding MTE support. >>>> >>>> Signed-off-by: Yang Shi <yang@os.amperecomputing.com> >>>> --- >>>> fs/hugetlbfs/inode.c | 2 +- >>>> 1 file changed, 1 insertion(+), 1 deletion(-) >>>> >>>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c >>>> index ecad73a4f713..c34faef62daf 100644 >>>> --- a/fs/hugetlbfs/inode.c >>>> +++ b/fs/hugetlbfs/inode.c >>>> @@ -110,7 +110,7 @@ static int hugetlbfs_file_mmap(struct file >>>> *file, struct vm_area_struct *vma) >>>> * way when do_mmap unwinds (may be important on powerpc >>>> * and ia64). >>>> */ >>>> - vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND); >>>> + vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND | VM_MTE_ALLOWED); >>>> vma->vm_ops = &hugetlb_vm_ops; >>> >>> Last time I checked, about a year ago, this was not sufficient. One >>> issue is that there's no arch_clear_hugetlb_flags() implemented by your >>> patch, leaving PG_arch_{2,3} set on a page. The other issue was that I >>> initially tried to do this only on the head page but this did not go >>> well with the folio_copy() -> copy_highpage() which expects the >>> PG_arch_* flags on each individual page. The alternative was for >>> arch_clear_hugetlb_flags() to iterate over all the pages in a folio. >> >> This would likely also add a blocker for >> ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP on arm64 (no idea if there are now >> ways to move forward with that now, or if we are still not sure if we >> can actually add support), correct? > > IIUC, it is not. We just need to guarantee each subpage has > PG_mte_tagged flag and allocated tags. The HVO just maps the 7 vmemmap > pages for sub pages to the first page, they still see the flag and the > space for tag is not impacted, right? Did I miss something? In the R/O vmemmap optimization we won't be able to modify the flags of the double-mapped vmemmap pages via the double mappings. Of course, we could find HVO-specific ways to only modify the flags of the first vmemmap page, but it does sound wrong ... Really, the question is if we can have a per-folio flag for hugetlb instead and avoid all that?
On Wed, Jul 03, 2024 at 12:24:40PM +0200, David Hildenbrand wrote: > On 03.07.24 02:20, Yang Shi wrote: > > On 7/2/24 6:09 AM, David Hildenbrand wrote: > > > On 02.07.24 14:34, Catalin Marinas wrote: > > > > On Tue, Jun 25, 2024 at 04:37:17PM -0700, Yang Shi wrote: > > > > > MTE can be supported on ram based filesystem. It is supported on tmpfs. > > > > > There is use case to use MTE on hugetlbfs as well, adding MTE support. > > > > > > > > > > Signed-off-by: Yang Shi <yang@os.amperecomputing.com> > > > > > --- > > > > > fs/hugetlbfs/inode.c | 2 +- > > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > > > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c > > > > > index ecad73a4f713..c34faef62daf 100644 > > > > > --- a/fs/hugetlbfs/inode.c > > > > > +++ b/fs/hugetlbfs/inode.c > > > > > @@ -110,7 +110,7 @@ static int hugetlbfs_file_mmap(struct file > > > > > *file, struct vm_area_struct *vma) > > > > > * way when do_mmap unwinds (may be important on powerpc > > > > > * and ia64). > > > > > */ > > > > > - vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND); > > > > > + vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND | VM_MTE_ALLOWED); > > > > > vma->vm_ops = &hugetlb_vm_ops; > > > > > > > > Last time I checked, about a year ago, this was not sufficient. One > > > > issue is that there's no arch_clear_hugetlb_flags() implemented by your > > > > patch, leaving PG_arch_{2,3} set on a page. The other issue was that I > > > > initially tried to do this only on the head page but this did not go > > > > well with the folio_copy() -> copy_highpage() which expects the > > > > PG_arch_* flags on each individual page. The alternative was for > > > > arch_clear_hugetlb_flags() to iterate over all the pages in a folio. > > > > > > This would likely also add a blocker for > > > ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP on arm64 (no idea if there are now > > > ways to move forward with that now, or if we are still not sure if we > > > can actually add support), correct? > > > > IIUC, it is not. We just need to guarantee each subpage has > > PG_mte_tagged flag and allocated tags. The HVO just maps the 7 vmemmap > > pages for sub pages to the first page, they still see the flag and the > > space for tag is not impacted, right? Did I miss something? > > In the R/O vmemmap optimization we won't be able to modify the flags of the > double-mapped vmemmap pages via the double mappings. > > Of course, we could find HVO-specific ways to only modify the flags of the > first vmemmap page, but it does sound wrong ... > > Really, the question is if we can have a per-folio flag for hugetlb instead > and avoid all that? I think it is possible and I have some half-baked changes but got distracted and never completed. The only issue I came across was folio_copy() calling copy_highpage() on individual pages that did not have the original PG_mte_tagged (PG_arch_2) flag. To avoid some races, we also use PG_mte_lock (PG_arch_3) as some form of locking but for optimisation we don't clear this flag after copying the tags and setting PG_mte_tagged. So doing the checks on the head page only confuses the tail page copying. Even if we use PG_arch_3 as a proper lock bit and clear it after tag copying, I'll need to check whether this can race with any mprotect(PROT_MTE) that could cause losing tags or leaking tags (not initialising the pages). set_pte_at() relies on the PG_mte_tagged flag to decide whether to initialise the tags. The arm64 hugetlbfs supports contiguous ptes, so we'd get multiple set_pte_at() calls. Anyway, I think with some care it is doable, I just did not have the time, nor did I see anyone asking for such feature until now.
On Tue, Jul 02, 2024 at 05:15:12PM -0700, Yang Shi wrote: > On 7/2/24 5:04 PM, Yang Shi wrote: > > On 7/2/24 5:34 AM, Catalin Marinas wrote: > > > Last time I checked, about a year ago, this was not sufficient. One > > > issue is that there's no arch_clear_hugetlb_flags() implemented by your > > > patch, leaving PG_arch_{2,3} set on a page. The other issue was that I > > > initially tried to do this only on the head page but this did not go > > > well with the folio_copy() -> copy_highpage() which expects the > > > PG_arch_* flags on each individual page. The alternative was for > > > arch_clear_hugetlb_flags() to iterate over all the pages in a folio. > > > > Thanks for pointing this out. I did miss this point. I took a quick look > > at when the PG_ flags are set. IIUC, it is set by post_alloc_hook() for > > order-0 anonymous folio (clearing page and tags) and set_ptes() for > > others (just clear tags), for example, THP and hugetlb. > > > > I can see THP does set the PG_mte_tagged flag for each sub pages. But it > > seems it does not do it for hugetlb if I read the code correctly. The > > call path is: > > > > hugetlb_fault() -> > > hugetlb_no_page-> > > set_huge_pte_at -> > > __set_ptes() -> > > __sync_cache_and_tags() -> > > > > > > The __set_ptes() is called in a loop: > > > > if (!pte_present(pte)) { > > for (i = 0; i < ncontig; i++, ptep++, addr += pgsize) > > __set_ptes(mm, addr, ptep, pte, 1); > > return; > > } > > > > The ncontig and pgsize are returned by num_contig_ptes(). For example, > > 2M hugetlb, ncontig is 1 and pgsize is 2M IIUC. So it means actually > > just the head page has PG_mte_tagged set. If so the copy_highpage() will > > just copy the old head page's flag to the new head page, and the tag. > > All the sub pages don't have PG_mte_tagged set. > > > > Is it expected behavior? I'm supposed we need tags for every sub pages > > too, right? > > We should need something like the below to have tags and page flag set up > for each sub page: > > diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c > index 3f09ac73cce3..528164deef27 100644 > --- a/arch/arm64/mm/hugetlbpage.c > +++ b/arch/arm64/mm/hugetlbpage.c > @@ -228,9 +228,12 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned > long addr, > int ncontig; > unsigned long pfn, dpfn; > pgprot_t hugeprot; > + unsigned long nr = sz >> PAGE_SHIFT; > > ncontig = num_contig_ptes(sz, &pgsize); > > + __sync_cache_and_tags(pte, nr); > + > if (!pte_present(pte)) { > for (i = 0; i < ncontig; i++, ptep++, addr += pgsize) > __set_ptes(mm, addr, ptep, pte, 1); We only need this for the last loop when we have a present pte. I'd also avoid calling __set_ptes(...., 1) if we call the __sync_cache_and_tags() here already. Basically just unroll __set_ptes() in set_huge_pte_at() and call __set_pte() directly. It might be better to convert those page flag checks to only happen on the head page. My stashed changes from over a year ago (before we had more folio conversions) below. However, as I mentioned, I got stuck on folio_copy() which also does a cond_resched() between copy_highpage(). ---------8<-------------------------------- diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c index f5bcb0dc6267..a84ad0e46f12 100644 --- a/arch/arm64/kernel/mte.c +++ b/arch/arm64/kernel/mte.c @@ -49,7 +49,9 @@ static void mte_sync_page_tags(struct page *page, pte_t old_pte, return; if (try_page_mte_tagging(page)) { - mte_clear_page_tags(page_address(page)); + unsigned long i, nr_pages = compound_nr(page); + for (i = 0; i < nr_pages; i++) + mte_clear_page_tags(page_address(page + i)); set_page_mte_tagged(page); } } @@ -57,22 +59,23 @@ static void mte_sync_page_tags(struct page *page, pte_t old_pte, void mte_sync_tags(pte_t old_pte, pte_t pte) { struct page *page = pte_page(pte); - long i, nr_pages = compound_nr(page); - bool check_swap = nr_pages == 1; + bool check_swap = !PageCompound(page); bool pte_is_tagged = pte_tagged(pte); /* Early out if there's nothing to do */ if (!check_swap && !pte_is_tagged) return; + /* + * HugeTLB pages are always fully mapped, so only setting head page's + * PG_mte_* flags is enough. + */ + if (PageHuge(page)) + page = compound_head(page); + /* if PG_mte_tagged is set, tags have already been initialised */ - for (i = 0; i < nr_pages; i++, page++) { - if (!page_mte_tagged(page)) { - mte_sync_page_tags(page, old_pte, check_swap, - pte_is_tagged); - set_page_mte_tagged(page); - } - } + if (!page_mte_tagged(page)) + mte_sync_page_tags(page, old_pte, check_swap, pte_is_tagged); /* ensure the tags are visible before the PTE is set */ smp_wmb(); diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c index 5626ddb540ce..692198d8c0d2 100644 --- a/arch/arm64/kvm/guest.c +++ b/arch/arm64/kvm/guest.c @@ -1079,7 +1079,7 @@ long kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm, /* uaccess failed, don't leave stale tags */ if (num_tags != MTE_GRANULES_PER_PAGE) - mte_clear_page_tags(page); + mte_clear_page_tags(page_address(page)); set_page_mte_tagged(page); kvm_release_pfn_dirty(pfn); diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 31d7fa4c7c14..b531b1d75cda 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1173,11 +1173,10 @@ static void sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn, if (!kvm_has_mte(kvm)) return; - for (i = 0; i < nr_pages; i++, page++) { - if (try_page_mte_tagging(page)) { - mte_clear_page_tags(page_address(page)); - set_page_mte_tagged(page); - } + if (try_page_mte_tagging(page)) { + for (i = 0; i < nr_pages; i++) + mte_clear_page_tags(page_address(page + i)); + set_page_mte_tagged(page); } }
On 7/4/24 6:44 AM, Catalin Marinas wrote: > On Tue, Jul 02, 2024 at 05:15:12PM -0700, Yang Shi wrote: >> On 7/2/24 5:04 PM, Yang Shi wrote: >>> On 7/2/24 5:34 AM, Catalin Marinas wrote: >>>> Last time I checked, about a year ago, this was not sufficient. One >>>> issue is that there's no arch_clear_hugetlb_flags() implemented by your >>>> patch, leaving PG_arch_{2,3} set on a page. The other issue was that I >>>> initially tried to do this only on the head page but this did not go >>>> well with the folio_copy() -> copy_highpage() which expects the >>>> PG_arch_* flags on each individual page. The alternative was for >>>> arch_clear_hugetlb_flags() to iterate over all the pages in a folio. >>> Thanks for pointing this out. I did miss this point. I took a quick look >>> at when the PG_ flags are set. IIUC, it is set by post_alloc_hook() for >>> order-0 anonymous folio (clearing page and tags) and set_ptes() for >>> others (just clear tags), for example, THP and hugetlb. >>> >>> I can see THP does set the PG_mte_tagged flag for each sub pages. But it >>> seems it does not do it for hugetlb if I read the code correctly. The >>> call path is: >>> >>> hugetlb_fault() -> >>> hugetlb_no_page-> >>> set_huge_pte_at -> >>> __set_ptes() -> >>> __sync_cache_and_tags() -> >>> >>> >>> The __set_ptes() is called in a loop: >>> >>> if (!pte_present(pte)) { >>> for (i = 0; i < ncontig; i++, ptep++, addr += pgsize) >>> __set_ptes(mm, addr, ptep, pte, 1); >>> return; >>> } >>> >>> The ncontig and pgsize are returned by num_contig_ptes(). For example, >>> 2M hugetlb, ncontig is 1 and pgsize is 2M IIUC. So it means actually >>> just the head page has PG_mte_tagged set. If so the copy_highpage() will >>> just copy the old head page's flag to the new head page, and the tag. >>> All the sub pages don't have PG_mte_tagged set. >>> >>> Is it expected behavior? I'm supposed we need tags for every sub pages >>> too, right? >> We should need something like the below to have tags and page flag set up >> for each sub page: >> >> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c >> index 3f09ac73cce3..528164deef27 100644 >> --- a/arch/arm64/mm/hugetlbpage.c >> +++ b/arch/arm64/mm/hugetlbpage.c >> @@ -228,9 +228,12 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned >> long addr, >> int ncontig; >> unsigned long pfn, dpfn; >> pgprot_t hugeprot; >> + unsigned long nr = sz >> PAGE_SHIFT; >> >> ncontig = num_contig_ptes(sz, &pgsize); >> >> + __sync_cache_and_tags(pte, nr); >> + >> if (!pte_present(pte)) { >> for (i = 0; i < ncontig; i++, ptep++, addr += pgsize) >> __set_ptes(mm, addr, ptep, pte, 1); > We only need this for the last loop when we have a present pte. I'd also > avoid calling __set_ptes(...., 1) if we call the __sync_cache_and_tags() > here already. Basically just unroll __set_ptes() in set_huge_pte_at() > and call __set_pte() directly. > > It might be better to convert those page flag checks to only happen on > the head page. My stashed changes from over a year ago (before we had > more folio conversions) below. However, as I mentioned, I got stuck on > folio_copy() which also does a cond_resched() between copy_highpage(). We can have the page flags set for head only for hugetlb page. For copy_highpage(), we should be able to do something like the below: if page_is_head && page_is_hugetlb && page_has_mte_tagged set page_mte_tagged flags copy tags for all sub pages else // <-- tail page or non-hugetlb page current copy_highpage implementation The hugetlb folio can't go away under us since migration path should pin it so the status of folio is stable. The preemption caused by cond_resched() should be fine too due to the pin and the page table entry keeps being migration entry until migration is done, so every one should just see migration entry and wait for migration is done. The other concerned user of copy_highpage() is uprobe, but it also pins the page then doing copy and it is called with holding write mmap_lock. IIUC, it should work if I don't miss something. This also should have no impact on HVO. The overhead for other users of copy_highpage() should be also acceptable. > > ---------8<-------------------------------- > diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c > index f5bcb0dc6267..a84ad0e46f12 100644 > --- a/arch/arm64/kernel/mte.c > +++ b/arch/arm64/kernel/mte.c > @@ -49,7 +49,9 @@ static void mte_sync_page_tags(struct page *page, pte_t old_pte, > return; > > if (try_page_mte_tagging(page)) { > - mte_clear_page_tags(page_address(page)); > + unsigned long i, nr_pages = compound_nr(page); > + for (i = 0; i < nr_pages; i++) > + mte_clear_page_tags(page_address(page + i)); > set_page_mte_tagged(page); > } > } > @@ -57,22 +59,23 @@ static void mte_sync_page_tags(struct page *page, pte_t old_pte, > void mte_sync_tags(pte_t old_pte, pte_t pte) > { > struct page *page = pte_page(pte); > - long i, nr_pages = compound_nr(page); > - bool check_swap = nr_pages == 1; > + bool check_swap = !PageCompound(page); > bool pte_is_tagged = pte_tagged(pte); > > /* Early out if there's nothing to do */ > if (!check_swap && !pte_is_tagged) > return; > > + /* > + * HugeTLB pages are always fully mapped, so only setting head page's > + * PG_mte_* flags is enough. > + */ > + if (PageHuge(page)) > + page = compound_head(page); > + > /* if PG_mte_tagged is set, tags have already been initialised */ > - for (i = 0; i < nr_pages; i++, page++) { > - if (!page_mte_tagged(page)) { > - mte_sync_page_tags(page, old_pte, check_swap, > - pte_is_tagged); > - set_page_mte_tagged(page); > - } > - } > + if (!page_mte_tagged(page)) > + mte_sync_page_tags(page, old_pte, check_swap, pte_is_tagged); > > /* ensure the tags are visible before the PTE is set */ > smp_wmb(); > diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c > index 5626ddb540ce..692198d8c0d2 100644 > --- a/arch/arm64/kvm/guest.c > +++ b/arch/arm64/kvm/guest.c > @@ -1079,7 +1079,7 @@ long kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm, > > /* uaccess failed, don't leave stale tags */ > if (num_tags != MTE_GRANULES_PER_PAGE) > - mte_clear_page_tags(page); > + mte_clear_page_tags(page_address(page)); > set_page_mte_tagged(page); > > kvm_release_pfn_dirty(pfn); > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > index 31d7fa4c7c14..b531b1d75cda 100644 > --- a/arch/arm64/kvm/mmu.c > +++ b/arch/arm64/kvm/mmu.c > @@ -1173,11 +1173,10 @@ static void sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn, > if (!kvm_has_mte(kvm)) > return; > > - for (i = 0; i < nr_pages; i++, page++) { > - if (try_page_mte_tagging(page)) { > - mte_clear_page_tags(page_address(page)); > - set_page_mte_tagged(page); > - } > + if (try_page_mte_tagging(page)) { > + for (i = 0; i < nr_pages; i++) > + mte_clear_page_tags(page_address(page + i)); > + set_page_mte_tagged(page); > } > } > >
On 7/9/24 10:42 AM, Yang Shi wrote: > > > On 7/4/24 6:44 AM, Catalin Marinas wrote: >> On Tue, Jul 02, 2024 at 05:15:12PM -0700, Yang Shi wrote: >>> On 7/2/24 5:04 PM, Yang Shi wrote: >>>> On 7/2/24 5:34 AM, Catalin Marinas wrote: >>>>> Last time I checked, about a year ago, this was not sufficient. One >>>>> issue is that there's no arch_clear_hugetlb_flags() implemented by >>>>> your >>>>> patch, leaving PG_arch_{2,3} set on a page. The other issue was >>>>> that I >>>>> initially tried to do this only on the head page but this did not go >>>>> well with the folio_copy() -> copy_highpage() which expects the >>>>> PG_arch_* flags on each individual page. The alternative was for >>>>> arch_clear_hugetlb_flags() to iterate over all the pages in a folio. >>>> Thanks for pointing this out. I did miss this point. I took a quick >>>> look >>>> at when the PG_ flags are set. IIUC, it is set by post_alloc_hook() >>>> for >>>> order-0 anonymous folio (clearing page and tags) and set_ptes() for >>>> others (just clear tags), for example, THP and hugetlb. >>>> >>>> I can see THP does set the PG_mte_tagged flag for each sub pages. >>>> But it >>>> seems it does not do it for hugetlb if I read the code correctly. The >>>> call path is: >>>> >>>> hugetlb_fault() -> >>>> hugetlb_no_page-> >>>> set_huge_pte_at -> >>>> __set_ptes() -> >>>> __sync_cache_and_tags() -> >>>> >>>> >>>> The __set_ptes() is called in a loop: >>>> >>>> if (!pte_present(pte)) { >>>> for (i = 0; i < ncontig; i++, ptep++, addr += pgsize) >>>> __set_ptes(mm, addr, ptep, pte, 1); >>>> return; >>>> } >>>> >>>> The ncontig and pgsize are returned by num_contig_ptes(). For example, >>>> 2M hugetlb, ncontig is 1 and pgsize is 2M IIUC. So it means actually >>>> just the head page has PG_mte_tagged set. If so the copy_highpage() >>>> will >>>> just copy the old head page's flag to the new head page, and the tag. >>>> All the sub pages don't have PG_mte_tagged set. >>>> >>>> Is it expected behavior? I'm supposed we need tags for every sub pages >>>> too, right? >>> We should need something like the below to have tags and page flag >>> set up >>> for each sub page: >>> >>> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c >>> index 3f09ac73cce3..528164deef27 100644 >>> --- a/arch/arm64/mm/hugetlbpage.c >>> +++ b/arch/arm64/mm/hugetlbpage.c >>> @@ -228,9 +228,12 @@ void set_huge_pte_at(struct mm_struct *mm, >>> unsigned >>> long addr, >>> int ncontig; >>> unsigned long pfn, dpfn; >>> pgprot_t hugeprot; >>> + unsigned long nr = sz >> PAGE_SHIFT; >>> >>> ncontig = num_contig_ptes(sz, &pgsize); >>> >>> + __sync_cache_and_tags(pte, nr); >>> + >>> if (!pte_present(pte)) { >>> for (i = 0; i < ncontig; i++, ptep++, addr += pgsize) >>> __set_ptes(mm, addr, ptep, pte, 1); >> We only need this for the last loop when we have a present pte. I'd also >> avoid calling __set_ptes(...., 1) if we call the __sync_cache_and_tags() >> here already. Basically just unroll __set_ptes() in set_huge_pte_at() >> and call __set_pte() directly. >> >> It might be better to convert those page flag checks to only happen on >> the head page. My stashed changes from over a year ago (before we had >> more folio conversions) below. However, as I mentioned, I got stuck on >> folio_copy() which also does a cond_resched() between copy_highpage(). > Ping... Does this make sense and sound good? If is is good, I will try to implement it. > We can have the page flags set for head only for hugetlb page. For > copy_highpage(), we should be able to do something like the below: > > if page_is_head && page_is_hugetlb && page_has_mte_tagged > set page_mte_tagged flags > copy tags for all sub pages > else // <-- tail page or non-hugetlb page > current copy_highpage implementation > > > The hugetlb folio can't go away under us since migration path should > pin it so the status of folio is stable. The preemption caused by > cond_resched() should be fine too due to the pin and the page table > entry keeps being migration entry until migration is done, so every > one should just see migration entry and wait for migration is done. > > The other concerned user of copy_highpage() is uprobe, but it also > pins the page then doing copy and it is called with holding write > mmap_lock. > > IIUC, it should work if I don't miss something. This also should have > no impact on HVO. The overhead for other users of copy_highpage() > should be also acceptable. > >> >> ---------8<-------------------------------- >> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c >> index f5bcb0dc6267..a84ad0e46f12 100644 >> --- a/arch/arm64/kernel/mte.c >> +++ b/arch/arm64/kernel/mte.c >> @@ -49,7 +49,9 @@ static void mte_sync_page_tags(struct page *page, >> pte_t old_pte, >> return; >> if (try_page_mte_tagging(page)) { >> - mte_clear_page_tags(page_address(page)); >> + unsigned long i, nr_pages = compound_nr(page); >> + for (i = 0; i < nr_pages; i++) >> + mte_clear_page_tags(page_address(page + i)); >> set_page_mte_tagged(page); >> } >> } >> @@ -57,22 +59,23 @@ static void mte_sync_page_tags(struct page *page, >> pte_t old_pte, >> void mte_sync_tags(pte_t old_pte, pte_t pte) >> { >> struct page *page = pte_page(pte); >> - long i, nr_pages = compound_nr(page); >> - bool check_swap = nr_pages == 1; >> + bool check_swap = !PageCompound(page); >> bool pte_is_tagged = pte_tagged(pte); >> /* Early out if there's nothing to do */ >> if (!check_swap && !pte_is_tagged) >> return; >> + /* >> + * HugeTLB pages are always fully mapped, so only setting head >> page's >> + * PG_mte_* flags is enough. >> + */ >> + if (PageHuge(page)) >> + page = compound_head(page); >> + >> /* if PG_mte_tagged is set, tags have already been initialised */ >> - for (i = 0; i < nr_pages; i++, page++) { >> - if (!page_mte_tagged(page)) { >> - mte_sync_page_tags(page, old_pte, check_swap, >> - pte_is_tagged); >> - set_page_mte_tagged(page); >> - } >> - } >> + if (!page_mte_tagged(page)) >> + mte_sync_page_tags(page, old_pte, check_swap, pte_is_tagged); >> /* ensure the tags are visible before the PTE is set */ >> smp_wmb(); >> diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c >> index 5626ddb540ce..692198d8c0d2 100644 >> --- a/arch/arm64/kvm/guest.c >> +++ b/arch/arm64/kvm/guest.c >> @@ -1079,7 +1079,7 @@ long kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm, >> /* uaccess failed, don't leave stale tags */ >> if (num_tags != MTE_GRANULES_PER_PAGE) >> - mte_clear_page_tags(page); >> + mte_clear_page_tags(page_address(page)); >> set_page_mte_tagged(page); >> kvm_release_pfn_dirty(pfn); >> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c >> index 31d7fa4c7c14..b531b1d75cda 100644 >> --- a/arch/arm64/kvm/mmu.c >> +++ b/arch/arm64/kvm/mmu.c >> @@ -1173,11 +1173,10 @@ static void sanitise_mte_tags(struct kvm >> *kvm, kvm_pfn_t pfn, >> if (!kvm_has_mte(kvm)) >> return; >> - for (i = 0; i < nr_pages; i++, page++) { >> - if (try_page_mte_tagging(page)) { >> - mte_clear_page_tags(page_address(page)); >> - set_page_mte_tagged(page); >> - } >> + if (try_page_mte_tagging(page)) { >> + for (i = 0; i < nr_pages; i++) >> + mte_clear_page_tags(page_address(page + i)); >> + set_page_mte_tagged(page); >> } >> } >> >
Sorry for the delay (holidays etc.) On Tue, Jul 09, 2024 at 10:42:58AM -0700, Yang Shi wrote: > On 7/4/24 6:44 AM, Catalin Marinas wrote: > > It might be better to convert those page flag checks to only happen on > > the head page. My stashed changes from over a year ago (before we had > > more folio conversions) below. However, as I mentioned, I got stuck on > > folio_copy() which also does a cond_resched() between copy_highpage(). > > We can have the page flags set for head only for hugetlb page. For > copy_highpage(), we should be able to do something like the below: > > if page_is_head && page_is_hugetlb && page_has_mte_tagged > set page_mte_tagged flags > copy tags for all sub pages > else // <-- tail page or non-hugetlb page > current copy_highpage implementation Ah, so you want in the first copy_highpage() for the head page to populate the tags for the tail pages. I guess this would work. > The hugetlb folio can't go away under us since migration path should pin it > so the status of folio is stable. The preemption caused by cond_resched() > should be fine too due to the pin and the page table entry keeps being > migration entry until migration is done, so every one should just see > migration entry and wait for migration is done. Yeah, I don't see those pages going away, otherwise folio_copy() would corrupt data. > The other concerned user of copy_highpage() is uprobe, but it also pins the > page then doing copy and it is called with holding write mmap_lock. > > IIUC, it should work if I don't miss something. This also should have no > impact on HVO. The overhead for other users of copy_highpage() should be > also acceptable. I also think so. We also have the copy_user_highpage() on arm64 that calls copy_highpage() but I think that's also safe.
On 8/15/24 3:31 AM, Catalin Marinas wrote: > Sorry for the delay (holidays etc.) > > On Tue, Jul 09, 2024 at 10:42:58AM -0700, Yang Shi wrote: >> On 7/4/24 6:44 AM, Catalin Marinas wrote: >>> It might be better to convert those page flag checks to only happen on >>> the head page. My stashed changes from over a year ago (before we had >>> more folio conversions) below. However, as I mentioned, I got stuck on >>> folio_copy() which also does a cond_resched() between copy_highpage(). >> We can have the page flags set for head only for hugetlb page. For >> copy_highpage(), we should be able to do something like the below: >> >> if page_is_head && page_is_hugetlb && page_has_mte_tagged >> set page_mte_tagged flags >> copy tags for all sub pages >> else // <-- tail page or non-hugetlb page >> current copy_highpage implementation > Ah, so you want in the first copy_highpage() for the head page to > populate the tags for the tail pages. I guess this would work. Yes, because we know the order of hugetlb page so we know how many tail pages we need populate tags for. A deeper look showed this may be the only way to do it (if we want to have mte page flag for head only) because process_huge_page() may starting copy from the middle of huge page to have hot sub pages in cache. The process_huge_page() is used by hugetlb fork and COW. I think it is safe to do so in fork and COW path too since the destination hugetlb page won't be seen by the users until fork or COW fault is done. > >> The hugetlb folio can't go away under us since migration path should pin it >> so the status of folio is stable. The preemption caused by cond_resched() >> should be fine too due to the pin and the page table entry keeps being >> migration entry until migration is done, so every one should just see >> migration entry and wait for migration is done. > Yeah, I don't see those pages going away, otherwise folio_copy() would > corrupt data. > >> The other concerned user of copy_highpage() is uprobe, but it also pins the >> page then doing copy and it is called with holding write mmap_lock. >> >> IIUC, it should work if I don't miss something. This also should have no >> impact on HVO. The overhead for other users of copy_highpage() should be >> also acceptable. > I also think so. We also have the copy_user_highpage() on arm64 that > calls copy_highpage() but I think that's also safe. Yes, it is used by fork and COW fault by hugetlb. >
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index ecad73a4f713..c34faef62daf 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -110,7 +110,7 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma) * way when do_mmap unwinds (may be important on powerpc * and ia64). */ - vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND); + vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND | VM_MTE_ALLOWED); vma->vm_ops = &hugetlb_vm_ops; ret = seal_check_write(info->seals, vma);
MTE can be supported on ram based filesystem. It is supported on tmpfs. There is use case to use MTE on hugetlbfs as well, adding MTE support. Signed-off-by: Yang Shi <yang@os.amperecomputing.com> --- fs/hugetlbfs/inode.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)