Message ID | 20241026054307.3896926-1-wangkefeng.wang@huawei.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2,1/2] mm: use aligned address in clear_gigantic_page() | expand |
Kefeng Wang <wangkefeng.wang@huawei.com> writes: > When clearing gigantic page, it zeros page from the first page to the > last page, if directly passing addr_hint which maybe not the address > of the first page of folio, then some archs could flush the wrong cache > if it does use the addr_hint as a hint. For non-gigantic page, it > calculates the base address inside, even passed the wrong addr_hint, it > only has performance impact as the process_huge_page() wants to process > target page last to keep its cache lines hot), no functional impact. > > Let's pass the real accessed address to folio_zero_user() and use the > aligned address in clear_gigantic_page() to fix it. > > Fixes: 78fefd04c123 ("mm: memory: convert clear_huge_page() to folio_zero_user()") > Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> > --- > v2: > - update changelog to clarify the impact, per Andrew > > fs/hugetlbfs/inode.c | 2 +- > mm/memory.c | 1 + > 2 files changed, 2 insertions(+), 1 deletion(-) > > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c > index a4441fb77f7c..a5ea006f403e 100644 > --- a/fs/hugetlbfs/inode.c > +++ b/fs/hugetlbfs/inode.c > @@ -825,7 +825,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset, > error = PTR_ERR(folio); > goto out; > } > - folio_zero_user(folio, ALIGN_DOWN(addr, hpage_size)); > + folio_zero_user(folio, addr); 'addr' is set with the following statement above, /* addr is the offset within the file (zero based) */ addr = index * hpage_size; So, we just don't need to ALIGN_DOWN() here. Or do I miss something? > __folio_mark_uptodate(folio); > error = hugetlb_add_to_page_cache(folio, mapping, index); > if (unlikely(error)) { > diff --git a/mm/memory.c b/mm/memory.c > index 75c2dfd04f72..ef47b7ea5ddd 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -6821,6 +6821,7 @@ static void clear_gigantic_page(struct folio *folio, unsigned long addr, > int i; > > might_sleep(); > + addr = ALIGN_DOWN(addr, folio_size(folio)); > for (i = 0; i < nr_pages; i++) { > cond_resched(); > clear_user_highpage(folio_page(folio, i), addr + i * PAGE_SIZE); -- Best Regards, Huang, Ying
On 2024/10/28 14:17, Huang, Ying wrote: > Kefeng Wang <wangkefeng.wang@huawei.com> writes: > >> When clearing gigantic page, it zeros page from the first page to the >> last page, if directly passing addr_hint which maybe not the address >> of the first page of folio, then some archs could flush the wrong cache >> if it does use the addr_hint as a hint. For non-gigantic page, it >> calculates the base address inside, even passed the wrong addr_hint, it >> only has performance impact as the process_huge_page() wants to process >> target page last to keep its cache lines hot), no functional impact. >> >> Let's pass the real accessed address to folio_zero_user() and use the >> aligned address in clear_gigantic_page() to fix it. >> >> Fixes: 78fefd04c123 ("mm: memory: convert clear_huge_page() to folio_zero_user()") >> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> >> --- >> v2: >> - update changelog to clarify the impact, per Andrew >> >> fs/hugetlbfs/inode.c | 2 +- >> mm/memory.c | 1 + >> 2 files changed, 2 insertions(+), 1 deletion(-) >> >> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c >> index a4441fb77f7c..a5ea006f403e 100644 >> --- a/fs/hugetlbfs/inode.c >> +++ b/fs/hugetlbfs/inode.c >> @@ -825,7 +825,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset, >> error = PTR_ERR(folio); >> goto out; >> } >> - folio_zero_user(folio, ALIGN_DOWN(addr, hpage_size)); >> + folio_zero_user(folio, addr); > > 'addr' is set with the following statement above, > > /* addr is the offset within the file (zero based) */ > addr = index * hpage_size; > > So, we just don't need to ALIGN_DOWN() here. Or do I miss something? Yes, it is already aligned, > >> __folio_mark_uptodate(folio); >> error = hugetlb_add_to_page_cache(folio, mapping, index); >> if (unlikely(error)) { >> diff --git a/mm/memory.c b/mm/memory.c >> index 75c2dfd04f72..ef47b7ea5ddd 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -6821,6 +6821,7 @@ static void clear_gigantic_page(struct folio *folio, unsigned long addr, >> int i; >> >> might_sleep(); >> + addr = ALIGN_DOWN(addr, folio_size(folio)); but for hugetlb_no_page(), we do need to align the addr as it use vmf->real_address, so I move the alignment into the clear_gigantic_page. >> for (i = 0; i < nr_pages; i++) { >> cond_resched(); >> clear_user_highpage(folio_page(folio, i), addr + i * PAGE_SIZE); > > -- > Best Regards, > Huang, Ying >
Kefeng Wang <wangkefeng.wang@huawei.com> writes: > On 2024/10/28 14:17, Huang, Ying wrote: >> Kefeng Wang <wangkefeng.wang@huawei.com> writes: >> >>> When clearing gigantic page, it zeros page from the first page to the >>> last page, if directly passing addr_hint which maybe not the address >>> of the first page of folio, then some archs could flush the wrong cache >>> if it does use the addr_hint as a hint. For non-gigantic page, it >>> calculates the base address inside, even passed the wrong addr_hint, it >>> only has performance impact as the process_huge_page() wants to process >>> target page last to keep its cache lines hot), no functional impact. >>> >>> Let's pass the real accessed address to folio_zero_user() and use the >>> aligned address in clear_gigantic_page() to fix it. >>> >>> Fixes: 78fefd04c123 ("mm: memory: convert clear_huge_page() to folio_zero_user()") >>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> >>> --- >>> v2: >>> - update changelog to clarify the impact, per Andrew >>> >>> fs/hugetlbfs/inode.c | 2 +- >>> mm/memory.c | 1 + >>> 2 files changed, 2 insertions(+), 1 deletion(-) >>> >>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c >>> index a4441fb77f7c..a5ea006f403e 100644 >>> --- a/fs/hugetlbfs/inode.c >>> +++ b/fs/hugetlbfs/inode.c >>> @@ -825,7 +825,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset, >>> error = PTR_ERR(folio); >>> goto out; >>> } >>> - folio_zero_user(folio, ALIGN_DOWN(addr, hpage_size)); >>> + folio_zero_user(folio, addr); >> 'addr' is set with the following statement above, >> /* addr is the offset within the file (zero based) */ >> addr = index * hpage_size; >> So, we just don't need to ALIGN_DOWN() here. Or do I miss >> something? > > Yes, it is already aligned, >> >>> __folio_mark_uptodate(folio); >>> error = hugetlb_add_to_page_cache(folio, mapping, index); >>> if (unlikely(error)) { >>> diff --git a/mm/memory.c b/mm/memory.c >>> index 75c2dfd04f72..ef47b7ea5ddd 100644 >>> --- a/mm/memory.c >>> +++ b/mm/memory.c >>> @@ -6821,6 +6821,7 @@ static void clear_gigantic_page(struct folio *folio, unsigned long addr, >>> int i; >>> might_sleep(); >>> + addr = ALIGN_DOWN(addr, folio_size(folio)); > > but for hugetlb_no_page(), we do need to align the addr as it use > vmf->real_address, so I move the alignment into the > clear_gigantic_page. That sounds good. You may need to revise patch description to describe why you make the change. May be something like below? In current kernel, hugetlb_no_page() calls folio_zero_user() with the fault address. Where the fault address may be not aligned with the huge page size. Then, folio_zero_user() may call clear_gigantic_page() with the address, while clear_gigantic_page() requires the address to be huge page size aligned. So, this may cause memory corruption or information leak. >>> for (i = 0; i < nr_pages; i++) { >>> cond_resched(); >>> clear_user_highpage(folio_page(folio, i), addr + i * PAGE_SIZE); -- Best Regards, Huang, Ying
On 2024/10/28 15:03, Huang, Ying wrote: > Kefeng Wang <wangkefeng.wang@huawei.com> writes: > >> On 2024/10/28 14:17, Huang, Ying wrote: >>> Kefeng Wang <wangkefeng.wang@huawei.com> writes: >>> >>>> When clearing gigantic page, it zeros page from the first page to the >>>> last page, if directly passing addr_hint which maybe not the address >>>> of the first page of folio, then some archs could flush the wrong cache >>>> if it does use the addr_hint as a hint. For non-gigantic page, it >>>> calculates the base address inside, even passed the wrong addr_hint, it >>>> only has performance impact as the process_huge_page() wants to process >>>> target page last to keep its cache lines hot), no functional impact. >>>> >>>> Let's pass the real accessed address to folio_zero_user() and use the >>>> aligned address in clear_gigantic_page() to fix it. >>>> >>>> Fixes: 78fefd04c123 ("mm: memory: convert clear_huge_page() to folio_zero_user()") >>>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> >>>> --- >>>> v2: >>>> - update changelog to clarify the impact, per Andrew >>>> >>>> fs/hugetlbfs/inode.c | 2 +- >>>> mm/memory.c | 1 + >>>> 2 files changed, 2 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c >>>> index a4441fb77f7c..a5ea006f403e 100644 >>>> --- a/fs/hugetlbfs/inode.c >>>> +++ b/fs/hugetlbfs/inode.c >>>> @@ -825,7 +825,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset, >>>> error = PTR_ERR(folio); >>>> goto out; >>>> } >>>> - folio_zero_user(folio, ALIGN_DOWN(addr, hpage_size)); >>>> + folio_zero_user(folio, addr); >>> 'addr' is set with the following statement above, >>> /* addr is the offset within the file (zero based) */ >>> addr = index * hpage_size; >>> So, we just don't need to ALIGN_DOWN() here. Or do I miss >>> something? >> >> Yes, it is already aligned, >>> >>>> __folio_mark_uptodate(folio); >>>> error = hugetlb_add_to_page_cache(folio, mapping, index); >>>> if (unlikely(error)) { >>>> diff --git a/mm/memory.c b/mm/memory.c >>>> index 75c2dfd04f72..ef47b7ea5ddd 100644 >>>> --- a/mm/memory.c >>>> +++ b/mm/memory.c >>>> @@ -6821,6 +6821,7 @@ static void clear_gigantic_page(struct folio *folio, unsigned long addr, >>>> int i; >>>> might_sleep(); >>>> + addr = ALIGN_DOWN(addr, folio_size(folio)); >> >> but for hugetlb_no_page(), we do need to align the addr as it use >> vmf->real_address, so I move the alignment into the >> clear_gigantic_page. > > That sounds good. You may need to revise patch description to describe > why you make the change. May be something like below? > > In current kernel, hugetlb_no_page() calls folio_zero_user() with the > fault address. Where the fault address may be not aligned with the huge > page size. Then, folio_zero_user() may call clear_gigantic_page() with > the address, while clear_gigantic_page() requires the address to be huge > page size aligned. So, this may cause memory corruption or information > leak. OK, will use it and update all patches, thanks. > >>>> for (i = 0; i < nr_pages; i++) { >>>> cond_resched(); >>>> clear_user_highpage(folio_page(folio, i), addr + i * PAGE_SIZE); > > -- > Best Regards, > Huang, Ying >
On 26.10.24 07:43, Kefeng Wang wrote: > When clearing gigantic page, it zeros page from the first page to the > last page, if directly passing addr_hint which maybe not the address > of the first page of folio, then some archs could flush the wrong cache > if it does use the addr_hint as a hint. For non-gigantic page, it > calculates the base address inside, even passed the wrong addr_hint, it > only has performance impact as the process_huge_page() wants to process > target page last to keep its cache lines hot), no functional impact. > > Let's pass the real accessed address to folio_zero_user() and use the > aligned address in clear_gigantic_page() to fix it. > > Fixes: 78fefd04c123 ("mm: memory: convert clear_huge_page() to folio_zero_user()") > Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> > --- > v2: > - update changelog to clarify the impact, per Andrew > > fs/hugetlbfs/inode.c | 2 +- > mm/memory.c | 1 + > 2 files changed, 2 insertions(+), 1 deletion(-) > > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c > index a4441fb77f7c..a5ea006f403e 100644 > --- a/fs/hugetlbfs/inode.c > +++ b/fs/hugetlbfs/inode.c > @@ -825,7 +825,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset, > error = PTR_ERR(folio); > goto out; > } > - folio_zero_user(folio, ALIGN_DOWN(addr, hpage_size)); > + folio_zero_user(folio, addr); > __folio_mark_uptodate(folio); > error = hugetlb_add_to_page_cache(folio, mapping, index); > if (unlikely(error)) { > diff --git a/mm/memory.c b/mm/memory.c > index 75c2dfd04f72..ef47b7ea5ddd 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -6821,6 +6821,7 @@ static void clear_gigantic_page(struct folio *folio, unsigned long addr, > int i; > > might_sleep(); > + addr = ALIGN_DOWN(addr, folio_size(folio)); Right, that's what's effectively done in a very bad way in process_huge_page() unsigned long addr = addr_hint & ~(((unsigned long)nr_pages << PAGE_SHIFT) - 1); That should all be cleaned up ... process_huge_page() likely shouldn't be even consuming "nr_pages". In clear_gigantic_page(), can you please rename the "unsigned long addr" parameter to unsigned long "addr_hint" and use an additional "unsigned long addr" ?
On 2024/10/28 18:00, David Hildenbrand wrote: > On 26.10.24 07:43, Kefeng Wang wrote: >> When clearing gigantic page, it zeros page from the first page to the >> last page, if directly passing addr_hint which maybe not the address >> of the first page of folio, then some archs could flush the wrong cache >> if it does use the addr_hint as a hint. For non-gigantic page, it >> calculates the base address inside, even passed the wrong addr_hint, it >> only has performance impact as the process_huge_page() wants to process >> target page last to keep its cache lines hot), no functional impact. >> >> Let's pass the real accessed address to folio_zero_user() and use the >> aligned address in clear_gigantic_page() to fix it. >> >> Fixes: 78fefd04c123 ("mm: memory: convert clear_huge_page() to >> folio_zero_user()") >> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> >> --- >> v2: >> - update changelog to clarify the impact, per Andrew >> >> fs/hugetlbfs/inode.c | 2 +- >> mm/memory.c | 1 + >> 2 files changed, 2 insertions(+), 1 deletion(-) >> >> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c >> index a4441fb77f7c..a5ea006f403e 100644 >> --- a/fs/hugetlbfs/inode.c >> +++ b/fs/hugetlbfs/inode.c >> @@ -825,7 +825,7 @@ static long hugetlbfs_fallocate(struct file *file, >> int mode, loff_t offset, >> error = PTR_ERR(folio); >> goto out; >> } >> - folio_zero_user(folio, ALIGN_DOWN(addr, hpage_size)); >> + folio_zero_user(folio, addr); >> __folio_mark_uptodate(folio); >> error = hugetlb_add_to_page_cache(folio, mapping, index); >> if (unlikely(error)) { >> diff --git a/mm/memory.c b/mm/memory.c >> index 75c2dfd04f72..ef47b7ea5ddd 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -6821,6 +6821,7 @@ static void clear_gigantic_page(struct folio >> *folio, unsigned long addr, >> int i; >> might_sleep(); >> + addr = ALIGN_DOWN(addr, folio_size(folio)); > > Right, that's what's effectively done in a very bad way in > process_huge_page() > > unsigned long addr = addr_hint & > ~(((unsigned long)nr_pages << PAGE_SHIFT) - 1); > > > That should all be cleaned up ... process_huge_page() likely shouldn't Yes, let's fix the bug firstly, > be even consuming "nr_pages". No sure about this part, it uses nr_pages as the end and calculate the 'base'. > > > In clear_gigantic_page(), can you please rename the "unsigned long addr" > parameter to unsigned long "addr_hint" and use an additional "unsigned > long addr" ? > Sure.
On 28.10.24 13:52, Kefeng Wang wrote: > > > On 2024/10/28 18:00, David Hildenbrand wrote: >> On 26.10.24 07:43, Kefeng Wang wrote: >>> When clearing gigantic page, it zeros page from the first page to the >>> last page, if directly passing addr_hint which maybe not the address >>> of the first page of folio, then some archs could flush the wrong cache >>> if it does use the addr_hint as a hint. For non-gigantic page, it >>> calculates the base address inside, even passed the wrong addr_hint, it >>> only has performance impact as the process_huge_page() wants to process >>> target page last to keep its cache lines hot), no functional impact. >>> >>> Let's pass the real accessed address to folio_zero_user() and use the >>> aligned address in clear_gigantic_page() to fix it. >>> >>> Fixes: 78fefd04c123 ("mm: memory: convert clear_huge_page() to >>> folio_zero_user()") >>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> >>> --- >>> v2: >>> - update changelog to clarify the impact, per Andrew >>> >>> fs/hugetlbfs/inode.c | 2 +- >>> mm/memory.c | 1 + >>> 2 files changed, 2 insertions(+), 1 deletion(-) >>> >>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c >>> index a4441fb77f7c..a5ea006f403e 100644 >>> --- a/fs/hugetlbfs/inode.c >>> +++ b/fs/hugetlbfs/inode.c >>> @@ -825,7 +825,7 @@ static long hugetlbfs_fallocate(struct file *file, >>> int mode, loff_t offset, >>> error = PTR_ERR(folio); >>> goto out; >>> } >>> - folio_zero_user(folio, ALIGN_DOWN(addr, hpage_size)); >>> + folio_zero_user(folio, addr); >>> __folio_mark_uptodate(folio); >>> error = hugetlb_add_to_page_cache(folio, mapping, index); >>> if (unlikely(error)) { >>> diff --git a/mm/memory.c b/mm/memory.c >>> index 75c2dfd04f72..ef47b7ea5ddd 100644 >>> --- a/mm/memory.c >>> +++ b/mm/memory.c >>> @@ -6821,6 +6821,7 @@ static void clear_gigantic_page(struct folio >>> *folio, unsigned long addr, >>> int i; >>> might_sleep(); >>> + addr = ALIGN_DOWN(addr, folio_size(folio)); >> >> Right, that's what's effectively done in a very bad way in >> process_huge_page() >> >> unsigned long addr = addr_hint & >> ~(((unsigned long)nr_pages << PAGE_SHIFT) - 1); >> >> >> That should all be cleaned up ... process_huge_page() likely shouldn't > > Yes, let's fix the bug firstly, > >> be even consuming "nr_pages". > > No sure about this part, it uses nr_pages as the end and calculate the > 'base'. It should be using folio_nr_pages().
On 2024/10/28 21:14, David Hildenbrand wrote: > On 28.10.24 13:52, Kefeng Wang wrote: >> >> >> On 2024/10/28 18:00, David Hildenbrand wrote: >>> On 26.10.24 07:43, Kefeng Wang wrote: >>>> When clearing gigantic page, it zeros page from the first page to the >>>> last page, if directly passing addr_hint which maybe not the address >>>> of the first page of folio, then some archs could flush the wrong cache >>>> if it does use the addr_hint as a hint. For non-gigantic page, it >>>> calculates the base address inside, even passed the wrong addr_hint, it >>>> only has performance impact as the process_huge_page() wants to process >>>> target page last to keep its cache lines hot), no functional impact. >>>> >>>> Let's pass the real accessed address to folio_zero_user() and use the >>>> aligned address in clear_gigantic_page() to fix it. >>>> >>>> Fixes: 78fefd04c123 ("mm: memory: convert clear_huge_page() to >>>> folio_zero_user()") >>>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> >>>> --- >>>> v2: >>>> - update changelog to clarify the impact, per Andrew >>>> >>>> fs/hugetlbfs/inode.c | 2 +- >>>> mm/memory.c | 1 + >>>> 2 files changed, 2 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c >>>> index a4441fb77f7c..a5ea006f403e 100644 >>>> --- a/fs/hugetlbfs/inode.c >>>> +++ b/fs/hugetlbfs/inode.c >>>> @@ -825,7 +825,7 @@ static long hugetlbfs_fallocate(struct file *file, >>>> int mode, loff_t offset, >>>> error = PTR_ERR(folio); >>>> goto out; >>>> } >>>> - folio_zero_user(folio, ALIGN_DOWN(addr, hpage_size)); >>>> + folio_zero_user(folio, addr); >>>> __folio_mark_uptodate(folio); >>>> error = hugetlb_add_to_page_cache(folio, mapping, index); >>>> if (unlikely(error)) { >>>> diff --git a/mm/memory.c b/mm/memory.c >>>> index 75c2dfd04f72..ef47b7ea5ddd 100644 >>>> --- a/mm/memory.c >>>> +++ b/mm/memory.c >>>> @@ -6821,6 +6821,7 @@ static void clear_gigantic_page(struct folio >>>> *folio, unsigned long addr, >>>> int i; >>>> might_sleep(); >>>> + addr = ALIGN_DOWN(addr, folio_size(folio)); >>> >>> Right, that's what's effectively done in a very bad way in >>> process_huge_page() >>> >>> unsigned long addr = addr_hint & >>> ~(((unsigned long)nr_pages << PAGE_SHIFT) - 1); >>> >>> >>> That should all be cleaned up ... process_huge_page() likely shouldn't >> >> Yes, let's fix the bug firstly, >> >>> be even consuming "nr_pages". >> >> No sure about this part, it uses nr_pages as the end and calculate the >> 'base'. > > It should be using folio_nr_pages(). But process_huge_page() without an explicit folio argument, I'd like to move the aligned address calculate into the folio_zero_user and copy_user_large_folio(will rename it to folio_copy_user()) in the following cleanup patches, or do it in the fix patches?
On 28.10.24 14:33, Kefeng Wang wrote: > > > On 2024/10/28 21:14, David Hildenbrand wrote: >> On 28.10.24 13:52, Kefeng Wang wrote: >>> >>> >>> On 2024/10/28 18:00, David Hildenbrand wrote: >>>> On 26.10.24 07:43, Kefeng Wang wrote: >>>>> When clearing gigantic page, it zeros page from the first page to the >>>>> last page, if directly passing addr_hint which maybe not the address >>>>> of the first page of folio, then some archs could flush the wrong cache >>>>> if it does use the addr_hint as a hint. For non-gigantic page, it >>>>> calculates the base address inside, even passed the wrong addr_hint, it >>>>> only has performance impact as the process_huge_page() wants to process >>>>> target page last to keep its cache lines hot), no functional impact. >>>>> >>>>> Let's pass the real accessed address to folio_zero_user() and use the >>>>> aligned address in clear_gigantic_page() to fix it. >>>>> >>>>> Fixes: 78fefd04c123 ("mm: memory: convert clear_huge_page() to >>>>> folio_zero_user()") >>>>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> >>>>> --- >>>>> v2: >>>>> - update changelog to clarify the impact, per Andrew >>>>> >>>>> fs/hugetlbfs/inode.c | 2 +- >>>>> mm/memory.c | 1 + >>>>> 2 files changed, 2 insertions(+), 1 deletion(-) >>>>> >>>>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c >>>>> index a4441fb77f7c..a5ea006f403e 100644 >>>>> --- a/fs/hugetlbfs/inode.c >>>>> +++ b/fs/hugetlbfs/inode.c >>>>> @@ -825,7 +825,7 @@ static long hugetlbfs_fallocate(struct file *file, >>>>> int mode, loff_t offset, >>>>> error = PTR_ERR(folio); >>>>> goto out; >>>>> } >>>>> - folio_zero_user(folio, ALIGN_DOWN(addr, hpage_size)); >>>>> + folio_zero_user(folio, addr); >>>>> __folio_mark_uptodate(folio); >>>>> error = hugetlb_add_to_page_cache(folio, mapping, index); >>>>> if (unlikely(error)) { >>>>> diff --git a/mm/memory.c b/mm/memory.c >>>>> index 75c2dfd04f72..ef47b7ea5ddd 100644 >>>>> --- a/mm/memory.c >>>>> +++ b/mm/memory.c >>>>> @@ -6821,6 +6821,7 @@ static void clear_gigantic_page(struct folio >>>>> *folio, unsigned long addr, >>>>> int i; >>>>> might_sleep(); >>>>> + addr = ALIGN_DOWN(addr, folio_size(folio)); >>>> >>>> Right, that's what's effectively done in a very bad way in >>>> process_huge_page() >>>> >>>> unsigned long addr = addr_hint & >>>> ~(((unsigned long)nr_pages << PAGE_SHIFT) - 1); >>>> >>>> >>>> That should all be cleaned up ... process_huge_page() likely shouldn't >>> >>> Yes, let's fix the bug firstly, >>> >>>> be even consuming "nr_pages". >>> >>> No sure about this part, it uses nr_pages as the end and calculate the >>> 'base'. >> >> It should be using folio_nr_pages(). > > But process_huge_page() without an explicit folio argument, I'd like to > move the aligned address calculate into the folio_zero_user and > copy_user_large_folio(will rename it to folio_copy_user()) in the > following cleanup patches, or do it in the fix patches? First, why does folio_zero_user() call process_huge_page() for *a small folio*? Because we like or code to be extra complicated to understand? Or am I missing something important? Second, we should be passing the folio to "process_huge_page" and likely rename it to "folio_process_pages()" or sth like that. The function even documents "of the specified huge page", but there is none specified. The copy case might require a rework. I think this code needs a serious cleanup ...
On 2024/10/28 21:46, David Hildenbrand wrote: > On 28.10.24 14:33, Kefeng Wang wrote: >> >> >> On 2024/10/28 21:14, David Hildenbrand wrote: >>> On 28.10.24 13:52, Kefeng Wang wrote: >>>> >>>> >>>> On 2024/10/28 18:00, David Hildenbrand wrote: >>>>> On 26.10.24 07:43, Kefeng Wang wrote: >>>>>> When clearing gigantic page, it zeros page from the first page to the >>>>>> last page, if directly passing addr_hint which maybe not the address >>>>>> of the first page of folio, then some archs could flush the wrong >>>>>> cache >>>>>> if it does use the addr_hint as a hint. For non-gigantic page, it >>>>>> calculates the base address inside, even passed the wrong >>>>>> addr_hint, it >>>>>> only has performance impact as the process_huge_page() wants to >>>>>> process >>>>>> target page last to keep its cache lines hot), no functional impact. >>>>>> >>>>>> Let's pass the real accessed address to folio_zero_user() and use the >>>>>> aligned address in clear_gigantic_page() to fix it. >>>>>> >>>>>> Fixes: 78fefd04c123 ("mm: memory: convert clear_huge_page() to >>>>>> folio_zero_user()") >>>>>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> >>>>>> --- >>>>>> v2: >>>>>> - update changelog to clarify the impact, per Andrew >>>>>> >>>>>> fs/hugetlbfs/inode.c | 2 +- >>>>>> mm/memory.c | 1 + >>>>>> 2 files changed, 2 insertions(+), 1 deletion(-) >>>>>> >>>>>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c >>>>>> index a4441fb77f7c..a5ea006f403e 100644 >>>>>> --- a/fs/hugetlbfs/inode.c >>>>>> +++ b/fs/hugetlbfs/inode.c >>>>>> @@ -825,7 +825,7 @@ static long hugetlbfs_fallocate(struct file >>>>>> *file, >>>>>> int mode, loff_t offset, >>>>>> error = PTR_ERR(folio); >>>>>> goto out; >>>>>> } >>>>>> - folio_zero_user(folio, ALIGN_DOWN(addr, hpage_size)); >>>>>> + folio_zero_user(folio, addr); >>>>>> __folio_mark_uptodate(folio); >>>>>> error = hugetlb_add_to_page_cache(folio, mapping, index); >>>>>> if (unlikely(error)) { >>>>>> diff --git a/mm/memory.c b/mm/memory.c >>>>>> index 75c2dfd04f72..ef47b7ea5ddd 100644 >>>>>> --- a/mm/memory.c >>>>>> +++ b/mm/memory.c >>>>>> @@ -6821,6 +6821,7 @@ static void clear_gigantic_page(struct folio >>>>>> *folio, unsigned long addr, >>>>>> int i; >>>>>> might_sleep(); >>>>>> + addr = ALIGN_DOWN(addr, folio_size(folio)); >>>>> >>>>> Right, that's what's effectively done in a very bad way in >>>>> process_huge_page() >>>>> >>>>> unsigned long addr = addr_hint & >>>>> ~(((unsigned long)nr_pages << PAGE_SHIFT) - 1); >>>>> >>>>> >>>>> That should all be cleaned up ... process_huge_page() likely shouldn't >>>> >>>> Yes, let's fix the bug firstly, >>>> >>>>> be even consuming "nr_pages". >>>> >>>> No sure about this part, it uses nr_pages as the end and calculate the >>>> 'base'. >>> >>> It should be using folio_nr_pages(). >> >> But process_huge_page() without an explicit folio argument, I'd like to >> move the aligned address calculate into the folio_zero_user and >> copy_user_large_folio(will rename it to folio_copy_user()) in the >> following cleanup patches, or do it in the fix patches? > > First, why does folio_zero_user() call process_huge_page() for *a small > folio*? Because we like or code to be extra complicated to understand? > Or am I missing something important? The folio_zero_user() used for PMD-sized THP and HugeTLB before, and after anon mTHP supported, it is used for order-2~order-PMD-order THP and HugeTLB, so it won't process a small folio if I understand correctly. > > Second, we should be passing the folio to "process_huge_page" and likely > rename it to "folio_process_pages()" or sth like that. The function even > documents "of the specified huge page", but there is none specified. The > copy case might require a rework. > > I think this code needs a serious cleanup ... > Yes, I'd like to do more cleanup and rework them, also I find some performance issue on my arm64 machine[1] which need to be addressed. [1] https://lore.kernel.org/linux-mm/2524689c-08f5-446c-8cb9-924f9db0ee3a@huawei.com/
On 28.10.24 15:22, Kefeng Wang wrote: > > > On 2024/10/28 21:46, David Hildenbrand wrote: >> On 28.10.24 14:33, Kefeng Wang wrote: >>> >>> >>> On 2024/10/28 21:14, David Hildenbrand wrote: >>>> On 28.10.24 13:52, Kefeng Wang wrote: >>>>> >>>>> >>>>> On 2024/10/28 18:00, David Hildenbrand wrote: >>>>>> On 26.10.24 07:43, Kefeng Wang wrote: >>>>>>> When clearing gigantic page, it zeros page from the first page to the >>>>>>> last page, if directly passing addr_hint which maybe not the address >>>>>>> of the first page of folio, then some archs could flush the wrong >>>>>>> cache >>>>>>> if it does use the addr_hint as a hint. For non-gigantic page, it >>>>>>> calculates the base address inside, even passed the wrong >>>>>>> addr_hint, it >>>>>>> only has performance impact as the process_huge_page() wants to >>>>>>> process >>>>>>> target page last to keep its cache lines hot), no functional impact. >>>>>>> >>>>>>> Let's pass the real accessed address to folio_zero_user() and use the >>>>>>> aligned address in clear_gigantic_page() to fix it. >>>>>>> >>>>>>> Fixes: 78fefd04c123 ("mm: memory: convert clear_huge_page() to >>>>>>> folio_zero_user()") >>>>>>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> >>>>>>> --- >>>>>>> v2: >>>>>>> - update changelog to clarify the impact, per Andrew >>>>>>> >>>>>>> fs/hugetlbfs/inode.c | 2 +- >>>>>>> mm/memory.c | 1 + >>>>>>> 2 files changed, 2 insertions(+), 1 deletion(-) >>>>>>> >>>>>>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c >>>>>>> index a4441fb77f7c..a5ea006f403e 100644 >>>>>>> --- a/fs/hugetlbfs/inode.c >>>>>>> +++ b/fs/hugetlbfs/inode.c >>>>>>> @@ -825,7 +825,7 @@ static long hugetlbfs_fallocate(struct file >>>>>>> *file, >>>>>>> int mode, loff_t offset, >>>>>>> error = PTR_ERR(folio); >>>>>>> goto out; >>>>>>> } >>>>>>> - folio_zero_user(folio, ALIGN_DOWN(addr, hpage_size)); >>>>>>> + folio_zero_user(folio, addr); >>>>>>> __folio_mark_uptodate(folio); >>>>>>> error = hugetlb_add_to_page_cache(folio, mapping, index); >>>>>>> if (unlikely(error)) { >>>>>>> diff --git a/mm/memory.c b/mm/memory.c >>>>>>> index 75c2dfd04f72..ef47b7ea5ddd 100644 >>>>>>> --- a/mm/memory.c >>>>>>> +++ b/mm/memory.c >>>>>>> @@ -6821,6 +6821,7 @@ static void clear_gigantic_page(struct folio >>>>>>> *folio, unsigned long addr, >>>>>>> int i; >>>>>>> might_sleep(); >>>>>>> + addr = ALIGN_DOWN(addr, folio_size(folio)); >>>>>> >>>>>> Right, that's what's effectively done in a very bad way in >>>>>> process_huge_page() >>>>>> >>>>>> unsigned long addr = addr_hint & >>>>>> ~(((unsigned long)nr_pages << PAGE_SHIFT) - 1); >>>>>> >>>>>> >>>>>> That should all be cleaned up ... process_huge_page() likely shouldn't >>>>> >>>>> Yes, let's fix the bug firstly, >>>>> >>>>>> be even consuming "nr_pages". >>>>> >>>>> No sure about this part, it uses nr_pages as the end and calculate the >>>>> 'base'. >>>> >>>> It should be using folio_nr_pages(). >>> >>> But process_huge_page() without an explicit folio argument, I'd like to >>> move the aligned address calculate into the folio_zero_user and >>> copy_user_large_folio(will rename it to folio_copy_user()) in the >>> following cleanup patches, or do it in the fix patches? >> >> First, why does folio_zero_user() call process_huge_page() for *a small >> folio*? Because we like or code to be extra complicated to understand? >> Or am I missing something important? > > The folio_zero_user() used for PMD-sized THP and HugeTLB before, and > after anon mTHP supported, it is used for order-2~order-PMD-order THP > and HugeTLB, so it won't process a small folio if I understand correctly. And unfortunately neither the documentation nor the function name expresses that :( I'm happy to review any patches that improve the situation here :)
>>>>>>> >>>>>>> That should all be cleaned up ... process_huge_page() likely >>>>>>> shouldn't >>>>>> >>>>>> Yes, let's fix the bug firstly, >>>>>> >>>>>>> be even consuming "nr_pages". >>>>>> >>>>>> No sure about this part, it uses nr_pages as the end and calculate >>>>>> the >>>>>> 'base'. >>>>> >>>>> It should be using folio_nr_pages(). >>>> >>>> But process_huge_page() without an explicit folio argument, I'd like to >>>> move the aligned address calculate into the folio_zero_user and >>>> copy_user_large_folio(will rename it to folio_copy_user()) in the >>>> following cleanup patches, or do it in the fix patches? >>> >>> First, why does folio_zero_user() call process_huge_page() for *a small >>> folio*? Because we like or code to be extra complicated to understand? >>> Or am I missing something important? >> >> The folio_zero_user() used for PMD-sized THP and HugeTLB before, and >> after anon mTHP supported, it is used for order-2~order-PMD-order THP >> and HugeTLB, so it won't process a small folio if I understand correctly. > > And unfortunately neither the documentation nor the function name > expresses that :( > > I'm happy to review any patches that improve the situation here :) > Actually, could we drop the process_huge_page() totally, from my testcase[1], process_huge_page() is not better than clear/copy page from start to last, and sequential clearing/copying maybe more beneficial to the hardware prefetching, and is there a way to let lkp to test to check the performance, since the process_huge_page() was submitted by Ying, what's your opinion? [1]https://lore.kernel.org/linux-mm/2524689c-08f5-446c-8cb9-924f9db0ee3a@huawei.com/ case-anon-w-seq-mt (tried 2M PMD THP/ 64K mTHP) case-anon-w-seq-hugetlb (2M PMD HugeTLB) fallocate hugetlb 20G (2M PMD HugeTLB) fallocate tmpfs 20G (2M PMD THP)
On 29.10.24 14:04, Kefeng Wang wrote: >>>>>>>> >>>>>>>> That should all be cleaned up ... process_huge_page() likely >>>>>>>> shouldn't >>>>>>> >>>>>>> Yes, let's fix the bug firstly, >>>>>>> >>>>>>>> be even consuming "nr_pages". >>>>>>> >>>>>>> No sure about this part, it uses nr_pages as the end and calculate >>>>>>> the >>>>>>> 'base'. >>>>>> >>>>>> It should be using folio_nr_pages(). >>>>> >>>>> But process_huge_page() without an explicit folio argument, I'd like to >>>>> move the aligned address calculate into the folio_zero_user and >>>>> copy_user_large_folio(will rename it to folio_copy_user()) in the >>>>> following cleanup patches, or do it in the fix patches? >>>> >>>> First, why does folio_zero_user() call process_huge_page() for *a small >>>> folio*? Because we like or code to be extra complicated to understand? >>>> Or am I missing something important? >>> >>> The folio_zero_user() used for PMD-sized THP and HugeTLB before, and >>> after anon mTHP supported, it is used for order-2~order-PMD-order THP >>> and HugeTLB, so it won't process a small folio if I understand correctly. >> >> And unfortunately neither the documentation nor the function name >> expresses that :( >> >> I'm happy to review any patches that improve the situation here :) >> > > Actually, could we drop the process_huge_page() totally, from my > testcase[1], process_huge_page() is not better than clear/copy page > from start to last, and sequential clearing/copying maybe more > beneficial to the hardware prefetching, and is there a way to let lkp > to test to check the performance, since the process_huge_page() > was submitted by Ying, what's your opinion? I questioned that just recently [1], and Ying assumed that it still applies [2]. c79b57e462b5 ("mm: hugetlb: clear target sub-page last when clearing huge page”) documents the scenario where this matters -- anon-w-seq which you also run below. If there is no performance benefit anymore, we should rip that out. But likely we should check on multiple micro-architectures with multiple #CPU configs that are relevant. c79b57e462b5 used a Xeon E5 v3 2699 with 72 processes on 2 NUMA nodes, maybe your test environment cannot replicate that? [1] https://lore.kernel.org/linux-mm/b8272cb4-aee8-45ad-8dff-353444b3fa74@redhat.com/ [2] https://lore.kernel.org/linux-mm/878quv9lhf.fsf@yhuang6-desk2.ccr.corp.intel.com/ > > > [1]https://lore.kernel.org/linux-mm/2524689c-08f5-446c-8cb9-924f9db0ee3a@huawei.com/ > case-anon-w-seq-mt (tried 2M PMD THP/ 64K mTHP) > case-anon-w-seq-hugetlb (2M PMD HugeTLB) But these are sequential, not random. I'd have thought access + zeroing would be sequentially either way. Did you run with random access as well>
David Hildenbrand <david@redhat.com> writes: > On 29.10.24 14:04, Kefeng Wang wrote: >>>>>>>>> >>>>>>>>> That should all be cleaned up ... process_huge_page() likely >>>>>>>>> shouldn't >>>>>>>> >>>>>>>> Yes, let's fix the bug firstly, >>>>>>>> >>>>>>>>> be even consuming "nr_pages". >>>>>>>> >>>>>>>> No sure about this part, it uses nr_pages as the end and calculate >>>>>>>> the >>>>>>>> 'base'. >>>>>>> >>>>>>> It should be using folio_nr_pages(). >>>>>> >>>>>> But process_huge_page() without an explicit folio argument, I'd like to >>>>>> move the aligned address calculate into the folio_zero_user and >>>>>> copy_user_large_folio(will rename it to folio_copy_user()) in the >>>>>> following cleanup patches, or do it in the fix patches? >>>>> >>>>> First, why does folio_zero_user() call process_huge_page() for *a small >>>>> folio*? Because we like or code to be extra complicated to understand? >>>>> Or am I missing something important? >>>> >>>> The folio_zero_user() used for PMD-sized THP and HugeTLB before, and >>>> after anon mTHP supported, it is used for order-2~order-PMD-order THP >>>> and HugeTLB, so it won't process a small folio if I understand correctly. >>> >>> And unfortunately neither the documentation nor the function name >>> expresses that :( >>> >>> I'm happy to review any patches that improve the situation here :) >>> >> Actually, could we drop the process_huge_page() totally, from my >> testcase[1], process_huge_page() is not better than clear/copy page >> from start to last, and sequential clearing/copying maybe more >> beneficial to the hardware prefetching, and is there a way to let lkp >> to test to check the performance, since the process_huge_page() >> was submitted by Ying, what's your opinion? I don't think that it's a good idea to revert the commit without studying and root causing the issues. I can work together with you on that. If we have solid and well explained data to prove process_huge_page() isn't benefitial, we can revert the commit. > I questioned that just recently [1], and Ying assumed that it still > applies [2]. > > c79b57e462b5 ("mm: hugetlb: clear target > sub-page last when clearing huge page”) documents the scenario where > this matters -- anon-w-seq which you also run below. > > If there is no performance benefit anymore, we should rip that > out. But likely we should check on multiple micro-architectures with > multiple #CPU configs that are relevant. c79b57e462b5 used a Xeon E5 > v3 2699 with 72 processes on 2 NUMA nodes, maybe your test environment > cannot replicate that? > > > [1] > https://lore.kernel.org/linux-mm/b8272cb4-aee8-45ad-8dff-353444b3fa74@redhat.com/ > [2] > https://lore.kernel.org/linux-mm/878quv9lhf.fsf@yhuang6-desk2.ccr.corp.intel.com/ > >> [1]https://lore.kernel.org/linux-mm/2524689c-08f5-446c-8cb9-924f9db0ee3a@huawei.com/ >> case-anon-w-seq-mt (tried 2M PMD THP/ 64K mTHP) >> case-anon-w-seq-hugetlb (2M PMD HugeTLB) > > But these are sequential, not random. I'd have thought access + > zeroing would be sequentially either way. Did you run with random > access as well> -- Best Regards, Huang, Ying
On 2024/10/30 9:04, Huang, Ying wrote: > David Hildenbrand <david@redhat.com> writes: > >> On 29.10.24 14:04, Kefeng Wang wrote: >>>>>>>>>> >>>>>>>>>> That should all be cleaned up ... process_huge_page() likely >>>>>>>>>> shouldn't >>>>>>>>> >>>>>>>>> Yes, let's fix the bug firstly, >>>>>>>>> >>>>>>>>>> be even consuming "nr_pages". >>>>>>>>> >>>>>>>>> No sure about this part, it uses nr_pages as the end and calculate >>>>>>>>> the >>>>>>>>> 'base'. >>>>>>>> >>>>>>>> It should be using folio_nr_pages(). >>>>>>> >>>>>>> But process_huge_page() without an explicit folio argument, I'd like to >>>>>>> move the aligned address calculate into the folio_zero_user and >>>>>>> copy_user_large_folio(will rename it to folio_copy_user()) in the >>>>>>> following cleanup patches, or do it in the fix patches? >>>>>> >>>>>> First, why does folio_zero_user() call process_huge_page() for *a small >>>>>> folio*? Because we like or code to be extra complicated to understand? >>>>>> Or am I missing something important? >>>>> >>>>> The folio_zero_user() used for PMD-sized THP and HugeTLB before, and >>>>> after anon mTHP supported, it is used for order-2~order-PMD-order THP >>>>> and HugeTLB, so it won't process a small folio if I understand correctly. >>>> >>>> And unfortunately neither the documentation nor the function name >>>> expresses that :( >>>> >>>> I'm happy to review any patches that improve the situation here :) >>>> >>> Actually, could we drop the process_huge_page() totally, from my >>> testcase[1], process_huge_page() is not better than clear/copy page >>> from start to last, and sequential clearing/copying maybe more >>> beneficial to the hardware prefetching, and is there a way to let lkp >>> to test to check the performance, since the process_huge_page() >>> was submitted by Ying, what's your opinion? > > I don't think that it's a good idea to revert the commit without > studying and root causing the issues. I can work together with you on > that. If we have solid and well explained data to prove > process_huge_page() isn't benefitial, we can revert the commit. Take 'fallocate 20G' as an example, before Performance counter stats for 'taskset -c 10 fallocate -l 20G /mnt/hugetlbfs/test': 3,118.94 msec task-clock # 0.999 CPUs utilized 30 context-switches # 0.010 K/sec 1 cpu-migrations # 0.000 K/sec 136 page-faults # 0.044 K/sec 8,092,075,873 cycles # 2.594 GHz (92.82%) 1,624,587,663 instructions # 0.20 insn per cycle (92.83%) 395,341,850 branches # 126.755 M/sec (92.82%) 3,872,302 branch-misses # 0.98% of all branches (92.83%) 1,398,066,701 L1-dcache-loads # 448.251 M/sec (92.82%) 58,124,626 L1-dcache-load-misses # 4.16% of all L1-dcache accesses (92.82%) 1,032,527 LLC-loads # 0.331 M/sec (92.82%) 498,684 LLC-load-misses # 48.30% of all LL-cache accesses (92.84%) 473,689,004 L1-icache-loads # 151.875 M/sec (92.82%) 356,721 L1-icache-load-misses # 0.08% of all L1-icache accesses (92.85%) 1,947,644,987 dTLB-loads # 624.458 M/sec (92.95%) 10,185 dTLB-load-misses # 0.00% of all dTLB cache accesses (92.96%) 474,622,896 iTLB-loads # 152.174 M/sec (92.95%) 94 iTLB-load-misses # 0.00% of all iTLB cache accesses (85.69%) 3.122844830 seconds time elapsed 0.000000000 seconds user 3.107259000 seconds sys and after(clear from start to end) Performance counter stats for 'taskset -c 10 fallocate -l 20G /mnt/hugetlbfs/test': 1,135.53 msec task-clock # 0.999 CPUs utilized 10 context-switches # 0.009 K/sec 1 cpu-migrations # 0.001 K/sec 137 page-faults # 0.121 K/sec 2,946,673,587 cycles # 2.595 GHz (92.67%) 1,620,704,205 instructions # 0.55 insn per cycle (92.61%) 394,595,772 branches # 347.499 M/sec (92.60%) 130,756 branch-misses # 0.03% of all branches (92.84%) 1,396,726,689 L1-dcache-loads # 1230.022 M/sec (92.96%) 338,344 L1-dcache-load-misses # 0.02% of all L1-dcache accesses (92.95%) 111,737 LLC-loads # 0.098 M/sec (92.96%) 67,486 LLC-load-misses # 60.40% of all LL-cache accesses (92.96%) 418,198,663 L1-icache-loads # 368.285 M/sec (92.96%) 173,764 L1-icache-load-misses # 0.04% of all L1-icache accesses (92.96%) 2,203,364,632 dTLB-loads # 1940.385 M/sec (92.96%) 17,195 dTLB-load-misses # 0.00% of all dTLB cache accesses (92.95%) 418,198,365 iTLB-loads # 368.285 M/sec (92.96%) 79 iTLB-load-misses # 0.00% of all iTLB cache accesses (85.34%) 1.137015760 seconds time elapsed 0.000000000 seconds user 1.131266000 seconds sys The IPC improved a lot,less LLC-loads and more L1-dcache-loads, but this depends on the implementation of the microarchitecture. 1) Will test some rand test to check the different of performance as David suggested. 2) Hope the LKP to run more tests since it is very useful(more test set and different machines) > >> I questioned that just recently [1], and Ying assumed that it still >> applies [2]. >> >> c79b57e462b5 ("mm: hugetlb: clear target >> sub-page last when clearing huge page”) documents the scenario where >> this matters -- anon-w-seq which you also run below. >> >> If there is no performance benefit anymore, we should rip that >> out. But likely we should check on multiple micro-architectures with >> multiple #CPU configs that are relevant. c79b57e462b5 used a Xeon E5 >> v3 2699 with 72 processes on 2 NUMA nodes, maybe your test environment >> cannot replicate that?>> >> >> [1] >> https://lore.kernel.org/linux-mm/b8272cb4-aee8-45ad-8dff-353444b3fa74@redhat.com/ >> [2] >> https://lore.kernel.org/linux-mm/878quv9lhf.fsf@yhuang6-desk2.ccr.corp.intel.com/ >> >>> [1]https://lore.kernel.org/linux-mm/2524689c-08f5-446c-8cb9-924f9db0ee3a@huawei.com/ >>> case-anon-w-seq-mt (tried 2M PMD THP/ 64K mTHP) >>> case-anon-w-seq-hugetlb (2M PMD HugeTLB) >> >> But these are sequential, not random. I'd have thought access + >> zeroing would be sequentially either way. Did you run with random >> access as well> Will do. > > -- > Best Regards, > Huang, Ying >
Kefeng Wang <wangkefeng.wang@huawei.com> writes: > On 2024/10/30 9:04, Huang, Ying wrote: >> David Hildenbrand <david@redhat.com> writes: >> >>> On 29.10.24 14:04, Kefeng Wang wrote: >>>>>>>>>>> >>>>>>>>>>> That should all be cleaned up ... process_huge_page() likely >>>>>>>>>>> shouldn't >>>>>>>>>> >>>>>>>>>> Yes, let's fix the bug firstly, >>>>>>>>>> >>>>>>>>>>> be even consuming "nr_pages". >>>>>>>>>> >>>>>>>>>> No sure about this part, it uses nr_pages as the end and calculate >>>>>>>>>> the >>>>>>>>>> 'base'. >>>>>>>>> >>>>>>>>> It should be using folio_nr_pages(). >>>>>>>> >>>>>>>> But process_huge_page() without an explicit folio argument, I'd like to >>>>>>>> move the aligned address calculate into the folio_zero_user and >>>>>>>> copy_user_large_folio(will rename it to folio_copy_user()) in the >>>>>>>> following cleanup patches, or do it in the fix patches? >>>>>>> >>>>>>> First, why does folio_zero_user() call process_huge_page() for *a small >>>>>>> folio*? Because we like or code to be extra complicated to understand? >>>>>>> Or am I missing something important? >>>>>> >>>>>> The folio_zero_user() used for PMD-sized THP and HugeTLB before, and >>>>>> after anon mTHP supported, it is used for order-2~order-PMD-order THP >>>>>> and HugeTLB, so it won't process a small folio if I understand correctly. >>>>> >>>>> And unfortunately neither the documentation nor the function name >>>>> expresses that :( >>>>> >>>>> I'm happy to review any patches that improve the situation here :) >>>>> >>>> Actually, could we drop the process_huge_page() totally, from my >>>> testcase[1], process_huge_page() is not better than clear/copy page >>>> from start to last, and sequential clearing/copying maybe more >>>> beneficial to the hardware prefetching, and is there a way to let lkp >>>> to test to check the performance, since the process_huge_page() >>>> was submitted by Ying, what's your opinion? >> I don't think that it's a good idea to revert the commit without >> studying and root causing the issues. I can work together with you on >> that. If we have solid and well explained data to prove >> process_huge_page() isn't benefitial, we can revert the commit. > > > Take 'fallocate 20G' as an example, before > > Performance counter stats for 'taskset -c 10 fallocate -l 20G > /mnt/hugetlbfs/test': IIUC, fallocate will zero pages, but will not touch them at all, right? If so, no cache benefit from clearing referenced page last. > 3,118.94 msec task-clock # 0.999 CPUs > utilized > 30 context-switches # 0.010 K/sec > 1 cpu-migrations # 0.000 K/sec > 136 page-faults # 0.044 K/sec > 8,092,075,873 cycles # > 2.594 GHz (92.82%) > 1,624,587,663 instructions # 0.20 insn per > cycle (92.83%) > 395,341,850 branches # 126.755 M/sec > (92.82%) > 3,872,302 branch-misses # 0.98% of all > branches (92.83%) > 1,398,066,701 L1-dcache-loads # 448.251 M/sec > (92.82%) > 58,124,626 L1-dcache-load-misses # 4.16% of all > L1-dcache accesses (92.82%) > 1,032,527 LLC-loads # 0.331 M/sec > (92.82%) > 498,684 LLC-load-misses # 48.30% of all > LL-cache accesses (92.84%) > 473,689,004 L1-icache-loads # 151.875 M/sec > (92.82%) > 356,721 L1-icache-load-misses # 0.08% of all > L1-icache accesses (92.85%) > 1,947,644,987 dTLB-loads # 624.458 M/sec > (92.95%) > 10,185 dTLB-load-misses # 0.00% of all > dTLB cache accesses (92.96%) > 474,622,896 iTLB-loads # 152.174 M/sec > (92.95%) > 94 iTLB-load-misses # 0.00% of all > iTLB cache accesses (85.69%) > > 3.122844830 seconds time elapsed > > 0.000000000 seconds user > 3.107259000 seconds sys > > and after(clear from start to end) > > Performance counter stats for 'taskset -c 10 fallocate -l 20G > /mnt/hugetlbfs/test': > > 1,135.53 msec task-clock # 0.999 CPUs > utilized > 10 context-switches # 0.009 K/sec > 1 cpu-migrations # 0.001 K/sec > 137 page-faults # 0.121 K/sec > 2,946,673,587 cycles # > 2.595 GHz (92.67%) > 1,620,704,205 instructions # 0.55 insn per > cycle (92.61%) > 394,595,772 branches # 347.499 M/sec > (92.60%) > 130,756 branch-misses # 0.03% of all > branches (92.84%) > 1,396,726,689 L1-dcache-loads # 1230.022 M/sec > (92.96%) > 338,344 L1-dcache-load-misses # 0.02% of all > L1-dcache accesses (92.95%) > 111,737 LLC-loads # 0.098 M/sec > (92.96%) > 67,486 LLC-load-misses # 60.40% of all > LL-cache accesses (92.96%) > 418,198,663 L1-icache-loads # 368.285 M/sec > (92.96%) > 173,764 L1-icache-load-misses # 0.04% of all > L1-icache accesses (92.96%) > 2,203,364,632 dTLB-loads # 1940.385 M/sec > (92.96%) > 17,195 dTLB-load-misses # 0.00% of all > dTLB cache accesses (92.95%) > 418,198,365 iTLB-loads # 368.285 M/sec > (92.96%) > 79 iTLB-load-misses # 0.00% of all > iTLB cache accesses (85.34%) > > 1.137015760 seconds time elapsed > > 0.000000000 seconds user > 1.131266000 seconds sys > > The IPC improved a lot,less LLC-loads and more L1-dcache-loads, but > this depends on the implementation of the microarchitecture. Anyway, we need to avoid (or reduce at least) the pure memory clearing performance. Have you double checked whether process_huge_page() is inlined? Perf-profile result can be used to check this too. When you say from start to end, you mean to use clear_gigantic_page() directly, or change process_huge_page() to clear page from start to end? > 1) Will test some rand test to check the different of performance as > David suggested. > > 2) Hope the LKP to run more tests since it is very useful(more test > set and different machines) I'm starting to use LKP to test. -- Best Regards, Huang, Ying > >> >>> I questioned that just recently [1], and Ying assumed that it still >>> applies [2]. >>> >>> c79b57e462b5 ("mm: hugetlb: clear target >>> sub-page last when clearing huge page”) documents the scenario where >>> this matters -- anon-w-seq which you also run below. >>> >>> If there is no performance benefit anymore, we should rip that >>> out. But likely we should check on multiple micro-architectures with >>> multiple #CPU configs that are relevant. c79b57e462b5 used a Xeon E5 >>> v3 2699 with 72 processes on 2 NUMA nodes, maybe your test environment >>> cannot replicate that?>> >>> >>> [1] >>> https://lore.kernel.org/linux-mm/b8272cb4-aee8-45ad-8dff-353444b3fa74@redhat.com/ >>> [2] >>> https://lore.kernel.org/linux-mm/878quv9lhf.fsf@yhuang6-desk2.ccr.corp.intel.com/ >>> >>>> [1]https://lore.kernel.org/linux-mm/2524689c-08f5-446c-8cb9-924f9db0ee3a@huawei.com/ >>>> case-anon-w-seq-mt (tried 2M PMD THP/ 64K mTHP) >>>> case-anon-w-seq-hugetlb (2M PMD HugeTLB) >>> >>> But these are sequential, not random. I'd have thought access + >>> zeroing would be sequentially either way. Did you run with random >>> access as well> > > Will do. >> > -- >> Best Regards, >> Huang, Ying >>
On 2024/10/30 11:21, Huang, Ying wrote: > Kefeng Wang <wangkefeng.wang@huawei.com> writes: > >> On 2024/10/30 9:04, Huang, Ying wrote: >>> David Hildenbrand <david@redhat.com> writes: >>> >>>> On 29.10.24 14:04, Kefeng Wang wrote: >>>>>>>>>>>> >>>>>>>>>>>> That should all be cleaned up ... process_huge_page() likely >>>>>>>>>>>> shouldn't >>>>>>>>>>> >>>>>>>>>>> Yes, let's fix the bug firstly, >>>>>>>>>>> >>>>>>>>>>>> be even consuming "nr_pages". >>>>>>>>>>> >>>>>>>>>>> No sure about this part, it uses nr_pages as the end and calculate >>>>>>>>>>> the >>>>>>>>>>> 'base'. >>>>>>>>>> >>>>>>>>>> It should be using folio_nr_pages(). >>>>>>>>> >>>>>>>>> But process_huge_page() without an explicit folio argument, I'd like to >>>>>>>>> move the aligned address calculate into the folio_zero_user and >>>>>>>>> copy_user_large_folio(will rename it to folio_copy_user()) in the >>>>>>>>> following cleanup patches, or do it in the fix patches? >>>>>>>> >>>>>>>> First, why does folio_zero_user() call process_huge_page() for *a small >>>>>>>> folio*? Because we like or code to be extra complicated to understand? >>>>>>>> Or am I missing something important? >>>>>>> >>>>>>> The folio_zero_user() used for PMD-sized THP and HugeTLB before, and >>>>>>> after anon mTHP supported, it is used for order-2~order-PMD-order THP >>>>>>> and HugeTLB, so it won't process a small folio if I understand correctly. >>>>>> >>>>>> And unfortunately neither the documentation nor the function name >>>>>> expresses that :( >>>>>> >>>>>> I'm happy to review any patches that improve the situation here :) >>>>>> >>>>> Actually, could we drop the process_huge_page() totally, from my >>>>> testcase[1], process_huge_page() is not better than clear/copy page >>>>> from start to last, and sequential clearing/copying maybe more >>>>> beneficial to the hardware prefetching, and is there a way to let lkp >>>>> to test to check the performance, since the process_huge_page() >>>>> was submitted by Ying, what's your opinion? >>> I don't think that it's a good idea to revert the commit without >>> studying and root causing the issues. I can work together with you on >>> that. If we have solid and well explained data to prove >>> process_huge_page() isn't benefitial, we can revert the commit. >> >> >> Take 'fallocate 20G' as an example, before >> >> Performance counter stats for 'taskset -c 10 fallocate -l 20G >> /mnt/hugetlbfs/test': > > IIUC, fallocate will zero pages, but will not touch them at all, right? > If so, no cache benefit from clearing referenced page last. Yes, for this case, only clear page. > >> 3,118.94 msec task-clock # 0.999 CPUs >> utilized >> 30 context-switches # 0.010 K/sec >> 1 cpu-migrations # 0.000 K/sec >> 136 page-faults # 0.044 K/sec >> 8,092,075,873 cycles # >> 2.594 GHz (92.82%) >> 1,624,587,663 instructions # 0.20 insn per >> cycle (92.83%) >> 395,341,850 branches # 126.755 M/sec >> (92.82%) >> 3,872,302 branch-misses # 0.98% of all >> branches (92.83%) >> 1,398,066,701 L1-dcache-loads # 448.251 M/sec >> (92.82%) >> 58,124,626 L1-dcache-load-misses # 4.16% of all >> L1-dcache accesses (92.82%) >> 1,032,527 LLC-loads # 0.331 M/sec >> (92.82%) >> 498,684 LLC-load-misses # 48.30% of all >> LL-cache accesses (92.84%) >> 473,689,004 L1-icache-loads # 151.875 M/sec >> (92.82%) >> 356,721 L1-icache-load-misses # 0.08% of all >> L1-icache accesses (92.85%) >> 1,947,644,987 dTLB-loads # 624.458 M/sec >> (92.95%) >> 10,185 dTLB-load-misses # 0.00% of all >> dTLB cache accesses (92.96%) >> 474,622,896 iTLB-loads # 152.174 M/sec >> (92.95%) >> 94 iTLB-load-misses # 0.00% of all >> iTLB cache accesses (85.69%) >> >> 3.122844830 seconds time elapsed >> >> 0.000000000 seconds user >> 3.107259000 seconds sys >> >> and after(clear from start to end) >> >> Performance counter stats for 'taskset -c 10 fallocate -l 20G >> /mnt/hugetlbfs/test': >> >> 1,135.53 msec task-clock # 0.999 CPUs >> utilized >> 10 context-switches # 0.009 K/sec >> 1 cpu-migrations # 0.001 K/sec >> 137 page-faults # 0.121 K/sec >> 2,946,673,587 cycles # >> 2.595 GHz (92.67%) >> 1,620,704,205 instructions # 0.55 insn per >> cycle (92.61%) >> 394,595,772 branches # 347.499 M/sec >> (92.60%) >> 130,756 branch-misses # 0.03% of all >> branches (92.84%) >> 1,396,726,689 L1-dcache-loads # 1230.022 M/sec >> (92.96%) >> 338,344 L1-dcache-load-misses # 0.02% of all >> L1-dcache accesses (92.95%) >> 111,737 LLC-loads # 0.098 M/sec >> (92.96%) >> 67,486 LLC-load-misses # 60.40% of all >> LL-cache accesses (92.96%) >> 418,198,663 L1-icache-loads # 368.285 M/sec >> (92.96%) >> 173,764 L1-icache-load-misses # 0.04% of all >> L1-icache accesses (92.96%) >> 2,203,364,632 dTLB-loads # 1940.385 M/sec >> (92.96%) >> 17,195 dTLB-load-misses # 0.00% of all >> dTLB cache accesses (92.95%) >> 418,198,365 iTLB-loads # 368.285 M/sec >> (92.96%) >> 79 iTLB-load-misses # 0.00% of all >> iTLB cache accesses (85.34%) >> >> 1.137015760 seconds time elapsed >> >> 0.000000000 seconds user >> 1.131266000 seconds sys >> >> The IPC improved a lot,less LLC-loads and more L1-dcache-loads, but >> this depends on the implementation of the microarchitecture. > > Anyway, we need to avoid (or reduce at least) the pure memory clearing > performance. Have you double checked whether process_huge_page() is > inlined? Perf-profile result can be used to check this too. > Yes, I'm sure the process_huge_page() is inlined. > When you say from start to end, you mean to use clear_gigantic_page() > directly, or change process_huge_page() to clear page from start to end? > Using clear_gigantic_page() and changing process_huge_page() to clear page from start to end are both good for performance when sequential clearing, but no random test so far. >> 1) Will test some rand test to check the different of performance as >> David suggested. >> >> 2) Hope the LKP to run more tests since it is very useful(more test >> set and different machines) > > I'm starting to use LKP to test. Greet.
Kefeng Wang <wangkefeng.wang@huawei.com> writes: [snip] > >>> 1) Will test some rand test to check the different of performance as >>> David suggested. >>> >>> 2) Hope the LKP to run more tests since it is very useful(more test >>> set and different machines) >> I'm starting to use LKP to test. > > Greet. I have run some tests with LKP to test. Firstly, there's almost no measurable difference between clearing pages from start to end or from end to start on Intel server CPU. I guess that there's some similar optimization for both direction. For multiple processes (same as logical CPU number) vm-scalability/anon-w-seq test case, the benchmark score increases about 22.4%. For multiple processes vm-scalability/anon-w-rand test case, no measurable difference for benchmark score. So, the optimization helps sequential workload mainly. In summary, on x86, process_huge_page() will not introduce any regression. And it helps some workload. However, on ARM64, it does introduce some regression for clearing pages from end to start. That needs to be addressed. I guess that the regression can be resolved via using more clearing from start to end (but not all). For example, can you take a look at the patch below? Which uses the similar framework as before, but clear each small trunk (mpage) from start to end. You can adjust MPAGE_NRPAGES to check when the regression can be restored. WARNING: the patch is only build tested. Best Regards, Huang, Ying -----------------------------------8<---------------------------------------- From 406bcd1603987fdd7130d2df6f7d4aee4cc6b978 Mon Sep 17 00:00:00 2001 From: Huang Ying <ying.huang@intel.com> Date: Thu, 31 Oct 2024 11:13:57 +0800 Subject: [PATCH] mpage clear --- mm/memory.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 67 insertions(+), 3 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 3ccee51adfbb..1fdc548c4275 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -6769,6 +6769,68 @@ static inline int process_huge_page( return 0; } +#define MPAGE_NRPAGES (1<<4) +#define MPAGE_SIZE (PAGE_SIZE * MPAGE_NRPAGES) +static inline int clear_huge_page( + unsigned long addr_hint, unsigned int nr_pages, + int (*process_subpage)(unsigned long addr, int idx, void *arg), + void *arg) +{ + int i, n, base, l, ret; + unsigned long addr = addr_hint & + ~(((unsigned long)nr_pages << PAGE_SHIFT) - 1); + unsigned long nr_mpages = ((unsigned long)nr_pages << PAGE_SHIFT) / MPAGE_SIZE; + + /* Process target subpage last to keep its cache lines hot */ + might_sleep(); + n = (addr_hint - addr) / MPAGE_SIZE; + if (2 * n <= nr_mpages) { + /* If target subpage in first half of huge page */ + base = 0; + l = n; + /* Process subpages at the end of huge page */ + for (i = nr_mpages - 1; i >= 2 * n; i--) { + cond_resched(); + ret = process_subpage(addr + i * MPAGE_SIZE, + i * MPAGE_NRPAGES, arg); + if (ret) + return ret; + } + } else { + /* If target subpage in second half of huge page */ + base = nr_mpages - 2 * (nr_mpages - n); + l = nr_mpages - n; + /* Process subpages at the begin of huge page */ + for (i = 0; i < base; i++) { + cond_resched(); + ret = process_subpage(addr + i * MPAGE_SIZE, + i * MPAGE_NRPAGES, arg); + if (ret) + return ret; + } + } + /* + * Process remaining subpages in left-right-left-right pattern + * towards the target subpage + */ + for (i = 0; i < l; i++) { + int left_idx = base + i; + int right_idx = base + 2 * l - 1 - i; + + cond_resched(); + ret = process_subpage(addr + left_idx * MPAGE_SIZE, + left_idx * MPAGE_NRPAGES, arg); + if (ret) + return ret; + cond_resched(); + ret = process_subpage(addr + right_idx * MPAGE_SIZE, + right_idx * MPAGE_NRPAGES, arg); + if (ret) + return ret; + } + return 0; +} + static void clear_gigantic_page(struct folio *folio, unsigned long addr, unsigned int nr_pages) { @@ -6784,8 +6846,10 @@ static void clear_gigantic_page(struct folio *folio, unsigned long addr, static int clear_subpage(unsigned long addr, int idx, void *arg) { struct folio *folio = arg; + int i; - clear_user_highpage(folio_page(folio, idx), addr); + for (i = 0; i < MPAGE_NRPAGES; i++) + clear_user_highpage(folio_page(folio, idx + i), addr + i * PAGE_SIZE); return 0; } @@ -6798,10 +6862,10 @@ void folio_zero_user(struct folio *folio, unsigned long addr_hint) { unsigned int nr_pages = folio_nr_pages(folio); - if (unlikely(nr_pages > MAX_ORDER_NR_PAGES)) + if (unlikely(nr_pages != HPAGE_PMD_NR)) clear_gigantic_page(folio, addr_hint, nr_pages); else - process_huge_page(addr_hint, nr_pages, clear_subpage, folio); + clear_huge_page(addr_hint, nr_pages, clear_subpage, folio); } static int copy_user_gigantic_page(struct folio *dst, struct folio *src,
Kefeng Wang <wangkefeng.wang@huawei.com> writes: > On 2024/10/30 11:21, Huang, Ying wrote: >> Kefeng Wang <wangkefeng.wang@huawei.com> writes: >> >>> On 2024/10/30 9:04, Huang, Ying wrote: >>>> David Hildenbrand <david@redhat.com> writes: >>>> >>>>> On 29.10.24 14:04, Kefeng Wang wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> That should all be cleaned up ... process_huge_page() likely >>>>>>>>>>>>> shouldn't >>>>>>>>>>>> >>>>>>>>>>>> Yes, let's fix the bug firstly, >>>>>>>>>>>> >>>>>>>>>>>>> be even consuming "nr_pages". >>>>>>>>>>>> >>>>>>>>>>>> No sure about this part, it uses nr_pages as the end and calculate >>>>>>>>>>>> the >>>>>>>>>>>> 'base'. >>>>>>>>>>> >>>>>>>>>>> It should be using folio_nr_pages(). >>>>>>>>>> >>>>>>>>>> But process_huge_page() without an explicit folio argument, I'd like to >>>>>>>>>> move the aligned address calculate into the folio_zero_user and >>>>>>>>>> copy_user_large_folio(will rename it to folio_copy_user()) in the >>>>>>>>>> following cleanup patches, or do it in the fix patches? >>>>>>>>> >>>>>>>>> First, why does folio_zero_user() call process_huge_page() for *a small >>>>>>>>> folio*? Because we like or code to be extra complicated to understand? >>>>>>>>> Or am I missing something important? >>>>>>>> >>>>>>>> The folio_zero_user() used for PMD-sized THP and HugeTLB before, and >>>>>>>> after anon mTHP supported, it is used for order-2~order-PMD-order THP >>>>>>>> and HugeTLB, so it won't process a small folio if I understand correctly. >>>>>>> >>>>>>> And unfortunately neither the documentation nor the function name >>>>>>> expresses that :( >>>>>>> >>>>>>> I'm happy to review any patches that improve the situation here :) >>>>>>> >>>>>> Actually, could we drop the process_huge_page() totally, from my >>>>>> testcase[1], process_huge_page() is not better than clear/copy page >>>>>> from start to last, and sequential clearing/copying maybe more >>>>>> beneficial to the hardware prefetching, and is there a way to let lkp >>>>>> to test to check the performance, since the process_huge_page() >>>>>> was submitted by Ying, what's your opinion? >>>> I don't think that it's a good idea to revert the commit without >>>> studying and root causing the issues. I can work together with you on >>>> that. If we have solid and well explained data to prove >>>> process_huge_page() isn't benefitial, we can revert the commit. >>> >>> >>> Take 'fallocate 20G' as an example, before >>> >>> Performance counter stats for 'taskset -c 10 fallocate -l 20G >>> /mnt/hugetlbfs/test': >> IIUC, fallocate will zero pages, but will not touch them at all, >> right? >> If so, no cache benefit from clearing referenced page last. > > > Yes, for this case, only clear page. >> >>> 3,118.94 msec task-clock # 0.999 CPUs >>> utilized >>> 30 context-switches # 0.010 K/sec >>> 1 cpu-migrations # 0.000 K/sec >>> 136 page-faults # 0.044 K/sec >>> 8,092,075,873 cycles # >>> 2.594 GHz (92.82%) >>> 1,624,587,663 instructions # 0.20 insn per >>> cycle (92.83%) >>> 395,341,850 branches # 126.755 M/sec >>> (92.82%) >>> 3,872,302 branch-misses # 0.98% of all >>> branches (92.83%) >>> 1,398,066,701 L1-dcache-loads # 448.251 M/sec >>> (92.82%) >>> 58,124,626 L1-dcache-load-misses # 4.16% of all >>> L1-dcache accesses (92.82%) >>> 1,032,527 LLC-loads # 0.331 M/sec >>> (92.82%) >>> 498,684 LLC-load-misses # 48.30% of all >>> LL-cache accesses (92.84%) >>> 473,689,004 L1-icache-loads # 151.875 M/sec >>> (92.82%) >>> 356,721 L1-icache-load-misses # 0.08% of all >>> L1-icache accesses (92.85%) >>> 1,947,644,987 dTLB-loads # 624.458 M/sec >>> (92.95%) >>> 10,185 dTLB-load-misses # 0.00% of all >>> dTLB cache accesses (92.96%) >>> 474,622,896 iTLB-loads # 152.174 M/sec >>> (92.95%) >>> 94 iTLB-load-misses # 0.00% of all >>> iTLB cache accesses (85.69%) >>> >>> 3.122844830 seconds time elapsed >>> >>> 0.000000000 seconds user >>> 3.107259000 seconds sys >>> >>> and after(clear from start to end) >>> >>> Performance counter stats for 'taskset -c 10 fallocate -l 20G >>> /mnt/hugetlbfs/test': >>> >>> 1,135.53 msec task-clock # 0.999 CPUs >>> utilized >>> 10 context-switches # 0.009 K/sec >>> 1 cpu-migrations # 0.001 K/sec >>> 137 page-faults # 0.121 K/sec >>> 2,946,673,587 cycles # >>> 2.595 GHz (92.67%) >>> 1,620,704,205 instructions # 0.55 insn per >>> cycle (92.61%) >>> 394,595,772 branches # 347.499 M/sec >>> (92.60%) >>> 130,756 branch-misses # 0.03% of all >>> branches (92.84%) >>> 1,396,726,689 L1-dcache-loads # 1230.022 M/sec >>> (92.96%) >>> 338,344 L1-dcache-load-misses # 0.02% of all >>> L1-dcache accesses (92.95%) >>> 111,737 LLC-loads # 0.098 M/sec >>> (92.96%) >>> 67,486 LLC-load-misses # 60.40% of all >>> LL-cache accesses (92.96%) >>> 418,198,663 L1-icache-loads # 368.285 M/sec >>> (92.96%) >>> 173,764 L1-icache-load-misses # 0.04% of all >>> L1-icache accesses (92.96%) >>> 2,203,364,632 dTLB-loads # 1940.385 M/sec >>> (92.96%) >>> 17,195 dTLB-load-misses # 0.00% of all >>> dTLB cache accesses (92.95%) >>> 418,198,365 iTLB-loads # 368.285 M/sec >>> (92.96%) >>> 79 iTLB-load-misses # 0.00% of all >>> iTLB cache accesses (85.34%) >>> >>> 1.137015760 seconds time elapsed >>> >>> 0.000000000 seconds user >>> 1.131266000 seconds sys >>> >>> The IPC improved a lot,less LLC-loads and more L1-dcache-loads, but >>> this depends on the implementation of the microarchitecture. >> Anyway, we need to avoid (or reduce at least) the pure memory >> clearing >> performance. Have you double checked whether process_huge_page() is >> inlined? Perf-profile result can be used to check this too. >> > > Yes, I'm sure the process_huge_page() is inlined. > >> When you say from start to end, you mean to use clear_gigantic_page() >> directly, or change process_huge_page() to clear page from start to end? >> > > Using clear_gigantic_page() and changing process_huge_page() to clear > page from start to end are both good for performance when sequential > clearing, but no random test so far. > >>> 1) Will test some rand test to check the different of performance as >>> David suggested. >>> >>> 2) Hope the LKP to run more tests since it is very useful(more test >>> set and different machines) >> I'm starting to use LKP to test. https://lore.kernel.org/linux-mm/20200419155856.dtwxomdkyujljdfi@oneplus.com/ Just remembered that we have discussed a similar issue for arm64 before. Can you take a look at it? There's more discussion and tests/results in the thread, I think that may be helpful. -- Best Regards, Huang, Ying
On 2024/10/31 16:39, Huang, Ying wrote: > Kefeng Wang <wangkefeng.wang@huawei.com> writes: > > [snip] >> >>>> 1) Will test some rand test to check the different of performance as >>>> David suggested.>>>> >>>> 2) Hope the LKP to run more tests since it is very useful(more test >>>> set and different machines) >>> I'm starting to use LKP to test. >> >> Greet. Sorry for the late, > > I have run some tests with LKP to test. > > Firstly, there's almost no measurable difference between clearing pages > from start to end or from end to start on Intel server CPU. I guess > that there's some similar optimization for both direction. > > For multiple processes (same as logical CPU number) > vm-scalability/anon-w-seq test case, the benchmark score increases > about 22.4%. So process_huge_page is better than clear_gigantic_page() on Intel? Could you test the following case on x86? echo 10240 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages mkdir -p /hugetlbfs/ mount none /hugetlbfs/ -t hugetlbfs rm -f /hugetlbfs/test && fallocate -l 20G /hugetlbfs/test && fallocate -d -l 20G /hugetlbfs/test && time taskset -c 10 fallocate -l 20G /hugetlbfs/test > > For multiple processes vm-scalability/anon-w-rand test case, no > measurable difference for benchmark score. > > So, the optimization helps sequential workload mainly. > > In summary, on x86, process_huge_page() will not introduce any > regression. And it helps some workload. > > However, on ARM64, it does introduce some regression for clearing pages > from end to start. That needs to be addressed. I guess that the > regression can be resolved via using more clearing from start to end > (but not all). For example, can you take a look at the patch below? > Which uses the similar framework as before, but clear each small trunk > (mpage) from start to end. You can adjust MPAGE_NRPAGES to check when > the regression can be restored. > > WARNING: the patch is only build tested. Base: baseline Change1: using clear_gigantic_page() for 2M PMD Change2: your patch with MPAGE_NRPAGES=16 Change3: Case3 + fix[1] Change4: your patch with MPAGE_NRPAGES=64 + fix[1] 1. For rand write, case-anon-w-rand/case-anon-w-rand-hugetlb no measurable difference 2. For seq write, 1) case-anon-w-seq-mt: base: real 0m2.490s 0m2.254s 0m2.272s user 1m59.980s 2m23.431s 2m18.739s sys 1m3.675s 1m15.462s 1m15.030s Change1: real 0m2.234s 0m2.225s 0m2.159s user 2m56.105s 2m57.117s 3m0.489s sys 0m17.064s 0m17.564s 0m16.150s Change2: real 0m2.244s 0m2.384s 0m2.370s user 2m39.413s 2m41.990s 2m42.229s sys 0m19.826s 0m18.491s 0m18.053s Change3: // best performance real 0m2.155s 0m2.204s 0m2.194s user 3m2.640s 2m55.837s 3m0.902s sys 0m17.346s 0m17.630s 0m18.197s Change4: real 0m2.287s 0m2.377s 0m2.284s user 2m37.030s 2m52.868s 3m17.593s sys 0m15.445s 0m34.430s 0m45.224s 2) case-anon-w-seq-hugetlb very similar 1), Change4 slightly better than Change3, but not big different. 3) hugetlbfs fallocate 20G Change1(0m1.136s) = Change3(0m1.136s) = Change4(0m1.135s) < Change2(0m1.275s) < base(0m3.016s) In summary, the Change3 is best and Change1 is good on my arm64 machine. > > Best Regards, > Huang, Ying > > -----------------------------------8<---------------------------------------- > From 406bcd1603987fdd7130d2df6f7d4aee4cc6b978 Mon Sep 17 00:00:00 2001 > From: Huang Ying <ying.huang@intel.com> > Date: Thu, 31 Oct 2024 11:13:57 +0800 > Subject: [PATCH] mpage clear > > --- > mm/memory.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++--- > 1 file changed, 67 insertions(+), 3 deletions(-) > > diff --git a/mm/memory.c b/mm/memory.c > index 3ccee51adfbb..1fdc548c4275 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -6769,6 +6769,68 @@ static inline int process_huge_page( > return 0; > } > > +#define MPAGE_NRPAGES (1<<4) > +#define MPAGE_SIZE (PAGE_SIZE * MPAGE_NRPAGES) > +static inline int clear_huge_page( > + unsigned long addr_hint, unsigned int nr_pages, > + int (*process_subpage)(unsigned long addr, int idx, void *arg), > + void *arg) > +{ > + int i, n, base, l, ret; > + unsigned long addr = addr_hint & > + ~(((unsigned long)nr_pages << PAGE_SHIFT) - 1); > + unsigned long nr_mpages = ((unsigned long)nr_pages << PAGE_SHIFT) / MPAGE_SIZE; > + > + /* Process target subpage last to keep its cache lines hot */ > + might_sleep(); > + n = (addr_hint - addr) / MPAGE_SIZE; > + if (2 * n <= nr_mpages) { > + /* If target subpage in first half of huge page */ > + base = 0; > + l = n; > + /* Process subpages at the end of huge page */ > + for (i = nr_mpages - 1; i >= 2 * n; i--) { > + cond_resched(); > + ret = process_subpage(addr + i * MPAGE_SIZE, > + i * MPAGE_NRPAGES, arg); > + if (ret) > + return ret; > + } > + } else { > + /* If target subpage in second half of huge page */ > + base = nr_mpages - 2 * (nr_mpages - n); > + l = nr_mpages - n; > + /* Process subpages at the begin of huge page */ > + for (i = 0; i < base; i++) { > + cond_resched(); > + ret = process_subpage(addr + i * MPAGE_SIZE, > + i * MPAGE_NRPAGES, arg); > + if (ret) > + return ret; > + } > + } > + /* > + * Process remaining subpages in left-right-left-right pattern > + * towards the target subpage > + */ > + for (i = 0; i < l; i++) { > + int left_idx = base + i; > + int right_idx = base + 2 * l - 1 - i; > + > + cond_resched(); > + ret = process_subpage(addr + left_idx * MPAGE_SIZE, > + left_idx * MPAGE_NRPAGES, arg); > + if (ret) > + return ret; > + cond_resched(); > + ret = process_subpage(addr + right_idx * MPAGE_SIZE, > + right_idx * MPAGE_NRPAGES, arg); > + if (ret) > + return ret; > + } > + return 0; > +} > + > static void clear_gigantic_page(struct folio *folio, unsigned long addr, > unsigned int nr_pages) > { > @@ -6784,8 +6846,10 @@ static void clear_gigantic_page(struct folio *folio, unsigned long addr, > static int clear_subpage(unsigned long addr, int idx, void *arg) > { > struct folio *folio = arg; > + int i; > > - clear_user_highpage(folio_page(folio, idx), addr); > + for (i = 0; i < MPAGE_NRPAGES; i++) > + clear_user_highpage(folio_page(folio, idx + i), addr + i * PAGE_SIZE); > return 0; > } > > @@ -6798,10 +6862,10 @@ void folio_zero_user(struct folio *folio, unsigned long addr_hint) > { > unsigned int nr_pages = folio_nr_pages(folio); > > - if (unlikely(nr_pages > MAX_ORDER_NR_PAGES)) > + if (unlikely(nr_pages != HPAGE_PMD_NR)) > clear_gigantic_page(folio, addr_hint, nr_pages); > else > - process_huge_page(addr_hint, nr_pages, clear_subpage, folio); > + clear_huge_page(addr_hint, nr_pages, clear_subpage, folio); > } > > static int copy_user_gigantic_page(struct folio *dst, struct folio *src, [1] fix patch diff --git a/mm/memory.c b/mm/memory.c index b22d4b83295b..aee99ede0c4f 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -6816,7 +6816,7 @@ static inline int clear_huge_page( base = 0; l = n; /* Process subpages at the end of huge page */ - for (i = nr_mpages - 1; i >= 2 * n; i--) { + for (i = 2 * n; i < nr_mpages; i++) { cond_resched(); ret = process_subpage(addr + i * MPAGE_SIZE, i * MPAGE_NRPAGES, arg);
On 2024/11/1 14:18, Huang, Ying wrote: > Kefeng Wang <wangkefeng.wang@huawei.com> writes: > >> On 2024/10/30 11:21, Huang, Ying wrote: >>> Kefeng Wang <wangkefeng.wang@huawei.com> writes: >>> >>>> On 2024/10/30 9:04, Huang, Ying wrote: >>>>> David Hildenbrand <david@redhat.com> writes: >>>>> >>>>>> On 29.10.24 14:04, Kefeng Wang wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> That should all be cleaned up ... process_huge_page() likely >>>>>>>>>>>>>> shouldn't >>>>>>>>>>>>> >>>>>>>>>>>>> Yes, let's fix the bug firstly, >>>>>>>>>>>>> >>>>>>>>>>>>>> be even consuming "nr_pages". >>>>>>>>>>>>> >>>>>>>>>>>>> No sure about this part, it uses nr_pages as the end and calculate >>>>>>>>>>>>> the >>>>>>>>>>>>> 'base'. >>>>>>>>>>>> >>>>>>>>>>>> It should be using folio_nr_pages(). >>>>>>>>>>> >>>>>>>>>>> But process_huge_page() without an explicit folio argument, I'd like to >>>>>>>>>>> move the aligned address calculate into the folio_zero_user and >>>>>>>>>>> copy_user_large_folio(will rename it to folio_copy_user()) in the >>>>>>>>>>> following cleanup patches, or do it in the fix patches? >>>>>>>>>> >>>>>>>>>> First, why does folio_zero_user() call process_huge_page() for *a small >>>>>>>>>> folio*? Because we like or code to be extra complicated to understand? >>>>>>>>>> Or am I missing something important? >>>>>>>>> >>>>>>>>> The folio_zero_user() used for PMD-sized THP and HugeTLB before, and >>>>>>>>> after anon mTHP supported, it is used for order-2~order-PMD-order THP >>>>>>>>> and HugeTLB, so it won't process a small folio if I understand correctly. >>>>>>>> >>>>>>>> And unfortunately neither the documentation nor the function name >>>>>>>> expresses that :( >>>>>>>> >>>>>>>> I'm happy to review any patches that improve the situation here :) >>>>>>>> >>>>>>> Actually, could we drop the process_huge_page() totally, from my >>>>>>> testcase[1], process_huge_page() is not better than clear/copy page >>>>>>> from start to last, and sequential clearing/copying maybe more >>>>>>> beneficial to the hardware prefetching, and is there a way to let lkp >>>>>>> to test to check the performance, since the process_huge_page() >>>>>>> was submitted by Ying, what's your opinion? >>>>> I don't think that it's a good idea to revert the commit without >>>>> studying and root causing the issues. I can work together with you on >>>>> that. If we have solid and well explained data to prove >>>>> process_huge_page() isn't benefitial, we can revert the commit. >>>> >>>> >>>> Take 'fallocate 20G' as an example, before >>>> >>>> Performance counter stats for 'taskset -c 10 fallocate -l 20G >>>> /mnt/hugetlbfs/test': >>> IIUC, fallocate will zero pages, but will not touch them at all, >>> right? >>> If so, no cache benefit from clearing referenced page last. >> >> >> Yes, for this case, only clear page. >>> >>>> 3,118.94 msec task-clock # 0.999 CPUs >>>> utilized >>>> 30 context-switches # 0.010 K/sec >>>> 1 cpu-migrations # 0.000 K/sec >>>> 136 page-faults # 0.044 K/sec >>>> 8,092,075,873 cycles # >>>> 2.594 GHz (92.82%) >>>> 1,624,587,663 instructions # 0.20 insn per >>>> cycle (92.83%) >>>> 395,341,850 branches # 126.755 M/sec >>>> (92.82%) >>>> 3,872,302 branch-misses # 0.98% of all >>>> branches (92.83%) >>>> 1,398,066,701 L1-dcache-loads # 448.251 M/sec >>>> (92.82%) >>>> 58,124,626 L1-dcache-load-misses # 4.16% of all >>>> L1-dcache accesses (92.82%) >>>> 1,032,527 LLC-loads # 0.331 M/sec >>>> (92.82%) >>>> 498,684 LLC-load-misses # 48.30% of all >>>> LL-cache accesses (92.84%) >>>> 473,689,004 L1-icache-loads # 151.875 M/sec >>>> (92.82%) >>>> 356,721 L1-icache-load-misses # 0.08% of all >>>> L1-icache accesses (92.85%) >>>> 1,947,644,987 dTLB-loads # 624.458 M/sec >>>> (92.95%) >>>> 10,185 dTLB-load-misses # 0.00% of all >>>> dTLB cache accesses (92.96%) >>>> 474,622,896 iTLB-loads # 152.174 M/sec >>>> (92.95%) >>>> 94 iTLB-load-misses # 0.00% of all >>>> iTLB cache accesses (85.69%) >>>> >>>> 3.122844830 seconds time elapsed >>>> >>>> 0.000000000 seconds user >>>> 3.107259000 seconds sys >>>> >>>> and after(clear from start to end) >>>> >>>> Performance counter stats for 'taskset -c 10 fallocate -l 20G >>>> /mnt/hugetlbfs/test': >>>> >>>> 1,135.53 msec task-clock # 0.999 CPUs >>>> utilized >>>> 10 context-switches # 0.009 K/sec >>>> 1 cpu-migrations # 0.001 K/sec >>>> 137 page-faults # 0.121 K/sec >>>> 2,946,673,587 cycles # >>>> 2.595 GHz (92.67%) >>>> 1,620,704,205 instructions # 0.55 insn per >>>> cycle (92.61%) >>>> 394,595,772 branches # 347.499 M/sec >>>> (92.60%) >>>> 130,756 branch-misses # 0.03% of all >>>> branches (92.84%) >>>> 1,396,726,689 L1-dcache-loads # 1230.022 M/sec >>>> (92.96%) >>>> 338,344 L1-dcache-load-misses # 0.02% of all >>>> L1-dcache accesses (92.95%) >>>> 111,737 LLC-loads # 0.098 M/sec >>>> (92.96%) >>>> 67,486 LLC-load-misses # 60.40% of all >>>> LL-cache accesses (92.96%) >>>> 418,198,663 L1-icache-loads # 368.285 M/sec >>>> (92.96%) >>>> 173,764 L1-icache-load-misses # 0.04% of all >>>> L1-icache accesses (92.96%) >>>> 2,203,364,632 dTLB-loads # 1940.385 M/sec >>>> (92.96%) >>>> 17,195 dTLB-load-misses # 0.00% of all >>>> dTLB cache accesses (92.95%) >>>> 418,198,365 iTLB-loads # 368.285 M/sec >>>> (92.96%) >>>> 79 iTLB-load-misses # 0.00% of all >>>> iTLB cache accesses (85.34%) >>>> >>>> 1.137015760 seconds time elapsed >>>> >>>> 0.000000000 seconds user >>>> 1.131266000 seconds sys >>>> >>>> The IPC improved a lot,less LLC-loads and more L1-dcache-loads, but >>>> this depends on the implementation of the microarchitecture. >>> Anyway, we need to avoid (or reduce at least) the pure memory >>> clearing >>> performance. Have you double checked whether process_huge_page() is >>> inlined? Perf-profile result can be used to check this too. >>> >> >> Yes, I'm sure the process_huge_page() is inlined. >> >>> When you say from start to end, you mean to use clear_gigantic_page() >>> directly, or change process_huge_page() to clear page from start to end? >>> >> >> Using clear_gigantic_page() and changing process_huge_page() to clear >> page from start to end are both good for performance when sequential >> clearing, but no random test so far. >> >>>> 1) Will test some rand test to check the different of performance as >>>> David suggested. >>>> >>>> 2) Hope the LKP to run more tests since it is very useful(more test >>>> set and different machines) >>> I'm starting to use LKP to test. > > https://lore.kernel.org/linux-mm/20200419155856.dtwxomdkyujljdfi@oneplus.com/ > > Just remembered that we have discussed a similar issue for arm64 before. > Can you take a look at it? There's more discussion and tests/results in > the thread, I think that may be helpful. > Thanks for the tips, will check it. > -- > Best Regards, > Huang, Ying >
Kefeng Wang <wangkefeng.wang@huawei.com> writes: > On 2024/10/31 16:39, Huang, Ying wrote: >> Kefeng Wang <wangkefeng.wang@huawei.com> writes: >> [snip] >>> >>>>> 1) Will test some rand test to check the different of performance as >>>>> David suggested.>>>> >>>>> 2) Hope the LKP to run more tests since it is very useful(more test >>>>> set and different machines) >>>> I'm starting to use LKP to test. >>> >>> Greet. > > > Sorry for the late, > >> I have run some tests with LKP to test. >> Firstly, there's almost no measurable difference between clearing >> pages >> from start to end or from end to start on Intel server CPU. I guess >> that there's some similar optimization for both direction. >> For multiple processes (same as logical CPU number) >> vm-scalability/anon-w-seq test case, the benchmark score increases >> about 22.4%. > > So process_huge_page is better than clear_gigantic_page() on Intel? For vm-scalability/anon-w-seq test case, it is. Because the performance of forward and backward clearing is almost same, and the user space accessing has cache-hot benefit. > Could you test the following case on x86? > echo 10240 > > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages > mkdir -p /hugetlbfs/ > mount none /hugetlbfs/ -t hugetlbfs > rm -f /hugetlbfs/test && fallocate -l 20G /hugetlbfs/test && fallocate > -d -l 20G /hugetlbfs/test && time taskset -c 10 fallocate -l 20G > /hugetlbfs/test It's not trivial for me to do this test. Because 0day wraps test cases. Do you know which existing test cases provide this? For example, in vm-scalability? >> For multiple processes vm-scalability/anon-w-rand test case, no >> measurable difference for benchmark score. >> So, the optimization helps sequential workload mainly. >> In summary, on x86, process_huge_page() will not introduce any >> regression. And it helps some workload. >> However, on ARM64, it does introduce some regression for clearing >> pages >> from end to start. That needs to be addressed. I guess that the >> regression can be resolved via using more clearing from start to end >> (but not all). For example, can you take a look at the patch below? >> Which uses the similar framework as before, but clear each small trunk >> (mpage) from start to end. You can adjust MPAGE_NRPAGES to check when >> the regression can be restored. >> WARNING: the patch is only build tested. > > > Base: baseline > Change1: using clear_gigantic_page() for 2M PMD > Change2: your patch with MPAGE_NRPAGES=16 > Change3: Case3 + fix[1] What is case3? > Change4: your patch with MPAGE_NRPAGES=64 + fix[1] > > 1. For rand write, > case-anon-w-rand/case-anon-w-rand-hugetlb no measurable difference > > 2. For seq write, > > 1) case-anon-w-seq-mt: Can you try case-anon-w-seq? That may be more stable. > base: > real 0m2.490s 0m2.254s 0m2.272s > user 1m59.980s 2m23.431s 2m18.739s > sys 1m3.675s 1m15.462s 1m15.030s > > Change1: > real 0m2.234s 0m2.225s 0m2.159s > user 2m56.105s 2m57.117s 3m0.489s > sys 0m17.064s 0m17.564s 0m16.150s > > Change2: > real 0m2.244s 0m2.384s 0m2.370s > user 2m39.413s 2m41.990s 2m42.229s > sys 0m19.826s 0m18.491s 0m18.053s It appears strange. There's no much cache hot benefit even if we clear pages from end to begin (with larger chunk). However, sys time improves a lot. This shows clearing page with large chunk helps on ARM64. > Change3: // best performance > real 0m2.155s 0m2.204s 0m2.194s > user 3m2.640s 2m55.837s 3m0.902s > sys 0m17.346s 0m17.630s 0m18.197s > > Change4: > real 0m2.287s 0m2.377s 0m2.284s > user 2m37.030s 2m52.868s 3m17.593s > sys 0m15.445s 0m34.430s 0m45.224s Change4 is essentially same as Change1. I don't know why they are different. Is there some large variation among run to run? Can you further optimize the prototype patch below? I think that it has potential to fix your issue. > 2) case-anon-w-seq-hugetlb > very similar 1), Change4 slightly better than Change3, but not big > different. > > 3) hugetlbfs fallocate 20G > Change1(0m1.136s) = Change3(0m1.136s) = Change4(0m1.135s) < > Change2(0m1.275s) < base(0m3.016s) > > In summary, the Change3 is best and Change1 is good on my arm64 machine. > >> Best Regards, >> Huang, Ying >> -----------------------------------8<---------------------------------------- >> From 406bcd1603987fdd7130d2df6f7d4aee4cc6b978 Mon Sep 17 00:00:00 2001 >> From: Huang Ying <ying.huang@intel.com> >> Date: Thu, 31 Oct 2024 11:13:57 +0800 >> Subject: [PATCH] mpage clear >> --- >> mm/memory.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++--- >> 1 file changed, 67 insertions(+), 3 deletions(-) >> diff --git a/mm/memory.c b/mm/memory.c >> index 3ccee51adfbb..1fdc548c4275 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -6769,6 +6769,68 @@ static inline int process_huge_page( >> return 0; >> } >> +#define MPAGE_NRPAGES (1<<4) >> +#define MPAGE_SIZE (PAGE_SIZE * MPAGE_NRPAGES) >> +static inline int clear_huge_page( >> + unsigned long addr_hint, unsigned int nr_pages, >> + int (*process_subpage)(unsigned long addr, int idx, void *arg), >> + void *arg) >> +{ >> + int i, n, base, l, ret; >> + unsigned long addr = addr_hint & >> + ~(((unsigned long)nr_pages << PAGE_SHIFT) - 1); >> + unsigned long nr_mpages = ((unsigned long)nr_pages << PAGE_SHIFT) / MPAGE_SIZE; >> + >> + /* Process target subpage last to keep its cache lines hot */ >> + might_sleep(); >> + n = (addr_hint - addr) / MPAGE_SIZE; >> + if (2 * n <= nr_mpages) { >> + /* If target subpage in first half of huge page */ >> + base = 0; >> + l = n; >> + /* Process subpages at the end of huge page */ >> + for (i = nr_mpages - 1; i >= 2 * n; i--) { >> + cond_resched(); >> + ret = process_subpage(addr + i * MPAGE_SIZE, >> + i * MPAGE_NRPAGES, arg); >> + if (ret) >> + return ret; >> + } >> + } else { >> + /* If target subpage in second half of huge page */ >> + base = nr_mpages - 2 * (nr_mpages - n); >> + l = nr_mpages - n; >> + /* Process subpages at the begin of huge page */ >> + for (i = 0; i < base; i++) { >> + cond_resched(); >> + ret = process_subpage(addr + i * MPAGE_SIZE, >> + i * MPAGE_NRPAGES, arg); >> + if (ret) >> + return ret; >> + } >> + } >> + /* >> + * Process remaining subpages in left-right-left-right pattern >> + * towards the target subpage >> + */ >> + for (i = 0; i < l; i++) { >> + int left_idx = base + i; >> + int right_idx = base + 2 * l - 1 - i; >> + >> + cond_resched(); >> + ret = process_subpage(addr + left_idx * MPAGE_SIZE, >> + left_idx * MPAGE_NRPAGES, arg); >> + if (ret) >> + return ret; >> + cond_resched(); >> + ret = process_subpage(addr + right_idx * MPAGE_SIZE, >> + right_idx * MPAGE_NRPAGES, arg); >> + if (ret) >> + return ret; >> + } >> + return 0; >> +} >> + >> static void clear_gigantic_page(struct folio *folio, unsigned long addr, >> unsigned int nr_pages) >> { >> @@ -6784,8 +6846,10 @@ static void clear_gigantic_page(struct folio *folio, unsigned long addr, >> static int clear_subpage(unsigned long addr, int idx, void *arg) >> { >> struct folio *folio = arg; >> + int i; >> - clear_user_highpage(folio_page(folio, idx), addr); >> + for (i = 0; i < MPAGE_NRPAGES; i++) >> + clear_user_highpage(folio_page(folio, idx + i), addr + i * PAGE_SIZE); >> return 0; >> } >> @@ -6798,10 +6862,10 @@ void folio_zero_user(struct folio *folio, >> unsigned long addr_hint) >> { >> unsigned int nr_pages = folio_nr_pages(folio); >> - if (unlikely(nr_pages > MAX_ORDER_NR_PAGES)) >> + if (unlikely(nr_pages != HPAGE_PMD_NR)) >> clear_gigantic_page(folio, addr_hint, nr_pages); >> else >> - process_huge_page(addr_hint, nr_pages, clear_subpage, folio); >> + clear_huge_page(addr_hint, nr_pages, clear_subpage, folio); >> } >> static int copy_user_gigantic_page(struct folio *dst, struct >> folio *src, > > > [1] fix patch > > diff --git a/mm/memory.c b/mm/memory.c > index b22d4b83295b..aee99ede0c4f 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -6816,7 +6816,7 @@ static inline int clear_huge_page( > base = 0; > l = n; > /* Process subpages at the end of huge page */ > - for (i = nr_mpages - 1; i >= 2 * n; i--) { > + for (i = 2 * n; i < nr_mpages; i++) { > cond_resched(); > ret = process_subpage(addr + i * MPAGE_SIZE, > i * MPAGE_NRPAGES, arg);
On 2024/11/1 16:16, Huang, Ying wrote: > Kefeng Wang <wangkefeng.wang@huawei.com> writes: > >> On 2024/10/31 16:39, Huang, Ying wrote: >>> Kefeng Wang <wangkefeng.wang@huawei.com> writes: >>> [snip] >>>> >>>>>> 1) Will test some rand test to check the different of performance as >>>>>> David suggested.>>>> >>>>>> 2) Hope the LKP to run more tests since it is very useful(more test >>>>>> set and different machines) >>>>> I'm starting to use LKP to test. >>>> >>>> Greet. >> >> >> Sorry for the late, >> >>> I have run some tests with LKP to test. >>> Firstly, there's almost no measurable difference between clearing >>> pages >>> from start to end or from end to start on Intel server CPU. I guess >>> that there's some similar optimization for both direction. >>> For multiple processes (same as logical CPU number) >>> vm-scalability/anon-w-seq test case, the benchmark score increases >>> about 22.4%. >> >> So process_huge_page is better than clear_gigantic_page() on Intel? > > For vm-scalability/anon-w-seq test case, it is. Because the performance > of forward and backward clearing is almost same, and the user space > accessing has cache-hot benefit. > >> Could you test the following case on x86? >> echo 10240 > >> /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages >> mkdir -p /hugetlbfs/ >> mount none /hugetlbfs/ -t hugetlbfs >> rm -f /hugetlbfs/test && fallocate -l 20G /hugetlbfs/test && fallocate >> -d -l 20G /hugetlbfs/test && time taskset -c 10 fallocate -l 20G >> /hugetlbfs/test > > It's not trivial for me to do this test. Because 0day wraps test cases. > Do you know which existing test cases provide this? For example, in > vm-scalability? I don't know the public fallocate test, I will try to find a intel machine to test this case. > >>> For multiple processes vm-scalability/anon-w-rand test case, no >>> measurable difference for benchmark score. >>> So, the optimization helps sequential workload mainly. >>> In summary, on x86, process_huge_page() will not introduce any >>> regression. And it helps some workload. >>> However, on ARM64, it does introduce some regression for clearing >>> pages >>> from end to start. That needs to be addressed. I guess that the >>> regression can be resolved via using more clearing from start to end >>> (but not all). For example, can you take a look at the patch below? >>> Which uses the similar framework as before, but clear each small trunk >>> (mpage) from start to end. You can adjust MPAGE_NRPAGES to check when >>> the regression can be restored. >>> WARNING: the patch is only build tested. >> >> >> Base: baseline >> Change1: using clear_gigantic_page() for 2M PMD >> Change2: your patch with MPAGE_NRPAGES=16 >> Change3: Case3 + fix[1] > > What is case3? Oh, it is Change2. > >> Change4: your patch with MPAGE_NRPAGES=64 + fix[1] >> >> 1. For rand write, >> case-anon-w-rand/case-anon-w-rand-hugetlb no measurable difference >> >> 2. For seq write, >> >> 1) case-anon-w-seq-mt: > > Can you try case-anon-w-seq? That may be more stable. > >> base: >> real 0m2.490s 0m2.254s 0m2.272s >> user 1m59.980s 2m23.431s 2m18.739s >> sys 1m3.675s 1m15.462s 1m15.030s >> >> Change1: >> real 0m2.234s 0m2.225s 0m2.159s >> user 2m56.105s 2m57.117s 3m0.489s >> sys 0m17.064s 0m17.564s 0m16.150s >> >> Change2: >> real 0m2.244s 0m2.384s 0m2.370s >> user 2m39.413s 2m41.990s 2m42.229s >> sys 0m19.826s 0m18.491s 0m18.053s > > It appears strange. There's no much cache hot benefit even if we clear > pages from end to begin (with larger chunk). > > However, sys time improves a lot. This shows clearing page with large > chunk helps on ARM64. > >> Change3: // best performance >> real 0m2.155s 0m2.204s 0m2.194s >> user 3m2.640s 2m55.837s 3m0.902s >> sys 0m17.346s 0m17.630s 0m18.197s >> >> Change4: >> real 0m2.287s 0m2.377s 0m2.284s >> user 2m37.030s 2m52.868s 3m17.593s >> sys 0m15.445s 0m34.430s 0m45.224s > > Change4 is essentially same as Change1. I don't know why they are > different. Is there some large variation among run to run? As above shown, I test three times, the test results are relatively stable, at least for real, I will try case-anon-w-seq. > > Can you further optimize the prototype patch below? I think that it has > potential to fix your issue. Yes, thanks for you helper, but this will make process_huge_page() a little more complicated :) > >> 2) case-anon-w-seq-hugetlb >> very similar 1), Change4 slightly better than Change3, but not big >> different. >> >> 3) hugetlbfs fallocate 20G >> Change1(0m1.136s) = Change3(0m1.136s) = Change4(0m1.135s) < >> Change2(0m1.275s) < base(0m3.016s) >> >> In summary, the Change3 is best and Change1 is good on my arm64 machine. >> >>> Best Regards, >>> Huang, Ying >>> -----------------------------------8<---------------------------------------- >>> From 406bcd1603987fdd7130d2df6f7d4aee4cc6b978 Mon Sep 17 00:00:00 2001 >>> From: Huang Ying <ying.huang@intel.com> >>> Date: Thu, 31 Oct 2024 11:13:57 +0800 >>> Subject: [PATCH] mpage clear >>> --- >>> mm/memory.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++--- >>> 1 file changed, 67 insertions(+), 3 deletions(-) >>> diff --git a/mm/memory.c b/mm/memory.c >>> index 3ccee51adfbb..1fdc548c4275 100644 >>> --- a/mm/memory.c >>> +++ b/mm/memory.c >>> @@ -6769,6 +6769,68 @@ static inline int process_huge_page( >>> return 0; >>> } >>> +#define MPAGE_NRPAGES (1<<4) >>> +#define MPAGE_SIZE (PAGE_SIZE * MPAGE_NRPAGES) >>> +static inline int clear_huge_page( >>> + unsigned long addr_hint, unsigned int nr_pages, >>> + int (*process_subpage)(unsigned long addr, int idx, void *arg), >>> + void *arg) >>> +{ >>> + int i, n, base, l, ret; >>> + unsigned long addr = addr_hint & >>> + ~(((unsigned long)nr_pages << PAGE_SHIFT) - 1); >>> + unsigned long nr_mpages = ((unsigned long)nr_pages << PAGE_SHIFT) / MPAGE_SIZE; >>> + >>> + /* Process target subpage last to keep its cache lines hot */ >>> + might_sleep(); >>> + n = (addr_hint - addr) / MPAGE_SIZE; >>> + if (2 * n <= nr_mpages) { >>> + /* If target subpage in first half of huge page */ >>> + base = 0; >>> + l = n; >>> + /* Process subpages at the end of huge page */ >>> + for (i = nr_mpages - 1; i >= 2 * n; i--) { >>> + cond_resched(); >>> + ret = process_subpage(addr + i * MPAGE_SIZE, >>> + i * MPAGE_NRPAGES, arg); >>> + if (ret) >>> + return ret; >>> + } >>> + } else { >>> + /* If target subpage in second half of huge page */ >>> + base = nr_mpages - 2 * (nr_mpages - n); >>> + l = nr_mpages - n; >>> + /* Process subpages at the begin of huge page */ >>> + for (i = 0; i < base; i++) { >>> + cond_resched(); >>> + ret = process_subpage(addr + i * MPAGE_SIZE, >>> + i * MPAGE_NRPAGES, arg); >>> + if (ret) >>> + return ret; >>> + } >>> + } >>> + /* >>> + * Process remaining subpages in left-right-left-right pattern >>> + * towards the target subpage >>> + */ >>> + for (i = 0; i < l; i++) { >>> + int left_idx = base + i; >>> + int right_idx = base + 2 * l - 1 - i; >>> + >>> + cond_resched(); >>> + ret = process_subpage(addr + left_idx * MPAGE_SIZE, >>> + left_idx * MPAGE_NRPAGES, arg); >>> + if (ret) >>> + return ret; >>> + cond_resched(); >>> + ret = process_subpage(addr + right_idx * MPAGE_SIZE, >>> + right_idx * MPAGE_NRPAGES, arg); >>> + if (ret) >>> + return ret; >>> + } >>> + return 0; >>> +} >>> + >>> static void clear_gigantic_page(struct folio *folio, unsigned long addr, >>> unsigned int nr_pages) >>> { >>> @@ -6784,8 +6846,10 @@ static void clear_gigantic_page(struct folio *folio, unsigned long addr, >>> static int clear_subpage(unsigned long addr, int idx, void *arg) >>> { >>> struct folio *folio = arg; >>> + int i; >>> - clear_user_highpage(folio_page(folio, idx), addr); >>> + for (i = 0; i < MPAGE_NRPAGES; i++) >>> + clear_user_highpage(folio_page(folio, idx + i), addr + i * PAGE_SIZE); >>> return 0; >>> } >>> @@ -6798,10 +6862,10 @@ void folio_zero_user(struct folio *folio, >>> unsigned long addr_hint) >>> { >>> unsigned int nr_pages = folio_nr_pages(folio); >>> - if (unlikely(nr_pages > MAX_ORDER_NR_PAGES)) >>> + if (unlikely(nr_pages != HPAGE_PMD_NR)) >>> clear_gigantic_page(folio, addr_hint, nr_pages); >>> else >>> - process_huge_page(addr_hint, nr_pages, clear_subpage, folio); >>> + clear_huge_page(addr_hint, nr_pages, clear_subpage, folio); >>> } >>> static int copy_user_gigantic_page(struct folio *dst, struct >>> folio *src, >> >> >> [1] fix patch >> >> diff --git a/mm/memory.c b/mm/memory.c >> index b22d4b83295b..aee99ede0c4f 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -6816,7 +6816,7 @@ static inline int clear_huge_page( >> base = 0; >> l = n; >> /* Process subpages at the end of huge page */ >> - for (i = nr_mpages - 1; i >= 2 * n; i--) { >> + for (i = 2 * n; i < nr_mpages; i++) { >> cond_resched(); >> ret = process_subpage(addr + i * MPAGE_SIZE, >> i * MPAGE_NRPAGES, arg); >
Kefeng Wang <wangkefeng.wang@huawei.com> writes: > On 2024/11/1 16:16, Huang, Ying wrote: >> Kefeng Wang <wangkefeng.wang@huawei.com> writes: >> >>> On 2024/10/31 16:39, Huang, Ying wrote: >>>> Kefeng Wang <wangkefeng.wang@huawei.com> writes: >>>> [snip] >>>>> >>>>>>> 1) Will test some rand test to check the different of performance as >>>>>>> David suggested.>>>> >>>>>>> 2) Hope the LKP to run more tests since it is very useful(more test >>>>>>> set and different machines) >>>>>> I'm starting to use LKP to test. >>>>> >>>>> Greet. >>> >>> >>> Sorry for the late, >>> >>>> I have run some tests with LKP to test. >>>> Firstly, there's almost no measurable difference between clearing >>>> pages >>>> from start to end or from end to start on Intel server CPU. I guess >>>> that there's some similar optimization for both direction. >>>> For multiple processes (same as logical CPU number) >>>> vm-scalability/anon-w-seq test case, the benchmark score increases >>>> about 22.4%. >>> >>> So process_huge_page is better than clear_gigantic_page() on Intel? >> For vm-scalability/anon-w-seq test case, it is. Because the >> performance >> of forward and backward clearing is almost same, and the user space >> accessing has cache-hot benefit. >> >>> Could you test the following case on x86? >>> echo 10240 > >>> /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages >>> mkdir -p /hugetlbfs/ >>> mount none /hugetlbfs/ -t hugetlbfs >>> rm -f /hugetlbfs/test && fallocate -l 20G /hugetlbfs/test && fallocate >>> -d -l 20G /hugetlbfs/test && time taskset -c 10 fallocate -l 20G >>> /hugetlbfs/test >> It's not trivial for me to do this test. Because 0day wraps test >> cases. >> Do you know which existing test cases provide this? For example, in >> vm-scalability? > > I don't know the public fallocate test, I will try to find a intel > machine to test this case. I don't expect it to change much, because we have observed that the performance of forward and backward clearing is similar on Intel. >> >>>> For multiple processes vm-scalability/anon-w-rand test case, no >>>> measurable difference for benchmark score. >>>> So, the optimization helps sequential workload mainly. >>>> In summary, on x86, process_huge_page() will not introduce any >>>> regression. And it helps some workload. >>>> However, on ARM64, it does introduce some regression for clearing >>>> pages >>>> from end to start. That needs to be addressed. I guess that the >>>> regression can be resolved via using more clearing from start to end >>>> (but not all). For example, can you take a look at the patch below? >>>> Which uses the similar framework as before, but clear each small trunk >>>> (mpage) from start to end. You can adjust MPAGE_NRPAGES to check when >>>> the regression can be restored. >>>> WARNING: the patch is only build tested. >>> >>> >>> Base: baseline >>> Change1: using clear_gigantic_page() for 2M PMD >>> Change2: your patch with MPAGE_NRPAGES=16 >>> Change3: Case3 + fix[1] >> What is case3? > > Oh, it is Change2. Got it. >> >>> Change4: your patch with MPAGE_NRPAGES=64 + fix[1] >>> >>> 1. For rand write, >>> case-anon-w-rand/case-anon-w-rand-hugetlb no measurable difference >>> >>> 2. For seq write, >>> >>> 1) case-anon-w-seq-mt: >> Can you try case-anon-w-seq? That may be more stable. >> >>> base: >>> real 0m2.490s 0m2.254s 0m2.272s >>> user 1m59.980s 2m23.431s 2m18.739s >>> sys 1m3.675s 1m15.462s 1m15.030s >>> >>> Change1: >>> real 0m2.234s 0m2.225s 0m2.159s >>> user 2m56.105s 2m57.117s 3m0.489s >>> sys 0m17.064s 0m17.564s 0m16.150s >>> >>> Change2: >>> real 0m2.244s 0m2.384s 0m2.370s >>> user 2m39.413s 2m41.990s 2m42.229s >>> sys 0m19.826s 0m18.491s 0m18.053s >> It appears strange. There's no much cache hot benefit even if we >> clear >> pages from end to begin (with larger chunk). >> However, sys time improves a lot. This shows clearing page with >> large >> chunk helps on ARM64. >> >>> Change3: // best performance >>> real 0m2.155s 0m2.204s 0m2.194s >>> user 3m2.640s 2m55.837s 3m0.902s >>> sys 0m17.346s 0m17.630s 0m18.197s >>> >>> Change4: >>> real 0m2.287s 0m2.377s 0m2.284s >>> user 2m37.030s 2m52.868s 3m17.593s >>> sys 0m15.445s 0m34.430s 0m45.224s >> Change4 is essentially same as Change1. I don't know why they are >> different. Is there some large variation among run to run? > > As above shown, I test three times, the test results are relatively > stable, at least for real, I will try case-anon-w-seq. Can you also show the score of vm-scalability? TBH, I cannot understand your results. For example, why there are measurable difference between Change3 and Change4? In both cases, the kernel clears pages from start to end. >> Can you further optimize the prototype patch below? I think that it >> has >> potential to fix your issue. > > Yes, thanks for you helper, but this will make process_huge_page() a > little more complicated :) IMHO, we should try to root cause it, then try to find the proper solution and optimize (simplifies) it. -- Best Regards, Huang, Ying >> >>> 2) case-anon-w-seq-hugetlb >>> very similar 1), Change4 slightly better than Change3, but not big >>> different. >>> >>> 3) hugetlbfs fallocate 20G >>> Change1(0m1.136s) = Change3(0m1.136s) = Change4(0m1.135s) < >>> Change2(0m1.275s) < base(0m3.016s) >>> >>> In summary, the Change3 is best and Change1 is good on my arm64 machine. >>> >>>> Best Regards, >>>> Huang, Ying >>>> -----------------------------------8<---------------------------------------- >>>> From 406bcd1603987fdd7130d2df6f7d4aee4cc6b978 Mon Sep 17 00:00:00 2001 >>>> From: Huang Ying <ying.huang@intel.com> >>>> Date: Thu, 31 Oct 2024 11:13:57 +0800 >>>> Subject: [PATCH] mpage clear >>>> --- >>>> mm/memory.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++--- >>>> 1 file changed, 67 insertions(+), 3 deletions(-) >>>> diff --git a/mm/memory.c b/mm/memory.c >>>> index 3ccee51adfbb..1fdc548c4275 100644 >>>> --- a/mm/memory.c >>>> +++ b/mm/memory.c >>>> @@ -6769,6 +6769,68 @@ static inline int process_huge_page( >>>> return 0; >>>> } >>>> +#define MPAGE_NRPAGES (1<<4) >>>> +#define MPAGE_SIZE (PAGE_SIZE * MPAGE_NRPAGES) >>>> +static inline int clear_huge_page( >>>> + unsigned long addr_hint, unsigned int nr_pages, >>>> + int (*process_subpage)(unsigned long addr, int idx, void *arg), >>>> + void *arg) >>>> +{ >>>> + int i, n, base, l, ret; >>>> + unsigned long addr = addr_hint & >>>> + ~(((unsigned long)nr_pages << PAGE_SHIFT) - 1); >>>> + unsigned long nr_mpages = ((unsigned long)nr_pages << PAGE_SHIFT) / MPAGE_SIZE; >>>> + >>>> + /* Process target subpage last to keep its cache lines hot */ >>>> + might_sleep(); >>>> + n = (addr_hint - addr) / MPAGE_SIZE; >>>> + if (2 * n <= nr_mpages) { >>>> + /* If target subpage in first half of huge page */ >>>> + base = 0; >>>> + l = n; >>>> + /* Process subpages at the end of huge page */ >>>> + for (i = nr_mpages - 1; i >= 2 * n; i--) { >>>> + cond_resched(); >>>> + ret = process_subpage(addr + i * MPAGE_SIZE, >>>> + i * MPAGE_NRPAGES, arg); >>>> + if (ret) >>>> + return ret; >>>> + } >>>> + } else { >>>> + /* If target subpage in second half of huge page */ >>>> + base = nr_mpages - 2 * (nr_mpages - n); >>>> + l = nr_mpages - n; >>>> + /* Process subpages at the begin of huge page */ >>>> + for (i = 0; i < base; i++) { >>>> + cond_resched(); >>>> + ret = process_subpage(addr + i * MPAGE_SIZE, >>>> + i * MPAGE_NRPAGES, arg); >>>> + if (ret) >>>> + return ret; >>>> + } >>>> + } >>>> + /* >>>> + * Process remaining subpages in left-right-left-right pattern >>>> + * towards the target subpage >>>> + */ >>>> + for (i = 0; i < l; i++) { >>>> + int left_idx = base + i; >>>> + int right_idx = base + 2 * l - 1 - i; >>>> + >>>> + cond_resched(); >>>> + ret = process_subpage(addr + left_idx * MPAGE_SIZE, >>>> + left_idx * MPAGE_NRPAGES, arg); >>>> + if (ret) >>>> + return ret; >>>> + cond_resched(); >>>> + ret = process_subpage(addr + right_idx * MPAGE_SIZE, >>>> + right_idx * MPAGE_NRPAGES, arg); >>>> + if (ret) >>>> + return ret; >>>> + } >>>> + return 0; >>>> +} >>>> + >>>> static void clear_gigantic_page(struct folio *folio, unsigned long addr, >>>> unsigned int nr_pages) >>>> { >>>> @@ -6784,8 +6846,10 @@ static void clear_gigantic_page(struct folio *folio, unsigned long addr, >>>> static int clear_subpage(unsigned long addr, int idx, void *arg) >>>> { >>>> struct folio *folio = arg; >>>> + int i; >>>> - clear_user_highpage(folio_page(folio, idx), addr); >>>> + for (i = 0; i < MPAGE_NRPAGES; i++) >>>> + clear_user_highpage(folio_page(folio, idx + i), addr + i * PAGE_SIZE); >>>> return 0; >>>> } >>>> @@ -6798,10 +6862,10 @@ void folio_zero_user(struct folio *folio, >>>> unsigned long addr_hint) >>>> { >>>> unsigned int nr_pages = folio_nr_pages(folio); >>>> - if (unlikely(nr_pages > MAX_ORDER_NR_PAGES)) >>>> + if (unlikely(nr_pages != HPAGE_PMD_NR)) >>>> clear_gigantic_page(folio, addr_hint, nr_pages); >>>> else >>>> - process_huge_page(addr_hint, nr_pages, clear_subpage, folio); >>>> + clear_huge_page(addr_hint, nr_pages, clear_subpage, folio); >>>> } >>>> static int copy_user_gigantic_page(struct folio *dst, struct >>>> folio *src, >>> >>> >>> [1] fix patch >>> >>> diff --git a/mm/memory.c b/mm/memory.c >>> index b22d4b83295b..aee99ede0c4f 100644 >>> --- a/mm/memory.c >>> +++ b/mm/memory.c >>> @@ -6816,7 +6816,7 @@ static inline int clear_huge_page( >>> base = 0; >>> l = n; >>> /* Process subpages at the end of huge page */ >>> - for (i = nr_mpages - 1; i >= 2 * n; i--) { >>> + for (i = 2 * n; i < nr_mpages; i++) { >>> cond_resched(); >>> ret = process_subpage(addr + i * MPAGE_SIZE, >>> i * MPAGE_NRPAGES, arg); >>
On 2024/11/4 10:35, Huang, Ying wrote: > Kefeng Wang <wangkefeng.wang@huawei.com> writes: > >> On 2024/11/1 16:16, Huang, Ying wrote: >>> Kefeng Wang <wangkefeng.wang@huawei.com> writes: >>> >>>> On 2024/10/31 16:39, Huang, Ying wrote: >>>>> Kefeng Wang <wangkefeng.wang@huawei.com> writes: >>>>> [snip] >>>>>> >>>>>>>> 1) Will test some rand test to check the different of performance as >>>>>>>> David suggested.>>>> >>>>>>>> 2) Hope the LKP to run more tests since it is very useful(more test >>>>>>>> set and different machines) >>>>>>> I'm starting to use LKP to test. >>>>>> >>>>>> Greet. >>>> >>>> >>>> Sorry for the late, >>>> >>>>> I have run some tests with LKP to test. >>>>> Firstly, there's almost no measurable difference between clearing >>>>> pages >>>>> from start to end or from end to start on Intel server CPU. I guess >>>>> that there's some similar optimization for both direction. >>>>> For multiple processes (same as logical CPU number) >>>>> vm-scalability/anon-w-seq test case, the benchmark score increases >>>>> about 22.4%. >>>> >>>> So process_huge_page is better than clear_gigantic_page() on Intel? >>> For vm-scalability/anon-w-seq test case, it is. Because the >>> performance >>> of forward and backward clearing is almost same, and the user space >>> accessing has cache-hot benefit. >>> >>>> Could you test the following case on x86? >>>> echo 10240 > >>>> /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages >>>> mkdir -p /hugetlbfs/ >>>> mount none /hugetlbfs/ -t hugetlbfs >>>> rm -f /hugetlbfs/test && fallocate -l 20G /hugetlbfs/test && fallocate >>>> -d -l 20G /hugetlbfs/test && time taskset -c 10 fallocate -l 20G >>>> /hugetlbfs/test >>> It's not trivial for me to do this test. Because 0day wraps test >>> cases. >>> Do you know which existing test cases provide this? For example, in >>> vm-scalability? >> >> I don't know the public fallocate test, I will try to find a intel >> machine to test this case. > > I don't expect it to change much, because we have observed that the > performance of forward and backward clearing is similar on Intel. I find a Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz Caches (sum of all): L1d: 1.1 MiB (36 instances) L1i: 1.1 MiB (36 instances) L2: 36 MiB (36 instances) L3: 49.5 MiB (2 instances) NUMA: NUMA node(s): 2 NUMA node0 CPU(s): 0-17,36-53 NUMA node1 CPU(s): 18-35,54-71 Before: Performance counter stats for 'taskset -c 10 fallocate -l 20G /mnt/hugetlbfs/test': 3,856.93 msec task-clock # 0.997 CPUs utilized 6 context-switches # 1.556 /sec 1 cpu-migrations # 0.259 /sec 132 page-faults # 34.224 /sec 11,520,934,848 cycles # 2.987 GHz (19.95%) 213,731,011 instructions # 0.02 insn per cycle (24.96%) 58,164,361 branches # 15.080 M/sec (24.96%) 262,547 branch-misses # 0.45% of all branches (24.97%) 96,029,321 CPU_CLK_UNHALTED.REF_XCLK # 24.898 M/sec # 0.3 % tma_frontend_bound # 3.3 % tma_retiring # 96.4 % tma_backend_bound # 0.0 % tma_bad_speculation (24.99%) 149,735,020 IDQ_UOPS_NOT_DELIVERED.CORE # 38.822 M/sec (25.01%) 2,486,326 INT_MISC.RECOVERY_CYCLES_ANY # 644.638 K/sec (20.01%) 95,973,482 CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE # 24.883 M/sec (20.01%) 11,526,783,305 CPU_CLK_UNHALTED.THREAD # 2.989 G/sec (20.01%) 1,519,072,911 UOPS_RETIRED.RETIRE_SLOTS # 393.855 M/sec (20.01%) 1,526,020,825 UOPS_ISSUED.ANY # 395.657 M/sec (20.01%) 59,784,189 L1-dcache-loads # 15.500 M/sec (20.01%) 337,479,254 L1-dcache-load-misses # 564.50% of all L1-dcache accesses (20.02%) 175,954 LLC-loads # 45.620 K/sec (20.02%) 51,955 LLC-load-misses # 29.53% of all L1-icache accesses (20.02%) <not supported> L1-icache-loads 2,864,230 L1-icache-load-misses (20.02%) 59,769,391 dTLB-loads # 15.497 M/sec (20.02%) 819 dTLB-load-misses # 0.00% of all dTLB cache accesses (20.02%) 2,459 iTLB-loads # 637.553 /sec (20.01%) 370 iTLB-load-misses # 15.05% of all iTLB cache accesses (19.98%) 3.870393637 seconds time elapsed 0.000000000 seconds user 3.833021000 seconds sys After(using clear_gigantic_page()): Performance counter stats for 'taskset -c 10 fallocate -l 20G /mnt/hugetlbfs/test': 4,426.18 msec task-clock # 0.994 CPUs utilized 8 context-switches # 1.807 /sec 1 cpu-migrations # 0.226 /sec 131 page-faults # 29.597 /sec 13,221,263,588 cycles # 2.987 GHz (19.98%) 215,924,995 instructions # 0.02 insn per cycle (25.00%) 58,430,182 branches # 13.201 M/sec (25.01%) 279,381 branch-misses # 0.48% of all branches (25.03%) 110,199,114 CPU_CLK_UNHALTED.REF_XCLK # 24.897 M/sec # 0.3 % tma_frontend_bound # 2.9 % tma_retiring # 96.8 % tma_backend_bound # 0.0 % tma_bad_speculation (25.06%) 160,650,548 IDQ_UOPS_NOT_DELIVERED.CORE # 36.296 M/sec (25.07%) 2,559,970 INT_MISC.RECOVERY_CYCLES_ANY # 578.370 K/sec (20.05%) 110,229,402 CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE # 24.904 M/sec (20.05%) 13,227,924,727 CPU_CLK_UNHALTED.THREAD # 2.989 G/sec (20.03%) 1,525,019,287 UOPS_RETIRED.RETIRE_SLOTS # 344.545 M/sec (20.01%) 1,531,307,263 UOPS_ISSUED.ANY # 345.966 M/sec (19.98%) 60,600,471 L1-dcache-loads # 13.691 M/sec (19.96%) 337,576,917 L1-dcache-load-misses # 557.05% of all L1-dcache accesses (19.96%) 177,157 LLC-loads # 40.025 K/sec (19.96%) 48,056 LLC-load-misses # 27.13% of all L1-icache accesses (19.97%) <not supported> L1-icache-loads 2,653,617 L1-icache-load-misses (19.97%) 60,609,241 dTLB-loads # 13.693 M/sec (19.97%) 530 dTLB-load-misses # 0.00% of all dTLB cache accesses (19.97%) 1,952 iTLB-loads # 441.013 /sec (19.97%) 3,059 iTLB-load-misses # 156.71% of all iTLB cache accesses (19.97%) 4.450664421 seconds time elapsed 0.000984000 seconds user 4.397795000 seconds sys This shows the backward is better than forward,at least for this CPU. > >>> >>>>> For multiple processes vm-scalability/anon-w-rand test case, no >>>>> measurable difference for benchmark score. >>>>> So, the optimization helps sequential workload mainly. >>>>> In summary, on x86, process_huge_page() will not introduce any >>>>> regression. And it helps some workload. >>>>> However, on ARM64, it does introduce some regression for clearing >>>>> pages >>>>> from end to start. That needs to be addressed. I guess that the >>>>> regression can be resolved via using more clearing from start to end >>>>> (but not all). For example, can you take a look at the patch below? >>>>> Which uses the similar framework as before, but clear each small trunk >>>>> (mpage) from start to end. You can adjust MPAGE_NRPAGES to check when >>>>> the regression can be restored. >>>>> WARNING: the patch is only build tested. >>>> >>>> >>>> Base: baseline >>>> Change1: using clear_gigantic_page() for 2M PMD >>>> Change2: your patch with MPAGE_NRPAGES=16 >>>> Change3: Case3 + fix[1] >>> What is case3? >> >> Oh, it is Change2. > > Got it. > >>> >>>> Change4: your patch with MPAGE_NRPAGES=64 + fix[1] >>>> >>>> 1. For rand write, >>>> case-anon-w-rand/case-anon-w-rand-hugetlb no measurable difference >>>> >>>> 2. For seq write, >>>> >>>> 1) case-anon-w-seq-mt: >>> Can you try case-anon-w-seq? That may be more stable. >>> >>>> base: >>>> real 0m2.490s 0m2.254s 0m2.272s >>>> user 1m59.980s 2m23.431s 2m18.739s >>>> sys 1m3.675s 1m15.462s 1m15.030s >>>> >>>> Change1: >>>> real 0m2.234s 0m2.225s 0m2.159s >>>> user 2m56.105s 2m57.117s 3m0.489s >>>> sys 0m17.064s 0m17.564s 0m16.150s >>>> >>>> Change2: >>>> real 0m2.244s 0m2.384s 0m2.370s >>>> user 2m39.413s 2m41.990s 2m42.229s >>>> sys 0m19.826s 0m18.491s 0m18.053s >>> It appears strange. There's no much cache hot benefit even if we >>> clear >>> pages from end to begin (with larger chunk). >>> However, sys time improves a lot. This shows clearing page with >>> large >>> chunk helps on ARM64. >>> >>>> Change3: // best performance >>>> real 0m2.155s 0m2.204s 0m2.194s >>>> user 3m2.640s 2m55.837s 3m0.902s >>>> sys 0m17.346s 0m17.630s 0m18.197s >>>> >>>> Change4: >>>> real 0m2.287s 0m2.377s 0m2.284s >>>> user 2m37.030s 2m52.868s 3m17.593s >>>> sys 0m15.445s 0m34.430s 0m45.224s >>> Change4 is essentially same as Change1. I don't know why they are >>> different. Is there some large variation among run to run? >> >> As above shown, I test three times, the test results are relatively >> stable, at least for real, I will try case-anon-w-seq. > > Can you also show the score of vm-scalability? > > TBH, I cannot understand your results. For example, why there are > measurable difference between Change3 and Change4? In both cases, the > kernel clears pages from start to end. OK,will retest once I can access the machine again. > >>> Can you further optimize the prototype patch below? I think that it >>> has >>> potential to fix your issue. >> >> Yes, thanks for you helper, but this will make process_huge_page() a >> little more complicated :) > > IMHO, we should try to root cause it, then try to find the proper > solution and optimize (simplifies) it. From the above fallocate test on intel, it seems that different microarchitectures have different performance on Intel too.
On Thu, 31 Oct 2024 16:39:53 +0800 "Huang, Ying" <ying.huang@intel.com> wrote: > However, on ARM64, it does introduce some regression for clearing pages > from end to start. That needs to be addressed. I guess that the > regression can be resolved via using more clearing from start to end > (but not all). For example, can you take a look at the patch below? > Which uses the similar framework as before, but clear each small trunk > (mpage) from start to end. You can adjust MPAGE_NRPAGES to check when > the regression can be restored. > > WARNING: the patch is only build tested. I'm holding this patch in mm-unstable because it's unclear (to me) that this regression has been adequately addressed?
Hi, Andrew, Andrew Morton <akpm@linux-foundation.org> writes: > On Thu, 31 Oct 2024 16:39:53 +0800 "Huang, Ying" <ying.huang@intel.com> wrote: > >> However, on ARM64, it does introduce some regression for clearing pages >> from end to start. That needs to be addressed. I guess that the >> regression can be resolved via using more clearing from start to end >> (but not all). For example, can you take a look at the patch below? >> Which uses the similar framework as before, but clear each small trunk >> (mpage) from start to end. You can adjust MPAGE_NRPAGES to check when >> the regression can be restored. >> >> WARNING: the patch is only build tested. > > I'm holding this patch in mm-unstable because it's unclear (to me) that > this regression has been adequately addressed? Yes. This patch isn't ready to be merged yet. More works are needed to root cause the problem and implement the proper fix. I guess that Kefeng will continue to work on this. --- Best Regards, Huang, Ying
Hi Andrew and Ying, On 2024/12/1 13:37, Huang, Ying wrote: > Hi, Andrew, > > Andrew Morton <akpm@linux-foundation.org> writes: > >> On Thu, 31 Oct 2024 16:39:53 +0800 "Huang, Ying" <ying.huang@intel.com> wrote: >> >>> However, on ARM64, it does introduce some regression for clearing pages >>> from end to start. That needs to be addressed. I guess that the >>> regression can be resolved via using more clearing from start to end >>> (but not all). For example, can you take a look at the patch below? >>> Which uses the similar framework as before, but clear each small trunk >>> (mpage) from start to end. You can adjust MPAGE_NRPAGES to check when >>> the regression can be restored. >>> >>> WARNING: the patch is only build tested. >> >> I'm holding this patch in mm-unstable because it's unclear (to me) that >> this regression has been adequately addressed? > > Yes. This patch isn't ready to be merged yet. More works are needed to > root cause the problem and implement the proper fix. I guess that > Kefeng will continue to work on this. Sorry for the late, I've been buried in other tasks. The two bugfix patches in the next should be merged firstly, which is not related about the performance issue. 84ddd1450383 mm: use aligned address in copy_user_gigantic_page() 9fbd869b19b8 mm: use aligned address in clear_gigantic_page() and according to the suggestion of Ying, I am trying to fix the performance issue on arm64(will in new series), but I need to check no regression on other arm64's machine, will send the new patches once the test is complete. So Andrew, please help to pick up the above two bugfix patches to v6.13-rcX if no objection. > > --- > Best Regards, > Huang, Ying >
On Mon, 2 Dec 2024 09:03:45 +0800 Kefeng Wang <wangkefeng.wang@huawei.com> wrote: > >>> WARNING: the patch is only build tested. > >> > >> I'm holding this patch in mm-unstable because it's unclear (to me) that > >> this regression has been adequately addressed? > > > > Yes. This patch isn't ready to be merged yet. More works are needed to > > root cause the problem and implement the proper fix. I guess that > > Kefeng will continue to work on this. > > Sorry for the late, I've been buried in other tasks. > > The two bugfix patches in the next should be merged firstly, which is > not related about the performance issue. > > 84ddd1450383 mm: use aligned address in copy_user_gigantic_page() > 9fbd869b19b8 mm: use aligned address in clear_gigantic_page() > > > and according to the suggestion of Ying, I am trying to fix the > performance issue on arm64(will in new series), but I need to check no > regression on other arm64's machine, will send the new patches once > the test is complete. > > So Andrew, please help to pick up the above two bugfix patches to > v6.13-rcX if no objection. I added cc:stable to both and moved them into mm-hotfixes-unstable, targeting an upstream merge next week. "mm: use aligned address in clear_gigantic_page()": https://lkml.kernel.org/r/20241028145656.932941-1-wangkefeng.wang@huawei.com "mm: use aligned address in copy_user_gigantic_page()": https://lkml.kernel.org/r/20241028145656.932941-2-wangkefeng.wang@huawei.com
On 2024/12/6 9:47, Andrew Morton wrote: > On Mon, 2 Dec 2024 09:03:45 +0800 Kefeng Wang <wangkefeng.wang@huawei.com> wrote: > >>>>> WARNING: the patch is only build tested. >>>> >>>> I'm holding this patch in mm-unstable because it's unclear (to me) that >>>> this regression has been adequately addressed? >>> >>> Yes. This patch isn't ready to be merged yet. More works are needed to >>> root cause the problem and implement the proper fix. I guess that >>> Kefeng will continue to work on this. >> >> Sorry for the late, I've been buried in other tasks. >> >> The two bugfix patches in the next should be merged firstly, which is >> not related about the performance issue. >> >> 84ddd1450383 mm: use aligned address in copy_user_gigantic_page() >> 9fbd869b19b8 mm: use aligned address in clear_gigantic_page() >> >> >> and according to the suggestion of Ying, I am trying to fix the >> performance issue on arm64(will in new series), but I need to check no >> regression on other arm64's machine, will send the new patches once >> the test is complete. >> >> So Andrew, please help to pick up the above two bugfix patches to >> v6.13-rcX if no objection. > > I added cc:stable to both and moved them into mm-hotfixes-unstable, > targeting an upstream merge next week. > > "mm: use aligned address in clear_gigantic_page()": > https://lkml.kernel.org/r/20241028145656.932941-1-wangkefeng.wang@huawei.com > > "mm: use aligned address in copy_user_gigantic_page()": > https://lkml.kernel.org/r/20241028145656.932941-2-wangkefeng.wang@huawei.com > Nice, thanks a lot.
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index a4441fb77f7c..a5ea006f403e 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -825,7 +825,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset, error = PTR_ERR(folio); goto out; } - folio_zero_user(folio, ALIGN_DOWN(addr, hpage_size)); + folio_zero_user(folio, addr); __folio_mark_uptodate(folio); error = hugetlb_add_to_page_cache(folio, mapping, index); if (unlikely(error)) { diff --git a/mm/memory.c b/mm/memory.c index 75c2dfd04f72..ef47b7ea5ddd 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -6821,6 +6821,7 @@ static void clear_gigantic_page(struct folio *folio, unsigned long addr, int i; might_sleep(); + addr = ALIGN_DOWN(addr, folio_size(folio)); for (i = 0; i < nr_pages; i++) { cond_resched(); clear_user_highpage(folio_page(folio, i), addr + i * PAGE_SIZE);
When clearing gigantic page, it zeros page from the first page to the last page, if directly passing addr_hint which maybe not the address of the first page of folio, then some archs could flush the wrong cache if it does use the addr_hint as a hint. For non-gigantic page, it calculates the base address inside, even passed the wrong addr_hint, it only has performance impact as the process_huge_page() wants to process target page last to keep its cache lines hot), no functional impact. Let's pass the real accessed address to folio_zero_user() and use the aligned address in clear_gigantic_page() to fix it. Fixes: 78fefd04c123 ("mm: memory: convert clear_huge_page() to folio_zero_user()") Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> --- v2: - update changelog to clarify the impact, per Andrew fs/hugetlbfs/inode.c | 2 +- mm/memory.c | 1 + 2 files changed, 2 insertions(+), 1 deletion(-)