diff mbox series

[v2,1/2] mm: use aligned address in clear_gigantic_page()

Message ID 20241026054307.3896926-1-wangkefeng.wang@huawei.com (mailing list archive)
State New
Headers show
Series [v2,1/2] mm: use aligned address in clear_gigantic_page() | expand

Commit Message

Kefeng Wang Oct. 26, 2024, 5:43 a.m. UTC
When clearing gigantic page, it zeros page from the first page to the
last page, if directly passing addr_hint which maybe not the address
of the first page of folio, then some archs could flush the wrong cache
if it does use the addr_hint as a hint. For non-gigantic page, it
calculates the base address inside, even passed the wrong addr_hint, it
only has performance impact as the process_huge_page() wants to process
target page last to keep its cache lines hot), no functional impact.

Let's pass the real accessed address to folio_zero_user() and use the
aligned address in clear_gigantic_page() to fix it.

Fixes: 78fefd04c123 ("mm: memory: convert clear_huge_page() to folio_zero_user()")
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
---
v2: 
- update changelog to clarify the impact, per Andrew

 fs/hugetlbfs/inode.c | 2 +-
 mm/memory.c          | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

Comments

Huang, Ying Oct. 28, 2024, 6:17 a.m. UTC | #1
Kefeng Wang <wangkefeng.wang@huawei.com> writes:

> When clearing gigantic page, it zeros page from the first page to the
> last page, if directly passing addr_hint which maybe not the address
> of the first page of folio, then some archs could flush the wrong cache
> if it does use the addr_hint as a hint. For non-gigantic page, it
> calculates the base address inside, even passed the wrong addr_hint, it
> only has performance impact as the process_huge_page() wants to process
> target page last to keep its cache lines hot), no functional impact.
>
> Let's pass the real accessed address to folio_zero_user() and use the
> aligned address in clear_gigantic_page() to fix it.
>
> Fixes: 78fefd04c123 ("mm: memory: convert clear_huge_page() to folio_zero_user()")
> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
> ---
> v2: 
> - update changelog to clarify the impact, per Andrew
>
>  fs/hugetlbfs/inode.c | 2 +-
>  mm/memory.c          | 1 +
>  2 files changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index a4441fb77f7c..a5ea006f403e 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -825,7 +825,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
>  			error = PTR_ERR(folio);
>  			goto out;
>  		}
> -		folio_zero_user(folio, ALIGN_DOWN(addr, hpage_size));
> +		folio_zero_user(folio, addr);

'addr' is set with the following statement above,

		/* addr is the offset within the file (zero based) */
		addr = index * hpage_size;

So, we just don't need to ALIGN_DOWN() here.  Or do I miss something?

>  		__folio_mark_uptodate(folio);
>  		error = hugetlb_add_to_page_cache(folio, mapping, index);
>  		if (unlikely(error)) {
> diff --git a/mm/memory.c b/mm/memory.c
> index 75c2dfd04f72..ef47b7ea5ddd 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -6821,6 +6821,7 @@ static void clear_gigantic_page(struct folio *folio, unsigned long addr,
>  	int i;
>  
>  	might_sleep();
> +	addr = ALIGN_DOWN(addr, folio_size(folio));
>  	for (i = 0; i < nr_pages; i++) {
>  		cond_resched();
>  		clear_user_highpage(folio_page(folio, i), addr + i * PAGE_SIZE);

--
Best Regards,
Huang, Ying
Kefeng Wang Oct. 28, 2024, 6:35 a.m. UTC | #2
On 2024/10/28 14:17, Huang, Ying wrote:
> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
> 
>> When clearing gigantic page, it zeros page from the first page to the
>> last page, if directly passing addr_hint which maybe not the address
>> of the first page of folio, then some archs could flush the wrong cache
>> if it does use the addr_hint as a hint. For non-gigantic page, it
>> calculates the base address inside, even passed the wrong addr_hint, it
>> only has performance impact as the process_huge_page() wants to process
>> target page last to keep its cache lines hot), no functional impact.
>>
>> Let's pass the real accessed address to folio_zero_user() and use the
>> aligned address in clear_gigantic_page() to fix it.
>>
>> Fixes: 78fefd04c123 ("mm: memory: convert clear_huge_page() to folio_zero_user()")
>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
>> ---
>> v2:
>> - update changelog to clarify the impact, per Andrew
>>
>>   fs/hugetlbfs/inode.c | 2 +-
>>   mm/memory.c          | 1 +
>>   2 files changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>> index a4441fb77f7c..a5ea006f403e 100644
>> --- a/fs/hugetlbfs/inode.c
>> +++ b/fs/hugetlbfs/inode.c
>> @@ -825,7 +825,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
>>   			error = PTR_ERR(folio);
>>   			goto out;
>>   		}
>> -		folio_zero_user(folio, ALIGN_DOWN(addr, hpage_size));
>> +		folio_zero_user(folio, addr);
> 
> 'addr' is set with the following statement above,
> 
> 		/* addr is the offset within the file (zero based) */
> 		addr = index * hpage_size;
> 
> So, we just don't need to ALIGN_DOWN() here.  Or do I miss something?

Yes, it is already aligned,
> 
>>   		__folio_mark_uptodate(folio);
>>   		error = hugetlb_add_to_page_cache(folio, mapping, index);
>>   		if (unlikely(error)) {
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 75c2dfd04f72..ef47b7ea5ddd 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -6821,6 +6821,7 @@ static void clear_gigantic_page(struct folio *folio, unsigned long addr,
>>   	int i;
>>   
>>   	might_sleep();
>> +	addr = ALIGN_DOWN(addr, folio_size(folio));

but for hugetlb_no_page(),  we do need to align the addr as it use 
vmf->real_address, so I move the alignment into the clear_gigantic_page.

>>   	for (i = 0; i < nr_pages; i++) {
>>   		cond_resched();
>>   		clear_user_highpage(folio_page(folio, i), addr + i * PAGE_SIZE);
> 
> --
> Best Regards,
> Huang, Ying
>
Huang, Ying Oct. 28, 2024, 7:03 a.m. UTC | #3
Kefeng Wang <wangkefeng.wang@huawei.com> writes:

> On 2024/10/28 14:17, Huang, Ying wrote:
>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>> 
>>> When clearing gigantic page, it zeros page from the first page to the
>>> last page, if directly passing addr_hint which maybe not the address
>>> of the first page of folio, then some archs could flush the wrong cache
>>> if it does use the addr_hint as a hint. For non-gigantic page, it
>>> calculates the base address inside, even passed the wrong addr_hint, it
>>> only has performance impact as the process_huge_page() wants to process
>>> target page last to keep its cache lines hot), no functional impact.
>>>
>>> Let's pass the real accessed address to folio_zero_user() and use the
>>> aligned address in clear_gigantic_page() to fix it.
>>>
>>> Fixes: 78fefd04c123 ("mm: memory: convert clear_huge_page() to folio_zero_user()")
>>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
>>> ---
>>> v2:
>>> - update changelog to clarify the impact, per Andrew
>>>
>>>   fs/hugetlbfs/inode.c | 2 +-
>>>   mm/memory.c          | 1 +
>>>   2 files changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>>> index a4441fb77f7c..a5ea006f403e 100644
>>> --- a/fs/hugetlbfs/inode.c
>>> +++ b/fs/hugetlbfs/inode.c
>>> @@ -825,7 +825,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
>>>   			error = PTR_ERR(folio);
>>>   			goto out;
>>>   		}
>>> -		folio_zero_user(folio, ALIGN_DOWN(addr, hpage_size));
>>> +		folio_zero_user(folio, addr);
>> 'addr' is set with the following statement above,
>> 		/* addr is the offset within the file (zero based) */
>> 		addr = index * hpage_size;
>> So, we just don't need to ALIGN_DOWN() here.  Or do I miss
>> something?
>
> Yes, it is already aligned,
>> 
>>>   		__folio_mark_uptodate(folio);
>>>   		error = hugetlb_add_to_page_cache(folio, mapping, index);
>>>   		if (unlikely(error)) {
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 75c2dfd04f72..ef47b7ea5ddd 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -6821,6 +6821,7 @@ static void clear_gigantic_page(struct folio *folio, unsigned long addr,
>>>   	int i;
>>>     	might_sleep();
>>> +	addr = ALIGN_DOWN(addr, folio_size(folio));
>
> but for hugetlb_no_page(),  we do need to align the addr as it use
> vmf->real_address, so I move the alignment into the
> clear_gigantic_page.

That sounds good.  You may need to revise patch description to describe
why you make the change.  May be something like below?

In current kernel, hugetlb_no_page() calls folio_zero_user() with the
fault address.  Where the fault address may be not aligned with the huge
page size.  Then, folio_zero_user() may call clear_gigantic_page() with
the address, while clear_gigantic_page() requires the address to be huge
page size aligned.  So, this may cause memory corruption or information
leak.

>>>   	for (i = 0; i < nr_pages; i++) {
>>>   		cond_resched();
>>>   		clear_user_highpage(folio_page(folio, i), addr + i * PAGE_SIZE);

--
Best Regards,
Huang, Ying
Kefeng Wang Oct. 28, 2024, 8:35 a.m. UTC | #4
On 2024/10/28 15:03, Huang, Ying wrote:
> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
> 
>> On 2024/10/28 14:17, Huang, Ying wrote:
>>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>>>
>>>> When clearing gigantic page, it zeros page from the first page to the
>>>> last page, if directly passing addr_hint which maybe not the address
>>>> of the first page of folio, then some archs could flush the wrong cache
>>>> if it does use the addr_hint as a hint. For non-gigantic page, it
>>>> calculates the base address inside, even passed the wrong addr_hint, it
>>>> only has performance impact as the process_huge_page() wants to process
>>>> target page last to keep its cache lines hot), no functional impact.
>>>>
>>>> Let's pass the real accessed address to folio_zero_user() and use the
>>>> aligned address in clear_gigantic_page() to fix it.
>>>>
>>>> Fixes: 78fefd04c123 ("mm: memory: convert clear_huge_page() to folio_zero_user()")
>>>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>> ---
>>>> v2:
>>>> - update changelog to clarify the impact, per Andrew
>>>>
>>>>    fs/hugetlbfs/inode.c | 2 +-
>>>>    mm/memory.c          | 1 +
>>>>    2 files changed, 2 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>>>> index a4441fb77f7c..a5ea006f403e 100644
>>>> --- a/fs/hugetlbfs/inode.c
>>>> +++ b/fs/hugetlbfs/inode.c
>>>> @@ -825,7 +825,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
>>>>    			error = PTR_ERR(folio);
>>>>    			goto out;
>>>>    		}
>>>> -		folio_zero_user(folio, ALIGN_DOWN(addr, hpage_size));
>>>> +		folio_zero_user(folio, addr);
>>> 'addr' is set with the following statement above,
>>> 		/* addr is the offset within the file (zero based) */
>>> 		addr = index * hpage_size;
>>> So, we just don't need to ALIGN_DOWN() here.  Or do I miss
>>> something?
>>
>> Yes, it is already aligned,
>>>
>>>>    		__folio_mark_uptodate(folio);
>>>>    		error = hugetlb_add_to_page_cache(folio, mapping, index);
>>>>    		if (unlikely(error)) {
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 75c2dfd04f72..ef47b7ea5ddd 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -6821,6 +6821,7 @@ static void clear_gigantic_page(struct folio *folio, unsigned long addr,
>>>>    	int i;
>>>>      	might_sleep();
>>>> +	addr = ALIGN_DOWN(addr, folio_size(folio));
>>
>> but for hugetlb_no_page(),  we do need to align the addr as it use
>> vmf->real_address, so I move the alignment into the
>> clear_gigantic_page.
> 
> That sounds good.  You may need to revise patch description to describe
> why you make the change.  May be something like below?
> 
> In current kernel, hugetlb_no_page() calls folio_zero_user() with the
> fault address.  Where the fault address may be not aligned with the huge
> page size.  Then, folio_zero_user() may call clear_gigantic_page() with
> the address, while clear_gigantic_page() requires the address to be huge
> page size aligned.  So, this may cause memory corruption or information
> leak.

OK, will use it and update all patches, thanks.

> 
>>>>    	for (i = 0; i < nr_pages; i++) {
>>>>    		cond_resched();
>>>>    		clear_user_highpage(folio_page(folio, i), addr + i * PAGE_SIZE);
> 
> --
> Best Regards,
> Huang, Ying
>
David Hildenbrand Oct. 28, 2024, 10 a.m. UTC | #5
On 26.10.24 07:43, Kefeng Wang wrote:
> When clearing gigantic page, it zeros page from the first page to the
> last page, if directly passing addr_hint which maybe not the address
> of the first page of folio, then some archs could flush the wrong cache
> if it does use the addr_hint as a hint. For non-gigantic page, it
> calculates the base address inside, even passed the wrong addr_hint, it
> only has performance impact as the process_huge_page() wants to process
> target page last to keep its cache lines hot), no functional impact.
> 
> Let's pass the real accessed address to folio_zero_user() and use the
> aligned address in clear_gigantic_page() to fix it.
> 
> Fixes: 78fefd04c123 ("mm: memory: convert clear_huge_page() to folio_zero_user()")
> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
> ---
> v2:
> - update changelog to clarify the impact, per Andrew
> 
>   fs/hugetlbfs/inode.c | 2 +-
>   mm/memory.c          | 1 +
>   2 files changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index a4441fb77f7c..a5ea006f403e 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -825,7 +825,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
>   			error = PTR_ERR(folio);
>   			goto out;
>   		}
> -		folio_zero_user(folio, ALIGN_DOWN(addr, hpage_size));
> +		folio_zero_user(folio, addr);
>   		__folio_mark_uptodate(folio);
>   		error = hugetlb_add_to_page_cache(folio, mapping, index);
>   		if (unlikely(error)) {
> diff --git a/mm/memory.c b/mm/memory.c
> index 75c2dfd04f72..ef47b7ea5ddd 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -6821,6 +6821,7 @@ static void clear_gigantic_page(struct folio *folio, unsigned long addr,
>   	int i;
>   
>   	might_sleep();
> +	addr = ALIGN_DOWN(addr, folio_size(folio));

Right, that's what's effectively done in a very bad way in 
process_huge_page()

unsigned long addr = addr_hint &
		     ~(((unsigned long)nr_pages << PAGE_SHIFT) - 1);


That should all be cleaned up ... process_huge_page() likely shouldn't 
be even consuming "nr_pages".


In clear_gigantic_page(), can you please rename the "unsigned long addr" 
parameter to unsigned long "addr_hint" and use an additional "unsigned 
long addr" ?
Kefeng Wang Oct. 28, 2024, 12:52 p.m. UTC | #6
On 2024/10/28 18:00, David Hildenbrand wrote:
> On 26.10.24 07:43, Kefeng Wang wrote:
>> When clearing gigantic page, it zeros page from the first page to the
>> last page, if directly passing addr_hint which maybe not the address
>> of the first page of folio, then some archs could flush the wrong cache
>> if it does use the addr_hint as a hint. For non-gigantic page, it
>> calculates the base address inside, even passed the wrong addr_hint, it
>> only has performance impact as the process_huge_page() wants to process
>> target page last to keep its cache lines hot), no functional impact.
>>
>> Let's pass the real accessed address to folio_zero_user() and use the
>> aligned address in clear_gigantic_page() to fix it.
>>
>> Fixes: 78fefd04c123 ("mm: memory: convert clear_huge_page() to 
>> folio_zero_user()")
>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
>> ---
>> v2:
>> - update changelog to clarify the impact, per Andrew
>>
>>   fs/hugetlbfs/inode.c | 2 +-
>>   mm/memory.c          | 1 +
>>   2 files changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>> index a4441fb77f7c..a5ea006f403e 100644
>> --- a/fs/hugetlbfs/inode.c
>> +++ b/fs/hugetlbfs/inode.c
>> @@ -825,7 +825,7 @@ static long hugetlbfs_fallocate(struct file *file, 
>> int mode, loff_t offset,
>>               error = PTR_ERR(folio);
>>               goto out;
>>           }
>> -        folio_zero_user(folio, ALIGN_DOWN(addr, hpage_size));
>> +        folio_zero_user(folio, addr);
>>           __folio_mark_uptodate(folio);
>>           error = hugetlb_add_to_page_cache(folio, mapping, index);
>>           if (unlikely(error)) {
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 75c2dfd04f72..ef47b7ea5ddd 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -6821,6 +6821,7 @@ static void clear_gigantic_page(struct folio 
>> *folio, unsigned long addr,
>>       int i;
>>       might_sleep();
>> +    addr = ALIGN_DOWN(addr, folio_size(folio));
> 
> Right, that's what's effectively done in a very bad way in 
> process_huge_page()
> 
> unsigned long addr = addr_hint &
>               ~(((unsigned long)nr_pages << PAGE_SHIFT) - 1);
> 
> 
> That should all be cleaned up ... process_huge_page() likely shouldn't 

Yes, let's fix the bug firstly,

> be even consuming "nr_pages".

No sure about this part, it uses nr_pages as the end and calculate the 
'base'.

> 
> 
> In clear_gigantic_page(), can you please rename the "unsigned long addr" 
> parameter to unsigned long "addr_hint" and use an additional "unsigned 
> long addr" ?
> 

Sure.
David Hildenbrand Oct. 28, 2024, 1:14 p.m. UTC | #7
On 28.10.24 13:52, Kefeng Wang wrote:
> 
> 
> On 2024/10/28 18:00, David Hildenbrand wrote:
>> On 26.10.24 07:43, Kefeng Wang wrote:
>>> When clearing gigantic page, it zeros page from the first page to the
>>> last page, if directly passing addr_hint which maybe not the address
>>> of the first page of folio, then some archs could flush the wrong cache
>>> if it does use the addr_hint as a hint. For non-gigantic page, it
>>> calculates the base address inside, even passed the wrong addr_hint, it
>>> only has performance impact as the process_huge_page() wants to process
>>> target page last to keep its cache lines hot), no functional impact.
>>>
>>> Let's pass the real accessed address to folio_zero_user() and use the
>>> aligned address in clear_gigantic_page() to fix it.
>>>
>>> Fixes: 78fefd04c123 ("mm: memory: convert clear_huge_page() to
>>> folio_zero_user()")
>>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
>>> ---
>>> v2:
>>> - update changelog to clarify the impact, per Andrew
>>>
>>>    fs/hugetlbfs/inode.c | 2 +-
>>>    mm/memory.c          | 1 +
>>>    2 files changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>>> index a4441fb77f7c..a5ea006f403e 100644
>>> --- a/fs/hugetlbfs/inode.c
>>> +++ b/fs/hugetlbfs/inode.c
>>> @@ -825,7 +825,7 @@ static long hugetlbfs_fallocate(struct file *file,
>>> int mode, loff_t offset,
>>>                error = PTR_ERR(folio);
>>>                goto out;
>>>            }
>>> -        folio_zero_user(folio, ALIGN_DOWN(addr, hpage_size));
>>> +        folio_zero_user(folio, addr);
>>>            __folio_mark_uptodate(folio);
>>>            error = hugetlb_add_to_page_cache(folio, mapping, index);
>>>            if (unlikely(error)) {
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 75c2dfd04f72..ef47b7ea5ddd 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -6821,6 +6821,7 @@ static void clear_gigantic_page(struct folio
>>> *folio, unsigned long addr,
>>>        int i;
>>>        might_sleep();
>>> +    addr = ALIGN_DOWN(addr, folio_size(folio));
>>
>> Right, that's what's effectively done in a very bad way in
>> process_huge_page()
>>
>> unsigned long addr = addr_hint &
>>                ~(((unsigned long)nr_pages << PAGE_SHIFT) - 1);
>>
>>
>> That should all be cleaned up ... process_huge_page() likely shouldn't
> 
> Yes, let's fix the bug firstly,
> 
>> be even consuming "nr_pages".
> 
> No sure about this part, it uses nr_pages as the end and calculate the
> 'base'.

It should be using folio_nr_pages().
Kefeng Wang Oct. 28, 2024, 1:33 p.m. UTC | #8
On 2024/10/28 21:14, David Hildenbrand wrote:
> On 28.10.24 13:52, Kefeng Wang wrote:
>>
>>
>> On 2024/10/28 18:00, David Hildenbrand wrote:
>>> On 26.10.24 07:43, Kefeng Wang wrote:
>>>> When clearing gigantic page, it zeros page from the first page to the
>>>> last page, if directly passing addr_hint which maybe not the address
>>>> of the first page of folio, then some archs could flush the wrong cache
>>>> if it does use the addr_hint as a hint. For non-gigantic page, it
>>>> calculates the base address inside, even passed the wrong addr_hint, it
>>>> only has performance impact as the process_huge_page() wants to process
>>>> target page last to keep its cache lines hot), no functional impact.
>>>>
>>>> Let's pass the real accessed address to folio_zero_user() and use the
>>>> aligned address in clear_gigantic_page() to fix it.
>>>>
>>>> Fixes: 78fefd04c123 ("mm: memory: convert clear_huge_page() to
>>>> folio_zero_user()")
>>>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>> ---
>>>> v2:
>>>> - update changelog to clarify the impact, per Andrew
>>>>
>>>>    fs/hugetlbfs/inode.c | 2 +-
>>>>    mm/memory.c          | 1 +
>>>>    2 files changed, 2 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>>>> index a4441fb77f7c..a5ea006f403e 100644
>>>> --- a/fs/hugetlbfs/inode.c
>>>> +++ b/fs/hugetlbfs/inode.c
>>>> @@ -825,7 +825,7 @@ static long hugetlbfs_fallocate(struct file *file,
>>>> int mode, loff_t offset,
>>>>                error = PTR_ERR(folio);
>>>>                goto out;
>>>>            }
>>>> -        folio_zero_user(folio, ALIGN_DOWN(addr, hpage_size));
>>>> +        folio_zero_user(folio, addr);
>>>>            __folio_mark_uptodate(folio);
>>>>            error = hugetlb_add_to_page_cache(folio, mapping, index);
>>>>            if (unlikely(error)) {
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 75c2dfd04f72..ef47b7ea5ddd 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -6821,6 +6821,7 @@ static void clear_gigantic_page(struct folio
>>>> *folio, unsigned long addr,
>>>>        int i;
>>>>        might_sleep();
>>>> +    addr = ALIGN_DOWN(addr, folio_size(folio));
>>>
>>> Right, that's what's effectively done in a very bad way in
>>> process_huge_page()
>>>
>>> unsigned long addr = addr_hint &
>>>                ~(((unsigned long)nr_pages << PAGE_SHIFT) - 1);
>>>
>>>
>>> That should all be cleaned up ... process_huge_page() likely shouldn't
>>
>> Yes, let's fix the bug firstly,
>>
>>> be even consuming "nr_pages".
>>
>> No sure about this part, it uses nr_pages as the end and calculate the
>> 'base'.
> 
> It should be using folio_nr_pages().

But process_huge_page() without an explicit folio argument, I'd like to
move the aligned address calculate into the folio_zero_user and 
copy_user_large_folio(will rename it to folio_copy_user()) in the 
following cleanup patches, or do it in the fix patches?
David Hildenbrand Oct. 28, 2024, 1:46 p.m. UTC | #9
On 28.10.24 14:33, Kefeng Wang wrote:
> 
> 
> On 2024/10/28 21:14, David Hildenbrand wrote:
>> On 28.10.24 13:52, Kefeng Wang wrote:
>>>
>>>
>>> On 2024/10/28 18:00, David Hildenbrand wrote:
>>>> On 26.10.24 07:43, Kefeng Wang wrote:
>>>>> When clearing gigantic page, it zeros page from the first page to the
>>>>> last page, if directly passing addr_hint which maybe not the address
>>>>> of the first page of folio, then some archs could flush the wrong cache
>>>>> if it does use the addr_hint as a hint. For non-gigantic page, it
>>>>> calculates the base address inside, even passed the wrong addr_hint, it
>>>>> only has performance impact as the process_huge_page() wants to process
>>>>> target page last to keep its cache lines hot), no functional impact.
>>>>>
>>>>> Let's pass the real accessed address to folio_zero_user() and use the
>>>>> aligned address in clear_gigantic_page() to fix it.
>>>>>
>>>>> Fixes: 78fefd04c123 ("mm: memory: convert clear_huge_page() to
>>>>> folio_zero_user()")
>>>>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>> ---
>>>>> v2:
>>>>> - update changelog to clarify the impact, per Andrew
>>>>>
>>>>>     fs/hugetlbfs/inode.c | 2 +-
>>>>>     mm/memory.c          | 1 +
>>>>>     2 files changed, 2 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>>>>> index a4441fb77f7c..a5ea006f403e 100644
>>>>> --- a/fs/hugetlbfs/inode.c
>>>>> +++ b/fs/hugetlbfs/inode.c
>>>>> @@ -825,7 +825,7 @@ static long hugetlbfs_fallocate(struct file *file,
>>>>> int mode, loff_t offset,
>>>>>                 error = PTR_ERR(folio);
>>>>>                 goto out;
>>>>>             }
>>>>> -        folio_zero_user(folio, ALIGN_DOWN(addr, hpage_size));
>>>>> +        folio_zero_user(folio, addr);
>>>>>             __folio_mark_uptodate(folio);
>>>>>             error = hugetlb_add_to_page_cache(folio, mapping, index);
>>>>>             if (unlikely(error)) {
>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>> index 75c2dfd04f72..ef47b7ea5ddd 100644
>>>>> --- a/mm/memory.c
>>>>> +++ b/mm/memory.c
>>>>> @@ -6821,6 +6821,7 @@ static void clear_gigantic_page(struct folio
>>>>> *folio, unsigned long addr,
>>>>>         int i;
>>>>>         might_sleep();
>>>>> +    addr = ALIGN_DOWN(addr, folio_size(folio));
>>>>
>>>> Right, that's what's effectively done in a very bad way in
>>>> process_huge_page()
>>>>
>>>> unsigned long addr = addr_hint &
>>>>                 ~(((unsigned long)nr_pages << PAGE_SHIFT) - 1);
>>>>
>>>>
>>>> That should all be cleaned up ... process_huge_page() likely shouldn't
>>>
>>> Yes, let's fix the bug firstly,
>>>
>>>> be even consuming "nr_pages".
>>>
>>> No sure about this part, it uses nr_pages as the end and calculate the
>>> 'base'.
>>
>> It should be using folio_nr_pages().
> 
> But process_huge_page() without an explicit folio argument, I'd like to
> move the aligned address calculate into the folio_zero_user and
> copy_user_large_folio(will rename it to folio_copy_user()) in the
> following cleanup patches, or do it in the fix patches?

First, why does folio_zero_user() call process_huge_page() for *a small 
folio*? Because we like or code to be extra complicated to understand? 
Or am I missing something important?

Second, we should be passing the folio to "process_huge_page" and likely 
rename it to "folio_process_pages()" or sth like that. The function even 
documents "of the specified huge page", but there is none specified. The 
copy case might require a rework.

I think this code needs a serious cleanup ...
Kefeng Wang Oct. 28, 2024, 2:22 p.m. UTC | #10
On 2024/10/28 21:46, David Hildenbrand wrote:
> On 28.10.24 14:33, Kefeng Wang wrote:
>>
>>
>> On 2024/10/28 21:14, David Hildenbrand wrote:
>>> On 28.10.24 13:52, Kefeng Wang wrote:
>>>>
>>>>
>>>> On 2024/10/28 18:00, David Hildenbrand wrote:
>>>>> On 26.10.24 07:43, Kefeng Wang wrote:
>>>>>> When clearing gigantic page, it zeros page from the first page to the
>>>>>> last page, if directly passing addr_hint which maybe not the address
>>>>>> of the first page of folio, then some archs could flush the wrong 
>>>>>> cache
>>>>>> if it does use the addr_hint as a hint. For non-gigantic page, it
>>>>>> calculates the base address inside, even passed the wrong 
>>>>>> addr_hint, it
>>>>>> only has performance impact as the process_huge_page() wants to 
>>>>>> process
>>>>>> target page last to keep its cache lines hot), no functional impact.
>>>>>>
>>>>>> Let's pass the real accessed address to folio_zero_user() and use the
>>>>>> aligned address in clear_gigantic_page() to fix it.
>>>>>>
>>>>>> Fixes: 78fefd04c123 ("mm: memory: convert clear_huge_page() to
>>>>>> folio_zero_user()")
>>>>>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>> ---
>>>>>> v2:
>>>>>> - update changelog to clarify the impact, per Andrew
>>>>>>
>>>>>>     fs/hugetlbfs/inode.c | 2 +-
>>>>>>     mm/memory.c          | 1 +
>>>>>>     2 files changed, 2 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>>>>>> index a4441fb77f7c..a5ea006f403e 100644
>>>>>> --- a/fs/hugetlbfs/inode.c
>>>>>> +++ b/fs/hugetlbfs/inode.c
>>>>>> @@ -825,7 +825,7 @@ static long hugetlbfs_fallocate(struct file 
>>>>>> *file,
>>>>>> int mode, loff_t offset,
>>>>>>                 error = PTR_ERR(folio);
>>>>>>                 goto out;
>>>>>>             }
>>>>>> -        folio_zero_user(folio, ALIGN_DOWN(addr, hpage_size));
>>>>>> +        folio_zero_user(folio, addr);
>>>>>>             __folio_mark_uptodate(folio);
>>>>>>             error = hugetlb_add_to_page_cache(folio, mapping, index);
>>>>>>             if (unlikely(error)) {
>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>> index 75c2dfd04f72..ef47b7ea5ddd 100644
>>>>>> --- a/mm/memory.c
>>>>>> +++ b/mm/memory.c
>>>>>> @@ -6821,6 +6821,7 @@ static void clear_gigantic_page(struct folio
>>>>>> *folio, unsigned long addr,
>>>>>>         int i;
>>>>>>         might_sleep();
>>>>>> +    addr = ALIGN_DOWN(addr, folio_size(folio));
>>>>>
>>>>> Right, that's what's effectively done in a very bad way in
>>>>> process_huge_page()
>>>>>
>>>>> unsigned long addr = addr_hint &
>>>>>                 ~(((unsigned long)nr_pages << PAGE_SHIFT) - 1);
>>>>>
>>>>>
>>>>> That should all be cleaned up ... process_huge_page() likely shouldn't
>>>>
>>>> Yes, let's fix the bug firstly,
>>>>
>>>>> be even consuming "nr_pages".
>>>>
>>>> No sure about this part, it uses nr_pages as the end and calculate the
>>>> 'base'.
>>>
>>> It should be using folio_nr_pages().
>>
>> But process_huge_page() without an explicit folio argument, I'd like to
>> move the aligned address calculate into the folio_zero_user and
>> copy_user_large_folio(will rename it to folio_copy_user()) in the
>> following cleanup patches, or do it in the fix patches?
> 
> First, why does folio_zero_user() call process_huge_page() for *a small 
> folio*? Because we like or code to be extra complicated to understand? 
> Or am I missing something important?

The folio_zero_user() used for PMD-sized THP and HugeTLB before, and
after anon mTHP supported, it is used for order-2~order-PMD-order THP
and HugeTLB, so it won't process a small folio if I understand correctly.

> 
> Second, we should be passing the folio to "process_huge_page" and likely 
> rename it to "folio_process_pages()" or sth like that. The function even 
> documents "of the specified huge page", but there is none specified. The 
> copy case might require a rework.
> 
> I think this code needs a serious cleanup ...
> 

Yes, I'd like to do more cleanup and rework them, also I find some
performance issue on my arm64 machine[1] which need to be addressed.

[1] 
https://lore.kernel.org/linux-mm/2524689c-08f5-446c-8cb9-924f9db0ee3a@huawei.com/
David Hildenbrand Oct. 28, 2024, 2:24 p.m. UTC | #11
On 28.10.24 15:22, Kefeng Wang wrote:
> 
> 
> On 2024/10/28 21:46, David Hildenbrand wrote:
>> On 28.10.24 14:33, Kefeng Wang wrote:
>>>
>>>
>>> On 2024/10/28 21:14, David Hildenbrand wrote:
>>>> On 28.10.24 13:52, Kefeng Wang wrote:
>>>>>
>>>>>
>>>>> On 2024/10/28 18:00, David Hildenbrand wrote:
>>>>>> On 26.10.24 07:43, Kefeng Wang wrote:
>>>>>>> When clearing gigantic page, it zeros page from the first page to the
>>>>>>> last page, if directly passing addr_hint which maybe not the address
>>>>>>> of the first page of folio, then some archs could flush the wrong
>>>>>>> cache
>>>>>>> if it does use the addr_hint as a hint. For non-gigantic page, it
>>>>>>> calculates the base address inside, even passed the wrong
>>>>>>> addr_hint, it
>>>>>>> only has performance impact as the process_huge_page() wants to
>>>>>>> process
>>>>>>> target page last to keep its cache lines hot), no functional impact.
>>>>>>>
>>>>>>> Let's pass the real accessed address to folio_zero_user() and use the
>>>>>>> aligned address in clear_gigantic_page() to fix it.
>>>>>>>
>>>>>>> Fixes: 78fefd04c123 ("mm: memory: convert clear_huge_page() to
>>>>>>> folio_zero_user()")
>>>>>>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>> ---
>>>>>>> v2:
>>>>>>> - update changelog to clarify the impact, per Andrew
>>>>>>>
>>>>>>>      fs/hugetlbfs/inode.c | 2 +-
>>>>>>>      mm/memory.c          | 1 +
>>>>>>>      2 files changed, 2 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>>>>>>> index a4441fb77f7c..a5ea006f403e 100644
>>>>>>> --- a/fs/hugetlbfs/inode.c
>>>>>>> +++ b/fs/hugetlbfs/inode.c
>>>>>>> @@ -825,7 +825,7 @@ static long hugetlbfs_fallocate(struct file
>>>>>>> *file,
>>>>>>> int mode, loff_t offset,
>>>>>>>                  error = PTR_ERR(folio);
>>>>>>>                  goto out;
>>>>>>>              }
>>>>>>> -        folio_zero_user(folio, ALIGN_DOWN(addr, hpage_size));
>>>>>>> +        folio_zero_user(folio, addr);
>>>>>>>              __folio_mark_uptodate(folio);
>>>>>>>              error = hugetlb_add_to_page_cache(folio, mapping, index);
>>>>>>>              if (unlikely(error)) {
>>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>>> index 75c2dfd04f72..ef47b7ea5ddd 100644
>>>>>>> --- a/mm/memory.c
>>>>>>> +++ b/mm/memory.c
>>>>>>> @@ -6821,6 +6821,7 @@ static void clear_gigantic_page(struct folio
>>>>>>> *folio, unsigned long addr,
>>>>>>>          int i;
>>>>>>>          might_sleep();
>>>>>>> +    addr = ALIGN_DOWN(addr, folio_size(folio));
>>>>>>
>>>>>> Right, that's what's effectively done in a very bad way in
>>>>>> process_huge_page()
>>>>>>
>>>>>> unsigned long addr = addr_hint &
>>>>>>                  ~(((unsigned long)nr_pages << PAGE_SHIFT) - 1);
>>>>>>
>>>>>>
>>>>>> That should all be cleaned up ... process_huge_page() likely shouldn't
>>>>>
>>>>> Yes, let's fix the bug firstly,
>>>>>
>>>>>> be even consuming "nr_pages".
>>>>>
>>>>> No sure about this part, it uses nr_pages as the end and calculate the
>>>>> 'base'.
>>>>
>>>> It should be using folio_nr_pages().
>>>
>>> But process_huge_page() without an explicit folio argument, I'd like to
>>> move the aligned address calculate into the folio_zero_user and
>>> copy_user_large_folio(will rename it to folio_copy_user()) in the
>>> following cleanup patches, or do it in the fix patches?
>>
>> First, why does folio_zero_user() call process_huge_page() for *a small
>> folio*? Because we like or code to be extra complicated to understand?
>> Or am I missing something important?
> 
> The folio_zero_user() used for PMD-sized THP and HugeTLB before, and
> after anon mTHP supported, it is used for order-2~order-PMD-order THP
> and HugeTLB, so it won't process a small folio if I understand correctly.

And unfortunately neither the documentation nor the function name 
expresses that :(

I'm happy to review any patches that improve the situation here :)
Kefeng Wang Oct. 29, 2024, 1:04 p.m. UTC | #12
>>>>>>>
>>>>>>> That should all be cleaned up ... process_huge_page() likely 
>>>>>>> shouldn't
>>>>>>
>>>>>> Yes, let's fix the bug firstly,
>>>>>>
>>>>>>> be even consuming "nr_pages".
>>>>>>
>>>>>> No sure about this part, it uses nr_pages as the end and calculate 
>>>>>> the
>>>>>> 'base'.
>>>>>
>>>>> It should be using folio_nr_pages().
>>>>
>>>> But process_huge_page() without an explicit folio argument, I'd like to
>>>> move the aligned address calculate into the folio_zero_user and
>>>> copy_user_large_folio(will rename it to folio_copy_user()) in the
>>>> following cleanup patches, or do it in the fix patches?
>>>
>>> First, why does folio_zero_user() call process_huge_page() for *a small
>>> folio*? Because we like or code to be extra complicated to understand?
>>> Or am I missing something important?
>>
>> The folio_zero_user() used for PMD-sized THP and HugeTLB before, and
>> after anon mTHP supported, it is used for order-2~order-PMD-order THP
>> and HugeTLB, so it won't process a small folio if I understand correctly.
> 
> And unfortunately neither the documentation nor the function name 
> expresses that :(
> 
> I'm happy to review any patches that improve the situation here :)
> 

Actually, could we drop the process_huge_page() totally, from my
testcase[1], process_huge_page() is not better than clear/copy page
from start to last, and sequential clearing/copying maybe more
beneficial to the hardware prefetching, and is there a way to let lkp
to test to check the performance, since the process_huge_page()
was submitted by Ying, what's your opinion?


[1]https://lore.kernel.org/linux-mm/2524689c-08f5-446c-8cb9-924f9db0ee3a@huawei.com/
case-anon-w-seq-mt (tried 2M PMD THP/ 64K mTHP)
case-anon-w-seq-hugetlb (2M PMD HugeTLB)
fallocate hugetlb 20G (2M PMD HugeTLB)
fallocate tmpfs 20G (2M PMD THP)
David Hildenbrand Oct. 29, 2024, 2:04 p.m. UTC | #13
On 29.10.24 14:04, Kefeng Wang wrote:
>>>>>>>>
>>>>>>>> That should all be cleaned up ... process_huge_page() likely
>>>>>>>> shouldn't
>>>>>>>
>>>>>>> Yes, let's fix the bug firstly,
>>>>>>>
>>>>>>>> be even consuming "nr_pages".
>>>>>>>
>>>>>>> No sure about this part, it uses nr_pages as the end and calculate
>>>>>>> the
>>>>>>> 'base'.
>>>>>>
>>>>>> It should be using folio_nr_pages().
>>>>>
>>>>> But process_huge_page() without an explicit folio argument, I'd like to
>>>>> move the aligned address calculate into the folio_zero_user and
>>>>> copy_user_large_folio(will rename it to folio_copy_user()) in the
>>>>> following cleanup patches, or do it in the fix patches?
>>>>
>>>> First, why does folio_zero_user() call process_huge_page() for *a small
>>>> folio*? Because we like or code to be extra complicated to understand?
>>>> Or am I missing something important?
>>>
>>> The folio_zero_user() used for PMD-sized THP and HugeTLB before, and
>>> after anon mTHP supported, it is used for order-2~order-PMD-order THP
>>> and HugeTLB, so it won't process a small folio if I understand correctly.
>>
>> And unfortunately neither the documentation nor the function name
>> expresses that :(
>>
>> I'm happy to review any patches that improve the situation here :)
>>
> 
> Actually, could we drop the process_huge_page() totally, from my
> testcase[1], process_huge_page() is not better than clear/copy page
> from start to last, and sequential clearing/copying maybe more
> beneficial to the hardware prefetching, and is there a way to let lkp
> to test to check the performance, since the process_huge_page()
> was submitted by Ying, what's your opinion?

I questioned that just recently [1], and Ying assumed that it still 
applies [2].

c79b57e462b5 ("mm: hugetlb: clear target
sub-page last when clearing huge page”) documents the scenario where 
this matters -- anon-w-seq which you also run below.

If there is no performance benefit anymore, we should rip that out. But 
likely we should check on multiple micro-architectures with multiple 
#CPU configs that are relevant. c79b57e462b5 used a Xeon E5 v3 2699 with 
72 processes on 2 NUMA nodes, maybe your test environment cannot 
replicate that?


[1] 
https://lore.kernel.org/linux-mm/b8272cb4-aee8-45ad-8dff-353444b3fa74@redhat.com/
[2] 
https://lore.kernel.org/linux-mm/878quv9lhf.fsf@yhuang6-desk2.ccr.corp.intel.com/

> 
> 
> [1]https://lore.kernel.org/linux-mm/2524689c-08f5-446c-8cb9-924f9db0ee3a@huawei.com/
> case-anon-w-seq-mt (tried 2M PMD THP/ 64K mTHP)
> case-anon-w-seq-hugetlb (2M PMD HugeTLB)

But these are sequential, not random. I'd have thought access + zeroing 
would be sequentially either way. Did you run with random access as well>
Huang, Ying Oct. 30, 2024, 1:04 a.m. UTC | #14
David Hildenbrand <david@redhat.com> writes:

> On 29.10.24 14:04, Kefeng Wang wrote:
>>>>>>>>>
>>>>>>>>> That should all be cleaned up ... process_huge_page() likely
>>>>>>>>> shouldn't
>>>>>>>>
>>>>>>>> Yes, let's fix the bug firstly,
>>>>>>>>
>>>>>>>>> be even consuming "nr_pages".
>>>>>>>>
>>>>>>>> No sure about this part, it uses nr_pages as the end and calculate
>>>>>>>> the
>>>>>>>> 'base'.
>>>>>>>
>>>>>>> It should be using folio_nr_pages().
>>>>>>
>>>>>> But process_huge_page() without an explicit folio argument, I'd like to
>>>>>> move the aligned address calculate into the folio_zero_user and
>>>>>> copy_user_large_folio(will rename it to folio_copy_user()) in the
>>>>>> following cleanup patches, or do it in the fix patches?
>>>>>
>>>>> First, why does folio_zero_user() call process_huge_page() for *a small
>>>>> folio*? Because we like or code to be extra complicated to understand?
>>>>> Or am I missing something important?
>>>>
>>>> The folio_zero_user() used for PMD-sized THP and HugeTLB before, and
>>>> after anon mTHP supported, it is used for order-2~order-PMD-order THP
>>>> and HugeTLB, so it won't process a small folio if I understand correctly.
>>>
>>> And unfortunately neither the documentation nor the function name
>>> expresses that :(
>>>
>>> I'm happy to review any patches that improve the situation here :)
>>>
>> Actually, could we drop the process_huge_page() totally, from my
>> testcase[1], process_huge_page() is not better than clear/copy page
>> from start to last, and sequential clearing/copying maybe more
>> beneficial to the hardware prefetching, and is there a way to let lkp
>> to test to check the performance, since the process_huge_page()
>> was submitted by Ying, what's your opinion?

I don't think that it's a good idea to revert the commit without
studying and root causing the issues.  I can work together with you on
that.  If we have solid and well explained data to prove
process_huge_page() isn't benefitial, we can revert the commit.

> I questioned that just recently [1], and Ying assumed that it still
> applies [2].
>
> c79b57e462b5 ("mm: hugetlb: clear target
> sub-page last when clearing huge page”) documents the scenario where
> this matters -- anon-w-seq which you also run below.
>
> If there is no performance benefit anymore, we should rip that
> out. But likely we should check on multiple micro-architectures with
> multiple #CPU configs that are relevant. c79b57e462b5 used a Xeon E5
> v3 2699 with 72 processes on 2 NUMA nodes, maybe your test environment
> cannot replicate that?
>
>
> [1]
> https://lore.kernel.org/linux-mm/b8272cb4-aee8-45ad-8dff-353444b3fa74@redhat.com/
> [2]
> https://lore.kernel.org/linux-mm/878quv9lhf.fsf@yhuang6-desk2.ccr.corp.intel.com/
>
>> [1]https://lore.kernel.org/linux-mm/2524689c-08f5-446c-8cb9-924f9db0ee3a@huawei.com/
>> case-anon-w-seq-mt (tried 2M PMD THP/ 64K mTHP)
>> case-anon-w-seq-hugetlb (2M PMD HugeTLB)
>
> But these are sequential, not random. I'd have thought access +
> zeroing would be sequentially either way. Did you run with random
> access as well>

--
Best Regards,
Huang, Ying
Kefeng Wang Oct. 30, 2024, 3:04 a.m. UTC | #15
On 2024/10/30 9:04, Huang, Ying wrote:
> David Hildenbrand <david@redhat.com> writes:
> 
>> On 29.10.24 14:04, Kefeng Wang wrote:
>>>>>>>>>>
>>>>>>>>>> That should all be cleaned up ... process_huge_page() likely
>>>>>>>>>> shouldn't
>>>>>>>>>
>>>>>>>>> Yes, let's fix the bug firstly,
>>>>>>>>>
>>>>>>>>>> be even consuming "nr_pages".
>>>>>>>>>
>>>>>>>>> No sure about this part, it uses nr_pages as the end and calculate
>>>>>>>>> the
>>>>>>>>> 'base'.
>>>>>>>>
>>>>>>>> It should be using folio_nr_pages().
>>>>>>>
>>>>>>> But process_huge_page() without an explicit folio argument, I'd like to
>>>>>>> move the aligned address calculate into the folio_zero_user and
>>>>>>> copy_user_large_folio(will rename it to folio_copy_user()) in the
>>>>>>> following cleanup patches, or do it in the fix patches?
>>>>>>
>>>>>> First, why does folio_zero_user() call process_huge_page() for *a small
>>>>>> folio*? Because we like or code to be extra complicated to understand?
>>>>>> Or am I missing something important?
>>>>>
>>>>> The folio_zero_user() used for PMD-sized THP and HugeTLB before, and
>>>>> after anon mTHP supported, it is used for order-2~order-PMD-order THP
>>>>> and HugeTLB, so it won't process a small folio if I understand correctly.
>>>>
>>>> And unfortunately neither the documentation nor the function name
>>>> expresses that :(
>>>>
>>>> I'm happy to review any patches that improve the situation here :)
>>>>
>>> Actually, could we drop the process_huge_page() totally, from my
>>> testcase[1], process_huge_page() is not better than clear/copy page
>>> from start to last, and sequential clearing/copying maybe more
>>> beneficial to the hardware prefetching, and is there a way to let lkp
>>> to test to check the performance, since the process_huge_page()
>>> was submitted by Ying, what's your opinion?
> 
> I don't think that it's a good idea to revert the commit without
> studying and root causing the issues.  I can work together with you on
> that.  If we have solid and well explained data to prove
> process_huge_page() isn't benefitial, we can revert the commit.


Take 'fallocate 20G' as an example, before

Performance counter stats for 'taskset -c 10 fallocate -l 20G 
/mnt/hugetlbfs/test':

           3,118.94 msec task-clock                #    0.999 CPUs 
utilized
                 30      context-switches          #    0.010 K/sec 

                  1      cpu-migrations            #    0.000 K/sec 

                136      page-faults               #    0.044 K/sec 

      8,092,075,873      cycles                    #    2.594 GHz 
               (92.82%)
      1,624,587,663      instructions              #    0.20  insn per 
cycle           (92.83%)
        395,341,850      branches                  #  126.755 M/sec 
               (92.82%)
          3,872,302      branch-misses             #    0.98% of all 
branches          (92.83%)
      1,398,066,701      L1-dcache-loads           #  448.251 M/sec 
               (92.82%)
         58,124,626      L1-dcache-load-misses     #    4.16% of all 
L1-dcache accesses  (92.82%)
          1,032,527      LLC-loads                 #    0.331 M/sec 
               (92.82%)
            498,684      LLC-load-misses           #   48.30% of all 
LL-cache accesses  (92.84%)
        473,689,004      L1-icache-loads           #  151.875 M/sec 
               (92.82%)
            356,721      L1-icache-load-misses     #    0.08% of all 
L1-icache accesses  (92.85%)
      1,947,644,987      dTLB-loads                #  624.458 M/sec 
               (92.95%)
             10,185      dTLB-load-misses          #    0.00% of all 
dTLB cache accesses  (92.96%)
        474,622,896      iTLB-loads                #  152.174 M/sec 
               (92.95%)
                 94      iTLB-load-misses          #    0.00% of all 
iTLB cache accesses  (85.69%)

        3.122844830 seconds time elapsed

        0.000000000 seconds user
        3.107259000 seconds sys

and after(clear from start to end)

Performance counter stats for 'taskset -c 10 fallocate -l 20G 
/mnt/hugetlbfs/test':

           1,135.53 msec task-clock                #    0.999 CPUs 
utilized
                 10      context-switches          #    0.009 K/sec 

                  1      cpu-migrations            #    0.001 K/sec 

                137      page-faults               #    0.121 K/sec 

      2,946,673,587      cycles                    #    2.595 GHz 
               (92.67%)
      1,620,704,205      instructions              #    0.55  insn per 
cycle           (92.61%)
        394,595,772      branches                  #  347.499 M/sec 
               (92.60%)
            130,756      branch-misses             #    0.03% of all 
branches          (92.84%)
      1,396,726,689      L1-dcache-loads           # 1230.022 M/sec 
               (92.96%)
            338,344      L1-dcache-load-misses     #    0.02% of all 
L1-dcache accesses  (92.95%)
            111,737      LLC-loads                 #    0.098 M/sec 
               (92.96%)
             67,486      LLC-load-misses           #   60.40% of all 
LL-cache accesses  (92.96%)
        418,198,663      L1-icache-loads           #  368.285 M/sec 
               (92.96%)
            173,764      L1-icache-load-misses     #    0.04% of all 
L1-icache accesses  (92.96%)
      2,203,364,632      dTLB-loads                # 1940.385 M/sec 
               (92.96%)
             17,195      dTLB-load-misses          #    0.00% of all 
dTLB cache accesses  (92.95%)
        418,198,365      iTLB-loads                #  368.285 M/sec 
               (92.96%)
                 79      iTLB-load-misses          #    0.00% of all 
iTLB cache accesses  (85.34%)

        1.137015760 seconds time elapsed

        0.000000000 seconds user
        1.131266000 seconds sys

The IPC improved a lot,less LLC-loads and more L1-dcache-loads, but
this depends on the implementation of the microarchitecture.

1) Will test some rand test to check the different of performance as 
David suggested.

2) Hope the LKP to run more tests since it is very useful(more test set 
and different machines)


> 
>> I questioned that just recently [1], and Ying assumed that it still
>> applies [2].
>>
>> c79b57e462b5 ("mm: hugetlb: clear target
>> sub-page last when clearing huge page”) documents the scenario where
>> this matters -- anon-w-seq which you also run below.
>>
>> If there is no performance benefit anymore, we should rip that
>> out. But likely we should check on multiple micro-architectures with
>> multiple #CPU configs that are relevant. c79b57e462b5 used a Xeon E5
>> v3 2699 with 72 processes on 2 NUMA nodes, maybe your test environment
>> cannot replicate that?>>
>>
>> [1]
>> https://lore.kernel.org/linux-mm/b8272cb4-aee8-45ad-8dff-353444b3fa74@redhat.com/
>> [2]
>> https://lore.kernel.org/linux-mm/878quv9lhf.fsf@yhuang6-desk2.ccr.corp.intel.com/
>>
>>> [1]https://lore.kernel.org/linux-mm/2524689c-08f5-446c-8cb9-924f9db0ee3a@huawei.com/
>>> case-anon-w-seq-mt (tried 2M PMD THP/ 64K mTHP)
>>> case-anon-w-seq-hugetlb (2M PMD HugeTLB)
>>
>> But these are sequential, not random. I'd have thought access +
>> zeroing would be sequentially either way. Did you run with random
>> access as well>

Will do.
> > --
> Best Regards,
> Huang, Ying
>
Huang, Ying Oct. 30, 2024, 3:21 a.m. UTC | #16
Kefeng Wang <wangkefeng.wang@huawei.com> writes:

> On 2024/10/30 9:04, Huang, Ying wrote:
>> David Hildenbrand <david@redhat.com> writes:
>> 
>>> On 29.10.24 14:04, Kefeng Wang wrote:
>>>>>>>>>>>
>>>>>>>>>>> That should all be cleaned up ... process_huge_page() likely
>>>>>>>>>>> shouldn't
>>>>>>>>>>
>>>>>>>>>> Yes, let's fix the bug firstly,
>>>>>>>>>>
>>>>>>>>>>> be even consuming "nr_pages".
>>>>>>>>>>
>>>>>>>>>> No sure about this part, it uses nr_pages as the end and calculate
>>>>>>>>>> the
>>>>>>>>>> 'base'.
>>>>>>>>>
>>>>>>>>> It should be using folio_nr_pages().
>>>>>>>>
>>>>>>>> But process_huge_page() without an explicit folio argument, I'd like to
>>>>>>>> move the aligned address calculate into the folio_zero_user and
>>>>>>>> copy_user_large_folio(will rename it to folio_copy_user()) in the
>>>>>>>> following cleanup patches, or do it in the fix patches?
>>>>>>>
>>>>>>> First, why does folio_zero_user() call process_huge_page() for *a small
>>>>>>> folio*? Because we like or code to be extra complicated to understand?
>>>>>>> Or am I missing something important?
>>>>>>
>>>>>> The folio_zero_user() used for PMD-sized THP and HugeTLB before, and
>>>>>> after anon mTHP supported, it is used for order-2~order-PMD-order THP
>>>>>> and HugeTLB, so it won't process a small folio if I understand correctly.
>>>>>
>>>>> And unfortunately neither the documentation nor the function name
>>>>> expresses that :(
>>>>>
>>>>> I'm happy to review any patches that improve the situation here :)
>>>>>
>>>> Actually, could we drop the process_huge_page() totally, from my
>>>> testcase[1], process_huge_page() is not better than clear/copy page
>>>> from start to last, and sequential clearing/copying maybe more
>>>> beneficial to the hardware prefetching, and is there a way to let lkp
>>>> to test to check the performance, since the process_huge_page()
>>>> was submitted by Ying, what's your opinion?
>> I don't think that it's a good idea to revert the commit without
>> studying and root causing the issues.  I can work together with you on
>> that.  If we have solid and well explained data to prove
>> process_huge_page() isn't benefitial, we can revert the commit.
>
>
> Take 'fallocate 20G' as an example, before
>
> Performance counter stats for 'taskset -c 10 fallocate -l 20G
> /mnt/hugetlbfs/test':

IIUC, fallocate will zero pages, but will not touch them at all, right?
If so, no cache benefit from clearing referenced page last.

>           3,118.94 msec task-clock                #    0.999 CPUs
>           utilized
>                 30      context-switches          #    0.010 K/sec
>                 1      cpu-migrations            #    0.000 K/sec
>                 136      page-faults               #    0.044 K/sec
>                 8,092,075,873      cycles                    #
>                 2.594 GHz                (92.82%)
>      1,624,587,663      instructions              #    0.20  insn per
>      cycle           (92.83%)
>        395,341,850      branches                  #  126.755 M/sec
>        (92.82%)
>          3,872,302      branch-misses             #    0.98% of all
>          branches          (92.83%)
>      1,398,066,701      L1-dcache-loads           #  448.251 M/sec
>      (92.82%)
>         58,124,626      L1-dcache-load-misses     #    4.16% of all
>         L1-dcache accesses  (92.82%)
>          1,032,527      LLC-loads                 #    0.331 M/sec
>          (92.82%)
>            498,684      LLC-load-misses           #   48.30% of all
>            LL-cache accesses  (92.84%)
>        473,689,004      L1-icache-loads           #  151.875 M/sec
>        (92.82%)
>            356,721      L1-icache-load-misses     #    0.08% of all
>            L1-icache accesses  (92.85%)
>      1,947,644,987      dTLB-loads                #  624.458 M/sec
>      (92.95%)
>             10,185      dTLB-load-misses          #    0.00% of all
>             dTLB cache accesses  (92.96%)
>        474,622,896      iTLB-loads                #  152.174 M/sec
>        (92.95%)
>                 94      iTLB-load-misses          #    0.00% of all
>                 iTLB cache accesses  (85.69%)
>
>        3.122844830 seconds time elapsed
>
>        0.000000000 seconds user
>        3.107259000 seconds sys
>
> and after(clear from start to end)
>
> Performance counter stats for 'taskset -c 10 fallocate -l 20G
> /mnt/hugetlbfs/test':
>
>           1,135.53 msec task-clock                #    0.999 CPUs
>           utilized
>                 10      context-switches          #    0.009 K/sec
>                 1      cpu-migrations            #    0.001 K/sec
>                 137      page-faults               #    0.121 K/sec
>                 2,946,673,587      cycles                    #
>                 2.595 GHz                (92.67%)
>      1,620,704,205      instructions              #    0.55  insn per
>      cycle           (92.61%)
>        394,595,772      branches                  #  347.499 M/sec
>        (92.60%)
>            130,756      branch-misses             #    0.03% of all
>            branches          (92.84%)
>      1,396,726,689      L1-dcache-loads           # 1230.022 M/sec
>      (92.96%)
>            338,344      L1-dcache-load-misses     #    0.02% of all
>            L1-dcache accesses  (92.95%)
>            111,737      LLC-loads                 #    0.098 M/sec
>            (92.96%)
>             67,486      LLC-load-misses           #   60.40% of all
>             LL-cache accesses  (92.96%)
>        418,198,663      L1-icache-loads           #  368.285 M/sec
>        (92.96%)
>            173,764      L1-icache-load-misses     #    0.04% of all
>            L1-icache accesses  (92.96%)
>      2,203,364,632      dTLB-loads                # 1940.385 M/sec
>      (92.96%)
>             17,195      dTLB-load-misses          #    0.00% of all
>             dTLB cache accesses  (92.95%)
>        418,198,365      iTLB-loads                #  368.285 M/sec
>        (92.96%)
>                 79      iTLB-load-misses          #    0.00% of all
>                 iTLB cache accesses  (85.34%)
>
>        1.137015760 seconds time elapsed
>
>        0.000000000 seconds user
>        1.131266000 seconds sys
>
> The IPC improved a lot,less LLC-loads and more L1-dcache-loads, but
> this depends on the implementation of the microarchitecture.

Anyway, we need to avoid (or reduce at least) the pure memory clearing
performance.  Have you double checked whether process_huge_page() is
inlined?  Perf-profile result can be used to check this too.

When you say from start to end, you mean to use clear_gigantic_page()
directly, or change process_huge_page() to clear page from start to end?

> 1) Will test some rand test to check the different of performance as
> David suggested.
>
> 2) Hope the LKP to run more tests since it is very useful(more test
> set and different machines)

I'm starting to use LKP to test.

--
Best Regards,
Huang, Ying

>
>> 
>>> I questioned that just recently [1], and Ying assumed that it still
>>> applies [2].
>>>
>>> c79b57e462b5 ("mm: hugetlb: clear target
>>> sub-page last when clearing huge page”) documents the scenario where
>>> this matters -- anon-w-seq which you also run below.
>>>
>>> If there is no performance benefit anymore, we should rip that
>>> out. But likely we should check on multiple micro-architectures with
>>> multiple #CPU configs that are relevant. c79b57e462b5 used a Xeon E5
>>> v3 2699 with 72 processes on 2 NUMA nodes, maybe your test environment
>>> cannot replicate that?>>
>>>
>>> [1]
>>> https://lore.kernel.org/linux-mm/b8272cb4-aee8-45ad-8dff-353444b3fa74@redhat.com/
>>> [2]
>>> https://lore.kernel.org/linux-mm/878quv9lhf.fsf@yhuang6-desk2.ccr.corp.intel.com/
>>>
>>>> [1]https://lore.kernel.org/linux-mm/2524689c-08f5-446c-8cb9-924f9db0ee3a@huawei.com/
>>>> case-anon-w-seq-mt (tried 2M PMD THP/ 64K mTHP)
>>>> case-anon-w-seq-hugetlb (2M PMD HugeTLB)
>>>
>>> But these are sequential, not random. I'd have thought access +
>>> zeroing would be sequentially either way. Did you run with random
>>> access as well>
>
> Will do.
>> > --
>> Best Regards,
>> Huang, Ying
>>
Kefeng Wang Oct. 30, 2024, 5:05 a.m. UTC | #17
On 2024/10/30 11:21, Huang, Ying wrote:
> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
> 
>> On 2024/10/30 9:04, Huang, Ying wrote:
>>> David Hildenbrand <david@redhat.com> writes:
>>>
>>>> On 29.10.24 14:04, Kefeng Wang wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> That should all be cleaned up ... process_huge_page() likely
>>>>>>>>>>>> shouldn't
>>>>>>>>>>>
>>>>>>>>>>> Yes, let's fix the bug firstly,
>>>>>>>>>>>
>>>>>>>>>>>> be even consuming "nr_pages".
>>>>>>>>>>>
>>>>>>>>>>> No sure about this part, it uses nr_pages as the end and calculate
>>>>>>>>>>> the
>>>>>>>>>>> 'base'.
>>>>>>>>>>
>>>>>>>>>> It should be using folio_nr_pages().
>>>>>>>>>
>>>>>>>>> But process_huge_page() without an explicit folio argument, I'd like to
>>>>>>>>> move the aligned address calculate into the folio_zero_user and
>>>>>>>>> copy_user_large_folio(will rename it to folio_copy_user()) in the
>>>>>>>>> following cleanup patches, or do it in the fix patches?
>>>>>>>>
>>>>>>>> First, why does folio_zero_user() call process_huge_page() for *a small
>>>>>>>> folio*? Because we like or code to be extra complicated to understand?
>>>>>>>> Or am I missing something important?
>>>>>>>
>>>>>>> The folio_zero_user() used for PMD-sized THP and HugeTLB before, and
>>>>>>> after anon mTHP supported, it is used for order-2~order-PMD-order THP
>>>>>>> and HugeTLB, so it won't process a small folio if I understand correctly.
>>>>>>
>>>>>> And unfortunately neither the documentation nor the function name
>>>>>> expresses that :(
>>>>>>
>>>>>> I'm happy to review any patches that improve the situation here :)
>>>>>>
>>>>> Actually, could we drop the process_huge_page() totally, from my
>>>>> testcase[1], process_huge_page() is not better than clear/copy page
>>>>> from start to last, and sequential clearing/copying maybe more
>>>>> beneficial to the hardware prefetching, and is there a way to let lkp
>>>>> to test to check the performance, since the process_huge_page()
>>>>> was submitted by Ying, what's your opinion?
>>> I don't think that it's a good idea to revert the commit without
>>> studying and root causing the issues.  I can work together with you on
>>> that.  If we have solid and well explained data to prove
>>> process_huge_page() isn't benefitial, we can revert the commit.
>>
>>
>> Take 'fallocate 20G' as an example, before
>>
>> Performance counter stats for 'taskset -c 10 fallocate -l 20G
>> /mnt/hugetlbfs/test':
> 
> IIUC, fallocate will zero pages, but will not touch them at all, right?
> If so, no cache benefit from clearing referenced page last.


Yes, for this case, only clear page.
> 
>>            3,118.94 msec task-clock                #    0.999 CPUs
>>            utilized
>>                  30      context-switches          #    0.010 K/sec
>>                  1      cpu-migrations            #    0.000 K/sec
>>                  136      page-faults               #    0.044 K/sec
>>                  8,092,075,873      cycles                    #
>>                  2.594 GHz                (92.82%)
>>       1,624,587,663      instructions              #    0.20  insn per
>>       cycle           (92.83%)
>>         395,341,850      branches                  #  126.755 M/sec
>>         (92.82%)
>>           3,872,302      branch-misses             #    0.98% of all
>>           branches          (92.83%)
>>       1,398,066,701      L1-dcache-loads           #  448.251 M/sec
>>       (92.82%)
>>          58,124,626      L1-dcache-load-misses     #    4.16% of all
>>          L1-dcache accesses  (92.82%)
>>           1,032,527      LLC-loads                 #    0.331 M/sec
>>           (92.82%)
>>             498,684      LLC-load-misses           #   48.30% of all
>>             LL-cache accesses  (92.84%)
>>         473,689,004      L1-icache-loads           #  151.875 M/sec
>>         (92.82%)
>>             356,721      L1-icache-load-misses     #    0.08% of all
>>             L1-icache accesses  (92.85%)
>>       1,947,644,987      dTLB-loads                #  624.458 M/sec
>>       (92.95%)
>>              10,185      dTLB-load-misses          #    0.00% of all
>>              dTLB cache accesses  (92.96%)
>>         474,622,896      iTLB-loads                #  152.174 M/sec
>>         (92.95%)
>>                  94      iTLB-load-misses          #    0.00% of all
>>                  iTLB cache accesses  (85.69%)
>>
>>         3.122844830 seconds time elapsed
>>
>>         0.000000000 seconds user
>>         3.107259000 seconds sys
>>
>> and after(clear from start to end)
>>
>> Performance counter stats for 'taskset -c 10 fallocate -l 20G
>> /mnt/hugetlbfs/test':
>>
>>            1,135.53 msec task-clock                #    0.999 CPUs
>>            utilized
>>                  10      context-switches          #    0.009 K/sec
>>                  1      cpu-migrations            #    0.001 K/sec
>>                  137      page-faults               #    0.121 K/sec
>>                  2,946,673,587      cycles                    #
>>                  2.595 GHz                (92.67%)
>>       1,620,704,205      instructions              #    0.55  insn per
>>       cycle           (92.61%)
>>         394,595,772      branches                  #  347.499 M/sec
>>         (92.60%)
>>             130,756      branch-misses             #    0.03% of all
>>             branches          (92.84%)
>>       1,396,726,689      L1-dcache-loads           # 1230.022 M/sec
>>       (92.96%)
>>             338,344      L1-dcache-load-misses     #    0.02% of all
>>             L1-dcache accesses  (92.95%)
>>             111,737      LLC-loads                 #    0.098 M/sec
>>             (92.96%)
>>              67,486      LLC-load-misses           #   60.40% of all
>>              LL-cache accesses  (92.96%)
>>         418,198,663      L1-icache-loads           #  368.285 M/sec
>>         (92.96%)
>>             173,764      L1-icache-load-misses     #    0.04% of all
>>             L1-icache accesses  (92.96%)
>>       2,203,364,632      dTLB-loads                # 1940.385 M/sec
>>       (92.96%)
>>              17,195      dTLB-load-misses          #    0.00% of all
>>              dTLB cache accesses  (92.95%)
>>         418,198,365      iTLB-loads                #  368.285 M/sec
>>         (92.96%)
>>                  79      iTLB-load-misses          #    0.00% of all
>>                  iTLB cache accesses  (85.34%)
>>
>>         1.137015760 seconds time elapsed
>>
>>         0.000000000 seconds user
>>         1.131266000 seconds sys
>>
>> The IPC improved a lot,less LLC-loads and more L1-dcache-loads, but
>> this depends on the implementation of the microarchitecture.
> 
> Anyway, we need to avoid (or reduce at least) the pure memory clearing
> performance.  Have you double checked whether process_huge_page() is
> inlined?  Perf-profile result can be used to check this too.
> 

Yes, I'm sure the process_huge_page() is inlined.

> When you say from start to end, you mean to use clear_gigantic_page()
> directly, or change process_huge_page() to clear page from start to end?
> 

Using clear_gigantic_page() and changing process_huge_page() to clear 
page from start to end are both good for performance when sequential 
clearing, but no random test so far.

>> 1) Will test some rand test to check the different of performance as
>> David suggested.
>>
>> 2) Hope the LKP to run more tests since it is very useful(more test
>> set and different machines)
> 
> I'm starting to use LKP to test.

Greet.
Huang, Ying Oct. 31, 2024, 8:39 a.m. UTC | #18
Kefeng Wang <wangkefeng.wang@huawei.com> writes:

[snip]
>
>>> 1) Will test some rand test to check the different of performance as
>>> David suggested.
>>>
>>> 2) Hope the LKP to run more tests since it is very useful(more test
>>> set and different machines)
>> I'm starting to use LKP to test.
>
> Greet.

I have run some tests with LKP to test.

Firstly, there's almost no measurable difference between clearing pages
from start to end or from end to start on Intel server CPU.  I guess
that there's some similar optimization for both direction.

For multiple processes (same as logical CPU number)
vm-scalability/anon-w-seq test case, the benchmark score increases
about 22.4%.

For multiple processes vm-scalability/anon-w-rand test case, no
measurable difference for benchmark score.

So, the optimization helps sequential workload mainly.

In summary, on x86, process_huge_page() will not introduce any
regression.  And it helps some workload.

However, on ARM64, it does introduce some regression for clearing pages
from end to start.  That needs to be addressed.  I guess that the
regression can be resolved via using more clearing from start to end
(but not all).  For example, can you take a look at the patch below?
Which uses the similar framework as before, but clear each small trunk
(mpage) from start to end.  You can adjust MPAGE_NRPAGES to check when
the regression can be restored.

WARNING: the patch is only build tested.

Best Regards,
Huang, Ying

-----------------------------------8<----------------------------------------
From 406bcd1603987fdd7130d2df6f7d4aee4cc6b978 Mon Sep 17 00:00:00 2001
From: Huang Ying <ying.huang@intel.com>
Date: Thu, 31 Oct 2024 11:13:57 +0800
Subject: [PATCH] mpage clear

---
 mm/memory.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 67 insertions(+), 3 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 3ccee51adfbb..1fdc548c4275 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6769,6 +6769,68 @@ static inline int process_huge_page(
 	return 0;
 }
 
+#define MPAGE_NRPAGES	(1<<4)
+#define MPAGE_SIZE	(PAGE_SIZE * MPAGE_NRPAGES)
+static inline int clear_huge_page(
+	unsigned long addr_hint, unsigned int nr_pages,
+	int (*process_subpage)(unsigned long addr, int idx, void *arg),
+	void *arg)
+{
+	int i, n, base, l, ret;
+	unsigned long addr = addr_hint &
+		~(((unsigned long)nr_pages << PAGE_SHIFT) - 1);
+	unsigned long nr_mpages = ((unsigned long)nr_pages << PAGE_SHIFT) / MPAGE_SIZE;
+
+	/* Process target subpage last to keep its cache lines hot */
+	might_sleep();
+	n = (addr_hint - addr) / MPAGE_SIZE;
+	if (2 * n <= nr_mpages) {
+		/* If target subpage in first half of huge page */
+		base = 0;
+		l = n;
+		/* Process subpages at the end of huge page */
+		for (i = nr_mpages - 1; i >= 2 * n; i--) {
+			cond_resched();
+			ret = process_subpage(addr + i * MPAGE_SIZE,
+					      i * MPAGE_NRPAGES, arg);
+			if (ret)
+				return ret;
+		}
+	} else {
+		/* If target subpage in second half of huge page */
+		base = nr_mpages - 2 * (nr_mpages - n);
+		l = nr_mpages - n;
+		/* Process subpages at the begin of huge page */
+		for (i = 0; i < base; i++) {
+			cond_resched();
+			ret = process_subpage(addr + i * MPAGE_SIZE,
+					      i * MPAGE_NRPAGES, arg);
+			if (ret)
+				return ret;
+		}
+	}
+	/*
+	 * Process remaining subpages in left-right-left-right pattern
+	 * towards the target subpage
+	 */
+	for (i = 0; i < l; i++) {
+		int left_idx = base + i;
+		int right_idx = base + 2 * l - 1 - i;
+
+		cond_resched();
+		ret = process_subpage(addr + left_idx * MPAGE_SIZE,
+				      left_idx * MPAGE_NRPAGES, arg);
+		if (ret)
+			return ret;
+		cond_resched();
+		ret = process_subpage(addr + right_idx * MPAGE_SIZE,
+				      right_idx * MPAGE_NRPAGES, arg);
+		if (ret)
+			return ret;
+	}
+	return 0;
+}
+
 static void clear_gigantic_page(struct folio *folio, unsigned long addr,
 				unsigned int nr_pages)
 {
@@ -6784,8 +6846,10 @@ static void clear_gigantic_page(struct folio *folio, unsigned long addr,
 static int clear_subpage(unsigned long addr, int idx, void *arg)
 {
 	struct folio *folio = arg;
+	int i;
 
-	clear_user_highpage(folio_page(folio, idx), addr);
+	for (i = 0; i < MPAGE_NRPAGES; i++)
+		clear_user_highpage(folio_page(folio, idx + i), addr + i * PAGE_SIZE);
 	return 0;
 }
 
@@ -6798,10 +6862,10 @@ void folio_zero_user(struct folio *folio, unsigned long addr_hint)
 {
 	unsigned int nr_pages = folio_nr_pages(folio);
 
-	if (unlikely(nr_pages > MAX_ORDER_NR_PAGES))
+	if (unlikely(nr_pages != HPAGE_PMD_NR))
 		clear_gigantic_page(folio, addr_hint, nr_pages);
 	else
-		process_huge_page(addr_hint, nr_pages, clear_subpage, folio);
+		clear_huge_page(addr_hint, nr_pages, clear_subpage, folio);
 }
 
 static int copy_user_gigantic_page(struct folio *dst, struct folio *src,
Huang, Ying Nov. 1, 2024, 6:18 a.m. UTC | #19
Kefeng Wang <wangkefeng.wang@huawei.com> writes:

> On 2024/10/30 11:21, Huang, Ying wrote:
>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>> 
>>> On 2024/10/30 9:04, Huang, Ying wrote:
>>>> David Hildenbrand <david@redhat.com> writes:
>>>>
>>>>> On 29.10.24 14:04, Kefeng Wang wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> That should all be cleaned up ... process_huge_page() likely
>>>>>>>>>>>>> shouldn't
>>>>>>>>>>>>
>>>>>>>>>>>> Yes, let's fix the bug firstly,
>>>>>>>>>>>>
>>>>>>>>>>>>> be even consuming "nr_pages".
>>>>>>>>>>>>
>>>>>>>>>>>> No sure about this part, it uses nr_pages as the end and calculate
>>>>>>>>>>>> the
>>>>>>>>>>>> 'base'.
>>>>>>>>>>>
>>>>>>>>>>> It should be using folio_nr_pages().
>>>>>>>>>>
>>>>>>>>>> But process_huge_page() without an explicit folio argument, I'd like to
>>>>>>>>>> move the aligned address calculate into the folio_zero_user and
>>>>>>>>>> copy_user_large_folio(will rename it to folio_copy_user()) in the
>>>>>>>>>> following cleanup patches, or do it in the fix patches?
>>>>>>>>>
>>>>>>>>> First, why does folio_zero_user() call process_huge_page() for *a small
>>>>>>>>> folio*? Because we like or code to be extra complicated to understand?
>>>>>>>>> Or am I missing something important?
>>>>>>>>
>>>>>>>> The folio_zero_user() used for PMD-sized THP and HugeTLB before, and
>>>>>>>> after anon mTHP supported, it is used for order-2~order-PMD-order THP
>>>>>>>> and HugeTLB, so it won't process a small folio if I understand correctly.
>>>>>>>
>>>>>>> And unfortunately neither the documentation nor the function name
>>>>>>> expresses that :(
>>>>>>>
>>>>>>> I'm happy to review any patches that improve the situation here :)
>>>>>>>
>>>>>> Actually, could we drop the process_huge_page() totally, from my
>>>>>> testcase[1], process_huge_page() is not better than clear/copy page
>>>>>> from start to last, and sequential clearing/copying maybe more
>>>>>> beneficial to the hardware prefetching, and is there a way to let lkp
>>>>>> to test to check the performance, since the process_huge_page()
>>>>>> was submitted by Ying, what's your opinion?
>>>> I don't think that it's a good idea to revert the commit without
>>>> studying and root causing the issues.  I can work together with you on
>>>> that.  If we have solid and well explained data to prove
>>>> process_huge_page() isn't benefitial, we can revert the commit.
>>>
>>>
>>> Take 'fallocate 20G' as an example, before
>>>
>>> Performance counter stats for 'taskset -c 10 fallocate -l 20G
>>> /mnt/hugetlbfs/test':
>> IIUC, fallocate will zero pages, but will not touch them at all,
>> right?
>> If so, no cache benefit from clearing referenced page last.
>
>
> Yes, for this case, only clear page.
>> 
>>>            3,118.94 msec task-clock                #    0.999 CPUs
>>>            utilized
>>>                  30      context-switches          #    0.010 K/sec
>>>                  1      cpu-migrations            #    0.000 K/sec
>>>                  136      page-faults               #    0.044 K/sec
>>>                  8,092,075,873      cycles                    #
>>>                  2.594 GHz                (92.82%)
>>>       1,624,587,663      instructions              #    0.20  insn per
>>>       cycle           (92.83%)
>>>         395,341,850      branches                  #  126.755 M/sec
>>>         (92.82%)
>>>           3,872,302      branch-misses             #    0.98% of all
>>>           branches          (92.83%)
>>>       1,398,066,701      L1-dcache-loads           #  448.251 M/sec
>>>       (92.82%)
>>>          58,124,626      L1-dcache-load-misses     #    4.16% of all
>>>          L1-dcache accesses  (92.82%)
>>>           1,032,527      LLC-loads                 #    0.331 M/sec
>>>           (92.82%)
>>>             498,684      LLC-load-misses           #   48.30% of all
>>>             LL-cache accesses  (92.84%)
>>>         473,689,004      L1-icache-loads           #  151.875 M/sec
>>>         (92.82%)
>>>             356,721      L1-icache-load-misses     #    0.08% of all
>>>             L1-icache accesses  (92.85%)
>>>       1,947,644,987      dTLB-loads                #  624.458 M/sec
>>>       (92.95%)
>>>              10,185      dTLB-load-misses          #    0.00% of all
>>>              dTLB cache accesses  (92.96%)
>>>         474,622,896      iTLB-loads                #  152.174 M/sec
>>>         (92.95%)
>>>                  94      iTLB-load-misses          #    0.00% of all
>>>                  iTLB cache accesses  (85.69%)
>>>
>>>         3.122844830 seconds time elapsed
>>>
>>>         0.000000000 seconds user
>>>         3.107259000 seconds sys
>>>
>>> and after(clear from start to end)
>>>
>>> Performance counter stats for 'taskset -c 10 fallocate -l 20G
>>> /mnt/hugetlbfs/test':
>>>
>>>            1,135.53 msec task-clock                #    0.999 CPUs
>>>            utilized
>>>                  10      context-switches          #    0.009 K/sec
>>>                  1      cpu-migrations            #    0.001 K/sec
>>>                  137      page-faults               #    0.121 K/sec
>>>                  2,946,673,587      cycles                    #
>>>                  2.595 GHz                (92.67%)
>>>       1,620,704,205      instructions              #    0.55  insn per
>>>       cycle           (92.61%)
>>>         394,595,772      branches                  #  347.499 M/sec
>>>         (92.60%)
>>>             130,756      branch-misses             #    0.03% of all
>>>             branches          (92.84%)
>>>       1,396,726,689      L1-dcache-loads           # 1230.022 M/sec
>>>       (92.96%)
>>>             338,344      L1-dcache-load-misses     #    0.02% of all
>>>             L1-dcache accesses  (92.95%)
>>>             111,737      LLC-loads                 #    0.098 M/sec
>>>             (92.96%)
>>>              67,486      LLC-load-misses           #   60.40% of all
>>>              LL-cache accesses  (92.96%)
>>>         418,198,663      L1-icache-loads           #  368.285 M/sec
>>>         (92.96%)
>>>             173,764      L1-icache-load-misses     #    0.04% of all
>>>             L1-icache accesses  (92.96%)
>>>       2,203,364,632      dTLB-loads                # 1940.385 M/sec
>>>       (92.96%)
>>>              17,195      dTLB-load-misses          #    0.00% of all
>>>              dTLB cache accesses  (92.95%)
>>>         418,198,365      iTLB-loads                #  368.285 M/sec
>>>         (92.96%)
>>>                  79      iTLB-load-misses          #    0.00% of all
>>>                  iTLB cache accesses  (85.34%)
>>>
>>>         1.137015760 seconds time elapsed
>>>
>>>         0.000000000 seconds user
>>>         1.131266000 seconds sys
>>>
>>> The IPC improved a lot,less LLC-loads and more L1-dcache-loads, but
>>> this depends on the implementation of the microarchitecture.
>> Anyway, we need to avoid (or reduce at least) the pure memory
>> clearing
>> performance.  Have you double checked whether process_huge_page() is
>> inlined?  Perf-profile result can be used to check this too.
>> 
>
> Yes, I'm sure the process_huge_page() is inlined.
>
>> When you say from start to end, you mean to use clear_gigantic_page()
>> directly, or change process_huge_page() to clear page from start to end?
>> 
>
> Using clear_gigantic_page() and changing process_huge_page() to clear
> page from start to end are both good for performance when sequential
> clearing, but no random test so far.
>
>>> 1) Will test some rand test to check the different of performance as
>>> David suggested.
>>>
>>> 2) Hope the LKP to run more tests since it is very useful(more test
>>> set and different machines)
>> I'm starting to use LKP to test.

https://lore.kernel.org/linux-mm/20200419155856.dtwxomdkyujljdfi@oneplus.com/

Just remembered that we have discussed a similar issue for arm64 before.
Can you take a look at it?  There's more discussion and tests/results in
the thread, I think that may be helpful.

--
Best Regards,
Huang, Ying
Kefeng Wang Nov. 1, 2024, 7:43 a.m. UTC | #20
On 2024/10/31 16:39, Huang, Ying wrote:
> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
> 
> [snip]
>>
>>>> 1) Will test some rand test to check the different of performance as
>>>> David suggested.>>>>
>>>> 2) Hope the LKP to run more tests since it is very useful(more test
>>>> set and different machines)
>>> I'm starting to use LKP to test.
>>
>> Greet.


Sorry for the late,

> 
> I have run some tests with LKP to test.
> 
> Firstly, there's almost no measurable difference between clearing pages
> from start to end or from end to start on Intel server CPU.  I guess
> that there's some similar optimization for both direction.
> 
> For multiple processes (same as logical CPU number)
> vm-scalability/anon-w-seq test case, the benchmark score increases
> about 22.4%.

So process_huge_page is better than clear_gigantic_page() on Intel?

Could you test the following case on x86?
echo 10240 > 
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
mkdir -p /hugetlbfs/
mount none /hugetlbfs/ -t hugetlbfs
rm -f /hugetlbfs/test && fallocate -l 20G /hugetlbfs/test && fallocate 
-d -l 20G /hugetlbfs/test && time taskset -c 10 fallocate -l 20G 
/hugetlbfs/test

> 
> For multiple processes vm-scalability/anon-w-rand test case, no
> measurable difference for benchmark score.
> 
> So, the optimization helps sequential workload mainly.
> 
> In summary, on x86, process_huge_page() will not introduce any
> regression.  And it helps some workload.
> 
> However, on ARM64, it does introduce some regression for clearing pages
> from end to start.  That needs to be addressed.  I guess that the
> regression can be resolved via using more clearing from start to end
> (but not all).  For example, can you take a look at the patch below?
> Which uses the similar framework as before, but clear each small trunk
> (mpage) from start to end.  You can adjust MPAGE_NRPAGES to check when
> the regression can be restored.
> 
> WARNING: the patch is only build tested.


Base: baseline
Change1: using clear_gigantic_page() for 2M PMD
Change2: your patch with MPAGE_NRPAGES=16
Change3: Case3 + fix[1]
Change4: your patch with MPAGE_NRPAGES=64 + fix[1]

1. For rand write,
    case-anon-w-rand/case-anon-w-rand-hugetlb no measurable difference

2. For seq write,

1) case-anon-w-seq-mt:
base:
real    0m2.490s    0m2.254s    0m2.272s
user    1m59.980s   2m23.431s   2m18.739s
sys     1m3.675s    1m15.462s   1m15.030s

Change1:
real    0m2.234s    0m2.225s    0m2.159s
user    2m56.105s   2m57.117s   3m0.489s
sys     0m17.064s   0m17.564s   0m16.150s

Change2:
real	0m2.244s    0m2.384s	0m2.370s
user	2m39.413s   2m41.990s   2m42.229s
sys	0m19.826s   0m18.491s   0m18.053s

Change3:  // best performance
real	0m2.155s    0m2.204s	0m2.194s
user	3m2.640s    2m55.837s   3m0.902s
sys	0m17.346s   0m17.630s   0m18.197s

Change4:
real	0m2.287s    0m2.377s	0m2.284s	
user	2m37.030s   2m52.868s   3m17.593s
sys	0m15.445s   0m34.430s   0m45.224s

2) case-anon-w-seq-hugetlb
    very similar 1), Change4 slightly better than Change3, but not big 
different.

3) hugetlbfs fallocate 20G
    Change1(0m1.136s) = Change3(0m1.136s) =  Change4(0m1.135s) < 
Change2(0m1.275s) < base(0m3.016s)

In summary, the Change3 is best and Change1 is good on my arm64 machine.

> 
> Best Regards,
> Huang, Ying
> 
> -----------------------------------8<----------------------------------------
>  From 406bcd1603987fdd7130d2df6f7d4aee4cc6b978 Mon Sep 17 00:00:00 2001
> From: Huang Ying <ying.huang@intel.com>
> Date: Thu, 31 Oct 2024 11:13:57 +0800
> Subject: [PATCH] mpage clear
> 
> ---
>   mm/memory.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++---
>   1 file changed, 67 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 3ccee51adfbb..1fdc548c4275 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -6769,6 +6769,68 @@ static inline int process_huge_page(
>   	return 0;
>   }
>   
> +#define MPAGE_NRPAGES	(1<<4)
> +#define MPAGE_SIZE	(PAGE_SIZE * MPAGE_NRPAGES)
> +static inline int clear_huge_page(
> +	unsigned long addr_hint, unsigned int nr_pages,
> +	int (*process_subpage)(unsigned long addr, int idx, void *arg),
> +	void *arg)
> +{
> +	int i, n, base, l, ret;
> +	unsigned long addr = addr_hint &
> +		~(((unsigned long)nr_pages << PAGE_SHIFT) - 1);
> +	unsigned long nr_mpages = ((unsigned long)nr_pages << PAGE_SHIFT) / MPAGE_SIZE;
> +
> +	/* Process target subpage last to keep its cache lines hot */
> +	might_sleep();
> +	n = (addr_hint - addr) / MPAGE_SIZE;
> +	if (2 * n <= nr_mpages) {
> +		/* If target subpage in first half of huge page */
> +		base = 0;
> +		l = n;
> +		/* Process subpages at the end of huge page */
> +		for (i = nr_mpages - 1; i >= 2 * n; i--) {
> +			cond_resched();
> +			ret = process_subpage(addr + i * MPAGE_SIZE,
> +					      i * MPAGE_NRPAGES, arg);
> +			if (ret)
> +				return ret;
> +		}
> +	} else {
> +		/* If target subpage in second half of huge page */
> +		base = nr_mpages - 2 * (nr_mpages - n);
> +		l = nr_mpages - n;
> +		/* Process subpages at the begin of huge page */
> +		for (i = 0; i < base; i++) {
> +			cond_resched();
> +			ret = process_subpage(addr + i * MPAGE_SIZE,
> +					      i * MPAGE_NRPAGES, arg);
> +			if (ret)
> +				return ret;
> +		}
> +	}
> +	/*
> +	 * Process remaining subpages in left-right-left-right pattern
> +	 * towards the target subpage
> +	 */
> +	for (i = 0; i < l; i++) {
> +		int left_idx = base + i;
> +		int right_idx = base + 2 * l - 1 - i;
> +
> +		cond_resched();
> +		ret = process_subpage(addr + left_idx * MPAGE_SIZE,
> +				      left_idx * MPAGE_NRPAGES, arg);
> +		if (ret)
> +			return ret;
> +		cond_resched();
> +		ret = process_subpage(addr + right_idx * MPAGE_SIZE,
> +				      right_idx * MPAGE_NRPAGES, arg);
> +		if (ret)
> +			return ret;
> +	}
> +	return 0;
> +}
> +
>   static void clear_gigantic_page(struct folio *folio, unsigned long addr,
>   				unsigned int nr_pages)
>   {
> @@ -6784,8 +6846,10 @@ static void clear_gigantic_page(struct folio *folio, unsigned long addr,
>   static int clear_subpage(unsigned long addr, int idx, void *arg)
>   {
>   	struct folio *folio = arg;
> +	int i;
>   
> -	clear_user_highpage(folio_page(folio, idx), addr);
> +	for (i = 0; i < MPAGE_NRPAGES; i++)
> +		clear_user_highpage(folio_page(folio, idx + i), addr + i * PAGE_SIZE);
>   	return 0;
>   }
>   
> @@ -6798,10 +6862,10 @@ void folio_zero_user(struct folio *folio, unsigned long addr_hint)
>   {
>   	unsigned int nr_pages = folio_nr_pages(folio);
>   
> -	if (unlikely(nr_pages > MAX_ORDER_NR_PAGES))
> +	if (unlikely(nr_pages != HPAGE_PMD_NR))
>   		clear_gigantic_page(folio, addr_hint, nr_pages);
>   	else
> -		process_huge_page(addr_hint, nr_pages, clear_subpage, folio);
> +		clear_huge_page(addr_hint, nr_pages, clear_subpage, folio);
>   }
>   
>   static int copy_user_gigantic_page(struct folio *dst, struct folio *src,


[1] fix patch

diff --git a/mm/memory.c b/mm/memory.c
index b22d4b83295b..aee99ede0c4f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6816,7 +6816,7 @@ static inline int clear_huge_page(
                 base = 0;
                 l = n;
                 /* Process subpages at the end of huge page */
-               for (i = nr_mpages - 1; i >= 2 * n; i--) {
+               for (i = 2 * n; i < nr_mpages; i++) {
                         cond_resched();
                         ret = process_subpage(addr + i * MPAGE_SIZE,
                                               i * MPAGE_NRPAGES, arg);
Kefeng Wang Nov. 1, 2024, 7:51 a.m. UTC | #21
On 2024/11/1 14:18, Huang, Ying wrote:
> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
> 
>> On 2024/10/30 11:21, Huang, Ying wrote:
>>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>>>
>>>> On 2024/10/30 9:04, Huang, Ying wrote:
>>>>> David Hildenbrand <david@redhat.com> writes:
>>>>>
>>>>>> On 29.10.24 14:04, Kefeng Wang wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> That should all be cleaned up ... process_huge_page() likely
>>>>>>>>>>>>>> shouldn't
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes, let's fix the bug firstly,
>>>>>>>>>>>>>
>>>>>>>>>>>>>> be even consuming "nr_pages".
>>>>>>>>>>>>>
>>>>>>>>>>>>> No sure about this part, it uses nr_pages as the end and calculate
>>>>>>>>>>>>> the
>>>>>>>>>>>>> 'base'.
>>>>>>>>>>>>
>>>>>>>>>>>> It should be using folio_nr_pages().
>>>>>>>>>>>
>>>>>>>>>>> But process_huge_page() without an explicit folio argument, I'd like to
>>>>>>>>>>> move the aligned address calculate into the folio_zero_user and
>>>>>>>>>>> copy_user_large_folio(will rename it to folio_copy_user()) in the
>>>>>>>>>>> following cleanup patches, or do it in the fix patches?
>>>>>>>>>>
>>>>>>>>>> First, why does folio_zero_user() call process_huge_page() for *a small
>>>>>>>>>> folio*? Because we like or code to be extra complicated to understand?
>>>>>>>>>> Or am I missing something important?
>>>>>>>>>
>>>>>>>>> The folio_zero_user() used for PMD-sized THP and HugeTLB before, and
>>>>>>>>> after anon mTHP supported, it is used for order-2~order-PMD-order THP
>>>>>>>>> and HugeTLB, so it won't process a small folio if I understand correctly.
>>>>>>>>
>>>>>>>> And unfortunately neither the documentation nor the function name
>>>>>>>> expresses that :(
>>>>>>>>
>>>>>>>> I'm happy to review any patches that improve the situation here :)
>>>>>>>>
>>>>>>> Actually, could we drop the process_huge_page() totally, from my
>>>>>>> testcase[1], process_huge_page() is not better than clear/copy page
>>>>>>> from start to last, and sequential clearing/copying maybe more
>>>>>>> beneficial to the hardware prefetching, and is there a way to let lkp
>>>>>>> to test to check the performance, since the process_huge_page()
>>>>>>> was submitted by Ying, what's your opinion?
>>>>> I don't think that it's a good idea to revert the commit without
>>>>> studying and root causing the issues.  I can work together with you on
>>>>> that.  If we have solid and well explained data to prove
>>>>> process_huge_page() isn't benefitial, we can revert the commit.
>>>>
>>>>
>>>> Take 'fallocate 20G' as an example, before
>>>>
>>>> Performance counter stats for 'taskset -c 10 fallocate -l 20G
>>>> /mnt/hugetlbfs/test':
>>> IIUC, fallocate will zero pages, but will not touch them at all,
>>> right?
>>> If so, no cache benefit from clearing referenced page last.
>>
>>
>> Yes, for this case, only clear page.
>>>
>>>>             3,118.94 msec task-clock                #    0.999 CPUs
>>>>             utilized
>>>>                   30      context-switches          #    0.010 K/sec
>>>>                   1      cpu-migrations            #    0.000 K/sec
>>>>                   136      page-faults               #    0.044 K/sec
>>>>                   8,092,075,873      cycles                    #
>>>>                   2.594 GHz                (92.82%)
>>>>        1,624,587,663      instructions              #    0.20  insn per
>>>>        cycle           (92.83%)
>>>>          395,341,850      branches                  #  126.755 M/sec
>>>>          (92.82%)
>>>>            3,872,302      branch-misses             #    0.98% of all
>>>>            branches          (92.83%)
>>>>        1,398,066,701      L1-dcache-loads           #  448.251 M/sec
>>>>        (92.82%)
>>>>           58,124,626      L1-dcache-load-misses     #    4.16% of all
>>>>           L1-dcache accesses  (92.82%)
>>>>            1,032,527      LLC-loads                 #    0.331 M/sec
>>>>            (92.82%)
>>>>              498,684      LLC-load-misses           #   48.30% of all
>>>>              LL-cache accesses  (92.84%)
>>>>          473,689,004      L1-icache-loads           #  151.875 M/sec
>>>>          (92.82%)
>>>>              356,721      L1-icache-load-misses     #    0.08% of all
>>>>              L1-icache accesses  (92.85%)
>>>>        1,947,644,987      dTLB-loads                #  624.458 M/sec
>>>>        (92.95%)
>>>>               10,185      dTLB-load-misses          #    0.00% of all
>>>>               dTLB cache accesses  (92.96%)
>>>>          474,622,896      iTLB-loads                #  152.174 M/sec
>>>>          (92.95%)
>>>>                   94      iTLB-load-misses          #    0.00% of all
>>>>                   iTLB cache accesses  (85.69%)
>>>>
>>>>          3.122844830 seconds time elapsed
>>>>
>>>>          0.000000000 seconds user
>>>>          3.107259000 seconds sys
>>>>
>>>> and after(clear from start to end)
>>>>
>>>> Performance counter stats for 'taskset -c 10 fallocate -l 20G
>>>> /mnt/hugetlbfs/test':
>>>>
>>>>             1,135.53 msec task-clock                #    0.999 CPUs
>>>>             utilized
>>>>                   10      context-switches          #    0.009 K/sec
>>>>                   1      cpu-migrations            #    0.001 K/sec
>>>>                   137      page-faults               #    0.121 K/sec
>>>>                   2,946,673,587      cycles                    #
>>>>                   2.595 GHz                (92.67%)
>>>>        1,620,704,205      instructions              #    0.55  insn per
>>>>        cycle           (92.61%)
>>>>          394,595,772      branches                  #  347.499 M/sec
>>>>          (92.60%)
>>>>              130,756      branch-misses             #    0.03% of all
>>>>              branches          (92.84%)
>>>>        1,396,726,689      L1-dcache-loads           # 1230.022 M/sec
>>>>        (92.96%)
>>>>              338,344      L1-dcache-load-misses     #    0.02% of all
>>>>              L1-dcache accesses  (92.95%)
>>>>              111,737      LLC-loads                 #    0.098 M/sec
>>>>              (92.96%)
>>>>               67,486      LLC-load-misses           #   60.40% of all
>>>>               LL-cache accesses  (92.96%)
>>>>          418,198,663      L1-icache-loads           #  368.285 M/sec
>>>>          (92.96%)
>>>>              173,764      L1-icache-load-misses     #    0.04% of all
>>>>              L1-icache accesses  (92.96%)
>>>>        2,203,364,632      dTLB-loads                # 1940.385 M/sec
>>>>        (92.96%)
>>>>               17,195      dTLB-load-misses          #    0.00% of all
>>>>               dTLB cache accesses  (92.95%)
>>>>          418,198,365      iTLB-loads                #  368.285 M/sec
>>>>          (92.96%)
>>>>                   79      iTLB-load-misses          #    0.00% of all
>>>>                   iTLB cache accesses  (85.34%)
>>>>
>>>>          1.137015760 seconds time elapsed
>>>>
>>>>          0.000000000 seconds user
>>>>          1.131266000 seconds sys
>>>>
>>>> The IPC improved a lot,less LLC-loads and more L1-dcache-loads, but
>>>> this depends on the implementation of the microarchitecture.
>>> Anyway, we need to avoid (or reduce at least) the pure memory
>>> clearing
>>> performance.  Have you double checked whether process_huge_page() is
>>> inlined?  Perf-profile result can be used to check this too.
>>>
>>
>> Yes, I'm sure the process_huge_page() is inlined.
>>
>>> When you say from start to end, you mean to use clear_gigantic_page()
>>> directly, or change process_huge_page() to clear page from start to end?
>>>
>>
>> Using clear_gigantic_page() and changing process_huge_page() to clear
>> page from start to end are both good for performance when sequential
>> clearing, but no random test so far.
>>
>>>> 1) Will test some rand test to check the different of performance as
>>>> David suggested.
>>>>
>>>> 2) Hope the LKP to run more tests since it is very useful(more test
>>>> set and different machines)
>>> I'm starting to use LKP to test.
> 
> https://lore.kernel.org/linux-mm/20200419155856.dtwxomdkyujljdfi@oneplus.com/
> 
> Just remembered that we have discussed a similar issue for arm64 before.
> Can you take a look at it?  There's more discussion and tests/results in
> the thread, I think that may be helpful.
> 

Thanks for the tips, will check it.
> --
> Best Regards,
> Huang, Ying
>
Huang, Ying Nov. 1, 2024, 8:16 a.m. UTC | #22
Kefeng Wang <wangkefeng.wang@huawei.com> writes:

> On 2024/10/31 16:39, Huang, Ying wrote:
>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>> [snip]
>>>
>>>>> 1) Will test some rand test to check the different of performance as
>>>>> David suggested.>>>>
>>>>> 2) Hope the LKP to run more tests since it is very useful(more test
>>>>> set and different machines)
>>>> I'm starting to use LKP to test.
>>>
>>> Greet.
>
>
> Sorry for the late,
>
>> I have run some tests with LKP to test.
>> Firstly, there's almost no measurable difference between clearing
>> pages
>> from start to end or from end to start on Intel server CPU.  I guess
>> that there's some similar optimization for both direction.
>> For multiple processes (same as logical CPU number)
>> vm-scalability/anon-w-seq test case, the benchmark score increases
>> about 22.4%.
>
> So process_huge_page is better than clear_gigantic_page() on Intel?

For vm-scalability/anon-w-seq test case, it is.  Because the performance
of forward and backward clearing is almost same, and the user space
accessing has cache-hot benefit.

> Could you test the following case on x86?
> echo 10240 >
> /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
> mkdir -p /hugetlbfs/
> mount none /hugetlbfs/ -t hugetlbfs
> rm -f /hugetlbfs/test && fallocate -l 20G /hugetlbfs/test && fallocate
> -d -l 20G /hugetlbfs/test && time taskset -c 10 fallocate -l 20G
> /hugetlbfs/test

It's not trivial for me to do this test.  Because 0day wraps test cases.
Do you know which existing test cases provide this?  For example, in
vm-scalability?

>> For multiple processes vm-scalability/anon-w-rand test case, no
>> measurable difference for benchmark score.
>> So, the optimization helps sequential workload mainly.
>> In summary, on x86, process_huge_page() will not introduce any
>> regression.  And it helps some workload.
>> However, on ARM64, it does introduce some regression for clearing
>> pages
>> from end to start.  That needs to be addressed.  I guess that the
>> regression can be resolved via using more clearing from start to end
>> (but not all).  For example, can you take a look at the patch below?
>> Which uses the similar framework as before, but clear each small trunk
>> (mpage) from start to end.  You can adjust MPAGE_NRPAGES to check when
>> the regression can be restored.
>> WARNING: the patch is only build tested.
>
>
> Base: baseline
> Change1: using clear_gigantic_page() for 2M PMD
> Change2: your patch with MPAGE_NRPAGES=16
> Change3: Case3 + fix[1]

What is case3?

> Change4: your patch with MPAGE_NRPAGES=64 + fix[1]
>
> 1. For rand write,
>    case-anon-w-rand/case-anon-w-rand-hugetlb no measurable difference
>
> 2. For seq write,
>
> 1) case-anon-w-seq-mt:

Can you try case-anon-w-seq?  That may be more stable.

> base:
> real    0m2.490s    0m2.254s    0m2.272s
> user    1m59.980s   2m23.431s   2m18.739s
> sys     1m3.675s    1m15.462s   1m15.030s
>
> Change1:
> real    0m2.234s    0m2.225s    0m2.159s
> user    2m56.105s   2m57.117s   3m0.489s
> sys     0m17.064s   0m17.564s   0m16.150s
>
> Change2:
> real	0m2.244s    0m2.384s	0m2.370s
> user	2m39.413s   2m41.990s   2m42.229s
> sys	0m19.826s   0m18.491s   0m18.053s

It appears strange.  There's no much cache hot benefit even if we clear
pages from end to begin (with larger chunk).

However, sys time improves a lot.  This shows clearing page with large
chunk helps on ARM64.

> Change3:  // best performance
> real	0m2.155s    0m2.204s	0m2.194s
> user	3m2.640s    2m55.837s   3m0.902s
> sys	0m17.346s   0m17.630s   0m18.197s
>
> Change4:
> real	0m2.287s    0m2.377s	0m2.284s	
> user	2m37.030s   2m52.868s   3m17.593s
> sys	0m15.445s   0m34.430s   0m45.224s

Change4 is essentially same as Change1.  I don't know why they are
different.  Is there some large variation among run to run?

Can you further optimize the prototype patch below?  I think that it has
potential to fix your issue.

> 2) case-anon-w-seq-hugetlb
>    very similar 1), Change4 slightly better than Change3, but not big
>    different.
>
> 3) hugetlbfs fallocate 20G
>    Change1(0m1.136s) = Change3(0m1.136s) =  Change4(0m1.135s) <
>    Change2(0m1.275s) < base(0m3.016s)
>
> In summary, the Change3 is best and Change1 is good on my arm64 machine.
>
>> Best Regards,
>> Huang, Ying
>> -----------------------------------8<----------------------------------------
>>  From 406bcd1603987fdd7130d2df6f7d4aee4cc6b978 Mon Sep 17 00:00:00 2001
>> From: Huang Ying <ying.huang@intel.com>
>> Date: Thu, 31 Oct 2024 11:13:57 +0800
>> Subject: [PATCH] mpage clear
>> ---
>>   mm/memory.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++---
>>   1 file changed, 67 insertions(+), 3 deletions(-)
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 3ccee51adfbb..1fdc548c4275 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -6769,6 +6769,68 @@ static inline int process_huge_page(
>>   	return 0;
>>   }
>>   +#define MPAGE_NRPAGES	(1<<4)
>> +#define MPAGE_SIZE	(PAGE_SIZE * MPAGE_NRPAGES)
>> +static inline int clear_huge_page(
>> +	unsigned long addr_hint, unsigned int nr_pages,
>> +	int (*process_subpage)(unsigned long addr, int idx, void *arg),
>> +	void *arg)
>> +{
>> +	int i, n, base, l, ret;
>> +	unsigned long addr = addr_hint &
>> +		~(((unsigned long)nr_pages << PAGE_SHIFT) - 1);
>> +	unsigned long nr_mpages = ((unsigned long)nr_pages << PAGE_SHIFT) / MPAGE_SIZE;
>> +
>> +	/* Process target subpage last to keep its cache lines hot */
>> +	might_sleep();
>> +	n = (addr_hint - addr) / MPAGE_SIZE;
>> +	if (2 * n <= nr_mpages) {
>> +		/* If target subpage in first half of huge page */
>> +		base = 0;
>> +		l = n;
>> +		/* Process subpages at the end of huge page */
>> +		for (i = nr_mpages - 1; i >= 2 * n; i--) {
>> +			cond_resched();
>> +			ret = process_subpage(addr + i * MPAGE_SIZE,
>> +					      i * MPAGE_NRPAGES, arg);
>> +			if (ret)
>> +				return ret;
>> +		}
>> +	} else {
>> +		/* If target subpage in second half of huge page */
>> +		base = nr_mpages - 2 * (nr_mpages - n);
>> +		l = nr_mpages - n;
>> +		/* Process subpages at the begin of huge page */
>> +		for (i = 0; i < base; i++) {
>> +			cond_resched();
>> +			ret = process_subpage(addr + i * MPAGE_SIZE,
>> +					      i * MPAGE_NRPAGES, arg);
>> +			if (ret)
>> +				return ret;
>> +		}
>> +	}
>> +	/*
>> +	 * Process remaining subpages in left-right-left-right pattern
>> +	 * towards the target subpage
>> +	 */
>> +	for (i = 0; i < l; i++) {
>> +		int left_idx = base + i;
>> +		int right_idx = base + 2 * l - 1 - i;
>> +
>> +		cond_resched();
>> +		ret = process_subpage(addr + left_idx * MPAGE_SIZE,
>> +				      left_idx * MPAGE_NRPAGES, arg);
>> +		if (ret)
>> +			return ret;
>> +		cond_resched();
>> +		ret = process_subpage(addr + right_idx * MPAGE_SIZE,
>> +				      right_idx * MPAGE_NRPAGES, arg);
>> +		if (ret)
>> +			return ret;
>> +	}
>> +	return 0;
>> +}
>> +
>>   static void clear_gigantic_page(struct folio *folio, unsigned long addr,
>>   				unsigned int nr_pages)
>>   {
>> @@ -6784,8 +6846,10 @@ static void clear_gigantic_page(struct folio *folio, unsigned long addr,
>>   static int clear_subpage(unsigned long addr, int idx, void *arg)
>>   {
>>   	struct folio *folio = arg;
>> +	int i;
>>   -	clear_user_highpage(folio_page(folio, idx), addr);
>> +	for (i = 0; i < MPAGE_NRPAGES; i++)
>> +		clear_user_highpage(folio_page(folio, idx + i), addr + i * PAGE_SIZE);
>>   	return 0;
>>   }
>>   @@ -6798,10 +6862,10 @@ void folio_zero_user(struct folio *folio,
>> unsigned long addr_hint)
>>   {
>>   	unsigned int nr_pages = folio_nr_pages(folio);
>>   -	if (unlikely(nr_pages > MAX_ORDER_NR_PAGES))
>> +	if (unlikely(nr_pages != HPAGE_PMD_NR))
>>   		clear_gigantic_page(folio, addr_hint, nr_pages);
>>   	else
>> -		process_huge_page(addr_hint, nr_pages, clear_subpage, folio);
>> +		clear_huge_page(addr_hint, nr_pages, clear_subpage, folio);
>>   }
>>     static int copy_user_gigantic_page(struct folio *dst, struct
>> folio *src,
>
>
> [1] fix patch
>
> diff --git a/mm/memory.c b/mm/memory.c
> index b22d4b83295b..aee99ede0c4f 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -6816,7 +6816,7 @@ static inline int clear_huge_page(
>                 base = 0;
>                 l = n;
>                 /* Process subpages at the end of huge page */
> -               for (i = nr_mpages - 1; i >= 2 * n; i--) {
> +               for (i = 2 * n; i < nr_mpages; i++) {
>                         cond_resched();
>                         ret = process_subpage(addr + i * MPAGE_SIZE,
>                                               i * MPAGE_NRPAGES, arg);
Kefeng Wang Nov. 1, 2024, 9:45 a.m. UTC | #23
On 2024/11/1 16:16, Huang, Ying wrote:
> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
> 
>> On 2024/10/31 16:39, Huang, Ying wrote:
>>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>>> [snip]
>>>>
>>>>>> 1) Will test some rand test to check the different of performance as
>>>>>> David suggested.>>>>
>>>>>> 2) Hope the LKP to run more tests since it is very useful(more test
>>>>>> set and different machines)
>>>>> I'm starting to use LKP to test.
>>>>
>>>> Greet.
>>
>>
>> Sorry for the late,
>>
>>> I have run some tests with LKP to test.
>>> Firstly, there's almost no measurable difference between clearing
>>> pages
>>> from start to end or from end to start on Intel server CPU.  I guess
>>> that there's some similar optimization for both direction.
>>> For multiple processes (same as logical CPU number)
>>> vm-scalability/anon-w-seq test case, the benchmark score increases
>>> about 22.4%.
>>
>> So process_huge_page is better than clear_gigantic_page() on Intel?
> 
> For vm-scalability/anon-w-seq test case, it is.  Because the performance
> of forward and backward clearing is almost same, and the user space
> accessing has cache-hot benefit.
> 
>> Could you test the following case on x86?
>> echo 10240 >
>> /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
>> mkdir -p /hugetlbfs/
>> mount none /hugetlbfs/ -t hugetlbfs
>> rm -f /hugetlbfs/test && fallocate -l 20G /hugetlbfs/test && fallocate
>> -d -l 20G /hugetlbfs/test && time taskset -c 10 fallocate -l 20G
>> /hugetlbfs/test
> 
> It's not trivial for me to do this test.  Because 0day wraps test cases.
> Do you know which existing test cases provide this?  For example, in
> vm-scalability?

I don't know the public fallocate test, I will try to find a intel 
machine to test this case.
> 
>>> For multiple processes vm-scalability/anon-w-rand test case, no
>>> measurable difference for benchmark score.
>>> So, the optimization helps sequential workload mainly.
>>> In summary, on x86, process_huge_page() will not introduce any
>>> regression.  And it helps some workload.
>>> However, on ARM64, it does introduce some regression for clearing
>>> pages
>>> from end to start.  That needs to be addressed.  I guess that the
>>> regression can be resolved via using more clearing from start to end
>>> (but not all).  For example, can you take a look at the patch below?
>>> Which uses the similar framework as before, but clear each small trunk
>>> (mpage) from start to end.  You can adjust MPAGE_NRPAGES to check when
>>> the regression can be restored.
>>> WARNING: the patch is only build tested.
>>
>>
>> Base: baseline
>> Change1: using clear_gigantic_page() for 2M PMD
>> Change2: your patch with MPAGE_NRPAGES=16
>> Change3: Case3 + fix[1]
> 
> What is case3?

Oh, it is Change2.

> 
>> Change4: your patch with MPAGE_NRPAGES=64 + fix[1]
>>
>> 1. For rand write,
>>     case-anon-w-rand/case-anon-w-rand-hugetlb no measurable difference
>>
>> 2. For seq write,
>>
>> 1) case-anon-w-seq-mt:
> 
> Can you try case-anon-w-seq?  That may be more stable.
> 
>> base:
>> real    0m2.490s    0m2.254s    0m2.272s
>> user    1m59.980s   2m23.431s   2m18.739s
>> sys     1m3.675s    1m15.462s   1m15.030s
>>
>> Change1:
>> real    0m2.234s    0m2.225s    0m2.159s
>> user    2m56.105s   2m57.117s   3m0.489s
>> sys     0m17.064s   0m17.564s   0m16.150s
>>
>> Change2:
>> real	0m2.244s    0m2.384s	0m2.370s
>> user	2m39.413s   2m41.990s   2m42.229s
>> sys	0m19.826s   0m18.491s   0m18.053s
> 
> It appears strange.  There's no much cache hot benefit even if we clear
> pages from end to begin (with larger chunk).
> 
> However, sys time improves a lot.  This shows clearing page with large
> chunk helps on ARM64.
> 
>> Change3:  // best performance
>> real	0m2.155s    0m2.204s	0m2.194s
>> user	3m2.640s    2m55.837s   3m0.902s
>> sys	0m17.346s   0m17.630s   0m18.197s
>>
>> Change4:
>> real	0m2.287s    0m2.377s	0m2.284s	
>> user	2m37.030s   2m52.868s   3m17.593s
>> sys	0m15.445s   0m34.430s   0m45.224s
> 
> Change4 is essentially same as Change1.  I don't know why they are
> different.  Is there some large variation among run to run?

As above shown, I test three times, the test results are relatively
stable, at least for real, I will try case-anon-w-seq.

> 
> Can you further optimize the prototype patch below?  I think that it has
> potential to fix your issue.

Yes, thanks for you helper, but this will make process_huge_page() a
little more complicated :)
> 
>> 2) case-anon-w-seq-hugetlb
>>     very similar 1), Change4 slightly better than Change3, but not big
>>     different.
>>
>> 3) hugetlbfs fallocate 20G
>>     Change1(0m1.136s) = Change3(0m1.136s) =  Change4(0m1.135s) <
>>     Change2(0m1.275s) < base(0m3.016s)
>>
>> In summary, the Change3 is best and Change1 is good on my arm64 machine.
>>
>>> Best Regards,
>>> Huang, Ying
>>> -----------------------------------8<----------------------------------------
>>>   From 406bcd1603987fdd7130d2df6f7d4aee4cc6b978 Mon Sep 17 00:00:00 2001
>>> From: Huang Ying <ying.huang@intel.com>
>>> Date: Thu, 31 Oct 2024 11:13:57 +0800
>>> Subject: [PATCH] mpage clear
>>> ---
>>>    mm/memory.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++---
>>>    1 file changed, 67 insertions(+), 3 deletions(-)
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 3ccee51adfbb..1fdc548c4275 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -6769,6 +6769,68 @@ static inline int process_huge_page(
>>>    	return 0;
>>>    }
>>>    +#define MPAGE_NRPAGES	(1<<4)
>>> +#define MPAGE_SIZE	(PAGE_SIZE * MPAGE_NRPAGES)
>>> +static inline int clear_huge_page(
>>> +	unsigned long addr_hint, unsigned int nr_pages,
>>> +	int (*process_subpage)(unsigned long addr, int idx, void *arg),
>>> +	void *arg)
>>> +{
>>> +	int i, n, base, l, ret;
>>> +	unsigned long addr = addr_hint &
>>> +		~(((unsigned long)nr_pages << PAGE_SHIFT) - 1);
>>> +	unsigned long nr_mpages = ((unsigned long)nr_pages << PAGE_SHIFT) / MPAGE_SIZE;
>>> +
>>> +	/* Process target subpage last to keep its cache lines hot */
>>> +	might_sleep();
>>> +	n = (addr_hint - addr) / MPAGE_SIZE;
>>> +	if (2 * n <= nr_mpages) {
>>> +		/* If target subpage in first half of huge page */
>>> +		base = 0;
>>> +		l = n;
>>> +		/* Process subpages at the end of huge page */
>>> +		for (i = nr_mpages - 1; i >= 2 * n; i--) {
>>> +			cond_resched();
>>> +			ret = process_subpage(addr + i * MPAGE_SIZE,
>>> +					      i * MPAGE_NRPAGES, arg);
>>> +			if (ret)
>>> +				return ret;
>>> +		}
>>> +	} else {
>>> +		/* If target subpage in second half of huge page */
>>> +		base = nr_mpages - 2 * (nr_mpages - n);
>>> +		l = nr_mpages - n;
>>> +		/* Process subpages at the begin of huge page */
>>> +		for (i = 0; i < base; i++) {
>>> +			cond_resched();
>>> +			ret = process_subpage(addr + i * MPAGE_SIZE,
>>> +					      i * MPAGE_NRPAGES, arg);
>>> +			if (ret)
>>> +				return ret;
>>> +		}
>>> +	}
>>> +	/*
>>> +	 * Process remaining subpages in left-right-left-right pattern
>>> +	 * towards the target subpage
>>> +	 */
>>> +	for (i = 0; i < l; i++) {
>>> +		int left_idx = base + i;
>>> +		int right_idx = base + 2 * l - 1 - i;
>>> +
>>> +		cond_resched();
>>> +		ret = process_subpage(addr + left_idx * MPAGE_SIZE,
>>> +				      left_idx * MPAGE_NRPAGES, arg);
>>> +		if (ret)
>>> +			return ret;
>>> +		cond_resched();
>>> +		ret = process_subpage(addr + right_idx * MPAGE_SIZE,
>>> +				      right_idx * MPAGE_NRPAGES, arg);
>>> +		if (ret)
>>> +			return ret;
>>> +	}
>>> +	return 0;
>>> +}
>>> +
>>>    static void clear_gigantic_page(struct folio *folio, unsigned long addr,
>>>    				unsigned int nr_pages)
>>>    {
>>> @@ -6784,8 +6846,10 @@ static void clear_gigantic_page(struct folio *folio, unsigned long addr,
>>>    static int clear_subpage(unsigned long addr, int idx, void *arg)
>>>    {
>>>    	struct folio *folio = arg;
>>> +	int i;
>>>    -	clear_user_highpage(folio_page(folio, idx), addr);
>>> +	for (i = 0; i < MPAGE_NRPAGES; i++)
>>> +		clear_user_highpage(folio_page(folio, idx + i), addr + i * PAGE_SIZE);
>>>    	return 0;
>>>    }
>>>    @@ -6798,10 +6862,10 @@ void folio_zero_user(struct folio *folio,
>>> unsigned long addr_hint)
>>>    {
>>>    	unsigned int nr_pages = folio_nr_pages(folio);
>>>    -	if (unlikely(nr_pages > MAX_ORDER_NR_PAGES))
>>> +	if (unlikely(nr_pages != HPAGE_PMD_NR))
>>>    		clear_gigantic_page(folio, addr_hint, nr_pages);
>>>    	else
>>> -		process_huge_page(addr_hint, nr_pages, clear_subpage, folio);
>>> +		clear_huge_page(addr_hint, nr_pages, clear_subpage, folio);
>>>    }
>>>      static int copy_user_gigantic_page(struct folio *dst, struct
>>> folio *src,
>>
>>
>> [1] fix patch
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index b22d4b83295b..aee99ede0c4f 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -6816,7 +6816,7 @@ static inline int clear_huge_page(
>>                  base = 0;
>>                  l = n;
>>                  /* Process subpages at the end of huge page */
>> -               for (i = nr_mpages - 1; i >= 2 * n; i--) {
>> +               for (i = 2 * n; i < nr_mpages; i++) {
>>                          cond_resched();
>>                          ret = process_subpage(addr + i * MPAGE_SIZE,
>>                                                i * MPAGE_NRPAGES, arg);
>
Huang, Ying Nov. 4, 2024, 2:35 a.m. UTC | #24
Kefeng Wang <wangkefeng.wang@huawei.com> writes:

> On 2024/11/1 16:16, Huang, Ying wrote:
>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>> 
>>> On 2024/10/31 16:39, Huang, Ying wrote:
>>>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>>>> [snip]
>>>>>
>>>>>>> 1) Will test some rand test to check the different of performance as
>>>>>>> David suggested.>>>>
>>>>>>> 2) Hope the LKP to run more tests since it is very useful(more test
>>>>>>> set and different machines)
>>>>>> I'm starting to use LKP to test.
>>>>>
>>>>> Greet.
>>>
>>>
>>> Sorry for the late,
>>>
>>>> I have run some tests with LKP to test.
>>>> Firstly, there's almost no measurable difference between clearing
>>>> pages
>>>> from start to end or from end to start on Intel server CPU.  I guess
>>>> that there's some similar optimization for both direction.
>>>> For multiple processes (same as logical CPU number)
>>>> vm-scalability/anon-w-seq test case, the benchmark score increases
>>>> about 22.4%.
>>>
>>> So process_huge_page is better than clear_gigantic_page() on Intel?
>> For vm-scalability/anon-w-seq test case, it is.  Because the
>> performance
>> of forward and backward clearing is almost same, and the user space
>> accessing has cache-hot benefit.
>> 
>>> Could you test the following case on x86?
>>> echo 10240 >
>>> /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
>>> mkdir -p /hugetlbfs/
>>> mount none /hugetlbfs/ -t hugetlbfs
>>> rm -f /hugetlbfs/test && fallocate -l 20G /hugetlbfs/test && fallocate
>>> -d -l 20G /hugetlbfs/test && time taskset -c 10 fallocate -l 20G
>>> /hugetlbfs/test
>> It's not trivial for me to do this test.  Because 0day wraps test
>> cases.
>> Do you know which existing test cases provide this?  For example, in
>> vm-scalability?
>
> I don't know the public fallocate test, I will try to find a intel
> machine to test this case.

I don't expect it to change much, because we have observed that the
performance of forward and backward clearing is similar on Intel.

>> 
>>>> For multiple processes vm-scalability/anon-w-rand test case, no
>>>> measurable difference for benchmark score.
>>>> So, the optimization helps sequential workload mainly.
>>>> In summary, on x86, process_huge_page() will not introduce any
>>>> regression.  And it helps some workload.
>>>> However, on ARM64, it does introduce some regression for clearing
>>>> pages
>>>> from end to start.  That needs to be addressed.  I guess that the
>>>> regression can be resolved via using more clearing from start to end
>>>> (but not all).  For example, can you take a look at the patch below?
>>>> Which uses the similar framework as before, but clear each small trunk
>>>> (mpage) from start to end.  You can adjust MPAGE_NRPAGES to check when
>>>> the regression can be restored.
>>>> WARNING: the patch is only build tested.
>>>
>>>
>>> Base: baseline
>>> Change1: using clear_gigantic_page() for 2M PMD
>>> Change2: your patch with MPAGE_NRPAGES=16
>>> Change3: Case3 + fix[1]
>> What is case3?
>
> Oh, it is Change2.

Got it.

>> 
>>> Change4: your patch with MPAGE_NRPAGES=64 + fix[1]
>>>
>>> 1. For rand write,
>>>     case-anon-w-rand/case-anon-w-rand-hugetlb no measurable difference
>>>
>>> 2. For seq write,
>>>
>>> 1) case-anon-w-seq-mt:
>> Can you try case-anon-w-seq?  That may be more stable.
>> 
>>> base:
>>> real    0m2.490s    0m2.254s    0m2.272s
>>> user    1m59.980s   2m23.431s   2m18.739s
>>> sys     1m3.675s    1m15.462s   1m15.030s
>>>
>>> Change1:
>>> real    0m2.234s    0m2.225s    0m2.159s
>>> user    2m56.105s   2m57.117s   3m0.489s
>>> sys     0m17.064s   0m17.564s   0m16.150s
>>>
>>> Change2:
>>> real	0m2.244s    0m2.384s	0m2.370s
>>> user	2m39.413s   2m41.990s   2m42.229s
>>> sys	0m19.826s   0m18.491s   0m18.053s
>> It appears strange.  There's no much cache hot benefit even if we
>> clear
>> pages from end to begin (with larger chunk).
>> However, sys time improves a lot.  This shows clearing page with
>> large
>> chunk helps on ARM64.
>> 
>>> Change3:  // best performance
>>> real	0m2.155s    0m2.204s	0m2.194s
>>> user	3m2.640s    2m55.837s   3m0.902s
>>> sys	0m17.346s   0m17.630s   0m18.197s
>>>
>>> Change4:
>>> real	0m2.287s    0m2.377s	0m2.284s	
>>> user	2m37.030s   2m52.868s   3m17.593s
>>> sys	0m15.445s   0m34.430s   0m45.224s
>> Change4 is essentially same as Change1.  I don't know why they are
>> different.  Is there some large variation among run to run?
>
> As above shown, I test three times, the test results are relatively
> stable, at least for real, I will try case-anon-w-seq.

Can you also show the score of vm-scalability?

TBH, I cannot understand your results.  For example, why there are
measurable difference between Change3 and Change4?  In both cases, the
kernel clears pages from start to end.

>> Can you further optimize the prototype patch below?  I think that it
>> has
>> potential to fix your issue.
>
> Yes, thanks for you helper, but this will make process_huge_page() a
> little more complicated :)

IMHO, we should try to root cause it, then try to find the proper
solution and optimize (simplifies) it.

--
Best Regards,
Huang, Ying

>> 
>>> 2) case-anon-w-seq-hugetlb
>>>     very similar 1), Change4 slightly better than Change3, but not big
>>>     different.
>>>
>>> 3) hugetlbfs fallocate 20G
>>>     Change1(0m1.136s) = Change3(0m1.136s) =  Change4(0m1.135s) <
>>>     Change2(0m1.275s) < base(0m3.016s)
>>>
>>> In summary, the Change3 is best and Change1 is good on my arm64 machine.
>>>
>>>> Best Regards,
>>>> Huang, Ying
>>>> -----------------------------------8<----------------------------------------
>>>>   From 406bcd1603987fdd7130d2df6f7d4aee4cc6b978 Mon Sep 17 00:00:00 2001
>>>> From: Huang Ying <ying.huang@intel.com>
>>>> Date: Thu, 31 Oct 2024 11:13:57 +0800
>>>> Subject: [PATCH] mpage clear
>>>> ---
>>>>    mm/memory.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++---
>>>>    1 file changed, 67 insertions(+), 3 deletions(-)
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 3ccee51adfbb..1fdc548c4275 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -6769,6 +6769,68 @@ static inline int process_huge_page(
>>>>    	return 0;
>>>>    }
>>>>    +#define MPAGE_NRPAGES	(1<<4)
>>>> +#define MPAGE_SIZE	(PAGE_SIZE * MPAGE_NRPAGES)
>>>> +static inline int clear_huge_page(
>>>> +	unsigned long addr_hint, unsigned int nr_pages,
>>>> +	int (*process_subpage)(unsigned long addr, int idx, void *arg),
>>>> +	void *arg)
>>>> +{
>>>> +	int i, n, base, l, ret;
>>>> +	unsigned long addr = addr_hint &
>>>> +		~(((unsigned long)nr_pages << PAGE_SHIFT) - 1);
>>>> +	unsigned long nr_mpages = ((unsigned long)nr_pages << PAGE_SHIFT) / MPAGE_SIZE;
>>>> +
>>>> +	/* Process target subpage last to keep its cache lines hot */
>>>> +	might_sleep();
>>>> +	n = (addr_hint - addr) / MPAGE_SIZE;
>>>> +	if (2 * n <= nr_mpages) {
>>>> +		/* If target subpage in first half of huge page */
>>>> +		base = 0;
>>>> +		l = n;
>>>> +		/* Process subpages at the end of huge page */
>>>> +		for (i = nr_mpages - 1; i >= 2 * n; i--) {
>>>> +			cond_resched();
>>>> +			ret = process_subpage(addr + i * MPAGE_SIZE,
>>>> +					      i * MPAGE_NRPAGES, arg);
>>>> +			if (ret)
>>>> +				return ret;
>>>> +		}
>>>> +	} else {
>>>> +		/* If target subpage in second half of huge page */
>>>> +		base = nr_mpages - 2 * (nr_mpages - n);
>>>> +		l = nr_mpages - n;
>>>> +		/* Process subpages at the begin of huge page */
>>>> +		for (i = 0; i < base; i++) {
>>>> +			cond_resched();
>>>> +			ret = process_subpage(addr + i * MPAGE_SIZE,
>>>> +					      i * MPAGE_NRPAGES, arg);
>>>> +			if (ret)
>>>> +				return ret;
>>>> +		}
>>>> +	}
>>>> +	/*
>>>> +	 * Process remaining subpages in left-right-left-right pattern
>>>> +	 * towards the target subpage
>>>> +	 */
>>>> +	for (i = 0; i < l; i++) {
>>>> +		int left_idx = base + i;
>>>> +		int right_idx = base + 2 * l - 1 - i;
>>>> +
>>>> +		cond_resched();
>>>> +		ret = process_subpage(addr + left_idx * MPAGE_SIZE,
>>>> +				      left_idx * MPAGE_NRPAGES, arg);
>>>> +		if (ret)
>>>> +			return ret;
>>>> +		cond_resched();
>>>> +		ret = process_subpage(addr + right_idx * MPAGE_SIZE,
>>>> +				      right_idx * MPAGE_NRPAGES, arg);
>>>> +		if (ret)
>>>> +			return ret;
>>>> +	}
>>>> +	return 0;
>>>> +}
>>>> +
>>>>    static void clear_gigantic_page(struct folio *folio, unsigned long addr,
>>>>    				unsigned int nr_pages)
>>>>    {
>>>> @@ -6784,8 +6846,10 @@ static void clear_gigantic_page(struct folio *folio, unsigned long addr,
>>>>    static int clear_subpage(unsigned long addr, int idx, void *arg)
>>>>    {
>>>>    	struct folio *folio = arg;
>>>> +	int i;
>>>>    -	clear_user_highpage(folio_page(folio, idx), addr);
>>>> +	for (i = 0; i < MPAGE_NRPAGES; i++)
>>>> +		clear_user_highpage(folio_page(folio, idx + i), addr + i * PAGE_SIZE);
>>>>    	return 0;
>>>>    }
>>>>    @@ -6798,10 +6862,10 @@ void folio_zero_user(struct folio *folio,
>>>> unsigned long addr_hint)
>>>>    {
>>>>    	unsigned int nr_pages = folio_nr_pages(folio);
>>>>    -	if (unlikely(nr_pages > MAX_ORDER_NR_PAGES))
>>>> +	if (unlikely(nr_pages != HPAGE_PMD_NR))
>>>>    		clear_gigantic_page(folio, addr_hint, nr_pages);
>>>>    	else
>>>> -		process_huge_page(addr_hint, nr_pages, clear_subpage, folio);
>>>> +		clear_huge_page(addr_hint, nr_pages, clear_subpage, folio);
>>>>    }
>>>>      static int copy_user_gigantic_page(struct folio *dst, struct
>>>> folio *src,
>>>
>>>
>>> [1] fix patch
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index b22d4b83295b..aee99ede0c4f 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -6816,7 +6816,7 @@ static inline int clear_huge_page(
>>>                  base = 0;
>>>                  l = n;
>>>                  /* Process subpages at the end of huge page */
>>> -               for (i = nr_mpages - 1; i >= 2 * n; i--) {
>>> +               for (i = 2 * n; i < nr_mpages; i++) {
>>>                          cond_resched();
>>>                          ret = process_subpage(addr + i * MPAGE_SIZE,
>>>                                                i * MPAGE_NRPAGES, arg);
>>
Kefeng Wang Nov. 5, 2024, 2:06 a.m. UTC | #25
On 2024/11/4 10:35, Huang, Ying wrote:
> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
> 
>> On 2024/11/1 16:16, Huang, Ying wrote:
>>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>>>
>>>> On 2024/10/31 16:39, Huang, Ying wrote:
>>>>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>>>>> [snip]
>>>>>>
>>>>>>>> 1) Will test some rand test to check the different of performance as
>>>>>>>> David suggested.>>>>
>>>>>>>> 2) Hope the LKP to run more tests since it is very useful(more test
>>>>>>>> set and different machines)
>>>>>>> I'm starting to use LKP to test.
>>>>>>
>>>>>> Greet.
>>>>
>>>>
>>>> Sorry for the late,
>>>>
>>>>> I have run some tests with LKP to test.
>>>>> Firstly, there's almost no measurable difference between clearing
>>>>> pages
>>>>> from start to end or from end to start on Intel server CPU.  I guess
>>>>> that there's some similar optimization for both direction.
>>>>> For multiple processes (same as logical CPU number)
>>>>> vm-scalability/anon-w-seq test case, the benchmark score increases
>>>>> about 22.4%.
>>>>
>>>> So process_huge_page is better than clear_gigantic_page() on Intel?
>>> For vm-scalability/anon-w-seq test case, it is.  Because the
>>> performance
>>> of forward and backward clearing is almost same, and the user space
>>> accessing has cache-hot benefit.
>>>
>>>> Could you test the following case on x86?
>>>> echo 10240 >
>>>> /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
>>>> mkdir -p /hugetlbfs/
>>>> mount none /hugetlbfs/ -t hugetlbfs
>>>> rm -f /hugetlbfs/test && fallocate -l 20G /hugetlbfs/test && fallocate
>>>> -d -l 20G /hugetlbfs/test && time taskset -c 10 fallocate -l 20G
>>>> /hugetlbfs/test
>>> It's not trivial for me to do this test.  Because 0day wraps test
>>> cases.
>>> Do you know which existing test cases provide this?  For example, in
>>> vm-scalability?
>>
>> I don't know the public fallocate test, I will try to find a intel
>> machine to test this case.
> 
> I don't expect it to change much, because we have observed that the
> performance of forward and backward clearing is similar on Intel.

I find a Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz

Caches (sum of all):
   L1d:                    1.1 MiB (36 instances)
   L1i:                    1.1 MiB (36 instances)
   L2:                     36 MiB (36 instances)
   L3:                     49.5 MiB (2 instances)
NUMA:
   NUMA node(s):           2
   NUMA node0 CPU(s):      0-17,36-53
   NUMA node1 CPU(s):      18-35,54-71


Before:

Performance counter stats for 'taskset -c 10 fallocate -l 20G 
/mnt/hugetlbfs/test':

           3,856.93 msec task-clock                       #    0.997 
CPUs utilized
                  6      context-switches                 #    1.556 
/sec
                  1      cpu-migrations                   #    0.259 
/sec
                132      page-faults                      #   34.224 
/sec
     11,520,934,848      cycles                           #    2.987 GHz 
                         (19.95%)
        213,731,011      instructions                     #    0.02 
insn per cycle              (24.96%)
         58,164,361      branches                         #   15.080 
M/sec                       (24.96%)
            262,547      branch-misses                    #    0.45% of 
all branches             (24.97%)
         96,029,321      CPU_CLK_UNHALTED.REF_XCLK        #   24.898 
M/sec
                                                   #      0.3 % 
tma_frontend_bound
                                                   #      3.3 % 
tma_retiring
                                                   #     96.4 % 
tma_backend_bound
                                                   #      0.0 % 
tma_bad_speculation      (24.99%)
        149,735,020      IDQ_UOPS_NOT_DELIVERED.CORE      #   38.822 
M/sec                       (25.01%)
          2,486,326      INT_MISC.RECOVERY_CYCLES_ANY     #  644.638 
K/sec                       (20.01%)
         95,973,482      CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE #   24.883 
M/sec                       (20.01%)
     11,526,783,305      CPU_CLK_UNHALTED.THREAD          #    2.989 
G/sec                       (20.01%)
      1,519,072,911      UOPS_RETIRED.RETIRE_SLOTS        #  393.855 
M/sec                       (20.01%)
      1,526,020,825      UOPS_ISSUED.ANY                  #  395.657 
M/sec                       (20.01%)
         59,784,189      L1-dcache-loads                  #   15.500 
M/sec                       (20.01%)
        337,479,254      L1-dcache-load-misses            #  564.50% of 
all L1-dcache accesses   (20.02%)
            175,954      LLC-loads                        #   45.620 
K/sec                       (20.02%)
             51,955      LLC-load-misses                  #   29.53% of 
all L1-icache accesses   (20.02%)
    <not supported>      L1-icache-loads 

          2,864,230      L1-icache-load-misses 
                         (20.02%)
         59,769,391      dTLB-loads                       #   15.497 
M/sec                       (20.02%)
                819      dTLB-load-misses                 #    0.00% of 
all dTLB cache accesses  (20.02%)
              2,459      iTLB-loads                       #  637.553 
/sec                        (20.01%)
                370      iTLB-load-misses                 #   15.05% of 
all iTLB cache accesses  (19.98%)

        3.870393637 seconds time elapsed

        0.000000000 seconds user
        3.833021000 seconds sys

After(using clear_gigantic_page()):

Performance counter stats for 'taskset -c 10 fallocate -l 20G 
/mnt/hugetlbfs/test':

           4,426.18 msec task-clock                       #    0.994 
CPUs utilized
                  8      context-switches                 #    1.807 
/sec
                  1      cpu-migrations                   #    0.226 
/sec
                131      page-faults                      #   29.597 
/sec
     13,221,263,588      cycles                           #    2.987 GHz 
                         (19.98%)
        215,924,995      instructions                     #    0.02 
insn per cycle              (25.00%)
         58,430,182      branches                         #   13.201 
M/sec                       (25.01%)
            279,381      branch-misses                    #    0.48% of 
all branches             (25.03%)
        110,199,114      CPU_CLK_UNHALTED.REF_XCLK        #   24.897 
M/sec
                                                   #      0.3 % 
tma_frontend_bound
                                                   #      2.9 % 
tma_retiring
                                                   #     96.8 % 
tma_backend_bound
                                                   #      0.0 % 
tma_bad_speculation      (25.06%)
        160,650,548      IDQ_UOPS_NOT_DELIVERED.CORE      #   36.296 
M/sec                       (25.07%)
          2,559,970      INT_MISC.RECOVERY_CYCLES_ANY     #  578.370 
K/sec                       (20.05%)
        110,229,402      CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE #   24.904 
M/sec                       (20.05%)
     13,227,924,727      CPU_CLK_UNHALTED.THREAD          #    2.989 
G/sec                       (20.03%)
      1,525,019,287      UOPS_RETIRED.RETIRE_SLOTS        #  344.545 
M/sec                       (20.01%)
      1,531,307,263      UOPS_ISSUED.ANY                  #  345.966 
M/sec                       (19.98%)
         60,600,471      L1-dcache-loads                  #   13.691 
M/sec                       (19.96%)
        337,576,917      L1-dcache-load-misses            #  557.05% of 
all L1-dcache accesses   (19.96%)
            177,157      LLC-loads                        #   40.025 
K/sec                       (19.96%)
             48,056      LLC-load-misses                  #   27.13% of 
all L1-icache accesses   (19.97%)
    <not supported>      L1-icache-loads 

          2,653,617      L1-icache-load-misses 
                         (19.97%)
         60,609,241      dTLB-loads                       #   13.693 
M/sec                       (19.97%)
                530      dTLB-load-misses                 #    0.00% of 
all dTLB cache accesses  (19.97%)
              1,952      iTLB-loads                       #  441.013 
/sec                        (19.97%)
              3,059      iTLB-load-misses                 #  156.71% of 
all iTLB cache accesses  (19.97%)

        4.450664421 seconds time elapsed

        0.000984000 seconds user
        4.397795000 seconds sys


This shows the backward is better than forward,at least for this CPU.


> 
>>>
>>>>> For multiple processes vm-scalability/anon-w-rand test case, no
>>>>> measurable difference for benchmark score.
>>>>> So, the optimization helps sequential workload mainly.
>>>>> In summary, on x86, process_huge_page() will not introduce any
>>>>> regression.  And it helps some workload.
>>>>> However, on ARM64, it does introduce some regression for clearing
>>>>> pages
>>>>> from end to start.  That needs to be addressed.  I guess that the
>>>>> regression can be resolved via using more clearing from start to end
>>>>> (but not all).  For example, can you take a look at the patch below?
>>>>> Which uses the similar framework as before, but clear each small trunk
>>>>> (mpage) from start to end.  You can adjust MPAGE_NRPAGES to check when
>>>>> the regression can be restored.
>>>>> WARNING: the patch is only build tested.
>>>>
>>>>
>>>> Base: baseline
>>>> Change1: using clear_gigantic_page() for 2M PMD
>>>> Change2: your patch with MPAGE_NRPAGES=16
>>>> Change3: Case3 + fix[1]
>>> What is case3?
>>
>> Oh, it is Change2.
> 
> Got it.
> 
>>>
>>>> Change4: your patch with MPAGE_NRPAGES=64 + fix[1]
>>>>
>>>> 1. For rand write,
>>>>      case-anon-w-rand/case-anon-w-rand-hugetlb no measurable difference
>>>>
>>>> 2. For seq write,
>>>>
>>>> 1) case-anon-w-seq-mt:
>>> Can you try case-anon-w-seq?  That may be more stable.
>>>
>>>> base:
>>>> real    0m2.490s    0m2.254s    0m2.272s
>>>> user    1m59.980s   2m23.431s   2m18.739s
>>>> sys     1m3.675s    1m15.462s   1m15.030s
>>>>
>>>> Change1:
>>>> real    0m2.234s    0m2.225s    0m2.159s
>>>> user    2m56.105s   2m57.117s   3m0.489s
>>>> sys     0m17.064s   0m17.564s   0m16.150s
>>>>
>>>> Change2:
>>>> real	0m2.244s    0m2.384s	0m2.370s
>>>> user	2m39.413s   2m41.990s   2m42.229s
>>>> sys	0m19.826s   0m18.491s   0m18.053s
>>> It appears strange.  There's no much cache hot benefit even if we
>>> clear
>>> pages from end to begin (with larger chunk).
>>> However, sys time improves a lot.  This shows clearing page with
>>> large
>>> chunk helps on ARM64.
>>>
>>>> Change3:  // best performance
>>>> real	0m2.155s    0m2.204s	0m2.194s
>>>> user	3m2.640s    2m55.837s   3m0.902s
>>>> sys	0m17.346s   0m17.630s   0m18.197s
>>>>
>>>> Change4:
>>>> real	0m2.287s    0m2.377s	0m2.284s	
>>>> user	2m37.030s   2m52.868s   3m17.593s
>>>> sys	0m15.445s   0m34.430s   0m45.224s
>>> Change4 is essentially same as Change1.  I don't know why they are
>>> different.  Is there some large variation among run to run?
>>
>> As above shown, I test three times, the test results are relatively
>> stable, at least for real, I will try case-anon-w-seq.
> 
> Can you also show the score of vm-scalability?
> 
> TBH, I cannot understand your results.  For example, why there are
> measurable difference between Change3 and Change4?  In both cases, the
> kernel clears pages from start to end.

OK,will retest once I can access the machine again.

> 
>>> Can you further optimize the prototype patch below?  I think that it
>>> has
>>> potential to fix your issue.
>>
>> Yes, thanks for you helper, but this will make process_huge_page() a
>> little more complicated :)
> 
> IMHO, we should try to root cause it, then try to find the proper
> solution and optimize (simplifies) it.

 From the above fallocate test on intel, it seems that different
microarchitectures have different performance on Intel too.
Andrew Morton Dec. 1, 2024, 2:15 a.m. UTC | #26
On Thu, 31 Oct 2024 16:39:53 +0800 "Huang, Ying" <ying.huang@intel.com> wrote:

> However, on ARM64, it does introduce some regression for clearing pages
> from end to start.  That needs to be addressed.  I guess that the
> regression can be resolved via using more clearing from start to end
> (but not all).  For example, can you take a look at the patch below?
> Which uses the similar framework as before, but clear each small trunk
> (mpage) from start to end.  You can adjust MPAGE_NRPAGES to check when
> the regression can be restored.
> 
> WARNING: the patch is only build tested.

I'm holding this patch in mm-unstable because it's unclear (to me) that
this regression has been adequately addressed?
Huang, Ying Dec. 1, 2024, 5:37 a.m. UTC | #27
Hi, Andrew,

Andrew Morton <akpm@linux-foundation.org> writes:

> On Thu, 31 Oct 2024 16:39:53 +0800 "Huang, Ying" <ying.huang@intel.com> wrote:
>
>> However, on ARM64, it does introduce some regression for clearing pages
>> from end to start.  That needs to be addressed.  I guess that the
>> regression can be resolved via using more clearing from start to end
>> (but not all).  For example, can you take a look at the patch below?
>> Which uses the similar framework as before, but clear each small trunk
>> (mpage) from start to end.  You can adjust MPAGE_NRPAGES to check when
>> the regression can be restored.
>> 
>> WARNING: the patch is only build tested.
>
> I'm holding this patch in mm-unstable because it's unclear (to me) that
> this regression has been adequately addressed?

Yes.  This patch isn't ready to be merged yet.  More works are needed to
root cause the problem and implement the proper fix.  I guess that
Kefeng will continue to work on this.

---
Best Regards,
Huang, Ying
Kefeng Wang Dec. 2, 2024, 1:03 a.m. UTC | #28
Hi Andrew and Ying,

On 2024/12/1 13:37, Huang, Ying wrote:
> Hi, Andrew,
> 
> Andrew Morton <akpm@linux-foundation.org> writes:
> 
>> On Thu, 31 Oct 2024 16:39:53 +0800 "Huang, Ying" <ying.huang@intel.com> wrote:
>>
>>> However, on ARM64, it does introduce some regression for clearing pages
>>> from end to start.  That needs to be addressed.  I guess that the
>>> regression can be resolved via using more clearing from start to end
>>> (but not all).  For example, can you take a look at the patch below?
>>> Which uses the similar framework as before, but clear each small trunk
>>> (mpage) from start to end.  You can adjust MPAGE_NRPAGES to check when
>>> the regression can be restored.
>>>
>>> WARNING: the patch is only build tested.
>>
>> I'm holding this patch in mm-unstable because it's unclear (to me) that
>> this regression has been adequately addressed?
> 
> Yes.  This patch isn't ready to be merged yet.  More works are needed to
> root cause the problem and implement the proper fix.  I guess that
> Kefeng will continue to work on this.

Sorry for the late, I've been buried in other tasks.

The two bugfix patches in the next should be merged firstly, which is 
not related about the performance issue.

   84ddd1450383 mm: use aligned address in copy_user_gigantic_page()
   9fbd869b19b8 mm: use aligned address in clear_gigantic_page()


and according to the suggestion of Ying, I am trying to fix the
performance issue on arm64(will in new series), but I need to check no 
regression on other arm64's machine, will send the new patches once
the test is complete.

So Andrew, please help to pick up the above two bugfix patches to
v6.13-rcX if no objection.


> 
> ---
> Best Regards,
> Huang, Ying
>
Andrew Morton Dec. 6, 2024, 1:47 a.m. UTC | #29
On Mon, 2 Dec 2024 09:03:45 +0800 Kefeng Wang <wangkefeng.wang@huawei.com> wrote:

> >>> WARNING: the patch is only build tested.
> >>
> >> I'm holding this patch in mm-unstable because it's unclear (to me) that
> >> this regression has been adequately addressed?
> > 
> > Yes.  This patch isn't ready to be merged yet.  More works are needed to
> > root cause the problem and implement the proper fix.  I guess that
> > Kefeng will continue to work on this.
> 
> Sorry for the late, I've been buried in other tasks.
> 
> The two bugfix patches in the next should be merged firstly, which is 
> not related about the performance issue.
> 
>    84ddd1450383 mm: use aligned address in copy_user_gigantic_page()
>    9fbd869b19b8 mm: use aligned address in clear_gigantic_page()
> 
> 
> and according to the suggestion of Ying, I am trying to fix the
> performance issue on arm64(will in new series), but I need to check no 
> regression on other arm64's machine, will send the new patches once
> the test is complete.
> 
> So Andrew, please help to pick up the above two bugfix patches to
> v6.13-rcX if no objection.

I added cc:stable to both and moved them into mm-hotfixes-unstable,
targeting an upstream merge next week.

"mm: use aligned address in clear_gigantic_page()":
https://lkml.kernel.org/r/20241028145656.932941-1-wangkefeng.wang@huawei.com

"mm: use aligned address in copy_user_gigantic_page()":
https://lkml.kernel.org/r/20241028145656.932941-2-wangkefeng.wang@huawei.com
Kefeng Wang Dec. 6, 2024, 2:08 a.m. UTC | #30
On 2024/12/6 9:47, Andrew Morton wrote:
> On Mon, 2 Dec 2024 09:03:45 +0800 Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
> 
>>>>> WARNING: the patch is only build tested.
>>>>
>>>> I'm holding this patch in mm-unstable because it's unclear (to me) that
>>>> this regression has been adequately addressed?
>>>
>>> Yes.  This patch isn't ready to be merged yet.  More works are needed to
>>> root cause the problem and implement the proper fix.  I guess that
>>> Kefeng will continue to work on this.
>>
>> Sorry for the late, I've been buried in other tasks.
>>
>> The two bugfix patches in the next should be merged firstly, which is
>> not related about the performance issue.
>>
>>     84ddd1450383 mm: use aligned address in copy_user_gigantic_page()
>>     9fbd869b19b8 mm: use aligned address in clear_gigantic_page()
>>
>>
>> and according to the suggestion of Ying, I am trying to fix the
>> performance issue on arm64(will in new series), but I need to check no
>> regression on other arm64's machine, will send the new patches once
>> the test is complete.
>>
>> So Andrew, please help to pick up the above two bugfix patches to
>> v6.13-rcX if no objection.
> 
> I added cc:stable to both and moved them into mm-hotfixes-unstable,
> targeting an upstream merge next week.
> 
> "mm: use aligned address in clear_gigantic_page()":
> https://lkml.kernel.org/r/20241028145656.932941-1-wangkefeng.wang@huawei.com
> 
> "mm: use aligned address in copy_user_gigantic_page()":
> https://lkml.kernel.org/r/20241028145656.932941-2-wangkefeng.wang@huawei.com
> 

Nice, thanks a lot.
diff mbox series

Patch

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index a4441fb77f7c..a5ea006f403e 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -825,7 +825,7 @@  static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 			error = PTR_ERR(folio);
 			goto out;
 		}
-		folio_zero_user(folio, ALIGN_DOWN(addr, hpage_size));
+		folio_zero_user(folio, addr);
 		__folio_mark_uptodate(folio);
 		error = hugetlb_add_to_page_cache(folio, mapping, index);
 		if (unlikely(error)) {
diff --git a/mm/memory.c b/mm/memory.c
index 75c2dfd04f72..ef47b7ea5ddd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6821,6 +6821,7 @@  static void clear_gigantic_page(struct folio *folio, unsigned long addr,
 	int i;
 
 	might_sleep();
+	addr = ALIGN_DOWN(addr, folio_size(folio));
 	for (i = 0; i < nr_pages; i++) {
 		cond_resched();
 		clear_user_highpage(folio_page(folio, i), addr + i * PAGE_SIZE);