mm: shmem: convert to use folio_zero_range()

Message ID	20241017142504.1170208-2-wangkefeng.wang@huawei.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Kefeng Wang <wangkefeng.wang@huawei.com> To: Andrew Morton <akpm@linux-foundation.org>, Hugh Dickins <hughd@google.com> CC: David Hildenbrand <david@redhat.com>, Baolin Wang <baolin.wang@linux.alibaba.com>, Matthew Wilcox <willy@infradead.org>, <linux-mm@kvack.org>, Kefeng Wang <wangkefeng.wang@huawei.com> Subject: [PATCH] mm: shmem: convert to use folio_zero_range() Date: Thu, 17 Oct 2024 22:25:04 +0800 Message-ID: <20241017142504.1170208-2-wangkefeng.wang@huawei.com> In-Reply-To: <20241017142504.1170208-1-wangkefeng.wang@huawei.com> References: <20241017142504.1170208-1-wangkefeng.wang@huawei.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: shmem: convert to use folio_zero_range() \| expand mm: shmem: convert to use folio_zero_range()

Kefeng Wang Oct. 17, 2024, 2:25 p.m. UTC

Directly use folio_zero_range() to cleanup code.

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
---
 mm/shmem.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

Matthew Wilcox (Oracle) Oct. 17, 2024, 3:09 p.m. UTC | #1

On Thu, Oct 17, 2024 at 10:25:04PM +0800, Kefeng Wang wrote:
> Directly use folio_zero_range() to cleanup code.

Are you sure there's no performance regression introduced by this?
clear_highpage() is often optimised in ways that we can't optimise for
a plain memset().  On the other hand, if the folio is large, maybe a
modern CPU will be able to do better than clear-one-page-at-a-time.

IOW, what performance testing have you done with this patch?

>  	if (sgp != SGP_WRITE && !folio_test_uptodate(folio)) {
> -		long i, n = folio_nr_pages(folio);
> -
> -		for (i = 0; i < n; i++)
> -			clear_highpage(folio_page(folio, i));
> -		flush_dcache_folio(folio);
> +		folio_zero_range(folio, 0, folio_size(folio));
>  		folio_mark_uptodate(folio);
>  	}

Kefeng Wang Oct. 18, 2024, 5:20 a.m. UTC | #2

On 2024/10/17 23:09, Matthew Wilcox wrote:
> On Thu, Oct 17, 2024 at 10:25:04PM +0800, Kefeng Wang wrote:
>> Directly use folio_zero_range() to cleanup code.
> 
> Are you sure there's no performance regression introduced by this?
> clear_highpage() is often optimised in ways that we can't optimise for
> a plain memset().  On the other hand, if the folio is large, maybe a
> modern CPU will be able to do better than clear-one-page-at-a-time.
> 

Right, I missing this, clear_page might be better than memset, I change 
this one when look at the shmem_writepage(), which already convert to
use folio_zero_range() from clear_highpage(), also I grep 
folio_zero_range(), there are some other to use folio_zero_range().

fs/bcachefs/fs-io-buffered.c:		folio_zero_range(folio, 0, 
folio_size(folio));
fs/bcachefs/fs-io-buffered.c:			folio_zero_range(f, 0, folio_size(f));
fs/bcachefs/fs-io-buffered.c:			folio_zero_range(f, 0, folio_size(f));
fs/libfs.c:	folio_zero_range(folio, 0, folio_size(folio));
fs/ntfs3/frecord.c:		folio_zero_range(folio, 0, folio_size(folio));
mm/page_io.c:	folio_zero_range(folio, 0, folio_size(folio));
mm/shmem.c:		folio_zero_range(folio, 0, folio_size(folio));


> IOW, what performance testing have you done with this patch?

No performance test before, but I write a testcase,

1) allocate N large folios (folio_alloc(PMD_ORDER))
2) then calculate the diff(us) when clear all N folios
    clear_highpage/folio_zero_range/folio_zero_user
3) release N folios

the result(run 5 times) shown below on my machine,

N=1,
	clear_highpage  folio_zero_range    folio_zero_user
   1	  69                   74                 177
   2	  57                   62                 168
   3	  54                   58                 234
   4	  54                   58                 157
   5	  56                   62                 148
avg	  58                   62.8               176.8


N=100
	clear_highpage  folio_zero_range    folio_zero_user
   1	11015                 11309               32833
   2	10385                 11110               49751
   3	10369                 11056               33095
   4	10332                 11017               33106
   5	10483                 11000               49032
avg	10516.8               11098.4             39563.4

N=512
	clear_highpage  folio_zero_range   folio_zero_user
   1	55560                 60055              156876
   2	55485                 60024              157132
   3	55474                 60129              156658
   4	55555                 59867              157259
   5	55528                 59932              157108
avg	55520.4               60001.4            157006.6



folio_zero_user with many cond_resched(), so time fluctuates a lot,
clear_highpage is better folio_zero_range as you said.

Maybe add a new helper to convert all folio_zero_range(folio, 0, 
folio_size(folio))
to use clear_highpage + flush_dcache_folio?


> 
>>   	if (sgp != SGP_WRITE && !folio_test_uptodate(folio)) {
>> -		long i, n = folio_nr_pages(folio);
>> -
>> -		for (i = 0; i < n; i++)
>> -			clear_highpage(folio_page(folio, i));
>> -		flush_dcache_folio(folio);
>> +		folio_zero_range(folio, 0, folio_size(folio));
>>   		folio_mark_uptodate(folio);
>>   	}
>

Barry Song Oct. 18, 2024, 5:23 a.m. UTC | #3

On Fri, Oct 18, 2024 at 6:20 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>
>
>
> On 2024/10/17 23:09, Matthew Wilcox wrote:
> > On Thu, Oct 17, 2024 at 10:25:04PM +0800, Kefeng Wang wrote:
> >> Directly use folio_zero_range() to cleanup code.
> >
> > Are you sure there's no performance regression introduced by this?
> > clear_highpage() is often optimised in ways that we can't optimise for
> > a plain memset().  On the other hand, if the folio is large, maybe a
> > modern CPU will be able to do better than clear-one-page-at-a-time.
> >
>
> Right, I missing this, clear_page might be better than memset, I change
> this one when look at the shmem_writepage(), which already convert to
> use folio_zero_range() from clear_highpage(), also I grep
> folio_zero_range(), there are some other to use folio_zero_range().
>
> fs/bcachefs/fs-io-buffered.c:           folio_zero_range(folio, 0,
> folio_size(folio));
> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f, 0, folio_size(f));
> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f, 0, folio_size(f));
> fs/libfs.c:     folio_zero_range(folio, 0, folio_size(folio));
> fs/ntfs3/frecord.c:             folio_zero_range(folio, 0, folio_size(folio));
> mm/page_io.c:   folio_zero_range(folio, 0, folio_size(folio));
> mm/shmem.c:             folio_zero_range(folio, 0, folio_size(folio));
>
>
> > IOW, what performance testing have you done with this patch?
>
> No performance test before, but I write a testcase,
>
> 1) allocate N large folios (folio_alloc(PMD_ORDER))
> 2) then calculate the diff(us) when clear all N folios
>     clear_highpage/folio_zero_range/folio_zero_user
> 3) release N folios
>
> the result(run 5 times) shown below on my machine,
>
> N=1,
>         clear_highpage  folio_zero_range    folio_zero_user
>    1      69                   74                 177
>    2      57                   62                 168
>    3      54                   58                 234
>    4      54                   58                 157
>    5      56                   62                 148
> avg       58                   62.8               176.8
>
>
> N=100
>         clear_highpage  folio_zero_range    folio_zero_user
>    1    11015                 11309               32833
>    2    10385                 11110               49751
>    3    10369                 11056               33095
>    4    10332                 11017               33106
>    5    10483                 11000               49032
> avg     10516.8               11098.4             39563.4
>
> N=512
>         clear_highpage  folio_zero_range   folio_zero_user
>    1    55560                 60055              156876
>    2    55485                 60024              157132
>    3    55474                 60129              156658
>    4    55555                 59867              157259
>    5    55528                 59932              157108
> avg     55520.4               60001.4            157006.6
>
>
>
> folio_zero_user with many cond_resched(), so time fluctuates a lot,
> clear_highpage is better folio_zero_range as you said.
>
> Maybe add a new helper to convert all folio_zero_range(folio, 0,
> folio_size(folio))
> to use clear_highpage + flush_dcache_folio?

If this also improves performance for other existing callers of
folio_zero_range(), then that's a positive outcome.

>
>
> >
> >>      if (sgp != SGP_WRITE && !folio_test_uptodate(folio)) {
> >> -            long i, n = folio_nr_pages(folio);
> >> -
> >> -            for (i = 0; i < n; i++)
> >> -                    clear_highpage(folio_page(folio, i));
> >> -            flush_dcache_folio(folio);
> >> +            folio_zero_range(folio, 0, folio_size(folio));
> >>              folio_mark_uptodate(folio);
> >>      }
> >
>
>

Thanks
Barry

Kefeng Wang Oct. 18, 2024, 7:32 a.m. UTC | #4

On 2024/10/18 13:23, Barry Song wrote:
> On Fri, Oct 18, 2024 at 6:20 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>
>>
>>
>> On 2024/10/17 23:09, Matthew Wilcox wrote:
>>> On Thu, Oct 17, 2024 at 10:25:04PM +0800, Kefeng Wang wrote:
>>>> Directly use folio_zero_range() to cleanup code.
>>>
>>> Are you sure there's no performance regression introduced by this?
>>> clear_highpage() is often optimised in ways that we can't optimise for
>>> a plain memset().  On the other hand, if the folio is large, maybe a
>>> modern CPU will be able to do better than clear-one-page-at-a-time.
>>>
>>
>> Right, I missing this, clear_page might be better than memset, I change
>> this one when look at the shmem_writepage(), which already convert to
>> use folio_zero_range() from clear_highpage(), also I grep
>> folio_zero_range(), there are some other to use folio_zero_range().
>>
>> fs/bcachefs/fs-io-buffered.c:           folio_zero_range(folio, 0,
>> folio_size(folio));
>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f, 0, folio_size(f));
>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f, 0, folio_size(f));
>> fs/libfs.c:     folio_zero_range(folio, 0, folio_size(folio));
>> fs/ntfs3/frecord.c:             folio_zero_range(folio, 0, folio_size(folio));
>> mm/page_io.c:   folio_zero_range(folio, 0, folio_size(folio));
>> mm/shmem.c:             folio_zero_range(folio, 0, folio_size(folio));
>>
>>
>>> IOW, what performance testing have you done with this patch?
>>
>> No performance test before, but I write a testcase,
>>
>> 1) allocate N large folios (folio_alloc(PMD_ORDER))
>> 2) then calculate the diff(us) when clear all N folios
>>      clear_highpage/folio_zero_range/folio_zero_user
>> 3) release N folios
>>
>> the result(run 5 times) shown below on my machine,
>>
>> N=1,
>>          clear_highpage  folio_zero_range    folio_zero_user
>>     1      69                   74                 177
>>     2      57                   62                 168
>>     3      54                   58                 234
>>     4      54                   58                 157
>>     5      56                   62                 148
>> avg       58                   62.8               176.8
>>
>>
>> N=100
>>          clear_highpage  folio_zero_range    folio_zero_user
>>     1    11015                 11309               32833
>>     2    10385                 11110               49751
>>     3    10369                 11056               33095
>>     4    10332                 11017               33106
>>     5    10483                 11000               49032
>> avg     10516.8               11098.4             39563.4
>>
>> N=512
>>          clear_highpage  folio_zero_range   folio_zero_user
>>     1    55560                 60055              156876
>>     2    55485                 60024              157132
>>     3    55474                 60129              156658
>>     4    55555                 59867              157259
>>     5    55528                 59932              157108
>> avg     55520.4               60001.4            157006.6
>>
>>
>>
>> folio_zero_user with many cond_resched(), so time fluctuates a lot,
>> clear_highpage is better folio_zero_range as you said.
>>
>> Maybe add a new helper to convert all folio_zero_range(folio, 0,
>> folio_size(folio))
>> to use clear_highpage + flush_dcache_folio?
> 
> If this also improves performance for other existing callers of
> folio_zero_range(), then that's a positive outcome.


rm -f /tmp/test && fallocate -l 20G /tmp/test && fallocate -d -l 20G 
/tmp/test && time fallocate -l 20G /tmp/test

1）mount always(2M folio)
	with patch	without patch
real	0m1.214s	0m1.111s
user	0m0.000s	0m0.000s
sys	0m1.210s	0m1.109s

With this patch, the performance does have regression,
folio_zero_range() is bad than clear_highpage + flush_dcache_folio

with patch

   99.95%     0.00%  fallocate  [kernel.vmlinux]       [k] vfs_fallocate 

    vfs_fallocate 

  - shmem_fallocate 

       98.54% __pi_clear_page 

     - 1.38% shmem_get_folio_gfp 

          filemap_get_entry

without patch
  99.89%     0.00%  fallocate  [kernel.vmlinux]       [k] 
shmem_fallocate
   shmem_fallocate 

- shmem_get_folio_gfp 

      90.12% __memset 

    - 9.42% zero_user_segments.constprop.0 

         8.16% flush_dcache_page 

         1.03% flush_dcache_folio




2）mount  never (4K folio)
real	0m3.159s	0m3.176s
user	0m0.000s	0m0.000s
sys	0m3.150s	0m3.169s

But with this patch, the performance is improved a little,
folio_zero_range() is better than clear_highpage + flush_dcache_folio

with patch
  97.77%     3.37%  fallocate  [kernel.vmlinux]       [k] 
shmem_fallocate
- 94.40% shmem_fallocate 

    - 93.70% shmem_get_folio_gfp 

         66.60% __memset 

       - 7.43% filemap_get_entry 

            3.49% xas_load 

         1.32% zero_user_segments.constprop.0

without patch
   97.82%     3.22%  fallocate  [kernel.vmlinux]       [k] 
shmem_fallocate
  - 94.61% shmem_fallocate 

       68.18% __pi_clear_page 

     - 25.60% shmem_get_folio_gfp 

        - 7.64% filemap_get_entry 

             3.51% xas_load
> 
>>
>>
>>>
>>>>       if (sgp != SGP_WRITE && !folio_test_uptodate(folio)) {
>>>> -            long i, n = folio_nr_pages(folio);
>>>> -
>>>> -            for (i = 0; i < n; i++)
>>>> -                    clear_highpage(folio_page(folio, i));
>>>> -            flush_dcache_folio(folio);
>>>> +            folio_zero_range(folio, 0, folio_size(folio));
>>>>               folio_mark_uptodate(folio);
>>>>       }
>>>
>>
>>
> 
> Thanks
> Barry

Kefeng Wang Oct. 18, 2024, 7:47 a.m. UTC | #5

On 2024/10/18 15:32, Kefeng Wang wrote:
> 
> 
> On 2024/10/18 13:23, Barry Song wrote:
>> On Fri, Oct 18, 2024 at 6:20 PM Kefeng Wang 
>> <wangkefeng.wang@huawei.com> wrote:
>>>
>>>
>>>
>>> On 2024/10/17 23:09, Matthew Wilcox wrote:
>>>> On Thu, Oct 17, 2024 at 10:25:04PM +0800, Kefeng Wang wrote:
>>>>> Directly use folio_zero_range() to cleanup code.
>>>>
>>>> Are you sure there's no performance regression introduced by this?
>>>> clear_highpage() is often optimised in ways that we can't optimise for
>>>> a plain memset().  On the other hand, if the folio is large, maybe a
>>>> modern CPU will be able to do better than clear-one-page-at-a-time.
>>>>
>>>
>>> Right, I missing this, clear_page might be better than memset, I change
>>> this one when look at the shmem_writepage(), which already convert to
>>> use folio_zero_range() from clear_highpage(), also I grep
>>> folio_zero_range(), there are some other to use folio_zero_range().
>>>
>>> fs/bcachefs/fs-io-buffered.c:           folio_zero_range(folio, 0,
>>> folio_size(folio));
>>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f, 
>>> 0, folio_size(f));
>>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f, 
>>> 0, folio_size(f));
>>> fs/libfs.c:     folio_zero_range(folio, 0, folio_size(folio));
>>> fs/ntfs3/frecord.c:             folio_zero_range(folio, 0, 
>>> folio_size(folio));
>>> mm/page_io.c:   folio_zero_range(folio, 0, folio_size(folio));
>>> mm/shmem.c:             folio_zero_range(folio, 0, folio_size(folio));
>>>
>>>
>>>> IOW, what performance testing have you done with this patch?
>>>
>>> No performance test before, but I write a testcase,
>>>
>>> 1) allocate N large folios (folio_alloc(PMD_ORDER))
>>> 2) then calculate the diff(us) when clear all N folios
>>>      clear_highpage/folio_zero_range/folio_zero_user
>>> 3) release N folios
>>>
>>> the result(run 5 times) shown below on my machine,
>>>
>>> N=1,
>>>          clear_highpage  folio_zero_range    folio_zero_user
>>>     1      69                   74                 177
>>>     2      57                   62                 168
>>>     3      54                   58                 234
>>>     4      54                   58                 157
>>>     5      56                   62                 148
>>> avg       58                   62.8               176.8
>>>
>>>
>>> N=100
>>>          clear_highpage  folio_zero_range    folio_zero_user
>>>     1    11015                 11309               32833
>>>     2    10385                 11110               49751
>>>     3    10369                 11056               33095
>>>     4    10332                 11017               33106
>>>     5    10483                 11000               49032
>>> avg     10516.8               11098.4             39563.4
>>>
>>> N=512
>>>          clear_highpage  folio_zero_range   folio_zero_user
>>>     1    55560                 60055              156876
>>>     2    55485                 60024              157132
>>>     3    55474                 60129              156658
>>>     4    55555                 59867              157259
>>>     5    55528                 59932              157108
>>> avg     55520.4               60001.4            157006.6
>>>
>>>
>>>
>>> folio_zero_user with many cond_resched(), so time fluctuates a lot,
>>> clear_highpage is better folio_zero_range as you said.
>>>
>>> Maybe add a new helper to convert all folio_zero_range(folio, 0,
>>> folio_size(folio))
>>> to use clear_highpage + flush_dcache_folio?
>>
>> If this also improves performance for other existing callers of
>> folio_zero_range(), then that's a positive outcome.
> 
> 
> rm -f /tmp/test && fallocate -l 20G /tmp/test && fallocate -d -l 20G / 
> tmp/test && time fallocate -l 20G /tmp/test
> 
> 1）mount always(2M folio)
>      with patch    without patch
> real    0m1.214s    0m1.111s
> user    0m0.000s    0m0.000s
> sys    0m1.210s    0m1.109s
> 
> With this patch, the performance does have regression,
> folio_zero_range() is bad than clear_highpage + flush_dcache_folio
> 
> with patch

Oh, this should without patch since it uses clear_highpage,

> 
>    99.95%     0.00%  fallocate  [kernel.vmlinux]       [k] vfs_fallocate
>     vfs_fallocate
>   - shmem_fallocate
>        98.54% __pi_clear_page
>      - 1.38% shmem_get_folio_gfp
>           filemap_get_entry
> 
and this one is with patch
> without patch
>   99.89%     0.00%  fallocate  [kernel.vmlinux]       [k] shmem_fallocate
>    shmem_fallocate
> - shmem_get_folio_gfp
>       90.12% __memset
>     - 9.42% zero_user_segments.constprop.0
>          8.16% flush_dcache_page
>          1.03% flush_dcache_folio
> 
> 
> 
> 
> 2）mount  never (4K folio)
> real    0m3.159s    0m3.176s
> user    0m0.000s    0m0.000s
> sys    0m3.150s    0m3.169s
> 
> But with this patch, the performance is improved a little,
> folio_zero_range() is better than clear_highpage + flush_dcache_folio
> 

For 4K, the result is fluctuating, so maybe no different.

> with patch
>   97.77%     3.37%  fallocate  [kernel.vmlinux]       [k] shmem_fallocate
> - 94.40% shmem_fallocate
>     - 93.70% shmem_get_folio_gfp
>          66.60% __memset
>        - 7.43% filemap_get_entry
>             3.49% xas_load
>          1.32% zero_user_segments.constprop.0
> 
> without patch
>    97.82%     3.22%  fallocate  [kernel.vmlinux]       [k] shmem_fallocate
>   - 94.61% shmem_fallocate
>        68.18% __pi_clear_page
>      - 25.60% shmem_get_folio_gfp
>         - 7.64% filemap_get_entry
>              3.51% xas_load
>>
>>>
>>>
>>>>
>>>>>       if (sgp != SGP_WRITE && !folio_test_uptodate(folio)) {
>>>>> -            long i, n = folio_nr_pages(folio);
>>>>> -
>>>>> -            for (i = 0; i < n; i++)
>>>>> -                    clear_highpage(folio_page(folio, i));
>>>>> -            flush_dcache_folio(folio);
>>>>> +            folio_zero_range(folio, 0, folio_size(folio));
>>>>>               folio_mark_uptodate(folio);
>>>>>       }
>>>>
>>>
>>>
>>
>> Thanks
>> Barry
> 
>

Barry Song Oct. 21, 2024, 4:15 a.m. UTC | #6

On Fri, Oct 18, 2024 at 8:48 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>
>
>
> On 2024/10/18 15:32, Kefeng Wang wrote:
> >
> >
> > On 2024/10/18 13:23, Barry Song wrote:
> >> On Fri, Oct 18, 2024 at 6:20 PM Kefeng Wang
> >> <wangkefeng.wang@huawei.com> wrote:
> >>>
> >>>
> >>>
> >>> On 2024/10/17 23:09, Matthew Wilcox wrote:
> >>>> On Thu, Oct 17, 2024 at 10:25:04PM +0800, Kefeng Wang wrote:
> >>>>> Directly use folio_zero_range() to cleanup code.
> >>>>
> >>>> Are you sure there's no performance regression introduced by this?
> >>>> clear_highpage() is often optimised in ways that we can't optimise for
> >>>> a plain memset().  On the other hand, if the folio is large, maybe a
> >>>> modern CPU will be able to do better than clear-one-page-at-a-time.
> >>>>
> >>>
> >>> Right, I missing this, clear_page might be better than memset, I change
> >>> this one when look at the shmem_writepage(), which already convert to
> >>> use folio_zero_range() from clear_highpage(), also I grep
> >>> folio_zero_range(), there are some other to use folio_zero_range().
> >>>
> >>> fs/bcachefs/fs-io-buffered.c:           folio_zero_range(folio, 0,
> >>> folio_size(folio));
> >>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f,
> >>> 0, folio_size(f));
> >>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f,
> >>> 0, folio_size(f));
> >>> fs/libfs.c:     folio_zero_range(folio, 0, folio_size(folio));
> >>> fs/ntfs3/frecord.c:             folio_zero_range(folio, 0,
> >>> folio_size(folio));
> >>> mm/page_io.c:   folio_zero_range(folio, 0, folio_size(folio));
> >>> mm/shmem.c:             folio_zero_range(folio, 0, folio_size(folio));
> >>>
> >>>
> >>>> IOW, what performance testing have you done with this patch?
> >>>
> >>> No performance test before, but I write a testcase,
> >>>
> >>> 1) allocate N large folios (folio_alloc(PMD_ORDER))
> >>> 2) then calculate the diff(us) when clear all N folios
> >>>      clear_highpage/folio_zero_range/folio_zero_user
> >>> 3) release N folios
> >>>
> >>> the result(run 5 times) shown below on my machine,
> >>>
> >>> N=1,
> >>>          clear_highpage  folio_zero_range    folio_zero_user
> >>>     1      69                   74                 177
> >>>     2      57                   62                 168
> >>>     3      54                   58                 234
> >>>     4      54                   58                 157
> >>>     5      56                   62                 148
> >>> avg       58                   62.8               176.8
> >>>
> >>>
> >>> N=100
> >>>          clear_highpage  folio_zero_range    folio_zero_user
> >>>     1    11015                 11309               32833
> >>>     2    10385                 11110               49751
> >>>     3    10369                 11056               33095
> >>>     4    10332                 11017               33106
> >>>     5    10483                 11000               49032
> >>> avg     10516.8               11098.4             39563.4
> >>>
> >>> N=512
> >>>          clear_highpage  folio_zero_range   folio_zero_user
> >>>     1    55560                 60055              156876
> >>>     2    55485                 60024              157132
> >>>     3    55474                 60129              156658
> >>>     4    55555                 59867              157259
> >>>     5    55528                 59932              157108
> >>> avg     55520.4               60001.4            157006.6
> >>>
> >>>
> >>>
> >>> folio_zero_user with many cond_resched(), so time fluctuates a lot,
> >>> clear_highpage is better folio_zero_range as you said.
> >>>
> >>> Maybe add a new helper to convert all folio_zero_range(folio, 0,
> >>> folio_size(folio))
> >>> to use clear_highpage + flush_dcache_folio?
> >>
> >> If this also improves performance for other existing callers of
> >> folio_zero_range(), then that's a positive outcome.
> >
> >
> > rm -f /tmp/test && fallocate -l 20G /tmp/test && fallocate -d -l 20G /
> > tmp/test && time fallocate -l 20G /tmp/test
> >
> > 1）mount always(2M folio)
> >      with patch    without patch
> > real    0m1.214s    0m1.111s
> > user    0m0.000s    0m0.000s
> > sys    0m1.210s    0m1.109s
> >
> > With this patch, the performance does have regression,
> > folio_zero_range() is bad than clear_highpage + flush_dcache_folio
> >
> > with patch
>
> Oh, this should without patch since it uses clear_highpage,
>
> >
> >    99.95%     0.00%  fallocate  [kernel.vmlinux]       [k] vfs_fallocate
> >     vfs_fallocate
> >   - shmem_fallocate
> >        98.54% __pi_clear_page
> >      - 1.38% shmem_get_folio_gfp
> >           filemap_get_entry
> >
> and this one is with patch
> > without patch
> >   99.89%     0.00%  fallocate  [kernel.vmlinux]       [k] shmem_fallocate
> >    shmem_fallocate
> > - shmem_get_folio_gfp
> >       90.12% __memset
> >     - 9.42% zero_user_segments.constprop.0
> >          8.16% flush_dcache_page
> >          1.03% flush_dcache_folio
> >
> >
> >
> >
> > 2）mount  never (4K folio)
> > real    0m3.159s    0m3.176s
> > user    0m0.000s    0m0.000s
> > sys    0m3.150s    0m3.169s
> >
> > But with this patch, the performance is improved a little,
> > folio_zero_range() is better than clear_highpage + flush_dcache_folio
> >
>
> For 4K, the result is fluctuating, so maybe no different.

hi Kefeng,
what's your point? providing a helper like clear_highfolio() or similar?

>
> > with patch
> >   97.77%     3.37%  fallocate  [kernel.vmlinux]       [k] shmem_fallocate
> > - 94.40% shmem_fallocate
> >     - 93.70% shmem_get_folio_gfp
> >          66.60% __memset
> >        - 7.43% filemap_get_entry
> >             3.49% xas_load
> >          1.32% zero_user_segments.constprop.0
> >
> > without patch
> >    97.82%     3.22%  fallocate  [kernel.vmlinux]       [k] shmem_fallocate
> >   - 94.61% shmem_fallocate
> >        68.18% __pi_clear_page
> >      - 25.60% shmem_get_folio_gfp
> >         - 7.64% filemap_get_entry
> >              3.51% xas_load
> >>
> >>>
> >>>
> >>>>
> >>>>>       if (sgp != SGP_WRITE && !folio_test_uptodate(folio)) {
> >>>>> -            long i, n = folio_nr_pages(folio);
> >>>>> -
> >>>>> -            for (i = 0; i < n; i++)
> >>>>> -                    clear_highpage(folio_page(folio, i));
> >>>>> -            flush_dcache_folio(folio);
> >>>>> +            folio_zero_range(folio, 0, folio_size(folio));
> >>>>>               folio_mark_uptodate(folio);
> >>>>>       }
> >>>>
> >>>
> >>>
> >>
> >> Thanks
> >> Barry
> >
> >
>

Kefeng Wang Oct. 21, 2024, 5:16 a.m. UTC | #7

On 2024/10/21 12:15, Barry Song wrote:
> On Fri, Oct 18, 2024 at 8:48 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>
>>
>>
>> On 2024/10/18 15:32, Kefeng Wang wrote:
>>>
>>>
>>> On 2024/10/18 13:23, Barry Song wrote:
>>>> On Fri, Oct 18, 2024 at 6:20 PM Kefeng Wang
>>>> <wangkefeng.wang@huawei.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 2024/10/17 23:09, Matthew Wilcox wrote:
>>>>>> On Thu, Oct 17, 2024 at 10:25:04PM +0800, Kefeng Wang wrote:
>>>>>>> Directly use folio_zero_range() to cleanup code.
>>>>>>
>>>>>> Are you sure there's no performance regression introduced by this?
>>>>>> clear_highpage() is often optimised in ways that we can't optimise for
>>>>>> a plain memset().  On the other hand, if the folio is large, maybe a
>>>>>> modern CPU will be able to do better than clear-one-page-at-a-time.
>>>>>>
>>>>>
>>>>> Right, I missing this, clear_page might be better than memset, I change
>>>>> this one when look at the shmem_writepage(), which already convert to
>>>>> use folio_zero_range() from clear_highpage(), also I grep
>>>>> folio_zero_range(), there are some other to use folio_zero_range().
>>>>>
>>>>> fs/bcachefs/fs-io-buffered.c:           folio_zero_range(folio, 0,
>>>>> folio_size(folio));
>>>>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f,
>>>>> 0, folio_size(f));
>>>>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f,
>>>>> 0, folio_size(f));
>>>>> fs/libfs.c:     folio_zero_range(folio, 0, folio_size(folio));
>>>>> fs/ntfs3/frecord.c:             folio_zero_range(folio, 0,
>>>>> folio_size(folio));
>>>>> mm/page_io.c:   folio_zero_range(folio, 0, folio_size(folio));
>>>>> mm/shmem.c:             folio_zero_range(folio, 0, folio_size(folio));
>>>>>
>>>>>
>>>>>> IOW, what performance testing have you done with this patch?
>>>>>
>>>>> No performance test before, but I write a testcase,
>>>>>
>>>>> 1) allocate N large folios (folio_alloc(PMD_ORDER))
>>>>> 2) then calculate the diff(us) when clear all N folios
>>>>>       clear_highpage/folio_zero_range/folio_zero_user
>>>>> 3) release N folios
>>>>>
>>>>> the result(run 5 times) shown below on my machine,
>>>>>
>>>>> N=1,
>>>>>           clear_highpage  folio_zero_range    folio_zero_user
>>>>>      1      69                   74                 177
>>>>>      2      57                   62                 168
>>>>>      3      54                   58                 234
>>>>>      4      54                   58                 157
>>>>>      5      56                   62                 148
>>>>> avg       58                   62.8               176.8
>>>>>
>>>>>
>>>>> N=100
>>>>>           clear_highpage  folio_zero_range    folio_zero_user
>>>>>      1    11015                 11309               32833
>>>>>      2    10385                 11110               49751
>>>>>      3    10369                 11056               33095
>>>>>      4    10332                 11017               33106
>>>>>      5    10483                 11000               49032
>>>>> avg     10516.8               11098.4             39563.4
>>>>>
>>>>> N=512
>>>>>           clear_highpage  folio_zero_range   folio_zero_user
>>>>>      1    55560                 60055              156876
>>>>>      2    55485                 60024              157132
>>>>>      3    55474                 60129              156658
>>>>>      4    55555                 59867              157259
>>>>>      5    55528                 59932              157108
>>>>> avg     55520.4               60001.4            157006.6
>>>>>
>>>>>
>>>>>
>>>>> folio_zero_user with many cond_resched(), so time fluctuates a lot,
>>>>> clear_highpage is better folio_zero_range as you said.
>>>>>
>>>>> Maybe add a new helper to convert all folio_zero_range(folio, 0,
>>>>> folio_size(folio))
>>>>> to use clear_highpage + flush_dcache_folio?
>>>>
>>>> If this also improves performance for other existing callers of
>>>> folio_zero_range(), then that's a positive outcome.
>>>
>>>
>>> rm -f /tmp/test && fallocate -l 20G /tmp/test && fallocate -d -l 20G /
>>> tmp/test && time fallocate -l 20G /tmp/test
>>>
>>> 1）mount always(2M folio)
>>>       with patch    without patch
>>> real    0m1.214s    0m1.111s
>>> user    0m0.000s    0m0.000s
>>> sys    0m1.210s    0m1.109s
>>>
>>> With this patch, the performance does have regression,
>>> folio_zero_range() is bad than clear_highpage + flush_dcache_folio
>>>
>>> with patch
>>
>> Oh, this should without patch since it uses clear_highpage,
>>
>>>
>>>     99.95%     0.00%  fallocate  [kernel.vmlinux]       [k] vfs_fallocate
>>>      vfs_fallocate
>>>    - shmem_fallocate
>>>         98.54% __pi_clear_page
>>>       - 1.38% shmem_get_folio_gfp
>>>            filemap_get_entry
>>>
>> and this one is with patch
>>> without patch
>>>    99.89%     0.00%  fallocate  [kernel.vmlinux]       [k] shmem_fallocate
>>>     shmem_fallocate
>>> - shmem_get_folio_gfp
>>>        90.12% __memset
>>>      - 9.42% zero_user_segments.constprop.0
>>>           8.16% flush_dcache_page
>>>           1.03% flush_dcache_folio
>>>
>>>
>>>
>>>
>>> 2）mount  never (4K folio)
>>> real    0m3.159s    0m3.176s
>>> user    0m0.000s    0m0.000s
>>> sys    0m3.150s    0m3.169s
>>>
>>> But with this patch, the performance is improved a little,
>>> folio_zero_range() is better than clear_highpage + flush_dcache_folio
>>>
>>
>> For 4K, the result is fluctuating, so maybe no different.
> 
> hi Kefeng,
> what's your point? providing a helper like clear_highfolio() or similar?

Yes, from above test, using clear_highpage/flush_dcache_folio is better
than using folio_zero_range() for folio zero(especially for large 
folio), so I'd like to add a new helper, maybe name it folio_zero()
since it zero the whole folio.

> 
>>
>>> with patch
>>>    97.77%     3.37%  fallocate  [kernel.vmlinux]       [k] shmem_fallocate
>>> - 94.40% shmem_fallocate
>>>      - 93.70% shmem_get_folio_gfp
>>>           66.60% __memset
>>>         - 7.43% filemap_get_entry
>>>              3.49% xas_load
>>>           1.32% zero_user_segments.constprop.0
>>>
>>> without patch
>>>     97.82%     3.22%  fallocate  [kernel.vmlinux]       [k] shmem_fallocate
>>>    - 94.61% shmem_fallocate
>>>         68.18% __pi_clear_page
>>>       - 25.60% shmem_get_folio_gfp
>>>          - 7.64% filemap_get_entry
>>>               3.51% xas_load
>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>>        if (sgp != SGP_WRITE && !folio_test_uptodate(folio)) {
>>>>>>> -            long i, n = folio_nr_pages(folio);
>>>>>>> -
>>>>>>> -            for (i = 0; i < n; i++)
>>>>>>> -                    clear_highpage(folio_page(folio, i));
>>>>>>> -            flush_dcache_folio(folio);
>>>>>>> +            folio_zero_range(folio, 0, folio_size(folio));
>>>>>>>                folio_mark_uptodate(folio);
>>>>>>>        }
>>>>>>
>>>>>
>>>>>
>>>>
>>>> Thanks
>>>> Barry
>>>
>>>
>>

Barry Song Oct. 21, 2024, 5:38 a.m. UTC | #8

On Mon, Oct 21, 2024 at 6:16 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>
>
>
> On 2024/10/21 12:15, Barry Song wrote:
> > On Fri, Oct 18, 2024 at 8:48 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
> >>
> >>
> >>
> >> On 2024/10/18 15:32, Kefeng Wang wrote:
> >>>
> >>>
> >>> On 2024/10/18 13:23, Barry Song wrote:
> >>>> On Fri, Oct 18, 2024 at 6:20 PM Kefeng Wang
> >>>> <wangkefeng.wang@huawei.com> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 2024/10/17 23:09, Matthew Wilcox wrote:
> >>>>>> On Thu, Oct 17, 2024 at 10:25:04PM +0800, Kefeng Wang wrote:
> >>>>>>> Directly use folio_zero_range() to cleanup code.
> >>>>>>
> >>>>>> Are you sure there's no performance regression introduced by this?
> >>>>>> clear_highpage() is often optimised in ways that we can't optimise for
> >>>>>> a plain memset().  On the other hand, if the folio is large, maybe a
> >>>>>> modern CPU will be able to do better than clear-one-page-at-a-time.
> >>>>>>
> >>>>>
> >>>>> Right, I missing this, clear_page might be better than memset, I change
> >>>>> this one when look at the shmem_writepage(), which already convert to
> >>>>> use folio_zero_range() from clear_highpage(), also I grep
> >>>>> folio_zero_range(), there are some other to use folio_zero_range().
> >>>>>
> >>>>> fs/bcachefs/fs-io-buffered.c:           folio_zero_range(folio, 0,
> >>>>> folio_size(folio));
> >>>>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f,
> >>>>> 0, folio_size(f));
> >>>>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f,
> >>>>> 0, folio_size(f));
> >>>>> fs/libfs.c:     folio_zero_range(folio, 0, folio_size(folio));
> >>>>> fs/ntfs3/frecord.c:             folio_zero_range(folio, 0,
> >>>>> folio_size(folio));
> >>>>> mm/page_io.c:   folio_zero_range(folio, 0, folio_size(folio));
> >>>>> mm/shmem.c:             folio_zero_range(folio, 0, folio_size(folio));
> >>>>>
> >>>>>
> >>>>>> IOW, what performance testing have you done with this patch?
> >>>>>
> >>>>> No performance test before, but I write a testcase,
> >>>>>
> >>>>> 1) allocate N large folios (folio_alloc(PMD_ORDER))
> >>>>> 2) then calculate the diff(us) when clear all N folios
> >>>>>       clear_highpage/folio_zero_range/folio_zero_user
> >>>>> 3) release N folios
> >>>>>
> >>>>> the result(run 5 times) shown below on my machine,
> >>>>>
> >>>>> N=1,
> >>>>>           clear_highpage  folio_zero_range    folio_zero_user
> >>>>>      1      69                   74                 177
> >>>>>      2      57                   62                 168
> >>>>>      3      54                   58                 234
> >>>>>      4      54                   58                 157
> >>>>>      5      56                   62                 148
> >>>>> avg       58                   62.8               176.8
> >>>>>
> >>>>>
> >>>>> N=100
> >>>>>           clear_highpage  folio_zero_range    folio_zero_user
> >>>>>      1    11015                 11309               32833
> >>>>>      2    10385                 11110               49751
> >>>>>      3    10369                 11056               33095
> >>>>>      4    10332                 11017               33106
> >>>>>      5    10483                 11000               49032
> >>>>> avg     10516.8               11098.4             39563.4
> >>>>>
> >>>>> N=512
> >>>>>           clear_highpage  folio_zero_range   folio_zero_user
> >>>>>      1    55560                 60055              156876
> >>>>>      2    55485                 60024              157132
> >>>>>      3    55474                 60129              156658
> >>>>>      4    55555                 59867              157259
> >>>>>      5    55528                 59932              157108
> >>>>> avg     55520.4               60001.4            157006.6
> >>>>>
> >>>>>
> >>>>>
> >>>>> folio_zero_user with many cond_resched(), so time fluctuates a lot,
> >>>>> clear_highpage is better folio_zero_range as you said.
> >>>>>
> >>>>> Maybe add a new helper to convert all folio_zero_range(folio, 0,
> >>>>> folio_size(folio))
> >>>>> to use clear_highpage + flush_dcache_folio?
> >>>>
> >>>> If this also improves performance for other existing callers of
> >>>> folio_zero_range(), then that's a positive outcome.
> >>>
> >>>
> >>> rm -f /tmp/test && fallocate -l 20G /tmp/test && fallocate -d -l 20G /
> >>> tmp/test && time fallocate -l 20G /tmp/test
> >>>
> >>> 1）mount always(2M folio)
> >>>       with patch    without patch
> >>> real    0m1.214s    0m1.111s
> >>> user    0m0.000s    0m0.000s
> >>> sys    0m1.210s    0m1.109s
> >>>
> >>> With this patch, the performance does have regression,
> >>> folio_zero_range() is bad than clear_highpage + flush_dcache_folio
> >>>
> >>> with patch
> >>
> >> Oh, this should without patch since it uses clear_highpage,
> >>
> >>>
> >>>     99.95%     0.00%  fallocate  [kernel.vmlinux]       [k] vfs_fallocate
> >>>      vfs_fallocate
> >>>    - shmem_fallocate
> >>>         98.54% __pi_clear_page
> >>>       - 1.38% shmem_get_folio_gfp
> >>>            filemap_get_entry
> >>>
> >> and this one is with patch
> >>> without patch
> >>>    99.89%     0.00%  fallocate  [kernel.vmlinux]       [k] shmem_fallocate
> >>>     shmem_fallocate
> >>> - shmem_get_folio_gfp
> >>>        90.12% __memset
> >>>      - 9.42% zero_user_segments.constprop.0
> >>>           8.16% flush_dcache_page
> >>>           1.03% flush_dcache_folio
> >>>
> >>>
> >>>
> >>>
> >>> 2）mount  never (4K folio)
> >>> real    0m3.159s    0m3.176s
> >>> user    0m0.000s    0m0.000s
> >>> sys    0m3.150s    0m3.169s
> >>>
> >>> But with this patch, the performance is improved a little,
> >>> folio_zero_range() is better than clear_highpage + flush_dcache_folio
> >>>
> >>
> >> For 4K, the result is fluctuating, so maybe no different.
> >
> > hi Kefeng,
> > what's your point? providing a helper like clear_highfolio() or similar?
>
> Yes, from above test, using clear_highpage/flush_dcache_folio is better
> than using folio_zero_range() for folio zero(especially for large
> folio), so I'd like to add a new helper, maybe name it folio_zero()
> since it zero the whole folio.

we already have a helper like folio_zero_user()?
it is not good enough?

>
> >
> >>
> >>> with patch
> >>>    97.77%     3.37%  fallocate  [kernel.vmlinux]       [k] shmem_fallocate
> >>> - 94.40% shmem_fallocate
> >>>      - 93.70% shmem_get_folio_gfp
> >>>           66.60% __memset
> >>>         - 7.43% filemap_get_entry
> >>>              3.49% xas_load
> >>>           1.32% zero_user_segments.constprop.0
> >>>
> >>> without patch
> >>>     97.82%     3.22%  fallocate  [kernel.vmlinux]       [k] shmem_fallocate
> >>>    - 94.61% shmem_fallocate
> >>>         68.18% __pi_clear_page
> >>>       - 25.60% shmem_get_folio_gfp
> >>>          - 7.64% filemap_get_entry
> >>>               3.51% xas_load
> >>>>
> >>>>>
> >>>>>
> >>>>>>
> >>>>>>>        if (sgp != SGP_WRITE && !folio_test_uptodate(folio)) {
> >>>>>>> -            long i, n = folio_nr_pages(folio);
> >>>>>>> -
> >>>>>>> -            for (i = 0; i < n; i++)
> >>>>>>> -                    clear_highpage(folio_page(folio, i));
> >>>>>>> -            flush_dcache_folio(folio);
> >>>>>>> +            folio_zero_range(folio, 0, folio_size(folio));
> >>>>>>>                folio_mark_uptodate(folio);
> >>>>>>>        }
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>> Thanks
> >>>> Barry
> >>>
> >>>
> >>
>

Kefeng Wang Oct. 21, 2024, 6:09 a.m. UTC | #9

On 2024/10/21 13:38, Barry Song wrote:
> On Mon, Oct 21, 2024 at 6:16 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>
>>
>>
>> On 2024/10/21 12:15, Barry Song wrote:
>>> On Fri, Oct 18, 2024 at 8:48 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>>>
>>>>
>>>>
>>>> On 2024/10/18 15:32, Kefeng Wang wrote:
>>>>>
>>>>>
>>>>> On 2024/10/18 13:23, Barry Song wrote:
>>>>>> On Fri, Oct 18, 2024 at 6:20 PM Kefeng Wang
>>>>>> <wangkefeng.wang@huawei.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 2024/10/17 23:09, Matthew Wilcox wrote:
>>>>>>>> On Thu, Oct 17, 2024 at 10:25:04PM +0800, Kefeng Wang wrote:
>>>>>>>>> Directly use folio_zero_range() to cleanup code.
>>>>>>>>
>>>>>>>> Are you sure there's no performance regression introduced by this?
>>>>>>>> clear_highpage() is often optimised in ways that we can't optimise for
>>>>>>>> a plain memset().  On the other hand, if the folio is large, maybe a
>>>>>>>> modern CPU will be able to do better than clear-one-page-at-a-time.
>>>>>>>>
>>>>>>>
>>>>>>> Right, I missing this, clear_page might be better than memset, I change
>>>>>>> this one when look at the shmem_writepage(), which already convert to
>>>>>>> use folio_zero_range() from clear_highpage(), also I grep
>>>>>>> folio_zero_range(), there are some other to use folio_zero_range().
>>>>>>>
>>>>>>> fs/bcachefs/fs-io-buffered.c:           folio_zero_range(folio, 0,
>>>>>>> folio_size(folio));
>>>>>>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f,
>>>>>>> 0, folio_size(f));
>>>>>>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f,
>>>>>>> 0, folio_size(f));
>>>>>>> fs/libfs.c:     folio_zero_range(folio, 0, folio_size(folio));
>>>>>>> fs/ntfs3/frecord.c:             folio_zero_range(folio, 0,
>>>>>>> folio_size(folio));
>>>>>>> mm/page_io.c:   folio_zero_range(folio, 0, folio_size(folio));
>>>>>>> mm/shmem.c:             folio_zero_range(folio, 0, folio_size(folio));
>>>>>>>
>>>>>>>
>>>>>>>> IOW, what performance testing have you done with this patch?
>>>>>>>
>>>>>>> No performance test before, but I write a testcase,
>>>>>>>
>>>>>>> 1) allocate N large folios (folio_alloc(PMD_ORDER))
>>>>>>> 2) then calculate the diff(us) when clear all N folios
>>>>>>>        clear_highpage/folio_zero_range/folio_zero_user
>>>>>>> 3) release N folios
>>>>>>>
>>>>>>> the result(run 5 times) shown below on my machine,
>>>>>>>
>>>>>>> N=1,
>>>>>>>            clear_highpage  folio_zero_range    folio_zero_user
>>>>>>>       1      69                   74                 177
>>>>>>>       2      57                   62                 168
>>>>>>>       3      54                   58                 234
>>>>>>>       4      54                   58                 157
>>>>>>>       5      56                   62                 148
>>>>>>> avg       58                   62.8               176.8
>>>>>>>
>>>>>>>
>>>>>>> N=100
>>>>>>>            clear_highpage  folio_zero_range    folio_zero_user
>>>>>>>       1    11015                 11309               32833
>>>>>>>       2    10385                 11110               49751
>>>>>>>       3    10369                 11056               33095
>>>>>>>       4    10332                 11017               33106
>>>>>>>       5    10483                 11000               49032
>>>>>>> avg     10516.8               11098.4             39563.4
>>>>>>>
>>>>>>> N=512
>>>>>>>            clear_highpage  folio_zero_range   folio_zero_user
>>>>>>>       1    55560                 60055              156876
>>>>>>>       2    55485                 60024              157132
>>>>>>>       3    55474                 60129              156658
>>>>>>>       4    55555                 59867              157259
>>>>>>>       5    55528                 59932              157108
>>>>>>> avg     55520.4               60001.4            157006.6
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> folio_zero_user with many cond_resched(), so time fluctuates a lot,
>>>>>>> clear_highpage is better folio_zero_range as you said.
>>>>>>>
>>>>>>> Maybe add a new helper to convert all folio_zero_range(folio, 0,
>>>>>>> folio_size(folio))
>>>>>>> to use clear_highpage + flush_dcache_folio?
>>>>>>
>>>>>> If this also improves performance for other existing callers of
>>>>>> folio_zero_range(), then that's a positive outcome.
>>>>>
...

>>> hi Kefeng,
>>> what's your point? providing a helper like clear_highfolio() or similar?
>>
>> Yes, from above test, using clear_highpage/flush_dcache_folio is better
>> than using folio_zero_range() for folio zero(especially for large
>> folio), so I'd like to add a new helper, maybe name it folio_zero()
>> since it zero the whole folio.
> 
> we already have a helper like folio_zero_user()?
> it is not good enough?

Since it is with many cond_resched(), the performance is worst...

Barry Song Oct. 21, 2024, 7:47 a.m. UTC | #10

On Mon, Oct 21, 2024 at 7:09 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>
>
>
> On 2024/10/21 13:38, Barry Song wrote:
> > On Mon, Oct 21, 2024 at 6:16 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
> >>
> >>
> >>
> >> On 2024/10/21 12:15, Barry Song wrote:
> >>> On Fri, Oct 18, 2024 at 8:48 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 2024/10/18 15:32, Kefeng Wang wrote:
> >>>>>
> >>>>>
> >>>>> On 2024/10/18 13:23, Barry Song wrote:
> >>>>>> On Fri, Oct 18, 2024 at 6:20 PM Kefeng Wang
> >>>>>> <wangkefeng.wang@huawei.com> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On 2024/10/17 23:09, Matthew Wilcox wrote:
> >>>>>>>> On Thu, Oct 17, 2024 at 10:25:04PM +0800, Kefeng Wang wrote:
> >>>>>>>>> Directly use folio_zero_range() to cleanup code.
> >>>>>>>>
> >>>>>>>> Are you sure there's no performance regression introduced by this?
> >>>>>>>> clear_highpage() is often optimised in ways that we can't optimise for
> >>>>>>>> a plain memset().  On the other hand, if the folio is large, maybe a
> >>>>>>>> modern CPU will be able to do better than clear-one-page-at-a-time.
> >>>>>>>>
> >>>>>>>
> >>>>>>> Right, I missing this, clear_page might be better than memset, I change
> >>>>>>> this one when look at the shmem_writepage(), which already convert to
> >>>>>>> use folio_zero_range() from clear_highpage(), also I grep
> >>>>>>> folio_zero_range(), there are some other to use folio_zero_range().
> >>>>>>>
> >>>>>>> fs/bcachefs/fs-io-buffered.c:           folio_zero_range(folio, 0,
> >>>>>>> folio_size(folio));
> >>>>>>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f,
> >>>>>>> 0, folio_size(f));
> >>>>>>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f,
> >>>>>>> 0, folio_size(f));
> >>>>>>> fs/libfs.c:     folio_zero_range(folio, 0, folio_size(folio));
> >>>>>>> fs/ntfs3/frecord.c:             folio_zero_range(folio, 0,
> >>>>>>> folio_size(folio));
> >>>>>>> mm/page_io.c:   folio_zero_range(folio, 0, folio_size(folio));
> >>>>>>> mm/shmem.c:             folio_zero_range(folio, 0, folio_size(folio));
> >>>>>>>
> >>>>>>>
> >>>>>>>> IOW, what performance testing have you done with this patch?
> >>>>>>>
> >>>>>>> No performance test before, but I write a testcase,
> >>>>>>>
> >>>>>>> 1) allocate N large folios (folio_alloc(PMD_ORDER))
> >>>>>>> 2) then calculate the diff(us) when clear all N folios
> >>>>>>>        clear_highpage/folio_zero_range/folio_zero_user
> >>>>>>> 3) release N folios
> >>>>>>>
> >>>>>>> the result(run 5 times) shown below on my machine,
> >>>>>>>
> >>>>>>> N=1,
> >>>>>>>            clear_highpage  folio_zero_range    folio_zero_user
> >>>>>>>       1      69                   74                 177
> >>>>>>>       2      57                   62                 168
> >>>>>>>       3      54                   58                 234
> >>>>>>>       4      54                   58                 157
> >>>>>>>       5      56                   62                 148
> >>>>>>> avg       58                   62.8               176.8
> >>>>>>>
> >>>>>>>
> >>>>>>> N=100
> >>>>>>>            clear_highpage  folio_zero_range    folio_zero_user
> >>>>>>>       1    11015                 11309               32833
> >>>>>>>       2    10385                 11110               49751
> >>>>>>>       3    10369                 11056               33095
> >>>>>>>       4    10332                 11017               33106
> >>>>>>>       5    10483                 11000               49032
> >>>>>>> avg     10516.8               11098.4             39563.4
> >>>>>>>
> >>>>>>> N=512
> >>>>>>>            clear_highpage  folio_zero_range   folio_zero_user
> >>>>>>>       1    55560                 60055              156876
> >>>>>>>       2    55485                 60024              157132
> >>>>>>>       3    55474                 60129              156658
> >>>>>>>       4    55555                 59867              157259
> >>>>>>>       5    55528                 59932              157108
> >>>>>>> avg     55520.4               60001.4            157006.6
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> folio_zero_user with many cond_resched(), so time fluctuates a lot,
> >>>>>>> clear_highpage is better folio_zero_range as you said.
> >>>>>>>
> >>>>>>> Maybe add a new helper to convert all folio_zero_range(folio, 0,
> >>>>>>> folio_size(folio))
> >>>>>>> to use clear_highpage + flush_dcache_folio?
> >>>>>>
> >>>>>> If this also improves performance for other existing callers of
> >>>>>> folio_zero_range(), then that's a positive outcome.
> >>>>>
> ...
>
> >>> hi Kefeng,
> >>> what's your point? providing a helper like clear_highfolio() or similar?
> >>
> >> Yes, from above test, using clear_highpage/flush_dcache_folio is better
> >> than using folio_zero_range() for folio zero(especially for large
> >> folio), so I'd like to add a new helper, maybe name it folio_zero()
> >> since it zero the whole folio.
> >
> > we already have a helper like folio_zero_user()?
> > it is not good enough?
>
> Since it is with many cond_resched(), the performance is worst...

Not exactly? It should have zero cost for a preemptible kernel.
For a non-preemptible kernel, it helps avoid clearing the folio
from occupying the CPU and starving other processes, right?

Barry Song Oct. 21, 2024, 7:55 a.m. UTC | #11

On Mon, Oct 21, 2024 at 8:47 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Mon, Oct 21, 2024 at 7:09 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
> >
> >
> >
> > On 2024/10/21 13:38, Barry Song wrote:
> > > On Mon, Oct 21, 2024 at 6:16 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
> > >>
> > >>
> > >>
> > >> On 2024/10/21 12:15, Barry Song wrote:
> > >>> On Fri, Oct 18, 2024 at 8:48 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
> > >>>>
> > >>>>
> > >>>>
> > >>>> On 2024/10/18 15:32, Kefeng Wang wrote:
> > >>>>>
> > >>>>>
> > >>>>> On 2024/10/18 13:23, Barry Song wrote:
> > >>>>>> On Fri, Oct 18, 2024 at 6:20 PM Kefeng Wang
> > >>>>>> <wangkefeng.wang@huawei.com> wrote:
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On 2024/10/17 23:09, Matthew Wilcox wrote:
> > >>>>>>>> On Thu, Oct 17, 2024 at 10:25:04PM +0800, Kefeng Wang wrote:
> > >>>>>>>>> Directly use folio_zero_range() to cleanup code.
> > >>>>>>>>
> > >>>>>>>> Are you sure there's no performance regression introduced by this?
> > >>>>>>>> clear_highpage() is often optimised in ways that we can't optimise for
> > >>>>>>>> a plain memset().  On the other hand, if the folio is large, maybe a
> > >>>>>>>> modern CPU will be able to do better than clear-one-page-at-a-time.
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>> Right, I missing this, clear_page might be better than memset, I change
> > >>>>>>> this one when look at the shmem_writepage(), which already convert to
> > >>>>>>> use folio_zero_range() from clear_highpage(), also I grep
> > >>>>>>> folio_zero_range(), there are some other to use folio_zero_range().
> > >>>>>>>
> > >>>>>>> fs/bcachefs/fs-io-buffered.c:           folio_zero_range(folio, 0,
> > >>>>>>> folio_size(folio));
> > >>>>>>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f,
> > >>>>>>> 0, folio_size(f));
> > >>>>>>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f,
> > >>>>>>> 0, folio_size(f));
> > >>>>>>> fs/libfs.c:     folio_zero_range(folio, 0, folio_size(folio));
> > >>>>>>> fs/ntfs3/frecord.c:             folio_zero_range(folio, 0,
> > >>>>>>> folio_size(folio));
> > >>>>>>> mm/page_io.c:   folio_zero_range(folio, 0, folio_size(folio));
> > >>>>>>> mm/shmem.c:             folio_zero_range(folio, 0, folio_size(folio));
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>> IOW, what performance testing have you done with this patch?
> > >>>>>>>
> > >>>>>>> No performance test before, but I write a testcase,
> > >>>>>>>
> > >>>>>>> 1) allocate N large folios (folio_alloc(PMD_ORDER))
> > >>>>>>> 2) then calculate the diff(us) when clear all N folios
> > >>>>>>>        clear_highpage/folio_zero_range/folio_zero_user
> > >>>>>>> 3) release N folios
> > >>>>>>>
> > >>>>>>> the result(run 5 times) shown below on my machine,
> > >>>>>>>
> > >>>>>>> N=1,
> > >>>>>>>            clear_highpage  folio_zero_range    folio_zero_user
> > >>>>>>>       1      69                   74                 177
> > >>>>>>>       2      57                   62                 168
> > >>>>>>>       3      54                   58                 234
> > >>>>>>>       4      54                   58                 157
> > >>>>>>>       5      56                   62                 148
> > >>>>>>> avg       58                   62.8               176.8
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> N=100
> > >>>>>>>            clear_highpage  folio_zero_range    folio_zero_user
> > >>>>>>>       1    11015                 11309               32833
> > >>>>>>>       2    10385                 11110               49751
> > >>>>>>>       3    10369                 11056               33095
> > >>>>>>>       4    10332                 11017               33106
> > >>>>>>>       5    10483                 11000               49032
> > >>>>>>> avg     10516.8               11098.4             39563.4
> > >>>>>>>
> > >>>>>>> N=512
> > >>>>>>>            clear_highpage  folio_zero_range   folio_zero_user
> > >>>>>>>       1    55560                 60055              156876
> > >>>>>>>       2    55485                 60024              157132
> > >>>>>>>       3    55474                 60129              156658
> > >>>>>>>       4    55555                 59867              157259
> > >>>>>>>       5    55528                 59932              157108
> > >>>>>>> avg     55520.4               60001.4            157006.6
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> folio_zero_user with many cond_resched(), so time fluctuates a lot,
> > >>>>>>> clear_highpage is better folio_zero_range as you said.
> > >>>>>>>
> > >>>>>>> Maybe add a new helper to convert all folio_zero_range(folio, 0,
> > >>>>>>> folio_size(folio))
> > >>>>>>> to use clear_highpage + flush_dcache_folio?
> > >>>>>>
> > >>>>>> If this also improves performance for other existing callers of
> > >>>>>> folio_zero_range(), then that's a positive outcome.
> > >>>>>
> > ...
> >
> > >>> hi Kefeng,
> > >>> what's your point? providing a helper like clear_highfolio() or similar?
> > >>
> > >> Yes, from above test, using clear_highpage/flush_dcache_folio is better
> > >> than using folio_zero_range() for folio zero(especially for large
> > >> folio), so I'd like to add a new helper, maybe name it folio_zero()
> > >> since it zero the whole folio.
> > >
> > > we already have a helper like folio_zero_user()?
> > > it is not good enough?
> >
> > Since it is with many cond_resched(), the performance is worst...
>
> Not exactly? It should have zero cost for a preemptible kernel.
> For a non-preemptible kernel, it helps avoid clearing the folio
> from occupying the CPU and starving other processes, right?

--- a/mm/shmem.c
+++ b/mm/shmem.c

@@ -2393,10 +2393,7 @@ static int shmem_get_folio_gfp(struct inode
*inode, pgoff_t index,
         * it now, lest undo on failure cancel our earlier guarantee.
         */

        if (sgp != SGP_WRITE && !folio_test_uptodate(folio)) {
-               long i, n = folio_nr_pages(folio);
-
-               for (i = 0; i < n; i++)
-                       clear_highpage(folio_page(folio, i));
+               folio_zero_user(folio, vmf->address);
                flush_dcache_folio(folio);
                folio_mark_uptodate(folio);
        }

Do we perform better or worse with the following?

Kefeng Wang Oct. 21, 2024, 8:14 a.m. UTC | #12

On 2024/10/21 15:55, Barry Song wrote:
> On Mon, Oct 21, 2024 at 8:47 PM Barry Song <21cnbao@gmail.com> wrote:
>>
>> On Mon, Oct 21, 2024 at 7:09 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>>
>>>
>>>
>>> On 2024/10/21 13:38, Barry Song wrote:
>>>> On Mon, Oct 21, 2024 at 6:16 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 2024/10/21 12:15, Barry Song wrote:
>>>>>> On Fri, Oct 18, 2024 at 8:48 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 2024/10/18 15:32, Kefeng Wang wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2024/10/18 13:23, Barry Song wrote:
>>>>>>>>> On Fri, Oct 18, 2024 at 6:20 PM Kefeng Wang
>>>>>>>>> <wangkefeng.wang@huawei.com> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 2024/10/17 23:09, Matthew Wilcox wrote:
>>>>>>>>>>> On Thu, Oct 17, 2024 at 10:25:04PM +0800, Kefeng Wang wrote:
>>>>>>>>>>>> Directly use folio_zero_range() to cleanup code.
>>>>>>>>>>>
>>>>>>>>>>> Are you sure there's no performance regression introduced by this?
>>>>>>>>>>> clear_highpage() is often optimised in ways that we can't optimise for
>>>>>>>>>>> a plain memset().  On the other hand, if the folio is large, maybe a
>>>>>>>>>>> modern CPU will be able to do better than clear-one-page-at-a-time.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Right, I missing this, clear_page might be better than memset, I change
>>>>>>>>>> this one when look at the shmem_writepage(), which already convert to
>>>>>>>>>> use folio_zero_range() from clear_highpage(), also I grep
>>>>>>>>>> folio_zero_range(), there are some other to use folio_zero_range().
>>>>>>>>>>
>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:           folio_zero_range(folio, 0,
>>>>>>>>>> folio_size(folio));
>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f,
>>>>>>>>>> 0, folio_size(f));
>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f,
>>>>>>>>>> 0, folio_size(f));
>>>>>>>>>> fs/libfs.c:     folio_zero_range(folio, 0, folio_size(folio));
>>>>>>>>>> fs/ntfs3/frecord.c:             folio_zero_range(folio, 0,
>>>>>>>>>> folio_size(folio));
>>>>>>>>>> mm/page_io.c:   folio_zero_range(folio, 0, folio_size(folio));
>>>>>>>>>> mm/shmem.c:             folio_zero_range(folio, 0, folio_size(folio));
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> IOW, what performance testing have you done with this patch?
>>>>>>>>>>
>>>>>>>>>> No performance test before, but I write a testcase,
>>>>>>>>>>
>>>>>>>>>> 1) allocate N large folios (folio_alloc(PMD_ORDER))
>>>>>>>>>> 2) then calculate the diff(us) when clear all N folios
>>>>>>>>>>         clear_highpage/folio_zero_range/folio_zero_user
>>>>>>>>>> 3) release N folios
>>>>>>>>>>
>>>>>>>>>> the result(run 5 times) shown below on my machine,
>>>>>>>>>>
>>>>>>>>>> N=1,
>>>>>>>>>>             clear_highpage  folio_zero_range    folio_zero_user
>>>>>>>>>>        1      69                   74                 177
>>>>>>>>>>        2      57                   62                 168
>>>>>>>>>>        3      54                   58                 234
>>>>>>>>>>        4      54                   58                 157
>>>>>>>>>>        5      56                   62                 148
>>>>>>>>>> avg       58                   62.8               176.8
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> N=100
>>>>>>>>>>             clear_highpage  folio_zero_range    folio_zero_user
>>>>>>>>>>        1    11015                 11309               32833
>>>>>>>>>>        2    10385                 11110               49751
>>>>>>>>>>        3    10369                 11056               33095
>>>>>>>>>>        4    10332                 11017               33106
>>>>>>>>>>        5    10483                 11000               49032
>>>>>>>>>> avg     10516.8               11098.4             39563.4
>>>>>>>>>>
>>>>>>>>>> N=512
>>>>>>>>>>             clear_highpage  folio_zero_range   folio_zero_user
>>>>>>>>>>        1    55560                 60055              156876
>>>>>>>>>>        2    55485                 60024              157132
>>>>>>>>>>        3    55474                 60129              156658
>>>>>>>>>>        4    55555                 59867              157259
>>>>>>>>>>        5    55528                 59932              157108
>>>>>>>>>> avg     55520.4               60001.4            157006.6
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> folio_zero_user with many cond_resched(), so time fluctuates a lot,
>>>>>>>>>> clear_highpage is better folio_zero_range as you said.
>>>>>>>>>>
>>>>>>>>>> Maybe add a new helper to convert all folio_zero_range(folio, 0,
>>>>>>>>>> folio_size(folio))
>>>>>>>>>> to use clear_highpage + flush_dcache_folio?
>>>>>>>>>
>>>>>>>>> If this also improves performance for other existing callers of
>>>>>>>>> folio_zero_range(), then that's a positive outcome.
>>>>>>>>
>>> ...
>>>
>>>>>> hi Kefeng,
>>>>>> what's your point? providing a helper like clear_highfolio() or similar?
>>>>>
>>>>> Yes, from above test, using clear_highpage/flush_dcache_folio is better
>>>>> than using folio_zero_range() for folio zero(especially for large
>>>>> folio), so I'd like to add a new helper, maybe name it folio_zero()
>>>>> since it zero the whole folio.
>>>>
>>>> we already have a helper like folio_zero_user()?
>>>> it is not good enough?
>>>
>>> Since it is with many cond_resched(), the performance is worst...
>>
>> Not exactly? It should have zero cost for a preemptible kernel.
>> For a non-preemptible kernel, it helps avoid clearing the folio
>> from occupying the CPU and starving other processes, right?
> 
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> 
> @@ -2393,10 +2393,7 @@ static int shmem_get_folio_gfp(struct inode
> *inode, pgoff_t index,
>           * it now, lest undo on failure cancel our earlier guarantee.
>           */
> 
>          if (sgp != SGP_WRITE && !folio_test_uptodate(folio)) {
> -               long i, n = folio_nr_pages(folio);
> -
> -               for (i = 0; i < n; i++)
> -                       clear_highpage(folio_page(folio, i));
> +               folio_zero_user(folio, vmf->address);
>                  flush_dcache_folio(folio);
>                  folio_mark_uptodate(folio);
>          }
> 
> Do we perform better or worse with the following?

Here is for SGP_FALLOC, vmf = NULL, we could use folio_zero_user(folio, 
0), I think the performance is worse, will retest once I can access 
hardware.

Barry Song Oct. 21, 2024, 9:17 a.m. UTC | #13

On Mon, Oct 21, 2024 at 9:14 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>
>
>
> On 2024/10/21 15:55, Barry Song wrote:
> > On Mon, Oct 21, 2024 at 8:47 PM Barry Song <21cnbao@gmail.com> wrote:
> >>
> >> On Mon, Oct 21, 2024 at 7:09 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
> >>>
> >>>
> >>>
> >>> On 2024/10/21 13:38, Barry Song wrote:
> >>>> On Mon, Oct 21, 2024 at 6:16 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 2024/10/21 12:15, Barry Song wrote:
> >>>>>> On Fri, Oct 18, 2024 at 8:48 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On 2024/10/18 15:32, Kefeng Wang wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 2024/10/18 13:23, Barry Song wrote:
> >>>>>>>>> On Fri, Oct 18, 2024 at 6:20 PM Kefeng Wang
> >>>>>>>>> <wangkefeng.wang@huawei.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On 2024/10/17 23:09, Matthew Wilcox wrote:
> >>>>>>>>>>> On Thu, Oct 17, 2024 at 10:25:04PM +0800, Kefeng Wang wrote:
> >>>>>>>>>>>> Directly use folio_zero_range() to cleanup code.
> >>>>>>>>>>>
> >>>>>>>>>>> Are you sure there's no performance regression introduced by this?
> >>>>>>>>>>> clear_highpage() is often optimised in ways that we can't optimise for
> >>>>>>>>>>> a plain memset().  On the other hand, if the folio is large, maybe a
> >>>>>>>>>>> modern CPU will be able to do better than clear-one-page-at-a-time.
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Right, I missing this, clear_page might be better than memset, I change
> >>>>>>>>>> this one when look at the shmem_writepage(), which already convert to
> >>>>>>>>>> use folio_zero_range() from clear_highpage(), also I grep
> >>>>>>>>>> folio_zero_range(), there are some other to use folio_zero_range().
> >>>>>>>>>>
> >>>>>>>>>> fs/bcachefs/fs-io-buffered.c:           folio_zero_range(folio, 0,
> >>>>>>>>>> folio_size(folio));
> >>>>>>>>>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f,
> >>>>>>>>>> 0, folio_size(f));
> >>>>>>>>>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f,
> >>>>>>>>>> 0, folio_size(f));
> >>>>>>>>>> fs/libfs.c:     folio_zero_range(folio, 0, folio_size(folio));
> >>>>>>>>>> fs/ntfs3/frecord.c:             folio_zero_range(folio, 0,
> >>>>>>>>>> folio_size(folio));
> >>>>>>>>>> mm/page_io.c:   folio_zero_range(folio, 0, folio_size(folio));
> >>>>>>>>>> mm/shmem.c:             folio_zero_range(folio, 0, folio_size(folio));
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> IOW, what performance testing have you done with this patch?
> >>>>>>>>>>
> >>>>>>>>>> No performance test before, but I write a testcase,
> >>>>>>>>>>
> >>>>>>>>>> 1) allocate N large folios (folio_alloc(PMD_ORDER))
> >>>>>>>>>> 2) then calculate the diff(us) when clear all N folios
> >>>>>>>>>>         clear_highpage/folio_zero_range/folio_zero_user
> >>>>>>>>>> 3) release N folios
> >>>>>>>>>>
> >>>>>>>>>> the result(run 5 times) shown below on my machine,
> >>>>>>>>>>
> >>>>>>>>>> N=1,
> >>>>>>>>>>             clear_highpage  folio_zero_range    folio_zero_user
> >>>>>>>>>>        1      69                   74                 177
> >>>>>>>>>>        2      57                   62                 168
> >>>>>>>>>>        3      54                   58                 234
> >>>>>>>>>>        4      54                   58                 157
> >>>>>>>>>>        5      56                   62                 148
> >>>>>>>>>> avg       58                   62.8               176.8
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> N=100
> >>>>>>>>>>             clear_highpage  folio_zero_range    folio_zero_user
> >>>>>>>>>>        1    11015                 11309               32833
> >>>>>>>>>>        2    10385                 11110               49751
> >>>>>>>>>>        3    10369                 11056               33095
> >>>>>>>>>>        4    10332                 11017               33106
> >>>>>>>>>>        5    10483                 11000               49032
> >>>>>>>>>> avg     10516.8               11098.4             39563.4
> >>>>>>>>>>
> >>>>>>>>>> N=512
> >>>>>>>>>>             clear_highpage  folio_zero_range   folio_zero_user
> >>>>>>>>>>        1    55560                 60055              156876
> >>>>>>>>>>        2    55485                 60024              157132
> >>>>>>>>>>        3    55474                 60129              156658
> >>>>>>>>>>        4    55555                 59867              157259
> >>>>>>>>>>        5    55528                 59932              157108
> >>>>>>>>>> avg     55520.4               60001.4            157006.6
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> folio_zero_user with many cond_resched(), so time fluctuates a lot,
> >>>>>>>>>> clear_highpage is better folio_zero_range as you said.
> >>>>>>>>>>
> >>>>>>>>>> Maybe add a new helper to convert all folio_zero_range(folio, 0,
> >>>>>>>>>> folio_size(folio))
> >>>>>>>>>> to use clear_highpage + flush_dcache_folio?
> >>>>>>>>>
> >>>>>>>>> If this also improves performance for other existing callers of
> >>>>>>>>> folio_zero_range(), then that's a positive outcome.
> >>>>>>>>
> >>> ...
> >>>
> >>>>>> hi Kefeng,
> >>>>>> what's your point? providing a helper like clear_highfolio() or similar?
> >>>>>
> >>>>> Yes, from above test, using clear_highpage/flush_dcache_folio is better
> >>>>> than using folio_zero_range() for folio zero(especially for large
> >>>>> folio), so I'd like to add a new helper, maybe name it folio_zero()
> >>>>> since it zero the whole folio.
> >>>>
> >>>> we already have a helper like folio_zero_user()?
> >>>> it is not good enough?
> >>>
> >>> Since it is with many cond_resched(), the performance is worst...
> >>
> >> Not exactly? It should have zero cost for a preemptible kernel.
> >> For a non-preemptible kernel, it helps avoid clearing the folio
> >> from occupying the CPU and starving other processes, right?
> >
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> >
> > @@ -2393,10 +2393,7 @@ static int shmem_get_folio_gfp(struct inode
> > *inode, pgoff_t index,
> >           * it now, lest undo on failure cancel our earlier guarantee.
> >           */
> >
> >          if (sgp != SGP_WRITE && !folio_test_uptodate(folio)) {
> > -               long i, n = folio_nr_pages(folio);
> > -
> > -               for (i = 0; i < n; i++)
> > -                       clear_highpage(folio_page(folio, i));
> > +               folio_zero_user(folio, vmf->address);
> >                  flush_dcache_folio(folio);
> >                  folio_mark_uptodate(folio);
> >          }
> >
> > Do we perform better or worse with the following?
>
> Here is for SGP_FALLOC, vmf = NULL, we could use folio_zero_user(folio,
> 0), I think the performance is worse, will retest once I can access
> hardware.

Perhaps, since the current code uses clear_hugepage(). Does using
index << PAGE_SHIFT as the addr_hint offer any benefit?

>
>

Kefeng Wang Oct. 21, 2024, 3:33 p.m. UTC | #14

On 2024/10/21 17:17, Barry Song wrote:
> On Mon, Oct 21, 2024 at 9:14 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>
>>
>>
>> On 2024/10/21 15:55, Barry Song wrote:
>>> On Mon, Oct 21, 2024 at 8:47 PM Barry Song <21cnbao@gmail.com> wrote:
>>>>
>>>> On Mon, Oct 21, 2024 at 7:09 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 2024/10/21 13:38, Barry Song wrote:
>>>>>> On Mon, Oct 21, 2024 at 6:16 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 2024/10/21 12:15, Barry Song wrote:
>>>>>>>> On Fri, Oct 18, 2024 at 8:48 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 2024/10/18 15:32, Kefeng Wang wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 2024/10/18 13:23, Barry Song wrote:
>>>>>>>>>>> On Fri, Oct 18, 2024 at 6:20 PM Kefeng Wang
>>>>>>>>>>> <wangkefeng.wang@huawei.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 2024/10/17 23:09, Matthew Wilcox wrote:
>>>>>>>>>>>>> On Thu, Oct 17, 2024 at 10:25:04PM +0800, Kefeng Wang wrote:
>>>>>>>>>>>>>> Directly use folio_zero_range() to cleanup code.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Are you sure there's no performance regression introduced by this?
>>>>>>>>>>>>> clear_highpage() is often optimised in ways that we can't optimise for
>>>>>>>>>>>>> a plain memset().  On the other hand, if the folio is large, maybe a
>>>>>>>>>>>>> modern CPU will be able to do better than clear-one-page-at-a-time.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Right, I missing this, clear_page might be better than memset, I change
>>>>>>>>>>>> this one when look at the shmem_writepage(), which already convert to
>>>>>>>>>>>> use folio_zero_range() from clear_highpage(), also I grep
>>>>>>>>>>>> folio_zero_range(), there are some other to use folio_zero_range().
>>>>>>>>>>>>
>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:           folio_zero_range(folio, 0,
>>>>>>>>>>>> folio_size(folio));
>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f,
>>>>>>>>>>>> 0, folio_size(f));
>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f,
>>>>>>>>>>>> 0, folio_size(f));
>>>>>>>>>>>> fs/libfs.c:     folio_zero_range(folio, 0, folio_size(folio));
>>>>>>>>>>>> fs/ntfs3/frecord.c:             folio_zero_range(folio, 0,
>>>>>>>>>>>> folio_size(folio));
>>>>>>>>>>>> mm/page_io.c:   folio_zero_range(folio, 0, folio_size(folio));
>>>>>>>>>>>> mm/shmem.c:             folio_zero_range(folio, 0, folio_size(folio));
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> IOW, what performance testing have you done with this patch?
>>>>>>>>>>>>
>>>>>>>>>>>> No performance test before, but I write a testcase,
>>>>>>>>>>>>
>>>>>>>>>>>> 1) allocate N large folios (folio_alloc(PMD_ORDER))
>>>>>>>>>>>> 2) then calculate the diff(us) when clear all N folios
>>>>>>>>>>>>          clear_highpage/folio_zero_range/folio_zero_user
>>>>>>>>>>>> 3) release N folios
>>>>>>>>>>>>
>>>>>>>>>>>> the result(run 5 times) shown below on my machine,
>>>>>>>>>>>>
>>>>>>>>>>>> N=1,
>>>>>>>>>>>>              clear_highpage  folio_zero_range    folio_zero_user
>>>>>>>>>>>>         1      69                   74                 177
>>>>>>>>>>>>         2      57                   62                 168
>>>>>>>>>>>>         3      54                   58                 234
>>>>>>>>>>>>         4      54                   58                 157
>>>>>>>>>>>>         5      56                   62                 148
>>>>>>>>>>>> avg       58                   62.8               176.8
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> N=100
>>>>>>>>>>>>              clear_highpage  folio_zero_range    folio_zero_user
>>>>>>>>>>>>         1    11015                 11309               32833
>>>>>>>>>>>>         2    10385                 11110               49751
>>>>>>>>>>>>         3    10369                 11056               33095
>>>>>>>>>>>>         4    10332                 11017               33106
>>>>>>>>>>>>         5    10483                 11000               49032
>>>>>>>>>>>> avg     10516.8               11098.4             39563.4
>>>>>>>>>>>>
>>>>>>>>>>>> N=512
>>>>>>>>>>>>              clear_highpage  folio_zero_range   folio_zero_user
>>>>>>>>>>>>         1    55560                 60055              156876
>>>>>>>>>>>>         2    55485                 60024              157132
>>>>>>>>>>>>         3    55474                 60129              156658
>>>>>>>>>>>>         4    55555                 59867              157259
>>>>>>>>>>>>         5    55528                 59932              157108
>>>>>>>>>>>> avg     55520.4               60001.4            157006.6
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> folio_zero_user with many cond_resched(), so time fluctuates a lot,
>>>>>>>>>>>> clear_highpage is better folio_zero_range as you said.
>>>>>>>>>>>>
>>>>>>>>>>>> Maybe add a new helper to convert all folio_zero_range(folio, 0,
>>>>>>>>>>>> folio_size(folio))
>>>>>>>>>>>> to use clear_highpage + flush_dcache_folio?
>>>>>>>>>>>
>>>>>>>>>>> If this also improves performance for other existing callers of
>>>>>>>>>>> folio_zero_range(), then that's a positive outcome.
>>>>>>>>>>
>>>>> ...
>>>>>
>>>>>>>> hi Kefeng,
>>>>>>>> what's your point? providing a helper like clear_highfolio() or similar?
>>>>>>>
>>>>>>> Yes, from above test, using clear_highpage/flush_dcache_folio is better
>>>>>>> than using folio_zero_range() for folio zero(especially for large
>>>>>>> folio), so I'd like to add a new helper, maybe name it folio_zero()
>>>>>>> since it zero the whole folio.
>>>>>>
>>>>>> we already have a helper like folio_zero_user()?
>>>>>> it is not good enough?
>>>>>
>>>>> Since it is with many cond_resched(), the performance is worst...
>>>>
>>>> Not exactly? It should have zero cost for a preemptible kernel.
>>>> For a non-preemptible kernel, it helps avoid clearing the folio
>>>> from occupying the CPU and starving other processes, right?
>>>
>>> --- a/mm/shmem.c
>>> +++ b/mm/shmem.c
>>>
>>> @@ -2393,10 +2393,7 @@ static int shmem_get_folio_gfp(struct inode
>>> *inode, pgoff_t index,
>>>            * it now, lest undo on failure cancel our earlier guarantee.
>>>            */
>>>
>>>           if (sgp != SGP_WRITE && !folio_test_uptodate(folio)) {
>>> -               long i, n = folio_nr_pages(folio);
>>> -
>>> -               for (i = 0; i < n; i++)
>>> -                       clear_highpage(folio_page(folio, i));
>>> +               folio_zero_user(folio, vmf->address);
>>>                   flush_dcache_folio(folio);
>>>                   folio_mark_uptodate(folio);
>>>           }
>>>
>>> Do we perform better or worse with the following?
>>
>> Here is for SGP_FALLOC, vmf = NULL, we could use folio_zero_user(folio,
>> 0), I think the performance is worse, will retest once I can access
>> hardware.
> 
> Perhaps, since the current code uses clear_hugepage(). Does using
> index << PAGE_SHIFT as the addr_hint offer any benefit?
> 

when use folio_zero_user(), the performance is vary bad with above
fallocate test(mount huge=always),

       folio_zero_range   clear_highpage         folio_zero_user
real    0m1.214s             0m1.111s              0m3.159s
user    0m0.000s             0m0.000s              0m0.000s
sys     0m1.210s             0m1.109s              0m3.152s

I tried with addr_hint = 0/index << PAGE_SHIFT, no obvious different.

Barry Song Oct. 21, 2024, 8:32 p.m. UTC | #15

On Tue, Oct 22, 2024 at 4:33 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>
>
>
> On 2024/10/21 17:17, Barry Song wrote:
> > On Mon, Oct 21, 2024 at 9:14 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
> >>
> >>
> >>
> >> On 2024/10/21 15:55, Barry Song wrote:
> >>> On Mon, Oct 21, 2024 at 8:47 PM Barry Song <21cnbao@gmail.com> wrote:
> >>>>
> >>>> On Mon, Oct 21, 2024 at 7:09 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 2024/10/21 13:38, Barry Song wrote:
> >>>>>> On Mon, Oct 21, 2024 at 6:16 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On 2024/10/21 12:15, Barry Song wrote:
> >>>>>>>> On Fri, Oct 18, 2024 at 8:48 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On 2024/10/18 15:32, Kefeng Wang wrote:
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On 2024/10/18 13:23, Barry Song wrote:
> >>>>>>>>>>> On Fri, Oct 18, 2024 at 6:20 PM Kefeng Wang
> >>>>>>>>>>> <wangkefeng.wang@huawei.com> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 2024/10/17 23:09, Matthew Wilcox wrote:
> >>>>>>>>>>>>> On Thu, Oct 17, 2024 at 10:25:04PM +0800, Kefeng Wang wrote:
> >>>>>>>>>>>>>> Directly use folio_zero_range() to cleanup code.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Are you sure there's no performance regression introduced by this?
> >>>>>>>>>>>>> clear_highpage() is often optimised in ways that we can't optimise for
> >>>>>>>>>>>>> a plain memset().  On the other hand, if the folio is large, maybe a
> >>>>>>>>>>>>> modern CPU will be able to do better than clear-one-page-at-a-time.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Right, I missing this, clear_page might be better than memset, I change
> >>>>>>>>>>>> this one when look at the shmem_writepage(), which already convert to
> >>>>>>>>>>>> use folio_zero_range() from clear_highpage(), also I grep
> >>>>>>>>>>>> folio_zero_range(), there are some other to use folio_zero_range().
> >>>>>>>>>>>>
> >>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:           folio_zero_range(folio, 0,
> >>>>>>>>>>>> folio_size(folio));
> >>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f,
> >>>>>>>>>>>> 0, folio_size(f));
> >>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f,
> >>>>>>>>>>>> 0, folio_size(f));
> >>>>>>>>>>>> fs/libfs.c:     folio_zero_range(folio, 0, folio_size(folio));
> >>>>>>>>>>>> fs/ntfs3/frecord.c:             folio_zero_range(folio, 0,
> >>>>>>>>>>>> folio_size(folio));
> >>>>>>>>>>>> mm/page_io.c:   folio_zero_range(folio, 0, folio_size(folio));
> >>>>>>>>>>>> mm/shmem.c:             folio_zero_range(folio, 0, folio_size(folio));
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> IOW, what performance testing have you done with this patch?
> >>>>>>>>>>>>
> >>>>>>>>>>>> No performance test before, but I write a testcase,
> >>>>>>>>>>>>
> >>>>>>>>>>>> 1) allocate N large folios (folio_alloc(PMD_ORDER))
> >>>>>>>>>>>> 2) then calculate the diff(us) when clear all N folios
> >>>>>>>>>>>>          clear_highpage/folio_zero_range/folio_zero_user
> >>>>>>>>>>>> 3) release N folios
> >>>>>>>>>>>>
> >>>>>>>>>>>> the result(run 5 times) shown below on my machine,
> >>>>>>>>>>>>
> >>>>>>>>>>>> N=1,
> >>>>>>>>>>>>              clear_highpage  folio_zero_range    folio_zero_user
> >>>>>>>>>>>>         1      69                   74                 177
> >>>>>>>>>>>>         2      57                   62                 168
> >>>>>>>>>>>>         3      54                   58                 234
> >>>>>>>>>>>>         4      54                   58                 157
> >>>>>>>>>>>>         5      56                   62                 148
> >>>>>>>>>>>> avg       58                   62.8               176.8
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> N=100
> >>>>>>>>>>>>              clear_highpage  folio_zero_range    folio_zero_user
> >>>>>>>>>>>>         1    11015                 11309               32833
> >>>>>>>>>>>>         2    10385                 11110               49751
> >>>>>>>>>>>>         3    10369                 11056               33095
> >>>>>>>>>>>>         4    10332                 11017               33106
> >>>>>>>>>>>>         5    10483                 11000               49032
> >>>>>>>>>>>> avg     10516.8               11098.4             39563.4
> >>>>>>>>>>>>
> >>>>>>>>>>>> N=512
> >>>>>>>>>>>>              clear_highpage  folio_zero_range   folio_zero_user
> >>>>>>>>>>>>         1    55560                 60055              156876
> >>>>>>>>>>>>         2    55485                 60024              157132
> >>>>>>>>>>>>         3    55474                 60129              156658
> >>>>>>>>>>>>         4    55555                 59867              157259
> >>>>>>>>>>>>         5    55528                 59932              157108
> >>>>>>>>>>>> avg     55520.4               60001.4            157006.6
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> folio_zero_user with many cond_resched(), so time fluctuates a lot,
> >>>>>>>>>>>> clear_highpage is better folio_zero_range as you said.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Maybe add a new helper to convert all folio_zero_range(folio, 0,
> >>>>>>>>>>>> folio_size(folio))
> >>>>>>>>>>>> to use clear_highpage + flush_dcache_folio?
> >>>>>>>>>>>
> >>>>>>>>>>> If this also improves performance for other existing callers of
> >>>>>>>>>>> folio_zero_range(), then that's a positive outcome.
> >>>>>>>>>>
> >>>>> ...
> >>>>>
> >>>>>>>> hi Kefeng,
> >>>>>>>> what's your point? providing a helper like clear_highfolio() or similar?
> >>>>>>>
> >>>>>>> Yes, from above test, using clear_highpage/flush_dcache_folio is better
> >>>>>>> than using folio_zero_range() for folio zero(especially for large
> >>>>>>> folio), so I'd like to add a new helper, maybe name it folio_zero()
> >>>>>>> since it zero the whole folio.
> >>>>>>
> >>>>>> we already have a helper like folio_zero_user()?
> >>>>>> it is not good enough?
> >>>>>
> >>>>> Since it is with many cond_resched(), the performance is worst...
> >>>>
> >>>> Not exactly? It should have zero cost for a preemptible kernel.
> >>>> For a non-preemptible kernel, it helps avoid clearing the folio
> >>>> from occupying the CPU and starving other processes, right?
> >>>
> >>> --- a/mm/shmem.c
> >>> +++ b/mm/shmem.c
> >>>
> >>> @@ -2393,10 +2393,7 @@ static int shmem_get_folio_gfp(struct inode
> >>> *inode, pgoff_t index,
> >>>            * it now, lest undo on failure cancel our earlier guarantee.
> >>>            */
> >>>
> >>>           if (sgp != SGP_WRITE && !folio_test_uptodate(folio)) {
> >>> -               long i, n = folio_nr_pages(folio);
> >>> -
> >>> -               for (i = 0; i < n; i++)
> >>> -                       clear_highpage(folio_page(folio, i));
> >>> +               folio_zero_user(folio, vmf->address);
> >>>                   flush_dcache_folio(folio);
> >>>                   folio_mark_uptodate(folio);
> >>>           }
> >>>
> >>> Do we perform better or worse with the following?
> >>
> >> Here is for SGP_FALLOC, vmf = NULL, we could use folio_zero_user(folio,
> >> 0), I think the performance is worse, will retest once I can access
> >> hardware.
> >
> > Perhaps, since the current code uses clear_hugepage(). Does using
> > index << PAGE_SHIFT as the addr_hint offer any benefit?
> >
>
> when use folio_zero_user(), the performance is vary bad with above
> fallocate test(mount huge=always),
>
>        folio_zero_range   clear_highpage         folio_zero_user
> real    0m1.214s             0m1.111s              0m3.159s
> user    0m0.000s             0m0.000s              0m0.000s
> sys     0m1.210s             0m1.109s              0m3.152s
>
> I tried with addr_hint = 0/index << PAGE_SHIFT, no obvious different.

Interesting. Does your kernel have preemption disabled or
preemption_debug enabled?

If not, it makes me wonder whether folio_zero_user() in
alloc_anon_folio() is actually improving performance as expected,
compared to the simpler folio_zero() you plan to implement. :-)

Kefeng Wang Oct. 22, 2024, 3:10 p.m. UTC | #16

On 2024/10/22 4:32, Barry Song wrote:
> On Tue, Oct 22, 2024 at 4:33 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>
>>
>>
>> On 2024/10/21 17:17, Barry Song wrote:
>>> On Mon, Oct 21, 2024 at 9:14 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>>>
>>>>
>>>>
>>>> On 2024/10/21 15:55, Barry Song wrote:
>>>>> On Mon, Oct 21, 2024 at 8:47 PM Barry Song <21cnbao@gmail.com> wrote:
>>>>>>
>>>>>> On Mon, Oct 21, 2024 at 7:09 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 2024/10/21 13:38, Barry Song wrote:
>>>>>>>> On Mon, Oct 21, 2024 at 6:16 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 2024/10/21 12:15, Barry Song wrote:
>>>>>>>>>> On Fri, Oct 18, 2024 at 8:48 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 2024/10/18 15:32, Kefeng Wang wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 2024/10/18 13:23, Barry Song wrote:
>>>>>>>>>>>>> On Fri, Oct 18, 2024 at 6:20 PM Kefeng Wang
>>>>>>>>>>>>> <wangkefeng.wang@huawei.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 2024/10/17 23:09, Matthew Wilcox wrote:
>>>>>>>>>>>>>>> On Thu, Oct 17, 2024 at 10:25:04PM +0800, Kefeng Wang wrote:
>>>>>>>>>>>>>>>> Directly use folio_zero_range() to cleanup code.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Are you sure there's no performance regression introduced by this?
>>>>>>>>>>>>>>> clear_highpage() is often optimised in ways that we can't optimise for
>>>>>>>>>>>>>>> a plain memset().  On the other hand, if the folio is large, maybe a
>>>>>>>>>>>>>>> modern CPU will be able to do better than clear-one-page-at-a-time.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Right, I missing this, clear_page might be better than memset, I change
>>>>>>>>>>>>>> this one when look at the shmem_writepage(), which already convert to
>>>>>>>>>>>>>> use folio_zero_range() from clear_highpage(), also I grep
>>>>>>>>>>>>>> folio_zero_range(), there are some other to use folio_zero_range().
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:           folio_zero_range(folio, 0,
>>>>>>>>>>>>>> folio_size(folio));
>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f,
>>>>>>>>>>>>>> 0, folio_size(f));
>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f,
>>>>>>>>>>>>>> 0, folio_size(f));
>>>>>>>>>>>>>> fs/libfs.c:     folio_zero_range(folio, 0, folio_size(folio));
>>>>>>>>>>>>>> fs/ntfs3/frecord.c:             folio_zero_range(folio, 0,
>>>>>>>>>>>>>> folio_size(folio));
>>>>>>>>>>>>>> mm/page_io.c:   folio_zero_range(folio, 0, folio_size(folio));
>>>>>>>>>>>>>> mm/shmem.c:             folio_zero_range(folio, 0, folio_size(folio));
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> IOW, what performance testing have you done with this patch?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> No performance test before, but I write a testcase,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1) allocate N large folios (folio_alloc(PMD_ORDER))
>>>>>>>>>>>>>> 2) then calculate the diff(us) when clear all N folios
>>>>>>>>>>>>>>           clear_highpage/folio_zero_range/folio_zero_user
>>>>>>>>>>>>>> 3) release N folios
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> the result(run 5 times) shown below on my machine,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> N=1,
>>>>>>>>>>>>>>               clear_highpage  folio_zero_range    folio_zero_user
>>>>>>>>>>>>>>          1      69                   74                 177
>>>>>>>>>>>>>>          2      57                   62                 168
>>>>>>>>>>>>>>          3      54                   58                 234
>>>>>>>>>>>>>>          4      54                   58                 157
>>>>>>>>>>>>>>          5      56                   62                 148
>>>>>>>>>>>>>> avg       58                   62.8               176.8
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> N=100
>>>>>>>>>>>>>>               clear_highpage  folio_zero_range    folio_zero_user
>>>>>>>>>>>>>>          1    11015                 11309               32833
>>>>>>>>>>>>>>          2    10385                 11110               49751
>>>>>>>>>>>>>>          3    10369                 11056               33095
>>>>>>>>>>>>>>          4    10332                 11017               33106
>>>>>>>>>>>>>>          5    10483                 11000               49032
>>>>>>>>>>>>>> avg     10516.8               11098.4             39563.4
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> N=512
>>>>>>>>>>>>>>               clear_highpage  folio_zero_range   folio_zero_user
>>>>>>>>>>>>>>          1    55560                 60055              156876
>>>>>>>>>>>>>>          2    55485                 60024              157132
>>>>>>>>>>>>>>          3    55474                 60129              156658
>>>>>>>>>>>>>>          4    55555                 59867              157259
>>>>>>>>>>>>>>          5    55528                 59932              157108
>>>>>>>>>>>>>> avg     55520.4               60001.4            157006.6
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> folio_zero_user with many cond_resched(), so time fluctuates a lot,
>>>>>>>>>>>>>> clear_highpage is better folio_zero_range as you said.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Maybe add a new helper to convert all folio_zero_range(folio, 0,
>>>>>>>>>>>>>> folio_size(folio))
>>>>>>>>>>>>>> to use clear_highpage + flush_dcache_folio?
>>>>>>>>>>>>>
>>>>>>>>>>>>> If this also improves performance for other existing callers of
>>>>>>>>>>>>> folio_zero_range(), then that's a positive outcome.
>>>>>>>>>>>>
>>>>>>> ...
>>>>>>>
>>>>>>>>>> hi Kefeng,
>>>>>>>>>> what's your point? providing a helper like clear_highfolio() or similar?
>>>>>>>>>
>>>>>>>>> Yes, from above test, using clear_highpage/flush_dcache_folio is better
>>>>>>>>> than using folio_zero_range() for folio zero(especially for large
>>>>>>>>> folio), so I'd like to add a new helper, maybe name it folio_zero()
>>>>>>>>> since it zero the whole folio.
>>>>>>>>
>>>>>>>> we already have a helper like folio_zero_user()?
>>>>>>>> it is not good enough?
>>>>>>>
>>>>>>> Since it is with many cond_resched(), the performance is worst...
>>>>>>
>>>>>> Not exactly? It should have zero cost for a preemptible kernel.
>>>>>> For a non-preemptible kernel, it helps avoid clearing the folio
>>>>>> from occupying the CPU and starving other processes, right?
>>>>>
>>>>> --- a/mm/shmem.c
>>>>> +++ b/mm/shmem.c
>>>>>
>>>>> @@ -2393,10 +2393,7 @@ static int shmem_get_folio_gfp(struct inode
>>>>> *inode, pgoff_t index,
>>>>>             * it now, lest undo on failure cancel our earlier guarantee.
>>>>>             */
>>>>>
>>>>>            if (sgp != SGP_WRITE && !folio_test_uptodate(folio)) {
>>>>> -               long i, n = folio_nr_pages(folio);
>>>>> -
>>>>> -               for (i = 0; i < n; i++)
>>>>> -                       clear_highpage(folio_page(folio, i));
>>>>> +               folio_zero_user(folio, vmf->address);
>>>>>                    flush_dcache_folio(folio);
>>>>>                    folio_mark_uptodate(folio);
>>>>>            }
>>>>>
>>>>> Do we perform better or worse with the following?
>>>>
>>>> Here is for SGP_FALLOC, vmf = NULL, we could use folio_zero_user(folio,
>>>> 0), I think the performance is worse, will retest once I can access
>>>> hardware.
>>>
>>> Perhaps, since the current code uses clear_hugepage(). Does using
>>> index << PAGE_SHIFT as the addr_hint offer any benefit?
>>>
>>
>> when use folio_zero_user(), the performance is vary bad with above
>> fallocate test(mount huge=always),
>>
>>         folio_zero_range   clear_highpage         folio_zero_user
>> real    0m1.214s             0m1.111s              0m3.159s
>> user    0m0.000s             0m0.000s              0m0.000s
>> sys     0m1.210s             0m1.109s              0m3.152s
>>
>> I tried with addr_hint = 0/index << PAGE_SHIFT, no obvious different.
> 
> Interesting. Does your kernel have preemption disabled or
> preemption_debug enabled?

ARM64 server, CONFIG_PREEMPT_NONE=y

> 
> If not, it makes me wonder whether folio_zero_user() in
> alloc_anon_folio() is actually improving performance as expected,
> compared to the simpler folio_zero() you plan to implement. :-)

Yes, maybe, the folio_zero_user(was clear_huge_page) is from
47ad8475c000 ("thp: clear_copy_huge_page"), so original clear_huge_page
is used in HugeTLB, clear PUD size maybe spend many time, but for PMD or
other size of large folio, cond_resched is not necessary since we
already have some folio_zero_range() to clear large folio, and no issue
was reported.

Barry Song Oct. 22, 2024, 10:56 p.m. UTC | #17

On Wed, Oct 23, 2024 at 4:10 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>
>
>
> On 2024/10/22 4:32, Barry Song wrote:
> > On Tue, Oct 22, 2024 at 4:33 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
> >>
> >>
> >>
> >> On 2024/10/21 17:17, Barry Song wrote:
> >>> On Mon, Oct 21, 2024 at 9:14 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 2024/10/21 15:55, Barry Song wrote:
> >>>>> On Mon, Oct 21, 2024 at 8:47 PM Barry Song <21cnbao@gmail.com> wrote:
> >>>>>>
> >>>>>> On Mon, Oct 21, 2024 at 7:09 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On 2024/10/21 13:38, Barry Song wrote:
> >>>>>>>> On Mon, Oct 21, 2024 at 6:16 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On 2024/10/21 12:15, Barry Song wrote:
> >>>>>>>>>> On Fri, Oct 18, 2024 at 8:48 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On 2024/10/18 15:32, Kefeng Wang wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 2024/10/18 13:23, Barry Song wrote:
> >>>>>>>>>>>>> On Fri, Oct 18, 2024 at 6:20 PM Kefeng Wang
> >>>>>>>>>>>>> <wangkefeng.wang@huawei.com> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On 2024/10/17 23:09, Matthew Wilcox wrote:
> >>>>>>>>>>>>>>> On Thu, Oct 17, 2024 at 10:25:04PM +0800, Kefeng Wang wrote:
> >>>>>>>>>>>>>>>> Directly use folio_zero_range() to cleanup code.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Are you sure there's no performance regression introduced by this?
> >>>>>>>>>>>>>>> clear_highpage() is often optimised in ways that we can't optimise for
> >>>>>>>>>>>>>>> a plain memset().  On the other hand, if the folio is large, maybe a
> >>>>>>>>>>>>>>> modern CPU will be able to do better than clear-one-page-at-a-time.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Right, I missing this, clear_page might be better than memset, I change
> >>>>>>>>>>>>>> this one when look at the shmem_writepage(), which already convert to
> >>>>>>>>>>>>>> use folio_zero_range() from clear_highpage(), also I grep
> >>>>>>>>>>>>>> folio_zero_range(), there are some other to use folio_zero_range().
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:           folio_zero_range(folio, 0,
> >>>>>>>>>>>>>> folio_size(folio));
> >>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f,
> >>>>>>>>>>>>>> 0, folio_size(f));
> >>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f,
> >>>>>>>>>>>>>> 0, folio_size(f));
> >>>>>>>>>>>>>> fs/libfs.c:     folio_zero_range(folio, 0, folio_size(folio));
> >>>>>>>>>>>>>> fs/ntfs3/frecord.c:             folio_zero_range(folio, 0,
> >>>>>>>>>>>>>> folio_size(folio));
> >>>>>>>>>>>>>> mm/page_io.c:   folio_zero_range(folio, 0, folio_size(folio));
> >>>>>>>>>>>>>> mm/shmem.c:             folio_zero_range(folio, 0, folio_size(folio));
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> IOW, what performance testing have you done with this patch?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> No performance test before, but I write a testcase,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 1) allocate N large folios (folio_alloc(PMD_ORDER))
> >>>>>>>>>>>>>> 2) then calculate the diff(us) when clear all N folios
> >>>>>>>>>>>>>>           clear_highpage/folio_zero_range/folio_zero_user
> >>>>>>>>>>>>>> 3) release N folios
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> the result(run 5 times) shown below on my machine,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> N=1,
> >>>>>>>>>>>>>>               clear_highpage  folio_zero_range    folio_zero_user
> >>>>>>>>>>>>>>          1      69                   74                 177
> >>>>>>>>>>>>>>          2      57                   62                 168
> >>>>>>>>>>>>>>          3      54                   58                 234
> >>>>>>>>>>>>>>          4      54                   58                 157
> >>>>>>>>>>>>>>          5      56                   62                 148
> >>>>>>>>>>>>>> avg       58                   62.8               176.8
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> N=100
> >>>>>>>>>>>>>>               clear_highpage  folio_zero_range    folio_zero_user
> >>>>>>>>>>>>>>          1    11015                 11309               32833
> >>>>>>>>>>>>>>          2    10385                 11110               49751
> >>>>>>>>>>>>>>          3    10369                 11056               33095
> >>>>>>>>>>>>>>          4    10332                 11017               33106
> >>>>>>>>>>>>>>          5    10483                 11000               49032
> >>>>>>>>>>>>>> avg     10516.8               11098.4             39563.4
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> N=512
> >>>>>>>>>>>>>>               clear_highpage  folio_zero_range   folio_zero_user
> >>>>>>>>>>>>>>          1    55560                 60055              156876
> >>>>>>>>>>>>>>          2    55485                 60024              157132
> >>>>>>>>>>>>>>          3    55474                 60129              156658
> >>>>>>>>>>>>>>          4    55555                 59867              157259
> >>>>>>>>>>>>>>          5    55528                 59932              157108
> >>>>>>>>>>>>>> avg     55520.4               60001.4            157006.6
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> folio_zero_user with many cond_resched(), so time fluctuates a lot,
> >>>>>>>>>>>>>> clear_highpage is better folio_zero_range as you said.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Maybe add a new helper to convert all folio_zero_range(folio, 0,
> >>>>>>>>>>>>>> folio_size(folio))
> >>>>>>>>>>>>>> to use clear_highpage + flush_dcache_folio?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> If this also improves performance for other existing callers of
> >>>>>>>>>>>>> folio_zero_range(), then that's a positive outcome.
> >>>>>>>>>>>>
> >>>>>>> ...
> >>>>>>>
> >>>>>>>>>> hi Kefeng,
> >>>>>>>>>> what's your point? providing a helper like clear_highfolio() or similar?
> >>>>>>>>>
> >>>>>>>>> Yes, from above test, using clear_highpage/flush_dcache_folio is better
> >>>>>>>>> than using folio_zero_range() for folio zero(especially for large
> >>>>>>>>> folio), so I'd like to add a new helper, maybe name it folio_zero()
> >>>>>>>>> since it zero the whole folio.
> >>>>>>>>
> >>>>>>>> we already have a helper like folio_zero_user()?
> >>>>>>>> it is not good enough?
> >>>>>>>
> >>>>>>> Since it is with many cond_resched(), the performance is worst...
> >>>>>>
> >>>>>> Not exactly? It should have zero cost for a preemptible kernel.
> >>>>>> For a non-preemptible kernel, it helps avoid clearing the folio
> >>>>>> from occupying the CPU and starving other processes, right?
> >>>>>
> >>>>> --- a/mm/shmem.c
> >>>>> +++ b/mm/shmem.c
> >>>>>
> >>>>> @@ -2393,10 +2393,7 @@ static int shmem_get_folio_gfp(struct inode
> >>>>> *inode, pgoff_t index,
> >>>>>             * it now, lest undo on failure cancel our earlier guarantee.
> >>>>>             */
> >>>>>
> >>>>>            if (sgp != SGP_WRITE && !folio_test_uptodate(folio)) {
> >>>>> -               long i, n = folio_nr_pages(folio);
> >>>>> -
> >>>>> -               for (i = 0; i < n; i++)
> >>>>> -                       clear_highpage(folio_page(folio, i));
> >>>>> +               folio_zero_user(folio, vmf->address);
> >>>>>                    flush_dcache_folio(folio);
> >>>>>                    folio_mark_uptodate(folio);
> >>>>>            }
> >>>>>
> >>>>> Do we perform better or worse with the following?
> >>>>
> >>>> Here is for SGP_FALLOC, vmf = NULL, we could use folio_zero_user(folio,
> >>>> 0), I think the performance is worse, will retest once I can access
> >>>> hardware.
> >>>
> >>> Perhaps, since the current code uses clear_hugepage(). Does using
> >>> index << PAGE_SHIFT as the addr_hint offer any benefit?
> >>>
> >>
> >> when use folio_zero_user(), the performance is vary bad with above
> >> fallocate test(mount huge=always),
> >>
> >>         folio_zero_range   clear_highpage         folio_zero_user
> >> real    0m1.214s             0m1.111s              0m3.159s
> >> user    0m0.000s             0m0.000s              0m0.000s
> >> sys     0m1.210s             0m1.109s              0m3.152s
> >>
> >> I tried with addr_hint = 0/index << PAGE_SHIFT, no obvious different.
> >
> > Interesting. Does your kernel have preemption disabled or
> > preemption_debug enabled?
>
> ARM64 server, CONFIG_PREEMPT_NONE=y

this explains why the performance is much worse.

>
> >
> > If not, it makes me wonder whether folio_zero_user() in
> > alloc_anon_folio() is actually improving performance as expected,
> > compared to the simpler folio_zero() you plan to implement. :-)
>
> Yes, maybe, the folio_zero_user(was clear_huge_page) is from
> 47ad8475c000 ("thp: clear_copy_huge_page"), so original clear_huge_page
> is used in HugeTLB, clear PUD size maybe spend many time, but for PMD or
> other size of large folio, cond_resched is not necessary since we
> already have some folio_zero_range() to clear large folio, and no issue
> was reported.

probably worth an optimization. calling cond_resched() for each page
seems too aggressive and useless.

diff --git a/mm/memory.c b/mm/memory.c
index 0f614523b9f4..5fc38347d782 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6738,6 +6738,19 @@ EXPORT_SYMBOL(__might_fault);
 #endif
 
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
+/*
+ * To prevent process_huge_page() from starving other processes,
+ * we allow other processes a chance with each batch.
+ */
+static inline void batched_cond_resched(int *nr)
+{
+#define BATCHED_PROCESS_NR 64
+	if (*nr++ < BATCHED_PROCESS_NR)
+		return;
+	cond_resched();
+	*nr = 0;
+}
+
 /*
  * Process all subpages of the specified huge page with the specified
  * operation.  The target subpage will be processed last to keep its
@@ -6748,7 +6761,7 @@ static inline int process_huge_page(
 	int (*process_subpage)(unsigned long addr, int idx, void *arg),
 	void *arg)
 {
-	int i, n, base, l, ret;
+	int i, n, base, l, ret, processed_nr = 0;
 	unsigned long addr = addr_hint &
 		~(((unsigned long)nr_pages << PAGE_SHIFT) - 1);
 
@@ -6761,7 +6774,7 @@ static inline int process_huge_page(
 		l = n;
 		/* Process subpages at the end of huge page */
 		for (i = nr_pages - 1; i >= 2 * n; i--) {
-			cond_resched();
+			batched_cond_resched(&processed_nr);
 			ret = process_subpage(addr + i * PAGE_SIZE, i, arg);
 			if (ret)
 				return ret;
@@ -6772,7 +6785,7 @@ static inline int process_huge_page(
 		l = nr_pages - n;
 		/* Process subpages at the begin of huge page */
 		for (i = 0; i < base; i++) {
-			cond_resched();
+			batched_cond_resched(&processed_nr);
 			ret = process_subpage(addr + i * PAGE_SIZE, i, arg);
 			if (ret)
 				return ret;
@@ -6786,11 +6799,11 @@ static inline int process_huge_page(
 		int left_idx = base + i;
 		int right_idx = base + 2 * l - 1 - i;
 
-		cond_resched();
+		batched_cond_resched(&processed_nr);
 		ret = process_subpage(addr + left_idx * PAGE_SIZE, left_idx, arg);
 		if (ret)
 			return ret;
-		cond_resched();
+		batched_cond_resched(&processed_nr);
 		ret = process_subpage(addr + right_idx * PAGE_SIZE, right_idx, arg);
 		if (ret)
 			return ret;

Kefeng Wang Oct. 24, 2024, 10:10 a.m. UTC | #18

+CC Huang Ying,

On 2024/10/23 6:56, Barry Song wrote:
> On Wed, Oct 23, 2024 at 4:10 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>
>>
...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 2024/10/17 23:09, Matthew Wilcox wrote:
>>>>>>>>>>>>>>>>> On Thu, Oct 17, 2024 at 10:25:04PM +0800, Kefeng Wang wrote:
>>>>>>>>>>>>>>>>>> Directly use folio_zero_range() to cleanup code.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Are you sure there's no performance regression introduced by this?
>>>>>>>>>>>>>>>>> clear_highpage() is often optimised in ways that we can't optimise for
>>>>>>>>>>>>>>>>> a plain memset().  On the other hand, if the folio is large, maybe a
>>>>>>>>>>>>>>>>> modern CPU will be able to do better than clear-one-page-at-a-time.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Right, I missing this, clear_page might be better than memset, I change
>>>>>>>>>>>>>>>> this one when look at the shmem_writepage(), which already convert to
>>>>>>>>>>>>>>>> use folio_zero_range() from clear_highpage(), also I grep
>>>>>>>>>>>>>>>> folio_zero_range(), there are some other to use folio_zero_range().
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:           folio_zero_range(folio, 0,
>>>>>>>>>>>>>>>> folio_size(folio));
>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f,
>>>>>>>>>>>>>>>> 0, folio_size(f));
>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f,
>>>>>>>>>>>>>>>> 0, folio_size(f));
>>>>>>>>>>>>>>>> fs/libfs.c:     folio_zero_range(folio, 0, folio_size(folio));
>>>>>>>>>>>>>>>> fs/ntfs3/frecord.c:             folio_zero_range(folio, 0,
>>>>>>>>>>>>>>>> folio_size(folio));
>>>>>>>>>>>>>>>> mm/page_io.c:   folio_zero_range(folio, 0, folio_size(folio));
>>>>>>>>>>>>>>>> mm/shmem.c:             folio_zero_range(folio, 0, folio_size(folio));
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> IOW, what performance testing have you done with this patch?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> No performance test before, but I write a testcase,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1) allocate N large folios (folio_alloc(PMD_ORDER))
>>>>>>>>>>>>>>>> 2) then calculate the diff(us) when clear all N folios
>>>>>>>>>>>>>>>>            clear_highpage/folio_zero_range/folio_zero_user
>>>>>>>>>>>>>>>> 3) release N folios
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> the result(run 5 times) shown below on my machine,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> N=1,
>>>>>>>>>>>>>>>>                clear_highpage  folio_zero_range    folio_zero_user
>>>>>>>>>>>>>>>>           1      69                   74                 177
>>>>>>>>>>>>>>>>           2      57                   62                 168
>>>>>>>>>>>>>>>>           3      54                   58                 234
>>>>>>>>>>>>>>>>           4      54                   58                 157
>>>>>>>>>>>>>>>>           5      56                   62                 148
>>>>>>>>>>>>>>>> avg       58                   62.8               176.8
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> N=100
>>>>>>>>>>>>>>>>                clear_highpage  folio_zero_range    folio_zero_user
>>>>>>>>>>>>>>>>           1    11015                 11309               32833
>>>>>>>>>>>>>>>>           2    10385                 11110               49751
>>>>>>>>>>>>>>>>           3    10369                 11056               33095
>>>>>>>>>>>>>>>>           4    10332                 11017               33106
>>>>>>>>>>>>>>>>           5    10483                 11000               49032
>>>>>>>>>>>>>>>> avg     10516.8               11098.4             39563.4
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> N=512
>>>>>>>>>>>>>>>>                clear_highpage  folio_zero_range   folio_zero_user
>>>>>>>>>>>>>>>>           1    55560                 60055              156876
>>>>>>>>>>>>>>>>           2    55485                 60024              157132
>>>>>>>>>>>>>>>>           3    55474                 60129              156658
>>>>>>>>>>>>>>>>           4    55555                 59867              157259
>>>>>>>>>>>>>>>>           5    55528                 59932              157108
>>>>>>>>>>>>>>>> avg     55520.4               60001.4            157006.6
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> folio_zero_user with many cond_resched(), so time fluctuates a lot,
>>>>>>>>>>>>>>>> clear_highpage is better folio_zero_range as you said.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Maybe add a new helper to convert all folio_zero_range(folio, 0,
>>>>>>>>>>>>>>>> folio_size(folio))
>>>>>>>>>>>>>>>> to use clear_highpage + flush_dcache_folio?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If this also improves performance for other existing callers of
>>>>>>>>>>>>>>> folio_zero_range(), then that's a positive outcome.
>>>>>>>>>>>>>>
>>>>>>>>> ...
>>>>>>>>>
>>>>>>>>>>>> hi Kefeng,
>>>>>>>>>>>> what's your point? providing a helper like clear_highfolio() or similar?
>>>>>>>>>>>
>>>>>>>>>>> Yes, from above test, using clear_highpage/flush_dcache_folio is better
>>>>>>>>>>> than using folio_zero_range() for folio zero(especially for large
>>>>>>>>>>> folio), so I'd like to add a new helper, maybe name it folio_zero()
>>>>>>>>>>> since it zero the whole folio.
>>>>>>>>>>
>>>>>>>>>> we already have a helper like folio_zero_user()?
>>>>>>>>>> it is not good enough?
>>>>>>>>>
>>>>>>>>> Since it is with many cond_resched(), the performance is worst...
>>>>>>>>
>>>>>>>> Not exactly? It should have zero cost for a preemptible kernel.
>>>>>>>> For a non-preemptible kernel, it helps avoid clearing the folio
>>>>>>>> from occupying the CPU and starving other processes, right?
>>>>>>>
>>>>>>> --- a/mm/shmem.c
>>>>>>> +++ b/mm/shmem.c
>>>>>>>
>>>>>>> @@ -2393,10 +2393,7 @@ static int shmem_get_folio_gfp(struct inode
>>>>>>> *inode, pgoff_t index,
>>>>>>>              * it now, lest undo on failure cancel our earlier guarantee.
>>>>>>>              */
>>>>>>>
>>>>>>>             if (sgp != SGP_WRITE && !folio_test_uptodate(folio)) {
>>>>>>> -               long i, n = folio_nr_pages(folio);
>>>>>>> -
>>>>>>> -               for (i = 0; i < n; i++)
>>>>>>> -                       clear_highpage(folio_page(folio, i));
>>>>>>> +               folio_zero_user(folio, vmf->address);
>>>>>>>                     flush_dcache_folio(folio);
>>>>>>>                     folio_mark_uptodate(folio);
>>>>>>>             }
>>>>>>>
>>>>>>> Do we perform better or worse with the following?
>>>>>>
>>>>>> Here is for SGP_FALLOC, vmf = NULL, we could use folio_zero_user(folio,
>>>>>> 0), I think the performance is worse, will retest once I can access
>>>>>> hardware.
>>>>>
>>>>> Perhaps, since the current code uses clear_hugepage(). Does using
>>>>> index << PAGE_SHIFT as the addr_hint offer any benefit?
>>>>>
>>>>
>>>> when use folio_zero_user(), the performance is vary bad with above
>>>> fallocate test(mount huge=always),
>>>>
>>>>          folio_zero_range   clear_highpage         folio_zero_user
>>>> real    0m1.214s             0m1.111s              0m3.159s
>>>> user    0m0.000s             0m0.000s              0m0.000s
>>>> sys     0m1.210s             0m1.109s              0m3.152s
>>>>
>>>> I tried with addr_hint = 0/index << PAGE_SHIFT, no obvious different.
>>>
>>> Interesting. Does your kernel have preemption disabled or
>>> preemption_debug enabled?
>>
>> ARM64 server, CONFIG_PREEMPT_NONE=y
> 
> this explains why the performance is much worse.
> 
>>
>>>
>>> If not, it makes me wonder whether folio_zero_user() in
>>> alloc_anon_folio() is actually improving performance as expected,
>>> compared to the simpler folio_zero() you plan to implement. :-)
>>
>> Yes, maybe, the folio_zero_user(was clear_huge_page) is from
>> 47ad8475c000 ("thp: clear_copy_huge_page"), so original clear_huge_page
>> is used in HugeTLB, clear PUD size maybe spend many time, but for PMD or
>> other size of large folio, cond_resched is not necessary since we
>> already have some folio_zero_range() to clear large folio, and no issue
>> was reported.
> 
> probably worth an optimization. calling cond_resched() for each page
> seems too aggressive and useless.

After some test, I think the cond_resched() is not the root cause,
no performance gained with batched cond_resched(), even I kill
cond_resched() from process_huge_page, no improvement.

But when I unconditionally use clear_gigantic_page() in
folio_zero_user(patched), there is big improvement with above
fallocate on tmpfs(mount huge=always), also I test some other testcase,


1) case-anon-w-seq-mt: (2M PMD THP)

base:
real    0m2.490s    0m2.254s    0m2.272s
user    1m59.980s   2m23.431s   2m18.739s
sys     1m3.675s    1m15.462s   1m15.030s	

patched:
real    0m2.234s    0m2.225s    0m2.159s
user    2m56.105s   2m57.117s   3m0.489s
sys     0m17.064s   0m17.564s   0m16.150s

Patched kernel win on sys and bad in user, but real is almost same,
maybe a little better than base.

2) case-anon-w-seq-hugetlb:(2M PMD HugeTLB)

base:
real    0m5.175s    0m5.117s    0m4.856s
user    5m15.943s   5m7.567s    4m29.273s
sys     2m38.503s   2m21.949s   2m21.252s

patched:
real    0m4.966s    0m4.841s    0m4.561s
user    6m30.123s   6m9.516s    5m49.733s
sys     0m58.503s   0m47.847s   0m46.785s


This case is similar to the case1.

3) fallocate hugetlb 20G (2M PMD HugeTLB)

base:
real    0m3.016s    0m3.019s    0m3.018s
user    0m0.000s    0m0.000s    0m0.000s
sys     0m3.009s    0m3.012s    0m3.010s

patched:

real    0m1.136s    0m1.136s    0m1.136s
user    0m0.000s    0m0.000s    0m0.004s
sys     0m1.133s    0m1.133s    0m1.129s


There is big win on patched kernel, and it is similar to above tmpfs
test, so maybe we could revert the commit c79b57e462b5 ("mm: hugetlb:
clear target sub-page last when clearing huge page").

Huang, Ying Oct. 25, 2024, 2:59 a.m. UTC | #19

Hi, Kefeng,

Kefeng Wang <wangkefeng.wang@huawei.com> writes:

> +CC Huang Ying,
>
> On 2024/10/23 6:56, Barry Song wrote:
>> On Wed, Oct 23, 2024 at 4:10 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>>
>>>
> ...
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 2024/10/17 23:09, Matthew Wilcox wrote:
>>>>>>>>>>>>>>>>>> On Thu, Oct 17, 2024 at 10:25:04PM +0800, Kefeng Wang wrote:
>>>>>>>>>>>>>>>>>>> Directly use folio_zero_range() to cleanup code.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Are you sure there's no performance regression introduced by this?
>>>>>>>>>>>>>>>>>> clear_highpage() is often optimised in ways that we can't optimise for
>>>>>>>>>>>>>>>>>> a plain memset().  On the other hand, if the folio is large, maybe a
>>>>>>>>>>>>>>>>>> modern CPU will be able to do better than clear-one-page-at-a-time.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Right, I missing this, clear_page might be better than memset, I change
>>>>>>>>>>>>>>>>> this one when look at the shmem_writepage(), which already convert to
>>>>>>>>>>>>>>>>> use folio_zero_range() from clear_highpage(), also I grep
>>>>>>>>>>>>>>>>> folio_zero_range(), there are some other to use folio_zero_range().
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:           folio_zero_range(folio, 0,
>>>>>>>>>>>>>>>>> folio_size(folio));
>>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f,
>>>>>>>>>>>>>>>>> 0, folio_size(f));
>>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f,
>>>>>>>>>>>>>>>>> 0, folio_size(f));
>>>>>>>>>>>>>>>>> fs/libfs.c:     folio_zero_range(folio, 0, folio_size(folio));
>>>>>>>>>>>>>>>>> fs/ntfs3/frecord.c:             folio_zero_range(folio, 0,
>>>>>>>>>>>>>>>>> folio_size(folio));
>>>>>>>>>>>>>>>>> mm/page_io.c:   folio_zero_range(folio, 0, folio_size(folio));
>>>>>>>>>>>>>>>>> mm/shmem.c:             folio_zero_range(folio, 0, folio_size(folio));
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> IOW, what performance testing have you done with this patch?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> No performance test before, but I write a testcase,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 1) allocate N large folios (folio_alloc(PMD_ORDER))
>>>>>>>>>>>>>>>>> 2) then calculate the diff(us) when clear all N folios
>>>>>>>>>>>>>>>>>            clear_highpage/folio_zero_range/folio_zero_user
>>>>>>>>>>>>>>>>> 3) release N folios
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> the result(run 5 times) shown below on my machine,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> N=1,
>>>>>>>>>>>>>>>>>                clear_highpage  folio_zero_range    folio_zero_user
>>>>>>>>>>>>>>>>>           1      69                   74                 177
>>>>>>>>>>>>>>>>>           2      57                   62                 168
>>>>>>>>>>>>>>>>>           3      54                   58                 234
>>>>>>>>>>>>>>>>>           4      54                   58                 157
>>>>>>>>>>>>>>>>>           5      56                   62                 148
>>>>>>>>>>>>>>>>> avg       58                   62.8               176.8
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> N=100
>>>>>>>>>>>>>>>>>                clear_highpage  folio_zero_range    folio_zero_user
>>>>>>>>>>>>>>>>>           1    11015                 11309               32833
>>>>>>>>>>>>>>>>>           2    10385                 11110               49751
>>>>>>>>>>>>>>>>>           3    10369                 11056               33095
>>>>>>>>>>>>>>>>>           4    10332                 11017               33106
>>>>>>>>>>>>>>>>>           5    10483                 11000               49032
>>>>>>>>>>>>>>>>> avg     10516.8               11098.4             39563.4
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> N=512
>>>>>>>>>>>>>>>>>                clear_highpage  folio_zero_range   folio_zero_user
>>>>>>>>>>>>>>>>>           1    55560                 60055              156876
>>>>>>>>>>>>>>>>>           2    55485                 60024              157132
>>>>>>>>>>>>>>>>>           3    55474                 60129              156658
>>>>>>>>>>>>>>>>>           4    55555                 59867              157259
>>>>>>>>>>>>>>>>>           5    55528                 59932              157108
>>>>>>>>>>>>>>>>> avg     55520.4               60001.4            157006.6
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> folio_zero_user with many cond_resched(), so time fluctuates a lot,
>>>>>>>>>>>>>>>>> clear_highpage is better folio_zero_range as you said.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Maybe add a new helper to convert all folio_zero_range(folio, 0,
>>>>>>>>>>>>>>>>> folio_size(folio))
>>>>>>>>>>>>>>>>> to use clear_highpage + flush_dcache_folio?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> If this also improves performance for other existing callers of
>>>>>>>>>>>>>>>> folio_zero_range(), then that's a positive outcome.
>>>>>>>>>>>>>>>
>>>>>>>>>> ...
>>>>>>>>>>
>>>>>>>>>>>>> hi Kefeng,
>>>>>>>>>>>>> what's your point? providing a helper like clear_highfolio() or similar?
>>>>>>>>>>>>
>>>>>>>>>>>> Yes, from above test, using clear_highpage/flush_dcache_folio is better
>>>>>>>>>>>> than using folio_zero_range() for folio zero(especially for large
>>>>>>>>>>>> folio), so I'd like to add a new helper, maybe name it folio_zero()
>>>>>>>>>>>> since it zero the whole folio.
>>>>>>>>>>>
>>>>>>>>>>> we already have a helper like folio_zero_user()?
>>>>>>>>>>> it is not good enough?
>>>>>>>>>>
>>>>>>>>>> Since it is with many cond_resched(), the performance is worst...
>>>>>>>>>
>>>>>>>>> Not exactly? It should have zero cost for a preemptible kernel.
>>>>>>>>> For a non-preemptible kernel, it helps avoid clearing the folio
>>>>>>>>> from occupying the CPU and starving other processes, right?
>>>>>>>>
>>>>>>>> --- a/mm/shmem.c
>>>>>>>> +++ b/mm/shmem.c
>>>>>>>>
>>>>>>>> @@ -2393,10 +2393,7 @@ static int shmem_get_folio_gfp(struct inode
>>>>>>>> *inode, pgoff_t index,
>>>>>>>>              * it now, lest undo on failure cancel our earlier guarantee.
>>>>>>>>              */
>>>>>>>>
>>>>>>>>             if (sgp != SGP_WRITE && !folio_test_uptodate(folio)) {
>>>>>>>> -               long i, n = folio_nr_pages(folio);
>>>>>>>> -
>>>>>>>> -               for (i = 0; i < n; i++)
>>>>>>>> -                       clear_highpage(folio_page(folio, i));
>>>>>>>> +               folio_zero_user(folio, vmf->address);
>>>>>>>>                     flush_dcache_folio(folio);
>>>>>>>>                     folio_mark_uptodate(folio);
>>>>>>>>             }
>>>>>>>>
>>>>>>>> Do we perform better or worse with the following?
>>>>>>>
>>>>>>> Here is for SGP_FALLOC, vmf = NULL, we could use folio_zero_user(folio,
>>>>>>> 0), I think the performance is worse, will retest once I can access
>>>>>>> hardware.
>>>>>>
>>>>>> Perhaps, since the current code uses clear_hugepage(). Does using
>>>>>> index << PAGE_SHIFT as the addr_hint offer any benefit?
>>>>>>
>>>>>
>>>>> when use folio_zero_user(), the performance is vary bad with above
>>>>> fallocate test(mount huge=always),
>>>>>
>>>>>          folio_zero_range   clear_highpage         folio_zero_user
>>>>> real    0m1.214s             0m1.111s              0m3.159s
>>>>> user    0m0.000s             0m0.000s              0m0.000s
>>>>> sys     0m1.210s             0m1.109s              0m3.152s
>>>>>
>>>>> I tried with addr_hint = 0/index << PAGE_SHIFT, no obvious different.
>>>>
>>>> Interesting. Does your kernel have preemption disabled or
>>>> preemption_debug enabled?
>>>
>>> ARM64 server, CONFIG_PREEMPT_NONE=y
>> this explains why the performance is much worse.
>> 
>>>
>>>>
>>>> If not, it makes me wonder whether folio_zero_user() in
>>>> alloc_anon_folio() is actually improving performance as expected,
>>>> compared to the simpler folio_zero() you plan to implement. :-)
>>>
>>> Yes, maybe, the folio_zero_user(was clear_huge_page) is from
>>> 47ad8475c000 ("thp: clear_copy_huge_page"), so original clear_huge_page
>>> is used in HugeTLB, clear PUD size maybe spend many time, but for PMD or
>>> other size of large folio, cond_resched is not necessary since we
>>> already have some folio_zero_range() to clear large folio, and no issue
>>> was reported.
>> probably worth an optimization. calling cond_resched() for each page
>> seems too aggressive and useless.
>
> After some test, I think the cond_resched() is not the root cause,
> no performance gained with batched cond_resched(), even I kill
> cond_resched() from process_huge_page, no improvement.
>
> But when I unconditionally use clear_gigantic_page() in
> folio_zero_user(patched), there is big improvement with above
> fallocate on tmpfs(mount huge=always), also I test some other testcase,
>
>
> 1) case-anon-w-seq-mt: (2M PMD THP)
>
> base:
> real    0m2.490s    0m2.254s    0m2.272s
> user    1m59.980s   2m23.431s   2m18.739s
> sys     1m3.675s    1m15.462s   1m15.030s	
>
> patched:
> real    0m2.234s    0m2.225s    0m2.159s
> user    2m56.105s   2m57.117s   3m0.489s
> sys     0m17.064s   0m17.564s   0m16.150s
>
> Patched kernel win on sys and bad in user, but real is almost same,
> maybe a little better than base.

We can find user time difference.  That means the original cache hot
behavior still applies on your system.

However, it appears that the performance to clear page from end to begin
is really bad on your system.

So, I suggest to revise the current implementation to use sequential
clearing as much as possible.

> 2) case-anon-w-seq-hugetlb:(2M PMD HugeTLB)
>
> base:
> real    0m5.175s    0m5.117s    0m4.856s
> user    5m15.943s   5m7.567s    4m29.273s
> sys     2m38.503s   2m21.949s   2m21.252s
>
> patched:
> real    0m4.966s    0m4.841s    0m4.561s
> user    6m30.123s   6m9.516s    5m49.733s
> sys     0m58.503s   0m47.847s   0m46.785s
>
>
> This case is similar to the case1.
>
> 3) fallocate hugetlb 20G (2M PMD HugeTLB)
>
> base:
> real    0m3.016s    0m3.019s    0m3.018s
> user    0m0.000s    0m0.000s    0m0.000s
> sys     0m3.009s    0m3.012s    0m3.010s
>
> patched:
>
> real    0m1.136s    0m1.136s    0m1.136s
> user    0m0.000s    0m0.000s    0m0.004s
> sys     0m1.133s    0m1.133s    0m1.129s
>
>
> There is big win on patched kernel, and it is similar to above tmpfs
> test, so maybe we could revert the commit c79b57e462b5 ("mm: hugetlb:
> clear target sub-page last when clearing huge page").

--
Best Regards,
Huang, Ying

Kefeng Wang Oct. 25, 2024, 7:42 a.m. UTC | #20

On 2024/10/25 10:59, Huang, Ying wrote:
> Hi, Kefeng,
> 
> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
> 
>> +CC Huang Ying,
>>
>> On 2024/10/23 6:56, Barry Song wrote:
>>> On Wed, Oct 23, 2024 at 4:10 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>>>
>>>>
>> ...
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On 2024/10/17 23:09, Matthew Wilcox wrote:
>>>>>>>>>>>>>>>>>>> On Thu, Oct 17, 2024 at 10:25:04PM +0800, Kefeng Wang wrote:
>>>>>>>>>>>>>>>>>>>> Directly use folio_zero_range() to cleanup code.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Are you sure there's no performance regression introduced by this?
>>>>>>>>>>>>>>>>>>> clear_highpage() is often optimised in ways that we can't optimise for
>>>>>>>>>>>>>>>>>>> a plain memset().  On the other hand, if the folio is large, maybe a
>>>>>>>>>>>>>>>>>>> modern CPU will be able to do better than clear-one-page-at-a-time.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Right, I missing this, clear_page might be better than memset, I change
>>>>>>>>>>>>>>>>>> this one when look at the shmem_writepage(), which already convert to
>>>>>>>>>>>>>>>>>> use folio_zero_range() from clear_highpage(), also I grep
>>>>>>>>>>>>>>>>>> folio_zero_range(), there are some other to use folio_zero_range().
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:           folio_zero_range(folio, 0,
>>>>>>>>>>>>>>>>>> folio_size(folio));
>>>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f,
>>>>>>>>>>>>>>>>>> 0, folio_size(f));
>>>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f,
>>>>>>>>>>>>>>>>>> 0, folio_size(f));
>>>>>>>>>>>>>>>>>> fs/libfs.c:     folio_zero_range(folio, 0, folio_size(folio));
>>>>>>>>>>>>>>>>>> fs/ntfs3/frecord.c:             folio_zero_range(folio, 0,
>>>>>>>>>>>>>>>>>> folio_size(folio));
>>>>>>>>>>>>>>>>>> mm/page_io.c:   folio_zero_range(folio, 0, folio_size(folio));
>>>>>>>>>>>>>>>>>> mm/shmem.c:             folio_zero_range(folio, 0, folio_size(folio));
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> IOW, what performance testing have you done with this patch?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> No performance test before, but I write a testcase,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 1) allocate N large folios (folio_alloc(PMD_ORDER))
>>>>>>>>>>>>>>>>>> 2) then calculate the diff(us) when clear all N folios
>>>>>>>>>>>>>>>>>>             clear_highpage/folio_zero_range/folio_zero_user
>>>>>>>>>>>>>>>>>> 3) release N folios
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> the result(run 5 times) shown below on my machine,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> N=1,
>>>>>>>>>>>>>>>>>>                 clear_highpage  folio_zero_range    folio_zero_user
>>>>>>>>>>>>>>>>>>            1      69                   74                 177
>>>>>>>>>>>>>>>>>>            2      57                   62                 168
>>>>>>>>>>>>>>>>>>            3      54                   58                 234
>>>>>>>>>>>>>>>>>>            4      54                   58                 157
>>>>>>>>>>>>>>>>>>            5      56                   62                 148
>>>>>>>>>>>>>>>>>> avg       58                   62.8               176.8
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> N=100
>>>>>>>>>>>>>>>>>>                 clear_highpage  folio_zero_range    folio_zero_user
>>>>>>>>>>>>>>>>>>            1    11015                 11309               32833
>>>>>>>>>>>>>>>>>>            2    10385                 11110               49751
>>>>>>>>>>>>>>>>>>            3    10369                 11056               33095
>>>>>>>>>>>>>>>>>>            4    10332                 11017               33106
>>>>>>>>>>>>>>>>>>            5    10483                 11000               49032
>>>>>>>>>>>>>>>>>> avg     10516.8               11098.4             39563.4
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> N=512
>>>>>>>>>>>>>>>>>>                 clear_highpage  folio_zero_range   folio_zero_user
>>>>>>>>>>>>>>>>>>            1    55560                 60055              156876
>>>>>>>>>>>>>>>>>>            2    55485                 60024              157132
>>>>>>>>>>>>>>>>>>            3    55474                 60129              156658
>>>>>>>>>>>>>>>>>>            4    55555                 59867              157259
>>>>>>>>>>>>>>>>>>            5    55528                 59932              157108
>>>>>>>>>>>>>>>>>> avg     55520.4               60001.4            157006.6
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> folio_zero_user with many cond_resched(), so time fluctuates a lot,
>>>>>>>>>>>>>>>>>> clear_highpage is better folio_zero_range as you said.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Maybe add a new helper to convert all folio_zero_range(folio, 0,
>>>>>>>>>>>>>>>>>> folio_size(folio))
>>>>>>>>>>>>>>>>>> to use clear_highpage + flush_dcache_folio?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> If this also improves performance for other existing callers of
>>>>>>>>>>>>>>>>> folio_zero_range(), then that's a positive outcome.
>>>>>>>>>>>>>>>>
>>>>>>>>>>> ...
>>>>>>>>>>>
>>>>>>>>>>>>>> hi Kefeng,
>>>>>>>>>>>>>> what's your point? providing a helper like clear_highfolio() or similar?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes, from above test, using clear_highpage/flush_dcache_folio is better
>>>>>>>>>>>>> than using folio_zero_range() for folio zero(especially for large
>>>>>>>>>>>>> folio), so I'd like to add a new helper, maybe name it folio_zero()
>>>>>>>>>>>>> since it zero the whole folio.
>>>>>>>>>>>>
>>>>>>>>>>>> we already have a helper like folio_zero_user()?
>>>>>>>>>>>> it is not good enough?
>>>>>>>>>>>
>>>>>>>>>>> Since it is with many cond_resched(), the performance is worst...
>>>>>>>>>>
>>>>>>>>>> Not exactly? It should have zero cost for a preemptible kernel.
>>>>>>>>>> For a non-preemptible kernel, it helps avoid clearing the folio
>>>>>>>>>> from occupying the CPU and starving other processes, right?
>>>>>>>>>
>>>>>>>>> --- a/mm/shmem.c
>>>>>>>>> +++ b/mm/shmem.c
>>>>>>>>>
>>>>>>>>> @@ -2393,10 +2393,7 @@ static int shmem_get_folio_gfp(struct inode
>>>>>>>>> *inode, pgoff_t index,
>>>>>>>>>               * it now, lest undo on failure cancel our earlier guarantee.
>>>>>>>>>               */
>>>>>>>>>
>>>>>>>>>              if (sgp != SGP_WRITE && !folio_test_uptodate(folio)) {
>>>>>>>>> -               long i, n = folio_nr_pages(folio);
>>>>>>>>> -
>>>>>>>>> -               for (i = 0; i < n; i++)
>>>>>>>>> -                       clear_highpage(folio_page(folio, i));
>>>>>>>>> +               folio_zero_user(folio, vmf->address);
>>>>>>>>>                      flush_dcache_folio(folio);
>>>>>>>>>                      folio_mark_uptodate(folio);
>>>>>>>>>              }
>>>>>>>>>
>>>>>>>>> Do we perform better or worse with the following?
>>>>>>>>
>>>>>>>> Here is for SGP_FALLOC, vmf = NULL, we could use folio_zero_user(folio,
>>>>>>>> 0), I think the performance is worse, will retest once I can access
>>>>>>>> hardware.
>>>>>>>
>>>>>>> Perhaps, since the current code uses clear_hugepage(). Does using
>>>>>>> index << PAGE_SHIFT as the addr_hint offer any benefit?
>>>>>>>
>>>>>>
>>>>>> when use folio_zero_user(), the performance is vary bad with above
>>>>>> fallocate test(mount huge=always),
>>>>>>
>>>>>>           folio_zero_range   clear_highpage         folio_zero_user
>>>>>> real    0m1.214s             0m1.111s              0m3.159s
>>>>>> user    0m0.000s             0m0.000s              0m0.000s
>>>>>> sys     0m1.210s             0m1.109s              0m3.152s
>>>>>>
>>>>>> I tried with addr_hint = 0/index << PAGE_SHIFT, no obvious different.
>>>>>
>>>>> Interesting. Does your kernel have preemption disabled or
>>>>> preemption_debug enabled?
>>>>
>>>> ARM64 server, CONFIG_PREEMPT_NONE=y
>>> this explains why the performance is much worse.
>>>
>>>>
>>>>>
>>>>> If not, it makes me wonder whether folio_zero_user() in
>>>>> alloc_anon_folio() is actually improving performance as expected,
>>>>> compared to the simpler folio_zero() you plan to implement. :-)
>>>>
>>>> Yes, maybe, the folio_zero_user(was clear_huge_page) is from
>>>> 47ad8475c000 ("thp: clear_copy_huge_page"), so original clear_huge_page
>>>> is used in HugeTLB, clear PUD size maybe spend many time, but for PMD or
>>>> other size of large folio, cond_resched is not necessary since we
>>>> already have some folio_zero_range() to clear large folio, and no issue
>>>> was reported.
>>> probably worth an optimization. calling cond_resched() for each page
>>> seems too aggressive and useless.
>>
>> After some test, I think the cond_resched() is not the root cause,
>> no performance gained with batched cond_resched(), even I kill
>> cond_resched() from process_huge_page, no improvement.
>>
>> But when I unconditionally use clear_gigantic_page() in
>> folio_zero_user(patched), there is big improvement with above
>> fallocate on tmpfs(mount huge=always), also I test some other testcase,
>>
>>
>> 1) case-anon-w-seq-mt: (2M PMD THP)
>>
>> base:
>> real    0m2.490s    0m2.254s    0m2.272s
>> user    1m59.980s   2m23.431s   2m18.739s
>> sys     1m3.675s    1m15.462s   1m15.030s	
>>
>> patched:
>> real    0m2.234s    0m2.225s    0m2.159s
>> user    2m56.105s   2m57.117s   3m0.489s
>> sys     0m17.064s   0m17.564s   0m16.150s
>>
>> Patched kernel win on sys and bad in user, but real is almost same,
>> maybe a little better than base.
> 
> We can find user time difference.  That means the original cache hot
> behavior still applies on your system.
> 
> However, it appears that the performance to clear page from end to begin
> is really bad on your system.
> 
> So, I suggest to revise the current implementation to use sequential
> clearing as much as possible.
> 

I test case-anon-cow-seq-hugetlb for copy_user_large_folio()

base:
real    0m6.259s    0m6.197s    0m6.316s
user    1m31.176s   1m27.195s   1m29.594s
sys     7m44.199s   7m51.490s   8m21.149s

patched(use copy_user_gigantic_page for 2M hugetlb too)
real    0m3.182s    0m3.002s    0m2.963s
user    1m19.456s   1m3.107s    1m6.447s
sys     2m59.222s   3m10.899s   3m1.027s

and sequential copy is better than the current implementation,
so I will use sequential clear and copy.



>> 2) case-anon-w-seq-hugetlb:(2M PMD HugeTLB)
>>
>> base:
>> real    0m5.175s    0m5.117s    0m4.856s
>> user    5m15.943s   5m7.567s    4m29.273s
>> sys     2m38.503s   2m21.949s   2m21.252s
>>
>> patched:
>> real    0m4.966s    0m4.841s    0m4.561s
>> user    6m30.123s   6m9.516s    5m49.733s
>> sys     0m58.503s   0m47.847s   0m46.785s
>>
>>
>> This case is similar to the case1.
>>
>> 3) fallocate hugetlb 20G (2M PMD HugeTLB)
>>
>> base:
>> real    0m3.016s    0m3.019s    0m3.018s
>> user    0m0.000s    0m0.000s    0m0.000s
>> sys     0m3.009s    0m3.012s    0m3.010s
>>
>> patched:
>>
>> real    0m1.136s    0m1.136s    0m1.136s
>> user    0m0.000s    0m0.000s    0m0.004s
>> sys     0m1.133s    0m1.133s    0m1.129s
>>
>>
>> There is big win on patched kernel, and it is similar to above tmpfs
>> test, so maybe we could revert the commit c79b57e462b5 ("mm: hugetlb:
>> clear target sub-page last when clearing huge page").
> 
> --
> Best Regards,
> Huang, Ying
>

Huang, Ying Oct. 25, 2024, 7:47 a.m. UTC | #21

Kefeng Wang <wangkefeng.wang@huawei.com> writes:

> On 2024/10/25 10:59, Huang, Ying wrote:
>> Hi, Kefeng,
>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>> 
>>> +CC Huang Ying,
>>>
>>> On 2024/10/23 6:56, Barry Song wrote:
>>>> On Wed, Oct 23, 2024 at 4:10 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>>>>
>>>>>
>>> ...
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On 2024/10/17 23:09, Matthew Wilcox wrote:
>>>>>>>>>>>>>>>>>>>> On Thu, Oct 17, 2024 at 10:25:04PM +0800, Kefeng Wang wrote:
>>>>>>>>>>>>>>>>>>>>> Directly use folio_zero_range() to cleanup code.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Are you sure there's no performance regression introduced by this?
>>>>>>>>>>>>>>>>>>>> clear_highpage() is often optimised in ways that we can't optimise for
>>>>>>>>>>>>>>>>>>>> a plain memset().  On the other hand, if the folio is large, maybe a
>>>>>>>>>>>>>>>>>>>> modern CPU will be able to do better than clear-one-page-at-a-time.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Right, I missing this, clear_page might be better than memset, I change
>>>>>>>>>>>>>>>>>>> this one when look at the shmem_writepage(), which already convert to
>>>>>>>>>>>>>>>>>>> use folio_zero_range() from clear_highpage(), also I grep
>>>>>>>>>>>>>>>>>>> folio_zero_range(), there are some other to use folio_zero_range().
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:           folio_zero_range(folio, 0,
>>>>>>>>>>>>>>>>>>> folio_size(folio));
>>>>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f,
>>>>>>>>>>>>>>>>>>> 0, folio_size(f));
>>>>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f,
>>>>>>>>>>>>>>>>>>> 0, folio_size(f));
>>>>>>>>>>>>>>>>>>> fs/libfs.c:     folio_zero_range(folio, 0, folio_size(folio));
>>>>>>>>>>>>>>>>>>> fs/ntfs3/frecord.c:             folio_zero_range(folio, 0,
>>>>>>>>>>>>>>>>>>> folio_size(folio));
>>>>>>>>>>>>>>>>>>> mm/page_io.c:   folio_zero_range(folio, 0, folio_size(folio));
>>>>>>>>>>>>>>>>>>> mm/shmem.c:             folio_zero_range(folio, 0, folio_size(folio));
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> IOW, what performance testing have you done with this patch?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> No performance test before, but I write a testcase,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 1) allocate N large folios (folio_alloc(PMD_ORDER))
>>>>>>>>>>>>>>>>>>> 2) then calculate the diff(us) when clear all N folios
>>>>>>>>>>>>>>>>>>>             clear_highpage/folio_zero_range/folio_zero_user
>>>>>>>>>>>>>>>>>>> 3) release N folios
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> the result(run 5 times) shown below on my machine,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> N=1,
>>>>>>>>>>>>>>>>>>>                 clear_highpage  folio_zero_range    folio_zero_user
>>>>>>>>>>>>>>>>>>>            1      69                   74                 177
>>>>>>>>>>>>>>>>>>>            2      57                   62                 168
>>>>>>>>>>>>>>>>>>>            3      54                   58                 234
>>>>>>>>>>>>>>>>>>>            4      54                   58                 157
>>>>>>>>>>>>>>>>>>>            5      56                   62                 148
>>>>>>>>>>>>>>>>>>> avg       58                   62.8               176.8
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> N=100
>>>>>>>>>>>>>>>>>>>                 clear_highpage  folio_zero_range    folio_zero_user
>>>>>>>>>>>>>>>>>>>            1    11015                 11309               32833
>>>>>>>>>>>>>>>>>>>            2    10385                 11110               49751
>>>>>>>>>>>>>>>>>>>            3    10369                 11056               33095
>>>>>>>>>>>>>>>>>>>            4    10332                 11017               33106
>>>>>>>>>>>>>>>>>>>            5    10483                 11000               49032
>>>>>>>>>>>>>>>>>>> avg     10516.8               11098.4             39563.4
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> N=512
>>>>>>>>>>>>>>>>>>>                 clear_highpage  folio_zero_range   folio_zero_user
>>>>>>>>>>>>>>>>>>>            1    55560                 60055              156876
>>>>>>>>>>>>>>>>>>>            2    55485                 60024              157132
>>>>>>>>>>>>>>>>>>>            3    55474                 60129              156658
>>>>>>>>>>>>>>>>>>>            4    55555                 59867              157259
>>>>>>>>>>>>>>>>>>>            5    55528                 59932              157108
>>>>>>>>>>>>>>>>>>> avg     55520.4               60001.4            157006.6
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> folio_zero_user with many cond_resched(), so time fluctuates a lot,
>>>>>>>>>>>>>>>>>>> clear_highpage is better folio_zero_range as you said.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Maybe add a new helper to convert all folio_zero_range(folio, 0,
>>>>>>>>>>>>>>>>>>> folio_size(folio))
>>>>>>>>>>>>>>>>>>> to use clear_highpage + flush_dcache_folio?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> If this also improves performance for other existing callers of
>>>>>>>>>>>>>>>>>> folio_zero_range(), then that's a positive outcome.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>> ...
>>>>>>>>>>>>
>>>>>>>>>>>>>>> hi Kefeng,
>>>>>>>>>>>>>>> what's your point? providing a helper like clear_highfolio() or similar?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yes, from above test, using clear_highpage/flush_dcache_folio is better
>>>>>>>>>>>>>> than using folio_zero_range() for folio zero(especially for large
>>>>>>>>>>>>>> folio), so I'd like to add a new helper, maybe name it folio_zero()
>>>>>>>>>>>>>> since it zero the whole folio.
>>>>>>>>>>>>>
>>>>>>>>>>>>> we already have a helper like folio_zero_user()?
>>>>>>>>>>>>> it is not good enough?
>>>>>>>>>>>>
>>>>>>>>>>>> Since it is with many cond_resched(), the performance is worst...
>>>>>>>>>>>
>>>>>>>>>>> Not exactly? It should have zero cost for a preemptible kernel.
>>>>>>>>>>> For a non-preemptible kernel, it helps avoid clearing the folio
>>>>>>>>>>> from occupying the CPU and starving other processes, right?
>>>>>>>>>>
>>>>>>>>>> --- a/mm/shmem.c
>>>>>>>>>> +++ b/mm/shmem.c
>>>>>>>>>>
>>>>>>>>>> @@ -2393,10 +2393,7 @@ static int shmem_get_folio_gfp(struct inode
>>>>>>>>>> *inode, pgoff_t index,
>>>>>>>>>>               * it now, lest undo on failure cancel our earlier guarantee.
>>>>>>>>>>               */
>>>>>>>>>>
>>>>>>>>>>              if (sgp != SGP_WRITE && !folio_test_uptodate(folio)) {
>>>>>>>>>> -               long i, n = folio_nr_pages(folio);
>>>>>>>>>> -
>>>>>>>>>> -               for (i = 0; i < n; i++)
>>>>>>>>>> -                       clear_highpage(folio_page(folio, i));
>>>>>>>>>> +               folio_zero_user(folio, vmf->address);
>>>>>>>>>>                      flush_dcache_folio(folio);
>>>>>>>>>>                      folio_mark_uptodate(folio);
>>>>>>>>>>              }
>>>>>>>>>>
>>>>>>>>>> Do we perform better or worse with the following?
>>>>>>>>>
>>>>>>>>> Here is for SGP_FALLOC, vmf = NULL, we could use folio_zero_user(folio,
>>>>>>>>> 0), I think the performance is worse, will retest once I can access
>>>>>>>>> hardware.
>>>>>>>>
>>>>>>>> Perhaps, since the current code uses clear_hugepage(). Does using
>>>>>>>> index << PAGE_SHIFT as the addr_hint offer any benefit?
>>>>>>>>
>>>>>>>
>>>>>>> when use folio_zero_user(), the performance is vary bad with above
>>>>>>> fallocate test(mount huge=always),
>>>>>>>
>>>>>>>           folio_zero_range   clear_highpage         folio_zero_user
>>>>>>> real    0m1.214s             0m1.111s              0m3.159s
>>>>>>> user    0m0.000s             0m0.000s              0m0.000s
>>>>>>> sys     0m1.210s             0m1.109s              0m3.152s
>>>>>>>
>>>>>>> I tried with addr_hint = 0/index << PAGE_SHIFT, no obvious different.
>>>>>>
>>>>>> Interesting. Does your kernel have preemption disabled or
>>>>>> preemption_debug enabled?
>>>>>
>>>>> ARM64 server, CONFIG_PREEMPT_NONE=y
>>>> this explains why the performance is much worse.
>>>>
>>>>>
>>>>>>
>>>>>> If not, it makes me wonder whether folio_zero_user() in
>>>>>> alloc_anon_folio() is actually improving performance as expected,
>>>>>> compared to the simpler folio_zero() you plan to implement. :-)
>>>>>
>>>>> Yes, maybe, the folio_zero_user(was clear_huge_page) is from
>>>>> 47ad8475c000 ("thp: clear_copy_huge_page"), so original clear_huge_page
>>>>> is used in HugeTLB, clear PUD size maybe spend many time, but for PMD or
>>>>> other size of large folio, cond_resched is not necessary since we
>>>>> already have some folio_zero_range() to clear large folio, and no issue
>>>>> was reported.
>>>> probably worth an optimization. calling cond_resched() for each page
>>>> seems too aggressive and useless.
>>>
>>> After some test, I think the cond_resched() is not the root cause,
>>> no performance gained with batched cond_resched(), even I kill
>>> cond_resched() from process_huge_page, no improvement.
>>>
>>> But when I unconditionally use clear_gigantic_page() in
>>> folio_zero_user(patched), there is big improvement with above
>>> fallocate on tmpfs(mount huge=always), also I test some other testcase,
>>>
>>>
>>> 1) case-anon-w-seq-mt: (2M PMD THP)
>>>
>>> base:
>>> real    0m2.490s    0m2.254s    0m2.272s
>>> user    1m59.980s   2m23.431s   2m18.739s
>>> sys     1m3.675s    1m15.462s   1m15.030s	
>>>
>>> patched:
>>> real    0m2.234s    0m2.225s    0m2.159s
>>> user    2m56.105s   2m57.117s   3m0.489s
>>> sys     0m17.064s   0m17.564s   0m16.150s
>>>
>>> Patched kernel win on sys and bad in user, but real is almost same,
>>> maybe a little better than base.
>> We can find user time difference.  That means the original cache hot
>> behavior still applies on your system.
>> However, it appears that the performance to clear page from end to
>> begin
>> is really bad on your system.
>> So, I suggest to revise the current implementation to use sequential
>> clearing as much as possible.
>> 
>
> I test case-anon-cow-seq-hugetlb for copy_user_large_folio()
>
> base:
> real    0m6.259s    0m6.197s    0m6.316s
> user    1m31.176s   1m27.195s   1m29.594s
> sys     7m44.199s   7m51.490s   8m21.149s
>
> patched(use copy_user_gigantic_page for 2M hugetlb too)
> real    0m3.182s    0m3.002s    0m2.963s
> user    1m19.456s   1m3.107s    1m6.447s
> sys     2m59.222s   3m10.899s   3m1.027s
>
> and sequential copy is better than the current implementation,
> so I will use sequential clear and copy.

Sorry, it appears that you misunderstanding my suggestion.  I suggest to
revise process_huge_page() to use more sequential memory clearing and
copying to improve its performance on your platform.

--
Best Regards,
Huang, Ying

>>> 2) case-anon-w-seq-hugetlb:(2M PMD HugeTLB)
>>>
>>> base:
>>> real    0m5.175s    0m5.117s    0m4.856s
>>> user    5m15.943s   5m7.567s    4m29.273s
>>> sys     2m38.503s   2m21.949s   2m21.252s
>>>
>>> patched:
>>> real    0m4.966s    0m4.841s    0m4.561s
>>> user    6m30.123s   6m9.516s    5m49.733s
>>> sys     0m58.503s   0m47.847s   0m46.785s
>>>
>>>
>>> This case is similar to the case1.
>>>
>>> 3) fallocate hugetlb 20G (2M PMD HugeTLB)
>>>
>>> base:
>>> real    0m3.016s    0m3.019s    0m3.018s
>>> user    0m0.000s    0m0.000s    0m0.000s
>>> sys     0m3.009s    0m3.012s    0m3.010s
>>>
>>> patched:
>>>
>>> real    0m1.136s    0m1.136s    0m1.136s
>>> user    0m0.000s    0m0.000s    0m0.004s
>>> sys     0m1.133s    0m1.133s    0m1.129s
>>>
>>>
>>> There is big win on patched kernel, and it is similar to above tmpfs
>>> test, so maybe we could revert the commit c79b57e462b5 ("mm: hugetlb:
>>> clear target sub-page last when clearing huge page").

--
Best Regards,
Huang, Ying

Kefeng Wang Oct. 25, 2024, 10:21 a.m. UTC | #22

On 2024/10/25 15:47, Huang, Ying wrote:
> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
> 
>> On 2024/10/25 10:59, Huang, Ying wrote:
>>> Hi, Kefeng,
>>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>>>
>>>> +CC Huang Ying,
>>>>
>>>> On 2024/10/23 6:56, Barry Song wrote:
>>>>> On Wed, Oct 23, 2024 at 4:10 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>>>>>
>>>>>>
>>>> ...
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On 2024/10/17 23:09, Matthew Wilcox wrote:
>>>>>>>>>>>>>>>>>>>>> On Thu, Oct 17, 2024 at 10:25:04PM +0800, Kefeng Wang wrote:
>>>>>>>>>>>>>>>>>>>>>> Directly use folio_zero_range() to cleanup code.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Are you sure there's no performance regression introduced by this?
>>>>>>>>>>>>>>>>>>>>> clear_highpage() is often optimised in ways that we can't optimise for
>>>>>>>>>>>>>>>>>>>>> a plain memset().  On the other hand, if the folio is large, maybe a
>>>>>>>>>>>>>>>>>>>>> modern CPU will be able to do better than clear-one-page-at-a-time.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Right, I missing this, clear_page might be better than memset, I change
>>>>>>>>>>>>>>>>>>>> this one when look at the shmem_writepage(), which already convert to
>>>>>>>>>>>>>>>>>>>> use folio_zero_range() from clear_highpage(), also I grep
>>>>>>>>>>>>>>>>>>>> folio_zero_range(), there are some other to use folio_zero_range().
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:           folio_zero_range(folio, 0,
>>>>>>>>>>>>>>>>>>>> folio_size(folio));
>>>>>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f,
>>>>>>>>>>>>>>>>>>>> 0, folio_size(f));
>>>>>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f,
>>>>>>>>>>>>>>>>>>>> 0, folio_size(f));
>>>>>>>>>>>>>>>>>>>> fs/libfs.c:     folio_zero_range(folio, 0, folio_size(folio));
>>>>>>>>>>>>>>>>>>>> fs/ntfs3/frecord.c:             folio_zero_range(folio, 0,
>>>>>>>>>>>>>>>>>>>> folio_size(folio));
>>>>>>>>>>>>>>>>>>>> mm/page_io.c:   folio_zero_range(folio, 0, folio_size(folio));
>>>>>>>>>>>>>>>>>>>> mm/shmem.c:             folio_zero_range(folio, 0, folio_size(folio));
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> IOW, what performance testing have you done with this patch?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> No performance test before, but I write a testcase,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> 1) allocate N large folios (folio_alloc(PMD_ORDER))
>>>>>>>>>>>>>>>>>>>> 2) then calculate the diff(us) when clear all N folios
>>>>>>>>>>>>>>>>>>>>              clear_highpage/folio_zero_range/folio_zero_user
>>>>>>>>>>>>>>>>>>>> 3) release N folios
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> the result(run 5 times) shown below on my machine,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> N=1,
>>>>>>>>>>>>>>>>>>>>                  clear_highpage  folio_zero_range    folio_zero_user
>>>>>>>>>>>>>>>>>>>>             1      69                   74                 177
>>>>>>>>>>>>>>>>>>>>             2      57                   62                 168
>>>>>>>>>>>>>>>>>>>>             3      54                   58                 234
>>>>>>>>>>>>>>>>>>>>             4      54                   58                 157
>>>>>>>>>>>>>>>>>>>>             5      56                   62                 148
>>>>>>>>>>>>>>>>>>>> avg       58                   62.8               176.8
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> N=100
>>>>>>>>>>>>>>>>>>>>                  clear_highpage  folio_zero_range    folio_zero_user
>>>>>>>>>>>>>>>>>>>>             1    11015                 11309               32833
>>>>>>>>>>>>>>>>>>>>             2    10385                 11110               49751
>>>>>>>>>>>>>>>>>>>>             3    10369                 11056               33095
>>>>>>>>>>>>>>>>>>>>             4    10332                 11017               33106
>>>>>>>>>>>>>>>>>>>>             5    10483                 11000               49032
>>>>>>>>>>>>>>>>>>>> avg     10516.8               11098.4             39563.4
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> N=512
>>>>>>>>>>>>>>>>>>>>                  clear_highpage  folio_zero_range   folio_zero_user
>>>>>>>>>>>>>>>>>>>>             1    55560                 60055              156876
>>>>>>>>>>>>>>>>>>>>             2    55485                 60024              157132
>>>>>>>>>>>>>>>>>>>>             3    55474                 60129              156658
>>>>>>>>>>>>>>>>>>>>             4    55555                 59867              157259
>>>>>>>>>>>>>>>>>>>>             5    55528                 59932              157108
>>>>>>>>>>>>>>>>>>>> avg     55520.4               60001.4            157006.6
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> folio_zero_user with many cond_resched(), so time fluctuates a lot,
>>>>>>>>>>>>>>>>>>>> clear_highpage is better folio_zero_range as you said.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Maybe add a new helper to convert all folio_zero_range(folio, 0,
>>>>>>>>>>>>>>>>>>>> folio_size(folio))
>>>>>>>>>>>>>>>>>>>> to use clear_highpage + flush_dcache_folio?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> If this also improves performance for other existing callers of
>>>>>>>>>>>>>>>>>>> folio_zero_range(), then that's a positive outcome.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>> ...
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> hi Kefeng,
>>>>>>>>>>>>>>>> what's your point? providing a helper like clear_highfolio() or similar?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yes, from above test, using clear_highpage/flush_dcache_folio is better
>>>>>>>>>>>>>>> than using folio_zero_range() for folio zero(especially for large
>>>>>>>>>>>>>>> folio), so I'd like to add a new helper, maybe name it folio_zero()
>>>>>>>>>>>>>>> since it zero the whole folio.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> we already have a helper like folio_zero_user()?
>>>>>>>>>>>>>> it is not good enough?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Since it is with many cond_resched(), the performance is worst...
>>>>>>>>>>>>
>>>>>>>>>>>> Not exactly? It should have zero cost for a preemptible kernel.
>>>>>>>>>>>> For a non-preemptible kernel, it helps avoid clearing the folio
>>>>>>>>>>>> from occupying the CPU and starving other processes, right?
>>>>>>>>>>>
>>>>>>>>>>> --- a/mm/shmem.c
>>>>>>>>>>> +++ b/mm/shmem.c
>>>>>>>>>>>
>>>>>>>>>>> @@ -2393,10 +2393,7 @@ static int shmem_get_folio_gfp(struct inode
>>>>>>>>>>> *inode, pgoff_t index,
>>>>>>>>>>>                * it now, lest undo on failure cancel our earlier guarantee.
>>>>>>>>>>>                */
>>>>>>>>>>>
>>>>>>>>>>>               if (sgp != SGP_WRITE && !folio_test_uptodate(folio)) {
>>>>>>>>>>> -               long i, n = folio_nr_pages(folio);
>>>>>>>>>>> -
>>>>>>>>>>> -               for (i = 0; i < n; i++)
>>>>>>>>>>> -                       clear_highpage(folio_page(folio, i));
>>>>>>>>>>> +               folio_zero_user(folio, vmf->address);
>>>>>>>>>>>                       flush_dcache_folio(folio);
>>>>>>>>>>>                       folio_mark_uptodate(folio);
>>>>>>>>>>>               }
>>>>>>>>>>>
>>>>>>>>>>> Do we perform better or worse with the following?
>>>>>>>>>>
>>>>>>>>>> Here is for SGP_FALLOC, vmf = NULL, we could use folio_zero_user(folio,
>>>>>>>>>> 0), I think the performance is worse, will retest once I can access
>>>>>>>>>> hardware.
>>>>>>>>>
>>>>>>>>> Perhaps, since the current code uses clear_hugepage(). Does using
>>>>>>>>> index << PAGE_SHIFT as the addr_hint offer any benefit?
>>>>>>>>>
>>>>>>>>
>>>>>>>> when use folio_zero_user(), the performance is vary bad with above
>>>>>>>> fallocate test(mount huge=always),
>>>>>>>>
>>>>>>>>            folio_zero_range   clear_highpage         folio_zero_user
>>>>>>>> real    0m1.214s             0m1.111s              0m3.159s
>>>>>>>> user    0m0.000s             0m0.000s              0m0.000s
>>>>>>>> sys     0m1.210s             0m1.109s              0m3.152s
>>>>>>>>
>>>>>>>> I tried with addr_hint = 0/index << PAGE_SHIFT, no obvious different.
>>>>>>>
>>>>>>> Interesting. Does your kernel have preemption disabled or
>>>>>>> preemption_debug enabled?
>>>>>>
>>>>>> ARM64 server, CONFIG_PREEMPT_NONE=y
>>>>> this explains why the performance is much worse.
>>>>>
>>>>>>
>>>>>>>
>>>>>>> If not, it makes me wonder whether folio_zero_user() in
>>>>>>> alloc_anon_folio() is actually improving performance as expected,
>>>>>>> compared to the simpler folio_zero() you plan to implement. :-)
>>>>>>
>>>>>> Yes, maybe, the folio_zero_user(was clear_huge_page) is from
>>>>>> 47ad8475c000 ("thp: clear_copy_huge_page"), so original clear_huge_page
>>>>>> is used in HugeTLB, clear PUD size maybe spend many time, but for PMD or
>>>>>> other size of large folio, cond_resched is not necessary since we
>>>>>> already have some folio_zero_range() to clear large folio, and no issue
>>>>>> was reported.
>>>>> probably worth an optimization. calling cond_resched() for each page
>>>>> seems too aggressive and useless.
>>>>
>>>> After some test, I think the cond_resched() is not the root cause,
>>>> no performance gained with batched cond_resched(), even I kill
>>>> cond_resched() from process_huge_page, no improvement.
>>>>
>>>> But when I unconditionally use clear_gigantic_page() in
>>>> folio_zero_user(patched), there is big improvement with above
>>>> fallocate on tmpfs(mount huge=always), also I test some other testcase,
>>>>
>>>>
>>>> 1) case-anon-w-seq-mt: (2M PMD THP)
>>>>
>>>> base:
>>>> real    0m2.490s    0m2.254s    0m2.272s
>>>> user    1m59.980s   2m23.431s   2m18.739s
>>>> sys     1m3.675s    1m15.462s   1m15.030s	
>>>>
>>>> patched:
>>>> real    0m2.234s    0m2.225s    0m2.159s
>>>> user    2m56.105s   2m57.117s   3m0.489s
>>>> sys     0m17.064s   0m17.564s   0m16.150s
>>>>
>>>> Patched kernel win on sys and bad in user, but real is almost same,
>>>> maybe a little better than base.
>>> We can find user time difference.  That means the original cache hot
>>> behavior still applies on your system.
>>> However, it appears that the performance to clear page from end to
>>> begin
>>> is really bad on your system.
>>> So, I suggest to revise the current implementation to use sequential
>>> clearing as much as possible.
>>>
>>
>> I test case-anon-cow-seq-hugetlb for copy_user_large_folio()
>>
>> base:
>> real    0m6.259s    0m6.197s    0m6.316s
>> user    1m31.176s   1m27.195s   1m29.594s
>> sys     7m44.199s   7m51.490s   8m21.149s
>>
>> patched(use copy_user_gigantic_page for 2M hugetlb too)
>> real    0m3.182s    0m3.002s    0m2.963s
>> user    1m19.456s   1m3.107s    1m6.447s
>> sys     2m59.222s   3m10.899s   3m1.027s
>>
>> and sequential copy is better than the current implementation,
>> so I will use sequential clear and copy.
> 
> Sorry, it appears that you misunderstanding my suggestion.  I suggest to
> revise process_huge_page() to use more sequential memory clearing and
> copying to improve its performance on your platform.
> 
> --
> Best Regards,
> Huang, Ying
> 
>>>> 2) case-anon-w-seq-hugetlb:(2M PMD HugeTLB)
>>>>
>>>> base:
>>>> real    0m5.175s    0m5.117s    0m4.856s
>>>> user    5m15.943s   5m7.567s    4m29.273s
>>>> sys     2m38.503s   2m21.949s   2m21.252s
>>>>
>>>> patched:
>>>> real    0m4.966s    0m4.841s    0m4.561s
>>>> user    6m30.123s   6m9.516s    5m49.733s
>>>> sys     0m58.503s   0m47.847s   0m46.785s
>>>>
>>>>
>>>> This case is similar to the case1.
>>>>
>>>> 3) fallocate hugetlb 20G (2M PMD HugeTLB)
>>>>
>>>> base:
>>>> real    0m3.016s    0m3.019s    0m3.018s
>>>> user    0m0.000s    0m0.000s    0m0.000s
>>>> sys     0m3.009s    0m3.012s    0m3.010s
>>>>
>>>> patched:
>>>>
>>>> real    0m1.136s    0m1.136s    0m1.136s
>>>> user    0m0.000s    0m0.000s    0m0.004s
>>>> sys     0m1.133s    0m1.133s    0m1.129s
>>>>
>>>>
>>>> There is big win on patched kernel, and it is similar to above tmpfs
>>>> test, so maybe we could revert the commit c79b57e462b5 ("mm: hugetlb:
>>>> clear target sub-page last when clearing huge page").

I tried the following changes,
diff --git a/mm/memory.c b/mm/memory.c
index 66cf855dee3f..e5cc75adfa10 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6777,7 +6777,7 @@ static inline int process_huge_page(
                 base = 0;
                 l = n;
                 /* Process subpages at the end of huge page */
-               for (i = nr_pages - 1; i >= 2 * n; i--) {
+               for (i = 2 * n; i < nr_pages; i++) {
                         cond_resched();
                         ret = process_subpage(addr + i * PAGE_SIZE, i, 
arg);
                         if (ret)

Since n = 0, so the copying is from start to end now, but not
improvement for case-anon-cow-seq-hugetlb,

and if use copy_user_gigantic_pager, the time reduced from 6s to 3s

diff --git a/mm/memory.c b/mm/memory.c
index fe21bd3beff5..2c6532d21d84 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6876,10 +6876,7 @@ int copy_user_large_folio(struct folio *dst, 
struct folio *src,
                 .vma = vma,
         };

-       if (unlikely(nr_pages > MAX_ORDER_NR_PAGES))
-               return copy_user_gigantic_page(dst, src, addr_hint, vma, 
nr_pages);
-
-       return process_huge_page(addr_hint, nr_pages, copy_subpage, &arg);
+       return copy_user_gigantic_page(dst, src, addr_hint, vma, nr_pages);
  }






> 
> --
> Best Regards,
> Huang, Ying
>

Huang, Ying Oct. 25, 2024, 12:21 p.m. UTC | #23

Kefeng Wang <wangkefeng.wang@huawei.com> writes:

> On 2024/10/25 15:47, Huang, Ying wrote:
>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>> 
>>> On 2024/10/25 10:59, Huang, Ying wrote:
>>>> Hi, Kefeng,
>>>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>>>>
>>>>> +CC Huang Ying,
>>>>>
>>>>> On 2024/10/23 6:56, Barry Song wrote:
>>>>>> On Wed, Oct 23, 2024 at 4:10 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>>>>>>
>>>>>>>
>>>>> ...
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On 2024/10/17 23:09, Matthew Wilcox wrote:
>>>>>>>>>>>>>>>>>>>>>> On Thu, Oct 17, 2024 at 10:25:04PM +0800, Kefeng Wang wrote:
>>>>>>>>>>>>>>>>>>>>>>> Directly use folio_zero_range() to cleanup code.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Are you sure there's no performance regression introduced by this?
>>>>>>>>>>>>>>>>>>>>>> clear_highpage() is often optimised in ways that we can't optimise for
>>>>>>>>>>>>>>>>>>>>>> a plain memset().  On the other hand, if the folio is large, maybe a
>>>>>>>>>>>>>>>>>>>>>> modern CPU will be able to do better than clear-one-page-at-a-time.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Right, I missing this, clear_page might be better than memset, I change
>>>>>>>>>>>>>>>>>>>>> this one when look at the shmem_writepage(), which already convert to
>>>>>>>>>>>>>>>>>>>>> use folio_zero_range() from clear_highpage(), also I grep
>>>>>>>>>>>>>>>>>>>>> folio_zero_range(), there are some other to use folio_zero_range().
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:           folio_zero_range(folio, 0,
>>>>>>>>>>>>>>>>>>>>> folio_size(folio));
>>>>>>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f,
>>>>>>>>>>>>>>>>>>>>> 0, folio_size(f));
>>>>>>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f,
>>>>>>>>>>>>>>>>>>>>> 0, folio_size(f));
>>>>>>>>>>>>>>>>>>>>> fs/libfs.c:     folio_zero_range(folio, 0, folio_size(folio));
>>>>>>>>>>>>>>>>>>>>> fs/ntfs3/frecord.c:             folio_zero_range(folio, 0,
>>>>>>>>>>>>>>>>>>>>> folio_size(folio));
>>>>>>>>>>>>>>>>>>>>> mm/page_io.c:   folio_zero_range(folio, 0, folio_size(folio));
>>>>>>>>>>>>>>>>>>>>> mm/shmem.c:             folio_zero_range(folio, 0, folio_size(folio));
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> IOW, what performance testing have you done with this patch?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> No performance test before, but I write a testcase,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> 1) allocate N large folios (folio_alloc(PMD_ORDER))
>>>>>>>>>>>>>>>>>>>>> 2) then calculate the diff(us) when clear all N folios
>>>>>>>>>>>>>>>>>>>>>              clear_highpage/folio_zero_range/folio_zero_user
>>>>>>>>>>>>>>>>>>>>> 3) release N folios
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> the result(run 5 times) shown below on my machine,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> N=1,
>>>>>>>>>>>>>>>>>>>>>                  clear_highpage  folio_zero_range    folio_zero_user
>>>>>>>>>>>>>>>>>>>>>             1      69                   74                 177
>>>>>>>>>>>>>>>>>>>>>             2      57                   62                 168
>>>>>>>>>>>>>>>>>>>>>             3      54                   58                 234
>>>>>>>>>>>>>>>>>>>>>             4      54                   58                 157
>>>>>>>>>>>>>>>>>>>>>             5      56                   62                 148
>>>>>>>>>>>>>>>>>>>>> avg       58                   62.8               176.8
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> N=100
>>>>>>>>>>>>>>>>>>>>>                  clear_highpage  folio_zero_range    folio_zero_user
>>>>>>>>>>>>>>>>>>>>>             1    11015                 11309               32833
>>>>>>>>>>>>>>>>>>>>>             2    10385                 11110               49751
>>>>>>>>>>>>>>>>>>>>>             3    10369                 11056               33095
>>>>>>>>>>>>>>>>>>>>>             4    10332                 11017               33106
>>>>>>>>>>>>>>>>>>>>>             5    10483                 11000               49032
>>>>>>>>>>>>>>>>>>>>> avg     10516.8               11098.4             39563.4
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> N=512
>>>>>>>>>>>>>>>>>>>>>                  clear_highpage  folio_zero_range   folio_zero_user
>>>>>>>>>>>>>>>>>>>>>             1    55560                 60055              156876
>>>>>>>>>>>>>>>>>>>>>             2    55485                 60024              157132
>>>>>>>>>>>>>>>>>>>>>             3    55474                 60129              156658
>>>>>>>>>>>>>>>>>>>>>             4    55555                 59867              157259
>>>>>>>>>>>>>>>>>>>>>             5    55528                 59932              157108
>>>>>>>>>>>>>>>>>>>>> avg     55520.4               60001.4            157006.6
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> folio_zero_user with many cond_resched(), so time fluctuates a lot,
>>>>>>>>>>>>>>>>>>>>> clear_highpage is better folio_zero_range as you said.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Maybe add a new helper to convert all folio_zero_range(folio, 0,
>>>>>>>>>>>>>>>>>>>>> folio_size(folio))
>>>>>>>>>>>>>>>>>>>>> to use clear_highpage + flush_dcache_folio?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> If this also improves performance for other existing callers of
>>>>>>>>>>>>>>>>>>>> folio_zero_range(), then that's a positive outcome.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> hi Kefeng,
>>>>>>>>>>>>>>>>> what's your point? providing a helper like clear_highfolio() or similar?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yes, from above test, using clear_highpage/flush_dcache_folio is better
>>>>>>>>>>>>>>>> than using folio_zero_range() for folio zero(especially for large
>>>>>>>>>>>>>>>> folio), so I'd like to add a new helper, maybe name it folio_zero()
>>>>>>>>>>>>>>>> since it zero the whole folio.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> we already have a helper like folio_zero_user()?
>>>>>>>>>>>>>>> it is not good enough?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Since it is with many cond_resched(), the performance is worst...
>>>>>>>>>>>>>
>>>>>>>>>>>>> Not exactly? It should have zero cost for a preemptible kernel.
>>>>>>>>>>>>> For a non-preemptible kernel, it helps avoid clearing the folio
>>>>>>>>>>>>> from occupying the CPU and starving other processes, right?
>>>>>>>>>>>>
>>>>>>>>>>>> --- a/mm/shmem.c
>>>>>>>>>>>> +++ b/mm/shmem.c
>>>>>>>>>>>>
>>>>>>>>>>>> @@ -2393,10 +2393,7 @@ static int shmem_get_folio_gfp(struct inode
>>>>>>>>>>>> *inode, pgoff_t index,
>>>>>>>>>>>>                * it now, lest undo on failure cancel our earlier guarantee.
>>>>>>>>>>>>                */
>>>>>>>>>>>>
>>>>>>>>>>>>               if (sgp != SGP_WRITE && !folio_test_uptodate(folio)) {
>>>>>>>>>>>> -               long i, n = folio_nr_pages(folio);
>>>>>>>>>>>> -
>>>>>>>>>>>> -               for (i = 0; i < n; i++)
>>>>>>>>>>>> -                       clear_highpage(folio_page(folio, i));
>>>>>>>>>>>> +               folio_zero_user(folio, vmf->address);
>>>>>>>>>>>>                       flush_dcache_folio(folio);
>>>>>>>>>>>>                       folio_mark_uptodate(folio);
>>>>>>>>>>>>               }
>>>>>>>>>>>>
>>>>>>>>>>>> Do we perform better or worse with the following?
>>>>>>>>>>>
>>>>>>>>>>> Here is for SGP_FALLOC, vmf = NULL, we could use folio_zero_user(folio,
>>>>>>>>>>> 0), I think the performance is worse, will retest once I can access
>>>>>>>>>>> hardware.
>>>>>>>>>>
>>>>>>>>>> Perhaps, since the current code uses clear_hugepage(). Does using
>>>>>>>>>> index << PAGE_SHIFT as the addr_hint offer any benefit?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> when use folio_zero_user(), the performance is vary bad with above
>>>>>>>>> fallocate test(mount huge=always),
>>>>>>>>>
>>>>>>>>>            folio_zero_range   clear_highpage         folio_zero_user
>>>>>>>>> real    0m1.214s             0m1.111s              0m3.159s
>>>>>>>>> user    0m0.000s             0m0.000s              0m0.000s
>>>>>>>>> sys     0m1.210s             0m1.109s              0m3.152s
>>>>>>>>>
>>>>>>>>> I tried with addr_hint = 0/index << PAGE_SHIFT, no obvious different.
>>>>>>>>
>>>>>>>> Interesting. Does your kernel have preemption disabled or
>>>>>>>> preemption_debug enabled?
>>>>>>>
>>>>>>> ARM64 server, CONFIG_PREEMPT_NONE=y
>>>>>> this explains why the performance is much worse.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> If not, it makes me wonder whether folio_zero_user() in
>>>>>>>> alloc_anon_folio() is actually improving performance as expected,
>>>>>>>> compared to the simpler folio_zero() you plan to implement. :-)
>>>>>>>
>>>>>>> Yes, maybe, the folio_zero_user(was clear_huge_page) is from
>>>>>>> 47ad8475c000 ("thp: clear_copy_huge_page"), so original clear_huge_page
>>>>>>> is used in HugeTLB, clear PUD size maybe spend many time, but for PMD or
>>>>>>> other size of large folio, cond_resched is not necessary since we
>>>>>>> already have some folio_zero_range() to clear large folio, and no issue
>>>>>>> was reported.
>>>>>> probably worth an optimization. calling cond_resched() for each page
>>>>>> seems too aggressive and useless.
>>>>>
>>>>> After some test, I think the cond_resched() is not the root cause,
>>>>> no performance gained with batched cond_resched(), even I kill
>>>>> cond_resched() from process_huge_page, no improvement.
>>>>>
>>>>> But when I unconditionally use clear_gigantic_page() in
>>>>> folio_zero_user(patched), there is big improvement with above
>>>>> fallocate on tmpfs(mount huge=always), also I test some other testcase,
>>>>>
>>>>>
>>>>> 1) case-anon-w-seq-mt: (2M PMD THP)
>>>>>
>>>>> base:
>>>>> real    0m2.490s    0m2.254s    0m2.272s
>>>>> user    1m59.980s   2m23.431s   2m18.739s
>>>>> sys     1m3.675s    1m15.462s   1m15.030s	
>>>>>
>>>>> patched:
>>>>> real    0m2.234s    0m2.225s    0m2.159s
>>>>> user    2m56.105s   2m57.117s   3m0.489s
>>>>> sys     0m17.064s   0m17.564s   0m16.150s
>>>>>
>>>>> Patched kernel win on sys and bad in user, but real is almost same,
>>>>> maybe a little better than base.
>>>> We can find user time difference.  That means the original cache hot
>>>> behavior still applies on your system.
>>>> However, it appears that the performance to clear page from end to
>>>> begin
>>>> is really bad on your system.
>>>> So, I suggest to revise the current implementation to use sequential
>>>> clearing as much as possible.
>>>>
>>>
>>> I test case-anon-cow-seq-hugetlb for copy_user_large_folio()
>>>
>>> base:
>>> real    0m6.259s    0m6.197s    0m6.316s
>>> user    1m31.176s   1m27.195s   1m29.594s
>>> sys     7m44.199s   7m51.490s   8m21.149s
>>>
>>> patched(use copy_user_gigantic_page for 2M hugetlb too)
>>> real    0m3.182s    0m3.002s    0m2.963s
>>> user    1m19.456s   1m3.107s    1m6.447s
>>> sys     2m59.222s   3m10.899s   3m1.027s
>>>
>>> and sequential copy is better than the current implementation,
>>> so I will use sequential clear and copy.
>> Sorry, it appears that you misunderstanding my suggestion.  I
>> suggest to
>> revise process_huge_page() to use more sequential memory clearing and
>> copying to improve its performance on your platform.
>> --
>> Best Regards,
>> Huang, Ying
>> 
>>>>> 2) case-anon-w-seq-hugetlb:(2M PMD HugeTLB)
>>>>>
>>>>> base:
>>>>> real    0m5.175s    0m5.117s    0m4.856s
>>>>> user    5m15.943s   5m7.567s    4m29.273s
>>>>> sys     2m38.503s   2m21.949s   2m21.252s
>>>>>
>>>>> patched:
>>>>> real    0m4.966s    0m4.841s    0m4.561s
>>>>> user    6m30.123s   6m9.516s    5m49.733s
>>>>> sys     0m58.503s   0m47.847s   0m46.785s
>>>>>
>>>>>
>>>>> This case is similar to the case1.
>>>>>
>>>>> 3) fallocate hugetlb 20G (2M PMD HugeTLB)
>>>>>
>>>>> base:
>>>>> real    0m3.016s    0m3.019s    0m3.018s
>>>>> user    0m0.000s    0m0.000s    0m0.000s
>>>>> sys     0m3.009s    0m3.012s    0m3.010s
>>>>>
>>>>> patched:
>>>>>
>>>>> real    0m1.136s    0m1.136s    0m1.136s
>>>>> user    0m0.000s    0m0.000s    0m0.004s
>>>>> sys     0m1.133s    0m1.133s    0m1.129s
>>>>>
>>>>>
>>>>> There is big win on patched kernel, and it is similar to above tmpfs
>>>>> test, so maybe we could revert the commit c79b57e462b5 ("mm: hugetlb:
>>>>> clear target sub-page last when clearing huge page").
>
> I tried the following changes,
> diff --git a/mm/memory.c b/mm/memory.c
> index 66cf855dee3f..e5cc75adfa10 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -6777,7 +6777,7 @@ static inline int process_huge_page(
>                 base = 0;
>                 l = n;
>                 /* Process subpages at the end of huge page */
> -               for (i = nr_pages - 1; i >= 2 * n; i--) {
> +               for (i = 2 * n; i < nr_pages; i++) {
>                         cond_resched();
>                         ret = process_subpage(addr + i * PAGE_SIZE, i,
>                         arg);
>                         if (ret)
>
> Since n = 0, so the copying is from start to end now, but not
> improvement for case-anon-cow-seq-hugetlb,
>
> and if use copy_user_gigantic_pager, the time reduced from 6s to 3s
>
> diff --git a/mm/memory.c b/mm/memory.c
> index fe21bd3beff5..2c6532d21d84 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -6876,10 +6876,7 @@ int copy_user_large_folio(struct folio *dst,
> struct folio *src,
>                 .vma = vma,
>         };
>
> -       if (unlikely(nr_pages > MAX_ORDER_NR_PAGES))
> -               return copy_user_gigantic_page(dst, src, addr_hint,
>                 vma, nr_pages);
> -
> -       return process_huge_page(addr_hint, nr_pages, copy_subpage, &arg);
> +       return copy_user_gigantic_page(dst, src, addr_hint, vma, nr_pages);
>  }

It appears that we have code generation issue here.  Can you check it?
Whether code is inlined in the same way?

Maybe we can start with

modified   mm/memory.c
@@ -6714,7 +6714,7 @@ EXPORT_SYMBOL(__might_fault);
  * operation.  The target subpage will be processed last to keep its
  * cache lines hot.
  */
-static inline int process_huge_page(
+static __always_inline int process_huge_page(
 	unsigned long addr_hint, unsigned int nr_pages,
 	int (*process_subpage)(unsigned long addr, int idx, void *arg),
 	void *arg)

--
Best Regards,
Huang, Ying

Kefeng Wang Oct. 25, 2024, 1:35 p.m. UTC | #24

On 2024/10/25 20:21, Huang, Ying wrote:
> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
> 
>> On 2024/10/25 15:47, Huang, Ying wrote:
>>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>>>
>>>> On 2024/10/25 10:59, Huang, Ying wrote:
>>>>> Hi, Kefeng,
>>>>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>>>>>
>>>>>> +CC Huang Ying,
>>>>>>
>>>>>> On 2024/10/23 6:56, Barry Song wrote:
>>>>>>> On Wed, Oct 23, 2024 at 4:10 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>>>>>>>
>>>>>>>>
>>>>>> ...
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On 2024/10/17 23:09, Matthew Wilcox wrote:
>>>>>>>>>>>>>>>>>>>>>>> On Thu, Oct 17, 2024 at 10:25:04PM +0800, Kefeng Wang wrote:
>>>>>>>>>>>>>>>>>>>>>>>> Directly use folio_zero_range() to cleanup code.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Are you sure there's no performance regression introduced by this?
>>>>>>>>>>>>>>>>>>>>>>> clear_highpage() is often optimised in ways that we can't optimise for
>>>>>>>>>>>>>>>>>>>>>>> a plain memset().  On the other hand, if the folio is large, maybe a
>>>>>>>>>>>>>>>>>>>>>>> modern CPU will be able to do better than clear-one-page-at-a-time.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Right, I missing this, clear_page might be better than memset, I change
>>>>>>>>>>>>>>>>>>>>>> this one when look at the shmem_writepage(), which already convert to
>>>>>>>>>>>>>>>>>>>>>> use folio_zero_range() from clear_highpage(), also I grep
>>>>>>>>>>>>>>>>>>>>>> folio_zero_range(), there are some other to use folio_zero_range().
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:           folio_zero_range(folio, 0,
>>>>>>>>>>>>>>>>>>>>>> folio_size(folio));
>>>>>>>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f,
>>>>>>>>>>>>>>>>>>>>>> 0, folio_size(f));
>>>>>>>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f,
>>>>>>>>>>>>>>>>>>>>>> 0, folio_size(f));
>>>>>>>>>>>>>>>>>>>>>> fs/libfs.c:     folio_zero_range(folio, 0, folio_size(folio));
>>>>>>>>>>>>>>>>>>>>>> fs/ntfs3/frecord.c:             folio_zero_range(folio, 0,
>>>>>>>>>>>>>>>>>>>>>> folio_size(folio));
>>>>>>>>>>>>>>>>>>>>>> mm/page_io.c:   folio_zero_range(folio, 0, folio_size(folio));
>>>>>>>>>>>>>>>>>>>>>> mm/shmem.c:             folio_zero_range(folio, 0, folio_size(folio));
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> IOW, what performance testing have you done with this patch?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> No performance test before, but I write a testcase,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> 1) allocate N large folios (folio_alloc(PMD_ORDER))
>>>>>>>>>>>>>>>>>>>>>> 2) then calculate the diff(us) when clear all N folios
>>>>>>>>>>>>>>>>>>>>>>               clear_highpage/folio_zero_range/folio_zero_user
>>>>>>>>>>>>>>>>>>>>>> 3) release N folios
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> the result(run 5 times) shown below on my machine,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> N=1,
>>>>>>>>>>>>>>>>>>>>>>                   clear_highpage  folio_zero_range    folio_zero_user
>>>>>>>>>>>>>>>>>>>>>>              1      69                   74                 177
>>>>>>>>>>>>>>>>>>>>>>              2      57                   62                 168
>>>>>>>>>>>>>>>>>>>>>>              3      54                   58                 234
>>>>>>>>>>>>>>>>>>>>>>              4      54                   58                 157
>>>>>>>>>>>>>>>>>>>>>>              5      56                   62                 148
>>>>>>>>>>>>>>>>>>>>>> avg       58                   62.8               176.8
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> N=100
>>>>>>>>>>>>>>>>>>>>>>                   clear_highpage  folio_zero_range    folio_zero_user
>>>>>>>>>>>>>>>>>>>>>>              1    11015                 11309               32833
>>>>>>>>>>>>>>>>>>>>>>              2    10385                 11110               49751
>>>>>>>>>>>>>>>>>>>>>>              3    10369                 11056               33095
>>>>>>>>>>>>>>>>>>>>>>              4    10332                 11017               33106
>>>>>>>>>>>>>>>>>>>>>>              5    10483                 11000               49032
>>>>>>>>>>>>>>>>>>>>>> avg     10516.8               11098.4             39563.4
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> N=512
>>>>>>>>>>>>>>>>>>>>>>                   clear_highpage  folio_zero_range   folio_zero_user
>>>>>>>>>>>>>>>>>>>>>>              1    55560                 60055              156876
>>>>>>>>>>>>>>>>>>>>>>              2    55485                 60024              157132
>>>>>>>>>>>>>>>>>>>>>>              3    55474                 60129              156658
>>>>>>>>>>>>>>>>>>>>>>              4    55555                 59867              157259
>>>>>>>>>>>>>>>>>>>>>>              5    55528                 59932              157108
>>>>>>>>>>>>>>>>>>>>>> avg     55520.4               60001.4            157006.6
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> folio_zero_user with many cond_resched(), so time fluctuates a lot,
>>>>>>>>>>>>>>>>>>>>>> clear_highpage is better folio_zero_range as you said.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Maybe add a new helper to convert all folio_zero_range(folio, 0,
>>>>>>>>>>>>>>>>>>>>>> folio_size(folio))
>>>>>>>>>>>>>>>>>>>>>> to use clear_highpage + flush_dcache_folio?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> If this also improves performance for other existing callers of
>>>>>>>>>>>>>>>>>>>>> folio_zero_range(), then that's a positive outcome.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> hi Kefeng,
>>>>>>>>>>>>>>>>>> what's your point? providing a helper like clear_highfolio() or similar?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Yes, from above test, using clear_highpage/flush_dcache_folio is better
>>>>>>>>>>>>>>>>> than using folio_zero_range() for folio zero(especially for large
>>>>>>>>>>>>>>>>> folio), so I'd like to add a new helper, maybe name it folio_zero()
>>>>>>>>>>>>>>>>> since it zero the whole folio.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> we already have a helper like folio_zero_user()?
>>>>>>>>>>>>>>>> it is not good enough?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Since it is with many cond_resched(), the performance is worst...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Not exactly? It should have zero cost for a preemptible kernel.
>>>>>>>>>>>>>> For a non-preemptible kernel, it helps avoid clearing the folio
>>>>>>>>>>>>>> from occupying the CPU and starving other processes, right?
>>>>>>>>>>>>>
>>>>>>>>>>>>> --- a/mm/shmem.c
>>>>>>>>>>>>> +++ b/mm/shmem.c
>>>>>>>>>>>>>
>>>>>>>>>>>>> @@ -2393,10 +2393,7 @@ static int shmem_get_folio_gfp(struct inode
>>>>>>>>>>>>> *inode, pgoff_t index,
>>>>>>>>>>>>>                 * it now, lest undo on failure cancel our earlier guarantee.
>>>>>>>>>>>>>                 */
>>>>>>>>>>>>>
>>>>>>>>>>>>>                if (sgp != SGP_WRITE && !folio_test_uptodate(folio)) {
>>>>>>>>>>>>> -               long i, n = folio_nr_pages(folio);
>>>>>>>>>>>>> -
>>>>>>>>>>>>> -               for (i = 0; i < n; i++)
>>>>>>>>>>>>> -                       clear_highpage(folio_page(folio, i));
>>>>>>>>>>>>> +               folio_zero_user(folio, vmf->address);
>>>>>>>>>>>>>                        flush_dcache_folio(folio);
>>>>>>>>>>>>>                        folio_mark_uptodate(folio);
>>>>>>>>>>>>>                }
>>>>>>>>>>>>>
>>>>>>>>>>>>> Do we perform better or worse with the following?
>>>>>>>>>>>>
>>>>>>>>>>>> Here is for SGP_FALLOC, vmf = NULL, we could use folio_zero_user(folio,
>>>>>>>>>>>> 0), I think the performance is worse, will retest once I can access
>>>>>>>>>>>> hardware.
>>>>>>>>>>>
>>>>>>>>>>> Perhaps, since the current code uses clear_hugepage(). Does using
>>>>>>>>>>> index << PAGE_SHIFT as the addr_hint offer any benefit?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> when use folio_zero_user(), the performance is vary bad with above
>>>>>>>>>> fallocate test(mount huge=always),
>>>>>>>>>>
>>>>>>>>>>             folio_zero_range   clear_highpage         folio_zero_user
>>>>>>>>>> real    0m1.214s             0m1.111s              0m3.159s
>>>>>>>>>> user    0m0.000s             0m0.000s              0m0.000s
>>>>>>>>>> sys     0m1.210s             0m1.109s              0m3.152s
>>>>>>>>>>
>>>>>>>>>> I tried with addr_hint = 0/index << PAGE_SHIFT, no obvious different.
>>>>>>>>>
>>>>>>>>> Interesting. Does your kernel have preemption disabled or
>>>>>>>>> preemption_debug enabled?
>>>>>>>>
>>>>>>>> ARM64 server, CONFIG_PREEMPT_NONE=y
>>>>>>> this explains why the performance is much worse.
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> If not, it makes me wonder whether folio_zero_user() in
>>>>>>>>> alloc_anon_folio() is actually improving performance as expected,
>>>>>>>>> compared to the simpler folio_zero() you plan to implement. :-)
>>>>>>>>
>>>>>>>> Yes, maybe, the folio_zero_user(was clear_huge_page) is from
>>>>>>>> 47ad8475c000 ("thp: clear_copy_huge_page"), so original clear_huge_page
>>>>>>>> is used in HugeTLB, clear PUD size maybe spend many time, but for PMD or
>>>>>>>> other size of large folio, cond_resched is not necessary since we
>>>>>>>> already have some folio_zero_range() to clear large folio, and no issue
>>>>>>>> was reported.
>>>>>>> probably worth an optimization. calling cond_resched() for each page
>>>>>>> seems too aggressive and useless.
>>>>>>
>>>>>> After some test, I think the cond_resched() is not the root cause,
>>>>>> no performance gained with batched cond_resched(), even I kill
>>>>>> cond_resched() from process_huge_page, no improvement.
>>>>>>
>>>>>> But when I unconditionally use clear_gigantic_page() in
>>>>>> folio_zero_user(patched), there is big improvement with above
>>>>>> fallocate on tmpfs(mount huge=always), also I test some other testcase,
>>>>>>
>>>>>>
>>>>>> 1) case-anon-w-seq-mt: (2M PMD THP)
>>>>>>
>>>>>> base:
>>>>>> real    0m2.490s    0m2.254s    0m2.272s
>>>>>> user    1m59.980s   2m23.431s   2m18.739s
>>>>>> sys     1m3.675s    1m15.462s   1m15.030s	
>>>>>>
>>>>>> patched:
>>>>>> real    0m2.234s    0m2.225s    0m2.159s
>>>>>> user    2m56.105s   2m57.117s   3m0.489s
>>>>>> sys     0m17.064s   0m17.564s   0m16.150s
>>>>>>
>>>>>> Patched kernel win on sys and bad in user, but real is almost same,
>>>>>> maybe a little better than base.
>>>>> We can find user time difference.  That means the original cache hot
>>>>> behavior still applies on your system.
>>>>> However, it appears that the performance to clear page from end to
>>>>> begin
>>>>> is really bad on your system.
>>>>> So, I suggest to revise the current implementation to use sequential
>>>>> clearing as much as possible.
>>>>>
>>>>
>>>> I test case-anon-cow-seq-hugetlb for copy_user_large_folio()
>>>>
>>>> base:
>>>> real    0m6.259s    0m6.197s    0m6.316s
>>>> user    1m31.176s   1m27.195s   1m29.594s
>>>> sys     7m44.199s   7m51.490s   8m21.149s
>>>>
>>>> patched(use copy_user_gigantic_page for 2M hugetlb too)
>>>> real    0m3.182s    0m3.002s    0m2.963s
>>>> user    1m19.456s   1m3.107s    1m6.447s
>>>> sys     2m59.222s   3m10.899s   3m1.027s
>>>>
>>>> and sequential copy is better than the current implementation,
>>>> so I will use sequential clear and copy.
>>> Sorry, it appears that you misunderstanding my suggestion.  I
>>> suggest to
>>> revise process_huge_page() to use more sequential memory clearing and
>>> copying to improve its performance on your platform.
>>> --
>>> Best Regards,
>>> Huang, Ying
>>>
>>>>>> 2) case-anon-w-seq-hugetlb:(2M PMD HugeTLB)
>>>>>>
>>>>>> base:
>>>>>> real    0m5.175s    0m5.117s    0m4.856s
>>>>>> user    5m15.943s   5m7.567s    4m29.273s
>>>>>> sys     2m38.503s   2m21.949s   2m21.252s
>>>>>>
>>>>>> patched:
>>>>>> real    0m4.966s    0m4.841s    0m4.561s
>>>>>> user    6m30.123s   6m9.516s    5m49.733s
>>>>>> sys     0m58.503s   0m47.847s   0m46.785s
>>>>>>
>>>>>>
>>>>>> This case is similar to the case1.
>>>>>>
>>>>>> 3) fallocate hugetlb 20G (2M PMD HugeTLB)
>>>>>>
>>>>>> base:
>>>>>> real    0m3.016s    0m3.019s    0m3.018s
>>>>>> user    0m0.000s    0m0.000s    0m0.000s
>>>>>> sys     0m3.009s    0m3.012s    0m3.010s
>>>>>>
>>>>>> patched:
>>>>>>
>>>>>> real    0m1.136s    0m1.136s    0m1.136s
>>>>>> user    0m0.000s    0m0.000s    0m0.004s
>>>>>> sys     0m1.133s    0m1.133s    0m1.129s
>>>>>>
>>>>>>
>>>>>> There is big win on patched kernel, and it is similar to above tmpfs
>>>>>> test, so maybe we could revert the commit c79b57e462b5 ("mm: hugetlb:
>>>>>> clear target sub-page last when clearing huge page").
>>
>> I tried the following changes,
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 66cf855dee3f..e5cc75adfa10 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -6777,7 +6777,7 @@ static inline int process_huge_page(
>>                  base = 0;
>>                  l = n;
>>                  /* Process subpages at the end of huge page */
>> -               for (i = nr_pages - 1; i >= 2 * n; i--) {
>> +               for (i = 2 * n; i < nr_pages; i++) {
>>                          cond_resched();
>>                          ret = process_subpage(addr + i * PAGE_SIZE, i,
>>                          arg);
>>                          if (ret)
>>
>> Since n = 0, so the copying is from start to end now, but not
>> improvement for case-anon-cow-seq-hugetlb,
>>
>> and if use copy_user_gigantic_pager, the time reduced from 6s to 3s
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index fe21bd3beff5..2c6532d21d84 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -6876,10 +6876,7 @@ int copy_user_large_folio(struct folio *dst,
>> struct folio *src,
>>                  .vma = vma,
>>          };
>>
>> -       if (unlikely(nr_pages > MAX_ORDER_NR_PAGES))
>> -               return copy_user_gigantic_page(dst, src, addr_hint,
>>                  vma, nr_pages);
>> -
>> -       return process_huge_page(addr_hint, nr_pages, copy_subpage, &arg);
>> +       return copy_user_gigantic_page(dst, src, addr_hint, vma, nr_pages);
>>   }
> 
> It appears that we have code generation issue here.  Can you check it?
> Whether code is inlined in the same way?
> 

No different, and I checked the asm, both process_huge_page and 
copy_user_gigantic_page are inlined, it is strange...

> Maybe we can start with
> 
> modified   mm/memory.c
> @@ -6714,7 +6714,7 @@ EXPORT_SYMBOL(__might_fault);
>    * operation.  The target subpage will be processed last to keep its
>    * cache lines hot.
>    */
> -static inline int process_huge_page(
> +static __always_inline int process_huge_page(
>   	unsigned long addr_hint, unsigned int nr_pages,
>   	int (*process_subpage)(unsigned long addr, int idx, void *arg),
>   	void *arg)
> 
> --
> Best Regards,
> Huang, Ying
>

Huang, Ying Oct. 28, 2024, 2:39 a.m. UTC | #25

Kefeng Wang <wangkefeng.wang@huawei.com> writes:

> On 2024/10/25 20:21, Huang, Ying wrote:
>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>> 
>>> On 2024/10/25 15:47, Huang, Ying wrote:
>>>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>>>>
>>>>> On 2024/10/25 10:59, Huang, Ying wrote:
>>>>>> Hi, Kefeng,
>>>>>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>>>>>>
>>>>>>> +CC Huang Ying,
>>>>>>>
>>>>>>> On 2024/10/23 6:56, Barry Song wrote:
>>>>>>>> On Wed, Oct 23, 2024 at 4:10 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>> ...
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On 2024/10/17 23:09, Matthew Wilcox wrote:
>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Oct 17, 2024 at 10:25:04PM +0800, Kefeng Wang wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> Directly use folio_zero_range() to cleanup code.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Are you sure there's no performance regression introduced by this?
>>>>>>>>>>>>>>>>>>>>>>>> clear_highpage() is often optimised in ways that we can't optimise for
>>>>>>>>>>>>>>>>>>>>>>>> a plain memset().  On the other hand, if the folio is large, maybe a
>>>>>>>>>>>>>>>>>>>>>>>> modern CPU will be able to do better than clear-one-page-at-a-time.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Right, I missing this, clear_page might be better than memset, I change
>>>>>>>>>>>>>>>>>>>>>>> this one when look at the shmem_writepage(), which already convert to
>>>>>>>>>>>>>>>>>>>>>>> use folio_zero_range() from clear_highpage(), also I grep
>>>>>>>>>>>>>>>>>>>>>>> folio_zero_range(), there are some other to use folio_zero_range().
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:           folio_zero_range(folio, 0,
>>>>>>>>>>>>>>>>>>>>>>> folio_size(folio));
>>>>>>>>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f,
>>>>>>>>>>>>>>>>>>>>>>> 0, folio_size(f));
>>>>>>>>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f,
>>>>>>>>>>>>>>>>>>>>>>> 0, folio_size(f));
>>>>>>>>>>>>>>>>>>>>>>> fs/libfs.c:     folio_zero_range(folio, 0, folio_size(folio));
>>>>>>>>>>>>>>>>>>>>>>> fs/ntfs3/frecord.c:             folio_zero_range(folio, 0,
>>>>>>>>>>>>>>>>>>>>>>> folio_size(folio));
>>>>>>>>>>>>>>>>>>>>>>> mm/page_io.c:   folio_zero_range(folio, 0, folio_size(folio));
>>>>>>>>>>>>>>>>>>>>>>> mm/shmem.c:             folio_zero_range(folio, 0, folio_size(folio));
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> IOW, what performance testing have you done with this patch?
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> No performance test before, but I write a testcase,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> 1) allocate N large folios (folio_alloc(PMD_ORDER))
>>>>>>>>>>>>>>>>>>>>>>> 2) then calculate the diff(us) when clear all N folios
>>>>>>>>>>>>>>>>>>>>>>>               clear_highpage/folio_zero_range/folio_zero_user
>>>>>>>>>>>>>>>>>>>>>>> 3) release N folios
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> the result(run 5 times) shown below on my machine,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> N=1,
>>>>>>>>>>>>>>>>>>>>>>>                   clear_highpage  folio_zero_range    folio_zero_user
>>>>>>>>>>>>>>>>>>>>>>>              1      69                   74                 177
>>>>>>>>>>>>>>>>>>>>>>>              2      57                   62                 168
>>>>>>>>>>>>>>>>>>>>>>>              3      54                   58                 234
>>>>>>>>>>>>>>>>>>>>>>>              4      54                   58                 157
>>>>>>>>>>>>>>>>>>>>>>>              5      56                   62                 148
>>>>>>>>>>>>>>>>>>>>>>> avg       58                   62.8               176.8
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> N=100
>>>>>>>>>>>>>>>>>>>>>>>                   clear_highpage  folio_zero_range    folio_zero_user
>>>>>>>>>>>>>>>>>>>>>>>              1    11015                 11309               32833
>>>>>>>>>>>>>>>>>>>>>>>              2    10385                 11110               49751
>>>>>>>>>>>>>>>>>>>>>>>              3    10369                 11056               33095
>>>>>>>>>>>>>>>>>>>>>>>              4    10332                 11017               33106
>>>>>>>>>>>>>>>>>>>>>>>              5    10483                 11000               49032
>>>>>>>>>>>>>>>>>>>>>>> avg     10516.8               11098.4             39563.4
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> N=512
>>>>>>>>>>>>>>>>>>>>>>>                   clear_highpage  folio_zero_range   folio_zero_user
>>>>>>>>>>>>>>>>>>>>>>>              1    55560                 60055              156876
>>>>>>>>>>>>>>>>>>>>>>>              2    55485                 60024              157132
>>>>>>>>>>>>>>>>>>>>>>>              3    55474                 60129              156658
>>>>>>>>>>>>>>>>>>>>>>>              4    55555                 59867              157259
>>>>>>>>>>>>>>>>>>>>>>>              5    55528                 59932              157108
>>>>>>>>>>>>>>>>>>>>>>> avg     55520.4               60001.4            157006.6
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> folio_zero_user with many cond_resched(), so time fluctuates a lot,
>>>>>>>>>>>>>>>>>>>>>>> clear_highpage is better folio_zero_range as you said.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Maybe add a new helper to convert all folio_zero_range(folio, 0,
>>>>>>>>>>>>>>>>>>>>>>> folio_size(folio))
>>>>>>>>>>>>>>>>>>>>>>> to use clear_highpage + flush_dcache_folio?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> If this also improves performance for other existing callers of
>>>>>>>>>>>>>>>>>>>>>> folio_zero_range(), then that's a positive outcome.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> hi Kefeng,
>>>>>>>>>>>>>>>>>>> what's your point? providing a helper like clear_highfolio() or similar?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Yes, from above test, using clear_highpage/flush_dcache_folio is better
>>>>>>>>>>>>>>>>>> than using folio_zero_range() for folio zero(especially for large
>>>>>>>>>>>>>>>>>> folio), so I'd like to add a new helper, maybe name it folio_zero()
>>>>>>>>>>>>>>>>>> since it zero the whole folio.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> we already have a helper like folio_zero_user()?
>>>>>>>>>>>>>>>>> it is not good enough?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Since it is with many cond_resched(), the performance is worst...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Not exactly? It should have zero cost for a preemptible kernel.
>>>>>>>>>>>>>>> For a non-preemptible kernel, it helps avoid clearing the folio
>>>>>>>>>>>>>>> from occupying the CPU and starving other processes, right?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --- a/mm/shmem.c
>>>>>>>>>>>>>> +++ b/mm/shmem.c
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> @@ -2393,10 +2393,7 @@ static int shmem_get_folio_gfp(struct inode
>>>>>>>>>>>>>> *inode, pgoff_t index,
>>>>>>>>>>>>>>                 * it now, lest undo on failure cancel our earlier guarantee.
>>>>>>>>>>>>>>                 */
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>                if (sgp != SGP_WRITE && !folio_test_uptodate(folio)) {
>>>>>>>>>>>>>> -               long i, n = folio_nr_pages(folio);
>>>>>>>>>>>>>> -
>>>>>>>>>>>>>> -               for (i = 0; i < n; i++)
>>>>>>>>>>>>>> -                       clear_highpage(folio_page(folio, i));
>>>>>>>>>>>>>> +               folio_zero_user(folio, vmf->address);
>>>>>>>>>>>>>>                        flush_dcache_folio(folio);
>>>>>>>>>>>>>>                        folio_mark_uptodate(folio);
>>>>>>>>>>>>>>                }
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Do we perform better or worse with the following?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here is for SGP_FALLOC, vmf = NULL, we could use folio_zero_user(folio,
>>>>>>>>>>>>> 0), I think the performance is worse, will retest once I can access
>>>>>>>>>>>>> hardware.
>>>>>>>>>>>>
>>>>>>>>>>>> Perhaps, since the current code uses clear_hugepage(). Does using
>>>>>>>>>>>> index << PAGE_SHIFT as the addr_hint offer any benefit?
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> when use folio_zero_user(), the performance is vary bad with above
>>>>>>>>>>> fallocate test(mount huge=always),
>>>>>>>>>>>
>>>>>>>>>>>             folio_zero_range   clear_highpage         folio_zero_user
>>>>>>>>>>> real    0m1.214s             0m1.111s              0m3.159s
>>>>>>>>>>> user    0m0.000s             0m0.000s              0m0.000s
>>>>>>>>>>> sys     0m1.210s             0m1.109s              0m3.152s
>>>>>>>>>>>
>>>>>>>>>>> I tried with addr_hint = 0/index << PAGE_SHIFT, no obvious different.
>>>>>>>>>>
>>>>>>>>>> Interesting. Does your kernel have preemption disabled or
>>>>>>>>>> preemption_debug enabled?
>>>>>>>>>
>>>>>>>>> ARM64 server, CONFIG_PREEMPT_NONE=y
>>>>>>>> this explains why the performance is much worse.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> If not, it makes me wonder whether folio_zero_user() in
>>>>>>>>>> alloc_anon_folio() is actually improving performance as expected,
>>>>>>>>>> compared to the simpler folio_zero() you plan to implement. :-)
>>>>>>>>>
>>>>>>>>> Yes, maybe, the folio_zero_user(was clear_huge_page) is from
>>>>>>>>> 47ad8475c000 ("thp: clear_copy_huge_page"), so original clear_huge_page
>>>>>>>>> is used in HugeTLB, clear PUD size maybe spend many time, but for PMD or
>>>>>>>>> other size of large folio, cond_resched is not necessary since we
>>>>>>>>> already have some folio_zero_range() to clear large folio, and no issue
>>>>>>>>> was reported.
>>>>>>>> probably worth an optimization. calling cond_resched() for each page
>>>>>>>> seems too aggressive and useless.
>>>>>>>
>>>>>>> After some test, I think the cond_resched() is not the root cause,
>>>>>>> no performance gained with batched cond_resched(), even I kill
>>>>>>> cond_resched() from process_huge_page, no improvement.
>>>>>>>
>>>>>>> But when I unconditionally use clear_gigantic_page() in
>>>>>>> folio_zero_user(patched), there is big improvement with above
>>>>>>> fallocate on tmpfs(mount huge=always), also I test some other testcase,
>>>>>>>
>>>>>>>
>>>>>>> 1) case-anon-w-seq-mt: (2M PMD THP)
>>>>>>>
>>>>>>> base:
>>>>>>> real    0m2.490s    0m2.254s    0m2.272s
>>>>>>> user    1m59.980s   2m23.431s   2m18.739s
>>>>>>> sys     1m3.675s    1m15.462s   1m15.030s	
>>>>>>>
>>>>>>> patched:
>>>>>>> real    0m2.234s    0m2.225s    0m2.159s
>>>>>>> user    2m56.105s   2m57.117s   3m0.489s
>>>>>>> sys     0m17.064s   0m17.564s   0m16.150s
>>>>>>>
>>>>>>> Patched kernel win on sys and bad in user, but real is almost same,
>>>>>>> maybe a little better than base.
>>>>>> We can find user time difference.  That means the original cache hot
>>>>>> behavior still applies on your system.
>>>>>> However, it appears that the performance to clear page from end to
>>>>>> begin
>>>>>> is really bad on your system.
>>>>>> So, I suggest to revise the current implementation to use sequential
>>>>>> clearing as much as possible.
>>>>>>
>>>>>
>>>>> I test case-anon-cow-seq-hugetlb for copy_user_large_folio()
>>>>>
>>>>> base:
>>>>> real    0m6.259s    0m6.197s    0m6.316s
>>>>> user    1m31.176s   1m27.195s   1m29.594s
>>>>> sys     7m44.199s   7m51.490s   8m21.149s
>>>>>
>>>>> patched(use copy_user_gigantic_page for 2M hugetlb too)
>>>>> real    0m3.182s    0m3.002s    0m2.963s
>>>>> user    1m19.456s   1m3.107s    1m6.447s
>>>>> sys     2m59.222s   3m10.899s   3m1.027s
>>>>>
>>>>> and sequential copy is better than the current implementation,
>>>>> so I will use sequential clear and copy.
>>>> Sorry, it appears that you misunderstanding my suggestion.  I
>>>> suggest to
>>>> revise process_huge_page() to use more sequential memory clearing and
>>>> copying to improve its performance on your platform.
>>>> --
>>>> Best Regards,
>>>> Huang, Ying
>>>>
>>>>>>> 2) case-anon-w-seq-hugetlb:(2M PMD HugeTLB)
>>>>>>>
>>>>>>> base:
>>>>>>> real    0m5.175s    0m5.117s    0m4.856s
>>>>>>> user    5m15.943s   5m7.567s    4m29.273s
>>>>>>> sys     2m38.503s   2m21.949s   2m21.252s
>>>>>>>
>>>>>>> patched:
>>>>>>> real    0m4.966s    0m4.841s    0m4.561s
>>>>>>> user    6m30.123s   6m9.516s    5m49.733s
>>>>>>> sys     0m58.503s   0m47.847s   0m46.785s
>>>>>>>
>>>>>>>
>>>>>>> This case is similar to the case1.
>>>>>>>
>>>>>>> 3) fallocate hugetlb 20G (2M PMD HugeTLB)
>>>>>>>
>>>>>>> base:
>>>>>>> real    0m3.016s    0m3.019s    0m3.018s
>>>>>>> user    0m0.000s    0m0.000s    0m0.000s
>>>>>>> sys     0m3.009s    0m3.012s    0m3.010s
>>>>>>>
>>>>>>> patched:
>>>>>>>
>>>>>>> real    0m1.136s    0m1.136s    0m1.136s
>>>>>>> user    0m0.000s    0m0.000s    0m0.004s
>>>>>>> sys     0m1.133s    0m1.133s    0m1.129s
>>>>>>>
>>>>>>>
>>>>>>> There is big win on patched kernel, and it is similar to above tmpfs
>>>>>>> test, so maybe we could revert the commit c79b57e462b5 ("mm: hugetlb:
>>>>>>> clear target sub-page last when clearing huge page").
>>>
>>> I tried the following changes,
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 66cf855dee3f..e5cc75adfa10 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -6777,7 +6777,7 @@ static inline int process_huge_page(
>>>                  base = 0;
>>>                  l = n;
>>>                  /* Process subpages at the end of huge page */
>>> -               for (i = nr_pages - 1; i >= 2 * n; i--) {
>>> +               for (i = 2 * n; i < nr_pages; i++) {
>>>                          cond_resched();
>>>                          ret = process_subpage(addr + i * PAGE_SIZE, i,
>>>                          arg);
>>>                          if (ret)
>>>
>>> Since n = 0, so the copying is from start to end now, but not
>>> improvement for case-anon-cow-seq-hugetlb,
>>>
>>> and if use copy_user_gigantic_pager, the time reduced from 6s to 3s
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index fe21bd3beff5..2c6532d21d84 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -6876,10 +6876,7 @@ int copy_user_large_folio(struct folio *dst,
>>> struct folio *src,
>>>                  .vma = vma,
>>>          };
>>>
>>> -       if (unlikely(nr_pages > MAX_ORDER_NR_PAGES))
>>> -               return copy_user_gigantic_page(dst, src, addr_hint,
>>>                  vma, nr_pages);
>>> -
>>> -       return process_huge_page(addr_hint, nr_pages, copy_subpage, &arg);
>>> +       return copy_user_gigantic_page(dst, src, addr_hint, vma, nr_pages);
>>>   }
>> It appears that we have code generation issue here.  Can you check
>> it?
>> Whether code is inlined in the same way?
>> 
>
> No different, and I checked the asm, both process_huge_page and
> copy_user_gigantic_page are inlined, it is strange...

It's not inlined in my configuration.  And __always_inline below changes
it for me.

If it's already inlined and the code is actually almost same, why
there's difference?  Is it possible for you to do some profile or
further analysis?

>> Maybe we can start with
>> modified   mm/memory.c
>> @@ -6714,7 +6714,7 @@ EXPORT_SYMBOL(__might_fault);
>>    * operation.  The target subpage will be processed last to keep its
>>    * cache lines hot.
>>    */
>> -static inline int process_huge_page(
>> +static __always_inline int process_huge_page(
>>   	unsigned long addr_hint, unsigned int nr_pages,
>>   	int (*process_subpage)(unsigned long addr, int idx, void *arg),
>>   	void *arg)

--
Best Regards,
Huang, Ying

Kefeng Wang Oct. 28, 2024, 6:37 a.m. UTC | #26

On 2024/10/28 10:39, Huang, Ying wrote:
> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
> 
>> On 2024/10/25 20:21, Huang, Ying wrote:
>>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>>>
>>>> On 2024/10/25 15:47, Huang, Ying wrote:
>>>>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>>>>>
>>>>>> On 2024/10/25 10:59, Huang, Ying wrote:
>>>>>>> Hi, Kefeng,
>>>>>>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>>>>>>>
>>>>>>>> +CC Huang Ying,
>>>>>>>>
>>>>>>>> On 2024/10/23 6:56, Barry Song wrote:
>>>>>>>>> On Wed, Oct 23, 2024 at 4:10 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On 2024/10/17 23:09, Matthew Wilcox wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Oct 17, 2024 at 10:25:04PM +0800, Kefeng Wang wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>> Directly use folio_zero_range() to cleanup code.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Are you sure there's no performance regression introduced by this?
>>>>>>>>>>>>>>>>>>>>>>>>> clear_highpage() is often optimised in ways that we can't optimise for
>>>>>>>>>>>>>>>>>>>>>>>>> a plain memset().  On the other hand, if the folio is large, maybe a
>>>>>>>>>>>>>>>>>>>>>>>>> modern CPU will be able to do better than clear-one-page-at-a-time.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Right, I missing this, clear_page might be better than memset, I change
>>>>>>>>>>>>>>>>>>>>>>>> this one when look at the shmem_writepage(), which already convert to
>>>>>>>>>>>>>>>>>>>>>>>> use folio_zero_range() from clear_highpage(), also I grep
>>>>>>>>>>>>>>>>>>>>>>>> folio_zero_range(), there are some other to use folio_zero_range().
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:           folio_zero_range(folio, 0,
>>>>>>>>>>>>>>>>>>>>>>>> folio_size(folio));
>>>>>>>>>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f,
>>>>>>>>>>>>>>>>>>>>>>>> 0, folio_size(f));
>>>>>>>>>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f,
>>>>>>>>>>>>>>>>>>>>>>>> 0, folio_size(f));
>>>>>>>>>>>>>>>>>>>>>>>> fs/libfs.c:     folio_zero_range(folio, 0, folio_size(folio));
>>>>>>>>>>>>>>>>>>>>>>>> fs/ntfs3/frecord.c:             folio_zero_range(folio, 0,
>>>>>>>>>>>>>>>>>>>>>>>> folio_size(folio));
>>>>>>>>>>>>>>>>>>>>>>>> mm/page_io.c:   folio_zero_range(folio, 0, folio_size(folio));
>>>>>>>>>>>>>>>>>>>>>>>> mm/shmem.c:             folio_zero_range(folio, 0, folio_size(folio));
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> IOW, what performance testing have you done with this patch?
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> No performance test before, but I write a testcase,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> 1) allocate N large folios (folio_alloc(PMD_ORDER))
>>>>>>>>>>>>>>>>>>>>>>>> 2) then calculate the diff(us) when clear all N folios
>>>>>>>>>>>>>>>>>>>>>>>>                clear_highpage/folio_zero_range/folio_zero_user
>>>>>>>>>>>>>>>>>>>>>>>> 3) release N folios
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> the result(run 5 times) shown below on my machine,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> N=1,
>>>>>>>>>>>>>>>>>>>>>>>>                    clear_highpage  folio_zero_range    folio_zero_user
>>>>>>>>>>>>>>>>>>>>>>>>               1      69                   74                 177
>>>>>>>>>>>>>>>>>>>>>>>>               2      57                   62                 168
>>>>>>>>>>>>>>>>>>>>>>>>               3      54                   58                 234
>>>>>>>>>>>>>>>>>>>>>>>>               4      54                   58                 157
>>>>>>>>>>>>>>>>>>>>>>>>               5      56                   62                 148
>>>>>>>>>>>>>>>>>>>>>>>> avg       58                   62.8               176.8
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> N=100
>>>>>>>>>>>>>>>>>>>>>>>>                    clear_highpage  folio_zero_range    folio_zero_user
>>>>>>>>>>>>>>>>>>>>>>>>               1    11015                 11309               32833
>>>>>>>>>>>>>>>>>>>>>>>>               2    10385                 11110               49751
>>>>>>>>>>>>>>>>>>>>>>>>               3    10369                 11056               33095
>>>>>>>>>>>>>>>>>>>>>>>>               4    10332                 11017               33106
>>>>>>>>>>>>>>>>>>>>>>>>               5    10483                 11000               49032
>>>>>>>>>>>>>>>>>>>>>>>> avg     10516.8               11098.4             39563.4
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> N=512
>>>>>>>>>>>>>>>>>>>>>>>>                    clear_highpage  folio_zero_range   folio_zero_user
>>>>>>>>>>>>>>>>>>>>>>>>               1    55560                 60055              156876
>>>>>>>>>>>>>>>>>>>>>>>>               2    55485                 60024              157132
>>>>>>>>>>>>>>>>>>>>>>>>               3    55474                 60129              156658
>>>>>>>>>>>>>>>>>>>>>>>>               4    55555                 59867              157259
>>>>>>>>>>>>>>>>>>>>>>>>               5    55528                 59932              157108
>>>>>>>>>>>>>>>>>>>>>>>> avg     55520.4               60001.4            157006.6
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> folio_zero_user with many cond_resched(), so time fluctuates a lot,
>>>>>>>>>>>>>>>>>>>>>>>> clear_highpage is better folio_zero_range as you said.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Maybe add a new helper to convert all folio_zero_range(folio, 0,
>>>>>>>>>>>>>>>>>>>>>>>> folio_size(folio))
>>>>>>>>>>>>>>>>>>>>>>>> to use clear_highpage + flush_dcache_folio?
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> If this also improves performance for other existing callers of
>>>>>>>>>>>>>>>>>>>>>>> folio_zero_range(), then that's a positive outcome.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> hi Kefeng,
>>>>>>>>>>>>>>>>>>>> what's your point? providing a helper like clear_highfolio() or similar?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Yes, from above test, using clear_highpage/flush_dcache_folio is better
>>>>>>>>>>>>>>>>>>> than using folio_zero_range() for folio zero(especially for large
>>>>>>>>>>>>>>>>>>> folio), so I'd like to add a new helper, maybe name it folio_zero()
>>>>>>>>>>>>>>>>>>> since it zero the whole folio.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> we already have a helper like folio_zero_user()?
>>>>>>>>>>>>>>>>>> it is not good enough?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Since it is with many cond_resched(), the performance is worst...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Not exactly? It should have zero cost for a preemptible kernel.
>>>>>>>>>>>>>>>> For a non-preemptible kernel, it helps avoid clearing the folio
>>>>>>>>>>>>>>>> from occupying the CPU and starving other processes, right?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --- a/mm/shmem.c
>>>>>>>>>>>>>>> +++ b/mm/shmem.c
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> @@ -2393,10 +2393,7 @@ static int shmem_get_folio_gfp(struct inode
>>>>>>>>>>>>>>> *inode, pgoff_t index,
>>>>>>>>>>>>>>>                  * it now, lest undo on failure cancel our earlier guarantee.
>>>>>>>>>>>>>>>                  */
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>                 if (sgp != SGP_WRITE && !folio_test_uptodate(folio)) {
>>>>>>>>>>>>>>> -               long i, n = folio_nr_pages(folio);
>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>> -               for (i = 0; i < n; i++)
>>>>>>>>>>>>>>> -                       clear_highpage(folio_page(folio, i));
>>>>>>>>>>>>>>> +               folio_zero_user(folio, vmf->address);
>>>>>>>>>>>>>>>                         flush_dcache_folio(folio);
>>>>>>>>>>>>>>>                         folio_mark_uptodate(folio);
>>>>>>>>>>>>>>>                 }
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Do we perform better or worse with the following?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Here is for SGP_FALLOC, vmf = NULL, we could use folio_zero_user(folio,
>>>>>>>>>>>>>> 0), I think the performance is worse, will retest once I can access
>>>>>>>>>>>>>> hardware.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Perhaps, since the current code uses clear_hugepage(). Does using
>>>>>>>>>>>>> index << PAGE_SHIFT as the addr_hint offer any benefit?
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> when use folio_zero_user(), the performance is vary bad with above
>>>>>>>>>>>> fallocate test(mount huge=always),
>>>>>>>>>>>>
>>>>>>>>>>>>              folio_zero_range   clear_highpage         folio_zero_user
>>>>>>>>>>>> real    0m1.214s             0m1.111s              0m3.159s
>>>>>>>>>>>> user    0m0.000s             0m0.000s              0m0.000s
>>>>>>>>>>>> sys     0m1.210s             0m1.109s              0m3.152s
>>>>>>>>>>>>
>>>>>>>>>>>> I tried with addr_hint = 0/index << PAGE_SHIFT, no obvious different.
>>>>>>>>>>>
>>>>>>>>>>> Interesting. Does your kernel have preemption disabled or
>>>>>>>>>>> preemption_debug enabled?
>>>>>>>>>>
>>>>>>>>>> ARM64 server, CONFIG_PREEMPT_NONE=y
>>>>>>>>> this explains why the performance is much worse.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> If not, it makes me wonder whether folio_zero_user() in
>>>>>>>>>>> alloc_anon_folio() is actually improving performance as expected,
>>>>>>>>>>> compared to the simpler folio_zero() you plan to implement. :-)
>>>>>>>>>>
>>>>>>>>>> Yes, maybe, the folio_zero_user(was clear_huge_page) is from
>>>>>>>>>> 47ad8475c000 ("thp: clear_copy_huge_page"), so original clear_huge_page
>>>>>>>>>> is used in HugeTLB, clear PUD size maybe spend many time, but for PMD or
>>>>>>>>>> other size of large folio, cond_resched is not necessary since we
>>>>>>>>>> already have some folio_zero_range() to clear large folio, and no issue
>>>>>>>>>> was reported.
>>>>>>>>> probably worth an optimization. calling cond_resched() for each page
>>>>>>>>> seems too aggressive and useless.
>>>>>>>>
>>>>>>>> After some test, I think the cond_resched() is not the root cause,
>>>>>>>> no performance gained with batched cond_resched(), even I kill
>>>>>>>> cond_resched() from process_huge_page, no improvement.
>>>>>>>>
>>>>>>>> But when I unconditionally use clear_gigantic_page() in
>>>>>>>> folio_zero_user(patched), there is big improvement with above
>>>>>>>> fallocate on tmpfs(mount huge=always), also I test some other testcase,
>>>>>>>>
>>>>>>>>
>>>>>>>> 1) case-anon-w-seq-mt: (2M PMD THP)
>>>>>>>>
>>>>>>>> base:
>>>>>>>> real    0m2.490s    0m2.254s    0m2.272s
>>>>>>>> user    1m59.980s   2m23.431s   2m18.739s
>>>>>>>> sys     1m3.675s    1m15.462s   1m15.030s	
>>>>>>>>
>>>>>>>> patched:
>>>>>>>> real    0m2.234s    0m2.225s    0m2.159s
>>>>>>>> user    2m56.105s   2m57.117s   3m0.489s
>>>>>>>> sys     0m17.064s   0m17.564s   0m16.150s
>>>>>>>>
>>>>>>>> Patched kernel win on sys and bad in user, but real is almost same,
>>>>>>>> maybe a little better than base.
>>>>>>> We can find user time difference.  That means the original cache hot
>>>>>>> behavior still applies on your system.
>>>>>>> However, it appears that the performance to clear page from end to
>>>>>>> begin
>>>>>>> is really bad on your system.
>>>>>>> So, I suggest to revise the current implementation to use sequential
>>>>>>> clearing as much as possible.
>>>>>>>
>>>>>>
>>>>>> I test case-anon-cow-seq-hugetlb for copy_user_large_folio()
>>>>>>
>>>>>> base:
>>>>>> real    0m6.259s    0m6.197s    0m6.316s
>>>>>> user    1m31.176s   1m27.195s   1m29.594s
>>>>>> sys     7m44.199s   7m51.490s   8m21.149s
>>>>>>
>>>>>> patched(use copy_user_gigantic_page for 2M hugetlb too)
>>>>>> real    0m3.182s    0m3.002s    0m2.963s
>>>>>> user    1m19.456s   1m3.107s    1m6.447s
>>>>>> sys     2m59.222s   3m10.899s   3m1.027s
>>>>>>
>>>>>> and sequential copy is better than the current implementation,
>>>>>> so I will use sequential clear and copy.
>>>>> Sorry, it appears that you misunderstanding my suggestion.  I
>>>>> suggest to
>>>>> revise process_huge_page() to use more sequential memory clearing and
>>>>> copying to improve its performance on your platform.
>>>>> --
>>>>> Best Regards,
>>>>> Huang, Ying
>>>>>
>>>>>>>> 2) case-anon-w-seq-hugetlb:(2M PMD HugeTLB)
>>>>>>>>
>>>>>>>> base:
>>>>>>>> real    0m5.175s    0m5.117s    0m4.856s
>>>>>>>> user    5m15.943s   5m7.567s    4m29.273s
>>>>>>>> sys     2m38.503s   2m21.949s   2m21.252s
>>>>>>>>
>>>>>>>> patched:
>>>>>>>> real    0m4.966s    0m4.841s    0m4.561s
>>>>>>>> user    6m30.123s   6m9.516s    5m49.733s
>>>>>>>> sys     0m58.503s   0m47.847s   0m46.785s
>>>>>>>>
>>>>>>>>
>>>>>>>> This case is similar to the case1.
>>>>>>>>
>>>>>>>> 3) fallocate hugetlb 20G (2M PMD HugeTLB)
>>>>>>>>
>>>>>>>> base:
>>>>>>>> real    0m3.016s    0m3.019s    0m3.018s
>>>>>>>> user    0m0.000s    0m0.000s    0m0.000s
>>>>>>>> sys     0m3.009s    0m3.012s    0m3.010s
>>>>>>>>
>>>>>>>> patched:
>>>>>>>>
>>>>>>>> real    0m1.136s    0m1.136s    0m1.136s
>>>>>>>> user    0m0.000s    0m0.000s    0m0.004s
>>>>>>>> sys     0m1.133s    0m1.133s    0m1.129s
>>>>>>>>
>>>>>>>>
>>>>>>>> There is big win on patched kernel, and it is similar to above tmpfs
>>>>>>>> test, so maybe we could revert the commit c79b57e462b5 ("mm: hugetlb:
>>>>>>>> clear target sub-page last when clearing huge page").
>>>>
>>>> I tried the following changes,
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 66cf855dee3f..e5cc75adfa10 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -6777,7 +6777,7 @@ static inline int process_huge_page(
>>>>                   base = 0;
>>>>                   l = n;
>>>>                   /* Process subpages at the end of huge page */
>>>> -               for (i = nr_pages - 1; i >= 2 * n; i--) {
>>>> +               for (i = 2 * n; i < nr_pages; i++) {
>>>>                           cond_resched();
>>>>                           ret = process_subpage(addr + i * PAGE_SIZE, i,
>>>>                           arg);
>>>>                           if (ret)
>>>>
>>>> Since n = 0, so the copying is from start to end now, but not
>>>> improvement for case-anon-cow-seq-hugetlb,
>>>>
>>>> and if use copy_user_gigantic_pager, the time reduced from 6s to 3s
>>>>
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index fe21bd3beff5..2c6532d21d84 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -6876,10 +6876,7 @@ int copy_user_large_folio(struct folio *dst,
>>>> struct folio *src,
>>>>                   .vma = vma,
>>>>           };
>>>>
>>>> -       if (unlikely(nr_pages > MAX_ORDER_NR_PAGES))
>>>> -               return copy_user_gigantic_page(dst, src, addr_hint,
>>>>                   vma, nr_pages);
>>>> -
>>>> -       return process_huge_page(addr_hint, nr_pages, copy_subpage, &arg);
>>>> +       return copy_user_gigantic_page(dst, src, addr_hint, vma, nr_pages);
>>>>    }
>>> It appears that we have code generation issue here.  Can you check
>>> it?
>>> Whether code is inlined in the same way?
>>>
>>
>> No different, and I checked the asm, both process_huge_page and
>> copy_user_gigantic_page are inlined, it is strange...
> 
> It's not inlined in my configuration.  And __always_inline below changes
> it for me.
> 
> If it's already inlined and the code is actually almost same, why
> there's difference?  Is it possible for you to do some profile or
> further analysis?

Yes, will continue to debug this.

> 
>>> Maybe we can start with
>>> modified   mm/memory.c
>>> @@ -6714,7 +6714,7 @@ EXPORT_SYMBOL(__might_fault);
>>>     * operation.  The target subpage will be processed last to keep its
>>>     * cache lines hot.
>>>     */
>>> -static inline int process_huge_page(
>>> +static __always_inline int process_huge_page(
>>>    	unsigned long addr_hint, unsigned int nr_pages,
>>>    	int (*process_subpage)(unsigned long addr, int idx, void *arg),
>>>    	void *arg)
> 
> --
> Best Regards,
> Huang, Ying
>

Kefeng Wang Oct. 28, 2024, 11:41 a.m. UTC | #27

On 2024/10/28 14:37, Kefeng Wang wrote:
> 
> 
> On 2024/10/28 10:39, Huang, Ying wrote:
>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>>
>>> On 2024/10/25 20:21, Huang, Ying wrote:
>>>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>>>>
>>>>> On 2024/10/25 15:47, Huang, Ying wrote:
>>>>>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>>>>>>
>>>>>>> On 2024/10/25 10:59, Huang, Ying wrote:
>>>>>>>> Hi, Kefeng,
>>>>>>>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>>>>>>>>
>>>>>>>>> +CC Huang Ying,
>>>>>>>>>
>>>>>>>>> On 2024/10/23 6:56, Barry Song wrote:
>>>>>>>>>> On Wed, Oct 23, 2024 at 4:10 AM Kefeng Wang 
>>>>>>>>>> <wangkefeng.wang@huawei.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>> ...
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> On 2024/10/17 23:09, Matthew Wilcox wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Oct 17, 2024 at 10:25:04PM +0800, 
>>>>>>>>>>>>>>>>>>>>>>>>>> Kefeng Wang wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>> Directly use folio_zero_range() to cleanup code.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Are you sure there's no performance regression 
>>>>>>>>>>>>>>>>>>>>>>>>>> introduced by this?
>>>>>>>>>>>>>>>>>>>>>>>>>> clear_highpage() is often optimised in ways 
>>>>>>>>>>>>>>>>>>>>>>>>>> that we can't optimise for
>>>>>>>>>>>>>>>>>>>>>>>>>> a plain memset().  On the other hand, if the 
>>>>>>>>>>>>>>>>>>>>>>>>>> folio is large, maybe a
>>>>>>>>>>>>>>>>>>>>>>>>>> modern CPU will be able to do better than 
>>>>>>>>>>>>>>>>>>>>>>>>>> clear-one-page-at-a-time.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Right, I missing this, clear_page might be 
>>>>>>>>>>>>>>>>>>>>>>>>> better than memset, I change
>>>>>>>>>>>>>>>>>>>>>>>>> this one when look at the shmem_writepage(), 
>>>>>>>>>>>>>>>>>>>>>>>>> which already convert to
>>>>>>>>>>>>>>>>>>>>>>>>> use folio_zero_range() from clear_highpage(), 
>>>>>>>>>>>>>>>>>>>>>>>>> also I grep
>>>>>>>>>>>>>>>>>>>>>>>>> folio_zero_range(), there are some other to use 
>>>>>>>>>>>>>>>>>>>>>>>>> folio_zero_range().
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:           
>>>>>>>>>>>>>>>>>>>>>>>>> folio_zero_range(folio, 0,
>>>>>>>>>>>>>>>>>>>>>>>>> folio_size(folio));
>>>>>>>>>>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:                   
>>>>>>>>>>>>>>>>>>>>>>>>> folio_zero_range(f,
>>>>>>>>>>>>>>>>>>>>>>>>> 0, folio_size(f));
>>>>>>>>>>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:                   
>>>>>>>>>>>>>>>>>>>>>>>>> folio_zero_range(f,
>>>>>>>>>>>>>>>>>>>>>>>>> 0, folio_size(f));
>>>>>>>>>>>>>>>>>>>>>>>>> fs/libfs.c:     folio_zero_range(folio, 0, 
>>>>>>>>>>>>>>>>>>>>>>>>> folio_size(folio));
>>>>>>>>>>>>>>>>>>>>>>>>> fs/ntfs3/frecord.c:             
>>>>>>>>>>>>>>>>>>>>>>>>> folio_zero_range(folio, 0,
>>>>>>>>>>>>>>>>>>>>>>>>> folio_size(folio));
>>>>>>>>>>>>>>>>>>>>>>>>> mm/page_io.c:   folio_zero_range(folio, 0, 
>>>>>>>>>>>>>>>>>>>>>>>>> folio_size(folio));
>>>>>>>>>>>>>>>>>>>>>>>>> mm/shmem.c:             folio_zero_range(folio, 
>>>>>>>>>>>>>>>>>>>>>>>>> 0, folio_size(folio));
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> IOW, what performance testing have you done 
>>>>>>>>>>>>>>>>>>>>>>>>>> with this patch?
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> No performance test before, but I write a 
>>>>>>>>>>>>>>>>>>>>>>>>> testcase,
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> 1) allocate N large folios 
>>>>>>>>>>>>>>>>>>>>>>>>> (folio_alloc(PMD_ORDER))
>>>>>>>>>>>>>>>>>>>>>>>>> 2) then calculate the diff(us) when clear all N 
>>>>>>>>>>>>>>>>>>>>>>>>> folios
>>>>>>>>>>>>>>>>>>>>>>>>>                clear_highpage/folio_zero_range/ 
>>>>>>>>>>>>>>>>>>>>>>>>> folio_zero_user
>>>>>>>>>>>>>>>>>>>>>>>>> 3) release N folios
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> the result(run 5 times) shown below on my machine,
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> N=1,
>>>>>>>>>>>>>>>>>>>>>>>>>                    clear_highpage 
>>>>>>>>>>>>>>>>>>>>>>>>>  folio_zero_range    folio_zero_user
>>>>>>>>>>>>>>>>>>>>>>>>>               1      69                   74   
>>>>>>>>>>>>>>>>>>>>>>>>>               177
>>>>>>>>>>>>>>>>>>>>>>>>>               2      57                   62   
>>>>>>>>>>>>>>>>>>>>>>>>>               168
>>>>>>>>>>>>>>>>>>>>>>>>>               3      54                   58   
>>>>>>>>>>>>>>>>>>>>>>>>>               234
>>>>>>>>>>>>>>>>>>>>>>>>>               4      54                   58   
>>>>>>>>>>>>>>>>>>>>>>>>>               157
>>>>>>>>>>>>>>>>>>>>>>>>>               5      56                   62   
>>>>>>>>>>>>>>>>>>>>>>>>>               148
>>>>>>>>>>>>>>>>>>>>>>>>> avg       58                   62.8             
>>>>>>>>>>>>>>>>>>>>>>>>>   176.8
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> N=100
>>>>>>>>>>>>>>>>>>>>>>>>>                    clear_highpage 
>>>>>>>>>>>>>>>>>>>>>>>>>  folio_zero_range    folio_zero_user
>>>>>>>>>>>>>>>>>>>>>>>>>               1    11015                 11309 
>>>>>>>>>>>>>>>>>>>>>>>>>               32833
>>>>>>>>>>>>>>>>>>>>>>>>>               2    10385                 11110 
>>>>>>>>>>>>>>>>>>>>>>>>>               49751
>>>>>>>>>>>>>>>>>>>>>>>>>               3    10369                 11056 
>>>>>>>>>>>>>>>>>>>>>>>>>               33095
>>>>>>>>>>>>>>>>>>>>>>>>>               4    10332                 11017 
>>>>>>>>>>>>>>>>>>>>>>>>>               33106
>>>>>>>>>>>>>>>>>>>>>>>>>               5    10483                 11000 
>>>>>>>>>>>>>>>>>>>>>>>>>               49032
>>>>>>>>>>>>>>>>>>>>>>>>> avg     10516.8               11098.4           
>>>>>>>>>>>>>>>>>>>>>>>>>   39563.4
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> N=512
>>>>>>>>>>>>>>>>>>>>>>>>>                    clear_highpage 
>>>>>>>>>>>>>>>>>>>>>>>>>  folio_zero_range   folio_zero_user
>>>>>>>>>>>>>>>>>>>>>>>>>               1    55560                 60055 
>>>>>>>>>>>>>>>>>>>>>>>>>              156876
>>>>>>>>>>>>>>>>>>>>>>>>>               2    55485                 60024 
>>>>>>>>>>>>>>>>>>>>>>>>>              157132
>>>>>>>>>>>>>>>>>>>>>>>>>               3    55474                 60129 
>>>>>>>>>>>>>>>>>>>>>>>>>              156658
>>>>>>>>>>>>>>>>>>>>>>>>>               4    55555                 59867 
>>>>>>>>>>>>>>>>>>>>>>>>>              157259
>>>>>>>>>>>>>>>>>>>>>>>>>               5    55528                 59932 
>>>>>>>>>>>>>>>>>>>>>>>>>              157108
>>>>>>>>>>>>>>>>>>>>>>>>> avg     55520.4               60001.4           
>>>>>>>>>>>>>>>>>>>>>>>>>  157006.6
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> folio_zero_user with many cond_resched(), so 
>>>>>>>>>>>>>>>>>>>>>>>>> time fluctuates a lot,
>>>>>>>>>>>>>>>>>>>>>>>>> clear_highpage is better folio_zero_range as 
>>>>>>>>>>>>>>>>>>>>>>>>> you said.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Maybe add a new helper to convert all 
>>>>>>>>>>>>>>>>>>>>>>>>> folio_zero_range(folio, 0,
>>>>>>>>>>>>>>>>>>>>>>>>> folio_size(folio))
>>>>>>>>>>>>>>>>>>>>>>>>> to use clear_highpage + flush_dcache_folio?
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> If this also improves performance for other 
>>>>>>>>>>>>>>>>>>>>>>>> existing callers of
>>>>>>>>>>>>>>>>>>>>>>>> folio_zero_range(), then that's a positive outcome.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> hi Kefeng,
>>>>>>>>>>>>>>>>>>>>> what's your point? providing a helper like 
>>>>>>>>>>>>>>>>>>>>> clear_highfolio() or similar?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Yes, from above test, using clear_highpage/ 
>>>>>>>>>>>>>>>>>>>> flush_dcache_folio is better
>>>>>>>>>>>>>>>>>>>> than using folio_zero_range() for folio 
>>>>>>>>>>>>>>>>>>>> zero(especially for large
>>>>>>>>>>>>>>>>>>>> folio), so I'd like to add a new helper, maybe name 
>>>>>>>>>>>>>>>>>>>> it folio_zero()
>>>>>>>>>>>>>>>>>>>> since it zero the whole folio.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> we already have a helper like folio_zero_user()?
>>>>>>>>>>>>>>>>>>> it is not good enough?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Since it is with many cond_resched(), the performance 
>>>>>>>>>>>>>>>>>> is worst...
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Not exactly? It should have zero cost for a preemptible 
>>>>>>>>>>>>>>>>> kernel.
>>>>>>>>>>>>>>>>> For a non-preemptible kernel, it helps avoid clearing 
>>>>>>>>>>>>>>>>> the folio
>>>>>>>>>>>>>>>>> from occupying the CPU and starving other processes, 
>>>>>>>>>>>>>>>>> right?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --- a/mm/shmem.c
>>>>>>>>>>>>>>>> +++ b/mm/shmem.c
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> @@ -2393,10 +2393,7 @@ static int 
>>>>>>>>>>>>>>>> shmem_get_folio_gfp(struct inode
>>>>>>>>>>>>>>>> *inode, pgoff_t index,
>>>>>>>>>>>>>>>>                  * it now, lest undo on failure cancel 
>>>>>>>>>>>>>>>> our earlier guarantee.
>>>>>>>>>>>>>>>>                  */
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>                 if (sgp != SGP_WRITE && ! 
>>>>>>>>>>>>>>>> folio_test_uptodate(folio)) {
>>>>>>>>>>>>>>>> -               long i, n = folio_nr_pages(folio);
>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>> -               for (i = 0; i < n; i++)
>>>>>>>>>>>>>>>> -                       clear_highpage(folio_page(folio, 
>>>>>>>>>>>>>>>> i));
>>>>>>>>>>>>>>>> +               folio_zero_user(folio, vmf->address);
>>>>>>>>>>>>>>>>                         flush_dcache_folio(folio);
>>>>>>>>>>>>>>>>                         folio_mark_uptodate(folio);
>>>>>>>>>>>>>>>>                 }
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Do we perform better or worse with the following?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Here is for SGP_FALLOC, vmf = NULL, we could use 
>>>>>>>>>>>>>>> folio_zero_user(folio,
>>>>>>>>>>>>>>> 0), I think the performance is worse, will retest once I 
>>>>>>>>>>>>>>> can access
>>>>>>>>>>>>>>> hardware.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Perhaps, since the current code uses clear_hugepage(). 
>>>>>>>>>>>>>> Does using
>>>>>>>>>>>>>> index << PAGE_SHIFT as the addr_hint offer any benefit?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> when use folio_zero_user(), the performance is vary bad 
>>>>>>>>>>>>> with above
>>>>>>>>>>>>> fallocate test(mount huge=always),
>>>>>>>>>>>>>
>>>>>>>>>>>>>              folio_zero_range   clear_highpage         
>>>>>>>>>>>>> folio_zero_user
>>>>>>>>>>>>> real    0m1.214s             0m1.111s              0m3.159s
>>>>>>>>>>>>> user    0m0.000s             0m0.000s              0m0.000s
>>>>>>>>>>>>> sys     0m1.210s             0m1.109s              0m3.152s
>>>>>>>>>>>>>
>>>>>>>>>>>>> I tried with addr_hint = 0/index << PAGE_SHIFT, no obvious 
>>>>>>>>>>>>> different.
>>>>>>>>>>>>
>>>>>>>>>>>> Interesting. Does your kernel have preemption disabled or
>>>>>>>>>>>> preemption_debug enabled?
>>>>>>>>>>>
>>>>>>>>>>> ARM64 server, CONFIG_PREEMPT_NONE=y
>>>>>>>>>> this explains why the performance is much worse.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> If not, it makes me wonder whether folio_zero_user() in
>>>>>>>>>>>> alloc_anon_folio() is actually improving performance as 
>>>>>>>>>>>> expected,
>>>>>>>>>>>> compared to the simpler folio_zero() you plan to implement. :-)
>>>>>>>>>>>
>>>>>>>>>>> Yes, maybe, the folio_zero_user(was clear_huge_page) is from
>>>>>>>>>>> 47ad8475c000 ("thp: clear_copy_huge_page"), so original 
>>>>>>>>>>> clear_huge_page
>>>>>>>>>>> is used in HugeTLB, clear PUD size maybe spend many time, but 
>>>>>>>>>>> for PMD or
>>>>>>>>>>> other size of large folio, cond_resched is not necessary 
>>>>>>>>>>> since we
>>>>>>>>>>> already have some folio_zero_range() to clear large folio, 
>>>>>>>>>>> and no issue
>>>>>>>>>>> was reported.
>>>>>>>>>> probably worth an optimization. calling cond_resched() for 
>>>>>>>>>> each page
>>>>>>>>>> seems too aggressive and useless.
>>>>>>>>>
>>>>>>>>> After some test, I think the cond_resched() is not the root cause,
>>>>>>>>> no performance gained with batched cond_resched(), even I kill
>>>>>>>>> cond_resched() from process_huge_page, no improvement.
>>>>>>>>>
>>>>>>>>> But when I unconditionally use clear_gigantic_page() in
>>>>>>>>> folio_zero_user(patched), there is big improvement with above
>>>>>>>>> fallocate on tmpfs(mount huge=always), also I test some other 
>>>>>>>>> testcase,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 1) case-anon-w-seq-mt: (2M PMD THP)
>>>>>>>>>
>>>>>>>>> base:
>>>>>>>>> real    0m2.490s    0m2.254s    0m2.272s
>>>>>>>>> user    1m59.980s   2m23.431s   2m18.739s
>>>>>>>>> sys     1m3.675s    1m15.462s   1m15.030s
>>>>>>>>>
>>>>>>>>> patched:
>>>>>>>>> real    0m2.234s    0m2.225s    0m2.159s
>>>>>>>>> user    2m56.105s   2m57.117s   3m0.489s
>>>>>>>>> sys     0m17.064s   0m17.564s   0m16.150s
>>>>>>>>>
>>>>>>>>> Patched kernel win on sys and bad in user, but real is almost 
>>>>>>>>> same,
>>>>>>>>> maybe a little better than base.
>>>>>>>> We can find user time difference.  That means the original cache 
>>>>>>>> hot
>>>>>>>> behavior still applies on your system.
>>>>>>>> However, it appears that the performance to clear page from end to
>>>>>>>> begin
>>>>>>>> is really bad on your system.
>>>>>>>> So, I suggest to revise the current implementation to use 
>>>>>>>> sequential
>>>>>>>> clearing as much as possible.
>>>>>>>>
>>>>>>>
>>>>>>> I test case-anon-cow-seq-hugetlb for copy_user_large_folio()
>>>>>>>
>>>>>>> base:
>>>>>>> real    0m6.259s    0m6.197s    0m6.316s
>>>>>>> user    1m31.176s   1m27.195s   1m29.594s
>>>>>>> sys     7m44.199s   7m51.490s   8m21.149s
>>>>>>>
>>>>>>> patched(use copy_user_gigantic_page for 2M hugetlb too)
>>>>>>> real    0m3.182s    0m3.002s    0m2.963s
>>>>>>> user    1m19.456s   1m3.107s    1m6.447s
>>>>>>> sys     2m59.222s   3m10.899s   3m1.027s
>>>>>>>
>>>>>>> and sequential copy is better than the current implementation,
>>>>>>> so I will use sequential clear and copy.
>>>>>> Sorry, it appears that you misunderstanding my suggestion.  I
>>>>>> suggest to
>>>>>> revise process_huge_page() to use more sequential memory clearing and
>>>>>> copying to improve its performance on your platform.
>>>>>> -- 
>>>>>> Best Regards,
>>>>>> Huang, Ying
>>>>>>
>>>>>>>>> 2) case-anon-w-seq-hugetlb:(2M PMD HugeTLB)
>>>>>>>>>
>>>>>>>>> base:
>>>>>>>>> real    0m5.175s    0m5.117s    0m4.856s
>>>>>>>>> user    5m15.943s   5m7.567s    4m29.273s
>>>>>>>>> sys     2m38.503s   2m21.949s   2m21.252s
>>>>>>>>>
>>>>>>>>> patched:
>>>>>>>>> real    0m4.966s    0m4.841s    0m4.561s
>>>>>>>>> user    6m30.123s   6m9.516s    5m49.733s
>>>>>>>>> sys     0m58.503s   0m47.847s   0m46.785s
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This case is similar to the case1.
>>>>>>>>>
>>>>>>>>> 3) fallocate hugetlb 20G (2M PMD HugeTLB)
>>>>>>>>>
>>>>>>>>> base:
>>>>>>>>> real    0m3.016s    0m3.019s    0m3.018s
>>>>>>>>> user    0m0.000s    0m0.000s    0m0.000s
>>>>>>>>> sys     0m3.009s    0m3.012s    0m3.010s
>>>>>>>>>
>>>>>>>>> patched:
>>>>>>>>>
>>>>>>>>> real    0m1.136s    0m1.136s    0m1.136s
>>>>>>>>> user    0m0.000s    0m0.000s    0m0.004s
>>>>>>>>> sys     0m1.133s    0m1.133s    0m1.129s
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> There is big win on patched kernel, and it is similar to above 
>>>>>>>>> tmpfs
>>>>>>>>> test, so maybe we could revert the commit c79b57e462b5 ("mm: 
>>>>>>>>> hugetlb:
>>>>>>>>> clear target sub-page last when clearing huge page").
>>>>>
>>>>> I tried the following changes,
>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>> index 66cf855dee3f..e5cc75adfa10 100644
>>>>> --- a/mm/memory.c
>>>>> +++ b/mm/memory.c
>>>>> @@ -6777,7 +6777,7 @@ static inline int process_huge_page(
>>>>>                   base = 0;
>>>>>                   l = n;
>>>>>                   /* Process subpages at the end of huge page */
>>>>> -               for (i = nr_pages - 1; i >= 2 * n; i--) {
>>>>> +               for (i = 2 * n; i < nr_pages; i++) {
>>>>>                           cond_resched();
>>>>>                           ret = process_subpage(addr + i * 
>>>>> PAGE_SIZE, i,
>>>>>                           arg);
>>>>>                           if (ret)
>>>>>
>>>>> Since n = 0, so the copying is from start to end now, but not
>>>>> improvement for case-anon-cow-seq-hugetlb,
>>>>>
>>>>> and if use copy_user_gigantic_pager, the time reduced from 6s to 3s
>>>>>
>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>> index fe21bd3beff5..2c6532d21d84 100644
>>>>> --- a/mm/memory.c
>>>>> +++ b/mm/memory.c
>>>>> @@ -6876,10 +6876,7 @@ int copy_user_large_folio(struct folio *dst,
>>>>> struct folio *src,
>>>>>                   .vma = vma,
>>>>>           };
>>>>>
>>>>> -       if (unlikely(nr_pages > MAX_ORDER_NR_PAGES))
>>>>> -               return copy_user_gigantic_page(dst, src, addr_hint,
>>>>>                   vma, nr_pages);
>>>>> -
>>>>> -       return process_huge_page(addr_hint, nr_pages, copy_subpage, 
>>>>> &arg);
>>>>> +       return copy_user_gigantic_page(dst, src, addr_hint, vma, 
>>>>> nr_pages);
>>>>>    }
>>>> It appears that we have code generation issue here.  Can you check
>>>> it?
>>>> Whether code is inlined in the same way?
>>>>
>>>
>>> No different, and I checked the asm, both process_huge_page and
>>> copy_user_gigantic_page are inlined, it is strange...
>>
>> It's not inlined in my configuration.  And __always_inline below changes
>> it for me.
>>
>> If it's already inlined and the code is actually almost same, why
>> there's difference?  Is it possible for you to do some profile or
>> further analysis?
> 
> Yes, will continue to debug this.

My bad, I has some refactor patch before using copy_user_large_folio(),

ba3fda2a7b08 mm: use copy_user_large_page // good performance
a88666ae8f4d mm: call might_sleep() in folio_zero/copy_user()
3ab7d4d405e9 mm: calculate the base address in the folio_zero/copy_user()
7b240664c07d mm: convert to folio_copy_user()  // I made a mistake which 
use dst instead of src in copy_user_gigantic_page()
1a951e310aa9 mm: use aligned address in copy_user_gigantic_page()
e095ce052607 mm: use aligned address in clear_gigantic_page()

so please ignore the copy test result (case-anon-cow-seq-hugetlb)

In summary:
1) for copying, no obvious different between 
copy_user_large_folio/process_huge_page(copying from last to start or 
copying from start to last)

2) for clearing, clear_gigantic_page is better than process_huge_page
from my machine, and after clearing page from start to last(current, it 
process page from last to first), the performance is same to the 
clear_gigantic_page.


> 
>>
>>>> Maybe we can start with
>>>> modified   mm/memory.c
>>>> @@ -6714,7 +6714,7 @@ EXPORT_SYMBOL(__might_fault);
>>>>     * operation.  The target subpage will be processed last to keep its
>>>>     * cache lines hot.
>>>>     */
>>>> -static inline int process_huge_page(
>>>> +static __always_inline int process_huge_page(
>>>>        unsigned long addr_hint, unsigned int nr_pages,
>>>>        int (*process_subpage)(unsigned long addr, int idx, void *arg),
>>>>        void *arg)
>>
>> -- 
>> Best Regards,
>> Huang, Ying
>>
> 
>

Huang, Ying Oct. 30, 2024, 1:26 a.m. UTC | #28

Hi, Kefeng,

Sorry for late reply, the email is buried in my inbox, just dig it out.

Kefeng Wang <wangkefeng.wang@huawei.com> writes:

> On 2024/10/28 14:37, Kefeng Wang wrote:
>> On 2024/10/28 10:39, Huang, Ying wrote:
>>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>>>
>>>> On 2024/10/25 20:21, Huang, Ying wrote:
>>>>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>>>>>
>>>>>> On 2024/10/25 15:47, Huang, Ying wrote:
>>>>>>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>>>>>>>
>>>>>>>> On 2024/10/25 10:59, Huang, Ying wrote:
>>>>>>>>> Hi, Kefeng,
>>>>>>>>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>>>>>>>>>
>>>>>>>>>> +CC Huang Ying,
>>>>>>>>>>
>>>>>>>>>> On 2024/10/23 6:56, Barry Song wrote:
>>>>>>>>>>> On Wed, Oct 23, 2024 at 4:10 AM Kefeng Wang
>>>>>>>>>>> <wangkefeng.wang@huawei.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> On 2024/10/17 23:09, Matthew Wilcox wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Oct 17, 2024 at 10:25:04PM +0800,
>>>>>>>>>>>>>>>>>>>>>>>>>>> Kefeng Wang wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Directly use folio_zero_range() to cleanup code.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Are you sure there's no performance
>>>>>>>>>>>>>>>>>>>>>>>>>>> regression introduced by this?
>>>>>>>>>>>>>>>>>>>>>>>>>>> clear_highpage() is often optimised in ways
>>>>>>>>>>>>>>>>>>>>>>>>>>> that we can't optimise for
>>>>>>>>>>>>>>>>>>>>>>>>>>> a plain memset().  On the other hand, if
>>>>>>>>>>>>>>>>>>>>>>>>>>> the folio is large, maybe a
>>>>>>>>>>>>>>>>>>>>>>>>>>> modern CPU will be able to do better than
>>>>>>>>>>>>>>>>>>>>>>>>>>> clear-one-page-at-a-time.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Right, I missing this, clear_page might be
>>>>>>>>>>>>>>>>>>>>>>>>>> better than memset, I change
>>>>>>>>>>>>>>>>>>>>>>>>>> this one when look at the shmem_writepage(),
>>>>>>>>>>>>>>>>>>>>>>>>>> which already convert to
>>>>>>>>>>>>>>>>>>>>>>>>>> use folio_zero_range() from
>>>>>>>>>>>>>>>>>>>>>>>>>> clear_highpage(), also I grep
>>>>>>>>>>>>>>>>>>>>>>>>>> folio_zero_range(), there are some other to
>>>>>>>>>>>>>>>>>>>>>>>>>> use folio_zero_range().
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:
>>>>>>>>>>>>>>>>>>>>>>>>>> folio_zero_range(folio, 0,
>>>>>>>>>>>>>>>>>>>>>>>>>> folio_size(folio));
>>>>>>>>>>>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:
>>>>>>>>>>>>>>>>>>>>>>>>>> folio_zero_range(f,
>>>>>>>>>>>>>>>>>>>>>>>>>> 0, folio_size(f));
>>>>>>>>>>>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:
>>>>>>>>>>>>>>>>>>>>>>>>>> folio_zero_range(f,
>>>>>>>>>>>>>>>>>>>>>>>>>> 0, folio_size(f));
>>>>>>>>>>>>>>>>>>>>>>>>>> fs/libfs.c:     folio_zero_range(folio, 0,
>>>>>>>>>>>>>>>>>>>>>>>>>> folio_size(folio));
>>>>>>>>>>>>>>>>>>>>>>>>>> fs/ntfs3/frecord.c:
>>>>>>>>>>>>>>>>>>>>>>>>>> folio_zero_range(folio, 0,
>>>>>>>>>>>>>>>>>>>>>>>>>> folio_size(folio));
>>>>>>>>>>>>>>>>>>>>>>>>>> mm/page_io.c:   folio_zero_range(folio, 0,
>>>>>>>>>>>>>>>>>>>>>>>>>> folio_size(folio));
>>>>>>>>>>>>>>>>>>>>>>>>>> mm/shmem.c:            
>>>>>>>>>>>>>>>>>>>>>>>>>> folio_zero_range(folio, 0,
>>>>>>>>>>>>>>>>>>>>>>>>>> folio_size(folio));
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> IOW, what performance testing have you done
>>>>>>>>>>>>>>>>>>>>>>>>>>> with this patch?
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> No performance test before, but I write a
>>>>>>>>>>>>>>>>>>>>>>>>>> testcase,
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> 1) allocate N large folios
>>>>>>>>>>>>>>>>>>>>>>>>>> (folio_alloc(PMD_ORDER))
>>>>>>>>>>>>>>>>>>>>>>>>>> 2) then calculate the diff(us) when clear
>>>>>>>>>>>>>>>>>>>>>>>>>> all N folios
>>>>>>>>>>>>>>>>>>>>>>>>>>               
>>>>>>>>>>>>>>>>>>>>>>>>>> clear_highpage/folio_zero_range/
>>>>>>>>>>>>>>>>>>>>>>>>>> folio_zero_user
>>>>>>>>>>>>>>>>>>>>>>>>>> 3) release N folios
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> the result(run 5 times) shown below on my machine,
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> N=1,
>>>>>>>>>>>>>>>>>>>>>>>>>>                    clear_highpage
>>>>>>>>>>>>>>>>>>>>>>>>>>  folio_zero_range    folio_zero_user
>>>>>>>>>>>>>>>>>>>>>>>>>>               1      69                   74
>>>>>>>>>>>>>>>>>>>>>>>>>>               177
>>>>>>>>>>>>>>>>>>>>>>>>>>               2      57                   62
>>>>>>>>>>>>>>>>>>>>>>>>>>               168
>>>>>>>>>>>>>>>>>>>>>>>>>>               3      54                   58
>>>>>>>>>>>>>>>>>>>>>>>>>>               234
>>>>>>>>>>>>>>>>>>>>>>>>>>               4      54                   58
>>>>>>>>>>>>>>>>>>>>>>>>>>               157
>>>>>>>>>>>>>>>>>>>>>>>>>>               5      56                   62
>>>>>>>>>>>>>>>>>>>>>>>>>>               148
>>>>>>>>>>>>>>>>>>>>>>>>>> avg       58                   62.8
>>>>>>>>>>>>>>>>>>>>>>>>>>   176.8
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> N=100
>>>>>>>>>>>>>>>>>>>>>>>>>>                    clear_highpage
>>>>>>>>>>>>>>>>>>>>>>>>>>  folio_zero_range    folio_zero_user
>>>>>>>>>>>>>>>>>>>>>>>>>>               1    11015                
>>>>>>>>>>>>>>>>>>>>>>>>>> 11309               32833
>>>>>>>>>>>>>>>>>>>>>>>>>>               2    10385                
>>>>>>>>>>>>>>>>>>>>>>>>>> 11110               49751
>>>>>>>>>>>>>>>>>>>>>>>>>>               3    10369                
>>>>>>>>>>>>>>>>>>>>>>>>>> 11056               33095
>>>>>>>>>>>>>>>>>>>>>>>>>>               4    10332                
>>>>>>>>>>>>>>>>>>>>>>>>>> 11017               33106
>>>>>>>>>>>>>>>>>>>>>>>>>>               5    10483                
>>>>>>>>>>>>>>>>>>>>>>>>>> 11000               49032
>>>>>>>>>>>>>>>>>>>>>>>>>> avg     10516.8               11098.4
>>>>>>>>>>>>>>>>>>>>>>>>>>   39563.4
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> N=512
>>>>>>>>>>>>>>>>>>>>>>>>>>                    clear_highpage
>>>>>>>>>>>>>>>>>>>>>>>>>>  folio_zero_range   folio_zero_user
>>>>>>>>>>>>>>>>>>>>>>>>>>               1    55560                
>>>>>>>>>>>>>>>>>>>>>>>>>> 60055              156876
>>>>>>>>>>>>>>>>>>>>>>>>>>               2    55485                
>>>>>>>>>>>>>>>>>>>>>>>>>> 60024              157132
>>>>>>>>>>>>>>>>>>>>>>>>>>               3    55474                
>>>>>>>>>>>>>>>>>>>>>>>>>> 60129              156658
>>>>>>>>>>>>>>>>>>>>>>>>>>               4    55555                
>>>>>>>>>>>>>>>>>>>>>>>>>> 59867              157259
>>>>>>>>>>>>>>>>>>>>>>>>>>               5    55528                
>>>>>>>>>>>>>>>>>>>>>>>>>> 59932              157108
>>>>>>>>>>>>>>>>>>>>>>>>>> avg     55520.4               60001.4
>>>>>>>>>>>>>>>>>>>>>>>>>>  157006.6
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> folio_zero_user with many cond_resched(), so
>>>>>>>>>>>>>>>>>>>>>>>>>> time fluctuates a lot,
>>>>>>>>>>>>>>>>>>>>>>>>>> clear_highpage is better folio_zero_range as
>>>>>>>>>>>>>>>>>>>>>>>>>> you said.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Maybe add a new helper to convert all
>>>>>>>>>>>>>>>>>>>>>>>>>> folio_zero_range(folio, 0,
>>>>>>>>>>>>>>>>>>>>>>>>>> folio_size(folio))
>>>>>>>>>>>>>>>>>>>>>>>>>> to use clear_highpage + flush_dcache_folio?
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> If this also improves performance for other
>>>>>>>>>>>>>>>>>>>>>>>>> existing callers of
>>>>>>>>>>>>>>>>>>>>>>>>> folio_zero_range(), then that's a positive outcome.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> hi Kefeng,
>>>>>>>>>>>>>>>>>>>>>> what's your point? providing a helper like
>>>>>>>>>>>>>>>>>>>>>> clear_highfolio() or similar?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Yes, from above test, using clear_highpage/
>>>>>>>>>>>>>>>>>>>>> flush_dcache_folio is better
>>>>>>>>>>>>>>>>>>>>> than using folio_zero_range() for folio
>>>>>>>>>>>>>>>>>>>>> zero(especially for large
>>>>>>>>>>>>>>>>>>>>> folio), so I'd like to add a new helper, maybe
>>>>>>>>>>>>>>>>>>>>> name it folio_zero()
>>>>>>>>>>>>>>>>>>>>> since it zero the whole folio.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> we already have a helper like folio_zero_user()?
>>>>>>>>>>>>>>>>>>>> it is not good enough?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Since it is with many cond_resched(), the
>>>>>>>>>>>>>>>>>>> performance is worst...
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Not exactly? It should have zero cost for a
>>>>>>>>>>>>>>>>>> preemptible kernel.
>>>>>>>>>>>>>>>>>> For a non-preemptible kernel, it helps avoid
>>>>>>>>>>>>>>>>>> clearing the folio
>>>>>>>>>>>>>>>>>> from occupying the CPU and starving other processes,
>>>>>>>>>>>>>>>>>> right?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --- a/mm/shmem.c
>>>>>>>>>>>>>>>>> +++ b/mm/shmem.c
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> @@ -2393,10 +2393,7 @@ static int
>>>>>>>>>>>>>>>>> shmem_get_folio_gfp(struct inode
>>>>>>>>>>>>>>>>> *inode, pgoff_t index,
>>>>>>>>>>>>>>>>>                  * it now, lest undo on failure
>>>>>>>>>>>>>>>>> cancel our earlier guarantee.
>>>>>>>>>>>>>>>>>                  */
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>                 if (sgp != SGP_WRITE && !
>>>>>>>>>>>>>>>>> folio_test_uptodate(folio)) {
>>>>>>>>>>>>>>>>> -               long i, n = folio_nr_pages(folio);
>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>> -               for (i = 0; i < n; i++)
>>>>>>>>>>>>>>>>> -                      
>>>>>>>>>>>>>>>>> clear_highpage(folio_page(folio, i));
>>>>>>>>>>>>>>>>> +               folio_zero_user(folio, vmf->address);
>>>>>>>>>>>>>>>>>                         flush_dcache_folio(folio);
>>>>>>>>>>>>>>>>>                         folio_mark_uptodate(folio);
>>>>>>>>>>>>>>>>>                 }
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Do we perform better or worse with the following?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Here is for SGP_FALLOC, vmf = NULL, we could use
>>>>>>>>>>>>>>>> folio_zero_user(folio,
>>>>>>>>>>>>>>>> 0), I think the performance is worse, will retest once
>>>>>>>>>>>>>>>> I can access
>>>>>>>>>>>>>>>> hardware.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Perhaps, since the current code uses
>>>>>>>>>>>>>>> clear_hugepage(). Does using
>>>>>>>>>>>>>>> index << PAGE_SHIFT as the addr_hint offer any benefit?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> when use folio_zero_user(), the performance is vary bad
>>>>>>>>>>>>>> with above
>>>>>>>>>>>>>> fallocate test(mount huge=always),
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>              folio_zero_range   clear_highpage
>>>>>>>>>>>>>> folio_zero_user
>>>>>>>>>>>>>> real    0m1.214s             0m1.111s              0m3.159s
>>>>>>>>>>>>>> user    0m0.000s             0m0.000s              0m0.000s
>>>>>>>>>>>>>> sys     0m1.210s             0m1.109s              0m3.152s
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I tried with addr_hint = 0/index << PAGE_SHIFT, no
>>>>>>>>>>>>>> obvious different.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Interesting. Does your kernel have preemption disabled or
>>>>>>>>>>>>> preemption_debug enabled?
>>>>>>>>>>>>
>>>>>>>>>>>> ARM64 server, CONFIG_PREEMPT_NONE=y
>>>>>>>>>>> this explains why the performance is much worse.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> If not, it makes me wonder whether folio_zero_user() in
>>>>>>>>>>>>> alloc_anon_folio() is actually improving performance as
>>>>>>>>>>>>> expected,
>>>>>>>>>>>>> compared to the simpler folio_zero() you plan to implement. :-)
>>>>>>>>>>>>
>>>>>>>>>>>> Yes, maybe, the folio_zero_user(was clear_huge_page) is from
>>>>>>>>>>>> 47ad8475c000 ("thp: clear_copy_huge_page"), so original
>>>>>>>>>>>> clear_huge_page
>>>>>>>>>>>> is used in HugeTLB, clear PUD size maybe spend many time,
>>>>>>>>>>>> but for PMD or
>>>>>>>>>>>> other size of large folio, cond_resched is not necessary
>>>>>>>>>>>> since we
>>>>>>>>>>>> already have some folio_zero_range() to clear large folio,
>>>>>>>>>>>> and no issue
>>>>>>>>>>>> was reported.
>>>>>>>>>>> probably worth an optimization. calling cond_resched() for
>>>>>>>>>>> each page
>>>>>>>>>>> seems too aggressive and useless.
>>>>>>>>>>
>>>>>>>>>> After some test, I think the cond_resched() is not the root cause,
>>>>>>>>>> no performance gained with batched cond_resched(), even I kill
>>>>>>>>>> cond_resched() from process_huge_page, no improvement.
>>>>>>>>>>
>>>>>>>>>> But when I unconditionally use clear_gigantic_page() in
>>>>>>>>>> folio_zero_user(patched), there is big improvement with above
>>>>>>>>>> fallocate on tmpfs(mount huge=always), also I test some
>>>>>>>>>> other testcase,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 1) case-anon-w-seq-mt: (2M PMD THP)
>>>>>>>>>>
>>>>>>>>>> base:
>>>>>>>>>> real    0m2.490s    0m2.254s    0m2.272s
>>>>>>>>>> user    1m59.980s   2m23.431s   2m18.739s
>>>>>>>>>> sys     1m3.675s    1m15.462s   1m15.030s
>>>>>>>>>>
>>>>>>>>>> patched:
>>>>>>>>>> real    0m2.234s    0m2.225s    0m2.159s
>>>>>>>>>> user    2m56.105s   2m57.117s   3m0.489s
>>>>>>>>>> sys     0m17.064s   0m17.564s   0m16.150s
>>>>>>>>>>
>>>>>>>>>> Patched kernel win on sys and bad in user, but real is
>>>>>>>>>> almost same,
>>>>>>>>>> maybe a little better than base.
>>>>>>>>> We can find user time difference.  That means the original
>>>>>>>>> cache hot
>>>>>>>>> behavior still applies on your system.
>>>>>>>>> However, it appears that the performance to clear page from end to
>>>>>>>>> begin
>>>>>>>>> is really bad on your system.
>>>>>>>>> So, I suggest to revise the current implementation to use
>>>>>>>>> sequential
>>>>>>>>> clearing as much as possible.
>>>>>>>>>
>>>>>>>>
>>>>>>>> I test case-anon-cow-seq-hugetlb for copy_user_large_folio()
>>>>>>>>
>>>>>>>> base:
>>>>>>>> real    0m6.259s    0m6.197s    0m6.316s
>>>>>>>> user    1m31.176s   1m27.195s   1m29.594s
>>>>>>>> sys     7m44.199s   7m51.490s   8m21.149s
>>>>>>>>
>>>>>>>> patched(use copy_user_gigantic_page for 2M hugetlb too)
>>>>>>>> real    0m3.182s    0m3.002s    0m2.963s
>>>>>>>> user    1m19.456s   1m3.107s    1m6.447s
>>>>>>>> sys     2m59.222s   3m10.899s   3m1.027s
>>>>>>>>
>>>>>>>> and sequential copy is better than the current implementation,
>>>>>>>> so I will use sequential clear and copy.
>>>>>>> Sorry, it appears that you misunderstanding my suggestion.  I
>>>>>>> suggest to
>>>>>>> revise process_huge_page() to use more sequential memory clearing and
>>>>>>> copying to improve its performance on your platform.
>>>>>>> -- Best Regards,
>>>>>>> Huang, Ying
>>>>>>>
>>>>>>>>>> 2) case-anon-w-seq-hugetlb:(2M PMD HugeTLB)
>>>>>>>>>>
>>>>>>>>>> base:
>>>>>>>>>> real    0m5.175s    0m5.117s    0m4.856s
>>>>>>>>>> user    5m15.943s   5m7.567s    4m29.273s
>>>>>>>>>> sys     2m38.503s   2m21.949s   2m21.252s
>>>>>>>>>>
>>>>>>>>>> patched:
>>>>>>>>>> real    0m4.966s    0m4.841s    0m4.561s
>>>>>>>>>> user    6m30.123s   6m9.516s    5m49.733s
>>>>>>>>>> sys     0m58.503s   0m47.847s   0m46.785s
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> This case is similar to the case1.
>>>>>>>>>>
>>>>>>>>>> 3) fallocate hugetlb 20G (2M PMD HugeTLB)
>>>>>>>>>>
>>>>>>>>>> base:
>>>>>>>>>> real    0m3.016s    0m3.019s    0m3.018s
>>>>>>>>>> user    0m0.000s    0m0.000s    0m0.000s
>>>>>>>>>> sys     0m3.009s    0m3.012s    0m3.010s
>>>>>>>>>>
>>>>>>>>>> patched:
>>>>>>>>>>
>>>>>>>>>> real    0m1.136s    0m1.136s    0m1.136s
>>>>>>>>>> user    0m0.000s    0m0.000s    0m0.004s
>>>>>>>>>> sys     0m1.133s    0m1.133s    0m1.129s
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> There is big win on patched kernel, and it is similar to
>>>>>>>>>> above tmpfs
>>>>>>>>>> test, so maybe we could revert the commit c79b57e462b5 ("mm:
>>>>>>>>>> hugetlb:
>>>>>>>>>> clear target sub-page last when clearing huge page").
>>>>>>
>>>>>> I tried the following changes,
>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>> index 66cf855dee3f..e5cc75adfa10 100644
>>>>>> --- a/mm/memory.c
>>>>>> +++ b/mm/memory.c
>>>>>> @@ -6777,7 +6777,7 @@ static inline int process_huge_page(
>>>>>>                   base = 0;
>>>>>>                   l = n;
>>>>>>                   /* Process subpages at the end of huge page */
>>>>>> -               for (i = nr_pages - 1; i >= 2 * n; i--) {
>>>>>> +               for (i = 2 * n; i < nr_pages; i++) {
>>>>>>                           cond_resched();
>>>>>>                           ret = process_subpage(addr + i *
>>>>>> PAGE_SIZE, i,
>>>>>>                           arg);
>>>>>>                           if (ret)
>>>>>>
>>>>>> Since n = 0, so the copying is from start to end now, but not
>>>>>> improvement for case-anon-cow-seq-hugetlb,
>>>>>>
>>>>>> and if use copy_user_gigantic_pager, the time reduced from 6s to 3s
>>>>>>
>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>> index fe21bd3beff5..2c6532d21d84 100644
>>>>>> --- a/mm/memory.c
>>>>>> +++ b/mm/memory.c
>>>>>> @@ -6876,10 +6876,7 @@ int copy_user_large_folio(struct folio *dst,
>>>>>> struct folio *src,
>>>>>>                   .vma = vma,
>>>>>>           };
>>>>>>
>>>>>> -       if (unlikely(nr_pages > MAX_ORDER_NR_PAGES))
>>>>>> -               return copy_user_gigantic_page(dst, src, addr_hint,
>>>>>>                   vma, nr_pages);
>>>>>> -
>>>>>> -       return process_huge_page(addr_hint, nr_pages,
>>>>>> copy_subpage, &arg);
>>>>>> +       return copy_user_gigantic_page(dst, src, addr_hint, vma,
>>>>>> nr_pages);
>>>>>>    }
>>>>> It appears that we have code generation issue here.  Can you check
>>>>> it?
>>>>> Whether code is inlined in the same way?
>>>>>
>>>>
>>>> No different, and I checked the asm, both process_huge_page and
>>>> copy_user_gigantic_page are inlined, it is strange...
>>>
>>> It's not inlined in my configuration.  And __always_inline below changes
>>> it for me.
>>>
>>> If it's already inlined and the code is actually almost same, why
>>> there's difference?  Is it possible for you to do some profile or
>>> further analysis?
>> Yes, will continue to debug this.
>
> My bad, I has some refactor patch before using copy_user_large_folio(),
>
> ba3fda2a7b08 mm: use copy_user_large_page // good performance
> a88666ae8f4d mm: call might_sleep() in folio_zero/copy_user()
> 3ab7d4d405e9 mm: calculate the base address in the folio_zero/copy_user()
> 7b240664c07d mm: convert to folio_copy_user()  // I made a mistake
> which use dst instead of src in copy_user_gigantic_page()
> 1a951e310aa9 mm: use aligned address in copy_user_gigantic_page()
> e095ce052607 mm: use aligned address in clear_gigantic_page()
>
> so please ignore the copy test result (case-anon-cow-seq-hugetlb)
>
> In summary:
> 1) for copying, no obvious different between
> copy_user_large_folio/process_huge_page(copying from last to start or
> copying from start to last)
>
> 2) for clearing, clear_gigantic_page is better than process_huge_page
> from my machine, and after clearing page from start to last(current,
> it process page from last to first), the performance is same to the
> clear_gigantic_page.

Can you show detailed data, at least user/sys/real time?  Previously, we
can find user time reduction and sys time increment.

--
Best Regards,
Huang, Ying

>> 
>>>
>>>>> Maybe we can start with
>>>>> modified   mm/memory.c
>>>>> @@ -6714,7 +6714,7 @@ EXPORT_SYMBOL(__might_fault);
>>>>>     * operation.  The target subpage will be processed last to keep its
>>>>>     * cache lines hot.
>>>>>     */
>>>>> -static inline int process_huge_page(
>>>>> +static __always_inline int process_huge_page(
>>>>>        unsigned long addr_hint, unsigned int nr_pages,
>>>>>        int (*process_subpage)(unsigned long addr, int idx, void *arg),
>>>>>        void *arg)
>>>
>>> -- Best Regards,
>>> Huang, Ying
>>>
>>

mm: shmem: convert to use folio_zero_range()

Commit Message

Comments

Patch