mbox series

[v2,0/9] x86/clear_huge_page: multi-page clearing

Message ID 20230830184958.2333078-1-ankur.a.arora@oracle.com (mailing list archive)
Headers show
Series x86/clear_huge_page: multi-page clearing | expand

Message

Ankur Arora Aug. 30, 2023, 6:49 p.m. UTC
This series adds a multi-page clearing primitive, clear_pages(),
which enables more effective use of x86 string instructions by
advertising the real region-size to be cleared. 

Region-size can be used as a hint by uarchs to optimize the
clearing.

Also add allow_resched() which marks a code-section as allowing
rescheduling in the irqentry_exit path. This allows clear_pages()
to get by without having to call cond_sched() periodically.
(preempt_model_full() already handles this via
irqentry_exit_cond_resched(), so we handle this similarly for
preempt_model_none() and preempt_model_voluntary().)

Performance
==

With this demand fault performance gets a decent increase:

  *Milan*     mm/clear_huge_page   x86/clear_huge_page   change    
                          (GB/s)                (GB/s)             
                                                                   
  pg-sz=2MB                14.55                 19.29    +32.5%
  pg-sz=1GB                19.34                 49.60   +156.4%  

Milan (and some other AMD Zen uarchs tested) take advantage of the
hint to elide cacheline allocation for pg-sz=1GB. The cut-off for
this optimization seems to be at around region-size > LLC-size so
the pg-sz=2MB load still allocates cachelines.


  *Icelakex*  mm/clear_huge_page   x86/clear_huge_page   change   
                          (GB/s)                (GB/s)            
                                                                  
  pg-sz=2MB                 9.19                 12.94   +40.8%  
  pg-sz=1GB                 9.36                 12.97   +38.5%  

Icelakex sees a decent improvement in performance but for both
region-sizes does continue to allocate cachelines.


Negative: there is, a downside to clearing in larger chunks: the
current approach clears page-at-a-time, narrowing towards
the faulting subpage. This has better cache characteristics for
some sequential access workloads where subpages near the faulting
page have a greater likelihood of access.

I'm not sure if there are real cases which care about this workload
but one example is the vm-scalability/case-anon-w-seq-hugetlb test.
This test starts a process for each online CPU, with each process
writing sequentially to its set of hugepages.

The bottleneck here is the memory pipe and so the improvement in
stime is limited, and because the clearing is less cache-optimal 
now, utime suffers from worse user cache misses.

  *Icelakex*               mm/clear_huge_page  x86/clear_huge_page  change
  (tasks=128, mem=4GB/task)

  stime                        286.8 +- 3.6%      243.9 +- 4.1%     -14.9%
  utime                        497.7 +- 4.1%      553.5 +- 2.0%     +11.2%
  wall-clock                     6.9 +- 2.8%        7.0 +- 1.4%     + 1.4%


  *Milan*                  mm/clear_huge_page  x86/clear_huge_page  change
  (mem=1GB/task, tasks=512)

  stime                        501.3 +- 1.4%      498.0 +- 0.9%      -0.5%
  utime                        298.7 +- 1.1%      335.0 +- 2.2%     +12.1%
  wall-clock                     3.5 +- 2.8%        3.8 +- 2.6%      +8.5%

The same test performs better if we have a smaller number of processes,
since there is more backend BW available, and thus the improved stime
compensates for the worse utime.

This could be improved by using more circuitous chunking (somewhat
like this:
https://lore.kernel.org/lkml/20220606203725.1313715-1-ankur.a.arora@oracle.com/).
But I'm not sure if it is worth doing. Opinions?

Patches
==

Patch 1, 2, 3:
  "mm/clear_huge_page: allow arch override for clear_huge_page()",
  "mm/huge_page: separate clear_huge_page() and copy_huge_page()",
  "mm/huge_page: cleanup clear_/copy_subpage()"
are minor. The first one allows clear_huge_page() to have an
arch specific version and the other two are mechanical cleanup
patches.

Patches 3, 4, 5:
  "x86/clear_page: extend clear_page*() for multi-page clearing",
  "x86/clear_page: add clear_pages()",
  "x86/clear_huge_page: multi-page clearing"
define the x86 specific clear_pages() and clear_huge_pages().

Patches 6, 7, 8:
  "sched: define TIF_ALLOW_RESCHED"
  "irqentry: define irqentry_exit_allow_resched()"
which defines allow_resched() to demarcate preemptible sections.

This gets used in patch 9:
  "x86/clear_huge_page: make clear_contig_region() preemptible".

Changelog:

v2:
  - Addressed review comments from peterz, tglx.
  - Removed clear_user_pages(), and CONFIG_X86_32:clear_pages()
  - General code cleanup

Also at:
  github.com/terminus/linux clear-pages.v2

Comments appreciated!

Ankur Arora (9):
  mm/clear_huge_page: allow arch override for clear_huge_page()
  mm/huge_page: separate clear_huge_page() and copy_huge_page()
  mm/huge_page: cleanup clear_/copy_subpage()
  x86/clear_page: extend clear_page*() for multi-page clearing
  x86/clear_page: add clear_pages()
  x86/clear_huge_page: multi-page clearing
  sched: define TIF_ALLOW_RESCHED
  irqentry: define irqentry_exit_allow_resched()
  x86/clear_huge_page: make clear_contig_region() preemptible

 arch/x86/include/asm/page_64.h     |  27 +++--
 arch/x86/include/asm/thread_info.h |   2 +
 arch/x86/lib/clear_page_64.S       |  52 ++++++---
 arch/x86/mm/hugetlbpage.c          |  59 ++++++++++
 include/linux/entry-common.h       |  13 +++
 include/linux/sched.h              |  30 +++++
 kernel/entry/common.c              |  13 ++-
 kernel/sched/core.c                |  32 ++---
 mm/memory.c                        | 181 +++++++++++++++++------------
 9 files changed, 297 insertions(+), 112 deletions(-)

Comments

Mateusz Guzik Sept. 3, 2023, 8:14 a.m. UTC | #1
On Wed, Aug 30, 2023 at 11:49:49AM -0700, Ankur Arora wrote:
> This series adds a multi-page clearing primitive, clear_pages(),
> which enables more effective use of x86 string instructions by
> advertising the real region-size to be cleared. 
> 
> Region-size can be used as a hint by uarchs to optimize the
> clearing.
> 
> Also add allow_resched() which marks a code-section as allowing
> rescheduling in the irqentry_exit path. This allows clear_pages()
> to get by without having to call cond_sched() periodically.
> (preempt_model_full() already handles this via
> irqentry_exit_cond_resched(), so we handle this similarly for
> preempt_model_none() and preempt_model_voluntary().)
> 
> Performance
> ==
> 
> With this demand fault performance gets a decent increase:
> 
>   *Milan*     mm/clear_huge_page   x86/clear_huge_page   change    
>                           (GB/s)                (GB/s)             
>                                                                    
>   pg-sz=2MB                14.55                 19.29    +32.5%
>   pg-sz=1GB                19.34                 49.60   +156.4%  
> 
> Milan (and some other AMD Zen uarchs tested) take advantage of the
> hint to elide cacheline allocation for pg-sz=1GB. The cut-off for
> this optimization seems to be at around region-size > LLC-size so
> the pg-sz=2MB load still allocates cachelines.
> 

Have you benchmarked clzero? It is an AMD-specific instruction issuing
non-temporal stores. It is definitely something to try out for 1G pages.

One would think rep stosq has to be at least not worse since the CPU is
explicitly told what to do and is free to optimize it however it sees
fit, but the rep prefix has a long history of underperforming.

I'm not saying it is going to be better, but that this should be tested,
albeit one can easily argue this can be done at a later date.

I would do it myself but my access to AMD CPUs is limited.

> 
>   *Icelakex*  mm/clear_huge_page   x86/clear_huge_page   change   
>                           (GB/s)                (GB/s)            
>                                                                   
>   pg-sz=2MB                 9.19                 12.94   +40.8%  
>   pg-sz=1GB                 9.36                 12.97   +38.5%  
> 
> Icelakex sees a decent improvement in performance but for both
> region-sizes does continue to allocate cachelines.
> 
> 
> Negative: there is, a downside to clearing in larger chunks: the
> current approach clears page-at-a-time, narrowing towards
> the faulting subpage. This has better cache characteristics for
> some sequential access workloads where subpages near the faulting
> page have a greater likelihood of access.
> 
> I'm not sure if there are real cases which care about this workload
> but one example is the vm-scalability/case-anon-w-seq-hugetlb test.
> This test starts a process for each online CPU, with each process
> writing sequentially to its set of hugepages.
> 
> The bottleneck here is the memory pipe and so the improvement in
> stime is limited, and because the clearing is less cache-optimal 
> now, utime suffers from worse user cache misses.
> 
>   *Icelakex*               mm/clear_huge_page  x86/clear_huge_page  change
>   (tasks=128, mem=4GB/task)
> 
>   stime                        286.8 +- 3.6%      243.9 +- 4.1%     -14.9%
>   utime                        497.7 +- 4.1%      553.5 +- 2.0%     +11.2%
>   wall-clock                     6.9 +- 2.8%        7.0 +- 1.4%     + 1.4%
> 
> 
>   *Milan*                  mm/clear_huge_page  x86/clear_huge_page  change
>   (mem=1GB/task, tasks=512)
> 
>   stime                        501.3 +- 1.4%      498.0 +- 0.9%      -0.5%
>   utime                        298.7 +- 1.1%      335.0 +- 2.2%     +12.1%
>   wall-clock                     3.5 +- 2.8%        3.8 +- 2.6%      +8.5%
> 
> The same test performs better if we have a smaller number of processes,
> since there is more backend BW available, and thus the improved stime
> compensates for the worse utime.
> 
> This could be improved by using more circuitous chunking (somewhat
> like this:
> https://lore.kernel.org/lkml/20220606203725.1313715-1-ankur.a.arora@oracle.com/).
> But I'm not sure if it is worth doing. Opinions?
> 
> Patches
> ==
> 
> Patch 1, 2, 3:
>   "mm/clear_huge_page: allow arch override for clear_huge_page()",
>   "mm/huge_page: separate clear_huge_page() and copy_huge_page()",
>   "mm/huge_page: cleanup clear_/copy_subpage()"
> are minor. The first one allows clear_huge_page() to have an
> arch specific version and the other two are mechanical cleanup
> patches.
> 
> Patches 3, 4, 5:
>   "x86/clear_page: extend clear_page*() for multi-page clearing",
>   "x86/clear_page: add clear_pages()",
>   "x86/clear_huge_page: multi-page clearing"
> define the x86 specific clear_pages() and clear_huge_pages().
> 
> Patches 6, 7, 8:
>   "sched: define TIF_ALLOW_RESCHED"
>   "irqentry: define irqentry_exit_allow_resched()"
> which defines allow_resched() to demarcate preemptible sections.
> 
> This gets used in patch 9:
>   "x86/clear_huge_page: make clear_contig_region() preemptible".
> 
> Changelog:
> 
> v2:
>   - Addressed review comments from peterz, tglx.
>   - Removed clear_user_pages(), and CONFIG_X86_32:clear_pages()
>   - General code cleanup
> 
> Also at:
>   github.com/terminus/linux clear-pages.v2
> 
> Comments appreciated!
> 
> Ankur Arora (9):
>   mm/clear_huge_page: allow arch override for clear_huge_page()
>   mm/huge_page: separate clear_huge_page() and copy_huge_page()
>   mm/huge_page: cleanup clear_/copy_subpage()
>   x86/clear_page: extend clear_page*() for multi-page clearing
>   x86/clear_page: add clear_pages()
>   x86/clear_huge_page: multi-page clearing
>   sched: define TIF_ALLOW_RESCHED
>   irqentry: define irqentry_exit_allow_resched()
>   x86/clear_huge_page: make clear_contig_region() preemptible
> 
>  arch/x86/include/asm/page_64.h     |  27 +++--
>  arch/x86/include/asm/thread_info.h |   2 +
>  arch/x86/lib/clear_page_64.S       |  52 ++++++---
>  arch/x86/mm/hugetlbpage.c          |  59 ++++++++++
>  include/linux/entry-common.h       |  13 +++
>  include/linux/sched.h              |  30 +++++
>  kernel/entry/common.c              |  13 ++-
>  kernel/sched/core.c                |  32 ++---
>  mm/memory.c                        | 181 +++++++++++++++++------------
>  9 files changed, 297 insertions(+), 112 deletions(-)
> 
> -- 
> 2.31.1
> 
>
Raghavendra K T Sept. 5, 2023, 1:06 a.m. UTC | #2
On 8/31/2023 12:19 AM, Ankur Arora wrote:
> This series adds a multi-page clearing primitive, clear_pages(),
> which enables more effective use of x86 string instructions by
> advertising the real region-size to be cleared.
> 
> Region-size can be used as a hint by uarchs to optimize the
> clearing.
> 
> Also add allow_resched() which marks a code-section as allowing
> rescheduling in the irqentry_exit path. This allows clear_pages()
> to get by without having to call cond_sched() periodically.
> (preempt_model_full() already handles this via
> irqentry_exit_cond_resched(), so we handle this similarly for
> preempt_model_none() and preempt_model_voluntary().)
> 
> 

Hello Ankur,
Thansk for the patches.

I tried the patches, Improvements look similar to V1 (even without
circuitous chunk optimizations.)
STill we see similar 50-60% improvement for 1G and 2M page sizes.


SUT: Bergamo
     CPU family:          25
     Model:               160
     Thread(s) per core:  2
     Core(s) per socket:  128
     Socket(s):           2

NUMA:
   NUMA node(s):          2
   NUMA node0 CPU(s):     0-127,256-383
   NUMA node1 CPU(s):     128-255,384-511

Test:  Use mmap(MAP_HUGETLB) to demand a fault on 64GB region (NUMA 
node0), for both base-hugepage-size=2M and 1GB
Current result is with thp = always, but madv also did not make much 
difference.

perf stat -r 10 -d -d  numactl -m 0 -N 0 <test>

time in seconds elapsed (average of 10 runs) (lower = better)

Result:
base: mm/clear_huge_page
patched: x86/clear_huge_page

page-size  base       patched     Improvement %
2M         5.0779     2.50623     50.64
1G         2.50623    1.012439    59.60

More details:

  Performance counter stats for 'mm/map_hugetlb' (10 runs):

           5,058.71 msec task-clock                #    0.996 CPUs 
utilized            ( +-  0.26% )
                  8      context-switches          #    1.576 /sec 
                ( +-  7.23% )
                  0      cpu-migrations            #    0.000 /sec
             32,917      page-faults               #    6.484 K/sec 
                ( +-  0.00% )
     15,797,804,067      cycles                    #    3.112 GHz 
                ( +-  0.26% )  (35.70%)
          2,073,754      stalled-cycles-frontend   #    0.01% frontend 
cycles idle     ( +-  1.25% )  (35.71%)
         27,508,977      stalled-cycles-backend    #    0.17% backend 
cycles idle      ( +-  9.48% )  (35.74%)
      1,143,710,651      instructions              #    0.07  insn per cycle
                                                   #    0.03  stalled 
cycles per insn  ( +-  0.15% )  (35.76%)
        243,817,330      branches                  #   48.028 M/sec 
                ( +-  0.12% )  (35.78%)
            357,760      branch-misses             #    0.15% of all 
branches          ( +-  1.52% )  (35.75%)
      2,540,733,497      L1-dcache-loads           #  500.483 M/sec 
                ( +-  0.04% )  (35.74%)
      1,093,660,557      L1-dcache-load-misses     #   42.98% of all 
L1-dcache accesses  ( +-  0.03% )  (35.71%)
         73,335,478      L1-icache-loads           #   14.446 M/sec 
                ( +-  0.08% )  (35.70%)
            878,378      L1-icache-load-misses     #    1.19% of all 
L1-icache accesses  ( +-  2.65% )  (35.68%)
          1,025,714      dTLB-loads                #  202.049 K/sec 
                ( +-  2.70% )  (35.69%)
            405,407      dTLB-load-misses          #   37.35% of all 
dTLB cache accesses  ( +-  1.59% )  (35.68%)
                  2      iTLB-loads                #    0.394 /sec 
                ( +- 41.63% )  (35.68%)
             40,356      iTLB-load-misses          # 1552153.85% of all 
iTLB cache accesses  ( +-  7.18% )  (35.68%)

             5.0779 +- 0.0132 seconds time elapsed  ( +-  0.26% )

  Performance counter stats for 'numactl -m 0 -N 0 x86/map_hugetlb' (10 
runs):

           2,538.40 msec task-clock                #    1.013 CPUs 
utilized            ( +-  0.27% )
                  4      context-switches          #    1.597 /sec 
                ( +-  6.51% )
                  1      cpu-migrations            #    0.399 /sec
             32,916      page-faults               #   13.140 K/sec 
                ( +-  0.00% )
      7,901,830,782      cycles                    #    3.154 GHz 
                ( +-  0.27% )  (35.67%)
          6,590,473      stalled-cycles-frontend   #    0.08% frontend 
cycles idle     ( +- 10.31% )  (35.71%)
        329,970,288      stalled-cycles-backend    #    4.23% backend 
cycles idle      ( +- 13.65% )  (35.74%)
        725,811,962      instructions              #    0.09  insn per cycle
                                                   #    0.80  stalled 
cycles per insn  ( +-  0.37% )  (35.78%)
        132,182,704      branches                  #   52.767 M/sec 
                ( +-  0.26% )  (35.82%)
            254,163      branch-misses             #    0.19% of all 
branches          ( +-  2.47% )  (35.81%)
      2,382,927,453      L1-dcache-loads           #  951.262 M/sec 
                ( +-  0.04% )  (35.77%)
      1,082,022,067      L1-dcache-load-misses     #   45.41% of all 
L1-dcache accesses  ( +-  0.02% )  (35.74%)
         47,164,491      L1-icache-loads           #   18.828 M/sec 
                ( +-  0.37% )  (35.70%)
            474,535      L1-icache-load-misses     #    0.99% of all 
L1-icache accesses  ( +-  2.93% )  (35.66%)
          1,477,334      dTLB-loads                #  589.750 K/sec 
                ( +-  5.12% )  (35.65%)
            624,125      dTLB-load-misses          #   56.24% of all 
dTLB cache accesses  ( +-  5.66% )  (35.65%)
                  0      iTLB-loads                #    0.000 /sec 
                (35.65%)
              1,626      iTLB-load-misses          # 7069.57% of all 
iTLB cache accesses  ( +-283.51% )  (35.65%)

            2.50623 +- 0.00691 seconds time elapsed  ( +-  0.28% )


  Performance counter stats for 'numactl -m 0 -N 0 mm/map_hugetlb_1G' 
(10 runs):


           2,506.50 msec task-clock                #    0.995 CPUs 
utilized            ( +-  0.17% )
                  4      context-switches          #    1.589 /sec 
                ( +-  9.28% )
                  0      cpu-migrations            #    0.000 /sec
                214      page-faults               #   84.997 /sec 
                ( +-  0.13% )
      7,821,519,053      cycles                    #    3.107 GHz 
                ( +-  0.17% )  (35.72%)
          2,037,744      stalled-cycles-frontend   #    0.03% frontend 
cycles idle     ( +- 25.62% )  (35.73%)
          6,578,899      stalled-cycles-backend    #    0.08% backend 
cycles idle      ( +-  2.65% )  (35.73%)
        468,648,780      instructions              #    0.06  insn per cycle
                                                   #    0.01  stalled 
cycles per insn  ( +-  0.10% )  (35.73%)
        116,267,370      branches                  #   46.179 M/sec 
                ( +-  0.08% )  (35.73%)
            111,966      branch-misses             #    0.10% of all 
branches          ( +-  2.98% )  (35.72%)
      2,294,727,165      L1-dcache-loads           #  911.424 M/sec 
                ( +-  0.02% )  (35.71%)
      1,076,156,463      L1-dcache-load-misses     #   46.88% of all 
L1-dcache accesses  ( +-  0.01% )  (35.70%)
         26,093,151      L1-icache-loads           #   10.364 M/sec 
                ( +-  0.21% )  (35.71%)
            132,944      L1-icache-load-misses     #    0.51% of all 
L1-icache accesses  ( +-  0.55% )  (35.70%)
             30,925      dTLB-loads                #   12.283 K/sec 
                ( +-  5.70% )  (35.71%)
             27,437      dTLB-load-misses          #   86.22% of all 
dTLB cache accesses  ( +-  1.98% )  (35.70%)
                  0      iTLB-loads                #    0.000 /sec 
                (35.71%)
                 11      iTLB-load-misses          #   62.50% of all 
iTLB cache accesses  ( +-140.21% )  (35.70%)

            2.51890 +- 0.00433 seconds time elapsed  ( +-  0.17% )

  Performance counter stats for 'numactl -m 0 -N 0 x86/map_hugetlb_1G' 
(10 runs):

           1,013.59 msec task-clock                #    1.001 CPUs 
utilized            ( +-  0.07% )
                  2      context-switches          #    1.978 /sec 
                ( +- 12.91% )
                  1      cpu-migrations            #    0.989 /sec
                213      page-faults               #  210.634 /sec 
                ( +-  0.17% )
      3,169,391,694      cycles                    #    3.134 GHz 
                ( +-  0.07% )  (35.53%)
            109,925      stalled-cycles-frontend   #    0.00% frontend 
cycles idle     ( +-  5.56% )  (35.63%)
        950,638,913      stalled-cycles-backend    #   30.06% backend 
cycles idle      ( +-  5.06% )  (35.73%)
         51,189,571      instructions              #    0.02  insn per cycle
                                                   #   21.03  stalled 
cycles per insn  ( +-  1.22% )  (35.82%)
          9,545,941      branches                  #    9.440 M/sec 
                ( +-  1.50% )  (35.92%)
             86,836      branch-misses             #    0.88% of all 
branches          ( +-  3.74% )  (36.00%)
         46,109,587      L1-dcache-loads           #   45.597 M/sec 
                ( +-  3.92% )  (35.96%)
         13,796,172      L1-dcache-load-misses     #   41.77% of all 
L1-dcache accesses  ( +-  4.81% )  (35.85%)
          1,179,166      L1-icache-loads           #    1.166 M/sec 
                ( +-  1.22% )  (35.77%)
             21,528      L1-icache-load-misses     #    1.90% of all 
L1-icache accesses  ( +-  1.85% )  (35.66%)
             14,529      dTLB-loads                #   14.368 K/sec 
                ( +-  4.65% )  (35.57%)
              8,505      dTLB-load-misses          #   67.88% of all 
dTLB cache accesses  ( +-  5.61% )  (35.52%)
                  0      iTLB-loads                #    0.000 /sec 
                (35.52%)
                  8      iTLB-load-misses          #    0.00% of all 
iTLB cache accesses  ( +-267.99% )  (35.52%)

           1.012439 +- 0.000723 seconds time elapsed  ( +-  0.07% )


Please feel free to carry:

Tested-by: Raghavendra K T <raghavendra.kt@amd.com>
for any minor changes.

Thanks and Regards
- Raghu
Ankur Arora Sept. 5, 2023, 7:36 p.m. UTC | #3
Raghavendra K T <raghavendra.kt@amd.com> writes:

> On 8/31/2023 12:19 AM, Ankur Arora wrote:
>> This series adds a multi-page clearing primitive, clear_pages(),
>> which enables more effective use of x86 string instructions by
>> advertising the real region-size to be cleared.
>> Region-size can be used as a hint by uarchs to optimize the
>> clearing.
>> Also add allow_resched() which marks a code-section as allowing
>> rescheduling in the irqentry_exit path. This allows clear_pages()
>> to get by without having to call cond_sched() periodically.
>> (preempt_model_full() already handles this via
>> irqentry_exit_cond_resched(), so we handle this similarly for
>> preempt_model_none() and preempt_model_voluntary().)
>>
>
> Hello Ankur,
> Thansk for the patches.
>
> I tried the patches, Improvements look similar to V1 (even without
> circuitous chunk optimizations.)

Thanks for testing Raghu.

> STill we see similar 50-60% improvement for 1G and 2M page sizes.
>
> SUT: Bergamo
>     CPU family:          25
>     Model:               160
>     Thread(s) per core:  2
>     Core(s) per socket:  128
>     Socket(s):           2
>
> NUMA:
>   NUMA node(s):          2
>   NUMA node0 CPU(s):     0-127,256-383
>   NUMA node1 CPU(s):     128-255,384-511
>
> Test:  Use mmap(MAP_HUGETLB) to demand a fault on 64GB region (NUMA node0), for
> both base-hugepage-size=2M and 1GB
> Current result is with thp = always, but madv also did not make much difference.
> perf stat -r 10 -d -d  numactl -m 0 -N 0 <test>
>
> time in seconds elapsed (average of 10 runs) (lower = better)
>
> Result:
> base: mm/clear_huge_page
> patched: x86/clear_huge_page
>
> page-size  base       patched     Improvement %
> 2M         5.0779     2.50623     50.64
> 1G         2.50623    1.012439    59.60

Seems like Bergamo improves over Milan for both 4K BW, and also
for extent=2MB/extent=1GB.

> Please feel free to carry:
>
> Tested-by: Raghavendra K T <raghavendra.kt@amd.com>
> for any minor changes.

Thank you. Will add.

--
ankur
Ankur Arora Sept. 5, 2023, 10:14 p.m. UTC | #4
Mateusz Guzik <mjguzik@gmail.com> writes:

> On Wed, Aug 30, 2023 at 11:49:49AM -0700, Ankur Arora wrote:
>> This series adds a multi-page clearing primitive, clear_pages(),
>> which enables more effective use of x86 string instructions by
>> advertising the real region-size to be cleared.
>>
>> Region-size can be used as a hint by uarchs to optimize the
>> clearing.
>>
>> Also add allow_resched() which marks a code-section as allowing
>> rescheduling in the irqentry_exit path. This allows clear_pages()
>> to get by without having to call cond_sched() periodically.
>> (preempt_model_full() already handles this via
>> irqentry_exit_cond_resched(), so we handle this similarly for
>> preempt_model_none() and preempt_model_voluntary().)
>>
>> Performance
>> ==
>>
>> With this demand fault performance gets a decent increase:
>>
>>   *Milan*     mm/clear_huge_page   x86/clear_huge_page   change
>>                           (GB/s)                (GB/s)
>>
>>   pg-sz=2MB                14.55                 19.29    +32.5%
>>   pg-sz=1GB                19.34                 49.60   +156.4%
>>
>> Milan (and some other AMD Zen uarchs tested) take advantage of the
>> hint to elide cacheline allocation for pg-sz=1GB. The cut-off for
>> this optimization seems to be at around region-size > LLC-size so
>> the pg-sz=2MB load still allocates cachelines.
>>
>
> Have you benchmarked clzero? It is an AMD-specific instruction issuing
> non-temporal stores. It is definitely something to try out for 1G pages.

Thanks for the suggestion. Been a little while, but see the numbers here:
https://lore.kernel.org/linux-mm/20220606203725.1313715-15-ankur.a.arora@oracle.com/

> One would think rep stosq has to be at least not worse since the CPU is
> explicitly told what to do and is free to optimize it however it sees
> fit, but the rep prefix has a long history of underperforming.

I agree that historically REP variants have been all over the place.
But, if you look at the numbers, REP; STOS and CLZERO are pretty close,
at least for current generation of AMD uarchs.

Now, current uarch performance is no guarantee for future uarchs, but
if the kernel uses REP; STOS in performance paths, then hopefully
they'll also shows up in internal CPU regression benchmarks which might
mean that the high performance persists.

That said, I think using CLZERO/MOVNT is a good idea -- though, as a
fallback option or where it is better to send an explicit hint while
say, clearing a 2MB region.


Thanks
Ankur

> I'm not saying it is going to be better, but that this should be tested,
> albeit one can easily argue this can be done at a later date.
>
>
> I would do it myself but my access to AMD CPUs is limited.
>
>>
>>   *Icelakex*  mm/clear_huge_page   x86/clear_huge_page   change
>>                           (GB/s)                (GB/s)
>>
>>   pg-sz=2MB                 9.19                 12.94   +40.8%
>>   pg-sz=1GB                 9.36                 12.97   +38.5%
>>
>> Icelakex sees a decent improvement in performance but for both
>> region-sizes does continue to allocate cachelines.
>>
>>
>> Negative: there is, a downside to clearing in larger chunks: the
>> current approach clears page-at-a-time, narrowing towards
>> the faulting subpage. This has better cache characteristics for
>> some sequential access workloads where subpages near the faulting
>> page have a greater likelihood of access.
>>
>> I'm not sure if there are real cases which care about this workload
>> but one example is the vm-scalability/case-anon-w-seq-hugetlb test.
>> This test starts a process for each online CPU, with each process
>> writing sequentially to its set of hugepages.
>>
>> The bottleneck here is the memory pipe and so the improvement in
>> stime is limited, and because the clearing is less cache-optimal
>> now, utime suffers from worse user cache misses.
>>
>>   *Icelakex*               mm/clear_huge_page  x86/clear_huge_page  change
>>   (tasks=128, mem=4GB/task)
>>
>>   stime                        286.8 +- 3.6%      243.9 +- 4.1%     -14.9%
>>   utime                        497.7 +- 4.1%      553.5 +- 2.0%     +11.2%
>>   wall-clock                     6.9 +- 2.8%        7.0 +- 1.4%     + 1.4%
>>
>>
>>   *Milan*                  mm/clear_huge_page  x86/clear_huge_page  change
>>   (mem=1GB/task, tasks=512)
>>
>>   stime                        501.3 +- 1.4%      498.0 +- 0.9%      -0.5%
>>   utime                        298.7 +- 1.1%      335.0 +- 2.2%     +12.1%
>>   wall-clock                     3.5 +- 2.8%        3.8 +- 2.6%      +8.5%
>>
>> The same test performs better if we have a smaller number of processes,
>> since there is more backend BW available, and thus the improved stime
>> compensates for the worse utime.
>>
>> This could be improved by using more circuitous chunking (somewhat
>> like this:
>> https://lore.kernel.org/lkml/20220606203725.1313715-1-ankur.a.arora@oracle.com/).
>> But I'm not sure if it is worth doing. Opinions?
>>
>> Patches
>> ==
>>
>> Patch 1, 2, 3:
>>   "mm/clear_huge_page: allow arch override for clear_huge_page()",
>>   "mm/huge_page: separate clear_huge_page() and copy_huge_page()",
>>   "mm/huge_page: cleanup clear_/copy_subpage()"
>> are minor. The first one allows clear_huge_page() to have an
>> arch specific version and the other two are mechanical cleanup
>> patches.
>>
>> Patches 3, 4, 5:
>>   "x86/clear_page: extend clear_page*() for multi-page clearing",
>>   "x86/clear_page: add clear_pages()",
>>   "x86/clear_huge_page: multi-page clearing"
>> define the x86 specific clear_pages() and clear_huge_pages().
>>
>> Patches 6, 7, 8:
>>   "sched: define TIF_ALLOW_RESCHED"
>>   "irqentry: define irqentry_exit_allow_resched()"
>> which defines allow_resched() to demarcate preemptible sections.
>>
>> This gets used in patch 9:
>>   "x86/clear_huge_page: make clear_contig_region() preemptible".
>>
>> Changelog:
>>
>> v2:
>>   - Addressed review comments from peterz, tglx.
>>   - Removed clear_user_pages(), and CONFIG_X86_32:clear_pages()
>>   - General code cleanup
>>
>> Also at:
>>   github.com/terminus/linux clear-pages.v2
>>
>> Comments appreciated!
>>
>> Ankur Arora (9):
>>   mm/clear_huge_page: allow arch override for clear_huge_page()
>>   mm/huge_page: separate clear_huge_page() and copy_huge_page()
>>   mm/huge_page: cleanup clear_/copy_subpage()
>>   x86/clear_page: extend clear_page*() for multi-page clearing
>>   x86/clear_page: add clear_pages()
>>   x86/clear_huge_page: multi-page clearing
>>   sched: define TIF_ALLOW_RESCHED
>>   irqentry: define irqentry_exit_allow_resched()
>>   x86/clear_huge_page: make clear_contig_region() preemptible
>>
>>  arch/x86/include/asm/page_64.h     |  27 +++--
>>  arch/x86/include/asm/thread_info.h |   2 +
>>  arch/x86/lib/clear_page_64.S       |  52 ++++++---
>>  arch/x86/mm/hugetlbpage.c          |  59 ++++++++++
>>  include/linux/entry-common.h       |  13 +++
>>  include/linux/sched.h              |  30 +++++
>>  kernel/entry/common.c              |  13 ++-
>>  kernel/sched/core.c                |  32 ++---
>>  mm/memory.c                        | 181 +++++++++++++++++------------
>>  9 files changed, 297 insertions(+), 112 deletions(-)
>>
>> --
>> 2.31.1
>>
>>


--
ankur
Raghavendra K T Sept. 8, 2023, 2:18 a.m. UTC | #5
On 9/3/2023 1:44 PM, Mateusz Guzik wrote:
> On Wed, Aug 30, 2023 at 11:49:49AM -0700, Ankur Arora wrote:
>> This series adds a multi-page clearing primitive, clear_pages(),
>> which enables more effective use of x86 string instructions by
>> advertising the real region-size to be cleared.
>>
>> Region-size can be used as a hint by uarchs to optimize the
>> clearing.
>>
>> Also add allow_resched() which marks a code-section as allowing
>> rescheduling in the irqentry_exit path. This allows clear_pages()
>> to get by without having to call cond_sched() periodically.
>> (preempt_model_full() already handles this via
>> irqentry_exit_cond_resched(), so we handle this similarly for
>> preempt_model_none() and preempt_model_voluntary().)
>>
>> Performance
>> ==
>>
>> With this demand fault performance gets a decent increase:
>>
>>    *Milan*     mm/clear_huge_page   x86/clear_huge_page   change
>>                            (GB/s)                (GB/s)
>>                                                                     
>>    pg-sz=2MB                14.55                 19.29    +32.5%
>>    pg-sz=1GB                19.34                 49.60   +156.4%
>>
>> Milan (and some other AMD Zen uarchs tested) take advantage of the
>> hint to elide cacheline allocation for pg-sz=1GB. The cut-off for
>> this optimization seems to be at around region-size > LLC-size so
>> the pg-sz=2MB load still allocates cachelines.
>>
> 
> Have you benchmarked clzero? It is an AMD-specific instruction issuing
> non-temporal stores. It is definitely something to try out for 1G pages.
> 
> One would think rep stosq has to be at least not worse since the CPU is
> explicitly told what to do and is free to optimize it however it sees
> fit, but the rep prefix has a long history of underperforming.
> 
> I'm not saying it is going to be better, but that this should be tested,
> albeit one can easily argue this can be done at a later date.
> 
> I would do it myself but my access to AMD CPUs is limited.
> 

Hello Mateuz,

I plugged in CLZERO unconditionally (even for coherent path with
sfence) for my earlier experimets on top of this series.

Test: Use mmap(MAP_HUGETLB) to demand a fault on 64GB region (NUMA
node0),  for both base-hugepage-size=2M and 1GB

perf stat -r 10 -d -d  numactl -m 0 -N 0 <test>

SUT: AMD Bergamo with 2 node/2 socket  128 cores per socket.

 From that I see time taken is:
for 2M:  1.092125
for 1G:  0.997661

So overall for 64GB size experiment result look like this:
Time taken for 64GB region, (lesser = better)

  page-size  base       patched (gain%)    patched-clzero (gain%)
  2M         5.0779     2.50623  (50.64)          1.092125 (78)
  1G         2.50623    1.012439 (59.60)          0.997661 (60)

In summary I further see improvements for even for 2M base size (2.5x).

Overall CLZERO clearing is promising. But we may need threshold tuning
and hint passing as done in Ankurs'
Link: 
https://lore.kernel.org/lkml/20220606202109.1306034-1-ankur.a.arora@oracle.com/
on top of current series.

I need to experiment with different chunk size as well as base size
further. (both clzero and rep stos)

Thanks and Regards
- Raghu

Run Details:
  Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_1G' (10 
runs):

             996.34 msec task-clock                #    0.999 CPUs 
utilized            ( +-  0.02% )
                  2      context-switches          #    2.007 /sec 
                ( +- 21.34% )
                  0      cpu-migrations            #    0.000 /sec
                212      page-faults               #  212.735 /sec 
                ( +-  0.20% )
      3,116,497,471      cycles                    #    3.127 GHz 
                ( +-  0.02% )  (35.66%)
            100,343      stalled-cycles-frontend   #    0.00% frontend 
cycles idle     ( +- 16.85% )  (35.75%)
          1,369,118      stalled-cycles-backend    #    0.04% backend 
cycles idle      ( +-  3.45% )  (35.86%)
      4,325,987,025      instructions              #    1.39  insn per cycle
                                                   #    0.00  stalled 
cycles per insn  ( +-  0.02% )  (35.87%)
      1,078,119,163      branches                  #    1.082 G/sec 
                ( +-  0.01% )  (35.87%)
             87,907      branch-misses             #    0.01% of all 
branches          ( +-  5.22% )  (35.83%)
         12,337,100      L1-dcache-loads           #   12.380 M/sec 
                ( +-  5.44% )  (35.74%)
            280,300      L1-dcache-load-misses     #    2.48% of all 
L1-dcache accesses  ( +-  5.74% )  (35.64%)
           1,464,549      L1-icache-loads           #    1.470 M/sec 
                 ( +-  1.61% )  (35.63%)
             30,659      L1-icache-load-misses     #    2.12% of all 
L1-icache accesses  ( +-  3.30% )  (35.62%)
             17,366      dTLB-loads                #   17.426 K/sec 
                ( +-  5.52% )  (35.63%)
             11,774      dTLB-load-misses          #   81.79% of all 
dTLB cache accesses  ( +-  7.94% )  (35.63%)
                  0      iTLB-loads                #    0.000 /sec 
                (35.63%)
                  2      iTLB-load-misses          #    0.00% of all 
iTLB cache accesses  ( +-342.39% )  (35.64%)

           0.997661 +- 0.000150 seconds time elapsed  ( +-  0.02% )


  Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb' (10 runs):

           1,089.97 msec task-clock                #    0.998 CPUs 
utilized            ( +-  0.03% )
                  3      context-switches          #    2.750 /sec 
                ( +- 15.11% )
                  0      cpu-migrations            #    0.000 /sec
             32,917      page-faults               #   30.172 K/sec 
                ( +-  0.00% )
      3,408,713,422      cycles                    #    3.124 GHz 
                ( +-  0.03% )  (35.60%)
            982,417      stalled-cycles-frontend   #    0.03% frontend 
cycles idle     ( +-  2.77% )  (35.60%)
          8,495,409      stalled-cycles-backend    #    0.25% backend 
cycles idle      ( +-  6.12% )  (35.59%)
      4,970,939,278      instructions              #    1.46  insn per cycle
                                                   #    0.00  stalled 
cycles per insn  ( +-  0.04% )  (35.64%)
      1,196,644,653      branches                  #    1.097 G/sec 
                ( +-  0.03% )  (35.73%)
            196,584      branch-misses             #    0.02% of all 
branches          ( +-  2.79% )  (35.78%)
        226,254,284      L1-dcache-loads           #  207.388 M/sec 
                ( +-  0.23% )  (35.78%)
          1,161,607      L1-dcache-load-misses     #    0.52% of all 
L1-dcache accesses  ( +-  3.27% )  (35.78%)
          21,757,775      L1-icache-loads           #   19.943 M/sec 
                 ( +-  0.66% )  (35.77%)
            165,503      L1-icache-load-misses     #    0.78% of all 
L1-icache accesses  ( +-  3.11% )  (35.78%)
          1,118,573      dTLB-loads                #    1.025 M/sec 
                ( +-  1.38% )  (35.78%)
            415,943      dTLB-load-misses          #   37.10% of all 
dTLB cache accesses  ( +-  1.12% )  (35.78%)
                 36      iTLB-loads                #   32.998 /sec 
                ( +- 18.47% )  (35.74%)
             49,785      iTLB-load-misses          # 270570.65% of all 
iTLB cache accesses  ( +-  0.34% )  (35.65%)

           1.092125 +- 0.000350 seconds time elapsed  ( +-  0.03% )