diff mbox series

[9/9] x86/clear_huge_page: make clear_contig_region() preemptible

Message ID 20230403052233.1880567-10-ankur.a.arora@oracle.com (mailing list archive)
State New
Headers show
Series x86/clear_huge_page: multi-page clearing | expand

Commit Message

Ankur Arora April 3, 2023, 5:22 a.m. UTC
clear_contig_region() can be used to clear up to a huge-page (2MB/1GB)
chunk Allow preemption in the irqentry_exit path to make sure we don't
hold on to the CPU for an arbitrarily long period.

Performance: vm-scalability/case-anon-w-seq-hugetlb mmaps an anonymous
hugetlb-2mb region, and then writes sequentially to the region, demand
faulting pages on the way.

This test, with a CONFIG_VOLUNTARY config shows the effects of this
change: stime drops (~18% on Icelakex, ~5% on Milan), while the utime
goes up (~15% on Icelakex, ~13% on Milan.)

  *Icelakex*                  mm/clear_huge_page   x86/clear_huge_page   change
  (mem=4GB/task, tasks=128)

  stime                           293.02 +- .49%        239.39 +- .83%   -18.30%
  utime                           440.11 +- .28%        508.74 +- .60%   +15.59%
  wall-clock                        5.96 +- .33%          6.27 +-2.23%   + 5.20%



  *Milan*                     mm/clear_huge_page   x86/clear_huge_page   change
  (mem=1GB/task, tasks=512)

  stime                          490.95 +- 3.55%       466.90 +- 4.79%   - 4.89%
  utime                          276.43 +- 2.85%       311.97 +- 5.15%   +12.85%
  wall-clock                       3.74 +- 6.41%         3.58 +- 7.82%   - 4.27%

The drop in stime is due to REP; STOS being more efficient for bigger
extents.  The increase in utime is due to cache effects of that change:
mm/clear_huge_page() clears page-at-a-time, while narrowing towards the
faulting page; while x86/clear_huge_page only optimizes for cache
locality in the local neighbourhood of the faulting address.

This effect on utime is visible via the increased L1-dcache-load-misses
and LLC-load* and an increased backend boundedness for perf user-stat
--all-user on Icelakex. The effect is slight but given the heavy cache
pressure generated by the test, shows up in the drop in user IPC:

    -  9,455,243,414,829      instructions                     #    2.75  insn per cycle              ( +- 14.14% )  (46.17%)
    -  2,367,920,864,112      L1-dcache-loads                  #    1.054 G/sec                       ( +- 14.14% )  (69.24%)
    -     42,075,182,813      L1-dcache-load-misses            #    2.96% of all L1-dcache accesses   ( +- 14.14% )  (69.24%)
    -         20,365,688      LLC-loads                        #    9.064 K/sec                       ( +- 13.98% )  (69.24%)
    -            890,382      LLC-load-misses                  #    7.18% of all LL-cache accesses    ( +- 14.91% )  (69.24%)

    +  9,467,796,660,698      instructions                     #    2.37  insn per cycle              ( +- 14.14% )  (46.16%)
    +  2,369,973,307,561      L1-dcache-loads                  #    1.027 G/sec                       ( +- 14.14% )  (69.24%)
    +     42,155,621,201      L1-dcache-load-misses            #    2.96% of all L1-dcache accesses   ( +- 14.14% )  (69.24%)
    +         22,116,300      LLC-loads                        #    9.588 K/sec                       ( +- 14.20% )  (69.24%)
    +          1,355,607      LLC-load-misses                  #   10.29% of all LL-cache accesses    ( +- 15.49% )  (69.25%)

Given the fact that the stime improves for all loads using this path,
while the utime drop is load dependent add this change.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/mm/hugetlbpage.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

Comments

Peter Zijlstra April 5, 2023, 8:27 p.m. UTC | #1
On Sun, Apr 02, 2023 at 10:22:33PM -0700, Ankur Arora wrote:
> clear_contig_region() can be used to clear up to a huge-page (2MB/1GB)
> chunk Allow preemption in the irqentry_exit path to make sure we don't
> hold on to the CPU for an arbitrarily long period.
> 
> Performance: vm-scalability/case-anon-w-seq-hugetlb mmaps an anonymous
> hugetlb-2mb region, and then writes sequentially to the region, demand
> faulting pages on the way.
> 
> This test, with a CONFIG_VOLUNTARY config shows the effects of this
> change: stime drops (~18% on Icelakex, ~5% on Milan), while the utime
> goes up (~15% on Icelakex, ~13% on Milan.)
> 
>   *Icelakex*                  mm/clear_huge_page   x86/clear_huge_page   change
>   (mem=4GB/task, tasks=128)
> 
>   stime                           293.02 +- .49%        239.39 +- .83%   -18.30%
>   utime                           440.11 +- .28%        508.74 +- .60%   +15.59%
>   wall-clock                        5.96 +- .33%          6.27 +-2.23%   + 5.20%
> 
> 
> 
>   *Milan*                     mm/clear_huge_page   x86/clear_huge_page   change
>   (mem=1GB/task, tasks=512)
> 
>   stime                          490.95 +- 3.55%       466.90 +- 4.79%   - 4.89%
>   utime                          276.43 +- 2.85%       311.97 +- 5.15%   +12.85%
>   wall-clock                       3.74 +- 6.41%         3.58 +- 7.82%   - 4.27%
> 
> The drop in stime is due to REP; STOS being more efficient for bigger
> extents.  The increase in utime is due to cache effects of that change:
> mm/clear_huge_page() clears page-at-a-time, while narrowing towards the
> faulting page; while x86/clear_huge_page only optimizes for cache
> locality in the local neighbourhood of the faulting address.
> 
> This effect on utime is visible via the increased L1-dcache-load-misses
> and LLC-load* and an increased backend boundedness for perf user-stat
> --all-user on Icelakex. The effect is slight but given the heavy cache
> pressure generated by the test, shows up in the drop in user IPC:
> 
>     -  9,455,243,414,829      instructions                     #    2.75  insn per cycle              ( +- 14.14% )  (46.17%)
>     -  2,367,920,864,112      L1-dcache-loads                  #    1.054 G/sec                       ( +- 14.14% )  (69.24%)
>     -     42,075,182,813      L1-dcache-load-misses            #    2.96% of all L1-dcache accesses   ( +- 14.14% )  (69.24%)
>     -         20,365,688      LLC-loads                        #    9.064 K/sec                       ( +- 13.98% )  (69.24%)
>     -            890,382      LLC-load-misses                  #    7.18% of all LL-cache accesses    ( +- 14.91% )  (69.24%)
> 
>     +  9,467,796,660,698      instructions                     #    2.37  insn per cycle              ( +- 14.14% )  (46.16%)
>     +  2,369,973,307,561      L1-dcache-loads                  #    1.027 G/sec                       ( +- 14.14% )  (69.24%)
>     +     42,155,621,201      L1-dcache-load-misses            #    2.96% of all L1-dcache accesses   ( +- 14.14% )  (69.24%)
>     +         22,116,300      LLC-loads                        #    9.588 K/sec                       ( +- 14.20% )  (69.24%)
>     +          1,355,607      LLC-load-misses                  #   10.29% of all LL-cache accesses    ( +- 15.49% )  (69.25%)
> 
> Given the fact that the stime improves for all loads using this path,
> while the utime drop is load dependent add this change.

Either I really need sleep, or *NONE* of the above is actually relevant
to what the patch below actually does!

The above talks about the glories of using large clears, while the patch
allows reschedules which are about latency.

> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  arch/x86/mm/hugetlbpage.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> index 4294b77c4f18..c8564b0552e5 100644
> --- a/arch/x86/mm/hugetlbpage.c
> +++ b/arch/x86/mm/hugetlbpage.c
> @@ -158,7 +158,17 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
>  static void clear_contig_region(struct page *page, unsigned long vaddr,
>  				unsigned int npages)
>  {
> +	might_sleep();
> +
> +	/*
> +	 * We might be clearing a large region.
> +	 * Allow rescheduling.
> +	 */
> +	allow_resched();
>  	clear_user_pages(page_address(page), vaddr, page, npages);
> +	disallow_resched();
> +
> +	cond_resched();
>  }
>  
>  void clear_huge_page(struct page *page,
> -- 
> 2.31.1
>
Ankur Arora April 6, 2023, 5 p.m. UTC | #2
Peter Zijlstra <peterz@infradead.org> writes:

> On Sun, Apr 02, 2023 at 10:22:33PM -0700, Ankur Arora wrote:
>> clear_contig_region() can be used to clear up to a huge-page (2MB/1GB)
>> chunk Allow preemption in the irqentry_exit path to make sure we don't
>> hold on to the CPU for an arbitrarily long period.
>>
>> Performance: vm-scalability/case-anon-w-seq-hugetlb mmaps an anonymous
>> hugetlb-2mb region, and then writes sequentially to the region, demand
>> faulting pages on the way.
>>
>> This test, with a CONFIG_VOLUNTARY config shows the effects of this
>> change: stime drops (~18% on Icelakex, ~5% on Milan), while the utime
>> goes up (~15% on Icelakex, ~13% on Milan.)
>>
>>   *Icelakex*                  mm/clear_huge_page   x86/clear_huge_page   change
>>   (mem=4GB/task, tasks=128)
>>
>>   stime                           293.02 +- .49%        239.39 +- .83%   -18.30%
>>   utime                           440.11 +- .28%        508.74 +- .60%   +15.59%
>>   wall-clock                        5.96 +- .33%          6.27 +-2.23%   + 5.20%
>>
>>
>>
>>   *Milan*                     mm/clear_huge_page   x86/clear_huge_page   change
>>   (mem=1GB/task, tasks=512)
>>
>>   stime                          490.95 +- 3.55%       466.90 +- 4.79%   - 4.89%
>>   utime                          276.43 +- 2.85%       311.97 +- 5.15%   +12.85%
>>   wall-clock                       3.74 +- 6.41%         3.58 +- 7.82%   - 4.27%
>>
>> The drop in stime is due to REP; STOS being more efficient for bigger
>> extents.  The increase in utime is due to cache effects of that change:
>> mm/clear_huge_page() clears page-at-a-time, while narrowing towards the
>> faulting page; while x86/clear_huge_page only optimizes for cache
>> locality in the local neighbourhood of the faulting address.
>>
>> This effect on utime is visible via the increased L1-dcache-load-misses
>> and LLC-load* and an increased backend boundedness for perf user-stat
>> --all-user on Icelakex. The effect is slight but given the heavy cache
>> pressure generated by the test, shows up in the drop in user IPC:
>>
>>     -  9,455,243,414,829      instructions                     #    2.75  insn per cycle              ( +- 14.14% )  (46.17%)
>>     -  2,367,920,864,112      L1-dcache-loads                  #    1.054 G/sec                       ( +- 14.14% )  (69.24%)
>>     -     42,075,182,813      L1-dcache-load-misses            #    2.96% of all L1-dcache accesses   ( +- 14.14% )  (69.24%)
>>     -         20,365,688      LLC-loads                        #    9.064 K/sec                       ( +- 13.98% )  (69.24%)
>>     -            890,382      LLC-load-misses                  #    7.18% of all LL-cache accesses    ( +- 14.91% )  (69.24%)
>>
>>     +  9,467,796,660,698      instructions                     #    2.37  insn per cycle              ( +- 14.14% )  (46.16%)
>>     +  2,369,973,307,561      L1-dcache-loads                  #    1.027 G/sec                       ( +- 14.14% )  (69.24%)
>>     +     42,155,621,201      L1-dcache-load-misses            #    2.96% of all L1-dcache accesses   ( +- 14.14% )  (69.24%)
>>     +         22,116,300      LLC-loads                        #    9.588 K/sec                       ( +- 14.20% )  (69.24%)
>>     +          1,355,607      LLC-load-misses                  #   10.29% of all LL-cache accesses    ( +- 15.49% )  (69.25%)
>>
>> Given the fact that the stime improves for all loads using this path,
>> while the utime drop is load dependent add this change.
>
> Either I really need sleep, or *NONE* of the above is actually relevant
> to what the patch below actually does!

Yeah, you are right about the relevance.

I wanted to provide two sets of stats: the raw memory BW stats and the
relevant vm-scalability workload. The vm-scalability workload needs a
more reasonable scheduling model than what's present until this patch
and so it seemed to make sense to put here for that reason.
But yeah it doesn't really fit here.

> The above talks about the glories of using large clears, while the patch
> allows reschedules which are about latency.

Yeah, let me find a more reasonable way to present these.

Ankur

>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>>  arch/x86/mm/hugetlbpage.c | 10 ++++++++++
>>  1 file changed, 10 insertions(+)
>>
>> diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
>> index 4294b77c4f18..c8564b0552e5 100644
>> --- a/arch/x86/mm/hugetlbpage.c
>> +++ b/arch/x86/mm/hugetlbpage.c
>> @@ -158,7 +158,17 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
>>  static void clear_contig_region(struct page *page, unsigned long vaddr,
>>  				unsigned int npages)
>>  {
>> +	might_sleep();
>> +
>> +	/*
>> +	 * We might be clearing a large region.
>> +	 * Allow rescheduling.
>> +	 */
>> +	allow_resched();
>>  	clear_user_pages(page_address(page), vaddr, page, npages);
>> +	disallow_resched();
>> +
>> +	cond_resched();
>>  }
>>
>>  void clear_huge_page(struct page *page,
>> --
>> 2.31.1
>>


--
ankur
diff mbox series

Patch

diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 4294b77c4f18..c8564b0552e5 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -158,7 +158,17 @@  hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 static void clear_contig_region(struct page *page, unsigned long vaddr,
 				unsigned int npages)
 {
+	might_sleep();
+
+	/*
+	 * We might be clearing a large region.
+	 * Allow rescheduling.
+	 */
+	allow_resched();
 	clear_user_pages(page_address(page), vaddr, page, npages);
+	disallow_resched();
+
+	cond_resched();
 }
 
 void clear_huge_page(struct page *page,