[v2] mm, slub: Use prefetchw instead of prefetch

Message ID	20211011144331.70084-1-42.hyeyoo@gmail.com (mailing list archive)
State	New
Headers	show Return-Path: <SRS0=SV65=O7=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 5E5E5603E9 From: Hyeonggon Yoo <42.hyeyoo@gmail.com> To: linux-mm@kvack.org Cc: 42.hyeyoo@gmail.com, Christoph Lameter <cl@linux.com>, Pekka Enberg <penberg@kernel.org>, David Rientjes <rientjes@google.com>, Joonsoo Kim <iamjoonsoo.kim@lge.com>, Andrew Morton <akpm@linux-foundation.org>, Vlastimil Babka <vbabka@suse.cz>, linux-kernel@vger.kernel.org Subject: [PATCH v2] mm, slub: Use prefetchw instead of prefetch Date: Mon, 11 Oct 2021 14:43:31 +0000 Message-Id: <20211011144331.70084-1-42.hyeyoo@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[v2] mm, slub: Use prefetchw instead of prefetch \| expand [v2] mm, slub: Use prefetchw instead of prefetch

Message ID

20211011144331.70084-1-42.hyeyoo@gmail.com (mailing list archive)

State

New

Headers

DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 5E5E5603E9
From: Hyeonggon Yoo <42.hyeyoo@gmail.com>
To: linux-mm@kvack.org
Cc: 42.hyeyoo@gmail.com,
	Christoph Lameter <cl@linux.com>,
	Pekka Enberg <penberg@kernel.org>,
	David Rientjes <rientjes@google.com>,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Vlastimil Babka <vbabka@suse.cz>,
	linux-kernel@vger.kernel.org
Subject: [PATCH v2] mm, slub: Use prefetchw instead of prefetch
Date: Mon, 11 Oct 2021 14:43:31 +0000
Message-Id: <20211011144331.70084-1-42.hyeyoo@gmail.com>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

[v2] mm, slub: Use prefetchw instead of prefetch | expand

Commit Message

Hyeonggon Yoo Oct. 11, 2021, 2:43 p.m. UTC

commit 0ad9500e16fe ("slub: prefetch next freelist pointer in
slab_alloc()") introduced prefetch_freepointer() because when other cpu(s)
freed objects into a page that current cpu owns, the freelist link is
hot on cpu(s) which freed objects and possibly very cold on current cpu.

But if freelist link chain is hot on cpu(s) which freed objects,
it's better to invalidate that chain because they're not going to access
again within a short time.

So use prefetchw instead of prefetch. On supported architectures like x86
and arm, it invalidates other copied instances of a cache line when
prefetching it.

Before:

Time: 91.677

 Performance counter stats for 'hackbench -g 100 -l 10000':
        1462938.07 msec cpu-clock                 #   15.908 CPUs utilized
          18072550      context-switches          #   12.354 K/sec
           1018814      cpu-migrations            #  696.416 /sec
            104558      page-faults               #   71.471 /sec
     1580035699271      cycles                    #    1.080 GHz                      (54.51%)
     2003670016013      instructions              #    1.27  insn per cycle           (54.31%)
        5702204863      branch-misses                                                 (54.28%)
      643368500985      cache-references          #  439.778 M/sec                    (54.26%)
       18475582235      cache-misses              #    2.872 % of all cache refs      (54.28%)
      642206796636      L1-dcache-loads           #  438.984 M/sec                    (46.87%)
       18215813147      L1-dcache-load-misses     #    2.84% of all L1-dcache accesses  (46.83%)
      653842996501      dTLB-loads                #  446.938 M/sec                    (46.63%)
        3227179675      dTLB-load-misses          #    0.49% of all dTLB cache accesses  (46.85%)
      537531951350      iTLB-loads                #  367.433 M/sec                    (54.33%)
         114750630      iTLB-load-misses          #    0.02% of all iTLB cache accesses  (54.37%)
      630135543177      L1-icache-loads           #  430.733 M/sec                    (46.80%)
       22923237620      L1-icache-load-misses     #    3.64% of all L1-icache accesses  (46.76%)

      91.964452802 seconds time elapsed

      43.416742000 seconds user
    1422.441123000 seconds sys

After:

Time: 90.220

 Performance counter stats for 'hackbench -g 100 -l 10000':
        1437418.48 msec cpu-clock                 #   15.880 CPUs utilized
          17694068      context-switches          #   12.310 K/sec
            958257      cpu-migrations            #  666.651 /sec
            100604      page-faults               #   69.989 /sec
     1583259429428      cycles                    #    1.101 GHz                      (54.57%)
     2004002484935      instructions              #    1.27  insn per cycle           (54.37%)
        5594202389      branch-misses                                                 (54.36%)
      643113574524      cache-references          #  447.409 M/sec                    (54.39%)
       18233791870      cache-misses              #    2.835 % of all cache refs      (54.37%)
      640205852062      L1-dcache-loads           #  445.386 M/sec                    (46.75%)
       17968160377      L1-dcache-load-misses     #    2.81% of all L1-dcache accesses  (46.79%)
      651747432274      dTLB-loads                #  453.415 M/sec                    (46.59%)
        3127124271      dTLB-load-misses          #    0.48% of all dTLB cache accesses  (46.75%)
      535395273064      iTLB-loads                #  372.470 M/sec                    (54.38%)
         113500056      iTLB-load-misses          #    0.02% of all iTLB cache accesses  (54.35%)
      628871845924      L1-icache-loads           #  437.501 M/sec                    (46.80%)
       22585641203      L1-icache-load-misses     #    3.59% of all L1-icache accesses  (46.79%)

      90.514819303 seconds time elapsed

      43.877656000 seconds user
    1397.176001000 seconds sys

Link: https://lkml.org/lkml/2021/10/8/598 
Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
---
 mm/slub.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Hyeonggon Yoo Oct. 16, 2021, 11:38 a.m. UTC | #1

Andrew, can you please update the patch to v2?

On Mon, Oct 11, 2021 at 02:43:31PM +0000, Hyeonggon Yoo wrote:
> commit 0ad9500e16fe ("slub: prefetch next freelist pointer in
> slab_alloc()") introduced prefetch_freepointer() because when other cpu(s)
> freed objects into a page that current cpu owns, the freelist link is
> hot on cpu(s) which freed objects and possibly very cold on current cpu.
> 
> But if freelist link chain is hot on cpu(s) which freed objects,
> it's better to invalidate that chain because they're not going to access
> again within a short time.
> 
> So use prefetchw instead of prefetch. On supported architectures like x86
> and arm, it invalidates other copied instances of a cache line when
> prefetching it.
> 
> Before:
> 
> Time: 91.677
> 
>  Performance counter stats for 'hackbench -g 100 -l 10000':
>         1462938.07 msec cpu-clock                 #   15.908 CPUs utilized
>           18072550      context-switches          #   12.354 K/sec
>            1018814      cpu-migrations            #  696.416 /sec
>             104558      page-faults               #   71.471 /sec
>      1580035699271      cycles                    #    1.080 GHz                      (54.51%)
>      2003670016013      instructions              #    1.27  insn per cycle           (54.31%)
>         5702204863      branch-misses                                                 (54.28%)
>       643368500985      cache-references          #  439.778 M/sec                    (54.26%)
>        18475582235      cache-misses              #    2.872 % of all cache refs      (54.28%)
>       642206796636      L1-dcache-loads           #  438.984 M/sec                    (46.87%)
>        18215813147      L1-dcache-load-misses     #    2.84% of all L1-dcache accesses  (46.83%)
>       653842996501      dTLB-loads                #  446.938 M/sec                    (46.63%)
>         3227179675      dTLB-load-misses          #    0.49% of all dTLB cache accesses  (46.85%)
>       537531951350      iTLB-loads                #  367.433 M/sec                    (54.33%)
>          114750630      iTLB-load-misses          #    0.02% of all iTLB cache accesses  (54.37%)
>       630135543177      L1-icache-loads           #  430.733 M/sec                    (46.80%)
>        22923237620      L1-icache-load-misses     #    3.64% of all L1-icache accesses  (46.76%)
> 
>       91.964452802 seconds time elapsed
> 
>       43.416742000 seconds user
>     1422.441123000 seconds sys
> 
> After:
> 
> Time: 90.220
> 
>  Performance counter stats for 'hackbench -g 100 -l 10000':
>         1437418.48 msec cpu-clock                 #   15.880 CPUs utilized
>           17694068      context-switches          #   12.310 K/sec
>             958257      cpu-migrations            #  666.651 /sec
>             100604      page-faults               #   69.989 /sec
>      1583259429428      cycles                    #    1.101 GHz                      (54.57%)
>      2004002484935      instructions              #    1.27  insn per cycle           (54.37%)
>         5594202389      branch-misses                                                 (54.36%)
>       643113574524      cache-references          #  447.409 M/sec                    (54.39%)
>        18233791870      cache-misses              #    2.835 % of all cache refs      (54.37%)
>       640205852062      L1-dcache-loads           #  445.386 M/sec                    (46.75%)
>        17968160377      L1-dcache-load-misses     #    2.81% of all L1-dcache accesses  (46.79%)
>       651747432274      dTLB-loads                #  453.415 M/sec                    (46.59%)
>         3127124271      dTLB-load-misses          #    0.48% of all dTLB cache accesses  (46.75%)
>       535395273064      iTLB-loads                #  372.470 M/sec                    (54.38%)
>          113500056      iTLB-load-misses          #    0.02% of all iTLB cache accesses  (54.35%)
>       628871845924      L1-icache-loads           #  437.501 M/sec                    (46.80%)
>        22585641203      L1-icache-load-misses     #    3.59% of all L1-icache accesses  (46.79%)
> 
>       90.514819303 seconds time elapsed
> 
>       43.877656000 seconds user
>     1397.176001000 seconds sys
> 
> Link: https://lkml.org/lkml/2021/10/8/598 
> Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> ---
>  mm/slub.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 3d2025f7163b..ce3d8b11215c 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -354,7 +354,7 @@ static inline void *get_freepointer(struct kmem_cache *s, void *object)
>  
>  static void prefetch_freepointer(const struct kmem_cache *s, void *object)
>  {
> -	prefetch(object + s->offset);
> +	prefetchw(object + s->offset);
>  }
>  
>  static inline void *get_freepointer_safe(struct kmem_cache *s, void *object)
> -- 
> 2.27.0
>

Vlastimil Babka Oct. 19, 2021, 7:11 a.m. UTC | #2

On 10/11/21 16:43, Hyeonggon Yoo wrote:
> commit 0ad9500e16fe ("slub: prefetch next freelist pointer in
> slab_alloc()") introduced prefetch_freepointer() because when other cpu(s)
> freed objects into a page that current cpu owns, the freelist link is
> hot on cpu(s) which freed objects and possibly very cold on current cpu.
> 
> But if freelist link chain is hot on cpu(s) which freed objects,
> it's better to invalidate that chain because they're not going to access
> again within a short time.
> 
> So use prefetchw instead of prefetch. On supported architectures like x86
> and arm, it invalidates other copied instances of a cache line when
> prefetching it.
> 
> Before:
> 
> Time: 91.677
> 
>  Performance counter stats for 'hackbench -g 100 -l 10000':
>         1462938.07 msec cpu-clock                 #   15.908 CPUs utilized
>           18072550      context-switches          #   12.354 K/sec
>            1018814      cpu-migrations            #  696.416 /sec
>             104558      page-faults               #   71.471 /sec
>      1580035699271      cycles                    #    1.080 GHz                      (54.51%)
>      2003670016013      instructions              #    1.27  insn per cycle           (54.31%)
>         5702204863      branch-misses                                                 (54.28%)
>       643368500985      cache-references          #  439.778 M/sec                    (54.26%)
>        18475582235      cache-misses              #    2.872 % of all cache refs      (54.28%)
>       642206796636      L1-dcache-loads           #  438.984 M/sec                    (46.87%)
>        18215813147      L1-dcache-load-misses     #    2.84% of all L1-dcache accesses  (46.83%)
>       653842996501      dTLB-loads                #  446.938 M/sec                    (46.63%)
>         3227179675      dTLB-load-misses          #    0.49% of all dTLB cache accesses  (46.85%)
>       537531951350      iTLB-loads                #  367.433 M/sec                    (54.33%)
>          114750630      iTLB-load-misses          #    0.02% of all iTLB cache accesses  (54.37%)
>       630135543177      L1-icache-loads           #  430.733 M/sec                    (46.80%)
>        22923237620      L1-icache-load-misses     #    3.64% of all L1-icache accesses  (46.76%)
> 
>       91.964452802 seconds time elapsed
> 
>       43.416742000 seconds user
>     1422.441123000 seconds sys
> 
> After:
> 
> Time: 90.220
> 
>  Performance counter stats for 'hackbench -g 100 -l 10000':
>         1437418.48 msec cpu-clock                 #   15.880 CPUs utilized
>           17694068      context-switches          #   12.310 K/sec
>             958257      cpu-migrations            #  666.651 /sec
>             100604      page-faults               #   69.989 /sec
>      1583259429428      cycles                    #    1.101 GHz                      (54.57%)
>      2004002484935      instructions              #    1.27  insn per cycle           (54.37%)
>         5594202389      branch-misses                                                 (54.36%)
>       643113574524      cache-references          #  447.409 M/sec                    (54.39%)
>        18233791870      cache-misses              #    2.835 % of all cache refs      (54.37%)
>       640205852062      L1-dcache-loads           #  445.386 M/sec                    (46.75%)
>        17968160377      L1-dcache-load-misses     #    2.81% of all L1-dcache accesses  (46.79%)
>       651747432274      dTLB-loads                #  453.415 M/sec                    (46.59%)
>         3127124271      dTLB-load-misses          #    0.48% of all dTLB cache accesses  (46.75%)
>       535395273064      iTLB-loads                #  372.470 M/sec                    (54.38%)
>          113500056      iTLB-load-misses          #    0.02% of all iTLB cache accesses  (54.35%)
>       628871845924      L1-icache-loads           #  437.501 M/sec                    (46.80%)
>        22585641203      L1-icache-load-misses     #    3.59% of all L1-icache accesses  (46.79%)
> 
>       90.514819303 seconds time elapsed
> 
>       43.877656000 seconds user
>     1397.176001000 seconds sys

Wouldn't expect such noticeable difference. Maybe it would diminish when
repeating and taking average. But guess it's at least not worse with
prefetchw, so...

> Link: https://lkml.org/lkml/2021/10/8/598 
> Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/slub.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 3d2025f7163b..ce3d8b11215c 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -354,7 +354,7 @@ static inline void *get_freepointer(struct kmem_cache *s, void *object)
>  
>  static void prefetch_freepointer(const struct kmem_cache *s, void *object)
>  {
> -	prefetch(object + s->offset);
> +	prefetchw(object + s->offset);
>  }
>  
>  static inline void *get_freepointer_safe(struct kmem_cache *s, void *object)
>

diff --git a/mm/slub.c b/mm/slub.c
index 3d2025f7163b..ce3d8b11215c 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -354,7 +354,7 @@  static inline void *get_freepointer(struct kmem_cache *s, void *object)
 
 static void prefetch_freepointer(const struct kmem_cache *s, void *object)
 {
-	prefetch(object + s->offset);
+	prefetchw(object + s->offset);
 }
 
 static inline void *get_freepointer_safe(struct kmem_cache *s, void *object)

[v2] mm, slub: Use prefetchw instead of prefetch

Commit Message

Comments

Patch