Message ID | 20211011144331.70084-1-42.hyeyoo@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2] mm, slub: Use prefetchw instead of prefetch | expand |
Andrew, can you please update the patch to v2? On Mon, Oct 11, 2021 at 02:43:31PM +0000, Hyeonggon Yoo wrote: > commit 0ad9500e16fe ("slub: prefetch next freelist pointer in > slab_alloc()") introduced prefetch_freepointer() because when other cpu(s) > freed objects into a page that current cpu owns, the freelist link is > hot on cpu(s) which freed objects and possibly very cold on current cpu. > > But if freelist link chain is hot on cpu(s) which freed objects, > it's better to invalidate that chain because they're not going to access > again within a short time. > > So use prefetchw instead of prefetch. On supported architectures like x86 > and arm, it invalidates other copied instances of a cache line when > prefetching it. > > Before: > > Time: 91.677 > > Performance counter stats for 'hackbench -g 100 -l 10000': > 1462938.07 msec cpu-clock # 15.908 CPUs utilized > 18072550 context-switches # 12.354 K/sec > 1018814 cpu-migrations # 696.416 /sec > 104558 page-faults # 71.471 /sec > 1580035699271 cycles # 1.080 GHz (54.51%) > 2003670016013 instructions # 1.27 insn per cycle (54.31%) > 5702204863 branch-misses (54.28%) > 643368500985 cache-references # 439.778 M/sec (54.26%) > 18475582235 cache-misses # 2.872 % of all cache refs (54.28%) > 642206796636 L1-dcache-loads # 438.984 M/sec (46.87%) > 18215813147 L1-dcache-load-misses # 2.84% of all L1-dcache accesses (46.83%) > 653842996501 dTLB-loads # 446.938 M/sec (46.63%) > 3227179675 dTLB-load-misses # 0.49% of all dTLB cache accesses (46.85%) > 537531951350 iTLB-loads # 367.433 M/sec (54.33%) > 114750630 iTLB-load-misses # 0.02% of all iTLB cache accesses (54.37%) > 630135543177 L1-icache-loads # 430.733 M/sec (46.80%) > 22923237620 L1-icache-load-misses # 3.64% of all L1-icache accesses (46.76%) > > 91.964452802 seconds time elapsed > > 43.416742000 seconds user > 1422.441123000 seconds sys > > After: > > Time: 90.220 > > Performance counter stats for 'hackbench -g 100 -l 10000': > 1437418.48 msec cpu-clock # 15.880 CPUs utilized > 17694068 context-switches # 12.310 K/sec > 958257 cpu-migrations # 666.651 /sec > 100604 page-faults # 69.989 /sec > 1583259429428 cycles # 1.101 GHz (54.57%) > 2004002484935 instructions # 1.27 insn per cycle (54.37%) > 5594202389 branch-misses (54.36%) > 643113574524 cache-references # 447.409 M/sec (54.39%) > 18233791870 cache-misses # 2.835 % of all cache refs (54.37%) > 640205852062 L1-dcache-loads # 445.386 M/sec (46.75%) > 17968160377 L1-dcache-load-misses # 2.81% of all L1-dcache accesses (46.79%) > 651747432274 dTLB-loads # 453.415 M/sec (46.59%) > 3127124271 dTLB-load-misses # 0.48% of all dTLB cache accesses (46.75%) > 535395273064 iTLB-loads # 372.470 M/sec (54.38%) > 113500056 iTLB-load-misses # 0.02% of all iTLB cache accesses (54.35%) > 628871845924 L1-icache-loads # 437.501 M/sec (46.80%) > 22585641203 L1-icache-load-misses # 3.59% of all L1-icache accesses (46.79%) > > 90.514819303 seconds time elapsed > > 43.877656000 seconds user > 1397.176001000 seconds sys > > Link: https://lkml.org/lkml/2021/10/8/598 > Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> > --- > mm/slub.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/slub.c b/mm/slub.c > index 3d2025f7163b..ce3d8b11215c 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -354,7 +354,7 @@ static inline void *get_freepointer(struct kmem_cache *s, void *object) > > static void prefetch_freepointer(const struct kmem_cache *s, void *object) > { > - prefetch(object + s->offset); > + prefetchw(object + s->offset); > } > > static inline void *get_freepointer_safe(struct kmem_cache *s, void *object) > -- > 2.27.0 >
On 10/11/21 16:43, Hyeonggon Yoo wrote: > commit 0ad9500e16fe ("slub: prefetch next freelist pointer in > slab_alloc()") introduced prefetch_freepointer() because when other cpu(s) > freed objects into a page that current cpu owns, the freelist link is > hot on cpu(s) which freed objects and possibly very cold on current cpu. > > But if freelist link chain is hot on cpu(s) which freed objects, > it's better to invalidate that chain because they're not going to access > again within a short time. > > So use prefetchw instead of prefetch. On supported architectures like x86 > and arm, it invalidates other copied instances of a cache line when > prefetching it. > > Before: > > Time: 91.677 > > Performance counter stats for 'hackbench -g 100 -l 10000': > 1462938.07 msec cpu-clock # 15.908 CPUs utilized > 18072550 context-switches # 12.354 K/sec > 1018814 cpu-migrations # 696.416 /sec > 104558 page-faults # 71.471 /sec > 1580035699271 cycles # 1.080 GHz (54.51%) > 2003670016013 instructions # 1.27 insn per cycle (54.31%) > 5702204863 branch-misses (54.28%) > 643368500985 cache-references # 439.778 M/sec (54.26%) > 18475582235 cache-misses # 2.872 % of all cache refs (54.28%) > 642206796636 L1-dcache-loads # 438.984 M/sec (46.87%) > 18215813147 L1-dcache-load-misses # 2.84% of all L1-dcache accesses (46.83%) > 653842996501 dTLB-loads # 446.938 M/sec (46.63%) > 3227179675 dTLB-load-misses # 0.49% of all dTLB cache accesses (46.85%) > 537531951350 iTLB-loads # 367.433 M/sec (54.33%) > 114750630 iTLB-load-misses # 0.02% of all iTLB cache accesses (54.37%) > 630135543177 L1-icache-loads # 430.733 M/sec (46.80%) > 22923237620 L1-icache-load-misses # 3.64% of all L1-icache accesses (46.76%) > > 91.964452802 seconds time elapsed > > 43.416742000 seconds user > 1422.441123000 seconds sys > > After: > > Time: 90.220 > > Performance counter stats for 'hackbench -g 100 -l 10000': > 1437418.48 msec cpu-clock # 15.880 CPUs utilized > 17694068 context-switches # 12.310 K/sec > 958257 cpu-migrations # 666.651 /sec > 100604 page-faults # 69.989 /sec > 1583259429428 cycles # 1.101 GHz (54.57%) > 2004002484935 instructions # 1.27 insn per cycle (54.37%) > 5594202389 branch-misses (54.36%) > 643113574524 cache-references # 447.409 M/sec (54.39%) > 18233791870 cache-misses # 2.835 % of all cache refs (54.37%) > 640205852062 L1-dcache-loads # 445.386 M/sec (46.75%) > 17968160377 L1-dcache-load-misses # 2.81% of all L1-dcache accesses (46.79%) > 651747432274 dTLB-loads # 453.415 M/sec (46.59%) > 3127124271 dTLB-load-misses # 0.48% of all dTLB cache accesses (46.75%) > 535395273064 iTLB-loads # 372.470 M/sec (54.38%) > 113500056 iTLB-load-misses # 0.02% of all iTLB cache accesses (54.35%) > 628871845924 L1-icache-loads # 437.501 M/sec (46.80%) > 22585641203 L1-icache-load-misses # 3.59% of all L1-icache accesses (46.79%) > > 90.514819303 seconds time elapsed > > 43.877656000 seconds user > 1397.176001000 seconds sys Wouldn't expect such noticeable difference. Maybe it would diminish when repeating and taking average. But guess it's at least not worse with prefetchw, so... > Link: https://lkml.org/lkml/2021/10/8/598 > Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> > --- > mm/slub.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/slub.c b/mm/slub.c > index 3d2025f7163b..ce3d8b11215c 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -354,7 +354,7 @@ static inline void *get_freepointer(struct kmem_cache *s, void *object) > > static void prefetch_freepointer(const struct kmem_cache *s, void *object) > { > - prefetch(object + s->offset); > + prefetchw(object + s->offset); > } > > static inline void *get_freepointer_safe(struct kmem_cache *s, void *object) >
diff --git a/mm/slub.c b/mm/slub.c index 3d2025f7163b..ce3d8b11215c 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -354,7 +354,7 @@ static inline void *get_freepointer(struct kmem_cache *s, void *object) static void prefetch_freepointer(const struct kmem_cache *s, void *object) { - prefetch(object + s->offset); + prefetchw(object + s->offset); } static inline void *get_freepointer_safe(struct kmem_cache *s, void *object)
commit 0ad9500e16fe ("slub: prefetch next freelist pointer in slab_alloc()") introduced prefetch_freepointer() because when other cpu(s) freed objects into a page that current cpu owns, the freelist link is hot on cpu(s) which freed objects and possibly very cold on current cpu. But if freelist link chain is hot on cpu(s) which freed objects, it's better to invalidate that chain because they're not going to access again within a short time. So use prefetchw instead of prefetch. On supported architectures like x86 and arm, it invalidates other copied instances of a cache line when prefetching it. Before: Time: 91.677 Performance counter stats for 'hackbench -g 100 -l 10000': 1462938.07 msec cpu-clock # 15.908 CPUs utilized 18072550 context-switches # 12.354 K/sec 1018814 cpu-migrations # 696.416 /sec 104558 page-faults # 71.471 /sec 1580035699271 cycles # 1.080 GHz (54.51%) 2003670016013 instructions # 1.27 insn per cycle (54.31%) 5702204863 branch-misses (54.28%) 643368500985 cache-references # 439.778 M/sec (54.26%) 18475582235 cache-misses # 2.872 % of all cache refs (54.28%) 642206796636 L1-dcache-loads # 438.984 M/sec (46.87%) 18215813147 L1-dcache-load-misses # 2.84% of all L1-dcache accesses (46.83%) 653842996501 dTLB-loads # 446.938 M/sec (46.63%) 3227179675 dTLB-load-misses # 0.49% of all dTLB cache accesses (46.85%) 537531951350 iTLB-loads # 367.433 M/sec (54.33%) 114750630 iTLB-load-misses # 0.02% of all iTLB cache accesses (54.37%) 630135543177 L1-icache-loads # 430.733 M/sec (46.80%) 22923237620 L1-icache-load-misses # 3.64% of all L1-icache accesses (46.76%) 91.964452802 seconds time elapsed 43.416742000 seconds user 1422.441123000 seconds sys After: Time: 90.220 Performance counter stats for 'hackbench -g 100 -l 10000': 1437418.48 msec cpu-clock # 15.880 CPUs utilized 17694068 context-switches # 12.310 K/sec 958257 cpu-migrations # 666.651 /sec 100604 page-faults # 69.989 /sec 1583259429428 cycles # 1.101 GHz (54.57%) 2004002484935 instructions # 1.27 insn per cycle (54.37%) 5594202389 branch-misses (54.36%) 643113574524 cache-references # 447.409 M/sec (54.39%) 18233791870 cache-misses # 2.835 % of all cache refs (54.37%) 640205852062 L1-dcache-loads # 445.386 M/sec (46.75%) 17968160377 L1-dcache-load-misses # 2.81% of all L1-dcache accesses (46.79%) 651747432274 dTLB-loads # 453.415 M/sec (46.59%) 3127124271 dTLB-load-misses # 0.48% of all dTLB cache accesses (46.75%) 535395273064 iTLB-loads # 372.470 M/sec (54.38%) 113500056 iTLB-load-misses # 0.02% of all iTLB cache accesses (54.35%) 628871845924 L1-icache-loads # 437.501 M/sec (46.80%) 22585641203 L1-icache-load-misses # 3.59% of all L1-icache accesses (46.79%) 90.514819303 seconds time elapsed 43.877656000 seconds user 1397.176001000 seconds sys Link: https://lkml.org/lkml/2021/10/8/598 Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> --- mm/slub.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)