Message ID | 20211008133602.4963-1-42.hyeyoo@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm, slub: Use prefetchw instead of prefetch | expand |
On Fri, 8 Oct 2021, Hyeonggon Yoo wrote: > It's certain that an object will be not only read, but also > written after allocation. > Why is it certain? I think perhaps what you meant to say is that if we are doing any prefetching here, then access will benefit from prefetchw instead of prefetch. But it's not "certain" that allocated memory will be accessed at all. > Use prefetchw instead of prefetchw. On supported architecture If we're using prefetchw instead of prefetchw, I think the diff would be 0 lines changed :) > like x86, it helps to invalidate cache line when the object exists > in other processors' cache. > > Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> > --- > mm/slub.c | 7 +++---- > 1 file changed, 3 insertions(+), 4 deletions(-) > > diff --git a/mm/slub.c b/mm/slub.c > index 3d2025f7163b..2aca7523165e 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -352,9 +352,9 @@ static inline void *get_freepointer(struct kmem_cache *s, void *object) > return freelist_dereference(s, object + s->offset); > } > > -static void prefetch_freepointer(const struct kmem_cache *s, void *object) > +static void prefetchw_freepointer(const struct kmem_cache *s, void *object) > { > - prefetch(object + s->offset); > + prefetchw(object + s->offset); > } > > static inline void *get_freepointer_safe(struct kmem_cache *s, void *object) > @@ -3195,10 +3195,9 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s, > note_cmpxchg_failure("slab_alloc", s, tid); > goto redo; > } > - prefetch_freepointer(s, next_object); > + prefetchw_freepointer(s, next_object); > stat(s, ALLOC_FASTPATH); > } > - > maybe_wipe_obj_freeptr(s, object); > init = slab_want_init_on_alloc(gfpflags, s); > > -- > 2.27.0 > >
On Fri, 8 Oct 2021, Hyeonggon Yoo wrote: > It's certain that an object will be not only read, but also > written after allocation. get_freepointer is used in multiple code path not only in allocation. It is for example used when scanning through a freelist. With this change all objects get needlessly dirtied and the cross cpu cache contention is increased.
On 10/11/21 00:49, David Rientjes wrote: > On Fri, 8 Oct 2021, Hyeonggon Yoo wrote: > >> It's certain that an object will be not only read, but also >> written after allocation. >> > > Why is it certain? I think perhaps what you meant to say is that if we > are doing any prefetching here, then access will benefit from prefetchw > instead of prefetch. But it's not "certain" that allocated memory will be > accessed at all. I think the primary reason there's a prefetch is freelist traversal. The cacheline we prefetch will be read during the next allocation, so if we expect there to be one soon, prefetch might help. That the freepointer is part of object itself and thus the cache line will be probably accessed also after the allocation, is secondary. Yeah this might help some workloads, but perhaps hurt others - these things might look obvious in theory but be rather unpredictable in practice. At least some hackbench results would help... >> Use prefetchw instead of prefetchw. On supported architecture > > If we're using prefetchw instead of prefetchw, I think the diff would be > 0 lines changed :) > >> like x86, it helps to invalidate cache line when the object exists >> in other processors' cache. >> >> Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> >> --- >> mm/slub.c | 7 +++---- >> 1 file changed, 3 insertions(+), 4 deletions(-) >> >> diff --git a/mm/slub.c b/mm/slub.c >> index 3d2025f7163b..2aca7523165e 100644 >> --- a/mm/slub.c >> +++ b/mm/slub.c >> @@ -352,9 +352,9 @@ static inline void *get_freepointer(struct kmem_cache *s, void *object) >> return freelist_dereference(s, object + s->offset); >> } >> >> -static void prefetch_freepointer(const struct kmem_cache *s, void *object) >> +static void prefetchw_freepointer(const struct kmem_cache *s, void *object) I wouldn't rename the function itself, unless we have both variants for different situations (we don't). That it uses prefetchw() is internal detail at this point. >> { >> - prefetch(object + s->offset); >> + prefetchw(object + s->offset); >> } >> >> static inline void *get_freepointer_safe(struct kmem_cache *s, void *object) >> @@ -3195,10 +3195,9 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s, >> note_cmpxchg_failure("slab_alloc", s, tid); >> goto redo; >> } >> - prefetch_freepointer(s, next_object); >> + prefetchw_freepointer(s, next_object); >> stat(s, ALLOC_FASTPATH); >> } >> - >> maybe_wipe_obj_freeptr(s, object); >> init = slab_want_init_on_alloc(gfpflags, s); >> >> -- >> 2.27.0 >> >> >
On Sun, Oct 10, 2021 at 03:49:07PM -0700, David Rientjes wrote: > On Fri, 8 Oct 2021, Hyeonggon Yoo wrote: > > > It's certain that an object will be not only read, but also > > written after allocation. > > > > Why is it certain? I think perhaps what you meant to say is that if we > are doing any prefetching here, then access will benefit from prefetchw > instead of prefetch. But it's not "certain" that allocated memory will be > accessed at all. > Blame my english skill :( When I wrote I thought it was ok, but it was unclear. Thank you for pointing them! What I meant was "When accessing an object, it must be written before read. So There's no situation that caller only reads an object and does not write. Thus it's better to use prefetchw instead of prefetch.". Let's rephrase: commit 0ad9500e16fe ("slub: prefetch next freelist pointer in slab_alloc()") introduced prefetch_freepointer() because when other cpu(s) freed objects into a page that current cpu owns, the freelist link is hot on cpu(s) which freed objects and possibly very cold on current cpu. But if freelist link chain is hot on cpu(s) which freed objects, it's better to invalidate that chain because they're not going to access again within a short time. So use prefetchw instead of prefetch. On supported architectures like x86, it invalidates other copied instances of a cache line when prefetching it. > > Use prefetchw instead of prefetchw. On supported architecture > > If we're using prefetchw instead of prefetchw, I think the diff would be > 0 lines changed :) > That was my typo. thankfully Andrew fixed that. > > like x86, it helps to invalidate cache line when the object exists > > in other processors' cache. > > > > Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> > > --- > > mm/slub.c | 7 +++---- > > 1 file changed, 3 insertions(+), 4 deletions(-) > > > > diff --git a/mm/slub.c b/mm/slub.c > > index 3d2025f7163b..2aca7523165e 100644 > > --- a/mm/slub.c > > +++ b/mm/slub.c > > @@ -352,9 +352,9 @@ static inline void *get_freepointer(struct kmem_cache *s, void *object) > > return freelist_dereference(s, object + s->offset); > > } > > > > -static void prefetch_freepointer(const struct kmem_cache *s, void *object) > > +static void prefetchw_freepointer(const struct kmem_cache *s, void *object) > > { > > - prefetch(object + s->offset); > > + prefetchw(object + s->offset); > > } > > > > static inline void *get_freepointer_safe(struct kmem_cache *s, void *object) > > @@ -3195,10 +3195,9 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s, > > note_cmpxchg_failure("slab_alloc", s, tid); > > goto redo; > > } > > - prefetch_freepointer(s, next_object); > > + prefetchw_freepointer(s, next_object); > > stat(s, ALLOC_FASTPATH); > > } > > - > > maybe_wipe_obj_freeptr(s, object); > > init = slab_want_init_on_alloc(gfpflags, s); > > > > -- > > 2.27.0 > > > >
On Mon, Oct 11, 2021 at 09:20:36AM +0200, Christoph Lameter wrote: > On Fri, 8 Oct 2021, Hyeonggon Yoo wrote: > > > It's certain that an object will be not only read, but also > > written after allocation. > > get_freepointer is used in multiple code path not only in allocation. It > is for example used when scanning through a freelist. > > With this change all objects get needlessly dirtied and the cross cpu > cache contention is increased. I didn't touch get_freepointer and there's only one caller of prefetch_freepointer. My change was not adding additional prefetch on get_freepointer, but changing existing prefetch into prefetchw. The prefetch was introcued by commit 0ad9500e16fe ("slub: prefetch next freelist pointer in slab_alloc()") that you ACKed in 2011. Do you think removeing existing prefetch is better than changing it from prefetch to prefetchw?
Hello Vlastimil. On Mon, Oct 11, 2021 at 09:21:01AM +0200, Vlastimil Babka wrote: > On 10/11/21 00:49, David Rientjes wrote: > > On Fri, 8 Oct 2021, Hyeonggon Yoo wrote: > > > >> It's certain that an object will be not only read, but also > >> written after allocation. > >> > > > > Why is it certain? I think perhaps what you meant to say is that if we > > are doing any prefetching here, then access will benefit from prefetchw > > instead of prefetch. But it's not "certain" that allocated memory will be > > accessed at all. > > I think the primary reason there's a prefetch is freelist traversal. The > cacheline we prefetch will be read during the next allocation, so if we > expect there to be one soon, prefetch might help. I agree that. > That the freepointer is > part of object itself and thus the cache line will be probably accessed also > after the allocation, is secondary. Right. it depends on cache line size and whether first cache line of an object is frequently accessed or not. > Yeah this might help some workloads, but > perhaps hurt others - these things might look obvious in theory but be > rather unpredictable in practice. At least some hackbench results would help... > Below is my measurement. it seems prefetch(w) is not making things worse at least on hackbench. Measured on 16 CPUs (ARM64) / 16G RAM Without prefetch: Time: 91.989 Performance counter stats for 'hackbench -g 100 -l 10000': 1467926.03 msec cpu-clock # 15.907 CPUs utilized 17782076 context-switches # 12.114 K/sec 957523 cpu-migrations # 652.296 /sec 104561 page-faults # 71.230 /sec 1622117569931 cycles # 1.105 GHz (54.54%) 2002981132267 instructions # 1.23 insn per cycle (54.32%) 5600876429 branch-misses (54.28%) 642657442307 cache-references # 437.800 M/sec (54.27%) 19404890844 cache-misses # 3.019 % of all cache refs (54.28%) 640413686039 L1-dcache-loads # 436.271 M/sec (46.85%) 19110650580 L1-dcache-load-misses # 2.98% of all L1-dcache accesses (46.83%) 651556334841 dTLB-loads # 443.862 M/sec (46.63%) 3193647402 dTLB-load-misses # 0.49% of all dTLB cache accesses (46.84%) 538927659684 iTLB-loads # 367.135 M/sec (54.31%) 118503839 iTLB-load-misses # 0.02% of all iTLB cache accesses (54.35%) 625750168840 L1-icache-loads # 426.282 M/sec (46.80%) 24348083282 L1-icache-load-misses # 3.89% of all L1-icache accesses (46.78%) 92.284351157 seconds time elapsed 44.524693000 seconds user 1426.214006000 seconds sys With prefetch: Time: 91.677 Performance counter stats for 'hackbench -g 100 -l 10000': 1462938.07 msec cpu-clock # 15.908 CPUs utilized 18072550 context-switches # 12.354 K/sec 1018814 cpu-migrations # 696.416 /sec 104558 page-faults # 71.471 /sec 2003670016013 instructions # 1.27 insn per cycle (54.31%) 5702204863 branch-misses (54.28%) 643368500985 cache-references # 439.778 M/sec (54.26%) 18475582235 cache-misses # 2.872 % of all cache refs (54.28%) 642206796636 L1-dcache-loads # 438.984 M/sec (46.87%) 18215813147 L1-dcache-load-misses # 2.84% of all L1-dcache accesses (46.83%) 653842996501 dTLB-loads # 446.938 M/sec (46.63%) 3227179675 dTLB-load-misses # 0.49% of all dTLB cache accesses (46.85%) 537531951350 iTLB-loads # 367.433 M/sec (54.33%) 114750630 iTLB-load-misses # 0.02% of all iTLB cache accesses (54.37%) 630135543177 L1-icache-loads # 430.733 M/sec (46.80%) 22923237620 L1-icache-load-misses # 3.64% of all L1-icache accesses (46.76%) 91.964452802 seconds time elapsed 43.416742000 seconds user 1422.441123000 seconds sys With prefetchw: Time: 90.220 Performance counter stats for 'hackbench -g 100 -l 10000': 1437418.48 msec cpu-clock # 15.880 CPUs utilized 17694068 context-switches # 12.310 K/sec 958257 cpu-migrations # 666.651 /sec 100604 page-faults # 69.989 /sec 1583259429428 cycles # 1.101 GHz (54.57%) 2004002484935 instructions # 1.27 insn per cycle (54.37%) 5594202389 branch-misses (54.36%) 643113574524 cache-references # 447.409 M/sec (54.39%) 18233791870 cache-misses # 2.835 % of all cache refs (54.37%) 640205852062 L1-dcache-loads # 445.386 M/sec (46.75%) 17968160377 L1-dcache-load-misses # 2.81% of all L1-dcache accesses (46.79%) 651747432274 dTLB-loads # 453.415 M/sec (46.59%) 3127124271 dTLB-load-misses # 0.48% of all dTLB cache accesses (46.75%) 535395273064 iTLB-loads # 372.470 M/sec (54.38%) 113500056 iTLB-load-misses # 0.02% of all iTLB cache accesses (54.35%) 628871845924 L1-icache-loads # 437.501 M/sec (46.80%) 22585641203 L1-icache-load-misses # 3.59% of all L1-icache accesses (46.79%) 90.514819303 seconds time elapsed 43.877656000 seconds user 1397.176001000 seconds sys Thanks, Hyeonggon
On Mon, Oct 11, 2021 at 10:33:02AM +0000, Hyeonggon Yoo wrote: > Hello Vlastimil. > > On Mon, Oct 11, 2021 at 09:21:01AM +0200, Vlastimil Babka wrote: > > On 10/11/21 00:49, David Rientjes wrote: > > > On Fri, 8 Oct 2021, Hyeonggon Yoo wrote: > > > > > >> It's certain that an object will be not only read, but also > > >> written after allocation. > > >> > > > > > > Why is it certain? I think perhaps what you meant to say is that if we > > > are doing any prefetching here, then access will benefit from prefetchw > > > instead of prefetch. But it's not "certain" that allocated memory will be > > > accessed at all. > > > > I think the primary reason there's a prefetch is freelist traversal. The > > cacheline we prefetch will be read during the next allocation, so if we > > expect there to be one soon, prefetch might help. > > I agree that. > > > That the freepointer is > > part of object itself and thus the cache line will be probably accessed also > > after the allocation, is secondary. > > Right. it depends on cache line size and whether first cache line of an > object is frequently accessed or not. Not first cache line because free pointer is in the middle of object or out of object area. my mistake. > >> Use prefetchw instead of prefetchw. On supported architecture > > > > If we're using prefetchw instead of prefetchw, I think the diff would be > > 0 lines changed :) > > > >> like x86, it helps to invalidate cache line when the object exists > >> in other processors' cache. > >> > >> Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> > >> --- > >> mm/slub.c | 7 +++---- > >> 1 file changed, 3 insertions(+), 4 deletions(-) > >> > >> diff --git a/mm/slub.c b/mm/slub.c > >> index 3d2025f7163b..2aca7523165e 100644 > >> --- a/mm/slub.c > >> +++ b/mm/slub.c > >> @@ -352,9 +352,9 @@ static inline void *get_freepointer(struct kmem_cache *s, void *object) > >> return freelist_dereference(s, object + s->offset); > >> } > >> > >> -static void prefetch_freepointer(const struct kmem_cache *s, void *object) > >> +static void prefetchw_freepointer(const struct kmem_cache *s, void *object) > > I wouldn't rename the function itself, unless we have both variants for > different situations (we don't). That it uses prefetchw() is internal detail > at this point. looks good. that is simpler.
diff --git a/mm/slub.c b/mm/slub.c index 3d2025f7163b..2aca7523165e 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -352,9 +352,9 @@ static inline void *get_freepointer(struct kmem_cache *s, void *object) return freelist_dereference(s, object + s->offset); } -static void prefetch_freepointer(const struct kmem_cache *s, void *object) +static void prefetchw_freepointer(const struct kmem_cache *s, void *object) { - prefetch(object + s->offset); + prefetchw(object + s->offset); } static inline void *get_freepointer_safe(struct kmem_cache *s, void *object) @@ -3195,10 +3195,9 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s, note_cmpxchg_failure("slab_alloc", s, tid); goto redo; } - prefetch_freepointer(s, next_object); + prefetchw_freepointer(s, next_object); stat(s, ALLOC_FASTPATH); } - maybe_wipe_obj_freeptr(s, object); init = slab_want_init_on_alloc(gfpflags, s);
It's certain that an object will be not only read, but also written after allocation. Use prefetchw instead of prefetchw. On supported architecture like x86, it helps to invalidate cache line when the object exists in other processors' cache. Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> --- mm/slub.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-)