diff mbox series

mm, slub: Use prefetchw instead of prefetch

Message ID 20211008133602.4963-1-42.hyeyoo@gmail.com (mailing list archive)
State New
Headers show
Series mm, slub: Use prefetchw instead of prefetch | expand

Commit Message

Hyeonggon Yoo Oct. 8, 2021, 1:36 p.m. UTC
It's certain that an object will be not only read, but also
written after allocation.

Use prefetchw instead of prefetchw. On supported architecture
like x86, it helps to invalidate cache line when the object exists
in other processors' cache.

Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
---
 mm/slub.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

Comments

David Rientjes Oct. 10, 2021, 10:49 p.m. UTC | #1
On Fri, 8 Oct 2021, Hyeonggon Yoo wrote:

> It's certain that an object will be not only read, but also
> written after allocation.
> 

Why is it certain?  I think perhaps what you meant to say is that if we 
are doing any prefetching here, then access will benefit from prefetchw 
instead of prefetch.  But it's not "certain" that allocated memory will be 
accessed at all.

> Use prefetchw instead of prefetchw. On supported architecture

If we're using prefetchw instead of prefetchw, I think the diff would be 
0 lines changed :)

> like x86, it helps to invalidate cache line when the object exists
> in other processors' cache.
> 
> Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> ---
>  mm/slub.c | 7 +++----
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 3d2025f7163b..2aca7523165e 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -352,9 +352,9 @@ static inline void *get_freepointer(struct kmem_cache *s, void *object)
>  	return freelist_dereference(s, object + s->offset);
>  }
>  
> -static void prefetch_freepointer(const struct kmem_cache *s, void *object)
> +static void prefetchw_freepointer(const struct kmem_cache *s, void *object)
>  {
> -	prefetch(object + s->offset);
> +	prefetchw(object + s->offset);
>  }
>  
>  static inline void *get_freepointer_safe(struct kmem_cache *s, void *object)
> @@ -3195,10 +3195,9 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,
>  			note_cmpxchg_failure("slab_alloc", s, tid);
>  			goto redo;
>  		}
> -		prefetch_freepointer(s, next_object);
> +		prefetchw_freepointer(s, next_object);
>  		stat(s, ALLOC_FASTPATH);
>  	}
> -
>  	maybe_wipe_obj_freeptr(s, object);
>  	init = slab_want_init_on_alloc(gfpflags, s);
>  
> -- 
> 2.27.0
> 
>
Christoph Lameter Oct. 11, 2021, 7:20 a.m. UTC | #2
On Fri, 8 Oct 2021, Hyeonggon Yoo wrote:

> It's certain that an object will be not only read, but also
> written after allocation.

get_freepointer is used in multiple code path not only in allocation. It
is for example used when scanning through a freelist.

With this change all objects get needlessly dirtied and the cross cpu
cache contention is increased.
Vlastimil Babka Oct. 11, 2021, 7:21 a.m. UTC | #3
On 10/11/21 00:49, David Rientjes wrote:
> On Fri, 8 Oct 2021, Hyeonggon Yoo wrote:
> 
>> It's certain that an object will be not only read, but also
>> written after allocation.
>> 
> 
> Why is it certain?  I think perhaps what you meant to say is that if we 
> are doing any prefetching here, then access will benefit from prefetchw 
> instead of prefetch.  But it's not "certain" that allocated memory will be 
> accessed at all.

I think the primary reason there's a prefetch is freelist traversal. The
cacheline we prefetch will be read during the next allocation, so if we
expect there to be one soon, prefetch might help. That the freepointer is
part of object itself and thus the cache line will be probably accessed also
after the allocation, is secondary. Yeah this might help some workloads, but
perhaps hurt others - these things might look obvious in theory but be
rather unpredictable in practice. At least some hackbench results would help...

>> Use prefetchw instead of prefetchw. On supported architecture
> 
> If we're using prefetchw instead of prefetchw, I think the diff would be 
> 0 lines changed :)
> 
>> like x86, it helps to invalidate cache line when the object exists
>> in other processors' cache.
>> 
>> Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
>> ---
>>  mm/slub.c | 7 +++----
>>  1 file changed, 3 insertions(+), 4 deletions(-)
>> 
>> diff --git a/mm/slub.c b/mm/slub.c
>> index 3d2025f7163b..2aca7523165e 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -352,9 +352,9 @@ static inline void *get_freepointer(struct kmem_cache *s, void *object)
>>  	return freelist_dereference(s, object + s->offset);
>>  }
>>  
>> -static void prefetch_freepointer(const struct kmem_cache *s, void *object)
>> +static void prefetchw_freepointer(const struct kmem_cache *s, void *object)

I wouldn't rename the function itself, unless we have both  variants for
different situations (we don't). That it uses prefetchw() is internal detail
at this point.

>>  {
>> -	prefetch(object + s->offset);
>> +	prefetchw(object + s->offset);
>>  }
>>  
>>  static inline void *get_freepointer_safe(struct kmem_cache *s, void *object)
>> @@ -3195,10 +3195,9 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,
>>  			note_cmpxchg_failure("slab_alloc", s, tid);
>>  			goto redo;
>>  		}
>> -		prefetch_freepointer(s, next_object);
>> +		prefetchw_freepointer(s, next_object);
>>  		stat(s, ALLOC_FASTPATH);
>>  	}
>> -
>>  	maybe_wipe_obj_freeptr(s, object);
>>  	init = slab_want_init_on_alloc(gfpflags, s);
>>  
>> -- 
>> 2.27.0
>> 
>> 
>
Hyeonggon Yoo Oct. 11, 2021, 7:23 a.m. UTC | #4
On Sun, Oct 10, 2021 at 03:49:07PM -0700, David Rientjes wrote:
> On Fri, 8 Oct 2021, Hyeonggon Yoo wrote:
> 
> > It's certain that an object will be not only read, but also
> > written after allocation.
> > 
> 
> Why is it certain?  I think perhaps what you meant to say is that if we 
> are doing any prefetching here, then access will benefit from prefetchw 
> instead of prefetch.  But it's not "certain" that allocated memory will be 
> accessed at all.
>

Blame my english skill :(

When I wrote I thought it was ok, but it was unclear.
Thank you for pointing them!

What I meant was "When accessing an object, it must be written before read.
So There's no situation that caller only reads an object and does not
write. Thus it's better to use prefetchw instead of prefetch.".

Let's rephrase:

commit 0ad9500e16fe ("slub: prefetch next freelist pointer in 
slab_alloc()") introduced prefetch_freepointer() because when other cpu(s)
freed objects into a page that current cpu owns, the freelist link is
hot on cpu(s) which freed objects and possibly very cold on current cpu.

But if freelist link chain is hot on cpu(s) which freed	objects,
it's better to invalidate that chain because they're not going to access
again within a short time.

So use prefetchw instead of prefetch. On supported architectures like x86,
it invalidates other copied instances of a cache line when prefetching it.

> > Use prefetchw instead of prefetchw. On supported architecture
> 
> If we're using prefetchw instead of prefetchw, I think the diff would be 
> 0 lines changed :)
>

That was my typo. thankfully Andrew fixed that.

> > like x86, it helps to invalidate cache line when the object exists
> > in other processors' cache.
> > 
> > Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> > ---
> >  mm/slub.c | 7 +++----
> >  1 file changed, 3 insertions(+), 4 deletions(-)
> > 
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 3d2025f7163b..2aca7523165e 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -352,9 +352,9 @@ static inline void *get_freepointer(struct kmem_cache *s, void *object)
> >  	return freelist_dereference(s, object + s->offset);
> >  }
> >  
> > -static void prefetch_freepointer(const struct kmem_cache *s, void *object)
> > +static void prefetchw_freepointer(const struct kmem_cache *s, void *object)
> >  {
> > -	prefetch(object + s->offset);
> > +	prefetchw(object + s->offset);
> >  }
> >  
> >  static inline void *get_freepointer_safe(struct kmem_cache *s, void *object)
> > @@ -3195,10 +3195,9 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,
> >  			note_cmpxchg_failure("slab_alloc", s, tid);
> >  			goto redo;
> >  		}
> > -		prefetch_freepointer(s, next_object);
> > +		prefetchw_freepointer(s, next_object);
> >  		stat(s, ALLOC_FASTPATH);
> >  	}
> > -
> >  	maybe_wipe_obj_freeptr(s, object);
> >  	init = slab_want_init_on_alloc(gfpflags, s);
> >  
> > -- 
> > 2.27.0
> > 
> >
Hyeonggon Yoo Oct. 11, 2021, 7:32 a.m. UTC | #5
On Mon, Oct 11, 2021 at 09:20:36AM +0200, Christoph Lameter wrote:
> On Fri, 8 Oct 2021, Hyeonggon Yoo wrote:
> 
> > It's certain that an object will be not only read, but also
> > written after allocation.
> 
> get_freepointer is used in multiple code path not only in allocation. It
> is for example used when scanning through a freelist.
> 
> With this change all objects get needlessly dirtied and the cross cpu
> cache contention is increased.

I didn't touch get_freepointer and there's
only one caller of prefetch_freepointer.

My change was not adding additional prefetch on get_freepointer,
but changing existing prefetch into prefetchw.

The prefetch was introcued by commit 0ad9500e16fe ("slub: prefetch next
freelist pointer in slab_alloc()") that you ACKed in 2011.

Do you think removeing existing prefetch is better than changing it
from prefetch to prefetchw?
Hyeonggon Yoo Oct. 11, 2021, 10:33 a.m. UTC | #6
Hello Vlastimil.

On Mon, Oct 11, 2021 at 09:21:01AM +0200, Vlastimil Babka wrote:
> On 10/11/21 00:49, David Rientjes wrote:
> > On Fri, 8 Oct 2021, Hyeonggon Yoo wrote:
> > 
> >> It's certain that an object will be not only read, but also
> >> written after allocation.
> >> 
> > 
> > Why is it certain?  I think perhaps what you meant to say is that if we 
> > are doing any prefetching here, then access will benefit from prefetchw 
> > instead of prefetch.  But it's not "certain" that allocated memory will be 
> > accessed at all.
> 
> I think the primary reason there's a prefetch is freelist traversal. The
> cacheline we prefetch will be read during the next allocation, so if we
> expect there to be one soon, prefetch might help.

I agree that.

> That the freepointer is
> part of object itself and thus the cache line will be probably accessed also
> after the allocation, is secondary.

Right. it depends on cache line size and whether first cache line of an
object is frequently accessed or not.

> Yeah this might help some workloads, but
> perhaps hurt others - these things might look obvious in theory but be
> rather unpredictable in practice. At least some hackbench results would help...
>

Below is my measurement. it seems prefetch(w) is not making things worse
at least on hackbench.

Measured on 16 CPUs (ARM64) / 16G RAM
Without prefetch:

Time: 91.989
 Performance counter stats for 'hackbench -g 100 -l 10000':
        1467926.03 msec cpu-clock                 #   15.907 CPUs utilized          
          17782076      context-switches          #   12.114 K/sec                  
            957523      cpu-migrations            #  652.296 /sec                   
            104561      page-faults               #   71.230 /sec                   
     1622117569931      cycles                    #    1.105 GHz                      (54.54%)
     2002981132267      instructions              #    1.23  insn per cycle           (54.32%)
        5600876429      branch-misses                                                 (54.28%)
      642657442307      cache-references          #  437.800 M/sec                    (54.27%)
       19404890844      cache-misses              #    3.019 % of all cache refs      (54.28%)
      640413686039      L1-dcache-loads           #  436.271 M/sec                    (46.85%)
       19110650580      L1-dcache-load-misses     #    2.98% of all L1-dcache accesses  (46.83%)
      651556334841      dTLB-loads                #  443.862 M/sec                    (46.63%)
        3193647402      dTLB-load-misses          #    0.49% of all dTLB cache accesses  (46.84%)
      538927659684      iTLB-loads                #  367.135 M/sec                    (54.31%)
         118503839      iTLB-load-misses          #    0.02% of all iTLB cache accesses  (54.35%)
      625750168840      L1-icache-loads           #  426.282 M/sec                    (46.80%)
       24348083282      L1-icache-load-misses     #    3.89% of all L1-icache accesses  (46.78%)

      92.284351157 seconds time elapsed

      44.524693000 seconds user
    1426.214006000 seconds sys

With prefetch:

Time: 91.677

 Performance counter stats for 'hackbench -g 100 -l 10000':
        1462938.07 msec cpu-clock                 #   15.908 CPUs utilized          
          18072550      context-switches          #   12.354 K/sec                  
           1018814      cpu-migrations            #  696.416 /sec                   
            104558      page-faults               #   71.471 /sec                   
     2003670016013      instructions              #    1.27  insn per cycle           (54.31%)
        5702204863      branch-misses                                                 (54.28%)
      643368500985      cache-references          #  439.778 M/sec                    (54.26%)
       18475582235      cache-misses              #    2.872 % of all cache refs      (54.28%)
      642206796636      L1-dcache-loads           #  438.984 M/sec                    (46.87%)
       18215813147      L1-dcache-load-misses     #    2.84% of all L1-dcache accesses  (46.83%)
      653842996501      dTLB-loads                #  446.938 M/sec                    (46.63%)
        3227179675      dTLB-load-misses          #    0.49% of all dTLB cache accesses  (46.85%)
      537531951350      iTLB-loads                #  367.433 M/sec                    (54.33%)
         114750630      iTLB-load-misses          #    0.02% of all iTLB cache accesses  (54.37%)
      630135543177      L1-icache-loads           #  430.733 M/sec                    (46.80%)
       22923237620      L1-icache-load-misses     #    3.64% of all L1-icache accesses  (46.76%)
 
      91.964452802 seconds time elapsed

      43.416742000 seconds user
    1422.441123000 seconds sys
	
With prefetchw:

Time: 90.220

 Performance counter stats for 'hackbench -g 100 -l 10000':
        1437418.48 msec cpu-clock                 #   15.880 CPUs utilized          
          17694068      context-switches          #   12.310 K/sec                  
            958257      cpu-migrations            #  666.651 /sec                   
            100604      page-faults               #   69.989 /sec                   
     1583259429428      cycles                    #    1.101 GHz                      (54.57%)
     2004002484935      instructions              #    1.27  insn per cycle           (54.37%)
        5594202389      branch-misses                                                 (54.36%)
      643113574524      cache-references          #  447.409 M/sec                    (54.39%)
       18233791870      cache-misses              #    2.835 % of all cache refs      (54.37%)
      640205852062      L1-dcache-loads           #  445.386 M/sec                    (46.75%)
       17968160377      L1-dcache-load-misses     #    2.81% of all L1-dcache accesses  (46.79%)
      651747432274      dTLB-loads                #  453.415 M/sec                    (46.59%)
        3127124271      dTLB-load-misses          #    0.48% of all dTLB cache accesses  (46.75%)
      535395273064      iTLB-loads                #  372.470 M/sec                    (54.38%)
         113500056      iTLB-load-misses          #    0.02% of all iTLB cache accesses  (54.35%)
      628871845924      L1-icache-loads           #  437.501 M/sec                    (46.80%)
       22585641203      L1-icache-load-misses     #    3.59% of all L1-icache accesses  (46.79%)
 
      90.514819303 seconds time elapsed
 
      43.877656000 seconds user
    1397.176001000 seconds sys

Thanks,
Hyeonggon
Hyeonggon Yoo Oct. 11, 2021, 1:49 p.m. UTC | #7
On Mon, Oct 11, 2021 at 10:33:02AM +0000, Hyeonggon Yoo wrote:
> Hello Vlastimil.
> 
> On Mon, Oct 11, 2021 at 09:21:01AM +0200, Vlastimil Babka wrote:
> > On 10/11/21 00:49, David Rientjes wrote:
> > > On Fri, 8 Oct 2021, Hyeonggon Yoo wrote:
> > > 
> > >> It's certain that an object will be not only read, but also
> > >> written after allocation.
> > >> 
> > > 
> > > Why is it certain?  I think perhaps what you meant to say is that if we 
> > > are doing any prefetching here, then access will benefit from prefetchw 
> > > instead of prefetch.  But it's not "certain" that allocated memory will be 
> > > accessed at all.
> > 
> > I think the primary reason there's a prefetch is freelist traversal. The
> > cacheline we prefetch will be read during the next allocation, so if we
> > expect there to be one soon, prefetch might help.
> 
> I agree that.
> 
> > That the freepointer is
> > part of object itself and thus the cache line will be probably accessed also
> > after the allocation, is secondary.
> 
> Right. it depends on cache line size and whether first cache line of an
> object is frequently accessed or not.

Not first cache line because free pointer is in the middle of object or
out of object area. my mistake.

> >> Use prefetchw instead of prefetchw. On supported architecture
> >
> > If we're using prefetchw instead of prefetchw, I think the diff would be
> > 0 lines changed :)
> >
> >> like x86, it helps to invalidate cache line when the object exists
> >> in other processors' cache.
> >>
> >> Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> >> ---
> >>  mm/slub.c | 7 +++----
> >>  1 file changed, 3 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/mm/slub.c b/mm/slub.c
> >> index 3d2025f7163b..2aca7523165e 100644
> >> --- a/mm/slub.c
> >> +++ b/mm/slub.c
> >> @@ -352,9 +352,9 @@ static inline void *get_freepointer(struct kmem_cache *s, void *object)
> >>  	return freelist_dereference(s, object + s->offset);
> >>  }
> >>
> >> -static void prefetch_freepointer(const struct kmem_cache *s, void *object)
> >> +static void prefetchw_freepointer(const struct kmem_cache *s, void *object)
>
> I wouldn't rename the function itself, unless we have both  variants for
> different situations (we don't). That it uses prefetchw() is internal detail
> at this point.

looks good. that is simpler.
diff mbox series

Patch

diff --git a/mm/slub.c b/mm/slub.c
index 3d2025f7163b..2aca7523165e 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -352,9 +352,9 @@  static inline void *get_freepointer(struct kmem_cache *s, void *object)
 	return freelist_dereference(s, object + s->offset);
 }
 
-static void prefetch_freepointer(const struct kmem_cache *s, void *object)
+static void prefetchw_freepointer(const struct kmem_cache *s, void *object)
 {
-	prefetch(object + s->offset);
+	prefetchw(object + s->offset);
 }
 
 static inline void *get_freepointer_safe(struct kmem_cache *s, void *object)
@@ -3195,10 +3195,9 @@  static __always_inline void *slab_alloc_node(struct kmem_cache *s,
 			note_cmpxchg_failure("slab_alloc", s, tid);
 			goto redo;
 		}
-		prefetch_freepointer(s, next_object);
+		prefetchw_freepointer(s, next_object);
 		stat(s, ALLOC_FASTPATH);
 	}
-
 	maybe_wipe_obj_freeptr(s, object);
 	init = slab_want_init_on_alloc(gfpflags, s);