mm/slub: Make __ksize() faster

Message ID	20240308031325.218676-1-kent.overstreet@linux.dev (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Kent Overstreet <kent.overstreet@linux.dev> To: Cc: Kent Overstreet <kent.overstreet@linux.dev>, linux-mm@kvack.org, Vlastimil Babka <vbabka@suse.cz>, Andrew Morton <akpm@linux-foundation.org>, Christoph Lameter <cl@linux.com>, Pekka Enberg <penberg@kernel.org>, David Rientjes <rientjes@google.com>, Roman Gushchin <roman.gushchin@linux.dev>, Hyeonggon Yoo <42.hyeyoo@gmail.com>, Matthew Wilcox <willy@infradead.org> Subject: [PATCH] mm/slub: Make __ksize() faster Date: Thu, 7 Mar 2024 22:13:22 -0500 Message-ID: <20240308031325.218676-1-kent.overstreet@linux.dev> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm/slub: Make __ksize() faster \| expand mm/slub: Make __ksize() faster

Kent Overstreet March 8, 2024, 3:13 a.m. UTC

this is part of a larger in-progress patch series to add tracking of
amount of memory stranded via RCU.

I want to base this on willy's series that rearranges folio batch
freeing - that's got __folio_put() cleanups I want. Once he's got that
up in a git tree I'll do that and finish the rest. 

-- >8 --

with slab gone, we now have a free u32 in struct slab.

This steals it to make __ksize() faster; it's now a single dependent
load, instead of two. This is going to be important for tracking the
amount of memory stranded by RCU, which we want to be able to do if
we're going to be freeing all pagecache folios (and perhaps all folios)
via RCU.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: linux-mm@kvack.org
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/slab.h        | 2 +-
 mm/slab_common.c | 9 ++++-----
 mm/slub.c        | 1 +
 3 files changed, 6 insertions(+), 6 deletions(-)

Matthew Wilcox March 8, 2024, 4:48 a.m. UTC | #1

On Thu, Mar 07, 2024 at 10:13:22PM -0500, Kent Overstreet wrote:
> This steals it to make __ksize() faster; it's now a single dependent
> load, instead of two. This is going to be important for tracking the
> amount of memory stranded by RCU, which we want to be able to do if
> we're going to be freeing all pagecache folios (and perhaps all folios)
> via RCU.

Do you have any measurements?  I don't necessarily object, it's just
that I would have thought we were already bringing slab->slab_cache
into cache as part of the freeing operation.

Kent Overstreet March 8, 2024, 5:16 a.m. UTC | #2

On Fri, Mar 08, 2024 at 04:48:11AM +0000, Matthew Wilcox wrote:
> On Thu, Mar 07, 2024 at 10:13:22PM -0500, Kent Overstreet wrote:
> > This steals it to make __ksize() faster; it's now a single dependent
> > load, instead of two. This is going to be important for tracking the
> > amount of memory stranded by RCU, which we want to be able to do if
> > we're going to be freeing all pagecache folios (and perhaps all folios)
> > via RCU.
> 
> Do you have any measurements?  I don't necessarily object, it's just
> that I would have thought we were already bringing slab->slab_cache
> into cache as part of the freeing operation.

for kfree() yes, but for kfree_rcu() no - that'll happen significantly
later.

I don't have measurements that it /matters/, but since we have the u32
there now this seems like a good use for it.

Matthew Wilcox March 8, 2024, 2:58 p.m. UTC | #3

On Fri, Mar 08, 2024 at 12:16:09AM -0500, Kent Overstreet wrote:
> On Fri, Mar 08, 2024 at 04:48:11AM +0000, Matthew Wilcox wrote:
> > On Thu, Mar 07, 2024 at 10:13:22PM -0500, Kent Overstreet wrote:
> > > This steals it to make __ksize() faster; it's now a single dependent
> > > load, instead of two. This is going to be important for tracking the
> > > amount of memory stranded by RCU, which we want to be able to do if
> > > we're going to be freeing all pagecache folios (and perhaps all folios)
> > > via RCU.
> > 
> > Do you have any measurements?  I don't necessarily object, it's just
> > that I would have thought we were already bringing slab->slab_cache
> > into cache as part of the freeing operation.
> 
> for kfree() yes, but for kfree_rcu() no - that'll happen significantly
> later.
> 
> I don't have measurements that it /matters/, but since we have the u32
> there now this seems like a good use for it.

There are potentiually better uses for those bits.  We could turn
folio_test_slab() into a PageType test, freeing up a page flag.

Kent Overstreet March 8, 2024, 4:27 p.m. UTC | #4

On Fri, Mar 08, 2024 at 02:58:48PM +0000, Matthew Wilcox wrote:
> On Fri, Mar 08, 2024 at 12:16:09AM -0500, Kent Overstreet wrote:
> > On Fri, Mar 08, 2024 at 04:48:11AM +0000, Matthew Wilcox wrote:
> > > On Thu, Mar 07, 2024 at 10:13:22PM -0500, Kent Overstreet wrote:
> > > > This steals it to make __ksize() faster; it's now a single dependent
> > > > load, instead of two. This is going to be important for tracking the
> > > > amount of memory stranded by RCU, which we want to be able to do if
> > > > we're going to be freeing all pagecache folios (and perhaps all folios)
> > > > via RCU.
> > > 
> > > Do you have any measurements?  I don't necessarily object, it's just
> > > that I would have thought we were already bringing slab->slab_cache
> > > into cache as part of the freeing operation.
> > 
> > for kfree() yes, but for kfree_rcu() no - that'll happen significantly
> > later.
> > 
> > I don't have measurements that it /matters/, but since we have the u32
> > there now this seems like a good use for it.
> 
> There are potentiually better uses for those bits.  We could turn
> folio_test_slab() into a PageType test, freeing up a page flag.

They overlap _mapcount, did you figure out how to use that for a
PageType enum?

Matthew Wilcox March 8, 2024, 5:06 p.m. UTC | #5

On Fri, Mar 08, 2024 at 11:27:32AM -0500, Kent Overstreet wrote:
> On Fri, Mar 08, 2024 at 02:58:48PM +0000, Matthew Wilcox wrote:
> > There are potentiually better uses for those bits.  We could turn
> > folio_test_slab() into a PageType test, freeing up a page flag.
> 
> They overlap _mapcount, did you figure out how to use that for a
> PageType enum?

In 2018 ... 6e292b9be7f4358985ce33ae1f59ab30a8c09e08

Christoph Lameter (Ampere) March 8, 2024, 5:12 p.m. UTC | #6

> On Fri, Mar 08, 2024 at 11:27:32AM -0500, Kent Overstreet wrote:
>> On Fri, Mar 08, 2024 at 02:58:48PM +0000, Matthew Wilcox wrote:
>>> There are potentiually better uses for those bits.  We could turn
>>> folio_test_slab() into a PageType test, freeing up a page flag.
>>
>> They overlap _mapcount, did you figure out how to use that for a
>> PageType enum?
>
> In 2018 ... 6e292b9be7f4358985ce33ae1f59ab30a8c09e08
>

This seems to be 32 bit field. We could segment that into two unsigned 
shorts. In fact any operation on a slab larger than 2xPAGE_SIZE is 
directly turned into a page allocator call bypassing slub. So you only 
need 0 ... 2 * PAGE_SIZE for the range of the int.

Matthew Wilcox March 8, 2024, 6:26 p.m. UTC | #7

On Fri, Mar 08, 2024 at 09:12:06AM -0800, Christoph Lameter (Ampere) wrote:
> 
> > On Fri, Mar 08, 2024 at 11:27:32AM -0500, Kent Overstreet wrote:
> > > On Fri, Mar 08, 2024 at 02:58:48PM +0000, Matthew Wilcox wrote:
> > > > There are potentiually better uses for those bits.  We could turn
> > > > folio_test_slab() into a PageType test, freeing up a page flag.
> > > 
> > > They overlap _mapcount, did you figure out how to use that for a
> > > PageType enum?
> > 
> > In 2018 ... 6e292b9be7f4358985ce33ae1f59ab30a8c09e08
> > 
> 
> This seems to be 32 bit field. We could segment that into two unsigned
> shorts. In fact any operation on a slab larger than 2xPAGE_SIZE is directly
> turned into a page allocator call bypassing slub. So you only need 0 ... 2 *
> PAGE_SIZE for the range of the int.

Are there any CPUs with PAGE_SIZE > 65535?  ;-)

It could, just about, be done.  Although not on Hexagon with its crazy
256kB page.  Right now, I reserve 0xf000007f to catch over/underflow,
although this is perhaps excessive and I could get away with just
0x8000007f.  That leaves 24 bits.  We've currently got 4 in use, and I
want to add two more (Slab and HugeTLB), so there's 18 bits remaining.

So is it really worth burning all the remaining bits on implementing
ksize with one fewer pointer dereference, given that struct kmem_cache
is read-mostly and should live in the CPU cache quite well?

Kent Overstreet March 8, 2024, 8:58 p.m. UTC | #8

On Fri, Mar 08, 2024 at 06:26:11PM +0000, Matthew Wilcox wrote:
> On Fri, Mar 08, 2024 at 09:12:06AM -0800, Christoph Lameter (Ampere) wrote:
> > 
> > > On Fri, Mar 08, 2024 at 11:27:32AM -0500, Kent Overstreet wrote:
> > > > On Fri, Mar 08, 2024 at 02:58:48PM +0000, Matthew Wilcox wrote:
> > > > > There are potentiually better uses for those bits.  We could turn
> > > > > folio_test_slab() into a PageType test, freeing up a page flag.
> > > > 
> > > > They overlap _mapcount, did you figure out how to use that for a
> > > > PageType enum?
> > > 
> > > In 2018 ... 6e292b9be7f4358985ce33ae1f59ab30a8c09e08
> > > 
> > 
> > This seems to be 32 bit field. We could segment that into two unsigned
> > shorts. In fact any operation on a slab larger than 2xPAGE_SIZE is directly
> > turned into a page allocator call bypassing slub. So you only need 0 ... 2 *
> > PAGE_SIZE for the range of the int.
> 
> Are there any CPUs with PAGE_SIZE > 65535?  ;-)
> 
> It could, just about, be done.  Although not on Hexagon with its crazy
> 256kB page.  Right now, I reserve 0xf000007f to catch over/underflow,
> although this is perhaps excessive and I could get away with just
> 0x8000007f.  That leaves 24 bits.  We've currently got 4 in use, and I
> want to add two more (Slab and HugeTLB), so there's 18 bits remaining.
> 
> So is it really worth burning all the remaining bits on implementing
> ksize with one fewer pointer dereference, given that struct kmem_cache
> is read-mostly and should live in the CPU cache quite well?

Ok, this is the first I'd seen of these PageType shenanigans :)

That definitely takes priority over shaving cycles here; but could we
clean that up and expand it?

Does it really need to be a bitmask? And really, enums should be enums
(that we can give a proper type name to) not #defines and raw ints.

If we did that, and killed PG_slab and made it part of the PageType
enum, that frees up a few bits.

Christoph Lameter (Ampere) March 8, 2024, 9:28 p.m. UTC | #9

On Fri, 8 Mar 2024, Matthew Wilcox wrote:

> On Fri, Mar 08, 2024 at 09:12:06AM -0800, Christoph Lameter (Ampere) wrote:
>>
>>> On Fri, Mar 08, 2024 at 11:27:32AM -0500, Kent Overstreet wrote:
>>>> On Fri, Mar 08, 2024 at 02:58:48PM +0000, Matthew Wilcox wrote:
>>>>> There are potentiually better uses for those bits.  We could turn
>>>>> folio_test_slab() into a PageType test, freeing up a page flag.
>>>>
>>>> They overlap _mapcount, did you figure out how to use that for a
>>>> PageType enum?
>>>
>>> In 2018 ... 6e292b9be7f4358985ce33ae1f59ab30a8c09e08
>>>
>>
>> This seems to be 32 bit field. We could segment that into two unsigned
>> shorts. In fact any operation on a slab larger than 2xPAGE_SIZE is directly
>> turned into a page allocator call bypassing slub. So you only need 0 ... 2 *
>> PAGE_SIZE for the range of the int.
>
> Are there any CPUs with PAGE_SIZE > 65535?  ;-)

Yes. PPC has something like that I believe. Otherwise 64K is popular on 
ARM64 f.e.

> So is it really worth burning all the remaining bits on implementing
> ksize with one fewer pointer dereference, given that struct kmem_cache
> is read-mostly and should live in the CPU cache quite well?

No idea. Yes struct kmem cache must be cache hot anyways for 
operations on a cache. So make sure to move the object size to the 
beginning of kmem_cache so that it shares a cacheline (this should already 
be the case??). Then there should really be no win from relocating the 
size. But we have 32 precious bits to play around with. That is an 
important discussion on what to do with them.

mm/slub: Make __ksize() faster

Commit Message

Comments

Patch