diff mbox series

[v2,6/6] alloc_tag: config to store page allocation tag refs in page flags

Message ID 20240902044128.664075-7-surenb@google.com (mailing list archive)
State Handled Elsewhere
Headers show
Series page allocation tag compression | expand

Checks

Context Check Description
mdraidci/vmtest-modules-next-PR fail merge-conflict

Commit Message

Suren Baghdasaryan Sept. 2, 2024, 4:41 a.m. UTC
Add CONFIG_PGALLOC_TAG_USE_PAGEFLAGS to store allocation tag
references directly in the page flags. This removes dependency on
page_ext and results in better performance for page allocations as
well as reduced page_ext memory overhead.
CONFIG_PGALLOC_TAG_REF_BITS controls the number of bits required
to be available in the page flags to store the references. If the
number of page flag bits is insufficient, the build will fail and
either CONFIG_PGALLOC_TAG_REF_BITS would have to be lowered or
CONFIG_PGALLOC_TAG_USE_PAGEFLAGS should be disabled.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mmzone.h            |  3 ++
 include/linux/page-flags-layout.h | 10 +++++--
 include/linux/pgalloc_tag.h       | 48 +++++++++++++++++++++++++++++++
 lib/Kconfig.debug                 | 27 +++++++++++++++--
 lib/alloc_tag.c                   |  4 +++
 mm/page_ext.c                     |  2 +-
 6 files changed, 89 insertions(+), 5 deletions(-)

Comments

Andrew Morton Sept. 2, 2024, 5:16 a.m. UTC | #1
On Sun,  1 Sep 2024 21:41:28 -0700 Suren Baghdasaryan <surenb@google.com> wrote:

> Add CONFIG_PGALLOC_TAG_USE_PAGEFLAGS to store allocation tag
> references directly in the page flags. This removes dependency on
> page_ext and results in better performance for page allocations as
> well as reduced page_ext memory overhead.
> CONFIG_PGALLOC_TAG_REF_BITS controls the number of bits required
> to be available in the page flags to store the references. If the
> number of page flag bits is insufficient, the build will fail and
> either CONFIG_PGALLOC_TAG_REF_BITS would have to be lowered or
> CONFIG_PGALLOC_TAG_USE_PAGEFLAGS should be disabled.
> 
> ...
>
> +config PGALLOC_TAG_USE_PAGEFLAGS
> +	bool "Use pageflags to encode page allocation tag reference"
> +	default n
> +	depends on MEM_ALLOC_PROFILING
> +	help
> +	  When set, page allocation tag references are encoded inside page
> +	  flags, otherwise they are encoded in page extensions.
> +
> +	  Setting this flag reduces memory and performance overhead of memory
> +	  allocation profiling but also limits how many allocations can be
> +	  tagged. The number of bits is set by PGALLOC_TAG_USE_PAGEFLAGS and
> +	  they must fit in the page flags field.

Again.  Please put yourself in the position of one of the all-minus-two
people in this world who aren't kernel-memory-profiling-developers. 
How the heck are they to decide whether or not to enable this?  OK, 59%
of them are likely to say "yes" because reasons.  But then what?  How
are they to determine whether it was the correct choice for them?  If
we don't tell them, who will?

>  config PGALLOC_TAG_REF_BITS
>  	int "Number of bits for page allocation tag reference (10-64)"
>  	range 10 64
> -	default "64"
> +	default "16" if PGALLOC_TAG_USE_PAGEFLAGS
> +	default "64" if !PGALLOC_TAG_USE_PAGEFLAGS
>  	depends on MEM_ALLOC_PROFILING
>  	help
>  	  Number of bits used to encode a page allocation tag reference.
> @@ -1011,6 +1027,13 @@ config PGALLOC_TAG_REF_BITS
>  	  Smaller number results in less memory overhead but limits the number of
>  	  allocations which can be tagged (including allocations from modules).
>  
> +	  If PGALLOC_TAG_USE_PAGEFLAGS is set, the number of requested bits should
> +	  fit inside the page flags.

What does "should fit" mean?  "It is your responsibility to make it
fit"?  "We think it will fit but we aren't really sure"?

> +	  If PGALLOC_TAG_USE_PAGEFLAGS is not set, the number of bits used to store
> +	  a reference is rounded up to the closest basic type. If set higher than 32,
> +	  a direct pointer to the allocation tag is stored for performance reasons.
> +

We shouldn't be offering things like this to our users.  If we cannot decide, how
can they?
Suren Baghdasaryan Sept. 3, 2024, 6:19 p.m. UTC | #2
On Sun, Sep 1, 2024 at 10:16 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Sun,  1 Sep 2024 21:41:28 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
>
> > Add CONFIG_PGALLOC_TAG_USE_PAGEFLAGS to store allocation tag
> > references directly in the page flags. This removes dependency on
> > page_ext and results in better performance for page allocations as
> > well as reduced page_ext memory overhead.
> > CONFIG_PGALLOC_TAG_REF_BITS controls the number of bits required
> > to be available in the page flags to store the references. If the
> > number of page flag bits is insufficient, the build will fail and
> > either CONFIG_PGALLOC_TAG_REF_BITS would have to be lowered or
> > CONFIG_PGALLOC_TAG_USE_PAGEFLAGS should be disabled.
> >
> > ...
> >
> > +config PGALLOC_TAG_USE_PAGEFLAGS
> > +     bool "Use pageflags to encode page allocation tag reference"
> > +     default n
> > +     depends on MEM_ALLOC_PROFILING
> > +     help
> > +       When set, page allocation tag references are encoded inside page
> > +       flags, otherwise they are encoded in page extensions.
> > +
> > +       Setting this flag reduces memory and performance overhead of memory
> > +       allocation profiling but also limits how many allocations can be
> > +       tagged. The number of bits is set by PGALLOC_TAG_USE_PAGEFLAGS and
> > +       they must fit in the page flags field.
>
> Again.  Please put yourself in the position of one of the all-minus-two
> people in this world who aren't kernel-memory-profiling-developers.
> How the heck are they to decide whether or not to enable this?  OK, 59%
> of them are likely to say "yes" because reasons.  But then what?  How
> are they to determine whether it was the correct choice for them?  If
> we don't tell them, who will?

Fair point. I think one would want to enable this config unless there
aren't enough unused bits if the page flags to address all page
allocation tags. That last part of determining how many bits we need
is a bit tricky.
If we put aside loadable modules for now, there are 3 cases:

1. The number of unused page flag bits is enough to address all page
allocations.
2. The number of unused page flag bits is enough if we push
last_cpupid out of page flags. In that case we get the warning at
https://elixir.bootlin.com/linux/v6.11-rc6/source/mm/mm_init.c#L124.
3. The number of unused page flag bits is not enough even if we push
last_cpupid out of page flags. In that case we get the  "Not enough
bits in page flags" build time error.

So, maybe I should make this option "default y" when
CONFIG_MEM_ALLOC_PROFILING=y and let the user disable it if they hit
case #3 or (case #2 and performance hit is unacceptable)?

For loadable modules, if we hit the limit when loading a module at
runtime, we could issue a warning and disable allocation tagging via
the static key. Another option is to fail to load the module with a
proper warning but that IMO would be less appealing.

>
> >  config PGALLOC_TAG_REF_BITS
> >       int "Number of bits for page allocation tag reference (10-64)"
> >       range 10 64
> > -     default "64"
> > +     default "16" if PGALLOC_TAG_USE_PAGEFLAGS
> > +     default "64" if !PGALLOC_TAG_USE_PAGEFLAGS
> >       depends on MEM_ALLOC_PROFILING
> >       help
> >         Number of bits used to encode a page allocation tag reference.
> > @@ -1011,6 +1027,13 @@ config PGALLOC_TAG_REF_BITS
> >         Smaller number results in less memory overhead but limits the number of
> >         allocations which can be tagged (including allocations from modules).
> >
> > +       If PGALLOC_TAG_USE_PAGEFLAGS is set, the number of requested bits should
> > +       fit inside the page flags.
>
> What does "should fit" mean?  "It is your responsibility to make it
> fit"?  "We think it will fit but we aren't really sure"?

This is the case #3 I described above, the user will get a "Not enough
bits in page flags" build time error. If we stick with this config, I
can clarify that in this description.

>
> > +       If PGALLOC_TAG_USE_PAGEFLAGS is not set, the number of bits used to store
> > +       a reference is rounded up to the closest basic type. If set higher than 32,
> > +       a direct pointer to the allocation tag is stored for performance reasons.
> > +
>
> We shouldn't be offering things like this to our users.  If we cannot decide, how
> can they?

Thinking about the ease of use, the CONFIG_PGALLOC_TAG_REF_BITS is the
hardest one to set. The user does not know how many page allocations
are there. I think I can simplify this by trying to use all unused
page flag bits for addressing the tags. Then, after compilation we can
follow the rules I mentioned before:
- If the available bits are not enough to address all kernel page
allocations, we issue an error. The user should disable
CONFIG_PGALLOC_TAG_USE_PAGEFLAGS.
- If there are enough unused bits but we have to push last_cpupid out
of page flags, we issue a warning and continue. The user can disable
CONFIG_PGALLOC_TAG_USE_PAGEFLAGS if last_cpupid has to stay in page
flags.
- If we run out of addressing space during module loading, we disable
allocation tagging and continue. The user should disable
CONFIG_PGALLOC_TAG_USE_PAGEFLAGS.

This leaves one outstanding case:
- If we run out of addressing space during module loading but we would
not run out of space if we pushed last_cpupid out of page flags during
compilation.
In this case I would want the user to have an option to request a
larger addressing space for page allocation tags at compile time.
Maybe I can keep CONFIG_PGALLOC_TAG_REF_BITS for such explicit
requests for a larger space? This would limit the use of
CONFIG_PGALLOC_TAG_REF_BITS to this case only. In all other cases the
number of bits would be set automatically. WDYT?
John Hubbard Sept. 4, 2024, 1:25 a.m. UTC | #3
On 9/3/24 11:19 AM, Suren Baghdasaryan wrote:
> On Sun, Sep 1, 2024 at 10:16 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>> On Sun,  1 Sep 2024 21:41:28 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
...
>> We shouldn't be offering things like this to our users.  If we cannot decide, how
>> can they?
> 
> Thinking about the ease of use, the CONFIG_PGALLOC_TAG_REF_BITS is the
> hardest one to set. The user does not know how many page allocations
> are there. I think I can simplify this by trying to use all unused
> page flag bits for addressing the tags. Then, after compilation we can
> follow the rules I mentioned before:
> - If the available bits are not enough to address all kernel page
> allocations, we issue an error. The user should disable
> CONFIG_PGALLOC_TAG_USE_PAGEFLAGS.
> - If there are enough unused bits but we have to push last_cpupid out
> of page flags, we issue a warning and continue. The user can disable
> CONFIG_PGALLOC_TAG_USE_PAGEFLAGS if last_cpupid has to stay in page
> flags.
> - If we run out of addressing space during module loading, we disable
> allocation tagging and continue. The user should disable
> CONFIG_PGALLOC_TAG_USE_PAGEFLAGS.

If the computer already knows what to do, it should do it, rather than
prompting the user to disable a deeply mystifying config parameter.

> 
> This leaves one outstanding case:
> - If we run out of addressing space during module loading but we would
> not run out of space if we pushed last_cpupid out of page flags during
> compilation.
> In this case I would want the user to have an option to request a
> larger addressing space for page allocation tags at compile time.
> Maybe I can keep CONFIG_PGALLOC_TAG_REF_BITS for such explicit
> requests for a larger space? This would limit the use of
> CONFIG_PGALLOC_TAG_REF_BITS to this case only. In all other cases the
> number of bits would be set automatically. WDYT?

Manually dealing with something like this is just not going to work.

The more I read this story, the clearer it becomes that this should be
entirely done by the build system: set it, or don't set it, automatically.

And if you can make it not even a kconfig item at all, that's probably even
better.

And if there is no way to set it automatically, then that probably means
that the feature is still too raw to unleash upon the world.

thanks,
John Hubbard Sept. 4, 2024, 2:05 a.m. UTC | #4
On 9/3/24 6:25 PM, John Hubbard wrote:
> On 9/3/24 11:19 AM, Suren Baghdasaryan wrote:
>> On Sun, Sep 1, 2024 at 10:16 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>>> On Sun,  1 Sep 2024 21:41:28 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
> ...
>>> We shouldn't be offering things like this to our users.  If we cannot decide, how
>>> can they?
>>
>> Thinking about the ease of use, the CONFIG_PGALLOC_TAG_REF_BITS is the
>> hardest one to set. The user does not know how many page allocations

I should probably clarify my previous reply, so here is the more detailed
version:

>> are there. I think I can simplify this by trying to use all unused
>> page flag bits for addressing the tags. Then, after compilation we can

Yes.

>> follow the rules I mentioned before:
>> - If the available bits are not enough to address all kernel page
>> allocations, we issue an error. The user should disable
>> CONFIG_PGALLOC_TAG_USE_PAGEFLAGS.

The configuration should disable itself, in this case. But if that is
too big of a change for now, I suppose we could fall back to an error
message to the effect of, "please disable CONFIG_PGALLOC_TAG_USE_PAGEFLAGS
because the kernel build system is still too primitive to do that for you". :)


>> - If there are enough unused bits but we have to push last_cpupid out
>> of page flags, we issue a warning and continue. The user can disable
>> CONFIG_PGALLOC_TAG_USE_PAGEFLAGS if last_cpupid has to stay in page
>> flags.

Let's try to decide now, what that tradeoff should be. Just pick one based
on what some of us perceive to be the expected usefulness and frequency of
use between last_cpuid and these tag refs.

If someone really needs to change the tradeoff for that one bit, then that
someone is also likely able to hack up a change for it.

thanks,
Matthew Wilcox (Oracle) Sept. 4, 2024, 2:18 a.m. UTC | #5
On Tue, Sep 03, 2024 at 06:25:52PM -0700, John Hubbard wrote:
> The more I read this story, the clearer it becomes that this should be
> entirely done by the build system: set it, or don't set it, automatically.
> 
> And if you can make it not even a kconfig item at all, that's probably even
> better.
> 
> And if there is no way to set it automatically, then that probably means
> that the feature is still too raw to unleash upon the world.

I'd suggest that this implementation is just too whack.

What if you use a maple tree for this?  For each allocation range, you
can store a pointer to a tag instead of storing an index in each folio.
Kent Overstreet Sept. 4, 2024, 2:41 a.m. UTC | #6
On Tue, Sep 03, 2024 at 06:25:52PM GMT, John Hubbard wrote:
> On 9/3/24 11:19 AM, Suren Baghdasaryan wrote:
> > On Sun, Sep 1, 2024 at 10:16 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > > On Sun,  1 Sep 2024 21:41:28 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
> ...
> > > We shouldn't be offering things like this to our users.  If we cannot decide, how
> > > can they?
> > 
> > Thinking about the ease of use, the CONFIG_PGALLOC_TAG_REF_BITS is the
> > hardest one to set. The user does not know how many page allocations
> > are there. I think I can simplify this by trying to use all unused
> > page flag bits for addressing the tags. Then, after compilation we can
> > follow the rules I mentioned before:
> > - If the available bits are not enough to address all kernel page
> > allocations, we issue an error. The user should disable
> > CONFIG_PGALLOC_TAG_USE_PAGEFLAGS.
> > - If there are enough unused bits but we have to push last_cpupid out
> > of page flags, we issue a warning and continue. The user can disable
> > CONFIG_PGALLOC_TAG_USE_PAGEFLAGS if last_cpupid has to stay in page
> > flags.
> > - If we run out of addressing space during module loading, we disable
> > allocation tagging and continue. The user should disable
> > CONFIG_PGALLOC_TAG_USE_PAGEFLAGS.
> 
> If the computer already knows what to do, it should do it, rather than
> prompting the user to disable a deeply mystifying config parameter.
> 
> > 
> > This leaves one outstanding case:
> > - If we run out of addressing space during module loading but we would
> > not run out of space if we pushed last_cpupid out of page flags during
> > compilation.
> > In this case I would want the user to have an option to request a
> > larger addressing space for page allocation tags at compile time.
> > Maybe I can keep CONFIG_PGALLOC_TAG_REF_BITS for such explicit
> > requests for a larger space? This would limit the use of
> > CONFIG_PGALLOC_TAG_REF_BITS to this case only. In all other cases the
> > number of bits would be set automatically. WDYT?
> 
> Manually dealing with something like this is just not going to work.
> 
> The more I read this story, the clearer it becomes that this should be
> entirely done by the build system: set it, or don't set it, automatically.

The difficulty is that we don't know how many modules are going to be
loaded, and an allyesconfig or allmodconfig is going to require quite a
few bits - it's not something we can really predict or calculate.

So there is some genuine trickyness here.
Suren Baghdasaryan Sept. 4, 2024, 4:08 p.m. UTC | #7
On Tue, Sep 3, 2024 at 7:06 PM 'John Hubbard' via kernel-team
<kernel-team@android.com> wrote:
>
> On 9/3/24 6:25 PM, John Hubbard wrote:
> > On 9/3/24 11:19 AM, Suren Baghdasaryan wrote:
> >> On Sun, Sep 1, 2024 at 10:16 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >>> On Sun,  1 Sep 2024 21:41:28 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
> > ...
> >>> We shouldn't be offering things like this to our users.  If we cannot decide, how
> >>> can they?
> >>
> >> Thinking about the ease of use, the CONFIG_PGALLOC_TAG_REF_BITS is the
> >> hardest one to set. The user does not know how many page allocations
>
> I should probably clarify my previous reply, so here is the more detailed
> version:
>
> >> are there. I think I can simplify this by trying to use all unused
> >> page flag bits for addressing the tags. Then, after compilation we can
>
> Yes.
>
> >> follow the rules I mentioned before:
> >> - If the available bits are not enough to address all kernel page
> >> allocations, we issue an error. The user should disable
> >> CONFIG_PGALLOC_TAG_USE_PAGEFLAGS.
>
> The configuration should disable itself, in this case. But if that is
> too big of a change for now, I suppose we could fall back to an error
> message to the effect of, "please disable CONFIG_PGALLOC_TAG_USE_PAGEFLAGS
> because the kernel build system is still too primitive to do that for you". :)

I don't think we can detect this at build time. We need to know how
many page allocations there are, which we find out only after we build
the kernel image (from the section size that holds allocation tags).
Therefore it would have to be a post-build check. So I think the best
we can do is to generate the error like the one you suggested after we
build the image.
Dependency on CONFIG_PAGE_EXTENSION is yet another complexity because
if we auto-disable CONFIG_PGALLOC_TAG_USE_PAGEFLAGS, we would have to
also auto-enable CONFIG_PAGE_EXTENSION if it's not already enabled.

I'll dig around some more to see if there is a better way.

>
>
> >> - If there are enough unused bits but we have to push last_cpupid out
> >> of page flags, we issue a warning and continue. The user can disable
> >> CONFIG_PGALLOC_TAG_USE_PAGEFLAGS if last_cpupid has to stay in page
> >> flags.
>
> Let's try to decide now, what that tradeoff should be. Just pick one based
> on what some of us perceive to be the expected usefulness and frequency of
> use between last_cpuid and these tag refs.
>
> If someone really needs to change the tradeoff for that one bit, then that
> someone is also likely able to hack up a change for it.

Yeah, from all the feedback, I realize that by pursuing the maximum
flexibility I made configuring this mechanism close to impossible. I
think the first step towards simplifying this would be to identify
usable configurations. From that POV, I can see 3 useful modes:

1. Page flags are not used. In this mode we will use direct pointer
references and page extensions, like we do today. This mode is used
when we don't have enough page flags. This can be a safe default which
keeps things as they are today and should always work.
2. Page flags are used but not forced. This means we will try to use
all free page flags bits (up to a reasonable limit of 16) without
pushing out last_cpupid.
3. Page flags are forced. This means we will try to use all free page
flags bits after pushing last_cpupid out of page flags. This mode
could be used if the user cares about memory profiling more than the
performance overhead caused by last_cpupid.

I'm not 100% sure (3) is needed, so I think we can skip it until
someone asks for it. It should be easy to add that in the future.
If we detect at build time that we don't have enough page flag bits to
cover kernel allocations for modes (2) or (3), we issue an error
prompting the user to reconfigure to mode (1).

Ideally, I would like to have (2) as default mode and automatically
fall back to (1) when it's impossible but as I mentioned before, I
don't yet see a way to do that automatically.

For loadable modules, I think my earlier suggestion should work fine.
If a module causes us to run out of space for tags, we disable memory
profiling at runtime and log a warning for the user stating that we
disabled memory profiling and if the user needs it they should
configure mode (1). I *think* I can even disable profiling only for
that module and not globally but I need to try that first.

I can start with modes (1) and (2) support which requires only
CONFIG_PGALLOC_TAG_USE_PAGEFLAGS defaulted to N. Any user can try
enabling this config and if that builds fine then keeping it for
better performance and memory usage. Does that sound acceptable?
Thanks,
Suren.

>
> thanks,
> --
> John Hubbard
>
> >> - If we run out of addressing space during module loading, we disable
> >> allocation tagging and continue. The user should disable
> >> CONFIG_PGALLOC_TAG_USE_PAGEFLAGS.
> >
> > If the computer already knows what to do, it should do it, rather than
> > prompting the user to disable a deeply mystifying config parameter.
> >
> >>
> >> This leaves one outstanding case:
> >> - If we run out of addressing space during module loading but we would
> >> not run out of space if we pushed last_cpupid out of page flags during
> >> compilation.
> >> In this case I would want the user to have an option to request a
> >> larger addressing space for page allocation tags at compile time.
> >> Maybe I can keep CONFIG_PGALLOC_TAG_REF_BITS for such explicit
> >> requests for a larger space? This would limit the use of
> >> CONFIG_PGALLOC_TAG_REF_BITS to this case only. In all other cases the
> >> number of bits would be set automatically. WDYT?
> >
> > Manually dealing with something like this is just not going to work.
> >
> > The more I read this story, the clearer it becomes that this should be
> > entirely done by the build system: set it, or don't set it, automatically.
> >
> > And if you can make it not even a kconfig item at all, that's probably even
> > better.
> >
> > And if there is no way to set it automatically, then that probably means
> > that the feature is still too raw to unleash upon the world.
> >
> > thanks,
>
>
>
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>
Suren Baghdasaryan Sept. 4, 2024, 4:18 p.m. UTC | #8
On Tue, Sep 3, 2024 at 7:19 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Sep 03, 2024 at 06:25:52PM -0700, John Hubbard wrote:
> > The more I read this story, the clearer it becomes that this should be
> > entirely done by the build system: set it, or don't set it, automatically.
> >
> > And if you can make it not even a kconfig item at all, that's probably even
> > better.
> >
> > And if there is no way to set it automatically, then that probably means
> > that the feature is still too raw to unleash upon the world.
>
> I'd suggest that this implementation is just too whack.
>
> What if you use a maple tree for this?  For each allocation range, you
> can store a pointer to a tag instead of storing an index in each folio.

I'm not sure I understand your suggestion, Matthew. We allocate a
folio and need to store a reference to the tag associated with the
code that allocated that folio. We are not operating with ranges here.
Are you suggesting to use a maple tree instead of page_ext to store
this reference?
Kent Overstreet Sept. 4, 2024, 4:21 p.m. UTC | #9
On Wed, Sep 04, 2024 at 09:18:01AM GMT, Suren Baghdasaryan wrote:
> On Tue, Sep 3, 2024 at 7:19 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Tue, Sep 03, 2024 at 06:25:52PM -0700, John Hubbard wrote:
> > > The more I read this story, the clearer it becomes that this should be
> > > entirely done by the build system: set it, or don't set it, automatically.
> > >
> > > And if you can make it not even a kconfig item at all, that's probably even
> > > better.
> > >
> > > And if there is no way to set it automatically, then that probably means
> > > that the feature is still too raw to unleash upon the world.
> >
> > I'd suggest that this implementation is just too whack.
> >
> > What if you use a maple tree for this?  For each allocation range, you
> > can store a pointer to a tag instead of storing an index in each folio.
> 
> I'm not sure I understand your suggestion, Matthew. We allocate a
> folio and need to store a reference to the tag associated with the
> code that allocated that folio. We are not operating with ranges here.
> Are you suggesting to use a maple tree instead of page_ext to store
> this reference?

yeah, I don't think the maple tree idea makes any sense either

we already have a way of going from index -> alloc tag, this is just
about how we store the index
Matthew Wilcox (Oracle) Sept. 4, 2024, 4:35 p.m. UTC | #10
On Wed, Sep 04, 2024 at 09:18:01AM -0700, Suren Baghdasaryan wrote:
> I'm not sure I understand your suggestion, Matthew. We allocate a
> folio and need to store a reference to the tag associated with the
> code that allocated that folio. We are not operating with ranges here.
> Are you suggesting to use a maple tree instead of page_ext to store
> this reference?

I'm saying that a folio has a physical address.  So you can use a physical
address as an index into a maple tree to store additional information
instead of using page_ext or trying to hammer the additional information
into struct page somewhere.
Kent Overstreet Sept. 4, 2024, 4:37 p.m. UTC | #11
On Wed, Sep 04, 2024 at 05:35:49PM GMT, Matthew Wilcox wrote:
> On Wed, Sep 04, 2024 at 09:18:01AM -0700, Suren Baghdasaryan wrote:
> > I'm not sure I understand your suggestion, Matthew. We allocate a
> > folio and need to store a reference to the tag associated with the
> > code that allocated that folio. We are not operating with ranges here.
> > Are you suggesting to use a maple tree instead of page_ext to store
> > this reference?
> 
> I'm saying that a folio has a physical address.  So you can use a physical
> address as an index into a maple tree to store additional information
> instead of using page_ext or trying to hammer the additional information
> into struct page somewhere.

Ah, thanks, that makes more sense.

But it would add a lot of overhead to the page alloc/free paths...
Suren Baghdasaryan Sept. 4, 2024, 4:39 p.m. UTC | #12
On Wed, Sep 4, 2024 at 9:37 AM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> On Wed, Sep 04, 2024 at 05:35:49PM GMT, Matthew Wilcox wrote:
> > On Wed, Sep 04, 2024 at 09:18:01AM -0700, Suren Baghdasaryan wrote:
> > > I'm not sure I understand your suggestion, Matthew. We allocate a
> > > folio and need to store a reference to the tag associated with the
> > > code that allocated that folio. We are not operating with ranges here.
> > > Are you suggesting to use a maple tree instead of page_ext to store
> > > this reference?
> >
> > I'm saying that a folio has a physical address.  So you can use a physical
> > address as an index into a maple tree to store additional information
> > instead of using page_ext or trying to hammer the additional information
> > into struct page somewhere.
>
> Ah, thanks, that makes more sense.
>
> But it would add a lot of overhead to the page alloc/free paths...

Yeah, inserting into a maple_tree in the fast path of page allocation
would introduce considerable performance overhead.
John Hubbard Sept. 4, 2024, 6:58 p.m. UTC | #13
On 9/4/24 9:08 AM, Suren Baghdasaryan wrote:
> On Tue, Sep 3, 2024 at 7:06 PM 'John Hubbard' via kernel-team
> <kernel-team@android.com> wrote:
>> On 9/3/24 6:25 PM, John Hubbard wrote:
>>> On 9/3/24 11:19 AM, Suren Baghdasaryan wrote:
>>>> On Sun, Sep 1, 2024 at 10:16 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>>>>> On Sun,  1 Sep 2024 21:41:28 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
...
>> The configuration should disable itself, in this case. But if that is
>> too big of a change for now, I suppose we could fall back to an error
>> message to the effect of, "please disable CONFIG_PGALLOC_TAG_USE_PAGEFLAGS
>> because the kernel build system is still too primitive to do that for you". :)
> 
> I don't think we can detect this at build time. We need to know how
> many page allocations there are, which we find out only after we build
> the kernel image (from the section size that holds allocation tags).
> Therefore it would have to be a post-build check. So I think the best
> we can do is to generate the error like the one you suggested after we
> build the image.
> Dependency on CONFIG_PAGE_EXTENSION is yet another complexity because
> if we auto-disable CONFIG_PGALLOC_TAG_USE_PAGEFLAGS, we would have to
> also auto-enable CONFIG_PAGE_EXTENSION if it's not already enabled.
> 
> I'll dig around some more to see if there is a better way.
>>
>>>> - If there are enough unused bits but we have to push last_cpupid out
>>>> of page flags, we issue a warning and continue. The user can disable
>>>> CONFIG_PGALLOC_TAG_USE_PAGEFLAGS if last_cpupid has to stay in page
>>>> flags.
>>
>> Let's try to decide now, what that tradeoff should be. Just pick one based
>> on what some of us perceive to be the expected usefulness and frequency of
>> use between last_cpuid and these tag refs.
>>
>> If someone really needs to change the tradeoff for that one bit, then that
>> someone is also likely able to hack up a change for it.
> 
> Yeah, from all the feedback, I realize that by pursuing the maximum
> flexibility I made configuring this mechanism close to impossible. I
> think the first step towards simplifying this would be to identify
> usable configurations. From that POV, I can see 3 useful modes:
> 
> 1. Page flags are not used. In this mode we will use direct pointer
> references and page extensions, like we do today. This mode is used
> when we don't have enough page flags. This can be a safe default which
> keeps things as they are today and should always work.

Definitely my favorite so far.

> 2. Page flags are used but not forced. This means we will try to use
> all free page flags bits (up to a reasonable limit of 16) without
> pushing out last_cpupid.

This is a logical next step, agreed.

> 3. Page flags are forced. This means we will try to use all free page
> flags bits after pushing last_cpupid out of page flags. This mode
> could be used if the user cares about memory profiling more than the
> performance overhead caused by last_cpupid.
> 
> I'm not 100% sure (3) is needed, so I think we can skip it until
> someone asks for it. It should be easy to add that in the future.

Right.

> If we detect at build time that we don't have enough page flag bits to
> cover kernel allocations for modes (2) or (3), we issue an error
> prompting the user to reconfigure to mode (1).
> 
> Ideally, I would like to have (2) as default mode and automatically
> fall back to (1) when it's impossible but as I mentioned before, I
> don't yet see a way to do that automatically.
> 
> For loadable modules, I think my earlier suggestion should work fine.
> If a module causes us to run out of space for tags, we disable memory
> profiling at runtime and log a warning for the user stating that we
> disabled memory profiling and if the user needs it they should
> configure mode (1). I *think* I can even disable profiling only for
> that module and not globally but I need to try that first.
> 
> I can start with modes (1) and (2) support which requires only
> CONFIG_PGALLOC_TAG_USE_PAGEFLAGS defaulted to N. Any user can try
> enabling this config and if that builds fine then keeping it for
> better performance and memory usage. Does that sound acceptable?
> Thanks,
> Suren.
> 

How badly do we need (2)? Because this is really expensive:

    a) It adds complexity to a complex,delicate core part of mm.

    b) It adds constraints, which prevent possible future features.

It's not yet clear that (2) is valuable enough (compared to (1))
to compensate, at least from what I've read. Unless I missed
something big.


thanks,
Suren Baghdasaryan Sept. 4, 2024, 9:07 p.m. UTC | #14
On Wed, Sep 4, 2024 at 11:58 AM 'John Hubbard' via kernel-team
<kernel-team@android.com> wrote:
>
> On 9/4/24 9:08 AM, Suren Baghdasaryan wrote:
> > On Tue, Sep 3, 2024 at 7:06 PM 'John Hubbard' via kernel-team
> > <kernel-team@android.com> wrote:
> >> On 9/3/24 6:25 PM, John Hubbard wrote:
> >>> On 9/3/24 11:19 AM, Suren Baghdasaryan wrote:
> >>>> On Sun, Sep 1, 2024 at 10:16 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >>>>> On Sun,  1 Sep 2024 21:41:28 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
> ...
> >> The configuration should disable itself, in this case. But if that is
> >> too big of a change for now, I suppose we could fall back to an error
> >> message to the effect of, "please disable CONFIG_PGALLOC_TAG_USE_PAGEFLAGS
> >> because the kernel build system is still too primitive to do that for you". :)
> >
> > I don't think we can detect this at build time. We need to know how
> > many page allocations there are, which we find out only after we build
> > the kernel image (from the section size that holds allocation tags).
> > Therefore it would have to be a post-build check. So I think the best
> > we can do is to generate the error like the one you suggested after we
> > build the image.
> > Dependency on CONFIG_PAGE_EXTENSION is yet another complexity because
> > if we auto-disable CONFIG_PGALLOC_TAG_USE_PAGEFLAGS, we would have to
> > also auto-enable CONFIG_PAGE_EXTENSION if it's not already enabled.
> >
> > I'll dig around some more to see if there is a better way.
> >>
> >>>> - If there are enough unused bits but we have to push last_cpupid out
> >>>> of page flags, we issue a warning and continue. The user can disable
> >>>> CONFIG_PGALLOC_TAG_USE_PAGEFLAGS if last_cpupid has to stay in page
> >>>> flags.
> >>
> >> Let's try to decide now, what that tradeoff should be. Just pick one based
> >> on what some of us perceive to be the expected usefulness and frequency of
> >> use between last_cpuid and these tag refs.
> >>
> >> If someone really needs to change the tradeoff for that one bit, then that
> >> someone is also likely able to hack up a change for it.
> >
> > Yeah, from all the feedback, I realize that by pursuing the maximum
> > flexibility I made configuring this mechanism close to impossible. I
> > think the first step towards simplifying this would be to identify
> > usable configurations. From that POV, I can see 3 useful modes:
> >
> > 1. Page flags are not used. In this mode we will use direct pointer
> > references and page extensions, like we do today. This mode is used
> > when we don't have enough page flags. This can be a safe default which
> > keeps things as they are today and should always work.
>
> Definitely my favorite so far.
>
> > 2. Page flags are used but not forced. This means we will try to use
> > all free page flags bits (up to a reasonable limit of 16) without
> > pushing out last_cpupid.
>
> This is a logical next step, agreed.
>
> > 3. Page flags are forced. This means we will try to use all free page
> > flags bits after pushing last_cpupid out of page flags. This mode
> > could be used if the user cares about memory profiling more than the
> > performance overhead caused by last_cpupid.
> >
> > I'm not 100% sure (3) is needed, so I think we can skip it until
> > someone asks for it. It should be easy to add that in the future.
>
> Right.
>
> > If we detect at build time that we don't have enough page flag bits to
> > cover kernel allocations for modes (2) or (3), we issue an error
> > prompting the user to reconfigure to mode (1).
> >
> > Ideally, I would like to have (2) as default mode and automatically
> > fall back to (1) when it's impossible but as I mentioned before, I
> > don't yet see a way to do that automatically.
> >
> > For loadable modules, I think my earlier suggestion should work fine.
> > If a module causes us to run out of space for tags, we disable memory
> > profiling at runtime and log a warning for the user stating that we
> > disabled memory profiling and if the user needs it they should
> > configure mode (1). I *think* I can even disable profiling only for
> > that module and not globally but I need to try that first.
> >
> > I can start with modes (1) and (2) support which requires only
> > CONFIG_PGALLOC_TAG_USE_PAGEFLAGS defaulted to N. Any user can try
> > enabling this config and if that builds fine then keeping it for
> > better performance and memory usage. Does that sound acceptable?
> > Thanks,
> > Suren.
> >
>
> How badly do we need (2)? Because this is really expensive:
>
>     a) It adds complexity to a complex,delicate core part of mm.
>
>     b) It adds constraints, which prevent possible future features.
>
> It's not yet clear that (2) is valuable enough (compared to (1))
> to compensate, at least from what I've read. Unless I missed
> something big.

(1) is what we already have today.
(2) would allow us to drop page_ext dependency, so there is
considerable memory saving and performance improvement (no need to
lookup the page extension on each tag reference). The benefits are
described in the cover letter at:
https://lore.kernel.org/all/20240902044128.664075-1-surenb@google.com/.


>
>
> thanks,
> --
> John Hubbard
>
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>
diff mbox series

Patch

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 17506e4a2835..0dd2b42f7cb6 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1085,6 +1085,7 @@  static inline bool zone_is_empty(struct zone *zone)
 #define KASAN_TAG_PGOFF		(LAST_CPUPID_PGOFF - KASAN_TAG_WIDTH)
 #define LRU_GEN_PGOFF		(KASAN_TAG_PGOFF - LRU_GEN_WIDTH)
 #define LRU_REFS_PGOFF		(LRU_GEN_PGOFF - LRU_REFS_WIDTH)
+#define ALLOC_TAG_REF_PGOFF	(LRU_REFS_PGOFF - ALLOC_TAG_REF_WIDTH)
 
 /*
  * Define the bit shifts to access each section.  For non-existent
@@ -1096,6 +1097,7 @@  static inline bool zone_is_empty(struct zone *zone)
 #define ZONES_PGSHIFT		(ZONES_PGOFF * (ZONES_WIDTH != 0))
 #define LAST_CPUPID_PGSHIFT	(LAST_CPUPID_PGOFF * (LAST_CPUPID_WIDTH != 0))
 #define KASAN_TAG_PGSHIFT	(KASAN_TAG_PGOFF * (KASAN_TAG_WIDTH != 0))
+#define ALLOC_TAG_REF_PGSHIFT	(ALLOC_TAG_REF_PGOFF * (ALLOC_TAG_REF_WIDTH != 0))
 
 /* NODE:ZONE or SECTION:ZONE is used to ID a zone for the buddy allocator */
 #ifdef NODE_NOT_IN_PAGE_FLAGS
@@ -1116,6 +1118,7 @@  static inline bool zone_is_empty(struct zone *zone)
 #define LAST_CPUPID_MASK	((1UL << LAST_CPUPID_SHIFT) - 1)
 #define KASAN_TAG_MASK		((1UL << KASAN_TAG_WIDTH) - 1)
 #define ZONEID_MASK		((1UL << ZONEID_SHIFT) - 1)
+#define ALLOC_TAG_REF_MASK	((1UL << ALLOC_TAG_REF_WIDTH) - 1)
 
 static inline enum zone_type page_zonenum(const struct page *page)
 {
diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h
index 7d79818dc065..21bba7c8c965 100644
--- a/include/linux/page-flags-layout.h
+++ b/include/linux/page-flags-layout.h
@@ -5,6 +5,12 @@ 
 #include <linux/numa.h>
 #include <generated/bounds.h>
 
+#ifdef CONFIG_PGALLOC_TAG_USE_PAGEFLAGS
+#define ALLOC_TAG_REF_WIDTH	CONFIG_PGALLOC_TAG_REF_BITS
+#else
+#define ALLOC_TAG_REF_WIDTH	0
+#endif
+
 /*
  * When a memory allocation must conform to specific limitations (such
  * as being suitable for DMA) the caller will pass in hints to the
@@ -91,7 +97,7 @@ 
 #endif
 
 #if ZONES_WIDTH + LRU_GEN_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \
-	KASAN_TAG_WIDTH + LAST_CPUPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
+	KASAN_TAG_WIDTH + ALLOC_TAG_REF_WIDTH + LAST_CPUPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
 #define LAST_CPUPID_WIDTH LAST_CPUPID_SHIFT
 #else
 #define LAST_CPUPID_WIDTH 0
@@ -102,7 +108,7 @@ 
 #endif
 
 #if ZONES_WIDTH + LRU_GEN_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \
-	KASAN_TAG_WIDTH + LAST_CPUPID_WIDTH > BITS_PER_LONG - NR_PAGEFLAGS
+	KASAN_TAG_WIDTH + ALLOC_TAG_REF_WIDTH + LAST_CPUPID_WIDTH > BITS_PER_LONG - NR_PAGEFLAGS
 #error "Not enough bits in page flags"
 #endif
 
diff --git a/include/linux/pgalloc_tag.h b/include/linux/pgalloc_tag.h
index a7f8f00c118f..dcb6706dee15 100644
--- a/include/linux/pgalloc_tag.h
+++ b/include/linux/pgalloc_tag.h
@@ -118,6 +118,52 @@  static inline void write_pgref(pgalloc_tag_ref *pgref, union codetag_ref *ref)
 void __init alloc_tag_sec_init(void);
 
 #endif /* PGALLOC_TAG_DIRECT_REF */
+
+#ifdef CONFIG_PGALLOC_TAG_USE_PAGEFLAGS
+
+typedef struct page	*pgtag_ref_handle;
+
+/* Should be called only if mem_alloc_profiling_enabled() */
+static inline pgtag_ref_handle get_page_tag_ref(struct page *page,
+						union codetag_ref *ref)
+{
+	if (page) {
+		pgalloc_tag_ref pgref;
+
+		pgref = (page->flags >> ALLOC_TAG_REF_PGSHIFT) & ALLOC_TAG_REF_MASK;
+		read_pgref(&pgref, ref);
+		return page;
+	}
+
+	return NULL;
+}
+
+static inline void put_page_tag_ref(pgtag_ref_handle page)
+{
+	WARN_ON(!page);
+}
+
+static inline void update_page_tag_ref(pgtag_ref_handle page, union codetag_ref *ref)
+{
+	unsigned long old_flags, flags, val;
+	pgalloc_tag_ref pgref;
+
+	if (WARN_ON(!page || !ref))
+		return;
+
+	write_pgref(&pgref, ref);
+	val = (unsigned long)pgref;
+	val = (val & ALLOC_TAG_REF_MASK) << ALLOC_TAG_REF_PGSHIFT;
+	do {
+		old_flags = READ_ONCE(page->flags);
+		flags = old_flags;
+		flags &= ~(ALLOC_TAG_REF_MASK << ALLOC_TAG_REF_PGSHIFT);
+		flags |= val;
+	} while (unlikely(!try_cmpxchg(&page->flags, &old_flags, flags)));
+}
+
+#else /* CONFIG_PGALLOC_TAG_USE_PAGEFLAGS */
+
 #include <linux/page_ext.h>
 
 extern struct page_ext_operations page_alloc_tagging_ops;
@@ -166,6 +212,8 @@  static inline void update_page_tag_ref(pgtag_ref_handle pgref, union codetag_ref
 	write_pgref(pgref, ref);
 }
 
+#endif /* CONFIG_PGALLOC_TAG_USE_PAGEFLAGS */
+
 static inline void clear_page_tag_ref(struct page *page)
 {
 	if (mem_alloc_profiling_enabled()) {
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 253f9c2028da..9fc8c1981f27 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -979,7 +979,7 @@  config MEM_ALLOC_PROFILING
 	depends on PROC_FS
 	depends on !DEBUG_FORCE_WEAK_PER_CPU
 	select CODE_TAGGING
-	select PAGE_EXTENSION
+	select PAGE_EXTENSION if !PGALLOC_TAG_USE_PAGEFLAGS
 	select SLAB_OBJ_EXT
 	help
 	  Track allocation source code and record total allocation size
@@ -1000,10 +1000,26 @@  config MEM_ALLOC_PROFILING_DEBUG
 	  Adds warnings with helpful error messages for memory allocation
 	  profiling.
 
+config PGALLOC_TAG_USE_PAGEFLAGS
+	bool "Use pageflags to encode page allocation tag reference"
+	default n
+	depends on MEM_ALLOC_PROFILING
+	help
+	  When set, page allocation tag references are encoded inside page
+	  flags, otherwise they are encoded in page extensions.
+
+	  Setting this flag reduces memory and performance overhead of memory
+	  allocation profiling but also limits how many allocations can be
+	  tagged. The number of bits is set by PGALLOC_TAG_USE_PAGEFLAGS and
+	  they must fit in the page flags field.
+
+	  Say N if unsure.
+
 config PGALLOC_TAG_REF_BITS
 	int "Number of bits for page allocation tag reference (10-64)"
 	range 10 64
-	default "64"
+	default "16" if PGALLOC_TAG_USE_PAGEFLAGS
+	default "64" if !PGALLOC_TAG_USE_PAGEFLAGS
 	depends on MEM_ALLOC_PROFILING
 	help
 	  Number of bits used to encode a page allocation tag reference.
@@ -1011,6 +1027,13 @@  config PGALLOC_TAG_REF_BITS
 	  Smaller number results in less memory overhead but limits the number of
 	  allocations which can be tagged (including allocations from modules).
 
+	  If PGALLOC_TAG_USE_PAGEFLAGS is set, the number of requested bits should
+	  fit inside the page flags.
+
+	  If PGALLOC_TAG_USE_PAGEFLAGS is not set, the number of bits used to store
+	  a reference is rounded up to the closest basic type. If set higher than 32,
+	  a direct pointer to the allocation tag is stored for performance reasons.
+
 source "lib/Kconfig.kasan"
 source "lib/Kconfig.kfence"
 source "lib/Kconfig.kmsan"
diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
index 73791aa55ab6..34820650186d 100644
--- a/lib/alloc_tag.c
+++ b/lib/alloc_tag.c
@@ -476,6 +476,8 @@  static int __init setup_early_mem_profiling(char *str)
 }
 early_param("sysctl.vm.mem_profiling", setup_early_mem_profiling);
 
+#ifndef CONFIG_PGALLOC_TAG_USE_PAGEFLAGS
+
 static __init bool need_page_alloc_tagging(void)
 {
 	return mem_profiling_support;
@@ -492,6 +494,8 @@  struct page_ext_operations page_alloc_tagging_ops = {
 };
 EXPORT_SYMBOL(page_alloc_tagging_ops);
 
+#endif /* CONFIG_PGALLOC_TAG_USE_PAGEFLAGS */
+
 #ifdef CONFIG_SYSCTL
 static struct ctl_table memory_allocation_profiling_sysctls[] = {
 	{
diff --git a/mm/page_ext.c b/mm/page_ext.c
index 641d93f6af4c..5f993c271ee7 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -83,7 +83,7 @@  static struct page_ext_operations *page_ext_ops[] __initdata = {
 #if defined(CONFIG_PAGE_IDLE_FLAG) && !defined(CONFIG_64BIT)
 	&page_idle_ops,
 #endif
-#ifdef CONFIG_MEM_ALLOC_PROFILING
+#if defined(CONFIG_MEM_ALLOC_PROFILING) && !defined(CONFIG_PGALLOC_TAG_USE_PAGEFLAGS)
 	&page_alloc_tagging_ops,
 #endif
 #ifdef CONFIG_PAGE_TABLE_CHECK