diff mbox series

[v3,5/5] alloc_tag: config to store page allocation tag refs in page flags

Message ID 20241014203646.1952505-6-surenb@google.com (mailing list archive)
State Handled Elsewhere
Headers show
Series page allocation tag compression | expand

Commit Message

Suren Baghdasaryan Oct. 14, 2024, 8:36 p.m. UTC
Add CONFIG_PGALLOC_TAG_USE_PAGEFLAGS to store allocation tag
references directly in the page flags. This eliminates memory
overhead caused by page_ext and results in better performance
for page allocations.
If the number of available page flag bits is insufficient to
address all kernel allocations, profiling falls back to using
page extensions with an appropriate warning to disable this
config.
If dynamically loaded modules add enough tags that they can't
be addressed anymore with available page flag bits, memory
profiling gets disabled and a warning is issued.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/alloc_tag.h         |  10 +-
 include/linux/codetag.h           |   3 +
 include/linux/page-flags-layout.h |   7 ++
 include/linux/pgalloc_tag.h       | 184 +++++++++++++++++++++++++++++-
 lib/Kconfig.debug                 |  19 +++
 lib/alloc_tag.c                   |  90 ++++++++++++++-
 lib/codetag.c                     |   4 +-
 mm/mm_init.c                      |   5 +-
 8 files changed, 310 insertions(+), 12 deletions(-)

Comments

Yosry Ahmed Oct. 14, 2024, 11:48 p.m. UTC | #1
On Mon, Oct 14, 2024 at 1:37 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> Add CONFIG_PGALLOC_TAG_USE_PAGEFLAGS to store allocation tag
> references directly in the page flags. This eliminates memory
> overhead caused by page_ext and results in better performance
> for page allocations.
> If the number of available page flag bits is insufficient to
> address all kernel allocations, profiling falls back to using
> page extensions with an appropriate warning to disable this
> config.
> If dynamically loaded modules add enough tags that they can't
> be addressed anymore with available page flag bits, memory
> profiling gets disabled and a warning is issued.

Just curious, why do we need a config option? If there are enough bits
in page flags, why not use them automatically or fallback to page_ext
otherwise?

Is the reason that dynamically loadable modules, where the user may
know in advance that the page flags won't be enough with the modules
loaded?
John Hubbard Oct. 14, 2024, 11:53 p.m. UTC | #2
On 10/14/24 4:48 PM, Yosry Ahmed wrote:
> On Mon, Oct 14, 2024 at 1:37 PM Suren Baghdasaryan <surenb@google.com> wrote:
>>
>> Add CONFIG_PGALLOC_TAG_USE_PAGEFLAGS to store allocation tag
>> references directly in the page flags. This eliminates memory
>> overhead caused by page_ext and results in better performance
>> for page allocations.
>> If the number of available page flag bits is insufficient to
>> address all kernel allocations, profiling falls back to using
>> page extensions with an appropriate warning to disable this
>> config.
>> If dynamically loaded modules add enough tags that they can't
>> be addressed anymore with available page flag bits, memory
>> profiling gets disabled and a warning is issued.
> 
> Just curious, why do we need a config option? If there are enough bits
> in page flags, why not use them automatically or fallback to page_ext
> otherwise?

Or better yet, *always* fall back to page_ext, thus leaving the
scarce and valuable page flags available for other features?

Sorry Suren, to keep coming back to this suggestion, I know
I'm driving you crazy here! But I just keep thinking it through
and failing to see why this feature deserves to consume so
many page flags.

thanks,
Yosry Ahmed Oct. 14, 2024, 11:56 p.m. UTC | #3
On Mon, Oct 14, 2024 at 4:53 PM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 10/14/24 4:48 PM, Yosry Ahmed wrote:
> > On Mon, Oct 14, 2024 at 1:37 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >>
> >> Add CONFIG_PGALLOC_TAG_USE_PAGEFLAGS to store allocation tag
> >> references directly in the page flags. This eliminates memory
> >> overhead caused by page_ext and results in better performance
> >> for page allocations.
> >> If the number of available page flag bits is insufficient to
> >> address all kernel allocations, profiling falls back to using
> >> page extensions with an appropriate warning to disable this
> >> config.
> >> If dynamically loaded modules add enough tags that they can't
> >> be addressed anymore with available page flag bits, memory
> >> profiling gets disabled and a warning is issued.
> >
> > Just curious, why do we need a config option? If there are enough bits
> > in page flags, why not use them automatically or fallback to page_ext
> > otherwise?
>
> Or better yet, *always* fall back to page_ext, thus leaving the
> scarce and valuable page flags available for other features?
>
> Sorry Suren, to keep coming back to this suggestion, I know
> I'm driving you crazy here! But I just keep thinking it through
> and failing to see why this feature deserves to consume so
> many page flags.

I think we already always use page_ext today. My understanding is that
the purpose of this series is to give the option to avoid using
page_ext if there are enough unused page flags anyway, which reduces
memory waste and improves performance.

My question is just why not have that be the default behavior with a
config option, use page flags if there are enough unused bits,
otherwise use page_ext.
John Hubbard Oct. 15, 2024, 12:03 a.m. UTC | #4
On 10/14/24 4:56 PM, Yosry Ahmed wrote:
> On Mon, Oct 14, 2024 at 4:53 PM John Hubbard <jhubbard@nvidia.com> wrote:
>>
>> On 10/14/24 4:48 PM, Yosry Ahmed wrote:
>>> On Mon, Oct 14, 2024 at 1:37 PM Suren Baghdasaryan <surenb@google.com> wrote:
>>>>
>>>> Add CONFIG_PGALLOC_TAG_USE_PAGEFLAGS to store allocation tag
>>>> references directly in the page flags. This eliminates memory
>>>> overhead caused by page_ext and results in better performance
>>>> for page allocations.
>>>> If the number of available page flag bits is insufficient to
>>>> address all kernel allocations, profiling falls back to using
>>>> page extensions with an appropriate warning to disable this
>>>> config.
>>>> If dynamically loaded modules add enough tags that they can't
>>>> be addressed anymore with available page flag bits, memory
>>>> profiling gets disabled and a warning is issued.
>>>
>>> Just curious, why do we need a config option? If there are enough bits
>>> in page flags, why not use them automatically or fallback to page_ext
>>> otherwise?
>>
>> Or better yet, *always* fall back to page_ext, thus leaving the
>> scarce and valuable page flags available for other features?
>>
>> Sorry Suren, to keep coming back to this suggestion, I know
>> I'm driving you crazy here! But I just keep thinking it through
>> and failing to see why this feature deserves to consume so
>> many page flags.
> 
> I think we already always use page_ext today. My understanding is that
> the purpose of this series is to give the option to avoid using
> page_ext if there are enough unused page flags anyway, which reduces
> memory waste and improves performance.
> 
> My question is just why not have that be the default behavior with a
> config option, use page flags if there are enough unused bits,
> otherwise use page_ext.

I agree that if you're going to implement this feature at all, then
keying off of CONFIG_MEM_ALLOC_PROFILING seems sufficient, and no
need to add CONFIG_PGALLOC_TAG_USE_PAGEFLAGS on top of that.

thanks,
Matthew Wilcox (Oracle) Oct. 15, 2024, 1:40 a.m. UTC | #5
On Mon, Oct 14, 2024 at 05:03:32PM -0700, John Hubbard wrote:
> > > Or better yet, *always* fall back to page_ext, thus leaving the
> > > scarce and valuable page flags available for other features?
> > > 
> > > Sorry Suren, to keep coming back to this suggestion, I know
> > > I'm driving you crazy here! But I just keep thinking it through
> > > and failing to see why this feature deserves to consume so
> > > many page flags.
> > 
> > I think we already always use page_ext today. My understanding is that
> > the purpose of this series is to give the option to avoid using
> > page_ext if there are enough unused page flags anyway, which reduces
> > memory waste and improves performance.
> > 
> > My question is just why not have that be the default behavior with a
> > config option, use page flags if there are enough unused bits,
> > otherwise use page_ext.
> 
> I agree that if you're going to implement this feature at all, then
> keying off of CONFIG_MEM_ALLOC_PROFILING seems sufficient, and no
> need to add CONFIG_PGALLOC_TAG_USE_PAGEFLAGS on top of that.

Maybe the right idea is to use all the bits which are unused in this
configuration for the first N callsites, then use page_ext for all the
ones larger than N.  It doesn't save any memory, but it does have the
performance improvement.
Suren Baghdasaryan Oct. 15, 2024, 1:58 a.m. UTC | #6
On Mon, Oct 14, 2024 at 5:03 PM 'John Hubbard' via kernel-team
<kernel-team@android.com> wrote:
>
> On 10/14/24 4:56 PM, Yosry Ahmed wrote:
> > On Mon, Oct 14, 2024 at 4:53 PM John Hubbard <jhubbard@nvidia.com> wrote:
> >>
> >> On 10/14/24 4:48 PM, Yosry Ahmed wrote:
> >>> On Mon, Oct 14, 2024 at 1:37 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >>>>
> >>>> Add CONFIG_PGALLOC_TAG_USE_PAGEFLAGS to store allocation tag
> >>>> references directly in the page flags. This eliminates memory
> >>>> overhead caused by page_ext and results in better performance
> >>>> for page allocations.
> >>>> If the number of available page flag bits is insufficient to
> >>>> address all kernel allocations, profiling falls back to using
> >>>> page extensions with an appropriate warning to disable this
> >>>> config.
> >>>> If dynamically loaded modules add enough tags that they can't
> >>>> be addressed anymore with available page flag bits, memory
> >>>> profiling gets disabled and a warning is issued.
> >>>
> >>> Just curious, why do we need a config option? If there are enough bits
> >>> in page flags, why not use them automatically or fallback to page_ext
> >>> otherwise?
> >>
> >> Or better yet, *always* fall back to page_ext, thus leaving the
> >> scarce and valuable page flags available for other features?
> >>
> >> Sorry Suren, to keep coming back to this suggestion, I know
> >> I'm driving you crazy here! But I just keep thinking it through
> >> and failing to see why this feature deserves to consume so
> >> many page flags.
> >
> > I think we already always use page_ext today. My understanding is that
> > the purpose of this series is to give the option to avoid using
> > page_ext if there are enough unused page flags anyway, which reduces
> > memory waste and improves performance.
> >
> > My question is just why not have that be the default behavior with a
> > config option, use page flags if there are enough unused bits,
> > otherwise use page_ext.
>
> I agree that if you're going to implement this feature at all, then
> keying off of CONFIG_MEM_ALLOC_PROFILING seems sufficient, and no
> need to add CONFIG_PGALLOC_TAG_USE_PAGEFLAGS on top of that.

Yosry's original guess was correct. If not for loadable modules we
could get away with having no CONFIG_PGALLOC_TAG_USE_PAGEFLAGS. We
could try to fit codetag references into page flags and if they do not
fit we could fall back to page_ext. That works fine when we have a
predetermined number of tags. But loadable modules make this number
variable at runtime and there is a possibility we run out of page flag
bits at runtime. In that case, the current patchset disables memory
profiling and issues a warning that the user should disable
CONFIG_PGALLOC_TAG_USE_PAGEFLAGS to avoid this situation. I expect
this to be a rare case but it can happen and we have to provide a way
for a user to avoid running out of bits, which is where
CONFIG_PGALLOC_TAG_USE_PAGEFLAGS would be used.
Thanks,
Suren.

>
> thanks,
> --
> John Hubbard
>
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>
Suren Baghdasaryan Oct. 15, 2024, 2:03 a.m. UTC | #7
On Mon, Oct 14, 2024 at 6:40 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Oct 14, 2024 at 05:03:32PM -0700, John Hubbard wrote:
> > > > Or better yet, *always* fall back to page_ext, thus leaving the
> > > > scarce and valuable page flags available for other features?
> > > >
> > > > Sorry Suren, to keep coming back to this suggestion, I know
> > > > I'm driving you crazy here! But I just keep thinking it through
> > > > and failing to see why this feature deserves to consume so
> > > > many page flags.
> > >
> > > I think we already always use page_ext today. My understanding is that
> > > the purpose of this series is to give the option to avoid using
> > > page_ext if there are enough unused page flags anyway, which reduces
> > > memory waste and improves performance.
> > >
> > > My question is just why not have that be the default behavior with a
> > > config option, use page flags if there are enough unused bits,
> > > otherwise use page_ext.
> >
> > I agree that if you're going to implement this feature at all, then
> > keying off of CONFIG_MEM_ALLOC_PROFILING seems sufficient, and no
> > need to add CONFIG_PGALLOC_TAG_USE_PAGEFLAGS on top of that.
>
> Maybe the right idea is to use all the bits which are unused in this
> configuration for the first N callsites, then use page_ext for all the
> ones larger than N.  It doesn't save any memory, but it does have the
> performance improvement.

Thanks for the idea! This would be more complicated and likely would
require additional branching instructions in the allocation path. I'll
think more about details but would prefer to get the simple solution
first before adding more complexity.
David Hildenbrand Oct. 15, 2024, 7:32 a.m. UTC | #8
On 15.10.24 01:53, John Hubbard wrote:
> On 10/14/24 4:48 PM, Yosry Ahmed wrote:
>> On Mon, Oct 14, 2024 at 1:37 PM Suren Baghdasaryan <surenb@google.com> wrote:
>>>
>>> Add CONFIG_PGALLOC_TAG_USE_PAGEFLAGS to store allocation tag
>>> references directly in the page flags. This eliminates memory
>>> overhead caused by page_ext and results in better performance
>>> for page allocations.
>>> If the number of available page flag bits is insufficient to
>>> address all kernel allocations, profiling falls back to using
>>> page extensions with an appropriate warning to disable this
>>> config.
>>> If dynamically loaded modules add enough tags that they can't
>>> be addressed anymore with available page flag bits, memory
>>> profiling gets disabled and a warning is issued.
>>
>> Just curious, why do we need a config option? If there are enough bits
>> in page flags, why not use them automatically or fallback to page_ext
>> otherwise?
> 
> Or better yet, *always* fall back to page_ext, thus leaving the
> scarce and valuable page flags available for other features?
> 
> Sorry Suren, to keep coming back to this suggestion, I know
> I'm driving you crazy here! But I just keep thinking it through
> and failing to see why this feature deserves to consume so
> many page flags.

My 2 cents: there is nothing wrong about consuming unused page flags in 
a configuration. No need to let them stay unused in a configuration :)

The real issue starts once another feature wants to make use of some of 
them ... in such configuration there would be less available for 
allocation tags and the performance of allocations tags might 
consequently get worse again.
Yosry Ahmed Oct. 15, 2024, 8:10 a.m. UTC | #9
On Mon, Oct 14, 2024 at 6:58 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Mon, Oct 14, 2024 at 5:03 PM 'John Hubbard' via kernel-team
> <kernel-team@android.com> wrote:
> >
> > On 10/14/24 4:56 PM, Yosry Ahmed wrote:
> > > On Mon, Oct 14, 2024 at 4:53 PM John Hubbard <jhubbard@nvidia.com> wrote:
> > >>
> > >> On 10/14/24 4:48 PM, Yosry Ahmed wrote:
> > >>> On Mon, Oct 14, 2024 at 1:37 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > >>>>
> > >>>> Add CONFIG_PGALLOC_TAG_USE_PAGEFLAGS to store allocation tag
> > >>>> references directly in the page flags. This eliminates memory
> > >>>> overhead caused by page_ext and results in better performance
> > >>>> for page allocations.
> > >>>> If the number of available page flag bits is insufficient to
> > >>>> address all kernel allocations, profiling falls back to using
> > >>>> page extensions with an appropriate warning to disable this
> > >>>> config.
> > >>>> If dynamically loaded modules add enough tags that they can't
> > >>>> be addressed anymore with available page flag bits, memory
> > >>>> profiling gets disabled and a warning is issued.
> > >>>
> > >>> Just curious, why do we need a config option? If there are enough bits
> > >>> in page flags, why not use them automatically or fallback to page_ext
> > >>> otherwise?
> > >>
> > >> Or better yet, *always* fall back to page_ext, thus leaving the
> > >> scarce and valuable page flags available for other features?
> > >>
> > >> Sorry Suren, to keep coming back to this suggestion, I know
> > >> I'm driving you crazy here! But I just keep thinking it through
> > >> and failing to see why this feature deserves to consume so
> > >> many page flags.
> > >
> > > I think we already always use page_ext today. My understanding is that
> > > the purpose of this series is to give the option to avoid using
> > > page_ext if there are enough unused page flags anyway, which reduces
> > > memory waste and improves performance.
> > >
> > > My question is just why not have that be the default behavior with a
> > > config option, use page flags if there are enough unused bits,
> > > otherwise use page_ext.
> >
> > I agree that if you're going to implement this feature at all, then
> > keying off of CONFIG_MEM_ALLOC_PROFILING seems sufficient, and no
> > need to add CONFIG_PGALLOC_TAG_USE_PAGEFLAGS on top of that.
>
> Yosry's original guess was correct. If not for loadable modules we
> could get away with having no CONFIG_PGALLOC_TAG_USE_PAGEFLAGS. We
> could try to fit codetag references into page flags and if they do not
> fit we could fall back to page_ext. That works fine when we have a
> predetermined number of tags. But loadable modules make this number
> variable at runtime and there is a possibility we run out of page flag
> bits at runtime. In that case, the current patchset disables memory
> profiling and issues a warning that the user should disable
> CONFIG_PGALLOC_TAG_USE_PAGEFLAGS to avoid this situation. I expect
> this to be a rare case but it can happen and we have to provide a way
> for a user to avoid running out of bits, which is where
> CONFIG_PGALLOC_TAG_USE_PAGEFLAGS would be used.

If this is in fact a rare case, I think it may make more sense for the
config to be on by default, and the config description would clearly
state that it is intended to be turned off only if the loadable
modules warning fires.
Suren Baghdasaryan Oct. 15, 2024, 2:59 p.m. UTC | #10
On Tue, Oct 15, 2024 at 12:32 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 15.10.24 01:53, John Hubbard wrote:
> > On 10/14/24 4:48 PM, Yosry Ahmed wrote:
> >> On Mon, Oct 14, 2024 at 1:37 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >>>
> >>> Add CONFIG_PGALLOC_TAG_USE_PAGEFLAGS to store allocation tag
> >>> references directly in the page flags. This eliminates memory
> >>> overhead caused by page_ext and results in better performance
> >>> for page allocations.
> >>> If the number of available page flag bits is insufficient to
> >>> address all kernel allocations, profiling falls back to using
> >>> page extensions with an appropriate warning to disable this
> >>> config.
> >>> If dynamically loaded modules add enough tags that they can't
> >>> be addressed anymore with available page flag bits, memory
> >>> profiling gets disabled and a warning is issued.
> >>
> >> Just curious, why do we need a config option? If there are enough bits
> >> in page flags, why not use them automatically or fallback to page_ext
> >> otherwise?
> >
> > Or better yet, *always* fall back to page_ext, thus leaving the
> > scarce and valuable page flags available for other features?
> >
> > Sorry Suren, to keep coming back to this suggestion, I know
> > I'm driving you crazy here! But I just keep thinking it through
> > and failing to see why this feature deserves to consume so
> > many page flags.
>
> My 2 cents: there is nothing wrong about consuming unused page flags in
> a configuration. No need to let them stay unused in a configuration :)
>
> The real issue starts once another feature wants to make use of some of
> them ... in such configuration there would be less available for
> allocation tags and the performance of allocations tags might
> consequently get worse again.

Thanks for the input and indeed this is the case. If this happens, we
will get a warning telling us that page flags could not be used and
page_ext will be used instead. I think that's the best I can do given
that page flag bits is a limited resource.

>
> --
> Cheers,
>
> David / dhildenb
>
Suren Baghdasaryan Oct. 15, 2024, 3:06 p.m. UTC | #11
On Tue, Oct 15, 2024 at 1:10 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Mon, Oct 14, 2024 at 6:58 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Mon, Oct 14, 2024 at 5:03 PM 'John Hubbard' via kernel-team
> > <kernel-team@android.com> wrote:
> > >
> > > On 10/14/24 4:56 PM, Yosry Ahmed wrote:
> > > > On Mon, Oct 14, 2024 at 4:53 PM John Hubbard <jhubbard@nvidia.com> wrote:
> > > >>
> > > >> On 10/14/24 4:48 PM, Yosry Ahmed wrote:
> > > >>> On Mon, Oct 14, 2024 at 1:37 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > >>>>
> > > >>>> Add CONFIG_PGALLOC_TAG_USE_PAGEFLAGS to store allocation tag
> > > >>>> references directly in the page flags. This eliminates memory
> > > >>>> overhead caused by page_ext and results in better performance
> > > >>>> for page allocations.
> > > >>>> If the number of available page flag bits is insufficient to
> > > >>>> address all kernel allocations, profiling falls back to using
> > > >>>> page extensions with an appropriate warning to disable this
> > > >>>> config.
> > > >>>> If dynamically loaded modules add enough tags that they can't
> > > >>>> be addressed anymore with available page flag bits, memory
> > > >>>> profiling gets disabled and a warning is issued.
> > > >>>
> > > >>> Just curious, why do we need a config option? If there are enough bits
> > > >>> in page flags, why not use them automatically or fallback to page_ext
> > > >>> otherwise?
> > > >>
> > > >> Or better yet, *always* fall back to page_ext, thus leaving the
> > > >> scarce and valuable page flags available for other features?
> > > >>
> > > >> Sorry Suren, to keep coming back to this suggestion, I know
> > > >> I'm driving you crazy here! But I just keep thinking it through
> > > >> and failing to see why this feature deserves to consume so
> > > >> many page flags.
> > > >
> > > > I think we already always use page_ext today. My understanding is that
> > > > the purpose of this series is to give the option to avoid using
> > > > page_ext if there are enough unused page flags anyway, which reduces
> > > > memory waste and improves performance.
> > > >
> > > > My question is just why not have that be the default behavior with a
> > > > config option, use page flags if there are enough unused bits,
> > > > otherwise use page_ext.
> > >
> > > I agree that if you're going to implement this feature at all, then
> > > keying off of CONFIG_MEM_ALLOC_PROFILING seems sufficient, and no
> > > need to add CONFIG_PGALLOC_TAG_USE_PAGEFLAGS on top of that.
> >
> > Yosry's original guess was correct. If not for loadable modules we
> > could get away with having no CONFIG_PGALLOC_TAG_USE_PAGEFLAGS. We
> > could try to fit codetag references into page flags and if they do not
> > fit we could fall back to page_ext. That works fine when we have a
> > predetermined number of tags. But loadable modules make this number
> > variable at runtime and there is a possibility we run out of page flag
> > bits at runtime. In that case, the current patchset disables memory
> > profiling and issues a warning that the user should disable
> > CONFIG_PGALLOC_TAG_USE_PAGEFLAGS to avoid this situation. I expect
> > this to be a rare case but it can happen and we have to provide a way
> > for a user to avoid running out of bits, which is where
> > CONFIG_PGALLOC_TAG_USE_PAGEFLAGS would be used.
>
> If this is in fact a rare case, I think it may make more sense for the
> config to be on by default, and the config description would clearly
> state that it is intended to be turned off only if the loadable
> modules warning fires.

Just to be clear, I think running out of pageflag bits at runtime (as
a result of module loading) will be a rare case. That's because the
core kernel has many more allocations than a typical module. Not
having enough bits for the core kernel might be the most prevalent
case, at least for large systems.
That said, your suggestion makes sense. Since we have to keep
CONFIG_PAGE_EXTENSION dependency (for the cases when we have to fall
back to page_ext), there is no advantage of keeping
CONFIG_PGALLOC_TAG_USE_PAGEFLAGS disabled. I'll change the default in
the next version. Thanks for the suggestion!
David Hildenbrand Oct. 15, 2024, 3:42 p.m. UTC | #12
On 15.10.24 16:59, Suren Baghdasaryan wrote:
> On Tue, Oct 15, 2024 at 12:32 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 15.10.24 01:53, John Hubbard wrote:
>>> On 10/14/24 4:48 PM, Yosry Ahmed wrote:
>>>> On Mon, Oct 14, 2024 at 1:37 PM Suren Baghdasaryan <surenb@google.com> wrote:
>>>>>
>>>>> Add CONFIG_PGALLOC_TAG_USE_PAGEFLAGS to store allocation tag
>>>>> references directly in the page flags. This eliminates memory
>>>>> overhead caused by page_ext and results in better performance
>>>>> for page allocations.
>>>>> If the number of available page flag bits is insufficient to
>>>>> address all kernel allocations, profiling falls back to using
>>>>> page extensions with an appropriate warning to disable this
>>>>> config.
>>>>> If dynamically loaded modules add enough tags that they can't
>>>>> be addressed anymore with available page flag bits, memory
>>>>> profiling gets disabled and a warning is issued.
>>>>
>>>> Just curious, why do we need a config option? If there are enough bits
>>>> in page flags, why not use them automatically or fallback to page_ext
>>>> otherwise?
>>>
>>> Or better yet, *always* fall back to page_ext, thus leaving the
>>> scarce and valuable page flags available for other features?
>>>
>>> Sorry Suren, to keep coming back to this suggestion, I know
>>> I'm driving you crazy here! But I just keep thinking it through
>>> and failing to see why this feature deserves to consume so
>>> many page flags.
>>
>> My 2 cents: there is nothing wrong about consuming unused page flags in
>> a configuration. No need to let them stay unused in a configuration :)
>>
>> The real issue starts once another feature wants to make use of some of
>> them ... in such configuration there would be less available for
>> allocation tags and the performance of allocations tags might
>> consequently get worse again.
> 
> Thanks for the input and indeed this is the case. If this happens, we
> will get a warning telling us that page flags could not be used and
> page_ext will be used instead. I think that's the best I can do given
> that page flag bits is a limited resource.

Right, I think what John is concerned about (and me as well) is that 
once a new feature really needs a page flag, there will be objection 
like "no you can't, we need them for allocation tags otherwise that 
feature will be degraded".

So a "The Lord has given, and the Lord has taken away!" mentality might 
be required when consuming that many scarce resources, meaning, as long 
as they are actually unused, use them, but it should not block other 
features that really need them.

Does that make sense?
Suren Baghdasaryan Oct. 15, 2024, 3:58 p.m. UTC | #13
On Tue, Oct 15, 2024 at 8:42 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 15.10.24 16:59, Suren Baghdasaryan wrote:
> > On Tue, Oct 15, 2024 at 12:32 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 15.10.24 01:53, John Hubbard wrote:
> >>> On 10/14/24 4:48 PM, Yosry Ahmed wrote:
> >>>> On Mon, Oct 14, 2024 at 1:37 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >>>>>
> >>>>> Add CONFIG_PGALLOC_TAG_USE_PAGEFLAGS to store allocation tag
> >>>>> references directly in the page flags. This eliminates memory
> >>>>> overhead caused by page_ext and results in better performance
> >>>>> for page allocations.
> >>>>> If the number of available page flag bits is insufficient to
> >>>>> address all kernel allocations, profiling falls back to using
> >>>>> page extensions with an appropriate warning to disable this
> >>>>> config.
> >>>>> If dynamically loaded modules add enough tags that they can't
> >>>>> be addressed anymore with available page flag bits, memory
> >>>>> profiling gets disabled and a warning is issued.
> >>>>
> >>>> Just curious, why do we need a config option? If there are enough bits
> >>>> in page flags, why not use them automatically or fallback to page_ext
> >>>> otherwise?
> >>>
> >>> Or better yet, *always* fall back to page_ext, thus leaving the
> >>> scarce and valuable page flags available for other features?
> >>>
> >>> Sorry Suren, to keep coming back to this suggestion, I know
> >>> I'm driving you crazy here! But I just keep thinking it through
> >>> and failing to see why this feature deserves to consume so
> >>> many page flags.
> >>
> >> My 2 cents: there is nothing wrong about consuming unused page flags in
> >> a configuration. No need to let them stay unused in a configuration :)
> >>
> >> The real issue starts once another feature wants to make use of some of
> >> them ... in such configuration there would be less available for
> >> allocation tags and the performance of allocations tags might
> >> consequently get worse again.
> >
> > Thanks for the input and indeed this is the case. If this happens, we
> > will get a warning telling us that page flags could not be used and
> > page_ext will be used instead. I think that's the best I can do given
> > that page flag bits is a limited resource.
>
> Right, I think what John is concerned about (and me as well) is that
> once a new feature really needs a page flag, there will be objection
> like "no you can't, we need them for allocation tags otherwise that
> feature will be degraded".

I do understand your concern but IMHO the possibility of degrading a
feature should not be a reason to always operate at degraded capacity
(which is what we have today). If one is really concerned about
possible future regression they can set
CONFIG_PGALLOC_TAG_USE_PAGEFLAGS=n and keep what we have today. That's
why I'm strongly advocating that we do need
CONFIG_PGALLOC_TAG_USE_PAGEFLAGS so that the user has control over how
this scarce resource is used.

>
> So a "The Lord has given, and the Lord has taken away!" mentality might
> be required when consuming that many scarce resources, meaning, as long
> as they are actually unused, use them, but it should not block other
> features that really need them.

I agree and I think that's what I implemented here. If there are
enough page flag bits we use them, otherwise we automatically fall
back to page_ext.

>
> Does that make sense?

Absolutely and thank you all for the feedback.

>
> --
> Cheers,
>
> David / dhildenb
>
Michal Hocko Oct. 18, 2024, 1:03 p.m. UTC | #14
On Tue 15-10-24 08:58:59, Suren Baghdasaryan wrote:
> On Tue, Oct 15, 2024 at 8:42 AM David Hildenbrand <david@redhat.com> wrote:
[...]
> > Right, I think what John is concerned about (and me as well) is that
> > once a new feature really needs a page flag, there will be objection
> > like "no you can't, we need them for allocation tags otherwise that
> > feature will be degraded".
> 
> I do understand your concern but IMHO the possibility of degrading a
> feature should not be a reason to always operate at degraded capacity
> (which is what we have today). If one is really concerned about
> possible future regression they can set
> CONFIG_PGALLOC_TAG_USE_PAGEFLAGS=n and keep what we have today. That's
> why I'm strongly advocating that we do need
> CONFIG_PGALLOC_TAG_USE_PAGEFLAGS so that the user has control over how
> this scarce resource is used.

I really do not think users will know how/why to setup this and I wouldn't
even bother them thinking about that at all TBH. 

This is an implementation detail. It is fine to reuse unused flags space
as a storage as a performance optimization but why do you want users to
bother with that? Why would they ever want to say N here?
Suren Baghdasaryan Oct. 18, 2024, 4:04 p.m. UTC | #15
On Fri, Oct 18, 2024 at 6:03 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 15-10-24 08:58:59, Suren Baghdasaryan wrote:
> > On Tue, Oct 15, 2024 at 8:42 AM David Hildenbrand <david@redhat.com> wrote:
> [...]
> > > Right, I think what John is concerned about (and me as well) is that
> > > once a new feature really needs a page flag, there will be objection
> > > like "no you can't, we need them for allocation tags otherwise that
> > > feature will be degraded".
> >
> > I do understand your concern but IMHO the possibility of degrading a
> > feature should not be a reason to always operate at degraded capacity
> > (which is what we have today). If one is really concerned about
> > possible future regression they can set
> > CONFIG_PGALLOC_TAG_USE_PAGEFLAGS=n and keep what we have today. That's
> > why I'm strongly advocating that we do need
> > CONFIG_PGALLOC_TAG_USE_PAGEFLAGS so that the user has control over how
> > this scarce resource is used.
>
> I really do not think users will know how/why to setup this and I wouldn't
> even bother them thinking about that at all TBH.
>
> This is an implementation detail. It is fine to reuse unused flags space
> as a storage as a performance optimization but why do you want users to
> bother with that? Why would they ever want to say N here?

In this patch you can find a couple of warnings that look like this:

pr_warn("With module %s there are too many tags to fit in %d page flag
bits. Memory profiling is disabled!\n", mod->name,
NR_UNUSED_PAGEFLAG_BITS);
emitted when we run out of page flag bits during a module loading,

pr_err("%s: alignment %lu is incompatible with allocation tag
indexing, disable CONFIG_PGALLOC_TAG_USE_PAGEFLAGS",  mod->name,
align);
emitted when the arch-specific section alignment is incompatible with
alloc_tag indexing.

I'll change the first one to also specifically guide the user to
disable CONFIG_PGALLOC_TAG_USE_PAGEFLAGS.
When these happen, memory profiling gets disabled automatically. These
two cases would be the main ones when the user would want to disable
CONFIG_PGALLOC_TAG_USE_PAGEFLAGS to keep memory profiling enabled.
I also think when we auto-disable memory profiling at runtime like
that, I should make /proc/allocinfo empty so that it's apparent it is
disabled and the user does not use stale data.


> --
> Michal Hocko
> SUSE Labs
Michal Hocko Oct. 18, 2024, 5:08 p.m. UTC | #16
On Fri 18-10-24 09:04:24, Suren Baghdasaryan wrote:
> On Fri, Oct 18, 2024 at 6:03 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Tue 15-10-24 08:58:59, Suren Baghdasaryan wrote:
> > > On Tue, Oct 15, 2024 at 8:42 AM David Hildenbrand <david@redhat.com> wrote:
> > [...]
> > > > Right, I think what John is concerned about (and me as well) is that
> > > > once a new feature really needs a page flag, there will be objection
> > > > like "no you can't, we need them for allocation tags otherwise that
> > > > feature will be degraded".
> > >
> > > I do understand your concern but IMHO the possibility of degrading a
> > > feature should not be a reason to always operate at degraded capacity
> > > (which is what we have today). If one is really concerned about
> > > possible future regression they can set
> > > CONFIG_PGALLOC_TAG_USE_PAGEFLAGS=n and keep what we have today. That's
> > > why I'm strongly advocating that we do need
> > > CONFIG_PGALLOC_TAG_USE_PAGEFLAGS so that the user has control over how
> > > this scarce resource is used.
> >
> > I really do not think users will know how/why to setup this and I wouldn't
> > even bother them thinking about that at all TBH.
> >
> > This is an implementation detail. It is fine to reuse unused flags space
> > as a storage as a performance optimization but why do you want users to
> > bother with that? Why would they ever want to say N here?
> 
> In this patch you can find a couple of warnings that look like this:
> 
> pr_warn("With module %s there are too many tags to fit in %d page flag
> bits. Memory profiling is disabled!\n", mod->name,
> NR_UNUSED_PAGEFLAG_BITS);
> emitted when we run out of page flag bits during a module loading,
> 
> pr_err("%s: alignment %lu is incompatible with allocation tag
> indexing, disable CONFIG_PGALLOC_TAG_USE_PAGEFLAGS",  mod->name,
> align);
> emitted when the arch-specific section alignment is incompatible with
> alloc_tag indexing.

You are asking users to workaround implementation issue by configuration
which sounds like a really bad idea. Why cannot you make the fallback
automatic?
Suren Baghdasaryan Oct. 18, 2024, 5:45 p.m. UTC | #17
On Fri, Oct 18, 2024 at 10:08 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 18-10-24 09:04:24, Suren Baghdasaryan wrote:
> > On Fri, Oct 18, 2024 at 6:03 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Tue 15-10-24 08:58:59, Suren Baghdasaryan wrote:
> > > > On Tue, Oct 15, 2024 at 8:42 AM David Hildenbrand <david@redhat.com> wrote:
> > > [...]
> > > > > Right, I think what John is concerned about (and me as well) is that
> > > > > once a new feature really needs a page flag, there will be objection
> > > > > like "no you can't, we need them for allocation tags otherwise that
> > > > > feature will be degraded".
> > > >
> > > > I do understand your concern but IMHO the possibility of degrading a
> > > > feature should not be a reason to always operate at degraded capacity
> > > > (which is what we have today). If one is really concerned about
> > > > possible future regression they can set
> > > > CONFIG_PGALLOC_TAG_USE_PAGEFLAGS=n and keep what we have today. That's
> > > > why I'm strongly advocating that we do need
> > > > CONFIG_PGALLOC_TAG_USE_PAGEFLAGS so that the user has control over how
> > > > this scarce resource is used.
> > >
> > > I really do not think users will know how/why to setup this and I wouldn't
> > > even bother them thinking about that at all TBH.
> > >
> > > This is an implementation detail. It is fine to reuse unused flags space
> > > as a storage as a performance optimization but why do you want users to
> > > bother with that? Why would they ever want to say N here?
> >
> > In this patch you can find a couple of warnings that look like this:
> >
> > pr_warn("With module %s there are too many tags to fit in %d page flag
> > bits. Memory profiling is disabled!\n", mod->name,
> > NR_UNUSED_PAGEFLAG_BITS);
> > emitted when we run out of page flag bits during a module loading,
> >
> > pr_err("%s: alignment %lu is incompatible with allocation tag
> > indexing, disable CONFIG_PGALLOC_TAG_USE_PAGEFLAGS",  mod->name,
> > align);
> > emitted when the arch-specific section alignment is incompatible with
> > alloc_tag indexing.
>
> You are asking users to workaround implementation issue by configuration
> which sounds like a really bad idea. Why cannot you make the fallback
> automatic?

Automatic fallback is possible during boot, when we decide whether to
enable page extensions or not. So, if during boot we decide to disable
page extensions and use page flags, we can't go back and re-enable
page extensions after boot is complete. Since there is a possibility
that we run out of page flags at runtime when we load a new module,
this leaves this case when we can't reference the module tags and we
can't fall back to page extensions, so we have to disable memory
profiling.
I could keep page extensions always on just in case this happens but
that's a lot of memory waste to handle a rare case...

>
> --
> Michal Hocko
> SUSE Labs
Suren Baghdasaryan Oct. 18, 2024, 9:57 p.m. UTC | #18
On Fri, Oct 18, 2024 at 10:45 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Fri, Oct 18, 2024 at 10:08 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Fri 18-10-24 09:04:24, Suren Baghdasaryan wrote:
> > > On Fri, Oct 18, 2024 at 6:03 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Tue 15-10-24 08:58:59, Suren Baghdasaryan wrote:
> > > > > On Tue, Oct 15, 2024 at 8:42 AM David Hildenbrand <david@redhat.com> wrote:
> > > > [...]
> > > > > > Right, I think what John is concerned about (and me as well) is that
> > > > > > once a new feature really needs a page flag, there will be objection
> > > > > > like "no you can't, we need them for allocation tags otherwise that
> > > > > > feature will be degraded".
> > > > >
> > > > > I do understand your concern but IMHO the possibility of degrading a
> > > > > feature should not be a reason to always operate at degraded capacity
> > > > > (which is what we have today). If one is really concerned about
> > > > > possible future regression they can set
> > > > > CONFIG_PGALLOC_TAG_USE_PAGEFLAGS=n and keep what we have today. That's
> > > > > why I'm strongly advocating that we do need
> > > > > CONFIG_PGALLOC_TAG_USE_PAGEFLAGS so that the user has control over how
> > > > > this scarce resource is used.
> > > >
> > > > I really do not think users will know how/why to setup this and I wouldn't
> > > > even bother them thinking about that at all TBH.
> > > >
> > > > This is an implementation detail. It is fine to reuse unused flags space
> > > > as a storage as a performance optimization but why do you want users to
> > > > bother with that? Why would they ever want to say N here?
> > >
> > > In this patch you can find a couple of warnings that look like this:
> > >
> > > pr_warn("With module %s there are too many tags to fit in %d page flag
> > > bits. Memory profiling is disabled!\n", mod->name,
> > > NR_UNUSED_PAGEFLAG_BITS);
> > > emitted when we run out of page flag bits during a module loading,
> > >
> > > pr_err("%s: alignment %lu is incompatible with allocation tag
> > > indexing, disable CONFIG_PGALLOC_TAG_USE_PAGEFLAGS",  mod->name,
> > > align);
> > > emitted when the arch-specific section alignment is incompatible with
> > > alloc_tag indexing.
> >
> > You are asking users to workaround implementation issue by configuration
> > which sounds like a really bad idea. Why cannot you make the fallback
> > automatic?
>
> Automatic fallback is possible during boot, when we decide whether to
> enable page extensions or not. So, if during boot we decide to disable
> page extensions and use page flags, we can't go back and re-enable
> page extensions after boot is complete. Since there is a possibility
> that we run out of page flags at runtime when we load a new module,
> this leaves this case when we can't reference the module tags and we
> can't fall back to page extensions, so we have to disable memory
> profiling.
> I could keep page extensions always on just in case this happens but
> that's a lot of memory waste to handle a rare case...

After thinking more about this, I suggest a couple of changes that
IMHO would make configuration simpler:
1. Change the CONFIG_PGALLOC_TAG_USE_PAGEFLAGS to an early boot
parameter. Today we have a "mem_profiling" parameter to enable/disable
memory profiling. I suggest adding "mem_profiling_use_pgflags" to
switch the current behavior of using page extensions to use page
flags. We keep the current behavior of using page extensions as
default (mem_profiling_use_pgflags=0) because it always works even
though it has higher overhead.
2. No auto-fallback. If mem_profiling_use_pgflags=1 and we don't have
enough page flags (at boot time or later when we load a module), we
simply disable memory profiling with a warning.

I think this would be less confusing for users and will avoid David's
example when performance is unexpectedly degraded. The default option
will work for everyone as it does today and advanced users who want to
minimize the overhead can set mem_profiling_use_pgflags=1 to check if
that works for their system.
WDYT?

>
> >
> > --
> > Michal Hocko
> > SUSE Labs
Michal Hocko Oct. 21, 2024, 7:21 a.m. UTC | #19
On Fri 18-10-24 10:45:39, Suren Baghdasaryan wrote:
> On Fri, Oct 18, 2024 at 10:08 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Fri 18-10-24 09:04:24, Suren Baghdasaryan wrote:
> > > On Fri, Oct 18, 2024 at 6:03 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Tue 15-10-24 08:58:59, Suren Baghdasaryan wrote:
> > > > > On Tue, Oct 15, 2024 at 8:42 AM David Hildenbrand <david@redhat.com> wrote:
> > > > [...]
> > > > > > Right, I think what John is concerned about (and me as well) is that
> > > > > > once a new feature really needs a page flag, there will be objection
> > > > > > like "no you can't, we need them for allocation tags otherwise that
> > > > > > feature will be degraded".
> > > > >
> > > > > I do understand your concern but IMHO the possibility of degrading a
> > > > > feature should not be a reason to always operate at degraded capacity
> > > > > (which is what we have today). If one is really concerned about
> > > > > possible future regression they can set
> > > > > CONFIG_PGALLOC_TAG_USE_PAGEFLAGS=n and keep what we have today. That's
> > > > > why I'm strongly advocating that we do need
> > > > > CONFIG_PGALLOC_TAG_USE_PAGEFLAGS so that the user has control over how
> > > > > this scarce resource is used.
> > > >
> > > > I really do not think users will know how/why to setup this and I wouldn't
> > > > even bother them thinking about that at all TBH.
> > > >
> > > > This is an implementation detail. It is fine to reuse unused flags space
> > > > as a storage as a performance optimization but why do you want users to
> > > > bother with that? Why would they ever want to say N here?
> > >
> > > In this patch you can find a couple of warnings that look like this:
> > >
> > > pr_warn("With module %s there are too many tags to fit in %d page flag
> > > bits. Memory profiling is disabled!\n", mod->name,
> > > NR_UNUSED_PAGEFLAG_BITS);
> > > emitted when we run out of page flag bits during a module loading,
> > >
> > > pr_err("%s: alignment %lu is incompatible with allocation tag
> > > indexing, disable CONFIG_PGALLOC_TAG_USE_PAGEFLAGS",  mod->name,
> > > align);
> > > emitted when the arch-specific section alignment is incompatible with
> > > alloc_tag indexing.
> >
> > You are asking users to workaround implementation issue by configuration
> > which sounds like a really bad idea. Why cannot you make the fallback
> > automatic?
> 
> Automatic fallback is possible during boot, when we decide whether to
> enable page extensions or not. So, if during boot we decide to disable
> page extensions and use page flags, we can't go back and re-enable
> page extensions after boot is complete. Since there is a possibility
> that we run out of page flags at runtime when we load a new module,
> this leaves this case when we can't reference the module tags and we
> can't fall back to page extensions, so we have to disable memory
> profiling.

Right, I do understand (I guess) the challenge. I am just arguing that
it makes really no sense to tell user to recompile the kernel with a
CONFIG_FOO to workaround this limitation. Please note that many users of
this feature will simply use a precompiled (e.g. distribution) kernels.
Once you force somebody to recompile with
CONFIG_PGALLOC_TAG_USE_PAGEFLAGS=n they are not going back to a more
memory optimal implementation.

Just my 2cents
Michal Hocko Oct. 21, 2024, 7:26 a.m. UTC | #20
On Fri 18-10-24 14:57:26, Suren Baghdasaryan wrote:
> On Fri, Oct 18, 2024 at 10:45 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Fri, Oct 18, 2024 at 10:08 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Fri 18-10-24 09:04:24, Suren Baghdasaryan wrote:
> > > > On Fri, Oct 18, 2024 at 6:03 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Tue 15-10-24 08:58:59, Suren Baghdasaryan wrote:
> > > > > > On Tue, Oct 15, 2024 at 8:42 AM David Hildenbrand <david@redhat.com> wrote:
> > > > > [...]
> > > > > > > Right, I think what John is concerned about (and me as well) is that
> > > > > > > once a new feature really needs a page flag, there will be objection
> > > > > > > like "no you can't, we need them for allocation tags otherwise that
> > > > > > > feature will be degraded".
> > > > > >
> > > > > > I do understand your concern but IMHO the possibility of degrading a
> > > > > > feature should not be a reason to always operate at degraded capacity
> > > > > > (which is what we have today). If one is really concerned about
> > > > > > possible future regression they can set
> > > > > > CONFIG_PGALLOC_TAG_USE_PAGEFLAGS=n and keep what we have today. That's
> > > > > > why I'm strongly advocating that we do need
> > > > > > CONFIG_PGALLOC_TAG_USE_PAGEFLAGS so that the user has control over how
> > > > > > this scarce resource is used.
> > > > >
> > > > > I really do not think users will know how/why to setup this and I wouldn't
> > > > > even bother them thinking about that at all TBH.
> > > > >
> > > > > This is an implementation detail. It is fine to reuse unused flags space
> > > > > as a storage as a performance optimization but why do you want users to
> > > > > bother with that? Why would they ever want to say N here?
> > > >
> > > > In this patch you can find a couple of warnings that look like this:
> > > >
> > > > pr_warn("With module %s there are too many tags to fit in %d page flag
> > > > bits. Memory profiling is disabled!\n", mod->name,
> > > > NR_UNUSED_PAGEFLAG_BITS);
> > > > emitted when we run out of page flag bits during a module loading,
> > > >
> > > > pr_err("%s: alignment %lu is incompatible with allocation tag
> > > > indexing, disable CONFIG_PGALLOC_TAG_USE_PAGEFLAGS",  mod->name,
> > > > align);
> > > > emitted when the arch-specific section alignment is incompatible with
> > > > alloc_tag indexing.
> > >
> > > You are asking users to workaround implementation issue by configuration
> > > which sounds like a really bad idea. Why cannot you make the fallback
> > > automatic?
> >
> > Automatic fallback is possible during boot, when we decide whether to
> > enable page extensions or not. So, if during boot we decide to disable
> > page extensions and use page flags, we can't go back and re-enable
> > page extensions after boot is complete. Since there is a possibility
> > that we run out of page flags at runtime when we load a new module,
> > this leaves this case when we can't reference the module tags and we
> > can't fall back to page extensions, so we have to disable memory
> > profiling.
> > I could keep page extensions always on just in case this happens but
> > that's a lot of memory waste to handle a rare case...
> 
> After thinking more about this, I suggest a couple of changes that
> IMHO would make configuration simpler:
> 1. Change the CONFIG_PGALLOC_TAG_USE_PAGEFLAGS to an early boot
> parameter.

This makes much more sense!

> Today we have a "mem_profiling" parameter to enable/disable
> memory profiling. I suggest adding "mem_profiling_use_pgflags" to
> switch the current behavior of using page extensions to use page
> flags.

I do not want to bikeshed about this but to me it would make more sense
to have an extension paramater to mem_profiling and call it something
like compress or similar so that page flags are not really carved into
naming. The docuemntation then can explain that the copression cannot be
always guaranteed and it might fail so this is more of a optimistic and
potentially failing optimization that might need to be dropped in some
usege scenarios.

> We keep the current behavior of using page extensions as
> default (mem_profiling_use_pgflags=0) because it always works even
> though it has higher overhead.

Yes this seems to be a safe default.

> 2. No auto-fallback. If mem_profiling_use_pgflags=1 and we don't have
> enough page flags (at boot time or later when we load a module), we
> simply disable memory profiling with a warning.

No strong opinion on this.
David Hildenbrand Oct. 21, 2024, 9:13 a.m. UTC | #21
Am 21.10.24 um 09:26 schrieb Michal Hocko:
> On Fri 18-10-24 14:57:26, Suren Baghdasaryan wrote:
>> On Fri, Oct 18, 2024 at 10:45 AM Suren Baghdasaryan <surenb@google.com> wrote:
>>>
>>> On Fri, Oct 18, 2024 at 10:08 AM Michal Hocko <mhocko@suse.com> wrote:
>>>
>>> Automatic fallback is possible during boot, when we decide whether to
>>> enable page extensions or not. So, if during boot we decide to disable
>>> page extensions and use page flags, we can't go back and re-enable
>>> page extensions after boot is complete. Since there is a possibility
>>> that we run out of page flags at runtime when we load a new module,
>>> this leaves this case when we can't reference the module tags and we
>>> can't fall back to page extensions, so we have to disable memory
>>> profiling.
>>> I could keep page extensions always on just in case this happens but
>>> that's a lot of memory waste to handle a rare case...
>>
>> After thinking more about this, I suggest a couple of changes that
>> IMHO would make configuration simpler:
>> 1. Change the CONFIG_PGALLOC_TAG_USE_PAGEFLAGS to an early boot
>> parameter.
> 
> This makes much more sense!
> 
>> Today we have a "mem_profiling" parameter to enable/disable
>> memory profiling. I suggest adding "mem_profiling_use_pgflags" to
>> switch the current behavior of using page extensions to use page
>> flags.
> 
> I do not want to bikeshed about this but to me it would make more sense
> to have an extension paramater to mem_profiling and call it something
> like compress or similar so that page flags are not really carved into
> naming. The docuemntation then can explain that the copression cannot be
> always guaranteed and it might fail so this is more of a optimistic and
> potentially failing optimization that might need to be dropped in some
> usege scenarios.

Maybe we can reuse the existing parameter (e.g., tristate). Only makes sense if 
we don't expect too many other modes though :)

> 
>> We keep the current behavior of using page extensions as
>> default (mem_profiling_use_pgflags=0) because it always works even
>> though it has higher overhead.
> 
> Yes this seems to be a safe default.

Agreed.

> 
>> 2. No auto-fallback. If mem_profiling_use_pgflags=1 and we don't have
>> enough page flags (at boot time or later when we load a module), we
>> simply disable memory profiling with a warning.

Sounds reasonable to me.
Suren Baghdasaryan Oct. 21, 2024, 3:05 p.m. UTC | #22
On Mon, Oct 21, 2024 at 2:13 AM David Hildenbrand <david@redhat.com> wrote:
>
>
>
> Am 21.10.24 um 09:26 schrieb Michal Hocko:
> > On Fri 18-10-24 14:57:26, Suren Baghdasaryan wrote:
> >> On Fri, Oct 18, 2024 at 10:45 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >>>
> >>> On Fri, Oct 18, 2024 at 10:08 AM Michal Hocko <mhocko@suse.com> wrote:
> >>>
> >>> Automatic fallback is possible during boot, when we decide whether to
> >>> enable page extensions or not. So, if during boot we decide to disable
> >>> page extensions and use page flags, we can't go back and re-enable
> >>> page extensions after boot is complete. Since there is a possibility
> >>> that we run out of page flags at runtime when we load a new module,
> >>> this leaves this case when we can't reference the module tags and we
> >>> can't fall back to page extensions, so we have to disable memory
> >>> profiling.
> >>> I could keep page extensions always on just in case this happens but
> >>> that's a lot of memory waste to handle a rare case...
> >>
> >> After thinking more about this, I suggest a couple of changes that
> >> IMHO would make configuration simpler:
> >> 1. Change the CONFIG_PGALLOC_TAG_USE_PAGEFLAGS to an early boot
> >> parameter.
> >
> > This makes much more sense!
> >
> >> Today we have a "mem_profiling" parameter to enable/disable
> >> memory profiling. I suggest adding "mem_profiling_use_pgflags" to
> >> switch the current behavior of using page extensions to use page
> >> flags.
> >
> > I do not want to bikeshed about this but to me it would make more sense
> > to have an extension paramater to mem_profiling and call it something
> > like compress or similar so that page flags are not really carved into
> > naming. The docuemntation then can explain that the copression cannot be
> > always guaranteed and it might fail so this is more of a optimistic and
> > potentially failing optimization that might need to be dropped in some
> > usege scenarios.
>
> Maybe we can reuse the existing parameter (e.g., tristate). Only makes sense if
> we don't expect too many other modes though :)

Yeah, I thought about adding new values to "mem_profiling" but it's a
bit complicated. Today it's a tristate:

mem_profiling=0|1|never

0/1 means we disable/enable memory profiling by default but the user
can enable it at runtime using a sysctl. This means that we enable
page_ext at boot even when it's set to 0.
"never" means we do not enable page_ext, memory profiling is disabled
and sysctl to enable it will not be exposed. Used when a distribution
has CONFIG_MEM_ALLOC_PROFILING=y but the user does not use it and does
not want to waste memory on enabling page_ext.

I can add another option like "pgflags" but then it also needs to
specify whether we should enable or disable profiling by default
(similar to 0|1 for page_ext mode). IOW we will need to encode also
the default state we want. Something like this:

mem_profiling=0|1|never|pgflags_on|pgflags_off

Would this be acceptable?


>
> >
> >> We keep the current behavior of using page extensions as
> >> default (mem_profiling_use_pgflags=0) because it always works even
> >> though it has higher overhead.
> >
> > Yes this seems to be a safe default.
>
> Agreed.
>
> >
> >> 2. No auto-fallback. If mem_profiling_use_pgflags=1 and we don't have
> >> enough page flags (at boot time or later when we load a module), we
> >> simply disable memory profiling with a warning.
>
> Sounds reasonable to me.
>
> --
> Cheers,
>
> David / dhildenb
>
Michal Hocko Oct. 21, 2024, 3:34 p.m. UTC | #23
On Mon 21-10-24 08:05:16, Suren Baghdasaryan wrote:
[...]
> Yeah, I thought about adding new values to "mem_profiling" but it's a
> bit complicated. Today it's a tristate:
> 
> mem_profiling=0|1|never
> 
> 0/1 means we disable/enable memory profiling by default but the user
> can enable it at runtime using a sysctl. This means that we enable
> page_ext at boot even when it's set to 0.
> "never" means we do not enable page_ext, memory profiling is disabled
> and sysctl to enable it will not be exposed. Used when a distribution
> has CONFIG_MEM_ALLOC_PROFILING=y but the user does not use it and does
> not want to waste memory on enabling page_ext.
> 
> I can add another option like "pgflags" but then it also needs to
> specify whether we should enable or disable profiling by default
> (similar to 0|1 for page_ext mode). IOW we will need to encode also
> the default state we want. Something like this:
> 
> mem_profiling=0|1|never|pgflags_on|pgflags_off
> 
> Would this be acceptable?

Isn't this overcomplicating it? Why cannot you simply go with
mem_profiling={0|never|1}[,$YOUR_OPTIONS]

While $YOUR_OPTIONS could be compress,fallback,ponies and it would apply
or just be ignored if that is not applicable.
Suren Baghdasaryan Oct. 21, 2024, 3:41 p.m. UTC | #24
On Mon, Oct 21, 2024 at 8:34 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 21-10-24 08:05:16, Suren Baghdasaryan wrote:
> [...]
> > Yeah, I thought about adding new values to "mem_profiling" but it's a
> > bit complicated. Today it's a tristate:
> >
> > mem_profiling=0|1|never
> >
> > 0/1 means we disable/enable memory profiling by default but the user
> > can enable it at runtime using a sysctl. This means that we enable
> > page_ext at boot even when it's set to 0.
> > "never" means we do not enable page_ext, memory profiling is disabled
> > and sysctl to enable it will not be exposed. Used when a distribution
> > has CONFIG_MEM_ALLOC_PROFILING=y but the user does not use it and does
> > not want to waste memory on enabling page_ext.
> >
> > I can add another option like "pgflags" but then it also needs to
> > specify whether we should enable or disable profiling by default
> > (similar to 0|1 for page_ext mode). IOW we will need to encode also
> > the default state we want. Something like this:
> >
> > mem_profiling=0|1|never|pgflags_on|pgflags_off
> >
> > Would this be acceptable?
>
> Isn't this overcomplicating it? Why cannot you simply go with
> mem_profiling={0|never|1}[,$YOUR_OPTIONS]
>
> While $YOUR_OPTIONS could be compress,fallback,ponies and it would apply
> or just be ignored if that is not applicable.

Oh, you mean having 2 parts in the parameter with supported options being:

mem_profiling=never
mem_profiling=0
mem_profiling=1
mem_profiling=0,pgflags
mem_profiling=1,pgflags

Did I understand correctly? If so then yes, this should work.

> --
> Michal Hocko
> SUSE Labs
David Hildenbrand Oct. 21, 2024, 3:49 p.m. UTC | #25
On 21.10.24 17:41, Suren Baghdasaryan wrote:
> On Mon, Oct 21, 2024 at 8:34 AM Michal Hocko <mhocko@suse.com> wrote:
>>
>> On Mon 21-10-24 08:05:16, Suren Baghdasaryan wrote:
>> [...]
>>> Yeah, I thought about adding new values to "mem_profiling" but it's a
>>> bit complicated. Today it's a tristate:
>>>
>>> mem_profiling=0|1|never
>>>
>>> 0/1 means we disable/enable memory profiling by default but the user
>>> can enable it at runtime using a sysctl. This means that we enable
>>> page_ext at boot even when it's set to 0.
>>> "never" means we do not enable page_ext, memory profiling is disabled
>>> and sysctl to enable it will not be exposed. Used when a distribution
>>> has CONFIG_MEM_ALLOC_PROFILING=y but the user does not use it and does
>>> not want to waste memory on enabling page_ext.
>>>
>>> I can add another option like "pgflags" but then it also needs to
>>> specify whether we should enable or disable profiling by default
>>> (similar to 0|1 for page_ext mode). IOW we will need to encode also
>>> the default state we want. Something like this:
>>>
>>> mem_profiling=0|1|never|pgflags_on|pgflags_off
>>>
>>> Would this be acceptable?
>>
>> Isn't this overcomplicating it? Why cannot you simply go with
>> mem_profiling={0|never|1}[,$YOUR_OPTIONS]
>>
>> While $YOUR_OPTIONS could be compress,fallback,ponies and it would apply
>> or just be ignored if that is not applicable.
> 
> Oh, you mean having 2 parts in the parameter with supported options being:
> 
> mem_profiling=never
> mem_profiling=0
> mem_profiling=1
> mem_profiling=0,pgflags
> mem_profiling=1,pgflags
> 

That's also a viable solution indeed.
Michal Hocko Oct. 21, 2024, 3:57 p.m. UTC | #26
On Mon 21-10-24 08:41:00, Suren Baghdasaryan wrote:
> On Mon, Oct 21, 2024 at 8:34 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 21-10-24 08:05:16, Suren Baghdasaryan wrote:
> > [...]
> > > Yeah, I thought about adding new values to "mem_profiling" but it's a
> > > bit complicated. Today it's a tristate:
> > >
> > > mem_profiling=0|1|never
> > >
> > > 0/1 means we disable/enable memory profiling by default but the user
> > > can enable it at runtime using a sysctl. This means that we enable
> > > page_ext at boot even when it's set to 0.
> > > "never" means we do not enable page_ext, memory profiling is disabled
> > > and sysctl to enable it will not be exposed. Used when a distribution
> > > has CONFIG_MEM_ALLOC_PROFILING=y but the user does not use it and does
> > > not want to waste memory on enabling page_ext.
> > >
> > > I can add another option like "pgflags" but then it also needs to
> > > specify whether we should enable or disable profiling by default
> > > (similar to 0|1 for page_ext mode). IOW we will need to encode also
> > > the default state we want. Something like this:
> > >
> > > mem_profiling=0|1|never|pgflags_on|pgflags_off
> > >
> > > Would this be acceptable?
> >
> > Isn't this overcomplicating it? Why cannot you simply go with
> > mem_profiling={0|never|1}[,$YOUR_OPTIONS]
> >
> > While $YOUR_OPTIONS could be compress,fallback,ponies and it would apply
> > or just be ignored if that is not applicable.
> 
> Oh, you mean having 2 parts in the parameter with supported options being:
> 
> mem_profiling=never
> mem_profiling=0
> mem_profiling=1
> mem_profiling=0,pgflags
> mem_profiling=1,pgflags
> 
> Did I understand correctly? If so then yes, this should work.

yes. I would just not call it pgflags because that just doesn't really
tell what the option is to anybody but kernel developers. You could also
have an option to override the default (disable profiling) failure strategy.
Suren Baghdasaryan Oct. 21, 2024, 4:16 p.m. UTC | #27
On Mon, Oct 21, 2024 at 8:57 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 21-10-24 08:41:00, Suren Baghdasaryan wrote:
> > On Mon, Oct 21, 2024 at 8:34 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Mon 21-10-24 08:05:16, Suren Baghdasaryan wrote:
> > > [...]
> > > > Yeah, I thought about adding new values to "mem_profiling" but it's a
> > > > bit complicated. Today it's a tristate:
> > > >
> > > > mem_profiling=0|1|never
> > > >
> > > > 0/1 means we disable/enable memory profiling by default but the user
> > > > can enable it at runtime using a sysctl. This means that we enable
> > > > page_ext at boot even when it's set to 0.
> > > > "never" means we do not enable page_ext, memory profiling is disabled
> > > > and sysctl to enable it will not be exposed. Used when a distribution
> > > > has CONFIG_MEM_ALLOC_PROFILING=y but the user does not use it and does
> > > > not want to waste memory on enabling page_ext.
> > > >
> > > > I can add another option like "pgflags" but then it also needs to
> > > > specify whether we should enable or disable profiling by default
> > > > (similar to 0|1 for page_ext mode). IOW we will need to encode also
> > > > the default state we want. Something like this:
> > > >
> > > > mem_profiling=0|1|never|pgflags_on|pgflags_off
> > > >
> > > > Would this be acceptable?
> > >
> > > Isn't this overcomplicating it? Why cannot you simply go with
> > > mem_profiling={0|never|1}[,$YOUR_OPTIONS]
> > >
> > > While $YOUR_OPTIONS could be compress,fallback,ponies and it would apply
> > > or just be ignored if that is not applicable.
> >
> > Oh, you mean having 2 parts in the parameter with supported options being:
> >
> > mem_profiling=never
> > mem_profiling=0
> > mem_profiling=1
> > mem_profiling=0,pgflags
> > mem_profiling=1,pgflags
> >
> > Did I understand correctly? If so then yes, this should work.
>
> yes. I would just not call it pgflags because that just doesn't really
> tell what the option is to anybody but kernel developers. You could also
> have an option to override the default (disable profiling) failure strategy.

Ok, how about "compressed" instead? Like this:

mem_profiling=0,compressed

> --
> Michal Hocko
> SUSE Labs
Michal Hocko Oct. 21, 2024, 4:23 p.m. UTC | #28
On Mon 21-10-24 09:16:14, Suren Baghdasaryan wrote:
> On Mon, Oct 21, 2024 at 8:57 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 21-10-24 08:41:00, Suren Baghdasaryan wrote:
> > > On Mon, Oct 21, 2024 at 8:34 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Mon 21-10-24 08:05:16, Suren Baghdasaryan wrote:
> > > > [...]
> > > > > Yeah, I thought about adding new values to "mem_profiling" but it's a
> > > > > bit complicated. Today it's a tristate:
> > > > >
> > > > > mem_profiling=0|1|never
> > > > >
> > > > > 0/1 means we disable/enable memory profiling by default but the user
> > > > > can enable it at runtime using a sysctl. This means that we enable
> > > > > page_ext at boot even when it's set to 0.
> > > > > "never" means we do not enable page_ext, memory profiling is disabled
> > > > > and sysctl to enable it will not be exposed. Used when a distribution
> > > > > has CONFIG_MEM_ALLOC_PROFILING=y but the user does not use it and does
> > > > > not want to waste memory on enabling page_ext.
> > > > >
> > > > > I can add another option like "pgflags" but then it also needs to
> > > > > specify whether we should enable or disable profiling by default
> > > > > (similar to 0|1 for page_ext mode). IOW we will need to encode also
> > > > > the default state we want. Something like this:
> > > > >
> > > > > mem_profiling=0|1|never|pgflags_on|pgflags_off
> > > > >
> > > > > Would this be acceptable?
> > > >
> > > > Isn't this overcomplicating it? Why cannot you simply go with
> > > > mem_profiling={0|never|1}[,$YOUR_OPTIONS]
> > > >
> > > > While $YOUR_OPTIONS could be compress,fallback,ponies and it would apply
> > > > or just be ignored if that is not applicable.
> > >
> > > Oh, you mean having 2 parts in the parameter with supported options being:
> > >
> > > mem_profiling=never
> > > mem_profiling=0
> > > mem_profiling=1
> > > mem_profiling=0,pgflags
> > > mem_profiling=1,pgflags
> > >
> > > Did I understand correctly? If so then yes, this should work.
> >
> > yes. I would just not call it pgflags because that just doesn't really
> > tell what the option is to anybody but kernel developers. You could also
> > have an option to override the default (disable profiling) failure strategy.
> 
> Ok, how about "compressed" instead? Like this:
> 
> mem_profiling=0,compressed

Sounds good to me. And just to repeat, I do not really care about
specific name but let's just stay away from something as specific as
page flags because that is really not helping to understand the purpose
but rather the underlying mechanism which is not telling much to most
users outside of kernel developers.
Suren Baghdasaryan Oct. 21, 2024, 4:32 p.m. UTC | #29
On Mon, Oct 21, 2024 at 9:23 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 21-10-24 09:16:14, Suren Baghdasaryan wrote:
> > On Mon, Oct 21, 2024 at 8:57 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Mon 21-10-24 08:41:00, Suren Baghdasaryan wrote:
> > > > On Mon, Oct 21, 2024 at 8:34 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Mon 21-10-24 08:05:16, Suren Baghdasaryan wrote:
> > > > > [...]
> > > > > > Yeah, I thought about adding new values to "mem_profiling" but it's a
> > > > > > bit complicated. Today it's a tristate:
> > > > > >
> > > > > > mem_profiling=0|1|never
> > > > > >
> > > > > > 0/1 means we disable/enable memory profiling by default but the user
> > > > > > can enable it at runtime using a sysctl. This means that we enable
> > > > > > page_ext at boot even when it's set to 0.
> > > > > > "never" means we do not enable page_ext, memory profiling is disabled
> > > > > > and sysctl to enable it will not be exposed. Used when a distribution
> > > > > > has CONFIG_MEM_ALLOC_PROFILING=y but the user does not use it and does
> > > > > > not want to waste memory on enabling page_ext.
> > > > > >
> > > > > > I can add another option like "pgflags" but then it also needs to
> > > > > > specify whether we should enable or disable profiling by default
> > > > > > (similar to 0|1 for page_ext mode). IOW we will need to encode also
> > > > > > the default state we want. Something like this:
> > > > > >
> > > > > > mem_profiling=0|1|never|pgflags_on|pgflags_off
> > > > > >
> > > > > > Would this be acceptable?
> > > > >
> > > > > Isn't this overcomplicating it? Why cannot you simply go with
> > > > > mem_profiling={0|never|1}[,$YOUR_OPTIONS]
> > > > >
> > > > > While $YOUR_OPTIONS could be compress,fallback,ponies and it would apply
> > > > > or just be ignored if that is not applicable.
> > > >
> > > > Oh, you mean having 2 parts in the parameter with supported options being:
> > > >
> > > > mem_profiling=never
> > > > mem_profiling=0
> > > > mem_profiling=1
> > > > mem_profiling=0,pgflags
> > > > mem_profiling=1,pgflags
> > > >
> > > > Did I understand correctly? If so then yes, this should work.
> > >
> > > yes. I would just not call it pgflags because that just doesn't really
> > > tell what the option is to anybody but kernel developers. You could also
> > > have an option to override the default (disable profiling) failure strategy.
> >
> > Ok, how about "compressed" instead? Like this:
> >
> > mem_profiling=0,compressed
>
> Sounds good to me. And just to repeat, I do not really care about
> specific name but let's just stay away from something as specific as
> page flags because that is really not helping to understand the purpose
> but rather the underlying mechanism which is not telling much to most
> users outside of kernel developers.

Understood. Ok, I'll start changing my patchset to incorporate this
feedback and will post the new version this week.
Thanks for the input everyone!

>
> --
> Michal Hocko
> SUSE Labs
John Hubbard Oct. 21, 2024, 6:12 p.m. UTC | #30
On 10/21/24 9:32 AM, Suren Baghdasaryan wrote:
> On Mon, Oct 21, 2024 at 9:23 AM Michal Hocko <mhocko@suse.com> wrote:
>> On Mon 21-10-24 09:16:14, Suren Baghdasaryan wrote:
>>> On Mon, Oct 21, 2024 at 8:57 AM Michal Hocko <mhocko@suse.com> wrote:
>>>> On Mon 21-10-24 08:41:00, Suren Baghdasaryan wrote:
>>>>> On Mon, Oct 21, 2024 at 8:34 AM Michal Hocko <mhocko@suse.com> wrote:
>>>>>> On Mon 21-10-24 08:05:16, Suren Baghdasaryan wrote:
>>>>>> [...]
>>>>>>> Yeah, I thought about adding new values to "mem_profiling" but it's a
>>>>>>> bit complicated. Today it's a tristate:
>>>>>>>
>>>>>>> mem_profiling=0|1|never
>>>>>>>
>>>>>>> 0/1 means we disable/enable memory profiling by default but the user
>>>>>>> can enable it at runtime using a sysctl. This means that we enable
>>>>>>> page_ext at boot even when it's set to 0.
>>>>>>> "never" means we do not enable page_ext, memory profiling is disabled
>>>>>>> and sysctl to enable it will not be exposed. Used when a distribution
>>>>>>> has CONFIG_MEM_ALLOC_PROFILING=y but the user does not use it and does
>>>>>>> not want to waste memory on enabling page_ext.
>>>>>>>
>>>>>>> I can add another option like "pgflags" but then it also needs to
>>>>>>> specify whether we should enable or disable profiling by default
>>>>>>> (similar to 0|1 for page_ext mode). IOW we will need to encode also
>>>>>>> the default state we want. Something like this:
>>>>>>>
>>>>>>> mem_profiling=0|1|never|pgflags_on|pgflags_off
>>>>>>>
>>>>>>> Would this be acceptable?
>>>>>>
>>>>>> Isn't this overcomplicating it? Why cannot you simply go with
>>>>>> mem_profiling={0|never|1}[,$YOUR_OPTIONS]
>>>>>>
>>>>>> While $YOUR_OPTIONS could be compress,fallback,ponies and it would apply
>>>>>> or just be ignored if that is not applicable.
>>>>>
>>>>> Oh, you mean having 2 parts in the parameter with supported options being:
>>>>>
>>>>> mem_profiling=never
>>>>> mem_profiling=0
>>>>> mem_profiling=1
>>>>> mem_profiling=0,pgflags
>>>>> mem_profiling=1,pgflags
>>>>>
>>>>> Did I understand correctly? If so then yes, this should work.
>>>>
>>>> yes. I would just not call it pgflags because that just doesn't really
>>>> tell what the option is to anybody but kernel developers. You could also
>>>> have an option to override the default (disable profiling) failure strategy.
>>>
>>> Ok, how about "compressed" instead? Like this:
>>>
>>> mem_profiling=0,compressed

Yes. The configuration options all fit together nicely now, and the
naming seems exactly right as well. And no more "you must rebuild your
kernel" messages. Great!

thanks,
diff mbox series

Patch

diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h
index 7431757999c5..4f811ec0ffe0 100644
--- a/include/linux/alloc_tag.h
+++ b/include/linux/alloc_tag.h
@@ -30,8 +30,16 @@  struct alloc_tag {
 	struct alloc_tag_counters __percpu	*counters;
 } __aligned(8);
 
+struct alloc_tag_kernel_section {
+	struct alloc_tag *first_tag;
+	unsigned long count;
+};
+
 struct alloc_tag_module_section {
-	unsigned long start_addr;
+	union {
+		unsigned long start_addr;
+		struct alloc_tag *first_tag;
+	};
 	unsigned long end_addr;
 	/* used size */
 	unsigned long size;
diff --git a/include/linux/codetag.h b/include/linux/codetag.h
index fb4e7adfa746..401fc297eeda 100644
--- a/include/linux/codetag.h
+++ b/include/linux/codetag.h
@@ -13,6 +13,9 @@  struct codetag_module;
 struct seq_buf;
 struct module;
 
+#define CODETAG_SECTION_START_PREFIX	"__start_"
+#define CODETAG_SECTION_STOP_PREFIX	"__stop_"
+
 /*
  * An instance of this structure is created in a special ELF section at every
  * code location being tagged.  At runtime, the special section is treated as
diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h
index 7d79818dc065..4f5c9e979bb9 100644
--- a/include/linux/page-flags-layout.h
+++ b/include/linux/page-flags-layout.h
@@ -111,5 +111,12 @@ 
 			    ZONES_WIDTH - LRU_GEN_WIDTH - SECTIONS_WIDTH - \
 			    NODES_WIDTH - KASAN_TAG_WIDTH - LAST_CPUPID_WIDTH)
 
+#define NR_NON_PAGEFLAG_BITS	(SECTIONS_WIDTH + NODES_WIDTH + ZONES_WIDTH + \
+				LAST_CPUPID_SHIFT + KASAN_TAG_WIDTH + \
+				LRU_GEN_WIDTH + LRU_REFS_WIDTH)
+
+#define NR_UNUSED_PAGEFLAG_BITS	(BITS_PER_LONG - \
+				(NR_NON_PAGEFLAG_BITS + NR_PAGEFLAGS))
+
 #endif
 #endif /* _LINUX_PAGE_FLAGS_LAYOUT */
diff --git a/include/linux/pgalloc_tag.h b/include/linux/pgalloc_tag.h
index bc65710ee1f9..69976cf747a6 100644
--- a/include/linux/pgalloc_tag.h
+++ b/include/linux/pgalloc_tag.h
@@ -9,6 +9,93 @@ 
 
 #ifdef CONFIG_MEM_ALLOC_PROFILING
 
+#ifdef CONFIG_PGALLOC_TAG_USE_PAGEFLAGS
+
+typedef u16	pgalloc_tag_ref;
+
+extern struct alloc_tag_kernel_section kernel_tags;
+
+#define CODETAG_ID_NULL		0
+#define CODETAG_ID_EMPTY	1
+#define CODETAG_ID_FIRST	2
+
+#ifdef CONFIG_MODULES
+
+extern struct alloc_tag_module_section module_tags;
+
+static inline struct codetag *get_module_ct(pgalloc_tag_ref pgref)
+{
+	return &module_tags.first_tag[pgref - kernel_tags.count].ct;
+}
+
+static inline pgalloc_tag_ref get_module_pgref(struct alloc_tag *tag)
+{
+	return CODETAG_ID_FIRST + kernel_tags.count + (tag - module_tags.first_tag);
+}
+
+#else /* CONFIG_MODULES */
+
+static inline struct codetag *get_module_ct(pgalloc_tag_ref pgref)
+{
+	pr_warn("invalid page tag reference %lu\n", (unsigned long)pgref);
+	return NULL;
+}
+
+static inline pgalloc_tag_ref get_module_pgref(struct alloc_tag *tag)
+{
+	pr_warn("invalid page tag 0x%lx\n", (unsigned long)tag);
+	return CODETAG_ID_NULL;
+}
+
+#endif /* CONFIG_MODULES */
+
+static inline void read_pgref(pgalloc_tag_ref *pgref, union codetag_ref *ref)
+{
+	pgalloc_tag_ref pgref_val = *pgref;
+
+	switch (pgref_val) {
+	case (CODETAG_ID_NULL):
+		ref->ct = NULL;
+		break;
+	case (CODETAG_ID_EMPTY):
+		set_codetag_empty(ref);
+		break;
+	default:
+		pgref_val -= CODETAG_ID_FIRST;
+		ref->ct = pgref_val < kernel_tags.count ?
+			&kernel_tags.first_tag[pgref_val].ct :
+			get_module_ct(pgref_val);
+		break;
+	}
+}
+
+static inline void write_pgref(pgalloc_tag_ref *pgref, union codetag_ref *ref)
+{
+	struct alloc_tag *tag;
+
+	if (!ref->ct) {
+		*pgref = CODETAG_ID_NULL;
+		return;
+	}
+
+	if (is_codetag_empty(ref)) {
+		*pgref = CODETAG_ID_EMPTY;
+		return;
+	}
+
+	tag = ct_to_alloc_tag(ref->ct);
+	if (tag >= kernel_tags.first_tag && tag < kernel_tags.first_tag + kernel_tags.count) {
+		*pgref = CODETAG_ID_FIRST + (tag - kernel_tags.first_tag);
+		return;
+	}
+
+	*pgref = get_module_pgref(tag);
+}
+
+void __init alloc_tag_sec_init(void);
+
+#else /* CONFIG_PGALLOC_TAG_USE_PAGEFLAGS */
+
 typedef union codetag_ref	pgalloc_tag_ref;
 
 static inline void read_pgref(pgalloc_tag_ref *pgref, union codetag_ref *ref)
@@ -21,8 +108,13 @@  static inline void write_pgref(pgalloc_tag_ref *pgref, union codetag_ref *ref)
 	pgref->ct = ref->ct;
 }
 
+static inline void alloc_tag_sec_init(void) {}
+
+#endif /* CONFIG_PGALLOC_TAG_USE_PAGEFLAGS */
+
 union pgtag_ref_handle {
 	pgalloc_tag_ref *pgref;	/* reference in page extension */
+	struct page *page;	/* reference in page flags */
 };
 
 #include <linux/page_ext.h>
@@ -40,8 +132,8 @@  static inline struct page_ext *page_ext_from_pgref(pgalloc_tag_ref *pgref)
 }
 
 /* Should be called only if mem_alloc_profiling_enabled() */
-static inline bool get_page_tag_ref(struct page *page, union codetag_ref *ref,
-				    union pgtag_ref_handle *handle)
+static inline bool page_ext_get_page_tag_ref(struct page *page, union codetag_ref *ref,
+					     union pgtag_ref_handle *handle)
 {
 	struct page_ext *page_ext;
 	pgalloc_tag_ref *pgref;
@@ -59,7 +151,7 @@  static inline bool get_page_tag_ref(struct page *page, union codetag_ref *ref,
 	return true;
 }
 
-static inline void put_page_tag_ref(union pgtag_ref_handle handle)
+static inline void page_ext_put_page_tag_ref(union pgtag_ref_handle handle)
 {
 	if (WARN_ON(!handle.pgref))
 		return;
@@ -67,8 +159,8 @@  static inline void put_page_tag_ref(union pgtag_ref_handle handle)
 	page_ext_put(page_ext_from_pgref(handle.pgref));
 }
 
-static inline void update_page_tag_ref(union pgtag_ref_handle handle,
-				       union codetag_ref *ref)
+static inline void page_ext_update_page_tag_ref(union pgtag_ref_handle handle,
+						union codetag_ref *ref)
 {
 	if (WARN_ON(!handle.pgref || !ref))
 		return;
@@ -76,6 +168,87 @@  static inline void update_page_tag_ref(union pgtag_ref_handle handle,
 	write_pgref(handle.pgref, ref);
 }
 
+#ifdef CONFIG_PGALLOC_TAG_USE_PAGEFLAGS
+
+DECLARE_STATIC_KEY_TRUE(mem_profiling_use_pageflags);
+extern unsigned long alloc_tag_ref_mask;
+extern int alloc_tag_ref_offs;
+
+/* Should be called only if mem_alloc_profiling_enabled() */
+static inline bool get_page_tag_ref(struct page *page, union codetag_ref *ref,
+				    union pgtag_ref_handle *handle)
+{
+	pgalloc_tag_ref pgref;
+
+	if (!static_key_enabled(&mem_profiling_use_pageflags))
+		return page_ext_get_page_tag_ref(page, ref, handle);
+
+	if (!page)
+		return false;
+
+	pgref = (page->flags >> alloc_tag_ref_offs) & alloc_tag_ref_mask;
+	read_pgref(&pgref, ref);
+	handle->page = page;
+	return true;
+}
+
+static inline void put_page_tag_ref(union pgtag_ref_handle handle)
+{
+	if (!static_key_enabled(&mem_profiling_use_pageflags)) {
+		page_ext_put_page_tag_ref(handle);
+		return;
+	}
+
+	WARN_ON(!handle.page);
+}
+
+static inline void update_page_tag_ref(union pgtag_ref_handle handle, union codetag_ref *ref)
+{
+	unsigned long old_flags, flags, val;
+	struct page *page = handle.page;
+	pgalloc_tag_ref pgref;
+
+	if (!static_key_enabled(&mem_profiling_use_pageflags)) {
+		page_ext_update_page_tag_ref(handle, ref);
+		return;
+	}
+
+	if (WARN_ON(!page || !ref))
+		return;
+
+	write_pgref(&pgref, ref);
+	val = (unsigned long)pgref;
+	val = (val & alloc_tag_ref_mask) << alloc_tag_ref_offs;
+	do {
+		old_flags = READ_ONCE(page->flags);
+		flags = old_flags;
+		flags &= ~(alloc_tag_ref_mask << alloc_tag_ref_offs);
+		flags |= val;
+	} while (unlikely(!try_cmpxchg(&page->flags, &old_flags, flags)));
+}
+
+#else /* CONFIG_PGALLOC_TAG_USE_PAGEFLAGS */
+
+/* Should be called only if mem_alloc_profiling_enabled() */
+static inline bool get_page_tag_ref(struct page *page, union codetag_ref *ref,
+				    union pgtag_ref_handle *handle)
+{
+	return page_ext_get_page_tag_ref(page, ref, handle);
+}
+
+static inline void put_page_tag_ref(union pgtag_ref_handle handle)
+{
+	page_ext_put_page_tag_ref(handle);
+}
+
+static inline void update_page_tag_ref(union pgtag_ref_handle handle,
+				       union codetag_ref *ref)
+{
+	page_ext_update_page_tag_ref(handle, ref);
+}
+
+#endif /* CONFIG_PGALLOC_TAG_USE_PAGEFLAGS */
+
 static inline void clear_page_tag_ref(struct page *page)
 {
 	if (mem_alloc_profiling_enabled()) {
@@ -152,6 +325,7 @@  static inline void pgalloc_tag_add(struct page *page, struct task_struct *task,
 static inline void pgalloc_tag_sub(struct page *page, unsigned int nr) {}
 static inline struct alloc_tag *pgalloc_tag_get(struct page *page) { return NULL; }
 static inline void pgalloc_tag_sub_pages(struct alloc_tag *tag, unsigned int nr) {}
+static inline void alloc_tag_sec_init(void) {}
 
 #endif /* CONFIG_MEM_ALLOC_PROFILING */
 
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 7315f643817a..f4c272c30719 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1017,6 +1017,25 @@  config MEM_ALLOC_PROFILING_DEBUG
 	  Adds warnings with helpful error messages for memory allocation
 	  profiling.
 
+config PGALLOC_TAG_USE_PAGEFLAGS
+	bool "Use pageflags to encode page allocation tag reference"
+	default n
+	depends on MEM_ALLOC_PROFILING
+	help
+	  When set, page allocation tag references are encoded inside page
+	  flags if they fit, otherwise they are encoded in page extensions.
+
+	  Setting this flag reduces memory and performance overhead of memory
+	  allocation profiling but it requires enough unused page flag bits to
+	  address all kernel allocations. If enabled but there are not enough
+	  unused page flag bits, profiling will fall back to using page
+	  extensions and a warning will be issued to disable this config.
+	  If dynamically loaded modules add enough tags that they can't be
+	  addressed anymore with available page flag bits, profiling for such
+	  modules will be disabled and a warning will be issued.
+
+	  Say N if unsure.
+
 source "lib/Kconfig.kasan"
 source "lib/Kconfig.kfence"
 source "lib/Kconfig.kmsan"
diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
index 2ef762acc203..341a5c826449 100644
--- a/lib/alloc_tag.c
+++ b/lib/alloc_tag.c
@@ -3,6 +3,7 @@ 
 #include <linux/execmem.h>
 #include <linux/fs.h>
 #include <linux/gfp.h>
+#include <linux/kallsyms.h>
 #include <linux/module.h>
 #include <linux/page_ext.h>
 #include <linux/pgalloc_tag.h>
@@ -152,6 +153,42 @@  static void __init procfs_init(void)
 	proc_create_seq("allocinfo", 0400, NULL, &allocinfo_seq_op);
 }
 
+#ifdef CONFIG_PGALLOC_TAG_USE_PAGEFLAGS
+
+DEFINE_STATIC_KEY_TRUE(mem_profiling_use_pageflags);
+unsigned long alloc_tag_ref_mask;
+int alloc_tag_ref_offs;
+
+#define SECTION_START(NAME)	(CODETAG_SECTION_START_PREFIX NAME)
+#define SECTION_STOP(NAME)	(CODETAG_SECTION_STOP_PREFIX NAME)
+
+struct alloc_tag_kernel_section kernel_tags = { NULL, 0 };
+
+void __init alloc_tag_sec_init(void)
+{
+	struct alloc_tag *last_codetag;
+
+	kernel_tags.first_tag = (struct alloc_tag *)kallsyms_lookup_name(
+					SECTION_START(ALLOC_TAG_SECTION_NAME));
+	last_codetag = (struct alloc_tag *)kallsyms_lookup_name(
+					SECTION_STOP(ALLOC_TAG_SECTION_NAME));
+	kernel_tags.count = last_codetag - kernel_tags.first_tag;
+
+	/* Check if kernel tags fit into page flags */
+	if (kernel_tags.count > (1UL << NR_UNUSED_PAGEFLAG_BITS)) {
+		static_branch_disable(&mem_profiling_use_pageflags);
+		pr_warn("%lu kernel tags do not fit into %d available page flag bits, page extensions will be used instead!\n",
+			kernel_tags.count, NR_UNUSED_PAGEFLAG_BITS);
+	} else {
+		alloc_tag_ref_offs = (LRU_REFS_PGOFF - NR_UNUSED_PAGEFLAG_BITS);
+		alloc_tag_ref_mask = ((1UL << NR_UNUSED_PAGEFLAG_BITS) - 1);
+		pr_info("All unused page flags (%d bits) are used to store page tag references!\n",
+			NR_UNUSED_PAGEFLAG_BITS);
+	}
+}
+
+#endif /* CONFIG_PGALLOC_TAG_USE_PAGEFLAGS */
+
 #ifdef CONFIG_MODULES
 
 static struct maple_tree mod_area_mt = MTREE_INIT(mod_area_mt, MT_FLAGS_ALLOC_RANGE);
@@ -161,7 +198,29 @@  static struct module unloaded_mod;
 /* A dummy object used to indicate a module prepended area */
 static struct module prepend_mod;
 
-static struct alloc_tag_module_section module_tags;
+struct alloc_tag_module_section module_tags;
+
+#ifdef CONFIG_PGALLOC_TAG_USE_PAGEFLAGS
+static inline unsigned long alloc_tag_align(unsigned long val)
+{
+	if (val % sizeof(struct alloc_tag) == 0)
+		return val;
+	return ((val / sizeof(struct alloc_tag)) + 1) * sizeof(struct alloc_tag);
+}
+
+static inline bool tags_addressable(void)
+{
+	unsigned long tag_idx_count = CODETAG_ID_FIRST + kernel_tags.count +
+					module_tags.size / sizeof(struct alloc_tag);
+
+	return tag_idx_count < (1UL << NR_UNUSED_PAGEFLAG_BITS);
+}
+
+#else /* CONFIG_PGALLOC_TAG_USE_PAGEFLAGS */
+
+static inline bool tags_addressable(void) { return true; }
+
+#endif /* CONFIG_PGALLOC_TAG_USE_PAGEFLAGS */
 
 static bool needs_section_mem(struct module *mod, unsigned long size)
 {
@@ -237,6 +296,21 @@  static void *reserve_module_tags(struct module *mod, unsigned long size,
 	if (!align)
 		align = 1;
 
+#ifdef CONFIG_PGALLOC_TAG_USE_PAGEFLAGS
+	/*
+	 * If alloc_tag size is not a multiple of required alignment tag
+	 * indexing does not work.
+	 */
+	if (!IS_ALIGNED(sizeof(struct alloc_tag), align)) {
+		pr_err("%s: alignment %lu is incompatible with allocation tag indexing, disable CONFIG_PGALLOC_TAG_USE_PAGEFLAGS",
+			mod->name, align);
+		return ERR_PTR(-EINVAL);
+	}
+
+	/* Ensure prepend consumes multiple of alloc_tag-sized blocks */
+	if (prepend)
+		prepend = alloc_tag_align(prepend);
+#endif /* CONFIG_PGALLOC_TAG_USE_PAGEFLAGS */
 	mas_lock(&mas);
 repeat:
 	/* Try finding exact size and hope the start is aligned */
@@ -305,6 +379,12 @@  static void *reserve_module_tags(struct module *mod, unsigned long size,
 		unsigned long phys_size = vm_module_tags->nr_pages << PAGE_SHIFT;
 
 		module_tags.size = offset + size;
+		if (!tags_addressable() && mem_alloc_profiling_enabled()) {
+			static_branch_disable(&mem_alloc_profiling_key);
+			pr_warn("With module %s there are too many tags to fit in %d page flag bits. Memory profiling is disabled!\n",
+				mod->name, NR_UNUSED_PAGEFLAG_BITS);
+		}
+
 		if (phys_size < module_tags.size) {
 			int grow_res;
 
@@ -406,6 +486,10 @@  static int __init alloc_mod_tags_mem(void)
 
 	module_tags.start_addr = (unsigned long)vm_module_tags->addr;
 	module_tags.end_addr = module_tags.start_addr + MODULE_ALLOC_TAG_VMAP_SIZE;
+#ifdef CONFIG_PGALLOC_TAG_USE_PAGEFLAGS
+	/* Ensure the base is alloc_tag aligned */
+	module_tags.start_addr = alloc_tag_align(module_tags.start_addr);
+#endif
 
 	return 0;
 }
@@ -467,6 +551,10 @@  early_param("sysctl.vm.mem_profiling", setup_early_mem_profiling);
 
 static __init bool need_page_alloc_tagging(void)
 {
+#ifdef CONFIG_PGALLOC_TAG_USE_PAGEFLAGS
+	if (static_key_enabled(&mem_profiling_use_pageflags))
+		return false;
+#endif
 	return mem_profiling_support;
 }
 
diff --git a/lib/codetag.c b/lib/codetag.c
index 3f3a4f4ca72d..cda5a852ef30 100644
--- a/lib/codetag.c
+++ b/lib/codetag.c
@@ -149,8 +149,8 @@  static struct codetag_range get_section_range(struct module *mod,
 					      const char *section)
 {
 	return (struct codetag_range) {
-		get_symbol(mod, "__start_", section),
-		get_symbol(mod, "__stop_", section),
+		get_symbol(mod, CODETAG_SECTION_START_PREFIX, section),
+		get_symbol(mod, CODETAG_SECTION_STOP_PREFIX, section),
 	};
 }
 
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 4ba5607aaf19..1c205b0a86ed 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -83,8 +83,7 @@  void __init mminit_verify_pageflags_layout(void)
 	unsigned long or_mask, add_mask;
 
 	shift = BITS_PER_LONG;
-	width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH
-		- LAST_CPUPID_SHIFT - KASAN_TAG_WIDTH - LRU_GEN_WIDTH - LRU_REFS_WIDTH;
+	width = shift - NR_NON_PAGEFLAG_BITS;
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_widths",
 		"Section %d Node %d Zone %d Lastcpupid %d Kasantag %d Gen %d Tier %d Flags %d\n",
 		SECTIONS_WIDTH,
@@ -2639,7 +2638,7 @@  void __init mm_core_init(void)
 	BUILD_BUG_ON(MAX_ZONELISTS > 2);
 	build_all_zonelists(NULL);
 	page_alloc_init_cpuhp();
-
+	alloc_tag_sec_init();
 	/*
 	 * page_ext requires contiguous pages,
 	 * bigger than MAX_PAGE_ORDER unless SPARSEMEM.