mbox series

[0/2] Introduce panic function when slub leaks

Message ID 20240925032256.1782-1-fangzheng.zhang@unisoc.com (mailing list archive)
Headers show
Series Introduce panic function when slub leaks | expand

Message

Fangzheng Zhang Sept. 25, 2024, 3:22 a.m. UTC
Hi all,

A method to detect slub leaks by monitoring its usage in real time
on the page allocation path of the slub. When the slub occupancy
exceeds the user-set value, it is considered that the slub is leaking
at this time, and a panic operation will be triggered immediately.

Fangzheng Zhang (2):
  mm/slub: Add panic function when slub leaks
  Documentation: admin-guide: kernel-parameters: Add parameter
    description for slub_leak_panic function

 .../admin-guide/kernel-parameters.txt         | 15 ++++
 mm/Kconfig                                    | 11 ++++++++
 mm/slub.c                                     | 76 +++++++++++++++++++

 3 files changed, 102 insertions(+)

Comments

Hyeonggon Yoo Sept. 25, 2024, 1:18 p.m. UTC | #1
On Wed, Sep 25, 2024 at 12:23 PM Fangzheng Zhang
<fangzheng.zhang@unisoc.com> wrote:
>
> Hi all,

Hi Fangzheng,

> A method to detect slub leaks by monitoring its usage in real time
> on the page allocation path of the slub. When the slub occupancy
> exceeds the user-set value, it is considered that the slub is leaking
> at this time

I'm not sure why this should be a kernel feature. Why not write a user
script that parses
MemTotal: and Slab: part of /proc/meminfo file and generates a log
entry or an alarm?

> and a panic operation will be triggered immediately.

I don't think it would be a good idea to panic unnecessarily.
IMO it is not proper to panic when the kernel can still run.

Any thoughts?

Thanks,
Hyeonggon
Vlastimil Babka Sept. 26, 2024, 12:30 p.m. UTC | #2
On 9/25/24 15:18, Hyeonggon Yoo wrote:
> On Wed, Sep 25, 2024 at 12:23 PM Fangzheng Zhang
> <fangzheng.zhang@unisoc.com> wrote:
>>
>> Hi all,
> 
> Hi Fangzheng,
> 
>> A method to detect slub leaks by monitoring its usage in real time
>> on the page allocation path of the slub. When the slub occupancy
>> exceeds the user-set value, it is considered that the slub is leaking
>> at this time
> 
> I'm not sure why this should be a kernel feature. Why not write a user
> script that parses
> MemTotal: and Slab: part of /proc/meminfo file and generates a log
> entry or an alarm?

Yes very much agreed. It seems rather arbitrary. Why slab, why not any other
kernel-specific counter in /proc/meminfo? Why include NR_SLAB_RECLAIMABLE_B
when that's used by caches with shrinkers?
A userspace solution should be straightforward and universal - easily
configurable for different scenarios.

>> and a panic operation will be triggered immediately.
> 
> I don't think it would be a good idea to panic unnecessarily.
> IMO it is not proper to panic when the kernel can still run.

Yes these days it's practically impossible to add a BUG_ON() for more
serious conditions than this.

Please don't post new versions addressing specific implementation details
until this fundamental issue is addressed.

Thanks,
Vlastimil

> Any thoughts?
> 
> Thanks,
> Hyeonggon
zhang fangzheng Sept. 27, 2024, 7:28 a.m. UTC | #3
On Thu, Sep 26, 2024 at 8:30 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 9/25/24 15:18, Hyeonggon Yoo wrote:
> > On Wed, Sep 25, 2024 at 12:23 PM Fangzheng Zhang
> > <fangzheng.zhang@unisoc.com> wrote:
> >>
> >> Hi all,
> >
> > Hi Fangzheng,
> >
> >> A method to detect slub leaks by monitoring its usage in real time
> >> on the page allocation path of the slub. When the slub occupancy
> >> exceeds the user-set value, it is considered that the slub is leaking
> >> at this time
> >
> > I'm not sure why this should be a kernel feature. Why not write a user
> > script that parses
> > MemTotal: and Slab: part of /proc/meminfo file and generates a log
> > entry or an alarm?
>
> Yes very much agreed. It seems rather arbitrary. Why slab, why not any other
> kernel-specific counter in /proc/meminfo? Why include NR_SLAB_RECLAIMABLE_B
> when that's used by caches with shrinkers?

Ok, this is because the current consideration is to specifically
track the memory usage of the slab module.
In the stability test, ie, monkey test,
the anr or reboot problem occurs, there is a high probability
that the slab occupancy is high when it comes to memory analysis.
In addition to directly monitoring leaks in the allocation path, it is
also convenient to record the allocation stack information
when an exception occurs.

> A userspace solution should be straightforward and universal - easily
> configurable for different scenarios.
>
> >> and a panic operation will be triggered immediately.
> >
> > I don't think it would be a good idea to panic unnecessarily.
> > IMO it is not proper to panic when the kernel can still run.
>
> Yes these days it's practically impossible to add a BUG_ON() for more
> serious conditions than this.
>
> Please don't post new versions addressing specific implementation details
> until this fundamental issue is addressed.
>
> Thanks,
> Vlastimil
>
> > Any thoughts?
> >
> > Thanks,
> > Hyeonggon
>
Hyeonggon Yoo Sept. 27, 2024, 8:01 a.m. UTC | #4
On Fri, Sep 27, 2024 at 4:28 PM zhang fangzheng
<fangzheng.zhang1003@gmail.com> wrote:
>
> On Thu, Sep 26, 2024 at 8:30 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> >
> > On 9/25/24 15:18, Hyeonggon Yoo wrote:
> > > On Wed, Sep 25, 2024 at 12:23 PM Fangzheng Zhang
> > > <fangzheng.zhang@unisoc.com> wrote:
> > >>
> > >> Hi all,
> > >
> > > Hi Fangzheng,
> > >
> > >> A method to detect slub leaks by monitoring its usage in real time
> > >> on the page allocation path of the slub. When the slub occupancy
> > >> exceeds the user-set value, it is considered that the slub is leaking
> > >> at this time
> > >
> > > I'm not sure why this should be a kernel feature. Why not write a user
> > > script that parses
> > > MemTotal: and Slab: part of /proc/meminfo file and generates a log
> > > entry or an alarm?
> >
> > Yes very much agreed. It seems rather arbitrary. Why slab, why not any other
> > kernel-specific counter in /proc/meminfo? Why include NR_SLAB_RECLAIMABLE_B
> > when that's used by caches with shrinkers?
>
> Ok, this is because the current consideration is to specifically
> track the memory usage of the slab module.
> In the stability test, ie, monkey test,
> the anr or reboot problem occurs, there is a high probability
> that the slab occupancy is high when it comes to memory analysis.
> In addition to directly monitoring leaks in the allocation path, it is
> also convenient to record the allocation stack information
> when an exception occurs.

[+Cc Memory Allocation Profiling maintainers]

For recording allocation information, I think CONFIG_MEM_ALLOC_PROFILING [1] [2]
may be used to track allocation sites that contribute to memory leaks,
instead of making the kernel panic or printing WARNING?

.....Or with higher overhead, slub_debug=U [3] if it is not meant to
be run on production.

[1] https://docs.kernel.org/mm/allocation-profiling.html
[2] https://lwn.net/Articles/974380
[3] https://docs.kernel.org/mm/slub.html#debugfs-files-for-slub

Best,
Hyeonggon

> > A userspace solution should be straightforward and universal - easily
> > configurable for different scenarios.
> >
> > >> and a panic operation will be triggered immediately.
> > >
> > > I don't think it would be a good idea to panic unnecessarily.
> > > IMO it is not proper to panic when the kernel can still run.
> >
> > Yes these days it's practically impossible to add a BUG_ON() for more
> > serious conditions than this.
> >
> > Please don't post new versions addressing specific implementation details
> > until this fundamental issue is addressed.
> >
> > Thanks,
> > Vlastimil
> >
> > > Any thoughts?
> > >
> > > Thanks,
> > > Hyeonggon
> >
韩玉明 (Yuming Han) Oct. 9, 2024, 1:25 a.m. UTC | #5
?loop  shuo.tian