diff mbox series

[RFC] mm: Add reclaim type to memory.reclaim

Message ID 20240225114204.50459-1-laoar.shao@gmail.com (mailing list archive)
State New
Headers show
Series [RFC] mm: Add reclaim type to memory.reclaim | expand

Commit Message

Yafang Shao Feb. 25, 2024, 11:42 a.m. UTC
In our container environment, we've observed that certain containers may
accumulate more than 40GB of slabs, predominantly negative dentries. These
negative dentries remain unreclaimed unless there is memory pressure. Even
after the containers exit, these negative dentries persist. To manage disk
storage efficiently, we employ an agent that identifies container images
eligible for destruction once all instances of that image exit.

However, during destruction, dealing with directories containing numerous
negative dentries can significantly impact performance. To mitigate this
issue, we aim to proactively reclaim these dentries using a user agent.
Extending the memory.reclaim functionality to specifically target slabs
aligns with our requirements.

We propose adding a "type=" parameter to memory.reclaim to allow
reclamation of pagecache pages only, slabs only, or both:

  - type=1: Reclaim pagecache pages only
  - type=2: Reclaim slabs only
  - type=3: Reclaim both pagecache pages and slabs

For instance:

  echo "1M type=1" > /sys/fs/cgroup/test/memory.reclaim

will perform the reclaim on the 'test' memcg to reclaim pagecache pages
only. Please note that due to the derferred freeing of slabs, the amount of
reclaimed slabs may higher than 1M during this process.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 12 ++++++++++++
 include/linux/swap.h                    |  2 ++
 mm/memcontrol.c                         | 22 +++++++++++++++++++++-
 mm/vmscan.c                             | 13 ++++++++++---
 4 files changed, 45 insertions(+), 4 deletions(-)

Comments

Matthew Wilcox Feb. 26, 2024, 4:17 a.m. UTC | #1
On Sun, Feb 25, 2024 at 07:42:04PM +0800, Yafang Shao wrote:
> In our container environment, we've observed that certain containers may
> accumulate more than 40GB of slabs, predominantly negative dentries. These
> negative dentries remain unreclaimed unless there is memory pressure. Even
> after the containers exit, these negative dentries persist. To manage disk
> storage efficiently, we employ an agent that identifies container images
> eligible for destruction once all instances of that image exit.

I understand why you've written this patch, but we really do need to fix
this for non-container workloads.  See also:

https://lore.kernel.org/all/20220402072103.5140-1-hdanton@sina.com/

https://lore.kernel.org/linux-fsdevel/1611235185-1685-1-git-send-email-gautham.ananthakrishna@oracle.com/

https://lore.kernel.org/all/YjDvRPuxPN0GsxLB@casper.infradead.org/

I'm sure theer have been many other threads on this over the years.
Yafang Shao Feb. 26, 2024, 12:34 p.m. UTC | #2
On Mon, Feb 26, 2024 at 12:17 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Sun, Feb 25, 2024 at 07:42:04PM +0800, Yafang Shao wrote:
> > In our container environment, we've observed that certain containers may
> > accumulate more than 40GB of slabs, predominantly negative dentries. These
> > negative dentries remain unreclaimed unless there is memory pressure. Even
> > after the containers exit, these negative dentries persist. To manage disk
> > storage efficiently, we employ an agent that identifies container images
> > eligible for destruction once all instances of that image exit.
>
> I understand why you've written this patch, but we really do need to fix
> this for non-container workloads.  See also:
>
> https://lore.kernel.org/all/20220402072103.5140-1-hdanton@sina.com/
>
> https://lore.kernel.org/linux-fsdevel/1611235185-1685-1-git-send-email-gautham.ananthakrishna@oracle.com/
>
> https://lore.kernel.org/all/YjDvRPuxPN0GsxLB@casper.infradead.org/
>
> I'm sure theer have been many other threads on this over the years.

Thank you for sharing your insights. I've reviewed the proposals and
related discussions. It appears that a consensus has not yet been
reached on how to tackle the issue. While I may not fully comprehend
all aspects of the discussions, it seems that the challenges stemming
from slab shrinking can be distilled into four key questions:

- When should the shrinker be triggered?
- Which task is responsible for performing the shrinking?
- Which slab should be reclaimed?
- How many slabs should be reclaimed?

Addressing all these questions within the kernel might introduce
unnecessary complexity. Instead, one potential approach could be to
extend the functionality of memory.reclaim or introduce a new
interface, such as memory.shrinker, and delegate decision-making to
userspace based on the workload. Since memory.reclaim is also
supported in the root memcg, it can effectively address issues outside
of container environments. Here's a rough idea, which needs
validation:

1. Expose detailed shrinker information via debugfs
We've already exposed details of the slab through
/sys/kernel/debug/slab, so extending this to include shrinker details
shouldn't be too challenging. For example, for the dentry shrinker, we
could expose /sys/kernel/debug/shrinker/super_cache_scan/{shrinker_id,
kmem_cache, ...}.

2. Shrink specific slabs with a specific count
This could be implemented by extending memory.reclaim with parameters
like "shrinker_id=" and "scan_count=". Currently, memory.reclaim is
byte-based, which isn't ideal for shrinkers due to the deferred
freeing of slabs. Using scan_count to specify the number of slabs to
reclaim could be more effective.

These are preliminary ideas, and I welcome any feedback.

Additionally, since this patch offers a straightforward solution to
address the issue in container environments, would it be feasible to
apply this patch initially?
Matthew Wilcox Feb. 26, 2024, 1:16 p.m. UTC | #3
On Mon, Feb 26, 2024 at 08:34:07PM +0800, Yafang Shao wrote:
> Additionally, since this patch offers a straightforward solution to
> address the issue in container environments, would it be feasible to
> apply this patch initially?

There are lot sof other problems with this patch, but since I disagree
with the idea of this patch, I didn't enumerate them.
Yafang Shao Feb. 26, 2024, 1:56 p.m. UTC | #4
On Mon, Feb 26, 2024 at 9:16 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Feb 26, 2024 at 08:34:07PM +0800, Yafang Shao wrote:
> > Additionally, since this patch offers a straightforward solution to
> > address the issue in container environments, would it be feasible to
> > apply this patch initially?
>
> There are lot sof other problems with this patch, but since I disagree
> with the idea of this patch, I didn't enumerate them.

Could you please provide an enumeration of the issues you've
identified?  One issue is that the reclaimed bytes may not be optimal
for slab shrinking, particularly because many dentries are freed
deferredly.
Michal Hocko Feb. 26, 2024, 2:04 p.m. UTC | #5
On Sun 25-02-24 19:42:04, Yafang Shao wrote:
> In our container environment, we've observed that certain containers may
> accumulate more than 40GB of slabs, predominantly negative dentries. These
> negative dentries remain unreclaimed unless there is memory pressure. Even
> after the containers exit, these negative dentries persist.

Have you considered using memory.high limit to trigger that memory
reclaim? That is surely not focused on the neg. dentry cache but it
should keep the overal memory consumption under control.

> To manage disk
> storage efficiently, we employ an agent that identifies container images
> eligible for destruction once all instances of that image exit.
>
> However, during destruction, dealing with directories containing numerous
> negative dentries can significantly impact performance.

Performance of what. I have to say I am kind of lost here. We are
talking about memory or a disk storage?

> To mitigate this
> issue, we aim to proactively reclaim these dentries using a user agent.
> Extending the memory.reclaim functionality to specifically target slabs
> aligns with our requirements.

Matthew has already pointed out that this has been proposed several
times already and rejected. Dedicated slab shrinking interface is
especially tricky because you would need a way to tell which shrinkers
to invoke and that would be very kernel version specific.
Yafang Shao Feb. 27, 2024, 5:48 a.m. UTC | #6
On Mon, Feb 26, 2024 at 10:05 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Sun 25-02-24 19:42:04, Yafang Shao wrote:
> > In our container environment, we've observed that certain containers may
> > accumulate more than 40GB of slabs, predominantly negative dentries. These
> > negative dentries remain unreclaimed unless there is memory pressure. Even
> > after the containers exit, these negative dentries persist.
>
> Have you considered using memory.high limit to trigger that memory
> reclaim? That is surely not focused on the neg. dentry cache but it
> should keep the overal memory consumption under control.

Upon reaching the high watermark, both pagecache and slabs are
reclaimed. However, it tends to prioritize reclaiming more pagecache
pages over slab pages, leading to persistent memory consumption by
slabs. Setting memory.high too low may negatively impact the working
set.

>
> > To manage disk
> > storage efficiently, we employ an agent that identifies container images
> > eligible for destruction once all instances of that image exit.
> >
> > However, during destruction, dealing with directories containing numerous
> > negative dentries can significantly impact performance.
>
> Performance of what. I have to say I am kind of lost here. We are
> talking about memory or a disk storage?

Removing an empty directory with numerous dentries can significantly
prolong the process of freeing associated dentries, leading to high
system CPU usage that adversely affects overall system performance.

>
> > To mitigate this
> > issue, we aim to proactively reclaim these dentries using a user agent.
> > Extending the memory.reclaim functionality to specifically target slabs
> > aligns with our requirements.
>
> Matthew has already pointed out that this has been proposed several
> times already and rejected.

With that being said, we haven't come up with any superior solutions
compared to the proposals mentioned.

> Dedicated slab shrinking interface is
> especially tricky because you would need a way to tell which shrinkers
> to invoke and that would be very kernel version specific.

The persistence of this issue over several years without any
discernible progress suggests that we might be heading in the wrong
direction. Perhaps we could consider providing a kernel interface to
users, allowing them to tailor the reclamation process based on their
workload requirements.

--
Regards
Yafang
Michal Hocko Feb. 27, 2024, 9:05 a.m. UTC | #7
On Tue 27-02-24 13:48:31, Yafang Shao wrote:
> On Mon, Feb 26, 2024 at 10:05 PM Michal Hocko <mhocko@suse.com> wrote:
[...]
> > > To manage disk
> > > storage efficiently, we employ an agent that identifies container images
> > > eligible for destruction once all instances of that image exit.
> > >
> > > However, during destruction, dealing with directories containing numerous
> > > negative dentries can significantly impact performance.
> >
> > Performance of what. I have to say I am kind of lost here. We are
> > talking about memory or a disk storage?
> 
> Removing an empty directory with numerous dentries can significantly
> prolong the process of freeing associated dentries, leading to high
> system CPU usage that adversely affects overall system performance.

Is there anything that prevents you from reclaiming the memcg you are
about to remove? We do have interfaces for that.

> > > To mitigate this
> > > issue, we aim to proactively reclaim these dentries using a user agent.
> > > Extending the memory.reclaim functionality to specifically target slabs
> > > aligns with our requirements.
> >
> > Matthew has already pointed out that this has been proposed several
> > times already and rejected.
> 
> With that being said, we haven't come up with any superior solutions
> compared to the proposals mentioned.
> 
> > Dedicated slab shrinking interface is
> > especially tricky because you would need a way to tell which shrinkers
> > to invoke and that would be very kernel version specific.
> 
> The persistence of this issue over several years without any
> discernible progress suggests that we might be heading in the wrong
> direction. Perhaps we could consider providing a kernel interface to
> users, allowing them to tailor the reclamation process based on their
> workload requirements.

There are clear problems identified with type specific reclaim and those
might easily strike back with future changes. Once we put an interface
in place we won't be able remove it and that could lead to problems with
future changes in the memory reclaim.
Yafang Shao Feb. 27, 2024, 9:39 a.m. UTC | #8
On Tue, Feb 27, 2024 at 5:05 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 27-02-24 13:48:31, Yafang Shao wrote:
> > On Mon, Feb 26, 2024 at 10:05 PM Michal Hocko <mhocko@suse.com> wrote:
> [...]
> > > > To manage disk
> > > > storage efficiently, we employ an agent that identifies container images
> > > > eligible for destruction once all instances of that image exit.
> > > >
> > > > However, during destruction, dealing with directories containing numerous
> > > > negative dentries can significantly impact performance.
> > >
> > > Performance of what. I have to say I am kind of lost here. We are
> > > talking about memory or a disk storage?
> >
> > Removing an empty directory with numerous dentries can significantly
> > prolong the process of freeing associated dentries, leading to high
> > system CPU usage that adversely affects overall system performance.
>
> Is there anything that prevents you from reclaiming the memcg you are
> about to remove? We do have interfaces for that.

Reclaiming numerous dentries through force_empty can also lead to
potential issues, which is why we attempt to shrink the slab gradually
to mitigate them. However, it's important to note that the underlying
causes of the issues in force_empty and rmdir are not identical, as
they involve different locks.

>
> > > > To mitigate this
> > > > issue, we aim to proactively reclaim these dentries using a user agent.
> > > > Extending the memory.reclaim functionality to specifically target slabs
> > > > aligns with our requirements.
> > >
> > > Matthew has already pointed out that this has been proposed several
> > > times already and rejected.
> >
> > With that being said, we haven't come up with any superior solutions
> > compared to the proposals mentioned.
> >
> > > Dedicated slab shrinking interface is
> > > especially tricky because you would need a way to tell which shrinkers
> > > to invoke and that would be very kernel version specific.
> >
> > The persistence of this issue over several years without any
> > discernible progress suggests that we might be heading in the wrong
> > direction. Perhaps we could consider providing a kernel interface to
> > users, allowing them to tailor the reclamation process based on their
> > workload requirements.
>
> There are clear problems identified with type specific reclaim and those
> might easily strike back with future changes. Once we put an interface
> in place we won't be able remove it and that could lead to problems with
> future changes in the memory reclaim.

That shouldn't deter us from actively seeking a resolution to an issue
that has persisted for tens of years.
As observed, numerous memcg interfaces have been deprecated in recent years.
Yosry Ahmed Feb. 27, 2024, 9:43 a.m. UTC | #9
On Tue, Feb 27, 2024 at 1:39 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Tue, Feb 27, 2024 at 5:05 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Tue 27-02-24 13:48:31, Yafang Shao wrote:
> > > On Mon, Feb 26, 2024 at 10:05 PM Michal Hocko <mhocko@suse.com> wrote:
> > [...]
> > > > > To manage disk
> > > > > storage efficiently, we employ an agent that identifies container images
> > > > > eligible for destruction once all instances of that image exit.
> > > > >
> > > > > However, during destruction, dealing with directories containing numerous
> > > > > negative dentries can significantly impact performance.
> > > >
> > > > Performance of what. I have to say I am kind of lost here. We are
> > > > talking about memory or a disk storage?
> > >
> > > Removing an empty directory with numerous dentries can significantly
> > > prolong the process of freeing associated dentries, leading to high
> > > system CPU usage that adversely affects overall system performance.
> >
> > Is there anything that prevents you from reclaiming the memcg you are
> > about to remove? We do have interfaces for that.
>
> Reclaiming numerous dentries through force_empty can also lead to
> potential issues, which is why we attempt to shrink the slab gradually
> to mitigate them. However, it's important to note that the underlying
> causes of the issues in force_empty and rmdir are not identical, as
> they involve different locks.
>
> >
> > > > > To mitigate this
> > > > > issue, we aim to proactively reclaim these dentries using a user agent.
> > > > > Extending the memory.reclaim functionality to specifically target slabs
> > > > > aligns with our requirements.
> > > >
> > > > Matthew has already pointed out that this has been proposed several
> > > > times already and rejected.
> > >
> > > With that being said, we haven't come up with any superior solutions
> > > compared to the proposals mentioned.
> > >
> > > > Dedicated slab shrinking interface is
> > > > especially tricky because you would need a way to tell which shrinkers
> > > > to invoke and that would be very kernel version specific.
> > >
> > > The persistence of this issue over several years without any
> > > discernible progress suggests that we might be heading in the wrong
> > > direction. Perhaps we could consider providing a kernel interface to
> > > users, allowing them to tailor the reclamation process based on their
> > > workload requirements.
> >
> > There are clear problems identified with type specific reclaim and those
> > might easily strike back with future changes. Once we put an interface
> > in place we won't be able remove it and that could lead to problems with
> > future changes in the memory reclaim.
>
> That shouldn't deter us from actively seeking a resolution to an issue
> that has persisted for tens of years.
> As observed, numerous memcg interfaces have been deprecated in recent years.

There has been recent work to add a swapiness= argument to
memory.reclaim to balance between anon and file pages. Adding a type=
argument together with that is a recipe for eternal confusion. *If* we
want to support this, we need to have a way to combine these two into
something more user-friendly.
Michal Hocko Feb. 27, 2024, 9:45 a.m. UTC | #10
On Tue 27-02-24 17:39:01, Yafang Shao wrote:
> On Tue, Feb 27, 2024 at 5:05 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Tue 27-02-24 13:48:31, Yafang Shao wrote:
> > > On Mon, Feb 26, 2024 at 10:05 PM Michal Hocko <mhocko@suse.com> wrote:
> > [...]
> > > > > To manage disk
> > > > > storage efficiently, we employ an agent that identifies container images
> > > > > eligible for destruction once all instances of that image exit.
> > > > >
> > > > > However, during destruction, dealing with directories containing numerous
> > > > > negative dentries can significantly impact performance.
> > > >
> > > > Performance of what. I have to say I am kind of lost here. We are
> > > > talking about memory or a disk storage?
> > >
> > > Removing an empty directory with numerous dentries can significantly
> > > prolong the process of freeing associated dentries, leading to high
> > > system CPU usage that adversely affects overall system performance.
> >
> > Is there anything that prevents you from reclaiming the memcg you are
> > about to remove? We do have interfaces for that.
> 
> Reclaiming numerous dentries through force_empty can also lead to
> potential issues, which is why we attempt to shrink the slab gradually
> to mitigate them. However, it's important to note that the underlying
> causes of the issues in force_empty and rmdir are not identical, as
> they involve different locks.

Please be more specific about those issues.

> > > > > To mitigate this
> > > > > issue, we aim to proactively reclaim these dentries using a user agent.
> > > > > Extending the memory.reclaim functionality to specifically target slabs
> > > > > aligns with our requirements.
> > > >
> > > > Matthew has already pointed out that this has been proposed several
> > > > times already and rejected.
> > >
> > > With that being said, we haven't come up with any superior solutions
> > > compared to the proposals mentioned.
> > >
> > > > Dedicated slab shrinking interface is
> > > > especially tricky because you would need a way to tell which shrinkers
> > > > to invoke and that would be very kernel version specific.
> > >
> > > The persistence of this issue over several years without any
> > > discernible progress suggests that we might be heading in the wrong
> > > direction. Perhaps we could consider providing a kernel interface to
> > > users, allowing them to tailor the reclamation process based on their
> > > workload requirements.
> >
> > There are clear problems identified with type specific reclaim and those
> > might easily strike back with future changes. Once we put an interface
> > in place we won't be able remove it and that could lead to problems with
> > future changes in the memory reclaim.
> 
> That shouldn't deter us from actively seeking a resolution to an issue
> that has persisted for tens of years.

Right, I do not believe we would deter anybody from doing that. This is
just not an easy problem to tackle. So either you find solid arguments
that previous conclusions do not hold anymore or you need to look into
options which haven't been discussed so far. I do realize that chasing
previous discussions in email archives is not fun but maybe a good (re)start
would be documenting those problems somewhere under Documentation/.

> As observed, numerous memcg interfaces have been deprecated in recent years.

yes, for very good reasons
Yafang Shao Feb. 27, 2024, 10:06 a.m. UTC | #11
On Tue, Feb 27, 2024 at 5:45 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 27-02-24 17:39:01, Yafang Shao wrote:
> > On Tue, Feb 27, 2024 at 5:05 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Tue 27-02-24 13:48:31, Yafang Shao wrote:
> > > > On Mon, Feb 26, 2024 at 10:05 PM Michal Hocko <mhocko@suse.com> wrote:
> > > [...]
> > > > > > To manage disk
> > > > > > storage efficiently, we employ an agent that identifies container images
> > > > > > eligible for destruction once all instances of that image exit.
> > > > > >
> > > > > > However, during destruction, dealing with directories containing numerous
> > > > > > negative dentries can significantly impact performance.
> > > > >
> > > > > Performance of what. I have to say I am kind of lost here. We are
> > > > > talking about memory or a disk storage?
> > > >
> > > > Removing an empty directory with numerous dentries can significantly
> > > > prolong the process of freeing associated dentries, leading to high
> > > > system CPU usage that adversely affects overall system performance.
> > >
> > > Is there anything that prevents you from reclaiming the memcg you are
> > > about to remove? We do have interfaces for that.
> >
> > Reclaiming numerous dentries through force_empty can also lead to
> > potential issues, which is why we attempt to shrink the slab gradually
> > to mitigate them. However, it's important to note that the underlying
> > causes of the issues in force_empty and rmdir are not identical, as
> > they involve different locks.
>
> Please be more specific about those issues.

Both of these issues stem from lock contention:

- rmdir
When executing rmdir, the lock of the inode of the empty directory is
held. If this directory contains numerous negative dentries, this lock
is held for an extended duration. Consequently, if other processes
attempt to acquire this lock, they are blocked.
A simple reproducer involves:

1. Generating numerous negative dentries in an empty directory.
2. Running `rmdir ~/test` and `ls ~/` concurrently.

This setup demonstrates that ls takes a significant amount of time to
complete due to lock contention.

- force_empty
Force_empty holds the lock of super_block->dentry_list. However, I
haven't yet had the opportunity to produce a specific example to
illustrate this issue.


>
> > > > > > To mitigate this
> > > > > > issue, we aim to proactively reclaim these dentries using a user agent.
> > > > > > Extending the memory.reclaim functionality to specifically target slabs
> > > > > > aligns with our requirements.
> > > > >
> > > > > Matthew has already pointed out that this has been proposed several
> > > > > times already and rejected.
> > > >
> > > > With that being said, we haven't come up with any superior solutions
> > > > compared to the proposals mentioned.
> > > >
> > > > > Dedicated slab shrinking interface is
> > > > > especially tricky because you would need a way to tell which shrinkers
> > > > > to invoke and that would be very kernel version specific.
> > > >
> > > > The persistence of this issue over several years without any
> > > > discernible progress suggests that we might be heading in the wrong
> > > > direction. Perhaps we could consider providing a kernel interface to
> > > > users, allowing them to tailor the reclamation process based on their
> > > > workload requirements.
> > >
> > > There are clear problems identified with type specific reclaim and those
> > > might easily strike back with future changes. Once we put an interface
> > > in place we won't be able remove it and that could lead to problems with
> > > future changes in the memory reclaim.
> >
> > That shouldn't deter us from actively seeking a resolution to an issue
> > that has persisted for tens of years.
>
> Right, I do not believe we would deter anybody from doing that. This is
> just not an easy problem to tackle. So either you find solid arguments
> that previous conclusions do not hold anymore or you need to look into
> options which haven't been discussed so far. I do realize that chasing
> previous discussions in email archives is not fun but maybe a good (re)start
> would be documenting those problems somewhere under Documentation/.
>
> > As observed, numerous memcg interfaces have been deprecated in recent years.
>
> yes, for very good reasons
Yafang Shao Feb. 27, 2024, 11:37 a.m. UTC | #12
On Tue, Feb 27, 2024 at 5:44 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Tue, Feb 27, 2024 at 1:39 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Tue, Feb 27, 2024 at 5:05 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Tue 27-02-24 13:48:31, Yafang Shao wrote:
> > > > On Mon, Feb 26, 2024 at 10:05 PM Michal Hocko <mhocko@suse.com> wrote:
> > > [...]
> > > > > > To manage disk
> > > > > > storage efficiently, we employ an agent that identifies container images
> > > > > > eligible for destruction once all instances of that image exit.
> > > > > >
> > > > > > However, during destruction, dealing with directories containing numerous
> > > > > > negative dentries can significantly impact performance.
> > > > >
> > > > > Performance of what. I have to say I am kind of lost here. We are
> > > > > talking about memory or a disk storage?
> > > >
> > > > Removing an empty directory with numerous dentries can significantly
> > > > prolong the process of freeing associated dentries, leading to high
> > > > system CPU usage that adversely affects overall system performance.
> > >
> > > Is there anything that prevents you from reclaiming the memcg you are
> > > about to remove? We do have interfaces for that.
> >
> > Reclaiming numerous dentries through force_empty can also lead to
> > potential issues, which is why we attempt to shrink the slab gradually
> > to mitigate them. However, it's important to note that the underlying
> > causes of the issues in force_empty and rmdir are not identical, as
> > they involve different locks.
> >
> > >
> > > > > > To mitigate this
> > > > > > issue, we aim to proactively reclaim these dentries using a user agent.
> > > > > > Extending the memory.reclaim functionality to specifically target slabs
> > > > > > aligns with our requirements.
> > > > >
> > > > > Matthew has already pointed out that this has been proposed several
> > > > > times already and rejected.
> > > >
> > > > With that being said, we haven't come up with any superior solutions
> > > > compared to the proposals mentioned.
> > > >
> > > > > Dedicated slab shrinking interface is
> > > > > especially tricky because you would need a way to tell which shrinkers
> > > > > to invoke and that would be very kernel version specific.
> > > >
> > > > The persistence of this issue over several years without any
> > > > discernible progress suggests that we might be heading in the wrong
> > > > direction. Perhaps we could consider providing a kernel interface to
> > > > users, allowing them to tailor the reclamation process based on their
> > > > workload requirements.
> > >
> > > There are clear problems identified with type specific reclaim and those
> > > might easily strike back with future changes. Once we put an interface
> > > in place we won't be able remove it and that could lead to problems with
> > > future changes in the memory reclaim.
> >
> > That shouldn't deter us from actively seeking a resolution to an issue
> > that has persisted for tens of years.
> > As observed, numerous memcg interfaces have been deprecated in recent years.
>
> There has been recent work to add a swapiness= argument to
> memory.reclaim to balance between anon and file pages. Adding a type=
> argument together with that is a recipe for eternal confusion. *If* we
> want to support this, we need to have a way to combine these two into
> something more user-friendly.

What if we introduce a new file, like memory.shrink? This could serve
as a foundation for potential future extensions, allowing us to shrink
specific slabs with specific counts.
Yosry Ahmed Feb. 27, 2024, 11:40 a.m. UTC | #13
On Tue, Feb 27, 2024 at 3:37 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Tue, Feb 27, 2024 at 5:44 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Tue, Feb 27, 2024 at 1:39 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > On Tue, Feb 27, 2024 at 5:05 PM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Tue 27-02-24 13:48:31, Yafang Shao wrote:
> > > > > On Mon, Feb 26, 2024 at 10:05 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > [...]
> > > > > > > To manage disk
> > > > > > > storage efficiently, we employ an agent that identifies container images
> > > > > > > eligible for destruction once all instances of that image exit.
> > > > > > >
> > > > > > > However, during destruction, dealing with directories containing numerous
> > > > > > > negative dentries can significantly impact performance.
> > > > > >
> > > > > > Performance of what. I have to say I am kind of lost here. We are
> > > > > > talking about memory or a disk storage?
> > > > >
> > > > > Removing an empty directory with numerous dentries can significantly
> > > > > prolong the process of freeing associated dentries, leading to high
> > > > > system CPU usage that adversely affects overall system performance.
> > > >
> > > > Is there anything that prevents you from reclaiming the memcg you are
> > > > about to remove? We do have interfaces for that.
> > >
> > > Reclaiming numerous dentries through force_empty can also lead to
> > > potential issues, which is why we attempt to shrink the slab gradually
> > > to mitigate them. However, it's important to note that the underlying
> > > causes of the issues in force_empty and rmdir are not identical, as
> > > they involve different locks.
> > >
> > > >
> > > > > > > To mitigate this
> > > > > > > issue, we aim to proactively reclaim these dentries using a user agent.
> > > > > > > Extending the memory.reclaim functionality to specifically target slabs
> > > > > > > aligns with our requirements.
> > > > > >
> > > > > > Matthew has already pointed out that this has been proposed several
> > > > > > times already and rejected.
> > > > >
> > > > > With that being said, we haven't come up with any superior solutions
> > > > > compared to the proposals mentioned.
> > > > >
> > > > > > Dedicated slab shrinking interface is
> > > > > > especially tricky because you would need a way to tell which shrinkers
> > > > > > to invoke and that would be very kernel version specific.
> > > > >
> > > > > The persistence of this issue over several years without any
> > > > > discernible progress suggests that we might be heading in the wrong
> > > > > direction. Perhaps we could consider providing a kernel interface to
> > > > > users, allowing them to tailor the reclamation process based on their
> > > > > workload requirements.
> > > >
> > > > There are clear problems identified with type specific reclaim and those
> > > > might easily strike back with future changes. Once we put an interface
> > > > in place we won't be able remove it and that could lead to problems with
> > > > future changes in the memory reclaim.
> > >
> > > That shouldn't deter us from actively seeking a resolution to an issue
> > > that has persisted for tens of years.
> > > As observed, numerous memcg interfaces have been deprecated in recent years.
> >
> > There has been recent work to add a swapiness= argument to
> > memory.reclaim to balance between anon and file pages. Adding a type=
> > argument together with that is a recipe for eternal confusion. *If* we
> > want to support this, we need to have a way to combine these two into
> > something more user-friendly.
>
> What if we introduce a new file, like memory.shrink? This could serve
> as a foundation for potential future extensions, allowing us to shrink
> specific slabs with specific counts.

Shrinking specific slabs is something that shouldn't be exposed as an
interface, as this is a kernel implementation detail. Also,
memory.reclaim and memory.shrink would still have overlapping
functionalities.
Yafang Shao Feb. 27, 2024, 11:58 a.m. UTC | #14
On Tue, Feb 27, 2024 at 7:40 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Tue, Feb 27, 2024 at 3:37 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Tue, Feb 27, 2024 at 5:44 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > On Tue, Feb 27, 2024 at 1:39 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > On Tue, Feb 27, 2024 at 5:05 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Tue 27-02-24 13:48:31, Yafang Shao wrote:
> > > > > > On Mon, Feb 26, 2024 at 10:05 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > [...]
> > > > > > > > To manage disk
> > > > > > > > storage efficiently, we employ an agent that identifies container images
> > > > > > > > eligible for destruction once all instances of that image exit.
> > > > > > > >
> > > > > > > > However, during destruction, dealing with directories containing numerous
> > > > > > > > negative dentries can significantly impact performance.
> > > > > > >
> > > > > > > Performance of what. I have to say I am kind of lost here. We are
> > > > > > > talking about memory or a disk storage?
> > > > > >
> > > > > > Removing an empty directory with numerous dentries can significantly
> > > > > > prolong the process of freeing associated dentries, leading to high
> > > > > > system CPU usage that adversely affects overall system performance.
> > > > >
> > > > > Is there anything that prevents you from reclaiming the memcg you are
> > > > > about to remove? We do have interfaces for that.
> > > >
> > > > Reclaiming numerous dentries through force_empty can also lead to
> > > > potential issues, which is why we attempt to shrink the slab gradually
> > > > to mitigate them. However, it's important to note that the underlying
> > > > causes of the issues in force_empty and rmdir are not identical, as
> > > > they involve different locks.
> > > >
> > > > >
> > > > > > > > To mitigate this
> > > > > > > > issue, we aim to proactively reclaim these dentries using a user agent.
> > > > > > > > Extending the memory.reclaim functionality to specifically target slabs
> > > > > > > > aligns with our requirements.
> > > > > > >
> > > > > > > Matthew has already pointed out that this has been proposed several
> > > > > > > times already and rejected.
> > > > > >
> > > > > > With that being said, we haven't come up with any superior solutions
> > > > > > compared to the proposals mentioned.
> > > > > >
> > > > > > > Dedicated slab shrinking interface is
> > > > > > > especially tricky because you would need a way to tell which shrinkers
> > > > > > > to invoke and that would be very kernel version specific.
> > > > > >
> > > > > > The persistence of this issue over several years without any
> > > > > > discernible progress suggests that we might be heading in the wrong
> > > > > > direction. Perhaps we could consider providing a kernel interface to
> > > > > > users, allowing them to tailor the reclamation process based on their
> > > > > > workload requirements.
> > > > >
> > > > > There are clear problems identified with type specific reclaim and those
> > > > > might easily strike back with future changes. Once we put an interface
> > > > > in place we won't be able remove it and that could lead to problems with
> > > > > future changes in the memory reclaim.
> > > >
> > > > That shouldn't deter us from actively seeking a resolution to an issue
> > > > that has persisted for tens of years.
> > > > As observed, numerous memcg interfaces have been deprecated in recent years.
> > >
> > > There has been recent work to add a swapiness= argument to
> > > memory.reclaim to balance between anon and file pages. Adding a type=
> > > argument together with that is a recipe for eternal confusion. *If* we
> > > want to support this, we need to have a way to combine these two into
> > > something more user-friendly.
> >
> > What if we introduce a new file, like memory.shrink? This could serve
> > as a foundation for potential future extensions, allowing us to shrink
> > specific slabs with specific counts.
>
> Shrinking specific slabs is something that shouldn't be exposed as an
> interface, as this is a kernel implementation detail. Also,

If that's the case, why was slabs info initially exposed through
/proc/slabinfo? Isn't that level of detail considered a kernel
implementation detail? Currently, users can identify which slab is
consuming the most memory but lack the ability to take action based on
that information. This suggests a flaw in the kernel implementation.

> memory.reclaim and memory.shrink would still have overlapping
> functionalities.
Yafang Shao Feb. 27, 2024, 12:09 p.m. UTC | #15
On Tue, Feb 27, 2024 at 7:58 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Tue, Feb 27, 2024 at 7:40 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Tue, Feb 27, 2024 at 3:37 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > On Tue, Feb 27, 2024 at 5:44 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > >
> > > > On Tue, Feb 27, 2024 at 1:39 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > >
> > > > > On Tue, Feb 27, 2024 at 5:05 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > >
> > > > > > On Tue 27-02-24 13:48:31, Yafang Shao wrote:
> > > > > > > On Mon, Feb 26, 2024 at 10:05 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > [...]
> > > > > > > > > To manage disk
> > > > > > > > > storage efficiently, we employ an agent that identifies container images
> > > > > > > > > eligible for destruction once all instances of that image exit.
> > > > > > > > >
> > > > > > > > > However, during destruction, dealing with directories containing numerous
> > > > > > > > > negative dentries can significantly impact performance.
> > > > > > > >
> > > > > > > > Performance of what. I have to say I am kind of lost here. We are
> > > > > > > > talking about memory or a disk storage?
> > > > > > >
> > > > > > > Removing an empty directory with numerous dentries can significantly
> > > > > > > prolong the process of freeing associated dentries, leading to high
> > > > > > > system CPU usage that adversely affects overall system performance.
> > > > > >
> > > > > > Is there anything that prevents you from reclaiming the memcg you are
> > > > > > about to remove? We do have interfaces for that.
> > > > >
> > > > > Reclaiming numerous dentries through force_empty can also lead to
> > > > > potential issues, which is why we attempt to shrink the slab gradually
> > > > > to mitigate them. However, it's important to note that the underlying
> > > > > causes of the issues in force_empty and rmdir are not identical, as
> > > > > they involve different locks.
> > > > >
> > > > > >
> > > > > > > > > To mitigate this
> > > > > > > > > issue, we aim to proactively reclaim these dentries using a user agent.
> > > > > > > > > Extending the memory.reclaim functionality to specifically target slabs
> > > > > > > > > aligns with our requirements.
> > > > > > > >
> > > > > > > > Matthew has already pointed out that this has been proposed several
> > > > > > > > times already and rejected.
> > > > > > >
> > > > > > > With that being said, we haven't come up with any superior solutions
> > > > > > > compared to the proposals mentioned.
> > > > > > >
> > > > > > > > Dedicated slab shrinking interface is
> > > > > > > > especially tricky because you would need a way to tell which shrinkers
> > > > > > > > to invoke and that would be very kernel version specific.
> > > > > > >
> > > > > > > The persistence of this issue over several years without any
> > > > > > > discernible progress suggests that we might be heading in the wrong
> > > > > > > direction. Perhaps we could consider providing a kernel interface to
> > > > > > > users, allowing them to tailor the reclamation process based on their
> > > > > > > workload requirements.
> > > > > >
> > > > > > There are clear problems identified with type specific reclaim and those
> > > > > > might easily strike back with future changes. Once we put an interface
> > > > > > in place we won't be able remove it and that could lead to problems with
> > > > > > future changes in the memory reclaim.
> > > > >
> > > > > That shouldn't deter us from actively seeking a resolution to an issue
> > > > > that has persisted for tens of years.
> > > > > As observed, numerous memcg interfaces have been deprecated in recent years.
> > > >
> > > > There has been recent work to add a swapiness= argument to
> > > > memory.reclaim to balance between anon and file pages. Adding a type=
> > > > argument together with that is a recipe for eternal confusion. *If* we
> > > > want to support this, we need to have a way to combine these two into
> > > > something more user-friendly.
> > >
> > > What if we introduce a new file, like memory.shrink? This could serve
> > > as a foundation for potential future extensions, allowing us to shrink
> > > specific slabs with specific counts.
> >
> > Shrinking specific slabs is something that shouldn't be exposed as an
> > interface, as this is a kernel implementation detail. Also,
>
> If that's the case, why was slabs info initially exposed through
> /proc/slabinfo? Isn't that level of detail considered a kernel
> implementation detail? Currently, users can identify which slab is
> consuming the most memory but lack the ability to take action based on
> that information. This suggests a flaw in the kernel implementation.

BTW, we even expose more detailed kernel implementation details
through /sys/kernel/slab.
That is really confusing...
Yafang Shao Feb. 27, 2024, 12:12 p.m. UTC | #16
On Tue, Feb 27, 2024 at 8:09 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Tue, Feb 27, 2024 at 7:58 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Tue, Feb 27, 2024 at 7:40 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > On Tue, Feb 27, 2024 at 3:37 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > On Tue, Feb 27, 2024 at 5:44 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > >
> > > > > On Tue, Feb 27, 2024 at 1:39 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > >
> > > > > > On Tue, Feb 27, 2024 at 5:05 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > >
> > > > > > > On Tue 27-02-24 13:48:31, Yafang Shao wrote:
> > > > > > > > On Mon, Feb 26, 2024 at 10:05 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > > [...]
> > > > > > > > > > To manage disk
> > > > > > > > > > storage efficiently, we employ an agent that identifies container images
> > > > > > > > > > eligible for destruction once all instances of that image exit.
> > > > > > > > > >
> > > > > > > > > > However, during destruction, dealing with directories containing numerous
> > > > > > > > > > negative dentries can significantly impact performance.
> > > > > > > > >
> > > > > > > > > Performance of what. I have to say I am kind of lost here. We are
> > > > > > > > > talking about memory or a disk storage?
> > > > > > > >
> > > > > > > > Removing an empty directory with numerous dentries can significantly
> > > > > > > > prolong the process of freeing associated dentries, leading to high
> > > > > > > > system CPU usage that adversely affects overall system performance.
> > > > > > >
> > > > > > > Is there anything that prevents you from reclaiming the memcg you are
> > > > > > > about to remove? We do have interfaces for that.
> > > > > >
> > > > > > Reclaiming numerous dentries through force_empty can also lead to
> > > > > > potential issues, which is why we attempt to shrink the slab gradually
> > > > > > to mitigate them. However, it's important to note that the underlying
> > > > > > causes of the issues in force_empty and rmdir are not identical, as
> > > > > > they involve different locks.
> > > > > >
> > > > > > >
> > > > > > > > > > To mitigate this
> > > > > > > > > > issue, we aim to proactively reclaim these dentries using a user agent.
> > > > > > > > > > Extending the memory.reclaim functionality to specifically target slabs
> > > > > > > > > > aligns with our requirements.
> > > > > > > > >
> > > > > > > > > Matthew has already pointed out that this has been proposed several
> > > > > > > > > times already and rejected.
> > > > > > > >
> > > > > > > > With that being said, we haven't come up with any superior solutions
> > > > > > > > compared to the proposals mentioned.
> > > > > > > >
> > > > > > > > > Dedicated slab shrinking interface is
> > > > > > > > > especially tricky because you would need a way to tell which shrinkers
> > > > > > > > > to invoke and that would be very kernel version specific.
> > > > > > > >
> > > > > > > > The persistence of this issue over several years without any
> > > > > > > > discernible progress suggests that we might be heading in the wrong
> > > > > > > > direction. Perhaps we could consider providing a kernel interface to
> > > > > > > > users, allowing them to tailor the reclamation process based on their
> > > > > > > > workload requirements.
> > > > > > >
> > > > > > > There are clear problems identified with type specific reclaim and those
> > > > > > > might easily strike back with future changes. Once we put an interface
> > > > > > > in place we won't be able remove it and that could lead to problems with
> > > > > > > future changes in the memory reclaim.
> > > > > >
> > > > > > That shouldn't deter us from actively seeking a resolution to an issue
> > > > > > that has persisted for tens of years.
> > > > > > As observed, numerous memcg interfaces have been deprecated in recent years.
> > > > >
> > > > > There has been recent work to add a swapiness= argument to
> > > > > memory.reclaim to balance between anon and file pages. Adding a type=
> > > > > argument together with that is a recipe for eternal confusion. *If* we
> > > > > want to support this, we need to have a way to combine these two into
> > > > > something more user-friendly.
> > > >
> > > > What if we introduce a new file, like memory.shrink? This could serve
> > > > as a foundation for potential future extensions, allowing us to shrink
> > > > specific slabs with specific counts.
> > >
> > > Shrinking specific slabs is something that shouldn't be exposed as an
> > > interface, as this is a kernel implementation detail. Also,
> >
> > If that's the case, why was slabs info initially exposed through
> > /proc/slabinfo? Isn't that level of detail considered a kernel
> > implementation detail? Currently, users can identify which slab is
> > consuming the most memory but lack the ability to take action based on
> > that information. This suggests a flaw in the kernel implementation.
>
> BTW, we even expose more detailed kernel implementation details
> through /sys/kernel/slab.
> That is really confusing...

There is even a /sys/kernel/slab/dentry/shrink ....
oh please...
Michal Hocko Feb. 27, 2024, 1:17 p.m. UTC | #17
On Tue 27-02-24 20:12:27, Yafang Shao wrote:
> On Tue, Feb 27, 2024 at 8:09 PM Yafang Shao <laoar.shao@gmail.com> wrote:
[...]
> > > If that's the case, why was slabs info initially exposed through
> > > /proc/slabinfo?

because that helps to better understand the memory consumption by slab
consumers.

> > > Isn't that level of detail considered a kernel
> > > implementation detail? Currently, users can identify which slab is
> > > consuming the most memory but lack the ability to take action based on
> > > that information. This suggests a flaw in the kernel implementation.

I disgree!

> > BTW, we even expose more detailed kernel implementation details
> > through /sys/kernel/slab.
> > That is really confusing...
> 
> There is even a /sys/kernel/slab/dentry/shrink ....
> oh please...

We also have /proc/sys/vm/drop_caches and we have learned those are
really terrible interfaces and we have good reasons to not replicate
those into memcg interfaces. Using bad interfaces as an example is not
the way argue for new ones.
Michal Hocko Feb. 27, 2024, 1:20 p.m. UTC | #18
On Tue 27-02-24 18:06:05, Yafang Shao wrote:
> On Tue, Feb 27, 2024 at 5:45 PM Michal Hocko <mhocko@suse.com> wrote:
[...]
> > > Reclaiming numerous dentries through force_empty can also lead to
> > > potential issues, which is why we attempt to shrink the slab gradually
> > > to mitigate them. However, it's important to note that the underlying
> > > causes of the issues in force_empty and rmdir are not identical, as
> > > they involve different locks.
> >
> > Please be more specific about those issues.
> 
> Both of these issues stem from lock contention:
> 
> - rmdir
> When executing rmdir, the lock of the inode of the empty directory is
> held. If this directory contains numerous negative dentries, this lock
> is held for an extended duration. Consequently, if other processes
> attempt to acquire this lock, they are blocked.
> A simple reproducer involves:
> 
> 1. Generating numerous negative dentries in an empty directory.
> 2. Running `rmdir ~/test` and `ls ~/` concurrently.

I fail to see how is this relevant to memcg reclaim

> This setup demonstrates that ls takes a significant amount of time to
> complete due to lock contention.
> 
> - force_empty
> Force_empty holds the lock of super_block->dentry_list. However, I
> haven't yet had the opportunity to produce a specific example to
> illustrate this issue.

OK, get back to us once you can identify an actual problem. We might be
talking about different things here though. My question is directed at
existing memcg interfaces to reclaim the memory. That would be legacy
(and effectively deprecated) force_empty and memory.reclaim that we
have.
Yafang Shao Feb. 27, 2024, 2:03 p.m. UTC | #19
On Tue, Feb 27, 2024 at 9:20 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 27-02-24 18:06:05, Yafang Shao wrote:
> > On Tue, Feb 27, 2024 at 5:45 PM Michal Hocko <mhocko@suse.com> wrote:
> [...]
> > > > Reclaiming numerous dentries through force_empty can also lead to
> > > > potential issues, which is why we attempt to shrink the slab gradually
> > > > to mitigate them. However, it's important to note that the underlying
> > > > causes of the issues in force_empty and rmdir are not identical, as
> > > > they involve different locks.
> > >
> > > Please be more specific about those issues.
> >
> > Both of these issues stem from lock contention:
> >
> > - rmdir
> > When executing rmdir, the lock of the inode of the empty directory is
> > held. If this directory contains numerous negative dentries, this lock
> > is held for an extended duration. Consequently, if other processes
> > attempt to acquire this lock, they are blocked.
> > A simple reproducer involves:
> >
> > 1. Generating numerous negative dentries in an empty directory.
> > 2. Running `rmdir ~/test` and `ls ~/` concurrently.
>
> I fail to see how is this relevant to memcg reclaim


It appears there might still be some misunderstanding regarding the
issue at hand.

The numerous negative dentries are generated within a memcg. I
simplified the explanation by omitting the memcg context.

>
> > This setup demonstrates that ls takes a significant amount of time to
> > complete due to lock contention.
> >
> > - force_empty
> > Force_empty holds the lock of super_block->dentry_list. However, I
> > haven't yet had the opportunity to produce a specific example to
> > illustrate this issue.
>
> OK, get back to us once you can identify an actual problem. We might be

Pls. take a look at the force_empty->prune_dcache_sb->shrink_dentry_list.

> talking about different things here though. My question is directed at
> existing memcg interfaces to reclaim the memory. That would be legacy
> (and effectively deprecated) force_empty and memory.reclaim that we
> have.
Yafang Shao Feb. 27, 2024, 2:05 p.m. UTC | #20
On Tue, Feb 27, 2024 at 9:17 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 27-02-24 20:12:27, Yafang Shao wrote:
> > On Tue, Feb 27, 2024 at 8:09 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> [...]
> > > > If that's the case, why was slabs info initially exposed through
> > > > /proc/slabinfo?
>
> because that helps to better understand the memory consumption by slab
> consumers.
>
> > > > Isn't that level of detail considered a kernel
> > > > implementation detail? Currently, users can identify which slab is
> > > > consuming the most memory but lack the ability to take action based on
> > > > that information. This suggests a flaw in the kernel implementation.
>
> I disgree!
>
> > > BTW, we even expose more detailed kernel implementation details
> > > through /sys/kernel/slab.
> > > That is really confusing...
> >
> > There is even a /sys/kernel/slab/dentry/shrink ....
> > oh please...
>
> We also have /proc/sys/vm/drop_caches and we have learned those are
> really terrible interfaces and we have good reasons to not replicate
> those into memcg interfaces. Using bad interfaces as an example is not
> the way argue for new ones.

And then we introduce a similar memory.reclaim......
Yafang Shao Feb. 27, 2024, 2:51 p.m. UTC | #21
On Tue, Feb 27, 2024 at 10:03 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Tue, Feb 27, 2024 at 9:20 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Tue 27-02-24 18:06:05, Yafang Shao wrote:
> > > On Tue, Feb 27, 2024 at 5:45 PM Michal Hocko <mhocko@suse.com> wrote:
> > [...]
> > > > > Reclaiming numerous dentries through force_empty can also lead to
> > > > > potential issues, which is why we attempt to shrink the slab gradually
> > > > > to mitigate them. However, it's important to note that the underlying
> > > > > causes of the issues in force_empty and rmdir are not identical, as
> > > > > they involve different locks.
> > > >
> > > > Please be more specific about those issues.
> > >
> > > Both of these issues stem from lock contention:
> > >
> > > - rmdir
> > > When executing rmdir, the lock of the inode of the empty directory is
> > > held. If this directory contains numerous negative dentries, this lock
> > > is held for an extended duration. Consequently, if other processes
> > > attempt to acquire this lock, they are blocked.
> > > A simple reproducer involves:
> > >
> > > 1. Generating numerous negative dentries in an empty directory.
> > > 2. Running `rmdir ~/test` and `ls ~/` concurrently.
> >
> > I fail to see how is this relevant to memcg reclaim
>
>
> It appears there might still be some misunderstanding regarding the
> issue at hand.
>
> The numerous negative dentries are generated within a memcg. I
> simplified the explanation by omitting the memcg context.
>
> >
> > > This setup demonstrates that ls takes a significant amount of time to
> > > complete due to lock contention.
> > >
> > > - force_empty
> > > Force_empty holds the lock of super_block->dentry_list. However, I
> > > haven't yet had the opportunity to produce a specific example to
> > > illustrate this issue.
> >
> > OK, get back to us once you can identify an actual problem. We might be
>
> Pls. take a look at the force_empty->prune_dcache_sb->shrink_dentry_list.

Please note that the spinlock is located within list_lru_walk_one.
Additionally, the issue persists even with the newly introduced
memory.reclaim, as it functions similarly to force_empty.

>
> > talking about different things here though. My question is directed at
> > existing memcg interfaces to reclaim the memory. That would be legacy
> > (and effectively deprecated) force_empty and memory.reclaim that we
> > have.
>
>
>
> --
> Regards
> Yafang
diff mbox series

Patch

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 0270517ade47..6807d0fa197d 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1322,6 +1322,18 @@  The following nested keys are defined.
 	same semantics as vm.swappiness applied to memcg reclaim with
 	all the existing limitations and potential future extensions.
 
+	  ====                  ==============================
+	  type                  Type of memory to reclaim with
+	  ====                  ==============================
+
+        Specifying a memory type value instructs the kernel to perform
+        the reclaim with that memory type. The current supported
+        values are:
+
+          1 - Reclaim pagecache pages only
+          2 - Reclaim slabs  only
+          3 - Reclaim both pagecache pages and slabs
+
   memory.peak
 	A read-only single value file which exists on non-root
 	cgroups.
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 41e4b484bc34..27c432101032 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -404,6 +404,8 @@  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 
 #define MEMCG_RECLAIM_MAY_SWAP (1 << 1)
 #define MEMCG_RECLAIM_PROACTIVE (1 << 2)
+#define MEMCG_RECLAIM_PAGECACHE_ONLY (1 << 3)
+#define MEMCG_RECLAIM_SLAB_ONLY (1 << 4)
 #define MIN_SWAPPINESS 0
 #define MAX_SWAPPINESS 200
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4070ba84b508..3dfdbf5782c8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -54,6 +54,7 @@ 
 #include <linux/seq_file.h>
 #include <linux/parser.h>
 #include <linux/vmpressure.h>
+#include <linux/parser.h>
 #include <linux/memremap.h>
 #include <linux/mm_inline.h>
 #include <linux/swap_cgroup.h>
@@ -6930,11 +6931,13 @@  static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
 
 enum {
 	MEMORY_RECLAIM_SWAPPINESS = 0,
+	MEMORY_RECLAIM_TYPE = 1,
 	MEMORY_RECLAIM_NULL,
 };
 
 static const match_table_t tokens = {
 	{ MEMORY_RECLAIM_SWAPPINESS, "swappiness=%d"},
+	{ MEMORY_RECLAIM_TYPE, "type=%d"},
 	{ MEMORY_RECLAIM_NULL, NULL },
 };
 
@@ -6944,7 +6947,7 @@  static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
 	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
 	unsigned int nr_retries = MAX_RECLAIM_RETRIES;
 	unsigned long nr_to_reclaim, nr_reclaimed = 0;
-	int swappiness = -1;
+	int swappiness = -1, type = 0;
 	unsigned int reclaim_options;
 	char *old_buf, *start;
 	substring_t args[MAX_OPT_ARGS];
@@ -6968,12 +6971,29 @@  static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
 			if (swappiness < MIN_SWAPPINESS || swappiness > MAX_SWAPPINESS)
 				return -EINVAL;
 			break;
+		case MEMORY_RECLAIM_TYPE:
+			if (match_int(&args[0], &type))
+				return -EINVAL;
+			if (type > 3 || type <= 0)
+				return -EINVAL;
+			break;
 		default:
 			return -EINVAL;
 		}
 	}
 
 	reclaim_options	= MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE;
+	switch (type) {
+	case 1:
+		reclaim_options |= MEMCG_RECLAIM_PAGECACHE_ONLY;
+		break;
+	case 2:
+		reclaim_options |= MEMCG_RECLAIM_SLAB_ONLY;
+		break;
+	default:
+		break;
+	}
+
 	while (nr_reclaimed < nr_to_reclaim) {
 		/* Will converge on zero, but reclaim enforces a minimum */
 		unsigned long batch_size = (nr_to_reclaim - nr_reclaimed) / 4;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4b1a609755bb..53cea01a1742 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -141,6 +141,9 @@  struct scan_control {
 	/* Always discard instead of demoting to lower tier memory */
 	unsigned int no_demotion:1;
 
+	unsigned int pagecache_only:1;
+	unsigned int slab_only:1;
+
 	/* Allocation order */
 	s8 order;
 
@@ -5881,10 +5884,12 @@  static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 		reclaimed = sc->nr_reclaimed;
 		scanned = sc->nr_scanned;
 
-		shrink_lruvec(lruvec, sc);
+		if (!sc->slab_only)
+			shrink_lruvec(lruvec, sc);
 
-		shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
-			    sc->priority);
+		if (!sc->pagecache_only)
+			shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
+				    sc->priority);
 
 		/* Record the group's reclaim efficiency */
 		if (!sc->proactive)
@@ -6522,6 +6527,8 @@  unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 		.may_unmap = 1,
 		.may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP),
 		.proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE),
+		.pagecache_only = !!(reclaim_options & MEMCG_RECLAIM_PAGECACHE_ONLY),
+		.slab_only = !!(reclaim_options & MEMCG_RECLAIM_SLAB_ONLY),
 	};
 	/*
 	 * Traverse the ZONELIST_FALLBACK zonelist of the current node to put