mbox series

[v6,0/2] Add swappiness argument to memory.reclaim

Message ID 20240103164841.2800183-1-schatzberg.dan@gmail.com (mailing list archive)
Headers show
Series Add swappiness argument to memory.reclaim | expand

Message

Dan Schatzberg Jan. 3, 2024, 4:48 p.m. UTC
Changes since V5:
  * Made the scan_control behavior limited to proactive reclaim explicitly
  * created sc_swappiness helper to reduce chance of mis-use

Changes since V4:
  * Fixed some initialization bugs by reverting back to a pointer for swappiness
  * Added some more caveats to the behavior of swappiness in documentation

Changes since V3:
  * Added #define for MIN_SWAPPINESS and MAX_SWAPPINESS
  * Added explicit calls to mem_cgroup_swappiness

Changes since V2:
  * No functional change
  * Used int consistently rather than a pointer

Changes since V1:
  * Added documentation

This patch proposes augmenting the memory.reclaim interface with a
swappiness=<val> argument that overrides the swappiness value for that instance
of proactive reclaim.

Userspace proactive reclaimers use the memory.reclaim interface to trigger
reclaim. The memory.reclaim interface does not allow for any way to effect the
balance of file vs anon during proactive reclaim. The only approach is to adjust
the vm.swappiness setting. However, there are a few reasons we look to control
the balance of file vs anon during proactive reclaim, separately from reactive
reclaim:

* Swapout should be limited to manage SSD write endurance. In near-OOM
  situations we are fine with lots of swap-out to avoid OOMs. As these are
  typically rare events, they have relatively little impact on write endurance.
  However, proactive reclaim runs continuously and so its impact on SSD write
  endurance is more significant. Therefore it is desireable to control swap-out
  for proactive reclaim separately from reactive reclaim

* Some userspace OOM killers like systemd-oomd[1] support OOM killing on swap
  exhaustion. This makes sense if the swap exhaustion is triggered due to
  reactive reclaim but less so if it is triggered due to proactive reclaim (e.g.
  one could see OOMs when free memory is ample but anon is just particularly
  cold). Therefore, it's desireable to have proactive reclaim reduce or stop
  swap-out before the threshold at which OOM killing occurs.

In the case of Meta's Senpai proactive reclaimer, we adjust vm.swappiness before
writes to memory.reclaim[2]. This has been in production for nearly two years
and has addressed our needs to control proactive vs reactive reclaim behavior
but is still not ideal for a number of reasons:

* vm.swappiness is a global setting, adjusting it can race/interfere with other
  system administration that wishes to control vm.swappiness. In our case, we
  need to disable Senpai before adjusting vm.swappiness.

* vm.swappiness is stateful - so a crash or restart of Senpai can leave a
  misconfigured setting. This requires some additional management to record the
  "desired" setting and ensure Senpai always adjusts to it.

With this patch, we avoid these downsides of adjusting vm.swappiness globally.

Previously, this exact interface addition was proposed by Yosry[3]. In response,
Roman proposed instead an interface to specify precise file/anon/slab reclaim
amounts[4]. More recently Huan also proposed this as well[5] and others
similarly questioned if this was the proper interface.

Previous proposals sought to use this to allow proactive reclaimers to
effectively perform a custom reclaim algorithm by issuing proactive reclaim with
different settings to control file vs anon reclaim (e.g. to only reclaim anon
from some applications). Responses argued that adjusting swappiness is a poor
interface for custom reclaim.

In contrast, I argue in favor of a swappiness setting not as a way to implement
custom reclaim algorithms but rather to bias the balance of anon vs file due to
differences of proactive vs reactive reclaim. In this context, swappiness is the
existing interface for controlling this balance and this patch simply allows for
it to be configured differently for proactive vs reactive reclaim.

Specifying explicit amounts of anon vs file pages to reclaim feels inappropriate
for this prupose. Proactive reclaimers are un-aware of the relative age of file
vs anon for a cgroup which makes it difficult to manage proactive reclaim of
different memory pools. A proactive reclaimer would need some amount of anon
reclaim attempts separate from the amount of file reclaim attempts which seems
brittle given that it's difficult to observe the impact.

[1]https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
[2]https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L585-L598
[3]https://lore.kernel.org/linux-mm/CAJD7tkbDpyoODveCsnaqBBMZEkDvshXJmNdbk51yKSNgD7aGdg@mail.gmail.com/
[4]https://lore.kernel.org/linux-mm/YoPHtHXzpK51F%2F1Z@carbon/
[5]https://lore.kernel.org/lkml/20231108065818.19932-1-link@vivo.com/

Dan Schatzberg (2):
  mm: add defines for min/max swappiness
  mm: add swapiness= arg to memory.reclaim

 Documentation/admin-guide/cgroup-v2.rst | 18 +++++---
 include/linux/swap.h                    |  5 ++-
 mm/memcontrol.c                         | 58 ++++++++++++++++++++-----
 mm/vmscan.c                             | 39 ++++++++++++-----
 4 files changed, 90 insertions(+), 30 deletions(-)

Comments

Shakeel Butt June 11, 2024, 7:25 p.m. UTC | #1
Hi folks,

This series has been in the mm-unstable for several months. Are there
any remaining concerns here otherwise can we please put this in the
mm-stable branch to be merged in the next Linux release?

On Wed, Jan 03, 2024 at 08:48:35AM GMT, Dan Schatzberg wrote:
> Changes since V5:
>   * Made the scan_control behavior limited to proactive reclaim explicitly
>   * created sc_swappiness helper to reduce chance of mis-use
> 
> Changes since V4:
>   * Fixed some initialization bugs by reverting back to a pointer for swappiness
>   * Added some more caveats to the behavior of swappiness in documentation
> 
> Changes since V3:
>   * Added #define for MIN_SWAPPINESS and MAX_SWAPPINESS
>   * Added explicit calls to mem_cgroup_swappiness
> 
> Changes since V2:
>   * No functional change
>   * Used int consistently rather than a pointer
> 
> Changes since V1:
>   * Added documentation
> 
> This patch proposes augmenting the memory.reclaim interface with a
> swappiness=<val> argument that overrides the swappiness value for that instance
> of proactive reclaim.
> 
> Userspace proactive reclaimers use the memory.reclaim interface to trigger
> reclaim. The memory.reclaim interface does not allow for any way to effect the
> balance of file vs anon during proactive reclaim. The only approach is to adjust
> the vm.swappiness setting. However, there are a few reasons we look to control
> the balance of file vs anon during proactive reclaim, separately from reactive
> reclaim:
> 
> * Swapout should be limited to manage SSD write endurance. In near-OOM
>   situations we are fine with lots of swap-out to avoid OOMs. As these are
>   typically rare events, they have relatively little impact on write endurance.
>   However, proactive reclaim runs continuously and so its impact on SSD write
>   endurance is more significant. Therefore it is desireable to control swap-out
>   for proactive reclaim separately from reactive reclaim
> 
> * Some userspace OOM killers like systemd-oomd[1] support OOM killing on swap
>   exhaustion. This makes sense if the swap exhaustion is triggered due to
>   reactive reclaim but less so if it is triggered due to proactive reclaim (e.g.
>   one could see OOMs when free memory is ample but anon is just particularly
>   cold). Therefore, it's desireable to have proactive reclaim reduce or stop
>   swap-out before the threshold at which OOM killing occurs.
> 
> In the case of Meta's Senpai proactive reclaimer, we adjust vm.swappiness before
> writes to memory.reclaim[2]. This has been in production for nearly two years
> and has addressed our needs to control proactive vs reactive reclaim behavior
> but is still not ideal for a number of reasons:
> 
> * vm.swappiness is a global setting, adjusting it can race/interfere with other
>   system administration that wishes to control vm.swappiness. In our case, we
>   need to disable Senpai before adjusting vm.swappiness.
> 
> * vm.swappiness is stateful - so a crash or restart of Senpai can leave a
>   misconfigured setting. This requires some additional management to record the
>   "desired" setting and ensure Senpai always adjusts to it.
> 
> With this patch, we avoid these downsides of adjusting vm.swappiness globally.
> 
> Previously, this exact interface addition was proposed by Yosry[3]. In response,
> Roman proposed instead an interface to specify precise file/anon/slab reclaim
> amounts[4]. More recently Huan also proposed this as well[5] and others
> similarly questioned if this was the proper interface.
> 
> Previous proposals sought to use this to allow proactive reclaimers to
> effectively perform a custom reclaim algorithm by issuing proactive reclaim with
> different settings to control file vs anon reclaim (e.g. to only reclaim anon
> from some applications). Responses argued that adjusting swappiness is a poor
> interface for custom reclaim.
> 
> In contrast, I argue in favor of a swappiness setting not as a way to implement
> custom reclaim algorithms but rather to bias the balance of anon vs file due to
> differences of proactive vs reactive reclaim. In this context, swappiness is the
> existing interface for controlling this balance and this patch simply allows for
> it to be configured differently for proactive vs reactive reclaim.
> 
> Specifying explicit amounts of anon vs file pages to reclaim feels inappropriate
> for this prupose. Proactive reclaimers are un-aware of the relative age of file
> vs anon for a cgroup which makes it difficult to manage proactive reclaim of
> different memory pools. A proactive reclaimer would need some amount of anon
> reclaim attempts separate from the amount of file reclaim attempts which seems
> brittle given that it's difficult to observe the impact.
> 
> [1]https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
> [2]https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L585-L598
> [3]https://lore.kernel.org/linux-mm/CAJD7tkbDpyoODveCsnaqBBMZEkDvshXJmNdbk51yKSNgD7aGdg@mail.gmail.com/
> [4]https://lore.kernel.org/linux-mm/YoPHtHXzpK51F%2F1Z@carbon/
> [5]https://lore.kernel.org/lkml/20231108065818.19932-1-link@vivo.com/
> 
> Dan Schatzberg (2):
>   mm: add defines for min/max swappiness
>   mm: add swapiness= arg to memory.reclaim
> 
>  Documentation/admin-guide/cgroup-v2.rst | 18 +++++---
>  include/linux/swap.h                    |  5 ++-
>  mm/memcontrol.c                         | 58 ++++++++++++++++++++-----
>  mm/vmscan.c                             | 39 ++++++++++++-----
>  4 files changed, 90 insertions(+), 30 deletions(-)
> 
> -- 
> 2.39.3
>
Yosry Ahmed June 11, 2024, 7:31 p.m. UTC | #2
On Tue, Jun 11, 2024 at 12:25 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> Hi folks,
>
> This series has been in the mm-unstable for several months. Are there
> any remaining concerns here otherwise can we please put this in the
> mm-stable branch to be merged in the next Linux release?

+Yu Zhao

I don't think Yu Zhao was correctly CC'd on this :)

>
> On Wed, Jan 03, 2024 at 08:48:35AM GMT, Dan Schatzberg wrote:
> > Changes since V5:
> >   * Made the scan_control behavior limited to proactive reclaim explicitly
> >   * created sc_swappiness helper to reduce chance of mis-use
> >
> > Changes since V4:
> >   * Fixed some initialization bugs by reverting back to a pointer for swappiness
> >   * Added some more caveats to the behavior of swappiness in documentation
> >
> > Changes since V3:
> >   * Added #define for MIN_SWAPPINESS and MAX_SWAPPINESS
> >   * Added explicit calls to mem_cgroup_swappiness
> >
> > Changes since V2:
> >   * No functional change
> >   * Used int consistently rather than a pointer
> >
> > Changes since V1:
> >   * Added documentation
> >
> > This patch proposes augmenting the memory.reclaim interface with a
> > swappiness=<val> argument that overrides the swappiness value for that instance
> > of proactive reclaim.
> >
> > Userspace proactive reclaimers use the memory.reclaim interface to trigger
> > reclaim. The memory.reclaim interface does not allow for any way to effect the
> > balance of file vs anon during proactive reclaim. The only approach is to adjust
> > the vm.swappiness setting. However, there are a few reasons we look to control
> > the balance of file vs anon during proactive reclaim, separately from reactive
> > reclaim:
> >
> > * Swapout should be limited to manage SSD write endurance. In near-OOM
> >   situations we are fine with lots of swap-out to avoid OOMs. As these are
> >   typically rare events, they have relatively little impact on write endurance.
> >   However, proactive reclaim runs continuously and so its impact on SSD write
> >   endurance is more significant. Therefore it is desireable to control swap-out
> >   for proactive reclaim separately from reactive reclaim
> >
> > * Some userspace OOM killers like systemd-oomd[1] support OOM killing on swap
> >   exhaustion. This makes sense if the swap exhaustion is triggered due to
> >   reactive reclaim but less so if it is triggered due to proactive reclaim (e.g.
> >   one could see OOMs when free memory is ample but anon is just particularly
> >   cold). Therefore, it's desireable to have proactive reclaim reduce or stop
> >   swap-out before the threshold at which OOM killing occurs.
> >
> > In the case of Meta's Senpai proactive reclaimer, we adjust vm.swappiness before
> > writes to memory.reclaim[2]. This has been in production for nearly two years
> > and has addressed our needs to control proactive vs reactive reclaim behavior
> > but is still not ideal for a number of reasons:
> >
> > * vm.swappiness is a global setting, adjusting it can race/interfere with other
> >   system administration that wishes to control vm.swappiness. In our case, we
> >   need to disable Senpai before adjusting vm.swappiness.
> >
> > * vm.swappiness is stateful - so a crash or restart of Senpai can leave a
> >   misconfigured setting. This requires some additional management to record the
> >   "desired" setting and ensure Senpai always adjusts to it.
> >
> > With this patch, we avoid these downsides of adjusting vm.swappiness globally.
> >
> > Previously, this exact interface addition was proposed by Yosry[3]. In response,
> > Roman proposed instead an interface to specify precise file/anon/slab reclaim
> > amounts[4]. More recently Huan also proposed this as well[5] and others
> > similarly questioned if this was the proper interface.
> >
> > Previous proposals sought to use this to allow proactive reclaimers to
> > effectively perform a custom reclaim algorithm by issuing proactive reclaim with
> > different settings to control file vs anon reclaim (e.g. to only reclaim anon
> > from some applications). Responses argued that adjusting swappiness is a poor
> > interface for custom reclaim.
> >
> > In contrast, I argue in favor of a swappiness setting not as a way to implement
> > custom reclaim algorithms but rather to bias the balance of anon vs file due to
> > differences of proactive vs reactive reclaim. In this context, swappiness is the
> > existing interface for controlling this balance and this patch simply allows for
> > it to be configured differently for proactive vs reactive reclaim.
> >
> > Specifying explicit amounts of anon vs file pages to reclaim feels inappropriate
> > for this prupose. Proactive reclaimers are un-aware of the relative age of file
> > vs anon for a cgroup which makes it difficult to manage proactive reclaim of
> > different memory pools. A proactive reclaimer would need some amount of anon
> > reclaim attempts separate from the amount of file reclaim attempts which seems
> > brittle given that it's difficult to observe the impact.
> >
> > [1]https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
> > [2]https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L585-L598
> > [3]https://lore.kernel.org/linux-mm/CAJD7tkbDpyoODveCsnaqBBMZEkDvshXJmNdbk51yKSNgD7aGdg@mail.gmail.com/
> > [4]https://lore.kernel.org/linux-mm/YoPHtHXzpK51F%2F1Z@carbon/
> > [5]https://lore.kernel.org/lkml/20231108065818.19932-1-link@vivo.com/
> >
> > Dan Schatzberg (2):
> >   mm: add defines for min/max swappiness
> >   mm: add swapiness= arg to memory.reclaim
> >
> >  Documentation/admin-guide/cgroup-v2.rst | 18 +++++---
> >  include/linux/swap.h                    |  5 ++-
> >  mm/memcontrol.c                         | 58 ++++++++++++++++++++-----
> >  mm/vmscan.c                             | 39 ++++++++++++-----
> >  4 files changed, 90 insertions(+), 30 deletions(-)
> >
> > --
> > 2.39.3
> >
Andrew Morton June 11, 2024, 7:48 p.m. UTC | #3
On Tue, 11 Jun 2024 12:25:24 -0700 Shakeel Butt <shakeel.butt@linux.dev> wrote:

> Hi folks,
> 
> This series has been in the mm-unstable for several months. Are there
> any remaining concerns here otherwise can we please put this in the
> mm-stable branch to be merged in the next Linux release?

The review didn't go terribly well so I parked the series awaiting more
clarity.  Although on rereading, it seems that Yu Zhao isn't seeing any
blocking issues?
Shakeel Butt June 11, 2024, 10:50 p.m. UTC | #4
On Tue, Jun 11, 2024 at 12:48:07PM GMT, Andrew Morton wrote:
> On Tue, 11 Jun 2024 12:25:24 -0700 Shakeel Butt <shakeel.butt@linux.dev> wrote:
> 
> > Hi folks,
> > 
> > This series has been in the mm-unstable for several months. Are there
> > any remaining concerns here otherwise can we please put this in the
> > mm-stable branch to be merged in the next Linux release?
> 
> The review didn't go terribly well so I parked the series awaiting more
> clarity.  Although on rereading, it seems that Yu Zhao isn't seeing any
> blocking issues?
> 

Yu, please share if you have any strong concern in merging this series?
Yu Zhao June 11, 2024, 11:10 p.m. UTC | #5
On Tue, Jun 11, 2024 at 4:50 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Tue, Jun 11, 2024 at 12:48:07PM GMT, Andrew Morton wrote:
> > On Tue, 11 Jun 2024 12:25:24 -0700 Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > > Hi folks,
> > >
> > > This series has been in the mm-unstable for several months. Are there
> > > any remaining concerns here otherwise can we please put this in the
> > > mm-stable branch to be merged in the next Linux release?
> >
> > The review didn't go terribly well so I parked the series awaiting more
> > clarity.  Although on rereading, it seems that Yu Zhao isn't seeing any
> > blocking issues?
> >
>
> Yu, please share if you have any strong concern in merging this series?

I don't remember I had any strong concerns. In fact, I don't remember
what I commented on.

Let me go back to the previous discussion and see why it was stalled.
Will get back to you soon.