diff mbox series

mm/vmscan: add vm_swappiness configuration knobs

Message ID BL0PR02MB560167492CA4094C91589930E9FC0@BL0PR02MB5601.namprd02.prod.outlook.com (mailing list archive)
State New, archived
Headers show
Series mm/vmscan: add vm_swappiness configuration knobs | expand

Commit Message

Ivan Teterevkov March 11, 2020, 5:45 p.m. UTC
This patch adds a couple of knobs:

- The configuration option (CONFIG_VM_SWAPPINESS).
- The command line parameter (vm_swappiness).

The default value is preserved, but now defined by CONFIG_VM_SWAPPINESS.

Historically, the default swappiness is set to the well-known value 60,
and this works well for the majority of cases. The vm_swappiness is also
exposed as the kernel parameter that can be changed at runtime too, e.g.
with sysctl.

This approach might not suit well some configurations, e.g. systemd-based
distros, where systemd is put in charge of the cgroup controllers,
including the memory one. In such cases, the default swappiness 60
is copied across the cgroup subtrees early at startup, when systemd
is arranging the slices for its services, before the sysctl.conf
or tmpfiles.d/*.conf changes are applied.

One could run a script to traverse the cgroup trees later and set the
desired memory.swappiness individually in each occurrence when the runtime
is set up, but this would require some amount of work to implement
properly. Instead, why not set the default swappiness as early as possible?

Signed-off-by: Ivan Teterevkov <ivan.teterevkov@nutanix.com>
---
 .../admin-guide/kernel-parameters.txt         |  4 ++++
 mm/Kconfig                                    | 10 ++++++++
 mm/vmscan.c                                   | 24 ++++++++++++++++++-
 3 files changed, 37 insertions(+), 1 deletion(-)

Comments

David Rientjes March 11, 2020, 7:31 p.m. UTC | #1
On Wed, 11 Mar 2020, Ivan Teterevkov wrote:

> This patch adds a couple of knobs:
> 
> - The configuration option (CONFIG_VM_SWAPPINESS).
> - The command line parameter (vm_swappiness).
> 
> The default value is preserved, but now defined by CONFIG_VM_SWAPPINESS.
> 
> Historically, the default swappiness is set to the well-known value 60,
> and this works well for the majority of cases. The vm_swappiness is also
> exposed as the kernel parameter that can be changed at runtime too, e.g.
> with sysctl.
> 
> This approach might not suit well some configurations, e.g. systemd-based
> distros, where systemd is put in charge of the cgroup controllers,
> including the memory one. In such cases, the default swappiness 60
> is copied across the cgroup subtrees early at startup, when systemd
> is arranging the slices for its services, before the sysctl.conf
> or tmpfiles.d/*.conf changes are applied.
> 

Seems like something that can be fully handled by an initscript that would 
set the sysctl and then iterate the memcg hierarchy propagating the 
non-default value.  I don't think that's too much of an ask if userspace 
wants to manipulate the swappiness value.

Or maybe we can be more clever: have memcg->swappiness store -1 by default 
unless it is changed by the user explicitly and then have 
mem_cgroup_swappiness() return vm_swappiness for this value.  If the user 
overwrites it, it's intended.

So there are a couple options here but I don't think one of them is to add 
a new config option or kernel command line option.

> One could run a script to traverse the cgroup trees later and set the
> desired memory.swappiness individually in each occurrence when the runtime
> is set up, but this would require some amount of work to implement
> properly. Instead, why not set the default swappiness as early as possible?
> 
> Signed-off-by: Ivan Teterevkov <ivan.teterevkov@nutanix.com>
> ---
>  .../admin-guide/kernel-parameters.txt         |  4 ++++
>  mm/Kconfig                                    | 10 ++++++++
>  mm/vmscan.c                                   | 24 ++++++++++++++++++-
>  3 files changed, 37 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index c07815d230bc..5d54a4303522 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -5317,6 +5317,10 @@
>  			  P	Enable page structure init time poisoning
>  			  -	Disable all of the above options
>  
> +	vm_swappiness=	[KNL]
> +			Sets the default vm_swappiness.
> +			Ranges from 0 to 100, the default value is 60.
> +
>  	vmalloc=nn[KMG]	[KNL,BOOT] Forces the vmalloc area to have an exact
>  			size of <nn>. This can be used to increase the
>  			minimum size (128MB on x86). It can also be used to diff --git a/mm/Kconfig b/mm/Kconfig index ab80933be65f..ec59c19e578e 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -739,4 +739,14 @@ config ARCH_HAS_HUGEPD  config MAPPING_DIRTY_HELPERS
>          bool
>  
> +config VM_SWAPPINESS
> +	int "Default memory swappiness"
> +	default 60
> +	range 0 100
> +	help
> +	  Sets the default vm_swappiness, that could be changed later
> +	  in the runtime, e.g. kernel command line, sysctl, etc.
> +
> +	  Higher value means more swappy. Historically, defaults to 60.
> +
>  endmenu
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 876370565455..7d2d3550f698 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -163,7 +163,29 @@ struct scan_control {
>  /*
>   * From 0 .. 100.  Higher means more swappy.
>   */
> -int vm_swappiness = 60;
> +int vm_swappiness = CONFIG_VM_SWAPPINESS;
> +
> +static int __init swappiness_cmdline(char *str) {
> +	int val, err;
> +
> +	if (!str)
> +		return -EINVAL;
> +
> +	err = kstrtoint(str, 10, &val);
> +	if (err)
> +		return -EINVAL;
> +
> +	if (val < 0 || val > 100)
> +		return -EINVAL;
> +
> +	vm_swappiness = val;
> +
> +	return 0;
> +}
> +
> +early_param("vm_swappiness", swappiness_cmdline);
> +
>  /*
>   * The total number of pages which are beyond the high watermark within all
>   * zones.
Michal Hocko March 12, 2020, 9:25 a.m. UTC | #2
On Wed 11-03-20 17:45:58, Ivan Teterevkov wrote:
> This patch adds a couple of knobs:
> 
> - The configuration option (CONFIG_VM_SWAPPINESS).
> - The command line parameter (vm_swappiness).
> 
> The default value is preserved, but now defined by CONFIG_VM_SWAPPINESS.
> 
> Historically, the default swappiness is set to the well-known value 60,
> and this works well for the majority of cases. The vm_swappiness is also
> exposed as the kernel parameter that can be changed at runtime too, e.g.
> with sysctl.
> 
> This approach might not suit well some configurations, e.g. systemd-based
> distros, where systemd is put in charge of the cgroup controllers,
> including the memory one. In such cases, the default swappiness 60
> is copied across the cgroup subtrees early at startup, when systemd
> is arranging the slices for its services, before the sysctl.conf
> or tmpfiles.d/*.conf changes are applied.
> 
> One could run a script to traverse the cgroup trees later and set the
> desired memory.swappiness individually in each occurrence when the runtime
> is set up, but this would require some amount of work to implement
> properly. Instead, why not set the default swappiness as early as possible?

I have to say I am not a great fan of more tunning for swappiness as
this is quite a poor tunning for many years already. It essentially does
nothing in many cases because the reclaim process ignores to value in
many cases (have a look a get_scan_count. I have seen quite some
reports that setting a specific value for vmswappiness didn't make any
change. The knob itself has a terrible semantic to begin with because
there is no way to express I really prefer to swap rather than page
cache reclaim.

This all makes me think that swappiness is a historical mistake that we
should rather make obsolete than promote even further.
Ivan Teterevkov March 12, 2020, 12:48 p.m. UTC | #3
On Wed, 11 Mar 2020, David Rientjes wrote:

> On Wed, 11 Mar 2020, Ivan Teterevkov wrote:
> 
> > This patch adds a couple of knobs:
> >
> > - The configuration option (CONFIG_VM_SWAPPINESS).
> > - The command line parameter (vm_swappiness).
> >
> > The default value is preserved, but now defined by CONFIG_VM_SWAPPINESS.
> >
> > Historically, the default swappiness is set to the well-known value 60,
> > and this works well for the majority of cases. The vm_swappiness is also
> > exposed as the kernel parameter that can be changed at runtime too, e.g.
> > with sysctl.
> >
> > This approach might not suit well some configurations, e.g. systemd-based
> > distros, where systemd is put in charge of the cgroup controllers,
> > including the memory one. In such cases, the default swappiness 60
> > is copied across the cgroup subtrees early at startup, when systemd
> > is arranging the slices for its services, before the sysctl.conf
> > or tmpfiles.d/*.conf changes are applied.
> >
> 
> Seems like something that can be fully handled by an initscript that would
> set the sysctl and then iterate the memcg hierarchy propagating the
> non-default value.  I don't think that's too much of an ask if userspace
> wants to manipulate the swappiness value.
> 

This is exactly what I'm trying to avoid: in some distros there is no way
to tackle the configuration early enough, e.g. in systemd-based systems
the systemd is the process that starts first and arranges memcg in a way
it's configured, but unfortunately, it doesn't offer the swappiness knob.

There could be a script to iterate the memcg later, but there would be a
race condition with the system entity that's put in charge of the memcg
because the configuration can't be changed atomically, e.g. a possible
script could iterate the memcg tree and update each memory.swappiness
while systemd is creating another slice or scope subtree.

> Or maybe we can be more clever: have memcg->swappiness store -1 by default
> unless it is changed by the user explicitly and then have
> mem_cgroup_swappiness() return vm_swappiness for this value.  If the user
> overwrites it, it's intended.
> 

Does it mean that -1 would become a reference to the vm_swappiness
or the parent's memory.swappiness? It sounds interesting and if so then
it would address my issues with the swappiness but would also change
the existing memcg behaviour: if the referred-to value changed, would
the memory.swappiness backed by -1 also change?

> So there are a couple options here but I don't think one of them is to add
> a new config option or kernel command line option.
> 

The vm_swappiness starts its lifespan in the kernel and thus
why not to facilitate it with a simple "constructor" there?

> > One could run a script to traverse the cgroup trees later and set the
> > desired memory.swappiness individually in each occurrence when the runtime
> > is set up, but this would require some amount of work to implement
> > properly. Instead, why not set the default swappiness as early as possible?
> >
> > Signed-off-by: Ivan Teterevkov <ivan.teterevkov@nutanix.com>
> > ---
> >  .../admin-guide/kernel-parameters.txt         |  4 ++++
> >  mm/Kconfig                                    | 10 ++++++++
> >  mm/vmscan.c                                   | 24 ++++++++++++++++++-
> >  3 files changed, 37 insertions(+), 1 deletion(-)
> >
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt
> b/Documentation/admin-guide/kernel-parameters.txt
> > index c07815d230bc..5d54a4303522 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -5317,6 +5317,10 @@
> >  			  P	Enable page structure init time poisoning
> >  			  -	Disable all of the above options
> >
> > +	vm_swappiness=	[KNL]
> > +			Sets the default vm_swappiness.
> > +			Ranges from 0 to 100, the default value is 60.
> > +
> >  	vmalloc=nn[KMG]	[KNL,BOOT] Forces the vmalloc area to have an
> exact
> >  			size of <nn>. This can be used to increase the
> >  			minimum size (128MB on x86). It can also be used to
> diff --git a/mm/Kconfig b/mm/Kconfig index ab80933be65f..ec59c19e578e
> 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -739,4 +739,14 @@ config ARCH_HAS_HUGEPD  config
> MAPPING_DIRTY_HELPERS
> >          bool
> >
> > +config VM_SWAPPINESS
> > +	int "Default memory swappiness"
> > +	default 60
> > +	range 0 100
> > +	help
> > +	  Sets the default vm_swappiness, that could be changed later
> > +	  in the runtime, e.g. kernel command line, sysctl, etc.
> > +
> > +	  Higher value means more swappy. Historically, defaults to 60.
> > +
> >  endmenu
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 876370565455..7d2d3550f698 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -163,7 +163,29 @@ struct scan_control {
> >  /*
> >   * From 0 .. 100.  Higher means more swappy.
> >   */
> > -int vm_swappiness = 60;
> > +int vm_swappiness = CONFIG_VM_SWAPPINESS;
> > +
> > +static int __init swappiness_cmdline(char *str) {
> > +	int val, err;
> > +
> > +	if (!str)
> > +		return -EINVAL;
> > +
> > +	err = kstrtoint(str, 10, &val);
> > +	if (err)
> > +		return -EINVAL;
> > +
> > +	if (val < 0 || val > 100)
> > +		return -EINVAL;
> > +
> > +	vm_swappiness = val;
> > +
> > +	return 0;
> > +}
> > +
> > +early_param("vm_swappiness", swappiness_cmdline);
> > +
> >  /*
> >   * The total number of pages which are beyond the high watermark within all
> >   * zones.
Ivan Teterevkov March 12, 2020, 12:54 p.m. UTC | #4
On Thurs, 12 Mar 2020, Michal Hocko wrote:

> On Wed 11-03-20 17:45:58, Ivan Teterevkov wrote:
> > This patch adds a couple of knobs:
> >
> > - The configuration option (CONFIG_VM_SWAPPINESS).
> > - The command line parameter (vm_swappiness).
> >
> > The default value is preserved, but now defined by CONFIG_VM_SWAPPINESS.
> >
> > Historically, the default swappiness is set to the well-known value
> > 60, and this works well for the majority of cases. The vm_swappiness
> > is also exposed as the kernel parameter that can be changed at runtime too,
> e.g.
> > with sysctl.
> >
> > This approach might not suit well some configurations, e.g.
> > systemd-based distros, where systemd is put in charge of the cgroup
> > controllers, including the memory one. In such cases, the default
> > swappiness 60 is copied across the cgroup subtrees early at startup,
> > when systemd is arranging the slices for its services, before the
> > sysctl.conf or tmpfiles.d/*.conf changes are applied.
> >
> > One could run a script to traverse the cgroup trees later and set the
> > desired memory.swappiness individually in each occurrence when the
> > runtime is set up, but this would require some amount of work to
> > implement properly. Instead, why not set the default swappiness as early as
> possible?
> 
> I have to say I am not a great fan of more tunning for swappiness as this is quite
> a poor tunning for many years already. It essentially does nothing in many cases
> because the reclaim process ignores to value in many cases (have a look a
> get_scan_count. I have seen quite some reports that setting a specific value for
> vmswappiness didn't make any change. The knob itself has a terrible semantic to
> begin with because there is no way to express I really prefer to swap rather than
> page cache reclaim.
> 
> This all makes me think that swappiness is a historical mistake that we should
> rather make obsolete than promote even further.

Absolutely agree, the semantics of the vm_swappiness is perplexing.
Moreover, the same get_scan_count treats vm_swappiness and cgroups
memory.swappiness differently, in particular, 0 disables the memcg swap.

Certainly, the patch adds some additional exposure to a parameter that
is not trivial to tackle but it's already getting created with a magic
number which is also confusing. Is there any harm to be done by the patch
considering the already existing sysctl interface to that knob?

> --
> Michal Hocko
> SUSE Labs
Michal Hocko March 12, 2020, 1:26 p.m. UTC | #5
On Thu 12-03-20 12:54:19, Ivan Teterevkov wrote:
> On Thurs, 12 Mar 2020, Michal Hocko wrote:
> 
> > On Wed 11-03-20 17:45:58, Ivan Teterevkov wrote:
> > > This patch adds a couple of knobs:
> > >
> > > - The configuration option (CONFIG_VM_SWAPPINESS).
> > > - The command line parameter (vm_swappiness).
> > >
> > > The default value is preserved, but now defined by CONFIG_VM_SWAPPINESS.
> > >
> > > Historically, the default swappiness is set to the well-known value
> > > 60, and this works well for the majority of cases. The vm_swappiness
> > > is also exposed as the kernel parameter that can be changed at runtime too,
> > e.g.
> > > with sysctl.
> > >
> > > This approach might not suit well some configurations, e.g.
> > > systemd-based distros, where systemd is put in charge of the cgroup
> > > controllers, including the memory one. In such cases, the default
> > > swappiness 60 is copied across the cgroup subtrees early at startup,
> > > when systemd is arranging the slices for its services, before the
> > > sysctl.conf or tmpfiles.d/*.conf changes are applied.
> > >
> > > One could run a script to traverse the cgroup trees later and set the
> > > desired memory.swappiness individually in each occurrence when the
> > > runtime is set up, but this would require some amount of work to
> > > implement properly. Instead, why not set the default swappiness as early as
> > possible?
> > 
> > I have to say I am not a great fan of more tunning for swappiness as this is quite
> > a poor tunning for many years already. It essentially does nothing in many cases
> > because the reclaim process ignores to value in many cases (have a look a
> > get_scan_count. I have seen quite some reports that setting a specific value for
> > vmswappiness didn't make any change. The knob itself has a terrible semantic to
> > begin with because there is no way to express I really prefer to swap rather than
> > page cache reclaim.
> > 
> > This all makes me think that swappiness is a historical mistake that we should
> > rather make obsolete than promote even further.
> 
> Absolutely agree, the semantics of the vm_swappiness is perplexing.
> Moreover, the same get_scan_count treats vm_swappiness and cgroups
> memory.swappiness differently, in particular, 0 disables the memcg swap.
> 
> Certainly, the patch adds some additional exposure to a parameter that
> is not trivial to tackle but it's already getting created with a magic
> number which is also confusing. Is there any harm to be done by the patch
> considering the already existing sysctl interface to that knob?

Like any other config option/kernel parameter. It is adding the the
overall config space size problem and unless this is really needed I
would rather not make it worse.
Matthew Wilcox March 12, 2020, 1:36 p.m. UTC | #6
On Thu, Mar 12, 2020 at 12:48:22PM +0000, Ivan Teterevkov wrote:
> On Wed, 11 Mar 2020, David Rientjes wrote:
> > On Wed, 11 Mar 2020, Ivan Teterevkov wrote:
> > > This patch adds a couple of knobs:
> > >
> > > - The configuration option (CONFIG_VM_SWAPPINESS).
> > > - The command line parameter (vm_swappiness).
> > >
> > > The default value is preserved, but now defined by CONFIG_VM_SWAPPINESS.
> > >
> > > Historically, the default swappiness is set to the well-known value 60,
> > > and this works well for the majority of cases. The vm_swappiness is also
> > > exposed as the kernel parameter that can be changed at runtime too, e.g.
> > > with sysctl.
> > >
> > > This approach might not suit well some configurations, e.g. systemd-based
> > > distros, where systemd is put in charge of the cgroup controllers,
> > > including the memory one. In such cases, the default swappiness 60
> > > is copied across the cgroup subtrees early at startup, when systemd
> > > is arranging the slices for its services, before the sysctl.conf
> > > or tmpfiles.d/*.conf changes are applied.
> > 
> > Seems like something that can be fully handled by an initscript that would
> > set the sysctl and then iterate the memcg hierarchy propagating the
> > non-default value.  I don't think that's too much of an ask if userspace
> > wants to manipulate the swappiness value.
> > 
> 
> This is exactly what I'm trying to avoid: in some distros there is no way
> to tackle the configuration early enough, e.g. in systemd-based systems
> the systemd is the process that starts first and arranges memcg in a way
> it's configured, but unfortunately, it doesn't offer the swappiness knob.

This sounds like a systemd problem.  Have you talked to the systemd
people about fixing it in systemd?
Chris Down March 12, 2020, 2:03 p.m. UTC | #7
Matthew Wilcox writes:
>On Thu, Mar 12, 2020 at 12:48:22PM +0000, Ivan Teterevkov wrote:
>> This is exactly what I'm trying to avoid: in some distros there is no way
>> to tackle the configuration early enough, e.g. in systemd-based systems
>> the systemd is the process that starts first and arranges memcg in a way
>> it's configured, but unfortunately, it doesn't offer the swappiness knob.
>
>This sounds like a systemd problem.  Have you talked to the systemd
>people about fixing it in systemd?

Hi there ;-)

In general most of us maintaining cgroups in systemd run with cgroup v2, so 
this isn't a problem we run into in production. The swappiness controls in 
general don't make a whole lot of sense being distributed hierarchically, so 
they've been phased out entirely in cgroup v2.

If there had been a patch years ago implementing this in systemd we'd probably 
have accepted it, but cgroup v1 is dying and I am really not in favour of 
adding more code to massage its rough edges. We already have enough problems 
generated by it already.

However, the following kludge in tmpfiles.d should work to solve your immediate 
problem:

	w /sys/fs/cgroup/memory/system.slice/memory.swappiness - - - - value

Taking my systemd hat off and putting my -mm hat on: let's not add more hacky 
APIs at cgroup v1's behest, or we'll be here until we're pushing up the 
daisies.

Thanks,

Chris
Ivan Teterevkov March 13, 2020, 10:49 a.m. UTC | #8
On Thurs, 12 Mar 2020, Chris Down wrote:

> Matthew Wilcox writes:
> >On Thu, Mar 12, 2020 at 12:48:22PM +0000, Ivan Teterevkov wrote:
> >> This is exactly what I'm trying to avoid: in some distros there is no
> >> way to tackle the configuration early enough, e.g. in systemd-based
> >> systems the systemd is the process that starts first and arranges
> >> memcg in a way it's configured, but unfortunately, it doesn't offer the
> swappiness knob.
> >
> >This sounds like a systemd problem.  Have you talked to the systemd
> >people about fixing it in systemd?
> 
> Hi there ;-)
> 
> In general most of us maintaining cgroups in systemd run with cgroup v2, so this
> isn't a problem we run into in production. The swappiness controls in general
> don't make a whole lot of sense being distributed hierarchically, so they've been
> phased out entirely in cgroup v2.
> 
> If there had been a patch years ago implementing this in systemd we'd probably
> have accepted it, but cgroup v1 is dying and I am really not in favour of adding
> more code to massage its rough edges. We already have enough problems
> generated by it already.
> 
> However, the following kludge in tmpfiles.d should work to solve your
> immediate
> problem:
> 
> 	w /sys/fs/cgroup/memory/system.slice/memory.swappiness - - - - value
> 
> Taking my systemd hat off and putting my -mm hat on: let's not add more hacky
> APIs at cgroup v1's behest, or we'll be here until we're pushing up the daisies.
> 
> Thanks,
> 
> Chris

The above approach doesn't work for me in el7 with systemd 219 or ubuntu 19
with systemd 242 because presumably the systemd-tmpfiles services start too
late. Please find the snippet at the bottom of the email.

The only approach that seems to work is to set up a service to run:

$ find /sys/fs/cgroup/memory/ -name memory.swappiness | while read -r name; do echo 0 > "${name}"; done

I think this is quite ugly because there might be a race condition with
the systemd that could be creating the slices while the find is running.
One could suggest constraining the depth and going from top to the
bottom of the memcg but this still looks inherently unstable.

This is why I ended up with the vm_swappiness patch (which I don't
quite like myself) but this appears to be the only rock solid option
unless I've missed anything obvious. There is no doubt that cgroup v1
is due for replacement and vm_swappiness is frightening but they still
have certain advantages to employ until they are history.

$ systemctl --version
systemd 242 (242)
+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=hybrid

$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=19.10
DISTRIB_CODENAME=eoan
DISTRIB_DESCRIPTION="Ubuntu 19.10"

$ uname -a
Linux ubuntu 5.6.0-rc5-custom #1 SMP Wed Mar 11 14:59:15 GMT 2020 x86_64 x86_64 x86_64 GNU/Linux

$ tail -1 /etc/sysctl.conf
vm.swappiness=10

$ cat /etc/tmpfiles.d/10-swap.conf
w /sys/fs/cgroup/memory/system.slice/memory.swappiness - - - - 20
w /sys/fs/cgroup/memory/user.slice/memory.swappiness   - - - - 30

$ find /sys/fs/cgroup/memory -name memory.swappiness | while read -r name; do cat "${name}"; done | sort | uniq -c
      1 10
     32 20
      6 30
     21 60

$ find /sys/fs/cgroup/memory -name memory.swappiness | while read -r name; do echo "${name}"; cat "${name}"; done | grep --before-context=1 60
/sys/fs/cgroup/memory/system.slice/systemd-udevd.service/memory.swappiness
60
--
/sys/fs/cgroup/memory/system.slice/sys-fs-fuse-connections.mount/memory.swappiness
60
/sys/fs/cgroup/memory/system.slice/snap-gnome\x2d3\x2d28\x2d1804-116.mount/memory.swappiness
60
/sys/fs/cgroup/memory/system.slice/snap-gnome\x2dlogs-81.mount/memory.swappiness
60
/sys/fs/cgroup/memory/system.slice/sys-kernel-config.mount/memory.swappiness
60
--
/sys/fs/cgroup/memory/system.slice/snap-core-7917.mount/memory.swappiness
60
/sys/fs/cgroup/memory/system.slice/sys-kernel-debug.mount/memory.swappiness
60
--
/sys/fs/cgroup/memory/system.slice/snap-gnome\x2dcharacters-399.mount/memory.swappiness
60
/sys/fs/cgroup/memory/system.slice/swapfile.swap/memory.swappiness
60
--
/sys/fs/cgroup/memory/system.slice/snap-gtk\x2dcommon\x2dthemes-1440.mount/memory.swappiness
60
/sys/fs/cgroup/memory/system.slice/snap-gnome\x2dcharacters-317.mount/memory.swappiness
60
--
/sys/fs/cgroup/memory/system.slice/systemd-journald.service/memory.swappiness
60
--
/sys/fs/cgroup/memory/system.slice/dev-mqueue.mount/memory.swappiness
60
--
/sys/fs/cgroup/memory/system.slice/snap-gtk\x2dcommon\x2dthemes-1353.mount/memory.swappiness
60
/sys/fs/cgroup/memory/system.slice/snap-core-8689.mount/memory.swappiness
60
--
/sys/fs/cgroup/memory/system.slice/snap-gnome\x2d3\x2d28\x2d1804-71.mount/memory.swappiness
60
--
/sys/fs/cgroup/memory/system.slice/snap-core18-1668.mount/memory.swappiness
60
--
/sys/fs/cgroup/memory/system.slice/snap-gnome\x2dcalculator-501.mount/memory.swappiness
60
--
/sys/fs/cgroup/memory/system.slice/dev-hugepages.mount/memory.swappiness
60
--
/sys/fs/cgroup/memory/system.slice/snap-gnome\x2dcalculator-544.mount/memory.swappiness
60
--
/sys/fs/cgroup/memory/system.slice/snap-core18-1223.mount/memory.swappiness
60
David Rientjes March 13, 2020, 9:50 p.m. UTC | #9
On Fri, 13 Mar 2020, Ivan Teterevkov wrote:

> The above approach doesn't work for me in el7 with systemd 219 or ubuntu 19
> with systemd 242 because presumably the systemd-tmpfiles services start too
> late. Please find the snippet at the bottom of the email.
> 
> The only approach that seems to work is to set up a service to run:
> 
> $ find /sys/fs/cgroup/memory/ -name memory.swappiness | while read -r name; do echo 0 > "${name}"; done
> 
> I think this is quite ugly because there might be a race condition with
> the systemd that could be creating the slices while the find is running.
> One could suggest constraining the depth and going from top to the
> bottom of the memcg but this still looks inherently unstable.
> 

I'll renew my suggestion of defaulting memcg->swappiness to -1 and then 
letting mem_cgroup_swappiness() return vm_swappiness when 
memcg->swappiness == -1.

I don't think the act of creating a memcg and not initializing the 
swappiness value implies that the existing value meets your expectation.

> This is why I ended up with the vm_swappiness patch (which I don't
> quite like myself) but this appears to be the only rock solid option
> unless I've missed anything obvious. There is no doubt that cgroup v1
> is due for replacement and vm_swappiness is frightening but they still
> have certain advantages to employ until they are history.
> 
> $ systemctl --version
> systemd 242 (242)
> +PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=hybrid
> 
> $ cat /etc/lsb-release
> DISTRIB_ID=Ubuntu
> DISTRIB_RELEASE=19.10
> DISTRIB_CODENAME=eoan
> DISTRIB_DESCRIPTION="Ubuntu 19.10"
> 
> $ uname -a
> Linux ubuntu 5.6.0-rc5-custom #1 SMP Wed Mar 11 14:59:15 GMT 2020 x86_64 x86_64 x86_64 GNU/Linux
> 
> $ tail -1 /etc/sysctl.conf
> vm.swappiness=10
> 
> $ cat /etc/tmpfiles.d/10-swap.conf
> w /sys/fs/cgroup/memory/system.slice/memory.swappiness - - - - 20
> w /sys/fs/cgroup/memory/user.slice/memory.swappiness   - - - - 30
> 
> $ find /sys/fs/cgroup/memory -name memory.swappiness | while read -r name; do cat "${name}"; done | sort | uniq -c
>       1 10
>      32 20
>       6 30
>      21 60
> 
> $ find /sys/fs/cgroup/memory -name memory.swappiness | while read -r name; do echo "${name}"; cat "${name}"; done | grep --before-context=1 60
> /sys/fs/cgroup/memory/system.slice/systemd-udevd.service/memory.swappiness
> 60
> --
> /sys/fs/cgroup/memory/system.slice/sys-fs-fuse-connections.mount/memory.swappiness
> 60
> /sys/fs/cgroup/memory/system.slice/snap-gnome\x2d3\x2d28\x2d1804-116.mount/memory.swappiness
> 60
> /sys/fs/cgroup/memory/system.slice/snap-gnome\x2dlogs-81.mount/memory.swappiness
> 60
> /sys/fs/cgroup/memory/system.slice/sys-kernel-config.mount/memory.swappiness
> 60
> --
> /sys/fs/cgroup/memory/system.slice/snap-core-7917.mount/memory.swappiness
> 60
> /sys/fs/cgroup/memory/system.slice/sys-kernel-debug.mount/memory.swappiness
> 60
> --
> /sys/fs/cgroup/memory/system.slice/snap-gnome\x2dcharacters-399.mount/memory.swappiness
> 60
> /sys/fs/cgroup/memory/system.slice/swapfile.swap/memory.swappiness
> 60
> --
> /sys/fs/cgroup/memory/system.slice/snap-gtk\x2dcommon\x2dthemes-1440.mount/memory.swappiness
> 60
> /sys/fs/cgroup/memory/system.slice/snap-gnome\x2dcharacters-317.mount/memory.swappiness
> 60
> --
> /sys/fs/cgroup/memory/system.slice/systemd-journald.service/memory.swappiness
> 60
> --
> /sys/fs/cgroup/memory/system.slice/dev-mqueue.mount/memory.swappiness
> 60
> --
> /sys/fs/cgroup/memory/system.slice/snap-gtk\x2dcommon\x2dthemes-1353.mount/memory.swappiness
> 60
> /sys/fs/cgroup/memory/system.slice/snap-core-8689.mount/memory.swappiness
> 60
> --
> /sys/fs/cgroup/memory/system.slice/snap-gnome\x2d3\x2d28\x2d1804-71.mount/memory.swappiness
> 60
> --
> /sys/fs/cgroup/memory/system.slice/snap-core18-1668.mount/memory.swappiness
> 60
> --
> /sys/fs/cgroup/memory/system.slice/snap-gnome\x2dcalculator-501.mount/memory.swappiness
> 60
> --
> /sys/fs/cgroup/memory/system.slice/dev-hugepages.mount/memory.swappiness
> 60
> --
> /sys/fs/cgroup/memory/system.slice/snap-gnome\x2dcalculator-544.mount/memory.swappiness
> 60
> --
> /sys/fs/cgroup/memory/system.slice/snap-core18-1223.mount/memory.swappiness
> 60
>
Vlastimil Babka March 16, 2020, 2:53 p.m. UTC | #10
On 3/12/20 2:26 PM, Michal Hocko wrote:
> On Thu 12-03-20 12:54:19, Ivan Teterevkov wrote:
>> 
>> Absolutely agree, the semantics of the vm_swappiness is perplexing.
>> Moreover, the same get_scan_count treats vm_swappiness and cgroups
>> memory.swappiness differently, in particular, 0 disables the memcg swap.
>> 
>> Certainly, the patch adds some additional exposure to a parameter that
>> is not trivial to tackle but it's already getting created with a magic
>> number which is also confusing. Is there any harm to be done by the patch
>> considering the already existing sysctl interface to that knob?
> 
> Like any other config option/kernel parameter. It is adding the the
> overall config space size problem and unless this is really needed I
> would rather not make it worse.

Setting the vm_swappiness specific case aside, I wonder if if would be
useful to be able to emulate any sysctl with a kernel parameter,
i.e. boot the kernel with sysctl.vm.swappiness=X
There are already some options that provide kernel parameter as well
as sysctl, why not just support all.
Quick and dirty proof of concept:

----8<-----
diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index 02fa84493f23..62ae963a5c0c 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -206,6 +206,7 @@ struct ctl_table_header *register_sysctl_paths(const struct ctl_path *path,
 void unregister_sysctl_table(struct ctl_table_header * table);
 
 extern int sysctl_init(void);
+int process_sysctl_arg(char *param, char *val, const char *unused, void *arg);
 
 extern struct ctl_table sysctl_mount_point[];
 
diff --git a/init/main.c b/init/main.c
index ee4947af823f..c1544ff4ec5b 100644
--- a/init/main.c
+++ b/init/main.c
@@ -1345,6 +1345,23 @@ void __weak free_initmem(void)
 	free_initmem_default(POISON_FREE_INITMEM);
 }
 
+static void do_sysctl_args(void)
+{
+	size_t len = strlen(saved_command_line) + 1;
+	char *command_line;
+
+	command_line = kzalloc(len, GFP_KERNEL);
+	if (!command_line)
+		panic("%s: Failed to allocate %zu bytes\n", __func__, len);
+
+	strcpy(command_line, saved_command_line);
+
+	parse_args("Setting sysctl args", command_line,
+		   NULL, 0, -1, -1, NULL, process_sysctl_arg);
+
+	kfree(command_line);
+}
+
 static int __ref kernel_init(void *unused)
 {
 	int ret;
@@ -1367,6 +1384,8 @@ static int __ref kernel_init(void *unused)
 
 	rcu_end_inkernel_boot();
 
+	do_sysctl_args();
+
 	if (ramdisk_execute_command) {
 		ret = run_init_process(ramdisk_execute_command);
 		if (!ret)
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index ad5b88a53c5a..5b3b520d29a8 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1980,6 +1980,66 @@ int __init sysctl_init(void)
 	return 0;
 }
 
+int process_sysctl_arg(char *param, char *val,
+			       const char *unused, void *arg)
+{
+	size_t count;
+	char *tmp;
+	int err;
+	loff_t ppos = 0;
+	struct ctl_table *base, *child = NULL, *found = NULL;
+
+	if (strncmp(param, "sysctl.", sizeof("sysctl.") - 1))
+		return 0;
+
+	param += (sizeof("sysctl.") - 1);
+
+	pr_notice("sysctl: %s=%s", param, val);
+
+	tmp = strchr(param, '.');
+	if (!tmp) {
+		pr_notice("invalid sysctl param '%s' on command line", param);
+		return 0;
+	}
+
+	*tmp = '\0';
+
+	for (base = &sysctl_base_table[0]; base->procname != 0; base++) {
+		if (strcmp(param, base->procname) == 0) {
+			child = base->child;
+			break;
+		}
+	}
+
+	if (!child) {
+		pr_notice("unknown sysctl prefix '%s' on command line", param);
+		return 0;
+	}
+
+	tmp++;
+
+	for (; child->procname != 0; child++) {
+		if (strcmp(tmp, child->procname) == 0) {
+			found = child;
+			break;
+		}
+	}
+
+	if (!found) {
+		pr_notice("unknown sysctl param '%s.%s' on command line", param, tmp);
+		return 0;
+	}
+
+	count = strlen(val);
+	err = found->proc_handler(found, 1, val, &count, &ppos);
+
+	if (err)
+		pr_notice("error %d setting sysctl '%s.%s' from command line",
+				err, param, tmp);
+
+	return 0;
+}
+
 #endif /* CONFIG_SYSCTL */
 
 /*
Ivan Teterevkov March 16, 2020, 4:03 p.m. UTC | #11
On Fri, 13 Mar 2020, David Rientjes wrote:

> I'll renew my suggestion of defaulting memcg->swappiness to -1 and then
> letting mem_cgroup_swappiness() return vm_swappiness when
> memcg->swappiness == -1.
> 
> I don't think the act of creating a memcg and not initializing the
> swappiness value implies that the existing value meets your expectation.
> 
Thanks, David, I haven't considered this point.

This point has made me realising that the existing behaviour, where
the swappiness is inherited from the parent, is not documented and,
apparently, just appears to be implemented that way.

In this case, the -1 approach looks less harmful in comparison with
the original patch, unless I underestimate a number of users "affected"
by the assumptions about how memcg->swappiness is getting propagated.

I'll send another patch implementing the -1 approach shortly,
perhaps with some clarifications in the "swappiness" documentation.
Ivan Teterevkov March 16, 2020, 4:14 p.m. UTC | #12
On Mon, 16 Mar 2020, Vlastimil Babka wrote:
> 
> Setting the vm_swappiness specific case aside, I wonder if if would be
> useful to be able to emulate any sysctl with a kernel parameter,
> i.e. boot the kernel with sysctl.vm.swappiness=X
> There are already some options that provide kernel parameter as well
> as sysctl, why not just support all.
>
It looks like a missing feature to me but I've always assumed that
there is some unobvious philosophical reasoning behind not having
such interfaces in the kernel. The feature-wise I'd be one of the
first users of the sysctl command-line parameters.
Michal Hocko March 17, 2020, 8:29 a.m. UTC | #13
On Mon 16-03-20 15:53:21, Vlastimil Babka wrote:
> On 3/12/20 2:26 PM, Michal Hocko wrote:
> > On Thu 12-03-20 12:54:19, Ivan Teterevkov wrote:
> >> 
> >> Absolutely agree, the semantics of the vm_swappiness is perplexing.
> >> Moreover, the same get_scan_count treats vm_swappiness and cgroups
> >> memory.swappiness differently, in particular, 0 disables the memcg swap.
> >> 
> >> Certainly, the patch adds some additional exposure to a parameter that
> >> is not trivial to tackle but it's already getting created with a magic
> >> number which is also confusing. Is there any harm to be done by the patch
> >> considering the already existing sysctl interface to that knob?
> > 
> > Like any other config option/kernel parameter. It is adding the the
> > overall config space size problem and unless this is really needed I
> > would rather not make it worse.
> 
> Setting the vm_swappiness specific case aside, I wonder if if would be
> useful to be able to emulate any sysctl with a kernel parameter,
> i.e. boot the kernel with sysctl.vm.swappiness=X
> There are already some options that provide kernel parameter as well
> as sysctl, why not just support all.

This sounds like a much better idea than a case by case handling. I
wouldn't be surprised if some kernel parameters would duplicate sysctl
values. I cannot judge the implementation unfortunately but from a quick
glance it makes sense to me. Especially where you hooked it into because
not all handlers are simple setters for a global value. Some of them
have a much more complicated logic which requires a lot of
infrastructure to be set up already. So putting do_sysctl_args right
before the init is executed should be good enough.

Care to post an RFC to a larger audience? Let's see where we get from
there.

> Quick and dirty proof of concept:
> @@ -1367,6 +1384,8 @@ static int __ref kernel_init(void *unused)
>  
>  	rcu_end_inkernel_boot();
>  
> +	do_sysctl_args();
> +
>  	if (ramdisk_execute_command) {
>  		ret = run_init_process(ramdisk_execute_command);
>  		if (!ret)
Vlastimil Babka March 17, 2020, 2:51 p.m. UTC | #14
On 3/17/20 9:29 AM, Michal Hocko wrote:
> On Mon 16-03-20 15:53:21, Vlastimil Babka wrote:
>> On 3/12/20 2:26 PM, Michal Hocko wrote:
> 
> This sounds like a much better idea than a case by case handling. I
> wouldn't be surprised if some kernel parameters would duplicate sysctl
> values. I cannot judge the implementation unfortunately but from a quick
> glance it makes sense to me. Especially where you hooked it into because
> not all handlers are simple setters for a global value. Some of them
> have a much more complicated logic which requires a lot of
> infrastructure to be set up already. So putting do_sysctl_args right
> before the init is executed should be good enough.
> 
> Care to post an RFC to a larger audience? Let's see where we get from
> there.

FYI done: (I didn't CC everybody from this thread)

https://lore.kernel.org/linux-api/20200317132105.24555-1-vbabka@suse.cz/
diff mbox series

Patch

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index c07815d230bc..5d54a4303522 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5317,6 +5317,10 @@ 
 			  P	Enable page structure init time poisoning
 			  -	Disable all of the above options
 
+	vm_swappiness=	[KNL]
+			Sets the default vm_swappiness.
+			Ranges from 0 to 100, the default value is 60.
+
 	vmalloc=nn[KMG]	[KNL,BOOT] Forces the vmalloc area to have an exact
 			size of <nn>. This can be used to increase the
 			minimum size (128MB on x86). It can also be used to diff --git a/mm/Kconfig b/mm/Kconfig index ab80933be65f..ec59c19e578e 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -739,4 +739,14 @@  config ARCH_HAS_HUGEPD  config MAPPING_DIRTY_HELPERS
         bool
 
+config VM_SWAPPINESS
+	int "Default memory swappiness"
+	default 60
+	range 0 100
+	help
+	  Sets the default vm_swappiness, that could be changed later
+	  in the runtime, e.g. kernel command line, sysctl, etc.
+
+	  Higher value means more swappy. Historically, defaults to 60.
+
 endmenu
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 876370565455..7d2d3550f698 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -163,7 +163,29 @@  struct scan_control {
 /*
  * From 0 .. 100.  Higher means more swappy.
  */
-int vm_swappiness = 60;
+int vm_swappiness = CONFIG_VM_SWAPPINESS;
+
+static int __init swappiness_cmdline(char *str) {
+	int val, err;
+
+	if (!str)
+		return -EINVAL;
+
+	err = kstrtoint(str, 10, &val);
+	if (err)
+		return -EINVAL;
+
+	if (val < 0 || val > 100)
+		return -EINVAL;
+
+	vm_swappiness = val;
+
+	return 0;
+}
+
+early_param("vm_swappiness", swappiness_cmdline);
+
 /*
  * The total number of pages which are beyond the high watermark within all
  * zones.
--
2.25.0