[RFC,v2,00/19] mm: process/cgroup ksm support

Message ID	20230210215023.2740545-1-shr@devkernel.io (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Stefan Roesch <shr@devkernel.io> To: kernel-team@fb.com Cc: shr@devkernel.io, linux-mm@kvack.org, riel@surriel.com, mhocko@suse.com, david@redhat.com, linux-kselftest@vger.kernel.org, linux-doc@vger.kernel.org, akpm@linux-foundation.org Subject: [RFC PATCH v2 00/19] mm: process/cgroup ksm support Date: Fri, 10 Feb 2023 13:50:04 -0800 Message-Id: <20230210215023.2740545-1-shr@devkernel.io> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: process/cgroup ksm support \| expand [RFC,v2,00/19] mm: process/cgroup ksm support [RFC,v2,01/19] mm: add new flag to enable ksm per process [RFC,v2,02/19] mm: add flag to __ksm_enter [RFC,v2,03/19] mm: add flag to __ksm_exit call [RFC,v2,04/19] mm: invoke madvise for all vmas in scan_get_next_rmap_item [RFC,v2,05/19] mm: support disabling of ksm for a process [RFC,v2,06/19] mm: add new prctl option to get and set ksm for a process [RFC,v2,07/19] mm: split off pages_volatile function [RFC,v2,08/19] mm: expose general_profit metric [RFC,v2,09/19] docs: document general_profit sysfs knob [RFC,v2,10/19] mm: calculate ksm process profit metric [RFC,v2,11/19] mm: add ksm_merge_type() function [RFC,v2,12/19] mm: expose ksm process profit metric in ksm_stat [RFC,v2,13/19] mm: expose ksm merge type in ksm_stat [RFC,v2,14/19] docs: document new procfs ksm knobs [RFC,v2,15/19] tools: add new prctl flags to prctl in tools dir [RFC,v2,16/19] selftests/vm: add KSM prctl merge test [RFC,v2,17/19] selftests/vm: add KSM get merge type test [RFC,v2,18/19] selftests/vm: add KSM fork test [RFC,v2,19/19] selftests/vm: add two functions for debugging merge outcome

Message ID

20230210215023.2740545-1-shr@devkernel.io (mailing list archive)

Headers

From: Stefan Roesch <shr@devkernel.io>
To: kernel-team@fb.com
Cc: shr@devkernel.io,
	linux-mm@kvack.org,
	riel@surriel.com,
	mhocko@suse.com,
	david@redhat.com,
	linux-kselftest@vger.kernel.org,
	linux-doc@vger.kernel.org,
	akpm@linux-foundation.org
Subject: [RFC PATCH v2 00/19] mm: process/cgroup ksm support
Date: Fri, 10 Feb 2023 13:50:04 -0800
Message-Id: <20230210215023.2740545-1-shr@devkernel.io>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

mm: process/cgroup ksm support | expand

Message

Stefan Roesch Feb. 10, 2023, 9:50 p.m. UTC

So far KSM can only be enabled by calling madvise for memory regions. What is
required to enable KSM for more workloads is to enable / disable it at the
process / cgroup level.

Use case:
The madvise call is not available in the programming language. An example for
this are programs with forked workloads using a garbage collected language without
pointers. In such a language madvise cannot be made available.

In addition the addresses of objects get moved around as they are garbage
collected. KSM sharing needs to be enabled "from the outside" for these type of
workloads.

Experiments with using UKSM have shown a capacity increase of around 20%.


1. New options for prctl system command
This patch series adds two new options to the prctl system call. The first
one allows to enable KSM at the process level and the second one to query the
setting.

The setting will be inherited by child processes.

With the above setting, KSM can be enabled for the seed process of a cgroup
and all processes in the cgroup will inherit the setting.

2. Changes to KSM processing
When KSM is enabled at the process level, the KSM code will iterate over all
the VMA's and enable KSM for the eligible VMA's.

When forking a process that has KSM enabled, the setting will be inherited by
the new child process.

In addition when KSM is disabled for a process, KSM will be disabled for the
VMA's where KSM has been enabled.

3. Add general_profit metric
The general_profit metric of KSM is specified in the documentation, but not
calculated. This adds the general profit metric to /sys/kernel/debug/mm/ksm.

4. Add more metrics to ksm_stat
This adds the process profit and ksm type metric to /proc/<pid>/ksm_stat.

5. Add more tests to ksm_tests
This adds an option to specify the merge type to the ksm_tests. This allows to
test madvise and prctl KSM. It also adds a new option to query if prctl KSM has
been enabled. It adds a fork test to verify that the KSM process setting is
inherited by client processes.


Changes:
- V2:
  - Added use cases to the cover letter
  - Removed the tracing patch from the patch series and posted it as an
    individual patch
  - Refreshed repo



Stefan Roesch (19):
  mm: add new flag to enable ksm per process
  mm: add flag to __ksm_enter
  mm: add flag to __ksm_exit call
  mm: invoke madvise for all vmas in scan_get_next_rmap_item
  mm: support disabling of ksm for a process
  mm: add new prctl option to get and set ksm for a process
  mm: split off pages_volatile function
  mm: expose general_profit metric
  docs: document general_profit sysfs knob
  mm: calculate ksm process profit metric
  mm: add ksm_merge_type() function
  mm: expose ksm process profit metric in ksm_stat
  mm: expose ksm merge type in ksm_stat
  docs: document new procfs ksm knobs
  tools: add new prctl flags to prctl in tools dir
  selftests/vm: add KSM prctl merge test
  selftests/vm: add KSM get merge type test
  selftests/vm: add KSM fork test
  selftests/vm: add two functions for debugging merge outcome

 Documentation/ABI/testing/sysfs-kernel-mm-ksm |   8 +
 Documentation/admin-guide/mm/ksm.rst          |   8 +-
 fs/proc/base.c                                |   5 +
 include/linux/ksm.h                           |  19 +-
 include/linux/sched/coredump.h                |   1 +
 include/uapi/linux/prctl.h                    |   2 +
 kernel/sys.c                                  |  29 ++
 mm/ksm.c                                      | 114 +++++++-
 tools/include/uapi/linux/prctl.h              |   2 +
 tools/testing/selftests/mm/Makefile           |   3 +-
 tools/testing/selftests/mm/ksm_tests.c        | 254 +++++++++++++++---
 11 files changed, 389 insertions(+), 56 deletions(-)


base-commit: 234a68e24b120b98875a8b6e17a9dead277be16a

Comments

Matthew Wilcox Feb. 10, 2023, 11:23 p.m. UTC | #1

On Fri, Feb 10, 2023 at 01:50:04PM -0800, Stefan Roesch wrote:
> So far KSM can only be enabled by calling madvise for memory regions. What is
> required to enable KSM for more workloads is to enable / disable it at the
> process / cgroup level.
> 
> Use case:
> The madvise call is not available in the programming language. An example for
> this are programs with forked workloads using a garbage collected language without
> pointers. In such a language madvise cannot be made available.
> 
> In addition the addresses of objects get moved around as they are garbage
> collected. KSM sharing needs to be enabled "from the outside" for these type of
> workloads.

Don't you have source code to the interpreter for this mysterious
language?  Usually that would be where we put calls to madvise()

Rik van Riel Feb. 11, 2023, 2:41 a.m. UTC | #2

On Fri, 2023-02-10 at 23:23 +0000, Matthew Wilcox wrote:
> On Fri, Feb 10, 2023 at 01:50:04PM -0800, Stefan Roesch wrote:
> > So far KSM can only be enabled by calling madvise for memory
> > regions. What is
> > required to enable KSM for more workloads is to enable / disable it
> > at the
> > process / cgroup level.
> > 
> > Use case:
> > The madvise call is not available in the programming language. An
> > example for
> > this are programs with forked workloads using a garbage collected
> > language without
> > pointers. In such a language madvise cannot be made available.
> > 
> > In addition the addresses of objects get moved around as they are
> > garbage
> > collected. KSM sharing needs to be enabled "from the outside" for
> > these type of
> > workloads.
> 
> Don't you have source code to the interpreter for this mysterious
> language?  Usually that would be where we put calls to madvise()

That same interpreter is also used for workloads where
KSM brings no benefit, and we don't want the overhead
of KSM.

It really would be useful to have the ability to enable
this on a per workload basis, for programming languages
that do not support madvise.

Johannes Weiner Feb. 21, 2023, 4:10 p.m. UTC | #3

Hi Stefan,

On Fri, Feb 10, 2023 at 01:50:04PM -0800, Stefan Roesch wrote:
> So far KSM can only be enabled by calling madvise for memory regions. What is
> required to enable KSM for more workloads is to enable / disable it at the
> process / cgroup level.
> 
> Use case:
> The madvise call is not available in the programming language. An example for
> this are programs with forked workloads using a garbage collected language without
> pointers. In such a language madvise cannot be made available.
> 
> In addition the addresses of objects get moved around as they are garbage
> collected. KSM sharing needs to be enabled "from the outside" for these type of
> workloads.

It would be good to expand on the argument that Rik made about the
interpreter being used for things were there are no merging
opportunities, and the KSM scanning overhead isn't amortized.

There is a fundamental mismatch in scopes. madvise() is a
workload-local decision, whereas sizable sharing opportunities may or
may not exist across multiple workloads. Only a higher-level entity
like a job scheduler can know for certain whether it's running one or
more instances of a job. That job scheduler in turn doesn't have the
necessary knowledge of the workload's internals to make targeted and
well-timed advise calls with, say, process_madvise().

This also applies to the security concerns brought up in previous
threads. An individual workload doesn't know what else is running on
the machine, so it needs to be highly conservative about what it can
give up for system-wide merging. However, if the system is dedicated
to running multiple jobs within the same security domain, it's the job
scheduler that knows that sharing isn't a problem, and even desirable.

So I think this series makes sense, but it would be good to expand a
bit on the reasoning and address the security aspect in the cover/doc.

> Stefan Roesch (19):
>   mm: add new flag to enable ksm per process
>   mm: add flag to __ksm_enter
>   mm: add flag to __ksm_exit call
>   mm: invoke madvise for all vmas in scan_get_next_rmap_item
>   mm: support disabling of ksm for a process
>   mm: add new prctl option to get and set ksm for a process

The implementation looks sound to me as well.

I think it would be a bit easier to review if you folded these ^^^
patches, the tools patch below, and the prctl selftests, all into one
single commit. It's one logical change. This way the new flags and
helper functions can be reviewed against the new users and callsites
without having to jump back and forth between emails.

>   mm: split off pages_volatile function
>   mm: expose general_profit metric
>   docs: document general_profit sysfs knob
>   mm: calculate ksm process profit metric
>   mm: add ksm_merge_type() function
>   mm: expose ksm process profit metric in ksm_stat
>   mm: expose ksm merge type in ksm_stat
>   docs: document new procfs ksm knobs

Same with the new knobs/stats and their documentation.

Logical splitting is easier to follow than geographical splitting.

Thanks!

Stefan Roesch Feb. 21, 2023, 5:59 p.m. UTC | #4

Johannes Weiner <hannes@cmpxchg.org> writes:

> Hi Stefan,
>
> On Fri, Feb 10, 2023 at 01:50:04PM -0800, Stefan Roesch wrote:
>> So far KSM can only be enabled by calling madvise for memory regions. What is
>> required to enable KSM for more workloads is to enable / disable it at the
>> process / cgroup level.
>>
>> Use case:
>> The madvise call is not available in the programming language. An example for
>> this are programs with forked workloads using a garbage collected language without
>> pointers. In such a language madvise cannot be made available.
>>
>> In addition the addresses of objects get moved around as they are garbage
>> collected. KSM sharing needs to be enabled "from the outside" for these type of
>> workloads.
>
> It would be good to expand on the argument that Rik made about the
> interpreter being used for things were there are no merging
> opportunities, and the KSM scanning overhead isn't amortized.
>
> There is a fundamental mismatch in scopes. madvise() is a
> workload-local decision, whereas sizable sharing opportunities may or
> may not exist across multiple workloads. Only a higher-level entity
> like a job scheduler can know for certain whether it's running one or
> more instances of a job. That job scheduler in turn doesn't have the
> necessary knowledge of the workload's internals to make targeted and
> well-timed advise calls with, say, process_madvise().
>
> This also applies to the security concerns brought up in previous
> threads. An individual workload doesn't know what else is running on
> the machine, so it needs to be highly conservative about what it can
> give up for system-wide merging. However, if the system is dedicated
> to running multiple jobs within the same security domain, it's the job
> scheduler that knows that sharing isn't a problem, and even desirable.
>
> So I think this series makes sense, but it would be good to expand a
> bit on the reasoning and address the security aspect in the cover/doc.
>

These are good points Johannes, I'll elaborate on them with the next
version of the patch.

>> Stefan Roesch (19):
>>   mm: add new flag to enable ksm per process
>>   mm: add flag to __ksm_enter
>>   mm: add flag to __ksm_exit call
>>   mm: invoke madvise for all vmas in scan_get_next_rmap_item
>>   mm: support disabling of ksm for a process
>>   mm: add new prctl option to get and set ksm for a process
>
> The implementation looks sound to me as well.
>
> I think it would be a bit easier to review if you folded these ^^^
> patches, the tools patch below, and the prctl selftests, all into one
> single commit. It's one logical change. This way the new flags and
> helper functions can be reviewed against the new users and callsites
> without having to jump back and forth between emails.
>

I'll fold them in the next version.

>>   mm: split off pages_volatile function
>>   mm: expose general_profit metric
>>   docs: document general_profit sysfs knob
>>   mm: calculate ksm process profit metric
>>   mm: add ksm_merge_type() function
>>   mm: expose ksm process profit metric in ksm_stat
>>   mm: expose ksm merge type in ksm_stat
>>   docs: document new procfs ksm knobs
>
> Same with the new knobs/stats and their documentation.
>

I'll fold them in the next version.

> Logical splitting is easier to follow than geographical splitting.
>
> Thanks!