[v3,0/3] mm: process/cgroup ksm support

Message ID	20230224044000.3084046-1-shr@devkernel.io (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Stefan Roesch <shr@devkernel.io> To: kernel-team@fb.com Cc: shr@devkernel.io, linux-mm@kvack.org, riel@surriel.com, mhocko@suse.com, david@redhat.com, linux-kselftest@vger.kernel.org, linux-doc@vger.kernel.org, akpm@linux-foundation.org, hannes@cmpxchg.org Subject: [PATCH v3 0/3] mm: process/cgroup ksm support Date: Thu, 23 Feb 2023 20:39:57 -0800 Message-Id: <20230224044000.3084046-1-shr@devkernel.io> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: process/cgroup ksm support \| expand [v3,0/3] mm: process/cgroup ksm support [v3,1/3] mm: add new api to enable ksm per process [v3,2/3] mm: add new KSM process and sysfs knobs [v3,3/3] selftests/mm: add new selftests for KSM

Stefan Roesch Feb. 24, 2023, 4:39 a.m. UTC

So far KSM can only be enabled by calling madvise for memory regions. To
be able to use KSM for more workloads, KSM needs to have the ability to be
enabled / disabled at the process / cgroup level.

Use case 1:
The madvise call is not available in the programming language. An example for
this are programs with forked workloads using a garbage collected language without
pointers. In such a language madvise cannot be made available.

In addition the addresses of objects get moved around as they are garbage
collected. KSM sharing needs to be enabled "from the outside" for these type of
workloads.

Use case 2:
The same interpreter can also be used for workloads where KSM brings no
benefit or even has overhead. We'd like to be able to enable KSM on a workload
by workload basis.

Use case 3:
With the madvise call sharing opportunities are only enabled for the current
process: it is a workload-local decision. A considerable number of sharing
opportuniites may exist across multiple workloads or jobs. Only a higler level
entity like a job scheduler or container can know for certain if its running
one or more instances of a job. That job scheduler however doesn't have
the necessary internal worklaod knowledge to make targeted madvise calls.

Security concerns:
In previous discussions security concerns have been brought up. The problem is
that an individual workload does not have the knowledge about what else is
running on a machine. Therefore it has to be very conservative in what memory
areas can be shared or not. However, if the system is dedicated to running
multiple jobs within the same security domain, its the job scheduler that has
the knowledge that sharing can be safely enabled and is even desirable.

Performance:
Experiments with using UKSM have shown a capacity increase of around 20%.


1. New options for prctl system command
This patch series adds two new options to the prctl system call. The first
one allows to enable KSM at the process level and the second one to query the
setting.

The setting will be inherited by child processes.

With the above setting, KSM can be enabled for the seed process of a cgroup
and all processes in the cgroup will inherit the setting.

2. Changes to KSM processing
When KSM is enabled at the process level, the KSM code will iterate over all
the VMA's and enable KSM for the eligible VMA's.

When forking a process that has KSM enabled, the setting will be inherited by
the new child process.

In addition when KSM is disabled for a process, KSM will be disabled for the
VMA's where KSM has been enabled.

3. Add general_profit metric
The general_profit metric of KSM is specified in the documentation, but not
calculated. This adds the general profit metric to /sys/kernel/debug/mm/ksm.

4. Add more metrics to ksm_stat
This adds the process profit and ksm type metric to /proc/<pid>/ksm_stat.

5. Add more tests to ksm_tests
This adds an option to specify the merge type to the ksm_tests. This allows to
test madvise and prctl KSM. It also adds a new option to query if prctl KSM has
been enabled. It adds a fork test to verify that the KSM process setting is
inherited by client processes.


Changes:
- V3:
  - folded patch 1 - 6
  - folded patch 7 - 14
  - folded patch 15 - 19
  - Expanded on the use cases in the cover letter
  - Added a section on security concerns to the cover letter

- V2:
  - Added use cases to the cover letter
  - Removed the tracing patch from the patch series and posted it as an
    individual patch
  - Refreshed repo



Stefan Roesch (3):
  mm: add new api to enable ksm per process
  mm: add new KSM process and sysfs knobs
  selftests/mm: add new selftests for KSM

 Documentation/ABI/testing/sysfs-kernel-mm-ksm |   8 +
 Documentation/admin-guide/mm/ksm.rst          |   8 +-
 fs/proc/base.c                                |   5 +
 include/linux/ksm.h                           |  19 +-
 include/linux/sched/coredump.h                |   1 +
 include/uapi/linux/prctl.h                    |   2 +
 kernel/sys.c                                  |  29 ++
 mm/ksm.c                                      | 114 +++++++-
 tools/include/uapi/linux/prctl.h              |   2 +
 tools/testing/selftests/mm/Makefile           |   3 +-
 tools/testing/selftests/mm/ksm_tests.c        | 254 +++++++++++++++---
 11 files changed, 389 insertions(+), 56 deletions(-)


base-commit: 234a68e24b120b98875a8b6e17a9dead277be16a

Andrew Morton Feb. 26, 2023, 5:08 a.m. UTC | #1

On Thu, 23 Feb 2023 20:39:57 -0800 Stefan Roesch <shr@devkernel.io> wrote:

> So far KSM can only be enabled by calling madvise for memory regions. To
> be able to use KSM for more workloads, KSM needs to have the ability to be
> enabled / disabled at the process / cgroup level.

I'll toss this in for integration and testing, but I'd like to see
reviewer input before proceeding further.

Please plan on adding suitable user-facing documentation?  Presumably a
patch for the prctl manpage?

Stefan Roesch Feb. 27, 2023, 5:13 p.m. UTC | #2

Andrew Morton <akpm@linux-foundation.org> writes:

> On Thu, 23 Feb 2023 20:39:57 -0800 Stefan Roesch <shr@devkernel.io> wrote:
>
>> So far KSM can only be enabled by calling madvise for memory regions. To
>> be able to use KSM for more workloads, KSM needs to have the ability to be
>> enabled / disabled at the process / cgroup level.
>
> I'll toss this in for integration and testing, but I'd like to see
> reviewer input before proceeding further.
>
> Please plan on adding suitable user-facing documentation?  Presumably a
> patch for the prctl manpage?

I'll work on a patch for the prctl manpage.

Stefan Roesch March 7, 2023, 6:48 p.m. UTC | #3

Andrew Morton <akpm@linux-foundation.org> writes:

> On Thu, 23 Feb 2023 20:39:57 -0800 Stefan Roesch <shr@devkernel.io> wrote:
>
>> So far KSM can only be enabled by calling madvise for memory regions. To
>> be able to use KSM for more workloads, KSM needs to have the ability to be
>> enabled / disabled at the process / cgroup level.
>
> I'll toss this in for integration and testing, but I'd like to see
> reviewer input before proceeding further.
>
> Please plan on adding suitable user-facing documentation?  Presumably a
> patch for the prctl manpage?

The doc patch has been posted:
https://lore.kernel.org/linux-man/20230227220206.436662-1-shr@devkernel.io/

David Hildenbrand March 8, 2023, 5:01 p.m. UTC | #4

For some reason gmail thought it would be a good ideas to move this into 
the SPAM folder, so I only saw the recent replies just now.

I'm going to have a look at this soonish.

One point that popped up in the past and that I raised on the last RFC: 
we should think about letting processes *opt out/disable* KSM on their 
own. Either completely, or for selected VMAs.

Reasoning is, that if you have an application that really doesn't want 
some memory regions to be applicable to KSM (memory de-duplication 
attacks? Knowing that KSM on some regions will be counter-productive)

For example, remembering if MADV_UNMERGEABLE was called and not only 
clearing the VMA flag. So even if KSM would be force-enabled by some 
tooling after the process started, such regions would not get considered 
for KSM.

It would a bit like how we handle THP.

On 24.02.23 05:39, Stefan Roesch wrote:
> So far KSM can only be enabled by calling madvise for memory regions. To
> be able to use KSM for more workloads, KSM needs to have the ability to be
> enabled / disabled at the process / cgroup level.
> 
> Use case 1:
> The madvise call is not available in the programming language. An example for
> this are programs with forked workloads using a garbage collected language without
> pointers. In such a language madvise cannot be made available.
> 
> In addition the addresses of objects get moved around as they are garbage
> collected. KSM sharing needs to be enabled "from the outside" for these type of
> workloads.
> 
> Use case 2:
> The same interpreter can also be used for workloads where KSM brings no
> benefit or even has overhead. We'd like to be able to enable KSM on a workload
> by workload basis.
> 
> Use case 3:
> With the madvise call sharing opportunities are only enabled for the current
> process: it is a workload-local decision. A considerable number of sharing
> opportuniites may exist across multiple workloads or jobs. Only a higler level
> entity like a job scheduler or container can know for certain if its running
> one or more instances of a job. That job scheduler however doesn't have
> the necessary internal worklaod knowledge to make targeted madvise calls.
> 
> Security concerns:
> In previous discussions security concerns have been brought up. The problem is
> that an individual workload does not have the knowledge about what else is
> running on a machine. Therefore it has to be very conservative in what memory
> areas can be shared or not. However, if the system is dedicated to running
> multiple jobs within the same security domain, its the job scheduler that has
> the knowledge that sharing can be safely enabled and is even desirable.

Note that there are some papers about why limiting memory deduplciation 
attacks to single security domains is not sufficient. Especially, the 
remote deduplication attacks fall into that category IIRC.

Johannes Weiner March 8, 2023, 5:30 p.m. UTC | #5

Hey David,

On Wed, Mar 08, 2023 at 06:01:14PM +0100, David Hildenbrand wrote:
> For some reason gmail thought it would be a good ideas to move this into the
> SPAM folder, so I only saw the recent replies just now.
> 
> I'm going to have a look at this soonish.

Thanks! More eyes are always helpful.

> One point that popped up in the past and that I raised on the last RFC: we
> should think about letting processes *opt out/disable* KSM on their own.
> Either completely, or for selected VMAs.
> 
> Reasoning is, that if you have an application that really doesn't want some
> memory regions to be applicable to KSM (memory de-duplication attacks?
> Knowing that KSM on some regions will be counter-productive)
> 
> For example, remembering if MADV_UNMERGEABLE was called and not only
> clearing the VMA flag. So even if KSM would be force-enabled by some tooling
> after the process started, such regions would not get considered for KSM.
> 
> It would a bit like how we handle THP.

I'm not sure the THP comparison is apt. THP is truly a local
optimization that depends on the workload's access patterns. The
environment isn't a true factor. It makes some sense that if there is
a global policy to generally use THP the workload be able to opt out
based on known sparse access patterns. At least until THP allocation
strategy inside the kernel becomes smarter!

Merging opportunities and security questions are trickier. The
application might know which data is sensitive, but it doesn't know
whether its environment is safe or subject do memory attacks, so it
cannot make that decision purely from inside.

There is a conceivable usecase where multiple instances of the same
job are running inside a safe shared security domain and using the
same sensitive data.

There is a conceivable usecase where the system and the workload
collaborate to merge insensitive data across security domains.

I'm honestly not sure which usecase is more likely. My gut feeling is
the first one, simply because of broader concerns of multiple security
domains sharing kernel instances or physical hardware.

> On 24.02.23 05:39, Stefan Roesch wrote:
> > So far KSM can only be enabled by calling madvise for memory regions. To
> > be able to use KSM for more workloads, KSM needs to have the ability to be
> > enabled / disabled at the process / cgroup level.
> > 
> > Use case 1:
> > The madvise call is not available in the programming language. An example for
> > this are programs with forked workloads using a garbage collected language without
> > pointers. In such a language madvise cannot be made available.
> > 
> > In addition the addresses of objects get moved around as they are garbage
> > collected. KSM sharing needs to be enabled "from the outside" for these type of
> > workloads.
> > 
> > Use case 2:
> > The same interpreter can also be used for workloads where KSM brings no
> > benefit or even has overhead. We'd like to be able to enable KSM on a workload
> > by workload basis.
> > 
> > Use case 3:
> > With the madvise call sharing opportunities are only enabled for the current
> > process: it is a workload-local decision. A considerable number of sharing
> > opportuniites may exist across multiple workloads or jobs. Only a higler level
> > entity like a job scheduler or container can know for certain if its running
> > one or more instances of a job. That job scheduler however doesn't have
> > the necessary internal worklaod knowledge to make targeted madvise calls.
> > 
> > Security concerns:
> > In previous discussions security concerns have been brought up. The problem is
> > that an individual workload does not have the knowledge about what else is
> > running on a machine. Therefore it has to be very conservative in what memory
> > areas can be shared or not. However, if the system is dedicated to running
> > multiple jobs within the same security domain, its the job scheduler that has
> > the knowledge that sharing can be safely enabled and is even desirable.
> 
> Note that there are some papers about why limiting memory deduplciation
> attacks to single security domains is not sufficient. Especially, the remote
> deduplication attacks fall into that category IIRC.

I think it would be good to elaborate on that and include any caveats
in the documentation.

Ultimately, the bar isn't whether there are attack vectors on a subset
of possible usecases, but whether there are usecases where this can be
used safely, which is obviously true.

David Hildenbrand March 8, 2023, 6:41 p.m. UTC | #6

>> One point that popped up in the past and that I raised on the last RFC: we
>> should think about letting processes *opt out/disable* KSM on their own.
>> Either completely, or for selected VMAs.
>>
>> Reasoning is, that if you have an application that really doesn't want some
>> memory regions to be applicable to KSM (memory de-duplication attacks?
>> Knowing that KSM on some regions will be counter-productive)
>>
>> For example, remembering if MADV_UNMERGEABLE was called and not only
>> clearing the VMA flag. So even if KSM would be force-enabled by some tooling
>> after the process started, such regions would not get considered for KSM.
>>
>> It would a bit like how we handle THP.
> 
> I'm not sure the THP comparison is apt. THP is truly a local
> optimization that depends on the workload's access patterns. The
> environment isn't a true factor. It makes some sense that if there is
> a global policy to generally use THP the workload be able to opt out
> based on known sparse access patterns. At least until THP allocation
> strategy inside the kernel becomes smarter!

Yes, and some features really don't want THP, at least for some period 
of time (e.g., userfaultfd), because they are to some degree 
incompatible with the idea of THP populating more memory than was accessed.

Page pinning + KSM was one of the remaining cases where force-enabling 
KSM could have made a real difference (IOW buggy) that we discussed the 
last time this was proposed. That should be fixed now. I guess besides 
that, most features should be compatible with KSM nowadays. So 
force-enabling it should not result in actual issues I guess.

> 
> Merging opportunities and security questions are trickier. The
> application might know which data is sensitive, but it doesn't know
> whether its environment is safe or subject do memory attacks, so it
> cannot make that decision purely from inside.

I agree regarding security. Regarding merging opportunities, I am not so 
sure. There are certainly examples where an application knows best that 
memory deduplication is mostly a lost bet (if a lot of randomization or 
pointers are involved most probably).

> 
> There is a conceivable usecase where multiple instances of the same
> job are running inside a safe shared security domain and using the
> same sensitive data.

Yes. IMHO, such special applications could just enable KSM manually, 
though, instead of enabling it for each and every last piece of 
anonymous memory that doesn't make sense to get deduplciated :)

But of course, I see the simplicity in just enabling it globally.

> 
> There is a conceivable usecase where the system and the workload
> collaborate to merge insensitive data across security domains.
> 
> I'm honestly not sure which usecase is more likely. My gut feeling is
> the first one, simply because of broader concerns of multiple security
> domains sharing kernel instances or physical hardware.
> 

See my side note below.

>> On 24.02.23 05:39, Stefan Roesch wrote:
>>> So far KSM can only be enabled by calling madvise for memory regions. To
>>> be able to use KSM for more workloads, KSM needs to have the ability to be
>>> enabled / disabled at the process / cgroup level.
>>>
>>> Use case 1:
>>> The madvise call is not available in the programming language. An example for
>>> this are programs with forked workloads using a garbage collected language without
>>> pointers. In such a language madvise cannot be made available.
>>>
>>> In addition the addresses of objects get moved around as they are garbage
>>> collected. KSM sharing needs to be enabled "from the outside" for these type of
>>> workloads.
>>>
>>> Use case 2:
>>> The same interpreter can also be used for workloads where KSM brings no
>>> benefit or even has overhead. We'd like to be able to enable KSM on a workload
>>> by workload basis.
>>>
>>> Use case 3:
>>> With the madvise call sharing opportunities are only enabled for the current
>>> process: it is a workload-local decision. A considerable number of sharing
>>> opportuniites may exist across multiple workloads or jobs. Only a higler level
>>> entity like a job scheduler or container can know for certain if its running
>>> one or more instances of a job. That job scheduler however doesn't have
>>> the necessary internal worklaod knowledge to make targeted madvise calls.
>>>
>>> Security concerns:
>>> In previous discussions security concerns have been brought up. The problem is
>>> that an individual workload does not have the knowledge about what else is
>>> running on a machine. Therefore it has to be very conservative in what memory
>>> areas can be shared or not. However, if the system is dedicated to running
>>> multiple jobs within the same security domain, its the job scheduler that has
>>> the knowledge that sharing can be safely enabled and is even desirable.
>>
>> Note that there are some papers about why limiting memory deduplciation
>> attacks to single security domains is not sufficient. Especially, the remote
>> deduplication attacks fall into that category IIRC.
> 
> I think it would be good to elaborate on that and include any caveats
> in the documentation.

Yes. The main point I would make is that we should encourage eventual 
users to think twice instead of blindly enabling this feature. Good 
documentation is certainly helpful.

> 
> Ultimately, the bar isn't whether there are attack vectors on a subset
> of possible usecases, but whether there are usecases where this can be
> used safely, which is obviously true.

I agree. But still I have to raise that the security implications might 
be rather subtle and surprising (e.g., single security domain). Sure, 
there are setups that certainly don't care, I totally agree.

Side note:

Of course, I wonder how many workloads would place identical data into 
anonymous memory where it would have to get deduplicated instead, say, 
mmaping a file instead.

In the VM world it all makes sense to me, because the kernel, libraries, 
...executables may be identical and loaded into guest memory (-> 
anonymous memory) where we'd just wish to deduplciate them. In ordinary 
process, I'm not so sure how much deduplication potential there really 
is once pointers etc. are involved and memory allocators go crazy on 
placing unrelated data into the same page. There is one prime example, 
though, that might be different, which is the shared zeropage I guess.

I'd be curious which data the mentioned 20% actually deduplicate: 
according to [1], some workloads mostly only deduplicate the shared 
zeropage (in their Microsoft Edge scenario, 84% -- 93% of all 
deduplicated pages are zeropage). Deduplicating the shared zeropage is 
obviously "less security" relevant and one could optimize KSM easily to 
only try deduplicating that and avoid a lot of unstable nodes.

Of course, just a thought on memory deduplication on process level.

[1] https://ieeexplore.ieee.org/document/7546546

[v3,0/3] mm: process/cgroup ksm support

Message

Comments