mbox series

[RFC,v2,0/4] mm/ksm: add option to automerge VMAs

Message ID 20190514131654.25463-1-oleksandr@redhat.com (mailing list archive)
Headers show
Series mm/ksm: add option to automerge VMAs | expand

Message

Oleksandr Natalenko May 14, 2019, 1:16 p.m. UTC
By default, KSM works only on memory that is marked by madvise(). And the
only way to get around that is to either:

  * use LD_PRELOAD; or
  * patch the kernel with something like UKSM or PKSM.

Instead, lets implement a sysfs knob, which allows marking VMAs as
mergeable. This can be used manually on some task in question or by some
small userspace helper daemon.

The knob is named "force_madvise", and it is write-only. It accepts a PID
to act on. To mark the VMAs as mergeable, use:

   # echo PID > /sys/kernel/mm/ksm/force_madvise

To unmerge all the VMAs, use the same approach, prepending the PID with
the "minus" sign:

   # echo -PID > /sys/kernel/mm/ksm/force_madvise

This patchset is based on earlier Timofey's submission [1], but it doesn't
use dedicated kthread to walk through the list of tasks/VMAs. Instead,
it is up to userspace to traverse all the tasks in /proc if needed.

The previous suggestion [2] was based on amending do_anonymous_page()
handler to implement fully automatic mode, but this approach was
incorrect due to improper locking and not desired due to excessive
complexity.

The current approach just implements minimal interface and leaves the
decision on how and when to act to userspace.

Thanks.

[1] https://lore.kernel.org/patchwork/patch/1012142/
[2] http://lkml.iu.edu/hypermail/linux/kernel/1905.1/02417.html

Oleksandr Natalenko (4):
  mm/ksm: introduce ksm_enter() helper
  mm/ksm: introduce ksm_leave() helper
  mm/ksm: introduce force_madvise knob
  mm/ksm: add force merging/unmerging documentation

 Documentation/admin-guide/mm/ksm.rst |  11 ++
 mm/ksm.c                             | 160 +++++++++++++++++++++------
 2 files changed, 137 insertions(+), 34 deletions(-)

Comments

Michal Hocko May 14, 2019, 2:41 p.m. UTC | #1
[This is adding a new user visible interface so you should be CCing
linux-api mailing list. Also CC Hugh for KSM in general. Done now]

On Tue 14-05-19 15:16:50, Oleksandr Natalenko wrote:
> By default, KSM works only on memory that is marked by madvise(). And the
> only way to get around that is to either:
> 
>   * use LD_PRELOAD; or
>   * patch the kernel with something like UKSM or PKSM.
> 
> Instead, lets implement a sysfs knob, which allows marking VMAs as
> mergeable. This can be used manually on some task in question or by some
> small userspace helper daemon.
> 
> The knob is named "force_madvise", and it is write-only. It accepts a PID
> to act on. To mark the VMAs as mergeable, use:
> 
>    # echo PID > /sys/kernel/mm/ksm/force_madvise
> 
> To unmerge all the VMAs, use the same approach, prepending the PID with
> the "minus" sign:
> 
>    # echo -PID > /sys/kernel/mm/ksm/force_madvise
> 
> This patchset is based on earlier Timofey's submission [1], but it doesn't
> use dedicated kthread to walk through the list of tasks/VMAs. Instead,
> it is up to userspace to traverse all the tasks in /proc if needed.
> 
> The previous suggestion [2] was based on amending do_anonymous_page()
> handler to implement fully automatic mode, but this approach was
> incorrect due to improper locking and not desired due to excessive
> complexity.
> 
> The current approach just implements minimal interface and leaves the
> decision on how and when to act to userspace.

Please make sure to describe a usecase that warrants adding a new
interface we have to maintain for ever.

> 
> Thanks.
> 
> [1] https://lore.kernel.org/patchwork/patch/1012142/
> [2] http://lkml.iu.edu/hypermail/linux/kernel/1905.1/02417.html
> 
> Oleksandr Natalenko (4):
>   mm/ksm: introduce ksm_enter() helper
>   mm/ksm: introduce ksm_leave() helper
>   mm/ksm: introduce force_madvise knob
>   mm/ksm: add force merging/unmerging documentation
> 
>  Documentation/admin-guide/mm/ksm.rst |  11 ++
>  mm/ksm.c                             | 160 +++++++++++++++++++++------
>  2 files changed, 137 insertions(+), 34 deletions(-)
> 
> -- 
> 2.21.0
>
Michal Hocko May 14, 2019, 2:51 p.m. UTC | #2
[Forgot Hugh]

On Tue 14-05-19 16:41:05, Michal Hocko wrote:
> [This is adding a new user visible interface so you should be CCing
> linux-api mailing list. Also CC Hugh for KSM in general. Done now]
> 
> On Tue 14-05-19 15:16:50, Oleksandr Natalenko wrote:
> > By default, KSM works only on memory that is marked by madvise(). And the
> > only way to get around that is to either:
> > 
> >   * use LD_PRELOAD; or
> >   * patch the kernel with something like UKSM or PKSM.
> > 
> > Instead, lets implement a sysfs knob, which allows marking VMAs as
> > mergeable. This can be used manually on some task in question or by some
> > small userspace helper daemon.
> > 
> > The knob is named "force_madvise", and it is write-only. It accepts a PID
> > to act on. To mark the VMAs as mergeable, use:
> > 
> >    # echo PID > /sys/kernel/mm/ksm/force_madvise
> > 
> > To unmerge all the VMAs, use the same approach, prepending the PID with
> > the "minus" sign:
> > 
> >    # echo -PID > /sys/kernel/mm/ksm/force_madvise
> > 
> > This patchset is based on earlier Timofey's submission [1], but it doesn't
> > use dedicated kthread to walk through the list of tasks/VMAs. Instead,
> > it is up to userspace to traverse all the tasks in /proc if needed.
> > 
> > The previous suggestion [2] was based on amending do_anonymous_page()
> > handler to implement fully automatic mode, but this approach was
> > incorrect due to improper locking and not desired due to excessive
> > complexity.
> > 
> > The current approach just implements minimal interface and leaves the
> > decision on how and when to act to userspace.
> 
> Please make sure to describe a usecase that warrants adding a new
> interface we have to maintain for ever.
> 
> > 
> > Thanks.
> > 
> > [1] https://lore.kernel.org/patchwork/patch/1012142/
> > [2] http://lkml.iu.edu/hypermail/linux/kernel/1905.1/02417.html
> > 
> > Oleksandr Natalenko (4):
> >   mm/ksm: introduce ksm_enter() helper
> >   mm/ksm: introduce ksm_leave() helper
> >   mm/ksm: introduce force_madvise knob
> >   mm/ksm: add force merging/unmerging documentation
> > 
> >  Documentation/admin-guide/mm/ksm.rst |  11 ++
> >  mm/ksm.c                             | 160 +++++++++++++++++++++------
> >  2 files changed, 137 insertions(+), 34 deletions(-)
> > 
> > -- 
> > 2.21.0
> > 
> 
> -- 
> Michal Hocko
> SUSE Labs
Oleksandr Natalenko May 15, 2019, 6:25 a.m. UTC | #3
Hi.

On Tue, May 14, 2019 at 04:51:22PM +0200, Michal Hocko wrote:
> [Forgot Hugh]
> 
> On Tue 14-05-19 16:41:05, Michal Hocko wrote:
> > [This is adding a new user visible interface so you should be CCing
> > linux-api mailing list. Also CC Hugh for KSM in general. Done now]

Right, thanks for taking care of this.

> > On Tue 14-05-19 15:16:50, Oleksandr Natalenko wrote:
> > > By default, KSM works only on memory that is marked by madvise(). And the
> > > only way to get around that is to either:
> > > 
> > >   * use LD_PRELOAD; or
> > >   * patch the kernel with something like UKSM or PKSM.
> > > 
> > > Instead, lets implement a sysfs knob, which allows marking VMAs as
> > > mergeable. This can be used manually on some task in question or by some
> > > small userspace helper daemon.
> > > 
> > > The knob is named "force_madvise", and it is write-only. It accepts a PID
> > > to act on. To mark the VMAs as mergeable, use:
> > > 
> > >    # echo PID > /sys/kernel/mm/ksm/force_madvise
> > > 
> > > To unmerge all the VMAs, use the same approach, prepending the PID with
> > > the "minus" sign:
> > > 
> > >    # echo -PID > /sys/kernel/mm/ksm/force_madvise
> > > 
> > > This patchset is based on earlier Timofey's submission [1], but it doesn't
> > > use dedicated kthread to walk through the list of tasks/VMAs. Instead,
> > > it is up to userspace to traverse all the tasks in /proc if needed.
> > > 
> > > The previous suggestion [2] was based on amending do_anonymous_page()
> > > handler to implement fully automatic mode, but this approach was
> > > incorrect due to improper locking and not desired due to excessive
> > > complexity.
> > > 
> > > The current approach just implements minimal interface and leaves the
> > > decision on how and when to act to userspace.
> > 
> > Please make sure to describe a usecase that warrants adding a new
> > interface we have to maintain for ever.

I think of two major consumers of this interface:

1) hosts, that run containers, especially similar ones and especially in
a trusted environment;

2) heavy applications, that can be run in multiple instances, not
limited to opensource ones like Firefox, but also those that cannot be
modified.

I'll add this justification to the cover letter once I send another
iteration of this submission if necessary.

Thank you.

> > 
> > > 
> > > Thanks.
> > > 
> > > [1] https://lore.kernel.org/patchwork/patch/1012142/
> > > [2] http://lkml.iu.edu/hypermail/linux/kernel/1905.1/02417.html
> > > 
> > > Oleksandr Natalenko (4):
> > >   mm/ksm: introduce ksm_enter() helper
> > >   mm/ksm: introduce ksm_leave() helper
> > >   mm/ksm: introduce force_madvise knob
> > >   mm/ksm: add force merging/unmerging documentation
> > > 
> > >  Documentation/admin-guide/mm/ksm.rst |  11 ++
> > >  mm/ksm.c                             | 160 +++++++++++++++++++++------
> > >  2 files changed, 137 insertions(+), 34 deletions(-)
> > > 
> > > -- 
> > > 2.21.0
> > > 
> > 
> > -- 
> > Michal Hocko
> > SUSE Labs
> 
> -- 
> Michal Hocko
> SUSE Labs
Michal Hocko May 15, 2019, 6:53 a.m. UTC | #4
On Wed 15-05-19 08:25:23, Oleksandr Natalenko wrote:
[...]
> > > Please make sure to describe a usecase that warrants adding a new
> > > interface we have to maintain for ever.
> 
> I think of two major consumers of this interface:
> 
> 1) hosts, that run containers, especially similar ones and especially in
> a trusted environment;
> 
> 2) heavy applications, that can be run in multiple instances, not
> limited to opensource ones like Firefox, but also those that cannot be
> modified.

This is way too generic. Please provide something more specific. Ideally
with numbers. Why those usecases cannot use an existing interfaces.
Remember you are trying to add a new user interface which we will have
to maintain for ever.

I will try to comment on the interface itself later. But I have to say
that I am not impressed. Abusing sysfs for per process features is quite
gross to be honest.
Oleksandr Natalenko May 15, 2019, 7:37 a.m. UTC | #5
Hi.

On Wed, May 15, 2019 at 08:53:11AM +0200, Michal Hocko wrote:
> On Wed 15-05-19 08:25:23, Oleksandr Natalenko wrote:
> [...]
> > > > Please make sure to describe a usecase that warrants adding a new
> > > > interface we have to maintain for ever.
> > 
> > I think of two major consumers of this interface:
> > 
> > 1) hosts, that run containers, especially similar ones and especially in
> > a trusted environment;
> > 
> > 2) heavy applications, that can be run in multiple instances, not
> > limited to opensource ones like Firefox, but also those that cannot be
> > modified.
> 
> This is way too generic. Please provide something more specific. Ideally
> with numbers. Why those usecases cannot use an existing interfaces.
> Remember you are trying to add a new user interface which we will have
> to maintain for ever.

For my current setup with 2 Firefox instances I get 100 to 200 MiB saved
for the second instance depending on the amount of tabs.

1 FF instance with 15 tabs:

$ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
410

2 FF instances, second one has 12 tabs (all the tabs are different):

$ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
592

At the very moment I do not have specific numbers for containerised
workload, but those should be similar in case the containers share
similar/same runtime (like multiple Node.js containers etc).

Answering your question regarding using existing interfaces, since
there's only one, madvise(2), this requires modifying all the
applications one wants to de-duplicate. In case of containers with
arbitrary content or in case of binary-only apps this is pretty hard if
not impossible to do properly.

> I will try to comment on the interface itself later. But I have to say
> that I am not impressed. Abusing sysfs for per process features is quite
> gross to be honest.

Sure, please do.

Thanks for your time and inputs.
Michal Hocko May 15, 2019, 8:33 a.m. UTC | #6
On Wed 15-05-19 09:37:23, Oleksandr Natalenko wrote:
[...]
> > This is way too generic. Please provide something more specific. Ideally
> > with numbers. Why those usecases cannot use an existing interfaces.
> > Remember you are trying to add a new user interface which we will have
> > to maintain for ever.
> 
> For my current setup with 2 Firefox instances I get 100 to 200 MiB saved
> for the second instance depending on the amount of tabs.

What does prevent Firefox (an opensource project) to be updated to use
the explicit merging?

[...]

> Answering your question regarding using existing interfaces, since
> there's only one, madvise(2), this requires modifying all the
> applications one wants to de-duplicate. In case of containers with
> arbitrary content or in case of binary-only apps this is pretty hard if
> not impossible to do properly.

OK, this makes more sense. Please note that there are other people who
would like to see certain madvise operations to be done on a remote
process - e.g. to allow external memory management (Android would like
to control memory aging so something like MADV_DONTNEED without loosing
content and more probably) and potentially other madvise operations.
Or maybe we need a completely new interface other than madvise.

In general, having a more generic API that would cover more usecases is
definitely much more preferable than one ad-hoc API that handles a very
specific usecase. So please try to think about a more generic
Oleksandr Natalenko May 15, 2019, 8:51 a.m. UTC | #7
On Wed, May 15, 2019 at 10:33:21AM +0200, Michal Hocko wrote:
> > For my current setup with 2 Firefox instances I get 100 to 200 MiB saved
> > for the second instance depending on the amount of tabs.
> 
> What does prevent Firefox (an opensource project) to be updated to use
> the explicit merging?

This was rather an example of a big project. Other big projects may be
closed source, of course.

And yes, with regard to FF specifically I think nothing prevents it from
being modified appropriately.

> > Answering your question regarding using existing interfaces, since
> > there's only one, madvise(2), this requires modifying all the
> > applications one wants to de-duplicate. In case of containers with
> > arbitrary content or in case of binary-only apps this is pretty hard if
> > not impossible to do properly.
> 
> OK, this makes more sense. Please note that there are other people who
> would like to see certain madvise operations to be done on a remote
> process - e.g. to allow external memory management (Android would like
> to control memory aging so something like MADV_DONTNEED without loosing
> content and more probably) and potentially other madvise operations.
> Or maybe we need a completely new interface other than madvise.

I didn't know about those intentions. Could you please point me to a
relevant discussion so that I can check the details?

> In general, having a more generic API that would cover more usecases is
> definitely much more preferable than one ad-hoc API that handles a very
> specific usecase. So please try to think about a more generic

Yup, I see now. Since you are aware of ongoing intentions, please do Cc
those people then and/or let me know about previous discussions please.
That way thinking of how a new API should be implemented (be it a sysfs
file or something else) should be easier and more visible.

Thanks.
Michal Hocko May 15, 2019, 2:24 p.m. UTC | #8
On Wed 15-05-19 10:51:58, Oleksandr Natalenko wrote:
> On Wed, May 15, 2019 at 10:33:21AM +0200, Michal Hocko wrote:
> > > For my current setup with 2 Firefox instances I get 100 to 200 MiB saved
> > > for the second instance depending on the amount of tabs.
> > 
> > What does prevent Firefox (an opensource project) to be updated to use
> > the explicit merging?
> 
> This was rather an example of a big project. Other big projects may be
> closed source, of course.

Again, specific examples are usually considered a much better
justification than "something might use the feature".

[...]

> > OK, this makes more sense. Please note that there are other people who
> > would like to see certain madvise operations to be done on a remote
> > process - e.g. to allow external memory management (Android would like
> > to control memory aging so something like MADV_DONTNEED without loosing
> > content and more probably) and potentially other madvise operations.
> > Or maybe we need a completely new interface other than madvise.
> 
> I didn't know about those intentions. Could you please point me to a
> relevant discussion so that I can check the details?

I am sorry I do not have any specific links to patches under discussion.
We have discussed that topic at LSFMM this year
(https://lwn.net/Articles/787217/) and Google guys should be sending
something soon.
Michal Hocko May 15, 2019, 2:51 p.m. UTC | #9
[Cc Suren and Minchan - the email thread starts here 20190514131654.25463-1-oleksandr@redhat.com]

On Wed 15-05-19 08:53:11, Michal Hocko wrote:
[...]
> I will try to comment on the interface itself later. But I have to say
> that I am not impressed. Abusing sysfs for per process features is quite
> gross to be honest.

I have already commented on this in other email. I consider sysfs an
unsuitable interface for per-process API. Not to mention this particular
one is very KSM specific while the question about setting different
hints on memory of a remote process is a more generic question. As
already mentioned there are usecases where people would like to say
that a certain memory is cold from outside of the process context (e.g.
monitor application). So essentially a form of a user space memory
management. And this usecase sounds a bit similar to me and having a
common api sounds more sensible to me.

One thing we were discussing at LSFMM this year was a way to either
provide madvise_remote(pid, addr, length, advice) or a fadvise
alternative over /proc/<pid>/map_vmas/<range> file descriptors
(essentially resembling the existing map_files api) to achieve such a
functionality. This is still a very rough idea but the api would sound
much more generic to me and it would allow much wider range of usecases.

But maybe I am completely wrong and this is just opens a can of worms
that we do not want.
Greg KH May 15, 2019, 3:15 p.m. UTC | #10
On Wed, May 15, 2019 at 04:51:51PM +0200, Michal Hocko wrote:
> [Cc Suren and Minchan - the email thread starts here 20190514131654.25463-1-oleksandr@redhat.com]
> 
> On Wed 15-05-19 08:53:11, Michal Hocko wrote:
> [...]
> > I will try to comment on the interface itself later. But I have to say
> > that I am not impressed. Abusing sysfs for per process features is quite
> > gross to be honest.
> 
> I have already commented on this in other email. I consider sysfs an
> unsuitable interface for per-process API.

Wait, what?  A new sysfs file/directory per process?  That's crazy, no
one must have benchmarked it :)

And I agree, sysfs is not for that, please don't abuse it.  Proc is for
process apis.

thanks,

greg k-h
Michal Hocko May 16, 2019, 7:47 a.m. UTC | #11
On Wed 15-05-19 17:15:57, Greg KH wrote:
> On Wed, May 15, 2019 at 04:51:51PM +0200, Michal Hocko wrote:
> > [Cc Suren and Minchan - the email thread starts here 20190514131654.25463-1-oleksandr@redhat.com]
> > 
> > On Wed 15-05-19 08:53:11, Michal Hocko wrote:
> > [...]
> > > I will try to comment on the interface itself later. But I have to say
> > > that I am not impressed. Abusing sysfs for per process features is quite
> > > gross to be honest.
> > 
> > I have already commented on this in other email. I consider sysfs an
> > unsuitable interface for per-process API.
> 
> Wait, what?  A new sysfs file/directory per process?  That's crazy, no
> one must have benchmarked it :)

Just to clarify, that was not a per process file but rather per process API.
Essentially echo $PID > $SYSFS_SPECIAL_FILE
Greg KH May 16, 2019, 7:53 a.m. UTC | #12
On Thu, May 16, 2019 at 09:47:13AM +0200, Michal Hocko wrote:
> On Wed 15-05-19 17:15:57, Greg KH wrote:
> > On Wed, May 15, 2019 at 04:51:51PM +0200, Michal Hocko wrote:
> > > [Cc Suren and Minchan - the email thread starts here 20190514131654.25463-1-oleksandr@redhat.com]
> > > 
> > > On Wed 15-05-19 08:53:11, Michal Hocko wrote:
> > > [...]
> > > > I will try to comment on the interface itself later. But I have to say
> > > > that I am not impressed. Abusing sysfs for per process features is quite
> > > > gross to be honest.
> > > 
> > > I have already commented on this in other email. I consider sysfs an
> > > unsuitable interface for per-process API.
> > 
> > Wait, what?  A new sysfs file/directory per process?  That's crazy, no
> > one must have benchmarked it :)
> 
> Just to clarify, that was not a per process file but rather per process API.
> Essentially echo $PID > $SYSFS_SPECIAL_FILE

Ick, no, that's not ok either.  sysfs files are not a replacement for
syscalls :)

thanks,

greg k-h