mbox series

[RFC,0/4] mm/ksm: add option to automerge VMAs

Message ID 20190510072125.18059-1-oleksandr@redhat.com (mailing list archive)
Headers show
Series mm/ksm: add option to automerge VMAs | expand

Message

Oleksandr Natalenko May 10, 2019, 7:21 a.m. UTC
By default, KSM works only on memory that is marked by madvise(). And the
only way to get around that is to either:

  * use LD_PRELOAD; or
  * patch the kernel with something like UKSM or PKSM.

Instead, lets implement a so-called "always" mode, which allows marking
VMAs as mergeable on do_anonymous_page() call automatically.

The submission introduces a new sysctl knob as well as kernel cmdline option
to control which mode to use. The default mode is to maintain old
(madvise-based) behaviour.

Due to security concerns, this submission also introduces VM_UNMERGEABLE
vmaflag for apps to explicitly opt out of automerging. Because of adding
a new vmaflag, the whole work is available for 64-bit architectures only.

This patchset is based on earlier Timofey's submission [1], but it doesn't
use dedicated kthread to walk through the list of tasks/VMAs.

For my laptop it saves up to 300 MiB of RAM for usual workflow (browser,
terminal, player, chats etc). Timofey's submission also mentions
containerised workload that benefits from automerging too.

Open questions:

  * once "always" mode is activated, should re-scan of all VMAs be
    triggered to find eligible ones for automerging?

Thanks.

[1] https://lore.kernel.org/patchwork/patch/1012142/

Oleksandr Natalenko (4):
  mm/ksm: introduce ksm_enter() helper
  mm/ksm: introduce VM_UNMERGEABLE
  mm/ksm: allow anonymous memory automerging
  mm/ksm: add automerging documentation

 .../admin-guide/kernel-parameters.txt         |   7 +
 Documentation/admin-guide/mm/ksm.rst          |   7 +
 fs/proc/task_mmu.c                            |   3 +
 include/linux/ksm.h                           |   5 +
 include/linux/mm.h                            |   6 +
 include/trace/events/mmflags.h                |   7 +
 mm/ksm.c                                      | 142 ++++++++++++++----
 mm/memory.c                                   |   6 +
 8 files changed, 157 insertions(+), 26 deletions(-)

Comments

Kirill Tkhai May 13, 2019, 10:38 a.m. UTC | #1
Hi, Oleksandr,

On 10.05.2019 10:21, Oleksandr Natalenko wrote:
> By default, KSM works only on memory that is marked by madvise(). And the
> only way to get around that is to either:
> 
>   * use LD_PRELOAD; or
>   * patch the kernel with something like UKSM or PKSM.
>
> Instead, lets implement a so-called "always" mode, which allows marking
> VMAs as mergeable on do_anonymous_page() call automatically.
>
> The submission introduces a new sysctl knob as well as kernel cmdline option
> to control which mode to use. The default mode is to maintain old
> (madvise-based) behaviour.
>
> Due to security concerns, this submission also introduces VM_UNMERGEABLE
> vmaflag for apps to explicitly opt out of automerging. Because of adding
> a new vmaflag, the whole work is available for 64-bit architectures only.
>> This patchset is based on earlier Timofey's submission [1], but it doesn't
> use dedicated kthread to walk through the list of tasks/VMAs.
> 
> For my laptop it saves up to 300 MiB of RAM for usual workflow (browser,
> terminal, player, chats etc). Timofey's submission also mentions
> containerised workload that benefits from automerging too.

This all approach looks complicated for me, and I'm not sure the shown profit
for desktop is big enough to introduce contradictory vma flags, boot option
and advance page fault handler. Also, 32/64bit defines do not look good for
me. I had tried something like this on my laptop some time ago, and
the result was bad even in absolute (not in memory percentage) meaning.
Isn't LD_PRELOAD trick enough to desktop? Your workload is same all the time,
so you may statically insert correct preload to /etc/profile and replace
your mmap forever.

Speaking about containers, something like this may have a sense, I think.
The probability of that several containers have the same pages are higher,
than that desktop applications have the same pages; also LD_PRELOAD for
containers is not applicable. 

But 1)this could be made for trusted containers only (are there similar
issues with KSM like with hardware side-channel attacks?!); 2) the most
shared data for containers in my experience is file cache, which is not
supported by KSM.

There are good results by the link [1], but it's difficult to analyze
them without knowledge about what happens inside them there.

Some of tests have "VM" prefix. What the reason the hypervisor don't mark
their VMAs as mergeable? Can't this be fixed in hypervisor? What is the
generic reason that VMAs are not marked in all the tests?

In case of there is a fundamental problem of calling madvise, can't we
just implement an easier workaround like a new write-only file:

#echo $task > /sys/kernel/mm/ksm/force_madvise

which will mark all anon VMAs as mergeable for a passed task's mm?

A small userspace daemon may write mergeable tasks there from time to time.

Then we won't need to introduce additional vm flags and to change
anon pagefault handler, and the changes will be small and only
related to mm/ksm.c, and good enough for both 32 and 64 bit machines.

Thanks,
Kirill
Oleksandr Natalenko May 13, 2019, 11:33 a.m. UTC | #2
Hi.

On Mon, May 13, 2019 at 01:38:43PM +0300, Kirill Tkhai wrote:
> On 10.05.2019 10:21, Oleksandr Natalenko wrote:
> > By default, KSM works only on memory that is marked by madvise(). And the
> > only way to get around that is to either:
> > 
> >   * use LD_PRELOAD; or
> >   * patch the kernel with something like UKSM or PKSM.
> >
> > Instead, lets implement a so-called "always" mode, which allows marking
> > VMAs as mergeable on do_anonymous_page() call automatically.
> >
> > The submission introduces a new sysctl knob as well as kernel cmdline option
> > to control which mode to use. The default mode is to maintain old
> > (madvise-based) behaviour.
> >
> > Due to security concerns, this submission also introduces VM_UNMERGEABLE
> > vmaflag for apps to explicitly opt out of automerging. Because of adding
> > a new vmaflag, the whole work is available for 64-bit architectures only.
> >> This patchset is based on earlier Timofey's submission [1], but it doesn't
> > use dedicated kthread to walk through the list of tasks/VMAs.
> > 
> > For my laptop it saves up to 300 MiB of RAM for usual workflow (browser,
> > terminal, player, chats etc). Timofey's submission also mentions
> > containerised workload that benefits from automerging too.
> 
> This all approach looks complicated for me, and I'm not sure the shown profit
> for desktop is big enough to introduce contradictory vma flags, boot option
> and advance page fault handler. Also, 32/64bit defines do not look good for
> me. I had tried something like this on my laptop some time ago, and
> the result was bad even in absolute (not in memory percentage) meaning.
> Isn't LD_PRELOAD trick enough to desktop? Your workload is same all the time,
> so you may statically insert correct preload to /etc/profile and replace
> your mmap forever.
>
> Speaking about containers, something like this may have a sense, I think.
> The probability of that several containers have the same pages are higher,
> than that desktop applications have the same pages; also LD_PRELOAD for
> containers is not applicable. 

Yes, I get your point. But the intention is to avoid another hacky trick
(LD_PRELOAD), thus *something* should *preferably* be done on the
kernel level instead.

> But 1)this could be made for trusted containers only (are there similar
> issues with KSM like with hardware side-channel attacks?!);

Regarding side-channel attacks, yes, I think so. Were those openssl guys
who complained about it?..

> 2) the most
> shared data for containers in my experience is file cache, which is not
> supported by KSM.
> 
> There are good results by the link [1], but it's difficult to analyze
> them without knowledge about what happens inside them there.
> 
> Some of tests have "VM" prefix. What the reason the hypervisor don't mark
> their VMAs as mergeable? Can't this be fixed in hypervisor? What is the
> generic reason that VMAs are not marked in all the tests?

Timofey, could you please address this?

Also, just for the sake of another piece of stats here:

$ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
526

> In case of there is a fundamental problem of calling madvise, can't we
> just implement an easier workaround like a new write-only file:
> 
> #echo $task > /sys/kernel/mm/ksm/force_madvise
> 
> which will mark all anon VMAs as mergeable for a passed task's mm?
> 
> A small userspace daemon may write mergeable tasks there from time to time.
> 
> Then we won't need to introduce additional vm flags and to change
> anon pagefault handler, and the changes will be small and only
> related to mm/ksm.c, and good enough for both 32 and 64 bit machines.

Yup, looks appealing. Two concerns, though:

1) we are falling back to scanning through the list of tasks (I guess
this is what we wanted to avoid, although this time it happens in the
userspace);

2) what kinds of opt-out we should maintain? Like, what if force_madvise
is called, but the task doesn't want some VMAs to be merged? This will
required new flag anyway, it seems. And should there be another
write-only file to unmerge everything forcibly for specific task?

Thanks.

P.S. Cc'ing Pavel properly this time.
Timofey Titovets May 13, 2019, 11:48 a.m. UTC | #3
пн, 13 мая 2019 г. в 14:33, Oleksandr Natalenko <oleksandr@redhat.com>:
>
> Hi.
>
> On Mon, May 13, 2019 at 01:38:43PM +0300, Kirill Tkhai wrote:
> > On 10.05.2019 10:21, Oleksandr Natalenko wrote:
> > > By default, KSM works only on memory that is marked by madvise(). And the
> > > only way to get around that is to either:
> > >
> > >   * use LD_PRELOAD; or
> > >   * patch the kernel with something like UKSM or PKSM.
> > >
> > > Instead, lets implement a so-called "always" mode, which allows marking
> > > VMAs as mergeable on do_anonymous_page() call automatically.
> > >
> > > The submission introduces a new sysctl knob as well as kernel cmdline option
> > > to control which mode to use. The default mode is to maintain old
> > > (madvise-based) behaviour.
> > >
> > > Due to security concerns, this submission also introduces VM_UNMERGEABLE
> > > vmaflag for apps to explicitly opt out of automerging. Because of adding
> > > a new vmaflag, the whole work is available for 64-bit architectures only.
> > >> This patchset is based on earlier Timofey's submission [1], but it doesn't
> > > use dedicated kthread to walk through the list of tasks/VMAs.
> > >
> > > For my laptop it saves up to 300 MiB of RAM for usual workflow (browser,
> > > terminal, player, chats etc). Timofey's submission also mentions
> > > containerised workload that benefits from automerging too.
> >
> > This all approach looks complicated for me, and I'm not sure the shown profit
> > for desktop is big enough to introduce contradictory vma flags, boot option
> > and advance page fault handler. Also, 32/64bit defines do not look good for
> > me. I had tried something like this on my laptop some time ago, and
> > the result was bad even in absolute (not in memory percentage) meaning.
> > Isn't LD_PRELOAD trick enough to desktop? Your workload is same all the time,
> > so you may statically insert correct preload to /etc/profile and replace
> > your mmap forever.
> >
> > Speaking about containers, something like this may have a sense, I think.
> > The probability of that several containers have the same pages are higher,
> > than that desktop applications have the same pages; also LD_PRELOAD for
> > containers is not applicable.
>
> Yes, I get your point. But the intention is to avoid another hacky trick
> (LD_PRELOAD), thus *something* should *preferably* be done on the
> kernel level instead.
>
> > But 1)this could be made for trusted containers only (are there similar
> > issues with KSM like with hardware side-channel attacks?!);
>
> Regarding side-channel attacks, yes, I think so. Were those openssl guys
> who complained about it?..
>
> > 2) the most
> > shared data for containers in my experience is file cache, which is not
> > supported by KSM.
> >
> > There are good results by the link [1], but it's difficult to analyze
> > them without knowledge about what happens inside them there.
> >
> > Some of tests have "VM" prefix. What the reason the hypervisor don't mark
> > their VMAs as mergeable? Can't this be fixed in hypervisor? What is the
> > generic reason that VMAs are not marked in all the tests?
>
> Timofey, could you please address this?

That's just a describe of machine,
only to show difference in deduplication for application in small VM
and real big server
i.e. KSM enabled in VM for containers, not for hypervisor.

> Also, just for the sake of another piece of stats here:
>
> $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
> 526

IIRC, for calculate saving you must use (pages_shared - pages_sharing)

> > In case of there is a fundamental problem of calling madvise, can't we
> > just implement an easier workaround like a new write-only file:
> >
> > #echo $task > /sys/kernel/mm/ksm/force_madvise
> >
> > which will mark all anon VMAs as mergeable for a passed task's mm?
> >
> > A small userspace daemon may write mergeable tasks there from time to time.
> >
> > Then we won't need to introduce additional vm flags and to change
> > anon pagefault handler, and the changes will be small and only
> > related to mm/ksm.c, and good enough for both 32 and 64 bit machines.
>
> Yup, looks appealing. Two concerns, though:
>
> 1) we are falling back to scanning through the list of tasks (I guess
> this is what we wanted to avoid, although this time it happens in the
> userspace);
>
> 2) what kinds of opt-out we should maintain? Like, what if force_madvise
> is called, but the task doesn't want some VMAs to be merged? This will
> required new flag anyway, it seems. And should there be another
> write-only file to unmerge everything forcibly for specific task?
>
> Thanks.
>
> P.S. Cc'ing Pavel properly this time.
>
> --
>   Best regards,
>     Oleksandr Natalenko (post-factum)
>     Senior Software Maintenance Engineer
Oleksandr Natalenko May 13, 2019, 12:01 p.m. UTC | #4
On Mon, May 13, 2019 at 02:48:29PM +0300, Timofey Titovets wrote:
> > Also, just for the sake of another piece of stats here:
> >
> > $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
> > 526
> 
> IIRC, for calculate saving you must use (pages_shared - pages_sharing)

Based on Documentation/ABI/testing/sysfs-kernel-mm-ksm:

	pages_shared: how many shared pages are being used.

	pages_sharing: how many more sites are sharing them i.e. how
	much saved.

and unless I'm missing something, this must be already accounted:

[~]$ echo "$(cat /sys/kernel/mm/ksm/pages_shared) * 4 / 1024" | bc
69

[~]$ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
563
Oleksandr Natalenko May 13, 2019, 12:06 p.m. UTC | #5
On Mon, May 13, 2019 at 02:01:17PM +0200, Oleksandr Natalenko wrote:
> On Mon, May 13, 2019 at 02:48:29PM +0300, Timofey Titovets wrote:
> > > Also, just for the sake of another piece of stats here:
> > >
> > > $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
> > > 526
> > 
> > IIRC, for calculate saving you must use (pages_shared - pages_sharing)
> 
> Based on Documentation/ABI/testing/sysfs-kernel-mm-ksm:
> 
> 	pages_shared: how many shared pages are being used.
> 
> 	pages_sharing: how many more sites are sharing them i.e. how
> 	much saved.
> 
> and unless I'm missing something, this must be already accounted:
> 
> [~]$ echo "$(cat /sys/kernel/mm/ksm/pages_shared) * 4 / 1024" | bc
> 69
> 
> [~]$ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
> 563

Yup. To expand on this,

 246 /* The number of nodes in the stable tree */
 247 static unsigned long ksm_pages_shared;
 248 
 249 /* The number of page slots additionally sharing those nodes */
 250 static unsigned long ksm_pages_sharing;

2037     if (rmap_item->hlist.next)
2038         ksm_pages_sharing++;
2039     else
2040         ksm_pages_shared++;

IOW, first item is accounter in "shared", the rest will go to "sharing".
Kirill Tkhai May 13, 2019, 12:37 p.m. UTC | #6
On 13.05.2019 14:33, Oleksandr Natalenko wrote:
> Hi.
> 
> On Mon, May 13, 2019 at 01:38:43PM +0300, Kirill Tkhai wrote:
>> On 10.05.2019 10:21, Oleksandr Natalenko wrote:
>>> By default, KSM works only on memory that is marked by madvise(). And the
>>> only way to get around that is to either:
>>>
>>>   * use LD_PRELOAD; or
>>>   * patch the kernel with something like UKSM or PKSM.
>>>
>>> Instead, lets implement a so-called "always" mode, which allows marking
>>> VMAs as mergeable on do_anonymous_page() call automatically.
>>>
>>> The submission introduces a new sysctl knob as well as kernel cmdline option
>>> to control which mode to use. The default mode is to maintain old
>>> (madvise-based) behaviour.
>>>
>>> Due to security concerns, this submission also introduces VM_UNMERGEABLE
>>> vmaflag for apps to explicitly opt out of automerging. Because of adding
>>> a new vmaflag, the whole work is available for 64-bit architectures only.
>>>> This patchset is based on earlier Timofey's submission [1], but it doesn't
>>> use dedicated kthread to walk through the list of tasks/VMAs.
>>>
>>> For my laptop it saves up to 300 MiB of RAM for usual workflow (browser,
>>> terminal, player, chats etc). Timofey's submission also mentions
>>> containerised workload that benefits from automerging too.
>>
>> This all approach looks complicated for me, and I'm not sure the shown profit
>> for desktop is big enough to introduce contradictory vma flags, boot option
>> and advance page fault handler. Also, 32/64bit defines do not look good for
>> me. I had tried something like this on my laptop some time ago, and
>> the result was bad even in absolute (not in memory percentage) meaning.
>> Isn't LD_PRELOAD trick enough to desktop? Your workload is same all the time,
>> so you may statically insert correct preload to /etc/profile and replace
>> your mmap forever.
>>
>> Speaking about containers, something like this may have a sense, I think.
>> The probability of that several containers have the same pages are higher,
>> than that desktop applications have the same pages; also LD_PRELOAD for
>> containers is not applicable. 
> 
> Yes, I get your point. But the intention is to avoid another hacky trick
> (LD_PRELOAD), thus *something* should *preferably* be done on the
> kernel level instead.

I don't think so. Does userspace hack introduce some overhead? It does not
look so. Why should we think about mergeable VMAs in page fault handler?!
This is the last thing we want to think in page fault handler.

Also, there is difficult synchronization in page fault handlers, and it's
easy to make a mistake. So, there is a mistake in [3/4], and you call
ksm_enter() with mmap_sem read locked, while normal way is to call it
with write lock (see madvise_need_mmap_write()).

So, let's don't touch this path. Small optimization for unlikely case will
introduce problems in optimization for likely case in the future.

>> But 1)this could be made for trusted containers only (are there similar
>> issues with KSM like with hardware side-channel attacks?!);
> 
> Regarding side-channel attacks, yes, I think so. Were those openssl guys
> who complained about it?..
> 
>> 2) the most
>> shared data for containers in my experience is file cache, which is not
>> supported by KSM.
>>
>> There are good results by the link [1], but it's difficult to analyze
>> them without knowledge about what happens inside them there.
>>
>> Some of tests have "VM" prefix. What the reason the hypervisor don't mark
>> their VMAs as mergeable? Can't this be fixed in hypervisor? What is the
>> generic reason that VMAs are not marked in all the tests?
> 
> Timofey, could you please address this?
> 
> Also, just for the sake of another piece of stats here:
> 
> $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
> 526

This all requires attentive analysis. The number looks pretty big for me.
What are the pages you get merged there? This may be just zero pages,
you have identical.

E.g., your browser want to work fast. It introduces smart schemes,
and preallocates many pages in background (mmap + write 1 byte to a page),
so in further it save some time (no page fault + alloc), when page is
really needed. But your change merges these pages and kills this
optimization. Sounds not good, does this?

I think, we should not think we know and predict better than application
writers, what they want from kernel. Let's people decide themselves
in dependence of their workload. The only exception is some buggy
or old applications, which impossible to change, so force madvise
workaround may help. But only in case there are really such applications...

I'd researched what pages you have duplicated in these 526 MB. Maybe
you find, no action is required or a report to userspace application
to use madvise is needed.

>> In case of there is a fundamental problem of calling madvise, can't we
>> just implement an easier workaround like a new write-only file:
>>
>> #echo $task > /sys/kernel/mm/ksm/force_madvise
>>
>> which will mark all anon VMAs as mergeable for a passed task's mm?
>>
>> A small userspace daemon may write mergeable tasks there from time to time.
>>
>> Then we won't need to introduce additional vm flags and to change
>> anon pagefault handler, and the changes will be small and only
>> related to mm/ksm.c, and good enough for both 32 and 64 bit machines.
> 
> Yup, looks appealing. Two concerns, though:
> 
> 1) we are falling back to scanning through the list of tasks (I guess
> this is what we wanted to avoid, although this time it happens in the
> userspace);

IMO, this should be made only for tasks, which are known to be buggy
(which can't call madvise). Yes, scanning will be required.

> 2) what kinds of opt-out we should maintain? Like, what if force_madvise
> is called, but the task doesn't want some VMAs to be merged? This will
> required new flag anyway, it seems. And should there be another
> write-only file to unmerge everything forcibly for specific task?

For example,

Merge:
#echo $task > /sys/kernel/mm/ksm/force_madvise

Unmerge:
#echo -$task > /sys/kernel/mm/ksm/force_madvise

In case of task don't want to merge some VMA, we just should skip it at all.

But firstly we probably should check, that we really need this, and why
existing applications don't call madvise directly. Now we just don't know,
what happens.

Kirill

P.S. This all above is my opinion. Let's wait, what other people think.
Oleksandr Natalenko May 14, 2019, 6:30 a.m. UTC | #7
Hi.

On Mon, May 13, 2019 at 03:37:56PM +0300, Kirill Tkhai wrote:
> > Yes, I get your point. But the intention is to avoid another hacky trick
> > (LD_PRELOAD), thus *something* should *preferably* be done on the
> > kernel level instead.
> 
> I don't think so. Does userspace hack introduce some overhead? It does not
> look so. Why should we think about mergeable VMAs in page fault handler?!
> This is the last thing we want to think in page fault handler.
> 
> Also, there is difficult synchronization in page fault handlers, and it's
> easy to make a mistake. So, there is a mistake in [3/4], and you call
> ksm_enter() with mmap_sem read locked, while normal way is to call it
> with write lock (see madvise_need_mmap_write()).
> 
> So, let's don't touch this path. Small optimization for unlikely case will
> introduce problems in optimization for likely case in the future.

Yup, you're right, I've missed the fact that write lock is needed there.
Re-vamping locking there is not my intention, so lets find another
solution.

> > Also, just for the sake of another piece of stats here:
> > 
> > $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
> > 526
> 
> This all requires attentive analysis. The number looks pretty big for me.
> What are the pages you get merged there? This may be just zero pages,
> you have identical.
> 
> E.g., your browser want to work fast. It introduces smart schemes,
> and preallocates many pages in background (mmap + write 1 byte to a page),
> so in further it save some time (no page fault + alloc), when page is
> really needed. But your change merges these pages and kills this
> optimization. Sounds not good, does this?
> 
> I think, we should not think we know and predict better than application
> writers, what they want from kernel. Let's people decide themselves
> in dependence of their workload. The only exception is some buggy
> or old applications, which impossible to change, so force madvise
> workaround may help. But only in case there are really such applications...
> 
> I'd researched what pages you have duplicated in these 526 MB. Maybe
> you find, no action is required or a report to userspace application
> to use madvise is needed.

OK, I agree, this is a good argument to move decision to userspace.

> > 2) what kinds of opt-out we should maintain? Like, what if force_madvise
> > is called, but the task doesn't want some VMAs to be merged? This will
> > required new flag anyway, it seems. And should there be another
> > write-only file to unmerge everything forcibly for specific task?
> 
> For example,
> 
> Merge:
> #echo $task > /sys/kernel/mm/ksm/force_madvise

Immediate question: what should be actually done on this? I see 2
options:

1) mark all VMAs as mergeable + set some flag for mmap() to mark all
further allocations as mergeable as well;
2) just mark all the VMAs as mergeable; userspace can call this
periodically to mark new VMAs.

My prediction is that 2) is less destructive, and the decision is
preserved predominantly to userspace, thus it would be a desired option.

> Unmerge:
> #echo -$task > /sys/kernel/mm/ksm/force_madvise

Okay.

> In case of task don't want to merge some VMA, we just should skip it at all.

This way we lose some flexibility, IMO, but I get you point.

Thanks.
Kirill Tkhai May 14, 2019, 9:12 a.m. UTC | #8
On 14.05.2019 09:30, Oleksandr Natalenko wrote:
> Hi.
> 
> On Mon, May 13, 2019 at 03:37:56PM +0300, Kirill Tkhai wrote:
>>> Yes, I get your point. But the intention is to avoid another hacky trick
>>> (LD_PRELOAD), thus *something* should *preferably* be done on the
>>> kernel level instead.
>>
>> I don't think so. Does userspace hack introduce some overhead? It does not
>> look so. Why should we think about mergeable VMAs in page fault handler?!
>> This is the last thing we want to think in page fault handler.
>>
>> Also, there is difficult synchronization in page fault handlers, and it's
>> easy to make a mistake. So, there is a mistake in [3/4], and you call
>> ksm_enter() with mmap_sem read locked, while normal way is to call it
>> with write lock (see madvise_need_mmap_write()).
>>
>> So, let's don't touch this path. Small optimization for unlikely case will
>> introduce problems in optimization for likely case in the future.
> 
> Yup, you're right, I've missed the fact that write lock is needed there.
> Re-vamping locking there is not my intention, so lets find another
> solution.
> 
>>> Also, just for the sake of another piece of stats here:
>>>
>>> $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
>>> 526
>>
>> This all requires attentive analysis. The number looks pretty big for me.
>> What are the pages you get merged there? This may be just zero pages,
>> you have identical.
>>
>> E.g., your browser want to work fast. It introduces smart schemes,
>> and preallocates many pages in background (mmap + write 1 byte to a page),
>> so in further it save some time (no page fault + alloc), when page is
>> really needed. But your change merges these pages and kills this
>> optimization. Sounds not good, does this?
>>
>> I think, we should not think we know and predict better than application
>> writers, what they want from kernel. Let's people decide themselves
>> in dependence of their workload. The only exception is some buggy
>> or old applications, which impossible to change, so force madvise
>> workaround may help. But only in case there are really such applications...
>>
>> I'd researched what pages you have duplicated in these 526 MB. Maybe
>> you find, no action is required or a report to userspace application
>> to use madvise is needed.
> 
> OK, I agree, this is a good argument to move decision to userspace.
> 
>>> 2) what kinds of opt-out we should maintain? Like, what if force_madvise
>>> is called, but the task doesn't want some VMAs to be merged? This will
>>> required new flag anyway, it seems. And should there be another
>>> write-only file to unmerge everything forcibly for specific task?
>>
>> For example,
>>
>> Merge:
>> #echo $task > /sys/kernel/mm/ksm/force_madvise
> 
> Immediate question: what should be actually done on this? I see 2
> options:
> 
> 1) mark all VMAs as mergeable + set some flag for mmap() to mark all
> further allocations as mergeable as well;
> 2) just mark all the VMAs as mergeable; userspace can call this
> periodically to mark new VMAs.
> 
> My prediction is that 2) is less destructive, and the decision is
> preserved predominantly to userspace, thus it would be a desired option.

Let's see, how we use KSM now. It's good for virtual machines: people
install the same distribution in several VMs, and they have the same
packages and the same files. When you read a file inside VM, its pages
are file cache for the VM, but they are anonymous pages for host kernel.

Hypervisor marks VM memory as mergeable, and host KSM merges the same
anonymous pages together. Many of file cache inside VM is constant
content, so we have good KSM compression on such the file pages.
The result we have is explainable and expected.

But we don't know anything about pages, you have merged on your laptop.
We can't make any assumptions before analysis of applications, which
produce such the pages. Let's check what happens before we try to implement
some specific design (if we really need something to implement).

The rest is just technical details. We may implement everything we need
on top of this (even implement a polling of /proc/[pid]/maps and write
a task and address of vma to force_madvise or similar file).

Kirill
Oleksandr Natalenko May 14, 2019, 1:33 p.m. UTC | #9
Hi.

On Tue, May 14, 2019 at 12:12:16PM +0300, Kirill Tkhai wrote:
> > Immediate question: what should be actually done on this? I see 2
> > options:
> > 
> > 1) mark all VMAs as mergeable + set some flag for mmap() to mark all
> > further allocations as mergeable as well;
> > 2) just mark all the VMAs as mergeable; userspace can call this
> > periodically to mark new VMAs.
> > 
> > My prediction is that 2) is less destructive, and the decision is
> > preserved predominantly to userspace, thus it would be a desired option.
> 
> Let's see, how we use KSM now. It's good for virtual machines: people
> install the same distribution in several VMs, and they have the same
> packages and the same files. When you read a file inside VM, its pages
> are file cache for the VM, but they are anonymous pages for host kernel.
> 
> Hypervisor marks VM memory as mergeable, and host KSM merges the same
> anonymous pages together. Many of file cache inside VM is constant
> content, so we have good KSM compression on such the file pages.
> The result we have is explainable and expected.

Yup, correct.

> But we don't know anything about pages, you have merged on your laptop.
> We can't make any assumptions before analysis of applications, which
> produce such the pages. Let's check what happens before we try to implement
> some specific design (if we really need something to implement).
> 
> The rest is just technical details. We may implement everything we need
> on top of this (even implement a polling of /proc/[pid]/maps and write
> a task and address of vma to force_madvise or similar file).

I'm not sure that reviewing all the applications falls under the scope
of this and/or similar submission. Personally I do not feel comfortable
reviewing Firefox code, for example.

But I do run 2 instances of FF, one for work stuff, one for personal stuff,
so merging its memory would be definitely beneficial for me. I believe I'm
not the only one doing this, and things are not limited to Firefox only, of
course.

Please consider checking a v2 submission I've just posted. It implements
your suggestion on "force_madvise" knob, and I find your feedback very
relevant and useful.

Thanks.