Message ID | 20190514131654.25463-1-oleksandr@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | mm/ksm: add option to automerge VMAs | expand |
[This is adding a new user visible interface so you should be CCing linux-api mailing list. Also CC Hugh for KSM in general. Done now] On Tue 14-05-19 15:16:50, Oleksandr Natalenko wrote: > By default, KSM works only on memory that is marked by madvise(). And the > only way to get around that is to either: > > * use LD_PRELOAD; or > * patch the kernel with something like UKSM or PKSM. > > Instead, lets implement a sysfs knob, which allows marking VMAs as > mergeable. This can be used manually on some task in question or by some > small userspace helper daemon. > > The knob is named "force_madvise", and it is write-only. It accepts a PID > to act on. To mark the VMAs as mergeable, use: > > # echo PID > /sys/kernel/mm/ksm/force_madvise > > To unmerge all the VMAs, use the same approach, prepending the PID with > the "minus" sign: > > # echo -PID > /sys/kernel/mm/ksm/force_madvise > > This patchset is based on earlier Timofey's submission [1], but it doesn't > use dedicated kthread to walk through the list of tasks/VMAs. Instead, > it is up to userspace to traverse all the tasks in /proc if needed. > > The previous suggestion [2] was based on amending do_anonymous_page() > handler to implement fully automatic mode, but this approach was > incorrect due to improper locking and not desired due to excessive > complexity. > > The current approach just implements minimal interface and leaves the > decision on how and when to act to userspace. Please make sure to describe a usecase that warrants adding a new interface we have to maintain for ever. > > Thanks. > > [1] https://lore.kernel.org/patchwork/patch/1012142/ > [2] http://lkml.iu.edu/hypermail/linux/kernel/1905.1/02417.html > > Oleksandr Natalenko (4): > mm/ksm: introduce ksm_enter() helper > mm/ksm: introduce ksm_leave() helper > mm/ksm: introduce force_madvise knob > mm/ksm: add force merging/unmerging documentation > > Documentation/admin-guide/mm/ksm.rst | 11 ++ > mm/ksm.c | 160 +++++++++++++++++++++------ > 2 files changed, 137 insertions(+), 34 deletions(-) > > -- > 2.21.0 >
[Forgot Hugh] On Tue 14-05-19 16:41:05, Michal Hocko wrote: > [This is adding a new user visible interface so you should be CCing > linux-api mailing list. Also CC Hugh for KSM in general. Done now] > > On Tue 14-05-19 15:16:50, Oleksandr Natalenko wrote: > > By default, KSM works only on memory that is marked by madvise(). And the > > only way to get around that is to either: > > > > * use LD_PRELOAD; or > > * patch the kernel with something like UKSM or PKSM. > > > > Instead, lets implement a sysfs knob, which allows marking VMAs as > > mergeable. This can be used manually on some task in question or by some > > small userspace helper daemon. > > > > The knob is named "force_madvise", and it is write-only. It accepts a PID > > to act on. To mark the VMAs as mergeable, use: > > > > # echo PID > /sys/kernel/mm/ksm/force_madvise > > > > To unmerge all the VMAs, use the same approach, prepending the PID with > > the "minus" sign: > > > > # echo -PID > /sys/kernel/mm/ksm/force_madvise > > > > This patchset is based on earlier Timofey's submission [1], but it doesn't > > use dedicated kthread to walk through the list of tasks/VMAs. Instead, > > it is up to userspace to traverse all the tasks in /proc if needed. > > > > The previous suggestion [2] was based on amending do_anonymous_page() > > handler to implement fully automatic mode, but this approach was > > incorrect due to improper locking and not desired due to excessive > > complexity. > > > > The current approach just implements minimal interface and leaves the > > decision on how and when to act to userspace. > > Please make sure to describe a usecase that warrants adding a new > interface we have to maintain for ever. > > > > > Thanks. > > > > [1] https://lore.kernel.org/patchwork/patch/1012142/ > > [2] http://lkml.iu.edu/hypermail/linux/kernel/1905.1/02417.html > > > > Oleksandr Natalenko (4): > > mm/ksm: introduce ksm_enter() helper > > mm/ksm: introduce ksm_leave() helper > > mm/ksm: introduce force_madvise knob > > mm/ksm: add force merging/unmerging documentation > > > > Documentation/admin-guide/mm/ksm.rst | 11 ++ > > mm/ksm.c | 160 +++++++++++++++++++++------ > > 2 files changed, 137 insertions(+), 34 deletions(-) > > > > -- > > 2.21.0 > > > > -- > Michal Hocko > SUSE Labs
Hi. On Tue, May 14, 2019 at 04:51:22PM +0200, Michal Hocko wrote: > [Forgot Hugh] > > On Tue 14-05-19 16:41:05, Michal Hocko wrote: > > [This is adding a new user visible interface so you should be CCing > > linux-api mailing list. Also CC Hugh for KSM in general. Done now] Right, thanks for taking care of this. > > On Tue 14-05-19 15:16:50, Oleksandr Natalenko wrote: > > > By default, KSM works only on memory that is marked by madvise(). And the > > > only way to get around that is to either: > > > > > > * use LD_PRELOAD; or > > > * patch the kernel with something like UKSM or PKSM. > > > > > > Instead, lets implement a sysfs knob, which allows marking VMAs as > > > mergeable. This can be used manually on some task in question or by some > > > small userspace helper daemon. > > > > > > The knob is named "force_madvise", and it is write-only. It accepts a PID > > > to act on. To mark the VMAs as mergeable, use: > > > > > > # echo PID > /sys/kernel/mm/ksm/force_madvise > > > > > > To unmerge all the VMAs, use the same approach, prepending the PID with > > > the "minus" sign: > > > > > > # echo -PID > /sys/kernel/mm/ksm/force_madvise > > > > > > This patchset is based on earlier Timofey's submission [1], but it doesn't > > > use dedicated kthread to walk through the list of tasks/VMAs. Instead, > > > it is up to userspace to traverse all the tasks in /proc if needed. > > > > > > The previous suggestion [2] was based on amending do_anonymous_page() > > > handler to implement fully automatic mode, but this approach was > > > incorrect due to improper locking and not desired due to excessive > > > complexity. > > > > > > The current approach just implements minimal interface and leaves the > > > decision on how and when to act to userspace. > > > > Please make sure to describe a usecase that warrants adding a new > > interface we have to maintain for ever. I think of two major consumers of this interface: 1) hosts, that run containers, especially similar ones and especially in a trusted environment; 2) heavy applications, that can be run in multiple instances, not limited to opensource ones like Firefox, but also those that cannot be modified. I'll add this justification to the cover letter once I send another iteration of this submission if necessary. Thank you. > > > > > > > > Thanks. > > > > > > [1] https://lore.kernel.org/patchwork/patch/1012142/ > > > [2] http://lkml.iu.edu/hypermail/linux/kernel/1905.1/02417.html > > > > > > Oleksandr Natalenko (4): > > > mm/ksm: introduce ksm_enter() helper > > > mm/ksm: introduce ksm_leave() helper > > > mm/ksm: introduce force_madvise knob > > > mm/ksm: add force merging/unmerging documentation > > > > > > Documentation/admin-guide/mm/ksm.rst | 11 ++ > > > mm/ksm.c | 160 +++++++++++++++++++++------ > > > 2 files changed, 137 insertions(+), 34 deletions(-) > > > > > > -- > > > 2.21.0 > > > > > > > -- > > Michal Hocko > > SUSE Labs > > -- > Michal Hocko > SUSE Labs
On Wed 15-05-19 08:25:23, Oleksandr Natalenko wrote: [...] > > > Please make sure to describe a usecase that warrants adding a new > > > interface we have to maintain for ever. > > I think of two major consumers of this interface: > > 1) hosts, that run containers, especially similar ones and especially in > a trusted environment; > > 2) heavy applications, that can be run in multiple instances, not > limited to opensource ones like Firefox, but also those that cannot be > modified. This is way too generic. Please provide something more specific. Ideally with numbers. Why those usecases cannot use an existing interfaces. Remember you are trying to add a new user interface which we will have to maintain for ever. I will try to comment on the interface itself later. But I have to say that I am not impressed. Abusing sysfs for per process features is quite gross to be honest.
Hi. On Wed, May 15, 2019 at 08:53:11AM +0200, Michal Hocko wrote: > On Wed 15-05-19 08:25:23, Oleksandr Natalenko wrote: > [...] > > > > Please make sure to describe a usecase that warrants adding a new > > > > interface we have to maintain for ever. > > > > I think of two major consumers of this interface: > > > > 1) hosts, that run containers, especially similar ones and especially in > > a trusted environment; > > > > 2) heavy applications, that can be run in multiple instances, not > > limited to opensource ones like Firefox, but also those that cannot be > > modified. > > This is way too generic. Please provide something more specific. Ideally > with numbers. Why those usecases cannot use an existing interfaces. > Remember you are trying to add a new user interface which we will have > to maintain for ever. For my current setup with 2 Firefox instances I get 100 to 200 MiB saved for the second instance depending on the amount of tabs. 1 FF instance with 15 tabs: $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc 410 2 FF instances, second one has 12 tabs (all the tabs are different): $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc 592 At the very moment I do not have specific numbers for containerised workload, but those should be similar in case the containers share similar/same runtime (like multiple Node.js containers etc). Answering your question regarding using existing interfaces, since there's only one, madvise(2), this requires modifying all the applications one wants to de-duplicate. In case of containers with arbitrary content or in case of binary-only apps this is pretty hard if not impossible to do properly. > I will try to comment on the interface itself later. But I have to say > that I am not impressed. Abusing sysfs for per process features is quite > gross to be honest. Sure, please do. Thanks for your time and inputs.
On Wed 15-05-19 09:37:23, Oleksandr Natalenko wrote: [...] > > This is way too generic. Please provide something more specific. Ideally > > with numbers. Why those usecases cannot use an existing interfaces. > > Remember you are trying to add a new user interface which we will have > > to maintain for ever. > > For my current setup with 2 Firefox instances I get 100 to 200 MiB saved > for the second instance depending on the amount of tabs. What does prevent Firefox (an opensource project) to be updated to use the explicit merging? [...] > Answering your question regarding using existing interfaces, since > there's only one, madvise(2), this requires modifying all the > applications one wants to de-duplicate. In case of containers with > arbitrary content or in case of binary-only apps this is pretty hard if > not impossible to do properly. OK, this makes more sense. Please note that there are other people who would like to see certain madvise operations to be done on a remote process - e.g. to allow external memory management (Android would like to control memory aging so something like MADV_DONTNEED without loosing content and more probably) and potentially other madvise operations. Or maybe we need a completely new interface other than madvise. In general, having a more generic API that would cover more usecases is definitely much more preferable than one ad-hoc API that handles a very specific usecase. So please try to think about a more generic
On Wed, May 15, 2019 at 10:33:21AM +0200, Michal Hocko wrote: > > For my current setup with 2 Firefox instances I get 100 to 200 MiB saved > > for the second instance depending on the amount of tabs. > > What does prevent Firefox (an opensource project) to be updated to use > the explicit merging? This was rather an example of a big project. Other big projects may be closed source, of course. And yes, with regard to FF specifically I think nothing prevents it from being modified appropriately. > > Answering your question regarding using existing interfaces, since > > there's only one, madvise(2), this requires modifying all the > > applications one wants to de-duplicate. In case of containers with > > arbitrary content or in case of binary-only apps this is pretty hard if > > not impossible to do properly. > > OK, this makes more sense. Please note that there are other people who > would like to see certain madvise operations to be done on a remote > process - e.g. to allow external memory management (Android would like > to control memory aging so something like MADV_DONTNEED without loosing > content and more probably) and potentially other madvise operations. > Or maybe we need a completely new interface other than madvise. I didn't know about those intentions. Could you please point me to a relevant discussion so that I can check the details? > In general, having a more generic API that would cover more usecases is > definitely much more preferable than one ad-hoc API that handles a very > specific usecase. So please try to think about a more generic Yup, I see now. Since you are aware of ongoing intentions, please do Cc those people then and/or let me know about previous discussions please. That way thinking of how a new API should be implemented (be it a sysfs file or something else) should be easier and more visible. Thanks.
On Wed 15-05-19 10:51:58, Oleksandr Natalenko wrote: > On Wed, May 15, 2019 at 10:33:21AM +0200, Michal Hocko wrote: > > > For my current setup with 2 Firefox instances I get 100 to 200 MiB saved > > > for the second instance depending on the amount of tabs. > > > > What does prevent Firefox (an opensource project) to be updated to use > > the explicit merging? > > This was rather an example of a big project. Other big projects may be > closed source, of course. Again, specific examples are usually considered a much better justification than "something might use the feature". [...] > > OK, this makes more sense. Please note that there are other people who > > would like to see certain madvise operations to be done on a remote > > process - e.g. to allow external memory management (Android would like > > to control memory aging so something like MADV_DONTNEED without loosing > > content and more probably) and potentially other madvise operations. > > Or maybe we need a completely new interface other than madvise. > > I didn't know about those intentions. Could you please point me to a > relevant discussion so that I can check the details? I am sorry I do not have any specific links to patches under discussion. We have discussed that topic at LSFMM this year (https://lwn.net/Articles/787217/) and Google guys should be sending something soon.
[Cc Suren and Minchan - the email thread starts here 20190514131654.25463-1-oleksandr@redhat.com] On Wed 15-05-19 08:53:11, Michal Hocko wrote: [...] > I will try to comment on the interface itself later. But I have to say > that I am not impressed. Abusing sysfs for per process features is quite > gross to be honest. I have already commented on this in other email. I consider sysfs an unsuitable interface for per-process API. Not to mention this particular one is very KSM specific while the question about setting different hints on memory of a remote process is a more generic question. As already mentioned there are usecases where people would like to say that a certain memory is cold from outside of the process context (e.g. monitor application). So essentially a form of a user space memory management. And this usecase sounds a bit similar to me and having a common api sounds more sensible to me. One thing we were discussing at LSFMM this year was a way to either provide madvise_remote(pid, addr, length, advice) or a fadvise alternative over /proc/<pid>/map_vmas/<range> file descriptors (essentially resembling the existing map_files api) to achieve such a functionality. This is still a very rough idea but the api would sound much more generic to me and it would allow much wider range of usecases. But maybe I am completely wrong and this is just opens a can of worms that we do not want.
On Wed, May 15, 2019 at 04:51:51PM +0200, Michal Hocko wrote: > [Cc Suren and Minchan - the email thread starts here 20190514131654.25463-1-oleksandr@redhat.com] > > On Wed 15-05-19 08:53:11, Michal Hocko wrote: > [...] > > I will try to comment on the interface itself later. But I have to say > > that I am not impressed. Abusing sysfs for per process features is quite > > gross to be honest. > > I have already commented on this in other email. I consider sysfs an > unsuitable interface for per-process API. Wait, what? A new sysfs file/directory per process? That's crazy, no one must have benchmarked it :) And I agree, sysfs is not for that, please don't abuse it. Proc is for process apis. thanks, greg k-h
On Wed 15-05-19 17:15:57, Greg KH wrote: > On Wed, May 15, 2019 at 04:51:51PM +0200, Michal Hocko wrote: > > [Cc Suren and Minchan - the email thread starts here 20190514131654.25463-1-oleksandr@redhat.com] > > > > On Wed 15-05-19 08:53:11, Michal Hocko wrote: > > [...] > > > I will try to comment on the interface itself later. But I have to say > > > that I am not impressed. Abusing sysfs for per process features is quite > > > gross to be honest. > > > > I have already commented on this in other email. I consider sysfs an > > unsuitable interface for per-process API. > > Wait, what? A new sysfs file/directory per process? That's crazy, no > one must have benchmarked it :) Just to clarify, that was not a per process file but rather per process API. Essentially echo $PID > $SYSFS_SPECIAL_FILE
On Thu, May 16, 2019 at 09:47:13AM +0200, Michal Hocko wrote: > On Wed 15-05-19 17:15:57, Greg KH wrote: > > On Wed, May 15, 2019 at 04:51:51PM +0200, Michal Hocko wrote: > > > [Cc Suren and Minchan - the email thread starts here 20190514131654.25463-1-oleksandr@redhat.com] > > > > > > On Wed 15-05-19 08:53:11, Michal Hocko wrote: > > > [...] > > > > I will try to comment on the interface itself later. But I have to say > > > > that I am not impressed. Abusing sysfs for per process features is quite > > > > gross to be honest. > > > > > > I have already commented on this in other email. I consider sysfs an > > > unsuitable interface for per-process API. > > > > Wait, what? A new sysfs file/directory per process? That's crazy, no > > one must have benchmarked it :) > > Just to clarify, that was not a per process file but rather per process API. > Essentially echo $PID > $SYSFS_SPECIAL_FILE Ick, no, that's not ok either. sysfs files are not a replacement for syscalls :) thanks, greg k-h