Message ID | 20200302193630.68771-8-minchan@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | introduce memory hinting API for external process | expand |
On 3/2/20 8:36 PM, Minchan Kim wrote: > From: Oleksandr Natalenko <oleksandr@redhat.com> > > It all began with the fact that KSM works only on memory that is marked > by madvise(). And the only way to get around that is to either: > > * use LD_PRELOAD; or > * patch the kernel with something like UKSM or PKSM. > > (i skip ptrace can of worms here intentionally) > > To overcome this restriction, lets employ a new remote madvise API. This > can be used by some small userspace helper daemon that will do auto-KSM > job for us. > > I think of two major consumers of remote KSM hints: > > * hosts, that run containers, especially similar ones and especially in > a trusted environment, sharing the same runtime like Node.js; > > * heavy applications, that can be run in multiple instances, not > limited to opensource ones like Firefox, but also those that cannot be > modified since they are binary-only and, maybe, statically linked. > > Speaking of statistics, more numbers can be found in the very first > submission, that is related to this one [1]. For my current setup with > two Firefox instances I get 100 to 200 MiB saved for the second instance > depending on the amount of tabs. > > 1 FF instance with 15 tabs: > > $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc > 410 > > 2 FF instances, second one has 12 tabs (all the tabs are different): > > $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc > 592 > > At the very moment I do not have specific numbers for containerised > workload, but those should be comparable in case the containers share > similar/same runtime. > > [1] https://lore.kernel.org/patchwork/patch/1012142/ > > Reviewed-by: SeongJae Park <sjpark@amazon.de> > Signed-off-by: Oleksandr Natalenko <oleksandr@redhat.com> > Signed-off-by: Minchan Kim <minchan@kernel.org> This will lead to one process calling unmerge_ksm_pages() of another. There's a (signal_pending(current)) test there, should it check also the other task, analogically to task 3? Then break_ksm() is fine as it is, as ksmd also calls it, right? > --- > mm/madvise.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/mm/madvise.c b/mm/madvise.c > index e77c6c1fad34..f4fa962ee74d 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -1005,6 +1005,10 @@ process_madvise_behavior_valid(int behavior) > switch (behavior) { > case MADV_COLD: > case MADV_PAGEOUT: > +#ifdef CONFIG_KSM > + case MADV_MERGEABLE: > + case MADV_UNMERGEABLE: > +#endif > return true; > default: > return false; >
On Fri, Mar 06, 2020 at 02:13:49PM +0100, Vlastimil Babka wrote: > On 3/2/20 8:36 PM, Minchan Kim wrote: > > From: Oleksandr Natalenko <oleksandr@redhat.com> > > > > It all began with the fact that KSM works only on memory that is marked > > by madvise(). And the only way to get around that is to either: > > > > * use LD_PRELOAD; or > > * patch the kernel with something like UKSM or PKSM. > > > > (i skip ptrace can of worms here intentionally) > > > > To overcome this restriction, lets employ a new remote madvise API. This > > can be used by some small userspace helper daemon that will do auto-KSM > > job for us. > > > > I think of two major consumers of remote KSM hints: > > > > * hosts, that run containers, especially similar ones and especially in > > a trusted environment, sharing the same runtime like Node.js; > > > > * heavy applications, that can be run in multiple instances, not > > limited to opensource ones like Firefox, but also those that cannot be > > modified since they are binary-only and, maybe, statically linked. > > > > Speaking of statistics, more numbers can be found in the very first > > submission, that is related to this one [1]. For my current setup with > > two Firefox instances I get 100 to 200 MiB saved for the second instance > > depending on the amount of tabs. > > > > 1 FF instance with 15 tabs: > > > > $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc > > 410 > > > > 2 FF instances, second one has 12 tabs (all the tabs are different): > > > > $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc > > 592 > > > > At the very moment I do not have specific numbers for containerised > > workload, but those should be comparable in case the containers share > > similar/same runtime. > > > > [1] https://lore.kernel.org/patchwork/patch/1012142/ > > > > Reviewed-by: SeongJae Park <sjpark@amazon.de> > > Signed-off-by: Oleksandr Natalenko <oleksandr@redhat.com> > > Signed-off-by: Minchan Kim <minchan@kernel.org> > > This will lead to one process calling unmerge_ksm_pages() of another. There's a > (signal_pending(current)) test there, should it check also the other task, > analogically to task 3? Do we care about current there then? Shall we just pass mm into unmerge_ksm_pages and check the signals of the target task only, be it current or something else? > Then break_ksm() is fine as it is, as ksmd also calls it, right? I think break_ksm() cares only about mmap_sem protection, so we should be fine here. > > > --- > > mm/madvise.c | 4 ++++ > > 1 file changed, 4 insertions(+) > > > > diff --git a/mm/madvise.c b/mm/madvise.c > > index e77c6c1fad34..f4fa962ee74d 100644 > > --- a/mm/madvise.c > > +++ b/mm/madvise.c > > @@ -1005,6 +1005,10 @@ process_madvise_behavior_valid(int behavior) > > switch (behavior) { > > case MADV_COLD: > > case MADV_PAGEOUT: > > +#ifdef CONFIG_KSM > > + case MADV_MERGEABLE: > > + case MADV_UNMERGEABLE: > > +#endif > > return true; > > default: > > return false; > > >
On 3/6/20 2:41 PM, Oleksandr Natalenko wrote: > On Fri, Mar 06, 2020 at 02:13:49PM +0100, Vlastimil Babka wrote: >> On 3/2/20 8:36 PM, Minchan Kim wrote: >> > From: Oleksandr Natalenko <oleksandr@redhat.com> >> > >> > It all began with the fact that KSM works only on memory that is marked >> > by madvise(). And the only way to get around that is to either: >> > >> > * use LD_PRELOAD; or >> > * patch the kernel with something like UKSM or PKSM. >> > >> > (i skip ptrace can of worms here intentionally) >> > >> > To overcome this restriction, lets employ a new remote madvise API. This >> > can be used by some small userspace helper daemon that will do auto-KSM >> > job for us. >> > >> > I think of two major consumers of remote KSM hints: >> > >> > * hosts, that run containers, especially similar ones and especially in >> > a trusted environment, sharing the same runtime like Node.js; Ah, I forgot to ask, given the discussion of races in patch 2 (Question 2), where android can stop the tasks to apply the madvise hints in a race-free manner, how does that work for remote KSM hints in your scenarios, especially the one above? >> > >> > * heavy applications, that can be run in multiple instances, not >> > limited to opensource ones like Firefox, but also those that cannot be >> > modified since they are binary-only and, maybe, statically linked. >> > >> > Speaking of statistics, more numbers can be found in the very first >> > submission, that is related to this one [1]. For my current setup with >> > two Firefox instances I get 100 to 200 MiB saved for the second instance >> > depending on the amount of tabs. >> > >> > 1 FF instance with 15 tabs: >> > >> > $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc >> > 410 >> > >> > 2 FF instances, second one has 12 tabs (all the tabs are different): >> > >> > $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc >> > 592 >> > >> > At the very moment I do not have specific numbers for containerised >> > workload, but those should be comparable in case the containers share >> > similar/same runtime. >> > >> > [1] https://lore.kernel.org/patchwork/patch/1012142/ >> > >> > Reviewed-by: SeongJae Park <sjpark@amazon.de> >> > Signed-off-by: Oleksandr Natalenko <oleksandr@redhat.com> >> > Signed-off-by: Minchan Kim <minchan@kernel.org> >> >> This will lead to one process calling unmerge_ksm_pages() of another. There's a >> (signal_pending(current)) test there, should it check also the other task, >> analogically to task 3? > > Do we care about current there then? Shall we just pass mm into unmerge_ksm_pages and check the signals of the target task only, be it current or something else? Dunno, it's nice to react to signals quickly, for any proces that gets them, no? >> Then break_ksm() is fine as it is, as ksmd also calls it, right? > > I think break_ksm() cares only about mmap_sem protection, so we should > be fine here. > >> >> > --- >> > mm/madvise.c | 4 ++++ >> > 1 file changed, 4 insertions(+) >> > >> > diff --git a/mm/madvise.c b/mm/madvise.c >> > index e77c6c1fad34..f4fa962ee74d 100644 >> > --- a/mm/madvise.c >> > +++ b/mm/madvise.c >> > @@ -1005,6 +1005,10 @@ process_madvise_behavior_valid(int behavior) >> > switch (behavior) { >> > case MADV_COLD: >> > case MADV_PAGEOUT: >> > +#ifdef CONFIG_KSM >> > + case MADV_MERGEABLE: >> > + case MADV_UNMERGEABLE: >> > +#endif >> > return true; >> > default: >> > return false; >> > >> >
On Fri, Mar 06, 2020 at 05:08:18PM +0100, Vlastimil Babka wrote: > On 3/6/20 2:41 PM, Oleksandr Natalenko wrote: > > On Fri, Mar 06, 2020 at 02:13:49PM +0100, Vlastimil Babka wrote: > >> On 3/2/20 8:36 PM, Minchan Kim wrote: > >> > From: Oleksandr Natalenko <oleksandr@redhat.com> > >> > > >> > It all began with the fact that KSM works only on memory that is marked > >> > by madvise(). And the only way to get around that is to either: > >> > > >> > * use LD_PRELOAD; or > >> > * patch the kernel with something like UKSM or PKSM. > >> > > >> > (i skip ptrace can of worms here intentionally) > >> > > >> > To overcome this restriction, lets employ a new remote madvise API. This > >> > can be used by some small userspace helper daemon that will do auto-KSM > >> > job for us. > >> > > >> > I think of two major consumers of remote KSM hints: > >> > > >> > * hosts, that run containers, especially similar ones and especially in > >> > a trusted environment, sharing the same runtime like Node.js; > > Ah, I forgot to ask, given the discussion of races in patch 2 (Question 2), > where android can stop the tasks to apply the madvise hints in a race-free > manner, how does that work for remote KSM hints in your scenarios, especially > the one above? We have cgroup.freeze for that. > > >> > > >> > * heavy applications, that can be run in multiple instances, not > >> > limited to opensource ones like Firefox, but also those that cannot be > >> > modified since they are binary-only and, maybe, statically linked. > >> > > >> > Speaking of statistics, more numbers can be found in the very first > >> > submission, that is related to this one [1]. For my current setup with > >> > two Firefox instances I get 100 to 200 MiB saved for the second instance > >> > depending on the amount of tabs. > >> > > >> > 1 FF instance with 15 tabs: > >> > > >> > $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc > >> > 410 > >> > > >> > 2 FF instances, second one has 12 tabs (all the tabs are different): > >> > > >> > $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc > >> > 592 > >> > > >> > At the very moment I do not have specific numbers for containerised > >> > workload, but those should be comparable in case the containers share > >> > similar/same runtime. > >> > > >> > [1] https://lore.kernel.org/patchwork/patch/1012142/ > >> > > >> > Reviewed-by: SeongJae Park <sjpark@amazon.de> > >> > Signed-off-by: Oleksandr Natalenko <oleksandr@redhat.com> > >> > Signed-off-by: Minchan Kim <minchan@kernel.org> > >> > >> This will lead to one process calling unmerge_ksm_pages() of another. There's a > >> (signal_pending(current)) test there, should it check also the other task, > >> analogically to task 3? > > > > Do we care about current there then? Shall we just pass mm into unmerge_ksm_pages and check the signals of the target task only, be it current or something else? > > Dunno, it's nice to react to signals quickly, for any proces that gets them, no? So, do you mean something like this? === diff --git a/mm/ksm.c b/mm/ksm.c index 363ec8189561..b39c237cfcf4 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -849,7 +849,8 @@ static int unmerge_ksm_pages(struct vm_area_struct *vma, for (addr = start; addr < end && !err; addr += PAGE_SIZE) { if (ksm_test_exit(vma->vm_mm)) break; - if (signal_pending(current)) + if (signal_pending(current) || + signal_pending(rcu_dereference(vma->vm_mm->owner))) err = -ERESTARTSYS; else err = break_ksm(vma, addr); === BTW, this won't work with !CONFIG_MEMCG, so probably task_struct should be passed through instead. IIUC, this would also require amending struct mm_slot in order to share the same code path with ksmd. I'm not sure I've seen such a culprit anywhere else, so I'm in doubt this would be a correct thing to do. Ideas? > > >> Then break_ksm() is fine as it is, as ksmd also calls it, right? > > > > I think break_ksm() cares only about mmap_sem protection, so we should > > be fine here. > > > >> > >> > --- > >> > mm/madvise.c | 4 ++++ > >> > 1 file changed, 4 insertions(+) > >> > > >> > diff --git a/mm/madvise.c b/mm/madvise.c > >> > index e77c6c1fad34..f4fa962ee74d 100644 > >> > --- a/mm/madvise.c > >> > +++ b/mm/madvise.c > >> > @@ -1005,6 +1005,10 @@ process_madvise_behavior_valid(int behavior) > >> > switch (behavior) { > >> > case MADV_COLD: > >> > case MADV_PAGEOUT: > >> > +#ifdef CONFIG_KSM > >> > + case MADV_MERGEABLE: > >> > + case MADV_UNMERGEABLE: > >> > +#endif > >> > return true; > >> > default: > >> > return false; > >> > > >> > > >
On Mon 09-03-20 14:11:17, Oleksandr Natalenko wrote: > On Fri, Mar 06, 2020 at 05:08:18PM +0100, Vlastimil Babka wrote: [...] > > Dunno, it's nice to react to signals quickly, for any proces that gets them, no? > > So, do you mean something like this? > > === > diff --git a/mm/ksm.c b/mm/ksm.c > index 363ec8189561..b39c237cfcf4 100644 > --- a/mm/ksm.c > +++ b/mm/ksm.c > @@ -849,7 +849,8 @@ static int unmerge_ksm_pages(struct vm_area_struct *vma, > for (addr = start; addr < end && !err; addr += PAGE_SIZE) { > if (ksm_test_exit(vma->vm_mm)) > break; > - if (signal_pending(current)) > + if (signal_pending(current) || > + signal_pending(rcu_dereference(vma->vm_mm->owner))) > err = -ERESTARTSYS; > else > err = break_ksm(vma, addr); > === This is broken because mm might be attached to different tasks. AFAIU this check is meant to allow quick backoff of the _calling_ process so that it doesn't waste time when the context is killed already. I do not understand why should we care about any other context here? What is the actual problem this would solve?
On Mon, Mar 09, 2020 at 04:08:15PM +0100, Michal Hocko wrote: > On Mon 09-03-20 14:11:17, Oleksandr Natalenko wrote: > > On Fri, Mar 06, 2020 at 05:08:18PM +0100, Vlastimil Babka wrote: > [...] > > > Dunno, it's nice to react to signals quickly, for any proces that gets them, no? > > > > So, do you mean something like this? > > > > === > > diff --git a/mm/ksm.c b/mm/ksm.c > > index 363ec8189561..b39c237cfcf4 100644 > > --- a/mm/ksm.c > > +++ b/mm/ksm.c > > @@ -849,7 +849,8 @@ static int unmerge_ksm_pages(struct vm_area_struct *vma, > > for (addr = start; addr < end && !err; addr += PAGE_SIZE) { > > if (ksm_test_exit(vma->vm_mm)) > > break; > > - if (signal_pending(current)) > > + if (signal_pending(current) || > > + signal_pending(rcu_dereference(vma->vm_mm->owner))) > > err = -ERESTARTSYS; > > else > > err = break_ksm(vma, addr); > > === > > This is broken because mm might be attached to different tasks. > AFAIU this check is meant to allow quick backoff of the _calling_ > process so that it doesn't waste time when the context is killed > already. I do not understand why should we care about any other context > here? What is the actual problem this would solve? I agree with you, but still trying to understand what does Vlastimil mean :). > > -- > Michal Hocko > SUSE Labs >
On 3/9/20 4:19 PM, Oleksandr Natalenko wrote: > On Mon, Mar 09, 2020 at 04:08:15PM +0100, Michal Hocko wrote: >> On Mon 09-03-20 14:11:17, Oleksandr Natalenko wrote: >> > On Fri, Mar 06, 2020 at 05:08:18PM +0100, Vlastimil Babka wrote: >> [...] >> > > Dunno, it's nice to react to signals quickly, for any proces that gets them, no? >> > >> > So, do you mean something like this? >> > >> > === >> > diff --git a/mm/ksm.c b/mm/ksm.c >> > index 363ec8189561..b39c237cfcf4 100644 >> > --- a/mm/ksm.c >> > +++ b/mm/ksm.c >> > @@ -849,7 +849,8 @@ static int unmerge_ksm_pages(struct vm_area_struct *vma, >> > for (addr = start; addr < end && !err; addr += PAGE_SIZE) { >> > if (ksm_test_exit(vma->vm_mm)) >> > break; >> > - if (signal_pending(current)) >> > + if (signal_pending(current) || >> > + signal_pending(rcu_dereference(vma->vm_mm->owner))) >> > err = -ERESTARTSYS; >> > else >> > err = break_ksm(vma, addr); >> > === >> >> This is broken because mm might be attached to different tasks. >> AFAIU this check is meant to allow quick backoff of the _calling_ >> process so that it doesn't waste time when the context is killed >> already. I do not understand why should we care about any other context >> here? What is the actual problem this would solve? > > I agree with you, but still trying to understand what does Vlastimil mean > :). Well you wondered if we should stop caring about current, and I said that probably wouldn't be nice. As for caring about the other task, patch 3/7 does that for (MADV_COLD|MADV_PAGEOUT) so I just pointed out that the KSM case doesn't. AFAIU if we don't check the signals, we might be blocking the killed task from exiting? >> >> -- >> Michal Hocko >> SUSE Labs >> >
On Mon 09-03-20 16:42:43, Vlastimil Babka wrote: > On 3/9/20 4:19 PM, Oleksandr Natalenko wrote: > > On Mon, Mar 09, 2020 at 04:08:15PM +0100, Michal Hocko wrote: > >> On Mon 09-03-20 14:11:17, Oleksandr Natalenko wrote: > >> > On Fri, Mar 06, 2020 at 05:08:18PM +0100, Vlastimil Babka wrote: > >> [...] > >> > > Dunno, it's nice to react to signals quickly, for any proces that gets them, no? > >> > > >> > So, do you mean something like this? > >> > > >> > === > >> > diff --git a/mm/ksm.c b/mm/ksm.c > >> > index 363ec8189561..b39c237cfcf4 100644 > >> > --- a/mm/ksm.c > >> > +++ b/mm/ksm.c > >> > @@ -849,7 +849,8 @@ static int unmerge_ksm_pages(struct vm_area_struct *vma, > >> > for (addr = start; addr < end && !err; addr += PAGE_SIZE) { > >> > if (ksm_test_exit(vma->vm_mm)) > >> > break; > >> > - if (signal_pending(current)) > >> > + if (signal_pending(current) || > >> > + signal_pending(rcu_dereference(vma->vm_mm->owner))) > >> > err = -ERESTARTSYS; > >> > else > >> > err = break_ksm(vma, addr); > >> > === > >> > >> This is broken because mm might be attached to different tasks. > >> AFAIU this check is meant to allow quick backoff of the _calling_ > >> process so that it doesn't waste time when the context is killed > >> already. I do not understand why should we care about any other context > >> here? What is the actual problem this would solve? > > > > I agree with you, but still trying to understand what does Vlastimil mean > > :). > > Well you wondered if we should stop caring about current, and I said that > probably wouldn't be nice. > As for caring about the other task, patch 3/7 does that for > (MADV_COLD|MADV_PAGEOUT) so I just pointed out that the KSM case doesn't. AFAIU > if we don't check the signals, we might be blocking the killed task from exiting? I would have to double check but I do not think this would be a problem because the remote task should take mmget to prevent address space to vanish under its feet. That should also rule out the exclusive mmap_sem usage from the exit path.
On Mon, Mar 2, 2020 at 8:36 PM Minchan Kim <minchan@kernel.org> wrote: > From: Oleksandr Natalenko <oleksandr@redhat.com> > > It all began with the fact that KSM works only on memory that is marked > by madvise(). And the only way to get around that is to either: [...] > To overcome this restriction, lets employ a new remote madvise API. This > can be used by some small userspace helper daemon that will do auto-KSM > job for us. > > I think of two major consumers of remote KSM hints: [...] > * heavy applications, that can be run in multiple instances, not > limited to opensource ones like Firefox, but also those that cannot be > modified since they are binary-only and, maybe, statically linked. Just as a note, since you're mentioning Firefox as a usecase: Memory deduplication between browser renderers creates new side channels and is a questionable idea from a security standpoint. Memory deduplication is (mostly) fine if either all involved processes are trusted or no involved processes contain secrets, but browsers usually run tons of untrusted code while at the same time containing lots of valuable secrets.
diff --git a/mm/madvise.c b/mm/madvise.c index e77c6c1fad34..f4fa962ee74d 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1005,6 +1005,10 @@ process_madvise_behavior_valid(int behavior) switch (behavior) { case MADV_COLD: case MADV_PAGEOUT: +#ifdef CONFIG_KSM + case MADV_MERGEABLE: + case MADV_UNMERGEABLE: +#endif return true; default: return false;