diff mbox series

[v2,2/5] mm: introduce external memory hinting API

Message ID 20200116235953.163318-3-minchan@kernel.org (mailing list archive)
State New, archived
Headers show
Series introduce memory hinting API for external process | expand

Commit Message

Minchan Kim Jan. 16, 2020, 11:59 p.m. UTC
There is usecase that System Management Software(SMS) want to give
a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
in the case of Android, it is the ActivityManagerService.

It's similar in spirit to madvise(MADV_WONTNEED), but the information
required to make the reclaim decision is not known to the app. Instead,
it is known to the centralized userspace daemon(ActivityManagerService),
and that daemon must be able to initiate reclaim on its own without
any app involvement.

To solve the issue, this patch introduces new syscall process_madvise(2).
It uses pidfd of an external processs to give the hint.

 int process_madvise(int pidfd, void *addr, size_t length, int advise,
			unsigned long flag);

Since it could affect other process's address range, only privileged
process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
gives it the right to ptrace the process could use it successfully.
The flag argument is reserved for future use if we need to extend the
API.

I think supporting all hints madvise has/will supported/support to
process_madvise is rather risky. Because we are not sure all hints make
sense from external process and implementation for the hint may rely on
the caller being in the current context so it could be error-prone.
Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.

If someone want to add other hints, we could hear hear the usecase and
review it for each hint. It's more safe for maintainace rather than
introducing a buggy syscall but hard to fix it later.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 arch/alpha/kernel/syscalls/syscall.tbl      |  1 +
 arch/arm/tools/syscall.tbl                  |  1 +
 arch/arm64/include/asm/unistd.h             |  2 +-
 arch/arm64/include/asm/unistd32.h           |  2 +
 arch/ia64/kernel/syscalls/syscall.tbl       |  1 +
 arch/m68k/kernel/syscalls/syscall.tbl       |  1 +
 arch/microblaze/kernel/syscalls/syscall.tbl |  1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   |  1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   |  1 +
 arch/parisc/kernel/syscalls/syscall.tbl     |  1 +
 arch/powerpc/kernel/syscalls/syscall.tbl    |  1 +
 arch/s390/kernel/syscalls/syscall.tbl       |  1 +
 arch/sh/kernel/syscalls/syscall.tbl         |  1 +
 arch/sparc/kernel/syscalls/syscall.tbl      |  1 +
 arch/x86/entry/syscalls/syscall_32.tbl      |  1 +
 arch/x86/entry/syscalls/syscall_64.tbl      |  1 +
 arch/xtensa/kernel/syscalls/syscall.tbl     |  1 +
 include/linux/syscalls.h                    |  2 +
 include/uapi/asm-generic/unistd.h           |  5 +-
 kernel/sys_ni.c                             |  1 +
 mm/madvise.c                                | 63 +++++++++++++++++++++
 21 files changed, 88 insertions(+), 2 deletions(-)

Comments

Michal Hocko Jan. 17, 2020, 11:52 a.m. UTC | #1
On Thu 16-01-20 15:59:50, Minchan Kim wrote:
> There is usecase that System Management Software(SMS) want to give
> a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
> in the case of Android, it is the ActivityManagerService.
> 
> It's similar in spirit to madvise(MADV_WONTNEED), but the information
> required to make the reclaim decision is not known to the app. Instead,
> it is known to the centralized userspace daemon(ActivityManagerService),
> and that daemon must be able to initiate reclaim on its own without
> any app involvement.
> 
> To solve the issue, this patch introduces new syscall process_madvise(2).
> It uses pidfd of an external processs to give the hint.
> 
>  int process_madvise(int pidfd, void *addr, size_t length, int advise,
> 			unsigned long flag);
> 
> Since it could affect other process's address range, only privileged
> process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> gives it the right to ptrace the process could use it successfully.
> The flag argument is reserved for future use if we need to extend the
> API.
> 
> I think supporting all hints madvise has/will supported/support to
> process_madvise is rather risky. Because we are not sure all hints make
> sense from external process and implementation for the hint may rely on
> the caller being in the current context so it could be error-prone.
> Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
> 
> If someone want to add other hints, we could hear hear the usecase and
> review it for each hint. It's more safe for maintainace rather than
> introducing a buggy syscall but hard to fix it later.

I have brought this up when we discussed this in the past but there is
no reflection on that here so let me bring that up again. 

I believe that the interface has an inherent problem that it is racy.
The external entity needs to know the address space layout of the target
process to do anyhing useful on it. The address space is however under
the full control of the target process though and the external entity
has no means to find out that the layout has changed. So
time-to-check-time-to-act is an inherent problem.

This is a serious design flaw and it should be explained why it doesn't
matter or how to use the interface properly to prevent that problem.
Kirill A. Shutemov Jan. 17, 2020, 3:58 p.m. UTC | #2
On Fri, Jan 17, 2020 at 12:52:25PM +0100, Michal Hocko wrote:
> On Thu 16-01-20 15:59:50, Minchan Kim wrote:
> > There is usecase that System Management Software(SMS) want to give
> > a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
> > in the case of Android, it is the ActivityManagerService.
> > 
> > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > required to make the reclaim decision is not known to the app. Instead,
> > it is known to the centralized userspace daemon(ActivityManagerService),
> > and that daemon must be able to initiate reclaim on its own without
> > any app involvement.
> > 
> > To solve the issue, this patch introduces new syscall process_madvise(2).
> > It uses pidfd of an external processs to give the hint.
> > 
> >  int process_madvise(int pidfd, void *addr, size_t length, int advise,
> > 			unsigned long flag);
> > 
> > Since it could affect other process's address range, only privileged
> > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> > gives it the right to ptrace the process could use it successfully.
> > The flag argument is reserved for future use if we need to extend the
> > API.
> > 
> > I think supporting all hints madvise has/will supported/support to
> > process_madvise is rather risky. Because we are not sure all hints make
> > sense from external process and implementation for the hint may rely on
> > the caller being in the current context so it could be error-prone.
> > Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
> > 
> > If someone want to add other hints, we could hear hear the usecase and
> > review it for each hint. It's more safe for maintainace rather than
> > introducing a buggy syscall but hard to fix it later.
> 
> I have brought this up when we discussed this in the past but there is
> no reflection on that here so let me bring that up again. 
> 
> I believe that the interface has an inherent problem that it is racy.
> The external entity needs to know the address space layout of the target
> process to do anyhing useful on it. The address space is however under
> the full control of the target process though and the external entity
> has no means to find out that the layout has changed. So
> time-to-check-time-to-act is an inherent problem.
> 
> This is a serious design flaw and it should be explained why it doesn't
> matter or how to use the interface properly to prevent that problem.

I agree, it looks flawed.

Also I don't see what System Management Software can generically do on
sub-process level. I mean how can it decide which part of address space is
less important than other.

I see how a manager can indicate that this process (or a group of
processes) is less important than other, but on per-addres-range basis?
Minchan Kim Jan. 17, 2020, 5:25 p.m. UTC | #3
On Fri, Jan 17, 2020 at 12:52:25PM +0100, Michal Hocko wrote:
> On Thu 16-01-20 15:59:50, Minchan Kim wrote:
> > There is usecase that System Management Software(SMS) want to give
> > a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
> > in the case of Android, it is the ActivityManagerService.
> > 
> > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > required to make the reclaim decision is not known to the app. Instead,
> > it is known to the centralized userspace daemon(ActivityManagerService),
> > and that daemon must be able to initiate reclaim on its own without
> > any app involvement.
> > 
> > To solve the issue, this patch introduces new syscall process_madvise(2).
> > It uses pidfd of an external processs to give the hint.
> > 
> >  int process_madvise(int pidfd, void *addr, size_t length, int advise,
> > 			unsigned long flag);
> > 
> > Since it could affect other process's address range, only privileged
> > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> > gives it the right to ptrace the process could use it successfully.
> > The flag argument is reserved for future use if we need to extend the
> > API.
> > 
> > I think supporting all hints madvise has/will supported/support to
> > process_madvise is rather risky. Because we are not sure all hints make
> > sense from external process and implementation for the hint may rely on
> > the caller being in the current context so it could be error-prone.
> > Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
> > 
> > If someone want to add other hints, we could hear hear the usecase and
> > review it for each hint. It's more safe for maintainace rather than
> > introducing a buggy syscall but hard to fix it later.
> 
> I have brought this up when we discussed this in the past but there is
> no reflection on that here so let me bring that up again. 
> 
> I believe that the interface has an inherent problem that it is racy.
> The external entity needs to know the address space layout of the target
> process to do anyhing useful on it. The address space is however under
> the full control of the target process though and the external entity
> has no means to find out that the layout has changed. So
> time-to-check-time-to-act is an inherent problem.
> 
> This is a serious design flaw and it should be explained why it doesn't
> matter or how to use the interface properly to prevent that problem.

Sorry for the missing that part.

It's not a particular problem of this API because other APIs already have
done with that(e.g., move_pages, process_vm_writev).

Point is userspace has several ways for the control of target process
like SIGSTOP, cgroup freezer or even no need to control since platform
is already aware of that the process will never run until he grant it
or it's resilient even though the race happens.

In future, if we want to support more fine-grained consistency model
like memory layout, we could provide some API to get cookie(e.g.,
seq count which is updated whenever vma of the process changes).  And then
we could feed the cookie to process_madvise's last argument so that
it can fail if founds it's not matched.
For that API, Daniel already posted RFC - process_getinfo[1].
https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224
Minchan Kim Jan. 17, 2020, 5:32 p.m. UTC | #4
On Fri, Jan 17, 2020 at 06:58:37PM +0300, Kirill A. Shutemov wrote:
> On Fri, Jan 17, 2020 at 12:52:25PM +0100, Michal Hocko wrote:
> > On Thu 16-01-20 15:59:50, Minchan Kim wrote:
> > > There is usecase that System Management Software(SMS) want to give
> > > a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
> > > in the case of Android, it is the ActivityManagerService.
> > > 
> > > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > > required to make the reclaim decision is not known to the app. Instead,
> > > it is known to the centralized userspace daemon(ActivityManagerService),
> > > and that daemon must be able to initiate reclaim on its own without
> > > any app involvement.
> > > 
> > > To solve the issue, this patch introduces new syscall process_madvise(2).
> > > It uses pidfd of an external processs to give the hint.
> > > 
> > >  int process_madvise(int pidfd, void *addr, size_t length, int advise,
> > > 			unsigned long flag);
> > > 
> > > Since it could affect other process's address range, only privileged
> > > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> > > gives it the right to ptrace the process could use it successfully.
> > > The flag argument is reserved for future use if we need to extend the
> > > API.
> > > 
> > > I think supporting all hints madvise has/will supported/support to
> > > process_madvise is rather risky. Because we are not sure all hints make
> > > sense from external process and implementation for the hint may rely on
> > > the caller being in the current context so it could be error-prone.
> > > Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
> > > 
> > > If someone want to add other hints, we could hear hear the usecase and
> > > review it for each hint. It's more safe for maintainace rather than
> > > introducing a buggy syscall but hard to fix it later.
> > 
> > I have brought this up when we discussed this in the past but there is
> > no reflection on that here so let me bring that up again. 
> > 
> > I believe that the interface has an inherent problem that it is racy.
> > The external entity needs to know the address space layout of the target
> > process to do anyhing useful on it. The address space is however under
> > the full control of the target process though and the external entity
> > has no means to find out that the layout has changed. So
> > time-to-check-time-to-act is an inherent problem.
> > 
> > This is a serious design flaw and it should be explained why it doesn't
> > matter or how to use the interface properly to prevent that problem.
> 
> I agree, it looks flawed.
> 
> Also I don't see what System Management Software can generically do on
> sub-process level. I mean how can it decide which part of address space is
> less important than other.
> 
> I see how a manager can indicate that this process (or a group of
> processes) is less important than other, but on per-addres-range basis?

For example, memory ranges shared by several processes or critical for the
latency, we could avoid those ranges to be cold/pageout to prevent
unncecessary CPU burning/paging.

I also think people don't want to give an KSM hint to non-mergeable area.
Kirill A. Shutemov Jan. 17, 2020, 9:26 p.m. UTC | #5
On Fri, Jan 17, 2020 at 09:32:39AM -0800, Minchan Kim wrote:
> On Fri, Jan 17, 2020 at 06:58:37PM +0300, Kirill A. Shutemov wrote:
> > On Fri, Jan 17, 2020 at 12:52:25PM +0100, Michal Hocko wrote:
> > > On Thu 16-01-20 15:59:50, Minchan Kim wrote:
> > > > There is usecase that System Management Software(SMS) want to give
> > > > a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
> > > > in the case of Android, it is the ActivityManagerService.
> > > > 
> > > > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > > > required to make the reclaim decision is not known to the app. Instead,
> > > > it is known to the centralized userspace daemon(ActivityManagerService),
> > > > and that daemon must be able to initiate reclaim on its own without
> > > > any app involvement.
> > > > 
> > > > To solve the issue, this patch introduces new syscall process_madvise(2).
> > > > It uses pidfd of an external processs to give the hint.
> > > > 
> > > >  int process_madvise(int pidfd, void *addr, size_t length, int advise,
> > > > 			unsigned long flag);
> > > > 
> > > > Since it could affect other process's address range, only privileged
> > > > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> > > > gives it the right to ptrace the process could use it successfully.
> > > > The flag argument is reserved for future use if we need to extend the
> > > > API.
> > > > 
> > > > I think supporting all hints madvise has/will supported/support to
> > > > process_madvise is rather risky. Because we are not sure all hints make
> > > > sense from external process and implementation for the hint may rely on
> > > > the caller being in the current context so it could be error-prone.
> > > > Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
> > > > 
> > > > If someone want to add other hints, we could hear hear the usecase and
> > > > review it for each hint. It's more safe for maintainace rather than
> > > > introducing a buggy syscall but hard to fix it later.
> > > 
> > > I have brought this up when we discussed this in the past but there is
> > > no reflection on that here so let me bring that up again. 
> > > 
> > > I believe that the interface has an inherent problem that it is racy.
> > > The external entity needs to know the address space layout of the target
> > > process to do anyhing useful on it. The address space is however under
> > > the full control of the target process though and the external entity
> > > has no means to find out that the layout has changed. So
> > > time-to-check-time-to-act is an inherent problem.
> > > 
> > > This is a serious design flaw and it should be explained why it doesn't
> > > matter or how to use the interface properly to prevent that problem.
> > 
> > I agree, it looks flawed.
> > 
> > Also I don't see what System Management Software can generically do on
> > sub-process level. I mean how can it decide which part of address space is
> > less important than other.
> > 
> > I see how a manager can indicate that this process (or a group of
> > processes) is less important than other, but on per-addres-range basis?
> 
> For example, memory ranges shared by several processes or critical for the
> latency, we could avoid those ranges to be cold/pageout to prevent
> unncecessary CPU burning/paging.

Hmm.. I still don't see why any external entity has a better (or any)
knowledge about the matter. The process has to do this, no?

> I also think people don't want to give an KSM hint to non-mergeable area.

And how the manager knows which data is mergable?

If you are intimate enough with the process' internal state feel free to
inject syscall into the process with ptrace. Why bother with half-measures?
Sandeep Patil Jan. 19, 2020, 4:14 p.m. UTC | #6
On Sat, Jan 18, 2020 at 12:26:53AM +0300, Kirill A. Shutemov wrote:
> On Fri, Jan 17, 2020 at 09:32:39AM -0800, Minchan Kim wrote:
> > On Fri, Jan 17, 2020 at 06:58:37PM +0300, Kirill A. Shutemov wrote:
> > > On Fri, Jan 17, 2020 at 12:52:25PM +0100, Michal Hocko wrote:
> > > > On Thu 16-01-20 15:59:50, Minchan Kim wrote:
> > > > > There is usecase that System Management Software(SMS) want to give
> > > > > a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
> > > > > in the case of Android, it is the ActivityManagerService.
> > > > > 
> > > > > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > > > > required to make the reclaim decision is not known to the app. Instead,
> > > > > it is known to the centralized userspace daemon(ActivityManagerService),
> > > > > and that daemon must be able to initiate reclaim on its own without
> > > > > any app involvement.
> > > > > 
> > > > > To solve the issue, this patch introduces new syscall process_madvise(2).
> > > > > It uses pidfd of an external processs to give the hint.
> > > > > 
> > > > >  int process_madvise(int pidfd, void *addr, size_t length, int advise,
> > > > > 			unsigned long flag);
> > > > > 
> > > > > Since it could affect other process's address range, only privileged
> > > > > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> > > > > gives it the right to ptrace the process could use it successfully.
> > > > > The flag argument is reserved for future use if we need to extend the
> > > > > API.
> > > > > 
> > > > > I think supporting all hints madvise has/will supported/support to
> > > > > process_madvise is rather risky. Because we are not sure all hints make
> > > > > sense from external process and implementation for the hint may rely on
> > > > > the caller being in the current context so it could be error-prone.
> > > > > Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
> > > > > 
> > > > > If someone want to add other hints, we could hear hear the usecase and
> > > > > review it for each hint. It's more safe for maintainace rather than
> > > > > introducing a buggy syscall but hard to fix it later.
> > > > 
> > > > I have brought this up when we discussed this in the past but there is
> > > > no reflection on that here so let me bring that up again. 
> > > > 
> > > > I believe that the interface has an inherent problem that it is racy.
> > > > The external entity needs to know the address space layout of the target
> > > > process to do anyhing useful on it. The address space is however under
> > > > the full control of the target process though and the external entity
> > > > has no means to find out that the layout has changed. So
> > > > time-to-check-time-to-act is an inherent problem.
> > > > 
> > > > This is a serious design flaw and it should be explained why it doesn't
> > > > matter or how to use the interface properly to prevent that problem.
> > > 
> > > I agree, it looks flawed.
> > > 
> > > Also I don't see what System Management Software can generically do on
> > > sub-process level. I mean how can it decide which part of address space is
> > > less important than other.
> > > 
> > > I see how a manager can indicate that this process (or a group of
> > > processes) is less important than other, but on per-addres-range basis?
> > 
> > For example, memory ranges shared by several processes or critical for the
> > latency, we could avoid those ranges to be cold/pageout to prevent
> > unncecessary CPU burning/paging.
> 
> Hmm.. I still don't see why any external entity has a better (or any)
> knowledge about the matter. The process has to do this, no?

FWIW, I totally agree with the time-to-check-time-to-react problem. However,
I'd like to clarify the ActivityManager/SystemServer case (I'll call it
SystemServer from now on)

For Android, every application (including the special SystemServer) are forked
from Zygote. The reason ofcourse is to share as many libraries and classes between
the two as possible to benefit from the preloading during boot.

After applications start, (almost) all of the APIs  end up calling into this
SystemServer process over IPC (binder) and back to the application.

In a fully running system, the SystemServer monitors every single process
periodically to calculate their PSS / RSS and also decides which process is
"important" to the user for interactivity.

So, because of how these processes start _and_ the fact that the SystemServer
is looping to monitor each process, it does tend to *know* which address
range of the application is not used / useful.

Besides, we can never rely on applications to clean things up themselves.
We've had the "hey app1, the system is low on memory, please trim your
memory usage down" notifications for a long time[1]. They rely on
applications honoring the broadcasts and very few do.

So, if we want to avoid the inevitable killing of the application and
restarting it, some way to be able to tell the OS about unimportant memory in
these applications will be useful.


- ssp

1. https://developer.android.com/topic/performance/memory


> 
> > I also think people don't want to give an KSM hint to non-mergeable area.
> 
> And how the manager knows which data is mergable?
> 
> If you are intimate enough with the process' internal state feel free to
> inject syscall into the process with ptrace. Why bother with half-measures?
> 
> -- 
>  Kirill A. Shutemov
Michal Hocko Jan. 20, 2020, 7:58 a.m. UTC | #7
On Sun 19-01-20 08:14:31, sspatil@google.com wrote:
> On Sat, Jan 18, 2020 at 12:26:53AM +0300, Kirill A. Shutemov wrote:
> > On Fri, Jan 17, 2020 at 09:32:39AM -0800, Minchan Kim wrote:
> > > On Fri, Jan 17, 2020 at 06:58:37PM +0300, Kirill A. Shutemov wrote:
> > > > On Fri, Jan 17, 2020 at 12:52:25PM +0100, Michal Hocko wrote:
> > > > > On Thu 16-01-20 15:59:50, Minchan Kim wrote:
> > > > > > There is usecase that System Management Software(SMS) want to give
> > > > > > a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
> > > > > > in the case of Android, it is the ActivityManagerService.
> > > > > > 
> > > > > > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > > > > > required to make the reclaim decision is not known to the app. Instead,
> > > > > > it is known to the centralized userspace daemon(ActivityManagerService),
> > > > > > and that daemon must be able to initiate reclaim on its own without
> > > > > > any app involvement.
> > > > > > 
> > > > > > To solve the issue, this patch introduces new syscall process_madvise(2).
> > > > > > It uses pidfd of an external processs to give the hint.
> > > > > > 
> > > > > >  int process_madvise(int pidfd, void *addr, size_t length, int advise,
> > > > > > 			unsigned long flag);
> > > > > > 
> > > > > > Since it could affect other process's address range, only privileged
> > > > > > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> > > > > > gives it the right to ptrace the process could use it successfully.
> > > > > > The flag argument is reserved for future use if we need to extend the
> > > > > > API.
> > > > > > 
> > > > > > I think supporting all hints madvise has/will supported/support to
> > > > > > process_madvise is rather risky. Because we are not sure all hints make
> > > > > > sense from external process and implementation for the hint may rely on
> > > > > > the caller being in the current context so it could be error-prone.
> > > > > > Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
> > > > > > 
> > > > > > If someone want to add other hints, we could hear hear the usecase and
> > > > > > review it for each hint. It's more safe for maintainace rather than
> > > > > > introducing a buggy syscall but hard to fix it later.
> > > > > 
> > > > > I have brought this up when we discussed this in the past but there is
> > > > > no reflection on that here so let me bring that up again. 
> > > > > 
> > > > > I believe that the interface has an inherent problem that it is racy.
> > > > > The external entity needs to know the address space layout of the target
> > > > > process to do anyhing useful on it. The address space is however under
> > > > > the full control of the target process though and the external entity
> > > > > has no means to find out that the layout has changed. So
> > > > > time-to-check-time-to-act is an inherent problem.
> > > > > 
> > > > > This is a serious design flaw and it should be explained why it doesn't
> > > > > matter or how to use the interface properly to prevent that problem.
> > > > 
> > > > I agree, it looks flawed.
> > > > 
> > > > Also I don't see what System Management Software can generically do on
> > > > sub-process level. I mean how can it decide which part of address space is
> > > > less important than other.
> > > > 
> > > > I see how a manager can indicate that this process (or a group of
> > > > processes) is less important than other, but on per-addres-range basis?
> > > 
> > > For example, memory ranges shared by several processes or critical for the
> > > latency, we could avoid those ranges to be cold/pageout to prevent
> > > unncecessary CPU burning/paging.
> > 
> > Hmm.. I still don't see why any external entity has a better (or any)
> > knowledge about the matter. The process has to do this, no?
> 
> FWIW, I totally agree with the time-to-check-time-to-react problem. However,
> I'd like to clarify the ActivityManager/SystemServer case (I'll call it
> SystemServer from now on)
> 
> For Android, every application (including the special SystemServer) are forked
> from Zygote. The reason ofcourse is to share as many libraries and classes between
> the two as possible to benefit from the preloading during boot.
> 
> After applications start, (almost) all of the APIs  end up calling into this
> SystemServer process over IPC (binder) and back to the application.
> 
> In a fully running system, the SystemServer monitors every single process
> periodically to calculate their PSS / RSS and also decides which process is
> "important" to the user for interactivity.
> 
> So, because of how these processes start _and_ the fact that the SystemServer
> is looping to monitor each process, it does tend to *know* which address
> range of the application is not used / useful.
> 
> Besides, we can never rely on applications to clean things up themselves.
> We've had the "hey app1, the system is low on memory, please trim your
> memory usage down" notifications for a long time[1]. They rely on
> applications honoring the broadcasts and very few do.
> 
> So, if we want to avoid the inevitable killing of the application and
> restarting it, some way to be able to tell the OS about unimportant memory in
> these applications will be useful.

This is a useful information that should be a part of the changelog. I
do see how the current form of the API might fit into Android model
without many problems. But we are not designing an API for a single
usecase, right? In a highly cooperative environments you can use ptrace
code injection as mentioned by Kirill. Or is there any fundamental
problem about that?

The interface really has to be robust to future potential usecases.
Michal Hocko Jan. 20, 2020, 8:03 a.m. UTC | #8
On Fri 17-01-20 09:25:42, Minchan Kim wrote:
> On Fri, Jan 17, 2020 at 12:52:25PM +0100, Michal Hocko wrote:
> > On Thu 16-01-20 15:59:50, Minchan Kim wrote:
> > > There is usecase that System Management Software(SMS) want to give
> > > a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
> > > in the case of Android, it is the ActivityManagerService.
> > > 
> > > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > > required to make the reclaim decision is not known to the app. Instead,
> > > it is known to the centralized userspace daemon(ActivityManagerService),
> > > and that daemon must be able to initiate reclaim on its own without
> > > any app involvement.
> > > 
> > > To solve the issue, this patch introduces new syscall process_madvise(2).
> > > It uses pidfd of an external processs to give the hint.
> > > 
> > >  int process_madvise(int pidfd, void *addr, size_t length, int advise,
> > > 			unsigned long flag);
> > > 
> > > Since it could affect other process's address range, only privileged
> > > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> > > gives it the right to ptrace the process could use it successfully.
> > > The flag argument is reserved for future use if we need to extend the
> > > API.
> > > 
> > > I think supporting all hints madvise has/will supported/support to
> > > process_madvise is rather risky. Because we are not sure all hints make
> > > sense from external process and implementation for the hint may rely on
> > > the caller being in the current context so it could be error-prone.
> > > Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
> > > 
> > > If someone want to add other hints, we could hear hear the usecase and
> > > review it for each hint. It's more safe for maintainace rather than
> > > introducing a buggy syscall but hard to fix it later.
> > 
> > I have brought this up when we discussed this in the past but there is
> > no reflection on that here so let me bring that up again. 
> > 
> > I believe that the interface has an inherent problem that it is racy.
> > The external entity needs to know the address space layout of the target
> > process to do anyhing useful on it. The address space is however under
> > the full control of the target process though and the external entity
> > has no means to find out that the layout has changed. So
> > time-to-check-time-to-act is an inherent problem.
> > 
> > This is a serious design flaw and it should be explained why it doesn't
> > matter or how to use the interface properly to prevent that problem.
> 
> Sorry for the missing that part.
> 
> It's not a particular problem of this API because other APIs already have
> done with that(e.g., move_pages, process_vm_writev).

I am sorry but this is not really an argument.

> Point is userspace has several ways for the control of target process
> like SIGSTOP, cgroup freezer or even no need to control since platform
> is already aware of that the process will never run until he grant it
> or it's resilient even though the race happens.

If you have that level of control then you can simply inject the code
via ptrace and you do not need a new syscall in the first place.

> In future, if we want to support more fine-grained consistency model
> like memory layout, we could provide some API to get cookie(e.g.,
> seq count which is updated whenever vma of the process changes).  And then
> we could feed the cookie to process_madvise's last argument so that
> it can fail if founds it's not matched.
> For that API, Daniel already posted RFC - process_getinfo[1].
> https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224

So why do not we start with a clean API since the beginning? I do agree
that a remote madvise is an interesting feature and it opens gates to
all sorts of userspace memory management which is not possible this
days. But the syscall has to have a reasonable semantic to allow that.
We cannot simply start with a half proken symantic first based on an
Android usecase and then hit the wall as soon as others with a different
user space model want to use it as well.
Kirill Tkhai Jan. 20, 2020, 10:24 a.m. UTC | #9
On 17.01.2020 14:52, Michal Hocko wrote:
> On Thu 16-01-20 15:59:50, Minchan Kim wrote:
>> There is usecase that System Management Software(SMS) want to give
>> a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
>> in the case of Android, it is the ActivityManagerService.
>>
>> It's similar in spirit to madvise(MADV_WONTNEED), but the information
>> required to make the reclaim decision is not known to the app. Instead,
>> it is known to the centralized userspace daemon(ActivityManagerService),
>> and that daemon must be able to initiate reclaim on its own without
>> any app involvement.
>>
>> To solve the issue, this patch introduces new syscall process_madvise(2).
>> It uses pidfd of an external processs to give the hint.
>>
>>  int process_madvise(int pidfd, void *addr, size_t length, int advise,
>> 			unsigned long flag);
>>
>> Since it could affect other process's address range, only privileged
>> process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
>> gives it the right to ptrace the process could use it successfully.
>> The flag argument is reserved for future use if we need to extend the
>> API.
>>
>> I think supporting all hints madvise has/will supported/support to
>> process_madvise is rather risky. Because we are not sure all hints make
>> sense from external process and implementation for the hint may rely on
>> the caller being in the current context so it could be error-prone.
>> Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
>>
>> If someone want to add other hints, we could hear hear the usecase and
>> review it for each hint. It's more safe for maintainace rather than
>> introducing a buggy syscall but hard to fix it later.
> 
> I have brought this up when we discussed this in the past but there is
> no reflection on that here so let me bring that up again. 
> 
> I believe that the interface has an inherent problem that it is racy.
> The external entity needs to know the address space layout of the target
> process to do anyhing useful on it. The address space is however under
> the full control of the target process though and the external entity
> has no means to find out that the layout has changed. So
> time-to-check-time-to-act is an inherent problem.
> 
> This is a serious design flaw and it should be explained why it doesn't
> matter or how to use the interface properly to prevent that problem.

Really, any address space manipulation, where more than one process is
involved, is racy. Even two threads on common memory need a synchronization
to manage mappings in a sane way. Managing memory from two processes
is the same in principle, and the only difference is that another level
of synchronization is required.

I'm not fan and user for this feature (at least for now, nobody knows
what will be in the future), but design flaw argument does not look
fully valid.

Regards,
Kirill
Kirill Tkhai Jan. 20, 2020, 10:39 a.m. UTC | #10
On 20.01.2020 10:58, Michal Hocko wrote:
> On Sun 19-01-20 08:14:31, sspatil@google.com wrote:
>> On Sat, Jan 18, 2020 at 12:26:53AM +0300, Kirill A. Shutemov wrote:
>>> On Fri, Jan 17, 2020 at 09:32:39AM -0800, Minchan Kim wrote:
>>>> On Fri, Jan 17, 2020 at 06:58:37PM +0300, Kirill A. Shutemov wrote:
>>>>> On Fri, Jan 17, 2020 at 12:52:25PM +0100, Michal Hocko wrote:
>>>>>> On Thu 16-01-20 15:59:50, Minchan Kim wrote:
>>>>>>> There is usecase that System Management Software(SMS) want to give
>>>>>>> a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
>>>>>>> in the case of Android, it is the ActivityManagerService.
>>>>>>>
>>>>>>> It's similar in spirit to madvise(MADV_WONTNEED), but the information
>>>>>>> required to make the reclaim decision is not known to the app. Instead,
>>>>>>> it is known to the centralized userspace daemon(ActivityManagerService),
>>>>>>> and that daemon must be able to initiate reclaim on its own without
>>>>>>> any app involvement.
>>>>>>>
>>>>>>> To solve the issue, this patch introduces new syscall process_madvise(2).
>>>>>>> It uses pidfd of an external processs to give the hint.
>>>>>>>
>>>>>>>  int process_madvise(int pidfd, void *addr, size_t length, int advise,
>>>>>>> 			unsigned long flag);
>>>>>>>
>>>>>>> Since it could affect other process's address range, only privileged
>>>>>>> process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
>>>>>>> gives it the right to ptrace the process could use it successfully.
>>>>>>> The flag argument is reserved for future use if we need to extend the
>>>>>>> API.
>>>>>>>
>>>>>>> I think supporting all hints madvise has/will supported/support to
>>>>>>> process_madvise is rather risky. Because we are not sure all hints make
>>>>>>> sense from external process and implementation for the hint may rely on
>>>>>>> the caller being in the current context so it could be error-prone.
>>>>>>> Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
>>>>>>>
>>>>>>> If someone want to add other hints, we could hear hear the usecase and
>>>>>>> review it for each hint. It's more safe for maintainace rather than
>>>>>>> introducing a buggy syscall but hard to fix it later.
>>>>>>
>>>>>> I have brought this up when we discussed this in the past but there is
>>>>>> no reflection on that here so let me bring that up again. 
>>>>>>
>>>>>> I believe that the interface has an inherent problem that it is racy.
>>>>>> The external entity needs to know the address space layout of the target
>>>>>> process to do anyhing useful on it. The address space is however under
>>>>>> the full control of the target process though and the external entity
>>>>>> has no means to find out that the layout has changed. So
>>>>>> time-to-check-time-to-act is an inherent problem.
>>>>>>
>>>>>> This is a serious design flaw and it should be explained why it doesn't
>>>>>> matter or how to use the interface properly to prevent that problem.
>>>>>
>>>>> I agree, it looks flawed.
>>>>>
>>>>> Also I don't see what System Management Software can generically do on
>>>>> sub-process level. I mean how can it decide which part of address space is
>>>>> less important than other.
>>>>>
>>>>> I see how a manager can indicate that this process (or a group of
>>>>> processes) is less important than other, but on per-addres-range basis?
>>>>
>>>> For example, memory ranges shared by several processes or critical for the
>>>> latency, we could avoid those ranges to be cold/pageout to prevent
>>>> unncecessary CPU burning/paging.
>>>
>>> Hmm.. I still don't see why any external entity has a better (or any)
>>> knowledge about the matter. The process has to do this, no?
>>
>> FWIW, I totally agree with the time-to-check-time-to-react problem. However,
>> I'd like to clarify the ActivityManager/SystemServer case (I'll call it
>> SystemServer from now on)
>>
>> For Android, every application (including the special SystemServer) are forked
>> from Zygote. The reason ofcourse is to share as many libraries and classes between
>> the two as possible to benefit from the preloading during boot.
>>
>> After applications start, (almost) all of the APIs  end up calling into this
>> SystemServer process over IPC (binder) and back to the application.
>>
>> In a fully running system, the SystemServer monitors every single process
>> periodically to calculate their PSS / RSS and also decides which process is
>> "important" to the user for interactivity.
>>
>> So, because of how these processes start _and_ the fact that the SystemServer
>> is looping to monitor each process, it does tend to *know* which address
>> range of the application is not used / useful.
>>
>> Besides, we can never rely on applications to clean things up themselves.
>> We've had the "hey app1, the system is low on memory, please trim your
>> memory usage down" notifications for a long time[1]. They rely on
>> applications honoring the broadcasts and very few do.
>>
>> So, if we want to avoid the inevitable killing of the application and
>> restarting it, some way to be able to tell the OS about unimportant memory in
>> these applications will be useful.
> 
> This is a useful information that should be a part of the changelog. I
> do see how the current form of the API might fit into Android model
> without many problems. But we are not designing an API for a single
> usecase, right? In a highly cooperative environments you can use ptrace
> code injection as mentioned by Kirill. Or is there any fundamental
> problem about that?

There could be only problems with multi-threads applications, which
poll the state of another threads (and exit with error, when they
found that someone is traced). But this may be workarounded by freezer
cgroup making all the thread group is frozen. It should guarantee
consistent state of the whole process after attaching.

Kirill
Michal Hocko Jan. 20, 2020, 11:27 a.m. UTC | #11
On Mon 20-01-20 13:24:35, Kirill Tkhai wrote:
> On 17.01.2020 14:52, Michal Hocko wrote:
> > On Thu 16-01-20 15:59:50, Minchan Kim wrote:
> >> There is usecase that System Management Software(SMS) want to give
> >> a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
> >> in the case of Android, it is the ActivityManagerService.
> >>
> >> It's similar in spirit to madvise(MADV_WONTNEED), but the information
> >> required to make the reclaim decision is not known to the app. Instead,
> >> it is known to the centralized userspace daemon(ActivityManagerService),
> >> and that daemon must be able to initiate reclaim on its own without
> >> any app involvement.
> >>
> >> To solve the issue, this patch introduces new syscall process_madvise(2).
> >> It uses pidfd of an external processs to give the hint.
> >>
> >>  int process_madvise(int pidfd, void *addr, size_t length, int advise,
> >> 			unsigned long flag);
> >>
> >> Since it could affect other process's address range, only privileged
> >> process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> >> gives it the right to ptrace the process could use it successfully.
> >> The flag argument is reserved for future use if we need to extend the
> >> API.
> >>
> >> I think supporting all hints madvise has/will supported/support to
> >> process_madvise is rather risky. Because we are not sure all hints make
> >> sense from external process and implementation for the hint may rely on
> >> the caller being in the current context so it could be error-prone.
> >> Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
> >>
> >> If someone want to add other hints, we could hear hear the usecase and
> >> review it for each hint. It's more safe for maintainace rather than
> >> introducing a buggy syscall but hard to fix it later.
> > 
> > I have brought this up when we discussed this in the past but there is
> > no reflection on that here so let me bring that up again. 
> > 
> > I believe that the interface has an inherent problem that it is racy.
> > The external entity needs to know the address space layout of the target
> > process to do anyhing useful on it. The address space is however under
> > the full control of the target process though and the external entity
> > has no means to find out that the layout has changed. So
> > time-to-check-time-to-act is an inherent problem.
> > 
> > This is a serious design flaw and it should be explained why it doesn't
> > matter or how to use the interface properly to prevent that problem.
> 
> Really, any address space manipulation, where more than one process is
> involved, is racy.

They are, indeed. But that is not the point I wanted to make.

> Even two threads on common memory need a synchronization
> to manage mappings in a sane way. Managing memory from two processes
> is the same in principle, and the only difference is that another level
> of synchronization is required.

Well, not really. The operation might simply attempt to perform an
operation on a specific memory area and get a failure if it doesn't
reference the same object anymore. What I think we need is some form of
a handle to operate on. In the past we have discussed several
directions. I was proposing /proc/self/map_anon/ (analogous to
map_files) where you could inspect anonymous memory and get a file
handle for it. madvise would then operate on the fd and then there
shouldn't be a real problem to revalidate that the object is still
valid. But there was no general enthusiasm about that approach. There
are likely some land mines on the way.

There was another approach mentioned by Minchan in other email in this
thread.

What I want to say is that I believe a remove madvise can have a
sensible semantic even without a strong synchronization between the
monitor and the target task. We just have to make sure that the monitor
never operates on a different object then it believes it acts on.

On the other hand, there will never be any way to make this interface
reasonable for destructive operations though because the content of the
memory needs a strong synchronization IMHO.
Kirill A. Shutemov Jan. 20, 2020, 12:39 p.m. UTC | #12
On Mon, Jan 20, 2020 at 12:27:22PM +0100, Michal Hocko wrote:
> On Mon 20-01-20 13:24:35, Kirill Tkhai wrote:
> > On 17.01.2020 14:52, Michal Hocko wrote:
> > > On Thu 16-01-20 15:59:50, Minchan Kim wrote:
> > >> There is usecase that System Management Software(SMS) want to give
> > >> a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
> > >> in the case of Android, it is the ActivityManagerService.
> > >>
> > >> It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > >> required to make the reclaim decision is not known to the app. Instead,
> > >> it is known to the centralized userspace daemon(ActivityManagerService),
> > >> and that daemon must be able to initiate reclaim on its own without
> > >> any app involvement.
> > >>
> > >> To solve the issue, this patch introduces new syscall process_madvise(2).
> > >> It uses pidfd of an external processs to give the hint.
> > >>
> > >>  int process_madvise(int pidfd, void *addr, size_t length, int advise,
> > >> 			unsigned long flag);
> > >>
> > >> Since it could affect other process's address range, only privileged
> > >> process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> > >> gives it the right to ptrace the process could use it successfully.
> > >> The flag argument is reserved for future use if we need to extend the
> > >> API.
> > >>
> > >> I think supporting all hints madvise has/will supported/support to
> > >> process_madvise is rather risky. Because we are not sure all hints make
> > >> sense from external process and implementation for the hint may rely on
> > >> the caller being in the current context so it could be error-prone.
> > >> Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
> > >>
> > >> If someone want to add other hints, we could hear hear the usecase and
> > >> review it for each hint. It's more safe for maintainace rather than
> > >> introducing a buggy syscall but hard to fix it later.
> > > 
> > > I have brought this up when we discussed this in the past but there is
> > > no reflection on that here so let me bring that up again. 
> > > 
> > > I believe that the interface has an inherent problem that it is racy.
> > > The external entity needs to know the address space layout of the target
> > > process to do anyhing useful on it. The address space is however under
> > > the full control of the target process though and the external entity
> > > has no means to find out that the layout has changed. So
> > > time-to-check-time-to-act is an inherent problem.
> > > 
> > > This is a serious design flaw and it should be explained why it doesn't
> > > matter or how to use the interface properly to prevent that problem.
> > 
> > Really, any address space manipulation, where more than one process is
> > involved, is racy.
> 
> They are, indeed. But that is not the point I wanted to make.
> 
> > Even two threads on common memory need a synchronization
> > to manage mappings in a sane way. Managing memory from two processes
> > is the same in principle, and the only difference is that another level
> > of synchronization is required.
> 
> Well, not really. The operation might simply attempt to perform an
> operation on a specific memory area and get a failure if it doesn't
> reference the same object anymore. What I think we need is some form of
> a handle to operate on. In the past we have discussed several
> directions. I was proposing /proc/self/map_anon/ (analogous to
> map_files) where you could inspect anonymous memory and get a file
> handle for it. madvise would then operate on the fd and then there
> shouldn't be a real problem to revalidate that the object is still
> valid. But there was no general enthusiasm about that approach. There
> are likely some land mines on the way.

Converting anon memory to file-backed is bad idea and going to backfire.
It will break many things around rmap (for THP in particular). We have
assumption that an anon page cannot be mapped twice in the same process.
Breaking rules around memory inheritance is not helpful too.

It is a major re-design of the subsystem and I don't see any real need for
this.
Michal Hocko Jan. 20, 2020, 1:24 p.m. UTC | #13
On Mon 20-01-20 15:39:35, Kirill A. Shutemov wrote:
> On Mon, Jan 20, 2020 at 12:27:22PM +0100, Michal Hocko wrote:
> > On Mon 20-01-20 13:24:35, Kirill Tkhai wrote:
[...]
> > > Even two threads on common memory need a synchronization
> > > to manage mappings in a sane way. Managing memory from two processes
> > > is the same in principle, and the only difference is that another level
> > > of synchronization is required.
> > 
> > Well, not really. The operation might simply attempt to perform an
> > operation on a specific memory area and get a failure if it doesn't
> > reference the same object anymore. What I think we need is some form of
> > a handle to operate on. In the past we have discussed several
> > directions. I was proposing /proc/self/map_anon/ (analogous to
> > map_files) where you could inspect anonymous memory and get a file
> > handle for it. madvise would then operate on the fd and then there
> > shouldn't be a real problem to revalidate that the object is still
> > valid. But there was no general enthusiasm about that approach. There
> > are likely some land mines on the way.
> 
> Converting anon memory to file-backed is bad idea and going to backfire.

I didn't mean to convert. I meant to expose that information via proc
the same way we do for file backed mappings. That shouldn't really
require to re-design the way how anonymous vma work IMO. But I haven't
tried that so there might be many gotchas there.

There are obvious things to think about though. Such fd cannot be sent
to other processes (SCM stuff), mmap of the file would have to be
disallowed and many others I am not aware of. I am not even pushing this
direction because I am not convinced about how viable it is myself. But
it would sound like a nice extension of the existing mechanism we have
and a file based madvise sounds attractive to me as well because we
already have that.
Kirill A. Shutemov Jan. 20, 2020, 2:21 p.m. UTC | #14
On Mon, Jan 20, 2020 at 02:24:05PM +0100, Michal Hocko wrote:
> On Mon 20-01-20 15:39:35, Kirill A. Shutemov wrote:
> > On Mon, Jan 20, 2020 at 12:27:22PM +0100, Michal Hocko wrote:
> > > On Mon 20-01-20 13:24:35, Kirill Tkhai wrote:
> [...]
> > > > Even two threads on common memory need a synchronization
> > > > to manage mappings in a sane way. Managing memory from two processes
> > > > is the same in principle, and the only difference is that another level
> > > > of synchronization is required.
> > > 
> > > Well, not really. The operation might simply attempt to perform an
> > > operation on a specific memory area and get a failure if it doesn't
> > > reference the same object anymore. What I think we need is some form of
> > > a handle to operate on. In the past we have discussed several
> > > directions. I was proposing /proc/self/map_anon/ (analogous to
> > > map_files) where you could inspect anonymous memory and get a file
> > > handle for it. madvise would then operate on the fd and then there
> > > shouldn't be a real problem to revalidate that the object is still
> > > valid. But there was no general enthusiasm about that approach. There
> > > are likely some land mines on the way.
> > 
> > Converting anon memory to file-backed is bad idea and going to backfire.
> 
> I didn't mean to convert. I meant to expose that information via proc
> the same way we do for file backed mappings. That shouldn't really
> require to re-design the way how anonymous vma work IMO. But I haven't
> tried that so there might be many gotchas there.
> 
> There are obvious things to think about though. Such fd cannot be sent
> to other processes (SCM stuff), mmap of the file would have to be
> disallowed and many others I am not aware of. I am not even pushing this
> direction because I am not convinced about how viable it is myself. But
> it would sound like a nice extension of the existing mechanism we have
> and a file based madvise sounds attractive to me as well because we
> already have that.

If the fd cannot be passed around or mmaped what does it represent?
And how is it different from plain address?
Michal Hocko Jan. 20, 2020, 3:44 p.m. UTC | #15
On Mon 20-01-20 17:21:32, Kirill A. Shutemov wrote:
> On Mon, Jan 20, 2020 at 02:24:05PM +0100, Michal Hocko wrote:
> > On Mon 20-01-20 15:39:35, Kirill A. Shutemov wrote:
> > > On Mon, Jan 20, 2020 at 12:27:22PM +0100, Michal Hocko wrote:
> > > > On Mon 20-01-20 13:24:35, Kirill Tkhai wrote:
> > [...]
> > > > > Even two threads on common memory need a synchronization
> > > > > to manage mappings in a sane way. Managing memory from two processes
> > > > > is the same in principle, and the only difference is that another level
> > > > > of synchronization is required.
> > > > 
> > > > Well, not really. The operation might simply attempt to perform an
> > > > operation on a specific memory area and get a failure if it doesn't
> > > > reference the same object anymore. What I think we need is some form of
> > > > a handle to operate on. In the past we have discussed several
> > > > directions. I was proposing /proc/self/map_anon/ (analogous to
> > > > map_files) where you could inspect anonymous memory and get a file
> > > > handle for it. madvise would then operate on the fd and then there
> > > > shouldn't be a real problem to revalidate that the object is still
> > > > valid. But there was no general enthusiasm about that approach. There
> > > > are likely some land mines on the way.
> > > 
> > > Converting anon memory to file-backed is bad idea and going to backfire.
> > 
> > I didn't mean to convert. I meant to expose that information via proc
> > the same way we do for file backed mappings. That shouldn't really
> > require to re-design the way how anonymous vma work IMO. But I haven't
> > tried that so there might be many gotchas there.
> > 
> > There are obvious things to think about though. Such fd cannot be sent
> > to other processes (SCM stuff), mmap of the file would have to be
> > disallowed and many others I am not aware of. I am not even pushing this
> > direction because I am not convinced about how viable it is myself. But
> > it would sound like a nice extension of the existing mechanism we have
> > and a file based madvise sounds attractive to me as well because we
> > already have that.
> 
> If the fd cannot be passed around or mmaped what does it represent?

Because it is a cookie maintained by the kernel.

> And how is it different from plain address?

Kernel can revalidate that the given fd is denoting the vma it was
created for and simply fail with ENOENT or whatever suits if somebody
did unmap and mmap to the same address.
Minchan Kim Jan. 21, 2020, 6:11 p.m. UTC | #16
On Sat, Jan 18, 2020 at 12:26:53AM +0300, Kirill A. Shutemov wrote:
> On Fri, Jan 17, 2020 at 09:32:39AM -0800, Minchan Kim wrote:
> > On Fri, Jan 17, 2020 at 06:58:37PM +0300, Kirill A. Shutemov wrote:
> > > On Fri, Jan 17, 2020 at 12:52:25PM +0100, Michal Hocko wrote:
> > > > On Thu 16-01-20 15:59:50, Minchan Kim wrote:
> > > > > There is usecase that System Management Software(SMS) want to give
> > > > > a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
> > > > > in the case of Android, it is the ActivityManagerService.
> > > > > 
> > > > > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > > > > required to make the reclaim decision is not known to the app. Instead,
> > > > > it is known to the centralized userspace daemon(ActivityManagerService),
> > > > > and that daemon must be able to initiate reclaim on its own without
> > > > > any app involvement.
> > > > > 
> > > > > To solve the issue, this patch introduces new syscall process_madvise(2).
> > > > > It uses pidfd of an external processs to give the hint.
> > > > > 
> > > > >  int process_madvise(int pidfd, void *addr, size_t length, int advise,
> > > > > 			unsigned long flag);
> > > > > 
> > > > > Since it could affect other process's address range, only privileged
> > > > > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> > > > > gives it the right to ptrace the process could use it successfully.
> > > > > The flag argument is reserved for future use if we need to extend the
> > > > > API.
> > > > > 
> > > > > I think supporting all hints madvise has/will supported/support to
> > > > > process_madvise is rather risky. Because we are not sure all hints make
> > > > > sense from external process and implementation for the hint may rely on
> > > > > the caller being in the current context so it could be error-prone.
> > > > > Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
> > > > > 
> > > > > If someone want to add other hints, we could hear hear the usecase and
> > > > > review it for each hint. It's more safe for maintainace rather than
> > > > > introducing a buggy syscall but hard to fix it later.
> > > > 
> > > > I have brought this up when we discussed this in the past but there is
> > > > no reflection on that here so let me bring that up again. 
> > > > 
> > > > I believe that the interface has an inherent problem that it is racy.
> > > > The external entity needs to know the address space layout of the target
> > > > process to do anyhing useful on it. The address space is however under
> > > > the full control of the target process though and the external entity
> > > > has no means to find out that the layout has changed. So
> > > > time-to-check-time-to-act is an inherent problem.
> > > > 
> > > > This is a serious design flaw and it should be explained why it doesn't
> > > > matter or how to use the interface properly to prevent that problem.
> > > 
> > > I agree, it looks flawed.
> > > 
> > > Also I don't see what System Management Software can generically do on
> > > sub-process level. I mean how can it decide which part of address space is
> > > less important than other.
> > > 
> > > I see how a manager can indicate that this process (or a group of
> > > processes) is less important than other, but on per-addres-range basis?
> > 
> > For example, memory ranges shared by several processes or critical for the
> > latency, we could avoid those ranges to be cold/pageout to prevent
> > unncecessary CPU burning/paging.
> 
> Hmm.. I still don't see why any external entity has a better (or any)
> knowledge about the matter. The process has to do this, no?

I think Sandeep already gave enough information in other thread.

> 
> > I also think people don't want to give an KSM hint to non-mergeable area.
> 
> And how the manager knows which data is mergable?

Oleksandr, could you say your thought why you need address range based
API?

> 
> If you are intimate enough with the process' internal state feel free to
> inject syscall into the process with ptrace. Why bother with half-measures?

Concern is we want to act the hint in caller's context, not calle because
calle is usually very limited in cpuset/cgroups or even freezed state so
they couldn't act by themselves quick enough, which makes many problems.
One of efforts to solve the issue was "Expedited memory reclaim"

	https://lwn.net/Articles/785709/

That could be also a good candidate for process_madvise API.
Minchan Kim Jan. 21, 2020, 6:32 p.m. UTC | #17
On Mon, Jan 20, 2020 at 08:58:25AM +0100, Michal Hocko wrote:
> On Sun 19-01-20 08:14:31, sspatil@google.com wrote:
> > On Sat, Jan 18, 2020 at 12:26:53AM +0300, Kirill A. Shutemov wrote:
> > > On Fri, Jan 17, 2020 at 09:32:39AM -0800, Minchan Kim wrote:
> > > > On Fri, Jan 17, 2020 at 06:58:37PM +0300, Kirill A. Shutemov wrote:
> > > > > On Fri, Jan 17, 2020 at 12:52:25PM +0100, Michal Hocko wrote:
> > > > > > On Thu 16-01-20 15:59:50, Minchan Kim wrote:
> > > > > > > There is usecase that System Management Software(SMS) want to give
> > > > > > > a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
> > > > > > > in the case of Android, it is the ActivityManagerService.
> > > > > > > 
> > > > > > > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > > > > > > required to make the reclaim decision is not known to the app. Instead,
> > > > > > > it is known to the centralized userspace daemon(ActivityManagerService),
> > > > > > > and that daemon must be able to initiate reclaim on its own without
> > > > > > > any app involvement.
> > > > > > > 
> > > > > > > To solve the issue, this patch introduces new syscall process_madvise(2).
> > > > > > > It uses pidfd of an external processs to give the hint.
> > > > > > > 
> > > > > > >  int process_madvise(int pidfd, void *addr, size_t length, int advise,
> > > > > > > 			unsigned long flag);
> > > > > > > 
> > > > > > > Since it could affect other process's address range, only privileged
> > > > > > > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> > > > > > > gives it the right to ptrace the process could use it successfully.
> > > > > > > The flag argument is reserved for future use if we need to extend the
> > > > > > > API.
> > > > > > > 
> > > > > > > I think supporting all hints madvise has/will supported/support to
> > > > > > > process_madvise is rather risky. Because we are not sure all hints make
> > > > > > > sense from external process and implementation for the hint may rely on
> > > > > > > the caller being in the current context so it could be error-prone.
> > > > > > > Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
> > > > > > > 
> > > > > > > If someone want to add other hints, we could hear hear the usecase and
> > > > > > > review it for each hint. It's more safe for maintainace rather than
> > > > > > > introducing a buggy syscall but hard to fix it later.
> > > > > > 
> > > > > > I have brought this up when we discussed this in the past but there is
> > > > > > no reflection on that here so let me bring that up again. 
> > > > > > 
> > > > > > I believe that the interface has an inherent problem that it is racy.
> > > > > > The external entity needs to know the address space layout of the target
> > > > > > process to do anyhing useful on it. The address space is however under
> > > > > > the full control of the target process though and the external entity
> > > > > > has no means to find out that the layout has changed. So
> > > > > > time-to-check-time-to-act is an inherent problem.
> > > > > > 
> > > > > > This is a serious design flaw and it should be explained why it doesn't
> > > > > > matter or how to use the interface properly to prevent that problem.
> > > > > 
> > > > > I agree, it looks flawed.
> > > > > 
> > > > > Also I don't see what System Management Software can generically do on
> > > > > sub-process level. I mean how can it decide which part of address space is
> > > > > less important than other.
> > > > > 
> > > > > I see how a manager can indicate that this process (or a group of
> > > > > processes) is less important than other, but on per-addres-range basis?
> > > > 
> > > > For example, memory ranges shared by several processes or critical for the
> > > > latency, we could avoid those ranges to be cold/pageout to prevent
> > > > unncecessary CPU burning/paging.
> > > 
> > > Hmm.. I still don't see why any external entity has a better (or any)
> > > knowledge about the matter. The process has to do this, no?
> > 
> > FWIW, I totally agree with the time-to-check-time-to-react problem. However,
> > I'd like to clarify the ActivityManager/SystemServer case (I'll call it
> > SystemServer from now on)
> > 
> > For Android, every application (including the special SystemServer) are forked
> > from Zygote. The reason ofcourse is to share as many libraries and classes between
> > the two as possible to benefit from the preloading during boot.
> > 
> > After applications start, (almost) all of the APIs  end up calling into this
> > SystemServer process over IPC (binder) and back to the application.
> > 
> > In a fully running system, the SystemServer monitors every single process
> > periodically to calculate their PSS / RSS and also decides which process is
> > "important" to the user for interactivity.
> > 
> > So, because of how these processes start _and_ the fact that the SystemServer
> > is looping to monitor each process, it does tend to *know* which address
> > range of the application is not used / useful.
> > 
> > Besides, we can never rely on applications to clean things up themselves.
> > We've had the "hey app1, the system is low on memory, please trim your
> > memory usage down" notifications for a long time[1]. They rely on
> > applications honoring the broadcasts and very few do.
> > 
> > So, if we want to avoid the inevitable killing of the application and
> > restarting it, some way to be able to tell the OS about unimportant memory in
> > these applications will be useful.

Thanks for adding more useful description, Sandeep.

> 
> This is a useful information that should be a part of the changelog. I

I will incldue it in next respin.

> do see how the current form of the API might fit into Android model
> without many problems. But we are not designing an API for a single
> usecase, right? In a highly cooperative environments you can use ptrace
> code injection as mentioned by Kirill. Or is there any fundamental
> problem about that?

I replied it at Kirill's thread so let's discuss there.

> 
> The interface really has to be robust to future potential usecases.

I do understand your concern but for me, it's chicken and egg problem.
We usually do best effort to make something perfect as far as possible
but we also don't do over-engineering without real usecase from the
beginning.

I already told you how we could synchronize among processes and potential
way to be extended Daniel suggested(That's why current API has extra field
for the cookie) even though we don't need it right now.

If you want to suggest the other way, please explain why your idea is
better and why we need it at this moment. I don't think that is a
blocker for the progress of this API since we already have several ways
to synchronize processes.
Minchan Kim Jan. 21, 2020, 6:43 p.m. UTC | #18
On Mon, Jan 20, 2020 at 02:24:05PM +0100, Michal Hocko wrote:
> On Mon 20-01-20 15:39:35, Kirill A. Shutemov wrote:
> > On Mon, Jan 20, 2020 at 12:27:22PM +0100, Michal Hocko wrote:
> > > On Mon 20-01-20 13:24:35, Kirill Tkhai wrote:
> [...]
> > > > Even two threads on common memory need a synchronization
> > > > to manage mappings in a sane way. Managing memory from two processes
> > > > is the same in principle, and the only difference is that another level
> > > > of synchronization is required.
> > > 
> > > Well, not really. The operation might simply attempt to perform an
> > > operation on a specific memory area and get a failure if it doesn't
> > > reference the same object anymore. What I think we need is some form of
> > > a handle to operate on. In the past we have discussed several
> > > directions. I was proposing /proc/self/map_anon/ (analogous to
> > > map_files) where you could inspect anonymous memory and get a file
> > > handle for it. madvise would then operate on the fd and then there
> > > shouldn't be a real problem to revalidate that the object is still
> > > valid. But there was no general enthusiasm about that approach. There
> > > are likely some land mines on the way.
> > 
> > Converting anon memory to file-backed is bad idea and going to backfire.
> 
> I didn't mean to convert. I meant to expose that information via proc
> the same way we do for file backed mappings. That shouldn't really
> require to re-design the way how anonymous vma work IMO. But I haven't
> tried that so there might be many gotchas there.
> 
> There are obvious things to think about though. Such fd cannot be sent
> to other processes (SCM stuff), mmap of the file would have to be
> disallowed and many others I am not aware of. I am not even pushing this
> direction because I am not convinced about how viable it is myself. But
> it would sound like a nice extension of the existing mechanism we have
> and a file based madvise sounds attractive to me as well because we
> already have that.

I am not a fan of fd based approach but I already reserved last argument
of the API as extendable field so we could use the field as "fd" when we
really need that kinds of fine-grained synchronization model if it's not
enough with SGISTOP, freezer and so.
Michal Hocko Jan. 22, 2020, 8:28 a.m. UTC | #19
On Tue 21-01-20 10:32:12, Minchan Kim wrote:
> On Mon, Jan 20, 2020 at 08:58:25AM +0100, Michal Hocko wrote:
[...]
> > The interface really has to be robust to future potential usecases.
> 
> I do understand your concern but for me, it's chicken and egg problem.
> We usually do best effort to make something perfect as far as possible
> but we also don't do over-engineering without real usecase from the
> beginning.
> 
> I already told you how we could synchronize among processes and potential
> way to be extended Daniel suggested(That's why current API has extra field
> for the cookie) even though we don't need it right now.

If you can synchronize with the target task then you do not need a
remote interface. Just use ptrace and you are done with it.

> If you want to suggest the other way, please explain why your idea is
> better and why we need it at this moment.

I believe I have explained my concerns and why they matter. All you are
saying is that you do not care because your particular usecase doesn't
care. And that is a first signal of a future disaster when we end up
with a broken and unfixable interface we have to maintain for ever.

I will not go as far as to nack this but you should seriously think
about other potential usecases and how they would work and what we are
going to do when a first non-cooperative userspace memory management
usecase materializes.
Michal Hocko Jan. 22, 2020, 10:02 a.m. UTC | #20
On Wed 22-01-20 10:36:24, SeongJae Park wrote:
> On Wed, 22 Jan 2020 09:28:53 +0100 Michal Hocko <mhocko@kernel.org> wrote:
> 
> > On Tue 21-01-20 10:32:12, Minchan Kim wrote:
> > > On Mon, Jan 20, 2020 at 08:58:25AM +0100, Michal Hocko wrote:
> > [...]
> > > > The interface really has to be robust to future potential usecases.
> > > 
> > > I do understand your concern but for me, it's chicken and egg problem.
> > > We usually do best effort to make something perfect as far as possible
> > > but we also don't do over-engineering without real usecase from the
> > > beginning.
> > > 
> > > I already told you how we could synchronize among processes and potential
> > > way to be extended Daniel suggested(That's why current API has extra field
> > > for the cookie) even though we don't need it right now.
> > 
> > If you can synchronize with the target task then you do not need a
> > remote interface. Just use ptrace and you are done with it.
> > 
> > > If you want to suggest the other way, please explain why your idea is
> > > better and why we need it at this moment.
> > 
> > I believe I have explained my concerns and why they matter. All you are
> > saying is that you do not care because your particular usecase doesn't
> > care. And that is a first signal of a future disaster when we end up
> > with a broken and unfixable interface we have to maintain for ever.
> > 
> > I will not go as far as to nack this but you should seriously think
> > about other potential usecases and how they would work and what we are
> > going to do when a first non-cooperative userspace memory management
> > usecase materializes.
> 
> Beside of the specific environment of Android, I think there are many ways to
> know the address space layout and access patterns of other processes.  The
> idle_page_tracking might be an example that widelay available.
> 
> Of course, the information might not strictly correct due to the timing issue,
> but could be still worth to be used under some extreme situations, such as
> memory pressure or fragmentation.  For the same reason, ptrace() would not be
> sufficient, as we have no perfect control, but only some level of control that
> would be useful under specific situations.

I am not sure I see your point. I am talking about races where a remote
task is operating on a completely different object because the one it
checked for has been unmapped and new one mapped over it. Memory
pressure or a fragmentation will not change the object itself. Sure the
memory might be reclaimed but that should be completely OK unless I am
missing something.

> I assume the users of this systemcall would understand the tradeoff and make
> decisions.

I disagree. My experience tells me that users tend to squeeze the
maximum and beyond and hope they get what they want.

> Also, as the users already have the right to do the tradeoff, I
> think it's fair.  In other words, I think the caller has both the power and the
> responsibility to deal with the time-to-check-time-to-react problem.
> 
> Nonetheless, I also agree this is important concern and the patch would be
> better if it adds more detailed documentation regarding this issue.

If there is _really_ a strong consensus that the racy interface is
reasonable then it absolutely has to be described with a clearly state
that those races might result in hard to predict behavior unless all
tasks sharing the address space are blocked between the check and the
madvise call.
Oleksandr Natalenko Jan. 22, 2020, 10:44 a.m. UTC | #21
Hello.

On Tue, Jan 21, 2020 at 10:11:13AM -0800, Minchan Kim wrote:
> > > I also think people don't want to give an KSM hint to non-mergeable area.
> > 
> > And how the manager knows which data is mergable?
> 
> Oleksandr, could you say your thought why you need address range based
> API?

It seems I've overlooked an important piece of this submission: one
cannot apply the hint to all the anonymous mapping regardless of address
range. For KSM I'd rather either have a possibility to hint all the
anonymous mappings, or, as it was suggested previously, be able to iterate
over existing mappings using some (fd-based?) API.
Minchan Kim Jan. 23, 2020, 1:41 a.m. UTC | #22
On Wed, Jan 22, 2020 at 09:28:53AM +0100, Michal Hocko wrote:
> On Tue 21-01-20 10:32:12, Minchan Kim wrote:
> > On Mon, Jan 20, 2020 at 08:58:25AM +0100, Michal Hocko wrote:
> [...]
> > > The interface really has to be robust to future potential usecases.
> > 
> > I do understand your concern but for me, it's chicken and egg problem.
> > We usually do best effort to make something perfect as far as possible
> > but we also don't do over-engineering without real usecase from the
> > beginning.
> > 
> > I already told you how we could synchronize among processes and potential
> > way to be extended Daniel suggested(That's why current API has extra field
> > for the cookie) even though we don't need it right now.
> 
> If you can synchronize with the target task then you do not need a
> remote interface. Just use ptrace and you are done with it.

As I mentioned in other reply, we want to do in caller's context, not
callee's one because target processes stay in little cores, which are
much slower than the core the manager lives in.
The other reason is the apps are already freezed so they couldn't response
by ptrace.

> 
> > If you want to suggest the other way, please explain why your idea is
> > better and why we need it at this moment.
> 
> I believe I have explained my concerns and why they matter. All you are
> saying is that you do not care because your particular usecase doesn't
> care. And that is a first signal of a future disaster when we end up
> with a broken and unfixable interface we have to maintain for ever.

We already had suggested cookie and fd based approaches so I reserved a
argument for that to make the API extendable.
Thing is currently it's a just optimization idea since we have several
ways to sychronize processes(e.g., signal, cgroup freezer, userfaultfd
and so). It's a just matter of granularity, not necessary one we should
introduce it from the beginnig.
If someone needs that kinds of fine-grained consistency, we could extend
it then. And that's the usual way we make progress when we couldn't
know the future.

What do you want to see further?
Minchan Kim Jan. 23, 2020, 1:43 a.m. UTC | #23
On Wed, Jan 22, 2020 at 11:44:24AM +0100, Oleksandr Natalenko wrote:
> Hello.
> 
> On Tue, Jan 21, 2020 at 10:11:13AM -0800, Minchan Kim wrote:
> > > > I also think people don't want to give an KSM hint to non-mergeable area.
> > > 
> > > And how the manager knows which data is mergable?
> > 
> > Oleksandr, could you say your thought why you need address range based
> > API?
> 
> It seems I've overlooked an important piece of this submission: one
> cannot apply the hint to all the anonymous mapping regardless of address
> range. For KSM I'd rather either have a possibility to hint all the
> anonymous mappings, or, as it was suggested previously, be able to iterate
> over existing mappings using some (fd-based?) API.

Thing is how you could identify a certan range is better for KSM than
others from external process?
Oleksandr Natalenko Jan. 23, 2020, 7:29 a.m. UTC | #24
On Wed, Jan 22, 2020 at 05:43:16PM -0800, Minchan Kim wrote:
> > It seems I've overlooked an important piece of this submission: one
> > cannot apply the hint to all the anonymous mapping regardless of address
> > range. For KSM I'd rather either have a possibility to hint all the
> > anonymous mappings, or, as it was suggested previously, be able to iterate
> > over existing mappings using some (fd-based?) API.
> 
> Thing is how you could identify a certan range is better for KSM than
> others from external process?

I think the info like this is kinda available via /proc/pid/smaps. It
lists the ranges and the vmflags. But using it raises 2 concerns: one is
the absence of guarantee the mappings won't change after smaps is read
and the second one is that there's no separate vmflag for marking a vma
as non-meregable (and IIRC from previous attempts on addressing this,
we've already exhausted all the flags on 32-bit arches, so it is not
something that can be trivially addressed).
Michal Hocko Jan. 23, 2020, 9:13 a.m. UTC | #25
On Wed 22-01-20 17:41:31, Minchan Kim wrote:
[...]
> What do you want to see further?

Either a consensus that it is sufficient to have an inherently racy
interface that requires some form of external sychronization or
a robust interface that can cope with races in a sensible way.
And no, having a flag for future extension is definitely not the
way how a new API should be added. Seriously!
diff mbox series

Patch

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index e56950f23b49..776c61803315 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -477,3 +477,4 @@ 
 # 545 reserved for clone3
 546	common	watch_devices			sys_watch_devices
 547	common	openat2				sys_openat2
+548	common	process_madvise			sys_process_madvise
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 7fb2f4d59210..a43381542276 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -451,3 +451,4 @@ 
 435	common	clone3				sys_clone3
 436	common	watch_devices			sys_watch_devices
 437	common	openat2				sys_openat2
+438	common	process_madvise			sys_process_madvise
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 8aa00ccb0b96..b722e47377a5 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -38,7 +38,7 @@ 
 #define __ARM_NR_compat_set_tls		(__ARM_NR_COMPAT_BASE + 5)
 #define __ARM_NR_COMPAT_END		(__ARM_NR_COMPAT_BASE + 0x800)
 
-#define __NR_compat_syscalls		438
+#define __NR_compat_syscalls		439
 #endif
 
 #define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index 31f0ce25719e..e3643d7fecc3 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -883,6 +883,8 @@  __SYSCALL(__NR_clone3, sys_clone3)
 __SYSCALL(__NR_watch_devices, sys_watch_devices)
 #define __NR_openat2 437
 __SYSCALL(__NR_openat2, sys_openat2)
+#define __NR_process_madvise 438
+__SYSCALL(__NR_process_madvise, process_madvise)
 
 /*
  * Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index b9aa59931905..c156abc9a298 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -358,3 +358,4 @@ 
 # 435 reserved for clone3
 436	common	watch_devices			sys_watch_devices
 437	common	openat2				sys_openat2
+438	common	process_madvise			sys_process_madvise
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 868c1ef89d35..5b6034b6650f 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -437,3 +437,4 @@ 
 # 435 reserved for clone3
 436	common	watch_devices			sys_watch_devices
 437	common	openat2				sys_openat2
+438	common	process_madvise			sys_process_madvise
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 544b4cef18b3..4bef584af09c 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -443,3 +443,4 @@ 
 435	common	clone3				sys_clone3
 436	common	watch_devices			sys_watch_devices
 437	common	openat2				sys_openat2
+438	common	process_madvise			sys_process_madvise
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 05e8aee5dae7..7061b2103438 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -376,3 +376,4 @@ 
 435	n32	clone3				__sys_clone3
 436	n32	watch_devices			sys_watch_devices
 437	n32	openat2				sys_openat2
+438	n32	process_madivse			sys_process_madvise
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index 24d6c01328fb..84042d57fbfb 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -352,3 +352,4 @@ 
 435	n64	clone3				__sys_clone3
 436	n64	watch_devices			sys_watch_devices
 437	n64	openat2				sys_openat2
+438	n64	process_madvise			sys_process_madvise
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index 4b5f77a4e1a2..5bfd359c7e6f 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -435,3 +435,4 @@ 
 435	common	clone3				sys_clone3_wrapper
 436	common	watch_devices			sys_watch_devices
 437	common	openat2				sys_openat2
+438	common	process_madvise			sys_process_madvise
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 9716dc85a517..ffa0e679aca0 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -519,3 +519,4 @@ 
 435	nospu	clone3				ppc_clone3
 436	common	watch_devices			sys_watch_devices
 437	common	openat2				sys_openat2
+438	common	process_madvise			sys_process_madvise
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 7da330f8b03e..c301717216ca 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -440,3 +440,4 @@ 
 435  common	clone3			sys_clone3			sys_clone3
 436  common	watch_devices		sys_watch_devices		sys_watch_devices
 437  common	openat2			sys_openat2			sys_openat2
+438  common	process_madvise		sys_process_madvise		sys_process_madvise
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index bb7e68e25337..b8f15701f69f 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -440,3 +440,4 @@ 
 # 435 reserved for clone3
 436	common	watch_devices			sys_watch_devices
 437	common	openat2				sys_openat2
+438	common	process_madvise			sys_process_madvise
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 646a1fad7218..7ea95f37b222 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -483,3 +483,4 @@ 
 # 435 reserved for clone3
 436	common	watch_devices			sys_watch_devices
 437	common	openat2			sys_openat2
+438	common	process_madvise		sys_process_madvise
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 57c53acee290..76a2c266fe7e 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -442,3 +442,4 @@ 
 435	i386	clone3			sys_clone3			__ia32_sys_clone3
 436	i386	watch_devices		sys_watch_devices		__ia32_sys_watch_devices
 437	i386	openat2			sys_openat2			__ia32_sys_openat2
+438	i386	process_madvise		sys_process_madvise		__ia32_sys_process_madvise
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 1dd8d21f6500..b697cd8620cb 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -359,6 +359,7 @@ 
 435	common	clone3			__x64_sys_clone3/ptregs
 436	common	watch_devices		__x64_sys_watch_devices
 437	common	openat2			__x64_sys_openat2
+438	common	process_madvise		__x64_sys_process_madvise
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 0f48ab7bd75b..2e9813ecfd7d 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -408,3 +408,4 @@ 
 435	common	clone3				sys_clone3
 436	common	watch_devices			sys_watch_devices
 437	common	openat2				sys_openat2
+438	common	process_madvise			sys_process_madvise
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 433c8c85636e..1b58a11ff49f 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -877,6 +877,8 @@  asmlinkage long sys_munlockall(void);
 asmlinkage long sys_mincore(unsigned long start, size_t len,
 				unsigned char __user * vec);
 asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
+asmlinkage long sys_process_madvise(int pidfd, unsigned long start,
+			size_t len, int behavior, unsigned long flags);
 asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
 			unsigned long prot, unsigned long pgoff,
 			unsigned long flags);
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 33f3856a9c3c..4a49fbaea013 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -856,8 +856,11 @@  __SYSCALL(__NR_watch_devices, sys_watch_devices)
 #define __NR_openat2 437
 __SYSCALL(__NR_openat2, sys_openat2)
 
+#define __NR_process_madvise 438
+__SYSCALL(__NR_process_madvise, sys_process_madvise)
+
 #undef __NR_syscalls
-#define __NR_syscalls 438
+#define __NR_syscalls 439
 
 /*
  * 32 bit systems traditionally used different
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 0e9b275260f8..10ce5eac8b4b 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -281,6 +281,7 @@  COND_SYSCALL(mlockall);
 COND_SYSCALL(munlockall);
 COND_SYSCALL(mincore);
 COND_SYSCALL(madvise);
+COND_SYSCALL(process_madvise);
 COND_SYSCALL(remap_file_pages);
 COND_SYSCALL(mbind);
 COND_SYSCALL_COMPAT(mbind);
diff --git a/mm/madvise.c b/mm/madvise.c
index 0c901de531e4..59b5cc99ef61 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -17,6 +17,7 @@ 
 #include <linux/falloc.h>
 #include <linux/fadvise.h>
 #include <linux/sched.h>
+#include <linux/sched/mm.h>
 #include <linux/ksm.h>
 #include <linux/fs.h>
 #include <linux/file.h>
@@ -993,6 +994,18 @@  madvise_behavior_valid(int behavior)
 	}
 }
 
+static bool
+process_madvise_behavior_valid(int behavior)
+{
+	switch (behavior) {
+	case MADV_COLD:
+	case MADV_PAGEOUT:
+		return true;
+	default:
+		return false;
+	}
+}
+
 /*
  * madvise_common - request behavior hint to address range of the target process
  *
@@ -1151,6 +1164,10 @@  static int madvise_common(struct task_struct *task, struct mm_struct *mm,
  *  MADV_DONTDUMP - the application wants to prevent pages in the given range
  *		from being included in its core dump.
  *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
+ *  MADV_COLD - the application uses the memory less so the kernel can deactivate
+ *  		the memory to evict them quickly when the memory pressure happen.
+ *  MADV_PAGEOUT - the application uses the memroy very rarely so kernel can
+ *  		page out the memory instantly.
  *
  * return values:
  *  zero    - success
@@ -1169,3 +1186,49 @@  SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 {
 	return madvise_common(current, current->mm, start, len_in, behavior);
 }
+
+SYSCALL_DEFINE5(process_madvise, int, pidfd, unsigned long, start,
+		size_t, len_in, int, behavior, unsigned long, flags)
+{
+	int ret;
+	struct fd f;
+	struct pid *pid;
+	struct task_struct *task;
+	struct mm_struct *mm;
+
+	if (flags != 0)
+		return -EINVAL;
+
+	if (!process_madvise_behavior_valid(behavior))
+		return -EINVAL;
+
+	f = fdget(pidfd);
+	if (!f.file)
+		return -EBADF;
+
+	pid = pidfd_pid(f.file);
+	if (IS_ERR(pid)) {
+		ret = PTR_ERR(pid);
+		goto fdput;
+	}
+
+	task = get_pid_task(pid, PIDTYPE_PID);
+	if (!task) {
+		ret = -ESRCH;
+		goto fdput;
+	}
+
+	mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS);
+	if (IS_ERR_OR_NULL(mm)) {
+		ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
+		goto release_task;
+	}
+
+	ret = madvise_common(task, mm, start, len_in, behavior);
+	mmput(mm);
+release_task:
+	put_task_struct(task);
+fdput:
+	fdput(f);
+	return ret;
+}