Message ID | 20200116235953.163318-3-minchan@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | introduce memory hinting API for external process | expand |
On Thu 16-01-20 15:59:50, Minchan Kim wrote: > There is usecase that System Management Software(SMS) want to give > a memory hint like MADV_[COLD|PAGEEOUT] to other processes and > in the case of Android, it is the ActivityManagerService. > > It's similar in spirit to madvise(MADV_WONTNEED), but the information > required to make the reclaim decision is not known to the app. Instead, > it is known to the centralized userspace daemon(ActivityManagerService), > and that daemon must be able to initiate reclaim on its own without > any app involvement. > > To solve the issue, this patch introduces new syscall process_madvise(2). > It uses pidfd of an external processs to give the hint. > > int process_madvise(int pidfd, void *addr, size_t length, int advise, > unsigned long flag); > > Since it could affect other process's address range, only privileged > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID) > gives it the right to ptrace the process could use it successfully. > The flag argument is reserved for future use if we need to extend the > API. > > I think supporting all hints madvise has/will supported/support to > process_madvise is rather risky. Because we are not sure all hints make > sense from external process and implementation for the hint may rely on > the caller being in the current context so it could be error-prone. > Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch. > > If someone want to add other hints, we could hear hear the usecase and > review it for each hint. It's more safe for maintainace rather than > introducing a buggy syscall but hard to fix it later. I have brought this up when we discussed this in the past but there is no reflection on that here so let me bring that up again. I believe that the interface has an inherent problem that it is racy. The external entity needs to know the address space layout of the target process to do anyhing useful on it. The address space is however under the full control of the target process though and the external entity has no means to find out that the layout has changed. So time-to-check-time-to-act is an inherent problem. This is a serious design flaw and it should be explained why it doesn't matter or how to use the interface properly to prevent that problem.
On Fri, Jan 17, 2020 at 12:52:25PM +0100, Michal Hocko wrote: > On Thu 16-01-20 15:59:50, Minchan Kim wrote: > > There is usecase that System Management Software(SMS) want to give > > a memory hint like MADV_[COLD|PAGEEOUT] to other processes and > > in the case of Android, it is the ActivityManagerService. > > > > It's similar in spirit to madvise(MADV_WONTNEED), but the information > > required to make the reclaim decision is not known to the app. Instead, > > it is known to the centralized userspace daemon(ActivityManagerService), > > and that daemon must be able to initiate reclaim on its own without > > any app involvement. > > > > To solve the issue, this patch introduces new syscall process_madvise(2). > > It uses pidfd of an external processs to give the hint. > > > > int process_madvise(int pidfd, void *addr, size_t length, int advise, > > unsigned long flag); > > > > Since it could affect other process's address range, only privileged > > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID) > > gives it the right to ptrace the process could use it successfully. > > The flag argument is reserved for future use if we need to extend the > > API. > > > > I think supporting all hints madvise has/will supported/support to > > process_madvise is rather risky. Because we are not sure all hints make > > sense from external process and implementation for the hint may rely on > > the caller being in the current context so it could be error-prone. > > Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch. > > > > If someone want to add other hints, we could hear hear the usecase and > > review it for each hint. It's more safe for maintainace rather than > > introducing a buggy syscall but hard to fix it later. > > I have brought this up when we discussed this in the past but there is > no reflection on that here so let me bring that up again. > > I believe that the interface has an inherent problem that it is racy. > The external entity needs to know the address space layout of the target > process to do anyhing useful on it. The address space is however under > the full control of the target process though and the external entity > has no means to find out that the layout has changed. So > time-to-check-time-to-act is an inherent problem. > > This is a serious design flaw and it should be explained why it doesn't > matter or how to use the interface properly to prevent that problem. I agree, it looks flawed. Also I don't see what System Management Software can generically do on sub-process level. I mean how can it decide which part of address space is less important than other. I see how a manager can indicate that this process (or a group of processes) is less important than other, but on per-addres-range basis?
On Fri, Jan 17, 2020 at 12:52:25PM +0100, Michal Hocko wrote: > On Thu 16-01-20 15:59:50, Minchan Kim wrote: > > There is usecase that System Management Software(SMS) want to give > > a memory hint like MADV_[COLD|PAGEEOUT] to other processes and > > in the case of Android, it is the ActivityManagerService. > > > > It's similar in spirit to madvise(MADV_WONTNEED), but the information > > required to make the reclaim decision is not known to the app. Instead, > > it is known to the centralized userspace daemon(ActivityManagerService), > > and that daemon must be able to initiate reclaim on its own without > > any app involvement. > > > > To solve the issue, this patch introduces new syscall process_madvise(2). > > It uses pidfd of an external processs to give the hint. > > > > int process_madvise(int pidfd, void *addr, size_t length, int advise, > > unsigned long flag); > > > > Since it could affect other process's address range, only privileged > > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID) > > gives it the right to ptrace the process could use it successfully. > > The flag argument is reserved for future use if we need to extend the > > API. > > > > I think supporting all hints madvise has/will supported/support to > > process_madvise is rather risky. Because we are not sure all hints make > > sense from external process and implementation for the hint may rely on > > the caller being in the current context so it could be error-prone. > > Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch. > > > > If someone want to add other hints, we could hear hear the usecase and > > review it for each hint. It's more safe for maintainace rather than > > introducing a buggy syscall but hard to fix it later. > > I have brought this up when we discussed this in the past but there is > no reflection on that here so let me bring that up again. > > I believe that the interface has an inherent problem that it is racy. > The external entity needs to know the address space layout of the target > process to do anyhing useful on it. The address space is however under > the full control of the target process though and the external entity > has no means to find out that the layout has changed. So > time-to-check-time-to-act is an inherent problem. > > This is a serious design flaw and it should be explained why it doesn't > matter or how to use the interface properly to prevent that problem. Sorry for the missing that part. It's not a particular problem of this API because other APIs already have done with that(e.g., move_pages, process_vm_writev). Point is userspace has several ways for the control of target process like SIGSTOP, cgroup freezer or even no need to control since platform is already aware of that the process will never run until he grant it or it's resilient even though the race happens. In future, if we want to support more fine-grained consistency model like memory layout, we could provide some API to get cookie(e.g., seq count which is updated whenever vma of the process changes). And then we could feed the cookie to process_madvise's last argument so that it can fail if founds it's not matched. For that API, Daniel already posted RFC - process_getinfo[1]. https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224
On Fri, Jan 17, 2020 at 06:58:37PM +0300, Kirill A. Shutemov wrote: > On Fri, Jan 17, 2020 at 12:52:25PM +0100, Michal Hocko wrote: > > On Thu 16-01-20 15:59:50, Minchan Kim wrote: > > > There is usecase that System Management Software(SMS) want to give > > > a memory hint like MADV_[COLD|PAGEEOUT] to other processes and > > > in the case of Android, it is the ActivityManagerService. > > > > > > It's similar in spirit to madvise(MADV_WONTNEED), but the information > > > required to make the reclaim decision is not known to the app. Instead, > > > it is known to the centralized userspace daemon(ActivityManagerService), > > > and that daemon must be able to initiate reclaim on its own without > > > any app involvement. > > > > > > To solve the issue, this patch introduces new syscall process_madvise(2). > > > It uses pidfd of an external processs to give the hint. > > > > > > int process_madvise(int pidfd, void *addr, size_t length, int advise, > > > unsigned long flag); > > > > > > Since it could affect other process's address range, only privileged > > > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID) > > > gives it the right to ptrace the process could use it successfully. > > > The flag argument is reserved for future use if we need to extend the > > > API. > > > > > > I think supporting all hints madvise has/will supported/support to > > > process_madvise is rather risky. Because we are not sure all hints make > > > sense from external process and implementation for the hint may rely on > > > the caller being in the current context so it could be error-prone. > > > Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch. > > > > > > If someone want to add other hints, we could hear hear the usecase and > > > review it for each hint. It's more safe for maintainace rather than > > > introducing a buggy syscall but hard to fix it later. > > > > I have brought this up when we discussed this in the past but there is > > no reflection on that here so let me bring that up again. > > > > I believe that the interface has an inherent problem that it is racy. > > The external entity needs to know the address space layout of the target > > process to do anyhing useful on it. The address space is however under > > the full control of the target process though and the external entity > > has no means to find out that the layout has changed. So > > time-to-check-time-to-act is an inherent problem. > > > > This is a serious design flaw and it should be explained why it doesn't > > matter or how to use the interface properly to prevent that problem. > > I agree, it looks flawed. > > Also I don't see what System Management Software can generically do on > sub-process level. I mean how can it decide which part of address space is > less important than other. > > I see how a manager can indicate that this process (or a group of > processes) is less important than other, but on per-addres-range basis? For example, memory ranges shared by several processes or critical for the latency, we could avoid those ranges to be cold/pageout to prevent unncecessary CPU burning/paging. I also think people don't want to give an KSM hint to non-mergeable area.
On Fri, Jan 17, 2020 at 09:32:39AM -0800, Minchan Kim wrote: > On Fri, Jan 17, 2020 at 06:58:37PM +0300, Kirill A. Shutemov wrote: > > On Fri, Jan 17, 2020 at 12:52:25PM +0100, Michal Hocko wrote: > > > On Thu 16-01-20 15:59:50, Minchan Kim wrote: > > > > There is usecase that System Management Software(SMS) want to give > > > > a memory hint like MADV_[COLD|PAGEEOUT] to other processes and > > > > in the case of Android, it is the ActivityManagerService. > > > > > > > > It's similar in spirit to madvise(MADV_WONTNEED), but the information > > > > required to make the reclaim decision is not known to the app. Instead, > > > > it is known to the centralized userspace daemon(ActivityManagerService), > > > > and that daemon must be able to initiate reclaim on its own without > > > > any app involvement. > > > > > > > > To solve the issue, this patch introduces new syscall process_madvise(2). > > > > It uses pidfd of an external processs to give the hint. > > > > > > > > int process_madvise(int pidfd, void *addr, size_t length, int advise, > > > > unsigned long flag); > > > > > > > > Since it could affect other process's address range, only privileged > > > > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID) > > > > gives it the right to ptrace the process could use it successfully. > > > > The flag argument is reserved for future use if we need to extend the > > > > API. > > > > > > > > I think supporting all hints madvise has/will supported/support to > > > > process_madvise is rather risky. Because we are not sure all hints make > > > > sense from external process and implementation for the hint may rely on > > > > the caller being in the current context so it could be error-prone. > > > > Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch. > > > > > > > > If someone want to add other hints, we could hear hear the usecase and > > > > review it for each hint. It's more safe for maintainace rather than > > > > introducing a buggy syscall but hard to fix it later. > > > > > > I have brought this up when we discussed this in the past but there is > > > no reflection on that here so let me bring that up again. > > > > > > I believe that the interface has an inherent problem that it is racy. > > > The external entity needs to know the address space layout of the target > > > process to do anyhing useful on it. The address space is however under > > > the full control of the target process though and the external entity > > > has no means to find out that the layout has changed. So > > > time-to-check-time-to-act is an inherent problem. > > > > > > This is a serious design flaw and it should be explained why it doesn't > > > matter or how to use the interface properly to prevent that problem. > > > > I agree, it looks flawed. > > > > Also I don't see what System Management Software can generically do on > > sub-process level. I mean how can it decide which part of address space is > > less important than other. > > > > I see how a manager can indicate that this process (or a group of > > processes) is less important than other, but on per-addres-range basis? > > For example, memory ranges shared by several processes or critical for the > latency, we could avoid those ranges to be cold/pageout to prevent > unncecessary CPU burning/paging. Hmm.. I still don't see why any external entity has a better (or any) knowledge about the matter. The process has to do this, no? > I also think people don't want to give an KSM hint to non-mergeable area. And how the manager knows which data is mergable? If you are intimate enough with the process' internal state feel free to inject syscall into the process with ptrace. Why bother with half-measures?
On Sat, Jan 18, 2020 at 12:26:53AM +0300, Kirill A. Shutemov wrote: > On Fri, Jan 17, 2020 at 09:32:39AM -0800, Minchan Kim wrote: > > On Fri, Jan 17, 2020 at 06:58:37PM +0300, Kirill A. Shutemov wrote: > > > On Fri, Jan 17, 2020 at 12:52:25PM +0100, Michal Hocko wrote: > > > > On Thu 16-01-20 15:59:50, Minchan Kim wrote: > > > > > There is usecase that System Management Software(SMS) want to give > > > > > a memory hint like MADV_[COLD|PAGEEOUT] to other processes and > > > > > in the case of Android, it is the ActivityManagerService. > > > > > > > > > > It's similar in spirit to madvise(MADV_WONTNEED), but the information > > > > > required to make the reclaim decision is not known to the app. Instead, > > > > > it is known to the centralized userspace daemon(ActivityManagerService), > > > > > and that daemon must be able to initiate reclaim on its own without > > > > > any app involvement. > > > > > > > > > > To solve the issue, this patch introduces new syscall process_madvise(2). > > > > > It uses pidfd of an external processs to give the hint. > > > > > > > > > > int process_madvise(int pidfd, void *addr, size_t length, int advise, > > > > > unsigned long flag); > > > > > > > > > > Since it could affect other process's address range, only privileged > > > > > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID) > > > > > gives it the right to ptrace the process could use it successfully. > > > > > The flag argument is reserved for future use if we need to extend the > > > > > API. > > > > > > > > > > I think supporting all hints madvise has/will supported/support to > > > > > process_madvise is rather risky. Because we are not sure all hints make > > > > > sense from external process and implementation for the hint may rely on > > > > > the caller being in the current context so it could be error-prone. > > > > > Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch. > > > > > > > > > > If someone want to add other hints, we could hear hear the usecase and > > > > > review it for each hint. It's more safe for maintainace rather than > > > > > introducing a buggy syscall but hard to fix it later. > > > > > > > > I have brought this up when we discussed this in the past but there is > > > > no reflection on that here so let me bring that up again. > > > > > > > > I believe that the interface has an inherent problem that it is racy. > > > > The external entity needs to know the address space layout of the target > > > > process to do anyhing useful on it. The address space is however under > > > > the full control of the target process though and the external entity > > > > has no means to find out that the layout has changed. So > > > > time-to-check-time-to-act is an inherent problem. > > > > > > > > This is a serious design flaw and it should be explained why it doesn't > > > > matter or how to use the interface properly to prevent that problem. > > > > > > I agree, it looks flawed. > > > > > > Also I don't see what System Management Software can generically do on > > > sub-process level. I mean how can it decide which part of address space is > > > less important than other. > > > > > > I see how a manager can indicate that this process (or a group of > > > processes) is less important than other, but on per-addres-range basis? > > > > For example, memory ranges shared by several processes or critical for the > > latency, we could avoid those ranges to be cold/pageout to prevent > > unncecessary CPU burning/paging. > > Hmm.. I still don't see why any external entity has a better (or any) > knowledge about the matter. The process has to do this, no? FWIW, I totally agree with the time-to-check-time-to-react problem. However, I'd like to clarify the ActivityManager/SystemServer case (I'll call it SystemServer from now on) For Android, every application (including the special SystemServer) are forked from Zygote. The reason ofcourse is to share as many libraries and classes between the two as possible to benefit from the preloading during boot. After applications start, (almost) all of the APIs end up calling into this SystemServer process over IPC (binder) and back to the application. In a fully running system, the SystemServer monitors every single process periodically to calculate their PSS / RSS and also decides which process is "important" to the user for interactivity. So, because of how these processes start _and_ the fact that the SystemServer is looping to monitor each process, it does tend to *know* which address range of the application is not used / useful. Besides, we can never rely on applications to clean things up themselves. We've had the "hey app1, the system is low on memory, please trim your memory usage down" notifications for a long time[1]. They rely on applications honoring the broadcasts and very few do. So, if we want to avoid the inevitable killing of the application and restarting it, some way to be able to tell the OS about unimportant memory in these applications will be useful. - ssp 1. https://developer.android.com/topic/performance/memory > > > I also think people don't want to give an KSM hint to non-mergeable area. > > And how the manager knows which data is mergable? > > If you are intimate enough with the process' internal state feel free to > inject syscall into the process with ptrace. Why bother with half-measures? > > -- > Kirill A. Shutemov
On Sun 19-01-20 08:14:31, sspatil@google.com wrote: > On Sat, Jan 18, 2020 at 12:26:53AM +0300, Kirill A. Shutemov wrote: > > On Fri, Jan 17, 2020 at 09:32:39AM -0800, Minchan Kim wrote: > > > On Fri, Jan 17, 2020 at 06:58:37PM +0300, Kirill A. Shutemov wrote: > > > > On Fri, Jan 17, 2020 at 12:52:25PM +0100, Michal Hocko wrote: > > > > > On Thu 16-01-20 15:59:50, Minchan Kim wrote: > > > > > > There is usecase that System Management Software(SMS) want to give > > > > > > a memory hint like MADV_[COLD|PAGEEOUT] to other processes and > > > > > > in the case of Android, it is the ActivityManagerService. > > > > > > > > > > > > It's similar in spirit to madvise(MADV_WONTNEED), but the information > > > > > > required to make the reclaim decision is not known to the app. Instead, > > > > > > it is known to the centralized userspace daemon(ActivityManagerService), > > > > > > and that daemon must be able to initiate reclaim on its own without > > > > > > any app involvement. > > > > > > > > > > > > To solve the issue, this patch introduces new syscall process_madvise(2). > > > > > > It uses pidfd of an external processs to give the hint. > > > > > > > > > > > > int process_madvise(int pidfd, void *addr, size_t length, int advise, > > > > > > unsigned long flag); > > > > > > > > > > > > Since it could affect other process's address range, only privileged > > > > > > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID) > > > > > > gives it the right to ptrace the process could use it successfully. > > > > > > The flag argument is reserved for future use if we need to extend the > > > > > > API. > > > > > > > > > > > > I think supporting all hints madvise has/will supported/support to > > > > > > process_madvise is rather risky. Because we are not sure all hints make > > > > > > sense from external process and implementation for the hint may rely on > > > > > > the caller being in the current context so it could be error-prone. > > > > > > Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch. > > > > > > > > > > > > If someone want to add other hints, we could hear hear the usecase and > > > > > > review it for each hint. It's more safe for maintainace rather than > > > > > > introducing a buggy syscall but hard to fix it later. > > > > > > > > > > I have brought this up when we discussed this in the past but there is > > > > > no reflection on that here so let me bring that up again. > > > > > > > > > > I believe that the interface has an inherent problem that it is racy. > > > > > The external entity needs to know the address space layout of the target > > > > > process to do anyhing useful on it. The address space is however under > > > > > the full control of the target process though and the external entity > > > > > has no means to find out that the layout has changed. So > > > > > time-to-check-time-to-act is an inherent problem. > > > > > > > > > > This is a serious design flaw and it should be explained why it doesn't > > > > > matter or how to use the interface properly to prevent that problem. > > > > > > > > I agree, it looks flawed. > > > > > > > > Also I don't see what System Management Software can generically do on > > > > sub-process level. I mean how can it decide which part of address space is > > > > less important than other. > > > > > > > > I see how a manager can indicate that this process (or a group of > > > > processes) is less important than other, but on per-addres-range basis? > > > > > > For example, memory ranges shared by several processes or critical for the > > > latency, we could avoid those ranges to be cold/pageout to prevent > > > unncecessary CPU burning/paging. > > > > Hmm.. I still don't see why any external entity has a better (or any) > > knowledge about the matter. The process has to do this, no? > > FWIW, I totally agree with the time-to-check-time-to-react problem. However, > I'd like to clarify the ActivityManager/SystemServer case (I'll call it > SystemServer from now on) > > For Android, every application (including the special SystemServer) are forked > from Zygote. The reason ofcourse is to share as many libraries and classes between > the two as possible to benefit from the preloading during boot. > > After applications start, (almost) all of the APIs end up calling into this > SystemServer process over IPC (binder) and back to the application. > > In a fully running system, the SystemServer monitors every single process > periodically to calculate their PSS / RSS and also decides which process is > "important" to the user for interactivity. > > So, because of how these processes start _and_ the fact that the SystemServer > is looping to monitor each process, it does tend to *know* which address > range of the application is not used / useful. > > Besides, we can never rely on applications to clean things up themselves. > We've had the "hey app1, the system is low on memory, please trim your > memory usage down" notifications for a long time[1]. They rely on > applications honoring the broadcasts and very few do. > > So, if we want to avoid the inevitable killing of the application and > restarting it, some way to be able to tell the OS about unimportant memory in > these applications will be useful. This is a useful information that should be a part of the changelog. I do see how the current form of the API might fit into Android model without many problems. But we are not designing an API for a single usecase, right? In a highly cooperative environments you can use ptrace code injection as mentioned by Kirill. Or is there any fundamental problem about that? The interface really has to be robust to future potential usecases.
On Fri 17-01-20 09:25:42, Minchan Kim wrote: > On Fri, Jan 17, 2020 at 12:52:25PM +0100, Michal Hocko wrote: > > On Thu 16-01-20 15:59:50, Minchan Kim wrote: > > > There is usecase that System Management Software(SMS) want to give > > > a memory hint like MADV_[COLD|PAGEEOUT] to other processes and > > > in the case of Android, it is the ActivityManagerService. > > > > > > It's similar in spirit to madvise(MADV_WONTNEED), but the information > > > required to make the reclaim decision is not known to the app. Instead, > > > it is known to the centralized userspace daemon(ActivityManagerService), > > > and that daemon must be able to initiate reclaim on its own without > > > any app involvement. > > > > > > To solve the issue, this patch introduces new syscall process_madvise(2). > > > It uses pidfd of an external processs to give the hint. > > > > > > int process_madvise(int pidfd, void *addr, size_t length, int advise, > > > unsigned long flag); > > > > > > Since it could affect other process's address range, only privileged > > > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID) > > > gives it the right to ptrace the process could use it successfully. > > > The flag argument is reserved for future use if we need to extend the > > > API. > > > > > > I think supporting all hints madvise has/will supported/support to > > > process_madvise is rather risky. Because we are not sure all hints make > > > sense from external process and implementation for the hint may rely on > > > the caller being in the current context so it could be error-prone. > > > Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch. > > > > > > If someone want to add other hints, we could hear hear the usecase and > > > review it for each hint. It's more safe for maintainace rather than > > > introducing a buggy syscall but hard to fix it later. > > > > I have brought this up when we discussed this in the past but there is > > no reflection on that here so let me bring that up again. > > > > I believe that the interface has an inherent problem that it is racy. > > The external entity needs to know the address space layout of the target > > process to do anyhing useful on it. The address space is however under > > the full control of the target process though and the external entity > > has no means to find out that the layout has changed. So > > time-to-check-time-to-act is an inherent problem. > > > > This is a serious design flaw and it should be explained why it doesn't > > matter or how to use the interface properly to prevent that problem. > > Sorry for the missing that part. > > It's not a particular problem of this API because other APIs already have > done with that(e.g., move_pages, process_vm_writev). I am sorry but this is not really an argument. > Point is userspace has several ways for the control of target process > like SIGSTOP, cgroup freezer or even no need to control since platform > is already aware of that the process will never run until he grant it > or it's resilient even though the race happens. If you have that level of control then you can simply inject the code via ptrace and you do not need a new syscall in the first place. > In future, if we want to support more fine-grained consistency model > like memory layout, we could provide some API to get cookie(e.g., > seq count which is updated whenever vma of the process changes). And then > we could feed the cookie to process_madvise's last argument so that > it can fail if founds it's not matched. > For that API, Daniel already posted RFC - process_getinfo[1]. > https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224 So why do not we start with a clean API since the beginning? I do agree that a remote madvise is an interesting feature and it opens gates to all sorts of userspace memory management which is not possible this days. But the syscall has to have a reasonable semantic to allow that. We cannot simply start with a half proken symantic first based on an Android usecase and then hit the wall as soon as others with a different user space model want to use it as well.
On 17.01.2020 14:52, Michal Hocko wrote: > On Thu 16-01-20 15:59:50, Minchan Kim wrote: >> There is usecase that System Management Software(SMS) want to give >> a memory hint like MADV_[COLD|PAGEEOUT] to other processes and >> in the case of Android, it is the ActivityManagerService. >> >> It's similar in spirit to madvise(MADV_WONTNEED), but the information >> required to make the reclaim decision is not known to the app. Instead, >> it is known to the centralized userspace daemon(ActivityManagerService), >> and that daemon must be able to initiate reclaim on its own without >> any app involvement. >> >> To solve the issue, this patch introduces new syscall process_madvise(2). >> It uses pidfd of an external processs to give the hint. >> >> int process_madvise(int pidfd, void *addr, size_t length, int advise, >> unsigned long flag); >> >> Since it could affect other process's address range, only privileged >> process(CAP_SYS_PTRACE) or something else(e.g., being the same UID) >> gives it the right to ptrace the process could use it successfully. >> The flag argument is reserved for future use if we need to extend the >> API. >> >> I think supporting all hints madvise has/will supported/support to >> process_madvise is rather risky. Because we are not sure all hints make >> sense from external process and implementation for the hint may rely on >> the caller being in the current context so it could be error-prone. >> Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch. >> >> If someone want to add other hints, we could hear hear the usecase and >> review it for each hint. It's more safe for maintainace rather than >> introducing a buggy syscall but hard to fix it later. > > I have brought this up when we discussed this in the past but there is > no reflection on that here so let me bring that up again. > > I believe that the interface has an inherent problem that it is racy. > The external entity needs to know the address space layout of the target > process to do anyhing useful on it. The address space is however under > the full control of the target process though and the external entity > has no means to find out that the layout has changed. So > time-to-check-time-to-act is an inherent problem. > > This is a serious design flaw and it should be explained why it doesn't > matter or how to use the interface properly to prevent that problem. Really, any address space manipulation, where more than one process is involved, is racy. Even two threads on common memory need a synchronization to manage mappings in a sane way. Managing memory from two processes is the same in principle, and the only difference is that another level of synchronization is required. I'm not fan and user for this feature (at least for now, nobody knows what will be in the future), but design flaw argument does not look fully valid. Regards, Kirill
On 20.01.2020 10:58, Michal Hocko wrote: > On Sun 19-01-20 08:14:31, sspatil@google.com wrote: >> On Sat, Jan 18, 2020 at 12:26:53AM +0300, Kirill A. Shutemov wrote: >>> On Fri, Jan 17, 2020 at 09:32:39AM -0800, Minchan Kim wrote: >>>> On Fri, Jan 17, 2020 at 06:58:37PM +0300, Kirill A. Shutemov wrote: >>>>> On Fri, Jan 17, 2020 at 12:52:25PM +0100, Michal Hocko wrote: >>>>>> On Thu 16-01-20 15:59:50, Minchan Kim wrote: >>>>>>> There is usecase that System Management Software(SMS) want to give >>>>>>> a memory hint like MADV_[COLD|PAGEEOUT] to other processes and >>>>>>> in the case of Android, it is the ActivityManagerService. >>>>>>> >>>>>>> It's similar in spirit to madvise(MADV_WONTNEED), but the information >>>>>>> required to make the reclaim decision is not known to the app. Instead, >>>>>>> it is known to the centralized userspace daemon(ActivityManagerService), >>>>>>> and that daemon must be able to initiate reclaim on its own without >>>>>>> any app involvement. >>>>>>> >>>>>>> To solve the issue, this patch introduces new syscall process_madvise(2). >>>>>>> It uses pidfd of an external processs to give the hint. >>>>>>> >>>>>>> int process_madvise(int pidfd, void *addr, size_t length, int advise, >>>>>>> unsigned long flag); >>>>>>> >>>>>>> Since it could affect other process's address range, only privileged >>>>>>> process(CAP_SYS_PTRACE) or something else(e.g., being the same UID) >>>>>>> gives it the right to ptrace the process could use it successfully. >>>>>>> The flag argument is reserved for future use if we need to extend the >>>>>>> API. >>>>>>> >>>>>>> I think supporting all hints madvise has/will supported/support to >>>>>>> process_madvise is rather risky. Because we are not sure all hints make >>>>>>> sense from external process and implementation for the hint may rely on >>>>>>> the caller being in the current context so it could be error-prone. >>>>>>> Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch. >>>>>>> >>>>>>> If someone want to add other hints, we could hear hear the usecase and >>>>>>> review it for each hint. It's more safe for maintainace rather than >>>>>>> introducing a buggy syscall but hard to fix it later. >>>>>> >>>>>> I have brought this up when we discussed this in the past but there is >>>>>> no reflection on that here so let me bring that up again. >>>>>> >>>>>> I believe that the interface has an inherent problem that it is racy. >>>>>> The external entity needs to know the address space layout of the target >>>>>> process to do anyhing useful on it. The address space is however under >>>>>> the full control of the target process though and the external entity >>>>>> has no means to find out that the layout has changed. So >>>>>> time-to-check-time-to-act is an inherent problem. >>>>>> >>>>>> This is a serious design flaw and it should be explained why it doesn't >>>>>> matter or how to use the interface properly to prevent that problem. >>>>> >>>>> I agree, it looks flawed. >>>>> >>>>> Also I don't see what System Management Software can generically do on >>>>> sub-process level. I mean how can it decide which part of address space is >>>>> less important than other. >>>>> >>>>> I see how a manager can indicate that this process (or a group of >>>>> processes) is less important than other, but on per-addres-range basis? >>>> >>>> For example, memory ranges shared by several processes or critical for the >>>> latency, we could avoid those ranges to be cold/pageout to prevent >>>> unncecessary CPU burning/paging. >>> >>> Hmm.. I still don't see why any external entity has a better (or any) >>> knowledge about the matter. The process has to do this, no? >> >> FWIW, I totally agree with the time-to-check-time-to-react problem. However, >> I'd like to clarify the ActivityManager/SystemServer case (I'll call it >> SystemServer from now on) >> >> For Android, every application (including the special SystemServer) are forked >> from Zygote. The reason ofcourse is to share as many libraries and classes between >> the two as possible to benefit from the preloading during boot. >> >> After applications start, (almost) all of the APIs end up calling into this >> SystemServer process over IPC (binder) and back to the application. >> >> In a fully running system, the SystemServer monitors every single process >> periodically to calculate their PSS / RSS and also decides which process is >> "important" to the user for interactivity. >> >> So, because of how these processes start _and_ the fact that the SystemServer >> is looping to monitor each process, it does tend to *know* which address >> range of the application is not used / useful. >> >> Besides, we can never rely on applications to clean things up themselves. >> We've had the "hey app1, the system is low on memory, please trim your >> memory usage down" notifications for a long time[1]. They rely on >> applications honoring the broadcasts and very few do. >> >> So, if we want to avoid the inevitable killing of the application and >> restarting it, some way to be able to tell the OS about unimportant memory in >> these applications will be useful. > > This is a useful information that should be a part of the changelog. I > do see how the current form of the API might fit into Android model > without many problems. But we are not designing an API for a single > usecase, right? In a highly cooperative environments you can use ptrace > code injection as mentioned by Kirill. Or is there any fundamental > problem about that? There could be only problems with multi-threads applications, which poll the state of another threads (and exit with error, when they found that someone is traced). But this may be workarounded by freezer cgroup making all the thread group is frozen. It should guarantee consistent state of the whole process after attaching. Kirill
On Mon 20-01-20 13:24:35, Kirill Tkhai wrote: > On 17.01.2020 14:52, Michal Hocko wrote: > > On Thu 16-01-20 15:59:50, Minchan Kim wrote: > >> There is usecase that System Management Software(SMS) want to give > >> a memory hint like MADV_[COLD|PAGEEOUT] to other processes and > >> in the case of Android, it is the ActivityManagerService. > >> > >> It's similar in spirit to madvise(MADV_WONTNEED), but the information > >> required to make the reclaim decision is not known to the app. Instead, > >> it is known to the centralized userspace daemon(ActivityManagerService), > >> and that daemon must be able to initiate reclaim on its own without > >> any app involvement. > >> > >> To solve the issue, this patch introduces new syscall process_madvise(2). > >> It uses pidfd of an external processs to give the hint. > >> > >> int process_madvise(int pidfd, void *addr, size_t length, int advise, > >> unsigned long flag); > >> > >> Since it could affect other process's address range, only privileged > >> process(CAP_SYS_PTRACE) or something else(e.g., being the same UID) > >> gives it the right to ptrace the process could use it successfully. > >> The flag argument is reserved for future use if we need to extend the > >> API. > >> > >> I think supporting all hints madvise has/will supported/support to > >> process_madvise is rather risky. Because we are not sure all hints make > >> sense from external process and implementation for the hint may rely on > >> the caller being in the current context so it could be error-prone. > >> Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch. > >> > >> If someone want to add other hints, we could hear hear the usecase and > >> review it for each hint. It's more safe for maintainace rather than > >> introducing a buggy syscall but hard to fix it later. > > > > I have brought this up when we discussed this in the past but there is > > no reflection on that here so let me bring that up again. > > > > I believe that the interface has an inherent problem that it is racy. > > The external entity needs to know the address space layout of the target > > process to do anyhing useful on it. The address space is however under > > the full control of the target process though and the external entity > > has no means to find out that the layout has changed. So > > time-to-check-time-to-act is an inherent problem. > > > > This is a serious design flaw and it should be explained why it doesn't > > matter or how to use the interface properly to prevent that problem. > > Really, any address space manipulation, where more than one process is > involved, is racy. They are, indeed. But that is not the point I wanted to make. > Even two threads on common memory need a synchronization > to manage mappings in a sane way. Managing memory from two processes > is the same in principle, and the only difference is that another level > of synchronization is required. Well, not really. The operation might simply attempt to perform an operation on a specific memory area and get a failure if it doesn't reference the same object anymore. What I think we need is some form of a handle to operate on. In the past we have discussed several directions. I was proposing /proc/self/map_anon/ (analogous to map_files) where you could inspect anonymous memory and get a file handle for it. madvise would then operate on the fd and then there shouldn't be a real problem to revalidate that the object is still valid. But there was no general enthusiasm about that approach. There are likely some land mines on the way. There was another approach mentioned by Minchan in other email in this thread. What I want to say is that I believe a remove madvise can have a sensible semantic even without a strong synchronization between the monitor and the target task. We just have to make sure that the monitor never operates on a different object then it believes it acts on. On the other hand, there will never be any way to make this interface reasonable for destructive operations though because the content of the memory needs a strong synchronization IMHO.
On Mon, Jan 20, 2020 at 12:27:22PM +0100, Michal Hocko wrote: > On Mon 20-01-20 13:24:35, Kirill Tkhai wrote: > > On 17.01.2020 14:52, Michal Hocko wrote: > > > On Thu 16-01-20 15:59:50, Minchan Kim wrote: > > >> There is usecase that System Management Software(SMS) want to give > > >> a memory hint like MADV_[COLD|PAGEEOUT] to other processes and > > >> in the case of Android, it is the ActivityManagerService. > > >> > > >> It's similar in spirit to madvise(MADV_WONTNEED), but the information > > >> required to make the reclaim decision is not known to the app. Instead, > > >> it is known to the centralized userspace daemon(ActivityManagerService), > > >> and that daemon must be able to initiate reclaim on its own without > > >> any app involvement. > > >> > > >> To solve the issue, this patch introduces new syscall process_madvise(2). > > >> It uses pidfd of an external processs to give the hint. > > >> > > >> int process_madvise(int pidfd, void *addr, size_t length, int advise, > > >> unsigned long flag); > > >> > > >> Since it could affect other process's address range, only privileged > > >> process(CAP_SYS_PTRACE) or something else(e.g., being the same UID) > > >> gives it the right to ptrace the process could use it successfully. > > >> The flag argument is reserved for future use if we need to extend the > > >> API. > > >> > > >> I think supporting all hints madvise has/will supported/support to > > >> process_madvise is rather risky. Because we are not sure all hints make > > >> sense from external process and implementation for the hint may rely on > > >> the caller being in the current context so it could be error-prone. > > >> Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch. > > >> > > >> If someone want to add other hints, we could hear hear the usecase and > > >> review it for each hint. It's more safe for maintainace rather than > > >> introducing a buggy syscall but hard to fix it later. > > > > > > I have brought this up when we discussed this in the past but there is > > > no reflection on that here so let me bring that up again. > > > > > > I believe that the interface has an inherent problem that it is racy. > > > The external entity needs to know the address space layout of the target > > > process to do anyhing useful on it. The address space is however under > > > the full control of the target process though and the external entity > > > has no means to find out that the layout has changed. So > > > time-to-check-time-to-act is an inherent problem. > > > > > > This is a serious design flaw and it should be explained why it doesn't > > > matter or how to use the interface properly to prevent that problem. > > > > Really, any address space manipulation, where more than one process is > > involved, is racy. > > They are, indeed. But that is not the point I wanted to make. > > > Even two threads on common memory need a synchronization > > to manage mappings in a sane way. Managing memory from two processes > > is the same in principle, and the only difference is that another level > > of synchronization is required. > > Well, not really. The operation might simply attempt to perform an > operation on a specific memory area and get a failure if it doesn't > reference the same object anymore. What I think we need is some form of > a handle to operate on. In the past we have discussed several > directions. I was proposing /proc/self/map_anon/ (analogous to > map_files) where you could inspect anonymous memory and get a file > handle for it. madvise would then operate on the fd and then there > shouldn't be a real problem to revalidate that the object is still > valid. But there was no general enthusiasm about that approach. There > are likely some land mines on the way. Converting anon memory to file-backed is bad idea and going to backfire. It will break many things around rmap (for THP in particular). We have assumption that an anon page cannot be mapped twice in the same process. Breaking rules around memory inheritance is not helpful too. It is a major re-design of the subsystem and I don't see any real need for this.
On Mon 20-01-20 15:39:35, Kirill A. Shutemov wrote: > On Mon, Jan 20, 2020 at 12:27:22PM +0100, Michal Hocko wrote: > > On Mon 20-01-20 13:24:35, Kirill Tkhai wrote: [...] > > > Even two threads on common memory need a synchronization > > > to manage mappings in a sane way. Managing memory from two processes > > > is the same in principle, and the only difference is that another level > > > of synchronization is required. > > > > Well, not really. The operation might simply attempt to perform an > > operation on a specific memory area and get a failure if it doesn't > > reference the same object anymore. What I think we need is some form of > > a handle to operate on. In the past we have discussed several > > directions. I was proposing /proc/self/map_anon/ (analogous to > > map_files) where you could inspect anonymous memory and get a file > > handle for it. madvise would then operate on the fd and then there > > shouldn't be a real problem to revalidate that the object is still > > valid. But there was no general enthusiasm about that approach. There > > are likely some land mines on the way. > > Converting anon memory to file-backed is bad idea and going to backfire. I didn't mean to convert. I meant to expose that information via proc the same way we do for file backed mappings. That shouldn't really require to re-design the way how anonymous vma work IMO. But I haven't tried that so there might be many gotchas there. There are obvious things to think about though. Such fd cannot be sent to other processes (SCM stuff), mmap of the file would have to be disallowed and many others I am not aware of. I am not even pushing this direction because I am not convinced about how viable it is myself. But it would sound like a nice extension of the existing mechanism we have and a file based madvise sounds attractive to me as well because we already have that.
On Mon, Jan 20, 2020 at 02:24:05PM +0100, Michal Hocko wrote: > On Mon 20-01-20 15:39:35, Kirill A. Shutemov wrote: > > On Mon, Jan 20, 2020 at 12:27:22PM +0100, Michal Hocko wrote: > > > On Mon 20-01-20 13:24:35, Kirill Tkhai wrote: > [...] > > > > Even two threads on common memory need a synchronization > > > > to manage mappings in a sane way. Managing memory from two processes > > > > is the same in principle, and the only difference is that another level > > > > of synchronization is required. > > > > > > Well, not really. The operation might simply attempt to perform an > > > operation on a specific memory area and get a failure if it doesn't > > > reference the same object anymore. What I think we need is some form of > > > a handle to operate on. In the past we have discussed several > > > directions. I was proposing /proc/self/map_anon/ (analogous to > > > map_files) where you could inspect anonymous memory and get a file > > > handle for it. madvise would then operate on the fd and then there > > > shouldn't be a real problem to revalidate that the object is still > > > valid. But there was no general enthusiasm about that approach. There > > > are likely some land mines on the way. > > > > Converting anon memory to file-backed is bad idea and going to backfire. > > I didn't mean to convert. I meant to expose that information via proc > the same way we do for file backed mappings. That shouldn't really > require to re-design the way how anonymous vma work IMO. But I haven't > tried that so there might be many gotchas there. > > There are obvious things to think about though. Such fd cannot be sent > to other processes (SCM stuff), mmap of the file would have to be > disallowed and many others I am not aware of. I am not even pushing this > direction because I am not convinced about how viable it is myself. But > it would sound like a nice extension of the existing mechanism we have > and a file based madvise sounds attractive to me as well because we > already have that. If the fd cannot be passed around or mmaped what does it represent? And how is it different from plain address?
On Mon 20-01-20 17:21:32, Kirill A. Shutemov wrote: > On Mon, Jan 20, 2020 at 02:24:05PM +0100, Michal Hocko wrote: > > On Mon 20-01-20 15:39:35, Kirill A. Shutemov wrote: > > > On Mon, Jan 20, 2020 at 12:27:22PM +0100, Michal Hocko wrote: > > > > On Mon 20-01-20 13:24:35, Kirill Tkhai wrote: > > [...] > > > > > Even two threads on common memory need a synchronization > > > > > to manage mappings in a sane way. Managing memory from two processes > > > > > is the same in principle, and the only difference is that another level > > > > > of synchronization is required. > > > > > > > > Well, not really. The operation might simply attempt to perform an > > > > operation on a specific memory area and get a failure if it doesn't > > > > reference the same object anymore. What I think we need is some form of > > > > a handle to operate on. In the past we have discussed several > > > > directions. I was proposing /proc/self/map_anon/ (analogous to > > > > map_files) where you could inspect anonymous memory and get a file > > > > handle for it. madvise would then operate on the fd and then there > > > > shouldn't be a real problem to revalidate that the object is still > > > > valid. But there was no general enthusiasm about that approach. There > > > > are likely some land mines on the way. > > > > > > Converting anon memory to file-backed is bad idea and going to backfire. > > > > I didn't mean to convert. I meant to expose that information via proc > > the same way we do for file backed mappings. That shouldn't really > > require to re-design the way how anonymous vma work IMO. But I haven't > > tried that so there might be many gotchas there. > > > > There are obvious things to think about though. Such fd cannot be sent > > to other processes (SCM stuff), mmap of the file would have to be > > disallowed and many others I am not aware of. I am not even pushing this > > direction because I am not convinced about how viable it is myself. But > > it would sound like a nice extension of the existing mechanism we have > > and a file based madvise sounds attractive to me as well because we > > already have that. > > If the fd cannot be passed around or mmaped what does it represent? Because it is a cookie maintained by the kernel. > And how is it different from plain address? Kernel can revalidate that the given fd is denoting the vma it was created for and simply fail with ENOENT or whatever suits if somebody did unmap and mmap to the same address.
On Sat, Jan 18, 2020 at 12:26:53AM +0300, Kirill A. Shutemov wrote: > On Fri, Jan 17, 2020 at 09:32:39AM -0800, Minchan Kim wrote: > > On Fri, Jan 17, 2020 at 06:58:37PM +0300, Kirill A. Shutemov wrote: > > > On Fri, Jan 17, 2020 at 12:52:25PM +0100, Michal Hocko wrote: > > > > On Thu 16-01-20 15:59:50, Minchan Kim wrote: > > > > > There is usecase that System Management Software(SMS) want to give > > > > > a memory hint like MADV_[COLD|PAGEEOUT] to other processes and > > > > > in the case of Android, it is the ActivityManagerService. > > > > > > > > > > It's similar in spirit to madvise(MADV_WONTNEED), but the information > > > > > required to make the reclaim decision is not known to the app. Instead, > > > > > it is known to the centralized userspace daemon(ActivityManagerService), > > > > > and that daemon must be able to initiate reclaim on its own without > > > > > any app involvement. > > > > > > > > > > To solve the issue, this patch introduces new syscall process_madvise(2). > > > > > It uses pidfd of an external processs to give the hint. > > > > > > > > > > int process_madvise(int pidfd, void *addr, size_t length, int advise, > > > > > unsigned long flag); > > > > > > > > > > Since it could affect other process's address range, only privileged > > > > > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID) > > > > > gives it the right to ptrace the process could use it successfully. > > > > > The flag argument is reserved for future use if we need to extend the > > > > > API. > > > > > > > > > > I think supporting all hints madvise has/will supported/support to > > > > > process_madvise is rather risky. Because we are not sure all hints make > > > > > sense from external process and implementation for the hint may rely on > > > > > the caller being in the current context so it could be error-prone. > > > > > Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch. > > > > > > > > > > If someone want to add other hints, we could hear hear the usecase and > > > > > review it for each hint. It's more safe for maintainace rather than > > > > > introducing a buggy syscall but hard to fix it later. > > > > > > > > I have brought this up when we discussed this in the past but there is > > > > no reflection on that here so let me bring that up again. > > > > > > > > I believe that the interface has an inherent problem that it is racy. > > > > The external entity needs to know the address space layout of the target > > > > process to do anyhing useful on it. The address space is however under > > > > the full control of the target process though and the external entity > > > > has no means to find out that the layout has changed. So > > > > time-to-check-time-to-act is an inherent problem. > > > > > > > > This is a serious design flaw and it should be explained why it doesn't > > > > matter or how to use the interface properly to prevent that problem. > > > > > > I agree, it looks flawed. > > > > > > Also I don't see what System Management Software can generically do on > > > sub-process level. I mean how can it decide which part of address space is > > > less important than other. > > > > > > I see how a manager can indicate that this process (or a group of > > > processes) is less important than other, but on per-addres-range basis? > > > > For example, memory ranges shared by several processes or critical for the > > latency, we could avoid those ranges to be cold/pageout to prevent > > unncecessary CPU burning/paging. > > Hmm.. I still don't see why any external entity has a better (or any) > knowledge about the matter. The process has to do this, no? I think Sandeep already gave enough information in other thread. > > > I also think people don't want to give an KSM hint to non-mergeable area. > > And how the manager knows which data is mergable? Oleksandr, could you say your thought why you need address range based API? > > If you are intimate enough with the process' internal state feel free to > inject syscall into the process with ptrace. Why bother with half-measures? Concern is we want to act the hint in caller's context, not calle because calle is usually very limited in cpuset/cgroups or even freezed state so they couldn't act by themselves quick enough, which makes many problems. One of efforts to solve the issue was "Expedited memory reclaim" https://lwn.net/Articles/785709/ That could be also a good candidate for process_madvise API.
On Mon, Jan 20, 2020 at 08:58:25AM +0100, Michal Hocko wrote: > On Sun 19-01-20 08:14:31, sspatil@google.com wrote: > > On Sat, Jan 18, 2020 at 12:26:53AM +0300, Kirill A. Shutemov wrote: > > > On Fri, Jan 17, 2020 at 09:32:39AM -0800, Minchan Kim wrote: > > > > On Fri, Jan 17, 2020 at 06:58:37PM +0300, Kirill A. Shutemov wrote: > > > > > On Fri, Jan 17, 2020 at 12:52:25PM +0100, Michal Hocko wrote: > > > > > > On Thu 16-01-20 15:59:50, Minchan Kim wrote: > > > > > > > There is usecase that System Management Software(SMS) want to give > > > > > > > a memory hint like MADV_[COLD|PAGEEOUT] to other processes and > > > > > > > in the case of Android, it is the ActivityManagerService. > > > > > > > > > > > > > > It's similar in spirit to madvise(MADV_WONTNEED), but the information > > > > > > > required to make the reclaim decision is not known to the app. Instead, > > > > > > > it is known to the centralized userspace daemon(ActivityManagerService), > > > > > > > and that daemon must be able to initiate reclaim on its own without > > > > > > > any app involvement. > > > > > > > > > > > > > > To solve the issue, this patch introduces new syscall process_madvise(2). > > > > > > > It uses pidfd of an external processs to give the hint. > > > > > > > > > > > > > > int process_madvise(int pidfd, void *addr, size_t length, int advise, > > > > > > > unsigned long flag); > > > > > > > > > > > > > > Since it could affect other process's address range, only privileged > > > > > > > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID) > > > > > > > gives it the right to ptrace the process could use it successfully. > > > > > > > The flag argument is reserved for future use if we need to extend the > > > > > > > API. > > > > > > > > > > > > > > I think supporting all hints madvise has/will supported/support to > > > > > > > process_madvise is rather risky. Because we are not sure all hints make > > > > > > > sense from external process and implementation for the hint may rely on > > > > > > > the caller being in the current context so it could be error-prone. > > > > > > > Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch. > > > > > > > > > > > > > > If someone want to add other hints, we could hear hear the usecase and > > > > > > > review it for each hint. It's more safe for maintainace rather than > > > > > > > introducing a buggy syscall but hard to fix it later. > > > > > > > > > > > > I have brought this up when we discussed this in the past but there is > > > > > > no reflection on that here so let me bring that up again. > > > > > > > > > > > > I believe that the interface has an inherent problem that it is racy. > > > > > > The external entity needs to know the address space layout of the target > > > > > > process to do anyhing useful on it. The address space is however under > > > > > > the full control of the target process though and the external entity > > > > > > has no means to find out that the layout has changed. So > > > > > > time-to-check-time-to-act is an inherent problem. > > > > > > > > > > > > This is a serious design flaw and it should be explained why it doesn't > > > > > > matter or how to use the interface properly to prevent that problem. > > > > > > > > > > I agree, it looks flawed. > > > > > > > > > > Also I don't see what System Management Software can generically do on > > > > > sub-process level. I mean how can it decide which part of address space is > > > > > less important than other. > > > > > > > > > > I see how a manager can indicate that this process (or a group of > > > > > processes) is less important than other, but on per-addres-range basis? > > > > > > > > For example, memory ranges shared by several processes or critical for the > > > > latency, we could avoid those ranges to be cold/pageout to prevent > > > > unncecessary CPU burning/paging. > > > > > > Hmm.. I still don't see why any external entity has a better (or any) > > > knowledge about the matter. The process has to do this, no? > > > > FWIW, I totally agree with the time-to-check-time-to-react problem. However, > > I'd like to clarify the ActivityManager/SystemServer case (I'll call it > > SystemServer from now on) > > > > For Android, every application (including the special SystemServer) are forked > > from Zygote. The reason ofcourse is to share as many libraries and classes between > > the two as possible to benefit from the preloading during boot. > > > > After applications start, (almost) all of the APIs end up calling into this > > SystemServer process over IPC (binder) and back to the application. > > > > In a fully running system, the SystemServer monitors every single process > > periodically to calculate their PSS / RSS and also decides which process is > > "important" to the user for interactivity. > > > > So, because of how these processes start _and_ the fact that the SystemServer > > is looping to monitor each process, it does tend to *know* which address > > range of the application is not used / useful. > > > > Besides, we can never rely on applications to clean things up themselves. > > We've had the "hey app1, the system is low on memory, please trim your > > memory usage down" notifications for a long time[1]. They rely on > > applications honoring the broadcasts and very few do. > > > > So, if we want to avoid the inevitable killing of the application and > > restarting it, some way to be able to tell the OS about unimportant memory in > > these applications will be useful. Thanks for adding more useful description, Sandeep. > > This is a useful information that should be a part of the changelog. I I will incldue it in next respin. > do see how the current form of the API might fit into Android model > without many problems. But we are not designing an API for a single > usecase, right? In a highly cooperative environments you can use ptrace > code injection as mentioned by Kirill. Or is there any fundamental > problem about that? I replied it at Kirill's thread so let's discuss there. > > The interface really has to be robust to future potential usecases. I do understand your concern but for me, it's chicken and egg problem. We usually do best effort to make something perfect as far as possible but we also don't do over-engineering without real usecase from the beginning. I already told you how we could synchronize among processes and potential way to be extended Daniel suggested(That's why current API has extra field for the cookie) even though we don't need it right now. If you want to suggest the other way, please explain why your idea is better and why we need it at this moment. I don't think that is a blocker for the progress of this API since we already have several ways to synchronize processes.
On Mon, Jan 20, 2020 at 02:24:05PM +0100, Michal Hocko wrote: > On Mon 20-01-20 15:39:35, Kirill A. Shutemov wrote: > > On Mon, Jan 20, 2020 at 12:27:22PM +0100, Michal Hocko wrote: > > > On Mon 20-01-20 13:24:35, Kirill Tkhai wrote: > [...] > > > > Even two threads on common memory need a synchronization > > > > to manage mappings in a sane way. Managing memory from two processes > > > > is the same in principle, and the only difference is that another level > > > > of synchronization is required. > > > > > > Well, not really. The operation might simply attempt to perform an > > > operation on a specific memory area and get a failure if it doesn't > > > reference the same object anymore. What I think we need is some form of > > > a handle to operate on. In the past we have discussed several > > > directions. I was proposing /proc/self/map_anon/ (analogous to > > > map_files) where you could inspect anonymous memory and get a file > > > handle for it. madvise would then operate on the fd and then there > > > shouldn't be a real problem to revalidate that the object is still > > > valid. But there was no general enthusiasm about that approach. There > > > are likely some land mines on the way. > > > > Converting anon memory to file-backed is bad idea and going to backfire. > > I didn't mean to convert. I meant to expose that information via proc > the same way we do for file backed mappings. That shouldn't really > require to re-design the way how anonymous vma work IMO. But I haven't > tried that so there might be many gotchas there. > > There are obvious things to think about though. Such fd cannot be sent > to other processes (SCM stuff), mmap of the file would have to be > disallowed and many others I am not aware of. I am not even pushing this > direction because I am not convinced about how viable it is myself. But > it would sound like a nice extension of the existing mechanism we have > and a file based madvise sounds attractive to me as well because we > already have that. I am not a fan of fd based approach but I already reserved last argument of the API as extendable field so we could use the field as "fd" when we really need that kinds of fine-grained synchronization model if it's not enough with SGISTOP, freezer and so.
On Tue 21-01-20 10:32:12, Minchan Kim wrote: > On Mon, Jan 20, 2020 at 08:58:25AM +0100, Michal Hocko wrote: [...] > > The interface really has to be robust to future potential usecases. > > I do understand your concern but for me, it's chicken and egg problem. > We usually do best effort to make something perfect as far as possible > but we also don't do over-engineering without real usecase from the > beginning. > > I already told you how we could synchronize among processes and potential > way to be extended Daniel suggested(That's why current API has extra field > for the cookie) even though we don't need it right now. If you can synchronize with the target task then you do not need a remote interface. Just use ptrace and you are done with it. > If you want to suggest the other way, please explain why your idea is > better and why we need it at this moment. I believe I have explained my concerns and why they matter. All you are saying is that you do not care because your particular usecase doesn't care. And that is a first signal of a future disaster when we end up with a broken and unfixable interface we have to maintain for ever. I will not go as far as to nack this but you should seriously think about other potential usecases and how they would work and what we are going to do when a first non-cooperative userspace memory management usecase materializes.
On Wed 22-01-20 10:36:24, SeongJae Park wrote: > On Wed, 22 Jan 2020 09:28:53 +0100 Michal Hocko <mhocko@kernel.org> wrote: > > > On Tue 21-01-20 10:32:12, Minchan Kim wrote: > > > On Mon, Jan 20, 2020 at 08:58:25AM +0100, Michal Hocko wrote: > > [...] > > > > The interface really has to be robust to future potential usecases. > > > > > > I do understand your concern but for me, it's chicken and egg problem. > > > We usually do best effort to make something perfect as far as possible > > > but we also don't do over-engineering without real usecase from the > > > beginning. > > > > > > I already told you how we could synchronize among processes and potential > > > way to be extended Daniel suggested(That's why current API has extra field > > > for the cookie) even though we don't need it right now. > > > > If you can synchronize with the target task then you do not need a > > remote interface. Just use ptrace and you are done with it. > > > > > If you want to suggest the other way, please explain why your idea is > > > better and why we need it at this moment. > > > > I believe I have explained my concerns and why they matter. All you are > > saying is that you do not care because your particular usecase doesn't > > care. And that is a first signal of a future disaster when we end up > > with a broken and unfixable interface we have to maintain for ever. > > > > I will not go as far as to nack this but you should seriously think > > about other potential usecases and how they would work and what we are > > going to do when a first non-cooperative userspace memory management > > usecase materializes. > > Beside of the specific environment of Android, I think there are many ways to > know the address space layout and access patterns of other processes. The > idle_page_tracking might be an example that widelay available. > > Of course, the information might not strictly correct due to the timing issue, > but could be still worth to be used under some extreme situations, such as > memory pressure or fragmentation. For the same reason, ptrace() would not be > sufficient, as we have no perfect control, but only some level of control that > would be useful under specific situations. I am not sure I see your point. I am talking about races where a remote task is operating on a completely different object because the one it checked for has been unmapped and new one mapped over it. Memory pressure or a fragmentation will not change the object itself. Sure the memory might be reclaimed but that should be completely OK unless I am missing something. > I assume the users of this systemcall would understand the tradeoff and make > decisions. I disagree. My experience tells me that users tend to squeeze the maximum and beyond and hope they get what they want. > Also, as the users already have the right to do the tradeoff, I > think it's fair. In other words, I think the caller has both the power and the > responsibility to deal with the time-to-check-time-to-react problem. > > Nonetheless, I also agree this is important concern and the patch would be > better if it adds more detailed documentation regarding this issue. If there is _really_ a strong consensus that the racy interface is reasonable then it absolutely has to be described with a clearly state that those races might result in hard to predict behavior unless all tasks sharing the address space are blocked between the check and the madvise call.
Hello. On Tue, Jan 21, 2020 at 10:11:13AM -0800, Minchan Kim wrote: > > > I also think people don't want to give an KSM hint to non-mergeable area. > > > > And how the manager knows which data is mergable? > > Oleksandr, could you say your thought why you need address range based > API? It seems I've overlooked an important piece of this submission: one cannot apply the hint to all the anonymous mapping regardless of address range. For KSM I'd rather either have a possibility to hint all the anonymous mappings, or, as it was suggested previously, be able to iterate over existing mappings using some (fd-based?) API.
On Wed, Jan 22, 2020 at 09:28:53AM +0100, Michal Hocko wrote: > On Tue 21-01-20 10:32:12, Minchan Kim wrote: > > On Mon, Jan 20, 2020 at 08:58:25AM +0100, Michal Hocko wrote: > [...] > > > The interface really has to be robust to future potential usecases. > > > > I do understand your concern but for me, it's chicken and egg problem. > > We usually do best effort to make something perfect as far as possible > > but we also don't do over-engineering without real usecase from the > > beginning. > > > > I already told you how we could synchronize among processes and potential > > way to be extended Daniel suggested(That's why current API has extra field > > for the cookie) even though we don't need it right now. > > If you can synchronize with the target task then you do not need a > remote interface. Just use ptrace and you are done with it. As I mentioned in other reply, we want to do in caller's context, not callee's one because target processes stay in little cores, which are much slower than the core the manager lives in. The other reason is the apps are already freezed so they couldn't response by ptrace. > > > If you want to suggest the other way, please explain why your idea is > > better and why we need it at this moment. > > I believe I have explained my concerns and why they matter. All you are > saying is that you do not care because your particular usecase doesn't > care. And that is a first signal of a future disaster when we end up > with a broken and unfixable interface we have to maintain for ever. We already had suggested cookie and fd based approaches so I reserved a argument for that to make the API extendable. Thing is currently it's a just optimization idea since we have several ways to sychronize processes(e.g., signal, cgroup freezer, userfaultfd and so). It's a just matter of granularity, not necessary one we should introduce it from the beginnig. If someone needs that kinds of fine-grained consistency, we could extend it then. And that's the usual way we make progress when we couldn't know the future. What do you want to see further?
On Wed, Jan 22, 2020 at 11:44:24AM +0100, Oleksandr Natalenko wrote: > Hello. > > On Tue, Jan 21, 2020 at 10:11:13AM -0800, Minchan Kim wrote: > > > > I also think people don't want to give an KSM hint to non-mergeable area. > > > > > > And how the manager knows which data is mergable? > > > > Oleksandr, could you say your thought why you need address range based > > API? > > It seems I've overlooked an important piece of this submission: one > cannot apply the hint to all the anonymous mapping regardless of address > range. For KSM I'd rather either have a possibility to hint all the > anonymous mappings, or, as it was suggested previously, be able to iterate > over existing mappings using some (fd-based?) API. Thing is how you could identify a certan range is better for KSM than others from external process?
On Wed, Jan 22, 2020 at 05:43:16PM -0800, Minchan Kim wrote: > > It seems I've overlooked an important piece of this submission: one > > cannot apply the hint to all the anonymous mapping regardless of address > > range. For KSM I'd rather either have a possibility to hint all the > > anonymous mappings, or, as it was suggested previously, be able to iterate > > over existing mappings using some (fd-based?) API. > > Thing is how you could identify a certan range is better for KSM than > others from external process? I think the info like this is kinda available via /proc/pid/smaps. It lists the ranges and the vmflags. But using it raises 2 concerns: one is the absence of guarantee the mappings won't change after smaps is read and the second one is that there's no separate vmflag for marking a vma as non-meregable (and IIRC from previous attempts on addressing this, we've already exhausted all the flags on 32-bit arches, so it is not something that can be trivially addressed).
On Wed 22-01-20 17:41:31, Minchan Kim wrote:
[...]
> What do you want to see further?
Either a consensus that it is sufficient to have an inherently racy
interface that requires some form of external sychronization or
a robust interface that can cope with races in a sensible way.
And no, having a flag for future extension is definitely not the
way how a new API should be added. Seriously!
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl index e56950f23b49..776c61803315 100644 --- a/arch/alpha/kernel/syscalls/syscall.tbl +++ b/arch/alpha/kernel/syscalls/syscall.tbl @@ -477,3 +477,4 @@ # 545 reserved for clone3 546 common watch_devices sys_watch_devices 547 common openat2 sys_openat2 +548 common process_madvise sys_process_madvise diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl index 7fb2f4d59210..a43381542276 100644 --- a/arch/arm/tools/syscall.tbl +++ b/arch/arm/tools/syscall.tbl @@ -451,3 +451,4 @@ 435 common clone3 sys_clone3 436 common watch_devices sys_watch_devices 437 common openat2 sys_openat2 +438 common process_madvise sys_process_madvise diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h index 8aa00ccb0b96..b722e47377a5 100644 --- a/arch/arm64/include/asm/unistd.h +++ b/arch/arm64/include/asm/unistd.h @@ -38,7 +38,7 @@ #define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5) #define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800) -#define __NR_compat_syscalls 438 +#define __NR_compat_syscalls 439 #endif #define __ARCH_WANT_SYS_CLONE diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h index 31f0ce25719e..e3643d7fecc3 100644 --- a/arch/arm64/include/asm/unistd32.h +++ b/arch/arm64/include/asm/unistd32.h @@ -883,6 +883,8 @@ __SYSCALL(__NR_clone3, sys_clone3) __SYSCALL(__NR_watch_devices, sys_watch_devices) #define __NR_openat2 437 __SYSCALL(__NR_openat2, sys_openat2) +#define __NR_process_madvise 438 +__SYSCALL(__NR_process_madvise, process_madvise) /* * Please add new compat syscalls above this comment and update diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl index b9aa59931905..c156abc9a298 100644 --- a/arch/ia64/kernel/syscalls/syscall.tbl +++ b/arch/ia64/kernel/syscalls/syscall.tbl @@ -358,3 +358,4 @@ # 435 reserved for clone3 436 common watch_devices sys_watch_devices 437 common openat2 sys_openat2 +438 common process_madvise sys_process_madvise diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl index 868c1ef89d35..5b6034b6650f 100644 --- a/arch/m68k/kernel/syscalls/syscall.tbl +++ b/arch/m68k/kernel/syscalls/syscall.tbl @@ -437,3 +437,4 @@ # 435 reserved for clone3 436 common watch_devices sys_watch_devices 437 common openat2 sys_openat2 +438 common process_madvise sys_process_madvise diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl index 544b4cef18b3..4bef584af09c 100644 --- a/arch/microblaze/kernel/syscalls/syscall.tbl +++ b/arch/microblaze/kernel/syscalls/syscall.tbl @@ -443,3 +443,4 @@ 435 common clone3 sys_clone3 436 common watch_devices sys_watch_devices 437 common openat2 sys_openat2 +438 common process_madvise sys_process_madvise diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl index 05e8aee5dae7..7061b2103438 100644 --- a/arch/mips/kernel/syscalls/syscall_n32.tbl +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl @@ -376,3 +376,4 @@ 435 n32 clone3 __sys_clone3 436 n32 watch_devices sys_watch_devices 437 n32 openat2 sys_openat2 +438 n32 process_madivse sys_process_madvise diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl index 24d6c01328fb..84042d57fbfb 100644 --- a/arch/mips/kernel/syscalls/syscall_n64.tbl +++ b/arch/mips/kernel/syscalls/syscall_n64.tbl @@ -352,3 +352,4 @@ 435 n64 clone3 __sys_clone3 436 n64 watch_devices sys_watch_devices 437 n64 openat2 sys_openat2 +438 n64 process_madvise sys_process_madvise diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl index 4b5f77a4e1a2..5bfd359c7e6f 100644 --- a/arch/parisc/kernel/syscalls/syscall.tbl +++ b/arch/parisc/kernel/syscalls/syscall.tbl @@ -435,3 +435,4 @@ 435 common clone3 sys_clone3_wrapper 436 common watch_devices sys_watch_devices 437 common openat2 sys_openat2 +438 common process_madvise sys_process_madvise diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl index 9716dc85a517..ffa0e679aca0 100644 --- a/arch/powerpc/kernel/syscalls/syscall.tbl +++ b/arch/powerpc/kernel/syscalls/syscall.tbl @@ -519,3 +519,4 @@ 435 nospu clone3 ppc_clone3 436 common watch_devices sys_watch_devices 437 common openat2 sys_openat2 +438 common process_madvise sys_process_madvise diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl index 7da330f8b03e..c301717216ca 100644 --- a/arch/s390/kernel/syscalls/syscall.tbl +++ b/arch/s390/kernel/syscalls/syscall.tbl @@ -440,3 +440,4 @@ 435 common clone3 sys_clone3 sys_clone3 436 common watch_devices sys_watch_devices sys_watch_devices 437 common openat2 sys_openat2 sys_openat2 +438 common process_madvise sys_process_madvise sys_process_madvise diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl index bb7e68e25337..b8f15701f69f 100644 --- a/arch/sh/kernel/syscalls/syscall.tbl +++ b/arch/sh/kernel/syscalls/syscall.tbl @@ -440,3 +440,4 @@ # 435 reserved for clone3 436 common watch_devices sys_watch_devices 437 common openat2 sys_openat2 +438 common process_madvise sys_process_madvise diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl index 646a1fad7218..7ea95f37b222 100644 --- a/arch/sparc/kernel/syscalls/syscall.tbl +++ b/arch/sparc/kernel/syscalls/syscall.tbl @@ -483,3 +483,4 @@ # 435 reserved for clone3 436 common watch_devices sys_watch_devices 437 common openat2 sys_openat2 +438 common process_madvise sys_process_madvise diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 57c53acee290..76a2c266fe7e 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -442,3 +442,4 @@ 435 i386 clone3 sys_clone3 __ia32_sys_clone3 436 i386 watch_devices sys_watch_devices __ia32_sys_watch_devices 437 i386 openat2 sys_openat2 __ia32_sys_openat2 +438 i386 process_madvise sys_process_madvise __ia32_sys_process_madvise diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 1dd8d21f6500..b697cd8620cb 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -359,6 +359,7 @@ 435 common clone3 __x64_sys_clone3/ptregs 436 common watch_devices __x64_sys_watch_devices 437 common openat2 __x64_sys_openat2 +438 common process_madvise __x64_sys_process_madvise # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl index 0f48ab7bd75b..2e9813ecfd7d 100644 --- a/arch/xtensa/kernel/syscalls/syscall.tbl +++ b/arch/xtensa/kernel/syscalls/syscall.tbl @@ -408,3 +408,4 @@ 435 common clone3 sys_clone3 436 common watch_devices sys_watch_devices 437 common openat2 sys_openat2 +438 common process_madvise sys_process_madvise diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 433c8c85636e..1b58a11ff49f 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -877,6 +877,8 @@ asmlinkage long sys_munlockall(void); asmlinkage long sys_mincore(unsigned long start, size_t len, unsigned char __user * vec); asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior); +asmlinkage long sys_process_madvise(int pidfd, unsigned long start, + size_t len, int behavior, unsigned long flags); asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size, unsigned long prot, unsigned long pgoff, unsigned long flags); diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index 33f3856a9c3c..4a49fbaea013 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -856,8 +856,11 @@ __SYSCALL(__NR_watch_devices, sys_watch_devices) #define __NR_openat2 437 __SYSCALL(__NR_openat2, sys_openat2) +#define __NR_process_madvise 438 +__SYSCALL(__NR_process_madvise, sys_process_madvise) + #undef __NR_syscalls -#define __NR_syscalls 438 +#define __NR_syscalls 439 /* * 32 bit systems traditionally used different diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 0e9b275260f8..10ce5eac8b4b 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -281,6 +281,7 @@ COND_SYSCALL(mlockall); COND_SYSCALL(munlockall); COND_SYSCALL(mincore); COND_SYSCALL(madvise); +COND_SYSCALL(process_madvise); COND_SYSCALL(remap_file_pages); COND_SYSCALL(mbind); COND_SYSCALL_COMPAT(mbind); diff --git a/mm/madvise.c b/mm/madvise.c index 0c901de531e4..59b5cc99ef61 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -17,6 +17,7 @@ #include <linux/falloc.h> #include <linux/fadvise.h> #include <linux/sched.h> +#include <linux/sched/mm.h> #include <linux/ksm.h> #include <linux/fs.h> #include <linux/file.h> @@ -993,6 +994,18 @@ madvise_behavior_valid(int behavior) } } +static bool +process_madvise_behavior_valid(int behavior) +{ + switch (behavior) { + case MADV_COLD: + case MADV_PAGEOUT: + return true; + default: + return false; + } +} + /* * madvise_common - request behavior hint to address range of the target process * @@ -1151,6 +1164,10 @@ static int madvise_common(struct task_struct *task, struct mm_struct *mm, * MADV_DONTDUMP - the application wants to prevent pages in the given range * from being included in its core dump. * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump. + * MADV_COLD - the application uses the memory less so the kernel can deactivate + * the memory to evict them quickly when the memory pressure happen. + * MADV_PAGEOUT - the application uses the memroy very rarely so kernel can + * page out the memory instantly. * * return values: * zero - success @@ -1169,3 +1186,49 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) { return madvise_common(current, current->mm, start, len_in, behavior); } + +SYSCALL_DEFINE5(process_madvise, int, pidfd, unsigned long, start, + size_t, len_in, int, behavior, unsigned long, flags) +{ + int ret; + struct fd f; + struct pid *pid; + struct task_struct *task; + struct mm_struct *mm; + + if (flags != 0) + return -EINVAL; + + if (!process_madvise_behavior_valid(behavior)) + return -EINVAL; + + f = fdget(pidfd); + if (!f.file) + return -EBADF; + + pid = pidfd_pid(f.file); + if (IS_ERR(pid)) { + ret = PTR_ERR(pid); + goto fdput; + } + + task = get_pid_task(pid, PIDTYPE_PID); + if (!task) { + ret = -ESRCH; + goto fdput; + } + + mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS); + if (IS_ERR_OR_NULL(mm)) { + ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH; + goto release_task; + } + + ret = madvise_common(task, mm, start, len_in, behavior); + mmput(mm); +release_task: + put_task_struct(task); +fdput: + fdput(f); + return ret; +}
There is usecase that System Management Software(SMS) want to give a memory hint like MADV_[COLD|PAGEEOUT] to other processes and in the case of Android, it is the ActivityManagerService. It's similar in spirit to madvise(MADV_WONTNEED), but the information required to make the reclaim decision is not known to the app. Instead, it is known to the centralized userspace daemon(ActivityManagerService), and that daemon must be able to initiate reclaim on its own without any app involvement. To solve the issue, this patch introduces new syscall process_madvise(2). It uses pidfd of an external processs to give the hint. int process_madvise(int pidfd, void *addr, size_t length, int advise, unsigned long flag); Since it could affect other process's address range, only privileged process(CAP_SYS_PTRACE) or something else(e.g., being the same UID) gives it the right to ptrace the process could use it successfully. The flag argument is reserved for future use if we need to extend the API. I think supporting all hints madvise has/will supported/support to process_madvise is rather risky. Because we are not sure all hints make sense from external process and implementation for the hint may rely on the caller being in the current context so it could be error-prone. Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch. If someone want to add other hints, we could hear hear the usecase and review it for each hint. It's more safe for maintainace rather than introducing a buggy syscall but hard to fix it later. Signed-off-by: Minchan Kim <minchan@kernel.org> --- arch/alpha/kernel/syscalls/syscall.tbl | 1 + arch/arm/tools/syscall.tbl | 1 + arch/arm64/include/asm/unistd.h | 2 +- arch/arm64/include/asm/unistd32.h | 2 + arch/ia64/kernel/syscalls/syscall.tbl | 1 + arch/m68k/kernel/syscalls/syscall.tbl | 1 + arch/microblaze/kernel/syscalls/syscall.tbl | 1 + arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + arch/mips/kernel/syscalls/syscall_n64.tbl | 1 + arch/parisc/kernel/syscalls/syscall.tbl | 1 + arch/powerpc/kernel/syscalls/syscall.tbl | 1 + arch/s390/kernel/syscalls/syscall.tbl | 1 + arch/sh/kernel/syscalls/syscall.tbl | 1 + arch/sparc/kernel/syscalls/syscall.tbl | 1 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/xtensa/kernel/syscalls/syscall.tbl | 1 + include/linux/syscalls.h | 2 + include/uapi/asm-generic/unistd.h | 5 +- kernel/sys_ni.c | 1 + mm/madvise.c | 63 +++++++++++++++++++++ 21 files changed, 88 insertions(+), 2 deletions(-)