Message ID | 20200302193630.68771-6-minchan@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | introduce memory hinting API for external process | expand |
On 3/2/20 8:36 PM, Minchan Kim wrote: > There is a demand[1] to support pid as well pidfd for process_madvise > to reduce unnecessary syscall to get pidfd if the user has control of > the target process(ie, they could guarantee the process is not gone > or pid is not reused). > > This patch aims for supporting both options like waitid(2). So, the > syscall is currently, > > int process_madvise(int which, pid_t pid, void *addr, > size_t length, int advise, unsigned long flag); This is again halfway between kernel and userspace description, so if we stick to userspace then it's: int process_madvise(idtype_t idtype, id_t id, void *addr, size_t length, int advice, unsigned long flags); > @which is actually idtype_t for userspace libray and currently, > it supports P_PID and P_PIDFD. > > [1] https://lore.kernel.org/linux-mm/9d849087-3359-c4ab-fbec-859e8186c509@virtuozzo.com/ > > Cc: Christian Brauner <christian@brauner.io> > Reviewed-by: Suren Baghdasaryan <surenb@google.com> > Suggested-by: Kirill Tkhai <ktkhai@virtuozzo.com> > Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
On Fri, Mar 06, 2020 at 12:14:19PM +0100, Vlastimil Babka wrote: > On 3/2/20 8:36 PM, Minchan Kim wrote: > > There is a demand[1] to support pid as well pidfd for process_madvise > > to reduce unnecessary syscall to get pidfd if the user has control of > > the target process(ie, they could guarantee the process is not gone > > or pid is not reused). > > > > This patch aims for supporting both options like waitid(2). So, the > > syscall is currently, > > > > int process_madvise(int which, pid_t pid, void *addr, > > size_t length, int advise, unsigned long flag); > > This is again halfway between kernel and userspace description, so if we stick > to userspace then it's: > > int process_madvise(idtype_t idtype, id_t id, void *addr, > size_t length, int advice, unsigned long flags); Yub. > > > > @which is actually idtype_t for userspace libray and currently, > > it supports P_PID and P_PIDFD. > > > > [1] https://lore.kernel.org/linux-mm/9d849087-3359-c4ab-fbec-859e8186c509@virtuozzo.com/ > > > > Cc: Christian Brauner <christian@brauner.io> > > Reviewed-by: Suren Baghdasaryan <surenb@google.com> > > Suggested-by: Kirill Tkhai <ktkhai@virtuozzo.com> > > Signed-off-by: Minchan Kim <minchan@kernel.org> > > Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Thanks!
On Tue, Mar 10, 2020 at 05:42:51PM -0700, Minchan Kim wrote: > On Fri, Mar 06, 2020 at 12:14:19PM +0100, Vlastimil Babka wrote: > > On 3/2/20 8:36 PM, Minchan Kim wrote: > > > There is a demand[1] to support pid as well pidfd for process_madvise > > > to reduce unnecessary syscall to get pidfd if the user has control of > > > the target process(ie, they could guarantee the process is not gone > > > or pid is not reused). > > > > > > This patch aims for supporting both options like waitid(2). So, the > > > syscall is currently, > > > > > > int process_madvise(int which, pid_t pid, void *addr, > > > size_t length, int advise, unsigned long flag); > > > > This is again halfway between kernel and userspace description, so if we stick > > to userspace then it's: > > > > int process_madvise(idtype_t idtype, id_t id, void *addr, > > size_t length, int advice, unsigned long flags); > > Yub. > Hi Andrew, Per Vlastimil's request, I changed "which and advise" with "idtype and advice" in function prototype of description. Could you replace the part in the description? Code is never changed. Thanks. From f11cfd023746ae67b89f2d84d960706ba6c5c911 Mon Sep 17 00:00:00 2001 From: Minchan Kim <minchan@kernel.org> Date: Wed, 6 May 2020 13:54:40 +0000 Subject: [PATCH] mm/madvise: support both pid and pidfd for process_madvise There is a demand[1] to support pid as well pidfd for process_madvise to reduce unnecessary syscall to get pidfd if the user has control of the target process(ie, they could guarantee the process is not gone or pid is not reused). This patch aims for supporting both options like waitid(2). So, the syscall is currently, int process_madvise(idtype_t idtype, id_t id, void *addr, size_t length, int advice, unsigned long flags); @which is actually idtype_t for userspace libray and currently, it supports P_PID and P_PIDFD. [1] https://lore.kernel.org/linux-mm/9d849087-3359-c4ab-fbec-859e8186c509@virtuozzo.com/ Link: http://lkml.kernel.org/r/20200302193630.68771-6-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Suggested-by: Kirill Tkhai <ktkhai@virtuozzo.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christian Brauner <christian@brauner.io> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Brian Geffon <bgeffon@google.com> Cc: Daniel Colascione <dancol@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Dias <joaodias@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Sandeep Patil <sspatil@google.com> Cc: SeongJae Park <sj38.park@gmail.com> Cc: SeongJae Park <sjpark@amazon.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Tim Murray <timmurray@google.com> Cc: <linux-man@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
On Fri, 8 May 2020 11:36:53 -0700 Minchan Kim <minchan@kernel.org> wrote: > > ... > > Per Vlastimil's request, I changed "which and advise" with "idtype and > advice" in function prototype of description. > Could you replace the part in the description? Code is never changed. > Done, but... > > ... > > There is a demand[1] to support pid as well pidfd for process_madvise to > reduce unnecessary syscall to get pidfd if the user has control of the > target process(ie, they could guarantee the process is not gone or pid is > not reused). > > This patch aims for supporting both options like waitid(2). So, the > syscall is currently, > > int process_madvise(idtype_t idtype, id_t id, void *addr, > size_t length, int advice, unsigned long flags); > > @which is actually idtype_t for userspace libray and currently, it > supports P_PID and P_PIDFD. What does "@which is actually idtype_t for userspace libray" mean? Can you clarify and expand? Also, does this userspace library exist? If so, where is it? > > ... >
On Fri, May 08, 2020 at 04:04:15PM -0700, Andrew Morton wrote: > On Fri, 8 May 2020 11:36:53 -0700 Minchan Kim <minchan@kernel.org> wrote: > > > > > ... > > > > Per Vlastimil's request, I changed "which and advise" with "idtype and > > advice" in function prototype of description. > > Could you replace the part in the description? Code is never changed. > > > > Done, but... > > > > > ... > > > > There is a demand[1] to support pid as well pidfd for process_madvise to > > reduce unnecessary syscall to get pidfd if the user has control of the > > target process(ie, they could guarantee the process is not gone or pid is > > not reused). > > > > This patch aims for supporting both options like waitid(2). So, the > > syscall is currently, > > > > int process_madvise(idtype_t idtype, id_t id, void *addr, > > size_t length, int advice, unsigned long flags); > > > > @which is actually idtype_t for userspace libray and currently, it > > supports P_PID and P_PIDFD. > > What does "@which is actually idtype_t for userspace libray" mean? Can > you clarify and expand? If I may clarify, the only case where we've supported both pidfd and pid in the same system call is waitid() to avoid adding a dedicated system call for waiting and because waitid() already had this (imho insane) argument type switching. The idtype_t thing comes from waitid() and is located int sys/wait.h and is defined as "The type idtype_t is defined as an enumeration type whose possible values include at least the following: P_ALL P_PID P_PGID " int waitid(idtype_t idtype, id_t id, siginfo_t *infop, int options); If idtype is P_PID, waitid() shall wait for the child with a process ID equal to (pid_t)id. If idtype is P_PGID, waitid() shall wait for any child with a process group ID equal to (pid_t)id. If idtype is P_ALL, waitid() shall wait for any children and id is ignored. I'm personally not a fan of this idtype_t thing and think this should just have been > > int pidfd_madvise(int pidfd, void *addr, > > size_t length, int advice, unsigned long flags); and call it a day. Also, if I may ask, why is the flag argument "unsigned long"? That's pretty unorthodox. The expectation is that flag arguments are not word-size dependent and should usually use "unsigned int". All new system calls follow this pattern too. The current syscall layout will mean that on 64 bit systems you have 64 flag bits and on 32 bit you have 32 flag bits, I think. That has just recently led to some problems with the clone() syscall (fixed in [1] which I'm sending Monday) which has the same weird word-size-dependent flag argument layout. If a system does sign-extension and a userspace api or glibc uses e.g. an int for the flag argument in the system call wrapper - which is fairly common - you can get sign extended and then you end up with garbage in the upper 32 bits of your system call. > > Also, does this userspace library exist? If so, where is it? [1]: https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/commit/?h=fixes&id=3f2c788a13143620c5471ac96ac4f033fc9ac3f3 Christian
Hi Christian, On Sat, May 09, 2020 at 02:48:17PM +0200, Christian Brauner wrote: > On Fri, May 08, 2020 at 04:04:15PM -0700, Andrew Morton wrote: > > On Fri, 8 May 2020 11:36:53 -0700 Minchan Kim <minchan@kernel.org> wrote: > > > > > > > > ... > > > > > > Per Vlastimil's request, I changed "which and advise" with "idtype and > > > advice" in function prototype of description. > > > Could you replace the part in the description? Code is never changed. > > > > > > > Done, but... > > > > > > > > ... > > > > > > There is a demand[1] to support pid as well pidfd for process_madvise to > > > reduce unnecessary syscall to get pidfd if the user has control of the > > > target process(ie, they could guarantee the process is not gone or pid is > > > not reused). > > > > > > This patch aims for supporting both options like waitid(2). So, the > > > syscall is currently, > > > > > > int process_madvise(idtype_t idtype, id_t id, void *addr, > > > size_t length, int advice, unsigned long flags); > > > > > > @which is actually idtype_t for userspace libray and currently, it > > > supports P_PID and P_PIDFD. > > > > What does "@which is actually idtype_t for userspace libray" mean? Can > > you clarify and expand? > > If I may clarify, the only case where we've supported both pidfd and pid > in the same system call is waitid() to avoid adding a dedicated system > call for waiting and because waitid() already had this (imho insane) > argument type switching. The idtype_t thing comes from waitid() and is > located int sys/wait.h and is defined as > > "The type idtype_t is defined as an enumeration type whose possible > values include at least the following: > > P_ALL > P_PID > P_PGID > " > > int waitid(idtype_t idtype, id_t id, siginfo_t *infop, int options); > If idtype is P_PID, waitid() shall wait for the child with a process ID equal to (pid_t)id. > If idtype is P_PGID, waitid() shall wait for any child with a process group ID equal to (pid_t)id. > If idtype is P_ALL, waitid() shall wait for any children and id is ignored. > > I'm personally not a fan of this idtype_t thing and think this should > just have been > > > int pidfd_madvise(int pidfd, void *addr, > > > size_t length, int advice, unsigned long flags); > and call it a day. That was the argument at that time, Daniel and I didn't want to have pid along with pidfd even though Kirill strongly wanted to have it. However you said " Overall, I don't particularly care how or if you integrate pidfd here." at that time. https://lore.kernel.org/linux-mm/20200113104256.5ujbplyec2sk4onn@wittgenstein/ I asked a question to Kirll at that time. " > Sounds like that you want to support both options for every upcoming API > which deals with pid. I'm not sure how it's critical for process_madvise > API this case. In general, we sacrifice some performance for the nicer one > and later, once it's reported as hurdle for some workload, we could fix it > via introducing new flag. What I don't like at this moment is to make > syscall complicated with potential scenarios without real workload. Yes, I suggest allowing both options for every new process api " https://lore.kernel.org/linux-mm/9d849087-3359-c4ab-fbec-859e8186c509@virtuozzo.com/ You didn't give the opinion at that time, either(I expected you will make some voice then). What I could do to proceed work was separate it as different patch like this one to get more attention in future. And now it works. Let me clarify my side: I still don't like to introduce pid for new API since we have pidfd. Since you just brought this issue again, I want to hear *opinions* from others, again. > > Also, if I may ask, why is the flag argument "unsigned long"? > That's pretty unorthodox. The expectation is that flag arguments are > not word-size dependent and should usually use "unsigned int". All new > system calls follow this pattern too. Nothing special in this flag: Let me change it as "unsigned int". I will send the change once we have an agreement on "pidfd" argument. Thanks for the review, Christian!
On Sat, May 9, 2020 at 4:14 PM Minchan Kim <minchan@kernel.org> wrote: > > Hi Christian, > > On Sat, May 09, 2020 at 02:48:17PM +0200, Christian Brauner wrote: > > On Fri, May 08, 2020 at 04:04:15PM -0700, Andrew Morton wrote: > > > On Fri, 8 May 2020 11:36:53 -0700 Minchan Kim <minchan@kernel.org> wrote: > > > > > > > > > > > ... > > > > > > > > Per Vlastimil's request, I changed "which and advise" with "idtype and > > > > advice" in function prototype of description. > > > > Could you replace the part in the description? Code is never changed. > > > > > > > > > > Done, but... > > > > > > > > > > > ... > > > > > > > > There is a demand[1] to support pid as well pidfd for process_madvise to > > > > reduce unnecessary syscall to get pidfd if the user has control of the > > > > target process(ie, they could guarantee the process is not gone or pid is > > > > not reused). > > > > > > > > This patch aims for supporting both options like waitid(2). So, the > > > > syscall is currently, > > > > > > > > int process_madvise(idtype_t idtype, id_t id, void *addr, > > > > size_t length, int advice, unsigned long flags); > > > > > > > > @which is actually idtype_t for userspace libray and currently, it > > > > supports P_PID and P_PIDFD. > > > > > > What does "@which is actually idtype_t for userspace libray" mean? Can > > > you clarify and expand? > > > > If I may clarify, the only case where we've supported both pidfd and pid > > in the same system call is waitid() to avoid adding a dedicated system > > call for waiting and because waitid() already had this (imho insane) > > argument type switching. The idtype_t thing comes from waitid() and is > > located int sys/wait.h and is defined as > > > > "The type idtype_t is defined as an enumeration type whose possible > > values include at least the following: > > > > P_ALL > > P_PID > > P_PGID > > " > > > > int waitid(idtype_t idtype, id_t id, siginfo_t *infop, int options); > > If idtype is P_PID, waitid() shall wait for the child with a process ID equal to (pid_t)id. > > If idtype is P_PGID, waitid() shall wait for any child with a process group ID equal to (pid_t)id. > > If idtype is P_ALL, waitid() shall wait for any children and id is ignored. > > > > I'm personally not a fan of this idtype_t thing and think this should > > just have been > > > > int pidfd_madvise(int pidfd, void *addr, > > > > size_t length, int advice, unsigned long flags); > > and call it a day. > > That was the argument at that time, Daniel and I didn't want to have > pid along with pidfd even though Kirill strongly wanted to have it. > However you said " Overall, I don't particularly care how or if you > integrate pidfd here." at that time. > > https://lore.kernel.org/linux-mm/20200113104256.5ujbplyec2sk4onn@wittgenstein/ > > I asked a question to Kirll at that time. > > " > > Sounds like that you want to support both options for every upcoming API > > which deals with pid. I'm not sure how it's critical for process_madvise > > API this case. In general, we sacrifice some performance for the nicer one > > and later, once it's reported as hurdle for some workload, we could fix it > > via introducing new flag. What I don't like at this moment is to make > > syscall complicated with potential scenarios without real workload. > > Yes, I suggest allowing both options for every new process api > " > https://lore.kernel.org/linux-mm/9d849087-3359-c4ab-fbec-859e8186c509@virtuozzo.com/ > > You didn't give the opinion at that time, either(I expected you will > make some voice then). What I could do to proceed work was separate it > as different patch like this one to get more attention in future. > And now it works. > > Let me clarify my side: I still don't like to introduce pid for new API > since we have pidfd. Since you just brought this issue again, I want to > hear *opinions* from others, again. IIRC Kirill's main complaint was that if we support only pidfds and userspace has a pid of the process then it would have to convert that pid into pidfd before calling process_madvise, which involves additional syscall(s). The overhead would be more tangible if there are multiple processes needing to be madvised. I'm not sure how often such a need arises to madvise multiple processes in a bulk like that and how critical is the overhead of obtaining pidfd. With pid reuse possibility pid-based API will still have the issue of possibly sending the request to a wrong process, so this pidfd obtaining overhead arguably makes the usage more robust and therefore is not a pure loss. I don't have a real strong opinion against supporting pid in this syscall but I think API maintainers should decide going forward whether new APIs should support pid along with pidfd or switch to pidfd only. Thanks! > > > > > Also, if I may ask, why is the flag argument "unsigned long"? > > That's pretty unorthodox. The expectation is that flag arguments are > > not word-size dependent and should usually use "unsigned int". All new > > system calls follow this pattern too. > > Nothing special in this flag: Let me change it as "unsigned int". > I will send the change once we have an agreement on "pidfd" argument. > > Thanks for the review, Christian!
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index e4cd2c2f8bb4..f5ada20e2943 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -876,7 +876,8 @@ asmlinkage long sys_munlockall(void); asmlinkage long sys_mincore(unsigned long start, size_t len, unsigned char __user * vec); asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior); -asmlinkage long sys_process_madvise(int pidfd, unsigned long start, + +asmlinkage long sys_process_madvise(int which, pid_t pid, unsigned long start, size_t len, int behavior, unsigned long flags); asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size, unsigned long prot, unsigned long pgoff, diff --git a/mm/madvise.c b/mm/madvise.c index 6543f2bfc3d8..e794367f681e 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1182,11 +1182,10 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) return do_madvise(current, current->mm, start, len_in, behavior); } -SYSCALL_DEFINE5(process_madvise, int, pidfd, unsigned long, start, +SYSCALL_DEFINE6(process_madvise, int, which, pid_t, upid, unsigned long, start, size_t, len_in, int, behavior, unsigned long, flags) { int ret; - struct fd f; struct pid *pid; struct task_struct *task; struct mm_struct *mm; @@ -1197,20 +1196,31 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, unsigned long, start, if (!process_madvise_behavior_valid(behavior)) return -EINVAL; - f = fdget(pidfd); - if (!f.file) - return -EBADF; + switch (which) { + case P_PID: + if (upid <= 0) + return -EINVAL; + + pid = find_get_pid(upid); + if (!pid) + return -ESRCH; + break; + case P_PIDFD: + if (upid < 0) + return -EINVAL; - pid = pidfd_pid(f.file); - if (IS_ERR(pid)) { - ret = PTR_ERR(pid); - goto fdput; + pid = pidfd_get_pid(upid); + if (IS_ERR(pid)) + return PTR_ERR(pid); + break; + default: + return -EINVAL; } task = get_pid_task(pid, PIDTYPE_PID); if (!task) { ret = -ESRCH; - goto fdput; + goto put_pid; } mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS); @@ -1223,7 +1233,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, unsigned long, start, mmput(mm); release_task: put_task_struct(task); -fdput: - fdput(f); +put_pid: + put_pid(pid); return ret; }