diff mbox series

[v2,1/2] mm/madvise: introduce PR_MADV_SELF flag to process_madvise()

Message ID 1ecf2692b3bcdd693ad61d510ce0437abb43a1bd.1727176176.git.lorenzo.stoakes@oracle.com (mailing list archive)
State Awaiting Upstream
Headers show
Series unrestrict process_madvise() for current process | expand

Commit Message

Lorenzo Stoakes Sept. 24, 2024, 11:16 a.m. UTC
process_madvise() was conceived as a useful means for performing a vector
of madvise() operations on a remote process's address space.

However it's useful to be able to do so on the current process also. It is
currently rather clunky to do this (requiring a pidfd to be opened for the
current process) and introduces unnecessary overhead in incrementing
reference counts for the task and mm.

Avoid all of this by providing a PR_MADV_SELF flag, which causes
process_madvise() to simply ignore the pidfd parameter and instead apply
the operation to the current process.

Since we are operating on our own process, no restrictions need be applied
on behaviors we can perform, so do not limit these in that case.

Also extend the case of a user specifying the current process via pidfd to
not be restricted on behaviors which can be performed.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 arch/alpha/include/uapi/asm/mman.h     |  2 +
 arch/mips/include/uapi/asm/mman.h      |  2 +
 arch/parisc/include/uapi/asm/mman.h    |  2 +
 arch/xtensa/include/uapi/asm/mman.h    |  2 +
 include/uapi/asm-generic/mman-common.h |  2 +
 mm/madvise.c                           | 66 ++++++++++++++++++--------
 6 files changed, 56 insertions(+), 20 deletions(-)

Comments

Pedro Falcato Sept. 24, 2024, 12:51 p.m. UTC | #1
On Tue, Sep 24, 2024 at 12:16:27PM GMT, Lorenzo Stoakes wrote:
> process_madvise() was conceived as a useful means for performing a vector
> of madvise() operations on a remote process's address space.
> 
> However it's useful to be able to do so on the current process also. It is
> currently rather clunky to do this (requiring a pidfd to be opened for the
> current process) and introduces unnecessary overhead in incrementing
> reference counts for the task and mm.
> 
> Avoid all of this by providing a PR_MADV_SELF flag, which causes
> process_madvise() to simply ignore the pidfd parameter and instead apply
> the operation to the current process.
> 

How about simply defining a pseudo-fd PIDFD_SELF in the negative int space?
There's precedent for it in the fs space (AT_FDCWD). I think it's more ergonomic
and if you take out the errno space we have around 2^31 - 4096 available sentinel
values.

e.g:

/* AT_FDCWD = -10, -1 is dangerous, pick a different value */
#define PIDFD_SELF   -11

int pidfd = target_pid == getpid() ? PIDFD_SELF : pidfd_open(...);
process_madvise(pidfd, ...);


What do you think?
Lorenzo Stoakes Sept. 24, 2024, 1:12 p.m. UTC | #2
On Tue, Sep 24, 2024 at 01:51:11PM GMT, Pedro Falcato wrote:
> On Tue, Sep 24, 2024 at 12:16:27PM GMT, Lorenzo Stoakes wrote:
> > process_madvise() was conceived as a useful means for performing a vector
> > of madvise() operations on a remote process's address space.
> >
> > However it's useful to be able to do so on the current process also. It is
> > currently rather clunky to do this (requiring a pidfd to be opened for the
> > current process) and introduces unnecessary overhead in incrementing
> > reference counts for the task and mm.
> >
> > Avoid all of this by providing a PR_MADV_SELF flag, which causes
> > process_madvise() to simply ignore the pidfd parameter and instead apply
> > the operation to the current process.
> >
>
> How about simply defining a pseudo-fd PIDFD_SELF in the negative int space?
> There's precedent for it in the fs space (AT_FDCWD). I think it's more ergonomic
> and if you take out the errno space we have around 2^31 - 4096 available sentinel
> values.
>
> e.g:
>
> /* AT_FDCWD = -10, -1 is dangerous, pick a different value */
> #define PIDFD_SELF   -11
>
> int pidfd = target_pid == getpid() ? PIDFD_SELF : pidfd_open(...);
> process_madvise(pidfd, ...);
>
>
> What do you think?

I like the way you're thinking, but I don't think this is something we can
do in the context of this series.

I mean, I totally accept using a flag here and ignoring the pidfd field is
_ugly_, no question. But I'm trying to find the smallest change that
achieves what we want.

To add such a sentinel would be a change to the pidfd mechanism as a whole,
and we'd be left in the awkward situation that no other user of the pidfd
mechanism would be implementing this, but we'd have to expose this as a
general sentinel value for all pidfd users.

One nice thing with doing this as a flag is that, later, if somebody is
willing to do the larger task of having a special sentinel pidfd value to
mean 'the current process', we could use this in process_madvise() and
deprecate this flag :)

>
> --
> Pedro
Shakeel Butt Sept. 25, 2024, 2:02 p.m. UTC | #3
Cced Christian

On Tue, Sep 24, 2024 at 02:12:49PM GMT, Lorenzo Stoakes wrote:
> On Tue, Sep 24, 2024 at 01:51:11PM GMT, Pedro Falcato wrote:
> > On Tue, Sep 24, 2024 at 12:16:27PM GMT, Lorenzo Stoakes wrote:
> > > process_madvise() was conceived as a useful means for performing a vector
> > > of madvise() operations on a remote process's address space.
> > >
> > > However it's useful to be able to do so on the current process also. It is
> > > currently rather clunky to do this (requiring a pidfd to be opened for the
> > > current process) and introduces unnecessary overhead in incrementing
> > > reference counts for the task and mm.
> > >
> > > Avoid all of this by providing a PR_MADV_SELF flag, which causes
> > > process_madvise() to simply ignore the pidfd parameter and instead apply
> > > the operation to the current process.
> > >
> >
> > How about simply defining a pseudo-fd PIDFD_SELF in the negative int space?
> > There's precedent for it in the fs space (AT_FDCWD). I think it's more ergonomic
> > and if you take out the errno space we have around 2^31 - 4096 available sentinel
> > values.
> >
> > e.g:
> >
> > /* AT_FDCWD = -10, -1 is dangerous, pick a different value */
> > #define PIDFD_SELF   -11
> >
> > int pidfd = target_pid == getpid() ? PIDFD_SELF : pidfd_open(...);
> > process_madvise(pidfd, ...);
> >
> >
> > What do you think?
> 
> I like the way you're thinking, but I don't think this is something we can
> do in the context of this series.
> 
> I mean, I totally accept using a flag here and ignoring the pidfd field is
> _ugly_, no question. But I'm trying to find the smallest change that
> achieves what we want.

I don't think "smallest change" should be the target. We are changing
user API and we should aim to make it as robust as possible against
possible misuse or making uninteded assumptions.

The proposed implementation opened the door for the applications to
provide dummy pidfd if PR_MADV_SELF is used. You definitely need to
restrict it to some known value like -1 used by mmap() syscall.

> 
> To add such a sentinel would be a change to the pidfd mechanism as a whole,
> and we'd be left in the awkward situation that no other user of the pidfd
> mechanism would be implementing this, but we'd have to expose this as a
> general sentinel value for all pidfd users.

There might be future users which can take advantage of this. I can even
imagine pidfd_send_signal() can use PIDFD_SELF as well.

> 
> One nice thing with doing this as a flag is that, later, if somebody is
> willing to do the larger task of having a special sentinel pidfd value to
> mean 'the current process', we could use this in process_madvise() and
> deprecate this flag :)
> 

Once something is added to an API, particularly syscalls, the removal
is almost impossible.

Anyways, I don't have very strong opinion one way or other but whatever
we decide, let's make it robust.
Lorenzo Stoakes Sept. 25, 2024, 2:48 p.m. UTC | #4
On Wed, Sep 25, 2024 at 07:02:59AM GMT, Shakeel Butt wrote:
> Cced Christian
>
> On Tue, Sep 24, 2024 at 02:12:49PM GMT, Lorenzo Stoakes wrote:
> > On Tue, Sep 24, 2024 at 01:51:11PM GMT, Pedro Falcato wrote:
> > > On Tue, Sep 24, 2024 at 12:16:27PM GMT, Lorenzo Stoakes wrote:
> > > > process_madvise() was conceived as a useful means for performing a vector
> > > > of madvise() operations on a remote process's address space.
> > > >
> > > > However it's useful to be able to do so on the current process also. It is
> > > > currently rather clunky to do this (requiring a pidfd to be opened for the
> > > > current process) and introduces unnecessary overhead in incrementing
> > > > reference counts for the task and mm.
> > > >
> > > > Avoid all of this by providing a PR_MADV_SELF flag, which causes
> > > > process_madvise() to simply ignore the pidfd parameter and instead apply
> > > > the operation to the current process.
> > > >
> > >
> > > How about simply defining a pseudo-fd PIDFD_SELF in the negative int space?
> > > There's precedent for it in the fs space (AT_FDCWD). I think it's more ergonomic
> > > and if you take out the errno space we have around 2^31 - 4096 available sentinel
> > > values.
> > >
> > > e.g:
> > >
> > > /* AT_FDCWD = -10, -1 is dangerous, pick a different value */
> > > #define PIDFD_SELF   -11
> > >
> > > int pidfd = target_pid == getpid() ? PIDFD_SELF : pidfd_open(...);
> > > process_madvise(pidfd, ...);
> > >
> > >
> > > What do you think?
> >
> > I like the way you're thinking, but I don't think this is something we can
> > do in the context of this series.
> >
> > I mean, I totally accept using a flag here and ignoring the pidfd field is
> > _ugly_, no question. But I'm trying to find the smallest change that
> > achieves what we want.
>
> I don't think "smallest change" should be the target. We are changing
> user API and we should aim to make it as robust as possible against
> possible misuse or making uninteded assumptions.

I think introducing a new pidfd sentinel that isn't used anywhere else is
far more liable to mistakes than adding an explicit flag.

Could you provide examples of possible misuse of this flag or unintended
assumptions it confers (other than the -1 thing addressed below).

The flag is explicitly 'target this process, ignore pidfd'. We can document
it as such (I will patch manpages too).

>
> The proposed implementation opened the door for the applications to
> provide dummy pidfd if PR_MADV_SELF is used. You definitely need to
> restrict it to some known value like -1 used by mmap() syscall.

Why?

mmap() is special in that you have a 'dual' situation with shmem that is
both file-backed and private and of course you can do MAP_SHARED |
MAP_PRIVATE and have mmap() transparently assign something to you, etc.

Here we explicitly have a flag whose semantics are 'ignore pidfd, target
self'.

If you choose to use a brand new flag that explicitly states this and
provide a 'dummy' pidfd which then has nothing done to it - what exactly is
the problem?

I mean if you feel strongly, we can enforce this, but I'm not sure -1
implying a special case for pidfd is a thing either.

On the other hand it would be _weird_ and broken for the user to provide a
valid pidfd so maybe we should as it is easy to do and the user has clearly
done something wrong.

So fine, agreed, I'll add that.

>
> >
> > To add such a sentinel would be a change to the pidfd mechanism as a whole,
> > and we'd be left in the awkward situation that no other user of the pidfd
> > mechanism would be implementing this, but we'd have to expose this as a
> > general sentinel value for all pidfd users.
>
> There might be future users which can take advantage of this. I can even
> imagine pidfd_send_signal() can use PIDFD_SELF as well.

I'm confused by this comment - I mean absolutely, as I said I like the
idea, but this just proves the point that you'd have to go around and
implement this everywhere that uses a pidfd?

That is a big undertaking, and not blocked by this change. Nor is
maintaining the flag proposed here egregious.

Blocking a useful feature because we may in future possibly add a new means
of doing the same thing seems a little silly to me.

> >
> > One nice thing with doing this as a flag is that, later, if somebody is
> > willing to do the larger task of having a special sentinel pidfd value to
> > mean 'the current process', we could use this in process_madvise() and
> > deprecate this flag :)
> >
>
> Once something is added to an API, particularly syscalls, the removal
> is almost impossible.

And why would it be such a problem to have this flag remain? I said
deprecate not remove. And only in the sense that 'you may as well use the
sentinel'.

The flag is very clear in its meaning, and confers no special problem in
remaining supported. It is a private flag that overlaps no others.

I mean it'd in effect being a change to a single line 'if pidfd is sentinel
or flag is used'. If we can't support that going forward, then we should
give up this kernel stuff and frolick in the fields joyously instead...

Again, if you can tell me why it'd be such a problem then fine we can
address that.

But blocking a series and demanding a change to an entire other feature
just to support something I'd say requires some pretty specific reasons as
to why you have a problem with the change.

>
> Anyways, I don't have very strong opinion one way or other but whatever
> we decide, let's make it robust.

I mean... err... it sounds like you do kinda have pretty strong opinions ;)

But anyway - as to robustness, again, could you please provide examples of
possible misuse of this flag or unintended assumptions it confers (other
than the -1 thing addressed above)? I would be happy to address them.

If not then let's move forward with this useful feature?
Shakeel Butt Sept. 25, 2024, 4:19 p.m. UTC | #5
I have no idea what makes you think I am blocking the feature that you
repond in a weird tone but let me be upfront what I am asking: Let's
collectively decide which is the better option (in terms of
maintainability and extensibility) and move forward.

On Wed, Sep 25, 2024 at 03:48:07PM GMT, Lorenzo Stoakes wrote:
> On Wed, Sep 25, 2024 at 07:02:59AM GMT, Shakeel Butt wrote:
> > Cced Christian
> >
> > On Tue, Sep 24, 2024 at 02:12:49PM GMT, Lorenzo Stoakes wrote:
> > > On Tue, Sep 24, 2024 at 01:51:11PM GMT, Pedro Falcato wrote:
> > > > On Tue, Sep 24, 2024 at 12:16:27PM GMT, Lorenzo Stoakes wrote:
> > > > > process_madvise() was conceived as a useful means for performing a vector
> > > > > of madvise() operations on a remote process's address space.
> > > > >
> > > > > However it's useful to be able to do so on the current process also. It is
> > > > > currently rather clunky to do this (requiring a pidfd to be opened for the
> > > > > current process) and introduces unnecessary overhead in incrementing
> > > > > reference counts for the task and mm.
> > > > >
> > > > > Avoid all of this by providing a PR_MADV_SELF flag, which causes
> > > > > process_madvise() to simply ignore the pidfd parameter and instead apply
> > > > > the operation to the current process.
> > > > >
> > > >
> > > > How about simply defining a pseudo-fd PIDFD_SELF in the negative int space?
> > > > There's precedent for it in the fs space (AT_FDCWD). I think it's more ergonomic
> > > > and if you take out the errno space we have around 2^31 - 4096 available sentinel
> > > > values.
> > > >
> > > > e.g:
> > > >
> > > > /* AT_FDCWD = -10, -1 is dangerous, pick a different value */
> > > > #define PIDFD_SELF   -11
> > > >
> > > > int pidfd = target_pid == getpid() ? PIDFD_SELF : pidfd_open(...);
> > > > process_madvise(pidfd, ...);
> > > >
> > > >
> > > > What do you think?
> > >
> > > I like the way you're thinking, but I don't think this is something we can
> > > do in the context of this series.
> > >
> > > I mean, I totally accept using a flag here and ignoring the pidfd field is
> > > _ugly_, no question. But I'm trying to find the smallest change that
> > > achieves what we want.
> >
> > I don't think "smallest change" should be the target. We are changing
> > user API and we should aim to make it as robust as possible against
> > possible misuse or making uninteded assumptions.
> 
> I think introducing a new pidfd sentinel that isn't used anywhere else is
> far more liable to mistakes than adding an explicit flag.
> 
> Could you provide examples of possible misuse of this flag or unintended
> assumptions it confers (other than the -1 thing addressed below).
> 
> The flag is explicitly 'target this process, ignore pidfd'. We can document
> it as such (I will patch manpages too).
> 
> >
> > The proposed implementation opened the door for the applications to
> > provide dummy pidfd if PR_MADV_SELF is used. You definitely need to
> > restrict it to some known value like -1 used by mmap() syscall.
> 
> Why?
> 
> mmap() is special in that you have a 'dual' situation with shmem that is
> both file-backed and private and of course you can do MAP_SHARED |
> MAP_PRIVATE and have mmap() transparently assign something to you, etc.
> 
> Here we explicitly have a flag whose semantics are 'ignore pidfd, target
> self'.
> 
> If you choose to use a brand new flag that explicitly states this and
> provide a 'dummy' pidfd which then has nothing done to it - what exactly is
> the problem?

IMHO having a fixed dummy would allow the kernel more flexibility in
future for evolving the API.

> 
> I mean if you feel strongly, we can enforce this, but I'm not sure -1
> implying a special case for pidfd is a thing either.
> 
> On the other hand it would be _weird_ and broken for the user to provide a
> valid pidfd so maybe we should as it is easy to do and the user has clearly
> done something wrong.
> 
> So fine, agreed, I'll add that.
> 

No, don't just agree. The response like "-1 is not good for so and so
reasons" is totally fine and my request would be add that reasoning in
the commit message. My only request is that we have thought through
alternatives and document the reasonsing behind the decided approach.

> >
> > >
> > > To add such a sentinel would be a change to the pidfd mechanism as a whole,
> > > and we'd be left in the awkward situation that no other user of the pidfd
> > > mechanism would be implementing this, but we'd have to expose this as a
> > > general sentinel value for all pidfd users.
> >
> > There might be future users which can take advantage of this. I can even
> > imagine pidfd_send_signal() can use PIDFD_SELF as well.
> 
> I'm confused by this comment - I mean absolutely, as I said I like the
> idea, but this just proves the point that you'd have to go around and
> implement this everywhere that uses a pidfd?
> 
> That is a big undertaking, and not blocked by this change. Nor is
> maintaining the flag proposed here egregious.

By big undertaking, do you mean other syscalls that take pidfd
(pidfd_getfd, pidfd_send_signal & process_mrelease) to handle PIDFD_SELF
or something else?

> 
> Blocking a useful feature because we may in future possibly add a new means
> of doing the same thing seems a little silly to me.
> 

Hah!!

> > >
> > > One nice thing with doing this as a flag is that, later, if somebody is
> > > willing to do the larger task of having a special sentinel pidfd value to
> > > mean 'the current process', we could use this in process_madvise() and
> > > deprecate this flag :)
> > >
> >
> > Once something is added to an API, particularly syscalls, the removal
> > is almost impossible.
> 
> And why would it be such a problem to have this flag remain? I said
> deprecate not remove. And only in the sense that 'you may as well use the
> sentinel'.
> 

My point was to aim for the solution where we can avoid such scenario
but it is totally understandable and acceptable that we still have to go
through deprecation process in future.

> The flag is very clear in its meaning, and confers no special problem in
> remaining supported. It is a private flag that overlaps no others.
> 
> I mean it'd in effect being a change to a single line 'if pidfd is sentinel
> or flag is used'. If we can't support that going forward, then we should
> give up this kernel stuff and frolick in the fields joyously instead...
> 
> Again, if you can tell me why it'd be such a problem then fine we can
> address that.
> 
> But blocking a series and demanding a change to an entire other feature
> just to support something I'd say requires some pretty specific reasons as
> to why you have a problem with the change.
> 
> >
> > Anyways, I don't have very strong opinion one way or other but whatever
> > we decide, let's make it robust.
> 
> I mean... err... it sounds like you do kinda have pretty strong opinions ;)

I am not sure how more explicit I have to be to but I am hoping now it
is more clear than before.

Shakeel
Lorenzo Stoakes Sept. 25, 2024, 5:04 p.m. UTC | #6
On Wed, Sep 25, 2024 at 09:19:17AM GMT, Shakeel Butt wrote:
> I have no idea what makes you think I am blocking the feature that you
> repond in a weird tone but let me be upfront what I am asking: Let's
> collectively decide which is the better option (in terms of
> maintainability and extensibility) and move forward.

I'm not sure what you mean by 'weird tone'... perhaps a miscommunication?

To summarise in my view - a suggestion was made to, rather than provide the
proposed flag - a pidfd sentinel should be introduced.

Simply introducing a sentinel that represents 'the current process' without
changing interfaces that accept a pidfd would be broken - so implementing
this implies that _all_ pidfd interfaces are updated, as well as tests.

I suggest doing so is, of course, entirely out of the scope of this
change. Therefore if we were to require that here - it would block the
feature while I go work on that.

I think this is pretty clear right? And I also suggest that doing so is
likely to take quite some time, and may not even have a positive outcome.

So it's not a case of 'shall we take approach A or approach B?' but rather
'should we take approach A or entirely implement a new feature B, then once
that is done, use it'.

So as to your 'collectively decide what is the better option' - in my
previous response I argued that the best approach between 'use an
unimplemented suggested entirely new feature of pidfd' vs. 'implement a
flag that would in no way block the prior approach' - a flag works better.

If you can provide specific arguments as to why I'm wrong then by all means
I'm happy to hear them.

>
> On Wed, Sep 25, 2024 at 03:48:07PM GMT, Lorenzo Stoakes wrote:
> > On Wed, Sep 25, 2024 at 07:02:59AM GMT, Shakeel Butt wrote:
> > > Cced Christian
> > >
> > > On Tue, Sep 24, 2024 at 02:12:49PM GMT, Lorenzo Stoakes wrote:
> > > > On Tue, Sep 24, 2024 at 01:51:11PM GMT, Pedro Falcato wrote:
> > > > > On Tue, Sep 24, 2024 at 12:16:27PM GMT, Lorenzo Stoakes wrote:
> > > > > > process_madvise() was conceived as a useful means for performing a vector
> > > > > > of madvise() operations on a remote process's address space.
> > > > > >
> > > > > > However it's useful to be able to do so on the current process also. It is
> > > > > > currently rather clunky to do this (requiring a pidfd to be opened for the
> > > > > > current process) and introduces unnecessary overhead in incrementing
> > > > > > reference counts for the task and mm.
> > > > > >
> > > > > > Avoid all of this by providing a PR_MADV_SELF flag, which causes
> > > > > > process_madvise() to simply ignore the pidfd parameter and instead apply
> > > > > > the operation to the current process.
> > > > > >
> > > > >
> > > > > How about simply defining a pseudo-fd PIDFD_SELF in the negative int space?
> > > > > There's precedent for it in the fs space (AT_FDCWD). I think it's more ergonomic
> > > > > and if you take out the errno space we have around 2^31 - 4096 available sentinel
> > > > > values.
> > > > >
> > > > > e.g:
> > > > >
> > > > > /* AT_FDCWD = -10, -1 is dangerous, pick a different value */
> > > > > #define PIDFD_SELF   -11
> > > > >
> > > > > int pidfd = target_pid == getpid() ? PIDFD_SELF : pidfd_open(...);
> > > > > process_madvise(pidfd, ...);
> > > > >
> > > > >
> > > > > What do you think?
> > > >
> > > > I like the way you're thinking, but I don't think this is something we can
> > > > do in the context of this series.
> > > >
> > > > I mean, I totally accept using a flag here and ignoring the pidfd field is
> > > > _ugly_, no question. But I'm trying to find the smallest change that
> > > > achieves what we want.
> > >
> > > I don't think "smallest change" should be the target. We are changing
> > > user API and we should aim to make it as robust as possible against
> > > possible misuse or making uninteded assumptions.
> >
> > I think introducing a new pidfd sentinel that isn't used anywhere else is
> > far more liable to mistakes than adding an explicit flag.
> >
> > Could you provide examples of possible misuse of this flag or unintended
> > assumptions it confers (other than the -1 thing addressed below).
> >
> > The flag is explicitly 'target this process, ignore pidfd'. We can document
> > it as such (I will patch manpages too).
> >
> > >
> > > The proposed implementation opened the door for the applications to
> > > provide dummy pidfd if PR_MADV_SELF is used. You definitely need to
> > > restrict it to some known value like -1 used by mmap() syscall.
> >
> > Why?
> >
> > mmap() is special in that you have a 'dual' situation with shmem that is
> > both file-backed and private and of course you can do MAP_SHARED |
> > MAP_PRIVATE and have mmap() transparently assign something to you, etc.
> >
> > Here we explicitly have a flag whose semantics are 'ignore pidfd, target
> > self'.
> >
> > If you choose to use a brand new flag that explicitly states this and
> > provide a 'dummy' pidfd which then has nothing done to it - what exactly is
> > the problem?
>
> IMHO having a fixed dummy would allow the kernel more flexibility in
> future for evolving the API.

OK. I agree with having a fixed dummy value as stated.

>
> >
> > I mean if you feel strongly, we can enforce this, but I'm not sure -1
> > implying a special case for pidfd is a thing either.
> >
> > On the other hand it would be _weird_ and broken for the user to provide a
> > valid pidfd so maybe we should as it is easy to do and the user has clearly
> > done something wrong.
> >
> > So fine, agreed, I'll add that.
> >
>
> No, don't just agree. The response like "-1 is not good for so and so
> reasons" is totally fine and my request would be add that reasoning in
> the commit message. My only request is that we have thought through
> alternatives and document the reasonsing behind the decided approach.

I didn't just agree, as I said, my reasoning is:

	On the other hand it would be _weird_ and broken for the user to
	provide a valid pidfd so maybe we should as it is easy to do and
	the user has clearly done something wrong.

If we're in alignment with that then all good!

>
> > >
> > > >
> > > > To add such a sentinel would be a change to the pidfd mechanism as a whole,
> > > > and we'd be left in the awkward situation that no other user of the pidfd
> > > > mechanism would be implementing this, but we'd have to expose this as a
> > > > general sentinel value for all pidfd users.
> > >
> > > There might be future users which can take advantage of this. I can even
> > > imagine pidfd_send_signal() can use PIDFD_SELF as well.
> >
> > I'm confused by this comment - I mean absolutely, as I said I like the
> > idea, but this just proves the point that you'd have to go around and
> > implement this everywhere that uses a pidfd?
> >
> > That is a big undertaking, and not blocked by this change. Nor is
> > maintaining the flag proposed here egregious.
>
> By big undertaking, do you mean other syscalls that take pidfd
> (pidfd_getfd, pidfd_send_signal & process_mrelease) to handle PIDFD_SELF
> or something else?

I mean if you add a pidfd sentinel that represents 'the current process' it
may get passed to any interface that accepts a pidfd, so all of them have
to handle it _somehow_.

Also you'll want to update tests accordingly and clearly need to get
community buy-in for that feature.

You may want to just add a bunch of:

if (pidfd == SENTINEL)
	return -EINVAL;

So it's not impossible my instincts are off and we can get away with simply
doing that.

On the other hand, would that be confusing? Wouldn't we need to update
documentation, manpages, etc. to say explicitly 'hey this sentinel is just
not supported'?

Again totally fine with the idea, like it actually, just my instincts are
it will involve some work. I may be wrong.

>
> >
> > Blocking a useful feature because we may in future possibly add a new means
> > of doing the same thing seems a little silly to me.
> >
>
> Hah!!

See top of mail.

>
> > > >
> > > > One nice thing with doing this as a flag is that, later, if somebody is
> > > > willing to do the larger task of having a special sentinel pidfd value to
> > > > mean 'the current process', we could use this in process_madvise() and
> > > > deprecate this flag :)
> > > >
> > >
> > > Once something is added to an API, particularly syscalls, the removal
> > > is almost impossible.
> >
> > And why would it be such a problem to have this flag remain? I said
> > deprecate not remove. And only in the sense that 'you may as well use the
> > sentinel'.
> >
>
> My point was to aim for the solution where we can avoid such scenario
> but it is totally understandable and acceptable that we still have to go
> through deprecation process in future.
>
> > The flag is very clear in its meaning, and confers no special problem in
> > remaining supported. It is a private flag that overlaps no others.
> >
> > I mean it'd in effect being a change to a single line 'if pidfd is sentinel
> > or flag is used'. If we can't support that going forward, then we should
> > give up this kernel stuff and frolick in the fields joyously instead...
> >
> > Again, if you can tell me why it'd be such a problem then fine we can
> > address that.
> >
> > But blocking a series and demanding a change to an entire other feature
> > just to support something I'd say requires some pretty specific reasons as
> > to why you have a problem with the change.
> >
> > >
> > > Anyways, I don't have very strong opinion one way or other but whatever
> > > we decide, let's make it robust.
> >
> > I mean... err... it sounds like you do kinda have pretty strong opinions ;)
>
> I am not sure how more explicit I have to be to but I am hoping now it
> is more clear than before.

I mean perhaps I misinterpreted you as strongly advocating for the sentinel
and your intent was rather to provide argument on that side also so the
community can decide as you say - sure.

But with you indifferent as you say as to which way to go, and my having
provided arguments for the flags (again happy to hear push-back of course)
- I suggest we go forward with the series as-is, other than a fixpatch I'll
send for the -1 thing.

>
> Shakeel

Thanks for your review!
Pedro Falcato Sept. 25, 2024, 6:14 p.m. UTC | #7
On Wed, Sep 25, 2024 at 06:04:59PM GMT, Lorenzo Stoakes wrote:
> On Wed, Sep 25, 2024 at 09:19:17AM GMT, Shakeel Butt wrote:
> > I have no idea what makes you think I am blocking the feature that you
> > repond in a weird tone but let me be upfront what I am asking: Let's
> > collectively decide which is the better option (in terms of
> > maintainability and extensibility) and move forward.
> 
> I'm not sure what you mean by 'weird tone'... perhaps a miscommunication?
> 
> To summarise in my view - a suggestion was made to, rather than provide the
> proposed flag - a pidfd sentinel should be introduced.
> 
> Simply introducing a sentinel that represents 'the current process' without
> changing interfaces that accept a pidfd would be broken - so implementing
> this implies that _all_ pidfd interfaces are updated, as well as tests.
>

While I suggested PIDFD_SELF, I never meant that we should change every interface,
but rather adopt a sound, consistent strategy for pidfd interfaces and stick with
it for the foreseeable future.

In this case, we'd adapt process_madvise, then possibly later pidfd_send_signal, etc.
There are plenty of pidfd interfaces that don't make sense with a PIDFD_SELF. Various
other interfaces will probably never want to adopt it at all (select _can't_, other
fs syscalls such as read/write/poll/whatever would require awful handholding from
various kernel subsystems, while in that sense we would definitely require a proper
struct file/inode/whatever for each pseudo-fd, which is not exactly what we want).

> I suggest doing so is, of course, entirely out of the scope of this
> change. Therefore if we were to require that here - it would block the
> feature while I go work on that.
> 
> I think this is pretty clear right? And I also suggest that doing so is
> likely to take quite some time, and may not even have a positive outcome.
> 
> So it's not a case of 'shall we take approach A or approach B?' but rather
> 'should we take approach A or entirely implement a new feature B, then once
> that is done, use it'.
> 
> So as to your 'collectively decide what is the better option' - in my
> previous response I argued that the best approach between 'use an
> unimplemented suggested entirely new feature of pidfd' vs. 'implement a
> flag that would in no way block the prior approach' - a flag works better.

I just don't think it's a new feature, just an established, future-proof way
of doing things :) Your patch should remain mostly similar apart from switching
the flag check into an fd check.

> >
> > By big undertaking, do you mean other syscalls that take pidfd
> > (pidfd_getfd, pidfd_send_signal & process_mrelease) to handle PIDFD_SELF
> > or something else?
> 
> I mean if you add a pidfd sentinel that represents 'the current process' it
> may get passed to any interface that accepts a pidfd, so all of them have
> to handle it _somehow_.
> 
> Also you'll want to update tests accordingly and clearly need to get
> community buy-in for that feature.
> 
> You may want to just add a bunch of:
> 
> if (pidfd == SENTINEL)
> 	return -EINVAL;

It should already be there in the form of an -EBADF.

> 
> So it's not impossible my instincts are off and we can get away with simply
> doing that.
> 
> On the other hand, would that be confusing? Wouldn't we need to update
> documentation, manpages, etc. to say explicitly 'hey this sentinel is just
> not supported'?

This is a fair point, but we could also just... not :) which I don't feel is too
wrong, since the fd works kind of like a flag here.
Lorenzo Stoakes Sept. 25, 2024, 6:31 p.m. UTC | #8
On Wed, Sep 25, 2024 at 07:14:51PM GMT, Pedro Falcato wrote:
> On Wed, Sep 25, 2024 at 06:04:59PM GMT, Lorenzo Stoakes wrote:
> > On Wed, Sep 25, 2024 at 09:19:17AM GMT, Shakeel Butt wrote:
> > > I have no idea what makes you think I am blocking the feature that you
> > > repond in a weird tone but let me be upfront what I am asking: Let's
> > > collectively decide which is the better option (in terms of
> > > maintainability and extensibility) and move forward.
> >
> > I'm not sure what you mean by 'weird tone'... perhaps a miscommunication?
> >
> > To summarise in my view - a suggestion was made to, rather than provide the
> > proposed flag - a pidfd sentinel should be introduced.
> >
> > Simply introducing a sentinel that represents 'the current process' without
> > changing interfaces that accept a pidfd would be broken - so implementing
> > this implies that _all_ pidfd interfaces are updated, as well as tests.
> >
>
> While I suggested PIDFD_SELF, I never meant that we should change every interface,
> but rather adopt a sound, consistent strategy for pidfd interfaces and stick with
> it for the foreseeable future.

If we add this to a public uapi-facing header and document it as 'refer to
self', we will introduce something that users will receive and be confused
if they are unable to use anywhere else.

This is the fundamental problem with this. It's a fundamental, permanent
uAPI change which either introduces confusion - because it only works for
process_madvise() - or will need work along with test updates etc.

And it is very much out of scope for this series as a result.

I may just back out of doing this and replace this with something simpler
that causes less push-back (and fixes existing broken behaviour in
process_madvise()) that lets me do what I need to do with the guard
pages. I don't think anybody can object to a self-pidfd having unrestricted
access to madvise flags...

I may in parallel just try to implement the pidfd sentinel idea and let it
take that long time and update process_madvise() later, as I can't really
let this block the guard page work.

>
> In this case, we'd adapt process_madvise, then possibly later pidfd_send_signal, etc.
> There are plenty of pidfd interfaces that don't make sense with a PIDFD_SELF. Various
> other interfaces will probably never want to adopt it at all (select _can't_, other
> fs syscalls such as read/write/poll/whatever would require awful handholding from
> various kernel subsystems, while in that sense we would definitely require a proper
> struct file/inode/whatever for each pseudo-fd, which is not exactly what we want).
>

And arguably, you'd have to audit all of them and decide. I mean I think
this kind of sums up my point really right?

Again, I don't object to the idea, I object to the suggestion that you
don't need to do this other work.

> > I suggest doing so is, of course, entirely out of the scope of this
> > change. Therefore if we were to require that here - it would block the
> > feature while I go work on that.
> >
> > I think this is pretty clear right? And I also suggest that doing so is
> > likely to take quite some time, and may not even have a positive outcome.
> >
> > So it's not a case of 'shall we take approach A or approach B?' but rather
> > 'should we take approach A or entirely implement a new feature B, then once
> > that is done, use it'.
> >
> > So as to your 'collectively decide what is the better option' - in my
> > previous response I argued that the best approach between 'use an
> > unimplemented suggested entirely new feature of pidfd' vs. 'implement a
> > flag that would in no way block the prior approach' - a flag works better.
>
> I just don't think it's a new feature, just an established, future-proof way
> of doing things :) Your patch should remain mostly similar apart from switching
> the flag check into an fd check.

It is absolutely a new feature, you're adding an entirely new UAPI-visible
flag that is applicable to all pidfd's, unless we make it
process_madvise()-specific in the flag, and I'm not sure that's going to
get accepted.

It also needs to be separately accepted by the maintainers of the relevant
file header etc.

>
> > >
> > > By big undertaking, do you mean other syscalls that take pidfd
> > > (pidfd_getfd, pidfd_send_signal & process_mrelease) to handle PIDFD_SELF
> > > or something else?
> >
> > I mean if you add a pidfd sentinel that represents 'the current process' it
> > may get passed to any interface that accepts a pidfd, so all of them have
> > to handle it _somehow_.
> >
> > Also you'll want to update tests accordingly and clearly need to get
> > community buy-in for that feature.
> >
> > You may want to just add a bunch of:
> >
> > if (pidfd == SENTINEL)
> > 	return -EINVAL;
>
> It should already be there in the form of an -EBADF.

:) this is pseudo-code. And I'd want to check all pidfd handled it
correctly. I mean you'd think so right...

>
> >
> > So it's not impossible my instincts are off and we can get away with simply
> > doing that.
> >
> > On the other hand, would that be confusing? Wouldn't we need to update
> > documentation, manpages, etc. to say explicitly 'hey this sentinel is just
> > not supported'?
>
> This is a fair point, but we could also just... not :) which I don't feel is too
> wrong, since the fd works kind of like a flag here.

Yeah, no, I think you would have to. It's not specific to
process_madvise(), it's a general fd flag.

And like I said, if we did introduce this we'd need additional assessment
and review from those guys.

>
> --
> Pedro
Shakeel Butt Sept. 25, 2024, 9:37 p.m. UTC | #9
On Wed, Sep 25, 2024 at 06:04:59PM GMT, Lorenzo Stoakes wrote:
> On Wed, Sep 25, 2024 at 09:19:17AM GMT, Shakeel Butt wrote:
> > I have no idea what makes you think I am blocking the feature that you
> > repond in a weird tone but let me be upfront what I am asking: Let's
> > collectively decide which is the better option (in terms of
> > maintainability and extensibility) and move forward.
> 
> I'm not sure what you mean by 'weird tone'... perhaps a miscommunication?
> 
> To summarise in my view - a suggestion was made to, rather than provide the
> proposed flag - a pidfd sentinel should be introduced.
> 
> Simply introducing a sentinel that represents 'the current process' without
> changing interfaces that accept a pidfd would be broken - so implementing
> this implies that _all_ pidfd interfaces are updated, as well as tests.
> 
> I suggest doing so is, of course, entirely out of the scope of this
> change. Therefore if we were to require that here - it would block the
> feature while I go work on that.
> 
> I think this is pretty clear right? And I also suggest that doing so is
> likely to take quite some time, and may not even have a positive outcome.

If you have some concrete example on how this may not have a positive
outcome then it will make your case much stronger.

> 
> So it's not a case of 'shall we take approach A or approach B?' but rather
> 'should we take approach A or entirely implement a new feature B, then once
> that is done, use it'.

The "entire new feature" is a bit too strong IMHO. (though no pushback
from me).

> 
> So as to your 'collectively decide what is the better option' - in my
> previous response I argued that the best approach between 'use an
> unimplemented suggested entirely new feature of pidfd' vs. 'implement a
> flag that would in no way block the prior approach' - a flag works better.
> 
> If you can provide specific arguments as to why I'm wrong then by all means
> I'm happy to hear them.
> 
> >
> > On Wed, Sep 25, 2024 at 03:48:07PM GMT, Lorenzo Stoakes wrote:
> > > On Wed, Sep 25, 2024 at 07:02:59AM GMT, Shakeel Butt wrote:
> > > > Cced Christian
> > > >
> > > > On Tue, Sep 24, 2024 at 02:12:49PM GMT, Lorenzo Stoakes wrote:
> > > > > On Tue, Sep 24, 2024 at 01:51:11PM GMT, Pedro Falcato wrote:
> > > > > > On Tue, Sep 24, 2024 at 12:16:27PM GMT, Lorenzo Stoakes wrote:
> > > > > > > process_madvise() was conceived as a useful means for performing a vector
> > > > > > > of madvise() operations on a remote process's address space.
> > > > > > >
> > > > > > > However it's useful to be able to do so on the current process also. It is
> > > > > > > currently rather clunky to do this (requiring a pidfd to be opened for the
> > > > > > > current process) and introduces unnecessary overhead in incrementing
> > > > > > > reference counts for the task and mm.
> > > > > > >
> > > > > > > Avoid all of this by providing a PR_MADV_SELF flag, which causes
> > > > > > > process_madvise() to simply ignore the pidfd parameter and instead apply
> > > > > > > the operation to the current process.
> > > > > > >
> > > > > >
> > > > > > How about simply defining a pseudo-fd PIDFD_SELF in the negative int space?
> > > > > > There's precedent for it in the fs space (AT_FDCWD). I think it's more ergonomic
> > > > > > and if you take out the errno space we have around 2^31 - 4096 available sentinel
> > > > > > values.
> > > > > >
> > > > > > e.g:
> > > > > >
> > > > > > /* AT_FDCWD = -10, -1 is dangerous, pick a different value */
> > > > > > #define PIDFD_SELF   -11
> > > > > >
> > > > > > int pidfd = target_pid == getpid() ? PIDFD_SELF : pidfd_open(...);
> > > > > > process_madvise(pidfd, ...);
> > > > > >
> > > > > >
> > > > > > What do you think?
> > > > >
> > > > > I like the way you're thinking, but I don't think this is something we can
> > > > > do in the context of this series.
> > > > >
> > > > > I mean, I totally accept using a flag here and ignoring the pidfd field is
> > > > > _ugly_, no question. But I'm trying to find the smallest change that
> > > > > achieves what we want.
> > > >
> > > > I don't think "smallest change" should be the target. We are changing
> > > > user API and we should aim to make it as robust as possible against
> > > > possible misuse or making uninteded assumptions.
> > >
> > > I think introducing a new pidfd sentinel that isn't used anywhere else is
> > > far more liable to mistakes than adding an explicit flag.
> > >
> > > Could you provide examples of possible misuse of this flag or unintended
> > > assumptions it confers (other than the -1 thing addressed below).
> > >
> > > The flag is explicitly 'target this process, ignore pidfd'. We can document
> > > it as such (I will patch manpages too).
> > >
> > > >
> > > > The proposed implementation opened the door for the applications to
> > > > provide dummy pidfd if PR_MADV_SELF is used. You definitely need to
> > > > restrict it to some known value like -1 used by mmap() syscall.
> > >
> > > Why?
> > >
> > > mmap() is special in that you have a 'dual' situation with shmem that is
> > > both file-backed and private and of course you can do MAP_SHARED |
> > > MAP_PRIVATE and have mmap() transparently assign something to you, etc.
> > >
> > > Here we explicitly have a flag whose semantics are 'ignore pidfd, target
> > > self'.
> > >
> > > If you choose to use a brand new flag that explicitly states this and
> > > provide a 'dummy' pidfd which then has nothing done to it - what exactly is
> > > the problem?
> >
> > IMHO having a fixed dummy would allow the kernel more flexibility in
> > future for evolving the API.
> 
> OK. I agree with having a fixed dummy value as stated.
> 
> >
> > >
> > > I mean if you feel strongly, we can enforce this, but I'm not sure -1
> > > implying a special case for pidfd is a thing either.
> > >
> > > On the other hand it would be _weird_ and broken for the user to provide a
> > > valid pidfd so maybe we should as it is easy to do and the user has clearly
> > > done something wrong.
> > >
> > > So fine, agreed, I'll add that.
> > >
> >
> > No, don't just agree. The response like "-1 is not good for so and so
> > reasons" is totally fine and my request would be add that reasoning in
> > the commit message. My only request is that we have thought through
> > alternatives and document the reasonsing behind the decided approach.
> 
> I didn't just agree, as I said, my reasoning is:
> 
> 	On the other hand it would be _weird_ and broken for the user to
> 	provide a valid pidfd so maybe we should as it is easy to do and
> 	the user has clearly done something wrong.
> 
> If we're in alignment with that then all good!
> 
> >
> > > >
> > > > >
> > > > > To add such a sentinel would be a change to the pidfd mechanism as a whole,
> > > > > and we'd be left in the awkward situation that no other user of the pidfd
> > > > > mechanism would be implementing this, but we'd have to expose this as a
> > > > > general sentinel value for all pidfd users.
> > > >
> > > > There might be future users which can take advantage of this. I can even
> > > > imagine pidfd_send_signal() can use PIDFD_SELF as well.
> > >
> > > I'm confused by this comment - I mean absolutely, as I said I like the
> > > idea, but this just proves the point that you'd have to go around and
> > > implement this everywhere that uses a pidfd?
> > >
> > > That is a big undertaking, and not blocked by this change. Nor is
> > > maintaining the flag proposed here egregious.
> >
> > By big undertaking, do you mean other syscalls that take pidfd
> > (pidfd_getfd, pidfd_send_signal & process_mrelease) to handle PIDFD_SELF
> > or something else?
> 
> I mean if you add a pidfd sentinel that represents 'the current process' it
> may get passed to any interface that accepts a pidfd, so all of them have
> to handle it _somehow_.
> 
> Also you'll want to update tests accordingly and clearly need to get
> community buy-in for that feature.
> 
> You may want to just add a bunch of:
> 
> if (pidfd == SENTINEL)
> 	return -EINVAL;
> 
> So it's not impossible my instincts are off and we can get away with simply
> doing that.
> 
> On the other hand, would that be confusing? Wouldn't we need to update
> documentation, manpages, etc. to say explicitly 'hey this sentinel is just
> not supported'?
> 
> Again totally fine with the idea, like it actually, just my instincts are
> it will involve some work. I may be wrong.
> 
> >
> > >
> > > Blocking a useful feature because we may in future possibly add a new means
> > > of doing the same thing seems a little silly to me.
> > >
> >
> > Hah!!
> 
> See top of mail.
> 
> >
> > > > >
> > > > > One nice thing with doing this as a flag is that, later, if somebody is
> > > > > willing to do the larger task of having a special sentinel pidfd value to
> > > > > mean 'the current process', we could use this in process_madvise() and
> > > > > deprecate this flag :)
> > > > >
> > > >
> > > > Once something is added to an API, particularly syscalls, the removal
> > > > is almost impossible.
> > >
> > > And why would it be such a problem to have this flag remain? I said
> > > deprecate not remove. And only in the sense that 'you may as well use the
> > > sentinel'.
> > >
> >
> > My point was to aim for the solution where we can avoid such scenario
> > but it is totally understandable and acceptable that we still have to go
> > through deprecation process in future.
> >
> > > The flag is very clear in its meaning, and confers no special problem in
> > > remaining supported. It is a private flag that overlaps no others.
> > >
> > > I mean it'd in effect being a change to a single line 'if pidfd is sentinel
> > > or flag is used'. If we can't support that going forward, then we should
> > > give up this kernel stuff and frolick in the fields joyously instead...
> > >
> > > Again, if you can tell me why it'd be such a problem then fine we can
> > > address that.
> > >
> > > But blocking a series and demanding a change to an entire other feature
> > > just to support something I'd say requires some pretty specific reasons as
> > > to why you have a problem with the change.
> > >
> > > >
> > > > Anyways, I don't have very strong opinion one way or other but whatever
> > > > we decide, let's make it robust.
> > >
> > > I mean... err... it sounds like you do kinda have pretty strong opinions ;)
> >
> > I am not sure how more explicit I have to be to but I am hoping now it
> > is more clear than before.
> 
> I mean perhaps I misinterpreted you as strongly advocating for the sentinel
> and your intent was rather to provide argument on that side also so the
> community can decide as you say - sure.
> 
> But with you indifferent as you say as to which way to go, and my having
> provided arguments for the flags (again happy to hear push-back of course)
> - I suggest we go forward with the series as-is, other than a fixpatch I'll
> send for the -1 thing.
>

My only request would be to add all these points in the commit message
i.e. why we took this approach rather than the alternative.

> >
> > Shakeel
> 
> Thanks for your review!
diff mbox series

Patch

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 763929e814e9..0148e6de35ab 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -86,4 +86,6 @@ 
 #define PKEY_ACCESS_MASK	(PKEY_DISABLE_ACCESS |\
 				 PKEY_DISABLE_WRITE)
 
+#define PR_MADV_SELF	(1<<0)		/* process_madvise() flag - apply to self */
+
 #endif /* __ALPHA_MMAN_H__ */
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index 9c48d9a21aa0..acb4c3bc92b2 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -113,4 +113,6 @@ 
 #define PKEY_ACCESS_MASK	(PKEY_DISABLE_ACCESS |\
 				 PKEY_DISABLE_WRITE)
 
+#define PR_MADV_SELF	(1<<0)		/* process_madvise() flag - apply to self */
+
 #endif /* _ASM_MMAN_H */
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index 68c44f99bc93..0f839b2cad13 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -83,4 +83,6 @@ 
 #define PKEY_ACCESS_MASK	(PKEY_DISABLE_ACCESS |\
 				 PKEY_DISABLE_WRITE)
 
+#define PR_MADV_SELF	(1<<0)		/* process_madvise() flag - apply to self */
+
 #endif /* __PARISC_MMAN_H__ */
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 1ff0c858544f..37dd27d09251 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -121,4 +121,6 @@ 
 #define PKEY_ACCESS_MASK	(PKEY_DISABLE_ACCESS |\
 				 PKEY_DISABLE_WRITE)
 
+#define PR_MADV_SELF	(1<<0)		/* process_madvise() flag - apply to self */
+
 #endif /* _XTENSA_MMAN_H */
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 6ce1f1ceb432..8f59f23dee09 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -87,4 +87,6 @@ 
 #define PKEY_ACCESS_MASK	(PKEY_DISABLE_ACCESS |\
 				 PKEY_DISABLE_WRITE)
 
+#define PR_MADV_SELF	(1<<0)		/* process_madvise() flag - apply to self */
+
 #endif /* __ASM_GENERIC_MMAN_COMMON_H */
diff --git a/mm/madvise.c b/mm/madvise.c
index ff139e57cca2..49d12f98b677 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1208,7 +1208,8 @@  madvise_behavior_valid(int behavior)
 	}
 }
 
-static bool process_madvise_behavior_valid(int behavior)
+/* Can we invoke process_madvise() on a remote mm for the specified behavior? */
+static bool process_madvise_remote_valid(int behavior)
 {
 	switch (behavior) {
 	case MADV_COLD:
@@ -1477,6 +1478,28 @@  SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 	return do_madvise(current->mm, start, len_in, behavior);
 }
 
+/* Perform an madvise operation over a vector of addresses and lengths. */
+static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
+			      int behavior)
+{
+	ssize_t ret = 0;
+	size_t total_len;
+
+	total_len = iov_iter_count(iter);
+
+	while (iov_iter_count(iter)) {
+		ret = do_madvise(mm, (unsigned long)iter_iov_addr(iter),
+				 iter_iov_len(iter), behavior);
+		if (ret < 0)
+			break;
+		iov_iter_advance(iter, iter_iov_len(iter));
+	}
+
+	ret = (total_len - iov_iter_count(iter)) ? : ret;
+
+	return ret;
+}
+
 SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
 		size_t, vlen, int, behavior, unsigned int, flags)
 {
@@ -1486,10 +1509,9 @@  SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
 	struct iov_iter iter;
 	struct task_struct *task;
 	struct mm_struct *mm;
-	size_t total_len;
 	unsigned int f_flags;
 
-	if (flags != 0) {
+	if (flags & ~PR_MADV_SELF) {
 		ret = -EINVAL;
 		goto out;
 	}
@@ -1498,17 +1520,21 @@  SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
 	if (ret < 0)
 		goto out;
 
+	/*
+	 * Perform an madvise operation on the current process. No restrictions
+	 * need be applied, nor do we need to pin the task or mm_struct.
+	 */
+	if (flags & PR_MADV_SELF) {
+		ret = vector_madvise(current->mm, &iter, behavior);
+		goto free_iov;
+	}
+
 	task = pidfd_get_task(pidfd, &f_flags);
 	if (IS_ERR(task)) {
 		ret = PTR_ERR(task);
 		goto free_iov;
 	}
 
-	if (!process_madvise_behavior_valid(behavior)) {
-		ret = -EINVAL;
-		goto release_task;
-	}
-
 	/* Require PTRACE_MODE_READ to avoid leaking ASLR metadata. */
 	mm = mm_access(task, PTRACE_MODE_READ_FSCREDS);
 	if (IS_ERR_OR_NULL(mm)) {
@@ -1516,26 +1542,26 @@  SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
 		goto release_task;
 	}
 
+	/*
+	 * We need only perform this check if we are attempting to manipulate a
+	 * remote process's address space.
+	 */
+	if (mm != current->mm && !process_madvise_remote_valid(behavior)) {
+		ret = -EINVAL;
+		goto release_mm;
+	}
+
 	/*
 	 * Require CAP_SYS_NICE for influencing process performance. Note that
-	 * only non-destructive hints are currently supported.
+	 * only non-destructive hints are currently supported for remote
+	 * processes.
 	 */
 	if (mm != current->mm && !capable(CAP_SYS_NICE)) {
 		ret = -EPERM;
 		goto release_mm;
 	}
 
-	total_len = iov_iter_count(&iter);
-
-	while (iov_iter_count(&iter)) {
-		ret = do_madvise(mm, (unsigned long)iter_iov_addr(&iter),
-					iter_iov_len(&iter), behavior);
-		if (ret < 0)
-			break;
-		iov_iter_advance(&iter, iter_iov_len(&iter));
-	}
-
-	ret = (total_len - iov_iter_count(&iter)) ? : ret;
+	ret = vector_madvise(mm, &iter, behavior);
 
 release_mm:
 	mmput(mm);