diff mbox series

[v5,10/13] mm/madvise: add MADV_COLLAPSE to process_madvise()

Message ID 20220504214437.2850685-11-zokeefe@google.com (mailing list archive)
State New
Headers show
Series mm: userspace hugepage collapse | expand

Commit Message

Zach O'Keefe May 4, 2022, 9:44 p.m. UTC
Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has
CAP_SYS_ADMIN or is requesting collapse of it's own memory.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 mm/madvise.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Comments

Rongwei Wang May 11, 2022, 12:49 a.m. UTC | #1
Hi, Zach

Thanks for your great patchset!
Recently, We also try to collapse THP in this way, likes performance
degradation due to using too much hugepages in our scenes.

And there is a doubt about process_madvise(MADV_COLLAPSE) when we test 
this patchset:. It seems that process_madvise(MADV_COLLAPSE) rely on
madvise(MADV_HUGEPAGE)? If the vma wasn't marked with 'hg', 
process_madvise(MADV_COLLAPSE) will fail to collapse. And if I miss 
something, please let me know.

If so, how about introducing process_madvise(MADV_HUGEPAGE) or 
process_madvise(MADV_NOHUGEPAGE)? The former helps to mark the target
vma with 'hg', and the collapse process can be finished completely with 
the help of other processes. the latter could let some special vma avoid
collapsing when setting 'THP=always'.

Best regards,
-wrw

On 5/5/22 5:44 AM, Zach O'Keefe wrote:
> Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has
> CAP_SYS_ADMIN or is requesting collapse of it's own memory.
> 
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> ---
>   mm/madvise.c | 6 ++++--
>   1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 638517952bd2..08c11217025a 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1168,13 +1168,15 @@ madvise_behavior_valid(int behavior)
>   }
>   
>   static bool
> -process_madvise_behavior_valid(int behavior)
> +process_madvise_behavior_valid(int behavior, struct task_struct *task)
>   {
>   	switch (behavior) {
>   	case MADV_COLD:
>   	case MADV_PAGEOUT:
>   	case MADV_WILLNEED:
>   		return true;
> +	case MADV_COLLAPSE:
> +		return task == current || capable(CAP_SYS_ADMIN);
>   	default:
>   		return false;
>   	}
> @@ -1452,7 +1454,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
>   		goto free_iov;
>   	}
>   
> -	if (!process_madvise_behavior_valid(behavior)) {
> +	if (!process_madvise_behavior_valid(behavior, task)) {
>   		ret = -EINVAL;
>   		goto release_task;
>   	}
Zach O'Keefe May 11, 2022, 3:34 p.m. UTC | #2
Hey Rongwei,

Thanks for taking the time to review!

On Tue, May 10, 2022 at 5:49 PM Rongwei Wang
<rongwei.wang@linux.alibaba.com> wrote:
>
> Hi, Zach
>
> Thanks for your great patchset!
> Recently, We also try to collapse THP in this way, likes performance
> degradation due to using too much hugepages in our scenes.
>
> And there is a doubt about process_madvise(MADV_COLLAPSE) when we test
> this patchset:. It seems that process_madvise(MADV_COLLAPSE) rely on
> madvise(MADV_HUGEPAGE)? If the vma wasn't marked with 'hg',
> process_madvise(MADV_COLLAPSE) will fail to collapse. And if I miss
> something, please let me know.
>

I tried to have MADV_COLLAPSE follow the same THP eligibility
semantics as khugepaged and at-fault: either THP=always, or
THP=madvise and the vma is marked with MADV_HUGEPAGE, as you point
out.

If I understand you correctly, the usefulness of
process_madvise(MADV_COLLAPSE) is limited in the case where
THP=madvise and a CAP_SYS_ADMIN user is requesting a collapse of
behalf of another process since they don't have a way to mark the
target memory as eligible (which requires VM_HUGEPAGE).

If so, I think that's a valid point, and your suggestion below of a
supporting MADV_[NO]HUGEPAGE for process_madvise(2) makes sense. For
the sake of exploring all options, I'll mention that there was also a
previous idea suggested by Yang Shi where MADV_COLLAPSE could also set
VM_HUGEPAGE[1].

Since it's possible supporting MADV_[NO]HUGEPAGE for
process_madivse(2) has applications outside a subsequent
MADV_COLLAPSE, and since I don't see process_madvise(MADV_COLLAPSE) to
be in a hot path, I'd vote in favor of your suggestion and include
process_madvise(MADV_[NO]HUGEPAGE) support in v6 unless others object.

Thanks again for your review and your suggestion!
Zach

[1] https://lore.kernel.org/linux-mm/CAHbLzkqLRBd6u3qn=KqpOhRcPZtpGXbTXLUjK1z=4d_dQ06Pvw@mail.gmail.com/

> If so, how about introducing process_madvise(MADV_HUGEPAGE) or
> process_madvise(MADV_NOHUGEPAGE)? The former helps to mark the target
> vma with 'hg', and the collapse process can be finished completely with
> the help of other processes. the latter could let some special vma avoid
> collapsing when setting 'THP=always'.
>
> Best regards,
> -wrw
>
> On 5/5/22 5:44 AM, Zach O'Keefe wrote:
> > Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has
> > CAP_SYS_ADMIN or is requesting collapse of it's own memory.
> >
> > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > ---
> >   mm/madvise.c | 6 ++++--
> >   1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 638517952bd2..08c11217025a 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -1168,13 +1168,15 @@ madvise_behavior_valid(int behavior)
> >   }
> >
> >   static bool
> > -process_madvise_behavior_valid(int behavior)
> > +process_madvise_behavior_valid(int behavior, struct task_struct *task)
> >   {
> >       switch (behavior) {
> >       case MADV_COLD:
> >       case MADV_PAGEOUT:
> >       case MADV_WILLNEED:
> >               return true;
> > +     case MADV_COLLAPSE:
> > +             return task == current || capable(CAP_SYS_ADMIN);
> >       default:
> >               return false;
> >       }
> > @@ -1452,7 +1454,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
> >               goto free_iov;
> >       }
> >
> > -     if (!process_madvise_behavior_valid(behavior)) {
> > +     if (!process_madvise_behavior_valid(behavior, task)) {
> >               ret = -EINVAL;
> >               goto release_task;
> >       }
Rongwei Wang May 12, 2022, 3:53 p.m. UTC | #3
On 5/11/22 11:34 PM, Zach O'Keefe wrote:
> Hey Rongwei,
> 
> Thanks for taking the time to review!
> 
> On Tue, May 10, 2022 at 5:49 PM Rongwei Wang
> <rongwei.wang@linux.alibaba.com> wrote:
>>
>> Hi, Zach
>>
>> Thanks for your great patchset!
>> Recently, We also try to collapse THP in this way, likes performance
>> degradation due to using too much hugepages in our scenes.
>>
>> And there is a doubt about process_madvise(MADV_COLLAPSE) when we test
>> this patchset:. It seems that process_madvise(MADV_COLLAPSE) rely on
>> madvise(MADV_HUGEPAGE)? If the vma wasn't marked with 'hg',
>> process_madvise(MADV_COLLAPSE) will fail to collapse. And if I miss
>> something, please let me know.
>>
> 
> I tried to have MADV_COLLAPSE follow the same THP eligibility
> semantics as khugepaged and at-fault: either THP=always, or
> THP=madvise and the vma is marked with MADV_HUGEPAGE, as you point
> out.
> 
> If I understand you correctly, the usefulness of
> process_madvise(MADV_COLLAPSE) is limited in the case where
> THP=madvise and a CAP_SYS_ADMIN user is requesting a collapse of
> behalf of another process since they don't have a way to mark the
> target memory as eligible (which requires VM_HUGEPAGE).
Hi Zach

Yes, that's my means. It seems that MADV_[NO]HUGEPAGE for 
process_madivse(2) just needs little codes.

Anyway, looking forward to your next patchset.

Thanks
-wrw

> 
> If so, I think that's a valid point, and your suggestion below of a
> supporting MADV_[NO]HUGEPAGE for process_madvise(2) makes sense. For
> the sake of exploring all options, I'll mention that there was also a
> previous idea suggested by Yang Shi where MADV_COLLAPSE could also set
> VM_HUGEPAGE[1].
> 
> Since it's possible supporting MADV_[NO]HUGEPAGE for
> process_madivse(2) has applications outside a subsequent
> MADV_COLLAPSE, and since I don't see process_madvise(MADV_COLLAPSE) to
> be in a hot path, I'd vote in favor of your suggestion and include
> process_madvise(MADV_[NO]HUGEPAGE) support in v6 unless others object.
> 
> Thanks again for your review and your suggestion!
> Zach
> 
> [1] https://lore.kernel.org/linux-mm/CAHbLzkqLRBd6u3qn=KqpOhRcPZtpGXbTXLUjK1z=4d_dQ06Pvw@mail.gmail.com/
> 
>> If so, how about introducing process_madvise(MADV_HUGEPAGE) or
>> process_madvise(MADV_NOHUGEPAGE)? The former helps to mark the target
>> vma with 'hg', and the collapse process can be finished completely with
>> the help of other processes. the latter could let some special vma avoid
>> collapsing when setting 'THP=always'.
>>
>> Best regards,
>> -wrw
>>
>> On 5/5/22 5:44 AM, Zach O'Keefe wrote:
>>> Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has
>>> CAP_SYS_ADMIN or is requesting collapse of it's own memory.
>>>
>>> Signed-off-by: Zach O'Keefe <zokeefe@google.com>
>>> ---
>>>    mm/madvise.c | 6 ++++--
>>>    1 file changed, 4 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/madvise.c b/mm/madvise.c
>>> index 638517952bd2..08c11217025a 100644
>>> --- a/mm/madvise.c
>>> +++ b/mm/madvise.c
>>> @@ -1168,13 +1168,15 @@ madvise_behavior_valid(int behavior)
>>>    }
>>>
>>>    static bool
>>> -process_madvise_behavior_valid(int behavior)
>>> +process_madvise_behavior_valid(int behavior, struct task_struct *task)
>>>    {
>>>        switch (behavior) {
>>>        case MADV_COLD:
>>>        case MADV_PAGEOUT:
>>>        case MADV_WILLNEED:
>>>                return true;
>>> +     case MADV_COLLAPSE:
>>> +             return task == current || capable(CAP_SYS_ADMIN);
>>>        default:
>>>                return false;
>>>        }
>>> @@ -1452,7 +1454,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
>>>                goto free_iov;
>>>        }
>>>
>>> -     if (!process_madvise_behavior_valid(behavior)) {
>>> +     if (!process_madvise_behavior_valid(behavior, task)) {
>>>                ret = -EINVAL;
>>>                goto release_task;
>>>        }
David Rientjes May 12, 2022, 8:03 p.m. UTC | #4
On Wed, 4 May 2022, Zach O'Keefe wrote:

> Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has
> CAP_SYS_ADMIN or is requesting collapse of it's own memory.
> 
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>

Acked-by: David Rientjes <rientjes@google.com>
David Rientjes May 12, 2022, 8:03 p.m. UTC | #5
On Wed, 11 May 2022, Zach O'Keefe wrote:

> Hey Rongwei,
> 
> Thanks for taking the time to review!
> 
> On Tue, May 10, 2022 at 5:49 PM Rongwei Wang
> <rongwei.wang@linux.alibaba.com> wrote:
> >
> > Hi, Zach
> >
> > Thanks for your great patchset!
> > Recently, We also try to collapse THP in this way, likes performance
> > degradation due to using too much hugepages in our scenes.
> >

Rongwei, could you elaborate on this?  I can understand undesired overhead 
for allocation of a hugepage at the time of fault if there is not enough 
benefit derived by the hugepages over the long term (like for a database 
workload), is this the performance degradation you're referring to?

Otherwise I'm unfamiliar with performance degradation for using too much 
hugepages after they have been allocated :)  Maybe RSS grows too much and 
we run into memory pressure?

It would be good to know more if you can share details here.

> > And there is a doubt about process_madvise(MADV_COLLAPSE) when we test
> > this patchset:. It seems that process_madvise(MADV_COLLAPSE) rely on
> > madvise(MADV_HUGEPAGE)? If the vma wasn't marked with 'hg',
> > process_madvise(MADV_COLLAPSE) will fail to collapse. And if I miss
> > something, please let me know.
> >
> 
> I tried to have MADV_COLLAPSE follow the same THP eligibility
> semantics as khugepaged and at-fault: either THP=always, or
> THP=madvise and the vma is marked with MADV_HUGEPAGE, as you point
> out.
> 
> If I understand you correctly, the usefulness of
> process_madvise(MADV_COLLAPSE) is limited in the case where
> THP=madvise and a CAP_SYS_ADMIN user is requesting a collapse of
> behalf of another process since they don't have a way to mark the
> target memory as eligible (which requires VM_HUGEPAGE).
> 
> If so, I think that's a valid point, and your suggestion below of a
> supporting MADV_[NO]HUGEPAGE for process_madvise(2) makes sense. For
> the sake of exploring all options, I'll mention that there was also a
> previous idea suggested by Yang Shi where MADV_COLLAPSE could also set
> VM_HUGEPAGE[1].
> 

If a user is doing MADV_COLLAPSE on behalf of itself, it seems unnecessary 
to need to do MADV_HUGEPAGE before that regardless of system-wide settings 
that it may not control?

Same point for a root user doing this on behalf of another user.  It could 
either do MADV_HUGEPAGE and change the behavior that the user perhaps 
requested behind its back (not desired) or it could temporarily set the 
system-wide setting to allow THP always before doing the MADV_COLLAPSE 
(it's root).

We can simply allow MADV_COLLAPSE to always collapse in either context 
regardless of VM_HUGEPAGE, VM_NOHUGEPAGE, or system-wide settings?

> Since it's possible supporting MADV_[NO]HUGEPAGE for
> process_madivse(2) has applications outside a subsequent
> MADV_COLLAPSE, and since I don't see process_madvise(MADV_COLLAPSE) to
> be in a hot path, I'd vote in favor of your suggestion and include
> process_madvise(MADV_[NO]HUGEPAGE) support in v6 unless others object.
> 
> Thanks again for your review and your suggestion!
> Zach
> 
> [1] https://lore.kernel.org/linux-mm/CAHbLzkqLRBd6u3qn=KqpOhRcPZtpGXbTXLUjK1z=4d_dQ06Pvw@mail.gmail.com/
> 
> > If so, how about introducing process_madvise(MADV_HUGEPAGE) or
> > process_madvise(MADV_NOHUGEPAGE)? The former helps to mark the target
> > vma with 'hg', and the collapse process can be finished completely with
> > the help of other processes. the latter could let some special vma avoid
> > collapsing when setting 'THP=always'.
> >
> > Best regards,
> > -wrw
> >
> > On 5/5/22 5:44 AM, Zach O'Keefe wrote:
> > > Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has
> > > CAP_SYS_ADMIN or is requesting collapse of it's own memory.
> > >
> > > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > > ---
> > >   mm/madvise.c | 6 ++++--
> > >   1 file changed, 4 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > index 638517952bd2..08c11217025a 100644
> > > --- a/mm/madvise.c
> > > +++ b/mm/madvise.c
> > > @@ -1168,13 +1168,15 @@ madvise_behavior_valid(int behavior)
> > >   }
> > >
> > >   static bool
> > > -process_madvise_behavior_valid(int behavior)
> > > +process_madvise_behavior_valid(int behavior, struct task_struct *task)
> > >   {
> > >       switch (behavior) {
> > >       case MADV_COLD:
> > >       case MADV_PAGEOUT:
> > >       case MADV_WILLNEED:
> > >               return true;
> > > +     case MADV_COLLAPSE:
> > > +             return task == current || capable(CAP_SYS_ADMIN);
> > >       default:
> > >               return false;
> > >       }
> > > @@ -1452,7 +1454,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
> > >               goto free_iov;
> > >       }
> > >
> > > -     if (!process_madvise_behavior_valid(behavior)) {
> > > +     if (!process_madvise_behavior_valid(behavior, task)) {
> > >               ret = -EINVAL;
> > >               goto release_task;
> > >       }
>
Zach O'Keefe May 13, 2022, 9:06 p.m. UTC | #6
On Thu, May 12, 2022 at 1:03 PM David Rientjes <rientjes@google.com> wrote:
>
> On Wed, 11 May 2022, Zach O'Keefe wrote:
>
> > Hey Rongwei,
> >
> > Thanks for taking the time to review!
> >
> > On Tue, May 10, 2022 at 5:49 PM Rongwei Wang
> > <rongwei.wang@linux.alibaba.com> wrote:
> > >
> > > Hi, Zach
> > >
> > > Thanks for your great patchset!
> > > Recently, We also try to collapse THP in this way, likes performance
> > > degradation due to using too much hugepages in our scenes.
> > >
>
> Rongwei, could you elaborate on this?  I can understand undesired overhead
> for allocation of a hugepage at the time of fault if there is not enough
> benefit derived by the hugepages over the long term (like for a database
> workload), is this the performance degradation you're referring to?
>
> Otherwise I'm unfamiliar with performance degradation for using too much
> hugepages after they have been allocated :)  Maybe RSS grows too much and
> we run into memory pressure?
>
> It would be good to know more if you can share details here.
>
> > > And there is a doubt about process_madvise(MADV_COLLAPSE) when we test
> > > this patchset:. It seems that process_madvise(MADV_COLLAPSE) rely on
> > > madvise(MADV_HUGEPAGE)? If the vma wasn't marked with 'hg',
> > > process_madvise(MADV_COLLAPSE) will fail to collapse. And if I miss
> > > something, please let me know.
> > >
> >
> > I tried to have MADV_COLLAPSE follow the same THP eligibility
> > semantics as khugepaged and at-fault: either THP=always, or
> > THP=madvise and the vma is marked with MADV_HUGEPAGE, as you point
> > out.
> >
> > If I understand you correctly, the usefulness of
> > process_madvise(MADV_COLLAPSE) is limited in the case where
> > THP=madvise and a CAP_SYS_ADMIN user is requesting a collapse of
> > behalf of another process since they don't have a way to mark the
> > target memory as eligible (which requires VM_HUGEPAGE).
> >
> > If so, I think that's a valid point, and your suggestion below of a
> > supporting MADV_[NO]HUGEPAGE for process_madvise(2) makes sense. For
> > the sake of exploring all options, I'll mention that there was also a
> > previous idea suggested by Yang Shi where MADV_COLLAPSE could also set
> > VM_HUGEPAGE[1].
> >
>
> If a user is doing MADV_COLLAPSE on behalf of itself, it seems unnecessary
> to need to do MADV_HUGEPAGE before that regardless of system-wide settings
> that it may not control?
>

If THP=always, this isn't necessary - but yes, if THP=madvise then, as
proposed, we'd need MADV_HUGEPAGE to opt-in to THPs, just like
at-fault and khugepaged.

If MADV_COLLAPSE also sets VM_HUGEPAGE, then in THP=madvise mode, we
also opt-in that range to khugepaged scanning. Most likely this is
what we want anyways (if the collapsed range is split after
MADV_COLLAPSE, khugepaged could come along and fix things) but it does
couple the two more. We can always MADV_NOHUGEPAGE the range after
MADV_COLLAPSE if this was ever a concern, however.

> Same point for a root user doing this on behalf of another user.  It could
> either do MADV_HUGEPAGE and change the behavior that the user perhaps
> requested behind its back (not desired) or it could temporarily set the
> system-wide setting to allow THP always before doing the MADV_COLLAPSE
> (it's root).
>
> We can simply allow MADV_COLLAPSE to always collapse in either context
> regardless of VM_HUGEPAGE, VM_NOHUGEPAGE, or system-wide settings?
>

I think we need to respect VM_NOHUGEPAGE as it can be set from
non-madvise code, as David H mentioned for kvm on s390 in the PATCH
RFC[1]. It's not clear to me how this would break, however. If
MADV_HUGEPAGE'ing the areas marked by
arch/s390/mm/gmap.c:thp_split_mm() would also break kvm (assuming
subsequent collapse by khugepaged), then MADV_COLLAPSE ignoring
VM_NOHUGEPAGE doesn't really increase the failure space much. Else, I
think we should really be respecting VM_NOHUGEPAGE.

[1] https://lore.kernel.org/linux-mm/30571216-5a6a-7a11-3b2c-77d914025f6d@redhat.com/


> > Since it's possible supporting MADV_[NO]HUGEPAGE for
> > process_madivse(2) has applications outside a subsequent
> > MADV_COLLAPSE, and since I don't see process_madvise(MADV_COLLAPSE) to
> > be in a hot path, I'd vote in favor of your suggestion and include
> > process_madvise(MADV_[NO]HUGEPAGE) support in v6 unless others object.
> >
> > Thanks again for your review and your suggestion!
> > Zach
> >
> > [1] https://lore.kernel.org/linux-mm/CAHbLzkqLRBd6u3qn=KqpOhRcPZtpGXbTXLUjK1z=4d_dQ06Pvw@mail.gmail.com/
> >
> > > If so, how about introducing process_madvise(MADV_HUGEPAGE) or
> > > process_madvise(MADV_NOHUGEPAGE)? The former helps to mark the target
> > > vma with 'hg', and the collapse process can be finished completely with
> > > the help of other processes. the latter could let some special vma avoid
> > > collapsing when setting 'THP=always'.
> > >
> > > Best regards,
> > > -wrw
> > >
> > > On 5/5/22 5:44 AM, Zach O'Keefe wrote:
> > > > Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has
> > > > CAP_SYS_ADMIN or is requesting collapse of it's own memory.
> > > >
> > > > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > > > ---
> > > >   mm/madvise.c | 6 ++++--
> > > >   1 file changed, 4 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > > index 638517952bd2..08c11217025a 100644
> > > > --- a/mm/madvise.c
> > > > +++ b/mm/madvise.c
> > > > @@ -1168,13 +1168,15 @@ madvise_behavior_valid(int behavior)
> > > >   }
> > > >
> > > >   static bool
> > > > -process_madvise_behavior_valid(int behavior)
> > > > +process_madvise_behavior_valid(int behavior, struct task_struct *task)
> > > >   {
> > > >       switch (behavior) {
> > > >       case MADV_COLD:
> > > >       case MADV_PAGEOUT:
> > > >       case MADV_WILLNEED:
> > > >               return true;
> > > > +     case MADV_COLLAPSE:
> > > > +             return task == current || capable(CAP_SYS_ADMIN);
> > > >       default:
> > > >               return false;
> > > >       }
> > > > @@ -1452,7 +1454,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
> > > >               goto free_iov;
> > > >       }
> > > >
> > > > -     if (!process_madvise_behavior_valid(behavior)) {
> > > > +     if (!process_madvise_behavior_valid(behavior, task)) {
> > > >               ret = -EINVAL;
> > > >               goto release_task;
> > > >       }
> >
Rongwei Wang May 16, 2022, 3:56 a.m. UTC | #7
On 5/13/22 4:03 AM, David Rientjes wrote:
> On Wed, 11 May 2022, Zach O'Keefe wrote:
> 
>> Hey Rongwei,
>>
>> Thanks for taking the time to review!
>>
>> On Tue, May 10, 2022 at 5:49 PM Rongwei Wang
>> <rongwei.wang@linux.alibaba.com> wrote:
>>>
>>> Hi, Zach
>>>
>>> Thanks for your great patchset!
>>> Recently, We also try to collapse THP in this way, likes performance
>>> degradation due to using too much hugepages in our scenes.
>>>
> 
> Rongwei, could you elaborate on this?  I can understand undesired overhead
> for allocation of a hugepage at the time of fault if there is not enough
> benefit derived by the hugepages over the long term (like for a database
> workload), is this the performance degradation you're referring to?
> 
Hi David

Sorry for the late reply.

Actually, I'm not sure that the degradation is caused by using too much 
hugepages. Plus, I also referrd to Google's doc[1], and the TLB pressure 
mentioned here. Maybe the process_madvise(MADV_COLLAPSE) can help us to 
solve these issues.

[1] 
https://dyninst.github.io/scalable_tools_workshop/petascale2018/assets/slides/CSCADS%202018%20perf_events%20status%20update.pdf

-wrw

> Otherwise I'm unfamiliar with performance degradation for using too much
> hugepages after they have been allocated :)  Maybe RSS grows too much and
> we run into memory pressure?
> 
> It would be good to know more if you can share details here.
> 
>>> And there is a doubt about process_madvise(MADV_COLLAPSE) when we test
>>> this patchset:. It seems that process_madvise(MADV_COLLAPSE) rely on
>>> madvise(MADV_HUGEPAGE)? If the vma wasn't marked with 'hg',
>>> process_madvise(MADV_COLLAPSE) will fail to collapse. And if I miss
>>> something, please let me know.
>>>
>>
>> I tried to have MADV_COLLAPSE follow the same THP eligibility
>> semantics as khugepaged and at-fault: either THP=always, or
>> THP=madvise and the vma is marked with MADV_HUGEPAGE, as you point
>> out.
>>
>> If I understand you correctly, the usefulness of
>> process_madvise(MADV_COLLAPSE) is limited in the case where
>> THP=madvise and a CAP_SYS_ADMIN user is requesting a collapse of
>> behalf of another process since they don't have a way to mark the
>> target memory as eligible (which requires VM_HUGEPAGE).
>>
>> If so, I think that's a valid point, and your suggestion below of a
>> supporting MADV_[NO]HUGEPAGE for process_madvise(2) makes sense. For
>> the sake of exploring all options, I'll mention that there was also a
>> previous idea suggested by Yang Shi where MADV_COLLAPSE could also set
>> VM_HUGEPAGE[1].
>>
> 
> If a user is doing MADV_COLLAPSE on behalf of itself, it seems unnecessary
> to need to do MADV_HUGEPAGE before that regardless of system-wide settings
> that it may not control?
> 
> Same point for a root user doing this on behalf of another user.  It could
> either do MADV_HUGEPAGE and change the behavior that the user perhaps
> requested behind its back (not desired) or it could temporarily set the
> system-wide setting to allow THP always before doing the MADV_COLLAPSE
> (it's root).
> 
> We can simply allow MADV_COLLAPSE to always collapse in either context
> regardless of VM_HUGEPAGE, VM_NOHUGEPAGE, or system-wide settings?
> 
>> Since it's possible supporting MADV_[NO]HUGEPAGE for
>> process_madivse(2) has applications outside a subsequent
>> MADV_COLLAPSE, and since I don't see process_madvise(MADV_COLLAPSE) to
>> be in a hot path, I'd vote in favor of your suggestion and include
>> process_madvise(MADV_[NO]HUGEPAGE) support in v6 unless others object.
>>
>> Thanks again for your review and your suggestion!
>> Zach
>>
>> [1] https://lore.kernel.org/linux-mm/CAHbLzkqLRBd6u3qn=KqpOhRcPZtpGXbTXLUjK1z=4d_dQ06Pvw@mail.gmail.com/
>>
>>> If so, how about introducing process_madvise(MADV_HUGEPAGE) or
>>> process_madvise(MADV_NOHUGEPAGE)? The former helps to mark the target
>>> vma with 'hg', and the collapse process can be finished completely with
>>> the help of other processes. the latter could let some special vma avoid
>>> collapsing when setting 'THP=always'.
>>>
>>> Best regards,
>>> -wrw
>>>
>>> On 5/5/22 5:44 AM, Zach O'Keefe wrote:
>>>> Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has
>>>> CAP_SYS_ADMIN or is requesting collapse of it's own memory.
>>>>
>>>> Signed-off-by: Zach O'Keefe <zokeefe@google.com>
>>>> ---
>>>>    mm/madvise.c | 6 ++++--
>>>>    1 file changed, 4 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/mm/madvise.c b/mm/madvise.c
>>>> index 638517952bd2..08c11217025a 100644
>>>> --- a/mm/madvise.c
>>>> +++ b/mm/madvise.c
>>>> @@ -1168,13 +1168,15 @@ madvise_behavior_valid(int behavior)
>>>>    }
>>>>
>>>>    static bool
>>>> -process_madvise_behavior_valid(int behavior)
>>>> +process_madvise_behavior_valid(int behavior, struct task_struct *task)
>>>>    {
>>>>        switch (behavior) {
>>>>        case MADV_COLD:
>>>>        case MADV_PAGEOUT:
>>>>        case MADV_WILLNEED:
>>>>                return true;
>>>> +     case MADV_COLLAPSE:
>>>> +             return task == current || capable(CAP_SYS_ADMIN);
>>>>        default:
>>>>                return false;
>>>>        }
>>>> @@ -1452,7 +1454,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
>>>>                goto free_iov;
>>>>        }
>>>>
>>>> -     if (!process_madvise_behavior_valid(behavior)) {
>>>> +     if (!process_madvise_behavior_valid(behavior, task)) {
>>>>                ret = -EINVAL;
>>>>                goto release_task;
>>>>        }
>>
diff mbox series

Patch

diff --git a/mm/madvise.c b/mm/madvise.c
index 638517952bd2..08c11217025a 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1168,13 +1168,15 @@  madvise_behavior_valid(int behavior)
 }
 
 static bool
-process_madvise_behavior_valid(int behavior)
+process_madvise_behavior_valid(int behavior, struct task_struct *task)
 {
 	switch (behavior) {
 	case MADV_COLD:
 	case MADV_PAGEOUT:
 	case MADV_WILLNEED:
 		return true;
+	case MADV_COLLAPSE:
+		return task == current || capable(CAP_SYS_ADMIN);
 	default:
 		return false;
 	}
@@ -1452,7 +1454,7 @@  SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
 		goto free_iov;
 	}
 
-	if (!process_madvise_behavior_valid(behavior)) {
+	if (!process_madvise_behavior_valid(behavior, task)) {
 		ret = -EINVAL;
 		goto release_task;
 	}