Message ID | 20220504214437.2850685-11-zokeefe@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: userspace hugepage collapse | expand |
Hi, Zach Thanks for your great patchset! Recently, We also try to collapse THP in this way, likes performance degradation due to using too much hugepages in our scenes. And there is a doubt about process_madvise(MADV_COLLAPSE) when we test this patchset:. It seems that process_madvise(MADV_COLLAPSE) rely on madvise(MADV_HUGEPAGE)? If the vma wasn't marked with 'hg', process_madvise(MADV_COLLAPSE) will fail to collapse. And if I miss something, please let me know. If so, how about introducing process_madvise(MADV_HUGEPAGE) or process_madvise(MADV_NOHUGEPAGE)? The former helps to mark the target vma with 'hg', and the collapse process can be finished completely with the help of other processes. the latter could let some special vma avoid collapsing when setting 'THP=always'. Best regards, -wrw On 5/5/22 5:44 AM, Zach O'Keefe wrote: > Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has > CAP_SYS_ADMIN or is requesting collapse of it's own memory. > > Signed-off-by: Zach O'Keefe <zokeefe@google.com> > --- > mm/madvise.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/mm/madvise.c b/mm/madvise.c > index 638517952bd2..08c11217025a 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -1168,13 +1168,15 @@ madvise_behavior_valid(int behavior) > } > > static bool > -process_madvise_behavior_valid(int behavior) > +process_madvise_behavior_valid(int behavior, struct task_struct *task) > { > switch (behavior) { > case MADV_COLD: > case MADV_PAGEOUT: > case MADV_WILLNEED: > return true; > + case MADV_COLLAPSE: > + return task == current || capable(CAP_SYS_ADMIN); > default: > return false; > } > @@ -1452,7 +1454,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec, > goto free_iov; > } > > - if (!process_madvise_behavior_valid(behavior)) { > + if (!process_madvise_behavior_valid(behavior, task)) { > ret = -EINVAL; > goto release_task; > }
Hey Rongwei, Thanks for taking the time to review! On Tue, May 10, 2022 at 5:49 PM Rongwei Wang <rongwei.wang@linux.alibaba.com> wrote: > > Hi, Zach > > Thanks for your great patchset! > Recently, We also try to collapse THP in this way, likes performance > degradation due to using too much hugepages in our scenes. > > And there is a doubt about process_madvise(MADV_COLLAPSE) when we test > this patchset:. It seems that process_madvise(MADV_COLLAPSE) rely on > madvise(MADV_HUGEPAGE)? If the vma wasn't marked with 'hg', > process_madvise(MADV_COLLAPSE) will fail to collapse. And if I miss > something, please let me know. > I tried to have MADV_COLLAPSE follow the same THP eligibility semantics as khugepaged and at-fault: either THP=always, or THP=madvise and the vma is marked with MADV_HUGEPAGE, as you point out. If I understand you correctly, the usefulness of process_madvise(MADV_COLLAPSE) is limited in the case where THP=madvise and a CAP_SYS_ADMIN user is requesting a collapse of behalf of another process since they don't have a way to mark the target memory as eligible (which requires VM_HUGEPAGE). If so, I think that's a valid point, and your suggestion below of a supporting MADV_[NO]HUGEPAGE for process_madvise(2) makes sense. For the sake of exploring all options, I'll mention that there was also a previous idea suggested by Yang Shi where MADV_COLLAPSE could also set VM_HUGEPAGE[1]. Since it's possible supporting MADV_[NO]HUGEPAGE for process_madivse(2) has applications outside a subsequent MADV_COLLAPSE, and since I don't see process_madvise(MADV_COLLAPSE) to be in a hot path, I'd vote in favor of your suggestion and include process_madvise(MADV_[NO]HUGEPAGE) support in v6 unless others object. Thanks again for your review and your suggestion! Zach [1] https://lore.kernel.org/linux-mm/CAHbLzkqLRBd6u3qn=KqpOhRcPZtpGXbTXLUjK1z=4d_dQ06Pvw@mail.gmail.com/ > If so, how about introducing process_madvise(MADV_HUGEPAGE) or > process_madvise(MADV_NOHUGEPAGE)? The former helps to mark the target > vma with 'hg', and the collapse process can be finished completely with > the help of other processes. the latter could let some special vma avoid > collapsing when setting 'THP=always'. > > Best regards, > -wrw > > On 5/5/22 5:44 AM, Zach O'Keefe wrote: > > Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has > > CAP_SYS_ADMIN or is requesting collapse of it's own memory. > > > > Signed-off-by: Zach O'Keefe <zokeefe@google.com> > > --- > > mm/madvise.c | 6 ++++-- > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > diff --git a/mm/madvise.c b/mm/madvise.c > > index 638517952bd2..08c11217025a 100644 > > --- a/mm/madvise.c > > +++ b/mm/madvise.c > > @@ -1168,13 +1168,15 @@ madvise_behavior_valid(int behavior) > > } > > > > static bool > > -process_madvise_behavior_valid(int behavior) > > +process_madvise_behavior_valid(int behavior, struct task_struct *task) > > { > > switch (behavior) { > > case MADV_COLD: > > case MADV_PAGEOUT: > > case MADV_WILLNEED: > > return true; > > + case MADV_COLLAPSE: > > + return task == current || capable(CAP_SYS_ADMIN); > > default: > > return false; > > } > > @@ -1452,7 +1454,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec, > > goto free_iov; > > } > > > > - if (!process_madvise_behavior_valid(behavior)) { > > + if (!process_madvise_behavior_valid(behavior, task)) { > > ret = -EINVAL; > > goto release_task; > > }
On 5/11/22 11:34 PM, Zach O'Keefe wrote: > Hey Rongwei, > > Thanks for taking the time to review! > > On Tue, May 10, 2022 at 5:49 PM Rongwei Wang > <rongwei.wang@linux.alibaba.com> wrote: >> >> Hi, Zach >> >> Thanks for your great patchset! >> Recently, We also try to collapse THP in this way, likes performance >> degradation due to using too much hugepages in our scenes. >> >> And there is a doubt about process_madvise(MADV_COLLAPSE) when we test >> this patchset:. It seems that process_madvise(MADV_COLLAPSE) rely on >> madvise(MADV_HUGEPAGE)? If the vma wasn't marked with 'hg', >> process_madvise(MADV_COLLAPSE) will fail to collapse. And if I miss >> something, please let me know. >> > > I tried to have MADV_COLLAPSE follow the same THP eligibility > semantics as khugepaged and at-fault: either THP=always, or > THP=madvise and the vma is marked with MADV_HUGEPAGE, as you point > out. > > If I understand you correctly, the usefulness of > process_madvise(MADV_COLLAPSE) is limited in the case where > THP=madvise and a CAP_SYS_ADMIN user is requesting a collapse of > behalf of another process since they don't have a way to mark the > target memory as eligible (which requires VM_HUGEPAGE). Hi Zach Yes, that's my means. It seems that MADV_[NO]HUGEPAGE for process_madivse(2) just needs little codes. Anyway, looking forward to your next patchset. Thanks -wrw > > If so, I think that's a valid point, and your suggestion below of a > supporting MADV_[NO]HUGEPAGE for process_madvise(2) makes sense. For > the sake of exploring all options, I'll mention that there was also a > previous idea suggested by Yang Shi where MADV_COLLAPSE could also set > VM_HUGEPAGE[1]. > > Since it's possible supporting MADV_[NO]HUGEPAGE for > process_madivse(2) has applications outside a subsequent > MADV_COLLAPSE, and since I don't see process_madvise(MADV_COLLAPSE) to > be in a hot path, I'd vote in favor of your suggestion and include > process_madvise(MADV_[NO]HUGEPAGE) support in v6 unless others object. > > Thanks again for your review and your suggestion! > Zach > > [1] https://lore.kernel.org/linux-mm/CAHbLzkqLRBd6u3qn=KqpOhRcPZtpGXbTXLUjK1z=4d_dQ06Pvw@mail.gmail.com/ > >> If so, how about introducing process_madvise(MADV_HUGEPAGE) or >> process_madvise(MADV_NOHUGEPAGE)? The former helps to mark the target >> vma with 'hg', and the collapse process can be finished completely with >> the help of other processes. the latter could let some special vma avoid >> collapsing when setting 'THP=always'. >> >> Best regards, >> -wrw >> >> On 5/5/22 5:44 AM, Zach O'Keefe wrote: >>> Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has >>> CAP_SYS_ADMIN or is requesting collapse of it's own memory. >>> >>> Signed-off-by: Zach O'Keefe <zokeefe@google.com> >>> --- >>> mm/madvise.c | 6 ++++-- >>> 1 file changed, 4 insertions(+), 2 deletions(-) >>> >>> diff --git a/mm/madvise.c b/mm/madvise.c >>> index 638517952bd2..08c11217025a 100644 >>> --- a/mm/madvise.c >>> +++ b/mm/madvise.c >>> @@ -1168,13 +1168,15 @@ madvise_behavior_valid(int behavior) >>> } >>> >>> static bool >>> -process_madvise_behavior_valid(int behavior) >>> +process_madvise_behavior_valid(int behavior, struct task_struct *task) >>> { >>> switch (behavior) { >>> case MADV_COLD: >>> case MADV_PAGEOUT: >>> case MADV_WILLNEED: >>> return true; >>> + case MADV_COLLAPSE: >>> + return task == current || capable(CAP_SYS_ADMIN); >>> default: >>> return false; >>> } >>> @@ -1452,7 +1454,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec, >>> goto free_iov; >>> } >>> >>> - if (!process_madvise_behavior_valid(behavior)) { >>> + if (!process_madvise_behavior_valid(behavior, task)) { >>> ret = -EINVAL; >>> goto release_task; >>> }
On Wed, 4 May 2022, Zach O'Keefe wrote: > Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has > CAP_SYS_ADMIN or is requesting collapse of it's own memory. > > Signed-off-by: Zach O'Keefe <zokeefe@google.com> Acked-by: David Rientjes <rientjes@google.com>
On Wed, 11 May 2022, Zach O'Keefe wrote: > Hey Rongwei, > > Thanks for taking the time to review! > > On Tue, May 10, 2022 at 5:49 PM Rongwei Wang > <rongwei.wang@linux.alibaba.com> wrote: > > > > Hi, Zach > > > > Thanks for your great patchset! > > Recently, We also try to collapse THP in this way, likes performance > > degradation due to using too much hugepages in our scenes. > > Rongwei, could you elaborate on this? I can understand undesired overhead for allocation of a hugepage at the time of fault if there is not enough benefit derived by the hugepages over the long term (like for a database workload), is this the performance degradation you're referring to? Otherwise I'm unfamiliar with performance degradation for using too much hugepages after they have been allocated :) Maybe RSS grows too much and we run into memory pressure? It would be good to know more if you can share details here. > > And there is a doubt about process_madvise(MADV_COLLAPSE) when we test > > this patchset:. It seems that process_madvise(MADV_COLLAPSE) rely on > > madvise(MADV_HUGEPAGE)? If the vma wasn't marked with 'hg', > > process_madvise(MADV_COLLAPSE) will fail to collapse. And if I miss > > something, please let me know. > > > > I tried to have MADV_COLLAPSE follow the same THP eligibility > semantics as khugepaged and at-fault: either THP=always, or > THP=madvise and the vma is marked with MADV_HUGEPAGE, as you point > out. > > If I understand you correctly, the usefulness of > process_madvise(MADV_COLLAPSE) is limited in the case where > THP=madvise and a CAP_SYS_ADMIN user is requesting a collapse of > behalf of another process since they don't have a way to mark the > target memory as eligible (which requires VM_HUGEPAGE). > > If so, I think that's a valid point, and your suggestion below of a > supporting MADV_[NO]HUGEPAGE for process_madvise(2) makes sense. For > the sake of exploring all options, I'll mention that there was also a > previous idea suggested by Yang Shi where MADV_COLLAPSE could also set > VM_HUGEPAGE[1]. > If a user is doing MADV_COLLAPSE on behalf of itself, it seems unnecessary to need to do MADV_HUGEPAGE before that regardless of system-wide settings that it may not control? Same point for a root user doing this on behalf of another user. It could either do MADV_HUGEPAGE and change the behavior that the user perhaps requested behind its back (not desired) or it could temporarily set the system-wide setting to allow THP always before doing the MADV_COLLAPSE (it's root). We can simply allow MADV_COLLAPSE to always collapse in either context regardless of VM_HUGEPAGE, VM_NOHUGEPAGE, or system-wide settings? > Since it's possible supporting MADV_[NO]HUGEPAGE for > process_madivse(2) has applications outside a subsequent > MADV_COLLAPSE, and since I don't see process_madvise(MADV_COLLAPSE) to > be in a hot path, I'd vote in favor of your suggestion and include > process_madvise(MADV_[NO]HUGEPAGE) support in v6 unless others object. > > Thanks again for your review and your suggestion! > Zach > > [1] https://lore.kernel.org/linux-mm/CAHbLzkqLRBd6u3qn=KqpOhRcPZtpGXbTXLUjK1z=4d_dQ06Pvw@mail.gmail.com/ > > > If so, how about introducing process_madvise(MADV_HUGEPAGE) or > > process_madvise(MADV_NOHUGEPAGE)? The former helps to mark the target > > vma with 'hg', and the collapse process can be finished completely with > > the help of other processes. the latter could let some special vma avoid > > collapsing when setting 'THP=always'. > > > > Best regards, > > -wrw > > > > On 5/5/22 5:44 AM, Zach O'Keefe wrote: > > > Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has > > > CAP_SYS_ADMIN or is requesting collapse of it's own memory. > > > > > > Signed-off-by: Zach O'Keefe <zokeefe@google.com> > > > --- > > > mm/madvise.c | 6 ++++-- > > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > > > diff --git a/mm/madvise.c b/mm/madvise.c > > > index 638517952bd2..08c11217025a 100644 > > > --- a/mm/madvise.c > > > +++ b/mm/madvise.c > > > @@ -1168,13 +1168,15 @@ madvise_behavior_valid(int behavior) > > > } > > > > > > static bool > > > -process_madvise_behavior_valid(int behavior) > > > +process_madvise_behavior_valid(int behavior, struct task_struct *task) > > > { > > > switch (behavior) { > > > case MADV_COLD: > > > case MADV_PAGEOUT: > > > case MADV_WILLNEED: > > > return true; > > > + case MADV_COLLAPSE: > > > + return task == current || capable(CAP_SYS_ADMIN); > > > default: > > > return false; > > > } > > > @@ -1452,7 +1454,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec, > > > goto free_iov; > > > } > > > > > > - if (!process_madvise_behavior_valid(behavior)) { > > > + if (!process_madvise_behavior_valid(behavior, task)) { > > > ret = -EINVAL; > > > goto release_task; > > > } >
On Thu, May 12, 2022 at 1:03 PM David Rientjes <rientjes@google.com> wrote: > > On Wed, 11 May 2022, Zach O'Keefe wrote: > > > Hey Rongwei, > > > > Thanks for taking the time to review! > > > > On Tue, May 10, 2022 at 5:49 PM Rongwei Wang > > <rongwei.wang@linux.alibaba.com> wrote: > > > > > > Hi, Zach > > > > > > Thanks for your great patchset! > > > Recently, We also try to collapse THP in this way, likes performance > > > degradation due to using too much hugepages in our scenes. > > > > > Rongwei, could you elaborate on this? I can understand undesired overhead > for allocation of a hugepage at the time of fault if there is not enough > benefit derived by the hugepages over the long term (like for a database > workload), is this the performance degradation you're referring to? > > Otherwise I'm unfamiliar with performance degradation for using too much > hugepages after they have been allocated :) Maybe RSS grows too much and > we run into memory pressure? > > It would be good to know more if you can share details here. > > > > And there is a doubt about process_madvise(MADV_COLLAPSE) when we test > > > this patchset:. It seems that process_madvise(MADV_COLLAPSE) rely on > > > madvise(MADV_HUGEPAGE)? If the vma wasn't marked with 'hg', > > > process_madvise(MADV_COLLAPSE) will fail to collapse. And if I miss > > > something, please let me know. > > > > > > > I tried to have MADV_COLLAPSE follow the same THP eligibility > > semantics as khugepaged and at-fault: either THP=always, or > > THP=madvise and the vma is marked with MADV_HUGEPAGE, as you point > > out. > > > > If I understand you correctly, the usefulness of > > process_madvise(MADV_COLLAPSE) is limited in the case where > > THP=madvise and a CAP_SYS_ADMIN user is requesting a collapse of > > behalf of another process since they don't have a way to mark the > > target memory as eligible (which requires VM_HUGEPAGE). > > > > If so, I think that's a valid point, and your suggestion below of a > > supporting MADV_[NO]HUGEPAGE for process_madvise(2) makes sense. For > > the sake of exploring all options, I'll mention that there was also a > > previous idea suggested by Yang Shi where MADV_COLLAPSE could also set > > VM_HUGEPAGE[1]. > > > > If a user is doing MADV_COLLAPSE on behalf of itself, it seems unnecessary > to need to do MADV_HUGEPAGE before that regardless of system-wide settings > that it may not control? > If THP=always, this isn't necessary - but yes, if THP=madvise then, as proposed, we'd need MADV_HUGEPAGE to opt-in to THPs, just like at-fault and khugepaged. If MADV_COLLAPSE also sets VM_HUGEPAGE, then in THP=madvise mode, we also opt-in that range to khugepaged scanning. Most likely this is what we want anyways (if the collapsed range is split after MADV_COLLAPSE, khugepaged could come along and fix things) but it does couple the two more. We can always MADV_NOHUGEPAGE the range after MADV_COLLAPSE if this was ever a concern, however. > Same point for a root user doing this on behalf of another user. It could > either do MADV_HUGEPAGE and change the behavior that the user perhaps > requested behind its back (not desired) or it could temporarily set the > system-wide setting to allow THP always before doing the MADV_COLLAPSE > (it's root). > > We can simply allow MADV_COLLAPSE to always collapse in either context > regardless of VM_HUGEPAGE, VM_NOHUGEPAGE, or system-wide settings? > I think we need to respect VM_NOHUGEPAGE as it can be set from non-madvise code, as David H mentioned for kvm on s390 in the PATCH RFC[1]. It's not clear to me how this would break, however. If MADV_HUGEPAGE'ing the areas marked by arch/s390/mm/gmap.c:thp_split_mm() would also break kvm (assuming subsequent collapse by khugepaged), then MADV_COLLAPSE ignoring VM_NOHUGEPAGE doesn't really increase the failure space much. Else, I think we should really be respecting VM_NOHUGEPAGE. [1] https://lore.kernel.org/linux-mm/30571216-5a6a-7a11-3b2c-77d914025f6d@redhat.com/ > > Since it's possible supporting MADV_[NO]HUGEPAGE for > > process_madivse(2) has applications outside a subsequent > > MADV_COLLAPSE, and since I don't see process_madvise(MADV_COLLAPSE) to > > be in a hot path, I'd vote in favor of your suggestion and include > > process_madvise(MADV_[NO]HUGEPAGE) support in v6 unless others object. > > > > Thanks again for your review and your suggestion! > > Zach > > > > [1] https://lore.kernel.org/linux-mm/CAHbLzkqLRBd6u3qn=KqpOhRcPZtpGXbTXLUjK1z=4d_dQ06Pvw@mail.gmail.com/ > > > > > If so, how about introducing process_madvise(MADV_HUGEPAGE) or > > > process_madvise(MADV_NOHUGEPAGE)? The former helps to mark the target > > > vma with 'hg', and the collapse process can be finished completely with > > > the help of other processes. the latter could let some special vma avoid > > > collapsing when setting 'THP=always'. > > > > > > Best regards, > > > -wrw > > > > > > On 5/5/22 5:44 AM, Zach O'Keefe wrote: > > > > Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has > > > > CAP_SYS_ADMIN or is requesting collapse of it's own memory. > > > > > > > > Signed-off-by: Zach O'Keefe <zokeefe@google.com> > > > > --- > > > > mm/madvise.c | 6 ++++-- > > > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/mm/madvise.c b/mm/madvise.c > > > > index 638517952bd2..08c11217025a 100644 > > > > --- a/mm/madvise.c > > > > +++ b/mm/madvise.c > > > > @@ -1168,13 +1168,15 @@ madvise_behavior_valid(int behavior) > > > > } > > > > > > > > static bool > > > > -process_madvise_behavior_valid(int behavior) > > > > +process_madvise_behavior_valid(int behavior, struct task_struct *task) > > > > { > > > > switch (behavior) { > > > > case MADV_COLD: > > > > case MADV_PAGEOUT: > > > > case MADV_WILLNEED: > > > > return true; > > > > + case MADV_COLLAPSE: > > > > + return task == current || capable(CAP_SYS_ADMIN); > > > > default: > > > > return false; > > > > } > > > > @@ -1452,7 +1454,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec, > > > > goto free_iov; > > > > } > > > > > > > > - if (!process_madvise_behavior_valid(behavior)) { > > > > + if (!process_madvise_behavior_valid(behavior, task)) { > > > > ret = -EINVAL; > > > > goto release_task; > > > > } > >
On 5/13/22 4:03 AM, David Rientjes wrote: > On Wed, 11 May 2022, Zach O'Keefe wrote: > >> Hey Rongwei, >> >> Thanks for taking the time to review! >> >> On Tue, May 10, 2022 at 5:49 PM Rongwei Wang >> <rongwei.wang@linux.alibaba.com> wrote: >>> >>> Hi, Zach >>> >>> Thanks for your great patchset! >>> Recently, We also try to collapse THP in this way, likes performance >>> degradation due to using too much hugepages in our scenes. >>> > > Rongwei, could you elaborate on this? I can understand undesired overhead > for allocation of a hugepage at the time of fault if there is not enough > benefit derived by the hugepages over the long term (like for a database > workload), is this the performance degradation you're referring to? > Hi David Sorry for the late reply. Actually, I'm not sure that the degradation is caused by using too much hugepages. Plus, I also referrd to Google's doc[1], and the TLB pressure mentioned here. Maybe the process_madvise(MADV_COLLAPSE) can help us to solve these issues. [1] https://dyninst.github.io/scalable_tools_workshop/petascale2018/assets/slides/CSCADS%202018%20perf_events%20status%20update.pdf -wrw > Otherwise I'm unfamiliar with performance degradation for using too much > hugepages after they have been allocated :) Maybe RSS grows too much and > we run into memory pressure? > > It would be good to know more if you can share details here. > >>> And there is a doubt about process_madvise(MADV_COLLAPSE) when we test >>> this patchset:. It seems that process_madvise(MADV_COLLAPSE) rely on >>> madvise(MADV_HUGEPAGE)? If the vma wasn't marked with 'hg', >>> process_madvise(MADV_COLLAPSE) will fail to collapse. And if I miss >>> something, please let me know. >>> >> >> I tried to have MADV_COLLAPSE follow the same THP eligibility >> semantics as khugepaged and at-fault: either THP=always, or >> THP=madvise and the vma is marked with MADV_HUGEPAGE, as you point >> out. >> >> If I understand you correctly, the usefulness of >> process_madvise(MADV_COLLAPSE) is limited in the case where >> THP=madvise and a CAP_SYS_ADMIN user is requesting a collapse of >> behalf of another process since they don't have a way to mark the >> target memory as eligible (which requires VM_HUGEPAGE). >> >> If so, I think that's a valid point, and your suggestion below of a >> supporting MADV_[NO]HUGEPAGE for process_madvise(2) makes sense. For >> the sake of exploring all options, I'll mention that there was also a >> previous idea suggested by Yang Shi where MADV_COLLAPSE could also set >> VM_HUGEPAGE[1]. >> > > If a user is doing MADV_COLLAPSE on behalf of itself, it seems unnecessary > to need to do MADV_HUGEPAGE before that regardless of system-wide settings > that it may not control? > > Same point for a root user doing this on behalf of another user. It could > either do MADV_HUGEPAGE and change the behavior that the user perhaps > requested behind its back (not desired) or it could temporarily set the > system-wide setting to allow THP always before doing the MADV_COLLAPSE > (it's root). > > We can simply allow MADV_COLLAPSE to always collapse in either context > regardless of VM_HUGEPAGE, VM_NOHUGEPAGE, or system-wide settings? > >> Since it's possible supporting MADV_[NO]HUGEPAGE for >> process_madivse(2) has applications outside a subsequent >> MADV_COLLAPSE, and since I don't see process_madvise(MADV_COLLAPSE) to >> be in a hot path, I'd vote in favor of your suggestion and include >> process_madvise(MADV_[NO]HUGEPAGE) support in v6 unless others object. >> >> Thanks again for your review and your suggestion! >> Zach >> >> [1] https://lore.kernel.org/linux-mm/CAHbLzkqLRBd6u3qn=KqpOhRcPZtpGXbTXLUjK1z=4d_dQ06Pvw@mail.gmail.com/ >> >>> If so, how about introducing process_madvise(MADV_HUGEPAGE) or >>> process_madvise(MADV_NOHUGEPAGE)? The former helps to mark the target >>> vma with 'hg', and the collapse process can be finished completely with >>> the help of other processes. the latter could let some special vma avoid >>> collapsing when setting 'THP=always'. >>> >>> Best regards, >>> -wrw >>> >>> On 5/5/22 5:44 AM, Zach O'Keefe wrote: >>>> Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has >>>> CAP_SYS_ADMIN or is requesting collapse of it's own memory. >>>> >>>> Signed-off-by: Zach O'Keefe <zokeefe@google.com> >>>> --- >>>> mm/madvise.c | 6 ++++-- >>>> 1 file changed, 4 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/mm/madvise.c b/mm/madvise.c >>>> index 638517952bd2..08c11217025a 100644 >>>> --- a/mm/madvise.c >>>> +++ b/mm/madvise.c >>>> @@ -1168,13 +1168,15 @@ madvise_behavior_valid(int behavior) >>>> } >>>> >>>> static bool >>>> -process_madvise_behavior_valid(int behavior) >>>> +process_madvise_behavior_valid(int behavior, struct task_struct *task) >>>> { >>>> switch (behavior) { >>>> case MADV_COLD: >>>> case MADV_PAGEOUT: >>>> case MADV_WILLNEED: >>>> return true; >>>> + case MADV_COLLAPSE: >>>> + return task == current || capable(CAP_SYS_ADMIN); >>>> default: >>>> return false; >>>> } >>>> @@ -1452,7 +1454,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec, >>>> goto free_iov; >>>> } >>>> >>>> - if (!process_madvise_behavior_valid(behavior)) { >>>> + if (!process_madvise_behavior_valid(behavior, task)) { >>>> ret = -EINVAL; >>>> goto release_task; >>>> } >>
diff --git a/mm/madvise.c b/mm/madvise.c index 638517952bd2..08c11217025a 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1168,13 +1168,15 @@ madvise_behavior_valid(int behavior) } static bool -process_madvise_behavior_valid(int behavior) +process_madvise_behavior_valid(int behavior, struct task_struct *task) { switch (behavior) { case MADV_COLD: case MADV_PAGEOUT: case MADV_WILLNEED: return true; + case MADV_COLLAPSE: + return task == current || capable(CAP_SYS_ADMIN); default: return false; } @@ -1452,7 +1454,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec, goto free_iov; } - if (!process_madvise_behavior_valid(behavior)) { + if (!process_madvise_behavior_valid(behavior, task)) { ret = -EINVAL; goto release_task; }
Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has CAP_SYS_ADMIN or is requesting collapse of it's own memory. Signed-off-by: Zach O'Keefe <zokeefe@google.com> --- mm/madvise.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)